LAMP - Language Group

LAMP Seminar
Language and Media Processing Laboratory
Conference Room 4406
A.V. Williams Building
University of Maryland

Tuesday Oct 20, 1998
1:00 PM

Appearance-Based Recognition of Document Images

Christian K. Shin (**)

ABSTRACT

In this talk, I will present a method for partitioning a scanned document image into two kinds of regions: (i) regions encompassing running text, i.e., text formatted in paragraph and columns; and (ii) regions encompassing text formatted in other layout structures, such as heading, list, and tables. Once these regions are recognized, a structure of a document image is described in terms of its parts and their features and relations based on their visual features. Such structural descriptions are compared for searching for similar images, eventually for recognizing genres. This structural matching is appealing for comparing scanned documents because while the spatial arrangement of the page components is standardized within a given genre, the component shapes may vary greatly between documents. Business letters, for example, include highly variable standard components - author and recipient addresses, main body, signature - in a conventional spatial arrangement.

I will focus on my discussion on what features we recognized from a segmented scanned image, and how a user can further define additional features using a simple feature definition language. I will also describe how structural page matching based on these recognized features is applied to interactive, appearance-directed search over large corpora of scanned documents, focusing on two considerations. First, the organization of the corpus with respect to appearance is not unique but transient, induced by the user's task and perspective. Second, the corpus continually evolves as new documents, perhaps in new genres, are added. In this context the structural approach has the added appeal that it is an open technology: it represents page images in terms that are often perceptually intuitive to the end user, thus lending itself to convenient, interactive customization and extension at a hierarchy of computational levels. I will describe these levels of tailorability as implemented in The Integrator, a set of WWW-based interfaces for hybrid document search.

REFERENCES
----------

1. Jeanette Blomberg, Lucy Suchman, and Randall Trigg, "Reflections on a Work-Oriented Design Project", Human-Computer Interaction, 1996, Volume 11, pp. 237-265

2. James Mahoney, "Detecting the Running Text in a Page Image", Internal Technical Report, Systems and Practices Laboratory, Xerox PARC

3. James Mahoney, Jeanette Blomberg, Randall Trigg, and Christian Shin, "System for Dynamically Specifying Layout Components of Document Images", U.S. Patent Application (Filed in 1997), Xerox PARC

(**) Christian spent last two summers at Xerox Palo Alto Research Center (Xerox PARC), and participated in a project named "the Integrator project" in Systems and Practices Laboratory (SPL). The goal of the project is to integrate (as the name indicates) document image analysis technology into a system that is capable of recognizing document types or genres from the visual characteristics of documents. As a result, a web-based interface has been developed that supports image-based search and genre recognition over heterogeneous document corpora.