In
this talk, I will present a method for partitioning a scanned
document image into two kinds of regions: (i) regions encompassing
running text, i.e., text formatted in paragraph and columns; and
(ii) regions encompassing text formatted in other layout structures,
such as heading, list, and tables. Once these regions are recognized,
a structure of a document image is described in terms of its parts
and their features and relations based on their visual features.
Such structural descriptions are compared for searching for similar
images, eventually for recognizing genres. This structural matching
is appealing for comparing scanned documents because while the
spatial arrangement of the page components is standardized within
a given genre, the component shapes may vary greatly between documents.
Business letters, for example, include highly variable standard
components - author and recipient addresses, main body, signature
- in a conventional spatial arrangement.
I
will focus on my discussion on what features we recognized from
a segmented scanned image, and how a user can further define additional
features using a simple feature definition language. I will also
describe how structural page matching based on these recognized
features is applied to interactive, appearance-directed search
over large corpora of scanned documents, focusing on two considerations.
First, the organization of the corpus with respect to appearance
is not unique but transient, induced by the user's task and perspective.
Second, the corpus continually evolves as new documents, perhaps
in new genres, are added. In this context the structural approach
has the added appeal that it is an open technology: it represents
page images in terms that are often perceptually intuitive to
the end user, thus lending itself to convenient, interactive customization
and extension at a hierarchy of computational levels. I will describe
these levels of tailorability as implemented in The Integrator,
a set of WWW-based interfaces for hybrid document search.
REFERENCES
----------
1.
Jeanette Blomberg, Lucy Suchman, and Randall Trigg, "Reflections
on a Work-Oriented Design Project", Human-Computer Interaction,
1996, Volume 11, pp. 237-265
2.
James Mahoney, "Detecting the Running Text in a Page Image",
Internal Technical Report, Systems and Practices Laboratory, Xerox
PARC
3.
James Mahoney, Jeanette Blomberg, Randall Trigg, and Christian
Shin, "System for Dynamically Specifying Layout Components
of Document Images", U.S. Patent Application (Filed in 1997),
Xerox PARC
(**)
Christian spent last two summers at Xerox Palo Alto Research Center
(Xerox PARC), and participated in a project named "the Integrator
project" in Systems and Practices Laboratory (SPL). The goal
of the project is to integrate (as the name indicates) document
image analysis technology into a system that is capable of recognizing
document types or genres from the visual characteristics of documents.
As a result, a web-based interface has been developed that supports
image-based search and genre recognition over heterogeneous document
corpora.
|