Overview
For document decomposition, salient regions in a document can
take the form of text, graphics, or half-tones, and can be of
nearly any shape or size. A first approximation to these regions
is obtained from a page decomposition module to provide specialized
processing for individual components.
The
decomposition is clearly class-dependent, and unless a specific
model is available to guide the analysis, the correct descriptions
of the region may not always be obtained at the pixel or component
levels. Consider, for example, the problem of table interpretation.
A valid decomposition may label a table'' region appropriately,
but depending on the complexity of the model, a structural analysis
may require a more complete description of the column, spacing,
and separator components. For this reason, it is not claimed that
the decomposition is complete, but that it divides the document
into components which act as a guide to the interpretation process.
A
representation is under development which allows the description
of regions in the document according to their physical characteristics
(e.g., text, graphics, and half-tones), which can later be augmented
with appropriate semantic labels.
For
general document understanding problems, in which little is known
a priori about the contents of the document, the process of decomposition,
derivation of document class, and logical component labeling are
interdependent. Beginning with a candidate decomposition of the
document, as described above, it is possible to establish a hierarchy
of abstraction which extends from the physical entities (syntactic
components) up through the logical entities (semantic labels).
In general this parallels a scene description hierarchy in general
computer vision where the low level information is at the pixel
level, and the high level description involves the identification
of objects, their components and relationships with other objects.
The analysis task can be viewed as the derivation of a meaningful
instantiation of this hierarchy based on information the about
the layout of the document and a model space which describes valid
structured and logical document organizations.
The
structural analysis of documents involves more specifically the
derivation of the logical or semantic meaning of a set of salient
fields or regions within a document. In general the problem involves
attributes and structural relationships of the document to label
document components within the contextual rules dictated by the
document class or type (memo, letter, journal article, newspaper,
etc.).
Humans'
ability to label these components in a meaningful way is due,
in part, to their ability to understand the functionality of the
document. By knowing the intent of the document, it can be associated
with a document class and a model space can be invoked which defines
a general description of which types of components are expected
and how they may be arranged. A process can then be undertaken
in which the model is instantiated and components are labeled
in a way which is consistent with the model expectations. If the
class of documents is known, the interpretation is constrained
by the layout characteristics which make the document an instance
of that class.
The
analysis of structured documents relies on three components: 1)
a meaningful decomposition of the document into primitive physical
entities, 2) the association of the document with a class of documents
which can be used to guide the analysis, and 3) the labeling of
individual components as to their logical meanings and their relationships
with each other.
This
research is centered around the idea of using document models
to perform a reconciliation of information from the low-level
layout relationships (bottom-up) and from the high-level model
space constraints of the domain (top-down). Information which
is recovered about the document structure constrains the model
space, and in turn, the constrained model space dictates what
is looked for in the document.
Automatic
model generation may be achievable when sufficiently detailed
information is available about the style specification used to
generate a type of document. Although individual models in the
model space could be separately hand-generated, automatic generation
of models will permit the rapid incorporation of a much wider
range of document types.
|