About
People
Research
Publications
Seminars
Presentations
Courses
         Page Decomposition and Structural Analysis
 


Overview

For document decomposition, salient regions in a document can take the form of text, graphics, or half-tones, and can be of nearly any shape or size. A first approximation to these regions is obtained from a page decomposition module to provide specialized processing for individual components.

The decomposition is clearly class-dependent, and unless a specific model is available to guide the analysis, the correct descriptions of the region may not always be obtained at the pixel or component levels. Consider, for example, the problem of table interpretation. A valid decomposition may label a table'' region appropriately, but depending on the complexity of the model, a structural analysis may require a more complete description of the column, spacing, and separator components. For this reason, it is not claimed that the decomposition is complete, but that it divides the document into components which act as a guide to the interpretation process.

A representation is under development which allows the description of regions in the document according to their physical characteristics (e.g., text, graphics, and half-tones), which can later be augmented with appropriate semantic labels.

For general document understanding problems, in which little is known a priori about the contents of the document, the process of decomposition, derivation of document class, and logical component labeling are interdependent. Beginning with a candidate decomposition of the document, as described above, it is possible to establish a hierarchy of abstraction which extends from the physical entities (syntactic components) up through the logical entities (semantic labels). In general this parallels a scene description hierarchy in general computer vision where the low level information is at the pixel level, and the high level description involves the identification of objects, their components and relationships with other objects. The analysis task can be viewed as the derivation of a meaningful instantiation of this hierarchy based on information the about the layout of the document and a model space which describes valid structured and logical document organizations.

The structural analysis of documents involves more specifically the derivation of the logical or semantic meaning of a set of salient fields or regions within a document. In general the problem involves attributes and structural relationships of the document to label document components within the contextual rules dictated by the document class or type (memo, letter, journal article, newspaper, etc.).

Humans' ability to label these components in a meaningful way is due, in part, to their ability to understand the functionality of the document. By knowing the intent of the document, it can be associated with a document class and a model space can be invoked which defines a general description of which types of components are expected and how they may be arranged. A process can then be undertaken in which the model is instantiated and components are labeled in a way which is consistent with the model expectations. If the class of documents is known, the interpretation is constrained by the layout characteristics which make the document an instance of that class.

The analysis of structured documents relies on three components: 1) a meaningful decomposition of the document into primitive physical entities, 2) the association of the document with a class of documents which can be used to guide the analysis, and 3) the labeling of individual components as to their logical meanings and their relationships with each other.

This research is centered around the idea of using document models to perform a reconciliation of information from the low-level layout relationships (bottom-up) and from the high-level model space constraints of the domain (top-down). Information which is recovered about the document structure constrains the model space, and in turn, the constrained model space dictates what is looked for in the document.

Automatic model generation may be achievable when sufficiently detailed information is available about the style specification used to generate a type of document. Although individual models in the model space could be separately hand-generated, automatic generation of models will permit the rapid incorporation of a much wider range of document types.







home | language group | media group | sponsors & partners | publications | seminars | contact us | staff only
© Copyright 2001, Language and Media Processing Laboratory, University of Maryland, All rights reserved.