LAMP - Language Group

LAMP Seminar
Language and Media Processing Laboratory
Conference Room 4424
A.V. Williams Building
University of Maryland

November 21, 2000, 1:00 PM
Christian Shin

University of Maryland

The Roles of Document Structure in Document Image Retrieval and Classification

ABSTRACT

Current document management and database systems provide text search and retrieval capabilities, but generally lack the ability to utilize the documents' logical and physical structures. In this talk, we define a general framework for describing the physical and logical structure of documents, and describe a general system for document image retrieval that is able to make use of document structure. We discuss the use of structural similarity for retrieval; we define a measure of structural similarity between document images based on content area overlap, and also compare similarity ratings based on this measure with human relevance judgments. Finally, we investigate document type classification using features related to physical layout structure, and using both decision-tree and self-organizing map classifiers; in these experiments too, ground truth was provided by human judgments.