Current
document management and database systems provide text search and
retrieval capabilities, but generally lack the ability to utilize
the documents' logical and physical structures. In this talk,
we define a general framework for describing the physical and
logical structure of documents, and describe a general system
for document image retrieval that is able to make use of document
structure. We discuss the use of structural similarity for retrieval;
we define a measure of structural similarity between document
images based on content area overlap, and also compare similarity
ratings based on this measure with human relevance judgments.
Finally, we investigate document type classification using features
related to physical layout structure, and using both decision-tree
and self-organizing map classifiers; in these experiments too,
ground truth was provided by human judgments.
|