|
Overview
The ability to effectively query document image databases is a
problem which presents interesting challenges primarily because
of the rich structure of the underlying documents. Unlike traditional
databases in which structures and relationships between elements
are well defined, an unprocessed document image contains a wealth
of information, but is only represented initially by a set of
pixels. For traditional databases, the underlying structure allows
the user to form queries on specific fields or relationships between
them, and retrieve a list of entries which match a specific query.
Ideally, image databases should allow the same basic level of
access, although it is often difficult to define queries in a
quantitative way to facilitate such operations.
Work
has recently begun on a joint project between the Universities
of Maryland and Oulu on the development of a system for Intelligent
Document Image Retrieval (IDIR). The IDIR architecture will provide
close connections with, and utilization of, document analysis
and image processing techniques, advanced computing and networking,
and current approaches to database management. The system design
consists of aggressively modularized components to enhance the
development of individual parts which are used in the complete
solution and include: Interface specifications, multipurpose feature
extraction, an integrated efficient query language, physical retrieval
from an object-oriented database, and delivery of retrieved objects.
The
main technical accomplishment of this project is that we have
developed methods for spatial indexing of document components
and defined measures of document similarity based on these methods.
This essentially allows us to define document image similarity
based on structure. We have used the techniques query small databases
for image that appear'' similar to a give query image with respect
to structure. We can demonstrate its effectiveness for identifying
documents such as title pages, bibliographics and advertisements
by example.
Project
Status
Recent
work has focused on the development of spatial indexing techniques
for document images. The current plan is to incorporate these
indexing techniques and test the entire system on heterogeneous
databases. Furthermore, we are integrating other indexing mechanisms
such as imperfect OCR and the results of page classification.
|