Overview
Document imaging technology has developed to the point where it
is not uncommon for organizations to scan large numbers of documents
into databases with little or no index information. This may be
done for archival purposes with the index as simple as a case
number, or with the ultimate goal of automatically extracting
index information for content-based queries. Maintaining the integrity
of such a database is difficult, especially in a distributed environment
where copies of the same documents may be scanned at different
times.
We
are currently addressing the case of image-variant documents,
where multiple instances of an effectively identical original
source are scanned for incorporation into a database. The original
documents may have been written on, stapled, torn, taped, or may
have pages missing or a cover added. The document may have been
copied repeatedly, so different-generation copies are involved.
The document may have been scanned at different times and on different
devices, so resolution, illumination, and contrast are also issues.
Skew and translation may add additional distortion.
Our
approach is based on the conversion of a representative line of
text in a document image to a signature using a shape coding technique
which attempts to label symbols in a line of text based on very
simple shape properties, such as whether they are ascenders, descenders,
limited to the x-line, multi-component, or punctuation, for example.
The
string of shape codes is then a signature for the document, and
is used to index into a large table/database of previously processed
documents. A second level of robustness is added by indexing based
on n-grams of the signature, rather then attempting to use a line
index based on the entire string. Each of the shape coded n-grams
is extracted from a sliding window of size w across the signature
and each serves as an index key into the database. A single dropped
or inserted code will affect a small number of keys and will not
affect the entire signature. When a set of keys is presented for
indexing, each key results in a set of hits from the database.
Each hit is counted as a vote for the resulting document, and
a ranked set is returned.
The
system is able to deal with differences between scanned documents
such as resolution, skew and image quality. The approach has a
number of advantages over OCR or other recognition-based methods,
including speed and robustness to imaging distortions. The fact
that we use an indexing mechanism allows the system to be scaled
to operate on millions of documents. The system we have implemented
is currently being integrated by a number of government agencies
for use in document database management.
A
number of agencies have shown great interest in this project and
its underlying application to the problem of duplicate detection
for declassification. The current system is a prototype which
provides the basic functionality including document management,
image processing, database indexing and ranked retrieval. There
are a number of desirable research avenues which are being explored
as a result of this project. They include the ability to perform
shape coding on degraded documents, the ability to identify structured
noise, and general methods for document image indexing.
We
plan to extend this work both with respect to the technical performance
as will as to new applications such as general document image
database management.
|