About
People
Research
Publications
Seminars
Presentations
Courses
         Duplicate Document Image Detection
 


Overview

Document imaging technology has developed to the point where it is not uncommon for organizations to scan large numbers of documents into databases with little or no index information. This may be done for archival purposes with the index as simple as a case number, or with the ultimate goal of automatically extracting index information for content-based queries. Maintaining the integrity of such a database is difficult, especially in a distributed environment where copies of the same documents may be scanned at different times.

We are currently addressing the case of image-variant documents, where multiple instances of an effectively identical original source are scanned for incorporation into a database. The original documents may have been written on, stapled, torn, taped, or may have pages missing or a cover added. The document may have been copied repeatedly, so different-generation copies are involved. The document may have been scanned at different times and on different devices, so resolution, illumination, and contrast are also issues. Skew and translation may add additional distortion.

Our approach is based on the conversion of a representative line of text in a document image to a signature using a shape coding technique which attempts to label symbols in a line of text based on very simple shape properties, such as whether they are ascenders, descenders, limited to the x-line, multi-component, or punctuation, for example.

The string of shape codes is then a signature for the document, and is used to index into a large table/database of previously processed documents. A second level of robustness is added by indexing based on n-grams of the signature, rather then attempting to use a line index based on the entire string. Each of the shape coded n-grams is extracted from a sliding window of size w across the signature and each serves as an index key into the database. A single dropped or inserted code will affect a small number of keys and will not affect the entire signature. When a set of keys is presented for indexing, each key results in a set of hits from the database. Each hit is counted as a vote for the resulting document, and a ranked set is returned.

The system is able to deal with differences between scanned documents such as resolution, skew and image quality. The approach has a number of advantages over OCR or other recognition-based methods, including speed and robustness to imaging distortions. The fact that we use an indexing mechanism allows the system to be scaled to operate on millions of documents. The system we have implemented is currently being integrated by a number of government agencies for use in document database management.

A number of agencies have shown great interest in this project and its underlying application to the problem of duplicate detection for declassification. The current system is a prototype which provides the basic functionality including document management, image processing, database indexing and ranked retrieval. There are a number of desirable research avenues which are being explored as a result of this project. They include the ability to perform shape coding on degraded documents, the ability to identify structured noise, and general methods for document image indexing.

We plan to extend this work both with respect to the technical performance as will as to new applications such as general document image database management.







home | language group | media group | sponsors & partners | publications | seminars | contact us | staff only
© Copyright 2001, Language and Media Processing Laboratory, University of Maryland, All rights reserved.