|
University of Maryland, College Park, Maryland USA
and
Centre for Strategic Infocomm Technologies, Singapore |
|
CLEAT:A CLassification, Enhancement and Analysis Toolkit for Heterogeneous Document Image Collections
Overview
The challenges related to the analysis of large heterogeneous collections of document images ultimately encompass almost all aspects of the field of document image processing. The written language takes on many forms that differ in presentation and content, yet the trained individual can interpret the visual language rather simply. The goal of document analysis is ultimately to be able to make an informed interpretation about the intended message of the visual language.
In this project, we will develop specific modules of interest to the sponsors related to Triage, Enhancement, Segmentation, and Content Labeling. The work will be accomplished by researchers in the Laboratory for Language and Media Processing (LAMP) at the University of Maryland and integrated with an existing infrastructure for document image analysis. The proposal contains an in-depth discussion of the problems and lays a roadmap for addressing them. We anticipate further conversations with the sponsors will focus the directions outlined here.
We assume, at the lowest level, we are given an image that may contain useful document related content. Our goal is to first determine if the image does contain document content, then to enhance and process it to the point of sufficient layout metadata to support down stream content processing such as optical character recognition. To support focused research we will develop the necessary tools, gather ground truth, visualize results, and provide efficient implementations of the algorithms we develop.
System Flow Diagram (PDF)
Proposal
Summary of Milestones
Phase 1 -
- Deliver completed CLEAT data collection - DONE
- Provide ground truth for subset of data including signatures, stamps, logos, handwritten, and machine printed text. - DONE
- Provide document describing evaluation framework.- DONE
Phase 2:
- Deliver completed ground truthing and visualization tool for CLEAT metadata. - DONE
- Deliver Prototype version of CLEAT Software API Modules:
- Document Image Enhancement, - SEE UMDAPI
- Document Text/Image Text/Non-Text Discrimination,
- Page Layout Similarity Ranking, - DONE (included in DocLib)
- Page Layer Segmentation and Zone Labeling, and -- SEE UMDAPI
- Content Labeling of Signatures, annotations, Stamps and Logos. - DONE (included in DocLib)
- Provide results of CLEAT API run on CLEAT datasets.
- Provide preliminary evaluation report.
- Provide basic API documentation
Phase 3:
- Deliver Final version of CLEAT API.
- Provide training on use of CLEAT.
- Provide complete evaluation results on CLEAT data.
- Provide complete documentation of API.
- Provide feasibility report for system extensions.
Meetings
- Maryland (May 9-11, 2007)
Agenda
- Singapore (October 16-17, 2007)
Presentations
Reports
Software (by Task)
- UMD API
- DocLib
- GEDI
- By Task
- Enhancement
- Image Classification
- Layer Separation
- Layout Similarity
- Content Labeling of Signatures, annotations, Stamps and Logos
- Complete New Distirbution
- Supplemental Software (02/18/2008)
- New Version of DocID
Data
- Database Overview (PDF)
- Dataset was Delivered by FEDEX on DVD
- 50,000 page document collection ground truth to be available in October 2007