University of Maryland, College Park, Maryland USA

and

Centre for Strategic Infocomm Technologies, Singapore

CLEAT:A CLassification, Enhancement and Analysis Toolkit for Heterogeneous Document Image Collections

Overview

The challenges related to the analysis of large heterogeneous collections of document images ultimately encompass almost all aspects of the field of document image processing. The written language takes on many forms that differ in presentation and content, yet the trained individual can interpret the visual language rather simply. The goal of document analysis is ultimately to be able to make an informed interpretation about the intended message of the visual language.

In this project, we will develop specific modules of interest to the sponsors related to Triage, Enhancement, Segmentation, and Content Labeling. The work will be accomplished by researchers in the Laboratory for Language and Media Processing (LAMP) at the University of Maryland and integrated with an existing infrastructure for document image analysis. The proposal contains an in-depth discussion of the problems and lays a roadmap for addressing them. We anticipate further conversations with the sponsors will focus the directions outlined here.

We assume, at the lowest level, we are given an image that may contain useful document related content. Our goal is to first determine if the image does contain document content, then to enhance and process it to the point of sufficient layout metadata to support down stream content processing such as optical character recognition. To support focused research we will develop the necessary tools, gather ground truth, visualize results, and provide efficient implementations of the algorithms we develop.

Summary of Milestones

Phase 1 -

Deliver completed CLEAT data collection - DONE

Provide ground truth for subset of data including signatures, stamps, logos, handwritten, and machine printed text. - DONE

Provide document describing evaluation framework.- DONE

Phase 2:

Deliver completed ground truthing and visualization tool for CLEAT metadata. - DONE

Deliver Prototype version of CLEAT Software API Modules:

Document Image Enhancement, - SEE UMDAPI
Document Text/Image Text/Non-Text Discrimination,
Page Layout Similarity Ranking, - DONE (included in DocLib)
Page Layer Segmentation and Zone Labeling, and -- SEE UMDAPI
Content Labeling of Signatures, annotations, Stamps and Logos. - DONE (included in DocLib)

Provide results of CLEAT API run on CLEAT datasets.

Provide preliminary evaluation report.

Provide basic API documentation

Phase 3:

Deliver Final version of CLEAT API.

Provide training on use of CLEAT.

Provide complete evaluation results on CLEAT data.

Provide complete documentation of API.

Provide feasibility report for system extensions.

Software (by Task)

UMD API

DocLib

Distribution

GEDI

Distribution

By Task

Enhancement
- Included in UMDAPI
Image Classification
- Matlab and Doclib Components
- Dataset for Testing
- Local Binary Pattern (LBP Version - faster)
Layer Separation
- Included in UMDAPI
Layout Similarity
- Included in DocLib
Content Labeling of Signatures, annotations, Stamps and Logos
- Included in DocLib

Complete New Distirbution

Download Here (12/13/07)
This distribution replaces the old DocLib with drivers for Logo and Signature Detect

Supplemental Software (02/18/2008)

Download Here

New Version of DocID

Download Here (New March 7th)

CLEAT:A CLassification, Enhancement and Analysis Toolkit for Heterogeneous Document Image Collections

Overview

Proposal

Summary of Milestones

Phase 1 -

Phase 2:

Phase 3:

Meetings

Presentations

Reports

Software (by Task)

Data