LAMP - Language Group

LAMP Seminar
Language and Media Processing Laboratory
Conference Room 4406
A.V. Williams Building
University of Maryland

Thursday, May 29th, 11 am
GEOCR: Good Enough OCR

Larry Spitz
Document Recognition Technologies, Inc.
Palo Alto, CA
L.Spitz@ieee.org

ABSTRACT

Traditional OCR performs lexical post-processing to assist in the resolution of errors produced in the upstream character recognition processes. We change that model by incorporating lexical information very early in recognition. The result is an OCR that has as its principal attributes high speed of operation and tunability to the lexical content of the documents to which it is applied. GEOCR relies on the transformation of the text image into character shape codes, a rapid and robust process, and on special lexica, indexed by the "shapes" of words, containing the character ambiguities present within particular word shape classifications. We rely on the structure of language and the high percentage of singleton mappings between the shape codes and the characters in the words. Considerable ambiguity is removed by simple lookup in the specially tuned and structured lexicon and substitution on a character-by-character basis. Ambiguity is further reduced by template matching using exemplars derived from surrounding text, taking advantage of the local consistency of font, face and size as well as image quality.