LAMP - Language Group

LAMP Seminar
Language and Media Processing Laboratory
Conference Room 4406
A.V. Williams Building
University of Maryland

Tuesday March 23, 1998
1:00 PM

Script and Language Identification for
Optical Character Recognition

Doe-Wan Kim

ABSTRACT

Most Optical Character Recognition (OCR) systems assume that the script and the language of the document being processed is known. Identity of the script is used to select feature extraction routines and classifiers; identity of the language is used to select lexicons, character and word bi-gram probabilities, etc.

In situations where document script and language identity are not known a priori, or in situations where the a document page can have multiple languages, these assumptions do not hold. Thus it is important to detect the script and the language of the documents prior to performing OCR. In this talk I will summarize two papers on Asian script and language identification:

1. A. Lawrence Spitz, "Script and language Determination from Document Images", Proceedings of Symposium on Document Image Understanding Technology, April 1997

2. Judith Hochberg, Michael Cannon, Patrick Kelly, and James White, "Page Segmentation Using Script Identification Vectors: A First Look", Proceedings of Symposium on Document Image Understanding Technology, April 1997