Most
Optical Character Recognition (OCR) systems assume that the script
and the language of the document being processed is known. Identity
of the script is used to select feature extraction routines and
classifiers; identity of the language is used to select lexicons,
character and word bi-gram probabilities, etc.
In
situations where document script and language identity are not
known a priori, or in situations where the a document page can
have multiple languages, these assumptions do not hold. Thus it
is important to detect the script and the language of the documents
prior to performing OCR. In this talk I will summarize two papers
on Asian script and language identification:
1.
A. Lawrence Spitz, "Script and language Determination from
Document Images", Proceedings of Symposium on Document Image
Understanding Technology, April 1997
2.
Judith Hochberg, Michael Cannon, Patrick Kelly, and James White,
"Page Segmentation Using Script Identification Vectors: A
First Look", Proceedings of Symposium on Document Image Understanding
Technology, April 1997
|