LAMP - Language Group

LAMP Seminar
Language and Media Processing Laboratory
Conference Room 2120
A.V. Williams Building
University of Maryland

October 3, 2000, 1:00 PM
Daniel Lopresti

Bell Labs, Lucent Technologies Inc.

Searching Noisy Data Using Noisy Queries

ABSTRACT

As technologies for speech, handwriting, and printed character recognition become more prevalent and less obtrusive, one can imagine situations where they will be applied entirely in the background, without imposing on the user, for the purposes of indexing and retrieval. This scenario, however, raises the issue of coping with undetected, uncorrected recognition errors. Consider the problem of querying via voice a database that was created from faxed documents. To accomplish this task, we must contend with ASR errors from the speech recognition process, a completely different class of errors from the OCR process, and the issue of judging the similarity between spoken and printed keywords. In this talk, I'll describe a new formalism, known as cross- domain approximate string matching, for resolving these disparate constraints. We have formulated this in terms of an optimization problem and developed a polynomial time algorithm for its solution, along with several variations. I'll conclude by presenting the results of a recent experiment showing how cross-domain string matching can improve the effectiveness of retrieval when searching a database of scanned, OCR'ed documents using handwritten queries. (This is joint work with Gordon Wilfong, also of Bell Labs.)