LAMP - Language Group

LAMP Seminar
Language and Media Processing Laboratory
Conference Room 4406
A.V. Williams Building
University of Maryland

AUGUST 30, 2000, 1:00 PM
Yuen-Hsien Tseng

University of Maryland, College Park.

Chinese OCR Text Retrieval: Stratagies and Their Performance Evaluation

ABSTRACT

The advent of the World Wide Web has made dissemination of digital information easier than ever. Such accessibility has inspired many information providers to digitize their data for networked information services. Although future information is likely to be present in full digital form, digitization, indexing, and searching of retrospective paper materials, however, are not easy tasks.

Motivated by a digitization project of a large Chinese news clipping collection, this talk will present our efforts in providing information access to such a collection. In particular, we will focus on information retrieval through the use of OCR text converted from the news clipping images. Based on a test collection of over 8000 OCR documents, different indexing methods, retrieval models, and strategies are examined and their performances are compared. This OCR test collection also contains 30 English queries translated by domain experts from the corresponding Chinese ones. So experiments for cross-language OCR text retrieval can be done and some results from that will also be discussed in this talk.

About the Presenter:
Dr. Yuen-Hsien Tseng is from the Department of Library and Information Science at Fu Jen Catholic University in Taiwan. He received the Ph.D. from the Department of Computer Science and Information Engineering, National Taiwan University. His recent research interests include information retrieval for retrospective data and content-based music retrieval. Dr. Tseng has received several times of Research Awards from National Science Council, the Academic Research Award from Fu Jen Catholic University in 1998, and the Second Prize Award for "Automatic Cataloguing and Searching" from the Fritz Kutter-Fonds, Zurich, in 1999.