The
advent of the World Wide Web has made dissemination of digital information
easier than ever. Such accessibility has inspired many information
providers to digitize their data for networked information services.
Although future information is likely to be present in full digital
form, digitization, indexing, and searching of retrospective paper
materials, however, are not easy tasks.
Motivated
by a digitization project of a large Chinese news clipping collection,
this talk will present our efforts in providing information access
to such a collection. In particular, we will focus on information
retrieval through the use of OCR text converted from the news
clipping images. Based on a test collection of over 8000 OCR documents,
different indexing methods, retrieval models, and strategies are
examined and their performances are compared. This OCR test collection
also contains 30 English queries translated by domain experts
from the corresponding Chinese ones. So experiments for cross-language
OCR text retrieval can be done and some results from that will
also be discussed in this talk.
About
the Presenter:
Dr. Yuen-Hsien Tseng is from the Department of Library and Information
Science at Fu Jen Catholic University in Taiwan. He received the
Ph.D. from the Department of Computer Science and Information
Engineering, National Taiwan University. His recent research interests
include information retrieval for retrospective data and content-based
music retrieval. Dr. Tseng has received several times of Research
Awards from National Science Council, the Academic Research Award
from Fu Jen Catholic University in 1998, and the Second Prize
Award for "Automatic Cataloguing and Searching" from
the Fritz Kutter-Fonds, Zurich, in 1999.
|