UMIACS Computational Linguistics Colloquium, February 26, 1998

WORD SENSE DISAMBIGUATION FOR LARGE TEXT DATABASES


Robert Krovetz


NEC Research Institute


UMIACS Computational Linguistics Colloquium

February 26, 1998, 4pm, AVW Room 4406


Most retrieval systems represent documents and queries by the words they contain, and rank documents based on the words in common with the query. Because words are ambiguous, this can cause documents to be retrieved that are not relevant. In addition, a document can be relevant even if it does not mention the exact words used in the query. A user is generally not interested in the words, but in the concepts that those words represent. We report on an analysis of lexical ambiguity in information retrieval test collections, and on experiments to determine the utility of word meanings for separating relevant from non-relvant documents.

Our research has examined different sources of evidence for distinguishing meanings. These sources can serve as a mechanism for splitting meanings apart, as well as bringing them together. For example, morphology separates author/authorize, and universe/university; it brings together burglar/burglarize and sincere/sincerity. Similar distinctions hold for phrases and part of speech. Any effort to deal with word meanings and information retrieval must take these distinctions into account.

We will present the results of experiments with these sources of evidence, and our proposals for future work.


For the colloquium series schedule, see the UMD Computational Linguistics Colloquium Series web page at http://umiacs.umd.edu/~resnik/cl_colloquium/. If you are interested in meeting with the speaker, please contact Mari Broman Olsen (molsen@umiacs.umd.edu) or Philip Resnik (resnik@umiacs.umd.edu).