UMIACS Computational Linguistics Colloquium Series, December 4, 1997

Automatically Learning Natural Language from On-Line Text Corpora

Eric Brill
The Johns Hopkins University

Robust automated natural language processing has the potential to revolutionize the way we interact with machines and process information. To date, the major bottleneck inhibiting the creation of robust and accurate natural language processing systems has been the problem of linguistic knowledge acquisition: how a machine can obtain the linguistic sophistication necessary for accurate processing of language. We have been pursuing research which will help break this knowledge acquisition bottleneck. As an alternative to manually providing a machine with linguistic knowledge, we have been exploring machine learning techniques for automatically learning this knowledge from on-line text corpora. In particular, we have developed a learning method called transformation-based learning. This technique has been applied to many different core natural language processing tasks such as part of speech tagging, document segmentation, spelling correction, discourse labelling, and parsing, and has been incorporated into systems for machine translation, information retrieval and message understanding. In addition to state of the art accuracy, this approach offers the advantages of very fast processing speed and seamless integration of human-derived and machine-learned knowledge. In the first part of this talk, we will introduce transformation-based learning and present some recent results using this method for natural language acquisition.

The second part of the talk will focus on exploiting complementary strengths of different learning methods to further improve our ability to automatically acquire linguistic knowledge. Over the last decade, a large number of diverse machine learning techniques have been applied to the problem of acquiring linguistic knowledge. While these techniques differ significantly in attributes such as training and run time, size of learned knowledge and ease of implementation, they differ surprisingly little in their accuracy. We will examine this issue in detail, looking at lexical disambiguation as a case study. We will demonstrate how despite the near indistinguishable error rates, the complementary generalization strengths of the different machine learning algorithms can be exploited to derive a new lexical disambiguator that achieves significantly better results than that achieved by any individual learner.

Bio: Eric Brill is an Assistant Professor in Computer Science and a member of the Center for Language and Speech Processing, at Johns Hopkins University. He received his B.S. from University of Chicago in 1987 in Mathematics, his M.S. from University of Texas at Austin in Computer Science in 1989 and his Ph.D. in Computer Science from University of Pennsylvania in 1993. After graduating, he was a Research Scientist in the Spoken Language Systems Group at MIT for one year. His research interests include Natural Language and Speech Processing, Machine Learning, Artificial Intelligence, Shakespearean Authorship, and the Voynich Manuscript.

Return to the UMD Computational Linguistics Colloquium Series.