UMIACS Computational Linguistics Colloquium Series,
December 4, 1997
UMIACS Computational Linguistics Colloquium Series,
December 4, 1997
Automatically Learning Natural Language from On-Line Text Corpora
Eric Brill
The Johns Hopkins University
Robust automated natural language processing has the potential to
revolutionize the way we interact with machines and process
information. To date, the major bottleneck inhibiting the creation of
robust and accurate natural language processing systems has been the
problem of linguistic knowledge acquisition: how a machine can obtain
the linguistic sophistication necessary for accurate processing of
language. We have been pursuing research which will help break this
knowledge acquisition bottleneck. As an alternative to manually
providing a machine with linguistic knowledge, we have been exploring
machine learning techniques for automatically learning this knowledge
from on-line text corpora. In particular, we have developed a
learning method called transformation-based learning. This technique
has been applied to many different core natural language processing
tasks such as part of speech tagging, document segmentation, spelling
correction, discourse labelling, and parsing, and has been
incorporated into systems for machine translation, information
retrieval and message understanding. In addition to state of the art
accuracy, this approach offers the advantages of very fast processing
speed and seamless integration of human-derived and machine-learned
knowledge. In the first part of this talk, we will introduce
transformation-based learning and present some recent results using
this method for natural language acquisition.
The second part of the talk will focus on exploiting complementary
strengths of different learning methods to further improve our ability
to automatically acquire linguistic knowledge. Over the last decade, a
large number of diverse machine learning techniques have been applied
to the problem of acquiring linguistic knowledge. While these
techniques differ significantly in attributes such as training and run
time, size of learned knowledge and ease of implementation, they
differ surprisingly little in their accuracy. We will examine this
issue in detail, looking at lexical disambiguation as a case study.
We will demonstrate how despite the near indistinguishable error
rates, the complementary generalization strengths of the different
machine learning algorithms can be exploited to derive a new lexical
disambiguator that achieves significantly better results than that
achieved by any individual learner.
Bio: Eric Brill is an Assistant Professor in Computer Science and a
member of the Center for Language and Speech Processing, at Johns
Hopkins University. He received his B.S. from University of Chicago
in 1987 in Mathematics, his M.S. from University of Texas at Austin in
Computer Science in 1989 and his Ph.D. in Computer Science from
University of Pennsylvania in 1993. After graduating, he was a
Research Scientist in the Spoken Language Systems Group at MIT for one
year. His research interests include Natural Language and Speech
Processing, Machine Learning, Artificial Intelligence, Shakespearean
Authorship, and the Voynich Manuscript.
Return to the UMD
Computational Linguistics Colloquium Series.