Smoothing


In this assignment you will get some hands-on experience with smoothing for language modeling. You can either implement a smoothing technique yourself, or you are free to use the SRI Language Modeling Toolkit, also known as srilm.

Turn in your code as a tarball, and turn in a hardcopy of the writeup by e-mail. This assignment is due at 5pm Friday, March 9, 2007.

  1. For reference, download Stanley F. Chen and Joshua T. Goodman. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98, Computer Science Group, Harvard University, 1998. For present purposes, the relevant parts are Section 1 and Section 2 up through Section 2.8 (page 18), excluding Sections 2.3 and 2.5.

  2. Download the version of the Bible at http://www.umiacs.umd.edu/~resnik/parallel/bible/English/bib.EN. Separate the full set of verses into a training set that includes the first five books (GEN, EXO, LEV, NUM, DEU) and a first test set that includes the rest. Discard everything that's not a Bible verse (a seg element), and lowercase and tokenize the text. (You can use any tokenizer you like. Tokenizing just on common punctuation and whitespace is fine; no need to get fancy. You may already have done this step for a previous assignment.) To keep life simple in terms of the boundary case, add two instances of a special start token (e.g. @START@) to the beginning of both sets. Doing this for all data sets guarantees that P(w1) and P(w2|w1) are both equal to 1.)

    For the writeup, Item 1: Briefly describe your preprocessing, esp. what your tokenizer does.

  3. Download Adventures of Sherlock Holmes as a second test set. Preprocess the same way that you did in the previous step. N.B. You may already have done this step for a previous assignment.

    *See note at the end regarding alternative training and test sets

  4. Train a trigram language model on the training set, smoothing using some version of Good-Turing (GT), Katz backoff (Katz), or Kneser-Ney (KN).

    You are free to use srilm for this. If you do your own implementation, you'll want to make sure to work with log probabilities rather than raw probabilities to avoid underflow problems. I would predict that GT is significantly more straightforward than either Katz or KN, since you don't need to worry about combining with or backing off to estimates for lower order n-grams.

    For the writeup, Item 2: Describe how you did your smoothing. If you used srilm, be explicit about which parameter settings you used. If you did your own implementation, give any "missing details" of what you implemented. (E.g. for Good-Turing, how did you compute S(Nr)? Did you keep hapax legomena, or did you treat ngrams seen only once as if they were never observed? Etc.)

  5. Evaluation.

    For the writeup, Item 3: Which of the two test sets is more likely, according to your trained model? Briefly discuss a few instances (e.g. individual n-grams, phrases, or sentences) where the model does and does not make good predictions.


Note regarding alternative training and test sets

The assignment uses the Bible and Sherlock Holmes because you're likely to have worked with these on previous assignments. However, here are two variants on the assignment that are closer to the real world and therefore, perhaps, more interesting. Feel free to do one of these instead. (Make sure you still turn in answers to Items 1 and 2, above, in your hardcopy. See below for the alternative versions of Item 3.)

Either of these alternatives is worth 10% extra credit.

Using language modeling for authorship identification

The Federalist Papers are an important piece of American history, but there have been well known disputes regarding the authorship for some of them. This on-line version of the Federalist Papers identifies an author for each one following "the consensus of scholars". The Wikipedia list of Federalist papers identifies the twelve whose authorship is disputed.

Using language modeling to rescore machine translation hypotheses

One common current use of language models is to rescore sytem output for systems that produce hypotheses in the form of English text -- for example, speech recognizers, or machine translation systems.