Turn in your code as a tarball, and turn in a hardcopy of the writeup by e-mail. This assignment is due at 5pm Friday, March 9, 2007.
For the writeup, Item 1: Briefly describe your preprocessing, esp. what your tokenizer does.
*See note at the end regarding alternative training and test sets
You are free to use srilm for this. If you do your own implementation, you'll want to make sure to work with log probabilities rather than raw probabilities to avoid underflow problems. I would predict that GT is significantly more straightforward than either Katz or KN, since you don't need to worry about combining with or backing off to estimates for lower order n-grams.
For the writeup, Item 2: Describe how you did your smoothing. If you used srilm, be explicit about which parameter settings you used. If you did your own implementation, give any "missing details" of what you implemented. (E.g. for Good-Turing, how did you compute S(Nr)? Did you keep hapax legomena, or did you treat ngrams seen only once as if they were never observed? Etc.)
For the writeup, Item 3: Which of the two test sets is more likely, according to your trained model? Briefly discuss a few instances (e.g. individual n-grams, phrases, or sentences) where the model does and does not make good predictions.
Either of these alternatives is worth 10% extra credit.
Alternative Item 3 for your writeup: For each of the disputed papers, which trigram model best predicts it? How well do your predictions match the consensus of scholars?
As an example, see this output file, which contains 1000 hypotheses for a sentence in the NIST 2003 Chinese-English MTeval data set, ranked from best to worst. Each line contains three elements separated by "|||": (a) a sentence ID, (b) a tokenized, lowercase English output from an MT system, and (c) a sequence of feature values from the MT system that go into determining the total score for the hypothesis. We'll be learning more about MT system features later this semester, but you might be interested in nothing that the first feature is the log probability that was assigned by the system's language model. You might also be interested in looking at a correct English translation.
Alternative Item 3 for the writeup: report the number of translations in each of those three categories, and discuss an example or two in each category in terms of why you think the translation got worse, better, or stayed the same.