Assignment 4


Note: I've deliberately made this week's required assignment pretty easy, and the extra credit quite high, to provide a strong incentive for you to do more than the minimum and get hands-on with one of the extra credit suggestions.

Required:

  1. Prove that cross entropy H(p,q) = H(p) + D(p||q). (Yes, by this I mean simply using the definitions out of Manning and Schuetze and showing that the left side and the right side are equal to each other.)

    Notice that since D(p||q) must be greater than or equal to 0, this establishes that cross entropy is an upper bound on entropy.

  2. This problem makes use of the same corpora from last week's assignment. Construct two n-gram language models: M_d based on the Democratic speeches and M_r from the Republican speeches. Treating Obama's speech as a test set, compare the perplexity of models M_d and M_r. Which one is a better language model?

    Using a a unigram model is fine, and last week's homework should already give you most of what you need. If you're interested in being more realistic, a bigram or trigram model would be more interesting and plausible, although of course these data sets are unrealistically small. If you're feeling like this assignment is too easy and you want to do something more challenging and real, you could try using the SRI language modeling toolkit (srilm), KenLM, or another of the language modeling toolkits widely used in NLP (or I imagine NLTK probably has decent LM support in python), and/or using different corpora that might give you interesting results.


Optional extra credit

I strongly recommend that, for up to an additional 50% extra credit on this assignment (i.e. a "very high pass", which would be worth 150% of a high pass), you do one of these three mini-projects.