Please turn in a hardcopy for this assignment. I don't need to see your code. (Although if it's really clean, please e-mail it to me, or send me a URL pointing to it, so I can use it as a solution!) Notice that the point values for the problems add up to 85; grades will be normalized to be out of 100%.
As an aside for the mathematically inclined, here is a pointer to several derivations of the definition of entropy based on axioms about properties that a measure of uncertainty should have. See also a concise summary of properties of entropy. The nice presentation/proof I mentioned in class is in A.I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York, 1957.
A note about notation. If X is a random variable whose distribution is (p1,p2,...,pk) then we usually write H(X) for H(p1,p2,...,pk). I will occasionally write H(p) instead of H(X), where p abbreviates p1,p2,...,pk.
h0 h1 h2 h3 h4 h5 h6 h7 p 1/2 1/4 1/8 1/16 1/64 1/64 1/64 1/64Recall that an optimal encoding for this distribution (one that uses the fewest possible number of bits on average, when communicating horse race outcome events) can be created using Huffman coding (see also this interesting discussion by Huffman's nephew):
h0 h1 h2 h3 h4 h5 h6 h7 codeP 0 10 110 1110 111100 111101 111110 111111
Show, without any numerical computations, that if you use codeP, the expected value Ep[length(h)] (i.e. the expected value of length(h) under distribution p) is equal to H(p), the entropy of the distribution. (Partial credit for doing the two computations explicitly and showing you get the same number.)
h0 h1 h2 h3 h4 h5 h6 h7 q 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8Since the horses all have the same probability, the best code we can invent will obviously use the same number of bits for each horse. This means that the smallest number of bits we can use is log28 = 3 bits:
h0 h1 h2 h3 h4 h5 h6 h7 codeQ 000 001 010 011 100 101 110 111Now imagine that we communicate lots of horse race outcomes using codeQ. Encoding winning horses using codeQ, what is Ep[length(h)]?
Note: you'll need to do something about zero counts. Smooth using Laplace ("add one", Manning and Schuetze Section 6.2.2) to avoid zero. You can assume that the total vocabulary is the union of all the unigrams in the three texts.
Hint for less experienced programmers (or more experienced programmers who don't feel like doing it themselves): you can use one of the files created when you ran Stats in the Collocations assignment to get the unigram frequencies.
Questions to answer on paper: