Information Theory Questions

Please turn in a hardcopy for this assignment. I don't need to see your code. (Although if it's really clean, please e-mail it to me, or send me a URL pointing to it, so I can use it as a solution!) Notice that the point values for the problems add up to 85; grades will be normalized to be out of 100%.

As an aside for the mathematically inclined, here is a pointer to several derivations of the definition of entropy based on axioms about properties that a measure of uncertainty should have. See also a concise summary of properties of entropy. The nice presentation/proof I mentioned in class is in A.I. Khinchin, Mathematical Foundations of Information Theory, Dover, New York, 1957.

A note about notation. If X is a random variable whose distribution is (p₁,p₂,...,p_k) then we usually write H(X) for H(p₁,p₂,...,p_k). I will occasionally write H(p) instead of H(X), where p abbreviates p₁,p₂,...,p_k.

[5pts] (Manning and Schuetze exercise 2.12) Show that the KL divergence is not symmetric by finding an example of two distributions p and q for which D(p||q) is not equal to D(q||p). Give the two distributions and show the explicit computation of the KL divergences. Simple toy distributions (e.g. over {heads,tails}) are fine.
- a. [15pts] Consider the horse race example from class, where the distribution over horses h0-h7 is p:
```
	       h0     h1   h2     h3     h4     h5     h6     h7
	 p     1/2    1/4  1/8    1/16   1/64   1/64   1/64   1/64
     
```
  Recall that an optimal encoding for this distribution (one that uses the fewest possible number of bits on average, when communicating horse race outcome events) can be created using Huffman coding (see also this interesting discussion by Huffman's nephew):
```
	       h0     h1   h2     h3     h4     h5     h6     h7
      codeP    0      10   110    1110   111100 111101 111110 111111
     
```
  Show, without any numerical computations, that if you use codeP, the expected value E_p[length(h)] (i.e. the expected value of length(h) under distribution p) is equal to H(p), the entropy of the distribution. (Partial credit for doing the two computations explicitly and showing you get the same number.)
- b. [5pts] Imagine that p is the true distribution of outcomes when these horses race, but that we do a terrible job of estimating the probabilities and come up with the following model, q:
```
	       h0     h1   h2     h3     h4     h5     h6     h7
	 q     1/8    1/8  1/8    1/8    1/8    1/8    1/8    1/8
     
```
  Since the horses all have the same probability, the best code we can invent will obviously use the same number of bits for each horse. This means that the smallest number of bits we can use is log₂8 = 3 bits:
```
	       h0     h1   h2     h3     h4     h5     h6     h7
      codeQ   000    001  010    011    100    101    110    111
     
```
  Now imagine that we communicate lots of horse race outcomes using codeQ. Encoding winning horses using codeQ, what is E_p[length(h)]?
- c. [15pts] Use the previous two parts to this question, together with the definition of relative entropy (KL divergence), to explain and illustrate Manning and Schuetze's observation: "KL divergence is the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q."
[15pts] Prove that cross entropy H(p,q) = H(p) + D(p||q).
[30pts] (Based on Manning and Schuetze exercise 2.10) Compute three probability distributions over unigrams: one for Conan Doyle's Adventures of Sherlock Holmes (Adventures), one for Conan Doyle's A Study in Scarlet (Study), and one for the book of Genesis (Genesis) -- all available via the assignment on collocations. Treating the Adventures distribution as p, compute the KL divergence against q (Study) and q' (Genesis).
Note: you'll need to do something about zero counts. Smooth using Laplace ("add one", Manning and Schuetze Section 6.2.2) to avoid zero. You can assume that the total vocabulary is the union of all the unigrams in the three texts.
Hint for less experienced programmers (or more experienced programmers who don't feel like doing it themselves): you can use one of the files created when you ran Stats in the Collocations assignment to get the unigram frequencies.
Questions to answer on paper:
- a. What are D(p||q) and D(p||q')? How does this match what you expected them to be?
- b. For D(p||q), which unigrams (say, the top 10) contribute most positively to the sum? Which contribute most negatively? Why do you think that's the case?
- c. Same question, but for D(p||q').
- d. [Optional, for 5pts extra credit] Comment on the assumption that the total vocabulary is the union of unigrams in the three texts: what might be a problem with this assumption?