Assignment 3 for Ling645/CMSC723: Solution

Assignment 3 Solution

Make up an interesting (or entertaining) example showing that Pr(x|y) can be very different from Pr(y|x). (Should take 5 minutes)

  Note: Any solution that mentions dice, balls, urns, or rain is not
  entertaining or interesting, but we didn't take off points for that!

  A good example (adapted from a student's paper):  

    Let X = renting an apartment,
    Let Y = being a graduate student.

    P(X|Y) is very high, since we can't afford to buy a house and a 
    cardboard box on the sidewalk wouldn't protect our textbooks.

    P(Y|X) is fairly low overall, since many people rent apartments
    but not many people in the general population are graduate students.

Use the definition of conditional probability to derive Bayes's Theorem. It may be a while since you used your high-school algebra, but that's all this requires. (If you have not completed this in 10 minutes, don't bother doing it, just look at the solution when I give it out.)

  Bayes's Theorem: P(B|A) = P(A|B) * P(B) / P(A)

  Proof:

  (1) P(A|B) = P(A,B) / P(B)            [Def. of conditional probability]
  (2) P(B|A) = P(A,B) / P(A)            [Def. of conditional probability]
  (3)  implies P(A,B) = P(B|A) * P(A)   [Multiplied both sides of (2) by P(A)]
  (4) Therefore                         [Using (3) to replace P(A,B) in (1)]
       P(B|A) = [P(B|A) * P(A)] / P(B).

Do Exercise 6.1 in Jurafsky and Martin. That is, show the formula for the maximum likelihood estimate of probability in a trigram model (Should take 5-10 minutes)


A maximum likelihood estimate (MLE) of probability is computed by
taking the raw count and dividing by the total, so for a general
N-gram model the relevant maximum likelihood estimates are:

  (1)  P(w[n-N+1],...,w[n-1],n) = Count(w[n-N+1],...,w[n-1],w[n])/Total
  (2)  P(w[n-N+1],...,w[n-1])   = Count(w[n-N+1],...,w[n-1])/Total

What we want is the conditional probability

                                          P(w[n-N+1],...,w[n-1],n)
  (3)  P(w[n] | w[n-N+1],...,w[n-1])  =  -------------------------
                                          P(w[n-N+1],...,w[n-1])

Substituting (1) and (2) into the numerator and denominator of (3),
the Total in the numerator and denominator cancel out, so we get the
general formula for the MLE of probability in an N-gram model:

                                         Count(w[n-N+1],...,w[n-1],w[n])
  (4)  P(w[n] | w[n-N+1],...,w[n-1]) =  ---------------------------------
                                          Count(w[n-N+1],...,w[n-1])

Note that Jurafsky and Martin give you (4), so you
didn't actually need to go through steps (1)-(3) by yourself!  They're
included here so you can fully understand what's going on.

Given (4), you need only show how this general formula applies for a
trigram model, where N = 3:

                                    Count(w[n-2],w[n-1],w[n])
  P(w[n] | w[n-2],...,w[n-1]) =  ---------------------------------
                                     Count(w[n-2],w[n-1])

Note that throughout I have used the notation w[i] to indicate that w
has subscript i.