Assignment 0 -- a "starter" assignment for Ling/CMSC 773


Please turn in hardcopy at the start of next class.

  1. Pick a phenomenon involving things that can be counted, and investigate empirically whether the frequency distribution of those things is or is not Zipfian. For example, one could count bigrams in the Book of Genesis, or average daily visitors to Web sites, or populations of American cities. However, you should not pick any of those things, or any other examples that we discussed in class or that you already know to be Zipfian. Pick something new, and the more interesting/fun the better. You should argue that the phenomenon is or is not well described by Zipf's law, supporting your argument empirically, e.g. by means of a graph.
    Note added at the end of the semester... On this assignment I encouraged judging the, um, Zipfianness? of a phenomenon by looking at how linear the frequency versus rank plot looked on a log-log scale. However, there are convincing arguments that "looking for a straight line on a log-log plot, and even finding one with high r-squared, is simply not a reliable way of checking whether a distribution is a power-law" (http://cscs.umich.edu/~crshalizi/weblog/457.html). The same author in a later post goes on to talk, somewhat entertainingly, about the right way to go about it (http://cscs.umich.edu/~crshalizi/weblog/491.html), with pointers to a detailed paper and code (http://www.santafe.edu/~aaronc/powerlaws/).

  2. Do Manning and Schuetze Exercise 2.5.

  3. Here's this question as I phrased it in class.

    Suppose you have been given a "black box" function A that will take any two words (assume I mean tokens) and return a statistical association score for that pair of words, where a higher score means the two words tend to be associated with each other. (I'm deliberately leaving that notion vague.) If you had a large set of French-English sentence translations (i.e. parallel text, e.g. Canadian Parliament proceedings in French and English), how would you use those translations plus black-box function A to build a French-English bilingual dictionary?

    And here's the way I meant to say it.

    Suppose I give you a technique that will take a large set of word (token) pairs of the form , and allow you to create a "black box" function A, which takes as input any two words and returns a statistical association score for that pair of words, where a higher score means the two words tend to be associated with each other. (I'm deliberately leaving that notion vague.) For example, if you used my technique on all the bigrams in the Book of Genesis, you might get high scores for word pairs like (of,the) and (said,unto), because these word pairs tend to be "sticky", appearing together more often than you'd expect by chance, in that text. Now, if you had a large set of French-English sentence translations (i.e. parallel text, e.g. Canadian Parliament proceedings in French and English), how would you use those translations, plus the technique I've given you, to build a French-English bilingual dictionary?

    The second version is more interesting, but feel free to answer either. The main point is to get you thinking about ways to use statistical association measures, because we're going to cover that idea next class.