Assignment 3

In addition to electronic submission, please also turn in hardcopy! I plan to grade the hardcopy. I will go to the electronic version while grading only if I need to. Hardcopy assignments can be turned in under my office door, AVW 3143, or you can turn it in early in class on Wednesday if it's done by then.

Finite-state automata

Consider the tokenizer in /afs/glue.umd.edu/class/fall2006/ling/723/0101/public/hw3/tokenize1.pl. Copy it to one of your own directories and run it by piping some text into standard input, i.e., at the Unix shell prompt (%), type:
```
% ./tokenize1.pl < example.txt
```
You can ignore the top and bottom of the Perl source; all that matters is the regular expression in the middle:
```
[A-Za-z]+    # words
| [0-9]+     # numbers
| 's | --    # some special symbols
| [^ ]       # individual symbols
```
Note that here whitespace and comments are allowed, and ignored, inside the regular expression (except inside square brackets, and except when escaped by a backslash).
1. Run the tokenizer on /afs/glue.umd.edu/class/fall2006/ling/723/0101/public/hw3/fireswamp.txt.
2. Extend the tokenizer to handle acronyms like R.O.U.S. and the like. It should be able to handle any acronym, not just a few. Provide your program listing and show that it works on fireswamp.txt. Show that it works on other acronyms by running on the following sentence:
  - "Not Spew," said Hermione impatiently. "It's S.P.E.W. Stands for the Society for the Promotion of Elfish Welfare."
3. Identify another problem in fireswamp.txt and fix it. Provide your program listing and the output on fireswamp.txt.
Extend your implementation of depth-first search from last week to simulate a nondeterministic finite-state automaton (NFA):
1. Change your graph representation to allow arc labels. Encode the following NFA:
2. Change the "push" step of the algorithm to push (state, position) pairs instead of just nodes (see J&M book, p. 44). Provide your program listing.
3. Demonstrate that your algorithm works correctly using the following example strings and a few others:
  - aaa
  - bbb
  - aab

Language models

In this exercise, we are going to play with some real language models using the simple programs shown in class. They are in /afs/glue.umd.edu/class/fall2006/ling/723/0101/public/hw3/ and are called make-lm.py, lm-score.py, and lm-generate.py. You can copy them to your own directory or run them where they are. It is assumed below that you have copied them to your own directory. There are also some corpora in /afs/glue.umd.edu/class/fall2006/ling/723/0101/public/texts. Please don't copy these, as they are sizable.
1. Create a bigram language model out of the Princess Bride script:
```
% ./make-lm.py -n 2 /afs/glue.umd.edu/class/fall2006/ling/723/0101/public/texts/princess.txt -o princess.lm
```
  Have a look at the language model file. It should look like:
```
sword. </s>     1.000000
memory. </s>    1.000000
at seeing       0.009709
at all  0.019417
at least,       0.009709
at her. 0.009709
at her, 0.019417
at hand,        0.009709
at last,        0.019417
at Humperdinck. 0.009709
```
  Each line has an n-gram, and a probability: for example, the first line gives P(</s>|sword.). Notice that we have not done any tokenization -- you are welcome to try it with a tokenizer, but for consistency's sake we will all try untokenized text first.
2. Use the probabilities in the language model file to calculate the probability of the string
```
I'm not left-handed.
```
  Use grep to find the n-grams in the language model file, and of course you may use a calculator to multiply the probabilities together. Don't forget the start and stop symbols. Show the individual n-gram probabilities you used. Check your work against the computer's:
```
% ./lm-score.py -m princess.lm
I'm not left-handed.
[Press control-D; you may have to press it twice]
I'm not left-handed.     => exp(-n.nnnnn)
```
  The computer gives the answer as a logarithm (base 10) because in practice these probabilities can get very small.
3. Give the random generator a try:
```
% ./lm-generate.py -m princess.lm -r 10
```
  The -r 10 specifies the number of random sentences to generate.
4. Now for the interesting part: create some more language models on the same text, using different values of n (use the -n option). Try at least n=1 and n=3 or 4. For these new language models, looking up the probability of the string again -- what happens? Generating some random sentences -- does the quality go up or down? Explain.