java -jar weka.jar weka.classifiers.trees.J48 -t weather.arffIf you are running via ssh to another machine, you might get an error message regarding the X11 display. To get past this error, use the following command line instead:
java -Djava.awt.headless=true -classpath THISDIRECTORY/weka.jar weka.classifiers.trees.J48 -t data/weather.arffwhere THISDIRECTORY is the full path for the directory.
[ resorts/NPS ] has/VBZ suspended/VBN [ interest_6/NN payments/NNS ] on/IN [ its/PP$ bonds/NNS ] ,/, pending/VBG [ a/DT debt/NN ] restructuring/NN ./.you can see that the word "interest" has been used in sense interest_6. (FYI, "Resorts" is the name of a company. These data are all from the Wall Street Journal.)
Write a program that takes this data set, and creates an .arff file in which the class to be predicted is one of {interest_1, interest_2, ..., interest_6}, and there are four features (i.e. attributes other than the predicted class): prevword, the word to the left of "interest", nextword, the word to the right of it, prevtag, the part-of-speech tag of the word to the left, and nexttag, the part-of-speech tag of the word to the right. For present purposes, you should treat the square brackets as if they're not there. (They mark noun phrase boundaries.)
There are some subtleties to deal with -- for example, WEKA seems to want nominal attributes (i.e. listing all the possible values for each feature) rather than "string" attributes, and it also doesn't like attribute values like "u.s.a." that contain punctuation. See this code skeleton, which is based on what I threw together quickly; it will save you time to see what I did to deal with these sorts of issues.
Once you've got an .arff file, see how well a decision tree classifier (J48) does at selecting the correct sense of "interest" from that limited context. Turn in the output of WEKA running with classifier weka.classifiers.trees.J48 (same as what you ran above, but with your new .arff file), showing the decision tree, performance on the training data, and cross-validation results. Discuss the decision tree and the results. For example, which senses of "interest" is it having the hardest time distinguishing, based on the confusion matrix? Are there any patterns of usage that were picked up very well even by features paying attention to this extremely limited context?
Optionally: instead of distinguishing senses interest_1 through interest_6, collapse the first four senses into interestA and the last two into interestB. This two-way distinction is roughly interest as "caring about something" versus interest in its financial sense. The decision tree will be a whole lot smaller and easier to understand.
Optionally: instead of the above .arff format, create an .arff file where the predictive features (i.e. the non-class attributes) all have values 0 or 1. For example, an attribute nextword_money would have the value 1 if (and only if) the word after "interest" was "money", and prevtag_jj would have the value 1 if (and only if) the word before "interest" were tagged as an adjective (tag JJ). Notice how much easier this representation makes it to define a new feature we might call near10_money, whose value is 1 if and only if the word "money" appears within ten words of "interest" in either direction, or a new feature like s_money, whose value is 1 if and only if the word "money" appears in the same sentence. Looking at the data, do you think features of this kind would be likely to be useful?
Optionally: try a couple of different classifiers. WEKA makes this really easy -- it's basically just a different java class specified on the command line. Feel free to share information with classmates about how to get other classifiers to run. Do other classifiers do better or worse than the decision tree classifier?
15% Extra Credit: try adding some features other than the previous and next words/tags, and report the results. Did they help? (Note: this wouldn't be a valid experimental result, of course, since you didn't test on previously-unseen data!)