Pang and Lee point out quite a few variations on the theme, but the central premise underlying most of this work is that spans of text often convey internal state, such as a positive or negative opinion, an emotional reaction, or an author's perspective on a topic, and that such a state can be thought of as a label for the text. The simplest example (and probably the best researched) would be opinion detection in movie or product reviews: if the text says "I thought this movie was terrible!", we have overt information that a subjective statement is being made (I thought) as well as overt information about the polarity of an opinion (terrible). Significantly more challenging are scenarios where the goal is to label an underlying state in texts that are not overtly subjective, e.g. in a marital counseling session, "The dishwasher broke yesterday" and "My husband broke the dishwasher yesterday" are both statements of fact, but they convey vastly different underlying perspectives on the event. Sometimes the underlying state we're interested in is not positive or negative, but is instead a contrast based on group membership or ideology; for example, a posting on someone's blog might be labeled liberal or conservative.
In this project, your team will do a piece of end-to-end research on this kind of labeling. This will involve preprocessing corpora, making choices about features to include in text representations, training classifiers, evaluating performance, and writing a project summary.
Note that this project is designed so that good results might actually be publishable in a workshop or even conference paper. What you're being given here is not a textbook problem; rather, it's part of the very much open problem of how to do better sentiment analysis. That has advantages and disadvantages. On the positive side, it's more fun. On the negative side, this is open territory and it's possible that unforeseen problems will crop up with the assignment -- either in how it's formulated, or with the materials I give you, or in system issues. If that happens, let me know and we'll adjust accordingly.
Note that I have stored the corpus in a really inefficient way: documents are duplicated across the train/test split in each fold. This keeps the structure of the directories really simple but hugely wastes space. It would make a lot more sense to keep just one copy of each document and then define train/test splits with pointers (e.g. symbolic links) to the actual documents.
WEKA. The Weka toolkit is one of the best known and most widely available machine learning packages. It supports a wide range of supervised learning techniques, including most of the ones we have discussed in class. Weka comes with both a graphical user interface and a command-line interface, as well as a java API. The basic idea, in using Weka, is to represent your learning problem using an .arff file, within which each instance is represented as a feature vector. (The header of the file identifies the types of the features as well as the feature that constitutes the class being predicted.) Once you've got your data into the .arff format, it's very easy to try out different learning algorithms and/or different parameters for the same algorithm.
As a quick and easy starting point, you might want to try out Carolyn Rose's TagHelper package. TagHelper is a wrapper around Weka (with its own GUI and command-line interfaces) that builds in many of the most common feature extraction choices commonly used in text processing, e.g. tokenization, lowercasing, unigrams, bigrams, stemming, part-of-speech tags, etc. The basics are incredibly simple: you put your data into a two-column spreadsheet where the first column is Code (i.e. label) and the second column is Text (i.e. the text being classified), you tell it which feature extraction options you want to use (or just use its defaults), and then you run it. It will do feature extraction (creating .arrf files for you) and, by default, will do evaluation on your dataset via (IIRC) 10-fold cross-validation. It's also possible to use separate training and test sets. HOWEVER, be warned that in the past some people have had some trouble with TagHelper's scalability.
MALLET. The MALLET toolkit is another very nice package. It's not as well documented as Weka, but it has some decent "quick start" Web pages with useful examples, fairly readable java source code (or so I'm told), and an online discussion group that is fairly active monitored by MALLET team members who actually seem to be pretty responsive. MALLET overlaps with Weka a bit, but it also supports sequence modeling (conditional random fields, a.k.a. CRFs) and unsupervised topic modeling (Latent Dirichlet Allocation, a.k.a. LDA). And again, you can get your data into a fairly standard format and play with different parameterizations for learning, there's a java API, etc. (But no GUI.)
Others. There are a variety of other toolkits out there for specific approaches such as maximum entropy modeling, support vector machines, and decision tree learning, and I know LingPipe implements subjectivity and sentiment analysis. It also appears that NLTK offers useful machine learning toolkit functionality (decision tree, maximum entropy, naive Bayes clasifiers, and an interface to Weka) although I'm not particularly familiar with it. There are various other lists of machine learning toolkits out there; probably one of the best too look at would be Hal Daume's list of useful machine learning links and software.
Discussion of machine learning packages is welcome on the class forum, and I'm happy to inquire with my students and former students about their experiences with packages you're considering using.
Why bother doing this instead of just doing a standard Naive Bayes model (like you can find in Weka and other toolkits)? Well, Pedersen and Bruce -- and practically everyone else who uses Gibbs sampling, it would appear -- adopt a uniform prior for the word distributions. Remember how the Beta(1,1) distribution is just a uniform distribution for a coin-flip parameter? For distributions over a vocabulary of words, the natural uninformative prior is a Dirichlet(1,1,...,1) distribution, where the width of the vector is the size |V| of the vocabulary. It's the same idea as the Beta(1,1) prior, just going from 2 alternatives to |V| alternatives: this prior introduces no bias at all regarding which words start out as more likely. I conjecture that you can do better by using an informed prior, specifically by using a Dirichlet prior informed by a subjectivity lexicon (see below). For example, the prior distribution for the NEGATIVE label could give lower weight to words in the lexicon like good or adore that are associated with positive sentiments.
The syntactic features introduced by Greene and Resnik would also be something to consider including, especially since I can provide code to extract them. The basic idea here is that the syntactic form of a sentence carries information about the semantic "framing" being adopted by the author, which can be connected to underlying sentiment. For example, "My husband broke the dishwasher" would give rise to features including break-TRANS and obj-dishwasher, indicating respectively that break was used transitively and that dishwasher appears as an object. (We didn't use triples like break-obj-dishwasher because of data sparseness issues.) In contrast, "The dishwasher broke" would give rise to, among others, the feature break-NOOBJ, indicating that the verb break was used without an overt direct object. Because syntactic transitivity is associated with some highly relevant semantic properties (e.g. causation, intended action, and change-of-state in the object), the transitivity-indicating feature encourages an interpretation of the event that foregrounds the husband's causal role, the fact that the dishwasher was strongly affected by the event, etc. If breaking the dishwasher is an undesirable outcome, then the transitive statement encourages an interpretation of the event connected with negativity toward the husband; the inchoative version (no object) de-emphasizes the properties associated with that interpretation.
What are some possible extensions of this idea? Well, Greene and I did not use any external knowledge about the verb. We left it to the machine learning to figure out which features would push the label in which directions; e.g. one might expect break-TRANS to show up more in one kind of document and the same feature with a more positive verb, say rescue-TRANS to show up in the opposite kind of document. But I think it would be interesting to explore whether the subjectivity lexicon could be used in conjunction with these syntactic features in some way in order to capture generalizations based on verb types (perhaps adding features like negativepolarity-TRANS, positivepolarity-TRANS, etc. in addition to the verb-specific features?).
As another thought, it might be interesting to look at the extent to which the author's choice of syntactic frame for the verb differs from its most conventional use. For example, if break is used in an active transitive 5 times as frequently as the passive, based on syntactic analysis in some reference corpus (Penn Treebank?), then its use in the passive seems like it should receive strong weight, while a transitive use might not be telling us anything particularly significant about how the author is framing the situation. I really think this ought to help in identifying places where "spin" is taking place using syntactic structures.
Those are just some pet ideas I've been thinking about. Yes, I think it would be very cool to see some of you guys try them out. But I'm also very open to creative thinking on your part about other features that might be useful for one or more of the classification problems.
In terms of evaluation measure, one would generally use simple accuracy: did the classifier's label on the test item match the "ground truth" for that item?
That said, an important part of the evaluation is going beyond the numbers to an analysis of why things turned out the way they did. (And a good analysis can be as important as a positive result, in terms of good research, even if it makes it harder to get a paper accepted.) One form of analysis might be an error analysis for an individual classifier/featureset combination, trying to identify generalizations about what it does well or what it does poorly. Another form of analysis might break the errors into false positives and false negatives, or into other buckets, in order to seek insight into what's working, what's not, and how it could be improved.
For the remaining 25 points, each team member should anonymously rate each other team member as follows -- making sure to look at the definitions below to see what the numeric scales are supposed to mean.
Collaboration: 10 means that this person was great to collaborate with and you'd be eager to collaborate with them again, and 1 means you definitely would avoid collaborating with this person again. Give 5 as an average rating for someone who was fine as a collaborator, but for whom you wouldn't feel strongly about either seeking them out of avoiding them as a collaborator in the future.
Contribution: 10 means that this person did their part and more, over and above the call of duty. 1 means that this person really did not contribute what they were supposed to. Give 5 as an average, they did what they were expected to rating. Note that this is a subjective rating relative to what a person was expected to contribute. If five people were contributors and each did a fantastic job on their pieces, then all five could conceivably get a rating of 10; you would not slice up the pie 5 ways and give each person a 2 because they each did 20% of the work! It is your job as a group to work out what the expected contributions will be, to make sure everyone is comfortable with the relative sizes of the contributions, and to recalibrate expectations if you discover you need to. Try to keep things as equitable as possible, but if one person's skills mean they could do a total of 10% of the work compared to another person's 15%, and everyone is ok with this, then both contributors can certainly get a score of higher than 5 if they both do their parts and do them well. If you need help breaking up tasks, agreeing on expectations, etc., I would be happy to meet with the group to assist in working these things out.
Effort: A rating of 3 should be average, with 5 as superior effort (whether or not they succeeded) and 1 as didn't put in the effort. A rating below 3 would not be expected if the person's contribution was 5 or better. If a person just didn't manage to contribute what they were expected to, but you think they did everything in their power to make it happen, you could give them a top rating for effort even while giving them a low contribution score.
Unlike the real world, which is not very forgiving, this is a controlled setting that involves the guidance of an instructor, who can be very forgiving. Remember that the activity is, first and foremost, a collaborative learning activity, with the emphasis on learning. If there are problems or issues of any kind, let me know sooner rather than later, and I will help to get them worked out. Also feel free to use the mailing list or discussion forum: the presence of multiple teams does not mean that you are competing with each other. (I considered adding extra credit for the team with the best results, but I specifically decided against it because I would much rather see a spirit of collaboration not only within teams but at the level of the entire class.)