Context-Sensitive Adaptation of Phrase Probabilities Using Phrase Sense Disambiguation

Note: this version is not final until I say it is.

Introduction

This project applies familiar supervised learning ideas from work on word sense disambiguation (WSD) to the problem of disambiguating source language "phrases" for purposes of machine translation. You will basically be replicating a technique introduced by Marine Carpuat and Dekai Wu:

The general setting for this work is the problem of machine translation. State-of-the-art statistical MT systems are trained using a large corpus of parallel sentence pairs in the source and target language, typically also using an automatically-created "best" alignment of source words with target words. For example, the training data might include the pair

  deseo saludar la intervenció importante y sustancial del presidente prodi .  
  i too would like to welcome mr prodi 's forceful and meaningful intervention .

with the alignment

  0-2 0-3 1-5 8-6 8-7 9-7 7-8 6-9 5-10 4-11 6-11 3-12 10-13

The pair 1-5 in the alignment represents an alignment link between token 1 of the source (saludar) and token 5 of the target (welcome). Notice that tokens are numbered starting with 0. Also notice that alignments are not necessarily one-to-one; e.g. you can see that Spanish token 0 (deseo) is aligned with English tokens 2 and 3 (would like).

In these statistical systems, the training process usually includes the creation of a "phrase table", where "phrase" means a sub-sequence of tokens, not necessarily the sort of things that a linguist would call a phrase. By analyzing the aligned training corpus, systems automatically extract phrase table entries like the following:

  deseo saludar ||| i must congratulate ||| 0.00714286 2.33525e-05 0.166667 2.03231e-05 2.718
  deseo saludar ||| i should like ||| 7.03284e-05 3.10014e-06 0.166667 0.000497919 2.718
  deseo saludar ||| i welcome ||| 0.000550812 0.000134179 0.333333 0.100133 2.718
  deseo saludar ||| like to commend ||| 0.0333333 0.000293642 0.166667 0.000208066 2.718
  deseo saludar ||| like to congratulate ||| 0.00147275 5.07773e-05  0.1666667 0.000208066 2.718

These entries contain a source phrase, a target translation, and then a set of numerical "features". We will concern ourselves in this project with just one of those feature, the conditional probability p(e|f). In these particular phrase table entries, this is the 3rd numerical feature, so you can see, for example, that

p(i welcome | deseo saludar) = 0.333333

Crucially, this probability is not sensitive to context. These phrase table entries tell the system that regardless of the context -- e.g. surrounding words in the sentence -- the phrase deseo saludar has a 33.3% chance of translating as i welcome.

The question we ask in this project: is it possible to come up with better probabilities for an instance of the Spanish phrase, if you pay attention to the context?

Context-Sensitive Adaptation as WSD

With a moment's thought, it should be clear that this is very similar to the WSD problem, just mapping Spanish phrases to English phrases rather than mapping Spanish words to some Spanish sense inventory like EuroWordNet. The phrase table entries above (assuming they're exhaustive) tell us that the Spanish phrase deseo saludar has five "senses" in English. Probability p(e|f) can be viewed as a context-insensitive prior probability distribution over the senses. (A predecessor to Carpuat and Wu's work attempted a similar approach at the word level: Clara Cabezas and Philip Resnik, ``Using WSD Techniques for Lexical Selection in Statistical Machine Translation'', Technical report CS-TR-4736/LAMP-TR-124/UMIACS-TR-2005-42, July 2005. The phrase-based approach works better.)

So, how can we create context-sensitive phrase translation probabilities? Here is what I see as the general framework:

  1. From a Spanish-English training set, create a phrase table like the one above, and use that to define a set of disambiguation contexts. This has been done for you. Relevant files are in an XML format like the following:
          <phrases id="Spanish-English, priors from training set">
           <phrase f="el hombre" spans="0(0,2)+2(5,7)+...">
    	<translations>
    	 <e prob="0.31858393">man</e>
    	 <e prob="0.15929197">the man</e>
    	 <e prob="0.10176998">men</e>
    	 <e prob="0.02876109">humans</e>
    	 <e prob="0.02654869">people</e>
    	 <e prob="0.02654869">human beings</e>
    	</translations>
    	<h>4.33165213</h>
           </phrase>
           <phrase f="con el hombre" spans="12(14,17)">
    	<translations>
    	 <e prob="0.42857143">with men</e>
    	 <e prob="0.14285714">for women</e>
    	 <e prob="0.14285714">human beings</e>
    	 ...
    	</translations>
    	<h>2.12808528</h>
           </phrase>
    
    	...
          </phrases>
        

    Each phrase element is a Spanish phrase together with all of its possible English "senses" based on the training data. The spans attribute identifies where this Spanish phrase occurs. In this particular XML file, we see that el hombre occurred in sentence 0 from tokens 0 to 2, in sentence 2 from tokens 5 to 7, etc. Because the prior probabilities p(e|f) are not sensitive to context, we save space by listing all of them within a single phrase element. When you create context sensitive probabilities, the XML file for devtest and test data will need to contain a separate phrase element for each individual instance; that is, when you pay attention to context, each context where the phrase occurs has to have its own prob= distribution. Those probabilities should obviously sum to 1.

    The h element is the entropy of the distribution (in bits). In the above example, el hombre has a higher entropy prior distribution than con el hombre. This basically tells us that el hombre is harder to translate.

  2. Choose a supervised classification technique that you will use, e.g. naive Bayes, decision trees, SVMs, maximum entropy, etc. This project is not about implementing supervised classification techniques, so you should use an existing software package. Some obvious possibilities include WEKA, SVM-Light, and Mallet. I will inquire with people who have taken this class before to see if there are any particular "gotchas" or recommendations.

  3. Define relevant features of a disambiguation instance. With the information you're given, you could define features like

    In addition, it should be straightforward to run the FreeLing Spanish analyzer on the Spanish data, in order to include features based on part-of-speech tags or even syntactic dependencies. Exploration of the feature space is one of the main points of the assignment. [Update: Jacob Devlin was kind enough to do this run and the tags are now part of the dataset available to the class.]

  4. For a given test set (i.e. the XML file for either the devtest or the test set), possibly select which subset of the Spanish phrases you're going to try to disambiguate, i.e. which phrase entries you're going to pay attention to. It would be great to try to disambiguate every one of them, but that might turn out to be a lot of classifiers and you might not have enough time or the computational resources. You should design your code to be able to scale up to everything, but it might turn out that you need to choose which Spanish phrases to work on. How many you can handle will depend on the software package you use and its memory and disk space requirements. If you can't do all of them, you're going to need to make your choices strategically here. Obviously the greatest impact would be to improve the disambiguation of frequent Spanish phrases for which the prior distribution has high entropy, but if you can't scale up to everything, you're going to need to figure out the right balance between frequency and entropy.

  5. Create training data for your classifier. This means that in the training set, for each instance of a Spanish phrase that you're trying to disambiguate, you need to identify the "correct" English phrase. A simple hack would be to iterate through the possible English phrases, and just to a substring match to see which one occurs in the English sentence. A better hack would be to use the alignment information to narrow down the relevant subsequences of the English sentence.

    The right solution to this problem would be to use the same Spanish-to-English phrase mapper that was used to create the phrase table in the first place. I'm working on getting you a program that will do this, and hopefully I can have it for you by the time you need it.

    Once you've identified the correct answer for each training instance, you can create your pairs, i.e. data pairing feature vectors with correct labels, as suitable for the classifier software you've chosen.

  6. Train your classifier.

  7. Run your classifier on devtest data. Convert the output into the XML format. E.g.
          <phrases id="Spanish-English, context-sensitive probabilities for devtest set">
           <phrase f="el hombre" spans="21(16,18)">
    	<translations>
    	 <e prob="0.01858393">man</e>
    	 <e prob="0.35929197">the man</e>
    	 <e prob="0.00001769">men</e>
    	 <e prob="0.02876109">humans</e>
             ...
    	</translations>
    	<h>0</h>
           </phrase>
    	...
          </phrases>
        
    This means that for the instance of el hombre appearing from tokens 16 to 18 in sentence 21 of the devtest data, the classifier assigned the probability distribution as specified by prob=. For classifiers that don't assign probability distributions, e.g. SVMs, it would not be unreasonable to take confidence values assigned by the classifier, normalize, and call it a probability distribution.

  8. Evaluation. How do we decide whether or not your context-sensitive probabilities are better than the prior? You'll be given a program that takes the following arguments:
        The XML file containing prior probabilities p(e|f) from phrase table 
        The XML file containing the adapted probabilities p(e|f) for a particular test set
        Test set: Target (English) sentences
        Test set: Source (Spanish) sentences
        Word alignment file for this test set
        Verboseness of output (0 for summary, 1 for by-sentence info, 2 for debugging details)
    
    and computes perplexity for the test set it is given. The lower the perplexity, the better you've done.

    Note: for any cases not included in your XML file, e.g. because you could not disambiguate everything, the evaluation program just uses the prior distribution. So if you don't disambiguate anything at all, your perplexity would stay the same. It would be good for you to beat this baseline if you can, but note that this is not a textbook exercise. There are no guarantees that you will get positive results. That's ok. The goal is for you to get hands-on experience on a real research problem.

What to Turn In

Advice (for what it's worth). For implementation purposes, think about starting by working with (possibly-artificial) subsets of the data and resources, and implement/debug using that before you go to full scale. Although you do have to keep scalability issues in mind, this approach is likely to save you a lot of headaches and time.

Grading

The group will receive a grade-in-common out of 75 points.

For the remaining 25 points, each team member should anonymously rate each other team member as follows, making sure to look at the definitions below to see what the numeric scales are supposed to mean.

Collaboration: 10 means that this person was great to collaborate with and you'd be eager to collaborate with them again, and 1 means you definitely would avoid collaborating with this person again. Give 5 as an average rating for someone who was fine as a collaborator, but for whom you wouldn't feel strongly about either seeking them out of avoiding them as a collaborator in the future.

Contribution: 10 means that this person did their part and more, over and above the call of duty. 1 means that this person really did not contribute what they were supposed to. Give 5 as an average, "they did what they were expected to" rating. Note that this is a subjective rating relative to what a person was expected to contribute. If five people were contributors and each did a fantastic job on their pieces, then all five could conceivably get a rating of 10; you would not slice up the pie 5 ways and give each person a 2 because they each did 20% of the work! It is your job as a group to work out what the expected contributions will be, to make sure everyone is comfortable with the relative sizes of the contributions, and to recalibrate expectations if you discover you need to. Try to keep things as equitable as possible, but if one person's skills mean they could do a total of 10% of the work compared to another person's 15%, and everyone is ok with this, then both contributors can certainly get a score of higher than 5 if they both do their parts and do them well. If you need help breaking up tasks, agreeing on expectations, etc., I would be happy to meet with the group to assist in working these things out.

Effort: A rating of 3 should be average, with 5 as .superior effort. (whether or not they succeeded) and 1 as .didn.t put in the effort .. A rating below 3 would not be expected if the person's contribution was 5 or better. If a person just didn't manage to contribute what they were expected to, but you think they did everything in their power to make it happen, you could give them a top rating for effort even while giving them a low contribution score.

A Final Note

This project is ambitious. It attempts to give you an experience doing something real, not just a textbook exercise. The task has not been run end-to-end before at UMD. That means that there might be unanticipated problems, situations where people do not receive inputs they need to get their part done, intra-team politics, interpersonal issues, and who knows what else -- just like in the real world. It also means that something new and interesting might come out of it, which is pretty cool.

Unlike the real world, which is not very forgiving, this is a controlled setting that involves the guidance of an instructor, who can be. Remember that the activity is, first and foremost, a collaborative learning activity, with the emphasis on learning; if there are problems or issues of any kind, let me know sooner rather than later, and I will help to get them worked out. Also feel free to use the mailing list. The presence of multiple teams does not mean that you are competing with each other.