The general setting for this work is the problem of machine translation. State-of-the-art statistical MT systems are trained using a large corpus of parallel sentence pairs in the source and target language, typically also using an automatically-created "best" alignment of source words with target words. For example, the training data might include the pair
deseo saludar la intervenció importante y sustancial del presidente prodi . i too would like to welcome mr prodi 's forceful and meaningful intervention .
with the alignment
0-2 0-3 1-5 8-6 8-7 9-7 7-8 6-9 5-10 4-11 6-11 3-12 10-13
The pair 1-5 in the alignment represents an alignment link between token 1 of the source (saludar) and token 5 of the target (welcome). Notice that tokens are numbered starting with 0. Also notice that alignments are not necessarily one-to-one; e.g. you can see that Spanish token 0 (deseo) is aligned with English tokens 2 and 3 (would like).
In these statistical systems, the training process usually includes the creation of a "phrase table", where "phrase" means a sub-sequence of tokens, not necessarily the sort of things that a linguist would call a phrase. By analyzing the aligned training corpus, systems automatically extract phrase table entries like the following:
deseo saludar ||| i must congratulate ||| 0.00714286 2.33525e-05 0.166667 2.03231e-05 2.718 deseo saludar ||| i should like ||| 7.03284e-05 3.10014e-06 0.166667 0.000497919 2.718 deseo saludar ||| i welcome ||| 0.000550812 0.000134179 0.333333 0.100133 2.718 deseo saludar ||| like to commend ||| 0.0333333 0.000293642 0.166667 0.000208066 2.718 deseo saludar ||| like to congratulate ||| 0.00147275 5.07773e-05 0.1666667 0.000208066 2.718
These entries contain a source phrase, a target translation, and then a set of numerical "features". We will concern ourselves in this project with just one of those feature, the conditional probability p(e|f). In these particular phrase table entries, this is the 3rd numerical feature, so you can see, for example, that
p(i welcome | deseo saludar) = 0.333333
Crucially, this probability is not sensitive to context. These phrase table entries tell the system that regardless of the context -- e.g. surrounding words in the sentence -- the phrase deseo saludar has a 33.3% chance of translating as i welcome.
The question we ask in this project: is it possible to come up with better probabilities for an instance of the Spanish phrase, if you pay attention to the context?
So, how can we create context-sensitive phrase translation probabilities? Here is what I see as the general framework:
<phrases id="Spanish-English, priors from training set"> <phrase f="el hombre" spans="0(0,2)+2(5,7)+..."> <translations> <e prob="0.31858393">man</e> <e prob="0.15929197">the man</e> <e prob="0.10176998">men</e> <e prob="0.02876109">humans</e> <e prob="0.02654869">people</e> <e prob="0.02654869">human beings</e> </translations> <h>4.33165213</h> </phrase> <phrase f="con el hombre" spans="12(14,17)"> <translations> <e prob="0.42857143">with men</e> <e prob="0.14285714">for women</e> <e prob="0.14285714">human beings</e> ... </translations> <h>2.12808528</h> </phrase> ... </phrases>
Each phrase element is a Spanish phrase together with all of its possible English "senses" based on the training data. The spans attribute identifies where this Spanish phrase occurs. In this particular XML file, we see that el hombre occurred in sentence 0 from tokens 0 to 2, in sentence 2 from tokens 5 to 7, etc. Because the prior probabilities p(e|f) are not sensitive to context, we save space by listing all of them within a single phrase element. When you create context sensitive probabilities, the XML file for devtest and test data will need to contain a separate phrase element for each individual instance; that is, when you pay attention to context, each context where the phrase occurs has to have its own prob= distribution. Those probabilities should obviously sum to 1.
The h element is the entropy of the distribution (in bits). In the above example, el hombre has a higher entropy prior distribution than con el hombre. This basically tells us that el hombre is harder to translate.
In addition, it should be straightforward to run the FreeLing Spanish analyzer on the Spanish data, in order to include features based on part-of-speech tags or even syntactic dependencies. Exploration of the feature space is one of the main points of the assignment. [Update: Jacob Devlin was kind enough to do this run and the tags are now part of the dataset available to the class.]
The right solution to this problem would be to use the same Spanish-to-English phrase mapper that was used to create the phrase table in the first place. I'm working on getting you a program that will do this, and hopefully I can have it for you by the time you need it.
Once you've identified the correct answer for each training
instance, you can create your
Note: for any cases not included in your XML file, e.g. because you
could not disambiguate everything, the evaluation program just uses the prior distribution.
So if you don't disambiguate anything at all, your perplexity would stay the same. It would
be good for you to beat this baseline if you can, but note that this is not a textbook exercise.
There are no guarantees that you will get positive results. That's ok. The goal is for you to
get hands-on experience on a real research problem.
<phrases id="Spanish-English, context-sensitive probabilities for devtest set">
<phrase f="el hombre" spans="21(16,18)">
<translations>
<e prob="0.01858393">man</e>
<e prob="0.35929197">the man</e>
<e prob="0.00001769">men</e>
<e prob="0.02876109">humans</e>
...
</translations>
<h>0</h>
</phrase>
...
</phrases>
This means that for the instance of el hombre appearing from tokens 16 to 18 in sentence 21
of the devtest data, the classifier assigned the probability distribution as specified by prob=.
For classifiers that don't assign probability distributions, e.g. SVMs, it would not be unreasonable
to take confidence values assigned by the classifier, normalize, and call it a probability distribution.
The XML file containing prior probabilities p(e|f) from phrase table
The XML file containing the adapted probabilities p(e|f) for a particular test set
Test set: Target (English) sentences
Test set: Source (Spanish) sentences
Word alignment file for this test set
Verboseness of output (0 for summary, 1 for by-sentence info, 2 for debugging details)
and computes perplexity for the test set it is given. The lower the perplexity,
the better you've done.
For the remaining 25 points, each team member should anonymously rate each other team member as follows, making sure to look at the definitions below to see what the numeric scales are supposed to mean.
Collaboration: 10 means that this person was great to collaborate with and you'd be eager to collaborate with them again, and 1 means you definitely would avoid collaborating with this person again. Give 5 as an average rating for someone who was fine as a collaborator, but for whom you wouldn't feel strongly about either seeking them out of avoiding them as a collaborator in the future.
Contribution: 10 means that this person did their part and more, over and above the call of duty. 1 means that this person really did not contribute what they were supposed to. Give 5 as an average, "they did what they were expected to" rating. Note that this is a subjective rating relative to what a person was expected to contribute. If five people were contributors and each did a fantastic job on their pieces, then all five could conceivably get a rating of 10; you would not slice up the pie 5 ways and give each person a 2 because they each did 20% of the work! It is your job as a group to work out what the expected contributions will be, to make sure everyone is comfortable with the relative sizes of the contributions, and to recalibrate expectations if you discover you need to. Try to keep things as equitable as possible, but if one person's skills mean they could do a total of 10% of the work compared to another person's 15%, and everyone is ok with this, then both contributors can certainly get a score of higher than 5 if they both do their parts and do them well. If you need help breaking up tasks, agreeing on expectations, etc., I would be happy to meet with the group to assist in working these things out.
Effort: A rating of 3 should be average, with 5 as .superior effort. (whether or not they succeeded) and 1 as .didn.t put in the effort .. A rating below 3 would not be expected if the person's contribution was 5 or better. If a person just didn't manage to contribute what they were expected to, but you think they did everything in their power to make it happen, you could give them a top rating for effort even while giving them a low contribution score.
This project is ambitious. It attempts to give you an experience doing something real, not just a textbook exercise. The task has not been run end-to-end before at UMD. That means that there might be unanticipated problems, situations where people do not receive inputs they need to get their part done, intra-team politics, interpersonal issues, and who knows what else -- just like in the real world. It also means that something new and interesting might come out of it, which is pretty cool.
Unlike the real world, which is not very forgiving, this is a controlled setting that involves the guidance of an instructor, who can be. Remember that the activity is, first and foremost, a collaborative learning activity, with the emphasis on learning; if there are problems or issues of any kind, let me know sooner rather than later, and I will help to get them worked out. Also feel free to use the mailing list. The presence of multiple teams does not mean that you are competing with each other.