Phrase-based MT
- Download Koehn, PHARAOH: A
Beam Search Decoder for Phrase-Based Statistical Machine
Translation, and read pages 21-23 (i.e. from the start of Section 2
to the end of Section 2.2).
- As an exercise in phrase extraction, consider the
intersection alignment from Figure 3 (page 23). To keep
this problem small, we will use only the part of the
alignment that covers the strings "slap the green witch" and
"bofetada a la bruja verde" -- that is, just the 4x5 grid in the
lower right hand corner. (Don't do this exercise on the
entire intersection alignment!) Apply the phrase extraction
heuristic we discussed in class, i.e. extracting all phrase pairs
consistent with the alignment, in order to give a table of e/f pairs.
- Compare your table of e/f pairs with the set of phrase pairs in
Figure 5 on page 25, considering only pairs from the same 4x5
sub-grid. What are the differences? Give at least one example
of a phrase pair that you get from one alignment and not the
other, and explain why.
- Extra credit (20%). Consider Figure 2 on page
22. Expand the expression for p(fbar_{1..I}|ebar_{1..I}) -- i.e. the
last equation just before Section 2.2 begins -- for this particular
instance. This is a translation model score that includes both
phrase-to-phrase translation probability and phrase distortion.
You'll need to just write out phi expressions without converting them
to numbers, since you don't have a phrase table with probabilities.
But you can replace each instance of d(a_i - b_{i-1}) with an actual
number. So the expansion will have the form:
(phi(...)*d * phi(...)*d * ... * phi(...)*d) where each
d is a number. (Note: for any given phrase, "start position"
and "end position" refer to the positions of the first word and the
last word in the phrase, respectively. E.g. "nach Kanada" has start
position 4 and end position 5, assuming words are numbered from 1.)