Abstract | In this paper, we discuss how we apply automatically generated semantic knowledge to benefit statisticalmachine translation (SMT). Currently, almost all statistical machine translation systems rely heavily on
memorizing translations of phrases. Some systems attempt to go further and generalize these learned phrase
translations into templates using empirically derived information about word alignments and a small amount
of syntactic information, if at all. There are several issues in a SMT pipeline that could be addressed by the
application of semantic knowledge, if such knowledge were easily available. One such issue, an important
one, is that of reference sparsity. The fundamental problem that translation systems have to face is that there
is no such thing as the correct translation for any sentence. In fact, any given source sentence can often be
translated into the target language in many valid ways. Since there can be many “correct answers,” almost
all models employed by SMT systems require, in addition to a large bitext, a held-out development set
comprised of multiple high-quality, human-authored reference translations in the target language in order to
tune their parameters relative to a translation quality metric.1 There are several reasons that this requirement
is not an easy one to satisfy. First, with a few exceptions—notably NIST’s annual MT evaluations—most
new MT research data sets are provided with only a single reference translation. Second, obtaining multiple
reference translations in rapid development, low-density source language scenarios (e.g. (Oard, 2003)) is
likely to be severely limited (or made entirely impractical) by limitations of time, cost, and ready availability
of qualified translators.
|