Using Comparable Corpora to Adapt a Translation Model to Domains
Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada
Department of Computer Science, Shizuoka University
3-5-1 Johoku, Naka-ku, Hamamatsu-shi, 432-8011, Japan
{kaji, tuna}@inf.shizuoka.ac.jp
Abstract
Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains.
To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation
pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the
words associated with a source-language word, presently restricted to a noun, and its translations; word translation
pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher
its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based
on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for
merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an
out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a
Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by
incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the
parameters and an extension to estimate translation pseudo-probabilities for verbs.
Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada
Department of Computer Science, Shizuoka University
3-5-1 Johoku, Naka-ku, Hamamatsu-shi, 432-8011, Japan
{kaji, tuna}@inf.shizuoka.ac.jp
Abstract
Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains.
To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation
pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the
words associated with a source-language word, presently restricted to a noun, and its translations; word translation
pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher
its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based
on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for
merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an
out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a
Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by
incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the
parameters and an extension to estimate translation pseudo-probabilities for verbs.