Using Comparable Corpora to Adapt a Translation Model to Domains

Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada
Department of Computer Science, Shizuoka University
3-5-1 Johoku, Naka-ku, Hamamatsu-shi, 432-8011, Japan
{kaji, tuna}@inf.shizuoka.ac.jp
Abstract
Statistical machine translation (SMT) requires a large parallel corpus, which is available only for restricted language pairs and domains.
To expand the language pairs and domains to which SMT is applicable, we created a method for estimating translation
pseudo-probabilities from bilingual comparable corpora. The essence of our method is to calculate pairwise correlations between the
words associated with a source-language word, presently restricted to a noun, and its translations; word translation
pseudo-probabilities are calculated based on the assumption that the more associated words a translation is correlated with, the higher
its translation probability. We also describe a method we created for calculating noun-sequence translation pseudo-probabilities based
on occurrence frequencies of noun sequences and constituent-word translation pseudo-probabilities. Then, we present a framework for
merging the translation pseudo-probabilities estimated from in-domain comparable corpora with a translation model learned from an
out-of-domain parallel corpus. Experiments using Japanese and English comparable corpora of scientific paper abstracts and a
Japanese-English parallel corpus of patent abstracts showed promising results; the BLEU score was improved to some degree by
incorporating the pseudo-probabilities estimated from the in-domain comparable corpora. Future work includes an optimization of the
parameters and an extension to estimate translation pseudo-probabilities for verbs.

저작자표시

'Research Ideas' 카테고리의 다른 글

Parallel corpora word alignment and applications (0)	2012.01.22
DIY Local Learner Corpora: Bridging Gaps Between Theory and Practice (0)	2012.01.22
Measuring Vocabulary Levels of English Textbooks and Tests Using a BNC Lemmatised High Frequency Word List (0)	2012.01.22
Estimating naturalness in Japanese English textbooks (0)	2012.01.22
The Role of Variety Recognition in Japanese University Students' Attitudes Towards English Speech Varieties (0)	2012.01.22

Corpus for TESOL

Using Comparable Corpora to Adapt a Translation Model to Domains

'Research Ideas' 카테고리의 다른 글

티스토리툴바

Using Comparable Corpora to Adapt a Translation Model to Domains

'Research Ideas' 카테고리의 다른 글

'Research Ideas' Related Articles

티스토리툴바