Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction
Hercules Dalianis, Hao-chun Xing, Xin Zhang
Department of Computer and Systems Sciences (DSV)
KTH-Stockholm University
Forum 100, 164 40 Kista, Sweden
E-mail: {hercules,haoc-xin,xin-zhan}@dsv.su.se
Abstract
This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment
tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus
(SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been
manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese
words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any
delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus.
Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding
translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus
before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese.
Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an
accuracy of 73.1 percent.
Hercules Dalianis, Hao-chun Xing, Xin Zhang
Department of Computer and Systems Sciences (DSV)
KTH-Stockholm University
Forum 100, 164 40 Kista, Sweden
E-mail: {hercules,haoc-xin,xin-zhan}@dsv.su.se
Abstract
This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment
tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus
(SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been
manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese
words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any
delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus.
Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding
translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus
before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese.
Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an
accuracy of 73.1 percent.
'Research Ideas' 카테고리의 다른 글
불-한 병렬 말뭉치 문장단위 정렬방법 연구 (0) | 2012.01.22 |
---|---|
웹 문서로부터 한영병렬말뭉치 자동구축과 문장단위 정렬에 관한 연구 (0) | 2012.01.22 |
Improved Sentence Alignment (0) | 2012.01.22 |
Parallel corpora word alignment and applications (0) | 2012.01.22 |
DIY Local Learner Corpora: Bridging Gaps Between Theory and Practice (0) | 2012.01.22 |