본문 바로가기

Research Ideas

Improved Sentence Alignment


Improved Sentence Alignment for Building a Parallel Subtitle Corpus



J¨org Tiedemann
University of Groningen
Abstract
In this paper on-going work of creating an extensive multilingual parallel corpus of movie
subtitles is presented. The corpus currently contains roughly 23,000 pairs of aligned subtitles
covering about 2,700 movies in 29 languages. Subtitles mainly consist of transcribed
speech, sometimes in a very condensed way. Insertions, deletions and paraphrases are very
frequent which makes them a challenging data set to work with especially when applying
automatic sentence alignment. Standard alignment approaches rely on translation consistency
either in terms of length or term translations or a combination of both. In the paper, we
show that these approaches are not applicable for subtitles and we propose a new alignment
approach based on time overlaps specifically designed for subtitles. In our experiments we
obtain a significant improvement of alignment accuracy compared to standard length-based
approaches.