본문 바로가기

Corpus

Tanaka_Corpus


http://www.edrdg.org/wiki/index.php/Tanaka_Corpus

This page provides some brief documentation for the Tanaka Corpus of parallel Japanese-English sentences, and in particular the modification and editing that has been carried out to enable use of the corpus as a source of examples in the WWWJDIC dictionary server and other systems.

The corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students, as described in his Pacling2001 paper. At Pacling2001 Professor Tanaka released copies of the corpus, and stated that it is in the public domain. According to Professor Christian Boitet, Professor Tanaka did not think the collection was of a very good standard. (Sadly, Prof. Tanaka died in early 2003.)

At the 2002 Papillon workshop in Tokyo, Professor Boitet included a copy of the corpus in a CD, distributed to participants, and suggested that it may serve as examples in a dictionary. Jim Breen realised it had the potential to be a source of example sentences in the WWWJDIC server. He edited, reformatted and indexed the corpus and linked it at the word level to the dictionary function in the server. (see below)

The inclusion of the Corpus in the WWWJDIC server exposed it to a wide audience, and a number of other systems incorporated the corpus into their operation. It also began to be used in some research projects in natural language processing.

In 2006 the Corpus was incorporated into the Tatoeba Project being developed by Trang Ho to provide a sentence-based multi-lingual resource. That project is now the "home" of the Corpus.

'Corpus' 카테고리의 다른 글

Vienna-Oxford International Corpus of English  (0) 2012.03.13
Japanese-English Parallel Corpus  (0) 2012.02.08
the open parallel corpus  (0) 2012.02.08
British Academic Written English Corpus (BAWE)  (0) 2012.02.08
Corpora and Language Teachers:  (0) 2012.01.25