본문 바로가기

Corpus

Reuters Corpora

http://corpus.leeds.ac.uk/

Reuters Corpora (RCV1, RCV2, TRC2)

In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community.

In Fall of 2004, NIST took over distribution of RCV1 and any future Reuters Corpora. You can now get these datasets by sending a request to NIST and by signing the agreements below.

What's available

RCV1

Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 (Release date 2000-11-03, Format version 1, correction level 0)

This is distributed on two CDs and contains about 810,000 Reuters, English Language News stories. It requires about 2.5 GB for storage of the uncompressed files.

RCV2

Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 (Release date 2005-05-31, Format version 1, correction level 0)

This is distributed on one CD and contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish). The stories are NOT PARALLEL, but are written by local reporters in each language. These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.

TRC2

Thomson Reuters Text Research Collection (TRC2)

The TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14 or 2,871,075,221 bytes, and was initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow). TRC2 is distributed via web download.

The stories in the Reuters Corpus are under the copyright of Reuters Ltd and/or Thompson Reuters, and their use is governed by the following agreements:

Organizational agreement
This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.
Individual agreement
This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file at your organization.

Getting the corpus

  1. Download and print the Organizational and Individual agreement forms above.
  2. Send the Organizational form to NIST by one of the methods listed below:

    send a scanned pdf file
    Complete the Reuters Organizational form and send a pdf file of the form to:
    reuters-request@nist.gov
    In your email include the following:
    Subject: request for Reuters corpus
    In the body of message include: your name, your complete postal address, and if you are requesting RCV1, RCV2, TRC2 or all three.
    (do not include other correspondence in this message)
  3. Complete and keep the individual agreement form on file at your organization.
  4. Subject to our approval, you will receive (in the case of RCV1 and 2) the corpus CDs by mail, and/or (in the case of TRC2) a download URL, login, and password via email. Please allow seven business days for a response.

If you have already obtained some of the Reuters corpora, and wish to obtain others, send email to reuters-request@nist.gov. Please provide the name of your organization, the month/year you requested RCV1/2/TRC2, and the corpus you are interested in receiving. An Organizational agreement must be on file at NIST.



Reuters Corpora Resources

The article,

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.

provides an extensive description of RCV1 and its category coding. Several on-line appendices, including tokenized versions of the collection, can be found via David Lewis' website.