.. _corpora:

***********************
Sussex NLTK Corpora
***********************

We provide access to a number of corpora that are not distributed with NLTK. Each corpus has its own corpus reader
in the :keyword:`corpus_readers` module.

========
Amazon Review Corpus
========

The Amazon Review Corpus consists of user written product reviews on `amazon.com <http://www.amazon.com>`_.
    
The corpus has four categories ``dvd, book, kitchen`` and ``electronics`` and three sentiment classes. Each category is divided into three sentiment classes ``positive, negative`` and ``neutral`` according to the true sentiment expressed in the review. The review sentiment has been automatically determined according to the number of stars the reviewer has given the product.
    
The positive and negative sentiment categories contain 1000 reviews. The neutral sentiment class contains a varying number of documents for each category.

The corpus is provided as raw text. Any sentence segmentation and or tokenisation is done using tools provied by NLTK.

========
Reuters Corpus
========

The Reuters corpus is a subset of the entire `RCV1 corpus <http://about.reuters.com/researchandstandards/corpus/>`_. It contains news articles from the period 1996-08-20 to 1997-08-19.

There are two categories ``finance`` and ``sport``. The categories have been determined according to the categorisation given by Reuters.

The corpus is provided as raw text. Any sentence segmentation and or tokenisation is done using tools provied by NLTK.

========
Twitter Corpus
========

The Twitter corpus is a collection of tweets related to teamGB during to 2012 London olympics. The tweets have been collected between 7th - 8th of August and are ordered by time.

The corpus is provided as raw text. Any sentence segmentation is done using tools provied by NLTK. Tokenisation is done using the CMU Twitter specific
tokeniser.

========
Medline Corpus
========

The Medline corpus contains abstracts of scientific medical papers. 

The corpus is provided as raw text. Any sentence segmentation is done using tools provied by NLTK. Tokenisation is done using the CMU Twitter specific
tokeniser.

========
Wall Street Journal Corpus
========

The Wall Street Journal corpus is a subset of the Penn Treebank and contains news articles from the Wall Street Journal.

The corpus is provided as sentence segmented, tokenised and part-of-speech tagged.