We provide access to a number of corpora that are not distributed with NLTK. Each corpus has its own corpus reader in the corpus_readers module.
The Amazon Review Corpus consists of user written product reviews on amazon.com.
The corpus has four categories dvd, book, kitchen and electronics and three sentiment classes. Each category is divided into three sentiment classes positive, negative and neutral according to the true sentiment expressed in the review. The review sentiment has been automatically determined according to the number of stars the reviewer has given the product.
The positive and negative sentiment categories contain 1000 reviews. The neutral sentiment class contains a varying number of documents for each category.
The corpus is provided as raw text. Any sentence segmentation and or tokenisation is done using tools provied by NLTK.
The Reuters corpus is a subset of the entire RCV1 corpus. It contains news articles from the period 1996-08-20 to 1997-08-19.
There are two categories finance and sport. The categories have been determined according to the categorisation given by Reuters.
The corpus is provided as raw text. Any sentence segmentation and or tokenisation is done using tools provied by NLTK.
The Twitter corpus is a collection of tweets related to teamGB during to 2012 London olympics. The tweets have been collected between 7th - 8th of August and are ordered by time.
The corpus is provided as raw text. No sentence segmentation is done. Tokenisation is done using the CMU Twitter specific tokeniser.
The Medline corpus contains abstracts of scientific medical papers.
The corpus is provided as raw text. Any sentence segmentation is done using tools provied by NLTK. Tokenisation is done using the CMU Twitter specific tokeniser.
The Wall Street Journal corpus is a subset of the Penn Treebank and contains news articles from the Wall Street Journal.
The corpus is provided as sentence segmented, tokenised and part-of-speech tagged.