Sussex NLTK Corpora

We provide access to a number of corpora that are not distributed with NLTK. Each corpus has its own corpus reader in the corpus_readers module.

Amazon Review Corpus

The Amazon Review Corpus consists of user written product reviews on amazon.com.

The corpus has four categories dvd, book, kitchen and electronics and three sentiment classes. Each category is divided into three sentiment classes positive, negative and neutral according to the true sentiment expressed in the review. The review sentiment has been automatically determined according to the number of stars the reviewer has given the product.

The positive and negative sentiment categories contain 1000 reviews. The neutral sentiment class contains a varying number of documents for each category.

The corpus is provided as raw text. Any sentence segmentation and or tokenisation is done using tools provied by NLTK.

Reuters Corpus

The Reuters corpus is a subset of the entire RCV1 corpus. It contains news articles from the period 1996-08-20 to 1997-08-19.

There are two categories finance and sport. The categories have been determined according to the categorisation given by Reuters.

The corpus is provided as raw text. Any sentence segmentation and or tokenisation is done using tools provied by NLTK.

Twitter Corpus

The Twitter corpus is a collection of tweets related to teamGB during to 2012 London olympics. The tweets have been collected between 7th - 8th of August and are ordered by time.

The corpus is provided as raw text. No sentence segmentation is done. Tokenisation is done using the CMU Twitter specific tokeniser.

Medline Corpus

The Medline corpus contains abstracts of scientific medical papers.

The corpus is provided as raw text. Any sentence segmentation is done using tools provied by NLTK. Tokenisation is done using the CMU Twitter specific tokeniser.

Wall Street Journal Corpus

The Wall Street Journal corpus is a subset of the Penn Treebank and contains news articles from the Wall Street Journal.

The corpus is provided as sentence segmented, tokenised and part-of-speech tagged.

Table Of Contents

Previous topic

Welcome to Sussex NLTK package documentation!

Next topic

sussex_nltk Package

This Page