sussex_nltk Package

sussex_nltk Package

The Sussex NLTK package provides extensions to the functionality provided by the standard NLTK distribution, along with additional corpora.

cmu Module

The CMU module provides access to the Carnegie Mellon twitter tokenizer. It is used internally by other modules in the sussex_nltk package and should no be called directly.

sussex_nltk.cmu.tag(sents, java_options='-Xmx1g')[source]

Tokenizes a sentence using the CMU twitter tokenizer.

corpus_readers Module

The corpus readers module provides access to corpora not included in the standard NLTK distribution. The corpora have been prepared by the Text Analytics Group at the University of Sussex.

class sussex_nltk.corpus_readers.AmazonReview(data)[source]

Bases: object

format_sentences_string(word_limit=70)[source]
rating()[source]
raw()[source]
sents()[source]
tagged_sents()[source]
tokenise_segment(word_limit=0)[source]
words()[source]
class sussex_nltk.corpus_readers.AmazonReviewCorpusReader(fileids='.*\.review')[source]

Bases: nltk.corpus.reader.api.CorpusReader

The reader provides access to user written product reviews on amazon.com.

The corpus is categorised into 'dvd','book','kitchen' and 'electronics' and each category is further divided into three sentiment classes 'positive','negative' and 'neutral'.

Each category contains 1000 reviews for the 'positive' and 'negative' sentiment classes.

attach_srl_data(srl_path, output_dir, replace_self=False)[source]
category(cat)[source]

Returns a new AmazonReviewCorpusReader over the specified category.

cat should be one of 'kitchen','dvd','book','electronics'.

documents(start=None, end=None)[source]

Generator over the documents in the corpus.

returns AmazonReview objects.

enumerate()[source]

Returns the number of review documents in the corpus.

enumerate_sents()

Returns the number of sentences in the corpus.

negative(domains=['books', 'dvd', 'electronics', 'kitchen'])[source]

Returns a new AmazonReviewCorpusReader over the negative reviews.

domains should be a list of categories.

positive(domains=['books', 'dvd', 'electronics', 'kitchen'])[source]

Returns a new AmazonReviewCorpusReader over the positive reviews.

domains should be a list of categories.

pre_process_corpus(output_dir, replace_self=False)[source]
raw()[source]

Generator to return the raw text of the reviews.

raw_documents(start=None, end=None)
sample_sents(samplesize=2000)
sample_words(candno, samplesize=900)[source]

Returns a random sample of words the corpus.

The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.

candno is used as the seed to a random number generator to ensure unique samples from the corpus.

samplesize is the number of documents that should be sampled from the corpus.

The method will raise a ValueError is samplesize or larger than the population size.

sample_words_by_sents(samplesize=2000)
sents()[source]

Generator to return all sentences as a list of list of strings.

unlabeled(domains=['books', 'dvd', 'electronics', 'kitchen'])[source]

Returns a new AmazonReviewCorpusReader over the unlabeled reviews.

domains should be a list of categories.

words()[source]

Generator to return all words as a flat list.

class sussex_nltk.corpus_readers.CompressedCorpusReader(fileids='.*\.txt', data_folder='')[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for accessing corpora in a gzip format.

enumerate()[source]

Returns the number of documents in the corpus.

enumerate_sents()[source]

“Returns the number of sentences in the corpus

raw(fileids=None)[source]

Returns a generator object over the raw documents in the corpus.

The documents are returned as a raw text string in the order they are stored in the corpus file.

fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.

sample_sents(samplesize=2000)

Return a list of sentences sampled from the corpus.

The method selects random sentences (uniformly) from the corpus up to samplesize and returns those as a list of list of strings.

sample_words(candno, samplesize=2000)[source]

Returns a random sample of words in the corpus.

The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.

candno is used as the seed to a random number generator to ensure unique samples from the corpus.

samplesize is the number of documents that should be sampled from the corpus.

The method will raise a ValueError is samplesize or larger than the population size.

sample_words_by_documents(candno, samplesize=2000)

Returns a random sample of words in the corpus.

The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.

candno is used as the seed to a random number generator to ensure unique samples from the corpus.

samplesize is the number of documents that should be sampled from the corpus.

The method will raise a ValueError is samplesize or larger than the population size.

sample_words_by_sents(samplesize=2000)

Return a list of words sampled by sentence from the corpus.

The method selects random sentences (uniformly) from the corpus up to samplesize and returns those as a list of strings.

sents(fileids=None)[source]

A generator over the sentences in the corpus.

The generator iterates over all the sentences in the corpus in order such that the documents in the corpus are iterated over in an ordered sequence. The order is determined by the order the documents are returned from the file system. Document boundaries are not marked in the generator. The produced list is a list of lists of strings.

fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.

words(fileids=None)[source]

Returns a generator of the tokens in the corpus.

The generator iterates over all the sentences in the corpus in order such that the documents in the corpus are iterated over in an ordered sequence. The order is determined by the order the documents are returned from the file system. Document boundaries are not marked in the generator. The produced list is a flat list of strings.

fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.

class sussex_nltk.corpus_readers.MedlineCorpusReader(fileids='.*\.gz')[source]

Bases: sussex_nltk.corpus_readers.CompressedCorpusReader

The Medline corpus reader provides access to abstracts of medical research papers.

class sussex_nltk.corpus_readers.ReutersCorpusReader(fileids='.*\.gz')[source]

Bases: sussex_nltk.corpus_readers.CompressedCorpusReader

The ReutersCorpusReader provides access to a subset of the RCV1 corpus.

The categories provided by the reader are 'finance' and 'sport'. The documents are stored in a raw format, ie. they are not sentence segmented or POS tagged.

Link RCV1 corpus <http://about.reuters.com/researchandstandards/corpus/>

category(cat)[source]

Returns a new ReutersCorpusReader over the specified category.

cat should be either 'finance' or 'sport'.

finance()[source]

Returns a ReutersCorpusReader restricted to the 'finance' category.

sport()[source]

Returns a ReutersCorpusReader restricted to the 'sport' category.

class sussex_nltk.corpus_readers.TwitterCorpusReader(fileids='.*\.gz')[source]

Bases: sussex_nltk.corpus_readers.CompressedCorpusReader

Provides access to tweets about teamGB collected during the London 2012 olympics.

The corpus spans a roughly 24 hour period between 7th - 8th of August.

class sussex_nltk.corpus_readers.WSJCorpusReader(fileids='.*\.mrg')[source]

Bases: nltk.corpus.reader.api.CorpusReader

The WSJCorpusReader provides access to a subsample of the Penn Treebank.

Link Penn Treebank <http://www.cis.upenn.edu/~treebank/>

enumerate()[source]

Returns the number of documents in the corpus.

enumerate_sents()

Returns the number of documents in the corpus.

raw(fileids=None)[source]

Returns a generator object over the raw documents in the corpus.

The documents are returned as a raw text string in the order they are stored in the corpus file. All markup the documents may contain is removed.

fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.

sample_sents(samplesize=2000)
sample_words(candno, samplesize=2000)[source]
sample_words_by_documents(candno, samplesize=1000)

Returns a random sample of words the corpus.

The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.

candno is used as the seed to a random number generator to ensure unique samples from the corpus.

samplesize is the number of documents that should be sampled from the corpus.

The method will raise a ValueError is samplesize or larger than the population size.

sample_words_by_sents(samplesize=2000)
sents(fileids=None)[source]

Returns a list of list of strings representing each sentence in the corpus.

The method loads each document in the corpus in turn, sentence and word tokenizes them and returns a [[str,str,str],[str,str,str]] where each sublist is a sentence and each str is a token in that sentence.

tagged_sents(fileids=None)[source]

Returns a list of list of tuples representing each sentence in the corpus.

The method loads each document in the corpus in turn, sentence and word tokenizes them and returns a [[(str,str),(str,str)]] where each sublist is a sentence and each (str,str) tuple is a (token,pos_tag) pair in that sentence.

tagged_words(fileids=None)[source]

Returns a flat list of the tagged words of the corpus.

The method returns a list of tuples where each tuple is a (word,pos_tag) pair.

words(fileids=None)[source]

Returns a flat list of the words in the corpus.

sussex_nltk.corpus_readers.get_srl_sent(srl_file)[source]
sussex_nltk.corpus_readers.pre_process_file(paths)[source]
sussex_nltk.corpus_readers.reviews_from_file(fileid, count=[0], start=None, end=None)[source]

spell Module

sussex_nltk.spell.dictionary(dict_type='aspell', dict_language='en_GB')[source]

stats Module

sussex_nltk.stats.expected_sentiment_tokens(tokens, _n_norm=500)[source]

Calculates the expected number of sentiment bearing tokens per _n_norm tokens.

sussex_nltk.stats.expected_token_freq(tokens, word, _n_norm=5000)[source]

Calculates the expected frequency of each item in words per _n_norm token.

sussex_nltk.stats.normalised_lexical_diversity(tokens, _n_norm=500)[source]

Calculates the average lexical diversity per _n_norm tokens.

sussex_nltk.stats.percentage(count, total)[source]
sussex_nltk.stats.prob_short_sents(sents)[source]

Calculates the probability of a sentence of 2 or less tokens.

sussex_nltk.stats.sample_from_corpus(corpus, sample_size)[source]
sussex_nltk.stats.zipf_dist(freqdist, num_of_ranks=50, show_values=True)[source]

Given a frequency distribution object, rank all types in order of frequency of occurrence (where rank 1 is most frequent word), and plot the ranks against the frequency of occurrence. If num_of_ranks=20, then 20 types will be plotted. If show_values = True, then display the bar values above them.

tag Module

The tag module provides access to the Stanford and Carnegie Mellon twitter part-of-speech taggers.

The Stanford tagger has four different models trained on data that has been preprocessed differently.

  • wsj-0-18-bidirectional-distsim.tagger Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape and distributional similarity features.

    Penn Treebank tagset.

    Performance: 97.28% correct on WSJ 19-21 (90.46% correct on unknown words)

  • wsj-0-18-left3words.tagger Trained on WSJ sections 0-18 using the left3words architecture and includes word shape features.

    Penn tagset.

    Performance: 96.97% correct on WSJ 19-21 (88.85% correct on unknown words)

  • english-left3words-distsim.tagger Trained on WSJ sections 0-18 and extra parser training data using the left3words architecture and includes word shape and distributional similarity features.

    Penn tagset.

  • english-bidirectional-distsim.tagger Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape and distributional similarity features.

    Penn Treebank tagset.

sussex_nltk.tag.stanford_tag(sent, model='wsj-0-18-bidirectional-distsim')[source]

Uses the Standorf POS tagger to tag a sentence.

model should be one of 'wsj-bidirectional-distsim', 'wsj-left3words-distsim', 'wsj-bidirectional', 'wsj-left3words'.

sussex_nltk.tag.stanford_tag_batch(sents, model=None)[source]
sussex_nltk.tag.twitter_tag(sent)[source]

Tokenizes a sentence using the CMU twitter tokenizer.

sussex_nltk.tag.twitter_tag_batch(sents)[source]

Tokenizes a list of sentences using the CMU twitter tokenizer.

tokenize Module

sussex_nltk.tokenize.twitter_tokenize(sent, root=None)[source]

Tokenizes a sentence using the CMU twitter tokenizer.

sussex_nltk.tokenize.twitter_tokenize_batch(sents)[source]

Tokenizes a list of sentences using the CMU twitter tokenizer.

Table Of Contents

Previous topic

Sussex NLTK Corpora

This Page