The Sussex NLTK package provides extensions to the functionality provided by the standard NLTK distribution, along with additional corpora.
The CMU module provides access to the Carnegie Mellon twitter tokenizer. It is used internally by other modules in the sussex_nltk package and should no be called directly.
The corpus readers module provides access to corpora not included in the standard NLTK distribution. The corpora have been prepared by the Text Analytics Group at the University of Sussex.
Bases: nltk.corpus.reader.api.CorpusReader
The reader provides access to user written product reviews on amazon.com.
The corpus is categorised into 'dvd','book','kitchen' and 'electronics' and each category is further divided into three sentiment classes 'positive','negative' and 'neutral'.
Each category contains 1000 reviews for the 'positive' and 'negative' sentiment classes.
Returns a new AmazonReviewCorpusReader over the specified category.
cat should be one of 'kitchen','dvd','book','electronics'.
Generator over the documents in the corpus.
returns AmazonReview objects.
Returns the number of sentences in the corpus.
Returns a new AmazonReviewCorpusReader over the negative reviews.
domains should be a list of categories.
Returns a new AmazonReviewCorpusReader over the positive reviews.
domains should be a list of categories.
Returns a random sample of words the corpus.
The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.
candno is used as the seed to a random number generator to ensure unique samples from the corpus.
samplesize is the number of documents that should be sampled from the corpus.
The method will raise a ValueError is samplesize or larger than the population size.
Bases: nltk.corpus.reader.api.CorpusReader
A corpus reader for accessing corpora in a gzip format.
Returns a generator object over the raw documents in the corpus.
The documents are returned as a raw text string in the order they are stored in the corpus file.
fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.
Return a list of sentences sampled from the corpus.
The method selects random sentences (uniformly) from the corpus up to samplesize and returns those as a list of list of strings.
Returns a random sample of words in the corpus.
The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.
candno is used as the seed to a random number generator to ensure unique samples from the corpus.
samplesize is the number of documents that should be sampled from the corpus.
The method will raise a ValueError is samplesize or larger than the population size.
Returns a random sample of words in the corpus.
The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.
candno is used as the seed to a random number generator to ensure unique samples from the corpus.
samplesize is the number of documents that should be sampled from the corpus.
The method will raise a ValueError is samplesize or larger than the population size.
Return a list of words sampled by sentence from the corpus.
The method selects random sentences (uniformly) from the corpus up to samplesize and returns those as a list of strings.
A generator over the sentences in the corpus.
The generator iterates over all the sentences in the corpus in order such that the documents in the corpus are iterated over in an ordered sequence. The order is determined by the order the documents are returned from the file system. Document boundaries are not marked in the generator. The produced list is a list of lists of strings.
fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.
Returns a generator of the tokens in the corpus.
The generator iterates over all the sentences in the corpus in order such that the documents in the corpus are iterated over in an ordered sequence. The order is determined by the order the documents are returned from the file system. Document boundaries are not marked in the generator. The produced list is a flat list of strings.
fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.
Bases: sussex_nltk.corpus_readers.CompressedCorpusReader
The Medline corpus reader provides access to abstracts of medical research papers.
Bases: sussex_nltk.corpus_readers.CompressedCorpusReader
The ReutersCorpusReader provides access to a subset of the RCV1 corpus.
The categories provided by the reader are 'finance' and 'sport'. The documents are stored in a raw format, ie. they are not sentence segmented or POS tagged.
Link RCV1 corpus <http://about.reuters.com/researchandstandards/corpus/>
Bases: sussex_nltk.corpus_readers.CompressedCorpusReader
Provides access to tweets about teamGB collected during the London 2012 olympics.
The corpus spans a roughly 24 hour period between 7th - 8th of August.
Bases: nltk.corpus.reader.api.CorpusReader
The WSJCorpusReader provides access to a subsample of the Penn Treebank.
Link Penn Treebank <http://www.cis.upenn.edu/~treebank/>
Returns the number of documents in the corpus.
Returns a generator object over the raw documents in the corpus.
The documents are returned as a raw text string in the order they are stored in the corpus file. All markup the documents may contain is removed.
fileids is an optional list of file ids that can be used to filter down the corpus files where the list of strings is generated from.
Returns a random sample of words the corpus.
The sample is generated by selecting samplesize documents from the corpus and flattening these documents into a list of strings.
candno is used as the seed to a random number generator to ensure unique samples from the corpus.
samplesize is the number of documents that should be sampled from the corpus.
The method will raise a ValueError is samplesize or larger than the population size.
Returns a list of list of strings representing each sentence in the corpus.
The method loads each document in the corpus in turn, sentence and word tokenizes them and returns a [[str,str,str],[str,str,str]] where each sublist is a sentence and each str is a token in that sentence.
Returns a list of list of tuples representing each sentence in the corpus.
The method loads each document in the corpus in turn, sentence and word tokenizes them and returns a [[(str,str),(str,str)]] where each sublist is a sentence and each (str,str) tuple is a (token,pos_tag) pair in that sentence.
Calculates the expected number of sentiment bearing tokens per _n_norm tokens.
Calculates the expected frequency of each item in words per _n_norm token.
Calculates the average lexical diversity per _n_norm tokens.
Calculates the probability of a sentence of 2 or less tokens.
Given a frequency distribution object, rank all types in order of frequency of occurrence (where rank 1 is most frequent word), and plot the ranks against the frequency of occurrence. If num_of_ranks=20, then 20 types will be plotted. If show_values = True, then display the bar values above them.
The tag module provides access to the Stanford and Carnegie Mellon twitter part-of-speech taggers.
The Stanford tagger has four different models trained on data that has been preprocessed differently.
wsj-0-18-bidirectional-distsim.tagger Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape and distributional similarity features.
Penn Treebank tagset.
Performance: 97.28% correct on WSJ 19-21 (90.46% correct on unknown words)
wsj-0-18-left3words.tagger Trained on WSJ sections 0-18 using the left3words architecture and includes word shape features.
Penn tagset.
Performance: 96.97% correct on WSJ 19-21 (88.85% correct on unknown words)
english-left3words-distsim.tagger Trained on WSJ sections 0-18 and extra parser training data using the left3words architecture and includes word shape and distributional similarity features.
Penn tagset.
english-bidirectional-distsim.tagger Trained on WSJ sections 0-18 using a bidirectional architecture and including word shape and distributional similarity features.
Penn Treebank tagset.