Parser Comparison - Context-Free Grammar (CFG) Data

The files listed below contain the grammars, lexicons, and sets of test sentences used in the experiments performed in this study. The CT grammar and lexicon were derived from a CFG (courtesy of John Dowding, SRI International) compiled from a task-specific unification grammar written for CommandTalk, an SRI-developed spoken-language interface to a military simulation system. The ATIS grammar and lexicon (courtesy of George Heidorn and Eric Ringger, Microsoft Research) were extracted from a treebank of the DARPA ATIS3 training sentences. The PT grammar (courtesy of Eugene Charniak, Brown University) was extracted from the Penn Treebank. Various transformations, including removal of cycles, were performed on these grammars to get them into a uniform notation suitable for the experiments reported in the study. The root symbol for all three grammars is the token SIGMA.

A standard test set is supplied for each of the three grammars. The test set for the CT grammar is a set of sentences made up by the CommandTalk developers to test the functionality of the system. The test set for the ATIS grammar is a randomly selected subset of the DARPA ATIS3 development test set. (The full ATIS corpus is distributed by the Linguistic Data Consortium). The test set for the PT grammar is a set of preterminal strings randomly generated from a probabilistic version of the grammar, with the probabilities based on the frequency of the bracketings occuring in the training data, and then filtered for length to make it possible to conduct experiments in a reasonable amount of time, given the high degree of ambiguity of the grammar.

The terminals of the grammars are preterminal lexical categories rather than words. Preterminals were generated automatically, by grouping together all the words that could occur in exactly the same contexts in all grammar rules, to eliminate lexical ambiguity. Since the test set for the PT grammar is already a set of preterminal strings, a dummy lexicon was created that simply assigns each preterminal itself as its preterminal category. This was done to keep the process of looking up the preterminal category of each lexical item uniform in all the experiments in the study.

Some statistics on the grammars and test sets are contained in the following table:

	CT Grammar	ATIS Grammar	PT Grammar
Rules	24,456	4,592	15,039
Nonterminals	3,946	192	38
Terminals	1,032	357	47
# Test Sentences	162	98	30
Average Length	8.3	11.4	5.7
# Grammatical	150	70	30
Average # Parses	5.4	940	7.2 X 10^27

Note that for the CT and ATIS sets, not all sentences are accepted by the corresponding grammars. The most striking difference among the three grammars is the degree of ambiguity. The CT grammar has relatively low ambiguity, the ATIS grammar may be considered highly ambiguous, and the PT grammar can only be called massively ambiguous.

FILE FORMATS

Grammar, lexicon, and test sentence files are all in plain ASCII text format. Some of the files include comment lines begining with the semicolon character (";") to include required notices.

The grammar files (except for ct-grammar-original) are in the form of blocks, separated by blank lines, defining the productions expanding each nonterminal. The first line in each block contains only the nonterminal symbol whose productions are defined by that block. Each remaining line of the block consists of a space-separated sequence of nonterminals and preterminals defining a possible expansion of the nonterminal in question. Tokens beginning with upper-case characters are nonterminals, all other tokens are preterminals. For example, the block

In the lexicon files, each line contains a lexical item followed by a space followed by its preterminal category. Note that lexical items and preterminal categories are not necessarily distinct symbols. In these lexicons, whenever there is only one lexical item in a given preterminal category, the lexical item itself is used as the symbol for the preterminal category.

In the sentence files, each line consists of the lexical tokens of a single sentence, separated by spaces. Punctuation marks are treated as lexical tokens, and are present only where required by the corresponding grammar.

NAMES OF FILES

The names of three sets of grammars, lexicons, and test sentences as used in the study are as follows:

	Grammar	Lexicon	Test Sentences
CT	ct-grammar-eval	ct-eval-lex	ct-sentences
ATIS	atis-grammar	atis-lex	atis-sentences
PT	pt-grammar	pt-lex	pt-sentences

In addition, the file ct-grammar-original includes the original CT grammar (incorporating lexical information) as it came it came from SRI. Please note the restrictions on use of the CT data as indicated by comments in the corresponding files.

Author Bob Moore, Microsoft Research. Maintained by John Carroll, University of Sussex. Last modified 31 Jul 01.