Parser Comparison - Context-Free Grammar (CFG) Data


OVERVIEW

The files iwpt2000-rev2.ps and iwpt2000-rev2.pdf are the postscript and pdf forms of a revised version of "Improved Left-Corner Chart Parsing for Large Context-Free Grammars," by Robert C. Moore, presented at the Sixth International Workshop on Parsing Technologies, ITC-IRST, Trento, Italy, 23-25 February, 2000. The abstract reads:

We develop an improved form of left-corner chart parsing for large context-free grammars, introducing improvements that result in significant speed-ups compared to previously-known variants of left-corner parsing. We also compare our method to several other major parsing approaches, and find that our improved left-corner parsing method outperforms each of these across a range of grammars. Finally, we also describe a new technique for minimizing the extra information needed to efficiently recover parses from the data structures built in the course of parsing.

The files listed below contain the grammars, lexicons, and sets of test sentences used in the experiments performed in this study. The CT grammar and lexicon were derived from a CFG (courtesy of John Dowding, SRI International) compiled from a task-specific unification grammar written for CommandTalk, an SRI-developed spoken-language interface to a military simulation system. The ATIS grammar and lexicon (courtesy of George Heidorn and Eric Ringger, Microsoft Research) were extracted from a treebank of the DARPA ATIS3 training sentences. The PT grammar (courtesy of Eugene Charniak, Brown University) was extracted from the Penn Treebank. Various transformations, including removal of cycles, were performed on these grammars to get them into a uniform notation suitable for the experiments reported in the study. The root symbol for all three grammars is the token SIGMA.

A standard test set is supplied for each of the three grammars. The test set for the CT grammar is a set of sentences made up by the CommandTalk developers to test the functionality of the system. The test set for the ATIS grammar is a randomly selected subset of the DARPA ATIS3 development test set. (The full ATIS corpus is distributed by the Linguistic Data Consortium). The test set for the PT grammar is a set of preterminal strings randomly generated from a probabilistic version of the grammar, with the probabilities based on the frequency of the bracketings occuring in the training data, and then filtered for length to make it possible to conduct experiments in a reasonable amount of time, given the high degree of ambiguity of the grammar.

The terminals of the grammars are preterminal lexical categories rather than words. Preterminals were generated automatically, by grouping together all the words that could occur in exactly the same contexts in all grammar rules, to eliminate lexical ambiguity. Since the test set for the PT grammar is already a set of preterminal strings, a dummy lexicon was created that simply assigns each preterminal itself as its preterminal category. This was done to keep the process of looking up the preterminal category of each lexical item uniform in all the experiments in the study.

Some statistics on the grammars and test sets are contained in the following table:

CT Grammar ATIS Grammar PT Grammar
Rules24,4564,59215,039
Nonterminals3,94619238
Terminals1,03235747
# Test Sentences1629830
Average Length8.311.45.7
# Grammatical1507030
Average # Parses5.49407.2 X 10^27

Note that for the CT and ATIS sets, not all sentences are accepted by the corresponding grammars. The most striking difference among the three grammars is the degree of ambiguity. The CT grammar has relatively low ambiguity, the ATIS grammar may be considered highly ambiguous, and the PT grammar can only be called massively ambiguous.

FILE FORMATS

Grammar, lexicon, and test sentence files are all in plain ASCII text format. Some of the files include comment lines begining with the semicolon character (";") to include required notices.

The grammar files (except for ct-grammar-original) are in the form of blocks, separated by blank lines, defining the productions expanding each nonterminal. The first line in each block contains only the nonterminal symbol whose productions are defined by that block. Each remaining line of the block consists of a space-separated sequence of nonterminals and preterminals defining a possible expansion of the nonterminal in question. Tokens beginning with upper-case characters are nonterminals, all other tokens are preterminals. For example, the block

NP
det NBAR
NP POSTNOMMOD

would define productions more conventionally written as

NP -> det NBAR
NP -> NP POSTNOMMOD

In the lexicon files, each line contains a lexical item followed by a space followed by its preterminal category. Note that lexical items and preterminal categories are not necessarily distinct symbols. In these lexicons, whenever there is only one lexical item in a given preterminal category, the lexical item itself is used as the symbol for the preterminal category.

In the sentence files, each line consists of the lexical tokens of a single sentence, separated by spaces. Punctuation marks are treated as lexical tokens, and are present only where required by the corresponding grammar.

NAMES OF FILES

The names of three sets of grammars, lexicons, and test sentences as used in the study are as follows:

GrammarLexiconTest Sentences
CT ct-grammar-eval  ct-eval-lex  ct-sentences
ATIS atis-grammar atis-lex atis-sentences
PT pt-grammar pt-lex pt-sentences

In addition, the file ct-grammar-original includes the original CT grammar (incorporating lexical information) as it came it came from SRI. Please note the restrictions on use of the CT data as indicated by comments in the corresponding files.


Author Bob Moore, Microsoft Research. Maintained by John Carroll, University of Sussex. Last modified 31 Jul 01.