Efficiency in Large-scale Parsing Systems - Parser Comparison

Evaluation Methodology and Data

Interest in large-scale, grammar-based parsing has recently seen a large increase, in response to the complexities of language-based application tasks such as speech-to-speech translation, and enabled by efforts in large-scale, collaborative grammar engineering and in the induction of statistical grammars/parsers from treebanks. Parser throughput is an important consideration in real-world applications, since throughput tends to degrade as grammar size increases. Investigation of efficient approaches to parsing is therefore an important topic of research.

The primary method of assessing the efficiency of a parsing algorithm is empirical: the algorithm is implemented, a grammar loaded, and the time taken to parse a set of test sentences is recorded. Within a single research effort this technique can allow the efficiency of different algorithms to be compared. However, reliable comparison with other researchers' results is not possible unless (1) the same grammars and test data are used, and (2) it is possible to relate parse times obtained across different implementation languages and computing platforms.

To date there have been a few studies (Carroll 1994 / van Noord 1997; Natural Language Engineering 6(1) 2000) into parsing efficiency that have used common grammars and test data, but comparing parse times across the different languages and machines used in the studies is problematic.

One of the goals of the workshop Efficiency in Large-scale Parsing Systems at COLING 2000 (proceedings now available) was to discuss possible approaches to comparing parser efficiency. One methodology that gained broad agreement (proposed by Bob Moore in his paper in the proceedings) is, in addition to using common grammars and data, to have standard implementations of a reference algorithm in all implementation languages of interest, so that studies can report timings relative to the speed of this reference parser. This would allow the influences of different implementation languages and computing platforms to be factored out. So, in summary, the proposed methodology for comparing parser efficiency is:

one or more large grammars, associated test sentences, and reference parser implementations coded in Lisp/Prolog/C/Perl/etc. are provided;
the researcher runs the grammar(s) on the test data using his/her own parser(s);
the researcher runs the grammar(s) on the test data using the reference parser implemented in the same language (and on the same machine); and
the researcher reports execution times of his/her own parser(s) as a percentage of the times recorded for the reference parser.

Ideally, comparisons should be made on the basis of testing with several grammars, to guard against the possibility that the algorithm is (whether by accident or design) "tuned" to one of them. Also see Salzberg (1997) for pitfalls to avoid in comparative evaluation.

A number of large natural language grammars have been used in work of this kind, most being freely available. A few reference implementations are also available, although more are needed (contributions welcome, emailed to John Carroll, University of Sussex). These resources are listed below, together with brief descriptions.

Context-Free Grammar (CFG)

Bob Moore has provided three CFGs (called CT, ATIS, and PT) that were used in a comparative evaluation of several parsers (Moore 2000). The grammars are independently motivated by analyses of natural-language corpora or actual applications of natural language processing. The CT grammar and lexicon were derived from a CFG compiled from a task-specific unification grammar written for CommandTalk, an SRI-developed spoken-language interface to a military simulation system. The ATIS grammar and lexicon were extracted from a treebank of the DARPA ATIS3 training sentences. The PT grammar was extracted from the Penn Treebank. The most significant variation among the grammars is the degree of ambiguity of the test sets associated with each grammar. The CT test set has 5.4 parses/sentence with the CT grammar, the ATIS test set has 940 parses/sentence with the ATIS grammar, and the PT test set has 7.2 X 10^27 parses/sentence with the PT grammar.

NEW! Reference implementations in Perl of four parsing algorithms (two left-corner parsing algorithms, a variant of the Cocke-Kasami-Younger algorithm, and Tomita's generalized LR parsing algorithm, in an LR(0) version) are also available. They can be downloaded from the Microsoft Research demos & downloads site (click on the item "Context-Free Parsing Algorithms").

Definite Clause Grammar (DCG)

The Alvey NL Tools (ANLT) contains a large, wide-coverage sentence grammar of English written in a unification-based metagrammatical formalism resembling GPSG. Carroll (1994) and van Noord (1997) describe parser evaluations using this grammar, compiled out into an object DCG and applied to a set of test sentences. The DCG comprises some 780 rules, with each category containing on average around 30 nodes. The test set consists of 129 relatively short sentences (a representative sample of the test suite originally written by the grammar developer), plus 100 longer sentences (constructed especially for parser efficiency testing).

The ANLT and van Noord's HDrug system contain parsers (in Lisp and Prolog respectively) that could be used as reference implementations.

Head-Driven Phrase Structure Grammar (HPSG)

The LinGO English grammar is a broad-coverage HPSG, containing roughly 8,000 types and 64 lexical and grammar rules, with an average feature structure size of around 300 nodes. Three main test sets have been used for parser evaluation with this grammar, the largest - of 2,100 items - having been extracted from VerbMobil corpora of transcribed speech and balanced with respect to sentence length. Evaluations using this grammar are reported in Natural Language Engineering 6(1) 2000.

A freely-available reference implementation (in Lisp) is the LKB system parser.

Lexical Functional Grammar (LFG)

In the Xerox-led ParGram collaboration, wide-coverage grammars of English, French, German and a number of other languages are being developed in parallel in the LFG framework, all of the grammars based on a common set of linguistic principles, with a commonly-agreed-upon set of grammatical features. A previous version of the English grammar was used by Maxwell & Kaplan (1993) in experiments on parsing efficiency. Unfortunately, the grammars are not publicly distributed at present.

Lexicalized Tree Adjoining Grammar (LTAG)

The XTAG system grammar is a large-scale feature structure based lexicalized tree adjoining grammar of English, developed at the University of Pennsylvania over the past ten years or so. The XTAG Tools contain a parser.

Maintained by John Carroll, University of Sussex. Last modified 23 Apr 03.