7.4.3 Command-Line Options

The adaptive top-down parser atd-parser supports the following command line options:

--det-interim

Create a separate set of output files at each iteration of determinization of a top-down template grammar. The files have the suffix ‘.ITER’, where ITER is an iteration index. By default, the parser overwrites the same output files at each iteration.

--det-niter-goal=INT

The number of determinization iterations to perform. The actual number of iterations cannot exceed the number of terminal symbols removable from a top-down template grammar without reducing its terminal symbol coverage.

--det-niter-keep=INT

The number of last determinization iterations to keep aggregate statistics. If that number is equal to 1, use statistics from the last iteration only. If that number is greater than the number of already performed iterations, keep aggregate statistics for the already performed iterations. The default value is the goal number of iterations.

--det-niter-max=INT

Limit on the number of iterations of determinization of a top-down template grammar. The parser stops on reaching the limit. If that number is less than the goal number of iterations, iterative determinization terminates prematurely. Special value 0 means no limit. The default value is 0.

--det-ratio=FLOAT

Minimum ratio for the number of terminal symbols to attempt to remove from terminal symbol classes at each determinization iteration relative to the number of removable terminal symbols. The default value is 0.

--det-shape=linear|exp

The shape of a function for the number of terminal symbols removable from terminal symbol classes at a given iteration of determinization of a top-down template grammar:

linear

Remove approximately equal numbers of terminal symbols at each iteration.

exp

Remove more terminal symbols at beginning iterations and less terminal symbols at ending iterations.

The default value is ‘linear’.

--eos-marker

Enable the end-of-stream marker $$. It is a special terminal symbol a top-down template grammar consumes after consuming all terminal symbols from a training terminal symbol sequence or current parse unit before finishing parsing a start nonterminal symbol. On omitting this option, the parser stops processing a training terminal symbol sequence or current parse unit on attempting to consume the next terminal symbol from it if there are no more terminal symbols to consume.

--fq-span=window|total

Event history span for output frequencies of PCFG productions and terminal symbol n-grams:

window

Event history window.

total

Entire event history.

The default value is ‘total’.

--input-parse-units

Interpret an input terminal symbol sequence as consisting of parse units separated by empty lines. By default, interpret an input terminal symbol sequence as a single stream of terminal symbols.

--kt=FLOAT

The temperature of the environment state identification engine. The default value is 2.

-N, --npass=INT

The number of training passes. The parser reinitializes the environment state identification engine at each training pass and accumulates PCFG production frequencies for all training passes at a determinization iteration. Using multiple training passes usually requires a shorter training terminal symbol sequence to achieve the same quality of grammar synthesis. Performing multiple training passes in parallel is possible but not implemented. The default value is 1.

-n, --nstep=INT

The length of a training terminal symbol sequence. If an input terminal symbol sequence ends earlier, the parser processes it again from the beginning. For correct repeated processing an input terminal symbol sequence, it should consist of an integer number of parse units, that is, it should end with a complete parse unit. In parse unit mode (see the option --input-parse-units), parsing finishes at a parse unit boundary. By default, the length of a training terminal symbol sequence is equal to the length of an input terminal symbol sequence.

--od[=FILE]

Dump a determinized top-down template grammar to a file or stdout. The grammar does not contain subexpressions starting with empty terminal symbol classes ‘[]’. This option turns on iterative determinization of a top-down template grammar.

--ode[=FILE]

Dump a determinized top-down template grammar to a file or stdout. The grammar can contain subexpressions starting with empty terminal symbol classes ‘[]’. This option turns on iterative determinization of a top-down template grammar.

--oe[=FILE]

Dump terminal symbol expansion sequences for nonterminal symbols of a top-down template grammar to a file or stdout.

--og[=FILE]

Dump an initial context-free grammar to a file or stdout.

--on[=FILE]

Dump terminal symbol n-grams to a file or stdout.

--oo[=FILE]

Dump parse statistics to a file or stdout. See “Parse Statistics Format of Top-Down Parser”, for more information.

--op[=FILE]

Dump a learned PCFG to a file or stdout.

--or[=FILE]

Dump a residual regular expression grammar to a file or stdout.

--os[=FILE]

Dump a combined sequence consisting of pairs of terminal symbols from an actual (training) terminal symbol sequence and predicted terminal symbol sequence to a file or stdout. See pcfg-predict-eval, for the format of a combined sequence. This option turns on terminal symbol prediction mode.

--ot[=FILE]

Dump PCFG parse trees to a file or stdout.

--plain-pcfg

Remove output and put-back terminal symbols from a learned PCFG.

--predict

Predict the next symbol in a training terminal symbol sequence on processing each symbol from it. This option turns on terminal symbol prediction mode.

--profile-tol=FLOAT

Generate output signal choice trees with specified maximum difference between desired profile probabilities of output signals and their actual profile probabilities. Special value 0 means to use Huffman trees as output signal choice trees. The default value is 0.005.

--recurs=left|right

Recursion type for context-free grammars to dump: left or right. The default value is ‘left’.

-i, --seed=INT

A seed for the pseudo-random number generator. A non-negative value switches parser operation to adaptive mode. A negative value switches parser operation to random mode. The default value is 0.

--simplify

Partially simplify a determinized regular expression grammar or residual regular expression grammar or fully simplify a learned PCFG. By default, do not simplify the grammars.

--stack-size=INT

Maximum nesting level of nonterminal symbols of a top-down template grammar while parsing. Parsing aborts on exceeding that level. The default value is 32.

--terse

Use condensed format to dump productions of regular expression grammars. By default, dump the productions in indented format.

--ww=INT

The length (width) of the cycle event history window and grammar event history window counted in terminal symbols. By default, the parser sets and adjusts that length automatically.