Next: , Previous: , Up: topdown   [Contents][Index]


7.3.4 Iterative Determinization

Iterative grammar determinization is a process of learning a partially or fully deterministic regular expression grammar by iterative removing terminal symbols from look-ahead terminal symbol classes in a template regular expression grammar.

While removing the terminal symbols, the parser preserves terminal symbol coverage for the unions of look-ahead terminal symbol classes at the beginnings of control transfer branches—if removing a terminal symbol results in lesser coverage, the parser skips the removal. Providing the same terminal symbol coverage ensures that parsing never aborts on an unexpected terminal symbol when using a learned grammar to parse any terminal symbol sequence conforming to the template grammar.

To iteratively determinize a template regular expression grammar, use the command line

$ topdown -i random_seed [ -n seq_len ] [ --det-niter-goal=num_iters ]  \
          --od=DET_GRAM_FILE --oo=LOG_FILE                              \
          [ additional options ] REGEX_GRAM_FILE SYM_SEQ_FILE

The parameter DET_GRAM_FILE specifies an output file with a determinized regular expression grammar. See Parsing a Token Sequence for the description of REGEX_GRAM_FILE, SYM_SEQ_FILE, LOG_FILE, random_seed, and seq_len parameters.

The parameter num_iters specifies the number of determinization iterations to perform. The actual number of iterations is less than num_iters if the number of removable terminal symbols is less than num_iters. If the number of removable terminal symbols is greater than num_iters, the parser removes multiple terminal symbols on some or all iterations. At each iteration, the parser processes a training terminal symbol sequence from the beginning to end.

Note: determinizing a template regular expression grammar containing ‘*’ quantifiers might not lead to acceptable results.

At every determinization iteration, the parser dumps to a log file a block of lines like this:

Iteration 3:
P: prob_gram ..., prob_term ..., prob_wpredict ..., prob_npredict ..., cycle_period 97
T: prob_gram ..., prob_term ..., prob_wpredict ..., prob_npredict ..., cycle_period 87
n_term_removable 8, n_term_removed 1, iter_time 0s

An iteration index (e.g. 3) goes after ‘Iteration’.

A line beginning with ‘P: ’ (“Pass”) contains parse statistics collected for a current iteration. A line beginning with ‘T: ’ (“Total”) contains aggregate parse statistics for all iterations up to the current iteration. See Parsing a Token Sequence for the description of fields of both lines.

The parameter ‘n_term_removable’ specifies the total number (e.g. 8) of symbols removable from terminal symbol classes in a current determinized regular expression grammar without reducing its terminal symbol coverage. The parameter ‘n_term_removed’ less than or equal to ‘n_term_removable’ specifies the number of terminal symbols actually removed at this iteration. The parameter ‘iter_time’ is a rounded number of seconds (e.g. ‘0s’) the iteration was executing.

Command-line options applicable to iterative template grammar determinization are the following:

--det-fixed=INT

The minimum number of symbols to attempt to remove from terminal symbol classes at each iteration. The default value is 1.

--det-interim

Create a separate set of output files at each iteration. The files have the suffix ‘.ITER’, where ITER is an iteration index. By default, the parser overwrites output files for a current determinized grammar at each iteration.

--det-niter-goal=INT

The number of iterations to perform. The actual number of iterations cannot exceed the number of terminal symbols removable from the template regular expression grammar without reducing its terminal symbol coverage. At each iteration, the parser tries to remove approximately equal numbers of terminal symbols with the goal to remove all unnecessary terminal symbols by the end of last iteration.

--det-niter-max=INT

Limit on the number of iterations. The parser stops after performing a specified number of iterations. Special value 0 means no limit. The default value is 0.

--det-ratio=FLOAT

Minimum ratio for the number of symbols to attempt to remove at each iteration relative to the number of removable terminal symbols. The default value is 0.

--od=FILE

Write a determinized regular expression grammar to FILE at the end of each iteration. On passing the option --det-interim, create a file with an iteration index suffix. If FILE is ‘-’, write the grammar to stdout. The grammar does not include branches disabled using the empty terminal symbol class ‘[]’. This option turns on iterative grammar determinization mode.

--ode=FILE

Dump a determinized regular expression grammar containing branches disabled using the empty terminal symbol class ‘[]’. In other respects, this option is similar to the option --od=FILE and turns on iterative grammar determinization mode.

--qd[=NONT]

Dump a production for a nonterminal symbol NONT at the left-hand side to a file specified by the option --od=FILE. You can pass multiple options --qd=NONT to dump productions for multiple nonterminal symbols in a determinized regular expression grammar. If the option --od=FILE not supplied, dump requested productions to stdout. If NONT not supplied, dump the entire determinized regular expression grammar. This option turns on iterative grammar determinization mode.

--qde[=NONT]

This option is similar to --qd[=NONT] but corresponds to the option --ode=FILE for dumping productions containing branches disabled using the empty terminal symbol class ‘[]’. This option turns on iterative grammar determinization mode.

The following command-line options are also applicable to other parser operation modes:

--simplify

Partially simplify a determinized regular expression grammar or residual regular expression grammar (see Residual Regex Grammar) or fully simplify a final PCFG (see Learned PCFG). By default, do not simplify those grammars.

--terse

Dump regular expressions in the productions of a determinized regular expression grammar or residual regular expression grammar (see Residual Regex Grammar) in condensed format. By default, dump those regular expressions in indented format.


Next: , Previous: , Up: topdown   [Contents][Index]