Next: , Previous: , Up: topdown   [Contents][Index]


7.3.2 Parsing a Token Sequence

To parse a training terminal symbol (token) sequence read from a file SYM_SEQ_FILE using a template regular expression grammar read from a file REGEX_GRAM_FILE, use the command line

$ topdown -i random_seed [ -n seq_len ] [ --oo=LOG_FILE ]      \
          [ additional options ] REGEX_GRAM_FILE SYM_SEQ_FILE

The file SYM_SEQ_FILE contains (unquoted) terminal symbols separated by spaces and/or newlines.

See the descriptions of -i random_seed and -n seq_len options further on in this subsection.

If a filename REGEX_GRAM_FILE or SYM_SEQ_FILE is ‘-’, the parser reads file content from stdin. If a filename LOG_FILE is ‘-’, the parser writes file content to stdout.

When creating output files, the parser writes intermediate output to a file with the suffix ‘.tmp’ and renames this temporary file to a target file after finishing writing the output. This approach prevents creating output files with incomplete content on aborting program execution.

On passing the option --oo=LOG_FILE, the parser writes a log file LOG_FILE with a line like this:

$ topdown -i1 --oo=common_segm.log common_segm_2.rg common_segm.seq
$ cat common_segm.log
[0]: prob_gram 0.59900845, prob_term 0.66974218, prob_wpredict 0.85428619,
prob_npredict 0.84400000, cycle_period 81

The fields of this line (split into two lines in this example) have the following meaning:

[idx]

Template grammar index. Using multiple template grammars improves predicting the next terminal symbols in the training terminal symbol sequence. The options --predict and --os=FILE turn on predicting the next terminal symbols using an ensemble of template grammars. See Predicted Token Sequence for the description of this mode and a command-line format for specifying multiple template grammars.

prob_gram

The probability of parsing the training terminal symbol sequence by a learned grammar. That probability is the weighted probability of all productions in a full learned PCFG where weights are production frequencies.

prob_term

The weighted probability of productions containing terminal symbol sequences in a full learned PCFG where weights are production frequencies. The probability prob_gram is the weighted probability of prob_term and productions that do not represent terminal symbol sequences.

prob_wpredict

The sum of estimated probabilities of correctly predicted terminal symbols in the training terminal symbol sequence divided by the length of that sequence.

prob_npredict

The actual number of correctly predicted terminal symbols in the training terminal symbol sequence divided by the length of that sequence.

cycle_period

Average cycle length for parsing the same symbol sequence from the training terminal symbol sequence by the same group of terminal symbols, terminal symbol classes, or ‘.’ from the template grammar. This average cycle length is equal to total cycle length counted in terminal symbols divided by the number of cycles. The parser increments the total cycle length and the number of cycles only if the previous occurrence of that group was parsing a different terminal symbol sequence.

To only check the correctness of a template grammar, use the command line

$ topdown REGEX_GRAM_FILE

The following command-line options are applicable to parsing a terminal symbol sequence:

--kt=FLOAT

The temperature of the environment state identification engine. The default value is 2.

-N, --npass=INT

The number of training passes. The parser reinitializes the environment state identification engine at each training pass and accumulates PCFG production frequencies for all training passes. Using multiple training passes usually requires a shorter training terminal symbol sequence to achieve the same quality of grammar synthesis. Performing multiple training passes in parallel is possible but not implemented. The default value is 1.

-n, --nstep=INT

The length of the training terminal symbol sequence. If an input terminal symbol sequence specified by SYM_SEQ_FILE ends earlier, the parser processes this input sequence again from the beginning. For this repeated processing to be correct, the input sequence should consist of an integer number of parse units, that is, end with a complete parse unit. By default, the length of a training terminal symbol sequence is equal to the length of an input terminal symbol sequence.

-i, --seed=INT

A seed for the pseudo-random number generator. A non-negative value turns on adaptive parser operation. A negative value turns on random parser operation. The default value is 0.

--oo=FILE

The name of a log file to dump parse statistics. The special name ‘-’ means to print the log to stdout.

--stack-size=INT

The maximum nesting level of nonterminal symbols in parse trees generated while parsing the training terminal symbol sequence. Parsing aborts on exceeding that level. The default value is 32.

--ww=INT

The length (width) of the cycle event history window and grammar event history window counted in terminal symbols. By default, the parser sets and adjusts that length automatically.

A parsing process aborts on encountering an unexpected terminal symbol. In this case, the parser dumps a parse stack trace.

For example, using the template grammar

cat >unexpect.rg <<EOF

S: . A . ;
A: . . ( "a" B | "s" B ) ;
B: . . . ( C | D ) ;
C: "c" . . D ;
D: "d" . ;

EOF

to parse the sequence ‘a b c s e f g h i j’ results in the following stack trace:

$ topdown unexpect.rg - <<< 'a b c s e f g h i j'
STACK TRACE
#2 [0]  S: . >>> A <<< .
        a
#1 [1]  A: . . ("a" B | "s" >>> B <<<)
        b c s
#0 [4]  B: . . . (>>> C | D <<<)
        e f g
Unexpected terminal symbol [7]: h

The bottom line indicates the position (‘[7]’) of an unexpected symbol in the training terminal symbol sequence and the symbol itself (‘h’). The pairs of lines above the bottom line correspond to parse stack frames.

A number after ‘#’ indicates the index of a parse stack frame, where frame ‘#0’ is the current stack frame, frame ‘#1’ is the previous stack frame, and so on. A number in square brackets (e.g. ‘[4]’) after the index of a parse stack frame indicates the position of the first terminal symbol consumed while parsing a nonterminal symbol corresponding to the parse stack frame. The nonterminal symbol itself ended with ‘:’ (e.g. ‘B:’) follows this number. A regular expression (e.g. ‘. . . (>>> C | D <<<)’) for the nonterminal symbol follows it. In the regular expression, the markers ‘>>>’ and ‘<<<’ indicate a subexpression with a parse error.

The second line in a pair of lines for a parse stack frame lists terminal symbols (e.g. ‘e f g’) consumed while parsing a regular expression before encountering a parse error.

A stack overflow produces a similar stack trace. For example, using the template grammar

cat >stack_overflow.rg <<EOF

S: . A ;
A: . . B ;
B: . . . C ;
C: . . . . A ;

EOF

to parse the sequence ‘a b c d e f g h i j k l m n’ with stack size 4 results in the following stack trace:

$ topdown --stack-size=4 stack_overflow.rg - <<< 'a b c d e f g h i j k l m n'
STACK OVERFLOW
#3 [0]  S: . >>> A <<<
        a
#2 [1]  A: . . >>> B <<<
        b c
#1 [3]  B: . . . >>> C <<<
        d e f
#0 [6]  C: . . . . >>> A <<<
        g h i j
Terminal symbol [10]: k

The bottom line indicates the position (‘[10]’) of a terminal symbol led to creating a new parse stack frame resulted in exceeding the maximum allowed number of parse stack frames. The terminal symbol itself (‘k’) follows that position.


Next: , Previous: , Up: topdown   [Contents][Index]