To learn a PCFG and residual top-down grammar by iterative determinization of a top-down template grammar, use the commands:
atd-parser -iPOSITIVE_RANDOM_SEED [ -nTRAINING_SEQ_LENGTH ] \
--det-niter-goal=NUMBER_OF_ITERATIONS --od=DETERMINIZED_TD_GRAMMAR_FILE \
--oo[=LOG_FILE] INPUT_TD_GRAMMAR_FILE INPUT_SEQ_FILE
atd-parser --op=LEARNED_PCFG_FILE --oo --simplify \
DETERMINIZED_TD_GRAMMAR_FILE INPUT_SEQ_FILE
atd-parser --or=RESIDUAL_TD_GRAMMAR_FILE [ --simplify ] \
DETERMINIZED_TD_GRAMMAR_FILE INPUT_SEQ_FILE
In the above command line formats:
INPUT_TD_GRAMMAR_FILEThe name of a file containing an input top-down template grammar. See Top-Down Template Grammar, for more information. The filename ‘-’ means stdin.
DETERMINIZED_TD_GRAMMAR_FILEThe name of a file containing a determinized top-down template grammar. The filename ‘-’ means stdin or stdout.
INPUT_SEQ_FILEThe name of a file containing a sequence of terminal symbol names separated by whitespace characters. The filename ‘-’ means stdin.
See Command-Line Options, for descriptions of options and their arguments mentioned in the command line formats.
On passing the option --oo[=FILE], the parser dumps overall parse statistics or parse statistics for each determinization iteration to a file or stdout.
Overall parse statistics looks like this:
[0]: p_td 0.29952870, p_rd 0.71248166, p_wp 0.56714492, p_np 0.56526226, cp 893
The fields of this line have the following meanings:
[idx]Top-down template grammar index.
It is possible to supply a list of top-down template grammars to the parser. In this case, the options --predict and --os[=FILE] turn on predicting the next terminal symbols using an ensemble of top-down template grammars, and overall parse statistics contains lines corresponding to each top-down template grammar from the list. After the lines, there goes a line similar to the following one:
p_epredict 0.27998545
The line contains observed probability of a terminal symbol correctly predicted using an ensemble of top-down template grammars.
p_tdThe probability of parsing a training terminal symbol sequence by a learned PCFG. It is weighted probability of all productions in the PCFG where weights are production frequencies.
p_rdWeighted probability of productions with terminal symbols at the right-hand side in a learned PCFG where weights are production frequencies.
The probability p_td is weighted probability of p_rd and productions with nonterminal symbols at the right-hand side.
p_wpThe sum of estimated probabilities of terminal symbols correctly predicted in a training terminal symbol sequence divided by sequence length.
p_npThe actual number of terminal symbols correctly predicted in a training terminal symbol sequence divided by sequence length.
cpAverage cycle length counted in terminal symbols consumed from a training terminal symbol sequence. Average cycle length is equal to total length of cycles divided by the number of cycles. A cycle here is parsing the same training terminal symbol subsequence by the same group of terminal symbols, terminal symbol classes, and ‘.’ contained in a top-down template grammar. The parser increments total length of cycles and the number of cycles only if the previous occurrence of that group was parsing a different training terminal symbol subsequence.
Parse statistics for a determinization iteration looks like this:
Iteration 24 of 50 (48.0 %): P: p_td 0.21248701, p_rd 0.33389827, p_wp 0.67163360, p_np 0.63650000, cp 709 T: p_td 0.09378993, p_rd 0.11626595, p_wp 0.63718393, p_np 0.59475417, cp 728 n_removable_rd 1283, n_removed_rd 48(+6), iter_time 1s
The first line contains the ordinal number of a current iteration, the goal number of iterations, and the percentage of completed iterations. The line beginning with ‘P:’ (“Pass”) contains parse statistics for a current iteration. The line beginning with ‘T:’ (“Total”) contains aggregate parse statistics for last iterations up to a current iteration. The option --det-niter-keep=INT specifies the number of last iterations for aggregating parse statistics. See the table above for the description of fields in the lines beginning with ‘P:’ and ‘T:’.
The fourth line contains the following fields:
n_removable_rdThe number of terminal symbols removable from terminal symbol classes of a top-down template grammar at a current determinization iteration without reducing terminal symbol coverage of the top-down template grammar.
n_removed_rdThe number (less than or equal to n_removable_rd) of terminal symbols actually removed from terminal symbol classes of a top-down template grammar at a current determinization iteration and, in round brackets, the number of implicitly removed terminal symbols.
Implicitly removed terminal symbols are unreachable terminal symbols and terminal symbols not present in an input terminal symbol sequence and a set of reachable output terminal symbols.
iter_timeApproximate number of seconds a determinization iteration was executing.
The parser dumps a stack trace on stack overflow or encountering an unexpected terminal symbol.
Example for an Unexpected Terminal Symbol
Using the file unexpect.rg containing the top-down template grammar
S: . A . ; A: . . ( "a" B | "s" B ) ; B: . . . ( C | D ) ; C: "c" . . D ; D: "d" . ;to parse the terminal symbol sequence ‘a b c s e f g h i j’ results in the following stack trace:
$ atd-parser unexpect.rg - <<< 'a b c s e f g h i j' STACK TRACE #2 [0] S: . >>> A <<< . "a" #1 [1] A: . . ("a" B | "s" >>> B <<<) "b" "c" "s" #0 [4] B: . . . (>>> C | D <<<) "e" "f" "g" Unexpected terminal symbol [7]: "h"
The bottom line of a stack trace indicates an offset (‘[7]’) in a training terminal symbol sequence for an unexpected terminal symbol and indicates the unexpected terminal symbol itself ("h").
The pairs of lines above the bottom line correspond to parse stack frames.
A number after ‘#’ indicates the index of a parse stack frame, where frame ‘#0’ is the current stack frame, frame ‘#1’ is the previous stack frame, and so on. A number in square brackets (e.g. ‘[4]’) after the index of a parse stack frame indicates an offset in a training terminal symbol sequence. The offset corresponds to the beginning of parsing a nonterminal symbol. The nonterminal symbol itself ended with ‘:’ (e.g. ‘B:’) follows the number in square brackets. A regular expression (e.g. ‘. . . (>>> C | D <<<)’) for the nonterminal symbol follows it. In the regular expression, the markers ‘>>>’ and ‘<<<’ indicate a subexpression with a parse error.
The second line in a pair of lines for a parse stack frame lists terminal symbols (e.g. ‘"e" "f" "g"’) consumed while parsing the regular expression before encountering a parse error.
Example for Stack Overflow
Using the file stack_overflow.rg containing the top-down template grammar
S: . A ; A: . . B ; B: . . . C ; C: . . . . A ;to parse the sequence ‘a b c d e f g h i j k l m n’ with stack size 4 results in the following stack trace:
$ atd-parser --stack-size=4 stack_overflow.rg - <<< 'a b c d e f g h i j k l m n' STACK OVERFLOW #3 [0] S: . >>> A <<< "a" #2 [1] A: . . >>> B <<< "b" "c" #1 [3] B: . . . >>> C <<< "d" "e" "f" #0 [6] C: . . . . >>> A <<< "g" "h" "i" "j" Terminal symbol [10]: "k"
The bottom line of a stack trace indicates an offset (‘[10]’) in a training terminal symbol sequence for a terminal symbol led to creating a new parse stack frame resulted in exceeding the maximum allowed number of parse stack frames.
The terminal symbol itself ("k") follows the offset.