To learn a source PCFG by iterative determinization of a top-down template grammar converted from a bottom-up template grammar, use the commands:
abu-parser -iPOSITIVE_RANDOM_SEED [ -nTRAINING_SEQ_LENGTH ] --viterbi \
--det-niter-goal=NUMBER_OF_ITERATIONS --det-niter-keep=1 \
--randomize-det-iter --od=DETERMINIZED_TD_GRAMMAR_FILE \
--oo[=LOG_FILE] INPUT_TD_GRAMMAR_FILE INPUT_PARSE_UNITS_FILE
abu-parser --viterbi --ops=LEARNED_PCFG_FILE --oo --simplify \
DETERMINIZED_TD_GRAMMAR_FILE INPUT_PARSE_UNITS_FILE
To generate a random source PCFG for a top-down template grammar converted from a bottom-up template grammar, use the command:
abu-parser -iNEGATIVE_RANDOM_SEED --viterbi --ops=RANDOM_PCFG_FILE \
INPUT_TD_GRAMMAR_FILE INPUT_PARSE_UNITS_FILE
In the above command line formats:
INPUT_TD_GRAMMAR_FILEThe name of a file containing an input top-down template grammar (see Top-Down Template Grammar) converted by the rege-bottom-up tool (see rege-bottom-up) from a bottom-up template grammar with source PCFG markup.
The filename ‘-’ means stdin.
DETERMINIZED_TD_GRAMMAR_FILEThe name of a file containing a determinized top-down template grammar. The filename ‘-’ means stdin or stdout.
INPUT_PARSE_UNITS_FILEThe name of a file containing input parse units separated by empty lines. The filename ‘-’ means stdin. Input parse units are sequences of terminal symbol names separated by whitespace characters (including single newline characters, as multiple consecutive newline characters separate two parse units).
See Command-Line Options, for descriptions of options and their arguments mentioned in the command line formats.
On passing the option --oo[=FILE], the parser dumps overall parse statistics or parse statistics for each determinization iteration to a file or stdout.
Overall parse statistics looks like this:
p_td 0.53287683, p_so 0.26498707, p_rd 0.68541645, cp 17
The fields of this line have the following meanings:
p_tdThe probability of parsing training parse units by a learned top-down PCFG. It is weighted probability of all productions in the PCFG where weights are production frequencies.
p_soThe probability of parsing training parse units by a learned source PCFG. It is weighted probability of all productions in the PCFG where weights are production frequencies.
p_rdWeighted probability of productions with terminal symbols at the right-hand side in a learned top-down PCFG where weights are production frequencies.
The probability p_td is weighted probability of p_rd and productions with nonterminal symbols at the right-hand side in a learned top-down PCFG.
cpAverage cycle length counted in training parse units. It is equal to total length of cycles divided by the number of cycles. A cycle here is leading out the same production of a source PCFG. The parser increments total length of cycles and the number of cycles on leading out a production of a source PCFG only if a nonterminal symbol at the left-hand side of the production had the previous led out right-hand side with a different index.
Parse statistics for a determinization iteration looks like this:
Iteration 30 of 50 (60.0 %): P: p_td 0.68479463, p_so 0.34150993, p_rd 0.93313912, cp 35 T: p_td 0.68479463, p_so 0.34150993, p_rd 0.93313912, cp 35 n_removable_rd 129, n_removed_rd 6(+0), iter_time 1s
The first line contains the ordinal number of a current iteration, the goal number of iterations, and the percentage of completed iterations. The line beginning with ‘P:’ (“Pass”) contains parse statistics for a current iteration. The line beginning with ‘T:’ (“Total”) contains aggregate parse statistics for last iterations up to a current iteration. In the example, the lines beginning with ‘P:’ and ‘T:’ are equal because of passing the option --det-niter-keep=1. See the table above for the description of fields in the lines.
The fourth line contains the following fields:
n_removable_rdThe number of terminal symbols removable from terminal symbol classes of a top-down template grammar at a current determinization iteration without reducing terminal symbol coverage of the top-down template grammar.
n_removed_rdThe number (less than or equal to n_removable_rd) of terminal symbols actually removed from terminal symbol classes of a top-down template grammar at a current determinization iteration and, in round brackets, the number of implicitly removed terminal symbols.
Implicitly removed terminal symbols are unreachable terminal symbols and terminal symbols not present in input parse units and a set of reachable output terminal symbols.
iter_timeApproximate number of seconds a determinization iteration was executing.