pcfg-predict-eval ¶This tool evaluates the accuracy of prediction of every next symbol in a terminal symbol sequence by comparing the number of correctly predicted terminal symbols with a calculated upper bound on the number of terminal symbols predictable using a specified PCFG.
The file predict_eval.pcfg contains the following PCFG:
$ cat predict_eval.pcfg S: "a" B "d" ; B: "b" | "c" "c" ;
Suppose, this PCFG has generated the following terminal symbol sequence stored in the file actual.seq:
$ cat actual.seq a c c d a b d
In this terminal symbol sequence, an upper bound on the number of correctly predictable terminal symbols is 6, as all 4 terminal symbols outside of subsequences ‘c c’ and ‘b’ are constant according to the PCFG, the sum of average numbers of correctly predictable terminal symbols at the second and sixth positions is 0.5+0.5=1, and the second ‘c’ always follows the first ‘c’ thereby adding 1 to the number of correctly predictable terminal symbols.
An upper bound on the number of correctly predictable terminal symbols divided by sequence length is prob_wpredict_max=6.0/7=0.85714286.
While reading the terminal symbol sequence, a test program was trying to predict the next terminal symbol the PCFG generates. Suppose, the test program has predicted the following terminal symbol sequence and stored it in the file predicted.seq:
$ cat predicted.seq a c b d a d d
In this terminal symbol sequence, the number of correctly predicted terminal symbols is 5, because the terminal symbols ‘b’ and ‘d’ at the third and sixth positions differ from the terminal symbols ‘c’ and ‘b’ at the same positions in the actual terminal symbol sequence in the file actual.seq, and terminal symbols at all other positions are equal.
The number of correctly predicted terminal symbols divided by sequence length is prob_npredict_actual=5.0/7=0.71428571.
The following command calculates and prints the aforementioned numbers prob_wpredict_max and prob_npredict_actual:
$ pcfg-predict-eval predict_eval.pcfg actual.seq predicted.seq
{
"seq_len" : 7,
"wpredict_max" : 6.00000000,
"npredict_actual" : 5,
"prob_wpredict_max" : 0.85714286,
"prob_npredict_actual" : 0.71428571
}
Use the command line formats:
pcfg-predict-eval [options] PCFG_FILE ACTUAL_SEQ_FILE PREDICTED_SEQ_FILE pcfg-predict-eval [options] PCFG_FILE COMBINED_SEQ_FILE
where:
PCFG_FILEA PCFG (see PCFG Format) without left recursion. Along with a terminal symbol sequence generated according to the PCFG, they specify an upper bound on the number of correctly predictable terminal symbols. The tool may hang if the PCFG has an empty production.
ACTUAL_SEQ_FILEAn actual terminal symbol sequence.
It is a terminal symbol sequence produced by a PCFG in PCFG_FILE.
The sequence consists of (unquoted) terminal symbol names separated by spaces or single newline characters.
Multiple consecutive newline characters separate parse units.
PREDICTED_SEQ_FILEA predicted terminal symbol sequence for calculating prediction accuracy.
The sequence consists of (unquoted) terminal symbol names separated by spaces or single newline characters.
Multiple consecutive newline characters separate parse units.
The number of terminal symbols in the sequence must be equal to the number of terminal symbols in a sequence in ACTUAL_SEQ_FILE.
Parse unit segmentations in the two sequences must be at the same terminal symbol positions.
Specify this argument equal to ACTUAL_SEQ_FILE to only calculate an upper bound on the number of terminal symbols correctly predictable in a terminal symbol sequence.
COMBINED_SEQ_FILEA terminal symbol sequence combined from an actual terminal symbol sequence and predicted terminal symbol sequence. The combined sequence consists of terminal symbol pairs separated by spaces or single newline characters. Multiple consecutive newline characters separate parse units.
The first element of a pair is a quoted actual terminal symbol. The second element of a pair is a quoted predicted terminal symbol or ‘?’ if a predicted terminal symbol is unknown. Use the escape character ‘\’ to insert ‘"’ or ‘\’ into a quoted terminal symbol name. Allowed delimiters between an actual terminal symbol name and predicted terminal symbol name:
Any actual terminal symbol and predicted terminal symbol.
The actual terminal symbol equal to the predicted terminal symbol.
The actual terminal symbol different from the predicted terminal symbol.
The fields of JSON output have the following meaning:
seq_lenThe number of terminal symbols in the actual or predicted terminal symbol sequence.
wpredict_maxAn upper bound on the number of correctly predictable terminal symbols.
npredict_actualThe number of correctly predicted terminal symbols contained in the predicted terminal symbol sequence.
npredict_ngramThe number of terminal symbols correctly predicted in the actual terminal symbol sequence by applying the n-gram approach. The tool prints this field on passing the option --ngrams.
wpredict_randThe number of correctly predicted terminal symbols in random mode.
The tool prints this field on passing the option --prob-rand=FLOAT.
This field is equal to the argument of that option multiplied by the field seq_len.
prob_wpredict_maxThis field is equal to wpredict_max/seq_len.
prob_npredict_actualThis field is equal to npredict_actual/seq_len.
prob_npredict_ngramThis field is equal to npredict_ngram/seq_len.
The tool prints the field on passing the option --ngrams.
prob_npredict_randThis field is equal to the argument of --prob-rand=FLOAT option. The tool prints the field on passing this option.
efficiency_ngram, %This field is equal to (npredict_actual-npredict_ngram)*100/(wpredict_max-npredict_ngram) except for the following special cases: if npredict_actual<npredict_ngram, the field is equal to 0; if npredict_actual>wpredict_max, the field is equal to 100.
The program prints the field on passing the option --ngrams.
efficiency_rand, %This field is equal to (prob_npredict_actual-prob_npredict_rand)*100/(prob_wpredict_max-prob_npredict_rand) except for the following special cases: if prob_npredict_actual<prob_npredict_rand, the field is equal to 0; if prob_npredict_actual>prob_wpredict_max, the field is equal to 100.
The program prints the field on passing the option --prob-rand=FLOAT.
The option --intermediate dumps details about calculating an upper bound on the number of correctly predictable terminal symbols while processing every terminal symbol from an actual and predicted terminal symbol sequence.
Example
$ pcfg-predict-eval --intermediate predict_eval.pcfg actual.seq predicted.seq
step 0: actual "a", predicted "a", prob_max 1.00000000, npredict 1, wpredict_max 1.00000000
step 1: actual "c", predicted "c", prob_max 0.50000000, npredict 2, wpredict_max 1.50000000
step 2: actual "c", predicted "b", prob_max 1.00000000, npredict 2, wpredict_max 2.50000000
step 3: actual "d", predicted "d", prob_max 1.00000000, npredict 3, wpredict_max 3.50000000
step 4: actual "a", predicted "a", prob_max 1.00000000, npredict 4, wpredict_max 4.50000000
step 5: actual "b", predicted "d", prob_max 0.50000000, npredict 4, wpredict_max 5.00000000
step 6: actual "d", predicted "d", prob_max 1.00000000, npredict 5, wpredict_max 6.00000000
{
"seq_len" : 7,
"wpredict_max" : 6.00000000,
"npredict_actual" : 5,
"prob_wpredict_max" : 0.85714286,
"prob_npredict_actual" : 0.71428571
}
For every pair of terminal symbols read from the actual and predicted terminal symbol sequence, the tool prints a line with the following fields:
stepThe index (position) of a terminal symbol in the actual or predicted terminal symbol sequence. Indices start with 0.
actualA terminal symbol read from the actual terminal symbol sequence.
predictedA terminal symbol read from the predicted terminal symbol sequence.
prob_maxThe probability of correct prediction of a terminal symbol in the field actual based on the input PCFG and previous terminal symbols read from the actual terminal symbol sequence.
npredictThe number of times terminal symbols in the fields actual and predicted were equal (up to the current line).
wpredict_maxThe sum of prob_max fields up to the current line.
The option --dump-stacks along with --intermediate prints survived stacks of expanding the start nonterminal symbol of an input PCFG before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol from that sequence.
Example
$ pcfg-predict-eval --dump-stacks --intermediate predict_eval.pcfg actual.seq predicted.seq 1.00000000 "a" <- S [0] step 0: actual "a", predicted "a", prob_max 1.00000000, npredict 1, wpredict_max 1.00000000 0.50000000 "b" <- B [1] <- S [0] 0.50000000 "c" <- B [1] <- S [0] step 1: actual "c", predicted "c", prob_max 0.50000000, npredict 2, wpredict_max 1.50000000 1.00000000 "c" <- B [1] <- S [0] step 2: actual "c", predicted "b", prob_max 1.00000000, npredict 2, wpredict_max 2.50000000 1.00000000 "d" <- S [0] step 3: actual "d", predicted "d", prob_max 1.00000000, npredict 3, wpredict_max 3.50000000 1.00000000 "a" <- S [4] step 4: actual "a", predicted "a", prob_max 1.00000000, npredict 4, wpredict_max 4.50000000 0.50000000 "b" <- B [5] <- S [4] 0.50000000 "c" <- B [5] <- S [4] step 5: actual "b", predicted "d", prob_max 0.50000000, npredict 4, wpredict_max 5.00000000 1.00000000 "d" <- S [4]
step 6: actual "d", predicted "d", prob_max 1.00000000, npredict 5, wpredict_max 6.00000000 1.00000000 "a" <- S [7]
The description of every survived stack (e.g. ‘0.50000000 "b" <- B [5] <- S [4]’) begins with its probability followed by stack content in reverse order—from a terminal symbol (in the actual terminal symbol sequence) to the start nonterminal symbol. Numbers in square brackets after nonterminal symbols are indices of terminal symbols (in the actual terminal symbol sequence) that initiated expansions of the nonterminal symbols. A number in square brackets can precede another number in square brackets indicating a sub-step incremented on encountering productions with empty right-hand sides for a nonterminal symbol at the left-hand side.
The option --dump-terms along with --intermediate prints probabilities of possible terminal symbols calculated before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol of that sequence.
Example
$ pcfg-predict-eval --dump-terms --intermediate predict_eval.pcfg actual.seq predicted.seq 1.00000000 "a" * step 0: actual "a", predicted "a", prob_max 1.00000000, npredict 1, wpredict_max 1.00000000 0.50000000 "b" * 0.50000000 "c" * step 1: actual "c", predicted "c", prob_max 0.50000000, npredict 2, wpredict_max 1.50000000 1.00000000 "c" * step 2: actual "c", predicted "b", prob_max 1.00000000, npredict 2, wpredict_max 2.50000000 1.00000000 "d" * step 3: actual "d", predicted "d", prob_max 1.00000000, npredict 3, wpredict_max 3.50000000 1.00000000 "a" * step 4: actual "a", predicted "a", prob_max 1.00000000, npredict 4, wpredict_max 4.50000000 0.50000000 "b" * 0.50000000 "c" *
step 5: actual "b", predicted "d", prob_max 0.50000000, npredict 4, wpredict_max 5.00000000 1.00000000 "d" *
step 6: actual "d", predicted "d", prob_max 1.00000000, npredict 5, wpredict_max 6.00000000 1.00000000 "a" *
The description of a possible terminal symbol (e.g. ‘0.50000000 "b" *’) begins with its probability followed by the terminal symbol itself.
If the probability is equal to a maximum probability among probabilities of all possible terminal symbols, the description ends with ‘*’.
The field prob_max is equal to the maximum probability.
If the actual terminal symbol sequence conforms to an input PCFG, every terminal symbol of that sequence belongs to a set of possible terminal symbols determined just before reading the terminal symbol from the sequence. If the terminal symbol does not belong to the set, the tool reports an error ‘step INDEX: no expansion stacks survived on processing terminal symbol STRING’.
You can combine the option --dump-terms with --dump-stacks.
The pcfg-predict-eval tool supports the following command line options:
Print survived stacks of expanding the start nonterminal symbol of a PCFG. On passing the option --intermediate, the tool prints the survived stacks before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol of that sequence. Without the option --intermediate, the tool prints the stacks survived after processing the last terminal symbol only. See “Dumping Survived Stacks”, for more information. You can combine the option --dump-stacks with --dump-terms. By default, do not print the survived stacks.
Print probabilities of possible terminal symbols. On passing the option --intermediate, the tool prints the probabilities before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol of that sequence. Without the option --intermediate, the tool prints the probabilities after processing the last terminal symbol only. See “Dumping Possible Terminal Symbols”, for more information. You can combine the option --dump-terms with --dump-stacks. By default, do not print probabilities of possible terminal symbols.
Print information about processing steps, survived stacks, and probabilities of possible terminal symbols for every processed terminal symbol from the actual terminal symbol sequence. See “Dumping Processing Steps”, for fields printed in every processing step. By default, only print survived stacks and probabilities of possible terminal symbols after processing the last terminal symbol.
Process not more than a specified number of terminal symbols from the actual terminal symbol sequence or predicted terminal symbol sequence or process not more than a specified number of symbol pairs from the combined terminal symbol sequence. By default, process all sequence symbols.
Calculate the number of terminal symbols correctly predicted in the actual terminal symbol sequence based on collected n-grams and calculate the efficiency (accuracy) of prediction using the n-gram approach. This mode slows down processing terminal symbol sequences. By default, do not apply the n-gram approach and do not print JSON fields related to n-grams.
Dump processing results to a specified file. By default, dump processing results to stdout.
The probability of a terminal symbol correctly predicted in random mode. This is a lower bound to compare with the probability of a terminal symbol correctly predicted in adaptive mode to calculate prediction efficiency (accuracy). By default, do not calculate prediction efficiency relative to the probability of a terminal symbol correctly predicted in random mode.
If possible, limit the length of every line of printed processing results by a specified number of characters. By default, do not limit the line lengths.