7.10 pcfg-predict-eval

This tool evaluates the accuracy of prediction of every next symbol in a terminal symbol sequence by comparing the number of correctly predicted terminal symbols with a calculated upper bound on the number of terminal symbols predictable using a specified PCFG.

Example

The file predict_eval.pcfg contains the following PCFG:

$ cat predict_eval.pcfg

  S: "a" B "d"
  ;

  B: "b"
   | "c" "c"
  ;

Suppose, this PCFG has generated the following terminal symbol sequence stored in the file actual.seq:

$ cat actual.seq

  a c c d a b d

In this terminal symbol sequence, an upper bound on the number of correctly predictable terminal symbols is 6, as all 4 terminal symbols outside of subsequences ‘c c’ and ‘b’ are constant according to the PCFG, the sum of average numbers of correctly predictable terminal symbols at the second and sixth positions is 0.5+0.5=1, and the second ‘c’ always follows the first ‘c’ thereby adding 1 to the number of correctly predictable terminal symbols. An upper bound on the number of correctly predictable terminal symbols divided by sequence length is prob_wpredict_max=6.0/7=0.85714286.

While reading the terminal symbol sequence, a test program was trying to predict the next terminal symbol the PCFG generates. Suppose, the test program has predicted the following terminal symbol sequence and stored it in the file predicted.seq:

$ cat predicted.seq

  a c b d a d d

In this terminal symbol sequence, the number of correctly predicted terminal symbols is 5, because the terminal symbols ‘b’ and ‘d’ at the third and sixth positions differ from the terminal symbols ‘c’ and ‘b’ at the same positions in the actual terminal symbol sequence in the file actual.seq, and terminal symbols at all other positions are equal. The number of correctly predicted terminal symbols divided by sequence length is prob_npredict_actual=5.0/7=0.71428571.

The following command calculates and prints the aforementioned numbers prob_wpredict_max and prob_npredict_actual:

$ pcfg-predict-eval predict_eval.pcfg actual.seq predicted.seq

  {
      "seq_len"              : 7,
      "wpredict_max"         : 6.00000000,
      "npredict_actual"      : 5,
      "prob_wpredict_max"    : 0.85714286,
      "prob_npredict_actual" : 0.71428571
  }

Synopsis

Use the command line formats:

pcfg-predict-eval [options] PCFG_FILE ACTUAL_SEQ_FILE PREDICTED_SEQ_FILE
pcfg-predict-eval [options] PCFG_FILE COMBINED_SEQ_FILE

where:

PCFG_FILE

A PCFG (see PCFG Format) without left recursion. Along with a terminal symbol sequence generated according to the PCFG, they specify an upper bound on the number of correctly predictable terminal symbols. The tool may hang if the PCFG has an empty production.

ACTUAL_SEQ_FILE

An actual terminal symbol sequence. It is a terminal symbol sequence produced by a PCFG in PCFG_FILE. The sequence consists of (unquoted) terminal symbol names separated by spaces or single newline characters. Multiple consecutive newline characters separate parse units.

PREDICTED_SEQ_FILE

A predicted terminal symbol sequence for calculating prediction accuracy. The sequence consists of (unquoted) terminal symbol names separated by spaces or single newline characters. Multiple consecutive newline characters separate parse units. The number of terminal symbols in the sequence must be equal to the number of terminal symbols in a sequence in ACTUAL_SEQ_FILE. Parse unit segmentations in the two sequences must be at the same terminal symbol positions. Specify this argument equal to ACTUAL_SEQ_FILE to only calculate an upper bound on the number of terminal symbols correctly predictable in a terminal symbol sequence.

COMBINED_SEQ_FILE

A terminal symbol sequence combined from an actual terminal symbol sequence and predicted terminal symbol sequence. The combined sequence consists of terminal symbol pairs separated by spaces or single newline characters. Multiple consecutive newline characters separate parse units.

The first element of a pair is a quoted actual terminal symbol. The second element of a pair is a quoted predicted terminal symbol or ‘?’ if a predicted terminal symbol is unknown. Use the escape character ‘\’ to insert ‘"’ or ‘\’ into a quoted terminal symbol name. Allowed delimiters between an actual terminal symbol name and predicted terminal symbol name:

:

Any actual terminal symbol and predicted terminal symbol.

==

The actual terminal symbol equal to the predicted terminal symbol.

!=

The actual terminal symbol different from the predicted terminal symbol.

Output JSON Fields

The fields of JSON output have the following meaning:

seq_len

The number of terminal symbols in the actual or predicted terminal symbol sequence.

wpredict_max

An upper bound on the number of correctly predictable terminal symbols.

npredict_actual

The number of correctly predicted terminal symbols contained in the predicted terminal symbol sequence.

npredict_ngram

The number of terminal symbols correctly predicted in the actual terminal symbol sequence by applying the n-gram approach. The tool prints this field on passing the option --ngrams.

wpredict_rand

The number of correctly predicted terminal symbols in random mode. The tool prints this field on passing the option --prob-rand=FLOAT. This field is equal to the argument of that option multiplied by the field seq_len.

prob_wpredict_max

This field is equal to wpredict_max/seq_len.

prob_npredict_actual

This field is equal to npredict_actual/seq_len.

prob_npredict_ngram

This field is equal to npredict_ngram/seq_len. The tool prints the field on passing the option --ngrams.

prob_npredict_rand

This field is equal to the argument of --prob-rand=FLOAT option. The tool prints the field on passing this option.

efficiency_ngram, %

This field is equal to (npredict_actual-npredict_ngram)*100/(wpredict_max-npredict_ngram) except for the following special cases: if npredict_actual<npredict_ngram, the field is equal to 0; if npredict_actual>wpredict_max, the field is equal to 100. The program prints the field on passing the option --ngrams.

efficiency_rand, %

This field is equal to (prob_npredict_actual-prob_npredict_rand)*100/(prob_wpredict_max-prob_npredict_rand) except for the following special cases: if prob_npredict_actual<prob_npredict_rand, the field is equal to 0; if prob_npredict_actual>prob_wpredict_max, the field is equal to 100. The program prints the field on passing the option --prob-rand=FLOAT.

Dumping Processing Steps

The option --intermediate dumps details about calculating an upper bound on the number of correctly predictable terminal symbols while processing every terminal symbol from an actual and predicted terminal symbol sequence.

Example

$ pcfg-predict-eval --intermediate predict_eval.pcfg actual.seq predicted.seq
step 0: actual "a", predicted "a", prob_max 1.00000000, npredict 1, wpredict_max 1.00000000
step 1: actual "c", predicted "c", prob_max 0.50000000, npredict 2, wpredict_max 1.50000000
step 2: actual "c", predicted "b", prob_max 1.00000000, npredict 2, wpredict_max 2.50000000
step 3: actual "d", predicted "d", prob_max 1.00000000, npredict 3, wpredict_max 3.50000000
step 4: actual "a", predicted "a", prob_max 1.00000000, npredict 4, wpredict_max 4.50000000
step 5: actual "b", predicted "d", prob_max 0.50000000, npredict 4, wpredict_max 5.00000000
step 6: actual "d", predicted "d", prob_max 1.00000000, npredict 5, wpredict_max 6.00000000

{
    "seq_len"              : 7,
    "wpredict_max"         : 6.00000000,
    "npredict_actual"      : 5,
    "prob_wpredict_max"    : 0.85714286,
    "prob_npredict_actual" : 0.71428571
}

For every pair of terminal symbols read from the actual and predicted terminal symbol sequence, the tool prints a line with the following fields:

step

The index (position) of a terminal symbol in the actual or predicted terminal symbol sequence. Indices start with 0.

actual

A terminal symbol read from the actual terminal symbol sequence.

predicted

A terminal symbol read from the predicted terminal symbol sequence.

prob_max

The probability of correct prediction of a terminal symbol in the field actual based on the input PCFG and previous terminal symbols read from the actual terminal symbol sequence.

npredict

The number of times terminal symbols in the fields actual and predicted were equal (up to the current line).

wpredict_max

The sum of prob_max fields up to the current line.

Dumping Survived Stacks

The option --dump-stacks along with --intermediate prints survived stacks of expanding the start nonterminal symbol of an input PCFG before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol from that sequence.

Example

$ pcfg-predict-eval --dump-stacks --intermediate predict_eval.pcfg actual.seq predicted.seq
1.00000000  "a" <- S [0]

step 0: actual "a", predicted "a", prob_max 1.00000000, npredict 1, wpredict_max 1.00000000

0.50000000  "b" <- B [1] <- S [0]
0.50000000  "c" <- B [1] <- S [0]

step 1: actual "c", predicted "c", prob_max 0.50000000, npredict 2, wpredict_max 1.50000000

1.00000000  "c" <- B [1] <- S [0]

step 2: actual "c", predicted "b", prob_max 1.00000000, npredict 2, wpredict_max 2.50000000

1.00000000  "d" <- S [0]

step 3: actual "d", predicted "d", prob_max 1.00000000, npredict 3, wpredict_max 3.50000000

1.00000000  "a" <- S [4]

step 4: actual "a", predicted "a", prob_max 1.00000000, npredict 4, wpredict_max 4.50000000

0.50000000  "b" <- B [5] <- S [4]
0.50000000  "c" <- B [5] <- S [4]

step 5: actual "b", predicted "d", prob_max 0.50000000, npredict 4, wpredict_max 5.00000000

1.00000000  "d" <- S [4]

step 6: actual "d", predicted "d", prob_max 1.00000000, npredict 5, wpredict_max 6.00000000

1.00000000  "a" <- S [7]

The description of every survived stack (e.g. ‘0.50000000 "b" <- B [5] <- S [4]’) begins with its probability followed by stack content in reverse order—from a terminal symbol (in the actual terminal symbol sequence) to the start nonterminal symbol. Numbers in square brackets after nonterminal symbols are indices of terminal symbols (in the actual terminal symbol sequence) that initiated expansions of the nonterminal symbols. A number in square brackets can precede another number in square brackets indicating a sub-step incremented on encountering productions with empty right-hand sides for a nonterminal symbol at the left-hand side.

Dumping Possible Terminal Symbols

The option --dump-terms along with --intermediate prints probabilities of possible terminal symbols calculated before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol of that sequence.

Example

$ pcfg-predict-eval --dump-terms --intermediate predict_eval.pcfg actual.seq predicted.seq
1.00000000  "a" *

step 0: actual "a", predicted "a", prob_max 1.00000000, npredict 1, wpredict_max 1.00000000

0.50000000  "b" *
0.50000000  "c" *

step 1: actual "c", predicted "c", prob_max 0.50000000, npredict 2, wpredict_max 1.50000000

1.00000000  "c" *

step 2: actual "c", predicted "b", prob_max 1.00000000, npredict 2, wpredict_max 2.50000000

1.00000000  "d" *

step 3: actual "d", predicted "d", prob_max 1.00000000, npredict 3, wpredict_max 3.50000000

1.00000000  "a" *

step 4: actual "a", predicted "a", prob_max 1.00000000, npredict 4, wpredict_max 4.50000000

0.50000000  "b" *
0.50000000  "c" *

step 5: actual "b", predicted "d", prob_max 0.50000000, npredict 4, wpredict_max 5.00000000

1.00000000  "d" *

step 6: actual "d", predicted "d", prob_max 1.00000000, npredict 5, wpredict_max 6.00000000

1.00000000  "a" *

The description of a possible terminal symbol (e.g. ‘0.50000000 "b" *’) begins with its probability followed by the terminal symbol itself. If the probability is equal to a maximum probability among probabilities of all possible terminal symbols, the description ends with ‘*’. The field prob_max is equal to the maximum probability.

If the actual terminal symbol sequence conforms to an input PCFG, every terminal symbol of that sequence belongs to a set of possible terminal symbols determined just before reading the terminal symbol from the sequence. If the terminal symbol does not belong to the set, the tool reports an error ‘step INDEX: no expansion stacks survived on processing terminal symbol STRING’.

You can combine the option --dump-terms with --dump-stacks.

All Command-Line Options

The pcfg-predict-eval tool supports the following command line options:

--dump-stacks

Print survived stacks of expanding the start nonterminal symbol of a PCFG. On passing the option --intermediate, the tool prints the survived stacks before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol of that sequence. Without the option --intermediate, the tool prints the stacks survived after processing the last terminal symbol only. See “Dumping Survived Stacks”, for more information. You can combine the option --dump-stacks with --dump-terms. By default, do not print the survived stacks.

--dump-terms

Print probabilities of possible terminal symbols. On passing the option --intermediate, the tool prints the probabilities before processing every terminal symbol from the actual terminal symbol sequence and after processing the last terminal symbol of that sequence. Without the option --intermediate, the tool prints the probabilities after processing the last terminal symbol only. See “Dumping Possible Terminal Symbols”, for more information. You can combine the option --dump-terms with --dump-stacks. By default, do not print probabilities of possible terminal symbols.

--intermediate

Print information about processing steps, survived stacks, and probabilities of possible terminal symbols for every processed terminal symbol from the actual terminal symbol sequence. See “Dumping Processing Steps”, for fields printed in every processing step. By default, only print survived stacks and probabilities of possible terminal symbols after processing the last terminal symbol.

-n INT

Process not more than a specified number of terminal symbols from the actual terminal symbol sequence or predicted terminal symbol sequence or process not more than a specified number of symbol pairs from the combined terminal symbol sequence. By default, process all sequence symbols.

--ngrams

Calculate the number of terminal symbols correctly predicted in the actual terminal symbol sequence based on collected n-grams and calculate the efficiency (accuracy) of prediction using the n-gram approach. This mode slows down processing terminal symbol sequences. By default, do not apply the n-gram approach and do not print JSON fields related to n-grams.

-o FILE

Dump processing results to a specified file. By default, dump processing results to stdout.

--prob-rand=FLOAT

The probability of a terminal symbol correctly predicted in random mode. This is a lower bound to compare with the probability of a terminal symbol correctly predicted in adaptive mode to calculate prediction efficiency (accuracy). By default, do not calculate prediction efficiency relative to the probability of a terminal symbol correctly predicted in random mode.

-R, --margin-right=INT

If possible, limit the length of every line of printed processing results by a specified number of characters. By default, do not limit the line lengths.