Next: , Previous: , Up: Output Information   [Contents][Index]


Learned PCFG

A learned PCFG is a PCFG based on an initial context-free grammar (see Initial Context-Free Grammar) generated for a template regular expression grammar. Productions in this PCFG have frequencies accumulated while parsing a training terminal symbol sequence. Use the following command-line format to dump a learned PCFG:

$ topdown --qp[=NONT1] ... --qp[=NONTn] [--op=FILE]               \
          [--fp=fq_min_prod] [--ft=fq_min_term]                   \
          [--nlp=num_lower_prod] [--nlt=num_lower_term]           \
          [--nup=num_upper_prod] [--nut=num_upper_term]           \
          [--pp=prob_min_prod] [--pt=prob_min_term]               \
          [--fq-span=window] [--recurs=right] [--remove-putback]  \
          [--simplify] [--term1] REGEX_GRAM_FILE SYM_SEQ_FILE

The following command-line options are applicable to dumping a learned PCFG:

--fp=fq_min_prod

A minimum frequency a production must have for including it in the learned PCFG. This option does not filter productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. The default value is 0.

--fq-span=window|total

Event history span for accumulating production frequencies to include in the learned PCFG:

window

Event history window. See the description of --ww=INT option in Parsing a Token Sequence.

total

Entire event history.

The default value is “total”.

--ft=fq_min_term

A minimum frequency a production must have for including it in the learned PCFG. This option only filters productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. The default value is 0.

--nlp=num_lower_prod

If possible, include in the learned PCFG at least a specified number of right-hand sides for every nonterminal symbol at the left-hand side. This option does not filter productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. The default value is 0.

--nlt=num_lower_term

If possible, include in the learned PCFG at least a specified number of right-hand sides for every nonterminal symbol at the left-hand side. This option only filters productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. The default value is 0.

--nup=num_upper_prod

Include in the learned PCFG at most a specified number of right-hand sides for every nonterminal symbol at the left-hand side. The parser retains the most probable right-hand sides. This option does not filter productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. No limit by default.

--nut=num_upper_term

Include in the learned PCFG at most a specified number of right-hand sides for every nonterminal symbol at the left-hand side. The parser retains the most probable right-hand sides. This option only filters productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. No limit by default.

--op=FILE

Write the learned PCFG to a FILE. If FILE is ‘-’, write the PCFG to stdout. This option queries the learned PCFG.

--pp=prob_min_prod

A minimum probability a production must have for including it in the learned PCFG. This option does not filter productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. The default value is 0.

--pt=prob_min_term

A minimum probability a production must have for including it in the learned PCFG. This option only filters productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). On passing the option --simplify, the parser filters productions before simplifying a PCFG. The default value is 0.

--qp[=NONT]

Dump learned PCFG productions for a nonterminal symbol NONT and auxiliary nonterminal symbols it uses to a file specified by the option --op=FILE. The nonterminal symbol must belong to a set of nonterminal symbols of the template regular expression grammar. You can pass multiple options --qp=NONT to dump productions for multiple nonterminal symbols. If the option --op=FILE not supplied, dump queried productions to stdout. If NONT not supplied, dump the entire learned PCFG. This option queries the learned PCFG.

--remove-putback

Remove from the learned PCFG auxiliary nonterminal symbols for terminal symbol placeholder sequences processed in put-back mode. Those nonterminal symbols have the suffix ‘~’ in the right-hand sides of productions: _X_iT~ or _X_iTj~. See Put-back Terminal Symbols for more information. By default, do not remove nonterminal symbols for terminal symbol placeholder sequences processed in put-back mode.

--term1

For every nonterminal symbol at the left-hand side, retain most probable right-hand sides beginning with unique terminal symbols. This mode is only applicable to productions representing possible terminal symbol sequences for terminal symbol placeholder sequences (with nonterminal symbols _X_iT and _X_iTj at the left-hand side). By default, the right-hand sides of productions with a specific nonterminal symbol at the left-hand side can start with duplicate terminal symbols.

See Initial Context-Free Grammar for the description of --recurs=left|right option. See Iterative Determinization for the description of --simplify option.

The example of dumping a learned PCFG is below. See Terminal Symbol Expansions for the content of expan.rg and expan1.seq files.

$ topdown -N10 --qp expan.rg expan1.seq
S: A A     [0.73446328]  // 130
 | _S_1T5  [0.26553672]  // 47
;  // 177

A: _A_1T3  // 258
;

_A_1T3: "a" "b" "c"  [0.68650794]  // 173  0.77216188
      | "c" "a" "b"  [0.25000000]  // 63   0.03027511
      | "b" "c" "a"  [0.06349206]  // 16   0.00044992
;  // 252

_S_1T5: "a" "b" "c" "a" "b"  [0.39130435]  // 18  0.31862270
      | "c" "a" "b" "c" "a"  [0.30434783]  // 14  0.09068896
      | "b" "c" "a" "b" "c"  [0.30434783]  // 14  0.09068895
;  // 46

See pcfg-generate-seq for the PCFG format. See Initial Context-Free Grammar for the format of generated nonterminal symbol names beginning with ‘_’.

A comment at the end of each right-hand side contains its frequency. A comment after ‘;’ contains the sum of frequencies of all right-hand sides.

A fractional number after the frequency of a right-hand side representing a possible terminal symbol sequence for a terminal symbol placeholder sequence (with a nonterminal symbol _X_iT or _X_iTj at the left-hand side) is the score of this right-hand side. The parser gets rid of right-hand sides with less scores during iterative determinization.

Note: simplifying a learned PCFG by passing the option --simplify might remove scores from right-hand sides of productions.


Next: , Previous: , Up: Output Information   [Contents][Index]