7.3 PCFG Format

The pcfg-generate-seq and pcfg-predict-eval tools take a PCFG (Probabilistic Context-Free Grammar) for processing. Some other tools can generate PCFGs, if requested.

Example

This is a PCFG:

S: "a"              [0.5]
 | "b" "b"          [0.33]
 | "c" "c" "c" D_1  [0.17]
;

D_1: "delta"
   | "delta" D_1
;

Nonterminal symbols are (unquoted) sequences of English letters, digits, and the characters ‘_’ starting with an English letter or ‘_’. In the above example, the nonterminal symbols are S and D_1.

Terminal symbols are character sequences quoted using single or double quotation marks. Use the escape character ‘\’ to insert a single or double quotation mark or ‘\’ into a terminal symbol. In the above example, the terminal symbols are ‘a’, ‘b’, ‘c’, and ‘delta’.

Every set of right-hand sides of productions for a nonterminal symbol at the left-hand side starts with the nonterminal symbol followed by the character ‘:’ and ends with the character ‘;’. A nonterminal symbol at the left-hand side of the first production in a PCFG is its start nonterminal symbol.

The characters ‘|’ can delimit right-hand sides of productions with the same nonterminal symbol at their left-hand side. The productions can also be separate ones each ending with ‘;’.

A production can have relative probability specified in square brackets after its right-hand side before ‘|’ or ‘;’. The default relative probability is 1.

Example

The notation

S: "a"              [0.5]  
 | "b" "b"          [0.33]
 | "c" "c" "c" D_1  [0.17]
;

is equivalent to

S: "a"              [0.5]  ;
S: "b" "b"          [0.33] ;
S: "c" "c" "c" D_1  [0.17] ;