The pcfg-generate-seq and pcfg-predict-eval tools take a PCFG (Probabilistic Context-Free Grammar) for processing.
Some other tools can generate PCFGs, if requested.
Example
This is a PCFG:
S: "a" [0.5] | "b" "b" [0.33] | "c" "c" "c" D_1 [0.17] ; D_1: "delta" | "delta" D_1 ;
Nonterminal symbols are (unquoted) sequences of English letters, digits, and the characters ‘_’ starting with an English letter or ‘_’.
In the above example, the nonterminal symbols are S and D_1.
Terminal symbols are character sequences quoted using single or double quotation marks. Use the escape character ‘\’ to insert a single or double quotation mark or ‘\’ into a terminal symbol. In the above example, the terminal symbols are ‘a’, ‘b’, ‘c’, and ‘delta’.
Every set of right-hand sides of productions for a nonterminal symbol at the left-hand side starts with the nonterminal symbol followed by the character ‘:’ and ends with the character ‘;’. A nonterminal symbol at the left-hand side of the first production in a PCFG is its start nonterminal symbol.
The characters ‘|’ can delimit right-hand sides of productions with the same nonterminal symbol at their left-hand side. The productions can also be separate ones each ending with ‘;’.
A production can have relative probability specified in square brackets after its right-hand side before ‘|’ or ‘;’. The default relative probability is 1.
Example
The notation
S: "a" [0.5] | "b" "b" [0.33] | "c" "c" "c" D_1 [0.17] ;is equivalent to
S: "a" [0.5] ; S: "b" "b" [0.33] ; S: "c" "c" "c" D_1 [0.17] ;