Next: , Up: Auxiliary Programs   [Contents][Index]


8.1 pcfg-generate-seq

This program generates a random terminal symbol sequence according to a specified PCFG. The program expands the start nonterminal symbol of this PCFG to produce a parse unit. If sequence limit not reached, the program repeats expanding the start nonterminal symbol.

In the usual case, invoke the program using the command line

$ pcfg-generate-seq -i random_seed -n len_term -o SYM_SEQ_FILE PCFG_FILE

The parameter random_seed specifies a seed for the pseudo-random number generator. The parameter len_term specifies the number of terminal symbols in a generated sequence. By default, if the generated sequence does not end on parse unit boundary, the program shortens the sequence to make it contain an integer number of parse units.

The parameter SYM_SEQ_FILE specifies an output file for the generated sequence. If the option -o SYM_SEQ_FILE is absent, or SYM_SEQ_FILE is ‘-’, the program dumps the generated sequence to stdout. By default, the output consists of (unquoted) terminal symbols separated by spaces with right margin column 70.

The parameter PCFG_FILE specifies an input file containing a PCFG. If PCFG_FILE is ‘-’, the program reads the PCFG from stdin. An example PCFG is below:

cat >sample.pcfg <<EOF

S: "a"            [0.5]  
 | "b" "b"        [0.33]
 | "c" "c" "c" D  [0.17]
;

D: "delta"
 | "delta" D
;

EOF

Nonterminal symbols are unquoted sequences of English letters, digits, and the characters ‘_’ starting with an English letter or ‘_’. Terminal symbols are quoted (using single or double quotation marks) sequences of characters. Use the escape character ‘\’ to insert a single or double quotation mark or ‘\’ into a terminal symbol.

Every set of right-hand sides of productions for a nonterminal symbol at the left-hand side starts with this nonterminal symbol followed by the character ‘:’ and ends with the character ‘;’. A nonterminal symbol at the left-hand side of the first production in a PCFG is the start nonterminal symbol of this PCFG. You can delimit by the characters ‘|’ the right-hand sides of productions with the same nonterminal symbol at their left-hand side. However, you can also specify those productions as separate ones ending with ‘;’. You can specify the relative probability of a production in square brackets after its right-hand side before ‘|’ or ‘;’. The default relative probability is 1.

Example output is below:

$ pcfg-generate-seq -i1 -n100 -o sample.seq sample.pcfg
$ cat sample.seq
a c c c delta delta delta a a c c c delta a a a a a a b b a c c c
delta delta delta a b b b b a a c c c delta a b b b b c c c delta a b
b a a c c c delta b b b b b b c c c delta delta a b b b b c c c delta
delta delta delta b b a a a a c c c delta delta b b a b b c c c delta

The program supports the following command-line options:

-l INT

The maximum length of an output terminal symbol sequence, in characters. That length includes newline characters and delimiters between terminal symbols. No limit by default.

-n INT

The maximum length of an output terminal symbol sequence, in terminal symbols. No limit by default.

-N INT

The maximum length of an output terminal symbol sequence, in parse units. No limit by default.

-o FILE

Output a generated terminal symbol sequence to FILE. If FILE is ‘-’, output the generated sequence to stdout. By default, output the sequence to stdout.

-R, --margin-right=INT

If possible, limit the length of every line in an output terminal symbol sequence by a specified number of characters. Special value 0 means no right margin. The default value is 70.

-i, --seed=INT

A seed for the pseudo-random number generator. The default value is 0.

--separate-parse-units

If there is a right margin, separate generated parse units with empty lines. If there is no right margin, start every parse unit on a new line. The option -R, --margin-right=INT sets or removes the right margin. By default, do not separate generated parse units in a special way.

--separator-term=STR

Separate terminal symbols using a specified string. By default, separate the terminal symbols by spaces.

--truncate[=no|parse-unit|term]

The mode of truncation of a generated character sequence if its length exceeds maximum length specified by the options -l INT, -n INT, and -N INT:

no

Ensure that the generated character sequence ends with a complete parse unit.

parse-unit

Permit the truncation of last parse unit in the generated character sequence but ensure that the sequence ends with a complete terminal symbol name.

term

Permit the truncation of last terminal symbol name in the generated character sequence.

If the option argument not specified, the program uses --truncate=parse-unit. If the option not specified, the program ensures that the generated character sequence ends with a complete parse unit.

You need to pass at least one option: -l INT, -n INT, or -N INT.


Next: , Up: Auxiliary Programs   [Contents][Index]