Next: , Previous: , Up: Auxiliary Programs   [Contents][Index]


8.3 mk-rg-vit.sh

This script generates template regular expression grammars for segmenting symbol sequences into words. The script has the following command-line format:

$ mk-rg-vit.sh max_word_len max_segment_len n_prod_1 ... n_prod_N

Script arguments:

max_word_len

Maximum word length in terminal symbols.

max_segment_len

Maximum word segment length in terminal symbols. A word segment is a sequence of ‘.’ inside a regular expression. The word segment corresponds to a part of a word. Specify this argument equal to max_word_len to generate word segments with all allowed lengths.

n_prod_I

The maximum number of possible terminal symbols to consider at position I in a word. Each possible terminal symbol determines parsing the remaining part of this word. For all J>I, if arguments n_prod_J are absent on the command line, the script considers them equal to n_prod_I.

The script dumps a generated regular expression grammar to stdout.

Example:

$ mk-rg-vit.sh 4 3 2
S: .
 | .
 | . L3_0
 | . L3_1
 | . .
 | . .
 | . . L2_0
 | . . L2_1
 | . . .
 | . . .
;

L2_0: .
    | .
    | . .
    | . .
;

L2_1 = L2_0 ;

L3_0: .
    | .
    | . ( .
        | .
        | . .
        | . .
        )
    | . ( .
        | .
        | . .
        | . .
        )
    | . .
    | . .
    | . . .
    | . . .
;

L3_1 = L3_0 ;

Nonterminal symbols Li_j correspond to remaining parts of a word, where i is the maximum length of a remaining part in terminal symbols, and j is the index of a copy of a remaining part. To reduce the size of a generated grammar, it defines productions for copies with indices greater than 0 by shallow copying a production with index 0 (see Shallow Production Copies).

See Examples for using the script mk-rg-vit.sh to generate template regular expression grammars for grammar learning.