9.2 mk-rg-vit.sh

This script generates top-down template grammars for segmenting symbol sequences into words. The script has the following command line format:

mk-rg-vit.sh MAX_WORD_LENGTH MAX_SEGMENT_LENGTH N_PROD_1 ... N_PROD_n

The script arguments:

MAX_WORD_LENGTH

Maximum word length in terminal symbols.

MAX_SEGMENT_LENGTH

Maximum word segment length in terminal symbols. A word segment is a sequence of ‘.’ inside a regular expression. A word segment corresponds to a part of a word. Specify this argument equal to MAX_WORD_LENGTH to generate word segments with all allowed lengths.

N_PROD_i

The maximum number of possible terminal symbols to consider at position i in a word. Each possible terminal symbol determines parsing a remaining word part. For all j>i, if arguments N_PROD_j are absent on the command line, the script considers them equal to N_PROD_i.

The script dumps a generated regular expression grammar to stdout.

Example

$ mk-rg-vit.sh 4 3 2
S: .
 | .
 | . L3_0
 | . L3_1
 | . .
 | . .
 | . . L2_0
 | . . L2_1
 | . . .
 | . . .
;

L2_0: .
    | .
    | . .
    | . .
;

L2_1 = L2_0 ;

L3_0: .
    | .
    | . ( .
        | .
        | . .
        | . .
        )
    | . ( .
        | .
        | . .
        | . .
        )
    | . .
    | . .
    | . . .
    | . . .
;

L3_1 = L3_0 ;

Nonterminal symbols Li_j correspond to remaining parts of a word, where i is the maximum length of a remaining part in terminal symbols, and j is the index of a copy of the remaining part. To reduce the size of a generated grammar, it contains productions for copies with indices greater than 0 by shallow copying a production with index 0 (see Cloning Nonterminal Symbols).

See Examples, for using the script mk-rg-vit.sh to generate top-down template grammars for dividing terminal symbol sequences into words.