mk-rg-vit.sh ¶This script generates top-down template grammars for segmenting symbol sequences into words. The script has the following command line format:
mk-rg-vit.sh MAX_WORD_LENGTH MAX_SEGMENT_LENGTH N_PROD_1 ... N_PROD_n
The script arguments:
MAX_WORD_LENGTHMaximum word length in terminal symbols.
MAX_SEGMENT_LENGTHMaximum word segment length in terminal symbols.
A word segment is a sequence of ‘.’ inside a regular expression.
A word segment corresponds to a part of a word.
Specify this argument equal to MAX_WORD_LENGTH to generate word segments with all allowed lengths.
N_PROD_iThe maximum number of possible terminal symbols to consider at position i in a word.
Each possible terminal symbol determines parsing a remaining word part.
For all j>i, if arguments N_PROD_j are absent on the command line, the script considers them equal to N_PROD_i.
The script dumps a generated regular expression grammar to stdout.
Example
$ mk-rg-vit.sh 4 3 2
S: .
| .
| . L3_0
| . L3_1
| . .
| . .
| . . L2_0
| . . L2_1
| . . .
| . . .
;
L2_0: .
| .
| . .
| . .
;
L2_1 = L2_0 ;
L3_0: .
| .
| . ( .
| .
| . .
| . .
)
| . ( .
| .
| . .
| . .
)
| . .
| . .
| . . .
| . . .
;
L3_1 = L3_0 ;
Nonterminal symbols Li_j correspond to remaining parts of a word, where i is the maximum length of a remaining part in terminal symbols, and j is the index of a copy of the remaining part.
To reduce the size of a generated grammar, it contains productions for copies with indices greater than 0 by shallow copying a production with index 0 (see Cloning Nonterminal Symbols).
See Examples, for using the script mk-rg-vit.sh to generate top-down template grammars for dividing terminal symbol sequences into words.