This script generates template regular expression grammars for segmenting symbol sequences into words. The script has the following command-line format:
$ mk-rg-vit.sh max_word_len max_segment_len n_prod_1 ... n_prod_N
Maximum word length in terminal symbols.
Maximum word segment length in terminal symbols. A word segment is a sequence of ‘.’ inside a regular expression. The word segment corresponds to a part of a word. Specify this argument equal to max_word_len to generate word segments with all allowed lengths.
The maximum number of possible terminal symbols to consider at position I in a word. Each possible terminal symbol determines parsing the remaining part of this word. For all J>I, if arguments n_prod_J are absent on the command line, the script considers them equal to n_prod_I.
The script dumps a generated regular expression grammar to stdout.
$ mk-rg-vit.sh 4 3 2 S: . | . | . L3_0 | . L3_1 | . . | . . | . . L2_0 | . . L2_1 | . . . | . . . ; L2_0: . | . | . . | . . ; L2_1 = L2_0 ; L3_0: . | . | . ( . | . | . . | . . ) | . ( . | . | . . | . . ) | . . | . . | . . . | . . . ; L3_1 = L3_0 ;
Li_j correspond to remaining parts of a word, where i is the maximum length of a remaining part in terminal symbols, and j is the index of a copy of a remaining part.
To reduce the size of a generated grammar, it defines productions for copies with indices greater than 0 by shallow copying a production with index 0 (see Shallow Production Copies).
See Examples for using the script
mk-rg-vit.sh to generate template regular expression grammars for grammar learning.