9.4 rege-asm

The purpose of this program is debugging generating an assembler program and context-free grammar for a regular expression possibly located at the right-hand side of a production of a regular expression grammar.

The instruction set of a generated assembler program includes instructions for analyzing look-ahead terminal symbols, consuming terminal symbols, incrementing frequencies of productions of a context-free grammar, transferring control to subroutines for parsing nonterminal symbols, and returning control from the subroutines. For more information, see Assembler Instruction Set and Assembler Instruction Set.

A sequence of subexpressions in a regular expression results in a sequence of code blocks for the subexpressions in a generated assembler program. Assembler code blocks for ‘?’ and ‘*’ quantifiers and sets of alternatives separated by ‘|’ have specific structure. See Assembler Program Structure, for more information.

Run the program using one of the following command line formats.

  1. Dumping an assembler program containing simplified instructions for a regular expression:
    qsmm-example-rege-asm [ --nterm-min=INT ] REGEX
    
  2. Dumping an assembler program containing normal instructions for a regular expression:
    qsmm-example-rege-asm [ --nterm-min=INT ] [ --eos-marker ] --dump-asm=extended REGEX
    
  3. Dumping a context-free grammar for a regular expression:
    qsmm-example-rege-asm --dump-gram[=specific|replace] [ --recurs=right ] REGEX
    
  4. Dumping statistics on a regular expression:
    qsmm-example-rege-asm --dump-stats REGEX
    

The argument REGEX is a regular expression. Refer to Productions, for the regular expression syntax.

Example

$ qsmm-example-rege-asm --dump-asm=extended '([ "a" "b" ] D D .)* [ "b" "c" ]'
        ; BEG: ([ "a" "b" ] D D .)* [ "b" "c" ]
        prod    "E", 0
        ; BEG: ([ "a" "b" ] D D .)*
r1:
        ; FIRST: [ "a" "b" "c" ]
        peek    1
        joe     0, b1   ; "a"
        joe     1, t1_1 ; "b"
        jmp     s1      ; "c"
t1_1:
        ; "b"
        jprob   0.5, b1
        jmp     s1
b1:
        prod    "_E_1A", 1
                        ; stochastic on "b"
        ; BEG: [ "a" "b" ] D D .
        rd      "_E_1A", 1, 1, 1
                        ; [ "a" "b" ]
        call    D, 4
        call    D, 5
        rd      "_E_1A", 1, 4, 1
                        ; .
        ; END: [ "a" "b" ] D D .
        jmp     r1
s1:
        prod    "_E_1A", 0
                        ; stochastic on "b"
        ; END: ([ "a" "b" ] D D .)*
        rd      "E", 0, 1, 1
                        ; [ "b" "c" ]
        ; END: ([ "a" "b" ] D D .)* [ "b" "c" ]
        ret

The program rege-asm supports the following command line options:

--dump-asm[=simple|extended]

Dump an assembler program for probabilistic parsing a terminal symbol sequence according to the regular expression:

simple

Dump an assembler program containing simplified instructions. The regular expression cannot contain terminal symbol classes and specific terminal symbols, but it can contain ‘.’.

extended

Dump an assembler program containing normal instructions and instructions for setting up a correspondence between parts of the assembler program and the productions of a context-free grammar for the regular expression. The regular expression can contain terminal symbol classes and specific terminal symbols.

On omitting the option or its argument, the program rege-asm uses --dump-asm=simple.

--dump-gram[=specific|dot|replace]

Dump a context-free grammar for the regular expression:

specific

Dump the grammar where auxiliary nonterminal symbols _E_iT and _E_iTj replace terminal symbol sequences (groups) containing at least one terminal symbol class or ‘.’.

dot

Dump the grammar where auxiliary nonterminal symbols _E_iT and _E_iTj replace terminal symbol sequences (groups) containing at least one terminal symbol class.

replace

Dump the grammar where auxiliary nonterminal symbols _E_iT and _E_iTj replace any terminal symbol sequence.

In _E_iT and _E_iTj, i is the ordinal number of an auxiliary nonterminal symbol, and j is sequence length if it is greater than 1.

On omitting the option argument, the program uses --dump-gram=dot. On omitting the option, the program does not dump a context-free grammar.

--dump-stats

Dump statistics on the regular expression.

--eos-marker

Enable the use of the end-of-stream marker $$ in the regular expression. The end-of-stream marker becomes an extra element of a set of known terminal symbols.

--nterm-min=INT

The minimum number of terminal symbols. On passing the option --dump-asm=extended, the program rege-asm generates an assembler program referencing terminal symbols contained in the regular expression and, optionally, referencing the end-of-stream marker $$ (on passing the option --eos-marker). Pass the option --nterm-min=INT to generate an assembler program for a larger set of terminal symbols on passing the option --dump-asm=extended or generate an assembler program for a specified number of terminal symbols on passing the option --dump-asm=simple. The default minimum number of terminal symbols for generated assembler programs is 2.

--recurs=left|right

Recursion type for the productions of a context-free grammar dumped on passing the option --dump-gram[=specific|dot|replace]: left or right. By default, generate left-recursive productions.