7.6 rege-markup-cfg

This tool marks up the productions of a source PCFG in a bottom-up template grammar. The productions can have terminal symbols at the right-hand side or nonterminal symbols at the right-hand side.

The tool adds terminal symbol segment names (see Named Segments) acting as nonterminal symbols at the left-hand side of productions of a source PCFG with terminal symbols at the right-hand side.

Example

If a production right-hand side contains the expression

( . . ) ( . . . )

the rege-markup-cfg tool may convert it to the expression

_S_1T2: . . _S_2T3: . . .

where the terminal symbol segment names _S_1T2 and _S_2T3 act as nonterminal symbols of a source PCFG.

For source PCFG productions with nonterminal symbols at the right-hand side, rege-markup-cfg adds the markers #l… (see Marking a Left-Hand Side) and #r… (see Marking a Right-Hand Side).

Example

The rege-markup-cfg tool converts the production

S: A ( B
     | C
     | D
     ) E F
 | G (H I)* J
;

to the production

S #l-S: A #l-_S_1C ( B #r0
                   | C #r1
                   | D #r2
                   ) E F #r0
      | G (#l-_S_2A H I #r0)* J #r1
;

To mark up source PCFG productions for the ‘?’ quantifier, rege-markup-cfg changes the expression to two alternatives separated by ‘|’, where the first alternative is the empty one, and the second alternative is a subexpression under the ‘?’ quantifier.

Example

The rege-markup-cfg tool converts the production

S: A (B C)? D
;

to the production

S #l-S: A #l-_S_1Q ( #r0
                   | B C #r1
                   ) D #r0
;

Example

$ cat sample-rege-markup-cfg.rg

  S: A B
  ;

  A: .
   | . .
   | . . .
   | . . . .
  ;

  B = A ;

$ rege-markup-cfg sample-rege-markup-cfg.rg

  S #l-S: A B #r0
  ;

  A #l-A: _A_1T: . #r0
        | _A_2T2: . . #r1
        | _A_3T3: . . . #r2
        | _A_4T4: . . . . #r3
  ;

  B #l-B: _B_1T: . #r0
        | _B_2T2: . . #r1
        | _B_3T3: . . . #r2
        | _B_4T4: . . . . #r3
  ;

Synopsis

For usual cases, use the command line format:

rege-markup-cfg [ -o OUTPUT_BU_GRAMMAR_FILE ] [ --nont-class ] INPUT_BU_GRAMMAR_FILE

where INPUT_BU_GRAMMAR_FILE is the name of a file containing an input bottom-up template grammar (see Bottom-Up Template Grammar); the filename ‘-’ means stdin.

Command-Line Options

The rege-markup-cfg tool supports the following command line options:

--error-nont

Allow using the error nonterminal symbol in a regular expression grammar. See Error Nonterminal Symbol, for more information.

--nont-class

Allow using nonterminal symbol classes in a regular expression grammar. See Nonterminal Symbol Classes, for more information.

--out-cfg=FILE

Dump a context-free grammar to a file. The filename ‘-’ means stdout. The context-free grammar is a source PCFG that does not contain production probabilities and productions with terminal symbols at the right-hand side. A markup added to a regular expression grammar corresponds to the source PCFG. This option conflicts with the option --nont-class. By default, do not dump the context-free grammar.

-o, --out-gram-markup=FILE

Dump a regular expression grammar containing the markup of source PCFG productions to a file. By default, dump the regular expression grammar to stdout.

--recurs=left|right

Recursion type for a context-free grammar dumped using the option --out-cfg=FILE: left or right. The default value is ‘left’.