This program is a probabilistic adaptive top-down parser. It processes a token sequence from left to right multiple times to synthesize a grammar based on a template grammar supplied in regular expression format. The token sequence consists of parse units immediately following one another, where each parse unit corresponds to an expansion of the start nonterminal symbol of the template grammar. The synthesis method is iterative determinization of the template grammar.
The parser can dump a PCFG (Probabilistic Context-Free Grammar) converted from a template regular expression grammar, a learned PCFG, a learned regular expression grammar, token sequences consumed for its nonterminal symbols, and a predicted token sequence for the learned grammar. See pcfg-generate-seq for the description of a program that generates a token sequence according to a specified PCFG. See pcfg-predict-eval for the description of a program that tests the accuracy of grammar synthesis by comparing the number of correctly predicted tokens with the maximum number of correctly predicted tokens estimated for a PCFG and a token sequence generated for this PCFG.
The parser internally generates adaptive assembler routines for the nonterminal symbols of the template grammar. Those assembler routines parse token sequences for corresponding nonterminal symbols and increment the frequencies of PCFG productions generated for the template grammar. Changing the frequencies of productions results in changing the spur. An assembler routine calls other assembler routines for nested nonterminal symbols. See Assembler Instruction Set for the description of an instruction set used in the assembler routines. See Assembler Program Structure for the description of their building blocks.
See mk-rg-vit.sh for the description of a script that generates a template grammar for dividing a text into words. This template grammar provides one-level segmenting a token sequence for probabilistic parsing.
Note: because of inefficient memory consumption by QSMM, the parser supports only simple template grammars.
rege-test described in Auxiliary Programs help in testing the parser and understanding how it works.
The files topdown.c, coverage.h, coverage.c, gram_pcfg.h, gram_pcfg.c, gram_rege.h, gram_rege.c, least_sq.h, least_sq.c, mat_norm.h, mat_norm.c, pcfg_base.h, pcfg_base.c, pcfg_subset.h, pcfg_subset.c, rege_ast.h, rege_ast.c, rege_first.h, rege_first.c, rege_prg.h, and rege_prg.c in the subdirectory samples in the root of the package distribution contain parser source code.
|• Template Grammar Format|
|• Parsing a Token Sequence|
|• Output Information|
|• Iterative Determinization|