This program performs function analogous to the
predict-test example program, but learns a signal sequence specified by a probabilistic context-free grammar (PCFG), not by a probabilistic finite automaton.
langlearn-test program can be considered as a more obvious example of learning a language than the
predict-test program where the language is defined in an indirect way.
As distinct from the
predict-test program, which can use several types of probability profiles defined by internally generated assembler programs that are loaded into single-node models, the
langlearn-test program uses plain actors instead of single-node models and the uniform probability profile.
predict-test program that profile has index 0.
An example of a file with a description of probabilistic context-free grammar is shown below.
S: A S ; A: "a" B "a" "a" C "a" ; B: "b" [0.67] | "b" B [0.33] ; C: "c" [0.67] | "c" C [0.33] ;
Files, which contain sample grammar descriptions, are located in directory samples/gram of the package source tree. The above example is contained in the file samples/gram/gram7.
Nonterminal symbols are sequences of alphanumeric characters and the character ‘_’, which start with a letter or ‘_’. Terminal symbols are quoted (with single or double quotes) sequences of characters; the escape character ‘\’ can be used to insert a quote or the character ‘\’ in these sequences. Every production starts with a nonterminal symbol followed by the character ‘:’ and ends with the character ‘;’. A nonterminal symbol at the left hand side of the first production in a grammar becomes the start nonterminal symbol of the grammar. Right hand sides of productions with the same nonterminal symbol at their left hand sides can be delimited by ‘|’. However, they can also be specified as multiple productions that end with ‘;’. A positive probability or relative probability of production can be specified in square brackets after the right hand side of production before characters ‘|’ or ‘;’. By default, the relative probability equal to 1 is used.
When applying grammar productions, left recursion is suppressed. Therefore, it is recommended not to use such recursion in input grammars, possibly replacing it with right recursion.
To evaluate the efficiency of learning a language, the number of correctly predicted terminal symbols in the sequence is compared with a lower bound and an upper bound. The lower bound is equal to the number of symbols correctly predicted using the classical n-gram approach. The upper bound is equal to the maximum possible number of symbols that can be predicted by any means. To calculate the upper bound, a set of stacks is kept, which contain information on nested productions that can be used to emit terminal symbols. One of those symbols becomes the next terminal symbol of the sequence. With every stack a probability is associated, which makes possible calculating exact probabilities of next terminal symbols. After actual emitting the next terminal symbol, only those stacks survive, which could emit that terminal symbol. The stacks survived then can be split into several next generation stacks that correspond to possible alternatives of selecting next grammar productions. Identical stacks are merged with summing up probabilities associated with them. The upper bound on the number of terminal symbols that can be correctly predicted as the next symbol of the sequence is taken equal to the maximum probability among exact probabilities of next terminal symbols. The upper bound on the number of terminal symbols that can be correctly predicted in a sequence is the sum of upper bounds calculated for all symbols in the sequence.
The maximum allowed size of the set of stacks is specified by the option -C, --nstack-max=INT. When the actual size of the set exceeds the maximum allowed size, the program prints the number of sequence symbols that were generated up to the moment and terminates with an error. For some grammars the stacks are spawned faster than killed, so the size of the set will not be limited by a certain fixed number. In such cases, providing a large value for the option -C INT sometimes solves the problem for specific sequence length. However, for some grammars the argument of option -C INT needs to be very large even for relatively short sequences, and an upper bound on the number of correctly predicted symbols will be impractical to calculate.
The length of every sequence of terminal symbols generated by the example program is limited by the argument of option -n, --nstep-pass=INT (default value is 10000). The length will be shorter if expanding a start nonterminal symbol finishes earlier. That is why many of sample grammars included in the package distribution begin with a recursive production of form ‘S: A S’, so the sequence of terminal symbols produced by the grammar never terminates. However, for big values of option -n INT this approach should be avoided, because it results in ever increasing lengths of stacks used for calculating exact probabilities of next terminal symbols. Because long stacks require long memory blocks to hold them, splitting the stacks takes ever increasing time for block copying operations. To greatly limit the number of frames in the stacks when generating a sequence of a certain length, the recursion can be replaced with a number of nested productions as shown below.
S: A1 A1 A1 A1 A1 A1 A1 A1 A1 A1; A1: A2 A2 A2 A2 A2 A2 A2 A2 A2 A2; A2: A3 A3 A3 A3 A3 A3 A3 A3 A3 A3; A3: A A A A A A A A A A; ...
Such productions will generate sequence
S that consists of 10000 copies of subsequence
In comparison with the
predict-test program, the
langlearn-test program has a few simplifications.
The first one is that actors created by the
langlearn-test program do not use positive countervailing spur incremented every time a correct symbol is predicted.
The second simplification is that when large actors are created, the use of the automatic spur by small actors associated with the large actors is turned off.
The third simplification is that actors use a relative probability function of type
In other respects, the
langlearn-test program identifies current sequence states and makes the prediction of the next sequence signal in the same way as the
predict-test program does when it uses a probability profile of type 0.
You should invoke the
langlearn-test program with an argument, which specifies the name of a file with a grammar description.
The program understands the following command-line options.
The maximum allowed size of the set of stacks used to calculate exact probabilities of next terminal symbols. If is greater than 0, then the maximum exact probability of the next terminal symbol will be used as an upper bound on the number of terminal symbols that can be correctly predicted as the next symbol of the sequence. If is equal to 0, then probability 1 will be used as the upper bound. Default value is 0. In the description of test modes is denoted by ‘Max. stacks’.
A seed to initialize pseudorandom number generators. Default value is 0. In the description of test modes is denoted by ‘Random seed’.
The length of n-grams for predicting the next terminal symbol in the sequence using the classical n-gram approach. The number of symbols correctly predicted in such a way is taken as a lower bound on the number of terminal symbols that can be correctly predicted in the sequence. Value 0 means predicting a terminal symbol that occurs most frequently in a sequence processed so far. Default value is 0. In the description of test modes is denoted by ‘N-gram length’.
Use large actors with specified tree arity. Default value is 2. In the description of test modes is denoted by ‘Large’ and ‘Tree arity’.
The number of actors and the number of sequence states tracked by every actor. The actors operate concurrently in separate threads. In the description of test modes are denoted by ‘Actors’ and ‘States’.
The maximum length of every sequence of terminal symbols to generate according to the input grammar. The actual length can be shorter if expanding a start nonterminal symbol finishes earlier. Default value is 10000. In the description of test modes is denoted by ‘Max. length’.
A file to write program output to instead of stdout.
Perform the test specified number of passes. In the description of test modes is denoted by ‘Passes’.
When invoked with the option -t INT, the program prints a test log, an example of which is shown below. The log begins with a description of test modes used. Then there goes a table, where each row corresponds to a test pass, with a summary row. Finally, there is printed the standard deviation of values in the ‘% ef’ column of the table.
$ ./langlearn-test -i1 -t10 -C3 -L -M6,16 gram/gram7 Grammar file: gram/gram7 Terminals: 3 Passes: 10 Max. length: 10000 N-gram length: 0 Actors: 6 States: 16 Large: on Tree arity: 2 Max. stacks: 3 Random seed: 1 pass length earned ngram exact % ef ---- -------- -------- -------- ------------ -------- 1 10000 7676 5744 8595.850 67.745 2 10000 7019 5725 8589.250 45.178 3 10000 7738 5718 8587.600 70.393 4 10000 7590 5675 8573.080 66.078 5 10000 5765 5706 8583.310 2.051 6 10000 7549 5703 8581.990 64.120 7 10000 6733 5726 8590.240 35.158 8 10000 6868 5746 8596.840 39.357 9 10000 7253 5782 8608.060 52.051 10 10000 7646 5743 8595.520 66.713 TOTL 100000 71837 57268 85901.740 50.881 stddev ef: 21.397
Columns of the table have the following meaning.
A test pass index. For the summary row, ‘TOTL’ is printed here.
The length of the generated sequence of terminal symbols. Is equal to the argument of option -n INT or to a lesser value if expanding a start nonterminal symbol finishes earlier. In the summary row—the sum of values in the column.
The number of terminal symbols correctly predicted according to the algorithm. In the summary row—the sum of values in the column.
The number of terminal symbols correctly predicted using the classical n-gram approach with n-gram length equal to the argument of option -l INT. Is a lower bound used when calculating values in the ‘% ef’ column. In the summary row—the sum of values in the column.
The number of terminal symbols that can be correctly predicted by any means. Is an upper bound used when calculating values in the ‘% ef’ column. In the summary row—the sum of values in the column.
The efficiency of prediction of sequence symbols equal to (earned-ngram)/(exact-ngram)*100% if earned≥ngram and exact>ngram, or to 0 otherwise. For the summary row, is calculated using corresponding values in that row.
When invoked without the option -t INT, but with the option -M NACTOR,NSTATE specified, the program prints the stream of pairs of terminal symbols
When invoked without options -t INT and -M NACTOR,NSTATE, the program prints a stream of terminal symbols produced by an input grammar. An example of such program output is shown below.
$ ./langlearn-test -n100 gram/gram7 a b a a c c a a b b a a c c c a a b a a c a a b a a c a a b a a c c a a b a a c a a b a a c a a b a a c c a a b a a c a a b a a c a a b a a c c a a b a a c c a a b a a c a a b b a a c a a b b b a a c
Reference results of program invocation are represented below. The ‘Lower’ column corresponds to the ‘ngram’ column of test log, and the ‘Upper’ column corresponds to the ‘exact’ column. Values in ‘Lower’, ‘Earned’, ‘Upper’, and ‘% ef’ columns were taken from summary rows of test logs.
Gram. file Options nStep Model Lower Earned Upper % ef ========== ================ ======= ======= ====== ======= ======= ==== gram/gram1 -i1 -t100 -C2 -L -n10000 -M2,16 62497 553800 875000 60.5 -M4,16 615529 68.1 -M8,16 695858 78.0
gram/gram2 -i1 -t100 -C2 -L -n10000 -M3,16 31009 514649 937500 53.4 -M6,16 602224 63.0 -M12,16 704557 74.3
gram/gram3 -i1 -t100 -C2 -L -n10000 -M1,16 100184 456693 800015 50.9 -M2,16 591133 70.1 -M4,16 622143 74.6
gram/gram4 -i1 -t100 -C3 -L -n10000 -M2,32 492575 708138 916612 50.8 -M4,32 772630 66.0 -M8,32 815670 76.2
gram/gram5 -i1 -t100 -C5 -L -n10000 -M4,16 332270 631169 893936 53.2 -M8,16 713483 67.9 -M16,16 767706 77.5
gram/gram6 -i1 -t100 -C6 -L -n40000 -M14,16 998909 1905534 2777543 51.0 -M28,16 2031198 58.0 -M56,16 2145155 64.4
gram/gram7 -i1 -t100 -C3 -L -n10000 -M6,16 572876 723772 859081 52.7 -M12,16 768748 68.4 -M24,16 799322 79.1
gram/gram8 -i1 -t100 -C3 -L -n10000 -M1,16 400000 746964 933286 65.1 -M2,16 794785 74.0 -M4,16 826898 80.1 ========== ================ ======= ======= ====== ======= ======= ====
As it can be seen from the table above, the correctness of the prediction of sequence symbols is increased by using more actors that operate concurrently.
make builds the example program if the QSMM package is configured by the configure script to use the POSIX threads API.
See the file INSTALL in the root of the package distribution for information on the configure script.