Previous: , Up: Example Programs   [Contents][Index]


7.12 langlearn-test

This program performs function analogous to the predict-test example program, but learns a signal sequence specified by a probabilistic context-free grammar (PCFG), not by a probabilistic finite automaton. Thus, the langlearn-test program can be considered as a more obvious example of learning a language than the predict-test program where the language is defined in an indirect way. As distinct from the predict-test program, which can use several types of probability profiles defined by internally generated assembler programs that are loaded into single-node models, the langlearn-test program uses plain actors instead of single-node models and the uniform probability profile. In the predict-test program that profile has index 0.

An example of a file with a description of probabilistic context-free grammar is shown below.

S: A S
;

A: "a" B "a" "a" C "a"
;

B: "b"    [0.67]
 | "b" B  [0.33]
;

C: "c"    [0.67]
 | "c" C  [0.33]
;

Files, which contain sample grammar descriptions, are located in directory samples/gram of the package source tree. The above example is contained in the file samples/gram/gram7.

Nonterminal symbols are sequences of alphanumeric characters and the character ‘_’, which start with a letter or ‘_’. Terminal symbols are quoted (with single or double quotes) sequences of characters; the escape character ‘\’ can be used to insert a quote or the character ‘\’ in these sequences. Every production starts with a nonterminal symbol followed by the character ‘:’ and ends with the character ‘;’. A nonterminal symbol at the left hand side of the first production in a grammar becomes the start nonterminal symbol of the grammar. Right hand sides of productions with the same nonterminal symbol at their left hand sides can be delimited by ‘|’. However, they can also be specified as multiple productions that end with ‘;’. A positive probability or relative probability of production can be specified in square brackets after the right hand side of production before characters ‘|’ or ‘;’. By default, the relative probability equal to 1 is used.

When applying grammar productions, left recursion is suppressed. Therefore, it is recommended not to use such recursion in input grammars, possibly replacing it with right recursion.

To evaluate the efficiency of learning a language, the number of correctly predicted terminal symbols in the sequence is compared with a lower bound and an upper bound. The lower bound is equal to the number of symbols correctly predicted using the classical n-gram approach. The upper bound is equal to the maximum possible number of symbols that can be predicted by any means. To calculate the upper bound, a set of stacks is kept, which contain information on nested productions that can be used to emit terminal symbols. One of those symbols becomes the next terminal symbol of the sequence. With every stack a probability is associated, which makes possible calculating exact probabilities of next terminal symbols. After actual emitting the next terminal symbol, only those stacks survive, which could emit that terminal symbol. The stacks survived then can be split into several next generation stacks that correspond to possible alternatives of selecting next grammar productions. Identical stacks are merged with summing up probabilities associated with them. The upper bound on the number of terminal symbols that can be correctly predicted as the next symbol of the sequence is taken equal to the maximum probability among exact probabilities of next terminal symbols. The upper bound on the number of terminal symbols that can be correctly predicted in a sequence is the sum of upper bounds calculated for all symbols in the sequence.

The maximum allowed size of the set of stacks is specified by the option -C, --nstack-max=INT. When the actual size of the set exceeds the maximum allowed size, the program prints the number of sequence symbols that were generated up to the moment and terminates with an error. For some grammars the stacks are spawned faster than killed, so the size of the set will not be limited by a certain fixed number. In such cases, providing a large value for the option -C INT sometimes solves the problem for specific sequence length. However, for some grammars the argument of option -C INT needs to be very large even for relatively short sequences, and an upper bound on the number of correctly predicted symbols will be impractical to calculate.

The length of every sequence of terminal symbols generated by the example program is limited by the argument of option -n, --nstep-pass=INT (default value is 10000). The length will be shorter if expanding a start nonterminal symbol finishes earlier. That is why many of sample grammars included in the package distribution begin with a recursive production of form ‘S: A S’, so the sequence of terminal symbols produced by the grammar never terminates. However, for big values of option -n INT this approach should be avoided, because it results in ever increasing lengths of stacks used for calculating exact probabilities of next terminal symbols. Because long stacks require long memory blocks to hold them, splitting the stacks takes ever increasing time for block copying operations. To greatly limit the number of frames in the stacks when generating a sequence of a certain length, the recursion can be replaced with a number of nested productions as shown below.

S:  A1 A1 A1 A1 A1 A1 A1 A1 A1 A1;
A1: A2 A2 A2 A2 A2 A2 A2 A2 A2 A2;
A2: A3 A3 A3 A3 A3 A3 A3 A3 A3 A3;
A3: A A A A A A A A A A;
...

Such productions will generate sequence S that consists of 10000 copies of subsequence A.

In comparison with the predict-test program, the langlearn-test program has a few simplifications. The first one is that actors created by the langlearn-test program do not use positive countervailing spur incremented every time a correct symbol is predicted. The second simplification is that when large actors are created, the use of the automatic spur by small actors associated with the large actors is turned off. The third simplification is that actors use a relative probability function of type QSMM_RELPROB_BUILTIN3. In other respects, the langlearn-test program identifies current sequence states and makes the prediction of the next sequence signal in the same way as the predict-test program does when it uses a probability profile of type 0.

You should invoke the langlearn-test program with an argument, which specifies the name of a file with a grammar description. The program understands the following command-line options.

-C, --nstack-max=INT

The maximum allowed size of the set of stacks used to calculate exact probabilities of next terminal symbols. If is greater than 0, then the maximum exact probability of the next terminal symbol will be used as an upper bound on the number of terminal symbols that can be correctly predicted as the next symbol of the sequence. If is equal to 0, then probability 1 will be used as the upper bound. Default value is 0. In the description of test modes is denoted by ‘Max. stacks’.

-i, --seed=INT

A seed to initialize pseudorandom number generators. Default value is 0. In the description of test modes is denoted by ‘Random seed’.

-l, --ngram-length=INT

The length of n-grams for predicting the next terminal symbol in the sequence using the classical n-gram approach. The number of symbols correctly predicted in such a way is taken as a lower bound on the number of terminal symbols that can be correctly predicted in the sequence. Value 0 means predicting a terminal symbol that occurs most frequently in a sequence processed so far. Default value is 0. In the description of test modes is denoted by ‘N-gram length’.

-L, --large[=INT]

Use large actors with specified tree arity. Default value is 2. In the description of test modes is denoted by ‘Large’ and ‘Tree arity’.

-M, --model=NACTOR,NSTATE

The number of actors and the number of sequence states tracked by every actor. The actors operate concurrently in separate threads. In the description of test modes are denoted by ‘Actors’ and ‘States’.

-n, --nstep-pass=INT

The maximum length of every sequence of terminal symbols to generate according to the input grammar. The actual length can be shorter if expanding a start nonterminal symbol finishes earlier. Default value is 10000. In the description of test modes is denoted by ‘Max. length’.

-o FILE

A file to write program output to instead of stdout.

-t, --test=INT

Perform the test specified number of passes. In the description of test modes is denoted by ‘Passes’.

When invoked with the option -t INT, the program prints a test log, an example of which is shown below. The log begins with a description of test modes used. Then there goes a table, where each row corresponds to a test pass, with a summary row. Finally, there is printed the standard deviation of values in the ‘% ef’ column of the table.

$ ./langlearn-test -i1 -t10 -C3 -L -M6,16 gram/gram7
 Grammar file: gram/gram7
    Terminals: 3
       Passes: 10
  Max. length: 10000
N-gram length: 0
       Actors: 6
       States: 16
        Large: on
   Tree arity: 2
  Max. stacks: 3
  Random seed: 1

pass   length   earned    ngram        exact     % ef
---- -------- -------- -------- ------------ --------
   1    10000     7676     5744     8595.850   67.745
   2    10000     7019     5725     8589.250   45.178
   3    10000     7738     5718     8587.600   70.393
   4    10000     7590     5675     8573.080   66.078
   5    10000     5765     5706     8583.310    2.051
   6    10000     7549     5703     8581.990   64.120
   7    10000     6733     5726     8590.240   35.158
   8    10000     6868     5746     8596.840   39.357
   9    10000     7253     5782     8608.060   52.051
  10    10000     7646     5743     8595.520   66.713

TOTL   100000    71837    57268    85901.740   50.881

stddev ef:  21.397

Columns of the table have the following meaning.

pass

A test pass index. For the summary row, ‘TOTL’ is printed here.

length

The length of the generated sequence of terminal symbols. Is equal to the argument of option -n INT or to a lesser value if expanding a start nonterminal symbol finishes earlier. In the summary row—the sum of values in the column.

earned

The number of terminal symbols correctly predicted according to the algorithm. In the summary row—the sum of values in the column.

ngram

The number of terminal symbols correctly predicted using the classical n-gram approach with n-gram length equal to the argument of option -l INT. Is a lower bound used when calculating values in the ‘% ef’ column. In the summary row—the sum of values in the column.

exact

The number of terminal symbols that can be correctly predicted by any means. Is an upper bound used when calculating values in the ‘% ef’ column. In the summary row—the sum of values in the column.

% ef

The efficiency of prediction of sequence symbols equal to (earned-ngram)/(exact-ngram)*100% if earnedngram  and exact>ngram, or to 0 otherwise. For the summary row, is calculated using corresponding values in that row.

When invoked without the option -t INT, but with the option -M NACTOR,NSTATE specified, the program prints the stream of pairs of terminal symbols

actual_symbol:predicted_symbol

When invoked without options -t INT and -M NACTOR,NSTATE, the program prints a stream of terminal symbols produced by an input grammar. An example of such program output is shown below.

$ ./langlearn-test -n100 gram/gram7
a b a a c c a a b b a a c c c a a b a a c a a b a a c a a b a a c c a
a b a a c a a b a a c a a b a a c c a a b a a c a a b a a c a a b a a
c c a a b a a c c a a b a a c a a b b a a c a a b b b a a c

Reference results of program invocation are represented below. The ‘Lower’ column corresponds to the ‘ngram’ column of test log, and the ‘Upper’ column corresponds to the ‘exact’ column. Values in ‘Lower’, ‘Earned’, ‘Upper’, and ‘% ef’ columns were taken from summary rows of test logs.

Gram. file  Options           nStep    Model     Lower   Earned    Upper  % ef
==========  ================  =======  =======  ======  =======  =======  ====
gram/gram1  -i1 -t100 -C2 -L  -n10000  -M2,16    62497   553800   875000  60.5
                                       -M4,16            615529           68.1
                                       -M8,16            695858           78.0
                                                   
gram/gram2  -i1 -t100 -C2 -L  -n10000  -M3,16    31009   514649   937500  53.4
                                       -M6,16            602224           63.0
                                       -M12,16           704557           74.3
                                                   
gram/gram3  -i1 -t100 -C2 -L  -n10000  -M1,16   100184   456693   800015  50.9
                                       -M2,16            591133           70.1
                                       -M4,16            622143           74.6
                                                   
gram/gram4  -i1 -t100 -C3 -L  -n10000  -M2,32   492575   708138   916612  50.8
                                       -M4,32            772630           66.0
                                       -M8,32            815670           76.2

gram/gram5  -i1 -t100 -C5 -L  -n10000  -M4,16   332270   631169   893936  53.2
                                       -M8,16            713483           67.9
                                       -M16,16           767706           77.5
                                                   
gram/gram6  -i1 -t100 -C6 -L  -n40000  -M14,16  998909  1905534  2777543  51.0
                                       -M28,16          2031198           58.0
                                       -M56,16          2145155           64.4
                                                   
gram/gram7  -i1 -t100 -C3 -L  -n10000  -M6,16   572876   723772   859081  52.7
                                       -M12,16           768748           68.4
                                       -M24,16           799322           79.1
                                                   
gram/gram8  -i1 -t100 -C3 -L  -n10000  -M1,16   400000   746964   933286  65.1
                                       -M2,16            794785           74.0
                                       -M4,16            826898           80.1
==========  ================  =======  =======  ======  =======  =======  ====

As it can be seen from the table above, the correctness of the prediction of sequence symbols is increased by using more actors that operate concurrently.

The command make builds the example program if the QSMM package is configured by the configure script to use the POSIX threads API. See the file INSTALL in the root of the package distribution for information on the configure script.


Previous: , Up: Example Programs   [Contents][Index]