This program learns a state model, which corresponds to a signal sequence specified by a probabilistic finite automaton that has output signals, but does not have input ones. Learning efficiency is tested by measuring the ability of the program to predict the next signal of the sequence. An example of a file with a description of probabilistic finite automaton is shown below.
Non-empty line #1 Non-empty line #2 ... Non-empty line #n 3 4 0 0/0 1/1 0/0 2/1 0/0 0/1 0/1 3/1 0/0 0/0 0/0 0/1
All lines before the first empty line are ignored. The first line after that empty line has format
number_of_signals number_of_states initial_state_index
Indices of signals and states start from 0.
Starting from the second line after the empty line there go descriptions of transitions between states during which signals are emitted. The second line corresponds to transitions from state 0, the third—from state 1, the fourth—from state 2, and so on. Each line contains number pairs that correspond to signals. The first pair corresponds to signal 0, the second—to signal 1, the third—to signal 2, and so on. A pair has format
and specifies a transition from a source state to a target state with indicated relative probability, during which a signal is emitted.
Files, which contain example descriptions of probabilistic finite automatons, are located in directory samples/dist of the package source tree.
The program creates at least one single-node model which action emission matrice defines deterministic choice of actions for all node states.
The environment state identification engine of every model uses a relative probability function of type
QSMM_RELPROB_BUILTIN2 and two spur types with the normal way of spur perception.
Spur of type 0 is the automatic spur.
Spur of type 1 has weight 1 and is incremented by 1 when the model correctly predicts the next sequence signal.
When a large actor is used for the environment state identification engine, the weight of the automatic spur of a small actor associated with the large actor and the weight of the automatic spur of the large actor (of type 0) are set to 0.5 to countervail spur of type 1 of the large actor.
Every single-node model is executed in a separate thread that communicates with the main program thread using the Side API. Units of communication are signals. At every step of processing a signal sequence, the next predicted signal is received from each single-node model and the actual next sequence signal is sent to those models.
An important improvement made in QSMM version 1.16 for the example program is a new method of how the overall result of next signal prediction is computed on the basis of the next signal predicted by each single-node model. There is used a vector of signal weights, which is zeroed every time before predicting the next sequence signal. A signal predicted by every single-node model causes the increment of a corresponding element of that vector. For the overall result of next signal prediction there is taken index i of an element of the vector that gives maximum value of weight[i]*tmd/fq[i]. Here weight[i] is the value of the element, tmd is the length of a signal sequence processed by the example program so far, and fq[i] is the number of occurrences of signal i in that signal sequence.
For every single-node model, the program uses one of four built-in probability profiles. Those probability profiles are represented by internally generated assembler programs and specify possible transitions between states depending on signals received in the states. To every state there corresponds a vector of frequencies of signals received in the state during model execution. For the next signal predicted in the state, a signal that has the maximum frequency is taken.
The basic probability profile, which has index 0, is a uniform profile with all states connected, where from every state there are transitions to all other states upon receiving every signal. The structure of such profile with four states is represented in Figure 7.2. Dots denote the states, and lines denote groups of state transitions. The number of state transitions represented by every line in the figure is equal to the number of signals multiplied by two because in this profile every transition can be made in two directions.
Probability profile 1 is suitable for learning sequences of signals, which consist of independent segments of similar structure. The profile has a rectangular form with the initial state to the left. Every state has a reset transition to the initial state. The structure of probability profile 1 with the initial state, three rows, and five columns is represented in Figure 7.3. This example profile supports learning sequences, which consist of independent segments of similar structure that have length up to 6. Dots denote the states, and arrows denote groups of state transitions, where each state transition corresponds to a received signal.
Probability profile 2 is completely deterministic one. Its states are n-grams. This profile could be used to compare the efficiency of next signal prediction using stochastic algorithms and using the classical n-gram approach. The structure of the profile for n-gram length 3 and two signals is represented in Figure 7.4. Rectangles denote the states. The rectangle of the initial state has label “Start” inside it. Rectangles of other states contain n-gram contents. Arrows indicate state transitions and are labeled by indices of signals that trigger the transitions.
Probability profile 3 is based on probability profile 2, but has a number of states that correspond to every n-gram. That number is fixed and along with n-gram length are the parameters for creating the profile. The representation of probability profile 3 would be analogous to the representation of probability profile 2, but would contain several substates in place of every state and would have every arrow split into multiple ones. Every pair of states of probability profile 2 with an arrow between them would be replaced with two groups of substates, where each substate from the first group is connected with each substate from the second group by an arrow that has the same direction. Unfortunately, the example figure of probability profile 3 is not provided because of its substantial complexity.
You should invoke the
predict-test program with an argument, which specifies the name of a file containing the description of a probabilistic finite automaton that defines the signal sequence.
The program understands the following command-line options.
Dump the source text of an assembler program that corresponds to the first model structure (probability profile) descriptor specified by the option -M.
Dump the disassembled learned program of every model to a file named “PREFIXi,” where i is a model index. If PREFIX not specified, then ‘prg_disasm_’ will be used for it.
Dump the state transition matrix of every model to a file named “PREFIXi,” where i is a model index. If PREFIX not specified, then ‘mat_goto_’ will be used for it.
A seed to initialize pseudo-random number generators. Default value is 0. In the description of test modes is denoted by ‘Random seed’.
Use a large actor for the environment state identification engine of every model. In the description of test modes is denoted by ‘Large’.
At least one model structure (probability profile) descriptor. If several descriptors are given, they must be delimited by ‘;’, and STR should be put in quotes. Every descriptor must be specified in one of the following formats.
State transition profile 0 with nstate states.
State transition profile 1 with a rectangular state transition matrix that has nrow rows and ncol columns.
Deterministic state transition profile 2 with n-grams of lengths up to ngram_sz.
State transition profile 3 with n-grams of lengths up to ngram_sz and nstate_ngram states per every n-gram.
In the description of test modes is denoted by ‘Models’ and ‘model[i]’, where i is a model index.
The number of sequence signals to generate using the probabilistic finite automaton (specified by a program argument) and predict at each test pass. Default value is 10000. In the description of test modes is denoted by ‘Steps per pass’.
A file to write program output to instead of stdout.
Perform the test specified number of passes. In the description of test modes is denoted by ‘Passes’.
When the option --dump-disasm[=PREFIX] is specified, disassemble the learned program of every model using an assembler program template. The assembler program template corresponds to an assembler program that can be dumped with the help of option --dump-asm.
When invoked with the option -t INT, the program prints a test log, an example of which is shown below. The log begins with a description of test modes used that includes parameters of state transition profiles specified by the option -M STR. Then there goes a table, where each row corresponds to a test pass, with a summary row. Finally, there is printed the standard deviation of values in ‘% efr’ and ‘% efa’ columns of the table.
$ ./predict-test -i1 -t10 -M1,5,40 -n640000 dist/dist1 Dist. file: dist/dist1 Signals: 10 Dist. states: 18 Passes: 10 Steps per pass: 640000 Large: off Random seed: 1 Models: 1 model: type=1, nrow=5, ncol=40 pass earned random maximal % efr % efa ---- -------- ------ ------------ -------- ------- 1 532581 63825 640000.000 834.439 81.357 2 602249 64123 640000.000 939.209 93.445 3 635244 63973 640000.000 992.988 99.174 4 567070 64337 640000.000 881.406 87.331 5 617353 64130 640000.000 962.659 96.067 6 591651 63792 640000.000 927.469 91.609 7 527643 64022 640000.000 824.159 80.493 8 637622 63784 640000.000 999.658 99.587 9 602574 64029 640000.000 941.095 93.502 10 635594 64214 640000.000 989.806 99.235 TOTL 5949581 640229 6400000.000 929.290 92.180 stddev efr: 63.568 stddev efa: 7.068
Columns of the table have the following meaning.
A test pass index. For the summary row, ‘TOTL’ is printed here.
The number of correctly predicted sequence signals. In the summary row—the sum of values in the column.
The number of coincidences of sequence signals with uniformly distributed random signals. In the summary row—the sum of values in the column.
The number of sequence signals that will be correctly predicted if the program every time predicts a signal, which has the maximum probability in a row of emission matrix of probabilistic finite automaton. The row corresponds to the current state of the automaton. In the summary row—the sum of values in the column.
Relative efficiency equal to (earned/random)*100%. For the summary row, is calculated using corresponding values in that row.
Actual efficiency equal to (earned-random)/(maximal-random)*100%. For the summary row, is calculated using corresponding values in that row.
When invoked without the option -t INT, but with the option -M STR specified, the program prints the stream of pairs of signal identifiers
The number of pairs is equal to the number of steps at a test pass (see the option -n INT).
When invoked without options -t INT and -M STR, the program prints a stream of signal identifiers produced by the probabilistic finite automaton. The number of signal identifiers is equal to the number of steps at a test pass (see the option -n INT). An example of such program output is shown below.
$ ./predict-test -n100 dist/dist1 0 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9
Reference results of program invocation are represented below. Values of ‘% efr’ and ‘% efa’ were taken from summary rows of test logs.
Dist. file Opts. Prefix Model nStep % efr % efa nStep % efr % efa ========== ============ =============== ====== ===== ===== ======== ===== ===== dist/dist1 -i1 -t100 -L -M0,40 -n8000 616.0 57.5 -n640000 824.7 80.5 -M"0,40[x4]" -n8000 837.7 82.2
-i1 -t100 -M1,5,45 -n8000 638.8 60.0 -n640000 894.0 88.2 -M"1,5,45[x6]" -n8000 909.7 90.2
-M3,1,4 -n8000 861.4 84.8 -n640000 936.8 93.0 -M"3,1,4[x4]" -n8000 959.3 95.7
dist/dist2 -i1 -t100 -L -M0,40 -n8000 815.6 79.5 -n640000 868.9 85.5 -M"0,40[x16]" -n8000 868.4 85.3
dist/dist3 -i1 -t100 -L -M0,40 -n8000 741.6 71.2 -n640000 862.8 84.7 -M"0,40[x8]" -n8000 868.8 85.3
-i1 -t100 -M1,5,45 -n8000 710.8 67.8 -n640000 831.7 81.3 -M"1,5,45[x4]" -n8000 854.5 83.7
dist/dist4 -i1 -t100 -M0,40 -n8000 303.3 50.6 -n640000 362.8 65.7 -M"0,40[x16]" -n8000 375.8 68.6
dist/dist5 -i1 -t100 -L -M3,1,8 -n8000 405.4 71.8 -n640000 448.5 81.8 dist/dist6 -i1 -t100 -M3,1,8 -n8000 304.7 51.2 -n640000 382.4 70.6 dist/dist7 -i1 -t100 -L -M1,5,45 -n8000 232.8 33.3 -n640000 272.9 43.2
dist/dist8 -i1 -t100 -M1,5,45 -n8000 268.1 25.9 -n640000 476.1 57.9
-M"1,5,45[x16]" -n8000 503.5 62.1
-M3,1,8 -n8000 454.3 54.5 -n640000 588.7 75.2 ======================= =============== ====== ===== ===== ======== ===== =====
Note: in the ‘Model’ column, notation ‘-M"...[xn]"’ means repeating an expression denoted by ellipses n times with inserting delimiter ‘;’ between copies of the expression. For example, notation ‘-M"3,1,4[x4]"’ is to be expanded to ‘-M"3,1,4;3,1,4;3,1,4;3,1,4"’.
As it can be seen from the table above, in many cases, loss of efficiency of signal prediction for a shorter length of a signal sequence can be compensated by using more models of the same structure. However, for some probabilistic finite automatons, adding more models of the same structure does not provide appreciable increase in the number of correctly predicted signals.
make builds the example program if the QSMM package is configured by the configure script to use the POSIX threads API.
See the file INSTALL in the root of the package distribution for information on the configure script.