This subsection contains the examples of learning PCFGs using bottom-up template grammars in the files bu-1-14-uni-no-quant.rg and bu-uni-alt-1.rg located in the directory $prefix/share/qsmm/samples/gram installed from the directory samples/gram in the package distribution.
The bottom-up template grammar bu-1-14-uni-no-quant.rg is the unification of all PCFGs in the files *.pcfg in that directory except for the file 7.pcfg. The bottom-up template grammar bu-uni-alt-1.rg is an alternative way of representing the unification of the PCFGs.
To reproduce the examples, execute in a temporary directory the preparation commands listed below.
See rege-markup-cfg, rege-vit, and rege-bottom-up for descriptions of referenced tools.
$ mkdir seq log td-learn bu-learn $ prefix=/usr # use a different prefix, if necessary
$ rege-markup-cfg -o markup-bu-1-14-uni-no-quant.rg \
"$prefix/share/qsmm/samples/gram/bu-1-14-uni-no-quant.rg"
$ rege-vit -o vit-markup-bu-1-14-uni-no-quant.rg --cfg-markup \
16 markup-bu-1-14-uni-no-quant.rg
$ rege-bottom-up -o td-1-14-uni-no-quant.rg --cfg-markup --template \
vit-markup-bu-1-14-uni-no-quant.rg
$ rege-markup-cfg -o markup-bu-uni-alt-1.rg --nont-class \
"$prefix/share/qsmm/samples/gram/bu-uni-alt-1.rg"
$ rege-vit -o vit-markup-bu-uni-alt-1.rg --cfg-markup 16 markup-bu-uni-alt-1.rg $ rege-bottom-up -o td-uni-alt-1.rg --cfg-markup --template vit-markup-bu-uni-alt-1.rg
The above commands convert the bottom-up template grammar $prefix/share/qsmm/samples/gram/bu-1-14-uni-no-quant.rg to the top-down template grammar td-1-14-uni-no-quant.rg and convert the bottom-up template grammar $prefix/share/qsmm/samples/gram/bu-uni-alt-1.rg to the top-down template grammar td-uni-alt-1.rg.
See Files {bu,td}-1-14-uni-no-quant.rg, for contents of the former two grammars. See Files {bu,td}-uni-alt-1.rg, for contents of the latter two grammars.
Continue by executing commands for the examples below.
The examples use the auxiliary programs described in pcfg-generate-seq and pcfg-predict-eval.
$ cat "$prefix/share/qsmm/samples/gram/1.pcfg" S: A B ; A: "a" "b" "c" "d" | "d" "c" "b" "a" ; B: "e" "f" "g" "h" | "h" "g" "f" "e" ;
$ pcfg-generate-seq -i1 -n10000 --separate-parse-units \
-o seq/1-10k-unit.seq "$prefix/share/qsmm/samples/gram/1.pcfg"
Using the Unified Template Grammar
$ abu-parser -i1 -n20000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/1a_det.rg --oo=log/1a.log \
td-1-14-uni-no-quant.rg seq/1-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/1a_out.pcfg --oo --simplify \
td-learn/1a_det.rg seq/1-10k-unit.seq
p_td 0.82778218, p_so 0.75789709, p_rd 0.75789709, cp 3
$ cat bu-learn/1a_out.pcfg S: A B // 1250 ; A: "d" "c" "b" "a" [0.508] // 635 | "a" "b" "c" "d" [0.492] // 615 ; // 1250 B: "h" "g" "f" "e" [0.508] // 635 | "e" "f" "g" "h" [0.492] // 615 ; // 1250
Using the Alternative Template Grammar
$ abu-parser -i1 -n80000 --viterbi --det-niter-goal=100 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/1b_det.rg --oo=log/1b.log \
td-uni-alt-1.rg seq/1-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/1b_out.pcfg --oo --simplify \
td-learn/1b_det.rg seq/1-10k-unit.seq
p_td 0.84953738, p_so 0.82036536, p_rd 0.82036536, cp 3
$ cat bu-learn/1b_out.pcfg
S: B1 B2 // 1250
;
B1: W4_2 // 1250
;
B2: W4_3 [0.508] // 635
| W4_4 [0.492] // 615
; // 1250
W4_2: "d" "c" "b" "a" [0.508] // 635
| "a" "b" "c" "d" [0.492] // 615
; // 1250
W4_3: "h" "g" "f" "e" // 635
;
W4_4: "e" "f" "g" "h" // 615
;
$ cat "$prefix/share/qsmm/samples/gram/2.pcfg"
S: L00 L01
;
L00: L10 L11
| L11 L10
;
L01: L12 L13
| L13 L12
;
L10: "a" "b" "c" "d"
;
L11: "e" "f" "g" "h"
;
L12: "d" "c" "b" "a"
;
L13: "h" "g" "f" "e"
;
$ pcfg-generate-seq -i1 -n10000 --separate-parse-units \
-o seq/2-10k-unit.seq "$prefix/share/qsmm/samples/gram/2.pcfg"
Using the Unified Template Grammar
$ abu-parser -i1 -n10000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/2a_det.rg --oo=log/2a.log \
td-1-14-uni-no-quant.rg seq/2-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/2a_out.pcfg --oo --simplify \
td-learn/2a_det.rg seq/2-10k-unit.seq
p_td 0.85728227, p_so 0.88162347, p_rd 1.00000000, cp 3
$ cat bu-learn/2a_out.pcfg
S: L00 L01 // 625
;
L00: L10 L11 [0.5136] // 321
| L11 L10 [0.4864] // 304
; // 625
L01: L13 L12 [0.504] // 315
| L12 L13 [0.496] // 310
; // 625
L10: "e" "f" "g" "h" // 625
;
L11: "a" "b" "c" "d" // 625
;
L12: "d" "c" "b" "a" // 625
;
L13: "h" "g" "f" "e" // 625
;
Using the Alternative Template Grammar
$ abu-parser -i1 -n10000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/2b_det.rg --oo=log/2b.log \
td-uni-alt-1.rg seq/2-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/2b_out.pcfg --oo --simplify \
td-learn/2b_det.rg seq/2-10k-unit.seq
p_td 0.82776840, p_so 0.88162347, p_rd 1.00000000, cp 0
$ cat bu-learn/2b_out.pcfg
S: B1 B2 // 625
;
B1: W4_1 W4_3 [0.5136] // 321
| W4_3 W4_1 [0.4864] // 304
; // 625
B2: W4_2 W4_4 [0.504] // 315
| W4_4 W4_2 [0.496] // 310
; // 625
W4_1: "e" "f" "g" "h" // 625 ; W4_2: "h" "g" "f" "e" // 625 ; W4_3: "a" "b" "c" "d" // 625 ; W4_4: "d" "c" "b" "a" // 625 ;
$ cat "$prefix/share/qsmm/samples/gram/3.pcfg"
S: A B
;
A: "a" "b" "c"
| "d" "c" "b" "a"
;
B: "e"
| "f" "e"
;
$ pcfg-generate-seq -i1 -n10000 --separate-parse-units \
-o seq/3-10k-unit.seq "$prefix/share/qsmm/samples/gram/3.pcfg"
Using the Alternative Template Grammar
$ abu-parser -i1 -n20000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/3b_det.rg --oo=log/3b.log \
td-uni-alt-1.rg seq/3-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/3b_out.pcfg --oo --simplify \
td-learn/3b_det.rg seq/3-10k-unit.seq
p_td 0.90778245, p_so 0.79357088, p_rd 1.00000000, cp 0
$ cat bu-learn/3b_out.pcfg S: W2_1 W3_2 [0.25727182] // 513 | B1 W2_9 [0.25325978] // 505 | B1 B2 [0.25125376] // 501 | W2_1 B4 [0.23821464] // 475 ; // 1994 B1: W4_2 // 1006 ; B2: W1_1 // 501 ; B4: W2_5 // 475 ; W1_1: "e" // 501 ;
W2_1: "a" "b" // 988 ;
W2_5: "c" "e" // 475 ; W2_9: "f" "e" // 505 ; W3_2: "c" "f" "e" // 513 ; W4_2: "d" "c" "b" "a" // 1006 ;
$ cat "$prefix/share/qsmm/samples/gram/5.pcfg"
S: B C
| B C C
| D C B
| D D C
;
B: "a" "a"
;
C: "c" "b" "c"
;
D: "d" "b" "b" "d"
;
$ pcfg-generate-seq -i1 -n10000 --separate-parse-units \
-o seq/5-10k-unit.seq "$prefix/share/qsmm/samples/gram/5.pcfg"
Using the Unified Template Grammar
$ abu-parser -i1 -n40000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/5a_det.rg --oo=log/5a.log \
td-1-14-uni-no-quant.rg seq/5-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/5a_out.pcfg --oo --simplify \
td-learn/5a_det.rg seq/5-10k-unit.seq
p_td 1.00000000, p_so 0.68039575, p_rd 1.00000000, cp 4
$ cat bu-learn/5a_out.pcfg
S: B C C [0.26477935] // 318
| D C B [0.26311407] // 316
| D D C [0.24562864] // 295
| B C [0.22647794] // 272
; // 1201
B: "a" "a" // 906
;
C: "c" "b" "c" // 1519
;
D: "d" "b" "b" "d" // 906
;
Using the Alternative Template Grammar
$ abu-parser -i1 -n640000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/5b_det.rg --oo=log/5b.log \
td-uni-alt-1.rg seq/5-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/5b_out.pcfg --oo --simplify \
td-learn/5b_det.rg seq/5-10k-unit.seq
p_td 1.00000000, p_so 0.80941299, p_rd 1.00000000, cp 5
$ cat bu-learn/5b_out.pcfg
S: W2_1 W3_2 W3_2 [0.26477935] // 318
| W4_5 W3_2 W2_1 [0.26311407] // 316
| W4_5 W4_5 W3_2 [0.24562864] // 295
| W2_1 W3_2 [0.22647794] // 272
; // 1201
W2_1: "a" "a" // 906
;
W3_2: "c" "b" "c" // 1519
;
W4_5: "d" "b" "b" "d" // 906
;
$ cat "$prefix/share/qsmm/samples/gram/6.pcfg" S: B C | B D D | B E E E ; B: "a" "b" | "b" "b" ;
C: "a" "c" | "c" "c" ;
D: "a" "d"
| "d" "d"
;
E: "a" "e"
| "e" "e"
;
$ pcfg-generate-seq -i1 -n80000 --separate-parse-units \
-o seq/6-80k-unit.seq "$prefix/share/qsmm/samples/gram/6.pcfg"
Using the Unified Template Grammar
$ abu-parser -i1 -n80000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/6a_det.rg --oo=log/6a.log \
td-1-14-uni-no-quant.rg seq/6-80k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/6a_out.pcfg --oo --simplify \
td-learn/6a_det.rg seq/6-80k-unit.seq
p_td 0.82044141, p_so 0.57534250, p_rd 0.98219605, cp 6
$ cat bu-learn/6a_out.pcfg S: B E E E [0.33058223] // 4406 | B D D [0.17136855] // 2284 | B C C [0.16851741] // 2246 | "a" "b" "a" "c" [0.08883553] // 1184 | "b" "b" "c" "c" [0.08133253] // 1084 | "a" "b" "c" "c" [0.08013205] // 1068 | "b" "b" "a" "c" [0.07923169] // 1056 ; // 13328
B: "b" "b" [0.51242167] // 4579 | "a" "b" [0.48757833] // 4357 ; // 8936 C: "a" "d" [0.75] // 3369 | "d" "d" [0.25] // 1123 ; // 4492 D: "d" "d" [0.7515324] // 3433 | "a" "d" [0.2484676] // 1135 ; // 4568 E: "e" "e" [0.50174005] // 6632 | "a" "e" [0.49825995] // 6586 ; // 13218
Using the Alternative Template Grammar
$ abu-parser -i1 -n80000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/6b_det.rg --oo=log/6b.log \
td-uni-alt-1.rg seq/6-80k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/6b_out.pcfg --oo --simplify \
td-learn/6b_det.rg seq/6-80k-unit.seq
p_td 0.78683710, p_so 0.64378074, p_rd 0.93880311, cp 19
$ cat bu-learn/6b_out.pcfg
S: W2_12 B2 [0.17534514] // 2337
| B3 B2 [0.16454082] // 2193
| B3 W2_4 [0.08883553] // 1184
| W4_5 B5 B5 [0.08583433] // 1144
| W4_5 B5 W2_1 [0.08238295] // 1098
| W2_12 W2_7 [0.08133253] // 1084
| B3 W2_7 [0.08013205] // 1068
| W2_12 W2_4 [0.07923169] // 1056
| B3 W2_1 B6 B6 [0.02183373] // 291
| B3 B6 B6 W2_1 [0.02108343] // 281
| B3 W2_1 W2_1 W2_1 [0.0210084] // 280
| B3 B6 W2_1 W2_1 [0.0202581] // 270
| B3 B6 B6 B6 [0.01995798] // 266
| B3 W2_1 B6 W2_1 [0.01995798] // 266
| B3 B6 W2_1 B6 [0.01988295] // 265
| B3 W2_1 W2_1 B6 [0.01838235] // 245
; // 13328
B2: W4_3 [0.50419426] // 2284
| W4_4 [0.49580574] // 2246
; // 4530
B3: W2_2 // 6609
;
B5: W2_6 // 3386 ;
B6: W2_9 // 3253
;
W2_1: "a" "e" // 4337
;
W2_2: "a" "b" // 6609
;
W2_4: "a" "c" // 2240
;
W2_6: "e" "e" [0.66302422] // 2245
| "a" "e" [0.33697578] // 1141
; // 3386
W2_7: "c" "c" // 2152
;
W2_9: "e" "e" // 3253
;
W2_12: "b" "b" // 4477
;
W4_3: "d" "d" "d" "d" [0.5030648] // 1149
| "d" "d" "a" "d" [0.4969352] // 1135
; // 2284
W4_4: "a" "d" "a" "d" [0.5] // 1123
| "a" "d" "d" "d" [0.5] // 1123
; // 2246
W4_5: "b" "b" "e" "e" [0.50579839] // 1134
| "b" "b" "a" "e" [0.49420161] // 1108
; // 2242
To evaluate the correctness of the learned PCFG above, we can compare the probability of a terminal symbol correctly predictable in parse units in the file seq/6-80k-unit.seq using the PCFG in the file $prefix/share/qsmm/gram/6.pcfg and the probability of a terminal symbol correctly predictable using the learned PCFG in the file bu-learn/6b_out.pcfg.
$ pcfg-predict-eval "$prefix/share/qsmm/samples/gram/6.pcfg" \
seq/6-80k-unit.seq seq/6-80k-unit.seq | grep prob_wpredict_max
"prob_wpredict_max" : 0.69436388,
$ pcfg-predict-eval bu-learn/6b_out.pcfg \
seq/6-80k-unit.seq seq/6-80k-unit.seq | grep prob_wpredict_max
"prob_wpredict_max" : 0.70663111,
The probability of a correctly predictable terminal symbol is even slightly less for the original PCFG compared to the learned PCFG. The reason of this small decrease might be pseudo-randomness of parse units in the file seq/6-80k-unit.seq.
$ cat "$prefix/share/qsmm/samples/gram/8.pcfg" S: C B C B | C C B B | C B B C ; B: "a" "b" ; C: "a" "b" "c" ;
$ pcfg-generate-seq -i1 -n10000 --separate-parse-units \
-o seq/8-10k-unit.seq "$prefix/share/qsmm/samples/gram/8.pcfg"
Using the Unified Template Grammar
$ abu-parser -i1 -n320000 --viterbi --det-niter-goal=100 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/8a_det.rg --oo=log/8a.log \
td-1-14-uni-no-quant.rg seq/8-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/8a_out.pcfg --oo --simplify \
td-learn/8a_det.rg seq/8-10k-unit.seq
p_td 0.89477660, p_so 0.77924711, p_rd 0.94080204, cp 4
$ cat bu-learn/8a_out.pcfg
S: D C B // 1000
;
B: "b" "a" "b" [0.352] // 352
| "b" "c" "a" "b" [0.337] // 337
| "b" "a" "b" "c" [0.311] // 311
; // 1000
C: "b" "a" [0.648] // 648
| "b" "c" "a" [0.352] // 352
; // 1000
D: "a" "b" "c" "a" // 1000
;
To evaluate the correctness of the learned PCFG above, we can compare the probability of a terminal symbol correctly predictable in parse units in the file seq/8-10k-unit.seq using the PCFG in the file $prefix/share/qsmm/gram/8.pcfg and the probability of a terminal symbol correctly predictable using the learned PCFG in the file bu-learn/8a_out.pcfg.
$ pcfg-predict-eval "$prefix/share/qsmm/samples/gram/8.pcfg" \
seq/8-10k-unit.seq seq/8-10k-unit.seq | grep prob_wpredict_max
"prob_wpredict_max" : 0.93426667,
$ pcfg-predict-eval bu-learn/8a_out.pcfg \
seq/8-10k-unit.seq seq/8-10k-unit.seq | grep prob_wpredict_max
"prob_wpredict_max" : 0.91651161,
The probability of a correctly predictable terminal symbol is greater for the original PCFG compared to the learned PCFG. To evaluate the measure of correctness of the learned PCFG in percents, we can treat as correctness 0% the probability of a terminal symbol correctly predictable using a random PCFG for parse units in the file seq/8-10k-unit.seq and treat as correctness 100% the probability of a terminal symbol correctly predictable using the original PCFG in the file $prefix/share/qsmm/samples/gram/8.pcfg.
$ abu-parser -i-1 --viterbi --ops=bu-learn/8_rand.pcfg \
td-1-14-uni-no-quant.rg seq/8-10k-unit.seq
$ pcfg-predict-eval bu-learn/8_rand.pcfg \
seq/8-10k-unit.seq seq/8-10k-unit.seq | grep prob_wpredict_max
"prob_wpredict_max" : 0.64024028,
$ bc <<< 'scale=1; max=0.93426667; learned=0.91651161; rand=0.64024028;
(learned-rand)*100/(max-rand)'
93.9
Using the Alternative Template Grammar
$ abu-parser -i1 -n320000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/8b_det.rg --oo=log/8b.log \
td-uni-alt-1.rg seq/8-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/8b_out.pcfg --oo --simplify \
td-learn/8b_det.rg seq/8-10k-unit.seq
p_td 0.91735279, p_so 0.84526556, p_rd 0.96239338, cp 2
$ cat bu-learn/8b_out.pcfg
S: W2_1 W3_1 W2_1 W3_1 [0.648] // 648
| W2_1 W3_1 W3_1 W2_1 [0.352] // 352
; // 1000
W2_1: "a" "b" // 2000
;
W3_1: "c" "a" "b" [0.8445] // 1689
| "a" "b" "c" [0.1555] // 311
; // 2000
To evaluate the correctness of the learned PCFG above, we can compare the probability of a terminal symbol correctly predictable in parse units in the file seq/8-10k-unit.seq using the PCFG in the file $prefix/share/qsmm/gram/8.pcfg and the probability of a terminal symbol correctly predictable using the learned PCFG in the file bu-learn/8b_out.pcfg.
$ pcfg-predict-eval "$prefix/share/qsmm/samples/gram/8.pcfg" \
seq/8-10k-unit.seq seq/8-10k-unit.seq | grep prob_wpredict_max
"prob_wpredict_max" : 0.93426667,
$ pcfg-predict-eval bu-learn/8b_out.pcfg \
seq/8-10k-unit.seq seq/8-10k-unit.seq | grep prob_wpredict_max
"prob_wpredict_max" : 0.94543205,
The probability of a correctly predictable terminal symbol is less for the original PCFG compared to the learned PCFG. This condition may indicate that the original PCFG is less optimal compared to the learned PCFG.
$ cat "$prefix/share/qsmm/samples/gram/12.pcfg" W: "h" "e" A [0.67] | "d" [0.20] | "i" "k" "j" [0.13] ; A: "e" "h" "j" [0.67] | "c" [0.33] ;
$ pcfg-generate-seq -i1 -n10000 --separate-parse-units \
-o seq/12-10k-unit.seq "$prefix/share/qsmm/samples/gram/12.pcfg"
Using the Alternative Template Grammar
$ abu-parser -i1 -n10000 --viterbi --det-niter-goal=50 --det-niter-keep=1 \
--randomize-det-iter --od=td-learn/12b_det.rg --oo=log/12b.log \
td-uni-alt-1.rg seq/12-10k-unit.seq
$ abu-parser -i1 --viterbi --ops=bu-learn/12b_out.pcfg --oo --simplify \
td-learn/12b_det.rg seq/12-10k-unit.seq
p_td 0.95077196, p_so 0.77508709, p_rd 1.00000000, cp 6
$ cat bu-learn/12b_out.pcfg
S: W2_1 B7 [0.44929577] // 1276
| W2_1 B2 [0.22676056] // 644
| W1_2 [0.18908451] // 537
| W3_3 [0.13485915] // 383
; // 2840
B2: W1_1 // 644
;
B7: W3_1 // 1276
;
W1_1: "c" // 644
;
W1_2: "d" // 537
;
W2_1: "h" "e" // 1920
;
W3_1: "e" "h" "j" // 1276
;
W3_3: "i" "k" "j" // 383
;