In versionMappings
we have constructed edge features that map the nodes from one version of the data to the next.
In this notebook we are going to use those edges to study what happened to the feature function
of phrases
.
We explore:
function
feature have changed;The feature function
was called phrase_function
in version 3
.
In order to see whether phrase boundaries have changed, we follow the omap@
edges from
phrases in one version to their counterparts in the next version.
We make use of the dissimilarity values that are attached to such edges.
If there is no value, or the value is 0
, we have a match without a boundary change.
All other dissimilarities imply that boundaries have changed.
For the sake of presentation, we start with the result cells, they should be run after the other cells. The computation starts here.
function
values¶for (v, w) in reversed(phraseMapping): # noqa F821
caption(1, "Phrase function change from version {} to {}".format(v, w)) # noqa F821
featureDiff(v, w, "FUNCTION") # noqa F821
############################################################################################## # # # 1m 13s Phrase function change from version 3 to 2017 # # # ##############################################################################################
3\2017 | Adju | Cmpl | Conj | EPPr | ExsS | Exst | Frnt | IntS | Intj | Loca | ModS | Modi | NCoS | NCop | Nega | Objc | PrAd | PrcS | PreC | PreO | PreS | Pred | PtcO | Ques | Rela | Subj | Supp | Time | Voct |
Adju | 5438 | 142 | 15 | 18 | 17 | 169 | 2 | 171 | 84 | 111 | 4 | 9 | 2 | 177 | 3 | 20 | 1 | ||||||||||||
Cmpl | 186 | 22005 | 9 | 11 | 25 | 11 | 168 | 27 | 110 | 5 | 1 | 3 | 1 | 160 | 3 | 8 | 3 | ||||||||||||
Conj | 101 | 51 | 33064 | 10 | 17 | 151 | 105 | 3 | 10 | 2 | 3 | 19 | 73 | 29 | 10 | 1 | |||||||||||||
ExsS | 7 | ||||||||||||||||||||||||||||
Exst | 2 | 2 | 90 | 1 | 3 | 7 | |||||||||||||||||||||||
Frnt | 8 | 7 | 755 | 1 | 9 | 12 | 1 | 45 | 5 | ||||||||||||||||||||
IntS | 161 | 1 | |||||||||||||||||||||||||||
Intj | 1 | 1 | 17 | 1199 | 6 | 5 | 9 | 3 | 5 | 5 | 12 | 1 | |||||||||||||||||
IrpC | 1 | 18 | 3 | ||||||||||||||||||||||||||
IrpO | 1 | 3 | 120 | 5 | 4 | 2 | |||||||||||||||||||||||
IrpP | 4 | 1 | 142 | 4 | 3 | ||||||||||||||||||||||||
IrpS | 1 | 1 | 2 | 3 | 3 | 2 | 184 | ||||||||||||||||||||||
Loca | 18 | 80 | 2 | 1992 | 49 | 17 | 7 | 1 | 18 | ||||||||||||||||||||
ModS | 25 | 2 | 2 | ||||||||||||||||||||||||||
Modi | 61 | 80 | 11 | 28 | 24 | 21 | 2271 | 1 | 40 | 141 | 18 | 36 | 1 | 21 | 254 | 210 | |||||||||||||
NegS | 50 | ||||||||||||||||||||||||||||
Nega | 21 | 7 | 13 | 1 | 24 | 415 | 4234 | 5 | 7 | 3 | 1 | 39 | 2 | 1 | 2 | ||||||||||||||
Objc | 32 | 81 | 2 | 18 | 7 | 15 | 15351 | 32 | 40 | 3 | 182 | 4 | 8 | ||||||||||||||||
PreC | 80 | 80 | 5 | 1 | 3 | 26 | 2 | 19 | 7 | 1 | 1 | 63 | 25 | 13567 | 834 | 1 | 4 | 167 | 6 | 14 | |||||||||
PreO | 9 | 1 | 1 | 1 | 6 | 2 | 5348 | 234 | 13 | 66 | 1 | 9 | |||||||||||||||||
PreS | 1 | 5 | 619 | 2 | 12 | ||||||||||||||||||||||||
Pred | 2 | 3 | 3 | 1 | 1 | 4 | 206 | 42 | 7 | 56155 | 8 | 19 | |||||||||||||||||
PtSp | 1 | 1 | |||||||||||||||||||||||||||
PtcO | 1 | 5 | 87 | 14 | 1 | ||||||||||||||||||||||||
Ques | 11 | 9 | 3 | 2 | 1 | 2 | 21 | 9 | 797 | 19 | 25 | ||||||||||||||||||
Rela | 1 | 3 | 591 | 3 | 2 | 1 | 11 | 1 | 1 | 4829 | 4 | ||||||||||||||||||
Subj | 25 | 16 | 2 | 6 | 55 | 1 | 9 | 10 | 1 | 1 | 123 | 8 | 2 | 175 | 3 | 2 | 1 | 1 | 15 | 1 | 21238 | 9 | 36 | ||||||
Supp | 65 | 110 | 1 | 5 | 7 | 3 | 140 | ||||||||||||||||||||||
Time | 15 | 5 | 9 | 9 | 1 | 1 | 47 | 22 | 8 | 1 | 2 | 1 | 36 | 2745 | 1 | ||||||||||||||
Unkn | 3992 | 7979 | 11890 | 18 | 5 | 49 | 360 | 73 | 401 | 620 | 8 | 1029 | 22 | 166 | 1761 | 7260 | 55 | 5 | 5584 | 14 | 13 | 42 | 4 | 363 | 1356 | 10507 | 32 | 1219 | 775 |
Voct | 1 | 2 | 6 | 3 | 24 | 821 |
############################################################################################## # # # 1m 15s Phrase function change from version 2016 to 2017 # # # ##############################################################################################
2016\2017 | Adju | Cmpl | Conj | EPPr | ExsS | Exst | Frnt | IntS | Intj | Loca | ModS | Modi | NCoS | NCop | Nega | Objc | PrAd | PrcS | PreC | PreO | PreS | Pred | PtcO | Ques | Rela | Subj | Supp | Time | Voct |
Adju | 9508 | 12 | 2 | 5 | 2 | 5 | 4 | 2 | 6 | ||||||||||||||||||||
Cmpl | 16 | 30002 | 4 | 13 | 1 | 1 | |||||||||||||||||||||||
Conj | 1 | 46135 | 3 | 3 | 1 | ||||||||||||||||||||||||
EPPr | 21 | ||||||||||||||||||||||||||||
ExsS | 14 | ||||||||||||||||||||||||||||
Exst | 143 | ||||||||||||||||||||||||||||
Frnt | 1 | 1119 | 1 | 1 | 9 | ||||||||||||||||||||||||
IntS | 251 | ||||||||||||||||||||||||||||
Intj | 1621 | ||||||||||||||||||||||||||||
Loca | 2 | 2621 | |||||||||||||||||||||||||||
ModS | 35 | ||||||||||||||||||||||||||||
Modi | 3738 | 32 | 216 | ||||||||||||||||||||||||||
NCoS | 101 | ||||||||||||||||||||||||||||
NCop | 595 | ||||||||||||||||||||||||||||
Nega | 6047 | ||||||||||||||||||||||||||||
Objc | 2 | 6 | 5 | 2 | 22627 | 1 | 19 | 7 | |||||||||||||||||||||
PrAd | 242 | ||||||||||||||||||||||||||||
PrcS | 8 | ||||||||||||||||||||||||||||
PreC | 6 | 4 | 8 | 1 | 2 | 1 | 19333 | 1 | 12 | ||||||||||||||||||||
PreO | 5402 | 1 | |||||||||||||||||||||||||||
PreS | 886 | ||||||||||||||||||||||||||||
Pred | 57069 | ||||||||||||||||||||||||||||
PtcO | 162 | ||||||||||||||||||||||||||||
Ques | 1 | 1203 | |||||||||||||||||||||||||||
Rela | 1 | 1 | 6327 | ||||||||||||||||||||||||||
Subj | 1 | 1 | 3 | 5 | 19 | 3 | 13 | 1 | 31907 | 1 | 1 | ||||||||||||||||||
Supp | 178 | ||||||||||||||||||||||||||||
Time | 6 | 1 | 3850 | ||||||||||||||||||||||||||
Voct | 1 | 2 | 1605 |
############################################################################################## # # # 1m 16s Phrase function change from version 4b to 2016 # # # ##############################################################################################
4b\2016 | Adju | Cmpl | Conj | EPPr | ExsS | Exst | Frnt | IntS | Intj | Loca | ModS | Modi | NCoS | NCop | Nega | Objc | PrAd | PrcS | PreC | PreO | PreS | Pred | PtcO | Ques | Rela | Subj | Supp | Time | Voct |
Adju | 9477 | 31 | 1 | 5 | 1 | 1 | 1 | 11 | 1 | 6 | 1 | 8 | 2 | 1 | |||||||||||||||
Cmpl | 39 | 29921 | 1 | 6 | 8 | 1 | 2 | 41 | 24 | 11 | 7 | ||||||||||||||||||
Conj | 1 | 46124 | 1 | 2 | 1 | ||||||||||||||||||||||||
EPPr | 9 | ||||||||||||||||||||||||||||
ExsS | 14 | ||||||||||||||||||||||||||||
Exst | 143 | ||||||||||||||||||||||||||||
Frnt | 1087 | 25 | |||||||||||||||||||||||||||
IntS | 251 | ||||||||||||||||||||||||||||
Intj | 1621 | ||||||||||||||||||||||||||||
Loca | 3 | 53 | 2613 | 1 | 4 | 2 | 4 | ||||||||||||||||||||||
ModS | 35 | ||||||||||||||||||||||||||||
Modi | 1 | 3980 | 1 | 3 | 1 | ||||||||||||||||||||||||
NCoS | 101 | ||||||||||||||||||||||||||||
NCop | 594 | 1 | |||||||||||||||||||||||||||
Nega | 1 | 1 | 1 | 6040 | |||||||||||||||||||||||||
Objc | 7 | 24 | 10 | 4 | 3 | 2 | 22596 | 2 | 14 | 1 | 60 | ||||||||||||||||||
PrAd | 2 | 235 | 1 | ||||||||||||||||||||||||||
PrcS | 8 | ||||||||||||||||||||||||||||
PreC | 15 | 5 | 1 | 1 | 1 | 1 | 1 | 10 | 3 | 19327 | 1 | 1 | 25 | ||||||||||||||||
PreO | 1 | 5404 | 30 | ||||||||||||||||||||||||||
PreS | 1 | 855 | 10 | ||||||||||||||||||||||||||
Pred | 1 | 1 | 57068 | 4 | |||||||||||||||||||||||||
PtcO | 162 | ||||||||||||||||||||||||||||
Ques | 1204 | ||||||||||||||||||||||||||||
Rela | 3 | 6325 | 1 | ||||||||||||||||||||||||||
Subj | 3 | 1 | 1 | 11 | 17 | 1 | 1 | 14 | 1 | 1 | 14 | 1 | 31811 | 1 | |||||||||||||||
Supp | 2 | 9 | 176 | ||||||||||||||||||||||||||
Time | 1 | 1 | 9 | 3835 | |||||||||||||||||||||||||
Voct | 2 | 2 | 1607 |
############################################################################################## # # # 1m 17s Phrase function change from version 4 to 4b # # # ##############################################################################################
4\4b | Adju | Cmpl | Conj | EPPr | ExsS | Exst | Frnt | IntS | Intj | Loca | ModS | Modi | NCoS | NCop | Nega | Objc | PrAd | PrcS | PreC | PreO | PreS | Pred | PtcO | Ques | Rela | Subj | Supp | Time | Voct |
Adju | 8061 | 94 | 13 | 7 | 10 | 206 | 1 | 155 | 82 | 65 | 5 | 1 | 8 | 1 | 186 | 3 | 17 | ||||||||||||
Cmpl | 77 | 27606 | 9 | 2 | 10 | 8 | 65 | 7 | 105 | 3 | 1 | 1 | 5 | 86 | 2 | 3 | |||||||||||||
Conj | 44 | 39 | 45936 | 17 | 10 | 6 | 110 | 1 | 19 | 1 | 3 | 74 | 42 | 7 | 1 | ||||||||||||||
EPPr | 4 | ||||||||||||||||||||||||||||
ExsS | 14 | ||||||||||||||||||||||||||||
Exst | 143 | 1 | |||||||||||||||||||||||||||
Frnt | 1 | 5 | 1007 | 1 | 2 | 5 | 5 | 3 | |||||||||||||||||||||
IntS | 250 | ||||||||||||||||||||||||||||
Intj | 1624 | 1 | 3 | ||||||||||||||||||||||||||
Loca | 7 | 18 | 2433 | 43 | 5 | 4 | 2 | ||||||||||||||||||||||
ModS | 35 | 1 | |||||||||||||||||||||||||||
Modi | 39 | 19 | 6 | 1 | 4 | 3526 | 11 | 24 | 13 | 2 | 34 | 19 | 15 | ||||||||||||||||
NCoS | 101 | ||||||||||||||||||||||||||||
NCop | 13 | 2 | 2 | 587 | 2 | 3 | |||||||||||||||||||||||
Nega | 6 | 1 | 4 | 1 | 4 | 6039 | 1 | 2 | 2 | 4 | 1 | ||||||||||||||||||
Objc | 15 | 35 | 20 | 1 | 5 | 13 | 20672 | 22 | 26 | 1 | 4 | 2 | 60 | 3 | |||||||||||||||
PrAd | 1 | 1 | 79 | 1 | 2 | ||||||||||||||||||||||||
PrcS | 8 | ||||||||||||||||||||||||||||
PreC | 36 | 41 | 7 | 11 | 5 | 4 | 35 | 12 | 17550 | 19 | 1 | 4 | 77 | 2 | 8 | ||||||||||||||
PreO | 1 | 4 | 1 | 5 | 5434 | 76 | 1 | 1 | 1 | 2 | |||||||||||||||||||
PreS | 1 | 777 | 1 | ||||||||||||||||||||||||||
Pred | 1 | 1 | 1 | 1 | 4 | 4 | 57042 | 9 | 13 | ||||||||||||||||||||
PtcO | 1 | 161 | 4 | ||||||||||||||||||||||||||
Ques | 15 | 9 | 4 | 24 | 18 | 1156 | 18 | 31 | |||||||||||||||||||||
Rela | 3 | 1 | 62 | 3 | 1 | 7 | 13 | 6239 | 12 | 1 | |||||||||||||||||||
Subj | 15 | 12 | 3 | 5 | 18 | 3 | 9 | 1 | 50 | 5 | 79 | 1 | 4 | 18 | 28763 | 4 | 16 | ||||||||||||
Supp | 60 | 49 | 2 | 2 | 3 | 180 | |||||||||||||||||||||||
Time | 2 | 1 | 8 | 1 | 1 | 40 | 2 | 5 | 1 | 1 | 6 | 3489 | |||||||||||||||||
Unkn | 1391 | 2357 | 65 | 113 | 235 | 167 | 2 | 1 | 2043 | 13 | 1840 | 1 | 2 | 3 | 1 | 10 | 2957 | 4 | 337 | 79 | |||||||||
Voct | 2 | 2 | 1 | 17 | 1504 |
############################################################################################## # # # 1m 18s Phrase function change from version 3 to 4 # # # ##############################################################################################
3\4 | Adju | Cmpl | Conj | EPPr | ExsS | Exst | Frnt | IntS | Intj | Loca | ModS | Modi | NCoS | NCop | Nega | Objc | PrAd | PrcS | PreC | PreO | PreS | Pred | PtcO | Ques | Rela | Subj | Supp | Time | Unkn | Voct |
Adju | 6067 | 74 | 15 | 6 | 10 | 31 | 43 | 19 | 65 | 1 | 43 | 15 | 2 | |||||||||||||||||
Cmpl | 90 | 22418 | 12 | 5 | 14 | 6 | 1 | 79 | 26 | 21 | 3 | 1 | 1 | 71 | 3 | 6 | 2 | |||||||||||||
Conj | 87 | 27 | 33540 | 6 | 8 | 154 | 36 | 2 | 5 | 2 | 3 | 18 | 39 | 22 | 7 | 1 | ||||||||||||||
ExsS | 7 | |||||||||||||||||||||||||||||
Exst | 2 | 90 | 1 | 3 | 9 | |||||||||||||||||||||||||
Frnt | 8 | 2 | 1 | 785 | 8 | 12 | 22 | 5 | ||||||||||||||||||||||
IntS | 161 | 1 | ||||||||||||||||||||||||||||
Intj | 1 | 2 | 17 | 1199 | 16 | 5 | 9 | 3 | 5 | 5 | 1 | 1 | ||||||||||||||||||
IrpC | 1 | 18 | 3 | |||||||||||||||||||||||||||
IrpO | 1 | 1 | 130 | 1 | 2 | |||||||||||||||||||||||||
IrpP | 2 | 4 | 1 | 139 | 7 | 3 | ||||||||||||||||||||||||
IrpS | 1 | 3 | 2 | 2 | 188 | |||||||||||||||||||||||||
Loca | 12 | 17 | 2 | 2119 | 6 | 11 | 3 | 1 | 14 | |||||||||||||||||||||
ModS | 26 | 2 | 1 | |||||||||||||||||||||||||||
Modi | 35 | 62 | 7 | 28 | 24 | 18 | 2567 | 1 | 40 | 135 | 15 | 2 | 241 | 46 | ||||||||||||||||
NegS | 50 | |||||||||||||||||||||||||||||
Nega | 10 | 6 | 10 | 1 | 24 | 433 | 4242 | 5 | 3 | 3 | 1 | 36 | 1 | 1 | ||||||||||||||||
Objc | 22 | 44 | 14 | 8 | 2 | 5 | 15588 | 10 | 15 | 2 | 1 | 93 | 2 | 7 | ||||||||||||||||
PreC | 48 | 50 | 10 | 3 | 18 | 2 | 14 | 7 | 1 | 1 | 49 | 11 | 13797 | 817 | 2 | 1 | 108 | 7 | 13 | |||||||||||
PreO | 9 | 1 | 1 | 1 | 2 | 5463 | 141 | 12 | 67 | 6 | ||||||||||||||||||||
PreS | 7 | 630 | 2 | 2 | ||||||||||||||||||||||||||
Pred | 1 | 4 | 1 | 1 | 1 | 1 | 2 | 202 | 42 | 7 | 56189 | 9 | 4 | |||||||||||||||||
PtSp | 1 | 1 | ||||||||||||||||||||||||||||
PtcO | 1 | 4 | 1 | 90 | 11 | 1 | ||||||||||||||||||||||||
Ques | 2 | 2 | 1 | 2 | 1 | 2 | 1 | 885 | 2 | 1 | ||||||||||||||||||||
Rela | 1 | 3 | 569 | 2 | 1 | 8 | 1 | 4903 | ||||||||||||||||||||||
Subj | 19 | 10 | 4 | 40 | 1 | 7 | 8 | 1 | 1 | 89 | 3 | 1 | 143 | 2 | 3 | 2 | 2 | 1 | 21407 | 5 | 28 | |||||||||
Supp | 15 | 44 | 1 | 3 | 4 | 264 | ||||||||||||||||||||||||
Time | 15 | 5 | 2 | 1 | 1 | 1 | 19 | 22 | 6 | 2 | 38 | 2807 | 1 | |||||||||||||||||
Unkn | 2726 | 5589 | 12119 | 4 | 5 | 50 | 214 | 72 | 402 | 385 | 8 | 910 | 22 | 164 | 1772 | 5243 | 22 | 5 | 3756 | 10 | 8 | 36 | 7 | 369 | 1401 | 7609 | 29 | 803 | 11416 | 692 |
Voct | 1 | 1 | 8 | 2 | 2 | 1 | 1 | 8 | 835 |
for (v, w) in reversed(phraseMapping): # noqa F821
caption(1, "Phrase boundary change from version {} to {}".format(v, w)) # noqa F821
showStats(v, w) # noqa F821
############################################################################################## # # # 1m 30s Phrase boundary change from version 3 to 2017 # # # ##############################################################################################
dissimilarity | number of phrases |
0 | 251551 |
1 | 29 |
2 | 26 |
3 | 22 |
4 | 13 |
5 | 10 |
6 | 5 |
7 | 6 |
8 | 1 |
9 | 13 |
10 | 3 |
11 | 4 |
12 | 4 |
13 | 1 |
14 | 3 |
15 | 1 |
16 | 1 |
17 | |
18 | |
19 | 2 |
20 | 1 |
21 | |
22 | |
23 | 1 |
24 | |
25 | |
26 | 1 |
27 | |
28 | 1 |
############################################################################################## # # # 1m 30s Phrase boundary change from version 2016 to 2017 # # # ##############################################################################################
dissimilarity | number of phrases |
0 | 253073 |
1 | 29 |
2 | 26 |
3 | 22 |
4 | 13 |
5 | 10 |
6 | 5 |
7 | 6 |
8 | 1 |
9 | 13 |
10 | 3 |
11 | 4 |
12 | 4 |
13 | 1 |
14 | 3 |
15 | 1 |
16 | 1 |
17 | |
18 | |
19 | 2 |
20 | 1 |
21 | |
22 | |
23 | 1 |
24 | |
25 | |
26 | 1 |
27 | |
28 | 1 |
############################################################################################## # # # 1m 30s Phrase boundary change from version 4b to 2016 # # # ##############################################################################################
dissimilarity | number of phrases |
0 | 252881 |
1 | 128 |
2 | 82 |
3 | 65 |
4 | 26 |
5 | 16 |
6 | 11 |
7 | 14 |
8 | 11 |
9 | 5 |
10 | 3 |
11 | 2 |
12 | 1 |
13 | 1 |
14 | |
15 | |
16 | 1 |
17 | |
18 | 1 |
19 | |
20 | 1 |
############################################################################################## # # # 1m 30s Phrase boundary change from version 4 to 4b # # # ##############################################################################################
dissimilarity | number of phrases |
0 | 250751 |
1 | 750 |
2 | 745 |
3 | 618 |
4 | 372 |
5 | 305 |
6 | 188 |
7 | 141 |
8 | 123 |
9 | 77 |
10 | 67 |
11 | 64 |
12 | 43 |
13 | 41 |
14 | 27 |
15 | 22 |
16 | 15 |
17 | 17 |
18 | 20 |
19 | 15 |
20 | 11 |
21 | 9 |
22 | 3 |
23 | 4 |
24 | 5 |
25 | 2 |
26 | |
27 | 5 |
28 | 2 |
29 | 5 |
30 | 2 |
31 | 3 |
32 | 1 |
33 | 2 |
34 | |
35 | 1 |
36 | 1 |
37 | 1 |
############################################################################################## # # # 1m 30s Phrase boundary change from version 3 to 4 # # # ##############################################################################################
dissimilarity | number of phrases |
0 | 250346 |
1 | 2837 |
2 | 1164 |
3 | 788 |
4 | 457 |
5 | 287 |
6 | 166 |
7 | 127 |
8 | 86 |
9 | 61 |
10 | 66 |
11 | 33 |
12 | 39 |
13 | 19 |
14 | 22 |
15 | 16 |
16 | 16 |
17 | 10 |
18 | 10 |
19 | 7 |
20 | 4 |
21 | 2 |
22 | |
23 | 3 |
24 | 1 |
25 | |
26 | |
27 | |
28 | |
29 | |
30 | 1 |
Start the program here.
import os # noqa 402
import collections # noqa 402
from functools import reduce # noqa 402
from utils import caption # noqa 402
from tf.fabric import Fabric # noqa 402
from IPython.display import HTML, display # noqa 402
We specify our versions and the subtle differences between them as far as they are relevant.
REPO = os.path.expanduser("~/github/etcbc/bhsa")
baseDir = "{}/tf".format(REPO)
tempDir = "{}/_temp".format(REPO)
versions = """
3
4
4b
2016
2017
""".strip().split()
versionInfoSpec = {
"": dict(
OCC="g_word",
LEX="lex",
FUNCTION="function",
),
"3": dict(
OCC="text_plain",
LEX="lexeme",
FUNCTION="phrase_function",
),
}
versionInfo = {}
defaults = versionInfoSpec[""].items()
for (i, v) in enumerate(versions):
versionInfo.setdefault(v, {})["OMAP"] = (
"" if i == 0 else "omap@{}-{}".format(versions[i - 1], v)
)
versionInfo[v].update(versionInfoSpec.get("", {}))
versionInfo[v].update(versionInfoSpec.get(v, {}))
Load all versions in one go, with the version mapping feature if present.
TF = {}
api = {}
for (i, v) in enumerate(versions):
for (param, value) in versionInfo[v].items():
globals()[param] = value
caption(4, "Version -> {} <- loading ...".format(v))
TF[v] = Fabric(locations="{}/{}".format(baseDir, v), modules=[""])
api[v] = TF[v].load(" ".join((OCC, LEX, FUNCTION, OMAP))) # noqa F821
.............................................................................................. . 0.00s Version -> 3 <- loading ... . .............................................................................................. This is Text-Fabric 3.0.9 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 118 features found and 0 ignored 0.00s loading features ... | 0.12s B lexeme from /Users/dirk/github/etcbc/bhsa/tf/3 | 0.22s B text_plain from /Users/dirk/github/etcbc/bhsa/tf/3 | 0.08s B phrase_function from /Users/dirk/github/etcbc/bhsa/tf/3 | 0.00s Feature overview: 115 for nodes; 2 for edges; 1 configs; 7 computed 4.99s All features loaded/computed - for details use loadLog() .............................................................................................. . 5.00s Version -> 4 <- loading ... . .............................................................................................. This is Text-Fabric 3.0.9 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 104 features found and 0 ignored 0.00s loading features ... | 0.14s B g_word from /Users/dirk/github/etcbc/bhsa/tf/4 | 0.12s B lex from /Users/dirk/github/etcbc/bhsa/tf/4 | 0.07s B function from /Users/dirk/github/etcbc/bhsa/tf/4 | 6.25s T omap@3-4 from /Users/dirk/github/etcbc/bhsa/tf/4 | 0.00s Feature overview: 98 for nodes; 5 for edges; 1 configs; 7 computed 12s All features loaded/computed - for details use loadLog() .............................................................................................. . 17s Version -> 4b <- loading ... . .............................................................................................. This is Text-Fabric 3.0.9 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 103 features found and 0 ignored 0.00s loading features ... | 0.16s B g_word from /Users/dirk/github/etcbc/bhsa/tf/4b | 0.14s B lex from /Users/dirk/github/etcbc/bhsa/tf/4b | 0.07s B function from /Users/dirk/github/etcbc/bhsa/tf/4b | 6.33s T omap@4-4b from /Users/dirk/github/etcbc/bhsa/tf/4b | 0.00s Feature overview: 97 for nodes; 5 for edges; 1 configs; 7 computed 12s All features loaded/computed - for details use loadLog() .............................................................................................. . 29s Version -> 2016 <- loading ... . .............................................................................................. This is Text-Fabric 3.0.9 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 108 features found and 0 ignored 0.00s loading features ... | 0.15s B g_word from /Users/dirk/github/etcbc/bhsa/tf/2016 | 0.12s B lex from /Users/dirk/github/etcbc/bhsa/tf/2016 | 0.08s B function from /Users/dirk/github/etcbc/bhsa/tf/2016 | 6.56s T omap@4b-2016 from /Users/dirk/github/etcbc/bhsa/tf/2016 | 0.00s Feature overview: 102 for nodes; 5 for edges; 1 configs; 7 computed 12s All features loaded/computed - for details use loadLog() .............................................................................................. . 41s Version -> 2017 <- loading ... . .............................................................................................. This is Text-Fabric 3.0.9 Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api Tutorial : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/Dans-labs/text-fabric-data 114 features found and 0 ignored 0.00s loading features ... | 0.48s B g_word from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.16s B lex from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.10s B function from /Users/dirk/github/etcbc/bhsa/tf/2017 | 6.50s T omap@2016-2017 from /Users/dirk/github/etcbc/bhsa/tf/2017 | 0.00s Feature overview: 108 for nodes; 5 for edges; 1 configs; 7 computed 13s All features loaded/computed - for details use loadLog()
def tableText(table):
return display(
HTML(
"<table><tr>{}</tr></table>".format(
"</tr><tr>".join(
"<td>{}</td>".format("</td><td>".join(str(_) for _ in row))
for row in table
)
)
)
)
Here is a function that gets the counterparts of phrases between versions, and classifies them according to dissimilarity.
phraseMapping
is keyed by a (source version, target version) pair,
then by dissimilarity, then by node in source version, and then
the value is a node in the target version.
Source nodes that lack a counterpart, end up in a bucket with dissimilarity -1.
phraseMapping = collections.OrderedDict()
def getPhrases(v, w):
V = api[v]
W = api[w]
mapVW = "omap@{}-{}".format(v, w)
vKey = (v, w)
phraseMapping[vKey] = {}
phrases = phraseMapping[vKey]
for n in V.F.otype.s("phrase"):
ms = W.Es(mapVW).f(n)
if ms is not None:
phrases[n] = ms
We also want to see the evolution in one big leap, so we construct a mapping from the first version to the last,
just by composing the individual omap@
s into a stride.
Picking a phrase, and following it through the versions might lead to multiple counterparts. When that happens, we choose the one with the highest similarity, and ignore the rest.
def composeMap(curMap, newStep):
resultMap = {}
for (n, ms) in curMap.items():
theM = (
ms[0][0] if len(ms) == 1 else sorted(ms, key=lambda x: (x[1], x[0]))[0][0]
)
resultMap[n] = newStep[theM]
return resultMap
def getFirstLastMapping():
if len(versions) <= 2:
return {}
curMap = phraseMapping[(versions[0], versions[1])]
for i in range(2, len(versions)):
caption(0, "mapping from {} to {}".format(versions[0], versions[i]))
curMap = composeMap(curMap, phraseMapping[(versions[i - 1], versions[i])])
phraseMapping[(versions[0], versions[-1])] = curMap
def showStats(v, w):
vKey = (v, w)
phrases = phraseMapping[vKey]
dists = {}
for (n, ms) in phrases.items():
for (m, dis) in ms:
dists.setdefault(dis or 0, set()).add(m)
stats = collections.Counter()
for (dis, ms) in dists.items():
stats[dis] = len(ms)
table = []
table.append(["dissimilarity", "number of phrases"])
for dis in range(0, max(stats) + 1):
table.append([dis, stats.get(dis, "")])
tableText(table)
We visualize the changes in the values of the function
feature,
by generating a matrix, with old values in the row headers
and new values in the column headers, and the number of times that this old feature has changed into that new
feature in the corresponding matrix cells.
def featureDiff(v, w, feat):
V = api[v]
W = api[w]
vKey = (v, w)
vFeat = versionInfo[v][feat]
wFeat = versionInfo[w][feat]
phrases = phraseMapping[vKey]
combis = {}
for (n, ms) in phrases.items():
vVal = V.Fs(vFeat).v(n)
for (m, dis) in ms:
wVal = W.Fs(wFeat).v(m)
combis.setdefault(vVal, collections.Counter())[wVal] += 1
vValues = sorted(combis.keys())
wValues = sorted(reduce(set.union, [set(combis[v]) for v in vValues], set()))
table = []
table.append(["{}\\{}".format(v, w)] + wValues)
for v in vValues:
table.append([v] + [str(combis[v].get(w, "")) for w in wValues])
tableText(table)
We collect all data in a big data structure.
caption(4, "Collecting data")
for (i, w) in enumerate(versions):
if i == 0:
continue
v = versions[i - 1]
caption(0, "\t{:<4} => {:<4}".format(v, w))
getPhrases(v, w)
caption(0, "\t{:<4} => {:<4}".format(versions[0], versions[-1]))
getFirstLastMapping()
caption(0, "Done")
.............................................................................................. . 55s Collecting data . .............................................................................................. | 55s 3 => 4 | 57s 4 => 4b | 58s 4b => 2016 | 1m 00s 2016 => 2017 | 1m 02s 3 => 2017 | 1m 02s mapping from 3 to 4b | 1m 02s mapping from 3 to 2016 | 1m 02s mapping from 3 to 2017 | 1m 02s Done