This is heavily modeled on the Pytorch tutorial: https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
We use fastai libraries extensively to make dataloading and training easier
This is a list of surnames and their ethnicities
#!wget https://download.pytorch.org/tutorial/data.zip
#!unzip -o data.zip
fastai import pandas and all sorts of other goodies
from fastai import *
from fastai.text import *
from unidecode import unidecode
import string
Reduce the ouput to 20 rows to prevent it from taking too much of the output.
pd.options.display.max_rows = 20
Read in the data; names for each language is in a separate file
path = Path('data/names')
!ls {path}
Arabic.txt English.txt Irish.txt Polish.txt Spanish.txt Chinese.txt French.txt Italian.txt Portuguese.txt Vietnamese.txt Czech.txt German.txt Japanese.txt Russian.txt Dutch.txt Greek.txt Korean.txt Scottish.txt
!head -n5 {path}/Arabic.txt
Khoury Nahas Daher Gerges Nazari
names = []
for p in path.glob('*.txt'):
lang = p.name[:-4]
with open(p) as f:
names += [(lang, l.strip()) for l in f]
df = pd.DataFrame(names, columns=['cl', 'name'])
It's always worth doing some sanity checks on your data (even supposedly clean tutorial data).
No matter how good your model is: garbage in, garbage out.
df.head()
cl | name | |
---|---|---|
0 | Korean | Ahn |
1 | Korean | Baik |
2 | Korean | Bang |
3 | Korean | Byon |
4 | Korean | Cha |
len(df)
20074
What letters outside of ASCII are in the names?
foreign_chars = Counter(_ for _ in ''.join(list(df.name)) if _ not in string.ascii_letters)
foreign_chars.most_common()
[(' ', 115), ("'", 87), ('-', 25), ('ö', 24), ('é', 22), ('í', 14), ('ó', 13), ('ä', 13), ('á', 12), ('ü', 11), ('à', 10), ('ß', 9), ('ú', 7), ('ñ', 6), ('ò', 3), ('Ś', 3), ('1', 3), (',', 3), ('è', 2), ('ã', 2), ('ù', 1), ('ì', 1), ('ż', 1), ('ń', 1), ('ł', 1), ('ą', 1), ('Ż', 1), ('/', 1), (':', 1), ('Á', 1), ('\xa0', 1), ('õ', 1), ('É', 1), ('ê', 1), ('ç', 1)]
A few of these look suspicious. (Note the use of a regular expression in contains
to check each of the characters)
suss_chars = [':', '/', '\xa0', ',', '1']
df[df.name.str.contains('|'.join(suss_chars))]
cl | name | |
---|---|---|
2494 | Czech | Maxa/B |
2590 | Czech | Rafaj1 |
2703 | Czech | Urbanek1 |
2732 | Czech | Whitmire1 |
3214 | Chinese | Lu: |
14506 | Russian | Jevolojnov, |
15347 | Russian | Lysansky, |
15366 | Russian | Lytkin, |
18052 | Russian | To The First Page |
Most of these look like legitimate names with extra junk (except 'To The First Page'). Since it's so few names it's easiest just to drop them.
df = df[~df.name.str.contains('|'.join(suss_chars))]
Single quotes and spaces are common
df[df.name.str.contains("'| ")]
cl | name | |
---|---|---|
369 | Italian | D'ambrosio |
371 | Italian | D'amore |
372 | Italian | D'angelo |
373 | Italian | D'antonio |
374 | Italian | De angelis |
375 | Italian | De campo |
376 | Italian | De felice |
377 | Italian | De filippis |
378 | Italian | De fiore |
379 | Italian | De laurentis |
... | ... | ... |
18161 | Russian | V'Yurkov |
19061 | Russian | Zasyad'Ko |
19740 | Portuguese | D'cruz |
19741 | Portuguese | D'cruze |
19743 | Portuguese | De santigo |
19858 | French | D'aramitz |
19864 | French | De la fontaine |
19869 | French | De sauveterre |
20051 | French | St martin |
20052 | French | St pierre |
150 rows × 2 columns
Since hyphens mainly join multiple last names (and are pretty rare) we won't lose heaps by dropping them.
df[df.name.str.contains('-')]
cl | name | |
---|---|---|
2982 | Chinese | Au-Yong |
3088 | Chinese | Ou-Yang |
3089 | Chinese | Ow-Yang |
10156 | Russian | Abdank-Kossovsky |
10639 | Russian | Amet-Han |
11221 | Russian | Bagai-Ool |
11757 | Russian | Bei-Bienko |
11787 | Russian | Beknazar-Yuzbashev |
11790 | Russian | Bekovich-Cherkassky |
11904 | Russian | Bestujev-Lada |
... | ... | ... |
11952 | Russian | Bim-Bad |
12209 | Russian | Chyrgal-Ool |
13071 | Russian | Galkin-Vraskoi |
13307 | Russian | Gorbunov-Posadov |
16687 | Russian | Porai-Koshits |
17222 | Russian | Shah-Nazaroff |
17430 | Russian | Shirinsky-Shikhmatov |
17748 | Russian | Tsann-Kay-Si |
17999 | Russian | Tzann-Kay-Si |
18315 | Russian | Van-Puteren |
23 rows × 2 columns
df = df[~df.name.str.contains('-')]
Let's normalise all non-ASCII characters to ASCII equivalents.
This makes our classification problem harder in practice: any names containing a ß are almost surely German, wheras "ss" could occur in many language. It also reduces the set of characters we need to represent our language.
df['ascii_name'] = df.name.apply(unidecode)
df[df.name != df.ascii_name]
cl | name | ascii_name | |
---|---|---|---|
100 | Italian | Abbà | Abba |
112 | Italian | Abelló | Abello |
160 | Italian | Airò | Airo |
195 | Italian | Alò | Alo |
238 | Italian | Azzarà | Azzara |
300 | Italian | Bovér | Bover |
445 | Italian | Giùgovaz | Giugovaz |
461 | Italian | Làconi | Laconi |
462 | Italian | Laganà | Lagana |
463 | Italian | Lagomarsìno | Lagomarsino |
... | ... | ... | ... |
19912 | French | Géroux | Geroux |
19920 | French | Guérin | Guerin |
19924 | French | Hébert | Hebert |
19949 | French | Lécuyer | Lecuyer |
19951 | French | Lefévre | Lefevre |
19955 | French | Lémieux | Lemieux |
19960 | French | Lévêque | Leveque |
19961 | French | Lévesque | Levesque |
19965 | French | Maçon | Macon |
20047 | French | Séverin | Severin |
156 rows × 3 columns
Let's check case: I expect names to be in CamelCase.
These seem to be mistakes.
df[~df.ascii_name.str.contains("^[A-Z][^A-Z]*(?:[' -][A-Z][^A-Z]*)*$")]
cl | name | ascii_name | |
---|---|---|---|
2508 | Czech | MonkoAustria | MonkoAustria |
2677 | Czech | StrakaO | StrakaO |
3266 | Vietnamese | an | an |
df = df[df.ascii_name.str.contains("^[A-Z][^A-Z]*(?:[' -][A-Z][^A-Z]*)*$")]
Let's lowercase the ascii_names
df['ascii_name'] = df.ascii_name.str.lower()
Make a check we've normalised correctly.
ascii_chars = Counter(''.join(list(df.ascii_name)))
ascii_chars.most_common()
[('a', 16511), ('o', 11120), ('e', 10768), ('i', 10416), ('n', 9943), ('r', 8245), ('s', 7980), ('h', 7673), ('k', 6902), ('l', 6704), ('v', 6301), ('t', 5939), ('u', 4725), ('m', 4343), ('d', 3894), ('b', 3641), ('y', 3604), ('g', 3209), ('c', 3068), ('z', 1928), ('f', 1774), ('p', 1707), ('j', 1346), ('w', 1125), (' ', 112), ('q', 98), ("'", 87), ('x', 72)]
In practice a surname could have multiple ethnicities, but we'd have to be really careful of how we use this in training.
If we end up with e.g. 'Michel' as French in the training dataset, but German in the validation set our model has no hope of getting it right (and we may discard an actually good model).
We could handle this by:
Without any information about frequency we can't do (2) and (1) is a harder problem, so we'll stick to (3).
name_classes = df.\
groupby('ascii_name').\
nunique().cl.sort_values(ascending=False)
name_classes.head(20)
ascii_name michel 6 adam 5 albert 5 abel 5 martin 5 simon 5 ventura 4 costa 4 jordan 4 han 4 salomon 4 samuel 4 klein 4 franco 4 wang 4 oliver 4 garcia 3 horn 3 lim 3 rose 3 Name: cl, dtype: int64
df[df.name == 'Michel']
cl | name | ascii_name | |
---|---|---|---|
872 | Polish | Michel | michel |
2077 | Dutch | Michel | michel |
3489 | Spanish | Michel | michel |
6163 | German | Michel | michel |
8709 | English | Michel | michel |
19978 | French | Michel | michel |
1 in 40 of our names have multiple classes (most of them do before normalisation too)
len(name_classes), sum(name_classes > 1) / len(name_classes)
(17380, 0.027445339470655927)
While some names like Abel do seem to occur commonly in multiple countries, for example:
It seems like Korean and Chinese have a lot of overlap, as to English and Scottish. While this makes some linguistic sense it will make it hard to make a reliable classifier.
Note that most names only occur once; so we can't pick a "most common" frequency class.
with pd.option_context('display.max_rows', 60):
print(df[df.ascii_name.isin(name_classes[name_classes > 1].index)].groupby(['ascii_name', 'cl']).count())
name ascii_name cl abel English 1 French 1 German 1 Russian 1 Spanish 1 abello Italian 1 Spanish 1 abraham English 1 French 1 abreu Portuguese 1 Spanish 1 adam English 1 French 1 German 1 Irish 1 Russian 1 adams English 1 Russian 1 adamson English 1 Russian 1 adler English 1 German 1 Russian 1 aitken English 1 Scottish 1 albert English 1 French 1 German 1 Russian 1 Spanish 1 ... ... wilson English 1 Scottish 1 winter English 1 German 1 wolf English 1 German 1 wong Chinese 1 English 1 woo Chinese 1 Korean 1 wood Czech 1 English 1 Scottish 1 wright English 1 Scottish 1 yan Chinese 2 Russian 1 yang Chinese 1 English 1 Korean 1 yim Chinese 1 Korean 1 you Chinese 1 Korean 1 young English 1 Scottish 1 yun Chinese 1 Korean 1 zambrano Italian 1 Spanish 1 [1051 rows x 1 columns]
Rather than finding the "right" ethnicity the easy thing to do is to remove all ambiguous cases.
df = df[~df.ascii_name.isin(name_classes[name_classes > 1].index)]
We need exactly one row per pair; if separate copies appear in the training and validation set we'll get a higher validation accuracy than is reasonable.
Some names occur very frequently.
counts = df.assign(n=1).groupby(['ascii_name', 'cl']).count().sort_values('n', ascending=False)
counts.head(n=20)
name | n | ||
---|---|---|---|
ascii_name | cl | ||
tahan | Arabic | 28 | 28 |
fakhoury | Arabic | 28 | 28 |
koury | Arabic | 27 | 27 |
nader | Arabic | 27 | 27 |
sarraf | Arabic | 26 | 26 |
hadad | Arabic | 26 | 26 |
kassis | Arabic | 26 | 26 |
antar | Arabic | 26 | 26 |
shadid | Arabic | 25 | 25 |
cham | Arabic | 25 | 25 |
mifsud | Arabic | 25 | 25 |
nahas | Arabic | 24 | 24 |
gerges | Arabic | 24 | 24 |
ganim | Arabic | 23 | 23 |
tuma | Arabic | 23 | 23 |
to the first page | Russian | 23 | 23 |
atiyeh | Arabic | 23 | 23 |
malouf | Arabic | 23 | 23 |
sayegh | Arabic | 22 | 22 |
naifeh | Arabic | 22 | 22 |
Let's remove the "To The First Page" junk (probably some artifact of where the data was scraped from)
df = df[df.ascii_name != 'to the first page']
There are no multiples in English, and a lot in Arabic. It seems like a data entry error rather than meaningful.
counts.assign(multiple=counts.n > 1, rows=1).groupby('cl').sum().sort_values('n', ascending=False)
name | n | multiple | rows | |
---|---|---|---|---|
cl | ||||
Russian | 9326 | 9326 | 35.0 | 9263 |
English | 3359 | 3359 | 0.0 | 3359 |
Arabic | 1892 | 1892 | 103.0 | 103 |
Japanese | 983 | 983 | 1.0 | 982 |
Italian | 665 | 665 | 5.0 | 660 |
German | 613 | 613 | 33.0 | 578 |
Czech | 480 | 480 | 16.0 | 464 |
Dutch | 255 | 255 | 10.0 | 244 |
Chinese | 219 | 219 | 19.0 | 200 |
Spanish | 214 | 214 | 2.0 | 212 |
French | 213 | 213 | 3.0 | 210 |
Greek | 195 | 195 | 2.0 | 192 |
Irish | 170 | 170 | 6.0 | 164 |
Polish | 124 | 124 | 1.0 | 123 |
Korean | 61 | 61 | 0.0 | 61 |
Vietnamese | 56 | 56 | 1.0 | 55 |
Portuguese | 32 | 32 | 0.0 | 32 |
Scottish | 1 | 1 | 0.0 | 1 |
It makes sense to drop the duplicates and only have a single row per ascii_name
and cl
.
df = df.drop_duplicates(['ascii_name', 'cl'])
len(df)
16902
It's worth checking if the shortest and longest names make sense.
They look reasonable.
df.assign(len=df.name.str.len()).sort_values('len')
cl | name | ascii_name | len | |
---|---|---|---|---|
3265 | Vietnamese | An | an | 2 |
50 | Korean | Oh | oh | 2 |
1150 | Japanese | Ii | ii | 2 |
54 | Korean | Ra | ra | 2 |
3891 | Arabic | Ba | ba | 2 |
57 | Korean | Ri | ri | 2 |
69 | Korean | Si | si | 2 |
71 | Korean | So | so | 2 |
3311 | Vietnamese | To | to | 2 |
85 | Korean | Yi | yi | 2 |
... | ... | ... | ... | ... |
11475 | Russian | Bakhtchivandzhi | bakhtchivandzhi | 15 |
10191 | Russian | Abdulladzhanoff | abdulladzhanoff | 15 |
17299 | Russian | Shakhnazaryants | shakhnazaryants | 15 |
11393 | Russian | Baistryutchenko | baistryutchenko | 15 |
14965 | Russian | Katzenellenbogen | katzenellenbogen | 16 |
2228 | Dutch | Vandroogenbroeck | vandroogenbroeck | 16 |
14947 | Russian | Katsenellenbogen | katsenellenbogen | 16 |
19552 | Greek | Chrysanthopoulos | chrysanthopoulos | 16 |
2841 | Irish | Maceachthighearna | maceachthighearna | 17 |
6380 | German | Von grimmelshausen | von grimmelshausen | 18 |
16902 rows × 4 columns
The dataset is very unbalanced.
I doubt there's enough data to tacke Portuguese (which will be close to Spanish) and Scottish (which will be close to English)
df.groupby('cl').name.count().sort_values(ascending=False)
cl Russian 9262 English 3359 Japanese 982 Italian 660 German 578 Czech 464 Dutch 244 Spanish 212 French 210 Chinese 200 Greek 192 Irish 164 Polish 123 Arabic 103 Korean 61 Vietnamese 55 Portuguese 32 Scottish 1 Name: name, dtype: int64
df[df.cl.isin(['Scottish'])]
cl | name | ascii_name | |
---|---|---|---|
3711 | Scottish | Hay | hay |
Let's remove the rarest classes; we're not likely to have enough data to guess them.
df = df[~df.cl.isin(['Scottish', 'Portuguese'])]
Note Russian contains variant transliterations to English like Abaimoff and Abaimov (which both correspond to Абаимов).
But this doesn't quite explain it's high frequency: it seems a lot more Russian data was extracted.
(Side note: Chebyshev can also be spelt e.g. Chebychev, Tchebycheff, Tschebyschef)
df[df.cl == 'Russian']
cl | name | ascii_name | |
---|---|---|---|
10112 | Russian | Ababko | ababko |
10113 | Russian | Abaev | abaev |
10114 | Russian | Abagyan | abagyan |
10115 | Russian | Abaidulin | abaidulin |
10116 | Russian | Abaidullin | abaidullin |
10117 | Russian | Abaimoff | abaimoff |
10118 | Russian | Abaimov | abaimov |
10119 | Russian | Abakeliya | abakeliya |
10120 | Russian | Abakovsky | abakovsky |
10121 | Russian | Abakshin | abakshin |
... | ... | ... | ... |
19510 | Russian | Zolotavin | zolotavin |
19511 | Russian | Zolotdinov | zolotdinov |
19512 | Russian | Zolotenkov | zolotenkov |
19513 | Russian | Zolotilin | zolotilin |
19514 | Russian | Zolotkov | zolotkov |
19515 | Russian | Zolotnitsky | zolotnitsky |
19516 | Russian | Zolotnitzky | zolotnitzky |
19517 | Russian | Zozrov | zozrov |
19518 | Russian | Zozulya | zozulya |
19519 | Russian | Zukerman | zukerman |
9262 rows × 3 columns
We want our final model to work well on any language.
But if we pick our validation set uniformly at random from the data we're likely to get many Russian names and not many Vietnamese names, which isn't a good test of this.
So instead we'll take our validation set from an equal number from each subclass.
df = df.reset_index().drop('index', 1)
df
cl | name | ascii_name | |
---|---|---|---|
0 | Korean | Ahn | ahn |
1 | Korean | Baik | baik |
2 | Korean | Bang | bang |
3 | Korean | Byon | byon |
4 | Korean | Cha | cha |
5 | Korean | Cho | cho |
6 | Korean | Choe | choe |
7 | Korean | Choi | choi |
8 | Korean | Chun | chun |
9 | Korean | Chweh | chweh |
... | ... | ... | ... |
16859 | French | Travere | travere |
16860 | French | Traverse | traverse |
16861 | French | Travert | travert |
16862 | French | Tremblay | tremblay |
16863 | French | Tremble | tremble |
16864 | French | Victors | victors |
16865 | French | Villeneuve | villeneuve |
16866 | French | Vipond | vipond |
16867 | French | Voclain | voclain |
16868 | French | Yount | yount |
16869 rows × 3 columns
counts = df.groupby('cl').name.count().sort_values(ascending=False)
counts
cl Russian 9262 English 3359 Japanese 982 Italian 660 German 578 Czech 464 Dutch 244 Spanish 212 French 210 Chinese 200 Greek 192 Irish 164 Polish 123 Arabic 103 Korean 61 Vietnamese 55 Name: name, dtype: int64
valid_size = 30 # We'll pick 30 at random from each subclass
train_size = 500 # For a balanced training set we'll pick 500 at random with replacement
np.random.seed(6011)
valid_idx = []
for cl in counts.keys():
# Random sample of size "valid_size" for each class
valid_idx += list(df[df.cl == cl].sample(valid_size).index)
df['valid'] = False
df.loc[valid_idx, 'valid'] = True
Let's also create a balanced training set as an alternative to using everything not in validation
np.random.seed(7012)
balanced_idx = []
for cl in counts.keys():
# Random sample of size "train_size" for each class from the data outside of the validation set
balanced_idx += list(df[(df.cl == cl) & ~df.valid].sample(train_size, replace=True).index)
Note the balanced index contains all 25 (= 55 - 30) Vietnamese names outside of the training set, but only contains 486 of the Russian names (because we sampled randomly with replacement there will be a couple of double ups).
df.loc[balanced_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)
cl | name | ascii_name | valid | |
---|---|---|---|---|
cl | ||||
Russian | 1 | 486 | 486 | 1 |
English | 1 | 459 | 459 | 1 |
Japanese | 1 | 383 | 383 | 1 |
Italian | 1 | 357 | 357 | 1 |
German | 1 | 330 | 330 | 1 |
Czech | 1 | 295 | 295 | 1 |
Dutch | 1 | 195 | 195 | 1 |
French | 1 | 172 | 172 | 1 |
Spanish | 1 | 170 | 170 | 1 |
Chinese | 1 | 158 | 158 | 1 |
Greek | 1 | 153 | 153 | 1 |
Irish | 1 | 129 | 129 | 1 |
Polish | 1 | 93 | 93 | 1 |
Arabic | 1 | 73 | 73 | 1 |
Korean | 1 | 31 | 31 | 1 |
Vietnamese | 1 | 25 | 25 | 1 |
Let's record our balanced set in the dataframe: this will make it easy to reload at a later point.
df['bal'] = 0
for k, v in Counter(balanced_idx).items():
df.loc[k, 'bal'] += v
df.head()
cl | name | ascii_name | valid | bal | |
---|---|---|---|---|---|
0 | Korean | Ahn | ahn | False | 13 |
1 | Korean | Baik | baik | True | 0 |
2 | Korean | Bang | bang | False | 13 |
3 | Korean | Byon | byon | False | 15 |
4 | Korean | Cha | cha | True | 0 |
We can always retrieve the indexes from the dataframe
idx = []
for k, v in zip(df.index, df.bal):
idx += [k]*v
sorted(balanced_idx) == idx
True
df.to_csv('names_clean.csv', index=False)
The first benchmark is random guessing/always guessing the same class.
The expected return is 1/(number of classes) = 1/16 ~ 6.25%
df = pd.read_csv('names_clean.csv')
valid_idx = df[df.valid].index
train_idx = df[~df.valid].index
bal_idx = []
for k, v in zip(df.index, df.bal):
bal_idx += [k]*v
Check training/balanced training data doesn't contain any names in validation set
train_intersect_valid = sum(df.iloc[train_idx].ascii_name.isin(df.iloc[valid_idx].ascii_name))
bal_interset_valid = sum(df.iloc[bal_idx].ascii_name.isin(df.iloc[valid_idx].ascii_name))
train_intersect_valid, bal_interset_valid
(0, 0)
Make sure the data looks right
df.iloc[train_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)
cl | name | ascii_name | valid | bal | |
---|---|---|---|---|---|
cl | |||||
Russian | 1 | 9232 | 9232 | 1 | 3 |
English | 1 | 3329 | 3329 | 1 | 4 |
Japanese | 1 | 952 | 952 | 1 | 5 |
Italian | 1 | 630 | 630 | 1 | 5 |
German | 1 | 548 | 548 | 1 | 5 |
Czech | 1 | 434 | 434 | 1 | 6 |
Dutch | 1 | 214 | 214 | 1 | 9 |
Spanish | 1 | 182 | 182 | 1 | 10 |
French | 1 | 180 | 180 | 1 | 9 |
Chinese | 1 | 170 | 170 | 1 | 9 |
Greek | 1 | 162 | 162 | 1 | 10 |
Irish | 1 | 134 | 134 | 1 | 10 |
Polish | 1 | 93 | 93 | 1 | 11 |
Arabic | 1 | 73 | 73 | 1 | 13 |
Korean | 1 | 31 | 31 | 1 | 13 |
Vietnamese | 1 | 25 | 25 | 1 | 16 |
df.iloc[bal_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)
cl | name | ascii_name | valid | bal | |
---|---|---|---|---|---|
cl | |||||
Russian | 1 | 486 | 486 | 1 | 2 |
English | 1 | 459 | 459 | 1 | 3 |
Japanese | 1 | 383 | 383 | 1 | 4 |
Italian | 1 | 357 | 357 | 1 | 4 |
German | 1 | 330 | 330 | 1 | 4 |
Czech | 1 | 295 | 295 | 1 | 5 |
Dutch | 1 | 195 | 195 | 1 | 8 |
French | 1 | 172 | 172 | 1 | 8 |
Spanish | 1 | 170 | 170 | 1 | 9 |
Chinese | 1 | 158 | 158 | 1 | 8 |
Greek | 1 | 153 | 153 | 1 | 9 |
Irish | 1 | 129 | 129 | 1 | 9 |
Polish | 1 | 93 | 93 | 1 | 11 |
Arabic | 1 | 73 | 73 | 1 | 13 |
Korean | 1 | 31 | 31 | 1 | 13 |
Vietnamese | 1 | 25 | 25 | 1 | 16 |
df.iloc[valid_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)
cl | name | ascii_name | valid | bal | |
---|---|---|---|---|---|
cl | |||||
Arabic | 1 | 30 | 30 | 1 | 1 |
Chinese | 1 | 30 | 30 | 1 | 1 |
Czech | 1 | 30 | 30 | 1 | 1 |
Dutch | 1 | 30 | 30 | 1 | 1 |
English | 1 | 30 | 30 | 1 | 1 |
French | 1 | 30 | 30 | 1 | 1 |
German | 1 | 30 | 30 | 1 | 1 |
Greek | 1 | 30 | 30 | 1 | 1 |
Irish | 1 | 30 | 30 | 1 | 1 |
Italian | 1 | 30 | 30 | 1 | 1 |
Japanese | 1 | 30 | 30 | 1 | 1 |
Korean | 1 | 30 | 30 | 1 | 1 |
Polish | 1 | 30 | 30 | 1 | 1 |
Russian | 1 | 30 | 30 | 1 | 1 |
Spanish | 1 | 30 | 30 | 1 | 1 |
Vietnamese | 1 | 30 | 30 | 1 | 1 |
Picking any one class in validation will give 1/16 = 6.25%
(df[df.valid] == 'Korean').cl.sum() / df.valid.sum()
0.0625
A reasonable way to guess a language is by the frequency of characters and pairs of characters.
For example 'cz' is very rare in English, but quite common in the slavic languages.
name = 'zozrov'
A function to count the occurances of sequences of one, two or three letters (in general these sequences are called "n-grams" particularly when referring to sequences of words).
def ngrams(s,n=1):
parts = [s[i:] for i in range(n)] # e.g. ['zozrov', 'ozrov', 'zrov']
return Counter(''.join(_) for _ in zip(*parts))
ngrams(name, 1), ngrams(name, 2), ngrams(name, 3)
(Counter({'z': 2, 'o': 2, 'r': 1, 'v': 1}), Counter({'zo': 1, 'oz': 1, 'zr': 1, 'ro': 1, 'ov': 1}), Counter({'zoz': 1, 'ozr': 1, 'zro': 1, 'rov': 1}))
df = df.assign(letters=df.ascii_name.apply(ngrams))
df = df.assign(bigrams=df.ascii_name.apply(ngrams, n=2))
df = df.assign(trigrams=df.ascii_name.apply(ngrams, n=3))
df.head()
cl | name | ascii_name | valid | bal | letters | bigrams | trigrams | |
---|---|---|---|---|---|---|---|---|
0 | Korean | Ahn | ahn | False | 13 | {'a': 1, 'h': 1, 'n': 1} | {'ah': 1, 'hn': 1} | {'ahn': 1} |
1 | Korean | Baik | baik | True | 0 | {'b': 1, 'a': 1, 'i': 1, 'k': 1} | {'ba': 1, 'ai': 1, 'ik': 1} | {'bai': 1, 'aik': 1} |
2 | Korean | Bang | bang | False | 13 | {'b': 1, 'a': 1, 'n': 1, 'g': 1} | {'ba': 1, 'an': 1, 'ng': 1} | {'ban': 1, 'ang': 1} |
3 | Korean | Byon | byon | False | 15 | {'b': 1, 'y': 1, 'o': 1, 'n': 1} | {'by': 1, 'yo': 1, 'on': 1} | {'byo': 1, 'yon': 1} |
4 | Korean | Cha | cha | True | 0 | {'c': 1, 'h': 1, 'a': 1} | {'ch': 1, 'ha': 1} | {'cha': 1} |
Let's try to guess the name using Naive Bayes.
TL;DR: This is a really simple model that works quite well and will give a good benchmark.
This uses "Bayes Rule" which uses the data to answer questions like: "given the name contains the bigram 'ah' what's the probability it's Korean?".
The "Naive" part means that that we assume all these probabilities are independent (knowing it contains 'ah' doesn't tell you anything about the fact it contains 'hn'). Even though this definitely isn't true, it's often a reasonable approximation.
This makes it really fast and simple to fit a model and often works well.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
vd1 = DictVectorizer(sparse=False)
vd2 = DictVectorizer(sparse=False)
vd3 = DictVectorizer(sparse=False)
y = df.cl
letters = vd1.fit_transform(df.letters)
bigrams = vd2.fit_transform(df.bigrams)
trigrams = vd3.fit_transform(df.trigrams)
The letters matrix contains the number of times each of the 28 letters occurs (e.g. number of spaces, number of apostrophes, number of 'a', ...).
vd1.get_feature_names()[:10]
[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
letters
array([[0., 0., 1., 0., ..., 0., 0., 0., 0.], [0., 0., 1., 1., ..., 0., 0., 0., 0.], [0., 0., 1., 1., ..., 0., 0., 0., 0.], [0., 0., 0., 1., ..., 0., 0., 1., 0.], ..., [0., 0., 0., 0., ..., 0., 0., 0., 0.], [0., 0., 0., 0., ..., 0., 0., 0., 0.], [0., 0., 1., 0., ..., 0., 0., 0., 0.], [0., 0., 0., 0., ..., 0., 0., 1., 0.]])
Similarly bigrams and trigrams contains the number of times each sequence of 2 or 3 letters occurs
vd2.get_feature_names()[:5], vd2.get_feature_names()[-5:]
([' a', ' b', ' c', ' e', ' f'], ['zu', 'zv', 'zw', 'zy', 'zz'])
letters.shape, bigrams.shape, trigrams.shape, y.shape
((16869, 28), (16869, 623), (16869, 5794), (16869,))
How good a model can we get looking at individual letters (e.g. saying 'z' occurs much more frequently in Chinese than in English names).
letter_nb = MultinomialNB()
letter_nb.fit(letters[train_idx],y[train_idx])
bal_letter_nb = MultinomialNB()
bal_letter_nb.fit(letters[bal_idx],y[bal_idx])
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
The balanced set does mut better than random; around 33%
letter_pred = letter_nb.predict(letters[valid_idx])
bal_letter_pred = bal_letter_nb.predict(letters[valid_idx])
(letter_pred == y[valid_idx]).mean(), (bal_letter_pred == y[valid_idx]).mean()
(0.14791666666666667, 0.33541666666666664)
Let's write a function to test the Naive Bayes on any dataset; fitting on the whole dataset and the balanced dataset separately.
def nb(x):
model = MultinomialNB()
model.fit(x[train_idx], y[train_idx])
preds = model.predict(x[valid_idx])
acc_train = (preds == y[valid_idx]).mean()
model = MultinomialNB()
model.fit(x[bal_idx], y[bal_idx])
preds = model.predict(x[valid_idx])
acc_bal = (preds == y[valid_idx]).mean()
return acc_train, acc_bal
nb(letters)
(0.14791666666666667, 0.33541666666666664)
Using bigrams and a balanced training set gives a much better prediction performance 53% (up from the baseline of 6.25%).
nb(bigrams)
(0.35833333333333334, 0.5291666666666667)
Adding letters doesn't make much difference (which isn't surprising
nb(np.concatenate((letters, bigrams), axis=1))
(0.3854166666666667, 0.5166666666666667)
Trigrams alone also performs worse
nb(trigrams)
(0.33958333333333335, 0.4895833333333333)
Let's try every combination with trigrams:
nb(np.concatenate((letters, trigrams), axis=1))
(0.24375, 0.5083333333333333)
nb(np.concatenate((bigrams, trigrams), axis=1))
(0.36875, 0.5416666666666666)
nb(np.concatenate((letters, bigrams, trigrams), axis=1))
(0.32916666666666666, 0.55625)
None of them significantly outperform the simple bigram model (with 623 parameters; we could probably remove some of the uncommon ones without too many problems.
Let's remove the bigrams that only occur once as they have practically no value (and there's 100 of them).
common_bigrams = (bigrams[bal_idx].sum(axis=0)) >= 2
common_bigrams.sum()
503
common_bigram_index = [i for i, t in enumerate(common_bigrams) if t]
bigrams_min = bigrams[:, common_bigram_index]
bigrams_min.shape
(16869, 503)
bigram_model = MultinomialNB()
bigram_model.fit(bigrams_min[bal_idx], y[bal_idx])
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
We get around 53% accuracy.
bigram_pred = bigram_model.predict(bigrams_min[valid_idx])
(bigram_pred == y[valid_idx]).mean()
0.5291666666666667
bigram_prob = bigram_model.predict_proba(bigrams_min[valid_idx])
bigram_prob.max(axis=1)
array([244, 9, 20, 18, ..., 86, 422, 143, 431])
bigram_preds = (df
.iloc[valid_idx]
.assign(pred = bigram_pred)[['name', 'cl', 'pred']]
.assign(prob = bigram_prob.max(axis=1)))
bigram_preds.sort_values('prob', ascending=False).head(15)
name | cl | pred | prob | |
---|---|---|---|---|
16557 | Kotsiopoulos | Greek | Greek | 1.000000 |
2012 | Rooijakker | Dutch | Dutch | 1.000000 |
16470 | Akrivopoulos | Greek | Greek | 0.999999 |
826 | Warszawski | Polish | Polish | 0.999998 |
1997 | Romeijnders | Dutch | Dutch | 0.999997 |
16478 | Antonopoulos | Greek | Greek | 0.999995 |
813 | Sokolowski | Polish | Polish | 0.999994 |
839 | Zdunowski | Polish | Polish | 0.999950 |
1996 | Romeijn | Dutch | Dutch | 0.999917 |
16497 | Chrysanthopoulos | Greek | Greek | 0.999895 |
2053 | Sneijers | Dutch | Dutch | 0.999792 |
2031 | Schwarzenberg | Dutch | German | 0.999774 |
795 | Rudawski | Polish | Polish | 0.999751 |
16715 | De sauveterre | French | French | 0.999604 |
1160 | Kawagichi | Japanese | Japanese | 0.999600 |
The names it's least confident with: they typically seem to be quite short
bigram_preds.sort_values('prob', ascending=True).head(15)
name | cl | pred | prob | |
---|---|---|---|---|
2907 | Do | Vietnamese | Irish | 0.176679 |
24 | Mo | Korean | Japanese | 0.179534 |
47 | So | Korean | Korean | 0.188088 |
45 | Si | Korean | Greek | 0.190236 |
13775 | Prigojin | Russian | Italian | 0.191639 |
41 | Seok | Korean | French | 0.197154 |
5 | Cho | Korean | German | 0.202991 |
1091 | Isobe | Japanese | English | 0.206442 |
46 | Sin | Korean | Italian | 0.218300 |
5332 | Ingram | English | Spanish | 0.220205 |
2935 | Ta | Vietnamese | Japanese | 0.226875 |
2700 | Ban | Chinese | Vietnamese | 0.228022 |
1697 | Togo | Japanese | Japanese | 0.236658 |
4 | Cha | Korean | Irish | 0.239172 |
3445 | Graner | German | Spanish | 0.240844 |
The names it's most confidently wrong with:
bigram_preds[bigram_preds.cl != bigram_preds.pred].sort_values('prob', ascending=False).head(15)
name | cl | pred | prob | |
---|---|---|---|---|
2031 | Schwarzenberg | Dutch | German | 0.999774 |
16578 | Malihoudis | Greek | Arabic | 0.992311 |
16576 | Louverdis | Greek | French | 0.990256 |
4758 | Fairbrace | English | Irish | 0.987143 |
3743 | Spellmeyer | German | English | 0.976530 |
16468 | Adamou | Greek | Arabic | 0.973496 |
3009 | De la fuente | Spanish | French | 0.969431 |
3011 | De leon | Spanish | French | 0.964697 |
3263 | Boulos | Arabic | Greek | 0.962321 |
16513 | Egonidis | Greek | Italian | 0.954264 |
2478 | Suchanka | Czech | Japanese | 0.949000 |
2515 | Weichert | Czech | German | 0.946457 |
5476 | Keene | English | Dutch | 0.944270 |
3511 | Jaeger | German | Dutch | 0.938891 |
3174 | Attia | Arabic | Italian | 0.935905 |
Our very simple system does great on Japanese and Russian, but relatively poorly on Vietnamese where our data is most sparse (but still much better than random).
(bigram_preds
.assign(yes=bigram_preds.cl == bigram_preds.pred)
.groupby('cl')
.yes
.mean()
.sort_values(ascending=False)
cl Japanese 0.866667 Russian 0.733333 Polish 0.666667 Irish 0.666667 Dutch 0.633333 Italian 0.600000 Greek 0.533333 German 0.500000 English 0.500000 Spanish 0.466667 French 0.466667 Arabic 0.433333 Czech 0.400000 Chinese 0.400000 Korean 0.366667 Vietnamese 0.233333 Name: yes, dtype: float64
from sklearn.metrics import confusion_matrix
bigram_pred
array(['Arabic', 'Irish', 'German', 'Dutch', ..., 'Russian', 'Polish', 'Irish', 'Italian'], dtype='<U10')
cm = confusion_matrix(y[valid_idx], bigram_pred, labels=y.unique())
cm
array([[11, 1, 0, 6, ..., 0, 0, 2, 1], [ 0, 18, 0, 1, ..., 2, 1, 2, 1], [ 0, 1, 20, 1, ..., 0, 0, 0, 0], [ 1, 0, 0, 26, ..., 1, 0, 0, 0], ..., [ 1, 2, 0, 1, ..., 15, 0, 1, 1], [ 0, 2, 1, 1, ..., 1, 22, 0, 0], [ 0, 2, 1, 1, ..., 0, 1, 16, 3], [ 0, 3, 1, 0, ..., 1, 1, 4, 14]])
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
Vietnamese is often confused for Chinese (which makes sense) and Irish (which doesn't). Korean is often confused for Japanese. Spanish is often confused for Italian.
plt.figure(figsize=(12,12))
plot_confusion_matrix(cm, y.unique())
Confusion matrix, without normalization
bigram_preds[bigram_preds.cl == 'Vietnamese'].sort_values('prob').head(20)
name | cl | pred | prob | |
---|---|---|---|---|
2907 | Do | Vietnamese | Irish | 0.176679 |
2935 | Ta | Vietnamese | Japanese | 0.226875 |
2924 | Luc | Vietnamese | Vietnamese | 0.253900 |
2944 | Ton | Vietnamese | Vietnamese | 0.282859 |
2930 | Pho | Vietnamese | Dutch | 0.296166 |
2910 | Ly | Vietnamese | Russian | 0.298872 |
2915 | Doan | Vietnamese | Chinese | 0.307318 |
2916 | Dam | Vietnamese | Arabic | 0.325199 |
2906 | Dang | Vietnamese | Chinese | 0.388393 |
2900 | Pham | Vietnamese | Arabic | 0.396346 |
2928 | Nghiem | Vietnamese | English | 0.428147 |
2932 | Quach | Vietnamese | Vietnamese | 0.450314 |
2922 | Lac | Vietnamese | Irish | 0.450851 |
2938 | Thi | Vietnamese | Chinese | 0.455845 |
2902 | Hoang | Vietnamese | Korean | 0.487704 |
2927 | Mach | Vietnamese | Irish | 0.505350 |
2917 | Dao | Vietnamese | Irish | 0.512267 |
2923 | Lieu | Vietnamese | French | 0.515336 |
2939 | Than | Vietnamese | Chinese | 0.530805 |
2937 | Thai | Vietnamese | Irish | 0.580569 |
So our baseline is 53%. Let's see if we can do better with deep learning
Load in the dataframe and extract indexes for training, validation and balanced trainings.
df = pd.read_csv('names_clean.csv')
valid_idx = df[df.valid].index
train_idx = df[~df.valid].index
bal_idx = []
for k, v in zip(df.index, df.bal):
bal_idx += [k]*v
As of December 2018 Fastai only has Word level tokenizers; we'll have to create our own letter tokenizer.
The fastai library injects BOS
markers (xxbos
) at the start of every string; we'll have to parse them separately.
class LetterTokenizer(BaseTokenizer):
"Character level tokenizer function."
def __init__(self, lang): pass
def tokenizer(self, t:str) -> List[str]:
out = []
i = 0
while i < len(t):
if t[i:].startswith(BOS):
out.append(BOS)
i += len(BOS)
else:
out.append(t[i])
i += 1
return out
def add_special_cases(self, toks:Collection[str]): pass
We create a vocab of all ASCII letters, and a character tokenizer that doesn't do any specific processing.
itos = [UNK, BOS] + list(string.ascii_lowercase + " -'")
vocab=Vocab(itos)
tokenizer=Tokenizer(LetterTokenizer, pre_rules=[], post_rules=[])
We can create a data pipeline using the TextDataBunch.from_df
constructor.
mark_fields
puts and extra xxfld
marker between each field of text. Since we only have 1 field this is unnecessary.
train_df = df.iloc[train_idx, [0,2]]
valid_df = df.iloc[valid_idx, [0,2]]
train_df.head()
cl | ascii_name | |
---|---|---|
0 | Korean | ahn |
2 | Korean | bang |
3 | Korean | byon |
10 | Korean | gil |
11 | Korean | gu |
data = TextClasDataBunch.from_df(path='.', train_df=train_df, valid_df=valid_df,
tokenizer=tokenizer, vocab=vocab,
mark_fields=False)
data.show_batch()
text | target |
---|---|
v o n g r i m m e l s h a u s e n | German |
m a c e a c h t h i g h e a r n a | Irish |
c h k h a r t i s h v i l i | Russian |
t z e h m i s t r e n k o | Russian |
c h e p t y g m a s h e v | Russian |
Or we can create it using data block API.
This uses the processors
to tokenize and numericalize the input.
processors = [TokenizeProcessor(tokenizer=tokenizer, mark_fields=False),
NumericalizeProcessor(vocab=vocab)]
data = (TextList
.from_df(df,
cols=[2],
processor=processors)
.split_by_idxs(train_idx=train_idx, valid_idx=valid_idx)
.label_from_df(cols=0)
.databunch(bs=32))
data.show_batch()
text | target |
---|---|
v o n g r i m m e l s h a u s e n | German |
p a r a s k e v o p o u l o s | Greek |
d z h a v a h i s h v i l i | Russian |
s h a h n a z a r y a n t s | Russian |
m o g i l n i c h e n k o | Russian |
Counter(_.obj for _ in data.valid_ds.y)
Counter({'Korean': 30, 'Italian': 30, 'Polish': 30, 'Japanese': 30, 'Dutch': 30, 'Czech': 30, 'Irish': 30, 'Chinese': 30, 'Vietnamese': 30, 'Spanish': 30, 'Arabic': 30, 'German': 30, 'English': 30, 'Russian': 30, 'Greek': 30, 'French': 30})
Counter(_.obj for _ in data.train_ds.y).most_common()
[('Russian', 9232), ('English', 3329), ('Japanese', 952), ('Italian', 630), ('German', 548), ('Czech', 434), ('Dutch', 214), ('Spanish', 182), ('French', 180), ('Chinese', 170), ('Greek', 162), ('Irish', 134), ('Polish', 93), ('Arabic', 73), ('Korean', 31), ('Vietnamese', 25)]
Check no text is both in Validation and Training
valid_set = set(_.text for _ in data.valid_ds.x)
for _ in data.train_ds.x:
assert _.text not in valid_set, _.text
trainiter = iter(data.train_dl)
batch, cl = next(trainiter)
batch2, cl2 = next(trainiter)
cl, len(cl)
(tensor([ 6, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13], device='cuda:0'), 16)
batch.shape
torch.Size([20, 16])
The first 22 letters run down the batch backpadded by BOS
; we have 16 names across.
Somehow it looks like we also have an extra space at the beginning of each name that wasn't in the input data.
(Note this is different to what the fastai wrappers will give you; they concatenate the data and split it into 16 chunks).
pd.options.display.max_columns = 100
(pd
.DataFrame([[vocab.itos[y] for y in x] for x in batch])
.T
.assign(category=[data.classes[_] for _ in cl])
.T)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos |
1 | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | |
2 | v | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos |
3 | o | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos |
4 | n | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos |
5 | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | |||
6 | g | t | l | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | |||||||
7 | r | c | e | t | m | b | c | z | b | p | ||||||
8 | i | h | i | c | i | a | h | h | a | a | g | s | a | b | v | v |
9 | m | a | h | h | n | k | a | e | h | t | r | h | w | a | y | i |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11 | e | t | e | h | a | h | t | o | h | i | s | n | o | h | c | c |
12 | l | o | n | l | z | t | o | k | i | o | h | d | r | t | h | h |
13 | s | r | b | a | e | a | r | h | v | r | e | e | k | i | e | e |
14 | h | i | e | k | t | n | i | o | a | k | l | r | h | g | s | p |
15 | a | z | r | o | d | o | z | v | n | o | e | o | a | a | l | o |
16 | u | h | g | v | i | w | h | t | d | v | v | v | n | r | a | l |
17 | s | s | s | s | n | s | s | s | z | s | s | i | o | e | v | s |
18 | e | k | k | k | o | k | k | e | h | k | k | c | f | e | o | k |
19 | n | y | y | y | v | i | y | v | i | y | y | h | f | v | v | y |
category | German | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian | Russian |
21 rows × 16 columns
[vocab.itos[_] for _ in data.train_ds[0][0].data]
['xxbos', ' ', 'a', 'h', 'n']
list(df.iloc[0,1])
['A', 'h', 'n']
Note the length of strings varies between batches.
(pd
.DataFrame([[vocab.itos[y] for y in x] for x in batch2])
.T
.assign(category=[data.classes[_] for _ in cl2])
.T)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos | xxbos |
1 | ||||||||||||||||
2 | b | m | r | f | m | b | d | j | m | b | k | u | m | m | a | b |
3 | e | c | i | e | a | a | e | e | o | a | i | f | a | e | n | a |
4 | n | g | d | r | s | j | m | f | r | l | n | i | k | a | s | b |
5 | i | r | g | r | a | e | a | f | a | b | c | m | i | d | e | u |
6 | t | o | w | a | o | n | k | e | n | o | h | k | o | h | l | r |
7 | e | r | a | r | k | o | i | r | d | n | i | i | k | r | m | i |
8 | z | y | y | o | a | v | s | s | i | i | n | n | a | a | i | n |
category | Spanish | English | English | Italian | Japanese | Russian | Greek | English | Italian | Italian | English | Russian | Japanese | Irish | Italian | Russian |
vocab.textify(batch2[:,0])
'xxbos b e n i t e z'
data.show_batch(ds_type=DatasetType.Valid)
text | target |
---|---|
c h r y s a n t h o p o u l o s | Greek |
v o n i n g e r s l e b e n | German |
s c h w a r z e n b e r g | Dutch |
d e s a u v e t e r r e | French |
a r e c h a v a l e t a | Spanish |
The torch nn.RNN expects the data to be one hot encoded
one_hot = torch.eye(len(vocab.itos))
one_hot[batch][:2]
tensor([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])
one_hot[batch].shape
torch.Size([20, 16, 31])
Here's how we could do it without storing the one_hot matrix in memory.
def one_hot_fly(y, length=len(vocab.itos)):
length = len(vocab.itos)
shape = list(y.shape)
assert len(shape) == 2
tensor = torch.zeros(shape + [length])
for i,row in enumerate(y):
for j, val in enumerate(row):
tensor[i][j][val] = 1.
return tensor
(one_hot[batch] == one_hot_fly(batch)).all()
tensor(1, dtype=torch.uint8)
Using matrix operations is ~250 times faster at this size than the double for loop.
%timeit one_hot[batch]
%timeit one_hot_fly(batch)
None
36.1 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 8.91 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n_letters = len(vocab.itos)
n_hidden = 128
n_output = df.cl.nunique()
n_letters, n_output
(31, 16)
We use an RNN to take our sequence of letters in and calculate the hidden state
rnn = nn.RNN(input_size=n_letters,
hidden_size=n_hidden,
num_layers=1,
nonlinearity='relu',
dropout=0.)
output, hidden = rnn(one_hot[batch])
output.shape, hidden.shape
(torch.Size([20, 16, 128]), torch.Size([1, 16, 128]))
lo = nn.Linear(n_hidden, n_output)
preds = lo(output)
preds.shape
torch.Size([20, 16, 16])
cl
tensor([ 6, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13], device='cuda:0')
nn.functional.softmax(preds[-1], dim=1).argmax(dim=1)
tensor([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8])
one_hot = torch.eye(len(vocab.itos))
class MyLetterRNN(nn.Module):
def __init__(self, dropout=0., n_layers=1, n_input=n_letters, n_hidden=n_hidden, n_output=n_output):
super().__init__()
self.one_hot = torch.eye(n_letters).cuda()
self.rnn = nn.RNN(input_size=n_letters,
hidden_size=n_hidden,
num_layers=n_layers,
nonlinearity='relu',
dropout=dropout)
self.lo = nn.Linear(n_hidden, n_output)
def forward(self, input):
rnn, _ = self.rnn(self.one_hot[input])
out = self.lo(rnn)
return out[-1]
rnn = MyLetterRNN().cuda()
out = rnn(batch)
out.argmax(dim=1), cl
(tensor([6, 6, 6, 6, 1, 6, 6, 1, 6, 6, 6, 1, 6, 1, 1, 6], device='cuda:0'), tensor([ 6, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13], device='cuda:0'))
Fit the model
F.cross_entropy(out, cl)
tensor(2.7832, device='cuda:0', grad_fn=<NllLossBackward>)
learn = Learner(data, rnn, loss_func=F.cross_entropy, metrics=[accuracy])
learn.lr_find()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.recorder.plot()
learn.fit_one_cycle(10, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 0.811252 | 2.653636 | 0.260417 |
2 | 0.928508 | 3.329767 | 0.216667 |
3 | 0.830531 | 3.436638 | 0.218750 |
4 | 0.947136 | 3.056552 | 0.202083 |
5 | 0.878935 | 3.361734 | 0.210417 |
6 | 0.818984 | 3.208372 | 0.214583 |
7 | 0.811538 | 2.896590 | 0.252083 |
8 | 0.745542 | 3.237130 | 0.283333 |
9 | 0.753505 | 2.819807 | 0.302083 |
10 | 0.763112 | 2.878011 | 0.297917 |
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.save('char_rnn_1')
learn.fit_one_cycle(5, 3e-3)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 0.696038 | 2.910773 | 0.304167 |
2 | 0.734545 | 2.814250 | 0.306250 |
3 | 0.634951 | 2.827829 | 0.295833 |
4 | 0.636780 | 2.758662 | 0.312500 |
5 | 0.696148 | 2.838843 | 0.312500 |
learn.save('char_rnn_1_final')
This is abysmal; 31% is much worse than 52% from the simple Naive Bayes bigram model.
Does it improve if we add another layer?
learn = Learner(data, MyLetterRNN(n_layers=2), loss_func=F.cross_entropy, metrics=[accuracy])
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.fit_one_cycle(20, max_lr=1e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 0.929248 | 3.101529 | 0.189583 |
2 | 0.695901 | 2.869615 | 0.250000 |
3 | 0.745567 | 2.520683 | 0.316667 |
4 | 0.620927 | 3.530135 | 0.262500 |
5 | 0.742575 | 2.512531 | 0.318750 |
6 | 0.723677 | 2.616584 | 0.343750 |
7 | 0.839355 | 2.454891 | 0.335417 |
8 | 0.817186 | 2.794391 | 0.291667 |
9 | 0.653000 | 2.695168 | 0.302083 |
10 | 0.683367 | 2.637764 | 0.358333 |
11 | 0.611877 | 2.308675 | 0.333333 |
12 | 0.586979 | 2.296229 | 0.352083 |
13 | 0.611386 | 2.224956 | 0.381250 |
14 | 0.580687 | 2.247524 | 0.383333 |
15 | 0.512482 | 2.244857 | 0.387500 |
16 | 0.516693 | 2.303736 | 0.412500 |
17 | 0.409016 | 2.413911 | 0.412500 |
18 | 0.435291 | 2.442951 | 0.422917 |
19 | 0.386392 | 2.507006 | 0.425000 |
20 | 0.352908 | 2.518786 | 0.433333 |
It looks like the fit has converged, again at a much worse result than our Naive Bayes bigrams.
But that was trained using a balanced dataset; maybe that will help with RNNs too.
learn.recorder.plot_losses()
learn.save('char_rnn_2_p0')
prob, targ = learn.get_preds()
Counter(data.classes[_.item()] for _ in prob.argmax(dim=1)).most_common()
[('English', 141), ('Russian', 97), ('Chinese', 54), ('Italian', 43), ('Japanese', 38), ('Greek', 25), ('German', 22), ('Czech', 12), ('French', 11), ('Dutch', 11), ('Spanish', 10), ('Polish', 10), ('Korean', 4), ('Vietnamese', 2)]
Even though the balanced set is a subset of the training set (and throws away a lot of data), the model performs much better on the balanced validation set with it.
This is because on the whole training set heuristics like "when in doubt, guess Russian/English" and "it's almost never Vietnamese" are good, but are terrible on our validation set.
data = (TextList
.from_df(df,
cols=[2],
processor=processors)
.split_by_idxs(train_idx=bal_idx, valid_idx=valid_idx)
.label_from_df(cols=0)
.databunch(bs=1024))
Counter(_.obj for _ in data.valid_ds.y)
Counter({'Korean': 30, 'Italian': 30, 'Polish': 30, 'Japanese': 30, 'Dutch': 30, 'Czech': 30, 'Irish': 30, 'Chinese': 30, 'Vietnamese': 30, 'Spanish': 30, 'Arabic': 30, 'German': 30, 'English': 30, 'Russian': 30, 'Greek': 30, 'French': 30})
Counter(_.obj for _ in data.train_ds.y).most_common()
[('Korean', 500), ('Italian', 500), ('Polish', 500), ('Japanese', 500), ('Dutch', 500), ('Czech', 500), ('Irish', 500), ('Chinese', 500), ('Vietnamese', 500), ('Spanish', 500), ('Arabic', 500), ('German', 500), ('English', 500), ('Russian', 500), ('Greek', 500), ('French', 500)]
(pd.DataFrame({'x': [_.text for _ in data.train_ds.x], 'y': [_.obj for _ in data.train_ds.y]})
.groupby('y')
.nunique()
.sort_values('x', ascending=False))
x | y | |
---|---|---|
y | ||
Russian | 486 | 1 |
English | 459 | 1 |
Japanese | 383 | 1 |
Italian | 357 | 1 |
German | 330 | 1 |
Czech | 295 | 1 |
Dutch | 195 | 1 |
French | 172 | 1 |
Spanish | 170 | 1 |
Chinese | 158 | 1 |
Greek | 153 | 1 |
Irish | 129 | 1 |
Polish | 93 | 1 |
Arabic | 73 | 1 |
Korean | 31 | 1 |
Vietnamese | 25 | 1 |
valid_set = set(_.text for _ in data.valid_ds.x)
for _ in data.train_ds.x:
assert _.text not in valid_set, _.text
learn = Learner(data, MyLetterRNN(n_layers=2), loss_func=F.cross_entropy, metrics=[accuracy])
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
Note that our balanced dataset is about half the size of our training dataset. Useful to keep in mind when comparing number of epochs and runtime.
len(train_idx) / len(bal_idx)
2.048625
We only get around ~51% accuracy on a balanced test set (similar to the Naive Bayes)
learn.fit_one_cycle(30, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.773524 | 2.755105 | 0.256250 |
2 | 2.756307 | 2.651738 | 0.252083 |
3 | 2.653947 | 2.520690 | 0.222917 |
4 | 2.586744 | 2.220565 | 0.277083 |
5 | 2.418116 | 2.046903 | 0.358333 |
6 | 2.238451 | 2.456228 | 0.300000 |
7 | 2.139630 | 1.903009 | 0.404167 |
8 | 1.966441 | 1.928596 | 0.441667 |
9 | 1.810611 | 1.840151 | 0.439583 |
10 | 1.651867 | 1.802120 | 0.445833 |
11 | 1.560050 | 1.871820 | 0.450000 |
12 | 1.477005 | 1.901750 | 0.479167 |
13 | 1.461073 | 2.092149 | 0.406250 |
14 | 1.403493 | 2.019508 | 0.462500 |
15 | 1.279949 | 1.945947 | 0.470833 |
16 | 1.140172 | 2.048897 | 0.504167 |
17 | 0.996752 | 2.194067 | 0.472917 |
18 | 0.863288 | 2.319874 | 0.500000 |
19 | 0.765463 | 2.229700 | 0.491667 |
20 | 0.672631 | 2.348388 | 0.516667 |
21 | 0.577968 | 2.426682 | 0.506250 |
22 | 0.488059 | 2.629519 | 0.508333 |
23 | 0.406027 | 2.711037 | 0.520833 |
24 | 0.335120 | 2.838989 | 0.512500 |
25 | 0.274594 | 2.929183 | 0.510417 |
26 | 0.225410 | 3.002161 | 0.506250 |
27 | 0.185401 | 3.071880 | 0.506250 |
28 | 0.154014 | 3.097990 | 0.506250 |
29 | 0.130493 | 3.111022 | 0.510417 |
30 | 0.113012 | 3.112732 | 0.510417 |
It's starting to overfit and so could perhaps do with some regularization.
learn.recorder.plot_losses()
This model is a little worse in accuracy than the Naive Bayes Bigram model.
But our Neural Network is much more computationally intense and has about 4 times as many parameters!
sum(len(_) for _ in learn.model.parameters())
1056
Adding 50% dropout increases our accuracy a little above what we got with Naive Bayes; to 55%.
learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(30, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.770683 | 2.746132 | 0.206250 |
2 | 2.737838 | 2.597004 | 0.302083 |
3 | 2.625530 | 2.253647 | 0.272917 |
4 | 2.453825 | 2.077854 | 0.350000 |
5 | 2.315888 | 1.959774 | 0.372917 |
6 | 2.161286 | 3.241577 | 0.212500 |
7 | 2.172942 | 1.954330 | 0.412500 |
8 | 2.082301 | 1.910599 | 0.429167 |
9 | 1.960596 | 1.690655 | 0.493750 |
10 | 1.844915 | 1.704401 | 0.495833 |
11 | 1.749084 | 1.702470 | 0.462500 |
12 | 1.649408 | 1.623091 | 0.491667 |
13 | 1.574916 | 1.608889 | 0.506250 |
14 | 1.477580 | 1.634902 | 0.529167 |
15 | 1.378102 | 1.681566 | 0.497917 |
16 | 1.285141 | 1.700663 | 0.506250 |
17 | 1.198917 | 1.739364 | 0.510417 |
18 | 1.111936 | 1.807743 | 0.520833 |
19 | 1.044053 | 1.796086 | 0.495833 |
20 | 0.981998 | 1.776384 | 0.522917 |
21 | 0.910434 | 1.867425 | 0.522917 |
22 | 0.834056 | 1.867015 | 0.522917 |
23 | 0.770921 | 1.860593 | 0.537500 |
24 | 0.708352 | 1.918845 | 0.543750 |
25 | 0.653929 | 1.950722 | 0.522917 |
26 | 0.603225 | 1.974899 | 0.531250 |
27 | 0.564611 | 1.989453 | 0.535417 |
28 | 0.534487 | 2.007739 | 0.539583 |
29 | 0.512126 | 2.006823 | 0.545833 |
30 | 0.493982 | 2.006345 | 0.547917 |
learn.recorder.plot_losses()
Using our default of 128 gets 54%
learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5, n_hidden=128), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(15, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.769207 | 2.736875 | 0.191667 |
2 | 2.700270 | 2.432153 | 0.291667 |
3 | 2.527224 | 2.165822 | 0.352083 |
4 | 2.411032 | 2.005153 | 0.375000 |
5 | 2.233243 | 1.797880 | 0.427083 |
6 | 2.073015 | 1.878045 | 0.414583 |
7 | 1.945470 | 1.800441 | 0.425000 |
8 | 1.808628 | 1.668809 | 0.466667 |
9 | 1.670717 | 1.669584 | 0.456250 |
10 | 1.536347 | 1.619124 | 0.514583 |
11 | 1.387313 | 1.561667 | 0.512500 |
12 | 1.237852 | 1.506838 | 0.543750 |
13 | 1.109924 | 1.549464 | 0.541667 |
14 | 1.008990 | 1.554212 | 0.543750 |
15 | 0.929697 | 1.554622 | 0.543750 |
Doubling to 256 doesn't change performance
learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5, n_hidden=128), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(15, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.778366 | 2.746171 | 0.233333 |
2 | 2.692814 | 2.430204 | 0.266667 |
3 | 2.557370 | 2.121804 | 0.335417 |
4 | 2.400152 | 1.958430 | 0.383333 |
5 | 2.266069 | 1.900910 | 0.377083 |
6 | 2.139609 | 1.769911 | 0.437500 |
7 | 1.997191 | 1.776130 | 0.443750 |
8 | 1.849745 | 1.595855 | 0.481250 |
9 | 1.695713 | 1.665297 | 0.462500 |
10 | 1.545231 | 1.611652 | 0.491667 |
11 | 1.396923 | 1.523717 | 0.518750 |
12 | 1.269320 | 1.593659 | 0.531250 |
13 | 1.159536 | 1.635850 | 0.518750 |
14 | 1.062906 | 1.653824 | 0.529167 |
15 | 0.984885 | 1.646750 | 0.531250 |
Halving to 64 definitely does; 128 does seem to be a sweet spot.
learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5, n_hidden=64), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(15, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.788145 | 2.761199 | 0.089583 |
2 | 2.748204 | 2.577398 | 0.239583 |
3 | 2.619068 | 2.174743 | 0.316667 |
4 | 2.437004 | 2.133431 | 0.362500 |
5 | 2.297863 | 1.931483 | 0.372917 |
6 | 2.146514 | 1.836011 | 0.410417 |
7 | 2.041750 | 1.737024 | 0.437500 |
8 | 1.896687 | 1.610484 | 0.477083 |
9 | 1.764543 | 1.659021 | 0.462500 |
10 | 1.635953 | 1.572796 | 0.493750 |
11 | 1.525941 | 1.614364 | 0.489583 |
12 | 1.413064 | 1.567258 | 0.497917 |
13 | 1.319074 | 1.581343 | 0.483333 |
14 | 1.237835 | 1.610361 | 0.502083 |
15 | 1.175101 | 1.607811 | 0.502083 |
Finally 3 layers also gets a worse result.
learn = Learner(data, MyLetterRNN(n_layers=3, dropout=0.5), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(30, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.777945 | 2.767582 | 0.108333 |
2 | 2.763745 | 2.700912 | 0.177083 |
3 | 2.724584 | 2.610098 | 0.170833 |
4 | 2.610290 | 2.289089 | 0.293750 |
5 | 2.476536 | 2.088696 | 0.347917 |
6 | 2.339663 | 1.914408 | 0.404167 |
7 | 2.205508 | 1.885391 | 0.425000 |
8 | 2.120229 | 1.957143 | 0.341667 |
9 | 2.053982 | 1.699707 | 0.397917 |
10 | 1.987532 | 1.688885 | 0.441667 |
11 | 1.908170 | 1.695390 | 0.462500 |
12 | 1.848755 | 1.654914 | 0.456250 |
13 | 1.779561 | 1.624753 | 0.460417 |
14 | 1.697022 | 1.597392 | 0.475000 |
15 | 1.646311 | 1.599763 | 0.475000 |
16 | 1.582875 | 1.559585 | 0.477083 |
17 | 1.531765 | 1.559109 | 0.475000 |
18 | 1.491278 | 1.601305 | 0.462500 |
19 | 1.446864 | 1.503486 | 0.483333 |
20 | 1.387920 | 1.531969 | 0.508333 |
21 | 1.325619 | 1.495371 | 0.520833 |
22 | 1.257018 | 1.581387 | 0.531250 |
23 | 1.193253 | 1.517283 | 0.537500 |
24 | 1.138225 | 1.559087 | 0.529167 |
25 | 1.086946 | 1.572238 | 0.539583 |
26 | 1.040665 | 1.561826 | 0.525000 |
27 | 1.001137 | 1.584307 | 0.520833 |
28 | 0.970200 | 1.590697 | 0.516667 |
29 | 0.947225 | 1.590535 | 0.516667 |
30 | 0.927288 | 1.590184 | 0.516667 |
learn = Learner(data, MyLetterRNN(n_layers=3, dropout=0.5), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(30, max_lr=3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.777945 | 2.767582 | 0.108333 |
2 | 2.763745 | 2.700912 | 0.177083 |
3 | 2.724584 | 2.610098 | 0.170833 |
4 | 2.610290 | 2.289089 | 0.293750 |
5 | 2.476536 | 2.088696 | 0.347917 |
6 | 2.339663 | 1.914408 | 0.404167 |
7 | 2.205508 | 1.885391 | 0.425000 |
8 | 2.120229 | 1.957143 | 0.341667 |
9 | 2.053982 | 1.699707 | 0.397917 |
10 | 1.987532 | 1.688885 | 0.441667 |
11 | 1.908170 | 1.695390 | 0.462500 |
12 | 1.848755 | 1.654914 | 0.456250 |
13 | 1.779561 | 1.624753 | 0.460417 |
14 | 1.697022 | 1.597392 | 0.475000 |
15 | 1.646311 | 1.599763 | 0.475000 |
16 | 1.582875 | 1.559585 | 0.477083 |
17 | 1.531765 | 1.559109 | 0.475000 |
18 | 1.491278 | 1.601305 | 0.462500 |
19 | 1.446864 | 1.503486 | 0.483333 |
20 | 1.387920 | 1.531969 | 0.508333 |
21 | 1.325619 | 1.495371 | 0.520833 |
22 | 1.257018 | 1.581387 | 0.531250 |
23 | 1.193253 | 1.517283 | 0.537500 |
24 | 1.138225 | 1.559087 | 0.529167 |
25 | 1.086946 | 1.572238 | 0.539583 |
26 | 1.040665 | 1.561826 | 0.525000 |
27 | 1.001137 | 1.584307 | 0.520833 |
28 | 0.970200 | 1.590697 | 0.516667 |
29 | 0.947225 | 1.590535 | 0.516667 |
30 | 0.927288 | 1.590184 | 0.516667 |
Let's build our own RNN; instead of one hot encoding we'll use a nn.Embedding
.
data = (TextList
.from_df(df,
cols=[2],
processor=processors)
.split_by_idxs(train_idx=bal_idx, valid_idx=valid_idx)
.label_from_df(cols=0)
.databunch(bs=1024))
valid_data_set = set(tuple(_[0].data) for _ in data.valid_ds)
for datum in data.train_ds:
assert tuple(datum[0].data) not in valid_data_set, datum
x, y = next(iter(data.train_dl))
x.shape, y.shape
(torch.Size([19, 512]), torch.Size([512]))
x.shape[-1]
512
class Model(nn.Module):
def __init__(self, n_input, n_hidden, n_output, bn=False):
super().__init__()
self.i_h = nn.Embedding(n_input,n_hidden)
self.bn = nn.BatchNorm1d(n_hidden) if bn else None
self.o_h = nn.Linear(n_hidden, n_output)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.reset()
def forward(self, x):
# I'm not quite sure why the batch size seems to change to 720 in validation...
if self.h.shape[0] != x.shape[1]:
self.reset(x.shape[1])
h = self.h
x = self.i_h(x)
for xi in x:
h += xi
h = self.h_h(h)
h = F.relu(h)
if self.bn:
h = self.bn(h)
self.h = h.detach()
o = self.o_h(h)
return o
def reset(self, size=None):
size = size or 1
self.h = torch.zeros(size, n_hidden).cuda()
model = Model(n_letters, n_hidden, n_output).cuda()
learn = Learner(data, model, loss_func=F.cross_entropy, metrics=[accuracy])
learn.lr_find()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.recorder.plot()
This simple RNN seems to work better than the one we built using nn.rnn
, and we're only using one layer and haven't implemented dropout.
The big difference is that we're using an embedding layer instead of one-hot encoding. This gives us an extra bunch of parameters we can fit.
learn.fit_one_cycle(20, 7e-3)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.743902 | 2.623438 | 0.239583 |
2 | 2.569348 | 2.255966 | 0.333333 |
3 | 2.323031 | 1.929247 | 0.397917 |
4 | 2.109766 | 1.880335 | 0.406250 |
5 | 1.942279 | 1.689045 | 0.460417 |
6 | 1.745093 | 1.724750 | 0.452083 |
7 | 1.561630 | 1.589873 | 0.508333 |
8 | 1.374822 | 1.634860 | 0.525000 |
9 | 1.204314 | 1.628457 | 0.518750 |
10 | 1.037950 | 1.674552 | 0.552083 |
11 | 0.891771 | 1.778297 | 0.552083 |
12 | 0.772682 | 1.856418 | 0.541667 |
13 | 0.666088 | 1.887347 | 0.581250 |
14 | 0.572181 | 1.933961 | 0.545833 |
15 | 0.487959 | 2.034596 | 0.568750 |
16 | 0.415125 | 2.025807 | 0.556250 |
17 | 0.356673 | 2.072717 | 0.558333 |
18 | 0.310153 | 2.125560 | 0.560417 |
19 | 0.271913 | 2.125447 | 0.560417 |
20 | 0.244666 | 2.124661 | 0.558333 |
learn.save('rnn-bal-1')
data.classes
['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish', 'Russian', 'Spanish', 'Vietnamese']
Let's save the data classes; this will be useful if we want to make predictions.
with open('data.classes', 'wb') as f:
pickle.dump(data.classes, f)
Let's save the model data directly.
with open('models/rnn-bal-1.model', 'wb') as f:
pickle.dump(model.state_dict(), f)
And read it back in.
with open('models/rnn-bal-1.model', 'rb') as f:
state = pickle.load(f)
model.load_state_dict(state)
model = Model(n_letters, n_hidden, n_output, bn=True).cuda()
learn = Learner(data, model, loss_func=F.cross_entropy, metrics=[accuracy])
learn.lr_find()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
Batch norm makes the learning surface much smoother
learn.recorder.plot()
learn.fit_one_cycle(20, 3e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.507374 | 2.190547 | 0.354167 |
2 | 2.235302 | 1.959421 | 0.387500 |
3 | 2.003024 | 1.930297 | 0.445833 |
4 | 1.824707 | 1.977895 | 0.454167 |
5 | 1.707330 | 2.065223 | 0.470833 |
6 | 1.562633 | 2.085219 | 0.481250 |
7 | 1.450853 | 2.300199 | 0.481250 |
8 | 1.306937 | 2.536774 | 0.475000 |
9 | 1.191327 | 2.522710 | 0.483333 |
10 | 1.058434 | 2.229035 | 0.533333 |
11 | 0.928646 | 2.419605 | 0.533333 |
12 | 0.800899 | 2.562785 | 0.514583 |
13 | 0.701665 | 2.489831 | 0.533333 |
14 | 0.613110 | 2.690704 | 0.522917 |
15 | 0.528683 | 2.681259 | 0.537500 |
16 | 0.458904 | 2.672714 | 0.539583 |
17 | 0.393891 | 2.499606 | 0.541667 |
18 | 0.341363 | 2.683054 | 0.545833 |
19 | 0.300728 | 2.611176 | 0.547917 |
20 | 0.272150 | 2.567088 | 0.550000 |
In this case we actually get a very similar fit.
learn.recorder.plot_losses()
Adding a little regularisation using weight decay seems to help; we get a 59%.
Maybe dropout could help more.
model = Model(n_letters, n_hidden, n_output, bn=True).cuda()
learn = Learner(data, model, loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(20, 1e-2, wd=0.1)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.683858 | 2.500692 | 0.277083 |
2 | 2.455149 | 2.113877 | 0.352083 |
3 | 2.241700 | 1.903266 | 0.406250 |
4 | 2.017066 | 1.779788 | 0.454167 |
5 | 1.849892 | 1.776859 | 0.493750 |
6 | 1.691735 | 1.798783 | 0.479167 |
7 | 1.523797 | 1.714493 | 0.520833 |
8 | 1.337385 | 1.671328 | 0.541667 |
9 | 1.208459 | 1.819937 | 0.535417 |
10 | 1.074747 | 1.773041 | 0.535417 |
11 | 0.935993 | 1.755780 | 0.568750 |
12 | 0.815807 | 1.749104 | 0.562500 |
13 | 0.713215 | 1.812801 | 0.566667 |
14 | 0.620390 | 1.839610 | 0.575000 |
15 | 0.550001 | 1.766414 | 0.591667 |
16 | 0.485581 | 1.801695 | 0.597917 |
17 | 0.429470 | 1.961521 | 0.587500 |
18 | 0.386420 | 1.903309 | 0.597917 |
19 | 0.346661 | 1.830627 | 0.593750 |
20 | 0.319924 | 1.856341 | 0.593750 |
How does fastai's built in learner compare?
How long are the names?
df.ascii_name.str.len().describe()
count 16869.000000 mean 7.409509 std 2.050366 min 2.000000 25% 6.000000 50% 7.000000 75% 9.000000 max 18.000000 Name: ascii_name, dtype: float64
learn = text_classifier_learner(data, bptt=30)
learn.lr_find()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.recorder.plot(skip_end=10)
Note we don't necessarily expect this to do great because the parameters are tuned to processing medium sized documents a word at a time.
However it gets 67% way outperforms our RNN model without any parameter tuning.
learn.fit_one_cycle(20, max_lr=7e-3)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.760123 | 2.778821 | 0.062500 |
2 | 2.630543 | 2.745926 | 0.075000 |
3 | 2.513042 | 2.573858 | 0.160417 |
4 | 2.355199 | 2.121467 | 0.300000 |
5 | 2.195587 | 1.810203 | 0.387500 |
6 | 2.012244 | 1.566222 | 0.472917 |
7 | 1.909551 | 1.693872 | 0.460417 |
8 | 1.752974 | 1.615144 | 0.527083 |
9 | 1.620254 | 1.322627 | 0.581250 |
10 | 1.416614 | 1.251798 | 0.625000 |
11 | 1.226631 | 1.297575 | 0.610417 |
12 | 1.056211 | 1.230383 | 0.641667 |
13 | 0.922565 | 1.201090 | 0.664583 |
14 | 0.791524 | 1.235106 | 0.656250 |
15 | 0.657713 | 1.220895 | 0.683333 |
16 | 0.552160 | 1.252036 | 0.666667 |
17 | 0.460102 | 1.207947 | 0.666667 |
18 | 0.411827 | 1.196497 | 0.670833 |
19 | 0.370949 | 1.197413 | 0.668750 |
20 | 0.330069 | 1.188975 | 0.677083 |
learn.recorder.plot_losses()
learn.save('fastai_bal')
From the IMDB example we know for word level data pretraining the encoder gives much better results (albeit on much bigger datasets). Let's see if it improves things here.
data_lm = (TextList
.from_df(df, cols=[2], processor=processors)
.random_split_by_pct(0.1)
.label_for_lm()
.databunch(bs=32))
data_lm.show_batch()
idx | text |
---|---|
0 | z h i h a r e v i t c h xxbos s m o l a k xxbos n o s c h e n k o xxbos c r o w n xxbos t o k a e v xxbos o r i o l xxbos d j a n i b e |
1 | p i s k o t i n xxbos o ' c a l l a g h a n n xxbos e o g h a n xxbos e n o k i xxbos s h a n a u r i n xxbos c h k h a r t i s h v i l |
2 | j i m a xxbos t z a g o l o v xxbos l i c h m a n xxbos c o w l e y xxbos b a g d a s a r o f f xxbos w a t e r f i e l d xxbos n e l l i xxbos |
3 | t a l m u d xxbos m a r t z e n k o xxbos r i p l e y xxbos z a v o r i n xxbos g e i g e r xxbos v r a z e l xxbos r e y e r xxbos r o |
4 | o v s a e v xxbos g r e a v e s xxbos r e k u n xxbos y u z v i s h i n xxbos t c h e k m a s o v xxbos s o n e xxbos g r u s h e t s k y xxbos |
learn = language_model_learner(data_lm, drop_mult=0.5)
learn.lr_find()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
learn.recorder.plot(skip_end=10)
learn.fit_one_cycle(4, max_lr=1e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.655349 | 2.249993 | 0.325342 |
2 | 2.187761 | 1.975275 | 0.401822 |
3 | 2.007302 | 1.886618 | 0.426303 |
4 | 1.868937 | 1.847373 | 0.435982 |
learn.save('letter_lang')
learn.save_encoder('letter_enc')
TEXT = "ho"
N_WORDS = 4
N_SENTENCES = 5
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))
ho u s i m ho n o b a ho n e r e ho n a r a ho v a b e
TEXT = "tr"
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))
tr i n o r tr e r o e tr e n e n tr i r i s tr u p u t
learn = text_classifier_learner(data, bptt=30)
learn.load_encoder('letter_enc')
learn.lr_find()
learn.recorder.plot()
LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
In this case pretraining the encoder gives a worse result.
Maybe it's because the language model was on the entire (unbalanced) dataset? Or wasn't well trained enough?
learn.fit_one_cycle(20, max_lr=2e-2)
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.623937 | 2.702015 | 0.302083 |
2 | 2.410758 | 2.412743 | 0.337500 |
3 | 2.200685 | 1.927255 | 0.393750 |
4 | 2.046425 | 2.312107 | 0.316667 |
5 | 1.865316 | 2.074282 | 0.393750 |
6 | 1.777486 | 2.281461 | 0.360417 |
7 | 1.689011 | 2.557259 | 0.302083 |
8 | 1.596411 | 2.346404 | 0.370833 |
9 | 1.566185 | 2.441514 | 0.341667 |
10 | 1.473553 | 1.770901 | 0.429167 |
11 | 1.411398 | 1.677306 | 0.502083 |
12 | 1.352370 | 1.966482 | 0.404167 |
13 | 1.323129 | 3.021722 | 0.310417 |
14 | 1.271584 | 2.182109 | 0.389583 |
15 | 1.224382 | 1.864778 | 0.450000 |
16 | 1.196533 | 1.896098 | 0.447917 |
17 | 1.156429 | 1.960691 | 0.435417 |
18 | 1.108229 | 1.840390 | 0.456250 |
19 | 1.136375 | 2.003758 | 0.429167 |
20 | 1.125715 | 1.961390 | 0.443750 |
With a bit of tuning we can make a much smaller model that trains faster and is almost as good
learn = text_classifier_learner(data, bptt=30, emb_sz=200, nh=300, nl=2)
learn.fit_one_cycle(15, max_lr=1e-2, moms=(0.2, 0.1))
epoch | train_loss | valid_loss | accuracy |
---|---|---|---|
1 | 2.748441 | 2.779718 | 0.062500 |
2 | 2.560826 | 2.709974 | 0.122917 |
3 | 2.413826 | 2.418473 | 0.325000 |
4 | 2.248409 | 1.827642 | 0.397917 |
5 | 2.131097 | 1.928447 | 0.385417 |
6 | 2.005232 | 1.580826 | 0.510417 |
7 | 1.831701 | 1.488690 | 0.510417 |
8 | 1.711539 | 1.291139 | 0.600000 |
9 | 1.528764 | 1.403310 | 0.572917 |
10 | 1.353717 | 1.199514 | 0.627083 |
11 | 1.200141 | 1.201450 | 0.658333 |
12 | 1.080066 | 1.182234 | 0.641667 |
13 | 0.959931 | 1.155729 | 0.650000 |
14 | 0.873939 | 1.152237 | 0.656250 |
15 | 0.815623 | 1.167501 | 0.654167 |
learn.save('fastai_min')
learn.load('fastai_min')
None
prob, target, losses = learn.get_preds(with_loss=True)
pred = np.array([data.classes[_] for _ in prob.argmax(dim=1)])
target = np.array([data.classes[_] for _ in target])
x, y = list(learn.data.valid_dl)[0]
y = np.array([data.classes[_] for _ in y])
len(y), len(prob)
(480, 480)
names = np.array([''.join([vocab.itos[x] for x in l if x != 1][1:]) for l in zip(*x)])
I certainly think we could do better, but let's call it good enough.
loss_val, idx = losses.topk(10)
list(zip(names[idx], pred[idx], target[idx], loss_val))
[('simonis', 'Greek', 'Dutch', tensor(6.7631)), ('dam', 'Korean', 'Vietnamese', tensor(6.4027)), ('jelen', 'Dutch', 'Polish', tensor(6.1876)), ('cha', 'Vietnamese', 'Korean', tensor(6.1856)), ('hayden', 'Dutch', 'Irish', tensor(6.1785)), ('chmiel', 'French', 'Polish', tensor(5.8600)), ('blanxart', 'English', 'Spanish', tensor(5.8351)), ('chicken', 'Dutch', 'Czech', tensor(5.6187)), ('attia', 'Spanish', 'Arabic', tensor(5.4960)), ('ton', 'Korean', 'Vietnamese', tensor(5.4933))]
confuse = sklearn.metrics.confusion_matrix(target, pred, labels=data.classes)
def most_confused(n):
top = []
for i, row in enumerate(confuse):
for j, cell in enumerate(row):
if i == j:
continue
if cell >= n:
top.append([data.classes[i],data.classes[j], cell])
return sorted(top, key=lambda x: x[2], reverse=True)
Most of the confusion is between similar language families:
This is a good sign
most_confused(3)
[['Vietnamese', 'Chinese', 13], ['Korean', 'Chinese', 11], ['Spanish', 'Italian', 8], ['Polish', 'Czech', 6], ['Chinese', 'Korean', 4], ['Czech', 'German', 4], ['English', 'French', 4], ['English', 'German', 4], ['German', 'English', 4], ['Vietnamese', 'Korean', 4], ['Arabic', 'German', 3], ['Czech', 'Spanish', 3], ['Dutch', 'English', 3], ['English', 'Dutch', 3], ['German', 'Dutch', 3], ['Irish', 'Chinese', 3], ['Korean', 'Vietnamese', 3], ['Polish', 'German', 3], ['Spanish', 'French', 3], ['Vietnamese', 'Irish', 3]]
plt.figure(figsize=(6,6))
plot_confusion_matrix(confuse, data.classes)
Confusion matrix, without normalization
Let's set up everything from scratch so we could set it up in an external app
from fastai import *
from fastai.text import *
from unidecode import unidecode
import string
with open('data.classes', 'rb') as f:
classes = pickle.load(f)
classes
['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish', 'Russian', 'Spanish', 'Vietnamese']
class LetterTokenizer(BaseTokenizer):
"Character level tokenizer function."
def __init__(self, lang): pass
def tokenizer(self, t:str) -> List[str]:
t = unidecode(t).lower() ## Decode in tokenizer (ideally would be a separate preprocessor)
out = []
i = 0
while i < len(t):
if t[i:].startswith(BOS):
out.append(BOS)
i += len(BOS)
else:
out.append(t[i])
i += 1
return out
def add_special_cases(self, toks:Collection[str]): pass
itos = [UNK, BOS] + list(string.ascii_lowercase + " -'")
vocab=Vocab(itos)
tokenizer=Tokenizer(LetterTokenizer, pre_rules=[], post_rules=[])
empty = pd.DataFrame({'text':'', 'cl':classes})
empty
text | cl | |
---|---|---|
0 | Arabic | |
1 | Chinese | |
2 | Czech | |
3 | Dutch | |
4 | English | |
5 | French | |
6 | German | |
7 | Greek | |
8 | Irish | |
9 | Italian | |
10 | Japanese | |
11 | Korean | |
12 | Polish | |
13 | Russian | |
14 | Spanish | |
15 | Vietnamese |
processors = [TokenizeProcessor(tokenizer=tokenizer, mark_fields=False),
NumericalizeProcessor(vocab=vocab)]
data = TextList.from_df(empty, processor=processors).no_split().label_from_df(cols='cl').databunch(bs=2)
learn = text_classifier_learner(data, bptt=30, emb_sz=200, nh=300, nl=2)
learn.load('fastai_min')
None
Check it's not in the training set
!grep -ir '^Wu' data/names
learn.predict('Wu') # Chinese
(Category Chinese, tensor(1), tensor([1.5628e-04, 8.2566e-01, 1.5067e-04, 7.8611e-04, 5.4162e-03, 6.0296e-06, 1.6915e-03, 7.5540e-05, 1.2069e-03, 2.6427e-05, 2.3748e-03, 1.0973e-01, 4.8861e-02, 1.4435e-04, 4.1366e-05, 3.6734e-03]))
def predictions(name):
return sorted(zip(classes, (_.item() for _ in learn.predict(name)[2])), key=lambda x: x[1], reverse=True)
How does it do in practice?
predictions("Wojtyła")[:5] # Polish
[('Polish', 0.9237067699432373), ('Czech', 0.0443655289709568), ('Vietnamese', 0.00837066862732172), ('Spanish', 0.006321582943201065), ('Chinese', 0.0038041360676288605)]
predictions("Dvořák")[:5] # Czech
[('Czech', 0.866197407245636), ('Polish', 0.10485668480396271), ('Russian', 0.026793140918016434), ('Korean', 0.00042316922917962074), ('Japanese', 0.00041837719618342817)]
predictions("Gaddafi")[:5] # Arabic
[('Italian', 0.8921791911125183), ('Russian', 0.0403725765645504), ('Japanese', 0.025237590074539185), ('Spanish', 0.015365079045295715), ('Arabic', 0.01298986654728651)]
predictions('Goethe')[:5] # German
[('Dutch', 0.35977253317832947), ('German', 0.21167784929275513), ('French', 0.18858234584331512), ('English', 0.1542830914258957), ('Irish', 0.02631647326052189)]
Sometimes it does bad even if it's in the source data (it may not have ended up in training)
!grep -Er 'Pascal|Pham' data/names
data/names/Korean.txt:Kim data/names/Japanese.txt:Kimio data/names/Japanese.txt:Kimiyama data/names/Japanese.txt:Kimura data/names/Vietnamese.txt:Pham data/names/Vietnamese.txt:Kim data/names/English.txt:Kimber data/names/English.txt:Kimble data/names/French.txt:Pascal
predictions("Pascal")[:5] # French
[('Spanish', 0.9103199243545532), ('Italian', 0.07529251277446747), ('Polish', 0.005148978438228369), ('Greek', 0.0036756773479282856), ('Czech', 0.0034040361642837524)]
predictions("Pham")[:5] # Vietnamese
[('Dutch', 0.2937869727611542), ('Vietnamese', 0.13536876440048218), ('English', 0.11861108988523483), ('French', 0.07520488649606705), ('Irish', 0.0743907243013382)]
But sometimes it gets it right
!grep -ir '^Meijer' data/names
predictions("Meijer")[:5] # Dutch
[('Dutch', 0.9885940551757812), ('German', 0.008121304214000702), ('Czech', 0.0013009845279157162), ('Korean', 0.00039360582013614476), ('English', 0.0003091120161116123)]
predictions('Wójcik')[:5] # Polish
[('Polish', 0.9900423288345337), ('Czech', 0.004603931214660406), ('Chinese', 0.002563303103670478), ('Korean', 0.0009222071967087686), ('Dutch', 0.0005194866680540144)]
This model is not bad; but definitely sub-human.
What does it think about our ambiguous "Michel"?
predictions('Michel')[:7]
[('Czech', 0.6636908054351807), ('German', 0.11761205643415451), ('English', 0.061124321073293686), ('Irish', 0.04792550206184387), ('Polish', 0.027553826570510864), ('French', 0.026092179119586945), ('Dutch', 0.020014718174934387)]
class Model(nn.Module):
def __init__(self, n_input, n_hidden, n_output, bn=False, use_cuda=False):
super().__init__()
self.i_h = nn.Embedding(n_input,n_hidden)
self.bn = nn.BatchNorm1d(n_hidden) if bn else None
self.o_h = nn.Linear(n_hidden, n_output)
self.h_h = nn.Linear(n_hidden, n_hidden)
self.use_cuda = use_cuda
self.reset()
def forward(self, x):
# I'm not quite sure why the batch size seems to change to 720 in validation...
if self.h.shape[0] != x.shape[1]:
self.reset(x.shape[1])
h = self.h
x = self.i_h(x)
for xi in x:
h += xi
h = self.h_h(h)
h = F.relu(h)
if self.bn:
h = self.bn(h)
self.h = h.detach()
o = self.o_h(h)
return o
def reset(self, size=None):
size = size or 1
self.h = torch.zeros(size, n_hidden)
if self.use_cuda:
self.h = self.h.cuda()
n_letters = len(vocab.itos)
n_hidden = 128
n_output = len(classes)
model = Model(n_letters, n_hidden, n_output)
with open('models/rnn-bal-1.model', 'rb') as f:
state = pickle.load(f)
model.load_state_dict(state)
model = model.cpu()
model = model.eval()
for param in model.parameters():
param.requires_grad = False
name = 'Wójcik' # Polish
decode = BOS + unidecode(name)
decode
'xxbosWojcik'
tokens = tokenizer.process_all([decode])[0]
tokens
['xxbos', 'w', 'o', 'j', 'c', 'i', 'k']
nums = vocab.numericalize(tokens)
nums
[1, 24, 16, 11, 4, 10, 12]
x = torch.tensor([nums]).transpose(1,0)
x
tensor([[ 1], [24], [16], [11], [ 4], [10], [12]])
result = model(x).detach()
result
tensor([[ -8.0967, -6.4112, 7.0235, 4.0332, 1.4341, -15.9867, -0.3718, -9.9789, -17.9649, -7.5314, -2.7700, -0.1450, -3.1552, 1.9744, -14.5163, -15.3465]])
probs = F.softmax(result[0], dim=0)
probs
tensor([2.5545e-07, 1.3781e-06, 9.4171e-01, 4.7340e-02, 3.5193e-03, 9.5652e-11, 5.7831e-04, 3.8891e-08, 1.3231e-11, 4.4956e-07, 5.2559e-05, 7.2556e-04, 3.5759e-05, 6.0412e-03, 4.1617e-10, 1.8145e-10])
for prob, idx in zip(*probs.topk(3)):
print(f'{classes[idx]}: Probability {prob:0.2%}')
Czech: Probability 94.17% Dutch: Probability 4.73% Russian: Probability 0.60%
def get_probs(name):
decode = BOS + unidecode(name)
tokens = tokenizer.process_all([decode])[0]
nums = vocab.numericalize(tokens)
x = torch.tensor([nums]).transpose(1,0)
model.reset()
result = model(x).detach()
probs = F.softmax(result[0], dim=0)
return probs
def print_top_probs(name, n=3):
probs = get_probs(name)
for prob, idx in zip(*probs.topk(n)):
print(f'{classes[idx]}: Probability {prob:0.2%}')
In reality the model doesn't do great by human standards
print_top_probs('Goethe') # German
Irish: Probability 72.69% English: Probability 16.42% Japanese: Probability 3.84%
print_top_probs('Jinping') # Chinese
German: Probability 51.32% English: Probability 34.13% Chinese: Probability 9.00%
print_top_probs('Kim') # Korean
Korean: Probability 41.02% Russian: Probability 22.39% Dutch: Probability 16.08%
print_top_probs('Đặng') # Vietnamese
Korean: Probability 41.59% Chinese: Probability 34.24% Vietnamese: Probability 14.89%
print_top_probs('Zahir') # Arabic
Arabic: Probability 91.50% Russian: Probability 4.34% Czech: Probability 2.74%
It's also possible to use learn.load
to load in the model, if you make some fake data.
We need at least 2 rows or it will complain.
empty = pd.DataFrame([[' ']]*2)
empty
0 | |
---|---|
0 | |
1 |
processors = [TokenizeProcessor(tokenizer=tokenizer, mark_fields=False),
NumericalizeProcessor(vocab=vocab)]
data = TextList.from_df(empty, processor=processors).no_split().label_const().databunch(bs=2)
model = Model(n_letters, n_hidden, n_output)
learn = Learner(data, model)
learn = learn.load('rnn-bal-1')
learn.model = learn.model.eval().cpu()
for param in learn.model.parameters():
param.requires_grad = False
x, _ = data.one_item('Dvořák') # Czech
learn.model.reset()
probs = F.softmax(learn.model(x.cpu()))
probs
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
tensor([[1.2497e-05, 3.4860e-06, 5.5647e-01, 1.3278e-04, 1.5334e-03, 4.7975e-04, 7.6774e-05, 6.1160e-06, 4.5827e-08, 2.0946e-05, 1.2422e-04, 5.2584e-06, 4.1456e-01, 2.5337e-02, 1.2437e-03, 2.2771e-11]])
for prob, idx in zip(*probs[0].topk(3)):
print(f'{classes[idx]}: Probability {prob:0.2%}')
Czech: Probability 55.65% Polish: Probability 41.46% Russian: Probability 2.53%
The idea is to dig into the representation in the 50 dimensional activation and use this to compare names.
Two names are similar if they are close together in this embedding space. It's not totally obvious that the RMS distance is appropriate for this, but it's what we'll use.
from fastai.callbacks.hooks import *
df = pd.read_csv('names_clean.csv')
df.head()
cl | name | ascii_name | valid | bal | |
---|---|---|---|---|---|
0 | Korean | Ahn | ahn | False | 13 |
1 | Korean | Baik | baik | True | 0 |
2 | Korean | Bang | bang | False | 13 |
3 | Korean | Byon | byon | False | 15 |
4 | Korean | Cha | cha | True | 0 |
data = TextList.from_df(df, cols='ascii_name', processor=processors).no_split().label_from_df('cl').databunch(bs=1024)
# model = Model(n_letters, n_hidden, n_output).cuda()
# learn = Learner(data, model)
# learn = learn.load('rnn-bal-1')
learn = text_classifier_learner(data, bptt=30, emb_sz=200, nh=300, nl=2)
learn.load('fastai_min')
None
Let's look at the structure of our model
list(learn.model.named_children())
[('0', MultiBatchRNNCore( (encoder): Embedding(31, 200, padding_idx=1) (encoder_dp): EmbeddingDropout( (emb): Embedding(31, 200, padding_idx=1) ) (rnns): ModuleList( (0): WeightDropout( (module): LSTM(200, 300) ) (1): WeightDropout( (module): LSTM(300, 200) ) ) (input_dp): RNNDropout() (hidden_dps): ModuleList( (0): RNNDropout() (1): RNNDropout() ) )), ('1', PoolingLinearClassifier( (layers): Sequential( (0): BatchNorm1d(600, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (1): Dropout(p=0.4) (2): Linear(in_features=600, out_features=50, bias=True) (3): ReLU(inplace) (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): Dropout(p=0.1) (6): Linear(in_features=50, out_features=16, bias=True) ) ))]
Let's capture the output of the 50 dimensional embedding near the end
layer = 17
list(learn.model.modules())[layer]
Linear(in_features=600, out_features=50, bias=True)
def embed(x):
#with hook_output(list(learn.model.children())[-1]) as hook_a:
with hook_output(list(learn.model.modules())[layer]) as hook_a:
preds = learn.predict(x)
return hook_a.stored[0]
%time df = df.assign(embed = df.name.apply(embed))
CPU times: user 32.2 s, sys: 2.22 s, total: 34.4 s Wall time: 34.4 s
df.head()
cl | name | ascii_name | valid | bal | embed | |
---|---|---|---|---|---|---|
0 | Korean | Ahn | ahn | False | 13 | [tensor(0.7824, device='cuda:0'), tensor(2.431... |
1 | Korean | Baik | baik | True | 0 | [tensor(0., device='cuda:0'), tensor(0., devic... |
2 | Korean | Bang | bang | False | 13 | [tensor(0., device='cuda:0'), tensor(1.9753, d... |
3 | Korean | Byon | byon | False | 15 | [tensor(0., device='cuda:0'), tensor(4.9263, d... |
4 | Korean | Cha | cha | True | 0 | [tensor(0., device='cuda:0'), tensor(0., devic... |
def closest(name, n=10):
e = embed(name)
dist = [d(e, _) for _ in df.embed]
for idx in np.argsort(dist)[:10]:
print(f'{df.name.iloc[idx.item()]} ({df.cl.iloc[idx.item()]}): {dist[idx]}')
It's not immediately clear in what sense these are similar; but it doesn't seem random to me
%time closest('Ahn')
Ahn (Korean): 0.0 Hor (Chinese): 74.33894348144531 Gul (Russian): 78.02725219726562 Noh (Korean): 88.23501586914062 Hon (Russian): 89.40960693359375 Ryu (Korean): 90.15037536621094 Byon (Korean): 92.82701110839844 Jermy (English): 93.4688720703125 Bishop (English): 96.69548034667969 Heron (English): 97.36801147460938 CPU times: user 1.46 s, sys: 188 ms, total: 1.65 s Wall time: 1.65 s
closest('Ruder')
Turner (English): 34.640953063964844 Raizer (Russian): 41.138641357421875 Reuter (German): 45.08790588378906 Gunter (English): 46.127220153808594 Mendel (German): 48.00371551513672 Render (English): 48.202239990234375 Raeburn (English): 52.79540252685547 Rosenberger (German): 53.13862228393555 Rebinder (Russian): 53.502662658691406 Rosser (English): 54.7049446105957
closest('Gugger')
Oelberg (German): 28.07500457763672 Mordberg (Russian): 29.14569854736328 Gramberg (Russian): 30.82413101196289 Engman (Russian): 31.429853439331055 Burman (English): 33.723426818847656 Bumgarner (German): 33.76372528076172 Egger (German): 35.34405517578125 Großer (German): 35.70815658569336 Ranger (English): 35.80017852783203 Grainger (English): 36.01886749267578
closest('Thomas')
Manus (Irish): 49.601768493652344 Jemaitis (Russian): 56.250274658203125 Horos (Russian): 63.35132598876953 Klimes (Czech): 73.13825225830078 Bertsimas (Greek): 73.65045166015625 Tsogas (Greek): 79.87809753417969 Simonis (Dutch): 85.69441223144531 Honjas (Greek): 86.7238998413086 Mihelyus (Russian): 87.06715393066406 Grotus (Russian): 88.79036712646484
closest('Ross')
East (English): 59.85662841796875 Gammer (English): 60.93962097167969 Gale (English): 61.32642364501953 Gass (German): 65.68402099609375 Abrams (English): 68.60626983642578 Groer (Russian): 72.51294708251953 Bannister (English): 72.60062408447266 Glencross (English): 73.78807830810547 Moss (English): 74.06442260742188 Gander (English): 75.61122131347656
closest('Wu')
Wan (Chinese): 55.373172760009766 Wei (Chinese): 63.00840759277344 Won (Chinese): 96.11245727539062 Gwang (Korean): 103.33255004882812 Gwock (Chinese): 118.81124114990234 Weng (Chinese): 131.87939453125 Wane (English): 141.58985900878906 Twigg (English): 147.58078002929688 Wain (English): 153.88088989257812 Gowing (English): 156.93133544921875
closest('Chebyshev')
Cheryshev (Russian): 4.010871410369873 Dobryshev (Russian): 23.908416748046875 Chalyshev (Russian): 23.938615798950195 Chanyshev (Russian): 28.119571685791016 Cherushov (Russian): 28.66720962524414 Tchanyshev (Russian): 30.515729904174805 Chehov (Russian): 30.542831420898438 Tchalyshev (Russian): 34.40039825439453 Yachmentsev (Russian): 36.298770904541016 Jerebyatiev (Russian): 36.888675689697266