Classifying Name Ethnicity with a Character level RNN¶

This is heavily modeled on the Pytorch tutorial: https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

We use fastai libraries extensively to make dataloading and training easier

Download the Pytorch tutorial data¶

This is a list of surnames and their ethnicities

In [1]:

#!wget https://download.pytorch.org/tutorial/data.zip

In [2]:

#!unzip -o data.zip

Load the data¶

fastai import pandas and all sorts of other goodies

In [1]:

from fastai import *
from fastai.text import *

In [2]:

from unidecode import unidecode
import string

Reduce the ouput to 20 rows to prevent it from taking too much of the output.

In [3]:

pd.options.display.max_rows = 20

Read in the data; names for each language is in a separate file

In [5]:

path = Path('data/names')

In [6]:

!ls {path}

Arabic.txt   English.txt  Irish.txt	Polish.txt	Spanish.txt
Chinese.txt  French.txt   Italian.txt	Portuguese.txt	Vietnamese.txt
Czech.txt    German.txt   Japanese.txt	Russian.txt
Dutch.txt    Greek.txt	  Korean.txt	Scottish.txt

In [7]:

!head -n5 {path}/Arabic.txt

Khoury
Nahas
Daher
Gerges
Nazari

In [8]:

names = []
for p in path.glob('*.txt'):
    lang = p.name[:-4]
    with open(p) as f:
        names += [(lang, l.strip()) for l in f]
df = pd.DataFrame(names, columns=['cl', 'name'])

Check the Data¶

It's always worth doing some sanity checks on your data (even supposedly clean tutorial data).

No matter how good your model is: garbage in, garbage out.

In [9]:

df.head()

Out[9]:

	cl	name
0	Korean	Ahn
1	Korean	Baik
2	Korean	Bang
3	Korean	Byon
4	Korean	Cha

In [10]:

len(df)

Out[10]:

Character Set¶

What letters outside of ASCII are in the names?

In [11]:

foreign_chars = Counter(_ for _ in ''.join(list(df.name)) if _ not in string.ascii_letters)
foreign_chars.most_common()

Out[11]:

[(' ', 115),
 ("'", 87),
 ('-', 25),
 ('ö', 24),
 ('é', 22),
 ('í', 14),
 ('ó', 13),
 ('ä', 13),
 ('á', 12),
 ('ü', 11),
 ('à', 10),
 ('ß', 9),
 ('ú', 7),
 ('ñ', 6),
 ('ò', 3),
 ('Ś', 3),
 ('1', 3),
 (',', 3),
 ('è', 2),
 ('ã', 2),
 ('ù', 1),
 ('ì', 1),
 ('ż', 1),
 ('ń', 1),
 ('ł', 1),
 ('ą', 1),
 ('Ż', 1),
 ('/', 1),
 (':', 1),
 ('Á', 1),
 ('\xa0', 1),
 ('õ', 1),
 ('É', 1),
 ('ê', 1),
 ('ç', 1)]

A few of these look suspicious. (Note the use of a regular expression in contains to check each of the characters)

In [12]:

suss_chars = [':', '/', '\xa0', ',', '1']
df[df.name.str.contains('|'.join(suss_chars))]

Out[12]:

	cl	name
2494	Czech	Maxa/B
2590	Czech	Rafaj1
2703	Czech	Urbanek1
2732	Czech	Whitmire1
3214	Chinese	Lu:
14506	Russian	Jevolojnov,
15347	Russian	Lysansky,
15366	Russian	Lytkin,
18052	Russian	To The First Page

Most of these look like legitimate names with extra junk (except 'To The First Page'). Since it's so few names it's easiest just to drop them.

In [13]:

df = df[~df.name.str.contains('|'.join(suss_chars))]

Single quotes and spaces are common

In [14]:

df[df.name.str.contains("'| ")]

Out[14]:

	cl	name
369	Italian	D'ambrosio
371	Italian	D'amore
372	Italian	D'angelo
373	Italian	D'antonio
374	Italian	De angelis
375	Italian	De campo
376	Italian	De felice
377	Italian	De filippis
378	Italian	De fiore
379	Italian	De laurentis
...	...	...
18161	Russian	V'Yurkov
19061	Russian	Zasyad'Ko
19740	Portuguese	D'cruz
19741	Portuguese	D'cruze
19743	Portuguese	De santigo
19858	French	D'aramitz
19864	French	De la fontaine
19869	French	De sauveterre
20051	French	St martin
20052	French	St pierre

150 rows × 2 columns

Since hyphens mainly join multiple last names (and are pretty rare) we won't lose heaps by dropping them.

In [15]:

df[df.name.str.contains('-')]

Out[15]:

	cl	name
2982	Chinese	Au-Yong
3088	Chinese	Ou-Yang
3089	Chinese	Ow-Yang
10156	Russian	Abdank-Kossovsky
10639	Russian	Amet-Han
11221	Russian	Bagai-Ool
11757	Russian	Bei-Bienko
11787	Russian	Beknazar-Yuzbashev
11790	Russian	Bekovich-Cherkassky
11904	Russian	Bestujev-Lada
...	...	...
11952	Russian	Bim-Bad
12209	Russian	Chyrgal-Ool
13071	Russian	Galkin-Vraskoi
13307	Russian	Gorbunov-Posadov
16687	Russian	Porai-Koshits
17222	Russian	Shah-Nazaroff
17430	Russian	Shirinsky-Shikhmatov
17748	Russian	Tsann-Kay-Si
17999	Russian	Tzann-Kay-Si
18315	Russian	Van-Puteren

23 rows × 2 columns

In [16]:

df = df[~df.name.str.contains('-')]

Normalising non-ASCII Characters¶

Let's normalise all non-ASCII characters to ASCII equivalents.

This makes our classification problem harder in practice: any names containing a ß are almost surely German, wheras "ss" could occur in many language. It also reduces the set of characters we need to represent our language.

In [18]:

df['ascii_name'] = df.name.apply(unidecode)
df[df.name != df.ascii_name]

Out[18]:

	cl	name	ascii_name
100	Italian	Abbà	Abba
112	Italian	Abelló	Abello
160	Italian	Airò	Airo
195	Italian	Alò	Alo
238	Italian	Azzarà	Azzara
300	Italian	Bovér	Bover
445	Italian	Giùgovaz	Giugovaz
461	Italian	Làconi	Laconi
462	Italian	Laganà	Lagana
463	Italian	Lagomarsìno	Lagomarsino
...	...	...	...
19912	French	Géroux	Geroux
19920	French	Guérin	Guerin
19924	French	Hébert	Hebert
19949	French	Lécuyer	Lecuyer
19951	French	Lefévre	Lefevre
19955	French	Lémieux	Lemieux
19960	French	Lévêque	Leveque
19961	French	Lévesque	Levesque
19965	French	Maçon	Macon
20047	French	Séverin	Severin

156 rows × 3 columns

Let's check case: I expect names to be in CamelCase.

These seem to be mistakes.

In [20]:

df[~df.ascii_name.str.contains("^[A-Z][^A-Z]*(?:[' -][A-Z][^A-Z]*)*$")]

Out[20]:

	cl	name	ascii_name
2508	Czech	MonkoAustria	MonkoAustria
2677	Czech	StrakaO	StrakaO
3266	Vietnamese	an	an

In [21]:

df = df[df.ascii_name.str.contains("^[A-Z][^A-Z]*(?:[' -][A-Z][^A-Z]*)*$")]

Let's lowercase the ascii_names

In [22]:

df['ascii_name'] = df.ascii_name.str.lower()

Make a check we've normalised correctly.

In [24]:

ascii_chars = Counter(''.join(list(df.ascii_name)))
ascii_chars.most_common()

Out[24]:

[('a', 16511),
 ('o', 11120),
 ('e', 10768),
 ('i', 10416),
 ('n', 9943),
 ('r', 8245),
 ('s', 7980),
 ('h', 7673),
 ('k', 6902),
 ('l', 6704),
 ('v', 6301),
 ('t', 5939),
 ('u', 4725),
 ('m', 4343),
 ('d', 3894),
 ('b', 3641),
 ('y', 3604),
 ('g', 3209),
 ('c', 3068),
 ('z', 1928),
 ('f', 1774),
 ('p', 1707),
 ('j', 1346),
 ('w', 1125),
 (' ', 112),
 ('q', 98),
 ("'", 87),
 ('x', 72)]

How many classes does each name have?¶

In practice a surname could have multiple ethnicities, but we'd have to be really careful of how we use this in training.

If we end up with e.g. 'Michel' as French in the training dataset, but German in the validation set our model has no hope of getting it right (and we may discard an actually good model).

We could handle this by:

Allowing multiple class labels
Picking the country that the name most commonly associates to
Dropping ambiguous cases

Without any information about frequency we can't do (2) and (1) is a harder problem, so we'll stick to (3).

In [25]:

name_classes = df.\
  groupby('ascii_name').\
  nunique().cl.sort_values(ascending=False)
name_classes.head(20)

Out[25]:

ascii_name
michel     6
adam       5
albert     5
abel       5
martin     5
simon      5
ventura    4
costa      4
jordan     4
han        4
salomon    4
samuel     4
klein      4
franco     4
wang       4
oliver     4
garcia     3
horn       3
lim        3
rose       3
Name: cl, dtype: int64

In [26]:

df[df.name == 'Michel']

Out[26]:

	cl	name	ascii_name
872	Polish	Michel	michel
2077	Dutch	Michel	michel
3489	Spanish	Michel	michel
6163	German	Michel	michel
8709	English	Michel	michel
19978	French	Michel	michel

1 in 40 of our names have multiple classes (most of them do before normalisation too)

In [27]:

len(name_classes), sum(name_classes > 1) / len(name_classes)

Out[27]:

(17380, 0.027445339470655927)

While some names like Abel do seem to occur commonly in multiple countries, for example:

Adamson is very unlikely to be Russian
Wong is much more prevalant in Chinese than English
Yang is very rare in English

It seems like Korean and Chinese have a lot of overlap, as to English and Scottish. While this makes some linguistic sense it will make it hard to make a reliable classifier.

Note that most names only occur once; so we can't pick a "most common" frequency class.

In [35]:

with pd.option_context('display.max_rows', 60):
    print(df[df.ascii_name.isin(name_classes[name_classes > 1].index)].groupby(['ascii_name', 'cl']).count())

                       name
ascii_name cl              
abel       English        1
           French         1
           German         1
           Russian        1
           Spanish        1
abello     Italian        1
           Spanish        1
abraham    English        1
           French         1
abreu      Portuguese     1
           Spanish        1
adam       English        1
           French         1
           German         1
           Irish          1
           Russian        1
adams      English        1
           Russian        1
adamson    English        1
           Russian        1
adler      English        1
           German         1
           Russian        1
aitken     English        1
           Scottish       1
albert     English        1
           French         1
           German         1
           Russian        1
           Spanish        1
...                     ...
wilson     English        1
           Scottish       1
winter     English        1
           German         1
wolf       English        1
           German         1
wong       Chinese        1
           English        1
woo        Chinese        1
           Korean         1
wood       Czech          1
           English        1
           Scottish       1
wright     English        1
           Scottish       1
yan        Chinese        2
           Russian        1
yang       Chinese        1
           English        1
           Korean         1
yim        Chinese        1
           Korean         1
you        Chinese        1
           Korean         1
young      English        1
           Scottish       1
yun        Chinese        1
           Korean         1
zambrano   Italian        1
           Spanish        1

[1051 rows x 1 columns]

Rather than finding the "right" ethnicity the easy thing to do is to remove all ambiguous cases.

In [36]:

df = df[~df.ascii_name.isin(name_classes[name_classes > 1].index)]

How often do (class, name) pairs occur?¶

We need exactly one row per pair; if separate copies appear in the training and validation set we'll get a higher validation accuracy than is reasonable.

Some names occur very frequently.

In [37]:

counts = df.assign(n=1).groupby(['ascii_name', 'cl']).count().sort_values('n', ascending=False)
counts.head(n=20)

Out[37]:

		name	n
ascii_name	cl
tahan	Arabic	28	28
fakhoury	Arabic	28	28
koury	Arabic	27	27
nader	Arabic	27	27
sarraf	Arabic	26	26
hadad	Arabic	26	26
kassis	Arabic	26	26
antar	Arabic	26	26
shadid	Arabic	25	25
cham	Arabic	25	25
mifsud	Arabic	25	25
nahas	Arabic	24	24
gerges	Arabic	24	24
ganim	Arabic	23	23
tuma	Arabic	23	23
to the first page	Russian	23	23
atiyeh	Arabic	23	23
malouf	Arabic	23	23
sayegh	Arabic	22	22
naifeh	Arabic	22	22

Let's remove the "To The First Page" junk (probably some artifact of where the data was scraped from)

In [39]:

df = df[df.ascii_name != 'to the first page']

There are no multiples in English, and a lot in Arabic. It seems like a data entry error rather than meaningful.

In [41]:

counts.assign(multiple=counts.n > 1, rows=1).groupby('cl').sum().sort_values('n', ascending=False)

Out[41]:

	name	n	multiple	rows
cl
Russian	9326	9326	35.0	9263
English	3359	3359	0.0	3359
Arabic	1892	1892	103.0	103
Japanese	983	983	1.0	982
Italian	665	665	5.0	660
German	613	613	33.0	578
Czech	480	480	16.0	464
Dutch	255	255	10.0	244
Chinese	219	219	19.0	200
Spanish	214	214	2.0	212
French	213	213	3.0	210
Greek	195	195	2.0	192
Irish	170	170	6.0	164
Polish	124	124	1.0	123
Korean	61	61	0.0	61
Vietnamese	56	56	1.0	55
Portuguese	32	32	0.0	32
Scottish	1	1	0.0	1

It makes sense to drop the duplicates and only have a single row per ascii_name and cl.

In [42]:

df = df.drop_duplicates(['ascii_name', 'cl'])

In [43]:

len(df)

Out[43]:

Length Check¶

It's worth checking if the shortest and longest names make sense.

They look reasonable.

In [44]:

df.assign(len=df.name.str.len()).sort_values('len')

Out[44]:

	cl	name	ascii_name	len
3265	Vietnamese	An	an	2
50	Korean	Oh	oh	2
1150	Japanese	Ii	ii	2
54	Korean	Ra	ra	2
3891	Arabic	Ba	ba	2
57	Korean	Ri	ri	2
69	Korean	Si	si	2
71	Korean	So	so	2
3311	Vietnamese	To	to	2
85	Korean	Yi	yi	2
...	...	...	...	...
11475	Russian	Bakhtchivandzhi	bakhtchivandzhi	15
10191	Russian	Abdulladzhanoff	abdulladzhanoff	15
17299	Russian	Shakhnazaryants	shakhnazaryants	15
11393	Russian	Baistryutchenko	baistryutchenko	15
14965	Russian	Katzenellenbogen	katzenellenbogen	16
2228	Dutch	Vandroogenbroeck	vandroogenbroeck	16
14947	Russian	Katsenellenbogen	katsenellenbogen	16
19552	Greek	Chrysanthopoulos	chrysanthopoulos	16
2841	Irish	Maceachthighearna	maceachthighearna	17
6380	German	Von grimmelshausen	von grimmelshausen	18

16902 rows × 4 columns

Distribution by Language¶

The dataset is very unbalanced.

I doubt there's enough data to tacke Portuguese (which will be close to Spanish) and Scottish (which will be close to English)

In [45]:

df.groupby('cl').name.count().sort_values(ascending=False)

Out[45]:

cl
Russian       9262
English       3359
Japanese       982
Italian        660
German         578
Czech          464
Dutch          244
Spanish        212
French         210
Chinese        200
Greek          192
Irish          164
Polish         123
Arabic         103
Korean          61
Vietnamese      55
Portuguese      32
Scottish         1
Name: name, dtype: int64

In [46]:

df[df.cl.isin(['Scottish'])]

Out[46]:

	cl	name	ascii_name
3711	Scottish	Hay	hay

Let's remove the rarest classes; we're not likely to have enough data to guess them.

In [47]:

df = df[~df.cl.isin(['Scottish', 'Portuguese'])]

Note Russian contains variant transliterations to English like Abaimoff and Abaimov (which both correspond to Абаимов).

But this doesn't quite explain it's high frequency: it seems a lot more Russian data was extracted.

(Side note: Chebyshev can also be spelt e.g. Chebychev, Tchebycheff, Tschebyschef)

In [48]:

df[df.cl == 'Russian']

Out[48]:

	cl	name	ascii_name
10112	Russian	Ababko	ababko
10113	Russian	Abaev	abaev
10114	Russian	Abagyan	abagyan
10115	Russian	Abaidulin	abaidulin
10116	Russian	Abaidullin	abaidullin
10117	Russian	Abaimoff	abaimoff
10118	Russian	Abaimov	abaimov
10119	Russian	Abakeliya	abakeliya
10120	Russian	Abakovsky	abakovsky
10121	Russian	Abakshin	abakshin
...	...	...	...
19510	Russian	Zolotavin	zolotavin
19511	Russian	Zolotdinov	zolotdinov
19512	Russian	Zolotenkov	zolotenkov
19513	Russian	Zolotilin	zolotilin
19514	Russian	Zolotkov	zolotkov
19515	Russian	Zolotnitsky	zolotnitsky
19516	Russian	Zolotnitzky	zolotnitzky
19517	Russian	Zozrov	zozrov
19518	Russian	Zozulya	zozulya
19519	Russian	Zukerman	zukerman

9262 rows × 3 columns

Create Validation and Training Sets¶

We want our final model to work well on any language.

But if we pick our validation set uniformly at random from the data we're likely to get many Russian names and not many Vietnamese names, which isn't a good test of this.

So instead we'll take our validation set from an equal number from each subclass.

In [50]:

df = df.reset_index().drop('index', 1)
df

Out[50]:

	cl	name	ascii_name
0	Korean	Ahn	ahn
1	Korean	Baik	baik
2	Korean	Bang	bang
3	Korean	Byon	byon
4	Korean	Cha	cha
5	Korean	Cho	cho
6	Korean	Choe	choe
7	Korean	Choi	choi
8	Korean	Chun	chun
9	Korean	Chweh	chweh
...	...	...	...
16859	French	Travere	travere
16860	French	Traverse	traverse
16861	French	Travert	travert
16862	French	Tremblay	tremblay
16863	French	Tremble	tremble
16864	French	Victors	victors
16865	French	Villeneuve	villeneuve
16866	French	Vipond	vipond
16867	French	Voclain	voclain
16868	French	Yount	yount

16869 rows × 3 columns

In [52]:

counts = df.groupby('cl').name.count().sort_values(ascending=False)
counts

Out[52]:

cl
Russian       9262
English       3359
Japanese       982
Italian        660
German         578
Czech          464
Dutch          244
Spanish        212
French         210
Chinese        200
Greek          192
Irish          164
Polish         123
Arabic         103
Korean          61
Vietnamese      55
Name: name, dtype: int64

In [54]:

valid_size = 30 # We'll pick 30 at random from each subclass
train_size = 500 # For a balanced training set we'll pick 500 at random with replacement

In [55]:

np.random.seed(6011)
valid_idx = []
for cl in counts.keys():
    # Random sample of size "valid_size" for each class
    valid_idx += list(df[df.cl == cl].sample(valid_size).index)

In [56]:

df['valid'] = False
df.loc[valid_idx, 'valid'] = True

Let's also create a balanced training set as an alternative to using everything not in validation

In [57]:

np.random.seed(7012)
balanced_idx = []
for cl in counts.keys():
    # Random sample of size "train_size" for each class from the data outside of the validation set
    balanced_idx += list(df[(df.cl == cl) & ~df.valid].sample(train_size, replace=True).index)

Note the balanced index contains all 25 (= 55 - 30) Vietnamese names outside of the training set, but only contains 486 of the Russian names (because we sampled randomly with replacement there will be a couple of double ups).

In [58]:

df.loc[balanced_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)

Out[58]:

	cl	name	ascii_name	valid
cl
Russian	1	486	486	1
English	1	459	459	1
Japanese	1	383	383	1
Italian	1	357	357	1
German	1	330	330	1
Czech	1	295	295	1
Dutch	1	195	195	1
French	1	172	172	1
Spanish	1	170	170	1
Chinese	1	158	158	1
Greek	1	153	153	1
Irish	1	129	129	1
Polish	1	93	93	1
Arabic	1	73	73	1
Korean	1	31	31	1
Vietnamese	1	25	25	1

Let's record our balanced set in the dataframe: this will make it easy to reload at a later point.

In [59]:

df['bal'] = 0
for k, v in Counter(balanced_idx).items():
    df.loc[k, 'bal'] += v

In [60]:

df.head()

Out[60]:

	cl	name	ascii_name	valid	bal
0	Korean	Ahn	ahn	False	13
1	Korean	Baik	baik	True	0
2	Korean	Bang	bang	False	13
3	Korean	Byon	byon	False	15
4	Korean	Cha	cha	True	0

We can always retrieve the indexes from the dataframe

In [61]:

idx = []
for k, v in zip(df.index, df.bal):
    idx += [k]*v
sorted(balanced_idx) == idx

Out[61]:

True

Save the Data¶

In [62]:

df.to_csv('names_clean.csv', index=False)

Benchmarks¶

The first benchmark is random guessing/always guessing the same class.

The expected return is 1/(number of classes) = 1/16 ~ 6.25%

In [63]:

df = pd.read_csv('names_clean.csv')

valid_idx = df[df.valid].index
train_idx = df[~df.valid].index

bal_idx = []
for k, v in zip(df.index, df.bal):
    bal_idx += [k]*v

Sanity check out data¶

Check training/balanced training data doesn't contain any names in validation set

In [65]:

train_intersect_valid = sum(df.iloc[train_idx].ascii_name.isin(df.iloc[valid_idx].ascii_name)) 
bal_interset_valid = sum(df.iloc[bal_idx].ascii_name.isin(df.iloc[valid_idx].ascii_name))
train_intersect_valid, bal_interset_valid

Out[65]:

(0, 0)

Make sure the data looks right

In [66]:

df.iloc[train_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)

Out[66]:

	cl	name	ascii_name	valid	bal
cl
Russian	1	9232	9232	1	3
English	1	3329	3329	1	4
Japanese	1	952	952	1	5
Italian	1	630	630	1	5
German	1	548	548	1	5
Czech	1	434	434	1	6
Dutch	1	214	214	1	9
Spanish	1	182	182	1	10
French	1	180	180	1	9
Chinese	1	170	170	1	9
Greek	1	162	162	1	10
Irish	1	134	134	1	10
Polish	1	93	93	1	11
Arabic	1	73	73	1	13
Korean	1	31	31	1	13
Vietnamese	1	25	25	1	16

In [67]:

df.iloc[bal_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)

Out[67]:

	cl	name	ascii_name	valid	bal
cl
Russian	1	486	486	1	2
English	1	459	459	1	3
Japanese	1	383	383	1	4
Italian	1	357	357	1	4
German	1	330	330	1	4
Czech	1	295	295	1	5
Dutch	1	195	195	1	8
French	1	172	172	1	8
Spanish	1	170	170	1	9
Chinese	1	158	158	1	8
Greek	1	153	153	1	9
Irish	1	129	129	1	9
Polish	1	93	93	1	11
Arabic	1	73	73	1	13
Korean	1	31	31	1	13
Vietnamese	1	25	25	1	16

In [68]:

df.iloc[valid_idx].groupby('cl').nunique().sort_values('ascii_name', ascending=False)

Out[68]:

	cl	name	ascii_name	valid	bal
cl
Arabic	1	30	30	1	1
Chinese	1	30	30	1	1
Czech	1	30	30	1	1
Dutch	1	30	30	1	1
English	1	30	30	1	1
French	1	30	30	1	1
German	1	30	30	1	1
Greek	1	30	30	1	1
Irish	1	30	30	1	1
Italian	1	30	30	1	1
Japanese	1	30	30	1	1
Korean	1	30	30	1	1
Polish	1	30	30	1	1
Russian	1	30	30	1	1
Spanish	1	30	30	1	1
Vietnamese	1	30	30	1	1

Picking any one class in validation will give 1/16 = 6.25%

In [69]:

(df[df.valid] == 'Korean').cl.sum() / df.valid.sum()

Out[69]:

0.0625

n-grams and naive Bayes¶

A reasonable way to guess a language is by the frequency of characters and pairs of characters.

For example 'cz' is very rare in English, but quite common in the slavic languages.

In [70]:

name = 'zozrov'

A function to count the occurances of sequences of one, two or three letters (in general these sequences are called "n-grams" particularly when referring to sequences of words).

In [71]:

def ngrams(s,n=1):
    parts = [s[i:] for i in range(n)] # e.g. ['zozrov', 'ozrov', 'zrov']
    return Counter(''.join(_) for _ in zip(*parts))

ngrams(name, 1), ngrams(name, 2), ngrams(name, 3)

Out[71]:

(Counter({'z': 2, 'o': 2, 'r': 1, 'v': 1}),
 Counter({'zo': 1, 'oz': 1, 'zr': 1, 'ro': 1, 'ov': 1}),
 Counter({'zoz': 1, 'ozr': 1, 'zro': 1, 'rov': 1}))

In [72]:

df = df.assign(letters=df.ascii_name.apply(ngrams))
df = df.assign(bigrams=df.ascii_name.apply(ngrams, n=2))
df = df.assign(trigrams=df.ascii_name.apply(ngrams, n=3))

In [73]:

df.head()

Out[73]:

	cl	name	ascii_name	valid	bal	letters	bigrams	trigrams
0	Korean	Ahn	ahn	False	13	{'a': 1, 'h': 1, 'n': 1}	{'ah': 1, 'hn': 1}	{'ahn': 1}
1	Korean	Baik	baik	True	0	{'b': 1, 'a': 1, 'i': 1, 'k': 1}	{'ba': 1, 'ai': 1, 'ik': 1}	{'bai': 1, 'aik': 1}
2	Korean	Bang	bang	False	13	{'b': 1, 'a': 1, 'n': 1, 'g': 1}	{'ba': 1, 'an': 1, 'ng': 1}	{'ban': 1, 'ang': 1}
3	Korean	Byon	byon	False	15	{'b': 1, 'y': 1, 'o': 1, 'n': 1}	{'by': 1, 'yo': 1, 'on': 1}	{'byo': 1, 'yon': 1}
4	Korean	Cha	cha	True	0	{'c': 1, 'h': 1, 'a': 1}	{'ch': 1, 'ha': 1}	{'cha': 1}

Let's try to guess the name using Naive Bayes.

TL;DR: This is a really simple model that works quite well and will give a good benchmark.

This uses "Bayes Rule" which uses the data to answer questions like: "given the name contains the bigram 'ah' what's the probability it's Korean?".

The "Naive" part means that that we assume all these probabilities are independent (knowing it contains 'ah' doesn't tell you anything about the fact it contains 'hn'). Even though this definitely isn't true, it's often a reasonable approximation.

This makes it really fast and simple to fit a model and often works well.

In [82]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction import DictVectorizer
vd1 = DictVectorizer(sparse=False)
vd2 = DictVectorizer(sparse=False)
vd3 = DictVectorizer(sparse=False)

In [75]:

y = df.cl

In [83]:

letters = vd1.fit_transform(df.letters)
bigrams = vd2.fit_transform(df.bigrams)
trigrams = vd3.fit_transform(df.trigrams)

The letters matrix contains the number of times each of the 28 letters occurs (e.g. number of spaces, number of apostrophes, number of 'a', ...).

In [89]:

vd1.get_feature_names()[:10]

Out[89]:

[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

In [90]:

letters

Out[90]:

array([[0., 0., 1., 0., ..., 0., 0., 0., 0.],
       [0., 0., 1., 1., ..., 0., 0., 0., 0.],
       [0., 0., 1., 1., ..., 0., 0., 0., 0.],
       [0., 0., 0., 1., ..., 0., 0., 1., 0.],
       ...,
       [0., 0., 0., 0., ..., 0., 0., 0., 0.],
       [0., 0., 0., 0., ..., 0., 0., 0., 0.],
       [0., 0., 1., 0., ..., 0., 0., 0., 0.],
       [0., 0., 0., 0., ..., 0., 0., 1., 0.]])

Similarly bigrams and trigrams contains the number of times each sequence of 2 or 3 letters occurs

In [94]:

vd2.get_feature_names()[:5], vd2.get_feature_names()[-5:]

Out[94]:

([' a', ' b', ' c', ' e', ' f'], ['zu', 'zv', 'zw', 'zy', 'zz'])

In [95]:

letters.shape, bigrams.shape, trigrams.shape, y.shape

Out[95]:

((16869, 28), (16869, 623), (16869, 5794), (16869,))

How good a model can we get looking at individual letters (e.g. saying 'z' occurs much more frequently in Chinese than in English names).

In [96]:

letter_nb = MultinomialNB()
letter_nb.fit(letters[train_idx],y[train_idx])

bal_letter_nb = MultinomialNB()
bal_letter_nb.fit(letters[bal_idx],y[bal_idx])

Out[96]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

The balanced set does mut better than random; around 33%

In [97]:

letter_pred = letter_nb.predict(letters[valid_idx])
bal_letter_pred = bal_letter_nb.predict(letters[valid_idx])
(letter_pred == y[valid_idx]).mean(), (bal_letter_pred == y[valid_idx]).mean()

Out[97]:

(0.14791666666666667, 0.33541666666666664)

Let's write a function to test the Naive Bayes on any dataset; fitting on the whole dataset and the balanced dataset separately.

In [98]:

def nb(x):
    model = MultinomialNB()
    model.fit(x[train_idx], y[train_idx])
    preds = model.predict(x[valid_idx])
    acc_train = (preds == y[valid_idx]).mean()
    
    model = MultinomialNB()
    model.fit(x[bal_idx], y[bal_idx])
    preds = model.predict(x[valid_idx])
    acc_bal = (preds == y[valid_idx]).mean()
    
    return acc_train, acc_bal

In [99]:

nb(letters)

Out[99]:

(0.14791666666666667, 0.33541666666666664)

Using bigrams and a balanced training set gives a much better prediction performance 53% (up from the baseline of 6.25%).

In [100]:

nb(bigrams)

Out[100]:

(0.35833333333333334, 0.5291666666666667)

Adding letters doesn't make much difference (which isn't surprising

In [101]:

nb(np.concatenate((letters, bigrams), axis=1))

Out[101]:

(0.3854166666666667, 0.5166666666666667)

Trigrams alone also performs worse

In [102]:

nb(trigrams)

Out[102]:

(0.33958333333333335, 0.4895833333333333)

Let's try every combination with trigrams:

In [103]:

nb(np.concatenate((letters, trigrams), axis=1))

Out[103]:

(0.24375, 0.5083333333333333)

In [104]:

nb(np.concatenate((bigrams, trigrams), axis=1))

Out[104]:

(0.36875, 0.5416666666666666)

In [105]:

nb(np.concatenate((letters, bigrams, trigrams), axis=1))

Out[105]:

(0.32916666666666666, 0.55625)

None of them significantly outperform the simple bigram model (with 623 parameters; we could probably remove some of the uncommon ones without too many problems.

Examining the Bigram Model¶

Let's remove the bigrams that only occur once as they have practically no value (and there's 100 of them).

In [195]:

common_bigrams = (bigrams[bal_idx].sum(axis=0)) >= 2
common_bigrams.sum()

Out[195]:

In [196]:

common_bigram_index = [i for i, t in enumerate(common_bigrams) if t]
bigrams_min = bigrams[:, common_bigram_index]
bigrams_min.shape

Out[196]:

(16869, 503)

In [197]:

bigram_model = MultinomialNB()
bigram_model.fit(bigrams_min[bal_idx], y[bal_idx])

Out[197]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

We get around 53% accuracy.

In [203]:

bigram_pred = bigram_model.predict(bigrams_min[valid_idx])
(bigram_pred == y[valid_idx]).mean()

Out[203]:

0.5291666666666667

In [213]:

bigram_prob = bigram_model.predict_proba(bigrams_min[valid_idx])
bigram_prob.max(axis=1)

Out[213]:

array([244,   9,  20,  18, ...,  86, 422, 143, 431])

In [217]:

bigram_preds = (df
                .iloc[valid_idx]
                .assign(pred = bigram_pred)[['name', 'cl', 'pred']]
                .assign(prob = bigram_prob.max(axis=1)))
bigram_preds.sort_values('prob', ascending=False).head(15)

Out[217]:

	name	cl	pred	prob
16557	Kotsiopoulos	Greek	Greek	1.000000
2012	Rooijakker	Dutch	Dutch	1.000000
16470	Akrivopoulos	Greek	Greek	0.999999
826	Warszawski	Polish	Polish	0.999998
1997	Romeijnders	Dutch	Dutch	0.999997
16478	Antonopoulos	Greek	Greek	0.999995
813	Sokolowski	Polish	Polish	0.999994
839	Zdunowski	Polish	Polish	0.999950
1996	Romeijn	Dutch	Dutch	0.999917
16497	Chrysanthopoulos	Greek	Greek	0.999895
2053	Sneijers	Dutch	Dutch	0.999792
2031	Schwarzenberg	Dutch	German	0.999774
795	Rudawski	Polish	Polish	0.999751
16715	De sauveterre	French	French	0.999604
1160	Kawagichi	Japanese	Japanese	0.999600

The names it's least confident with: they typically seem to be quite short

In [218]:

bigram_preds.sort_values('prob', ascending=True).head(15)

Out[218]:

	name	cl	pred	prob
2907	Do	Vietnamese	Irish	0.176679
24	Mo	Korean	Japanese	0.179534
47	So	Korean	Korean	0.188088
45	Si	Korean	Greek	0.190236
13775	Prigojin	Russian	Italian	0.191639
41	Seok	Korean	French	0.197154
5	Cho	Korean	German	0.202991
1091	Isobe	Japanese	English	0.206442
46	Sin	Korean	Italian	0.218300
5332	Ingram	English	Spanish	0.220205
2935	Ta	Vietnamese	Japanese	0.226875
2700	Ban	Chinese	Vietnamese	0.228022
1697	Togo	Japanese	Japanese	0.236658
4	Cha	Korean	Irish	0.239172
3445	Graner	German	Spanish	0.240844

The names it's most confidently wrong with:

In [222]:

bigram_preds[bigram_preds.cl != bigram_preds.pred].sort_values('prob', ascending=False).head(15)

Out[222]:

	name	cl	pred	prob
2031	Schwarzenberg	Dutch	German	0.999774
16578	Malihoudis	Greek	Arabic	0.992311
16576	Louverdis	Greek	French	0.990256
4758	Fairbrace	English	Irish	0.987143
3743	Spellmeyer	German	English	0.976530
16468	Adamou	Greek	Arabic	0.973496
3009	De la fuente	Spanish	French	0.969431
3011	De leon	Spanish	French	0.964697
3263	Boulos	Arabic	Greek	0.962321
16513	Egonidis	Greek	Italian	0.954264
2478	Suchanka	Czech	Japanese	0.949000
2515	Weichert	Czech	German	0.946457
5476	Keene	English	Dutch	0.944270
3511	Jaeger	German	Dutch	0.938891
3174	Attia	Arabic	Italian	0.935905

Our very simple system does great on Japanese and Russian, but relatively poorly on Vietnamese where our data is most sparse (but still much better than random).

In [223]:

(bigram_preds
 .assign(yes=bigram_preds.cl == bigram_preds.pred)
 .groupby('cl')
 .yes
 .mean()
 .sort_values(ascending=False)

Out[223]:

cl
Japanese      0.866667
Russian       0.733333
Polish        0.666667
Irish         0.666667
Dutch         0.633333
Italian       0.600000
Greek         0.533333
German        0.500000
English       0.500000
Spanish       0.466667
French        0.466667
Arabic        0.433333
Czech         0.400000
Chinese       0.400000
Korean        0.366667
Vietnamese    0.233333
Name: yes, dtype: float64

In [227]:

from sklearn.metrics import confusion_matrix

In [236]:

bigram_pred

Out[236]:

array(['Arabic', 'Irish', 'German', 'Dutch', ..., 'Russian', 'Polish', 'Irish', 'Italian'], dtype='<U10')

In [246]:

cm = confusion_matrix(y[valid_idx], bigram_pred, labels=y.unique())
cm

Out[246]:

array([[11,  1,  0,  6, ...,  0,  0,  2,  1],
       [ 0, 18,  0,  1, ...,  2,  1,  2,  1],
       [ 0,  1, 20,  1, ...,  0,  0,  0,  0],
       [ 1,  0,  0, 26, ...,  1,  0,  0,  0],
       ...,
       [ 1,  2,  0,  1, ..., 15,  0,  1,  1],
       [ 0,  2,  1,  1, ...,  1, 22,  0,  0],
       [ 0,  2,  1,  1, ...,  0,  1, 16,  3],
       [ 0,  3,  1,  0, ...,  1,  1,  4, 14]])

In [201]:

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

Vietnamese is often confused for Chinese (which makes sense) and Irish (which doesn't). Korean is often confused for Japanese. Spanish is often confused for Italian.

In [250]:

plt.figure(figsize=(12,12))
plot_confusion_matrix(cm, y.unique())

Confusion matrix, without normalization

In [256]:

bigram_preds[bigram_preds.cl == 'Vietnamese'].sort_values('prob').head(20)

Out[256]:

	name	cl	pred	prob
2907	Do	Vietnamese	Irish	0.176679
2935	Ta	Vietnamese	Japanese	0.226875
2924	Luc	Vietnamese	Vietnamese	0.253900
2944	Ton	Vietnamese	Vietnamese	0.282859
2930	Pho	Vietnamese	Dutch	0.296166
2910	Ly	Vietnamese	Russian	0.298872
2915	Doan	Vietnamese	Chinese	0.307318
2916	Dam	Vietnamese	Arabic	0.325199
2906	Dang	Vietnamese	Chinese	0.388393
2900	Pham	Vietnamese	Arabic	0.396346
2928	Nghiem	Vietnamese	English	0.428147
2932	Quach	Vietnamese	Vietnamese	0.450314
2922	Lac	Vietnamese	Irish	0.450851
2938	Thi	Vietnamese	Chinese	0.455845
2902	Hoang	Vietnamese	Korean	0.487704
2927	Mach	Vietnamese	Irish	0.505350
2917	Dao	Vietnamese	Irish	0.512267
2923	Lieu	Vietnamese	French	0.515336
2939	Than	Vietnamese	Chinese	0.530805
2937	Thai	Vietnamese	Irish	0.580569

So our baseline is 53%. Let's see if we can do better with deep learning

Deep Learning¶

Build a Fastai Data Loader¶

Load in the dataframe and extract indexes for training, validation and balanced trainings.

In [4]:

df = pd.read_csv('names_clean.csv')

valid_idx = df[df.valid].index
train_idx = df[~df.valid].index

bal_idx = []
for k, v in zip(df.index, df.bal):
    bal_idx += [k]*v

As of December 2018 Fastai only has Word level tokenizers; we'll have to create our own letter tokenizer.

The fastai library injects BOS markers (xxbos) at the start of every string; we'll have to parse them separately.

In [5]:

class LetterTokenizer(BaseTokenizer):
    "Character level tokenizer function."
    def __init__(self, lang): pass
    def tokenizer(self, t:str) -> List[str]:
        out = []
        i = 0
        while i < len(t):
            if t[i:].startswith(BOS):
                out.append(BOS)
                i += len(BOS)
            else:
                out.append(t[i])
                i += 1
        return out
            
    def add_special_cases(self, toks:Collection[str]): pass

We create a vocab of all ASCII letters, and a character tokenizer that doesn't do any specific processing.

In [8]:

itos = [UNK, BOS] + list(string.ascii_lowercase + " -'")

In [9]:

vocab=Vocab(itos)
tokenizer=Tokenizer(LetterTokenizer, pre_rules=[], post_rules=[])

We can create a data pipeline using the TextDataBunch.from_df constructor.

mark_fields puts and extra xxfld marker between each field of text. Since we only have 1 field this is unnecessary.

In [10]:

train_df = df.iloc[train_idx, [0,2]]
valid_df = df.iloc[valid_idx, [0,2]]

In [9]:

train_df.head()

Out[9]:

	cl	ascii_name
0	Korean	ahn
2	Korean	bang
3	Korean	byon
10	Korean	gil
11	Korean	gu

In [11]:

data = TextClasDataBunch.from_df(path='.', train_df=train_df, valid_df=valid_df,
                         tokenizer=tokenizer, vocab=vocab,
                         mark_fields=False)

In [12]:

data.show_batch()

text	target
v o n g r i m m e l s h a u s e n	German
m a c e a c h t h i g h e a r n a	Irish
c h k h a r t i s h v i l i	Russian
t z e h m i s t r e n k o	Russian
c h e p t y g m a s h e v	Russian

Or we can create it using data block API. This uses the processors to tokenize and numericalize the input.

In [13]:

processors = [TokenizeProcessor(tokenizer=tokenizer, mark_fields=False),
            NumericalizeProcessor(vocab=vocab)]

In [14]:

data = (TextList
     .from_df(df, 
              cols=[2], 
              processor=processors)
     .split_by_idxs(train_idx=train_idx, valid_idx=valid_idx)
     .label_from_df(cols=0)
     .databunch(bs=32))

In [15]:

data.show_batch()

text	target
v o n g r i m m e l s h a u s e n	German
p a r a s k e v o p o u l o s	Greek
d z h a v a h i s h v i l i	Russian
s h a h n a z a r y a n t s	Russian
m o g i l n i c h e n k o	Russian

Sanity Checking¶

In [16]:

Counter(_.obj for _ in data.valid_ds.y)

Out[16]:

Counter({'Korean': 30,
         'Italian': 30,
         'Polish': 30,
         'Japanese': 30,
         'Dutch': 30,
         'Czech': 30,
         'Irish': 30,
         'Chinese': 30,
         'Vietnamese': 30,
         'Spanish': 30,
         'Arabic': 30,
         'German': 30,
         'English': 30,
         'Russian': 30,
         'Greek': 30,
         'French': 30})

In [17]:

Counter(_.obj for _ in data.train_ds.y).most_common()

Out[17]:

[('Russian', 9232),
 ('English', 3329),
 ('Japanese', 952),
 ('Italian', 630),
 ('German', 548),
 ('Czech', 434),
 ('Dutch', 214),
 ('Spanish', 182),
 ('French', 180),
 ('Chinese', 170),
 ('Greek', 162),
 ('Irish', 134),
 ('Polish', 93),
 ('Arabic', 73),
 ('Korean', 31),
 ('Vietnamese', 25)]

Check no text is both in Validation and Training

In [18]:

valid_set = set(_.text for _ in data.valid_ds.x)
for _ in data.train_ds.x:
    assert _.text not in valid_set, _.text

Examine a minibatch¶

In [19]:

trainiter = iter(data.train_dl)
batch, cl = next(trainiter)
batch2, cl2 = next(trainiter)

In [20]:

cl, len(cl)

Out[20]:

(tensor([ 6, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13],
        device='cuda:0'), 16)

In [21]:

batch.shape

Out[21]:

torch.Size([20, 16])

The first 22 letters run down the batch backpadded by BOS; we have 16 names across.

Somehow it looks like we also have an extra space at the beginning of each name that wasn't in the input data.

(Note this is different to what the fastai wrappers will give you; they concatenate the data and split it into 16 chunks).

In [26]:

pd.options.display.max_columns = 100
(pd
 .DataFrame([[vocab.itos[y] for y in x] for x in batch])
 .T
 .assign(category=[data.classes[_] for _ in cl])
 .T)

Out[26]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
0	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
1		xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
2	v	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
3	o	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
4	n	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
5				xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
6	g	t	l								xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
7	r	c	e	t	m	b	c	z	b	p
8	i	h	i	c	i	a	h	h	a	a	g	s	a	b	v	v
9	m	a	h	h	n	k	a	e	h	t	r	h	w	a	y	i
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11	e	t	e	h	a	h	t	o	h	i	s	n	o	h	c	c
12	l	o	n	l	z	t	o	k	i	o	h	d	r	t	h	h
13	s	r	b	a	e	a	r	h	v	r	e	e	k	i	e	e
14	h	i	e	k	t	n	i	o	a	k	l	r	h	g	s	p
15	a	z	r	o	d	o	z	v	n	o	e	o	a	a	l	o
16	u	h	g	v	i	w	h	t	d	v	v	v	n	r	a	l
17	s	s	s	s	n	s	s	s	z	s	s	i	o	e	v	s
18	e	k	k	k	o	k	k	e	h	k	k	c	f	e	o	k
19	n	y	y	y	v	i	y	v	i	y	y	h	f	v	v	y
category	German	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian	Russian

21 rows × 16 columns

In [27]:

[vocab.itos[_] for _ in data.train_ds[0][0].data]

Out[27]:

['xxbos', ' ', 'a', 'h', 'n']

In [28]:

list(df.iloc[0,1])

Out[28]:

['A', 'h', 'n']

Note the length of strings varies between batches.

In [29]:

(pd
 .DataFrame([[vocab.itos[y] for y in x] for x in batch2])
 .T
 .assign(category=[data.classes[_] for _ in cl2])
 .T)

Out[29]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
0	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos	xxbos
1
2	b	m	r	f	m	b	d	j	m	b	k	u	m	m	a	b
3	e	c	i	e	a	a	e	e	o	a	i	f	a	e	n	a
4	n	g	d	r	s	j	m	f	r	l	n	i	k	a	s	b
5	i	r	g	r	a	e	a	f	a	b	c	m	i	d	e	u
6	t	o	w	a	o	n	k	e	n	o	h	k	o	h	l	r
7	e	r	a	r	k	o	i	r	d	n	i	i	k	r	m	i
8	z	y	y	o	a	v	s	s	i	i	n	n	a	a	i	n
category	Spanish	English	English	Italian	Japanese	Russian	Greek	English	Italian	Italian	English	Russian	Japanese	Irish	Italian	Russian

In [30]:

vocab.textify(batch2[:,0])

Out[30]:

'xxbos   b e n i t e z'

In [31]:

data.show_batch(ds_type=DatasetType.Valid)

text	target
c h r y s a n t h o p o u l o s	Greek
v o n i n g e r s l e b e n	German
s c h w a r z e n b e r g	Dutch
d e s a u v e t e r r e	French
a r e c h a v a l e t a	Spanish

One Hot Encoding¶

The torch nn.RNN expects the data to be one hot encoded

In [32]:

one_hot = torch.eye(len(vocab.itos))

In [33]:

one_hot[batch][:2]

Out[33]:

tensor([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]])

In [34]:

one_hot[batch].shape

Out[34]:

torch.Size([20, 16, 31])

Here's how we could do it without storing the one_hot matrix in memory.

In [35]:

def one_hot_fly(y, length=len(vocab.itos)):
    length = len(vocab.itos)
    shape = list(y.shape)
    assert len(shape) == 2
    tensor = torch.zeros(shape + [length])
    for i,row in enumerate(y):
        for j, val in enumerate(row):
            tensor[i][j][val] = 1.
    return tensor

In [36]:

(one_hot[batch] == one_hot_fly(batch)).all()

Out[36]:

tensor(1, dtype=torch.uint8)

Using matrix operations is ~250 times faster at this size than the double for loop.

In [37]:

%timeit one_hot[batch]
%timeit one_hot_fly(batch)
None

36.1 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
8.91 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Fitting a model¶

In [38]:

n_letters = len(vocab.itos)
n_hidden = 128
n_output = df.cl.nunique()
n_letters, n_output

Out[38]:

(31, 16)

We use an RNN to take our sequence of letters in and calculate the hidden state

In [39]:

rnn = nn.RNN(input_size=n_letters,
             hidden_size=n_hidden,
             num_layers=1,
             nonlinearity='relu',
             dropout=0.)

In [40]:

output, hidden = rnn(one_hot[batch])
output.shape, hidden.shape

Out[40]:

(torch.Size([20, 16, 128]), torch.Size([1, 16, 128]))

In [41]:

lo = nn.Linear(n_hidden, n_output)

In [42]:

preds = lo(output)

In [43]:

preds.shape

Out[43]:

torch.Size([20, 16, 16])

In [44]:

cl

Out[44]:

tensor([ 6, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13],
       device='cuda:0')

In [45]:

nn.functional.softmax(preds[-1], dim=1).argmax(dim=1)

Out[45]:

tensor([8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8])

In [46]:

one_hot = torch.eye(len(vocab.itos))

In [47]:

class MyLetterRNN(nn.Module):
    def __init__(self, dropout=0., n_layers=1, n_input=n_letters, n_hidden=n_hidden, n_output=n_output):
        super().__init__()
        self.one_hot = torch.eye(n_letters).cuda()
        self.rnn = nn.RNN(input_size=n_letters,
                         hidden_size=n_hidden,
                         num_layers=n_layers,
                         nonlinearity='relu',
                         dropout=dropout)
        self.lo = nn.Linear(n_hidden, n_output)
        
    def forward(self, input):
        rnn, _ = self.rnn(self.one_hot[input])
        out = self.lo(rnn)
        return out[-1]

In [48]:

rnn = MyLetterRNN().cuda()

In [49]:

out = rnn(batch)
out.argmax(dim=1), cl

Out[49]:

(tensor([6, 6, 6, 6, 1, 6, 6, 1, 6, 6, 6, 1, 6, 1, 1, 6], device='cuda:0'),
 tensor([ 6, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13],
        device='cuda:0'))

Fit the model

In [50]:

F.cross_entropy(out, cl)

Out[50]:

tensor(2.7832, device='cuda:0', grad_fn=<NllLossBackward>)

In [51]:

learn = Learner(data, rnn, loss_func=F.cross_entropy, metrics=[accuracy])

In [52]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [53]:

learn.recorder.plot()

In [54]:

learn.fit_one_cycle(10, max_lr=3e-2)

Total time: 00:35

epoch	train_loss	valid_loss	accuracy
1	0.811252	2.653636	0.260417
2	0.928508	3.329767	0.216667
3	0.830531	3.436638	0.218750
4	0.947136	3.056552	0.202083
5	0.878935	3.361734	0.210417
6	0.818984	3.208372	0.214583
7	0.811538	2.896590	0.252083
8	0.745542	3.237130	0.283333
9	0.753505	2.819807	0.302083
10	0.763112	2.878011	0.297917

In [55]:

learn.lr_find()
learn.recorder.plot()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [56]:

learn.save('char_rnn_1')

In [57]:

learn.fit_one_cycle(5, 3e-3)

Total time: 00:17

epoch	train_loss	valid_loss	accuracy
1	0.696038	2.910773	0.304167
2	0.734545	2.814250	0.306250
3	0.634951	2.827829	0.295833
4	0.636780	2.758662	0.312500
5	0.696148	2.838843	0.312500

In [58]:

learn.save('char_rnn_1_final')

This is abysmal; 31% is much worse than 52% from the simple Naive Bayes bigram model.

Does it improve if we add another layer?

In [59]:

learn = Learner(data, MyLetterRNN(n_layers=2), loss_func=F.cross_entropy, metrics=[accuracy])

In [60]:

learn.lr_find()
learn.recorder.plot()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [61]:

learn.fit_one_cycle(20, max_lr=1e-2)

Total time: 01:34

epoch	train_loss	valid_loss	accuracy
1	0.929248	3.101529	0.189583
2	0.695901	2.869615	0.250000
3	0.745567	2.520683	0.316667
4	0.620927	3.530135	0.262500
5	0.742575	2.512531	0.318750
6	0.723677	2.616584	0.343750
7	0.839355	2.454891	0.335417
8	0.817186	2.794391	0.291667
9	0.653000	2.695168	0.302083
10	0.683367	2.637764	0.358333
11	0.611877	2.308675	0.333333
12	0.586979	2.296229	0.352083
13	0.611386	2.224956	0.381250
14	0.580687	2.247524	0.383333
15	0.512482	2.244857	0.387500
16	0.516693	2.303736	0.412500
17	0.409016	2.413911	0.412500
18	0.435291	2.442951	0.422917
19	0.386392	2.507006	0.425000
20	0.352908	2.518786	0.433333

It looks like the fit has converged, again at a much worse result than our Naive Bayes bigrams.

But that was trained using a balanced dataset; maybe that will help with RNNs too.

In [62]:

learn.recorder.plot_losses()

In [63]:

learn.save('char_rnn_2_p0')

In [76]:

prob, targ = learn.get_preds()
Counter(data.classes[_.item()] for _ in prob.argmax(dim=1)).most_common()

Out[76]:

[('English', 141),
 ('Russian', 97),
 ('Chinese', 54),
 ('Italian', 43),
 ('Japanese', 38),
 ('Greek', 25),
 ('German', 22),
 ('Czech', 12),
 ('French', 11),
 ('Dutch', 11),
 ('Spanish', 10),
 ('Polish', 10),
 ('Korean', 4),
 ('Vietnamese', 2)]

Rebalancing¶

Less is more¶

Even though the balanced set is a subset of the training set (and throws away a lot of data), the model performs much better on the balanced validation set with it.

This is because on the whole training set heuristics like "when in doubt, guess Russian/English" and "it's almost never Vietnamese" are good, but are terrible on our validation set.

In [77]:

data = (TextList
     .from_df(df, 
              cols=[2], 
              processor=processors)
     .split_by_idxs(train_idx=bal_idx, valid_idx=valid_idx)
     .label_from_df(cols=0)
     .databunch(bs=1024))

Sanity Checking¶

In [78]:

Counter(_.obj for _ in data.valid_ds.y)

Out[78]:

Counter({'Korean': 30,
         'Italian': 30,
         'Polish': 30,
         'Japanese': 30,
         'Dutch': 30,
         'Czech': 30,
         'Irish': 30,
         'Chinese': 30,
         'Vietnamese': 30,
         'Spanish': 30,
         'Arabic': 30,
         'German': 30,
         'English': 30,
         'Russian': 30,
         'Greek': 30,
         'French': 30})

In [79]:

Counter(_.obj for _ in data.train_ds.y).most_common()

Out[79]:

[('Korean', 500),
 ('Italian', 500),
 ('Polish', 500),
 ('Japanese', 500),
 ('Dutch', 500),
 ('Czech', 500),
 ('Irish', 500),
 ('Chinese', 500),
 ('Vietnamese', 500),
 ('Spanish', 500),
 ('Arabic', 500),
 ('German', 500),
 ('English', 500),
 ('Russian', 500),
 ('Greek', 500),
 ('French', 500)]

In [80]:

(pd.DataFrame({'x': [_.text for _ in data.train_ds.x], 'y': [_.obj for _ in data.train_ds.y]})
   .groupby('y')
   .nunique()
   .sort_values('x', ascending=False))

Out[80]:

	x	y
y
Russian	486	1
English	459	1
Japanese	383	1
Italian	357	1
German	330	1
Czech	295	1
Dutch	195	1
French	172	1
Spanish	170	1
Chinese	158	1
Greek	153	1
Irish	129	1
Polish	93	1
Arabic	73	1
Korean	31	1
Vietnamese	25	1

In [81]:

valid_set = set(_.text for _ in data.valid_ds.x)
for _ in data.train_ds.x:
    assert _.text not in valid_set, _.text

Fitting¶

In [82]:

learn = Learner(data, MyLetterRNN(n_layers=2), loss_func=F.cross_entropy, metrics=[accuracy])

In [83]:

learn.lr_find()
learn.recorder.plot()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

Note that our balanced dataset is about half the size of our training dataset. Useful to keep in mind when comparing number of epochs and runtime.

In [85]:

len(train_idx) / len(bal_idx)

Out[85]:

2.048625

We only get around ~51% accuracy on a balanced test set (similar to the Naive Bayes)

In [86]:

learn.fit_one_cycle(30, max_lr=3e-2)

Total time: 00:12

epoch	train_loss	valid_loss	accuracy
1	2.773524	2.755105	0.256250
2	2.756307	2.651738	0.252083
3	2.653947	2.520690	0.222917
4	2.586744	2.220565	0.277083
5	2.418116	2.046903	0.358333
6	2.238451	2.456228	0.300000
7	2.139630	1.903009	0.404167
8	1.966441	1.928596	0.441667
9	1.810611	1.840151	0.439583
10	1.651867	1.802120	0.445833
11	1.560050	1.871820	0.450000
12	1.477005	1.901750	0.479167
13	1.461073	2.092149	0.406250
14	1.403493	2.019508	0.462500
15	1.279949	1.945947	0.470833
16	1.140172	2.048897	0.504167
17	0.996752	2.194067	0.472917
18	0.863288	2.319874	0.500000
19	0.765463	2.229700	0.491667
20	0.672631	2.348388	0.516667
21	0.577968	2.426682	0.506250
22	0.488059	2.629519	0.508333
23	0.406027	2.711037	0.520833
24	0.335120	2.838989	0.512500
25	0.274594	2.929183	0.510417
26	0.225410	3.002161	0.506250
27	0.185401	3.071880	0.506250
28	0.154014	3.097990	0.506250
29	0.130493	3.111022	0.510417
30	0.113012	3.112732	0.510417

It's starting to overfit and so could perhaps do with some regularization.

In [87]:

learn.recorder.plot_losses()

This model is a little worse in accuracy than the Naive Bayes Bigram model.

But our Neural Network is much more computationally intense and has about 4 times as many parameters!

In [89]:

sum(len(_) for _ in learn.model.parameters())

Out[89]:

Regularisation: Dropout¶

Adding 50% dropout increases our accuracy a little above what we got with Naive Bayes; to 55%.

In [119]:

learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5), loss_func=F.cross_entropy, metrics=[accuracy])

In [120]:

learn.fit_one_cycle(30, max_lr=3e-2)

Total time: 00:12

epoch	train_loss	valid_loss	accuracy
1	2.770683	2.746132	0.206250
2	2.737838	2.597004	0.302083
3	2.625530	2.253647	0.272917
4	2.453825	2.077854	0.350000
5	2.315888	1.959774	0.372917
6	2.161286	3.241577	0.212500
7	2.172942	1.954330	0.412500
8	2.082301	1.910599	0.429167
9	1.960596	1.690655	0.493750
10	1.844915	1.704401	0.495833
11	1.749084	1.702470	0.462500
12	1.649408	1.623091	0.491667
13	1.574916	1.608889	0.506250
14	1.477580	1.634902	0.529167
15	1.378102	1.681566	0.497917
16	1.285141	1.700663	0.506250
17	1.198917	1.739364	0.510417
18	1.111936	1.807743	0.520833
19	1.044053	1.796086	0.495833
20	0.981998	1.776384	0.522917
21	0.910434	1.867425	0.522917
22	0.834056	1.867015	0.522917
23	0.770921	1.860593	0.537500
24	0.708352	1.918845	0.543750
25	0.653929	1.950722	0.522917
26	0.603225	1.974899	0.531250
27	0.564611	1.989453	0.535417
28	0.534487	2.007739	0.539583
29	0.512126	2.006823	0.545833
30	0.493982	2.006345	0.547917

In [92]:

learn.recorder.plot_losses()

Changing the dimension of hidden layers¶

Using our default of 128 gets 54%

In [123]:

learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5, n_hidden=128), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(15, max_lr=3e-2)

Total time: 00:06

epoch	train_loss	valid_loss	accuracy
1	2.769207	2.736875	0.191667
2	2.700270	2.432153	0.291667
3	2.527224	2.165822	0.352083
4	2.411032	2.005153	0.375000
5	2.233243	1.797880	0.427083
6	2.073015	1.878045	0.414583
7	1.945470	1.800441	0.425000
8	1.808628	1.668809	0.466667
9	1.670717	1.669584	0.456250
10	1.536347	1.619124	0.514583
11	1.387313	1.561667	0.512500
12	1.237852	1.506838	0.543750
13	1.109924	1.549464	0.541667
14	1.008990	1.554212	0.543750
15	0.929697	1.554622	0.543750

Doubling to 256 doesn't change performance

In [124]:

learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5, n_hidden=128), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(15, max_lr=3e-2)

Total time: 00:06

epoch	train_loss	valid_loss	accuracy
1	2.778366	2.746171	0.233333
2	2.692814	2.430204	0.266667
3	2.557370	2.121804	0.335417
4	2.400152	1.958430	0.383333
5	2.266069	1.900910	0.377083
6	2.139609	1.769911	0.437500
7	1.997191	1.776130	0.443750
8	1.849745	1.595855	0.481250
9	1.695713	1.665297	0.462500
10	1.545231	1.611652	0.491667
11	1.396923	1.523717	0.518750
12	1.269320	1.593659	0.531250
13	1.159536	1.635850	0.518750
14	1.062906	1.653824	0.529167
15	0.984885	1.646750	0.531250

Halving to 64 definitely does; 128 does seem to be a sweet spot.

In [125]:

learn = Learner(data, MyLetterRNN(n_layers=2, dropout=0.5, n_hidden=64), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(15, max_lr=3e-2)

Total time: 00:06

epoch	train_loss	valid_loss	accuracy
1	2.788145	2.761199	0.089583
2	2.748204	2.577398	0.239583
3	2.619068	2.174743	0.316667
4	2.437004	2.133431	0.362500
5	2.297863	1.931483	0.372917
6	2.146514	1.836011	0.410417
7	2.041750	1.737024	0.437500
8	1.896687	1.610484	0.477083
9	1.764543	1.659021	0.462500
10	1.635953	1.572796	0.493750
11	1.525941	1.614364	0.489583
12	1.413064	1.567258	0.497917
13	1.319074	1.581343	0.483333
14	1.237835	1.610361	0.502083
15	1.175101	1.607811	0.502083

Finally 3 layers also gets a worse result.

In [129]:

learn = Learner(data, MyLetterRNN(n_layers=3, dropout=0.5), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(30, max_lr=3e-2)

Total time: 00:14

epoch	train_loss	valid_loss	accuracy
1	2.777945	2.767582	0.108333
2	2.763745	2.700912	0.177083
3	2.724584	2.610098	0.170833
4	2.610290	2.289089	0.293750
5	2.476536	2.088696	0.347917
6	2.339663	1.914408	0.404167
7	2.205508	1.885391	0.425000
8	2.120229	1.957143	0.341667
9	2.053982	1.699707	0.397917
10	1.987532	1.688885	0.441667
11	1.908170	1.695390	0.462500
12	1.848755	1.654914	0.456250
13	1.779561	1.624753	0.460417
14	1.697022	1.597392	0.475000
15	1.646311	1.599763	0.475000
16	1.582875	1.559585	0.477083
17	1.531765	1.559109	0.475000
18	1.491278	1.601305	0.462500
19	1.446864	1.503486	0.483333
20	1.387920	1.531969	0.508333
21	1.325619	1.495371	0.520833
22	1.257018	1.581387	0.531250
23	1.193253	1.517283	0.537500
24	1.138225	1.559087	0.529167
25	1.086946	1.572238	0.539583
26	1.040665	1.561826	0.525000
27	1.001137	1.584307	0.520833
28	0.970200	1.590697	0.516667
29	0.947225	1.590535	0.516667
30	0.927288	1.590184	0.516667

In [129]:

learn = Learner(data, MyLetterRNN(n_layers=3, dropout=0.5), loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(30, max_lr=3e-2)

Total time: 00:14

epoch	train_loss	valid_loss	accuracy
1	2.777945	2.767582	0.108333
2	2.763745	2.700912	0.177083
3	2.724584	2.610098	0.170833
4	2.610290	2.289089	0.293750
5	2.476536	2.088696	0.347917
6	2.339663	1.914408	0.404167
7	2.205508	1.885391	0.425000
8	2.120229	1.957143	0.341667
9	2.053982	1.699707	0.397917
10	1.987532	1.688885	0.441667
11	1.908170	1.695390	0.462500
12	1.848755	1.654914	0.456250
13	1.779561	1.624753	0.460417
14	1.697022	1.597392	0.475000
15	1.646311	1.599763	0.475000
16	1.582875	1.559585	0.477083
17	1.531765	1.559109	0.475000
18	1.491278	1.601305	0.462500
19	1.446864	1.503486	0.483333
20	1.387920	1.531969	0.508333
21	1.325619	1.495371	0.520833
22	1.257018	1.581387	0.531250
23	1.193253	1.517283	0.537500
24	1.138225	1.559087	0.529167
25	1.086946	1.572238	0.539583
26	1.040665	1.561826	0.525000
27	1.001137	1.584307	0.520833
28	0.970200	1.590697	0.516667
29	0.947225	1.590535	0.516667
30	0.927288	1.590184	0.516667

RNN From Scratch¶

Let's build our own RNN; instead of one hot encoding we'll use a nn.Embedding.

In [131]:

data = (TextList
     .from_df(df, 
              cols=[2], 
              processor=processors)
     .split_by_idxs(train_idx=bal_idx, valid_idx=valid_idx)
     .label_from_df(cols=0)
     .databunch(bs=1024))

In [132]:

valid_data_set = set(tuple(_[0].data) for _ in data.valid_ds)
for datum in data.train_ds:
    assert tuple(datum[0].data) not in valid_data_set, datum

In [133]:

x, y = next(iter(data.train_dl))
x.shape, y.shape

Out[133]:

(torch.Size([19, 512]), torch.Size([512]))

In [134]:

x.shape[-1]

Out[134]:

In [135]:

class Model(nn.Module):
    def __init__(self, n_input, n_hidden, n_output, bn=False):
        super().__init__()
        self.i_h = nn.Embedding(n_input,n_hidden)
        self.bn = nn.BatchNorm1d(n_hidden) if bn else None
        self.o_h = nn.Linear(n_hidden, n_output)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.reset()
        
    def forward(self, x):
        # I'm not quite sure why the batch size seems to change to 720 in validation...
        if self.h.shape[0] != x.shape[1]:
            self.reset(x.shape[1])
        h = self.h
        x = self.i_h(x)
        for xi in x:
            h += xi
            h = self.h_h(h)
            h = F.relu(h)
            if self.bn:
                h = self.bn(h)
        self.h = h.detach()
        o = self.o_h(h)
        return o
        
    def reset(self, size=None):
        size = size or 1
        self.h = torch.zeros(size, n_hidden).cuda()

In [136]:

model = Model(n_letters, n_hidden, n_output).cuda()

In [137]:

learn = Learner(data, model, loss_func=F.cross_entropy, metrics=[accuracy])

In [138]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [139]:

learn.recorder.plot()

This simple RNN seems to work better than the one we built using nn.rnn, and we're only using one layer and haven't implemented dropout.

The big difference is that we're using an embedding layer instead of one-hot encoding. This gives us an extra bunch of parameters we can fit.

In [140]:

learn.fit_one_cycle(20, 7e-3)

Total time: 00:08

epoch	train_loss	valid_loss	accuracy
1	2.743902	2.623438	0.239583
2	2.569348	2.255966	0.333333
3	2.323031	1.929247	0.397917
4	2.109766	1.880335	0.406250
5	1.942279	1.689045	0.460417
6	1.745093	1.724750	0.452083
7	1.561630	1.589873	0.508333
8	1.374822	1.634860	0.525000
9	1.204314	1.628457	0.518750
10	1.037950	1.674552	0.552083
11	0.891771	1.778297	0.552083
12	0.772682	1.856418	0.541667
13	0.666088	1.887347	0.581250
14	0.572181	1.933961	0.545833
15	0.487959	2.034596	0.568750
16	0.415125	2.025807	0.556250
17	0.356673	2.072717	0.558333
18	0.310153	2.125560	0.560417
19	0.271913	2.125447	0.560417
20	0.244666	2.124661	0.558333

In [141]:

learn.save('rnn-bal-1')

In [142]:

data.classes

Out[142]:

['Arabic',
 'Chinese',
 'Czech',
 'Dutch',
 'English',
 'French',
 'German',
 'Greek',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Polish',
 'Russian',
 'Spanish',
 'Vietnamese']

Let's save the data classes; this will be useful if we want to make predictions.

In [143]:

with open('data.classes', 'wb') as f:
    pickle.dump(data.classes, f)

Let's save the model data directly.

In [144]:

with open('models/rnn-bal-1.model', 'wb') as f:
    pickle.dump(model.state_dict(), f)

And read it back in.

In [145]:

with open('models/rnn-bal-1.model', 'rb') as f:
    state = pickle.load(f)
    model.load_state_dict(state)

Batchnorm¶

In [146]:

model = Model(n_letters, n_hidden, n_output, bn=True).cuda()

In [147]:

learn = Learner(data, model, loss_func=F.cross_entropy, metrics=[accuracy])

In [148]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

Batch norm makes the learning surface much smoother

In [149]:

learn.recorder.plot()

In [150]:

learn.fit_one_cycle(20, 3e-2)

Total time: 00:09

epoch	train_loss	valid_loss	accuracy
1	2.507374	2.190547	0.354167
2	2.235302	1.959421	0.387500
3	2.003024	1.930297	0.445833
4	1.824707	1.977895	0.454167
5	1.707330	2.065223	0.470833
6	1.562633	2.085219	0.481250
7	1.450853	2.300199	0.481250
8	1.306937	2.536774	0.475000
9	1.191327	2.522710	0.483333
10	1.058434	2.229035	0.533333
11	0.928646	2.419605	0.533333
12	0.800899	2.562785	0.514583
13	0.701665	2.489831	0.533333
14	0.613110	2.690704	0.522917
15	0.528683	2.681259	0.537500
16	0.458904	2.672714	0.539583
17	0.393891	2.499606	0.541667
18	0.341363	2.683054	0.545833
19	0.300728	2.611176	0.547917
20	0.272150	2.567088	0.550000

In this case we actually get a very similar fit.

In [151]:

learn.recorder.plot_losses()

Adding a little regularisation using weight decay seems to help; we get a 59%.

Maybe dropout could help more.

In [152]:

model = Model(n_letters, n_hidden, n_output, bn=True).cuda()
learn = Learner(data, model, loss_func=F.cross_entropy, metrics=[accuracy])
learn.fit_one_cycle(20, 1e-2, wd=0.1)

Total time: 00:09

epoch	train_loss	valid_loss	accuracy
1	2.683858	2.500692	0.277083
2	2.455149	2.113877	0.352083
3	2.241700	1.903266	0.406250
4	2.017066	1.779788	0.454167
5	1.849892	1.776859	0.493750
6	1.691735	1.798783	0.479167
7	1.523797	1.714493	0.520833
8	1.337385	1.671328	0.541667
9	1.208459	1.819937	0.535417
10	1.074747	1.773041	0.535417
11	0.935993	1.755780	0.568750
12	0.815807	1.749104	0.562500
13	0.713215	1.812801	0.566667
14	0.620390	1.839610	0.575000
15	0.550001	1.766414	0.591667
16	0.485581	1.801695	0.597917
17	0.429470	1.961521	0.587500
18	0.386420	1.903309	0.597917
19	0.346661	1.830627	0.593750
20	0.319924	1.856341	0.593750

fastai Builtin¶

How does fastai's built in learner compare?

How long are the names?

In [153]:

df.ascii_name.str.len().describe()

Out[153]:

count    16869.000000
mean         7.409509
std          2.050366
min          2.000000
25%          6.000000
50%          7.000000
75%          9.000000
max         18.000000
Name: ascii_name, dtype: float64

In [154]:

learn = text_classifier_learner(data, bptt=30)

In [156]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [159]:

learn.recorder.plot(skip_end=10)

Note we don't necessarily expect this to do great because the parameters are tuned to processing medium sized documents a word at a time.

However it gets 67% way outperforms our RNN model without any parameter tuning.

In [160]:

learn.fit_one_cycle(20, max_lr=7e-3)

Total time: 01:07

epoch	train_loss	valid_loss	accuracy
1	2.760123	2.778821	0.062500
2	2.630543	2.745926	0.075000
3	2.513042	2.573858	0.160417
4	2.355199	2.121467	0.300000
5	2.195587	1.810203	0.387500
6	2.012244	1.566222	0.472917
7	1.909551	1.693872	0.460417
8	1.752974	1.615144	0.527083
9	1.620254	1.322627	0.581250
10	1.416614	1.251798	0.625000
11	1.226631	1.297575	0.610417
12	1.056211	1.230383	0.641667
13	0.922565	1.201090	0.664583
14	0.791524	1.235106	0.656250
15	0.657713	1.220895	0.683333
16	0.552160	1.252036	0.666667
17	0.460102	1.207947	0.666667
18	0.411827	1.196497	0.670833
19	0.370949	1.197413	0.668750
20	0.330069	1.188975	0.677083

In [161]:

learn.recorder.plot_losses()

In [167]:

learn.save('fastai_bal')

Pretraining the Encoder¶

From the IMDB example we know for word level data pretraining the encoder gives much better results (albeit on much bigger datasets). Let's see if it improves things here.

In [168]:

data_lm = (TextList
           .from_df(df, cols=[2], processor=processors)
           .random_split_by_pct(0.1)
           .label_for_lm()
           .databunch(bs=32))

In [169]:

data_lm.show_batch()

idx	text
0	z h i h a r e v i t c h xxbos s m o l a k xxbos n o s c h e n k o xxbos c r o w n xxbos t o k a e v xxbos o r i o l xxbos d j a n i b e
1	p i s k o t i n xxbos o ' c a l l a g h a n n xxbos e o g h a n xxbos e n o k i xxbos s h a n a u r i n xxbos c h k h a r t i s h v i l
2	j i m a xxbos t z a g o l o v xxbos l i c h m a n xxbos c o w l e y xxbos b a g d a s a r o f f xxbos w a t e r f i e l d xxbos n e l l i xxbos
3	t a l m u d xxbos m a r t z e n k o xxbos r i p l e y xxbos z a v o r i n xxbos g e i g e r xxbos v r a z e l xxbos r e y e r xxbos r o
4	o v s a e v xxbos g r e a v e s xxbos r e k u n xxbos y u z v i s h i n xxbos t c h e k m a s o v xxbos s o n e xxbos g r u s h e t s k y xxbos

In [175]:

learn = language_model_learner(data_lm, drop_mult=0.5)

In [171]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [173]:

learn.recorder.plot(skip_end=10)

In [176]:

learn.fit_one_cycle(4, max_lr=1e-2)

Total time: 00:34

epoch	train_loss	valid_loss	accuracy
1	2.655349	2.249993	0.325342
2	2.187761	1.975275	0.401822
3	2.007302	1.886618	0.426303
4	1.868937	1.847373	0.435982

In [177]:

learn.save('letter_lang')
learn.save_encoder('letter_enc')

In [178]:

TEXT = "ho"
N_WORDS = 4
N_SENTENCES = 5

In [179]:

print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

ho u s i m
ho n o b a
ho n e r e
ho n a r a
ho v a b e

In [180]:

TEXT = "tr"
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

tr i n o r
tr e r o e
tr e n e n
tr i r i s
tr u p u t

In [181]:

learn = text_classifier_learner(data, bptt=30)
learn.load_encoder('letter_enc')

In [182]:

learn.lr_find()
learn.recorder.plot()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In this case pretraining the encoder gives a worse result.

Maybe it's because the language model was on the entire (unbalanced) dataset? Or wasn't well trained enough?

In [183]:

learn.fit_one_cycle(20, max_lr=2e-2)

Total time: 00:29

epoch	train_loss	valid_loss	accuracy
1	2.623937	2.702015	0.302083
2	2.410758	2.412743	0.337500
3	2.200685	1.927255	0.393750
4	2.046425	2.312107	0.316667
5	1.865316	2.074282	0.393750
6	1.777486	2.281461	0.360417
7	1.689011	2.557259	0.302083
8	1.596411	2.346404	0.370833
9	1.566185	2.441514	0.341667
10	1.473553	1.770901	0.429167
11	1.411398	1.677306	0.502083
12	1.352370	1.966482	0.404167
13	1.323129	3.021722	0.310417
14	1.271584	2.182109	0.389583
15	1.224382	1.864778	0.450000
16	1.196533	1.896098	0.447917
17	1.156429	1.960691	0.435417
18	1.108229	1.840390	0.456250
19	1.136375	2.003758	0.429167
20	1.125715	1.961390	0.443750

fastai: Hyperparameter Tuning¶

With a bit of tuning we can make a much smaller model that trains faster and is almost as good

In [187]:

learn = text_classifier_learner(data, bptt=30, emb_sz=200, nh=300, nl=2)
learn.fit_one_cycle(15, max_lr=1e-2, moms=(0.2, 0.1))

Total time: 00:10

epoch	train_loss	valid_loss	accuracy
1	2.748441	2.779718	0.062500
2	2.560826	2.709974	0.122917
3	2.413826	2.418473	0.325000
4	2.248409	1.827642	0.397917
5	2.131097	1.928447	0.385417
6	2.005232	1.580826	0.510417
7	1.831701	1.488690	0.510417
8	1.711539	1.291139	0.600000
9	1.528764	1.403310	0.572917
10	1.353717	1.199514	0.627083
11	1.200141	1.201450	0.658333
12	1.080066	1.182234	0.641667
13	0.959931	1.155729	0.650000
14	0.873939	1.152237	0.656250
15	0.815623	1.167501	0.654167

In [188]:

learn.save('fastai_min')

Analysing the results¶

In [189]:

learn.load('fastai_min')
None

In [190]:

prob, target, losses = learn.get_preds(with_loss=True)
pred = np.array([data.classes[_] for _ in prob.argmax(dim=1)])
target = np.array([data.classes[_] for _ in target])

In [191]:

x, y = list(learn.data.valid_dl)[0]
y = np.array([data.classes[_] for _ in y])

In [192]:

len(y), len(prob)

Out[192]:

(480, 480)

In [193]:

names = np.array([''.join([vocab.itos[x] for x in l if x != 1][1:]) for l in zip(*x)])

I certainly think we could do better, but let's call it good enough.

In [196]:

loss_val, idx = losses.topk(10)
list(zip(names[idx], pred[idx], target[idx], loss_val))

Out[196]:

[('simonis', 'Greek', 'Dutch', tensor(6.7631)),
 ('dam', 'Korean', 'Vietnamese', tensor(6.4027)),
 ('jelen', 'Dutch', 'Polish', tensor(6.1876)),
 ('cha', 'Vietnamese', 'Korean', tensor(6.1856)),
 ('hayden', 'Dutch', 'Irish', tensor(6.1785)),
 ('chmiel', 'French', 'Polish', tensor(5.8600)),
 ('blanxart', 'English', 'Spanish', tensor(5.8351)),
 ('chicken', 'Dutch', 'Czech', tensor(5.6187)),
 ('attia', 'Spanish', 'Arabic', tensor(5.4960)),
 ('ton', 'Korean', 'Vietnamese', tensor(5.4933))]

In [203]:

confuse = sklearn.metrics.confusion_matrix(target, pred, labels=data.classes)

In [204]:

def most_confused(n):
    top = []
    for i, row in enumerate(confuse):
        for j, cell in enumerate(row):
            if i == j:
                continue
            if cell >= n:
                top.append([data.classes[i],data.classes[j], cell])
    return sorted(top, key=lambda x: x[2], reverse=True)

Most of the confusion is between similar language families:

Vietnamese and Korean and Chinese
Czech and Polish
Spanish and Italian

This is a good sign

In [205]:

most_confused(3)

Out[205]:

[['Vietnamese', 'Chinese', 13],
 ['Korean', 'Chinese', 11],
 ['Spanish', 'Italian', 8],
 ['Polish', 'Czech', 6],
 ['Chinese', 'Korean', 4],
 ['Czech', 'German', 4],
 ['English', 'French', 4],
 ['English', 'German', 4],
 ['German', 'English', 4],
 ['Vietnamese', 'Korean', 4],
 ['Arabic', 'German', 3],
 ['Czech', 'Spanish', 3],
 ['Dutch', 'English', 3],
 ['English', 'Dutch', 3],
 ['German', 'Dutch', 3],
 ['Irish', 'Chinese', 3],
 ['Korean', 'Vietnamese', 3],
 ['Polish', 'German', 3],
 ['Spanish', 'French', 3],
 ['Vietnamese', 'Irish', 3]]

In [210]:

plt.figure(figsize=(6,6))
plot_confusion_matrix(confuse, data.classes)

Confusion matrix, without normalization

Predictions¶

Let's set up everything from scratch so we could set it up in an external app

In [1]:

from fastai import *
from fastai.text import *

In [2]:

from unidecode import unidecode
import string

In [4]:

with open('data.classes', 'rb') as f:
    classes = pickle.load(f)
classes

Out[4]:

['Arabic',
 'Chinese',
 'Czech',
 'Dutch',
 'English',
 'French',
 'German',
 'Greek',
 'Irish',
 'Italian',
 'Japanese',
 'Korean',
 'Polish',
 'Russian',
 'Spanish',
 'Vietnamese']

In [5]:

class LetterTokenizer(BaseTokenizer):
    "Character level tokenizer function."
    def __init__(self, lang): pass
    def tokenizer(self, t:str) -> List[str]:
        t = unidecode(t).lower() ## Decode in tokenizer (ideally would be a separate preprocessor)
        out = []
        i = 0
        while i < len(t):
            if t[i:].startswith(BOS):
                out.append(BOS)
                i += len(BOS)
            else:
                out.append(t[i])
                i += 1
        return out
            
    def add_special_cases(self, toks:Collection[str]): pass

In [6]:

itos = [UNK, BOS] + list(string.ascii_lowercase + " -'")

In [7]:

vocab=Vocab(itos)
tokenizer=Tokenizer(LetterTokenizer, pre_rules=[], post_rules=[])

In [8]:

empty = pd.DataFrame({'text':'', 'cl':classes})
empty

Out[8]:

	text	cl
0		Arabic
1		Chinese
2		Czech
3		Dutch
4		English
5		French
6		German
7		Greek
8		Irish
9		Italian
10		Japanese
11		Korean
12		Polish
13		Russian
14		Spanish
15		Vietnamese

In [9]:

processors = [TokenizeProcessor(tokenizer=tokenizer, mark_fields=False),
            NumericalizeProcessor(vocab=vocab)]

In [10]:

data = TextList.from_df(empty, processor=processors).no_split().label_from_df(cols='cl').databunch(bs=2)

In [11]:

learn = text_classifier_learner(data, bptt=30, emb_sz=200, nh=300, nl=2)

In [12]:

learn.load('fastai_min')
None

Check it's not in the training set

In [12]:

!grep -ir '^Wu' data/names

In [13]:

learn.predict('Wu') # Chinese

Out[13]:

(Category Chinese,
 tensor(1),
 tensor([1.5628e-04, 8.2566e-01, 1.5067e-04, 7.8611e-04, 5.4162e-03, 6.0296e-06,
         1.6915e-03, 7.5540e-05, 1.2069e-03, 2.6427e-05, 2.3748e-03, 1.0973e-01,
         4.8861e-02, 1.4435e-04, 4.1366e-05, 3.6734e-03]))

In [14]:

def predictions(name):
    return sorted(zip(classes, (_.item() for _ in learn.predict(name)[2])), key=lambda x: x[1], reverse=True)

How does it do in practice?

In [15]:

predictions("Wojtyła")[:5] # Polish

Out[15]:

[('Polish', 0.9237067699432373),
 ('Czech', 0.0443655289709568),
 ('Vietnamese', 0.00837066862732172),
 ('Spanish', 0.006321582943201065),
 ('Chinese', 0.0038041360676288605)]

In [29]:

predictions("Dvořák")[:5] # Czech

Out[29]:

[('Czech', 0.866197407245636),
 ('Polish', 0.10485668480396271),
 ('Russian', 0.026793140918016434),
 ('Korean', 0.00042316922917962074),
 ('Japanese', 0.00041837719618342817)]

In [18]:

predictions("Gaddafi")[:5] # Arabic

Out[18]:

[('Italian', 0.8921791911125183),
 ('Russian', 0.0403725765645504),
 ('Japanese', 0.025237590074539185),
 ('Spanish', 0.015365079045295715),
 ('Arabic', 0.01298986654728651)]

In [49]:

predictions('Goethe')[:5] # German

Out[49]:

[('Dutch', 0.35977253317832947),
 ('German', 0.21167784929275513),
 ('French', 0.18858234584331512),
 ('English', 0.1542830914258957),
 ('Irish', 0.02631647326052189)]

Sometimes it does bad even if it's in the source data (it may not have ended up in training)

In [51]:

!grep -Er 'Pascal|Pham' data/names

data/names/Korean.txt:Kim
data/names/Japanese.txt:Kimio
data/names/Japanese.txt:Kimiyama
data/names/Japanese.txt:Kimura
data/names/Vietnamese.txt:Pham
data/names/Vietnamese.txt:Kim
data/names/English.txt:Kimber
data/names/English.txt:Kimble
data/names/French.txt:Pascal

In [27]:

predictions("Pascal")[:5] # French

Out[27]:

[('Spanish', 0.9103199243545532),
 ('Italian', 0.07529251277446747),
 ('Polish', 0.005148978438228369),
 ('Greek', 0.0036756773479282856),
 ('Czech', 0.0034040361642837524)]

In [28]:

predictions("Pham")[:5] # Vietnamese

Out[28]:

[('Dutch', 0.2937869727611542),
 ('Vietnamese', 0.13536876440048218),
 ('English', 0.11861108988523483),
 ('French', 0.07520488649606705),
 ('Irish', 0.0743907243013382)]

But sometimes it gets it right

In [19]:

!grep -ir '^Meijer' data/names

In [20]:

predictions("Meijer")[:5] # Dutch

Out[20]:

[('Dutch', 0.9885940551757812),
 ('German', 0.008121304214000702),
 ('Czech', 0.0013009845279157162),
 ('Korean', 0.00039360582013614476),
 ('English', 0.0003091120161116123)]

In [21]:

predictions('Wójcik')[:5] # Polish

Out[21]:

[('Polish', 0.9900423288345337),
 ('Czech', 0.004603931214660406),
 ('Chinese', 0.002563303103670478),
 ('Korean', 0.0009222071967087686),
 ('Dutch', 0.0005194866680540144)]

This model is not bad; but definitely sub-human.

What does it think about our ambiguous "Michel"?

In [31]:

predictions('Michel')[:7]

Out[31]:

[('Czech', 0.6636908054351807),
 ('German', 0.11761205643415451),
 ('English', 0.061124321073293686),
 ('Irish', 0.04792550206184387),
 ('Polish', 0.027553826570510864),
 ('French', 0.026092179119586945),
 ('Dutch', 0.020014718174934387)]

Predicting from a pretrained custom model¶

In [32]:

class Model(nn.Module):
    def __init__(self, n_input, n_hidden, n_output, bn=False, use_cuda=False):
        super().__init__()
        self.i_h = nn.Embedding(n_input,n_hidden)
        self.bn = nn.BatchNorm1d(n_hidden) if bn else None
        self.o_h = nn.Linear(n_hidden, n_output)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.use_cuda = use_cuda
        self.reset()
        
    def forward(self, x):
        # I'm not quite sure why the batch size seems to change to 720 in validation...
        if self.h.shape[0] != x.shape[1]:
            self.reset(x.shape[1])
        h = self.h
        x = self.i_h(x)
        for xi in x:
            h += xi
            h = self.h_h(h)
            h = F.relu(h)
            if self.bn:
                h = self.bn(h)
        self.h = h.detach()
        o = self.o_h(h)
        return o
        
    def reset(self, size=None):
        size = size or 1
        self.h = torch.zeros(size, n_hidden)
        if self.use_cuda:
            self.h = self.h.cuda()

In [33]:

n_letters = len(vocab.itos)
n_hidden = 128
n_output = len(classes)
model = Model(n_letters, n_hidden, n_output)

In [34]:

with open('models/rnn-bal-1.model', 'rb') as f:
    state = pickle.load(f)
    model.load_state_dict(state)
model = model.cpu()
model = model.eval()

In [35]:

for param in model.parameters():
    param.requires_grad = False

In [36]:

name = 'Wójcik' # Polish

In [37]:

decode = BOS + unidecode(name)
decode

Out[37]:

'xxbosWojcik'

In [38]:

tokens = tokenizer.process_all([decode])[0]
tokens

Out[38]:

['xxbos', 'w', 'o', 'j', 'c', 'i', 'k']

In [39]:

nums = vocab.numericalize(tokens)
nums

Out[39]:

[1, 24, 16, 11, 4, 10, 12]

In [40]:

x = torch.tensor([nums]).transpose(1,0)
x

Out[40]:

tensor([[ 1],
        [24],
        [16],
        [11],
        [ 4],
        [10],
        [12]])

In [41]:

result = model(x).detach()
result

Out[41]:

tensor([[ -8.0967,  -6.4112,   7.0235,   4.0332,   1.4341, -15.9867,  -0.3718,
          -9.9789, -17.9649,  -7.5314,  -2.7700,  -0.1450,  -3.1552,   1.9744,
         -14.5163, -15.3465]])

In [42]:

probs = F.softmax(result[0], dim=0)
probs

Out[42]:

tensor([2.5545e-07, 1.3781e-06, 9.4171e-01, 4.7340e-02, 3.5193e-03, 9.5652e-11,
        5.7831e-04, 3.8891e-08, 1.3231e-11, 4.4956e-07, 5.2559e-05, 7.2556e-04,
        3.5759e-05, 6.0412e-03, 4.1617e-10, 1.8145e-10])

In [43]:

for prob, idx in zip(*probs.topk(3)):
    print(f'{classes[idx]}: Probability {prob:0.2%}')

Czech: Probability 94.17%
Dutch: Probability 4.73%
Russian: Probability 0.60%

In [44]:

def get_probs(name):
    decode = BOS + unidecode(name)
    tokens = tokenizer.process_all([decode])[0]
    nums = vocab.numericalize(tokens)
    x = torch.tensor([nums]).transpose(1,0)
    model.reset()
    result = model(x).detach()
    probs = F.softmax(result[0], dim=0)
    return probs

In [45]:

def print_top_probs(name, n=3):
    probs = get_probs(name)
    for prob, idx in zip(*probs.topk(n)):
        print(f'{classes[idx]}: Probability {prob:0.2%}')

In reality the model doesn't do great by human standards

In [46]:

print_top_probs('Goethe') # German

Irish: Probability 72.69%
English: Probability 16.42%
Japanese: Probability 3.84%

In [47]:

print_top_probs('Jinping') # Chinese

German: Probability 51.32%
English: Probability 34.13%
Chinese: Probability 9.00%

In [48]:

print_top_probs('Kim') # Korean

Korean: Probability 41.02%
Russian: Probability 22.39%
Dutch: Probability 16.08%

In [52]:

print_top_probs('Đặng') # Vietnamese

Korean: Probability 41.59%
Chinese: Probability 34.24%
Vietnamese: Probability 14.89%

In [53]:

print_top_probs('Zahir') # Arabic

Arabic: Probability 91.50%
Russian: Probability 4.34%
Czech: Probability 2.74%

It's also possible to use learn.load to load in the model, if you make some fake data.

We need at least 2 rows or it will complain.

In [54]:

empty = pd.DataFrame([[' ']]*2)
empty

Out[54]:

	0
0
1

In [55]:

processors = [TokenizeProcessor(tokenizer=tokenizer, mark_fields=False),
            NumericalizeProcessor(vocab=vocab)]

In [56]:

data = TextList.from_df(empty, processor=processors).no_split().label_const().databunch(bs=2)

In [57]:

model = Model(n_letters, n_hidden, n_output)

In [58]:

learn = Learner(data, model)

In [59]:

learn = learn.load('rnn-bal-1')

In [60]:

learn.model = learn.model.eval().cpu()
for param in learn.model.parameters():
    param.requires_grad = False

In [61]:

x, _ = data.one_item('Dvořák') # Czech

In [62]:

learn.model.reset()
probs = F.softmax(learn.model(x.cpu()))
probs

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.

Out[62]:

tensor([[1.2497e-05, 3.4860e-06, 5.5647e-01, 1.3278e-04, 1.5334e-03, 4.7975e-04,
         7.6774e-05, 6.1160e-06, 4.5827e-08, 2.0946e-05, 1.2422e-04, 5.2584e-06,
         4.1456e-01, 2.5337e-02, 1.2437e-03, 2.2771e-11]])

In [63]:

for prob, idx in zip(*probs[0].topk(3)):
    print(f'{classes[idx]}: Probability {prob:0.2%}')

Czech: Probability 55.65%
Polish: Probability 41.46%
Russian: Probability 2.53%

Using the model to find similar names¶

The idea is to dig into the representation in the 50 dimensional activation and use this to compare names.

Two names are similar if they are close together in this embedding space. It's not totally obvious that the RMS distance is appropriate for this, but it's what we'll use.

In [113]:

from fastai.callbacks.hooks import *

In [215]:

df = pd.read_csv('names_clean.csv')
df.head()

Out[215]:

	cl	name	ascii_name	valid	bal
0	Korean	Ahn	ahn	False	13
1	Korean	Baik	baik	True	0
2	Korean	Bang	bang	False	13
3	Korean	Byon	byon	False	15
4	Korean	Cha	cha	True	0

In [216]:

data = TextList.from_df(df, cols='ascii_name', processor=processors).no_split().label_from_df('cl').databunch(bs=1024)

In [217]:

# model = Model(n_letters, n_hidden, n_output).cuda()
# learn = Learner(data, model)
# learn = learn.load('rnn-bal-1')
learn = text_classifier_learner(data, bptt=30, emb_sz=200, nh=300, nl=2)
learn.load('fastai_min')
None

Let's look at the structure of our model

In [218]:

list(learn.model.named_children())

Out[218]:

[('0', MultiBatchRNNCore(
    (encoder): Embedding(31, 200, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(31, 200, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(200, 300)
      )
      (1): WeightDropout(
        (module): LSTM(300, 200)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
    )
  )), ('1', PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(600, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.4)
      (2): Linear(in_features=600, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_features=16, bias=True)
    )
  ))]

Let's capture the output of the 50 dimensional embedding near the end

In [221]:

layer = 17

In [222]:

list(learn.model.modules())[layer]

Out[222]:

Linear(in_features=600, out_features=50, bias=True)

In [223]:

def embed(x):
    #with hook_output(list(learn.model.children())[-1]) as hook_a: 
    with hook_output(list(learn.model.modules())[layer]) as hook_a:
        preds = learn.predict(x)
        return hook_a.stored[0]

In [224]:

%time df = df.assign(embed = df.name.apply(embed))

CPU times: user 32.2 s, sys: 2.22 s, total: 34.4 s
Wall time: 34.4 s

In [226]:

df.head()

Out[226]:

	cl	name	ascii_name	valid	bal	embed
0	Korean	Ahn	ahn	False	13	[tensor(0.7824, device='cuda:0'), tensor(2.431...
1	Korean	Baik	baik	True	0	[tensor(0., device='cuda:0'), tensor(0., devic...
2	Korean	Bang	bang	False	13	[tensor(0., device='cuda:0'), tensor(1.9753, d...
3	Korean	Byon	byon	False	15	[tensor(0., device='cuda:0'), tensor(4.9263, d...
4	Korean	Cha	cha	True	0	[tensor(0., device='cuda:0'), tensor(0., devic...

In [227]:

def closest(name, n=10):
    e = embed(name)
    dist = [d(e, _) for _ in df.embed]
    for idx in np.argsort(dist)[:10]:
        print(f'{df.name.iloc[idx.item()]} ({df.cl.iloc[idx.item()]}): {dist[idx]}')

It's not immediately clear in what sense these are similar; but it doesn't seem random to me

In [229]:

%time closest('Ahn')

Ahn (Korean): 0.0
Hor (Chinese): 74.33894348144531
Gul (Russian): 78.02725219726562
Noh (Korean): 88.23501586914062
Hon (Russian): 89.40960693359375
Ryu (Korean): 90.15037536621094
Byon (Korean): 92.82701110839844
Jermy (English): 93.4688720703125
Bishop (English): 96.69548034667969
Heron (English): 97.36801147460938
CPU times: user 1.46 s, sys: 188 ms, total: 1.65 s
Wall time: 1.65 s

In [232]:

closest('Ruder')

Turner (English): 34.640953063964844
Raizer (Russian): 41.138641357421875
Reuter (German): 45.08790588378906
Gunter (English): 46.127220153808594
Mendel (German): 48.00371551513672
Render (English): 48.202239990234375
Raeburn (English): 52.79540252685547
Rosenberger (German): 53.13862228393555
Rebinder (Russian): 53.502662658691406
Rosser (English): 54.7049446105957

In [233]:

closest('Gugger')

Oelberg (German): 28.07500457763672
Mordberg (Russian): 29.14569854736328
Gramberg (Russian): 30.82413101196289
Engman (Russian): 31.429853439331055
Burman (English): 33.723426818847656
Bumgarner (German): 33.76372528076172
Egger (German): 35.34405517578125
Großer (German): 35.70815658569336
Ranger (English): 35.80017852783203
Grainger (English): 36.01886749267578

In [234]:

closest('Thomas')

Manus (Irish): 49.601768493652344
Jemaitis (Russian): 56.250274658203125
Horos (Russian): 63.35132598876953
Klimes (Czech): 73.13825225830078
Bertsimas (Greek): 73.65045166015625
Tsogas (Greek): 79.87809753417969
Simonis (Dutch): 85.69441223144531
Honjas (Greek): 86.7238998413086
Mihelyus (Russian): 87.06715393066406
Grotus (Russian): 88.79036712646484

In [201]:

closest('Ross')

East (English): 59.85662841796875
Gammer (English): 60.93962097167969
Gale (English): 61.32642364501953
Gass (German): 65.68402099609375
Abrams (English): 68.60626983642578
Groer (Russian): 72.51294708251953
Bannister (English): 72.60062408447266
Glencross (English): 73.78807830810547
Moss (English): 74.06442260742188
Gander (English): 75.61122131347656

In [202]:

closest('Wu')

Wan (Chinese): 55.373172760009766
Wei (Chinese): 63.00840759277344
Won (Chinese): 96.11245727539062
Gwang (Korean): 103.33255004882812
Gwock (Chinese): 118.81124114990234
Weng (Chinese): 131.87939453125
Wane (English): 141.58985900878906
Twigg (English): 147.58078002929688
Wain (English): 153.88088989257812
Gowing (English): 156.93133544921875

In [214]:

closest('Chebyshev')

Cheryshev (Russian): 4.010871410369873
Dobryshev (Russian): 23.908416748046875
Chalyshev (Russian): 23.938615798950195
Chanyshev (Russian): 28.119571685791016
Cherushov (Russian): 28.66720962524414
Tchanyshev (Russian): 30.515729904174805
Chehov (Russian): 30.542831420898438
Tchalyshev (Russian): 34.40039825439453
Yachmentsev (Russian): 36.298770904541016
Jerebyatiev (Russian): 36.888675689697266