French Bidirectional Language Model (LM) from scratch¶

(architecture QRNN, SentenPiece tokenizer)¶

Author: Pierre Guillou
Date: September 2019
Post in medium: link
Ref: Fastai v1 (Deep Learning library on PyTorch)

Information

According to this new article "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" (September 10, 2019), the architecture QRNN and the SentencePiece tokenizer give better results than AWD-LSTM and the spaCy tokenizer respectively. Therefore, they have been used in this notebook to train a French Bidirectional Language Model on a Wikipedia corpus of 100 millions tokens.

More, the hyperparameters values given at the end of the article have been used, too.

Wikipedia corpus

download: 512 659 articles of 492 596 078 tokens tokens
used: 252 898 articles of 100 716 190 tokens

Hyperparameters values

(batch size) bs = 50
(QRNN) 3 QRNN (default: 3) with 1152 hidden parameters each one (default: 1152)
(SentencePiece) vocab of 15000 tokens
(dropout) mult_drop = 0
(weight decay) wd = 0.01
(number of training epochs) 10 epochs

Results¶

Accuracy and Perplexity¶

forward : (accuracy) 40.99% | (perplexity) 19.96
backward: (accuracy) 47.19% | (perplexity) 19.47

To be improved¶

The French Bidirectional Language Model should be retrained with the MultiFiT hyperparameters values like 4 QRNN (and not 3 AWD-LSTM), 1550 hidden parameters by layer (and not 1152), no dropout, batch size of 50, etc.

Initialisation¶

In [1]:

from fastai import *
from fastai.text import *
from fastai.callbacks import *

import matplotlib.pyplot as plt

%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:

# batch size to be choosen according to your GPU 
# bs=48
# bs=24
bs=50

In [3]:

torch.cuda.set_device(0)

In [4]:

import fastai
print(f'fastai: {fastai.__version__}')
print(f'cuda: {torch.cuda.is_available()}')

fastai: 1.0.57
cuda: True

In [5]:

!python -m fastai.utils.show_install


```text
=== Software === 
python        : 3.7.4
fastai        : 1.0.57
fastprogress  : 0.1.21
torch         : 1.2.0
nvidia driver : 410.104
torch cuda    : 10.0.130 / is available
torch cudnn   : 7602 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 16130MB | Tesla V100-SXM2-16GB

=== Environment === 
platform      : Linux-4.9.0-9-amd64-x86_64-with-debian-9.9
distro        : #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11)
conda env     : base
python        : /opt/anaconda3/bin/python
sys.path      : /home/jupyter/tutorials/fastai/course-nlp
/opt/anaconda3/lib/python37.zip
/opt/anaconda3/lib/python3.7
/opt/anaconda3/lib/python3.7/lib-dynload
/opt/anaconda3/lib/python3.7/site-packages
/opt/anaconda3/lib/python3.7/site-packages/IPython/extensions
```

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information

In [6]:

data_path = Config.data_path()

This will create a {lang}wiki folder, containing a {lang}wiki text file with the wikipedia contents. (For other languages, replace {lang} with the appropriate code from the list of wikipedias.)

In [7]:

lang = 'fr'

In [8]:

name = f'{lang}wiki'
path = data_path/name
path.mkdir(exist_ok=True, parents=True)

lm_fns2 = [f'{lang}_wt_sp15', f'{lang}_wt_vocab_sp15']
lm_fns2_bwd = [f'{lang}_wt_sp15_bwd', f'{lang}_wt_vocab_sp15_bwd']

Data (French wikipedia)¶

Download data¶

In [9]:

from nlputils import split_wiki,get_wiki
from nlputils2 import *

In [11]:

%%time
get_wiki(path,lang)

extracting...
CPU times: user 52 ms, sys: 12 ms, total: 64 ms
Wall time: 17min 30s

In [12]:

%%time
split_wiki(path,lang)

0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1500000
1600000
1700000
1800000
1900000
2000000
2100000
2200000
2300000
2400000
2500000
2600000
2700000
2800000
2900000
3000000
3100000
3200000
3300000
3400000
3500000
3600000
3700000
3800000
3900000
4000000
4100000
4200000
4300000
4400000
4500000
4600000
4700000
4800000
4900000
5000000
5100000
5200000
5300000
5400000
5500000
5600000
5700000
5800000
5900000
6000000
6100000
6200000
6300000
6400000
6500000
6600000
6700000
6800000
6900000
7000000
7100000
7200000
7300000
7400000
7500000
7600000
7700000
7800000
7900000
8000000
8100000
8200000
8300000
8400000
8500000
8600000
8700000
8800000
8900000
9000000
9100000
9200000
9300000
9400000
9500000
9600000
9700000
9800000
9900000
10000000
10100000
10200000
10300000
10400000
10500000
10600000
10700000
10800000
10900000
11000000
11100000
11200000
11300000
11400000
11500000
11600000
11700000
11800000
11900000
12000000
12100000
12200000
12300000
12400000
12500000
12600000
12700000
12800000
12900000
13000000
13100000
13200000
13300000
13400000
13500000
13600000
13700000
13800000
13900000
14000000
14100000
14200000
14300000
14400000
14500000
14600000
14700000
14800000
14900000
15000000
15100000
15200000
15300000
15400000
15500000
15600000
15700000
15800000
15900000
16000000
16100000
16200000
16300000
16400000
16500000
16600000
16700000
16800000
16900000
17000000
17100000
17200000
17300000
17400000
17500000
17600000
17700000
17800000
17900000
18000000
18100000
18200000
18300000
18400000
18500000
18600000
18700000
18800000
18900000
19000000
19100000
19200000
19300000
19400000
19500000
19600000
19700000
19800000
19900000
20000000
20100000
20200000
20300000
20400000
20500000
CPU times: user 37.7 s, sys: 14.8 s, total: 52.6 s
Wall time: 1min 19s

Out[12]:

PosixPath('/home/jupyter/.fastai/data/frwiki/docs')

In [13]:

path.ls()

Out[13]:

[PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki-latest-pages-articles.xml'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/wikiextractor'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/frwiki-latest-pages-articles.xml.bz2'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/log')]

In [12]:

!head -n4 {path}/{name}

<doc id="3" url="https://fr.wikipedia.org/wiki?curid=3" title="Antoine Meillet">
Antoine Meillet

Paul Jules Antoine Meillet, né le à Moulins (Allier) et mort le à Châteaumeillant (Cher), est le principal linguiste français des premières décennies du . Il est aussiphilologue.

This function splits the single wikipedia file into a separate file per article. This is often easier to work with.

In [16]:

# %%time
# folder = "docs"
# clean_files(path,folder)

CPU times: user 1min 31s, sys: 28.7 s, total: 2min
Wall time: 8min 37s

In [11]:

dest = path/'docs'
dest.ls()[:5]

Out[11]:

[PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Min_y_.txt'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Darius_Johnson_Odom.txt'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Guillaume_Cherel.txt'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Henk_Badings.txt'),
 PosixPath('/home/jupyter/.fastai/data/frwiki/docs/Henri_de_Virel.txt')]

In [16]:

!head -n4 {dest}/'Darius_Johnson_Odom.txt'

Darius Earvin Johnson-Odom, né le , à Youngsville, en Caroline du Nord, est un joueur américain de basket-ball. Il évolue au poste d'arrière.

Johnson-Odom passe trois saisons à l'université de Marquette avant d'être sélectionné à la de la draft 2012 de la NBA par les Mavericks de Dallas, qui le transfèrent immédiatement aux Lakers de Los Angeles. Le 15 septembre 2012, il signe son contrat rookie avec les Lakers. Johnson-Odom est envoyé, plusieurs fois durant la saison 2012-2013, chez les D-Fenders de Los Angeles, l'équipe de D-League affiliée aux Lakers.

Size of downloaded data in the docs folder¶

In [17]:

%%time
num_files, num_tokens = get_num_tokens(dest)
print(f'{num_files} files - {num_tokens} tokens')

512659 files - 492596078 tokens
CPU times: user 57.1 s, sys: 18.9 s, total: 1min 16s
Wall time: 2min 49s

Create a corpus of about 100 millions of tokens¶

In [18]:

%%time
path_corpus = get_corpus(dest, path, num_tokens, obj_tokens=1e8)

files copied to the new corpus folder: /home/jupyter/.fastai/data/frwiki/corpus_100000000
CPU times: user 13.3 s, sys: 8.85 s, total: 22.1 s
Wall time: 22.1 s

In [19]:

%%time
# VERIFICATION of the number of words in the corpus folder
num_files_corpus, num_tokens_corpus = get_num_tokens(path_corpus)
print(f'{num_files_corpus} files - {num_tokens_corpus} tokens')

252898 files - 100716190 tokens
CPU times: user 13.9 s, sys: 2.92 s, total: 16.8 s
Wall time: 2min 9s

In [20]:

# change name of the corpus 
!mv {path}/'corpus_100000000' {path}/'corpus2_100'

Databunch¶

In [10]:

dest = path/'corpus2_100'

Forward¶

In [24]:

%%time
data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])
        .split_by_rand_pct(0.1, seed=42)
        .label_for_lm()
        .databunch(bs=bs, num_workers=1))

CPU times: user 41min 54s, sys: 20.3 s, total: 42min 14s
Wall time: 20min 14s

In [25]:

data.save(f'{path}/{lang}_databunch_corpus2_100_sp15')

In [26]:

len(data.vocab.itos),len(data.train_ds)

Out[26]:

(15000, 227609)

In [27]:

len(data.vocab.itos),len(data.vocab.stoi)

Out[27]:

(15000, 15000)

In [28]:

data.train_ds.x[1]

Out[28]:

Text ▁xxbos ▁xxmaj ▁dar ius ▁xxmaj ▁ e ar vin ▁johnson - o dom , ▁né ▁le ▁ , ▁à ▁xxmaj ▁young s ville , ▁en ▁xxmaj ▁caroline ▁du ▁xxmaj ▁nord , ▁est ▁un ▁joueur ▁américain ▁de ▁basket - ball . ▁xxmaj ▁il ▁évolue ▁au ▁poste ▁d ' arrière . ▁johnson - o dom ▁passe ▁trois ▁saisons ▁à ▁l ' université ▁de ▁xxmaj ▁marque tte ▁avant ▁d ' être ▁sélectionné ▁à ▁la ▁de ▁la ▁draft ▁2012 ▁de ▁la ▁xxup ▁nba ▁par ▁les ▁xxmaj ▁ma ve rick s ▁de ▁xxmaj ▁dallas , ▁qui ▁le ▁trans f èrent ▁immédiatement ▁aux ▁xxmaj ▁la ker s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles . ▁xxmaj ▁le ▁15 ▁septembre ▁2012, ▁il ▁signe ▁son ▁contrat ▁ro ok ie ▁avec ▁les ▁xxmaj ▁la ker s . ▁johnson - o dom ▁est ▁envoyé , ▁plusieurs ▁fois ▁durant ▁la ▁saison ▁2012-2013 , ▁chez ▁les ▁d - f en der s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles , ▁l ' équipe ▁de ▁d - le a gue ▁affilié e ▁aux ▁xxmaj ▁la ker s . ▁xxmaj ▁le ▁7 ▁janvier ▁2013, ▁johnson - o dom ▁est ▁coupé ▁par ▁les ▁xxmaj ▁la ker s . ▁xxmaj ▁c ' était ▁le ▁dernier ▁jour ▁pour ▁les ▁équipes ▁xxup ▁nba ▁pour ▁libérer ▁des ▁joueurs ▁dont ▁le ▁contrat ▁n ' est ▁pas ▁garanti ▁avant ▁que ▁leur ▁contrat ▁de vienne ▁garanti ▁jusqu ' à ▁la ▁fin ▁de ▁la ▁saison . ▁xxmaj ▁il ▁joue ▁quatre ▁matches ▁et ▁seulement ▁six ▁minutes ▁au ▁total ▁avec ▁les ▁xxmaj ▁la ker s , ▁passant ▁la ▁plupart ▁de ▁son ▁temps ▁en ▁d - le a gue ▁où ▁il ▁est ▁le ▁meilleur ▁marque ur ▁des ▁d - f en der s , ▁avec ▁21 ▁points ▁par ▁match . ▁xxmaj ▁le ▁24 ▁janvier ▁2013, ▁johnson - o dom ▁part ▁en ▁xxmaj ▁russie ▁où ▁il ▁signe ▁avec ▁le ▁xxmaj ▁ sparta k ▁saint - pétersbourg ▁pour ▁le ▁reste ▁de ▁la ▁saison ▁2012-2013 . ▁xxmaj ▁il ▁participe ▁avec ▁les ▁xxmaj ▁celtic s ▁de ▁xxmaj ▁boston ▁à ▁la ▁xxmaj ▁summer ▁xxmaj ▁league ▁2013 ▁d ' orlando . ▁xxmaj ▁le ▁25 ▁septembre ▁2013, ▁il ▁rejoint ▁les ▁xxmaj ▁la ker s ▁de ▁xxmaj ▁los ▁xxmaj ▁angeles ▁pour ▁participer ▁au ▁camp ▁d ' entraînement . ▁xxmaj ▁toutefois , ▁ils ▁le ▁libère nt ▁le ▁16 ▁octobre . ▁xxmaj ▁le ▁18 ▁octobre ▁2013, ▁il ▁signe ▁en ▁xxmaj ▁chine ▁au ▁ . ▁xxmaj ▁en ▁novembre ▁2013, ▁après ▁quatre ▁matches ▁de ▁saison ▁régulière , ▁il ▁quitte ▁les ▁xxmaj ▁blue ▁xxmaj ▁wh ales . ▁xxmaj ▁le ▁3 ▁janvier ▁2014, ▁il ▁rejoint ▁l ' armor ▁de ▁xxmaj ▁ spring field ▁en ▁d - le a gue . ▁xxmaj ▁le ▁14 ▁mars ▁2014, ▁il ▁signe ▁un ▁contrat ▁de ▁dix ▁jours ▁avec ▁les ▁ 76 ers ▁de ▁xxmaj ▁philadelphie . ▁xxmaj ▁le ▁24 ▁mars ▁2014, ▁il ▁ne ▁signe ▁pas ▁de ▁second ▁contrat ▁de ▁dix ▁jours ▁à ▁la ▁fin ▁du ▁premier . ▁johnson - o dom ▁détient ▁actuellement ▁le ▁record ▁du ▁plus ▁grand ▁nombre ▁de ▁tirs ▁tenté s ▁sans ▁en ▁réussi r ▁un ▁seul ▁en ▁xxup ▁nba ▁avec ▁11 ▁tirs . ▁xxmaj ▁le ▁2 ▁août ▁2014, ▁il ▁signe ▁en ▁xxmaj ▁italie , ▁au ▁xxmaj ▁c ant ù ▁pour ▁la ▁saison ▁2014-2015 . ▁xxmaj ▁le ▁11 ▁décembre ▁2014, ▁il ▁est ▁nommé ▁xxup ▁m v p ▁de ▁la ▁de ▁l ' euro coup e . ▁xxmaj ▁le ▁14 ▁juin ▁2015, ▁il ▁signe ▁en ▁xxmaj ▁turquie ▁au ▁xxmaj ▁tra b zon s por ▁xxmaj ▁basket ball ▁pour ▁la ▁saison ▁2015-2016 . ▁xxmaj ▁le ▁28 ▁décembre ▁2015, ▁il ▁signe ▁en ▁xxmaj ▁grèce , ▁à ▁l ' olympia k ó s ▁pour ▁le ▁reste ▁de ▁la ▁saison ▁2015-2016 ▁et ▁le ▁xxmaj ▁top ▁16 ▁de ▁l ' euro ligu e . ▁xxmaj ▁le ▁11 ▁juin ▁2016, ▁il ▁retourne ▁en ▁xxmaj ▁italie ▁où ▁il ▁signe ▁à ▁xxmaj ▁sa s s ari . ▁xxmaj ▁les ▁statistiques ▁en ▁matchs ▁universitaires ▁de ▁xxmaj ▁dar ius ▁johnson - o dom ▁sont ▁les ▁suivantes ▁:

Backward¶

In [29]:

%%time
data = (TextList.from_folder(dest, processor=[OpenFileProcessor(), SPProcessor(max_vocab_sz=15000)])
            .split_by_rand_pct(0.1, seed=42)
            .label_for_lm()
            .databunch(bs=bs, num_workers=1, backwards=True))

data.save(f'{path}/{lang}_databunch_corpus2_100_sp15_bwd')

CPU times: user 42min 18s, sys: 23.5 s, total: 42min 42s
Wall time: 20min 30s

Training¶

Forward¶

In [11]:

%%time
data = load_data(path, f'{lang}_databunch_corpus2_100_sp15', bs=bs)

CPU times: user 3.94 s, sys: 1.81 s, total: 5.75 s
Wall time: 12.2 s

In [12]:

config = awd_lstm_lm_config.copy()
config['qrnn'] = True

In [46]:

%%time
perplexity = Perplexity()
learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, 
                               metrics=[error_rate, accuracy, perplexity]).to_fp16()

CPU times: user 248 ms, sys: 36 ms, total: 284 ms
Wall time: 284 ms

In [47]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [48]:

learn.recorder.plot()

In [49]:

lr = 3e-3
lr *= bs/48  # Scale learning rate by batch size

In [ ]:

%%time
learn.unfreeze()
wd = 0.01 
learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), 
                    callbacks=[ShowGraph(learn),
                               SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15')])

30.00% [3/10 1:27:10<3:23:23]

epoch	train_loss	valid_loss	error_rate	accuracy	perplexity	time
0	3.456712	3.493251	0.652252	0.347747	32.892696	28:46
1	3.243398	3.353261	0.640550	0.359450	28.595928	29:06
2	3.228111	3.336011	0.638784	0.361217	28.106716	29:10

76.71% [33696/43928 21:02<06:23 3.2003]

Better model found at epoch 0 with accuracy value: 0.34774690866470337.
Better model found at epoch 1 with accuracy value: 0.3594495952129364.
Better model found at epoch 2 with accuracy value: 0.36121678352355957.

GCP instance has stopped. Need to reload bestmodel_sp15 and vocab to continue the training.

In [12]:

mdl_path = path/'models'
mdl_path.mkdir(exist_ok=True)
lm_fns2[0] = 'bestmodel_sp15'
data.vocab.save(mdl_path/(lm_fns2[1] + '.pkl'))

In [13]:

lm_fns2

Out[13]:

['bestmodel_sp15', 'fr_wt_vocab_sp15']

In [15]:

config = awd_lstm_lm_config.copy()
config['qrnn'] = True

In [16]:

%%time
perplexity = Perplexity()
learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained_fnames=lm_fns2, 
                               metrics=[error_rate, accuracy, perplexity]).to_fp16()

CPU times: user 4.43 s, sys: 1.5 s, total: 5.94 s
Wall time: 30.4 s

In [20]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [21]:

learn.recorder.plot()

In [22]:

lr = 3e-4
lr *= bs/48  # Scale learning rate by batch size

In [23]:

%%time
learn.unfreeze()
wd = 0.01 
learn.fit_one_cycle(7, lr, wd=wd, moms=(0.8,0.7), 
                    callbacks=[ShowGraph(learn),
                               SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15')])

epoch	train_loss	valid_loss	error_rate	accuracy	perplexity	time
0	3.068131	3.126621	0.609930	0.390071	22.796721	28:29
1	2.995493	3.070316	0.602313	0.397686	21.548635	28:26
2	3.011955	3.043591	0.598430	0.401570	20.980440	28:24
3	2.978858	3.023281	0.595229	0.404770	20.558710	28:24
4	2.960861	3.005773	0.592423	0.407577	20.201872	28:26
5	2.940080	2.995852	0.590565	0.409436	20.002399	28:27
6	2.937475	2.993789	0.590111	0.409888	19.961227	28:23

Better model found at epoch 0 with accuracy value: 0.3900712728500366.
Better model found at epoch 1 with accuracy value: 0.39768603444099426.
Better model found at epoch 2 with accuracy value: 0.40157046914100647.
Better model found at epoch 3 with accuracy value: 0.40476998686790466.
Better model found at epoch 4 with accuracy value: 0.40757742524147034.
Better model found at epoch 5 with accuracy value: 0.40943607687950134.
Better model found at epoch 6 with accuracy value: 0.4098880887031555.
CPU times: user 2h 36min 55s, sys: 42min 44s, total: 3h 19min 39s
Wall time: 3h 19min 22s

Save the pretrained model and vocab:

In [24]:

mdl_path = path/'models'
mdl_path.mkdir(exist_ok=True)
learn.to_fp32().save(mdl_path/lm_fns2[0], with_opt=False)
learn.data.vocab.save(mdl_path/(lm_fns2[1] + '.pkl'))

Backward¶

In [25]:

data = load_data(path, f'{lang}_databunch_corpus2_100_sp15_bwd', bs=bs, backwards=True)

In [26]:

config = awd_lstm_lm_config.copy()
config['qrnn'] = True

In [27]:

%%time
perplexity = Perplexity()
learn = language_model_learner(data, AWD_LSTM, config=config, drop_mult=0., pretrained=False, 
                               metrics=[error_rate, accuracy, perplexity]).to_fp16()

CPU times: user 832 ms, sys: 36 ms, total: 868 ms
Wall time: 868 ms

In [28]:

learn.lr_find()

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

In [29]:

learn.recorder.plot()

In [30]:

lr = 3e-3
lr *= bs/48  # Scale learning rate by batch size

In [ ]:

%%time
learn.unfreeze()
wd = 0.01
learn.fit_one_cycle(10, lr, wd=wd, moms=(0.8,0.7), 
                    callbacks=[ShowGraph(learn),
                               SaveModelCallback(learn.to_fp32(), monitor='accuracy', name='bestmodel_sp15_bwd')])

90.00% [9/10 4:17:30<28:36]

epoch	train_loss	valid_loss	error_rate	accuracy	perplexity	time
0	3.464687	3.499862	0.588392	0.411609	33.110847	28:31
1	3.290530	3.359655	0.577429	0.422572	28.779305	28:36
2	3.279709	3.345511	0.576788	0.423212	28.375286	28:39
3	3.218174	3.291578	0.570586	0.429414	26.885269	28:37
4	3.188569	3.235373	0.564242	0.435757	25.415869	28:37
5	3.065708	3.165833	0.555168	0.444832	23.708519	28:39
6	3.036142	3.089983	0.545404	0.454596	21.976547	28:20
7	2.999072	3.019598	0.535763	0.464237	20.483038	28:26
8	2.888412	2.969108	0.528063	0.471937	19.474483	28:36

19.87% [8728/43928 05:24<21:49 2.9050]

Better model found at epoch 0 with accuracy value: 0.4116089344024658.
Better model found at epoch 1 with accuracy value: 0.42257198691368103.
Better model found at epoch 2 with accuracy value: 0.42321211099624634.
Better model found at epoch 3 with accuracy value: 0.4294137954711914.
Better model found at epoch 4 with accuracy value: 0.4357571601867676.
Better model found at epoch 5 with accuracy value: 0.4448317289352417.
Better model found at epoch 6 with accuracy value: 0.4545958638191223.
Better model found at epoch 7 with accuracy value: 0.4642367959022522.
Better model found at epoch 8 with accuracy value: 0.47193723917007446.

GCP instance has stopped. Need to reload bestmodel_sp15 and vocab to continue the training.

In [51]:

mdl_path = path/'models'
mdl_path.mkdir(exist_ok=True)
lm_fns2_bwd[0] = 'bestmodel_sp15_bwd'
data.vocab.save(mdl_path/(lm_fns2_bwd[1] + '.pkl'))

In [17]:

# mdl_path = path/'models'
# mdl_path.mkdir(exist_ok=True)
# learn.to_fp32().save(mdl_path/lm_fns2_bwd[0], with_opt=False)
# learn.data.vocab.save(mdl_path/(lm_fns2_bwd[1] + '.pkl'))

Generate fake texts¶

Note: the architecture used for our French LM is based on 4 QRNN with about 46 millions of parameters. This kind of architecture can be sufficient to fine-tune another LM to a specific corpus in order to create in-fine a text classifier (the ULMFiT method) but it is not sufficient in order to create an efficient text generator (better use a model GPT-2 or BERT). More, the SentencePiece tokenizer used in this notebook implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) that can generate caracters from its vocabulary instead of words.

In [53]:

%%time
data = load_data(path, f'{lang}_databunch_corpus2_100_sp15k', bs=bs)

CPU times: user 3.98 s, sys: 1.07 s, total: 5.05 s
Wall time: 5.05 s

In [54]:

config = awd_lstm_lm_config.copy()
config['qrnn'] = True

In [55]:

# LM without pretraining
learn = language_model_learner(data, AWD_LSTM, config=config, pretrained=False)

In [58]:

# LM pretrained in English
learn_en = language_model_learner(data, AWD_LSTM, pretrained=True)

In [60]:

# LM pretrained in french
learn_fr = language_model_learner(data, AWD_LSTM, config=config, pretrained_fnames=lm_fns2)

In [61]:

TEXT = "Nadal a gagné le tournoi de" # original text
N_WORDS = 100 # number of words to predict following the TEXT
N_SENTENCES = 1 # number of different predictions

print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Nadal a gagné le tournoi de ▁2011. sncb ▁auprès ▁visible ▁transcription ▁natal griff ▁constituant ▁tétra ▁donnée ▁entière 产 anniversaire ▁journée 马 pend ▁mull ἄ ▁culminant 興 ▁frappe ▁tard ภ agrément ▁joué ὸ hui voy ▁baltique ▁ferdinand cara ▁climatique ե ime ▁déterminé ▁entre ር л fils ▁fièvre chart équipe ▁intégrée avance ▁grands ▁compagnies gonal pa ▁». џ ille ▁sub étalon ▁in ▁pistolet etto ▁poser bot ▁théo ▁capacité „ ▁dép ▁plante original în ▁width ▁mathématique ▁criminel ▁condé ▁sortant ▁iv ح স ▁met ▁taxi post cro ▁600 ̞ ▁descend ▁vente ▁renonce ca architecte gou vez ▁menacée ura ▁rat ▁prolongation ▁chrétiens ▁wi ▁maryland ▁restent ▁profond elle ▁tunnel ▁manga ▁accepta diff

In [62]:

TEXT = "Nadal a gagné le tournoi de" # original text
N_WORDS = 100 # number of words to predict following the TEXT
N_SENTENCES = 1 # number of different predictions

print("\n".join(learn_en.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Nadal a gagné le tournoi de , and the pin - round for the introduction of the modern - day offensive , as well as a ▁funérailles - style version of the " e " in the version of the article " the world of the old , modern . " ( c . 2000 ) . " i ' m not just a rich man , but an old friend . i just see it as the right one , " he had a friend , " and just a good friend . " " i ' m not a good man " , he

In [63]:

TEXT = "Nadal a gagné le tournoi de" # original text
N_WORDS = 100 # number of words to predict following the TEXT
N_SENTENCES = 1 # number of different predictions

print("\n".join(learn_fr.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

Nadal a gagné le tournoi de ▁xxmaj ▁ s ot chi ▁en ▁cours ▁de ▁saison ▁ - ▁1981 ▁à ▁xxmaj ▁milan . ▁xxmaj ▁il ▁est ▁surtout ▁connu ▁pour ▁l ' ac cumul ation ▁de ▁l ' équipe ▁de ▁xxmaj ▁russie ▁xxmaj ▁en ly s ▁xxmaj ▁ pet ter rö th ▁et ▁xxmaj ▁wi ki pé dia ▁xxmaj ▁ gel der k omm odor e ▁qui ▁ont ▁été ▁prêté s ▁à ▁la ▁xxmaj ▁province ▁de ▁xxmaj ▁ s omo gy , ▁qui ▁l ' a ▁nommé ▁gouverneur ▁de ▁l ' ancienne ▁ université ▁de ▁xxmaj ▁ t bili s si . ▁xxmaj ▁il ▁ s ' agit