fastai2
¶Things work a little differently in fastai2
for text compared to the other two modules (vision and tab)
In today's lesson we'll explore the high-level API for Text, and understand what makes it different
!pip install fastai2 nbdev --quiet
!pip show fastai2
from fastai2.text.all import *
We're going to use a subset of the IMDB dataset, a sentiment-analysis dataset where you try to see if a review was positive or negative:
path = untar_data(URLs.IMDB_SAMPLE)
What's in our path?
path.ls()
(#1) [Path('/root/.fastai/data/imdb_sample/texts.csv')]
df = pd.read_csv(path/'texts.csv')
df.head()
label | text | is_valid | |
---|---|---|---|
0 | negative | Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! | False |
1 | positive | This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... | False |
2 | negative | Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... | False |
3 | positive | Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... | False |
4 | negative | This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... | False |
Alright! So what's the general plan for training?
We're just going to touch on how to use the API, in the next lesson we'll touch on the nitty-gritty of what is going on. For now, let's look at how to build this LM DataLoader using the DataBlock
API:
TextBlock
¶As it's a text problem, we'll probably want something like a TextBlock
right? Well we can't simply do this:
block = [TextBlock]
It won't throw any errors yet, so why?
doc(TextBlock)
class
TextBlock
[source]
TextBlock
(tok_tfm
,vocab
=None
,is_lm
=False
,seq_len
=72
,min_freq
=3
,max_vocab
=60000
,special_toks
=None
) ::TransformBlock
A TransformBlock
for texts
TextBlock
needs to know how we plan to tokenize our words (our tok_tfm
), if we want to use a vocab already, if it's a language model, our sequence length, and a few other parameters. So there's a lot going on there!
Along with this there's a few class methods for this too:
doc(TextBlock.from_df)
TextBlock.from_df
[source]
TextBlock.from_df
(text_cols
,vocab
=None
,is_lm
=False
,seq_len
=72
,min_freq
=3
,max_vocab
=60000
,tok_func
='SpacyTokenizer'
,rules
=None
,sep
=' '
,n_workers
=2
,mark_fields
=None
,res_col_name
='text'
, **kwargs
)
Build a TextBlock
from a dataframe using text_cols
doc(TextBlock.from_folder)
TextBlock.from_folder
[source]
TextBlock.from_folder
(path
,vocab
=None
,is_lm
=False
,seq_len
=72
,min_freq
=3
,max_vocab
=60000
,tok_func
='SpacyTokenizer'
,rules
=None
,extensions
=None
,folders
=None
,output_dir
=None
,n_workers
=2
,encoding
='utf8'
, **kwargs
)
Build a TextBlock
from a path
So we can see if we don't necissarily want to define everything ourself we can use quick and easy from_df
and from_folder
methods. We'll use from_df
here.
But what? What is this res_col_name
? res_col_name
is the column where our tokenized text will be added to. This becomes very important as our get_x
is going to want to pull from this column rather than where our untokenized input is. So let's build a TextBlock
for our problem. So we can see a different output, we'll change our res_col_name
to tok_text
:
lm_block = TextBlock.from_df('text', is_lm=True, res_col_name='tok_text')
For the rest of our DataBlock
, we want our get_x
to read that res_col_name
column, and our splitter to split our data 90%, 10%:
The more data you can train your LM on, the better:
dblock = DataBlock(blocks=lm_block,
get_x=ColReader('tok_text'),
splitter=RandomSplitter(0.1))
And now we build the DataLoaders
. We need to declare how long our sequence length is going to be here as well:
We'll also set
num_workers
to 4, the rule of thumb is 4 workers / 1 GPU, source
%%time
dls = dblock.dataloaders(df, bs=64, seq_len=72, num_workers=4)
CPU times: user 766 ms, sys: 79.2 ms, total: 845 ms Wall time: 3.47 s
Let's look at a batch of data, both in a raw form and as a show_batch:
dls.show_batch(max_n=3)
text | text_ | |
---|---|---|
0 | xxbos i xxunk this movie a xxunk days xxunk … what the xxunk was that ? \n\n i like movies with xxmaj xxunk xxunk , they are xxunk and xxunk . xxmaj when i xxunk a xxunk of this xxunk and xxunk i xxunk great , this one could be really good … some xxunk for xxunk or xxunk xxunk movies … but xxunk then i xxunk a xxunk and xxunk xxunk | i xxunk this movie a xxunk days xxunk … what the xxunk was that ? \n\n i like movies with xxmaj xxunk xxunk , they are xxunk and xxunk . xxmaj when i xxunk a xxunk of this xxunk and xxunk i xxunk great , this one could be really good … some xxunk for xxunk or xxunk xxunk movies … but xxunk then i xxunk a xxunk and xxunk xxunk it |
1 | xxunk . xxmaj the xxunk is , i xxunk n't xxunk this film at all xxunk . \n\n xxmaj it 's the xxunk of xxunk you xxunk xxunk to see xxunk on a xxunk as a xxunk xxunk xxunk and as an xxunk in xxunk xxunk xxunk , xxunk xxunk . and just xxunk the xxunk , it xxunk on some xxunk . xxmaj as a xxunk xxunk of film though , | . xxmaj the xxunk is , i xxunk n't xxunk this film at all xxunk . \n\n xxmaj it 's the xxunk of xxunk you xxunk xxunk to see xxunk on a xxunk as a xxunk xxunk xxunk and as an xxunk in xxunk xxunk xxunk , xxunk xxunk . and just xxunk the xxunk , it xxunk on some xxunk . xxmaj as a xxunk xxunk of film though , it |
2 | were xxunk xxunk on such a xxunk of a movie … if you can even xxunk it a movie . xxbos xxmaj the only xxunk xxunk of this film is the xxunk xxunk … xxunk , this movie was xxunk . xxmaj the acting was xxunk xxunk , and the xxunk xxunk was xxunk and very xxunk . xxmaj the xxunk was xxunk , but it was very hard to xxunk xxunk | xxunk xxunk on such a xxunk of a movie … if you can even xxunk it a movie . xxbos xxmaj the only xxunk xxunk of this film is the xxunk xxunk … xxunk , this movie was xxunk . xxmaj the acting was xxunk xxunk , and the xxunk xxunk was xxunk and very xxunk . xxmaj the xxunk was xxunk , but it was very hard to xxunk xxunk to |
So this looks a bit odd, what happened?
xb,yb = next(iter(dls[0]))
xb[0]
tensor([ 2, 25, 0, 0, 8, 0, 38, 0, 0, 86, 16, 0, 66, 11, 12, 27, 0, 86, 25, 0, 9, 0, 21, 18, 26, 11, 12, 102, 73, 48, 86, 25, 0, 19, 8, 0, 92, 0, 16, 0, 10, 8, 9, 26, 11, 0, 11, 27, 14, 0, 0, 10, 8, 9, 0, 28, 0, 0, 9, 0, 12, 0, 0, 13, 9, 0, 0, 11, 12, 9, 0, 17], device='cuda:0')
yb[0]
tensor([ 25, 0, 0, 8, 0, 38, 0, 0, 86, 16, 0, 66, 11, 12, 27, 0, 86, 25, 0, 9, 0, 21, 18, 26, 11, 12, 102, 73, 48, 86, 25, 0, 19, 8, 0, 92, 0, 16, 0, 10, 8, 9, 26, 11, 0, 11, 27, 14, 0, 0, 10, 8, 9, 0, 28, 0, 0, 9, 0, 12, 0, 0, 13, 9, 0, 0, 11, 12, 9, 0, 17, 0], device='cuda:0')
This is where that Numericalization
comes into play. Each token/vocabulary gets converted into a number for us to pass to our model, also:
xb[0].shape, yb[0].shape
(torch.Size([72]), torch.Size([72]))
We can see that since we passed in a seq_len
of 72, each individual text input is 72 words!
Cool, can we train already?
We have a special Learner
for language models, language_model_learner
we'll use. The only thing you need to pass in is your DataLoaders
, specify the arch
, and pass in some metrics:
doc(language_model_learner)
language_model_learner
[source]
language_model_learner
(dls
,arch
,config
=None
,drop_mult
=1.0
,pretrained
=True
,pretrained_fnames
=None
,loss_func
=None
,opt_func
='Adam'
,lr
=0.001
,splitter
='trainable_params'
,cbs
=None
,metrics
=None
,path
=None
,model_dir
='models'
,wd
=None
,wd_bn_bias
=False
,train_bn
=True
,moms
=(0.95, 0.85, 0.95)
)
Create a Learner
with a language model from dls
and arch
.
Potentially, if you have your own pre-trained base model you want to use you can pass in a pretrained_fname
, otherwise we'll use a pretrained WikiText103
model.
For our metrics we'll be using both accuracy and Perplexity
What is Perplexity? Teaching our model how to deal with uncertainty in language.
Perplexity metric in NLP is a way to capture the degree of 'uncertainty' a model has in predicting (assigning probabilities to) some text. Lower the entropy (uncertainty), lower the perplexity. If a model, which is trained on good blogs and is being evaluated on similarly looking good blogs, assigns higher probability, we say the model has lower perplexity than a model which assigns lower probability. source
doc(Perplexity)
class
Perplexity
[source]
Perplexity
() ::AvgLoss
Perplexity (exponential of cross-entropy loss) for Language Models
So now let's build our Learner
. For an arch we'll use the AWD_LSTM
(we'll explore more later on) pretrained on WikiText
:
learn = language_model_learner(dls, AWD_LSTM, metrics=[accuracy, Perplexity()])
For training our model, we'll simply use fine_tune
for a few epochs (mostly due to the small sample size):
learn.fine_tune(5)
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 3.058100 | 2.642663 | 0.390180 | 14.050574 | 00:10 |
epoch | train_loss | valid_loss | accuracy | perplexity | time |
---|---|---|---|---|---|
0 | 2.833319 | 2.535338 | 0.411559 | 12.620695 | 00:12 |
1 | 2.716148 | 2.460023 | 0.418889 | 11.705077 | 00:12 |
2 | 2.635717 | 2.426812 | 0.421273 | 11.322727 | 00:12 |
3 | 2.590852 | 2.411244 | 0.423066 | 11.147822 | 00:12 |
4 | 2.569319 | 2.407586 | 0.423487 | 11.107115 | 00:12 |
That's about what we want in accuracy, ~30-40%, don't expect higher unless you know the model can. Remember: we're predicting the next word given the previous few, a hard task! Now that we have our LM, we want to save those embeddings away.
Embeddings?
learn.model[0].encoder
Embedding(184, 400, padding_idx=1)
These embeddings! This is essentially our ImageNet weights. Along with this, for the downstream task we'll also want our vocab, so we'll actually go ahead and save our DataLoaders
too.
How do we do this? Using torch.save
and save_encoder
:
torch.save(learn.dls, 'lm_dls.pth')
learn.save_encoder('fine_tuned_enc')
Our model has been saved to the models
directory now. Let's work on our downstream task:
Now classification is what we actually want to do: is this review pos
or neg
Let's look at our DataFrame
one more time to figure out how to frame our DataBlock
:
df.head()
label | text | is_valid | |
---|---|---|---|
0 | negative | Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! | False |
1 | positive | This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... | False |
2 | negative | Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... | False |
3 | positive | Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... | False |
4 | negative | This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... | False |
So we'll want another ColReader
to grab our label and to split our data by the is_valid
column. First let's get everything ready by loading in those DataLoaders
from earlier:
lm_dls = torch.load('lm_dls.pth')
We need these because we already have a vocab to use, and a sequence length:
blocks = (TextBlock.from_df('text', res_col_name='tok_text', seq_len=lm_dls.seq_len,
vocab=lm_dls.vocab), CategoryBlock())
Then we'll build our DataBlock
:
imdb_class = DataBlock(blocks=blocks,
get_x=ColReader('tok_text'),
get_y=ColReader('label'),
splitter=ColSplitter(col='is_valid'))
And finally the DataLoaders
:
dls = imdb_class.dataloaders(df, bs=64)
Those with a keen eye will notice how we called .vocab
, but before that meant how we grab our classes! So how do we do this here?
Since Categorize
(what CategoryBlock
adds) is a transform that's stored as an attribute, we can grab our classes that way:
dls.categorize.vocab
(#2) ['negative','positive']
text_classifier_learner
¶Next up is the text_classifier_learner
. This is what we'll use to make our classification AWD_LSTM:
doc(text_classifier_learner)
text_classifier_learner
[source]
text_classifier_learner
(dls
,arch
,seq_len
=72
,config
=None
,pretrained
=True
,drop_mult
=0.5
,n_out
=None
,lin_ftrs
=None
,ps
=None
,max_len
=1440
,y_range
=None
,loss_func
=None
,opt_func
='Adam'
,lr
=0.001
,splitter
='trainable_params'
,cbs
=None
,metrics
=None
,path
=None
,model_dir
='models'
,wd
=None
,wd_bn_bias
=False
,train_bn
=True
,moms
=(0.95, 0.85, 0.95)
)
Create a Learner
with a text classifier from dls
and arch
.
learn = text_classifier_learner(dls, AWD_LSTM, metrics=[accuracy])
Now we have our pretrained embeddings right? Let's look at it in our learn
:
learn.model[0].module.encoder
Embedding(184, 400, padding_idx=1)
It's right there! So we have a load_encoder
function that will copy over our weights to there:
doc(learn.load_encoder)
TextLearner.load_encoder
[source]
TextLearner.load_encoder
(file
,device
=None
)
Load the encoder file
from the model directory, optionally ensuring it's on device
learn.load_encoder('fine_tuned_enc');
Note: It will automatically assume that your model is saved in
learn.path
in themodels
folder and has the extention.pth
So now we have our frozen model and our pretrained weights. Let's train!
We're going to use a training methodology that Jeremy and Sebastian Ruder came up with for fine-tuning:
lr = learn.lr_find(suggestions=True)
lr
SuggestedLRs(lr_min=0.04365158379077912, lr_steep=0.0014454397605732083)
That's about 4e-2, so I'll agree with it, let's train on this schema:
lr = 0.04365158379077912
First one epoch completely frozen:
learn.fit_one_cycle(1, lr)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.725329 | 0.660920 | 0.610000 | 00:03 |
Then we freeze to -2
and adjust our lr
:
doc(learn.freeze_to)
adj = 2.6**4; adj
45.69760000000001
This adjuster schema is how we will divide our lr
during fitting (we'll also adjust our learning rate outside of it):
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(lr/adj, lr))
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.728662 | 0.663544 | 0.595000 | 00:05 |
Then -3:
learn.freeze_to(-3)
lr /= 2
learn.fit_one_cycle(1, slice(lr/adj, lr))
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.631015 | 0.583745 | 0.680000 | 00:05 |
learn.unfreeze()
lr /= 5
learn.fit_one_cycle(2, slice(lr/adj, lr))
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.492473 | 0.531331 | 0.725000 | 00:08 |
1 | 0.448703 | 0.533241 | 0.735000 | 00:08 |
73.5% is as high as we got given a subset of only 1,000 texts! Not to shabby, with the full version we can expect around 94.5% accuracy (this was their results in the paper).
Now let's show an example predict, and for fun, compare fastai2
to fastinference
Warning: the
ULMFiT
model will not export to ONNX
%%time
out = learn.predict(df.iloc[0]['text'])
CPU times: user 39.6 ms, sys: 2.15 ms, total: 41.7 ms Wall time: 46.4 ms
out
('negative', tensor(0), tensor([0.6171, 0.3829]))
dl = learn.dls.test_dl(df.iloc[:1]['text'])
%%time
preds = learn.get_preds(dl=dl)
CPU times: user 33.9 ms, sys: 52.5 ms, total: 86.4 ms Wall time: 127 ms
preds
(tensor([[0.6171, 0.3829]]), None)
What about fastinference
?
!pip install fastinference --quiet
from fastinference.inference import *
%%time
out = learn.predict(df.iloc[0]['text'])
CPU times: user 22.9 ms, sys: 4.6 ms, total: 27.5 ms Wall time: 33.7 ms
out
[['negative'], array([[0.6171321, 0.3828678]], dtype=float32)]
%%time
preds = learn.get_preds(dl=dl)
CPU times: user 20.3 ms, sys: 52.6 ms, total: 72.9 ms Wall time: 105 ms
We can still shave off some time!
preds
[['negative'], array([[0.6171321, 0.3828678]], dtype=float32)]
While also still getting our classes back each time too