In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle
import spacy

Language modeling

Data

The large movie view dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

However, before we try to classify sentiment, we will simply try to create a language model; that is, a model that can predict the next word in a sentence. Why? Because our model first needs to understand the structure of English, before we can expect it to recognize positive vs negative sentiment.

So our plan of attack is the same as we used for Dogs v Cats: pretrain a model to do one thing (predict the next word), and fine tune it to do something else (classify sentiment).

Unfortunately, there are no good pretrained language models available to download, so we need to create our own. To follow along with this notebook, we suggest downloading the dataset from this location on files.fast.ai.

In [2]:
PATH='data/aclImdb/'

TRN_PATH = 'train/all/'
VAL_PATH = 'test/all/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}
imdbEr.txt  imdb.vocab  README  test/  train/

Let's look inside the training folder...

In [3]:
trn_files = !ls {TRN}
trn_files[:10]
Out[3]:
['0_0.txt',
 '0_3.txt',
 '0_9.txt',
 '10000_0.txt',
 '10000_4.txt',
 '10000_8.txt',
 '1000_0.txt',
 '10001_0.txt',
 '10001_10.txt',
 '10001_4.txt']

...and at an example review.

In [4]:
review = !cat {TRN}{trn_files[6]}
review[0]
Out[4]:
"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop-socky fung-ku, but what I got instead was a comedy. So, it wasn't quite was I was expecting, but I really liked it anyway! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them!! I was laughing my ass off. I mean, the cops were just so bad! And when I say bad, I mean The Shield Vic Macky bad. But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose...man, oh man. What can you say about that hottie. She was great and put those other actresses to shame. She should work more often!!!!! I also really liked the fight scene outside of the building. That was done really well. Lots of fighting and people getting their heads banged up. FUN! Last, but not least Joe Estevez and William Smith were great as the...well, I wasn't sure what they were, but they seemed to be having fun and throwing out lines. I mean, some of it didn't make sense with the rest of the flick, but who cares when you're laughing so hard! All in all the film wasn't the greatest thing since sliced bread, but I wasn't expecting that. It was a Troma flick so I figured it would totally suck. It's nice when something surprises you but not totally sucking.<br /><br />Rent it if you want to get stoned on a Friday night and laugh with your buddies. Don't rent it if you are an uptight weenie or want a zombie movie with lots of flesh eating.<br /><br />P.S. Uwe Boil was a nice touch."

Sounds like I'd really enjoy Zombiegeddon...

Now we'll check how many words are in the dataset.

In [5]:
!find {TRN} -name '*.txt' | xargs cat | wc -w
17486581
In [6]:
!find {VAL} -name '*.txt' | xargs cat | wc -w
5686719

Before we can analyze text, we must first tokenize it. This refers to the process of splitting a sentence into an array of words (or more generally, into an array of tokens).

In [7]:
spacy_tok = spacy.load('en')
In [8]:
' '.join([sent.string.strip() for sent in spacy_tok(review[0])])
Out[8]:
"I have to say when a name like Zombiegeddon and an atom bomb on the front cover I was expecting a flat out chop - socky fung - ku , but what I got instead was a comedy . So , it was n't quite was I was expecting , but I really liked it anyway ! The best scene ever was the main cop dude pulling those kids over and pulling a Bad Lieutenant on them ! ! I was laughing my ass off . I mean , the cops were just so bad ! And when I say bad , I mean The Shield Vic Macky bad . But unlike that show I was laughing when they shot people and smoked dope.<br /><br />Felissa Rose ... man , oh man . What can you say about that hottie . She was great and put those other actresses to shame . She should work more often ! ! ! ! ! I also really liked the fight scene outside of the building . That was done really well . Lots of fighting and people getting their heads banged up . FUN ! Last , but not least Joe Estevez and William Smith were great as the ... well , I was n't sure what they were , but they seemed to be having fun and throwing out lines . I mean , some of it did n't make sense with the rest of the flick , but who cares when you 're laughing so hard ! All in all the film was n't the greatest thing since sliced bread , but I was n't expecting that . It was a Troma flick so I figured it would totally suck . It 's nice when something surprises you but not totally sucking.<br /><br />Rent it if you want to get stoned on a Friday night and laugh with your buddies . Do n't rent it if you are an uptight weenie or want a zombie movie with lots of flesh eating.<br /><br />P.S. Uwe Boil was a nice touch ."

We use Pytorch's torchtext library to preprocess our data, telling it to use the wonderful spacy library to handle tokenization.

First, we create a torchtext field, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

In [9]:
TEXT = data.Field(lower=True, tokenize="spacy")

fastai works closely with torchtext. We create a ModelData object for language modeling by taking advantage of LanguageModelData, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use VAL_PATH for that too.

As well as the usual bs (batch size) parameter, we also not have bptt; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [10]:
bs=64; bptt=70
In [10]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After building our ModelData object, it automatically fills the TEXT object with a very important attribute: TEXT.vocab. This is a vocabulary, which stores which words (or tokens) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

(Technical note: python's standard Pickle library can't handle this correctly, so at the top of this notebook we used the dill library instead and imported it as pickle).

In [ ]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Here are the: # batches; # unique tokens in the vocab; # tokens in the training set; # sentences

In [11]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)
Out[11]:
(4602, 34945, 1, 20621966)

This is the start of the mapping from integer IDs to unique tokens.

In [12]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]
Out[12]:
['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'it', 'in']
In [13]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']
Out[13]:
2

Note that in a LanguageModelData object there is only one item in each dataset: all the words of the text joined together.

In [14]:
md.trn_ds[0].text[:12]
Out[14]:
['i',
 'have',
 'always',
 'loved',
 'this',
 'story',
 '-',
 'the',
 'hopeful',
 'theme',
 ',',
 'the']

torchtext will handle turning this words into integer IDs for us automatically.

In [15]:
TEXT.numericalize([md.trn_ds[0].text[:12]])
Out[15]:
Variable containing:
   12
   35
  227
  480
   13
   76
   17
    2
 7319
  769
    3
    2
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our LanguageModelData object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our bptt parameter - backprop through time).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [16]:
next(iter(md.trn_dl))
Out[16]:
(Variable containing:
     12    567      3  ...    2118      4   2399
     35      7     33  ...       6    148     55
    227    103    533  ...    4892     31     10
         ...            ⋱           ...         
     19   8879     33  ...      41     24    733
    552   8250     57  ...     219     57   1777
      5     19      2  ...    3099      8     48
 [torch.cuda.LongTensor of size 77x64 (GPU 0)], Variable containing:
     35
      7
     33
   ⋮   
     22
   3885
  21587
 [torch.cuda.LongTensor of size 4928 (GPU 0)])

Train

We have a number of parameters to set - we'll learn more about these later, but you should find these values suitable for many problems.

In [11]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of momentum (which we'll learn about later) don't work well with these kinds of RNN models, so we create a version of the Adam optimizer with less momentum than it's default of 0.9.

In [12]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai uses a variant of the state of the art AWD LSTM Language Model developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through Dropout. There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (alpha, beta, and clip) shouldn't generally need tuning.

In [19]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

As you can see below, I gradually tuned the language model in a few stages. I possibly could have trained it further (it wasn't yet overfitting), but I didn't have time to experiment more. Maybe you can see if you can train it to a better accuracy! (I used lr_find to find a good learning rate, but didn't save the output in this notebook. Feel free to try running it yourself now.)

In [30]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)
Widget Javascript not detected.  It may not be installed or enabled properly.
[ 0.       4.85167  4.72509]                                    
[ 1.       4.65204  4.51418]                                  
[ 2.       4.52936  4.43176]                                  
[ 3.       4.57711  4.45321]                                  
[ 4.       4.49827  4.37943]                                  
[ 5.       4.41825  4.32227]                                  
[ 6.       4.40372  4.30466]                                  
[ 7.       4.52163  4.39423]                                  
[ 8.       4.48485  4.36614]                                  
[ 9.       4.43876  4.33174]                                  
[ 10.        4.40153   4.30196]                               
[ 11.        4.38985   4.27407]                               
[ 12.        4.31973   4.24876]                               
[ 13.        4.29297   4.2362 ]                               
[ 14.        4.31048   4.23348]                               

In [ ]:
learner.save_encoder('adam1_enc')
In [20]:
learner.load_encoder('adam1_enc')
In [22]:
learner.load_cycle('adam3_10',2)
In [23]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)
Widget Javascript not detected.  It may not be installed or enabled properly.
[ 0.      4.3926  4.2917]                                       
[ 1.       4.37693  4.28255]                                  
[ 2.       4.37998  4.27243]                                  
[ 3.       4.34284  4.24789]                                  
[ 4.      4.3287  4.2317]                                     
[ 5.       4.28881  4.20722]                                  
[ 6.       4.24637  4.18926]                                  
[ 7.       4.23797  4.17644]                                  
[ 8.       4.20074  4.16989]                                  
[ 9.       4.18873  4.16866]                                  

In [24]:
learner.save_encoder('adam3_10_enc')

In the sentiment analysis section, we'll just need half of the language model - the encoder, so we save that part.

In [25]:
learner.save_encoder('adam3_20_enc')
In [26]:
learner.load_encoder('adam3_20_enc')

Language modeling accuracy is generally measured using the metric perplexity, which is simply exp() of the loss function we used.

In [27]:
math.exp(4.165)
Out[27]:
64.3926824434624
In [ ]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [28]:
m=learner.model
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])
Out[28]:
". So , it was n't quite was I was expecting , but I really liked it anyway ! The best"

We haven't yet added methods to make it easy to test a language model, so we'll need to manually go through the steps.

In [29]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

Let's see what the top 10 predictions were for the next word after our short text:

In [30]:
nexts = torch.topk(res[-1], 10)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]
Out[30]:
['film',
 'movie',
 'of',
 'thing',
 'part',
 '<unk>',
 'performance',
 'scene',
 ',',
 'actor']

...and let's see if our model can generate a bit more text all by itself!

In [31]:
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')
. So, it wasn't quite was I was expecting, but I really liked it anyway! The best 

film ever ! <eos> i saw this movie at the toronto international film festival . i was very impressed . i was very impressed with the acting . i was very impressed with the acting . i was surprised to see that the actors were not in the movie . ...

Sentiment

We'll need to the saved vocab from the language model, since we need to ensure the same words map to the same IDs.

In [5]:
TEXT = pickle.load(open(f'{PATH}models/TEXT.pkl','rb'))

sequential=False tells torchtext that a text field should be tokenized (in this case, we just want to store the 'positive' or 'negative' single label).

splits is a torchtext method that creates train, test, and validation sets. The IMDB dataset is built into torchtext, so we can take advantage of that. Take a look at lang_model-arxiv.ipynb to see how to define your own fastai/torchtext datasets.

In [6]:
IMDB_LABEL = data.Field(sequential=False)
splits = torchtext.datasets.IMDB.splits(TEXT, IMDB_LABEL, 'data/')
In [7]:
t = splits[0].examples[0]
In [8]:
t.label, ' '.join(t.text[:16])
Out[8]:
('pos',
 'this was another great tom berenger movie .. but some people are right it was like')

fastai can create a ModelData object directly from torchtext splits.

In [13]:
md2 = TextData.from_splits(PATH, splits, bs)
In [15]:
m3 = md2.get_model(opt_fn, 1500, bptt, emb_sz=em_sz, n_hid=nh, n_layers=nl, 
           dropout=0.1, dropouti=0.4, wdrop=0.5, dropoute=0.05, dropouth=0.3)
m3.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
m3.load_encoder(f'adam3_20_enc')

Because we're fine-tuning a pretrained model, we'll use differential learning rates, and also increase the max gradient for clipping, to allow the SGDR to work better.

In [20]:
m3.clip=25.
lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])
In [31]:
m3.freeze_to(-1)
m3.fit(lrs/2, 1, metrics=[accuracy])
m3.unfreeze()
m3.fit(lrs, 1, metrics=[accuracy], cycle_len=1)
In [40]:
m3.fit(lrs, 7, metrics=[accuracy], cycle_len=2, cycle_save_name='imdb2')
Widget Javascript not detected.  It may not be installed or enabled properly.
[ 0.       0.29053  0.18292  0.93241]                        
[ 1.       0.24058  0.18233  0.93313]                        
[ 2.       0.24244  0.17261  0.93714]                        
[ 3.       0.21166  0.17143  0.93866]                        
[ 4.       0.2062   0.17143  0.94042]                        
[ 5.       0.18951  0.16591  0.94083]                        
[ 6.       0.20527  0.16631  0.9393 ]                        
[ 7.       0.17372  0.16162  0.94159]                        
[ 8.       0.17434  0.17213  0.94063]                        
[ 9.       0.16285  0.16073  0.94311]                        
[ 10.        0.16327   0.17851   0.93998]                    
[ 11.        0.15795   0.16042   0.94267]                    
[ 12.        0.1602    0.16015   0.94199]                    
[ 13.        0.15503   0.1624    0.94171]                    

In [41]:
m3.load_cycle('imdb2', 4)
In [42]:
accuracy_np(*m3.predict_with_targs())
Out[42]:
0.94310897435897434

A recent paper from Bradbury et al, Learned in translation: contextualized word vectors, has a handy summary of the latest academic research in solving this IMDB sentiment analysis problem. Many of the latest algorithms shown are tuned for this specific problem.

image.png

As you see, we just got a new state of the art result in sentiment analysis, decreasing the error from 5.9% to 5.5%! You should be able to get similarly world-class results on other NLP classification problems using the same basic steps.

There are many opportunities to further improve this, although we won't be able to get to them until part 2 of this course...

End

In [ ]: