At Fast.ai we have introduced a new module called fastai.text which replaces the torchtext library that was used in our 2018 dl1 course. The fastai.text module also supersedes the fastai.nlp library but retains many of the key functions.
from fastai.text import *
import html
The Fastai.text module introduces several custom tokens.
We need to download the IMDB large movie reviews from this site: http://ai.stanford.edu/~amaas/data/sentiment/ Direct link : Link and untar it into the PATH location. We use pathlib which makes directory traveral a breeze.
===================================== (START) Download IMDb data =====================================
%mkdir data/aclImdb
%cd data/aclImdb
/home/ubuntu/data/aclImdb
!aria2c --file-allocation=none -c -x 5 -s 5 http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
[#6acf06 79MiB/80MiB(99%) CN:1 DL:14MiB] 06/26 15:59:49 [NOTICE] Download complete: /home/ubuntu/data/aclImdb/aclImdb_v1.tar.gz Download Results: gid |stat|avg speed |path/URI ======+====+===========+======================================================= 6acf06|OK | 14MiB/s|/home/ubuntu/data/aclImdb/aclImdb_v1.tar.gz Status Legend: (OK):download completed.
!tar -zxf aclImdb_v1.tar.gz -C .
%cd ../..
%rm data/aclImdb/aclImdb_v1.tar.gz
/home/ubuntu
%mv data/aclImdb/ data/aclImdb2
%mv data/aclImdb2/aclImdb data/
%rm -rf data/aclImdb2
PATH = Path('data/aclImdb/')
!ls -lah {PATH}
total 1.7M drwxr-xr-x 4 ubuntu ubuntu 4.0K Jun 26 2011 . drwxrwxr-x 8 ubuntu ubuntu 4.0K Jun 26 16:17 .. -rw-r--r-- 1 ubuntu ubuntu 882K Jun 11 2011 imdbEr.txt -rw-r--r-- 1 ubuntu ubuntu 827K Apr 12 2011 imdb.vocab -rw-r--r-- 1 ubuntu ubuntu 4.0K Jun 26 2011 README drwxr-xr-x 4 ubuntu ubuntu 4.0K Jun 26 16:02 test drwxr-xr-x 5 ubuntu ubuntu 4.0K Jun 26 16:02 train
===================================== (END) Download IMDb data =====================================
BOS = 'xbos' # beginning-of-sentence tag
FLD = 'xfld' # data field tag
CLAS_PATH = Path('data/imdb_clas/')
CLAS_PATH.mkdir(exist_ok=True)
!ls data
aclImdb dogscats dogscats.zip imdb_clas pascal spellbee
LM_PATH = Path('data/imdb_lm/')
LM_PATH.mkdir(exist_ok=True)
!ls data
aclImdb dogscats dogscats.zip imdb_clas imdb_lm pascal spellbee
The IMDb dataset has 3 classes; positive, negative and unsupervised(sentiment is unknown). There are 75k training reviews(12.5k pos, 12.5k neg, 50k unsup) There are 25k validation reviews(12.5k pos, 12.5k neg & no unsup)
Refer to the README file in the IMDb corpus for further information about the dataset.
!cat data/aclImdb/README
Large Movie Review Dataset v1.0 Overview This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided. Dataset The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning. In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5. Files There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset. We also include the IMDb URLs for each review in a separate [urls_[pos, neg, unsup].txt] file. A review with unique id 200 will have its URL on line 200 of this file. Due the ever-changing IMDb, we are unable to link directly to the review, but only to the movie's review page. In addition to the review text files, we include already-tokenized bag of words (BoW) features that were used in our experiments. These are stored in .feat files in the train/test directories. Each .feat file is in LIBSVM format, an ascii sparse-vector format for labeled data. The feature indices in these files start from 0, and the text tokens corresponding to a feature index is found in [imdb.vocab]. So a line with 0:7 in a .feat file means the first word in [imdb.vocab] (the) appears 7 times in that review. LIBSVM page for details on .feat file format: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ We also include [imdbEr.txt] which contains the expected rating for each token in [imdb.vocab] as computed by (Potts, 2011). The expected rating is a good way to get a sense for the average polarity of a word in the dataset. Citing the dataset When using this dataset please cite our ACL 2011 paper which introduces it. This paper also contains classification results which you may want to compare against. @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} } References Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20, 636-659. Contact For questions/comments/corrections please contact Andrew Maas amaas@cs.stanford.edu
CLASSES = ['neg', 'pos', 'unsup']
def get_texts(path):
texts, labels = [], []
for idx, label in enumerate(CLASSES):
for fname in (path/label).glob('*.*'):
texts.append(fname.open('r').read())
labels.append(idx)
return np.array(texts), np.array(labels)
trn_texts, trn_labels = get_texts(PATH / 'train')
val_texts, val_labels = get_texts(PATH / 'test')
len(trn_texts), len(val_texts)
(75000, 25000)
col_names = ['labels', 'text']
We use a random permutation numpy array to shuffle the text reviews.
np.random.seed(42)
trn_idx = np.random.permutation(len(trn_texts))
val_idx = np.random.permutation(len(val_texts))
trn_texts = trn_texts[trn_idx]
val_texts = val_texts[val_idx]
trn_labels = trn_labels[trn_idx]
val_labels = val_labels[val_idx]
df_trn = pd.DataFrame({ 'text': trn_texts, 'labels': trn_labels }, columns=col_names)
df_val = pd.DataFrame({ 'text': val_texts, 'labels': val_labels }, columns=col_names)
# DEBUG
# View train df
df_trn.head()
labels | text | |
---|---|---|
0 | 2 | A group of filmmakers (College Students?) deci... |
1 | 0 | Sequels have a nasty habit of being disappoint... |
2 | 1 | In a future society, the military component do... |
3 | 2 | Imagine Albert Finney, one of the great ham bo... |
4 | 2 | I bought this DVD for $2.00 at the local varie... |
# DEBUG
# View validation df
df_val.head()
labels | text | |
---|---|---|
0 | 1 | Every year there's one can't-miss much-anticip... |
1 | 1 | I don't usually like this sort of movie but wa... |
2 | 1 | Great movie in a Trainspotting style... Being ... |
3 | 0 | New rule. Nobody is allowed to make any more Z... |
4 | 0 | I saw this movie (unfortunately) because it wa... |
The pandas dataframe is used to store text data in a newly evolving standard format of label followed by text columns. This was influenced by a paper by Yann LeCun (LINK REQUIRED). Fastai adopts this new format for NLP datasets. In the case of IMDB, there is only one text column.
# we remove everything that has a label of 2 `df_trn['labels'] != 2` because label of 2 is "unsupervised" and we can’t use it.
df_trn[df_trn['labels'] != 2].to_csv(CLAS_PATH / 'train.csv', header=False, index=False)
df_val.to_csv(CLAS_PATH / 'test.csv', header=False, index=False)
(CLAS_PATH / 'classes.txt').open('w').writelines(f'{o}\n' for o in CLASSES)
We start by creating the data for the Language Model(LM). The LM's goal is to learn the structure of the English language. It learns language by trying to predict the next word given a set of previous words(ngrams). Since the LM does not classify reviews, the labels can be ignored.
The LM can benefit from all the textual data and there is no need to exclude the unsup/unclassified movie reviews.
We first concat all the train(pos/neg/unsup = 75k) and test(pos/neg=25k) reviews into a big chunk of 100k reviews. And then we use sklearn splitter to divide up the 100k texts into 90% training and 10% validation sets.
trn_texts, val_texts = sklearn.model_selection.train_test_split(
np.concatenate([trn_texts, val_texts]), test_size=0.1)
len(trn_texts), len(val_texts)
(90000, 10000)
df_trn = pd.DataFrame({ 'text': trn_texts, 'labels': [0] * len(trn_texts) }, columns=col_names)
df_val = pd.DataFrame({ 'text': val_texts, 'labels': [0] * len(val_texts) }, columns=col_names)
df_trn.to_csv(LM_PATH / 'train.csv', header=False, index=False)
df_val.to_csv(LM_PATH / 'test.csv', header=False, index=False)
In this section, we start cleaning up the messy text. There are 2 main activities we need to perform:
Tokenization is the process of splitting the text into separate tokens so that each token can be assigned a unique index. This means we can convert the text into integer indexes our models can use.
We use an appropriate chunksize
as the tokenization process is memory intensive.
chunksize = 24000
Before we pass it to spaCy, we will write a simple fixup function which is each time we have looked at different datasets (about a dozen in building this), every one had different weird things that needed to be replaced. So here are all the ones we have come up with so far, and hopefully this will help you out as well. All the entities are HTML unescaped and there are bunch more things we replace. Have a look at the result of running this on text that you put in and make sure there's no more weird tokens in there.
re1 = re.compile(r' +')
def fixup(x):
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
'<br />', "\n").replace('\\"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
' @-@ ', '-').replace('\\', ' \\ ')
return re1.sub(' ', html.unescape(x))
def get_texts(df, n_lbls=1):
labels = df.iloc[:, range(n_lbls)].values.astype(np.int64)
texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
for i in range(n_lbls + 1, len(df.columns)):
texts += f' {FLD} {i - n_lbls} ' + df[i].astype(str)
texts = texts.apply(fixup).values.astype(str)
tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
return tok, list(labels)
def get_all(df, n_lbls):
tok, labels = [], []
for i, r in enumerate(df):
print(i)
tok_, labels_ = get_texts(r, n_lbls)
tok += tok_
labels += labels_
return tok, labels
df_trn = pd.read_csv(LM_PATH / 'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(LM_PATH / 'test.csv', header=None, chunksize=chunksize)
# tok_trn, trn_labels = get_all(df_trn, 1)
# tok_val, val_labels = get_all(df_val, 1)
0
--------------------------------------------------------------------------- OSError Traceback (most recent call last) <ipython-input-51-51e26f4b98a3> in <module>() ----> 1 tok_trn, trn_labels = get_all(df_trn, 1) 2 tok_val, val_labels = get_all(df_val, 1) <ipython-input-48-bfe25ce1655c> in get_all(df, n_lbls) 4 for i, r in enumerate(df): 5 print(i) ----> 6 tok_, labels_ = get_texts(r, n_lbls) 7 tok += tok_ 8 labels += labels_ <ipython-input-47-d4a58d702615> in get_texts(df, n_lbls) 6 texts = texts.apply(fixup).values.astype(str) 7 ----> 8 tok = Tokenizer().proc_all_mp(partition_by_cores(texts)) 9 return tok, list(labels) ~/fastai/courses/dl2/fastai/text.py in __init__(self, lang) 44 def __init__(self, lang='en'): 45 self.re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE) ---> 46 self.tok = spacy.load(lang) 47 for w in ('<eos>','<bos>','<unk>'): 48 self.tok.tokenizer.add_special_case(w, [{ORTH: w}]) ~/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/__init__.py in load(name, **overrides) 13 if depr_path not in (True, False, None): 14 deprecation_warning(Warnings.W001.format(path=depr_path)) ---> 15 return util.load_model(name, **overrides) 16 17 ~/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/util.py in load_model(name, **overrides) 117 elif hasattr(name, 'exists'): # Path or Path-like to model data 118 return load_model_from_path(name, **overrides) --> 119 raise IOError(Errors.E050.format(name=name)) 120 121 OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Fix spaCy issue above:
!python -m spacy download en
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB) 100% |████████████████████████████████| 37.4MB 81.7MB/s ta 0:00:01 7% |██▎ | 2.6MB 611kB/s eta 0:00:57 Installing collected packages: en-core-web-sm Running setup.py install for en-core-web-sm ... done Successfully installed en-core-web-sm-2.0.0 You are using pip version 9.0.3, however version 10.0.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Linking successful /home/ubuntu/anaconda3/envs/fastai/lib/python3.6/site-packages/en_core_web_sm --> /home/ubuntu/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/data/en You can now load the model via spacy.load('en')
# Re-run these 2 lines of codes.
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)
0 1 2 3 0
(LM_PATH / 'tmp').mkdir(exist_ok=True)
Testing
type(tok_trn), len(tok_trn)
(list, 90000)
tok_trn[0]
['\n', 'xbos', 'xfld', '1', 'first', 'of', 'all', 'jan', 'guillou', 'is', 'a', 'fantastic', 'writer', '.', 'but', 'even', 'so', ',', 'i', 'have', 'not', 'read', 'his', '"', 'arn', '"-', 'series', 'books', '.', 'as', 'i', 'have', 'great', 'love', 'and', 'respect', 'for', 'guillou', ',', 'i', 'had', 'high', 'expectations', 'for', 'this', 'movie', '.', 'also', ',', 'a', 'good', 'friend', 'of', 'mine', '(', 'student', 'in', 'university', 'reading', 'history', ')', 'had', 'read', 'and', 'recommended', 'this', 'book', 'strongly', '.', 'perhaps', 'the', 'director', 'could', "n't", 'catch', 'the', 'atmosphere', 'in', 'the', 'book', ',', 'because', 'the', 'movie', 'was', 'a', 'huge', 'disappointment', '.', 'so', 'i', 'will', 'go', 'very', 'hard', 'on', 'this', 't_up', 'movie', 'and', 'not', 'on', 'the', 'book', 'so', 'please', 'make', 'the', 'difference', '.', '\n\n', 'arn', ',', 'movie', ',', 'tells', 'us', 'the', 'tale', 'of', 'arn', ',', 'born', 'in', '1150', ',', 'in', 'the', 'north', 'of', 'europe', ',', 'in', 'what', 'would', 'later', 'become', 'today', "'s", 'sweden', '.', 'the', 'movie', 'is', 'basically', 'separated', 'in', 'three', 'parts', ';', '(', '1', ')', 'rise', 'of', 'sweden', ',', 'meaning', 'the', 'rivals', 'and', 'fights', 'for', 'land', 'and', 'kingdom', ',', '(', '2', ')', 'arns', 'own', 'tale', ',', '(', '3', ')', 'a', 'romantic', '.', 'i', 'do', "n't", 'want', 'to', 'tell', 'more', 'than', 'that', 'about', 'the', 'movie', ',', 'but', 'now', 'on', 'the', 'trailers', 'you', 'see', 'a', 'lot', 'of', 'wars', 'in', 'jerusalem', ',', 'but', 'that', 'is', 'only', 'very', 'short', 'time', 'of', 'this', '2.31', 'hour', 'long', 'movie', '.', '\n\n ', 't_up', 'acting', '/', 't_up', 'lines', 'the', 'actors', 'were', 'way', 'too', 'aware', 'of', 'that', 'this', 'was', 'a', 'swedish', 'blockbuster', ',', 'and', 'when', 'they', 'played', 'their', 'rolls', ',', 'one', 'could', 'see', 'an', 'all', 'to', 'relaxed', '(', 'not', 'living', 'into', 'their', 'characters', ')', 'actors', '.', 'acting', 'was', 'so', 'poor', ',', 'i', 'sometimes', 'wondered', ';', 'if', 'this', 't_up', 'was', 'a', 'middle', 'age', 'movie', '.', 'they', 'were', 'saying', 'their', 'lines', 'as', 'a', 'person', 'living', 'in', 'stockholm', 'would', 'do', 'in', '2007', ',', 'which', 'was', 'just', 'really', 'lame', ',', 'there', 'was', 'no', 'attempt', 'to', 'change', 'the', 'accent', 'i', 'believe', ',', 'made', 'the', 'quality', 'poor', '.', 'in', 'some', 'parts', ',', 'they', 'did', "n't", 'even', 'use', 'old', 'swedish', 'words', '!', 'i', 'mean', 'i', 'ca', "n't", 'believe', 'that', 'the', 'swedish', 'language', '/', 'accent', 'has', "n't", 'change', 'since', '1150', '-', '2007', '\x85 ', 'the', 'lines', 'was', 'empty', 'and', 'when', 'the', 'accents', 'was', 'so', 'poor', ',', 'this', 'equaled', 'in', 'very', 'low', 'performance', ',', 'and', 'i', 'sometimes', 'felt', 'that', 'i', 'was', 'watching', 'swedish', 'big', 'brother', '\x85 ', 'i', 'think', 'this', 'is', 'because', 'swedish', 'actors', 'has', 'not', 'yet', 'understood', 'that', 'acting', 'is', 'with', 'whole', 'body', ',', 'eye', 'moves', ',', 'body', '-', 'language', ',', 'etc', ',', 'not', 'just', 'standing', 'there', 'like', 'a', 'jukebox', 'and', 'saying', 'your', 'lines', 'one', 'after', 'other', '.', 'the', 'kids', 'acting', 'was', 'horrible', ',', 'i', 'was', 'dying', 'every', 'time', 'they', 'said', 'their', 'empty', 'lines', ',', 'with', 'no', 'feeling', ',', 'just', 'saying', 'it', 'as', 'instructed', '!', 'i', 'mean', ',', 'compare', 'little', 'girl', 'briony', 'tallis', 'in', 'atonement', ',', 'her', 'majestic', 'way', 'of', 'acting', ',', 'taking', 'the', 'crowd', ',', 'filling', 'up', 'the', 'scenes', 'whether', 'to', 'make', 'us', 'like', 'her', ',', 'hate', 'her', ',', 'moving', 'us', 'from', 'different', 'moods', 'by', 'her', 'acting', ',', 'those', 'well', '-', 'formed', 'lines', 'she', 'so', 'realistically', 'with', 'so', 'much', 'emotions', 'said', '.', 'if', 'you', 'find', 'her', 'too', 'old', 'for', 'comparative', 'purpose', ',', 'then', 'one', 'could', 'compare', 'to', 'jake', 'lloyd', ',', 'little', 'anakin', 'skywalker', '(', 'sw', 'e1', ')', '.', 'in', 'arn', ',', 'those', 'kids', 'did', "n't", 'even', 'have', 'many', 'scenes', 'to', 'play', 'and', 'most', 'of', 'their', 'scenes', 'where', 'just', 'the', 'same', ',', 'and', 'lines', 'so', 'easy', 'compared', 'to', 'jakes', ',', 'whom', 'was', 'the', 'one', 'character', 'we', "'ve", 'waited', 'for', 'since', '1977', '!', 'the', 'only', 'good', 'actors', 'in', 'arn', 'was', 'the', 'old', 'nun', ',', 'mother', 'rikissa', ',', 'she', 'was', '"', 'good', '"', 'but', 'not', 'perfect', 'as', 'the', 'actors', 'should', 'have', 'been', 'in', 'this', 'movie', '!', 'also', 'saladin', ',', 'father', 'henry', ',', 'brother', 'guilbert', ',', 'and', 'the', 'bishop', 'passed', ',', 'all', 'other', 'sucked', 'so', 'bad', 'i', 'was', 'about', 'to', 'leave', 'the', 'cinema', 'and', 'take', 'a', 'rape', '-', 'shower', '.', 'worst', 'acting', 'i', 'found', ',', 'cecilia', 'algottsdotter', '(', 'and', 'her', 'sister', 'in', 'the', 'movie', ')', ',', 'i', 'have', 'no', 'words', '.', 'arn', 'himself', 'is', 'right', 'behind', '\x85 ', 'they', 'could', 'simply', 'not', 'let', 'us', 'feel', 'those', 'really', 'important', 'scenes', 'of', 'creating', 'a', 'character', ',', 'make', 'us', 'love', ',', 'hate', ',', 'or', 'mystic', 'feelings', '.', '\n\n ', 't_up', 'cutting', '/', 't_up', 'camera', '(', 'story', 'telling', ')', 'what', 'were', 'those', 'stupid', 'layering', 'other', 'scenes', 'about', '(', 'picture', 'over', 'picture', ')', '!', '?', 'i', 'mean', ',', 'when', 'you', "'re", 'supposed', 'to', 'create', 'a', 'flashback', ',', 'you', 'want', 'to', 'do', 'it', 'good', ',', 'like', '"', 'i', 'am', 'legend', '"', '.', 'even', 'the', 'tv', '-', 'series', 't_up', 'lost', 'makes', 'better', 'and', 'more', 'intense', 'flashbacks', '.', 'and', 'most', 'times', 'this', 'layering', 'of', 'flags', 'etc', 'just', 'explains', 'the', 'movie', 'watcher', 'as', 'stupid', '!', 'i', 'mean', 'in', 'one', 'scene', ',', 'mother', 'rikissa', 'says', '"', 'we', 'are', 'sverkar', '"', 'and', 't_up', 'bam', 'a', 'layer', 'over', 'her', 'face', 'with', 'their', 'flag', ',', 'what', 'is', 'the', 'meaning', '?', '!', 'who', 'made', 'that', 'on', 'photoshop', '?', '!', 'i', 'will', 'come', 'to', 'that', 'later', '!', 'the', 'cutting', 'was', 'poor', ',', 'a', 'lot', 'of', 'scenes', 'did', "n't", 'make', 'sense', 'at', 'all', ',', 'we', 'were', 'pushed', 'all', 'too', 'suddenly', 'from', 'one', 'place', 'to', 'another', 'and', 'we', 'did', "n't", 'even', 'get', 'an', 'explanation', 'why', '\x85 ', 'arns', 'riding', '(', 'i', 'even', 'heard', 'this', 'guy', 'is', 'actually', 'good', 'at', 'riding', ')', 'was', 'really', 'bad', ',', 'the', 'horse', 'was', 'so', 'beautiful', 'and', 'powerful', ',', 'but', 'arns', 'carriage', '/', 'attitude', 'made', 'it', 'look', 'so', 't_up', 'stupid', ',', 'i', 'mean', 'a', 'cool', 'horse', 'riding', 'scenes', 'have', 'we', 'all', 'seen', ',', 'is', 'it', 'so', 'hard', '?', '!', 'in', 'elizabeth', 'the', 'golden', 'age', ',', 'when', 'they', 'ride', 'their', 'horses', ',', 'one', 'catches', 'that', 'freedom', ',', 'speed', ',', 'carriage', '/', 'attitude', 'i', 'am', 'talking', 'about', '.', '"', 'let', 'them', 'come', 'with', 'armies', 'of', 'hell', '!', 'they', 'shall', 'not', 'pass', '!', '"', 'elizabeth', 'shouts', 'with', 'furious', 'anger', '!', 'cool', 'quotes', 'as', 'these', 'can', 'also', 'be', 'found', '(', 'in', 'elizabeth', '\x96', 'not', 'arn', '!', 'not', 'one', 'single', 'line', 'is', 'memorable', 'quote', '!', ')', 'i', 'did', "n't", 'even', 'like', 'elizabeth', 'the', 'golden', 'age', '!', ...]
type(trn_labels), len(trn_labels)
(list, 90000)
Tokenization Result
Beginning of the stream token (xbos
), beginning of field number 1 token (xfld 1
), and tokenized text. You'll see that the punctuation is on whole now a separate token.
' '.join(tok_trn[0])
'\n xbos xfld 1 first of all jan guillou is a fantastic writer . but even so , i have not read his " arn "- series books . as i have great love and respect for guillou , i had high expectations for this movie . also , a good friend of mine ( student in university reading history ) had read and recommended this book strongly . perhaps the director could n\'t catch the atmosphere in the book , because the movie was a huge disappointment . so i will go very hard on this t_up movie and not on the book so please make the difference . \n\n arn , movie , tells us the tale of arn , born in 1150 , in the north of europe , in what would later become today \'s sweden . the movie is basically separated in three parts ; ( 1 ) rise of sweden , meaning the rivals and fights for land and kingdom , ( 2 ) arns own tale , ( 3 ) a romantic . i do n\'t want to tell more than that about the movie , but now on the trailers you see a lot of wars in jerusalem , but that is only very short time of this 2.31 hour long movie . \n\n t_up acting / t_up lines the actors were way too aware of that this was a swedish blockbuster , and when they played their rolls , one could see an all to relaxed ( not living into their characters ) actors . acting was so poor , i sometimes wondered ; if this t_up was a middle age movie . they were saying their lines as a person living in stockholm would do in 2007 , which was just really lame , there was no attempt to change the accent i believe , made the quality poor . in some parts , they did n\'t even use old swedish words ! i mean i ca n\'t believe that the swedish language / accent has n\'t change since 1150 - 2007 \x85 the lines was empty and when the accents was so poor , this equaled in very low performance , and i sometimes felt that i was watching swedish big brother \x85 i think this is because swedish actors has not yet understood that acting is with whole body , eye moves , body - language , etc , not just standing there like a jukebox and saying your lines one after other . the kids acting was horrible , i was dying every time they said their empty lines , with no feeling , just saying it as instructed ! i mean , compare little girl briony tallis in atonement , her majestic way of acting , taking the crowd , filling up the scenes whether to make us like her , hate her , moving us from different moods by her acting , those well - formed lines she so realistically with so much emotions said . if you find her too old for comparative purpose , then one could compare to jake lloyd , little anakin skywalker ( sw e1 ) . in arn , those kids did n\'t even have many scenes to play and most of their scenes where just the same , and lines so easy compared to jakes , whom was the one character we \'ve waited for since 1977 ! the only good actors in arn was the old nun , mother rikissa , she was " good " but not perfect as the actors should have been in this movie ! also saladin , father henry , brother guilbert , and the bishop passed , all other sucked so bad i was about to leave the cinema and take a rape - shower . worst acting i found , cecilia algottsdotter ( and her sister in the movie ) , i have no words . arn himself is right behind \x85 they could simply not let us feel those really important scenes of creating a character , make us love , hate , or mystic feelings . \n\n t_up cutting / t_up camera ( story telling ) what were those stupid layering other scenes about ( picture over picture ) ! ? i mean , when you \'re supposed to create a flashback , you want to do it good , like " i am legend " . even the tv - series t_up lost makes better and more intense flashbacks . and most times this layering of flags etc just explains the movie watcher as stupid ! i mean in one scene , mother rikissa says " we are sverkar " and t_up bam a layer over her face with their flag , what is the meaning ? ! who made that on photoshop ? ! i will come to that later ! the cutting was poor , a lot of scenes did n\'t make sense at all , we were pushed all too suddenly from one place to another and we did n\'t even get an explanation why \x85 arns riding ( i even heard this guy is actually good at riding ) was really bad , the horse was so beautiful and powerful , but arns carriage / attitude made it look so t_up stupid , i mean a cool horse riding scenes have we all seen , is it so hard ? ! in elizabeth the golden age , when they ride their horses , one catches that freedom , speed , carriage / attitude i am talking about . " let them come with armies of hell ! they shall not pass ! " elizabeth shouts with furious anger ! cool quotes as these can also be found ( in elizabeth \x96 not arn ! not one single line is memorable quote ! ) i did n\'t even like elizabeth the golden age ! \n\n t_up atmosphere / t_up music now , some movies have characteristic music , godfather , star wars , lord of the rings , every movie you read now you \'ve heard the music and felt that atmosphere within you , did n\'t you ? forget that on arn . the music is really poor , no atmosphere at all , no tension or bravery \x96 there is nothing . and i mean , this is the type of movie where you could create amazing music to create different moods ! \n\n t_up environment well , this was the only positive part of the movie , beautiful places , great houses , roads , weapons and clothes , all that was really good . \n\n so my conclusion is . they have tried to push three books in one movie and therefore the cutting , building up characters , story , conflicts and lines has failed brutally . after watching this movie you will see : bad actors , no good lines , extremely bad cutting ( read bad story telling ) and no good music . simply no atmosphere , no creation of different moods etc and is n\'t that what a movie is suppose to do ? i rate arn 1 awful . " kingdom of heaven " still is winner in this genre .'
' '.join(tok_trn[1])
"\n xbos xfld 1 for lack of better things to watch , we stumbled on this movie the other night on cable . wow ! if action is your thing , this film will be for you . there must be killings every five minutes . in fact , we are worried when there are no shootings in the background ! \n\n charles t. kanganis wrote and directed this movie that has a woman detective at the center of the story . vickie , is a tough cookie ( no pun intended ) . she might look blonde and vulnerable , but just do n't mess around with her . the fact that vickie is basically standing up as the film ends is a testament to tracy lords ' masochism . \n\n the bad guys come and go , yet , vickie is able to avoid being shot , or have her hair messed during the worst of the action . the action is too intense at times as the latin gangsters show to be ruthless in the way they settle disputes . \n\n watch this film for the pure fun of watching the action . otherwise , do n't bother ."
Save work
np.save(LM_PATH / 'tmp' / 'tok_trn.npy', tok_trn)
np.save(LM_PATH / 'tmp' / 'tok_val.npy', tok_val)
# Try to load it back up later
tok_trn = np.load(LM_PATH / 'tmp' / 'tok_trn.npy')
tok_val = np.load(LM_PATH / 'tmp' / 'tok_val.npy')
# DEBUG - check serialized numpy files are created
!ls -lh {LM_PATH}/tmp
total 381M -rw-rw-r-- 1 ubuntu ubuntu 1010K Jun 27 15:42 itos.pkl -rw-rw-r-- 1 ubuntu ubuntu 299M Jun 28 07:30 tok_trn.npy -rw-rw-r-- 1 ubuntu ubuntu 34M Jun 28 07:30 tok_val.npy -rw-rw-r-- 1 ubuntu ubuntu 42M Jun 27 15:42 trn_ids.npy -rw-rw-r-- 1 ubuntu ubuntu 6.3M Jun 27 15:42 val_ids.npy
Now that we got it tokenized, the next thing we need to do is to turn it into numbers which we call numericalizing it.
freq = Counter(p for o in tok_trn for p in o)
freq.most_common(25)
[('the', 1209252), ('.', 993276), (',', 984626), ('and', 587546), ('a', 584199), ('of', 524706), ('to', 485996), ('is', 394082), ('it', 341931), ('in', 337846), ('i', 308713), ('this', 270873), ('that', 261496), ('"', 237503), ("'s", 221557), ('-', 187727), ('was', 180500), ('\n\n', 179162), ('as', 166379), ('with', 159369), ('for', 159204), ('movie', 157938), ('but', 150477), ('film', 144038), ('you', 124418)]
The vocab is the unique set of all tokens in our dataset. The vocab provides us a way for us to simply replace each word in our datasets with a unique integer called an index.
In a large corpus of data one might find some rare words which are only used a few times in the whole dataset. We discard such rare words and avoid trying to learn meaningful patterns out of them.
Here we have set a minimum frequency of occurence to 2 times. It has been observed by NLP practicioners that a maximum vocab of 60k usually yields good results for classification tasks. So we set max_vocab
to 60000.
max_vocab = 60000
min_freq = 1
itos = [o for o, c in freq.most_common(max_vocab) if c > min_freq]
itos.insert(0, '_pad_')
itos.insert(0, '_unk_')
len(itos)
60002
We create a reverse mapping called stoi
which is useful to lookup the index of a given token. stoi
also has the same number of elements as itos
. We use a high performance container called collections.defaultdict to store our stoi
mapping.
stoi = collections.defaultdict(lambda: 0, { v: k for k, v in enumerate(itos) })
len(itos)
60002
trn_lm = np.array([ [stoi[o] for o in p] for p in tok_trn ])
val_lm = np.array([ [stoi[o] for o in p] for p in tok_val ])
Testing
' '.join(str(o) for o in trn_lm[0])
'40 41 42 39 106 7 43 5033 36163 9 6 846 569 3 24 75 51 4 12 36 32 369 35 15 18231 5213 228 1218 3 20 12 36 101 131 5 1166 22 36163 4 12 84 323 1413 22 13 23 3 102 4 6 66 443 7 1805 30 1374 11 3458 901 499 27 84 369 5 1169 13 309 2337 3 400 2 169 95 29 1323 2 879 11 2 309 4 105 2 23 18 6 672 1459 3 51 12 104 158 69 266 28 13 31 23 5 32 28 2 309 51 617 112 2 1540 3 19 18231 4 23 4 716 200 2 777 7 18231 4 1362 11 0 4 11 2 2135 7 2258 4 11 63 72 329 445 511 16 7852 3 2 23 9 681 5744 11 300 522 133 30 39 27 2240 7 7852 4 1201 2 6910 5 1758 22 1300 5 4683 4 30 261 27 58729 221 777 4 30 381 27 6 745 3 12 57 29 201 8 401 67 92 14 58 2 23 4 24 166 28 2 4351 26 83 6 186 7 1610 11 16640 4 24 14 9 81 69 365 74 7 13 0 562 216 23 3 640 31 136 123 31 428 2 170 85 115 116 1864 7 14 13 18 6 3630 2673 4 5 68 45 275 78 6284 4 38 95 83 47 43 8 7992 30 32 593 103 78 120 27 170 3 136 18 51 357 4 12 557 3661 133 62 13 31 18 6 658 603 23 3 45 85 660 78 428 20 6 417 593 11 19313 72 57 11 4216 4 79 18 56 82 871 4 53 18 73 590 8 678 2 1231 12 283 4 113 2 503 357 3 11 64 522 4 45 86 29 75 376 175 3630 690 49 12 403 12 195 29 283 14 2 3630 1090 123 1231 60 29 678 252 0 17 4216 1623 2 428 18 1887 5 68 2 2605 18 51 357 4 13 20935 11 69 380 257 4 5 12 557 462 14 12 18 168 3630 219 607 1623 12 121 13 9 105 3630 170 60 32 262 2769 14 136 9 21 238 662 4 766 1124 4 662 17 1090 4 528 4 32 56 2022 53 52 6 30104 5 660 146 428 38 118 100 3 2 378 136 18 510 4 12 18 1642 189 74 45 325 78 1887 428 4 21 73 574 4 56 660 10 20 16641 49 12 403 4 1740 138 256 0 0 11 22523 4 55 11340 115 7 136 4 655 2 2031 4 6111 71 2 155 768 8 112 200 52 55 4 738 55 4 763 200 50 285 10177 46 55 136 4 164 88 17 6304 428 70 51 6340 21 51 93 1361 325 3 62 26 185 55 116 175 22 23472 1264 4 114 38 95 1740 8 3534 2585 4 138 13179 13552 30 23473 0 27 3 11 18231 4 164 378 86 29 75 36 128 155 8 315 5 109 7 78 155 134 56 2 187 4 5 428 51 749 1111 8 24947 4 938 18 2 38 122 89 159 4798 22 252 5991 49 2 81 66 170 11 18231 18 2 175 5761 4 399 0 4 70 18 15 66 15 24 32 424 20 2 170 154 36 98 11 13 23 49 102 28259 4 355 1370 4 607 44777 4 5 2 6489 2220 4 43 100 2125 51 96 12 18 58 8 577 2 458 5 213 6 1796 17 3450 3 269 136 12 276 4 10124 0 30 5 55 802 11 2 23 27 4 12 36 73 690 3 18231 333 9 227 538 1623 45 95 346 32 302 200 250 164 82 648 155 7 1853 6 122 4 112 200 131 4 738 4 54 11341 1427 3 640 31 2227 123 31 375 30 80 983 27 63 85 164 394 29173 100 155 58 30 450 141 450 27 49 65 12 403 4 68 26 198 457 8 1016 6 2454 4 26 201 8 57 10 66 4 52 15 12 260 1857 15 3 75 2 264 17 228 31 436 177 145 5 67 1597 2078 3 5 109 235 13 29173 7 11420 528 56 2676 2 23 8260 20 394 49 12 403 11 38 149 4 399 0 566 15 89 33 0 15 5 31 8261 6 8388 141 55 410 21 78 5265 4 63 9 2 1201 65 49 48 113 14 28 34658 65 49 12 104 232 8 14 329 49 2 2227 18 357 4 6 186 7 155 86 29 112 291 44 43 4 89 85 3981 43 116 1089 50 38 286 8 174 5 89 86 29 75 97 47 1822 153 1623 58729 2894 30 12 75 572 13 230 9 179 66 44 2894 27 18 82 96 4 2 1703 18 51 342 5 940 4 24 58729 9884 123 2054 113 10 184 51 31 394 4 12 403 6 601 1703 2894 155 36 89 43 129 4 9 10 51 266 65 49 11 2450 2 2102 603 4 68 45 1200 78 3553 4 38 4447 14 2105 4 2205 4 9884 123 2054 12 260 687 58 3 15 302 110 232 21 12509 7 582 49 45 3542 32 1383 49 15 2450 9527 21 5805 2674 49 601 4770 20 150 77 102 37 276 30 11 2450 526 32 18231 49 32 38 684 367 9 919 3167 49 27 12 86 29 75 52 2450 2 2102 603 49 640 31 879 123 31 239 166 4 64 117 36 8465 239 4 3717 4 341 1610 4 1737 7 2 2959 4 189 23 26 369 166 26 159 572 2 239 5 462 14 879 748 26 4 86 29 26 65 838 14 28 18231 3 2 239 9 82 357 4 73 879 44 43 4 73 1122 54 10178 526 53 9 176 3 5 12 403 4 13 9 2 545 7 23 134 26 95 1016 525 239 8 1016 285 10177 49 640 31 2788 88 4 13 18 2 81 1173 191 7 2 23 4 342 1293 4 101 4789 4 8814 4 2499 5 1749 4 43 14 18 82 66 3 19 51 76 1207 9 3 45 36 792 8 3315 300 1218 11 38 23 5 1575 2 2227 4 1280 71 120 4 80 4 4820 5 428 60 1213 4301 3 118 168 13 23 26 104 83 94 96 170 4 73 66 428 4 578 96 2227 30 369 96 80 983 27 5 73 66 239 3 346 73 879 4 73 3159 7 285 10177 528 5 9 29 14 63 6 23 9 1363 8 57 65 12 1001 18231 39 398 3 15 4683 7 2120 15 151 9 2289 11 13 502 3'
Save work
np.save(LM_PATH / 'tmp' / 'trn_ids.npy', trn_lm)
np.save(LM_PATH / 'tmp' / 'val_ids.npy', val_lm)
pickle.dump(itos, open(LM_PATH / 'tmp' / 'itos.pkl', 'wb'))
trn_lm = np.load(LM_PATH / 'tmp' / 'trn_ids.npy')
val_lm = np.load(LM_PATH / 'tmp' / 'val_ids.npy')
itos = pickle.load(open(LM_PATH / 'tmp' / 'itos.pkl', 'rb'))
vs = len(itos)
vs, len(trn_lm)
(60002, 90000)
We are now going to build an English language model (LM) for the IMDb corpus. We could start from scratch and try to learn the structure of the English language. But we use a technique called transfer learning to make this process easier. In transfer learning (a fairly recent idea for NLP) a pre-trained LM that has been trained on a large generic corpus(like wikipedia articles) can be used to transfer it's knowledge to a target LM and the weights can be fine-tuned.
Our source LM is the WikiText-103 LM created by Stephen Merity @ Salesforce research. Link to dataset The language model for wikitext103 (AWD LSTM) has been pre-trained and the weights can be downloaded here: Link. Our target LM is the IMDb LM.
# wget options:
# -nH don't create host directories
# -r specify recursive download
# -np don't ascend to the parent directory
# -P get all images, etc. needed to display HTML page
!wget -nH -r -np -P {PATH} http://files.fast.ai/models/wt103/
--2018-06-28 04:49:15-- http://files.fast.ai/models/wt103/ Resolving files.fast.ai (files.fast.ai)... 67.205.15.147 Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:49:16 (140 MB/s) - ‘data/aclImdb/models/wt103/index.html’ saved [857/857] Loading robots.txt; please ignore errors. --2018-06-28 04:49:16-- http://files.fast.ai/robots.txt Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 404 Not Found 2018-06-28 04:49:16 ERROR 404: Not Found. --2018-06-28 04:49:16-- http://files.fast.ai/models/wt103/?C=N;O=D Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=N;O=D’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:49:16 (160 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=N;O=D’ saved [857/857] --2018-06-28 04:49:16-- http://files.fast.ai/models/wt103/?C=M;O=A Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=M;O=A’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:49:16 (138 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=M;O=A’ saved [857/857] --2018-06-28 04:49:16-- http://files.fast.ai/models/wt103/?C=S;O=A Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=S;O=A’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:49:16 (134 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=S;O=A’ saved [857/857] --2018-06-28 04:49:16-- http://files.fast.ai/models/wt103/?C=D;O=A Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=D;O=A’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:49:17 (123 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=D;O=A’ saved [857/857] --2018-06-28 04:49:17-- http://files.fast.ai/models/wt103/bwd_wt103.h5 Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 462387687 (441M) [text/plain] Saving to: ‘data/aclImdb/models/wt103/bwd_wt103.h5’ models/wt103/bwd_wt 100%[===================>] 440.97M 7.69MB/s in 59s 2018-06-28 04:50:16 (7.45 MB/s) - ‘data/aclImdb/models/wt103/bwd_wt103.h5’ saved [462387687/462387687] --2018-06-28 04:50:16-- http://files.fast.ai/models/wt103/bwd_wt103_enc.h5 Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 462387634 (441M) [text/plain] Saving to: ‘data/aclImdb/models/wt103/bwd_wt103_enc.h5’ models/wt103/bwd_wt 100%[===================>] 440.97M 6.76MB/s in 59s 2018-06-28 04:51:16 (7.43 MB/s) - ‘data/aclImdb/models/wt103/bwd_wt103_enc.h5’ saved [462387634/462387634] --2018-06-28 04:51:16-- http://files.fast.ai/models/wt103/fwd_wt103.h5 Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 462387687 (441M) [text/plain] Saving to: ‘data/aclImdb/models/wt103/fwd_wt103.h5’ models/wt103/fwd_wt 100%[===================>] 440.97M 7.66MB/s in 59s 2018-06-28 04:52:15 (7.48 MB/s) - ‘data/aclImdb/models/wt103/fwd_wt103.h5’ saved [462387687/462387687] --2018-06-28 04:52:15-- http://files.fast.ai/models/wt103/fwd_wt103_enc.h5 Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 462387634 (441M) [text/plain] Saving to: ‘data/aclImdb/models/wt103/fwd_wt103_enc.h5’ models/wt103/fwd_wt 100%[===================>] 440.97M 7.76MB/s in 58s 2018-06-28 04:53:13 (7.61 MB/s) - ‘data/aclImdb/models/wt103/fwd_wt103_enc.h5’ saved [462387634/462387634] --2018-06-28 04:53:13-- http://files.fast.ai/models/wt103/itos_wt103.pkl Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 4161252 (4.0M) [text/plain] Saving to: ‘data/aclImdb/models/wt103/itos_wt103.pkl’ models/wt103/itos_w 100%[===================>] 3.97M 9.86MB/s in 0.4s 2018-06-28 04:53:13 (9.86 MB/s) - ‘data/aclImdb/models/wt103/itos_wt103.pkl’ saved [4161252/4161252] --2018-06-28 04:53:13-- http://files.fast.ai/models/wt103/?C=N;O=A Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=N;O=A’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:53:14 (146 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=N;O=A’ saved [857/857] --2018-06-28 04:53:14-- http://files.fast.ai/models/wt103/?C=M;O=D Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=M;O=D’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:53:14 (186 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=M;O=D’ saved [857/857] --2018-06-28 04:53:14-- http://files.fast.ai/models/wt103/?C=S;O=D Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=S;O=D’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:53:14 (169 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=S;O=D’ saved [857/857] --2018-06-28 04:53:14-- http://files.fast.ai/models/wt103/?C=D;O=D Reusing existing connection to files.fast.ai:80. HTTP request sent, awaiting response... 200 OK Length: 857 [text/html] Saving to: ‘data/aclImdb/models/wt103/index.html?C=D;O=D’ models/wt103/index. 100%[===================>] 857 --.-KB/s in 0s 2018-06-28 04:53:14 (153 MB/s) - ‘data/aclImdb/models/wt103/index.html?C=D;O=D’ saved [857/857] FINISHED --2018-06-28 04:53:14-- Total wall clock time: 4m 0s Downloaded: 14 files, 1.7G in 3m 56s (7.50 MB/s)
!ls -lh {PATH}/models/wt103
total 1.8G -rw-rw-r-- 1 ubuntu ubuntu 441M Mar 29 00:31 bwd_wt103_enc.h5 -rw-rw-r-- 1 ubuntu ubuntu 441M Mar 29 00:34 bwd_wt103.h5 -rw-rw-r-- 1 ubuntu ubuntu 441M Mar 29 00:36 fwd_wt103_enc.h5 -rw-rw-r-- 1 ubuntu ubuntu 441M Mar 29 00:39 fwd_wt103.h5 -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:49 index.html -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:49 index.html?C=D;O=A -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:53 index.html?C=D;O=D -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:49 index.html?C=M;O=A -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:53 index.html?C=M;O=D -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:53 index.html?C=N;O=A -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:49 index.html?C=N;O=D -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:49 index.html?C=S;O=A -rw-rw-r-- 1 ubuntu ubuntu 857 Jun 28 04:53 index.html?C=S;O=D -rw-rw-r-- 1 ubuntu ubuntu 4.0M Mar 29 00:30 itos_wt103.pkl
The pre-trained LM weights have an embedding size of 400, 1150 hidden units and just 3 layers. We need to match these values with the target IMDB LM so that the weights can be loaded up.
em_sz, nh, nl = 400, 1150, 3
Here are our pre-trained path and our pre-trained language model path.
PRE_PATH = PATH / 'models' / 'wt103'
PRE_LM_PATH = PRE_PATH / 'fwd_wt103.h5'
wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)
Map IMDb vocab to WikiText vocab
We calculate the mean of the layer0 encoder weights. This can be used to assign weights to unknown tokens when we transfer to target IMDB LM.
# ==================================== START DEBUG ====================================
# specialized container datatypes providing alternatives to Python’s general purpose built-in containers, dict.
# dict subclass that remembers the order entries were added.
# In other words,
# a regular dict does not track the insertion order, and iterating over it produces the values in an arbitrary order.
# In an OrderedDict, by contrast, the order the items are inserted is remembered and used when creating an iterator.
print(type(wgts))
print(len(wgts))
for k in wgts.keys():
print(k)
tmp_wgts = wgts['0.encoder.weight']
print(f'\n{type(tmp_wgts)}' )
tmp_enc_wgts = to_np(tmp_wgts)
print( type(to_np(tmp_wgts)) )
# pre-trained LM weights have an embedding size of 400
print( tmp_enc_wgts.shape )
tmp_row_m = tmp_enc_wgts.mean(0) # returns the average of the array elements along axis 0
print( type(tmp_row_m) ) # numpy.ndarray
print( tmp_row_m.shape ) # shape: (400,)
# ==================================== END DEBUG ====================================
<class 'collections.OrderedDict'> 15 0.encoder.weight 0.encoder_with_dropout.embed.weight 0.rnns.0.module.weight_ih_l0 0.rnns.0.module.bias_ih_l0 0.rnns.0.module.bias_hh_l0 0.rnns.0.module.weight_hh_l0_raw 0.rnns.1.module.weight_ih_l0 0.rnns.1.module.bias_ih_l0 0.rnns.1.module.bias_hh_l0 0.rnns.1.module.weight_hh_l0_raw 0.rnns.2.module.weight_ih_l0 0.rnns.2.module.bias_ih_l0 0.rnns.2.module.bias_hh_l0 0.rnns.2.module.weight_hh_l0_raw 1.decoder.weight <class 'torch.FloatTensor'> <class 'numpy.ndarray'> (238462, 400) <class 'numpy.ndarray'> (400,)
enc_wgts = to_np(wgts['0.encoder.weight']) # converts np.ndarray from torch.FloatTensor.output shape: (238462, 400)
row_m = enc_wgts.mean(0) # returns the average of the array elements along axis 0. output shape: (400,)
itos2 = pickle.load( (PRE_PATH / 'itos_wt103.pkl').open('rb') )
stoi2 = collections.defaultdict(lambda: -1, { v: k for k, v in enumerate(itos2) })
# ==================================== START DEBUG ====================================
print( type(itos2) )
print(len(itos2))
i = 0
for k, v in enumerate(itos2):
i = i + 1
if i <= 10:
print(f'{k}: {v}')
else:
break
print( type(stoi2) )
print( len(stoi2) )
print( stoi2['On'] ) # returns -1 (because the vocab is not found)
print( stoi2['the'] ) # returns 2
# ==================================== END DEBUG ====================================
<class 'list'> 238462 0: _unk_ 1: _pad_ 2: the 3: , 4: . 5: of 6: and 7: in 8: to 9: a <class 'collections.defaultdict'> 238462 -1 2
Before we try to transfer the knowledge from WikiText to the IMDb LM, we match up the vocab words and their indexes. We use the defaultdict
container once again, to assign mean weights to unknown IMDb tokens that do not exist in WikiText-103.
new_w = np.zeros((vs, em_sz), dtype=np.float32) # shape: (60002, 400)
for i, w in enumerate(itos):
r = stoi2[w]
new_w[i] = enc_wgts[r] if r >= 0 else row_m
We now overwrite the weights into the wgts
OrderedDict.
The decoder module, which we will explore in detail is also loaded with the same weights due to an idea called weight tying.
wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w)) # weird thing with how we do embedding dropout
wgts['1.decoder.weight'] = T(np.copy(new_w))
Now that we have the weights prepared, we are ready to create and start training our new IMDb language PyTorch model!
It is fairly straightforward to create a new language model using the fastai library. Like every other lesson, our model will have a backbone and a custom head. The backbone in our case is the IMDb LM pre-trained with WikiText and the custom head is a linear classifier. In this section we will focus on the backbone LM and the next section will talk about the classifier custom head.
bptt (also known traditionally in NLP LM as ngrams) in fastai LMs is approximated to a std. deviation around 70, by perturbing the sequence length on a per-batch basis. This is akin to shuffling our data in computer vision, only that in NLP we cannot shuffle inputs and we have to maintain statefulness.
Since we are predicting words using ngrams, we want our next batch to line up with the end-points of the previous mini-batch's items. batch size is constant but the fastai library expands and contracts bptt each mini-batch using a clever stochastic implementation of a batch. (original credits attributed to Smerity)
wd = 1e-7
bptt = 70
bs = 52
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
The goal of the LM is to learn to predict a word/token given a preceeding set of words(tokens). We take all the movie reviews in both the 90k training set and 10k validation set and concatenate them to form long strings of tokens. In fastai, we use the LanguageModelLoader
to create a data loader which makes it easy to create and use bptt sized mini batches. The LanguageModelLoader
takes a concatenated string of tokens and returns a loader.
We have a special modeldata object class for LMs called LanguageModelData
to which we can pass the training and validation loaders and get in return the model itself.
trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)
Choosing dropout
We setup the dropouts for the model - these values have been chosen after experimentation. If you need to update them for custom LMs, you can change the weighting factor (0.7
here) based on the amount of data you have. For more data, you can reduce dropout factor and for small datasets, you can reduce overfitting by choosing a higher dropout factor. No other dropout value requires tuning.
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15]) * 0.7
We first tune the last embedding layer so that the missing tokens initialized with mean weights get tuned properly. So we freeze everything except the last layer.
We also keep track of the accuracy metric.
learner = md.get_model(opt_fn, em_sz, nh, nl,
dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])
learner.metrics = [accuracy]
learner.freeze_to(-1)
Measuring accuracy
learner.model.load_state_dict(wgts)
We set learning rates and fit our IMDb LM. We first run one epoch to tune the last layer which contains the embedding weights. This should help the missing tokens in the WikiText-103 learn better weights.
lr = 1e-3
lrs = lr
learner.fit(lrs / 2, 1, wds=wd, use_clr=(32, 2), cycle_len=1)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
epoch trn_loss val_loss accuracy 0 4.663849 4.442456 0.258212
[array([4.44246]), 0.2582116474118943]
Note that we print out accuracy and keep track of how often we end up predicting the target word correctly. While this is a good metric to check, it is not part of our loss function as it can get quite bumpy. We only minimize cross-entropy loss in the LM.
The exponent of the cross-entropy loss is called the perplexity of the LM. (low perplexity is better).
learner.save('lm_last_ft')
learner.load('lm_last_ft')
learner.unfreeze()
learner.lr_find(start_lr=lrs / 10, end_lr=lrs * 10, linear=True)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
epoch trn_loss val_loss accuracy 0 4.743739 4.586191 0.247705
learner.sched.plot()
learner.fit(lrs, 1, wds=wd, use_clr=(20, 10), cycle_len=15)
HBox(children=(IntProgress(value=0, description='Epoch', max=15), HTML(value='')))
epoch trn_loss val_loss accuracy 0 4.133916 4.017627 0.300258 1 4.127663 4.023184 0.299315 70%|███████ | 4818/6872 [57:53<24:40, 1.39it/s, loss=4.14]
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-67-3ddec9b67dda> in <module>() ----> 1 learner.fit(lrs, 1, wds=wd, use_clr=(20, 10), cycle_len=15) ~/fastai/courses/dl2/fastai/text.py in fit(self, *args, **kwargs) 209 210 def _get_crit(self, data): return F.cross_entropy --> 211 def fit(self, *args, **kwargs): return super().fit(*args, **kwargs, seq_first=True) 212 213 def save_encoder(self, name): save_model(self.model[0], self.get_model_path(name)) ~/fastai/courses/dl2/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs) 285 self.sched = None 286 layer_opt = self.get_layer_opt(lrs, wds) --> 287 return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs) 288 289 def warm_up(self, lr, wds=None): ~/fastai/courses/dl2/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs) 232 metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16, 233 swa_model=self.swa_model if use_swa else None, swa_start=swa_start, --> 234 swa_eval_freq=swa_eval_freq, **kwargs) 235 236 def get_layer_groups(self): return self.models.get_layer_groups() ~/fastai/courses/dl2/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs) 138 batch_num += 1 139 for cb in callbacks: cb.on_batch_begin() --> 140 loss = model_stepper.step(V(x),V(y), epoch) 141 avg_loss = avg_loss * avg_mom + loss * (1-avg_mom) 142 debias_loss = avg_loss / (1 - avg_mom**batch_num) ~/fastai/courses/dl2/fastai/model.py in step(self, xs, y, epoch) 55 if self.loss_scale != 1: assert(self.fp16); loss = loss*self.loss_scale 56 if self.reg_fn: loss = self.reg_fn(output, xtra, raw_loss) ---> 57 loss.backward() 58 if self.fp16: update_fp32_grads(self.fp32_params, self.m) 59 if self.loss_scale != 1: ~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/autograd/variable.py in backward(self, gradient, retain_graph, create_graph, retain_variables) 165 Variable. 166 """ --> 167 torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) 168 169 def register_hook(self, hook): ~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(variables, grad_variables, retain_graph, create_graph, retain_variables) 97 98 Variable._execution_engine.run_backward( ---> 99 variables, grad_variables, retain_graph) 100 101 KeyboardInterrupt:
learner.sched.plot_loss()
We save the trained model weights and separately save the encoder part of the LM model as well. This will serve as our backbone in the classification task model.
learner.save('lm1')
learner.save_encoder('lm1_enc')
# My extra codes
# Resume training
learner.load('lm1')
# My extra codes
# Resume training
learner.load_encoder('lm1_enc')
# My extra codes
# Resume training
learner.fit(lrs, 1, wds=wd, use_clr=(20, 10), cycle_len=13)
HBox(children=(IntProgress(value=0, description='Epoch', max=13), HTML(value='')))
epoch trn_loss val_loss accuracy 0 4.089766 3.987413 0.303426 1 4.110863 3.993293 0.302394 2 4.05779 3.982436 0.304094 3 4.026278 3.972332 0.30501 4 4.020928 3.95975 0.306272 5 4.052557 3.951147 0.307224 6 3.980955 3.941334 0.308409 7 3.962256 3.937269 0.309225 8 3.918868 3.932689 0.309884 9 3.922733 3.924108 0.310882 10 3.948124 3.914877 0.311638 11 3.885483 3.914468 0.312277 12 3.868742 3.910146 0.312858
[array([3.91015]), 0.31285761840572784]
learner.sched.plot_loss()
learner.save('lm2')
learner.save_encoder('lm2_enc')
The classifier model is basically a linear layer custom head on top of the LM backbone. Setting up the classifier data is similar to the LM data setup except that we cannot use the unsup movie reviews this time.
Note: we are using the CSV files created earlier which have already filtered out the unsup movie reviews.
df_trn = pd.read_csv(CLAS_PATH / 'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(CLAS_PATH /'test.csv', header=None, chunksize=chunksize)
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)
0 1 0 1
(CLAS_PATH / 'tmp').mkdir(exist_ok=True)
np.save(CLAS_PATH / 'tmp' / 'tok_trn.npy', tok_trn)
np.save(CLAS_PATH /'tmp' / 'tok_val.npy', tok_val)
np.save(CLAS_PATH /'tmp' / 'trn_labels.npy', trn_labels)
np.save(CLAS_PATH /'tmp' / 'val_labels.npy', val_labels)
tok_trn = np.load(CLAS_PATH / 'tmp' / 'tok_trn.npy')
tok_val = np.load(CLAS_PATH / 'tmp' / 'tok_val.npy')
itos = pickle.load((LM_PATH / 'tmp' / 'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda: 0, { v: k for k, v in enumerate(itos) })
len(itos)
60002
trn_clas = np.array([[stoi[o] for o in p] for p in tok_trn])
val_clas = np.array([[stoi[o] for o in p] for p in tok_val])
np.save(CLAS_PATH / 'tmp' / 'trn_ids.npy', trn_clas)
np.save(CLAS_PATH / 'tmp' / 'val_ids.npy', val_clas)
Now we can create our final model, a classifier which is really a custom linear head over our trained IMDb backbone. The steps to create the classifier model are similar to the ones for the LM.
trn_clas = np.load(CLAS_PATH / 'tmp' / 'trn_ids.npy')
val_clas = np.load(CLAS_PATH / 'tmp' / 'val_ids.npy')
trn_labels = np.squeeze(np.load(CLAS_PATH / 'tmp' / 'trn_labels.npy'))
val_labels = np.squeeze(np.load(CLAS_PATH / 'tmp' / 'val_labels.npy'))
bptt, em_sz, nh, nl = 70, 400, 1150, 3
vs = len(itos) # num of tokens (vocab size)
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
bs = 48
min_lbl = trn_labels.min()
trn_labels -= min_lbl
val_labels -= min_lbl
c = int(trn_labels.max()) + 1 # num of class
Shuffle documents; Sort-ish to save computation
In the classifier, unlike LM, we need to read a movie review at a time and learn to predict its sentiment as pos/neg. We do not deal with equal bptt size batches, so we have to pad the sequences to the same length in each batch. To create batches of similar sized movie reviews, we use a sortish sampler method invented by @Smerity and @jekbradbury
The SortishSampler
cuts down the overall number of padding tokens the classifier ends up seeing.
trn_ds = TextDataset(trn_clas, trn_labels)
val_ds = TextDataset(val_clas, val_labels)
Turning it to DataLoader
trn_samp = SortishSampler(trn_clas, key=lambda x: len(trn_clas[x]), bs=bs // 2)
val_samp = SortSampler(val_clas, key=lambda x: len(val_clas[x]))
trn_dl = DataLoader(trn_ds, bs // 2, transpose=True, num_workers=1, pad_idx=1, sampler=trn_samp)
val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)
md = ModelData(PATH, trn_dl, val_dl)
Create RNN Encoder
# part 1
dps = np.array([0.4, 0.5, 0.05, 0.3, 0.1])
dps = np.array([0.4, 0.5, 0.05, 0.3, 0.4]) * 0.5
get_rnn_classifer
going to create exactly the same encoder more or less, and we are going to pass in the same architectural details as before. But this time, with the head we add on, you have a few more things you can do. One is you can add more than one hidden layer. In layers = [em_sz * 3, 50, c]
:
em_sz * 3
: this is what the input to my head (i.e. classifier section) is going to be.50
: this is the output of the first layerc
: this is the output of the second layerAnd you can add as many as you like. So you can basically create a little multi-layer neural net classifier at the end. Similarly, for drops=[dps[4], 0.1]
, these are the dropouts to go after each of these layers.
m = get_rnn_classifer(bptt, 20 * 70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
layers=[em_sz * 3, 50, c], drops=[dps[4], 0.1],
dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip = 25.
learn.metrics = [accuracy]
lr = 3e-3
lrm = 2.6
lrs = np.array([lr / (lrm**4), lr / (lrm**3), lr / (lrm**2), lr / lrm, lr])
lrs = np.array([1e-4, 1e-4, 1e-4, 1e-3, 1e-2])
wd = 1e-7
wd = 0
learn.load_encoder('lm2_enc')
We start out just training the last layer and we get 92.9% accuracy:
learn.freeze_to(-1)
learn.lr_find(lrs / 1000)
learn.sched.plot()
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
76%|███████▋ | 795/1042 [07:58<02:28, 1.66it/s, loss=1.14]
learn.fit(lrs, 1, wds=wd, use_clr=(8, 3), cycle_len=1)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
epoch trn_loss val_loss accuracy 0 0.254959 0.176155 0.93412
[array([0.17615]), 0.9341200002670288]
learn.save('clas_0')
learn.load('clas_0')
Then we unfreeze one more layer, get 93.3% accuracy:
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, use_clr=(8, 3), cycle_len=1)
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
epoch trn_loss val_loss accuracy 0 0.227991 0.163791 0.93864
[array([0.16379]), 0.9386400006484985]
learn.save('clas_1')
learn.load('clas_1')
Then we fine-tune the whole thing. This was the main attempt before our paper came along at using a pre-trained model:
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, use_clr=(32, 10), cycle_len=14)
HBox(children=(IntProgress(value=0, description='Epoch', max=14), HTML(value='')))
epoch trn_loss val_loss accuracy 0 0.24981 0.172563 0.93528 1 0.263968 0.160485 0.93976 2 0.202914 0.148376 0.9454 3 0.156502 0.182274 0.94648 4 0.134656 0.168832 0.94548 5 0.107242 0.156522 0.9484 6 0.102729 0.180831 0.94348 7 0.075103 0.172596 0.94548 8 0.07143 0.1826 0.94396 9 0.066486 0.194617 0.94256 10 0.047482 0.211435 0.9434 11 0.049275 0.221188 0.94312 12 0.0459 0.219328 0.94628 13 0.040396 0.22585 0.94604
[array([0.22585]), 0.9460399997520447]
learn.sched.plot_loss()
learn.save('clas_2')
The previous state of the art result was 94.1% accuracy (5.9% error). With bidir we get 95.4% accuracy (4.6% error).