Link to the notebook: https://drive.google.com/file/d/1pKE2nMcLo4kpp_maZly6XDP4aUFnjDlI/view Copy the notebook to your GDrive to edit. (to update)
First we will install extra packages which aren't included in the default colab environment. In this case, we are installing gensim
for GloVe embeddings, bpemb
for word-piece embeddings and tokenization, and some extra packages for tokenization.
!pip install bpemb
!pip install gensim
!python -m spacy download en_core_web_sm
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: bpemb in /usr/local/lib/python3.7/dist-packages (0.3.3)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.7/dist-packages (from bpemb) (0.1.97)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from bpemb) (1.21.6)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from bpemb) (4.64.1)
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (from bpemb) (3.6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from bpemb) (2.23.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (5.2.1)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (1.15.0)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (1.7.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->bpemb) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->bpemb) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->bpemb) (2022.6.15)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->bpemb) (3.0.4)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (3.6.0)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.15.0)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.7.3)
Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.7/dist-packages (from gensim) (1.21.6)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (5.2.1)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
|████████████████████████████████| 12.8 MB 7.2 MB/s
Requirement already satisfied: spacy<3.5.0,>=3.4.0 in /usr/local/lib/python3.7/dist-packages (from en-core-web-sm==3.4.0) (3.4.1)
Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (8.1.0)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.21.6)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.10.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.8)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (21.3)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.11.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.8)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.6)
Requirement already satisfied: typing-extensions<4.2.0,>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (4.1.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.23.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.7)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.3.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (57.4.0)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.10)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.4.4)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.9.2)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.4.2)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.0.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (4.64.1)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.7/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.6.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.6->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.8.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.9)
Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /usr/local/lib/python3.7/dist-packages (from pathy>=0.3.5->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (5.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2022.6.15)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (1.24.3)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.7/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (0.7.8)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.5.0,>=0.3.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.0) (2.0.1)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
import nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
Here we are just using some magic commands to make sure changes to external packages are automatically loaded and plots are displayed in the notebook.
%reload_ext autoreload
%autoreload 2
%matplotlib inline
Before we do anything else, we need to make sure that we seed all of the random number generators and set some flags to ensure that our results are reproducible across runs i.e. that we get the same result every time we run our code.
import torch
import random
import numpy as np
def enforce_reproducibility(seed=42):
# Sets seed manually for both CPU and CUDA
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# For atomic operations there is currently
# no simple way to enforce determinism, as
# the order of parallel operations is not known.
# CUDNN
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# System based
random.seed(seed)
np.random.seed(seed)
enforce_reproducibility()
bpemb
package. Read more heregensim
which has a large variety of pre-trained word embeddingsimport gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))
['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']
# Load the vectors
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-100')
# Look at some of the most similar words
glove_vectors.most_similar('great')
[('greatest', 0.7882647514343262), ('good', 0.759279727935791), ('little', 0.7585746049880981), ('much', 0.7477047443389893), ('well', 0.7401014566421509), ('big', 0.7318841218948364), ('kind', 0.7308766841888428), ('important', 0.7286974191665649), ('there', 0.7225556373596191), ('way', 0.7116187214851379)]
# Get vector representation of word
glove_vectors['great']
array([-0.013786 , 0.38216 , 0.53236 , 0.15261 , -0.29694 , -0.20558 , -0.41846 , -0.58437 , -0.77355 , -0.87866 , -0.37858 , -0.18516 , -0.128 , -0.20584 , -0.22925 , -0.42599 , 0.3725 , 0.26077 , -1.0702 , 0.62916 , -0.091469 , 0.70348 , -0.4973 , -0.77691 , 0.66045 , 0.09465 , -0.44893 , 0.018917 , 0.33146 , -0.35022 , -0.35789 , 0.030313 , 0.22253 , -0.23236 , -0.19719 , -0.0053125, -0.25848 , 0.58081 , -0.10705 , -0.17845 , -0.16206 , 0.087086 , 0.63029 , -0.76649 , 0.51619 , 0.14073 , 1.019 , -0.43136 , 0.46138 , -0.43585 , -0.47568 , 0.19226 , 0.36065 , 0.78987 , 0.088945 , -2.7814 , -0.15366 , 0.01015 , 1.1798 , 0.15168 , -0.050112 , 1.2626 , -0.77527 , 0.36031 , 0.95761 , -0.11385 , 0.28035 , -0.02591 , 0.31246 , -0.15424 , 0.3778 , -0.13599 , 0.2946 , -0.31579 , 0.42943 , 0.086969 , 0.019169 , -0.27242 , -0.31696 , 0.37327 , 0.61997 , 0.13889 , 0.17188 , 0.30363 , -1.2776 , 0.044423 , -0.52736 , -0.88536 , -0.19428 , -0.61947 , -0.10146 , -0.26301 , -0.061707 , 0.36627 , -0.95223 , -0.39346 , -0.69183 , -1.0426 , 0.28855 , 0.63056 ], dtype=float32)
glove_vectors.most_similar('bad')
[('worse', 0.7929712533950806), ('good', 0.7702798247337341), ('things', 0.7653602361679077), ('too', 0.7630148530006409), ('thing', 0.7609667778015137), ('lot', 0.7443647384643555), ('kind', 0.740868091583252), ('because', 0.739879846572876), ('really', 0.7376541495323181), ("n't", 0.7336541414260864)]
bpemb
which has pretrained BPE tokenizers/embeddings for 275 languagesgensim
, for example observing the top similar words to a given wordfrom bpemb import BPEmb
# Load english model with 25k word-pieces
bpemb_en = BPEmb(lang='en', dim=100, vs=25000)
bpemb_en.most_similar('great')
[('little', 0.5838008522987366), ('▁great', 0.5713019371032715), ('grand', 0.4832311272621155), ('the', 0.47381392121315), ('▁glorious', 0.4674184322357178), ('bear', 0.4492991268634796), ('wh', 0.4481717348098755), ('mother', 0.44679754972457886), ('▁splend', 0.44266074895858765), ('southern', 0.427338182926178)]
bpemb_en['great']
array([ 0.188451, -0.271453, -0.28907 , 0.061539, -0.00119 , 0.023874, -0.067886, -0.331542, 0.216387, 0.357444, -0.37975 , 0.568398, 0.094106, 0.131808, -0.225529, -0.0577 , -0.471579, 0.270219, 0.500174, -0.351869, 0.249958, -0.083147, -0.196093, 0.536826, -0.472552, -0.290486, 0.044731, 0.291949, -0.02748 , 0.833167, -0.702959, 0.102262, -0.192826, 0.323853, -0.222146, -0.471987, -0.789 , -0.384075, 0.231821, 0.479966, -0.23674 , -0.233447, -0.095288, 0.994725, 0.38566 , 0.539499, -0.117075, -0.424706, -0.35617 , -0.073661, 0.185955, -0.329411, -0.336546, -0.883611, 0.450815, -0.087026, 0.547418, -0.177824, -0.50549 , 0.423664, 0.15182 , 0.289324, 0.546515, -0.963773, 0.041765, 0.526582, -0.336767, 0.128516, 0.307212, 0.049208, -0.314852, 0.343516, 0.09675 , 0.033849, -0.260953, 0.340189, -0.114852, 0.128735, 0.403689, -0.154596, 0.303449, 0.855851, 0.089396, 0.229577, 0.022571, -0.194975, 0.080064, 0.273517, 0.394372, -0.196653, 0.220832, -0.052137, -0.274584, 0.376419, 0.151644, -0.01065 , 0.195909, -0.456158, 0.042357, 0.058726], dtype=float32)
bpemb_en.most_similar('bad')
[('▁bad', 0.7139850854873657), ('bir', 0.5771083831787109), ('her', 0.5590745806694031), ('hol', 0.5386293530464172), ('mad', 0.5319529175758362), ('wal', 0.5289992690086365), ('wolf', 0.5198653340339661), ('hal', 0.5146304368972778), ('▁luck', 0.5110129117965698), ('ün', 0.506809651851654)]
We'll see how to train a basic logistic regression classifier for sentiment analysis. For this, we're using the IMDB movie reviews dataset.
import spacy
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report
# Upload the data from local computer
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
# or you can load from google drive if you upload the files there (faster in my experience)
from google.colab import drive
drive.mount('/content/drive') # this will trigger permission prompts
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
train_data = pd.read_csv('/content/drive/My Drive/nlp_labs/lab_2/imdb/train.csv', usecols=[1,2]).fillna('')
valid_data = pd.read_csv('/content/drive/My Drive/nlp_labs/lab_2/imdb/val.csv', usecols=[1,2]).fillna('')
test_data = pd.read_csv('/content/drive/My Drive/nlp_labs/lab_2/imdb/test.csv', usecols=[1,2]).fillna('')
valid_data.head() # in the sentiment column, 1 is positive, 0 is negative
review | sentiment | |
---|---|---|
0 | A genuinely odd, surreal jumble of visual idea... | 0 |
1 | "The Snow Queen" is based on the famous and ve... | 0 |
2 | The quintessential Georgian film of Georgi Dan... | 1 |
3 | I'm a huge comedy show fan. Racial humor is al... | 0 |
4 | Pretty good film from Preminger; labyrinthine ... | 1 |
train_data.values[0]
array(["Having avoided seeing the movie in the cinema, but buying the DVD for my wife for Xmas, I had to watch it. I did not expect much, which usually means I get more than I bargained for. But 'Mamma Mia' - utter, utter cr**. I like ABBA, I like the songs, I have the old LPs. But this film is just terrible. The stage show looks like a bit of a musical, but this races along with songs hurriedly following one another, no characterisation, the dance numbers (which were heavily choreographed according to the extras on the DVD) are just thrown away with only half the bodies ever on screen, the dance chorus of north Europeans appear on a small Greek island at will, while the set and set up of numbers would have disgraced Cliff Richard's musicals in the sixties!Meryl (see me I'm acting)Streep can't even make her usual mugging effective in an over-the-top musical! Her grand piece - 'The Winner Takes It All' - is Meryl at the Met! Note to director - it should have been shot in stillness with the camera gradually showing distance growing between Streep and Brosnan! Some of the singing is awful karaoke on amateur night. The camera cannot stop moving like bad MTV. One can never settle down and just enjoy the music, enthusiasm and characters. But what is even worse is how this botched piece of excre**** has become the highest grossing film in the UK and the best selling DVD to boot? Blair, Campbell and New Labour really have reduced the UK to zombies - critical faculties anyone???", 0], dtype=object)
This defines some functions for vectorizing our text. For GloVe and BPE embeddings, this involves:
There are many options for pooling e.g. max-pooling, average-pooling, summation, etc.
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
def get_unigram_features(dataset, vectorizer):
X = vectorizer.transform(dataset[:,0])
y = list(dataset[:,1])
return X,y
def get_bpemb_features(dataset, bpemb):
# With bpemb we can tokenize and embed an entire document using .embed(x)
X = [bpemb.embed(x).mean(0) for x in tqdm(dataset[:,0])]
y = list(dataset[:,1])
return X,y
def get_glove_features(dataset, glove, nlp):
X = []
for x in tqdm(dataset[:,0]):
# For glove embeddings, we first tokenize the sentence using spacy, lower-case and remove stopwords, and get the available word vectors
vecs = np.vstack([glove[t.text.lower()] if t.text.lower() in glove else np.zeros(100) for t in nlp(x) if t.text.lower() not in stopwords])
X.append(vecs.mean(0))
y = list(dataset[:,1])
return np.stack(X),y
This defines how to run a basic logistic regression classifier. Commented out are some code for if you want to run a hyperparameter search.
def run_classifier(X_train, y_train, X_test, y_test):
# Define hyperparams for search
# C = np.logspace(-6, 2, 50)
# warm_start = [False, True]
# class_weight = ['balanced', None]
# hp = {"C": C, "warm_start": warm_start, 'class_weight': class_weight}
classifier = LogisticRegression(penalty='l2', max_iter=1000)
# classifier_random = RandomizedSearchCV(
# estimator=classifier,
# param_distributions=hp,
# n_iter=100,
# cv=5,
# verbose=2,
# random_state=1000,
# n_jobs=-1,
# scoring='f1'
# )
# classifier_random.fit(X_train, y_train)
# print(classifier_random.best_params_)
# print(classifier_random.best_score_)
# model = classifier_random.best_estimator_
classifier.fit(X_train,y_train)
preds = classifier.predict(X_test)
print(classification_report(y_test, preds))
First we'll run GloVe
#only train and test on part of the dataset as an example
nlp = spacy.load('en_core_web_sm')
X_train,y_train = get_glove_features(train_data.values[:5000], glove_vectors, nlp)
X_test,y_test = get_glove_features(test_data.values[:1000], glove_vectors, nlp)
run_classifier(X_train, y_train, X_test, y_test)
100%|██████████| 5000/5000 [03:01<00:00, 27.55it/s] 100%|██████████| 1000/1000 [00:35<00:00, 28.08it/s]
precision recall f1-score support 0 0.82 0.79 0.80 524 1 0.78 0.80 0.79 476 accuracy 0.80 1000 macro avg 0.80 0.80 0.80 1000 weighted avg 0.80 0.80 0.80 1000
Next BPE
X_train,y_train = get_bpemb_features(train_data.values[:5000], bpemb_en)
X_test,y_test = get_bpemb_features(test_data.values[:1000], bpemb_en)
run_classifier(X_train, y_train, X_test, y_test)
100%|██████████| 5000/5000 [00:04<00:00, 1157.32it/s] 100%|██████████| 1000/1000 [00:00<00:00, 1143.96it/s]
precision recall f1-score support 0 0.78 0.79 0.78 524 1 0.76 0.75 0.76 476 accuracy 0.77 1000 macro avg 0.77 0.77 0.77 1000 weighted avg 0.77 0.77 0.77 1000
Just as a note, you can actually get much better performance using simple word counts -- why do you think this is?
# Run on unigram features
vectorizer = CountVectorizer()
vectorizer.fit(train_data.values[:,0])
X_train,y_train = get_unigram_features(train_data.values[:5000], vectorizer)
X_test,y_test = get_unigram_features(test_data.values[:1000], vectorizer)
run_classifier(X_train, y_train, X_test, y_test)
precision recall f1-score support 0 0.89 0.84 0.87 524 1 0.83 0.89 0.86 476 accuracy 0.86 1000 macro avg 0.86 0.86 0.86 1000 weighted avg 0.87 0.86 0.86 1000
A simple and common way that data is read in PyTorch is to use the two following classes: torch.utils.data.Dataset
and torch.utils.data.DataLoader
.
The Dataset
class can be extended to read in and store the data you are using for your experiment. The only requirements are to implement the __len__
and __getitem__
methods. __len__
simply returns the size of your dataset and __getitem__
takes an index and returns that sample from your dataset, processed in whatever way is necessary to be input to your model.
The DataLoader
class determines how to iterate through your Dataset
, including how to shuffle and batch your data.
from torch.utils.data import Dataset, DataLoader
from typing import List, Tuple
This is a utility function which, given a list of text samples and a tokenizer, will tokenize the text and return the IDs of the tokens in the vocabulary. The tokenizer will split the text into individual word-pieces and convert these to IDs in the tokenizer vocabulary. In addition, it will return the length of each sequence, which is needed when inputting padded data into an RNN in PyTorch.
def text_to_batch_bilstm(text: List, tokenizer, max_len=512) -> Tuple[List, List]:
"""
Creates a tokenized batch for input to a bilstm model
:param text: A list of sentences to tokenize
:param tokenizer: A tokenization function to use (i.e. fasttext)
:return: Tokenized text as well as the length of the input sequence
"""
# Some light preprocessing
input_ids = [tokenizer.encode_ids_with_eos(t)[:max_len] for t in text]
return input_ids, [len(ids) for ids in input_ids]
This is another utility function which defines how to combine data into a batch. Importantly, it determines the max length of a sequence in the batch and pads all of the samples with a [PAD]
ID to this length.
def collate_batch_bilstm(input_data: Tuple) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Combines multiple data samples into a single batch
:param input_data: The combined input_ids, seq_lens, and labels for the batch
:return: A tuple of tensors (input_ids, seq_lens, labels)
"""
input_ids = [i[0][0] for i in input_data]
seq_lens = [i[1][0] for i in input_data]
labels = [i[2] for i in input_data]
max_length = max([len(i) for i in input_ids])
# Pad all of the input samples to the max length (25000 is the ID of the [PAD] token)
input_ids = [(i + [25000] * (max_length - len(i))) for i in input_ids]
# Make sure each sample is max_length long
assert (all(len(i) == max_length for i in input_ids))
return torch.tensor(input_ids), torch.tensor(seq_lens), torch.tensor(labels)
This is the class which reads the dataset and processes indivudal samples. It is initialized with a pandas DataFrame
and a tokenizer.
# This will load the dataset and process it lazily in the __getitem__ function
class ClassificationDatasetReader(Dataset):
def __init__(self, df, tokenizer):
self.df = df
self.tokenizer = tokenizer
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
row = self.df.values[idx]
# Calls the text_to_batch function
input_ids,seq_lens = text_to_batch_bilstm([row[0]], self.tokenizer)
label = row[1]
return input_ids, seq_lens, label
Here we can test whether or not our implementation is correct. We first load english bpemb
embeddings, which will be used to initialize the word embedding layer of our model, and the tokenizer which will be used in the dataset reader.
# Extract the embeddings and add a randomly initialized embedding for our extra [PAD] token
pretrained_embeddings = np.concatenate([bpemb_en.emb.vectors, np.zeros(shape=(1,100))], axis=0)
# Extract the vocab and add an extra [PAD] token
vocabulary = bpemb_en.emb.index2word + ['[PAD]']
reader = ClassificationDatasetReader(valid_data, bpemb_en)
reader[0]
([[4, 406, 9631, 354, 10278, 24934, 17672, 7826, 955, 27, 5245, 4885, 192, 3870, 9002, 5590, 12165, 989, 71, 7, 6768, 4016, 24951, 1025, 1294, 6222, 1972, 800, 1221, 42, 7, 1361, 24934, 272, 437, 1656, 2170, 24935, 14189, 99, 24935, 1412, 189, 788, 24920, 215, 7219, 9042, 6381, 27, 32, 4307, 85, 6354, 13280, 19462, 2895, 4, 779, 143, 9009, 191, 15850, 400, 146, 3701, 24951, 7, 4794, 10176, 365, 42, 215, 2414, 24940, 8280, 24934, 34, 2054, 4719, 42, 16572, 146, 24180, 1228, 24935, 332, 1949, 1701, 34, 394, 24940, 2767, 862, 10726, 962, 72, 424, 24934, 7, 7184, 27, 800, 1221, 332, 8099, 37, 24934, 592, 42, 332, 24068, 24934, 127, 1579, 360, 332, 38, 3691, 2919, 86, 3243, 42, 13269, 4175, 24935, 12144, 22944, 7595, 14680, 2586, 24934, 683, 260, 7, 2480, 24933, 401, 9984, 154, 1025, 315, 35, 88, 72, 7, 334, 67, 771, 24940, 18932, 24951, 351, 3811, 80, 467, 1613, 24935, 5032, 1412, 189, 73, 1935, 32, 1024, 22531, 3473, 592, 345, 493, 1623, 4820, 88, 34, 278, 22754, 13032, 220, 875, 3532, 24934, 637, 215, 361, 24940, 1828, 773, 4338, 3906, 810, 24788, 24934, 79, 7, 670, 73, 216, 4, 1026, 24935, 7669, 4939, 130, 7949, 2]], [205], 0)
Next we will create a BiLSTM model with BPE word-piece embeddings. In this case we will extend the PyTorch class torch.nn.Module
. To create your own module, you need only define your model architecture in the __init__
function, and define how tensors are processed by your model in the __forward__
function.
from torch import nn
# Define a default lstm_dim
lstm_dim = 100
# Define the model
class BiLSTMNetwork(nn.Module):
"""
Basic BiLSTM network
"""
def __init__(
self,
pretrained_embeddings: torch.tensor,
lstm_dim: int,
dropout_prob: float = 0.1,
n_classes: int = 2
):
"""
Initializer for basic BiLSTM network
:param pretrained_embeddings: A tensor containing the pretrained BPE embeddings
:param lstm_dim: The dimensionality of the BiLSTM network
:param dropout_prob: Dropout probability
:param n_classes: The number of output classes
"""
# First thing is to call the superclass initializer
super(BiLSTMNetwork, self).__init__()
# We'll define the network in a ModuleDict, which makes organizing the model a bit nicer
# The components are an embedding layer, a 2 layer BiLSTM, and a feed-forward output layer
self.model = nn.ModuleDict({
'embeddings': nn.Embedding.from_pretrained(pretrained_embeddings, padding_idx=pretrained_embeddings.shape[0] - 1),
'bilstm': nn.LSTM(
pretrained_embeddings.shape[1],
lstm_dim,
1,
batch_first=True,
dropout=dropout_prob,
bidirectional=True),
'cls': nn.Linear(2*lstm_dim, n_classes)
})
self.n_classes = n_classes
self.dropout = nn.Dropout(p=dropout_prob)
# Initialize the weights of the model
self._init_weights()
def _init_weights(self):
all_params = list(self.model['bilstm'].named_parameters()) + \
list(self.model['cls'].named_parameters())
for n,p in all_params:
if 'weight' in n:
nn.init.xavier_normal_(p)
elif 'bias' in n:
nn.init.zeros_(p)
def forward(self, inputs, input_lens, labels = None):
"""
Defines how tensors flow through the model
:param inputs: (b x sl) The IDs into the vocabulary of the input samples
:param input_lens: (b) The length of each input sequence
:param labels: (b) The label of each sample
:return: (loss, logits) if `labels` is not None, otherwise just (logits,)
"""
# Get embeddings (b x sl x edim)
embeds = self.model['embeddings'](inputs)
# Pack padded: This is necessary for padded batches input to an RNN
lstm_in = nn.utils.rnn.pack_padded_sequence(
embeds,
input_lens.cpu(),
batch_first=True,
enforce_sorted=False
)
# Pass the packed sequence through the BiLSTM
lstm_out, hidden = self.model['bilstm'](lstm_in)
# Unpack the packed sequence --> (b x sl x 2*lstm_dim)
lstm_out,_ = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
# Max pool along the last dimension
ff_in = self.dropout(torch.max(lstm_out, 1)[0])
# Some magic to get the last output of the BiLSTM for classification (b x 2*lstm_dim)
#ff_in = lstm_out.gather(1, input_lens.view(-1,1,1).expand(lstm_out.size(0), 1, lstm_out.size(2)) - 1).squeeze()
# Get logits (b x n_classes)
logits = self.model['cls'](ff_in).view(-1, self.n_classes)
outputs = (logits,)
if labels is not None:
# Xentropy loss
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, labels)
outputs = (loss,) + outputs
return outputs
PyTorch tensors exist on a particular device: in order to use a GPU for example, the tensors in your network need to be moved to a GPU device. The following allows us to check if a GPU device is available and get a reference to that device which we can use to place tensors on it.
device = torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda")
Here we actually instantiate the model and move it to the GPU.
# Create the model
model = BiLSTMNetwork(
pretrained_embeddings=torch.FloatTensor(pretrained_embeddings),
lstm_dim=lstm_dim,
dropout_prob=0.1,
n_classes=2
).to(device)
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/rnn.py:65: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.1 and num_layers=1 "num_layers={}".format(dropout, num_layers))
Here we define a metric which we are trying to improve -- in this case the accuracy of the model with respect to the label -- and functions to evaluate and train the model.
def accuracy(logits, labels):
logits = np.asarray(logits).reshape(-1, len(logits[0]))
labels = np.asarray(labels).reshape(-1)
return np.sum(np.argmax(logits, axis=-1) == labels).astype(np.float32) / float(labels.shape[0])
A useful utility which gives us a progress bar
from tqdm import tqdm_notebook as tqdm
This is a utility function which will take a model and a validation dataloader and return the current accuracy of the model against that dataset. We can use this to know when to save the model and to perform early stopping if desired.
def evaluate(model: nn.Module, valid_dl: DataLoader):
"""
Evaluates the model on the given dataset
:param model: The model under evaluation
:param valid_dl: A `DataLoader` reading validation data
:return: The accuracy of the model on the dataset
"""
# VERY IMPORTANT: Put your model in "eval" mode -- this disables things like
# layer normalization and dropout
model.eval()
labels_all = []
logits_all = []
# ALSO IMPORTANT: Don't accumulate gradients during this process
with torch.no_grad():
for batch in tqdm(valid_dl, desc='Evaluation'):
batch = tuple(t.to(device) for t in batch)
input_ids = batch[0]
seq_lens = batch[1]
labels = batch[2]
_, logits = model(input_ids, seq_lens, labels=labels)
labels_all.extend(list(labels.detach().cpu().numpy()))
logits_all.extend(list(logits.detach().cpu().numpy()))
acc = accuracy(logits_all, labels_all)
return acc,labels_all,logits_all
Here we define the main training loop.
def train(
model: nn.Module,
train_dl: DataLoader,
valid_dl: DataLoader,
optimizer: torch.optim.Optimizer,
n_epochs: int,
device: torch.device,
patience: int = 10
):
"""
The main training loop which will optimize a given model on a given dataset
:param model: The model being optimized
:param train_dl: The training dataset
:param valid_dl: A validation dataset
:param optimizer: The optimizer used to update the model parameters
:param n_epochs: Number of epochs to train for
:param device: The device to train on
:return: (model, losses) The best model and the losses per iteration
"""
# Keep track of the loss and best accuracy
losses = []
best_acc = 0.0
pcounter = 0
# Iterate through epochs
for ep in range(n_epochs):
loss_epoch = []
#Iterate through each batch in the dataloader
for batch in tqdm(train_dl):
# VERY IMPORTANT: Make sure the model is in training mode, which turns on
# things like dropout and layer normalization
model.train()
# VERY IMPORTANT: zero out all of the gradients on each iteration -- PyTorch
# keeps track of these dynamically in its computation graph so you need to explicitly
# zero them out
optimizer.zero_grad()
# Place each tensor on the GPU
batch = tuple(t.to(device) for t in batch)
input_ids = batch[0]
seq_lens = batch[1]
labels = batch[2]
# Pass the inputs through the model, get the current loss and logits
loss, logits = model(input_ids, seq_lens, labels=labels)
losses.append(loss.item())
loss_epoch.append(loss.item())
# Calculate all of the gradients and weight updates for the model
loss.backward()
# Optional: clip gradients
#torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Finally, update the weights of the model
optimizer.step()
#gc.collect()
# Perform inline evaluation at the end of the epoch
acc,_,_ = evaluate(model, valid_dl)
print(f'Validation accuracy: {acc}, train loss: {sum(loss_epoch) / len(loss_epoch)}')
# Keep track of the best model based on the accuracy
if acc > best_acc:
torch.save(model.state_dict(), 'best_model')
best_acc = acc
pcounter = 0
else:
pcounter += 1
if pcounter == patience:
break
#gc.collect()
model.load_state_dict(torch.load('best_model'))
return model, losses
Now that we have the basic training and evaluation loops defined, we can create the datasets and optimizer and run it!
from torch.optim import Adam
# Define some hyperparameters
batch_size = 32
lr = 3e-4
n_epochs = 100
# Create the dataset readers
train_dataset = ClassificationDatasetReader(train_data[:5000], bpemb_en)
# dataset loaded lazily with N workers in parallel
train_dl = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_batch_bilstm, num_workers=8)
valid_dataset = ClassificationDatasetReader(valid_data[:1000], bpemb_en)
valid_dl = DataLoader(valid_dataset, batch_size=len(valid_data[:1000]), collate_fn=collate_batch_bilstm, num_workers=8)
# Create the optimizer
optimizer = Adam(model.parameters(), lr=lr)
# Train
model, losses = train(model, train_dl, valid_dl, optimizer, n_epochs, device)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:32: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0 Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
0%| | 0/157 [00:00<?, ?it/s]
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0 Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook` app.launch_new_instance()
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.712, train loss: 0.6779187424167706
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.768, train loss: 0.5833906335815503
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.781, train loss: 0.5077499014556788
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.792, train loss: 0.4751406448661901
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.803, train loss: 0.45408434273710674
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.814, train loss: 0.4271178468587292
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.818, train loss: 0.4111836481436043
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.806, train loss: 0.39968144570945935
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.822, train loss: 0.37929828569387936
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.817, train loss: 0.3708639157236002
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.834, train loss: 0.348740708960849
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.801, train loss: 0.3409302178651664
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.836, train loss: 0.331271399737923
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.838, train loss: 0.31309484021299205
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.84, train loss: 0.3230010888948562
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.84, train loss: 0.294746164540956
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.854, train loss: 0.2886540435112206
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.842, train loss: 0.27443656467708055
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.845, train loss: 0.26643091720190776
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.835, train loss: 0.2540476996048241
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.85, train loss: 0.23896284092953252
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.864, train loss: 0.236352364870773
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.851, train loss: 0.22536911013399719
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.847, train loss: 0.21778053904225111
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.861, train loss: 0.2023459713739954
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.845, train loss: 0.2063905032244837
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.837, train loss: 0.19206032301684853
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.858, train loss: 0.17706783614151037
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.85, train loss: 0.16507186833175885
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.859, train loss: 0.16044201360196825
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.847, train loss: 0.1473938333950225
0%| | 0/157 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Validation accuracy: 0.85, train loss: 0.13892966600218018
Next we can plot the loss curve
import matplotlib.pyplot as plt
plt.plot(losses)
[<matplotlib.lines.Line2D at 0x7f7faa5ec5d0>]
test_dataset = ClassificationDatasetReader(test_data[:1000], bpemb_en)
test_dl = DataLoader(test_dataset, batch_size=len(test_data), collate_fn=collate_batch_bilstm, num_workers=8)
val_acc,_,_ = evaluate(model, valid_dl)
test_acc,labs,logs = evaluate(model, test_dl)
print(f"Valiation accuracy: {val_acc}")
print(f"Test accuracy: {test_acc}")
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:566: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:16: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0 Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook` app.launch_new_instance()
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Evaluation: 0%| | 0/1 [00:00<?, ?it/s]
Valiation accuracy: 0.864 Test accuracy: 0.839
References