DistilBERT Classifier as Feature Extractor Using Embetter¶

In this feature-based approach, we are using the embeddings from a pretrained transormer to train a random forest and logistic regression model in scikit-learn:

In [1]:

# pip install transformers datasets

In [2]:

# conda install sklearn --yes

In addition, we will be using the embetter scikit-learn library:

In [3]:

%load_ext watermark
%watermark --conda -p torch,transformers,datasets,sklearn

torch       : 1.12.1
transformers: 4.23.1
datasets    : 2.6.1
sklearn     : 0.0

conda environment: dl-fundamentals

In [4]:

import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0

1 Loading the Dataset¶

The IMDB movie review dataset consists of 50k movie reviews with sentiment label (0: negative, 1: positive).

1a) Load from `datasets` Hub¶

In [5]:

from datasets import list_datasets, load_dataset

In [6]:

# list_datasets()

In [7]:

imdb_data = load_dataset("imdb")
print(imdb_data)

Found cached dataset imdb (/home/raschka/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)

  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [8]:

imdb_data["train"][99]

Out[8]:

{'text': "This film is terrible. You don't really need to read this review further. If you are planning on watching it, suffice to say - don't (unless you are studying how not to make a good movie).<br /><br />The acting is horrendous... serious amateur hour. Throughout the movie I thought that it was interesting that they found someone who speaks and looks like Michael Madsen, only to find out that it is actually him! A new low even for him!!<br /><br />The plot is terrible. People who claim that it is original or good have probably never seen a decent movie before. Even by the standard of Hollywood action flicks, this is a terrible movie.<br /><br />Don't watch it!!! Go for a jog instead - at least you won't feel like killing yourself.",
 'label': 0}

1b) Load from local directory¶

The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/. After downloading the dataset, decompress the files.

A) If you are working with Linux or MacOS X, open a new terminal windowm cd into the download directory and execute

tar -zxf aclImdb_v1.tar.gz

B) If you are working with Windows, download an archiver such as 7Zip to extract the files from the download archive.

C) Use the following code to download and unzip the dataset via Python

Download the movie reviews

In [9]:

import os
import sys
import tarfile
import time
import urllib.request

source = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
target = "aclImdb_v1.tar.gz"

if os.path.exists(target):
    os.remove(target)


def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    progress_size = int(count * block_size)
    speed = progress_size / (1024.0**2 * duration)
    percent = count * block_size * 100.0 / total_size

    sys.stdout.write(
        f"\r{int(percent)}% | {progress_size / (1024.**2):.2f} MB "
        f"| {speed:.2f} MB/s | {duration:.2f} sec elapsed"
    )
    sys.stdout.flush()


if not os.path.isdir("aclImdb") and not os.path.isfile("aclImdb_v1.tar.gz"):
    urllib.request.urlretrieve(source, target, reporthook)

In [10]:

if not os.path.isdir("aclImdb"):

    with tarfile.open(target, "r:gz") as tar:
        tar.extractall()

Convert them to a pandas DataFrame and save them as CSV

In [11]:

import os
import sys

import numpy as np
import pandas as pd
from packaging import version
from tqdm import tqdm

# change the `basepath` to the directory of the
# unzipped movie dataset

basepath = "aclImdb"

labels = {"pos": 1, "neg": 0}

df = pd.DataFrame()

with tqdm(total=50000) as pbar:
    for s in ("test", "train"):
        for l in ("pos", "neg"):
            path = os.path.join(basepath, s, l)
            for file in sorted(os.listdir(path)):
                with open(os.path.join(path, file), "r", encoding="utf-8") as infile:
                    txt = infile.read()

                if version.parse(pd.__version__) >= version.parse("1.3.2"):
                    x = pd.DataFrame(
                        [[txt, labels[l]]], columns=["review", "sentiment"]
                    )
                    df = pd.concat([df, x], ignore_index=False)

                else:
                    df = df.append([[txt, labels[l]]], ignore_index=True)
                pbar.update()
df.columns = ["text", "label"]

100%|███████████████████████████████████████████████████████| 50000/50000 [00:55<00:00, 893.83it/s]

In [12]:

import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

Basic datasets analysis and sanity checks

In [13]:

print("Class distribution:")
np.bincount(df["label"].values)

Class distribution:

Out[13]:

array([25000, 25000])

In [14]:

text_len = df["text"].apply(lambda x: len(x.split()))
text_len.min(), text_len.median(), text_len.max() 

Out[14]:

(4, 173.0, 2470)

Split data into training, validation, and test sets

In [15]:

df_shuffled = df.sample(frac=1, random_state=1).reset_index()

df_train = df_shuffled.iloc[:35_000]
df_val = df_shuffled.iloc[35_000:40_000]
df_test = df_shuffled.iloc[40_000:]

df_train.to_csv("train.csv", index=False, encoding="utf-8")
df_val.to_csv("validation.csv", index=False, encoding="utf-8")
df_test.to_csv("test.csv", index=False, encoding="utf-8")

In [16]:

df_train.head()

Out[16]:

	text	label
0	When we started watching this series on cable,...	1
1	Steve Biko was a black activist who tried to r...	1
2	My short comment for this flick is go pick it ...	1
3	As a serious horror fan, I get that certain ma...	0
4	Robert Cummings, Laraine Day and Jean Muir sta...	1

2 Train Model on Embeddings (Extracted Features)¶

In [17]:

import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.text import SentenceEncoder

classifier = make_pipeline(
  SentenceEncoder("distiluse-base-multilingual-cased-v2"),
  LogisticRegression()
)

classifier.fit(df_train["text"].values, df_train["label"].values);

In [18]:

classifier.score(df_val["text"].values, df_val["label"].values)

Out[18]:

0.8

In [19]:

classifier.score(df_test["text"].values, df_test["label"].values)

Out[19]:

0.8032

In [ ]:

DistilBERT Classifier as Feature Extractor Using Embetter¶

1 Loading the Dataset¶

1a) Load from datasets Hub¶

1b) Load from local directory¶

2 Train Model on Embeddings (Extracted Features)¶

1a) Load from `datasets` Hub¶