Notebook

Link to the notebook: https://tinyurl.com/y2zz2qu8 Copy the notebook to your GDrive to edit.

Setup¶

In [1]:

# magic commands to make sure changes to external packages are automatically loaded and plots are displayed in the notebook
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:

!pip3 install datasets transformers bpemb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (2.5.0)
Requirement already satisfied: transformers in /usr/local/lib/python3.7/dist-packages (4.22.1)
Requirement already satisfied: bpemb in /usr/local/lib/python3.7/dist-packages (0.3.3)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (2022.8.2)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.12.0)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.13)
Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from datasets) (3.8.1)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.21.6)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.9.1)
Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.18.0)
Requirement already satisfied: dill<0.3.6 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.5.1)
Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.64.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.2.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.8.1)
Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.1.1)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (6.0.2)
Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (0.13.0)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (22.1.0)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.1.1)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.8.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (6.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2022.6.15)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.25.11)
Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.12.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2022.6.2)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.7/dist-packages (from bpemb) (0.1.97)
Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (from bpemb) (3.6.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (5.2.1)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (1.15.0)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (1.7.3)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.8.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2022.2.1)

In [3]:

import torch
import random
import numpy as np
from typing import List, Tuple
from torch.utils.data import Dataset, DataLoader
from torch import nn
import matplotlib.pyplot as plt
import re
from tqdm.notebook import tqdm
from torch.optim import Adam, RMSprop
import nltk
from datasets import load_dataset
from bpemb import BPEmb

In [4]:

def enforce_reproducibility(seed=42):
    # Sets seed manually for both CPU and CUDA
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # For atomic operations there is currently 
    # no simple way to enforce determinism, as
    # the order of parallel operations is not known.
    # CUDNN
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # System based
    random.seed(seed)
    np.random.seed(seed)
enforce_reproducibility()

In [5]:

device = torch.device("cpu")
if torch.cuda.is_available():
  device = torch.device("cuda")
device

Out[5]:

device(type='cuda')

Language Models¶

LM example

Trained to predict probabilities of the next word, given some context.
Architectures with large numbers of parameters trained on large corpora of text.
Can be used to generate text, to verify how likely is a particular sequence of text, e.g. if it sounds grammatical and complies to the rules of a language, or we can use the contextual representations of the words learned by the model.
Evaluated with perplexity, which is the inverse probability of the test set, normalized by the number of words:

\begin{equation} \mathrm{PP}(W)=\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P\left(w_{i} \mid w_{1} \ldots w_{i-1}\right)}} \end{equation}

Equivalent to the exponential of the cross-entropy loss (detailed explanation)
Minimizing perplexity is the same as maximising probability of the correct prediction.
Low perplexity is better!

RNNs¶

Recap¶

Karpathy blog rnn example

Source: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

Widely used for working with sequence data.
Can be used for a variety of problems which work with sequences.
Consider both a given input and a state, which is updated at every step on the input sequence.
How this state is kept and the internal calculations of an RNN can differ.
- Three main types of RNN cells: vanilla RNNs, GRU, and LSTM.
- GRUs and LSTMs differ from vanilla RNNs in that they use gating mechanisms to mitigate vanishing gradients, and are better at capturing long-term dependencies.

Data Preparation for RNNs¶

*PyTorch-specific.

Packing¶

In [6]:

# We want to run LSTM on a batch with 3 sentences
sents = ['The word of the Lord came to Zechariah son of Iddo the prophet.',  # len = 13
        'fruit flies like a banana',      # len = 5
        'Fruit flies live on a banana']    # len = 6

# Step 1: Construct Vocabulary
vocab = ['<pad>'] + sorted(set([token for sent in sents for token in sent.split()]))

#Step 2: Load indexed data (list of instances, where each instance is list of character indices)
vectorized_seqs = [[vocab.index(tok) for tok in sent.split()]for sent in sents]

#Step 3: Make Model
embed = torch.nn.Embedding(len(vocab), 4) # embedding_dim = 4
lstm = torch.nn.LSTM(input_size=4, hidden_size=5, batch_first=True) # input_dim = 4, hidden_dim = 5

#Step 4: Pad instances with 0s till max length sequence
# get the length of each seq in your batch
seq_lengths = torch.LongTensor(list(map(len, vectorized_seqs)))

seq_tensor = torch.tensor(torch.zeros((len(vectorized_seqs), seq_lengths.max()))).long()

for idx, (seq, seqlen) in enumerate(zip(vectorized_seqs, seq_lengths)):
    seq_tensor[idx, :seqlen] = torch.LongTensor(seq)
seq_tensor

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:20: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).

Out[6]:

tensor([[ 4, 19, 13, 17,  3,  8, 18,  5, 16, 13,  2, 17, 15],
        [10,  9, 11,  6,  7,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 1,  9, 12, 14,  6,  7,  0,  0,  0,  0,  0,  0,  0]])

In [7]:

# Step 5: Sort instances by sequence length in descending order 
# this step is not compulsory, where the packing functions have a parameter enforce_sorted, which can be set to False
seq_lengths, perm_idx = seq_lengths.sort(0, descending=True)
seq_tensor = seq_tensor[perm_idx]
seq_tensor

Out[7]:

tensor([[ 4, 19, 13, 17,  3,  8, 18,  5, 16, 13,  2, 17, 15],
        [ 1,  9, 12, 14,  6,  7,  0,  0,  0,  0,  0,  0,  0],
        [10,  9, 11,  6,  7,  0,  0,  0,  0,  0,  0,  0,  0]])

In [8]:

# Calling pack_padded_sequence with instances and sequence lengths
# Packing is important for training RNN on text with variable length to save compute
packed_input = torch.nn.utils.rnn.pack_padded_sequence(seq_tensor, seq_lengths.cpu().numpy(), batch_first=True)
# packed_input (PackedSequence is NamedTuple with 2 attributes: data and batch_sizes
packed_input

Out[8]:

PackedSequence(data=tensor([ 4,  1, 10, 19,  9,  9, 13, 12, 11, 17, 14,  6,  3,  6,  7,  8,  7, 18,
         5, 16, 13,  2, 17, 15]), batch_sizes=tensor([3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1]), sorted_indices=None, unsorted_indices=None)

In [9]:

# Step 6: Let's now proceed with the network transformations and embed the instances
embedded_seq_tensor = embed(seq_tensor)
embedded_seq_tensor

Out[9]:

tensor([[[ 1.6423, -0.1596, -0.4974,  0.4396],
         [ 0.5750, -0.6417, -2.2064, -0.7508],
         [-0.4880,  1.1914, -0.8140, -0.7360],
         [ 0.3466, -0.1973, -1.0546,  1.2780],
         [-0.7279, -0.5594, -0.7688,  0.7624],
         [-1.3847, -0.8712, -0.2234,  1.7174],
         [-0.1722,  0.5238,  0.0566,  0.4263],
         [-0.7581,  1.0783,  0.8008,  1.6806],
         [ 1.4451,  0.8564,  2.2181,  0.5232],
         [-0.4880,  1.1914, -0.8140, -0.7360],
         [-0.7521,  1.6487, -0.3925, -1.4036],
         [ 0.3466, -0.1973, -1.0546,  1.2780],
         [-0.0978,  1.8446, -1.1845,  1.3835]],

        [[ 0.6784, -1.2345, -0.0431, -1.6047],
         [ 0.3189, -0.4245,  0.3057, -0.7746],
         [-0.9138, -0.6581,  0.0780,  0.5258],
         [-1.4032,  0.0360, -0.0635,  0.6756],
         [ 1.2791,  1.2964,  0.6105,  1.3347],
         [-0.2316,  0.0418, -0.2516,  0.8599],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055]],

        [[-1.5576,  0.9956, -0.8798, -0.6011],
         [ 0.3189, -0.4245,  0.3057, -0.7746],
         [-1.2742,  2.1228, -1.2347, -0.4879],
         [ 1.2791,  1.2964,  0.6105,  1.3347],
         [-0.2316,  0.0418, -0.2516,  0.8599],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055],
         [ 1.9269,  1.4873,  0.9007, -2.1055]]], grad_fn=<EmbeddingBackward0>)

In [10]:

# Step 7: Call pack_padded_sequence with embeded instances and sequence lengths
packed_input = torch.nn.utils.rnn.pack_padded_sequence(embedded_seq_tensor, seq_lengths.cpu().numpy(), batch_first=True)
# packed_input (PackedSequence is NamedTuple with 2 attributes: data and batch_sizes
packed_input

Out[10]:

PackedSequence(data=tensor([[ 1.6423, -0.1596, -0.4974,  0.4396],
        [ 0.6784, -1.2345, -0.0431, -1.6047],
        [-1.5576,  0.9956, -0.8798, -0.6011],
        [ 0.5750, -0.6417, -2.2064, -0.7508],
        [ 0.3189, -0.4245,  0.3057, -0.7746],
        [ 0.3189, -0.4245,  0.3057, -0.7746],
        [-0.4880,  1.1914, -0.8140, -0.7360],
        [-0.9138, -0.6581,  0.0780,  0.5258],
        [-1.2742,  2.1228, -1.2347, -0.4879],
        [ 0.3466, -0.1973, -1.0546,  1.2780],
        [-1.4032,  0.0360, -0.0635,  0.6756],
        [ 1.2791,  1.2964,  0.6105,  1.3347],
        [-0.7279, -0.5594, -0.7688,  0.7624],
        [ 1.2791,  1.2964,  0.6105,  1.3347],
        [-0.2316,  0.0418, -0.2516,  0.8599],
        [-1.3847, -0.8712, -0.2234,  1.7174],
        [-0.2316,  0.0418, -0.2516,  0.8599],
        [-0.1722,  0.5238,  0.0566,  0.4263],
        [-0.7581,  1.0783,  0.8008,  1.6806],
        [ 1.4451,  0.8564,  2.2181,  0.5232],
        [-0.4880,  1.1914, -0.8140, -0.7360],
        [-0.7521,  1.6487, -0.3925, -1.4036],
        [ 0.3466, -0.1973, -1.0546,  1.2780],
        [-0.0978,  1.8446, -1.1845,  1.3835]],
       grad_fn=<PackPaddedSequenceBackward0>), batch_sizes=tensor([3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1]), sorted_indices=None, unsorted_indices=None)

In [11]:

# Step 8: Forward with LSTM, get packed_output, hidden state and cell state
packed_output, (ht, ct) = lstm(packed_input)

# Step 9: Call unpack_padded_sequences if required / or just pick last hidden vector
output, input_sizes = torch.nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)

Summary of Shape Transformations

(batch_size X max_seq_len X embedding_dim) --> Sort by seqlen ---> (batch_size X max_seq_len X embedding_dim)

(batch_size X max_seq_len X embedding_dim) ---> Pack ---> (batch_sum_seq_len X embedding_dim)

(batch_sum_seq_len X embedding_dim) ---> LSTM ---> (batch_sum_seq_len X hidden_dim)

(batch_sum_seq_len X hidden_dim) ---> UnPack ---> (batch_size X max_seq_len X hidden_dim)

Bucketing¶

Group instances of similar lenghts in the same batch which leads to minimal padding.
E.g. a batch with instances of lenght 5, 5, 6 will be padded to lenght 6, while a batch with instances of lenght 5, 5, 13 will be padded to lenght 13 and will also take more time for processing.
To still have some randomness in the batches, we usually have buckets with N instances with similar lenght and shuffle them into K batches.
Example implementation -- Bucket Iterator (https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator).

Dataset Loading¶

For training the Language Model, we'll be using the WikiText-2 dataset

HuggingFace have a repository of many datasets, where you can easily find datasets of interests (e.g. the TyDi QA dataset).

In [12]:

datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
datasets.keys()

WARNING:datasets.builder:Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)

  0%|          | 0/3 [00:00<?, ?it/s]

Out[12]:

dict_keys(['test', 'train', 'validation'])

In [13]:

print(datasets['train'][1])
print(datasets['train'][2])
print(datasets['train'][3])

{'text': ' = Valkyria Chronicles III = \n'}
{'text': ''}
{'text': ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n'}

In [14]:

# We'll use again the pretrained BP Embeddings and the corresponding tokenizer.

bpemb_en = BPEmb(lang='en', dim=100, vs=25000)
# Extract the embeddings and add a randomly initialized embedding for our extra [PAD] token
pretrained_embeddings = np.concatenate([bpemb_en.emb.vectors, np.zeros(shape=(1,100))], axis=0)
# Extract the vocab and add an extra [PAD] token
vocabulary = bpemb_en.emb.index2word + ['[PAD]']

In [15]:

print("embeddings' shape:", pretrained_embeddings.shape, "vocabulary size:", len(vocabulary))

embeddings' shape: (25001, 100) vocabulary size: 25001

In [16]:

def tokenizer(text):
  return {'input_ids': bpemb_en.encode_ids_with_eos(text)}

def tokenize_function(examples):
    return tokenizer(examples['text'])

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of size block_size .
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

block_size = 128

tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
print(tokenized_datasets['train'][3]['input_ids'][:10])

lm_datasets = tokenized_datasets.map(group_texts, batched=True, batch_size=1000, num_proc=4,)

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-8c6fa1ee7badaacd.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-77acfeb86a33ecd2.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-54389ded37e8dd5e.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-5b81ec00cae1d713.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-eef702d036cb08e7.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-dd74ef948d72de48.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-7aa3254c800caf5a.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-996aeb3b127df775.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-18c523d5ea07181c.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-6aed561681a698aa.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-2dc2f04c3268970e.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-a143ecc30448f0c9.arrow

[1282, 20611, 467, 852, 2358, 2538, 121, 944, 118, 14849]

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-0cebd0dd47e9dca3.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-ba2baed289148080.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-9436b3aebfce6e6d.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-e5908a6cfef7bf69.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-a8b410318c1a7367.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-cf6c4434cd4b23a9.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-fd1f4825f23219d7.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-4502a7f4eeb26f66.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-d6c4e34424567d33.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-eba367bb853c786b.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-6906813da7cf25e5.arrow

WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-dd273094f4e07438.arrow

In [17]:

def collate_batch_bilstm(dataset) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Combines multiple data samples into a single batch
    :param input_ids: The token input ids
    :return: A tuple of tensors (input_ids, targets)
    """
    input_ids = [i['input_ids'] for i in dataset]

    input_lengths, padded_input = [], []
    for sentence in input_ids:
      sentence = sentence[:seq_len]
      input_lengths.append(len(sentence) - 1)
      sentence = sentence + [0] * (seq_len - len(sentence))
      padded_input.append(sentence)

    input_data = torch.tensor(padded_input)

    # we don't use the last position as there isn't anything left for generation
    input_ids = input_data[:, :-1]

    # the target at each step is to generate the next word from the sequence
    # so we shift the token ids with 1 position
    targets = input_data[:, 1:].reshape(-1)

    return input_ids, torch.tensor(input_lengths), targets

In [18]:

test_dl = torch.utils.data.DataLoader(lm_datasets['test'], batch_size=32, collate_fn=collate_batch_bilstm)
train_dl = torch.utils.data.DataLoader(lm_datasets['train'], batch_size=32, collate_fn=collate_batch_bilstm)
valid_dl = torch.utils.data.DataLoader(lm_datasets['validation'], batch_size=32, collate_fn=collate_batch_bilstm)

Model Implementation¶

Next we will create an LSTM model. We again extend the PyTorch class torch.nn.Module and implement the __init__ function, and define how tensors are processed in the __forward__ function.

In [19]:

class LSTM_LM(nn.Module):
    """
    LSTM Language Model
    """
    def __init__(
            self,            
            pretrained_embeddings: torch.tensor,
            lstm_dim: int,       
            dropout_prob: float = 0.0,
            lstm_layers: int = 1,
    ):
        """
        Initializer for LSTM Language Model
        :param pretrained_embeddings: A tensor containing the pretrained BPE embeddings
        :param lstm_dim: The dimensionality of the BiLSTM network
        :param dropout_prob: Dropout probability
        :param lstm_layers: The number of stacked LSTM layers
        """

        # First thing is to call the superclass initializer
        super(LSTM_LM, self).__init__()

        # We'll define the network in a ModuleDict, which makes organizing the model a bit nicer
        # The components are an embedding layer, an LSTM layer, a dropout layer, and a feed-forward output layer
        self.vocab_size = pretrained_embeddings.shape[0]
        self.model = nn.ModuleDict({
            'embeddings': nn.Embedding.from_pretrained(pretrained_embeddings, padding_idx=pretrained_embeddings.shape[0] - 1),
            'lstm': nn.LSTM( 
                pretrained_embeddings.shape[1],
                lstm_dim,
                num_layers=lstm_layers,
                batch_first=True,
                dropout=dropout_prob),
            'ff': nn.Linear(lstm_dim, pretrained_embeddings.shape[0]),
            'drop': nn.Dropout(dropout_prob)
        })

        # Initialize the weights of the model
        self._init_weights()

    def _init_weights(self):
        all_params = list(self.model['lstm'].named_parameters()) + \
                     list(self.model['ff'].named_parameters())
        for n, p in all_params:
            if 'weight' in n:
                nn.init.xavier_normal_(p)
            elif 'bias' in n:
                nn.init.zeros_(p)

    def forward(self, input_ids, input_lens, hidden_states):
        """
        Defines how tensors flow through the model
        :param input_ids: (b x sl) The IDs into the vocabulary of the input samples
        :param input_lens: (b x 1) The length of each instance's text
        :param hidden_states: (b x sl) x 2 Hidden states for the LSTM model
        :return: (lstm output, updated hidden stated)
        """

        # Get embeddings (b x sl x edim)
        embeds = self.model['drop'](self.model['embeddings'](input_ids))

        lstm_in = nn.utils.rnn.pack_padded_sequence(
            embeds,
            input_lens.to('cpu'),
            batch_first=True,
            enforce_sorted=False
        )

        # Pass the packed sequence through the BiLSTM
        lstm_out, hidden = self.model['lstm'](lstm_in)
        # Unpack the packed sequence --> (b x sl x 2*lstm_dim)
        lstm_out, hidden_states = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
        lstm_out = self.model['drop'](lstm_out)
        # generate the prediction of each word in the vocabulary being the next
        lstm_out = self.model['ff'](lstm_out)
        lstm_out = lstm_out.reshape(-1, self.vocab_size)

        return lstm_out, hidden_states

Utility Functions¶

This is a utility function which will take a model and a validation dataloader and return the current perplexity of the model against that dataset. We can use this to know when to save the model and to perform early stopping if desired.

In [20]:

def evaluate(model: nn.Module, valid_dl: DataLoader):
  """
  Evaluates the model on the given dataset
  :param model: The model under evaluation
  :param valid_dl: A `DataLoader` reading validation data
  :return: The accuracy of the model on the dataset
  """
  model.eval()
  loss_all = []
  states = (torch.zeros(lstm_layers, batch_size, lstm_dim).to(device),
              torch.zeros(lstm_layers, batch_size, lstm_dim).to(device))
  loss_fn = nn.CrossEntropyLoss()
        
  with torch.no_grad():
    for batch in tqdm(valid_dl, desc='Evaluation'):
      batch = tuple(t.to(device) for t in batch)
      input_ids = batch[0]
      input_lens = batch[1]
      targets = batch[2]
      states = detach(states)
      logits, states = model(input_ids, input_lens, states)
      loss = loss_fn(logits, targets.reshape(-1))

      loss_all.append(loss.detach().cpu().numpy())

  perplexity = np.exp(sum(loss_all) / (len(loss_all)))
  return perplexity

Here we define the main training loop.

In [21]:

# Truncated backpropagation
def detach(states):
    return [state.detach() for state in states]

def train(
    model: nn.Module, 
    train_dl: DataLoader, 
    valid_dl: DataLoader, 
    optimizer: torch.optim.Optimizer, 
    n_epochs: int, 
    device: torch.device
):
  """
  The main training loop which will optimize a given model on a given dataset
  :param model: The model being optimized
  :param train_dl: The training dataset
  :param valid_dl: A validation dataset
  :param optimizer: The optimizer used to update the model parameters
  :param n_epochs: Number of epochs to train for
  :param device: The device to train on
  :return: (model, losses) The best model and the losses per iteration
  """

  # Keep track of the loss and best accuracy
  losses = []
  best_perplexity = 300.0
  # Set initial hidden and cell states
  loss_fn = nn.CrossEntropyLoss()
  # Iterate through epochs
  for ep in range(n_epochs):
    states = (torch.zeros(lstm_layers, batch_size, lstm_dim).to(device),
              torch.zeros(lstm_layers, batch_size, lstm_dim).to(device))
 
    loss_epoch = []

    #Iterate through each batch in the dataloader
    for batch in tqdm(train_dl):
      # VERY IMPORTANT: Make sure the model is in training mode, which turns on 
      # things like dropout and layer normalization
      model.train()

      # VERY IMPORTANT: zero out all of the gradients on each iteration -- PyTorch
      # keeps track of these dynamically in its computation graph so you need to explicitly
      # zero them out
      optimizer.zero_grad()

      # Place each tensor on the GPU
      batch = tuple(t.to(device) for t in batch)
      input_ids = batch[0]
      input_lens = batch[1]
      targets = batch[2]
      # Pass the inputs through the model, get the current loss and logits
      states = detach(states)
      logits, states = model(input_ids, input_lens, states)
      loss = loss_fn(logits, targets.reshape(-1))

      losses.append(loss.item())
      loss_epoch.append(loss.item())
      
      # Calculate all of the gradients and weight updates for the model
      loss.backward()

      # Optional: clip gradients, helps with exploding gradients
      torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

      # Finally, update the weights of the model
      optimizer.step()
      #gc.collect()

    # Perform inline evaluation at the end of the epoch
    perplexity = evaluate(model, valid_dl)
    print(f'Validation perplexity: {perplexity}, train loss: {sum(loss_epoch) / len(loss_epoch)}')

    # Keep track of the best model based on the accuracy
    best_model = model.state_dict()
    if perplexity < best_perplexity:
      best_model = model.state_dict()
      best_perplexity = perplexity

  model.load_state_dict(best_model)
  return model, losses

Now that we have the basic training and evaluation loops defined, we can create the datasets and optimizer and run it!

Training¶

In [40]:

# Define some hyperparameters
lr = 0.001
n_epochs = 6
lstm_dim = 1024
lstm_layers = 4
batch_size = 128
seq_len = 128

model = LSTM_LM(
    torch.FloatTensor(pretrained_embeddings),
    lstm_dim=lstm_dim, 
    dropout_prob=0.1, 
    lstm_layers=lstm_layers
  ).to(device)

# Create the optimizer
optimizer = Adam(model.parameters(), lr=lr)

# Train
model, losses = train(model, train_dl, valid_dl, optimizer, n_epochs, device)

  0%|          | 0/599 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/62 [00:00<?, ?it/s]

Validation perplexity: 1818.0729549106857, train loss: 7.606159637686805

  0%|          | 0/599 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/62 [00:00<?, ?it/s]

Validation perplexity: 1179.8671494429902, train loss: 7.158430816733181

  0%|          | 0/599 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/62 [00:00<?, ?it/s]

Validation perplexity: 1226.9833916912687, train loss: 7.133625640295981

  0%|          | 0/599 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/62 [00:00<?, ?it/s]

Validation perplexity: 1239.3600432924457, train loss: 7.105498870346502

  0%|          | 0/599 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/62 [00:00<?, ?it/s]

Validation perplexity: 1334.6796099809183, train loss: 7.01695058580631

  0%|          | 0/599 [00:00<?, ?it/s]

Evaluation:   0%|          | 0/62 [00:00<?, ?it/s]

Validation perplexity: 483.7851902849598, train loss: 6.430242812295191

In [41]:

torch.save(model.state_dict(), 'best_model_wiki')

Next we can plot the loss curve

In [42]:

plt.plot(losses)

Out[42]:

[<matplotlib.lines.Line2D at 0x7fb3544b0390>]

In [43]:

evaluate(model, test_dl)

Evaluation:   0%|          | 0/71 [00:00<?, ?it/s]

Out[43]:

499.80787826019485

Results on dataset

Generation of text¶

In [44]:

sentence_start = 'This was a'
print(sentence_start, end=' ')

new_token = None
states = (torch.zeros(lstm_layers, 1, lstm_dim).to(device),
              torch.zeros(lstm_layers, 1, lstm_dim).to(device))
loss_fn = torch.nn.CrossEntropyLoss()
  
while new_token != '<eos>' and len(sentence_start) < 200:
  token_ids = bpemb_en.encode_ids(sentence_start)
  batch = collate_batch_bilstm([{'input_ids': token_ids}])
  logits, states = model(batch[0].to(device), batch[1].to(device), states)
  logits = logits.detach().cpu().numpy()[-1]
  
  new_token_ids = np.argsort(logits, axis=-1)[::-1]
  token_id = new_token_ids[0]
  word = bpemb_en.decode_ids([int(token_id),])
  print(word[0], end = ' ')
  sentence_start = sentence_start + " " + word[0]

This was a the star star of . the the star star of of the the star star of , the the star star of , the the star star of , the the star star of , the the star star of , the the star star of , the the star

Check perplexity of a sentence¶

In [45]:

def get_sentence_perplexity(sentence, model, vocabulary, seq_len):
  states = (torch.zeros(lstm_layers, 1, lstm_dim).to(device),
              torch.zeros(lstm_layers, 1, lstm_dim).to(device))
  token_ids = [{'input_ids': bpemb_en.encode_ids(sentence)}]
  batch = collate_batch_bilstm(token_ids)
  loss_fn = torch.nn.CrossEntropyLoss()
  logits, states = model(batch[0].to(device), batch[1].to(device), states)

  target = batch[2].to(device)[:len(token_ids[0]['input_ids'])-1]
  loss = loss_fn(logits, target.reshape(-1))
  loss = loss.detach().cpu().numpy()
  return np.exp(loss)

In [46]:

get_sentence_perplexity('I want to buy some potatoes from the airport.', model, vocabulary, seq_len)

Out[46]:

1096.5176

In [47]:

get_sentence_perplexity('jibberish ? . something something is', model, vocabulary, seq_len)

Out[47]:

1489.3047

References:

Bucket Iterator https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator
Packing https://github.com/HarshTrivedi/packing-unpacking-pytorch-minimal-tutorial

Transformers¶

Recap¶

Problems with using Bi-LSTMs to learn contextual word embeddings:

Recurrent neural networks are difficult to train
- Vanishing gradients (LSTM; Hochreiter & Schmidhuber 1997)
- Exploding gradients (norm rescaling; Pascanu et al. 2013)
Proposal: Transformer Networks
- Idea 1: Encode words individually with feed-forward neural networks
  - Shorter path for gradient back-propagation
  - Parallel computation possible
- Idea 2: Replace recurrence function with a positional encoding
  - Fixed-length vectors, similar word embeddings, that represents the position
- Current base architecture used for all state-of-the-art NLP models

Summary of Transformer Models ¶

Apart from improveemnts of the core Transformer architecture, Transformer models differ in the objectives they are trained to optimise:

Decoders/autoregressive models -- pretrained on language modeling task, i.e. predict the next word, given the preceding context.
Encoders/autoencoding models -- pretrained to reconstruct corrupted text, i.e.
Sequence-to-sequence models -- use both an encoder and a decoder for text-to-text tasks

One architecture can be used for training with different objectives. The used objective makes a model more suitable for particular tasks, e.g., autoregressive models are most suitable for generation tasks, while autoencoding models -- for classification tasks, and sequence-to-sequence models for translations, summarisation and other text-to-text tasks.

LMs

Fine-tuning Transformer Models¶

We can directly use pretrained Transformer Models (search for suitable models on HuggingFace's Models page) and fine-tune them to perform our task at hand.

We can also fine-tune the selected Transformer Model with intermediate in-domain data and then use it for the end task.

Benefit of using unlabelled in-domain data to fine-tune the model with its original objective.
Especially useful when there is limited labelled data for training the end task.

Intermediate Fine-tuning with original LM objective and unsupervised data¶

In [48]:

import transformers
from transformers import AutoTokenizer
from torch.utils.data import Dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
import math

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.

Moving 0 files to the new cache system

0it [00:00, ?it/s]

In [49]:

model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

In [50]:

unsupervised_imdb = load_dataset('imdb', split='unsupervised')
unsupervised_imdb_splits = unsupervised_imdb.train_test_split(test_size=0.01)
print(unsupervised_imdb_splits.keys())
print(unsupervised_imdb_splits['train'][0])

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.
dict_keys(['train', 'test'])
{'text': 'In preparation for writing this comment, which I am compelled to write for reasons which will become clear, I read a fair few of the major critics to see, if on this occasion, they had seen and felt what I had.<br /><br />James Bernadelli suggests this film should be known as the "Pursuit of Richness"; another column I have read suggests that the money-will-solve-your-problems resolution is depressing and not the message Hollywood (or art in general) should be trying to put across.<br /><br />I ask myself when the last occasion might have been that any of these critics genuinely had to :spoiler: run, under painful, embarrassing duress, from a cab - because they didn\'t have the money; or when the forces of life seemed to conspire against them so unfairly that they broke down in tears. Spoiler: when your wealthy boss, who you CANNOT disappoint, asks you to borrow the last five dollars in your wallet, and you know that that money is all you have in the world - to feed your family, to pay your gas.<br /><br />You go through times in life when you feel that things couldn\'t possibly get worse - and then they do - and then they do again. Sometimes, (like Chris and son at the beach towards the end of the film) you just want to get away from it all, other times you cry. Sometimes, and this is rare, you laugh - because if you don\'t you\'ll cry. You know that if you let it all become too much, you will sink to the bottom of the sea.<br /><br />Chris Gardner (not the real one perhaps, but the one in this movie) is my personal hero; my shining example; my inspiration - the guy that never allows himself to sink - even when he can feel his shoelaces trailing on the seabed.<br /><br />I realised as I was watching this film that I was watching another me, so I never once stopped rooting for Chris. When he :spoiler: fixes the scanner in the shelter, I felt like I had fixed it.<br /><br />I\'m still working on getting the job that earns me enough to have a less painful life - when Chris finally achieves it, I want it for him so badly that I feel it in the very fibre of my being.<br /><br />I suppose that it doesn\'t matter that much that Chris wants to succeed as much for his son as for himself. In a way, the fact that Chris has a son makes this movie emotionally frightening and if (and only if) our basest fears are being toyed with by the director, it is only in exactly the same way that we sometimes look up in to the sky and say - like Jim Carrey in "Bruce Almighty" - is there anything else you could possibly do to me today? In my life, God doesn\'t appear and explain why he keeps toying with me. The truest belief is in oneself and one must never lose it.<br /><br />I adore this movie and I couldn\'t be more grateful for its existence. It comes at a time when I need it most. Its message: Keep going - never, ever give up.<br /><br />And so I come back to those reviewers. Money certainly doesn\'t mean happiness in this existence, but no money, in this cruel capitalist world, can cost you your life. Do not be judgmental of those who want more, it might just be enough to pay for their son to eat, to keep a roof over his head, to keep their dignity. This film is not about rags-to-riches as some reviewers have said. It\'s not about the American dream either. It is about dignity and integrity and how money (or lack thereof) could easily strip you of both.<br /><br />Chris Gardner never loses either and for this, he is an inspiration to us all.<br /><br />See this film TODAY.', 'label': -1}

In [51]:

block_size = 128
def tokenize_function(examples):
    return tokenizer(examples["text"])

def group_texts(examples):
    # Concatenate all texts.
    keys = ['attention_mask', 'input_ids']
    concatenated_examples = {k: sum(examples[k], []) for k in keys}
    total_length = len(concatenated_examples[list(keys)[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    # this is needed as the used dataset is a subclass of ClassificationDataset, which requires label as a field...
    result["label"] = result["input_ids"].copy()
    result["labels"] = result["input_ids"].copy()
    return result

unsupervised_imdb_tok = unsupervised_imdb_splits.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
unsupervised_imdb_splits = unsupervised_imdb_splits.remove_columns(['label'])
unsupervised_imdb_tok_lm = unsupervised_imdb_tok.map(group_texts, batched=True, batch_size=1000, num_proc=4,)
unsupervised_imdb_tok_lm = unsupervised_imdb_tok_lm.remove_columns(['label'])
print(unsupervised_imdb_tok_lm['test'][0])

#0:   0%|          | 0/13 [00:00<?, ?ba/s]

#1:   0%|          | 0/13 [00:00<?, ?ba/s]

#3:   0%|          | 0/13 [00:00<?, ?ba/s]

#2:   0%|          | 0/13 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1297 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1286 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1027 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1115 > 1024). Running this sequence through the model will result in indexing errors

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1282 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1198 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1286 > 1024). Running this sequence through the model will result in indexing errors

#0:   0%|          | 0/13 [00:00<?, ?ba/s]

#1:   0%|          | 0/13 [00:00<?, ?ba/s]

#2:   0%|          | 0/13 [00:00<?, ?ba/s]

#3:   0%|          | 0/13 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

{'input_ids': [1532, 345, 1053, 1775, 262, 6833, 13637, 327, 26183, 2196, 20495, 18653, 7886, 340, 338, 1327, 284, 1234, 340, 503, 286, 534, 1182, 11, 475, 345, 2192, 815, 466, 780, 428, 530, 318, 6635, 1180, 13, 47419, 774, 468, 587, 9958, 287, 7075, 286, 10319, 12, 448, 9961, 532, 42156, 11, 46835, 290, 477, 12, 744, 22029, 1108, 13, 7477, 340, 338, 11441, 11, 13913, 88, 11, 31826, 1417, 290, 15074, 25292, 357, 20839, 597, 1866, 286, 262, 41162, 1107, 5806, 12718, 12, 3036, 1150, 15232, 33924, 475, 3805, 477, 428, 340, 318, 29381, 13206, 13, 314, 7360, 3521, 470, 11626, 3589, 1497, 422, 262, 3159, 1566, 262, 886, 286, 262, 3807, 13, 1002, 612, 338, 257, 5749, 19370, 345, 460, 1414, 284, 257, 2646, 314, 836, 470], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [1532, 345, 1053, 1775, 262, 6833, 13637, 327, 26183, 2196, 20495, 18653, 7886, 340, 338, 1327, 284, 1234, 340, 503, 286, 534, 1182, 11, 475, 345, 2192, 815, 466, 780, 428, 530, 318, 6635, 1180, 13, 47419, 774, 468, 587, 9958, 287, 7075, 286, 10319, 12, 448, 9961, 532, 42156, 11, 46835, 290, 477, 12, 744, 22029, 1108, 13, 7477, 340, 338, 11441, 11, 13913, 88, 11, 31826, 1417, 290, 15074, 25292, 357, 20839, 597, 1866, 286, 262, 41162, 1107, 5806, 12718, 12, 3036, 1150, 15232, 33924, 475, 3805, 477, 428, 340, 318, 29381, 13206, 13, 314, 7360, 3521, 470, 11626, 3589, 1497, 422, 262, 3159, 1566, 262, 886, 286, 262, 3807, 13, 1002, 612, 338, 257, 5749, 19370, 345, 460, 1414, 284, 257, 2646, 314, 836, 470]}

In [52]:

model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    "test-clm",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=1,
    max_steps=300
)

In [53]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=unsupervised_imdb_tok_lm["train"],
    eval_dataset=unsupervised_imdb_tok_lm["test"],
)

max_steps is given, it will override any value given in num_train_epochs

In [54]:

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1073
  Batch size = 8

[135/135 00:07]

Perplexity: 70.50

In [55]:

trainer.train()

/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 116288
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 300

[300/300 01:07, Epoch 0/1]

Epoch	Training Loss	Validation Loss
0	No log	4.047292

***** Running Evaluation *****
  Num examples = 1073
  Batch size = 8

[135/135 01:17]


Training completed. Do not forget to share your model on huggingface.co/models =)

Out[55]:

TrainOutput(global_step=300, training_loss=4.1625390625, metrics={'train_runtime': 68.0522, 'train_samples_per_second': 35.267, 'train_steps_per_second': 4.408, 'total_flos': 78389025177600.0, 'train_loss': 4.1625390625, 'epoch': 0.02})

In [56]:

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1073
  Batch size = 8

[135/135 00:07]

Perplexity: 57.24

In [57]:

model.save_pretrained("imdb-gpt2")

Configuration saved in imdb-gpt2/config.json
Model weights saved in imdb-gpt2/pytorch_model.bin

End Domain finetuning for the target classification task.¶

The pretrained (and potentially finetuned on an intermediate task) model can be used in multiple ways:
- Use the learned weights and add a linear layer at the end for classification, e.g. for onne [CLS] token
- Export the contextual representations of the words and use them in another model or aggregate them in one representation with following layers for classification.

In [60]:

# Export contextual representations:

def mean_pooling(model_output, attention_mask):
    # Mean Pooling - Take attention mask into account for correct averaging
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(model_output.size()).float()
    sum_embeddings = torch.sum(model_output * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

encoded = tokenizer(['Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers".'],
                    return_tensors='pt')
encoded = {k:v.to(device) for k, v in encoded.items()}

model_output = model(**encoded, output_hidden_states=True, return_dict=True)
print(model_output.keys())
print(len(model_output['hidden_states'])) # contextual representations of separate words from each of the 6 layes
print(model_output['hidden_states'][-1].shape) # last layer with contextual representations (batch_size x num words x representation dim)

# Aggregate all the representations into one
mean_model_output = mean_pooling(model_output['hidden_states'][-1], encoded['attention_mask'])
mean_model_output.shape

odict_keys(['logits', 'past_key_values', 'hidden_states'])
7
torch.Size([1, 29, 768])

Out[60]:

torch.Size([1, 768])

Generation¶

In [61]:

model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/config.json
Model config GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.22.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/config.json
Model config GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.22.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/config.json
Model config GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.22.1",
  "use_cache": true,
  "vocab_size": 50257
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/pytorch_model.bin
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.

In [62]:

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I was meaning to', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Output:
----------------------------------------------------------------------------------------------------
I was meaning to be a part of the world, and I was a part of it. I was a part of it. I was a part of it. I was a part of it. I was a part of it. I was a

Text Generation Approaches¶

Source 1, Source 2

Standard generation

Greedy search - pick the word with the highest probability as the next word at each time step.
The model quickly starts to repeat itself.
- Common problem in language generation (Vijayakumar et. al, 2016)[https://arxiv.org/abs/1610.02424],(Shao et. al, 2017)[https://arxiv.org/abs/1701.03185].
Misses high probability words hidden behind a low probability word.

BEAM

Other generation approaches:

Beam Search
- Keeps a list of the most likely num_beams so far, the final text is the one with the highes score.
- Solves the problem of missing high probability words hidden behind a low probability words.
- Downside: still generates repeating words
- Downside: works well only in tasks where the length of the generated text is predictable and not variable - e.g. machine translation and summarisation vs. dialogue/story generation
N-gram penalty
- Reduce probability of next words that will result in n-grams that have been generated already
- Downside: Not always a desirable effect, some repetiotions might be ok
Sampling
- Randomly pick the next work from the generated probability distribution
- Ari Holtzman et al. (2019) note that high quality human language does not follow a distribution of high probability next words
- Downside: Output is not often incoherent
- Use temperature to shift the probability distributions
  - Dividing logits by the temperature before feeding them into softmax
Top-K Sampling
- (Fan et al., 2020)
- The K most likely next words are filtered and the probability mass is redistributed among only those K next words
- Used in GPT2 and can be partly attributed to its success
- Downside: Does not dynamically adapt the number of words

top-k-before top-k-after

Top-p Nucleus Sampling
- Instead ot top-k, chooses from the smallest possible set of words whose cumulative probability exceeds the probability p.
- Appears to improve quality by removing the tail and making it less likely to go off-topic.
- Dynamic number of words.

In [63]:

input_ids = tokenizer.encode('I was going', return_tensors='pt')

beam_outputs = model.generate(input_ids, max_length=50, num_beams=5, num_return_sequences=5)
print("Beam Search Output:")
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Beam Search Output:
0: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I
1: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had
2: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure that I had enough time to make sure that I had enough time to make sure that I had enough time to
3: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure that I had enough time to make sure I had enough time to make sure I had enough time to make sure
4: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I was enough time to make sure I

In [64]:

input_ids = tokenizer.encode('I was going', return_tensors='pt')

output = model.generate(input_ids, max_length=50, num_beams=5)
print("Beam Search Output:")
print(tokenizer.decode(output[0], skip_special_tokens=True))

output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2)
print("N-Gram Penalty on Beam Search Output:")
print(tokenizer.decode(output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Beam Search Output:
I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I
N-Gram Penalty on Beam Search Output:
I was going to have to do a lot of work to make sure I had the right amount of time to work on this project.”

“I’m not sure if I can do that, but I think it�

In [65]:

# Higher temperature -- more random samples, lower -- more deterministic

input_ids = tokenizer.encode('I was going', return_tensors='pt')

greedy_output = model.generate(input_ids, max_length=50, do_sample=True)
print("Sampling Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, temperature=10.)
print("Sampling Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, temperature=0.5)
print("Sampling Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Sampling Output:
I was going to lose something on that night. After I got hurt, I kept being hurt."

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Sampling Output:
I was going about every hour from morning with three very experienced athletes from Florida right where they would meet face for face – in-line football coaches everywhere and everything is on line like some team." So with your support now coming from Jacksonville the guys have
Sampling Output:
I was going to have to go to the police station. He was going to have to go to the police station. He was going to have to go to the police station. He was going to have to go to the police station. He was

In [66]:

input_ids = tokenizer.encode('I was going', return_tensors='pt')

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=50)
print("Tok-K Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=10)
print("Tok-K Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=100)
print("Tok-K Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Tok-K Sampling:
I was going to go to school at 12 and go to work at 8. It was also good for me personally.

"We are really hoping we will have something to do with our new location to raise awareness to where this is happening,"

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Tok-K Sampling:
I was going to be doing a show about how I‪re a good guy, I don‪e have the right to do that‪e be the show‪e be the talk show‪e be the talk show�
Tok-K Sampling:
I was going to live here," she said.



"So she wasn't ready for New York City?" her husband asked.
"I was sleeping with nothing and her husband wanted me and I could not even move out of fear

In [67]:

input_ids = tokenizer.encode('I was going', return_tensors='pt')

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=0, top_p=0.2)
print("Tok-p Nucleus Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=0, top_p=0.92)
print("Tok-p Nucleus Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=0, top_p=0.99)
print("Tok-p Nucleus Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Tok-p Nucleus Sampling:
I was going to have to go to the hospital, and I'm going to have to go to the hospital.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Tok-p Nucleus Sampling:
I was going to need to beat me and because that fight would probably end without me.


In his bout with Steve Luger, Sergey Kovalev gave everything that made him the No. 1 heavyweight in the country and thought he would
Tok-p Nucleus Sampling:
I was going to keep running, survive straight for half a year.›


As evil as I was, they seemed to just sort to just return to life, escape from their prison captors, flee the country, and then back

Setup¶

Language Models¶

RNNs¶

Recap¶

Data Preparation for RNNs¶

Packing¶

Bucketing¶

Dataset Loading¶

Model Implementation¶

Utility Functions¶

Training¶

Generation of text¶

Check perplexity of a sentence¶

Transformers¶

Recap¶

Summary of Transformer Models¶

Fine-tuning Transformer Models¶

Intermediate Fine-tuning with original LM objective and unsupervised data¶

End Domain finetuning for the target classification task.¶

Generation¶

Text Generation Approaches¶

Further reading material:¶

Summary of Transformer Models ¶