Link to the notebook: https://tinyurl.com/y2zz2qu8 Copy the notebook to your GDrive to edit.
# magic commands to make sure changes to external packages are automatically loaded and plots are displayed in the notebook
%reload_ext autoreload
%autoreload 2
%matplotlib inline
!pip3 install datasets transformers bpemb
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: datasets in /usr/local/lib/python3.7/dist-packages (2.5.0) Requirement already satisfied: transformers in /usr/local/lib/python3.7/dist-packages (4.22.1) Requirement already satisfied: bpemb in /usr/local/lib/python3.7/dist-packages (0.3.3) Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (2022.8.2) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.12.0) Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.13) Requirement already satisfied: xxhash in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0) Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from datasets) (3.8.1) Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.21.6) Requirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.9.1) Requirement already satisfied: responses<0.19 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.18.0) Requirement already satisfied: dill<0.3.6 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.5.1) Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1) Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0) Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3) Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5) Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.64.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.2.0) Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.8.1) Requirement already satisfied: typing-extensions>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.1.1) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.0.2) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (6.0.2) Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (0.13.0) Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (22.1.0) Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.1.1) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.3.1) Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.8.0) Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (6.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.9) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2022.6.15) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.25.11) Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.12.1) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2022.6.2) Requirement already satisfied: sentencepiece in /usr/local/lib/python3.7/dist-packages (from bpemb) (0.1.97) Requirement already satisfied: gensim in /usr/local/lib/python3.7/dist-packages (from bpemb) (3.6.0) Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (5.2.1) Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (1.15.0) Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim->bpemb) (1.7.3) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.8.1) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2022.2.1)
import torch
import random
import numpy as np
from typing import List, Tuple
from torch.utils.data import Dataset, DataLoader
from torch import nn
import matplotlib.pyplot as plt
import re
from tqdm.notebook import tqdm
from torch.optim import Adam, RMSprop
import nltk
from datasets import load_dataset
from bpemb import BPEmb
def enforce_reproducibility(seed=42):
# Sets seed manually for both CPU and CUDA
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# For atomic operations there is currently
# no simple way to enforce determinism, as
# the order of parallel operations is not known.
# CUDNN
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# System based
random.seed(seed)
np.random.seed(seed)
enforce_reproducibility()
device = torch.device("cpu")
if torch.cuda.is_available():
device = torch.device("cuda")
device
device(type='cuda')
Source: https://karpathy.github.io/2015/05/21/rnn-effectiveness/
*PyTorch-specific.
# We want to run LSTM on a batch with 3 sentences
sents = ['The word of the Lord came to Zechariah son of Iddo the prophet.', # len = 13
'fruit flies like a banana', # len = 5
'Fruit flies live on a banana'] # len = 6
# Step 1: Construct Vocabulary
vocab = ['<pad>'] + sorted(set([token for sent in sents for token in sent.split()]))
#Step 2: Load indexed data (list of instances, where each instance is list of character indices)
vectorized_seqs = [[vocab.index(tok) for tok in sent.split()]for sent in sents]
#Step 3: Make Model
embed = torch.nn.Embedding(len(vocab), 4) # embedding_dim = 4
lstm = torch.nn.LSTM(input_size=4, hidden_size=5, batch_first=True) # input_dim = 4, hidden_dim = 5
#Step 4: Pad instances with 0s till max length sequence
# get the length of each seq in your batch
seq_lengths = torch.LongTensor(list(map(len, vectorized_seqs)))
seq_tensor = torch.tensor(torch.zeros((len(vectorized_seqs), seq_lengths.max()))).long()
for idx, (seq, seqlen) in enumerate(zip(vectorized_seqs, seq_lengths)):
seq_tensor[idx, :seqlen] = torch.LongTensor(seq)
seq_tensor
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:20: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
tensor([[ 4, 19, 13, 17, 3, 8, 18, 5, 16, 13, 2, 17, 15], [10, 9, 11, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0], [ 1, 9, 12, 14, 6, 7, 0, 0, 0, 0, 0, 0, 0]])
# Step 5: Sort instances by sequence length in descending order
# this step is not compulsory, where the packing functions have a parameter enforce_sorted, which can be set to False
seq_lengths, perm_idx = seq_lengths.sort(0, descending=True)
seq_tensor = seq_tensor[perm_idx]
seq_tensor
tensor([[ 4, 19, 13, 17, 3, 8, 18, 5, 16, 13, 2, 17, 15], [ 1, 9, 12, 14, 6, 7, 0, 0, 0, 0, 0, 0, 0], [10, 9, 11, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0]])
# Calling pack_padded_sequence with instances and sequence lengths
# Packing is important for training RNN on text with variable length to save compute
packed_input = torch.nn.utils.rnn.pack_padded_sequence(seq_tensor, seq_lengths.cpu().numpy(), batch_first=True)
# packed_input (PackedSequence is NamedTuple with 2 attributes: data and batch_sizes
packed_input
PackedSequence(data=tensor([ 4, 1, 10, 19, 9, 9, 13, 12, 11, 17, 14, 6, 3, 6, 7, 8, 7, 18, 5, 16, 13, 2, 17, 15]), batch_sizes=tensor([3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1]), sorted_indices=None, unsorted_indices=None)
# Step 6: Let's now proceed with the network transformations and embed the instances
embedded_seq_tensor = embed(seq_tensor)
embedded_seq_tensor
tensor([[[ 1.6423, -0.1596, -0.4974, 0.4396], [ 0.5750, -0.6417, -2.2064, -0.7508], [-0.4880, 1.1914, -0.8140, -0.7360], [ 0.3466, -0.1973, -1.0546, 1.2780], [-0.7279, -0.5594, -0.7688, 0.7624], [-1.3847, -0.8712, -0.2234, 1.7174], [-0.1722, 0.5238, 0.0566, 0.4263], [-0.7581, 1.0783, 0.8008, 1.6806], [ 1.4451, 0.8564, 2.2181, 0.5232], [-0.4880, 1.1914, -0.8140, -0.7360], [-0.7521, 1.6487, -0.3925, -1.4036], [ 0.3466, -0.1973, -1.0546, 1.2780], [-0.0978, 1.8446, -1.1845, 1.3835]], [[ 0.6784, -1.2345, -0.0431, -1.6047], [ 0.3189, -0.4245, 0.3057, -0.7746], [-0.9138, -0.6581, 0.0780, 0.5258], [-1.4032, 0.0360, -0.0635, 0.6756], [ 1.2791, 1.2964, 0.6105, 1.3347], [-0.2316, 0.0418, -0.2516, 0.8599], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055]], [[-1.5576, 0.9956, -0.8798, -0.6011], [ 0.3189, -0.4245, 0.3057, -0.7746], [-1.2742, 2.1228, -1.2347, -0.4879], [ 1.2791, 1.2964, 0.6105, 1.3347], [-0.2316, 0.0418, -0.2516, 0.8599], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055], [ 1.9269, 1.4873, 0.9007, -2.1055]]], grad_fn=<EmbeddingBackward0>)
# Step 7: Call pack_padded_sequence with embeded instances and sequence lengths
packed_input = torch.nn.utils.rnn.pack_padded_sequence(embedded_seq_tensor, seq_lengths.cpu().numpy(), batch_first=True)
# packed_input (PackedSequence is NamedTuple with 2 attributes: data and batch_sizes
packed_input
PackedSequence(data=tensor([[ 1.6423, -0.1596, -0.4974, 0.4396], [ 0.6784, -1.2345, -0.0431, -1.6047], [-1.5576, 0.9956, -0.8798, -0.6011], [ 0.5750, -0.6417, -2.2064, -0.7508], [ 0.3189, -0.4245, 0.3057, -0.7746], [ 0.3189, -0.4245, 0.3057, -0.7746], [-0.4880, 1.1914, -0.8140, -0.7360], [-0.9138, -0.6581, 0.0780, 0.5258], [-1.2742, 2.1228, -1.2347, -0.4879], [ 0.3466, -0.1973, -1.0546, 1.2780], [-1.4032, 0.0360, -0.0635, 0.6756], [ 1.2791, 1.2964, 0.6105, 1.3347], [-0.7279, -0.5594, -0.7688, 0.7624], [ 1.2791, 1.2964, 0.6105, 1.3347], [-0.2316, 0.0418, -0.2516, 0.8599], [-1.3847, -0.8712, -0.2234, 1.7174], [-0.2316, 0.0418, -0.2516, 0.8599], [-0.1722, 0.5238, 0.0566, 0.4263], [-0.7581, 1.0783, 0.8008, 1.6806], [ 1.4451, 0.8564, 2.2181, 0.5232], [-0.4880, 1.1914, -0.8140, -0.7360], [-0.7521, 1.6487, -0.3925, -1.4036], [ 0.3466, -0.1973, -1.0546, 1.2780], [-0.0978, 1.8446, -1.1845, 1.3835]], grad_fn=<PackPaddedSequenceBackward0>), batch_sizes=tensor([3, 3, 3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1]), sorted_indices=None, unsorted_indices=None)
# Step 8: Forward with LSTM, get packed_output, hidden state and cell state
packed_output, (ht, ct) = lstm(packed_input)
# Step 9: Call unpack_padded_sequences if required / or just pick last hidden vector
output, input_sizes = torch.nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
Summary of Shape Transformations
(batch_size X max_seq_len X embedding_dim) --> Sort by seqlen ---> (batch_size X max_seq_len X embedding_dim)
(batch_size X max_seq_len X embedding_dim) ---> Pack ---> (batch_sum_seq_len X embedding_dim)
(batch_sum_seq_len X embedding_dim) ---> LSTM ---> (batch_sum_seq_len X hidden_dim)
(batch_sum_seq_len X hidden_dim) ---> UnPack ---> (batch_size X max_seq_len X hidden_dim)
For training the Language Model, we'll be using the WikiText-2 dataset
HuggingFace have a repository of many datasets, where you can easily find datasets of interests (e.g. the TyDi QA dataset).
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
datasets.keys()
WARNING:datasets.builder:Found cached dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
0%| | 0/3 [00:00<?, ?it/s]
dict_keys(['test', 'train', 'validation'])
print(datasets['train'][1])
print(datasets['train'][2])
print(datasets['train'][3])
{'text': ' = Valkyria Chronicles III = \n'} {'text': ''} {'text': ' Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n'}
# We'll use again the pretrained BP Embeddings and the corresponding tokenizer.
bpemb_en = BPEmb(lang='en', dim=100, vs=25000)
# Extract the embeddings and add a randomly initialized embedding for our extra [PAD] token
pretrained_embeddings = np.concatenate([bpemb_en.emb.vectors, np.zeros(shape=(1,100))], axis=0)
# Extract the vocab and add an extra [PAD] token
vocabulary = bpemb_en.emb.index2word + ['[PAD]']
print("embeddings' shape:", pretrained_embeddings.shape, "vocabulary size:", len(vocabulary))
embeddings' shape: (25001, 100) vocabulary size: 25001
def tokenizer(text):
return {'input_ids': bpemb_en.encode_ids_with_eos(text)}
def tokenize_function(examples):
return tokenizer(examples['text'])
def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of size block_size .
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
return result
block_size = 128
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
print(tokenized_datasets['train'][3]['input_ids'][:10])
lm_datasets = tokenized_datasets.map(group_texts, batched=True, batch_size=1000, num_proc=4,)
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-8c6fa1ee7badaacd.arrow WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-77acfeb86a33ecd2.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-54389ded37e8dd5e.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-5b81ec00cae1d713.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-eef702d036cb08e7.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-dd74ef948d72de48.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-7aa3254c800caf5a.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-996aeb3b127df775.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-18c523d5ea07181c.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-6aed561681a698aa.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-2dc2f04c3268970e.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-a143ecc30448f0c9.arrow
[1282, 20611, 467, 852, 2358, 2538, 121, 944, 118, 14849]
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-0cebd0dd47e9dca3.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-ba2baed289148080.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-9436b3aebfce6e6d.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-e5908a6cfef7bf69.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-a8b410318c1a7367.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-cf6c4434cd4b23a9.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-fd1f4825f23219d7.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-4502a7f4eeb26f66.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-d6c4e34424567d33.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-eba367bb853c786b.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-6906813da7cf25e5.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-dd273094f4e07438.arrow
def collate_batch_bilstm(dataset) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Combines multiple data samples into a single batch
:param input_ids: The token input ids
:return: A tuple of tensors (input_ids, targets)
"""
input_ids = [i['input_ids'] for i in dataset]
input_lengths, padded_input = [], []
for sentence in input_ids:
sentence = sentence[:seq_len]
input_lengths.append(len(sentence) - 1)
sentence = sentence + [0] * (seq_len - len(sentence))
padded_input.append(sentence)
input_data = torch.tensor(padded_input)
# we don't use the last position as there isn't anything left for generation
input_ids = input_data[:, :-1]
# the target at each step is to generate the next word from the sequence
# so we shift the token ids with 1 position
targets = input_data[:, 1:].reshape(-1)
return input_ids, torch.tensor(input_lengths), targets
test_dl = torch.utils.data.DataLoader(lm_datasets['test'], batch_size=32, collate_fn=collate_batch_bilstm)
train_dl = torch.utils.data.DataLoader(lm_datasets['train'], batch_size=32, collate_fn=collate_batch_bilstm)
valid_dl = torch.utils.data.DataLoader(lm_datasets['validation'], batch_size=32, collate_fn=collate_batch_bilstm)
Next we will create an LSTM model. We again extend the PyTorch class torch.nn.Module
and implement the __init__
function, and define how tensors are processed in the __forward__
function.
class LSTM_LM(nn.Module):
"""
LSTM Language Model
"""
def __init__(
self,
pretrained_embeddings: torch.tensor,
lstm_dim: int,
dropout_prob: float = 0.0,
lstm_layers: int = 1,
):
"""
Initializer for LSTM Language Model
:param pretrained_embeddings: A tensor containing the pretrained BPE embeddings
:param lstm_dim: The dimensionality of the BiLSTM network
:param dropout_prob: Dropout probability
:param lstm_layers: The number of stacked LSTM layers
"""
# First thing is to call the superclass initializer
super(LSTM_LM, self).__init__()
# We'll define the network in a ModuleDict, which makes organizing the model a bit nicer
# The components are an embedding layer, an LSTM layer, a dropout layer, and a feed-forward output layer
self.vocab_size = pretrained_embeddings.shape[0]
self.model = nn.ModuleDict({
'embeddings': nn.Embedding.from_pretrained(pretrained_embeddings, padding_idx=pretrained_embeddings.shape[0] - 1),
'lstm': nn.LSTM(
pretrained_embeddings.shape[1],
lstm_dim,
num_layers=lstm_layers,
batch_first=True,
dropout=dropout_prob),
'ff': nn.Linear(lstm_dim, pretrained_embeddings.shape[0]),
'drop': nn.Dropout(dropout_prob)
})
# Initialize the weights of the model
self._init_weights()
def _init_weights(self):
all_params = list(self.model['lstm'].named_parameters()) + \
list(self.model['ff'].named_parameters())
for n, p in all_params:
if 'weight' in n:
nn.init.xavier_normal_(p)
elif 'bias' in n:
nn.init.zeros_(p)
def forward(self, input_ids, input_lens, hidden_states):
"""
Defines how tensors flow through the model
:param input_ids: (b x sl) The IDs into the vocabulary of the input samples
:param input_lens: (b x 1) The length of each instance's text
:param hidden_states: (b x sl) x 2 Hidden states for the LSTM model
:return: (lstm output, updated hidden stated)
"""
# Get embeddings (b x sl x edim)
embeds = self.model['drop'](self.model['embeddings'](input_ids))
lstm_in = nn.utils.rnn.pack_padded_sequence(
embeds,
input_lens.to('cpu'),
batch_first=True,
enforce_sorted=False
)
# Pass the packed sequence through the BiLSTM
lstm_out, hidden = self.model['lstm'](lstm_in)
# Unpack the packed sequence --> (b x sl x 2*lstm_dim)
lstm_out, hidden_states = nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True)
lstm_out = self.model['drop'](lstm_out)
# generate the prediction of each word in the vocabulary being the next
lstm_out = self.model['ff'](lstm_out)
lstm_out = lstm_out.reshape(-1, self.vocab_size)
return lstm_out, hidden_states
This is a utility function which will take a model and a validation dataloader and return the current perplexity of the model against that dataset. We can use this to know when to save the model and to perform early stopping if desired.
def evaluate(model: nn.Module, valid_dl: DataLoader):
"""
Evaluates the model on the given dataset
:param model: The model under evaluation
:param valid_dl: A `DataLoader` reading validation data
:return: The accuracy of the model on the dataset
"""
model.eval()
loss_all = []
states = (torch.zeros(lstm_layers, batch_size, lstm_dim).to(device),
torch.zeros(lstm_layers, batch_size, lstm_dim).to(device))
loss_fn = nn.CrossEntropyLoss()
with torch.no_grad():
for batch in tqdm(valid_dl, desc='Evaluation'):
batch = tuple(t.to(device) for t in batch)
input_ids = batch[0]
input_lens = batch[1]
targets = batch[2]
states = detach(states)
logits, states = model(input_ids, input_lens, states)
loss = loss_fn(logits, targets.reshape(-1))
loss_all.append(loss.detach().cpu().numpy())
perplexity = np.exp(sum(loss_all) / (len(loss_all)))
return perplexity
Here we define the main training loop.
# Truncated backpropagation
def detach(states):
return [state.detach() for state in states]
def train(
model: nn.Module,
train_dl: DataLoader,
valid_dl: DataLoader,
optimizer: torch.optim.Optimizer,
n_epochs: int,
device: torch.device
):
"""
The main training loop which will optimize a given model on a given dataset
:param model: The model being optimized
:param train_dl: The training dataset
:param valid_dl: A validation dataset
:param optimizer: The optimizer used to update the model parameters
:param n_epochs: Number of epochs to train for
:param device: The device to train on
:return: (model, losses) The best model and the losses per iteration
"""
# Keep track of the loss and best accuracy
losses = []
best_perplexity = 300.0
# Set initial hidden and cell states
loss_fn = nn.CrossEntropyLoss()
# Iterate through epochs
for ep in range(n_epochs):
states = (torch.zeros(lstm_layers, batch_size, lstm_dim).to(device),
torch.zeros(lstm_layers, batch_size, lstm_dim).to(device))
loss_epoch = []
#Iterate through each batch in the dataloader
for batch in tqdm(train_dl):
# VERY IMPORTANT: Make sure the model is in training mode, which turns on
# things like dropout and layer normalization
model.train()
# VERY IMPORTANT: zero out all of the gradients on each iteration -- PyTorch
# keeps track of these dynamically in its computation graph so you need to explicitly
# zero them out
optimizer.zero_grad()
# Place each tensor on the GPU
batch = tuple(t.to(device) for t in batch)
input_ids = batch[0]
input_lens = batch[1]
targets = batch[2]
# Pass the inputs through the model, get the current loss and logits
states = detach(states)
logits, states = model(input_ids, input_lens, states)
loss = loss_fn(logits, targets.reshape(-1))
losses.append(loss.item())
loss_epoch.append(loss.item())
# Calculate all of the gradients and weight updates for the model
loss.backward()
# Optional: clip gradients, helps with exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Finally, update the weights of the model
optimizer.step()
#gc.collect()
# Perform inline evaluation at the end of the epoch
perplexity = evaluate(model, valid_dl)
print(f'Validation perplexity: {perplexity}, train loss: {sum(loss_epoch) / len(loss_epoch)}')
# Keep track of the best model based on the accuracy
best_model = model.state_dict()
if perplexity < best_perplexity:
best_model = model.state_dict()
best_perplexity = perplexity
model.load_state_dict(best_model)
return model, losses
Now that we have the basic training and evaluation loops defined, we can create the datasets and optimizer and run it!
# Define some hyperparameters
lr = 0.001
n_epochs = 6
lstm_dim = 1024
lstm_layers = 4
batch_size = 128
seq_len = 128
model = LSTM_LM(
torch.FloatTensor(pretrained_embeddings),
lstm_dim=lstm_dim,
dropout_prob=0.1,
lstm_layers=lstm_layers
).to(device)
# Create the optimizer
optimizer = Adam(model.parameters(), lr=lr)
# Train
model, losses = train(model, train_dl, valid_dl, optimizer, n_epochs, device)
0%| | 0/599 [00:00<?, ?it/s]
Evaluation: 0%| | 0/62 [00:00<?, ?it/s]
Validation perplexity: 1818.0729549106857, train loss: 7.606159637686805
0%| | 0/599 [00:00<?, ?it/s]
Evaluation: 0%| | 0/62 [00:00<?, ?it/s]
Validation perplexity: 1179.8671494429902, train loss: 7.158430816733181
0%| | 0/599 [00:00<?, ?it/s]
Evaluation: 0%| | 0/62 [00:00<?, ?it/s]
Validation perplexity: 1226.9833916912687, train loss: 7.133625640295981
0%| | 0/599 [00:00<?, ?it/s]
Evaluation: 0%| | 0/62 [00:00<?, ?it/s]
Validation perplexity: 1239.3600432924457, train loss: 7.105498870346502
0%| | 0/599 [00:00<?, ?it/s]
Evaluation: 0%| | 0/62 [00:00<?, ?it/s]
Validation perplexity: 1334.6796099809183, train loss: 7.01695058580631
0%| | 0/599 [00:00<?, ?it/s]
Evaluation: 0%| | 0/62 [00:00<?, ?it/s]
Validation perplexity: 483.7851902849598, train loss: 6.430242812295191
torch.save(model.state_dict(), 'best_model_wiki')
Next we can plot the loss curve
plt.plot(losses)
[<matplotlib.lines.Line2D at 0x7fb3544b0390>]
evaluate(model, test_dl)
Evaluation: 0%| | 0/71 [00:00<?, ?it/s]
499.80787826019485
sentence_start = 'This was a'
print(sentence_start, end=' ')
new_token = None
states = (torch.zeros(lstm_layers, 1, lstm_dim).to(device),
torch.zeros(lstm_layers, 1, lstm_dim).to(device))
loss_fn = torch.nn.CrossEntropyLoss()
while new_token != '<eos>' and len(sentence_start) < 200:
token_ids = bpemb_en.encode_ids(sentence_start)
batch = collate_batch_bilstm([{'input_ids': token_ids}])
logits, states = model(batch[0].to(device), batch[1].to(device), states)
logits = logits.detach().cpu().numpy()[-1]
new_token_ids = np.argsort(logits, axis=-1)[::-1]
token_id = new_token_ids[0]
word = bpemb_en.decode_ids([int(token_id),])
print(word[0], end = ' ')
sentence_start = sentence_start + " " + word[0]
This was a the star star of . the the star star of of the the star star of , the the star star of , the the star star of , the the star star of , the the star star of , the the star star of , the the star
def get_sentence_perplexity(sentence, model, vocabulary, seq_len):
states = (torch.zeros(lstm_layers, 1, lstm_dim).to(device),
torch.zeros(lstm_layers, 1, lstm_dim).to(device))
token_ids = [{'input_ids': bpemb_en.encode_ids(sentence)}]
batch = collate_batch_bilstm(token_ids)
loss_fn = torch.nn.CrossEntropyLoss()
logits, states = model(batch[0].to(device), batch[1].to(device), states)
target = batch[2].to(device)[:len(token_ids[0]['input_ids'])-1]
loss = loss_fn(logits, target.reshape(-1))
loss = loss.detach().cpu().numpy()
return np.exp(loss)
get_sentence_perplexity('I want to buy some potatoes from the airport.', model, vocabulary, seq_len)
1096.5176
get_sentence_perplexity('jibberish ? . something something is', model, vocabulary, seq_len)
1489.3047
References:
Problems with using Bi-LSTMs to learn contextual word embeddings:
Apart from improveemnts of the core Transformer architecture, Transformer models differ in the objectives they are trained to optimise:
One architecture can be used for training with different objectives. The used objective makes a model more suitable for particular tasks, e.g., autoregressive models are most suitable for generation tasks, while autoencoding models -- for classification tasks, and sequence-to-sequence models for translations, summarisation and other text-to-text tasks.
We can directly use pretrained Transformer Models (search for suitable models on HuggingFace's Models page) and fine-tune them to perform our task at hand.
We can also fine-tune the selected Transformer Model with intermediate in-domain data and then use it for the end task.
import transformers
from transformers import AutoTokenizer
from torch.utils.data import Dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
import math
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]
model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
Downloading: 0%| | 0.00/762 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/1.04M [00:00<?, ?B/s]
Downloading: 0%| | 0.00/456k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/1.36M [00:00<?, ?B/s]
Downloading: 0%| | 0.00/353M [00:00<?, ?B/s]
unsupervised_imdb = load_dataset('imdb', split='unsupervised')
unsupervised_imdb_splits = unsupervised_imdb.train_test_split(test_size=0.01)
print(unsupervised_imdb_splits.keys())
print(unsupervised_imdb_splits['train'][0])
Downloading builder script: 0%| | 0.00/4.31k [00:00<?, ?B/s]
Downloading metadata: 0%| | 0.00/2.17k [00:00<?, ?B/s]
Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...
Downloading data: 0%| | 0.00/84.1M [00:00<?, ?B/s]
Generating train split: 0%| | 0/25000 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/25000 [00:00<?, ? examples/s]
Generating unsupervised split: 0%| | 0/50000 [00:00<?, ? examples/s]
Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data. dict_keys(['train', 'test']) {'text': 'In preparation for writing this comment, which I am compelled to write for reasons which will become clear, I read a fair few of the major critics to see, if on this occasion, they had seen and felt what I had.<br /><br />James Bernadelli suggests this film should be known as the "Pursuit of Richness"; another column I have read suggests that the money-will-solve-your-problems resolution is depressing and not the message Hollywood (or art in general) should be trying to put across.<br /><br />I ask myself when the last occasion might have been that any of these critics genuinely had to :spoiler: run, under painful, embarrassing duress, from a cab - because they didn\'t have the money; or when the forces of life seemed to conspire against them so unfairly that they broke down in tears. Spoiler: when your wealthy boss, who you CANNOT disappoint, asks you to borrow the last five dollars in your wallet, and you know that that money is all you have in the world - to feed your family, to pay your gas.<br /><br />You go through times in life when you feel that things couldn\'t possibly get worse - and then they do - and then they do again. Sometimes, (like Chris and son at the beach towards the end of the film) you just want to get away from it all, other times you cry. Sometimes, and this is rare, you laugh - because if you don\'t you\'ll cry. You know that if you let it all become too much, you will sink to the bottom of the sea.<br /><br />Chris Gardner (not the real one perhaps, but the one in this movie) is my personal hero; my shining example; my inspiration - the guy that never allows himself to sink - even when he can feel his shoelaces trailing on the seabed.<br /><br />I realised as I was watching this film that I was watching another me, so I never once stopped rooting for Chris. When he :spoiler: fixes the scanner in the shelter, I felt like I had fixed it.<br /><br />I\'m still working on getting the job that earns me enough to have a less painful life - when Chris finally achieves it, I want it for him so badly that I feel it in the very fibre of my being.<br /><br />I suppose that it doesn\'t matter that much that Chris wants to succeed as much for his son as for himself. In a way, the fact that Chris has a son makes this movie emotionally frightening and if (and only if) our basest fears are being toyed with by the director, it is only in exactly the same way that we sometimes look up in to the sky and say - like Jim Carrey in "Bruce Almighty" - is there anything else you could possibly do to me today? In my life, God doesn\'t appear and explain why he keeps toying with me. The truest belief is in oneself and one must never lose it.<br /><br />I adore this movie and I couldn\'t be more grateful for its existence. It comes at a time when I need it most. Its message: Keep going - never, ever give up.<br /><br />And so I come back to those reviewers. Money certainly doesn\'t mean happiness in this existence, but no money, in this cruel capitalist world, can cost you your life. Do not be judgmental of those who want more, it might just be enough to pay for their son to eat, to keep a roof over his head, to keep their dignity. This film is not about rags-to-riches as some reviewers have said. It\'s not about the American dream either. It is about dignity and integrity and how money (or lack thereof) could easily strip you of both.<br /><br />Chris Gardner never loses either and for this, he is an inspiration to us all.<br /><br />See this film TODAY.', 'label': -1}
block_size = 128
def tokenize_function(examples):
return tokenizer(examples["text"])
def group_texts(examples):
# Concatenate all texts.
keys = ['attention_mask', 'input_ids']
concatenated_examples = {k: sum(examples[k], []) for k in keys}
total_length = len(concatenated_examples[list(keys)[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
# this is needed as the used dataset is a subclass of ClassificationDataset, which requires label as a field...
result["label"] = result["input_ids"].copy()
result["labels"] = result["input_ids"].copy()
return result
unsupervised_imdb_tok = unsupervised_imdb_splits.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
unsupervised_imdb_splits = unsupervised_imdb_splits.remove_columns(['label'])
unsupervised_imdb_tok_lm = unsupervised_imdb_tok.map(group_texts, batched=True, batch_size=1000, num_proc=4,)
unsupervised_imdb_tok_lm = unsupervised_imdb_tok_lm.remove_columns(['label'])
print(unsupervised_imdb_tok_lm['test'][0])
#0: 0%| | 0/13 [00:00<?, ?ba/s]
#1: 0%| | 0/13 [00:00<?, ?ba/s]
#3: 0%| | 0/13 [00:00<?, ?ba/s]
#2: 0%| | 0/13 [00:00<?, ?ba/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1297 > 1024). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (1286 > 1024). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (1027 > 1024). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (1115 > 1024). Running this sequence through the model will result in indexing errors
#0: 0%| | 0/1 [00:00<?, ?ba/s]
#1: 0%| | 0/1 [00:00<?, ?ba/s]
#3: 0%| | 0/1 [00:00<?, ?ba/s]
#2: 0%| | 0/1 [00:00<?, ?ba/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1282 > 1024). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (1198 > 1024). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (1286 > 1024). Running this sequence through the model will result in indexing errors
#0: 0%| | 0/13 [00:00<?, ?ba/s]
#1: 0%| | 0/13 [00:00<?, ?ba/s]
#2: 0%| | 0/13 [00:00<?, ?ba/s]
#3: 0%| | 0/13 [00:00<?, ?ba/s]
#0: 0%| | 0/1 [00:00<?, ?ba/s]
#1: 0%| | 0/1 [00:00<?, ?ba/s]
#3: 0%| | 0/1 [00:00<?, ?ba/s]
#2: 0%| | 0/1 [00:00<?, ?ba/s]
{'input_ids': [1532, 345, 1053, 1775, 262, 6833, 13637, 327, 26183, 2196, 20495, 18653, 7886, 340, 338, 1327, 284, 1234, 340, 503, 286, 534, 1182, 11, 475, 345, 2192, 815, 466, 780, 428, 530, 318, 6635, 1180, 13, 47419, 774, 468, 587, 9958, 287, 7075, 286, 10319, 12, 448, 9961, 532, 42156, 11, 46835, 290, 477, 12, 744, 22029, 1108, 13, 7477, 340, 338, 11441, 11, 13913, 88, 11, 31826, 1417, 290, 15074, 25292, 357, 20839, 597, 1866, 286, 262, 41162, 1107, 5806, 12718, 12, 3036, 1150, 15232, 33924, 475, 3805, 477, 428, 340, 318, 29381, 13206, 13, 314, 7360, 3521, 470, 11626, 3589, 1497, 422, 262, 3159, 1566, 262, 886, 286, 262, 3807, 13, 1002, 612, 338, 257, 5749, 19370, 345, 460, 1414, 284, 257, 2646, 314, 836, 470], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [1532, 345, 1053, 1775, 262, 6833, 13637, 327, 26183, 2196, 20495, 18653, 7886, 340, 338, 1327, 284, 1234, 340, 503, 286, 534, 1182, 11, 475, 345, 2192, 815, 466, 780, 428, 530, 318, 6635, 1180, 13, 47419, 774, 468, 587, 9958, 287, 7075, 286, 10319, 12, 448, 9961, 532, 42156, 11, 46835, 290, 477, 12, 744, 22029, 1108, 13, 7477, 340, 338, 11441, 11, 13913, 88, 11, 31826, 1417, 290, 15074, 25292, 357, 20839, 597, 1866, 286, 262, 41162, 1107, 5806, 12718, 12, 3036, 1150, 15232, 33924, 475, 3805, 477, 428, 340, 318, 29381, 13206, 13, 314, 7360, 3521, 470, 11626, 3589, 1497, 422, 262, 3159, 1566, 262, 886, 286, 262, 3807, 13, 1002, 612, 338, 257, 5749, 19370, 345, 460, 1414, 284, 257, 2646, 314, 836, 470]}
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
"test-clm",
evaluation_strategy = "epoch",
learning_rate=2e-5,
weight_decay=0.01,
num_train_epochs=1,
max_steps=300
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=unsupervised_imdb_tok_lm["train"],
eval_dataset=unsupervised_imdb_tok_lm["test"],
)
max_steps is given, it will override any value given in num_train_epochs
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
***** Running Evaluation ***** Num examples = 1073 Batch size = 8
Perplexity: 70.50
trainer.train()
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning FutureWarning, ***** Running training ***** Num examples = 116288 Num Epochs = 1 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 1 Total optimization steps = 300
Epoch | Training Loss | Validation Loss |
---|---|---|
0 | No log | 4.047292 |
***** Running Evaluation ***** Num examples = 1073 Batch size = 8
Training completed. Do not forget to share your model on huggingface.co/models =)
TrainOutput(global_step=300, training_loss=4.1625390625, metrics={'train_runtime': 68.0522, 'train_samples_per_second': 35.267, 'train_steps_per_second': 4.408, 'total_flos': 78389025177600.0, 'train_loss': 4.1625390625, 'epoch': 0.02})
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
***** Running Evaluation ***** Num examples = 1073 Batch size = 8
Perplexity: 57.24
model.save_pretrained("imdb-gpt2")
Configuration saved in imdb-gpt2/config.json Model weights saved in imdb-gpt2/pytorch_model.bin
# Export contextual representations:
def mean_pooling(model_output, attention_mask):
# Mean Pooling - Take attention mask into account for correct averaging
input_mask_expanded = attention_mask.unsqueeze(-1).expand(model_output.size()).float()
sum_embeddings = torch.sum(model_output * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
encoded = tokenizer(['Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers".'],
return_tensors='pt')
encoded = {k:v.to(device) for k, v in encoded.items()}
model_output = model(**encoded, output_hidden_states=True, return_dict=True)
print(model_output.keys())
print(len(model_output['hidden_states'])) # contextual representations of separate words from each of the 6 layes
print(model_output['hidden_states'][-1].shape) # last layer with contextual representations (batch_size x num words x representation dim)
# Aggregate all the representations into one
mean_model_output = mean_pooling(model_output['hidden_states'][-1], encoded['attention_mask'])
mean_model_output.shape
odict_keys(['logits', 'past_key_values', 'hidden_states']) 7 torch.Size([1, 29, 768])
torch.Size([1, 768])
model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
Could not locate the tokenizer configuration file, will try to use the model config instead. loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/config.json Model config GPT2Config { "_name_or_path": "distilgpt2", "_num_labels": 1, "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "label2id": { "LABEL_0": 0 }, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.22.1", "use_cache": true, "vocab_size": 50257 } loading file vocab.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/vocab.json loading file merges.txt from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/merges.txt loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at None loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/config.json Model config GPT2Config { "_name_or_path": "distilgpt2", "_num_labels": 1, "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "label2id": { "LABEL_0": 0 }, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.22.1", "use_cache": true, "vocab_size": 50257 } loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/config.json Model config GPT2Config { "_name_or_path": "distilgpt2", "_num_labels": 1, "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "label2id": { "LABEL_0": 0 }, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.22.1", "use_cache": true, "vocab_size": 50257 } loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--distilgpt2/snapshots/c3772e6d13ecdaf8d1105055f7c89becd6e37590/pytorch_model.bin All model checkpoint weights were used when initializing GPT2LMHeadModel. All the weights of GPT2LMHeadModel were initialized from the model checkpoint at distilgpt2. If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I was meaning to', return_tensors='pt')
# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Output: ---------------------------------------------------------------------------------------------------- I was meaning to be a part of the world, and I was a part of it. I was a part of it. I was a part of it. I was a part of it. I was a part of it. I was a
Standard generation
Other generation approaches:
Beam Search
N-gram penalty
Sampling
Top-K Sampling
input_ids = tokenizer.encode('I was going', return_tensors='pt')
beam_outputs = model.generate(input_ids, max_length=50, num_beams=5, num_return_sequences=5)
print("Beam Search Output:")
for i, beam_output in enumerate(beam_outputs):
print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Beam Search Output: 0: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I 1: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had 2: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure that I had enough time to make sure that I had enough time to make sure that I had enough time to 3: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure that I had enough time to make sure I had enough time to make sure I had enough time to make sure 4: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I was enough time to make sure I
input_ids = tokenizer.encode('I was going', return_tensors='pt')
output = model.generate(input_ids, max_length=50, num_beams=5)
print("Beam Search Output:")
print(tokenizer.decode(output[0], skip_special_tokens=True))
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2)
print("N-Gram Penalty on Beam Search Output:")
print(tokenizer.decode(output[0], skip_special_tokens=True))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Beam Search Output: I was going to have to do a lot of work to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I had enough time to make sure I N-Gram Penalty on Beam Search Output: I was going to have to do a lot of work to make sure I had the right amount of time to work on this project.” “I’m not sure if I can do that, but I think it�
# Higher temperature -- more random samples, lower -- more deterministic
input_ids = tokenizer.encode('I was going', return_tensors='pt')
greedy_output = model.generate(input_ids, max_length=50, do_sample=True)
print("Sampling Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, temperature=10.)
print("Sampling Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, temperature=0.5)
print("Sampling Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Sampling Output: I was going to lose something on that night. After I got hurt, I kept being hurt."
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Sampling Output: I was going about every hour from morning with three very experienced athletes from Florida right where they would meet face for face – in-line football coaches everywhere and everything is on line like some team." So with your support now coming from Jacksonville the guys have Sampling Output: I was going to have to go to the police station. He was going to have to go to the police station. He was going to have to go to the police station. He was going to have to go to the police station. He was
input_ids = tokenizer.encode('I was going', return_tensors='pt')
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=50)
print("Tok-K Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=10)
print("Tok-K Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=100)
print("Tok-K Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Tok-K Sampling: I was going to go to school at 12 and go to work at 8. It was also good for me personally. "We are really hoping we will have something to do with our new location to raise awareness to where this is happening,"
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Tok-K Sampling: I was going to be doing a show about how Ire a good guy, I done have the right to do thate be the showe be the talk showe be the talk show� Tok-K Sampling: I was going to live here," she said. "So she wasn't ready for New York City?" her husband asked. "I was sleeping with nothing and her husband wanted me and I could not even move out of fear
input_ids = tokenizer.encode('I was going', return_tensors='pt')
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=0, top_p=0.2)
print("Tok-p Nucleus Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=0, top_p=0.92)
print("Tok-p Nucleus Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
greedy_output = model.generate(input_ids, max_length=50, do_sample=True, top_k=0, top_p=0.99)
print("Tok-p Nucleus Sampling:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True, do_sample=True))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Tok-p Nucleus Sampling: I was going to have to go to the hospital, and I'm going to have to go to the hospital.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Tok-p Nucleus Sampling: I was going to need to beat me and because that fight would probably end without me. In his bout with Steve Luger, Sergey Kovalev gave everything that made him the No. 1 heavyweight in the country and thought he would Tok-p Nucleus Sampling: I was going to keep running, survive straight for half a year.› As evil as I was, they seemed to just sort to just return to life, escape from their prison captors, flee the country, and then back