The following code uses the requests
library, which is a more flexible and user-friendly way to handle HTTP requests in Python. It downloads the dataset from the URL and saves it as a file called input.txt.
The status message at the end of the code lets you know when the dataset has been saved.
import requests
# Download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# Read the file
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
--2023-02-18 07:16:23-- https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1115394 (1.1M) [text/plain] Saving to: ‘input.txt.2’ input.txt.2 100%[===================>] 1.06M --.-KB/s in 0.05s 2023-02-18 07:16:23 (22.5 MB/s) - ‘input.txt.2’ saved [1115394/1115394]
This code prints the length of the text string, which is the contents of the input.txt file
. The length of the string is computed using the built-in len
function and is expressed in characters. The resulting value is then printed to the console using the print
function.
print("Length of the dataset in chracters: ", len(text))
Length of the dataset in chracters: 1115394
Let's check out by printing the first 500 characters of the text
variable, which was previously loaded from a file named input.txt
using the with open statement. The [:500]
syntax is used to slice the first 500 characters of the text
string.
print(text[:500])
First Citizen: Before we proceed any further, hear me speak. All: Speak, speak. First Citizen: You are all resolved rather to die than to famish? All: Resolved. resolved. First Citizen: First, you know Caius Marcius is chief enemy to the people. All: We know't, we know't. First Citizen: Let us kill him, and we'll have corn at our own price. Is't a verdict? All: No more talking on't; let it be done: away, away! Second Citizen: One word, good citizens. First Citizen: We are accounted poor
After that, we are generating a set of unique characters in the text, then sorts it. The sorted set of unique characters are then joined together and printed out as a string. The number of unique characters is also calculated and printed out with a formatted message "Unique characters:".
The set comprehension {char for char in text}
is used to extract unique characters in the text
variable, which was read from the file. The set comprehension will only include one occurrence of each character in the text
, hence getting the unique characters in the text.
After getting the unique characters, the len
function is used to get the length of the set of unique characters which gives us the count of unique characters. This value is stored in the vocab_size
variable and is printed out with a formatted message.
# Check out the unique characters that occur in this text dataset
chars = sorted({char for char in text})
vocab_size = len(chars)
print(''.join(chars))
print(f"Unique characters: {vocab_size}")
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz Unique characters: 65
As a next step, we will create two dictionaries, char_to_int
and int_to_char
, that map characters in the text dataset to unique integers, and vice versa, respectively. Then, two functions encode
and decode
are defined to convert the text dataset to a list of integers, and vice versa. The code tests the two functions by encoding the string "hello there" and decoding the result to make sure the process works as expected. The output should be the encoded list of integers for "hello there" and the decoded string "hello there".
# Create mappings from characters to integers and vice versa
char_to_int = {char: index for index, char in enumerate(chars)}
int_to_char = {index: char for index, char in enumerate(chars)}
# Define encoding and decoding functions
def encode(text):
return [char_to_int[char] for char in text]
def decode(encoded):
return ''.join([int_to_char[index] for index in encoded])
# Test the encoding and decoding functions
encoded = encode("hello there")
print(encoded)
print(decode(encoded))
[46, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43] hello there
As we checked encoding and decoding works well, we will import the PyTorch library and create a 1-dimensional tensor (i.e. a torch.LongTensor) called data
that holds the encoding of the entire text dataset.
The encoding of the text is done using the encode
function that was defined earlier and takes a string as input and returns a list of integers. The resulting list is then passed to the torch.tensor
function which creates a tensor from the input data. The dtype
argument is set to torch.long
which specifies the data type of the tensor as long integers.
Finally, the shape and data type of the tensor are printed, as well as the first 500 characters in their encoded form. The output of the print
statements will give us some basic information about the tensor.
# Encode the entire text dataset and save it into a torch.Tensor
import torch # PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Shape of data tensor: {data.shape}")
print(f"Data type of data tensor: {data.dtype}")
print(f"First 500 characters in encoded form: {data[:500]}")
Shape of data tensor: torch.Size([1115394]) Data type of data tensor: torch.int64 First 500 characters in encoded form: tensor([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59, 1, 39, 56, 43, 1, 39, 50, 50, 1, 56, 43, 57, 53, 50, 60, 43, 42, 1, 56, 39, 58, 46, 43, 56, 1, 58, 53, 1, 42, 47, 43, 1, 58, 46, 39, 52, 1, 58, 53, 1, 44, 39, 51, 47, 57, 46, 12, 0, 0, 13, 50, 50, 10, 0, 30, 43, 57, 53, 50, 60, 43, 42, 8, 1, 56, 43, 57, 53, 50, 60, 43, 42, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 18, 47, 56, 57, 58, 6, 1, 63, 53, 59, 1, 49, 52, 53, 61, 1, 15, 39, 47, 59, 57, 1, 25, 39, 56, 41, 47, 59, 57, 1, 47, 57, 1, 41, 46, 47, 43, 44, 1, 43, 52, 43, 51, 63, 1, 58, 53, 1, 58, 46, 43, 1, 54, 43, 53, 54, 50, 43, 8, 0, 0, 13, 50, 50, 10, 0, 35, 43, 1, 49, 52, 53, 61, 5, 58, 6, 1, 61, 43, 1, 49, 52, 53, 61, 5, 58, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 24, 43, 58, 1, 59, 57, 1, 49, 47, 50, 50, 1, 46, 47, 51, 6, 1, 39, 52, 42, 1, 61, 43, 5, 50, 50, 1, 46, 39, 60, 43, 1, 41, 53, 56, 52, 1, 39, 58, 1, 53, 59, 56, 1, 53, 61, 52, 1, 54, 56, 47, 41, 43, 8, 0, 21, 57, 5, 58, 1, 39, 1, 60, 43, 56, 42, 47, 41, 58, 12, 0, 0, 13, 50, 50, 10, 0, 26, 53, 1, 51, 53, 56, 43, 1, 58, 39, 50, 49, 47, 52, 45, 1, 53, 52, 5, 58, 11, 1, 50, 43, 58, 1, 47, 58, 1, 40, 43, 1, 42, 53, 52, 43, 10, 1, 39, 61, 39, 63, 6, 1, 39, 61, 39, 63, 2, 0, 0, 31, 43, 41, 53, 52, 42, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 27, 52, 43, 1, 61, 53, 56, 42, 6, 1, 45, 53, 53, 42, 1, 41, 47, 58, 47, 64, 43, 52, 57, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 35, 43, 1, 39, 56, 43, 1, 39, 41, 41, 53, 59, 52, 58, 43, 42, 1, 54, 53, 53, 56])
We will split the data tensor into training and validation set using a specified train ratio (0.8 in this case).
The length of the training data is calculated as the product of the length of the data tensor and the train ratio (0.8). The first part of the data tensor with length equal to the calculated training data length becomes the training data. The rest of the data tensor becomes the validation data.
The lengths of the training data and validation data are then printed out.
# Split the data into training and validation sets
train_ratio = 0.8
train_data_length = int(len(data) * train_ratio)
train_data = data[:train_data_length]
val_data = data[train_data_length:]
print(f"Length of training data: {len(train_data)}")
print(f"Length of validation data: {len(val_data)}")
Length of training data: 892315 Length of validation data: 223079
Let's define a block size of 8 characters, and print the first 8 + 1 = 9 characters in the training data.
It first converts the training data tensor slice to a list using tolist()
method, then passes this list to the decode function to get the characters represented by the integers. The decode
function takes a list of integers, which represent characters as indices in the int_to_char
mapping, and converts the indices back to characters using the int_to_char
mapping.
The print
statement outputs the decoded characters, allowing us to see a portion of the original text.
block_size = 8
print("First", block_size + 1, "characters in training data:")
print(decode(train_data[:block_size+1].tolist()))
First 9 characters in training data: First Cit
Also, let's split the training data into two parts x
and y
with block_size
characters each. x
contains the first block_size
characters of the training data and y
contains the next block_size
characters.
Then, the code loops through the range t
from 0 to block_size
and uses t
as the index to extract the context and target. The context is x[:t+1]
, which is a slice of the x
array that contains all elements up to and including the t
-th element. The target is y[t]
, which is the t
-th element of the y
array.
Finally, the code uses the decode
function to convert the context and target from encoded integers back to characters. The decode
function takes a list of integers as input and returns the string that corresponds to the concatenation of the characters corresponding to the integers.
The code prints the context and target for each value of t
, with a message that describes what each one represents.
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
context = x[:t+1]
target = y[t]
print(f"when input is {context} the target: {target}")
when input is tensor([18]) the target: 47 when input is tensor([18, 47]) the target: 56 when input is tensor([18, 47, 56]) the target: 57 when input is tensor([18, 47, 56, 57]) the target: 58 when input is tensor([18, 47, 56, 57, 58]) the target: 1 when input is tensor([18, 47, 56, 57, 58, 1]) the target: 15 when input is tensor([18, 47, 56, 57, 58, 1, 15]) the target: 47 when input is tensor([18, 47, 56, 57, 58, 1, 15, 47]) the target: 58
The code generates a random batch of inputs (x) and targets (y) for either the training data or validation data. The data is split based on the input argument passed to the function get_batch()
. The block_size
specifies the maximum context length for predictions. The batch_size
specifies the number of independent sequences that will be processed in parallel.
The torch.randint
function is used to generate a tensor of shape (batch_size, ) containing random integers between 0 and len(data) - block_size. These integers represent the starting indices of each sequence in the batch. The torch.stack
function is used to create tensors x
and y
from the corresponding slices of the data tensor. x
contains the blocks of size block_size
from data
, and y
contains the blocks of size block_size
shifted by one from data
.
The generated inputs and targets are printed out, along with their shapes. The code then loops over the generated batch and for each sequence in the batch, it prints out the context and the corresponding target. The context
is taken as the slice of the input tensor xb
from the beginning of the sequence to the current time step, and the target is taken as the corresponding element in the target tensor yb
. The tolist
method is used to convert the tensor to a Python list. The target is then printed out.
torch.manual_seed(1337)
batch_size = 4 # The number of independent sequences that will be processed in parallel
block_size = 8 # Max context length for predictions
def get_batch(split):
data = train_data if split == 'train' else val_data
ix = torch.randint(0, len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x, y
xb, yb = get_batch('train')
print('Inputs:')
print(xb.shape)
print(xb)
print('Targets:')
print(yb.shape)
print(yb)
print('----')
for b in range(batch_size):
for t in range(block_size):
context = xb[b, :t+1]
target = yb[b, t]
print(f"When input is {context.tolist()}, the target is: {target}")
Inputs: torch.Size([4, 8]) tensor([[58, 63, 8, 0, 0, 19, 24, 27], [39, 59, 45, 46, 58, 1, 46, 43], [49, 43, 57, 1, 53, 50, 42, 1], [52, 41, 47, 43, 52, 58, 1, 56]]) Targets: torch.Size([4, 8]) tensor([[63, 8, 0, 0, 19, 24, 27, 33], [59, 45, 46, 58, 1, 46, 43, 1], [43, 57, 1, 53, 50, 42, 1, 46], [41, 47, 43, 52, 58, 1, 56, 47]]) ---- When input is [58], the target is: 63 When input is [58, 63], the target is: 8 When input is [58, 63, 8], the target is: 0 When input is [58, 63, 8, 0], the target is: 0 When input is [58, 63, 8, 0, 0], the target is: 19 When input is [58, 63, 8, 0, 0, 19], the target is: 24 When input is [58, 63, 8, 0, 0, 19, 24], the target is: 27 When input is [58, 63, 8, 0, 0, 19, 24, 27], the target is: 33 When input is [39], the target is: 59 When input is [39, 59], the target is: 45 When input is [39, 59, 45], the target is: 46 When input is [39, 59, 45, 46], the target is: 58 When input is [39, 59, 45, 46, 58], the target is: 1 When input is [39, 59, 45, 46, 58, 1], the target is: 46 When input is [39, 59, 45, 46, 58, 1, 46], the target is: 43 When input is [39, 59, 45, 46, 58, 1, 46, 43], the target is: 1 When input is [49], the target is: 43 When input is [49, 43], the target is: 57 When input is [49, 43, 57], the target is: 1 When input is [49, 43, 57, 1], the target is: 53 When input is [49, 43, 57, 1, 53], the target is: 50 When input is [49, 43, 57, 1, 53, 50], the target is: 42 When input is [49, 43, 57, 1, 53, 50, 42], the target is: 1 When input is [49, 43, 57, 1, 53, 50, 42, 1], the target is: 46 When input is [52], the target is: 41 When input is [52, 41], the target is: 47 When input is [52, 41, 47], the target is: 43 When input is [52, 41, 47, 43], the target is: 52 When input is [52, 41, 47, 43, 52], the target is: 58 When input is [52, 41, 47, 43, 52, 58], the target is: 1 When input is [52, 41, 47, 43, 52, 58, 1], the target is: 56 When input is [52, 41, 47, 43, 52, 58, 1, 56], the target is: 47
print(xb) # My input to the transformer
tensor([[58, 63, 8, 0, 0, 19, 24, 27], [39, 59, 45, 46, 58, 1, 46, 43], [49, 43, 57, 1, 53, 50, 42, 1], [52, 41, 47, 43, 52, 58, 1, 56]])
Now, we will introduce a Bigram Language Model using PyTorch. The Bigram Language Model is a type of language model that predicts the next word in a sentence based on the current word.
The model is implemented as a custom PyTorch nn.Module
named BigramLanguageModel
.
The model has two main components:
nn.Embedding
with vocab_size
output features, which is used to embed each token (word) in the input sequence into a dense representation.forward
method, which takes an input sequence idx
of token indices (integers) and optional targets
(also an integer sequence), and computes the logits for the next token predictions and the cross-entropy loss if targets are provided. The logits are computed by passing the input sequence through the token embedding table.The model also has a generate
method, which generates a sequence of new tokens based on an initial context idx
and a specified number of tokens max_new_tokens
to generate. The method uses the forward
method to obtain the logits for each new token, applies softmax to obtain probabilities, and samples from the distribution to obtain the next token index. The generated token indices are concatenated to the input sequence to obtain the new context for the next prediction.
In the code, an instance of the BigramLanguageModel
is created with a specified vocab_size
and is used to compute the logits and loss for an input sequence xb
and target sequence yb
, and to generate a new sequence of tokens with an initial context of a single token with an index of 0 and 100 new tokens to generate.
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
logits = self.embedding(idx)
B, T, C = logits.shape
logits = logits.reshape(B * T, C)
if targets is None:
return logits, None
targets = targets.reshape(-1)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
for i in range(max_new_tokens):
logits = self.embedding(idx)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, next_token), dim=1)
return idx
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
torch.Size([32, 65]) tensor(5.0493, grad_fn=<NllLossBackward0>) SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp wnYWmnxKWWev-tDqXErVKLgJ
Here, we create an instance of the AdamW
optimizer in PyTorch. The optimizer will adjust the parameters of the model m
to minimize the loss during training.
The AdamW
optimizer is a variant of the popular Adam
optimizer that incorporates weight decay, which helps to regularize the model to prevent overfitting. The optimizer takes as input the parameters of the model m
and sets their learning rate to 0.001.
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=0.001)
After then, we will train a PyTorch model using early stopping and TensorBoard logging.
from torch.utils.tensorboard import SummaryWriter
batch_size = 32
max_steps = 100 # Increase number of steps for better results...
early_stop_steps = 10
# Create a TensorBoard writer
writer = SummaryWriter()
# Keep track of the best validation loss
best_val_loss = float("inf")
# Early stopping counter
early_stop_counter = 0
for steps in range(max_steps):
# Obtain a batch of data as sample
xb, yb = get_batch('train')
# Loss evaluation
logits, loss = m(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
if steps % 10 == 0:
print("Step {}: Train Loss: {}".format(steps, loss.item()))
writer.add_scalar('train_loss', loss.item(), steps)
# Evaluate the model on the validation set
with torch.no_grad():
xb, yb = get_batch('val')
logits, val_loss = m(xb, yb)
writer.add_scalar('val_loss', val_loss.item(), steps)
if val_loss.item() < best_val_loss:
best_val_loss = val_loss.item()
early_stop_counter = 0
else:
early_stop_counter += 1
if early_stop_counter >= early_stop_steps:
print("Early stopping at step {}".format(steps))
break
writer.close()
print("Final Train Loss: {}".format(loss.item()))
Step 0: Train Loss: 4.679175853729248 Step 10: Train Loss: 4.761965751647949 Early stopping at step 10 Final Train Loss: 4.761965751647949
Finally, we will generate text using a language model m
, and prints the decoded result. The function m.generate
generates text with the given input idx
, which is an array of shape (1, 1) containing the starting index (e.g., the first word of the text), encoded as an integer. The argument max_new_tokens=500
specifies the maximum number of tokens to generate in the output text. The generated text is then converted to a list of integers, [0].tolist()
. Finally, the function decode
is applied to the list of integers to convert the encoded text back to a string of human-readable words.
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))
dzX:COtRIDiOskzytNCxrfSjum;auw$CA'Oc!PO;DT:CSGKzzYC33,i!!:'ruHAUQJ;ZJi?NpHFP,h?jBjagJ.xYDHM,3-gnQQbjmJGGJEHxr'BirVMGXrvU,ZL$HA-!uD.vcWRlgH-s&LC,e?OJMo?yLTQb?qx;xzW f$-FZv$;.igBjU'AXgF-bGN&ZmZb&yFCaPZSJA'rA'KHx?w$YHAGjHRURSPwHo-W:MlapJ. jxLvUAQmZBL&$zdbA!BCPjxTfiKmJMQTFafjxI!udZV,SPGGSPlyYWNT;a;Q-BGrIu$Ca'PTR C&,SywwcPyFWgC3ryxfNd?EX&jF.WCq;3fq-ofcla!--UG&SBoiw'rt,rcIcmYcLC?OLfpOpX-ZK;vm,lDW?nZTbmJJrYdYZTH!abIJ&sXcoUEXrUZVm;K:vi-vTJaMPiH-UnZ??yFk$cOKBjThuq.ywEb$zLTQgUZayZ!pzd,RL&evVjZUAElx;pgOYPh
The following creates a lower triangular matrix of ones (a
) and normalizes the matrix along the rows. Then, it generates random weights from a normal distribution (b
) and computes the weighted aggregation of a
and b
using matrix multiplication. The results are then printed.
# Set random seed for reproducibility
torch.manual_seed(42)
# Create a lower triangular matrix of ones
a = torch.tril(torch.ones(3, 3))
# Normalize the lower triangular matrix along the rows
a = a / torch.sum(a, 1, keepdim=True)
# Generate random weights from a normal distribution
b = torch.randn(3, 2).float()
# Compute the weighted aggregation using matrix multiplication
c = a @ b
# Print results
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)
a= tensor([[1.0000, 0.0000, 0.0000], [0.5000, 0.5000, 0.0000], [0.3333, 0.3333, 0.3333]]) -- b= tensor([[ 0.3367, 0.1288], [ 0.2345, 0.2303], [-1.1229, -0.1863]]) -- c= tensor([[ 0.3367, 0.1288], [ 0.2856, 0.1796], [-0.1839, 0.0576]])
Here, we set the random seed for reproducibility, generate a tensor with random values (x
) of shape (B
, T
, C
), and prints the shape of the tensor. The code then attempts to compute the cumulative sum of x
along the second dimension (time) and divide it by the range of the time steps.
# Set random seed for reproducibility
torch.manual_seed(1337)
# Define the batch size (B), time steps (T), and number of channels (C)
B, T, C = 4, 8, 2
# Generate a tensor with random values, with shape (B, T, C)
x = torch.randn(B, T, C)
# Print the shape of the tensor
print("Shape of the tensor:", x.shape)
Shape of the tensor: torch.Size([4, 8, 2])
The code calculates a moving average of a 3D tensor x
with dimensions [B, T, C]
along the time axis (T
).
xbow
is initialized as a 3D tensor of zeros with the same dimensions as x
.
The calculation is performed by dividing the cumulative sum of the elements of x
along the T
axis by a sequence of integers from 1
to T
. The cumsum()
method is used to compute the cumulative sum along the T
axis of x
. The method returns a tensor with the same dimensions as x
. The sequence of integers is created using torch.arange()
, which returns a 1D tensor of consecutive integers.
The resulting tensor is a 3D tensor of the same shape as x
, where xbow[b,t]
contains the average of x[b,:t+1]
.
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
# for b in range(B):
# for t in range(T):
# xprev = x[b,:t+1] # (t,C)
# xbow[b,t] = torch.mean(xprev, 0)
xbow = x.cumsum(1) / torch.arange(1, T+1, dtype=torch.float32).unsqueeze(0).unsqueeze(2)
This code generates a tensor wei
that is a lower triangular matrix with ones and then normalizes it along the rows. It then generates a tensor x
with shape (B, T, C)
and performs a matrix multiplication between wei
and x
using @
. The resulting tensor xbow2
has the same shape as x
and each element xbow2[b, t]
is the weighted average of all the previous elements in x[b]
up to and including x[b, t]
, where the weights are given by the corresponding element in the t
-th row of wei
. Finally, it checks if xbow
and xbow2
are element-wise close using the torch.allclose()
function.
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)
True
The provided code appears to be calculating the weights for each time step in the tensor x
using a mask of a lower triangular matrix and a subsequent softmax operation to obtain weights that sum to 1 across each row. These weights are then used to obtain a weighted average of the values in x
for each time step.
# Create a mask for the lower triangular matrix of shape (T, T)
mask = torch.tril(torch.ones(T, T)).bool()
# Set the values outside the lower triangle to -inf to ensure zero weights
mask = mask.float().masked_fill(~mask, float('-inf'))
# Apply softmax to the masked values along the last dimension to obtain the weights
weights = F.softmax(mask, dim=-1)
# Obtain the weighted average of x using the computed weights
xbow3 = torch.matmul(weights, x)
# Check if xbow and xbow3 are equal
torch.allclose(xbow, xbow3)
True
The code defines a single head of self-attention using linear transformations for the key, query, and value inputs. The input tensor x
has shape (B, T, C)
representing a batch of sequences with T
timesteps and C
features. The key and query are transformed using a linear layer with C
input channels and head_size
output channels, while the value is transformed using a linear layer with C
input channels and head_size
output channels.
The dot product of the transformed query and key, divided by the square root of head_size
, gives an attention weight matrix of shape (B, T, T)
. A lower triangular mask is applied to the attention weight matrix to ensure that information flows only from the past to the present. The softmax operation is then applied along the last dimension of the resulting attention weight matrix to obtain a probability distribution. Finally, the output is obtained by multiplying the probability distribution with the value tensor.
The output has shape (B, T, head_size)
.
# Set the random seed to a fixed value for reproducibility
torch.manual_seed(1337)
# Define the batch size (B), number of time steps (T), and number of channels (C)
B,T,C = 4,8,32
# Generate a tensor with random values of shape (B, T, C)
x = torch.randn(B,T,C)
# Define the size of each head in the attention mechanism
head_size = 16
# Define the linear layers for the keys, queries, and values
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
# Apply the linear layer for the keys to the input tensor to get k of shape (B, T, head_size)
k = key(x)
# Apply the linear layer for the queries to the input tensor to get q of shape (B, T, head_size)
q = query(x)
# Calculate the attention weights by multiplying q and k transposed together. This produces a tensor wei of shape (B, T, T).
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
# Create a lower triangular matrix with ones and zero out the upper triangular part
tril = torch.tril(torch.ones(T, T))
# Replace the 0 elements in the triangular part of the attention weights with negative infinity
wei = wei.masked_fill(tril == 0, float('-inf'))
# Apply the softmax function along the last dimension of wei to get the final attention weights
wei = F.softmax(wei, dim=-1)
# Apply the linear layer for the values to the input tensor to get v of shape (B, T, head_size)
v = value(x)
# Compute the weighted sum of the values with the attention weights to get the output tensor of shape (B, T, head_size)
out = wei @ v
# Print the shape of the output tensor
print(out.shape)
torch.Size([4, 8, 16])
print(wei[0])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000], [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000], [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000], [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000], [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]], grad_fn=<SelectBackward0>)
Notes:
The provided code computes the scaled dot-product attention between two tensors k
and q
, and applies the softmax function on the output to get the attention weights.
q
and k
and then multiply the product by the scale factor.head_size
to float32
using torch.float32
.We print the variance of k
, q
, and wei
to ensure they have the expected values.
We use the softmax()
function from PyTorch instead of performing the softmax operation manually.
We also use the softmax()
function with a larger scaling factor (8
in this case), which results in a smoother attention distribution.
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
scale = torch.rsqrt(torch.tensor(head_size, dtype=torch.float32))
wei = q @ k.transpose(-2, -1) * scale
print("variance of k:", k.var())
print("variance of q:", q.var())
print("variance of wei:", wei.var())
# softmax over last dimension
softmaxed_wei = torch.softmax(wei, dim=-1)
# a different softmax that reduces the "peakiness" of the attention distribution
sm = torch.nn.Softmax(dim=-1)
scaled_sm_wei = sm(wei * 8)
variance of k: tensor(0.9006) variance of q: tensor(1.0037) variance of wei: tensor(0.9957)
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)
tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])
The code defines a LayerNorm1d
class, which is used to implement Layer Normalization for 1D input tensors. The __init__
method sets the values of the eps
parameter and the learnable parameters gamma
and beta
. The __call__
method takes an input tensor x
, computes the layer normalization operation, and returns the normalized output. The parameters
method returns the learnable parameters of the layer.
The code initializes an instance of the LayerNorm1d
class with 100 dimensions and applies it to a 32 x 100 input tensor. The output tensor has the same shape as the input tensor.
class LayerNorm1d:
def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps
self.gamma = torch.ones(dim)
self.beta = torch.zeros(dim)
def __call__(self, x):
# Calculate the forward pass
xmean = x.mean(1, keepdim=True) # Batch mean
xvar = x.var(1, keepdim=True) # Batch variance
xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # Normalize to unit variance
self.out = self.gamma * xhat + self.beta
return self.out
def parameters(self):
return [self.gamma, self.beta]
# Instantiate the module
torch.manual_seed(1337)
module = LayerNorm1d(100)
# Generate a batch of size 32 with 100-dimensional vectors
x = torch.randn(32, 100)
# Apply the layer normalization to the batch
x = module(x)
# Print the shape of the output tensor
print(f"Output shape: {x.shape}")
Output shape: torch.Size([32, 100])
# Compute the mean and standard deviation of the first feature across all inputs in the batch
batch_feature_mean, batch_feature_std = x[:, 0].mean(), x[:, 0].std()
print(f"Batch feature mean: {batch_feature_mean}, Batch feature std: {batch_feature_std}")
# Compute the mean and standard deviation of all features for a single input in the batch
single_input_mean, single_input_std = x[0, :].mean(), x[0, :].std()
print(f"Single input mean: {single_input_mean}, Single input std: {single_input_std}")
Batch feature mean: 0.14685693383216858, Batch feature std: 0.8803138732910156 Single input mean: -9.53674295089968e-09, Single input std: 0.9999954700469971
nanoGPT
: https://github.com/karpathy/nanoGPTSentencePiece
: https://github.com/google/sentencepieceAttention Is All You Need
: https://arxiv.org/abs/1706.03762Training language models to follow instructions with human feedback
: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdfThe New Version of GPT-3 Is Much, Much Better
: https://towardsdatascience.com/the-new-version-of-gpt-3-is-much-much-better-53ac95f21cfbThis is a PyTorch implementation of a language model using a transformer architecture. Here's a brief explanation of the code:
Overall, this code defines a language model that uses a transformer architecture and is trained using stochastic gradient descent with the AdamW optimizer. The transformer architecture consists of multiple transformer blocks, which are composed of self-attention heads and feed-forward modules. The self-attention heads enable the model to consider the relationships between all tokens in a sequence, while the feed-forward modules provide additional nonlinear transformations. The model is trained to predict the next token in a sequence given a context of previous tokens. The code prints the train and validation loss at specified intervals during training, and generates new text using the trained model.
import torch
import torch.nn as nn
from torch.nn import functional as F
# Hyperparameters
batch_size = 16
block_size = 32
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
torch.manual_seed(1337)
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Create a sorted list of unique characters in the text
chars = sorted(list(set(text)))
# Get the number of unique characters
vocab_size = len(chars)
# Create a dictionary mapping each character to an integer index
stoi = {ch: i for i, ch in enumerate(chars)}
# Create a dictionary mapping each integer index to a character
itos = {i: ch for i, ch in enumerate(chars)}
# Define a lambda function to encode a string as a list of integers
encode = lambda s: [stoi[c] for c in s]
# Define a lambda function to decode a list of integers as a string
decode = lambda l: ''.join([itos[i] for i in l])
# Convert the text to a tensor of integers
data = torch.tensor(encode(text), dtype=torch.long)
# Split the data into train and validation sets
n = int(0.9 * len(data)) # use first 90% for training, last 10% for validation
train_data = data[:n]
val_data = data[n:]
# Define a function to generate a small batch of input-target pairs
def get_batch(split):
# Select either the training or validation set
data = train_data if split == 'train' else val_data
# Generate random indices to start each block of input
ix = torch.randint(len(data) - block_size, (batch_size,))
# Select block_size characters starting at each index for input
x = torch.stack([data[i:i + block_size] for i in ix])
# Select block_size characters starting at each index + 1 for target
y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
# Send tensors to GPU if available
x, y = x.to(device), y.to(device)
return x, y
# Define a function to estimate the model's loss on the train and validation sets
@torch.no_grad()
def estimate_loss():
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
# Linear transformations for key, query, and value
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
# Lower triangular matrix to mask future values
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
# Dropout layer
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Get batch size, sequence length, and number of features
B, T, C = x.shape
# Linear transformations of key and query
k = self.key(x) # (B,T,C)
q = self.query(x) # (B,T,C)
# Compute attention scores ("affinities")
wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
# Mask future values
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
# Apply softmax to get attention weights
wei = F.softmax(wei, dim=-1) # (B, T, T)
# Apply dropout to attention weights
wei = self.dropout(wei)
# Linear transformation of value
v = self.value(x) # (B,T,C)
# Weighted sum of values using attention weights
out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
return out
class MultiHeadAttention(nn.Module):
""" multiple heads of self-attention in parallel """
def __init__(self, num_heads, head_size):
super().__init__() # Initialize the superclass (nn.Module)
# Instantiate a list of head modules, and assign it to self.heads
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
# A linear transformation to project the concatenated attention heads back to the input dimension
self.proj = nn.Linear(n_embd, n_embd)
# Dropout layer to avoid overfitting
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Apply the self-attention mechanism by passing the input through each attention head and concatenate the results along the feature dimension
out = torch.cat([h(x) for h in self.heads], dim=-1)
# Apply the projection layer and the dropout layer to the concatenated output
out = self.dropout(self.proj(out))
return out
class FeedFoward(nn.Module):
""" a simple linear layer followed by a non-linearity """
def __init__(self, n_embd):
super().__init__()
# Define a sequential neural network composed of two linear layers and a ReLU activation function
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
# Forward pass through the neural network
return self.net(x)
class Block(nn.Module):
""" Transformer block: communication followed by computation """
def __init__(self, n_embd, n_head):
# n_embd: embedding dimension, n_head: the number of heads we'd like
super().__init__()
head_size = n_embd // n_head
# Multi-head self-attention layer
self.sa = MultiHeadAttention(n_head, head_size)
# Feedforward layer
self.ffwd = FeedFoward(n_embd)
# Layer normalization layer 1
self.ln1 = nn.LayerNorm(n_embd)
# Layer normalization layer 2
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
# Communication followed by computation
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
# super simple bigram model
class BigramLanguageModel(nn.Module):
def __init__(self):
super().__init__()
# Define the model architecture using embedding, multi-head attention, and linear layers
self.token_embedding_table = nn.Embedding(vocab_size, n_embd) # Lookup table for token embedding
self.position_embedding_table = nn.Embedding(block_size, n_embd) # Lookup table for position embedding
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) # A sequence of transformer blocks
self.ln_f = nn.LayerNorm(n_embd) # Layer normalization
self.lm_head = nn.Linear(n_embd, vocab_size) # Linear layer to get logits
def forward(self, idx, targets=None):
B, T = idx.shape
# Embed tokens and positions
tok_emb = self.token_embedding_table(idx) # (B,T,C)
pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
x = tok_emb + pos_emb # (B,T,C)
# Pass through the transformer blocks
x = self.blocks(x) # (B,T,C)
# Apply layer normalization and linear layer
x = self.ln_f(x) # (B,T,C)
logits = self.lm_head(x) # (B,T,vocab_size)
# Compute the cross-entropy loss if targets are provided
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
# Generate new text by sampling from the learned distribution
for _ in range(max_new_tokens):
# Crop idx to the last block_size tokens
idx_cond = idx[:, -block_size:]
# Get the predictions
logits, loss = self(idx_cond)
# Focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# Apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# Sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# Append the sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
# Instantiate the bigram language model
model = BigramLanguageModel()
# Move the model to the specified device (CPU or GPU)
m = model.to(device)
# Print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')
# Create a PyTorch optimizer with the AdamW algorithm
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
# Loop over training iterations
for iter in range(max_iters):
# Every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# Sample a batch of data (inputs and targets)
xb, yb = get_batch('train')
# Evaluate the loss and update the model parameters
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
# Generate text from the model
# Initialize the context with a zero tensor
context = torch.zeros((1, 1), dtype=torch.long, device=device)
# Generate a sequence of tokens using the model
generated_sequence = m.generate(context, max_new_tokens=2000)
# Decode the generated sequence into a string
decoded_sequence = decode(generated_sequence[0].tolist())
# Print the generated text
print(decoded_sequence)
0.209729 M parameters step 0: train loss 4.4116, val loss 4.4022 step 100: train loss 2.6568, val loss 2.6670 step 200: train loss 2.5090, val loss 2.5058 step 300: train loss 2.4198, val loss 2.4340 step 400: train loss 2.3503, val loss 2.3567 step 500: train loss 2.2970, val loss 2.3136 step 600: train loss 2.2410, val loss 2.2506 step 700: train loss 2.2062, val loss 2.2198 step 800: train loss 2.1638, val loss 2.1871 step 900: train loss 2.1232, val loss 2.1494 step 1000: train loss 2.1020, val loss 2.1293 step 1100: train loss 2.0704, val loss 2.1196 step 1200: train loss 2.0382, val loss 2.0798 step 1300: train loss 2.0249, val loss 2.0640 step 1400: train loss 1.9922, val loss 2.0354 step 1500: train loss 1.9707, val loss 2.0308 step 1600: train loss 1.9614, val loss 2.0474 step 1700: train loss 1.9393, val loss 2.0130 step 1800: train loss 1.9070, val loss 1.9943 step 1900: train loss 1.9057, val loss 1.9871 step 2000: train loss 1.8834, val loss 1.9954 step 2100: train loss 1.8719, val loss 1.9758 step 2200: train loss 1.8582, val loss 1.9623 step 2300: train loss 1.8546, val loss 1.9517 step 2400: train loss 1.8410, val loss 1.9476 step 2500: train loss 1.8167, val loss 1.9455 step 2600: train loss 1.8263, val loss 1.9401 step 2700: train loss 1.8108, val loss 1.9340 step 2800: train loss 1.8040, val loss 1.9247 step 2900: train loss 1.8044, val loss 1.9304 step 3000: train loss 1.7963, val loss 1.9242 step 3100: train loss 1.7687, val loss 1.9147 step 3200: train loss 1.7547, val loss 1.9102 step 3300: train loss 1.7557, val loss 1.9037 step 3400: train loss 1.7547, val loss 1.8946 step 3500: train loss 1.7385, val loss 1.8968 step 3600: train loss 1.7260, val loss 1.8914 step 3700: train loss 1.7257, val loss 1.8808 step 3800: train loss 1.7204, val loss 1.8919 step 3900: train loss 1.7215, val loss 1.8788 step 4000: train loss 1.7146, val loss 1.8639 step 4100: train loss 1.7095, val loss 1.8724 step 4200: train loss 1.7079, val loss 1.8707 step 4300: train loss 1.7035, val loss 1.8502 step 4400: train loss 1.7043, val loss 1.8693 step 4500: train loss 1.6914, val loss 1.8522 step 4600: train loss 1.6853, val loss 1.8357 step 4700: train loss 1.6862, val loss 1.8483 step 4800: train loss 1.6671, val loss 1.8434 step 4900: train loss 1.6736, val loss 1.8415 step 4999: train loss 1.6635, val loss 1.8226 FlY BOLINGLO: Them thrumply towiter arts the muscue rike begatt the sea it What satell in rowers that some than othis Marrity. LUCENTVO: But userman these that, where can is not diesty rege; What and see to not. But's eyes. What? JOHN MARGARET: Than up I wark, what out, I ever of and love, one these do sponce, vois I me; But my pray sape to ries all to the not erralied in may. BENVOLIO: To spits as stold's bewear I would and say mesby all on sworn make he anough As cousins the solle, whose be my conforeful may lie them yet nobe allimely untraled to be thre I say be, Notham a brotes theme an make come, And that his reach to the duke ento the grmeants bell! and now there king-liff-or grief? GLOUCESTER: All the bettle dreene, for To his like thou thron! MENENIUS: Then, if I knom her all. My lord, but terruly friend Rish of the ploceiness and wilt tends sure? Is you knows a fasir wead That with him my spaut, I shall not tas where's not, becomity; my coulds sting, then the wit be dong to tyget our hereefore, Who strop me, mend here, if agains, bitten, thy lack. The but these it were is tus. For the her skeep the fasting. joy tweet Bumner:- How the enclady: It you and how, I am in him, And ladderle: Their hand whose wife, it my hithre, Roman and where sposs gives'd you. TROMIOLANUS: But livants you great, I shom mistrot come, for to she to lot for smy to men ventry mehus. Gazise; Full't were some the cause, and stouch set, Or promises, which a kingsasted to your gove them; and sterrer, And that wae love him. BRUTUS: You shape with these sweet. CORTENGONO: Lo, where 'twon elmes, 'morth young agres; Sir, azavoust to striel accurded we missery sets crave. ANGOLUM: For is Henry to have gleise the dreason That I ant shorfold wefth their servy in enscy. ISABELLA: O, I better you eyse such formfetrews. BUCKINGHARENT: Qead my lightle this righanneds flase them Wam which an take was our some pleasurs, Lovisoname to me, then fult me?--have it? HENRY BOLINGBROY: That wha