The following code uses the `requests`

library, which is a more flexible and user-friendly way to handle HTTP requests in Python. It downloads the dataset from the URL and saves it as a file called `input.txt.`

The status message at the end of the code lets you know when the dataset has been saved.

In [ ]:

```
import requests
# Download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# Read the file
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
```

This code prints the length of the text string, which is the contents of the `input.txt file`

. The length of the string is computed using the built-in `len`

function and is expressed in characters. The resulting value is then printed to the console using the `print`

function.

In [ ]:

```
print("Length of the dataset in chracters: ", len(text))
```

Length of the dataset in chracters: 1115394

Let's check out by printing the first 500 characters of the `text`

variable, which was previously loaded from a file named `input.txt`

using the with open statement. The `[:500]`

syntax is used to slice the first 500 characters of the `text`

string.

In [ ]:

```
print(text[:500])
```

After that, we are generating a set of unique characters in the text, then sorts it. The sorted set of unique characters are then joined together and printed out as a string. The number of unique characters is also calculated and printed out with a formatted message "Unique characters:".

The set comprehension `{char for char in text}`

is used to extract unique characters in the `text`

variable, which was read from the file. The set comprehension will only include one occurrence of each character in the `text`

, hence getting the unique characters in the text.

After getting the unique characters, the `len`

function is used to get the length of the set of unique characters which gives us the count of unique characters. This value is stored in the `vocab_size`

variable and is printed out with a formatted message.

In [ ]:

```
# Check out the unique characters that occur in this text dataset
chars = sorted({char for char in text})
vocab_size = len(chars)
print(''.join(chars))
print(f"Unique characters: {vocab_size}")
```

!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz Unique characters: 65

As a next step, we will create two dictionaries, `char_to_int`

and `int_to_char`

, that map characters in the text dataset to unique integers, and vice versa, respectively. Then, two functions `encode`

and `decode`

are defined to convert the text dataset to a list of integers, and vice versa. The code tests the two functions by encoding the string "hello there" and decoding the result to make sure the process works as expected. The output should be the encoded list of integers for "hello there" and the decoded string "hello there".

In [ ]:

```
# Create mappings from characters to integers and vice versa
char_to_int = {char: index for index, char in enumerate(chars)}
int_to_char = {index: char for index, char in enumerate(chars)}
# Define encoding and decoding functions
def encode(text):
return [char_to_int[char] for char in text]
def decode(encoded):
return ''.join([int_to_char[index] for index in encoded])
# Test the encoding and decoding functions
encoded = encode("hello there")
print(encoded)
print(decode(encoded))
```

[46, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43] hello there

As we checked encoding and decoding works well, we will import the PyTorch library and create a 1-dimensional tensor (i.e. a torch.LongTensor) called `data`

that holds the encoding of the entire text dataset.

The encoding of the text is done using the `encode`

function that was defined earlier and takes a string as input and returns a list of integers. The resulting list is then passed to the `torch.tensor`

function which creates a tensor from the input data. The `dtype`

argument is set to `torch.long`

which specifies the data type of the tensor as long integers.

Finally, the shape and data type of the tensor are printed, as well as the first 500 characters in their encoded form. The output of the `print`

statements will give us some basic information about the tensor.

In [ ]:

```
# Encode the entire text dataset and save it into a torch.Tensor
import torch # PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(f"Shape of data tensor: {data.shape}")
print(f"Data type of data tensor: {data.dtype}")
print(f"First 500 characters in encoded form: {data[:500]}")
```

We will split the data tensor into training and validation set using a specified train ratio (0.8 in this case).

The length of the training data is calculated as the product of the length of the data tensor and the train ratio (0.8). The first part of the data tensor with length equal to the calculated training data length becomes the training data. The rest of the data tensor becomes the validation data.

The lengths of the training data and validation data are then printed out.

In [ ]:

```
# Split the data into training and validation sets
train_ratio = 0.8
train_data_length = int(len(data) * train_ratio)
train_data = data[:train_data_length]
val_data = data[train_data_length:]
print(f"Length of training data: {len(train_data)}")
print(f"Length of validation data: {len(val_data)}")
```

Length of training data: 892315 Length of validation data: 223079

Let's define a block size of 8 characters, and print the first 8 + 1 = 9 characters in the training data.

It first converts the training data tensor slice to a list using `tolist()`

method, then passes this list to the decode function to get the characters represented by the integers. The `decode`

function takes a list of integers, which represent characters as indices in the `int_to_char`

mapping, and converts the indices back to characters using the `int_to_char`

mapping.

The `print`

statement outputs the decoded characters, allowing us to see a portion of the original text.

In [ ]:

```
block_size = 8
print("First", block_size + 1, "characters in training data:")
print(decode(train_data[:block_size+1].tolist()))
```

First 9 characters in training data: First Cit

Also, let's split the training data into two parts `x`

and `y`

with `block_size`

characters each. `x`

contains the first `block_size`

characters of the training data and `y`

contains the next `block_size`

characters.

Then, the code loops through the range `t`

from 0 to `block_size`

and uses `t`

as the index to extract the context and target. The context is `x[:t+1]`

, which is a slice of the `x`

array that contains all elements up to and including the `t`

-th element. The target is `y[t]`

, which is the `t`

-th element of the `y`

array.

Finally, the code uses the `decode`

function to convert the context and target from encoded integers back to characters. The `decode`

function takes a list of integers as input and returns the string that corresponds to the concatenation of the characters corresponding to the integers.

The code prints the context and target for each value of `t`

, with a message that describes what each one represents.

In [ ]:

```
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
context = x[:t+1]
target = y[t]
print(f"when input is {context} the target: {target}")
```

The code generates a random batch of inputs (x) and targets (y) for either the training data or validation data. The data is split based on the input argument passed to the function `get_batch()`

. The `block_size`

specifies the maximum context length for predictions. The `batch_size`

specifies the number of independent sequences that will be processed in parallel.

The `torch.randint`

function is used to generate a tensor of shape (batch_size, ) containing random integers between 0 and len(data) - block_size. These integers represent the starting indices of each sequence in the batch. The `torch.stack`

function is used to create tensors `x`

and `y`

from the corresponding slices of the data tensor. `x`

contains the blocks of size `block_size`

from `data`

, and `y`

contains the blocks of size `block_size`

shifted by one from `data`

.

The generated inputs and targets are printed out, along with their shapes. The code then loops over the generated batch and for each sequence in the batch, it prints out the context and the corresponding target. The `context`

is taken as the slice of the input tensor `xb`

from the beginning of the sequence to the current time step, and the target is taken as the corresponding element in the target tensor `yb`

. The `tolist`

method is used to convert the tensor to a Python list. The target is then printed out.

In [ ]:

```
torch.manual_seed(1337)
batch_size = 4 # The number of independent sequences that will be processed in parallel
block_size = 8 # Max context length for predictions
def get_batch(split):
data = train_data if split == 'train' else val_data
ix = torch.randint(0, len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
return x, y
xb, yb = get_batch('train')
print('Inputs:')
print(xb.shape)
print(xb)
print('Targets:')
print(yb.shape)
print(yb)
print('----')
for b in range(batch_size):
for t in range(block_size):
context = xb[b, :t+1]
target = yb[b, t]
print(f"When input is {context.tolist()}, the target is: {target}")
```

In [ ]:

```
print(xb) # My input to the transformer
```

Now, we will introduce a Bigram Language Model using PyTorch. The Bigram Language Model is a type of language model that predicts the next word in a sentence based on the current word.

The model is implemented as a custom PyTorch `nn.Module`

named `BigramLanguageModel`

.

The model has two main components:

- The token embedding table, implemented as an instance of
`nn.Embedding`

with`vocab_size`

output features, which is used to embed each token (word) in the input sequence into a dense representation. - The
`forward`

method, which takes an input sequence`idx`

of token indices (integers) and optional`targets`

(also an integer sequence), and computes the logits for the next token predictions and the cross-entropy loss if targets are provided. The logits are computed by passing the input sequence through the token embedding table.

The model also has a `generate`

method, which generates a sequence of new tokens based on an initial context `idx`

and a specified number of tokens `max_new_tokens`

to generate. The method uses the `forward`

method to obtain the logits for each new token, applies softmax to obtain probabilities, and samples from the distribution to obtain the next token index. The generated token indices are concatenated to the input sequence to obtain the new context for the next prediction.

In the code, an instance of the `BigramLanguageModel`

is created with a specified `vocab_size`

and is used to compute the logits and loss for an input sequence `xb`

and target sequence `yb`

, and to generate a new sequence of tokens with an initial context of a single token with an index of 0 and 100 new tokens to generate.

In [ ]:

```
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
logits = self.embedding(idx)
B, T, C = logits.shape
logits = logits.reshape(B * T, C)
if targets is None:
return logits, None
targets = targets.reshape(-1)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
for i in range(max_new_tokens):
logits = self.embedding(idx)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, next_token), dim=1)
return idx
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
```

Here, we create an instance of the `AdamW`

optimizer in PyTorch. The optimizer will adjust the parameters of the model `m`

to minimize the loss during training.

The `AdamW`

optimizer is a variant of the popular `Adam`

optimizer that incorporates weight decay, which helps to regularize the model to prevent overfitting. The optimizer takes as input the parameters of the model `m`

and sets their learning rate to 0.001.

In [ ]:

```
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=0.001)
```

After then, we will train a PyTorch model using early stopping and TensorBoard logging.

- A PyTorch AdamW optimizer is created with a learning rate of 0.001 for the model parameters.
- A TensorBoard writer is created to log the training and validation losses.
- The code trains the model in a loop of 100 steps (max_steps) and evaluates the model on the validation set at each step.
- The best validation loss is recorded and compared with the current validation loss.
- If the current validation loss is worse than the best validation loss for a number of early_stop_steps, the training stops and the loop exits.
- The training loss is printed and logged in TensorBoard every 10 steps.
- After the loop, the TensorBoard writer is closed. The final train loss is printed.

In [ ]:

```
from torch.utils.tensorboard import SummaryWriter
batch_size = 32
max_steps = 100 # Increase number of steps for better results...
early_stop_steps = 10
# Create a TensorBoard writer
writer = SummaryWriter()
# Keep track of the best validation loss
best_val_loss = float("inf")
# Early stopping counter
early_stop_counter = 0
for steps in range(max_steps):
# Obtain a batch of data as sample
xb, yb = get_batch('train')
# Loss evaluation
logits, loss = m(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
if steps % 10 == 0:
print("Step {}: Train Loss: {}".format(steps, loss.item()))
writer.add_scalar('train_loss', loss.item(), steps)
# Evaluate the model on the validation set
with torch.no_grad():
xb, yb = get_batch('val')
logits, val_loss = m(xb, yb)
writer.add_scalar('val_loss', val_loss.item(), steps)
if val_loss.item() < best_val_loss:
best_val_loss = val_loss.item()
early_stop_counter = 0
else:
early_stop_counter += 1
if early_stop_counter >= early_stop_steps:
print("Early stopping at step {}".format(steps))
break
writer.close()
print("Final Train Loss: {}".format(loss.item()))
```

Finally, we will generate text using a language model `m`

, and prints the decoded result. The function `m.generate`

generates text with the given input `idx`

, which is an array of shape (1, 1) containing the starting index (e.g., the first word of the text), encoded as an integer. The argument `max_new_tokens=500`

specifies the maximum number of tokens to generate in the output text. The generated text is then converted to a list of integers, `[0].tolist()`

. Finally, the function `decode`

is applied to the list of integers to convert the encoded text back to a string of human-readable words.

In [ ]:

```
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))
```

The following creates a lower triangular matrix of ones (`a`

) and normalizes the matrix along the rows. Then, it generates random weights from a normal distribution (`b`

) and computes the weighted aggregation of `a`

and `b`

using matrix multiplication. The results are then printed.

In [ ]:

```
# Set random seed for reproducibility
torch.manual_seed(42)
# Create a lower triangular matrix of ones
a = torch.tril(torch.ones(3, 3))
# Normalize the lower triangular matrix along the rows
a = a / torch.sum(a, 1, keepdim=True)
# Generate random weights from a normal distribution
b = torch.randn(3, 2).float()
# Compute the weighted aggregation using matrix multiplication
c = a @ b
# Print results
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)
```

`x`

) of shape (`B`

, `T`

, `C`

), and prints the shape of the tensor. The code then attempts to compute the cumulative sum of `x`

along the second dimension (time) and divide it by the range of the time steps.

In [ ]:

```
# Set random seed for reproducibility
torch.manual_seed(1337)
# Define the batch size (B), time steps (T), and number of channels (C)
B, T, C = 4, 8, 2
# Generate a tensor with random values, with shape (B, T, C)
x = torch.randn(B, T, C)
# Print the shape of the tensor
print("Shape of the tensor:", x.shape)
```

Shape of the tensor: torch.Size([4, 8, 2])

The code calculates a moving average of a 3D tensor `x`

with dimensions `[B, T, C]`

along the time axis (`T`

).

`xbow`

is initialized as a 3D tensor of zeros with the same dimensions as `x`

.

The calculation is performed by dividing the cumulative sum of the elements of `x`

along the `T`

axis by a sequence of integers from `1`

to `T`

. The `cumsum()`

method is used to compute the cumulative sum along the `T`

axis of `x`

. The method returns a tensor with the same dimensions as `x`

. The sequence of integers is created using `torch.arange()`

, which returns a 1D tensor of consecutive integers.

The resulting tensor is a 3D tensor of the same shape as `x`

, where `xbow[b,t]`

contains the average of `x[b,:t+1]`

.

In [ ]:

```
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
# for b in range(B):
# for t in range(T):
# xprev = x[b,:t+1] # (t,C)
# xbow[b,t] = torch.mean(xprev, 0)
xbow = x.cumsum(1) / torch.arange(1, T+1, dtype=torch.float32).unsqueeze(0).unsqueeze(2)
```

This code generates a tensor `wei`

that is a lower triangular matrix with ones and then normalizes it along the rows. It then generates a tensor `x`

with shape `(B, T, C)`

and performs a matrix multiplication between `wei`

and `x`

using `@`

. The resulting tensor `xbow2`

has the same shape as `x`

and each element `xbow2[b, t]`

is the weighted average of all the previous elements in `x[b]`

up to and including `x[b, t]`

, where the weights are given by the corresponding element in the `t`

-th row of `wei`

. Finally, it checks if `xbow`

and `xbow2`

are element-wise close using the `torch.allclose()`

function.

In [ ]:

```
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)
```

Out[ ]:

True

The provided code appears to be calculating the weights for each time step in the tensor `x`

using a mask of a lower triangular matrix and a subsequent softmax operation to obtain weights that sum to 1 across each row. These weights are then used to obtain a weighted average of the values in `x`

for each time step.

In [ ]:

```
# Create a mask for the lower triangular matrix of shape (T, T)
mask = torch.tril(torch.ones(T, T)).bool()
# Set the values outside the lower triangle to -inf to ensure zero weights
mask = mask.float().masked_fill(~mask, float('-inf'))
# Apply softmax to the masked values along the last dimension to obtain the weights
weights = F.softmax(mask, dim=-1)
# Obtain the weighted average of x using the computed weights
xbow3 = torch.matmul(weights, x)
# Check if xbow and xbow3 are equal
torch.allclose(xbow, xbow3)
```

Out[ ]:

True

The code defines a single head of self-attention using linear transformations for the key, query, and value inputs. The input tensor `x`

has shape `(B, T, C)`

representing a batch of sequences with `T`

timesteps and `C`

features. The key and query are transformed using a linear layer with `C`

input channels and `head_size`

output channels, while the value is transformed using a linear layer with `C`

input channels and `head_size`

output channels.

The dot product of the transformed query and key, divided by the square root of `head_size`

, gives an attention weight matrix of shape `(B, T, T)`

. A lower triangular mask is applied to the attention weight matrix to ensure that information flows only from the past to the present. The softmax operation is then applied along the last dimension of the resulting attention weight matrix to obtain a probability distribution. Finally, the output is obtained by multiplying the probability distribution with the value tensor.

The output has shape `(B, T, head_size)`

.

In [ ]:

```
# Set the random seed to a fixed value for reproducibility
torch.manual_seed(1337)
# Define the batch size (B), number of time steps (T), and number of channels (C)
B,T,C = 4,8,32
# Generate a tensor with random values of shape (B, T, C)
x = torch.randn(B,T,C)
# Define the size of each head in the attention mechanism
head_size = 16
# Define the linear layers for the keys, queries, and values
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
# Apply the linear layer for the keys to the input tensor to get k of shape (B, T, head_size)
k = key(x)
# Apply the linear layer for the queries to the input tensor to get q of shape (B, T, head_size)
q = query(x)
# Calculate the attention weights by multiplying q and k transposed together. This produces a tensor wei of shape (B, T, T).
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
# Create a lower triangular matrix with ones and zero out the upper triangular part
tril = torch.tril(torch.ones(T, T))
# Replace the 0 elements in the triangular part of the attention weights with negative infinity
wei = wei.masked_fill(tril == 0, float('-inf'))
# Apply the softmax function along the last dimension of wei to get the final attention weights
wei = F.softmax(wei, dim=-1)
# Apply the linear layer for the values to the input tensor to get v of shape (B, T, head_size)
v = value(x)
# Compute the weighted sum of the values with the attention weights to get the output tensor of shape (B, T, head_size)
out = wei @ v
# Print the shape of the output tensor
print(out.shape)
```

torch.Size([4, 8, 16])

In [ ]:

```
print(wei[0])
```

Notes:

- Attention is a
**communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights. - There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

The provided code computes the scaled dot-product attention between two tensors `k`

and `q`

, and applies the softmax function on the output to get the attention weights.

- We calculate the scale factor outside the product of
`q`

and`k`

and then multiply the product by the scale factor. - We explicitly cast the integer
`head_size`

to`float32`

using`torch.float32`

.

We print the variance of `k`

, `q`

, and `wei`

to ensure they have the expected values.
We use the `softmax()`

function from PyTorch instead of performing the softmax operation manually.
We also use the `softmax()`

function with a larger scaling factor (`8`

in this case), which results in a smoother attention distribution.

In [ ]:

```
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
scale = torch.rsqrt(torch.tensor(head_size, dtype=torch.float32))
wei = q @ k.transpose(-2, -1) * scale
print("variance of k:", k.var())
print("variance of q:", q.var())
print("variance of wei:", wei.var())
# softmax over last dimension
softmaxed_wei = torch.softmax(wei, dim=-1)
# a different softmax that reduces the "peakiness" of the attention distribution
sm = torch.nn.Softmax(dim=-1)
scaled_sm_wei = sm(wei * 8)
```

variance of k: tensor(0.9006) variance of q: tensor(1.0037) variance of wei: tensor(0.9957)

In [ ]:

```
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)
```

Out[ ]:

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [ ]:

```
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)
```

Out[ ]:

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

The code defines a `LayerNorm1d`

class, which is used to implement Layer Normalization for 1D input tensors. The `__init__`

method sets the values of the `eps`

parameter and the learnable parameters `gamma`

and `beta`

. The `__call__`

method takes an input tensor `x`

, computes the layer normalization operation, and returns the normalized output. The `parameters`

method returns the learnable parameters of the layer.

The code initializes an instance of the `LayerNorm1d`

class with 100 dimensions and applies it to a 32 x 100 input tensor. The output tensor has the same shape as the input tensor.

In [ ]:

```
class LayerNorm1d:
def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps
self.gamma = torch.ones(dim)
self.beta = torch.zeros(dim)
def __call__(self, x):
# Calculate the forward pass
xmean = x.mean(1, keepdim=True) # Batch mean
xvar = x.var(1, keepdim=True) # Batch variance
xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # Normalize to unit variance
self.out = self.gamma * xhat + self.beta
return self.out
def parameters(self):
return [self.gamma, self.beta]
# Instantiate the module
torch.manual_seed(1337)
module = LayerNorm1d(100)
# Generate a batch of size 32 with 100-dimensional vectors
x = torch.randn(32, 100)
# Apply the layer normalization to the batch
x = module(x)
# Print the shape of the output tensor
print(f"Output shape: {x.shape}")
```

Output shape: torch.Size([32, 100])

In [ ]:

```
# Compute the mean and standard deviation of the first feature across all inputs in the batch
batch_feature_mean, batch_feature_std = x[:, 0].mean(), x[:, 0].std()
print(f"Batch feature mean: {batch_feature_mean}, Batch feature std: {batch_feature_std}")
# Compute the mean and standard deviation of all features for a single input in the batch
single_input_mean, single_input_std = x[0, :].mean(), x[0, :].std()
print(f"Single input mean: {single_input_mean}, Single input std: {single_input_std}")
```

`nanoGPT`

: https://github.com/karpathy/nanoGPT`SentencePiece`

: https://github.com/google/sentencepiece`Attention Is All You Need`

: https://arxiv.org/abs/1706.03762`Training language models to follow instructions with human feedback`

: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf`The New Version of GPT-3 Is Much, Much Better`

: https://towardsdatascience.com/the-new-version-of-gpt-3-is-much-much-better-53ac95f21cfb

This is a PyTorch implementation of a language model using a transformer architecture. Here's a brief explanation of the code:

- Sets hyperparameters for the model, such as batch size, block size, and learning rate.
- Loads input text data from a file and creates character mappings.
- Splits data into training and validation sets.
- Defines a function to load batches of data for training or validation.
- Defines a function to estimate loss during training.
- Defines the self-attention head, which is a component of the transformer architecture.
- Defines the multi-head attention module, which consists of multiple self-attention heads in parallel.
- Defines the feed-forward module, which is another component of the transformer architecture.
- Defines the transformer block, which is a combination of the self-attention head and feed-forward module.

Overall, this code defines a language model that uses a transformer architecture and is trained using stochastic gradient descent with the AdamW optimizer. The transformer architecture consists of multiple transformer blocks, which are composed of self-attention heads and feed-forward modules. The self-attention heads enable the model to consider the relationships between all tokens in a sequence, while the feed-forward modules provide additional nonlinear transformations. The model is trained to predict the next token in a sequence given a context of previous tokens. The code prints the train and validation loss at specified intervals during training, and generates new text using the trained model.

In [ ]:

```
import torch
import torch.nn as nn
from torch.nn import functional as F
# Hyperparameters
batch_size = 16
block_size = 32
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
torch.manual_seed(1337)
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
# Create a sorted list of unique characters in the text
chars = sorted(list(set(text)))
# Get the number of unique characters
vocab_size = len(chars)
# Create a dictionary mapping each character to an integer index
stoi = {ch: i for i, ch in enumerate(chars)}
# Create a dictionary mapping each integer index to a character
itos = {i: ch for i, ch in enumerate(chars)}
# Define a lambda function to encode a string as a list of integers
encode = lambda s: [stoi[c] for c in s]
# Define a lambda function to decode a list of integers as a string
decode = lambda l: ''.join([itos[i] for i in l])
# Convert the text to a tensor of integers
data = torch.tensor(encode(text), dtype=torch.long)
# Split the data into train and validation sets
n = int(0.9 * len(data)) # use first 90% for training, last 10% for validation
train_data = data[:n]
val_data = data[n:]
# Define a function to generate a small batch of input-target pairs
def get_batch(split):
# Select either the training or validation set
data = train_data if split == 'train' else val_data
# Generate random indices to start each block of input
ix = torch.randint(len(data) - block_size, (batch_size,))
# Select block_size characters starting at each index for input
x = torch.stack([data[i:i + block_size] for i in ix])
# Select block_size characters starting at each index + 1 for target
y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
# Send tensors to GPU if available
x, y = x.to(device), y.to(device)
return x, y
# Define a function to estimate the model's loss on the train and validation sets
@torch.no_grad()
def estimate_loss():
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
# Linear transformations for key, query, and value
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
# Lower triangular matrix to mask future values
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
# Dropout layer
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Get batch size, sequence length, and number of features
B, T, C = x.shape
# Linear transformations of key and query
k = self.key(x) # (B,T,C)
q = self.query(x) # (B,T,C)
# Compute attention scores ("affinities")
wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
# Mask future values
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
# Apply softmax to get attention weights
wei = F.softmax(wei, dim=-1) # (B, T, T)
# Apply dropout to attention weights
wei = self.dropout(wei)
# Linear transformation of value
v = self.value(x) # (B,T,C)
# Weighted sum of values using attention weights
out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
return out
class MultiHeadAttention(nn.Module):
""" multiple heads of self-attention in parallel """
def __init__(self, num_heads, head_size):
super().__init__() # Initialize the superclass (nn.Module)
# Instantiate a list of head modules, and assign it to self.heads
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
# A linear transformation to project the concatenated attention heads back to the input dimension
self.proj = nn.Linear(n_embd, n_embd)
# Dropout layer to avoid overfitting
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Apply the self-attention mechanism by passing the input through each attention head and concatenate the results along the feature dimension
out = torch.cat([h(x) for h in self.heads], dim=-1)
# Apply the projection layer and the dropout layer to the concatenated output
out = self.dropout(self.proj(out))
return out
class FeedFoward(nn.Module):
""" a simple linear layer followed by a non-linearity """
def __init__(self, n_embd):
super().__init__()
# Define a sequential neural network composed of two linear layers and a ReLU activation function
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
# Forward pass through the neural network
return self.net(x)
class Block(nn.Module):
""" Transformer block: communication followed by computation """
def __init__(self, n_embd, n_head):
# n_embd: embedding dimension, n_head: the number of heads we'd like
super().__init__()
head_size = n_embd // n_head
# Multi-head self-attention layer
self.sa = MultiHeadAttention(n_head, head_size)
# Feedforward layer
self.ffwd = FeedFoward(n_embd)
# Layer normalization layer 1
self.ln1 = nn.LayerNorm(n_embd)
# Layer normalization layer 2
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
# Communication followed by computation
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
# super simple bigram model
class BigramLanguageModel(nn.Module):
def __init__(self):
super().__init__()
# Define the model architecture using embedding, multi-head attention, and linear layers
self.token_embedding_table = nn.Embedding(vocab_size, n_embd) # Lookup table for token embedding
self.position_embedding_table = nn.Embedding(block_size, n_embd) # Lookup table for position embedding
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) # A sequence of transformer blocks
self.ln_f = nn.LayerNorm(n_embd) # Layer normalization
self.lm_head = nn.Linear(n_embd, vocab_size) # Linear layer to get logits
def forward(self, idx, targets=None):
B, T = idx.shape
# Embed tokens and positions
tok_emb = self.token_embedding_table(idx) # (B,T,C)
pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
x = tok_emb + pos_emb # (B,T,C)
# Pass through the transformer blocks
x = self.blocks(x) # (B,T,C)
# Apply layer normalization and linear layer
x = self.ln_f(x) # (B,T,C)
logits = self.lm_head(x) # (B,T,vocab_size)
# Compute the cross-entropy loss if targets are provided
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
# Generate new text by sampling from the learned distribution
for _ in range(max_new_tokens):
# Crop idx to the last block_size tokens
idx_cond = idx[:, -block_size:]
# Get the predictions
logits, loss = self(idx_cond)
# Focus only on the last time step
logits = logits[:, -1, :] # becomes (B, C)
# Apply softmax to get probabilities
probs = F.softmax(logits, dim=-1) # (B, C)
# Sample from the distribution
idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
# Append the sampled index to the running sequence
idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
return idx
# Instantiate the bigram language model
model = BigramLanguageModel()
# Move the model to the specified device (CPU or GPU)
m = model.to(device)
# Print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')
# Create a PyTorch optimizer with the AdamW algorithm
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
# Loop over training iterations
for iter in range(max_iters):
# Every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# Sample a batch of data (inputs and targets)
xb, yb = get_batch('train')
# Evaluate the loss and update the model parameters
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
# Generate text from the model
# Initialize the context with a zero tensor
context = torch.zeros((1, 1), dtype=torch.long, device=device)
# Generate a sequence of tokens using the model
generated_sequence = m.generate(context, max_new_tokens=2000)
# Decode the generated sequence into a string
decoded_sequence = decode(generated_sequence[0].tolist())
# Print the generated text
print(decoded_sequence)
```