Notebook

GPT-like large language model development using tiny Shakespear dataset¶

Step 1.¶

The following code uses the requests library, which is a more flexible and user-friendly way to handle HTTP requests in Python. It downloads the dataset from the URL and saves it as a file called input.txt. The status message at the end of the code lets you know when the dataset has been saved.

In [ ]:

import requests

# Download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# Read the file
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

--2023-02-18 07:16:23--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.2’

input.txt.2         100%[===================>]   1.06M  --.-KB/s    in 0.05s   

2023-02-18 07:16:23 (22.5 MB/s) - ‘input.txt.2’ saved [1115394/1115394]

Step 2.¶

This code prints the length of the text string, which is the contents of the input.txt file. The length of the string is computed using the built-in len function and is expressed in characters. The resulting value is then printed to the console using the print function.

In [ ]:

print("Length of the dataset in chracters: ", len(text))

Length of the dataset in chracters:  1115394

Step 3.¶

Let's check out by printing the first 500 characters of the text variable, which was previously loaded from a file named input.txt using the with open statement. The [:500] syntax is used to slice the first 500 characters of the text string.

In [ ]:

print(text[:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor

Step 4.¶

After that, we are generating a set of unique characters in the text, then sorts it. The sorted set of unique characters are then joined together and printed out as a string. The number of unique characters is also calculated and printed out with a formatted message "Unique characters:".

The set comprehension {char for char in text} is used to extract unique characters in the text variable, which was read from the file. The set comprehension will only include one occurrence of each character in the text, hence getting the unique characters in the text.

After getting the unique characters, the len function is used to get the length of the set of unique characters which gives us the count of unique characters. This value is stored in the vocab_size variable and is printed out with a formatted message.

In [ ]:

# Check out the unique characters that occur in this text dataset
chars = sorted({char for char in text})
vocab_size = len(chars)
print(''.join(chars))
print(f"Unique characters: {vocab_size}")

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Unique characters: 65

Step 5.¶

As a next step, we will create two dictionaries, char_to_int and int_to_char, that map characters in the text dataset to unique integers, and vice versa, respectively. Then, two functions encode and decode are defined to convert the text dataset to a list of integers, and vice versa. The code tests the two functions by encoding the string "hello there" and decoding the result to make sure the process works as expected. The output should be the encoded list of integers for "hello there" and the decoded string "hello there".

In [ ]:

# Create mappings from characters to integers and vice versa
char_to_int = {char: index for index, char in enumerate(chars)}
int_to_char = {index: char for index, char in enumerate(chars)}

# Define encoding and decoding functions
def encode(text):
  return [char_to_int[char] for char in text]

def decode(encoded):
  return ''.join([int_to_char[index] for index in encoded])

# Test the encoding and decoding functions
encoded = encode("hello there")
print(encoded)
print(decode(encoded))

[46, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43]
hello there

Step 6.¶

As we checked encoding and decoding works well, we will import the PyTorch library and create a 1-dimensional tensor (i.e. a torch.LongTensor) called data that holds the encoding of the entire text dataset.

The encoding of the text is done using the encode function that was defined earlier and takes a string as input and returns a list of integers. The resulting list is then passed to the torch.tensor function which creates a tensor from the input data. The dtype argument is set to torch.long which specifies the data type of the tensor as long integers.

Finally, the shape and data type of the tensor are printed, as well as the first 500 characters in their encoded form. The output of the print statements will give us some basic information about the tensor.

In [ ]:

# Encode the entire text dataset and save it into a torch.Tensor

import torch # PyTorch: https://pytorch.org

data = torch.tensor(encode(text), dtype=torch.long)
print(f"Shape of data tensor: {data.shape}")
print(f"Data type of data tensor: {data.dtype}")
print(f"First 500 characters in encoded form: {data[:500]}")

Shape of data tensor: torch.Size([1115394])
Data type of data tensor: torch.int64
First 500 characters in encoded form: tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57,  1, 47, 57,  1, 41, 46, 47, 43, 44,  1, 43, 52, 43, 51, 63,
         1, 58, 53,  1, 58, 46, 43,  1, 54, 43, 53, 54, 50, 43,  8,  0,  0, 13,
        50, 50, 10,  0, 35, 43,  1, 49, 52, 53, 61,  5, 58,  6,  1, 61, 43,  1,
        49, 52, 53, 61,  5, 58,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47, 58,
        47, 64, 43, 52, 10,  0, 24, 43, 58,  1, 59, 57,  1, 49, 47, 50, 50,  1,
        46, 47, 51,  6,  1, 39, 52, 42,  1, 61, 43,  5, 50, 50,  1, 46, 39, 60,
        43,  1, 41, 53, 56, 52,  1, 39, 58,  1, 53, 59, 56,  1, 53, 61, 52,  1,
        54, 56, 47, 41, 43,  8,  0, 21, 57,  5, 58,  1, 39,  1, 60, 43, 56, 42,
        47, 41, 58, 12,  0,  0, 13, 50, 50, 10,  0, 26, 53,  1, 51, 53, 56, 43,
         1, 58, 39, 50, 49, 47, 52, 45,  1, 53, 52,  5, 58, 11,  1, 50, 43, 58,
         1, 47, 58,  1, 40, 43,  1, 42, 53, 52, 43, 10,  1, 39, 61, 39, 63,  6,
         1, 39, 61, 39, 63,  2,  0,  0, 31, 43, 41, 53, 52, 42,  1, 15, 47, 58,
        47, 64, 43, 52, 10,  0, 27, 52, 43,  1, 61, 53, 56, 42,  6,  1, 45, 53,
        53, 42,  1, 41, 47, 58, 47, 64, 43, 52, 57,  8,  0,  0, 18, 47, 56, 57,
        58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 35, 43,  1, 39, 56, 43,  1,
        39, 41, 41, 53, 59, 52, 58, 43, 42,  1, 54, 53, 53, 56])

Step 7.¶

We will split the data tensor into training and validation set using a specified train ratio (0.8 in this case).

The length of the training data is calculated as the product of the length of the data tensor and the train ratio (0.8). The first part of the data tensor with length equal to the calculated training data length becomes the training data. The rest of the data tensor becomes the validation data.

The lengths of the training data and validation data are then printed out.

In [ ]:

# Split the data into training and validation sets

train_ratio = 0.8
train_data_length = int(len(data) * train_ratio)
train_data = data[:train_data_length]
val_data = data[train_data_length:]

print(f"Length of training data: {len(train_data)}")
print(f"Length of validation data: {len(val_data)}")

Length of training data: 892315
Length of validation data: 223079

Step 8.¶

Let's define a block size of 8 characters, and print the first 8 + 1 = 9 characters in the training data.

It first converts the training data tensor slice to a list using tolist() method, then passes this list to the decode function to get the characters represented by the integers. The decode function takes a list of integers, which represent characters as indices in the int_to_char mapping, and converts the indices back to characters using the int_to_char mapping.

The print statement outputs the decoded characters, allowing us to see a portion of the original text.

In [ ]:

block_size = 8
print("First", block_size + 1, "characters in training data:")
print(decode(train_data[:block_size+1].tolist()))

First 9 characters in training data:
First Cit

Step 9.¶

Also, let's split the training data into two parts x and y with block_size characters each. x contains the first block_size characters of the training data and y contains the next block_size characters.

Then, the code loops through the range t from 0 to block_size and uses t as the index to extract the context and target. The context is x[:t+1], which is a slice of the x array that contains all elements up to and including the t-th element. The target is y[t], which is the t-th element of the y array.

Finally, the code uses the decode function to convert the context and target from encoded integers back to characters. The decode function takes a list of integers as input and returns the string that corresponds to the concatenation of the characters corresponding to the integers.

The code prints the context and target for each value of t, with a message that describes what each one represents.

In [ ]:

x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58

Step 10.¶

The code generates a random batch of inputs (x) and targets (y) for either the training data or validation data. The data is split based on the input argument passed to the function get_batch(). The block_size specifies the maximum context length for predictions. The batch_size specifies the number of independent sequences that will be processed in parallel.

The torch.randint function is used to generate a tensor of shape (batch_size, ) containing random integers between 0 and len(data) - block_size. These integers represent the starting indices of each sequence in the batch. The torch.stack function is used to create tensors x and y from the corresponding slices of the data tensor. x contains the blocks of size block_size from data, and y contains the blocks of size block_size shifted by one from data.

The generated inputs and targets are printed out, along with their shapes. The code then loops over the generated batch and for each sequence in the batch, it prints out the context and the corresponding target. The context is taken as the slice of the input tensor xb from the beginning of the sequence to the current time step, and the target is taken as the corresponding element in the target tensor yb. The tolist method is used to convert the tensor to a Python list. The target is then printed out.

In [ ]:

torch.manual_seed(1337)
batch_size = 4 # The number of independent sequences that will be processed in parallel
block_size = 8 # Max context length for predictions


def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(0, len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('Inputs:')
print(xb.shape)
print(xb)
print('Targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"When input is {context.tolist()}, the target is: {target}")

Inputs:
torch.Size([4, 8])
tensor([[58, 63,  8,  0,  0, 19, 24, 27],
        [39, 59, 45, 46, 58,  1, 46, 43],
        [49, 43, 57,  1, 53, 50, 42,  1],
        [52, 41, 47, 43, 52, 58,  1, 56]])
Targets:
torch.Size([4, 8])
tensor([[63,  8,  0,  0, 19, 24, 27, 33],
        [59, 45, 46, 58,  1, 46, 43,  1],
        [43, 57,  1, 53, 50, 42,  1, 46],
        [41, 47, 43, 52, 58,  1, 56, 47]])
----
When input is [58], the target is: 63
When input is [58, 63], the target is: 8
When input is [58, 63, 8], the target is: 0
When input is [58, 63, 8, 0], the target is: 0
When input is [58, 63, 8, 0, 0], the target is: 19
When input is [58, 63, 8, 0, 0, 19], the target is: 24
When input is [58, 63, 8, 0, 0, 19, 24], the target is: 27
When input is [58, 63, 8, 0, 0, 19, 24, 27], the target is: 33
When input is [39], the target is: 59
When input is [39, 59], the target is: 45
When input is [39, 59, 45], the target is: 46
When input is [39, 59, 45, 46], the target is: 58
When input is [39, 59, 45, 46, 58], the target is: 1
When input is [39, 59, 45, 46, 58, 1], the target is: 46
When input is [39, 59, 45, 46, 58, 1, 46], the target is: 43
When input is [39, 59, 45, 46, 58, 1, 46, 43], the target is: 1
When input is [49], the target is: 43
When input is [49, 43], the target is: 57
When input is [49, 43, 57], the target is: 1
When input is [49, 43, 57, 1], the target is: 53
When input is [49, 43, 57, 1, 53], the target is: 50
When input is [49, 43, 57, 1, 53, 50], the target is: 42
When input is [49, 43, 57, 1, 53, 50, 42], the target is: 1
When input is [49, 43, 57, 1, 53, 50, 42, 1], the target is: 46
When input is [52], the target is: 41
When input is [52, 41], the target is: 47
When input is [52, 41, 47], the target is: 43
When input is [52, 41, 47, 43], the target is: 52
When input is [52, 41, 47, 43, 52], the target is: 58
When input is [52, 41, 47, 43, 52, 58], the target is: 1
When input is [52, 41, 47, 43, 52, 58, 1], the target is: 56
When input is [52, 41, 47, 43, 52, 58, 1, 56], the target is: 47

In [ ]:

print(xb) # My input to the transformer

tensor([[58, 63,  8,  0,  0, 19, 24, 27],
        [39, 59, 45, 46, 58,  1, 46, 43],
        [49, 43, 57,  1, 53, 50, 42,  1],
        [52, 41, 47, 43, 52, 58,  1, 56]])

Step 11.¶

Now, we will introduce a Bigram Language Model using PyTorch. The Bigram Language Model is a type of language model that predicts the next word in a sentence based on the current word.

The model is implemented as a custom PyTorch nn.Module named BigramLanguageModel.

The model has two main components:

The token embedding table, implemented as an instance of nn.Embedding with vocab_size output features, which is used to embed each token (word) in the input sequence into a dense representation.
The forward method, which takes an input sequence idx of token indices (integers) and optional targets (also an integer sequence), and computes the logits for the next token predictions and the cross-entropy loss if targets are provided. The logits are computed by passing the input sequence through the token embedding table.

The model also has a generate method, which generates a sequence of new tokens based on an initial context idx and a specified number of tokens max_new_tokens to generate. The method uses the forward method to obtain the logits for each new token, applies softmax to obtain probabilities, and samples from the distribution to obtain the next token index. The generated token indices are concatenated to the input sequence to obtain the new context for the next prediction.

In the code, an instance of the BigramLanguageModel is created with a specified vocab_size and is used to compute the logits and loss for an input sequence xb and target sequence yb, and to generate a new sequence of tokens with an initial context of a single token with an index of 0 and 100 new tokens to generate.

In [ ]:

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.embedding(idx)
        B, T, C = logits.shape
        logits = logits.reshape(B * T, C)
        if targets is None:
            return logits, None
        targets = targets.reshape(-1)
        loss = F.cross_entropy(logits, targets)
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        for i in range(max_new_tokens):
            logits = self.embedding(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_token), dim=1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(5.0493, grad_fn=<NllLossBackward0>)

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ

Step 12.¶

Here, we create an instance of the AdamW optimizer in PyTorch. The optimizer will adjust the parameters of the model m to minimize the loss during training.

The AdamW optimizer is a variant of the popular Adam optimizer that incorporates weight decay, which helps to regularize the model to prevent overfitting. The optimizer takes as input the parameters of the model m and sets their learning rate to 0.001.

In [ ]:

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=0.001)

Step 13.¶

After then, we will train a PyTorch model using early stopping and TensorBoard logging.

A PyTorch AdamW optimizer is created with a learning rate of 0.001 for the model parameters.
A TensorBoard writer is created to log the training and validation losses.
The code trains the model in a loop of 100 steps (max_steps) and evaluates the model on the validation set at each step.
The best validation loss is recorded and compared with the current validation loss.
If the current validation loss is worse than the best validation loss for a number of early_stop_steps, the training stops and the loop exits.
The training loss is printed and logged in TensorBoard every 10 steps.
After the loop, the TensorBoard writer is closed. The final train loss is printed.

In [ ]:

from torch.utils.tensorboard import SummaryWriter

batch_size = 32
max_steps = 100 # Increase number of steps for better results... 
early_stop_steps = 10

# Create a TensorBoard writer
writer = SummaryWriter()

# Keep track of the best validation loss
best_val_loss = float("inf")

# Early stopping counter
early_stop_counter = 0

for steps in range(max_steps):
    # Obtain a batch of data as sample
    xb, yb = get_batch('train')

    # Loss evaluation
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if steps % 10 == 0:
        print("Step {}: Train Loss: {}".format(steps, loss.item()))
        writer.add_scalar('train_loss', loss.item(), steps)
    
    # Evaluate the model on the validation set
    with torch.no_grad():
        xb, yb = get_batch('val')
        logits, val_loss = m(xb, yb)
        writer.add_scalar('val_loss', val_loss.item(), steps)
        
        if val_loss.item() < best_val_loss:
            best_val_loss = val_loss.item()
            early_stop_counter = 0
        else:
            early_stop_counter += 1
        
        if early_stop_counter >= early_stop_steps:
            print("Early stopping at step {}".format(steps))
            break

writer.close()
print("Final Train Loss: {}".format(loss.item()))

Step 0: Train Loss: 4.679175853729248
Step 10: Train Loss: 4.761965751647949
Early stopping at step 10
Final Train Loss: 4.761965751647949

Step 14.¶

Finally, we will generate text using a language model m, and prints the decoded result. The function m.generate generates text with the given input idx, which is an array of shape (1, 1) containing the starting index (e.g., the first word of the text), encoded as an integer. The argument max_new_tokens=500 specifies the maximum number of tokens to generate in the output text. The generated text is then converted to a list of integers, [0].tolist(). Finally, the function decode is applied to the list of integers to convert the encoded text back to a string of human-readable words.

In [ ]:

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

dzX:COtRIDiOskzytNCxrfSjum;auw$CA'Oc!PO;DT:CSGKzzYC33,i!!:'ruHAUQJ;ZJi?NpHFP,h?jBjagJ.xYDHM,3-gnQQbjmJGGJEHxr'BirVMGXrvU,ZL$HA-!uD.vcWRlgH-s&LC,e?OJMo?yLTQb?qx;xzW f$-FZv$;.igBjU'AXgF-bGN&ZmZb&yFCaPZSJA'rA'KHx?w$YHAGjHRURSPwHo-W:MlapJ.
jxLvUAQmZBL&$zdbA!BCPjxTfiKmJMQTFafjxI!udZV,SPGGSPlyYWNT;a;Q-BGrIu$Ca'PTR C&,SywwcPyFWgC3ryxfNd?EX&jF.WCq;3fq-ofcla!--UG&SBoiw'rt,rcIcmYcLC?OLfpOpX-ZK;vm,lDW?nZTbmJJrYdYZTH!abIJ&sXcoUEXrUZVm;K:vi-vTJaMPiH-UnZ??yFk$cOKBjThuq.ywEb$zLTQgUZayZ!pzd,RL&evVjZUAElx;pgOYPh

Some short examples¶

(1) Example 1: Brief illustration of how matrix multiplication can perform a weighted combination¶

The following creates a lower triangular matrix of ones (a) and normalizes the matrix along the rows. Then, it generates random weights from a normal distribution (b) and computes the weighted aggregation of a and b using matrix multiplication. The results are then printed.

In [ ]:

# Set random seed for reproducibility
torch.manual_seed(42)

# Create a lower triangular matrix of ones
a = torch.tril(torch.ones(3, 3))

# Normalize the lower triangular matrix along the rows
a = a / torch.sum(a, 1, keepdim=True)

# Generate random weights from a normal distribution
b = torch.randn(3, 2).float()

# Compute the weighted aggregation using matrix multiplication
c = a @ b

# Print results
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[ 0.3367,  0.1288],
        [ 0.2345,  0.2303],
        [-1.1229, -0.1863]])
--
c=
tensor([[ 0.3367,  0.1288],
        [ 0.2856,  0.1796],
        [-0.1839,  0.0576]])

Here, we set the random seed for reproducibility, generate a tensor with random values (x) of shape (B, T, C), and prints the shape of the tensor. The code then attempts to compute the cumulative sum of x along the second dimension (time) and divide it by the range of the time steps.

In [ ]:

# Set random seed for reproducibility
torch.manual_seed(1337)

# Define the batch size (B), time steps (T), and number of channels (C)
B, T, C = 4, 8, 2

# Generate a tensor with random values, with shape (B, T, C)
x = torch.randn(B, T, C)

# Print the shape of the tensor
print("Shape of the tensor:", x.shape)

Shape of the tensor: torch.Size([4, 8, 2])

The code calculates a moving average of a 3D tensor x with dimensions [B, T, C] along the time axis (T).

xbow is initialized as a 3D tensor of zeros with the same dimensions as x.

The calculation is performed by dividing the cumulative sum of the elements of x along the T axis by a sequence of integers from 1 to T. The cumsum() method is used to compute the cumulative sum along the T axis of x. The method returns a tensor with the same dimensions as x. The sequence of integers is created using torch.arange(), which returns a 1D tensor of consecutive integers.

The resulting tensor is a 3D tensor of the same shape as x, where xbow[b,t] contains the average of x[b,:t+1].

In [ ]:

# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))

# for b in range(B):
#    for t in range(T):
#        xprev = x[b,:t+1] # (t,C)
#        xbow[b,t] = torch.mean(xprev, 0)

xbow = x.cumsum(1) / torch.arange(1, T+1, dtype=torch.float32).unsqueeze(0).unsqueeze(2)

(2) Example 2: Using matrix multiply for a weighted aggregation¶

This code generates a tensor wei that is a lower triangular matrix with ones and then normalizes it along the rows. It then generates a tensor x with shape (B, T, C) and performs a matrix multiplication between wei and x using @. The resulting tensor xbow2 has the same shape as x and each element xbow2[b, t] is the weighted average of all the previous elements in x[b] up to and including x[b, t], where the weights are given by the corresponding element in the t-th row of wei. Finally, it checks if xbow and xbow2 are element-wise close using the torch.allclose() function.

In [ ]:

wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

Out[ ]:

True

(3) Example 3: Using Softmax¶

The provided code appears to be calculating the weights for each time step in the tensor x using a mask of a lower triangular matrix and a subsequent softmax operation to obtain weights that sum to 1 across each row. These weights are then used to obtain a weighted average of the values in x for each time step.

In [ ]:

# Create a mask for the lower triangular matrix of shape (T, T)
mask = torch.tril(torch.ones(T, T)).bool()

# Set the values outside the lower triangle to -inf to ensure zero weights
mask = mask.float().masked_fill(~mask, float('-inf'))

# Apply softmax to the masked values along the last dimension to obtain the weights
weights = F.softmax(mask, dim=-1)

# Obtain the weighted average of x using the computed weights
xbow3 = torch.matmul(weights, x)

# Check if xbow and xbow3 are equal
torch.allclose(xbow, xbow3)

Out[ ]:

True

(4) Example 4: Self-attention¶

The code defines a single head of self-attention using linear transformations for the key, query, and value inputs. The input tensor x has shape (B, T, C) representing a batch of sequences with T timesteps and C features. The key and query are transformed using a linear layer with C input channels and head_size output channels, while the value is transformed using a linear layer with C input channels and head_size output channels.

The dot product of the transformed query and key, divided by the square root of head_size, gives an attention weight matrix of shape (B, T, T). A lower triangular mask is applied to the attention weight matrix to ensure that information flows only from the past to the present. The softmax operation is then applied along the last dimension of the resulting attention weight matrix to obtain a probability distribution. Finally, the output is obtained by multiplying the probability distribution with the value tensor.

The output has shape (B, T, head_size).

In [ ]:

# Set the random seed to a fixed value for reproducibility
torch.manual_seed(1337)

# Define the batch size (B), number of time steps (T), and number of channels (C)
B,T,C = 4,8,32

# Generate a tensor with random values of shape (B, T, C)
x = torch.randn(B,T,C)

# Define the size of each head in the attention mechanism
head_size = 16

# Define the linear layers for the keys, queries, and values
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# Apply the linear layer for the keys to the input tensor to get k of shape (B, T, head_size)
k = key(x)

# Apply the linear layer for the queries to the input tensor to get q of shape (B, T, head_size)
q = query(x)

# Calculate the attention weights by multiplying q and k transposed together. This produces a tensor wei of shape (B, T, T).
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

# Create a lower triangular matrix with ones and zero out the upper triangular part
tril = torch.tril(torch.ones(T, T))

# Replace the 0 elements in the triangular part of the attention weights with negative infinity
wei = wei.masked_fill(tril == 0, float('-inf'))

# Apply the softmax function along the last dimension of wei to get the final attention weights
wei = F.softmax(wei, dim=-1)

# Apply the linear layer for the values to the input tensor to get v of shape (B, T, head_size)
v = value(x)

# Compute the weighted sum of the values with the attention weights to get the output tensor of shape (B, T, head_size)
out = wei @ v

# Print the shape of the output tensor
print(out.shape)

torch.Size([4, 8, 16])

In [ ]:

print(wei[0])

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:

Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
Each example across batch dimension is of course processed completely independently and never "talk" to each other
In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
"self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
"Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

The provided code computes the scaled dot-product attention between two tensors k and q, and applies the softmax function on the output to get the attention weights.

We calculate the scale factor outside the product of q and k and then multiply the product by the scale factor.
We explicitly cast the integer head_size to float32 using torch.float32.

We print the variance of k, q, and wei to ensure they have the expected values. We use the softmax() function from PyTorch instead of performing the softmax operation manually. We also use the softmax() function with a larger scaling factor (8 in this case), which results in a smoother attention distribution.

In [ ]:

k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
scale = torch.rsqrt(torch.tensor(head_size, dtype=torch.float32))
wei = q @ k.transpose(-2, -1) * scale

print("variance of k:", k.var())
print("variance of q:", q.var())
print("variance of wei:", wei.var())

# softmax over last dimension
softmaxed_wei = torch.softmax(wei, dim=-1)

# a different softmax that reduces the "peakiness" of the attention distribution
sm = torch.nn.Softmax(dim=-1)
scaled_sm_wei = sm(wei * 8)

variance of k: tensor(0.9006)
variance of q: tensor(1.0037)
variance of wei: tensor(0.9957)

In [ ]:

torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

Out[ ]:

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [ ]:

torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)

Out[ ]:

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

The code defines a LayerNorm1d class, which is used to implement Layer Normalization for 1D input tensors. The __init__ method sets the values of the eps parameter and the learnable parameters gamma and beta. The __call__ method takes an input tensor x, computes the layer normalization operation, and returns the normalized output. The parameters method returns the learnable parameters of the layer.

The code initializes an instance of the LayerNorm1d class with 100 dimensions and applies it to a 32 x 100 input tensor. The output tensor has the same shape as the input tensor.

In [ ]:

class LayerNorm1d:
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)
        
    def __call__(self, x):
        # Calculate the forward pass
        xmean = x.mean(1, keepdim=True)  # Batch mean
        xvar = x.var(1, keepdim=True)  # Batch variance
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)  # Normalize to unit variance
        self.out = self.gamma * xhat + self.beta
        return self.out
    
    def parameters(self):
        return [self.gamma, self.beta]

# Instantiate the module
torch.manual_seed(1337)
module = LayerNorm1d(100)

# Generate a batch of size 32 with 100-dimensional vectors
x = torch.randn(32, 100)

# Apply the layer normalization to the batch
x = module(x)

# Print the shape of the output tensor
print(f"Output shape: {x.shape}")

Output shape: torch.Size([32, 100])

In [ ]:

# Compute the mean and standard deviation of the first feature across all inputs in the batch
batch_feature_mean, batch_feature_std = x[:, 0].mean(), x[:, 0].std()
print(f"Batch feature mean: {batch_feature_mean}, Batch feature std: {batch_feature_std}")

# Compute the mean and standard deviation of all features for a single input in the batch
single_input_mean, single_input_std = x[0, :].mean(), x[0, :].std()
print(f"Single input mean: {single_input_mean}, Single input std: {single_input_std}")

Batch feature mean: 0.14685693383216858, Batch feature std: 0.8803138732910156
Single input mean: -9.53674295089968e-09, Single input std: 0.9999954700469971

References¶

nanoGPT: https://github.com/karpathy/nanoGPT
SentencePiece: https://github.com/google/sentencepiece
Attention Is All You Need: https://arxiv.org/abs/1706.03762
Training language models to follow instructions with human feedback: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf
The New Version of GPT-3 Is Much, Much Better: https://towardsdatascience.com/the-new-version-of-gpt-3-is-much-much-better-53ac95f21cfb

As a Whole¶

This is a PyTorch implementation of a language model using a transformer architecture. Here's a brief explanation of the code:

Sets hyperparameters for the model, such as batch size, block size, and learning rate.
Loads input text data from a file and creates character mappings.
Splits data into training and validation sets.
Defines a function to load batches of data for training or validation.
Defines a function to estimate loss during training.
Defines the self-attention head, which is a component of the transformer architecture.
Defines the multi-head attention module, which consists of multiple self-attention heads in parallel.
Defines the feed-forward module, which is another component of the transformer architecture.
Defines the transformer block, which is a combination of the self-attention head and feed-forward module.

Overall, this code defines a language model that uses a transformer architecture and is trained using stochastic gradient descent with the AdamW optimizer. The transformer architecture consists of multiple transformer blocks, which are composed of self-attention heads and feed-forward modules. The self-attention heads enable the model to consider the relationships between all tokens in a sequence, while the feed-forward modules provide additional nonlinear transformations. The model is trained to predict the next token in a sequence given a context of previous tokens. The code prints the train and validation loss at specified intervals during training, and generates new text using the trained model.

In [ ]:

import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
batch_size = 16 
block_size = 32 
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Create a sorted list of unique characters in the text
chars = sorted(list(set(text)))

# Get the number of unique characters
vocab_size = len(chars)

# Create a dictionary mapping each character to an integer index
stoi = {ch: i for i, ch in enumerate(chars)}

# Create a dictionary mapping each integer index to a character
itos = {i: ch for i, ch in enumerate(chars)}

# Define a lambda function to encode a string as a list of integers
encode = lambda s: [stoi[c] for c in s]

# Define a lambda function to decode a list of integers as a string
decode = lambda l: ''.join([itos[i] for i in l])

# Convert the text to a tensor of integers
data = torch.tensor(encode(text), dtype=torch.long)

# Split the data into train and validation sets
n = int(0.9 * len(data))  # use first 90% for training, last 10% for validation
train_data = data[:n]
val_data = data[n:]

# Define a function to generate a small batch of input-target pairs
def get_batch(split):
    # Select either the training or validation set
    data = train_data if split == 'train' else val_data
    
    # Generate random indices to start each block of input
    ix = torch.randint(len(data) - block_size, (batch_size,))
    
    # Select block_size characters starting at each index for input
    x = torch.stack([data[i:i + block_size] for i in ix])
    
    # Select block_size characters starting at each index + 1 for target
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    
    # Send tensors to GPU if available
    x, y = x.to(device), y.to(device)
    return x, y

# Define a function to estimate the model's loss on the train and validation sets
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()

        # Linear transformations for key, query, and value
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)

        # Lower triangular matrix to mask future values
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Get batch size, sequence length, and number of features
        B, T, C = x.shape

        # Linear transformations of key and query
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)

        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)

        # Mask future values
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)

        # Apply softmax to get attention weights
        wei = F.softmax(wei, dim=-1) # (B, T, T)

        # Apply dropout to attention weights
        wei = self.dropout(wei)

        # Linear transformation of value
        v = self.value(x) # (B,T,C)

        # Weighted sum of values using attention weights
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)

        return out


class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()  # Initialize the superclass (nn.Module)
        # Instantiate a list of head modules, and assign it to self.heads
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        # A linear transformation to project the concatenated attention heads back to the input dimension
        self.proj = nn.Linear(n_embd, n_embd)
        # Dropout layer to avoid overfitting
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Apply the self-attention mechanism by passing the input through each attention head and concatenate the results along the feature dimension
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        # Apply the projection layer and the dropout layer to the concatenated output
        out = self.dropout(self.proj(out))
        return out


class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()

        # Define a sequential neural network composed of two linear layers and a ReLU activation function
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        # Forward pass through the neural network
        return self.net(x)


class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        
        # Multi-head self-attention layer
        self.sa = MultiHeadAttention(n_head, head_size)
        
        # Feedforward layer
        self.ffwd = FeedFoward(n_embd)
        
        # Layer normalization layer 1
        self.ln1 = nn.LayerNorm(n_embd)
        
        # Layer normalization layer 2
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # Communication followed by computation
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # Define the model architecture using embedding, multi-head attention, and linear layers
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)  # Lookup table for token embedding
        self.position_embedding_table = nn.Embedding(block_size, n_embd)  # Lookup table for position embedding
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])  # A sequence of transformer blocks
        self.ln_f = nn.LayerNorm(n_embd)  # Layer normalization
        self.lm_head = nn.Linear(n_embd, vocab_size)  # Linear layer to get logits

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # Embed tokens and positions
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T,C)
        x = tok_emb + pos_emb  # (B,T,C)

        # Pass through the transformer blocks
        x = self.blocks(x)  # (B,T,C)

        # Apply layer normalization and linear layer
        x = self.ln_f(x)  # (B,T,C)
        logits = self.lm_head(x)  # (B,T,vocab_size)

        # Compute the cross-entropy loss if targets are provided
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # Generate new text by sampling from the learned distribution
        for _ in range(max_new_tokens):
            # Crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # Get the predictions
            logits, loss = self(idx_cond)
            # Focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # Append the sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

# Instantiate the bigram language model
model = BigramLanguageModel()  

# Move the model to the specified device (CPU or GPU)
m = model.to(device)  

# Print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')  

# Create a PyTorch optimizer with the AdamW algorithm
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Loop over training iterations
for iter in range(max_iters):  

    # Every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data (inputs and targets)
    xb, yb = get_batch('train')

    # Evaluate the loss and update the model parameters
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# Generate text from the model

# Initialize the context with a zero tensor
context = torch.zeros((1, 1), dtype=torch.long, device=device)  

# Generate a sequence of tokens using the model
generated_sequence = m.generate(context, max_new_tokens=2000)

# Decode the generated sequence into a string
decoded_sequence = decode(generated_sequence[0].tolist())  

# Print the generated text
print(decoded_sequence)

0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5090, val loss 2.5058
step 300: train loss 2.4198, val loss 2.4340
step 400: train loss 2.3503, val loss 2.3567
step 500: train loss 2.2970, val loss 2.3136
step 600: train loss 2.2410, val loss 2.2506
step 700: train loss 2.2062, val loss 2.2198
step 800: train loss 2.1638, val loss 2.1871
step 900: train loss 2.1232, val loss 2.1494
step 1000: train loss 2.1020, val loss 2.1293
step 1100: train loss 2.0704, val loss 2.1196
step 1200: train loss 2.0382, val loss 2.0798
step 1300: train loss 2.0249, val loss 2.0640
step 1400: train loss 1.9922, val loss 2.0354
step 1500: train loss 1.9707, val loss 2.0308
step 1600: train loss 1.9614, val loss 2.0474
step 1700: train loss 1.9393, val loss 2.0130
step 1800: train loss 1.9070, val loss 1.9943
step 1900: train loss 1.9057, val loss 1.9871
step 2000: train loss 1.8834, val loss 1.9954
step 2100: train loss 1.8719, val loss 1.9758
step 2200: train loss 1.8582, val loss 1.9623
step 2300: train loss 1.8546, val loss 1.9517
step 2400: train loss 1.8410, val loss 1.9476
step 2500: train loss 1.8167, val loss 1.9455
step 2600: train loss 1.8263, val loss 1.9401
step 2700: train loss 1.8108, val loss 1.9340
step 2800: train loss 1.8040, val loss 1.9247
step 2900: train loss 1.8044, val loss 1.9304
step 3000: train loss 1.7963, val loss 1.9242
step 3100: train loss 1.7687, val loss 1.9147
step 3200: train loss 1.7547, val loss 1.9102
step 3300: train loss 1.7557, val loss 1.9037
step 3400: train loss 1.7547, val loss 1.8946
step 3500: train loss 1.7385, val loss 1.8968
step 3600: train loss 1.7260, val loss 1.8914
step 3700: train loss 1.7257, val loss 1.8808
step 3800: train loss 1.7204, val loss 1.8919
step 3900: train loss 1.7215, val loss 1.8788
step 4000: train loss 1.7146, val loss 1.8639
step 4100: train loss 1.7095, val loss 1.8724
step 4200: train loss 1.7079, val loss 1.8707
step 4300: train loss 1.7035, val loss 1.8502
step 4400: train loss 1.7043, val loss 1.8693
step 4500: train loss 1.6914, val loss 1.8522
step 4600: train loss 1.6853, val loss 1.8357
step 4700: train loss 1.6862, val loss 1.8483
step 4800: train loss 1.6671, val loss 1.8434
step 4900: train loss 1.6736, val loss 1.8415
step 4999: train loss 1.6635, val loss 1.8226

FlY BOLINGLO:
Them thrumply towiter arts the
muscue rike begatt the sea it
What satell in rowers that some than othis Marrity.

LUCENTVO:
But userman these that, where can is not diesty rege;
What and see to not. But's eyes. What?

JOHN MARGARET:
Than up I wark, what out, I ever of and love,
one these do sponce, vois I me;
But my pray sape to ries all to the not erralied in may.

BENVOLIO:
To spits as stold's bewear I would and say mesby all
on sworn make he anough
As cousins the solle, whose be my conforeful may lie them yet
nobe allimely untraled to be thre I say be,
Notham a brotes theme an make come,
And that his reach to the duke ento
the grmeants bell! and now there king-liff-or grief?

GLOUCESTER:
All the bettle dreene, for To his like thou thron!

MENENIUS:
Then, if I knom her all.
My lord, but terruly friend
Rish of the ploceiness and wilt tends sure?
Is you knows a fasir wead
That with him my spaut,
I shall not tas where's not, becomity; my coulds sting,
then the wit be dong to tyget our hereefore,
Who strop me, mend here, if agains, bitten, thy lack.
The but these it were is tus. For the her skeep the fasting. joy tweet Bumner:-
How the enclady: It you and how,
I am in him, And ladderle:
Their hand whose wife, it my hithre,
Roman and where sposs gives'd you.

TROMIOLANUS:
But livants you great, I shom mistrot come, for to she to lot
for smy to men ventry mehus. Gazise;
Full't were some the cause, and stouch set,
Or promises, which a kingsasted to your gove them; and sterrer,
And that wae love him.

BRUTUS:
You shape with these sweet.

CORTENGONO:
Lo, where 'twon elmes, 'morth young agres;
Sir, azavoust to striel accurded we missery sets crave.

ANGOLUM:
For is Henry to have gleise the dreason
That I ant shorfold wefth their servy in enscy.

ISABELLA:
O, I better you eyse such formfetrews.

BUCKINGHARENT:
Qead my lightle this righanneds flase them
Wam which an take was our some pleasurs,
Lovisoname to me, then fult me?--have it?

HENRY BOLINGBROY:
That wha