This notebook focuses on the preprocessing techniques applied to raw text before it is input into the encoder and decoder blocks of the Transformer architecture. The objective is to explore the key preprocessing steps, particularly the role of positional encodings, and to understand their impact on the overall model performance.
Upon completing this study, we will be able to:
%load_ext autoreload
%autoreload 2
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.decomposition import PCA
The positional encodings are derived using the following formulas:
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$In natural language processing (NLP), it is a standard practice to convert sentences into tokens before inputting them into a language model. Each token is then transformed into a numerical vector of fixed length, referred to as an embedding, which encapsulates the semantic meaning of the word. In the Transformer architecture, a positional encoding vector is added to the embedding to incorporate positional information, allowing the model to account for the order of words in the sequence.
While these positional encoding vectors are challenging to interpret directly through their numerical representations, visualizations can provide valuable insights into their behavior. For instance, when embeddings are reduced to two dimensions and plotted, semantically similar words are typically located closer together, while dissimilar words are plotted farther apart. Similarly, when visualizing positional encoding vectors, words that are closer together in a sentence should appear closer on a Cartesian plane, whereas words that are farther apart should be more distanced.
In this notebook, we will generate a series of visualizations of both word embeddings and positional encoding vectors. These visualizations will help us develop a deeper intuition regarding the effect of positional encodings on word embeddings and their role in maintaining sequential information throughout the Transformer model.
The positional_encoding
function, is implemented below. Building upon this, we will create visualizations to further explore the behavior of positional encodings.
def positional_encoding(positions, d):
"""
Precomputes a matrix with all the positional encodings.
Arguments:
positions (int) -- Maximum number of positions to be encoded.
d (int) -- Encoding size (dimensionality of the positional encoding).
Returns:
pos_encoding -- (1, positions, d) A matrix containing the positional encodings.
"""
# Initialize a matrix of angles
angle_rads = np.arange(positions)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d)[np.newaxis, :] // 2)) / np.float32(d))
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
pos_encoding = angle_rads[np.newaxis, ...]
return tf.cast(pos_encoding, dtype=tf.float32)
Next, we define the embedding dimension as 100, which matches the dimensionality of the word embedding. According to the "Attention is All You Need" paper, embedding sizes typically range from 100 to 1024, depending on the specific task. Additionally, the paper utilizes a maximum sequence length ranging from 40 to 512, which we will set to 100 for this exercise. The maximum number of words is set to 64.
EMBEDDING_DIM = 100
MAX_SEQUENCE_LENGTH = 100
MAX_NB_WORDS = 64
pos_encoding = positional_encoding(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)
# Visualizing the positional encoding matrix
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('d')
plt.xlim((0, EMBEDDING_DIM))
plt.ylabel('Position')
plt.colorbar()
plt.show()
2024-11-25 11:52:19.846115: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
Having implemented the visualization, we now explore some interesting properties of the positional encoding matrix:
pos
). For example, when evaluating the norm of a positional encoding at pos = 34
, we observe that the norm is always approximately 7.071068, regardless of the position.pos = 34
tf.norm(pos_encoding[0,pos,:])
<tf.Tensor: shape=(), dtype=float32, numpy=7.0710673332214355>
k
positions also remains constant. If we fix k
and change pos
, the difference between the positional encoding vectors stays nearly the same. This property underscores the importance of relative position information, as the difference between encodings is independent of the specific positions of the words but is determined by their relative separation. Being able to express positional encodings as linear functions of one another can help the model to learn by focusing on the relative positions of words.This reflection of the difference in the positions of words with vector encodings is difficult to achieve, especially given that the values of the vector encodings must remain small enough so that they do not distort the word embeddings.
pos = 70
k = 2
print(tf.norm(pos_encoding[0,pos,:] - pos_encoding[0,pos + k,:]))
tf.Tensor(3.2668781, shape=(), dtype=float32)
Through these observations, we gain a deeper understanding of the properties of positional encoding vectors and their implications for the Transformer architecture. The constancy of the vector norms and the stability of the difference norms are essential for the model to maintain consistent and meaningful representations of word order, facilitating the model's ability to learn from sequential data without introducing distortion into the word embeddings.
In the following sections, we will continue to explore how these properties affect the relationships between word embeddings and positional encodings, enhancing our intuition of how the Transformer model processes text sequences.
The positional encoding matrix allows us to visualize the unique vector representation for each position in a sequence. However, it is essential to understand how these vectors capture the relative positions of words within a sentence. To explore this, we calculate the correlation between pairs of positional encoding vectors at every position.
A well-constructed positional encoding should produce a perfectly symmetric correlation matrix where the values are maximized along the main diagonal. This indicates that vectors at similar positions in the sequence have the highest correlation. As we move away from the diagonal, the correlation values should decrease, reflecting the decreasing similarity between vectors corresponding to positions farther apart in the sentence.
The following code calculates the correlation between the positional encoding vectors at every position:
# Compute the correlation matrix for positional encodings
corr = tf.matmul(pos_encoding, pos_encoding, transpose_b=True).numpy()[0]
# Visualizing the correlation matrix
plt.pcolormesh(corr, cmap='RdBu')
plt.xlabel('Position')
plt.xlim((0, MAX_SEQUENCE_LENGTH))
plt.ylabel('Position')
plt.colorbar()
plt.show()
The resulting plot illustrates the correlation between positional encodings. The main diagonal should exhibit maximum values, indicating that vectors corresponding to positions that are closer in the sequence are highly correlated. As we move away from the diagonal, the correlation should decrease, reflecting the relative positions of the words in the sentence.
In addition to correlation, we can also compare positional encoding vectors using the Euclidean distance. Unlike the correlation matrix, the Euclidean distance matrix will have zero values along the main diagonal, and the values off the diagonal will increase as the positional encodings represent words further apart in the sentence.
The following code computes the Euclidean distance between positional encoding vectors at every position:
# Compute the Euclidean distance matrix for positional encodings
eu = np.zeros((MAX_SEQUENCE_LENGTH, MAX_SEQUENCE_LENGTH))
# Calculate distances between all pairs of positional encodings
for a in range(MAX_SEQUENCE_LENGTH):
for b in range(a + 1, MAX_SEQUENCE_LENGTH):
eu[a, b] = tf.norm(tf.math.subtract(pos_encoding[0, a], pos_encoding[0, b]))
eu[b, a] = eu[a, b]
# Visualizing the Euclidean distance matrix
plt.pcolormesh(eu, cmap='RdBu')
plt.xlabel('Position')
plt.xlim((0, MAX_SEQUENCE_LENGTH))
plt.ylabel('Position')
plt.colorbar()
plt.show()
In the resulting visualization, the diagonal values are all zero, reflecting the fact that the distance between a positional encoding and itself is zero. As the distance between positions increases, so does the Euclidean distance, with off-diagonal values growing as the separation between words in the sequence becomes larger.
These visualizations—both the correlation matrix and the Euclidean distance matrix—serve as diagnostic tools for assessing the behavior of positional encodings. By ensuring that the main diagonal exhibits the highest correlation (or the lowest distance) and that the correlation (or distance) decreases as the positions in the sequence move farther apart, we can confirm that the positional encodings are effectively capturing the relative position of words in the sequence. These properties are crucial for the Transformer model to accurately process sequential data.
Understanding the relationship between positional encodings and word embeddings is critical for analyzing how sequential information is captured in a Transformer model. In this section, we explore how positional encodings affect word embeddings by visualizing the sum of these vectors.
To combine pretrained word embeddings with the positional encodings, we utilize a 100-dimensional embedding from the GloVe project. This embedding contains representations for 400,000 words, each represented as a 100-dimensional vector.
The following code loads the pretrained GloVe embeddings:
# Load GloVe embeddings
embeddings_index = {}
GLOVE_DIR = "glove"
# mkdir glove
# cd glove
# pip install gdown
# gdown https://drive.google.com/uc?id=1RdFBU9Zvm6onI3laothPLVrEJeoRd3zU
# https://drive.google.com/file/d/1RdFBU9Zvm6onI3laothPLVrEJeoRd3zU/view?usp=sharing
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
# Display statistics
print('Found %s word vectors.' % len(embeddings_index))
print('d_model:', embeddings_index['hi'].shape)
Found 400000 word vectors. d_model: (100,)
Note: This embedding is composed of 400,000 words and each word embedding has 100 features.
We define two sentences for analysis. These sentences are designed to illustrate semantic similarities among groups of words. The first sentence contains consecutive groups of semantically similar words, while the second sentence randomizes the order of these words:
texts = ['king queen man woman dog wolf football basketball red green yellow',
'man queen yellow basketball green dog woman football king red wolf']
To prepare the data for embedding, we apply tokenization, which converts each sentence into sequences of numerical indices. These indices correspond to the positions of the words in a dictionary created from the text. The sequences are padded to a uniform length defined by MAX_SEQUENCE_LENGTH
.
A quick summary of tokenization is as follows:
MAX_SEQUENCE_LENGTH
.word_index
).MAX_SEQUENCE_LENGTH
are padded with zeros to create uniform length.# Tokenize and sequence the text
tokenizer = Tokenizer(num_words=MAX_NB_WORDS) # MAX_NB_WORDS = 64
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
# Generate the word index and padded sequences
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, padding='post', maxlen=MAX_SEQUENCE_LENGTH) # MAX_SEQUENCE_LENGTH = 100
print(data.shape)
print(data)
Found 11 unique tokens. (2, 100) [[ 1 2 3 4 5 6 7 8 9 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 3 2 11 8 10 5 4 7 1 9 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
To simplify the analysis, we focus only on the 11 unique words present in the sentences. Each word is represented by its corresponding GloVe embedding. Words not found in the GloVe index are represented by a zero vector.
# Create an embedding matrix for the words in the text
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Use the GloVe embedding if available
embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)
(12, 100)
Using the extracted embedding matrix, we construct an embedding layer initialized with the GloVe embeddings. This layer transforms the tokenized input data into word embeddings.
# Create the embedding layer
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
trainable=False)
(2, 100, 100)
# Transform tokenized data into embeddings
embedding = embedding_layer(data)
print(embedding.shape)
(2, 100, 100)
To gain insights into how embeddings capture semantic information, we will visualize word embeddings on a Cartesian plane. Using Principal Component Analysis (PCA), we reduce the 100-dimensional GloVe embeddings to two dimensions for visualization purposes.
The following function plots the embeddings of words in a given sentence. PCA is applied to reduce the high-dimensional embeddings to 2D coordinates, enabling clear visualization. Each word is annotated on the plot to reveal its position in the reduced space.
def plot_words(embedding, sequences, sentence):
"""
Visualizes word embeddings on a Cartesian plane using PCA.
Arguments:
embedding -- Tensor containing word embeddings for the input sentences.
sequences -- List of tokenized and padded sentences.
sentence -- Index of the sentence to visualize.
Returns:
A scatter plot of words in the reduced 2D space.
"""
# Apply PCA to reduce embeddings to 2 dimensions
pca = PCA(n_components=2)
X_pca_train = pca.fit_transform(embedding[sentence, 0:len(sequences[sentence]), :])
# Generate scatter plot with word annotations
fig, ax = plt.subplots(figsize=(12, 6))
plt.rcParams['font.size'] = '12'
ax.scatter(X_pca_train[:, 0], X_pca_train[:, 1])
words = list(word_index.keys())
for i, index in enumerate(sequences[sentence]):
ax.annotate(words[index - 1], (X_pca_train[i, 0], X_pca_train[i, 1]))
sequences
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [3, 2, 11, 8, 10, 5, 4, 7, 1, 9, 6]]
plot_words(embedding, sequences, 0)
The second sentence man queen yellow basketball green dog woman football king red wolf
contains the same words as the first but in a different order. By plotting its embeddings, we verify that the word order does not influence the underlying vector representations:
plot_words(embedding, sequences, 1)
This visualization illustrates how pretrained word embeddings effectively encode semantic information. The PCA-reduced embeddings provide a clear and interpretable representation of word relationships in two dimensions. These plots offer valuable insights into the structure and properties of the embedding space, further demonstrating the robustness of GloVe embeddings in capturing semantic similarity.
In this section, we'll combine semantic embeddings (from GloVe) with positional encodings to see how positional information affects the representation of words in a sentence. The results demonstrate how varying the relative importance (weights) of semantic and positional embeddings can dramatically alter word representations.
To combine the embeddings, use the following equation with adjustable weights:
$$ \text{embedding2} = W_1 \cdot \text{semantic\_embedding} + W_2 \cdot \text{positional\_encoding} $$The combined embeddings give equal importance to both semantic and positional features. Let's visualize the results:
embedding2 = embedding * 1.0 + pos_encoding[:, :, :] * 1.0
# Visualizing the embeddings
plot_words(embedding2, sequences, 0) # First sentence
plot_words(embedding2, sequences, 1) # Second sentence
red
and wolf
in the second sentence, now appear closer together due to the influence of positional encoding.By experimenting with the weights, we can observe how they influence the balance between semantic and positional representations:
W1 = 1
W2 = 10
embedding2 = embedding * W1 + pos_encoding[:, :, :] * W2
# Visualizing the embeddings
plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)
# For reference
#['king queen man woman dog wolf football basketball red green yellow',
# 'man queen yellow basketball green dog woman football king red wolf']
Effect:
W1 = 10
W2 = 1
embedding2 = embedding * W1 + pos_encoding[:, :, :] * W2
# Visualizing the embeddings
plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)
Effect:
W1 = np.sqrt(EMBEDDING_DIM)
W2 = 1
embedding2 = embedding * W1 + pos_encoding[:, :, :] * W2
# Visualizing the embeddings
plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)
Effect:
These experiments illustrate how relative weighting of semantic and positional embeddings affects word representation, showcasing the power of positional encoding in capturing sequential information within the Transformer architecture.
🎉 Congratulations! 🎉
We've successfully completed this notebook, delving into the inputs of the Transformer network and exploring how positional encodings interact with word embeddings.
This notebook has provided a comprehensive exploration of positional encodings and their role in Transformer architectures. By visualizing and analyzing their properties, we've observed how they encode relational information and influence word embeddings. This understanding is critical for appreciating how Transformers achieve state-of-the-art performance in natural language processing tasks.
Positional encodings serve as a cornerstone for enabling sequential information flow through models, ensuring that context is preserved alongside semantic meaning. With the ability to control their weight, they offer flexibility in designing Transformer-based solutions for diverse applications.
Positional Encodings:
Visualizations:
Semantic and Positional Embeddings: