Transformer Pre-processing

Table of Contents

1. Overview

1.1. Transformer Preprocessing
1.2. Objectives

2. Imports

3. Positional Encoding

3.1. Visualizing Positional Encodings

3.1.1. Observing Properties of Positional Encodings

3.2. Comparing Positional Encodings

3.2.1. Correlation
3.2.2. Euclidean Distance
3.2.3. Discussion

4. Semantic Embedding

4.1 - Load Pretrained Embedding

4.1.1. Analyzing the Text Data
4.1.2. Embedding Matrix Construction
4.1.3. Embedding Layer Creation

4.2. Visualization on a Cartesian plane

4.2.1. PCA-Based Visualization
4.2.2. Observations
4.2.3. Summary

5. Semantic and positional embedding

5.1. Combining Semantic and Positional Embeddings

5.1.1. Equal Weights
5.1.2. Adjusted Weights
5.1.3. Summary

6. Conclusion

1. Overview

1.1. Transformer Preprocessing¶

This notebook focuses on the preprocessing techniques applied to raw text before it is input into the encoder and decoder blocks of the Transformer architecture. The objective is to explore the key preprocessing steps, particularly the role of positional encodings, and to understand their impact on the overall model performance.

1.1.1. Objectives¶

Upon completing this study, we will be able to:

Visualize positional encodings to develop a deeper understanding of their significance in Transformer models.
Investigate how positional encodings interact with word embeddings and their effect on the model's representation of text.

2. Imports

In [2]:

%load_ext autoreload
%autoreload 2

In [21]:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from sklearn.decomposition import PCA

3. Positional Encoding

The positional encodings are derived using the following formulas:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right) $$

In natural language processing (NLP), it is a standard practice to convert sentences into tokens before inputting them into a language model. Each token is then transformed into a numerical vector of fixed length, referred to as an embedding, which encapsulates the semantic meaning of the word. In the Transformer architecture, a positional encoding vector is added to the embedding to incorporate positional information, allowing the model to account for the order of words in the sequence.

While these positional encoding vectors are challenging to interpret directly through their numerical representations, visualizations can provide valuable insights into their behavior. For instance, when embeddings are reduced to two dimensions and plotted, semantically similar words are typically located closer together, while dissimilar words are plotted farther apart. Similarly, when visualizing positional encoding vectors, words that are closer together in a sentence should appear closer on a Cartesian plane, whereas words that are farther apart should be more distanced.

In this notebook, we will generate a series of visualizations of both word embeddings and positional encoding vectors. These visualizations will help us develop a deeper intuition regarding the effect of positional encodings on word embeddings and their role in maintaining sequential information throughout the Transformer model.

3.1. Visualizing Positional Encodings¶

The positional_encoding function, is implemented below. Building upon this, we will create visualizations to further explore the behavior of positional encodings.

In [4]:

def positional_encoding(positions, d):
    """
    Precomputes a matrix with all the positional encodings.

    Arguments:
        positions (int) -- Maximum number of positions to be encoded.
        d (int) -- Encoding size (dimensionality of the positional encoding).

    Returns:
        pos_encoding -- (1, positions, d) A matrix containing the positional encodings.
    """

    # Initialize a matrix of angles
    angle_rads = np.arange(positions)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d)[np.newaxis, :] // 2)) / np.float32(d))
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

Next, we define the embedding dimension as 100, which matches the dimensionality of the word embedding. According to the "Attention is All You Need" paper, embedding sizes typically range from 100 to 1024, depending on the specific task. Additionally, the paper utilizes a maximum sequence length ranging from 40 to 512, which we will set to 100 for this exercise. The maximum number of words is set to 64.

In [5]:

EMBEDDING_DIM = 100
MAX_SEQUENCE_LENGTH = 100
MAX_NB_WORDS = 64
pos_encoding = positional_encoding(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)

# Visualizing the positional encoding matrix
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('d')
plt.xlim((0, EMBEDDING_DIM))
plt.ylabel('Position')
plt.colorbar()
plt.show()

2024-11-25 11:52:19.846115: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)

3.1.1. Observing Properties of Positional Encodings¶

Having implemented the visualization, we now explore some interesting properties of the positional encoding matrix:

Constant Vector Norms: The norm of each positional encoding vector remains constant, regardless of the position (pos). For example, when evaluating the norm of a positional encoding at pos = 34, we observe that the norm is always approximately 7.071068, regardless of the position.

In [6]:

pos = 34
tf.norm(pos_encoding[0,pos,:])

Out[6]:

<tf.Tensor: shape=(), dtype=float32, numpy=7.0710673332214355>

This consistency implies that the dot product between two positional encoding vectors is unaffected by the scale of the vectors. This characteristic has significant implications for correlation calculations, ensuring that the model's attention mechanism is not influenced by the magnitude of the vectors but rather by their relative positions.

Constant Difference Norms: Another intriguing property is that the norm of the difference between two vectors separated by k positions also remains constant. If we fix k and change pos, the difference between the positional encoding vectors stays nearly the same. This property underscores the importance of relative position information, as the difference between encodings is independent of the specific positions of the words but is determined by their relative separation. Being able to express positional encodings as linear functions of one another can help the model to learn by focusing on the relative positions of words.

This reflection of the difference in the positions of words with vector encodings is difficult to achieve, especially given that the values of the vector encodings must remain small enough so that they do not distort the word embeddings.

In [7]:

pos = 70
k = 2
print(tf.norm(pos_encoding[0,pos,:] -  pos_encoding[0,pos + k,:]))

tf.Tensor(3.2668781, shape=(), dtype=float32)

Through these observations, we gain a deeper understanding of the properties of positional encoding vectors and their implications for the Transformer architecture. The constancy of the vector norms and the stability of the difference norms are essential for the model to maintain consistent and meaningful representations of word order, facilitating the model's ability to learn from sequential data without introducing distortion into the word embeddings.

In the following sections, we will continue to explore how these properties affect the relationships between word embeddings and positional encodings, enhancing our intuition of how the Transformer model processes text sequences.

3.2. Comparing Positional Encodings¶

3.2.1. Correlation¶

The positional encoding matrix allows us to visualize the unique vector representation for each position in a sequence. However, it is essential to understand how these vectors capture the relative positions of words within a sentence. To explore this, we calculate the correlation between pairs of positional encoding vectors at every position.

A well-constructed positional encoding should produce a perfectly symmetric correlation matrix where the values are maximized along the main diagonal. This indicates that vectors at similar positions in the sequence have the highest correlation. As we move away from the diagonal, the correlation values should decrease, reflecting the decreasing similarity between vectors corresponding to positions farther apart in the sentence.

The following code calculates the correlation between the positional encoding vectors at every position:

In [8]:

# Compute the correlation matrix for positional encodings
corr = tf.matmul(pos_encoding, pos_encoding, transpose_b=True).numpy()[0]

# Visualizing the correlation matrix
plt.pcolormesh(corr, cmap='RdBu')
plt.xlabel('Position')
plt.xlim((0, MAX_SEQUENCE_LENGTH))
plt.ylabel('Position')
plt.colorbar()
plt.show()

The resulting plot illustrates the correlation between positional encodings. The main diagonal should exhibit maximum values, indicating that vectors corresponding to positions that are closer in the sequence are highly correlated. As we move away from the diagonal, the correlation should decrease, reflecting the relative positions of the words in the sentence.

3.2.2. Euclidean Distance¶

In addition to correlation, we can also compare positional encoding vectors using the Euclidean distance. Unlike the correlation matrix, the Euclidean distance matrix will have zero values along the main diagonal, and the values off the diagonal will increase as the positional encodings represent words further apart in the sentence.

The following code computes the Euclidean distance between positional encoding vectors at every position:

In [9]:

# Compute the Euclidean distance matrix for positional encodings
eu = np.zeros((MAX_SEQUENCE_LENGTH, MAX_SEQUENCE_LENGTH))

# Calculate distances between all pairs of positional encodings
for a in range(MAX_SEQUENCE_LENGTH):
    for b in range(a + 1, MAX_SEQUENCE_LENGTH):
        eu[a, b] = tf.norm(tf.math.subtract(pos_encoding[0, a], pos_encoding[0, b]))
        eu[b, a] = eu[a, b]

# Visualizing the Euclidean distance matrix
plt.pcolormesh(eu, cmap='RdBu')
plt.xlabel('Position')
plt.xlim((0, MAX_SEQUENCE_LENGTH))
plt.ylabel('Position')
plt.colorbar()
plt.show()

In the resulting visualization, the diagonal values are all zero, reflecting the fact that the distance between a positional encoding and itself is zero. As the distance between positions increases, so does the Euclidean distance, with off-diagonal values growing as the separation between words in the sequence becomes larger.

3.2.3. Discussion¶

These visualizations—both the correlation matrix and the Euclidean distance matrix—serve as diagnostic tools for assessing the behavior of positional encodings. By ensuring that the main diagonal exhibits the highest correlation (or the lowest distance) and that the correlation (or distance) decreases as the positions in the sequence move farther apart, we can confirm that the positional encodings are effectively capturing the relative position of words in the sequence. These properties are crucial for the Transformer model to accurately process sequential data.

4. Semantic Embedding

Understanding the relationship between positional encodings and word embeddings is critical for analyzing how sequential information is captured in a Transformer model. In this section, we explore how positional encodings affect word embeddings by visualizing the sum of these vectors.

4.1 - Load Pretrained Embedding¶

To combine pretrained word embeddings with the positional encodings, we utilize a 100-dimensional embedding from the GloVe project. This embedding contains representations for 400,000 words, each represented as a 100-dimensional vector.

The following code loads the pretrained GloVe embeddings:

In [10]:

# Load GloVe embeddings
embeddings_index = {}
GLOVE_DIR = "glove"
# mkdir glove
# cd glove
# pip install gdown
# gdown https://drive.google.com/uc?id=1RdFBU9Zvm6onI3laothPLVrEJeoRd3zU
# https://drive.google.com/file/d/1RdFBU9Zvm6onI3laothPLVrEJeoRd3zU/view?usp=sharing
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) 
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

# Display statistics
print('Found %s word vectors.' % len(embeddings_index))
print('d_model:', embeddings_index['hi'].shape)

Found 400000 word vectors.
d_model: (100,)

Note: This embedding is composed of 400,000 words and each word embedding has 100 features.

4.1.1. Analyzing the Text Data¶

We define two sentences for analysis. These sentences are designed to illustrate semantic similarities among groups of words. The first sentence contains consecutive groups of semantically similar words, while the second sentence randomizes the order of these words:

In [12]:

texts = ['king queen man woman dog wolf football basketball red green yellow',
         'man queen yellow basketball green dog  woman football  king red wolf']

To prepare the data for embedding, we apply tokenization, which converts each sentence into sequences of numerical indices. These indices correspond to the positions of the words in a dictionary created from the text. The sequences are padded to a uniform length defined by MAX_SEQUENCE_LENGTH.

A quick summary of tokenization is as follows:

If we feed an array of plain text of different sentence lengths, and it will produce a matrix with one row for each sentence, each of them represented by an array of size MAX_SEQUENCE_LENGTH.
Each value in this array represents each word of the sentence using its corresponding index in a dictionary(word_index).
The sequences shorter than the MAX_SEQUENCE_LENGTH are padded with zeros to create uniform length.

In [16]:

# Tokenize and sequence the text
tokenizer = Tokenizer(num_words=MAX_NB_WORDS) # MAX_NB_WORDS = 64
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Generate the word index and padded sequences
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, padding='post', maxlen=MAX_SEQUENCE_LENGTH) # MAX_SEQUENCE_LENGTH = 100
print(data.shape)
print(data)

Found 11 unique tokens.
(2, 100)
[[ 1  2  3  4  5  6  7  8  9 10 11  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0]
 [ 3  2 11  8 10  5  4  7  1  9  6  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0]]

4.1.2. Embedding Matrix Construction¶

To simplify the analysis, we focus only on the 11 unique words present in the sentences. Each word is represented by its corresponding GloVe embedding. Words not found in the GloVe index are represented by a zero vector.

In [17]:

# Create an embedding matrix for the words in the text
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Use the GloVe embedding if available
        embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)

(12, 100)

4.1.3. Embedding Layer Creation¶

Using the extracted embedding matrix, we construct an embedding layer initialized with the GloVe embeddings. This layer transforms the tokenized input data into word embeddings.

In [18]:

# Create the embedding layer
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
                            trainable=False)

(2, 100, 100)

In [19]:

# Transform tokenized data into embeddings
embedding = embedding_layer(data)
print(embedding.shape)

(2, 100, 100)

4.2. Visualization on a Cartesian plane¶

To gain insights into how embeddings capture semantic information, we will visualize word embeddings on a Cartesian plane. Using Principal Component Analysis (PCA), we reduce the 100-dimensional GloVe embeddings to two dimensions for visualization purposes.

4.2.1. PCA-Based Visualization¶

The following function plots the embeddings of words in a given sentence. PCA is applied to reduce the high-dimensional embeddings to 2D coordinates, enabling clear visualization. Each word is annotated on the plot to reveal its position in the reduced space.

In [22]:

def plot_words(embedding, sequences, sentence):
    """
    Visualizes word embeddings on a Cartesian plane using PCA.
    
    Arguments:
    embedding -- Tensor containing word embeddings for the input sentences.
    sequences -- List of tokenized and padded sentences.
    sentence -- Index of the sentence to visualize.

    Returns:
    A scatter plot of words in the reduced 2D space.
    """
    # Apply PCA to reduce embeddings to 2 dimensions
    pca = PCA(n_components=2)
    X_pca_train = pca.fit_transform(embedding[sentence, 0:len(sequences[sentence]), :])

    # Generate scatter plot with word annotations
    fig, ax = plt.subplots(figsize=(12, 6)) 
    plt.rcParams['font.size'] = '12'
    ax.scatter(X_pca_train[:, 0], X_pca_train[:, 1])
    words = list(word_index.keys())
    for i, index in enumerate(sequences[sentence]):
        ax.annotate(words[index - 1], (X_pca_train[i, 0], X_pca_train[i, 1]))

In [25]:

sequences

Out[25]:

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [3, 2, 11, 8, 10, 5, 4, 7, 1, 9, 6]]

Visualizing Sentences¶

Sentence 1:¶

The first sentence king queen man woman dog wolf football basketball red green yellow contains semantically grouped words in sequential order. Plotting its embeddings allows us to observe the spatial arrangement of words in the 2D space:

In [23]:

plot_words(embedding, sequences, 0)

Sentence 2:¶

The second sentence man queen yellow basketball green dog woman football king red wolf contains the same words as the first but in a different order. By plotting its embeddings, we verify that the word order does not influence the underlying vector representations:

In [24]:

plot_words(embedding, sequences, 1)

4.2.2. Observations¶

Semantic Clustering: Words with similar meanings appear closer together, reflecting the embeddings' ability to capture semantic relationships.
Order Invariance: The embeddings remain unchanged despite the shuffled order of words in the second sentence, confirming that the GloVe representations are independent of word sequence.

4.2.3. Summary¶

This visualization illustrates how pretrained word embeddings effectively encode semantic information. The PCA-reduced embeddings provide a clear and interpretable representation of word relationships in two dimensions. These plots offer valuable insights into the structure and properties of the embedding space, further demonstrating the robustness of GloVe embeddings in capturing semantic similarity.

5. Semantic and positional embedding

In this section, we'll combine semantic embeddings (from GloVe) with positional encodings to see how positional information affects the representation of words in a sentence. The results demonstrate how varying the relative importance (weights) of semantic and positional embeddings can dramatically alter word representations.

5.1. Combining Semantic and Positional Embeddings¶

To combine the embeddings, use the following equation with adjustable weights:

$$ \text{embedding2} = W_1 \cdot \text{semantic\_embedding} + W_2 \cdot \text{positional\_encoding} $$

5.1.1. Equal Weights: $W_1 = W_2 = 1$¶

The combined embeddings give equal importance to both semantic and positional features. Let's visualize the results:

In [27]:

embedding2 = embedding * 1.0 + pos_encoding[:, :, :] * 1.0

# Visualizing the embeddings
plot_words(embedding2, sequences, 0)  # First sentence
plot_words(embedding2, sequences, 1)  # Second sentence

Observations:¶

The new plots differ significantly from the original ones.
Words that were semantically distant, such as red and wolf in the second sentence, now appear closer together due to the influence of positional encoding.

5.1.2. Adjusted Weights¶

By experimenting with the weights, we can observe how they influence the balance between semantic and positional representations:

Case 1: Positional Encoding Dominates ($W_1 = 1, W_2 = 10$)¶

In [28]:

W1 = 1
W2 = 10
embedding2 = embedding * W1 + pos_encoding[:, :, :] * W2

# Visualizing the embeddings
plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)

# For reference
#['king queen man woman dog wolf football basketball red green yellow',
# 'man queen yellow basketball green dog  woman football  king red wolf']

Effect:

The embeddings now emphasize positional information, resulting in a circular or spiral-like arrangement of words.
This effect arises because the positional encoding vectors dominate, overshadowing semantic similarities.

Case 2: Semantic Embedding Dominates ($W_1 = 10, W_2 = 1$)¶

In [29]:

W1 = 10
W2 = 1
embedding2 = embedding * W1 + pos_encoding[:, :, :] * W2

# Visualizing the embeddings
plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)

Effect:

The visualization closely resembles the original semantic embeddings.
Positional information has a minor influence, with only subtle shifts in the word positions.

Case 3: Balanced Weights ($W_1 = \sqrt{\text{EMBEDDING\_DIM}}, W_2 = 1$)¶

In [30]:

W1 = np.sqrt(EMBEDDING_DIM)
W2 = 1
embedding2 = embedding * W1 + pos_encoding[:, :, :] * W2

# Visualizing the embeddings
plot_words(embedding2, sequences, 0)
plot_words(embedding2, sequences, 1)

Effect:

This setting mirrors the weighting approach used in the Transformer architecture $semantic\_embedding * \sqrt{d}$
The plot shows a balance between semantic clustering and positional shifts, maintaining meaningful word relationships while encoding positional context.

5.1.3. Summary¶

Equal Weights ($W_1 = W_2$): Positional encoding introduces noticeable changes, particularly in reordered sentences.
Dominant Positional Encoding ($W_2 \gg W_1$): Results in a layout that heavily reflects positional relationships, often at the cost of semantic similarity.
Dominant Semantic Embedding ($W_1 \gg W_2$): Closely resembles the original semantic embedding with minimal positional effects.
Balanced Weights ($W_1 = \sqrt{d}, W_2 = 1$): Achieves a balance, aligning with Transformer design principles.

These experiments illustrate how relative weighting of semantic and positional embeddings affects word representation, showcasing the power of positional encoding in capturing sequential information within the Transformer architecture.

6. Conclusion

🎉 Congratulations! 🎉

We've successfully completed this notebook, delving into the inputs of the Transformer network and exploring how positional encodings interact with word embeddings.

Conclusion¶

This notebook has provided a comprehensive exploration of positional encodings and their role in Transformer architectures. By visualizing and analyzing their properties, we've observed how they encode relational information and influence word embeddings. This understanding is critical for appreciating how Transformers achieve state-of-the-art performance in natural language processing tasks.

Positional encodings serve as a cornerstone for enabling sequential information flow through models, ensuring that context is preserved alongside semantic meaning. With the ability to control their weight, they offer flexibility in designing Transformer-based solutions for diverse applications.

Key Takeaways:¶

Positional Encodings:
- Capture relative positions of words in a sequence.
- Exhibit unique relational properties:
  - The norm of each vector remains constant.
  - Differences between vectors depend on relative positions, not absolute ones.
Visualizations:
- Correlation and Euclidean distance matrices reveal how positional encodings maintain relational properties.
- Cartesian plots highlight the interplay between word embeddings and positional encodings.
Semantic and Positional Embeddings:
- Combining GloVe embeddings with positional encodings demonstrates their influence on semantic clustering.
- Adjusting weights between embeddings and positional encodings shows the flexibility of maintaining semantic meaning or emphasizing positional context.