Traditionally, for Natural Language Processing (NLP), we used recurrent neural networks (RNN) to process the text data. In fact, we can also treat text as a one-dimensional image, so that we can use one-dimensional convolutional neural networks (CNN) to capture associations between adjacent words.
This notebook describes a groundbreaking approach to applying convolutional neural networks to text analysis: textCNN by Kim et al..
First, import the environment packages and modules.
import d2l
from mxnet import gluon, init, np, npx
from mxnet.contrib import text
from mxnet.gluon import nn
npx.set_np()
Here we use Stanford’s Large Movie Review Dataset as the dataset for sentiment analysis.
The training and testing dataset each contains 25,000 movie reviews downloaded from IMDb, respectively. In addition, the number of comments labeled as “positive” and “negative” is equal in each dataset.
For the purpose of simplicity, we are using a built-in function load_data_imdb
in the d2l package to load the dataset. If you are interested in the preprocessing of the full dataset, please check more detail at D2L.ai.
batch_size = 64
train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)
for X_batch, y_batch in train_iter:
print("\n X_batch has shape {}, and y_batch has shape {}".format(X_batch.shape, y_batch.shape))
break
X_batch has shape (64, 500), and y_batch has shape (64,)
TextCNN involves the following steps:
Like a two-dimensional convolutional layer, a one-dimensional convolutional layer uses a one-dimensional cross-correlation operation.
In the one-dimensional cross-correlation operation, the convolution window slides on the input array from left to right successively.
At a certain position, the input array in the convolution window and kernel array are elementwise multiplied and summed to obtain the output array.
Now let's implement one-dimensional cross-correlation in the corr1d
function. It accepts the input array X and kernel array K, then it outputs the array Y.
def corr1d(X, K):
w = K.shape[0]
Y = np.zeros((X.shape[0] - w + 1))
for i in range(Y.shape[0]):
Y[i] = (X[i: i + w] * K).sum()
return Y
As shown in figure below, the input is a one-dimensional array with a width of 7 and the width of the kernel array is 2. As we can see, the output width is $7−2+1=6$ and the first element is obtained by performing multiplication by element on the leftmost input subarray with a width of 2 and kernel array and then summing the results.
X, K = np.array([0, 1, 2, 3, 4, 5, 6]), np.array([1, 2])
corr1d(X, K)
array([ 2., 5., 8., 11., 14., 17.])
We can use the gluon built-in class Conv1D()
to perform the 1D convolution.
To use Conv1D()
, we need to firstly define a new Sequential()
architecture, "convs", which can stack neural network layers sequentially.
num_channels, kernel_sizes = 2, 4
convs = nn.Sequential()
convs.add(nn.Conv1D(num_channels, kernel_sizes, activation='relu'))
Then, we randomly initialize its weights with a normal distribution (zero mean and 0.01 standard deviation) through the initialize()
function.
convs.initialize(init.Normal(sigma=0.01))
convs
Sequential( (0): Conv1D(-1 -> 2, kernel_size=(4,), stride=(1,), Activation(relu)) )
Note that the required inputs of Conv1D()
is an 3D input tensor with shape (batch_size, in_channels, width)
. In the context of NLP, this shape can be interpreted as (batch_size, word_vector_dimension, number_of_words)
.
In textCNN, the max-over-time pooling layer equals to a one-dimensional global maximum pooling layer.
We can use the gluon built-in class GlobalMaxPool1D()
as below:
max_over_time_pooling = nn.GlobalMaxPool1D()
The fully connected layer is referred as the Dense()
layer in Gluon (more detail in D2L).
Besides, a dropout layer Dropout()
can be used after the fully connected layer to deal with the overfitting problem.
decoder = nn.Dense(2) # 2 outputs
print("Shape of Dense : ", decoder)
dropout = nn.Dropout(0.4) # dropout 40% of neurons' weights
print("Shape of Dropout : ", decoder)
Shape of Dense : Dense(-1 -> 10, linear) Shape of Dropout : Dense(-1 -> 10, linear)
Now let's put everything together!
Suppose that:
Then the input example has a width of $n$, a height of 1, and $d$ input channels.
We first initialize the layers of our textCNN
class.
class TextCNN(nn.Block):
def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
**kwargs):
super(TextCNN, self).__init__(**kwargs)
self.embedding = nn.Embedding(vocab_size, embed_size)
# The constant embedding layer does not participate in training
self.constant_embedding = nn.Embedding(vocab_size, embed_size)
self.dropout = nn.Dropout(0.5)
self.decoder = nn.Dense(2)
# The max-over-time pooling layer has no weight, so it can share an
# instance
self.pool = nn.GlobalMaxPool1D()
# Create multiple one-dimensional convolutional layers
self.convs = nn.Sequential()
for c, k in zip(num_channels, kernel_sizes):
self.convs.add(nn.Conv1D(c, k, activation='relu'))
forward
Function¶Now let's write the forward
function of our textCNN
class.
def forward(self, inputs):
embeddings = np.concatenate((
self.embedding(inputs), self.constant_embedding(inputs)),
axis=2)
embeddings = embeddings.transpose(0, 2, 1)
encoding = np.concatenate([
np.squeeze(self.pool(conv(embeddings)), axis=-1)
for conv in self.convs], axis=1)
outputs = self.decoder(self.dropout(encoding))
return outputs
It looks a bit complicated, but we can decompose it to 4 steps.
First, we concatenate the output of two embedding layers with shape of (batch_size, number_of_words, word_vector_dimension)
by the last dimension as below:
embeddings = np.concatenate((
self.embedding(inputs), self.constant_embedding(inputs)), axis=2)
Second, recall that the required inputs of Conv1D()
is an 3D input tensor with shape (batch_size, word_vector_dimension, number_of_words)
, while our current embeddings is of shape (batch_size, number_of_words, word_vector_dimension)
. Hence, we need to transpose the last two dimensions as below:
embeddings = embeddings.transpose(0, 2, 1)
Third, we compute encoding
as below:
encoding = np.concatenate([np.squeeze(self.pool(conv(embeddings)), axis=-1)
for conv in self.convs], axis=1)
self.pool()
.squeeze()
to remove it.concatenate()
.Last, we apply the dropout function to randomly dropout some units of the encoding to avoid overfitting (i.e. not rely on one specific unit of encoding too much). And then we apply a fully connected layer as a decoder to obtain the outputs.
outputs = self.decoder(self.dropout(encoding))
To sum up, here is the full TextCNN
class:
class TextCNN(nn.Block):
def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
**kwargs):
super(TextCNN, self).__init__(**kwargs)
self.embedding = nn.Embedding(vocab_size, embed_size)
# The constant embedding layer does not participate in training
self.constant_embedding = nn.Embedding(vocab_size, embed_size)
self.dropout = nn.Dropout(0.5)
self.decoder = nn.Dense(2)
# The max-over-time pooling layer has no weight, so it can share
# an instance
self.pool = nn.GlobalMaxPool1D()
# Create multiple one-dimensional convolutional layers with
# different kernel sizes and number of channels
self.convs = nn.Sequential()
for c, k in zip(num_channels, kernel_sizes):
self.convs.add(nn.Conv1D(c, k, activation='relu'))
def forward(self, inputs):
embeddings = np.concatenate((
self.embedding(inputs), self.constant_embedding(inputs)), axis=2)
embeddings = embeddings.transpose(0, 2, 1)
encoding = np.concatenate([np.squeeze(self.pool(conv(embeddings)),
axis=-1)
for conv in self.convs], axis=1)
outputs = self.decoder(self.dropout(encoding))
return outputs
Rather than training from scratch, we load a pre-trained 100-dimensional GloVe word vectors. This step will take several minutes to load.
# Load word vectors and query the word vectors that in our vocabulary
glove_embedding = text.embedding.create(
'glove', pretrained_file_name='glove.6B.100d.txt')
embeds = glove_embedding.get_vecs_by_tokens(vocab.idx_to_token)
print("embeds.shape : ", embeds.shape)
embeds.shape : (49339, 100)
Now let's create a TextCNN
model. Since multiple kernel filters (with varying window sizes) can obtain different features, the original TextCNN
model applies 3 convolutional layers with kernel widths of 3, 4, and 5, respectively.
Also, each of the filter window has an output channel with 100 units.
embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100]
ctx = d2l.try_all_gpus() ## Get the GPUs
net = TextCNN(vocab_size=len(vocab), embed_size=embed_size,
kernel_sizes=kernel_sizes, num_channels=nums_channels)
net.initialize(init.Xavier(), ctx=ctx)
net
TextCNN( (embedding): Embedding(49339 -> 100, float32) (constant_embedding): Embedding(49339 -> 100, float32) (dropout): Dropout(p = 0.5, axes=()) (decoder): Dense(-1 -> 2, linear) (pool): GlobalMaxPool1D(size=(1,), stride=(1,), padding=(0,), ceil_mode=True, global_pool=True, pool_type=max, layout=NCW) (convs): Sequential( (0): Conv1D(-1 -> 100, kernel_size=(3,), stride=(1,), Activation(relu)) (1): Conv1D(-1 -> 100, kernel_size=(4,), stride=(1,), Activation(relu)) (2): Conv1D(-1 -> 100, kernel_size=(5,), stride=(1,), Activation(relu)) ) )
Then we initialize the embedding layers embedding and constant_embedding using the GloVe embeddings. Here, the former participates in training while the latter has a fixed weight.
net.embedding.weight.set_data(embeds)
net.constant_embedding.weight.set_data(embeds)
net.constant_embedding.collect_params().setattr('grad_req', 'null')
To train our TextCNN
model, we also need to define:
lr
,num_epochs
,adam
,SoftmaxCrossEntropyLoss()
.For simplicity, we call the built-in function train_ch13
(more detail in D2L) to train.
lr, num_epochs = 0.001, 5
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr})
loss = gluon.loss.SoftmaxCELoss()
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, ctx)
loss 0.093, train acc 0.968, test acc 0.868 2887.2 examples/sec on [gpu(0), gpu(1), gpu(2), gpu(3)]
Now is the time to use our trained model to classify sentiments of two simple sentences.
d2l.predict_sentiment(net, vocab, 'this movie is so amazing!')
'positive'