#!/usr/bin/env python # coding: utf-8 # # TextCNN on NLP # # Traditionally, for Natural Language Processing (NLP), we used recurrent neural networks (RNN) to process the text data. In fact, we can also treat text as a one-dimensional image, so that we can use one-dimensional convolutional neural networks (CNN) to capture associations between adjacent words. # # ![A one-dimensional word vectors.](https://nbviewer.jupyter.org/format/slides/github/goldmermaid/gtc2020/blob/master/Notebooks/1d-vector.svg) # # This notebook describes a groundbreaking approach to applying convolutional neural networks to text analysis: textCNN [by Kim et al.](https://arxiv.org/abs/1408.5882). # First, import the environment packages and modules. # In[1]: import d2l from mxnet import gluon, init, np, npx from mxnet.contrib import text from mxnet.gluon import nn npx.set_np() # ## The Dataset # # Here we use Stanford’s [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) as the dataset for sentiment analysis. # # The training and testing dataset each contains 25,000 movie reviews downloaded from IMDb, respectively. In addition, the number of comments labeled as “positive” and “negative” is equal in each dataset. # # # For the purpose of simplicity, we are using a built-in function `load_data_imdb` in the d2l package to load the dataset. If you are interested in the preprocessing of the full dataset, please check [more detail](https://d2l.ai/chapter_natural-language-processing/sentiment-analysis.html) at D2L.ai. # In[2]: batch_size = 64 train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size) for X_batch, y_batch in train_iter: print("\n X_batch has shape {}, and y_batch has shape {}".format(X_batch.shape, y_batch.shape)) break # ## The TextCNN Model's Skeleton # # ![An example to illustrate the textCNN.](https://d2l.ai/_images/textcnn.svg) # TextCNN involves the following steps: # # 1. Performing multiple **one-dimensional convolution** kernels on the input text sequences; # 2. Applying **max-over-time pooling** on the previous output channels, and then concatenate to one vector; # 3. Using the **fully connected layer** (aka. dense layer) and **dropout** on the previous outputs. # # ### 1. One-Dimensional Convolutional Layer # # Like a two-dimensional convolutional layer, a one-dimensional convolutional layer uses a one-dimensional cross-correlation operation. # # In the one-dimensional cross-correlation operation, the convolution window slides on the input array from left to right successively. # # At a certain position, the input array in the convolution window and kernel array are elementwise multiplied and summed to obtain the output array. # # ![One-dimensional cross-correlation operation.](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/conv1d.svg?sanitize=true) # Now let's implement one-dimensional cross-correlation in the `corr1d` function. It accepts the input array X and kernel array K, then it outputs the array Y. # In[3]: def corr1d(X, K): w = K.shape[0] Y = np.zeros((X.shape[0] - w + 1)) for i in range(Y.shape[0]): Y[i] = (X[i: i + w] * K).sum() return Y # As shown in figure below, the input is a one-dimensional array with a width of 7 and the width of the kernel array is 2. As we can see, the output width is $7−2+1=6$ and the first element is obtained by performing multiplication by element on the leftmost input subarray with a width of 2 and kernel array and then summing the results. # # ![One-dimensional cross-correlation operation.](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/conv1d.svg?sanitize=true) # In[4]: X, K = np.array([0, 1, 2, 3, 4, 5, 6]), np.array([1, 2]) corr1d(X, K) # We can use the [gluon built-in class `Conv1D()`](https://beta.mxnet.io/api/gluon/_autogen/mxnet.gluon.nn.Conv1D.html) to perform the 1D convolution. # # To use `Conv1D()`, we need to firstly define a new `Sequential()` architecture, "convs", which can stack neural network layers sequentially. # In[5]: num_channels, kernel_sizes = 2, 4 convs = nn.Sequential() convs.add(nn.Conv1D(num_channels, kernel_sizes, activation='relu')) # Then, we randomly initialize its weights with a normal distribution (zero mean and 0.01 standard deviation) through the `initialize()` function. # In[6]: convs.initialize(init.Normal(sigma=0.01)) convs # Note that the required inputs of `Conv1D()` is an 3D input tensor with shape `(batch_size, in_channels, width)`. In the context of NLP, this shape can be interpreted as `(batch_size, word_vector_dimension, number_of_words)`. # ### 2. Max-Over-Time Pooling Layer # # In textCNN, the max-over-time pooling layer equals to a one-dimensional global maximum pooling layer. # # We can use the [gluon built-in class `GlobalMaxPool1D()`](https://beta.mxnet.io/api/gluon/_autogen/mxnet.gluon.nn.GlobalAvgPool1D.html) as below: # # ```python # max_over_time_pooling = nn.GlobalMaxPool1D() # ``` # # ![The max-over-time pooling layer.](https://nbviewer.jupyter.org/format/slides/github/goldmermaid/gtc2020/blob/master/Notebooks/maxpooling.svg) # ### 3. Fully Connected Layer and Dropout # # The fully connected layer is referred as the `Dense()` layer in Gluon ([more detail](https://d2l.ai/chapter_multilayer-perceptrons/mlp-gluon.html) in D2L). # # Besides, a dropout layer `Dropout()` can be used after the fully connected layer to deal with the overfitting problem. # # ![The fully connected layer.](https://nbviewer.jupyter.org/format/slides/github/goldmermaid/gtc2020/blob/master/Notebooks/ff.svg) # In[7]: decoder = nn.Dense(2) # 2 outputs print("Shape of Dense : ", decoder) dropout = nn.Dropout(0.4) # dropout 40% of neurons' weights print("Shape of Dropout : ", decoder) # ## The TextCNN Model # # Now let's put everything together! # Suppose that: # # - the input text sequence consists of $n$ words; # - each word is represented by a $d$-dimension word vector; # # Then the input example has a width of $n$, a height of 1, and $d$ input channels. # ### Model Initialization # # We first initialize the layers of our `textCNN` class. # # ```python # class TextCNN(nn.Block): # def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels, # **kwargs): # super(TextCNN, self).__init__(**kwargs) # self.embedding = nn.Embedding(vocab_size, embed_size) # # The constant embedding layer does not participate in training # self.constant_embedding = nn.Embedding(vocab_size, embed_size) # self.dropout = nn.Dropout(0.5) # self.decoder = nn.Dense(2) # # The max-over-time pooling layer has no weight, so it can share an # # instance # self.pool = nn.GlobalMaxPool1D() # # Create multiple one-dimensional convolutional layers # self.convs = nn.Sequential() # for c, k in zip(num_channels, kernel_sizes): # self.convs.add(nn.Conv1D(c, k, activation='relu')) # ``` # ### The `forward` Function # Now let's write the `forward` function of our `textCNN` class. # # ```python # def forward(self, inputs): # embeddings = np.concatenate(( # self.embedding(inputs), self.constant_embedding(inputs)), # axis=2) # embeddings = embeddings.transpose(0, 2, 1) # encoding = np.concatenate([ # np.squeeze(self.pool(conv(embeddings)), axis=-1) # for conv in self.convs], axis=1) # outputs = self.decoder(self.dropout(encoding)) # return outputs # ``` # # It looks a bit complicated, but we can decompose it to 4 steps. # #### Concatenation # # First, we concatenate the output of two embedding layers with shape of `(batch_size, number_of_words, word_vector_dimension)` by the last dimension as below: # # ```python # embeddings = np.concatenate(( # self.embedding(inputs), self.constant_embedding(inputs)), axis=2) # ``` # #### Transposing # # Second, recall that the required inputs of `Conv1D()` is an 3D input tensor with shape `(batch_size, word_vector_dimension, number_of_words)`, while our current embeddings is of shape `(batch_size, number_of_words, word_vector_dimension)`. Hence, we need to transpose the last two dimensions as below: # # ```python # embeddings = embeddings.transpose(0, 2, 1) # ``` # #### Encoding # # Third, we compute `encoding` as below: # # ```python # encoding = np.concatenate([np.squeeze(self.pool(conv(embeddings)), axis=-1) # for conv in self.convs], axis=1) # ``` # # 1. For each one-dimensional convolutional layer, we apply a max-over-time pooling, i.e., `self.pool()`. # 2. Since the max-over-time pooling is applied at the last dimension of convolution's outputs, the last dimension (axis = -1) will be 1. We use the flatten function `squeeze()` to remove it. # 3. We concatenate the results from varied convolution kernels (axis = 1) by `concatenate()`. # #### Decoding # # Last, we apply the dropout function to randomly dropout some units of the encoding to avoid overfitting (i.e. not rely on one specific unit of encoding too much). And then we apply a fully connected layer as a decoder to obtain the outputs. # # ```python # outputs = self.decoder(self.dropout(encoding)) # ``` # To sum up, here is the full `TextCNN` class: # In[8]: class TextCNN(nn.Block): def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels, **kwargs): super(TextCNN, self).__init__(**kwargs) self.embedding = nn.Embedding(vocab_size, embed_size) # The constant embedding layer does not participate in training self.constant_embedding = nn.Embedding(vocab_size, embed_size) self.dropout = nn.Dropout(0.5) self.decoder = nn.Dense(2) # The max-over-time pooling layer has no weight, so it can share # an instance self.pool = nn.GlobalMaxPool1D() # Create multiple one-dimensional convolutional layers with # different kernel sizes and number of channels self.convs = nn.Sequential() for c, k in zip(num_channels, kernel_sizes): self.convs.add(nn.Conv1D(c, k, activation='relu')) def forward(self, inputs): embeddings = np.concatenate(( self.embedding(inputs), self.constant_embedding(inputs)), axis=2) embeddings = embeddings.transpose(0, 2, 1) encoding = np.concatenate([np.squeeze(self.pool(conv(embeddings)), axis=-1) for conv in self.convs], axis=1) outputs = self.decoder(self.dropout(encoding)) return outputs # ### Load Pre-trained Embedding # # Rather than training from scratch, we load a pre-trained 100-dimensional [GloVe word vectors](https://d2l.ai/chapter_natural-language-processing/glove.html). This step will take several minutes to load. # In[9]: # Load word vectors and query the word vectors that in our vocabulary glove_embedding = text.embedding.create( 'glove', pretrained_file_name='glove.6B.100d.txt') embeds = glove_embedding.get_vecs_by_tokens(vocab.idx_to_token) print("embeds.shape : ", embeds.shape) # ### Training # # Now let's create a `TextCNN` model. Since multiple kernel filters (with varying window sizes) can obtain different features, the original `TextCNN` model applies 3 convolutional layers with kernel widths of 3, 4, and 5, respectively. # # Also, each of the filter window has an output channel with 100 units. # In[10]: embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100] ctx = d2l.try_all_gpus() ## Get the GPUs net = TextCNN(vocab_size=len(vocab), embed_size=embed_size, kernel_sizes=kernel_sizes, num_channels=nums_channels) net.initialize(init.Xavier(), ctx=ctx) net # Then we initialize the embedding layers embedding and constant_embedding using the GloVe embeddings. Here, the former participates in training while the latter has a fixed weight. # In[11]: net.embedding.weight.set_data(embeds) net.constant_embedding.weight.set_data(embeds) net.constant_embedding.collect_params().setattr('grad_req', 'null') # To train our `TextCNN` model, we also need to define: # 1. the learning rate `lr`, # 2. the number of epochs `num_epochs`, # 3. the optimizer `adam`, # 4. the loss function `SoftmaxCrossEntropyLoss()`. # # # For simplicity, we call the built-in function `train_ch13` ([more detail in D2L](https://d2l.ai/chapter_computer-vision/image-augmentation.html?highlight=train_ch13#using-an-image-augmentation-training-model)) to train. # In[12]: lr, num_epochs = 0.001, 5 trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr}) loss = gluon.loss.SoftmaxCELoss() d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, ctx) # Now is the time to use our trained model to classify sentiments of two simple sentences. # In[13]: d2l.predict_sentiment(net, vocab, 'this movie is so amazing!')