In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 
In [2]:
import ktrain
from ktrain import text
Using TensorFlow backend.
using Keras version: 2.2.4

Building a Chinese-Language Sentiment Analyzer

In this notebook, we will build a Chinese-language text classification model in 3 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.

The dataset can be downloaded from Chengwei Zhang's GitHub repository here.

(Disclaimer: I don't speak Chinese. Please forgive mistakes.)

STEP 1: Load and Preprocess the Data

First, we use the texts_from_folder function to load and preprocess the data. We assume that the data is in the following form:

    ├── datadir
    │   ├── train
    │   │   ├── class0       # folder containing documents of class 0
    │   │   ├── class1       # folder containing documents of class 1
    │   │   ├── class2       # folder containing documents of class 2
    │   │   └── classN       # folder containing documents of class N

We set val_pct as 0.1, which will automatically sample 10% of the data for validation. Since we will be using a pretrained BERT model for classification, we specifiy preprocess_mode='bert'. If you are using any other model (e.g., fasttext), you should either omit this parameter or use preprocess_mode='standard').

Notice that there is nothing speical or extra we need to do here for non-English text. ktrain automatically detects the language and character encoding and prepares the data and configures the model appropriately.

In [3]:
trn, val, preproc = text.texts_from_folder('/home/amaiya/data/ChnSentiCorp_htl_ba_6000', 
                                            maxlen=75, 
                                            max_features=30000,
                                            preprocess_mode='bert',
                                            train_test_names=['train'],
                                            val_pct=0.1,
                                            classes=['pos', 'neg'])
detected encoding: GB18030
Decoding with GB18030 failed 1st attempt - using GB18030 with skips
skipped 109 lines (0.3%) due to character decoding errors
skipped 9 lines (0.2%) due to character decoding errors
preprocessing train...
language: zh-cn
done.
preprocessing test...
language: zh-cn
done.

STEP 2: Create a Model and Wrap in Learner Object

In [4]:
model = text.text_classifier('bert', trn, preproc=preproc)
learner = ktrain.get_learner(model, 
                             train_data=trn, 
                             val_data=val, 
                             batch_size=32)
Is Multi-Label? False
maxlen is 75
done.

STEP 3: Train the Model

We will use the fit_onecycle method that employs a 1cycle learning rate policy for four epochs. We will save the weights from each epoch using the checkpoint_folder argument, so that we can go reload the weights from the best epoch in case we overfit.

In [5]:
learner.fit_onecycle(2e-5, 4, checkpoint_folder='/tmp/saved_weights')

begin training using onecycle policy with max lr of 2e-05...
Train on 5324 samples, validate on 592 samples
Epoch 1/4
5324/5324 [==============================] - 54s 10ms/step - loss: 0.3635 - acc: 0.8422 - val_loss: 0.2793 - val_acc: 0.8801
Epoch 2/4
5324/5324 [==============================] - 42s 8ms/step - loss: 0.2151 - acc: 0.9170 - val_loss: 0.2501 - val_acc: 0.9223
Epoch 3/4
5324/5324 [==============================] - 42s 8ms/step - loss: 0.1153 - acc: 0.9591 - val_loss: 0.2267 - val_acc: 0.9257
Epoch 4/4
5324/5324 [==============================] - 42s 8ms/step - loss: 0.0438 - acc: 0.9859 - val_loss: 0.2596 - val_acc: 0.9324
Out[5]:
<keras.callbacks.History at 0x7f4409157438>

Although Epoch 03 had the lowest validation loss, the final validation accuracy at the end of the last epoch is still the highest (i.e., 93.24%), so we will just leave the model weights as they are this time.

Inspecting Misclassifications

In [7]:
learner.view_top_losses(n=1, preproc=preproc)
----------
id:299 | loss:7.37 | true:neg | pred:pos)

[CLS]酒店位置佳,出入西街比较方便;观景房是观西街的景,晚上虽然较吵但基本不会影响到睡眠;酒店对面就是在携程上预订自行车的九九车行,取车方便。通过携程订[SEP]

Using Google Translate, the above roughly translates to:


Hotel location is good, access to West Street is more convenient; viewing room is the view of Guanxi Street, although the night is noisy, but it will not affect sleep; the opposite side of the hotel is the Jiujiu car line booking bicycles on Ctrip, easy to pick up. By Ctrip

Although there is a minor negative comment embedded in this review about noise, the review appears to be overall positive and was predicted as positive by our classifier. The ground-truth label, however, is negative, which may be a mistake and may explain the high loss.

Making Predictions on New Data

In [8]:
p = ktrain.get_predictor(learner.model, preproc)

Predicting label for the text

"The view and service of this hotel were terrible and our room was dirty."

In [9]:
p.predict("这家酒店的看法和服务都很糟糕,我们的房间很脏。")
Out[9]:
'neg'

Predicting label for:

"I like the service of this hotel."

In [10]:
p.predict('我喜欢这家酒店的服务')
Out[10]:
'pos'

Save Predictor for Later Deployment

In [11]:
p.save('/tmp/mypred')
In [12]:
p = ktrain.load_predictor('/tmp/mypred')
In [13]:
# still works
p.predict('我喜欢这家酒店的服务')
Out[13]:
'pos'