%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
import ktrain
from ktrain import text
Using TensorFlow backend.
using Keras version: 2.2.4
In this notebook, we will build a Chinese-language text classification model in 3 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.
The dataset can be downloaded from Chengwei Zhang's GitHub repository here.
(Disclaimer: I don't speak Chinese. Please forgive mistakes.)
First, we use the texts_from_folder
function to load and preprocess the data. We assume that the data is in the following form:
├── datadir
│ ├── train
│ │ ├── class0 # folder containing documents of class 0
│ │ ├── class1 # folder containing documents of class 1
│ │ ├── class2 # folder containing documents of class 2
│ │ └── classN # folder containing documents of class N
We set val_pct
as 0.1, which will automatically sample 10% of the data for validation. Since we will be using a pretrained BERT model for classification, we specifiy preprocess_mode='bert'
. If you are using any other model (e.g., fasttext
), you should either omit this parameter or use preprocess_mode='standard'
).
Notice that there is nothing speical or extra we need to do here for non-English text. ktrain automatically detects the language and character encoding and prepares the data and configures the model appropriately.
trn, val, preproc = text.texts_from_folder('/home/amaiya/data/ChnSentiCorp_htl_ba_6000',
maxlen=75,
max_features=30000,
preprocess_mode='bert',
train_test_names=['train'],
val_pct=0.1,
classes=['pos', 'neg'])
detected encoding: GB18030 Decoding with GB18030 failed 1st attempt - using GB18030 with skips skipped 109 lines (0.3%) due to character decoding errors skipped 9 lines (0.2%) due to character decoding errors preprocessing train... language: zh-cn
preprocessing test... language: zh-cn
model = text.text_classifier('bert', trn, preproc=preproc)
learner = ktrain.get_learner(model,
train_data=trn,
val_data=val,
batch_size=32)
Is Multi-Label? False maxlen is 75 done.
We will use the fit_onecycle
method that employs a 1cycle learning rate policy for four epochs. We will save the weights from each epoch using the checkpoint_folder
argument, so that we can go reload the weights from the best epoch in case we overfit.
learner.fit_onecycle(2e-5, 4, checkpoint_folder='/tmp/saved_weights')
begin training using onecycle policy with max lr of 2e-05... Train on 5324 samples, validate on 592 samples Epoch 1/4 5324/5324 [==============================] - 54s 10ms/step - loss: 0.3635 - acc: 0.8422 - val_loss: 0.2793 - val_acc: 0.8801 Epoch 2/4 5324/5324 [==============================] - 42s 8ms/step - loss: 0.2151 - acc: 0.9170 - val_loss: 0.2501 - val_acc: 0.9223 Epoch 3/4 5324/5324 [==============================] - 42s 8ms/step - loss: 0.1153 - acc: 0.9591 - val_loss: 0.2267 - val_acc: 0.9257 Epoch 4/4 5324/5324 [==============================] - 42s 8ms/step - loss: 0.0438 - acc: 0.9859 - val_loss: 0.2596 - val_acc: 0.9324
<keras.callbacks.History at 0x7f4409157438>
Although Epoch 03 had the lowest validation loss, the final validation accuracy at the end of the last epoch is still the highest (i.e., 93.24%), so we will just leave the model weights as they are this time.
learner.view_top_losses(n=1, preproc=preproc)
---------- id:299 | loss:7.37 | true:neg | pred:pos) [CLS]酒店位置佳,出入西街比较方便;观景房是观西街的景,晚上虽然较吵但基本不会影响到睡眠;酒店对面就是在携程上预订自行车的九九车行,取车方便。通过携程订[SEP]
Using Google Translate, the above roughly translates to:
Hotel location is good, access to West Street is more convenient; viewing room is the view of Guanxi Street, although the night is noisy, but it will not affect sleep; the opposite side of the hotel is the Jiujiu car line booking bicycles on Ctrip, easy to pick up. By Ctrip
Although there is a minor negative comment embedded in this review about noise, the review appears to be overall positive and was predicted as positive by our classifier. The ground-truth label, however, is negative, which may be a mistake and may explain the high loss.
p = ktrain.get_predictor(learner.model, preproc)
Predicting label for the text
"The view and service of this hotel were terrible and our room was dirty."
p.predict("这家酒店的看法和服务都很糟糕,我们的房间很脏。")
'neg'
Predicting label for:
"I like the service of this hotel."
p.predict('我喜欢这家酒店的服务')
'pos'
p.save('/tmp/mypred')
p = ktrain.load_predictor('/tmp/mypred')
# still works
p.predict('我喜欢这家酒店的服务')
'pos'