%reload_ext autoreload %autoreload 2 %matplotlib inline import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"; os.environ["CUDA_VISIBLE_DEVICES"]="0";
import ktrain from ktrain import text
Using TensorFlow backend.
using Keras version: 2.2.4
In this notebook, we will build a Chinese-language text classification model in 3 simple steps. More specifically, we will build a model that classifies Chinese hotel reviews as either positive or negative.
The dataset can be downloaded from Chengwei Zhang's GitHub repository here.
(Disclaimer: I don't speak Chinese. Please forgive mistakes.)
First, we use the
texts_from_folder function to load and preprocess the data. We assume that the data is in the following form:
├── datadir │ ├── train │ │ ├── class0 # folder containing documents of class 0 │ │ ├── class1 # folder containing documents of class 1 │ │ ├── class2 # folder containing documents of class 2 │ │ └── classN # folder containing documents of class N
val_pct as 0.1, which will automatically sample 10% of the data for validation. Since we will be using a pretrained BERT model for classification, we specifiy
preprocess_mode='bert'. If you are using any other model (e.g.,
fasttext), you should either omit this parameter or use
Notice that there is nothing speical or extra we need to do here for non-English text. ktrain automatically detects the language and character encoding and prepares the data and configures the model appropriately.
trn, val, preproc = text.texts_from_folder('/home/amaiya/data/ChnSentiCorp_htl_ba_6000', maxlen=75, max_features=30000, preprocess_mode='bert', train_test_names=['train'], val_pct=0.1, classes=['pos', 'neg'])
detected encoding: GB18030 Decoding with GB18030 failed 1st attempt - using GB18030 with skips skipped 109 lines (0.3%) due to character decoding errors skipped 9 lines (0.2%) due to character decoding errors preprocessing train... language: zh-cn
preprocessing test... language: zh-cn
model = text.text_classifier('bert', trn, preproc=preproc) learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=32)
Is Multi-Label? False maxlen is 75 done.
learner.fit_onecycle(2e-5, 4, checkpoint_folder='/tmp/saved_weights')
begin training using onecycle policy with max lr of 2e-05... Train on 5324 samples, validate on 592 samples Epoch 1/4 5324/5324 [==============================] - 54s 10ms/step - loss: 0.3635 - acc: 0.8422 - val_loss: 0.2793 - val_acc: 0.8801 Epoch 2/4 5324/5324 [==============================] - 42s 8ms/step - loss: 0.2151 - acc: 0.9170 - val_loss: 0.2501 - val_acc: 0.9223 Epoch 3/4 5324/5324 [==============================] - 42s 8ms/step - loss: 0.1153 - acc: 0.9591 - val_loss: 0.2267 - val_acc: 0.9257 Epoch 4/4 5324/5324 [==============================] - 42s 8ms/step - loss: 0.0438 - acc: 0.9859 - val_loss: 0.2596 - val_acc: 0.9324
<keras.callbacks.History at 0x7f4409157438>
Although Epoch 03 had the lowest validation loss, the final validation accuracy at the end of the last epoch is still the highest (i.e., 93.24%), so we will just leave the model weights as they are this time.
---------- id:299 | loss:7.37 | true:neg | pred:pos) [CLS]酒店位置佳，出入西街比较方便；观景房是观西街的景，晚上虽然较吵但基本不会影响到睡眠；酒店对面就是在携程上预订自行车的九九车行，取车方便。通过携程订[SEP]
Using Google Translate, the above roughly translates to:
Hotel location is good, access to West Street is more convenient; viewing room is the view of Guanxi Street, although the night is noisy, but it will not affect sleep; the opposite side of the hotel is the Jiujiu car line booking bicycles on Ctrip, easy to pick up. By Ctrip
Although there is a minor negative comment embedded in this review about noise, the review appears to be overall positive and was predicted as positive by our classifier. The ground-truth label, however, is negative, which may be a mistake and may explain the high loss.
p = ktrain.get_predictor(learner.model, preproc)
Predicting label for the text
"The view and service of this hotel were terrible and our room was dirty."
Predicting label for:
"I like the service of this hotel."
p = ktrain.load_predictor('/tmp/mypred')
# still works p.predict('我喜欢这家酒店的服务')