%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
This notebook demonstrates sentence pair classification with ktrain.
In this notebook, we will use the Microsoft Research Paraphrase Corpus (MRPC) to build a model that can detect pairs of sentences that are paraphrases of one another. The MRPC train and test datasets can be downloaded from here:
Once downloaded, we will prepare the datasets as arrays of sentence pairs.
import pandas as pd
import csv
TRAIN = 'data/mrpc/msr_paraphrase_train.txt'
TEST = 'data/mrpc/msr_paraphrase_test.txt'
train_df = pd.read_csv(TRAIN, delimiter='\t', quoting=csv.QUOTE_NONE)
test_df = pd.read_csv(TEST, delimiter='\t', quoting=csv.QUOTE_NONE)
x_train = train_df[['#1 String', '#2 String']].values
y_train = train_df['Quality'].values
x_test = test_df[['#1 String', '#2 String']].values
y_test = test_df['Quality'].values
# IMPORTANT: data format for sentence pair classification is list of tuples of form (str, str)
x_train = list(map(tuple, x_train))
x_test = list(map(tuple, x_test))
print(x_train[0])
print(y_train[0])
('Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .') 1
BERT
Model¶For demonstration purposes, we only train for 3 epochs.
import ktrain
from ktrain import text
MODEL_NAME = 'bert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=128, class_names=['not paraphrase', 'paraphrase'])
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=32) # lower bs if OOM occurs
learner.fit_onecycle(5e-5, 3)
preprocessing train... language: en
preprocessing test... language: en
begin training using onecycle policy with max lr of 5e-05... Train for 128 steps, validate for 54 steps Epoch 1/3 128/128 [==============================] - 66s 518ms/step - loss: 0.5913 - accuracy: 0.6796 - val_loss: 0.5731 - val_accuracy: 0.7328 Epoch 2/3 128/128 [==============================] - 50s 390ms/step - loss: 0.3982 - accuracy: 0.8182 - val_loss: 0.4072 - val_accuracy: 0.8354 Epoch 3/3 128/128 [==============================] - 50s 390ms/step - loss: 0.1550 - accuracy: 0.9495 - val_loss: 0.4492 - val_accuracy: 0.8504
<tensorflow.python.keras.callbacks.History at 0x7f56501a5320>
predictor = ktrain.get_predictor(learner.model, t)
Let's select a positive and negative example from x_test
.
y_test[:5]
array([1, 1, 1, 0, 0])
positive = x_test[0]
negative = x_test[4]
print('Valid Paraphrase:\n%s' %(positive,))
Valid Paraphrase: ("PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .", 'Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .')
print('Invalid Paraphrase:\n%s' %(negative,))
Invalid Paraphrase: ("The company didn 't detail the costs of the replacement and repairs .", 'But company officials expect the costs of the replacement work to run into the millions of dollars .')
predictor.predict(positive)
'paraphrase'
predictor.predict(negative)
'not paraphrase'
predictor.predict([positive, negative])
['paraphrase', 'not paraphrase']
predictor.save('/tmp/mrpc_model')
p = ktrain.load_predictor('/tmp/mrpc_model')
p.predict(positive)
'paraphrase'