In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

Sentence Pair Classification with ktrain

This notebook demonstrates sentence pair classification with ktrain.

Download a Sentence Pair Classification Dataset

In this notebook, we will use the Microsoft Research Paraphrase Corpus (MRPC) to build a model that can detect pairs of sentences that are paraphrases of one another. The MRPC train and test datasets can be downloaded from here:

Once downloaded, we will prepare the datasets as arrays of sentence pairs.

In [2]:
import pandas as pd
import csv
TRAIN = 'data/mrpc/msr_paraphrase_train.txt'
TEST = 'data/mrpc/msr_paraphrase_test.txt'
train_df = pd.read_csv(TRAIN, delimiter='\t', quoting=csv.QUOTE_NONE)
test_df = pd.read_csv(TEST, delimiter='\t', quoting=csv.QUOTE_NONE)
x_train = train_df[['#1 String', '#2 String']].values
y_train = train_df['Quality'].values
x_test = test_df[['#1 String', '#2 String']].values
y_test = test_df['Quality'].values


# IMPORTANT: data format for sentence pair classification is list of tuples of form (str, str)
x_train = list(map(tuple, x_train))
x_test = list(map(tuple, x_test))
In [3]:
print(x_train[0])
print(y_train[0])
('Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .')
1

Build and Train a BERT Model

For demonstration purposes, we only train for 3 epochs.

In [4]:
import ktrain
from ktrain import text
MODEL_NAME = 'bert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=128, class_names=['not paraphrase', 'paraphrase'])
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=32) # lower bs if OOM occurs
learner.fit_onecycle(5e-5, 3)
preprocessing train...
language: en
preprocessing test...
language: en
begin training using onecycle policy with max lr of 5e-05...
Train for 128 steps, validate for 54 steps
Epoch 1/3
128/128 [==============================] - 66s 518ms/step - loss: 0.5913 - accuracy: 0.6796 - val_loss: 0.5731 - val_accuracy: 0.7328
Epoch 2/3
128/128 [==============================] - 50s 390ms/step - loss: 0.3982 - accuracy: 0.8182 - val_loss: 0.4072 - val_accuracy: 0.8354
Epoch 3/3
128/128 [==============================] - 50s 390ms/step - loss: 0.1550 - accuracy: 0.9495 - val_loss: 0.4492 - val_accuracy: 0.8504
Out[4]:
<tensorflow.python.keras.callbacks.History at 0x7f56501a5320>

Make Predictions

In [5]:
predictor = ktrain.get_predictor(learner.model, t)

Let's select a positive and negative example from x_test.

In [6]:
y_test[:5]
Out[6]:
array([1, 1, 1, 0, 0])
In [12]:
positive = x_test[0]
negative = x_test[4]
In [13]:
print('Valid Paraphrase:\n%s' %(positive,))
Valid Paraphrase:
("PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .", 'Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .')
In [14]:
print('Invalid Paraphrase:\n%s' %(negative,))
Invalid Paraphrase:
("The company didn 't detail the costs of the replacement and repairs .", 'But company officials expect the costs of the replacement work to run into the millions of dollars .')
In [15]:
predictor.predict(positive)
Out[15]:
'paraphrase'
In [16]:
predictor.predict(negative)
Out[16]:
'not paraphrase'
In [17]:
predictor.predict([positive, negative])
Out[17]:
['paraphrase', 'not paraphrase']
In [18]:
predictor.save('/tmp/mrpc_model')
In [19]:
p = ktrain.load_predictor('/tmp/mrpc_model')
In [20]:
p.predict(positive)
Out[20]:
'paraphrase'
In [ ]: