%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0";
import ktrain
from ktrain import text
using Keras version: 2.2.4-tf
This notebook shows an example of text regression in ktrain. Given a textual description of a wine, we will attempt to predict its price. The data is available from FloydHub here.
We will simply perform the same data preparation as performed by the original FloydHub example notebook that inspired this exmaple.
import pandas as pd
import numpy as np
path = 'data/wine/wine_data.csv' # ADD path/to/dataset
data = pd.read_csv(path)
data = data.sample(frac=1., random_state=0)
data.head()
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|
8486 | 8486 | Italy | Made entirely from Nero d'Avola, this opens wi... | Violino | 89 | 20.0 | Sicily & Sardinia | Vittoria | NaN | Nero d'Avola | Paolo Calì |
148584 | 148585 | Portugal | Warre's seems to have found just the right for... | Otima 20-year old tawny | 90 | 42.0 | Port | NaN | NaN | Port | Warre's |
18353 | 18353 | Italy | A more evolved and sophisticated expression of... | Campogrande | 87 | 23.0 | Veneto | Soave Superiore | NaN | Garganega | Sandro de Bruno |
5281 | 5281 | Spain | Red-fruit and citrus aromas create an astringe... | NaN | 84 | 12.0 | Northern Spain | Ribera del Duero | NaN | Tempranillo | Condado de Oriza |
87768 | 87768 | US | Lightly funky and showing definite signs of ea... | Lia's Vineyard | 89 | 35.0 | Oregon | Chehalem Mountains | Willamette Valley | Pinot Noir | Seven of Hearts |
# this code was taken directly from FloydHub's regression template for
# wine price prediction: https://github.com/floydhub/regression-template
# Clean it from null values
data = data[pd.notnull(data['country'])]
data = data[pd.notnull(data['price'])]
data = data.drop(data.columns[0], axis=1)
variety_threshold = 500 # Anything that occurs less than this will be removed.
value_counts = data['variety'].value_counts()
to_remove = value_counts[value_counts <= variety_threshold].index
data.replace(to_remove, np.nan, inplace=True)
data = data[pd.notnull(data['variety'])]
# Split data into train and test
train_size = int(len(data) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))
# Train features
description_train = data['description'][:train_size]
variety_train = data['variety'][:train_size]
# Train labels
labels_train = data['price'][:train_size]
# Test features
description_test = data['description'][train_size:]
variety_test = data['variety'][train_size:]
# Test labels
labels_test = data['price'][train_size:]
x_train = description_train.values
y_train = labels_train.values
x_test = description_test.values
y_test = labels_test.values
Train size: 95646 Test size: 23912
trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
x_test=x_test, y_test=y_test,
ngram_range=3,
maxlen=200,
max_features=35000)
task: text regression (supply class_names argument if this is supposed to be classification task) language: en Word Counts: 30953 Nrows: 95646 95646 train sequences train sequence lengths: mean : 41 95percentile : 62 99percentile : 74 Adding 3-gram features max_features changed to 1769319 with addition of ngrams Average train sequence length with ngrams: 120 train (w/ngrams) sequence lengths: mean : 121 95percentile : 183 99percentile : 219 x_train shape: (95646,200) y_train shape: 95646 23912 test sequences test sequence lengths: mean : 41 95percentile : 62 99percentile : 73 Average test sequence length with ngrams: 111 test (w/ngrams) sequence lengths: mean : 112 95percentile : 172 99percentile : 207 x_test shape: (23912,200) y_test shape: 23912
text.print_text_regression_models()
fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf] linreg: linear text regression using a trainable Embedding layer bigru: Bidirectional GRU with pretrained word vectors [https://arxiv.org/abs/1712.09405] standard_gru: simple 2-layer GRU with randomly initialized embeddings bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805] distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]
model = text.text_regression_model('linreg', train_data=trn, preproc=preproc)
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=256)
maxlen is 200 done.
Lower the batch size
above if you run out of GPU memory.
learner.lr_find()
simulating training for different learning rates... this may take a few moments... Train on 95646 samples Epoch 1/1024 95646/95646 [==============================] - 8s 81us/sample - loss: 2627.6407 - mae: 34.2769 Epoch 2/1024 95646/95646 [==============================] - 7s 70us/sample - loss: 2610.0313 - mae: 34.0299 Epoch 3/1024 95646/95646 [==============================] - 7s 70us/sample - loss: 2148.5174 - mae: 26.8848 Epoch 4/1024 95646/95646 [==============================] - 7s 71us/sample - loss: 1158.6146 - mae: 15.1160 Epoch 5/1024 15360/95646 [===>..........................] - ETA: 5s - loss: 4022.5116 - mae: 36.6476 done. Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.
learner.lr_plot()
learner.fit_onecycle(0.03, 10)
begin training using onecycle policy with max lr of 0.03... Train on 95646 samples, validate on 23912 samples Epoch 1/10 95646/95646 [==============================] - 8s 79us/sample - loss: 1556.0435 - mae: 19.2369 - val_loss: 984.7442 - val_mae: 15.1122 Epoch 2/10 95646/95646 [==============================] - 8s 79us/sample - loss: 1052.5454 - mae: 13.0505 - val_loss: 808.2142 - val_mae: 12.5382 Epoch 3/10 95646/95646 [==============================] - 7s 76us/sample - loss: 809.7949 - mae: 9.4578 - val_loss: 695.8532 - val_mae: 10.8098 Epoch 4/10 95646/95646 [==============================] - 8s 80us/sample - loss: 616.9707 - mae: 6.6427 - val_loss: 621.5498 - val_mae: 9.9253 Epoch 5/10 95646/95646 [==============================] - 7s 78us/sample - loss: 471.5737 - mae: 4.8021 - val_loss: 582.4865 - val_mae: 9.9948 Epoch 6/10 95646/95646 [==============================] - 8s 79us/sample - loss: 369.3043 - mae: 4.1017 - val_loss: 572.5836 - val_mae: 10.4219 Epoch 7/10 95646/95646 [==============================] - 8s 79us/sample - loss: 304.9035 - mae: 3.7351 - val_loss: 563.6406 - val_mae: 10.3136 Epoch 8/10 95646/95646 [==============================] - 8s 79us/sample - loss: 257.2997 - mae: 2.8500 - val_loss: 562.7244 - val_mae: 10.0789 Epoch 9/10 95646/95646 [==============================] - 8s 79us/sample - loss: 226.9375 - mae: 1.9855 - val_loss: 559.9848 - val_mae: 9.7024 Epoch 10/10 95646/95646 [==============================] - 8s 79us/sample - loss: 211.9495 - mae: 1.3842 - val_loss: 561.2627 - val_mae: 9.6977
<tensorflow.python.keras.callbacks.History at 0x7f653d0d1cf8>
Our MAE is roughly 10, which means our model's predictions are about $10 off on average. This isn't bad considering there is a wide range of wine prices and predictions are being made purely from text descriptions.
Let's examine the wines we got the most wrong.
learner.view_top_losses(n=3, preproc=preproc)
---------- id:6695 | loss:675000.75 | true:980.0 | pred:158.42) this was a great vintage port year and this white port which was bottled in 2015 has hints of the firm tannins and structure that marked out the year it also has preserved an amazing amount of freshness still suggesting orange marmalade flavors these are backed up by the fine concentrated old wood tastes the wine is of course ready to drink ---------- id:19469 | loss:524528.9 | true:775.0 | pred:50.76) perfumed florals mingle curiously with deep dusty mineral notes on this bracing tba sunny nectarine and tangerine flavors are mouthwatering and juicy struck with acidity then plunged into of sweet honey and nectar it's a delightful sensory roller coaster that feels endless on the finish ---------- id:3310 | loss:400394.03 | true:848.0 | pred:215.23) full of ripe fruit opulent and concentrated this is a fabulous and impressive wine it has a beautiful line of acidity balanced with ripe fruits the wood aging is subtle just a hint of smokiness and toast this is one of those wines from a great white wine vintage that will age many years drink from 2024
It looks like our model has trouble with expensive wines, which is understandable given the descriptions of them, which may not differ much from less expensive wines.
predictor = ktrain.get_predictor(learner.model, preproc)
Let's make a prediction for a random wine in the validation set.
idx = np.random.randint(len(x_test))
print('Description: %s' % (x_test[idx]))
print('Actual Price: %s' % (y_test[idx]))
Description: This Millesimato sparkling blend of Pinot Nero and oak-aged Chardonnay delivers a generous and creamy mouthfeel followed by refined aromas of dried fruit and baked bread. This is a beautiful wine to serve with tempura appetizers. Actual Price: 52.0
Our prediction for this wine:
predictor.predict(x_test[idx])
array([52.698753], dtype=float32)
ktrain includes a simplified interface to the Hugging Face transformers library. This interface can also be used for text regression. Here is a short example of training a DistilBERT model for a single epoch to predict wine prices.
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=75)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_regression_model()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)
learner.fit_onecycle(1e-4, 1)
preprocessing train... language: en train sequence lengths: mean : 41 95percentile : 61 99percentile : 73
preprocessing test... language: en test sequence lengths: mean : 41 95percentile : 62 99percentile : 73
begin training using onecycle policy with max lr of 0.0001... Train for 748 steps, validate for 187 steps 748/748 [==============================] - 310s 415ms/step - loss: 1443.0076 - mae: 18.3470 - val_loss: 879.1416 - val_mae: 13.9600
<tensorflow.python.keras.callbacks.History at 0x7f743e36e748>