In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os

Text Regression with Extra Regressors: An Example of Using Custom Data Formats and Models in ktrain

This notebook illustrates how one can construct custom data formats and models for use in ktrain. In this example, we will build a model that can predict the price of a wine by both its textual description and the winery from which it was produced. This example is inspired by FloydHub's regression template for wine price prediction. However, instead of using the wine variety as the extra regressor, we will use the winery.

Text classification (or text regression) with extra predictors arises across many scenarios. For instance, when making a prediction about the trustworthiness of a news story, one may want to consider both the text of the news aricle in addition to extra metadata such as the news publication and the authors. Here, such models can be built.

The dataset in CSV format can be obtained from Floydhub at this URL. We will begin by importing some necessary modules and reading in the dataset.

In [2]:
# import some modules and read in the dataset
import pandas as pd
from tensorflow import keras
import numpy as np
import math
path = 'data/wine/wine_data.csv'  # ADD path/to/dataset
data = pd.read_csv(path)
data = data.sample(frac=1., random_state=42)
Unnamed: 0 country description designation points price province region_1 region_2 variety winery
82956 82956 Spain Spiced apple and dried cheese aromas are simul... Mercat Brut 84 12.0 Catalonia Cava NaN Sparkling Blend El Xamfrà
60767 60767 US A little too sharp and acidic, with jammy cher... NaN 82 9.0 California California California Other Shiraz Woodbridge by Robert Mondavi
123576 123576 Spain Starts out rustic and leathery, with hints of ... Selección 12 Crianza 89 15.0 Levante Jumilla NaN Red Blend Bodegas Luzón
71003 71003 Chile Ripe to the point that it's soft and flat. Big... NaN 82 8.0 Maule Valley NaN NaN Chardonnay Melania
78168 78168 Italy From one of the best producers in the little-t... Contado Riserva 88 17.0 Southern Italy Molise NaN Aglianico Di Majo Norante

Cleaning the Data

We use the exact same data-cleaning steps employed in FloydHub's regression example for this dataset.

In [3]:
# Clean it from null values
data = data[pd.notnull(data['country'])]
data = data[pd.notnull(data['price'])]
data = data.drop(data.columns[0], axis=1) 
variety_threshold = 500 # Anything that occurs less than this will be removed.
value_counts = data['variety'].value_counts()
to_remove = value_counts[value_counts <= variety_threshold].index
data.replace(to_remove, np.nan, inplace=True)
data = data[pd.notnull(data['variety'])]
data = data[pd.notnull(data['winery'])]

# Split data into train and test
train_size = int(len(data) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))

# Train features
description_train = data['description'][:train_size]
variety_train = data['variety'][:train_size]

# Train labels
labels_train = data['price'][:train_size]

# Test features
description_test = data['description'][train_size:]
variety_test = data['variety'][train_size:]

# Test labels
labels_test = data['price'][train_size:]

x_train = description_train.values
y_train = labels_train.values
x_test = description_test.values
y_test = labels_test.values

# winery  metadata to be used later
winery_train = data['winery'][:train_size]
winery_test = data['winery'][train_size:]
Train size: 95612
Test size: 23904

Building a Vanilla Text Regression Model in ktrain

We will preprocess the data and select a linreg model for our initial "vanilla" text regression model.

In [4]:
import ktrain
from ktrain import text
using Keras version: 2.2.4-tf
In [5]:
trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                          x_test=x_test, y_test=y_test,
task: text regression (supply class_names argument if this is supposed to be classification task)
language: en
Word Counts: 30807
Nrows: 95612
95612 train sequences
train sequence lengths:
	mean : 41
	95percentile : 62
	99percentile : 74
Adding 3-gram features
max_features changed to 1765149 with addition of ngrams
Average train sequence length with ngrams: 120
train (w/ngrams) sequence lengths:
	mean : 121
	95percentile : 183
	99percentile : 219
x_train shape: (95612,200)
y_train shape: 95612
23904 test sequences
test sequence lengths:
	mean : 41
	95percentile : 62
	99percentile : 74
Average test sequence length with ngrams: 111
test (w/ngrams) sequence lengths:
	mean : 112
	95percentile : 171
	99percentile : 207
x_test shape: (23904,200)
y_test shape: 23904
In [6]:
fasttext: a fastText-like model []
linreg: linear text regression using a trainable Embedding layer
bigru: Bidirectional GRU with pretrained word vectors []
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) []
distilbert: distilled, smaller, and faster BERT from Hugging Face []
In [7]:
model = text.text_regression_model('linreg', train_data=trn, preproc=preproc)
maxlen is 200

Adding an Extra Regressor to Our Model

Next, we will add an extra regressor to our model, thereby, creating a new, augmented model. We choose the winery as the extra regressor, which is a categorical variable. Instead of representing the winery as a typical one-hot-encoded vector, we will learn an embedding for the winery during training. The embedding module will then be concatenated with our linreg text regression model forming a new model. The new model expects two distinct inputs. The first input is an integer representing the winery. The second input is a sequence of word IDs - standard input to neural text classifiers/regressors.

In [8]:
extra_train_data = winery_train
extra_test_data = winery_test

# encode winery as integers
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()['winery'])
extra_train = encoder.transform(extra_train_data)
extra_test = encoder.transform(extra_test_data)
no_of_unique_cat = np.max(extra_train) + 1
embedding_size = min(np.ceil((no_of_unique_cat)/2), 50 )
embedding_size = int(embedding_size)
vocab =  no_of_unique_cat+1
extra_train = np.expand_dims(extra_train, -1)
extra_test = np.expand_dims(extra_test, -1)

# winery module
extra_input = keras.layers.Input(shape=(1,))
extra_output = keras.layers.Embedding(vocab, embedding_size, input_length=1)(extra_input)
extra_output = keras.layers.Flatten()(extra_output)
extra_model = keras.Model(inputs=extra_input, outputs=extra_output)
extra_model.compile(loss='mse', optimizer='adam', metrics=['mae'])

# Combine winery module with linreg model
merged_out = keras.layers.concatenate([extra_model.output, model.output])
merged_out = keras.layers.Dropout(0.25)(merged_out)
merged_out = keras.layers.Dense(1000, activation='relu')(merged_out)
merged_out = keras.layers.Dropout(0.25)(merged_out)
merged_out = keras.layers.Dense(500, activation='relu')(merged_out)
merged_out = keras.layers.Dropout(0.5)(merged_out)
merged_out = keras.layers.Dense(1)(merged_out)
combined_model = keras.Model([extra_model.input] + [model.input], merged_out)

Wrapping our Data in an Instance of ktrain.Dataset

To use this custom data format of two inputs in ktrain, we will wrap it in a ktrain.dataset.Dataset instance. There are two ways to do this.

The first is to represent our datasets as instances and then wrap each in a ktrain.dataset.TFDataset instance, which is a wrapper to a Use of instances can potentially yield certain performance improvements. See this example notebook for a demonstration of using the ktrain.dataset.TFDataset class. For this example, one can make us of ktrain.dataset.TFDataset instances as follows:

import tensorflow as tf
from ktrain.dataset import TFDataset

trn_combined = [extra_train] +  [trn[0]] + [trn[1]]
val_combined = [extra_test] + [val[0]] + [val[1]]

def features_to_tfdataset(examples):

    def gen():
        for idx, ex0 in enumerate(examples[0]):
            ex1 = examples[1][idx]
            label = examples[2][idx]
            x = (ex0, ex1)
            y = label
            yield ( (x, y) )

            ((tf.int32, tf.int32), tf.int64),
            ((tf.TensorShape([None]), tf.TensorShape([None])), tf.TensorShape([])) )
    return tfdataset
train_tfdataset= features_to_tfdataset(trn_combined)
val_tfdataset= features_to_tfdataset(val_combined)
train_tfdataset = train_tfdataset.shuffle(trn_combined[0].shape[0]).batch(BATCH_SIZE).repeat(-1)
val_tfdataset = val_tfdataset.batch(BATCH_SIZE)

train_data = TFDataset(train_tfdataset, n=trn_combined[0].shape[0], y=trn_combined[2])
val_data = TFDataset(val_tfdataset, n=val_combined[0].shape[0], y=val_combined[2])
learner = ktrain.get_learner(combined_model, train_data=train_data, val_data=val_data)

The second approach is to wrap our datasets in a subclass of ktrain.dataset.SequenceDataset. We must be sure to override and implment the required methods (e.g., def nsamples and def get_y). The ktrain.dataset.SequenceDataset class is simply a subclass of tf.keras.utils.Sequence. See the TensorFlow documentation on the Sequence class for more information on how Sequence wrappers work.

We employ the second approach in this tutorial. Note that, in the implementation below, we have made MyCustomDataset more general such that it can wrap lists containing an arbitrary number of inputs instead of just the two needed in our example.

In [9]:
class MyCustomDataset(ktrain.dataset.SequenceDataset):
    def __init__(self, x, y, batch_size=32, shuffle=True):
        # error checks
        err = False
        if type(x) == np.ndarray and len(x.shape) != 2: err = True
        elif type(x) == list:
            for d in x:
                if type(d) != np.ndarray or len(d.shape) != 2:
                    err = True
        else: err = True
        if err:
            raise ValueError('x must be a 2d numpy array or a list of 2d numpy arrays')
        if type(y) != np.ndarray:
            raise ValueError('y must be a numpy array')
        if type(x) == np.ndarray:
            x = [x]

        # set variables
        self.x, self.y = x, y
        self.indices = np.arange(self.x[0].shape[0])
        self.n_inputs = len(x)
        self.shuffle = shuffle

    # required for instances of tf.keras.utils.Sequence
    def __len__(self):
        return math.ceil(self.x[0].shape[0] / self.batch_size)

    # required for instances of tf.keras.utils.Sequence
    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = []
        for i in range(self.n_inputs):
        batch_y = self.y[inds]
        return tuple(batch_x), batch_y

    # required for instances of ktrain.Dataset
    def nsamples(self):
        return self.x[0].shape[0]

    #required for instances of ktrain.Dataset
    def get_y(self):
        return self.y

    def on_epoch_end(self):
        if self.shuffle:  np.random.shuffle(self.indices)

Note that, you can also add a to_tfdataset method to your ktrain.dataset.SequenceDataset subclass. The to_tfdataset method is responsible for converting your dataset to a tf.Dataset and, if it exists, will be called by ktrain just prior to training. We have not done this here.

Using the Custom Model and Data Format

Once we wrap our data in a ktrain.dataset.SequenceDataset instance, we can wrap the model and datasets in a Learner object and use ktrain normally.

In [10]:
train_data = MyCustomDataset([extra_train] +  [trn[0]], trn[1], shuffle=True)
val_data = MyCustomDataset([extra_test] + [val[0]], val[1], shuffle=False)
learner = ktrain.get_learner(combined_model, train_data=train_data, val_data=val_data, batch_size=256)

Estimate Learning Rate

We'll choose a learning rate where the loss is falling. As shown in the plot, 1e-3 seems to be a good choice in this case.

In [11]:
learner.lr_find(show_plot=True, restore_weights_only=True)
simulating training for different learning rates... this may take a few moments...
Train for 373 steps
Epoch 1/1024
373/373 [==============================] - 9s 24ms/step - loss: 34.1117 - mae: 34.1153
Epoch 2/1024
373/373 [==============================] - 8s 20ms/step - loss: 28.8677 - mae: 28.8826
Epoch 3/1024
373/373 [==============================] - 8s 20ms/step - loss: 13.2890 - mae: 13.2908
Epoch 4/1024
373/373 [==============================] - 8s 21ms/step - loss: 20.4389 - mae: 20.4431
Epoch 5/1024
359/373 [===========================>..] - ETA: 0s - loss: 17.9780 - mae: 17.9780


Train the Model

We will now train the model using the estimated learning rate from above for 12 epochs using the 1cycle learning rate policy.

In [12]:
learner.fit_onecycle(1e-3, 12)
begin training using onecycle policy with max lr of 0.001...
Train for 374 steps, validate for 94 steps
Epoch 1/12
374/374 [==============================] - 9s 23ms/step - loss: 22.8788 - mae: 22.8866 - val_loss: 13.7107 - val_mae: 13.7028
Epoch 2/12
374/374 [==============================] - 9s 23ms/step - loss: 12.2521 - mae: 12.2531 - val_loss: 10.8341 - val_mae: 10.8276
Epoch 3/12
374/374 [==============================] - 9s 23ms/step - loss: 9.9158 - mae: 9.9183 - val_loss: 9.9131 - val_mae: 9.9106
Epoch 4/12
374/374 [==============================] - 8s 23ms/step - loss: 8.9252 - mae: 8.9264 - val_loss: 9.4691 - val_mae: 9.4692
Epoch 5/12
374/374 [==============================] - 8s 23ms/step - loss: 8.3064 - mae: 8.3072 - val_loss: 9.1714 - val_mae: 9.1709
Epoch 6/12
374/374 [==============================] - 8s 22ms/step - loss: 7.9027 - mae: 7.9037 - val_loss: 9.0367 - val_mae: 9.0353
Epoch 7/12
374/374 [==============================] - 9s 23ms/step - loss: 7.4723 - mae: 7.4741 - val_loss: 8.6807 - val_mae: 8.6820
Epoch 8/12
374/374 [==============================] - 9s 23ms/step - loss: 6.9741 - mae: 6.9762 - val_loss: 8.3878 - val_mae: 8.3916
Epoch 9/12
374/374 [==============================] - 8s 22ms/step - loss: 6.4518 - mae: 6.4508 - val_loss: 8.2264 - val_mae: 8.2321
Epoch 10/12
374/374 [==============================] - 8s 23ms/step - loss: 5.9795 - mae: 5.9803 - val_loss: 7.8524 - val_mae: 7.8609
Epoch 11/12
374/374 [==============================] - 8s 23ms/step - loss: 5.7376 - mae: 5.7394 - val_loss: 7.8682 - val_mae: 7.8760
Epoch 12/12
374/374 [==============================] - 8s 23ms/step - loss: 5.5266 - mae: 5.5273 - val_loss: 7.8161 - val_mae: 7.8243
<tensorflow.python.keras.callbacks.History at 0x7f84a50470f0>

Our final validation MAE is 7.82, which means our predictions are, on average, about $8 off the mark, which is not bad considering our model only looks at the textual description of the wine and the winery.

Plot Some Training History

The validation loss is still decreasing, which suggests we could train further if desired. The second and third plot show the learning rate and momentum schedules employed by fit_onecycle.

In [13]:
In [14]:
In [15]:

View Top Losses

Let's examine the validation examples that we got the most wrong. Looks like our model has trouble with expensive wines.

In [16]:
id:21790 | loss:1042.46 | true:1100.0 | pred:57.54)

id:13745 | loss:1014.34 | true:1400.0 | pred:385.66)

id:11710 | loss:884.58 | true:980.0 | pred:95.42)

In [17]:
Wet earth, rain-wet stones, damp moss, wild sage and very ripe pear make for a complex opening. Further sniffs reveal more citrus: both juice and zest of lemon. The palate still holds a lot of leesy yeast flavors but its phenolic richness is tempered by total citrus freshness. This is still tightly wound; leave it so it can come into its own. The warming resonance on the palate suggests it has a long future. Drink from 2019.
In [18]:
A wine that has created its own universe. It has a unique, special softness that allies with the total purity that comes from a small, enclosed single vineyard. The fruit is almost irrelevant here, because it comes as part of a much deeper complexity. This is a great wine, at the summit of Champagne, a sublime, unforgettable experience.
In [19]:
preds = learner.predict(val_data)
In [20]:
array([385.65793], dtype=float32)

Making Predictions

Lastly, we will use our model to make predictions on 5 randomly selected wines in the validation set.

In [22]:
# 5 random predictions
val_data.batch_size = 1
for i in range(5):
    idx = np.random.choice(len(x_test))
    print("TEXT:\n%s" % (x_test[idx]))
    print("\tpredicted: %s" % (np.squeeze(learner.predict(val_data[idx]))))
    print("\tactual: %s" % (y_test[idx]))           
Relatively full-bodied and muscular as well as dry, this new effort from winemaker Steve Bird features plenty of brawny citrus and spice flavors that finish long. There's no real track record, so it's probably best to drink now.

	predicted: 18.009167
	actual: 17.0
Very tart and spicy, with distinct notes of clove and orange peel. Citrus and apple flavors crop up unexpectedly, and the tannins have a hint of green tea about them.

	predicted: 20.4764
	actual: 20.0
Dusty apple aromas are given lift courtesy of citrus notes. This feels good on the palate, with zesty acidity. Flavors of stone fruits, tropical fruits, apple and citrus meld together well, while the finish is pure and long.

	predicted: 15.768029
	actual: 17.0
Smoky and savory on the nose, with saucy fruit sitting below a veil of firm oak. Runs a bit tart and racy in the mouth, where cherry and plum flavors are boosted by blazing natural acidity. Not a sour wine, but definitely crisp and racy.

	predicted: 15.798236
	actual: 24.0
Textbook Gewurztraminer, done well, starting with scents of rose petals and lychees, and moving through pear and melon flavors into a finish that shows a hint of bitterness. Medium-weight and just slightly off-dry.

	predicted: 23.65241
	actual: 16.0

Let's look at our most expensive prediction. Our most expensive prediction ($404) is associated with an expensive wine priced at $800, which is good. However, we are ~$400 off. Again, our model has trouble with expensive wines. This is somewhat understandable since our model only looks at short textual descriptions and the winery - neither of which contain clear indicators of their exorbitant prices.

In [43]:
max_pred_id = np.argmax(preds)
print("highest-priced prediction: %s" % (np.squeeze(preds[max_pred_id])))
print("actual price for this wine:%s" % (y_test[max_pred_id]))
print('TEXT:\n%s' % (x_test[max_pred_id]))
highest-priced prediction: 404.31885
actual price for this wine:800.0
The palate opens slowly, offering an initial citrus character, followed by wood and then, finally, wonderfully rich, but taut fruit. There is still a toast character here, with apricots and pear on top of the citrus, but it is still only just developing. In 10–15 years, it will be a magnificent wine.

Making Predictions on Unseen Examples

In the example above, we made predictions for examples in the validation set. To make predictions for an arbitrary set of wine data, the steps are as follows:

  1. Encode the winery using the same label encoder used above for validation data
  2. Preprocess the wine description using the preprocess_test method. In this example, you will use preproc.preprocess_test.
  3. Combine both into a ktrain.dataset.Dataset instance, as we did above.
In [ ]: