Analytics Zoo Recommendation Part 2¶

Applying NCF on Goodreads using Analytics Zoo library

toc: true
badges: true
comments: true
categories: [Book, BigData, PySpark, AnalyticsZoo, NCF]
image:

Introduction¶

NCF Recommender with Explict Feedback

In this notebook we demostrate how to build a neural network recommendation system, Neural Collaborative Filtering(NCF) with explict feedback. We use Recommender API in Analytics Zoo to build a model, and use optimizer of BigDL to train the model.

The system (Recommendation systems: Principles, methods and evaluation) normally prompts the user through the system interface to provide ratings for items in order to construct and improve the model. The accuracy of recommendation depends on the quantity of ratings provided by the user.

NCF(He, 2015) leverages a multi-layer perceptrons to learn the user–item interaction function, at the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NCF with or without matrix factorization.

Data:

Goodreads book ratings dataset

References:

A Keras implementation of Movie Recommendation(notebook) from the blog.
Nerual Collaborative filtering (He, 2015)

Python interface:

ncf = NeuralCF(user_count, item_count, class_num, user_embed=20, item_embed=20, hidden_layers=(40, 20, 10), include_mf=True, mf_embed=20)

user_count: The number of users. Positive int.
item_count: The number of classes. Positive int.
class_num: The number of classes. Positive int.
user_embed: Units of user embedding. Positive int. Default is 20.
item_embed: itemEmbed Units of item embedding. Positive int. Default is 20.
hidden_layers: Units of hidden layers for MLP. Tuple of positive int. Default is (40, 20, 10).
include_mf: Whether to include Matrix Factorization. Boolean. Default is True.
mf_embed: Units of matrix factorization embedding. Positive int. Default is 20.

Installation¶

Install Java 8¶

Run the command on the colaboratory file to install jdk 1.8

In [1]:

# Install jdk8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# Set jdk environment path which enables you to run Pyspark in your Colab environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode

Install Analytics Zoo from pip¶

You can add the following command on your colab file to install the analytics-zoo via pip easily:

In [2]:

# Install latest release version of analytics-zoo 
# Installing analytics-zoo from pip will automatically install pyspark, bigdl, and their dependencies.
!pip install analytics-zoo

Collecting analytics-zoo
  Downloading https://files.pythonhosted.org/packages/49/32/2fe969c682b8683fb3792f06107fe8ac56f165e307d327756c506aa76d31/analytics_zoo-0.10.0-py2.py3-none-manylinux1_x86_64.whl (158.9MB)
     |████████████████████████████████| 158.9MB 81kB/s 
Collecting conda-pack==0.3.1
  Downloading https://files.pythonhosted.org/packages/e9/e7/d942780c4281a665f34dbfffc1cd1517c5843fb478c133a1e1fa0df30cd6/conda_pack-0.3.1-py2.py3-none-any.whl
Collecting bigdl==0.12.2
  Downloading https://files.pythonhosted.org/packages/1a/40/81fed203a633536dbb83454f1a102ebde86214f576572769290afb9427c9/BigDL-0.12.2-py2.py3-none-manylinux1_x86_64.whl (114.1MB)
     |████████████████████████████████| 114.1MB 2.5MB/s 
Collecting pyspark==2.4.3
  Downloading https://files.pythonhosted.org/packages/37/98/244399c0daa7894cdf387e7007d5e8b3710a79b67f3fd991c0b0b644822d/pyspark-2.4.3.tar.gz (215.6MB)
     |████████████████████████████████| 215.6MB 69kB/s 
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from conda-pack==0.3.1->analytics-zoo) (57.0.0)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from bigdl==0.12.2->analytics-zoo) (1.15.0)
Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.7/dist-packages (from bigdl==0.12.2->analytics-zoo) (1.19.5)
Collecting py4j==0.10.7
  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
     |████████████████████████████████| 204kB 53.7MB/s 
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... done
  Created wheel for pyspark: filename=pyspark-2.4.3-py2.py3-none-any.whl size=215964968 sha256=6d89c562d8dc3446296b8872b20b52d86bddd6342fd3dd59448c51a9d25a0513
  Stored in directory: /root/.cache/pip/wheels/8d/20/f0/b30e2024226dc112e256930dd2cd4f06d00ab053c86278dcf3
Successfully built pyspark
Installing collected packages: conda-pack, py4j, pyspark, bigdl, analytics-zoo
Successfully installed analytics-zoo-0.10.0 bigdl-0.12.2 conda-pack-0.3.1 py4j-0.10.7 pyspark-2.4.3

Initialize context¶

Call init_nncontext() that will create a SparkContext with optimized performance configurations.

In [3]:

from zoo.common.nncontext import*

sc = init_nncontext()

Prepending /usr/local/lib/python3.7/dist-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
Adding /usr/local/lib/python3.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar to BIGDL_JARS
Prepending /usr/local/lib/python3.7/dist-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
pyspark_submit_args is:  --driver-class-path /usr/local/lib/python3.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar:/usr/local/lib/python3.7/dist-packages/bigdl/share/lib/bigdl-0.12.2-jar-with-dependencies.jar pyspark-shell

Analytics Zoo provides three Recommenders, including Wide and Deep (WND) model, Neural network-based Collaborative Filtering (NCF) model and Session Recommender model. Easy-to-use Keras-Style defined models which provides compile and fit methods for training. Alternatively, they could be fed into NNFrames or BigDL Optimizer.

WND and NCF recommenders can handle either explict or implicit feedback, given corresponding features.

Imports¶

In [4]:

from zoo.pipeline.api.keras.layers import *
from zoo.models.recommendation import UserItemFeature
from zoo.models.recommendation import NeuralCF
from zoo.common.nncontext import init_nncontext
import matplotlib
from sklearn import metrics
from operator import itemgetter
from bigdl.util.common import *

import os
import numpy as np
from sklearn import preprocessing

matplotlib.use('agg')
import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib

Download goodreads dataset¶

In [5]:

!wget https://github.com/sparsh-ai/reco-data/raw/master/goodreads/ratings.csv

--2021-06-27 12:57:02--  https://github.com/sparsh-ai/reco-data/raw/master/goodreads/ratings.csv
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sparsh-ai/reco-data/master/goodreads/ratings.csv [following]
--2021-06-27 12:57:04--  https://raw.githubusercontent.com/sparsh-ai/reco-data/master/goodreads/ratings.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4976229 (4.7M) [text/plain]
Saving to: ‘ratings.csv’

ratings.csv         100%[===================>]   4.75M  --.-KB/s    in 0.1s    

2021-06-27 12:57:04 (41.9 MB/s) - ‘ratings.csv’ saved [4976229/4976229]

Read the dataset¶

In [6]:

def read_data_sets(data_dir):
  rating_files = os.path.join(data_dir,"ratings.csv")
  rating_list = [i.strip().split(",") for i in open(rating_files,"r").readlines()]    
  goodreads_data = np.array(rating_list[1:]).astype(int)
  return goodreads_data 

def get_id_pairs(data_dir):
	goodreads_data = read_data_sets(data_dir)
	return goodreads_data[:, 0:2]

def get_id_ratings(data_dir):
  goodreads_data = read_data_sets(data_dir)
  le_user = preprocessing.LabelEncoder()
  goodreads_data[:, 0] = le_user.fit_transform(goodreads_data[:, 0])
  le_item = preprocessing.LabelEncoder()
  goodreads_data[:, 1] = le_item.fit_transform(goodreads_data[:, 1])
  return goodreads_data[:, 0:3]

In [7]:

goodreads_data = get_id_ratings("/content")

Understand the data¶

In [8]:

min_user_id = np.min(goodreads_data[:,0])
max_user_id = np.max(goodreads_data[:,0])
min_book_id = np.min(goodreads_data[:,1])
max_book_id = np.max(goodreads_data[:,1])
rating_labels= np.unique(goodreads_data[:,2])

print(goodreads_data.shape)
print(min_user_id, max_user_id, min_book_id, max_book_id, rating_labels)

(454517, 3)
0 4999 0 4999 [1 2 3 4 5]

Each record is in format of (userid, bookid, rating_score). Both UserIDs and BookIDs range between 0 and 4999. Ratings are made on a 5-star scale (whole-star ratings only). Counts of users and books are recorded for later use.

Transformation¶

Transform original data into RDD of sample. We use optimizer of BigDL directly to train the model, it requires data to be provided in format of RDD(Sample). A Sample is a BigDL data structure which can be constructed using 2 numpy arrays, feature and label respectively. The API interface is Sample.from_ndarray(feature, label).

In [9]:

def build_sample(user_id, item_id, rating):
    sample = Sample.from_ndarray(np.array([user_id, item_id]), np.array([rating]))
    return UserItemFeature(user_id, item_id, sample)

In [19]:

pairFeatureRdds = sc.parallelize(goodreads_data).map(lambda x: build_sample(x[0], x[1], x[2]-1))
pairFeatureRdds.take(3)

Out[19]:

[<zoo.models.recommendation.recommender.UserItemFeature at 0x7f91617553d0>,
 <zoo.models.recommendation.recommender.UserItemFeature at 0x7f9161755210>,
 <zoo.models.recommendation.recommender.UserItemFeature at 0x7f9161755090>]

Split¶

Randomly split the data into train (80%) and validation (20%)

In [20]:

trainPairFeatureRdds, valPairFeatureRdds = pairFeatureRdds.randomSplit([0.8, 0.2], seed= 1)
valPairFeatureRdds.cache()

train_rdd= trainPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd= valPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd.persist()

Out[20]:

PythonRDD[37] at RDD at PythonRDD.scala:53

In [21]:

train_rdd.count()

Out[21]:

In [22]:

train_rdd.take(3)

Out[22]:

[Sample: features: [JTensor: storage: [510. 276.], shape: [2], float], labels: [JTensor: storage: [3.], shape: [1], float],
 Sample: features: [JTensor: storage: [2289. 2492.], shape: [2], float], labels: [JTensor: storage: [1.], shape: [1], float],
 Sample: features: [JTensor: storage: [3919. 1597.], shape: [2], float], labels: [JTensor: storage: [4.], shape: [1], float]]

Build model¶

In Analytics Zoo, it is simple to build NCF model by calling NeuralCF API. You need specify the user count, item count and class number according to your data, then add hidden layers as needed, you can also choose to include matrix factorization in the network. The model could be fed into an Optimizer of BigDL or NNClassifier of analytics-zoo. Please refer to the document for more details. In this example, we demostrate how to use optimizer of BigDL.

In [23]:

ncf = NeuralCF(user_count=max_user_id, 
               item_count=max_book_id, 
               class_num=5, 
               hidden_layers=[20, 10], 
               include_mf = False)

creating: createZooKerasInput
creating: createZooKerasFlatten
creating: createZooKerasSelect
creating: createZooKerasFlatten
creating: createZooKerasSelect
creating: createZooKerasEmbedding
creating: createZooKerasEmbedding
creating: createZooKerasFlatten
creating: createZooKerasFlatten
creating: createZooKerasMerge
creating: createZooKerasDense
creating: createZooKerasDense
creating: createZooKerasDense
creating: createZooKerasModel
creating: createZooNeuralCF

Compile model¶

Compile model given specific optimizers, loss, as well as metrics for evaluation. Optimizer tries to minimize the loss of the neural net with respect to its weights/biases, over the training set. To create an Optimizer in BigDL, you want to at least specify arguments: model(a neural network model), criterion(the loss function), traing_rdd(training dataset) and batch size. Please refer to ProgrammingGuide and Optimizer for more details to create efficient optimizers.

In [24]:

ncf.compile(optimizer= "adam",
            loss= "sparse_categorical_crossentropy",
            metrics=['accuracy'])

creating: createAdam
creating: createZooKerasSparseCategoricalCrossEntropy
creating: createZooKerasSparseCategoricalAccuracy

Collect logs¶

You can leverage tensorboard to see the summaries.

In [25]:

tmp_log_dir = create_tmp_path()
ncf.set_tensorboard(tmp_log_dir, "training_ncf")

Train the model¶

In [26]:

ncf.fit(train_rdd, 
        nb_epoch= 10, 
        batch_size= 5000,
        validation_data=val_rdd)

Prediction¶

Zoo models make inferences based on the given data using model.predict(val_rdd) API. A result of RDD is returned. predict_class returns the predicted label.

In [27]:

results = ncf.predict(val_rdd)
results.take(5)

Out[27]:

[array([0.00071773, 0.00534406, 0.07192371, 0.46765593, 0.45435852],
       dtype=float32),
 array([0.14319365, 0.23656234, 0.28515524, 0.24342458, 0.09166414],
       dtype=float32),
 array([0.0084668 , 0.02982639, 0.17435056, 0.46424043, 0.32311577],
       dtype=float32),
 array([5.7515636e-04, 1.4479134e-02, 3.0612889e-01, 6.0694826e-01,
        7.1868621e-02], dtype=float32),
 array([0.00603686, 0.01832466, 0.12356475, 0.32203293, 0.5300408 ],
       dtype=float32)]

In [28]:

results_class = ncf.predict_class(val_rdd)
results_class.take(5)

Out[28]:

[4, 3, 4, 4, 5]

In Analytics Zoo, Recommender has provied 3 unique APIs to predict user-item pairs and make recommendations for users or items given candidates.

Predict for user item pairs¶

In [29]:

userItemPairPrediction = ncf.predict_user_item_pair(valPairFeatureRdds)
for result in userItemPairPrediction.take(5): print(result)

UserItemPrediction [user_id: 2765, item_id: 53, prediction: 4, probability: 0.4676559269428253]
UserItemPrediction [user_id: 4573, item_id: 95, prediction: 3, probability: 0.2851552367210388]
UserItemPrediction [user_id: 3907, item_id: 15, prediction: 4, probability: 0.4642404317855835]
UserItemPrediction [user_id: 790, item_id: 156, prediction: 4, probability: 0.6069482564926147]
UserItemPrediction [user_id: 3315, item_id: 1721, prediction: 5, probability: 0.5300408005714417]

In [30]:

userRecs = ncf.recommend_for_user(valPairFeatureRdds, 3)
for result in userRecs.take(5): print(result)

UserItemPrediction [user_id: 3586, item_id: 850, prediction: 5, probability: 0.49247753620147705]
UserItemPrediction [user_id: 3586, item_id: 2354, prediction: 4, probability: 0.6351815462112427]
UserItemPrediction [user_id: 3586, item_id: 2787, prediction: 4, probability: 0.6106586456298828]
UserItemPrediction [user_id: 1084, item_id: 4588, prediction: 5, probability: 0.7689940333366394]
UserItemPrediction [user_id: 1084, item_id: 554, prediction: 5, probability: 0.7594876289367676]

In [31]:

itemRecs = ncf.recommend_for_item(valPairFeatureRdds, 3)
for result in itemRecs.take(5): print(result)

UserItemPrediction [user_id: 3111, item_id: 3558, prediction: 5, probability: 0.48693835735321045]
UserItemPrediction [user_id: 2024, item_id: 3558, prediction: 5, probability: 0.42628324031829834]
UserItemPrediction [user_id: 4909, item_id: 3558, prediction: 4, probability: 0.562437891960144]
UserItemPrediction [user_id: 3023, item_id: 1084, prediction: 5, probability: 0.8389995694160461]
UserItemPrediction [user_id: 4790, item_id: 1084, prediction: 5, probability: 0.6715853810310364]

Evaluation¶

Plot the train and validation loss curves

In [32]:

#retrieve train and validation summary object and read the loss data into ndarray's. 
train_loss = np.array(ncf.get_train_summary("Loss"))
val_loss = np.array(ncf.get_validation_summary("Loss"))
#plot the train and validation curves
# each event data is a tuple in form of (iteration_count, value, timestamp)
plt.figure(figsize = (12,6))
plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')
plt.plot(val_loss[:,0],val_loss[:,1],label='val loss',color='green')
plt.scatter(val_loss[:,0],val_loss[:,1],color='green')
plt.legend();
plt.xlim(0,train_loss.shape[0]+10)
plt.grid(True)
plt.title("loss")

Out[32]:

Text(0.5, 1.0, 'loss')

Plot accuracy

In [33]:

plt.figure(figsize = (12,6))
top1 = np.array(ncf.get_validation_summary("Top1Accuracy"))
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.title("top1 accuracy")
plt.grid(True)
plt.legend();