Applying NCF on Goodreads using Analytics Zoo library
NCF Recommender with Explict Feedback
In this notebook we demostrate how to build a neural network recommendation system, Neural Collaborative Filtering(NCF) with explict feedback. We use Recommender API in Analytics Zoo to build a model, and use optimizer of BigDL to train the model.
The system (Recommendation systems: Principles, methods and evaluation) normally prompts the user through the system interface to provide ratings for items in order to construct and improve the model. The accuracy of recommendation depends on the quantity of ratings provided by the user.
NCF(He, 2015) leverages a multi-layer perceptrons to learn the user–item interaction function, at the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NCF with or without matrix factorization.
Data:
References:
Python interface:
ncf = NeuralCF(user_count, item_count, class_num, user_embed=20, item_embed=20, hidden_layers=(40, 20, 10), include_mf=True, mf_embed=20)
user_count
: The number of users. Positive int.item_count
: The number of classes. Positive int.class_num
: The number of classes. Positive int.user_embed
: Units of user embedding. Positive int. Default is 20.item_embed
: itemEmbed Units of item embedding. Positive int. Default is 20.hidden_layers
: Units of hidden layers for MLP. Tuple of positive int. Default is (40, 20, 10).include_mf
: Whether to include Matrix Factorization. Boolean. Default is True.mf_embed
: Units of matrix factorization embedding. Positive int. Default is 20.Run the command on the colaboratory file to install jdk 1.8
# Install jdk8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# Set jdk environment path which enables you to run Pyspark in your Colab environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode
You can add the following command on your colab file to install the analytics-zoo via pip easily:
# Install latest release version of analytics-zoo
# Installing analytics-zoo from pip will automatically install pyspark, bigdl, and their dependencies.
!pip install analytics-zoo
Collecting analytics-zoo Downloading https://files.pythonhosted.org/packages/49/32/2fe969c682b8683fb3792f06107fe8ac56f165e307d327756c506aa76d31/analytics_zoo-0.10.0-py2.py3-none-manylinux1_x86_64.whl (158.9MB) |████████████████████████████████| 158.9MB 81kB/s Collecting conda-pack==0.3.1 Downloading https://files.pythonhosted.org/packages/e9/e7/d942780c4281a665f34dbfffc1cd1517c5843fb478c133a1e1fa0df30cd6/conda_pack-0.3.1-py2.py3-none-any.whl Collecting bigdl==0.12.2 Downloading https://files.pythonhosted.org/packages/1a/40/81fed203a633536dbb83454f1a102ebde86214f576572769290afb9427c9/BigDL-0.12.2-py2.py3-none-manylinux1_x86_64.whl (114.1MB) |████████████████████████████████| 114.1MB 2.5MB/s Collecting pyspark==2.4.3 Downloading https://files.pythonhosted.org/packages/37/98/244399c0daa7894cdf387e7007d5e8b3710a79b67f3fd991c0b0b644822d/pyspark-2.4.3.tar.gz (215.6MB) |████████████████████████████████| 215.6MB 69kB/s Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from conda-pack==0.3.1->analytics-zoo) (57.0.0) Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from bigdl==0.12.2->analytics-zoo) (1.15.0) Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.7/dist-packages (from bigdl==0.12.2->analytics-zoo) (1.19.5) Collecting py4j==0.10.7 Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB) |████████████████████████████████| 204kB 53.7MB/s Building wheels for collected packages: pyspark Building wheel for pyspark (setup.py) ... done Created wheel for pyspark: filename=pyspark-2.4.3-py2.py3-none-any.whl size=215964968 sha256=6d89c562d8dc3446296b8872b20b52d86bddd6342fd3dd59448c51a9d25a0513 Stored in directory: /root/.cache/pip/wheels/8d/20/f0/b30e2024226dc112e256930dd2cd4f06d00ab053c86278dcf3 Successfully built pyspark Installing collected packages: conda-pack, py4j, pyspark, bigdl, analytics-zoo Successfully installed analytics-zoo-0.10.0 bigdl-0.12.2 conda-pack-0.3.1 py4j-0.10.7 pyspark-2.4.3
Call init_nncontext() that will create a SparkContext with optimized performance configurations.
from zoo.common.nncontext import*
sc = init_nncontext()
Prepending /usr/local/lib/python3.7/dist-packages/bigdl/share/conf/spark-bigdl.conf to sys.path Adding /usr/local/lib/python3.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar to BIGDL_JARS Prepending /usr/local/lib/python3.7/dist-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path pyspark_submit_args is: --driver-class-path /usr/local/lib/python3.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar:/usr/local/lib/python3.7/dist-packages/bigdl/share/lib/bigdl-0.12.2-jar-with-dependencies.jar pyspark-shell
Analytics Zoo provides three Recommenders, including Wide and Deep (WND) model, Neural network-based Collaborative Filtering (NCF) model and Session Recommender model. Easy-to-use Keras-Style defined models which provides compile and fit methods for training. Alternatively, they could be fed into NNFrames or BigDL Optimizer.
WND and NCF recommenders can handle either explict or implicit feedback, given corresponding features.
from zoo.pipeline.api.keras.layers import *
from zoo.models.recommendation import UserItemFeature
from zoo.models.recommendation import NeuralCF
from zoo.common.nncontext import init_nncontext
import matplotlib
from sklearn import metrics
from operator import itemgetter
from bigdl.util.common import *
import os
import numpy as np
from sklearn import preprocessing
matplotlib.use('agg')
import matplotlib.pyplot as plt
%pylab inline
Populating the interactive namespace from numpy and matplotlib
!wget https://github.com/sparsh-ai/reco-data/raw/master/goodreads/ratings.csv
--2021-06-27 12:57:02-- https://github.com/sparsh-ai/reco-data/raw/master/goodreads/ratings.csv Resolving github.com (github.com)... 192.30.255.113 Connecting to github.com (github.com)|192.30.255.113|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://raw.githubusercontent.com/sparsh-ai/reco-data/master/goodreads/ratings.csv [following] --2021-06-27 12:57:04-- https://raw.githubusercontent.com/sparsh-ai/reco-data/master/goodreads/ratings.csv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4976229 (4.7M) [text/plain] Saving to: ‘ratings.csv’ ratings.csv 100%[===================>] 4.75M --.-KB/s in 0.1s 2021-06-27 12:57:04 (41.9 MB/s) - ‘ratings.csv’ saved [4976229/4976229]
def read_data_sets(data_dir):
rating_files = os.path.join(data_dir,"ratings.csv")
rating_list = [i.strip().split(",") for i in open(rating_files,"r").readlines()]
goodreads_data = np.array(rating_list[1:]).astype(int)
return goodreads_data
def get_id_pairs(data_dir):
goodreads_data = read_data_sets(data_dir)
return goodreads_data[:, 0:2]
def get_id_ratings(data_dir):
goodreads_data = read_data_sets(data_dir)
le_user = preprocessing.LabelEncoder()
goodreads_data[:, 0] = le_user.fit_transform(goodreads_data[:, 0])
le_item = preprocessing.LabelEncoder()
goodreads_data[:, 1] = le_item.fit_transform(goodreads_data[:, 1])
return goodreads_data[:, 0:3]
goodreads_data = get_id_ratings("/content")
min_user_id = np.min(goodreads_data[:,0])
max_user_id = np.max(goodreads_data[:,0])
min_book_id = np.min(goodreads_data[:,1])
max_book_id = np.max(goodreads_data[:,1])
rating_labels= np.unique(goodreads_data[:,2])
print(goodreads_data.shape)
print(min_user_id, max_user_id, min_book_id, max_book_id, rating_labels)
(454517, 3) 0 4999 0 4999 [1 2 3 4 5]
Each record is in format of (userid, bookid, rating_score). Both UserIDs and BookIDs range between 0 and 4999. Ratings are made on a 5-star scale (whole-star ratings only). Counts of users and books are recorded for later use.
Transform original data into RDD of sample. We use optimizer of BigDL directly to train the model, it requires data to be provided in format of RDD(Sample). A Sample is a BigDL data structure which can be constructed using 2 numpy arrays, feature and label respectively. The API interface is Sample.from_ndarray(feature, label).
def build_sample(user_id, item_id, rating):
sample = Sample.from_ndarray(np.array([user_id, item_id]), np.array([rating]))
return UserItemFeature(user_id, item_id, sample)
pairFeatureRdds = sc.parallelize(goodreads_data).map(lambda x: build_sample(x[0], x[1], x[2]-1))
pairFeatureRdds.take(3)
[<zoo.models.recommendation.recommender.UserItemFeature at 0x7f91617553d0>, <zoo.models.recommendation.recommender.UserItemFeature at 0x7f9161755210>, <zoo.models.recommendation.recommender.UserItemFeature at 0x7f9161755090>]
Randomly split the data into train (80%) and validation (20%)
trainPairFeatureRdds, valPairFeatureRdds = pairFeatureRdds.randomSplit([0.8, 0.2], seed= 1)
valPairFeatureRdds.cache()
train_rdd= trainPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd= valPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)
val_rdd.persist()
PythonRDD[37] at RDD at PythonRDD.scala:53
train_rdd.count()
363662
train_rdd.take(3)
[Sample: features: [JTensor: storage: [510. 276.], shape: [2], float], labels: [JTensor: storage: [3.], shape: [1], float], Sample: features: [JTensor: storage: [2289. 2492.], shape: [2], float], labels: [JTensor: storage: [1.], shape: [1], float], Sample: features: [JTensor: storage: [3919. 1597.], shape: [2], float], labels: [JTensor: storage: [4.], shape: [1], float]]
In Analytics Zoo, it is simple to build NCF model by calling NeuralCF API. You need specify the user count, item count and class number according to your data, then add hidden layers as needed, you can also choose to include matrix factorization in the network. The model could be fed into an Optimizer of BigDL or NNClassifier of analytics-zoo. Please refer to the document for more details. In this example, we demostrate how to use optimizer of BigDL.
ncf = NeuralCF(user_count=max_user_id,
item_count=max_book_id,
class_num=5,
hidden_layers=[20, 10],
include_mf = False)
creating: createZooKerasInput creating: createZooKerasFlatten creating: createZooKerasSelect creating: createZooKerasFlatten creating: createZooKerasSelect creating: createZooKerasEmbedding creating: createZooKerasEmbedding creating: createZooKerasFlatten creating: createZooKerasFlatten creating: createZooKerasMerge creating: createZooKerasDense creating: createZooKerasDense creating: createZooKerasDense creating: createZooKerasModel creating: createZooNeuralCF
Compile model given specific optimizers, loss, as well as metrics for evaluation. Optimizer tries to minimize the loss of the neural net with respect to its weights/biases, over the training set. To create an Optimizer in BigDL, you want to at least specify arguments: model(a neural network model), criterion(the loss function), traing_rdd(training dataset) and batch size. Please refer to ProgrammingGuide and Optimizer for more details to create efficient optimizers.
ncf.compile(optimizer= "adam",
loss= "sparse_categorical_crossentropy",
metrics=['accuracy'])
creating: createAdam creating: createZooKerasSparseCategoricalCrossEntropy creating: createZooKerasSparseCategoricalAccuracy
You can leverage tensorboard to see the summaries.
tmp_log_dir = create_tmp_path()
ncf.set_tensorboard(tmp_log_dir, "training_ncf")
ncf.fit(train_rdd,
nb_epoch= 10,
batch_size= 5000,
validation_data=val_rdd)
Zoo models make inferences based on the given data using model.predict(val_rdd) API. A result of RDD is returned. predict_class returns the predicted label.
results = ncf.predict(val_rdd)
results.take(5)
[array([0.00071773, 0.00534406, 0.07192371, 0.46765593, 0.45435852], dtype=float32), array([0.14319365, 0.23656234, 0.28515524, 0.24342458, 0.09166414], dtype=float32), array([0.0084668 , 0.02982639, 0.17435056, 0.46424043, 0.32311577], dtype=float32), array([5.7515636e-04, 1.4479134e-02, 3.0612889e-01, 6.0694826e-01, 7.1868621e-02], dtype=float32), array([0.00603686, 0.01832466, 0.12356475, 0.32203293, 0.5300408 ], dtype=float32)]
results_class = ncf.predict_class(val_rdd)
results_class.take(5)
[4, 3, 4, 4, 5]
In Analytics Zoo, Recommender has provied 3 unique APIs to predict user-item pairs and make recommendations for users or items given candidates.
userItemPairPrediction = ncf.predict_user_item_pair(valPairFeatureRdds)
for result in userItemPairPrediction.take(5): print(result)
UserItemPrediction [user_id: 2765, item_id: 53, prediction: 4, probability: 0.4676559269428253] UserItemPrediction [user_id: 4573, item_id: 95, prediction: 3, probability: 0.2851552367210388] UserItemPrediction [user_id: 3907, item_id: 15, prediction: 4, probability: 0.4642404317855835] UserItemPrediction [user_id: 790, item_id: 156, prediction: 4, probability: 0.6069482564926147] UserItemPrediction [user_id: 3315, item_id: 1721, prediction: 5, probability: 0.5300408005714417]
userRecs = ncf.recommend_for_user(valPairFeatureRdds, 3)
for result in userRecs.take(5): print(result)
UserItemPrediction [user_id: 3586, item_id: 850, prediction: 5, probability: 0.49247753620147705] UserItemPrediction [user_id: 3586, item_id: 2354, prediction: 4, probability: 0.6351815462112427] UserItemPrediction [user_id: 3586, item_id: 2787, prediction: 4, probability: 0.6106586456298828] UserItemPrediction [user_id: 1084, item_id: 4588, prediction: 5, probability: 0.7689940333366394] UserItemPrediction [user_id: 1084, item_id: 554, prediction: 5, probability: 0.7594876289367676]
itemRecs = ncf.recommend_for_item(valPairFeatureRdds, 3)
for result in itemRecs.take(5): print(result)
UserItemPrediction [user_id: 3111, item_id: 3558, prediction: 5, probability: 0.48693835735321045] UserItemPrediction [user_id: 2024, item_id: 3558, prediction: 5, probability: 0.42628324031829834] UserItemPrediction [user_id: 4909, item_id: 3558, prediction: 4, probability: 0.562437891960144] UserItemPrediction [user_id: 3023, item_id: 1084, prediction: 5, probability: 0.8389995694160461] UserItemPrediction [user_id: 4790, item_id: 1084, prediction: 5, probability: 0.6715853810310364]
Plot the train and validation loss curves
#retrieve train and validation summary object and read the loss data into ndarray's.
train_loss = np.array(ncf.get_train_summary("Loss"))
val_loss = np.array(ncf.get_validation_summary("Loss"))
#plot the train and validation curves
# each event data is a tuple in form of (iteration_count, value, timestamp)
plt.figure(figsize = (12,6))
plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')
plt.plot(val_loss[:,0],val_loss[:,1],label='val loss',color='green')
plt.scatter(val_loss[:,0],val_loss[:,1],color='green')
plt.legend();
plt.xlim(0,train_loss.shape[0]+10)
plt.grid(True)
plt.title("loss")
Text(0.5, 1.0, 'loss')
Plot accuracy
plt.figure(figsize = (12,6))
top1 = np.array(ncf.get_validation_summary("Top1Accuracy"))
plt.plot(top1[:,0],top1[:,1],label='top1')
plt.title("top1 accuracy")
plt.grid(True)
plt.legend();