In this post, we will take a look at how broadcasting rules can be applied to the prob and log_prob methods of a distribution method. This is the summary of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import matplotlib.pyplot as plt
tfd = tfp.distributions
plt.rcParams['figure.figsize'] = (10, 6)
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)
Tensorflow Version: 2.5.0 Tensorflow Probability Version: 0.13.0
We will build simple exponential distribution that have 2 by 3 batch shaped. And it is univariate distribution.
exp = tfd.Exponential(rate=[[1., 1.5, 0.8], [0.3, 0.4, 1.8]])
exp
<tfp.distributions.Exponential 'Exponential' batch_shape=[2, 3] event_shape=[] dtype=float32>
We can convert it Multivariate distribution with Independent
Distribution.
ind_exp = tfd.Independent(exp)
ind_exp
<tfp.distributions.Independent 'IndependentExponential' batch_shape=[2] event_shape=[3] dtype=float32>
Note that we don't use reinterpreted_batch_ndims
keyword, so it will convert all but the first batch dimension into the event_shape, which is the default operation.
ind_exp.sample(4)
<tf.Tensor: shape=(4, 2, 3), dtype=float32, numpy= array([[[1.1482651 , 0.5365022 , 0.27021408], [5.6145535 , 5.1724033 , 0.75714827]], [[1.6911694 , 0.66507334, 1.3505032 ], [3.0767057 , 1.6154473 , 0.00633118]], [[0.9210652 , 1.2446263 , 2.043678 ], [1.4890236 , 1.2732824 , 0.45793444]], [[1.8119018 , 0.49963415, 0.09335292], [2.0591164 , 3.0031517 , 0.0490962 ]]], dtype=float32)>
Refresh that when we convert the univariate distribution to independent distribution and sample it, its shape will be (sample_size, batch_size, event_shape).
Let's take a look another distribution, which use reinterpreted_batch_ndims
keyword.
rates = [
[[[1., 1.5, 0.8], [0.3, 0.4, 1.8]]],
[[[0.2, 0.4, 1.4], [0.4, 1.1, 0.9]]]
]
np.array(rates).shape
(2, 1, 2, 3)
exp = tfd.Exponential(rate=rates)
exp
<tfp.distributions.Exponential 'Exponential' batch_shape=[2, 1, 2, 3] event_shape=[] dtype=float32>
ind_exp = tfd.Independent(exp, reinterpreted_batch_ndims=2)
ind_exp
<tfp.distributions.Independent 'IndependentExponential' batch_shape=[2, 1] event_shape=[2, 3] dtype=float32>
Now we have a rank 2 batch_shape and rank 2 event_shape.
ind_exp.sample([4, 2])
<tf.Tensor: shape=(4, 2, 2, 1, 2, 3), dtype=float32, numpy= array([[[[[[8.52314979e-02, 2.20484066e+00, 1.24831609e-01], [5.72363853e+00, 3.56653333e+00, 1.73619881e-01]]], [[[4.51495945e-01, 5.31205177e+00, 9.25723433e-01], [6.85539341e+00, 1.23257709e+00, 8.64088893e-01]]]], [[[[1.35191846e+00, 2.72666621e+00, 9.93554652e-01], [2.71719074e+00, 4.16195679e+00, 7.69947097e-02]]], [[[2.47778010e+00, 1.24962020e+00, 1.23250462e-01], [2.38104010e+00, 1.49862719e+00, 5.34713209e-01]]]]], [[[[[1.12703860e+00, 1.69036031e-01, 1.75236121e-01], [4.30009747e+00, 7.73049667e-02, 9.20982718e-01]]], [[[2.07261410e+01, 4.24093342e+00, 9.07527208e-02], [4.11562204e+00, 2.51152706e+00, 3.07851344e-01]]]], [[[[1.05817151e+00, 5.20415246e-01, 1.24326766e+00], [6.05947375e-01, 9.40155125e+00, 3.41252506e-01]]], [[[6.36713266e+00, 3.30024123e+00, 3.68020654e-01], [5.06028198e-02, 6.59128129e-02, 1.67161095e+00]]]]], [[[[[9.82080624e-02, 2.37932968e+00, 1.01604521e+00], [5.45345068e+00, 3.30391824e-01, 2.25491732e-01]]], [[[5.43883753e+00, 8.75549436e-01, 6.05356740e-03], [2.01855755e+00, 9.21276867e-01, 2.28063658e-01]]]], [[[[1.91371128e-01, 1.38863182e+00, 2.60386395e+00], [1.99791002e+00, 3.85638475e+00, 6.53963163e-02]]], [[[2.21251488e+01, 1.21040082e+00, 1.30126461e-01], [2.82561398e+00, 9.84178245e-01, 1.09785414e+00]]]]], [[[[[1.06780671e-01, 8.60901594e-01, 7.70675957e-01], [6.49312556e-01, 7.30921552e-02, 3.12643439e-01]]], [[[8.84963870e-01, 1.66497517e+00, 4.34715062e-01], [1.08664826e-01, 9.62030329e-03, 1.07902026e+00]]]], [[[[1.40111685e-01, 3.47986722e+00, 3.60813737e-01], [1.61084032e+00, 9.10717964e-01, 1.13252953e-01]]], [[[5.73364735e-01, 2.30145860e+00, 2.11053286e-02], [3.76549268e+00, 2.19317222e+00, 2.94480515e+00]]]]]], dtype=float32)>
When we sample rank 2 data, its output tensor has rank 6 tensor, which has same manner of generating single sample. In details, it has [4, 2] sample size, [2, 1] batchs, and [2, 3] events.
ind_exp.log_prob(0.5)
<tf.Tensor: shape=(2, 1), dtype=float32, numpy= array([[-4.2501583], [-5.3156004]], dtype=float32)>
When we want some log probability of single data, its output is broadcasting and has batch_size shape. As a general ruls, the log prob (and prob) method broadcast its input against the batch and event_shape.
ind_exp.log_prob([[0.3, 0.5, 0.8]])
<tf.Tensor: shape=(2, 1), dtype=float32, numpy= array([[-4.770158], [-5.8856 ]], dtype=float32)>
So what if we want to gather log probability of complex tensors?
tf.random.uniform((5, 1, 1, 2, 1)).shape
TensorShape([5, 1, 1, 2, 1])
ind_exp.log_prob(tf.random.uniform((5, 1, 1, 2, 1)))
<tf.Tensor: shape=(5, 2, 1), dtype=float32, numpy= array([[[-2.1549933], [-3.6172578]], [[-4.4256163], [-5.612159 ]], [[-6.6976185], [-7.158457 ]], [[-4.3900895], [-5.1580105]], [[-2.5316038], [-4.051901 ]]], dtype=float32)>
# Make Multivariate Distribution
loc = [[0.5, 1], [0.1, 0], [0, 0.2]]
scale_diag = [[2., 3], [1., 3], [4., 4]]
normal_distributions = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
normal_distributions
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[3] event_shape=[2] dtype=float32>
# Sample
normal_distributions.sample(5)
<tf.Tensor: shape=(5, 3, 2), dtype=float32, numpy= array([[[ 0.30364102, 2.3371062 ], [ 1.118043 , -0.4802311 ], [-5.099995 , -4.291627 ]], [[-0.8409153 , -2.1168094 ], [ 1.4500505 , -2.828692 ], [ 1.3774487 , -0.36408925]], [[-1.7672858 , -3.1279678 ], [ 2.3426754 , -1.0637026 ], [-0.3274685 , 2.4647486 ]], [[-0.42931175, -0.46005905], [ 1.943181 , -4.173031 ], [ 4.5748634 , -1.8085063 ]], [[ 2.2362173 , 3.2581265 ], [-0.18285483, -3.6155434 ], [-1.196823 , 8.819077 ]]], dtype=float32)>
# Multivariate Normal batched distribution
# We are broadcasting batch shapes of `loc` and `scale_diag`
# against each other
loc = [[[0.3, 1.5, 1.], [0.2, 0.4, 2.8]],
[[2., 2.3, 8], [1.4, 1., 1.3]]]
scale_diag = [0.4, 1., 0.7]
normal_distributions = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
normal_distributions
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2, 2] event_shape=[3] dtype=float32>
# Use independent to move part of the batch shape
ind_normal_distribution = tfd.Independent(normal_distributions,
reinterpreted_batch_ndims=1)
ind_normal_distribution
<tfp.distributions.Independent 'IndependentMultivariateNormalDiag' batch_shape=[2] event_shape=[2, 3] dtype=float32>
# Draw some samples
samples = ind_normal_distribution.sample(5)
samples.shape
TensorShape([5, 2, 2, 3])
# [B, E] shaped input
inp = tf.random.uniform((2, 2, 3))
ind_normal_distribution.log_prob(inp)
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-11.195248, -76.89377 ], dtype=float32)>
# [E] shaped input (broadcasting over batch size)
inp = tf.random.uniform((2, 3))
ind_normal_distribution.log_prob(inp)
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-10.743417, -75.82509 ], dtype=float32)>
# [S, B, E] shaped input
inp = tf.random.uniform((9, 2, 2, 3))
ind_normal_distribution.log_prob(inp)
<tf.Tensor: shape=(9, 2), dtype=float32, numpy= array([[ -9.456959, -74.69543 ], [-10.222152, -70.511894], [ -9.343542, -79.319824], [ -9.667246, -75.2012 ], [-10.971153, -62.26688 ], [ -9.592494, -68.68516 ], [ -9.568048, -74.8657 ], [-12.416011, -68.85194 ], [ -8.812517, -66.78156 ]], dtype=float32)>
# [S, b, e] shaped input, where [b, e] is broadcastable over [B, E]
inp = tf.random.uniform((5, 1, 2, 1))
ind_normal_distribution.log_prob(inp)
WARNING: Nested component "MultivariateNormalDiag_scale_matvec_linear_operator" in composition "MultivariateNormalDiag_chain_of_MultivariateNormalDiag_shift_of_MultivariateNormalDiag_scale_matvec_linear_operator" operates on inputs with increased degrees of freedom. This may result in an incorrect log_det_jacobian.
<tf.Tensor: shape=(5, 2), dtype=float32, numpy= array([[-11.335012, -81.39036 ], [ -9.282346, -63.680298], [ -9.340945, -68.03868 ], [-12.372825, -66.20646 ], [-10.076466, -80.59617 ]], dtype=float32)>
Let's now use what we have learned and continue the naive bayes classifier we were building last tutorial.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import f1_score
# Making a function get_data which:
# 1) Fetches the 20 newsgroup dataset
# 2) Performs a word count on the articles and binarizes the result
# 3) Returns the data as a numpy matrix with the labels
def get_data(categories):
newsgroups_train_data = fetch_20newsgroups(data_home='./dataset/20_Newsgroup_Data/',
subset='train', categories=categories)
newsgroups_test_data = fetch_20newsgroups(data_home='./dataset/20_Newsgroup_Data/',
subset='test', categories=categories)
n_documents = len(newsgroups_train_data['data'])
count_vectorizer = CountVectorizer(input='content', binary=True,max_df=0.25, min_df=1.01/n_documents)
train_binary_bag_of_words = count_vectorizer.fit_transform(newsgroups_train_data['data'])
test_binary_bag_of_words = count_vectorizer.transform(newsgroups_test_data['data'])
return (train_binary_bag_of_words.todense(), newsgroups_train_data['target']), (test_binary_bag_of_words.todense(), newsgroups_test_data['target'])
# Defining a function to conduct Laplace smoothing. This adds a base level of probability for a given feature
# to occur in every class.
def laplace_smoothing(labels, binary_data, n_classes):
# Compute the parameter estimates (adjusted fraction of documents in class that contain word)
n_words = binary_data.shape[1]
alpha = 1 # parameters for Laplace smoothing
theta = np.zeros([n_classes, n_words]) # stores parameter values - prob. word given class
for c_k in range(n_classes): # 0, 1, ..., 19
class_mask = (labels == c_k)
N = class_mask.sum() # number of articles in class
theta[c_k, :] = (binary_data[class_mask, :].sum(axis=0) + alpha)/(N + alpha*2)
return theta
# Getting a subset of the 20 newsgroup dataset
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
(train_data, train_labels), (test_data, test_labels) = get_data(categories=categories)
smoothed_counts = laplace_smoothing(labels=train_labels, binary_data=train_data, n_classes=len(categories))
To now make our NB classifier we need to build three functions:
# Function which computes the prior probability of every class based on frequency of occurence in
# the dataset
def class_priors(n_classes, labels):
counts = np.zeros(n_classes)
for c_k in range(n_classes):
counts[c_k] = np.sum(np.where(labels==c_k, 1, 0))
priors = counts / np.sum(counts)
print('The class priors are {}'.format(priors))
return priors
# Run the function
priors = class_priors(n_classes=len(categories), labels=train_labels)
The class priors are [0.2359882 0.28711898 0.29154376 0.18534907]
# Now we will do a function that given the feature occurence counts returns a Bernoulli distribution of
# batch_shape=number of classes and event_shape=number of features.
def make_distribution(probs):
batch_of_bernoulli = tfd.Bernoulli(probs=probs)
dist = tfd.Independent(batch_of_bernoulli, reinterpreted_batch_ndims=1)
return dist
tf_dist = make_distribution(smoothed_counts)
tf_dist
<tfp.distributions.Independent 'IndependentBernoulli' batch_shape=[4] event_shape=[17495] dtype=int32>
# The final function predict_sample which given the distribution, a test sample, and the class priors:
# 1) Computes the class conditional probabilities given the sample
# 2) Forms the joint likelihood
# 3) Normalises the joint likelihood and returns the log prob
def predict_sample(dist, sample, priors):
cond_probs = dist.log_prob(sample)
joint_likelihood = tf.add(np.log(priors), cond_probs)
norm_factor = tf.math.reduce_logsumexp(joint_likelihood, axis=-1, keepdims=True)
log_prob = joint_likelihood - norm_factor
return log_prob
# Predicting one example from our test data
log_probs = predict_sample(tf_dist, test_data[0], priors)
log_probs
<tf.Tensor: shape=(4,), dtype=float32, numpy= array([-6.1736343e+01, -1.5258789e-05, -1.1620026e+01, -6.3327866e+01], dtype=float32)>
# Loop over our test data and classify
probabilities = []
for sample, label in zip(test_data, test_labels):
probabilities.append(tf.exp(predict_sample(tf_dist, sample, priors)))
probabilities = np.asarray(probabilities)
predict_class = np.argmax(probabilities, axis=-1)
print('F1 score: ', f1_score(test_labels, predict_class, average='macro'))
F1 score: 0.7848499112849504
# Make a Bernoulli Naive Bayes classifier using sklearn with the same level of alpha smoothing
clf = BernoulliNB(alpha=1)
clf.fit(train_data, train_labels)
pred = clf.predict(test_data)
print('F1 from sklearn: ', f1_score(test_labels, pred, average='macro'))
F1 from sklearn: 0.7848499112849504