# Sampling and log probs¶

In this post, we will take a look at how broadcasting rules can be applied to the prob and log_prob methods of a distribution method. This is the summary of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.

• toc: true
• author: Chanseok Kang
• categories: [Python, Coursera, Tensorflow_probability, ICL]
• image:

## Packages¶

In [1]:
import tensorflow as tf
import tensorflow_probability as tfp

import numpy as np
import matplotlib.pyplot as plt

tfd = tfp.distributions

plt.rcParams['figure.figsize'] = (10, 6)

In [2]:
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)

Tensorflow Version:  2.5.0
Tensorflow Probability Version:  0.13.0


## Overview¶

We will build simple exponential distribution that have 2 by 3 batch shaped. And it is univariate distribution.

In [3]:
exp = tfd.Exponential(rate=[[1., 1.5, 0.8], [0.3, 0.4, 1.8]])
exp

Out[3]:
<tfp.distributions.Exponential 'Exponential' batch_shape=[2, 3] event_shape=[] dtype=float32>

We can convert it Multivariate distribution with Independent Distribution.

In [4]:
ind_exp = tfd.Independent(exp)
ind_exp

Out[4]:
<tfp.distributions.Independent 'IndependentExponential' batch_shape=[2] event_shape=[3] dtype=float32>

Note that we don't use reinterpreted_batch_ndims keyword, so it will convert all but the first batch dimension into the event_shape, which is the default operation.

In [5]:
ind_exp.sample(4)

Out[5]:
<tf.Tensor: shape=(4, 2, 3), dtype=float32, numpy=
array([[[1.1482651 , 0.5365022 , 0.27021408],
[5.6145535 , 5.1724033 , 0.75714827]],

[[1.6911694 , 0.66507334, 1.3505032 ],
[3.0767057 , 1.6154473 , 0.00633118]],

[[0.9210652 , 1.2446263 , 2.043678  ],
[1.4890236 , 1.2732824 , 0.45793444]],

[[1.8119018 , 0.49963415, 0.09335292],
[2.0591164 , 3.0031517 , 0.0490962 ]]], dtype=float32)>

Refresh that when we convert the univariate distribution to independent distribution and sample it, its shape will be (sample_size, batch_size, event_shape).

Let's take a look another distribution, which use reinterpreted_batch_ndims keyword.

In [6]:
rates = [
[[[1., 1.5, 0.8], [0.3, 0.4, 1.8]]],
[[[0.2, 0.4, 1.4], [0.4, 1.1, 0.9]]]
]

In [9]:
np.array(rates).shape

Out[9]:
(2, 1, 2, 3)
In [7]:
exp = tfd.Exponential(rate=rates)
exp

Out[7]:
<tfp.distributions.Exponential 'Exponential' batch_shape=[2, 1, 2, 3] event_shape=[] dtype=float32>
In [8]:
ind_exp = tfd.Independent(exp, reinterpreted_batch_ndims=2)
ind_exp

Out[8]:
<tfp.distributions.Independent 'IndependentExponential' batch_shape=[2, 1] event_shape=[2, 3] dtype=float32>

Now we have a rank 2 batch_shape and rank 2 event_shape.

In [10]:
ind_exp.sample([4, 2])

Out[10]:
<tf.Tensor: shape=(4, 2, 2, 1, 2, 3), dtype=float32, numpy=
array([[[[[[8.52314979e-02, 2.20484066e+00, 1.24831609e-01],
[5.72363853e+00, 3.56653333e+00, 1.73619881e-01]]],

[[[4.51495945e-01, 5.31205177e+00, 9.25723433e-01],
[6.85539341e+00, 1.23257709e+00, 8.64088893e-01]]]],

[[[[1.35191846e+00, 2.72666621e+00, 9.93554652e-01],
[2.71719074e+00, 4.16195679e+00, 7.69947097e-02]]],

[[[2.47778010e+00, 1.24962020e+00, 1.23250462e-01],
[2.38104010e+00, 1.49862719e+00, 5.34713209e-01]]]]],

[[[[[1.12703860e+00, 1.69036031e-01, 1.75236121e-01],
[4.30009747e+00, 7.73049667e-02, 9.20982718e-01]]],

[[[2.07261410e+01, 4.24093342e+00, 9.07527208e-02],
[4.11562204e+00, 2.51152706e+00, 3.07851344e-01]]]],

[[[[1.05817151e+00, 5.20415246e-01, 1.24326766e+00],
[6.05947375e-01, 9.40155125e+00, 3.41252506e-01]]],

[[[6.36713266e+00, 3.30024123e+00, 3.68020654e-01],
[5.06028198e-02, 6.59128129e-02, 1.67161095e+00]]]]],

[[[[[9.82080624e-02, 2.37932968e+00, 1.01604521e+00],
[5.45345068e+00, 3.30391824e-01, 2.25491732e-01]]],

[[[5.43883753e+00, 8.75549436e-01, 6.05356740e-03],
[2.01855755e+00, 9.21276867e-01, 2.28063658e-01]]]],

[[[[1.91371128e-01, 1.38863182e+00, 2.60386395e+00],
[1.99791002e+00, 3.85638475e+00, 6.53963163e-02]]],

[[[2.21251488e+01, 1.21040082e+00, 1.30126461e-01],
[2.82561398e+00, 9.84178245e-01, 1.09785414e+00]]]]],

[[[[[1.06780671e-01, 8.60901594e-01, 7.70675957e-01],
[6.49312556e-01, 7.30921552e-02, 3.12643439e-01]]],

[[[8.84963870e-01, 1.66497517e+00, 4.34715062e-01],
[1.08664826e-01, 9.62030329e-03, 1.07902026e+00]]]],

[[[[1.40111685e-01, 3.47986722e+00, 3.60813737e-01],
[1.61084032e+00, 9.10717964e-01, 1.13252953e-01]]],

[[[5.73364735e-01, 2.30145860e+00, 2.11053286e-02],
[3.76549268e+00, 2.19317222e+00, 2.94480515e+00]]]]]],
dtype=float32)>

When we sample rank 2 data, its output tensor has rank 6 tensor, which has same manner of generating single sample. In details, it has [4, 2] sample size, [2, 1] batchs, and [2, 3] events.

In [11]:
ind_exp.log_prob(0.5)

Out[11]:
<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[-4.2501583],
[-5.3156004]], dtype=float32)>

When we want some log probability of single data, its output is broadcasting and has batch_size shape. As a general ruls, the log prob (and prob) method broadcast its input against the batch and event_shape.

In [12]:
ind_exp.log_prob([[0.3, 0.5, 0.8]])

Out[12]:
<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[-4.770158],
[-5.8856  ]], dtype=float32)>

So what if we want to gather log probability of complex tensors?

In [14]:
tf.random.uniform((5, 1, 1, 2, 1)).shape

Out[14]:
TensorShape([5, 1, 1, 2, 1])
In [13]:
ind_exp.log_prob(tf.random.uniform((5, 1, 1, 2, 1)))

Out[13]:
<tf.Tensor: shape=(5, 2, 1), dtype=float32, numpy=
array([[[-2.1549933],
[-3.6172578]],

[[-4.4256163],
[-5.612159 ]],

[[-6.6976185],
[-7.158457 ]],

[[-4.3900895],
[-5.1580105]],

[[-2.5316038],
[-4.051901 ]]], dtype=float32)>

## Tutorial¶

In [16]:
# Make Multivariate Distribution
loc = [[0.5, 1], [0.1, 0], [0, 0.2]]
scale_diag = [[2., 3], [1., 3], [4., 4]]

In [17]:
normal_distributions = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
normal_distributions

Out[17]:
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[3] event_shape=[2] dtype=float32>
In [18]:
# Sample
normal_distributions.sample(5)

Out[18]:
<tf.Tensor: shape=(5, 3, 2), dtype=float32, numpy=
array([[[ 0.30364102,  2.3371062 ],
[ 1.118043  , -0.4802311 ],
[-5.099995  , -4.291627  ]],

[[-0.8409153 , -2.1168094 ],
[ 1.4500505 , -2.828692  ],
[ 1.3774487 , -0.36408925]],

[[-1.7672858 , -3.1279678 ],
[ 2.3426754 , -1.0637026 ],
[-0.3274685 ,  2.4647486 ]],

[[-0.42931175, -0.46005905],
[ 1.943181  , -4.173031  ],
[ 4.5748634 , -1.8085063 ]],

[[ 2.2362173 ,  3.2581265 ],
[-0.18285483, -3.6155434 ],
[-1.196823  ,  8.819077  ]]], dtype=float32)>
In [19]:
# Multivariate Normal batched distribution
# We are broadcasting batch shapes of loc and scale_diag
# against each other

loc = [[[0.3, 1.5, 1.], [0.2, 0.4, 2.8]],
[[2., 2.3, 8], [1.4, 1., 1.3]]]
scale_diag = [0.4, 1., 0.7]

In [20]:
normal_distributions = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
normal_distributions

Out[20]:
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2, 2] event_shape=[3] dtype=float32>
In [21]:
# Use independent to move part of the batch shape

ind_normal_distribution = tfd.Independent(normal_distributions,
reinterpreted_batch_ndims=1)
ind_normal_distribution

Out[21]:
<tfp.distributions.Independent 'IndependentMultivariateNormalDiag' batch_shape=[2] event_shape=[2, 3] dtype=float32>
In [22]:
# Draw some samples
samples = ind_normal_distribution.sample(5)
samples.shape

Out[22]:
TensorShape([5, 2, 2, 3])
In [23]:
# [B, E] shaped input
inp = tf.random.uniform((2, 2, 3))
ind_normal_distribution.log_prob(inp)

Out[23]:
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-11.195248, -76.89377 ], dtype=float32)>
In [24]:
# [E] shaped input (broadcasting over batch size)
inp = tf.random.uniform((2, 3))
ind_normal_distribution.log_prob(inp)

Out[24]:
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-10.743417, -75.82509 ], dtype=float32)>
In [25]:
# [S, B, E] shaped input
inp = tf.random.uniform((9, 2, 2, 3))
ind_normal_distribution.log_prob(inp)

Out[25]:
<tf.Tensor: shape=(9, 2), dtype=float32, numpy=
array([[ -9.456959, -74.69543 ],
[-10.222152, -70.511894],
[ -9.343542, -79.319824],
[ -9.667246, -75.2012  ],
[-10.971153, -62.26688 ],
[ -9.592494, -68.68516 ],
[ -9.568048, -74.8657  ],
[-12.416011, -68.85194 ],
[ -8.812517, -66.78156 ]], dtype=float32)>
In [27]:
# [S, b, e] shaped input, where [b, e] is broadcastable over [B, E]
inp = tf.random.uniform((5, 1, 2, 1))
ind_normal_distribution.log_prob(inp)

WARNING: Nested component "MultivariateNormalDiag_scale_matvec_linear_operator" in composition "MultivariateNormalDiag_chain_of_MultivariateNormalDiag_shift_of_MultivariateNormalDiag_scale_matvec_linear_operator" operates on inputs with increased degrees of freedom. This may result in an incorrect log_det_jacobian.

Out[27]:
<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[-11.335012, -81.39036 ],
[ -9.282346, -63.680298],
[ -9.340945, -68.03868 ],
[-12.372825, -66.20646 ],
[-10.076466, -80.59617 ]], dtype=float32)>

### Naive Bayes example¶

Let's now use what we have learned and continue the naive bayes classifier we were building last tutorial.

In [28]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import f1_score

In [29]:
# Making a function get_data which:
#   1) Fetches the 20 newsgroup dataset
#   2) Performs a word count on the articles and binarizes the result
#   3) Returns the data as a numpy matrix with the labels

def get_data(categories):

newsgroups_train_data = fetch_20newsgroups(data_home='./dataset/20_Newsgroup_Data/',
subset='train', categories=categories)
newsgroups_test_data = fetch_20newsgroups(data_home='./dataset/20_Newsgroup_Data/',
subset='test', categories=categories)

n_documents = len(newsgroups_train_data['data'])
count_vectorizer = CountVectorizer(input='content', binary=True,max_df=0.25, min_df=1.01/n_documents)

train_binary_bag_of_words = count_vectorizer.fit_transform(newsgroups_train_data['data'])
test_binary_bag_of_words = count_vectorizer.transform(newsgroups_test_data['data'])

return (train_binary_bag_of_words.todense(), newsgroups_train_data['target']),  (test_binary_bag_of_words.todense(), newsgroups_test_data['target'])

In [30]:
# Defining a function to conduct Laplace smoothing. This adds a base level of probability for a given feature
# to occur in every class.

def laplace_smoothing(labels, binary_data, n_classes):
# Compute the parameter estimates (adjusted fraction of documents in class that contain word)
n_words = binary_data.shape[1]
alpha = 1 # parameters for Laplace smoothing
theta = np.zeros([n_classes, n_words]) # stores parameter values - prob. word given class
for c_k in range(n_classes): # 0, 1, ..., 19
class_mask = (labels == c_k)
N = class_mask.sum() # number of articles in class
theta[c_k, :] = (binary_data[class_mask, :].sum(axis=0) + alpha)/(N + alpha*2)

return theta

In [31]:
# Getting a subset of the 20 newsgroup dataset

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

(train_data, train_labels), (test_data, test_labels) = get_data(categories=categories)
smoothed_counts = laplace_smoothing(labels=train_labels, binary_data=train_data, n_classes=len(categories))


To now make our NB classifier we need to build three functions:

• Compute the class priors
• Build our class conditional distributions
• Put it all together and classify our data
In [33]:
# Function which computes the prior probability of every class based on frequency of occurence in
# the dataset

def class_priors(n_classes, labels):
counts = np.zeros(n_classes)
for c_k in range(n_classes):
counts[c_k] = np.sum(np.where(labels==c_k, 1, 0))
priors = counts / np.sum(counts)
print('The class priors are {}'.format(priors))
return priors

In [34]:
# Run the function

priors = class_priors(n_classes=len(categories), labels=train_labels)

The class priors are [0.2359882  0.28711898 0.29154376 0.18534907]

In [35]:
# Now we will do a function that given the feature occurence counts returns a Bernoulli distribution of
# batch_shape=number of classes and event_shape=number of features.

def make_distribution(probs):
batch_of_bernoulli = tfd.Bernoulli(probs=probs)
dist = tfd.Independent(batch_of_bernoulli, reinterpreted_batch_ndims=1)
return dist

tf_dist = make_distribution(smoothed_counts)
tf_dist

Out[35]:
<tfp.distributions.Independent 'IndependentBernoulli' batch_shape=[4] event_shape=[17495] dtype=int32>
In [36]:
# The final function predict_sample which given the distribution, a test sample, and the class priors:
#   1) Computes the class conditional probabilities given the sample
#   2) Forms the joint likelihood
#   3) Normalises the joint likelihood and returns the log prob

def predict_sample(dist, sample, priors):
cond_probs = dist.log_prob(sample)
joint_likelihood = tf.add(np.log(priors), cond_probs)
norm_factor = tf.math.reduce_logsumexp(joint_likelihood, axis=-1, keepdims=True)
log_prob = joint_likelihood - norm_factor

return log_prob


#### Computing log_probs¶

In [37]:
# Predicting one example from our test data

log_probs = predict_sample(tf_dist, test_data[0], priors)
log_probs

Out[37]:
<tf.Tensor: shape=(4,), dtype=float32, numpy=
array([-6.1736343e+01, -1.5258789e-05, -1.1620026e+01, -6.3327866e+01],
dtype=float32)>
In [38]:
# Loop over our test data and classify

probabilities = []

for sample, label in zip(test_data, test_labels):
probabilities.append(tf.exp(predict_sample(tf_dist, sample, priors)))

probabilities = np.asarray(probabilities)
predict_class = np.argmax(probabilities, axis=-1)

print('F1 score: ', f1_score(test_labels, predict_class, average='macro'))

F1 score:  0.7848499112849504

In [39]:
# Make a Bernoulli Naive Bayes classifier using sklearn with the same level of alpha smoothing

clf = BernoulliNB(alpha=1)
clf.fit(train_data, train_labels)
pred = clf.predict(test_data)
print('F1 from sklearn: ', f1_score(test_labels, pred, average='macro'))

F1 from sklearn:  0.7848499112849504