Sampling and log probs¶

In this post, we will take a look at how broadcasting rules can be applied to the prob and log_prob methods of a distribution method. This is the summary of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.

toc: true
badges: true
comments: true
author: Chanseok Kang
categories: [Python, Coursera, Tensorflow_probability, ICL]
image:

Packages¶

In [1]:

import tensorflow as tf
import tensorflow_probability as tfp

import numpy as np
import matplotlib.pyplot as plt

tfd = tfp.distributions

plt.rcParams['figure.figsize'] = (10, 6)

In [2]:

print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)

Tensorflow Version:  2.5.0
Tensorflow Probability Version:  0.13.0

Overview¶

We will build simple exponential distribution that have 2 by 3 batch shaped. And it is univariate distribution.

In [3]:

exp = tfd.Exponential(rate=[[1., 1.5, 0.8], [0.3, 0.4, 1.8]])
exp

Out[3]:

<tfp.distributions.Exponential 'Exponential' batch_shape=[2, 3] event_shape=[] dtype=float32>

We can convert it Multivariate distribution with Independent Distribution.

In [4]:

ind_exp = tfd.Independent(exp)
ind_exp

Out[4]:

<tfp.distributions.Independent 'IndependentExponential' batch_shape=[2] event_shape=[3] dtype=float32>

Note that we don't use reinterpreted_batch_ndims keyword, so it will convert all but the first batch dimension into the event_shape, which is the default operation.

In [5]:

ind_exp.sample(4)

Out[5]:

<tf.Tensor: shape=(4, 2, 3), dtype=float32, numpy=
array([[[1.1482651 , 0.5365022 , 0.27021408],
        [5.6145535 , 5.1724033 , 0.75714827]],

       [[1.6911694 , 0.66507334, 1.3505032 ],
        [3.0767057 , 1.6154473 , 0.00633118]],

       [[0.9210652 , 1.2446263 , 2.043678  ],
        [1.4890236 , 1.2732824 , 0.45793444]],

       [[1.8119018 , 0.49963415, 0.09335292],
        [2.0591164 , 3.0031517 , 0.0490962 ]]], dtype=float32)>

Refresh that when we convert the univariate distribution to independent distribution and sample it, its shape will be (sample_size, batch_size, event_shape).

Let's take a look another distribution, which use reinterpreted_batch_ndims keyword.

In [6]:

rates = [
    [[[1., 1.5, 0.8], [0.3, 0.4, 1.8]]],
    [[[0.2, 0.4, 1.4], [0.4, 1.1, 0.9]]]
]

In [9]:

np.array(rates).shape

Out[9]:

(2, 1, 2, 3)

In [7]:

exp = tfd.Exponential(rate=rates)
exp

Out[7]:

<tfp.distributions.Exponential 'Exponential' batch_shape=[2, 1, 2, 3] event_shape=[] dtype=float32>

In [8]:

ind_exp = tfd.Independent(exp, reinterpreted_batch_ndims=2)
ind_exp

Out[8]:

<tfp.distributions.Independent 'IndependentExponential' batch_shape=[2, 1] event_shape=[2, 3] dtype=float32>

Now we have a rank 2 batch_shape and rank 2 event_shape.

In [10]:

ind_exp.sample([4, 2])

Out[10]:

<tf.Tensor: shape=(4, 2, 2, 1, 2, 3), dtype=float32, numpy=
array([[[[[[8.52314979e-02, 2.20484066e+00, 1.24831609e-01],
           [5.72363853e+00, 3.56653333e+00, 1.73619881e-01]]],


         [[[4.51495945e-01, 5.31205177e+00, 9.25723433e-01],
           [6.85539341e+00, 1.23257709e+00, 8.64088893e-01]]]],



        [[[[1.35191846e+00, 2.72666621e+00, 9.93554652e-01],
           [2.71719074e+00, 4.16195679e+00, 7.69947097e-02]]],


         [[[2.47778010e+00, 1.24962020e+00, 1.23250462e-01],
           [2.38104010e+00, 1.49862719e+00, 5.34713209e-01]]]]],




       [[[[[1.12703860e+00, 1.69036031e-01, 1.75236121e-01],
           [4.30009747e+00, 7.73049667e-02, 9.20982718e-01]]],


         [[[2.07261410e+01, 4.24093342e+00, 9.07527208e-02],
           [4.11562204e+00, 2.51152706e+00, 3.07851344e-01]]]],



        [[[[1.05817151e+00, 5.20415246e-01, 1.24326766e+00],
           [6.05947375e-01, 9.40155125e+00, 3.41252506e-01]]],


         [[[6.36713266e+00, 3.30024123e+00, 3.68020654e-01],
           [5.06028198e-02, 6.59128129e-02, 1.67161095e+00]]]]],




       [[[[[9.82080624e-02, 2.37932968e+00, 1.01604521e+00],
           [5.45345068e+00, 3.30391824e-01, 2.25491732e-01]]],


         [[[5.43883753e+00, 8.75549436e-01, 6.05356740e-03],
           [2.01855755e+00, 9.21276867e-01, 2.28063658e-01]]]],



        [[[[1.91371128e-01, 1.38863182e+00, 2.60386395e+00],
           [1.99791002e+00, 3.85638475e+00, 6.53963163e-02]]],


         [[[2.21251488e+01, 1.21040082e+00, 1.30126461e-01],
           [2.82561398e+00, 9.84178245e-01, 1.09785414e+00]]]]],




       [[[[[1.06780671e-01, 8.60901594e-01, 7.70675957e-01],
           [6.49312556e-01, 7.30921552e-02, 3.12643439e-01]]],


         [[[8.84963870e-01, 1.66497517e+00, 4.34715062e-01],
           [1.08664826e-01, 9.62030329e-03, 1.07902026e+00]]]],



        [[[[1.40111685e-01, 3.47986722e+00, 3.60813737e-01],
           [1.61084032e+00, 9.10717964e-01, 1.13252953e-01]]],


         [[[5.73364735e-01, 2.30145860e+00, 2.11053286e-02],
           [3.76549268e+00, 2.19317222e+00, 2.94480515e+00]]]]]],
      dtype=float32)>

When we sample rank 2 data, its output tensor has rank 6 tensor, which has same manner of generating single sample. In details, it has [4, 2] sample size, [2, 1] batchs, and [2, 3] events.

In [11]:

ind_exp.log_prob(0.5)

Out[11]:

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[-4.2501583],
       [-5.3156004]], dtype=float32)>

When we want some log probability of single data, its output is broadcasting and has batch_size shape. As a general ruls, the log prob (and prob) method broadcast its input against the batch and event_shape.

In [12]:

ind_exp.log_prob([[0.3, 0.5, 0.8]])

Out[12]:

<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
array([[-4.770158],
       [-5.8856  ]], dtype=float32)>

So what if we want to gather log probability of complex tensors?

In [14]:

tf.random.uniform((5, 1, 1, 2, 1)).shape

Out[14]:

TensorShape([5, 1, 1, 2, 1])

In [13]:

ind_exp.log_prob(tf.random.uniform((5, 1, 1, 2, 1)))

Out[13]:

<tf.Tensor: shape=(5, 2, 1), dtype=float32, numpy=
array([[[-2.1549933],
        [-3.6172578]],

       [[-4.4256163],
        [-5.612159 ]],

       [[-6.6976185],
        [-7.158457 ]],

       [[-4.3900895],
        [-5.1580105]],

       [[-2.5316038],
        [-4.051901 ]]], dtype=float32)>

Tutorial¶

In [16]:

# Make Multivariate Distribution
loc = [[0.5, 1], [0.1, 0], [0, 0.2]]
scale_diag = [[2., 3], [1., 3], [4., 4]]

In [17]:

normal_distributions = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
normal_distributions

Out[17]:

<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[3] event_shape=[2] dtype=float32>

In [18]:

# Sample
normal_distributions.sample(5)

Out[18]:

<tf.Tensor: shape=(5, 3, 2), dtype=float32, numpy=
array([[[ 0.30364102,  2.3371062 ],
        [ 1.118043  , -0.4802311 ],
        [-5.099995  , -4.291627  ]],

       [[-0.8409153 , -2.1168094 ],
        [ 1.4500505 , -2.828692  ],
        [ 1.3774487 , -0.36408925]],

       [[-1.7672858 , -3.1279678 ],
        [ 2.3426754 , -1.0637026 ],
        [-0.3274685 ,  2.4647486 ]],

       [[-0.42931175, -0.46005905],
        [ 1.943181  , -4.173031  ],
        [ 4.5748634 , -1.8085063 ]],

       [[ 2.2362173 ,  3.2581265 ],
        [-0.18285483, -3.6155434 ],
        [-1.196823  ,  8.819077  ]]], dtype=float32)>

In [19]:

# Multivariate Normal batched distribution
# We are broadcasting batch shapes of `loc` and `scale_diag`
# against each other

loc = [[[0.3, 1.5, 1.], [0.2, 0.4, 2.8]],
       [[2., 2.3, 8], [1.4, 1., 1.3]]]
scale_diag = [0.4, 1., 0.7]

In [20]:

normal_distributions = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
normal_distributions

Out[20]:

<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2, 2] event_shape=[3] dtype=float32>

In [21]:

# Use independent to move part of the batch shape

ind_normal_distribution = tfd.Independent(normal_distributions, 
                                          reinterpreted_batch_ndims=1)
ind_normal_distribution

Out[21]:

<tfp.distributions.Independent 'IndependentMultivariateNormalDiag' batch_shape=[2] event_shape=[2, 3] dtype=float32>

In [22]:

# Draw some samples
samples = ind_normal_distribution.sample(5)
samples.shape

Out[22]:

TensorShape([5, 2, 2, 3])

In [23]:

# [B, E] shaped input
inp = tf.random.uniform((2, 2, 3))
ind_normal_distribution.log_prob(inp)

Out[23]:

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-11.195248, -76.89377 ], dtype=float32)>

In [24]:

# [E] shaped input (broadcasting over batch size)
inp = tf.random.uniform((2, 3))
ind_normal_distribution.log_prob(inp)

Out[24]:

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-10.743417, -75.82509 ], dtype=float32)>

In [25]:

# [S, B, E] shaped input
inp = tf.random.uniform((9, 2, 2, 3))
ind_normal_distribution.log_prob(inp)

Out[25]:

<tf.Tensor: shape=(9, 2), dtype=float32, numpy=
array([[ -9.456959, -74.69543 ],
       [-10.222152, -70.511894],
       [ -9.343542, -79.319824],
       [ -9.667246, -75.2012  ],
       [-10.971153, -62.26688 ],
       [ -9.592494, -68.68516 ],
       [ -9.568048, -74.8657  ],
       [-12.416011, -68.85194 ],
       [ -8.812517, -66.78156 ]], dtype=float32)>

In [27]:

# [S, b, e] shaped input, where [b, e] is broadcastable over [B, E]
inp = tf.random.uniform((5, 1, 2, 1))
ind_normal_distribution.log_prob(inp)

WARNING: Nested component "MultivariateNormalDiag_scale_matvec_linear_operator" in composition "MultivariateNormalDiag_chain_of_MultivariateNormalDiag_shift_of_MultivariateNormalDiag_scale_matvec_linear_operator" operates on inputs with increased degrees of freedom. This may result in an incorrect log_det_jacobian.

Out[27]:

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[-11.335012, -81.39036 ],
       [ -9.282346, -63.680298],
       [ -9.340945, -68.03868 ],
       [-12.372825, -66.20646 ],
       [-10.076466, -80.59617 ]], dtype=float32)>

Naive Bayes example¶

Let's now use what we have learned and continue the naive bayes classifier we were building last tutorial.

In [28]:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import f1_score

In [29]:

# Making a function get_data which:
#   1) Fetches the 20 newsgroup dataset
#   2) Performs a word count on the articles and binarizes the result
#   3) Returns the data as a numpy matrix with the labels

def get_data(categories):
    
    newsgroups_train_data = fetch_20newsgroups(data_home='./dataset/20_Newsgroup_Data/',
                                               subset='train', categories=categories)
    newsgroups_test_data = fetch_20newsgroups(data_home='./dataset/20_Newsgroup_Data/',
                                              subset='test', categories=categories)

    n_documents = len(newsgroups_train_data['data'])
    count_vectorizer = CountVectorizer(input='content', binary=True,max_df=0.25, min_df=1.01/n_documents)
    
    train_binary_bag_of_words = count_vectorizer.fit_transform(newsgroups_train_data['data'])
    test_binary_bag_of_words = count_vectorizer.transform(newsgroups_test_data['data']) 

    return (train_binary_bag_of_words.todense(), newsgroups_train_data['target']),  (test_binary_bag_of_words.todense(), newsgroups_test_data['target'])

In [30]:

# Defining a function to conduct Laplace smoothing. This adds a base level of probability for a given feature
# to occur in every class.

def laplace_smoothing(labels, binary_data, n_classes):
    # Compute the parameter estimates (adjusted fraction of documents in class that contain word)
    n_words = binary_data.shape[1]
    alpha = 1 # parameters for Laplace smoothing
    theta = np.zeros([n_classes, n_words]) # stores parameter values - prob. word given class
    for c_k in range(n_classes): # 0, 1, ..., 19
        class_mask = (labels == c_k)
        N = class_mask.sum() # number of articles in class
        theta[c_k, :] = (binary_data[class_mask, :].sum(axis=0) + alpha)/(N + alpha*2)

    return theta

In [31]:

# Getting a subset of the 20 newsgroup dataset

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

(train_data, train_labels), (test_data, test_labels) = get_data(categories=categories)
smoothed_counts = laplace_smoothing(labels=train_labels, binary_data=train_data, n_classes=len(categories))

To now make our NB classifier we need to build three functions:

Compute the class priors
Build our class conditional distributions
Put it all together and classify our data

In [33]:

# Function which computes the prior probability of every class based on frequency of occurence in 
# the dataset

def class_priors(n_classes, labels):
    counts = np.zeros(n_classes)
    for c_k in range(n_classes):
        counts[c_k] = np.sum(np.where(labels==c_k, 1, 0))
    priors = counts / np.sum(counts)
    print('The class priors are {}'.format(priors))
    return priors

In [34]:

# Run the function

priors = class_priors(n_classes=len(categories), labels=train_labels)

The class priors are [0.2359882  0.28711898 0.29154376 0.18534907]

In [35]:

# Now we will do a function that given the feature occurence counts returns a Bernoulli distribution of 
# batch_shape=number of classes and event_shape=number of features.

def make_distribution(probs):
    batch_of_bernoulli = tfd.Bernoulli(probs=probs)
    dist = tfd.Independent(batch_of_bernoulli, reinterpreted_batch_ndims=1)
    return dist

tf_dist = make_distribution(smoothed_counts)
tf_dist

Out[35]:

<tfp.distributions.Independent 'IndependentBernoulli' batch_shape=[4] event_shape=[17495] dtype=int32>

In [36]:

# The final function predict_sample which given the distribution, a test sample, and the class priors:
#   1) Computes the class conditional probabilities given the sample
#   2) Forms the joint likelihood
#   3) Normalises the joint likelihood and returns the log prob

def predict_sample(dist, sample, priors):
    cond_probs = dist.log_prob(sample)
    joint_likelihood = tf.add(np.log(priors), cond_probs)
    norm_factor = tf.math.reduce_logsumexp(joint_likelihood, axis=-1, keepdims=True)
    log_prob = joint_likelihood - norm_factor
    
    return log_prob

Computing log_probs¶

In [37]:

# Predicting one example from our test data

log_probs = predict_sample(tf_dist, test_data[0], priors)
log_probs

Out[37]:

<tf.Tensor: shape=(4,), dtype=float32, numpy=
array([-6.1736343e+01, -1.5258789e-05, -1.1620026e+01, -6.3327866e+01],
      dtype=float32)>

In [38]:

# Loop over our test data and classify

probabilities = []

for sample, label in zip(test_data, test_labels):
    probabilities.append(tf.exp(predict_sample(tf_dist, sample, priors)))

probabilities = np.asarray(probabilities)
predict_class = np.argmax(probabilities, axis=-1)

print('F1 score: ', f1_score(test_labels, predict_class, average='macro'))

F1 score:  0.7848499112849504

In [39]:

# Make a Bernoulli Naive Bayes classifier using sklearn with the same level of alpha smoothing

clf = BernoulliNB(alpha=1)
clf.fit(train_data, train_labels)
pred = clf.predict(test_data)
print('F1 from sklearn: ', f1_score(test_labels, pred, average='macro'))

F1 from sklearn:  0.7848499112849504