In this post, it will introduce you to numpy's broadcasting rules and show how you can use broadcasting when specifying batches of distributions in TensorFlow, as well as with the
prob
andlog_prob
methods. This is the summary of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
tfd = tfp.distributions
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)
Tensorflow Version: 2.5.0 Tensorflow Probability Version: 0.13.0
Numpy operations can be applied to arrays that are not of the same shape, but only if the shapes satisfy certain conditions.
As a demonstration of this, let us add together two arrays of different shapes:
# Add two arrays with different shapes
a = np.array([[1.],
[2.],
[3.],
[4.]]) # shape (4, 1)
b = np.array([0., 1., 2.]) # shape (3,)
a.shape, b.shape
((4, 1), (3,))
a + b
array([[1., 2., 3.], [2., 3., 4.], [3., 4., 5.], [4., 5., 6.]])
(a + b).shape
(4, 3)
This is the addition
[ [1.], + [0., 1., 2.]
[2.],
[3.],
[4.] ]
To execute it, numpy:
Aligned the shapes of a
and b
on the last axis and prepended 1s to the shape with fewer axes:
a: 4 x 1 ---> a: 4 x 1
b: 3 ---> b: 1 x 3
Checked that the sizes of the axes matched or were equal to 1:
a: 4 x 1
b: 1 x 3
a
and b
satisfied this criterion.
a
was replicated 3 times in the second axis, while b
was replicated 4 times in the first axis.
This meant that the addition in the final step was
[ [1., 1., 1.], + [ [0., 1., 2.],
[2., 2., 2.], [0., 1., 2.],
[3., 3., 3.], [0., 1., 2.],
[4., 4., 4.] ] [0., 1., 2.] ]
Addition was then carried out element-by-element, as you can verify by referring back to the output of the code cell above.
This resulted in an output with shape 4 x 3.
Broadcasting rules describe how values should be transmitted when the inputs to an operation do not match.
In numpy, the broadcasting rule is very simple:
Prepend 1s to the smaller shape,
check that the axes of both arrays have sizes that are equal or 1,
then stretch the arrays in their size-1 axes.
A crucial aspect of this rule is that it does not require the input arrays have the same number of axes.
Another consequence of it is that a broadcasting output will have the largest size of its inputs in each axis.
Take the following multiplication as an example:
a: 3 x 7 x 1
b: 1 x 5
a * b: 3 x 7 x 5
You can see that the output shape is the maximum of the sizes in each axis.
Numpy's broadcasting rule also does not require that one of the arrays has to be bigger in all axes.
This is seen in the following example, where a
is smaller than b
in its third axis but is bigger in its second axis.
# Multiply two arrays with different shapes
a = np.array([[[0.01], [0.1]],
[[1.00], [10.]]]) # shape (2, 2, 1)
b = np.array([[[2., 2.]],
[[3., 3.]]]) # shape (2, 1, 2)
a.shape, b.shape
((2, 2, 1), (2, 1, 2))
a * b # shape (2, 2, 2)
array([[[2.e-02, 2.e-02], [2.e-01, 2.e-01]], [[3.e+00, 3.e+00], [3.e+01, 3.e+01]]])
print((a * b).shape)
(2, 2, 2)
Broadcasting behaviour also points to an efficient way to compute an outer product in numpy:
# Use broadcasting to compute an outer product
a = np.array([-1., 0., 1.])
b = np.array([0., 1., 2., 3.])
a.shape, b.shape
((3,), (4,))
a[:, np.newaxis] * b # outer product ab^T, where a and b are column vectors
array([[-0., -1., -2., -3.], [ 0., 0., 0., 0.], [ 0., 1., 2., 3.]])
(a[:, np.newaxis] * b).shape
(3, 4)
a[:, np.newaxis].shape
(3, 1)
The idea of numpy stretching the arrays in their size-1 axes is useful and is functionally correct. But this is not what numpy literally does behind the scenes, since that would be an inefficient use of memory. Instead, numpy carries out the operation by looping over singleton (size-1) dimensions.
To give you some practice with broadcasting, try predicting the output shapes for the following operations:
# Define three arrays with different shapes
a = [[1.], [2.], [3.]]
b = np.zeros(shape=[10, 1, 1])
c = np.ones(shape=[4])
b.shape, c.shape
((10, 1, 1), (4,))
Actually, a
is 2D list, not numpy array. But numpy addition can automatically convert list to numpy array.
# Predict the shape before executing this cell
(a + b).shape
(10, 3, 1)
# Predict the shape before executing this cell
(a * c).shape
(3, 4)
# Predict the shape before executing this cell
(a * b + c).shape
(10, 3, 4)
The broadcasting rule for TensorFlow is the same as that for numpy. For example, TensorFlow also allows you to specify the parameters of Distribution objects using broadcasting.
What is meant by this can be understood through an example with the univariate normal distribution. Say that we wish to specify a parameter grid for six Gaussians. The parameter combinations to be used, (loc, scale)
, are:
(0, 1)
(0, 10)
(0, 100)
(1, 1)
(1, 10)
(1, 100)
A laborious way of doing this is to explicitly pass each parameter to tfd.Normal
:
# Define a batch of Normal distributions without broadcasting
batch_of_normals = tfd.Normal(loc=[0., 0., 0., 1., 1., 1.,], scale=[1., 10., 100., 1., 10., 100.])
batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[6] event_shape=[] dtype=float32>
# Check the parameter values for loc (mean)
batch_of_normals.loc
<tf.Tensor: shape=(6,), dtype=float32, numpy=array([0., 0., 0., 1., 1., 1.], dtype=float32)>
# Check the parameter values for scale (std)
batch_of_normals.scale
<tf.Tensor: shape=(6,), dtype=float32, numpy=array([ 1., 10., 100., 1., 10., 100.], dtype=float32)>
A more succinct way to create a batch of distributions for this parameter grid is to use broadcasting.
Consider what would happen if we were to broadcast these arrays according the rule discussed earlier:
loc = [ [0.],
[1.] ]
scale = [1., 10., 100.]
The shapes would be stretched according to
loc: 2 x 1 ---> 2 x 3
scale: 1 x 3 ---> 2 x 3
resulting in
loc = [ [0., 0., 0.],
[1., 1., 1.] ]
scale = [ [1., 10., 100.],
[1., 10., 100.] ]
which are compatible with the loc
and scale
arguments of tfd.Normal
.
Sure enough, this is precisely what TensorFlow does:
# Define a batch of Normal distribution with broadcasting
loc = [[0.], [1.]] # (2, 1)
scale = [1., 10., 100.] # (3, )
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[2, 3] event_shape=[] dtype=float32>
# The stored loc parameter values are what you pass in, not what is used after broadcasting
another_batch_of_normals.loc
<tf.Tensor: shape=(2, 1), dtype=float32, numpy= array([[0.], [1.]], dtype=float32)>
# The stored scale parameter values are what you pass in, not what is used after broadcasting
another_batch_of_normals.scale
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([ 1., 10., 100.], dtype=float32)>
In summary, TensorFlow broadcasts parameter arrays: it stretches them according to the broadcasting rule, then creates a distribution on an element-by-element basis.
prob
and log_prob
methods¶When using prob
and log_prob
with broadcasting, we follow the same principles as before. Let's make a new batch of normals as before but with means which are centered at different locations to help distinguish the results we get.
# Define a batch of Normal distributions with broadcasting
loc = [[0.], [10.]]
scale = [1., 1., 1.]
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[2, 3] event_shape=[] dtype=float32>
We can feed in samples of any shape as long as it can be broadcast agasint our batch shape for this example.
# Use broadcasting along the second axis with the prob method
sample = tf.random.uniform((2, 1))
sample
<tf.Tensor: shape=(2, 1), dtype=float32, numpy= array([[0.30983818], [0.7367121 ]], dtype=float32)>
another_batch_of_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[3.8024542e-01, 3.8024542e-01, 3.8024542e-01], [9.2860834e-20, 9.2860834e-20, 9.2860834e-20]], dtype=float32)>
Or broadcasting along the first axis instead:
# Use broadcasting along the first axis with the prob method
sample = tf.random.uniform((1, 3))
sample
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.48120987, 0.05952108, 0.13428748]], dtype=float32)>
another_batch_of_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[3.5532588e-01, 3.9823624e-01, 3.9536136e-01], [8.4288889e-21, 1.3928772e-22, 2.9206223e-22]], dtype=float32)>
Or even both axes:
# Use broadcasting along both axes with the prob method
sample = tf.random.uniform((1, 1))
sample
<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.61676073]], dtype=float32)>
another_batch_of_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[3.2984403e-01, 3.2984403e-01, 3.2984403e-01], [3.0348733e-20, 3.0348733e-20, 3.0348733e-20]], dtype=float32)>
log_prob
works in the exact same way with broadcasting. We can replace prob
with log_prob
in any of the previous examples:
sample = tf.random.uniform((1, 3))
another_batch_of_normals.log_prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[ -1.2130027 , -0.96663547, -1.0695213 ], [-43.54405 , -47.878044 , -45.58167 ]], dtype=float32)>
Broadcasting behaviour for multivariate distributions is only a little more sophisticated than it is for univariate distributions.
Recall that MultivariateNormalDiag
has two parameter arguments: loc
and scale_diag
. When specifying a single distribution, these arguments are vectors of the same length:
# Define a multivariate normal distribution without broadcasting
single_mvt_normal = tfd.MultivariateNormalDiag(loc=[0., 0.], scale_diag=[1., 0.5])
single_mvt_normal
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[] event_shape=[2] dtype=float32>
# Print loc parameter
single_mvt_normal.loc
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 0.], dtype=float32)>
# Print Covariance matrix
single_mvt_normal.covariance()
<tf.Tensor: shape=(2, 2), dtype=float32, numpy= array([[1. , 0. ], [0. , 0.25]], dtype=float32)>
Covariance Matrix is the diagonal matrix with scale_diag^2
The size of the final axis of the inputs determines the event shape for each distribution in the batch. This means that if we pass
loc = [ [0., 0.],
[1., 1.] ]
scale_diag = [1., 0.5]
such that
loc: 2 x 2
scale_diag: 1 x 2
^ final dimension is interpreted as event dimension
^ other dimensions are interpreted as batch dimensions
then a batch of two bivariate normal distributions will be created.
# Define a multivariate normal distribution with broadcasting
loc = [[0., 0.],
[1., 1.]]
scale_diag = [1., 0.5]
batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
batch_of_mvt_normals
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2] event_shape=[2] dtype=float32>
# Print the distribution paramters
# There is a batch of two distributions with different means and same covariance
batch_of_mvt_normals.parameters
{'loc': ListWrapper([ListWrapper([0.0, 0.0]), ListWrapper([1.0, 1.0])]), 'scale_diag': ListWrapper([1.0, 0.5]), 'scale_identity_multiplier': None, 'validate_args': False, 'allow_nan_stats': True, 'experimental_use_kahan_sum': False, 'name': 'MultivariateNormalDiag'}
Knowing that, for multivariate distributions, TensorFlow
interprets the final axis of an array of parameters as the event shape,
and broadcasts over the remaining axes,
can you predict what the batch and event shapes will if we pass the arguments
loc = [ [ 1., 1., 1.],
[-1., -1., -1.] ] # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
[[10., 10., 10.]] ] # shape (2, 1, 3)
to MultivariateNormalDiag
?
Solution:
Align the parameter array shapes on their last axis, prepending 1s where necessary:
loc: 1 x 2 x 3
scale_diag: 2 x 1 x 3
The final axis has size 3, so event_shape = (3)
. The remaining axes are broadcast over to yield
loc: 2 x 2 x 3
scale_diag: 2 x 2 x 3
so batch_shape = (2, 2)
.
Let's see if this is correct!
# Define a multivariate Gaussian distribution with broadcasting
loc = [ [ 1., 1., 1.],
[-1., -1., -1.] ] # shape (2, 3)
scale_diag = [ [[0.1, 0.1, 0.1]],
[[10., 10., 10.]] ] # shape (2, 1, 3)
another_batch_of_mvt_normals = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
another_batch_of_mvt_normals
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[2, 2] event_shape=[3] dtype=float32>
# Print the distribution parameters
another_batch_of_mvt_normals.parameters
{'loc': ListWrapper([ListWrapper([1.0, 1.0, 1.0]), ListWrapper([-1.0, -1.0, -1.0])]), 'scale_diag': ListWrapper([ListWrapper([ListWrapper([0.1, 0.1, 0.1])]), ListWrapper([ListWrapper([10.0, 10.0, 10.0])])]), 'scale_identity_multiplier': None, 'validate_args': False, 'allow_nan_stats': True, 'experimental_use_kahan_sum': False, 'name': 'MultivariateNormalDiag'}
As we did before lets also look at broadcasting when we have batches of multivariate distributions.
# Define a batch of Normal distributions with broadcasting
loc = [[0.],
[1.],
[0.]]
scale = [1., 10., 100., 1., 10, 100.]
another_batch_of_normals = tfd.Normal(loc=loc, scale=scale)
another_batch_of_normals
<tfp.distributions.Normal 'Normal' batch_shape=[3, 6] event_shape=[] dtype=float32>
And to refresh our memory of Independent
we'll use it below to roll the rightmost batch shape into the event shape.
# Create a multivariate Independent distribution
another_batch_of_mvt_normals = tfd.Independent(another_batch_of_normals)
another_batch_of_mvt_normals
<tfp.distributions.Independent 'IndependentNormal' batch_shape=[3] event_shape=[6] dtype=float32>
Now, onto the broadcasting:
# Use broadcasting with the prob method
# Batch_size shaped input (broadcast over event)
sample = tf.random.uniform((3, 1))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([3.6341732e-09, 3.8412056e-09, 2.6659681e-09], dtype=float32)>
# Use broadcasting with the prob method
# Event_shape shaped input (broadcast over batch)
sample = tf.random.uniform((1, 6))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(3,), dtype=float32, numpy=array([2.4391147e-09, 2.7616289e-09, 2.4391147e-09], dtype=float32)>
# Use broadcasting with the prob method
# [Samples,Batch_size,Events] shaped input (broadcast over samples)
sample = tf.random.uniform((2, 3, 6))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[2.5010654e-09, 3.3668932e-09, 3.8643830e-09], [3.0422742e-09, 2.6408655e-09, 3.9074304e-09]], dtype=float32)>
# [S,b,e] shaped input where [b,e] can be broadcast agaisnt [B,E]
sample = tf.random.uniform((2, 1, 6))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[3.2072505e-09, 2.8448774e-09, 3.2072505e-09], [3.9072363e-09, 1.9765971e-09, 3.9072363e-09]], dtype=float32)>
As a final example with log_prob
instead of prob
# Use broadcasting with the log_prob method
# [S,b,e] shaped input where [b,e] can be broadcast agaisnt [B,E]
sample = tf.random.uniform((2, 3, 1))
another_batch_of_mvt_normals.prob(sample)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy= array([[3.5190979e-09, 2.3076914e-09, 3.0966967e-09], [1.6677665e-09, 2.7270144e-09, 3.7685051e-09]], dtype=float32)>
You should now feel confident specifying batches of distributions using broadcasting. As you may have already guessed, broadcasting is especially useful when specifying grids of hyperparameters.
If you don't feel entirely comfortable with broadcasting quite yet, don't worry: re-read this notebook, go through the further reading provided below, and experiment with broadcasting in both numpy and TensorFlow, and you'll be broadcasting in no time.