In this post, we will develop the naive bayes classifier for iris dataset using Tensorflow Probability. This is the Program assignment of lecture "Probabilistic Deep Learning with Tensorflow 2" from Imperial College London.
import tensorflow as tf
import tensorflow_probability as tfp
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection
import numpy as np
import matplotlib.pyplot as plt
tfd = tfp.distributions
plt.rcParams['figure.figsize'] = (10, 6)
print("Tensorflow Version: ", tf.__version__)
print("Tensorflow Probability Version: ", tfp.__version__)
Tensorflow Version: 2.5.0 Tensorflow Probability Version: 0.13.0
You will use the Iris dataset. It consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. For a reference, see the following papers:
Your goal is to construct a Naive Bayes classifier model that predicts the correct class from the sepal length and sepal width features. Under certain assumptions about this classifier model, you will explore the relation to logistic regression.
We will first read in the Iris dataset, and split the dataset into training and test sets.
# Load the dataset
iris = datasets.load_iris()
# Use only the first two features: sepal length and width
data = iris.data[:, :2]
targets = iris.target
# Randomly shuffle the data and make train and test splits
x_train, x_test, y_train, y_test = model_selection.train_test_split(data, targets, test_size=0.2)
# Plot the training data
labels = {0: 'Iris-Setosa', 1: 'Iris-Versicolour', 2: 'Iris-Virginica'}
label_colours = ['blue', 'orange', 'green']
def plot_data(x, y, labels, colours):
for c in np.unique(y):
inx = np.where(y == c)
plt.scatter(x[inx, 0], x[inx, 1], label=labels[c], c=colours[c])
plt.title("Training set")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.legend()
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)
plt.show()
We will briefly review the Naive Bayes classifier model. The fundamental equation for this classifier is Bayes' rule:
$$ P(Y=y_k | X_1,\ldots,X_d) = \frac{P(X_1,\ldots,X_d | Y=y_k)P(Y=y_k)}{\sum_{k=1}^K P(X_1,\ldots,X_d | Y=y_k)P(Y=y_k)} $$In the above, $d$ is the number of features or dimensions in the inputs $X$ (in our case $d=2$), and $K$ is the number of classes (in our case $K=3$). The distribution $P(Y)$ is the class prior distribution, which is a discrete distribution over $K$ classes. The distribution $P(X | Y)$ is the class-conditional distribution over inputs.
The Naive Bayes classifier makes the assumption that the data features $X_i$ are conditionally independent give the class $Y$ (the 'naive' assumption). In this case, the class-conditional distribution decomposes as
$$ \begin{aligned} P(X | Y=y_k) &= P(X_1,\ldots,X_d | Y=y_k)\\ &= \prod_{i=1}^d P(X_i | Y=y_k) \end{aligned} $$This simplifying assumption means that we typically need to estimate far fewer parameters for each of the distributions $P(X_i | Y=y_k)$ instead of the full joint distribution $P(X | Y=y_k)$.
Once the class prior distribution and class-conditional densities are estimated, the Naive Bayes classifier model can then make a class prediction $\hat{Y}$ for a new data input $\tilde{X} := (\tilde{X}_1,\ldots,\tilde{X}_d)$ according to
$$ \begin{aligned} \hat{Y} &= \text{argmax}_{y_k} P(Y=y_k | \tilde{X}_1,\ldots,\tilde{X}_d) \\ &= \text{argmax}_{y_k}\frac{P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)}{\sum_{k=1}^K P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)}\\ &= \text{argmax}_{y_k} P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k) \end{aligned} $$We will begin by defining the class prior distribution. To do this we will simply take the maximum likelihood estimate, given by
$$ P(Y=y_k) = \frac{\sum_{n=1}^N \delta(Y^{(n)}=y_k)}{N}, $$where the superscript $(n)$ indicates the $n$-th dataset example, $\delta(Y^{(n)}=y_k) = 1$ if $Y^{(n)}=y_k$ and 0 otherwise, and $N$ is the total number of examples in the dataset. The above is simply the proportion of data examples belonging to class $k$.
You should now write a function that builds the prior distribution from the training data, and returns it as a Categorical
Distribution object.
y
will be a numpy array of shape (num_samples,)
y
will be integer labels $k=0, 1,\ldots, K-1$Categorical
distribution objectdef get_prior(y):
"""
This function takes training labels as a numpy array y of shape (num_samples,) as an input.
This function should build a Categorical Distribution object with empty batch shape
and event shape, with the probability of each class given as above.
Your function should return the Distribution object.
"""
probs = np.unique(y, return_counts=True)[1] / len(y)
distribution = tfd.Categorical(probs=probs)
return distribution
# Run your function to get the prior
prior = get_prior(y_train)
# Plot the prior distribution
labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica']
plt.bar([0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)
plt.show()
We now turn to the definition of the class-conditional distributions $P(X_i | Y=y_k)$ for $i=0, 1$ and $k=0, 1, 2$. In our model, we will assume these distributions to be univariate Gaussian:
$$ \begin{aligned} P(X_i | Y=y_k) &= N(X_i | \mu_{ik}, \sigma_{ik})\\ &= \frac{1}{\sqrt{2\pi\sigma_{ik}^2}} \exp\left\{-\frac{1}{2} \left(\frac{x - \mu_{ik}}{\sigma_{ik}}\right)^2\right\} \end{aligned} $$with mean parameters $\mu_{ik}$ and standard deviation parameters $\sigma_{ik}$, twelve parameters in all. We will again estimate these parameters using maximum likelihood. In this case, the estimates are given by
$$ \begin{aligned} \hat{\mu}_{ik} &= \frac{\sum_n X_i^{(n)} \delta(Y^{(n)}=y_k)}{\sum_n \delta(Y^{(n)}=y_k)} \\ \hat{\sigma}^2_{ik} &= \frac{\sum_n (X_i^{(n)} - \hat{\mu}_{ik})^2 \delta(Y^{(n)}=y_k)}{\sum_n \delta(Y^{(n)}=y_k)} \end{aligned} $$Note that the above are just the means and variances of the sample data points for each class.
You should now write a function the computes the class-conditional Gaussian densities, using the maximum likelihood parameter estimates given above, and returns them in a single, batched MultivariateNormalDiag
Distribution object.
x
of shape (num_samples, num_features)
for the data inputsy
of shape (num_samples,)
for the target labelsdef get_class_conditionals(x, y):
"""
This function takes training data samples x and labels y as inputs.
This function should build the class-conditional Gaussian distributions above.
It should construct a batch of distributions for each feature and each class, using the
parameter estimates above for the means and standard deviations.
The batch shape of this distribution should be rank 2, where the first dimension corresponds
to the number of classes and the second corresponds to the number of features.
Your function should then return the Distribution object.
"""
counts = np.zeros(3)
loc = np.zeros((3, 2))
scale_diag = np.zeros((3, 2))
for i in range(2):
for c_k in range(3):
counts[c_k] = np.sum(np.where(y==c_k))
loc[c_k, i] = np.mean(x[np.where(y==c_k), i])
scale_diag[c_k, i] = np.std(x[np.where(y==c_k), i])
distribution = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale_diag)
return distribution
# Run your function to get the class-conditional distributions
class_conditionals = get_class_conditionals(x_train, y_train)
class_conditionals
<tfp.distributions.MultivariateNormalDiag 'MultivariateNormalDiag' batch_shape=[3] event_shape=[2] dtype=float64>
We can visualise the class-conditional densities with contour plots by running the cell below. Notice how the contours of each distribution correspond to a Gaussian distribution with diagonal covariance matrix, since the model assumes that each feature is independent given the class.
# Plot the training data with the class-conditional density contours
def get_meshgrid(x0_range, x1_range, num_points=100):
x0 = np.linspace(x0_range[0], x0_range[1], num_points)
x1 = np.linspace(x1_range[0], x1_range[1], num_points)
return np.meshgrid(x0, x1)
def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
for batch in np.arange(batch_shape):
if levels:
plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
else:
plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)
plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")
plt.show()
Now the prior and class-conditional distributions are defined, you can use them to compute the model's class probability predictions for an unknown test input $\tilde{X} = (\tilde{X}_1,\ldots,\tilde{X}_d)$, according to
$$ P(Y=y_k | \tilde{X}_1,\ldots,\tilde{X}_d) = \frac{P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)}{\sum_{k=1}^K P(\tilde{X}_1,\ldots,\tilde{X}_d | Y=y_k)P(Y=y_k)} $$The class prediction can then be taken as the class with the maximum probability:
$$ \hat{Y} = \text{argmax}_{y_k} P(Y=y_k | \tilde{X}_1,\ldots,\tilde{X}_d) $$You should now write a function to return the model's class probabilities for a given batch of test inputs of shape (batch_shape, 2)
, where the batch_shape
has rank at least one.
prior
and class_conditionals
distributions, and the inputs x
(batch_shape)
def predict_class(prior, class_conditionals, x):
"""
This function takes the prior distribution, class-conditional distribution, and
a batch of inputs in a numpy array of shape (batch_shape, 2).
This function should compute the class probabilities for each input in the batch, using
the prior and class-conditional distributions, according to the above equation.
Note that the batch_shape of x could have rank higher than one!
Your function should then return the class predictions by taking the class with the
maximum probability in a numpy array of shape (batch_shape,).
"""
class_probs = class_conditionals.log_prob(x[:, None])
joint_likelihood = tf.add(tf.cast(class_probs, dtype=tf.float64), tf.math.log(prior.probs)[tf.newaxis, ...])
norm_factor = tf.math.reduce_logsumexp(joint_likelihood, axis=-1, keepdims=True)
log_prob = joint_likelihood - norm_factor
y = np.argmax(np.exp(log_prob), axis=-1)
return y
# Get the class predictions
predictions = predict_class(prior, class_conditionals, x_test)
# Evaluate the model accuracy on the test set
accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))
Test accuracy: 0.7667
# Plot the model's decision regions
plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max),
lambda x: predict_class(prior, class_conditionals, x),
1, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
num_points=500)
plt.title("Training set with decision regions")
plt.show()
We will now draw a connection between the Naive Bayes classifier and logistic regression.
First, we will update our model to be a binary classifier. In particular, the model will output the probability that a given input data sample belongs to the 'Iris-Setosa' class: $P(Y=y_0 | \tilde{X}_1,\ldots,\tilde{X}_d)$. The remaining two classes will be pooled together with the label $y_1$.
# Redefine the dataset to have binary labels
y_train_binary = np.array(y_train)
y_train_binary[np.where(y_train_binary == 2)] = 1
y_test_binary = np.array(y_test)
y_test_binary[np.where(y_test_binary == 2)] = 1
# Plot the training data
labels_binary = {0: 'Iris-Setosa', 1: 'Iris-Versicolour / Iris-Virginica'}
label_colours_binary = ['blue', 'red']
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
plt.show()
We will also make an extra modelling assumption that for each class $k$, the class-conditional distribution $P(X_i | Y=y_k)$ for each feature $i=0, 1$, has standard deviation $\sigma_i$, which is the same for each class $k$.
This means there are now six parameters in total: four for the means $\mu_{ik}$ and two for the standard deviations $\sigma_i$ ($i, k=0, 1$).
We will again use maximum likelihood to estimate these parameters. The prior distribution will be as before, with the class prior probabilities given by
$$ P(Y=y_k) = \frac{\sum_{n=1}^N \delta(Y^{(n)}=y_k)}{N}, $$We will use your previous function get_prior
to redefine the prior distribution.
# Redefine the prior
prior_binary = get_prior(y_train_binary)
# Plot the prior distribution
plt.bar([0, 1], prior_binary.probs.numpy(), color=label_colours_binary)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1], labels_binary)
plt.show()
For the class-conditional densities, the maximum likelihood estimate for the means are again given by
$$ \hat{\mu}_{ik} = \frac{\sum_n X_i^{(n)} \delta(Y^{(n)}=y_k)}{\sum_n \delta(Y^{(n)}=y_k)} \\ $$However, the estimate for the standard deviations $\sigma_i$ is updated. There is also a closed-form solution for the shared standard deviations, but we will instead learn these from the data.
You should now write a function that takes the training inputs and target labels as input, as well as an optimizer object, number of epochs and a TensorFlow Variable. This function should be written according to the following spec:
x
of shape (num_samples, num_features)
for the data inputsy
of shape (num_samples,)
for the target labelstf.Variable
object scales
of length 2 for the standard deviations $\sigma_i$optimiser
: an optimiser objectepochs
: the number of epochs to run the training forMultivariateNormalDiag
with the means set to $\mu_{ik}$ and the scales set to scales
epochs
number of epochs, in which:scales
variables is computedscales
variables are updated by the optimiser
objectscales
variable and the loss(epochs,)
of loss values(epochs, 2)
of values for the scales
variable at each iterationMultivariateNormalDiag
distribution objectNB: ideally, we would like to constrain the scales
variable to have positive values. We are not doing that here, but in later weeks of the course you will learn how this can be implemented.
def learn_stdevs(x, y, scales, optimiser, epochs):
"""
This function takes the data inputs, targets, scales variable, optimiser and number of
epochs as inputs.
This function should set up and run a custom training loop according to the above
specifications, by setting up the class conditional distributions as a MultivariateNormalDiag
object, and updating the trainable variables (the scales) in a custom training loop.
Your function should then return the a tuple of three elements: a numpy array of loss values
during training, a numpy array of scales variables during training, and the final learned
MultivariateNormalDiag distribution object.
"""
n_classes = len(np.unique(y))
n_features = x.shape[-1]
loc = np.zeros((n_classes, n_features), dtype=np.float32)
for f in range(n_features):
for c in range(n_classes):
samples = x[y==c][:, f]
loc[c, f] = np.mean(samples)
distribution = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scales)
x_reshape = np.expand_dims(x.astype(np.float32), 1)
def nll(x, y, distribution):
predictions = - distribution.log_prob(x)
probs = []
for c_k in range(n_classes):
probs.append(tf.reduce_sum(predictions[y == c_k][:, c_k]))
return tf.reduce_sum(probs)
@tf.function
def get_loss_and_grads(x, distribution):
with tf.GradientTape() as tape:
tape.watch(distribution.trainable_variables)
loss = nll(x, y, distribution)
grads = tape.gradient(loss, distribution.trainable_variables)
return loss, grads
train_loss_results = []
train_scale_results = []
for epoch in range(epochs):
loss, grads = get_loss_and_grads(x_reshape, distribution)
optimiser.apply_gradients(zip(grads, distribution.trainable_variables))
train_loss_results.append(loss)
train_scale_results.append(distribution.parameters['scale_diag'].numpy())
if epoch % 100 == 0:
print(f'epoch: {epoch}, Loss: {loss}')
return np.array(train_loss_results), np.array(train_scale_results), distribution
# Define the inputs to your function
scales = tf.Variable([1., 1.])
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
epochs = 1000
# Run your function to learn the class-conditional standard deviations
nlls, scales_arr, class_conditionals_binary = learn_stdevs(x_train, y_train_binary, scales, opt, epochs)
epoch: 0, Loss: 248.52313232421875 epoch: 100, Loss: 229.68798828125 epoch: 200, Loss: 210.25640869140625 epoch: 300, Loss: 191.16299438476562 epoch: 400, Loss: 173.79574584960938 epoch: 500, Loss: 159.8612518310547 epoch: 600, Loss: 152.19459533691406 epoch: 700, Loss: 150.69192504882812 epoch: 800, Loss: 150.63607788085938 epoch: 900, Loss: 150.63558959960938
# View the distribution parameters
print("Class conditional means:")
print(class_conditionals_binary.loc.numpy())
print("\nClass conditional standard deviations:")
print(class_conditionals_binary.stddev().numpy())
Class conditional means: [[5.0214286 3.4095237] [6.25 2.853846 ]] Class conditional standard deviations: [[0.5859885 0.35059664] [0.5859885 0.35059664]]
# Plot the loss and convergence of the standard deviation parameters
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax[0].plot(nlls)
ax[0].set_title("Loss vs epoch")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Average negative log-likelihood")
for k in [0, 1]:
ax[1].plot(scales_arr[:, k], color=label_colours_binary[k], label=labels_binary[k])
ax[1].set_title("Standard deviation ML estimates vs epoch")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Standard deviation")
plt.legend()
plt.show()
We can also plot the contours of the class-conditional Gaussian distributions as before, this time with just binary labelled data. Notice the contours are the same for each class, just with a different centre location.
# Plot the training data with the class-conditional density contours
plt.figure(figsize=(10, 6))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals_binary.prob, 2, label_colours_binary)
plt.title("Training set with class-conditional density contours")
plt.show()
We can also plot the decision regions for this binary classifier model, notice that the decision boundary is now linear.
# Plot the model's decision regions
plt.figure(figsize=(10, 6))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max),
lambda x: predict_class(prior_binary, class_conditionals_binary, x),
1, label_colours_binary, levels=[-0.5, 0.5, 1.5],
num_points=500)
plt.title("Training set with decision regions")
plt.show()
In fact, we can see that our predictive distribution $P(Y=y_0 | X)$ can be written as follows:
$$ \begin{aligned} P(Y=y_0 | X) =& ~\frac{P(X | Y=y_0)P(Y=y_0)}{P(X | Y=y_0)P(Y=y_0) + P(X | Y=y_1)P(Y=y_1)}\\ =& ~\frac{1}{1 + \frac{P(X | Y=y_1)P(Y=y_1)}{P(X | Y=y_0)P(Y=y_0)}}\\ =& ~\sigma(a) \end{aligned} $$where $\sigma(a) = \frac{1}{1 + e^{-a}}$ is the sigmoid function, and $a = \log\frac{P(X | Y=y_0)P(Y=y_0)}{P(X | Y=y_1)P(Y=y_1)}$ is the log-odds.
With our additional modelling assumption of a shared covariance matrix $\Sigma$, it can be shown (using the Gaussian pdf) that $a$ is in fact a linear function of $X$:
$$ a = w^T X + w_0 $$where
$$ \begin{aligned} w =& ~\Sigma^{-1} (\mu_0 - \mu_1)\\ w_0 =& -\frac{1}{2}\mu_0^T \Sigma^{-1}\mu_0 + \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \log\frac{P(Y=y_0)}{P(Y=y_1)} \end{aligned} $$The model therefore takes the form $P(Y=y_0 | X) = \sigma(w^T X + w_0)$, with weights $w\in\mathbb{R}^2$ and bias $w_0\in\mathbb{R}$. This is the form used by logistic regression, and explains why the decision boundary above is linear.
In the above we have outlined the derivation of the generative logistic regression model. The parameters are typically estimated with maximum likelihood, as we have done.
Finally, we will use the above equations to directly parameterise the output Bernoulli distribution of the generative logistic regression model.
You should now write the following function, according to the following specification:
prior
over the two classesclass_conditionals
def get_logistic_regression_params(prior, class_conditionals):
"""
This function takes the prior distribution and class-conditional distribution as inputs.
This function should compute the weights and bias terms of the generative logistic
regression model as above, and return them in a 2-tuple of numpy arrays of shapes
(2,) and () respectively.
"""
mu0 = class_conditionals.parameters['loc'][0]
mu1 = class_conditionals.parameters['loc'][1]
cov = np.linalg.inv(class_conditionals.covariance())
# TODO: Why this covariance matrix has shape of (2, 2, 2), not (2, 2)
# In tfp.__version__ == 0.9.0, it has (2, 2)
# But tfp.__version__ == 0.13.0, (2, 2, 2)
print(cov.shape)
w = np.matmul(cov, (mu0 - mu1))
w0 = - 0.5 * (np.matmul(np.transpose(mu0), np.matmul(cov, mu0)))\
+ 0.5 * (np.matmul(np.transpose(mu1), np.matmul(cov, mu1)))\
+ np.log(prior.parameters['probs'][0] / prior.parameters['probs'][1])
return w, w0
# Run your function to get the logistic regression parameters
w, w0 = get_logistic_regression_params(prior_binary, class_conditionals_binary)
(2, 2, 2)
We can now use these parameters to make a contour plot to display the predictive distribution of our logistic regression model.
# Plot the training data with the logistic regression prediction contours
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
X0, X1 = get_meshgrid((x0_min, x0_max), (x1_min, x1_max))
logits = np.dot(np.array([X0.ravel(), X1.ravel()]).T, w) + w0
Z = tf.math.sigmoid(logits)
lr_contour = ax.contour(X0, X1, np.array(Z).T.reshape(*X0.shape), levels=10)
ax.clabel(lr_contour, inline=True, fontsize=10)
contour_plot((x0_min, x0_max), (x1_min, x1_max),
lambda x: predict_class(prior_binary, class_conditionals_binary, x),
1, label_colours_binary, levels=[-0.5, 0.5, 1.5],
num_points=300)
plt.title("Training set with prediction contours")
plt.show()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-133-72304ad00156> in <module> 9 logits = np.dot(np.array([X0.ravel(), X1.ravel()]).T, w) + w0 10 Z = tf.math.sigmoid(logits) ---> 11 lr_contour = ax.contour(X0, X1, np.array(Z).T.reshape(*X0.shape), levels=10) 12 ax.clabel(lr_contour, inline=True, fontsize=10) 13 contour_plot((x0_min, x0_max), (x1_min, x1_max), ValueError: cannot reshape array of size 20000 into shape (100,100)