#!/usr/bin/env python # coding: utf-8 # # PMR tutorial on VAEs # # A variational autoencoder (VAE) is a latent-variable statistical model with observable variables $\mathbf{x}$ and latents $\mathbf{z}$, where the marginal likelihood is typically written as # # $$ # p_{\theta}(\mathbf{x}) = \int p_{\theta}(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) \text{d} \mathbf{z}, # $$ # # and where the integral is generally intractable, since the generator $p_\theta(\mathbf{x} \mid \mathbf{z})$ is parametrised by a neural network, called decoder, with parameters $\theta$. Moreover, the classical EM algorithm is also intractable since neither the posterior distribution $p_\theta(\mathbf{z} \mid \mathbf{x})$ nor the computation of the lower-bound in E-step of the algorithm $\mathbb{E}_{p_{\hat \theta}(\mathbf{z} \mid \mathbf{x})}\left[\log \frac{p_{\theta}(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})}{p_{\hat \theta}(\mathbf{z} \mid \mathbf{x})}\right]$ are computationally tractable. # # Instead VAEs use variational inference to optimise the statistical model $p_\theta$. We define $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ to be the variational approximation of $p_\theta(\mathbf{z} \mid \mathbf{x})$ and write down the variational lower-bound for a single data-point $\mathbf{x}^i$ # # $$ # \mathcal{L}_i(\theta, \phi) = \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\left[\log \frac{p_{\theta}(\mathbf{x}^i \mid \mathbf{z}) p(\mathbf{z})}{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\right]. # $$ # # The lower-bound is then maximised with respect to the parameters $\theta$ and $\phi$ iteratively until convergence using stochastic gradient ascent. Moreover, to facilitate parameter sharing across data-points VAEs parametrise the variational distribution $q_{\phi}(\mathbf{z} \mid \mathbf{x})$ using a neural network, also called an inference network or an encoder, with globally-shared parameters $\phi$. # # The goals of this tutorial are to: # * Learn to implement mathematical equations in code # * Read and understand scientific code # * Learn about a common error in the VAE literature to avoid repeating it # * Be able to understand two ways of fitting variational autoencoders, explain their advantages and disadvantages, and implement them # * Learn about a useful gradient variance reduction method # * Fit and experiment with VAEs on [MNIST data](https://en.wikipedia.org/wiki/MNIST_database) # ## Structure # # We advise you to first read and complete all exercises (1-7) up to and including section ["VAE implementation"](#VAE-implementation). The exercises are a combination of pen-and-paper exercises, short text questions, and code implementation exercises. The answers to these exercises are provided below the question under a clickable button like this:

Here's the hidden answer.

# # After the ["VAE implementation"](#VAE-implementation) section follows the experiments for fitting the VAEs on MNIST data using two gradient estimation methods. Moreover, through these experiments we will also introduce an important variance reduction method called _control variates_ and demonstrate its use to reduce gradient variance. # This part of the tutorial will not ask you to write any code but will use the code you have implemented in exercises 1-7. The training in the required experiments should not take more than 20 minutes each. In addition some additional (optional) experiment suggestions are provided which may take longer (up to 2hrs) to complete. To save yourself some time you might want to run these experiments over night. Alternatively, if you do not wish to train the models yourself you can load pre-trained models (see instructions below). # # The rest of the tutorial will use a fitted VAE model from the experiment sections and hence will not require any time-consuming training. # Using the fitted VAE model, we will generate synthetic images of handwritten digits, then investigate and compare them with the original data, and ask you a couple of short text questions (Exercise 8). # Finally, we will investigate the latent space of the VAE by projecting test data to the low-dimensional latent space and sampling from a grid of latent variables. # # Optional (non-examinable) material is highlighted as such. # ## Loading pre-trained models # # Running the VAE training is optional. If you only wish to use a pre-trained model you can comment the training code and uncomment the loading code in the relevant VAE fitting sections. # ## Jupyter tips # # You may want to install some of the Jupyter plugins from (make sure you activate the pmr environment (`conda activate pmr`), or other if you're using a different name, before installing `jupyter_contrib_nbextensions`), for example the toc2 (table of contents) provides a useful sidebar that allows you to quickly navigate between sections. # # Here is also some useful Jupyter tips to help you navigating the tutorial more smoothly. # # * ctrl+enter (or cmd+enter on a Mac) runs the currently selected cell # * shift+enter runs the currently selected cell and jumps to the next # * shift+tab on a line of code displays useful description of the line, e.g. if the cursor is placed in the parentheses of a function call it shows the function's arguments and docstring, pressing tab multiple times scross through the entire docstring # * enter takes you to editing mode, which allows you to edit the code or markdown cells # * esc leaves the editing mode and allows using up and down arrow keys to navigate the cells # * When _not_ in editing mode (e.g. after pressing esc) you can use a or b to add a new cell above or below the current one # * When editing code or markdown text ctrl+/ (or cmd+/ on a Mac) comments the current line, if multiple lines are selected - all will be commented, pressing the same shortcut again will uncomment the lines # * More useful shortcuts can be found by pressing 'Help' in the toolbar and then 'Keyboard Shortcuts' # # # In[1]: # Import main libraries import argparse import itertools import random import os from tqdm import tqdm from copy import deepcopy import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') from IPython import display import torch import torchvision import torch.nn as nn import torch.optim as optim def seed_everything(seed=20211201): """ Helper to seed random number generators. Note, however, that even with the seeds fixed, some non-determinism is possible. For more details read . """ random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # if you are using multi-GPU. # The below might help to make it more deterministic # but will hurt the performance significantly # torch.backends.cudnn.benchmark = False # torch.backends.cudnn.deterministic = True # ## MNIST # # In this tutorial we will fit a VAE model on a toy dataset of handwritten digits, called MNIST, commonly used in VAE examples. # # Note the variable `img_size`, we use it to downsample the original 28x28 data to 14x14 data to increase the training speed of the experiments. Feel free to increase it to 28 if you want to use the original data without downsampling, or reduce it even more to 10 if the experiments take too long to complete. # # The following code will download the dataset and load it into memory. Then, we display a random sample from the dataset. # In[2]: img_size=14 train_data = torchvision.datasets.MNIST( './data', train=True, download=True, transform=torchvision.transforms.Compose([torchvision.transforms.Resize((img_size,img_size)), torchvision.transforms.ToTensor()])) test_data = torchvision.datasets.MNIST( './data', train=False, transform=torchvision.transforms.Compose([torchvision.transforms.Resize((img_size,img_size)), torchvision.transforms.ToTensor()])) data_dims=train_data[0][0].cpu().numpy().shape[1:] # In[3]: def plot_grid_of_mnist_samples(samples, grid=(5, 5), figsize=(8,8)): """A helper furnction for plotting MNIST samples""" fig, axes = plt.subplots(grid[0], grid[1], figsize=figsize) axes = axes.flatten() for idx, ax in enumerate(axes): ax.imshow(samples[idx], cmap='gray') ax.tick_params(left=False, labelleft=False, bottom=False, labelbottom=False) fig.tight_layout() return fig # In[4]: indices = np.random.choice(len(train_data), size=64) samples = torch.vstack([train_data[idx][0] for idx in indices]) fig = plot_grid_of_mnist_samples(samples, grid=(8,8)) # Note that the data lie within the range $[0, 1]$. # In[5]: all_data = torch.vstack([x for x, _ in train_data]) print('Min value:', all_data.min().item()) print('Max value:', all_data.max().item()) # Plot histogram for a random subset subsample_size = 2000 indices = np.random.choice(all_data.shape[0], size=subsample_size, replace=False) plt.hist(all_data[indices].flatten().numpy(), bins=np.linspace(0, 1, num=10)) # ## VAE for MNIST # # We saw that a large portion of values are either 0 or 1, however, the number of values between 0 and 1 is non-zero, hence the support of the generative model should be $[0, 1]$. It is a [common error](https://proceedings.neurips.cc/paper/2019/hash/f82798ec8909d23e55679ee26bb26437-Abstract.html) in the VAE literature to use the Bernoulli distribution, whose support is binary, for data with support $[0, 1]$. # NB: Remember to always check that the domain of the statistical model includes the domain of the data. # # We here parametrise the generator with a recently proposed [continuous Bernoulli distribution](https://en.wikipedia.org/wiki/Continuous_Bernoulli_distribution), whose support lies in $[0, 1]$. # # $$ # p_{\theta}(\mathbf{x} \mid \mathbf{z}) = \mathcal{CB}\left(\mathbf{x}; \eta \right) = \prod_{d=1}^D \mathcal{CB}\left(x_d; \eta_d \right), \text{ and } \eta_d = \text{sigmoid}(f_{\theta}(\mathbf{z})_d) # $$ # # where $\eta_d \in (0, 1)$ are the parameters of the continuous Bernoulli distribution, $\mathbf{f}_{\theta}(\mathbf{\cdot})$ is a neural network that takes $\mathbf{z}$ as input and outputs unconstrained parameters (we call them logits), the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) transforms the logit parameters to the range $(0, 1)$. # # As is common in the VAE literature, we choose the prior distribution over the latents $p(\mathbf{z})$ as standard Gaussian # # $$ # p(\mathbf{z}) = \mathcal{N}\left(\mathbf{z}; \mathbf{0}, \mathbf{1}\right) # $$ # # and specify the variational distribution family as a diagonal Gaussian # $$ # q_{\phi}(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}\left(\mathbf{z}; \mu, \text{diag}(\sigma) \right) = \prod_{k=1}^H \mathcal{N}\left(z_k; \mu_k, \sigma_k\right), \text{ and } \sigma_k = \exp(0.5 \gamma_{\phi}(\mathbf{x})_k), \quad \mu_k = \mu_{\phi}(\mathbf{x})_k # $$ # # where $\mu$ and $\sigma$ are the mean and standard deviation of the independent Gaussians (note that different to the lecture slides we denote the Gaussian distribution in terms of the standard deviation, not variance, since the implementation in PyTorch takes mean and standard deviation as arguments), $\mu_{\phi}(\mathbf{\cdot})$ and $\gamma_{\phi}(\mathbf{\cdot})$ are neural networks that take $\mathbf{x}$ as input and output the parameters---mean and log-variance---of the variational distribution, and $\mu_k$ and $\sigma_k$ (and $\gamma_{\phi}(\mathbf{x})_k$) are the corresponding $k$-th constrained (and unconstrained) parameters. # # In this setting where the prior and the variational distributions are both Gaussians the per-data-point variational ELBO is better rewritten as # # $$ # \mathcal{L}_i(\theta, \phi) = \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\left[\log \frac{p_{\theta}(\mathbf{x}^i \mid \mathbf{z}) p(\mathbf{z})}{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\right] = \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\left[\log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right] - \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \middle|\middle| p(\mathbf{z})\right) # $$ # # The KL divergence between two Gaussians can be computed analytically and in the current setting becomes # # $$ # \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \middle|\middle| p(\mathbf{z})\right) = -\frac{1}{2} \sum_{k=1}^K(1 + \log ((\sigma_k)^2) - (\mu_k)^2 - (\sigma_k)^2), \text{ where } \sigma_k = \exp(0.5 \gamma_{\phi}(\mathbf{x}^i)_k), \quad \mu_k = \mu_{\phi}(\mathbf{x}^i)_k. # $$ # Below we provide the code for evaluating the log probabilities of the generator $p_{\theta}(\mathbf{x}\mid \mathbf{z})$ and the variational distribution $q_{\phi}(\mathbf{z}\mid \mathbf{x})$. # In[6]: def evaluate_logprob_continuous_bernoulli(X, *, logits): """ Evaluates log-probability of the continuous Bernoulli distribution Args: X (Tensor): data, a batch of shape (B, D) logits (Tensor): parameters of the continuous Bernoulli, a batch of shape (B, D) Returns: logpx (Tensor): log-probabilities of the inputs X, a batch of shape (B,) """ cb = torch.distributions.ContinuousBernoulli(logits=logits) return cb.log_prob(X).sum(dim=-1) def evaluate_logprob_diagonal_gaussian(Z, *, mean, logvar): """ Evaluates log-probability of the diagonal Gaussian distribution Args: Z (Tensor): latent vectors, a batch of shape (*, B, H) mean (Tensor): mean of diagonal Gaussian, a batch of shape (*, B, H) logvar (Tensor): log-variance of diagonal Gaussian, a batch of shape (*, B, H) Returns: logqz (Tensor): log-probabilities of the inputs Z, a batch of shape (*, B) where `*` corresponds to any additional dimensions of the input arguments, for example a dimension representing the samples used to approximate the expectation in the ELBO """ gauss = torch.distributions.Normal(loc=mean, scale=torch.exp(0.5*logvar)) return gauss.log_prob(Z).sum(dim=-1) # #### Exercise 1 # # Using the formula above of the $\text{KL}(q_{\phi}(\mathbf{z}\mid \mathbf{x}^i)||p(\mathbf{z}))$ for diagonal Gaussian $q$ and standard Gaussian $p$ implement the following code for computing the KL divergence for a (mini-)batch of variational distributions $q(\mathbf{z} \mid \mathbf{x}^1) \ldots q(\mathbf{z} \mid \mathbf{x}^B)$. # #
#

Solution

# # The code solution is provided below, if you're struggling with the task you can simply copy the solution below into the provided location below. If you use the provided solution, make sure you understand what the code does. # # ```python # q_var = q_logvar.exp() # # kld = -0.5 * (1 + q_logvar - q_mean**2 - q_var).sum(dim=-1) # ``` # #

# In[7]: def compute_kld_with_standard_gaussian(q_mean, q_logvar): """ Computes KL(q(z|x)||p(z)) between the variational diagonal Gaussian distribution q and standard Gaussian prior p Args: q_mean (Tensor): mean of diagonal Gaussian q, a batch of shape (B, H) q_logvar (Tensor): log-variance of diagonal Gaussian q, a batch of shape (B, H) Returns: kld (Tensor): KL divergences between q and p, a batch of shape (B,) """ ### IMPLEMENT YOUR CODE HERE # q_var = ... # kld = ... q_var = q_logvar.exp() kld = -0.5 * (1 + q_logvar - q_mean**2 - q_var).sum(dim=-1) ### return kld # ## Gradients of the variational ELBO # # To optimise the model we will iteratively maximise the averaged variational lower-bound $\sum_i \mathcal{L}_i(\theta, \phi)$ with stochastic gradient ascent. This requires us to obtain gradients of $\mathcal{L}_i(\theta, \phi)$ with respect to both $\theta$ and $\phi$. # # $$ # \begin{align} # \nabla_{\theta, \phi} \mathcal{L}_i &= \nabla_{\theta, \phi} \left\{ \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\left[\log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right] - \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right) \right\}\\ # &= \nabla_{\theta, \phi} \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\left[\log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right] - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right) # \end{align} # $$ # # Since the KL divergence term can be computed analytically, we can obtain their gradients by using off-the-shelf autograd tools available in PyTorch. # # # However, the first expectation $\mathbb{E}_{q_{\phi}}[\cdot]$ depends on the parameters $\phi$, which complicates the computation of its gradients. Below we discuss two of the most common ways for obtaining gradients of the variational model: the pathwise gradient estimator and the score function gradient estimator. # # ### Pathwise gradient estimator (reparametrisation trick) # # The pathwise gradient estimator, also commonly known as the [reparametrisation trick](https://arxiv.org/abs/1312.6114), relies on the sampling path of the variational samples. In general, the estimator requires that sampling of the variational distribution must be decomposable into a parametric, deterministic, and differentiable transformation $\mathbf{t}(\cdot; \phi)$ of samples $\epsilon$ from an auxilliary noise distribution $n(\epsilon)$ without learnable parameters # # $$ # \mathbf{z} \sim q_{\phi}(\mathbf{z}) \Leftrightarrow \mathbf{z} = \mathbf{t}(\epsilon; \phi) \text{ with } \epsilon \sim n(\epsilon). # $$ # # The gradients obtained using the pathwise gradient estimator are often low-variance, and hence are often preferred over the other approaches if available. # # #### Exercise 2 # # Show that the gradients of the per data-point lower-bound $\mathcal{L}_i(\theta, \phi)$ are # # $$ # \nabla_{\theta, \phi} \mathcal{L}_i(\theta, \phi) = \mathbb{E}_{n(\epsilon)}\left[ \nabla_{\theta, \mathbf{z}} \log p_{\theta}(\mathbf{x} \mid \mathbf{z})^\top \nabla_{\theta, \phi} \mathbf{t}(\epsilon; \phi)\right] - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}) \mid\mid p(\mathbf{z})\right) # $$ # # using the reparametrisation $\mathbf{z} = \mathbf{t}(\epsilon; \phi)$. Since the gradients of the analytically computable KL term can be obtained via off-the-shelf autograd tools you can just denote them $\nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}) \mid\mid p(\mathbf{z})\right)$. #

Answer (click)

# # # #### Exercise 3 # # Implement a function $\mathbf{t}(\epsilon; \mu, \sigma)$ that samples from a Gaussian distribution $\mathcal{N}(z; \mu, \sigma)$ without using the default PyTorch implementation. # #
#

Hint 1 (click)

Use the standard Gaussian distribution for the "noise".

Hint 2 (click)

A univariate Gaussian $\mathcal{N}(z; \mu, \sigma)$ can be reparametrised as $z = \mu + \sigma\epsilon$ with $\epsilon \sim \mathcal{N}(0, 1)$.

Solution

# # The code solution is provided below, if you're struggling with the task you can simply copy the solution below into the provided location below. If you use the provided solution, make sure you understand what the code does. # # ```python # std = torch.exp(0.5*logvar) # eps = torch.randn(torch.Size([num_samples]) + std.shape, dtype=std.dtype, device=std.device) # # Z = mean + eps*std # ``` # #

# In[8]: def sample_gaussian_with_reparametrisation(mean, logvar, *, num_samples=1): """ Samples the Gaussian distribution using the reparametrisation trick Args: mean (Tensor): mean of diagonal Gaussian q, a batch of shape (B, K) logvar (Tensor): log-variance of diagonal Gaussian q, a batch of shape (B, K) num_samples (int): The number of samples (M) to approximate the expectation in the ELBO Returns: Z (Tensor): Samples Z from the diagonal Gaussian q, a batch of shape (num_samples, B, K) """ ### IMPLEMENT YOUR CODE HERE # std = ... # eps = ... # Z = ... std = torch.exp(0.5*logvar) eps = torch.randn(torch.Size([num_samples]) + std.shape, dtype=std.dtype, device=std.device) Z = mean + eps*std ### # Making sure that the variational samples are on the same device as the input argument (i.e. CPU or GPU) return Z.to(device=std.device) # #### Exercise 4 # # Now that we can obtain samples $z$ via the reparametrised sampling, implement the variational ELBO using the reparametrised samples # # $$ # \mathcal{L}_i(\theta, \phi) = \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\left[\log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right] - \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \middle|\middle| p(\mathbf{z})\right) # $$ # #
#

Hint (click)

Use the code implemented earlier.

Solution

# # The code solution is provided below, if you're struggling with the task you can simply copy the solution below into the provided location below. If you use the provided solution, make sure you understand what the code does. # # ```python # # KLD( q(z|x) || p(z) ) # kld = compute_kld_with_standard_gaussian(mean, logvar) # (B,) # # # ~E_{q(z|x)}[ p(x | z) ] # neg_cross_entropy = evaluate_logprob_continuous_bernoulli(X, logits=logits).mean(dim=0) # (B,) # # # ELBO for each data-point # elbos = neg_cross_entropy - kld # (B,) # ``` # #

# In[9]: def elbo_with_pathwise_gradients(X, *, encoder, decoder, num_samples): """ Evaluate the ELBO for each data-point in X Args: X (Tensor): data, a batch of shape (B, D) decoder (torch.nn.Module): is a neural network that provides the parameters of the generative model p(x|z; θ) encoder (torch.nn.Module): is a neural network that provides the parameters of the variational distribution q(z|x; φ). num_samples (int): the number of samples used in the Monte Carlo averaging Returns: elbos (Tensor): the ELBOs for each input X, a tensor of shape (B,) """ # Evaluate the encoder network to obtain the parameters of the # variational distribution mean, logvar = encoder(X) # (B, H), (B, H) # Sample the latents using the reparametrisation trick Z = sample_gaussian_with_reparametrisation( mean, logvar, num_samples=num_samples) # (M, B, H) # Evaluate the decoder network to obtain the parameters of the # generative model p(x|z) logits = decoder(Z) # (M, B, D) ### IMPLEMENT YOUR CODE BELOW # KLD( q(z|x) || p(z) ) # kld = ... # (B,) # ~E_{q(z|x)}[ p(x | z) ] # neg_cross_entropy = ... # (B,) # ELBO for each data-point # elbos = ... # (B,) # KLD( q(z|x) || p(z) ) kld = compute_kld_with_standard_gaussian(mean, logvar) # (B,) # ~E_{q(z|x)}[ p(x | z) ] neg_cross_entropy = evaluate_logprob_continuous_bernoulli(X, logits=logits).mean(dim=0) # (B,) # ELBO for each data-point elbos = neg_cross_entropy - kld # (B,) ### return elbos # ### Score function gradient estimator (optional, not examinable) # # While the pathwise gradient, where available, is often preferred in practice due to its simplicity and low variance, many variational distributions cannot be reparametrised, such as discrete distributions. # We can then use the [score function gradient estimator](https://arxiv.org/abs/1401.0118), also often known as the REINFORCE gradients in reinforcement learning literature. # # $$ # \begin{align} # \nabla_{\theta, \phi} \mathcal{L}_i(\theta, \phi) &= \nabla_{\theta, \phi} \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)}\left[\log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right] - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right)\\ # &= \nabla_{\theta, \phi} \int q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z}) \text{d}{\mathbf{z}} - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right)\\ # &= \int \nabla_{\theta, \phi} \left( q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z}) \right) \text{d}{\mathbf{z}} - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right)\\ # &= \int \left(\nabla_{\theta, \phi} \left\{ q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)\right\} \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z}) + q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \nabla_{\theta, \phi} \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right) \text{d}{\mathbf{z}} - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right)\\ # &= \int \left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \nabla_{\theta, \phi} \left\{ \log q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)\right\} \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z}) + q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \nabla_{\theta, \phi} \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right) \text{d}{\mathbf{z}} - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right)\\ # &= \int q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \left( \nabla_{\theta, \phi} \left\{\log q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)\right\} \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z}) + \nabla_{\theta, \phi} \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right) \text{d}{\mathbf{z}} - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right)\\ # &= \mathbb{E}_{q_{\phi}(\mathbf{z} \mid \mathbf{x}^i)} \left[ \nabla_{\theta, \phi} \left\{\log q_{\phi}(\mathbf{z} \mid \mathbf{x})\right\} \log p_{\theta}(\mathbf{x} \mid \mathbf{z}) + \nabla_{\theta, \phi} \log p_{\theta}(\mathbf{x}^i \mid \mathbf{z})\right] - \nabla_{\theta, \phi} \text{KL}\left(q_{\phi}(\mathbf{z} \mid \mathbf{x}^i) \mid\mid p(\mathbf{z})\right), # \end{align} # $$ # # where in the third equation we used the [dominated convergence theorem](https://en.wikipedia.org/wiki/Leibniz_integral_rule#Measure_theory_statement) to swap the integral and the gradient, and in the fifth equation we used the log-derivative trick $\nabla f(x) = f(x) \nabla \log f(x)$. # The gradients can be obtained as long as $\log q_{\phi}(\mathbf{z} \mid \mathbf{x})$ is differentiable with respect to the parameters $\phi$. # In practice the expectation in the last line is approximated by Monte Carlo averaging and the gradients are then the average gradients over samples $\mathbf{z}_1, \ldots, \mathbf{z}_M \sim q_{\phi}(\mathbf{z} \mid \mathbf{x})$. # # Importantly, as with the pathwise gradients, we can use off-the-shelf autograd tools, available in most ML frameworks to obtain the necessary gradients. # In this tutorial we will compare VAE fitting when using the pathwise and score function gradient estimators. # We will therefore use a variational distribution for which the reparametrisation is available (i.e. Gaussian) but to evaluate the method with score function gradient estimator we will simulate a non-reparametrisable distribution by [detaching](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) the computation graph to stop the _pathwise_ gradients. The below code implements the non-reparametrised Gaussian sampling. We will then ask you to implement a surrogate ELBO, which enables model estimation using the score function gradient estimator. # In[10]: def sample_gaussian_without_reparametrisation(mean, logvar, *, num_samples=1): """ Samples the Gaussian distribution without attaching then to the computation graph Args: mean (Tensor): mean of diagonal Gaussian q, a batch of shape (B, K) logvar (Tensor): log-variance of diagonal Gaussian q, a batch of shape (B, K) num_samples (int): The number of samples (M) to approximate the expectation in the ELBO Returns: Z (Tensor): Samples Z from the diagonal Gaussian q, a batch of shape (num_samples, B, K) """ return sample_gaussian_with_reparametrisation(mean, logvar, num_samples=num_samples).detach() # #### Exercise 5 (optional, not examinable) # # Implement a surrogate variational ELBO $\mathcal{\tilde L}_i(\theta,\phi)$, which enables the optimisation of a VAE and uses the score function gradient estimator. # #
#

Hint 1 (click)

Use the code implemented earlier.

Hint 2 (click)

# # You might want to use the [`Tensor.detach()`](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) function to create the surrogate ELBO with score function gradients. When a detached term appears in the loss, it will be treated as constant with respect to the variables of differentiation (i.e. the parameters). #

Hint 3 (click)

# # For an example of score fuction gradient estimator implementation in PyTorch read [here](https://pytorch.org/docs/stable/distributions.html#score-function). #

Spoiler 1! Click this to see the surrogate ELBO.

Solution

# # For this optional task we provide you with the solution below, simply copy it to the appropriate location in the code cell below. # # ```python # # KLD( q(z|x) || p(z) ) # kld = compute_kld_with_standard_gaussian(mean, logvar) # # # log p(x | z) # log_px_given_z = evaluate_logprob_continuous_bernoulli(X, logits=logits) # # # ~E_{q(z|x)}[ p(x | z) ] # neg_cross_entropy = log_px_given_z.mean(dim=0) # # # Per data-point ELBO, note that this ELBO will not estimate the gradients of the expectation parameters! # elbos = neg_cross_entropy - kld # # # log q(z | x) # log_qz_given_x = evaluate_logprob_diagonal_gaussian(Z, mean=mean, logvar=logvar) # # # Surrogate loss for score function gradient estimator # surrogate_loss = (log_qz_given_x*log_px_given_z.detach()).mean(dim=0) # # # ELBO with score function gradients # surrogate_elbos = elbos + surrogate_loss - surrogate_loss.detach() # ``` # #

# In[11]: def elbo_with_score_function_gradients(X, *, encoder, decoder, num_samples): """ Evaluate the surrogate ELBO for each data-point in X. The surrogate elbo uses score function gradient estimator to estimate the gradients of the variational distribution. Args: X (Tensor): data, a batch of shape (B, D) decoder (torch.nn.Module): is a neural network that provides the parameters of the generative model p(x|z; θ) encoder (torch.nn.Module): is a neural network that provides the parameters of the variational distribution q(z|x; φ). num_samples (int): the number of samples used in the Monte Carlo averaging Returns: elbos (Tensor): the ELBOs for each input X, a tensor of shape (B,) """ # Evaluate the encoder network to obtain the parameters of the # variational distribution mean, logvar = encoder(X) # (B, H), (B, H) # Sample the latents _without_ the reparametrisation trick Z = sample_gaussian_without_reparametrisation( mean, logvar, num_samples=num_samples) # (M, B, H) # Evaluate the decoder network to obtain the parameters of the # generative model p(x|z) logits = decoder(Z) # (M, B, D) ### IMPLEMENT YOUR CODE BELOW # # ELBO # # KLD( q(z|x) || p(z) ) # kld = ... # log p(x | z) # log_px_given_z = ... # ~E_{q(z|x)}[ p(x | z) ] # neg_cross_entropy = ... # Per-data-point ELBO that does not estimate the gradients of the expectation parameters # elbos = ... # # Score function surrogate # # log q(z | x) # log_qz_given_x = ... # Surrogate loss for score function gradient estimator of the expectation parameters # surrogate_loss = ... # ELBO with score function gradients # surrogate_elbos = ... # # ELBO # # KLD( q(z|x) || p(z) ) kld = compute_kld_with_standard_gaussian(mean, logvar) # log p(x | z) log_px_given_z = evaluate_logprob_continuous_bernoulli(X, logits=logits) # ~E_{q(z|x)}[ p(x | z) ] neg_cross_entropy = log_px_given_z.mean(dim=0) # Per data-point ELBO, note that this ELBO will not estimate the gradients of the expectation parameters! elbos = neg_cross_entropy - kld # # Score function surrogate # # log q(z | x) log_qz_given_x = evaluate_logprob_diagonal_gaussian(Z, mean=mean, logvar=logvar) # Surrogate loss for score function gradient estimator surrogate_loss = (log_qz_given_x*log_px_given_z.detach()).mean(dim=0) # ELBO with score function gradients surrogate_elbos = elbos + surrogate_loss - surrogate_loss.detach() ### return surrogate_elbos # ## VAE implementation # # Below we implement the PyTorch models for the encoder and decoder networks as well as a wrapper model to represent the VAE using the functions you have implemented above. You should read the code and understand it. # # We will use simple fully-connected neural networks but feel free to experiment with other neural network architectures, such as residual networks or convolutional layers, which can often improve the performance of the method for specific tasks. # In[12]: class Encoder(nn.Module): """ Encoder or inference network that predicts the parameters of the variational distribution q(z|x). """ def __init__(self, hparams): super().__init__() self.hparams = hparams enc_layer_dims = self.hparams.enc_layer_dims # Create all layers except last layers = [] for i in range(len(enc_layer_dims)-2): layers.append(nn.Linear(enc_layer_dims[i], enc_layer_dims[i+1])) # Use a non-linearity layers.append(nn.ReLU()) self.model = nn.Sequential(*layers) # Create separate final layers for each parameter (mean and log-variance) # We use log-variance to unconstrain the optimisation of the positive-only variance parameters self.mean = nn.Linear(enc_layer_dims[-2], enc_layer_dims[-1]) self.logvar = nn.Linear(enc_layer_dims[-2], enc_layer_dims[-1]) @staticmethod def add_model_args(parser): """Here we define the arguments for our encoder model.""" parser.add_argument('--enc_layer_dims', type=int, nargs='+', help='Encoder layer dimensions.') return parser def forward(self, X): """ Predicts the parameters of the variational distribution Args: X (Tensor): data, a batch of shape (B, D) Returns: mean (Tensor): means of the variational distributions, shape (B, K) logvar (Tensor): log-variances of the diagonal Gaussian variational distribution, shape (B, K) """ features = self.model(X) mean = self.mean(features) logvar = self.logvar(features) return mean, logvar def sample_with_reparametrisation(self, mean, logvar, *, num_samples=1): # Reuse the implemented code return sample_gaussian_with_reparametrisation(mean, logvar, num_samples=num_samples) def sample_without_reparametrisation(self, mean, logvar, *, num_samples=1): # Reuse the implemented code return sample_gaussian_without_reparametrisation(mean, logvar, num_samples=num_samples) def log_prob(self, mean, logvar, Z): """ Evaluates the log_probability of Z given the parameters of the diagonal Gaussian Args: mean (Tensor): means of the variational distributions, shape (*, B, K) logvar (Tensor): log-variances of the diagonal Gaussian variational distribution, shape (*, B, K) Z (Tensor): latent vectors, shape (*, B, K) Returns: logqz (Tensor): log-probability of Z, a batch of shape (*, B) """ # Reuse the implemented code return evaluate_logprob_diagonal_gaussian(Z, mean=mean, logvar=logvar) # In[13]: class Decoder(nn.Module): """ Decoder or generative network that computes the parameters of the likelihood. """ def __init__(self, hparams): super().__init__() self.hparams = hparams dec_layer_dims = self.hparams.dec_layer_dims # Create all layers except last layers = [] for i in range(len(dec_layer_dims)-2): layers.append(nn.Linear(dec_layer_dims[i], dec_layer_dims[i+1])) # Add non-linearity layers.append(nn.ReLU()) self.model = nn.Sequential(*layers) # Create final layers that predicts the parameters of the continuous Bernoulli self.logits = nn.Linear(dec_layer_dims[-2], dec_layer_dims[-1]) @staticmethod def add_model_args(parser): """Here we define the arguments for our decoder model.""" parser.add_argument('--dec_layer_dims', type=int, nargs='+', help='Decoder layer dimensions.') return parser def forward(self, Z): """ Computes the parameters of the generative distribution p(x | z) Args: Z (Tensor): latent vectors, a batch of shape (M, B, K) Returns: logits (Tensor): parameters of the continuous Bernoulli, shape (M, B, D) """ features = self.model(Z) logits = self.logits(features) return logits def log_prob(self, logits, X): """ Evaluates the log_probability of X given the parameters of the continuous Bernoulli Args: logits (Tensor): parameters of the continuous Bernoulli, shape (*, B, D) X (Tensor): data, shape (*, B, D) Returns: logpx (Tensor): log-probability of X, a batch of shape (*, B) """ # Reuse the implemented code return evaluate_logprob_continuous_bernoulli(X, logits=logits) # Some extra methods for analysis def sample(self, logits, *, num_samples=1): """ Samples the continuous Bernoulli Args: logits (Tensor): parameters of the continuous Bernoulli, shape (*, B, D) num_samples (int): number of samples Returns: X (Tensor): samples from the distribution, shape (num_samples, *, B, D) """ cb = torch.distributions.ContinuousBernoulli(logits=logits) return cb.sample((num_samples,)) def mean(self, logits): """ Returns the mean of the continuous Bernoulli Args: logits (Tensor): parameters of the continuous Bernoulli, shape (*, B, D) Returns: mean (Tensor): means of the continuous Bernoulli, shape (*, B, D) """ cb = torch.distributions.ContinuousBernoulli(logits=logits) return cb.mean def param_p(self, logits): """ Returns the success probability p of the continuous Bernoulli Args: logits (Tensor): parameters of the continuous Bernoulli, shape (*, B, D) Returns: p (Tensor): success probability p of the continuous Bernoulli, shape (*, B, D) """ cb = torch.distributions.ContinuousBernoulli(logits=logits) return cb.probs # In[14]: class VAE(nn.Module): """ A wrapper for the VAE model """ def __init__(self, hparams): super().__init__() self.hparams = hparams # Use the encoder and decoder implemented above self.encoder = Encoder(hparams) self.decoder = Decoder(hparams) @staticmethod def add_model_args(parser): """Here we define the arguments for our decoder model.""" parser = Encoder.add_model_args(parser) parser = Decoder.add_model_args(parser) parser.add_argument('--gradient_type', type=str, choices=['pathwise', 'score'], help='Variational model gradient estimation method.') parser.add_argument('--num_variational_samples', type=int, default=1, help=('The number of samples from the variational ' 'distribution to approximate the expectation.')) return parser def forward(self, X): """ Computes the variational ELBO Args: X (Tensor): data, a batch of shape (B, K) Returns: elbos (Tensor): per data-point elbos, shape (B, D) """ if self.hparams.gradient_type == 'pathwise': return self.elbo_with_pathwise_gradients(X) elif self.hparams.gradient_type == 'score': return self.elbo_with_score_function_gradients(X) else: raise ValueError(f'gradient_type={self.hparams.gradient_type} is invalid') def elbo_with_pathwise_gradients(self, X): # Reuse the implemented code return elbo_with_pathwise_gradients(X, encoder=self.encoder, decoder=self.decoder, num_samples=self.hparams.num_variational_samples) def elbo_with_score_function_gradients(self, X): # Reuse the implemented code return elbo_with_score_function_gradients(X, encoder=self.encoder, decoder=self.decoder, num_samples=self.hparams.num_variational_samples) # #### Exercise 6 # # We want you to be able to read, understand, and reason about scientific code. The following questions are here to facilitate your thoughts. # # 1. What do the `Decoder.forward()` and `Encoder.forward()` methods do? # 2. What would you have to change about the model and code if the data had a different domain, for example the whole real domain $\mathbb{R}^D$ instead of only $[0, 1]^D$? #

Answer (click)

# # 1. The `Decoder.forward()` predicts the parameters of the pdf (or pmf) of the conditional distribution $p_{\theta}(\mathbf{x}\mid\mathbf{z})$, and the `Encoder.forward()` predicts the parameters of the pdf (or pmf) of variational distribution $q_{\phi}(\mathbf{z}\mid \mathbf{x})$. # # 2. If the data had been distributed in the whole real domain we could specify a different generative distribution $p_{\theta}(\mathbf{x} \mid \mathbf{z})$, for example using a low-rank Gaussian or Student's-T distribution. You would then have to implement a decoder, which predicts the mean and covariance matrices of the Gaussian (or equivalently the parameters of the Student's-T), and use the corresponding probability density function when computing the ELBO, i.e. change the computations of the `neg_cross_entropy` variable in the provided solutions. # #

# # # #### Exercise 7 (optional, not examinable) # # 1. If you wanted to use a different lower-bound, for example the [importance-weighted ELBO](https://arxiv.org/abs/1509.00519) # $$ # \mathcal{L}^{IW}_i(\theta, \phi) = \mathbb{E}_{\mathbf{z}_1 \ldots \mathbf{z}_T \sim q_{\phi}(\mathbf{z}\mid\mathbf{x}^i)} \left[ \log \frac{1}{T} \sum_{t=1}^T \frac{p_{\theta}(\mathbf{x}^i, \mathbf{z}_t)}{q_{\phi}(\mathbf{z}_t \mid \mathbf{x}^i)}\right] # $$ # which parts of the code would you change? # 2. How would you compute the term inside the expectation $\log \frac{1}{T} \sum_{t=1}^T \frac{p_{\theta}(\mathbf{x}, \mathbf{z}_t)}{q_{\phi}(\mathbf{z}_t \mid \mathbf{x})}$? #

Hint (click)

Think about the numerical stability of the computation.

Answer (click)

# # 1. You could simply replace the code in the `VAE.forward()` method with the computation of the $\mathcal{L}^{IW}_i$ above. Note that you would not be able to compute the KL divergence analytically as before, instead the whole ELBO would have to be approximated using Monte Carlo averaging. # # The importance-weighted ELBO (IWELBO) has some nice properties, for example, the IWELBO becomes monotonically tighter as you increase the number of importance samples $T$, and with $T=1$ it equals the standard ELBO, whereas with $T\rightarrow \infty$ it approaches marginal log-likelihood $\log p_{\theta}(\mathbf{x}^i)$. The IWELBO has been used for training more [accurate VAE models](https://arxiv.org/abs/1509.00519), estimating $\log p_{\theta}(\mathbf{x}^i)$ (see [here](http://bayesiandeeplearning.org/2018/papers/101.pdf)), and fitting VAEs in the presence of [missing variables](https://arxiv.org/abs/1812.02633). # # 2. Computations in the probability domain, such as $\frac{p_{\theta}(\mathbf{x}, \mathbf{z}_t)}{q_{\phi}(\mathbf{z}_t \mid \mathbf{x})}$, are often numerically unstable, especially when the random variables are high-dimensional. Instead it is best to perform any computations in the log-probability domain. You should note that the $\log \frac{1}{T} \sum_{t=1}^T \frac{p_{\theta}(\mathbf{x}, \mathbf{z}_t)}{q_{\phi}(\mathbf{z}_t \mid \mathbf{x})}$ term is a composition of a _log_, a _summation_, as well as $p_{\theta}(\mathbf{x}, \mathbf{z}_t)$ and $q_{\phi}(\mathbf{z}_t \mid \mathbf{x})$ terms which we have previously computed in the log-domain. Hence, to preserve numerical stability you should use the [log-sum-exp trick](https://en.wikipedia.org/wiki/LogSumExp) # $$ # \begin{align} # \log \frac{1}{T} \sum_{t=1}^T \frac{p_{\theta}(\mathbf{x}, \mathbf{z}_t)}{q_{\phi}(\mathbf{z}_t \mid \mathbf{x})} # &= # - \log T + \log \sum_{t=1}^T \exp\left(\log p_{\theta}(\mathbf{x}, \mathbf{z}_t) - \log q_{\phi}(\mathbf{z}_t \mid \mathbf{x})\right) \\ # &= # -\log T + r_{\text{amax}} - \log \sum_{t=1 }^T \mathbb{1}(t \not = \text{amax}) \exp\left(r_{t} - r_{\text{amax}}\right), # \end{align} # $$ # where $r_t = \log p_{\theta}(\mathbf{x}, \mathbf{z}_t) - \log q_{\phi}(\mathbf{z}_t \mid \mathbf{x})$ and $\text{amax} = \mathrm{argmax}_t (\{r_1, \ldots, r_T\})$. This handy function is readily implemented in PyTorch as [`torch.logsumexp()`](https://pytorch.org/docs/stable/generated/torch.logsumexp.html). #

Answer (click)

# # 1. We can see that the samples in the top row are noisy, and in particular, there are grey pixels all around the image---where the background should be black we see some grey noise and where the "ink" should be white we also see some grey noise. These grey pixels are due to the predicted parameters $\eta_d=\text{sigmoid}(f_{\theta}(\mathbf{z})_d)$ not being close enough to 0 (black) or 1 (white), hence sampling this distribution results in some grey pixels. # The source of this behaviour can be attributed to the unconstrained optimisation of $\eta_d$. # The parameters $\eta_d$ are in the $(0, 1)$ range, where values closer to 0 result in sampling more darker pixels and closer to 1 result in more lighter pixels. However, to make the use of off-the-shelf gradient ascent optimisers easier, we actually optimise an unconstrained parameter $l_d = \text{logit}(\eta_d) = f_{\theta}(\mathbf{z})_d$ (see logit function [here](https://en.wikipedia.org/wiki/Logit)). Now to make $\eta_d$ close to 0 or 1, $l_d$ must approach $-\infty$ or $\infty$, respectively. However, neural networks are unlikely to predict such extreme values unless some numerical instability occurs, hence resulting in the image noise we observed. # # To understand the effect of the mean on the samples above, it is useful to see the parameter $\eta$ versus mean $\mu$ figure from the continuous Bernoulli paper (the 1st figure above). # You should note that for parameter values between 0.1 and 0.9 the mean function pushes the values towards 0.5, thus resulting in smoother, even blurry, samples as seen above. # The grey noise also vanishes as it is averaged out in the expectation. # # 2. In the final two rows we plot the parameter values $\eta=\text{sigmoid}(\mathbf{f}_{\theta}(\mathbf{z}))$ of the continuous Bernoulli distribution. It is a common observation that plotting the parameters of the distribution using the standard Bernoulli distribution on non-binary data in VAEs results in blurry images. The authors of the [continuous Bernoulli](https://arxiv.org/abs/1907.06845) have demonstrated that this effect is due to the standard Bernoulli distribution ignoring the log normalising constant $\log C(\eta)$ in the likelihood function (see 2nd figure above), which arises due to the use of the real domain in $[0, 1]$. The log normalising constant $\log C(\eta)$ acts as a regulariser that encourages the parameters to be more extreme (0 or 1) and discourages values close to 0.5, hence resulting in the sharpening of the samples compared to the standard Bernoulli distribution, which ignores the constant. # #

# ## Inspecting the latent space (optional, not examinable) # # In the above exercises you have fitted a VAE with 2-dimensional latent space to the MNIST data in an [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning) manner---the labels, which denote the digit in the image, were not used during training. # As we saw in the previous section, the generative model $p_{\theta}(\mathbf{x}\mid \mathbf{z})$ is capable of generating all digits in the training dataset, whilst using a single amortised neural network. # Hence, intuitively, there should be some structure in the latent space amongst the different categories to facilitate easier generation. # This intuition has motivated research on [feature disentanglement using VAEs](https://openreview.net/forum?id=Sy2fzU9gl) and [identifiable VAEs](https://arxiv.org/abs/1907.04809). # # Below we provide the code to visualise the latent space of the VAE. You should observe that images representing the same digit should be co-located in the latent space, but also you might notice that "similar" numbers may as well be close. # # With a Gaussian variational distribution, it is common to use the mean of the variational distribution $\mathbb{E}_{q_{\phi}(\mathbf{z}\mid \mathbf{x})}\left[ \mathbf{z} \right]$ to project the high-dimensional observed data-points to lower dimensional representations. The following code uses the model trained with pathwise gradients to project the test MNIST data to the low-dimensional space. # In[44]: with torch.inference_mode(): model_path.eval() model_path = model_path.to(args_path.device) test_loader = create_dataloader(test_data, args_path) Z = [] Y = [] for batch_idx, (X, labels) in tqdm(enumerate(test_loader)): # Prepare X = X.to(args_path.device) X = X.view([-1] + [X.shape[-2]*X.shape[-1]]) # Project mean, _ = model_path.encoder(X) # Collect Z.append(mean) Y.append(labels) Z = torch.vstack(Z) Y = torch.hstack(Y) # We now visualise the projected images `Z` in 2-dimensions and color each point with a color corresponding to its label. # In[45]: fig, ax = plt.subplots(1, 1, figsize=(8,8)) # Transfer tensors back to cpu Z = Z.cpu() Y = Y.cpu() # Plot all samples in a specific color dependent on the label colors = plt.rcParams['axes.prop_cycle'].by_key()['color'] c = np.array(colors)[Y.numpy()] ax.scatter(Z[:, 0], Z[:, 1], color=c, alpha=0.3) # Create legend legend_elements = [mpl.patches.Patch(facecolor=colors[i], label=i) for i in range(10)] ax.legend(handles=legend_elements, ncol=5, bbox_to_anchor=(0.5, 0.93), bbox_transform=fig.transFigure, loc='center', prop={'size': 14}, frameon=False) fig.tight_layout() # As expected, you should observe that the images corresponding to the same digit should be close in the latent space. Moreover, you may also notice that visually similar handwritten digits may also be located close to each other, such as 4 and 9. # # It is important to note that general VAEs are not identifiable, and hence the results may differ on different runs (for example, if ran the fitting on a different machine, or used an alternative random seed). # Alternatively, we also investigate the generator $p_{\theta}(\mathbf{x}\mid \mathbf{z})$ by showing the generated expected values of the continuous Bernoulli distribution $\mathbb{E}_{p_{\theta}(\mathbf{x}\mid \mathbf{z})}\left[\mathbf{x}\right]$ for a grid of latent variable values. # In[46]: # Establish limits in the latent space mx = Z.max(0)[0].cpu().numpy() mn = Z.min(0)[0].cpu().numpy() # Define the number of steps along each dimension steps = 20 step_x = (mx[0]-mn[0])/steps step_y = (mx[1]-mn[1])/steps # Locations of the grid samples x = np.arange(mn[0], mx[0], step_x) y = np.arange(mn[1], mx[1], step_y) # Create the tuples of latent locations x0, x1 = np.mgrid[mn[0]:mx[0]:step_x, mn[1]:mx[1]:step_y] Z_grid = np.empty(x0.shape + (2,)) Z_grid[:, :, 0] = x0 Z_grid[:, :, 1] = x1 with torch.inference_mode(): # Predict the continuous Bernoulli means logits = model_path.decoder(torch.tensor(Z_grid, dtype=torch.float, device=args_path.device)).cpu() means = model_path.decoder.mean(logits).cpu().reshape(steps, steps, *data_dims) # Plot fig, axes = plt.subplots(steps, steps, sharex=True, sharey=True, figsize=(14, 14)) fig.subplots_adjust(wspace=0., hspace=0.) for i, j in itertools.product(range(steps), range(steps)): axes[j, i].imshow(means[i, steps-j-1], cmap='gray') if j == 0: axes[j, i].set_title(f'{x[i]:.2f}', fontsize=13) if i == 0: axes[j, i].set_ylabel(f'{y[steps-j-1]:.2f}', fontsize=13) axes[j,i].tick_params(left=False, labelleft=False, bottom=False, labelbottom=False) # You should see that the two visualisations should correspond with each other. In particular, similar areas in the latent space should correspond to similar numbers. # # However, since the variational distribution $q_{\phi}(\mathbf{z}\mid \mathbf{x})$ is only an approximation of the true posterior $p_{\theta}(\mathbf{z} \mid \mathbf{x})$, some disagreement between the two visualisations can be expected. Below we plot the two visualisations on top of each other, which may help to identify the disagreement. # In[47]: # Plot fig, axes = plt.subplots(steps, steps, sharex=True, sharey=True, figsize=(14, 14)) fig.subplots_adjust(wspace=0., hspace=0.) # Overlay another axes rect = [axes[0][0].get_position().get_points()[0, 0], axes[-1][-1].get_position().get_points()[0, 1], axes[-1][-1].get_position().get_points()[1, 0]-axes[0][0].get_position().get_points()[0, 0], axes[0][0].get_position().get_points()[1, 1]-axes[-1][-1].get_position().get_points()[0, 1] ] ax = fig.add_axes(rect) # Plot samples for i, j in itertools.product(range(steps), range(steps)): axes[j, i].imshow(means[i, steps-j-1], cmap='gray') axes[j,i].tick_params(left=False, labelleft=False, bottom=False, labelbottom=False) # Plot projections ax.scatter(Z[:, 0], Z[:, 1], color=c, alpha=0.3) ax.patch.set_alpha(0.) ax.set_xlim(mn[0], mn[0]+step_x*steps) ax.set_ylim(mn[1], mn[1]+step_y*steps) # Create legend legend_elements = [mpl.patches.Patch(facecolor=colors[i], label=i) for i in range(10)] ax.legend(handles=legend_elements, ncol=10, bbox_to_anchor=(0.5, 0.92), bbox_transform=fig.transFigure, loc='center', prop={'size': 14}, frameon=False) # In[ ]: