Notebook

Computing Bayesian Coresets in Parallel using Edward¶

(Better displayed in nbviewer as red warnings in font tag may not be displayed on github)

In this notebook we try computing Bayesian coresets in parallel using the approach illustated in Section 4.3 of [1]. (Notice that by parallel we mean logically in parallel, even if the actual computation is sequential). Bayesian coresets are presented in [1][2]. We exploit Tensorflow/Edward [4] for the definition of models, for their derivation, for posterior approximation, and for sampling. We rely on the code made availabe by Trevor Campbell for computing coresets [3].

This notebook builds on the previous notebook, to which we refer for a more detailed description of the code.

The notebook is divided in three parts:

Setup: defining the data and the model;
Baselines: computing and visualizing baselines evaluated on the whole dataset at once;
Parallel Computation of Coresets: Random Subsets: computing the coresets in parallel when randomly subdividing the data in four subsets.
Parallel Computation of Coresets: Non i.d. Subsets: computing the coresets in parallel when subdividing the data in two different subsets.

ISSUES: still to correct/review: (i) Rescaling of weights for the model not to crash into NaN; (ii) computing time in Jupyter seems unrelated to the dataset size.

1. Setup¶

In this first section we define parameters and auxiliary functions, we generate and show the data, and we define the statistical model.

In [ ]:

Importing libraries¶

In [1]:

import time
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

import sklearn.linear_model as lm
import sklearn.svm as svm

import tensorflow as tf
import edward as ed
import bcoresets as bc

Setting a random seed¶

In [2]:

np.random.seed(742)

Defining the projection code¶

See previous notebook for an explanation of this code.

In [3]:

class _Projection(object):
    def __init__(self, N, projection_dim, approx_posterior):
        self.dim = approx_posterior.shape[0].value
        self.x = np.zeros((N, 0))
        self.approx_posterior = approx_posterior
        self.update_dimension(projection_dim)
        return

    def update_dimension(self, projection_dim):
        if projection_dim < self.x.shape[1]:
            self.x = self.x[:, :projection_dim]

        if projection_dim > self.x.shape[1]:
            old_dim = self.x.shape[1]
            w = np.zeros((self.x.shape[0], projection_dim))
            w[:, :old_dim] = self.x
            w *= np.sqrt(old_dim)
            for j in range(projection_dim-old_dim):
                w[:, j+old_dim] = self._sample_component()
            w /= np.sqrt(projection_dim)
            self.x = w
        return

    def get(self):
        return self.x.copy()

In [4]:

class ProjectionF(_Projection):
    def __init__(self, data, grad_log_likelihood, projection_dim, approx_posterior):
        self.data = data
        self.grad_log_likelihood = grad_log_likelihood
        
        _Projection.__init__(self, data.shape[0], projection_dim, approx_posterior)
  
    def _sample_component(self):
        sample_post = self.approx_posterior.sample().eval()
        
        sgll = [self.grad_log_likelihood.eval(feed_dict={X:self.data[i].reshape([1,self.data.shape[1]]),
                                                         W:np.ones(self.data.shape[0], dtype=np.float64),
                                                         theta:sample_post})
                for i in range(self.data.shape[0])]
        
        sgll = np.array(sgll)
                
        return np.sqrt(self.dim)*sgll[:,np.random.randint(self.dim)]

Defining the params of the simulation¶

We define the parameters of the simulation, including the number and dimensionality of the samples we will generate (nsamples, ndims); the number of random projection dimension and core samples for coreset computation (nrandomdims, ncoresamples); the number of samples, the step length, the burn-in duration and the amount of thinning for the MC simulation (n_mcsamples, n_mcstepsize, n_mcburnin, n_mcthinning,); the number of samples for posterior prediction (n_ppcsamples); and a grid step to plot contour plots (gridstep).

In [5]:

nsamples = 200
nfeatures = 2

nrandomdims = 100
ncoresamples = 40

n_mcsamples = 10000
n_mcstepsize = 0.005
n_mcburnin = int(n_mcsamples/2)
n_mcthinning = 5

ppc_samples = 100

gridstep = .1

Data generation¶

We generate data from three Gaussian distributions. We use two-dimensional data for ease of visualization.

In [6]:

mu1 = np.array([-2,-2])
sigma1 = np.array([[3,.5],[.5,1]])
mu2 = np.array([2,2])
sigma2 = np.array([[1,0],[0,1]])
mu3 = np.array([-2,2.5])
sigma3 = np.array([[1,0],[0,1]])

pdf1 = stats.multivariate_normal(mu1,sigma1)
pdf2 = stats.multivariate_normal(mu2,sigma2)
pdf3 = stats.multivariate_normal(mu3,sigma3)

X_pdf1 = pdf1.rvs(size=nsamples)
y_pdf1 = np.zeros(nsamples).reshape((nsamples,1))
X_pdf2 = pdf2.rvs(size=nsamples)
y_pdf2 = np.ones(nsamples).reshape((nsamples,1))
X_pdf3 = pdf3.rvs(size=nsamples)
y_pdf3 = np.ones(nsamples).reshape((nsamples,1))

Xtr = np.vstack((X_pdf1,X_pdf2,X_pdf3))
ytr = np.vstack((y_pdf1,y_pdf2,y_pdf3))

permutation = np.random.permutation(Xtr.shape[0])
Xtr = Xtr[permutation]
ytr = ytr[permutation]

Data visualization¶

We plot the data we generated.

In [7]:

xmeshgrid, ymeshgrid = np.mgrid[-7:7:gridstep, -6:6:gridstep]
xymeshgrid = np.c_[xmeshgrid.ravel(), ymeshgrid.ravel()]

ycolors = np.array(['r']*ytr.shape[0])
ycolors[ytr[:,0]==1] = 'b'

plt.figure()
plt.title('Data')
plt.scatter(Xtr[:,0],Xtr[:,1], c=ycolors)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')

xcontourgrid, ycontourgrid = np.mgrid[-6:2:gridstep, -6:2:gridstep]
xycontourgrid = np.dstack((xcontourgrid, ycontourgrid))
plt.contour(xcontourgrid, ycontourgrid, pdf1.pdf(xycontourgrid))

xcontourgrid, ycontourgrid = np.mgrid[0:5:gridstep, 0:5:gridstep]
xycontourgrid = np.dstack((xcontourgrid, ycontourgrid))
plt.contour(xcontourgrid, ycontourgrid, pdf2.pdf(xycontourgrid))

xcontourgrid, ycontourgrid = np.mgrid[-5:0:gridstep, 0:5:gridstep]
xycontourgrid = np.dstack((xcontourgrid, ycontourgrid))
plt.contour(xcontourgrid, ycontourgrid, pdf3.pdf(xycontourgrid))

Out[7]:

<matplotlib.contour.QuadContourSet at 0x7f67a96a7ba8>

Defining the model¶

We define a standard Bayesian weighted regression model as: $$ P(Y \vert \theta) \sim Bern \left\{ p = \sigma(X \theta)^w \right\} $$ where $\theta$ is the two-dimensional vector of parameters and $\sigma(x) = \frac{1}{1+e^{-x}}$ is the logistic function.

We also define the log-likelihood of the models and we use Tensorflow for the definition of the gradient of the log-likelihood.

In [8]:

X = tf.placeholder(tf.float64,[None,nfeatures])
W = tf.placeholder(tf.float64,[None])
theta = ed.models.Normal(loc=tf.zeros(nfeatures,dtype=tf.float64),scale=tf.ones(nfeatures,dtype=tf.float64))
likelihood = tf.sigmoid(ed.dot(X,theta))
log_likelihood = W*tf.log(likelihood)
y = ed.models.Bernoulli(probs=tf.exp(log_likelihood))

grad_log_likelihood = tf.gradients(log_likelihood,[theta])[0]

2. Baselines¶

In this second part we compute and plot baselines obtained by running our computations on the dataset at once (that is, not in parallel).

Performing standard logistic regression¶

We run standard logistic regression to learn a model for our data. We run the LR algorithm on the whole data and we visualize the result.

In [9]:

lr = lm.LogisticRegression()
lr.fit(Xtr,ytr)
probs_LR = lr.predict_proba(xymeshgrid)[:, 1].reshape(xmeshgrid.shape)

plt.title('Logistic Regression')
plt.contourf(xmeshgrid, ymeshgrid, probs_LR, 25, cmap="RdBu")
plt.scatter(Xtr[:,0],Xtr[:,1], c=ycolors)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')

/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Out[9]:

<matplotlib.lines.Line2D at 0x7f6799c40cc0>

Performing Bayesian logistic regression¶

Similarly, we run Bayesian logistic regression to learn a model for our data. First, we compute a MAP point estimate for $\theta$ in order to plot a line discriminator; second, we compute an estimate of the distibution of $\theta$ via Hamiltonian Monte Carlo in order to computer posterior probabilities. Finally we visualize the results.

In [10]:

map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]

thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
                    {X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat})
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)

plt.title('Bayes LR')
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xtr[:,0],Xtr[:,1], c=ycolors)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')

/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  not np.issubdtype(value.dtype, np.float) and \

1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 40.624
10000/10000 [100%] ██████████████████████████████ Elapsed: 10s | Acceptance Rate: 1.000

Out[10]:

<matplotlib.lines.Line2D at 0x7f679004b7f0>

Performing Bayesian Coreset¶

We compute the GIGA coreset of the full dataset. This step includes evaluating an approximate posterior via a Laplace approximation, a discretization and random projection of the log-likelihood, the computation of the coreset and the selection of samples with non-zero weight (see previous notebook for an explanation of this code). At the end, we run MAP estimation and HMC on the coreset data and we visualize the result.

In [11]:

qtheta = ed.models.MultivariateNormalTriL(
                    loc = tf.Variable(tf.zeros(nfeatures,dtype=tf.float64)),
                    scale_tril = tf.Variable(tf.eye(nfeatures,nfeatures,dtype=tf.float64)))
inference = ed.Laplace({theta:qtheta}, {X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
inference.run()

t0 = time.time()
randomprojection = ProjectionF(Xtr, grad_log_likelihood, nrandomdims, qtheta)
vecs = randomprojection.get()
bc_giga = bc.GIGA(vecs)
bc_giga.run(ncoresamples)
t_giga = time.time() - t0

Wt = bc_giga.weights()
Xwt = Xtr[Wt>0]
ywt = ytr[Wt>0]
wts = Wt[Wt>0]

map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xwt,W:wts/10.0,y:ywt.reshape((ywt.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]

thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
                   {X:Xwt,W:wts/10.0,y:ywt.reshape((ywt.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat}) 
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)

plt.title('GIGA coreset')
ycolors_giga = np.array(['r']*ywt.shape[0])
ycolors_giga[ywt[:,0]==1] = 'b'
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xwt[:,0],Xwt[:,1], c=ycolors_giga, s=wts)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')

/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  not np.issubdtype(value.dtype, np.float) and \

1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 40.624
1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 8.640
10000/10000 [100%] ██████████████████████████████ Elapsed: 10s | Acceptance Rate: 1.000

Out[11]:

<matplotlib.lines.Line2D at 0x7f6775a7abe0>

Discussion¶

All the models (standard logistic regression, Bayesian logistic regression, and Bayesian logistic regression on the coreset) running on the whole dataset are able to identify a good discriminator.

3. Parallel Computation of Coresets: Random Subsets¶

In this third part, we randomly partition the data in four subsets. We can safely assume that each one of these datasets come from the same distribution. We compute the four coresets in parallel, then we aggregate them and examine the result.

Partitioning the data¶

We divide the training data in four training sub-datasets.

In [12]:

partitioning = np.random.permutation(Xtr.shape[0])

Xtr1 = Xtr[partitioning[0:150]]
ytr1 = ytr[partitioning[0:150]]

Xtr2 = Xtr[partitioning[150:300]]
ytr2 = ytr[partitioning[150:300]]

Xtr3 = Xtr[partitioning[300:450]]
ytr3 = ytr[partitioning[300:450]]

Xtr4 = Xtr[partitioning[450:600]]
ytr4 = ytr[partitioning[450:600]]

Computing the coresets¶

We compute the four coresets in parallel. We implement a helper function (compute_parallel_coreset()) that computes the coreset for a given sub-dataset. Notice that each of the four coresets selects 10 points, so that, overall, we have the same number of points as when we computed the coreset on the whole dataset (40). At the end, we collect the results and plot them.

In [13]:

def compute_parallel_coreset(Xtr,ytr):
    qtheta = ed.models.MultivariateNormalTriL(
                        loc = tf.Variable(tf.zeros(nfeatures,dtype=tf.float64)),
                        scale_tril = tf.Variable(tf.eye(nfeatures,nfeatures,dtype=tf.float64)))
    inference = ed.Laplace({theta:qtheta}, {X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
    inference.run()
    
    t0 = time.time()
    randomprojection = ProjectionF(Xtr, grad_log_likelihood, nrandomdims, qtheta)
    vecs = randomprojection.get()
    bc_giga = bc.GIGA(vecs)
    bc_giga.run(ncoresamples)
    t_giga = time.time() - t0

    Wt = bc_giga.weights()
    Xwt = Xtr[Wt>0]
    ywt = ytr[Wt>0]
    wts = Wt[Wt>0]
    ycolors = np.array(['r']*ywt.shape[0])
    ycolors[ywt[:,0]==1] = 'b'
    
    map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
    inference = ed.MAP({theta:map_theta}, {X:Xwt,W:wts/10.0,y:ywt.reshape((ywt.shape[0]))})
    inference.run()
    mapx = np.linspace(-6,6,500)
    mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]
    
    return Xwt,ywt,wts,mapx,mapy,ycolors,t_giga

ncoresamples = 10
Xwt1,ywt1,wts1,mapx1,mapy1,ycolors1,t1 = compute_parallel_coreset(Xtr1,ytr1)
Xwt2,ywt2,wts2,mapx2,mapy2,ycolors2,t2 = compute_parallel_coreset(Xtr2,ytr2)
Xwt3,ywt3,wts3,mapx3,mapy3,ycolors3,t3 = compute_parallel_coreset(Xtr3,ytr3)
Xwt4,ywt4,wts4,mapx4,mapy4,ycolors4,t4 = compute_parallel_coreset(Xtr4,ytr4)

fig, axes = plt.subplots(2,2)
fig.suptitle('Parallel coresets')
axes[0,0].scatter(Xwt1[:,0],Xwt1[:,1], c=ycolors1, s=wts1)
axes[0,0].plot(mapx1,mapy1)
axes[0,0].axhline(y=0, color='gray')
axes[0,0].axvline(x=0, color='gray')

axes[0,1].scatter(Xwt2[:,0],Xwt2[:,1], c=ycolors2, s=wts2)
axes[0,1].plot(mapx2,mapy2)
axes[0,1].axhline(y=0, color='gray')
axes[0,1].axvline(x=0, color='gray')

axes[1,0].scatter(Xwt3[:,0],Xwt3[:,1], c=ycolors3, s=wts3)
axes[1,0].plot(mapx3,mapy3)
axes[1,0].axhline(y=0, color='gray')
axes[1,0].axvline(x=0, color='gray')

axes[1,1].scatter(Xwt4[:,0],Xwt4[:,1], c=ycolors4, s=wts4)
axes[1,1].plot(mapx4,mapy4)
axes[1,1].axhline(y=0, color='gray')
axes[1,1].axvline(x=0, color='gray')

/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  not np.issubdtype(value.dtype, np.float) and \

1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 13.498
1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 3.452
1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 9.804
1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 6.252
1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 16.456
1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 3.158
1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 17.184
1000/1000 [100%] ██████████████████████████████ Elapsed: 3s | Loss: 7.039

Out[13]:

<matplotlib.lines.Line2D at 0x7f6755fc2d30>

Assembling the data¶

We now collect all the coresets in a single dataset, and we perform Bayesian analysis as we did before.

In [14]:

Xpar = np.vstack((Xwt1,Xwt2,Xwt3,Xwt4))
ypar = np.vstack((ywt1,ywt2,ywt3,ywt4))
wpar = np.vstack((wts1.reshape((wts1.shape[0],1)),wts2.reshape((wts2.shape[0],1)),wts3.reshape((wts3.shape[0],1)),wts4.reshape((wts4.shape[0],1))))
wpar = wpar.reshape((wpar.shape[0]))

map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xpar,W:wpar/10.0,y:ypar.reshape((ypar.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]

thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
                   {X:Xpar,W:wpar,y:ypar.reshape((ypar.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat})
 
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)

plt.figure()
plt.title('GIGA coreset (aggregated dataset)')
ycolors_giga = np.array(['r']*ypar.shape[0])
ycolors_giga[ypar[:,0]==1] = 'b'
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xpar[:,0],Xpar[:,1], c=ycolors_giga, s=wpar)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')

/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  not np.issubdtype(value.dtype, np.float) and \

1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 11.286
10000/10000 [100%] ██████████████████████████████ Elapsed: 11s | Acceptance Rate: 1.000

Out[14]:

<matplotlib.lines.Line2D at 0x7f6754e52e10>

Printing out coreset training times¶

We compare the time to train a single coreset on the whole dataset and training four coresets in parallel.

In [15]:

print('Time to compute a single coreset on {0} samples: {1} secs'.format(Xtr.shape[0],t_giga))
print('Total time to compute four coresets on {0} samples: {1} secs'.format(150,t1+t2+t3+t4))
print('Individual coreset times: [{0}, {1}, {2}, {3}]'.format(t1,t2,t3,t4))
print('Average time for the four coresets: {0} secs'.format((t1+t2+t3+t4)/4.0))

Time to compute a single coreset on 600 samples: 22.029346466064453 secs
Total time to compute four coresets on 150 samples: 226.08742094039917 secs
Individual coreset times: [29.29443860054016, 44.80942702293396, 65.24506831169128, 86.73848700523376]
Average time for the four coresets: 56.52185523509979 secs

Discussion¶

On a simple dataset, random subsampling produces training sub-datasets that still represent the original distribution quite closely. The MAP discriminator associated with the sub-datasets shows a certain degree of variance, with some instances closer to the baseline (third sub-dataset) and other further from the baseline (second sub-dataset). However, sssembling the four coresets returns a posterior discriminator very close to the one computed from the whole dataset. Computing time seems not to be sensibly affected by the partitioning of the data. (?)

4. Parallel Computation of Coresets: Non i.d. Subsets¶

In this fourth part, we consider a more challenging partition of the data, in which the data are divided in two sub-datasets according to their generating clusters. This induces two different distributions. We compute the two coresets in parallel, then we aggregate them and examine the result.

Partitioning the data¶

We divide the training data in two training sub-datasets. Additionally, we plot the data to show that the two datasets are generated by different distributions.

In [16]:

Xtr12 = np.vstack((X_pdf1,X_pdf2))
ytr12 = np.vstack((y_pdf1,y_pdf2))
permutation = np.random.permutation(nsamples*2)
Xtr12 = Xtr12[permutation]
ytr12 = ytr12[permutation]
 
Xtr13 = np.vstack((X_pdf1,X_pdf3))
ytr13 = np.vstack((y_pdf1,y_pdf3))
permutation = np.random.permutation(nsamples*2)
Xtr13 = Xtr13[permutation]
ytr13 = ytr13[permutation]

fig, axes = plt.subplots(1,2)
ycolors_12 = np.array(['r']*ytr12.shape[0])
ycolors_12[ytr12[:,0]==1] = 'b'
axes[0].set_title('First sub-dataset')
axes[0].scatter(Xtr12[:,0],Xtr12[:,1], c=ycolors_12)
axes[0].axhline(y=0, color='gray')
axes[0].axvline(x=0, color='gray')

ycolors_13 = np.array(['r']*ytr13.shape[0])
ycolors_13[ytr13[:,0]==1] = 'b'
axes[1].set_title('First sub-dataset')
axes[1].scatter(Xtr13[:,0],Xtr13[:,1], c=ycolors_13)
axes[1].axhline(y=0, color='gray')
axes[1].axvline(x=0, color='gray')

Out[16]:

<matplotlib.lines.Line2D at 0x7f6754de84a8>

Computing the coresets¶

We compute the two coresets in parallel. We rely on the previous helper function (compute_parallel_coreset()). Notice that each of the two coreset selects 20 points, so that, overall, we have the same number of points as when we computed the coreset on the whole dataset (40). At the end, we collect the results and plot them.

In [17]:

ncoresamples = 20
Xwt12,ywt12,wts12,mapx12,mapy12,ycolors12,t12 = compute_parallel_coreset(Xtr12,ytr12)
Xwt13,ywt13,wts13,mapx13,mapy13,ycolors13,t13 = compute_parallel_coreset(Xtr13,ytr13)

fig, axes = plt.subplots(1,2)
fig.suptitle('Parallel coresets')
axes[0].scatter(Xwt12[:,0],Xwt12[:,1], c=ycolors12, s=wts12)
axes[0].plot(mapx12,mapy12)
axes[0].axhline(y=0, color='gray')
axes[0].axvline(x=0, color='gray')

axes[1].scatter(Xwt13[:,0],Xwt13[:,1], c=ycolors13, s=wts13)
axes[1].plot(mapx13,mapy13)
axes[1].axhline(y=0, color='gray')
axes[1].axvline(x=0, color='gray')

/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  not np.issubdtype(value.dtype, np.float) and \

1000/1000 [100%] ██████████████████████████████ Elapsed: 3s | Loss: 26.856
1000/1000 [100%] ██████████████████████████████ Elapsed: 3s | Loss: 3.843
1000/1000 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 26.399
1000/1000 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 4.593

Out[17]:

<matplotlib.lines.Line2D at 0x7f674b092f98>

Assembling the data¶

We now collect all the coresets in a single dataset, and we perform Bayesian analysis as we did before.

In [19]:

Xparr = np.vstack((Xwt12,Xwt13))
yparr = np.vstack((ywt12,ywt13))
wparr = np.vstack((wts12.reshape((wts12.shape[0],1)),wts13.reshape((wts13.shape[0],1))))
wparr = wparr.reshape((wparr.shape[0]))

map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xparr,W:wparr/10.0,y:yparr.reshape((yparr.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]

thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
                   {X:Xparr,W:wparr/10.0,y:yparr.reshape((yparr.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat})
 
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)

plt.figure()
plt.title('GIGA coreset (aggregated dataset)')
ycolors_giga = np.array(['r']*yparr.shape[0])
ycolors_giga[yparr[:,0]==1] = 'b'
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xparr[:,0],Xparr[:,1], c=ycolors_giga, s=wparr)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')

/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  not np.issubdtype(value.dtype, np.float) and \

1000/1000 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 8.286
10000/10000 [100%] ██████████████████████████████ Elapsed: 17s | Acceptance Rate: 1.000

Out[19]:

<matplotlib.lines.Line2D at 0x7f6748c55d68>

Printing out coreset training times¶

We compare the time to train a single coreset on the whole dataset and training four coresets in parallel.

In [20]:

print('Time to compute a single coreset on {0} samples: {1} secs'.format(Xtr.shape[0],t_giga))
print('Total time to compute two coresets on {0} samples: {1} secs'.format(150,t12+t13))
print('Individual coreset times: [{0}, {1}]'.format(t12,t13))
print('Average time for the four coresets: {0} secs'.format((t12+t13)/2.0))

Time to compute a single coreset on 600 samples: 22.029346466064453 secs
Total time to compute two coresets on 150 samples: 267.29918479919434 secs
Individual coreset times: [120.1095643043518, 147.18962049484253]
Average time for the four coresets: 133.64959239959717 secs

Discussion¶

When partitioning the data according to their generating clusters, we define two sub-datasets with different distribution. Not surprisingly, the MAP discriminators associated with the two sub-datasets are sensibly different. However, once again, assembling the coresets returns a posterior discriminator very close to the one computed from the whole dataset. Computing time seems not to be sensibly affected by the partitioning of the data. (?)

References¶

[1] Campbell, T. & Broderick, T. Automated Scalable Bayesian Inference via Hilbert Coresets arXiv preprint arXiv:1710.05053, 2017.

[2] Campbell, T. & Broderick, T. Bayesian coreset construction via greedy iterative geodesic ascent arXiv preprint arXiv:1802.01737, 2018

[3] Bayesian Coresets: Automated, Scalable Inference

[4] Tran, D.; Kucukelbir, A.; Dieng, A. B.; Rudolph, M.; Liang, D. & Blei, D. M. Edward: A library for probabilistic modeling, inference, and criticism arXiv preprint arXiv:1610.09787, 2016