(Better displayed in nbviewer as red warnings in font tag may not be displayed on github)
In this notebook we try computing Bayesian coresets in parallel using the approach illustated in Section 4.3 of [1]. (Notice that by parallel we mean logically in parallel, even if the actual computation is sequential). Bayesian coresets are presented in [1][2]. We exploit Tensorflow/Edward [4] for the definition of models, for their derivation, for posterior approximation, and for sampling. We rely on the code made availabe by Trevor Campbell for computing coresets [3].
This notebook builds on the previous notebook, to which we refer for a more detailed description of the code.
The notebook is divided in three parts:
ISSUES: still to correct/review: (i) Rescaling of weights for the model not to crash into NaN; (ii) computing time in Jupyter seems unrelated to the dataset size.
In this first section we define parameters and auxiliary functions, we generate and show the data, and we define the statistical model.
import time
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn.linear_model as lm
import sklearn.svm as svm
import tensorflow as tf
import edward as ed
import bcoresets as bc
np.random.seed(742)
See previous notebook for an explanation of this code.
class _Projection(object):
def __init__(self, N, projection_dim, approx_posterior):
self.dim = approx_posterior.shape[0].value
self.x = np.zeros((N, 0))
self.approx_posterior = approx_posterior
self.update_dimension(projection_dim)
return
def update_dimension(self, projection_dim):
if projection_dim < self.x.shape[1]:
self.x = self.x[:, :projection_dim]
if projection_dim > self.x.shape[1]:
old_dim = self.x.shape[1]
w = np.zeros((self.x.shape[0], projection_dim))
w[:, :old_dim] = self.x
w *= np.sqrt(old_dim)
for j in range(projection_dim-old_dim):
w[:, j+old_dim] = self._sample_component()
w /= np.sqrt(projection_dim)
self.x = w
return
def get(self):
return self.x.copy()
class ProjectionF(_Projection):
def __init__(self, data, grad_log_likelihood, projection_dim, approx_posterior):
self.data = data
self.grad_log_likelihood = grad_log_likelihood
_Projection.__init__(self, data.shape[0], projection_dim, approx_posterior)
def _sample_component(self):
sample_post = self.approx_posterior.sample().eval()
sgll = [self.grad_log_likelihood.eval(feed_dict={X:self.data[i].reshape([1,self.data.shape[1]]),
W:np.ones(self.data.shape[0], dtype=np.float64),
theta:sample_post})
for i in range(self.data.shape[0])]
sgll = np.array(sgll)
return np.sqrt(self.dim)*sgll[:,np.random.randint(self.dim)]
We define the parameters of the simulation, including the number and dimensionality of the samples we will generate (nsamples, ndims); the number of random projection dimension and core samples for coreset computation (nrandomdims, ncoresamples); the number of samples, the step length, the burn-in duration and the amount of thinning for the MC simulation (n_mcsamples, n_mcstepsize, n_mcburnin, n_mcthinning,); the number of samples for posterior prediction (n_ppcsamples); and a grid step to plot contour plots (gridstep).
nsamples = 200
nfeatures = 2
nrandomdims = 100
ncoresamples = 40
n_mcsamples = 10000
n_mcstepsize = 0.005
n_mcburnin = int(n_mcsamples/2)
n_mcthinning = 5
ppc_samples = 100
gridstep = .1
We generate data from three Gaussian distributions. We use two-dimensional data for ease of visualization.
mu1 = np.array([-2,-2])
sigma1 = np.array([[3,.5],[.5,1]])
mu2 = np.array([2,2])
sigma2 = np.array([[1,0],[0,1]])
mu3 = np.array([-2,2.5])
sigma3 = np.array([[1,0],[0,1]])
pdf1 = stats.multivariate_normal(mu1,sigma1)
pdf2 = stats.multivariate_normal(mu2,sigma2)
pdf3 = stats.multivariate_normal(mu3,sigma3)
X_pdf1 = pdf1.rvs(size=nsamples)
y_pdf1 = np.zeros(nsamples).reshape((nsamples,1))
X_pdf2 = pdf2.rvs(size=nsamples)
y_pdf2 = np.ones(nsamples).reshape((nsamples,1))
X_pdf3 = pdf3.rvs(size=nsamples)
y_pdf3 = np.ones(nsamples).reshape((nsamples,1))
Xtr = np.vstack((X_pdf1,X_pdf2,X_pdf3))
ytr = np.vstack((y_pdf1,y_pdf2,y_pdf3))
permutation = np.random.permutation(Xtr.shape[0])
Xtr = Xtr[permutation]
ytr = ytr[permutation]
We plot the data we generated.
xmeshgrid, ymeshgrid = np.mgrid[-7:7:gridstep, -6:6:gridstep]
xymeshgrid = np.c_[xmeshgrid.ravel(), ymeshgrid.ravel()]
ycolors = np.array(['r']*ytr.shape[0])
ycolors[ytr[:,0]==1] = 'b'
plt.figure()
plt.title('Data')
plt.scatter(Xtr[:,0],Xtr[:,1], c=ycolors)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')
xcontourgrid, ycontourgrid = np.mgrid[-6:2:gridstep, -6:2:gridstep]
xycontourgrid = np.dstack((xcontourgrid, ycontourgrid))
plt.contour(xcontourgrid, ycontourgrid, pdf1.pdf(xycontourgrid))
xcontourgrid, ycontourgrid = np.mgrid[0:5:gridstep, 0:5:gridstep]
xycontourgrid = np.dstack((xcontourgrid, ycontourgrid))
plt.contour(xcontourgrid, ycontourgrid, pdf2.pdf(xycontourgrid))
xcontourgrid, ycontourgrid = np.mgrid[-5:0:gridstep, 0:5:gridstep]
xycontourgrid = np.dstack((xcontourgrid, ycontourgrid))
plt.contour(xcontourgrid, ycontourgrid, pdf3.pdf(xycontourgrid))
<matplotlib.contour.QuadContourSet at 0x7f67a96a7ba8>
We define a standard Bayesian weighted regression model as: $$ P(Y \vert \theta) \sim Bern \left\{ p = \sigma(X \theta)^w \right\} $$ where $\theta$ is the two-dimensional vector of parameters and $\sigma(x) = \frac{1}{1+e^{-x}}$ is the logistic function.
We also define the log-likelihood of the models and we use Tensorflow for the definition of the gradient of the log-likelihood.
X = tf.placeholder(tf.float64,[None,nfeatures])
W = tf.placeholder(tf.float64,[None])
theta = ed.models.Normal(loc=tf.zeros(nfeatures,dtype=tf.float64),scale=tf.ones(nfeatures,dtype=tf.float64))
likelihood = tf.sigmoid(ed.dot(X,theta))
log_likelihood = W*tf.log(likelihood)
y = ed.models.Bernoulli(probs=tf.exp(log_likelihood))
grad_log_likelihood = tf.gradients(log_likelihood,[theta])[0]
In this second part we compute and plot baselines obtained by running our computations on the dataset at once (that is, not in parallel).
We run standard logistic regression to learn a model for our data. We run the LR algorithm on the whole data and we visualize the result.
lr = lm.LogisticRegression()
lr.fit(Xtr,ytr)
probs_LR = lr.predict_proba(xymeshgrid)[:, 1].reshape(xmeshgrid.shape)
plt.title('Logistic Regression')
plt.contourf(xmeshgrid, ymeshgrid, probs_LR, 25, cmap="RdBu")
plt.scatter(Xtr[:,0],Xtr[:,1], c=ycolors)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')
/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/sklearn/utils/validation.py:578: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
<matplotlib.lines.Line2D at 0x7f6799c40cc0>
Similarly, we run Bayesian logistic regression to learn a model for our data. First, we compute a MAP point estimate for $\theta$ in order to plot a line discriminator; second, we compute an estimate of the distibution of $\theta$ via Hamiltonian Monte Carlo in order to computer posterior probabilities. Finally we visualize the results.
map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]
thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
{X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat})
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)
plt.title('Bayes LR')
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xtr[:,0],Xtr[:,1], c=ycolors)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')
/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. not np.issubdtype(value.dtype, np.float) and \
1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 40.624 10000/10000 [100%] ██████████████████████████████ Elapsed: 10s | Acceptance Rate: 1.000
<matplotlib.lines.Line2D at 0x7f679004b7f0>
We compute the GIGA coreset of the full dataset. This step includes evaluating an approximate posterior via a Laplace approximation, a discretization and random projection of the log-likelihood, the computation of the coreset and the selection of samples with non-zero weight (see previous notebook for an explanation of this code). At the end, we run MAP estimation and HMC on the coreset data and we visualize the result.
qtheta = ed.models.MultivariateNormalTriL(
loc = tf.Variable(tf.zeros(nfeatures,dtype=tf.float64)),
scale_tril = tf.Variable(tf.eye(nfeatures,nfeatures,dtype=tf.float64)))
inference = ed.Laplace({theta:qtheta}, {X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
inference.run()
t0 = time.time()
randomprojection = ProjectionF(Xtr, grad_log_likelihood, nrandomdims, qtheta)
vecs = randomprojection.get()
bc_giga = bc.GIGA(vecs)
bc_giga.run(ncoresamples)
t_giga = time.time() - t0
Wt = bc_giga.weights()
Xwt = Xtr[Wt>0]
ywt = ytr[Wt>0]
wts = Wt[Wt>0]
map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xwt,W:wts/10.0,y:ywt.reshape((ywt.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]
thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
{X:Xwt,W:wts/10.0,y:ywt.reshape((ywt.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat})
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)
plt.title('GIGA coreset')
ycolors_giga = np.array(['r']*ywt.shape[0])
ycolors_giga[ywt[:,0]==1] = 'b'
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xwt[:,0],Xwt[:,1], c=ycolors_giga, s=wts)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')
/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. not np.issubdtype(value.dtype, np.float) and \
1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 40.624 1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 8.640 10000/10000 [100%] ██████████████████████████████ Elapsed: 10s | Acceptance Rate: 1.000
<matplotlib.lines.Line2D at 0x7f6775a7abe0>
All the models (standard logistic regression, Bayesian logistic regression, and Bayesian logistic regression on the coreset) running on the whole dataset are able to identify a good discriminator.
In this third part, we randomly partition the data in four subsets. We can safely assume that each one of these datasets come from the same distribution. We compute the four coresets in parallel, then we aggregate them and examine the result.
We divide the training data in four training sub-datasets.
partitioning = np.random.permutation(Xtr.shape[0])
Xtr1 = Xtr[partitioning[0:150]]
ytr1 = ytr[partitioning[0:150]]
Xtr2 = Xtr[partitioning[150:300]]
ytr2 = ytr[partitioning[150:300]]
Xtr3 = Xtr[partitioning[300:450]]
ytr3 = ytr[partitioning[300:450]]
Xtr4 = Xtr[partitioning[450:600]]
ytr4 = ytr[partitioning[450:600]]
We compute the four coresets in parallel. We implement a helper function (compute_parallel_coreset()) that computes the coreset for a given sub-dataset. Notice that each of the four coresets selects 10 points, so that, overall, we have the same number of points as when we computed the coreset on the whole dataset (40). At the end, we collect the results and plot them.
def compute_parallel_coreset(Xtr,ytr):
qtheta = ed.models.MultivariateNormalTriL(
loc = tf.Variable(tf.zeros(nfeatures,dtype=tf.float64)),
scale_tril = tf.Variable(tf.eye(nfeatures,nfeatures,dtype=tf.float64)))
inference = ed.Laplace({theta:qtheta}, {X:Xtr,W:np.ones(Xtr.shape[0],dtype=np.float64),y:ytr.reshape((ytr.shape[0]))})
inference.run()
t0 = time.time()
randomprojection = ProjectionF(Xtr, grad_log_likelihood, nrandomdims, qtheta)
vecs = randomprojection.get()
bc_giga = bc.GIGA(vecs)
bc_giga.run(ncoresamples)
t_giga = time.time() - t0
Wt = bc_giga.weights()
Xwt = Xtr[Wt>0]
ywt = ytr[Wt>0]
wts = Wt[Wt>0]
ycolors = np.array(['r']*ywt.shape[0])
ycolors[ywt[:,0]==1] = 'b'
map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xwt,W:wts/10.0,y:ywt.reshape((ywt.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]
return Xwt,ywt,wts,mapx,mapy,ycolors,t_giga
ncoresamples = 10
Xwt1,ywt1,wts1,mapx1,mapy1,ycolors1,t1 = compute_parallel_coreset(Xtr1,ytr1)
Xwt2,ywt2,wts2,mapx2,mapy2,ycolors2,t2 = compute_parallel_coreset(Xtr2,ytr2)
Xwt3,ywt3,wts3,mapx3,mapy3,ycolors3,t3 = compute_parallel_coreset(Xtr3,ytr3)
Xwt4,ywt4,wts4,mapx4,mapy4,ycolors4,t4 = compute_parallel_coreset(Xtr4,ytr4)
fig, axes = plt.subplots(2,2)
fig.suptitle('Parallel coresets')
axes[0,0].scatter(Xwt1[:,0],Xwt1[:,1], c=ycolors1, s=wts1)
axes[0,0].plot(mapx1,mapy1)
axes[0,0].axhline(y=0, color='gray')
axes[0,0].axvline(x=0, color='gray')
axes[0,1].scatter(Xwt2[:,0],Xwt2[:,1], c=ycolors2, s=wts2)
axes[0,1].plot(mapx2,mapy2)
axes[0,1].axhline(y=0, color='gray')
axes[0,1].axvline(x=0, color='gray')
axes[1,0].scatter(Xwt3[:,0],Xwt3[:,1], c=ycolors3, s=wts3)
axes[1,0].plot(mapx3,mapy3)
axes[1,0].axhline(y=0, color='gray')
axes[1,0].axvline(x=0, color='gray')
axes[1,1].scatter(Xwt4[:,0],Xwt4[:,1], c=ycolors4, s=wts4)
axes[1,1].plot(mapx4,mapy4)
axes[1,1].axhline(y=0, color='gray')
axes[1,1].axvline(x=0, color='gray')
/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. not np.issubdtype(value.dtype, np.float) and \
1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 13.498 1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 3.452 1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 9.804 1000/1000 [100%] ██████████████████████████████ Elapsed: 1s | Loss: 6.252 1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 16.456 1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 3.158 1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 17.184 1000/1000 [100%] ██████████████████████████████ Elapsed: 3s | Loss: 7.039
<matplotlib.lines.Line2D at 0x7f6755fc2d30>
We now collect all the coresets in a single dataset, and we perform Bayesian analysis as we did before.
Xpar = np.vstack((Xwt1,Xwt2,Xwt3,Xwt4))
ypar = np.vstack((ywt1,ywt2,ywt3,ywt4))
wpar = np.vstack((wts1.reshape((wts1.shape[0],1)),wts2.reshape((wts2.shape[0],1)),wts3.reshape((wts3.shape[0],1)),wts4.reshape((wts4.shape[0],1))))
wpar = wpar.reshape((wpar.shape[0]))
map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xpar,W:wpar/10.0,y:ypar.reshape((ypar.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]
thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
{X:Xpar,W:wpar,y:ypar.reshape((ypar.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat})
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)
plt.figure()
plt.title('GIGA coreset (aggregated dataset)')
ycolors_giga = np.array(['r']*ypar.shape[0])
ycolors_giga[ypar[:,0]==1] = 'b'
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xpar[:,0],Xpar[:,1], c=ycolors_giga, s=wpar)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')
/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. not np.issubdtype(value.dtype, np.float) and \
1000/1000 [100%] ██████████████████████████████ Elapsed: 2s | Loss: 11.286 10000/10000 [100%] ██████████████████████████████ Elapsed: 11s | Acceptance Rate: 1.000
<matplotlib.lines.Line2D at 0x7f6754e52e10>
We compare the time to train a single coreset on the whole dataset and training four coresets in parallel.
print('Time to compute a single coreset on {0} samples: {1} secs'.format(Xtr.shape[0],t_giga))
print('Total time to compute four coresets on {0} samples: {1} secs'.format(150,t1+t2+t3+t4))
print('Individual coreset times: [{0}, {1}, {2}, {3}]'.format(t1,t2,t3,t4))
print('Average time for the four coresets: {0} secs'.format((t1+t2+t3+t4)/4.0))
Time to compute a single coreset on 600 samples: 22.029346466064453 secs Total time to compute four coresets on 150 samples: 226.08742094039917 secs Individual coreset times: [29.29443860054016, 44.80942702293396, 65.24506831169128, 86.73848700523376] Average time for the four coresets: 56.52185523509979 secs
On a simple dataset, random subsampling produces training sub-datasets that still represent the original distribution quite closely. The MAP discriminator associated with the sub-datasets shows a certain degree of variance, with some instances closer to the baseline (third sub-dataset) and other further from the baseline (second sub-dataset). However, sssembling the four coresets returns a posterior discriminator very close to the one computed from the whole dataset. Computing time seems not to be sensibly affected by the partitioning of the data. (?)
In this fourth part, we consider a more challenging partition of the data, in which the data are divided in two sub-datasets according to their generating clusters. This induces two different distributions. We compute the two coresets in parallel, then we aggregate them and examine the result.
We divide the training data in two training sub-datasets. Additionally, we plot the data to show that the two datasets are generated by different distributions.
Xtr12 = np.vstack((X_pdf1,X_pdf2))
ytr12 = np.vstack((y_pdf1,y_pdf2))
permutation = np.random.permutation(nsamples*2)
Xtr12 = Xtr12[permutation]
ytr12 = ytr12[permutation]
Xtr13 = np.vstack((X_pdf1,X_pdf3))
ytr13 = np.vstack((y_pdf1,y_pdf3))
permutation = np.random.permutation(nsamples*2)
Xtr13 = Xtr13[permutation]
ytr13 = ytr13[permutation]
fig, axes = plt.subplots(1,2)
ycolors_12 = np.array(['r']*ytr12.shape[0])
ycolors_12[ytr12[:,0]==1] = 'b'
axes[0].set_title('First sub-dataset')
axes[0].scatter(Xtr12[:,0],Xtr12[:,1], c=ycolors_12)
axes[0].axhline(y=0, color='gray')
axes[0].axvline(x=0, color='gray')
ycolors_13 = np.array(['r']*ytr13.shape[0])
ycolors_13[ytr13[:,0]==1] = 'b'
axes[1].set_title('First sub-dataset')
axes[1].scatter(Xtr13[:,0],Xtr13[:,1], c=ycolors_13)
axes[1].axhline(y=0, color='gray')
axes[1].axvline(x=0, color='gray')
<matplotlib.lines.Line2D at 0x7f6754de84a8>
We compute the two coresets in parallel. We rely on the previous helper function (compute_parallel_coreset()). Notice that each of the two coreset selects 20 points, so that, overall, we have the same number of points as when we computed the coreset on the whole dataset (40). At the end, we collect the results and plot them.
ncoresamples = 20
Xwt12,ywt12,wts12,mapx12,mapy12,ycolors12,t12 = compute_parallel_coreset(Xtr12,ytr12)
Xwt13,ywt13,wts13,mapx13,mapy13,ycolors13,t13 = compute_parallel_coreset(Xtr13,ytr13)
fig, axes = plt.subplots(1,2)
fig.suptitle('Parallel coresets')
axes[0].scatter(Xwt12[:,0],Xwt12[:,1], c=ycolors12, s=wts12)
axes[0].plot(mapx12,mapy12)
axes[0].axhline(y=0, color='gray')
axes[0].axvline(x=0, color='gray')
axes[1].scatter(Xwt13[:,0],Xwt13[:,1], c=ycolors13, s=wts13)
axes[1].plot(mapx13,mapy13)
axes[1].axhline(y=0, color='gray')
axes[1].axvline(x=0, color='gray')
/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. not np.issubdtype(value.dtype, np.float) and \
1000/1000 [100%] ██████████████████████████████ Elapsed: 3s | Loss: 26.856 1000/1000 [100%] ██████████████████████████████ Elapsed: 3s | Loss: 3.843 1000/1000 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 26.399 1000/1000 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 4.593
<matplotlib.lines.Line2D at 0x7f674b092f98>
We now collect all the coresets in a single dataset, and we perform Bayesian analysis as we did before.
Xparr = np.vstack((Xwt12,Xwt13))
yparr = np.vstack((ywt12,ywt13))
wparr = np.vstack((wts12.reshape((wts12.shape[0],1)),wts13.reshape((wts13.shape[0],1))))
wparr = wparr.reshape((wparr.shape[0]))
map_theta = ed.models.PointMass(tf.Variable(tf.zeros(2,dtype=tf.float64)))
inference = ed.MAP({theta:map_theta}, {X:Xparr,W:wparr/10.0,y:yparr.reshape((yparr.shape[0]))})
inference.run()
mapx = np.linspace(-6,6,500)
mapy = -(mapx*map_theta.eval()[0]) / map_theta.eval()[1]
thetachain = ed.models.Empirical(params=tf.Variable(tf.zeros([n_mcsamples,nfeatures],dtype=tf.float64)))
inference = ed.HMC({theta:thetachain},
{X:Xparr,W:wparr/10.0,y:yparr.reshape((yparr.shape[0]))})
inference.run(step_size=n_mcstepsize)
thetahat = ed.models.Empirical(params = thetachain.params.eval()[n_mcburnin:n_mcsamples:n_mcthinning])
y_post = ed.copy(y,{theta:thetahat})
probs_BLR = [y_post.eval(feed_dict={X:xymeshgrid, W:np.ones(xymeshgrid.shape[0],dtype=np.float64)}) for _ in range(ppc_samples)]
probs_BLR = np.array(probs_BLR).mean(axis=0).reshape(xmeshgrid.shape)
plt.figure()
plt.title('GIGA coreset (aggregated dataset)')
ycolors_giga = np.array(['r']*yparr.shape[0])
ycolors_giga[yparr[:,0]==1] = 'b'
plt.contourf(xmeshgrid, ymeshgrid, probs_BLR, 25, cmap="RdBu")
plt.scatter(Xparr[:,0],Xparr[:,1], c=ycolors_giga, s=wparr)
plt.plot(mapx,mapy)
plt.axhline(y=0, color='gray')
plt.axvline(x=0, color='gray')
/home/fmzennaro/miniconda2_1/envs/bayes3/lib/python3.6/site-packages/edward/util/random_variables.py:52: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. not np.issubdtype(value.dtype, np.float) and \
1000/1000 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 8.286 10000/10000 [100%] ██████████████████████████████ Elapsed: 17s | Acceptance Rate: 1.000
<matplotlib.lines.Line2D at 0x7f6748c55d68>
We compare the time to train a single coreset on the whole dataset and training four coresets in parallel.
print('Time to compute a single coreset on {0} samples: {1} secs'.format(Xtr.shape[0],t_giga))
print('Total time to compute two coresets on {0} samples: {1} secs'.format(150,t12+t13))
print('Individual coreset times: [{0}, {1}]'.format(t12,t13))
print('Average time for the four coresets: {0} secs'.format((t12+t13)/2.0))
Time to compute a single coreset on 600 samples: 22.029346466064453 secs Total time to compute two coresets on 150 samples: 267.29918479919434 secs Individual coreset times: [120.1095643043518, 147.18962049484253] Average time for the four coresets: 133.64959239959717 secs
When partitioning the data according to their generating clusters, we define two sub-datasets with different distribution. Not surprisingly, the MAP discriminators associated with the two sub-datasets are sensibly different. However, once again, assembling the coresets returns a posterior discriminator very close to the one computed from the whole dataset. Computing time seems not to be sensibly affected by the partitioning of the data. (?)
[1] Campbell, T. & Broderick, T. Automated Scalable Bayesian Inference via Hilbert Coresets arXiv preprint arXiv:1710.05053, 2017.
[2] Campbell, T. & Broderick, T. Bayesian coreset construction via greedy iterative geodesic ascent arXiv preprint arXiv:1802.01737, 2018
[3] Bayesian Coresets: Automated, Scalable Inference
[4] Tran, D.; Kucukelbir, A.; Dieng, A. B.; Rudolph, M.; Liang, D. & Blei, D. M. Edward: A library for probabilistic modeling, inference, and criticism arXiv preprint arXiv:1610.09787, 2016