Random Effects Neural Networks with Edward and Keras

Bayesian probabilistic models provide a nimble and expressive framework for modeling "small-world" data. In contrast, deep learning offers a more rigid yet much more powerful framework for modeling data of massive size. Edward is a probabilistic programming library that bridges this gap: "black-box" variational inference enables us to fit extremely flexible Bayesian models to large-scale data. Furthermore, these models themselves may take advantage of classic deep-learning architectures of arbitrary complexity.

Edward uses TensorFlow for symbolic gradients and data flow graphs. As such, it interfaces cleanly with other libraries that do the same, namely TF-Slim, PrettyTensor and Keras. Personally, I've been working often with the latter, and am consistently delighted by the ease with which it allows me to specify complex neural architectures.

The aim of this work is to lay a practical foundation for Bayesian modeling in Edward, then explore how, and how easily, we can extend these models in the direction of classical deep learning via Keras. It will give both a conceptual overview of the models below, as well as notes on the practical considerations of their implementation — what worked and what didn't. Finally, this work will conclude with concrete ways in which to extend these models further, of which there are many.

If you're just getting started with Edward or Keras, I recommend first perusing the Edward tutorials and Keras documentation respectively.

To "pull us down the path," we build three models in additive fashion: a Bayesian linear regression model, a Bayesian linear regression model with random effects, and a neural network with random effects. We fit them on the Zillow Prize dataset, which asks us to predict logerror (in house-price estimate, i.e. the "Zestimate") given metadata for a list of homes. These models are intended to be demonstrative, not performant: they will not win you the prize in their current form.

In [1]:
import edward as ed
from edward.models import Normal
from keras.layers import Input, Dense
from keras.regularizers import l2
from keras import backend as K
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import scale

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
In [2]:
# ensure you are using TensorFlow as your Keras backend
sess = ed.get_session()

INIT_OP = tf.global_variables_initializer()

Data preparation

After importing the data we rename its columns as per Philipp Spachtholz's Exploratory Analysis - Zillow kernel on kaggle.com.

In [3]:
properties_df = pd.read_csv('data/properties.csv', low_memory=False)
transactions_df = pd.read_csv('data/transactions.csv')
In [4]:
properties_df = properties_df.rename(columns={
    'parcelid': 'id_parcel',
    'yearbuilt': 'build_year',
    'basementsqft': 'area_basement',
    'yardbuildingsqft17': 'area_patio',
    'yardbuildingsqft26': 'area_shed',
    'poolsizesum': 'area_pool',
    'lotsizesquarefeet': 'area_lot',
    'garagetotalsqft': 'area_garage',
    'finishedfloor1squarefeet': 'area_firstfloor_finished',
    'calculatedfinishedsquarefeet': 'area_total_calc',
    'finishedsquarefeet6': 'area_base',
    'finishedsquarefeet12': 'area_live_finished',
    'finishedsquarefeet13': 'area_liveperi_finished',
    'finishedsquarefeet15': 'area_total_finished',
    'finishedsquarefeet50': 'area_unknown',
    'unitcnt': 'num_unit',
    'numberofstories': 'num_story',
    'roomcnt': 'num_room',
    'bathroomcnt': 'num_bathroom',
    'bedroomcnt': 'num_bedroom',
    'calculatedbathnbr': 'num_bathroom_calc',
    'fullbathcnt': 'num_bath',
    'threequarterbathnbr': 'num_75_bath',
    'fireplacecnt': 'num_fireplace',
    'poolcnt': 'num_pool',
    'garagecarcnt': 'num_garage',
    'regionidcounty': 'region_county',
    'regionidcity': 'region_city',
    'regionidzip': 'region_zip',
    'regionidneighborhood': 'region_neighbor',
    'taxvaluedollarcnt': 'tax_total',
    'structuretaxvaluedollarcnt': 'tax_building',
    'landtaxvaluedollarcnt': 'tax_land',
    'taxamount': 'tax_property',
    'assessmentyear': 'tax_year',
    'taxdelinquencyflag': 'tax_delinquency',
    'taxdelinquencyyear': 'tax_delinquency_year',
    'propertyzoningdesc': 'zoning_property',
    'propertylandusetypeid': 'zoning_landuse',
    'propertycountylandusecode': 'zoning_landuse_county',
    'fireplaceflag': 'flag_fireplace',
    'hashottuborspa': 'flag_tub',
    'buildingqualitytypeid': 'quality',
    'buildingclasstypeid': 'framing',
    'typeconstructiontypeid': 'material',
    'decktypeid': 'deck',
    'storytypeid': 'story',
    'heatingorsystemtypeid': 'heating',
    'airconditioningtypeid': 'aircon',
    'architecturalstyletypeid': 'architectural_style'

transactions_df = transactions_df.rename(columns={
  'parcelid': 'id_parcel',
  'transactiondate': 'date'

Build training DataFrame

In [5]:
data = transactions_df.merge(properties_df, how='left', left_on='id_parcel', right_on='id_parcel')

Drop columns containing too many nulls

Bayesian probabilistic models allow us to flexibly model missing data itself. To this end, we conceive of a given predictor as a vector of both:

  1. Observed values.
  2. Parameters in place of missing values, which will form a posterior distribution for what this value might have been.

In a (partially-specified, for brevity) linear model, this might look as follows:

$$ y_i \sim \mathcal{N}(\mu_i, \sigma)\\ \mu_i = \alpha + \beta_N N_i\\ N_i \sim \mathcal{N}(\nu, \sigma_N)\\ $$

where $N_i$ is our sometimes-missing predictor. When $N_i$ is observed, $\mathcal{N}(\nu, \sigma_N)$ serves as a likelihood: given this data-point, we tweak retrodictive distributions on the parameters $(\nu, \sigma_N)$ by which it was produced. Conversely, when $N_i$ is missing it serves as a prior: after learning distributions of $(\nu, \sigma_N)$ we can generate a likely value of $N_i$ itself. Finally, inference will give us (presumably-wide) distributions on the model's belief in what was the true value of each missing $N_i$ conditional on the data observed.

I tried this in Edward, albeit briefly, to no avail. Dustin Tran gives an example of how one might accomplish this task in the case of Gaussian Matrix Factorization. In my case, I wasn't able to apply a 2-D missing-data-mask placeholder to a 2-D data placeholder via tf.gather nor tf.gather_nd. With more effort, I'm sure I could make this work. Help appreciated.

For now, we'll first drop columns containing too many null values, then, after choosing a few of the predictors most correlated with the target, drop the remaining rows containing nulls.

In [6]:
keep_cols = data.columns[ data.isnull().mean() < .25 ]
data = data[keep_cols]

Which columns are most correlated with the target?

In [7]:
float_cols = [col for col in data.columns if data[col].dtype == np.float64]

logerror               1.000000
area_live_finished     0.041922
area_total_calc        0.038784
num_bathroom_calc      0.029448
num_bath               0.028845
num_bathroom           0.027889
num_bedroom            0.025467
tax_building           0.022085
build_year             0.017312
censustractandblock    0.008892
Name: logerror, dtype: float64

Drop rows with null values

In [8]:

Select three fixed-effect predictors

In [9]:
fixed_effect_predictors = [

Select one random-effect predictor

In [10]:
zip_codes = data['region_zip'].astype('category').cat.codes

Split data into train, validation sets

In [11]:
train_index = data.sample(frac=0.5).index
val_index = data.drop(train_index).index

X = data.drop('logerror', axis=1)[fixed_effect_predictors]
X = scale(X)
y = data['logerror'].values

X_train = X[train_index]
y_train = y[train_index]
X_val = X[val_index]
y_val = y[val_index]

print('Dataset sizes:')
print(f'    X_train:  {X_train.shape}')
print(f'    X_val:    {X_val.shape}')
print(f'    y_train:  {y_train.shape}')
print(f'    y_val:    {y_val.shape}')
Dataset sizes:
    X_train:  (36986, 3)
    X_val:    (36986, 3)
    y_train:  (36986,)
    y_val:    (36986,)

Bayesian linear regression

Using three fixed-effect predictors we'll fit a model of the following form:

$$ y_i \sim \mathcal{N}(\mu_i, 1)\\ \mu_i = \alpha + \beta x_i\\ \alpha \sim \mathcal{N}(0, 1)\\ \beta \sim \mathcal{N}(0, 1)\\ $$

Having normalized our data to have mean 0 and unit-variance, we place our priors on a similar scale.

To infer posterior distributions of the model's parameters conditional on the data observed we employ variational inference — one of three inference classes supported in Edward. This approach posits posterior inference as posterior approximation via optimization, where optimization is done via stochastic, gradient-based methods. This is what enables us to scale complex probabilistic functional forms to large-scale data.

For an introduction to variational inference and Edward's API thereof, please reference:

Additionally, I provide an introduction to the basic math behind variational inference and the ELBO in a blog post of mine: Further Exploring Common Probabilistic Models.

Fit model

For the approximate q-distributions, we apply the softplus functionlog(exp(z) + 1) — to the scale parameter values at the suggestion of the Edward docs.

In [12]:
N, D = X_train.shape

# fixed-effects placeholders
fixed_effects = tf.placeholder(tf.float32, [N, D])

# fixed-effects parameters
β_fixed_effects = Normal(loc=tf.zeros(D), scale=tf.ones(D))
α = Normal(loc=tf.zeros(1), scale=tf.ones(1))

# model
μ_y = α + ed.dot(fixed_effects, β_fixed_effects)
y = Normal(loc=μ_y, scale=tf.ones(N))

# approximate fixed-effects distributions 
qβ_fixed_effects = Normal(
 = Normal(

Infer parameters

In [13]:
latent_vars = {
    β_fixed_effects: qβ_fixed_effects,

inference = ed.KLqp(latent_vars, data={fixed_effects: X_train, y: y_train})
inference.run(n_samples=5, n_iter=250)
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 35405.105

Criticize model

In [14]:
def visualize_data_fit(X, y, β, α, title_prefix, n_samples=10):
    '''Plot lines generated via samples from parameter distributions of the first 
    two fixed effects, vs. observed data points.
        X (np.array) : A design matrix of observed fixed effects.
        y (np.array) : A vector of observed responses.
        β (ed.RandomVariable) : A multivariate distribution of fixed-effect parameters.
        α (ed.RandomVariable) : A univariate distribution of the model's intercept term.
        title_prefix (str) : A string to append to the beginning of the title.
        n_samples (int) : The number of lines to plot as drawn from the parameter distributions.
    # draw samples from parameter distributions
    β_samples = β.sample(n_samples).eval()
    α_samples = α.sample(n_samples).eval()
    # plot the first two dimensions of `X`, vs. `y`
    fig = plt.figure(figsize=(12, 9))
    ax = plt.axes(projection='3d')
    ax.scatter(X[:, 0], X[:, 1], y)
    plt.title(f'{title_prefix} Parameter Samples vs. Observed Data')
    # plot lines defined by parameter samples
    inputs = np.linspace(-10, 10, num=500)
    for i in range(n_samples):
        output = inputs * β_samples[i][0] + inputs * β_samples[i][1] + α_samples[i][0]
        ax.plot(inputs, inputs, output)

Visualize data fit given parameter priors

In [15]:
visualize_data_fit(X_train, y_train, β_fixed_effects, α, 'Prior', n_samples=10)

Visualize data fit given parameter posteriors

In [16]:
visualize_data_fit(X_train, y_train, qβ_fixed_effects, , 'Posterior', n_samples=10)

It appears as if our model fits the data along the first two dimensions. This said, we could improve this fit considerably. This will become apparent when we compute the MAE on our validation set.

Inspect residuals

In [17]:
def compute_mean_absolute_error(y_posterior, X_val_feed_dict, y_val=y_val):
    data = {y_posterior: y_val}
    mae = ed.evaluate('mean_absolute_error', data=data)
    print(f'Mean absolute error on validation data: {mae:1.5}')
def plot_residuals(y_posterior, X_val_feed_dict, title, y_val=y_val):
    y_posterior_preds = y_posterior.eval(feed_dict=X_val_feed_dict)
    plt.figure(figsize=(9, 6))
    plt.hist(y_posterior_preds - y_val, edgecolor='white', linewidth=1, bins=30, alpha=.7)
    plt.axvline(0, color='#A60628', linestyle='--')
    plt.xlabel('`y_posterior_preds - y_val`', fontsize=14)
    plt.ylabel('Count', fontsize=14)
    plt.title(title, fontsize=16)
In [18]:
param_posteriors = {
    β_fixed_effects: qβ_fixed_effects.mean(),
    α: .mean()
X_val_feed_dict = {
    fixed_effects: X_val
y_posterior = ed.copy(y, param_posteriors)
In [19]:
print(f'Mean validation `logerror`: {y_val.mean()}')
compute_mean_absolute_error(y_posterior, X_val_feed_dict)
Mean validation `logerror`: 0.012986094738549725
Mean absolute error on validation data: 0.089943
In [20]:
plot_residuals(y_posterior, X_val_feed_dict, title='Linear Regression Residuals')

"The residuals appear normally distributed with mean 0: this is a good sanity check for the model."1 However, with respect to the magnitude of the mean of the validation logerror, our validation score is terrible. This is likely due to the fact that three predictors are not nearly sufficient for capturing the variation in the response. (Additionally, because the response itself is an error, it should be fundamentally harder to capture than the thing actually being predicted — the house price. This is because Zillow's team has already built models to capture this signal, then effectively threw the remaining "uncaptured" signal into this competition, i.e. "figure out how to get right the little that we got wrong.")

Inspect parameter posteriors

In [21]:
# draw samples from posteriors
qβ_fixed_effects_samples = qβ_fixed_effects.sample(1000).eval()
qα_samples = .sample(1000).eval()

# plot samples
plt.figure(figsize=(16, 10))

for dimension in range(D):
    subplot = plt.subplot(221 + dimension)
    plt.hist(qβ_fixed_effects_samples[:, dimension], edgecolor='white', linewidth=1, bins=30, alpha=.7)
    plt.axvline(0, color='#A60628', linestyle='--')
    title = f'Posterior Distribution of `{fixed_effect_predictors[dimension]}` Effect'
    plt.ylabel('Count', fontsize=14)
    plt.title(title, fontsize=16)
subplot = plt.subplot(221 + dimension + 1)
plt.hist(qα_samples, edgecolor='white', linewidth=1, bins=30, alpha=.7)
plt.axvline(0, color='#A60628', linestyle='--')
title = f'Posterior Distribution of Fixed Intercept α'
plt.ylabel('Count', fontsize=14)
plt.title(title, fontsize=15)
<matplotlib.text.Text at 0x1206147f0>

In keeping with the definition of multivariate linear regression itself, the above parameter posteriors tell us: "conditional on the assumption that the log-error and fixed effects can be related by a straight line, what is the predictive value of one variable once I already know the values of all other variables?"2

Bayesian linear regression with random effects

Random effects models — also known as hierarchical models — allow us to ascribe distinct behaviors to different "clusters" of observations, i.e. groups that may each act in a materially unique way. Furthermore, these models allow us to infer these tendencies in a collaborative fashion: while each cluster is assumed to behave differently, it can learn its parameters by heeding to the behavior of the population at large. In this example, we assume that houses in different zipcodes — holding all other predictors constant — should be priced in different ways.

For clarity, let's consider the two surrounding extremes:

  1. Estimate a single set of parameters for the population, i.e. the vanilla, scikit-learn linear regression, Bayesian or not. This confers no distinct behaviors to houses in different zipcodes.
  2. Estimate a set of parameters for each individual zipcode, i.e. split the data into its cluster groups and estimate a single model for each. This confers maximally distinct behaviors to houses in different zip codes: the behavior of one cluster knows nothing about that of the others.

Random-effects models "walk the line" between these two approaches — between maximally underfitting and maximally overfitting the behavior of each cluster. To this effect, its parameter estimates exhibit the canonical "shrinkage" phenomenon: the estimate for a given parameter is balanced between the within-cluster expectation and the global expectation. Smaller clusters exhibit larger shrinkage; larger clusters, i.e. those for which we've observed more data, are more bullheaded (in typical Bayesian fashion). A later plot illustrates this point.

We specify our random-effects functional form as follows:

μ_y = α + α_random_effects + ed.dot(fixed_effects, β_fixed_effects)
y = Normal(loc=μ_y, scale=tf.ones(N))

With respect to the previous model, we've simply added α_random_effects to the mean of our response. As such, this is a varying-intercepts model: the intercept term will be different for each cluster. To this end, we learn the global intercept α as well as the offsets from this intercept α_random_effects — a random variable with as many dimensions as there are zipcodes. In keeping with the notion of "offset," we ascribe it a prior of (0, σ_zc). This approach allows us to flexibly extend the model to include more random effects, e.g. city, architecture style, etc. With only one, however, we could have equivalently included the global intercept inside of our prior, i.e. α_random_effects ~ Normal(α, σ_zc), with priors on both α and σ_zc as per usual. This way, our random effect would no longer be a zip-code-specific offset from the global intercept, but a vector of zip-code-specific intercepts outright.

Finally, as Richard McElreath notes, "we can think of the σ_zc parameter for each cluster as a crude measure of that cluster's "relevance" in explaining variation in the response variable."3

Fit model

In [22]:
n_zip_codes = len(set(zip_codes))

# random-effect placeholder
zip_codes_ph = tf.placeholder(tf.int32, [N])

# random-effect parameter
σ_zip_code = tf.sqrt(tf.exp(tf.Variable(tf.random_normal([]))))
α_zip_code = Normal(loc=tf.zeros(n_zip_codes), scale=σ_zip_code * tf.ones(n_zip_codes))
# model
α_random_effects = tf.gather(α_zip_code, zip_codes_ph)
μ_y = α + α_random_effects + ed.dot(fixed_effects, β_fixed_effects)
y = Normal(loc=μ_y, scale=tf.ones(N))

# approximate random-effect distribution
qα_zip_code = Normal(

Infer parameters

In [23]:
latent_vars = {
    β_fixed_effects: qβ_fixed_effects,
    α: ,
    α_zip_code: qα_zip_code

inference = ed.KLqp(latent_vars, data={fixed_effects: X_train, zip_codes_ph: zip_codes[train_index], y: y_train})
inference.run(n_samples=5, n_iter=250)
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 34898.613

Criticize model

In [24]:
param_posteriors = {
    β_fixed_effects: qβ_fixed_effects.mean(),
    α: .mean(),
    α_zip_code: qα_zip_code.mean()
X_val_feed_dict = {
    fixed_effects: X_val,
    zip_codes_ph: zip_codes[val_index]
y_posterior = ed.copy(y, param_posteriors)
In [25]:
compute_mean_absolute_error(y_posterior, X_val_feed_dict)
Mean absolute error on validation data: 0.084635

Inspect residuals

In [26]:
plot_residuals(y_posterior, X_val_feed_dict, title='Linear Regression with Random Effects Residuals')

Plot shrinkage

To illustrate shrinkage we'll pare our model down to intercepts only (removing the fixed effects entirely). We'll first fit a random-effects model on the full dataset then compute the cluster-specific-intercept posterior means. Next, we'll fit a separate model to each individual cluster and compute the intercept posterior mean of each. The plot below shows how estimates from the former can be viewed as "estimates from the latter — shrunk towards the global-intercept posterior mean."

Finally, blue, green and orange points represent small, medium and large clusters respectively. As mentioned before, the larger the cluster size, i.e. the more data points we've observed belonging to a given cluster, the less prone it is to shrinkage towards the mean.

Estimate random-effects intercepts

In [27]:
# model
μ_y = α + α_random_effects
y = Normal(loc=μ_y, scale=tf.ones(N))

latent_vars = {
    α: ,
    α_zip_code: qα_zip_code

# infer parameters
inference = ed.KLqp(latent_vars, data={zip_codes_ph: zip_codes[train_index], y: y_train})
inference.run(n_samples=5, n_iter=250)

# compute global intercept posterior mean
global_intercept_posterior_mean = .mean().eval()

# compute random-effects posterior means
random_effects_posterior_means = global_intercept_posterior_mean + qα_zip_code.mean().eval()
random_effects_posterior_means = pd.Series(random_effects_posterior_means, index=range(0, n_zip_codes),
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 34515.773 0s | Loss: 34524

Estimate within-cluster intercepts

In [28]:
# select a few clusters of varying size
zip_codes_df = pd.DataFrame({'cluster_size': zip_codes[train_index].value_counts()})
zip_codes_df['cluster_size_group'] = pd.cut(zip_codes_df['cluster_size'], 3, labels=['small', 'medium', 'large'])
zip_codes_df = zip_codes_df.groupby('cluster_size_group').head(20)

# build individual models for each cluster
within_cluster_posterior_means = {}

for zip_code in zip_codes_df.index.unique():
    # compute mask, number of observations
    mask = zip_codes[train_index] == zip_code
    N_ = mask.sum()
    # instantiate model for current cluster
    fixed_effects = tf.placeholder(tf.float32, [N_, D])
    μ_y = α
    y = Normal(loc=μ_y, scale=tf.ones(N_))
    # fit model
    inference = ed.KLqp(latent_vars, data={y: y_train[mask]})
    inference.run(n_samples=5, n_iter=250)
    # compute mean of qα for this zip code
    within_cluster_posterior_means[zip_code] = .mean().eval()[0]
within_cluster_posterior_means = pd.Series(within_cluster_posterior_means, name='within_cluster_posterior_means')
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 407.149
250/250 [100%] ██████████████████████████████ Elapsed: 3s | Loss: 339.477
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 314.130
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 306.847
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 276.160
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 267.080
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 253.680 1s |
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 236.092
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 240.532
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 233.505
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 232.273
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 229.171 0s | Loss
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 217.063
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 211.964
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 211.962
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 207.407
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 199.322
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 199.557
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 197.278
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 195.386
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 191.565
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 187.577
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 190.232
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 185.698
250/250 [100%] ██████████████████████████████ Elapsed: 4s | Loss: 188.302
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 138.870
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 135.920
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 135.151
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 136.043
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 136.001
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 136.464 1s | Loss:
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 134.585
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 133.699
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 132.002
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 131.619
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 130.199 0s | Loss: 130.0
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 129.326
250/250 [100%] ██████████████████████████████ Elapsed: 5s | Loss: 129.478
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 128.817
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 132.139
250/250 [100%] ██████████████████████████████ Elapsed: 7s | Loss: 128.637 2s | Los
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 128.923
250/250 [100%] ██████████████████████████████ Elapsed: 7s | Loss: 126.815 3s | Loss: E
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 126.686
250/250 [100%] ██████████████████████████████ Elapsed: 6s | Loss: 132.182
In [29]:
# prepare for plotting: join, give colors, sort, reset index
zip_codes_df = zip_codes_df\
zip_codes_df['cluster_size_color'] = zip_codes_df['cluster_size_group'].map({'small': '#377eb8', 'medium': '#4daf4a', 'large': '#ff7f00'})
zip_codes_df = zip_codes_df\
In [30]:
# plot shrinkage
plt.figure(figsize=(14, 10))
plt.xlim(-1, len(zip_codes_df) + 1)

plt.scatter(zip_codes_df.index, zip_codes_df['random_effects_posterior_means'], facecolors='none',
            edgecolors=zip_codes_df['cluster_size_color'], s=50, linewidth=1, 
            alpha=1, label='Random-Effects-α Posterior Mean')
plt.scatter(zip_codes_df.index, zip_codes_df['within_cluster_posterior_means'], c=zip_codes_df['cluster_size_color'], 
            s=100, alpha=.7, label='Within-Cluster-α Posterior Mean')
plt.axhline(global_intercept_posterior_mean, color='#A60628', linestyle='--', label='Global-α Mean')
plt.xlabel('Cluster \n(In Order of Increasing Cluster Size)', fontsize=14)
plt.ylabel('α', fontsize=14)
plt.title('Shrinkage in Mean-α Estimates from Within-Cluster vs. Random Effects Linear Regression', fontsize=16)
<matplotlib.legend.Legend at 0x15d02bd30>

Neural network with random effects

Neural networks are powerful function approximators. Keras is a library that lets us flexibly define complex neural architectures. Thus far, we've been approximating the relationship between our fixed effects and response variable with a simple dot product; can we leverage Keras to make this relationship more expressive? Is it painless? Finally, how does it integrate with Edward's existing APIs and constructs? Can we couple nimble generative models with deep neural networks?

While my experimentation was brief, all answers point delightfully towards "yes" for two simple reasons:

  1. Edward and Keras both run on TensorFlow.
  2. "Black-box" variational inference makes everything scale.

This said, we must be nonetheless explicit about what's "Bayesian" and what's not, i.e. for which parameters do we infer full (approximate) posterior distributions, and for which do we infer point estimates of the posterior distribution.

Below, we drop a neural_network in place of our dot product. Our latent variables remain β_fixed_effects, α and α_zip_code: while we will infer their full (approximate) posterior distributions as before, we'll only compute point estimates for the parameters of the neural network as in the typical case. Conversely, to the best of my knowledge, to infer full distributions for the latter, we'll need to specify our network manually in raw TensorFlow, i.e. ditch Keras entirely. We then treat our weights and biases as standard latent variables and infer their approximate posteriors via variational inference. Edward's documentation contains a straightforward tutorial to this end.

Fit model

In [31]:
def neural_network(fixed_effects, λ=.001, input_dim=D):
    dense = Dense(5, activation='tanh', kernel_regularizer=l2(λ))(fixed_effects)
    output = Dense(1, activation='linear', name='output', kernel_regularizer=l2(λ))(dense)
    return K.squeeze(output, axis=1)

# model
fixed_effects = tf.placeholder(tf.float32, [N, D])
μ_y = α + α_random_effects + neural_network(fixed_effects)
y = Normal(loc=μ_y, scale=tf.ones(N))
In [32]:
latent_vars = {
    β_fixed_effects: qβ_fixed_effects,
    α: ,
    α_zip_code: qα_zip_code

inference = ed.KLqp(latent_vars, data={fixed_effects: X_train, zip_codes_ph: zip_codes[train_index], y: y_train})
optimizer = tf.train.RMSPropOptimizer(0.01, epsilon=1.0)
inference.run(n_samples=5, n_iter=1000)
1000/1000 [100%] ██████████████████████████████ Elapsed: 18s | Loss: 34446.191

Criticize model

In [33]:
param_posteriors = {
    β_fixed_effects: qβ_fixed_effects.mean(),
    α: .mean(),
    α_zip_code: qα_zip_code.mean()
X_val_feed_dict = {
    fixed_effects: X_val,
    zip_codes_ph: zip_codes[val_index]
y_posterior = ed.copy(y, param_posteriors)
In [34]:
compute_mean_absolute_error(y_posterior, X_val_feed_dict)
Mean absolute error on validation data: 0.081484

Inspect residuals

In [35]:
plot_residuals(y_posterior, X_val_feed_dict, title='Neural Network with Random Effects Residuals')

Future work

We've now laid a stable, if trivially simple foundation for building models with Edward and Keras. From here, I see two distinct paths to building more expressive probabilistic models using these tools:

  1. Build probabilistic models in Edward, and abstract deep-network-like subgraphs into Keras layers. This allows us to flexibly define complex neural architectures, e.g. a video question answering model, with a nominal amount of code, yet restricts us from, or at least makes it awkward to, infer full posterior distributions for the subgraph parameters.
  2. Build probabilistic models in Edward, and specify deep-network-like subgraphs with raw TensorFlow — ditching Keras entirely. Defining deep-network-like subgraphs becomes more cumbersome, while inferring full posterior distributions for the subgraph parameters becomes more natural and consistent with the flow of Edward code.

This work has shown a few basic variants of (generalized) Bayesian linear regression models. From here, there's tons more to explore — varying-slopes models, Gaussian process regression, mixture models and probabilistic matrix factorizations to name a random few.

Edward and Keras have proven a flexible, expressive and powerful duo for performing inference in deep probabilistic models. The models we built were simple; the only direction to go, and to go rather painlessly, is more.

Many thanks for reading.


  1. Edward - Linear Mixed Effects Models
  2. McElreath, Richard. "Chapter 5." Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton, FL: CRC, Taylor & Francis Group, 2016. N. pag. Print.
  3. McElreath, Richard. "Chapter 12." Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton, FL: CRC, Taylor & Francis Group, 2016. N. pag. Print.
  4. The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3
  5. Keras as a simplified interface to TensorFlow
  6. Mixture Density Networks with Edward, Keras and TensorFlow
  7. Use a Hierarchical Model
  8. McElreath, Richard. "Chapter 14." Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton, FL: CRC, Taylor & Francis Group, 2016. N. pag. Print.