Robust Vector Pooling

We can extend robust pooling to work over vector arguments. Denote by $\phi$ the penalty function. Then robust vector averaging with input $\{x_i \in \mathbb{R}^m \mid i = 1, \ldots, n\}$ and output $y \in \mathbb{R}^m$ finds the solution to the optimization problem

$$ y \in \text{arg min}_u \sum_{i=1}^{n} \phi(\|u - x_i\|; \alpha) $$

where $\alpha$ is a parameter of the penalty function. We implement this operation using a ddn.basic node for demonstration and can be used in your own code by importing RobustVectorAverage from ddn.basic.robust_nodes. There is also an efficient version called RobustVectorPool2d in the ddn.pytorch.robust_vec_pool module for use in Pytorch applications. That version follows the mathematical derivation shown at the end of this notebook.

In [1]:
%matplotlib notebook

import sys
sys.path.append("../")

import scipy.optimize as opt
from ddn.basic.node import *

class RobustVectorAverage(NonUniqueDeclarativeNode):
    """
    Solves for the multi-dimensional robust average,
        minimize f(x, y) = \sum_{i=1}^{n} phi(\|y - x_i\|; alpha)
    where phi(z; alpha) is one of the following robust penalties,
        'quadratic':    1/2 z^2
        'pseudo-huber': alpha^2 (\sqrt(1 + (z/alpha)^2 - 1)
        'huber':        1/2 z^2 for |z| <= alpha and alpha |z| - 1/2 alpha^2 otherwise
        'welsch':       1 - exp(-z^2 / 2 alpha^2)
        'trunc-quad':   1/2 z^2 for |z| <= alpha and 1/2 alpha^2 otherwise

    The input is assumed to be flattened from an (n \times m)-matrix to an nm-vector.
    """

    restarts = 10 # number of random restarts when solving non-convex penalties

    def __init__(self, n, m, penalty='huber', alpha=1.0):
        assert (alpha > 0.0)
        self.alpha = alpha
        self.alpha_sq = alpha ** 2
        self.penalty = penalty.lower()
        if (self.penalty == 'quadratic'):
            self.phi = lambda z: 0.5 * np.power(z, 2.0)
        elif (self.penalty == 'pseudo-huber'):
            self.phi = lambda z: self.alpha_sq * (np.sqrt(1.0 + np.power(z, 2.0) / self.alpha_sq) - 1.0)
        elif (self.penalty == 'huber'):
            self.phi = lambda z: np.where(np.abs(z) <= alpha, 0.5 * np.power(z, 2.0), alpha * np.abs(z) - 0.5 * self.alpha_sq)
        elif (self.penalty == 'welsch'):
            self.phi = lambda z: 1.0 - np.exp(-0.5 * np.power(z, 2.0) / self.alpha_sq)
        elif (self.penalty == 'trunc-quad'):
            self.phi = lambda z: np.minimum(0.5 * np.power(z, 2.0), 0.5 * self.alpha_sq)
        else:
            assert False, "unrecognized penalty function {}".format(penalty)

        super().__init__(n*m, m) # make sure node is properly constructed
        self.eps = 1.0e-4 # relax tolerance on optimality test

    def objective(self, x, y):
        assert (len(x) == self.dim_x) and (len(y) == self.dim_y)
        # the inclusion of 1.0e-9 prevents division by zero during automatic differentiation when a y lands exactly on a data point xi
        return np.sum([self.phi(np.sqrt(np.dot(y - xi, y - xi) + 1.0e-9)) for xi in x.reshape((int(self.dim_x / self.dim_y), self.dim_y))])

    def solve(self, x):
        assert(len(x) == self.dim_x)

        J = lambda y : self.objective(x, y)
        dJ = lambda y : self.fY(x, y)

        y_star = np.mean(x.reshape((int(self.dim_x / self.dim_y), self.dim_y)), 0)
        if (self.penalty != 'quadratic'):
            result = opt.minimize(J, y_star, args=(), method='L-BFGS-B', jac=dJ, options={'maxiter': 100, 'disp': False})
            if not result.success: print(result.message)
            y_star, J_star = result.x, result.fun

        # run with different intial guesses for non-convex penalties
        if (self.penalty == 'welsch') or (self.penalty == 'trunc-quad'):
            guesses = np.random.permutation(x.reshape((int(self.dim_x / self.dim_y), self.dim_y)))
            if self.dim_x > self.restarts: guesses = guesses[0:self.restarts, :]
            for y_init in guesses:
                result = opt.minimize(J, y_init, args=(), method='L-BFGS-B', jac=dJ, options={'maxiter': 100, 'disp': False})
                if not result.success: print(result.message)
                if (result.fun < J_star):
                    y_star, J_star = result.x, result.fun

        return y_star, None

Example

To show an example of robust vector averaging we demonstrate updating a set of 2D points with outliers such that the (robust) average of the points moves to (0, 0). That is, we aim to solve the problem

$$ \begin{array}{ll} \text{minimize (over $x_i$)} & \|y\|_2^2 \\ \text{subject to} & y = \text{arg min}_u \sum_{i=1}^{n} \phi(\|u - x_i\|; \alpha) \end{array} $$

We use three different penalty functions: quadratic, pseudo-huber and welsch. The animation below shows the initial position of the points and robust averages. In the three other panels are shown updates of the points at each iteration as we move the average to (0, 0). Notice that the average computed with the quadratic penalty function, which is not robust, moves all points including the outliers. The other two penalty functions only move inlier points; the outliers are largely unaffected. This could be useful, for example, in a neural network where we only want network parameters to be influenced by inliers during back-propagation.

In [2]:
%%capture

from ddn.basic.sample_nodes import SquaredErrorNode
from ddn.basic.composition import ComposedNode

import matplotlib.pyplot as plt
import matplotlib.animation as animation
import matplotlib.patches as patches
from IPython.display import HTML

# set for only two random restarts
RobustVectorAverage.restarts = 2

# setup data with outliers
np.random.seed(0)
x = 1.5 * (np.random.rand(10, 2) - 0.5)
x[-1, 0] += 6.0; x[-1, 1] -= 2.0
x[0, 0] += 3.0; x[0, 1] += 2.0

data = [('quadratic', 'b', x.copy(), []), ('pseudo-huber', 'r', x.copy(), []), ('welsch', 'g', x.copy(), [])]

t = np.linspace(0.0, 2.0 * np.pi)

def plot(ax, x_init, x, data, colour=None, history=None):
    y = {}
    for (name, c, _, _) in data:
        node = RobustVectorAverage(x.shape[0], x.shape[1], name)
        y[name], _ = node.solve(x.flatten())
        ax.plot(y[name][0], y[name][1], 'D', color=c, markersize=10)

    # draw original points
    ax.plot(x_init[:, 0], x_init[:, 1], 'o', markeredgecolor='k', markerfacecolor='w', markeredgewidth=1.0)
    if colour is not None:
        ax.plot(x[:, 0], x[:, 1], 'o', markeredgecolor='k', markerfacecolor=colour, markeredgewidth=1.0)

    # draw circle
    for (name, c, _, _) in data:
        ax.plot(np.cos(t) + y[name][0], np.sin(t) + y[name][1], '--', color=c, linewidth=1)

    # draw learning curve
    if history is not None:
        ax.add_patch(patches.Rectangle((3.5, 0.5), 3.0, 2.0, fc='w', ec='k'))
        if len(history) > 0:
            h = 2.0 * np.array(history) / np.max(history) + 0.5
            ax.plot(np.linspace(3.5, 6.5, len(history)), h, color=colour)
            pass

    ax.set_xlim(-2.0, 7.0); ax.set_ylim(-3.0, 3.0)


def init():
    plot(ax[0], x, x, data)
    ax[0].legend([name for (name, c, _, h) in data] + ['original points'])
    return ax


def animate(fnum, x_init, data):

    for i, (name, c, x, h) in enumerate(data):
        ax[i+1].clear()
        plot(ax[i+1], x_init, x, data, c, h)

        # gradient descent update
        node = ComposedNode(RobustVectorAverage(x.shape[0], x.shape[1], name), SquaredErrorNode(x.shape[1]))
        h.append(node.solve(x.flatten())[0])
        dJ = node.gradient(x.flatten())
        x -= 0.5 * dJ.reshape(x.shape)

    return ax[2:]


# create animation
fig = plt.figure()
ax = [plt.subplot(2, 2, i+1) for i in range(4)]
plt.tight_layout()
ani = animation.FuncAnimation(fig, animate, init_func=init, fargs=(x, data), interval=100, frames=50, repeat=False)
plt.close(fig)
In [3]:
# display using video or javascript
HTML(ani.to_html5_video())
#HTML(ani.to_jshtml())
Out[3]:

Mathematics

Consider the objective function for the robust vector pooling optimization problem,

$$ \begin{align*} f(\{x_1, \ldots, x_n\}, u) &= \sum_{i=1}^{n} \phi(\|u - x_i\|_2; \alpha) \\ &= \sum_{i=1}^{n} \phi(z_i; \alpha) \end{align*} $$

where we have written $z_i = \|u - x_i\|_2$.

The gradient of the minimizer $y$ with respect to each of the $x_j$ is given by

$$ \text{D}_{X_j} y = -H^{-1} B $$

where $H = \text{D}^2_{YY} f$ and $B = \text{D}^2_{X_jY} f$ (see Proposition 4.4 of the DDN paper). Since $f$ decomposes as a sum of penalty functions $\phi$, it suffices to consider $\text{D}^2_{YY} \phi$ and $\text{D}^2_{XY} \phi$.

$$ \begin{align*} \text{D}_{Y} \phi(z_i; \alpha) &= \phi'(z_i; \alpha) \text{D}_{Y} z_i \\ &= \phi'(z_i; \alpha) \frac{(y - x_i)^T}{z_i} \end{align*} $$

where $\phi'$ is the first derivative of $\phi$. So

$$ \begin{align*} \text{D}^2_{YY} \phi(z_i; \alpha) &= \frac{\phi'(z_i; \alpha)}{z_i} I_{m \times m} + \left(\phi''(z_i; \alpha) - \frac{\phi'(z_i; \alpha)}{z_i}\right) \frac{(y - x_i)(y - x_i)^T}{z_i^2} \\ &= \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T \end{align*} $$

where $\kappa_1$ and $\kappa_2$ are quantities that depend on the penalty function and $z_i$.

By symmetry $\text{D}^2_{X_jY} \phi(z_j; \alpha) = - \text{D}^2_{YY} \phi(z_j; \alpha)$.

We therefore have

$$ \begin{align*} \text{D}_{X_j} y &= \left( \sum_{i=1}^{n} \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T \right)^{-1} \Bigg( \kappa_1(z_j) I_{m \times m} + \kappa_2(z_j) (y - x_j)(y - x_j)^T\Bigg) \\ &= H^{-1} \Bigg( \kappa_1(z_j) I_{m \times m} + \kappa_2(z_j) (y - x_j)(y - x_j)^T\Bigg) \end{align*} $$

Implementation

Let $v^T = \text{D} J(y)$ be the derivative of the loss function with respect to the output, i.e., the backward going gradient. Our goal is to compute $\text{D} J(x_i)$ for $i = 1, \ldots, n$. We have,

$$ \text{D} J(x_i) = v^T H^{-1} \Bigg( \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T\Bigg) $$

Let $w^T = v^T H^{-1}$ obtained by solving $v = H^{-1} w$ using Cholesky factorization and back substitution. Note that this can be computed independent of $i$. We then have

$$ \begin{align*} \text{D} J(x_i) &= w^T \Bigg( \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T \Bigg) \\ &= \kappa_1(z_i) w^T + \kappa_2(z_i) w^T (y - x_i)(y - x_i)^T \end{align*} $$

Performing the inner product $w^T (y - x_i)$ before the outer product $(y - x_i)(y - x_i)^T$ results in significant memory and computational savings.

In [ ]: