We can extend robust pooling to work over vector arguments. Denote by $\phi$ the penalty function. Then robust vector averaging with input $\{x_i \in \mathbb{R}^m \mid i = 1, \ldots, n\}$ and output $y \in \mathbb{R}^m$ finds the solution to the optimization problem
$$ y \in \text{arg min}_u \sum_{i=1}^{n} \phi(\|u - x_i\|; \alpha) $$where $\alpha$ is a parameter of the penalty function. We implement this operation using a ddn.basic
node for demonstration and can be used in your own code by importing RobustVectorAverage
from ddn.basic.robust_nodes
. There is also an efficient version called RobustVectorPool2d
in the ddn.pytorch.robust_vec_pool
module for use in Pytorch applications. That version follows the mathematical derivation shown at the end of this notebook.
%matplotlib notebook
import sys
sys.path.append("../")
import scipy.optimize as opt
from ddn.basic.node import *
class RobustVectorAverage(NonUniqueDeclarativeNode):
"""
Solves for the multi-dimensional robust average,
minimize f(x, y) = \sum_{i=1}^{n} phi(\|y - x_i\|; alpha)
where phi(z; alpha) is one of the following robust penalties,
'quadratic': 1/2 z^2
'pseudo-huber': alpha^2 (\sqrt(1 + (z/alpha)^2 - 1)
'huber': 1/2 z^2 for |z| <= alpha and alpha |z| - 1/2 alpha^2 otherwise
'welsch': 1 - exp(-z^2 / 2 alpha^2)
'trunc-quad': 1/2 z^2 for |z| <= alpha and 1/2 alpha^2 otherwise
The input is assumed to be flattened from an (n \times m)-matrix to an nm-vector.
"""
restarts = 10 # number of random restarts when solving non-convex penalties
def __init__(self, n, m, penalty='huber', alpha=1.0):
assert (alpha > 0.0)
self.alpha = alpha
self.alpha_sq = alpha ** 2
self.penalty = penalty.lower()
if (self.penalty == 'quadratic'):
self.phi = lambda z: 0.5 * np.power(z, 2.0)
elif (self.penalty == 'pseudo-huber'):
self.phi = lambda z: self.alpha_sq * (np.sqrt(1.0 + np.power(z, 2.0) / self.alpha_sq) - 1.0)
elif (self.penalty == 'huber'):
self.phi = lambda z: np.where(np.abs(z) <= alpha, 0.5 * np.power(z, 2.0), alpha * np.abs(z) - 0.5 * self.alpha_sq)
elif (self.penalty == 'welsch'):
self.phi = lambda z: 1.0 - np.exp(-0.5 * np.power(z, 2.0) / self.alpha_sq)
elif (self.penalty == 'trunc-quad'):
self.phi = lambda z: np.minimum(0.5 * np.power(z, 2.0), 0.5 * self.alpha_sq)
else:
assert False, "unrecognized penalty function {}".format(penalty)
super().__init__(n*m, m) # make sure node is properly constructed
self.eps = 1.0e-4 # relax tolerance on optimality test
def objective(self, x, y):
assert (len(x) == self.dim_x) and (len(y) == self.dim_y)
# the inclusion of 1.0e-9 prevents division by zero during automatic differentiation when a y lands exactly on a data point xi
return np.sum([self.phi(np.sqrt(np.dot(y - xi, y - xi) + 1.0e-9)) for xi in x.reshape((int(self.dim_x / self.dim_y), self.dim_y))])
def solve(self, x):
assert(len(x) == self.dim_x)
J = lambda y : self.objective(x, y)
dJ = lambda y : self.fY(x, y)
y_star = np.mean(x.reshape((int(self.dim_x / self.dim_y), self.dim_y)), 0)
if (self.penalty != 'quadratic'):
result = opt.minimize(J, y_star, args=(), method='L-BFGS-B', jac=dJ, options={'maxiter': 100, 'disp': False})
if not result.success: print(result.message)
y_star, J_star = result.x, result.fun
# run with different intial guesses for non-convex penalties
if (self.penalty == 'welsch') or (self.penalty == 'trunc-quad'):
guesses = np.random.permutation(x.reshape((int(self.dim_x / self.dim_y), self.dim_y)))
if self.dim_x > self.restarts: guesses = guesses[0:self.restarts, :]
for y_init in guesses:
result = opt.minimize(J, y_init, args=(), method='L-BFGS-B', jac=dJ, options={'maxiter': 100, 'disp': False})
if not result.success: print(result.message)
if (result.fun < J_star):
y_star, J_star = result.x, result.fun
return y_star, None
To show an example of robust vector averaging we demonstrate updating a set of 2D points with outliers such that the (robust) average of the points moves to (0, 0). That is, we aim to solve the problem
$$ \begin{array}{ll} \text{minimize (over $x_i$)} & \|y\|_2^2 \\ \text{subject to} & y = \text{arg min}_u \sum_{i=1}^{n} \phi(\|u - x_i\|; \alpha) \end{array} $$We use three different penalty functions: quadratic
, pseudo-huber
and welsch
. The animation below shows the initial position of the points and robust averages. In the three other panels are shown updates of the points at each iteration as we move the average to (0, 0). Notice that the average computed with the quadratic
penalty function, which is not robust, moves all points including the outliers. The other two penalty functions only move inlier points; the outliers are largely unaffected. This could be useful, for example, in a neural network where we only want network parameters to be influenced by inliers during back-propagation.
%%capture
from ddn.basic.sample_nodes import SquaredErrorNode
from ddn.basic.composition import ComposedNode
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import matplotlib.patches as patches
from IPython.display import HTML
# set for only two random restarts
RobustVectorAverage.restarts = 2
# setup data with outliers
np.random.seed(0)
x = 1.5 * (np.random.rand(10, 2) - 0.5)
x[-1, 0] += 6.0; x[-1, 1] -= 2.0
x[0, 0] += 3.0; x[0, 1] += 2.0
data = [('quadratic', 'b', x.copy(), []), ('pseudo-huber', 'r', x.copy(), []), ('welsch', 'g', x.copy(), [])]
t = np.linspace(0.0, 2.0 * np.pi)
def plot(ax, x_init, x, data, colour=None, history=None):
y = {}
for (name, c, _, _) in data:
node = RobustVectorAverage(x.shape[0], x.shape[1], name)
y[name], _ = node.solve(x.flatten())
ax.plot(y[name][0], y[name][1], 'D', color=c, markersize=10)
# draw original points
ax.plot(x_init[:, 0], x_init[:, 1], 'o', markeredgecolor='k', markerfacecolor='w', markeredgewidth=1.0)
if colour is not None:
ax.plot(x[:, 0], x[:, 1], 'o', markeredgecolor='k', markerfacecolor=colour, markeredgewidth=1.0)
# draw circle
for (name, c, _, _) in data:
ax.plot(np.cos(t) + y[name][0], np.sin(t) + y[name][1], '--', color=c, linewidth=1)
# draw learning curve
if history is not None:
ax.add_patch(patches.Rectangle((3.5, 0.5), 3.0, 2.0, fc='w', ec='k'))
if len(history) > 0:
h = 2.0 * np.array(history) / np.max(history) + 0.5
ax.plot(np.linspace(3.5, 6.5, len(history)), h, color=colour)
pass
ax.set_xlim(-2.0, 7.0); ax.set_ylim(-3.0, 3.0)
def init():
plot(ax[0], x, x, data)
ax[0].legend([name for (name, c, _, h) in data] + ['original points'])
return ax
def animate(fnum, x_init, data):
for i, (name, c, x, h) in enumerate(data):
ax[i+1].clear()
plot(ax[i+1], x_init, x, data, c, h)
# gradient descent update
node = ComposedNode(RobustVectorAverage(x.shape[0], x.shape[1], name), SquaredErrorNode(x.shape[1]))
h.append(node.solve(x.flatten())[0])
dJ = node.gradient(x.flatten())
x -= 0.5 * dJ.reshape(x.shape)
return ax[2:]
# create animation
fig = plt.figure()
ax = [plt.subplot(2, 2, i+1) for i in range(4)]
plt.tight_layout()
ani = animation.FuncAnimation(fig, animate, init_func=init, fargs=(x, data), interval=100, frames=50, repeat=False)
#plt.close(fig)
# display using video or javascript
HTML(ani.to_html5_video())
#HTML(ani.to_jshtml())
Consider the objective function for the robust vector pooling optimization problem,
$$ \begin{align*} f(\{x_1, \ldots, x_n\}, u) &= \sum_{i=1}^{n} \phi(\|u - x_i\|_2; \alpha) \\ &= \sum_{i=1}^{n} \phi(z_i; \alpha) \end{align*} $$where we have written $z_i = \|u - x_i\|_2$.
The gradient of the minimizer $y$ with respect to each of the $x_j$ is given by
$$ \text{D}_{X_j} y = -H^{-1} B $$where $H = \text{D}^2_{YY} f$ and $B = \text{D}^2_{X_jY} f$ (see Proposition 4.4 of the DDN paper). Since $f$ decomposes as a sum of penalty functions $\phi$, it suffices to consider $\text{D}^2_{YY} \phi$ and $\text{D}^2_{XY} \phi$.
$$ \begin{align*} \text{D}_{Y} \phi(z_i; \alpha) &= \phi'(z_i; \alpha) \text{D}_{Y} z_i \\ &= \phi'(z_i; \alpha) \frac{(y - x_i)^T}{z_i} \end{align*} $$where $\phi'$ is the first derivative of $\phi$. So
$$ \begin{align*} \text{D}^2_{YY} \phi(z_i; \alpha) &= \frac{\phi'(z_i; \alpha)}{z_i} I_{m \times m} + \left(\phi''(z_i; \alpha) - \frac{\phi'(z_i; \alpha)}{z_i}\right) \frac{(y - x_i)(y - x_i)^T}{z_i^2} \\ &= \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T \end{align*} $$where $\kappa_1$ and $\kappa_2$ are quantities that depend on the penalty function and $z_i$.
By symmetry $\text{D}^2_{X_jY} \phi(z_j; \alpha) = - \text{D}^2_{YY} \phi(z_j; \alpha)$.
We therefore have
$$ \begin{align*} \text{D}_{X_j} y &= \left( \sum_{i=1}^{n} \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T \right)^{-1} \Bigg( \kappa_1(z_j) I_{m \times m} + \kappa_2(z_j) (y - x_j)(y - x_j)^T\Bigg) \\ &= H^{-1} \Bigg( \kappa_1(z_j) I_{m \times m} + \kappa_2(z_j) (y - x_j)(y - x_j)^T\Bigg) \end{align*} $$Let $v^T = \text{D} J(y)$ be the derivative of the loss function with respect to the output, i.e., the backward going gradient. Our goal is to compute $\text{D} J(x_i)$ for $i = 1, \ldots, n$. We have,
$$ \text{D} J(x_i) = v^T H^{-1} \Bigg( \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T\Bigg) $$Let $w^T = v^T H^{-1}$ obtained by solving $v = H^{-1} w$ using Cholesky factorization and back substitution. Note that this can be computed independent of $i$. We then have
$$ \begin{align*} \text{D} J(x_i) &= w^T \Bigg( \kappa_1(z_i) I_{m \times m} + \kappa_2(z_i) (y - x_i)(y - x_i)^T \Bigg) \\ &= \kappa_1(z_i) w^T + \kappa_2(z_i) w^T (y - x_i)(y - x_i)^T \end{align*} $$Performing the inner product $w^T (y - x_i)$ before the outer product $(y - x_i)(y - x_i)^T$ results in significant memory and computational savings.