This notebook illustrates reverse automatic differentiation, of first and second order. Recall that this approach is recommended for functions from a large dimensional space, to a small dimensional space, and whose jacobian does not have a sparse structure. It is typically appropriate for large scale optimization problems.
Disclaimer. First order reverse automatic differentiation is a classical feature, found in many packages better optimized and better maintained than this one. If you only need this specific feature, then it could be wise to use another implementation.
Summary of volume Algorithmic tools, this series of notebooks.
Main summary of the Adaptive Grid Discretizations book of notebooks, including the other volumes.
Acknowledgement. Some of the experiments presented in these notebooks are part of ongoing research with Ludovic Métivier and Da Chen.
Copyright Jean-Marie Mirebeau, Centre Borelli, ENS Paris-Saclay, CNRS, University Paris-Saclay
Known bugs and incompatibilities. Our implementation of automatic differentiation is based on subclassing the numpy array class. While simple and powerful, this approach suffers from a few pitfalls, described in notebook ADBugs.
import sys; sys.path.insert(0,"..") # Allow importing agd from parent directory
#from Miscellaneous import TocTools; print(TocTools.displayTOC('Reverse','Algo'))
import numpy as np
import scipy.sparse; import scipy.sparse.linalg
import agd.AutomaticDifferentiation as ad
import agd.FiniteDifferences as fd
import agd.LinearParallel as lp
def LInfNorm(a): return np.max(np.abs(a))
Reverse automatic differentiation works by keeping a history of the computations. The sensitivity of the result w.r.t some components is obtained by propagating the sensitivities backward in the computation queue and using the operator adjoints.
Our implementation of the reverseAD is differs from the denseAD or sparseAD classes: here we do not define an data type overloading the arithmetic operators and the basic special functions. Instead, the reverseAD class is only meant to keep a history of user specified computations.
Note to reader. This section only shows how to generate variables, and request an expression's derivatives. See the next section for an actual use of automatic differentiation.
We first create some basic numpy arrays.
x0=np.linspace(0,np.pi,4)
gridScale=x0[1]-x0[0]
u0=np.sin(x0)
v0=np.cos(x0)
Then we generate a reverseAD history, and register the variables of interest.
# Create an empty history
rev = ad.Reverse.empty()
# Create AD variables w.r.t which the gradient will be required
u1 = rev.identity(constant=u0)
v1 = rev.identity(constant=v0)
# Equivalently :
#rev,(u1,v1) = ad.Reverse.empty(inputs=(u0,v0))
In our implementation, the generated AD variables are of sparse AD type.
print("u1:",u1)
print("v1:",v1)
u1: spAD(array([0.00000000e+00, 8.66025404e-01, 8.66025404e-01, 1.22464680e-16]), array([[1.], [1.], [1.], [1.]]), array([[0], [1], [2], [3]])) v1: spAD(array([ 1. , 0.5, -0.5, -1. ]), array([[1.], [1.], [1.], [1.]]), array([[4], [5], [6], [7]]))
If we make some computation using the variables registered in the reverseAD class, then we can request the gradient of the final expression.
result = (u1**2+v1**2).sum()
rev.gradient(result)
array([ 0.00000000e+00, 1.73205081e+00, 1.73205081e+00, 2.44929360e-16, 2.00000000e+00, 1.00000000e+00, -1.00000000e+00, -2.00000000e+00])
If needed, this expression can be reshaped similar to the input variables.
u_grad,v_grad = rev.to_inputshapes(rev.gradient(result))
assert LInfNorm(v_grad-2*v0) < 1e-6
The hessian operator may be accessed as well, using the Reverse2 submodule. Note that the hessian matrix itself is never assembled, since in applications it would typically be dense and of large size.
rev,(u1,v1) = ad.Reverse2.empty(inputs=(u0,v0))
result = (u1**2+v1**2).sum()
grad = rev.gradient(result) # Gradient
hess = rev.hessian(result) # Hessian operator
grad
array([ 0.00000000e+00, 1.73205081e+00, 1.73205081e+00, 2.44929360e-16, 2.00000000e+00, 1.00000000e+00, -1.00000000e+00, -2.00000000e+00])
In this specific example, the hessian operator is twice the identity.
hess(grad)
array([ 0.00000000e+00, 3.46410162e+00, 3.46410162e+00, 4.89858720e-16, 4.00000000e+00, 2.00000000e+00, -2.00000000e+00, -4.00000000e+00])
Our implementation of reverse AD does rely on sparse AD, for the elementary operations, such as :
One may recall that sparseness is easily lost, as in the example below, which may result in complexity issues.
n=4
rev = ad.Reverse2.empty()
u1 = rev.identity(shape=(n,n))
u2 = (1.+u1+u1**2).sum(axis=0)
result = (u2**2).sum(axis=0)
grad,hess = rev.gradient(result),rev.hessian(result)
print("Result AD size : ",result.size_ad2)
Result AD size : 272
Sparse AD simplification may limit sparsity loss, but it is a costly process (involving sorting), which is not always very efficient.
u2.simplify_ad()
result = (u2**2).sum(axis=0)
print("Result AD size with sparse AD simplification : ",result.size_ad2)
Result AD size with sparse AD simplification : 80
We can take advantage of reverse AD to reintroduce some sparsity.
Important. Simplification based on reverse AD should not be confused with simplification based on sparse AD. Indeed, it does not involve sorting, and it is always optimally efficient.
rev = ad.Reverse2.empty()
u1 = rev.identity(shape=(n,n))
u2 = (1.+u1+u1**2).sum(axis=0)
u2_simpl = rev.simplify(u2) # Intermediate simplification
result = (u2_simpl**2).sum(axis=0)
grad_simpl,hess_simpl = rev.gradient(result),rev.hessian(result)
print("Result AD size with reverse AD simplification : ",result.size_ad2)
Result AD size with reverse AD simplification : 4
Reverse AD simplification proceeds by replacing the the complicated symbolic perturbations in $u_2$ with placeholders, in $u_2\_$simpl, with negative indices.
u2_simpl
spAD2(array([4., 4., 4., 4.]), array([[1.], [1.], [1.], [1.]]), array([[-1], [-2], [-3], [-4]]), array([], shape=(4, 0), dtype=float64), array([], shape=(4, 0), dtype=int64), array([], shape=(4, 0), dtype=int64))
The relative dependency of the old and new symbolic perturbations is registered, which allows computing derivatives.
assert LInfNorm(grad-grad_simpl) < 1e-13
dir_hessian = grad**2+1.
assert LInfNorm(hess(dir_hessian)-hess_simpl(dir_hessian)) < 1e-13
We illustrate Reverse automatic differentiation, in its intended use case involving operators operators whose jacobians are expected to be both high-dimensional and non-sparse. Note that linear operators often fit this desciption. For instance:
We first construct some sparse matrix, for the exposition, based on a transport scheme implemented using upwind finite differences.
def TransportScheme(u,h,speed,dt):
return u+dt*speed*fd.DiffUpwind(u,(1,),h,padding=0.)
In order to construct the matrix, we evaluate the scheme on a sparse AD variable.
speed = 1.; T=1.5; nsteps=5; dt = T/nsteps;
transport_ad = TransportScheme(ad.Sparse.identity(u0.shape),gridScale,speed,dt)
transport_matrix = scipy.sparse.coo_matrix(transport_ad.triplets()).tocsr()
We can now use this matrix in the course of reverse AD computations, mixing both linear and non-linear steps.
rev,(u1,v1) = ad.Reverse.empty(inputs=(u0,v0)) # Create the reverse AD environnement
u2 = rev.apply_linear_mapping(transport_matrix,u1**2,niter=nsteps) #Implement the desired method
result_Reverse = (u2*v1).sum()
Due to the multiple iterations, the variable $u_2$ does not depend in a sparse manner on $u_1$. In our implementation, the variable $u_2$ is of sparseAD type but features negative placeholder indices. Likewise for the final computation result.
u2
spAD(array([5.02050076e-01, 4.17158585e-01, 1.38706040e-01, 2.77367654e-33]), array([[1.], [1.], [1.], [1.]]), array([[-1], [-2], [-3], [-4]]))
result_Reverse
spAD(array(0.64127635), array([ 1.00000000e+00, 5.00000000e-01, -5.00000000e-01, -1.00000000e+00, 5.02050076e-01, 4.17158585e-01, 1.38706040e-01, 2.77367654e-33]), array([-1, -2, -3, -4, 4, 5, 6, 7]))
The negative indices are newly introduced symbolic perturbations, whose depedence on the original ones is known thanks to the registered adjoint. This is exploited for computing the gradient of the result.
rev.gradient(result_Reverse)
array([ 0.00000000e+00, 8.03222547e-01, 6.77741743e-01, -2.49367755e-17, 5.02050076e-01, 4.17158585e-01, 1.38706040e-01, 2.77367654e-33])
We can check the validity of this implementation using dense automatic differentiation, since this specific instance is small.
n=x0.size
u1 = ad.Dense2.identity(constant=u0,shift=(0,n))
v1 = ad.Dense2.identity(constant=v0,shift=(n,0))
u2 = ad.apply_linear_mapping(transport_matrix,u1**2,niter=nsteps)
result_Dense2 = (u2*v1).sum()
assert LInfNorm(rev.gradient(result_Reverse) - result_Dense2.coef1) < 1e-13
Second order reverse automatic differentiation is also implemented.
rev,(u1,v1) = ad.Reverse2.empty(inputs=(u0,v0))
u2 = rev.apply_linear_mapping(transport_matrix,u1**2,niter=nsteps)
result_Reverse2 = (u2*v1).sum()
assert LInfNorm(rev.gradient(result_Reverse2) - result_Dense2.coef1) < 1e-13
grad = rev.gradient(result_Reverse2)
hess = rev.hessian(result_Reverse2)
Note that the hessian is provided as an operator, that can be applied to arbitrary vectors, and not in a matrix form. This is enough to run e.g. the conjugate gradient method.
hess
<function agd.AutomaticDifferentiation.Reverse2.reverseAD2.hessian.<locals>.hess_operator(dir_hessian, coef2_init=None, with_grad=False)>
dir_hessian = grad**2+1.
assert LInfNorm(hess(dir_hessian) - np.dot(result_Dense2.coef2,dir_hessian)) < 1e-13
Linear inversion is almost always a non-local procedure, but with a simple and explicit adjoint. It is again a good fit for reverse AD. The syntax is almost identical, except for the choice of a linear solver.
rev,(u1,v1) = ad.Reverse2.empty(inputs=(u0,v0))
# Choose a solver, and request the inversion.
solver = scipy.sparse.linalg.spsolve
u2 = rev.apply_linear_inverse(solver,transport_matrix,u1**2)
result_Reverse2 = (u2*v1).sum()
grad = rev.gradient(result_Reverse2)
hess = rev.hessian(result_Reverse2)
Again, we control the results in this simple instance using dense AD.
n=x0.size
u1 = ad.Dense2.identity(constant=u0,shift=(0,n))
v1 = ad.Dense2.identity(constant=v0,shift=(n,0))
u2 = ad.apply_linear_inverse(solver,transport_matrix,u1**2)
result_Dense2 = (u2*v1).sum()
assert LInfNorm(grad - result_Dense2.coef1) < 1e-13
dir_hessian = grad**2+1.
assert LInfNorm(hess(dir_hessian) - np.dot(result_Dense2.coef2,dir_hessian)) < 1e-13
In the previous section, we have shown how to apply reverse AD to operators which are either:
We discuss here the general case of arbitrary non-linear operators. These techniques are typically needed in the context of non-linear evolution equations.
For an operator to be used in a reverse AD context, two execution paths must be implemented:
Note on the differentiation order. This specific subsection only applies to first order reverse AD. A slightly more complex implementation is required for second order reverse AD.
def my_operator(u,co_output=None):
if co_output is None:
return transport_matrix*u # Standard program flow
else:
co_output_value,co_arg_request = co_output
assert len(co_arg_request)==1 and co_arg_request[0] is u
return [(u, transport_matrix.T*co_output_value)] # Adjoint jacobian
This operator is functionally equivalent to the method rev.apply_linear_mapping(transport_matrix,...)
rev,(u1,v1) = ad.Reverse.empty(inputs=(u0,v0)) # Create the reverse AD environnement
u2 = rev.apply(my_operator,u1**2) #Implement the desired method
result_Reverse = (u2*v1).sum()
grad = rev.gradient(result_Reverse)
grad
array([ 0.00000000e+00, 1.11412341e+00, -3.69829398e-01, -2.09845813e-16, 2.14859173e-01, 7.50000000e-01, 5.35140827e-01, 1.07011025e-32])
rev,(u1,v1) = ad.Reverse.empty(inputs=(u0,v0))
u2 = rev.apply_linear_mapping(transport_matrix,u1**2) #Implement the desired method
result_Reverse = (u2*v1).sum()
assert LInfNorm(grad-rev.gradient(result_Reverse))<1e-13
For the purposes of illustration, we define an operator which:
This operator is defined using a block of elementary instructions.
def my_operator(u,v,co_output=None):
# Generate an operator-like AD environnement
rev,(u1,v1) = ad.Reverse.operator_like(inputs=(u,v),co_output=co_output)
# Do some computations
u2 = rev.apply_linear_mapping(transport_matrix,u1**2,niter=nsteps)
result = u2*v1
# Return the suitably converted result
return rev.output(result)
The non-linear operator defined above can be applied to ordinary variables without AD information, or dense AD variables. (The multiplication with transport_matrix
forbids the use of sparse AD variables.)
result_float = (my_operator(u0,u0*v0)*v0).sum()
n=x0.size
u1 = ad.Dense2.identity(constant=u0,shift=(0,n))
v1 = ad.Dense2.identity(constant=v0,shift=(n,0))
result_Dense2 = (my_operator(u1,u1*v1)*v1).sum()
It can also be used in a reverse AD context. (Note that this is recursive reverse AD.)
rev,(u1,v1) = ad.Reverse.empty(inputs=(u0,v0))
result_Reverse = (rev.apply(my_operator,u1,u1*v1)*v1).sum()
grad = rev.gradient(result_Reverse)
assert LInfNorm(grad - result_Dense2.coef1) < 1e-13
rev,(u1,v1) = ad.Reverse2.empty(inputs=(u0,v0))
result_Reverse2 = (rev.apply(my_operator,u1,u1*v1)*v1).sum()
grad = rev.gradient(result_Reverse2)
hess = rev.hessian(result_Reverse2)
assert LInfNorm(grad - result_Dense2.coef1) < 1e-13
dir_hessian = grad**2+1.
assert LInfNorm(hess(dir_hessian)-np.dot(result_Dense2.coef2,dir_hessian)) < 1e-13
We illustrate recursive reverse AD in nested loops. This approach allows to limit memory usage, at the cost of some recomputation.
Note on complexity Reverse automatic differentiation requires saving program state at each intermediate computation steps. If a program contains a loop, this may result in severe memory usage. Therefore, it is common to only store a small proportion of the intermediate states, referred to as keypoints, and to use recomputations for reconstructing the other steps. For best efficiency, this procedure must be made recursive.
Our first method makes $n$=niter iterations of the arithmetico-geometric algorithm.
Side note : the chosen example is not perfect, because this algorithm converges excessively fast, hence we only use a very small number of iterations, which may seem pointless.
counter=0 # Iteration counter
def my_loop(a,b,niter,co_output=None):
global counter
rev,(a,b) = ad.Reverse.operator_like(inputs=(a,b),co_output=co_output)
for i in range(niter):
a,b = (a+b)/2., np.sqrt(a*b)
a,b = rev.simplify(a),rev.simplify(b)
counter+=1
result = (a,b)
return rev.output(result)
a0=np.array([1.,2.,4.,8.])
b0=1/a0
niter=2; counter=0
my_loop(a0,b0,niter),counter
((array([1. , 1.125 , 1.5625 , 2.53125]), array([1. , 1.11803399, 1.45773797, 2.01556444])), 2)
After the selected number of iterations, we are quite far from convergence. We do note want to increase $n$, which would augment the number of intermediate steps saved. Instead, we achieve $n^2$ iterations while saving only $2n$ states, using an outer loop. This comes at the cost of recomputing $2$ times each step in the gradient evaluation.
With a similar basic recursion, one can achieve $n^k$ iterations, while saving only $k n$ steps, using $k$ nested loops. This comes at the cost of recomputing $k$ times each step in the gradient evaluation.
With the convention below, $k=$nrec
$+1$
def my_outer_loop(a,b,niter,nrec,operator,co_output=None):
rev,(a,b) = ad.Reverse.operator_like(inputs=(a,b),co_output=co_output)
for i in range(niter):
if nrec==1: a,b = rev.apply(operator,a,b,niter)
else: a,b = rev.apply(my_outer_loop,a,b,niter,nrec-1,operator)
result = (a,b)
return rev.output(result)
niter=2; nrec=1; counter=0
my_outer_loop(a0,b0,niter,nrec,my_loop), counter
((array([1. , 1.12151429, 1.50966462, 2.26607262]), array([1. , 1.12151429, 1.50966455, 2.26606075])), 4)
niter=2; nrec=2; counter=0
my_outer_loop(a0,b0,niter,nrec,my_loop), counter
((array([1. , 1.12151429, 1.50966459, 2.26606669]), array([1. , 1.12151429, 1.50966459, 2.26606669])), 8)
niter=2; nrec=2; counter=0;
rev,(a1,b1) = ad.Reverse.empty(inputs=(a0,b0))
a2,b2 = rev.apply(my_outer_loop,a1,b1,niter,nrec,my_loop)
result = (a2*b2).sum()
print("Iteration counter for forward evaluation", counter)
counter=0;
grad_rec = rev.gradient(result)
print("Iteration counter for backward evaluation", counter)
Iteration counter for forward evaluation 8 Iteration counter for backward evaluation 24
For verification purposes, we compare our results non-nested iteration.
niter2 = niter**(1+nrec); counter=0
rev,(a1,b1) = ad.Reverse.empty(inputs=(a0,b0))
a2,b2 = rev.apply(my_loop,a1,b1,niter2)
result = (a2*b2).sum()
print("Iteration counter for forward evaluation", counter)
counter=0;
grad_iter = rev.gradient(result)
print("Iteration counter for backward evaluation", counter)
assert LInfNorm(grad_rec-grad_iter)<1e-13
Iteration counter for forward evaluation 8 Iteration counter for backward evaluation 8
The hessian evaluation in second order reverse AD requires even more iterations, because it combines a forward and a backward pass.
niter=2; nrec=2; counter=0;
rev,(a1,b1) = ad.Reverse2.empty(inputs=(a0,b0))
a2,b2 = rev.apply(my_outer_loop,a1,b1,niter,nrec,my_loop)
result = (a2*b2).sum()
print("Iteration counter for forward evaluation", counter)
counter=0;
grad_rec2 = rev.gradient(result)
print("Iteration counter for gradient evaluation", counter)
assert LInfNorm(grad_rec-grad_rec2)<1e-13
counter=0
hess = rev.hessian(result)
dir_hessian = grad_rec**2+1.
hess_rec2 = hess(dir_hessian)
print("Iteration counter for hessian evaluation", counter)
Iteration counter for forward evaluation 8 Iteration counter for gradient evaluation 24 Iteration counter for hessian evaluation 48
Eventually, we check our results using dense AD, and an unrolled loop.
niter2 = niter**(1+nrec); counter=0
a1 = ad.Dense2.identity(constant=a0,shift=(0,4))
b1 = ad.Dense2.identity(constant=b0,shift=(4,0))
a2,b2 = my_loop(a1,b1,niter2)
result = (a2*b2).sum()
print("Iteration counter for forward evaluation", counter)
grad_dense2 = result.coef1
hess_dense2 = np.dot(result.coef2,dir_hessian)
assert np.allclose(grad_rec2,grad_dense2)
assert np.allclose(hess_rec2,hess_dense2)
Iteration counter for forward evaluation 8
Variables used in reverseAD mode may be scalars, arrays, multidimensional arrays, as well as tuples, lists and dicts. The latter two cases, lists and dicts, must be explicitly requested as they are not considered by default.
x0=1.;x1=np.array([1.,2.]);x2=np.array([[3.,4.],[5.,6.]]);x3=np.zeros((2,2,2))
rev,(y0,y1,y2,y3)=ad.Reverse.empty(inputs=(x0,x1,x2,x3))
z0 = (y1[1]+y2*y3).sum()
t0 = rev.simplify(z0)
grad_rev = rev.gradient(y0+t0**2)
rev.to_inputshapes(grad_rev)
(array(1.), array([ 0., 256.]), array([[0., 0.], [0., 0.]]), array([[[ 96., 128.], [160., 192.]], [[ 96., 128.], [160., 192.]]]))
rev,(y0,y1,y2,y3)=ad.Reverse2.empty(inputs=(x0,x1,x2,x3))
z0 = (y1[1]+y2*y3).sum()
t0 = rev.simplify(z0)
obj = y0+t0**2
grad_rev2 = rev.gradient(obj)
hess_rev2 = rev.hessian(obj)(grad_rev2)
We next check the result using dense automatic differentiation.
y0,y1,y2,y3 = ad.Dense2.register(inputs=(x0,x1,x2,x3))
z0=(y1[1]+y2*y3).sum()
obj=y0+z0**2
grad_dens = obj.gradient()
hess_dens = lp.dot_AV(obj.hessian(),grad_dens)
y0,y1,y2,y3 = ad.Sparse2.register(inputs=(x0,x1,x2,x3))
z0=(y1[1]+y2*y3).sum()
obj=(y0+z0**2).to_dense()
grad_sp = obj.gradient()
hess_sp = lp.dot_AV(obj.hessian(),grad_dens)
assert(all(grad_dens==grad_rev))
assert(all(grad_dens==grad_rev2))
assert(all(grad_dens==grad_sp))
assert(all(hess_dens==hess_rev2))
assert(all(hess_dens==hess_sp))
By default, the provided reverseAD library does not look into input lists and dictionaries for AD information : only tuples are explored. Lists and dictionnaries are regarded as opaque objects, whose contents are not differentiable. This behavior can be changed, as described below.
def myfun(mylist,mydict,revType=ad.Reverse,co_output=None):
rev,(l,dx,dy) = revType.operator_like(
inputs=(mylist,mydict['x'],mydict['y']),co_output=co_output,
input_iterables=(tuple,list),
output_iterables=(list,))
return rev.output( [l[0]*dx+dy,l[1]*dy] )
a0=[1.,np.array([2.,3.])]
b0={'x':np.array([4.,5.]),'y':6.}
rev,[a1,b1] = ad.Reverse.empty(inputs=[a0,b0], # list of list and dict
input_iterables=(list,dict),
output_iterables=(list,))
u,v= rev.apply(myfun,a1,b1)
result = lp.dot_VV(u,v)
grad_rev1 = rev.gradient(result)
revType=ad.Reverse2
rev,[a1,b1] = ad.Reverse2.empty(inputs=[a0,b0], # list of list and dict
input_iterables=(list,dict),
output_iterables=(list,))
u,v= rev.apply(myfun,a1,b1,ad.Reverse2)
result = lp.dot_VV(u,v)
grad_rev2 = rev.gradient(result)
hess_rev2 = rev.hessian(result)(grad_rev2)
Checking with denseAD
a1,b1 = ad.Dense2.register([a0,b0],iterables=(list,dict))
u,v= myfun(a1,b1)
result = lp.dot_VV(u,v)
grad_dense2 = result.gradient()
hess_dense2 = lp.dot_AV(result.hessian(),grad_dense2)
assert(all(grad_dense2==grad_rev1))
assert(all(grad_dense2==grad_rev2))
assert(all(hess_dense2==hess_rev2))