This notebook illustrates automatic differentiation, with an application to an simple one-dimensional denoising problem. More precisely, we use the following flavors of automatic differentiation:
Summary of volume Algorithmic tools, this series of notebooks.
Main summary of the Adaptive Grid Discretizations book of notebooks, including the other volumes.
Acknowledgement. Some of the experiments presented in these notebooks are part of ongoing research with Ludovic Métivier and Da Chen.
Copyright Jean-Marie Mirebeau, Centre Borelli, ENS Paris-Saclay, CNRS, University Paris-Saclay
Known bugs and incompatibilities. Our implementation of automatic differentiation is based on subclassing the numpy array class. While simple and powerful, this approach suffers from a few pitfalls, described in notebook ADBugs.
import sys; sys.path.insert(0,"..") # Allow import of agd from parent directory (useless if conda package installed)
#from Miscellaneous import TocTools; print(TocTools.displayTOC('Sparse','Algo'))
import numpy as np
import agd.AutomaticDifferentiation as ad
import agd.FiniteDifferences as fd
def LInfNorm(a): return np.max(np.abs(a))
def LInfNorm_AD(a): return np.max((LInfNorm(a.value),LInfNorm(a.coef)))
def LInfNorm_AD2(a): return np.max((LInfNorm(a.value),LInfNorm(a.coef1),LInfNorm(a.coef2)))
def reload_packages():
from Miscellaneous.rreload import rreload
global ad,fd
[ad,fd] = rreload([ad,fd],rootdir='..')
We illustrate sparse automatic differentiation with an optimization problem of the form $$ \min_u \int_\Omega F(x,\nabla u(x)) + G(x,u(x)) \, dx, $$ where the unknown $u$ is a function defined over a given domain $\Omega$, with arbitrary boundary conditions.
The cost functions $F$ and $G$ are given as well. Classicaly, $F$ is a regularization term, whereas $G$ is a data fidelity term. The simplest case would be $$ F_0(x,v) := \frac 1 2 \|v\|^2, \qquad G_0(x,y) := \frac 1 2 \|y-g_0(x)\|^2, $$ where $g_0(x)$ is some given data. However, one may also think of numerous variants, involving for instance and without loss of generality:
Costs involving a kernel. Applications such as deblurring often involve an additional cost of the form $$ H(x, (\rho\star u)(x)), $$ where $\rho$ is a convolution kernel. Because convolution is a non-local operator with a large support, sparse automatic differentiation is not a good fit. Backward automatic differentiation would be appropriate here, but it is not supported at this point.
Problem parameters. In order to emphasize automatic differentiation tools, rather than a particular application, we consider as a start one of the most basic problem instantiations. See the other notebooks for more advanced examples.
The domain $\Omega$ is defined as the (one dimensional) segment $[0,1]$ equipped with periodic boundary conditions. We let $F_0$ and $G_0$ be the simple instances above defined, with a discontinuous $u_0$.
def F0(x,v):
return 0.5*v**2
def g0(x):
return (np.abs(x-0.5)<0.3) + np.sin(2.*np.pi*x)
def G0(x,y):
return 0.5*(y-g0(x))**2
def Objective(u,x,dx,F,G):
# h=x[1]-x[0] # The grid scale
du = fd.DiffUpwind(u,(1,),dx,padding=None) # padding=None is for periodic bc
return (F(x+0.5*dx,du)+G(x,u)).sum(axis=0) * dx # Integrate over the domain
x,dx=np.linspace(0,1,10,endpoint=False,retstep=True)
u=np.zeros(x.shape)
Objective(u,x,dx,F0,G0)
0.5
The optimality condition of the variational problem, for above specific choices of $F_0$ and $G_0$, is $$ -\Delta u + u-g_0 = 0. $$ We can therefore, alternatively, solve directly this PDE instead going through the minimization problem.
def Scheme0(u,x):
h=float(x[1]-x[0])
d2u = fd.Diff2(u,(1,),h,padding=None)
return -d2u +u -g0(x)
Scheme0(u,x)
array([ 0. , -0.58778525, -0.95105652, -1.95105652, -1.58778525, -1. , -0.41221475, -0.04894348, 0.95105652, 0.58778525])
The sparse automatic differentiation classes, first and second order, allow to compute gradients, jacobian matrices, and hessians, represented in a sparse manner.
Assume that $u^0 = (u^0_i)_{0 \leq i < n}$ is some array.
Using the ad.Sparse.identity
constructor, one can create an augmented variable $u^1$ with components
$$
u^1_i = u^0_i + \delta_i + o(\delta_i)
$$
where $(\delta_i)_{0 \leq i < n}$ are independent first order symbolic perturbations.
x = np.linspace(0,1,5,endpoint=False)
u0 = np.cos(2.*np.pi*x)
u1 = ad.Sparse.identity(constant=u0)
Elementary properties of the np.ndarray
class are inherited
print("ndim :",u1.ndim,", shape :",u1.shape,", size :",u1.size,", len :",len(u1))
ndim : 1 , shape : (5,) , size : 5 , len : 5
However, internal data includes coefficient and index arrays, with one additional dimension. For now, this additional dimension is a singleton, because the perturbations of each component $u_i^1$ is the elementary $\delta_i$.
u1.size_ad
1
print("Constant term :", u1.value)
print("Coefficients of the symbolic perturbation :", u1.coef)
print("Indices of the symbolic perturbation :", u1.index)
Constant term : [ 1. 0.30901699 -0.80901699 -0.80901699 0.30901699] Coefficients of the symbolic perturbation : [[1.] [1.] [1.] [1.] [1.]] Indices of the symbolic perturbation : [[0] [1] [2] [3] [4]]
u1 # Scalar values, gradient coefficients, gradient indices.
spAD(array([ 1. , 0.30901699, -0.80901699, -0.80901699, 0.30901699]), array([[1.], [1.], [1.], [1.], [1.]]), array([[0], [1], [2], [3], [4]]))
Second order sparse automatic differentiation is based on the same principles.
Using the ad.Sparse2.identity
constructor, one generates an augmented variable $u^2$ with components
$$
u^2_i = u^0_i + \delta_i + o(\delta_i^2)
$$
where $(\delta_i)_{0 \leq i < n}$ are independent first order symbolic perturbations. The only difference with the first order case here lies in the remainder $o(\delta_i^2)$: the second order coefficients are known and explicitly set to zero.
u2 = ad.Sparse2.identity(constant=u0)
print("Constant term :", u2.value)
print("First order coefficients :", u2.coef1)
print("First order indices :", u2.index)
print()
print("Second order coefficients :",u2.coef2)
print("Second order row indices :",u2.index_row)
print("Second order column indices :",u2.index_col)
Constant term : [ 1. 0.30901699 -0.80901699 -0.80901699 0.30901699] First order coefficients : [[1.] [1.] [1.] [1.] [1.]] First order indices : [[0] [1] [2] [3] [4]] Second order coefficients : [] Second order row indices : [] Second order column indices : []
u2
spAD2(array([ 1. , 0.30901699, -0.80901699, -0.80901699, 0.30901699]), array([[1.], [1.], [1.], [1.], [1.]]), array([[0], [1], [2], [3], [4]]), array([], shape=(5, 0), dtype=float64), array([], shape=(5, 0), dtype=int64), array([], shape=(5, 0), dtype=int64))
Arithmetic operations, such as left and right multiplication, addition, substraction, division, work as usual. In particular, we can evaluate our objective function.
obj1 = Objective(u1,x,dx,F0,F0)
The result is an array scalar
, i.e. an array whose shape is the empty tuple, but with substantial AD information.
obj1
spAD(array(17.39957514), array([ -6.90983006, -11.18033989, 0. , 11.18033989, 6.90983006, 6.90983006, 11.18033989, -0. , -11.18033989, -6.90983006, 0.1 , 0.0309017 , -0.0809017 , -0.0809017 , 0.0309017 ]), array([1, 2, 3, 4, 0, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]))
Dense representation of the AD information would be more adequate for this particular variable. A conversion is obtained as follows.
obj1.to_dense()
denseAD(array(17.39957514), array([ 13.91966011, 4.30141153, -11.26124159, -11.26124159, 4.30141153]))
Note also that the variable $b^1=$ obj1
has the form
$$
b^1 = b^0 + \sum_{0 \leq i <I} \alpha_i \delta_{k_i},
$$
where the $k_i$ are sorted and not pairwise distinct. A canonical form is obtained with the simplify_ad
method.
obj1.simplify_ad() # In place simplification
obj1
spAD(array(17.39957514), array([ 13.91966011, 4.30141153, -11.26124159, -11.26124159, 4.30141153]), array([0, 1, 2, 3, 4]))
obj1.to_dense()
denseAD(array(17.39957514), array([ 13.91966011, 4.30141153, -11.26124159, -11.26124159, 4.30141153]))
First order automatic differentiation provides us with the gradient of the objective function, which enables gradient descent methods for minimization. However, Newton methods are more appropriate in certain contexts. For that purpose, second order automatic differentiation can extract the objective function hessian.
obj2 = Objective(u2,x,dx,F0,G0)
Looking at the dense form, one recognizes the matrix of a Laplacian.
obj2.to_dense()
denseAD2(array(17.78637854), array([ 13.91966011, 4.20630588, -11.42002011, -11.30246306, 4.39651718]), array([[ 20.1, -10. , 0. , 0. , -10. ], [-10. , 20.1, -10. , 0. , 0. ], [ 0. , -10. , 20.1, -10. , 0. ], [ 0. , 0. , -10. , 20.1, -10. ], [-10. , 0. , 0. , -10. , 20.1]]))
Our particular objective function, with those specific choices of $F_0$ and $G_0$, is quadratic. Therefore obj2
is in fact an exact representation, and a Newton method based on these reconstructed gradient and hessian would converge in a single step. See the notebook Elliptic for more on this approach.
The numerical scheme, returning the PDE residue, can also be evaluated with AD variables.
scheme0 = Scheme0(u0,x)
scheme1 = Scheme0(u1,x)
scheme2 = Scheme0(u2,x)
The constant part is simply the PDE residue
scheme0
array([ 35.54915028, 10.03423506, -30.34765197, -29.17208146, 11.93634809])
The first order part contains the numerical scheme jacobian matrix.
scheme1
spAD(array([ 35.54915028, 10.03423506, -30.34765197, -29.17208146, 11.93634809]), array([[-25., 50., -25., 1.], [-25., 50., -25., 1.], [-25., 50., -25., 1.], [-25., 50., -25., 1.], [-25., 50., -25., 1.]]), array([[1, 0, 4, 0], [2, 1, 0, 1], [3, 2, 1, 2], [4, 3, 2, 3], [0, 4, 3, 4]]))
Looking at the dense form, we recognize the matrix of a laplacian.
scheme1.to_dense(5)
denseAD(array([ 35.54915028, 10.03423506, -30.34765197, -29.17208146, 11.93634809]), array([[ 51., -25., 0., 0., -25.], [-25., 51., -25., 0., 0.], [ 0., -25., 51., -25., 0.], [ 0., 0., -25., 51., -25.], [-25., 0., 0., -25., 51.]]))
Since this specific numerical scheme is linear, the variable scheme1
is a faithful representation of it, and a Newton method based on the extracted Jacobian will converge in a single step. See the notebook LinearMonotoneSchemes2D for similar but more advanced examples.
Again, since this numerical scheme is linear, second order terms are zero.
scheme2
spAD2(array([ 35.54915028, 10.03423506, -30.34765197, -29.17208146, 11.93634809]), array([[-25., 50., -25., 1.], [-25., 50., -25., 1.], [-25., 50., -25., 1.], [-25., 50., -25., 1.], [-25., 50., -25., 1.]]), array([[1, 0, 4, 0], [2, 1, 0, 1], [3, 2, 1, 2], [4, 3, 2, 3], [0, 4, 3, 4]]), array([], shape=(5, 0), dtype=float64), array([], shape=(5, 0), dtype=int64), array([], shape=(5, 0), dtype=int64))
A number of elementary mathematical functions are implemented.
np.sqrt(1+u1)
spAD(array([1.41421356, 1.14412281, 0.43701602, 0.43701602, 1.14412281]), array([[0.35355339], [0.43701602], [1.14412281], [1.14412281], [0.43701602]]), array([[0], [1], [2], [3], [4]]))
np.abs(u1)
spAD(array([1. , 0.30901699, 0.80901699, 0.80901699, 0.30901699]), array([[ 1.], [ 1.], [-1.], [-1.], [ 1.]]), array([[0], [1], [2], [3], [4]]))
Comparison operators return a basic np.ndarray
, as well as integer valued functions. Indeed, since these functions take their values in a discrete set, AD information would be pointless.
u1 <= (-1+x)
array([False, False, True, True, False])
np.floor(u1+0.5)
array([ 1., 0., -1., -1., 0.])
Maximum and minimum work as well
np.maximum(x,u1)
spAD(array([1. , 0.30901699, 0.4 , 0.6 , 0.8 ]), array([[1.], [1.], [0.], [0.], [0.]]), array([[0], [1], [2], [3], [4]]))
Let us recall the recommended use cases of Sparse and Dense automatic differentiation.
In particular, using Sparse AD with a complex function, or Dense AD with many variables, will likely be inefficient in terms of memory usage and computation time. However, when numerically solving a PDE with a complex expression, one is often confronted with expressions such as: $$ f(\frac{u(x+h e_0)-u(x)} h,\cdots,\frac{u(x+h e_n)-u(x)} h) $$ We show, in that case, how to combine:
Sparse automatic differentiation, as implemented, is mostly intended for very simple functions. Indeed, the size of the ad information grows quickly and can become overwhelming. For instance, consider the following function with a loop.
def fun_with_loop(y):
r=y.copy()
for i in range(3):
r=r+r**2
return r
f1 = fun_with_loop(u1)
f1
spAD(array([42. , 0.89091371, -0.11356996, -0.11356996, 0.89091371]), array([[ 1.00000000e+00, 2.00000000e+00, 4.00000000e+00, 8.00000000e+00, 1.20000000e+01, 2.40000000e+01, 4.80000000e+01, 9.60000000e+01], [ 1.00000000e+00, 6.18033989e-01, 8.09016994e-01, 5.00000000e-01, 1.13627124e+00, 7.02254249e-01, 9.19262746e-01, 5.68135621e-01], [ 1.00000000e+00, -1.61803399e+00, -3.09016994e-01, 5.00000000e-01, -2.61271243e-01, 4.22745751e-01, 8.07372542e-02, -1.30635621e-01], [ 1.00000000e+00, -1.61803399e+00, -3.09016994e-01, 5.00000000e-01, -2.61271243e-01, 4.22745751e-01, 8.07372542e-02, -1.30635621e-01], [ 1.00000000e+00, 6.18033989e-01, 8.09016994e-01, 5.00000000e-01, 1.13627124e+00, 7.02254249e-01, 9.19262746e-01, 5.68135621e-01]]), array([[0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2, 2, 2], [3, 3, 3, 3, 3, 3, 3, 3], [4, 4, 4, 4, 4, 4, 4, 4]]))
After each binary operation, the AD information is as big as the concatenation of the input AD informations. In a loop, it therefore grows exponentially and becomes extremely redundant. Here, the result can be much simplified.
AD information is however simplified only at the request of the user, because that is a costly operation, involving some sorting and array resizing. And simplifying the final result only partially solves the problem, due to the cost of computing the intermediate results.
f1.simplify_ad()
f1
spAD(array([42. , 0.89091371, -0.11356996, -0.11356996, 0.89091371]), array([[195. ], [ 6.25297484], [ -0.31547484], [ -0.31547484], [ 6.25297484]]), array([[0], [1], [2], [3], [4]]))
When a loop is present, the AD information grows exponentially and its size is quickly our of control, as illustrated above. A partial solution is to simplify the AD information at each iteration.
def fun_with_loop_simpl(y):
r=y.copy()
for i in range(3):
r=r+r**2
ad.simplify_ad(r) # Simplify AD information
return r
Complexity, is now linear w.r.t. the number of iterations, instead of exponential.
fun_with_loop_simpl(u1)
spAD(array([42. , 0.89091371, -0.11356996, -0.11356996, 0.89091371]), array([[195. ], [ 6.25297484], [ -0.31547484], [ -0.31547484], [ 6.25297484]]), array([[0], [1], [2], [3], [4]]))
The function had to be slightly modified, but it still applies transparently to non AD variables.
fun_with_loop_simpl(u0)
array([42. , 0.89091371, -0.11356996, -0.11356996, 0.89091371])
Dense automatic differentiation is not subject to the same limitations: it is efficient for evaluating complex functions depending only on a few independent variables.
When such a function needs to be executed on a sparse AD variable, the recommended approach is to compose dense and sparse automatic differentiation. This is achieved using the ad.apply
command and indicating that the last dimensions are bound together.
u1 = ad.Sparse.identity(constant=u0)
u2 = ad.Sparse2.identity(constant=u0)
ad.apply(fun_with_loop,u1,shape_bound=u1.shape)
spAD(array([42. , 0.89091371, -0.11356996, -0.11356996, 0.89091371]), array([[195. ], [ 6.25297484], [ -0.31547484], [ -0.31547484], [ 6.25297484]]), array([[0], [1], [2], [3], [4]]))
The following command achieves the same result "by hand", in a less automated way.
# These dimensions can be "bound together", a.k.a. not regarded as independent.
shape_bound = u1.shape
# Generate an adequately dimensioned denseAD variable
u1d = ad.Dense.identity(constant=u1.value,shape_bound=shape_bound)
# Evaluate the complex function
fu1d = fun_with_loop(u1d)
# Compose the dense en sparse AD
ad.compose(fu1d,np.expand_dims(u1,axis=0),shape_bound=shape_bound)
spAD(array([42. , 0.89091371, -0.11356996, -0.11356996, 0.89091371]), array([[195. ], [ 6.25297484], [ -0.31547484], [ -0.31547484], [ 6.25297484]]), array([[0], [1], [2], [3], [4]]))
Composition also works between dense AD types. It may speed up computations in the case of the evaluation of a complex function on a dense AD type with many components.
ad.apply(fun_with_loop,u1.to_dense(),shape_bound=u1.shape)
denseAD(array([42. , 0.89091371, -0.11356996, -0.11356996, 0.89091371]), array([[195. , 0. , 0. , 0. , 0. ], [ 0. , 6.25297484, 0. , 0. , 0. ], [ 0. , 0. , -0.31547484, 0. , 0. ], [ 0. , 0. , 0. , -0.31547484, 0. ], [ 0. , 0. , 0. , 0. , 6.25297484]]))
Composition also works for second order automatic differentiation.
result = fun_with_loop(u2.to_dense())
tuple(LInfNorm_AD2(result - a.to_dense()) for a in (
fun_with_loop(u2),
fun_with_loop_simpl(u2),
ad.apply(fun_with_loop,u2,shape_bound=u2.shape),
))
(1.4210854715202004e-14, 7.105427357601002e-15, 0.0)
tuple(LInfNorm_AD2(result - a) for a in (
ad.apply(fun_with_loop,u2.to_dense(),shape_bound=u2.shape),
))
(0.0,)
The envelope theorem. Consider a function $f:X \to R$ defined as the minimum of a bivariate function $F:X \times A \to R$ over the second variable: $$ f(x):=\min_{\alpha\in A} F(x,\alpha), $$ for each element $x$ of a domain $X \subset R^d$, and where $A$ is an arbitrary set. The envelope theorem states that, under minor technical assumptions, one has $$ \nabla f(x) = \frac{\partial F}{\partial x}(x,\alpha(x)), $$ where $\alpha(x) \in A$ is the minimizer to the above optimization problem, that is $f(x) = F(x,\alpha(x))$.
Generalizations. The function $f$ may be defined by a maximum instead of a minimum, or a combination e.g. $f(x) =\min_{\alpha\in A}\max_{\beta \in B} F(x,\alpha,\beta)$. It may also be defined in a piecewise manner, etc.
Relevance to automatic differentiation. In a typical pratical situation:
First order derivatives only ! From a mathematical standpoint, the envelope theorem does not apply to second order derivatives in general. Please recall this. Our AD library will not stop you or issue a warning if you attempt to do so.
Practical implementation. Implement the function $f$ with the following enhancements:
oracle
optional parameter (corresponding to $\alpha(x)$). If the oracle is provided, then the optimization step is bypassed.oracle
in subsequent calls.Then call the function $f$ twice, first with raw np.ndarray
data types, and then with the desired AD type. We provide a helper function in the AD library to automate this process.
u1 = ad.Sparse.identity(constant=u0)
u2 = ad.Sparse2.identity(constant=u0)
We define a function whose evaluation involves a minimization by exhaustive search in a large table, here with $100$ entries.
def f(x,A,B):
return np.min(A*x+B,axis=0) # Exhaustive search
A=np.linspace(0,1,100)**0.5; B=np.linspace(1,0,100)**2
A,B = (np.expand_dims(e,axis=-1) for e in (A,B))
A little bit of code rewrite is needed to make this function compatible with the oracle interface.
For convenience, we provide a min_argmin
function, and likewise a max_argmax
function. (Note that the minimizer index given in the output of f_oracle
will be erroneous if an oracle is given, but this shall not be an issue.)
def f_oracle(x,A,B,oracle=None):
if not (oracle is None):
A,B = (np.take_along_axis(e,np.expand_dims(oracle,axis=0),axis=0) for e in (A,B))
return ad.min_argmin(A*x+B,axis=0)
There is no need to use the AD types to compute the minimizer.
value,minimizer = f_oracle(u0,A,B)
print("Value (without AD) : ", value, "\nand minimizer ", minimizer)
Value (without AD) : [ 0.92667447 0.30279844 -0.80901699 -0.80901699 0.30279844] and minimizer [69 91 99 99 91]
Given the minimizer, one can efficiently evaluate $f$ on the AD type, bypassing most computations.
f_oracle(u1,A,B,oracle=minimizer)[0]
spAD(array([ 0.92667447, 0.30279844, -0.80901699, -0.80901699, 0.30279844]), array([[0.83484711], [0.95874497], [1. ], [1. ], [0.95874497]]), array([[0], [1], [2], [3], [4]]))
Use ad.apply
, with the keyword argument envelope=True
, to automatically apply the above procedure.
ad.apply(f_oracle,u1,A,B,envelope=True)
(spAD(array([ 0.92667447, 0.30279844, -0.80901699, -0.80901699, 0.30279844]), array([[0.83484711], [0.95874497], [1. ], [1. ], [0.95874497]]), array([[0], [1], [2], [3], [4]])), array([69, 91, 99, 99, 91]))