Hand-computation of derivatives of complicated functions involving vectors and matrices is **error-prone**, and is a common source of bugs when performing numerical optimization, root-finding, sensitivity-analysis, etcetera.

The ideal alternative is to use **automatic differentiation (AD)**:

- AD lets software/compilers perform the derivatives for you. This is extremely reliable and, with modern AD software, can be very efficient.
- Unfortunately, there is still lots of code, e.g. code calling external libraries in other languages, that AD tools can't comprehend.
- And there are other cases where AD "needs help" — for example, if you are computing an answer
*approximately*(e.g. solving a nonlinear equation by Newton's method), AD can waste a lot of effort trying to*exactly differentiate the error*in your approximation. Often, you can compute the approximate answer (to the same accuracy) much more efficiently. - Even in cases where AD falls down, often you only need to give it a
**little help**: define a differentiation rule for a*small piece*of your program and let AD handle the rest. In Julia, this is done with by defining a "ChainRule", and in Python autograd/JAX it is done by defining a custom "vJp" (row-vector—Jacobian product) and/or "Jvp" (Jacobian–vector product).

If you do have to work out your own derivative rule, you *really* need to **check** it against something "brute force" and hard to get wrong. The usual candidate is…

The simplest way to check a derivative is to recall that the definition of a differential:
$$
df = f(x+dx) - f(x) = f'(x) dx
$$
came from dropping higher-order terms from a small but finite difference:
$$
\delta f = f(x+\delta x) - f(x) = f'(x) \delta x + o(\Vert \delta x \Vert)
$$
So, we can just compare the **finite difference** $\boxed{f(x+\delta x) - f(x)}$ to our **(directional) derivative operator** $f'(x) \delta x$ (i.e. the derivative in the direction $\delta x$).

$f(x+\delta x) - f(x)$ is also called a **forward difference** approximation. We'll look at other possibilities below.

The antonym of a forward diffence is a

**backward difference**approximation $f(x) - f(x - \delta x) \approx f'(x) \delta x$ … if you just want to compute a derivative, there is not much practical distinction between forward and backward differences. The distinction becomes more important when discretizing (approximating) differential equations.Finite-difference approximations come in many forms, but they are generally a

**last resort**in cases where it's too much effort to work out an analytical derivative and AD fails. But they are also useful to**check**your analytical derivatives and to quickly**explore**.

Let's try it for $f(A) = A^2$ (a matrix-valued function of a square matrix $A \in \mathbb{R}^{m\times m}$).

First, we'll do it by hand via the product rule:

$$ df = A dA + dA A $$i.e. $f'(A)$ is the **linear operator** $\boxed{f'(A)[\delta A] = A \delta A + \delta A A}$. This is **≠ 2AδA** because because **A and δA do not commute** in general.

Let's check it against a finite difference. We'll try it for a *random* input A and a *random small* perturbation $\delta A$:

In [1]:

```
f(A) = A^2
A = randn(4,4) # 4×4 matrix with Gaussian random (bell curve) entries of magnitude ~ 1 (std. dev = 1)
```

Out[1]:

4×4 Matrix{Float64}: -2.1088 0.734105 0.239804 0.769526 -0.50642 1.3228 0.820155 -0.155089 -0.57114 -0.600107 -0.00180516 -1.41079 0.160461 0.874805 -0.642604 -0.5685

In [2]:

```
dA = randn(4,4) * 1e-8 # a small random perturbation ~ 10⁻⁸
```

Out[2]:

4×4 Matrix{Float64}: 1.07507e-8 7.14782e-9 -1.93596e-9 1.54683e-9 1.38533e-8 1.54251e-8 1.97596e-8 -1.5794e-8 9.02758e-9 -2.33086e-8 1.87726e-9 2.47228e-9 2.16736e-9 7.34033e-9 6.54572e-9 8.20985e-9

In [3]:

```
approx = f(A + dA) - f(A) # finite difference
```

Out[3]:

4×4 Matrix{Float64}: -3.36055e-8 1.61716e-8 3.15253e-8 1.07036e-9 -3.08965e-8 1.42894e-9 5.37296e-8 -3.15509e-8 -2.54365e-8 -4.68219e-8 -3.85341e-8 3.51576e-9 -3.89827e-9 4.00008e-8 1.33e-8 -3.3197e-8

In [4]:

```
exact = A*dA + dA*A # exact directional derivative (linearization)
```

Out[4]:

In [5]:

```
2A*dA # wrong! matrices don't commute
```

Out[5]:

4×4 Matrix{Float64}: -1.73372e-8 -7.38125e-9 4.81509e-8 -1.58918e-8 3.98974e-8 -6.9411e-9 5.52859e-8 -4.18427e-8 -3.50552e-8 -4.73053e-8 -3.99804e-8 -5.98437e-9 1.36214e-8 5.08921e-8 2.40952e-8 -3.9649e-8

In [6]:

```
2A*dA - approx
```

Out[6]:

4×4 Matrix{Float64}: 1.62684e-8 -2.35529e-8 1.66256e-8 -1.69621e-8 7.07939e-8 -8.37003e-9 1.5563e-9 -1.02918e-8 -9.61869e-9 -4.83447e-10 -1.44632e-9 -9.50013e-9 1.75197e-8 1.08913e-8 1.07951e-8 -6.452e-9

**compared to what?** The natural answer is **small compared to the correct answer**. This is called the relative error (or "fractional" error):
$$
\text{relative error} = \frac{\Vert \text{approx} - \text{exact} \Vert}{\Vert \text{exact}\Vert}
$$
Here, $\Vert \cdot \Vert$ is a norm, like the length of a vector. In Julia, the `LinearAlgebra`

package provides a `norm`

function, and we can use that to define a relative error:

In [7]:

```
using LinearAlgebra
relative_error(approx, exact) = norm(approx - exact) / norm(exact)
```

Out[7]:

relative_error (generic function with 1 method)

Let's try it on our case above:

In [8]:

```
relative_error(approx, exact)
```

Out[8]:

1.201810486737118e-8

Pretty good — we are matching to **about eight significant digits**!

Getting a good match for a random input and a random small displacement isn't a proof of correctness, but it is always a good thing to check. This kind of randomized comparison will almost always **catch major bugs** where you have calculated the symbolic derivative incorrectly, like our $2AdA$ example:

In [9]:

```
relative_error(2A*dA, exact)
```

Out[9]:

0.7126976615760792

Here, for our buggy derivative, the relative error was **close to 50%**.

Wait, what do we mean by the norm $\Vert A \Vert$ of a matrix $A$?

The ordinary Euclidean ("L2") norm of a column-vector $x \in \mathbb{R}^n$ is $\Vert x \Vert = \sqrt{x^T x} = \sqrt{\sum_i x_i^2}$.

For a matrix, the natural analogue is the **sqrt of the sum of the squares of the entries**. This is called the Frobenius norm, and is what `norm(A)`

does in Julia by default.

In [10]:

```
using Symbolics
@variables m11 m12 m13 m21 m22 m23
M = [m11 m12 m13
m21 m22 m23]
```

Out[10]:

\begin{equation}
\left[
\begin{array}{ccc}
m11 & m12 & m13 \\
m21 & m22 & m23 \\
\end{array}
\right]
\end{equation}

In [11]:

```
norm(M)
```

Out[11]:

\begin{equation}
\sqrt{\mathrm{abs2}\left( m11 \right) + \mathrm{abs2}\left( m12 \right) + \mathrm{abs2}\left( m13 \right) + \mathrm{abs2}\left( m21 \right) + \mathrm{abs2}\left( m22 \right) + \mathrm{abs2}\left( m23 \right)}
\end{equation}

(`abs2(x)`

in Julia is the squared absolute value $|x|^2$.)

But we can write this in linear algebra notation!

Recall that the (Euclidean) "inner product" of two matrices is $\text{tr}(B^T A)$. Then we can define the Frobenius norm of a matrix as the sqrt of the dot product with itself!

$$ \Vert A \Vert_F = \sqrt{\text{trace}(A^T A)} $$In [12]:

```
norm(A)
```

Out[12]:

3.541526593233751

In [13]:

```
sqrt(tr(A'A))
```

Out[13]:

3.5415265932337516

In [14]:

```
sqrt(sum(abs2, A))
```

Out[14]:

3.541526593233751

In [15]:

```
sqrt(sum(A[i,j]^2 for i = 1:4, j = 1:4))
```

Out[15]:

3.541526593233751

How accurate is our finite-difference approximation above? How should we choose the size of δx?

In [16]:

```
using Plots # load a plotting package; there are several to choose from in Julia; this is a popular one
using LaTeXStrings # makes it easy to put LaTeX equations in L"..." strings
```

In [17]:

```
s = 10 .^ range(-8,8,50)
errors = [relative_error(f(A+s*dA)-f(A), exact*s) for s in s]
plot(s * norm(dA), errors, xaxis=:log, yaxis=:log, markershape=:circle,
title = L"forward-difference accuracy for $f(A)=A^2$", xlabel=L"\Vert \delta A\Vert", ylabel="relative error",
label = "error")
plot!(s, s, label=L"linear $\Vert \delta A\Vert$ reference")
```

Out[17]:

We observe two main features:

- The relative error decreases
*linearly*with $\Vert \delta A \Vert$. This is called**first-order accuracy**. Why? - When δA gets too small, the error
*increases*. Why?

If the relative error scales with $\Vert \delta A \Vert^n$, we call the approximation **n-th order accurate**. For forward differences, here, the order is **n=1**. Why?

For any $f(x)$ with a nonzero second derivative (think of the Taylor series), we have $$ f(x + \delta x) = f(x) + f'(x) \delta x + (\text{terms proportional to }\Vert \delta x \Vert^2) + \underbrace{o(\Vert \delta x \Vert^2)}_\text{i.e. higher-order terms} $$

*dropped* in our forward-difference approximations are proportional to $\Vert \delta x\Vert^2$. But that means that the **relative error is linear**:
$$
\begin{align}
\text{relative error} &= \frac{\Vert f(x+\delta x) - f(x) - f'(x) \delta x \Vert}{\Vert f'(x) \delta x \Vert} \\
&= \frac{(\text{terms proportional to }\Vert \delta x \Vert^2) + o(\Vert \delta x \Vert^2)}{\text{proportional to }\Vert \delta x \Vert} = (\text{terms proportional to }\Vert \delta x \Vert) + o(\Vert \delta x \Vert)
\end{align}
$$

The reason why the error *increased* for very small $\delta A$ was due to **roundoff errors**. The computer only stores a **finite number of significant digits** (about 15 decimal digits) for each real number and rounds off the rest on each operation — this is called floating-point arithmetic.

If δx is too small, then the difference $f(x+\delta x) - f(x)$ gets rounded off to zero (some or all of the *significant digits cancel*). This is called catastrophic cancellation.

Let's compute the derivative of $\sin(x)$ at $x=1$ with finite differences. For a scalar $f(x)$, we have $f(x+\delta x) - f(x) \approx f'(x) \delta x$, or equivalently:

$$ f'(x) = \frac{f(x+\delta x) - f(x)}{\delta x} + \underbrace{O(\delta x)}_{\text{terms proportional to }|\delta x|\text{ or smaller}} $$(This is the usual way you see a forward difference approximation written, but it only works when $\delta x$ is a scalar.)

In [18]:

```
dx = 1e-5
(sin(1 + dx) - sin(1))/dx
```

Out[18]:

0.5402980985058647

In [19]:

```
cos(1) # correct answer
```

Out[19]:

0.5403023058681398

In [20]:

```
relative_error((sin(1 + dx) - sin(1))/dx, cos(1))
```

Out[20]:

7.787052228587771e-6

In [21]:

```
dx = 1e-100
(sin(1 + dx) - sin(1))/dx
```

Out[21]:

0.0

In [22]:

```
1e-100
```

Out[22]:

1.0e-100

But look at this:

In [23]:

```
1 + 1e-100
```

Out[23]:

1.0

The correct answer would be $$ 1 + 10^{-100} = 1.\underbrace{000000\cdots000000}_{\text{99 zeros}}1 $$ but the computer only stores about 15 decimal digits (by default … you can ask for more, at the expense of speed), so it rounds this to $1.0$.

Hence, our finite difference gave zero.

Even if we do something less extreme, we have problems. This looks okay:

In [24]:

```
dx = 1e-13
1 + dx
```

Out[24]:

1.0000000000001

because the computer has enough digits, right? But compare:

In [25]:

```
sin(1 + dx)
```

Out[25]:

0.8414709848079505

In [26]:

```
sin(1)
```

Out[26]:

0.8414709848078965

**last 4 significant digits stored**. So when you subtract them, you only have **4 digits** left:

In [27]:

```
relative_error((sin(1 + dx) - sin(1))/dx, cos(1))
```

Out[27]:

0.001358343083756518

and the error for $\delta x = 10^{-13}$ is now $10^{-3}$!!