In [1]:

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_1.0.4     crayon_1.5.2    digest_0.6.31   utf8_1.2.2     
 [5] IRdisplay_1.1   repr_1.1.5      lifecycle_1.0.3 jsonlite_1.8.4 
 [9] evaluate_0.20   pillar_1.8.1    rlang_1.0.6     cli_3.6.0      
[13] uuid_1.1-0      vctrs_0.5.2     IRkernel_1.3.2  tools_4.2.2    
[17] glue_1.6.2      fastmap_1.1.0   compiler_4.2.2  base64enc_0.1-3
[21] pbdZMQ_0.3-9    htmltools_0.5.4

Nonlinear equation solve¶

Problem¶

Find $\mathbf{x}$ such that

$$ f(\mathbf{x}) = 0. $$

- If $f(\mathbf{x}) = \mathbf{A}\mathbf{x} - \mathbf{b}$, then we now know quite well how to solve this.
- However, in case $f$ is nonlinear in $\mathbf{x}$, closed-form solutions are out of reach -- no "direct" method
- Iterative, numerical methods come to rescue.

Examples:
1. Find a root of $f(x) = x^4 - \sin x$.
2. Find a maximizer of $g(x) = \frac{\log x}{1 + x}$, by solving
$$ f(x) = g'(x) = \frac{1 + x^{-1} - \log x}{(1 + x)^2}. $$
Nonlinear equation and optimization are closed related (at least if $g$ is differentiable).
- No distinction between maximization and minimization ("critical points")
- Statitical applications: MLE, minimizing Bayes risk, nonlinear least squares, MAP estimation, ...

Univariate problems¶

$$ f: \mathbb{R} \to \mathbb{R} $$

Bisection¶

Assume $f$ is continuous on $[l, r]$ and $f(l)f(r) \le 0$.
Then by the Intermediate Value Theorem, there exists $x^{\star}$ such that $f(x^{\star})=0$.
Consider the midpoint $m = (l + r)/2$.
If $f(m) = 0$, we are done.
Otherwise, start over with $r \gets m$ if $f(l)f(m) < 0$, $l \gets m$ if $f(m)f(r) < 0$.
Algorithm

In [2]:

bisect <- function(f, l, r, tol=1e-6) {
    iter <- 0
    while ( r - l > tol ) {
        m <- (l + r) / 2
        if ( f(l) * f(m) < 0 ) {
            r <- m
        } else {
            l <- m
        }
        iter <- iter + 1
    }
    ret <- list(val = (l + r) / 2, iter = iter)
}

Example: quantile of continuous distribution

In [3]:

F <- function(x) { pt(x, 5) - 0.95 }   # Student's t distribution with 5 df, 95%ile
F(1.291)
F(2.582)
ret <- bisect(F, 1.291, 2.582)
ret$val
ret$iter

-0.0765841085370957

0.0253437879217759

2.01504860901833

21

In [4]:

qt(0.95, 5)

2.01504837333302

After $n$ iterations, the final interval has length $2^{-n}(r^{(0)} - l^{(0)})$.
If $n$ is large enough, the midpoint of the final interval is an approximate root.

Fixed-point iteration¶

$x \in \mathbb{R}$ is a fixed point of function $F: \mathbb{R} \to \mathbb{R}$ if $F(x) = x$.
Fixed-point iteration

$$ x^{(k+1)} = F(x^k) $$

- If $F: \mathbb{R}^n \to \mathbb{R}^n$ maps $\mathbf{x} \mapsto \mathbf{R}\mathbf{x} - \mathbf{c}$ for  matrix $\mathbf{R}=\mathbf{M}^{-1}\mathbf{K}$ where $\mathbf{M} - \mathbf{K} = \mathbf{A}$ and $\mathbf{c}=\mathbf{M}^{-1}\mathbf{b}$, then FPI reduces to the iterative method for solving linear system $\mathbf{A}\mathbf{x}=\mathbf{b}$.

Fixed-point strategy for finding a root of $f$:
- Determine $F$ such that $f(x) = 0 \iff F(x) = x$;
- Apply FPI.

Example¶

Find a root of $f(x) = x^4 - \sin x$ on $(0, 1]$.

Method 1: choose $F_1(x) = (\sin x)^{1/4}$, since $f(x)=0 \iff x = (\sin x)^{1/4}$.

In [5]:

x <- 1.0
for (i in 1:10) {
    x <- (sin(x))^0.25
    print(x)
}

[1] 0.9577668
[1] 0.9509906
[1] 0.9498498
[1] 0.9496563
[1] 0.9496234
[1] 0.9496178
[1] 0.9496169
[1] 0.9496167
[1] 0.9496167
[1] 0.9496167

Method 2: choose $F_2(x) = \frac{\sin x}{x^3}$ since $f(x)=0 \iff x = \frac{\sin x}{x^3}$.

In [6]:

x <- 1.0
for (i in 1:10) {
    x <- sin(x) / x^3
    print(x)
}

[1] 0.841471
[1] 1.251418
[1] 0.4844576
[1] 4.096051
[1] -0.01187393
[1] 7092.524
[1] -2.604367e-12
[1] 1.474333e+23
[1] 1.308064e-70
[1] 5.844432e+139

Contraction mapping theorem¶

Theorem. Suppose the function $F$ defined on a closed interval $I \subset \mathbb{R}$ satisfies

$F(x) \in I$ whenever $x \in I$
$|F(x) - F(y)| \le L|x - y|$ for all $x, y \in I$ (Lipschitz continuity).

Then, if $L \in [0, 1)$, $F$ has a unique fixed point $x^{\star} \in I$. Furthermore, fixed-point iteration $x^{k+1} = F(x^{k})$ converges to $x^{\star}$ regardless of the initial point $x^{(0)}$.

Proof. From condition 2, \begin{align*} |x^{(k+1)} - x^{(k)}| &= |F(x^{(k)} - F(x^{(k-1)})|| \\ &\le L|x^{(k)} - x^{(k-1)}| \\ &\vdots \\ &\le L^k|x^{(1)} - x^{(0)}| \end{align*} Using the triangular inequality, for $m > n$ we have

\begin{align*} |x^{(n)} - x^{(m)}| &= |x^{(n)} - x^{(n+1)} + x^{(n+1)} - x^{(n+2)} + \dotsb + x^{(m-1)} - x^{(m)}| \\ &\le \sum_{k=n}^{m-1} |x^{(k)} - x^{(k+1)}| \\ &\le \sum_{k=n}^{m-1} L^k |x^{(1)} - x^{(0)})| \tag{*}\\ &\le \sum_{k=n}^{\infty} L^k |x^{(1)} - x^{(0)})| = \frac{L^n}{1 - L} |x^{(1)} - x^{(0)}|. \end{align*}

Thus the sequence $\{x^{(k)}\}$ is Cauchy and convergent. Since the interval $I$ is closed, $x^{\star} = \lim_k x^{(k)} \in I$.

Now by (Lipschitz) continuity of $F$, and the definition $x^{k+1} = F(x^{(k)})$, we see that $x^{\star}$ is a fixed point of $F$.

If there exists another fixed point $y^{\star} \neq x^{\star}$, then $$ |x^{\star} - y^{\star}| = |F(x^{\star}) - F(y^{\star})| \le L|x^{\star} - y^{\star}|, $$ which is a contradiction since $L \in (0, 1)$.

Remark. By sending $m \to \infty$ in inequality (*), we have the bound $$ |x^{(n)} - x^{\star}| \le \frac{L^n}{1 - L}|x^{(1)} - x^{(0)}| $$ which provides the rate of convergence to $x^{\star}$. This geometric rate of convergence is called linear convergence.

Back to example¶

Recall our example $F_1(x) = (\sin x)^{1/4}$ and $F_2(x) = \frac{\sin x}{x^3}$.
Lipschitz modulus of a differentiable function is $L = \sup_{x\in I} |F'(x)|$ (why?).
$F_1$ maps $I=[0.2, 1]$ to a smaller interval. $F_1'(x) = \frac{\cos x}{4\sin^{3/4}(x)}$ is decreasing on $I$ and $1 > F_1'(0.2) > F_1'(1) > 0$.
$F_2$ expands $I=[0.9, 1]$ to $[0.841, 1.075]$ and also $F_2'(x) = \frac{x\cos x - 3\sin x}{x^4}$ is increasing on $I$ with $F_2'(1) < -1$.

Obvious choice of $F$¶

Probably the most obvious choice for $F$ to find a root of $f$ is to set $F(x) = f(x) + x$. This yields an iteration

$$ x^{(k+1)} = x^{(k)} + f(x^{(k)}). $$

(Homework)

Newton's method (Newton-Raphson)¶

Newton's method was originally developed by, not surprisingly, Isaac Newton, and further refined by Joseph Raphson, to find roots of nonlinear equations

$f(x) = 0$: $$ x^{(k+1)} = x^{(k)} - \frac{f(x^{(k)})}{f'(x^{(k)})}. $$

Motivation: first-order Taylor expansion
- If $x^{(k)}$ approximates the root $x^{\star}$ of $f$, then
$$ 0 = f(x^{\star}) = f(x^{(k)}) + f'(\tilde{x})(x^{\star} - x^{(k)}) $$ for some $\tilde{x}$ on the line segment connecting $x^{(k)}$ and $x^{\star}$.
- If we substitute $\tilde{x}$ with $x^{(k)}$ and $x^{\star}$ with the next approximation $x^{(k+1)}$, we get
$$ 0 = f(x^{(k)}) + f'(x^{(k)})(x^{(k+1)} - x^{(k)}), $$ yielding the above iteration.
Fixed-point iteration: Newton's method is a FPI with

$$ F(x) = x - \frac{f(x)}{f'(x)}. $$

Thus the (local) convergence of Newton's method is governed by

$$ F'(x) = 1 - \frac{(f'(x))^2 - f(x)f''(x)}{(f'(x))^2} = \frac{f(x)f''(x)}{(f'(x))^2} $$

(assuming $f$ is twice differentiable).

If $f'(x^{\star})\neq 0$ and $|f''(x^\star)| < \infty$, then $F'(x^{\star})=0$. So there is a neighborhood of $x^{\star}$ in which $|F'(x)| < 1$.
With this assumption, apply 2nd-order Taylor expansion of $F$ around $x^{\star}$:

\begin{align*} x^{(k)} - x^{\star} &= F(x^{(k-1)}) - F(x^{\star}) \\ &= F'(x^{\star})(x^{(k-1)} - x^{\star}) + \frac{1}{2}F''(\tilde{x})(x^{(k-1)} - x^{\star})^2 \\ &= \frac{1}{2}F''(\tilde{x})(x^{(k-1)} - x^{\star})^2, \end{align*}

for $\tilde{x}$ on the line segment connecting $x^{(k-1)}$ and $x^{\star}$.

If furthermore $F''$ is continuous and $x^{(0)}$ is sufficiently close to $x^{\star}$, then $x^{(k)} \to x^{\star}$ and

$$ \lim_{k}\frac{|x^{(k)} - x^{\star}|}{|x^{(k-1)} - x^{\star}|^2} = \frac{1}{2}|F''(x^{\star})|. $$

Thus Newton's method has a locally quadratic convergence property (under suitable contidions).

The local convergence theory requires a good starting point $x^{(0)}$ ("sufficiently close to $x^{\star}$"). With a good starting point, Newton-Rephson will converge very fast to the nearby solution. Otherwise, it can fail miserably.

Examples¶

Quantile (cf. bisection)

In [7]:

F <- function(x) { pt(x, 5) - 0.95 }   # Student's t distribution with 5 df, 95%ile
Fprime <- function(x) { dt(x, 5) }  # derivative of F (density function)
#x <- 1.291
x <- 2.582
options(digits=20)
for (i in 1:10) {
    x <- x - (F(x) / Fprime(x))
    print(x)
}

[1] 1.7338465618414033997
[1] 1.968962641494075072
[1] 2.0136622631340750367
[1] 2.0150470922847292243
[1] 2.0150483733319273227
[1] 2.015048373333023779
[1] 2.015048373333023779
[1] 2.015048373333023779
[1] 2.015048373333023779
[1] 2.015048373333023779

In [8]:

qt(0.95, 5)

2.01504837333302

The following animation is the result of applying Newton's method for minimizing $g(x) = \sin x$, i.e., finding a root of $f(x) = \cos x$, starting from $x^{(0)} = 2.0, 2.75, 4.0$, respectively.

Multivariate problems¶

Goal: to solve \begin{align*} f_1(x_1, \dotsc, x_n) &= 0 \\ \vdots & ~ \\ f_n(x_1, \dotsc, x_n) &=0. \end{align*}

Example

\begin{align*} 2x_1 - x_2 &= e^{-x_1} \\ -x_1 + 2x_2 &= e^{-x_2} \end{align*}

Newton's method¶

Same idea as univariate version: linearize $f$ around $\mathbf{x}=(x_1, \dotsc, x_n)$.
Iteration:

$$ \mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} - [Jf(\mathbf{x}^{(k)})]^{-1}f(\mathbf{x}^{(k)}) $$

where $$ Jf(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & \dotsc & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_n(\mathbf{x})}{\partial x_1} & \dotsc & \frac{\partial f_n(\mathbf{x})}{\partial x_n} \end{bmatrix} $$

Remember we should solve a linear system

$$ [Jf(\mathbf{x}^{(k)})]\Delta\mathbf{x} = -f(\mathbf{x}^{(k)}) $$

to get $\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + \Delta\mathbf{x}$, using LU etc, for each iteration.

Variants¶

Modified Newton: replace $Jf(\mathbf{x}^{(k)})$ with $J = Jf(\mathbf{x}^{(0)})$.
Jacobi's method: replace $Jf(\mathbf{x}^{(k)})$ with $\text{diag}(Jf(\mathbf{x}^{(k)}))$.
Gauss-Seidel: replace $Jf(\mathbf{x}^{(k)})$ with $\text{lowertri}(Jf(\mathbf{x}^{(k)}))$.
SOR: replace $Jf(\mathbf{x}^{(k)})$ with $\frac{1}{\omega}\text{diag}(Jf(\mathbf{x}^{(k)})) + \text{strictlowertri}(Jf(\mathbf{x}^{(k)}))$.

In [ ]: