Consider maximizing log-likelihood function $L(\mathbf{\theta})$, $\theta \in \Theta \subset \mathbb{R}^p$.
Gradient (or score) of $L$: $$ \nabla L(\theta) = \begin{pmatrix} \frac{\partial L(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial L(\theta)}{\partial \theta_p} \end{pmatrix} $$
Hessian of $L$:
$$
\nabla^2 L(\theta) = \begin{pmatrix}
\frac{\partial^2 L(\theta)}{\partial \theta_1 \partial \theta_1} & \dots & \frac{\partial^2 L(\theta)}{\partial \theta_1 \partial \theta_p} \\
\vdots & \ddots & \vdots \\
\frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_1} & \dots & \frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_p}
\end{pmatrix}
$$
Observed information matrix (negative Hessian):
$f(\mathbf{x}) = \mathbf{0}$ (KL 5.4).
Idea: iterative quadratic approximation.
Taylor expansion around the current iterate $\mathbf{\theta}^{(t)}$
and then maximize the quadratic approximation.
which suggests the next iterate $$ \begin{eqnarray*} \theta^{(t+1)} &=& \theta^{(t)} - [\nabla^2L(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)}) \\ &=& \theta^{(t)} + [-\nabla^2L(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)}). \end{eqnarray*} $$ We call this naive Newton's method.
using Plots; gr()
using LaTeXStrings, ForwardDiff
f(x) = sin(x)
df = x -> ForwardDiff.derivative(f, x) # gradient
d2f = x -> ForwardDiff.derivative(df, x) # hessian
x = 2.0 # start point: 2.0 (local maximum), 2.75 (diverge), 4.0 (local minimum)
titletext = "Starting point: $x"
anim = @animate for iter in 0:10
iter > 0 && (global x = x - d2f(x) \ df(x))
p = Plots.plot(f, 0, 2π, xlim=(0, 2π), ylim=(-1.1, 1.1), legend=nothing, title=titletext)
Plots.plot!(p, [x], [f(x)], shape=:circle)
Plots.annotate!(p, x, f(x), text(latexstring("x^{($iter)}"), :right))
end
gif(anim, "./tmp.gif", fps = 1);
┌ Info: Saved animation to │ fn = /Users/huazhou/Documents/github.com/Hua-Zhou.github.io/teaching/biostatm280-2019spring/slides/21-newton/tmp.gif └ @ Plots /Users/huazhou/.julia/packages/Plots/oiirH/src/animation.jl:90
Remedies for the instability issue:
Why insist on a positive definite approximation of Hessian? By first-order Taylor expansion,
For $s$ sufficiently small, right hand side is strictly positive when $\mathbf{A}^{(t)}$ is positive definite. The quantity $\{\nabla L(\theta^{(t)})^T [\mathbf{A}^{(t)}]^{-1} \nabla L(\theta^{(t)})\}^{1/2}$ is termed the Newton decrement.
where $\mathbf{A}^{(t)}$ is a pd approximation of $-\nabla^2L(\theta^{(t)})$ and $s$ is a step length.
For strictly concave $L$, $-\nabla^2L(\theta^{(t)})$ is always positive definite. Line search is still needed to guarantee convergence.
Line search strategy: step-halving ($s=1,1/2,\ldots$), golden section search, cubic interpolation, Amijo rule, ... Note the Newton direction
only needs to be calculated once. Cost of line search mainly lies in objective function evaluation.
How to approximate $-\nabla^2L(\theta)$? More of an art than science. Often requires problem specific analysis.
Taking $\mathbf{A} = \mathbf{I}$ leads to the method of steepest ascent, aka gradient ascent.
which is psd under exchangeability of expectation and differentiation.
Therefore the Fisher's scoring algorithm iterates according to
$$
\boxed{ \theta^{(t+1)} = \theta^{(t)} + s [\mathbf{FIM}(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)})}.
$$
Let's consider a concrete example: logistic regression.
The goal is to predict whether a credit card transaction is fraud ($y_i=1$) or not ($y_i=0$). Predictors ($\mathbf{x}_i$) include: time of transaction, last location, merchant, ...
$y_i \in \{0,1\}$, $\mathbf{x}_i \in \mathbb{R}^{p}$. Model $y_i \sim $Bernoulli$(p_i)$.
Logistic regression. Density
where $$ \begin{eqnarray*} \mathbf{E} (y_i) = p_i &=& \frac{e^{\eta_i}}{1+ e^{\eta_i}} \quad \text{(mean function, inverse link function)} \\ \eta_i = \mathbf{x}_i^T \beta &=& \ln \left( \frac{p_i}{1-p_i} \right) \quad \text{(logit link function)}. \end{eqnarray*} $$
where $$ \mathbf{z}^{(t)} = \mathbf{X} \beta^{(t)} + s(\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \mathbf{p}^{(t)}) $$ are the working responses. A Newton's iteration is equivalent to solving a weighed least squares problem $\sum_{i=1}^n w_i (z_i - \mathbf{x}_i^T \beta)^2$. Thus the name IRWLS (iteratively re-weighted least squares).
Let's consider the more general class of generalized linear models (GLM).
Family | Canonical Link | Variance Function |
---|---|---|
Normal | $\eta=\mu$ | 1 |
Poisson | $\eta=\log \mu$ | $\mu$ |
Binomial | $\eta=\log \left( \frac{\mu}{1 - \mu} \right)$ | $\mu (1 - \mu)$ |
Gamma | $\eta = \mu^{-1}$ | $\mu^2$ |
Inverse Gaussian | $\eta = \mu^{-2}$ | $\mu^3$ |
* $\theta$: natural parameter.
* $\phi>0$: dispersion parameter.
GLM relates the mean $\mu = \mathbf{E}(Y|\mathbf{x})$ via a strictly increasing link function $$ g(\mu) = \eta = \mathbf{x}^T \beta, \quad \mu = g^{-1}(\eta) $$
IRWLS with weights $w_i = [\mu_i(\eta_i)]^2/\sigma_i^2$ and some working responses $z_i$.
Example: Probit regression (binary response with probit link). $$ \begin{eqnarray*} y_i &\sim& \text{Bernoulli}(p_i) \\ p_i &=& \Phi(\mathbf{x}_i^T \beta) \\ \eta_i &=& \mathbf{x}_i^T \beta = \Phi^{-1}(p_i). \end{eqnarray*} $$ where $\Phi(\cdot)$ is the cdf of a standard normal.
Relocate Ceres by fitting 41 observations to a 6-parameter (nonlinear) orbit.
For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve $$ \mu(x, \beta_1,\beta_2,\beta_3) = \frac{\beta_3}{1 + e^{-\beta_1 - \beta_2 x}}. $$
where $\mathbf{J}(\beta)^T = [\nabla \mu_1(\beta), \ldots, \nabla \mu_n(\beta)] \in \mathbb{R}^{p \times n}$.
bridging between Gauss-Newton and steepest descent.
See KL 14.4 for examples.