Consider maximizing log-likelihood function $L(\mathbf{\theta})$, $\theta \in \Theta \subset \mathbb{R}^p$.
Gradient (or score) of $L$: $$ \nabla L(\theta) = \begin{pmatrix} \frac{\partial L(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial L(\theta)}{\partial \theta_p} \end{pmatrix} $$
Differential of $L$: $$ dL(\theta) = [\nabla L(\theta)]^T $$
Hessian of $L$:
$$
\nabla^2 L(\theta) = d^2 L(\theta) = \begin{pmatrix}
\frac{\partial^2 L(\theta)}{\partial \theta_1 \partial \theta_1} & \dots & \frac{\partial^2 L(\theta)}{\partial \theta_1 \partial \theta_p} \\
\vdots & \ddots & \vdots \\
\frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_1} & \dots & \frac{\partial^2 L(\theta)}{\partial \theta_p \partial \theta_p}
\end{pmatrix}
$$
Observed information matrix:
$f(\mathbf{x}) = \mathbf{0}$ (KL 5.4).
Idea: iterative quadratic approximation.
Taylor expansion around the current iterate $\mathbf{\theta}^{(t)}$
and then maximize the quadratic approximation.
which suggests the next iterate $$ \begin{eqnarray*} \theta^{(t+1)} &=& \theta^{(t)} - [d^2L(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)}) \\ &=& \theta^{(t)} + [-d^2L(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)}). \end{eqnarray*} $$ We call this naive Newton's method.
Why insist on a positive definite approximation of Hessian? By first-order Taylor expansion, $$ \begin{eqnarray*} & & L(\theta^{(t)} + s \Delta \theta^{(t)}) - L(\theta^{(t)}) \\ &=& dL(\theta^{(t)}) s \Delta \theta^{(t)} + o(s) \\ &=& s dL(\theta^{(t)}) [\mathbf{A}^{(t)}]^{-1} \nabla L(\theta^{(t)}) + o(s). \end{eqnarray*} $$ For $s$ sufficiently small, right hand side is strictly positive when $\mathbf{A}^{(t)}$ is positive definite.
where $\mathbf{A}^{(t)}$ is a pd approximation of $-d^2L(\theta^{(t)})$ and $s$ is a step length.
For strictly concave $L$, $-d^2L(\theta^{(t)})$ is always positive definite. Line search is still needed to guarantee convergence.
Line search strategy: step-halving ($s=1,1/2,\ldots$), golden section search, cubic interpolation, Amijo rule, ... Note the Newton direction
only needs to be calculated once. Cost of line search mainly lies in objective function evaluation.
How to approximate $-d^2L(\theta)$? More of an art than science. Often requires problem specific analysis.
Taking $\mathbf{A} = \mathbf{I}$ leads to the method of steepest ascent, aka gradient ascent.
which is psd under exchangeability of expectation and differentiation.
Therefore the Fisher's scoring algorithm iterates according to
$$
\boxed{ \theta^{(t+1)} = \theta^{(t)} + s [\mathbf{I}(\theta^{(t)})]^{-1} \nabla L(\theta^{(t)})}.
$$
Let's consider a concrete example: logistic regression.
The goal is to predict whether a credit card transaction is fraud ($y_i=1$) or not ($y_i=0$). Predictors ($\mathbf{x}_i$) include: time of transaction, last location, merchant, ...
$y_i \in \{0,1\}$, $\mathbf{x}_i \in \mathbb{R}^{p}$. Model $y_i \sim $Bernoulli$(p_i)$.
Logistic regression. Density
where $$ \begin{eqnarray*} \mathbf{E} (y_i) = p_i &=& \frac{e^{\mathbf{x}_i^T \beta}}{1+ e^{\mathbf{x}_i^T \beta}} \quad \text{(mean function, inverse link function)} \\ \mathbf{x}_i^T \beta &=& \ln \left( \frac{p_i}{1-p_i} \right) \quad \text{(logit link function)}. \end{eqnarray*} $$
where $$ \mathbf{z}^{(t)} = \mathbf{X} \beta^{(t)} + s(\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \mathbf{p}^{(t)}) $$ are the working responses. A Newton's iteration is equivalent to solving a weighed least squares problem $\sum_{i=1}^n w_i (z_i - \mathbf{x}_i^T \beta)^2$. Thus the name IRWLS (iteratively re-weighted least squares).
Let's consider the more general class of generalized linear models (GLM).
Family | Canonical Link | Variance Function |
---|---|---|
Normal | $\eta=\mu$ | 1 |
Poisson | $\eta=\log \mu$ | $\mu$ |
Binomial | $\eta=\log \left( \frac{\mu}{1 - \mu} \right)$ | $\mu (1 - \mu)$ |
Gamma | $\eta = \mu^{-1}$ | $\mu^2$ |
Inverse Gaussian | $\eta = \mu^{-2}$ | $\mu^3$ |
* $\theta$: natural parameter.
* $\phi>0$: dispersion parameter.
GLM relates the mean $\mu = \mathbf{E}(Y|\mathbf{x})$ via a strictly increasing link function $$ g(\mu) = \eta = \mathbf{x} \beta, \quad \mu = g^{-1}(\eta) $$
IRWLS with weights $w_i = [\mu_i(\eta_i)]^2/\sigma_i^2$ and some working responses $z_i$.
Example: Probit regression (binary response with probit link). $$ \begin{eqnarray*} y_i &\sim& \text{Bernoulli}(p_i) \\ p_i &=& \Phi(\mathbf{x}_i^T \beta) \\ \eta_i &=& \mathbf{x}_i^T \beta = \Phi^{-1}(p_i). \end{eqnarray*} $$ where $\Phi(\cdot)$ is the cdf of a standard normal.
Relocate Ceres by fitting 41 observations to a 6-parameter (nonlinear) orbit.
For example, $y_i =$ dry weight of onion and $x_i=$ growth time, and we want to fit a 3-parameter growth curve $$ \mu(x, \beta_1,\beta_2,\beta_3) = \frac{\beta_3}{1 + e^{-\beta_1 - \beta_2 x}}. $$
where $\mathbf{J}(\beta)^T = [\nabla \mu_1(\beta), \ldots, \nabla \mu_n(\beta)] \in \mathbb{R}^{p \times n}$.
bridging between Gauss-Newton and steepest descent.
See KL 14.4 for examples.