A function $f: \mathbb{R}^n \mapsto \mathbb{R}$ is convex if
$f$ is strictly convex if the inequality is strict for all $\mathbf{x} \ne \mathbf{y} \in \text{dom} f$ and $\lambda$.
Supporting hyperplane inequality. A differentiable function $f$ is convex if and only if
for all $\mathbf{x}, \mathbf{y} \in \text{dom} f$.
Second-order condition for convexity. A twice differentiable function $f$ is convex if and only if $\nabla^2f(\mathbf{x})$ is psd for all $\mathbf{x} \in \text{dom} f$. It is strictly convex if and only if $\nabla^2f(\mathbf{x})$ is pd for all $\mathbf{x} \in \text{dom} f$.
Convexity and global optima. Suppose $f$ is a convex function.
Example: Least squares estimate. $f(\beta) = \frac 12 \| \mathbf{y} - \mathbf{X} \beta \|_2^2$ has Hessian $\nabla^2f = \mathbf{X}^T \mathbf{X}$ which is psd. So $f$ is convex and any stationary point (solution to the normal equation) is a global minimum. When $\mathbf{X}$ is rank deficient, the set of solutions is convex.
Jensen's inequality. If $h$ is convex and $\mathbf{W}$ a random vector taking values in $\text{dom} f$, then
provided both expectations exist. For a strictly convex $h$, equality holds if and only if $W = \mathbf{E}(W)$ almost surely.
Proof: Take $\mathbf{x} = \mathbf{W}$ and $\mathbf{y} = \mathbf{E} (\mathbf{W})$ in the supporting hyperplane inequality.
with equality if and only if $f = g$ almost everywhere on $\mu$.
Proof: Apply Jensen's inequality to the convex function $- \ln(t)$ and random variable $W=g(X)/f(X)$ where $X \sim f$.
Important applications of information inequality: M-estimation, EM algorithm.
The vectors $\lambda = (\lambda_1,\ldots, \lambda_m)^T$ and $\nu = (\nu_1,\ldots,\nu_p)^T$ are called the Lagrange multiplier vectors or dual variables.
Proof: For any feasible point $\tilde{\mathbf{x}}$, \begin{eqnarray*} L(\tilde{\mathbf{x}}, \lambda, \nu) = f_0(\tilde{\mathbf{x}}) + \sum_{i=1}^m \lambda_i f_i(\tilde{\mathbf{x}}) + \sum_{i=1}^p \nu_i h_i(\tilde{\mathbf{x}}) \le f_0(\tilde{\mathbf{x}}) \end{eqnarray*} because the second term is non-positive and the third term is zero. Then \begin{eqnarray*} g(\lambda, \mu) = \inf_{\mathbf{x}} L(\mathbf{x}, \lambda, \mu) \le L(\tilde{\mathbf{x}}, \lambda, \nu) \le f_0(\tilde{\mathbf{x}}). \end{eqnarray*}
which is a convex problem regardless the primal problem is convex or not.
The difference $p^\star - d^\star$ is called the optimal duality gap.
with $f_0,\ldots,f_m$ convex, we usually (but not always) have the strong duality, i.e., $d^\star = p^\star$.
Such a point is also called strictly feasible.
KKT is "one of the great triumphs of 20th century applied mathematics" (KL Chapter 11).
Assume $f_0,\ldots,f_m,h_1,\ldots,h_p$ are differentiable. Let $\mathbf{x}^\star$ and $(\lambda^\star, \nu^\star)$ be any primal and dual optimal points with zero duality gap, i.e., strong duality holds.
Since $\mathbf{x}^\star$ minimizes $L(\mathbf{x}, \lambda^\star, \nu^\star)$ over $\mathbf{x}$, its gradient vanishes at $\mathbf{x}^\star$, we have the Karush-Kuhn-Tucker (KKT) conditions
Since $\sum_{i=1}^m \lambda_i^\star f_i(\mathbf{x}^\star)=0$ and each term is non-positive, we have $\lambda_i^\star f_i(\mathbf{x}^\star)=0$, $i=1,\ldots,m$.
When the primal problem is convex, the KKT conditions are also sufficient for the points to be primal and dual optimal.
If $f_i$ are convex and $h_i$ are affine, and $(\tilde{\mathbf{x}}, \tilde \lambda, \tilde \nu)$ satisfy the KKT conditions, then $\tilde{\mathbf{x}}$ and $(\tilde \lambda, \tilde \nu)$ are primal and dual optimal, with zero duality gap.
The KKT conditions play an important role in optimization. Many algorithms for convex optimization are conceived as, or can be interpreted as, methods for solving the KKT conditions.