Review the main ideas in Lec. 17, p. 2--4.
Analytically compute the projection of a point $x_{0}\in\mathbb{R}^{n}_{\geq 0}\setminus\{0\}$ onto the standard simplex $\mathcal{S} := \{x\in\mathbb{R}^{n}_{\geq 0}\mid \mathbf{1}^{\top}x = 1\}$ with respect to the generalized Kullback-Leibler divergence $$D_{\text{KL}}(x\parallel x_{0}) := \sum_{i=1}^{n}\left(x_{i}\log\frac{x_{i}}{x_{0i}} - x_{i} + x_{0i}\right).$$ In other words, compute $\operatorname{proj}^{D_{\text{KL}}}_{\mathcal{S}}\left(x_{0}\right)$ as a function of $x_{0}$.
(Hint: Set up the relevant constrained optimization problem, argue why you can invoke statement 2 in Lec. 15, p. 2, and then solve the KKT conditions from Lec. 14, p. 18-19.)
The problem $\operatorname{proj}^{D_{\text{KL}}}_{\mathcal{S}}\left(x_{0}\right) := \underset{x\in\mathcal{S}}{\arg\min}\:D_{\text{KL}}(x\parallel x_{0})$ is convex because the constraint set being a standard simplex, is a convex set, and the objective $D_{\text{KL}}$ is a convex function in $x$. Also, the objective is $C^{1}$ in $x$. Hence by statement 2 in Lec. 15, p. 2, if we can find the solution of the KKT conditions, then that solution would be the optimal primal-dual pair.
Following Lec. 14, p. 18-19, let us now start solving the KKT conditions for this problem with Lagrangian $$L(x,\lambda,\nu) := \sum_{i=1}^{n}\left(x_{i}\log\frac{x_{i}}{x_{0i}} - x_{i} + x_{0i}\right) + \langle\lambda,-x\rangle + \nu(\mathbf{1}^{\top}x-1), \quad \lambda\in\mathbb{R}^{n}_{\geq 0}, \quad \nu\in\mathbb{R}.$$ Stationarity of the Lagrangian, i.e., $\nabla_{x}L = 0$ gives \begin{align*} \frac{\partial}{\partial x_{i}}\bigg\{x_{i}\log\frac{x_{i}}{x_{0i}} - x_{i} + x_{0i} - \lambda_{i}x_{i} + \nu(x_{i}-1)\bigg\}\bigg\vert_{(x_{i},\lambda_{i},\nu)=(x_{i}^{\text{opt}},\lambda_{i}^{\text{opt}},\nu^{\text{opt}})} = 0,\quad\Rightarrow\quad x_{i}^{\text{opt}} = \exp(-\nu^{\text{opt}})x_{0i}\exp\left(\lambda_{i}^{\text{opt}}\right), \end{align*} which upon using $\sum_{i=1}^{n}x_{i}^{\text{opt}}=1$, results in $$\exp(-\nu^{\text{opt}}) = \frac{1}{\sum_{i=1}^{n}x_{0i}\exp\left(\lambda_{i}^{\text{opt}}\right)},\quad\Rightarrow\quad x_{i}^{\text{opt}} = \frac{x_{0i}\exp\left(\lambda_{i}^{\text{opt}}\right)}{\sum_{i=1}^{n}x_{0i}\exp\left(\lambda_{i}^{\text{opt}}\right)},\quad i=1,...,n.$$
From above, if $x_{0i}=0$ then $x_{i}^{\text{opt}}=0$. Likewise, if $x_{0i}\neq 0$ then $x_{i}^{\text{opt}}\neq 0$.
From primal feasibility, we know $x_{i}^{\text{opt}} \geq 0$, and hence if $x_{0i}\neq 0$, we must have $x_{i}^{\text{opt}}>0$. From complimentary slackness, $x_{i}^{\text{opt}}>0\;\Leftrightarrow\;\lambda_{i}^{\text{opt}}=0$. Therefore, if $x_{0i}\neq 0$ then $$x_{i}^{\text{opt}} = \frac{x_{0i}}{\sum_{i=1}^{n}x_{0i}}.$$ Combining the $x_{0i}=0$ and $x_{0i}\neq 0$ cases, we conclude that the optimal primal is $$x^{\text{opt}} = \frac{1}{\mathbf{1}^{\top}x_{0}}x_{0},$$ that is, a normalization of the vector $x_{0}$.
Analytically compute the projection of $X_{0}\in\mathbb{S}^{n}$ onto the cone $\mathbb{S}_{+}^{n}$ with respect to the Frobenius norm (Lec. 17, p. 7, 1st row and second column). In other words, compute $\operatorname{proj}^{\|\cdot\|_{\text{F}}}_{\mathbb{S}^{n}_{+}}\left(X_{0}\right)$ as a function of $X_{0}$.
(Hint: Same hints as in part (a). Review Lec. 4 for taking derivative of scalar with respect to matrix. Also, the spectral theorem for symmertic matrices should help.)
The problem $\operatorname{proj}^{\|\cdot\|_{\text{F}}}_{\mathbb{S}^{n}_{+}}\left(X_{0}\right) := \underset{X\in\mathbb{S}^{n}_{+}}{\arg\min}\|X-X_{0}\|_{\rm{F}} = \underset{X\in\mathbb{S}^{n}_{+}}{\arg\min}\frac{1}{2}\|X-X_{0}\|_{\rm{F}}^{2}$ (the prefactor $1/2$ and the square will not change the argmin) is convex since the constraint set is a convex cone, and the objective is a convex function. Also, the objective is $C^{1}$ in $X$. Hence by statement 2 in Lec. 15, p. 2, if we can find the solution of the KKT conditions, then that solution would be the optimal primal-dual pair.
Following Lec. 14, p. 18-19, let us now start solving the KKT conditions for this problem with Lagrangian $$L(X,\Lambda) = \frac{1}{2}\operatorname{trace}\left(X-X_{0}\right)^{\top}\left(X-X_{0}\right) + \underbrace{\langle\Lambda,-X\rangle}_{-\operatorname{trace}(\Lambda^{\top}X)}, \quad \Lambda\in\mathbb{S}_{+}^{n}\;(\text{since $\mathbb{S}_{+}^{n}$ is self-dual cone}).$$ Stationarity of the Lagrangian, i.e., $\frac{\partial L}{\partial X} = 0$ gives
\begin{align*} 0 = \frac{\partial L}{\partial X} &= \frac{1}{2}\frac{\partial}{\partial X}\operatorname{trace}\left(X^{\top}X - 2X_{0}^{\top}X + X_{0}^{\top}X_{0} - \Lambda^{\top}X\right)\bigg\vert_{(X,\Lambda)=(X^{\text{opt}},\Lambda^{\text{opt}})}\\ &= \frac{1}{2}\frac{\partial}{\partial X}\operatorname{trace}\left(X^{\top}X\right)\bigg\vert_{(X,\Lambda)=(X^{\text{opt}},\Lambda^{\text{opt}})} - \frac{\partial}{\partial X}\operatorname{trace}\left((\Lambda + X_{0})^{\top}X\right)\bigg\vert_{(X,\Lambda)=(X^{\text{opt}},\Lambda^{\text{opt}})}\\ &= \frac{1}{2}\cdot 2X^{\text{opt}} - \Lambda^{\text{opt}} - X_{0},\\ \Rightarrow X_{0} &= X^{\text{opt}} - \Lambda^{\text{opt}}. \end{align*}By spectral theorem, $X_{0}\in\mathbb{S}^{n}$ admits spectral decomposition $X_{0} = U_{0}D_{0}U_{0}^{\top}$ with diagonal matrix $D_{0}=\operatorname{diag}(\lambda_1,...,\lambda_{n})$ containing the eigenvalues $\lambda_1,...,\lambda_{n}$ of $X_{0}$ along its diagonal, and $U_{0}$ orthogonal, i.e, $U_{0}^{-1}=U_{0}^{\top}$. Therefore,
$$U_{0}D_{0}U_{0}^{\top} = X^{\text{opt}} - \Lambda^{\text{opt}} \quad \Rightarrow\quad D_{0} = \underbrace{U_{0}^{\top}X^{\text{opt}}U_{0}}_{=: \widetilde{X}^{\text{opt}}} - \underbrace{U_{0}^{\top}\Lambda^{\text{opt}}U_{0}}_{=: \widetilde{\Lambda}^{\text{opt}}}.$$Thus, Lagrangian stationarity ultimately yields $D_{0} = \widetilde{X}^{\text{opt}} - \widetilde{\Lambda}^{\text{opt}}$, which in turn implies $\widetilde{X}^{\text{opt}}_{ij} = \widetilde{\Lambda}^{\text{opt}}_{ij}$ for all $i\neq j$, that is, all off-diagonal terms are equal.
Primal feasibility gives $X^{\text{opt}}\succeq 0$. Dual feasibility gives $\Lambda^{\text{opt}}\succeq 0$ (since $\mathbb{S}^{n}_{+}$ is self-dual). These are equivalent to $\widetilde{X}^{\text{opt}}\succeq 0$ and $\widetilde{\Lambda}^{\text{opt}}\succeq 0$, respectively.
Complimentary slackness gives $$\langle\Lambda^{\text{opt}},X^{\text{opt}}\rangle = \operatorname{trace}\left(\Lambda^{\text{opt}}X^{\text{opt}}\right) = 0, \quad\Rightarrow\quad\underbrace{\operatorname{trace}\left(U_{0}^{-1}\Lambda^{\text{opt}}X^{\text{opt}}U_{0}\right) = 0}_{\text{since similarity transform preserves spectrum}}.$$ But $U_{0}^{-1}\Lambda^{\text{opt}}X^{\text{opt}}U_{0} = U_{0}^{\top}\Lambda^{\text{opt}}U_{0}U_{0}^{\top}X^{\text{opt}}U_{0} = \widetilde{\Lambda}^{\text{opt}}\widetilde{X}^{\text{opt}}$, and hence the complimentary slackness condition is equivalent to $\operatorname{trace}\left(\widetilde{\Lambda}^{\text{opt}}\widetilde{X}^{\text{opt}}\right) = 0$.
On the other hand, $$\operatorname{trace}\left(\widetilde{\Lambda}^{\text{opt}}\widetilde{X}^{\text{opt}}\right) = \underbrace{\sum_{i=1}^{n}\widetilde{X}^{\text{opt}}_{ii}\widetilde{\Lambda}^{\text{opt}}_{ii}}_{\stackrel{\large{\text{trace of Hadamard product of positive semidefinite}}}{\text{matrices (from primal and dual feasibility), thus}\:\geq 0}} + \underbrace{\sum_{i=1}^{n}\sum_{j\neq i} \widetilde{X}^{\text{opt}}_{ij}\widetilde{\Lambda}^{\text{opt}}_{ij}}_{=\sum_{i=1}^{n}\sum_{j\neq i} \left(\widetilde{X}^{\text{opt}}_{ij}\right)^{2}\:\text{from off-diagonal equality}} = 0$$ is possible if and only if $$\widetilde{X}^{\text{opt}}_{ij} = \widetilde{\Lambda}^{\text{opt}}_{ij} = 0\;\forall\:i\neq j, \quad \widetilde{X}^{\text{opt}}_{ii}\widetilde{\Lambda}^{\text{opt}}_{ii}=0,\quad\text{for all}\quad i,j=1,...,n.$$ Therefore, $\widetilde{X}^{\text{opt}}$ and $\widetilde{\Lambda}^{\text{opt}}$ are diagonal matrices with complimentary zero-nonzero patterns along the their diagonals, i.e., $$\widetilde{X}^{\text{opt}}_{ij} = \max\{\lambda_{i},0\}, \quad \widetilde{\Lambda}^{\text{opt}}_{ij} = \max\{-\lambda_{i},0\}.$$ Consequently, \begin{align*} X^{\text{opt}} = X_{0} + \Lambda^{\text{opt}} &= U_{0}\left(D_{0} + \widetilde{\Lambda}^{\text{opt}}\right)U_{0}^{\top}\\ &= U_{0}\left(\operatorname{diag}\left(\lambda_1,...,\lambda_n\right) + \operatorname{diag}\left(\max\{-\lambda_1,0\}...,\max\{-\lambda_n,0\}\right)\right)U_{0}^{\top}\\ &= U_{0}\:\operatorname{diag}\left(\max\{\lambda_1,0\}...,\max\{\lambda_n,0\}\right)U_{0}^{\top}, \end{align*}
that is, the spectral decomposition associated with the nonnegative part of the spectrum of $X_{0}\in\mathbb{S}^{n}$.
During Lec. 17, p. 9-10, Example 2, we mentioned that the $k$ rank ball
$$\mathcal{C}_{mn}^{k} := \{X\in\mathbb{R}^{m\times n} \mid \text{rank}(X)\leq k\}, \quad\text{for fixed}\;k\leq \min\{m,n\},$$is a nonconvex set.
To understand this set nonconvexity, fix $m=n=3$ and $k=2$. Construct an example to demonstrate that the set $\mathcal{C}^{2}_{33}=\{X\in\mathbb{R}^{3\times 3}\mid \text{rank}(X)\leq 2\}$ is nonconvex.
Remark: Notice that even though $\text{rank}(X)$ is a nonconvex function, we need to be careful in arguing set nonconvexity becuase it is possible for a nonconvex function to have convex sublevel sets.
Consider $X = \begin{pmatrix}1 & 0 & 0\\ 0 & 0 & 0\\ 0 & 0 & 0 \end{pmatrix}$, $Y = \begin{pmatrix}0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{pmatrix}$, and $Z = \begin{pmatrix}0 & 0 & 0\\ 0 & 0 & 0\\ 0 & 0 & 1 \end{pmatrix}$. Then $\operatorname{rank}(X)=\operatorname{rank}(Y)=\operatorname{rank}(Z)=1$, and hence $X,Y,Z\in\mathcal{C}^{2}_{33}$.
Denote the standard 2-simplex as $$\mathcal{S}_{2} := \{\theta\in\mathbb{R}^{3} \mid \theta\succeq\mathbf{0}, \mathbf{1}^{\top}\theta = 1\}.$$
Notice that $\theta_1 X + \theta_2 Y + \theta_3 Z = \begin{pmatrix}\theta_1 & 0 & 0\\ 0 & \theta_2 & 0\\ 0 & 0 & \theta_3 \end{pmatrix} \not\in\mathcal{C}^{2}_{33}\;\forall\:\begin{pmatrix}\theta_1\\ \theta_2\\ \theta_3 \end{pmatrix}\in\mathcal{S}_2$, because $\operatorname{rank}\left(\theta_1 X + \theta_2 Y + \theta_3 Z\right) = 3$ whenever $\{\theta\in\mathcal{S}_{2}\mid \theta_1\neq 0, \theta_2\neq 0, \theta_3\neq 0\}$. Therefore, $\mathcal{C}^{2}_{33}$ is a nonconvex set.
In practice, we often encounter optimization problems where the objective is a convex but nondifferentiable function. To solve such problems, we need the concept of subgradient.
We say a vector $g\in\mathbb{R}^{n}$ is a subgradient of $f:\mathbb{R}^{n}\mapsto \mathbb{R}$ at $x\in\textbf{dom}(f)$ if $$f(y) \geq f(x) + g^{\top}(y-x),\quad\text{for all}\quad y\in\textbf{dom}(f).$$
The above definition is a generalization of the concept of gradient in the sense if $f$ were differentiable, the above would precisely be the first order characterization of convexity. In general, subgradient at a point may not be unique. The set of all subgradients at $x$ is called the subdifferential $\partial f(x) := \{g \mid f(y) \geq f(x) + g^{\top}(y-x),\:\text{for all}\:y\in\textbf{dom}(f)\}$.
As a simple example, consider $f(x) = |x|$ with $\textbf{dom}(f)=\mathbb{R}$. Then for $x>0$, the subgradient $g=1$ is unique, i.e., $\partial f(x) = \{1\}$. Likewise for $x<0$, we have $\partial f(x) = \{-1\}$. At $x=0$, the inequality $|y|\geq g^{\top}y$ is saisfied iff $g\in[-1,1]$. Thus, $\partial f(0) = [-1,1]$.
It can be shown that if $f$ is differentiable at $x$, then $\partial f(x) = \{\nabla f(x)\}$, as you intuitively expect. Thus, subgradient $g$ generalizes the concept of gradient.
Given convex functions $f_{1},...,f_{m}(x)$, let $f(x):=\max\{f_{1}(x), ..., f_{m}(x)\}$. Define $I(x)=\{i\mid f_{i}(x)=f(x)\}$, that is, the index of the "active" functions at $x$. A nice result is this: to compute subgradient $g$ of $f$ at $x$, choose any $i\in I(x)$. Then $g$ can be taken as any subgradient of $f_{i}$ at $x$. That is, if $i$ is an active index, then $g\in\partial f_{i}(x)$ implies $g\in\partial f(x)$. Furthermore, $$\partial f(x) = \text{conv}\left(\underset{i\in I(x)}{\bigcup}\partial f_{i}(x)\right),$$ the convex hull of the union of subdifferentials of "active" functions at $x$.
Use the above result to prove that if $f(x) = \|x\|_{1}$ with $x\in\mathbb{R}^{n}$, then $\partial f(x) = S_{1} \times S_{2} \times ... \times S_{n}$ (Cartesian product of sets), where $S_{j} := [-1,1]$ for $x_{j}=0$, $S_{j} := \{1\}$ for $x_{j}>0$, and $S_{j} := \{-1\}$ for $x_{j}<0$, $j=1,...,n$.
Hint: Start by writing $f(x)=\|x\|_{1}$ as the pointwise max of a finite number of functions.
We start by writing $f(x) = \|x\|_{1} = \underset{y\in\{-1,1\}^{n}}{\max}y^{\top}x$, that is, a poitwise max of $2^{n}$ functions. Then finding the active function reduces to finding $y\in\{-1,1\}^{n}$ such that $y^{\top}x=\|x\|_1$.
When $x_i > 0$, we get $y_i = +1$. When $x_i < 0$, we get $y_i = -1$. When $x_i = 0$, more than one function is active, and $y_{i}$ equals either $+1$ or $-1$. Hence the subgradient is $g_{i} = + 1$ for $x_{i}>0$, $g_{i} = - 1$ for $x_{i}<0$, $g_{i}=[-1,1]$ for $x_{i}=0$.
Therefore, $$\partial f(x) = \text{conv}\left(\underset{i\in I(x)}{\bigcup}\partial f_{i}(x)\right) = S_{1} \times S_{2} \times ... \times S_{n}$$ where $S_{j} := [-1,1]$ for $x_{j}=0$, $S_{j} := \{1\}$ for $x_{j}>0$, and $S_{j} := \{-1\}$ for $x_{j}<0$, $j=1,...,n$.
To solve a problem of the form $\underset{x\in\mathbb{R}^{n}}{\text{minimize}}\:f(x)$, where $f$ is convex but not everywhere differentiable, a standard iterative algorithm is the subgradient method: $$x_{k+1} = x_{k} - \eta_{k}g_{k}, \quad k=1,2,...,$$
where $g_{k}\in\partial f(x_{k})$ is any subgradient, and $\eta_{k}>0$ denotes stepsize at the $k$th iteration. Assume that the norm of subgradient is bounded, i.e., there exists $G$ such that $\|g_{k}\|_{2} \leq G$ for all iteration index $k=1,2,...$.
Although it looks very similar to gradient descent, subgradient method is not a descent method (see Lec. 18: intro to gradient descent, p. 2). Consider a running estimate of the best value so far: $f^{\text{best}}_{k} = \min\{f(x_{1}), ..., f(x_{k})\}$. Since $f^{\text{best}}_{k}$ is decreasing, it will have a limit, which may be finite or $-\infty$.
If $x_{\text{opt}}$ is a minimizer, then after $k$ iterations, the subgradient method satisfies:
$$f^{\text{best}}_{k} - f(x_{\text{opt}}) \leq \dfrac{\|x_{0} - x_{\text{opt}}\|_{2}^{2} + G^{2}\sum_{i=1}^{k}\eta_{i}^{2}}{2\sum_{i=1}^{k}\eta_{i}}.$$(i) From above, what can you tell about the convergence of subgradient method if the stepsize sequence $\{\eta_{k}\}$ is not summable but square summable, i.e., if $\sum_{k=1}^{\infty}\eta_{k} = \infty$, and $\sum_{k=1}^{\infty}\eta_{k}^{2} < \infty$.
(ii) What can you tell about the convergence of subgradient method if we choose the stepsize $\eta_{k} \equiv 1/k$ for all $k$?
(i) Since $f$ is a convex function, $0 \leq f^{\text{best}}_{k} - f(x_{\text{opt}})$ for all $k=1,2,...$.
If $\sum_{k=1}^{\infty}\eta_{k} = \infty$, and $\sum_{k=1}^{\infty}\eta_{k}^{2} < \infty$, then $\displaystyle\lim_{k\rightarrow\infty}\dfrac{\|x_{0} - x_{\text{opt}}\|_{2}^{2} + G^{2}\sum_{i=1}^{k}\eta_{i}^{2}}{2\sum_{i=1}^{k}\eta_{i}} = 0$. Therefore, by the squeeze theorem, we obtain $$\displaystyle\lim_{k\rightarrow\infty}\left(f^{\text{best}}_{k} - f(x_{\text{opt}})\right) = 0,$$ guaranteeing the convergence of the subgradient method for such stepsize sequences.
(ii) The sequence $\{\eta_{k}\}\equiv\{1/k\}$ is not summable since $\sum_{k=1}^{\infty}1/k = \infty$. This can be checked in several ways: https://en.wikipedia.org/wiki/Harmonic_series_(mathematics)#Divergence.
On the other hand, $\sum_{k=1}^{\infty}1/k^{2} < \infty$ by the integral test since $\int_{1}^{\infty}\frac{1}{x^{2}} = 1<\infty$. Therefore, by part 2(b)(i), we conclude that the subgradient method will converge with stepsize $\eta_{k}=1/k$ for all $k$.
A popular problem in statistics and machine learning is the so called "lasso problem", given by
$$\underset{x\in\mathbb{R}^{n}}{\text{minimize}}\:\frac{1}{2}\|Ax-b\|_{2}^{2} + \gamma \|x\|_{1},$$where $A\in\mathbb{R}^{m\times n}$, $b\in\mathbb{R}^{m}$, and $\gamma>0$ is a regularization parameter. For $\gamma = 0$, it reduces to ordinary least squares. For $\gamma>0$, it solves a linear regression problem while promoting sparsity in the minimizer. The objective function is a sum of two terms, and is convex but nondifferentiable. The first summand in the objective promotes data fidelity while the second promotes sparsity.
(i) Use your answer in part (a) to explicitly write down the subgradient update rule for the lasso problem.
(ii) Use the MATLAB starter code $\texttt{AM229F20HW8problem.m}$ inside CANVAS Files section, folder: HW problems and solutions, to report the optimal value and cpu time for cvx solution of the lasso problem for given problem data. Also report the 10,000th iterate value of the objective and cpu time for subgradient method for the same data with $\eta_k \equiv 1/k$ using the same code. Please submit your completed code.
You can also submit the Python equivalent of the completed starter code with cvxpy but please make sure to use the same numerical values for the problem data.
(i) Using the solution for part 2(a), the subgradient update rule for lasso problem becomes
$$x_{k+1}=x_{k}-\eta_{k}\left(A^{\top}(Ax_{k}-b)+ \gamma g_{k}\right), \quad k=1,2,...,$$where $\left(g_{k}\right)_{i}:= \operatorname{sgn}((x_{k})_{i})$ if $(x_{k})_{i} \neq 0$, and $\left(g_{k}\right)_{i}\in [-1,1]$ if $(x_{k})_{i} = 0$.
(ii) The posted code $\texttt{AM229F20HW8solution.m}$ inside CANVAS Files section, folder: HW problems and solutions, yields the optimal value using cvx as $1.5045$. The cpu time for cvx is $1.7100$ s. The same code gives the 10,000th iterate of the subgradient method as $1.5048$. The cpu time for the subgradient method is $0.1175$ s.