versioninfo()
Julia Version 1.6.2 Commit 1b93d53fc4 (2021-07-14 15:36 UTC) Platform Info: OS: macOS (x86_64-apple-darwin18.7.0) CPU: Intel(R) Core(TM) i5-8279U CPU @ 2.40GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
using Pkg
Pkg.activate("../..")
Pkg.status()
Activating environment at `~/Dropbox/class/M1399.000200/2021/M1399_000200-2021fall/Project.toml`
Status `~/Dropbox/class/M1399.000200/2021/M1399_000200-2021fall/Project.toml` [7d9fca2a] Arpack v0.4.0 [6e4b80f9] BenchmarkTools v1.2.0 [1e616198] COSMO v0.8.1 [f65535da] Convex v0.14.16 [a93c6f00] DataFrames v1.2.2 [31a5f54b] Debugger v0.6.8 [31c24e10] Distributions v0.24.18 [e2685f51] ECOS v0.12.3 [f6369f11] ForwardDiff v0.10.21 [28b8d3ca] GR v0.61.0 [c91e804a] Gadfly v1.3.3 [bd48cda9] GraphRecipes v0.5.7 [2e9cd046] Gurobi v0.9.14 [f67ccb44] HDF5 v0.14.3 [82e4d734] ImageIO v0.5.8 [6218d12a] ImageMagick v1.2.1 [916415d5] Images v0.24.1 [b6b21f68] Ipopt v0.7.0 [42fd0dbc] IterativeSolvers v0.9.1 [4076af6c] JuMP v0.21.10 [b51810bb] MatrixDepot v1.0.4 [6405355b] Mosek v1.2.1 [1ec41992] MosekTools v0.9.4 [76087f3c] NLopt v0.6.3 [47be7bcc] ORCA v0.5.0 [a03496cd] PlotlyBase v0.4.3 [f0f68f2c] PlotlyJS v0.15.0 [91a5bcdd] Plots v1.22.6 [438e738f] PyCall v1.92.3 [d330b81b] PyPlot v2.10.0 [dca85d43] QuartzImageIO v0.7.3 [6f49c342] RCall v0.13.12 [ce6b1742] RDatasets v0.7.5 [c946c3f1] SCS v0.7.1 [276daf66] SpecialFunctions v1.7.0 [2913bbd2] StatsBase v0.33.12 [f3b207a7] StatsPlots v0.14.28 [b8865327] UnicodePlots v2.4.6 [0f1e0344] WebIO v0.8.16 [8f399da3] Libdl [2f01184e] SparseArrays [10745b16] Statistics
over $\mathbf{x} \in \mathbb{R}^d$. This says the points $(\mathbf{A}_i \mathbf{x} + \mathbf{b}_i, \mathbf{c}_i^T \mathbf{x} + d_i)$ live in the second order cone (ice cream cone, Lorentz cone, quadratic cone) \begin{eqnarray*} \mathbb{L}^{d+1} = \{(\mathbf{x}, t): \|\mathbf{x}\|_2 \le t\} \end{eqnarray*} in $\mathbb{R}^{d+1}$.
QP is a special case of SOCP. Why?
When $\mathbf{c}_i = \mathbf{0}$ for $i=1,\ldots,m$, SOCP is equivalent to a quadratically constrained quadratic program (QCQP)
where $\mathbf{P}_i \in \mathbf{S}_+^n$, $i=0,1,\ldots,m$. Why?
- $\mathbb{L}_r^{d+2}$ is SOCr: $\|\mathbf{x}\|_2^2 \le 2 t_1 t_2,~t_1, t_2 \ge 0 \iff \left\| \begin{bmatrix} \mathbf{x} \\ \frac{t_1 - t_2}{2} \end{bmatrix} \right\|_2 \le \frac{t_1 + t_2}{2}$.
- $\mathbb{L}^{d+1}$ is $\mathbb{L}_r^{d+2}$-representable:
$$
\|\mathbf{x}\|_2 \le t \implies \left\| \begin{bmatrix} \mathbf{x} \\ t/2 - t/2 \end{bmatrix} \right\|_2 \le t/2 + t/2.
$$- In fact $\mathbb{L}_r^{d+2}$ is the $\mathbb{L}^{d+2}$ rotated by 45$^\circ$ in the $t_1$-$t_2$ plane, hence 1-to-1 and onto:
$$
\begin{bmatrix} \mathbf{x} \\ t_1 \\ t_2 \end{bmatrix}
\mapsto
\begin{bmatrix} \mathbf{x} \\ (t_1-t_2)/2 \\ (t_1+t_2)/2 \end{bmatrix}
$$Gurobi allows users to input second order cone constraint and quadratic constraints directly.
Mosek allows users to input second order cone constraint, quadratic constraints, and rotated quadratic cone constraint directly.
(Epigraphs of affine functions) $\mathbf{a}^T\mathbf{x} + b \le t \iff (0, t - \mathbf{a}^T\mathbf{x} - b) \in \mathbb{L}^{2}$.
(Euclidean norms) $\|\mathbf{x}\|_2 \le t \Leftrightarrow (\mathbf{x}, t) \in \mathbb{L}^{d+1}$.
(Sum of squares) $\|\mathbf{x}\|_2^2 \le t \Leftrightarrow (\mathbf{x}, t, 1/2) \in \mathbb{L}_r^{d+2}$.
(Epigraphs of quadratic-fractional functions) Let
Then, $\text{epi}g = \{(\mathbf{x}, t): \mathbf{x}^T\mathbf{x}/s \le t, s \ge 0 \} = \mathbb{L}_r^{d+2}$.
Intersection.
Cartesian product.
Minkowski sum.
Image $\mathcal{A}C$ of a SOCr set $C$ under an linear (affine) map $\mathcal{A}$.
Inverse image $\mathcal{A}^{-1}C$ of a SOCr set $C$ under a linear (affine) map $\mathcal{A}$.
Nonnegative weighted sum.
Composition with an affine mapping.
Pointwise maximum and supremum.
Separable sum. $g(\mathbf{x}_1,\dotsc,\mathbf{x}_m) = g_1(\mathbf{x}_1) + \dotsb + g_m(\mathbf{x}_m)$ is SOCr if each summand is so.
Paritial minimization.
Perspective. If $f(\mathbf{x})$ is SOCr, then its perspective $g(\mathbf{x}, t)=t f(t^{-1}\mathbf{x})$ is SOCr.
which is the sum of squared Euclidean norm (under a liner map) and an affine function. Both are SOCr.
This fact shows that QP and QCQP are instances of SOCP.
can be represented by rotated quadratic cones. For example,
$$ \begin{eqnarray*} \mathbb{K}_{\text{gm}}^2 &=& \{(x_1, x_2, t): \sqrt{x_1 x_2} \ge t, x_1, x_2 \ge 0\} \\ &=& \{(x_1, x_2, t): (\sqrt 2 t, x_1, x_2) \in \mathbb{L}_r^3\}. \end{eqnarray*} $$Then, $$ \begin{eqnarray*} \mathbb{K}_{\text{gm}}^4 &=& \{(x_1, x_2, x_3, x_4, t): \sqrt{x_1 x_2 x_3 x_4} \ge t, x_1, x_2, x_3, x_4 \ge 0\} \\ &=& \{(x_1, x_2, x_3, x_4, t): (\exists u_1, u_2 \ge 0)\sqrt{x_1 x_2} \ge u_1, \sqrt{x_3 x_4} \ge u_2, \sqrt{u_1u_2} \ge t, ~ x_1, x_2, x_3, x_4 \ge 0\} \\ &=& \{(x_1, x_2, t): (\exists u_1, u_2 \ge 0) (\sqrt 2 u_1, x_1, x_2) \in \mathbb{L}_r^3, (\sqrt 2 u_2, x_3, x_4) \in \mathbb{L}_r^3, (\sqrt 2 t, u_1, u_2) \in \mathbb{L}_r^3\} \end{eqnarray*} $$
and so on.
For $n\neq 2^l$, consider, e.g.,
$$ (x_1 x_2 x_3)^{1/3} \ge t \iff (x_1 x_2 x_3 t)^{1/4} \ge t $$(Harmonic mean) The hypograph of the harmonic mean function $\left( n^{-1} \sum_{i=1}^n x_i^{-1} \right)^{-1}$ is SOCr.
(Rational $\ell_p$-norm) Function $\|\mathbf{x}\|_p = (\sum_{i=1}^d |x_i|^p)^{1/p}$ for $p \ge 1$ rational is SOCr.
References:
Now our catalogue of SOCP terms includes all above terms.
Most of these function are implemented as the built-in function in the convex optimization modeling language cvx
(for Matlab) or Convex.jl
(for Julia).
where $\beta_g$ is the subvector of regression coefficients for group $g$, and $w_g$ are fixed group weights. This is equivalent to the SOCP \begin{eqnarray*} &\text{minimize}& \frac 12 \beta^T \mathbf{X}^T \left(\mathbf{I} - \frac{\mathbf{1} \mathbf{1}^T}{n} \right) \mathbf{X} \beta + \mathbf{y}^T \left(\mathbf{I} - \frac{\mathbf{1} \mathbf{1}^T}{n} \right) \mathbf{X} \beta + \lambda \sum_{g=1}^G w_g t_g \\ &\text{subject to}& \|\beta_g\|_2 \le t_g, \quad g = 1,\ldots, G, \end{eqnarray*} in variables $\beta$ and $t_1,\ldots,t_G$.
achieves sparsity at both group and individual coefficient level and can be solved by SOCP as well.
We have seen zero-sum group lasso before.
Apparently we can solve any previous loss functions (quantile, $\ell_1$, composite quantile, Huber, multi-response model) plus group or sparse group penalty by SOCP.
by SOCP. This variant generates the same solution path as lasso (why?) but simplifies the choice of $\lambda$.
using Statistics
using Random, LinearAlgebra, SparseArrays
Random.seed!(123) # seed
n, p = 100, 10
# Design matrix
X0 = randn(n, p)
X = [ones(n, 1) (X0 .- mean(X0 ,dims=1)) ./ std(X0, dims=1)] # design matrix is standardized and includes intercept
# True regression coefficients (first 5 are non-zero)
β = [1.0; randn(5); zeros(p - 5)]
# Responses
y = X * β + randn(n)
# plot the true β
using Plots; gr()
plt = plot(1:p, β[2:end], line=:stem, marker=2, legend=:none)
xlabel!(plt, "coordinate")
ylabel!(plt, "beta_j")
title!(plt, "True beta")
using Convex
## Use Mosek solver
#using Mosek, MosekTools
#opt = () -> Mosek.Optimizer(LOG=0)
## Use SCS solver
#using SCS
#opt = () -> SCS.Optimizer(verbose=0)
# Use COSMO solver
using COSMO
opt = () -> COSMO.Optimizer(max_iter=10000, verbose=false)
# solve at a grid of λ
λgrid = 0:10:200
β̂path = zeros(length(λgrid), size(X, 2)) # each row is β̂ at a λ
β̂lasso = Variable(size(X, 2))
@time for i in 1:length(λgrid)
λ = λgrid[i]
# objective
problem = minimize(0.5sumsquares(y - X * β̂lasso) + λ * norm(β̂lasso[2:end], 1))
solve!(problem, opt)
β̂path[i, :] = β̂lasso.value
end
plt = plot(collect(λgrid), β̂path, legend=:none)
xlabel!(plt, "lambda")
ylabel!(plt, "beta_hat")
title!(plt, "Lasso")
36.154562 seconds (92.12 M allocations: 5.221 GiB, 7.07% gc time, 1.75% compilation time)
can be formulated as a SOCP. Why? For instance, $\ell_{3/2}$ regression combines advantage of both robust $\ell_1$ regression and least squares.
λlist = 0:0.5:10
βpath = zeros(length(λlist), size(X, 2)) # each row is β̂ at a λ
βsqrtlasso = Variable(size(X, 2))
t = Variable()
@time for i = 1:length(λlist)
λ = λlist[i]
problem = minimize(norm(y - X * βsqrtlasso, 2) + + λ * norm(βsqrtlasso[2:end], 1))
solve!(problem, opt)
βpath[i, :] = βsqrtlasso.value
end
plt = plot(collect(λlist), βpath, legend=:none)
xlabel!(plt, "lambda")
ylabel!(plt, "beta_hat")
title!(plt, "Square-root Lasso")
0.281433 seconds (570.34 k allocations: 50.319 MiB, 19.59% gc time, 66.93% compilation time)
Rudin, L.I., Osher, S. and Fatemi, E., 1992. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4), pp.259-268. https://doi.org/10.1016/0167-2789(92)90242-F
Goldfarb, D. and Yin, W., 2005. Second-order cone programming methods for total variation-based image restoration. SIAM Journal on Scientific Computing, 27(2), pp.622-645. https://doi.org/10.1137/040608982
Many parts of this lecture note is based on Dr. Hua Zhou's 2019 Spring Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2019spring/index.html.