Authors: Jozef Hanč, Martina Hančová, Andrej Gajdoš
Faculty of Science, P. J. Šafárik University in Košice, Slovakia
emails: martina.hancova@upjs.sk
EBLUP-NE for tourism
Python-based computational tools - SciPy, CVXPY
To get back to the contents, use the Home key.
- solver - the convex optimization solver ECOS, OSQP, and SCS chosen according to the given problem
* **OSQP** for convex quadratic problems
* `max_iter` - maximum number of iterations (default: 10000).
* `eps_abs` - absolute accuracy (default: 1e-4).
* `eps_rel` - relative accuracy (default: 1e-4).
* **ECOS** for convex second-order cone problems
* `max_iters` - maximum number of iterations (default: 100).
* `abstol` - absolute accuracy (default: 1e-7).
* `reltol` - relative accuracy (default: 1e-6).
* `feastol` - tolerance for feasibility conditions (default: 1e-7).
* `abstol_inacc` - absolute accuracy for inaccurate solution (default: 5e-5).
* `reltol_inacc` - relative accuracy for inaccurate solution (default: 5e-5).
* `feastol_inacc` - tolerance for feasibility condition for inaccurate solution (default: 1e-4).
* **SCS** for large-scale convex cone problems
* `max_iters` - maximum number of iterations (default: 2500).
* `eps` - convergence tolerance (default: 1e-4).
* `alpha` - relaxation parameter (default: 1.8).
* `scale` - balance between minimizing primal and dual residual (default: 5.0).
* `normalize` - whether to precondition data matrices (default: True).
* `use_indirect` - whether to use indirect solver for KKT sytem (instead of direct) (default: True).<font>
import cvxpy
import numpy as np
import pandas as pd
import platform as pt
from cvxpy import *
from math import cos, sin
from numpy.linalg import inv, norm
from itertools import product
from __future__ import print_function
from __future__ import division
np.set_printoptions(precision=10)
# software versions
print('cvxpy:'+cvxpy.__version__)
print('numpy:'+np.__version__)
print('pandas:'+pd.__version__)
print('python:'+pt.python_version())
cvxpy:1.0.1 numpy:1.14.5 pandas:0.23.4 python:2.7.15
In this econometric FDSLRM application, we consider the time series data set, called
visnights
, representing total quarterly visitor nights (in millions) from
1998-2016 in one of the regions of Australia -- inner zone of Victoria state. The number of time
series observations is $n=76$. The data was adapted from Hyndman, 2018.
The Gaussian orthogonal FDSLRM fitting the tourism data has the following form:
$ \begin{array}{rl} & X(t) & \! = \! & \beta_1+\beta_2\cos\left(\tfrac{2\pi t}{76}\right)+\beta_3\sin\left(\tfrac{2\pi t\cdot 2}{76}\right) + \\ & & & +Y_1\cos\left(\tfrac{2\pi t\cdot 19 }{76}\right)+Y_2\sin\left(\tfrac{2\pi t\cdot 19}{76}\right) +Y_3\cos\left(\tfrac{2\pi t\cdot 38}{76}\right) +w(t), \, t\in \mathbb{N}, \end{array} $
where $\boldsymbol{\beta}=(\beta_1,\,\beta_2,\,\beta_3)' \in \mathbb{R}^3, \mathbf{Y} = (Y_1, Y_2, Y_3)' \sim \mathcal{N}_3(\boldsymbol{0}, \mathrm{D}), w(t) \sim \mathcal{iid}\, \mathcal{N} (0, \sigma_0^2), \boldsymbol{\nu}=(\sigma_0^2, \sigma_1^2, \sigma_2^2, \sigma_3^2) \in \mathbb{R}_{+}^4$.
We identified the given and most parsimonious structure of the FDSLRM using an iterative process of the model building and selection based on exploratory tools of spectral analysis and residual diagnostics (for details see our Jupyter notebook tourism.ipynb
).
# data - time series observation
path = 'tourism.csv'
data = pd.read_csv(path, sep=';', usecols=[1])
# observation x as matrix
x = np.asmatrix(data.values)
x.shape
(76L, 1L)
# model parameters
n, k, l = 76, 3, 3
# significant frequencies
om1, om2, om3, om4 = 2*np.pi/76, 2*np.pi*2/76, 2*np.pi*19/76, 2*np.pi*38/76
# model - design matrices F', F, V',V
Fc = np.mat([[1 for t in range(1,n+1)],
[cos(om1*t) for t in range(1,n+1)],
[sin(om2*t) for t in range(1,n+1)]])
Vc = np.mat([[cos(om3*t) for t in range(1,n+1)],
[sin(om3*t) for t in range(1,n+1)],
[cos(om4*t) for t in range(1,n+1)]])
F, V = Fc.T, Vc.T
# columns vj of V and their squared norm ||vj||^2
vv = lambda j: V[:,j-1]
nv2 = lambda j: np.trace(vv(j).T*vv(j))
# auxiliary matrices and vectors
# Gram matrices GF, GV
GF, GV = Fc*F, Vc*V
InvGF, InvGV = inv(GF), inv(GV)
# projectors PF, MF, PV, MV
In = np.identity(n)
PF = F*InvGF*Fc
PV = V*InvGV*Vc
MF, MV = In-PF, In-PV
# residuals e, e'
e = MF*x
ec = e.T
# orthogonality condition
Fc*V, GV
(matrix([[ 3.5527136788e-15, -1.0658141036e-14, 0.0000000000e+00], [ 7.6265746656e-15, -2.2367944157e-15, 1.9984014443e-15], [ 7.6719645096e-15, -5.5627717914e-15, -4.6210314404e-16]]), matrix([[ 3.8000000000e+01, -5.5047207898e-16, -3.2196467714e-15], [-5.5047207898e-16, 3.8000000000e+01, -9.3258734069e-15], [-3.2196467714e-15, -9.3258734069e-15, 7.6000000000e+01]]))
using formula (4.1) from Hancova et al 2019
$
\renewcommand{\arraystretch}{1.4} \breve{\boldsymbol{\nu}}(\mathbf{e}) = \begin{pmatrix} \tfrac{1}{n-k-l}\,\mathbf{e}'\,\mathrm{M_V}\,\mathbf{e} \\ (\mathbf{e}'\mathbf{v}_1)^2/||\mathbf{v}_1||^4 \\ \vdots \\ (\mathbf{e}'\mathbf{v}_l)^2/||\mathbf{v}_l||^4 \end{pmatrix} $
# NE according to formula (4.1)
NE0 = [1/(n-k-l)*np.trace(ec*MV*e)]
NEj = [(np.trace(ec*vv(j))/nv2(j))**2 for j in range(1,l+1)]
NE = NE0+NEj
print(NE, norm(NE))
[0.10766780139512397, 0.0039056202882091925, 0.23030624879955963, 0.02227313104780322] 0.25523453906141463
NE as a convex optimization problem
\begin{array}{ll} \textit{minimize} & \quad f_0(\boldsymbol{\nu})=||\mathbf{e}\mathbf{e}' - \mathrm{VDV'}||^2+||\mathrm{M_V}\mathbf{e}\mathbf{e}'\mathrm{M_V}-\nu_0\mathrm{M_F}\mathrm{M_V}||^2 \\[6pt] \textit{subject to} & \quad \boldsymbol{\nu} = \left(\nu_0, \ldots, \nu_l\right)'\in [0, \infty)^{l+1} \end{array}$
$
# the optimization variable, objective function
v = Variable(l+1)
fv = sum_squares(e*ec-V*diag(v[1:])*Vc)+sum_squares(MV*e*ec*MV-v[0]*MF*MV)
# the optimization problem for NE
objective = Minimize(fv)
constraints = [v >= 0]
prob = Problem(objective,constraints)
# solve the NE problem
prob.solve()
NEcvx = v.value
print('NEcvx =', NEcvx, ' norm= ', norm(NEcvx))
NEcvx = [0.1076678015 0.0039056763 0.2303062488 0.0222731307] norm= 0.2552345399269671
C:\Users\jozef\Anaconda2\lib\site-packages\cvxpy\problems\problem.py:609: RuntimeWarning: overflow encountered in long_scalars if self.max_big_small_squared < big*small**2:
using formula (3.10) from Hancova et al 2019.
$
\mathring{\nu}_j = \rho_j^2 \breve{\nu}_j; j = 0,1 \ldots, l\ \rho_0 = 1, \rho_j = \dfrac{\hat{\nu}_j||\mathbf{v}_j||^2}{\hat{\nu}_0+\hat{\nu}_j||\mathbf{v}_j||^2} $
where $\boldsymbol{\breve{\nu}}$ are NE, $\boldsymbol{\hat{\nu}}$ are initial estimates for EBLUP-NE
# EBLUP-NE based on formula (3.9)
rho2 = lambda est: [1] + [ (est[j]*nv2(j)/(est[0]+est[j]*nv2(j)))**2 for j in range(1,l+1) ]
EBLUPNE = lambda est: [rho2(est)[j]*NE[j] for j in range(l+1)]
# numerical results
print(rho2(NE))
[1, 0.33588549715479415, 0.9758415471932069, 0.8839735951347862]
print(EBLUPNE(NE), norm(EBLUPNE(NE)))
[0.10766780139512397, 0.001311841212202995, 0.22474240615682592, 0.01968885972723484] 0.2499817527483643
using formula (3.6) for general FDSLRM from Hancova et al 2019.
$
\mathring{\nu}_0 = \breve{\nu}_0, \mathring{\nu}_j = (\mathbf{Y}^)_j^2, j = 1, 2, \ldots, l \ \mathbf{Y}^ = \mathbb{T}\mathbf{X} \mbox{ with } \mathbb{T} = \mathrm{D}\mathbb{U}^{-1}\mathrm{V}'\mathrm{M_F}, \mathbb{U} = \mathrm{V}'\mathrm{M_F}\mathrm{V}\mathrm{D} + \nu_0 \mathrm{I}_l $
def EBLUPNEgen(est):
D = np.diag(est[1:])
U = Vc*MF*V*D + est[0]*np.identity(l)
T = D*inv(U)*Vc*MF
eest = np.vstack((np.matrix(NE[0]),np.multiply(T*x, T*x)))
return np.array(eest).flatten().tolist()
print(EBLUPNEgen(NE), norm(EBLUPNEgen(NE)))
[0.10766780139512397, 0.001311841212202979, 0.22474240615682617, 0.019688859727234925] 0.2499817527483645
using the the KKT algorithm (tab.3, Hancova et al 2019)
$~$
$
\qquad \mathbf{q} = \left(\begin{array}{c} \mathbf{e}' \mathbf{e}\\ (\mathbf{e}' \mathbf{v}_{1})^2 \\ \vdots \\ (\mathbf{e}' \mathbf{v}_{l})^2 \end{array}\right) $
$\qquad\mathrm{G} = \left(\begin{array}{ccccc}
\small n^* & ||\mathbf{v}{1}||^2 & ||\mathbf{v}{2}||^2 & \ldots & ||\mathbf{v}{l}||^2 \ ||\mathbf{v}{1}||^2 & ||\mathbf{v}{1}||^4 & 0 & \ldots & 0 \ ||\mathbf{v}{2}||^2 & 0 & ||\mathbf{v}{2}||^4 & \ldots & 0 \ \vdots & \vdots & \vdots & \ldots & \vdots \ ||\mathbf{v}{l}||^2 & 0 & 0 & \ldots & ||\mathbf{v}_{l}||^4 \end{array}\right) $
## SciPy(Numpy)# Input: form G
ns, nvj = n, norm(V, axis=0)
u, v, Q = np.mat(ns), np.mat(nvj**2), np.diag(nvj**4)
G = np.bmat([[u,v],[v.T,Q]])
# form q
e2, Ve2 = ec*e, np.multiply(Vc*e, Vc*e)
q = np.vstack((e2, Ve2))
# body of the algorithm
for b in product([0,1], repeat=l):
# set the KKT-conditions matrix K
K = G*1
for j in range(1,l+1):
if b[j-1] == 0: K[0,j], K[j,j] = 0,-1
# calculate the auxiliary vector g
g = inv(K)*q
# test non-negativity g
if (g >= 0).all(): break
# Output: Form estimates nu
nu = g*1
for j in range(1,l+1):
if b[j-1] == 0: nu[j] = 0
NN_DOOLSE = np.array(nu).flatten()
print(NN_DOOLSE, norm(NN_DOOLSE),b)
[0.1032430972 0.0011886967 0.2275893252 0.0209146692] 0.2507885054268491 (1, 1, 1)
nonnegative DOOLSE as a convex optimization problem
\begin{array}{ll} \textit{minimize} & f_0(\boldsymbol{\nu})=||\mathbf{e}\mathbf{e}'-\Sigma_\boldsymbol{\nu}||^2 \\[6pt] \textit{subject to} & \boldsymbol{\nu} = \left(\nu_0, \ldots, \nu_l\right)'\in [0, \infty)^{l+1} \end{array}$
$
# set the optimization variable, objective function
v = Variable(l+1)
fv = sum_squares(e*e.T - v[0]*In - V*diag(v[1:])*V.T)
# construct the problem for DOOLSE
objective = Minimize(fv)
constraints = [v >= 0]
prob = Problem(objective,constraints)
# solve the DOOLSE problem
prob.solve()
NN_DOOLSEcvx = v.value
print('NN-DOOLSEcvx =', NN_DOOLSEcvx, 'norm=', norm(NN_DOOLSEcvx))
NN-DOOLSEcvx = [0.103243196 0.001188791 0.2275893166 0.0209146922] norm= 0.2507885406842395
using equivalent (RE)MLE convex problem (proposition 5, Hancova et al 2019)
\begin{array}{ll} \textit{minimize} & \quad f_0(\mathbf{d})=-(n^*\!-l)\ln d_0 - \displaystyle\sum\limits_{j=1}^{l} \ln(d_0-d_j||\mathbf{v}_j||^2+d_0\mathbf{e}'\mathbf{e}-\mathbf{e}'\mathrm{V}\,\mathrm{diag}\{d_j\}\mathrm{V}'\mathbf{e} \\[6pt] \textit{subject to} & \quad d_0 > \max\{d_j||\mathbf{v}_j||^2, j = 1, \ldots, l\} \\ & \quad d_j \geq 0, j=1,\ldots l \\ & \\ & \quad\text{for MLE: } n^* = n, \text{ for REMLE: } n^* = n-k \\ \textit{back transformation:} & \quad \nu_0 = \dfrac{1}{d_0}, \nu_j = \dfrac{d_j}{d_0\left(d_0 -d_j||\mathbf{v}_j||^2\right)} \end{array}$
$
# set variables for the objective
ns = n
d = Variable(l+1)
logdetS = (ns-l)*log(d[0])+sum(log(d[0]-GV*d[1:]))
# construct the problem
objective = Maximize(logdetS-(d[0]*ec*e-ec*V*diag(d[1:])*Vc*e))
constraints = [0 <= d[1:], max(GV*d[1:]) <= d[0]]
prob = Problem(objective,constraints)
# solve the problem
solution = prob.solve()
dv = d.value.tolist()
# back transformation
s0 = [1/dv[0]]
sj = [dv[i]/(dv[0]*(dv[0]-dv[i]*GV[i-1,i-1])) for i in range(1,l+1)]
sv = s0+sj
print('REMLEcvx = ', sv, ' norm = ', norm(sv))
REMLEcvx = [0.1032431005586019, 0.001188696520735739, 0.22758937066419727, 0.020914668029817677] norm = 0.25078854796520283
## SciPy(Numpy)# numerical results
print(rho2(NN_DOOLSE))
[1, 0.09263221759409881, 0.9765451623936303, 0.8817377950062207]
print(EBLUPNE(NN_DOOLSE),norm(EBLUPNE(NN_DOOLSE)))
[0.10766780139512397, 0.00036178626837732086, 0.2249044531342338, 0.01963906145797461] 0.2501203552714627
#cross-checking
print(EBLUPNEgen(NN_DOOLSE), norm(EBLUPNEgen(NN_DOOLSE)))
[0.10766780139512397, 0.0003617862683773213, 0.2249044531342336, 0.01963906145797493] 0.25012035527146254
# Input: form G
ns, nvj = n-k, norm(V, axis=0)
u, v, Q = np.mat(ns), np.mat(nvj**2), np.diag(nvj**4)
G = np.bmat([[u,v],[v.T,Q]])
# form q
e2, Ve2 = ec*e, np.multiply(Vc*e, Vc*e)
q = np.vstack((e2, Ve2))
# body of the algorithm
for b in product([0,1], repeat=l):
# set the KKT-conditions matrix K
K = G*1
for j in range(1,l+1):
if b[j-1] == 0: K[0,j], K[j,j] = 0,-1
# calculate the auxiliary vector g
g = inv(K)*q
# test non-negativity g
if (g >= 0).all(): break
# Output: Form estimates nu
nu = g*1
for j in range(1,l+1):
if b[j-1] == 0: nu[j] = 0
NN_MDOOLSE = np.array(nu).flatten()
print(NN_MDOOLSE, norm(NN_MDOOLSE),b)
[0.1076678014 0.0010722571 0.2274728856 0.0208564495] 0.2525319986886 (1, 1, 1)
nonnegative DOOLSE as a convex optimization problem
\begin{array}{ll} \textit{minimize} & f_0(\boldsymbol{\nu})=||\mathbf{e}\mathbf{e}'-\mathrm{M_F}\Sigma_\boldsymbol{\nu}\mathrm{M_F}||^2 \\[6pt] \textit{subject to} & \boldsymbol{\nu} = \left(\nu_0, \ldots, \nu_l\right)'\in [0, \infty)^{l+1} \end{array}$
$
# set the optimization variable, objective function
v = Variable(l+1)
fv = sum_squares(e*ec - v[0]*MF - V*diag(v[1:])*Vc)
# the optimization problem for MDOOLSE
objective = Minimize(fv)
constraints = [v >= 0]
prob = Problem(objective,constraints)
# solve the MDOOLSE problem
prob.solve()
NN_MDOOLSEcvx = v.value
print('NN-MDOOLSEcvx =', NN_MDOOLSEcvx, 'norm =', norm(NN_MDOOLSEcvx) )
NN-MDOOLSEcvx = [0.107668313 0.0010725165 0.2274728524 0.020856525 ] norm = 0.2525321942497591
using equivalent (RE)MLE convex problem (proposition 5, Hancova et al 2019)
\begin{array}{ll} \textit{minimize} & \quad f_0(\mathbf{d})=-(n^*\!-l)\ln d_0 - \displaystyle\sum\limits_{j=1}^{l} \ln(d_0-d_j||\mathbf{v}_j||^2+d_0\mathbf{e}'\mathbf{e}-\mathbf{e}'\mathrm{V}\,\mathrm{diag}\{d_j\}\mathrm{V}'\mathbf{e} \\[6pt] \textit{subject to} & \quad d_0 > \max\{d_j||\mathbf{v}_j||^2, j = 1, \ldots, l\} \\ & \quad d_j \geq 0, j=1,\ldots l \\ & \\ & \quad\text{for MLE: } n^* = n, \text{ for REMLE: } n^* = n-k \\ \textit{back transformation:} & \quad \nu_0 = \dfrac{1}{d_0}, \nu_j = \dfrac{d_j}{d_0\left(d_0 -d_j||\mathbf{v}_j||^2\right)} \end{array}$
$
# set variables for the objective
ns = n - k
d = Variable(l+1)
logdetS = (ns-l)*log(d[0])+sum(log(d[0]-GV*d[1:]))
# construct the problem
objective = Maximize(logdetS-(d[0]*ec*e-ec*V*diag(d[1:])*Vc*e))
constraints = [0 <= d[1:], max(GV*d[1:]) <= d[0]]
prob = Problem(objective,constraints)
# solve the problem
solution = prob.solve()
dv = d.value.tolist()
# back transformation
s0 = [1/dv[0]]
sj = [dv[i]/(dv[0]*(dv[0]-dv[i]*GV[i-1,i-1])) for i in range(1,l+1)]
sv = s0+sj
print('REMLEcvx = ', sv, ' norm = ', norm(sv))
REMLEcvx = [0.10766780248874558, 0.001072255859342178, 0.22747309845443972, 0.020856451321299943] norm = 0.25253219103228086
## SciPy(Numpy)# numerical results
print(rho2(NN_MDOOLSE))
[1, 0.07537335031457347, 0.9755461750828655, 0.8768356719511479]
print(EBLUPNE(NN_MDOOLSE),norm(EBLUPNE(NN_MDOOLSE)))
[0.10766780139512397, 0.00029437968617889685, 0.22467438011409316, 0.01952987582875651] 0.2499048523862595
#cross-checking
print(EBLUPNEgen(NN_MDOOLSE), norm(EBLUPNEgen(NN_MDOOLSE)))
[0.10766780139512397, 0.00029437968617889316, 0.22467438011409285, 0.019529875828756697] 0.24990485238625926
This notebook belongs to suplementary materials of the paper submitted to Statistical Papers and available at https://arxiv.org/abs/1905.07771.
We propose a two-stage estimation method of variance components in time series models known as FDSLRMs, whose observations can be described by a linear mixed model (LMM). We based estimating variances, fundamental quantities in a time series forecasting approach called kriging, on the empirical (plug-in) best linear unbiased predictions of unobservable random components in FDSLRM.
The method, providing invariant non-negative quadratic estimators, can be used for any absolutely continuous probability distribution of time series data. As a result of applying the convex optimization and the LMM methodology, we resolved two problems $-$ theoretical existence and equivalence between least squares estimators, non-negative (M)DOOLSE, and maximum likelihood estimators, (RE)MLE, as possible starting points of our method and a practical lack of computational implementation for FDSLRM. As for computing (RE)MLE in the case of $ n $ observed time series values, we also discovered a new algorithm of order $\mathcal{O}(n)$, which at the default precision is $10^7$ times more accurate and $n^2$ times faster than the best current Python(or R)-based computational packages, namely CVXPY, CVXR, nlme, sommer and mixed.
We illustrate our results on three real data sets $-$ electricity consumption, tourism and cyber security $-$ which are easily available, reproducible, sharable and modifiable in the form of interactive Jupyter notebooks.