Notebook

Authors: Andrej Gajdoš, Martina Hančová, Jozef Hanč
Faculty of Science, P. J. Šafárik University in Košice, Slovakia
emails: andrej.gajdos@student.upjs.sk, martina.hancova@upjs.sk

FDSLRM applications - Tourism

Quarterly visitor nights for a region of Australia

Table of Contents¶

Data and model - data and model description, estimating parameters, software
Modeling - loading R functions and packages, data plot, periodogram
Residual diagnostics - description of graphical tools, numerical tests
Fitting summary - estimated model parameters, fit summary
Session info - list of applied R packages in computations
References - list of detailed references for data and applied methods
Appendix - Tools, R functions - brief help on applied diagnostic tools, R functions

To get back to the contents, use the Home key.

Data and model

Data description¶

In this FDSLRM application we model the econometric time series data set, called visnights, representing total quarterly visitor nights (in millions) from 1998-2016 in one of the regions of Australia - inner zone of Victoria state. The number of time series observations is $n=76$, the correspoding plot with more details is shown in the following section Modeling. The data was adapted from Hyndman, 2018.

Model description¶

The tourism data can be succesfully fitted by the FDSLRM of the form:

$$ X(t)=\beta_1+\beta_2\cos\left(\tfrac{2\pi t}{76}\right)+\beta_3\sin\left(\tfrac{2\pi t\cdot 2}{76}\right) +Y_1\cos\left(\tfrac{2\pi t\cdot 19 }{76}\right)+Y_2\sin\left(\tfrac{2\pi t\cdot 17}{76}\right)+Y_3\cos\left(\tfrac{2\pi t\cdot 38}{76}\right) +w(t), \, t\in \mathbb{N},$$

where

$\boldsymbol{\beta}=(\beta_1,\,\beta_2,\,\beta_3)' \in \mathbb{R}^3\,$ is a vector of real regression coefficients,
$\mathbf{Y} = (Y_1, Y_2, Y_3)' \sim \mathcal{N}_3(\boldsymbol{0}, \mathrm{D})\,$ is an unobservable Gaussian random vector with zero mean vector and covariance matrix

$\mathrm{D} = \,\scriptstyle \begin{pmatrix} \sigma_1^2 & 0 & 0 \\ 0 & \sigma_2^2 & 0 \\ 0 & 0 &\sigma_3^2 \end{pmatrix}$,

$w(t) \sim \mathcal{iid}\, \mathcal{N} (0, \sigma_0^2)\,$ is Gaussian iid noise with variance $\sigma_0^2$,
$\boldsymbol{\nu}= (\sigma_0^2, \sigma_1^2, \sigma_2^2, \sigma_3^2) \in \mathbb{R}_{+}^4 \,$ is a vector of real nonnegative variance-covariance parameters.

We identified the given and most parsimonious structure of the FDSLRM using an iterative process of the model building and selection based on exploratory tools of spectral analysis (Gajdoš et al., 2017; Brockwell & Davis, 2006) and residual diagnostics (see sections Modelling and Residual diagnostics).

Estimating the model parameters¶

During the modelling, we obtained:

estimates $\boldsymbol{\nu^*}=(0.108, 0.001, 0.227, 0.021)'$ of variance-covariance components $\boldsymbol{\nu} $ by residual maximum likelihood (REML),
estimates $\boldsymbol{\beta^*}=(4.254, 0.256, -0.247)'$ of regression parameters in trend $\boldsymbol{\beta} $ by weighted least squares,
predictions $\mathbf{Y^*}=(-0.017, 0.474, -0.140)'$ of the random vector $\mathbf{Y}$ using the best linear unbiased predictor (BLUP) procedure.

The methods and procedures described for FDSLRM in more detail can be found in Štulajter 2002, 2003; Gajdoš et al., 2018. Estimated values of $\boldsymbol{\beta}, \boldsymbol{\nu}$ and predictions of $\mathbf{Y}$ are also presented in the form of tables in Fitting summary.

Computational software¶

As for numerical calculations, we conducted our computations in the R statistical computing language (https://www.r-project.org; R Development Core Team, 2018) with the key libraries nlme (Pinhero et al., 2018, Galecki & Burzykowski, 2013), fpp2 (Hyndman, 2018), R functions for LMM programmed by Singer (Singer at al., 2017) and R functions for FDSLRM programmed by authors of the Jupyter notebook included in fdslrm package. The complete list of used R libraries is included in Session info.

Modeling

Remark.

Mean value (or FDSLRM trend) component $m(t)$ of our FDSLRM is the real function in the form of:

the linear regression (LR) $-$ the linear combination of deterministic real functions $ f_1(t)=1,\,f_2(t)=\cos\left(\tfrac{2\pi t}{76}\right),\,f_3(t)=\sin\left(\tfrac{2\pi t\cdot 2 }{76}\right)$ with real amplitudes $\beta_1, \beta_2, \beta_3$

$$ m(t) = \sum\limits_{i=1}^{3}f_i(t)\beta_i, \, t \in \mathbb{N}. $$

Random errors (or FDSLRM errors) component $\varepsilon(t)$ of our FDSLRM is the zero mean value time series consisting of:

finite discrete spectrum errors (FDS errors) $-$ the linear combination of deterministic real functions $v_1(t)=\cos\left(\tfrac{2\pi t\cdot 19}{76}\right),\,v_2(t)=\sin\left(\tfrac{2\pi t\cdot 19 }{76}\right),\,v_3(t)=\cos\left(\tfrac{2\pi t\cdot 38}{76}\right)$ with mutually uncorrelated random amplitudes $Y_1,Y_2,Y_3$ and
white noise errors (WN errors) $-$ $\mathcal{wn}$ (or $\mathcal{iid}$) noise $w(t)$

$$ \varepsilon(t) = \sum\limits_{j=1}^{3}v_j(t)Y_j + w(t), \, t \in \mathbb{N}. $$

Loading R functions and packages¶

A brief help on all applied R functions and packages designed to work with FDSLRM is in the section Appendix.

Important note: After our testing, the most reliable way to install our fdslrm package in a Binder repository is its direct loading from GitHub. The standard installation of our fdslrm package as in the case of any R package on GitHub works without any problems in a local installation using Anaconda R distribution or CRAN distribution.

In [1]:

# loading all fdslrm functions as an R script from GiHub
devtools::source_url("https://github.com/fdslrm/fdslrmAllinOne/blob/master/fdslrmAllinOne.R?raw=TRUE")
initialFDSLRM()

SHA-1 hash of file is f56b9d53e72a8575947a467930a2bdddb5b500ad

Data plot¶

The tourism data was loaded from Hyndman's R package fpp2 (Hyndman, 2018). The more detailed data description can be found in Hyndman, 2018, sec. 10.1, p. 300.

In [2]:

# IPython setting for output
options(repr.plot.res=120, repr.plot.height=4.5, repr.plot.width=6.5)

# Loading and plotting data from Hyndman's package fpp2
autoplot(window(visnights)[,15], ylab = "Visnights") # VICInner

Spectral analysis - Periodogram¶

Our econometric data shows some periodic patterns as they are influenced by seasons or regularly repeating events. To identify significant (Fourier) frequencies, we apply a spectral time series exploratory tool called periodogram (more details in Gajdoš et al., 2017).

In [3]:

# time series observations, times values, periodogram
dt <- as.numeric((window(visnights)[,15]))
t <- 1:length(dt)
periodo <- spec.pgram(dt, log="no")

Six most significant frequencies according to values of spectrum in periodogram.¶

In [4]:

drawTable(type = "periodogram", periodogram = periodo)

Frequencies by spectrum
spectrum	3.994666	1.663512	1.221876	1.048958	0.5176789	0.4726287
frequency (raw)	0.250000	0.500000	0.025000	0.012500	0.0625000	0.2625000

The raw frequencies from periodogram can be easily rewritten to the corresponding Fourier frequencies in the standard form $2\pi k/n$, where $k$ is the frequency order and $n$ is the number of time series observations (or a very close integer number); e.g. the first raw frequency $0.25$ can be expressed as $0.25=19/76$, which corresponds to the Fourier frequency $2\pi \cdot 19/76$.

In [5]:

# orders k for Fourier frequencies
print(round(76*c(0.250000,0.500000,0.025000,0.012500,0.0625000,0.2625000)))

[1] 19 38  2  1  5 20

In [6]:

fnames= c("19/76", "$38/76$", "$2/76$", "$1/76$", "$5/76$", "$20/76$")
drawTable(type = "periodogram", periodogram = periodo, frequencies = fnames)

Frequencies by spectrum
spectrum	3.9946658	1.6635122	1.2218756	1.0489582	0.5176789	0.4726287
frequency (raw)	0.2500	0.5000	0.0250	0.0125	0.0625	0.2625
frequency	$19/76$	$$38/76$$	$$2/76$$	$$1/76$$	$$5/76$$	$$20/76$$

Residual diagnostics

Our FDSLRM for the tourism data can be rewritten in the matrix form as a linear mixed model (LMM):

$$\mathbf{X}=\mathrm{F}\boldsymbol{\beta}+\mathrm{V}\mathbf{Y}+\boldsymbol{w},$$

where

$\mathbf{X} = (X(1), X(2), \ldots, X(76))'$ is a vector of the time series observations,
$\boldsymbol{w} = (w(1),w(2), \ldots, w(76))'$ is a random vector of corresponding iid (or white) noise values
model design matrices $\mathrm{F}, \mathrm{V}$ for our final model have the following structure:

$$\mathrm{F} \,{\scriptsize = \, \begin{pmatrix} 1 & \cos\left(\tfrac{2\pi}{76}\right) & \sin\left(\tfrac{2\pi\cdot 2}{76}\right) \\ 1 & \cos\left(\tfrac{2\pi\cdot 2}{76}\right) & \sin\left(\tfrac{2\pi\cdot 4}{76}\right) \\ \vdots & \vdots & \vdots \\ 1 & \cos\left(\tfrac{2\pi\cdot 76}{76}\right) & \sin\left(\tfrac{2\pi\cdot 152}{76}\right) \end{pmatrix}\qquad} \mathrm{V} \scriptsize = \begin{pmatrix} \cos\left(\tfrac{2\pi\cdot 19}{76}\right) & \sin\left(\tfrac{2\pi\cdot 19}{76}\right) & \cos\left(\tfrac{2\pi\cdot 38}{76}\right)\\ \cos\left(\tfrac{2\pi\cdot 38}{76}\right) & \sin\left(\tfrac{2\pi\cdot 38}{76}\right) & \cos\left(\tfrac{2\pi\cdot 76}{76}\right)\\ \vdots & \vdots \\ \cos\left(\tfrac{2\pi 1444}{76}\right) & \sin\left(\tfrac{2\pi\cdot 1444}{76}\right) & \cos\left(\tfrac{2\pi\cdot 2888}{76}\right) \end{pmatrix}.$$

This fundamental FDSLRM property allows us to apply many results and mathematical techniques of LMM methodology. In the language of LMM terminology $\boldsymbol{\beta}$ represents the vector of fixed effects, the random component depends on vector $\mathbf{Y}$ of random effects and $\boldsymbol{w}$ of random errors. From the viewpoint of LMM residual analysis, we have to consider three types of residuals:

marginal residuals (FDSLRM residuals): $\mathbf{X}-\mathrm{F}{\boldsymbol{\beta^*}}$,
random effects residuals (FDS residuals): $\mathrm{V}{\mathbf{Y^*}}$,
conditional residuals (WN residuals): $\mathbf{X}-\mathrm{F}{\boldsymbol{\beta^*}}-\mathrm{V}{\mathbf{Y^*}}$.

How was the final form of $\mathrm{F}$ and $\mathrm{V}$ found?

Due to the LMM structure, we can apply graphical (exploratory) tools and quantitative tests of LMM residual diagnostics (Singer et al., 2017) for FDSLRM observations (see the next subsection Graphical tools). The most suitable form of $\mathrm{F}$ and $\mathrm{V}$ is found by an iterative process (see rules and steps explained in the next remark) of applying the mentioned tools whose results are summarized in the following table. Two most adequate and parsimonious structures of the FDSLRM (2b, 3b) consist of two low frequencies $(2\pi/76, 2\pi\cdot 2/76)$ and two higher ones $(2\pi\cdot 19/76, 2\pi\cdot 38/76)$. Since the difference in AIC, BIC for both models is relatively small, our final choice is model 3b thanks to the generally smaller mean squared error in predictions (Hančová, 2007).

Iteration number	Graphical diagnostic tools									Numerical diagnostic tests		Frequencies (raw) in model*
	L	O1	H	O2	ACF	PACF	N1	N2	N3	Normality test	Independence test	$\dfrac{1}{76}$	$\dfrac{2}{76}$	$\dfrac{19}{76}$	$\dfrac{38}{76}$
1.	$\checkmark$	$\checkmark$	$\checkmark$	?	$\checkmark$	$\checkmark$	?	?	$\checkmark$	$\checkmark$	$\times$	-	T,1,1	T,1,1	R,1,1
2a.	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	?	$\times$	$\checkmark$	$\checkmark$	T,1,1	T,1,1	T,1,1	R,1,1
2b.	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	T,1,0	T,0,1	T,0,1	R,1,0
3a.	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\times$	?	?	$\checkmark$	$\checkmark$	T,1,1	T,1,1	R,1,1	R,1,1
3b.	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	?	$\checkmark$	$\checkmark$	$\checkmark$	T,1,0	T,0,1	R,1,1	R,1,0

*Coding: letter (T or R) $-$ presence of the frequency in the trend or random component; the first number (1 or 0) $-$ presence of the frequency in the cos term; the second number (1 or 0) $-$ in the sin term

Key rules and steps of the iterative econometric FDSLRM-building
General rules of the FDSLRM-building for econometric data (Štulajter, 2002; Box et al, 2016; Gajdoš et al., 2017):

the model should be parsimonious (in the form of a simple dependence with a small number of the parameters),
the model should create small unexplained (random) deviations (small variances, covariances or mean squared errors),
the model should include lower frequencies in the trend component, and higher ones in the random component.

Key steps of the FDSLRM-building for econometric data (our experience):

follow the scheme of the Box-Jenkins iterative, three-stage time series model-building approach: formulation (identification), estimation (fit), diagnostic (checking),
the more apparent periodic or seasonal patterns in the data mean the smaller number of iterations in the model building,
start at least with the three most significant frequencies in the model,
use the graphical-tools diagnostic matrix and numerical tests (see below) to check the adequacy of the model,
remove or add other significant frequencies and check the model adequacy,
remove cos or sin term with a particular frequency and check the model adequacy,
very small or very big variance parameter estimates indicate a reason to remove the corresponding cos or sin term with a particular frequency in the random component,
very small predictions of random effects also indicate an exclusion of the corresponding term in the random component,
use information criteria AIC, BIC to choose between competing adequate models.

Graphical (exploratory) tools¶

In the LMM residual analysis for FDSLRM we can use the following matrix of graphical exploratory tools (plots) as diagnostic tools for all FDSLRM assumptions. A brief description of the tools is in the Appendix.

|$ $|$\large\mbox{Graphical-tools diagnostic matrix}$|$ $|

|---|------------------------------------------------|---| | |
|$\mbox{linearity of fixed effects (L)}$| $\mbox{outlying observations (O1)}\hspace{0.75cm}$ | $\mbox{independence of cond. errors (ACF)} $ | |stand. marg. residuals vs marg. fitted values|stand. marg. residuals vs times$\hspace{0.75cm}$|ACF of cond. residuals| | |
|$\mbox{homoscedascity of cond. errors (H)}$|$\mbox{outlying observations (O2)}\hspace{0.75cm}$|$\mbox{independence of cond. errors (PACF)} $ | |stand. cond. residuals vs cond. predictions|stand. cond. residuals vs times$\hspace{0.75cm}$|PACF of cond. residuals| | | |$\mbox{normality of cond. errors (N1)}$|$\mbox{normality of cond. errors (N2)}\hspace{0.75cm}$|$\mbox{normality of cond. errors (N3)} $ | |histogram of cond. residuals|histogram of stand. least conf. residuals$\hspace{0.75cm}$|stand. least conf. residuals vs $\mathcal{N}(0,1)$ quantiles|

We present the residual diagnostics results for the final model (3b).

In [7]:

# Fitting the final FDSLRM
output <- fitDiagFDSLRM(dt, t, c(1/76, 2/76), include_fixed_eff = c(1,0,0,1), 
                          freq_random = c(19/76, 38/76), include_random_eff = c(1,1,1,0),
                          poly_trend_degree = 0, season_period = 4)

options(repr.plot.res=600, repr.plot.height=9, repr.plot.width=10)
drawDiagPlots("all", output)

Single panels for diagnostic¶

Our function drawDiagPlots() also allows to show any of diagnostic plots above in a single panel. Here we show two additional plots (not included in the Graphical-tools diagnostic matrix).

plot: cumulative periodogram of conditional residuals - detection of periodic nonrandomness in conditional residuals
plot: standardized marginal residuals vs marginal fitted values - test of linearity of fixed effects

In [8]:

options(repr.plot.res=80, repr.plot.height=6, repr.plot.width=6)
drawDiagPlots(output$diagnostic_plots_names$CumulatPeriodogCondResid, output)

In [9]:

options(repr.plot.res=100, repr.plot.height=5, repr.plot.width=7)
drawDiagPlots(output$diagnostic_plots_names$StdMarginalResidVsFittedValues, output)

Numerical tests¶

Tests of residual independence¶

In [10]:

print(output$Box_test_season_resid)
print(output$BoxLjung_test_season_resid)

	Box-Pierce test

data:  resid(fit)
X-squared = 3.3683, df = 8, p-value = 0.9092


	Box-Ljung test

data:  resid(fit)
X-squared = 3.6515, df = 8, p-value = 0.8871

Test of residual normality¶

In [11]:

print(output$ShapiroWilk_test_norm_cond_resid)
print(output$ShapiroWilk_test_stand_least_conf_resid)

	Shapiro-Wilk normality test

data:  resid(fit, type = "normalized")
W = 0.98015, p-value = 0.2796


	Shapiro-Wilk normality test

data:  SingerEtAl_resid_diag$least.confounded.residuals
W = 0.98402, p-value = 0.4848

Fitting summary

Parameter estimates¶

Estimates of regression coefficients¶

In [12]:

drawTable(type = "fixed", fixed_eff = output$fixed_effects)

	$\beta_{1}$	$\beta_{2}$	$\beta_{3}$
	4.253501	0.255671	-0.2473575

Predictions of random effects¶

In [13]:

drawTable(type = "random", random_eff = output$random_effects)

	$Y_{1}$	$Y_{2}$	$Y_{3}$
	-0.0171564	0.4739985	-0.1397494

Estimates of variance parameters¶

In [14]:

drawTable(type = "variance", variances = c(output$error_variance, diag(output$rand_eff_variance)))

$\sigma_{0}^2$	$\sigma_{1}^2$	$\sigma_{2}^2$	$\sigma_{3}^2$
0.1076678	0.0010722	0.2274806	0.0208567

Fit summary¶

Graphical summary for the final model (3b)¶

plot: time series observations (black), fitted values (blue), estimated trend (red) vs times

In [15]:

options(repr.plot.res=120, repr.plot.height=5, repr.plot.width=6.5)
drawDiagPlots(output$diagnostic_plots_names$FittedTimeSeries, output)

Numerical summary for the final model (3b)¶

In [16]:

print(output$fit_summary)

Linear mixed-effects model fit by REML
 Data: d 
       AIC      BIC    logLik
  77.54945 93.58267 -31.77472

Random effects:
 Formula: ~-1 + v1 + v2 + v3 | g
 Structure: Diagonal
                v1        v2        v3  Residual
StdDev: 0.03274389 0.4769492 0.1444184 0.3281277

Fixed effects: as.formula(paste("x~", paste(names(d)[2:kk], collapse = "+"))) 
                Value  Std.Error DF   t-value p-value
(Intercept)  4.253501 0.03763883 73 113.00833       0
f2           0.255671 0.05322934 73   4.80320       0
f3          -0.247357 0.05322934 73  -4.64701       0
 Correlation: 
   (Intr) f2
f2 0        
f3 0      0 

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.77665527 -0.68958989 -0.04928173  0.57724936  2.19774297 

Number of Observations: 76
Number of Groups: 1

Numerical summary for model 2b¶

In [17]:

# AIC, BIC, loglike for model 2b
output2b <- fitDiagFDSLRM(dt, t, c(1/76, 2/76, 19/76), include_fixed_eff = c(1,0,0,1,0,1), 
                          freq_random = c(38/76), include_random_eff = c(1,0),
                          poly_trend_degree = 0, season_period = 4)

print(output2b$fit_summary)

Linear mixed-effects model fit by REML
 Data: d 
      AIC     BIC   logLik
  74.2364 87.8964 -31.1182

Random effects:
 Formula: ~-1 + v1 | g
               v1 Residual
StdDev: 0.1443915 0.329001

Fixed effects: as.formula(paste("x~", paste(names(d)[2:kk], collapse = "+"))) 
                Value  Std.Error DF   t-value p-value
(Intercept)  4.253501 0.03773901 72 112.70834       0
f2           0.255671 0.05337102 72   4.79045       0
f3          -0.247357 0.05337102 72  -4.63468       0
f4           0.479902 0.05337102 72   8.99182       0
 Correlation: 
   (Intr) f2 f3
f2 0           
f3 0      0    
f4 0      0  0 

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-2.75118551 -0.70235938 -0.03079806  0.59381584  2.17636294 

Number of Observations: 76
Number of Groups: 1

Session info

In [18]:

print(sessionInfo())

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pracma_2.2.5     gnm_1.1-0        sommer_3.9.3     crayon_1.3.4    
 [5] lattice_0.20-38  matrixcalc_1.0-3 fpp2_2.3         expsmooth_2.3   
 [9] fma_2.3          ggplot2_3.1.1    forecast_8.7     nlme_3.1-139    
[13] car_3.0-2        carData_3.0-2    Matrix_1.2-17    MASS_7.3-51.4   
[17] IRdisplay_0.7.0  kableExtra_1.1.0

loaded via a namespace (and not attached):
 [1] fs_1.3.1          xts_0.11-2        usethis_1.5.0     devtools_2.0.2   
 [5] webshot_0.5.1     httr_1.4.0        rprojroot_1.3-2   repr_1.0.1       
 [9] tools_3.5.1       backports_1.1.4   R6_2.4.0          mgcv_1.8-28      
[13] lazyeval_0.2.2    colorspace_1.4-1  nnet_7.3-12       withr_2.1.2      
[17] tidyselect_0.2.5  prettyunits_1.0.2 processx_3.3.1    curl_3.3         
[21] compiler_3.5.1    cli_1.1.0         rvest_0.3.4       xml2_1.2.0       
[25] desc_1.2.0        labeling_0.3      tseries_0.10-46   scales_1.0.0     
[29] lmtest_0.9-37     fracdiff_1.4-2    readr_1.3.1       quadprog_1.5-7   
[33] callr_3.2.0       pbdZMQ_0.3-3      stringr_1.4.0     digest_0.6.18    
[37] relimp_1.0-5      foreign_0.8-71    rmarkdown_1.12    rio_0.5.16       
[41] base64enc_0.1-3   pkgconfig_2.0.2   htmltools_0.3.6   sessioninfo_1.1.1
[45] highr_0.8         rlang_0.3.4       readxl_1.3.1      TTR_0.23-4       
[49] rstudioapi_0.10   quantmod_0.4-14   zoo_1.8-5         jsonlite_1.6     
[53] dplyr_0.8.1       zip_2.0.2         magrittr_1.5      qvcalc_1.0.0     
[57] Rcpp_1.0.1        IRkernel_1.0.1    munsell_0.5.0     abind_1.4-5      
[61] stringi_1.4.3     pkgbuild_1.0.3    plyr_1.8.4        grid_3.5.1       
[65] parallel_3.5.1    forcats_0.4.0     splines_3.5.1     haven_2.1.0      
[69] hms_0.4.2         knitr_1.23        ps_1.3.0          pillar_1.4.0     
[73] uuid_0.1-2        pkgload_1.0.2     urca_1.3-0        glue_1.3.1       
[77] evaluate_0.13     data.table_1.12.2 remotes_2.0.4     cellranger_1.1.0 
[81] gtable_0.3.0      purrr_0.3.2       assertthat_0.2.1  xfun_0.7         
[85] openxlsx_4.1.0    viridisLite_0.3.0 timeDate_3043.102 tibble_2.1.1     
[89] memoise_1.1.0

References

This notebook belongs to suplementary materials of the paper submitted to Statistical Papers and available at https://arxiv.org/abs/1905.07771.

Hančová, M., Vozáriková, G., Gajdoš, A., Hanč, J. (2019). Estimating variance components in time series linear regression models using empirical BLUPs and convex optimization, https://arxiv.org/, 2019.

$~$

Abstract of the paper¶

We propose a two-stage estimation method of variance components in time series models known as FDSLRMs, whose observations can be described by a linear mixed model (LMM). We based estimating variances, fundamental quantities in a time series forecasting approach called kriging, on the empirical (plug-in) best linear unbiased predictions of unobservable random components in FDSLRM.

The method, providing invariant non-negative quadratic estimators, can be used for any absolutely continuous probability distribution of time series data. As a result of applying the convex optimization and the LMM methodology, we resolved two problems $-$ theoretical existence and equivalence between least squares estimators, non-negative (M)DOOLSE, and maximum likelihood estimators, (RE)MLE, as possible starting points of our method and a practical lack of computational implementation for FDSLRM. As for computing (RE)MLE in the case of $ n $ observed time series values, we also discovered a new algorithm of order $\mathcal{O}(n)$, which at the default precision is $10^7$ times more accurate and $n^2$ times faster than the best current Python(or R)-based computational packages, namely CVXPY, CVXR, nlme, sommer and mixed.

We illustrate our results on three real data sets $-$ electricity consumption, tourism and cyber security $-$ which are easily available, reproducible, sharable and modifiable in the form of interactive Jupyter notebooks. $~$

Brockwell, P. J., Davis, R. A. (2016). Introduction to Time Series and Forecasting (3rd ed.). New York, NY: Springer
Brockwell, P. J., & Davis, R. A. (2006). Time Series: Theory and Methods (2nd ed.). New York: Springer-Verlag
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Hoboken, New Jersey: Wiley
Gajdoš, A., Hančová, M., Hanč, J. (2017). Kriging Methodology and Its Development in Forecasting Econometric Time Series. Statistica: Statistics and Economy Journal, 2017, Vol. 97, No. 1, pp. 59–73
Galecki, A, Burzykowski, T. (2013). Linear Mixed-Effects Models Using R: A Step-by-Step Approach. New York: Springer
Hančová, M. (2007). Comparison of prediction quality of the best linear unbiased predictors in time series linear regression models. Proceedings of 15th European Young Statisticians Meeting. Castro Urdiales (Spain): University of Extremadura, http://matematicas.unex.es/~idelpuerto/15thEYSM.html.
Hilden-Minton, J.A. (1995). Multilevel diagnostics for mixed and hierarchical linear models, Unpublished PhD

Thesis, University of California, Los Angeles.

Nobre, J.S., Singer, J.M. (2007). Residual analysis for linear mixed models. Biom. J., Vol. 49, pp. 863–875
Pinheiro, J., Bates D., DebRoy S., Sarkar, D., R Core Team (2018). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-131, URL: https://CRAN.R-project.org/package=nlme
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/
Singer, J. M., Rocha, F. M. M., Nobre, J. S. (2017). Graphical Tools for Detecting Departures from Linear Mixed Model Assumptions and Some Remedial Measures. International Statistical Review, Vol. 85, pp. 290–324; R functions for LMM residual diagnostics https://www.ime.usp.br/~jmsinger/lmmdiagnostics.zip
Sokol P., Gajdoš, A. (2017). Prediction of Attacks Against Honeynet Based on Time Series Modeling. Silhavy, R., Silhavy, P., & Prokopova, Z. (Eds.). (2017). Applied Computational Intelligence and Mathematical Methods (Vol. 662). Cham: Springer International Publishing. pp. 360-371
Štulajter, F. (2003). The MSE of the BLUP in a Finite Discrete Spectrum LRM. Tatra Mountains Mathematical Publications, 2003, Vol. 26, No. 1, pp. 125–131
Štulajter, F. (2002). Predictions in Time Series Using Regression Models. New York: Springer

Appendix - Tools, R functions

A brief help on applied diagnostic tools and R functions.

Graphical and numerical tools¶

standardized marginal residuals vs marginal fitted values: residuals not randomly distributed, an obvious pattern of dependency (systematic trend) $\rightarrow$ assumption of linearity of fixed effects violated $\rightarrow$ model rejected.
standardized conditional residuals vs predicted values: residuals not randomly distributed, increase in variance of residuals $\rightarrow$ assumption of homoscedasticity violated $\rightarrow$ model rejected.
standardized least confounded residuals vs N(0,1) quantiles: points do not lie close to the straight line, many points (more than 5%) lie out of confidence bounds $\rightarrow$ assumption of normality of conditionl errors violated $\rightarrow$ model rejected.
standardized marginal residuals vs observation indices: some points are extremely far away from the majority of points $\rightarrow$ outliers detected.
standardized conditional residuals vs observation indices: some points are extremely far away from the majority of points $\rightarrow$ outliers detected.
autocorrelation function of conditional residuals: more than 5% of values cross the empirical bounds $\rightarrow$ independence of conditional errors violated $\rightarrow$ reject model.
partial autocorrelation function of conditional residuals: more than 5% of values cross the empirical bounds $\rightarrow$ independence of conditional errors violated $\rightarrow$ reject model.
histogram of conditional residuals: histogram does not approximately look like a Gaussian distribution $\rightarrow$ normality of conditional errors violated $\rightarrow$ reject model.
histogram of standardized least confounded residuals: histogram does not approximately look like a Gaussian distribution $\rightarrow$ normality of conditional errors violated $\rightarrow$ reject model.

As pointed out in Nobre and Singer, 2007 according to Hilden-Minton, 1995 a residual is said to be confounded for a specific type of error if it also depends on errors different from those that it is supposed to predict. In linear mixed

models, conditional residuals and the BLUP are confounded (Nobre and Singer, 2007). This implies, for example, that estimated conditional residuals be may not be adequate to check for normality of conditional errors since when random effects are grossly non-normal, estimated conditional residuals may not present a normal behavior even when conditional error is normal (Nobre and Singer, 2007, Section 4). Following the suggestion of Hilden-Minton, 1995 we consider standardized conditional least confounded residuals, obtained as linear combinations of the standardized conditional residuals that minimize the proportion of their variance due to the random effects.

cumulative periodogram of conditional residuals: cumulative periodogram shows strong systematic deviations from the straight line connecting point $[0,0]$ with point $[0.5,1]$ $\rightarrow$ reject model.

R functions¶

Packages - fdslrm, fpp2, stats, repr¶

The R package fdslrm has been developed by authors of this notebook and serves the purpose of modelling time series.

fdslrm: Time series analysis and forecasting using LMM¶

Purpose: R package for modeling and prediction of time series using linear mixed models.

Version: 0.1.0, 2019

Depends: kableExtra, IRdisplay, MASS, Matrix, car, nlme, stats, forecast, fpp2, matrixcalc, sommer, gnm, pracma, CVXR

Maintainer: Andrej Gajdoš

Authors: Andrej Gajdoš, Jozef Hanč, Martina Hančová

URL: https://github.com/fdslrm/R-package

Installation: Run jupyter notebook 00 installation fdslrm.ipynb once before the first run of any R-based Jupyter notebook.

The window() function from fpp2 serves for extracting a portion of time series visnights (we chose region 15 - the inner zone of Victoria state, VICInner).

The autoplot() function from fpp2 automatically produces an appropriate time plot of given time series.

The spec.pgram() function from base R stats package produces periodogram - estimation for spectral density of a time series.

The options(repr.plot.res=.., repr.plot.height=.., repr.plot.width=..) from repr package set parameters for resizing R plots in Jupyter notebooks.

Authors' source code - R functions¶

The initialFDSLRM() loads essential R packages (nlme, kableExtra, IRdisplay, MASS, Matrix, car, stats,...) for time series analysis and visualization (it also installs any missing package - in case of the installation it can take several minutes on the standard computer). Moreover, the function loads authors' R functions designed to work with FDSLRM and some Singer's R functions (Singer et al. 2017) for LMM modified by authors of this jupyter notebook .

The fitDiagFDSLRM() function fits the given FDSLRM and consequently conducts the residual diagnostics. Particularly, it means computing all estimates for the unknown parameters and all the statistical test of residuals. The basic input parameters of this function are

x - time series (vector),

times - vector,

freq_mean - frequencies in trend (vector),

poly_trend_degree - trend polynomial degree, default is zero,

include_fixed_eff - vector of ones and zeros specifying if the $\cos$ and $\sin$ component of corresponding frequency is included in trend or not,

freq_random - frequencies in random component (vector),

include_random_eff - vector of ones and zeros specifying if the $\cos$ and $\sin$ component of corresponding frequency is included in random component or not,

season_period - number specifying seasonality (periodicity), default value is zero.

The drawDiagPlots() function draws the diagnostics plots of residuals. This function has two inputs

plots_names - name of particular graph,

alternatively, user can specify the input parameter plots_names = "all" and the matrix of all diagnostics plots will be displayed,

fit_diag_fdslrm_output - output of the funtion fitDiagFDSLRM().

The drawTable() function creates the table of significant frequencies. The basic parameters allow to draw table for frequencies from periodogram and model parameters from fitting.