This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regularized Estimation of Sparse Spectral Precision Matrices

Navonil Deb 111Email: [email protected] Department of Statistics and Data Science, Cornell University Amy Kuceyeski 222Email: [email protected] Department of Radiology, Weill Cornell Medicine Sumanta Basu 333Corresponding author. Email: [email protected] Department of Statistics and Data Science, Cornell University
Abstract

Spectral precision matrix, the inverse of a spectral density matrix, is an object of central interest in frequency-domain analysis of multivariate time series. Estimation of spectral precision matrix is a key step in calculating partial coherency and graphical model selection of stationary time series. When the dimension of a multivariate time series is moderate to large, traditional estimators of spectral density matrices such as averaged periodograms tend to be severely ill-conditioned, and one needs to resort to suitable regularization strategies involving optimization over complex variables. Existing 1\ell_{1}-regularization approaches in this direction either separately penalize real and imaginary parts of a complex number, thereby artificially increasing dimensionality, or use off-the-shelf optimization routines for complex variables that do not explicitly leverage underlying sparsity structure of the problem, leading to scalability issues.

In this work, we propose complex graphical Lasso (CGLASSO), an 1\ell_{1}-penalized estimator of spectral precision matrix based on local Whittle likelihood maximization. We develop fast pathwise coordinate descent algorithms for implementing CGLASSO on large dimensional time series data sets. At its core, our algorithmic development relies on a ring isomorphism between complex and real matrices that helps map a number of optimization problems over complex variables to similar optimization problems over real variables. This finding may be of independent interest and more broadly applicable for high-dimensional statistical analysis with complex-valued data. We also present a complete non-asymptotic theory of our proposed estimator which shows that consistent estimation is possible in high-dimensional regime as long as the underlying spectral precision matrix is suitably sparse. We compare the performance of CGLASSO with competing alternatives on simulated data sets, and use it to construct partial coherence network among brain regions from a real fMRI data set.

1 Introduction

Estimation and structure learning of a spectral density matrix and its inverse, commonly known as the spectral precision matrix, are canonical problems in multivariate time series and signal processing literature with several important statistical tasks. For example, these objects are used for measuring frequency-domain analogues of correlation and partial correlations, viz. coherency and partial coherency. These are also used in graphical modeling of stationary time series, i.e. discovering conditional independence relationships among the components of a multivariate time series. This task amounts to learning the sparsity patterns in the off-diagonal elements of spectral precision matrices at different frequencies [Dahlhaus, 2000; Davis et al., 2016]. Frequency-domain measures such as spectral density, spectral precision, and more generally graphical modeling of stationary time series (StGM), have proven to be useful in several scientific areas including neuroscience, economics and environmental sciences [Granger, 1969; Bowyer, 2016; Fiecas et al., 2019; Baek et al., 2021b].

When the dimension of the multivariate time series is moderate or large, or the number of measured time points is small, the classical approach of estimating a spectral density matrix with smoothed periodograms and subsequently inverting it is not ideal, as shown in Böhm and von Sachs [2009]. This observation has motivated several approaches for shrinkage or sparsity-regularized estimation of spectral precision matrix for moderate to large-scale systems.

One approach is to first shrink a spectral density matrix and invert it, or shrink the inverted spectral density matrix [Fiecas and von Sachs, 2014; Fiecas et al., 2010; Schneider-Luftman and Walden, 2016]. These methods, however, do not ensure a sparse estimator of the spectral precision matrix. The other approach is to maximize local Whittle likelihood subject to sparsity constraints or penalties [Fiecas et al., 2019; Dallakyan et al., 2022; Baek et al., 2021a; Jung et al., 2015; Tugnait, 2022; Krampe and Paparoditis, 2022; Wodeyar and Srinivasan, 2022]. These methods rely on the fact that for stationary processes, discrete Fourier transforms (DFTs) at neighboring Fourier frequencies behave asymptotically as independent and nearly identically distributed “complex-valued data” in the frequency-domain. So algorithms can be designed following standard approaches in Gaussian graphical models (GGM) for real-valued i.i.d. data.

Some important algorithmic and statistical questions remain open in this emerging field of sparse spectral precision matrix estimation. On the algorithmic front, a key challenge is the lack of fast pathwise algorithms for penalized Whittle likelihood maximization that can take advantage of the underlying sparsity structure and efficiently compute penalized estimators on a grid of tuning parameters such as the penalty or the smoothing span. In GGM with real-valued data, pathwise coordinate descent [Friedman et al., 2007] provides state-of-the-art solution for this task. In contrast, existing algorithms to deal with complex-valued data either use off-the-shelf convex solvers such as the alternating direction method of multipliers (ADMM) [Jung et al., 2015; Dallakyan et al., 2022; Tugnait, 2022] or artificially increase the dimensionality of the problem by optimizing separately over real and imaginary parts of a complex number [Fiecas et al., 2019; Krampe and Paparoditis, 2022]. On the theoretical front, to the best of our knowledge there is no systematic asymptotic analysis for the penalized local Whittle likelihood maximizer that provides insight into the interplay of sparsity, high-dimensionality and the temporal and cross-sectional dependence of the underlying time series for any Fourier frequency. The closest works on this front are Baek et al. [2021a], which focuses on long memory processes but only on frequency 0, and Weylandt Jr [2020], which focuses on i.i.d. complex-valued data.

In this paper, we make two contributions to this literature. First, we provide fast pathwise coordinate descent algorithms for complex-variable Lasso (CLASSO) and graphical Lasso (CGLASSO). These algorithms are capable of taking advantage of the underlying sparsity structure of the problem in the same manner as their real counterparts. The core insight behind of our algorithms is to extend a ring isomorphism between complex numbers and 2×22\times 2 orthogonal matrices to the spaces of complex vectors and matrices. This connection helps represent a number of statistical optimization problems over complex variables (e.g. ordinary least squares and Gaussian likelihood maximization) into well-understood optimization problems over real variables. In particular, we can express a generic complex Lasso regression as a group Lasso over real variables with orthogonal group structures. This allows us to build block coordinate descent algorithm with closed form updates, which is significantly faster than a generic group Lasso implementation such as grplasso [Meier, 2020]. An upshot of this connection is that we can rely on the convergence results of block coordinate descent over real variables [Tseng, 2001]. Finally, we use this connection to build fast, pathwise algorithms for both CLASSO and CGLASSO.

Second, we provide a systematic non-asymptotic analysis to study the estimation error of our method in both low- and high-dimensional regimes. This requires a careful combination of tools from Sun et al. [2018], Ravikumar et al. [2011], and complex analysis to deal with complex random variables. Our results show that under suitable sparsity and irrepresentability assumptions akin to what is commonly used in high-dimensional statistics, our method can consistently estimate the true spectral precision matrix.

We conduct extensive numerical experiments on simulated data to assess a number of finite sample properties of CGLASSO. First, our analysis confirms that for spectral precision estimation, jointly penalizing real and imaginary parts of a complex variable is beneficial over separating them as in Fiecas et al. [2019], especially in high-dimension. This matches a similar observation for complex Lasso regression made in Maleki et al. [2013]. Second, we confirm that as expected CGLASSO outperforms unregularized estimators of spectral precision matrix in terms of estimation accuracy. Third, We find that CGLASSO, which works with the entire likelihood information, offers better model selection than node-wise regression strategies along the line of Meinshausen and Bühlmann [2006]. Finally, we demonstrate the use of CGLASSO for partial coherence estimation task on a real fMRI data set obtained from the Human Connectome Project (HCP).

The rest of the paper is organized as follows. In section 2, we provide some background of the problem, review relevant works in the literature and propose our estimator. In Section 3, we discuss our new optimization algorithms for complex Lasso and graphical Lasso. Section 4 contains a complete non-asymptotic analysis of our method for Gaussian time series and linear processes. In Sections 5 and 6, we report the performance of our method on simulated and real fMRI data.

Notation. Throughout the paper, ,\mathbb{Z},\mathbb{R} and \mathbb{C} denote the set of integers, real numbers and complex numbers respectively. |c||c| denotes the absolute value of a real number cc, or the modulus of a complex number cc. i\mathrm{i} denotes the imaginary number 1\sqrt{-1}. For any complex number zz, Re(z)\textbf{Re}(z) and Im(z)\textbf{Im}(z) denote the real and imaginary parts of zz respectively, and zz^{\dagger} denote its complex conjugate. For any positive integer pp, 𝐞(p)\mathbf{e}^{(p)} denotes the pp dimensional basis vector with 11 in the ppth position and 0 elsewhere. For a complex vector 𝐳=(z1,,zp)p\mathbf{z}=(z_{1},\ldots,z_{p})\in\mathbb{C}^{p}, Re(𝐳)\textbf{Re}(\mathbf{z}), Im(𝐳)\textbf{Im}(\mathbf{z}) and 𝐳\mathbf{z}^{\dagger} denotes its real part, imaginary part and conjugate transpose respectively. 𝐳\|\mathbf{z}\| denotes the 2\ell_{2}-norm of a vector 𝐳p\mathbf{z}\in\mathbb{C}^{p}, i.e. 𝐳=i|zi|2\|\mathbf{z}\|=\sqrt{\sum_{i}|z_{i}|^{2}}. For any q>0q>0, 𝐳q\|\mathbf{z}\|_{q} denotes q\ell_{q} norm of 𝐳\mathbf{z} i.e. (i|zi|q)1/q\left(\sum_{i}|z_{i}|^{q}\right)^{1/q}. For a matrix AA, AA^{\dagger} denotes its conjugate transpose, A¯\overline{A} denotes its vector representation and AjA_{\cdot j} denotes its jjth column. For two matrices AA and BB, the Fröbenius inner product between AA and BB, denoted by A,B\langle A,B\rangle is defined as tr(AB)\text{tr}(A^{\dagger}B). If AA is a square matrix, tr(A)\text{tr}(A) denotes its trace, i.e. the sum of its eigen values. In addition, A,A,AF\|A\|_{\infty},\|A\|,\|A\|_{F} and Amax{\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|}_{\max} denote the maximum modulus row sum norm i.e. maxrs|Ars|\max_{r}\sum_{s}|A_{rs}|, spectral norm or the square root of the largest eigen value of AAA^{\dagger}A, Frobenius norm A,A\sqrt{\langle A,A\rangle} and elementwise maximum modulus value maxr,s|Ars|\max_{r,s}|A_{rs}| respectively. The class of matrices 𝒮+p\mathcal{S}_{+}^{p} denotes the set of all p×pp\times p symmetric non-negative definite real matrices, i.e. 𝒮+p={Mp×p:vMv0vp}\mathcal{S}_{+}^{p}=\{M\in\mathbb{R}^{p\times p}:v^{\dagger}Mv\geq 0\ \forall v\in\mathbb{R}^{p}\}. 𝒮++p\mathcal{S}_{++}^{p} denotes the set of all p×pp\times p symmetric positive definite matrices, i.e. 𝒮++p={Mp×p:vMv>0vp,v0}\mathcal{S}_{++}^{p}=\{M\in\mathbb{R}^{p\times p}:v^{\dagger}Mv>0\ \forall v\in\mathbb{R}^{p},v\neq 0\}. Similarly, +p\mathcal{H}_{+}^{p} and ++p\mathcal{H}_{++}^{p} denotes the set of p×pp\times p Hermitian non-negative definite and positive definite complex matrices respectively. For asymptotics, we use the following standard notation: for two sequences (fn)n(f_{n})_{n\in\mathbb{N}} and (gn)n(g_{n})_{n\in\mathbb{N}}, we write fn=𝒪(gn)f_{n}=\mathcal{O}(g_{n}) if |fn|C|gn||f_{n}|\leq C|g_{n}| for some C>0C>0 independent of nn and n>Cn>C. Similarly, fn=Ω(gn)f_{n}=\Omega(g_{n}) if |fn|C|gn||f_{n}|\geq C^{\prime}|g_{n}| for some C>0C^{\prime}>0 independent of nn and n>Cn>C^{\prime}. We also write fngnf_{n}\asymp g_{n} if both fn=𝒪(gn)f_{n}=\mathcal{O}(g_{n}) and fn=Ω(gn)f_{n}=\Omega(g_{n}) hold. We denote ABA\succsim B if there is a universal constant cc, independent of model dimension and model parameters, such that AcBA\geq cB, and ABA\precsim B is defined analogously. We also use c,c>0c,c^{\prime}>0 to denote universal constants whose values are allowed to change from line to line within a proof.

2 Background and methods

We consider a pp-dimensional weakly stationary real-valued time series Xt=(Xt1,,Xtp),\allowbreak X_{t}=(X_{t1},\cdots,X_{tp})^{\top}, tt\in\mathbb{Z}. We also assume 𝔼(Xt)=0t\mathbb{E}(X_{t})=0\ \forall t\in\mathbb{Z} for ease of exposition. Weak stationarity implies that the autocovariance function Γ()=Cov(Xt,Xt)=𝔼(XtXt)\Gamma(\ell)=\text{Cov}(X_{t},X_{t-\ell})=\mathbb{E}(X_{t}X_{t-\ell}^{\top}) depends only on the lag \ell. The spectral density at a frequency ω[π,π]\omega\in[-\pi,\pi] is defined as the Fourier transform of the autocovariance functions

f(ω)=12π=Γ()eiω.f(\omega)=\frac{1}{2\pi}\sum_{\ell=-\infty}^{\infty}\Gamma(\ell)e^{-\mathrm{i}\ell\omega}.

We use 𝒳=[X1::Xn]\mathcal{X}=[X_{1}:\cdots:X_{n}]^{\top} to denote the data matrix that contains nn consecutive observations from the time series {Xt}t\{X_{t}\}_{t\in\mathbb{Z}} in its rows. Our goal is to estimate the spectral precision matrix Θ(ω):=f1(ω)\Theta(\omega):=f^{-1}(\omega) at a given frequency ω\omega. It is customary to focus only on the Fourier frequencies ω=2πk/n\omega=2\pi k/n, kFn={[n12],,[n2]}k\in F_{n}=\{-[-\frac{n-1}{2}],\ldots,[\frac{n}{2}]\} where [x][x] denotes the integer part of [x][x].

The autocovariance can be expressed in terms of ff as

Γ()=ππeiωf(ω)dω.\Gamma(\ell)=\int_{-\pi}^{\pi}e^{\mathrm{i}\ell\omega}f(\omega)\ \text{d}\omega.

Some of the properties of the stationary process {Xt}\{X_{t}\} can therefore be described equivalently in terms of f()f(\cdot) rather than Γ()\Gamma(\cdot) [Brockwell and Davis, 1991; Shumway and Stoffer, 2000; Ombao and Pinto, 2022]. Similarly {Xt}\{X_{t}\}, as a process, has a spectral representation, known as the Cramér’s representation, given by

Xt=ππeitωdZ(ω),X_{t}=\int_{-\pi}^{\pi}e^{\mathrm{i}t\omega}\ \text{d}Z(\omega), (2.1)

where {Z(ω):πωπ}\{Z(\omega):-\pi\leq\omega\leq\pi\} is an orthogonal increment process satisfying the following properties:

𝔼[Z(ω)]=0 for all ω[π,π]\mathbb{E}[Z(\omega)]=0\mbox{ for all }\omega\in[-\pi,\pi] (2.2)
Cov(Zj(ω),Zk(ω))=𝔼[Zj(ω)Zk(ω)¯]={0if ωω,fjk(ω)otherwise.\operatorname*{Cov}(Z_{j}(\omega),Z_{k}(\omega^{\prime}))=\mathbb{E}[Z_{j}(\omega)\overline{Z_{k}(\omega^{\prime})}]=\begin{cases}0&\text{if }\omega\neq\omega^{\prime},\\ f_{jk}(\omega)&\text{otherwise.}\end{cases} (2.3)

This provides us with a routine to decompose XtX_{t} as a sum of processes at different frequencies. The integral in Cramér’s representation of XtX_{t} is sum across all frequencies and dZ(ω)\text{d}Z(\omega) is the random weight for each frequency ω\omega. We are interested in the existence of such a decomposition because it means that we can sensibly talk about movements in XtX_{t} in a given frequency without worrying about its movements in other frequencies. In this sense, the object f1(ω)f^{-1}(\omega) can be naturally viewed to capture frequency specific conditional independence structure among the component time series. In the next subsection, we elaborate on this and give an illustrative example.

2.1 Frequency-specific conditional independence and f1(ω)f^{-1}(\omega)

As we can see from (2.2) and (2.3), at each frequency ω\omega, the correlation structure among the movements in different components is captured by f(ω)f(\omega). Similarly, the partial correlation structure among those component-wise random movements can be indicated using f1(ω)f^{-1}(\omega). The connection between precision matrix and partial correlation is reminiscent of Gaussian graphical models and the context will be stated more formally in 2.2.

This decomposition into different frequencies are useful in several fields including economics, neuroscience and signal processing. For example, often in human brain connectivity studies the scientists measure neurophysiological signals in the form of Electroencephalogram (EEG) or Functional magnetic resonance imaging (fMRI) for different brain regions. The resulting ‘brain-rhythm’ is viewed as a superposition of signals representing different physiological processes such as cardiovascular and respiratory activities. These different processes are captured by signals of different frequency intervals, e.g. delta band (0.5-4.0 Hz), theta band (4.0-8.0 Hz), alpha band (8.0-12.0 Hz), beta band (12.0-30.0 Hz), gamma band (30.0-50.0 Hz), obtained by frequency band pass filters [Ombao and Pinto, 2022]. The spectral density matrix f(ω)f(\omega) in this context captures the correlation structure among different brain regions in terms of physiological processes represented by a specific frequency ω\omega. Similarly, the partial correlation structure (the conditional dependence graph) among those regions can be extracted using the spectral precision matrix f1(ω)f^{-1}(\omega) if we are interested only in the movements at frequency ω\omega. Below we illustrate this with a small example from Shumway and Stoffer [2000].

Example 1. We consider two time series XtX_{t} and YtY_{t}, each having three frequency components, and generate N=200N=200 observations as follows

Xt(j)=Z1(ωj)cosωjt,\displaystyle X_{t}^{(j)}=Z_{1}(\omega_{j})\cos\omega_{j}t, Yt(j)=Z2(ωj)cosωjt,\displaystyle Y_{t}^{(j)}=Z_{2}(\omega_{j})\cos\omega_{j}t, j=1,2,3,\displaystyle j=1,2,3,
Xt=Xt(1)+Xt(2)+Xt(3),\displaystyle X_{t}=X_{t}^{(1)}+X_{t}^{(2)}+X_{t}^{(3)}, Yt=Yt(1)+Yt(2)+Yt(3),\displaystyle Y_{t}=Y_{t}^{(1)}+Y_{t}^{(2)}+Y_{t}^{(3)},

where ω1=2π×5/N,ω2=2π×10/N,ω3=2π×40/N\omega_{1}=2\pi\times 5/N,\ \omega_{2}=2\pi\times 10/N,\ \omega_{3}=2\pi\times 40/N. We generate the amplitude processes from the following distributions

[Z1(ω1)Z2(ω1)]\displaystyle\begin{bmatrix}Z_{1}(\omega_{1})\\ Z_{2}(\omega_{1})\end{bmatrix} 𝒩2([00],[10.50.51]),[Z1(ω2)Z2(ω2)]𝒩2([00],[10.50.51]),\displaystyle\sim\mathcal{N}_{2}\left(\begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}1&0.5\\ 0.5&1\end{bmatrix}\right),\quad\begin{bmatrix}Z_{1}(\omega_{2})\\ Z_{2}(\omega_{2})\end{bmatrix}\sim\mathcal{N}_{2}\left(\begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}1&-0.5\\ -0.5&1\end{bmatrix}\right),
[Z1(ω3)Z2(ω3)]\displaystyle\begin{bmatrix}Z_{1}(\omega_{3})\\ Z_{2}(\omega_{3})\end{bmatrix} 𝒩2([00],[20.10.12]).\displaystyle\sim\mathcal{N}_{2}\left(\begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}2&0.1\\ 0.1&2\end{bmatrix}\right).

We obtain XtX_{t} and YtY_{t} to behave as shown in Figure 1. In this example, cosωjt\cos\omega_{j}t’s are the orthogonal basis functions for the signals XtX_{t} and YtY_{t}, and Z1(ωj)Z_{1}(\omega_{j}) and Z2(ωj)Z_{2}(\omega_{j})’s are the corresponding random basis coefficients respectively and each set of basis coefficients are independent for each ωj\omega_{j}. We may think of this as an example with two brain regions which generates signals as a resultant of three distinct physiological processes, and hence three distinct frequencies.

In this illustration, if the spectral density of the two dimensional process Wt:=(Xt,Yt)W_{t}:=(X_{t},Y_{t})^{\prime} is fWf_{W}, Cramér’s representation as in (2.1) implies [fW(ω1)]11=Var(Z1(ω1))=1[f_{W}(\omega_{1})]_{11}=\operatorname*{Var}(Z_{1}(\omega_{1}))=1. In general, for k=1,2k=1,2, [fW(ωj)]kk=Var(Zk(ωj))[f_{W}(\omega_{j})]_{kk}=\operatorname*{Var}(Z_{k}(\omega_{j})) for j=1,2 and 3j=1,2\text{ and }3. For the cross-spectral term, [fW(ω1)]12=Cov(Z1(ω1),Z2(ω1))=0.5[f_{W}(\omega_{1})]_{12}=\operatorname*{Cov}(Z_{1}(\omega_{1}),Z_{2}(\omega_{1}))=0.5, and similarly [fW(ω2)]12=0.5[f_{W}(\omega_{2})]_{12}=-0.5. Therefore, we see that fWf_{W} fully captures the correlation structure of the same frequency orthogonal components of the joint process WtW_{t}. Similarly we can also compute the spectral precision fW1f^{-1}_{W} which captures the partial correlation. Hence, estimation and inference on the spectral density and similarly the spectral precision sets path for understanding the correlation and the partial correlation graph respectively among the same frequency components extracted from the time series observations.

Refer to caption
Figure 1: Time series as sum of periodic components as described in Example 1

Such two-dimensional representation can be generalized to a pp-dimensional case using discrete Fourier transforms (DFT) (section 10.1 of Brockwell and Davis [1991]). We use the DFT evaluated at the Fourier frequencies to estimate f(ω)f(\omega) and f1(ω)f^{-1}(\omega). The pp-dimensional complex vector-valued DFT at the Fourier frequency ωj=2πj/n\omega_{j}=2\pi j/n, jFnj\in F_{n}, is defined as

dj:=d(ωj)=1nt=1nXtexp(itωj).d_{j}:=d(\omega_{j})=\frac{1}{\sqrt{n}}\sum_{t=1}^{n}X_{t}\exp\left(-\mathrm{i}t\omega_{j}\right).

We represent it as dj=(dj1,,djp)d_{j}=(d_{j1},\cdots,d_{jp})^{\top}. The classical estimator of spectral density f(ωj)f(\omega_{j}) is based on the periodogram [Brockwell and Davis, 1991], defined as I(ωj)=d(ωj)d(ωj)=djdjI(\omega_{j})=d(\omega_{j})d^{\dagger}(\omega_{j})=d_{j}d_{j}^{\dagger}. According to proposition 10.3.1 of Brockwell and Davis [1991], for any 1ip1\leq i\leq p, 𝔼[Ikk(ωj)]2πf(ωj)\mathbb{E}[I_{kk}(\omega_{j})]\rightarrow 2\pi f(\omega_{j}), which implies Cov(djk,djk)=𝔼[djkdjk][2πf(ωj)]kk\operatorname*{Cov}(d_{jk},d_{jk})=\mathbb{E}[d_{jk}d_{jk}^{\dagger}]\approx[2\pi f(\omega_{j})]_{kk}. For any pair kkk\neq k^{\prime}, the cross-spectral terms can be similarly expressed as Cov(djk,djk)=𝔼[djkdjk][2πf(ωj)]kk\operatorname*{Cov}(d_{jk},d_{jk^{\prime}})=\mathbb{E}[d_{jk}d_{jk^{\prime}}]\approx[2\pi f(\omega_{j})]_{kk^{\prime}}. The bottomline is that the correlation structure among the DFT values i.e. the djd_{j}’s is approximately equal to the spectral density which again captures the correlation pattern of the underlying orthogonal increment processes as in (2.3). In general, I(ωj)I(\omega_{j}) is asymptotically unbiased for f(ωj)f(\omega_{j}) but not consistent even under classical, fixed-pp asymptotics. For example, for i.i.d. Gaussian white noise Xt𝒩(𝟎,σ2I)X_{t}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I), the variance of I(ωj)I(\omega_{j}) is of the order σ4\sigma^{4} (Proposition 10.3.2 of Brockwell and Davis [1991]), and does not vanish asymptotically. To ensure consistency, it is common to use averaged or smoothed periodogram estimator

f^(ωj)=12π(2m+1)|k|mI(ωj+k),jFn.\hat{f}(\omega_{j})=\frac{1}{2\pi(2m+1)}\sum_{|k|\leq m}I(\omega_{j+k}),\quad j\in F_{n}. (2.4)

Note that the indices of Fourier frequencies jj are evaluated “modulo nn” as j+kj+k may fall outside FnF_{n}. It is well-known (cf. Theorem 10.4.1 of Brockwell and Davis [1991]) that for m=o(n)m=o(\sqrt{n}), (2.4) is a consistent estimator of f(ωj)f(\omega_{j}) in the classical asymptotic framework where pp is fixed but the sample size nn\rightarrow\infty.

2.2 Method: Penalized Whittle likelihood estimator

We start by introducing the complex normal random variable and its role in the estimation of spectral precision matrix. The following definition is referred to the definition 11.7.1 of Brockwell and Davis [1991]).

Definition 2.1 (Complex Multivariate Normal Distribution).

Y=Y1+iY2Y=Y_{1}+\mathrm{i}Y_{2} follows a p-dimensional complex-valued multivariate normal distribution with mean vector μ=μ1+iμ2p\mu=\mu_{1}+\mathrm{i}\mu_{2}\in\mathbb{C}^{p} and covariance matrix Σ=Σ1+iΣ2+p\Sigma=\Sigma_{1}+\mathrm{i}\Sigma_{2}\in\mathcal{H}^{p}_{+} if

[Y1Y2]𝒩([μ1μ2],12[Σ1Σ2Σ2Σ1]).\begin{bmatrix}Y_{1}\\ Y_{2}\end{bmatrix}\thicksim\mathcal{N}\left(\begin{bmatrix}\mu_{1}\\ \mu_{2}\end{bmatrix},\frac{1}{2}\begin{bmatrix}\Sigma_{1}&-\Sigma_{2}\\ \Sigma_{2}&\Sigma_{1}\end{bmatrix}\right).

In this case we use the notation Y𝒩(μ,Σ)Y\thicksim\mathcal{N}_{\mathbb{C}}(\mu,\Sigma).

It is well-known that under some regularity conditions, dj𝒩(0,f(ωj))d_{j}\sim\mathcal{N}_{\mathbb{C}}(0,f(\omega_{j})) asymptotically (Proposition 11.7.3 and 11.7.4 of Brockwell and Davis [1991]), and with the exception of ±ωj\pm\omega_{j} the DFTs at different Fourier frequencies are approximately independent. So, for any given jFnj\in F_{n} and small enough mm, it makes sense to use the DFTs evaluated at ωjm,,ωj,,ωj+m\omega_{j-m},\ldots,\omega_{j},\ldots,\omega_{j+m} as nearly independent and identically distributed data with covariance matrix f(ωj)f(\omega_{j}) and precision matrix f1(ωj)f^{-1}(\omega_{j}).

This formulation is analogous to the framework of Gaussian graphical models (GGM), where one is interested in learning the precision matrix Θ=Σ1\Theta=\Sigma^{-1} from i.i.d. data Xk𝒩(0,Σ)X_{k}\sim\mathcal{N}(0,\Sigma) for k=1,,nk=1,\ldots,n. It is quite natural to devise algorithms that are similar to GGM estimators. Indeed, the sparse inverse periodogram estimator (SIPE) is built on a popular GGM estimator called CLIME [Cai et al., 2011], and the regression approach adopted in Krampe and Paparoditis [2022] is based on Meinshausen and Bühlmann [2006].

We propose to develop a sparsity regularized maximum likelihood estimator of Whittle’s approximate likelihood [Whittle, 1951]. Note that for Gaussian time series, dj𝒩(0,f(ωj))d_{j}\sim\mathcal{N}_{\mathbb{C}}(0,f(\omega_{j})) asymptotically for every Fourier frequency ωj\omega_{j}, and Whittle likelihood summarizes the pseudo-likelihood assuming the periodograms are approximately independent across frequencies. The negative log likelihood (upto constants) takes the form

jFnlogdetf1(ωj)jFndjf1(ωj)dj.\sum_{j\in F_{n}}\log\det f^{-1}(\omega_{j})-\sum_{j\in F_{n}}d^{*}_{j}\ f^{-1}(\omega_{j})\ d_{j}.

To estimate the spectral precision at a Fourier frequency ωj\omega_{j}, we work with the approximation

=jmj+mlogdetf1(ωj)=jmj+mdf1(ωj)d.\sum_{\ell=j-m}^{j+m}\log\det f^{-1}(\omega_{j})-\sum_{\ell=j-m}^{j+m}d^{\dagger}_{\ell}\ f^{-1}(\omega_{j})\ d_{\ell}.

We solve the following penalized negative log likelihood minimization problem to estimate f1(ωj)f^{-1}(\omega_{j}) using

Θ^(ωj):=argminΘ++p(2m+1)logdetΘ+trace(Θ=jmj+mdd)+λΘ1,off,\hat{\Theta}(\omega_{j}):=\,\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\,\,-(2m+1)\log\det\Theta+\text{trace}\left(\Theta\sum_{\ell=j-m}^{j+m}d_{\ell}d^{\dagger}_{\ell}\right)+\lambda\left\|\Theta\right\|_{1,\text{off}}, (2.5)

where λ>0\lambda>0 is the penalty parameter, and Θ1,off=k|Θk,|\left\|\Theta\right\|_{1,\text{off}}=\sum_{k\neq\ell}|\Theta_{k,\ell}| is the sum of the moduli of all off-diagonal elements of Θ\Theta. Equivalently,

Θ^j=argminΘ++pΘ,f^(ωj)logdetΘ+λΘ1,off.\hat{\Theta}_{j}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\,\langle\Theta,\hat{f}(\omega_{j})\rangle-\log\det\Theta+\lambda\|\Theta\|_{1,\text{off}}. (2.6)

Here Θ^(ωj)\hat{\Theta}(\omega_{j}) is abbreviated as Θ^j\hat{\Theta}_{j}, and λ\lambda in the optimization problem (2.5) is replaced by (2m+1)λ(2m+1)\lambda for simplifying the notation. This optimization problem is similar in spirit to estimation of inverse covariance matrices based on the graphical Lasso [Friedman et al., 2008] for i.i.d. data . The main difference is that f^(ωj)==jmj+mdd/(2m+1)\hat{f}(\omega_{j})=\sum_{\ell=j-m}^{j+m}d_{\ell}d^{\dagger}_{\ell}/(2m+1) takes the role of sample covariance matrix.

Similar penalized local Whittle likelihood approaches have been investigated in recent literature by several authors [Jung et al., 2015; Dallakyan et al., 2022; Baek et al., 2021a; Tugnait, 2022]. Among these works, Baek et al. [2021a] focuses only on ω=0\omega=0 but allows long-range dependent processes, and the others provide a single graph by penalizing across all the frequencies at once. In what follows, we restrict our discussions to only the SIPE algorithm of Fiecas et al. [2019] and Krampe and Paparoditis [2022], because only they provide algorithms to estimate f1(ω)f^{-1}(\omega) for any given ωj,jFn\omega_{j},j\in F_{n}.

2.3 Nodewise regression (NWR) of DFTs

Building upon the parallel to GGM, if one is only interested in recovering the sparsity structure of f1(ω)f^{-1}(\omega) and not the estimation of the full matrix, a natural alternative to graphical Lasso is to perform nodewise regression along the line of Meinshausen and Bühlmann [2006], where we regress one coordinate of the DFTs on all the other coordinates. This can be done by running pp different complex Lasso regressions. This is closely related to the approach in Krampe and Paparoditis [2022], where instead of the complex Lasso penalty the authors used separate Lasso penalties on the real and imaginary parts of a complex number. While we do not pursue the asymptotic theory of this method in our paper, we compare the model selection performance of our method against this method in the simulation section.

Next we provide a brief description of nodewise DFT regression. Consider the data matrix 𝒵\mathcal{Z} for estimation at Fourier frequency ωj\omega_{j}. Going forward, we will drop subscript jj when the index is clear from the context.

𝒵:=[djm,1djm,2djm,pdjm+1,1djm+1,2djm+1,pdj,1dj,2dj,pdj+1,1dj+1,2dj+1,pdj+m,1dj+m,2dj+m,p].\displaystyle\mathcal{Z}:=\left[\begin{array}[]{cccc}d_{j-m,1}&d_{j-m,2}&\ldots&d_{j-m,p}\\ d_{j-m+1,1}&d_{j-m+1,2}&\ldots&d_{j-m+1,p}\\ \vdots&\vdots&\vdots&\vdots\\ d_{j,1}&d_{j,2}&\ldots&d_{j,p}\\ d_{j+1,1}&d_{j+1,2}&\ldots&d_{j+1,p}\\ \vdots&\vdots&\vdots&\vdots\\ d_{j+m,1}&d_{j+m,2}&\ldots&d_{j+m,p}\end{array}\right].

We will denote the kthk^{th} column of 𝒵\mathcal{Z} as 𝒵k\mathcal{Z}_{k}, and the matrix containing all the other columns as 𝒵k\mathcal{Z}_{-k}. Our plan is to regress 𝒵k\mathcal{Z}_{k} on 𝒵k\mathcal{Z}_{-k}, and use the sparsity pattern of the regression coefficients to learn the sparsity pattern in the off-diagonal part of the kthk^{th} row (equivalently, column) of f1(ωj)f^{-1}(\omega_{j}). In low-dimensional regime, we can run ordinary least squares (OLS) regression with complex response and predictors. In high-dimension, we will use complex Lasso to estimate the nodewise regression coefficients

β^k=argminβp112N𝒵k𝒵kβ2+λkβ1.\hat{\beta}_{k}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p-1}}\frac{1}{2N}\|\mathcal{Z}_{k}-\mathcal{Z}_{-k}\beta\|^{2}+\lambda_{k}\|\beta\|_{1}. (2.8)

Here β1==1p1|β|\|\beta\|_{1}=\sum_{\ell=1}^{p-1}|\beta_{\ell}|, N=2m+1N=2m+1 is the effective sample size, and λk>0\lambda_{k}>0 is the penalty parameter. Then the th\ell^{th} entry in the off-diagonal part of the kthk^{th} row of f1(ωj)f^{-1}(\omega_{j}), i.e. [f1(ωj)]k,k\left[f^{-1}(\omega_{j})\right]_{k,-k}, will be zero if and only if β^kl=0\hat{\beta}_{kl}=0. This may lead to a potentially asymmetric sparsity pattern in f1(ωj)f^{-1}(\omega_{j}), but we can symmetrize the sparsity pattern using standard post-processing techniques proposed in Meinshausen and Bühlmann [2006].

3 Optimization algorithms

While standard convex optimization algorithms such as the ADMM can be used to solve (2.6), they do not exploit underlying sparsity structure of the problem and could be computationally expensive when solving the problem for different penalty parameters λ\lambda and frequencies ωj\omega_{j}. In the literature of high-dimensional regression and GGM, coordinate descent algorithms are popular because they provide considerable speed-up by using problem-specific strategies such as warm start and active set selection [Hastie et al., 2015]. A direct adoption of coordinate descent for complex graphical Lasso is difficult since the penalty |β||\beta|, for β\beta\in\mathbb{C}, is effectively a group Lasso penalty Re(β)2+Im(β)2\sqrt{\textbf{Re}(\beta)^{2}+\textbf{Im}(\beta)^{2}} over its real and imaginary parts. In general, a block coordinate descent update for group Lasso does not have a closed form, and one needs to solve a separate optimization problem at every iteration of the block coordinate descent.

i×i=1\mathrm{i}\times\mathrm{i}=-11×i=i1\times\mathrm{i}=\mathrm{i}1×i=i-1\times\mathrm{i}=-\mathrm{i}i×i=1-\mathrm{i}\times\mathrm{i}=1i\mathrm{i}111-1i-\mathrm{i}
Figure 2: Anti-clockwise and orthogonal rotation property of i\mathrm{i}

Our key observation is that the specific group Lasso regression for complex Lasso has an attractive feature.The predictors within a single group are orthogonal, which offers closed form updates in the block coordinate descent iterations. Using this observation, we first propose CLASSO, a coordinate descent algorithm for Lasso with complex variables. Then we use CLASSO to build CGLASSO, a graphical Lasso algorithm for complex variables that solves (2.6).

While this observation of orthogonality alone is sufficient for developing fast algorithms, we argue that this is not a coincidence. It reflects a deep connection between complex and real matrices that can be leveraged to systematically realify a broader family of complex optimization problems. We formalize this connection first, and then present the CLASSO and CGLASSO algorithms.

3.1 Realifying complex optimization with ring isomorphism

At the core of our algorithm lies a well-known field isomorphism between the set of complex scalars \mathbb{C} and a set of 2×22\times 2 orthogonal real matrices. We show that this can be extended from \mathbb{C} to a ring isomporphism on complex-valued matrices m×n\mathbb{C}^{m\times n}. At a high-level, this means that sum, product and inverse of p×pp\times p complex matrices have direct parallel to the same operations in the space of 2p×2p2p\times 2p real-valued matrices. The upshot is that some generic complex-variable optimization problems arising in frequency-domain time series can be reduced to well-understood optimization problems over real variables arising in high-dimensional statistics.

First, we illustrate this isomorphism in Figure 2 for a single complex number, i\mathrm{i}. The relationships i×1=i\mathrm{i}\times 1=\mathrm{i} and i×i=1\mathrm{i}\times\mathrm{i}=-1 imply that the operation “multiplication by i\mathrm{i}” can be viewed as a 9090^{\circ} rotation that maps the unit vector (1,0)(1,0) on the complex plane to (0,1)(0,1), and maps the unit vector (0,1)(0,1) to (1,0)(-1,0). This linear map corresponds to the 2×22\times 2 orthogonal matrix

𝐉=[0110].\mathbf{J}=\left[\begin{array}[]{cc}0&-1\\ 1&0\end{array}\right].

If we identify 11 with 𝐈\mathbf{I} and i\mathrm{i} with 𝐉\mathbf{J}, we can identify any complex number z=1×a+i×bz=1\times a+\mathrm{i}\times b, a,ba,b\in\mathbb{R}, with a 2×22\times 2 matrix a𝐈+b𝐉a\mathbf{I}+b\mathbf{J} of the form given in (3.1).

Next we formalize and expand on the above heuristic of isomorphism through a series of propositions. For a brief introduction to common structures in abstract alebra, e.g. group, ring, field and isomorphisms, we refer the readers to Appendix A.

Proposition 3.1.

Define φ:2×2\varphi:\mathbb{C}\rightarrow\mathbb{R}^{2\times 2} as

φ(z)=[Re(z)Im(z)Im(z)Re(z)].\varphi(z)={\begin{bmatrix}\textbf{Re}(z)&-\textbf{Im}(z)\\ \textbf{Im}(z)&\textbf{Re}(z)\end{bmatrix}}.

Then φ\varphi is a field isomorphism between \mathbb{C} and 2×2\mathcal{M}^{2\times 2}, where

2×2:={[abba]:a,b}.\displaystyle\mathcal{M}^{2\times 2}:=\left\{\begin{bmatrix}a&-b\\ b&a\end{bmatrix}:a,b\in\mathbb{R}\right\}. (3.1)

This implies

  1. (i)

    φ(0)=𝟎2×2\varphi(0)=\mathbf{0}_{2\times 2} and φ(1)=𝐈2\varphi(1)=\mathbf{I}_{2}.

  2. (ii)

    φ(z+z)=φ(z)+φ(z)\varphi(z+z^{\prime})=\varphi(z)+\varphi(z^{\prime}) for all z,zz,z^{\prime}\in\mathbb{C}. In particular, φ(z)=φ(z)\varphi(-z)=-\varphi(z), for all zz\in\mathbb{C}.

  3. (iii)

    φ(zz)=φ(z)φ(z)\varphi(zz^{\prime})=\varphi(z)\varphi(z^{\prime}) for all z,zz,z^{\prime}\in\mathbb{C}. In particular, φ(z1)=[φ(z)]1\varphi(z^{-1})=[\varphi(z)]^{-1} for all z,z0z\in\mathbb{C},z\neq 0.

Also, for any zz\in\mathbb{C}, denote φ(z)=(φ1(z),φ2(z))\varphi(z)=(\varphi_{1}(z),\varphi_{2}(z)), where φ1(z),φ2(z)2\varphi_{1}(z),\varphi_{2}(z)\in\mathbb{R}^{2}. Then the following hold

  1. (i)

    φ1(z),φ2(z)=0\langle\varphi_{1}(z),\varphi_{2}(z)\rangle=0,

  2. (ii)

    φ1(z)2=φ2(z)2=12φ(z)F2=|z|2\|\varphi_{1}(z)\|^{2}=\|\varphi_{2}(z)\|^{2}=\frac{1}{2}\|\varphi(z)\|_{F}^{2}=|z|^{2},

  3. (iii)

    φ(z)=[φ(z)].\varphi(z^{\dagger})=[\varphi(z)]^{\top}.

The proof is given in the Appendix B. The isomorphism φ\varphi can be naturally extended to higher dimensional vector spaces and matrix spaces m×n\mathbb{C}^{m\times n}. However the map φ\varphi will no longer be a field isomorphism since for m,n1m,n\geq 1, n\mathbb{C}^{n} and m×n\mathbb{C}^{m\times n} are not necessarily fields, but rings. So it reduces to a ring isomorphism.

For any 𝐳=(z1,,zn)n\mathbf{z}=(z_{1},\ldots,z_{n})^{\top}\in\mathbb{C}^{n}, we extend φ:n2n×2\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2} as

φ(𝐳)=[φ(z1)φ(zn)]=[φ1(𝐳)φ2(𝐳)],φ1(𝐳),φ2(𝐳)2n.\varphi(\mathbf{z})=\begin{bmatrix}\varphi(z_{1})\\ \vdots\\ \varphi(z_{n})\end{bmatrix}=[\varphi_{1}(\mathbf{z})~{}~{}\varphi_{2}(\mathbf{z})],~{}~{}~{}~{}~{}~{}\varphi_{1}(\mathbf{z}),\varphi_{2}(\mathbf{z})\in\mathbb{R}^{2n}. (3.2)

Similarly, for any Z=((zij))i,j=1,1m,nm×nZ=((z_{ij}))_{i,j=1,1}^{m,n}\in\mathbb{C}^{m\times n}, the extension φ:m×n2m×2n\varphi:\mathbb{C}^{m\times n}\rightarrow\mathbb{R}^{2m\times 2n} is

φ(Z)=[φ(z11)φ(z1n)φ(zm1)φ(zmn)].\varphi(Z)=\begin{bmatrix}\varphi(z_{11})&\cdots&\varphi(z_{1n})\\ \vdots&\ddots&\vdots\\ \varphi(z_{m1})&\cdots&\varphi(z_{mn})\end{bmatrix}. (3.3)

Below we state the properties of the extended φ\varphi in (3.2) and (3.3) without proof since they are similar to Proposition 3.1.

Proposition 3.2.

For any 𝐳n,n1\mathbf{z}\in\mathbb{C}^{n},\ n\geq 1, the extended map φ:n2n×2\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2} in (3.2) satisfies

  1. (i)

    φ1(𝐳),φ2(𝐳)=0\langle\varphi_{1}(\mathbf{z}),\varphi_{2}(\mathbf{z})\rangle=0,

  2. (ii)

    φ1(𝐳)2=φ2(𝐳)2=12φ(𝐳)F2=𝐳2\|\varphi_{1}(\mathbf{z})\|^{2}=\|\varphi_{2}(\mathbf{z})\|^{2}=\frac{1}{2}\|\varphi(\mathbf{z})\|_{F}^{2}=\|\mathbf{z}\|^{2},

and for any Zm×n,m,n1Z\in\mathbb{C}^{m\times n},\ m,n\geq 1, the extended map φ:m×n2m×2n\varphi:\mathbb{C}^{m\times n}\rightarrow\mathbb{R}^{2m\times 2n} in (3.3) satisfies

  1. (i)

    φ1(Zj),φ2(Zj)=0\langle\varphi_{1}(Z_{\cdot j}),\varphi_{2}(Z_{\cdot j})\rangle=0 for j=1,,nj=1,\ldots,n,

  2. (ii)

    φ(Z)F2=2ZF2\|\varphi(Z)\|_{F}^{2}=2\|Z\|_{F}^{2},

  3. (iii)

    φ(Z)=[φ(Z)].\varphi(Z^{\dagger})=[\varphi(Z)]^{\top}.

Note that the extended map φ\varphi is also a group isomorphism, and the additive identities are preserved. Hence the extension φ\varphi is a ring isomorphism as well. We formally state this result next.

Proposition 3.3.

Let Image(φ)\text{Image}(\varphi) be the image of the map φ\varphi. Then the following statements hold:

  1. (i)

    The extended map φ:n2n×2\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2} in (3.2) is a ring isomorphism between n\mathbb{C}^{n} and Image(φ)\text{Image}(\varphi),

  2. (ii)

    The extended map φ:m×n2m×2n\varphi:\mathbb{C}^{m\times n}\rightarrow\mathbb{R}^{2m\times 2n} in (3.3) is a ring isomorphism between m×n\mathbb{C}^{m\times n} and Image(φ)\text{Image}(\varphi),

  3. (iii)

    Let GLn()GL_{n}(\mathbb{C}) be the set of all n×nn\times n invertible matrices with entries in \mathbb{C}. Then, for any AGLn()A\in GL_{n}(\mathbb{C}), φ(A1)=[φ(A)]1\varphi(A^{-1})=[\varphi(A)]^{-1}.

The proof of Proposition 3.3 is provided in Appendix B.

Remark 3.1.

Note that even though the map φ\varphi doubles the dimension of the input object, the computational complexity of dealing with objects from 2×2\mathcal{M}^{2\times 2} remains the same as operations involving complex numbers.

So far the function φ(Z)\varphi(Z) replaces every entry of a complex matrix ZZ by a 2×22\times 2 real matrix. While this is useful for proving the isomorphism properties, rearranging the rows and columns of φ(Z)\varphi(Z) helps us present our algorithms in a more compact form. Let Πk,k\Pi_{k},k\in\mathbb{N} be a 2k×2k2k\times 2k permutation matrix that maps {1,,2k}\{1,\ldots,2k\} to {1,3,,2k1,2,4,,2k}\{1,3,\ldots,2k-1,2,4,\ldots,2k\}. In matrix notation,

Πk=[𝐞1:𝐞3::𝐞2k1:𝐞2:𝐞4::𝐞2k].\Pi_{k}=[\mathbf{e}_{1}:\mathbf{e}_{3}:\cdots:\mathbf{e}_{2k-1}:\mathbf{e}_{2}:\mathbf{e}_{4}:\cdots:\mathbf{e}_{2k}].

Then for 𝐳n\mathbf{z}\in\mathbb{C}^{n} and any Zm×nZ\in\mathbb{C}^{m\times n}, we introduce the notation

𝐳~=Πnφ(𝐳)𝐞1=[Re(𝐳)Im(𝐳)],Z~~=Πmφ(Z)Πn=[Re(Z)Im(Z)Im(Z)Re(Z)].\widetilde{\mathbf{z}}=\Pi_{n}^{\top}\varphi(\mathbf{z})\mathbf{e}_{1}=\begin{bmatrix}\textbf{Re}(\mathbf{z})\\ \textbf{Im}(\mathbf{z})\end{bmatrix},~{}~{}~{}\widetilde{\widetilde{Z}}=\Pi_{m}^{\top}\varphi(Z)\Pi_{n}=\begin{bmatrix}\textbf{Re}(Z)&-\textbf{Im}(Z)\\ \textbf{Im}(Z)&\textbf{Re}(Z)\\ \end{bmatrix}.

3.1.1 Ordinary least squares (OLS) with complex variables

The aforementioned properties of φ\varphi can be used to rewrite standard complex variable optimization problems in real-variable transformed problems. For instance, the objective function of an ordinary least squares (OLS) can be rewritten as

YXβ22\displaystyle\|Y-X\beta\|_{2}^{2} =12φ(YXβ)F2\displaystyle=\frac{1}{2}\|\varphi(Y-X\beta)\|_{F}^{2}
=12Πnφ(YXβ)Π1F2\displaystyle=\frac{1}{2}\|\Pi_{n}^{\top}\varphi(Y-X\beta)\Pi_{1}\|_{F}^{2}
[Fröbenius norm is invariant under orthogonal transform]
=12Πnφ(Y)Π1Πnφ(Xβ)Π1F2\displaystyle=\frac{1}{2}\|\Pi_{n}^{\top}\varphi(Y)\Pi_{1}-\Pi_{n}^{\top}\varphi(X\beta)\Pi_{1}\|_{F}^{2}
=12Πnφ(Y)Π1(Πnφ(X)Πp)(Πpφ(β)Π1)F2\displaystyle=\frac{1}{2}\|\Pi_{n}^{\top}\varphi(Y)\Pi_{1}-(\Pi_{n}^{\top}\varphi(X)\Pi_{p})(\Pi_{p}^{\top}\varphi(\beta)\Pi_{1})\|_{F}^{2}
=12Y~~X~~β~~F2\displaystyle=\frac{1}{2}\|\widetilde{\widetilde{Y}}-\widetilde{\widetilde{X}}\widetilde{\widetilde{\beta}}\|_{F}^{2}
=122(Y~~X~~β~~)𝐞122\displaystyle=\frac{1}{2}\cdot 2\|(\widetilde{\widetilde{Y}}-\widetilde{\widetilde{X}}\widetilde{\widetilde{\beta}})\mathbf{e}_{1}\|_{2}^{2}
=Y~X~~β~22.\displaystyle=\|\widetilde{Y}-\widetilde{\widetilde{X}}\widetilde{\beta}\|_{2}^{2}. (3.4)

Therefore the isomorphism φ\varphi makes a bridge between the complex-variable and real-variable least squares problems. A similar correspondence can be for the normal equation of the least squares problem

XXβ=XY\displaystyle X^{\dagger}X\beta=X^{\dagger}Y\ \iff\ φ(XXβ)=φ(XY)\displaystyle\varphi(X^{\dagger}X\beta)=\varphi(X^{\dagger}Y)
\displaystyle\iff\ [φ(X)]φ(X)φ(β)𝐞1=[φ(X)]φ(Y)𝐞1\displaystyle[\varphi(X)]^{\top}\varphi(X)\varphi(\beta)\mathbf{e}_{1}=[\varphi(X)]^{\top}\varphi(Y)\mathbf{e}_{1}
\displaystyle\iff\ (Πnφ(X)Πp)(Πnφ(X)Πp)Πpφ1(β)\displaystyle(\Pi_{n}^{\top}\varphi(X)\Pi_{p})^{\top}(\Pi_{n}^{\top}\varphi(X)\Pi_{p})\Pi_{p}^{\top}\varphi_{1}(\beta)
=(Πnφ(X)Πp)Πnφ1(Y)\displaystyle\hskip 56.9055pt=(\Pi_{n}^{\top}\varphi(X)\Pi_{p})^{\top}\Pi_{n}^{\top}\varphi_{1}(Y)
\displaystyle\iff\ X~~X~~β~=X~~Y~,\displaystyle\widetilde{\widetilde{X}}^{\top}\widetilde{\widetilde{X}}\widetilde{\beta}=\widetilde{\widetilde{X}}^{\top}\widetilde{Y}, (3.5)

This implies that the transformed normal equation (3.5) is equivalent to a complex normal equation XXβ=XβX^{\dagger}X\beta=X^{\dagger}\beta.

3.1.2 Complex Gaussian log-likelihood for precision matrix estimation

The Gaussian log-likelihood can alternatively be written in terms of real variable too. Consider the setup with i.i.d. random variables Z1,,Zn𝒩(0,Θ1)Z_{1},\ldots,Z_{n}\sim\mathcal{N}_{\mathbb{C}}(0,\Theta^{-1}) for Θ++p\Theta\in\mathcal{H}_{++}^{p}, and denote the complex-valued sample Gram matrix as Σ^=i=1nZiZi/n\hat{\Sigma}=\sum_{i=1}^{n}Z_{i}Z_{i}^{\dagger}/n.

The negative log-likelihood can be expressed (ignoring constants) as

L(Θ)=logdetΘtrace(PΘ).L_{\mathbb{C}}(\Theta)=\log\det\Theta-\operatorname*{trace}(P\Theta).

Let us consider a similar function

L(Θ)=logdetφ(Θ)trace(φ(P)φ(Θ)).L_{\mathbb{R}}(\Theta)=\log\det\varphi(\Theta)-\operatorname*{trace}(\varphi(P)\varphi(\Theta)).

If Θ=UΛU\Theta=U\Lambda U^{\dagger} is the spectral decomposition of Θ\Theta with UU being an unitary matrix and Λ\Lambda being a diagonal matrix with all positive real entries, then

φ(Θ)\displaystyle\varphi(\Theta) =φ(UΛU)\displaystyle=\varphi(U\Lambda U^{\dagger})
=φ(U)φ(Λ)φ(U)\displaystyle=\varphi(U)\varphi(\Lambda)\varphi(U)^{\top}
=φ(U)ΠpΠpφ(Λ)ΠpΠpφ(U)\displaystyle=\varphi(U)\Pi_{p}\Pi_{p}^{\top}\varphi(\Lambda)\Pi_{p}\Pi_{p}^{\top}\varphi(U)^{\top}
=φ(U)Πp[Λ00Λ]Πpφ(U).\displaystyle=\varphi(U)\Pi_{p}\begin{bmatrix}\Lambda&0\\ 0&\Lambda\end{bmatrix}\Pi_{p}^{\top}\varphi(U)^{\top}.

Therefore,

logdetφ(Θ)\displaystyle\log\det\varphi(\Theta) =log(det(φ(U))det(Πp)(detΛ)2det(Πp)det(φ(U)))\displaystyle=\log\left(\det(\varphi(U))\det(\Pi_{p})(\det\Lambda)^{2}\det(\Pi_{p}^{\top})\det(\varphi(U)^{\top})\right)
=log(det(φ(U)φ(U))(detΛ)2)\displaystyle=\log\left(\det(\varphi(U)\varphi(U)^{\top})(\det\Lambda)^{2}\right)
=log(det(φ(UU))(detΛ)2)\displaystyle=\log\left(\det(\varphi(UU^{\dagger}))(\det\Lambda)^{2}\right)
=log((detΛ)2)\displaystyle=\log\left((\det\Lambda)^{2}\right)
=2logdetΛ=2logdetΘ.\displaystyle=2\log\det\Lambda=2\log\det\Theta. (3.6)

We consider the Schur decomposition of PΘP\Theta to be PΘ=VΔVP\Theta=V\Delta V^{\dagger} with VV being a unitary matrix and Δ\Delta being an upper triangular matrix with the eigenvalues of PΘP\Theta on the diagonal. Let Θ1/2:=UΛ1/2U\Theta^{1/2}:=U\Lambda^{1/2}U^{\dagger} be the square root of the positive definite matrix Θ\Theta where Λ1/2\Lambda^{1/2} is the diagonal matrix with the square of the diagonals of Λ\Lambda, Since PΘ=Θ1/2(Θ1/2PΘ1/2)Θ1/2P\Theta=\Theta^{-1/2}(\Theta^{1/2}P\Theta^{1/2})\Theta^{1/2} is similar to the positive definite matrix Θ1/2PΘ1/2\Theta^{1/2}P\Theta^{1/2}, all the eigenvalues of PΘP\Theta is positive i.e. Δ\Delta has real positive diagonals. Then the term involving trace can be expressed as

trace(φ(P)φ(Θ))=trace(φ(PΘ))\displaystyle\operatorname*{trace}(\varphi(P)\varphi(\Theta))=\operatorname*{trace}(\varphi(P\Theta)) =trace(φ(VΔV))\displaystyle=\operatorname*{trace}(\varphi(V\Delta V^{\dagger}))
=trace(φ(V)φ(Δ)φ(V))\displaystyle=\operatorname*{trace}(\varphi(V)\varphi(\Delta)\varphi(V)^{\top})
=trace(φ(Δ)φ(V)φ(V))\displaystyle=\operatorname*{trace}(\varphi(\Delta)\varphi(V)^{\top}\varphi(V))
=trace(φ(Δ)φ(VV))\displaystyle=\operatorname*{trace}(\varphi(\Delta)\varphi(V^{\dagger}V))
=trace(φ(Δ))\displaystyle=\operatorname*{trace}(\varphi(\Delta))
=trace(φ(Δ)ΠpΠp)\displaystyle=\operatorname*{trace}(\varphi(\Delta)\Pi_{p}\Pi_{p}^{\top})
=trace(Πpφ(Δ)Πp)\displaystyle=\operatorname*{trace}(\Pi_{p}^{\top}\varphi(\Delta)\Pi_{p})
=trace(diag(Δ,Δ))\displaystyle=\operatorname*{trace}(\text{diag}(\Delta,\Delta))
=2trace(Δ)=2trace(PΘ),\displaystyle=2\operatorname*{trace}(\Delta)=2\operatorname*{trace}(P\Theta), (3.7)

where the fact that trace of a product of matrices is invariant under cyclic permutations is used twice in the calculation. Combining (3.6) and (3.7), we obtain

L(Θ)=2L(Θ),L_{\mathbb{R}}(\Theta)=2L_{\mathbb{C}}(\Theta),

which implies that minimizing L(Θ)L_{\mathbb{C}}(\Theta) with respect to Θ\Theta is equivalent to minimizing L(Θ)L_{\mathbb{R}}(\Theta) with respect to φ(Θ)\varphi(\Theta), or Θ\Theta (there is a one-one correspondence between Θ\Theta and φ(Θ)\varphi(\Theta)).

3.2 Lasso with complex variables

In order to solve the complex Lasso optimization problem (2.8), we can use group Lasso [Yuan and Lin, 2006] for real valued predictors and response. We show this using generic notations for complex Lasso. Let Xn×pX_{n\times p} and Yp×1Y_{p\times 1} be the predictors and the response in complex variable regression problem. The Lasso problem is

β^=argminβp12nYXβ22+λβ1,\hat{\beta}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p}}\frac{1}{2n}\|Y-X\beta\|_{2}^{2}+\lambda\|\beta\|_{1}, (3.8)

where λ>0\lambda>0 is the penalty parameter. In the optimization problem 2.8, 𝒵k\mathcal{Z}_{-k} and 𝒵k\mathcal{Z}_{k} play the roles of XX and YY respectively.

Algorithm 1 Complex Lasso (CLASSO) with complex update
Objective: Calculate β^=argminβp12nYXβ2+λβ1\hat{\beta}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p}}\frac{1}{2n}\|Y-X\beta\|^{2}+\lambda\|\beta\|_{1}
Input: X=[X1:X2::Xp]n×pX=[X_{1}:X_{2}:\ldots:X_{p}]\in\mathbb{C}^{n\times p} with Xj2=n\|X_{j}\|_{2}=\sqrt{n} for all j=1,,pj=1,\ldots,p,
           YnY\in\mathbb{C}^{n}, β^(0)p\hat{\beta}^{(0)}\in\mathbb{C}^{p}, λ>0\lambda>0
1. Initialize: β^=β^(0)\hat{\beta}=\hat{\beta}^{(0)}
2. Calculate residual r=Yj=1pXkβ^kr=Y-\sum_{j=1}^{p}X_{k}\hat{\beta}_{k}
4. Repeat until convergence for j=1,,pj=1,\ldots,p:
  • (a)

    Calculate jjth partial residual r(j)=r+Xjβ^jr^{(j)}=r+X_{j}\hat{\beta}_{j}

  • (b)

    β^j𝒮λ(1nXjr(j))\hat{\beta}_{j}\leftarrow{\cal S}_{\lambda}\left(\frac{1}{n}X_{j}^{\dagger}r^{(j)}\right)

  • (c)

    rr(j)Xjβ^jr\leftarrow r^{(j)}-X_{j}\hat{\beta}_{j}

5. Return β^\hat{\beta}

The Lasso problem (3.8) can be alternatively expressed as an optimization problem with real variables. We use (3.4) and the following fact for the 1\ell_{1} regularization part of (3.8)

β1=j=1p|βj|=j=1pφ1(βj)2=j=1pβj~2.\|\beta\|_{1}=\sum_{j=1}^{p}|\beta_{j}|=\sum_{j=1}^{p}\|\varphi_{1}(\beta_{j})\|_{2}=\sum_{j=1}^{p}\|\widetilde{\beta_{j}}\|_{2}.

Altogether, the optimization problem (3.8) is equivalent to

argminβ1~,,βp~212nY~j=1pXj~~βj~22+λj=1pβj~2.\operatorname*{arg\,min}_{\widetilde{\beta_{1}},\ldots,\widetilde{\beta_{p}}\in\mathbb{R}^{2}}\frac{1}{2n}\|\widetilde{Y}-\sum_{j=1}^{p}\widetilde{\widetilde{X_{j}}}\widetilde{\beta_{j}}\|_{2}^{2}+\lambda\sum_{j=1}^{p}\|\widetilde{\beta_{j}}\|_{2}. (3.9)

This is a group Lasso problem with pp groups, each of size 2. Let the optimizer of (3.9) be (β1~^,,βp~^)(\hat{\widetilde{\beta_{1}}},\ldots,\hat{\widetilde{\beta_{p}}})^{\top} and r~(j)=Y~kjXk~~βk~^\widetilde{r}^{(j)}=\widetilde{Y}-\sum_{k\neq j}\widetilde{\widetilde{X_{k}}}\hat{\widetilde{\beta_{k}}} be the vector of jjth partial residuals. Following the equation (4.14) for group Lasso in Hastie et al. [2015], the optimizer βj~^\hat{\widetilde{\beta_{j}}} should satisfy βj~^=0\hat{\widetilde{\beta_{j}}}=0 if 1nXj~~r~(j)<λ\left\|\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right\|<\lambda, and otherwise the following equation:

βj~^=(1nXj~~Xj~~+λβj~^2I2)11nXj~~r~(j).\hat{\widetilde{\beta_{j}}}=\left(\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{\widetilde{X_{j}}}+\frac{\lambda}{\left\|\hat{\widetilde{\beta_{j}}}\right\|_{2}}I_{2}\right)^{-1}\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}. (3.10)

The equation (3.10) does not have a closed form solution in general, but here we exploit the structure of Xj~~\widetilde{\widetilde{X_{j}}}. For each jj, the columns of Xj~~\widetilde{\widetilde{X_{j}}} are orthogonal, which implies

Xj~~Xj~~=φ(Xj)ΠnΠnI2nφ(Xj)=φ(Xj)φ(Xj)=φ(XjXj)=φ(Xj22)=Xj22I2.\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{\widetilde{X_{j}}}=\varphi(X_{j})^{\top}\underbrace{\Pi_{n}\Pi_{n}^{\top}}_{I_{2n}}\varphi(X_{j})=\varphi(X_{j}^{\dagger})\varphi(X_{j})=\varphi(X_{j}^{\dagger}X_{j})=\varphi(\|X_{j}\|_{2}^{2})=\|X_{j}\|_{2}^{2}I_{2}.

Therefore, the update (3.10) has a closed form expression of βj~^\hat{\widetilde{\beta_{j}}} given by

βj~^=(1λ1nXj~~r~(j)2)+1nXj~~r~(j)1nXj22.\hat{\widetilde{\beta_{j}}}=\left(1-\frac{\lambda}{\left\|\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right\|_{2}}\right)_{+}\frac{\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}}{\frac{1}{n}\|X_{j}\|_{2}^{2}}. (3.11)

Soft threshold operator. We define two close variants of the soft threshold operators widely used in Lasso literature. For any tt\in\mathbb{R}, let t+=max{0,t}t_{+}=\max\{0,t\}. For every zz\in\mathbb{C}, we define the soft threshold operator 𝒮λ:\mathcal{S}_{\lambda}:\mathbb{C}\rightarrow\mathbb{C} as

𝒮λ(z)=(|z|λ)+z|z|,\mathcal{S}_{\lambda}(z)=(|z|-\lambda)_{+}\frac{z}{|z|}, (3.12)

and for any x2x\in\mathbb{R}^{2}, we similarly define 𝒮~λ:22\widetilde{\cal S}_{\lambda}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2} as

𝒮~λ(x)=(x2λ)+xx2,\widetilde{\cal S}_{\lambda}(x)=(\|x\|_{2}-\lambda)_{+}\frac{x}{\|x\|_{2}},

the soft threshold operator in 2\mathbb{R}^{2}. The two operators are connected by the map φ\varphi and the following identity. For any zz\in\mathbb{C}, we have

𝒮~λ(φ(z))=(φ(z)λ)+φ(z)φ(z)2=(|z|λ)+φ(z)|z|=φ(𝒮λ(z)),\widetilde{\mathcal{S}}_{\lambda}(\varphi(z))=(\|\varphi(z)\|-\lambda)_{+}\frac{\varphi(z)}{\|\varphi(z)\|_{2}}=(|z|-\lambda)_{+}\frac{\varphi(z)}{|z|}=\varphi(\mathcal{S}_{\lambda}(z)),

where we use the property (iii) from Proposition 3.1. Similar univariate threshold operator has been defined in Sun et al. [2018].

We now use these soft threshold operators for getting a more convenient form of (3.11). The update (3.11) immediately becomes β~^j=S~λ(Xj~~r~(j)/n)/Xj22\hat{\widetilde{\beta}}_{j}=\widetilde{S}_{\lambda}\left(\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}/n\right)\bigg{/}\|X_{j}\|_{2}^{2}. If the columns of XX are scaled so that Xj22=n\|X_{j}\|_{2}^{2}=n for every j=1,,nj=1,\ldots,n, then

βj~^=𝒮~λ(1nXj~~r~(j)).\hat{\widetilde{\beta_{j}}}=\widetilde{\cal S}_{\lambda}\left(\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right). (3.13)

Using the group Lasso update provided in Hastie et al. [2015] and the soft threshold operator notation in (3.13), the group Lasso update in this case is

r~(j)\displaystyle\widetilde{r}^{(j)} r~+Xj~~+βj~^,\displaystyle\leftarrow\widetilde{r}+\widetilde{\widetilde{X_{j}}}+\hat{\widetilde{\beta_{j}}},
βj~^\displaystyle\hat{\widetilde{\beta_{j}}} 𝒮~λ(1nXj~~r~(j)),\displaystyle\leftarrow\widetilde{\mathcal{S}}_{\lambda}\left(\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right),
r~\displaystyle\widetilde{r} r~(j)Xj~~+βj~^.\displaystyle\leftarrow\widetilde{r}^{(j)}-\widetilde{\widetilde{X_{j}}}+\hat{\widetilde{\beta_{j}}}.

We denote β^j=βj~^1+iβj~^2\hat{\beta}_{j}=\hat{\widetilde{\beta_{j}}}_{1}+\mathrm{i}\hat{\widetilde{\beta_{j}}}_{2} i.e. βj~^=φ(β^j)\hat{\widetilde{\beta_{j}}}=\varphi(\hat{\beta}_{j}), and r(j)=r~1(j)+ir~2(j)r^{(j)}=\widetilde{r}^{(j)}_{1}+\mathrm{i}\ \widetilde{r}^{(j)}_{2} for j=1,,pj=1,\ldots,p. Therefore, we can write

φ(r(j))=r~(j)\displaystyle\varphi(r^{(j)})=\widetilde{r}^{(j)} =Y~kjXk~~βk~^\displaystyle=\widetilde{Y}-\sum_{k\neq j}\widetilde{\widetilde{X_{k}}}\hat{\widetilde{\beta_{k}}}
=φ(Y)kjφ(Xk)φ(β^k)\displaystyle=\varphi(Y)-\sum_{k\neq j}\varphi(X_{k})\varphi(\hat{\beta}_{k})
=φ(YkjXkβ^k)\displaystyle=\varphi(Y-\sum_{k\neq j}X_{k}\hat{\beta}_{k})
=φ(r+Xjβ^j),\displaystyle=\varphi(r+X_{j}\hat{\beta}_{j}),

where r=Yk=1pXkβ^kr=Y-\sum_{k=1}^{p}X_{k}\hat{\beta}_{k}. Using this representation, the Lasso update becomes

φ(r(j))\displaystyle\varphi(r^{(j)}) φ(r+Xjβ^j),\displaystyle\leftarrow\varphi(r+X_{j}\hat{\beta}_{j}),
φ(β^j)\displaystyle\varphi(\hat{\beta}_{j}) φ(𝒮λ(Xjr(j)/n)),\displaystyle\leftarrow\varphi\left({\cal S}_{\lambda}\left(X_{j}^{\dagger}r^{(j)/n}\right)\right),
φ(r)\displaystyle\varphi(r) φ(r(j)Xjβ^j).\displaystyle\leftarrow\varphi(r^{(j)}-X_{j}\hat{\beta}_{j}).

Therefore, φ\varphi being an isomorphism, we can simply omit it from the Lasso updates above and still we get an equivalent complex-valued update. The final steps are 4.(a-c) of Algorithm 1.

Complex Lasso with covariance update. We can modify Algorithm 1 to take the covariance matrix of XX and YY as inputs, which is needed for graphical Lasso. In step 4.(a), we can update Xjr(j)X_{j}^{\dagger}r^{(j)} instead of r(j)r^{(j)}, i.e. the update for each j=1,,pj=1,\ldots,p will be

Xr(j)\displaystyle X^{\dagger}r^{(j)} =Xr+XXjβ^j,\displaystyle=X^{\dagger}r+X^{\dagger}X_{j}\hat{\beta}_{j},
β^j\displaystyle\hat{\beta}_{j} 𝒮λ(1nXjr(j)),\displaystyle\leftarrow{\cal S}_{\lambda}\left(\frac{1}{n}X_{j}^{\dagger}r^{(j)}\right),
Xr\displaystyle X^{\dagger}r Xr(j)Xjβ^j.\displaystyle\leftarrow X^{\dagger}r^{(j)}-X_{j}\hat{\beta}_{j}.

which is aligned with the fact that information about XYX^{\dagger}Y and XXX^{\dagger}X are sufficient for estimation. The modified algorithm is presented as Algorithm 2. It should be noted as well that the objective function of Algorithm 2 is slightly different than that of Algorithm 1 in term of scaling. The specific routine followed in Algorithm 2 is used in the CGLASSO algorithm we propose in 3.3.

Algorithm 2 Complex Lasso with covariance update (CLASSO.COV)
Objective: Calculate β^=argminβp12βSXXβ𝐬XYβ+λβ1\hat{\beta}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p}}\frac{1}{2}\beta^{\dagger}S_{XX}\beta-\mathbf{s}_{XY}\beta+\lambda\|\beta\|_{1}
Input: SXXp×pS_{XX}\in\mathbb{C}^{p\times p}, 𝐬XYp\mathbf{s}_{XY}\in\mathbb{C}^{p}, β^(0)\hat{\beta}^{(0)}, and λ>0\lambda>0
1. Initialize: β^=β^(0)\hat{\beta}=\hat{\beta}^{(0)}
2. Calculate 𝐬Xr=𝐬XYSXXβ^\mathbf{s}_{Xr}=\mathbf{s}_{XY}-S_{XX}\hat{\beta}
3. Repeat until convergence for j=1,,pj=1,\cdots,p:
  • (a)

    v𝐬Xr+SXXβ^jv\leftarrow\mathbf{s}_{Xr}+S_{XX}\hat{\beta}_{j}

  • (b)

    β^j𝒮λ(vj)\hat{\beta}_{j}\leftarrow{\cal S}_{\lambda}(v_{j}) where 𝒮λ{\cal S}_{\lambda} is defined in (3.12)

  • (c)

    𝐬XrvSXXβ^j\mathbf{s}_{Xr}\leftarrow v-S_{XX}\hat{\beta}_{j}

4. Return β^\hat{\beta}
Refer to caption
Figure 3: Runtime comparison between CLASSO and grplasso

Computational benefit of orthogonal group structure. We would like to highlight the importance of the orthogonal predictor structure in the above group Lasso formulation. While it is well-known that a Lasso with complex variables can be solved using standard group Lasso optimization routines [Maleki et al., 2013], to the best of our knowledge the orthogonality of predictors within groups has not been observed or utilized in algorithm development in the existing literature. We perform a small numerical experiment to compare the computational efficiency of CLASSO with that of a popular group Lasso R package grplasso that does not use the within-group orthogonality of predictors.

The experiment consists of a p=50p=50 dimensional sample of size n=50n=50. We randomly generate the real and imaginary parts of the entries in XX from a standard normal distribution and consider β=(β1,,βp)\beta=(\beta_{1},\cdots,\beta_{p})^{\top} such that βk=11i\beta_{k}=1-1\mathrm{i} if kk is odd and 0 otherwise. We consider the error distribution {ϵi}i=1n𝒩(0,1)\{\epsilon_{i}\}_{i=1}^{n}\thicksim\mathcal{N}(0,1) and response variable y=Xβ+ϵy=X\beta+\epsilon. The experiment is replicated 20 times and the run times for both CLASSO and grplasso are recorded. We used a MacBook Pro M1 with 8 core CPU for running the experiment. The boxplots of the runtimes are shown in figure 3. We get an average run time of CLASSO and grplasso to be 0.01 and 1.34 seconds respectively. We have a run time improvement of 134 folds if we use CLASSO.

Warm start and active set selection. The complex Lasso can be sped up by incorporating warm start and active set selection [Hastie et al., 2015]. In practice, one needs to find the Lasso solution for not only one fixed λ\lambda, but a sequence of values for λ\lambda and obtain a regularization path. Such regularization path helps us identify which is the best choice of λ\lambda in term of certain criterion (e.g. relative mean squared error (RMSE), Bayesian Information Criterion (BIC) or cross-validation error, see section 5 for details). One method to do so is to start with a reasonably large value of the tuning parameter (λλ0=maxj|XjYj/n|\lambda\geq\lambda_{0}=\max_{j}|X_{j}^{\dagger}Y_{j}/n|) and obtain an all-zero solution β^(λ0)\hat{\beta}(\lambda_{0}). For each smaller value λ\lambda_{\ell} (1\ell\geq 1), we can use the solution for the previous λ1\lambda_{\ell-1}, β^(λ1)\hat{\beta}(\lambda_{\ell-1}) as an initial value i.e. warm start and compute β^(λ)\hat{\beta}(\lambda_{\ell}). This scales up the efficiency of computing the Lasso solution for all tuning parameters since we can expect that a small change in the tuning parameter has limited effect on the solution. This method is referred to as the pathwise coordinate descent.

An active set 𝒜t\mathcal{A}_{t} is the set of indices for non-zero entries in iteration tt of the Lasso algorithm. For the vector of current residuals r(t)r^{(t)} if |Xjr(t)|/N<λ|X_{j}^{\dagger}r^{(t)}|/N<\lambda_{\ell} is satisfied, then jj is excluded from 𝒜t\mathcal{A}_{t}, i.e. 𝒜t:=𝒜t1{j}\mathcal{A}_{t}:=\mathcal{A}_{t-1}\setminus\{j\}. For the indices in 𝒜t\mathcal{A}_{t}, the soft threshold step 4.(b) of Algorithm 1 is performed. The advantage of doing the active set selection is that once a index is declared inactive and sparse, the number of soft threshold operation in each step decreases. If after a few iterations β^(t)\hat{\beta}^{(t)} starts to become sparse, then the number of soft threshold operations significantly goes down.

3.3 Graphical Lasso with complex variables

We return to the original problem (2.6) with the given matrix 𝒵\mathcal{Z} of DFT values and the averaged periodogram f^(ωj)\hat{f}(\omega_{j}) for a fixed Fourier frequency ωjFn\omega_{j}\in F_{n}. For ease of notation, we will denote Θ^j\hat{\Theta}_{j} as Θ^\hat{\Theta}, and keep the frequency index jj throughout the algorithm description. We are allowed to do it since for now our algorithm does not involve information of the estimates in other frequencies. Also we denote the inverse periodogram f^(ωj)\hat{f}(\omega_{j}) as PP.

Algorithm 3 Complex graphical Lasso (CGLASSO)
Objective: Calculate Θ^=argminΘ++pΘ,f^(ωj)logdetΘ+λΘ1,off\hat{\Theta}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\langle\langle\Theta,\hat{f}(\omega_{j})\rangle\rangle-\log\det\Theta+\lambda\|\Theta\|_{\text{1,off}}
Input: P=f^(ωj)P=\hat{f}(\omega_{j}), (p1)×p(0){\cal B}^{(0)}_{(p-1)\times p} and λ>0\lambda>0
1. Initialize: 
  • W=PW=P; diag(W)\text{diag}(W) remains unchanged throughout the algorithm

  • =(0){\cal B}={\cal B}^{(0)}

2. Repeat until convergence for k=1,,p,1,,p,k=1,\cdots,p,1,\cdots,p,\cdots:
  • (a)

    Define

    • W11:=Wk,kW_{11}:=W_{-k,-k}

    • 𝐰12:=W,k=Wk,\mathbf{w}_{12}:=W_{\cdot,k}=W_{k,\cdot}

    • w22:=Wkkw_{22}:=W_{kk}

  • (b)

    β^=CLASSO.COV(SXX=W11,𝐬XY=𝐩12,β^(0)=,k,λ)\hat{\beta}=\text{CLASSO.COV}(S_{XX}=W_{11},\ \mathbf{s}_{XY}=\mathbf{p}_{12},\hat{\beta}^{(0)}={\cal B}_{\cdot,k},\lambda)

  • (c)

    Update:

    • ,kβ^{\cal B}_{\cdot,k}\leftarrow\hat{\beta}

    • w12W11β^\textbf{w}_{12}\leftarrow W_{11}\hat{\beta}

3. For k=1,,pk=1,\cdots,p:
  • θ^221w22𝐰12β^\hat{\theta}_{22}\leftarrow\frac{1}{w_{22}-\mathbf{w}_{12}^{\top}\hat{\beta}}

  • 𝜽^12β^θ^22\hat{\boldsymbol{\theta}}_{12}\leftarrow-\hat{\beta}\cdot\hat{\theta}_{22}

4. Return Θ^\hat{\Theta}

Using Claim B.1, the solution of the optimization problem (2.6) solves the following equation (2.6) is

PΘ1+λΨ=0,P-\Theta^{-1}+\lambda\Psi=0, (3.14)

where Ψ=((ψkl))k,l=1p\Psi=((\psi_{kl}))_{k,l=1}^{p} has the following entries

ψkl={0if k=l,Θkl|Θkl|if kland Θkl0,{z:|z|1}if kland Θkl=0.\psi_{kl}=\begin{cases}~{}\quad 0&\text{if }k=l,\\ \quad\dfrac{\Theta_{kl}}{|\Theta_{kl}|}&\text{if }k\neq l\ \text{and }\Theta_{kl}\neq 0,\\ \in\{z\in\mathbb{C}:|z|\leq 1\}&\text{if }k\neq l\ \text{and }\Theta_{kl}=0.\end{cases} (3.15)

The system (3.14) can be solved using block coordinate descent [Friedman et al., 2008]. The current working version of Θ\Theta and PP are partitioned as follows

Θ=[Θ11𝜽12𝜽12θ22],P=[P11𝐩12𝐩12p22].\Theta=\begin{bmatrix}\Theta_{11}&\boldsymbol{\theta}_{12}\\ \boldsymbol{\theta}_{12}^{\dagger}&\theta_{22}\end{bmatrix},\quad P=\begin{bmatrix}P_{11}&\mathbf{p}_{12}\\ \mathbf{p}_{12}^{\dagger}&p_{22}\end{bmatrix}.

Let WW be the current working version of Θ\Theta. Following Friedman et al. [2008] and using some tools in calculus of complex variables, it can be shown that (3.14) is equivalent to solving a Lasso problem where 𝐩12\mathbf{p}_{12} is the analogue of 1nXX\frac{1}{n}X^{\dagger}X, and 1nXX\frac{1}{n}X^{\dagger}X corresponds to W11W_{11}, the estimated cross product matrix in the current model. Based on the Algorithm 1, we can implement a block coordinate descent approach and come up with a graphical Lasso method in complex regime.

3.3.1 Implementation details

We provide some details on the implementation and convergence of the CGLASSO algorithm. While coordinate descent is fast and simple to implement, it is known to face convergence issues for very small values of penalty parameter, even in classical Lasso and graphical Lasso problems [Mazumder and Hastie, 2012; Friedman et al., 2007]. While we also encounter similar issues, we find that the algorithm performs robustly and accurately on the portion of the regularization path where λ\lambda is neither too small nor too large. We also make some modifications to the base algorithm that helps with convergence and speeding up performance.

  1. 1.

    Scaling: It is a well-known practice to standardize the individual variables before invoking graphical Lasso. We experimented with two different variants, both provide improvement over the basic CGLASSO algorithm. We discuss this in more details in section 5 (see equations (5.1)) and (5.2).

  2. 2.

    Warm and warmer start: Warm start is a standard technique to speed up convergence of pathwise coordinate descent (see section 3.2). We find that in the implementation of CGLASSO, another modification, which we call warmer start, helps more. In simple terms, warmer start is a two-fold warm start for the CGLASSO and the inner CLASSO.COV. For a fixed ωj\omega_{j} we can find the CGLASSO solution of Θ\Theta for a sequence of λ\lambda and construct a regularization path. Starting from a reasonably high value of λ=λ0\lambda=\lambda_{0}, we obtain the solution Θ^(λ0)\hat{\Theta}(\lambda_{0}) with all zero in off-diagonal entries. For each smaller value λ(1)\lambda_{\ell}\ (\ell\geq 1), the solution Θ^(λ1)\hat{\Theta}(\lambda_{\ell-1}) for the previous λ1\lambda_{\ell-1} is used as an initial value. In addition, at each iteration k=1,,p,1,,p,k=1,\ldots,p,1,\ldots,p,\ldots of CGLASSO we go through the following steps (b and c of Algorithm 3)

    β^=CLASSO.COV(W11,𝐬12,,k(0),λ);\displaystyle\hat{\beta}=\text{CLASSO.COV}(W_{11},\mathbf{s}_{12},\mathcal{B}^{(0)}_{\cdot,k},\lambda);
    ,k(0)β^;\displaystyle\mathcal{B}^{(0)}_{\cdot,k}\leftarrow\hat{\beta};
    𝐰12W11β^;\displaystyle\mathbf{w}_{12}\leftarrow W_{11}\hat{\beta};

    the above routine is a two-step warm up for the initial solution in both the Lasso and graphical Lasso algorithm.

  3. 3.

    Convergence issues for small λ\lambda: For very small λ\lambda, we may encounter convergence issues. In our simulation studies, we handle this issue by declaring a stopping rule for the regularization path depending on the estimation error since the true Θ\Theta is known. We calculate relative mean square error (RMSE) which, at frequency ωj\omega_{j} and penalty λ\lambda, is defined in (5.3).

    Let us assume for now that we are estimating Θ\Theta at a fixed frequency ωj\omega_{j} so that we can simply denote RMSEλ(ωj)\text{RMSE}_{\lambda}(\omega_{j}) by RMSEλ\text{RMSE}_{\lambda}. If the initial value of the pentalty parameter is λ0\lambda_{0}, and λmin\lambda_{\min} is the minimizer of the RMSEλ\text{RMSE}_{\lambda} values along the regularization path, then at the current λ\lambda we stop the regularization path if the gap λ0λ\lambda_{0}-\lambda is large enough and

    RMSEλRMSEλmin>0.5×(RMSEλ0RMSEλmin).\text{RMSE}_{\lambda}-\text{RMSE}_{\lambda_{\min}}>0.5\times(\text{RMSE}_{\lambda_{0}}-\text{RMSE}_{\lambda_{\min}}).

    Such stopping criterion perform very well when the regularization path is ideally U-shaped and the stopping rule is able to detect the deviation of the regularization path from the minima.

    While our reported results are not affected by these choices, i.e. the min-BIC solutions are achieved by λ\lambda where the algorithm converges and does not require a hard cut-off, we do plan to resolve these issues in future work.

4 Theoretical properties

In this section we establish non-asymptotic high probability upper bounds on the estimation error of f1^(ωj)\widehat{f^{-1}}(\omega_{j}), for a single frequency ωjFn\omega_{j}\in F_{n}. First we will control the distance between f1^(ωj)\widehat{f^{-1}}(\omega_{j}) and f1(ωj)f^{-1}(\omega_{j}) in \ell_{\infty}-norm. Our proof techniques build upon the arguments in Ravikumar et al. [2011] and Sun et al. [2018]. In particular, Ravikumar et al. [2011] used a primal-dual witness technique which constructs the pair (Θ~j,Z~j)(\widetilde{\Theta}_{j},\widetilde{Z}_{j}) such that the pair satisfies the optimality condition with high probability. Given that the construction is successful, the witness Θ~j\widetilde{\Theta}_{j} is equal to the solution Θ^j\hat{\Theta}_{j} of the optimization problem 2.6. The advantage of this construction is that we can ensure both estimation and variable selection properties of the estimator Θ^j\hat{\Theta}_{j} with high probability.

In this section, we first state the assumptions and explain the notations of terms that will govern our error bounds. Then we present the estimation error bounds for Gaussian and non-Gaussian linear processes.

Sparsity: The sparsity parameters capture the ambient dimension of the optimization problem. We use Θ=[f(ωj)]1\Theta^{*}=[f(\omega_{j})]^{-1} to denote the true spectral precision matrix at ωj\omega_{j}. The set of off-diagonal entries of Θ\Theta^{*}, used to define the edge set of the graph, is denoted by

E(Θ)={(k,)[p]×[p]:k,Θk0},E(\Theta^{*})=\{(k,\ell)\in[p]\times[p]:k\neq\ell,\ \Theta^{*}_{k\ell}\neq 0\},

and the augmented edge set is denoted as S(Θ)=E(Θ){(1,1),,(p,p)}S(\Theta^{*})=E(\Theta^{*})\cup\{(1,1),\cdots,(p,p)\}. For simplicity, we write S(Θ)S(\Theta^{*}) as SS. We need the following two sparsity parameters in our analysis,

s=|E(Θ)|,d=maxk=1,,p|{{1,,p}:Θk0}|.\displaystyle s=|E(\Theta^{*})|,~{}~{}~{}~{}~{}d=\max_{k=1,\cdots,p}\left|\{\ell\in\{1,\cdots,p\}:\ \Theta_{k\ell}^{*}\neq 0\}\right|.

The parameter ss captures the total number of edges in the graph. The parameter dd is the maximum degree of a node, and equals the maximum number of non-zero entries in any row of Θ\Theta^{*}.

Our analysis will show that consistent estimation is possible when the effective sample size is larger than the ambient dimension captured by the above parameters.

Identifiability: In high-dimensional statistics, some type of identifiability assuptions needs to hold in order to carry out meaningful estimation. Typical assumptions used in the literature are restricted eigenvalue, restricted strong convexity or irrepresentability conditions. These conditions inherently impose restrictions to ensure that the support and non-support variables are not too similar. These assumptions are also closely related to some form of strong convexity or steep curvature of the second derivative of the loss function at the true parameter in certain directions of the high-dimensional space.

We consider the function

g(Θ)={logdetΘif Θ0+otherwiseg(\Theta)=\begin{cases}-\log\det\Theta&\text{if }\Theta\succ 0\\ +\infty&\text{otherwise}\end{cases}

then the Hessian is essentially 2g(Θ)=Θ1Θ1=Υ\nabla^{2}g(\Theta^{*})={\Theta^{*}}^{-1}\otimes{\Theta^{*}}^{-1}=\Upsilon^{*}, where \otimes denotes the matrix Kronecker product, i.e. Υ\Upsilon^{*} is a p2×p2p^{2}\times p^{2} matrix indexed by the vertex pair/edge such that the entries Υ(k1,l1),(k2,l2)\Upsilon^{*}_{(k_{1},l_{1}),(k_{2},l_{2})} is of the form 2gΘk1,l1Θk2,l2(Θ)\frac{\partial^{2}g}{\partial\Theta_{k_{1},l_{1}}\partial\Theta_{k_{2},l_{2}}}(\Theta) evaluated at Θ=Θ\Theta=\Theta^{*}. This can be verified from the results on matrix derivative in Boyd and Vandenberghe [2004]. The following quantity captures the magnitude of this Hessian restricted on the true support set, and will appear in our error bounds of Whittle estimation.

κΘ=ΥSS=[(Θ)1(Θ)1]SS.\kappa_{\Theta^{*}}=\|\Upsilon^{*}_{SS}\|_{\infty}=\left\|[(\Theta^{*})^{-1}\otimes(\Theta^{*})^{-1}]_{SS}\right\|_{\infty}.

We will also need the following irrepresentability assumption, which is common in the literature of GGM [Ravikumar et al., 2011].

Assumption 4.1.

There exists some α(0,1]\alpha\in(0,1] such that

maxeScΥeS(ΥSS)111α.\max_{e\in S^{c}}\|\Upsilon_{eS}^{*}\left(\Upsilon_{SS}^{*}\right)^{-1}\|_{1}\leq 1-\alpha.

Such assumption limits the influence of the non-edge terms indicated by ScS^{c} with the edge terms based on SS [Zhao and Yu, 2006].

Temporal memory: We will need a few parameters to capture the effect of temporal and cross-sectional memory of the process on our error bounds.

Assumption 4.2.
=Γ()max<.\sum_{\ell=-\infty}^{\infty}\|\Gamma(\ell)\|_{\max}<\infty.

We also define the following quantity

|f|=esssupω[π,π]f(ω),{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\operatorname*{ess\,sup}_{\omega\in[-\pi,\pi]}\|f(\omega)\|,

which is a measure of stability of the time series XtX_{t}. Larger values of |f|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} are associated with processes having stronger temporal and cross-sectional dependence and less stability. The above assumption ensures the finiteness of |f|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} since for any ω[π,π]\omega\in[-\pi,\pi]

f(ω)=12π=Γ()eiω12π=Γ()p2π=Γ()max.\|f(\omega)\|=\frac{1}{2\pi}\left\|\sum_{\ell=-\infty}^{\infty}\Gamma(\ell)e^{-\mathrm{i}\omega\ell}\right\|\leq\frac{1}{2\pi}\sum_{\ell=-\infty}^{\infty}\|\Gamma(\ell)\|\leq\frac{p}{2\pi}\sum_{\ell=-\infty}^{\infty}\|\Gamma(\ell)\|_{\max}.

The Assumption 4.2 is useful in providing the error bound of Proposition 4.1 later.

We define two model dependent quantities

Ωn=max1r,sp||n|||Γr,s()|,Ln=max1k,p||>n|Γr,s()|.\Omega_{n}=\max_{1\leq r,s\leq p}\sum_{|\ell|\leq n}|\ell||\Gamma_{r,s}(\ell)|,\quad L_{n}=\max_{1\leq k,\ell\leq p}\sum_{|\ell|>n}|\Gamma_{r,s}(\ell)|. (4.1)

These two capture the strength of temporal and contemporaneous dependence in the multivariate time series. According to proposition 3.3 of Sun et al. [2018] they play an important role to control the bias of averaged periodogram. We shall use this information to control the bias in the estimation setup we deal with.

Finally, we will need the quantity

κf=f(ωj)\kappa_{f}=\|f(\omega_{j})\|_{\infty}

in our error bounds. It captures the size of the spectral density matrix in \ell_{\infty} norm, and is used to control the curvature of the Hessian of the loss function in cglasso.

4.1 Case I: Gaussian time series

Our first result, rephrased from Propositions 3.5 and 3.6 of Sun et al. [2018] using our notations, provides a high probability upper bound on the element-wise difference between the averaged periodogram and the true spectral density. The bound depends on the bias and tail decay of f^(ωj)\hat{f}(\omega_{j}), and decays with the dimension at the rate logp/m\sqrt{\log p/m}, where mm is the bandwidth parameter in spectral density estimation.

Proposition 4.1.

Let {Xt}t=1n\{X_{t}\}_{t=1}^{n} be nn consecutive observations from a stable Gaussian centered time series satisfying Assumption 4.1 and 4.2. Consider a single Fourier frequency ωj[π,π]\omega_{j}\in[-\pi,\pi], and assume nΩn|f|2logpn\succsim\Omega_{n}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p. Then for any choice of mm satisfying m|f|2logpm\succsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p, mn/Ωnm\precsim n/\Omega_{n}, any R>0R>0, and a choice of threshold

Δf(m,n,p)=|f|Rlogpm+m+12πnΩn+12πLn,\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{R\log p}{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}, (4.2)

there exist universal constants c,c>0c,c^{\prime}>0 such that

(f^(ωj)f(ωj)maxΔf(m,n,p))cp(cR2).\mathbb{P}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\geq\Delta_{f}(m,n,p)\right)\leq c^{\prime}p^{-(cR-2)}. (4.3)

The proof is given in Appendix B for completeness.

Remark 4.1.

The statement of proposition 4.1 is non-asymptotic in nature and can require sample size nn and bandwidth paramater mm depending on the stability parameter |f|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}. For appropriate choices of m,nm,n and pp the bias term can vanish faster than the tail decay for certain data generating processes. For example, according to proposition 3.4. of Sun et al. [2018], if the autocovariance function of the underlying centered stationary process XtX_{t} satisfies Γ()maxσXρX||{\left|\kern-1.07639pt\left|\Gamma(\ell)\right|\kern-1.07639pt\right|}_{\max}\leq\sigma_{X}\rho_{X}^{|\ell|} for every \ell\in\mathbb{Z} and some σX>0,ρX(0,1)\sigma_{X}>0,\rho_{X}\in(0,1), then Ωn=𝒪(nρXn)\Omega_{n}=\mathcal{O}(n\rho_{X}^{n}) and Ln=𝒪(ρXn)L_{n}=\mathcal{O}(\rho_{X}^{n}). The first components of the bias term is mnΩnmρXn\frac{m}{n}\Omega_{n}\precsim m\rho_{X}^{n}. For the choice of bandwidth parameter mnαm\asymp n^{\alpha} for some α>0\alpha>0, n3αρX2n0n^{3\alpha}\rho_{X}^{2n}\rightarrow 0 and Ln0L_{n}\rightarrow 0, which implies mρXn|f|logpmm\rho_{X}^{n}\precsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{\log p}{m}}. Therefore, the bias vanishes faster than the tail decay, and the threshold term behaves as Δf(m,n,p)𝒪(|f|logpm)\Delta_{f}(m,n,p)\approx\mathcal{O}\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{\log p}{m}}\right). This similar to the single-deviation control logp/n\sqrt{\log p/n} of covariance matrices in Ravikumar et al. [2011] and Bickel and Levina [2008].

Our next result uses the element-wise deviation bound as a starting point to control the departure of graphical lasso estimator from the true spectral precision in different norms.

Theorem 4.1.

Let {Xt}t=1n\{X_{t}\}_{t=1}^{n} be nn consecutive observations from a stable Gaussian centered time series satisfying Assumptions 4.1 and 4.2. Consider a single Fourier frequency ωj[π,π]\omega_{j}\in[-\pi,\pi], and the true spectral precision matrix Θ=[f(ωj)]1\Theta^{*}=[f(\omega_{j})]^{-1}. Assume κf,κΘ1\kappa_{f},\ \kappa_{\Theta^{*}}\geq 1 and nΩn|f|2logpn\succsim\Omega_{n}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p, and denote Cα=1+8αC_{\alpha}=1+\frac{8}{\alpha}. Then for any choice of tuning parameters λ=8αΔf(m,n,p)\lambda=\frac{8}{\alpha}\Delta_{f}(m,n,p), m|f|2logpm\succsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p, mn/Ωnm\precsim n/\Omega_{n}, Δf(m,n,p)d[6κΘ2κf3Cα]1\Delta_{f}(m,n,p)d\leq[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}]^{-1} and any R>0R>0, there exist universal constants c,c>0c,c^{\prime}>0 such that the following statements hold with probability at least 1cp(cR2)1-c^{\prime}p^{-(cR-2)},

  • (a)

    The CGLASSO estimator Θ^\hat{\Theta} in (2.6) satisfies

    Θ^Θmax2κΘCαΔf(m,n,p),{\left|\kern-1.07639pt\left|\hat{\Theta}-\Theta^{*}\right|\kern-1.07639pt\right|}_{\max}\leq 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p),
  • (b)

    The edge set specified by Θ^\hat{\Theta}, i.e. E(Θ^)E(\hat{\Theta}), is a subset of E(Θ)E(\Theta^{*}) and includes all edges (k,)(k,\ell) with |Θk|>2κΘCαΔf(m,n,p)|\Theta^{*}_{k\ell}|>2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p),

  • (c)

    The estimator Θ^\hat{\Theta} satisfies

    Θ^ΘF\displaystyle\|\hat{\Theta}-\Theta^{*}\|_{F} 2κΘCαs+pΔf(m,n,p),\displaystyle\leq 2\kappa_{\Theta^{*}}C_{\alpha}\sqrt{s+p}\ \Delta_{f}(m,n,p), (4.4)
    Θ^Θ2\displaystyle\|\hat{\Theta}-\Theta^{*}\|_{2} 2κΘCαmin{s+p,d}Δf(m,n,p).\displaystyle\leq 2\kappa_{\Theta^{*}}C_{\alpha}\min\{\sqrt{s+p},d\}\ \Delta_{f}(m,n,p). (4.5)
Remark 4.2.

The proof is deferred to Appendix B. In the statement of theorem 4.1m the condition Δf(m,n,p)d[6κΘ2κf3Cα]1\Delta_{f}(m,n,p)d\leq[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}]^{-1} ensures that the tail decay |f|Rlogpm{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{R\log p}{m}} and the bias m+12πnΩn+12πLn\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n} are separately controlled by [6κΘ2κf3Cαd]1[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}d]^{-1}. However, under common mixing conditions, the tail decay may asymptotically dominate the bias term. For instance as shown in Remark 4.1, when the autocovariance function of the underlying centered stationary process XtX_{t} satisfies Γ()maxσXρX||{\left|\kern-1.07639pt\left|\Gamma(\ell)\right|\kern-1.07639pt\right|}_{\max}\leq\sigma_{X}\rho_{X}^{|\ell|} for every \ell\in\mathbb{Z} and some σX>0,ρX(0,1)\sigma_{X}>0,\rho_{X}\in(0,1), for the choice of the bandwidth mnαm\asymp n^{\alpha} for some α(0,1)\alpha\in(0,1), the tail decay dominates the bias term. If κf,κΘ,1α\kappa_{f},\kappa_{\Theta^{*}},\frac{1}{\alpha} and |f|{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|} are asymptotically constants as functions of (n,p,d)(n,p,d), the threshold term behaves as Δf(m,n,p)logpm\Delta_{f}(m,n,p)\asymp\sqrt{\frac{\log p}{m}} so that logpm1d\sqrt{\frac{\log p}{m}}\precsim\frac{1}{d} i.e. md2logpm\succsim d^{2}\log p. This rate is similar to the sample size requirement Ω(d2logp)\Omega(d^{2}\log p) of Ravikumar et al. [2011] for precision matrix estimation using real graphical lasso. In each iteration of the CGLASSO algorithm, CLASSO i.e. the complex variant of the lasso algorithm is implemented. Hence, lasso succeeds with high probability with a sample of size Ω(dlogp)\Omega(d\log p) [Wainwright, 2009]. As addressed in Ravikumar et al. [2011], the bandwidth requirement of Ω(d2logp)\Omega(d^{2}\log p) might be an artifact of the theoretical analysis in this case. Also, consistency of Θ^\hat{\Theta} around Θ\Theta^{*} is ensured when d2logpm0\frac{d^{2}\log p}{m}\rightarrow 0. The part (b) of Theorem 4.1 also states that entry-wise model selection consistency is achieved when the true entry is larger than a signal depending on the threshold Δf(m,n,p)\Delta_{f}(m,n,p).

4.2 Case II: Linear processes

In this section, we extend the estimation consistency results from Gaussian time series to linear processes with non-Gaussian errors. We focus on three classes of error distributions given in Section 4 in Sun et al. [2018]. We consider a general linear process with absolute summable MA(\infty) coefficients

Xt==0Bεt,X_{t}=\sum_{\ell=0}^{\infty}B_{\ell}\varepsilon_{t-\ell}, (4.6)

where Bp×pB_{\ell}\in\mathbb{R}^{p\times p}, and εtp\varepsilon_{t}\in\mathbb{R}^{p} have i.i.d. centered distribution with possibly heavier tail than Gaussian. The absolute summability of MA(\infty) coefficients ensure stationarity of the process [Rosenblatt, 2012]

=0|(B)k,k|<,for all k,k=1,,p.\sum_{\ell=0}^{\infty}|(B_{\ell})_{k,k^{\prime}}|<\infty,~{}~{}~{}\mbox{for all }k,k^{\prime}=1,\cdots,p. (4.7)

Under this condition, the autocovariance function Γ()==0BtBt+\Gamma(\ell)=\sum_{\ell=0}^{\infty}B_{t}B_{t+\ell}^{\top} is well-defined for every \ell\in\mathbb{Z}, and Assumption 4.2 holds (see Lemma C.7 of Sun et al. [2018]). We assume that all the components of ε\varepsilon, viz. εtk, 1kp\varepsilon_{tk},\ 1\leq k\leq p, come form one of the following three distribution families:

  1. (C1)

    Sub-Gaussian:  There exists σ>0\sigma>0 such that for all η>0\eta>0

    (|εtk|>η)2exp(η22σ2),\mathbb{P}(|\varepsilon_{tk}|>\eta)\leq 2\exp\left(-\frac{\eta^{2}}{2\sigma^{2}}\right),
  2. (C2)

    Generalized sub-exponential family with parameter ν>0\nu>0 :  There exist positive constants aa and bb such that for all η>0\eta>0,

    (|εtk|>ην)aexp(bη),\mathbb{P}(|\varepsilon_{tk}|>\eta^{\nu})\leq a\exp(-b\eta),
  3. (C3)

    Distribution with finite 4th moment:  There exists K>0K>0 such that

    𝔼(εtk4)K<.\mathbb{E}(\varepsilon_{tk}^{4})\leq K<\infty.

With this specification of error distribution, we have the following results on the estimation error of averaged periodogram.

Proposition 4.2.

Suppose {Xt}t\{X_{t}\}_{t\in\mathbb{Z}} is a linear process (4.6) satisfying (4.7), and εt\varepsilon_{t} comes from one of the distribution families (C1), (C2) or (C3). Consider a given Fourier frequency ωjFn\omega_{j}\in F_{n}. Assume nΩn𝒩kk=1,2,3n\succsim\Omega_{n}\mathscr{N}_{k}\,k=1,2,3, where 𝒩1=|f|2logp,𝒩2=|f|2(logp)4+4α, and 𝒩3=p2\mathscr{N}_{1}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p,\ \mathscr{N}_{2}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}(\log p)^{4+4\alpha},\text{ and }\mathscr{N}_{3}=p^{2} for the three families (C1), (C2) and (C3) respectively. Then for mm satisfying m𝒩km\succsim\mathscr{N}_{k} and mn/Ωnm\precsim n/\Omega_{n}, and the following choice of threshold with an R>0R>0,

  1. (C1)

    Δf(m,n,p)=|f|(Rlogp)1/2m+m+12πnΩn+12πLn\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\frac{(R\log p)^{1/2}}{\sqrt{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n},

  2. (C2)

    Δf(m,n,p)=|f|(Rlogp)2+2νm+m+12πnΩn+12πLn\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\frac{(R\log p)^{2+2\nu}}{\sqrt{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n},

  3. (C3)

    Δf(m,n,p)=|f|p1+Rm+m+12πnΩn+12πLn\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\frac{p^{1+R}}{\sqrt{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n},

the error of averaged periodogram satisfies

(f^(ωj)f(ωj)maxΔf(m,n,p))k,\mathbb{P}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\geq\Delta_{f}(m,n,p)\right)\leq\mathcal{B}_{k}, (4.8)

where the tail probabilities k\mathcal{B}_{k}, k=1,2,3k=1,2,3, for (C1), (C2) and (C3) are given by

1\displaystyle\mathcal{B}_{1} =cexp{(cR2)logp},\displaystyle=c\exp\left\{-(c^{\prime}R-2)\log p\right\},
2\displaystyle\mathcal{B}_{2} =cexp{(cR2)logp},\displaystyle=c\exp\left\{-(c^{\prime}R-2)\log p\right\},
3\displaystyle\mathcal{B}_{3} =cexp{2Rlogp},\displaystyle=c\exp\{-2R\log p\},

and c,c>0c,c^{\prime}>0 are constants depending only on the distributions of errors εt\varepsilon_{t} but not on the temporal dependence parameters BB_{\ell} of the linear process.

This proposition shows how the sample size requirement changes as the errors in the linear process have heavier tails than Gaussian. For example, under the assumption of only the existence of 4th4^{th} moment, we need O(p2)O(p^{2}) samples. In contrast, in the subGaussian setting O(log(p))O(\log(p)) samples are sufficient for consistency.

Theorem 4.2.

Let Xt,t=1,,nX_{t},\ t=1,\cdots,n be as in Proposition 4.2. Consider a single Fourier frequency ωj[π,π]\omega_{j}\in[-\pi,\pi]. Assume κf,κΘ1\kappa_{f},\kappa_{\Theta^{*}}\geq 1, and nΩn𝒩kn\succsim\Omega_{n}\mathscr{N}_{k} where 𝒩k\mathscr{N}_{k} for k=1,2,3k=1,2,3 are as described in proposition 4.2, and denote Cα=1+8α,α>0C_{\alpha}=1+\frac{8}{\alpha},\ \alpha>0. Then for any choice of tuning parameters λ=8αΔf(m,n,p)\lambda=\frac{8}{\alpha}\Delta_{f}(m,n,p), m𝒩km\succsim\mathscr{N}_{k}, mn/Ωnm\precsim n/\Omega_{n}, Δf(m,n,p)d[6κΘ2κf3Cα]1\Delta_{f}(m,n,p)d\leq[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}]^{-1} and any R>0R>0, there exist universal constants c,c>0c,c^{\prime}>0 such that with probability greater than 1cp(cR2)1-c^{\prime}p^{-(cR-2)}, (a),(b) and (c) in Theorem 4.1 hold.

Proofs of Proposition 4.2 and Theorem 4.2 follow from a similar argument as in propositions 4.3 and 4.4 of Sun et al. [2018], where estimation error bounds for spectral density matrices were considered. For general linear processes, in proposition 4.2, the first term of the threshold Δf(m,n,p)\Delta_{f}(m,n,p) is tailored to the underlying process i.e. the threshold for estimation error decays slower from (C1) to (C3) as the tail of error distribution is heavier. The order of required sample size and the bandwidth also increases for heavier tail errors.

Proofs of these generalizations from Gaussian to linear processes follow the same argument as in Propositions 4.3 and 4.4 of Sun et al. [2018], where estimation error bounds for spectral density matrices were considered. The proof of Theorem 4.2 follows in the similar way as Theorem 4.1 (proof mentioned in appendix B). The upper bound in the right-side of (4.3) is replaced by more general set of upper bounds k\mathcal{B}_{k} for linear processes as mentioned in (4.8) of Proposition 4.2.

5 Numerical experiments

We conduct three sets of experiments to demonstrate the finite sample properties of CGLASSO, and compare its performance with existing alternative methods.

In graphical model and lasso literature [Janková and van de Geer, 2018; Carter et al., 2023] using a single tuning parameter λ\lambda is not ideal because the variables may be in different scales and possibly have different standard deviations. Hence the scaling issue needs to be carefully addressed. We use the following two variants of the CGLASSO algorithm.

CGLASSO-I:   The first variant, called CGLASSO-I, takes the common approach of scaling all the pp DFTs to have Zk=n\|Z_{k}\|=\sqrt{n} before using them as input in CGLASSO. The final output is then rescaled using D1/2Θ^D1/2D^{-1/2}\widehat{\Theta}D^{-1/2}, where DD is a diagonal matrix containing the Zk\|Z_{k}\|, for k=1,,pk=1,\ldots,p. Similar to Janková and van de Geer [2018], this is equivalent to solving a graphical lasso based on the sample coherence matrix D1/2f^(ωj)D1/2D^{-1/2}\hat{f}(\omega_{j})D^{-1/2} and rescale it back. The corresponding optimization problem is

Θ^j=D1/2(argminΘ++pΘ,D1/2f^(ωj)D1/2logdetΘ+λΘ1,off)D1/2.\widehat{\Theta}_{j}=D^{-1/2}\left(\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\langle\langle\Theta,D^{-1/2}\hat{f}(\omega_{j})D^{-1/2}\rangle\rangle-\log\det\Theta+\lambda\|\Theta\|_{\text{1,off}}\right)D^{-1/2}. (5.1)

The above optimization problem is equivalent to a complex graphical lasso with weighted penalty (similar fact for graphical lasso on real covariance matrix is stated in Janková and van de Geer [2018]) and the weights are determined by the diagonal matrix DD, that is

Θ^j=argminΘ++pΘ,f^(ωj)logdetΘ+λD1/2ΘD1/21,off.\widehat{\Theta}_{j}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\langle\langle\Theta,\hat{f}(\omega_{j})\rangle\rangle-\log\det\Theta+\lambda\|D^{1/2}\Theta D^{1/2}\|_{\text{1,off}}.

CGLASSO-II:   The second variant, called CGLASSO-II, takes a more involved approach where in each invoke of CLASSO.COV within CGLASSO, it uses DD to scale the input of CLASSO.COV. One way to view this operation is that every time a Lasso is run within the inner loop of CGLASSO, it ensures that the predictors in the Lasso input are scaled. This is equivalent to the following scheme for each iteration k=1,,pk=1,\ldots,p of the Lasso update in :

W11sc\displaystyle W_{11}^{sc} D111/2W11D111/2,\displaystyle\leftarrow D_{11}^{-1/2}W_{11}D_{11}^{-1/2},
𝐩12sc\displaystyle\mathbf{p}_{12}^{sc} D111/2𝐩12,\displaystyle\leftarrow D_{11}^{-1/2}\mathbf{p}_{12},
β^sc\displaystyle\hat{\beta}^{sc} =CLASSO.COV(W11sc,𝐩12sc,,k(0),λ),\displaystyle=\text{CLASSO.COV}(W_{11}^{sc},\mathbf{p}_{12}^{sc},\mathcal{B}^{(0)}_{\cdot,k},\lambda),
,k(0)\displaystyle\mathcal{B}^{(0)}_{\cdot,k} β^sc,\displaystyle\leftarrow\hat{\beta}^{sc},
β^\displaystyle\hat{\beta} D111/2β^sc,𝐰12W11β^.\displaystyle\leftarrow D_{11}^{-1/2}\hat{\beta}^{sc},~{}~{}~{}\mathbf{w}_{12}\leftarrow W_{11}\hat{\beta}. (5.2)

Comparison between CGLASSO variants: We compare between CGLASSO-I and CGLASSO-II at ω=0\omega=0. Our comparison is based on the regularization paths (Figures 4a and 4b) for relative means squared error (RMSE) as described in (5.3) for two different DGPs.

Experiment 1: We compare CGLASSO with Sparse Inverse Periodogram Estimator (SIPE) of Fiecas et al. [2019] on white noise and VAR processes. Our results show that penalizing real and imaginary parts separately as in SIPE can lead to loss of estimation and model selection accuracy. It is indeed preferable to penalize the real and imaginary parts together using a group penalty, as specified in the original penalized Whittle likelihood formulation (2.6). Our findings qualitatively match with the findings in Maleki et al. [2013] for generic regression with complex variables.

Experiment 2: We report relative mean squared error (RMSE) of CGLASSO at frequency ω=0\omega=0 for three different DGPs (white noise, VAR and VARMA), and different values of nn and pp. We show that CGLASSO substantially outperforms the traditional estimator of inverse averaged periodogram.

Experiment 3: We report model selection performances at ω=0\omega=0 for the same settings as in Experiment 2 using precision, recall, accuracy and area under the ROC curve (AUROC). We show that CGLASSO outperforms node-wise regression (NWR) algorithm of Section 2.3 Meinshausen and Bühlmann [2006] in terms of AUROC.

Before presenting the results of our experiments, we describe our simulation setup, performance metrics and choice of tuning parameters.

Generative models:  Three DGPs are considered and for each setup, we generate data with n=200,400n=200,400 and p=10,20,50p=10,20,50. The DGPs are following:

  1. (DGP1)

    White noise:  We consider Xti.i.d.𝒩(0,Σ)X_{t}\overset{\text{i.i.d.}}{\thicksim}\mathcal{N}(0,\Sigma) where Σ\Sigma is a tri-diagonal symmetric positive definite matrix with diagonal entries 0.7, upper and lower diagonal entries 0.3.

  2. (DGP2)

    VAR(1):  We consider Xt=AXt1+εtX_{t}=AX_{t-1}+\varepsilon_{t} where A=((aij))A=((a_{ij})) is a p×pp\times p coefficient matrix with banded structure i.e. aii=0.5a_{ii}=0.5 for i=1,,pi=1,\cdots,p, ai,i+1=0.3a_{i,i+1}=-0.3 for i=1,,p1i=1,\cdots,p-1 and ai,i+2=0.2a_{i,i+2}=0.2 for i=1,,p2i=1,\cdots,p-2.

  3. (DGP3)

    VARMA(2, 2):  The setting in this case is similar to the simulation setup mentioned in Wilms et al. [2021]. For any kk\in\mathbb{N}, let the matrices 𝐈k\mathbf{I}_{k} and 𝐉k\mathbf{J}_{k} denote the kk-dimensional identity matrix and kk-dimensional matrix with all ones respectively. The model is Xt=A1Xt1+A2Xt2+εt+B1εt1+B2εt2X_{t}=A_{1}X_{t-1}+A_{2}X_{t-2}+\varepsilon_{t}+B_{1}\varepsilon_{t-1}+B_{2}\varepsilon_{t-2}. For coefficients of the autoregressive part, we use the diagonal matrices A1=0.4𝐈p,A2=0.2𝐈pA_{1}=0.4\ \mathbf{I}_{p},\ A_{2}=0.2\ \mathbf{I}_{p} and for coefficients of the moving average part of the model we use block diagonal matrices where the 5×55\times 5 block matrices for B1B_{1} and B2B_{2} respectively are 1.5(𝐈5+𝐉5)1.5(\mathbf{I}_{5}+\mathbf{J}_{5}) and 0.75(𝐈5+𝐉5)0.75(\mathbf{I}_{5}+\mathbf{J}_{5}).

Refer to caption
(a) White noise, p=10p=10 and n=400n=400
Refer to caption
(b) VARMA(2, 2), p=50p=50 and n=200n=200
Figure 4: Regularization path for relative mean squared error (RMSE) CGLASSO-I and CGLASSO-II for ωj=0\omega_{j}=0. The generative model, dimension and sample size are mentioned underneath each figure.

We estimate the spectral precision matrix for all of the above three setups and compare with the true matrix. As mentioned in section 11.5 of Brockwell and Davis [1991], the true spectral density for the VARMA(p, q) process Xt=A1Xt1+ApXtp+εt+B1εt1++BqεtqX_{t}=A_{1}X_{t-1}+A_{p}X_{t-p}+\varepsilon_{t}+B_{1}\varepsilon_{t-1}+\cdots+B_{q}\varepsilon_{t-q} with εt𝒩(0,Σε)\varepsilon_{t}\thicksim\mathcal{N}(0,\Sigma_{\varepsilon}) for all tt is given by

f(ω)=12π𝒜(eiω)(eiω)Σε(eiω)(𝒜1(eiω)),f(\omega)=\frac{1}{2\pi}\mathcal{A}(e^{-\mathrm{i}\omega})\ \mathcal{B}(e^{-\mathrm{i}\omega})\ \Sigma_{\varepsilon}\ \mathcal{B}^{\dagger}(e^{-\mathrm{i}\omega})\ (\mathcal{A}^{-1}(e^{-\mathrm{i}\omega}))^{\dagger},

where 𝒜(z)=𝐈pt=1pAtzt\mathcal{A}(z)=\mathbf{I}_{p}-\sum_{t=1}^{p}A_{t}z^{t} and (z)=𝐈pt=1qBtzt\mathcal{B}(z)=\mathbf{I}_{p}-\sum_{t=1}^{q}B_{t}z^{t}. In our case the specific forms of the true spectral densities are calculated by changing AiA_{i}’s, BiB_{i}’s and Σε\Sigma_{\varepsilon} and the spectral precision is Θ(ω)=[f(ω)]1\Theta(\omega)=[f(\omega)]^{-1}. One observation is that for white noise model, f(ω)=Σε/2πf(\omega)=\Sigma_{\varepsilon}/2\pi, hence Θ(ω)=2πΣε1\Theta(\omega)=2\pi\Sigma_{\varepsilon}^{-1} for all ω[π,π]\omega\in[-\pi,\pi]. Therefore, the sparsity patterns of the inverse covariance matrix and the precision matrix are identical.

Refer to caption
(a) White noise, n=100n=100
Refer to caption
(b) VAR(1), n=200n=200
Refer to caption
(c) White noise, n=100n=100
Refer to caption
(d) VAR(1), n=200n=200
Figure 5: Relative mean squared error (RMSE) and areua under ROC curves of CGLASSO and sparse inverse periodogram estimator (SIPE) for p=50p=50 and ωj=0\omega_{j}=0. The generative model and sample size are mentioned underneath each figure.

At a fixed frequency, we seek to get the full regularization path for a range of penalty parameter (λ\lambda) values. Given the estimated Θ^\hat{\Theta} at a certain λ=λ\lambda=\lambda^{*}, we incorporate warm start, i.e. we use Θ^λ\hat{\Theta}_{\lambda^{*}} as a initial value and estimate Θ^λ\hat{\Theta}_{\lambda} for λ\lambda near λ\lambda^{*}. Warm start helps us obtain the full regularization path by repeatedly using the current estimated Θ^\hat{\Theta} as an initial value for the next step.

Performance metric:  In order to compare the estimated output with the true spectral precision matrix, we use a relative mean squared error (RMSE) at frequency ωj\omega_{j} defined as

RMSEλ(ωj)=Θ^λ(ωj)Θ(ωj)F2Θ(ωj)F2.\text{RMSE}_{\lambda}(\omega_{j})=\frac{\|\hat{\Theta}_{\lambda}(\omega_{j})-\Theta(\omega_{j})\|^{2}_{F}}{\|\Theta(\omega_{j})\|^{2}_{F}}. (5.3)

In order to capture how well the graphical lasso is able to recover the sparsity pattern and the non-zero coordinates of true Θ\Theta, we use precision, recall and accuracy measures defined as follows:

precision =#{(k,l):Θ^kl0|Θkl0}#{(k,l):Θ^kl0},\displaystyle=\frac{\#\{(k,l):\hat{\Theta}_{kl}\neq 0\ |\ \Theta_{kl}\neq 0\}}{\#\{(k,l):\hat{\Theta}_{kl}\neq 0\}},
recall =#{(k,l):Θ^kl0|Θkl0}#{(k,l):Θkl0},\displaystyle=\frac{\#\{(k,l):\hat{\Theta}_{kl}\neq 0\ |\ \Theta_{kl}\neq 0\}}{\#\{(k,l):\Theta_{kl}\neq 0\}},
accuracy =#{(k,l):Θ^kl=0|Θkl=0}+#{(k,l):Θ^kl0|Θkl0}p2.\displaystyle=\frac{\#\{(k,l):\hat{\Theta}_{kl}=0\ |\ \Theta_{kl}=0\}+\#\{(k,l):\hat{\Theta}_{kl}\neq 0\ |\ \Theta_{kl}\neq 0\}}{p^{2}}.

In addition, we calculate the area under receiver operating characteristic (ROC) curve. All the experiments are replicated 20 times in order to calculate the mean and standard deviation of each performance metric.

Bandwidth and tuning parameter selection:  We set the smoothing span m=nm=\lfloor\sqrt{n}\rfloor based on theoretical results in section 4. Such choice of smoothing span is also supported by Sun et al. [2018] and Fiecas et al. [2019].

We use Extended Bayesian information criterion (EBIC) [Chen and Chen, 2008] for tuning parameter selection. As discussed in Foygel and Drton [2010] the expression of EBIC for Gaussian graphical model in our case would be

EBICγ(Θ^)=2ln(Θ^)+|E|logn+4γ|E|logp,\text{EBIC}_{\gamma}(\hat{\Theta})=-2l_{n}(\hat{\Theta})+|E|\log n+4\gamma|E|\log p,

where ln(Θ^)l_{n}(\hat{\Theta}) is the log-likelihood corresponding to the Gaussian graphical model, E(Θ^)E(\hat{\Theta}) is the edge set of a candidate graph corresponding to Θ^\hat{\Theta} (described in (4)), and γ[0,1]\gamma\in[0,1] is a parameter. The choice of γ\gamma has been theoretically studied in some other paper [Chen and Chen, 2008]. In our simulations, we only report results for γ=0\gamma=0 for all the models, when EBIC is essentially BIC. However, in high dimensional models, the extended BIC with a suitable positive value of γ\gamma can be used as a tuning parameter selection criterion. Additional experiments (not reported here) with γ>0\gamma>0 shows that our results are qualitatively robust to its choice. We choose the tuning parameter λ\lambda with the minimum value of BIC, that is

λ=argminλ>0BICγ(Θ^λ).\lambda^{*}=\operatorname*{arg\,min}_{\lambda>0}\ \text{BIC}_{\gamma}(\hat{\Theta}_{\lambda}).

In order to find the minimum, we search on a log-linear scale grid of values for λ\lambda.

RMSE
Family Inverse periodogram CGLASSO (oracle) CGLASSO (BIC)
White Noise
p=10p=10
    n=200n=200 196.56 (189.68) 13.01 (3.69) 13.74 (3.52)
    n=400n=400 65.88 (28.74) 9.09 (2.32) 10.80 (2.77)
p=20p=20
    n=200n=200 3015.44 (1807.44) 13.11 (2.28) 19.69 (1.45)
    n=400n=400 482.38 (332.41) 9.82 (1.72) 14.53 (1.61)
p=50p=50
    n=200n=200 - 15.87 (1.65) 32.13 (0.92)
    n=400n=400 - 11.34 (0.80) 23.31 (0.78)
VAR(1)
p=10p=10
    n=200n=200 155.00 (84.18) 12.07 (3.04) 13.25 (2.86)
    n=400n=400 70.89 (49.07) 9.27 (2.19) 10.59 (1.88)
p=20p=20
    n=200n=200 3233.31 (3136.76) 11.84 (2.12) 14.14 (1.74)
    n=400n=400 394.83 (220.99) 8.86 (1.51) 10.95 (1.44)
p=50p=50
    n=200n=200 - 14.49 (1.69) 28.70 (2.27)
    n=400n=400 - 12.04 (2.02) 25.17 (2.36)
VARMA(2, 2)
p=10p=10
    n=200n=200 414.04 (277.31) 8.49 (1.86) 9.79 (3.39)
    n=400n=400 109.53 (53.68) 5.57 (1.77) 7.06 (2.04)
p=20p=20
    n=200n=200 13704.13 (16463.75) 10.36 (2.06) 11.46 (4.38)
    n=400n=400 736.20 (322.94) 7.82 (1.46) 8.60 (1.53)
p=50p=50
    n=200n=200 - 11.96 (1.17) 13.21 (1.21)
    n=400n=400 - 9.02 (0.89) 12.20 (0.84)
Table 1: RMSE of inverted periodogram and the complex graphical lasso estimator at ωj=0\omega_{j}=0. The RMSE (oracle) column corresponds to the minimum RMSE (in %) along the regularization path. The RMSE (in %) corresponding to the estimator with minimum BIC along the regularization paths are also reported in RMSE (BIC) column. Results are averaged over 20 independent replicates. Standard deviations (in %) are also reported in parentheses.
NWR CGLASSO CGLASSO
Family AUROC AUROC Precision Recall Accuracy
White Noise
p=10p=10
    n=200n=200 89.41 (6.77) 94.49 (4.24) 44.45 (7.99) 96.67 (6.35) 74.11 (6.74)
    n=400n=400 96.67 (2.98) 97.76 (2.66) 42.98 (7.00) 98.33 (4.07) 72.68 (6.61)
p=20p=20
    n=200n=200 87.64 (5.97) 95.53 (3.22) 39.30 (5.58) 94.74 (5.40) 84.42 (3.45)
    n=400n=400 94.60 (4.00) 98.90 (1.29) 37.78 (4.93) 99.21 (1.93) 83.21 (3.14)
p=50p=50
    n=200n=200 88.58 (2.42) 87.07 (3.01) 55.98 (5.26) 75.41 (5.90) 96.61 (0.48)
    n=400n=400 93.01 (2.07) 97.31 (1.47) 43.28 (4.75) 95.41 (2.80) 94.71 (0.98)
VAR(1)
p=10p=10
    n=200n=200 83.78 (6.09) 90.13 (5.12) 66.56 (6.04) 87.65 (7.61) 78.44 (4.84)
    n=400n=400 90.2 (3.98) 96.07 (2.94) 69.86 (5.54) 95.75 (4.42) 82.47 (4.03)
p=20p=20
    n=200n=200 90.84 (3.12) 95.83 (1.96) 49.44 (3.77) 95.45 (2.86) 79.95 (2.85)
    n=400n=400 95.25 (2.24) 97.89 (1.63) 48.62 (2.85) 97.87 (2.48) 79.34 (2.33)
p=50p=50
    n=200n=200 93.72 (1.86) 98.69 (0.67) 45.43 (3.49) 98.06 (1.09) 90.43 (1.39)
    n=400n=400 97.25 (0.85) 99.72 (0.39) 42.47 (2.52) 99.57 (0.71) 89.23 (1.04)
VARMA(2, 2)
p=10p=10
    n=200n=200 70.56 (7.85) 88.25 (3.83) 79.34 (8.58) 83.00 (6.16) 82.33 (5.37)
    n=400n=400 74.16 (6.37) 89.54 (4.52) 73.67 (7.01) 86.25 (7.23) 79.89 (5.92)
p=20p=20
    n=200n=200 66.11 (5.23) 87.22 (4.34) 64.37 (5.47) 79.50 (8.10) 86.32 (2.51)
    n=400n=400 72.32 (5.00) 90.94 (1.44) 64.557 (4.68) 85.875 (2.84) 86.95 (2.102)
p=50p=50
    n=200n=200 72.65 (2.30) 89.47 (1.98) 48.32 (3.51) 81.85 (3.90) 91.30 (1.00)
    n=400n=400 70.73 (2.56) 92.73 (2.35) 49.94 (4.29) 87.60 (4.44) 91.75 (1.162)
Table 2: AUROC for nodewise regression(NWR) and CGLASSO, precision, recall, accuracy and F1-score (each in %) of complex graphical lasso for ωj=0\omega_{j}=0. Results are averaged over 20 replicates. Standard deviations (in %) are also reported in parentheses.

Results of comparison between CGLASSO-I and CGLASSO-II: The comparison of relative errors indicates that the minimum RMSE along the regularization path of the CGLASSO-I outputs are less than that of CGLASSO-I. Empirical observations also suggest that more the data generating process involves temporal and cross-sectional dependence in the time series model, better are the relative error results of CGLASSO-II. We demonstrate this with two cases: data coming from a white noise model with p=10p=10, n=400n=400, and VARMA(2, 2) with p=50p=50, n=200n=200 (Figure 4a & 4b). Even when the data generating process is fixed, the difference of minimum RMSEs tends to increase in high dimensional-low sample size setups. In a nutshell, the scale-modified algorithm performs better than the vanilla version when the estimation problem becomes harder in terms of the data generating process, the dependence structure and sample size-to-dimension ratio. Also, the minimum RMSE along the regularization path of CGLASSO-II is attained for higher values of λ\lambda, implying that the model with least error is sparser when scaling is incorporated.

Results:  Next we describe the key findings of our three experiments.

Experiment 1: We compare the results for our proposed estimator with the sparse inverse periodogram estimator (SIPE) of Fiecas et al. [2019]. The latter uses the constrained 1\ell_{1} minimization for inverse matrix estimation (CLIME) of Cai et al. [2011]. For certain combinations of pp and nn for white noise and VAR(1) models, we calculate the relative mean squared error for frequency ωj=0\omega_{j}=0 along the regularization path.

In white noise model with p=50p=50 and n=100n=100, we see in figures 5a and 5b that the minimum relative error of SIPE is almost 38%38\% larger than that of complex graphical lasso. In addition, the area under ROC curve for SIPE is approximately 11%11\% less than that of complex graphical lasso, indicating a better model selection performance by the latter method. Additionally, if we run the experiment with a sample of size of 200 from a 50-dimensional VAR(1) model with AA being a block diagonal matrix with the blocks being the 5×55\times 5 matrix with all diagonal elements 0.5 and upper diagonals 0.2, and error distribution ϵt𝒩(0,0.5)\epsilon_{t}\thicksim\mathcal{N}(0,0.5), the minimum RMSE of SIPE is approximately 28% larger than that of complex graphical lasso. Similar to white noise model, the area under ROC curve for SIPE is almost 12% less than complex graphical lasso, as shown in Figures 5c and 5d. One reason for such significant outperforming results of graphical lasso might be that Fiecas et al. [2019] uses CLIME twice for estimating real and imaginary parts of Θ\Theta separately. In particular, their method involves estimation of 2p×2p=4p22p\times 2p=4p^{2} entries instead of p2p^{2} estimands in graphical lasso.

For the sparse inverse periodogram estimator, the AUROC values are higher for larger values of the penalty parameter (figures 5c and 5d). However, the value of RMSE also increases for higher penalty. For complex graphical lasso, minimum RMSE estimator also has a higher value of RMSE than that of SIPE, indicating that for our proposed method, the estimator with minimum relative error is well able to recover the true sparsity structure.

Experiment 2: The relative error comparison with the classical estimator i.e. the inverted periodogram is reported in table 1. For p=50p=50, the p×pp\times p averaged periodogram is not invertible since in our setup its rank can be at most 2m+1=2n+1212m+1=2\lfloor\sqrt{n}\rfloor+1\leq 21, hence the entries for those cases are left blank. In most simulation settings, our proposed method not only has significantly smaller relative error but also the BIC selected model has RMSE values just slightly higher than the minimum. In addition, RMSE goes down if we fix the dimension pp and increase nn. Such phenomena supports the fact that the relative estimation error should go down if a larger sample is available.

Experiment 3: Area under ROC curve, precision, recall, F1 score are reported in Table 2. In most of the simulation settings, AUROC values are above 80%, most values lying in the range 90-100% indicating a good recovery of the sparsity and signal structure by our methods. In addition, in each setting with the data generative family and the dimension pp fixed, increased sample size from n=200n=200 to n=400n=400 results in an increase in AUROC values. The methods have lower precision but higher recall, indicating a tendency of CGLASSO to select false positives. This pattern of 1\ell_{1} penalized estimators is well-known in the literature, and two-stage methods such as the adaptive lasso [Zou, 2006] have gained popularity to alleviate this problem. Such an extension of CGLASSO is beyond the scope of this paper, and we leave it for future research.

6 Functional connectivity analysis with fMRI data

Refer to caption
Refer to caption
Figure 6: Yeo Functional connectivity heatmaps of partial coherence and CGLASSO methods for one individual

We implement CGLASSO for interpretation and visualization of functional connectivity among different brain regions of a human subject. The data set used here is a publicly available, high-resolution, preprocessed magnetic resonance imaging data from the Human Connectome Project (HCP)—Young Adult S1200 release [Van Essen et al., 2013; Dhamala et al., 2020]. We consider the time series data of 1003 healthy adults observed over same time horizon across four complete resting state fMRI runs, each scan with n=1200n=1200 time points.

Preprocessing: MRIs were acquired on a Siemens Skyra 3 T scanner at Washington University in St. Louis. Each subject underwent four gradient-echo EPI resting-state functional MRIs (rsfMRI), with TR = 720 ms, TE = 33.1 ms, 2.0 mm isotropic voxels, FoV = 208 × 180 mm2, flip angle = 52, matrix = 104 × 90, 72 slices. Two 15 min rsfMRI runs were acquired at each of the two sessions. In the end, the data consisted of 1200 time points for each run, for a total of 4800 time points for each subject over four scans. Each run of each subject’s rfMRI was preprocessed by the HCP consortium [Smith et al., 2013]. The data were minimally preprocessed [Glasser et al., 2013] and had artefacts removed using ICA + FIX [Griffanti et al., 2014; Salimi-Khorshidi et al., 2014]. Gray matter was parcellated into 86 cortical, subcortical and cerebellar regions using FreeSurfer [Fischl et al., 1999] and regional time series were calculated as the mean of the time series in each voxel of a given region. Each region was further assigned to a Yeo functional network delineating visual, somatomotor, dorsal attention, ventral attention, limbic, default mode, and fronto-parietal networks [Yeo et al., 2011]; we added a subcortical and cerebellar network for whole brain coverage as in previous work [Dhamala et al., 2020].

We estimate the spectral precision at frequency ω=0,π/2, and π\omega=0,\pi/2,\text{ and }\pi using CGLASSO for the individual with subject ID 100206. The smoothing span is selected m=4nm=\lceil 4\sqrt{n}\rceil and we select the penalty parameter that produces the lowest value of BIC.

Results: In Figure 6 we show an example of FC coherence network of one individual in our sample. Our proposed CGLASSO algorithm seems successful in capturing the underlying sparsity pattern. The estimated inverse coherence matrix is well able to retain the known physiological connections e.g. strong connection between bilateral homologues (functional connection between the same regions in the right and left hemisphere) which are known to have strong functional connections [Zuo et al., 2010], while also suppressing many of the weaker functional connections that are likely due to noise. Particularly strong are the bilateral connections between motor and visual areas, which are known to be strongly connected. The same is also supported by Sun et al. [2018] where the authors have used adaptive lasso thresholding method.

7 Conclusion

In this paper we proposed and analyzed a frequency-domain variant of graphical Lasso, an 1\ell_{1}-penalized local Whittle estimator for estimating spectral precision matrices of multivariate stationary time series. We developed fast pathwise coordinate descent algorithms for Lasso and graphical Lasso involving complex variables. We conducted a non-asymptotic analysis of our proposed method and showed that consistent estimation is possible in both low- and high-dimensional regimes as long as the true spectral precision matrix is sufficiently sparse.

We did not delve into a number of natural questions related to our method. First, extending the algorithm from a single frequency to all the frequencies is required for recovering the true graphical model for stationary time series. Such an extension will need additional work on both algorithmic and analytical fronts. A faster pathwise algorithm for jointly optimizing over different frequencies and different tuning parameters will be helpful. Extending the theoretical guarantees from a single frequency to all frequencies will also be required.

Another important question is the selection of bandwidth/smoothing parameter when calculating the spectral density matrix. For theoretical simplicity, we use m=nm=\lceil\sqrt{n}\rceil following the custom in the literature of large-scale spectral density matrix estimation [Böhm and von Sachs, 2009]. However, a data driven choice of mm as proposed in Ombao et al. [2001] would be more appropriate in practice. One can also consider allowing different bandwidth choices for different off-diagonal entries. We leave these directions for future research.

Finally, a number of recent works have investigated the inferential properties of frequency-domain estimators [Chang et al., 2022; Krampe and Paparoditis, 2022] and delved into the nonstationary time series in high-dimension [Zhang and Wu, 2021; Basu and Subba Rao, 2023]. Extending CGLASSO and CLASSO to these broader frameworks are also left as avenues for future research.

Acknowledgements

ND, AK and SB acknowledge partial support from NIH award R21NS120227. In addition, SB acknowledges partial support from NSF awards DMS-1812128, DMS-2210675, DMS-2239102 and NIH award R01GM135926.

References

  • Baek et al. [2021a] C. Baek, M.-C. Düker, and V. Pipiras. Thresholding and graphical local whittle estimation. arXiv preprint arXiv:2105.13342, 2021a.
  • Baek et al. [2021b] C. Baek, M. Gampe, B. Leinwand, K. Lindquist, J. Hopfinger, K. M. Gates, and V. Pipiras. Detecting functional connectivity changes in fmri data. Brain Connectivity, 2021b.
  • Basu and Subba Rao [2023] S. Basu and S. Subba Rao. Graphical models for nonstationary time series. The Annals of Statistics, 51(4):1453–1483, 2023.
  • Bickel and Levina [2008] P. J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 – 2604, 2008. doi: 10.1214/08-AOS600. URL https://doi.org/10.1214/08-AOS600.
  • Böhm and von Sachs [2009] H. Böhm and R. von Sachs. Shrinkage estimation in the frequency domain of multivariate time series. Journal of Multivariate Analysis, 100(5):913–935, 2009.
  • Bowyer [2016] S. M. Bowyer. Coherence a measure of the brain networks: past and present. Neuropsychiatric Electrophysiology, 2(1):1–12, 2016.
  • Boyd and Vandenberghe [2004] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Brockwell and Davis [1991] P. J. Brockwell and R. A. Davis. Time series: theory and methods. Springer science & business media, 1991.
  • Cai et al. [2011] T. Cai, W. Liu, and X. Luo. A constrained 1\ell_{1} minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
  • Carter et al. [2023] J. S. Carter, D. Rossell, and J. Q. Smith. Partial correlation graphical lasso. Scandinavian Journal of Statistics, 2023.
  • Chang et al. [2022] J. Chang, Q. Jiang, T. S. McElroy, and X. Shao. Statistical inference for high-dimensional spectral density matrix. arXiv preprint arXiv:2212.13686, 2022.
  • Chen and Chen [2008] J. Chen and Z. Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008.
  • Dahlhaus [2000] R. Dahlhaus. Graphical interaction models for multivariate time series. Metrika, 51(2):157–172, 2000.
  • Dallakyan et al. [2022] A. Dallakyan, R. Kim, and M. Pourahmadi. Time series graphical lasso and sparse var estimation. Computational Statistics & Data Analysis, 176:107557, 2022.
  • Davis et al. [2016] R. A. Davis, P. Zang, and T. Zheng. Sparse vector autoregressive modeling. Journal of Computational and Graphical Statistics, 25(4):1077–1096, 2016.
  • Dhamala et al. [2020] E. Dhamala, K. W. Jamison, M. R. Sabuncu, and A. Kuceyeski. Sex classification using long-range temporal dependence of resting-state functional mri time series. Human brain mapping, 41(13):3567–3579, 2020.
  • Dudley et al. [2011] R. M. Dudley, R. Norvaiša, and R. Norvaiša. Concrete functional calculus. Springer, 2011.
  • Dummit and Foote [2004] D. S. Dummit and R. M. Foote. Abstract algebra, volume 3. Wiley Hoboken, 2004.
  • Fiecas and von Sachs [2014] M. Fiecas and R. von Sachs. Data-driven shrinkage of the spectral density matrix of a high-dimensional time series. Electronic Journal of Statistics, 8(2):2975–3003, 2014.
  • Fiecas et al. [2010] M. Fiecas, H. Ombao, C. Linkletter, W. Thompson, and J. Sanes. Functional connectivity: Shrinkage estimation and randomization test. NeuroImage, 49(4):3005–3014, 2010.
  • Fiecas et al. [2019] M. Fiecas, C. Leng, W. Liu, and Y. Yu. Spectral analysis of high-dimensional time series. Electronic Journal of Statistics, 13(2):4079–4101, 2019.
  • Fischl et al. [1999] B. Fischl, M. I. Sereno, and A. M. Dale. Cortical surface-based analysis: II. Inflation, flattening, and a surface-based coordinate system. NeuroImage, 9(2):195–207, 1999. ISSN 10538119. doi: 10.1006/nimg.1998.0396.
  • Foygel and Drton [2010] R. Foygel and M. Drton. Extended bayesian information criteria for gaussian graphical models. Advances in neural information processing systems, 23, 2010.
  • Friedman et al. [2007] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302 – 332, 2007. doi: 10.1214/07-AOAS131. URL https://doi.org/10.1214/07-AOAS131.
  • Friedman et al. [2008] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
  • Glasser et al. [2013] M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni, et al. The minimal preprocessing pipelines for the human connectome project. Neuroimage, 80:105–124, 2013.
  • Granger [1969] C. W. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, pages 424–438, 1969.
  • Griffanti et al. [2014] L. Griffanti, G. Salimi-Khorshidi, C. F. Beckmann, E. J. Auerbach, G. Douaud, C. E. Sexton, E. Zsoldos, K. P. Ebmeier, N. Filippini, C. E. Mackay, et al. Ica-based artefact removal and accelerated fmri acquisition for improved resting state network imaging. Neuroimage, 95:232–247, 2014.
  • Haase [2018] M. Haase. Lectures on functional calculus. In 21st International Internet Seminar, Kiel Univ, 2018.
  • Hastie et al. [2015] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity. Monographs on statistics and applied probability, 143:143, 2015.
  • Herstein [1991] I. N. Herstein. Topics in algebra. John Wiley & Sons, 1991.
  • Horn and Johnson [2012] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.
  • Janková and van de Geer [2018] J. Janková and S. van de Geer. Inference in high-dimensional graphical models. In Handbook of graphical models, pages 325–350. CRC Press, 2018.
  • Jung et al. [2015] A. Jung, G. Hannak, and N. Goertz. Graphical lasso based model selection for time series. IEEE Signal Processing Letters, 22(10):1781–1785, 2015.
  • Kellogg et al. [1976] R. B. Kellogg, T.-Y. Li, and J. Yorke. A constructive proof of the brouwer fixed-point theorem and computational results. SIAM Journal on Numerical Analysis, 13(4):473–483, 1976.
  • Krampe and Paparoditis [2022] J. Krampe and E. Paparoditis. Frequency domain statistical inference for high-dimensional time series. arXiv preprint arXiv:2206.02250, 2022.
  • Loh and Wainwright [2017] P.-L. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455 – 2482, 2017. doi: 10.1214/16-AOS1530. URL https://doi.org/10.1214/16-AOS1530.
  • Maleki et al. [2013] A. Maleki, L. Anitori, Z. Yang, and R. G. Baraniuk. Asymptotic analysis of complex lasso via complex approximate message passing (camp). IEEE Transactions on Information Theory, 59(7):4290–4308, 2013.
  • Mazumder and Hastie [2012] R. Mazumder and T. Hastie. The graphical lasso: New insights and alternatives. Electronic journal of statistics, 6:2125, 2012.
  • Meier [2020] L. Meier. grplasso: Fitting User-Specified Models with Group Lasso Penalty, 2020. URL https://CRAN.R-project.org/package=grplasso. R package version 0.4-7.
  • Meinshausen and Bühlmann [2006] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
  • Ombao and Pinto [2022] H. Ombao and M. Pinto. Spectral dependence. Econometrics and Statistics, 2022.
  • Ombao et al. [2001] H. C. Ombao, J. A. Raz, R. L. Strawderman, and R. Von Sachs. A simple generalised crossvalidation method of span selection for periodogram smoothing. Biometrika, 88(4):1186–1192, 2001.
  • Ravikumar et al. [2011] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing 1\ell_{1}-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
  • Rosenblatt [2012] M. Rosenblatt. Stationary sequences and random fields. Springer Science & Business Media, 2012.
  • Salimi-Khorshidi et al. [2014] G. Salimi-Khorshidi, G. Douaud, C. F. Beckmann, M. F. Glasser, L. Griffanti, and S. M. Smith. Automatic denoising of functional mri data: combining independent component analysis and hierarchical fusion of classifiers. Neuroimage, 90:449–468, 2014.
  • Schneider-Luftman and Walden [2016] D. Schneider-Luftman and A. T. Walden. Partial coherence estimation via spectral matrix shrinkage under quadratic loss. IEEE Transactions on Signal Processing, 64(22):5767–5777, 2016.
  • Shumway and Stoffer [2000] R. H. Shumway and D. S. Stoffer. Time series analysis and its applications, volume 3. Springer, 2000.
  • Smith et al. [2013] S. M. Smith, C. F. Beckmann, J. Andersson, E. J. Auerbach, J. Bijsterbosch, G. Douaud, E. Duff, D. A. Feinberg, L. Griffanti, M. P. Harms, et al. Resting-state fmri in the human connectome project. Neuroimage, 80:144–168, 2013.
  • Sun et al. [2018] Y. Sun, Y. Li, A. Kuceyeski, and S. Basu. Large spectral density matrix estimation by thresholding. arXiv preprint arXiv:1812.00532, 2018.
  • Tseng [2001] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475, 2001.
  • Tugnait [2022] J. K. Tugnait. On sparse high-dimensional graphical model learning for dependent time series. Signal Processing, 197:108539, 2022.
  • Van Essen et al. [2013] D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. Behrens, E. Yacoub, K. Ugurbil, W.-M. H. Consortium, et al. The wu-minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
  • Wainwright [2009] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
  • Weylandt Jr [2020] R. M. Weylandt Jr. Computational and Statistical Methodology for Highly Structured Data. PhD thesis, Rice University, 2020.
  • Whittle [1951] P. Whittle. Hypothesis testing in time series analysis. (No Title), 1951.
  • Wilms et al. [2021] I. Wilms, S. Basu, J. Bien, and D. S. Matteson. Sparse identification and estimation of large-scale vector autoregressive moving averages. Journal of the American Statistical Association, pages 1–12, 2021.
  • Wodeyar and Srinivasan [2022] A. Wodeyar and R. Srinivasan. Structural connectome constrained graphical lasso for meg partial coherence. Network Neuroscience, 6(4):1219–1242, 2022.
  • Yeo et al. [2011] B. T. T. Yeo, F. M. Krienen, J. Sepulcre, M. R. Sabuncu, D. Lashkari, M. Hollinshead, J. L. Roffman, J. W. Smoller, L. Zöllei, J. R. Polimeni, B. Fischl, H. Liu, and R. L. Buckner. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology, 106(3):1125–65, 9 2011. ISSN 1522-1598. doi: 10.1152/jn.00338.2011.
  • Yuan and Lin [2006] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
  • Zhang and Wu [2021] D. Zhang and W. B. Wu. Convergence of covariance and spectral density estimates for high-dimensional locally stationary processes. The Annals of Statistics, 49(1):233 – 254, 2021. doi: 10.1214/20-AOS1954. URL https://doi.org/10.1214/20-AOS1954.
  • Zhao and Yu [2006] P. Zhao and B. Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.
  • Zou [2006] H. Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
  • Zuo et al. [2010] X.-N. Zuo, C. Kelly, A. Di Martino, M. Mennes, D. S. Margulies, S. Bangaru, R. Grzadzinski, A. C. Evans, Y.-F. Zang, F. X. Castellanos, et al. Growing together and growing apart: regional and sex differences in the lifespan developmental trajectories of functional homotopy. Journal of Neuroscience, 30(45):15034–15043, 2010.

Appendix A A short introduction to groups, rings and fields

This supplement is for the readers new to the abstract algebraic objects used in this paper. In this section, we provide with some basic axiomatic definitions and relevant examples of algebraic structures that is required to understand the methodological tools in this paper. Readers will find deeper treatment of this topic in Herstein [1991] and Dummit and Foote [2004].

Definition A.1 (Group).

A nonempty set of elements GG is said to form a group if in GG there is defined a binary operation, called the product and denoted by \cdot, such that

  1. 1.

    a,bGabGa,b\in G\ \implies\ a\cdot b\in G (closed),

  2. 2.

    a,b,cGa(bc)=(ab)ca,b,c\in G\ \implies\ a\cdot(b\cdot c)=(a\ \cdot b)\cdot c (associative law),

  3. 3.

    eG\exists e\in G such that ae=ea=aa\cdot e=e\cdot a=a for all aGa\in G (existence of identity),

  4. 4.

    For every aGa\in G, a1G\exists a^{-1}\in G such that aa1=a1a=ea\cdot a^{-1}=a^{-1}\cdot a=e (existence of inverse).

Example. 1. The set of all integers \mathbb{Z} under the additive operation +, 2. the set of all complex numbers \mathbb{C} under complex addition +, the set of 2×22\times 2 matrices defined in (3.1) under the matrix addition +, etc.

In this paper we deal with Abelian groups of infinite order, that is, groups with infinitely many elements and where the group operation \cdot is commutative, i.e. ab=baGa\cdot b=b\cdot a\in G for all a,bGa,b\in G . Details can be found in Herstein [1991].

Definition A.2 (Ring).

A nonempty set RR is said to be an associative ring if in RR there are defined two operations ++ and \cdot, such that for all a,b,c,Ra,b,c,\in R,

  1. 1.

    a+bRa+b\in R,

  2. 2.

    a+b=b+aa+b=b+a,

  3. 3.

    a+(b+c)=(a+b)+ca+(b+c)=(a+b)+c,

  4. 4.

    There is an element 0R0\in R such that a+0=0+a=0a+0=0+a=0 for all aRa\in R (additive identity),

  5. 5.

    There is an element aR-a\in R such that a+(a)=0a+(-a)=0 (additive inverse),

  6. 6.

    abRa\cdot b\in R,

  7. 7.

    a(bc)=(ab)ca\cdot(b\cdot c)=(a\cdot b)\cdot c,

  8. 8.

    a(b+c)=ab+aca\cdot(b+c)=a\cdot b+a\cdot c and (b+c)a=ba+bc(b+c)\cdot a=b\cdot a+b\cdot c (distributive laws).

It follows from axioms 1-5 that RR is an Abelian group under the operation ++ (addition). Similarly, it follows from axioms 6-7 that a ring is closed under associative operation \cdot (multiplication). In usual notations, aba\cdot b is simply written as abab. The two operations are bridged by axiom 8.

Example

  1. 1.

    The set of real numbers \mathbb{R} is a commutative ring under the operations ++ (addition) and \cdot (multuplication). The additive unit is 0, and \mathbb{R} also has a multipicative unit denoted by 1.

  2. 2.

    Similarly, \mathbb{C} is a ring under ++ and \cdot.

  3. 3.

    The set of all matrices characterized by (3.1) is a ring under matrix addition and multiplication.

Definition A.3 (Ring homomorphism).

A map ϕ\phi from the ring RR into the ring RR^{\prime} is said to be a homomorphism if

  1. 1.

    ϕ(a+b)=ϕ(a)+ϕ(b)\phi(a+b)=\phi(a)+\phi(b)

  2. 2.

    ϕ(ab)=ϕ(a)ϕ(b)\phi(ab)=\phi(a)\phi(b)

Note. The operations ++ and \cdot on the left side are concerned with the ring RR, and the ring operations ++ and \cdot appearing on the right are of RR^{\prime}.

Example.

  1. 1.

    Let R=RR=R^{\prime} be the set of all real numbers \mathbb{R}, both associated with the ring operations being addition ++ and multiplication \cdot. Then ϕ:RR\phi:R\rightarrow R^{\prime}, defined by ϕ(x)=x\phi(x)=x for every xRx\in R, is a ring isomorphism.

  2. 2.

    The complex conjugation map ϕ:\phi:\mathbb{C}\rightarrow\mathbb{C}, defined as ϕ(z)=z¯\phi(z)=\overline{z} for all zz\in\mathbb{C}, is a ring homomorphism.

Definition A.4 (Ring isomorphism).

A homomorphism of RR into RR^{\prime} is said to be an isomorphism if it is a one-to-one mapping. Two rings are said to be isomorphic if there exists an isomorphism of one onto the other.

Example. Both the examples for ring homomorphism are one-to-one maps, and hence serve as examples for ring isomorphism as well.

Definition A.5 (Field).

A ring is said to be a division ring if its non zero elements form a group under multiplication. A field is a commutative division ring.

Example.

  1. 1.

    Some standard widely used fields in mathematics are the rational numbers \mathbb{Q}, the set of real numbers \mathbb{R}, complex numbers \mathbb{C}, finite fields p\mathbb{Z}_{p} for a prime integer pp, etc.

  2. 2.

    The set of matrices characterized by (3.1) is a field isomorphic to the set of all complex numbers \mathbb{C} (Proposition 3.1).

Note. A field is always a ring. Two fields FF and FF^{\prime} are said to be isomorphic if they are isomorphic as rings. We use the terms ring isomorphism and field isomorphism separately in order to keep the underlying isomorphic objects specified as rings or fields respectivly. A field isomorphism is always a ring isomorphism, however the converse may not be true.

Appendix B Proofs of technical results

Proof of Proposition 3.1

We first show that the map φ:2×2\varphi:\mathbb{C}\rightarrow\mathcal{M}^{2\times 2} is a field isomorphism. Clearly \mathbb{C} is a field. Also, \mathcal{M} is a field with the following additive identity (zero) and multiplicative identity (unit)

𝟎=[0000],𝟏=[1001]\mathbf{0}_{\mathcal{M}}=\begin{bmatrix}0&0\\ 0&0\end{bmatrix},~{}~{}~{}\mathbf{1}_{\mathcal{M}}=\begin{bmatrix}1&0\\ 0&1\end{bmatrix}

and for any A=[abba]A=\begin{bmatrix}a&-b\\ b&a\end{bmatrix}, where both aa and bb are not zero together, the inverse element of the field is

A1=1a2+b2[abba]A^{-1}=\frac{1}{a^{2}+b^{2}}\begin{bmatrix}a&b\\ -b&a\end{bmatrix}\in\mathcal{M}

Next we consider the map φ\varphi. For any z,zz,z^{\prime}\in\mathbb{C}, Re(z+z)=Re(z)+Re(z)\textbf{Re}(z+z^{\prime})=\textbf{Re}(z)+\textbf{Re}(z^{\prime}), Im(z+z)=Im(z)+Im(z)\textbf{Im}(z+z^{\prime})=\textbf{Im}(z)+\textbf{Im}(z^{\prime}). Hence,

φ(z+z)=[Re(z+z)Im(z+z)Im(z+z)Re(z+z)]=φ(z)+φ(z)\varphi(z+z^{\prime})=\begin{bmatrix}\textbf{Re}(z+z^{\prime})&-\textbf{Im}(z+z^{\prime})\\ \textbf{Im}(z+z^{\prime})&\textbf{Re}(z+z^{\prime})\end{bmatrix}=\varphi(z)+\varphi(z^{\prime})

Let z=x+iyz=x+\mathrm{i}y and z=x+iyz^{\prime}=x^{\prime}+\mathrm{i}y^{\prime}. Therefore,

φ(zz)\displaystyle\varphi(zz^{\prime}) =φ((x+iy)(x+iy))\displaystyle=\varphi((x+\mathrm{i}y)(x^{\prime}+\mathrm{i}y^{\prime}))
=φ((xxyy)+i(xy+xy))\displaystyle=\varphi((xx^{\prime}-yy^{\prime})+\mathrm{i}(xy^{\prime}+x^{\prime}y))
=[xxyyxyxyxy+xyxxyy]\displaystyle=\begin{bmatrix}xx^{\prime}-yy^{\prime}&-xy^{\prime}-x^{\prime}y\\ xy^{\prime}+x^{\prime}y&xx^{\prime}-yy^{\prime}\end{bmatrix}
=[xyyx][xyyx]\displaystyle=\begin{bmatrix}x&-y\\ y&x\end{bmatrix}\ \begin{bmatrix}x^{\prime}&-y^{\prime}\\ y^{\prime}&x^{\prime}\end{bmatrix}
=φ(z)φ(z)\displaystyle=\varphi(z)\varphi(z^{\prime})

Hence, the map φ\varphi is an isomorphism. Also note that φ(0)=𝟎\varphi(0)=\mathbf{0}_{\mathcal{M}} and φ(1)=𝟏\varphi(1)=\mathbf{1}_{\mathcal{M}}. In addition for any z0z\neq 0, φ(z1)=[φ(z)]1\varphi(z^{-1})=[\varphi(z)]^{-1}. Hence, φ\varphi is a field isomorphism between \mathbb{C} and \mathcal{M}.

For any z=x+iyz=x+\mathrm{i}y\in\mathbb{C}, φ1(z)=(x,y)\varphi_{1}(z)=(x,y)^{\top} and φ2(z)=(y,x)\varphi_{2}(z)=(-y,x)^{\top}. Hence,

φ1(z),φ2(z)=x(y)+yx=0\langle\varphi_{1}(z),\varphi_{2}(z)\rangle=x(-y)+yx=0

Also it is clear that

φ1(z)2=φ2(z)2=x2+y2=|z|2\|\varphi_{1}(z)\|^{2}=\|\varphi_{2}(z)\|^{2}=x^{2}+y^{2}=|z|^{2}
φ(z)=φ(xiy)=[xyyx]=φ(z)\varphi(z^{\dagger})=\varphi(x-\mathrm{i}y)=\begin{bmatrix}x&y\\ -y&x\end{bmatrix}=\varphi(z)^{\top}

Hence properties (i)-(iii) is shown. \hfil\square

Proof of Proposition 3.3

  • (a)

    It can be shown that both n\mathbb{C}^{n} and 2n×2\mathbb{R}^{2n\times 2} is are rings under addition ++ and entry-wise multiplication \odot of nn-dimensional complex-valued vectors and 2n×22n\times 2-dimensional real matrices respectively, for any positive integer nn [Herstein, 1991]. We consider the extended map φ:n2n×2\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2} as described in (3.2). Then, for any 𝐳1,𝐳2n\mathbf{z}_{1},\mathbf{z}_{2}\in\mathbb{C}^{n} such that 𝐳1=(z1i)i=1n,𝐳2=(z2i)i=1n\mathbf{z}_{1}=(z_{1i})_{i=1}^{n},\mathbf{z}_{2}=(z_{2i})_{i=1}^{n},

    φ(𝐳1+𝐳2)=φ((z1i+z2i)i=1n)=(φ(z1i+z2i))i=1n=(φ(z1i)+φ(z2i))i=1n=φ(𝐳1)+φ(𝐳2),\varphi(\mathbf{z}_{1}+\mathbf{z}_{2})=\varphi((z_{1i}+z_{2i})_{i=1}^{n})=(\varphi(z_{1i}+z_{2i}))_{i=1}^{n}=(\varphi(z_{1i})+\varphi(z_{2i}))_{i=1}^{n}=\varphi(\mathbf{z}_{1})+\varphi(\mathbf{z}_{2}),

    and

    φ(𝐳1𝐳2)=φ((z1iz2i)i=1n)=(φ(z1iz2i))i=1n=(φ(z1i)φ(z2i))i=1n=φ(𝐳1)φ(𝐳2).\varphi(\mathbf{z}_{1}\odot\mathbf{z}_{2})=\varphi((z_{1i}z_{2i})_{i=1}^{n})=(\varphi(z_{1i}z_{2i}))_{i=1}^{n}=(\varphi(z_{1i})\varphi(z_{2i}))_{i=1}^{n}=\varphi(\mathbf{z}_{1})\odot\varphi(\mathbf{z}_{2}).

    Therefore, the extended φ:n2n×2\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2} of (3.2) is a ring homomorphism. Using, proposition 5 of section 7.3 of Dummit and Foote [2004], Image(φ)\text{Image}(\varphi) is a subring of 2n×2\mathbb{R}^{2n\times 2}, and the kernel of φ\varphi, denoted by ker(φ)={𝐳:φ(𝐳)=𝟎2n×2}\text{ker}(\varphi)=\{\mathbf{z}:\varphi(\mathbf{z})=\mathbf{0}_{2n\times 2}\}, is a subring of n\mathbb{C}^{n}. In this case, ker(φ)\text{ker}(\varphi) consists of all those complex numbers zz such that Re(𝐳)=0,Im(𝐳)=0\textbf{Re}(\mathbf{z})=0,\ \textbf{Im}(\mathbf{z})=0, i.e. 𝐳=0\mathbf{z}=0. Therefore, φ\varphi has a trivial kernel and hence it is a one-to-one map. As a result, φ\varphi described in (3.2) is a ring isomorphism.

  • (b)

    The proof is similar to part (a). The only difference in this case are (i) m×n\mathbb{C}^{m\times n} and 2m×2n\mathbb{R}^{2m\times 2n} are rings under addition ++ and entry-wise multiplication \odot of m×nm\times n dimensional complex matrices and 2m×2n2m\times 2n dimensional real matrices respectively, and (ii) φ\varphi is described in (3.3)\eqref{eq:phi-extn-2}.

  • (c)

    Similar to the proof of part (a), GLn()\text{GL}_{n}(\mathbb{C}) is a field, and hence a ring. Image(φ)\text{Image}(\varphi) is a subring of 2n×2n\mathbb{R}^{2n\times 2n} and the kernel is trivial, which implies that φ\varphi described in (3.3) is a ring isomorphism in this case. In addition, using the homomorphic properties of φ\varphi and the field properties of GLn()\text{GL}_{n}(\mathbb{C}), it can be shown that Image(φ)\text{Image}(\varphi) is also a field. Therefore, Image(φ)\text{Image}(\varphi) is a field in this case, and hence φ\varphi is a field isomorphism.

    For any AGLn()A\in GL_{n}(\mathbb{C}), φ(A)φ(A1)=φ(AA1)=φ(In)=I2n\varphi(A)\varphi(A^{-1})=\varphi(AA^{-1})=\varphi(I_{n})=I_{2n}. Therefore, φ(A1)=[φ(A)]1\varphi(A^{-1})=[\varphi(A)]^{-1}.

Proof of Proposition 4.1

By Proposition 3.5 of Sun et al. [2018], for some constants c,cc,c^{\prime} and any η>0\eta>0, we have

(|f^kl(ωj)𝔼f^kl(ωj)||f|η)cexp{c(2m+1)min{η,η2}}.\mathbb{P}\left(\left|\hat{f}_{kl}(\omega_{j})-\mathbb{E}\hat{f}_{kl}(\omega_{j})\right|\geq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\eta\right)\leq c^{\prime}\exp\{-c(2m+1)\min\{\eta,\eta^{2}\}\}.

We choose η=Rlogpm\eta=\sqrt{\frac{R\log p}{m}}. In the asymptotic regime m|f|2logpm\succsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p, we have η=RlogpmRlogpm=η2\eta=\sqrt{\frac{R\log p}{m}}\geq\frac{R\log p}{m}=\eta^{2}. Therefore, by using a union bound over all k,lk,l, we obtain

(maxk,l|f^kl(ωj)𝔼f^kl(ωj)||f|Rlogpm)\displaystyle\mathbb{P}\left(\max_{k,l}\left|\hat{f}_{kl}(\omega_{j})-\mathbb{E}\hat{f}_{kl}(\omega_{j})\right|\geq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{R\log p}{m}}\right) p2cexp{c(2m+1)Rlogpm}\displaystyle\leq p^{2}\cdot c^{\prime}\exp\left\{-c(2m+1)\frac{R\log p}{m}\right\}
p2cexp{2cRlogp}=cp(2cR2).\displaystyle\leq p^{2}c^{\prime}\exp\{-2cR\log p\}=c^{\prime}p^{-(2cR-2)}. (B.1)

Also, from Proposition 3.3 of Sun et al. [2018], the upper bound on the bias term is

|𝔼f^kl(ωj)fkl(ωj)|m+12πnΩn+12πLn,for all k,l,\left|\mathbb{E}\hat{f}_{kl}(\omega_{j})-f_{kl}(\omega_{j})\right|\leq\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n},~{}~{}~{}\mbox{for all }k,l,

where Ωn\Omega_{n} and LnL_{n} are defined in (4.1). Therefore,

maxk,l|𝔼f^kl(ωj)fkl(ωj)|m+12πnΩn+12πLn.\max_{k,l}\left|\mathbb{E}\hat{f}_{kl}(\omega_{j})-f_{kl}(\omega_{j})\right|\leq\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}. (B.2)

We have chosen Δf(m,n,p)=|f|Rlogpm+m+12πnΩn+12πLn\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{R\log p}{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}. Combining (B.1) and (B.2), we obtain

(f^(ωj)f(ωj)maxΔf(m,n,p))cp(cR2).\mathbb{P}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\geq\Delta_{f}(m,n,p)\right)\leq c^{\prime}p^{-(cR-2)}.

\hfill\square

Proof of Theorem 4.1

Our objective is to show that the primal dual witness matrix Θ~\widetilde{\Theta} in (B.4) is equal to the solution of the optimization problem (2.6) with high probability. For a fixed j[n]j\in[n] and Fourier frequency ωj[π,π]\omega_{j}\in[-\pi,\pi], we define the event :={||f^(ωj)f(ωj)||max\mathcal{E}:=\bigg{\{}{\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max} Δf(m,n,p)}\leq\Delta_{f}(m,n,p)\bigg{\}}. As we are in the regime m|f|2ζ2logpm\succeq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\zeta^{2}\log p, mn/[Ωnζ]m\preceq n/\left[\Omega_{n}\zeta\right] and ζ1\zeta\geq 1, so m|f|2logpm\succeq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p, mn/Ωnm\preceq n/\Omega_{n}. Also, in the definition of Δf(m,n,p)\Delta_{f}(m,n,p) in (4.2), we consider R>1/cR>1/c i.e. τ=cR>2\tau=cR>2. Hence, by Proposition 4.1 ()1cpτ2\mathbb{P}(\mathcal{E})\geq 1-\frac{c^{\prime}}{p^{\tau-2}} for constants c>0c^{\prime}>0 and τ>2\tau>2. In the following analysis, we condition on the event \mathcal{E}.

Our proof goes on the following route- we first describe a primal-dual witness approach required for the proof, state a series of claims which are just restatement of a number of lemmas in Ravikumar et al. [2011] with highlight on the changes due to incorporating complex regime.

Claim B.1.

The penalized log-determinant program 2.6 has a unique solution Θ^\hat{\Theta} characterized by the equation

f^(ωj)Θ^1+λΨ^=0,\hat{f}(\omega_{j})-{\hat{\Theta}}^{-1}+\lambda\hat{\Psi}=0, (B.3)

where Ψ^\hat{\Psi} is an element in the set of subdifferentials of ()1,off\partial\|(\cdot)\|_{\text{1,off}} at Θ^\hat{\Theta} as characterized in (3.15).

Proof sketch. The proof of this claim B.1 follows from the Lemma 3 of Ravikumar et al. [2011]. Claim B.1 is similar to their result with the real covariance matrix Σ^\hat{\Sigma} replaced with the averaged periodogram matrix f^(ωj)\hat{f}(\omega_{j}), and Ψ^\hat{\Psi} replaced with the matrix with entry-wise signs of Θ^\hat{\Theta} (details available in Equation 41 Section 4.1 in Ravikumar et al. [2011]). The basic idea for the proof is following: the penalized log-determinant program (2.6) is a strict convex program i.e. the concerned objective function is a strictly convex function of Θ\Theta since the 1\ell_{1} penalty on the off-diagonals is a convex function of Θ\Theta and the log-determinant barrier is also convex (section 3.1.5 of Boyd and Vandenberghe [2004]). Also for complex positive definite Hermitian matrices, i.e. Θ++p\Theta\in\mathcal{H}^{p}_{++}, eigen values and diagonal elements of Θ\Theta are strictly positive. According to the proof of Lemma 3 in Ravikumar et al. [2011], and Hadamard’s inequality for Hermitian positive definite matrices (theorem 7.8.1 of Horn and Johnson [2012]), we have logdetΘk=1plogΘkk\log\det\Theta\leq\sum_{k=1}^{p}\log\Theta_{kk}. Hence,

k=1pΘkk[f^(ωj)]kklogdetΘk=1p(Θkk[f^(ωj)]kklogΘkk).\sum_{k=1}^{p}\Theta_{kk}[\hat{f}(\omega_{j})]_{kk}-\log\det\Theta\geq\sum_{k=1}^{p}\left(\Theta_{kk}[\hat{f}(\omega_{j})]_{kk}-\log\Theta_{kk}\right).

The right-hand side of the above inequality is coercive, i.e. it diverges to infinity as (Θ11,,Θpp)\|(\Theta_{11},\cdots,\Theta_{pp})\|\rightarrow\infty. Therefore, the objective function is a continuous and coercive function of Θ\Theta. Hence the level sets are compact, and continuity implies that a minimum is attained. In addition, strict convexity guarantees the uniqueness of the minimum.

A sufficient condition for Θ^\hat{\Theta} to be a solution of (2.6) is that zero belongs to the subdifferential of the objective function of (2.6) evaluated at Θ^\hat{\Theta}, i.e. 0(Θ^,f^(ωj)logdetΘ^+λΘ^1,off)0\in\partial\left(\langle\hat{\Theta},\hat{f}(\omega_{j})\rangle-\log\det\hat{\Theta}+\lambda\|\hat{\Theta}\|_{1,\text{off}}\right). Using theorems 2.9 and 2.10 of Weylandt Jr [2020], the KKT (Karush-Kahn Tucker) conditions, i.e. the subdifferential equation corresponding to the optimization problem (2.6) is the same as (B.3). \hfill\square

As now we guarantee existence, uniqueness, and a formulation of the optimal solution of (2.6), we can take a primal dual witness approach to calculate the solution as well as to recover the correct support set. The primal dual witness construction is motivated by previous works by Zhao and Yu [2006]; Wainwright [2009]; Ravikumar et al. [2011]; Loh and Wainwright [2017]. The key idea is to search for a pair of unique primal and dual feasible solutions satisfying the strict dual feasibility conditions in a setting when the objective function may not be strictly convex. Next we provide a reiteration of the primal dual witness construction in Ravikumar et al. [2011] for a restricted CGLASSO problem.

Primal-dual witness construction

We recall from section 4 that SS is the augmented edge set, and consider the restricted log-determinant problem

Θ~=argminΘ++p:ΘSc=0Θ,f^(ωj)logdetΘ+λΘ1,off.\widetilde{\Theta}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}:\ \Theta_{S^{\texttt{c}}}=0}\langle\Theta,\hat{f}(\omega_{j})\rangle-\log\det\Theta+\lambda\|\Theta\|_{1,\text{off}}. (B.4)

By this construction, Θ~\widetilde{\Theta} is positive-definite Hermitian as well as Θ~Sc=0\widetilde{\Theta}_{S^{\texttt{c}}}=0. Let Ψ~()1,off(Θ~)\widetilde{\Psi}\in\partial\|(\cdot)\|_{\text{1,off}}(\widetilde{\Theta}). For (k,l)Sc(k,l)\in S^{\texttt{c}}, we replace (Ψ~)kl(\widetilde{\Psi})_{kl} with

Ψ~kl=1λ[(f^(ωj))kl+(Θ~1)kl].\displaystyle\widetilde{\Psi}_{kl}=\frac{1}{\lambda}\left[-\left(\hat{f}(\omega_{j})\right)_{kl}+\left(\widetilde{\Theta}^{-1}\right)_{kl}\right].

The pair (Θ~,Ψ~)(\widetilde{\Theta},\widetilde{\Psi}) satisfies the equation (B.3) replacing (Θ^,Ψ^)(\hat{\Theta},\hat{\Psi}). Finally we check the strict dual feasibility condition, i.e.

|Ψ~kl|<1 for all (k,l)Sc.|\widetilde{\Psi}_{kl}|<1\text{ for all }(k,l)\in S^{\texttt{c}}.

This step ensures that Ψ~\widetilde{\Psi} satisfies the necessary conditions to belong to the set of subdifferentials.

We define G=Θ~ΘG=\widetilde{\Theta}-\Theta^{*}. Also let the remainder term of the first order Taylor expansion of (Θ+G)1(\Theta^{*}+G)^{-1} around Θ\Theta^{*} be R(G)=(Θ+G)1Θ+Θ1GΘ1R(G)=(\Theta^{*}+G)^{-1}-\Theta^{*}+{\Theta^{*}}^{-1}G{\Theta^{*}}^{-1}. We state the following result:

Claim B.2.

Suppose that

max{f^(ωj)f(ωj)max,R(G)max}α8λ.\max\left\{{\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max},{\left|\kern-1.07639pt\left|R(G)\right|\kern-1.07639pt\right|}_{\max}\right\}\leq\frac{\alpha}{8}\lambda. (B.5)

Then the vector Ψ~\widetilde{\Psi} constructed in the primal dual witness construction satisfies Ψ~Scmax<1\|\widetilde{\Psi}_{S^{c}}\|_{\max}<1. Hence strict duality holds and Θ~=Θ^\widetilde{\Theta}=\hat{\Theta}.

Proof sketch. The proof of claim B.2 follows from the proof of Lemma 4 in Ravikumar et al. [2011] and involves basic linear algebra computation with complex valued matrices. The only change in notation are that W=Σ^ΣW=\hat{\Sigma}-\Sigma is replaced by f^(ωj)f(ωj)\hat{f}(\omega_{j})-f(\omega_{j}) in our case and GG is the difference between the primal dual witness and the true spectral density instead of the difference Δ\Delta (as per the notation in Ravikumar et al. [2011], in our literature Δ\Delta is used to denote the threshold parameter) between the true inverse covariance and its primal dual witness. Lemma 4 in Ravikumar et al. [2011] shows that under the assumption max{Wmax,R(Δ)max}αλ/8\max\{{\left|\kern-1.07639pt\left|W\right|\kern-1.07639pt\right|}_{\max},{\left|\kern-1.07639pt\left|R(\Delta)\right|\kern-1.07639pt\right|}_{\max}\}\leq\alpha\lambda/8, the matrix of entry-wise signs for the primal-dual witness problem i.e. Ψ~\widetilde{\Psi} satisfies strict dual feasibility condition Ψ~Scmax<1\|\widetilde{\Psi}_{S^{c}}\|_{\max}<1. The same strict dual feasibility condition can be concluded under the condition (B.5) if we follow the steps (51)-(55) in the proof of Lemma 4 (section 4.2.1) in Ravikumar et al. [2011]. And under strict dual feasibility, by construction, Θ~=Θ^\widetilde{\Theta}=\hat{\Theta} is ensured.\hfill\square

Next, we verify that the assumption (B.5) holds. As our choice of regularization parameter is λ=8αΔf(m,n,p)\lambda=\frac{8}{\alpha}\Delta_{f}(m,n,p), under the event \mathcal{E} we have f^(ωj)f(ωj)maxα8λ=Δf(m,n,p){\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\leq\frac{\alpha}{8}\lambda=\Delta_{f}(m,n,p). In order to show (B.5) it remains to show R(G)maxα8λ{\left|\kern-1.07639pt\left|R(G)\right|\kern-1.07639pt\right|}_{\max}\leq\frac{\alpha}{8}\lambda. For that we will require the following two claims.

Claim B.3.

Suppose that

r:=2κΘ(f^(ωj)f(ωj)max+λ)min{13κfd,13κf3κΘd}.r:=2\kappa_{\Theta^{*}}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}+\lambda\right)\leq\min\left\{\frac{1}{3\kappa_{f}d},\frac{1}{3\kappa_{f}^{3}\kappa_{\Theta^{*}}d}\right\}.

Then Gmax=Θ~Θmaxr{\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}={\left|\kern-1.07639pt\left|\widetilde{\Theta}-\Theta^{*}\right|\kern-1.07639pt\right|}_{\max}\leq r.

Proof sketch. The proof of Claim B.3 is similar to that of Lemma 6 in Ravikumar et al. [2011] and involves linear algebra of complex matrices. By claim B.1, the restricted problem (B.4) has a unique solution. Thus, upon taking partial derivative w.r.t the restricted element, it should vanish at the unique optimum, that is, Θ~S\widetilde{\Theta}_{S} satisfies

H1(ΘS):=[f^(ωj)]S[Θ1]S+λΨ~S=0.H_{1}(\Theta_{S}):=[\hat{f}(\omega_{j})]_{S}-[\Theta^{-1}]_{S}+\lambda\widetilde{\Psi}_{S}=0.

This restricted zero gradient condition is necessary and sufficient for the optimal solution, hence the solution is unique. Now we define the map H2:|S||S|H_{2}:\mathbb{R}^{|S|}\rightarrow\mathbb{R}^{|S|} as follows

H2(G¯S):=(ΥSS)1H¯1(ΘS+GS)+G¯S.H_{2}(\overline{G}_{S}):=-(\Upsilon_{SS}^{*})^{-1}\ \overline{H}_{1}(\Theta_{S}^{*}+G_{S})+\overline{G}_{S}.

From construction, H2(G¯S)=G¯SH_{2}(\overline{G}_{S})=\overline{G}_{S}, i.e. H1(ΘS+GS)=H1(Θ~S)=0H_{1}(\Theta_{S}^{*}+G_{S})=H_{1}(\widetilde{\Theta}_{S}^{*})=0. Let 𝔹(r)\mathbb{B}(r) be the \ell_{\infty} ball in \mathbb{R}, i.e. 𝔹(r)={ΘS:ΘSr}\mathbb{B}(r)=\{\Theta_{S}:\|\Theta_{S}\|_{\infty}\leq r\} Now, following the matrix algebraic calculations in equations 72 to 74 of Ravikumar et al. [2011], it can be easily checked that H2(𝔹(r))𝔹(r)H_{2}(\mathbb{B}(r))\subset\mathbb{B}(r). Since H2H_{2} is a continuous function on |S|\mathbb{R}^{|S|} and 𝔹(r)\mathbb{B}(r) is convex and compact in |S|\mathbb{R}^{|S|}, by Brouwere’s fixed point theorem [Kellogg et al., 1976], existence of a fixed point G¯S𝔹(r)\overline{G}_{S}\in\mathbb{B}(r) of the function H2H_{2} is guaranteed. Also uniqueness of the zero gradient condition (B) implies the uniqueness of the fixed points of H2H_{2}. Therefore, we can conclude that Θ~SΘSr\|\widetilde{\Theta}_{S}-\Theta^{*}_{S}\|_{\infty}\leq r, and hence the claim is proved. \hfill\square

By our choice of the regularization constant

2κΘ(f^(ωj)f(ωj)max+λ)2κΘ(1+8α)Δf(m,n,p)=2κΘCαΔf(m,n,p).2\kappa_{\Theta^{*}}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}+\lambda\right)\leq 2\kappa_{\Theta^{*}}\left(1+\frac{8}{\alpha}\right)\Delta_{f}(m,n,p)=2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p).

From the conditions of Theorem 4.1, Δf(m,n,p)d[6κf3κΘ2Cα2]1\Delta_{f}(m,n,p)d\leq[6\kappa_{f}^{3}\kappa_{\Theta^{*}}^{2}C_{\alpha}^{2}]^{-1}, and Cα>1C_{\alpha}>1. Therefore, the following two inequalities are satisfied

r2κΘCαΔf(m,n,p)[3κf3κΘd]1,r\leq 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p)\leq[3\kappa_{f}^{3}\kappa_{\Theta^{*}}d]^{-1}, (B.6)

and

2κΘCα2Δf(m,n,p)[3κf3κΘd]12\kappa_{\Theta^{*}}C_{\alpha}^{2}\Delta_{f}(m,n,p)\leq[3\kappa_{f}^{3}\kappa_{\Theta^{*}}d]^{-1} (B.7)

Now we turn to a result in similar spirit as Lemma 5 in Ravikumar et al. [2011]. The result states the following:

Claim B.4.

Gmax13κfd{\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}\leq\frac{1}{3\kappa_{f}d} implies R(G)max32dGmax2κf3{\left|\kern-1.07639pt\left|R(G)\right|\kern-1.07639pt\right|}_{\max}\leq\frac{3}{2}d{\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}^{2}\kappa_{f}^{3}.

The proof follows from Lemma 5 in Ravikumar et al. [2011]. The proof involves a step where we need to expand the power series of the complex matrix (Θ+G)1(\Theta^{*}+G)^{-1} as

(Θ+G)1=k=0(Θ1G)kΘ1(\Theta^{*}+G)^{-1}=\sum_{k=0}^{\infty}({\Theta^{*}}^{-1}G)^{k}{\Theta^{*}}^{-1}

The power series expansion can be verified as Gmax1/3κfd{\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}\leq 1/3\kappa_{f}d implies Θ1Gmax<1{\left|\kern-1.07639pt\left|{\Theta^{*}}^{-1}G\right|\kern-1.07639pt\right|}_{\max}<1 (ensures the radius of convergence condition, see equation (68) in Appendix B of Ravikumar et al. [2011]), and also from the results in holomorphic functional calculus that involves the general power series expansion of an operator on a Banach space [Dudley et al., 2011; Haase, 2018]. Rest of the proof can be imitated as the proof of Lemma 5 in Appendix B of Ravikumar et al. [2011]

Combining this result with (B.6) we get

R(G)max32dGmax2κf3\displaystyle{\left|\kern-1.07639pt\left|R(G)\right|\kern-1.07639pt\right|}_{\max}\leq\frac{3}{2}d{\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}^{2}\kappa_{f}^{3} 32d(2κΘjCαΔf(m,n,p))2κf3\displaystyle\leq\frac{3}{2}d\left(2\kappa_{\Theta_{j}^{*}}C_{\alpha}\Delta_{f}(m,n,p)\right)^{2}\kappa_{f}^{3}
=6κf3κΘ2dCα2[Δf(m,n,p)]2\displaystyle=6\kappa_{f}^{3}\kappa_{\Theta^{*}}^{2}dC_{\alpha}^{2}\left[\Delta_{f}(m,n,p)\right]^{2}
=(6κf3κΘ2dCα2[Δf(m,n,p)])αλ8\displaystyle=\left(6\kappa_{f}^{3}\kappa_{\Theta^{*}}^{2}dC_{\alpha}^{2}\left[\Delta_{f}(m,n,p)\right]\right)\frac{\alpha\lambda}{8}
αλ8\displaystyle\leq\frac{\alpha\lambda}{8}

where the last inequality follows from (B.7). Therefore, we have shown that condition (B.5) holds under \mathcal{E} and strict duality allows us to conclude that Θ^=Θ~\hat{\Theta}=\widetilde{\Theta}. Hence, Θ^\hat{\Theta} satisfies the bound (B.6) and Θ^Sc=Θ~Sc=0\hat{\Theta}_{S^{c}}=\widetilde{\Theta}_{S^{c}}=0. Hence E(Θ^)=E(Θ~)E(\hat{\Theta})=E(\widetilde{\Theta}). Also, for (k,l)S(k,l)\in S such that |Θkl|>2CακΘΔf(m,n,p)\left|\Theta^{*}_{kl}\right|>2C_{\alpha}\kappa_{\Theta^{*}}\Delta_{f}(m,n,p)

|Θ^kl|\displaystyle\left|\hat{\Theta}_{kl}\right| =|Θ^klΘkl+Θkl||Θkl||Θ^klΘkl|>0\displaystyle=\left|\hat{\Theta}_{kl}-\Theta^{*}_{kl}+\Theta^{*}_{kl}\right|\geq\left|\Theta^{*}_{kl}\right|-\left|\hat{\Theta}_{kl}-\Theta^{*}_{kl}\right|>0

Therefore, E(Θ^)E(\hat{\Theta}) contains all edges (k,l)(k,l) with |Θ^kl|>2CακΘΔf(m,n,p)\left|\hat{\Theta}^{*}_{kl}\right|>2C_{\alpha}\kappa_{\Theta^{*}}\Delta_{f}(m,n,p) and hence (b) is proved.

Now we use the idea of Corollary 3 of Ravikumar et al. [2011] and write

Θ^ΘF\displaystyle\|\hat{\Theta}-\Theta^{*}\|_{F} [k=1p|Θ^kkΘkk|2+(k,l)E(Θ)|Θ^klΘkl|2]1/2\displaystyle\leq\left[\sum_{k=1}^{p}\left|\hat{\Theta}_{kk}-\Theta^{*}_{kk}\right|^{2}+\sum_{(k,l)\in E(\Theta^{*})}\left|\hat{\Theta}_{kl}-\Theta^{*}_{kl}\right|^{2}\right]^{1/2}
2κΘCαΔf(m,n,p)×s+p\displaystyle\leq 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p)\times\sqrt{s+p}

And

Θ^Θ2Θ^Θd×2κΘCαΔf(m,n,p)\|\hat{\Theta}-\Theta^{*}\|_{2}\leq\|\hat{\Theta}-\Theta^{*}\|_{\infty}\leq d\times 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p)

where we use the definition of \ell_{\infty} norm and the fact that Θ^\hat{\Theta} and Θ\Theta^{*} both have dd non-zeros in each row. Now we use the fact that Frobenius norm dominates \ell_{\infty} norm and (c) follows. \hfill\square

Appendix C Figures and tables

Refer to caption
Figure 7: Plot of diagonals of spectral density across different frequencies for VAR(1) model in 5 for p=20p=20. The kkth line represents [f(ω)]kk[f(\omega)]_{kk} across ω[π,π]\omega\in[-\pi,\pi].
Refer to caption
Figure 8: Plot of diagonals of spectral density across different frequencies for VARMA(2, 2) model in 5 for p=20p=20. The kkth line represents [f(ω)]kk[f(\omega)]_{kk} across ω[π,π]\omega\in[-\pi,\pi]. Here 1k51\leq k\leq 5, since for p=10p=10 f(ω)f(\omega) has two 5×55\times 5 blocks.
RMSE
Family Inverse periodogram CGLASSO (oracle) CGLASSO (BIC)
White Noise
p=10p=10
    n=200n=200 123.69 (66.17) 11.59 (2.08) 17.34 (2.72)
    n=400n=400 50.68 (17.23) 8.11 (1.09) 15.01 (1.49)
p=20p=20
    n=200n=200 200.23 (783.49) 14.06 (1.52) 21.76 (1.71)
    n=400n=400 394.05 (123.59) 10.88 (1.43) 18.69 (1.53)
p=50p=50
    n=200n=200 - 18.16 (1.34) 31.07 (0.77)
    n=400n=400 - 14.21 (1.07) 25.44 (0.88)
VAR(1)
p=10p=10
    n=200n=200 205.82 (137.47) 17.50 (-) 17.50 (-)
    n=400n=400 121.15 (34.07) 17.99 (1.45) 18.06 (1.25)
p=20p=20
    n=200n=200 1831.27 (666.7) 20.67 (-) 20.67 (-)
    n=400n=400 489.32 (140.18) 19.47 (0.81) 19.47 (0.81)
p=50p=50
    n=200n=200 - 20.59 (-) 20.59 (-)
    n=400n=400 - 19.52 (-) 19.52 (-)
VARMA(2, 2)
p=10p=10
    n=200n=200 117.13 (42.06) 13.72 (2.34) 15.00 (2.32)
    n=400n=400 62.96 (14.67) 11.20 (2.03) 11.48 (2.03)
p=20p=20
    n=200n=200 2359.21 (1101.53) 15.32 (1.73) 20.35 (1.69)
    n=400n=400 446.45 (127.16) 12.83 (2.01) 14.80 (1.67)
p=50p=50
    n=200n=200 - 16.77 (0.90) 28.98 (1.28)
    n=400n=400 - 13.60 (0.86) 18.67 (0.73)
Table 3: RMSE of inverted periodogram and the complex graphical lasso estimator at ωj=π/2\omega_{j}=\pi/2. The RMSE (oracle) column corresponds to the minimum RMSE (in %) along the regularization path. The RMSE (in %) corresponding to the estimator with minimum BIC along the regularization paths are also reported in RMSE (BIC) column. Results are averaged over 20 independent replicates. Standard deviations (in %) are also reported in parentheses.
RMSE
Family Inverse periodogram CGLASSO (oracle) CGLASSO (BIC)
White Noise
p=10p=10
    n=200n=200 148.83 (79.25) 10.31 (2.15) 12.83 (2.21)
    n=400n=400 80.58 (52.42) 8.75 (2.38) 11.21 (2.73)
p=20p=20
    n=200n=200 3620.6 (4038.01) 12.49 (2.07) 18.69 (2.31)
    n=400n=400 441.79 (222.01) 9.74 (1.55) 14.96 (1.63)
p=50p=50
    n=200n=200 - 14.96 (1.44) 28.44 (1.16)
    n=400n=400 - 11.94 (0.93) 23.47 (1.18)
VAR(1)
p=10p=10
    n=200n=200 158.18 (119.3) 9.49 (-) 15.68 (1.65)
    n=400n=400 73.09 (37.48) 8.69 (-) 14.17 (2.40)
p=20p=20
    n=200n=200 4439.06 (3856.06) 13.55 (-) 16.61 (2.15)
    n=400n=400 607.81 (324.17) 12.65 (1.14) 14.77 (1.84)
p=50p=50
    n=200n=200 - 16.44 (0.48) 17.30 (1.47)
    n=400n=400 - 14.74 (-) 15.15 (0.52)
VARMA(2, 2)
p=10p=10
    n=200n=200 162.61 (131.66) 9.64 (0.84) 15.11 (2.36)
    n=400n=400 67.55 (23.58) 7.62 (0.87) 9.21 (2.16)
p=20p=20
    n=200n=200 3174.67 (2912.83) 12.04 (1.39) 21.01 (2.44)
    n=400n=400 470.83 (270.81) 9.11 (-) 17.98 (1.56)
p=50p=50
    n=200n=200 - 14.10 (0.83) 19.85 (1.28)
    n=400n=400 - 11.16 (0.95) 17.52 (0.93)
Table 4: RMSE of inverted periodogram and the complex graphical lasso estimator at ωj=π\omega_{j}=\pi. The RMSE (oracle) column corresponds to the minimum RMSE (in %) along the regularization path. The RMSE (in %) corresponding to the estimator with minimum BIC along the regularization paths are also reported in RMSE (BIC) column. Results are averaged over 20 independent replicates. Standard deviations (in %) are also reported in parentheses.
NWR CGLASSO CGLASSO
Family AUROC AUROC Precision Recall Accuracy
White Noise
p=10p=10
    n=200n=200 97.27 (2.79) 99.61 (0.69) 53.45 (7.93) 100.00 (0.00) 81.78 (5.79)
    n=400n=400 99.68 (0.57) 99.91 (0.28) 59.37 (9.45) 100.00 (0.00) 85.56 (5.07)
p=20p=20
    n=200n=200 96.50 (2.11) 98.82 (1.64) 53.04 (5.89) 98.42 (3.01) 90.92 (2.08)
    n=400n=400 97.99 (2.18) 99.78 (0.63) 51.97 (5.39) 99.74 (1.18) 90.55 (2.02)
p=50p=50
    n=200n=200 96.52 (1.55) 94.81 (2.40) 68.60 (6.89) 90.00 (4.72) 97.90 (0.54)
    n=400n=400 98.18 (1.09) 99.48 (0.53) 62.79 (4.86) 99.08 (1.04) 97.58 (0.49)
VAR(1)
p=10p=10
    n=200n=200 74.42 (5.93) 50.00 (-) - 0.00 (0.00) 62.22 (0.00)
    n=400n=400 80.81 (7.04) 50.25 (0.85) 100.00 (-) 0.49 (1.70) 62.41 (0.64)
p=20p=20
    n=200n=200 72.76 (4.97) 52.70 (-) 100.00 (-) 5.41 (-) 81.58 (-)
    n=400n=400 79.42 (3.51) 50.75 (0.71) 100.00 (0.00) 1.50 (1.42) 80.82 (0.28)
p=50p=50
    n=200n=200 68.48 (3.33) 49.96 (-) 0.00 (-) 0.00 (-) 92.00 (-)
    n=400n=400 75.54 (2.70) 50.00 (-) - 0.00 (-) 92.08 (-)
VARMA(2, 2)
p=10p=10
    n=200n=200 76.27 (7.25) 90.20 (6.02) 75.97 (10.69) 88.25 (6.34) 81.56 (8.65)
    n=400n=400 79.62 (4.27) 93.83 (3.69) 76.00 (6.82) 92.25 (6.17) 83.22 (5.17)
p=20p=20
    n=200n=200 73.37 (4.33) 85.27 (4.40) 88.98 (5.73) 71.62 (8.67) 92.11 (2.07)
    n=400n=400 77.96 (2.28) 94.36 (1.65) 77.53 (6.15) 90.62 (3.23) 92.34 (1.87)
p=50p=50
    n=200n=200 77.58 (1.75) 50.00 (0.00) - 0.00 (0.00) 91.84 (0.00)
    n=400n=400 79.88 (2.10) 88.29 (2.72) 88.74 (3.17) 76.90 (5.42) 97.31 (0.43)
Table 5: AUROC for nodewise regression(NWR) and CGLASSO, precision, recall, accuracy and F1-score (each in %) of complex graphical lasso for ωj=π/2\omega_{j}=\pi/2. Results are averaged over 20 replicates. Standard deviations (in %) are also reported in parentheses.
NWR CGLASSO CGLASSO
Family AUROC AUROC Precision Recall Accuracy
White Noise
p=10p=10
    n=200n=200 93.46 (4.26) 96.71 (3.77) 46.73 (6.85) 97.22 (6.11) 76.56 (5.61)
    n=400n=400 95.22 (3.86) 97.93 (2.55) 48.18 (6.34) 98.33 (4.07) 77.78 (5.90)
p=20p=20
    n=200n=200 88.63 (5.41) 96.01 (3.20) 39.19 (5.21) 95.00 (5.53) 84.37 (3.26)
    n=400n=400 95.36 (3.13) 98.48 (1.76) 42.91 (5.01) 98.42 (3.01) 86.50 (2.66)
p=50p=50
    n=200n=200 88.68 (3.63) 91.48 (2.23) 42.58 (5.63) 84.69 (4.32) 94.68 (1.13)
    n=400n=400 92.98 (2.95) 96.43 (1.63) 42.77 (4.51) 93.78 (3.14) 94.63 (1.02)
VAR(1)
p=10p=10
    n=200n=200 60.63 (8.53) 53.00 (3.05) 84.38 (22.90) 6.95 (6.35) 64.24 (2.32)
    n=400n=400 64.01 (7.95) 50.93 (1.71) 100.00 (0.00) 1.86 (3.43) 62.92 (1.29)
p=20p=20
    n=200n=200 58.52 (5.95) 50.32 (0.78) 58.33 (49.16) 0.75 (1.55) 80.58 (0.36)
    n=400n=400 60.13 (4.20) 50.92 (1.25) 81.25 (37.20) 1.93 (2.47) 80.83 (0.53)
p=50p=50
    n=200n=200 55.86 (2.28) 50.00 (0.00) - 0.00 (0.00) 92.08 (0.00)
    n=400n=400 58.41 (2.37) 50.63 (0.61) 42.47 (2.52) 1.44 (1.21) 92.03 (0.17)
VARMA(2, 2)
p=10p=10
    n=200n=200 65.19 (6.36) 71.59 (8.46) 90.88 (7.09) 46.05 (16.63) 74.04 (7.56)
    n=400n=400 68.87 (6.78) 81.12 (6.83) 74.44 (8.42) 75.25. (11.75) 77.22 (7.42)
p=20p=20
    n=200n=200 63.09 (5.02) 50.00 (0.00) - 0.00 (0.00) 78.95 (0.00)
    n=400n=400 67.26 (4.28) 50.00 (0.00) - 0.00 (0.00) 78.95 (0.00)
p=50p=50
    n=200n=200 65.10 (2.84) 50.00 (0.00) - 0.00 (0.00) 91.84 (0.00)
    n=400n=400 66.88 (2.56) 50.00 (0.00) - 0.00 (0.00) 91.84 (0.00)
Table 6: AUROC for nodewise regression(NWR) and CGLASSO, precision, recall, accuracy and F1-score (each in %) of complex graphical lasso for ωj=π\omega_{j}=\pi. Results are averaged over 20 replicates. Standard deviations (in %) are also reported in parentheses.