Regularized Estimation of Sparse Spectral Precision Matrices

Navonil Deb ¹¹1Email: [email protected] Department of Statistics and Data Science, Cornell University Amy Kuceyeski ²²2Email: [email protected] Department of Radiology, Weill Cornell Medicine Sumanta Basu ³³3Corresponding author. Email: [email protected] Department of Statistics and Data Science, Cornell University

Abstract

Spectral precision matrix, the inverse of a spectral density matrix, is an object of central interest in frequency-domain analysis of multivariate time series. Estimation of spectral precision matrix is a key step in calculating partial coherency and graphical model selection of stationary time series. When the dimension of a multivariate time series is moderate to large, traditional estimators of spectral density matrices such as averaged periodograms tend to be severely ill-conditioned, and one needs to resort to suitable regularization strategies involving optimization over complex variables. Existing $\ell_{1}$ -regularization approaches in this direction either separately penalize real and imaginary parts of a complex number, thereby artificially increasing dimensionality, or use off-the-shelf optimization routines for complex variables that do not explicitly leverage underlying sparsity structure of the problem, leading to scalability issues.

In this work, we propose complex graphical Lasso (CGLASSO), an $\ell_{1}$ -penalized estimator of spectral precision matrix based on local Whittle likelihood maximization. We develop fast pathwise coordinate descent algorithms for implementing CGLASSO on large dimensional time series data sets. At its core, our algorithmic development relies on a ring isomorphism between complex and real matrices that helps map a number of optimization problems over complex variables to similar optimization problems over real variables. This finding may be of independent interest and more broadly applicable for high-dimensional statistical analysis with complex-valued data. We also present a complete non-asymptotic theory of our proposed estimator which shows that consistent estimation is possible in high-dimensional regime as long as the underlying spectral precision matrix is suitably sparse. We compare the performance of CGLASSO with competing alternatives on simulated data sets, and use it to construct partial coherence network among brain regions from a real fMRI data set.

1 Introduction

Estimation and structure learning of a spectral density matrix and its inverse, commonly known as the spectral precision matrix, are canonical problems in multivariate time series and signal processing literature with several important statistical tasks. For example, these objects are used for measuring frequency-domain analogues of correlation and partial correlations, viz. coherency and partial coherency. These are also used in graphical modeling of stationary time series, i.e. discovering conditional independence relationships among the components of a multivariate time series. This task amounts to learning the sparsity patterns in the off-diagonal elements of spectral precision matrices at different frequencies [Dahlhaus, 2000; Davis et al., 2016]. Frequency-domain measures such as spectral density, spectral precision, and more generally graphical modeling of stationary time series (StGM), have proven to be useful in several scientific areas including neuroscience, economics and environmental sciences [Granger, 1969; Bowyer, 2016; Fiecas et al., 2019; Baek et al., 2021b].

When the dimension of the multivariate time series is moderate or large, or the number of measured time points is small, the classical approach of estimating a spectral density matrix with smoothed periodograms and subsequently inverting it is not ideal, as shown in Böhm and von Sachs [2009]. This observation has motivated several approaches for shrinkage or sparsity-regularized estimation of spectral precision matrix for moderate to large-scale systems.

One approach is to first shrink a spectral density matrix and invert it, or shrink the inverted spectral density matrix [Fiecas and von Sachs, 2014; Fiecas et al., 2010; Schneider-Luftman and Walden, 2016]. These methods, however, do not ensure a sparse estimator of the spectral precision matrix. The other approach is to maximize local Whittle likelihood subject to sparsity constraints or penalties [Fiecas et al., 2019; Dallakyan et al., 2022; Baek et al., 2021a; Jung et al., 2015; Tugnait, 2022; Krampe and Paparoditis, 2022; Wodeyar and Srinivasan, 2022]. These methods rely on the fact that for stationary processes, discrete Fourier transforms (DFTs) at neighboring Fourier frequencies behave asymptotically as independent and nearly identically distributed “complex-valued data” in the frequency-domain. So algorithms can be designed following standard approaches in Gaussian graphical models (GGM) for real-valued i.i.d. data.

Some important algorithmic and statistical questions remain open in this emerging field of sparse spectral precision matrix estimation. On the algorithmic front, a key challenge is the lack of fast pathwise algorithms for penalized Whittle likelihood maximization that can take advantage of the underlying sparsity structure and efficiently compute penalized estimators on a grid of tuning parameters such as the penalty or the smoothing span. In GGM with real-valued data, pathwise coordinate descent [Friedman et al., 2007] provides state-of-the-art solution for this task. In contrast, existing algorithms to deal with complex-valued data either use off-the-shelf convex solvers such as the alternating direction method of multipliers (ADMM) [Jung et al., 2015; Dallakyan et al., 2022; Tugnait, 2022] or artificially increase the dimensionality of the problem by optimizing separately over real and imaginary parts of a complex number [Fiecas et al., 2019; Krampe and Paparoditis, 2022]. On the theoretical front, to the best of our knowledge there is no systematic asymptotic analysis for the penalized local Whittle likelihood maximizer that provides insight into the interplay of sparsity, high-dimensionality and the temporal and cross-sectional dependence of the underlying time series for any Fourier frequency. The closest works on this front are Baek et al. [2021a], which focuses on long memory processes but only on frequency $0$ , and Weylandt Jr [2020], which focuses on i.i.d. complex-valued data.

In this paper, we make two contributions to this literature. First, we provide fast pathwise coordinate descent algorithms for complex-variable Lasso (CLASSO) and graphical Lasso (CGLASSO). These algorithms are capable of taking advantage of the underlying sparsity structure of the problem in the same manner as their real counterparts. The core insight behind of our algorithms is to extend a ring isomorphism between complex numbers and $2\times 2$ orthogonal matrices to the spaces of complex vectors and matrices. This connection helps represent a number of statistical optimization problems over complex variables (e.g. ordinary least squares and Gaussian likelihood maximization) into well-understood optimization problems over real variables. In particular, we can express a generic complex Lasso regression as a group Lasso over real variables with orthogonal group structures. This allows us to build block coordinate descent algorithm with closed form updates, which is significantly faster than a generic group Lasso implementation such as grplasso [Meier, 2020]. An upshot of this connection is that we can rely on the convergence results of block coordinate descent over real variables [Tseng, 2001]. Finally, we use this connection to build fast, pathwise algorithms for both CLASSO and CGLASSO.

Second, we provide a systematic non-asymptotic analysis to study the estimation error of our method in both low- and high-dimensional regimes. This requires a careful combination of tools from Sun et al. [2018], Ravikumar et al. [2011], and complex analysis to deal with complex random variables. Our results show that under suitable sparsity and irrepresentability assumptions akin to what is commonly used in high-dimensional statistics, our method can consistently estimate the true spectral precision matrix.

We conduct extensive numerical experiments on simulated data to assess a number of finite sample properties of CGLASSO. First, our analysis confirms that for spectral precision estimation, jointly penalizing real and imaginary parts of a complex variable is beneficial over separating them as in Fiecas et al. [2019], especially in high-dimension. This matches a similar observation for complex Lasso regression made in Maleki et al. [2013]. Second, we confirm that as expected CGLASSO outperforms unregularized estimators of spectral precision matrix in terms of estimation accuracy. Third, We find that CGLASSO, which works with the entire likelihood information, offers better model selection than node-wise regression strategies along the line of Meinshausen and Bühlmann [2006]. Finally, we demonstrate the use of CGLASSO for partial coherence estimation task on a real fMRI data set obtained from the Human Connectome Project (HCP).

The rest of the paper is organized as follows. In section 2, we provide some background of the problem, review relevant works in the literature and propose our estimator. In Section 3, we discuss our new optimization algorithms for complex Lasso and graphical Lasso. Section 4 contains a complete non-asymptotic analysis of our method for Gaussian time series and linear processes. In Sections 5 and 6, we report the performance of our method on simulated and real fMRI data.

Notation. Throughout the paper, $\mathbb{Z},\mathbb{R}$ and $\mathbb{C}$ denote the set of integers, real numbers and complex numbers respectively. $|c|$ denotes the absolute value of a real number $c$ , or the modulus of a complex number $c$ . $\mathrm{i}$ denotes the imaginary number $\sqrt{-1}$ . For any complex number $z$ , $\textbf{Re}(z)$ and $\textbf{Im}(z)$ denote the real and imaginary parts of $z$ respectively, and $z^{\dagger}$ denote its complex conjugate. For any positive integer $p$ , $\mathbf{e}^{(p)}$ denotes the $p$ dimensional basis vector with $1$ in the $p$ ^th position and 0 elsewhere. For a complex vector $\mathbf{z}=(z_{1},\ldots,z_{p})\in\mathbb{C}^{p}$ , $\textbf{Re}(\mathbf{z})$ , $\textbf{Im}(\mathbf{z})$ and $\mathbf{z}^{\dagger}$ denotes its real part, imaginary part and conjugate transpose respectively. $\|\mathbf{z}\|$ denotes the $\ell_{2}$ -norm of a vector $\mathbf{z}\in\mathbb{C}^{p}$ , i.e. $\|\mathbf{z}\|=\sqrt{\sum_{i}|z_{i}|^{2}}$ . For any $q>0$ , $\|\mathbf{z}\|_{q}$ denotes $\ell_{q}$ norm of $\mathbf{z}$ i.e. $\left(\sum_{i}|z_{i}|^{q}\right)^{1/q}$ . For a matrix $A$ , $A^{\dagger}$ denotes its conjugate transpose, $\overline{A}$ denotes its vector representation and $A_{\cdot j}$ denotes its $j$ ^th column. For two matrices $A$ and $B$ , the Fröbenius inner product between $A$ and $B$ , denoted by $\langle A,B\rangle$ is defined as $\text{tr}(A^{\dagger}B)$ . If $A$ is a square matrix, $\text{tr}(A)$ denotes its trace, i.e. the sum of its eigen values. In addition, $\|A\|_{\infty},\|A\|,\|A\|_{F}$ and ${\left|\kern-1.07639pt\left|A\right|\kern-1.07639pt\right|}_{\max}$ denote the maximum modulus row sum norm i.e. $\max_{r}\sum_{s}|A_{rs}|$ , spectral norm or the square root of the largest eigen value of $A^{\dagger}A$ , Frobenius norm $\sqrt{\langle A,A\rangle}$ and elementwise maximum modulus value $\max_{r,s}|A_{rs}|$ respectively. The class of matrices $\mathcal{S}_{+}^{p}$ denotes the set of all $p\times p$ symmetric non-negative definite real matrices, i.e. $\mathcal{S}_{+}^{p}=\{M\in\mathbb{R}^{p\times p}:v^{\dagger}Mv\geq 0\ \forall v\in\mathbb{R}^{p}\}$ . $\mathcal{S}_{++}^{p}$ denotes the set of all $p\times p$ symmetric positive definite matrices, i.e. $\mathcal{S}_{++}^{p}=\{M\in\mathbb{R}^{p\times p}:v^{\dagger}Mv>0\ \forall v\in\mathbb{R}^{p},v\neq 0\}$ . Similarly, $\mathcal{H}_{+}^{p}$ and $\mathcal{H}_{++}^{p}$ denotes the set of $p\times p$ Hermitian non-negative definite and positive definite complex matrices respectively. For asymptotics, we use the following standard notation: for two sequences $(f_{n})_{n\in\mathbb{N}}$ and $(g_{n})_{n\in\mathbb{N}}$ , we write $f_{n}=\mathcal{O}(g_{n})$ if $|f_{n}|\leq C|g_{n}|$ for some $C>0$ independent of $n$ and $n>C$ . Similarly, $f_{n}=\Omega(g_{n})$ if $|f_{n}|\geq C^{\prime}|g_{n}|$ for some $C^{\prime}>0$ independent of $n$ and $n>C^{\prime}$ . We also write $f_{n}\asymp g_{n}$ if both $f_{n}=\mathcal{O}(g_{n})$ and $f_{n}=\Omega(g_{n})$ hold. We denote $A\succsim B$ if there is a universal constant $c$ , independent of model dimension and model parameters, such that $A\geq cB$ , and $A\precsim B$ is defined analogously. We also use $c,c^{\prime}>0$ to denote universal constants whose values are allowed to change from line to line within a proof.

2 Background and methods

We consider a $p$ -dimensional weakly stationary real-valued time series $\allowbreak X_{t}=(X_{t1},\cdots,X_{tp})^{\top},$ $t\in\mathbb{Z}$ . We also assume $\mathbb{E}(X_{t})=0\ \forall t\in\mathbb{Z}$ for ease of exposition. Weak stationarity implies that the autocovariance function $\Gamma(\ell)=\text{Cov}(X_{t},X_{t-\ell})=\mathbb{E}(X_{t}X_{t-\ell}^{\top})$ depends only on the lag $\ell$ . The spectral density at a frequency $\omega\in[-\pi,\pi]$ is defined as the Fourier transform of the autocovariance functions

f(\omega)=\frac{1}{2\pi}\sum_{\ell=-\infty}^{\infty}\Gamma(\ell)e^{-\mathrm{i}\ell\omega}.

We use $\mathcal{X}=[X_{1}:\cdots:X_{n}]^{\top}$ to denote the data matrix that contains $n$ consecutive observations from the time series $\{X_{t}\}_{t\in\mathbb{Z}}$ in its rows. Our goal is to estimate the spectral precision matrix $\Theta(\omega):=f^{-1}(\omega)$ at a given frequency $\omega$ . It is customary to focus only on the Fourier frequencies $\omega=2\pi k/n$ , $k\in F_{n}=\{-[-\frac{n-1}{2}],\ldots,[\frac{n}{2}]\}$ where $[x]$ denotes the integer part of $[x]$ .

The autocovariance can be expressed in terms of $f$ as

\Gamma(\ell)=\int_{-\pi}^{\pi}e^{\mathrm{i}\ell\omega}f(\omega)\ \text{d}\omega.

Some of the properties of the stationary process $\{X_{t}\}$ can therefore be described equivalently in terms of $f(\cdot)$ rather than $\Gamma(\cdot)$ [Brockwell and Davis, 1991; Shumway and Stoffer, 2000; Ombao and Pinto, 2022]. Similarly $\{X_{t}\}$ , as a process, has a spectral representation, known as the Cramér’s representation, given by

X_{t}=\int_{-\pi}^{\pi}e^{\mathrm{i}t\omega}\ \text{d}Z(\omega),

(2.1)

where $\{Z(\omega):-\pi\leq\omega\leq\pi\}$ is an orthogonal increment process satisfying the following properties:

\mathbb{E}[Z(\omega)]=0\mbox{ for all }\omega\in[-\pi,\pi]

(2.2)

\operatorname*{Cov}(Z_{j}(\omega),Z_{k}(\omega^{\prime}))=\mathbb{E}[Z_{j}(\omega)\overline{Z_{k}(\omega^{\prime})}]=\begin{cases}0&\text{if }\omega\neq\omega^{\prime},\\ f_{jk}(\omega)&\text{otherwise.}\end{cases}

(2.3)

This provides us with a routine to decompose $X_{t}$ as a sum of processes at different frequencies. The integral in Cramér’s representation of $X_{t}$ is sum across all frequencies and $\text{d}Z(\omega)$ is the random weight for each frequency $\omega$ . We are interested in the existence of such a decomposition because it means that we can sensibly talk about movements in $X_{t}$ in a given frequency without worrying about its movements in other frequencies. In this sense, the object $f^{-1}(\omega)$ can be naturally viewed to capture frequency specific conditional independence structure among the component time series. In the next subsection, we elaborate on this and give an illustrative example.

2.1 Frequency-specific conditional independence and $f^{-1}(\omega)$

As we can see from (2.2) and (2.3), at each frequency $\omega$ , the correlation structure among the movements in different components is captured by $f(\omega)$ . Similarly, the partial correlation structure among those component-wise random movements can be indicated using $f^{-1}(\omega)$ . The connection between precision matrix and partial correlation is reminiscent of Gaussian graphical models and the context will be stated more formally in 2.2.

This decomposition into different frequencies are useful in several fields including economics, neuroscience and signal processing. For example, often in human brain connectivity studies the scientists measure neurophysiological signals in the form of Electroencephalogram (EEG) or Functional magnetic resonance imaging (fMRI) for different brain regions. The resulting ‘brain-rhythm’ is viewed as a superposition of signals representing different physiological processes such as cardiovascular and respiratory activities. These different processes are captured by signals of different frequency intervals, e.g. delta band (0.5-4.0 Hz), theta band (4.0-8.0 Hz), alpha band (8.0-12.0 Hz), beta band (12.0-30.0 Hz), gamma band (30.0-50.0 Hz), obtained by frequency band pass filters [Ombao and Pinto, 2022]. The spectral density matrix $f(\omega)$ in this context captures the correlation structure among different brain regions in terms of physiological processes represented by a specific frequency $\omega$ . Similarly, the partial correlation structure (the conditional dependence graph) among those regions can be extracted using the spectral precision matrix $f^{-1}(\omega)$ if we are interested only in the movements at frequency $\omega$ . Below we illustrate this with a small example from Shumway and Stoffer [2000].

Example 1. We consider two time series $X_{t}$ and $Y_{t}$ , each having three frequency components, and generate $N=200$ observations as follows

	$\displaystyle X_{t}^{(j)}=Z_{1}(\omega_{j})\cos\omega_{j}t,$	$\displaystyle Y_{t}^{(j)}=Z_{2}(\omega_{j})\cos\omega_{j}t,$	$\displaystyle j=1,2,3,$
	$\displaystyle X_{t}=X_{t}^{(1)}+X_{t}^{(2)}+X_{t}^{(3)},$	$\displaystyle Y_{t}=Y_{t}^{(1)}+Y_{t}^{(2)}+Y_{t}^{(3)},$

where $\omega_{1}=2\pi\times 5/N,\ \omega_{2}=2\pi\times 10/N,\ \omega_{3}=2\pi\times 40/N$ . We generate the amplitude processes from the following distributions

	$\displaystyle\begin{bmatrix}Z_{1}(\omega_{1})\\ Z_{2}(\omega_{1})\end{bmatrix}$	$\displaystyle\sim\mathcal{N}_{2}\left(\begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}1&0.5\\ 0.5&1\end{bmatrix}\right),\quad\begin{bmatrix}Z_{1}(\omega_{2})\\ Z_{2}(\omega_{2})\end{bmatrix}\sim\mathcal{N}_{2}\left(\begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}1&-0.5\\ -0.5&1\end{bmatrix}\right),$
	$\displaystyle\begin{bmatrix}Z_{1}(\omega_{3})\\ Z_{2}(\omega_{3})\end{bmatrix}$	$\displaystyle\sim\mathcal{N}_{2}\left(\begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}2&0.1\\ 0.1&2\end{bmatrix}\right).$

We obtain $X_{t}$ and $Y_{t}$ to behave as shown in Figure 1. In this example, $\cos\omega_{j}t$ ’s are the orthogonal basis functions for the signals $X_{t}$ and $Y_{t}$ , and $Z_{1}(\omega_{j})$ and $Z_{2}(\omega_{j})$ ’s are the corresponding random basis coefficients respectively and each set of basis coefficients are independent for each $\omega_{j}$ . We may think of this as an example with two brain regions which generates signals as a resultant of three distinct physiological processes, and hence three distinct frequencies.

In this illustration, if the spectral density of the two dimensional process $W_{t}:=(X_{t},Y_{t})^{\prime}$ is $f_{W}$ , Cramér’s representation as in (2.1) implies $[f_{W}(\omega_{1})]_{11}=\operatorname*{Var}(Z_{1}(\omega_{1}))=1$ . In general, for $k=1,2$ , $[f_{W}(\omega_{j})]_{kk}=\operatorname*{Var}(Z_{k}(\omega_{j}))$ for $j=1,2\text{ and }3$ . For the cross-spectral term, $[f_{W}(\omega_{1})]_{12}=\operatorname*{Cov}(Z_{1}(\omega_{1}),Z_{2}(\omega_{1}))=0.5$ , and similarly $[f_{W}(\omega_{2})]_{12}=-0.5$ . Therefore, we see that $f_{W}$ fully captures the correlation structure of the same frequency orthogonal components of the joint process $W_{t}$ . Similarly we can also compute the spectral precision $f^{-1}_{W}$ which captures the partial correlation. Hence, estimation and inference on the spectral density and similarly the spectral precision sets path for understanding the correlation and the partial correlation graph respectively among the same frequency components extracted from the time series observations.

Refer to caption — Figure 1: Time series as sum of periodic components as described in Example 1

Such two-dimensional representation can be generalized to a $p$ -dimensional case using discrete Fourier transforms (DFT) (section 10.1 of Brockwell and Davis [1991]). We use the DFT evaluated at the Fourier frequencies to estimate $f(\omega)$ and $f^{-1}(\omega)$ . The $p$ -dimensional complex vector-valued DFT at the Fourier frequency $\omega_{j}=2\pi j/n$ , $j\in F_{n}$ , is defined as

d_{j}:=d(\omega_{j})=\frac{1}{\sqrt{n}}\sum_{t=1}^{n}X_{t}\exp\left(-\mathrm{i}t\omega_{j}\right).

We represent it as $d_{j}=(d_{j1},\cdots,d_{jp})^{\top}$ . The classical estimator of spectral density $f(\omega_{j})$ is based on the periodogram [Brockwell and Davis, 1991], defined as $I(\omega_{j})=d(\omega_{j})d^{\dagger}(\omega_{j})=d_{j}d_{j}^{\dagger}$ . According to proposition 10.3.1 of Brockwell and Davis [1991], for any $1\leq i\leq p$ , $\mathbb{E}[I_{kk}(\omega_{j})]\rightarrow 2\pi f(\omega_{j})$ , which implies $\operatorname*{Cov}(d_{jk},d_{jk})=\mathbb{E}[d_{jk}d_{jk}^{\dagger}]\approx[2\pi f(\omega_{j})]_{kk}$ . For any pair $k\neq k^{\prime}$ , the cross-spectral terms can be similarly expressed as $\operatorname*{Cov}(d_{jk},d_{jk^{\prime}})=\mathbb{E}[d_{jk}d_{jk^{\prime}}]\approx[2\pi f(\omega_{j})]_{kk^{\prime}}$ . The bottomline is that the correlation structure among the DFT values i.e. the $d_{j}$ ’s is approximately equal to the spectral density which again captures the correlation pattern of the underlying orthogonal increment processes as in (2.3). In general, $I(\omega_{j})$ is asymptotically unbiased for $f(\omega_{j})$ but not consistent even under classical, fixed- $p$ asymptotics. For example, for i.i.d. Gaussian white noise $X_{t}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I)$ , the variance of $I(\omega_{j})$ is of the order $\sigma^{4}$ (Proposition 10.3.2 of Brockwell and Davis [1991]), and does not vanish asymptotically. To ensure consistency, it is common to use averaged or smoothed periodogram estimator

\hat{f}(\omega_{j})=\frac{1}{2\pi(2m+1)}\sum_{|k|\leq m}I(\omega_{j+k}),\quad j\in F_{n}.

(2.4)

Note that the indices of Fourier frequencies $j$ are evaluated “modulo $n$ ” as $j+k$ may fall outside $F_{n}$ . It is well-known (cf. Theorem 10.4.1 of Brockwell and Davis [1991]) that for $m=o(\sqrt{n})$ , (2.4) is a consistent estimator of $f(\omega_{j})$ in the classical asymptotic framework where $p$ is fixed but the sample size $n\rightarrow\infty$ .

2.2 Method: Penalized Whittle likelihood estimator

We start by introducing the complex normal random variable and its role in the estimation of spectral precision matrix. The following definition is referred to the definition 11.7.1 of Brockwell and Davis [1991]).

Definition 2.1 (Complex Multivariate Normal Distribution).

$Y=Y_{1}+\mathrm{i}Y_{2}$ follows a p-dimensional complex-valued multivariate normal distribution with mean vector $\mu=\mu_{1}+\mathrm{i}\mu_{2}\in\mathbb{C}^{p}$ and covariance matrix $\Sigma=\Sigma_{1}+\mathrm{i}\Sigma_{2}\in\mathcal{H}^{p}_{+}$ if

\begin{bmatrix}Y_{1}\\ Y_{2}\end{bmatrix}\thicksim\mathcal{N}\left(\begin{bmatrix}\mu_{1}\\ \mu_{2}\end{bmatrix},\frac{1}{2}\begin{bmatrix}\Sigma_{1}&-\Sigma_{2}\\ \Sigma_{2}&\Sigma_{1}\end{bmatrix}\right).

In this case we use the notation $Y\thicksim\mathcal{N}_{\mathbb{C}}(\mu,\Sigma)$ .

It is well-known that under some regularity conditions, $d_{j}\sim\mathcal{N}_{\mathbb{C}}(0,f(\omega_{j}))$ asymptotically (Proposition 11.7.3 and 11.7.4 of Brockwell and Davis [1991]), and with the exception of $\pm\omega_{j}$ the DFTs at different Fourier frequencies are approximately independent. So, for any given $j\in F_{n}$ and small enough $m$ , it makes sense to use the DFTs evaluated at $\omega_{j-m},\ldots,\omega_{j},\ldots,\omega_{j+m}$ as nearly independent and identically distributed data with covariance matrix $f(\omega_{j})$ and precision matrix $f^{-1}(\omega_{j})$ .

This formulation is analogous to the framework of Gaussian graphical models (GGM), where one is interested in learning the precision matrix $\Theta=\Sigma^{-1}$ from i.i.d. data $X_{k}\sim\mathcal{N}(0,\Sigma)$ for $k=1,\ldots,n$ . It is quite natural to devise algorithms that are similar to GGM estimators. Indeed, the sparse inverse periodogram estimator (SIPE) is built on a popular GGM estimator called CLIME [Cai et al., 2011], and the regression approach adopted in Krampe and Paparoditis [2022] is based on Meinshausen and Bühlmann [2006].

We propose to develop a sparsity regularized maximum likelihood estimator of Whittle’s approximate likelihood [Whittle, 1951]. Note that for Gaussian time series, $d_{j}\sim\mathcal{N}_{\mathbb{C}}(0,f(\omega_{j}))$ asymptotically for every Fourier frequency $\omega_{j}$ , and Whittle likelihood summarizes the pseudo-likelihood assuming the periodograms are approximately independent across frequencies. The negative log likelihood (upto constants) takes the form

\sum_{j\in F_{n}}\log\det f^{-1}(\omega_{j})-\sum_{j\in F_{n}}d^{*}_{j}\ f^{-1}(\omega_{j})\ d_{j}.

To estimate the spectral precision at a Fourier frequency $\omega_{j}$ , we work with the approximation

\sum_{\ell=j-m}^{j+m}\log\det f^{-1}(\omega_{j})-\sum_{\ell=j-m}^{j+m}d^{\dagger}_{\ell}\ f^{-1}(\omega_{j})\ d_{\ell}.

We solve the following penalized negative log likelihood minimization problem to estimate $f^{-1}(\omega_{j})$ using

\hat{\Theta}(\omega_{j}):=\,\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\,\,-(2m+1)\log\det\Theta+\text{trace}\left(\Theta\sum_{\ell=j-m}^{j+m}d_{\ell}d^{\dagger}_{\ell}\right)+\lambda\left\|\Theta\right\|_{1,\text{off}},

(2.5)

where $\lambda>0$ is the penalty parameter, and $\left\|\Theta\right\|_{1,\text{off}}=\sum_{k\neq\ell}|\Theta_{k,\ell}|$ is the sum of the moduli of all off-diagonal elements of $\Theta$ . Equivalently,

\hat{\Theta}_{j}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\,\langle\Theta,\hat{f}(\omega_{j})\rangle-\log\det\Theta+\lambda\|\Theta\|_{1,\text{off}}.

(2.6)

Here $\hat{\Theta}(\omega_{j})$ is abbreviated as $\hat{\Theta}_{j}$ , and $\lambda$ in the optimization problem (2.5) is replaced by $(2m+1)\lambda$ for simplifying the notation. This optimization problem is similar in spirit to estimation of inverse covariance matrices based on the graphical Lasso [Friedman et al., 2008] for i.i.d. data . The main difference is that $\hat{f}(\omega_{j})=\sum_{\ell=j-m}^{j+m}d_{\ell}d^{\dagger}_{\ell}/(2m+1)$ takes the role of sample covariance matrix.

Similar penalized local Whittle likelihood approaches have been investigated in recent literature by several authors [Jung et al., 2015; Dallakyan et al., 2022; Baek et al., 2021a; Tugnait, 2022]. Among these works, Baek et al. [2021a] focuses only on $\omega=0$ but allows long-range dependent processes, and the others provide a single graph by penalizing across all the frequencies at once. In what follows, we restrict our discussions to only the SIPE algorithm of Fiecas et al. [2019] and Krampe and Paparoditis [2022], because only they provide algorithms to estimate $f^{-1}(\omega)$ for any given $\omega_{j},j\in F_{n}$ .

2.3 Nodewise regression (NWR) of DFTs

Building upon the parallel to GGM, if one is only interested in recovering the sparsity structure of $f^{-1}(\omega)$ and not the estimation of the full matrix, a natural alternative to graphical Lasso is to perform nodewise regression along the line of Meinshausen and Bühlmann [2006], where we regress one coordinate of the DFTs on all the other coordinates. This can be done by running $p$ different complex Lasso regressions. This is closely related to the approach in Krampe and Paparoditis [2022], where instead of the complex Lasso penalty the authors used separate Lasso penalties on the real and imaginary parts of a complex number. While we do not pursue the asymptotic theory of this method in our paper, we compare the model selection performance of our method against this method in the simulation section.

Next we provide a brief description of nodewise DFT regression. Consider the data matrix $\mathcal{Z}$ for estimation at Fourier frequency $\omega_{j}$ . Going forward, we will drop subscript $j$ when the index is clear from the context.

\displaystyle\mathcal{Z}:=\left[\begin{array}[]{cccc}d_{j-m,1}&d_{j-m,2}&\ldots&d_{j-m,p}\\ d_{j-m+1,1}&d_{j-m+1,2}&\ldots&d_{j-m+1,p}\\ \vdots&\vdots&\vdots&\vdots\\ d_{j,1}&d_{j,2}&\ldots&d_{j,p}\\ d_{j+1,1}&d_{j+1,2}&\ldots&d_{j+1,p}\\ \vdots&\vdots&\vdots&\vdots\\ d_{j+m,1}&d_{j+m,2}&\ldots&d_{j+m,p}\end{array}\right].

We will denote the $k^{th}$ column of $\mathcal{Z}$ as $\mathcal{Z}_{k}$ , and the matrix containing all the other columns as $\mathcal{Z}_{-k}$ . Our plan is to regress $\mathcal{Z}_{k}$ on $\mathcal{Z}_{-k}$ , and use the sparsity pattern of the regression coefficients to learn the sparsity pattern in the off-diagonal part of the $k^{th}$ row (equivalently, column) of $f^{-1}(\omega_{j})$ . In low-dimensional regime, we can run ordinary least squares (OLS) regression with complex response and predictors. In high-dimension, we will use complex Lasso to estimate the nodewise regression coefficients

\hat{\beta}_{k}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p-1}}\frac{1}{2N}\|\mathcal{Z}_{k}-\mathcal{Z}_{-k}\beta\|^{2}+\lambda_{k}\|\beta\|_{1}.

(2.8)

Here $\|\beta\|_{1}=\sum_{\ell=1}^{p-1}|\beta_{\ell}|$ , $N=2m+1$ is the effective sample size, and $\lambda_{k}>0$ is the penalty parameter. Then the $\ell^{th}$ entry in the off-diagonal part of the $k^{th}$ row of $f^{-1}(\omega_{j})$ , i.e. $\left[f^{-1}(\omega_{j})\right]_{k,-k}$ , will be zero if and only if $\hat{\beta}_{kl}=0$ . This may lead to a potentially asymmetric sparsity pattern in $f^{-1}(\omega_{j})$ , but we can symmetrize the sparsity pattern using standard post-processing techniques proposed in Meinshausen and Bühlmann [2006].

3 Optimization algorithms

While standard convex optimization algorithms such as the ADMM can be used to solve (2.6), they do not exploit underlying sparsity structure of the problem and could be computationally expensive when solving the problem for different penalty parameters $\lambda$ and frequencies $\omega_{j}$ . In the literature of high-dimensional regression and GGM, coordinate descent algorithms are popular because they provide considerable speed-up by using problem-specific strategies such as warm start and active set selection [Hastie et al., 2015]. A direct adoption of coordinate descent for complex graphical Lasso is difficult since the penalty $|\beta|$ , for $\beta\in\mathbb{C}$ , is effectively a group Lasso penalty $\sqrt{\textbf{Re}(\beta)^{2}+\textbf{Im}(\beta)^{2}}$ over its real and imaginary parts. In general, a block coordinate descent update for group Lasso does not have a closed form, and one needs to solve a separate optimization problem at every iteration of the block coordinate descent.

Figure 2: Anti-clockwise and orthogonal rotation property of

\mathrm{i}

Our key observation is that the specific group Lasso regression for complex Lasso has an attractive feature.The predictors within a single group are orthogonal, which offers closed form updates in the block coordinate descent iterations. Using this observation, we first propose CLASSO, a coordinate descent algorithm for Lasso with complex variables. Then we use CLASSO to build CGLASSO, a graphical Lasso algorithm for complex variables that solves (2.6).

While this observation of orthogonality alone is sufficient for developing fast algorithms, we argue that this is not a coincidence. It reflects a deep connection between complex and real matrices that can be leveraged to systematically realify a broader family of complex optimization problems. We formalize this connection first, and then present the CLASSO and CGLASSO algorithms.

3.1 Realifying complex optimization with ring isomorphism

At the core of our algorithm lies a well-known field isomorphism between the set of complex scalars $\mathbb{C}$ and a set of $2\times 2$ orthogonal real matrices. We show that this can be extended from $\mathbb{C}$ to a ring isomporphism on complex-valued matrices $\mathbb{C}^{m\times n}$ . At a high-level, this means that sum, product and inverse of $p\times p$ complex matrices have direct parallel to the same operations in the space of $2p\times 2p$ real-valued matrices. The upshot is that some generic complex-variable optimization problems arising in frequency-domain time series can be reduced to well-understood optimization problems over real variables arising in high-dimensional statistics.

First, we illustrate this isomorphism in Figure 2 for a single complex number, $\mathrm{i}$ . The relationships $\mathrm{i}\times 1=\mathrm{i}$ and $\mathrm{i}\times\mathrm{i}=-1$ imply that the operation “multiplication by $\mathrm{i}$ ” can be viewed as a $90^{\circ}$ rotation that maps the unit vector $(1,0)$ on the complex plane to $(0,1)$ , and maps the unit vector $(0,1)$ to $(-1,0)$ . This linear map corresponds to the $2\times 2$ orthogonal matrix

\mathbf{J}=\left[\begin{array}[]{cc}0&-1\\ 1&0\end{array}\right].

If we identify $1$ with $\mathbf{I}$ and $\mathrm{i}$ with $\mathbf{J}$ , we can identify any complex number $z=1\times a+\mathrm{i}\times b$ , $a,b\in\mathbb{R}$ , with a $2\times 2$ matrix $a\mathbf{I}+b\mathbf{J}$ of the form given in (3.1).

Next we formalize and expand on the above heuristic of isomorphism through a series of propositions. For a brief introduction to common structures in abstract alebra, e.g. group, ring, field and isomorphisms, we refer the readers to Appendix A.

Proposition 3.1.

Define $\varphi:\mathbb{C}\rightarrow\mathbb{R}^{2\times 2}$ as

\varphi(z)={\begin{bmatrix}\textbf{Re}(z)&-\textbf{Im}(z)\\ \textbf{Im}(z)&\textbf{Re}(z)\end{bmatrix}}.

Then $\varphi$ is a field isomorphism between $\mathbb{C}$ and $\mathcal{M}^{2\times 2}$ , where

\displaystyle\mathcal{M}^{2\times 2}:=\left\{\begin{bmatrix}a&-b\\ b&a\end{bmatrix}:a,b\in\mathbb{R}\right\}.

(3.1)

This implies

(i)

$\varphi(0)=\mathbf{0}_{2\times 2}$ and $\varphi(1)=\mathbf{I}_{2}$ .
(ii)

$\varphi(z+z^{\prime})=\varphi(z)+\varphi(z^{\prime})$ for all $z,z^{\prime}\in\mathbb{C}$ . In particular, $\varphi(-z)=-\varphi(z)$ , for all $z\in\mathbb{C}$ .
(iii)

$\varphi(zz^{\prime})=\varphi(z)\varphi(z^{\prime})$ for all $z,z^{\prime}\in\mathbb{C}$ . In particular, $\varphi(z^{-1})=[\varphi(z)]^{-1}$ for all $z\in\mathbb{C},z\neq 0$ .

Also, for any $z\in\mathbb{C}$ , denote $\varphi(z)=(\varphi_{1}(z),\varphi_{2}(z))$ , where $\varphi_{1}(z),\varphi_{2}(z)\in\mathbb{R}^{2}$ . Then the following hold

(i)

$\langle\varphi_{1}(z),\varphi_{2}(z)\rangle=0$ ,
(ii)

$\|\varphi_{1}(z)\|^{2}=\|\varphi_{2}(z)\|^{2}=\frac{1}{2}\|\varphi(z)\|_{F}^{2}=|z|^{2}$ ,
(iii)

$\varphi(z^{\dagger})=[\varphi(z)]^{\top}.$

The proof is given in the Appendix B. The isomorphism $\varphi$ can be naturally extended to higher dimensional vector spaces and matrix spaces $\mathbb{C}^{m\times n}$ . However the map $\varphi$ will no longer be a field isomorphism since for $m,n\geq 1$ , $\mathbb{C}^{n}$ and $\mathbb{C}^{m\times n}$ are not necessarily fields, but rings. So it reduces to a ring isomorphism.

For any $\mathbf{z}=(z_{1},\ldots,z_{n})^{\top}\in\mathbb{C}^{n}$ , we extend $\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2}$ as

\varphi(\mathbf{z})=\begin{bmatrix}\varphi(z_{1})\\ \vdots\\ \varphi(z_{n})\end{bmatrix}=[\varphi_{1}(\mathbf{z})~{}~{}\varphi_{2}(\mathbf{z})],~{}~{}~{}~{}~{}~{}\varphi_{1}(\mathbf{z}),\varphi_{2}(\mathbf{z})\in\mathbb{R}^{2n}.

(3.2)

Similarly, for any $Z=((z_{ij}))_{i,j=1,1}^{m,n}\in\mathbb{C}^{m\times n}$ , the extension $\varphi:\mathbb{C}^{m\times n}\rightarrow\mathbb{R}^{2m\times 2n}$ is

\varphi(Z)=\begin{bmatrix}\varphi(z_{11})&\cdots&\varphi(z_{1n})\\ \vdots&\ddots&\vdots\\ \varphi(z_{m1})&\cdots&\varphi(z_{mn})\end{bmatrix}.

(3.3)

Below we state the properties of the extended $\varphi$ in (3.2) and (3.3) without proof since they are similar to Proposition 3.1.

Proposition 3.2.

For any $\mathbf{z}\in\mathbb{C}^{n},\ n\geq 1$ , the extended map $\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2}$ in (3.2) satisfies

(i)

$\langle\varphi_{1}(\mathbf{z}),\varphi_{2}(\mathbf{z})\rangle=0$ ,
(ii)

$\|\varphi_{1}(\mathbf{z})\|^{2}=\|\varphi_{2}(\mathbf{z})\|^{2}=\frac{1}{2}\|\varphi(\mathbf{z})\|_{F}^{2}=\|\mathbf{z}\|^{2}$ ,

and for any $Z\in\mathbb{C}^{m\times n},\ m,n\geq 1$ , the extended map $\varphi:\mathbb{C}^{m\times n}\rightarrow\mathbb{R}^{2m\times 2n}$ in (3.3) satisfies

(i)

$\langle\varphi_{1}(Z_{\cdot j}),\varphi_{2}(Z_{\cdot j})\rangle=0$ for $j=1,\ldots,n$ ,
(ii)

$\|\varphi(Z)\|_{F}^{2}=2\|Z\|_{F}^{2}$ ,
(iii)

$\varphi(Z^{\dagger})=[\varphi(Z)]^{\top}.$

Note that the extended map $\varphi$ is also a group isomorphism, and the additive identities are preserved. Hence the extension $\varphi$ is a ring isomorphism as well. We formally state this result next.

Proposition 3.3.

Let $\text{Image}(\varphi)$ be the image of the map $\varphi$ . Then the following statements hold:

(i)

The extended map $\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2}$ in (3.2) is a ring isomorphism between $\mathbb{C}^{n}$ and $\text{Image}(\varphi)$ ,
(ii)

The extended map $\varphi:\mathbb{C}^{m\times n}\rightarrow\mathbb{R}^{2m\times 2n}$ in (3.3) is a ring isomorphism between $\mathbb{C}^{m\times n}$ and $\text{Image}(\varphi)$ ,
(iii)

Let $GL_{n}(\mathbb{C})$ be the set of all $n\times n$ invertible matrices with entries in $\mathbb{C}$ . Then, for any $A\in GL_{n}(\mathbb{C})$ , $\varphi(A^{-1})=[\varphi(A)]^{-1}$ .

The proof of Proposition 3.3 is provided in Appendix B.

Remark 3.1.

Note that even though the map $\varphi$ doubles the dimension of the input object, the computational complexity of dealing with objects from $\mathcal{M}^{2\times 2}$ remains the same as operations involving complex numbers.

So far the function $\varphi(Z)$ replaces every entry of a complex matrix $Z$ by a $2\times 2$ real matrix. While this is useful for proving the isomorphism properties, rearranging the rows and columns of $\varphi(Z)$ helps us present our algorithms in a more compact form. Let $\Pi_{k},k\in\mathbb{N}$ be a $2k\times 2k$ permutation matrix that maps $\{1,\ldots,2k\}$ to $\{1,3,\ldots,2k-1,2,4,\ldots,2k\}$ . In matrix notation,

\Pi_{k}=[\mathbf{e}_{1}:\mathbf{e}_{3}:\cdots:\mathbf{e}_{2k-1}:\mathbf{e}_{2}:\mathbf{e}_{4}:\cdots:\mathbf{e}_{2k}].

Then for $\mathbf{z}\in\mathbb{C}^{n}$ and any $Z\in\mathbb{C}^{m\times n}$ , we introduce the notation

\widetilde{\mathbf{z}}=\Pi_{n}^{\top}\varphi(\mathbf{z})\mathbf{e}_{1}=\begin{bmatrix}\textbf{Re}(\mathbf{z})\\ \textbf{Im}(\mathbf{z})\end{bmatrix},~{}~{}~{}\widetilde{\widetilde{Z}}=\Pi_{m}^{\top}\varphi(Z)\Pi_{n}=\begin{bmatrix}\textbf{Re}(Z)&-\textbf{Im}(Z)\\ \textbf{Im}(Z)&\textbf{Re}(Z)\\ \end{bmatrix}.

3.1.1 Ordinary least squares (OLS) with complex variables

The aforementioned properties of $\varphi$ can be used to rewrite standard complex variable optimization problems in real-variable transformed problems. For instance, the objective function of an ordinary least squares (OLS) can be rewritten as

$\displaystyle\\|Y-X\beta\\|_{2}^{2}$	$\displaystyle=\frac{1}{2}\\|\varphi(Y-X\beta)\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\\|\Pi_{n}^{\top}\varphi(Y-X\beta)\Pi_{1}\\|_{F}^{2}$
	[Fröbenius norm is invariant under orthogonal transform]
	$\displaystyle=\frac{1}{2}\\|\Pi_{n}^{\top}\varphi(Y)\Pi_{1}-\Pi_{n}^{\top}\varphi(X\beta)\Pi_{1}\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\\|\Pi_{n}^{\top}\varphi(Y)\Pi_{1}-(\Pi_{n}^{\top}\varphi(X)\Pi_{p})(\Pi_{p}^{\top}\varphi(\beta)\Pi_{1})\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\\|\widetilde{\widetilde{Y}}-\widetilde{\widetilde{X}}\widetilde{\widetilde{\beta}}\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\cdot 2\\|(\widetilde{\widetilde{Y}}-\widetilde{\widetilde{X}}\widetilde{\widetilde{\beta}})\mathbf{e}_{1}\\|_{2}^{2}$
	$\displaystyle=\\|\widetilde{Y}-\widetilde{\widetilde{X}}\widetilde{\beta}\\|_{2}^{2}.$	(3.4)

Therefore the isomorphism $\varphi$ makes a bridge between the complex-variable and real-variable least squares problems. A similar correspondence can be for the normal equation of the least squares problem

$\displaystyle X^{\dagger}X\beta=X^{\dagger}Y\ \iff\$	$\displaystyle\varphi(X^{\dagger}X\beta)=\varphi(X^{\dagger}Y)$
$\displaystyle\iff\$	$\displaystyle[\varphi(X)]^{\top}\varphi(X)\varphi(\beta)\mathbf{e}_{1}=[\varphi(X)]^{\top}\varphi(Y)\mathbf{e}_{1}$
$\displaystyle\iff\$	$\displaystyle(\Pi_{n}^{\top}\varphi(X)\Pi_{p})^{\top}(\Pi_{n}^{\top}\varphi(X)\Pi_{p})\Pi_{p}^{\top}\varphi_{1}(\beta)$
	$\displaystyle\hskip 56.9055pt=(\Pi_{n}^{\top}\varphi(X)\Pi_{p})^{\top}\Pi_{n}^{\top}\varphi_{1}(Y)$
$\displaystyle\iff\$	$\displaystyle\widetilde{\widetilde{X}}^{\top}\widetilde{\widetilde{X}}\widetilde{\beta}=\widetilde{\widetilde{X}}^{\top}\widetilde{Y},$	(3.5)

This implies that the transformed normal equation (3.5) is equivalent to a complex normal equation $X^{\dagger}X\beta=X^{\dagger}\beta$ .

3.1.2 Complex Gaussian log-likelihood for precision matrix estimation

The Gaussian log-likelihood can alternatively be written in terms of real variable too. Consider the setup with i.i.d. random variables $Z_{1},\ldots,Z_{n}\sim\mathcal{N}_{\mathbb{C}}(0,\Theta^{-1})$ for $\Theta\in\mathcal{H}_{++}^{p}$ , and denote the complex-valued sample Gram matrix as $\hat{\Sigma}=\sum_{i=1}^{n}Z_{i}Z_{i}^{\dagger}/n$ .

The negative log-likelihood can be expressed (ignoring constants) as

L_{\mathbb{C}}(\Theta)=\log\det\Theta-\operatorname*{trace}(P\Theta).

Let us consider a similar function

L_{\mathbb{R}}(\Theta)=\log\det\varphi(\Theta)-\operatorname*{trace}(\varphi(P)\varphi(\Theta)).

If $\Theta=U\Lambda U^{\dagger}$ is the spectral decomposition of $\Theta$ with $U$ being an unitary matrix and $\Lambda$ being a diagonal matrix with all positive real entries, then

	$\displaystyle\varphi(\Theta)$	$\displaystyle=\varphi(U\Lambda U^{\dagger})$
		$\displaystyle=\varphi(U)\varphi(\Lambda)\varphi(U)^{\top}$
		$\displaystyle=\varphi(U)\Pi_{p}\Pi_{p}^{\top}\varphi(\Lambda)\Pi_{p}\Pi_{p}^{\top}\varphi(U)^{\top}$
		$\displaystyle=\varphi(U)\Pi_{p}\begin{bmatrix}\Lambda&0\\ 0&\Lambda\end{bmatrix}\Pi_{p}^{\top}\varphi(U)^{\top}.$

Therefore,

$\displaystyle\log\det\varphi(\Theta)$	$\displaystyle=\log\left(\det(\varphi(U))\det(\Pi_{p})(\det\Lambda)^{2}\det(\Pi_{p}^{\top})\det(\varphi(U)^{\top})\right)$
	$\displaystyle=\log\left(\det(\varphi(U)\varphi(U)^{\top})(\det\Lambda)^{2}\right)$
	$\displaystyle=\log\left(\det(\varphi(UU^{\dagger}))(\det\Lambda)^{2}\right)$
	$\displaystyle=\log\left((\det\Lambda)^{2}\right)$
	$\displaystyle=2\log\det\Lambda=2\log\det\Theta.$	(3.6)

We consider the Schur decomposition of $P\Theta$ to be $P\Theta=V\Delta V^{\dagger}$ with $V$ being a unitary matrix and $\Delta$ being an upper triangular matrix with the eigenvalues of $P\Theta$ on the diagonal. Let $\Theta^{1/2}:=U\Lambda^{1/2}U^{\dagger}$ be the square root of the positive definite matrix $\Theta$ where $\Lambda^{1/2}$ is the diagonal matrix with the square of the diagonals of $\Lambda$ , Since $P\Theta=\Theta^{-1/2}(\Theta^{1/2}P\Theta^{1/2})\Theta^{1/2}$ is similar to the positive definite matrix $\Theta^{1/2}P\Theta^{1/2}$ , all the eigenvalues of $P\Theta$ is positive i.e. $\Delta$ has real positive diagonals. Then the term involving trace can be expressed as

$\displaystyle\operatorname{trace}(\varphi(P)\varphi(\Theta))=\operatorname{trace}(\varphi(P\Theta))$	$\displaystyle=\operatorname*{trace}(\varphi(V\Delta V^{\dagger}))$
	$\displaystyle=\operatorname*{trace}(\varphi(V)\varphi(\Delta)\varphi(V)^{\top})$
	$\displaystyle=\operatorname*{trace}(\varphi(\Delta)\varphi(V)^{\top}\varphi(V))$
	$\displaystyle=\operatorname*{trace}(\varphi(\Delta)\varphi(V^{\dagger}V))$
	$\displaystyle=\operatorname*{trace}(\varphi(\Delta))$
	$\displaystyle=\operatorname*{trace}(\varphi(\Delta)\Pi_{p}\Pi_{p}^{\top})$
	$\displaystyle=\operatorname*{trace}(\Pi_{p}^{\top}\varphi(\Delta)\Pi_{p})$
	$\displaystyle=\operatorname*{trace}(\text{diag}(\Delta,\Delta))$
	$\displaystyle=2\operatorname{trace}(\Delta)=2\operatorname{trace}(P\Theta),$	(3.7)

where the fact that trace of a product of matrices is invariant under cyclic permutations is used twice in the calculation. Combining (3.6) and (3.7), we obtain

L_{\mathbb{R}}(\Theta)=2L_{\mathbb{C}}(\Theta),

which implies that minimizing $L_{\mathbb{C}}(\Theta)$ with respect to $\Theta$ is equivalent to minimizing $L_{\mathbb{R}}(\Theta)$ with respect to $\varphi(\Theta)$ , or $\Theta$ (there is a one-one correspondence between $\Theta$ and $\varphi(\Theta)$ ).

3.2 Lasso with complex variables

In order to solve the complex Lasso optimization problem (2.8), we can use group Lasso [Yuan and Lin, 2006] for real valued predictors and response. We show this using generic notations for complex Lasso. Let $X_{n\times p}$ and $Y_{p\times 1}$ be the predictors and the response in complex variable regression problem. The Lasso problem is

\hat{\beta}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p}}\frac{1}{2n}\|Y-X\beta\|_{2}^{2}+\lambda\|\beta\|_{1},

(3.8)

where $\lambda>0$ is the penalty parameter. In the optimization problem 2.8, $\mathcal{Z}_{-k}$ and $\mathcal{Z}_{k}$ play the roles of $X$ and $Y$ respectively.

Algorithm 1 Complex Lasso (CLASSO) with complex update

Objective: Calculate

\hat{\beta}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p}}\frac{1}{2n}\|Y-X\beta\|^{2}+\lambda\|\beta\|_{1}

Input:

X=[X_{1}:X_{2}:\ldots:X_{p}]\in\mathbb{C}^{n\times p}

with

\|X_{j}\|_{2}=\sqrt{n}

for all

j=1,\ldots,p

Y\in\mathbb{C}^{n}

\hat{\beta}^{(0)}\in\mathbb{C}^{p}

\lambda>0

1. Initialize:

\hat{\beta}=\hat{\beta}^{(0)}

2. Calculate residual

r=Y-\sum_{j=1}^{p}X_{k}\hat{\beta}_{k}

4. Repeat until convergence for

j=1,\ldots,p

(a)

Calculate $j$ ^th partial residual $r^{(j)}=r+X_{j}\hat{\beta}_{j}$
(b)

$\hat{\beta}_{j}\leftarrow{\cal S}_{\lambda}\left(\frac{1}{n}X_{j}^{\dagger}r^{(j)}\right)$
(c)

$r\leftarrow r^{(j)}-X_{j}\hat{\beta}_{j}$

5. Return

\hat{\beta}

The Lasso problem (3.8) can be alternatively expressed as an optimization problem with real variables. We use (3.4) and the following fact for the $\ell_{1}$ regularization part of (3.8)

\|\beta\|_{1}=\sum_{j=1}^{p}|\beta_{j}|=\sum_{j=1}^{p}\|\varphi_{1}(\beta_{j})\|_{2}=\sum_{j=1}^{p}\|\widetilde{\beta_{j}}\|_{2}.

Altogether, the optimization problem (3.8) is equivalent to

\operatorname*{arg\,min}_{\widetilde{\beta_{1}},\ldots,\widetilde{\beta_{p}}\in\mathbb{R}^{2}}\frac{1}{2n}\|\widetilde{Y}-\sum_{j=1}^{p}\widetilde{\widetilde{X_{j}}}\widetilde{\beta_{j}}\|_{2}^{2}+\lambda\sum_{j=1}^{p}\|\widetilde{\beta_{j}}\|_{2}.

(3.9)

This is a group Lasso problem with $p$ groups, each of size 2. Let the optimizer of (3.9) be $(\hat{\widetilde{\beta_{1}}},\ldots,\hat{\widetilde{\beta_{p}}})^{\top}$ and $\widetilde{r}^{(j)}=\widetilde{Y}-\sum_{k\neq j}\widetilde{\widetilde{X_{k}}}\hat{\widetilde{\beta_{k}}}$ be the vector of $j$ ^th partial residuals. Following the equation (4.14) for group Lasso in Hastie et al. [2015], the optimizer $\hat{\widetilde{\beta_{j}}}$ should satisfy $\hat{\widetilde{\beta_{j}}}=0$ if $\left\|\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right\|<\lambda$ , and otherwise the following equation:

\hat{\widetilde{\beta_{j}}}=\left(\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{\widetilde{X_{j}}}+\frac{\lambda}{\left\|\hat{\widetilde{\beta_{j}}}\right\|_{2}}I_{2}\right)^{-1}\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}.

(3.10)

The equation (3.10) does not have a closed form solution in general, but here we exploit the structure of $\widetilde{\widetilde{X_{j}}}$ . For each $j$ , the columns of $\widetilde{\widetilde{X_{j}}}$ are orthogonal, which implies

\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{\widetilde{X_{j}}}=\varphi(X_{j})^{\top}\underbrace{\Pi_{n}\Pi_{n}^{\top}}_{I_{2n}}\varphi(X_{j})=\varphi(X_{j}^{\dagger})\varphi(X_{j})=\varphi(X_{j}^{\dagger}X_{j})=\varphi(\|X_{j}\|_{2}^{2})=\|X_{j}\|_{2}^{2}I_{2}.

Therefore, the update (3.10) has a closed form expression of $\hat{\widetilde{\beta_{j}}}$ given by

\hat{\widetilde{\beta_{j}}}=\left(1-\frac{\lambda}{\left\|\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right\|_{2}}\right)_{+}\frac{\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}}{\frac{1}{n}\|X_{j}\|_{2}^{2}}.

(3.11)

Soft threshold operator. We define two close variants of the soft threshold operators widely used in Lasso literature. For any $t\in\mathbb{R}$ , let $t_{+}=\max\{0,t\}$ . For every $z\in\mathbb{C}$ , we define the soft threshold operator $\mathcal{S}_{\lambda}:\mathbb{C}\rightarrow\mathbb{C}$ as

\mathcal{S}_{\lambda}(z)=(|z|-\lambda)_{+}\frac{z}{|z|},

(3.12)

and for any $x\in\mathbb{R}^{2}$ , we similarly define $\widetilde{\cal S}_{\lambda}:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2}$ as

\widetilde{\cal S}_{\lambda}(x)=(\|x\|_{2}-\lambda)_{+}\frac{x}{\|x\|_{2}},

the soft threshold operator in $\mathbb{R}^{2}$ . The two operators are connected by the map $\varphi$ and the following identity. For any $z\in\mathbb{C}$ , we have

\widetilde{\mathcal{S}}_{\lambda}(\varphi(z))=(\|\varphi(z)\|-\lambda)_{+}\frac{\varphi(z)}{\|\varphi(z)\|_{2}}=(|z|-\lambda)_{+}\frac{\varphi(z)}{|z|}=\varphi(\mathcal{S}_{\lambda}(z)),

where we use the property (iii) from Proposition 3.1. Similar univariate threshold operator has been defined in Sun et al. [2018].

We now use these soft threshold operators for getting a more convenient form of (3.11). The update (3.11) immediately becomes $\hat{\widetilde{\beta}}_{j}=\widetilde{S}_{\lambda}\left(\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}/n\right)\bigg{/}\|X_{j}\|_{2}^{2}$ . If the columns of $X$ are scaled so that $\|X_{j}\|_{2}^{2}=n$ for every $j=1,\ldots,n$ , then

\hat{\widetilde{\beta_{j}}}=\widetilde{\cal S}_{\lambda}\left(\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right).

(3.13)

Using the group Lasso update provided in Hastie et al. [2015] and the soft threshold operator notation in (3.13), the group Lasso update in this case is

	$\displaystyle\widetilde{r}^{(j)}$	$\displaystyle\leftarrow\widetilde{r}+\widetilde{\widetilde{X_{j}}}+\hat{\widetilde{\beta_{j}}},$
	$\displaystyle\hat{\widetilde{\beta_{j}}}$	$\displaystyle\leftarrow\widetilde{\mathcal{S}}_{\lambda}\left(\frac{1}{n}\widetilde{\widetilde{X_{j}}}^{\top}\widetilde{r}^{(j)}\right),$
	$\displaystyle\widetilde{r}$	$\displaystyle\leftarrow\widetilde{r}^{(j)}-\widetilde{\widetilde{X_{j}}}+\hat{\widetilde{\beta_{j}}}.$

We denote $\hat{\beta}_{j}=\hat{\widetilde{\beta_{j}}}_{1}+\mathrm{i}\hat{\widetilde{\beta_{j}}}_{2}$ i.e. $\hat{\widetilde{\beta_{j}}}=\varphi(\hat{\beta}_{j})$ , and $r^{(j)}=\widetilde{r}^{(j)}_{1}+\mathrm{i}\ \widetilde{r}^{(j)}_{2}$ for $j=1,\ldots,p$ . Therefore, we can write

	$\displaystyle\varphi(r^{(j)})=\widetilde{r}^{(j)}$	$\displaystyle=\widetilde{Y}-\sum_{k\neq j}\widetilde{\widetilde{X_{k}}}\hat{\widetilde{\beta_{k}}}$
		$\displaystyle=\varphi(Y)-\sum_{k\neq j}\varphi(X_{k})\varphi(\hat{\beta}_{k})$
		$\displaystyle=\varphi(Y-\sum_{k\neq j}X_{k}\hat{\beta}_{k})$
		$\displaystyle=\varphi(r+X_{j}\hat{\beta}_{j}),$

where $r=Y-\sum_{k=1}^{p}X_{k}\hat{\beta}_{k}$ . Using this representation, the Lasso update becomes

	$\displaystyle\varphi(r^{(j)})$	$\displaystyle\leftarrow\varphi(r+X_{j}\hat{\beta}_{j}),$
	$\displaystyle\varphi(\hat{\beta}_{j})$	$\displaystyle\leftarrow\varphi\left({\cal S}_{\lambda}\left(X_{j}^{\dagger}r^{(j)/n}\right)\right),$
	$\displaystyle\varphi(r)$	$\displaystyle\leftarrow\varphi(r^{(j)}-X_{j}\hat{\beta}_{j}).$

Therefore, $\varphi$ being an isomorphism, we can simply omit it from the Lasso updates above and still we get an equivalent complex-valued update. The final steps are 4.(a-c) of Algorithm 1.

Complex Lasso with covariance update. We can modify Algorithm 1 to take the covariance matrix of $X$ and $Y$ as inputs, which is needed for graphical Lasso. In step 4.(a), we can update $X_{j}^{\dagger}r^{(j)}$ instead of $r^{(j)}$ , i.e. the update for each $j=1,\ldots,p$ will be

	$\displaystyle X^{\dagger}r^{(j)}$	$\displaystyle=X^{\dagger}r+X^{\dagger}X_{j}\hat{\beta}_{j},$
	$\displaystyle\hat{\beta}_{j}$	$\displaystyle\leftarrow{\cal S}_{\lambda}\left(\frac{1}{n}X_{j}^{\dagger}r^{(j)}\right),$
	$\displaystyle X^{\dagger}r$	$\displaystyle\leftarrow X^{\dagger}r^{(j)}-X_{j}\hat{\beta}_{j}.$

which is aligned with the fact that information about $X^{\dagger}Y$ and $X^{\dagger}X$ are sufficient for estimation. The modified algorithm is presented as Algorithm 2. It should be noted as well that the objective function of Algorithm 2 is slightly different than that of Algorithm 1 in term of scaling. The specific routine followed in Algorithm 2 is used in the CGLASSO algorithm we propose in 3.3.

Algorithm 2 Complex Lasso with covariance update (CLASSO.COV)

Objective: Calculate

\hat{\beta}=\operatorname*{arg\,min}_{\beta\in\mathbb{C}^{p}}\frac{1}{2}\beta^{\dagger}S_{XX}\beta-\mathbf{s}_{XY}\beta+\lambda\|\beta\|_{1}

Input:

S_{XX}\in\mathbb{C}^{p\times p}

\mathbf{s}_{XY}\in\mathbb{C}^{p}

\hat{\beta}^{(0)}

, and

\lambda>0

1. Initialize:

\hat{\beta}=\hat{\beta}^{(0)}

2. Calculate

\mathbf{s}_{Xr}=\mathbf{s}_{XY}-S_{XX}\hat{\beta}

3. Repeat until convergence for

j=1,\cdots,p

(a)

$v\leftarrow\mathbf{s}_{Xr}+S_{XX}\hat{\beta}_{j}$
(b)

$\hat{\beta}_{j}\leftarrow{\cal S}_{\lambda}(v_{j})$ where ${\cal S}_{\lambda}$ is defined in (3.12)
(c)

$\mathbf{s}_{Xr}\leftarrow v-S_{XX}\hat{\beta}_{j}$

4. Return

\hat{\beta}

Computational benefit of orthogonal group structure. We would like to highlight the importance of the orthogonal predictor structure in the above group Lasso formulation. While it is well-known that a Lasso with complex variables can be solved using standard group Lasso optimization routines [Maleki et al., 2013], to the best of our knowledge the orthogonality of predictors within groups has not been observed or utilized in algorithm development in the existing literature. We perform a small numerical experiment to compare the computational efficiency of CLASSO with that of a popular group Lasso R package grplasso that does not use the within-group orthogonality of predictors.

The experiment consists of a $p=50$ dimensional sample of size $n=50$ . We randomly generate the real and imaginary parts of the entries in $X$ from a standard normal distribution and consider $\beta=(\beta_{1},\cdots,\beta_{p})^{\top}$ such that $\beta_{k}=1-1\mathrm{i}$ if $k$ is odd and 0 otherwise. We consider the error distribution $\{\epsilon_{i}\}_{i=1}^{n}\thicksim\mathcal{N}(0,1)$ and response variable $y=X\beta+\epsilon$ . The experiment is replicated 20 times and the run times for both CLASSO and grplasso are recorded. We used a MacBook Pro M1 with 8 core CPU for running the experiment. The boxplots of the runtimes are shown in figure 3. We get an average run time of CLASSO and grplasso to be 0.01 and 1.34 seconds respectively. We have a run time improvement of 134 folds if we use CLASSO.

Warm start and active set selection. The complex Lasso can be sped up by incorporating warm start and active set selection [Hastie et al., 2015]. In practice, one needs to find the Lasso solution for not only one fixed $\lambda$ , but a sequence of values for $\lambda$ and obtain a regularization path. Such regularization path helps us identify which is the best choice of $\lambda$ in term of certain criterion (e.g. relative mean squared error (RMSE), Bayesian Information Criterion (BIC) or cross-validation error, see section 5 for details). One method to do so is to start with a reasonably large value of the tuning parameter ( $\lambda\geq\lambda_{0}=\max_{j}|X_{j}^{\dagger}Y_{j}/n|$ ) and obtain an all-zero solution $\hat{\beta}(\lambda_{0})$ . For each smaller value $\lambda_{\ell}$ ( $\ell\geq 1$ ), we can use the solution for the previous $\lambda_{\ell-1}$ , $\hat{\beta}(\lambda_{\ell-1})$ as an initial value i.e. warm start and compute $\hat{\beta}(\lambda_{\ell})$ . This scales up the efficiency of computing the Lasso solution for all tuning parameters since we can expect that a small change in the tuning parameter has limited effect on the solution. This method is referred to as the pathwise coordinate descent.

An active set $\mathcal{A}_{t}$ is the set of indices for non-zero entries in iteration $t$ of the Lasso algorithm. For the vector of current residuals $r^{(t)}$ if $|X_{j}^{\dagger}r^{(t)}|/N<\lambda_{\ell}$ is satisfied, then $j$ is excluded from $\mathcal{A}_{t}$ , i.e. $\mathcal{A}_{t}:=\mathcal{A}_{t-1}\setminus\{j\}$ . For the indices in $\mathcal{A}_{t}$ , the soft threshold step 4.(b) of Algorithm 1 is performed. The advantage of doing the active set selection is that once a index is declared inactive and sparse, the number of soft threshold operation in each step decreases. If after a few iterations $\hat{\beta}^{(t)}$ starts to become sparse, then the number of soft threshold operations significantly goes down.

3.3 Graphical Lasso with complex variables

We return to the original problem (2.6) with the given matrix $\mathcal{Z}$ of DFT values and the averaged periodogram $\hat{f}(\omega_{j})$ for a fixed Fourier frequency $\omega_{j}\in F_{n}$ . For ease of notation, we will denote $\hat{\Theta}_{j}$ as $\hat{\Theta}$ , and keep the frequency index $j$ throughout the algorithm description. We are allowed to do it since for now our algorithm does not involve information of the estimates in other frequencies. Also we denote the inverse periodogram $\hat{f}(\omega_{j})$ as $P$ .

Algorithm 3 Complex graphical Lasso (CGLASSO)

Objective: Calculate

\hat{\Theta}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\langle\langle\Theta,\hat{f}(\omega_{j})\rangle\rangle-\log\det\Theta+\lambda\|\Theta\|_{\text{1,off}}

Input:

P=\hat{f}(\omega_{j})

{\cal B}^{(0)}_{(p-1)\times p}

and

\lambda>0

1. Initialize:

•

$W=P$ ; $\text{diag}(W)$ remains unchanged throughout the algorithm
•

${\cal B}={\cal B}^{(0)}$

2. Repeat until convergence for

k=1,\cdots,p,1,\cdots,p,\cdots

(a)
Define
- $W_{11}:=W_{-k,-k}$
- $\mathbf{w}_{12}:=W_{\cdot,k}=W_{k,\cdot}$
- $w_{22}:=W_{kk}$
(b)

$\hat{\beta}=\text{CLASSO.COV}(S_{XX}=W_{11},\ \mathbf{s}_{XY}=\mathbf{p}_{12},\hat{\beta}^{(0)}={\cal B}_{\cdot,k},\lambda)$
(c)
Update:
- ${\cal B}_{\cdot,k}\leftarrow\hat{\beta}$
- $\textbf{w}_{12}\leftarrow W_{11}\hat{\beta}$

3. For

k=1,\cdots,p

•

$\hat{\theta}_{22}\leftarrow\frac{1}{w_{22}-\mathbf{w}_{12}^{\top}\hat{\beta}}$
•

$\hat{\boldsymbol{\theta}}_{12}\leftarrow-\hat{\beta}\cdot\hat{\theta}_{22}$

4. Return

\hat{\Theta}

Using Claim B.1, the solution of the optimization problem (2.6) solves the following equation (2.6) is

P-\Theta^{-1}+\lambda\Psi=0,

(3.14)

where $\Psi=((\psi_{kl}))_{k,l=1}^{p}$ has the following entries

\psi_{kl}=\begin{cases}~{}\quad 0&\text{if }k=l,\\ \quad\dfrac{\Theta_{kl}}{|\Theta_{kl}|}&\text{if }k\neq l\ \text{and }\Theta_{kl}\neq 0,\\ \in\{z\in\mathbb{C}:|z|\leq 1\}&\text{if }k\neq l\ \text{and }\Theta_{kl}=0.\end{cases}

(3.15)

The system (3.14) can be solved using block coordinate descent [Friedman et al., 2008]. The current working version of $\Theta$ and $P$ are partitioned as follows

\Theta=\begin{bmatrix}\Theta_{11}&\boldsymbol{\theta}_{12}\\ \boldsymbol{\theta}_{12}^{\dagger}&\theta_{22}\end{bmatrix},\quad P=\begin{bmatrix}P_{11}&\mathbf{p}_{12}\\ \mathbf{p}_{12}^{\dagger}&p_{22}\end{bmatrix}.

Let $W$ be the current working version of $\Theta$ . Following Friedman et al. [2008] and using some tools in calculus of complex variables, it can be shown that (3.14) is equivalent to solving a Lasso problem where $\mathbf{p}_{12}$ is the analogue of $\frac{1}{n}X^{\dagger}X$ , and $\frac{1}{n}X^{\dagger}X$ corresponds to $W_{11}$ , the estimated cross product matrix in the current model. Based on the Algorithm 1, we can implement a block coordinate descent approach and come up with a graphical Lasso method in complex regime.

3.3.1 Implementation details

We provide some details on the implementation and convergence of the CGLASSO algorithm. While coordinate descent is fast and simple to implement, it is known to face convergence issues for very small values of penalty parameter, even in classical Lasso and graphical Lasso problems [Mazumder and Hastie, 2012; Friedman et al., 2007]. While we also encounter similar issues, we find that the algorithm performs robustly and accurately on the portion of the regularization path where $\lambda$ is neither too small nor too large. We also make some modifications to the base algorithm that helps with convergence and speeding up performance.

1.

Scaling: It is a well-known practice to standardize the individual variables before invoking graphical Lasso. We experimented with two different variants, both provide improvement over the basic CGLASSO algorithm. We discuss this in more details in section 5 (see equations (5.1)) and (5.2).

Warm and warmer start: Warm start is a standard technique to speed up convergence of pathwise coordinate descent (see section 3.2). We find that in the implementation of CGLASSO, another modification, which we call warmer start, helps more. In simple terms, warmer start is a two-fold warm start for the CGLASSO and the inner CLASSO.COV. For a fixed $\omega_{j}$ we can find the CGLASSO solution of $\Theta$ for a sequence of $\lambda$ and construct a regularization path. Starting from a reasonably high value of $\lambda=\lambda_{0}$ , we obtain the solution $\hat{\Theta}(\lambda_{0})$ with all zero in off-diagonal entries. For each smaller value $\lambda_{\ell}\ (\ell\geq 1)$ , the solution $\hat{\Theta}(\lambda_{\ell-1})$ for the previous $\lambda_{\ell-1}$ is used as an initial value. In addition, at each iteration $k=1,\ldots,p,1,\ldots,p,\ldots$ of CGLASSO we go through the following steps (b and c of Algorithm 3)

	$\displaystyle\hat{\beta}=\text{CLASSO.COV}(W_{11},\mathbf{s}_{12},\mathcal{B}^{(0)}_{\cdot,k},\lambda);$
	$\displaystyle\mathcal{B}^{(0)}_{\cdot,k}\leftarrow\hat{\beta};$
	$\displaystyle\mathbf{w}_{12}\leftarrow W_{11}\hat{\beta};$

the above routine is a two-step warm up for the initial solution in both the Lasso and graphical Lasso algorithm.

3.

Convergence issues for small $\lambda$ : For very small $\lambda$ , we may encounter convergence issues. In our simulation studies, we handle this issue by declaring a stopping rule for the regularization path depending on the estimation error since the true $\Theta$ is known. We calculate relative mean square error (RMSE) which, at frequency $\omega_{j}$ and penalty $\lambda$ , is defined in (5.3).

Let us assume for now that we are estimating $\Theta$ at a fixed frequency $\omega_{j}$ so that we can simply denote $\text{RMSE}_{\lambda}(\omega_{j})$ by $\text{RMSE}_{\lambda}$ . If the initial value of the pentalty parameter is $\lambda_{0}$ , and $\lambda_{\min}$ is the minimizer of the $\text{RMSE}_{\lambda}$ values along the regularization path, then at the current $\lambda$ we stop the regularization path if the gap $\lambda_{0}-\lambda$ is large enough and

$\text{RMSE}_{\lambda}-\text{RMSE}_{\lambda_{\min}}>0.5\times(\text{RMSE}_{\lambda_{0}}-\text{RMSE}_{\lambda_{\min}}).$

Such stopping criterion perform very well when the regularization path is ideally U-shaped and the stopping rule is able to detect the deviation of the regularization path from the minima.

While our reported results are not affected by these choices, i.e. the min-BIC solutions are achieved by $\lambda$ where the algorithm converges and does not require a hard cut-off, we do plan to resolve these issues in future work.

4 Theoretical properties

In this section we establish non-asymptotic high probability upper bounds on the estimation error of $\widehat{f^{-1}}(\omega_{j})$ , for a single frequency $\omega_{j}\in F_{n}$ . First we will control the distance between $\widehat{f^{-1}}(\omega_{j})$ and $f^{-1}(\omega_{j})$ in $\ell_{\infty}$ -norm. Our proof techniques build upon the arguments in Ravikumar et al. [2011] and Sun et al. [2018]. In particular, Ravikumar et al. [2011] used a primal-dual witness technique which constructs the pair $(\widetilde{\Theta}_{j},\widetilde{Z}_{j})$ such that the pair satisfies the optimality condition with high probability. Given that the construction is successful, the witness $\widetilde{\Theta}_{j}$ is equal to the solution $\hat{\Theta}_{j}$ of the optimization problem 2.6. The advantage of this construction is that we can ensure both estimation and variable selection properties of the estimator $\hat{\Theta}_{j}$ with high probability.

In this section, we first state the assumptions and explain the notations of terms that will govern our error bounds. Then we present the estimation error bounds for Gaussian and non-Gaussian linear processes.

Sparsity: The sparsity parameters capture the ambient dimension of the optimization problem. We use $\Theta^{*}=[f(\omega_{j})]^{-1}$ to denote the true spectral precision matrix at $\omega_{j}$ . The set of off-diagonal entries of $\Theta^{*}$ , used to define the edge set of the graph, is denoted by

E(\Theta^{*})=\{(k,\ell)\in[p]\times[p]:k\neq\ell,\ \Theta^{*}_{k\ell}\neq 0\},

and the augmented edge set is denoted as $S(\Theta^{*})=E(\Theta^{*})\cup\{(1,1),\cdots,(p,p)\}$ . For simplicity, we write $S(\Theta^{*})$ as $S$ . We need the following two sparsity parameters in our analysis,

\displaystyle s=|E(\Theta^{*})|,~{}~{}~{}~{}~{}d=\max_{k=1,\cdots,p}\left|\{\ell\in\{1,\cdots,p\}:\ \Theta_{k\ell}^{*}\neq 0\}\right|.

The parameter $s$ captures the total number of edges in the graph. The parameter $d$ is the maximum degree of a node, and equals the maximum number of non-zero entries in any row of $\Theta^{*}$ .

Our analysis will show that consistent estimation is possible when the effective sample size is larger than the ambient dimension captured by the above parameters.

Identifiability: In high-dimensional statistics, some type of identifiability assuptions needs to hold in order to carry out meaningful estimation. Typical assumptions used in the literature are restricted eigenvalue, restricted strong convexity or irrepresentability conditions. These conditions inherently impose restrictions to ensure that the support and non-support variables are not too similar. These assumptions are also closely related to some form of strong convexity or steep curvature of the second derivative of the loss function at the true parameter in certain directions of the high-dimensional space.

We consider the function

g(\Theta)=\begin{cases}-\log\det\Theta&\text{if }\Theta\succ 0\\ +\infty&\text{otherwise}\end{cases}

then the Hessian is essentially $\nabla^{2}g(\Theta^{*})={\Theta^{*}}^{-1}\otimes{\Theta^{*}}^{-1}=\Upsilon^{*}$ , where $\otimes$ denotes the matrix Kronecker product, i.e. $\Upsilon^{*}$ is a $p^{2}\times p^{2}$ matrix indexed by the vertex pair/edge such that the entries $\Upsilon^{*}_{(k_{1},l_{1}),(k_{2},l_{2})}$ is of the form $\frac{\partial^{2}g}{\partial\Theta_{k_{1},l_{1}}\partial\Theta_{k_{2},l_{2}}}(\Theta)$ evaluated at $\Theta=\Theta^{*}$ . This can be verified from the results on matrix derivative in Boyd and Vandenberghe [2004]. The following quantity captures the magnitude of this Hessian restricted on the true support set, and will appear in our error bounds of Whittle estimation.

\kappa_{\Theta^{*}}=\|\Upsilon^{*}_{SS}\|_{\infty}=\left\|[(\Theta^{*})^{-1}\otimes(\Theta^{*})^{-1}]_{SS}\right\|_{\infty}.

We will also need the following irrepresentability assumption, which is common in the literature of GGM [Ravikumar et al., 2011].

Assumption 4.1.

There exists some $\alpha\in(0,1]$ such that

\max_{e\in S^{c}}\|\Upsilon_{eS}^{*}\left(\Upsilon_{SS}^{*}\right)^{-1}\|_{1}\leq 1-\alpha.

Such assumption limits the influence of the non-edge terms indicated by $S^{c}$ with the edge terms based on $S$ [Zhao and Yu, 2006].

Temporal memory: We will need a few parameters to capture the effect of temporal and cross-sectional memory of the process on our error bounds.

Assumption 4.2.

\sum_{\ell=-\infty}^{\infty}\|\Gamma(\ell)\|_{\max}<\infty.

We also define the following quantity

{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}=\operatorname*{ess\,sup}_{\omega\in[-\pi,\pi]}\|f(\omega)\|,

which is a measure of stability of the time series $X_{t}$ . Larger values of ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ are associated with processes having stronger temporal and cross-sectional dependence and less stability. The above assumption ensures the finiteness of ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ since for any $\omega\in[-\pi,\pi]$

\|f(\omega)\|=\frac{1}{2\pi}\left\|\sum_{\ell=-\infty}^{\infty}\Gamma(\ell)e^{-\mathrm{i}\omega\ell}\right\|\leq\frac{1}{2\pi}\sum_{\ell=-\infty}^{\infty}\|\Gamma(\ell)\|\leq\frac{p}{2\pi}\sum_{\ell=-\infty}^{\infty}\|\Gamma(\ell)\|_{\max}.

The Assumption 4.2 is useful in providing the error bound of Proposition 4.1 later.

We define two model dependent quantities

\Omega_{n}=\max_{1\leq r,s\leq p}\sum_{|\ell|\leq n}|\ell||\Gamma_{r,s}(\ell)|,\quad L_{n}=\max_{1\leq k,\ell\leq p}\sum_{|\ell|>n}|\Gamma_{r,s}(\ell)|.

(4.1)

These two capture the strength of temporal and contemporaneous dependence in the multivariate time series. According to proposition 3.3 of Sun et al. [2018] they play an important role to control the bias of averaged periodogram. We shall use this information to control the bias in the estimation setup we deal with.

Finally, we will need the quantity

\kappa_{f}=\|f(\omega_{j})\|_{\infty}

in our error bounds. It captures the size of the spectral density matrix in $\ell_{\infty}$ norm, and is used to control the curvature of the Hessian of the loss function in cglasso.

4.1 Case I: Gaussian time series

Our first result, rephrased from Propositions 3.5 and 3.6 of Sun et al. [2018] using our notations, provides a high probability upper bound on the element-wise difference between the averaged periodogram and the true spectral density. The bound depends on the bias and tail decay of $\hat{f}(\omega_{j})$ , and decays with the dimension at the rate $\sqrt{\log p/m}$ , where $m$ is the bandwidth parameter in spectral density estimation.

Proposition 4.1.

Let $\{X_{t}\}_{t=1}^{n}$ be $n$ consecutive observations from a stable Gaussian centered time series satisfying Assumption 4.1 and 4.2. Consider a single Fourier frequency $\omega_{j}\in[-\pi,\pi]$ , and assume $n\succsim\Omega_{n}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p$ . Then for any choice of $m$ satisfying $m\succsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p$ , $m\precsim n/\Omega_{n}$ , any $R>0$ , and a choice of threshold

\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{R\log p}{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n},

(4.2)

there exist universal constants $c,c^{\prime}>0$ such that

\mathbb{P}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\geq\Delta_{f}(m,n,p)\right)\leq c^{\prime}p^{-(cR-2)}.

(4.3)

The proof is given in Appendix B for completeness.

Remark 4.1.

The statement of proposition 4.1 is non-asymptotic in nature and can require sample size $n$ and bandwidth paramater $m$ depending on the stability parameter ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ . For appropriate choices of $m,n$ and $p$ the bias term can vanish faster than the tail decay for certain data generating processes. For example, according to proposition 3.4. of Sun et al. [2018], if the autocovariance function of the underlying centered stationary process $X_{t}$ satisfies ${\left|\kern-1.07639pt\left|\Gamma(\ell)\right|\kern-1.07639pt\right|}_{\max}\leq\sigma_{X}\rho_{X}^{|\ell|}$ for every $\ell\in\mathbb{Z}$ and some $\sigma_{X}>0,\rho_{X}\in(0,1)$ , then $\Omega_{n}=\mathcal{O}(n\rho_{X}^{n})$ and $L_{n}=\mathcal{O}(\rho_{X}^{n})$ . The first components of the bias term is $\frac{m}{n}\Omega_{n}\precsim m\rho_{X}^{n}$ . For the choice of bandwidth parameter $m\asymp n^{\alpha}$ for some $\alpha>0$ , $n^{3\alpha}\rho_{X}^{2n}\rightarrow 0$ and $L_{n}\rightarrow 0$ , which implies $m\rho_{X}^{n}\precsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{\log p}{m}}$ . Therefore, the bias vanishes faster than the tail decay, and the threshold term behaves as $\Delta_{f}(m,n,p)\approx\mathcal{O}\left({\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{\log p}{m}}\right)$ . This similar to the single-deviation control $\sqrt{\log p/n}$ of covariance matrices in Ravikumar et al. [2011] and Bickel and Levina [2008].

Our next result uses the element-wise deviation bound as a starting point to control the departure of graphical lasso estimator from the true spectral precision in different norms.

Theorem 4.1.

Let $\{X_{t}\}_{t=1}^{n}$ be $n$ consecutive observations from a stable Gaussian centered time series satisfying Assumptions 4.1 and 4.2. Consider a single Fourier frequency $\omega_{j}\in[-\pi,\pi]$ , and the true spectral precision matrix $\Theta^{*}=[f(\omega_{j})]^{-1}$ . Assume $\kappa_{f},\ \kappa_{\Theta^{*}}\geq 1$ and $n\succsim\Omega_{n}{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p$ , and denote $C_{\alpha}=1+\frac{8}{\alpha}$ . Then for any choice of tuning parameters $\lambda=\frac{8}{\alpha}\Delta_{f}(m,n,p)$ , $m\succsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p$ , $m\precsim n/\Omega_{n}$ , $\Delta_{f}(m,n,p)d\leq[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}]^{-1}$ and any $R>0$ , there exist universal constants $c,c^{\prime}>0$ such that the following statements hold with probability at least $1-c^{\prime}p^{-(cR-2)}$ ,

(a)

The CGLASSO estimator $\hat{\Theta}$ in (2.6) satisfies

{\left|\kern-1.07639pt\left|\hat{\Theta}-\Theta^{*}\right|\kern-1.07639pt\right|}_{\max}\leq 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p),

(b)

The edge set specified by $\hat{\Theta}$ , i.e. $E(\hat{\Theta})$ , is a subset of $E(\Theta^{*})$ and includes all edges $(k,\ell)$ with $|\Theta^{*}_{k\ell}|>2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p)$ ,

(c)

The estimator $\hat{\Theta}$ satisfies

	$\displaystyle\\|\hat{\Theta}-\Theta^{*}\\|_{F}$	$\displaystyle\leq 2\kappa_{\Theta^{*}}C_{\alpha}\sqrt{s+p}\ \Delta_{f}(m,n,p),$		(4.4)
	$\displaystyle\\|\hat{\Theta}-\Theta^{*}\\|_{2}$	$\displaystyle\leq 2\kappa_{\Theta^{*}}C_{\alpha}\min\{\sqrt{s+p},d\}\ \Delta_{f}(m,n,p).$		(4.5)

Remark 4.2.

The proof is deferred to Appendix B. In the statement of theorem 4.1m the condition $\Delta_{f}(m,n,p)d\leq[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}]^{-1}$ ensures that the tail decay ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{R\log p}{m}}$ and the bias $\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}$ are separately controlled by $[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}d]^{-1}$ . However, under common mixing conditions, the tail decay may asymptotically dominate the bias term. For instance as shown in Remark 4.1, when the autocovariance function of the underlying centered stationary process $X_{t}$ satisfies ${\left|\kern-1.07639pt\left|\Gamma(\ell)\right|\kern-1.07639pt\right|}_{\max}\leq\sigma_{X}\rho_{X}^{|\ell|}$ for every $\ell\in\mathbb{Z}$ and some $\sigma_{X}>0,\rho_{X}\in(0,1)$ , for the choice of the bandwidth $m\asymp n^{\alpha}$ for some $\alpha\in(0,1)$ , the tail decay dominates the bias term. If $\kappa_{f},\kappa_{\Theta^{*}},\frac{1}{\alpha}$ and ${\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}$ are asymptotically constants as functions of $(n,p,d)$ , the threshold term behaves as $\Delta_{f}(m,n,p)\asymp\sqrt{\frac{\log p}{m}}$ so that $\sqrt{\frac{\log p}{m}}\precsim\frac{1}{d}$ i.e. $m\succsim d^{2}\log p$ . This rate is similar to the sample size requirement $\Omega(d^{2}\log p)$ of Ravikumar et al. [2011] for precision matrix estimation using real graphical lasso. In each iteration of the CGLASSO algorithm, CLASSO i.e. the complex variant of the lasso algorithm is implemented. Hence, lasso succeeds with high probability with a sample of size $\Omega(d\log p)$ [Wainwright, 2009]. As addressed in Ravikumar et al. [2011], the bandwidth requirement of $\Omega(d^{2}\log p)$ might be an artifact of the theoretical analysis in this case. Also, consistency of $\hat{\Theta}$ around $\Theta^{*}$ is ensured when $\frac{d^{2}\log p}{m}\rightarrow 0$ . The part (b) of Theorem 4.1 also states that entry-wise model selection consistency is achieved when the true entry is larger than a signal depending on the threshold $\Delta_{f}(m,n,p)$ .

4.2 Case II: Linear processes

In this section, we extend the estimation consistency results from Gaussian time series to linear processes with non-Gaussian errors. We focus on three classes of error distributions given in Section 4 in Sun et al. [2018]. We consider a general linear process with absolute summable MA( $\infty$ ) coefficients

X_{t}=\sum_{\ell=0}^{\infty}B_{\ell}\varepsilon_{t-\ell},

(4.6)

where $B_{\ell}\in\mathbb{R}^{p\times p}$ , and $\varepsilon_{t}\in\mathbb{R}^{p}$ have i.i.d. centered distribution with possibly heavier tail than Gaussian. The absolute summability of MA( $\infty$ ) coefficients ensure stationarity of the process [Rosenblatt, 2012]

\sum_{\ell=0}^{\infty}|(B_{\ell})_{k,k^{\prime}}|<\infty,~{}~{}~{}\mbox{for all }k,k^{\prime}=1,\cdots,p.

(4.7)

Under this condition, the autocovariance function $\Gamma(\ell)=\sum_{\ell=0}^{\infty}B_{t}B_{t+\ell}^{\top}$ is well-defined for every $\ell\in\mathbb{Z}$ , and Assumption 4.2 holds (see Lemma C.7 of Sun et al. [2018]). We assume that all the components of $\varepsilon$ , viz. $\varepsilon_{tk},\ 1\leq k\leq p$ , come form one of the following three distribution families:

(C1)

Sub-Gaussian: There exists $\sigma>0$ such that for all $\eta>0$

$\mathbb{P}(|\varepsilon_{tk}|>\eta)\leq 2\exp\left(-\frac{\eta^{2}}{2\sigma^{2}}\right),$
(C2)

Generalized sub-exponential family with parameter $\nu>0$ : There exist positive constants $a$ and $b$ such that for all $\eta>0$ ,

$\mathbb{P}(|\varepsilon_{tk}|>\eta^{\nu})\leq a\exp(-b\eta),$
(C3)

Distribution with finite 4^th moment: There exists $K>0$ such that

$\mathbb{E}(\varepsilon_{tk}^{4})\leq K<\infty.$

With this specification of error distribution, we have the following results on the estimation error of averaged periodogram.

Proposition 4.2.

Suppose $\{X_{t}\}_{t\in\mathbb{Z}}$ is a linear process (4.6) satisfying (4.7), and $\varepsilon_{t}$ comes from one of the distribution families (C1), (C2) or (C3). Consider a given Fourier frequency $\omega_{j}\in F_{n}$ . Assume $n\succsim\Omega_{n}\mathscr{N}_{k}\,k=1,2,3$ , where $\mathscr{N}_{1}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p,\ \mathscr{N}_{2}={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}(\log p)^{4+4\alpha},\text{ and }\mathscr{N}_{3}=p^{2}$ for the three families (C1), (C2) and (C3) respectively. Then for $m$ satisfying $m\succsim\mathscr{N}_{k}$ and $m\precsim n/\Omega_{n}$ , and the following choice of threshold with an $R>0$ ,

(C1)

$\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\frac{(R\log p)^{1/2}}{\sqrt{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}$ ,
(C2)

$\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\frac{(R\log p)^{2+2\nu}}{\sqrt{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}$ ,
(C3)

$\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\frac{p^{1+R}}{\sqrt{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}$ ,

the error of averaged periodogram satisfies

\mathbb{P}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\geq\Delta_{f}(m,n,p)\right)\leq\mathcal{B}_{k},

(4.8)

where the tail probabilities $\mathcal{B}_{k}$ , $k=1,2,3$ , for (C1), (C2) and (C3) are given by

	$\displaystyle\mathcal{B}_{1}$	$\displaystyle=c\exp\left\{-(c^{\prime}R-2)\log p\right\},$
	$\displaystyle\mathcal{B}_{2}$	$\displaystyle=c\exp\left\{-(c^{\prime}R-2)\log p\right\},$
	$\displaystyle\mathcal{B}_{3}$	$\displaystyle=c\exp\{-2R\log p\},$

and $c,c^{\prime}>0$ are constants depending only on the distributions of errors $\varepsilon_{t}$ but not on the temporal dependence parameters $B_{\ell}$ of the linear process.

This proposition shows how the sample size requirement changes as the errors in the linear process have heavier tails than Gaussian. For example, under the assumption of only the existence of $4^{th}$ moment, we need $O(p^{2})$ samples. In contrast, in the subGaussian setting $O(\log(p))$ samples are sufficient for consistency.

Theorem 4.2.

Let $X_{t},\ t=1,\cdots,n$ be as in Proposition 4.2. Consider a single Fourier frequency $\omega_{j}\in[-\pi,\pi]$ . Assume $\kappa_{f},\kappa_{\Theta^{*}}\geq 1$ , and $n\succsim\Omega_{n}\mathscr{N}_{k}$ where $\mathscr{N}_{k}$ for $k=1,2,3$ are as described in proposition 4.2, and denote $C_{\alpha}=1+\frac{8}{\alpha},\ \alpha>0$ . Then for any choice of tuning parameters $\lambda=\frac{8}{\alpha}\Delta_{f}(m,n,p)$ , $m\succsim\mathscr{N}_{k}$ , $m\precsim n/\Omega_{n}$ , $\Delta_{f}(m,n,p)d\leq[6\kappa_{\Theta^{*}}^{2}\kappa_{f}^{3}C_{\alpha}]^{-1}$ and any $R>0$ , there exist universal constants $c,c^{\prime}>0$ such that with probability greater than $1-c^{\prime}p^{-(cR-2)}$ , (a),(b) and (c) in Theorem 4.1 hold.

Proofs of Proposition 4.2 and Theorem 4.2 follow from a similar argument as in propositions 4.3 and 4.4 of Sun et al. [2018], where estimation error bounds for spectral density matrices were considered. For general linear processes, in proposition 4.2, the first term of the threshold $\Delta_{f}(m,n,p)$ is tailored to the underlying process i.e. the threshold for estimation error decays slower from (C1) to (C3) as the tail of error distribution is heavier. The order of required sample size and the bandwidth also increases for heavier tail errors.

Proofs of these generalizations from Gaussian to linear processes follow the same argument as in Propositions 4.3 and 4.4 of Sun et al. [2018], where estimation error bounds for spectral density matrices were considered. The proof of Theorem 4.2 follows in the similar way as Theorem 4.1 (proof mentioned in appendix B). The upper bound in the right-side of (4.3) is replaced by more general set of upper bounds $\mathcal{B}_{k}$ for linear processes as mentioned in (4.8) of Proposition 4.2.

5 Numerical experiments

We conduct three sets of experiments to demonstrate the finite sample properties of CGLASSO, and compare its performance with existing alternative methods.

In graphical model and lasso literature [Janková and van de Geer, 2018; Carter et al., 2023] using a single tuning parameter $\lambda$ is not ideal because the variables may be in different scales and possibly have different standard deviations. Hence the scaling issue needs to be carefully addressed. We use the following two variants of the CGLASSO algorithm.

CGLASSO-I: The first variant, called CGLASSO-I, takes the common approach of scaling all the $p$ DFTs to have $\|Z_{k}\|=\sqrt{n}$ before using them as input in CGLASSO. The final output is then rescaled using $D^{-1/2}\widehat{\Theta}D^{-1/2}$ , where $D$ is a diagonal matrix containing the $\|Z_{k}\|$ , for $k=1,\ldots,p$ . Similar to Janková and van de Geer [2018], this is equivalent to solving a graphical lasso based on the sample coherence matrix $D^{-1/2}\hat{f}(\omega_{j})D^{-1/2}$ and rescale it back. The corresponding optimization problem is

\widehat{\Theta}_{j}=D^{-1/2}\left(\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\langle\langle\Theta,D^{-1/2}\hat{f}(\omega_{j})D^{-1/2}\rangle\rangle-\log\det\Theta+\lambda\|\Theta\|_{\text{1,off}}\right)D^{-1/2}.

(5.1)

The above optimization problem is equivalent to a complex graphical lasso with weighted penalty (similar fact for graphical lasso on real covariance matrix is stated in Janková and van de Geer [2018]) and the weights are determined by the diagonal matrix $D$ , that is

\widehat{\Theta}_{j}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}}\langle\langle\Theta,\hat{f}(\omega_{j})\rangle\rangle-\log\det\Theta+\lambda\|D^{1/2}\Theta D^{1/2}\|_{\text{1,off}}.

CGLASSO-II: The second variant, called CGLASSO-II, takes a more involved approach where in each invoke of CLASSO.COV within CGLASSO, it uses $D$ to scale the input of CLASSO.COV. One way to view this operation is that every time a Lasso is run within the inner loop of CGLASSO, it ensures that the predictors in the Lasso input are scaled. This is equivalent to the following scheme for each iteration $k=1,\ldots,p$ of the Lasso update in :

$\displaystyle W_{11}^{sc}$	$\displaystyle\leftarrow D_{11}^{-1/2}W_{11}D_{11}^{-1/2},$
$\displaystyle\mathbf{p}_{12}^{sc}$	$\displaystyle\leftarrow D_{11}^{-1/2}\mathbf{p}_{12},$
$\displaystyle\hat{\beta}^{sc}$	$\displaystyle=\text{CLASSO.COV}(W_{11}^{sc},\mathbf{p}_{12}^{sc},\mathcal{B}^{(0)}_{\cdot,k},\lambda),$
$\displaystyle\mathcal{B}^{(0)}_{\cdot,k}$	$\displaystyle\leftarrow\hat{\beta}^{sc},$
$\displaystyle\hat{\beta}$	$\displaystyle\leftarrow D_{11}^{-1/2}\hat{\beta}^{sc},~{}~{}~{}\mathbf{w}_{12}\leftarrow W_{11}\hat{\beta}.$	(5.2)

Comparison between CGLASSO variants: We compare between CGLASSO-I and CGLASSO-II at $\omega=0$ . Our comparison is based on the regularization paths (Figures 4a and 4b) for relative means squared error (RMSE) as described in (5.3) for two different DGPs.

Experiment 1: We compare CGLASSO with Sparse Inverse Periodogram Estimator (SIPE) of Fiecas et al. [2019] on white noise and VAR processes. Our results show that penalizing real and imaginary parts separately as in SIPE can lead to loss of estimation and model selection accuracy. It is indeed preferable to penalize the real and imaginary parts together using a group penalty, as specified in the original penalized Whittle likelihood formulation (2.6). Our findings qualitatively match with the findings in Maleki et al. [2013] for generic regression with complex variables.

Experiment 2: We report relative mean squared error (RMSE) of CGLASSO at frequency $\omega=0$ for three different DGPs (white noise, VAR and VARMA), and different values of $n$ and $p$ . We show that CGLASSO substantially outperforms the traditional estimator of inverse averaged periodogram.

Experiment 3: We report model selection performances at $\omega=0$ for the same settings as in Experiment 2 using precision, recall, accuracy and area under the ROC curve (AUROC). We show that CGLASSO outperforms node-wise regression (NWR) algorithm of Section 2.3 Meinshausen and Bühlmann [2006] in terms of AUROC.

Before presenting the results of our experiments, we describe our simulation setup, performance metrics and choice of tuning parameters.

Generative models: Three DGPs are considered and for each setup, we generate data with $n=200,400$ and $p=10,20,50$ . The DGPs are following:

(DGP1)

White noise: We consider $X_{t}\overset{\text{i.i.d.}}{\thicksim}\mathcal{N}(0,\Sigma)$ where $\Sigma$ is a tri-diagonal symmetric positive definite matrix with diagonal entries 0.7, upper and lower diagonal entries 0.3.
(DGP2)

VAR(1): We consider $X_{t}=AX_{t-1}+\varepsilon_{t}$ where $A=((a_{ij}))$ is a $p\times p$ coefficient matrix with banded structure i.e. $a_{ii}=0.5$ for $i=1,\cdots,p$ , $a_{i,i+1}=-0.3$ for $i=1,\cdots,p-1$ and $a_{i,i+2}=0.2$ for $i=1,\cdots,p-2$ .
(DGP3)

VARMA(2, 2): The setting in this case is similar to the simulation setup mentioned in Wilms et al. [2021]. For any $k\in\mathbb{N}$ , let the matrices $\mathbf{I}_{k}$ and $\mathbf{J}_{k}$ denote the $k$ -dimensional identity matrix and $k$ -dimensional matrix with all ones respectively. The model is $X_{t}=A_{1}X_{t-1}+A_{2}X_{t-2}+\varepsilon_{t}+B_{1}\varepsilon_{t-1}+B_{2}\varepsilon_{t-2}$ . For coefficients of the autoregressive part, we use the diagonal matrices $A_{1}=0.4\ \mathbf{I}_{p},\ A_{2}=0.2\ \mathbf{I}_{p}$ and for coefficients of the moving average part of the model we use block diagonal matrices where the $5\times 5$ block matrices for $B_{1}$ and $B_{2}$ respectively are $1.5(\mathbf{I}_{5}+\mathbf{J}_{5})$ and $0.75(\mathbf{I}_{5}+\mathbf{J}_{5})$ .

We estimate the spectral precision matrix for all of the above three setups and compare with the true matrix. As mentioned in section 11.5 of Brockwell and Davis [1991], the true spectral density for the VARMA(p, q) process $X_{t}=A_{1}X_{t-1}+A_{p}X_{t-p}+\varepsilon_{t}+B_{1}\varepsilon_{t-1}+\cdots+B_{q}\varepsilon_{t-q}$ with $\varepsilon_{t}\thicksim\mathcal{N}(0,\Sigma_{\varepsilon})$ for all $t$ is given by

f(\omega)=\frac{1}{2\pi}\mathcal{A}(e^{-\mathrm{i}\omega})\ \mathcal{B}(e^{-\mathrm{i}\omega})\ \Sigma_{\varepsilon}\ \mathcal{B}^{\dagger}(e^{-\mathrm{i}\omega})\ (\mathcal{A}^{-1}(e^{-\mathrm{i}\omega}))^{\dagger},

where $\mathcal{A}(z)=\mathbf{I}_{p}-\sum_{t=1}^{p}A_{t}z^{t}$ and $\mathcal{B}(z)=\mathbf{I}_{p}-\sum_{t=1}^{q}B_{t}z^{t}$ . In our case the specific forms of the true spectral densities are calculated by changing $A_{i}$ ’s, $B_{i}$ ’s and $\Sigma_{\varepsilon}$ and the spectral precision is $\Theta(\omega)=[f(\omega)]^{-1}$ . One observation is that for white noise model, $f(\omega)=\Sigma_{\varepsilon}/2\pi$ , hence $\Theta(\omega)=2\pi\Sigma_{\varepsilon}^{-1}$ for all $\omega\in[-\pi,\pi]$ . Therefore, the sparsity patterns of the inverse covariance matrix and the precision matrix are identical.

At a fixed frequency, we seek to get the full regularization path for a range of penalty parameter ( $\lambda$ ) values. Given the estimated $\hat{\Theta}$ at a certain $\lambda=\lambda^{*}$ , we incorporate warm start, i.e. we use $\hat{\Theta}_{\lambda^{*}}$ as a initial value and estimate $\hat{\Theta}_{\lambda}$ for $\lambda$ near $\lambda^{*}$ . Warm start helps us obtain the full regularization path by repeatedly using the current estimated $\hat{\Theta}$ as an initial value for the next step.

Performance metric: In order to compare the estimated output with the true spectral precision matrix, we use a relative mean squared error (RMSE) at frequency $\omega_{j}$ defined as

\text{RMSE}_{\lambda}(\omega_{j})=\frac{\|\hat{\Theta}_{\lambda}(\omega_{j})-\Theta(\omega_{j})\|^{2}_{F}}{\|\Theta(\omega_{j})\|^{2}_{F}}.

(5.3)

In order to capture how well the graphical lasso is able to recover the sparsity pattern and the non-zero coordinates of true $\Theta$ , we use precision, recall and accuracy measures defined as follows:

	precision	$\displaystyle=\frac{\#\{(k,l):\hat{\Theta}_{kl}\neq 0\ \|\ \Theta_{kl}\neq 0\}}{\#\{(k,l):\hat{\Theta}_{kl}\neq 0\}},$
	recall	$\displaystyle=\frac{\#\{(k,l):\hat{\Theta}_{kl}\neq 0\ \|\ \Theta_{kl}\neq 0\}}{\#\{(k,l):\Theta_{kl}\neq 0\}},$
	accuracy	$\displaystyle=\frac{\#\{(k,l):\hat{\Theta}_{kl}=0\ \|\ \Theta_{kl}=0\}+\#\{(k,l):\hat{\Theta}_{kl}\neq 0\ \|\ \Theta_{kl}\neq 0\}}{p^{2}}.$

In addition, we calculate the area under receiver operating characteristic (ROC) curve. All the experiments are replicated 20 times in order to calculate the mean and standard deviation of each performance metric.

Bandwidth and tuning parameter selection: We set the smoothing span $m=\lfloor\sqrt{n}\rfloor$ based on theoretical results in section 4. Such choice of smoothing span is also supported by Sun et al. [2018] and Fiecas et al. [2019].

We use Extended Bayesian information criterion (EBIC) [Chen and Chen, 2008] for tuning parameter selection. As discussed in Foygel and Drton [2010] the expression of EBIC for Gaussian graphical model in our case would be

\text{EBIC}_{\gamma}(\hat{\Theta})=-2l_{n}(\hat{\Theta})+|E|\log n+4\gamma|E|\log p,

where $l_{n}(\hat{\Theta})$ is the log-likelihood corresponding to the Gaussian graphical model, $E(\hat{\Theta})$ is the edge set of a candidate graph corresponding to $\hat{\Theta}$ (described in (4)), and $\gamma\in[0,1]$ is a parameter. The choice of $\gamma$ has been theoretically studied in some other paper [Chen and Chen, 2008]. In our simulations, we only report results for $\gamma=0$ for all the models, when EBIC is essentially BIC. However, in high dimensional models, the extended BIC with a suitable positive value of $\gamma$ can be used as a tuning parameter selection criterion. Additional experiments (not reported here) with $\gamma>0$ shows that our results are qualitatively robust to its choice. We choose the tuning parameter $\lambda$ with the minimum value of BIC, that is

\lambda^{*}=\operatorname*{arg\,min}_{\lambda>0}\ \text{BIC}_{\gamma}(\hat{\Theta}_{\lambda}).

In order to find the minimum, we search on a log-linear scale grid of values for $\lambda$ .

		RMSE
Family	Inverse periodogram	CGLASSO (oracle)	CGLASSO (BIC)
White Noise
$p=10$
$n=200$	196.56 (189.68)	13.01 (3.69)	13.74 (3.52)
$n=400$	65.88 (28.74)	9.09 (2.32)	10.80 (2.77)
$p=20$
$n=200$	3015.44 (1807.44)	13.11 (2.28)	19.69 (1.45)
$n=400$	482.38 (332.41)	9.82 (1.72)	14.53 (1.61)
$p=50$
$n=200$	-	15.87 (1.65)	32.13 (0.92)
$n=400$	-	11.34 (0.80)	23.31 (0.78)
VAR(1)
$p=10$
$n=200$	155.00 (84.18)	12.07 (3.04)	13.25 (2.86)
$n=400$	70.89 (49.07)	9.27 (2.19)	10.59 (1.88)
$p=20$
$n=200$	3233.31 (3136.76)	11.84 (2.12)	14.14 (1.74)
$n=400$	394.83 (220.99)	8.86 (1.51)	10.95 (1.44)
$p=50$
$n=200$	-	14.49 (1.69)	28.70 (2.27)
$n=400$	-	12.04 (2.02)	25.17 (2.36)
VARMA(2, 2)
$p=10$
$n=200$	414.04 (277.31)	8.49 (1.86)	9.79 (3.39)
$n=400$	109.53 (53.68)	5.57 (1.77)	7.06 (2.04)
$p=20$
$n=200$	13704.13 (16463.75)	10.36 (2.06)	11.46 (4.38)
$n=400$	736.20 (322.94)	7.82 (1.46)	8.60 (1.53)
$p=50$
$n=200$	-	11.96 (1.17)	13.21 (1.21)
$n=400$	-	9.02 (0.89)	12.20 (0.84)

Table 1: RMSE of inverted periodogram and the complex graphical lasso estimator at

\omega_{j}=0

. The RMSE (oracle) column corresponds to the minimum RMSE (in %) along the regularization path. The RMSE (in %) corresponding to the estimator with minimum BIC along the regularization paths are also reported in RMSE (BIC) column. Results are averaged over 20 independent replicates. Standard deviations (in %) are also reported in parentheses.

	NWR	CGLASSO		CGLASSO
Family	AUROC	AUROC	Precision	Recall	Accuracy
White Noise
$p=10$
$n=200$	89.41 (6.77)	94.49 (4.24)	44.45 (7.99)	96.67 (6.35)	74.11 (6.74)
$n=400$	96.67 (2.98)	97.76 (2.66)	42.98 (7.00)	98.33 (4.07)	72.68 (6.61)
$p=20$
$n=200$	87.64 (5.97)	95.53 (3.22)	39.30 (5.58)	94.74 (5.40)	84.42 (3.45)
$n=400$	94.60 (4.00)	98.90 (1.29)	37.78 (4.93)	99.21 (1.93)	83.21 (3.14)
$p=50$
$n=200$	88.58 (2.42)	87.07 (3.01)	55.98 (5.26)	75.41 (5.90)	96.61 (0.48)
$n=400$	93.01 (2.07)	97.31 (1.47)	43.28 (4.75)	95.41 (2.80)	94.71 (0.98)
VAR(1)
$p=10$
$n=200$	83.78 (6.09)	90.13 (5.12)	66.56 (6.04)	87.65 (7.61)	78.44 (4.84)
$n=400$	90.2 (3.98)	96.07 (2.94)	69.86 (5.54)	95.75 (4.42)	82.47 (4.03)
$p=20$
$n=200$	90.84 (3.12)	95.83 (1.96)	49.44 (3.77)	95.45 (2.86)	79.95 (2.85)
$n=400$	95.25 (2.24)	97.89 (1.63)	48.62 (2.85)	97.87 (2.48)	79.34 (2.33)
$p=50$
$n=200$	93.72 (1.86)	98.69 (0.67)	45.43 (3.49)	98.06 (1.09)	90.43 (1.39)
$n=400$	97.25 (0.85)	99.72 (0.39)	42.47 (2.52)	99.57 (0.71)	89.23 (1.04)
VARMA(2, 2)
$p=10$
$n=200$	70.56 (7.85)	88.25 (3.83)	79.34 (8.58)	83.00 (6.16)	82.33 (5.37)
$n=400$	74.16 (6.37)	89.54 (4.52)	73.67 (7.01)	86.25 (7.23)	79.89 (5.92)
$p=20$
$n=200$	66.11 (5.23)	87.22 (4.34)	64.37 (5.47)	79.50 (8.10)	86.32 (2.51)
$n=400$	72.32 (5.00)	90.94 (1.44)	64.557 (4.68)	85.875 (2.84)	86.95 (2.102)
$p=50$
$n=200$	72.65 (2.30)	89.47 (1.98)	48.32 (3.51)	81.85 (3.90)	91.30 (1.00)
$n=400$	70.73 (2.56)	92.73 (2.35)	49.94 (4.29)	87.60 (4.44)	91.75 (1.162)

Table 2: AUROC for nodewise regression(NWR) and CGLASSO, precision, recall, accuracy and F1-score (each in %) of complex graphical lasso for

\omega_{j}=0

. Results are averaged over 20 replicates. Standard deviations (in %) are also reported in parentheses.

Results of comparison between CGLASSO-I and CGLASSO-II: The comparison of relative errors indicates that the minimum RMSE along the regularization path of the CGLASSO-I outputs are less than that of CGLASSO-I. Empirical observations also suggest that more the data generating process involves temporal and cross-sectional dependence in the time series model, better are the relative error results of CGLASSO-II. We demonstrate this with two cases: data coming from a white noise model with $p=10$ , $n=400$ , and VARMA(2, 2) with $p=50$ , $n=200$ (Figure 4a & 4b). Even when the data generating process is fixed, the difference of minimum RMSEs tends to increase in high dimensional-low sample size setups. In a nutshell, the scale-modified algorithm performs better than the vanilla version when the estimation problem becomes harder in terms of the data generating process, the dependence structure and sample size-to-dimension ratio. Also, the minimum RMSE along the regularization path of CGLASSO-II is attained for higher values of $\lambda$ , implying that the model with least error is sparser when scaling is incorporated.

Results: Next we describe the key findings of our three experiments.

Experiment 1: We compare the results for our proposed estimator with the sparse inverse periodogram estimator (SIPE) of Fiecas et al. [2019]. The latter uses the constrained $\ell_{1}$ minimization for inverse matrix estimation (CLIME) of Cai et al. [2011]. For certain combinations of $p$ and $n$ for white noise and VAR(1) models, we calculate the relative mean squared error for frequency $\omega_{j}=0$ along the regularization path.

In white noise model with $p=50$ and $n=100$ , we see in figures 5a and 5b that the minimum relative error of SIPE is almost $38\%$ larger than that of complex graphical lasso. In addition, the area under ROC curve for SIPE is approximately $11\%$ less than that of complex graphical lasso, indicating a better model selection performance by the latter method. Additionally, if we run the experiment with a sample of size of 200 from a 50-dimensional VAR(1) model with $A$ being a block diagonal matrix with the blocks being the $5\times 5$ matrix with all diagonal elements 0.5 and upper diagonals 0.2, and error distribution $\epsilon_{t}\thicksim\mathcal{N}(0,0.5)$ , the minimum RMSE of SIPE is approximately 28% larger than that of complex graphical lasso. Similar to white noise model, the area under ROC curve for SIPE is almost 12% less than complex graphical lasso, as shown in Figures 5c and 5d. One reason for such significant outperforming results of graphical lasso might be that Fiecas et al. [2019] uses CLIME twice for estimating real and imaginary parts of $\Theta$ separately. In particular, their method involves estimation of $2p\times 2p=4p^{2}$ entries instead of $p^{2}$ estimands in graphical lasso.

For the sparse inverse periodogram estimator, the AUROC values are higher for larger values of the penalty parameter (figures 5c and 5d). However, the value of RMSE also increases for higher penalty. For complex graphical lasso, minimum RMSE estimator also has a higher value of RMSE than that of SIPE, indicating that for our proposed method, the estimator with minimum relative error is well able to recover the true sparsity structure.

Experiment 2: The relative error comparison with the classical estimator i.e. the inverted periodogram is reported in table 1. For $p=50$ , the $p\times p$ averaged periodogram is not invertible since in our setup its rank can be at most $2m+1=2\lfloor\sqrt{n}\rfloor+1\leq 21$ , hence the entries for those cases are left blank. In most simulation settings, our proposed method not only has significantly smaller relative error but also the BIC selected model has RMSE values just slightly higher than the minimum. In addition, RMSE goes down if we fix the dimension $p$ and increase $n$ . Such phenomena supports the fact that the relative estimation error should go down if a larger sample is available.

Experiment 3: Area under ROC curve, precision, recall, F1 score are reported in Table 2. In most of the simulation settings, AUROC values are above 80%, most values lying in the range 90-100% indicating a good recovery of the sparsity and signal structure by our methods. In addition, in each setting with the data generative family and the dimension $p$ fixed, increased sample size from $n=200$ to $n=400$ results in an increase in AUROC values. The methods have lower precision but higher recall, indicating a tendency of CGLASSO to select false positives. This pattern of $\ell_{1}$ penalized estimators is well-known in the literature, and two-stage methods such as the adaptive lasso [Zou, 2006] have gained popularity to alleviate this problem. Such an extension of CGLASSO is beyond the scope of this paper, and we leave it for future research.

6 Functional connectivity analysis with fMRI data

We implement CGLASSO for interpretation and visualization of functional connectivity among different brain regions of a human subject. The data set used here is a publicly available, high-resolution, preprocessed magnetic resonance imaging data from the Human Connectome Project (HCP)—Young Adult S1200 release [Van Essen et al., 2013; Dhamala et al., 2020]. We consider the time series data of 1003 healthy adults observed over same time horizon across four complete resting state fMRI runs, each scan with $n=1200$ time points.

Preprocessing: MRIs were acquired on a Siemens Skyra 3 T scanner at Washington University in St. Louis. Each subject underwent four gradient-echo EPI resting-state functional MRIs (rsfMRI), with TR = 720 ms, TE = 33.1 ms, 2.0 mm isotropic voxels, FoV = 208 × 180 mm², flip angle = 52^∘, matrix = 104 × 90, 72 slices. Two 15 min rsfMRI runs were acquired at each of the two sessions. In the end, the data consisted of 1200 time points for each run, for a total of 4800 time points for each subject over four scans. Each run of each subject’s rfMRI was preprocessed by the HCP consortium [Smith et al., 2013]. The data were minimally preprocessed [Glasser et al., 2013] and had artefacts removed using ICA + FIX [Griffanti et al., 2014; Salimi-Khorshidi et al., 2014]. Gray matter was parcellated into 86 cortical, subcortical and cerebellar regions using FreeSurfer [Fischl et al., 1999] and regional time series were calculated as the mean of the time series in each voxel of a given region. Each region was further assigned to a Yeo functional network delineating visual, somatomotor, dorsal attention, ventral attention, limbic, default mode, and fronto-parietal networks [Yeo et al., 2011]; we added a subcortical and cerebellar network for whole brain coverage as in previous work [Dhamala et al., 2020].

We estimate the spectral precision at frequency $\omega=0,\pi/2,\text{ and }\pi$ using CGLASSO for the individual with subject ID 100206. The smoothing span is selected $m=\lceil 4\sqrt{n}\rceil$ and we select the penalty parameter that produces the lowest value of BIC.

Results: In Figure 6 we show an example of FC coherence network of one individual in our sample. Our proposed CGLASSO algorithm seems successful in capturing the underlying sparsity pattern. The estimated inverse coherence matrix is well able to retain the known physiological connections e.g. strong connection between bilateral homologues (functional connection between the same regions in the right and left hemisphere) which are known to have strong functional connections [Zuo et al., 2010], while also suppressing many of the weaker functional connections that are likely due to noise. Particularly strong are the bilateral connections between motor and visual areas, which are known to be strongly connected. The same is also supported by Sun et al. [2018] where the authors have used adaptive lasso thresholding method.

7 Conclusion

In this paper we proposed and analyzed a frequency-domain variant of graphical Lasso, an $\ell_{1}$ -penalized local Whittle estimator for estimating spectral precision matrices of multivariate stationary time series. We developed fast pathwise coordinate descent algorithms for Lasso and graphical Lasso involving complex variables. We conducted a non-asymptotic analysis of our proposed method and showed that consistent estimation is possible in both low- and high-dimensional regimes as long as the true spectral precision matrix is sufficiently sparse.

We did not delve into a number of natural questions related to our method. First, extending the algorithm from a single frequency to all the frequencies is required for recovering the true graphical model for stationary time series. Such an extension will need additional work on both algorithmic and analytical fronts. A faster pathwise algorithm for jointly optimizing over different frequencies and different tuning parameters will be helpful. Extending the theoretical guarantees from a single frequency to all frequencies will also be required.

Another important question is the selection of bandwidth/smoothing parameter when calculating the spectral density matrix. For theoretical simplicity, we use $m=\lceil\sqrt{n}\rceil$ following the custom in the literature of large-scale spectral density matrix estimation [Böhm and von Sachs, 2009]. However, a data driven choice of $m$ as proposed in Ombao et al. [2001] would be more appropriate in practice. One can also consider allowing different bandwidth choices for different off-diagonal entries. We leave these directions for future research.

Finally, a number of recent works have investigated the inferential properties of frequency-domain estimators [Chang et al., 2022; Krampe and Paparoditis, 2022] and delved into the nonstationary time series in high-dimension [Zhang and Wu, 2021; Basu and Subba Rao, 2023]. Extending CGLASSO and CLASSO to these broader frameworks are also left as avenues for future research.

Acknowledgements

ND, AK and SB acknowledge partial support from NIH award R21NS120227. In addition, SB acknowledges partial support from NSF awards DMS-1812128, DMS-2210675, DMS-2239102 and NIH award R01GM135926.

References

Baek et al. [2021a] C. Baek, M.-C. Düker, and V. Pipiras. Thresholding and graphical local whittle estimation. arXiv preprint arXiv:2105.13342, 2021a.
Baek et al. [2021b] C. Baek, M. Gampe, B. Leinwand, K. Lindquist, J. Hopfinger, K. M. Gates, and V. Pipiras. Detecting functional connectivity changes in fmri data. Brain Connectivity, 2021b.
Basu and Subba Rao [2023] S. Basu and S. Subba Rao. Graphical models for nonstationary time series. The Annals of Statistics, 51(4):1453–1483, 2023.
Bickel and Levina [2008] P. J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 – 2604, 2008. doi: 10.1214/08-AOS600. URL https://doi.org/10.1214/08-AOS600.
Böhm and von Sachs [2009] H. Böhm and R. von Sachs. Shrinkage estimation in the frequency domain of multivariate time series. Journal of Multivariate Analysis, 100(5):913–935, 2009.
Bowyer [2016] S. M. Bowyer. Coherence a measure of the brain networks: past and present. Neuropsychiatric Electrophysiology, 2(1):1–12, 2016.
Boyd and Vandenberghe [2004] S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
Brockwell and Davis [1991] P. J. Brockwell and R. A. Davis. Time series: theory and methods. Springer science & business media, 1991.
Cai et al. [2011] T. Cai, W. Liu, and X. Luo. A constrained $\ell_{1}$ minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607, 2011.
Carter et al. [2023] J. S. Carter, D. Rossell, and J. Q. Smith. Partial correlation graphical lasso. Scandinavian Journal of Statistics, 2023.
Chang et al. [2022] J. Chang, Q. Jiang, T. S. McElroy, and X. Shao. Statistical inference for high-dimensional spectral density matrix. arXiv preprint arXiv:2212.13686, 2022.
Chen and Chen [2008] J. Chen and Z. Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008.
Dahlhaus [2000] R. Dahlhaus. Graphical interaction models for multivariate time series. Metrika, 51(2):157–172, 2000.
Dallakyan et al. [2022] A. Dallakyan, R. Kim, and M. Pourahmadi. Time series graphical lasso and sparse var estimation. Computational Statistics & Data Analysis, 176:107557, 2022.
Davis et al. [2016] R. A. Davis, P. Zang, and T. Zheng. Sparse vector autoregressive modeling. Journal of Computational and Graphical Statistics, 25(4):1077–1096, 2016.
Dhamala et al. [2020] E. Dhamala, K. W. Jamison, M. R. Sabuncu, and A. Kuceyeski. Sex classification using long-range temporal dependence of resting-state functional mri time series. Human brain mapping, 41(13):3567–3579, 2020.
Dudley et al. [2011] R. M. Dudley, R. Norvaiša, and R. Norvaiša. Concrete functional calculus. Springer, 2011.
Dummit and Foote [2004] D. S. Dummit and R. M. Foote. Abstract algebra, volume 3. Wiley Hoboken, 2004.
Fiecas and von Sachs [2014] M. Fiecas and R. von Sachs. Data-driven shrinkage of the spectral density matrix of a high-dimensional time series. Electronic Journal of Statistics, 8(2):2975–3003, 2014.
Fiecas et al. [2010] M. Fiecas, H. Ombao, C. Linkletter, W. Thompson, and J. Sanes. Functional connectivity: Shrinkage estimation and randomization test. NeuroImage, 49(4):3005–3014, 2010.
Fiecas et al. [2019] M. Fiecas, C. Leng, W. Liu, and Y. Yu. Spectral analysis of high-dimensional time series. Electronic Journal of Statistics, 13(2):4079–4101, 2019.
Fischl et al. [1999] B. Fischl, M. I. Sereno, and A. M. Dale. Cortical surface-based analysis: II. Inflation, flattening, and a surface-based coordinate system. NeuroImage, 9(2):195–207, 1999. ISSN 10538119. doi: 10.1006/nimg.1998.0396.
Foygel and Drton [2010] R. Foygel and M. Drton. Extended bayesian information criteria for gaussian graphical models. Advances in neural information processing systems, 23, 2010.
Friedman et al. [2007] J. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302 – 332, 2007. doi: 10.1214/07-AOAS131. URL https://doi.org/10.1214/07-AOAS131.
Friedman et al. [2008] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
Glasser et al. [2013] M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi, M. Webster, J. R. Polimeni, et al. The minimal preprocessing pipelines for the human connectome project. Neuroimage, 80:105–124, 2013.
Granger [1969] C. W. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, pages 424–438, 1969.
Griffanti et al. [2014] L. Griffanti, G. Salimi-Khorshidi, C. F. Beckmann, E. J. Auerbach, G. Douaud, C. E. Sexton, E. Zsoldos, K. P. Ebmeier, N. Filippini, C. E. Mackay, et al. Ica-based artefact removal and accelerated fmri acquisition for improved resting state network imaging. Neuroimage, 95:232–247, 2014.
Haase [2018] M. Haase. Lectures on functional calculus. In 21st International Internet Seminar, Kiel Univ, 2018.
Hastie et al. [2015] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity. Monographs on statistics and applied probability, 143:143, 2015.
Herstein [1991] I. N. Herstein. Topics in algebra. John Wiley & Sons, 1991.
Horn and Johnson [2012] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012.
Janková and van de Geer [2018] J. Janková and S. van de Geer. Inference in high-dimensional graphical models. In Handbook of graphical models, pages 325–350. CRC Press, 2018.
Jung et al. [2015] A. Jung, G. Hannak, and N. Goertz. Graphical lasso based model selection for time series. IEEE Signal Processing Letters, 22(10):1781–1785, 2015.
Kellogg et al. [1976] R. B. Kellogg, T.-Y. Li, and J. Yorke. A constructive proof of the brouwer fixed-point theorem and computational results. SIAM Journal on Numerical Analysis, 13(4):473–483, 1976.
Krampe and Paparoditis [2022] J. Krampe and E. Paparoditis. Frequency domain statistical inference for high-dimensional time series. arXiv preprint arXiv:2206.02250, 2022.
Loh and Wainwright [2017] P.-L. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455 – 2482, 2017. doi: 10.1214/16-AOS1530. URL https://doi.org/10.1214/16-AOS1530.
Maleki et al. [2013] A. Maleki, L. Anitori, Z. Yang, and R. G. Baraniuk. Asymptotic analysis of complex lasso via complex approximate message passing (camp). IEEE Transactions on Information Theory, 59(7):4290–4308, 2013.
Mazumder and Hastie [2012] R. Mazumder and T. Hastie. The graphical lasso: New insights and alternatives. Electronic journal of statistics, 6:2125, 2012.
Meier [2020] L. Meier. grplasso: Fitting User-Specified Models with Group Lasso Penalty, 2020. URL https://CRAN.R-project.org/package=grplasso. R package version 0.4-7.
Meinshausen and Bühlmann [2006] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, 34(3):1436–1462, 2006.
Ombao and Pinto [2022] H. Ombao and M. Pinto. Spectral dependence. Econometrics and Statistics, 2022.
Ombao et al. [2001] H. C. Ombao, J. A. Raz, R. L. Strawderman, and R. Von Sachs. A simple generalised crossvalidation method of span selection for periodogram smoothing. Biometrika, 88(4):1186–1192, 2001.
Ravikumar et al. [2011] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing $\ell_{1}$ -penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
Rosenblatt [2012] M. Rosenblatt. Stationary sequences and random fields. Springer Science & Business Media, 2012.
Salimi-Khorshidi et al. [2014] G. Salimi-Khorshidi, G. Douaud, C. F. Beckmann, M. F. Glasser, L. Griffanti, and S. M. Smith. Automatic denoising of functional mri data: combining independent component analysis and hierarchical fusion of classifiers. Neuroimage, 90:449–468, 2014.
Schneider-Luftman and Walden [2016] D. Schneider-Luftman and A. T. Walden. Partial coherence estimation via spectral matrix shrinkage under quadratic loss. IEEE Transactions on Signal Processing, 64(22):5767–5777, 2016.
Shumway and Stoffer [2000] R. H. Shumway and D. S. Stoffer. Time series analysis and its applications, volume 3. Springer, 2000.
Smith et al. [2013] S. M. Smith, C. F. Beckmann, J. Andersson, E. J. Auerbach, J. Bijsterbosch, G. Douaud, E. Duff, D. A. Feinberg, L. Griffanti, M. P. Harms, et al. Resting-state fmri in the human connectome project. Neuroimage, 80:144–168, 2013.
Sun et al. [2018] Y. Sun, Y. Li, A. Kuceyeski, and S. Basu. Large spectral density matrix estimation by thresholding. arXiv preprint arXiv:1812.00532, 2018.
Tseng [2001] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475, 2001.
Tugnait [2022] J. K. Tugnait. On sparse high-dimensional graphical model learning for dependent time series. Signal Processing, 197:108539, 2022.
Van Essen et al. [2013] D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. Behrens, E. Yacoub, K. Ugurbil, W.-M. H. Consortium, et al. The wu-minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
Wainwright [2009] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
Weylandt Jr [2020] R. M. Weylandt Jr. Computational and Statistical Methodology for Highly Structured Data. PhD thesis, Rice University, 2020.
Whittle [1951] P. Whittle. Hypothesis testing in time series analysis. (No Title), 1951.
Wilms et al. [2021] I. Wilms, S. Basu, J. Bien, and D. S. Matteson. Sparse identification and estimation of large-scale vector autoregressive moving averages. Journal of the American Statistical Association, pages 1–12, 2021.
Wodeyar and Srinivasan [2022] A. Wodeyar and R. Srinivasan. Structural connectome constrained graphical lasso for meg partial coherence. Network Neuroscience, 6(4):1219–1242, 2022.
Yeo et al. [2011] B. T. T. Yeo, F. M. Krienen, J. Sepulcre, M. R. Sabuncu, D. Lashkari, M. Hollinshead, J. L. Roffman, J. W. Smoller, L. Zöllei, J. R. Polimeni, B. Fischl, H. Liu, and R. L. Buckner. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of neurophysiology, 106(3):1125–65, 9 2011. ISSN 1522-1598. doi: 10.1152/jn.00338.2011.
Yuan and Lin [2006] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
Zhang and Wu [2021] D. Zhang and W. B. Wu. Convergence of covariance and spectral density estimates for high-dimensional locally stationary processes. The Annals of Statistics, 49(1):233 – 254, 2021. doi: 10.1214/20-AOS1954. URL https://doi.org/10.1214/20-AOS1954.
Zhao and Yu [2006] P. Zhao and B. Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.
Zou [2006] H. Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
Zuo et al. [2010] X.-N. Zuo, C. Kelly, A. Di Martino, M. Mennes, D. S. Margulies, S. Bangaru, R. Grzadzinski, A. C. Evans, Y.-F. Zang, F. X. Castellanos, et al. Growing together and growing apart: regional and sex differences in the lifespan developmental trajectories of functional homotopy. Journal of Neuroscience, 30(45):15034–15043, 2010.

Appendix A A short introduction to groups, rings and fields

This supplement is for the readers new to the abstract algebraic objects used in this paper. In this section, we provide with some basic axiomatic definitions and relevant examples of algebraic structures that is required to understand the methodological tools in this paper. Readers will find deeper treatment of this topic in Herstein [1991] and Dummit and Foote [2004].

Definition A.1 (Group).

A nonempty set of elements $G$ is said to form a group if in $G$ there is defined a binary operation, called the product and denoted by $\cdot$ , such that

1.

$a,b\in G\ \implies\ a\cdot b\in G$ (closed),
2.

$a,b,c\in G\ \implies\ a\cdot(b\cdot c)=(a\ \cdot b)\cdot c$ (associative law),
3.

$\exists e\in G$ such that $a\cdot e=e\cdot a=a$ for all $a\in G$ (existence of identity),
4.

For every $a\in G$ , $\exists a^{-1}\in G$ such that $a\cdot a^{-1}=a^{-1}\cdot a=e$ (existence of inverse).

Example. 1. The set of all integers $\mathbb{Z}$ under the additive operation +, 2. the set of all complex numbers $\mathbb{C}$ under complex addition +, the set of $2\times 2$ matrices defined in (3.1) under the matrix addition +, etc.

In this paper we deal with Abelian groups of infinite order, that is, groups with infinitely many elements and where the group operation $\cdot$ is commutative, i.e. $a\cdot b=b\cdot a\in G$ for all $a,b\in G$ . Details can be found in Herstein [1991].

Definition A.2 (Ring).

A nonempty set $R$ is said to be an associative ring if in $R$ there are defined two operations $+$ and $\cdot$ , such that for all $a,b,c,\in R$ ,

1.

$a+b\in R$ ,
2.

$a+b=b+a$ ,
3.

$a+(b+c)=(a+b)+c$ ,
4.

There is an element $0\in R$ such that $a+0=0+a=0$ for all $a\in R$ (additive identity),
5.

There is an element $-a\in R$ such that $a+(-a)=0$ (additive inverse),
6.

$a\cdot b\in R$ ,
7.

$a\cdot(b\cdot c)=(a\cdot b)\cdot c$ ,
8.

$a\cdot(b+c)=a\cdot b+a\cdot c$ and $(b+c)\cdot a=b\cdot a+b\cdot c$ (distributive laws).

It follows from axioms 1-5 that $R$ is an Abelian group under the operation $+$ (addition). Similarly, it follows from axioms 6-7 that a ring is closed under associative operation $\cdot$ (multiplication). In usual notations, $a\cdot b$ is simply written as $ab$ . The two operations are bridged by axiom 8.

Example

1.

The set of real numbers $\mathbb{R}$ is a commutative ring under the operations $+$ (addition) and $\cdot$ (multuplication). The additive unit is $0$ , and $\mathbb{R}$ also has a multipicative unit denoted by 1.
2.

Similarly, $\mathbb{C}$ is a ring under $+$ and $\cdot$ .
3.

The set of all matrices characterized by (3.1) is a ring under matrix addition and multiplication.

Definition A.3 (Ring homomorphism).

A map $\phi$ from the ring $R$ into the ring $R^{\prime}$ is said to be a homomorphism if

1.

$\phi(a+b)=\phi(a)+\phi(b)$
2.

$\phi(ab)=\phi(a)\phi(b)$

Note. The operations $+$ and $\cdot$ on the left side are concerned with the ring $R$ , and the ring operations $+$ and $\cdot$ appearing on the right are of $R^{\prime}$ .

Example.

1.

Let $R=R^{\prime}$ be the set of all real numbers $\mathbb{R}$ , both associated with the ring operations being addition $+$ and multiplication $\cdot$ . Then $\phi:R\rightarrow R^{\prime}$ , defined by $\phi(x)=x$ for every $x\in R$ , is a ring isomorphism.
2.

The complex conjugation map $\phi:\mathbb{C}\rightarrow\mathbb{C}$ , defined as $\phi(z)=\overline{z}$ for all $z\in\mathbb{C}$ , is a ring homomorphism.

Definition A.4 (Ring isomorphism).

A homomorphism of $R$ into $R^{\prime}$ is said to be an isomorphism if it is a one-to-one mapping. Two rings are said to be isomorphic if there exists an isomorphism of one onto the other.

Example. Both the examples for ring homomorphism are one-to-one maps, and hence serve as examples for ring isomorphism as well.

Definition A.5 (Field).

A ring is said to be a division ring if its non zero elements form a group under multiplication. A field is a commutative division ring.

Example.

1.

Some standard widely used fields in mathematics are the rational numbers $\mathbb{Q}$ , the set of real numbers $\mathbb{R}$ , complex numbers $\mathbb{C}$ , finite fields $\mathbb{Z}_{p}$ for a prime integer $p$ , etc.
2.

The set of matrices characterized by (3.1) is a field isomorphic to the set of all complex numbers $\mathbb{C}$ (Proposition 3.1).

Note. A field is always a ring. Two fields $F$ and $F^{\prime}$ are said to be isomorphic if they are isomorphic as rings. We use the terms ring isomorphism and field isomorphism separately in order to keep the underlying isomorphic objects specified as rings or fields respectivly. A field isomorphism is always a ring isomorphism, however the converse may not be true.

Appendix B Proofs of technical results

Proof of Proposition 3.1

We first show that the map $\varphi:\mathbb{C}\rightarrow\mathcal{M}^{2\times 2}$ is a field isomorphism. Clearly $\mathbb{C}$ is a field. Also, $\mathcal{M}$ is a field with the following additive identity (zero) and multiplicative identity (unit)

\mathbf{0}_{\mathcal{M}}=\begin{bmatrix}0&0\\ 0&0\end{bmatrix},~{}~{}~{}\mathbf{1}_{\mathcal{M}}=\begin{bmatrix}1&0\\ 0&1\end{bmatrix}

and for any $A=\begin{bmatrix}a&-b\\ b&a\end{bmatrix}$ , where both $a$ and $b$ are not zero together, the inverse element of the field is

A^{-1}=\frac{1}{a^{2}+b^{2}}\begin{bmatrix}a&b\\ -b&a\end{bmatrix}\in\mathcal{M}

Next we consider the map $\varphi$ . For any $z,z^{\prime}\in\mathbb{C}$ , $\textbf{Re}(z+z^{\prime})=\textbf{Re}(z)+\textbf{Re}(z^{\prime})$ , $\textbf{Im}(z+z^{\prime})=\textbf{Im}(z)+\textbf{Im}(z^{\prime})$ . Hence,

\varphi(z+z^{\prime})=\begin{bmatrix}\textbf{Re}(z+z^{\prime})&-\textbf{Im}(z+z^{\prime})\\ \textbf{Im}(z+z^{\prime})&\textbf{Re}(z+z^{\prime})\end{bmatrix}=\varphi(z)+\varphi(z^{\prime})

Let $z=x+\mathrm{i}y$ and $z^{\prime}=x^{\prime}+\mathrm{i}y^{\prime}$ . Therefore,

	$\displaystyle\varphi(zz^{\prime})$	$\displaystyle=\varphi((x+\mathrm{i}y)(x^{\prime}+\mathrm{i}y^{\prime}))$
		$\displaystyle=\varphi((xx^{\prime}-yy^{\prime})+\mathrm{i}(xy^{\prime}+x^{\prime}y))$
		$\displaystyle=\begin{bmatrix}xx^{\prime}-yy^{\prime}&-xy^{\prime}-x^{\prime}y\\ xy^{\prime}+x^{\prime}y&xx^{\prime}-yy^{\prime}\end{bmatrix}$
		$\displaystyle=\begin{bmatrix}x&-y\\ y&x\end{bmatrix}\ \begin{bmatrix}x^{\prime}&-y^{\prime}\\ y^{\prime}&x^{\prime}\end{bmatrix}$
		$\displaystyle=\varphi(z)\varphi(z^{\prime})$

Hence, the map $\varphi$ is an isomorphism. Also note that $\varphi(0)=\mathbf{0}_{\mathcal{M}}$ and $\varphi(1)=\mathbf{1}_{\mathcal{M}}$ . In addition for any $z\neq 0$ , $\varphi(z^{-1})=[\varphi(z)]^{-1}$ . Hence, $\varphi$ is a field isomorphism between $\mathbb{C}$ and $\mathcal{M}$ .

For any $z=x+\mathrm{i}y\in\mathbb{C}$ , $\varphi_{1}(z)=(x,y)^{\top}$ and $\varphi_{2}(z)=(-y,x)^{\top}$ . Hence,

\langle\varphi_{1}(z),\varphi_{2}(z)\rangle=x(-y)+yx=0

Also it is clear that

\|\varphi_{1}(z)\|^{2}=\|\varphi_{2}(z)\|^{2}=x^{2}+y^{2}=|z|^{2}

\varphi(z^{\dagger})=\varphi(x-\mathrm{i}y)=\begin{bmatrix}x&y\\ -y&x\end{bmatrix}=\varphi(z)^{\top}

Hence properties (i)-(iii) is shown. $\hfil\square$

Proof of Proposition 3.3

(a)

It can be shown that both $\mathbb{C}^{n}$ and $\mathbb{R}^{2n\times 2}$ is are rings under addition $+$ and entry-wise multiplication $\odot$ of $n$ -dimensional complex-valued vectors and $2n\times 2$ -dimensional real matrices respectively, for any positive integer $n$ [Herstein, 1991]. We consider the extended map $\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2}$ as described in (3.2). Then, for any $\mathbf{z}_{1},\mathbf{z}_{2}\in\mathbb{C}^{n}$ such that $\mathbf{z}_{1}=(z_{1i})_{i=1}^{n},\mathbf{z}_{2}=(z_{2i})_{i=1}^{n}$ ,

\varphi(\mathbf{z}_{1}+\mathbf{z}_{2})=\varphi((z_{1i}+z_{2i})_{i=1}^{n})=(\varphi(z_{1i}+z_{2i}))_{i=1}^{n}=(\varphi(z_{1i})+\varphi(z_{2i}))_{i=1}^{n}=\varphi(\mathbf{z}_{1})+\varphi(\mathbf{z}_{2}),

and

\varphi(\mathbf{z}_{1}\odot\mathbf{z}_{2})=\varphi((z_{1i}z_{2i})_{i=1}^{n})=(\varphi(z_{1i}z_{2i}))_{i=1}^{n}=(\varphi(z_{1i})\varphi(z_{2i}))_{i=1}^{n}=\varphi(\mathbf{z}_{1})\odot\varphi(\mathbf{z}_{2}).

Therefore, the extended $\varphi:\mathbb{C}^{n}\rightarrow\mathbb{R}^{2n\times 2}$ of (3.2) is a ring homomorphism. Using, proposition 5 of section 7.3 of Dummit and Foote [2004], $\text{Image}(\varphi)$ is a subring of $\mathbb{R}^{2n\times 2}$ , and the kernel of $\varphi$ , denoted by $\text{ker}(\varphi)=\{\mathbf{z}:\varphi(\mathbf{z})=\mathbf{0}_{2n\times 2}\}$ , is a subring of $\mathbb{C}^{n}$ . In this case, $\text{ker}(\varphi)$ consists of all those complex numbers $z$ such that $\textbf{Re}(\mathbf{z})=0,\ \textbf{Im}(\mathbf{z})=0$ , i.e. $\mathbf{z}=0$ . Therefore, $\varphi$ has a trivial kernel and hence it is a one-to-one map. As a result, $\varphi$ described in (3.2) is a ring isomorphism.

(b)

The proof is similar to part (a). The only difference in this case are (i) $\mathbb{C}^{m\times n}$ and $\mathbb{R}^{2m\times 2n}$ are rings under addition $+$ and entry-wise multiplication $\odot$ of $m\times n$ dimensional complex matrices and $2m\times 2n$ dimensional real matrices respectively, and (ii) $\varphi$ is described in $\eqref{eq:phi-extn-2}$ .
(c)

Similar to the proof of part (a), $\text{GL}_{n}(\mathbb{C})$ is a field, and hence a ring. $\text{Image}(\varphi)$ is a subring of $\mathbb{R}^{2n\times 2n}$ and the kernel is trivial, which implies that $\varphi$ described in (3.3) is a ring isomorphism in this case. In addition, using the homomorphic properties of $\varphi$ and the field properties of $\text{GL}_{n}(\mathbb{C})$ , it can be shown that $\text{Image}(\varphi)$ is also a field. Therefore, $\text{Image}(\varphi)$ is a field in this case, and hence $\varphi$ is a field isomorphism.

For any $A\in GL_{n}(\mathbb{C})$ , $\varphi(A)\varphi(A^{-1})=\varphi(AA^{-1})=\varphi(I_{n})=I_{2n}$ . Therefore, $\varphi(A^{-1})=[\varphi(A)]^{-1}$ .

Proof of Proposition 4.1

By Proposition 3.5 of Sun et al. [2018], for some constants $c,c^{\prime}$ and any $\eta>0$ , we have

\mathbb{P}\left(\left|\hat{f}_{kl}(\omega_{j})-\mathbb{E}\hat{f}_{kl}(\omega_{j})\right|\geq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\eta\right)\leq c^{\prime}\exp\{-c(2m+1)\min\{\eta,\eta^{2}\}\}.

We choose $\eta=\sqrt{\frac{R\log p}{m}}$ . In the asymptotic regime $m\succsim{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p$ , we have $\eta=\sqrt{\frac{R\log p}{m}}\geq\frac{R\log p}{m}=\eta^{2}$ . Therefore, by using a union bound over all $k,l$ , we obtain

	$\displaystyle\mathbb{P}\left(\max_{k,l}\left\|\hat{f}_{kl}(\omega_{j})-\mathbb{E}\hat{f}_{kl}(\omega_{j})\right\|\geq{\left\|\kern-1.07639pt\left\|\kern-1.07639pt\left\|f\right\|\kern-1.07639pt\right\|\kern-1.07639pt\right\|}\sqrt{\frac{R\log p}{m}}\right)$	$\displaystyle\leq p^{2}\cdot c^{\prime}\exp\left\{-c(2m+1)\frac{R\log p}{m}\right\}$
		$\displaystyle\leq p^{2}c^{\prime}\exp\{-2cR\log p\}=c^{\prime}p^{-(2cR-2)}.$		(B.1)

Also, from Proposition 3.3 of Sun et al. [2018], the upper bound on the bias term is

\left|\mathbb{E}\hat{f}_{kl}(\omega_{j})-f_{kl}(\omega_{j})\right|\leq\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n},~{}~{}~{}\mbox{for all }k,l,

where $\Omega_{n}$ and $L_{n}$ are defined in (4.1). Therefore,

\max_{k,l}\left|\mathbb{E}\hat{f}_{kl}(\omega_{j})-f_{kl}(\omega_{j})\right|\leq\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}.

(B.2)

We have chosen $\Delta_{f}(m,n,p)={\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}\sqrt{\frac{R\log p}{m}}+\frac{m+\frac{1}{2\pi}}{n}\Omega_{n}+\frac{1}{2\pi}L_{n}$ . Combining (B.1) and (B.2), we obtain

\mathbb{P}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\geq\Delta_{f}(m,n,p)\right)\leq c^{\prime}p^{-(cR-2)}.

$\hfill\square$

Proof of Theorem 4.1

Our objective is to show that the primal dual witness matrix $\widetilde{\Theta}$ in (B.4) is equal to the solution of the optimization problem (2.6) with high probability. For a fixed $j\in[n]$ and Fourier frequency $\omega_{j}\in[-\pi,\pi]$ , we define the event $\mathcal{E}:=\bigg{\{}{\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}$ $\leq\Delta_{f}(m,n,p)\bigg{\}}$ . As we are in the regime $m\succeq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\zeta^{2}\log p$ , $m\preceq n/\left[\Omega_{n}\zeta\right]$ and $\zeta\geq 1$ , so $m\succeq{\left|\kern-1.07639pt\left|\kern-1.07639pt\left|f\right|\kern-1.07639pt\right|\kern-1.07639pt\right|}^{2}\log p$ , $m\preceq n/\Omega_{n}$ . Also, in the definition of $\Delta_{f}(m,n,p)$ in (4.2), we consider $R>1/c$ i.e. $\tau=cR>2$ . Hence, by Proposition 4.1 $\mathbb{P}(\mathcal{E})\geq 1-\frac{c^{\prime}}{p^{\tau-2}}$ for constants $c^{\prime}>0$ and $\tau>2$ . In the following analysis, we condition on the event $\mathcal{E}$ .

Our proof goes on the following route- we first describe a primal-dual witness approach required for the proof, state a series of claims which are just restatement of a number of lemmas in Ravikumar et al. [2011] with highlight on the changes due to incorporating complex regime.

Claim B.1.

The penalized log-determinant program 2.6 has a unique solution $\hat{\Theta}$ characterized by the equation

\hat{f}(\omega_{j})-{\hat{\Theta}}^{-1}+\lambda\hat{\Psi}=0,

(B.3)

where $\hat{\Psi}$ is an element in the set of subdifferentials of $\partial\|(\cdot)\|_{\text{1,off}}$ at $\hat{\Theta}$ as characterized in (3.15).

Proof sketch. The proof of this claim B.1 follows from the Lemma 3 of Ravikumar et al. [2011]. Claim B.1 is similar to their result with the real covariance matrix $\hat{\Sigma}$ replaced with the averaged periodogram matrix $\hat{f}(\omega_{j})$ , and $\hat{\Psi}$ replaced with the matrix with entry-wise signs of $\hat{\Theta}$ (details available in Equation 41 Section 4.1 in Ravikumar et al. [2011]). The basic idea for the proof is following: the penalized log-determinant program (2.6) is a strict convex program i.e. the concerned objective function is a strictly convex function of $\Theta$ since the $\ell_{1}$ penalty on the off-diagonals is a convex function of $\Theta$ and the log-determinant barrier is also convex (section 3.1.5 of Boyd and Vandenberghe [2004]). Also for complex positive definite Hermitian matrices, i.e. $\Theta\in\mathcal{H}^{p}_{++}$ , eigen values and diagonal elements of $\Theta$ are strictly positive. According to the proof of Lemma 3 in Ravikumar et al. [2011], and Hadamard’s inequality for Hermitian positive definite matrices (theorem 7.8.1 of Horn and Johnson [2012]), we have $\log\det\Theta\leq\sum_{k=1}^{p}\log\Theta_{kk}$ . Hence,

\sum_{k=1}^{p}\Theta_{kk}[\hat{f}(\omega_{j})]_{kk}-\log\det\Theta\geq\sum_{k=1}^{p}\left(\Theta_{kk}[\hat{f}(\omega_{j})]_{kk}-\log\Theta_{kk}\right).

The right-hand side of the above inequality is coercive, i.e. it diverges to infinity as $\|(\Theta_{11},\cdots,\Theta_{pp})\|\rightarrow\infty$ . Therefore, the objective function is a continuous and coercive function of $\Theta$ . Hence the level sets are compact, and continuity implies that a minimum is attained. In addition, strict convexity guarantees the uniqueness of the minimum.

A sufficient condition for $\hat{\Theta}$ to be a solution of (2.6) is that zero belongs to the subdifferential of the objective function of (2.6) evaluated at $\hat{\Theta}$ , i.e. $0\in\partial\left(\langle\hat{\Theta},\hat{f}(\omega_{j})\rangle-\log\det\hat{\Theta}+\lambda\|\hat{\Theta}\|_{1,\text{off}}\right)$ . Using theorems 2.9 and 2.10 of Weylandt Jr [2020], the KKT (Karush-Kahn Tucker) conditions, i.e. the subdifferential equation corresponding to the optimization problem (2.6) is the same as (B.3). $\hfill\square$

As now we guarantee existence, uniqueness, and a formulation of the optimal solution of (2.6), we can take a primal dual witness approach to calculate the solution as well as to recover the correct support set. The primal dual witness construction is motivated by previous works by Zhao and Yu [2006]; Wainwright [2009]; Ravikumar et al. [2011]; Loh and Wainwright [2017]. The key idea is to search for a pair of unique primal and dual feasible solutions satisfying the strict dual feasibility conditions in a setting when the objective function may not be strictly convex. Next we provide a reiteration of the primal dual witness construction in Ravikumar et al. [2011] for a restricted CGLASSO problem.

Primal-dual witness construction

We recall from section 4 that $S$ is the augmented edge set, and consider the restricted log-determinant problem

\widetilde{\Theta}=\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{++}^{p}:\ \Theta_{S^{\texttt{c}}}=0}\langle\Theta,\hat{f}(\omega_{j})\rangle-\log\det\Theta+\lambda\|\Theta\|_{1,\text{off}}.

(B.4)

By this construction, $\widetilde{\Theta}$ is positive-definite Hermitian as well as $\widetilde{\Theta}_{S^{\texttt{c}}}=0$ . Let $\widetilde{\Psi}\in\partial\|(\cdot)\|_{\text{1,off}}(\widetilde{\Theta})$ . For $(k,l)\in S^{\texttt{c}}$ , we replace $(\widetilde{\Psi})_{kl}$ with

\displaystyle\widetilde{\Psi}_{kl}=\frac{1}{\lambda}\left[-\left(\hat{f}(\omega_{j})\right)_{kl}+\left(\widetilde{\Theta}^{-1}\right)_{kl}\right].

The pair $(\widetilde{\Theta},\widetilde{\Psi})$ satisfies the equation (B.3) replacing $(\hat{\Theta},\hat{\Psi})$ . Finally we check the strict dual feasibility condition, i.e.

|\widetilde{\Psi}_{kl}|<1\text{ for all }(k,l)\in S^{\texttt{c}}.

This step ensures that $\widetilde{\Psi}$ satisfies the necessary conditions to belong to the set of subdifferentials.

We define $G=\widetilde{\Theta}-\Theta^{*}$ . Also let the remainder term of the first order Taylor expansion of $(\Theta^{*}+G)^{-1}$ around $\Theta^{*}$ be $R(G)=(\Theta^{*}+G)^{-1}-\Theta^{*}+{\Theta^{*}}^{-1}G{\Theta^{*}}^{-1}$ . We state the following result:

Claim B.2.

Suppose that

\max\left\{{\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max},{\left|\kern-1.07639pt\left|R(G)\right|\kern-1.07639pt\right|}_{\max}\right\}\leq\frac{\alpha}{8}\lambda.

(B.5)

Then the vector $\widetilde{\Psi}$ constructed in the primal dual witness construction satisfies $\|\widetilde{\Psi}_{S^{c}}\|_{\max}<1$ . Hence strict duality holds and $\widetilde{\Theta}=\hat{\Theta}$ .

Proof sketch. The proof of claim B.2 follows from the proof of Lemma 4 in Ravikumar et al. [2011] and involves basic linear algebra computation with complex valued matrices. The only change in notation are that $W=\hat{\Sigma}-\Sigma$ is replaced by $\hat{f}(\omega_{j})-f(\omega_{j})$ in our case and $G$ is the difference between the primal dual witness and the true spectral density instead of the difference $\Delta$ (as per the notation in Ravikumar et al. [2011], in our literature $\Delta$ is used to denote the threshold parameter) between the true inverse covariance and its primal dual witness. Lemma 4 in Ravikumar et al. [2011] shows that under the assumption $\max\{{\left|\kern-1.07639pt\left|W\right|\kern-1.07639pt\right|}_{\max},{\left|\kern-1.07639pt\left|R(\Delta)\right|\kern-1.07639pt\right|}_{\max}\}\leq\alpha\lambda/8$ , the matrix of entry-wise signs for the primal-dual witness problem i.e. $\widetilde{\Psi}$ satisfies strict dual feasibility condition $\|\widetilde{\Psi}_{S^{c}}\|_{\max}<1$ . The same strict dual feasibility condition can be concluded under the condition (B.5) if we follow the steps (51)-(55) in the proof of Lemma 4 (section 4.2.1) in Ravikumar et al. [2011]. And under strict dual feasibility, by construction, $\widetilde{\Theta}=\hat{\Theta}$ is ensured. $\hfill\square$

Next, we verify that the assumption (B.5) holds. As our choice of regularization parameter is $\lambda=\frac{8}{\alpha}\Delta_{f}(m,n,p)$ , under the event $\mathcal{E}$ we have ${\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}\leq\frac{\alpha}{8}\lambda=\Delta_{f}(m,n,p)$ . In order to show (B.5) it remains to show ${\left|\kern-1.07639pt\left|R(G)\right|\kern-1.07639pt\right|}_{\max}\leq\frac{\alpha}{8}\lambda$ . For that we will require the following two claims.

Claim B.3.

Suppose that

r:=2\kappa_{\Theta^{*}}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}+\lambda\right)\leq\min\left\{\frac{1}{3\kappa_{f}d},\frac{1}{3\kappa_{f}^{3}\kappa_{\Theta^{*}}d}\right\}.

Then ${\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}={\left|\kern-1.07639pt\left|\widetilde{\Theta}-\Theta^{*}\right|\kern-1.07639pt\right|}_{\max}\leq r$ .

Proof sketch. The proof of Claim B.3 is similar to that of Lemma 6 in Ravikumar et al. [2011] and involves linear algebra of complex matrices. By claim B.1, the restricted problem (B.4) has a unique solution. Thus, upon taking partial derivative w.r.t the restricted element, it should vanish at the unique optimum, that is, $\widetilde{\Theta}_{S}$ satisfies

H_{1}(\Theta_{S}):=[\hat{f}(\omega_{j})]_{S}-[\Theta^{-1}]_{S}+\lambda\widetilde{\Psi}_{S}=0.

This restricted zero gradient condition is necessary and sufficient for the optimal solution, hence the solution is unique. Now we define the map $H_{2}:\mathbb{R}^{|S|}\rightarrow\mathbb{R}^{|S|}$ as follows

H_{2}(\overline{G}_{S}):=-(\Upsilon_{SS}^{*})^{-1}\ \overline{H}_{1}(\Theta_{S}^{*}+G_{S})+\overline{G}_{S}.

From construction, $H_{2}(\overline{G}_{S})=\overline{G}_{S}$ , i.e. $H_{1}(\Theta_{S}^{*}+G_{S})=H_{1}(\widetilde{\Theta}_{S}^{*})=0$ . Let $\mathbb{B}(r)$ be the $\ell_{\infty}$ ball in $\mathbb{R}$ , i.e. $\mathbb{B}(r)=\{\Theta_{S}:\|\Theta_{S}\|_{\infty}\leq r\}$ Now, following the matrix algebraic calculations in equations 72 to 74 of Ravikumar et al. [2011], it can be easily checked that $H_{2}(\mathbb{B}(r))\subset\mathbb{B}(r)$ . Since $H_{2}$ is a continuous function on $\mathbb{R}^{|S|}$ and $\mathbb{B}(r)$ is convex and compact in $\mathbb{R}^{|S|}$ , by Brouwere’s fixed point theorem [Kellogg et al., 1976], existence of a fixed point $\overline{G}_{S}\in\mathbb{B}(r)$ of the function $H_{2}$ is guaranteed. Also uniqueness of the zero gradient condition (B) implies the uniqueness of the fixed points of $H_{2}$ . Therefore, we can conclude that $\|\widetilde{\Theta}_{S}-\Theta^{*}_{S}\|_{\infty}\leq r$ , and hence the claim is proved. $\hfill\square$

By our choice of the regularization constant

2\kappa_{\Theta^{*}}\left({\left|\kern-1.07639pt\left|\hat{f}(\omega_{j})-f(\omega_{j})\right|\kern-1.07639pt\right|}_{\max}+\lambda\right)\leq 2\kappa_{\Theta^{*}}\left(1+\frac{8}{\alpha}\right)\Delta_{f}(m,n,p)=2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p).

From the conditions of Theorem 4.1, $\Delta_{f}(m,n,p)d\leq[6\kappa_{f}^{3}\kappa_{\Theta^{*}}^{2}C_{\alpha}^{2}]^{-1}$ , and $C_{\alpha}>1$ . Therefore, the following two inequalities are satisfied

r\leq 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p)\leq[3\kappa_{f}^{3}\kappa_{\Theta^{*}}d]^{-1},

(B.6)

and

2\kappa_{\Theta^{*}}C_{\alpha}^{2}\Delta_{f}(m,n,p)\leq[3\kappa_{f}^{3}\kappa_{\Theta^{*}}d]^{-1}

(B.7)

Now we turn to a result in similar spirit as Lemma 5 in Ravikumar et al. [2011]. The result states the following:

Claim B.4.

${\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}\leq\frac{1}{3\kappa_{f}d}$ implies ${\left|\kern-1.07639pt\left|R(G)\right|\kern-1.07639pt\right|}_{\max}\leq\frac{3}{2}d{\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}^{2}\kappa_{f}^{3}$ .

The proof follows from Lemma 5 in Ravikumar et al. [2011]. The proof involves a step where we need to expand the power series of the complex matrix $(\Theta^{*}+G)^{-1}$ as

(\Theta^{*}+G)^{-1}=\sum_{k=0}^{\infty}({\Theta^{*}}^{-1}G)^{k}{\Theta^{*}}^{-1}

The power series expansion can be verified as ${\left|\kern-1.07639pt\left|G\right|\kern-1.07639pt\right|}_{\max}\leq 1/3\kappa_{f}d$ implies ${\left|\kern-1.07639pt\left|{\Theta^{*}}^{-1}G\right|\kern-1.07639pt\right|}_{\max}<1$ (ensures the radius of convergence condition, see equation (68) in Appendix B of Ravikumar et al. [2011]), and also from the results in holomorphic functional calculus that involves the general power series expansion of an operator on a Banach space [Dudley et al., 2011; Haase, 2018]. Rest of the proof can be imitated as the proof of Lemma 5 in Appendix B of Ravikumar et al. [2011]

Combining this result with (B.6) we get

	$\displaystyle{\left\|\kern-1.07639pt\left\|R(G)\right\|\kern-1.07639pt\right\|}_{\max}\leq\frac{3}{2}d{\left\|\kern-1.07639pt\left\|G\right\|\kern-1.07639pt\right\|}_{\max}^{2}\kappa_{f}^{3}$	$\displaystyle\leq\frac{3}{2}d\left(2\kappa_{\Theta_{j}^{*}}C_{\alpha}\Delta_{f}(m,n,p)\right)^{2}\kappa_{f}^{3}$
		$\displaystyle=6\kappa_{f}^{3}\kappa_{\Theta^{*}}^{2}dC_{\alpha}^{2}\left[\Delta_{f}(m,n,p)\right]^{2}$
		$\displaystyle=\left(6\kappa_{f}^{3}\kappa_{\Theta^{*}}^{2}dC_{\alpha}^{2}\left[\Delta_{f}(m,n,p)\right]\right)\frac{\alpha\lambda}{8}$
		$\displaystyle\leq\frac{\alpha\lambda}{8}$

where the last inequality follows from (B.7). Therefore, we have shown that condition (B.5) holds under $\mathcal{E}$ and strict duality allows us to conclude that $\hat{\Theta}=\widetilde{\Theta}$ . Hence, $\hat{\Theta}$ satisfies the bound (B.6) and $\hat{\Theta}_{S^{c}}=\widetilde{\Theta}_{S^{c}}=0$ . Hence $E(\hat{\Theta})=E(\widetilde{\Theta})$ . Also, for $(k,l)\in S$ such that $\left|\Theta^{*}_{kl}\right|>2C_{\alpha}\kappa_{\Theta^{*}}\Delta_{f}(m,n,p)$

\displaystyle\left|\hat{\Theta}_{kl}\right|

\displaystyle=\left|\hat{\Theta}_{kl}-\Theta^{*}_{kl}+\Theta^{*}_{kl}\right|\geq\left|\Theta^{*}_{kl}\right|-\left|\hat{\Theta}_{kl}-\Theta^{*}_{kl}\right|>0

Therefore, $E(\hat{\Theta})$ contains all edges $(k,l)$ with $\left|\hat{\Theta}^{*}_{kl}\right|>2C_{\alpha}\kappa_{\Theta^{*}}\Delta_{f}(m,n,p)$ and hence (b) is proved.

Now we use the idea of Corollary 3 of Ravikumar et al. [2011] and write

	$\displaystyle\\|\hat{\Theta}-\Theta^{*}\\|_{F}$	$\displaystyle\leq\left[\sum_{k=1}^{p}\left\|\hat{\Theta}_{kk}-\Theta^{}_{kk}\right\|^{2}+\sum_{(k,l)\in E(\Theta^{})}\left\|\hat{\Theta}_{kl}-\Theta^{*}_{kl}\right\|^{2}\right]^{1/2}$
		$\displaystyle\leq 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p)\times\sqrt{s+p}$

And

\|\hat{\Theta}-\Theta^{*}\|_{2}\leq\|\hat{\Theta}-\Theta^{*}\|_{\infty}\leq d\times 2\kappa_{\Theta^{*}}C_{\alpha}\Delta_{f}(m,n,p)

where we use the definition of $\ell_{\infty}$ norm and the fact that $\hat{\Theta}$ and $\Theta^{*}$ both have $d$ non-zeros in each row. Now we use the fact that Frobenius norm dominates $\ell_{\infty}$ norm and (c) follows. $\hfill\square$

Appendix C Figures and tables

		RMSE
Family	Inverse periodogram	CGLASSO (oracle)	CGLASSO (BIC)
White Noise
$p=10$
$n=200$	123.69 (66.17)	11.59 (2.08)	17.34 (2.72)
$n=400$	50.68 (17.23)	8.11 (1.09)	15.01 (1.49)
$p=20$
$n=200$	200.23 (783.49)	14.06 (1.52)	21.76 (1.71)
$n=400$	394.05 (123.59)	10.88 (1.43)	18.69 (1.53)
$p=50$
$n=200$	-	18.16 (1.34)	31.07 (0.77)
$n=400$	-	14.21 (1.07)	25.44 (0.88)
VAR(1)
$p=10$
$n=200$	205.82 (137.47)	17.50 (-)	17.50 (-)
$n=400$	121.15 (34.07)	17.99 (1.45)	18.06 (1.25)
$p=20$
$n=200$	1831.27 (666.7)	20.67 (-)	20.67 (-)
$n=400$	489.32 (140.18)	19.47 (0.81)	19.47 (0.81)
$p=50$
$n=200$	-	20.59 (-)	20.59 (-)
$n=400$	-	19.52 (-)	19.52 (-)
VARMA(2, 2)
$p=10$
$n=200$	117.13 (42.06)	13.72 (2.34)	15.00 (2.32)
$n=400$	62.96 (14.67)	11.20 (2.03)	11.48 (2.03)
$p=20$
$n=200$	2359.21 (1101.53)	15.32 (1.73)	20.35 (1.69)
$n=400$	446.45 (127.16)	12.83 (2.01)	14.80 (1.67)
$p=50$
$n=200$	-	16.77 (0.90)	28.98 (1.28)
$n=400$	-	13.60 (0.86)	18.67 (0.73)

Table 3: RMSE of inverted periodogram and the complex graphical lasso estimator at

\omega_{j}=\pi/2

		RMSE
Family	Inverse periodogram	CGLASSO (oracle)	CGLASSO (BIC)
White Noise
$p=10$
$n=200$	148.83 (79.25)	10.31 (2.15)	12.83 (2.21)
$n=400$	80.58 (52.42)	8.75 (2.38)	11.21 (2.73)
$p=20$
$n=200$	3620.6 (4038.01)	12.49 (2.07)	18.69 (2.31)
$n=400$	441.79 (222.01)	9.74 (1.55)	14.96 (1.63)
$p=50$
$n=200$	-	14.96 (1.44)	28.44 (1.16)
$n=400$	-	11.94 (0.93)	23.47 (1.18)
VAR(1)
$p=10$
$n=200$	158.18 (119.3)	9.49 (-)	15.68 (1.65)
$n=400$	73.09 (37.48)	8.69 (-)	14.17 (2.40)
$p=20$
$n=200$	4439.06 (3856.06)	13.55 (-)	16.61 (2.15)
$n=400$	607.81 (324.17)	12.65 (1.14)	14.77 (1.84)
$p=50$
$n=200$	-	16.44 (0.48)	17.30 (1.47)
$n=400$	-	14.74 (-)	15.15 (0.52)
VARMA(2, 2)
$p=10$
$n=200$	162.61 (131.66)	9.64 (0.84)	15.11 (2.36)
$n=400$	67.55 (23.58)	7.62 (0.87)	9.21 (2.16)
$p=20$
$n=200$	3174.67 (2912.83)	12.04 (1.39)	21.01 (2.44)
$n=400$	470.83 (270.81)	9.11 (-)	17.98 (1.56)
$p=50$
$n=200$	-	14.10 (0.83)	19.85 (1.28)
$n=400$	-	11.16 (0.95)	17.52 (0.93)

Table 4: RMSE of inverted periodogram and the complex graphical lasso estimator at

\omega_{j}=\pi

	NWR	CGLASSO		CGLASSO
Family	AUROC	AUROC	Precision	Recall	Accuracy
White Noise
$p=10$
$n=200$	97.27 (2.79)	99.61 (0.69)	53.45 (7.93)	100.00 (0.00)	81.78 (5.79)
$n=400$	99.68 (0.57)	99.91 (0.28)	59.37 (9.45)	100.00 (0.00)	85.56 (5.07)
$p=20$
$n=200$	96.50 (2.11)	98.82 (1.64)	53.04 (5.89)	98.42 (3.01)	90.92 (2.08)
$n=400$	97.99 (2.18)	99.78 (0.63)	51.97 (5.39)	99.74 (1.18)	90.55 (2.02)
$p=50$
$n=200$	96.52 (1.55)	94.81 (2.40)	68.60 (6.89)	90.00 (4.72)	97.90 (0.54)
$n=400$	98.18 (1.09)	99.48 (0.53)	62.79 (4.86)	99.08 (1.04)	97.58 (0.49)
VAR(1)
$p=10$
$n=200$	74.42 (5.93)	50.00 (-)	-	0.00 (0.00)	62.22 (0.00)
$n=400$	80.81 (7.04)	50.25 (0.85)	100.00 (-)	0.49 (1.70)	62.41 (0.64)
$p=20$
$n=200$	72.76 (4.97)	52.70 (-)	100.00 (-)	5.41 (-)	81.58 (-)
$n=400$	79.42 (3.51)	50.75 (0.71)	100.00 (0.00)	1.50 (1.42)	80.82 (0.28)
$p=50$
$n=200$	68.48 (3.33)	49.96 (-)	0.00 (-)	0.00 (-)	92.00 (-)
$n=400$	75.54 (2.70)	50.00 (-)	-	0.00 (-)	92.08 (-)
VARMA(2, 2)
$p=10$
$n=200$	76.27 (7.25)	90.20 (6.02)	75.97 (10.69)	88.25 (6.34)	81.56 (8.65)
$n=400$	79.62 (4.27)	93.83 (3.69)	76.00 (6.82)	92.25 (6.17)	83.22 (5.17)
$p=20$
$n=200$	73.37 (4.33)	85.27 (4.40)	88.98 (5.73)	71.62 (8.67)	92.11 (2.07)
$n=400$	77.96 (2.28)	94.36 (1.65)	77.53 (6.15)	90.62 (3.23)	92.34 (1.87)
$p=50$
$n=200$	77.58 (1.75)	50.00 (0.00)	-	0.00 (0.00)	91.84 (0.00)
$n=400$	79.88 (2.10)	88.29 (2.72)	88.74 (3.17)	76.90 (5.42)	97.31 (0.43)

Table 5: AUROC for nodewise regression(NWR) and CGLASSO, precision, recall, accuracy and F1-score (each in %) of complex graphical lasso for

\omega_{j}=\pi/2

. Results are averaged over 20 replicates. Standard deviations (in %) are also reported in parentheses.

	NWR	CGLASSO		CGLASSO
Family	AUROC	AUROC	Precision	Recall	Accuracy
White Noise
$p=10$
$n=200$	93.46 (4.26)	96.71 (3.77)	46.73 (6.85)	97.22 (6.11)	76.56 (5.61)
$n=400$	95.22 (3.86)	97.93 (2.55)	48.18 (6.34)	98.33 (4.07)	77.78 (5.90)
$p=20$
$n=200$	88.63 (5.41)	96.01 (3.20)	39.19 (5.21)	95.00 (5.53)	84.37 (3.26)
$n=400$	95.36 (3.13)	98.48 (1.76)	42.91 (5.01)	98.42 (3.01)	86.50 (2.66)
$p=50$
$n=200$	88.68 (3.63)	91.48 (2.23)	42.58 (5.63)	84.69 (4.32)	94.68 (1.13)
$n=400$	92.98 (2.95)	96.43 (1.63)	42.77 (4.51)	93.78 (3.14)	94.63 (1.02)
VAR(1)
$p=10$
$n=200$	60.63 (8.53)	53.00 (3.05)	84.38 (22.90)	6.95 (6.35)	64.24 (2.32)
$n=400$	64.01 (7.95)	50.93 (1.71)	100.00 (0.00)	1.86 (3.43)	62.92 (1.29)
$p=20$
$n=200$	58.52 (5.95)	50.32 (0.78)	58.33 (49.16)	0.75 (1.55)	80.58 (0.36)
$n=400$	60.13 (4.20)	50.92 (1.25)	81.25 (37.20)	1.93 (2.47)	80.83 (0.53)
$p=50$
$n=200$	55.86 (2.28)	50.00 (0.00)	-	0.00 (0.00)	92.08 (0.00)
$n=400$	58.41 (2.37)	50.63 (0.61)	42.47 (2.52)	1.44 (1.21)	92.03 (0.17)
VARMA(2, 2)
$p=10$
$n=200$	65.19 (6.36)	71.59 (8.46)	90.88 (7.09)	46.05 (16.63)	74.04 (7.56)
$n=400$	68.87 (6.78)	81.12 (6.83)	74.44 (8.42)	75.25. (11.75)	77.22 (7.42)
$p=20$
$n=200$	63.09 (5.02)	50.00 (0.00)	-	0.00 (0.00)	78.95 (0.00)
$n=400$	67.26 (4.28)	50.00 (0.00)	-	0.00 (0.00)	78.95 (0.00)
$p=50$
$n=200$	65.10 (2.84)	50.00 (0.00)	-	0.00 (0.00)	91.84 (0.00)
$n=400$	66.88 (2.56)	50.00 (0.00)	-	0.00 (0.00)	91.84 (0.00)

Table 6: AUROC for nodewise regression(NWR) and CGLASSO, precision, recall, accuracy and F1-score (each in %) of complex graphical lasso for

\omega_{j}=\pi

. Results are averaged over 20 replicates. Standard deviations (in %) are also reported in parentheses.

$\displaystyle\\|Y-X\beta\\|_{2}^{2}$	$\displaystyle=\frac{1}{2}\\|\varphi(Y-X\beta)\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\\|\Pi_{n}^{\top}\varphi(Y-X\beta)\Pi_{1}\\|_{F}^{2}$
	[Fröbenius norm is invariant under orthogonal transform]
	$\displaystyle=\frac{1}{2}\\|\Pi_{n}^{\top}\varphi(Y)\Pi_{1}-\Pi_{n}^{\top}\varphi(X\beta)\Pi_{1}\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\\|\Pi_{n}^{\top}\varphi(Y)\Pi_{1}-(\Pi_{n}^{\top}\varphi(X)\Pi_{p})(\Pi_{p}^{\top}\varphi(\beta)\Pi_{1})\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\\|\widetilde{\widetilde{Y}}-\widetilde{\widetilde{X}}\widetilde{\widetilde{\beta}}\\|_{F}^{2}$
	$\displaystyle=\frac{1}{2}\cdot 2\\|(\widetilde{\widetilde{Y}}-\widetilde{\widetilde{X}}\widetilde{\widetilde{\beta}})\mathbf{e}_{1}\\|_{2}^{2}$
	$\displaystyle=\\|\widetilde{Y}-\widetilde{\widetilde{X}}\widetilde{\beta}\\|_{2}^{2}.$	(3.4)

Regularized Estimation of Sparse Spectral Precision Matrices

Abstract

1 Introduction

2 Background and methods

2.1 Frequency-specific conditional independence and f−1​(ω)f^{-1}(\omega)

2.2 Method: Penalized Whittle likelihood estimator

Definition 2.1 (Complex Multivariate Normal Distribution).

2.3 Nodewise regression (NWR) of DFTs

3 Optimization algorithms

3.1 Realifying complex optimization with ring isomorphism

Proposition 3.1.

Proposition 3.2.

Proposition 3.3.

Remark 3.1.

3.1.1 Ordinary least squares (OLS) with complex variables

3.1.2 Complex Gaussian log-likelihood for precision matrix estimation

3.2 Lasso with complex variables

3.3 Graphical Lasso with complex variables

3.3.1 Implementation details

4 Theoretical properties

Assumption 4.1.

Assumption 4.2.

4.1 Case I: Gaussian time series

Proposition 4.1.

Remark 4.1.

Theorem 4.1.

Remark 4.2.

4.2 Case II: Linear processes

Proposition 4.2.

Theorem 4.2.

5 Numerical experiments

6 Functional connectivity analysis with fMRI data

7 Conclusion

Acknowledgements

References

Appendix A A short introduction to groups, rings and fields

Definition A.1 (Group).

Definition A.2 (Ring).

Definition A.3 (Ring homomorphism).

Definition A.4 (Ring isomorphism).

Definition A.5 (Field).

Appendix B Proofs of technical results

Proof of Proposition 3.1

Proof of Proposition 3.3

Proof of Proposition 4.1

Proof of Theorem 4.1

Claim B.1.

Primal-dual witness construction

Claim B.2.

Claim B.3.

Claim B.4.

Appendix C Figures and tables

2.1 Frequency-specific conditional independence and $f^{-1}(\omega)$