PLSO: A generative framework for decomposing nonstationary time-series into piecewise stationary oscillatory components

1 Introduction

With the collection of long time-series now common, in areas such as neuroscience and geophysics, it is important to develop an inference framework for data where the stationarity assumption is too restrictive. We restrict our attention to data 1) with spectral properties that change slowly over time and 2) for which decomposition into several oscillatory components is warranted for interpretation, often the case in electroencephalogram (EEG) or electrophysiology recordings. One can use bandpass filtering [Oppenheim et al., 2009] or methods such as the empirical mode decomposition [Huang et al., 1998, Daubechies et al., 2011] for these purposes. However, due to the absence of a generative model, these methods lack a framework for performing inference. Another popular approach is to perform inference in the time-frequency (TF) domain on the short-time Fourier transform (STFT) of the data, assuming stationarity within small intervals. This has led to a rich literature on inference in the TF domain, such as Wilson et al. [2008]. A drawback is that most of these methods focus on estimates for the power spectral density (PSD) and lose important phase information. To recover the time-domain estimates, additional algorithms are required [Griffin and J. Lim, 1984].

This motivates us to explore time-domain generative models that allow time-domain inference and decomposition into oscillatory components. We can find examples based on the stationarity assumption in the signal processing/Gaussian process (GP) communities. A superposition of stochastic harmonic oscillators, where each oscillator corresponds to a frequency band, is used in the processing of speech [Cemgil and Godsill, 2005] and neuroscience data [Matsuda and Komaki, 2017, Beck et al., 2018]. In GP literature [Rasmussen and Williams, 2005], the spectral mixture (SM) kernel [Wilson and Adams, 2013, Wilkinson et al., 2019] models the data as samples from a GP, whose kernel consists of the superposition of localized and frequency-modulated kernels.

These time-domain models can be applied to nonstationary data by partitioning them into stationary intervals and performing time-domain inference within each interval. However, we are faced with a different kind of challenge. As the inference is localized within each interval, the time-domain estimates in different intervals are independent conditioned on the data and do not reflect the dependence across intervals. This also causes discontinuity/distortion of the time-domain estimates near the interval boundaries, and consequently any quantities derived from these estimates.

To address these shortcomings, we propose a generative framework for data with slow time-varying spectra, termed the Piecewise Locally Stationary Oscillation (PLSO) framework¹¹1Code is available at https://github.com/andrewsong90/plso.git. The main contributions are:

Generative model for piecewise stationary, oscillatory components PLSO models time-series as the superposition of piecewise stationary, oscillatory components. This allows time-domain inference on each component and estimation of the time-varying spectra.

Continuity across stationary intervals The state-space model that underlies PLSO strikes a balance between ensuring time-domain continuity across piecewise stationary intervals and stationarity within each interval. Moreover, by imposing stochastic continuity on the interval-level, PLSO learns underlying smooth time-varying spectra accurately.

Inference procedure We propose a two-stage inference procedure for the time-varying spectra and the time-series. By leveraging the Markovian dynamics, the algorithm combines Kalman filter theory [Kalman, 1960] and inexact accelerated proximal gradient approach [Li and Lin, 2015].

In Section 2 we introduce necessary background, followed by the PLSO framework in Section 3. In Section 4, we discuss inference for PLSO. In Section 5, we discuss how PLSO relates to other frameworks. In Section 6, we present experimental results and conclude in Section 7.

2 Background

2.1 Notation

We use $j\in\{1,\ldots,J\}$ and $k\in\{1,\dots,K\}$ to denote frequency and discrete-time sample index, respectively. We use $\omega\in[-\pi,\pi]$ for normalized frequency. The $j^{\text{th}}$ latent process centered at $\omega=\omega_{j}$ is denoted as $\mathbf{z}_{j}\in\mathbb{C}^{K}$ , with $\mathbf{z}_{j,k}\in\mathbb{C}$ denoting the $k^{\text{th}}$ sample of $\mathbf{z}_{j}$ and $\mathbf{z}^{\Re}_{j,k}$ , $\mathbf{z}^{\Im}_{j,k}$ its real and imaginary parts. We also represent $\mathbf{z}_{j,k}$ as a $\mathbb{R}^{2}$ vector, $\widetilde{\mathbf{z}}_{j,k}=[\mathbf{z}^{\Re}_{j,k},\mathbf{z}^{\Im}_{j,k}]^{\text{T}}$ . The elements of $\mathbf{z}_{j}$ are denoted as $\mathbf{z}_{j,k:k^{\prime}}=[\mathbf{z}_{j,k},\ldots,\mathbf{z}_{j,k^{\prime}}]^{\text{T}}$ . The state covariance matrix for $\mathbf{z}_{j,k}$ is defined as $\mathbf{P}_{k}^{j}=\mathbb{E}[\widetilde{\mathbf{z}}_{j,k}\left(\widetilde{\mathbf{z}}_{j,k}\right)^{\text{T}}]$ . To express an enumeration of variables, we use $\{\cdot\}$ and drop first/last index for simplicity, e.g. $\{\mathbf{z}_{j}\}_{j}$ instead of $\{\mathbf{z}_{j}\}_{j=1}^{J}$ .

We use $\mathbf{y}_{k}$ and $\mathbf{z}_{j,k}$ for the discrete-time counterpart of the continuous observation and latent process, $y(t)$ and $z_{j}(t)$ . With the sampling frequency $f_{s}=1/\Delta$ , we have $\mathbf{y}_{k}=y(k\Delta)$ , $\mathbf{z}_{j,k}=z_{j}(k\Delta)$ , and $T=K\Delta$ .

2.2 Piecewise local stationarity

The concept of piecewise local stationarity (PLS) for nonstationary time-series with slowly time-varying spectra [Adak, 1998] plays an important role in PLSO. A stationary process has a constant mean and a covariance function which depends only on the difference between two time points.

For our purposes, it suffices to understand the following on PLS: 1) It includes local stationary [Priestley, 1965, Dahlhaus, 1997] and amplitude-modulated stationary processes. 2) A PLS process can be approximated as a piecewise stationary (PS) process (Theorem 1 of [Adak, 1998])

z(t)=\sum_{m=1}^{M}\mathbf{1}(u_{m}\leq t<u_{m+1})\cdot z^{m}(t),

(1)

where $z^{m}(t)$ is a continuous stationary process and the boundaries are $0=u_{1}<\cdots<u_{M+1}=T$ . Note that Eq. 1 does not guarantee continuity across different PS intervals,

\begin{split}\lim_{t\rightarrow u^{-}_{m}}z(t)=\lim_{t\rightarrow u^{-}_{m}}z^{m-1}(t)\neq\lim_{t\rightarrow u^{+}_{m}}z^{m}(t)=\lim_{t\rightarrow u^{+}_{m}}z(t).\\ \end{split}

3 The PLSO model and its mathematical properties

Building on the Theorem 1 of [Adak, 1998], PLSO models nonstationary data as PS processes. It is a superposition of $J$ different PS processes $\{\mathbf{z}_{j}\}_{j}$ , with $\mathbf{z}_{j}$ corresponding to an oscillatory process centered at frequency $\omega_{j}$ . PLSO also guarantees stochastic continuity across PS intervals. We show that piecewise stationarity and continuity across PS intervals are two competing objectives and that PLSO strikes a balance between them, as discussed in Section 3.2.

Refer to caption — Figure 1: A simulated example. (a) Time domain. Data (black) around boundaries (gray) and inferred oscillatory components using PLSO (red) and regularized STFT (blue). (b) Frequency domain. Spectrum of the data (gray), PLSO components for $J=2$ (purple) and their sum (red).

Fig. 1 shows an example of the PLSO framework applied to simulated data. In the time domain, the oscillation inferred using the regularized STFT (blue) [Kim et al., 2018], which imposes stochastic continuity on the STFT coefficients, suffers from discontinuity/distortion near window boundaries, whereas that inferred by PLSO (red) does not. In the frequency domain, each PLSO component corresponds to a localized spectrum $S_{j}(\omega)$ , the sum of which is the PSD $\gamma(\omega)$ , and is fit to the data STFT (or periodogram) $I(\omega)$ . We start by introducing the PLSO model for a single window.

3.1 PLSO for stationary data

As a building block for PLSO, we use the discrete stochastic harmonic oscillator for a stationary time series [Qi et al., 2002, Cemgil and Godsill, 2005, Matsuda and Komaki, 2017]. The data $\mathbf{y}$ are assumed to be a superposition of $J$ independent zero-mean components

\begin{split}\widetilde{\mathbf{z}}_{j,k}&=\rho_{j}\mathbf{R}(\omega_{j})\widetilde{\mathbf{z}}_{j,k-1}+\varepsilon_{j,k}\ \\ \mathbf{y}_{k}&=\sum_{j=1}^{J}\mathbf{z}^{\Re}_{j,k}+\nu_{k},\\ \end{split}

(2)

where $\mathbf{R}(\omega_{j})=\begin{pmatrix}\cos(\omega_{j})&-\sin(\omega_{j})\\ \sin(\omega_{j})&\cos(\omega_{j})\\ \end{pmatrix}$ , $\varepsilon_{j,k}\sim\mathcal{N}\left(0,\alpha_{j}\mathbf{I}_{2\times 2}\right)$ , and $\nu_{k}\sim\mathcal{N}(0,\sigma_{\nu}^{2})$ , correspond to the rotation matrix, the state noise, and the observation noise, respectively. The imaginary component $\mathbf{z}^{\Im}_{j,k}$ , which does not directly contribute to $\mathbf{y}_{k}$ , can be seen as the auxiliary variable to write $\mathbf{z}_{j}$ in recursive form using $\mathbf{R}(\omega_{j})$ [Koopman and Lee, 2009]. We assume $\mathbf{P}_{1}^{j}=\sigma_{j}^{2}\cdot\mathbf{I}_{2\times 2}\,\,\,\forall j$ .

We reparameterize $\rho_{j}$ and $\alpha_{j}$ , using lengthscale $l_{j}$ and power $\sigma^{2}_{j}$ , such that $\rho_{j}=\exp(-\Delta/l_{j})$ and $\alpha_{j}=\sigma_{j}^{2}(1-\rho_{j}^{2})$ . Theoretically, this establishes a connection to 1) the stochastic differential equation [Brown et al., 2004, Solin and Särkkä, 2014], detailed in Appendix A, and 2) a superposition of frequency-modulated and localized GP kernels, similar to SM kernel [Wilson and Adams, 2013]. Practically, this ensures that $\rho_{j}<1$ , and thus stability of the process.

Given Eq. 2, we can readily express the frequency spectra of PLSO in each interval, through the autocovariance function. The autocovariance of $\mathbf{z}_{j}$ is given as $Q_{j}(n^{\prime})=\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n^{\prime}}]=\sigma_{j}^{2}\cos(\omega_{j}n^{\prime})\exp\left(-n^{\prime}\Delta/l_{j}\right)$ . It can also be thought of as an exponential kernel, frequency-modulated by $\omega_{j}$ . The spectra for $\mathbf{z}_{j}$ , denoted as $S_{j}(\omega)$ , is obtained by taking the Fourier transform (FT) of $Q_{j}(n^{\prime})$

\begin{split}S_{j}(\omega)&=\sum_{n^{\prime}=-\infty}^{\infty}Q_{j}(n^{\prime})\exp\left(-i\omega n^{\prime}\right)=\varphi_{j}(\omega)+\varphi_{j}(-\omega)\\ \varphi_{j}(\omega)&=\frac{\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)}{1+\exp\left(-2\Delta/l_{j}\right)-2\exp\left(-\Delta/l_{j}\right)\cos(\omega-\omega_{j})},\\ \end{split}

with the detailed derivation in Appendix B.

Given $S_{j}(\omega)$ , we can show that PSD $\gamma(\omega)$ of the entire process $\sum_{j=1}^{J}\mathbf{z}_{j}$ is simply $\gamma(\omega)=\sum_{j=1}^{J}S_{j}(\omega)$ . First, since $\mathbf{z}_{j}$ is independent across $j$ , the autocovariance can be simplified, i.e., $\mathbb{E}[\sum_{j}\mathbf{z}^{\Re}_{j,k}\sum_{j}\mathbf{z}^{\Re}_{j,k+n^{\prime}}]=\sum_{j}\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n^{\prime}}]$ . Next, using the linearity of FT, we can conclude that $\gamma(\omega)$ is a superposition of individual spectra.

3.2 PLSO for nonstationary data

If $\mathbf{y}$ is nonstationary, we can still apply stationary PLSO of Eq. 2 for the time-domain inference. However, this implies constant spectra for the entire data ( $S_{j}(\omega)$ and $\gamma(\omega)$ do not depend on $k$ ), which is not suitable for nonstationary time-series for which we want to track spectral dynamics. This point is further illustrated in Section 6.

We therefore segment $\mathbf{y}$ into $M$ non-overlapping PS intervals, indexed by $m\in\{1,\ldots,M\}$ , of length $N$ , indexed by $n\in\{1,\ldots,N\}$ , such that $K=MN$ . We then apply the stationary PLSO to each interval, with additional Markovian dynamics imposed on $\sigma_{j,m}^{2}$ ,

\begin{split}\log(\sigma_{j,m}^{2})&=\log(\sigma_{j,m-1}^{2})+\eta_{j,m}\\ \widetilde{\mathbf{z}}_{j,mN+n}&=\rho_{j}\mathbf{R}(\omega_{j})\widetilde{\mathbf{z}}_{j,mN+(n-1)}+\varepsilon_{j,mN+n}\\ \mathbf{y}_{mN+n}&=\sum_{j=1}^{J}\mathbf{z}^{\Re}_{j,mN+n}+\nu_{mN+n},\\ \end{split}

(3)

where $\varepsilon_{j,mN+n}\sim\mathcal{N}(0,\sigma_{j,m}^{2}(1-\rho_{j}^{2})\mathbf{I}_{2\times 2})$ , $\eta_{j,m}\sim\mathcal{N}(0,1/\sqrt{\lambda})$ and $\nu_{mN+n}\sim\mathcal{N}(0,\sigma_{\nu}^{2})$ . We define $\mathbf{P}_{m,n}^{j}$ as the covariance of $\widetilde{\mathbf{z}}_{j,mN+n}$ , with $\mathbf{P}_{1,1}^{j}=\sigma_{j,1}^{2}\mathbf{I}_{2\times 2},\forall j$ . The graphical model is shown in Fig. 2.

We can understand PLSO as providing a parameterized spectrogram defined by $\theta=\{\lambda,\sigma_{\nu}^{2},\{l_{j}\}_{j},\{\omega_{j}\}_{j}\}$ and $\{\sigma_{j,m}^{2}\}_{j,m}$ of the time-domain generative model. The lengthscale $l_{j}$ controls the bandwidth of the $j^{\text{th}}$ process, with larger $l_{j}$ corresponding to narrower bandwidth. The variance $\sigma_{j,m}^{2}$ controls the power of $\mathbf{z}_{j}$ and changes across different intervals, resulting in time-varying spectra $S_{j}^{(m)}(\omega)$ and PSD $\gamma^{(m)}(\omega)$ . The center frequency $\omega_{j}$ , and $-\omega_{j}$ , at which $S_{j}^{(m)}(\omega)$ is maximized, controls the modulation frequency.

As discussed previously, the segmentation approach for nonstationary time-series produces distortion/discontinuity artifacts around interval boundaries - The PLSO, as described by Eq. 3, resolves these issues gracefully. We now analyze two mathematical properties, stochastic continuity and piecewise stationarity, to gain more insights on how PLSO accomplishes this.

3.2.1 Stochastic continuity

We discuss two types of stochastic continuity, 1) across the interval boundaries and 2) on $\{\sigma_{j,m}^{2}\}$ .

Continuity across the interval boundaries In PLSO, the state-space model (Eq. 3) provides stochastic continuity across different PS intervals. The following proposition rigorously explains stochastic continuity for PLSO.

Proposition 1.

For a given $m$ , as $\Delta\rightarrow 0$ , the samples on either side of the interval boundary, which are $\widetilde{\mathbf{z}}_{j,(m+1)N}$ and $\widetilde{\mathbf{z}}_{j,(m+1)N+1}$ , converge to each other in mean square,

\lim_{\Delta\rightarrow 0}\mathbb{E}[\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}^{\text{T}}]=0,

where we use $\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}=\widetilde{\mathbf{z}}_{j,(m+1)N+1}-\widetilde{\mathbf{z}}_{j,(m+1)N}$ .

Proof.

We use the connection between PLSO, which is a discrete-time model, and its continuous-time counterpart. Details are in the Appendix C. ∎

This matches our intuition that as $\Delta\rightarrow 0$ , the adjacent samples from the same process should coverge to each other. For PS approaches without the sample-level continuity, even with the interval-level constraint [Rosen et al., 2009, Kim et al., 2018, Song et al., 2018, Das and Babadi, 2018, Soulat et al., 2019], convergence is not guaranteed.

We can interpret the continuity in the context of posterior for $\mathbf{z}_{j}$ . For PS approaches without continuity, we have

\begin{split}&p(\left\{\mathbf{z}_{j}\right\}_{j}|\mathbf{y})\\ &\propto\prod_{m=1}^{M}p(\left\{\mathbf{z}_{j,(m-1)N+1:mN}\right\}_{j}|\mathbf{y}_{(m-1)N+1:mN}),\\ \end{split}

(4)

where $\{\sigma_{j,m}^{2}\}_{j,m}$ and $\theta$ are omitted for notational ease. This is due to $p(\{\mathbf{z}_{j}\}_{j})=\prod_{m=1}^{M}p(\{\mathbf{z}_{j,(m-1)N+1:mN}\}_{j})$ , as a result of absence of continuity across the intervals. Consequently, the inferred time-domain estimates are conditionally independent across the intervals. On the contrary, in PLSO, the time-domain estimates depend on the entire $\mathbf{y}$ , not just on a subset.

Continuity on $\sigma_{j,m}^{2}$ For a given $j$ , we impose stochastic continuity on $\log\sigma_{j,m}^{2}$ . Effectively, this pools together estimates of $\{\sigma^{2}_{j,m}\}_{m}$ to 1) prevent overfitting to the noisy data spectra and 2) estimate smooth dynamics of $\{\sigma^{2}_{j,m}\}_{m}$ . The use of $\log\sigma_{j,m}^{2}$ ensures that $\sigma_{j,m}^{2}$ is non-negative.

The choice of $\lambda$ dictates the smoothness of $\{\sigma^{2}_{j,m}\}_{m}$ , with the two extremes corresponding to the familiar dynamics. If $\lambda\rightarrow 0$ , we treat each window independently. If $\lambda\rightarrow\infty$ , we treat the data as stationary, as the constraint forces $\sigma^{2}_{j,m}=\sigma_{j}^{2}$ , $\forall m$ . Practically, the smooth regularization prevents artifacts in the spectral analysis, arising from sudden motion or missing data, as demonstrated in Section 6.

3.2.2 Piecewise stationarity

For the $m^{\text{th}}$ window to be piecewise stationary, the initial state covariance matrix $\mathbf{P}_{m,1}^{j}$ should be the steady-state covariance matrix for the window, denoted as $\mathbf{P}_{m,\infty}^{j}$ .

The challenge is transitioning from $\mathbf{P}_{m,\infty}^{j}$ to $\mathbf{P}_{m+1,\infty}^{j}$ . Specifically, to ensure $\mathbf{P}_{m+1,1}^{j}=\mathbf{P}_{m+1,\infty}^{j}$ , given that $\mathbf{P}_{m,N}^{j}=\mathbf{P}_{m,\infty}^{j}\neq\mathbf{P}_{m+1,\infty}^{j}$ , the variance of the process noise between the two samples, $\varepsilon_{j,(m+1)N+1}$ , has to equal $\mathbf{P}_{m+1,\infty}^{j}-\exp\left(-2\Delta/l_{j}\right)\mathbf{P}_{m,\infty}^{j}$ . However, this is infeasible. If $\mathbf{P}_{m+1,\infty}^{j}<\mathbf{P}_{m,\infty}^{j}$ , the variance is negative. Even if it were positive, the limit as $\Delta\rightarrow 0$ does not equal zero, i.e., $\mathbf{P}_{m+1,\infty}^{j}-\mathbf{P}_{m,\infty}^{j}$ . As a result, the Proposition 1 no longer holds and the trajectory is discontinuous.

In summary, there exists a trade-off between maintaining piecewise stationarity and continuity across intervals. PLSO maintains continuity across the intervals while ensuring that the state covariance quickly transitions to the steady-state covariance. We quantify the speed of transition in the following proposition.

Proposition 2.

Assume $l_{j}\ll N\Delta$ , such that $\mathbf{P}_{m,N}^{j}=\mathbf{P}_{m,\infty}^{j}$ . In Eq. 3, the difference between $\mathbf{P}_{m,\infty}^{j}$ and $\mathbf{P}_{m+1,\infty}^{j}$ decays exponentially fast as a function of $n$ ,

\begin{split}\mathbf{P}_{m+1,n}^{j}&=\mathbf{P}_{m+1,\infty}^{j}+\exp(-\frac{2n\Delta}{l_{j}})(\mathbf{P}_{m,\infty}^{j}-\mathbf{P}_{m+1,\infty}^{j}).\\ \end{split}

Proof.

In Appendix D, we prove this result by induction.

∎

This implies that, except for the transition portion at the beginning of each window, we can assume stationarity. In practice, we additionally impose an upper bound on $l_{j}$ during estimation and also use a reasonably-large $N$ . Empirically, we observe that the transition period has little impact.

4 Inference

Given the generative model in Eq. 3, our goal is to perform inference on the posterior distribution

\begin{split}&p(\left\{\mathbf{z}_{j}\right\}_{j},\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta)\\ &=\underbrace{p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta)}_{\text{window-level posterior}}\cdot\underbrace{p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)}_{\text{sample-level posterior}}.\\ \end{split}

(5)

We can learn $\theta$ or fix the parameters to specific values informed by domain knowledge, such as the center frequency or bandwidth of the processes. The posterior distribution factorizes into two terms as in Eq. 5, the window-level posterior $p(\{\sigma_{j,m}^{2}\}_{j,m}|\mathbf{y},\theta)$ and the sample-level posterior $p(\{\mathbf{z}_{j}\}_{j}|\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)$ . Accordingly, we break the inference into two stages.

Stage 1 We minimize the window-level negative log-posterior, with respect to $\theta$ and $\{\sigma_{j,m}^{2}\}_{j,m}$ . Specifically, we obtain maximum likelihood (ML) estimate $\hat{\theta}_{\text{ML}}$ and maximum a posteriori (MAP) estimate $\{\widehat{\sigma}^{2}_{j,m,\text{MAP}}\}_{j,m}$ . We drop subscripts ML and MAP for notational simplicity.

Stage 2 Given $\{\widehat{\sigma}^{2}_{j,m}\}_{j,m}$ and $\hat{\theta}$ , we perform inference on the sample-level posterior. This includes computing the mean $\widehat{\mathbf{z}}_{j}=\mathbb{E}[\mathbf{z}_{j}|\{\widehat{\sigma}^{2}_{j,m}\}_{j,m},\mathbf{y},\hat{\theta}]$ and credible intervals, which quantifies the uncertainty of the estimates, or any statistical quantity derived from the posterior distribution.

4.1 Optimization of $\{\sigma_{j,m}^{2}\}_{j,m}$ and $\theta$

We factorize the posterior, $p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta)\propto p(\mathbf{y}\mid\{\sigma_{j,m}^{2}\}_{j,m},\theta)\cdot p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\theta)$ . As the exact inference is intractable, we instead minimize the negative log-posterior, $-\log p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta)$ . This is an empirical Bayes approach [Casella, 1985], since we estimate $\{\sigma_{j,m}^{2}\}_{j,m}$ using the marginal likelihood $p(\mathbf{y}\mid\{\sigma_{j,m}^{2}\}_{j,m},\theta)$ . The smooth hyperprior provides the MAP estimate for $\{\sigma_{j,m}^{2}\}_{j,m}$ .

We use the Whittle likelihood [Whittle, 1953], defined for stationary time-series in the frequency domain, for the log-likelihood $f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)=\log p(\mathbf{y}\mid\{\sigma_{j,m}^{2}\}_{j,m},\theta)$ ,

\begin{split}&f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)\\ &=-\frac{1}{2}\sum_{m,n=1}^{M,N}\log(\gamma^{(m)}(\omega_{n})+\sigma_{\nu}^{2})+\frac{I^{(m)}(\omega_{n})}{\gamma^{(m)}(\omega_{n})+\sigma_{\nu}^{2}},\\ \end{split}

(6)

where the log-likelihood is the sum of the Whittle likelihood computed for each interval, with discrete frequency $\omega_{n}=2\pi n/N,$ and data STFT $I^{(m)}(\omega_{n})=\left|\sum_{n^{\prime}=1}^{N}\exp\left(-2\pi i(n^{\prime}-1)(n-1)/N\right)\mathbf{y}_{mN+n^{\prime}}\right|^{2}.$ The Whittle likelihood, which is nonconvex, enables frequency-domain parameter estimation as a computationally more efficient alternative to the time domain estimation [Turner and Sahani, 2014]. The concave log-prior $g(\{\sigma_{j,m}^{2}\}_{j,m};\theta)=\log p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\theta)$ , which arises from the continuity on $\{\sigma_{j,m}^{2}\}_{j,m}$ , is given as

g(\{\sigma_{j,m}^{2}\}_{j,m};\theta)=-\frac{\lambda}{2}\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\log\sigma^{2}_{j,m}-\log\sigma^{2}_{j,m-1}\right)^{2}.

(7)

This yields the following nonconvex problem

\begin{split}&\min_{\{\sigma_{j,m}^{2}\}_{j,m},\theta}\,-\log p\left(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta\right)\\ &=\min_{\{\sigma_{j,m}^{2}\}_{j,m},\theta}\,-f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)-g(\{\sigma_{j,m}^{2}\}_{j,m};\theta).\\ \end{split}

(8)

We optimize Eq. 8 by block coordinate descent [Wright, 2015] on $\{\sigma_{j,m}^{2}\}_{j,m}$ and $\{\sigma_{\nu}^{2},\{l_{j}\}_{j},\{\omega_{j}\}_{j}\}$ . For $\sigma_{\nu}^{2}$ , $\{l_{j}\}_{j}$ , and $\{\omega_{j}\}_{j}$ , we minimize $-f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ , since $g(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ does not affect them. We perform conjugate gradient descent on $\{l_{j}\}_{j}$ and $\{\omega_{j}\}_{j}$ . We discuss initialization and how to estimate $\sigma_{\nu}^{2}$ in the Appendix E.

4.1.1 Optimization of $\{\sigma_{j,m}^{2}\}_{j,m}$

We introduce an algorithm to compute a local optimal solution of $\{\sigma_{j,m}^{2}\}_{j,m}$ for the nonconvex optimization problem in Eq. 8, by leveraging the regularized temporal structure of $\{\sigma_{j,m}^{2}\}_{j,m}$ . It extends the inexact accelerated proximal gradient (APG) method [Li and Lin, 2015], by solving the proximal step with a Kalman filter/smoother [Kalman, 1960]. This follows since computing the proximal operator for $g(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ is equivalent to MAP estimation for $J$ independent 1-dimensional linear Gaussian state-space models

\begin{split}&\{\log\sigma^{(l+1),2}_{j,m}\}_{j,m}=\operatorname{prox}_{-\alpha^{(l)}g}(\mathbf{v}^{(l)})\\ &=\operatorname*{arg\,min}_{\{\sigma_{j,m}^{2}\}_{j,m}}\underbrace{\frac{\sum_{j,m}^{J,M}(\mathbf{v}_{j,m}^{(l)}-\log\sigma^{2}_{j,m})^{2}}{2\alpha^{(l)}}-g(\{\sigma_{j,m}^{2}\}_{j,m})}_{\sum_{j=1}^{J}q_{j}}\\ \end{split}

(9)

where $\alpha^{(l)}>0$ is a step-size for the $l^{\text{th}}$ iteration, $q_{j}=\sum_{m=1}^{M}\frac{(\mathbf{v}^{(l)}_{j,m}-\log\sigma^{2}_{j,m})^{2}}{2\alpha^{(l)}}+\frac{\lambda}{2}(\log\sigma^{2}_{j,m}-\log\sigma^{2}_{j,m-1})^{2}$ , and $\mathbf{v}^{(l)}_{j,m}=\log\sigma_{j,m}^{(l),2}+\alpha^{(l)}\frac{\partial f(\{\sigma_{j,m}^{2}\}_{j,m})}{\partial\log\sigma^{2}_{j,m}}\Big{|}_{\{\sigma^{(l),2}_{j,m}\}_{j,m}}$ .

The $j^{\text{th}}$ optimization problem, $\min_{\{\sigma_{j,m}^{2}\}_{m}}q_{j}$ , is equivalent to estimating the mean of the posterior for $\{\log\sigma^{2}_{j,m}\}_{m}$ in a linear Gaussian state-space model with observations $\{\mathbf{v}^{(l)}_{j,m}\}_{m}$ , observation noise variance $\alpha^{(l)}$ , and state variance $1/\lambda$ . Therefore, the solution can efficiently be computed with $J$ 1-dimensional, Kalman filters/smoothers, with the computational complexity of $O(JM)$ .

Note that Eq. 9 holds for all non-negative $\lambda$ . If $\lambda=0$ , the proximal operator is an identity operator, as $\log\sigma^{(l+1),2}_{j,m}=\mathbf{v}^{(l)}_{j,m}$ . This is a gradient descent with a step-size rule. If $\lambda\rightarrow\infty$ , we have $\log\sigma^{2}_{j,m}=\log\sigma^{2}_{j,m-1}$ , $\forall m$ , which leads to $\log\sigma_{j,m}^{(l+1),2}=(1/M)\sum_{m=1}^{M}\mathbf{v}^{(l)}_{j,m}$ . The algorithm is guaranteed to converge when $\alpha^{(l)}<1/C$ , where $C$ is the Lipschitz constant for $f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ . In practice, we select $\alpha^{(l)}$ according to the step-size rule [Barzilai and Borwein, 1988]. In Appendix F & G, we present the full algorithm for optimizing $\{\sigma_{j,m}^{2}\}_{j,m}$ and a derivation for $C$ .

4.2 Inference for $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)$

We perform inference on the posterior distribution $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y},\hat{\theta})$ . Since this is a Gaussian distribution, the mean trajectories $\{\widehat{\mathbf{z}}_{j}\}_{j}$ and the credible intervals can be computed analytically. Moreover, Eq. 3 is a linear Gaussian state-space model, we can use Kalman filter/smoother for efficient computation with computational complexity $O(J^{2}K)$ , further discussed In Appendix H. Since we use the point estimate $\{\widehat{\sigma}^{2}_{j,m}\}_{j,m}$ , the credible interval for $\{\widehat{\mathbf{z}}_{j}\}_{j}$ will be narrower compared to the full Bayesian setting which accounts for all values of $\{\sigma_{j,m}^{2}\}_{j,m}$ .

4.2.1 Monte Carlo Inference

We can also obtain posterior samples and perform Monte Carlo (MC) inference on any posterior-derived quantity. To generate the MC trajectory samples, we use the forward-filter backward-sampling (FFBS) algorithm [Carter and Kohn, 1994]. Assuming $S$ number of MC samples, the computational complexity for FFBS is $O(SJ^{2}K)$ , since for each sample, the algorithm uses Kalman filter/smoother for sampling. This is different from generating samples with the interval-specific posterior in Eq. 4. In the latter case, the FFBS algorithm is run $M$ times, the samples of which have to be concatenated to form an entire trajectory. With PLSO, the trajectory sample is conditioned on the entire observation and is continuous across the intervals.

One quantity of interest is the phase. We obtain the phase as $\phi_{j,k}=\tan^{-1}(\mathbf{z}^{\Im}_{j,k}/\mathbf{z}^{\Re}_{j,k})$ . Since $\tan^{-1}(\cdot)$ is a non-linear operation, we compute the mean and credible interval with MC samples through the FFBS algorithm. Given the posterior-sampled trajectories $\{\mathbf{z}_{j}^{(s)}\}_{j,s}$ , where $s\in\{1,\ldots,S\}$ denotes MC sample index, we estimate $\widehat{\phi}_{j,k}=(1/S)\sum_{s=1}^{S}\tan^{-1}(\mathbf{z}^{\Im,(s)}_{j,k}/\mathbf{z}^{\Re,(s)}_{j,k})$ , and use empirical quantiles for the associated credible interval.

4.3 Choice of $J$ and $\lambda$

We choose $J$ that minimizes the Akaike Information Criterion (AIC) [Akaike, 1981], defined as

\operatorname{AIC}(J)=-(2/M)\cdot\log p(\mathbf{y}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\hat{\theta})+2\cdot 3\cdot J,

(10)

where $3\cdot J$ corresponds to the number of parameters ( $\{l_{j}\}_{j},\{\omega_{j}\}_{j},\{\sigma^{2}_{j,m}\}_{j}$ ). The regularization parameter $\lambda$ is determined through a two-fold cross-validation, where each fold is generated by aggregating even/odd sample indices [Ba et al., 2014].

4.4 Choice of window length $N$

The choice of window length $N$ presents the tradeoff between 1) spectral resolution and 2) the temporal resolution of the spectral dynamics [Oppenheim et al., 2009]. For a shorter window, the estimated spectral dynamics have a finer temporal resolution, coarser spectral resolution, and higher variance. For a longer window, these trends are reversed. This suggests that the choice is application-dependent. For electrophysiology data, a window on the order of seconds is used, as scientific interpretations are made on the basis of fine spectral resolution (< 1Hz). For audio signal processing [Gold et al., 2011], short windows ( $10\sim 100$ ms) are used, since audio data is highly nonstationary and thus requires fine temporal resolution for processing. A survey of window lengths used in different applications can be found in the supporting information of Kim et al. [2018].

5 Related works

We examine how PLSO relates to other nonstationary frameworks.

STFT/Regularized STFT In STFT, the harmonic basis is used, whereas quasi-periodic components are used for PLSO, which allows capturing of broader spectral content. Recent works regularize STFT coefficients with stochastic continuity across the windows, to infer smooth spectral dynamics [Ba et al., 2014, Kim et al., 2018]. However, this regularization leads to discontinuities at window boundaries.

Piecewise stationary GP GP regression and parameter estimation are performed within each interval [Gramacy and Lee, 2008, Solin et al., 2018]. Consequently, the recovered trajectories are discontinuous. Also, the inversion of covariance matrix leads to an expensive inference. For example, the time-complexity of mean trajectory estimation is $O(MN^{3})=O(N^{2}K)$ , whereas the time-complexity for PLSO is $O(J^{2}K)$ . Considering that the typical sampling frequency for electrophysiology data is $\sim 10^{3}$ (Hz) and windows are several seconds, which leads to $N\sim 10^{3}$ , PLSO is computationally more efficient. In the Appendix I, we confirm this through an experiment.

Time-varying Autoregressive model (TVAR) The TVAR model [Kitagawa and Gersch, 1985] is given as

\mathbf{y}_{k}=\sum_{p=1}^{P}a_{p,k}\mathbf{y}_{k-p}+\varepsilon_{k},

with the time-varying coefficients $\{a_{p,k}\}_{p}$ . Consequently, it does not suffer from discontinuity issues. TVAR can also be approximately decomposed into oscillatory components via eigen-decomposition of the transition matrix [West et al., 1999]. However, since the eigen-decomposition changes at every sample, this leads to an ambiguous interpretation of the oscillations in the data, as we discuss in Section 6.

RNN frameworks Despite the popularity of recurrent neural networks (RNN) for time-series applications [Goodfellow et al., 2016], we believe PLSO is more appropriate for time-frequency analysis for two reasons.

1.

RNNs operate in the time-domain with the goal of prediction/denoising and consequently less emphasis is placed on local stationarity or estimation of second-order statistics. Performing time-frequency analysis requires segmenting the RNN outputs and applying the STFT, which yields noisy spectral estimates.
2.

RNN is not a generative framework. Although variational framework can be combined with RNN [Chung et al., 2015, Krishnan et al., 2017], the use of variational lower bound objective could lead to suboptimal results. On the other hand, PLSO is a generative framework that maximizes the true log-posterior.

6 Experiments

We apply PLSO to three settings: 1) A simulated dataset 2) local-field potential (LFP) data from the rat hippocampus, and 3) EEG data from a subject under anesthesia.

We use PLSO with $\lambda=0$ , $\lambda\rightarrow\infty$ , and $\lambda$ determined by cross-validation, $\lambda_{\operatorname{CV}}$ . We use interval lengths chosen by domain experts. As baselines, we use 1) regularized STFT (STFT-reg.) and 2) Piecewise stationary GP (GP-PS). For GP-PS, we use the same $\{\widehat{\sigma}_{j,m}^{2}\}_{j,m}$ and $\widehat{\theta}$ as PLSO with $\lambda=0$ , so that the estimated PSD of GP-PS and PLSO are identical. This lets us explain differences in time-domain estimates by the fact that PLSO operates in the time-domain.

6.1 Simulated dataset

We simulate from the following model for $1\leq k\leq K$

\mathbf{y}_{k}=10\left(\frac{K-k}{K}\right)\mathbf{z}^{\Re}_{1,k}+10\cos^{4}(2\pi\omega_{0}k)\mathbf{z}^{\Re}_{2,k}+\nu_{k},

where $\mathbf{z}_{1,k}$ and $\mathbf{z}_{2,k}$ are as in Eq. 2, with $(\omega_{0},\omega_{1},\omega_{2})=(0.04,1,10)$ Hz, $f_{s}=200$ Hz, $T=100$ seconds, $l_{1}=l_{2}=1$ , and $\nu_{k}\sim\mathcal{N}(0,25)$ . This stationary process comprises two amplitude-modulated oscillations, namely one modulated by a slow-frequency sinusoid and the other a linearly-increasing signal [Ba et al., 2014]. We simulate 20 realizations and train on each realization, assuming 2-second PS intervals. For PLSO, we use $J=2$ . Additional details are provided in the Appendix J.

Results We use two metrics: 1) Mean squared error (MSE) between the mean estimate $\hat{\mathbf{z}}_{j}$ and the ground truth $\mathbf{z}_{j}^{\text{True}}$ and 2) $\operatorname{jump}(\mathbf{z}_{j})$ . The averaged results are shown in Table 1. We define $\operatorname{jump}(\mathbf{z}_{j})=\frac{1}{M-1}\sum_{m=1}^{M-1}\left|\hat{\mathbf{z}}_{j,mN+1}-\hat{\mathbf{z}}_{j,mN}\right|$ to be the level of discontinuity at the interval boundaries. If $\operatorname{jump}(\mathbf{z}_{j})$ greatly exceeds $\operatorname{jump}(\mathbf{z}_{j}^{\text{True}})$ , this implies the existence of large discontinuities at the boundaries.

Table 1: Simulation results. For

\operatorname{jump}(\mathbf{z}_{j})

and MSE, left/right metrics correspond to

\mathbf{z}_{1}

/

\mathbf{z}_{2}

, respectively.

	$\operatorname{jump}(\mathbf{z}_{j})$	MSE	IS div.
Truth	0.95/12.11	0/0	0
$\lambda=0$	0.26/10.15	2.90/3.92	4.08
$\lambda\rightarrow\infty$	0.22/10.32	3.26/4.53	13.78
$\lambda=\lambda_{\operatorname{CV}}$	0.25/10.21	2.88/3.91	3.93
STFT-reg.	49.59/81.00	6.89/10.68	N/A
GP-PS	16.99/23.28	3.00/4.04	4.08

Fig. 3 shows the true data in the time domain and spectrogram results. Fig. 3(c) shows that although the regularized STFT detects activities around $1$ and $10$ Hz, it fails to delineate the time-varying spectral pattern. Fig. 3(d) shows that PLSO with stationarity ( $\lambda\rightarrow\infty$ ) assumption is too restrictive. Fig. 3(e), (f) show that both PLSO with independent window assumption ( $\lambda=0$ ) and PLSO with cross-validated $\lambda$ ( $\lambda=\lambda_{\operatorname{CV}}$ ) are able to capture the dynamic pattern, with the latter being more effective in recovering the smooth dynamics across different PS intervals.

For GP-PS and STFT-reg., $\operatorname{jump}(\mathbf{z}_{j})$ exceeds $\operatorname{jump}(\mathbf{z}_{j}^{\text{True}})$ , indicating discontinuities at the boundaries. An example is in Fig. 1. PLSO produces a similar jump metric as the ground truth metric, indicating the absence of discontinuities. We attribute the lower value to Kalman smoothing. For the TF domain, we use Itakura-Saito (IS) divergence [Itakura and Saito, 1970] as a distance measure between the ground truth spectra and the PLSO estimates. That the highest divergence is given by $\lambda\rightarrow\infty$ indicates the inaccuracy of the stationarity assumption.

6.2 LFP data from the rat hippocampus

We use LFP data collected from the rat hippocampus during open field tasks [Mizuseki et al., 2009], with $T=1,600$ seconds and $f_{s}=1,250$ Hz²²2We use channel 1 of mouse ec013.528 for the LFP. The population spikes were simultaneously recorded.. The theta neural oscillation band ( $5\sim 10$ Hz) is believed to play a role in coordinating the firing of neurons in the entorhinal-hippocampal system and is important for understanding the local circuit computation.

We fit PLSO with $J=5$ , which minimizes AIC as shown in Table 2, with 2-second PS interval. The estimated $\widehat{\omega}_{2}$ is 7.62 Hz in the theta band. To obtain the phase for non-PLSO methods, we perform the Hilbert transform on the theta-band reconstructed signal. With no ground truth, we bandpass-filter (BPF) the data in the theta band for reference.

Table 2: AIC as a function of

J

for Hippocampus data

J	1	2	3	4	5	6
AIC	2882	2593	2566	2522	2503	2505

Spike-phase coupling Fig. 4(a) shows the theta phase distribution of population neuron spikes in the hippocampus. The PLSO-estimated distribution (red) confirms the original results analyzed with bandpass-filtered signal (black) [Mizuseki et al., 2009]–the hippocampal spikes show a strong preference for a specific phase, $\pi$ for this dataset, of the theta band. Since PLSO provides posterior sample-trajectories for the entire time-series, we can compute as many realizations of the phase distribution as the number of MC samples. The resulting credible interval excludes the uniform distribution (horizontal gray), suggesting the statistical significance of strong phase preference.

Denoised spectrogram Fig. 4(b-c) shows the estimated spectrogram. We observe that PLSO denoises the spectrogram, while retaining sustained power at $\widehat{\omega}_{2}=7.62$ Hz and weaker bursts at $(\widehat{\omega}_{1},\widehat{\omega}_{3})=(2.99,15.92)$ Hz.

Time domain discontinuity Fig. 4(d-e) show a segment of the estimated signal and phase near a boundary for $\widehat{\omega}_{2}$ . While the estimates from STFT-reg. (blue) and PLSO (red) follow the BPF result closely, the STFT-reg. estimates exhibit discontiunity/distortion near the boundary. In Fig. 4(e), the phase jump at the boundary is $38.4$ degrees. We also computed $\operatorname{jump}(\phi_{2})$ in degrees/sample. Considering that the theta band roughly progresses $2.16\,(=7.5(\text{Hz})\times 360/1250\,(\text{Hz}))$ degrees/sample, we observe that BPF (2.23), as expected, and PLSO ( $\lambda_{\operatorname{CV}}:$ 2.40, $\lambda\rightarrow\infty$ : 2.66) are not affected by the boundary effect. This is not the case for STFT-reg. (26.83) and GP-PS (25.91).

Comparison with TVAR Fig. 5(a-b) shows a segment of TVAR inference results³³3The details for TVAR is in the Appendix K.. Specifically, Fig. 5(a) and (b) shows a time-varying center frequency $\widehat{\omega}_{1}$ and the corresponding reconstruction, for the lowest frequency component. Note that the eigenvalues, which correspond to $\{\omega_{j}\}_{j}$ , and the eigenvectors, which are used for oscillatory decomposition, are derived from the estimated TVAR transition matrix. Consequently, we cannot explicitly control $\{\omega_{j}\}_{j}$ , as shown in Fig. 5(a), the bandwidth of each component, as well as the number of components $J$ . This is further complicated by the fact that the transition matrix changes every sample. Fig. 5(b) shows that this ambiguity results in the lowest-frequency component of TVAR explaining both the slow ( $0.1\sim 2$ Hz) and theta components. With PLSO, on the contrary, we can explicitly specify or learn the parameters. Fig. 5(c-d) demonstrates that PLSO is able to delineate the slow/theta components without any discontinuity.

6.3 EEG data from the human brain under propofol anesthesia

We apply PLSO to the EEG data from a subject anesthetized with propofol anesthetic drug, to assess whether PLSO can leverage regularization to recover smooth spectral dynamics, which is widely-observed during propofol-induced unconsciousness⁴⁴4The EEG recording is part of de-identified data collected from patients at Massachusetts General Hospital (MGH) as a part of a MGH Human Research Committee-approved protocol. [Purdon et al., 2013]. The data last $T=2{,}300$ seconds, sampled at $f_{s}=250$ Hz. The drug infusion starts at $t=0$ and the subject loses consciousness around $t=260$ seconds. We use $J=6$ and assume a 4-second PS interval.

Smooth spectrogram Fig. 6(a-b) shows a segment of the PLSO-estimated spectrogram with $\lambda=0$ and $\lambda=\lambda_{\operatorname{CV}}$ . They identify strong slow ( $0.1\sim 2$ Hz) and alpha oscillations ( $8\sim 15$ Hz), both well-known signatures of propofol-induced unconsciousness. We also observe that the alpha band power diminishes between $1{,}200$ and $1{,}350$ seconds, suggesting that the subject regained consciousness before becoming unconscious again. PLSO with $\lambda=0$ exhibits PSD fluctuation across windows, since $\{\sigma_{j,m}^{2}\}_{j,m}$ are estimated independently. The stationary PLSO ( $\lambda\rightarrow\infty$ ) is restrictive and fails to capture spectral dynamics (Fig. 6(c)). In contrast, PLSO with $\lambda_{\operatorname{CV}}$ exhibits smooth dynamics by pooling together estimates from the neighboring windows. The regularization also helps remove movement-related artifacts, shown as vertical lines in Fig. 6(a), around $700\sim 800$ / $1{,}200$ seconds, and spurious power in $20\sim 25$ Hz band. In summary, PLSO with regularization enables smooth spectral dynamics estimation and spurious noise removal.

7 Conclusion

We presented the Piecewise Locally Stationary Oscillatory (PLSO) framework to model nonstationary time-series data with slowly time-varying spectra, as the superposition of piecewise stationary (PS) oscillatory components. PLSO strikes a balance between stochastic continuity of the data across PS intervals and stationarity within each interval. For inference, we introduce an algorithm that combines Kalman theory and nonconvex optimization algorithms. Applications to simulated/real data show that PLSO preserves time-domain continuity and captures time-varying spectra. Future directions include 1) the automatic identification of PS intervals and 2) the expansion to higher-order autoregressive models and diverse priors on the parameters that enforce continuity across intervals.

Acknowledgements.

We thank Abiy Tasissa for helpful discussion on the algorithm and the derivations. This research is supported by NIH grant P01 GM118629 and support from the JPB Foundation. AHS acknowledges Samsung scholarship.

References

Adak [1998] S. Adak. Time-dependent spectral analysis of nonstationary time series. Journal of the American Statistical Association, 93(444):1488–1501, 1998.
Akaike [1981] H. Akaike. Likelihood of a model and information criteria. Journal of Econometrics, 16(1):3–14, 1981.
Ba et al. [2014] D. Ba, B. Babadi, P. L. Purdon, and E. N. Brown. Robust spectrotemporal decomposition by iteratively reweighted least squares. Proceedings of the National Academy of Sciences, 111(50):E5336–E5345, 2014.
Barzilai and Borwein [1988] J. Barzilai and J. M. Borwein. Two-Point Step Size Gradient Methods. IMA Journal of Numerical Analysis, 8(1):141–148, 01 1988.
Beck et al. [2018] A. M. Beck, E. P. Stephen, and P. L. Purdon. State space oscillator models for neural data analysis. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 4740–4743, 2018.
Brown et al. [2004] E. N. Brown, V. Solo, Y. Choe, and Z. Zhang. Measuring period of human biological clock: infill asymptotic analysis of harmonic regression parameter estimates. Methods in enzymology, 383:382–405, 2004.
Carter and Kohn [1994] C. K. Carter and R. Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541–553, 1994.
Casella [1985] G. Casella. An introduction to empirical bayes data analysis. The American Statistician, 39(2):83–87, 1985.
Cemgil and Godsill [2005] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 2005 13th European Signal Processing Conference, pages 1–4, 2005.
Chung et al. [2015] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
Dahlhaus [1997] R. Dahlhaus. Fitting time series models to nonstationary processes. Ann. Statist., 25(1):1–37, 02 1997.
Das and Babadi [2018] P. Das and B. Babadi. Dynamic bayesian multitaper spectral analysis. IEEE Transactions on Signal Processing, 66(6):1394–1409, 2018.
Daubechies et al. [2011] I. Daubechies, J. Lu, and H. Wu. Synchrosqueezed wavelet transforms: An empirical mode decomposition-like tool. Applied and Computational Harmonic Analysis, 30(2):243 – 261, 2011.
Gold et al. [2011] B. Gold, N. Morgan, and D. Ellis. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley-Interscience, USA, 2nd edition, 2011. ISBN 0470195363.
Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
Gramacy and Lee [2008] R. B. Gramacy and H. K. H. Lee. Bayesian treed gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483):1119–1130, 2008.
Griffin and J. Lim [1984] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
Huang et al. [1998] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N. Yen, C. C. Tung, and H. H. Liu. The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 454(1971):903–995, 1998.
Itakura and Saito [1970] F. Itakura and S. Saito. A statistical method for estimation of speech spectral density and formant frequencies. 1970.
Kalman [1960] R.E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 1960.
Kim et al. [2018] S. Kim, M. K. Behr, D. Ba, and E. N. Brown. State-space multitaper time-frequency analysis. Proceedings of the National Academy of Sciences, 115(1):E5–E14, 2018.
Kitagawa and Gersch [1985] G. Kitagawa and W. Gersch. A smoothness priors time-varying ar coefficient modeling of nonstationary covariance time series. IEEE Transactions on Automatic Control, 30(1):48–56, 1985.
Koopman and Lee [2009] S. J. Koopman and K. M. Lee. Seasonality with trend and cycle interactions in unobserved components models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 58(4):427–448, 2009.
Krishnan et al. [2017] R. G. Krishnan, U. Shalit, and D. Sontag. Structured inference networks for nonlinear state space models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 2101–2109. AAAI Press, 2017.
Li and Lin [2015] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In Advances in Neural Information Processing Systems 28, pages 379–387. Curran Associates, Inc., 2015.
Lindsten and Schön [2013] F. Lindsten and T. B. Schön. Backward Simulation Methods for Monte Carlo Statistical Inference. 2013.
Matsuda and Komaki [2017] T. Matsuda and F. Komaki. Time series decomposition into oscillation components and phase estimation. Neural Computation, 29(2):332–367, 2017.
Mizuseki et al. [2009] K. Mizuseki, A. Sirota, E. Pastalkova, and G. Buzsaki. Theta oscillations provide temporal windows for local circuit computation in the entorhinal-hippocampal loop. Neuron, 64:267–280, 2009.
Oppenheim et al. [2009] A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-time Signal Processing (3rd Ed.). Prentice-Hall, Inc., 2009.
Priestley [1965] M. B. Priestley. Evolutionary spectra and non-stationary processes. Journal of the Royal Statistical Society. Series B (Methodological), 27(2):204–237, 1965.
Purdon et al. [2013] P. L. Purdon, E. T. Pierce, E. A. Mukamel, M. J. Prerau, J. L. Walsh, K. F. K. Wong, A. F. Salazar-Gomez, P. G. Harrell, A. L. Sampson, A. Cimenser, S. Ching, N. J. Kopell, C. Tavares-Stoeckel, K. Habeeb, R. Merhar, and E. N. Brown. Electroencephalogram signatures of loss and recovery of consciousness from propofol. Proceedings of the National Academy of Sciences, 110(12):E1142–E1151, 2013.
Qi et al. [2002] Y. Qi, T. P. Minka, and R. W. Picard. Bayesian spectrum estimation of unevenly sampled nonstationary data. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages II–1473–II–1476, 2002.
Rasmussen and Williams [2005] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
Rosen et al. [2009] O. Rosen, D. S. Stoffer, and S. Wood. Local spectral analysis via a bayesian mixture of smoothing splines. Journal of the American Statistical Association, 104(485):249–262, 2009.
Solin and Särkkä [2014] A. Solin and S. Särkkä. Explicit Link Between Periodic Covariance Functions and State Space Models. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 904–912. PMLR, 2014.
Solin et al. [2018] A. Solin, J. Hensman, and R. E Turner. Infinite-horizon gaussian processes. In Advances in Neural Information Processing Systems 31, pages 3486–3495. 2018.
Song et al. [2018] A. H. Song, S. Chakravarty, and E. N. Brown. A smoother state space multitaper spectrogram. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 33–36, July 2018.
Soulat et al. [2019] H. Soulat, E. P. Stephen, A. M. Beck, and P. L. Purdon. State space methods for phase amplitude coupling analysis. bioRxiv, 2019.
Turner and Sahani [2014] R. E. Turner and M. Sahani. Time-frequency analysis as probabilistic inference. IEEE Transactions on Signal Processing, 62(23):6171–6183, 2014.
West et al. [1999] M. West, R. Prado, and A. D. Krystal. Evaluation and Comparison of EEG Traces: Latent Structure in Nonstationary Time Series. Journal of the American Statistical Association, 94(448):1083–1095, 1999.
Whittle [1953] P. Whittle. Estimation and information in stationary time series. Ark. Mat., 2(5):423–434, 08 1953.
Wilkinson et al. [2019] W. J. Wilkinson, M. Riis Andersen, J. D. Reiss, D. Stowell, and A. Solin. Unifying probabilistic models for time-frequency analysis. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3352–3356, May 2019.
Wilson and Adams [2013] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, page III–1067–III–1075, 2013.
Wilson et al. [2008] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran. Speech denoising using nonnegative matrix factorization with priors. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4029–4032, 2008.
Wright [2015] S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151:3–34, 2015.

Appendix

References to the sections in the section headings are made with respect to the sections in the main text. Below is the table of contents for the Appendix.

A.

Continuous model interpretation of PLSO (Section 3)
B.

PSD for complex AR(1) process
C.

Proof for Proposition 1 (Section 3.2.1)
D.

Proof for Proposition 2 (Section 3.2.2)
E.

Initialization for PLSO (Section 4)
F.

Optimization for $\{\sigma_{j,m}^{2}\}_{j,m}$ via proximal gradient update (Section 4.1.1)
G.

Lipschitz constant for $\nabla f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ (Section 4.1.1)
H.

Inference with $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)$ (Section 4.2)
I.

Computational efficiency of PLSO vs. GP-PS
J.

Simulation experiment (Section 5.1)
K.

Details of TVAR model (Section 5.2)
L.

Anesthesia EEG dataset (Section 5.3)

A. Continuous model interpretation of PLSO

We can establish the equivalent continuous model of the PLSO in Eq. 2, using stochastic different equation

\frac{d\tilde{z}_{j}(t)}{dt}=\underbrace{\left(\left(-\frac{1}{l_{j}}\right)\oplus\begin{pmatrix}0&-\omega_{j}\\ \omega_{j}&0\\ \end{pmatrix}\right)}_{\mathbf{F}}\tilde{z}_{j}(t)+\varepsilon(t),

(11)

where $\tilde{z}_{j}(t):\mathbb{R}\rightarrow\mathbb{R}^{2}$ , $\oplus$ denotes the Kronecker sum and $\varepsilon(t)\sim\mathcal{N}(0,\sigma_{j}^{2}\mathbf{I}_{2\times 2})$ . Discretizing the solution of Eq. 11 at $\Delta$ , such that $\widetilde{\mathbf{z}}_{j,k}=\tilde{z}_{j}(k\Delta)$ , yields Eq. 2. Consequently, we obtain the following for $\Delta>0$

\begin{split}&\exp\left(\mathbf{F}\Delta\right)=\exp(-\Delta/l_{j})\mathbf{R}(\omega_{j}),\\ &\sigma_{j}^{2}\int_{0}^{\Delta}\exp\left(\mathbf{F}(\Delta-\tau)\right)\exp\left(\mathbf{F}(\Delta-\tau)\right)^{\text{T}}d\tau\\ &\quad\quad=\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}.\\ \end{split}

(12)

This interpretation extends to the nonstationary PLSO. The corresponding continuous model for $\widetilde{\mathbf{z}}_{j,mN+n}$ in Eq. 3 is the same as Eq. 11, with different variance $\mathbb{E}[\varepsilon_{j}(t)\varepsilon_{j}^{\text{T}}(t)]=\sum_{m=1}^{M}\sigma_{j,m}^{2}\cdot\mathbf{1}\left(\left(\frac{m-1}{M}\right)T\leq t<\left(\frac{m}{M}\right)T\right)\mathbf{I}_{2\times 2}.$

B. PSD for complex AR(1) process

We compute the steady-state covariance denoted as $\mathbf{P}_{\infty}^{j}$ . Since we assume $\mathbf{P}_{1}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2}$ , it is easy to show that $\mathbf{P}_{k}^{j}$ is a diagonal matrix from $\mathbf{R}(\omega_{j})\mathbf{R}^{\text{T}}(\omega_{j})=\mathbf{I}_{2\times 2}$ . Denoting $\mathbf{P}_{\infty}^{j}=\alpha\mathbf{I}_{2\times 2}$ , we use the discrete Lyapunov equation

\begin{split}&\mathbf{P}_{\infty}^{j}\\ &=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{\infty}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &\Rightarrow\alpha=\exp(-2\Delta/l_{j})\alpha+\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\\ &\Rightarrow\mathbf{P}_{\infty}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2},\\ \end{split}

(13)

which implies that under the assumption $\mathbf{P}_{1}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2}$ , we are guaranteed $\mathbf{P}_{k}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2}$ , $\forall k$ . To compute the PSD of the stationary process $\mathbf{z}_{j}$ , we now need to compute the autocovariance. Since only $\mathbf{z}^{\Re}_{j,k}$ contributes to $\mathbf{y}_{k}$ , we compute the autocovariance of $\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n}]$ as

\begin{split}\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n}]&=\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\cdot\Re(\rho_{j}^{n}\exp(i\omega_{j}n)\mathbf{z}_{j,k})]\\ &=\rho_{j}^{n}\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k}\cos\omega_{j}n]\\ &=\rho_{j}^{n}\cos\omega_{j}n\cdot\mathbb{E}[\{\mathbf{z}^{\Re}_{j,k}\}^{2}]\\ &=\rho_{j}^{n}\sigma_{j}^{2}\cos w_{j}n,\\ \end{split}

(14)

where $\Re(\cdot)$ denotes the operator that extracts the real part of the complex argument and we used the fact that $\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Im}_{j,k}]=0$ . The spectra for the $j^{\text{th}}$ component, $S_{j}(\omega)$ can be written as

\begin{split}&S_{j}(\omega)=\sum_{n=-\infty}^{\infty}\mathbb{E}\left[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n}\right]\exp\left(-i\omega n\right)\\ &=\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\sigma_{j}^{2}\cos w_{j}n\exp\left(-i\omega n\right)\\ &=\sigma_{j}^{2}\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\left\{\exp(i\omega_{j}n)+\exp(-i\omega_{j}n)\right\}\exp\left(-i\omega n\right)\\ &=\sigma_{j}^{2}\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\exp(-i(\omega\pm\omega_{j})n).\\ \end{split}

(15)

Unpacking the infinite sum for one of the terms,

\begin{split}&\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\exp(-i(\omega-\omega_{j})n)\\ &=1+\sum_{n=1}^{\infty}\rho_{j}^{n}\exp(-i(\omega-\omega_{j})n)+\rho_{j}^{n}\exp(i(\omega-\omega_{j})n)\\ &=1+\frac{\rho_{j}\exp(-i(\omega-\omega_{j}))}{1-\rho_{j}\exp(-i(\omega-\omega_{j}))}+\frac{\rho_{j}\exp(i(\omega-\omega_{j}))}{1-\rho_{j}\exp(i(\omega-\omega_{j}))}\\ &=1+\frac{2\rho_{j}\cos(\omega-\omega_{j})-2\rho_{j}^{2}}{\left(1-\rho_{j}\exp(-i(\omega-\omega_{j}))\right)\left(1-\rho_{j}\exp(i(\omega-\omega_{j}))\right)}\\ &=\frac{1-\rho_{j}^{2}}{1+\rho_{j}^{2}-2\rho_{j}\cos(\omega-\omega_{j})}.\\ \end{split}

(16)

Using the relation $\rho_{j}=\exp(-\Delta/l_{j})$ and unpacking the infinite sum for the other term, we have

\begin{split}&S_{j}(w)\\ &=\frac{\sigma_{j}^{2}(1-\exp(-2\Delta/l_{j}))}{1+\exp(-2\Delta/l_{j})-2\exp(-\Delta/l_{j})\cos(\omega-\omega_{j})}\\ &\quad+\frac{\sigma_{j}^{2}(1-\exp(-2\Delta/l_{j}))}{1+\exp(-2\Delta/l_{j})-2\exp(-\Delta/l_{j})\cos(\omega+\omega_{j})}.\\ \end{split}

Since Fourier transform is a linear operator, we can conclude that $\gamma(\omega)=\sigma_{\nu}^{2}+\sum_{j=1}^{J}S_{j}(\omega)$ .

C. Proof for Proposition 1 (Section 3.2.1)

Proposition 1. For a given $m$ , as $\Delta\rightarrow 0$ , the samples on either side of the interval boundary, which are $\widetilde{\mathbf{z}}_{j,(m+1)N}$ and $\widetilde{\mathbf{z}}_{j,(m+1)N+1}$ , converge to each other in mean square,

\lim_{\Delta\rightarrow 0}\mathbb{E}[\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}^{\text{T}}]=0,

where we use $\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}=\widetilde{\mathbf{z}}_{j,(m+1)N+1}-\widetilde{\mathbf{z}}_{j,(m+1)N}$ .

Proof.

To analyze Eq. 3 in the limit of $\Delta\rightarrow 0$ , we use the equivalent continuous model (Eq. 12). It suffices to show that $\lim_{\Delta\rightarrow 0}\exp(\mathbf{F}\Delta)=\mathbf{I}_{2\times 2}$ and $\lim_{\Delta\rightarrow 0}\mathbb{E}[\varepsilon_{j,(m+1)N+1}\varepsilon_{j,(m+1)N+1}^{\text{T}}]=\mathbf{0}$ . We have,

\begin{split}&\lim_{\Delta\rightarrow 0}\exp(\mathbf{F}\Delta)=\mathbf{I}_{2\times 2}+\lim_{\Delta\rightarrow 0}\sum_{k=1}^{\infty}\frac{\Delta^{k}}{k!}\mathbf{F}^{k}=\mathbf{I}_{2\times 2}\\ &\lim_{\Delta\rightarrow 0}\mathbb{E}[\varepsilon_{j,(m+1)N+1}\varepsilon_{j,(m+1)N+1}^{\text{T}}]/\sigma_{j,m+1}^{2}\\ &=\lim_{\Delta\rightarrow 0}\int_{0}^{\Delta}\exp\left(\mathbf{F}(\Delta-\tau)\right)\exp\left(\mathbf{F}(\Delta-\tau)\right)^{\text{T}}d\tau=\mathbf{0}.\\ \end{split}

Since this implies $\lim_{\Delta\rightarrow 0}\mathbb{E}[\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}^{\text{T}}]=0$ , we have convergence in mean square. ∎

D. Proof for Proposition 2 (Section 3.2.2)

Proposition 2. Assume $l_{j}\ll N\Delta$ , such that $\mathbf{P}_{m,N}^{j}=\mathbf{P}_{m,\infty}^{j}$ . In Eq. 3, the difference between $\mathbf{P}_{m,\infty}^{j}=\sigma_{j,m}^{2}\mathbf{I}_{2\times 2}$ and $\mathbf{P}_{m+1,\infty}^{j}=\sigma_{j,m+1}^{2}\mathbf{I}_{2\times 2}$ decays exponentially fast as a function of $n$ , for $1\leq n\leq N$ ,

\begin{split}\mathbf{P}_{m+1,n}^{j}&=\mathbf{P}_{m+1,\infty}^{j}+\exp\left(-2n\Delta/l_{j}\right)(\mathbf{P}_{m,\infty}^{j}-\mathbf{P}_{m+1,\infty}^{j}).\\ \end{split}

Proof.

We first obtain the steady-state covariance $\mathbf{P}_{m,\infty}^{j}$ , similar to Appendix B. Since we assume $\mathbf{P}_{1,1}^{j}=\sigma_{j,1}^{2}\mathbf{I}_{2\times 2}$ , we can show that $\forall m,n$ , $\mathbf{P}_{m,n}^{j}$ is a diagonal matrix, noting that $\mathbf{R}(\omega_{j})\mathbf{R}^{\text{T}}(\omega_{j})=\mathbf{I}_{2\times 2}$ . Denoting $\mathbf{P}_{m,\infty}^{j}=\alpha\mathbf{I}_{2\times 2}$ , we now use the discrete Lyapunov equation

\begin{split}&\mathbf{P}_{m,\infty}^{j}=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{m,\infty}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &\Rightarrow\alpha=\exp(-2\Delta/l_{j})\alpha+\sigma_{j,m}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\\ &\Rightarrow\mathbf{P}_{m,\infty}^{j}=\sigma_{j,m}^{2}\mathbf{I}_{2\times 2}.\\ \end{split}

(17)

We now prove the proposition by induction. For fixed $j$ and $m$ , and for $n=1$ ,

\begin{split}&\mathbf{P}_{m+1,1}^{j}\\ &=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{m,N}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m+1}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &=\left\{\sigma_{j,m+1}^{2}+\exp\left(-2\Delta/l_{j}\right)\left(\sigma_{j,m}^{2}-\sigma_{j,m+1}^{2}\right)\right\}\mathbf{I}_{2\times 2}.\\ \end{split}

Assuming the same holds for $n=n^{\prime}-1$ , we have for $n=n^{\prime}$ ,

\begin{split}&\mathbf{P}_{m+1,n^{\prime}}^{j}\\ &=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{m,n^{\prime}-1}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m+1}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &=\exp(-2\Delta/l_{j})\sigma_{j,m+1}^{2}\mathbf{I}_{2\times 2}\\ &\quad+\exp\left(-2n^{\prime}\Delta/l_{j}\right)\left(\sigma_{j,m}^{2}-\sigma_{j,m+1}^{2}\right)\mathbf{I}_{2\times 2}\\ &\quad+\sigma_{j,m+1}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &=\left\{\sigma_{j,m+1}^{2}+\exp\left(-2n^{\prime}\Delta/l_{j}\right)\left(\sigma_{j,m}^{2}-\sigma_{j,m+1}^{2}\right)\right\}\mathbf{I}_{2\times 2}.\\ \end{split}

By the principle of induction, Eq. D. Proof for Proposition 2 (Section 3.2.2) holds for $1\leq n\leq N$ . ∎

E. Initialization & Estimation for PLSO (Section 4)

E.1 Initialization

As noted in the main text, we use AIC to determine the optimal number of $J$ . For a given number of components $J$ , we first construct the spectrogram of the data using STFT and identify the frequency bands with prominent power, i.e., frequency bands whose average power exceeds pre-determined threshold. The center frequencies of these bands serve as the initial center frequencies $\{\omega_{j}^{\text{init}}\}_{j}$ , which are either fixed throughout the algorithm or further refined through the estimation algorithm in the main text. If $J$ exceeds the number of identified frequency bands from the spectrogram, 1) we first place $\{\omega_{j}^{\text{init}}\}_{j}$ in the prominent frequency bands and 2) we then place the remaining components uniformly spread out in $[0,\omega_{c}]$ , where $\omega_{c}$ is a cutoff frequency to be further determined in the next section. As for $\{l_{j}^{\text{init}}\}_{j}$ , we set it to be a certain fraction of the corresponding $\{\omega_{j}^{\text{init}}\}_{j}$ . We then fit $\{\sigma_{j,m}^{2}\}_{j,m}$ and $\theta$ with $\lambda=0$ , through the procedure explained in Stage 1. We finally use these estimates as initial values for other values of $\lambda$ .

E.2 Estimation for $\sigma_{\nu}^{2}$

There are two possible ways to estimate the observation noise variance $\sigma_{\nu}^{2}$ . The first approach is to perform maximum likelihood estimation of $f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ with respect to $\sigma_{\nu}^{2}$ . The second approach, which we found to work better in practice and use throughout the manuscript, is to directly estimate it from the Fourier transform of the data. Given a cutoff frequency $\omega_{c}$ , informed by domain knowledge, we take the average power of the Fourier transform of $\mathbf{y}$ in $[\omega_{c},f_{s}/2]$ . For instance, it is widely known that the spectral content below 40 Hz in anesthesia EEG dataset is physiologically relevant and we use $\omega_{c}\simeq 40$ Hz.

F. Optimization for $\{\sigma_{j,m}^{2}\}_{j,m}$ via proximal gradient update (Section 4.1.1)

We discuss the algorithm to obtain a local optimal solution of $\{\sigma_{j,m}^{2}\}_{j,m}$ to the MAP optimization problem in Eq. 8. We define $\psi_{j,m}=\log\sigma_{j,m}^{2}$ and $\Psi=[\psi_{1,1},\ldots,\psi_{1,M},\ldots,\psi_{J,M}]\in\mathbb{R}^{JM}$ for notational decluttering, to be used in this section.

We rewrite Eq. 8 as

\begin{split}&\min_{\Psi}\underbrace{-\log p(\Psi|\mathbf{y},\theta)}_{h(\Psi;\theta)}\\ &=\min_{\Psi}\underbrace{\frac{1}{2}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\{\log\gamma^{(m)}(\omega_{n})+\frac{I^{(m)}(\omega_{n})}{\gamma^{(m)}(\omega_{n})}\right\}}_{-f(\Psi;\theta)}\\ &\quad\quad\quad+\underbrace{\frac{\lambda}{2}\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\psi_{j,m}-\psi_{j,m-1}\right)^{2}}_{-g(\Psi;\theta)}\\ &=\min_{\Psi}-f(\Psi;\theta)-g(\Psi;\theta).\\ \end{split}

(18)

The algorithm is described in Algorithm 1. It follows the steps outlined in the inexact accelerated proximal gradient algorithm [Li and Lin, 2015]. For faster convergence, we use larger step sizes with the Barzilai-Borwein (BB) step size initialization rule [Barzilai and Borwein, 1988]. For rest of this section, we drop dependence on $\theta$ . The main novelty of our algorithm is the proximal gradient update

\begin{split}&\mathbf{u}^{(l+1)}\\ &=\operatorname{prox}_{-\alpha^{(l)}_{\mathbf{w}}g}(\mathbf{w}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\nabla f(\mathbf{w}^{(l)}))\\ &=\operatorname*{arg\,min}_{\Psi}\frac{1}{2\alpha_{\mathbf{w}}^{(l)}}\lVert\Psi-(\mathbf{w}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\nabla f(\mathbf{w}^{(l)}))\rVert^{2}-g(\Psi)\\ &=\operatorname*{arg\,min}_{\Psi}\sum_{j=1}^{J}\sum_{m=1}^{M}\frac{\left(\left(\mathbf{w}_{j,m}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\cdot\frac{\partial f(\mathbf{w}^{(l)})}{\partial\mathbf{w}_{j,m}}\right)-\psi_{j,m}\right)^{2}}{2\alpha_{\mathbf{w}}^{(l)}}\\ &\quad\quad\quad\quad\quad\quad\quad+\frac{\lambda}{2}(\psi_{j,m}-\psi_{j,m-1})^{2},\\ \end{split}

(19)

where the same holds for $\mathbf{x}^{(l+1)}=\operatorname{prox}_{-\alpha_{\Psi}^{(l)}g}(\Psi^{(l)}+\alpha_{\Psi}^{(l)}\nabla f(\Psi^{(l)}))$ . The auxiliary variables $\mathbf{w},\mathbf{u},\mathbf{x}\in\mathbb{R}^{JM}$ ensure convergence of $\Psi$ . We use $\mathbf{w}^{(l)}_{j,m}$ to denote $\left((m-1)J+j\right)^{\text{th}}$ entry of $\mathbf{w}^{(l)}$ . As mentioned in the main text, this can be solved in a computationally efficient manner by using Kalman filter/smoother.

Result:

\widehat{\Psi}

Initialize

\Psi^{(0)}=\Psi^{(1)}=\mathbf{u}^{(1)}

,

\beta^{(0)}=0,\,\beta^{(1)}=1,\,\delta>0,\,\rho<1

for $l\leftarrow 1$ to $L$ do

\mathbf{w}^{(l)}=\Psi^{(l)}+\frac{\beta^{(l-1)}}{\beta^{(l)}}(\mathbf{u}^{(l)}-\Psi^{(l)})+\frac{\beta^{(l-1)}-1}{\beta^{(l)}}(\Psi^{(l)}-\Psi^{(l-1)})

(BB step size initialization rule)

\mathbf{s}^{(l)}=\mathbf{u}^{(l)}-\mathbf{w}^{(l-1)}

,

\mathbf{r}^{(l)}=-\nabla f(\mathbf{u}^{(l)})+\nabla f(\mathbf{w}^{(l-1)})

\alpha_{\mathbf{w}}^{(l)}=((\mathbf{s}^{(l)})^{\text{T}}\mathbf{s}^{(l)})/((\mathbf{s}^{(l)})^{\text{T}}\mathbf{r}^{(l)})

\mathbf{s}^{(l)}=\mathbf{x}^{(l)}-\Psi^{(l-1)}

,

\mathbf{r}^{(l)}=-\nabla f(\mathbf{x}^{(l)})+\nabla f(\Psi^{(l-1)})

\alpha_{\Psi}^{(l)}=((\mathbf{s}^{(l)})^{\text{T}}\mathbf{s}^{(l)})/((\mathbf{s}^{(l)})^{\text{T}}\mathbf{r}^{(l)})

(Proximal update step)

repeat

\mathbf{u}^{(l+1)}=\operatorname{prox}_{-\alpha_{\mathbf{w}}^{(l)}g}(\mathbf{w}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\nabla f(\mathbf{w}^{(l)}))

\alpha_{\mathbf{w}}^{(l)}=\rho\cdot\alpha_{\mathbf{w}}^{(l)}

until $h(\mathbf{u}^{(l+1)})\leq h(\mathbf{w}^{(l)})-\delta\lVert\mathbf{u}^{(l+1)}-\mathbf{w}^{(l)}\rVert^{2}$ ;

repeat

\mathbf{x}^{(l+1)}=\operatorname{prox}_{-\alpha_{\Psi}^{(l)}g}(\Psi^{(l)}+\alpha_{\Psi}^{(l)}\nabla f(\Psi^{(l)}))

\alpha_{\Psi}^{(l)}=\rho\cdot\alpha_{\Psi}^{(l)}

until $h(\mathbf{x}^{(l+1)})\leq h(\Psi^{(l)})-\delta\lVert\mathbf{x}^{(l+1)}-\Psi^{(l)}\rVert^{2}$ ;

\beta^{(l+1)}=\frac{1+\sqrt{4\left(\beta^{(l)}\right)^{2}+1}}{2}

\Psi^{(l+1)}=\begin{cases}\mathbf{u}^{(l+1)}&\text{if }h(\textbf{u}^{(l+1)})\leq h(\textbf{x}^{(l+1)})\\ \mathbf{x}^{(l+1)}&\text{otherwise }\\ \end{cases}

end for

\widehat{\Psi}=\Psi^{(L)}

Algorithm 1 Inference for

\Psi

via inexact APG

G. Lipschitz constant for $\nabla f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ (Section 4.1.1)

In this section, we prove that under some assumptions, we can show that the log-likelihood $f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ has Lipschitz continuous gradient with the Lipschitz constant $C$ . As in the previous section, we use $\psi_{j,m}=\log\sigma_{j,m}^{2}$ and $\Psi=[\psi_{1,1},\ldots,\psi_{1,M},\ldots,\psi_{J,M}]\in\mathbb{R}^{JM}$ .

Let us start by restating the definition of Lipschitz continuous gradient.

Definition A continuously differentiable function $f:\mathcal{S}\rightarrow\mathbb{R}$ is Lipschitz continuous gradient if

\lVert\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\rVert_{2}\leq C\lVert\textbf{x}-\textbf{y}\rVert_{2}\quad\text{for every }\mathbf{x},\mathbf{y}\in\mathcal{S},

(20)

where $\mathcal{S}$ is a compact subset of $\mathbb{R}^{JM}$ and $C>0$ is the Lipschitz constant.

Our goal is to find the constant $C$ for the Whittle likelihood $f(\Psi;\theta)$

\begin{split}f(\Psi;\theta)&=-\frac{1}{2}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\{\log\gamma_{m,n}+\frac{I_{m,n}}{\gamma_{m,n}}\right\}\\ &=-\frac{1}{2}\underbrace{\sum_{m=1}^{M}\sum_{n=1}^{N}\log\left(\sigma_{\nu}^{2}+\sum_{j=1}^{J}\exp\left(\psi_{j,m}\right)\alpha_{j,n}\right)}_{f_{1}(\Psi;\theta)}\\ &\quad-\frac{1}{2}\underbrace{\sum_{m=1}^{M}\sum_{n=1}^{N}\frac{I_{m,n}}{\left(\sigma_{\nu}^{2}+\sum_{j=1}^{J}\exp\left(\psi_{j,m}\right)\alpha_{j,n}\right)}}_{f_{2}(\Psi;\theta)},\\ \end{split}

(21)

where we use $\gamma_{m,n}=\gamma^{(m)}(\omega_{n})$ and $I_{m,n}=I^{(m)}(\omega_{n})$ for notational simplicity and

\alpha_{j,n}=\frac{\left(1-\exp\left(-2\Delta/l_{j}\right)\right)}{1+\exp\left(-2\Delta/l_{j}\right)-2\exp\left(-\Delta/l_{j}\right)\cos(\omega_{n}-\omega_{j})}.

We make the following assumptions

1.

$I_{m,n}$ is bounded, i.e., $I_{m,n}\leq C_{I}$ . With the real-world signal, we can reasonably assume that $I_{m,n}$ or energy of the signal is bounded.
2.

$\forall j,m$ , $\psi_{j,m}$ is bounded, i.e., $|\psi_{j,m}|\leq\log C_{\psi}$ for some $C_{\psi}>1$ . This implies $1/C_{\psi}\leq\exp(\psi_{j,m})\leq C_{\psi}$ .

In addition, we have the following facts

1.

$I_{m,n},\alpha_{j,n}$ , and $\gamma_{m,n}$ are nonnegative.
2.

$I_{m,n}$ and $\gamma_{m,n}$ are bounded. This follows from the aforementioned assumptions.

3.

For given $\{l_{j}\}_{j}$ , we have bounded $\alpha_{j,n}$ . To see this, note that the maximum of $\alpha_{j,n}$ is acheived at $\omega_{n}=\omega_{j}$ ,

\begin{split}\max\alpha_{j,n}&=\frac{\left(1-\exp\left(-2\Delta/l_{j}\right)\right)}{1+\exp\left(-2\Delta/l_{j}\right)-2\exp\left(-\Delta/l_{j}\right)}\\ &=\frac{\left(1+\exp(-\Delta/l_{j})\right)}{\left(1-\exp(-\Delta/l_{j})\right)}.\\ \end{split}

(22)

Therefore, denoting $l_{\operatorname{max}}=\max_{j}\{l_{j}\}_{j}$ ,

\max\alpha_{j,n}\leq\frac{\left(1+\exp(-\Delta/l_{\operatorname{max}})\right)}{\left(1-\exp(-\Delta/l_{\operatorname{max}})\right)}=C_{\alpha}.

(23)

Finally, we define $\mathcal{S}=[-\log C_{\psi},\log C_{\psi}]\subset\mathbb{R}^{JM}$ .

We want to compute the Lipschitz constant for $\nabla f_{1}(\Psi)$ and $\nabla f_{2}(\Psi)$ for $\Psi,\overline{\Psi}\in\mathcal{S}$ , i.e.,

\begin{cases}\lVert\nabla f_{1}(\Psi)-\nabla f_{1}(\overline{\Psi})\rVert_{2}\leq C_{1}\lVert\Psi-\overline{\Psi}\rVert_{2}\\ \lVert\nabla f_{2}(\Psi)-\nabla f_{2}(\overline{\Psi})\rVert_{2}\leq C_{2}\lVert\Psi-\overline{\Psi}\rVert_{2},\\ \end{cases}

(24)

where we dropped dependence on $\theta$ for notational simplicity. Consequently, the triangle inequality yields

\begin{split}&\lVert\nabla f(\Psi)-\nabla f(\overline{\Psi})\rVert_{2}\\ &\leq\lVert\nabla f_{1}(\Psi)-\nabla f_{1}(\overline{\Psi})\rVert_{2}+\lVert\nabla f_{2}(\Psi)-\nabla f_{2}(\overline{\Psi})\rVert_{2}\\ &\leq(C_{1}+C_{2})\lVert\Psi-\overline{\Psi}\rVert_{2}.\end{split}

(25)

Lipschitz constant $C_{1}$ for $\nabla f_{1}(\Psi)$

Let us examine $f_{1}(\Psi)$ first. The derivative with respect to $\psi_{j,m}$ is given as

\begin{split}\frac{\partial f_{1}(\Psi)}{\partial\psi_{j,m}}&=\sum_{n=1}^{N}\alpha_{j,n}\cdot\underbrace{\frac{\exp\left(\psi_{j,m}\right)}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp\left(\psi_{j^{\prime},m}\right)\alpha_{j^{\prime},n}}}_{\widetilde{f}(\psi_{j,m})}\\ &=\sum_{n=1}^{N}\alpha_{j,n}\widetilde{f}(\psi_{j,m}).\end{split}

(26)

We now have

\left|\frac{\partial f_{1}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{1}(\overline{\Psi})}{\partial\psi_{j,m}}\right|=\sum_{n=1}^{N}|\alpha_{j,n}|\cdot\left|\widetilde{f}\left(\psi_{j,m}\right)-\widetilde{f}(\overline{\psi}_{j,m})\right|.

(27)

Without loss of generality, we assume $\psi_{j,m}\geq\overline{\psi}_{j,m}$ . We now apply the mean value theorem (MVT) to $\widetilde{f}(\psi_{j,m})$

\widetilde{f}(\psi_{j,m})-\widetilde{f}(\overline{\psi}_{j,m})=\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})(\psi_{j,m}-\overline{\psi}_{j,m}),

(28)

where $\psi^{\prime}_{j,m}\in[\overline{\psi}_{j,m},\psi_{j,m}]$ .We can compute and bound $\widetilde{f}^{\prime}(\psi_{j,m}^{\prime})=d\widetilde{f}(\psi_{j,m}^{\prime})/d\psi_{j,m}^{\prime}$ as follows

\begin{split}\widetilde{f}^{\prime}(\psi_{j,m}^{\prime})&=\frac{\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\\ &\quad-\frac{\alpha_{j,n}\exp^{2}(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{2}}\\ &=\frac{\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\\ &\quad\times\underbrace{\left(1-\frac{\alpha_{j,n}\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\right)}_{\leq 1}\\ &\leq\frac{\exp(\psi^{\prime}_{j,m})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi^{\prime}_{j^{\prime},m})\alpha_{j^{\prime},n}}\\ &\leq\frac{C_{\psi}}{\sigma_{\nu}^{2}}.\\ \end{split}

(29)

Combining Eq. 28 and 29, we have

\begin{split}&\sum_{n=1}^{N}|\alpha_{j,n}|\cdot\left|\widetilde{f}\left(\psi_{j,m}\right)-\widetilde{f}(\overline{\psi}_{j,m})\right|\\ &=\sum_{n=1}^{N}|\alpha_{j,n}|\cdot\left|\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})(\psi_{j,m}-\overline{\psi}_{j,m})\right|\\ &\leq\frac{NC_{\alpha}C_{\psi}}{\sigma_{\nu}^{2}}\left|\psi_{j,m}-\overline{\psi}_{j,m}\right|.\\ \end{split}

(30)

We thus have,

\begin{split}\lVert\nabla f_{1}(\Psi)-\nabla f_{1}(\overline{\Psi})\rVert_{2}^{2}&=\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\frac{\partial f_{1}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{1}(\overline{\Psi})}{\partial\psi_{j,m}}\right)^{2}\\ &\leq\left(\frac{JMNC_{\alpha}C_{\psi}}{\sigma_{\nu}^{2}}\right)^{2}\lVert\Psi-\overline{\Psi}\rVert^{2}_{2}.\end{split}

(31)

Lipschitz constant $C_{2}$ for $\nabla f_{2}(\Psi)$

Computing $C_{2}$ proceeds in a similar manner to computing $C_{1}$ . The derivative with respect to $\psi_{j,m}$ is given as

\begin{split}&\frac{\partial f_{2}(\Psi)}{\partial\psi_{j,m}}\\ &=-\sum_{n=1}^{N}I_{m,n}\alpha_{j,n}\cdot\underbrace{\frac{\exp(\psi_{j,m})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m})\alpha_{j^{\prime},n})^{2}}}_{\widetilde{f}(\psi_{j,m})}.\\ \end{split}

(32)

We now have

\begin{split}&\left|\frac{\partial f_{2}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{2}(\overline{\Psi})}{\partial\psi_{j,m}}\right|\\ &=\sum_{n=1}^{N}\left|I_{m,n}\alpha_{j,n}\right|\cdot\left|-\widetilde{f}(\psi_{j,m})+\widetilde{f}(\overline{\psi}_{j,m})\right|.\\ \end{split}

(33)

Without loss of generality, assume $\psi_{j,m}\geq\overline{\psi}_{j,m}$ . To apply MVT, we need to compute and bound $\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})$

\begin{split}\widetilde{f}^{\prime}(\psi_{j,m}^{\prime})&=\frac{\exp(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{2}}\\ &\quad-\frac{2\alpha_{j,n}\exp^{2}(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{3}}\\ &=\frac{\exp(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{2}}\\ &\quad\times\underbrace{\left(1-\frac{2\alpha_{j,n}\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\right)}_{\leq 1}\\ &\leq\frac{\exp(\psi^{\prime}_{j,m})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi^{\prime}_{j^{\prime},m})\alpha_{j^{\prime},n})^{2}}\\ &\leq\frac{C_{\psi}}{\sigma_{\nu}^{4}}.\\ \end{split}

(34)

Applying MVT,

\begin{split}&\sum_{n=1}^{N}|I_{m,n}\alpha_{j,n}|\cdot\left|-\widetilde{f}\left(\psi_{j,m}\right)+\widetilde{f}\left(\overline{\psi}_{j,m}\right)\right|\\ &=\sum_{n=1}^{N}|I_{m,n}\alpha_{j,n}|\cdot\left|\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})(\psi_{j,m}-\overline{\psi}_{j,m})\right|\\ &\leq\frac{NC_{I}C_{\alpha}C_{\psi}}{\sigma_{\nu}^{4}}\left|\psi_{j,m}-\overline{\psi}_{j,m}\right|.\\ \end{split}

(35)

We thus have,

\begin{split}&\lVert\nabla f_{2}(\Psi)-\nabla f_{2}(\overline{\Psi})\rVert_{2}^{2}\\ &=\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\frac{\partial f_{2}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{2}(\overline{\Psi})}{\partial\psi_{j,m}}\right)^{2}\\ &\leq\left(\frac{JMNC_{\alpha}C_{\psi}C_{I}}{\sigma_{\nu}^{4}}\right)^{2}\lVert\Psi-\overline{\Psi}\rVert^{2}_{2}.\end{split}

(36)

Collecting the Lipschitz constants $C_{1}$ and $C_{2}$ , we finally have

\lVert\nabla f(\Psi)-\nabla f(\overline{\Psi})\rVert_{2}\leq\underbrace{\frac{JMNC_{\alpha}C_{\psi}}{\sigma_{\nu}^{2}}\left(1+\frac{C_{I}}{\sigma_{\nu}^{2}}\right)}_{C}\lVert\Psi-\overline{\Psi}\rVert_{2}.

(37)

H. Inference with $p\left(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta\right)$ (Section 4.2)

We present the details for performing inference with $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y},\hat{\theta})$ , given the estimates $\{\widehat{\sigma}_{j,m}^{2}\}_{j,m}$ and $\hat{\theta}$ from window-level inference. First, we present the Kalman filter/smoother algorithm to compute the mean posterior trajectory, and the credible interval. Next, we present the forward filtering backward sampling (FFBS) algorithm Carter and Kohn [1994], Lindsten and Schön [2013] to generate Monte Carlo (MC) sample trajectories.

First, we define additional notations.

1) $\widetilde{\mathbf{z}}_{j,k|k^{\prime}}=\mathbb{E}[\widetilde{\mathbf{z}}_{j,k}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y}_{1:k^{\prime}},\hat{\theta}]\in\mathbb{R}^{2}$

Posterior mean of $\widetilde{\mathbf{z}}_{j,k}$ . We are primarily concerned with the following three types: 1) $\widetilde{\mathbf{z}}_{j,k|k-1}$ , the one-step prediction estimate, 2) $\widetilde{\mathbf{z}}_{j,k|k}$ , the Kalman filter estimate, and 3) $\widetilde{\mathbf{z}}_{j,k|MN}$ , the Kalman smoother estimate.

2) $\widetilde{\mathbf{z}}_{k|k^{\prime}}=[(\widetilde{\mathbf{z}}_{1,k|k^{\prime}})^{\text{T}},\ldots,(\widetilde{\mathbf{z}}_{J,k|k^{\prime}})^{\text{T}}]^{\text{T}}\in\mathbb{R}^{2J}$

A collection of $\{\widetilde{\mathbf{z}}_{j,k|k^{\prime}}\}_{j}$ in a single vector.

3) $\mathbf{P}_{j,k|k^{\prime}}=\mathbb{E}[\left(\widetilde{\mathbf{z}}_{j,k}-\widetilde{\mathbf{z}}_{j,k|k^{\prime}}\right)\left(\widetilde{\mathbf{z}}_{j,k}-\widetilde{\mathbf{z}}_{j,k|k^{\prime}}\right)^{\text{T}}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y}_{1:k^{\prime}},\hat{\theta}]\in\mathbb{R}^{2\times 2}$

Posterior covariance of $\widetilde{\mathbf{z}}_{j,k}$ . Just as in $\widetilde{\mathbf{z}}_{j,k|k^{\prime}}$ , we are interested in three types, i.e., $\mathbf{P}_{j,k|k-1}$ , $\mathbf{P}_{j,k|k}$ , and $\mathbf{P}_{j,k|MN}$ .

4) $\mathbf{P}_{k|k^{\prime}}=\operatorname{blkdiag}\left(\mathbf{P}_{j,k|k^{\prime}}\right)\in\mathbb{R}^{2J\times 2J}$

A block diagonal matrix of $J$ posterior covariance matrices.

5) $\mathbf{A}=\operatorname{blkdiag}\left(\exp\left(-\Delta/l_{j}\right)\mathbf{R}(\omega_{j})\right)\in\mathbb{R}^{2J\times 2J}$

A block digonal transition matrix.

6) $\mathbf{H}=(1,0,\ldots,1,0)$ The observation gain.

Kalman filter/smoother

The Kalman filter equations are given as

\begin{split}&\widetilde{\mathbf{z}}_{j,mN+n|mN+(n-1)}\\ &=\exp\left(-\Delta/l_{j}\right)\mathbf{R}(\omega_{j})\widetilde{\mathbf{z}}_{j,mN+(n-1)|mN+(n-1)}\\ &\mathbf{P}_{j,mN+n|mN+(n-1)}\\ &=\exp\left(-2\Delta/l_{j}\right)\mathbf{R}(\omega_{j})\mathbf{P}_{j,mN+(n-1)|mN+(n-1)}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\\ &\mathbf{K}_{mN+n}\\ &=\mathbf{P}_{mN+n|mN+(n-1)}\mathbf{H}^{\text{T}}\left(\mathbf{H}\mathbf{P}_{mN+n|mN+(n-1)}\mathbf{H}^{\text{T}}+\sigma_{\nu}^{2}\right)^{-1}\\ &\widetilde{\mathbf{z}}_{mN+n|mN+n}\\ &=\widetilde{\mathbf{z}}_{mN+n|mN+(n-1)}\\ &\quad+\mathbf{K}_{mN+n}\left(\mathbf{y}_{mN+n}-\mathbf{H}\widetilde{\mathbf{z}}_{mN+n|mN+(n-1)}\right)\\ &\mathbf{P}_{mN+n|mN+n}=\left(\mathbf{I}_{2J\times 2J}-\mathbf{K}_{mN+n}\mathbf{H}\right)\mathbf{P}_{mN+n|mN+(n-1)}.\\ \end{split}

(38)

Subsequently, the Kalman smoother equations are given as

\begin{split}&\mathbf{C}_{mN+n}\\ &=\mathbf{P}_{mN+n|mN+n}\mathbf{A}^{\text{T}}\mathbf{P}^{-1}_{mN+(n+1)|mN+n}\in\mathbb{R}^{2J\times 2J}\\ &\widetilde{\mathbf{z}}_{mN+n|MN}\\ &=\widetilde{\mathbf{z}}_{mN+n|mN+n}\\ &\quad+\mathbf{C}_{mN+n}\left(\widetilde{\mathbf{z}}_{mN+(n+1)|MN}-\widetilde{\mathbf{z}}_{mN+(n+1)|mN+n}\right)\\ &\mathbf{P}_{mN+n|MN}\\ &=\mathbf{P}_{mN+n|mN+n}\\ &\quad+\mathbf{C}_{mN+n}\mathbf{P}_{mN+(n+1)|MN}\mathbf{C}_{mN+n}^{\text{T}}.\\ &\quad-\mathbf{C}_{mN+n}\mathbf{P}_{mN+(n+1)|mN+n}\mathbf{C}_{mN+n}^{\text{T}}\\ \end{split}

(39)

To obtain the mean reconstructed trajectory for the $j^{\text{th}}$ oscillatory component, $\{\widehat{\mathbf{y}}_{j,k}\}_{k}$ , we take the real part of the $j^{\text{th}}$ component of the smoothed mean, $\widehat{\mathbf{y}}_{j,k}=\mathbf{e}_{2j-1}^{\text{T}}\cdot\widetilde{\mathbf{z}}_{k|MN}$ , where $\mathbf{e}_{2j-1}\in\mathbb{R}^{2J}$ is a unit vector with the only non-zero value, equal to 1, at the entry $2j-1$ .

The 95% credible interval for $\widehat{\mathbf{y}}_{j,k}$ , denoted as $\operatorname{CI}^{\text{lower}}_{j,k}$ / $\operatorname{CI}^{\text{upper}}_{j,k}$ for the upper/lower end, respectively, is given as

\begin{split}\operatorname{CI}^{\text{upper}}_{j,k}&=\mathbf{e}_{2j-1}^{\text{T}}\cdot\widetilde{\mathbf{z}}_{k|MN}+1.96\cdot\sqrt{\mathbf{e}_{2j-1}^{\text{T}}\mathbf{P}_{k|MN}\mathbf{e}_{2j-1}}\\ \operatorname{CI}^{\text{lower}}_{j,k}&=\mathbf{e}_{2j-1}^{\text{T}}\cdot\widetilde{\mathbf{z}}_{k|MN}-1.96\cdot\sqrt{\mathbf{e}_{2j-1}^{\text{T}}\mathbf{P}_{k|MN}\mathbf{e}_{2j-1}}.\\ \end{split}

(40)

FFBS algorithm for $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)$

To generate the MC trajectory samples from the posterior distribution $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y},\hat{\theta})$ , we use the FFBS algorithm. The steps are summarized in Algorithm 2, which uses the Kalman estimates derived in the previous section. We denote $s=1,\ldots,S$ as the MC sample index.

Result:

\left\{\widetilde{\mathbf{z}}_{k}^{(s)}\right\}_{k,s}^{MN,S}

for $s\leftarrow 1$ to $S$ do

Sample

\widetilde{\mathbf{z}}^{(s)}_{MN}

from

\mathcal{N}\left(\widetilde{\mathbf{z}}_{MN|MN},\mathbf{P}_{MN|MN}\right)

.

for $k\leftarrow MN-1$ to $1$ do

Sample

\widetilde{\mathbf{z}}^{(s)}_{k}

from

\mathcal{N}(\widetilde{\mu}_{k},\widetilde{\mathbf{P}}_{k})

, where

\begin{split}\widetilde{\mu}_{k}&=\widetilde{\mathbf{z}}_{k|k}+\mathbf{P}_{k|k}\mathbf{A}^{\text{T}}\mathbf{P}_{k+1|k}^{-1}(\widetilde{\mathbf{z}}^{(s)}_{k+1}-\mathbf{A}\widetilde{\mathbf{z}}_{k|k})\\ \widetilde{\mathbf{P}}_{k}&=\mathbf{P}_{k|k}-\mathbf{P}_{k|k}\mathbf{A}^{\text{T}}\mathbf{P}_{k+1|k}^{-1}\mathbf{A}\mathbf{P}_{k|k}.\\ \end{split}

end for

Algorithm 2 FFBS algorithm

I. Computational efficiency of PLSO vs. GP-PS

We show the runtime of PLSO and piecewise stationary GP (GP-PS) for inference of the mean trajectory of the hippocampus data ( $fs=1,250$ Hz, $J=5$ , 2-second window) for varying data lengths ( $50$ , $100$ , $200$ seconds corresponding to $K=6.25\times 10^{4},\,1.25\times 10^{5},\,2.5\times 10^{5}$ sample points, respectively).

As noted in Section 5, the computational complexity of PLSO is $O(J^{2}K)$ , where as the computational complexity of GP-PS is $O(N^{2}K)$ . Since $N$ , the number of samples per window, is fixed ( $2,500$ samples), we expect both PLSO and GP-PS to be linear in terms of the number of samples $K$ . Table 3 indeed confirms that this is the case. However, we observe that PLSO is much more efficient than GP-PS.

Table 3: Runtime (s) for PLSO and GP-PS for varying length

	PLSO	GP-PS
$T=$ 50	1.7	346.8
$T=$ 100	3.1	700.6
$T=$ 200	6.5	1334.0

J. Simulation experiment (Section 5.1)

We simulate from the following model for $1\leq k\leq K$

\mathbf{y}_{k}=10\left(\frac{K-k}{K}\right)\mathbf{z}^{\Re}_{1,k}+10\cos^{4}(2\pi\omega_{0}k)\mathbf{z}^{\Re}_{2,k}+\nu_{k},

where $\mathbf{z}_{1,k}$ and $\mathbf{z}_{2,k}$ are from the PLSO stationary generative model, i.e., $\sigma_{j,m}^{2}=\sigma_{j}^{2}$ . The parameters are $\omega_{0}/\omega_{1}/\omega_{2}=0.04/1/10$ Hz, $f_{s}=200$ Hz, $T=100$ seconds, $l_{1}=l_{2}=1$ , and $\nu_{k}\sim\mathcal{N}(0,25)$ . This stationary process comprises two amplitude-modulated oscillations, namely one modulated by a slow-frequency ( $\omega_{0}=0.04$ Hz) sinusoid and the other a linearly-increasing signal Ba et al. [2014]. We assume a 2-second PS interval. For PLSO, we use $J=2$ components and 5 block coordinate descent iterations for optimizing $\theta$ and $\{\sigma_{j,m}^{2}\}_{j,m}$ .

K. Details of the TVAR model

As explained in Section 5, the TVAR model is defined as

\mathbf{y}_{k}=\sum_{p=1}^{P}a_{p,k}\mathbf{y}_{k-p}+\varepsilon_{k},

which can alternatively be written as

\begin{split}&\begin{pmatrix}\mathbf{y}_{k}\\ \vdots\\ \mathbf{y}_{k-P+1}\\ \end{pmatrix}\\ &=\underbrace{\begin{pmatrix}a_{1,k}&a_{2,k}&\cdots&a_{P-1,k}&a_{P,k}\\ 1&0&\cdots&0&0\\ 0&1&\cdots&0&0\\ \vdots&&&&\vdots\\ &\vdots&&\\ 0&0&\cdots&1&0\\ \end{pmatrix}}_{A_{k}}\begin{pmatrix}\mathbf{y}_{k-1}\\ \vdots\\ \mathbf{y}_{k-P}\\ \end{pmatrix}+\varepsilon_{k}.\\ \end{split}

It is the transition matrix $A_{k}$ that determines the oscillatory component profile at time $k$ , such as the number of components and the center frequencies. Specifically, $\{A_{k}\}_{k}$ are first fit to the data $\mathbf{y}$ and then eigen-decomposition is performed on each of the estimated $\{A_{k}\}_{k}$ . More in-depth technical details can be found in West et al. [1999].

We use publicly available code for the TVAR implementation⁵⁵5https://www2.stat.duke.edu/ mw/mwsoftware/TVAR/index.html. We use TVAR order of $p=70$ , as the models with lower orders than this value did not capture the theta-band signal - The lowest frequency band in these cases were the gamma band ( $>30$ Hz). Even with the higher orders of $p$ , and various hyperparameter combinations, we observed that the slow and theta frequency band was still explained by a single oscillatory component. For the discount factor, we used $\beta=0.999$ to ensure that the TVAR coefficients and, consequently, the decomposed oscillatory components do not fluctuate much.

L. Anesthesia EEG dataset (Section 5.3)

We show spectral analysis results for the EEG data of a subject anesthetized with propofol (This is a different subject from the main text.) The data last $T=2{,}500$ seconds, sampled at $f_{s}=250$ Hz. We assume a 4-second PS interval, use $J=9$ components and 5 block coordinate descent iterations for optimizing $\theta$ and $\{\sigma_{j,m}^{2}\}_{j,m}$ .

Fig. 7 shows the STFT and the PLSO-estimated spectrograms. As noted in the main text, PLSO with stationarity assumption is too restrictive and fails to capture the time-varying spectral pattern. Both PLSO with $\lambda=0$ and $\lambda=\lambda_{\operatorname{CV}}$ are more effective in capturing such patterns, with the latter able to remove the artifacts and better recover the smoother dynamics.

PLSO: A generative framework for decomposing nonstationary time-series into piecewise stationary oscillatory components

Abstract

1 Introduction

2 Background

2.1 Notation

2.2 Piecewise local stationarity

3 The PLSO model and its mathematical properties

3.1 PLSO for stationary data

3.2 PLSO for nonstationary data

3.2.1 Stochastic continuity

Proposition 1.

Proof.

3.2.2 Piecewise stationarity

Proposition 2.

Proof.

4 Inference

4.1 Optimization of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} and θ\theta

4.1.1 Optimization of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}

4.2 Inference for p​({𝐳j}j∣{σj,m2}j,m,𝐲,θ)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)

4.2.1 Monte Carlo Inference

4.3 Choice of JJ and λ\lambda

4.4 Choice of window length NN

5 Related works

6 Experiments

6.1 Simulated dataset

6.2 LFP data from the rat hippocampus

6.3 EEG data from the human brain under propofol anesthesia

7 Conclusion

Acknowledgements.

References

Appendix

A. Continuous model interpretation of PLSO

B. PSD for complex AR(1) process

C. Proof for Proposition 1 (Section 3.2.1)

Proof.

D. Proof for Proposition 2 (Section 3.2.2)

Proof.

E. Initialization & Estimation for PLSO (Section 4)

E.1 Initialization

E.2 Estimation for σν2\sigma_{\nu}^{2}

F. Optimization for {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} via proximal gradient update (Section 4.1.1)

G. Lipschitz constant for ∇f​({σj,m2}j,m;θ)\nabla f(\{\sigma_{j,m}^{2}\}_{j,m};\theta) (Section 4.1.1)

Lipschitz constant C1C_{1} for ∇f1​(Ψ)\nabla f_{1}(\Psi)

Lipschitz constant C2C_{2} for ∇f2​(Ψ)\nabla f_{2}(\Psi)

H. Inference with p​({𝐳j}j∣{σj,m2}j,m,𝐲,θ)p\left(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta\right) (Section 4.2)

Kalman filter/smoother

FFBS algorithm for p​({𝐳j}j∣{σj,m2}j,m,𝐲,θ)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)

I. Computational efficiency of PLSO vs. GP-PS

J. Simulation experiment (Section 5.1)

K. Details of the TVAR model

L. Anesthesia EEG dataset (Section 5.3)

4.1 Optimization of $\{\sigma_{j,m}^{2}\}_{j,m}$ and $\theta$

4.1.1 Optimization of $\{\sigma_{j,m}^{2}\}_{j,m}$

4.2 Inference for $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)$

4.3 Choice of $J$ and $\lambda$

4.4 Choice of window length $N$

E.2 Estimation for $\sigma_{\nu}^{2}$

F. Optimization for $\{\sigma_{j,m}^{2}\}_{j,m}$ via proximal gradient update (Section 4.1.1)

G. Lipschitz constant for $\nabla f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)$ (Section 4.1.1)

Lipschitz constant $C_{1}$ for $\nabla f_{1}(\Psi)$

Lipschitz constant $C_{2}$ for $\nabla f_{2}(\Psi)$

H. Inference with $p\left(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta\right)$ (Section 4.2)

FFBS algorithm for $p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)$