This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PLSO: A generative framework for decomposing nonstationary time-series into piecewise stationary oscillatory components

Andrew H. Song Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Cambridge, Massachusetts, USA
Demba Ba School of Engineering and Applied Sciences
Havard University
Cambridge, Massachusetts, USA
Emery N. Brown Picower Institute of Learning and Memory
Massachusetts Institute of Technology
Cambridge, Massachusetts, USA
Institute of Medical Engineering and Sciences
Massachusetts Institute of Technology
Cambridge, Massachusetts, USA
Department of Anesthesia, Critical Care and Pain Medicine
Massachusetts General Hospital
Boston, Massachusetts, USA
Abstract

To capture the slowly time-varying spectral content of real-world time-series, a common paradigm is to partition the data into approximately stationary intervals and perform inference in the time-frequency domain. However, this approach lacks a corresponding nonstationary time-domain generative model for the entire data and thus, time-domain inference occurs in each interval separately. This results in distortion/discontinuity around interval boundaries and can consequently lead to erroneous inferences based on any quantities derived from the posterior, such as the phase. To address these shortcomings, we propose the Piecewise Locally Stationary Oscillation (PLSO) model for decomposing time-series data with slowly time-varying spectra into several oscillatory, piecewise-stationary processes. PLSO, as a nonstationary time-domain generative model, enables inference on the entire time-series without boundary effects and simultaneously provides a characterization of its time-varying spectral properties. We also propose a novel two-stage inference algorithm that combines Kalman theory and an accelerated proximal gradient algorithm. We demonstrate these points through experiments on simulated data and real neural data from the rat and the human brain.

1 Introduction

With the collection of long time-series now common, in areas such as neuroscience and geophysics, it is important to develop an inference framework for data where the stationarity assumption is too restrictive. We restrict our attention to data 1) with spectral properties that change slowly over time and 2) for which decomposition into several oscillatory components is warranted for interpretation, often the case in electroencephalogram (EEG) or electrophysiology recordings. One can use bandpass filtering [Oppenheim et al., 2009] or methods such as the empirical mode decomposition [Huang et al., 1998, Daubechies et al., 2011] for these purposes. However, due to the absence of a generative model, these methods lack a framework for performing inference. Another popular approach is to perform inference in the time-frequency (TF) domain on the short-time Fourier transform (STFT) of the data, assuming stationarity within small intervals. This has led to a rich literature on inference in the TF domain, such as Wilson et al. [2008]. A drawback is that most of these methods focus on estimates for the power spectral density (PSD) and lose important phase information. To recover the time-domain estimates, additional algorithms are required [Griffin and J. Lim, 1984].

This motivates us to explore time-domain generative models that allow time-domain inference and decomposition into oscillatory components. We can find examples based on the stationarity assumption in the signal processing/Gaussian process (GP) communities. A superposition of stochastic harmonic oscillators, where each oscillator corresponds to a frequency band, is used in the processing of speech  [Cemgil and Godsill, 2005] and neuroscience data [Matsuda and Komaki, 2017, Beck et al., 2018]. In GP literature [Rasmussen and Williams, 2005], the spectral mixture (SM) kernel [Wilson and Adams, 2013, Wilkinson et al., 2019] models the data as samples from a GP, whose kernel consists of the superposition of localized and frequency-modulated kernels.

These time-domain models can be applied to nonstationary data by partitioning them into stationary intervals and performing time-domain inference within each interval. However, we are faced with a different kind of challenge. As the inference is localized within each interval, the time-domain estimates in different intervals are independent conditioned on the data and do not reflect the dependence across intervals. This also causes discontinuity/distortion of the time-domain estimates near the interval boundaries, and consequently any quantities derived from these estimates.

To address these shortcomings, we propose a generative framework for data with slow time-varying spectra, termed the Piecewise Locally Stationary Oscillation (PLSO) framework111Code is available at https://github.com/andrewsong90/plso.git. The main contributions are:

Generative model for piecewise stationary, oscillatory components PLSO models time-series as the superposition of piecewise stationary, oscillatory components. This allows time-domain inference on each component and estimation of the time-varying spectra.

Continuity across stationary intervals The state-space model that underlies PLSO strikes a balance between ensuring time-domain continuity across piecewise stationary intervals and stationarity within each interval. Moreover, by imposing stochastic continuity on the interval-level, PLSO learns underlying smooth time-varying spectra accurately.

Inference procedure We propose a two-stage inference procedure for the time-varying spectra and the time-series. By leveraging the Markovian dynamics, the algorithm combines Kalman filter theory [Kalman, 1960] and inexact accelerated proximal gradient approach [Li and Lin, 2015].

In Section 2 we introduce necessary background, followed by the PLSO framework in Section 3. In Section 4, we discuss inference for PLSO. In Section 5, we discuss how PLSO relates to other frameworks. In Section 6, we present experimental results and conclude in Section 7.

2 Background

2.1 Notation

We use j{1,,J}j\in\{1,\ldots,J\} and k{1,,K}k\in\{1,\dots,K\} to denote frequency and discrete-time sample index, respectively. We use ω[π,π]\omega\in[-\pi,\pi] for normalized frequency. The jthj^{\text{th}} latent process centered at ω=ωj\omega=\omega_{j} is denoted as 𝐳jK\mathbf{z}_{j}\in\mathbb{C}^{K}, with 𝐳j,k\mathbf{z}_{j,k}\in\mathbb{C} denoting the kthk^{\text{th}} sample of 𝐳j\mathbf{z}_{j} and 𝐳j,k\mathbf{z}^{\Re}_{j,k}, 𝐳j,k\mathbf{z}^{\Im}_{j,k} its real and imaginary parts. We also represent 𝐳j,k\mathbf{z}_{j,k} as a 2\mathbb{R}^{2} vector, 𝐳~j,k=[𝐳j,k,𝐳j,k]T\widetilde{\mathbf{z}}_{j,k}=[\mathbf{z}^{\Re}_{j,k},\mathbf{z}^{\Im}_{j,k}]^{\text{T}}. The elements of 𝐳j\mathbf{z}_{j} are denoted as 𝐳j,k:k=[𝐳j,k,,𝐳j,k]T\mathbf{z}_{j,k:k^{\prime}}=[\mathbf{z}_{j,k},\ldots,\mathbf{z}_{j,k^{\prime}}]^{\text{T}}. The state covariance matrix for 𝐳j,k\mathbf{z}_{j,k} is defined as 𝐏kj=𝔼[𝐳~j,k(𝐳~j,k)T]\mathbf{P}_{k}^{j}=\mathbb{E}[\widetilde{\mathbf{z}}_{j,k}\left(\widetilde{\mathbf{z}}_{j,k}\right)^{\text{T}}]. To express an enumeration of variables, we use {}\{\cdot\} and drop first/last index for simplicity, e.g. {𝐳j}j\{\mathbf{z}_{j}\}_{j} instead of {𝐳j}j=1J\{\mathbf{z}_{j}\}_{j=1}^{J}.

We use 𝐲k\mathbf{y}_{k} and 𝐳j,k\mathbf{z}_{j,k} for the discrete-time counterpart of the continuous observation and latent process, y(t)y(t) and zj(t)z_{j}(t). With the sampling frequency fs=1/Δf_{s}=1/\Delta, we have 𝐲k=y(kΔ)\mathbf{y}_{k}=y(k\Delta), 𝐳j,k=zj(kΔ)\mathbf{z}_{j,k}=z_{j}(k\Delta), and T=KΔT=K\Delta.

2.2 Piecewise local stationarity

The concept of piecewise local stationarity (PLS) for nonstationary time-series with slowly time-varying spectra [Adak, 1998] plays an important role in PLSO. A stationary process has a constant mean and a covariance function which depends only on the difference between two time points.

For our purposes, it suffices to understand the following on PLS: 1) It includes local stationary [Priestley, 1965, Dahlhaus, 1997] and amplitude-modulated stationary processes. 2) A PLS process can be approximated as a piecewise stationary (PS) process (Theorem 1 of [Adak, 1998])

z(t)=m=1M𝟏(umt<um+1)zm(t),z(t)=\sum_{m=1}^{M}\mathbf{1}(u_{m}\leq t<u_{m+1})\cdot z^{m}(t), (1)

where zm(t)z^{m}(t) is a continuous stationary process and the boundaries are 0=u1<<uM+1=T0=u_{1}<\cdots<u_{M+1}=T. Note that Eq. 1 does not guarantee continuity across different PS intervals,

limtumz(t)=limtumzm1(t)limtum+zm(t)=limtum+z(t).\begin{split}\lim_{t\rightarrow u^{-}_{m}}z(t)=\lim_{t\rightarrow u^{-}_{m}}z^{m-1}(t)\neq\lim_{t\rightarrow u^{+}_{m}}z^{m}(t)=\lim_{t\rightarrow u^{+}_{m}}z(t).\\ \end{split}

3 The PLSO model and its mathematical properties

Building on the Theorem 1 of [Adak, 1998], PLSO models nonstationary data as PS processes. It is a superposition of JJ different PS processes {𝐳j}j\{\mathbf{z}_{j}\}_{j}, with 𝐳j\mathbf{z}_{j} corresponding to an oscillatory process centered at frequency ωj\omega_{j}. PLSO also guarantees stochastic continuity across PS intervals. We show that piecewise stationarity and continuity across PS intervals are two competing objectives and that PLSO strikes a balance between them, as discussed in Section 3.2.

Refer to caption
Figure 1: A simulated example. (a) Time domain. Data (black) around boundaries (gray) and inferred oscillatory components using PLSO (red) and regularized STFT (blue). (b) Frequency domain. Spectrum of the data (gray), PLSO components for J=2J=2 (purple) and their sum (red).

Fig. 1 shows an example of the PLSO framework applied to simulated data. In the time domain, the oscillation inferred using the regularized STFT (blue) [Kim et al., 2018], which imposes stochastic continuity on the STFT coefficients, suffers from discontinuity/distortion near window boundaries, whereas that inferred by PLSO (red) does not. In the frequency domain, each PLSO component corresponds to a localized spectrum Sj(ω)S_{j}(\omega), the sum of which is the PSD γ(ω)\gamma(\omega), and is fit to the data STFT (or periodogram) I(ω)I(\omega). We start by introducing the PLSO model for a single window.

3.1 PLSO for stationary data

As a building block for PLSO, we use the discrete stochastic harmonic oscillator for a stationary time series [Qi et al., 2002, Cemgil and Godsill, 2005, Matsuda and Komaki, 2017]. The data 𝐲\mathbf{y} are assumed to be a superposition of JJ independent zero-mean components

𝐳~j,k=ρj𝐑(ωj)𝐳~j,k1+εj,k𝐲k=j=1J𝐳j,k+νk,\begin{split}\widetilde{\mathbf{z}}_{j,k}&=\rho_{j}\mathbf{R}(\omega_{j})\widetilde{\mathbf{z}}_{j,k-1}+\varepsilon_{j,k}\ \\ \mathbf{y}_{k}&=\sum_{j=1}^{J}\mathbf{z}^{\Re}_{j,k}+\nu_{k},\\ \end{split} (2)

where 𝐑(ωj)=(cos(ωj)sin(ωj)sin(ωj)cos(ωj))\mathbf{R}(\omega_{j})=\begin{pmatrix}\cos(\omega_{j})&-\sin(\omega_{j})\\ \sin(\omega_{j})&\cos(\omega_{j})\\ \end{pmatrix}, εj,k𝒩(0,αj𝐈2×2)\varepsilon_{j,k}\sim\mathcal{N}\left(0,\alpha_{j}\mathbf{I}_{2\times 2}\right), and νk𝒩(0,σν2)\nu_{k}\sim\mathcal{N}(0,\sigma_{\nu}^{2}), correspond to the rotation matrix, the state noise, and the observation noise, respectively. The imaginary component 𝐳j,k\mathbf{z}^{\Im}_{j,k}, which does not directly contribute to 𝐲k\mathbf{y}_{k}, can be seen as the auxiliary variable to write 𝐳j\mathbf{z}_{j} in recursive form using 𝐑(ωj)\mathbf{R}(\omega_{j}) [Koopman and Lee, 2009]. We assume 𝐏1j=σj2𝐈2×2j\mathbf{P}_{1}^{j}=\sigma_{j}^{2}\cdot\mathbf{I}_{2\times 2}\,\,\,\forall j.

We reparameterize ρj\rho_{j} and αj\alpha_{j}, using lengthscale ljl_{j} and power σj2\sigma^{2}_{j}, such that ρj=exp(Δ/lj)\rho_{j}=\exp(-\Delta/l_{j}) and αj=σj2(1ρj2)\alpha_{j}=\sigma_{j}^{2}(1-\rho_{j}^{2}). Theoretically, this establishes a connection to 1) the stochastic differential equation [Brown et al., 2004, Solin and Särkkä, 2014], detailed in Appendix A, and 2) a superposition of frequency-modulated and localized GP kernels, similar to SM kernel [Wilson and Adams, 2013]. Practically, this ensures that ρj<1\rho_{j}<1, and thus stability of the process.

Given Eq. 2, we can readily express the frequency spectra of PLSO in each interval, through the autocovariance function. The autocovariance of 𝐳j\mathbf{z}_{j} is given as Qj(n)=𝔼[𝐳j,k𝐳j,k+n]=σj2cos(ωjn)exp(nΔ/lj)Q_{j}(n^{\prime})=\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n^{\prime}}]=\sigma_{j}^{2}\cos(\omega_{j}n^{\prime})\exp\left(-n^{\prime}\Delta/l_{j}\right). It can also be thought of as an exponential kernel, frequency-modulated by ωj\omega_{j}. The spectra for 𝐳j\mathbf{z}_{j}, denoted as Sj(ω)S_{j}(\omega), is obtained by taking the Fourier transform (FT) of Qj(n)Q_{j}(n^{\prime})

Sj(ω)=n=Qj(n)exp(iωn)=φj(ω)+φj(ω)φj(ω)=σj2(1exp(2Δ/lj))1+exp(2Δ/lj)2exp(Δ/lj)cos(ωωj),\begin{split}S_{j}(\omega)&=\sum_{n^{\prime}=-\infty}^{\infty}Q_{j}(n^{\prime})\exp\left(-i\omega n^{\prime}\right)=\varphi_{j}(\omega)+\varphi_{j}(-\omega)\\ \varphi_{j}(\omega)&=\frac{\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)}{1+\exp\left(-2\Delta/l_{j}\right)-2\exp\left(-\Delta/l_{j}\right)\cos(\omega-\omega_{j})},\\ \end{split}

with the detailed derivation in Appendix B.

Given Sj(ω)S_{j}(\omega), we can show that PSD γ(ω)\gamma(\omega) of the entire process j=1J𝐳j\sum_{j=1}^{J}\mathbf{z}_{j} is simply γ(ω)=j=1JSj(ω)\gamma(\omega)=\sum_{j=1}^{J}S_{j}(\omega). First, since 𝐳j\mathbf{z}_{j} is independent across jj, the autocovariance can be simplified, i.e., 𝔼[j𝐳j,kj𝐳j,k+n]=j𝔼[𝐳j,k𝐳j,k+n]\mathbb{E}[\sum_{j}\mathbf{z}^{\Re}_{j,k}\sum_{j}\mathbf{z}^{\Re}_{j,k+n^{\prime}}]=\sum_{j}\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n^{\prime}}]. Next, using the linearity of FT, we can conclude that γ(ω)\gamma(\omega) is a superposition of individual spectra.

3.2 PLSO for nonstationary data

If 𝐲\mathbf{y} is nonstationary, we can still apply stationary PLSO of Eq. 2 for the time-domain inference. However, this implies constant spectra for the entire data (Sj(ω)S_{j}(\omega) and γ(ω)\gamma(\omega) do not depend on kk), which is not suitable for nonstationary time-series for which we want to track spectral dynamics. This point is further illustrated in Section 6.

We therefore segment 𝐲\mathbf{y} into MM non-overlapping PS intervals, indexed by m{1,,M}m\in\{1,\ldots,M\}, of length NN, indexed by n{1,,N}n\in\{1,\ldots,N\}, such that K=MNK=MN. We then apply the stationary PLSO to each interval, with additional Markovian dynamics imposed on σj,m2\sigma_{j,m}^{2},

log(σj,m2)=log(σj,m12)+ηj,m𝐳~j,mN+n=ρj𝐑(ωj)𝐳~j,mN+(n1)+εj,mN+n𝐲mN+n=j=1J𝐳j,mN+n+νmN+n,\begin{split}\log(\sigma_{j,m}^{2})&=\log(\sigma_{j,m-1}^{2})+\eta_{j,m}\\ \widetilde{\mathbf{z}}_{j,mN+n}&=\rho_{j}\mathbf{R}(\omega_{j})\widetilde{\mathbf{z}}_{j,mN+(n-1)}+\varepsilon_{j,mN+n}\\ \mathbf{y}_{mN+n}&=\sum_{j=1}^{J}\mathbf{z}^{\Re}_{j,mN+n}+\nu_{mN+n},\\ \end{split} (3)

where εj,mN+n𝒩(0,σj,m2(1ρj2)𝐈2×2)\varepsilon_{j,mN+n}\sim\mathcal{N}(0,\sigma_{j,m}^{2}(1-\rho_{j}^{2})\mathbf{I}_{2\times 2}), ηj,m𝒩(0,1/λ)\eta_{j,m}\sim\mathcal{N}(0,1/\sqrt{\lambda}) and νmN+n𝒩(0,σν2)\nu_{mN+n}\sim\mathcal{N}(0,\sigma_{\nu}^{2}). We define 𝐏m,nj\mathbf{P}_{m,n}^{j} as the covariance of 𝐳~j,mN+n\widetilde{\mathbf{z}}_{j,mN+n}, with 𝐏1,1j=σj,12𝐈2×2,j\mathbf{P}_{1,1}^{j}=\sigma_{j,1}^{2}\mathbf{I}_{2\times 2},\forall j. The graphical model is shown in Fig. 2.

Refer to caption
Figure 2: The graphical model for PLSO.

We can understand PLSO as providing a parameterized spectrogram defined by θ={λ,σν2,{lj}j,{ωj}j}\theta=\{\lambda,\sigma_{\nu}^{2},\{l_{j}\}_{j},\{\omega_{j}\}_{j}\} and {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} of the time-domain generative model. The lengthscale ljl_{j} controls the bandwidth of the jthj^{\text{th}} process, with larger ljl_{j} corresponding to narrower bandwidth. The variance σj,m2\sigma_{j,m}^{2} controls the power of 𝐳j\mathbf{z}_{j} and changes across different intervals, resulting in time-varying spectra Sj(m)(ω)S_{j}^{(m)}(\omega) and PSD γ(m)(ω)\gamma^{(m)}(\omega). The center frequency ωj\omega_{j}, and ωj-\omega_{j}, at which Sj(m)(ω)S_{j}^{(m)}(\omega) is maximized, controls the modulation frequency.

As discussed previously, the segmentation approach for nonstationary time-series produces distortion/discontinuity artifacts around interval boundaries - The PLSO, as described by Eq. 3, resolves these issues gracefully. We now analyze two mathematical properties, stochastic continuity and piecewise stationarity, to gain more insights on how PLSO accomplishes this.

3.2.1 Stochastic continuity

We discuss two types of stochastic continuity, 1) across the interval boundaries and 2) on {σj,m2}\{\sigma_{j,m}^{2}\}.

Continuity across the interval boundaries In PLSO, the state-space model (Eq. 3) provides stochastic continuity across different PS intervals. The following proposition rigorously explains stochastic continuity for PLSO.

Proposition 1.

For a given mm, as Δ0\Delta\rightarrow 0, the samples on either side of the interval boundary, which are 𝐳~j,(m+1)N\widetilde{\mathbf{z}}_{j,(m+1)N} and 𝐳~j,(m+1)N+1\widetilde{\mathbf{z}}_{j,(m+1)N+1}, converge to each other in mean square,

limΔ0𝔼[Δ𝐳~j,(m+1)NΔ𝐳~j,(m+1)NT]=0,\lim_{\Delta\rightarrow 0}\mathbb{E}[\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}^{\text{T}}]=0,

where we use Δ𝐳~j,(m+1)N=𝐳~j,(m+1)N+1𝐳~j,(m+1)N\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}=\widetilde{\mathbf{z}}_{j,(m+1)N+1}-\widetilde{\mathbf{z}}_{j,(m+1)N}.

Proof.

We use the connection between PLSO, which is a discrete-time model, and its continuous-time counterpart. Details are in the Appendix C. ∎

This matches our intuition that as Δ0\Delta\rightarrow 0, the adjacent samples from the same process should coverge to each other. For PS approaches without the sample-level continuity, even with the interval-level constraint [Rosen et al., 2009, Kim et al., 2018, Song et al., 2018, Das and Babadi, 2018, Soulat et al., 2019], convergence is not guaranteed.

We can interpret the continuity in the context of posterior for 𝐳j\mathbf{z}_{j}. For PS approaches without continuity, we have

p({𝐳j}j|𝐲)m=1Mp({𝐳j,(m1)N+1:mN}j|𝐲(m1)N+1:mN),\begin{split}&p(\left\{\mathbf{z}_{j}\right\}_{j}|\mathbf{y})\\ &\propto\prod_{m=1}^{M}p(\left\{\mathbf{z}_{j,(m-1)N+1:mN}\right\}_{j}|\mathbf{y}_{(m-1)N+1:mN}),\\ \end{split} (4)

where {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} and θ\theta are omitted for notational ease. This is due to p({𝐳j}j)=m=1Mp({𝐳j,(m1)N+1:mN}j)p(\{\mathbf{z}_{j}\}_{j})=\prod_{m=1}^{M}p(\{\mathbf{z}_{j,(m-1)N+1:mN}\}_{j}), as a result of absence of continuity across the intervals. Consequently, the inferred time-domain estimates are conditionally independent across the intervals. On the contrary, in PLSO, the time-domain estimates depend on the entire 𝐲\mathbf{y}, not just on a subset.

Continuity on σj,m2\sigma_{j,m}^{2} For a given jj, we impose stochastic continuity on logσj,m2\log\sigma_{j,m}^{2}. Effectively, this pools together estimates of {σj,m2}m\{\sigma^{2}_{j,m}\}_{m} to 1) prevent overfitting to the noisy data spectra and 2) estimate smooth dynamics of {σj,m2}m\{\sigma^{2}_{j,m}\}_{m}. The use of logσj,m2\log\sigma_{j,m}^{2} ensures that σj,m2\sigma_{j,m}^{2} is non-negative.

The choice of λ\lambda dictates the smoothness of {σj,m2}m\{\sigma^{2}_{j,m}\}_{m}, with the two extremes corresponding to the familiar dynamics. If λ0\lambda\rightarrow 0, we treat each window independently. If λ\lambda\rightarrow\infty, we treat the data as stationary, as the constraint forces σj,m2=σj2\sigma^{2}_{j,m}=\sigma_{j}^{2}, m\forall m. Practically, the smooth regularization prevents artifacts in the spectral analysis, arising from sudden motion or missing data, as demonstrated in Section 6.

3.2.2 Piecewise stationarity

For the mthm^{\text{th}} window to be piecewise stationary, the initial state covariance matrix 𝐏m,1j\mathbf{P}_{m,1}^{j} should be the steady-state covariance matrix for the window, denoted as 𝐏m,j\mathbf{P}_{m,\infty}^{j}.

The challenge is transitioning from 𝐏m,j\mathbf{P}_{m,\infty}^{j} to 𝐏m+1,j\mathbf{P}_{m+1,\infty}^{j}. Specifically, to ensure 𝐏m+1,1j=𝐏m+1,j\mathbf{P}_{m+1,1}^{j}=\mathbf{P}_{m+1,\infty}^{j}, given that 𝐏m,Nj=𝐏m,j𝐏m+1,j\mathbf{P}_{m,N}^{j}=\mathbf{P}_{m,\infty}^{j}\neq\mathbf{P}_{m+1,\infty}^{j}, the variance of the process noise between the two samples, εj,(m+1)N+1\varepsilon_{j,(m+1)N+1}, has to equal 𝐏m+1,jexp(2Δ/lj)𝐏m,j\mathbf{P}_{m+1,\infty}^{j}-\exp\left(-2\Delta/l_{j}\right)\mathbf{P}_{m,\infty}^{j}. However, this is infeasible. If 𝐏m+1,j<𝐏m,j\mathbf{P}_{m+1,\infty}^{j}<\mathbf{P}_{m,\infty}^{j}, the variance is negative. Even if it were positive, the limit as Δ0\Delta\rightarrow 0 does not equal zero, i.e., 𝐏m+1,j𝐏m,j\mathbf{P}_{m+1,\infty}^{j}-\mathbf{P}_{m,\infty}^{j}. As a result, the Proposition 1 no longer holds and the trajectory is discontinuous.

In summary, there exists a trade-off between maintaining piecewise stationarity and continuity across intervals. PLSO maintains continuity across the intervals while ensuring that the state covariance quickly transitions to the steady-state covariance. We quantify the speed of transition in the following proposition.

Proposition 2.

Assume ljNΔl_{j}\ll N\Delta, such that 𝐏m,Nj=𝐏m,j\mathbf{P}_{m,N}^{j}=\mathbf{P}_{m,\infty}^{j}. In Eq. 3, the difference between 𝐏m,j\mathbf{P}_{m,\infty}^{j} and 𝐏m+1,j\mathbf{P}_{m+1,\infty}^{j} decays exponentially fast as a function of nn,

𝐏m+1,nj=𝐏m+1,j+exp(2nΔlj)(𝐏m,j𝐏m+1,j).\begin{split}\mathbf{P}_{m+1,n}^{j}&=\mathbf{P}_{m+1,\infty}^{j}+\exp(-\frac{2n\Delta}{l_{j}})(\mathbf{P}_{m,\infty}^{j}-\mathbf{P}_{m+1,\infty}^{j}).\\ \end{split}
Proof.

In Appendix D, we prove this result by induction.

This implies that, except for the transition portion at the beginning of each window, we can assume stationarity. In practice, we additionally impose an upper bound on ljl_{j} during estimation and also use a reasonably-large NN. Empirically, we observe that the transition period has little impact.

4 Inference

Given the generative model in Eq. 3, our goal is to perform inference on the posterior distribution

p({𝐳j}j,{σj,m2}j,m𝐲,θ)=p({σj,m2}j,m𝐲,θ)window-level posteriorp({𝐳j}j{σj,m2}j,m,𝐲,θ)sample-level posterior.\begin{split}&p(\left\{\mathbf{z}_{j}\right\}_{j},\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta)\\ &=\underbrace{p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta)}_{\text{window-level posterior}}\cdot\underbrace{p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)}_{\text{sample-level posterior}}.\\ \end{split} (5)

We can learn θ\theta or fix the parameters to specific values informed by domain knowledge, such as the center frequency or bandwidth of the processes. The posterior distribution factorizes into two terms as in Eq. 5, the window-level posterior p({σj,m2}j,m|𝐲,θ)p(\{\sigma_{j,m}^{2}\}_{j,m}|\mathbf{y},\theta) and the sample-level posterior p({𝐳j}j|{σj,m2}j,m,𝐲,θ)p(\{\mathbf{z}_{j}\}_{j}|\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta). Accordingly, we break the inference into two stages.

Stage 1 We minimize the window-level negative log-posterior, with respect to θ\theta and {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}. Specifically, we obtain maximum likelihood (ML) estimate θ^ML\hat{\theta}_{\text{ML}} and maximum a posteriori (MAP) estimate {σ^j,m,MAP2}j,m\{\widehat{\sigma}^{2}_{j,m,\text{MAP}}\}_{j,m}. We drop subscripts ML and MAP for notational simplicity.

Stage 2 Given {σ^j,m2}j,m\{\widehat{\sigma}^{2}_{j,m}\}_{j,m} and θ^\hat{\theta}, we perform inference on the sample-level posterior. This includes computing the mean 𝐳^j=𝔼[𝐳j|{σ^j,m2}j,m,𝐲,θ^]\widehat{\mathbf{z}}_{j}=\mathbb{E}[\mathbf{z}_{j}|\{\widehat{\sigma}^{2}_{j,m}\}_{j,m},\mathbf{y},\hat{\theta}] and credible intervals, which quantifies the uncertainty of the estimates, or any statistical quantity derived from the posterior distribution.

4.1 Optimization of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} and θ\theta

We factorize the posterior, p({σj,m2}j,m𝐲,θ)p(𝐲{σj,m2}j,m,θ)p({σj,m2}j,mθ)p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta)\propto p(\mathbf{y}\mid\{\sigma_{j,m}^{2}\}_{j,m},\theta)\cdot p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\theta). As the exact inference is intractable, we instead minimize the negative log-posterior, logp({σj,m2}j,m𝐲,θ)-\log p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta). This is an empirical Bayes approach [Casella, 1985], since we estimate {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} using the marginal likelihood p(𝐲{σj,m2}j,m,θ)p(\mathbf{y}\mid\{\sigma_{j,m}^{2}\}_{j,m},\theta). The smooth hyperprior provides the MAP estimate for {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}.

We use the Whittle likelihood [Whittle, 1953], defined for stationary time-series in the frequency domain, for the log-likelihood f({σj,m2}j,m;θ)=logp(𝐲{σj,m2}j,m,θ)f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)=\log p(\mathbf{y}\mid\{\sigma_{j,m}^{2}\}_{j,m},\theta),

f({σj,m2}j,m;θ)=12m,n=1M,Nlog(γ(m)(ωn)+σν2)+I(m)(ωn)γ(m)(ωn)+σν2,\begin{split}&f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)\\ &=-\frac{1}{2}\sum_{m,n=1}^{M,N}\log(\gamma^{(m)}(\omega_{n})+\sigma_{\nu}^{2})+\frac{I^{(m)}(\omega_{n})}{\gamma^{(m)}(\omega_{n})+\sigma_{\nu}^{2}},\\ \end{split} (6)

where the log-likelihood is the sum of the Whittle likelihood computed for each interval, with discrete frequency ωn=2πn/N,\omega_{n}=2\pi n/N, and data STFT I(m)(ωn)=|n=1Nexp(2πi(n1)(n1)/N)𝐲mN+n|2.I^{(m)}(\omega_{n})=\left|\sum_{n^{\prime}=1}^{N}\exp\left(-2\pi i(n^{\prime}-1)(n-1)/N\right)\mathbf{y}_{mN+n^{\prime}}\right|^{2}. The Whittle likelihood, which is nonconvex, enables frequency-domain parameter estimation as a computationally more efficient alternative to the time domain estimation [Turner and Sahani, 2014]. The concave log-prior g({σj,m2}j,m;θ)=logp({σj,m2}j,mθ)g(\{\sigma_{j,m}^{2}\}_{j,m};\theta)=\log p(\{\sigma_{j,m}^{2}\}_{j,m}\mid\theta), which arises from the continuity on {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}, is given as

g({σj,m2}j,m;θ)=λ2j=1Jm=1M(logσj,m2logσj,m12)2.g(\{\sigma_{j,m}^{2}\}_{j,m};\theta)=-\frac{\lambda}{2}\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\log\sigma^{2}_{j,m}-\log\sigma^{2}_{j,m-1}\right)^{2}. (7)

This yields the following nonconvex problem

min{σj,m2}j,m,θlogp({σj,m2}j,m𝐲,θ)=min{σj,m2}j,m,θf({σj,m2}j,m;θ)g({σj,m2}j,m;θ).\begin{split}&\min_{\{\sigma_{j,m}^{2}\}_{j,m},\theta}\,-\log p\left(\{\sigma_{j,m}^{2}\}_{j,m}\mid\mathbf{y},\theta\right)\\ &=\min_{\{\sigma_{j,m}^{2}\}_{j,m},\theta}\,-f(\{\sigma_{j,m}^{2}\}_{j,m};\theta)-g(\{\sigma_{j,m}^{2}\}_{j,m};\theta).\\ \end{split} (8)

We optimize Eq. 8 by block coordinate descent [Wright, 2015] on {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} and {σν2,{lj}j,{ωj}j}\{\sigma_{\nu}^{2},\{l_{j}\}_{j},\{\omega_{j}\}_{j}\}. For σν2\sigma_{\nu}^{2}, {lj}j\{l_{j}\}_{j}, and {ωj}j\{\omega_{j}\}_{j}, we minimize f({σj,m2}j,m;θ)-f(\{\sigma_{j,m}^{2}\}_{j,m};\theta), since g({σj,m2}j,m;θ)g(\{\sigma_{j,m}^{2}\}_{j,m};\theta) does not affect them. We perform conjugate gradient descent on {lj}j\{l_{j}\}_{j} and {ωj}j\{\omega_{j}\}_{j}. We discuss initialization and how to estimate σν2\sigma_{\nu}^{2} in the Appendix E.

4.1.1 Optimization of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}

We introduce an algorithm to compute a local optimal solution of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} for the nonconvex optimization problem in Eq. 8, by leveraging the regularized temporal structure of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}. It extends the inexact accelerated proximal gradient (APG) method [Li and Lin, 2015], by solving the proximal step with a Kalman filter/smoother [Kalman, 1960]. This follows since computing the proximal operator for g({σj,m2}j,m;θ)g(\{\sigma_{j,m}^{2}\}_{j,m};\theta) is equivalent to MAP estimation for JJ independent 1-dimensional linear Gaussian state-space models

{logσj,m(l+1),2}j,m=proxα(l)g(𝐯(l))=argmin{σj,m2}j,mj,mJ,M(𝐯j,m(l)logσj,m2)22α(l)g({σj,m2}j,m)j=1Jqj\begin{split}&\{\log\sigma^{(l+1),2}_{j,m}\}_{j,m}=\operatorname{prox}_{-\alpha^{(l)}g}(\mathbf{v}^{(l)})\\ &=\operatorname*{arg\,min}_{\{\sigma_{j,m}^{2}\}_{j,m}}\underbrace{\frac{\sum_{j,m}^{J,M}(\mathbf{v}_{j,m}^{(l)}-\log\sigma^{2}_{j,m})^{2}}{2\alpha^{(l)}}-g(\{\sigma_{j,m}^{2}\}_{j,m})}_{\sum_{j=1}^{J}q_{j}}\\ \end{split} (9)

where α(l)>0\alpha^{(l)}>0 is a step-size for the lthl^{\text{th}} iteration, qj=m=1M(𝐯j,m(l)logσj,m2)22α(l)+λ2(logσj,m2logσj,m12)2q_{j}=\sum_{m=1}^{M}\frac{(\mathbf{v}^{(l)}_{j,m}-\log\sigma^{2}_{j,m})^{2}}{2\alpha^{(l)}}+\frac{\lambda}{2}(\log\sigma^{2}_{j,m}-\log\sigma^{2}_{j,m-1})^{2}, and 𝐯j,m(l)=logσj,m(l),2+α(l)f({σj,m2}j,m)logσj,m2|{σj,m(l),2}j,m\mathbf{v}^{(l)}_{j,m}=\log\sigma_{j,m}^{(l),2}+\alpha^{(l)}\frac{\partial f(\{\sigma_{j,m}^{2}\}_{j,m})}{\partial\log\sigma^{2}_{j,m}}\Big{|}_{\{\sigma^{(l),2}_{j,m}\}_{j,m}}.

The jthj^{\text{th}} optimization problem, min{σj,m2}mqj\min_{\{\sigma_{j,m}^{2}\}_{m}}q_{j}, is equivalent to estimating the mean of the posterior for {logσj,m2}m\{\log\sigma^{2}_{j,m}\}_{m} in a linear Gaussian state-space model with observations {𝐯j,m(l)}m\{\mathbf{v}^{(l)}_{j,m}\}_{m}, observation noise variance α(l)\alpha^{(l)}, and state variance 1/λ1/\lambda. Therefore, the solution can efficiently be computed with JJ 1-dimensional, Kalman filters/smoothers, with the computational complexity of O(JM)O(JM).

Note that Eq. 9 holds for all non-negative λ\lambda. If λ=0\lambda=0, the proximal operator is an identity operator, as logσj,m(l+1),2=𝐯j,m(l)\log\sigma^{(l+1),2}_{j,m}=\mathbf{v}^{(l)}_{j,m}. This is a gradient descent with a step-size rule. If λ\lambda\rightarrow\infty, we have logσj,m2=logσj,m12\log\sigma^{2}_{j,m}=\log\sigma^{2}_{j,m-1}, m\forall m, which leads to logσj,m(l+1),2=(1/M)m=1M𝐯j,m(l)\log\sigma_{j,m}^{(l+1),2}=(1/M)\sum_{m=1}^{M}\mathbf{v}^{(l)}_{j,m}. The algorithm is guaranteed to converge when α(l)<1/C\alpha^{(l)}<1/C, where CC is the Lipschitz constant for f({σj,m2}j,m;θ)f(\{\sigma_{j,m}^{2}\}_{j,m};\theta). In practice, we select α(l)\alpha^{(l)} according to the step-size rule [Barzilai and Borwein, 1988]. In Appendix F & G, we present the full algorithm for optimizing {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} and a derivation for CC.

4.2 Inference for p({𝐳j}j{σj,m2}j,m,𝐲,θ)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)

We perform inference on the posterior distribution p({𝐳j}j{σ^j,m2}j,m,𝐲,θ^)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y},\hat{\theta}). Since this is a Gaussian distribution, the mean trajectories {𝐳^j}j\{\widehat{\mathbf{z}}_{j}\}_{j} and the credible intervals can be computed analytically. Moreover, Eq. 3 is a linear Gaussian state-space model, we can use Kalman filter/smoother for efficient computation with computational complexity O(J2K)O(J^{2}K), further discussed In Appendix H. Since we use the point estimate {σ^j,m2}j,m\{\widehat{\sigma}^{2}_{j,m}\}_{j,m}, the credible interval for {𝐳^j}j\{\widehat{\mathbf{z}}_{j}\}_{j} will be narrower compared to the full Bayesian setting which accounts for all values of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}.

4.2.1 Monte Carlo Inference

We can also obtain posterior samples and perform Monte Carlo (MC) inference on any posterior-derived quantity. To generate the MC trajectory samples, we use the forward-filter backward-sampling (FFBS) algorithm [Carter and Kohn, 1994]. Assuming SS number of MC samples, the computational complexity for FFBS is O(SJ2K)O(SJ^{2}K), since for each sample, the algorithm uses Kalman filter/smoother for sampling. This is different from generating samples with the interval-specific posterior in Eq. 4. In the latter case, the FFBS algorithm is run MM times, the samples of which have to be concatenated to form an entire trajectory. With PLSO, the trajectory sample is conditioned on the entire observation and is continuous across the intervals.

One quantity of interest is the phase. We obtain the phase as ϕj,k=tan1(𝐳j,k/𝐳j,k)\phi_{j,k}=\tan^{-1}(\mathbf{z}^{\Im}_{j,k}/\mathbf{z}^{\Re}_{j,k}). Since tan1()\tan^{-1}(\cdot) is a non-linear operation, we compute the mean and credible interval with MC samples through the FFBS algorithm. Given the posterior-sampled trajectories {𝐳j(s)}j,s\{\mathbf{z}_{j}^{(s)}\}_{j,s}, where s{1,,S}s\in\{1,\ldots,S\} denotes MC sample index, we estimate ϕ^j,k=(1/S)s=1Stan1(𝐳j,k,(s)/𝐳j,k,(s))\widehat{\phi}_{j,k}=(1/S)\sum_{s=1}^{S}\tan^{-1}(\mathbf{z}^{\Im,(s)}_{j,k}/\mathbf{z}^{\Re,(s)}_{j,k}), and use empirical quantiles for the associated credible interval.

4.3 Choice of JJ and λ\lambda

We choose JJ that minimizes the Akaike Information Criterion (AIC) [Akaike, 1981], defined as

AIC(J)=(2/M)logp(𝐲{σ^j,m2}j,m,θ^)+23J,\operatorname{AIC}(J)=-(2/M)\cdot\log p(\mathbf{y}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\hat{\theta})+2\cdot 3\cdot J, (10)

where 3J3\cdot J corresponds to the number of parameters ({lj}j,{ωj}j,{σj,m2}j\{l_{j}\}_{j},\{\omega_{j}\}_{j},\{\sigma^{2}_{j,m}\}_{j}). The regularization parameter λ\lambda is determined through a two-fold cross-validation, where each fold is generated by aggregating even/odd sample indices [Ba et al., 2014].

4.4 Choice of window length NN

The choice of window length NN presents the tradeoff between 1) spectral resolution and 2) the temporal resolution of the spectral dynamics [Oppenheim et al., 2009]. For a shorter window, the estimated spectral dynamics have a finer temporal resolution, coarser spectral resolution, and higher variance. For a longer window, these trends are reversed. This suggests that the choice is application-dependent. For electrophysiology data, a window on the order of seconds is used, as scientific interpretations are made on the basis of fine spectral resolution (< 1Hz). For audio signal processing [Gold et al., 2011], short windows (1010010\sim 100 ms) are used, since audio data is highly nonstationary and thus requires fine temporal resolution for processing. A survey of window lengths used in different applications can be found in the supporting information of Kim et al. [2018].

5 Related works

We examine how PLSO relates to other nonstationary frameworks.

STFT/Regularized STFT In STFT, the harmonic basis is used, whereas quasi-periodic components are used for PLSO, which allows capturing of broader spectral content. Recent works regularize STFT coefficients with stochastic continuity across the windows, to infer smooth spectral dynamics [Ba et al., 2014, Kim et al., 2018]. However, this regularization leads to discontinuities at window boundaries.

Piecewise stationary GP GP regression and parameter estimation are performed within each interval [Gramacy and Lee, 2008, Solin et al., 2018]. Consequently, the recovered trajectories are discontinuous. Also, the inversion of covariance matrix leads to an expensive inference. For example, the time-complexity of mean trajectory estimation is O(MN3)=O(N2K)O(MN^{3})=O(N^{2}K), whereas the time-complexity for PLSO is O(J2K)O(J^{2}K). Considering that the typical sampling frequency for electrophysiology data is 103\sim 10^{3} (Hz) and windows are several seconds, which leads to N103N\sim 10^{3}, PLSO is computationally more efficient. In the Appendix I, we confirm this through an experiment.

Time-varying Autoregressive model (TVAR) The TVAR model [Kitagawa and Gersch, 1985] is given as

𝐲k=p=1Pap,k𝐲kp+εk,\mathbf{y}_{k}=\sum_{p=1}^{P}a_{p,k}\mathbf{y}_{k-p}+\varepsilon_{k},

with the time-varying coefficients {ap,k}p\{a_{p,k}\}_{p}. Consequently, it does not suffer from discontinuity issues. TVAR can also be approximately decomposed into oscillatory components via eigen-decomposition of the transition matrix [West et al., 1999]. However, since the eigen-decomposition changes at every sample, this leads to an ambiguous interpretation of the oscillations in the data, as we discuss in Section 6.

RNN frameworks Despite the popularity of recurrent neural networks (RNN) for time-series applications [Goodfellow et al., 2016], we believe PLSO is more appropriate for time-frequency analysis for two reasons.

  1. 1.

    RNNs operate in the time-domain with the goal of prediction/denoising and consequently less emphasis is placed on local stationarity or estimation of second-order statistics. Performing time-frequency analysis requires segmenting the RNN outputs and applying the STFT, which yields noisy spectral estimates.

  2. 2.

    RNN is not a generative framework. Although variational framework can be combined with RNN [Chung et al., 2015, Krishnan et al., 2017], the use of variational lower bound objective could lead to suboptimal results. On the other hand, PLSO is a generative framework that maximizes the true log-posterior.

6 Experiments

We apply PLSO to three settings: 1) A simulated dataset 2) local-field potential (LFP) data from the rat hippocampus, and 3) EEG data from a subject under anesthesia.

We use PLSO with λ=0\lambda=0, λ\lambda\rightarrow\infty, and λ\lambda determined by cross-validation, λCV\lambda_{\operatorname{CV}}. We use interval lengths chosen by domain experts. As baselines, we use 1) regularized STFT (STFT-reg.) and 2) Piecewise stationary GP (GP-PS). For GP-PS, we use the same {σ^j,m2}j,m\{\widehat{\sigma}_{j,m}^{2}\}_{j,m} and θ^\widehat{\theta} as PLSO with λ=0\lambda=0, so that the estimated PSD of GP-PS and PLSO are identical. This lets us explain differences in time-domain estimates by the fact that PLSO operates in the time-domain.

6.1 Simulated dataset

We simulate from the following model for 1kK1\leq k\leq K

𝐲k=10(KkK)𝐳1,k+10cos4(2πω0k)𝐳2,k+νk,\mathbf{y}_{k}=10\left(\frac{K-k}{K}\right)\mathbf{z}^{\Re}_{1,k}+10\cos^{4}(2\pi\omega_{0}k)\mathbf{z}^{\Re}_{2,k}+\nu_{k},

where 𝐳1,k\mathbf{z}_{1,k} and 𝐳2,k\mathbf{z}_{2,k} are as in Eq. 2, with (ω0,ω1,ω2)=(0.04,1,10)(\omega_{0},\omega_{1},\omega_{2})=(0.04,1,10) Hz, fs=200f_{s}=200 Hz, T=100T=100 seconds, l1=l2=1l_{1}=l_{2}=1, and νk𝒩(0,25)\nu_{k}\sim\mathcal{N}(0,25). This stationary process comprises two amplitude-modulated oscillations, namely one modulated by a slow-frequency sinusoid and the other a linearly-increasing signal [Ba et al., 2014]. We simulate 20 realizations and train on each realization, assuming 2-second PS intervals. For PLSO, we use J=2J=2. Additional details are provided in the Appendix J.

Results We use two metrics: 1) Mean squared error (MSE) between the mean estimate 𝐳^j\hat{\mathbf{z}}_{j} and the ground truth 𝐳jTrue\mathbf{z}_{j}^{\text{True}} and 2) jump(𝐳j)\operatorname{jump}(\mathbf{z}_{j}). The averaged results are shown in Table 1. We define jump(𝐳j)=1M1m=1M1|𝐳^j,mN+1𝐳^j,mN|\operatorname{jump}(\mathbf{z}_{j})=\frac{1}{M-1}\sum_{m=1}^{M-1}\left|\hat{\mathbf{z}}_{j,mN+1}-\hat{\mathbf{z}}_{j,mN}\right| to be the level of discontinuity at the interval boundaries. If jump(𝐳j)\operatorname{jump}(\mathbf{z}_{j}) greatly exceeds jump(𝐳jTrue)\operatorname{jump}(\mathbf{z}_{j}^{\text{True}}), this implies the existence of large discontinuities at the boundaries.

Refer to caption
Figure 3: Spectrograms for simulation (in dB). (a) True data (b) True spectrogram (c) regularized STFT (d) PLSO with λ\lambda\rightarrow\infty (e) PLSO with λ=0\lambda=0 (f) PLSO with λ=λCV\lambda=\lambda_{\operatorname{CV}}.
Refer to caption
Figure 4: Result of analyses of hippocampal data. (a) Theta phase distribution of population neuron spikes, computed with bandpass-filtered LFP (black), PLSO estimate of 𝐳^2\hat{\mathbf{z}}_{2} with credible intervals estimated from 200 posterior samples (red). Horizontal gray line indicates the uniform distribution. (b-c) Spectrogram (in dB) for 500 seconds (b) STFT (c) PLSO with λCV\lambda_{\operatorname{CV}}. Learned frequencies are (ω^1,ω^2,ω^3)=(2.99,7.62,15.92)(\widehat{\omega}_{1},\widehat{\omega}_{2},\widehat{\omega}_{3})=(2.99,7.62,15.92) Hz, with ω^4ω^5>25\widehat{\omega}_{4}\sim\widehat{\omega}_{5}>25 Hz. (d-e) Time-domain results. (d) Reconstructed signal (e) phase for 𝐳^2\widehat{\mathbf{z}}_{2} and interval boundary (vertical gray), with bandpass-filtered data (dotted black), STFT-reg. (blue), and PLSO (red). Shaded area represents 95% credible interval from S=200S=200 sample trajectories.
Table 1: Simulation results. For jump(𝐳j)\operatorname{jump}(\mathbf{z}_{j}) and MSE, left/right metrics correspond to 𝐳1\mathbf{z}_{1}/𝐳2\mathbf{z}_{2}, respectively.
jump(𝐳j)\operatorname{jump}(\mathbf{z}_{j}) MSE IS div.
Truth 0.95/12.11 0/0 0
λ=0\lambda=0 0.26/10.15 2.90/3.92 4.08
λ\lambda\rightarrow\infty 0.22/10.32 3.26/4.53 13.78
λ=λCV\lambda=\lambda_{\operatorname{CV}} 0.25/10.21 2.88/3.91 3.93
STFT-reg. 49.59/81.00 6.89/10.68 N/A
GP-PS 16.99/23.28 3.00/4.04 4.08

Fig. 3 shows the true data in the time domain and spectrogram results. Fig. 3(c) shows that although the regularized STFT detects activities around 11 and 1010 Hz, it fails to delineate the time-varying spectral pattern. Fig. 3(d) shows that PLSO with stationarity (λ\lambda\rightarrow\infty) assumption is too restrictive. Fig. 3(e), (f) show that both PLSO with independent window assumption (λ=0\lambda=0) and PLSO with cross-validated λ\lambda (λ=λCV\lambda=\lambda_{\operatorname{CV}}) are able to capture the dynamic pattern, with the latter being more effective in recovering the smooth dynamics across different PS intervals.

For GP-PS and STFT-reg., jump(𝐳j)\operatorname{jump}(\mathbf{z}_{j}) exceeds jump(𝐳jTrue)\operatorname{jump}(\mathbf{z}_{j}^{\text{True}}), indicating discontinuities at the boundaries. An example is in Fig. 1. PLSO produces a similar jump metric as the ground truth metric, indicating the absence of discontinuities. We attribute the lower value to Kalman smoothing. For the TF domain, we use Itakura-Saito (IS) divergence [Itakura and Saito, 1970] as a distance measure between the ground truth spectra and the PLSO estimates. That the highest divergence is given by λ\lambda\rightarrow\infty indicates the inaccuracy of the stationarity assumption.

6.2 LFP data from the rat hippocampus

We use LFP data collected from the rat hippocampus during open field tasks [Mizuseki et al., 2009], with T=1,600T=1,600 seconds and fs=1,250f_{s}=1,250 Hz222We use channel 1 of mouse ec013.528 for the LFP. The population spikes were simultaneously recorded.. The theta neural oscillation band (5105\sim 10 Hz) is believed to play a role in coordinating the firing of neurons in the entorhinal-hippocampal system and is important for understanding the local circuit computation.

We fit PLSO with J=5J=5, which minimizes AIC as shown in Table 2, with 2-second PS interval. The estimated ω^2\widehat{\omega}_{2} is 7.62 Hz in the theta band. To obtain the phase for non-PLSO methods, we perform the Hilbert transform on the theta-band reconstructed signal. With no ground truth, we bandpass-filter (BPF) the data in the theta band for reference.

Table 2: AIC as a function of JJ for Hippocampus data
J 1 2 3 4 5 6
AIC 2882 2593 2566 2522 2503 2505

Spike-phase coupling Fig. 4(a) shows the theta phase distribution of population neuron spikes in the hippocampus. The PLSO-estimated distribution (red) confirms the original results analyzed with bandpass-filtered signal (black) [Mizuseki et al., 2009]–the hippocampal spikes show a strong preference for a specific phase, π\pi for this dataset, of the theta band. Since PLSO provides posterior sample-trajectories for the entire time-series, we can compute as many realizations of the phase distribution as the number of MC samples. The resulting credible interval excludes the uniform distribution (horizontal gray), suggesting the statistical significance of strong phase preference.

Denoised spectrogram Fig. 4(b-c) shows the estimated spectrogram. We observe that PLSO denoises the spectrogram, while retaining sustained power at ω^2=7.62\widehat{\omega}_{2}=7.62 Hz and weaker bursts at (ω^1,ω^3)=(2.99,15.92)(\widehat{\omega}_{1},\widehat{\omega}_{3})=(2.99,15.92) Hz.

Time domain discontinuity Fig. 4(d-e) show a segment of the estimated signal and phase near a boundary for ω^2\widehat{\omega}_{2}. While the estimates from STFT-reg. (blue) and PLSO (red) follow the BPF result closely, the STFT-reg. estimates exhibit discontiunity/distortion near the boundary. In Fig. 4(e), the phase jump at the boundary is 38.438.4 degrees. We also computed jump(ϕ2)\operatorname{jump}(\phi_{2}) in degrees/sample. Considering that the theta band roughly progresses 2.16(=7.5(Hz)×360/1250(Hz))2.16\,(=7.5(\text{Hz})\times 360/1250\,(\text{Hz})) degrees/sample, we observe that BPF (2.23), as expected, and PLSO (λCV:\lambda_{\operatorname{CV}}: 2.40, λ\lambda\rightarrow\infty: 2.66) are not affected by the boundary effect. This is not the case for STFT-reg. (26.83) and GP-PS (25.91).

Refer to caption
Figure 5: Hippocampus data. (a) time-varying ω^1\widehat{\omega}_{1} for TVAR. (b-d) Inferred mean trajectory (red) for (b) TVAR j=1j=1, (c) PLSO j=1j=1, and (d) PLSO j=2j=2, with raw data (black).

Comparison with TVAR Fig. 5(a-b) shows a segment of TVAR inference results333The details for TVAR is in the Appendix K.. Specifically, Fig. 5(a) and (b) shows a time-varying center frequency ω^1\widehat{\omega}_{1} and the corresponding reconstruction, for the lowest frequency component. Note that the eigenvalues, which correspond to {ωj}j\{\omega_{j}\}_{j}, and the eigenvectors, which are used for oscillatory decomposition, are derived from the estimated TVAR transition matrix. Consequently, we cannot explicitly control {ωj}j\{\omega_{j}\}_{j}, as shown in Fig. 5(a), the bandwidth of each component, as well as the number of components JJ. This is further complicated by the fact that the transition matrix changes every sample. Fig. 5(b) shows that this ambiguity results in the lowest-frequency component of TVAR explaining both the slow (0.120.1\sim 2 Hz) and theta components. With PLSO, on the contrary, we can explicitly specify or learn the parameters. Fig. 5(c-d) demonstrates that PLSO is able to delineate the slow/theta components without any discontinuity.

6.3 EEG data from the human brain under propofol anesthesia

We apply PLSO to the EEG data from a subject anesthetized with propofol anesthetic drug, to assess whether PLSO can leverage regularization to recover smooth spectral dynamics, which is widely-observed during propofol-induced unconsciousness444The EEG recording is part of de-identified data collected from patients at Massachusetts General Hospital (MGH) as a part of a MGH Human Research Committee-approved protocol. [Purdon et al., 2013]. The data last T=2,300T=2{,}300 seconds, sampled at fs=250f_{s}=250 Hz. The drug infusion starts at t=0t=0 and the subject loses consciousness around t=260t=260 seconds. We use J=6J=6 and assume a 4-second PS interval.

Refer to caption
Figure 6: Spectrogram (in dB) under propofol anesthesia. PLSO with (a) λ=0\lambda=0 (b) λ=λCV\lambda=\lambda_{\operatorname{CV}} (c) λ\lambda\rightarrow\infty.

Smooth spectrogram Fig. 6(a-b) shows a segment of the PLSO-estimated spectrogram with λ=0\lambda=0 and λ=λCV\lambda=\lambda_{\operatorname{CV}}. They identify strong slow (0.120.1\sim 2 Hz) and alpha oscillations (8158\sim 15 Hz), both well-known signatures of propofol-induced unconsciousness. We also observe that the alpha band power diminishes between 1,2001{,}200 and 1,3501{,}350 seconds, suggesting that the subject regained consciousness before becoming unconscious again. PLSO with λ=0\lambda=0 exhibits PSD fluctuation across windows, since {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} are estimated independently. The stationary PLSO (λ\lambda\rightarrow\infty) is restrictive and fails to capture spectral dynamics (Fig. 6(c)). In contrast, PLSO with λCV\lambda_{\operatorname{CV}} exhibits smooth dynamics by pooling together estimates from the neighboring windows. The regularization also helps remove movement-related artifacts, shown as vertical lines in Fig. 6(a), around 700800700\sim 800/1,2001{,}200 seconds, and spurious power in 202520\sim 25 Hz band. In summary, PLSO with regularization enables smooth spectral dynamics estimation and spurious noise removal.

7 Conclusion

We presented the Piecewise Locally Stationary Oscillatory (PLSO) framework to model nonstationary time-series data with slowly time-varying spectra, as the superposition of piecewise stationary (PS) oscillatory components. PLSO strikes a balance between stochastic continuity of the data across PS intervals and stationarity within each interval. For inference, we introduce an algorithm that combines Kalman theory and nonconvex optimization algorithms. Applications to simulated/real data show that PLSO preserves time-domain continuity and captures time-varying spectra. Future directions include 1) the automatic identification of PS intervals and 2) the expansion to higher-order autoregressive models and diverse priors on the parameters that enforce continuity across intervals.

Acknowledgements.
We thank Abiy Tasissa for helpful discussion on the algorithm and the derivations. This research is supported by NIH grant P01 GM118629 and support from the JPB Foundation. AHS acknowledges Samsung scholarship.

References

  • Adak [1998] S. Adak. Time-dependent spectral analysis of nonstationary time series. Journal of the American Statistical Association, 93(444):1488–1501, 1998.
  • Akaike [1981] H. Akaike. Likelihood of a model and information criteria. Journal of Econometrics, 16(1):3–14, 1981.
  • Ba et al. [2014] D. Ba, B. Babadi, P. L. Purdon, and E. N. Brown. Robust spectrotemporal decomposition by iteratively reweighted least squares. Proceedings of the National Academy of Sciences, 111(50):E5336–E5345, 2014.
  • Barzilai and Borwein [1988] J. Barzilai and J. M. Borwein. Two-Point Step Size Gradient Methods. IMA Journal of Numerical Analysis, 8(1):141–148, 01 1988.
  • Beck et al. [2018] A. M. Beck, E. P. Stephen, and P. L. Purdon. State space oscillator models for neural data analysis. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 4740–4743, 2018.
  • Brown et al. [2004] E. N. Brown, V. Solo, Y. Choe, and Z. Zhang. Measuring period of human biological clock: infill asymptotic analysis of harmonic regression parameter estimates. Methods in enzymology, 383:382–405, 2004.
  • Carter and Kohn [1994] C. K. Carter and R. Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541–553, 1994.
  • Casella [1985] G. Casella. An introduction to empirical bayes data analysis. The American Statistician, 39(2):83–87, 1985.
  • Cemgil and Godsill [2005] A. T. Cemgil and S. J. Godsill. Probabilistic phase vocoder and its application to interpolation of missing values in audio signals. In 2005 13th European Signal Processing Conference, pages 1–4, 2005.
  • Chung et al. [2015] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • Dahlhaus [1997] R. Dahlhaus. Fitting time series models to nonstationary processes. Ann. Statist., 25(1):1–37, 02 1997.
  • Das and Babadi [2018] P. Das and B. Babadi. Dynamic bayesian multitaper spectral analysis. IEEE Transactions on Signal Processing, 66(6):1394–1409, 2018.
  • Daubechies et al. [2011] I. Daubechies, J. Lu, and H. Wu. Synchrosqueezed wavelet transforms: An empirical mode decomposition-like tool. Applied and Computational Harmonic Analysis, 30(2):243 – 261, 2011.
  • Gold et al. [2011] B. Gold, N. Morgan, and D. Ellis. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley-Interscience, USA, 2nd edition, 2011. ISBN 0470195363.
  • Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • Gramacy and Lee [2008] R. B. Gramacy and H. K. H. Lee. Bayesian treed gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103(483):1119–1130, 2008.
  • Griffin and J. Lim [1984] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.
  • Huang et al. [1998] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N. Yen, C. C. Tung, and H. H. Liu. The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 454(1971):903–995, 1998.
  • Itakura and Saito [1970] F. Itakura and S. Saito. A statistical method for estimation of speech spectral density and formant frequencies. 1970.
  • Kalman [1960] R.E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 1960.
  • Kim et al. [2018] S. Kim, M. K. Behr, D. Ba, and E. N. Brown. State-space multitaper time-frequency analysis. Proceedings of the National Academy of Sciences, 115(1):E5–E14, 2018.
  • Kitagawa and Gersch [1985] G. Kitagawa and W. Gersch. A smoothness priors time-varying ar coefficient modeling of nonstationary covariance time series. IEEE Transactions on Automatic Control, 30(1):48–56, 1985.
  • Koopman and Lee [2009] S. J. Koopman and K. M. Lee. Seasonality with trend and cycle interactions in unobserved components models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 58(4):427–448, 2009.
  • Krishnan et al. [2017] R. G. Krishnan, U. Shalit, and D. Sontag. Structured inference networks for nonlinear state space models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 2101–2109. AAAI Press, 2017.
  • Li and Lin [2015] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In Advances in Neural Information Processing Systems 28, pages 379–387. Curran Associates, Inc., 2015.
  • Lindsten and Schön [2013] F. Lindsten and T. B. Schön. Backward Simulation Methods for Monte Carlo Statistical Inference. 2013.
  • Matsuda and Komaki [2017] T. Matsuda and F. Komaki. Time series decomposition into oscillation components and phase estimation. Neural Computation, 29(2):332–367, 2017.
  • Mizuseki et al. [2009] K. Mizuseki, A. Sirota, E. Pastalkova, and G. Buzsaki. Theta oscillations provide temporal windows for local circuit computation in the entorhinal-hippocampal loop. Neuron, 64:267–280, 2009.
  • Oppenheim et al. [2009] A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-time Signal Processing (3rd Ed.). Prentice-Hall, Inc., 2009.
  • Priestley [1965] M. B. Priestley. Evolutionary spectra and non-stationary processes. Journal of the Royal Statistical Society. Series B (Methodological), 27(2):204–237, 1965.
  • Purdon et al. [2013] P. L. Purdon, E. T. Pierce, E. A. Mukamel, M. J. Prerau, J. L. Walsh, K. F. K. Wong, A. F. Salazar-Gomez, P. G. Harrell, A. L. Sampson, A. Cimenser, S. Ching, N. J. Kopell, C. Tavares-Stoeckel, K. Habeeb, R. Merhar, and E. N. Brown. Electroencephalogram signatures of loss and recovery of consciousness from propofol. Proceedings of the National Academy of Sciences, 110(12):E1142–E1151, 2013.
  • Qi et al. [2002] Y. Qi, T. P. Minka, and R. W. Picard. Bayesian spectrum estimation of unevenly sampled nonstationary data. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages II–1473–II–1476, 2002.
  • Rasmussen and Williams [2005] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
  • Rosen et al. [2009] O. Rosen, D. S. Stoffer, and S. Wood. Local spectral analysis via a bayesian mixture of smoothing splines. Journal of the American Statistical Association, 104(485):249–262, 2009.
  • Solin and Särkkä [2014] A. Solin and S. Särkkä. Explicit Link Between Periodic Covariance Functions and State Space Models. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 904–912. PMLR, 2014.
  • Solin et al. [2018] A. Solin, J. Hensman, and R. E Turner. Infinite-horizon gaussian processes. In Advances in Neural Information Processing Systems 31, pages 3486–3495. 2018.
  • Song et al. [2018] A. H. Song, S. Chakravarty, and E. N. Brown. A smoother state space multitaper spectrogram. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 33–36, July 2018.
  • Soulat et al. [2019] H. Soulat, E. P. Stephen, A. M. Beck, and P. L. Purdon. State space methods for phase amplitude coupling analysis. bioRxiv, 2019.
  • Turner and Sahani [2014] R. E. Turner and M. Sahani. Time-frequency analysis as probabilistic inference. IEEE Transactions on Signal Processing, 62(23):6171–6183, 2014.
  • West et al. [1999] M. West, R. Prado, and A. D. Krystal. Evaluation and Comparison of EEG Traces: Latent Structure in Nonstationary Time Series. Journal of the American Statistical Association, 94(448):1083–1095, 1999.
  • Whittle [1953] P. Whittle. Estimation and information in stationary time series. Ark. Mat., 2(5):423–434, 08 1953.
  • Wilkinson et al. [2019] W. J. Wilkinson, M. Riis Andersen, J. D. Reiss, D. Stowell, and A. Solin. Unifying probabilistic models for time-frequency analysis. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3352–3356, May 2019.
  • Wilson and Adams [2013] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, page III–1067–III–1075, 2013.
  • Wilson et al. [2008] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran. Speech denoising using nonnegative matrix factorization with priors. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4029–4032, 2008.
  • Wright [2015] S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151:3–34, 2015.

Appendix

References to the sections in the section headings are made with respect to the sections in the main text. Below is the table of contents for the Appendix.

  • A.

    Continuous model interpretation of PLSO (Section 3)

  • B.

    PSD for complex AR(1) process

  • C.

    Proof for Proposition 1 (Section 3.2.1)

  • D.

    Proof for Proposition 2 (Section 3.2.2)

  • E.

    Initialization for PLSO (Section 4)

  • F.

    Optimization for {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} via proximal gradient update (Section 4.1.1)

  • G.

    Lipschitz constant for f({σj,m2}j,m;θ)\nabla f(\{\sigma_{j,m}^{2}\}_{j,m};\theta) (Section 4.1.1)

  • H.

    Inference with p({𝐳j}j{σj,m2}j,m,𝐲,θ)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta) (Section 4.2)

  • I.

    Computational efficiency of PLSO vs. GP-PS

  • J.

    Simulation experiment (Section 5.1)

  • K.

    Details of TVAR model (Section 5.2)

  • L.

    Anesthesia EEG dataset (Section 5.3)

A. Continuous model interpretation of PLSO

We can establish the equivalent continuous model of the PLSO in Eq. 2, using stochastic different equation

dz~j(t)dt=((1lj)(0ωjωj0))𝐅z~j(t)+ε(t),\frac{d\tilde{z}_{j}(t)}{dt}=\underbrace{\left(\left(-\frac{1}{l_{j}}\right)\oplus\begin{pmatrix}0&-\omega_{j}\\ \omega_{j}&0\\ \end{pmatrix}\right)}_{\mathbf{F}}\tilde{z}_{j}(t)+\varepsilon(t), (11)

where z~j(t):2\tilde{z}_{j}(t):\mathbb{R}\rightarrow\mathbb{R}^{2}, \oplus denotes the Kronecker sum and ε(t)𝒩(0,σj2𝐈2×2)\varepsilon(t)\sim\mathcal{N}(0,\sigma_{j}^{2}\mathbf{I}_{2\times 2}). Discretizing the solution of Eq. 11 at Δ\Delta, such that 𝐳~j,k=z~j(kΔ)\widetilde{\mathbf{z}}_{j,k}=\tilde{z}_{j}(k\Delta), yields Eq. 2. Consequently, we obtain the following for Δ>0\Delta>0

exp(𝐅Δ)=exp(Δ/lj)𝐑(ωj),σj20Δexp(𝐅(Δτ))exp(𝐅(Δτ))Tdτ=σj2(1exp(2Δ/lj))𝐈2×2.\begin{split}&\exp\left(\mathbf{F}\Delta\right)=\exp(-\Delta/l_{j})\mathbf{R}(\omega_{j}),\\ &\sigma_{j}^{2}\int_{0}^{\Delta}\exp\left(\mathbf{F}(\Delta-\tau)\right)\exp\left(\mathbf{F}(\Delta-\tau)\right)^{\text{T}}d\tau\\ &\quad\quad=\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}.\\ \end{split} (12)

This interpretation extends to the nonstationary PLSO. The corresponding continuous model for 𝐳~j,mN+n\widetilde{\mathbf{z}}_{j,mN+n} in Eq. 3 is the same as Eq. 11, with different variance 𝔼[εj(t)εjT(t)]=m=1Mσj,m2𝟏((m1M)Tt<(mM)T)𝐈2×2.\mathbb{E}[\varepsilon_{j}(t)\varepsilon_{j}^{\text{T}}(t)]=\sum_{m=1}^{M}\sigma_{j,m}^{2}\cdot\mathbf{1}\left(\left(\frac{m-1}{M}\right)T\leq t<\left(\frac{m}{M}\right)T\right)\mathbf{I}_{2\times 2}.

B. PSD for complex AR(1) process

We compute the steady-state covariance denoted as 𝐏j\mathbf{P}_{\infty}^{j}. Since we assume 𝐏1j=σj2𝐈2×2\mathbf{P}_{1}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2}, it is easy to show that 𝐏kj\mathbf{P}_{k}^{j} is a diagonal matrix from 𝐑(ωj)𝐑T(ωj)=𝐈2×2\mathbf{R}(\omega_{j})\mathbf{R}^{\text{T}}(\omega_{j})=\mathbf{I}_{2\times 2}. Denoting 𝐏j=α𝐈2×2\mathbf{P}_{\infty}^{j}=\alpha\mathbf{I}_{2\times 2}, we use the discrete Lyapunov equation

𝐏j=exp(2Δ/lj)𝐑(ωj)𝐏j𝐑T(ωj)+σj2(1exp(2Δ/lj))𝐈2×2α=exp(2Δ/lj)α+σj2(1exp(2Δ/lj))𝐏j=σj2𝐈2×2,\begin{split}&\mathbf{P}_{\infty}^{j}\\ &=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{\infty}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &\Rightarrow\alpha=\exp(-2\Delta/l_{j})\alpha+\sigma_{j}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\\ &\Rightarrow\mathbf{P}_{\infty}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2},\\ \end{split} (13)

which implies that under the assumption 𝐏1j=σj2𝐈2×2\mathbf{P}_{1}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2}, we are guaranteed 𝐏kj=σj2𝐈2×2\mathbf{P}_{k}^{j}=\sigma_{j}^{2}\mathbf{I}_{2\times 2}, k\forall k. To compute the PSD of the stationary process 𝐳j\mathbf{z}_{j}, we now need to compute the autocovariance. Since only 𝐳j,k\mathbf{z}^{\Re}_{j,k} contributes to 𝐲k\mathbf{y}_{k}, we compute the autocovariance of 𝔼[𝐳j,k𝐳j,k+n]\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n}] as

𝔼[𝐳j,k𝐳j,k+n]=𝔼[𝐳j,k(ρjnexp(iωjn)𝐳j,k)]=ρjn𝔼[𝐳j,k𝐳j,kcosωjn]=ρjncosωjn𝔼[{𝐳j,k}2]=ρjnσj2coswjn,\begin{split}\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n}]&=\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\cdot\Re(\rho_{j}^{n}\exp(i\omega_{j}n)\mathbf{z}_{j,k})]\\ &=\rho_{j}^{n}\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k}\cos\omega_{j}n]\\ &=\rho_{j}^{n}\cos\omega_{j}n\cdot\mathbb{E}[\{\mathbf{z}^{\Re}_{j,k}\}^{2}]\\ &=\rho_{j}^{n}\sigma_{j}^{2}\cos w_{j}n,\\ \end{split} (14)

where ()\Re(\cdot) denotes the operator that extracts the real part of the complex argument and we used the fact that 𝔼[𝐳j,k𝐳j,k]=0\mathbb{E}[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Im}_{j,k}]=0. The spectra for the jthj^{\text{th}} component, Sj(ω)S_{j}(\omega) can be written as

Sj(ω)=n=𝔼[𝐳j,k𝐳j,k+n]exp(iωn)=n=ρjnσj2coswjnexp(iωn)=σj2n=ρjn{exp(iωjn)+exp(iωjn)}exp(iωn)=σj2n=ρjnexp(i(ω±ωj)n).\begin{split}&S_{j}(\omega)=\sum_{n=-\infty}^{\infty}\mathbb{E}\left[\mathbf{z}^{\Re}_{j,k}\mathbf{z}^{\Re}_{j,k+n}\right]\exp\left(-i\omega n\right)\\ &=\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\sigma_{j}^{2}\cos w_{j}n\exp\left(-i\omega n\right)\\ &=\sigma_{j}^{2}\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\left\{\exp(i\omega_{j}n)+\exp(-i\omega_{j}n)\right\}\exp\left(-i\omega n\right)\\ &=\sigma_{j}^{2}\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\exp(-i(\omega\pm\omega_{j})n).\\ \end{split} (15)

Unpacking the infinite sum for one of the terms,

n=ρjnexp(i(ωωj)n)=1+n=1ρjnexp(i(ωωj)n)+ρjnexp(i(ωωj)n)=1+ρjexp(i(ωωj))1ρjexp(i(ωωj))+ρjexp(i(ωωj))1ρjexp(i(ωωj))=1+2ρjcos(ωωj)2ρj2(1ρjexp(i(ωωj)))(1ρjexp(i(ωωj)))=1ρj21+ρj22ρjcos(ωωj).\begin{split}&\sum_{n=-\infty}^{\infty}\rho_{j}^{n}\exp(-i(\omega-\omega_{j})n)\\ &=1+\sum_{n=1}^{\infty}\rho_{j}^{n}\exp(-i(\omega-\omega_{j})n)+\rho_{j}^{n}\exp(i(\omega-\omega_{j})n)\\ &=1+\frac{\rho_{j}\exp(-i(\omega-\omega_{j}))}{1-\rho_{j}\exp(-i(\omega-\omega_{j}))}+\frac{\rho_{j}\exp(i(\omega-\omega_{j}))}{1-\rho_{j}\exp(i(\omega-\omega_{j}))}\\ &=1+\frac{2\rho_{j}\cos(\omega-\omega_{j})-2\rho_{j}^{2}}{\left(1-\rho_{j}\exp(-i(\omega-\omega_{j}))\right)\left(1-\rho_{j}\exp(i(\omega-\omega_{j}))\right)}\\ &=\frac{1-\rho_{j}^{2}}{1+\rho_{j}^{2}-2\rho_{j}\cos(\omega-\omega_{j})}.\\ \end{split} (16)

Using the relation ρj=exp(Δ/lj)\rho_{j}=\exp(-\Delta/l_{j}) and unpacking the infinite sum for the other term, we have

Sj(w)=σj2(1exp(2Δ/lj))1+exp(2Δ/lj)2exp(Δ/lj)cos(ωωj)+σj2(1exp(2Δ/lj))1+exp(2Δ/lj)2exp(Δ/lj)cos(ω+ωj).\begin{split}&S_{j}(w)\\ &=\frac{\sigma_{j}^{2}(1-\exp(-2\Delta/l_{j}))}{1+\exp(-2\Delta/l_{j})-2\exp(-\Delta/l_{j})\cos(\omega-\omega_{j})}\\ &\quad+\frac{\sigma_{j}^{2}(1-\exp(-2\Delta/l_{j}))}{1+\exp(-2\Delta/l_{j})-2\exp(-\Delta/l_{j})\cos(\omega+\omega_{j})}.\\ \end{split}

Since Fourier transform is a linear operator, we can conclude that γ(ω)=σν2+j=1JSj(ω)\gamma(\omega)=\sigma_{\nu}^{2}+\sum_{j=1}^{J}S_{j}(\omega).

C. Proof for Proposition 1 (Section 3.2.1)

Proposition 1. For a given mm, as Δ0\Delta\rightarrow 0, the samples on either side of the interval boundary, which are 𝐳~j,(m+1)N\widetilde{\mathbf{z}}_{j,(m+1)N} and 𝐳~j,(m+1)N+1\widetilde{\mathbf{z}}_{j,(m+1)N+1}, converge to each other in mean square,

limΔ0𝔼[Δ𝐳~j,(m+1)NΔ𝐳~j,(m+1)NT]=0,\lim_{\Delta\rightarrow 0}\mathbb{E}[\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}^{\text{T}}]=0,

where we use Δ𝐳~j,(m+1)N=𝐳~j,(m+1)N+1𝐳~j,(m+1)N\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}=\widetilde{\mathbf{z}}_{j,(m+1)N+1}-\widetilde{\mathbf{z}}_{j,(m+1)N}.

Proof.

To analyze Eq. 3 in the limit of Δ0\Delta\rightarrow 0, we use the equivalent continuous model (Eq. 12). It suffices to show that limΔ0exp(𝐅Δ)=𝐈2×2\lim_{\Delta\rightarrow 0}\exp(\mathbf{F}\Delta)=\mathbf{I}_{2\times 2} and limΔ0𝔼[εj,(m+1)N+1εj,(m+1)N+1T]=𝟎\lim_{\Delta\rightarrow 0}\mathbb{E}[\varepsilon_{j,(m+1)N+1}\varepsilon_{j,(m+1)N+1}^{\text{T}}]=\mathbf{0}. We have,

limΔ0exp(𝐅Δ)=𝐈2×2+limΔ0k=1Δkk!𝐅k=𝐈2×2limΔ0𝔼[εj,(m+1)N+1εj,(m+1)N+1T]/σj,m+12=limΔ00Δexp(𝐅(Δτ))exp(𝐅(Δτ))Tdτ=𝟎.\begin{split}&\lim_{\Delta\rightarrow 0}\exp(\mathbf{F}\Delta)=\mathbf{I}_{2\times 2}+\lim_{\Delta\rightarrow 0}\sum_{k=1}^{\infty}\frac{\Delta^{k}}{k!}\mathbf{F}^{k}=\mathbf{I}_{2\times 2}\\ &\lim_{\Delta\rightarrow 0}\mathbb{E}[\varepsilon_{j,(m+1)N+1}\varepsilon_{j,(m+1)N+1}^{\text{T}}]/\sigma_{j,m+1}^{2}\\ &=\lim_{\Delta\rightarrow 0}\int_{0}^{\Delta}\exp\left(\mathbf{F}(\Delta-\tau)\right)\exp\left(\mathbf{F}(\Delta-\tau)\right)^{\text{T}}d\tau=\mathbf{0}.\\ \end{split}

Since this implies limΔ0𝔼[Δ𝐳~j,(m+1)NΔ𝐳~j,(m+1)NT]=0\lim_{\Delta\rightarrow 0}\mathbb{E}[\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}\Delta\widetilde{\mathbf{z}}_{j,(m+1)N}^{\text{T}}]=0, we have convergence in mean square. ∎

D. Proof for Proposition 2 (Section 3.2.2)

Proposition 2. Assume ljNΔl_{j}\ll N\Delta, such that 𝐏m,Nj=𝐏m,j\mathbf{P}_{m,N}^{j}=\mathbf{P}_{m,\infty}^{j}. In Eq. 3, the difference between 𝐏m,j=σj,m2𝐈2×2\mathbf{P}_{m,\infty}^{j}=\sigma_{j,m}^{2}\mathbf{I}_{2\times 2} and 𝐏m+1,j=σj,m+12𝐈2×2\mathbf{P}_{m+1,\infty}^{j}=\sigma_{j,m+1}^{2}\mathbf{I}_{2\times 2} decays exponentially fast as a function of nn, for 1nN1\leq n\leq N,

𝐏m+1,nj=𝐏m+1,j+exp(2nΔ/lj)(𝐏m,j𝐏m+1,j).\begin{split}\mathbf{P}_{m+1,n}^{j}&=\mathbf{P}_{m+1,\infty}^{j}+\exp\left(-2n\Delta/l_{j}\right)(\mathbf{P}_{m,\infty}^{j}-\mathbf{P}_{m+1,\infty}^{j}).\\ \end{split}
Proof.

We first obtain the steady-state covariance 𝐏m,j\mathbf{P}_{m,\infty}^{j}, similar to Appendix B. Since we assume 𝐏1,1j=σj,12𝐈2×2\mathbf{P}_{1,1}^{j}=\sigma_{j,1}^{2}\mathbf{I}_{2\times 2}, we can show that m,n\forall m,n, 𝐏m,nj\mathbf{P}_{m,n}^{j} is a diagonal matrix, noting that 𝐑(ωj)𝐑T(ωj)=𝐈2×2\mathbf{R}(\omega_{j})\mathbf{R}^{\text{T}}(\omega_{j})=\mathbf{I}_{2\times 2}. Denoting 𝐏m,j=α𝐈2×2\mathbf{P}_{m,\infty}^{j}=\alpha\mathbf{I}_{2\times 2}, we now use the discrete Lyapunov equation

𝐏m,j=exp(2Δ/lj)𝐑(ωj)𝐏m,j𝐑T(ωj)+σj,m2(1exp(2Δ/lj))𝐈2×2α=exp(2Δ/lj)α+σj,m2(1exp(2Δ/lj))𝐏m,j=σj,m2𝐈2×2.\begin{split}&\mathbf{P}_{m,\infty}^{j}=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{m,\infty}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &\Rightarrow\alpha=\exp(-2\Delta/l_{j})\alpha+\sigma_{j,m}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\\ &\Rightarrow\mathbf{P}_{m,\infty}^{j}=\sigma_{j,m}^{2}\mathbf{I}_{2\times 2}.\\ \end{split} (17)

We now prove the proposition by induction. For fixed jj and mm, and for n=1n=1,

𝐏m+1,1j=exp(2Δ/lj)𝐑(ωj)𝐏m,Nj𝐑T(ωj)+σj,m+12(1exp(2Δ/lj))𝐈2×2={σj,m+12+exp(2Δ/lj)(σj,m2σj,m+12)}𝐈2×2.\begin{split}&\mathbf{P}_{m+1,1}^{j}\\ &=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{m,N}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m+1}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &=\left\{\sigma_{j,m+1}^{2}+\exp\left(-2\Delta/l_{j}\right)\left(\sigma_{j,m}^{2}-\sigma_{j,m+1}^{2}\right)\right\}\mathbf{I}_{2\times 2}.\\ \end{split}

Assuming the same holds for n=n1n=n^{\prime}-1, we have for n=nn=n^{\prime},

𝐏m+1,nj=exp(2Δ/lj)𝐑(ωj)𝐏m,n1j𝐑T(ωj)+σj,m+12(1exp(2Δ/lj))𝐈2×2=exp(2Δ/lj)σj,m+12𝐈2×2+exp(2nΔ/lj)(σj,m2σj,m+12)𝐈2×2+σj,m+12(1exp(2Δ/lj))𝐈2×2={σj,m+12+exp(2nΔ/lj)(σj,m2σj,m+12)}𝐈2×2.\begin{split}&\mathbf{P}_{m+1,n^{\prime}}^{j}\\ &=\exp(-2\Delta/l_{j})\mathbf{R}(\omega_{j})\mathbf{P}_{m,n^{\prime}-1}^{j}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m+1}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &=\exp(-2\Delta/l_{j})\sigma_{j,m+1}^{2}\mathbf{I}_{2\times 2}\\ &\quad+\exp\left(-2n^{\prime}\Delta/l_{j}\right)\left(\sigma_{j,m}^{2}-\sigma_{j,m+1}^{2}\right)\mathbf{I}_{2\times 2}\\ &\quad+\sigma_{j,m+1}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\mathbf{I}_{2\times 2}\\ &=\left\{\sigma_{j,m+1}^{2}+\exp\left(-2n^{\prime}\Delta/l_{j}\right)\left(\sigma_{j,m}^{2}-\sigma_{j,m+1}^{2}\right)\right\}\mathbf{I}_{2\times 2}.\\ \end{split}

By the principle of induction, Eq. D. Proof for Proposition 2 (Section 3.2.2) holds for 1nN1\leq n\leq N. ∎

E. Initialization & Estimation for PLSO (Section 4)

E.1 Initialization

As noted in the main text, we use AIC to determine the optimal number of JJ. For a given number of components JJ, we first construct the spectrogram of the data using STFT and identify the frequency bands with prominent power, i.e., frequency bands whose average power exceeds pre-determined threshold. The center frequencies of these bands serve as the initial center frequencies {ωjinit}j\{\omega_{j}^{\text{init}}\}_{j}, which are either fixed throughout the algorithm or further refined through the estimation algorithm in the main text. If JJ exceeds the number of identified frequency bands from the spectrogram, 1) we first place {ωjinit}j\{\omega_{j}^{\text{init}}\}_{j} in the prominent frequency bands and 2) we then place the remaining components uniformly spread out in [0,ωc][0,\omega_{c}], where ωc\omega_{c} is a cutoff frequency to be further determined in the next section. As for {ljinit}j\{l_{j}^{\text{init}}\}_{j}, we set it to be a certain fraction of the corresponding {ωjinit}j\{\omega_{j}^{\text{init}}\}_{j}. We then fit {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} and θ\theta with λ=0\lambda=0, through the procedure explained in Stage 1. We finally use these estimates as initial values for other values of λ\lambda.

E.2 Estimation for σν2\sigma_{\nu}^{2}

There are two possible ways to estimate the observation noise variance σν2\sigma_{\nu}^{2}. The first approach is to perform maximum likelihood estimation of f({σj,m2}j,m;θ)f(\{\sigma_{j,m}^{2}\}_{j,m};\theta) with respect to σν2\sigma_{\nu}^{2}. The second approach, which we found to work better in practice and use throughout the manuscript, is to directly estimate it from the Fourier transform of the data. Given a cutoff frequency ωc\omega_{c}, informed by domain knowledge, we take the average power of the Fourier transform of 𝐲\mathbf{y} in [ωc,fs/2][\omega_{c},f_{s}/2]. For instance, it is widely known that the spectral content below 40 Hz in anesthesia EEG dataset is physiologically relevant and we use ωc40\omega_{c}\simeq 40 Hz.

F. Optimization for {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} via proximal gradient update (Section 4.1.1)

We discuss the algorithm to obtain a local optimal solution of {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m} to the MAP optimization problem in Eq. 8. We define ψj,m=logσj,m2\psi_{j,m}=\log\sigma_{j,m}^{2} and Ψ=[ψ1,1,,ψ1,M,,ψJ,M]JM\Psi=[\psi_{1,1},\ldots,\psi_{1,M},\ldots,\psi_{J,M}]\in\mathbb{R}^{JM} for notational decluttering, to be used in this section.

We rewrite Eq. 8 as

minΨlogp(Ψ|𝐲,θ)h(Ψ;θ)=minΨ12m=1Mn=1N{logγ(m)(ωn)+I(m)(ωn)γ(m)(ωn)}f(Ψ;θ)+λ2j=1Jm=1M(ψj,mψj,m1)2g(Ψ;θ)=minΨf(Ψ;θ)g(Ψ;θ).\begin{split}&\min_{\Psi}\underbrace{-\log p(\Psi|\mathbf{y},\theta)}_{h(\Psi;\theta)}\\ &=\min_{\Psi}\underbrace{\frac{1}{2}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\{\log\gamma^{(m)}(\omega_{n})+\frac{I^{(m)}(\omega_{n})}{\gamma^{(m)}(\omega_{n})}\right\}}_{-f(\Psi;\theta)}\\ &\quad\quad\quad+\underbrace{\frac{\lambda}{2}\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\psi_{j,m}-\psi_{j,m-1}\right)^{2}}_{-g(\Psi;\theta)}\\ &=\min_{\Psi}-f(\Psi;\theta)-g(\Psi;\theta).\\ \end{split} (18)

The algorithm is described in Algorithm 1. It follows the steps outlined in the inexact accelerated proximal gradient algorithm [Li and Lin, 2015]. For faster convergence, we use larger step sizes with the Barzilai-Borwein (BB) step size initialization rule [Barzilai and Borwein, 1988]. For rest of this section, we drop dependence on θ\theta. The main novelty of our algorithm is the proximal gradient update

𝐮(l+1)=proxα𝐰(l)g(𝐰(l)+α𝐰(l)f(𝐰(l)))=argminΨ12α𝐰(l)Ψ(𝐰(l)+α𝐰(l)f(𝐰(l)))2g(Ψ)=argminΨj=1Jm=1M((𝐰j,m(l)+α𝐰(l)f(𝐰(l))𝐰j,m)ψj,m)22α𝐰(l)+λ2(ψj,mψj,m1)2,\begin{split}&\mathbf{u}^{(l+1)}\\ &=\operatorname{prox}_{-\alpha^{(l)}_{\mathbf{w}}g}(\mathbf{w}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\nabla f(\mathbf{w}^{(l)}))\\ &=\operatorname*{arg\,min}_{\Psi}\frac{1}{2\alpha_{\mathbf{w}}^{(l)}}\lVert\Psi-(\mathbf{w}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\nabla f(\mathbf{w}^{(l)}))\rVert^{2}-g(\Psi)\\ &=\operatorname*{arg\,min}_{\Psi}\sum_{j=1}^{J}\sum_{m=1}^{M}\frac{\left(\left(\mathbf{w}_{j,m}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\cdot\frac{\partial f(\mathbf{w}^{(l)})}{\partial\mathbf{w}_{j,m}}\right)-\psi_{j,m}\right)^{2}}{2\alpha_{\mathbf{w}}^{(l)}}\\ &\quad\quad\quad\quad\quad\quad\quad+\frac{\lambda}{2}(\psi_{j,m}-\psi_{j,m-1})^{2},\\ \end{split} (19)

where the same holds for 𝐱(l+1)=proxαΨ(l)g(Ψ(l)+αΨ(l)f(Ψ(l)))\mathbf{x}^{(l+1)}=\operatorname{prox}_{-\alpha_{\Psi}^{(l)}g}(\Psi^{(l)}+\alpha_{\Psi}^{(l)}\nabla f(\Psi^{(l)})). The auxiliary variables 𝐰,𝐮,𝐱JM\mathbf{w},\mathbf{u},\mathbf{x}\in\mathbb{R}^{JM} ensure convergence of Ψ\Psi. We use 𝐰j,m(l)\mathbf{w}^{(l)}_{j,m} to denote ((m1)J+j)th\left((m-1)J+j\right)^{\text{th}} entry of 𝐰(l)\mathbf{w}^{(l)}. As mentioned in the main text, this can be solved in a computationally efficient manner by using Kalman filter/smoother.

Result: Ψ^\widehat{\Psi}
Initialize Ψ(0)=Ψ(1)=𝐮(1)\Psi^{(0)}=\Psi^{(1)}=\mathbf{u}^{(1)}, β(0)=0,β(1)=1,δ>0,ρ<1\beta^{(0)}=0,\,\beta^{(1)}=1,\,\delta>0,\,\rho<1
for l1l\leftarrow 1 to LL do
       𝐰(l)=Ψ(l)+β(l1)β(l)(𝐮(l)Ψ(l))+β(l1)1β(l)(Ψ(l)Ψ(l1))\mathbf{w}^{(l)}=\Psi^{(l)}+\frac{\beta^{(l-1)}}{\beta^{(l)}}(\mathbf{u}^{(l)}-\Psi^{(l)})+\frac{\beta^{(l-1)}-1}{\beta^{(l)}}(\Psi^{(l)}-\Psi^{(l-1)})
       (BB step size initialization rule)
       𝐬(l)=𝐮(l)𝐰(l1)\mathbf{s}^{(l)}=\mathbf{u}^{(l)}-\mathbf{w}^{(l-1)}, 𝐫(l)=f(𝐮(l))+f(𝐰(l1))\mathbf{r}^{(l)}=-\nabla f(\mathbf{u}^{(l)})+\nabla f(\mathbf{w}^{(l-1)})
       α𝐰(l)=((𝐬(l))T𝐬(l))/((𝐬(l))T𝐫(l))\alpha_{\mathbf{w}}^{(l)}=((\mathbf{s}^{(l)})^{\text{T}}\mathbf{s}^{(l)})/((\mathbf{s}^{(l)})^{\text{T}}\mathbf{r}^{(l)})
       𝐬(l)=𝐱(l)Ψ(l1)\mathbf{s}^{(l)}=\mathbf{x}^{(l)}-\Psi^{(l-1)}, 𝐫(l)=f(𝐱(l))+f(Ψ(l1))\mathbf{r}^{(l)}=-\nabla f(\mathbf{x}^{(l)})+\nabla f(\Psi^{(l-1)})
       αΨ(l)=((𝐬(l))T𝐬(l))/((𝐬(l))T𝐫(l))\alpha_{\Psi}^{(l)}=((\mathbf{s}^{(l)})^{\text{T}}\mathbf{s}^{(l)})/((\mathbf{s}^{(l)})^{\text{T}}\mathbf{r}^{(l)})
       (Proximal update step)
       repeat
              𝐮(l+1)=proxα𝐰(l)g(𝐰(l)+α𝐰(l)f(𝐰(l)))\mathbf{u}^{(l+1)}=\operatorname{prox}_{-\alpha_{\mathbf{w}}^{(l)}g}(\mathbf{w}^{(l)}+\alpha_{\mathbf{w}}^{(l)}\nabla f(\mathbf{w}^{(l)}))
              α𝐰(l)=ρα𝐰(l)\alpha_{\mathbf{w}}^{(l)}=\rho\cdot\alpha_{\mathbf{w}}^{(l)}
            
      until h(𝐮(l+1))h(𝐰(l))δ𝐮(l+1)𝐰(l)2h(\mathbf{u}^{(l+1)})\leq h(\mathbf{w}^{(l)})-\delta\lVert\mathbf{u}^{(l+1)}-\mathbf{w}^{(l)}\rVert^{2};
      repeat
              𝐱(l+1)=proxαΨ(l)g(Ψ(l)+αΨ(l)f(Ψ(l)))\mathbf{x}^{(l+1)}=\operatorname{prox}_{-\alpha_{\Psi}^{(l)}g}(\Psi^{(l)}+\alpha_{\Psi}^{(l)}\nabla f(\Psi^{(l)}))
              αΨ(l)=ραΨ(l)\alpha_{\Psi}^{(l)}=\rho\cdot\alpha_{\Psi}^{(l)}
            
      until h(𝐱(l+1))h(Ψ(l))δ𝐱(l+1)Ψ(l)2h(\mathbf{x}^{(l+1)})\leq h(\Psi^{(l)})-\delta\lVert\mathbf{x}^{(l+1)}-\Psi^{(l)}\rVert^{2};
      β(l+1)=1+4(β(l))2+12\beta^{(l+1)}=\frac{1+\sqrt{4\left(\beta^{(l)}\right)^{2}+1}}{2}
       Ψ(l+1)={𝐮(l+1)if h(u(l+1))h(x(l+1))𝐱(l+1)otherwise \Psi^{(l+1)}=\begin{cases}\mathbf{u}^{(l+1)}&\text{if }h(\textbf{u}^{(l+1)})\leq h(\textbf{x}^{(l+1)})\\ \mathbf{x}^{(l+1)}&\text{otherwise }\\ \end{cases}
end for
Ψ^=Ψ(L)\widehat{\Psi}=\Psi^{(L)}
Algorithm 1 Inference for Ψ\Psi via inexact APG

G. Lipschitz constant for f({σj,m2}j,m;θ)\nabla f(\{\sigma_{j,m}^{2}\}_{j,m};\theta) (Section 4.1.1)

In this section, we prove that under some assumptions, we can show that the log-likelihood f({σj,m2}j,m;θ)f(\{\sigma_{j,m}^{2}\}_{j,m};\theta) has Lipschitz continuous gradient with the Lipschitz constant CC. As in the previous section, we use ψj,m=logσj,m2\psi_{j,m}=\log\sigma_{j,m}^{2} and Ψ=[ψ1,1,,ψ1,M,,ψJ,M]JM\Psi=[\psi_{1,1},\ldots,\psi_{1,M},\ldots,\psi_{J,M}]\in\mathbb{R}^{JM}.

Let us start by restating the definition of Lipschitz continuous gradient.

Definition A continuously differentiable function f:𝒮f:\mathcal{S}\rightarrow\mathbb{R} is Lipschitz continuous gradient if

f(𝐱)f(𝐲)2Cxy2for every 𝐱,𝐲𝒮,\lVert\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\rVert_{2}\leq C\lVert\textbf{x}-\textbf{y}\rVert_{2}\quad\text{for every }\mathbf{x},\mathbf{y}\in\mathcal{S}, (20)

where 𝒮\mathcal{S} is a compact subset of JM\mathbb{R}^{JM} and C>0C>0 is the Lipschitz constant.

Our goal is to find the constant CC for the Whittle likelihood f(Ψ;θ)f(\Psi;\theta)

f(Ψ;θ)=12m=1Mn=1N{logγm,n+Im,nγm,n}=12m=1Mn=1Nlog(σν2+j=1Jexp(ψj,m)αj,n)f1(Ψ;θ)12m=1Mn=1NIm,n(σν2+j=1Jexp(ψj,m)αj,n)f2(Ψ;θ),\begin{split}f(\Psi;\theta)&=-\frac{1}{2}\sum_{m=1}^{M}\sum_{n=1}^{N}\left\{\log\gamma_{m,n}+\frac{I_{m,n}}{\gamma_{m,n}}\right\}\\ &=-\frac{1}{2}\underbrace{\sum_{m=1}^{M}\sum_{n=1}^{N}\log\left(\sigma_{\nu}^{2}+\sum_{j=1}^{J}\exp\left(\psi_{j,m}\right)\alpha_{j,n}\right)}_{f_{1}(\Psi;\theta)}\\ &\quad-\frac{1}{2}\underbrace{\sum_{m=1}^{M}\sum_{n=1}^{N}\frac{I_{m,n}}{\left(\sigma_{\nu}^{2}+\sum_{j=1}^{J}\exp\left(\psi_{j,m}\right)\alpha_{j,n}\right)}}_{f_{2}(\Psi;\theta)},\\ \end{split} (21)

where we use γm,n=γ(m)(ωn)\gamma_{m,n}=\gamma^{(m)}(\omega_{n}) and Im,n=I(m)(ωn)I_{m,n}=I^{(m)}(\omega_{n}) for notational simplicity and

αj,n=(1exp(2Δ/lj))1+exp(2Δ/lj)2exp(Δ/lj)cos(ωnωj).\alpha_{j,n}=\frac{\left(1-\exp\left(-2\Delta/l_{j}\right)\right)}{1+\exp\left(-2\Delta/l_{j}\right)-2\exp\left(-\Delta/l_{j}\right)\cos(\omega_{n}-\omega_{j})}.

We make the following assumptions

  1. 1.

    Im,nI_{m,n} is bounded, i.e., Im,nCII_{m,n}\leq C_{I}. With the real-world signal, we can reasonably assume that Im,nI_{m,n} or energy of the signal is bounded.

  2. 2.

    j,m\forall j,m, ψj,m\psi_{j,m} is bounded, i.e., |ψj,m|logCψ|\psi_{j,m}|\leq\log C_{\psi} for some Cψ>1C_{\psi}>1. This implies 1/Cψexp(ψj,m)Cψ1/C_{\psi}\leq\exp(\psi_{j,m})\leq C_{\psi}.

In addition, we have the following facts

  1. 1.

    Im,n,αj,nI_{m,n},\alpha_{j,n}, and γm,n\gamma_{m,n} are nonnegative.

  2. 2.

    Im,nI_{m,n} and γm,n\gamma_{m,n} are bounded. This follows from the aforementioned assumptions.

  3. 3.

    For given {lj}j\{l_{j}\}_{j}, we have bounded αj,n\alpha_{j,n}. To see this, note that the maximum of αj,n\alpha_{j,n} is acheived at ωn=ωj\omega_{n}=\omega_{j},

    maxαj,n=(1exp(2Δ/lj))1+exp(2Δ/lj)2exp(Δ/lj)=(1+exp(Δ/lj))(1exp(Δ/lj)).\begin{split}\max\alpha_{j,n}&=\frac{\left(1-\exp\left(-2\Delta/l_{j}\right)\right)}{1+\exp\left(-2\Delta/l_{j}\right)-2\exp\left(-\Delta/l_{j}\right)}\\ &=\frac{\left(1+\exp(-\Delta/l_{j})\right)}{\left(1-\exp(-\Delta/l_{j})\right)}.\\ \end{split} (22)

    Therefore, denoting lmax=maxj{lj}jl_{\operatorname{max}}=\max_{j}\{l_{j}\}_{j},

    maxαj,n(1+exp(Δ/lmax))(1exp(Δ/lmax))=Cα.\max\alpha_{j,n}\leq\frac{\left(1+\exp(-\Delta/l_{\operatorname{max}})\right)}{\left(1-\exp(-\Delta/l_{\operatorname{max}})\right)}=C_{\alpha}. (23)

Finally, we define 𝒮=[logCψ,logCψ]JM\mathcal{S}=[-\log C_{\psi},\log C_{\psi}]\subset\mathbb{R}^{JM}.

We want to compute the Lipschitz constant for f1(Ψ)\nabla f_{1}(\Psi) and f2(Ψ)\nabla f_{2}(\Psi) for Ψ,Ψ¯𝒮\Psi,\overline{\Psi}\in\mathcal{S}, i.e.,

{f1(Ψ)f1(Ψ¯)2C1ΨΨ¯2f2(Ψ)f2(Ψ¯)2C2ΨΨ¯2,\begin{cases}\lVert\nabla f_{1}(\Psi)-\nabla f_{1}(\overline{\Psi})\rVert_{2}\leq C_{1}\lVert\Psi-\overline{\Psi}\rVert_{2}\\ \lVert\nabla f_{2}(\Psi)-\nabla f_{2}(\overline{\Psi})\rVert_{2}\leq C_{2}\lVert\Psi-\overline{\Psi}\rVert_{2},\\ \end{cases} (24)

where we dropped dependence on θ\theta for notational simplicity. Consequently, the triangle inequality yields

f(Ψ)f(Ψ¯)2f1(Ψ)f1(Ψ¯)2+f2(Ψ)f2(Ψ¯)2(C1+C2)ΨΨ¯2.\begin{split}&\lVert\nabla f(\Psi)-\nabla f(\overline{\Psi})\rVert_{2}\\ &\leq\lVert\nabla f_{1}(\Psi)-\nabla f_{1}(\overline{\Psi})\rVert_{2}+\lVert\nabla f_{2}(\Psi)-\nabla f_{2}(\overline{\Psi})\rVert_{2}\\ &\leq(C_{1}+C_{2})\lVert\Psi-\overline{\Psi}\rVert_{2}.\end{split} (25)

Lipschitz constant C1C_{1} for f1(Ψ)\nabla f_{1}(\Psi)

Let us examine f1(Ψ)f_{1}(\Psi) first. The derivative with respect to ψj,m\psi_{j,m} is given as

f1(Ψ)ψj,m=n=1Nαj,nexp(ψj,m)σν2+j=1Jexp(ψj,m)αj,nf~(ψj,m)=n=1Nαj,nf~(ψj,m).\begin{split}\frac{\partial f_{1}(\Psi)}{\partial\psi_{j,m}}&=\sum_{n=1}^{N}\alpha_{j,n}\cdot\underbrace{\frac{\exp\left(\psi_{j,m}\right)}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp\left(\psi_{j^{\prime},m}\right)\alpha_{j^{\prime},n}}}_{\widetilde{f}(\psi_{j,m})}\\ &=\sum_{n=1}^{N}\alpha_{j,n}\widetilde{f}(\psi_{j,m}).\end{split} (26)

We now have

|f1(Ψ)ψj,mf1(Ψ¯)ψj,m|=n=1N|αj,n||f~(ψj,m)f~(ψ¯j,m)|.\left|\frac{\partial f_{1}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{1}(\overline{\Psi})}{\partial\psi_{j,m}}\right|=\sum_{n=1}^{N}|\alpha_{j,n}|\cdot\left|\widetilde{f}\left(\psi_{j,m}\right)-\widetilde{f}(\overline{\psi}_{j,m})\right|. (27)

Without loss of generality, we assume ψj,mψ¯j,m\psi_{j,m}\geq\overline{\psi}_{j,m}. We now apply the mean value theorem (MVT) to f~(ψj,m)\widetilde{f}(\psi_{j,m})

f~(ψj,m)f~(ψ¯j,m)=f~(ψj,m)(ψj,mψ¯j,m),\widetilde{f}(\psi_{j,m})-\widetilde{f}(\overline{\psi}_{j,m})=\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})(\psi_{j,m}-\overline{\psi}_{j,m}), (28)

where ψj,m[ψ¯j,m,ψj,m]\psi^{\prime}_{j,m}\in[\overline{\psi}_{j,m},\psi_{j,m}].We can compute and bound f~(ψj,m)=df~(ψj,m)/dψj,m\widetilde{f}^{\prime}(\psi_{j,m}^{\prime})=d\widetilde{f}(\psi_{j,m}^{\prime})/d\psi_{j,m}^{\prime} as follows

f~(ψj,m)=exp(ψj,m)σν2+j=1Jexp(ψj,m)αj,nαj,nexp2(ψj,m)(σν2+j=1Jexp(ψj,m)αj,n)2=exp(ψj,m)σν2+j=1Jexp(ψj,m)αj,n×(1αj,nexp(ψj,m)σν2+j=1Jexp(ψj,m)αj,n)1exp(ψj,m)σν2+j=1Jexp(ψj,m)αj,nCψσν2.\begin{split}\widetilde{f}^{\prime}(\psi_{j,m}^{\prime})&=\frac{\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\\ &\quad-\frac{\alpha_{j,n}\exp^{2}(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{2}}\\ &=\frac{\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\\ &\quad\times\underbrace{\left(1-\frac{\alpha_{j,n}\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\right)}_{\leq 1}\\ &\leq\frac{\exp(\psi^{\prime}_{j,m})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi^{\prime}_{j^{\prime},m})\alpha_{j^{\prime},n}}\\ &\leq\frac{C_{\psi}}{\sigma_{\nu}^{2}}.\\ \end{split} (29)

Combining Eq. 28 and 29, we have

n=1N|αj,n||f~(ψj,m)f~(ψ¯j,m)|=n=1N|αj,n||f~(ψj,m)(ψj,mψ¯j,m)|NCαCψσν2|ψj,mψ¯j,m|.\begin{split}&\sum_{n=1}^{N}|\alpha_{j,n}|\cdot\left|\widetilde{f}\left(\psi_{j,m}\right)-\widetilde{f}(\overline{\psi}_{j,m})\right|\\ &=\sum_{n=1}^{N}|\alpha_{j,n}|\cdot\left|\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})(\psi_{j,m}-\overline{\psi}_{j,m})\right|\\ &\leq\frac{NC_{\alpha}C_{\psi}}{\sigma_{\nu}^{2}}\left|\psi_{j,m}-\overline{\psi}_{j,m}\right|.\\ \end{split} (30)

We thus have,

f1(Ψ)f1(Ψ¯)22=j=1Jm=1M(f1(Ψ)ψj,mf1(Ψ¯)ψj,m)2(JMNCαCψσν2)2ΨΨ¯22.\begin{split}\lVert\nabla f_{1}(\Psi)-\nabla f_{1}(\overline{\Psi})\rVert_{2}^{2}&=\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\frac{\partial f_{1}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{1}(\overline{\Psi})}{\partial\psi_{j,m}}\right)^{2}\\ &\leq\left(\frac{JMNC_{\alpha}C_{\psi}}{\sigma_{\nu}^{2}}\right)^{2}\lVert\Psi-\overline{\Psi}\rVert^{2}_{2}.\end{split} (31)

Lipschitz constant C2C_{2} for f2(Ψ)\nabla f_{2}(\Psi)

Computing C2C_{2} proceeds in a similar manner to computing C1C_{1}. The derivative with respect to ψj,m\psi_{j,m} is given as

f2(Ψ)ψj,m=n=1NIm,nαj,nexp(ψj,m)(σν2+j=1Jexp(ψj,m)αj,n)2f~(ψj,m).\begin{split}&\frac{\partial f_{2}(\Psi)}{\partial\psi_{j,m}}\\ &=-\sum_{n=1}^{N}I_{m,n}\alpha_{j,n}\cdot\underbrace{\frac{\exp(\psi_{j,m})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m})\alpha_{j^{\prime},n})^{2}}}_{\widetilde{f}(\psi_{j,m})}.\\ \end{split} (32)

We now have

|f2(Ψ)ψj,mf2(Ψ¯)ψj,m|=n=1N|Im,nαj,n||f~(ψj,m)+f~(ψ¯j,m)|.\begin{split}&\left|\frac{\partial f_{2}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{2}(\overline{\Psi})}{\partial\psi_{j,m}}\right|\\ &=\sum_{n=1}^{N}\left|I_{m,n}\alpha_{j,n}\right|\cdot\left|-\widetilde{f}(\psi_{j,m})+\widetilde{f}(\overline{\psi}_{j,m})\right|.\\ \end{split} (33)

Without loss of generality, assume ψj,mψ¯j,m\psi_{j,m}\geq\overline{\psi}_{j,m}. To apply MVT, we need to compute and bound f~(ψj,m)\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})

f~(ψj,m)=exp(ψj,m)(σν2+j=1Jexp(ψj,m)αj,n)22αj,nexp2(ψj,m)(σν2+j=1Jexp(ψj,m)αj,n)3=exp(ψj,m)(σν2+j=1Jexp(ψj,m)αj,n)2×(12αj,nexp(ψj,m)σν2+j=1Jexp(ψj,m)αj,n)1exp(ψj,m)(σν2+j=1Jexp(ψj,m)αj,n)2Cψσν4.\begin{split}\widetilde{f}^{\prime}(\psi_{j,m}^{\prime})&=\frac{\exp(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{2}}\\ &\quad-\frac{2\alpha_{j,n}\exp^{2}(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{3}}\\ &=\frac{\exp(\psi_{j,m}^{\prime})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n})^{2}}\\ &\quad\times\underbrace{\left(1-\frac{2\alpha_{j,n}\exp(\psi_{j,m}^{\prime})}{\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi_{j^{\prime},m}^{\prime})\alpha_{j^{\prime},n}}\right)}_{\leq 1}\\ &\leq\frac{\exp(\psi^{\prime}_{j,m})}{(\sigma_{\nu}^{2}+\sum_{j^{\prime}=1}^{J}\exp(\psi^{\prime}_{j^{\prime},m})\alpha_{j^{\prime},n})^{2}}\\ &\leq\frac{C_{\psi}}{\sigma_{\nu}^{4}}.\\ \end{split} (34)

Applying MVT,

n=1N|Im,nαj,n||f~(ψj,m)+f~(ψ¯j,m)|=n=1N|Im,nαj,n||f~(ψj,m)(ψj,mψ¯j,m)|NCICαCψσν4|ψj,mψ¯j,m|.\begin{split}&\sum_{n=1}^{N}|I_{m,n}\alpha_{j,n}|\cdot\left|-\widetilde{f}\left(\psi_{j,m}\right)+\widetilde{f}\left(\overline{\psi}_{j,m}\right)\right|\\ &=\sum_{n=1}^{N}|I_{m,n}\alpha_{j,n}|\cdot\left|\widetilde{f}^{\prime}(\psi^{\prime}_{j,m})(\psi_{j,m}-\overline{\psi}_{j,m})\right|\\ &\leq\frac{NC_{I}C_{\alpha}C_{\psi}}{\sigma_{\nu}^{4}}\left|\psi_{j,m}-\overline{\psi}_{j,m}\right|.\\ \end{split} (35)

We thus have,

f2(Ψ)f2(Ψ¯)22=j=1Jm=1M(f2(Ψ)ψj,mf2(Ψ¯)ψj,m)2(JMNCαCψCIσν4)2ΨΨ¯22.\begin{split}&\lVert\nabla f_{2}(\Psi)-\nabla f_{2}(\overline{\Psi})\rVert_{2}^{2}\\ &=\sum_{j=1}^{J}\sum_{m=1}^{M}\left(\frac{\partial f_{2}(\Psi)}{\partial\psi_{j,m}}-\frac{\partial f_{2}(\overline{\Psi})}{\partial\psi_{j,m}}\right)^{2}\\ &\leq\left(\frac{JMNC_{\alpha}C_{\psi}C_{I}}{\sigma_{\nu}^{4}}\right)^{2}\lVert\Psi-\overline{\Psi}\rVert^{2}_{2}.\end{split} (36)

Collecting the Lipschitz constants C1C_{1} and C2C_{2}, we finally have

f(Ψ)f(Ψ¯)2JMNCαCψσν2(1+CIσν2)CΨΨ¯2.\lVert\nabla f(\Psi)-\nabla f(\overline{\Psi})\rVert_{2}\leq\underbrace{\frac{JMNC_{\alpha}C_{\psi}}{\sigma_{\nu}^{2}}\left(1+\frac{C_{I}}{\sigma_{\nu}^{2}}\right)}_{C}\lVert\Psi-\overline{\Psi}\rVert_{2}. (37)

H. Inference with p({𝐳j}j{σj,m2}j,m,𝐲,θ)p\left(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta\right) (Section 4.2)

We present the details for performing inference with p({𝐳j}j{σ^j,m2}j,m,𝐲,θ^)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y},\hat{\theta}), given the estimates {σ^j,m2}j,m\{\widehat{\sigma}_{j,m}^{2}\}_{j,m} and θ^\hat{\theta} from window-level inference. First, we present the Kalman filter/smoother algorithm to compute the mean posterior trajectory, and the credible interval. Next, we present the forward filtering backward sampling (FFBS) algorithm Carter and Kohn [1994], Lindsten and Schön [2013] to generate Monte Carlo (MC) sample trajectories.

First, we define additional notations.

1) 𝐳~j,k|k=𝔼[𝐳~j,k{σ^j,m2}j,m,𝐲1:k,θ^]2\widetilde{\mathbf{z}}_{j,k|k^{\prime}}=\mathbb{E}[\widetilde{\mathbf{z}}_{j,k}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y}_{1:k^{\prime}},\hat{\theta}]\in\mathbb{R}^{2}

Posterior mean of 𝐳~j,k\widetilde{\mathbf{z}}_{j,k}. We are primarily concerned with the following three types: 1) 𝐳~j,k|k1\widetilde{\mathbf{z}}_{j,k|k-1}, the one-step prediction estimate, 2) 𝐳~j,k|k\widetilde{\mathbf{z}}_{j,k|k}, the Kalman filter estimate, and 3) 𝐳~j,k|MN\widetilde{\mathbf{z}}_{j,k|MN}, the Kalman smoother estimate.

2) 𝐳~k|k=[(𝐳~1,k|k)T,,(𝐳~J,k|k)T]T2J\widetilde{\mathbf{z}}_{k|k^{\prime}}=[(\widetilde{\mathbf{z}}_{1,k|k^{\prime}})^{\text{T}},\ldots,(\widetilde{\mathbf{z}}_{J,k|k^{\prime}})^{\text{T}}]^{\text{T}}\in\mathbb{R}^{2J}

A collection of {𝐳~j,k|k}j\{\widetilde{\mathbf{z}}_{j,k|k^{\prime}}\}_{j} in a single vector.

3) 𝐏j,k|k=𝔼[(𝐳~j,k𝐳~j,k|k)(𝐳~j,k𝐳~j,k|k)T{σ^j,m2}j,m,𝐲1:k,θ^]2×2\mathbf{P}_{j,k|k^{\prime}}=\mathbb{E}[\left(\widetilde{\mathbf{z}}_{j,k}-\widetilde{\mathbf{z}}_{j,k|k^{\prime}}\right)\left(\widetilde{\mathbf{z}}_{j,k}-\widetilde{\mathbf{z}}_{j,k|k^{\prime}}\right)^{\text{T}}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y}_{1:k^{\prime}},\hat{\theta}]\in\mathbb{R}^{2\times 2}

Posterior covariance of 𝐳~j,k\widetilde{\mathbf{z}}_{j,k}. Just as in 𝐳~j,k|k\widetilde{\mathbf{z}}_{j,k|k^{\prime}}, we are interested in three types, i.e., 𝐏j,k|k1\mathbf{P}_{j,k|k-1}, 𝐏j,k|k\mathbf{P}_{j,k|k}, and 𝐏j,k|MN\mathbf{P}_{j,k|MN}.

4) 𝐏k|k=blkdiag(𝐏j,k|k)2J×2J\mathbf{P}_{k|k^{\prime}}=\operatorname{blkdiag}\left(\mathbf{P}_{j,k|k^{\prime}}\right)\in\mathbb{R}^{2J\times 2J}

A block diagonal matrix of JJ posterior covariance matrices.

5) 𝐀=blkdiag(exp(Δ/lj)𝐑(ωj))2J×2J\mathbf{A}=\operatorname{blkdiag}\left(\exp\left(-\Delta/l_{j}\right)\mathbf{R}(\omega_{j})\right)\in\mathbb{R}^{2J\times 2J}

A block digonal transition matrix.

6) 𝐇=(1,0,,1,0)\mathbf{H}=(1,0,\ldots,1,0) The observation gain.

Kalman filter/smoother

The Kalman filter equations are given as

𝐳~j,mN+n|mN+(n1)=exp(Δ/lj)𝐑(ωj)𝐳~j,mN+(n1)|mN+(n1)𝐏j,mN+n|mN+(n1)=exp(2Δ/lj)𝐑(ωj)𝐏j,mN+(n1)|mN+(n1)𝐑T(ωj)+σj,m2(1exp(2Δ/lj))𝐊mN+n=𝐏mN+n|mN+(n1)𝐇T(𝐇𝐏mN+n|mN+(n1)𝐇T+σν2)1𝐳~mN+n|mN+n=𝐳~mN+n|mN+(n1)+𝐊mN+n(𝐲mN+n𝐇𝐳~mN+n|mN+(n1))𝐏mN+n|mN+n=(𝐈2J×2J𝐊mN+n𝐇)𝐏mN+n|mN+(n1).\begin{split}&\widetilde{\mathbf{z}}_{j,mN+n|mN+(n-1)}\\ &=\exp\left(-\Delta/l_{j}\right)\mathbf{R}(\omega_{j})\widetilde{\mathbf{z}}_{j,mN+(n-1)|mN+(n-1)}\\ &\mathbf{P}_{j,mN+n|mN+(n-1)}\\ &=\exp\left(-2\Delta/l_{j}\right)\mathbf{R}(\omega_{j})\mathbf{P}_{j,mN+(n-1)|mN+(n-1)}\mathbf{R}^{\text{T}}(\omega_{j})\\ &\quad+\sigma_{j,m}^{2}\left(1-\exp\left(-2\Delta/l_{j}\right)\right)\\ &\mathbf{K}_{mN+n}\\ &=\mathbf{P}_{mN+n|mN+(n-1)}\mathbf{H}^{\text{T}}\left(\mathbf{H}\mathbf{P}_{mN+n|mN+(n-1)}\mathbf{H}^{\text{T}}+\sigma_{\nu}^{2}\right)^{-1}\\ &\widetilde{\mathbf{z}}_{mN+n|mN+n}\\ &=\widetilde{\mathbf{z}}_{mN+n|mN+(n-1)}\\ &\quad+\mathbf{K}_{mN+n}\left(\mathbf{y}_{mN+n}-\mathbf{H}\widetilde{\mathbf{z}}_{mN+n|mN+(n-1)}\right)\\ &\mathbf{P}_{mN+n|mN+n}=\left(\mathbf{I}_{2J\times 2J}-\mathbf{K}_{mN+n}\mathbf{H}\right)\mathbf{P}_{mN+n|mN+(n-1)}.\\ \end{split} (38)

Subsequently, the Kalman smoother equations are given as

𝐂mN+n=𝐏mN+n|mN+n𝐀T𝐏mN+(n+1)|mN+n12J×2J𝐳~mN+n|MN=𝐳~mN+n|mN+n+𝐂mN+n(𝐳~mN+(n+1)|MN𝐳~mN+(n+1)|mN+n)𝐏mN+n|MN=𝐏mN+n|mN+n+𝐂mN+n𝐏mN+(n+1)|MN𝐂mN+nT.𝐂mN+n𝐏mN+(n+1)|mN+n𝐂mN+nT\begin{split}&\mathbf{C}_{mN+n}\\ &=\mathbf{P}_{mN+n|mN+n}\mathbf{A}^{\text{T}}\mathbf{P}^{-1}_{mN+(n+1)|mN+n}\in\mathbb{R}^{2J\times 2J}\\ &\widetilde{\mathbf{z}}_{mN+n|MN}\\ &=\widetilde{\mathbf{z}}_{mN+n|mN+n}\\ &\quad+\mathbf{C}_{mN+n}\left(\widetilde{\mathbf{z}}_{mN+(n+1)|MN}-\widetilde{\mathbf{z}}_{mN+(n+1)|mN+n}\right)\\ &\mathbf{P}_{mN+n|MN}\\ &=\mathbf{P}_{mN+n|mN+n}\\ &\quad+\mathbf{C}_{mN+n}\mathbf{P}_{mN+(n+1)|MN}\mathbf{C}_{mN+n}^{\text{T}}.\\ &\quad-\mathbf{C}_{mN+n}\mathbf{P}_{mN+(n+1)|mN+n}\mathbf{C}_{mN+n}^{\text{T}}\\ \end{split} (39)

To obtain the mean reconstructed trajectory for the jthj^{\text{th}} oscillatory component, {𝐲^j,k}k\{\widehat{\mathbf{y}}_{j,k}\}_{k}, we take the real part of the jthj^{\text{th}} component of the smoothed mean, 𝐲^j,k=𝐞2j1T𝐳~k|MN\widehat{\mathbf{y}}_{j,k}=\mathbf{e}_{2j-1}^{\text{T}}\cdot\widetilde{\mathbf{z}}_{k|MN}, where 𝐞2j12J\mathbf{e}_{2j-1}\in\mathbb{R}^{2J} is a unit vector with the only non-zero value, equal to 1, at the entry 2j12j-1.

The 95% credible interval for 𝐲^j,k\widehat{\mathbf{y}}_{j,k}, denoted as CIj,klower\operatorname{CI}^{\text{lower}}_{j,k}/CIj,kupper\operatorname{CI}^{\text{upper}}_{j,k} for the upper/lower end, respectively, is given as

CIj,kupper=𝐞2j1T𝐳~k|MN+1.96𝐞2j1T𝐏k|MN𝐞2j1CIj,klower=𝐞2j1T𝐳~k|MN1.96𝐞2j1T𝐏k|MN𝐞2j1.\begin{split}\operatorname{CI}^{\text{upper}}_{j,k}&=\mathbf{e}_{2j-1}^{\text{T}}\cdot\widetilde{\mathbf{z}}_{k|MN}+1.96\cdot\sqrt{\mathbf{e}_{2j-1}^{\text{T}}\mathbf{P}_{k|MN}\mathbf{e}_{2j-1}}\\ \operatorname{CI}^{\text{lower}}_{j,k}&=\mathbf{e}_{2j-1}^{\text{T}}\cdot\widetilde{\mathbf{z}}_{k|MN}-1.96\cdot\sqrt{\mathbf{e}_{2j-1}^{\text{T}}\mathbf{P}_{k|MN}\mathbf{e}_{2j-1}}.\\ \end{split} (40)

FFBS algorithm for p({𝐳j}j{σj,m2}j,m,𝐲,θ)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\sigma_{j,m}^{2}\}_{j,m},\mathbf{y},\theta)

To generate the MC trajectory samples from the posterior distribution p({𝐳j}j{σ^j,m2}j,m,𝐲,θ^)p(\left\{\mathbf{z}_{j}\right\}_{j}\mid\{\widehat{\sigma}_{j,m}^{2}\}_{j,m},\mathbf{y},\hat{\theta}), we use the FFBS algorithm. The steps are summarized in Algorithm 2, which uses the Kalman estimates derived in the previous section. We denote s=1,,Ss=1,\ldots,S as the MC sample index.

Result: {𝐳~k(s)}k,sMN,S\left\{\widetilde{\mathbf{z}}_{k}^{(s)}\right\}_{k,s}^{MN,S}
for s1s\leftarrow 1 to SS do
       Sample 𝐳~MN(s)\widetilde{\mathbf{z}}^{(s)}_{MN} from 𝒩(𝐳~MN|MN,𝐏MN|MN)\mathcal{N}\left(\widetilde{\mathbf{z}}_{MN|MN},\mathbf{P}_{MN|MN}\right).
       for kMN1k\leftarrow MN-1 to 11 do
             Sample 𝐳~k(s)\widetilde{\mathbf{z}}^{(s)}_{k} from 𝒩(μ~k,𝐏~k)\mathcal{N}(\widetilde{\mu}_{k},\widetilde{\mathbf{P}}_{k}), where
μ~k=𝐳~k|k+𝐏k|k𝐀T𝐏k+1|k1(𝐳~k+1(s)𝐀𝐳~k|k)𝐏~k=𝐏k|k𝐏k|k𝐀T𝐏k+1|k1𝐀𝐏k|k.\begin{split}\widetilde{\mu}_{k}&=\widetilde{\mathbf{z}}_{k|k}+\mathbf{P}_{k|k}\mathbf{A}^{\text{T}}\mathbf{P}_{k+1|k}^{-1}(\widetilde{\mathbf{z}}^{(s)}_{k+1}-\mathbf{A}\widetilde{\mathbf{z}}_{k|k})\\ \widetilde{\mathbf{P}}_{k}&=\mathbf{P}_{k|k}-\mathbf{P}_{k|k}\mathbf{A}^{\text{T}}\mathbf{P}_{k+1|k}^{-1}\mathbf{A}\mathbf{P}_{k|k}.\\ \end{split}
       end for
      
end for
Algorithm 2 FFBS algorithm

I. Computational efficiency of PLSO vs. GP-PS

We show the runtime of PLSO and piecewise stationary GP (GP-PS) for inference of the mean trajectory of the hippocampus data (fs=1,250fs=1,250 Hz, J=5J=5, 2-second window) for varying data lengths (5050, 100100, 200200 seconds corresponding to K=6.25×104, 1.25×105, 2.5×105K=6.25\times 10^{4},\,1.25\times 10^{5},\,2.5\times 10^{5} sample points, respectively).

As noted in Section 5, the computational complexity of PLSO is O(J2K)O(J^{2}K), where as the computational complexity of GP-PS is O(N2K)O(N^{2}K). Since NN, the number of samples per window, is fixed (2,5002,500 samples), we expect both PLSO and GP-PS to be linear in terms of the number of samples KK. Table 3 indeed confirms that this is the case. However, we observe that PLSO is much more efficient than GP-PS.

Table 3: Runtime (s) for PLSO and GP-PS for varying length
PLSO GP-PS
T=T=50 1.7 346.8
T=T=100 3.1 700.6
T=T=200 6.5 1334.0

J. Simulation experiment (Section 5.1)

We simulate from the following model for 1kK1\leq k\leq K

𝐲k=10(KkK)𝐳1,k+10cos4(2πω0k)𝐳2,k+νk,\mathbf{y}_{k}=10\left(\frac{K-k}{K}\right)\mathbf{z}^{\Re}_{1,k}+10\cos^{4}(2\pi\omega_{0}k)\mathbf{z}^{\Re}_{2,k}+\nu_{k},

where 𝐳1,k\mathbf{z}_{1,k} and 𝐳2,k\mathbf{z}_{2,k} are from the PLSO stationary generative model, i.e., σj,m2=σj2\sigma_{j,m}^{2}=\sigma_{j}^{2}. The parameters are ω0/ω1/ω2=0.04/1/10\omega_{0}/\omega_{1}/\omega_{2}=0.04/1/10 Hz, fs=200f_{s}=200 Hz, T=100T=100 seconds, l1=l2=1l_{1}=l_{2}=1, and νk𝒩(0,25)\nu_{k}\sim\mathcal{N}(0,25). This stationary process comprises two amplitude-modulated oscillations, namely one modulated by a slow-frequency (ω0=0.04\omega_{0}=0.04 Hz) sinusoid and the other a linearly-increasing signal Ba et al. [2014]. We assume a 2-second PS interval. For PLSO, we use J=2J=2 components and 5 block coordinate descent iterations for optimizing θ\theta and {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}.

K. Details of the TVAR model

As explained in Section 5, the TVAR model is defined as

𝐲k=p=1Pap,k𝐲kp+εk,\mathbf{y}_{k}=\sum_{p=1}^{P}a_{p,k}\mathbf{y}_{k-p}+\varepsilon_{k},

which can alternatively be written as

(𝐲k𝐲kP+1)=(a1,ka2,kaP1,kaP,k100001000010)Ak(𝐲k1𝐲kP)+εk.\begin{split}&\begin{pmatrix}\mathbf{y}_{k}\\ \vdots\\ \mathbf{y}_{k-P+1}\\ \end{pmatrix}\\ &=\underbrace{\begin{pmatrix}a_{1,k}&a_{2,k}&\cdots&a_{P-1,k}&a_{P,k}\\ 1&0&\cdots&0&0\\ 0&1&\cdots&0&0\\ \vdots&&&&\vdots\\ &\vdots&&\\ 0&0&\cdots&1&0\\ \end{pmatrix}}_{A_{k}}\begin{pmatrix}\mathbf{y}_{k-1}\\ \vdots\\ \mathbf{y}_{k-P}\\ \end{pmatrix}+\varepsilon_{k}.\\ \end{split}

It is the transition matrix AkA_{k} that determines the oscillatory component profile at time kk, such as the number of components and the center frequencies. Specifically, {Ak}k\{A_{k}\}_{k} are first fit to the data 𝐲\mathbf{y} and then eigen-decomposition is performed on each of the estimated {Ak}k\{A_{k}\}_{k}. More in-depth technical details can be found in West et al. [1999].

We use publicly available code for the TVAR implementation555https://www2.stat.duke.edu/ mw/mwsoftware/TVAR/index.html. We use TVAR order of p=70p=70, as the models with lower orders than this value did not capture the theta-band signal - The lowest frequency band in these cases were the gamma band (>30>30 Hz). Even with the higher orders of pp, and various hyperparameter combinations, we observed that the slow and theta frequency band was still explained by a single oscillatory component. For the discount factor, we used β=0.999\beta=0.999 to ensure that the TVAR coefficients and, consequently, the decomposed oscillatory components do not fluctuate much.

L. Anesthesia EEG dataset (Section 5.3)

We show spectral analysis results for the EEG data of a subject anesthetized with propofol (This is a different subject from the main text.) The data last T=2,500T=2{,}500 seconds, sampled at fs=250f_{s}=250 Hz. We assume a 4-second PS interval, use J=9J=9 components and 5 block coordinate descent iterations for optimizing θ\theta and {σj,m2}j,m\{\sigma_{j,m}^{2}\}_{j,m}.

Fig. 7 shows the STFT and the PLSO-estimated spectrograms. As noted in the main text, PLSO with stationarity assumption is too restrictive and fails to capture the time-varying spectral pattern. Both PLSO with λ=0\lambda=0 and λ=λCV\lambda=\lambda_{\operatorname{CV}} are more effective in capturing such patterns, with the latter able to remove the artifacts and better recover the smoother dynamics.

Refer to caption
Figure 7: Spectrogram (in dB) under propofol anesthesia. (a) STFT of the data (b) PLSO with λ\lambda\rightarrow\infty (c) PLSO with λ=0\lambda=0 (d) PLSO with λ=λCV\lambda=\lambda_{\operatorname{CV}}.