This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bridging the Usability Gap: Theoretical and Methodological Advances for Spectral Learning of Hidden Markov Models

Xiaoyuan Ma
Department of Statistics, University of Virginia
and
Jordan Rodu
Department of Statistics, University of Virginia
Abstract

The Baum-Welch (B-W) algorithm is the most widely accepted method for inferring hidden Markov models (HMM). However, it is prone to getting stuck in local optima, and can be too slow for many real-time applications. Spectral learning of HMMs (SHMM), based on the method of moments (MOM) has been proposed in the literature to overcome these obstacles. Despite its promise, SHMM has lacked a comprehensive asymptotic theory. Additionally, its long-run performance can degrade due to unchecked error propagation. In this paper, we (1) provide an asymptotic distribution for the approximate error of the likelihood estimated by SHMM, (2) propose a novel algorithm called projected SHMM (PSHMM) that mitigates the problem of error propagation, and (3) describe online learning variants of both SHMM and PSHMM that accommodate potential nonstationarity. We compare the performance of SHMM with PSHMM and estimation through the B-W algorithm on both simulated data and data from real world applications, and find that PSHMM not only retains the computational advantages of SHMM, but also provides more robust estimation and forecasting.


Keywords: hidden Markov models (HMM), spectral estimation, projection-onto-simplex, online learning, time series forecasting

1 Introduction

The hidden Markov model (HMM), introduced by Baum and Petrie, (1966), has found widespread applications across diverse fields. These include finance (Hassan and Nath,, 2005; Mamon and Elliott,, 2014), natural language processing (Zhou and Su,, 2001; Stratos et al.,, 2016), and medicine (Shirley et al.,, 2010; Scott et al.,, 2005). An HMM is a stochastic probabilistic model for sequential or time series data that assumes that the underlying dynamics of the data are governed by a Markov chain. (Knoll et al.,, 2016).

A widely adopted method for inferring HMM parameters is the Baum-Welch algorithm (Baum et al.,, 1970). This algorithm is a special case of the expectation maximization (EM) algorithm (Dempster et al.,, 1977), grounded in maximum likelihood estimation (MLE). However, the E-M algorithm can require a large number of iterations until the parameter estimates converge–which has a large computational cost, especially for large-scale time series data–and can easily get trapped in local optima. To address these issues, particularly for large, high-dimensional time series, Hsu et al., (2012) proposed a spectral learning algorithm for HMMs (SHMM). This algorithm, based on the method of moments (MOM), offers attractive theoretical properties. However, the asymptotic error distribution of the algorithm was not well characterized. Later, Rodu, (2014) improved and extended the spectral estimation algorithm to HMMs with high-dimensional, continuously distributed output, but again did not address the asymptotic error distribution. In this manuscript, we provide a theoretical discussion of the asymptotic error behavior of SHMM algorithms.

In addition to investigating the asymptotic error distribution, we provide a novel improvement to the SHMM family of algorithms. Our improvement is motivated from an extensive simulation study of the methods proposed in Hsu et al., (2012) and Rodu, (2014). We found that spectral estimation does not provide stable results under the low signal-noise ratio setting. We propose a new spectral estimation method, the projected SHMM (PSHMM), that leverages a novel regularization technique that we call ‘projection-onto-simplex’ regularization. The PSHMM largely retains the computational advantages of SHMM methods without sacrificing accuracy.

Finally, we show how to adapt spectral estimation (including all SHMM and PSHMM approaches) to allow for online learning. We examine two approaches – the first speeds up computational time required for learning a model in large data settings, and the second incorporates “forgetfulness,” which allows for adapting to changing dynamics of the data. This speed and flexibility is crucial, for instance, in high-frequency trading, and we show the effectiveness of the PSHMM on real data in this setting.

The structure of this paper is as follows: In the rest of this section, we will introduce existing models. In Section 2, we provide theorems for the asymptotic properties of SHMMs. Section 3 introduces our new method, PSHMM. In Section 4, we extend the spectral estimation to online learning for both SHMM and PSHMM. Then, Section 5 shows results from simulation studies and Section 6 shows the application on high-frequency trading. We provide extensive discussion in Section 7.

1.1 The hidden Markov model

The standard HMM (Baum and Petrie,, 1966) is defined by a set of SS hidden categorical states 1,2,,S1,2,\cdots,S that evolve according to a Markov chain. Let hth_{t} denote the hidden state at time tt. The Markov chain is defined by two key components:

  1. 1.

    an initial probability π0=[π0(1),,π0(S)]\pi_{0}=[\pi_{0}^{(1)},\cdots,\pi_{0}^{(S)}] where h1Multinomial(π0(1),,π0(S))h_{1}\sim Multinomial(\pi_{0}^{(1)},\cdots,\pi_{0}^{(S)}), and

  2. 2.

    a transition matrix 𝐓=[𝐓ij]i=1,,Sj=1,,S\mathbf{T}=[\mathbf{T}_{ij}]_{i=1,\cdots,S}^{j=1,\cdots,S} where 𝐓ij=P(ht+1=j|ht=i)\mathbf{T}_{ij}=\mathrm{P}(h_{t+1}=j|h_{t}=i) for t\forall t.

At each time point tt, an observation XtX_{t} is emitted from a distribution conditional on the value of the current hidden state, Xt|ht=ssX_{t}|h_{t}=s\sim\mathcal{F}_{s} where s\mathcal{F}_{s} is the emission distribution when the hidden state ht=sh_{t}=s. Figure 1 is a graphical representation of the standard HMM. When the emission distribution s\mathcal{F}_{s} is Gaussian for all states, the model is often called a Gaussian HMM (GHMM).

Refer to caption
Figure 1: Model structure of standard HMM. {ht}\{h_{t}\} is a latent Markov chain that evolves according to transition matrix 𝐓\mathbf{T}. For each time stamp tt, the observed XtX_{t} is generated according to the emission distribution associated with hth_{t}.

1.2 Spectral learning of HMM

Refer to caption
Figure 2: Spectral estimation model by Rodu, (2014). In addition to the latent state series {ht}t\{h_{t}\}_{t} and observed series {Xt}t\{X_{t}\}_{t}, Rodu, (2014) introduced a reduced-dimensional series {Yt=UXt}\{Y_{t}=U^{\top}X_{t}\} which is a projection of XtX_{t} on a lower-dimensional subspace whose dimensionality is equal to the number of hidden states. Spectral estimation proceeds based on {Yt}t\{Y_{t}\}_{t}.

Figure 2 illustrates the model proposed by Rodu, (2014). As in the standard HMM, hth_{t} denotes the hidden state at time tt, and XtX_{t} the emitted observation. To estimate the model, observations are projected onto a lower-dimensional space of dimensionality dd. We denote these lower-dimensional projected observations as yty_{t}, where yt=Uxty_{t}=U^{\top}x_{t}. We address the selection of dimension dd and the method for projecting observations in Section 3.3.

The spectral estimation framework builds upon the observable operator model (Jaeger,, 2000). In this model, the likelihood of a sequence of observations is expressed as:

Pr(x1:t)\displaystyle Pr(x_{1:t}) =\displaystyle= 1A(xt)A(xt1)A(x1)π0\displaystyle 1^{\top}A(x_{t})A(x_{t-1})\cdots A(x_{1})\pi_{0}

with A(xj)=Tdiag(λ(xj))A(x_{j})=T\hbox{diag}(\lambda(x_{j})) and λ(xj)\lambda(x_{j}) is a vector with elements [λ(xj)]i=Pr(xj|Hj=hi)[\lambda(x_{j})]_{i}=\Pr(x_{j}|H_{j}=h_{i}). Note that this formulation of A(x)A(x) still requires inferring TT and λ(xj)\lambda(x_{j}). To address this, the spectral estimation framework employs a similarity transformation of A(x)A(x). This transformed version can be learned directly from the data using the MOM.

Under the spectral estimation framework of Hsu et al., (2012), the likelihood can be written as

P(x1:t)\displaystyle P(x_{1:t}) =1A(xt)A(x1)π0\displaystyle=1^{\top}~{}~{}A(x_{t})~{}~{}\cdots~{}~{}A(x_{1})~{}~{}\pi_{0}
=1S1bSA(xt)S1B(xt)SS1SA(x1)S1Sπb1\displaystyle=\underbrace{1^{\top}S^{-1}}_{b_{\infty}^{\top}}~{}~{}\underbrace{SA(x_{t})S^{-1}}_{B(x_{t})}~{}~{}S~{}~{}\cdots~{}~{}S^{-1}~{}~{}SA(x_{1})S^{-1}~{}~{}\underbrace{S\pi}_{b_{1}}
bB(xt)B(x1)b1\displaystyle\equiv b_{\infty}^{\top}~{}~{}B(x_{t})~{}~{}\cdots~{}~{}B(x_{1})~{}~{}b_{1}

where S=UMS=U^{\top}M with M=[M1,,MS]M=[M_{1},\cdots,M_{S}] and column vector Mi=E(X|H=i)M_{i}=\mathrm{E}(X|H=i). Letting yt=Uxty_{t}=U^{\top}x_{t}, Rodu, (2014) shows that, the likelihood can be expressed as

Pr(x1:t)\displaystyle Pr(x_{1:t}) =\displaystyle= cC(yt)C(yt1)C(y1)c1,\displaystyle c_{\infty}^{\top}C(y_{t})C(y_{t-1})\cdots C(y_{1})c_{1}, (1)

where

c1=μ,c=μΣ1,C(y)=K(y)Σ1,\displaystyle c_{1}=\mu,c_{\infty}^{\top}=\mu^{\top}\Sigma^{-1},C(y)=K(y)\Sigma^{-1},
μ=E(y1)=UMπ,\displaystyle\mu=E(y_{1})=U^{\top}M\pi,
Σ=E(y2y1)=UMTdiag(π)MU,\displaystyle\Sigma=E(y_{2}y_{1}^{\top})=U^{\top}MT\hbox{diag}(\pi)M^{\top}U,
K(a)=E(y3y1y2)a=UMTdiag(MUa)Tdiag(π)(MU),\displaystyle K(a)=\mathrm{E}(y_{3}y_{1}^{\top}y_{2}^{\top})a=U^{\top}MT\hbox{diag}(M^{\top}Ua)T\hbox{diag}(\pi)(M^{\top}U),
M=[M1,,MS] where Mi=E(X|H=i).\displaystyle M=[M_{1},\cdots,M_{S}]\text{ where }M_{i}=\mathrm{E}(X|H=i).

These quantities can be empirically estimated as:

P^r(x1:t)\displaystyle\widehat{P}r(x_{1:t}) =\displaystyle= c^C^(yt)C^(yt1)C^(y1)c^1,\displaystyle\hat{c}_{\infty}^{\top}\hat{C}(y_{t})\hat{C}(y_{t-1})\cdots\hat{C}(y_{1})\hat{c}_{1}, (2)

where

c^1=μ^,c^=μ^Σ^1,C^(y)=K^(y)Σ^1,\displaystyle\hat{c}_{1}=\hat{\mu},\hat{c}_{\infty}^{\top}=\hat{\mu}^{\top}\hat{\Sigma}^{-1},\hat{C}(y)=\hat{K}(y)\hat{\Sigma}^{-1},
μ^=1Ni=1NYi,\displaystyle\hat{\mu}=\frac{1}{N}\sum_{i=1}^{N}Y_{i},
Σ^=1Ni=1N1Yi+1Yi,\displaystyle\hat{\Sigma}=\frac{1}{N}\sum_{i=1}^{N-1}Y_{i+1}Y_{i}^{\top},
K^(y)=1Ni=1N2Yi+2YiYi+1y.\displaystyle\hat{K}(y)=\frac{1}{N}\sum_{i=1}^{N-2}Y_{i+2}Y_{i}^{\top}\cdot Y_{i+1}^{\top}y. (3)

Prediction of yty_{t} is computed recursively by:

y^t\displaystyle\hat{y}_{t} =\displaystyle= C^(yt1)y^t1c^C^(yt1)y^t1;\displaystyle\frac{\hat{C}(y_{t-1})\hat{y}_{t-1}}{\hat{c}^{\top}_{\infty}\hat{C}(y_{t-1})\hat{y}_{t-1}}; (4)

Observation xtx_{t} can be recovered as:

xt^|x1,x2,,xt1=Uyt^.\displaystyle\hat{x_{t}}|x_{1},x_{2},\cdots,x_{t-1}=U\hat{y_{t}}. (5)

In the above exposition of spectral likelihood estimation, moment estimation, and recursive forecasting we assume a discrete output HMM. For a continuous output HMM, the spectral estimation of likelihood is slightly different. We need some kernel function G(x)G(x) to calculate KK, so K(a)=UMTdiag(MUG(a))Tdiag(π)(MU)K(a)=U^{\top}MT\hbox{diag}(M^{\top}UG(a))T\hbox{diag}(\pi)(M^{\top}U).

G(a)G(a) provides a link between the observations and their conditional probabilities in the continuous output scenario. Let H~t\tilde{H}_{t} be the probability vector associated with being in a particular state at time tt, commonly called a belief state. Then E[Y2|H~2]=UMH~2E[Y_{2}|\tilde{H}_{2}]=U^{\top}M\tilde{H}_{2}. Therefore,

E[Y2|H~1]\displaystyle E[Y_{2}|\tilde{H}_{1}] =UMTH~1.\displaystyle=U^{\top}MT\tilde{H}_{1}.

Bayes formula gives

Pr(H1|X1)\displaystyle\Pr(H_{1}|X_{1}) =Pr(X1|H1)Pr(H1)Pr(X1).\displaystyle=\frac{\Pr(X_{1}|H_{1})\Pr(H_{1})}{\Pr(X_{1})}.

Letting [λ(x)]i=Pr(xt|ht=i)[\lambda(x)]_{i}=\Pr(x_{t}|h_{t}=i), this implies that

E[H1|X1=x]\displaystyle E[H_{1}|X_{1}=x] =diag(π)λ(x)πλ(x),\displaystyle=\frac{\hbox{diag}(\pi)\lambda(x)}{\pi^{\top}\lambda(x)},

Then we get

E[Y2|X1=x]\displaystyle E[Y_{2}|X_{1}=x] =UMTdiag(π)λ(x)πλ(x)\displaystyle=\frac{U^{\top}MT\hbox{diag}(\pi)\lambda(x)}{\pi^{\top}\lambda(x)}

Finally

Σ1E[Y2|X1=x]\displaystyle\Sigma^{-1}E[Y_{2}|X_{1}=x] =(MU)1λ(x)πλ(x)\displaystyle=\frac{(M^{\top}U)^{-1}\lambda(x)}{\pi^{\top}\lambda(x)}
G(x)\displaystyle\equiv G(x)

Using this definition of G(x)G(x) we have

K(x)\displaystyle K(x) =UMTdiag(MUG(x))Tdiag(π)(MU)\displaystyle=U^{\top}MT\hbox{diag}(M^{\top}UG(x))T\hbox{diag}(\pi)(M^{\top}U)
=UMTdiag(MU(MU)1λ(x)πλ(x))Tdiag(π)(MU)\displaystyle=U^{\top}MT\mathrm{diag}\left(M^{\top}U\frac{(M^{\top}U)^{-1}\lambda(x)}{\pi^{\top}\lambda(x)}\right)T\hbox{diag}(\pi)(M^{\top}U)
=1πλ(x)UMTdiag(λ(x))Tdiag(π)(MU)\displaystyle=\frac{1}{{\pi^{\top}\lambda(x)}}U^{\top}MT\hbox{diag}(\lambda(x))T\hbox{diag}(\pi)(M^{\top}U)

Note from above that this is exactly what we want, up to a scaling constant that depends on xx (for more on G()G(\cdot) see Rodu,, 2014). In this paper, we will use a linear kernel G(a)=aG(a)=a for simplicity. The moment estimation and recursive forecasting for the continuous case are identical to the discrete case. See Rodu et al., (2013) for detailed derivations and mathematical proofs of these results.

In the following, the theoretical properties discussed in Section 2 are based on the likelihood (Eq 1 - Eq 3), and improvements to prediction in Section 3 are based on Eq 4. Note that deriving the prediction function from the likelihood is straightforward. See Rodu, (2014) and Hsu et al., (2012) a detailed derivation.

2 Theoretical Properties of SHMM

2.1 Assumptions for SHMMs

The SHMM framework, (illustrated in Figure 2) relies on three primary assumptions:

  • Markovian hidden states: The sequence of underlying hidden states {ht}\{h_{t}\} follows a Markov process.

  • Conditional independence: The observations {x1:T}\{x_{1:T}\} are mutually independent, given the hidden states.

  • Invertibility condition: The matrix Σ=E(y2y1)\Sigma=E(y_{2}y_{1}^{\top}) is invertible.

The first two assumptions are standard assumptions for the HMM. The third assumption places constraints on the separability of the conditional distributions given the hidden states. It also requires that the underlying transition matrix for the HMM is full rank. While the full-rank condition is crucial for the standard SHMM we address in this paper, it is worth noting that some scenarios may involve non-full rank HMMs. For such cases, relaxations of this assumption have been explored in the literature (e.g. Siddiqi et al.,, 2009).

2.2 Asymptotic framework for SHMMs

SHMM relies on the MOM to estimate the likelihood, offering a computationally efficient approach. However, the asymptotic behavior of the SHMM has not been thoroughly explored in previous literature. Prior work by Hsu et al., (2012) and Rodu et al., (2013) established conditions for convergence of the spectral estimator to the true likelihood almost surely, for discrete and continuous HMM respectively. Specifically, they showed:

P^r(x1:T)\displaystyle\widehat{P}r(x_{1:T}) =\displaystyle= c^C^(yt)C^(yt1)C^(y1)c^1a.s.Pr(x1:T).\displaystyle\hat{c}_{\infty}^{\top}\hat{C}(y_{t})\hat{C}(y_{t-1})\cdots\hat{C}(y_{1})\hat{c}_{1}\xrightarrow[]{a.s.}Pr(x_{1:T}). (6)

Here, c1,C(yt)c_{1},C(y_{t}) and cc_{\infty} are estimated using independent and identically distributed (i.i.d) triples {Yi,Yi+1,Yi+2}\{Y_{i},Y_{i+1},Y_{i+2}\}, as defined in Eq 3. It is important to note that the use of i.i.d. triples is primarily a theoretical tool, used by both Hsu et al., (2012) and Rodu et al., (2013) to facilitate study of the theoretical properties of the SHMM. In practice, quantities should be estimated using the full sequence of observations, which introduces dependencies between the triplets (see section 3). The relationship between the effective sample size for estimation and the actual sample size is governed by the mixing rate of the underlying Markov chain.

In this paper, we extend these results by investigating the asymptotic distribution of the estimation error, P^r(x1:T)Pr(x1:T)\hat{P}r(x_{1:T})-Pr(x_{1:T}). Theorem 1 presents a CLT type bound for this approximation error. We first identify the sources of error for the SHMM in Lemma 1, which makes use of what we refer to as the ‘Δ\Delta’ terms, defined through the following equations: μ^=μ+Δμ^\widehat{\mu}=\mu+\widehat{\Delta\mu}, Σ^=Σ+ΔΣ^\widehat{\Sigma}=\Sigma+\widehat{\Delta\Sigma}, K^=K+ΔK^\widehat{K}=K+\widehat{\Delta K}.

Lemma 1.
P^r(x1:T)\displaystyle\widehat{P}r(x_{1:T}) =\displaystyle= Pr(x1:T)+(v+v~)Δμ^+t=1TatΔK^(yt)at~t=0TbtΔΣ^bt~+𝒪p(N1),\displaystyle Pr(x_{1:T})+(v+\tilde{v})^{\top}\widehat{\Delta\mu}+\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}}-\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}}+\mathcal{O}_{p}(N^{-1}),

where

v=(μΣ1K(yT)K(y1)Σ1);\displaystyle v=\left(\mu^{\top}\Sigma^{-1}K(y_{T})\cdots K(y_{1})\Sigma^{-1}\right)^{\top}; v~=Σ1K(yT)K(y1)Σ1μ;\displaystyle\quad\tilde{v}=\Sigma^{-1}K(y_{T})\cdots K(y_{1})\Sigma^{-1}\mu;
at=(μΣ1K(yT)Σ1K(yt+1)Σ1);\displaystyle a_{t}=\left(\mu^{\top}\Sigma^{-1}K(y_{T})\Sigma^{-1}\cdots K(y_{t+1})\Sigma^{-1}\right)^{\top}; at~=Σ1K(yt1)K(y1)Σ1μ;\displaystyle\quad\tilde{a_{t}}=\Sigma^{-1}K(y_{t-1})\cdots K(y_{1})\Sigma^{-1}\mu;
bt=(μΣ1K(yT)Σ1Σ1K(yt+1)Σ1);\displaystyle b_{t}=\left(\mu^{\top}\Sigma^{-1}K(y_{T})\Sigma^{-1}\cdots\Sigma^{-1}K(y_{t+1})\Sigma^{-1}\right)^{\top}; bt~=Σ1K(yt)Σ1K(y1)Σ1μ.\displaystyle\quad\tilde{b_{t}}=\Sigma^{-1}K(y_{t})\Sigma^{-1}\cdots K(y_{1})\Sigma^{-1}\mu.

We provide a detailed proof of this lemma in the supplementary material. The basic strategy is to fully expand P^r(x1:T)Pr(x1:T)\widehat{P}r(x_{1:T})-Pr(x_{1:T}) after rewriting the estimated quantities as a sum of the true quantity plus an error term. We then categorize each summand based on how many ‘Δ\Delta’ terms it has. There are three categories: terms with zero ‘Δ\Delta’ terms (i.e. the true likelihood Pr(x1:T)Pr(x_{1:T})), terms with only one ‘Δ\Delta’ (i.e. (v+v~)Δμ^+t=1TatΔK^(yt)at~t=0TbtΔΣ^bt~(v+\tilde{v})^{\top}\widehat{\Delta\mu}+\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}}-\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}}), and all remaining terms, which involve at least two ‘Δ\Delta’ quantities, and can be relegated to 𝒪p(N1)\mathcal{O}_{p}(N^{-1}).

Lemma 1 shows how the estimated error propagates to the likelihood approximation. We can leverage the fact that our moment estimators have a central limit theorem (CLT) property to obtain the desired results in Theroem 1.

We denote the outer product as \otimes, and define a “flattening” operator ()\mathcal{F}(\cdot) for both matrices and 3-way tensors. For matrix Ad×dA_{d\times d},

(A)=[A(1,1),A(1,2),,A(d,d)];\mathcal{F}(A)=[A^{(1,1)},A^{(1,2)},\cdots,A^{(d,d)}]^{\top};

For tensor Bd×d×dB_{d\times d\times d},

(B)=[B(1,1,1),B(1,1,2),,B(1,1,d),B(1,2,1),B(1,2,2),,B(1,2,d),,B(1,d,d),,B(d,d,d)]\mathcal{F}(B)=[B^{(1,1,1)},B^{(1,1,2)},\cdots,B^{(1,1,d)},B^{(1,2,1)},B^{(1,2,2)},\cdots,B^{(1,2,d)},\cdots,B^{(1,d,d)},\cdots,B^{(d,d,d)}]^{\top}

We now state and prove our main theorem.

Theorem 1.
N(P^r(x1:T)Pr(x1:T))\displaystyle\sqrt{N}(\widehat{P}r(x_{1:T})-Pr(x_{1:T})) 𝑑\displaystyle\xrightarrow[]{d} N(0,βCov([Y1(Y2Y1)(Y3Y1Y2)])β),\displaystyle N\left(0,\beta^{\top}Cov\left(\begin{bmatrix}Y_{1}\\ \mathcal{F}(Y_{2}\otimes Y_{1})\\ \mathcal{F}(Y_{3}\otimes Y_{1}\otimes Y_{2})\end{bmatrix}\right)\beta\right),

where

β=[(v+v~);(t=0T(btb~t));(t=1T(ata~tyt))]\beta=\left[(v+\tilde{v})^{\top};-\left(\sum_{t=0}^{T}\mathcal{F}(b_{t}\otimes\tilde{b}_{t})\right)^{\top};\left(\sum_{t=1}^{T}\mathcal{F}(a_{t}\otimes\tilde{a}_{t}\otimes y_{t})\right)^{\top}\right]^{\top}

and v,v~,at,a~t,bt,b~tv,\tilde{v},a_{t},\tilde{a}_{t},b_{t},\tilde{b}_{t} are defined as in Lemma 1.

Proof (Theorem 1).

We flatten ΔΣ^\widehat{\Delta\Sigma} and ΔK^\widehat{\Delta K} as

(ΔΣ^)\displaystyle\mathcal{F}(\widehat{\Delta\Sigma}) =\displaystyle= [ΔΣ^(1,1),ΔΣ^(1,2),,ΔΣ^(d,d)],\displaystyle[\widehat{\Delta\Sigma}^{(1,1)},\widehat{\Delta\Sigma}^{(1,2)},\cdots,\widehat{\Delta\Sigma}^{(d,d)}]^{\top},
(ΔK^)\displaystyle\mathcal{F}(\widehat{\Delta K}) =\displaystyle= [ΔK^(1,1,1),ΔK^(1,1,2),,ΔK^(d,d,d)].\displaystyle[\widehat{\Delta K}^{(1,1,1)},\widehat{\Delta K}^{(1,1,2)},\cdots,\widehat{\Delta K}^{(d,d,d)}]^{\top}.

Rewriting atΔK^(yt)at~a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}} and btΔΣ^bt~b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}} in Eq 1 as

atΔK^(yt)at~\displaystyle a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}} =\displaystyle= (ata~tyt)(ΔK^),\displaystyle\mathcal{F}(a_{t}\otimes\tilde{a}_{t}\otimes y_{t})^{\top}\cdot\mathcal{F}(\widehat{\Delta K}),
btΔΣ^bt~\displaystyle b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}} =\displaystyle= (btb~t)(ΔΣ^),\displaystyle\mathcal{F}(b_{t}\otimes\tilde{b}_{t})^{\top}\cdot\mathcal{F}(\widehat{\Delta\Sigma}),

we have

P^r(x1:T)Pr(x1:T)Op(N1)\displaystyle\widehat{P}r(x_{1:T})-Pr(x_{1:T})-O_{p}(N^{-1})
=\displaystyle= [(v+v~);(t=0T(btb~t));(t=1T(ata~tyt))][Δμ^(ΔΣ^)(ΔK^)]\displaystyle\left[(v+\tilde{v})^{\top};-\left(\sum_{t=0}^{T}\mathcal{F}(b_{t}\otimes\tilde{b}_{t})\right)^{\top};\left(\sum_{t=1}^{T}\mathcal{F}(a_{t}\otimes\tilde{a}_{t}\otimes y_{t})\right)^{\top}\right]\cdot\begin{bmatrix}\widehat{\Delta\mu}\\ \mathcal{F}(\widehat{\Delta\Sigma})\\ \mathcal{F}(\widehat{\Delta K})\end{bmatrix}
=\displaystyle= βΔθ^.\displaystyle\beta^{\top}\cdot\widehat{\Delta\theta}.

Since the CLT applies seperately to Δμ^,ΔΣ^,ΔK^\widehat{\Delta\mu},\widehat{\Delta\Sigma},\widehat{\Delta K}, then

NΔθ^\displaystyle\sqrt{N}\widehat{\Delta\theta} =\displaystyle= N[1Ni=1NYi,1μ(1Ni=1NYi,2Yi,1Σ)(1Ni=1NYi,3Yi,1Yi,2K)]𝑑MVN(0,Cov([Y1(Y2Y1)(Y3Y1Y2)])).\displaystyle\sqrt{N}\begin{bmatrix}\frac{1}{N}\sum_{i=1}^{N}Y_{i,1}-\mu\\ \mathcal{F}(\frac{1}{N}\sum_{i=1}^{N}Y_{i,2}\otimes Y_{i,1}-\Sigma)\\ \mathcal{F}(\frac{1}{N}\sum_{i=1}^{N}Y_{i,3}\otimes Y_{i,1}\otimes Y_{i,2}-K)\end{bmatrix}\xrightarrow[]{d}MVN\left(\vec{0},Cov\left(\begin{bmatrix}Y_{1}\\ \mathcal{F}(Y_{2}\otimes Y_{1})\\ \mathcal{F}(Y_{3}\otimes Y_{1}\otimes Y_{2})\end{bmatrix}\right)\right).

Therefore,

N(P^r(x1:T)Pr(x1:T))\displaystyle\sqrt{N}(\widehat{P}r(x_{1:T})-Pr(x_{1:T})) 𝑑\displaystyle\xrightarrow[]{d} N(0,βCov([Y1(Y2Y1)(Y3Y1Y2)])β).\displaystyle N\left(0,\beta^{\top}Cov\left(\begin{bmatrix}Y_{1}\\ \mathcal{F}(Y_{2}\otimes Y_{1})\\ \mathcal{F}(Y_{3}\otimes Y_{1}\otimes Y_{2})\end{bmatrix}\right)\beta\right).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Convergence of SHMM estimation error to theoretical distribution. Empirical histograms of P^r(x1:T)Pr(x1:T)\hat{P}r(x_{1:T})-Pr(x_{1:T}) are shown for different training sizes NN and sequence lengths TT, alongside the theoretical density derived from Theorem 1. Each subfigure corresponds to a different NN. (1) As NN increases, the empirical distribution converges to the theoretical normal distribution. (2) Convergence is faster for smaller values of TT.

2.3 Experimental Validation of Theorem 1.

We conducted a series of experiments to validate theorem 1. We simulated a target series x1:Tx_{1:T} with xi3x_{i}\in\mathbb{R}^{3} and T={30,100}T=\{30,100\} using a 3-state GHMM, where the initial probabilities and sticky transition probabilities are as described in Section 5.1.1. We set the discrete emission probability matrix to [[0.8,0.1,0.1],[0.1,0.8,0.1],[0.1,0.1,0.8]][[0.8,0.1,0.1]^{\top},[0.1,0.8,0.1]^{\top},[0.1,0.1,0.8]^{\top}]^{\top}. We estimated parameters μ^,Σ^\widehat{\mu},\widehat{\Sigma}, and K^\widehat{K} using training samples generated from the same model as the target series. Specifically, NN i.i.d. samples Y1(μ)Y_{1}^{(\mu)} were used to estimate μ^\widehat{\mu}, NN i.i.d. samples of (Y1(Σ),Y2(Σ))(Y_{1}^{(\Sigma)},Y_{2}^{(\Sigma)}) to estimate Σ^\widehat{\Sigma}, and NN i.i.d. samples of (Y1(K),Y2(K),Y3(K))(Y_{1}^{(K)},Y_{2}^{(K)},Y_{3}^{(K)}) to estimate K^\widehat{K}. We used training sets of size N=5000,10000,50000,100000N=5000,10000,50000,100000, replicating the experiment 1000 times for each NN. For each replication rr, we estimated P^r(x1:T)(N,r)\hat{P}r(x_{1:T})^{(N,r)} and constructed a histogram of the estimation errors {P^r(x1:T)(N,r)Pr(x1:T)(N,r)}r\{\hat{P}r(x_{1:T})^{(N,r)}-Pr(x_{1:T})^{(N,r)}\}_{r}. We compared this histogram to the theoretical probability density function (pdf), which by Theorem 1 should converge to a normal distribution as NN increases.

In Figure 3 we see that, as NN increases, the distribution of the estimated likelihood converges to the normal distribution, and with a shorter length TT, the error converges faster. We also analyzed the asymptotic behavior of the first-order estimation errors separately for the first moment, (v+v~)Δμ^(v+\tilde{v})^{\top}\widehat{\Delta\mu}, second moment, t=0TbtΔΣ^bt~\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}}, and third moment, t=1TatΔK^(yt)at~\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}} (see Figure 4). We found that the third moment estimation error has the largest effect, as shown in Table 1. We further analyzed the asymptotic distribution of the Frobenius norm of the first, second and third moment estimation error. Figure 5 shows the empirical histogram and its corresponding theoretical pdf.

Refer to caption
(a) Likelihood estimation error from first-order first moment estimation.
Refer to caption
(b) Likelihood estimation error from first-order second moment estimation.
Refer to caption
(c) Likelihood estimation error from first-order third moment estimation.
Figure 4: Histogram of the first-order error from the first, second and third moment estimation error (i.e. (v+v~)Δμ^(v+\tilde{v})^{\top}\widehat{\Delta\mu}, t=0TbtΔΣ^bt~\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}}, and t=1TatΔK^(yt)at~\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}}) under different training sizes NN with fixed length T=30T=30 vs. the theoretical pdf calculated based on Theorem 1 (red line). Each subfigure is associated with a different NN. As NN increases, the distribution converges to the theoretical normal distribution.
Refer to caption
(a) Frobenius norm of first moment estimation.
Refer to caption
(b) Frobenius norm of second moment estimation.
Refer to caption
(c) Frobenius norm of third moment estimation error
Figure 5: Histogram of the Frobenius norm of the first, second and third moment estimation error (i.e. μ\mu, Σ\Sigma and KK) under different training size NN vs. the theoretical pdf (red line). Here the red line is the theoretical Chi-squared distribution. Each subfigure is associated with a different NN. As NN increases, the distribution converges to the theoretical distribution.

We note two sources of error in estimating the likelihood. The first is the typical CLT-type error in parameter estimation, which decreases as NN increases. The second the propagation of estimation errors during forward recursion. To achieve a stable normal distribution given the second issue, NN must be much greater than TT. In our simulation with T=100T=100, we observe reasonable evidence of asymptotic normality at N=10000N=10000, but not at N=5000N=5000. When N=TN=T or is only slightly larger, the distribution is heavy-tailed, suggesting that regularization could be beneficial.

For simplicity of presentation, we derive Theorem 1 under the assumption that the output distribution is discrete. For continuous output, the CLT still holds with a proper kernel function G()G(\cdot) as mentioned in section 1.2.

Error source Mathematical expression Theoretical std Variance explained ratio
1st moment estimation (v+v~)Δμ^(v+\tilde{v})^{\top}\widehat{\Delta\mu} 1.09×1012/N1.09\times 10^{-12}/\sqrt{N} 0.43%0.43\%
2nd moment estimation t=0TbtΔΣ^bt~\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}} 8.98×1012/N8.98\times 10^{-12}/\sqrt{N} 29.21%29.21\%
3rd moment estimation t=1TatΔK^(yt)at~\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}} 1.39×1011/N1.39\times 10^{-11}/\sqrt{N} 70.36%70.36\%
(a) T=30T=30.
Error source Mathematical expression Theoretical std Variance explained ratio
1st moment estimation (v+v~)Δμ^(v+\tilde{v})^{\top}\widehat{\Delta\mu} 2.66×1049/N2.66\times 10^{-49}/\sqrt{N} 0.30%0.30\%
2nd moment estimation t=0TbtΔΣ^bt~\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}} 7.75×1048/N7.75\times 10^{-48}/\sqrt{N} 25.37%25.37\%
3rd moment estimation t=1TatΔK^(yt)at~\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}} 1.33×1047/N1.33\times 10^{-47}/\sqrt{N} 74.59%74.59\%
(b) T=100T=100.
Table 1: Theoretical variance for first, second, third moment estimation errors based on simulated data with different TT.

3 Projected SHMM

3.1 Motivation for Adding Projection

The Baum-Welch algorithm (Baum et al.,, 1970) predicts belief probabilities, or weights, for each underlying hidden state. Let w^t\hat{w}_{t} denote the predicted weight vector at time tt. The prediction can be expressed as a weighted combination of cluster means: y^t=Mw^t\hat{y}_{t}=M\hat{w}_{t} where w^t1=1||\hat{w}_{t}||_{1}=1. In the Baum-Welch algorithm, these weights are explicitly guaranteed to be non-negative and sum to 1 during forward propagation, which is consistent with their interpretation as probabilities.

However, SHMM does not directly estimate the belief probabilities, and these constraints are not explicitly enforced in spectral estimation. Consequently, SHMM can sometimes produce predictions that are far from the polyhedron spanned by the cluster means. To address this issue, we propose the projected SHMM, where projection serves to regularize the predictions, constraining them within a reasonable range.

This is particularly important when NN is not sufficiently large, and extreme deviations of the estimated likelihood from the true likelihood can occur when error is propagated over time. Regularization can help stabilize the performance of estimation of the likelihood by limiting this propagation of error.

3.2 Projection Methods for SHMM

We propose two projection methods for SHMM: projection-onto-polyhedron and projection-onto-simplex. Projection-onto-polyhedron directly addresses the motivation for using projections in SHMM but is computationally expensive. As a computationally efficient alternative, we propose projection-onto-simplex.

3.2.1 Projection-onto-Polyhedron

Projection-onto-polyhedron SHMM first generates prediction y^t(SHMM)\hat{y}_{t}^{(SHMM)} through the standard SHMM and then projects it onto the polyhedron with vertices M^\widehat{M}. Mathematically, we replace the recursive forecasting in Eq 4 with

y^t(SHMM)\displaystyle\hat{y}_{t}^{(SHMM)} =\displaystyle= C^(yt1)y^t1c^C^(yt1)y^t1;\displaystyle\frac{\hat{C}(y_{t-1})\hat{y}_{t-1}}{\hat{c}^{\top}_{\infty}\hat{C}(y_{t-1})\hat{y}_{t-1}};
y^t\displaystyle\hat{y}_{t} =\displaystyle= argminyPoly(M^)d(y,y^t(SHMM)),\displaystyle\operatorname*{\arg\min}_{y\in Poly(\widehat{M})}d(y,\hat{y}_{t}^{(SHMM)}), (7)

where d(,)d(\cdot,\cdot) is a distance function (e.g., Euclidean distance), and

Poly(M^)={y=M^w|wisonthesimplex}Poly(\widehat{M})=\{y=\widehat{M}w|w\ is\ on\ the\ simplex\}

is the polyhedron with vertices M^\widehat{M}. This results in a convex optimization problem if the distance is convex. While solvable using standard methods such as the Newton-Raphson algorithm (Boyd et al.,, 2004) with log-barrier methods (Frisch,, 1955), to the best of our knowledge finding an efficient solution for projection-onto-polyhedron is challenging. The optimization must be performed at every time step, creating a trade-off between accuracy and computational speed. Because speed over the Baum-Welch algorithm is one of the distinct advantages of SHMM, any modification should not result in a substantial increase in computational time.

3.2.2 Projection-onto-Simplex

To improve computational efficiency, we propose projection-onto-simplex. To avoid projection onto a polyhedron, we leverage the fact that y^t=M^w^t\hat{y}_{t}=\widehat{M}\hat{w}_{t} and optimize over w^t\hat{w}_{t} on the simplex:

w^t\displaystyle\hat{w}_{t} =\displaystyle= argminwSimplexwM^1y^t(SHMM)22.\displaystyle\operatorname*{\arg\min}_{w\in Simplex}||w-\widehat{M}^{-1}\hat{y}_{t}^{(SHMM)}||_{2}^{2}. (8)

This allows us to calculate the projection with time complexity 𝒪(dlog(d))\mathcal{O}(d\log(d)) (Wang and Carreira-Perpinán,, 2013). This approach, while not equivalent to the solution from the projection-onto-polyhedron–d(a,b)d(Aa,Ab)d(a,b)\neq d(Aa,Ab) in general–ensures predictions are constrained to the same polyhedron. Importantly, it can be solved using a closed-form solution (Algorithm 1), avoiding iterative optimization and yielding faster estimation. Figure 6 illustrates the projection-onto-polyhedron and projection-onto-simplex methods.

Input : u=[u1,u2,,ud]u=[u_{1},u_{2},\cdots,u_{d}]^{\top}
Sort uu into zz: z1z2zdz_{1}\geq z_{2}\cdots\geq z_{d};
Find ρ=max{1id:zi+1i(1j=1izj)>0}\rho=max\{1\leq i\leq d:z_{i}+\frac{1}{i}(1-\sum_{j=1}^{i}z_{j})>0\};
Define λ=1ρ(1j=1ρzj)\lambda=\frac{1}{\rho}(1-\sum_{j=1}^{\rho}z_{j});
Solve u(proj)u^{(proj)}, s.t. ui(proj)=max(ui+λ,0)u_{i}^{(proj)}=max(u_{i}+\lambda,0), i=1,,di=1,\cdots,d;
Output : u(proj)u^{(proj)}
Algorithm 1 Projection-onto-simplex (Wang and Carreira-Perpinán,, 2013).

The full projected SHMM algorithm is shown in Algorithm 2. In Algorithm 2, Steps 1-3 are identical to the standard SHMM. Steps 4-5 estimate M^\hat{M} by Gaussian Mixture Models (GMM) (McLachlan and Basford,, 1988), calculate the weight processes {wt}\{w_{t}\}, and apply SHMM on the weight process. Step 6 applies projection-onto-simplex on the recursive forecasting. Step 7 projects the data back into the original space.

Refer to caption
(a) Projection-onto-polyhedron.
Refer to caption
(b) Projection-onto-simplex.
Figure 6: The left figure shows the projection-onto-polyhedron step, and the right shows projection-onto-simplex. In both methods, we project the predicted values (blue points) into the constrained regions (defined with a red boundary), a polyhedron (left) or simplex (right).
Input : {xt}\{x_{t}\}, where t=1,,Tt=1,\cdots,T
Output : x^T+1\hat{x}_{T+1}
Step 1: Compute E^[xt+1xt]=1T2i=1T2xt+1xt\hat{E}[x_{t+1}\otimes x_{t}]=\frac{1}{T-2}\sum_{i=1}^{T-2}x_{t+1}x_{t}^{\top};
Step 2: Obtain U^\hat{U} by extracting the first kk left eigenvectors of E^[xt+1xt]\hat{E}[x_{t+1}\otimes x_{t}];
Step 3: Reduce dimensionality yt=U^xty_{t}=\hat{U}^{\top}x_{t};
Step 4: Estimate cluster mean by GMM, and obtain M^\hat{M}, where each column is the mean vector of each cluster. Then the weight vector is wt=M^1ytw_{t}=\hat{M}^{-1}y_{t} for t=1,,Tt=1,\cdots,T;
Step 5: Calculate μ^=1Tt=1wt\hat{\mu}=\frac{1}{T}\sum_{t=1}^{\top}w_{t}, Σ^=1T1t=1T1wt+1wt\hat{\Sigma}=\frac{1}{T-1}\sum_{t=1}^{T-1}w_{t+1}w_{t}^{\top}, and K^=1T2t=1T2wt+2wtwt+1\hat{K}=\frac{1}{T-2}\sum_{t=1}^{T-2}w_{t+2}\otimes w_{t}\otimes w_{t+1}. Set c^1=μ^\hat{c}_{1}=\hat{\mu}, c^=c1Σ^1\hat{c}_{\infty}^{\top}=c_{1}^{\top}\hat{\Sigma}^{-1}, and C^(wt)=K^(wt)Σ^1\hat{C}(w_{t})=\hat{K}(w_{t})\hat{\Sigma}^{-1};
Step 6: Recursive prediction with projection-onto-simplex w^t=Proj(C^(wt1)w^t1c^C^(wt1)w^t1)\hat{w}_{t}=Proj\left(\frac{\hat{C}(w_{t-1})\hat{w}_{t-1}}{\hat{c}^{\top}_{\infty}\hat{C}(w_{t-1})\hat{w}_{t-1}}\right) for t=2,,T+1t=2,\cdots,T+1 where Proj(a)=argminwSimplexwa22Proj(a)=\operatorname*{\arg\min}_{w\in Simplex}||w-a||_{2}^{2} can be solved by Algorithm 1, and set y^1=c^1\hat{y}_{1}=\hat{c}_{1};
Step 7: x^T+1=U^y^T+1=U^M^w^T+1\hat{x}_{T+1}=\hat{U}\hat{y}_{T+1}=\hat{U}\hat{M}\hat{w}_{T+1};
Algorithm 2 Projection-onto-simplex SHMM.

3.2.3 Bias-variance tradeoff

In PSHMM, we leverage GMM to provide projection boundaries, which would introduce bias, since the hidden state means estimated by GMM are biased when we ignore time dependency information. In addition, either projection method– ’projection-onto-polyhedron’ or ’projection-onto-simplex’– can introduce bias since they are not necessarily an orthogonal projection due to optimization constraints. However, adding such a projection will largely reduce the variance. That is, there is a bias-variance tradeoff.

3.3 Considerations for PSHMM Implementation

This section addresses two key aspects of PSHMM implementation: choosing the dimensionality of the projection space and computing the projection matrix UU.

3.3.1 Choice of projection space dimensionality

The hyperparameter dd determines the dimensionality of the projection space. Theoretically, dd should equal the number of states in the HMM. Our simulations confirm that when dd matches the true number of underlying states, estimation and prediction performance improves. However, in practice, the number of hidden states is often unknown. We recommend either choosing dd based on prior knowledge or tuning it empirically when strong prior beliefs are absent.

3.3.2 Computation of projection matrix UU

The projection matrix UU is typically constructed using the first dd left singular vectors from the singular value decomposition (SVD) (Eckart and Young,, 1936) of the bigram covariance matrix Σ^=E^[X2X1]\widehat{\Sigma}=\widehat{\mathrm{E}}[X_{2}\otimes X_{1}]. This approach encodes transition information while eliminating in-cluster covariance structure. Alternatively, UU could be estimated through an SVD of E^[X1X1]\widehat{\mathrm{E}}[X_{1}\otimes X_{1}], which would incorporate both covariance structure and cluster mean information. We generally recommend using the bigram matrix approach.

For extremely high-dimensional data, we propose a fast approximation algorithm based on randomized SVD (Halko et al.,, 2011). This method avoids computing the full covariance matrix E^[xt+1xt]\hat{E}[x_{t+1}\otimes x_{t}], reducing time complexity from 𝒪(Tp2+p3)\mathcal{O}(Tp^{2}+p^{3}) to 𝒪(pTlog(d)+(p+T)d2)\mathcal{O}(pT\log(d)+(p+T)d^{2}) for pdp\gg d, where TT and pp are the sample size and data dimensionality, respectively. The algorithm proceeds as follows:

  1. 1.

    Noting that E^[xt+1xt]=1T2i=1T2xt+1xt=1T2X2X1\hat{E}[x_{t+1}\otimes x_{t}]=\frac{1}{T-2}\sum_{i=1}^{T-2}x_{t+1}x_{t}^{\top}=\frac{1}{T-2}X_{2}^{\top}X_{1}, where X2=[x2,,xT]X_{2}=[x_{2},\cdots,x_{T}]^{\top} and X1=[x1,,xT1]X_{1}=[x_{1},\cdots,x_{T-1}]^{\top}, perform a randomized SVD of X1X_{1} and X2X_{2} separately to obtain two rank-d~\tilde{d} decompositions with dd~pd\leq\tilde{d}\ll p: X1U1Σ1V1X_{1}\approx U_{1}\Sigma_{1}V_{1}^{\top}, X2U2Σ2V2X_{2}\approx U_{2}\Sigma_{2}V_{2}^{\top}.

    Now X2X1V2(Σ2U2U1Σ1)V1X_{2}^{\top}X_{1}\approx V_{2}(\Sigma_{2}U_{2}^{\top}U_{1}\Sigma_{1})V_{1}^{\top}, where (Σ2U2U1Σ1)(\Sigma_{2}U_{2}^{\top}U_{1}\Sigma_{1}) has dimension d~×d~\tilde{d}\times\tilde{d}.

  2. 2.

    Perform an SVD on (Σ2U2U1Σ1)(\Sigma_{2}U_{2}^{\top}U_{1}\Sigma_{1}) to get Σ2U2U1Σ1=U~Σ~V~\Sigma_{2}U_{2}^{\top}U_{1}\Sigma_{1}=\tilde{U}\tilde{\Sigma}\tilde{V}^{\top}.

  3. 3.

    We can reconstruct E^[xt+1xt](V2U~)(1T2Σ~)(V1V~)\hat{E}[x_{t+1}\otimes x_{t}]\approx(V_{2}\tilde{U})(\frac{1}{T-2}\tilde{\Sigma})(V_{1}\tilde{V})^{\top}.

V2U~V_{2}\tilde{U} and V1V~V_{1}\tilde{V} are orthonormal matrices and 1T2Σ~\frac{1}{T-2}\tilde{\Sigma} is a diagonal matrix, so this is the rank-d~\tilde{d} SVD of E^[xt+1xt]\hat{E}[x_{t+1}\otimes x_{t}]. The first dd vectors of V2U~V_{2}\tilde{U} are an estimate of the U^\hat{U} matrix we compute in Step 1 and 2 in Algorithm 2

4 Online Learning

Online learning is a natural extension for SHMMs, given their computational efficiency advantages. This section introduces online learning approaches for both SHMM and PSHMM, addressing scenarios where traditional batch learning may be unsuitable. For example, in quantitative trading, markets often exhibit a regime-switching phenomenon. Thus, it is likely that the observed returns for some financial products are nonstationary. In high frequency trading, especially in second-level or minute-level trading, the delay from frequent offline re-training of the statistical model could impact the strategy and trading speed (Lahmiri and Bekiros,, 2021).

4.1 Online Learning of SHMM and PSHMM

We illustrate an online learning strategy that allows for adaptive parameter estimation. Let μ^,Σ^\hat{\mu},\hat{\Sigma}, and K^\hat{K}, be the estimated moments based on TT data points. Upon receiving new data YT+1Y_{T+1}, we update our moments recursively:

μ^\displaystyle\hat{\mu} \displaystyle\xleftarrow{} Tμ^+YT+1T+1;\displaystyle\frac{T\cdot\hat{\mu}+Y_{T+1}}{T+1};
Σ^\displaystyle\hat{\Sigma} \displaystyle\xleftarrow{} (T1)Σ^+YT+1YTT;\displaystyle\frac{(T-1)\cdot\hat{\Sigma}+Y_{T+1}\otimes Y_{T}}{T};
K^\displaystyle\hat{K} \displaystyle\xleftarrow{} (T2)K^+YT+1YT1YTT1;\displaystyle\frac{(T-2)\cdot\hat{K}+Y_{T+1}\otimes Y_{T-1}\otimes Y_{T}}{T-1};
T\displaystyle T \displaystyle\xleftarrow{} T+1.\displaystyle T+1. (9)

The update strategy applies to both SHMM and PSHMM. Algorithm 3 provides pseudo-code for online learning in PSHMM.

Input : {xt}t=1,,T\{x_{t}\}_{t=1,\cdots,T}, the warm-up length TwarmupT_{warmup}
Output : {x^t}t=Twarmup+1T+1\{\hat{x}_{t}\}_{t=T_{warmup}+1}^{T+1} yielded sequentially.
Step 1: Compute E^[xt+1xt](warmup)=1Twarmup2i=1Twarmup2xt+1xt\hat{E}[x_{t+1}\otimes x_{t}]^{(warmup)}=\frac{1}{T_{warmup}-2}\sum_{i=1}^{T_{warmup}-2}x_{t+1}x_{t}^{\top};
Step 2: Obtain U^\hat{U} by extracting the first kk left eigenvectors of E^[xt+1xt](warmup)\hat{E}[x_{t+1}\otimes x_{t}]^{(warmup)};
Step 3: Reduce dimensionality yt=U^xty_{t}=\hat{U}^{\top}x_{t} for t=1,,Twarmupt=1,\cdots,T_{warmup};
Step 4: Estimate cluster mean by GMM by data {y^t}t=1Twarmup\{\hat{y}_{t}\}_{t=1}^{T_{warmup}}, and obtain M^\hat{M}, where each column is the mean vector of each cluster. Then the weight vector is wt=M^1ytw_{t}=\hat{M}^{-1}y_{t} for t=1,,Twarmupt=1,\cdots,T_{warmup};
Step 5: Calculate μ^\hat{\mu}, Σ^\hat{\Sigma}, K^\hat{K}, c^1\hat{c}_{1}, c^\hat{c}_{\infty}^{\top} and C^()\hat{C}(\cdot) as described in Step 5 of Algorithm 2;
Step 6: Recursive prediction with projection-onto-simplex w^t\hat{w}_{t} for t=1,,Twarmup+1t=1,\cdots,T_{warmup}+1 as described in Step 6 of Algorithm 2. Yield x^Twarmup+1=U^M^w^Twarmup+1\hat{x}_{T_{warmup}+1}=\hat{U}\hat{M}\hat{w}_{T_{warmup}+1};
Step 7 (online learning and prediction):
for tTwarmup+1t\leftarrow T_{warmup}+1 to TT do
      
      wt=M^1U^xtw_{t}=\hat{M}^{-1}\hat{U}^{\top}x_{t};
      
      Update μ^\hat{\mu}, Σ^\hat{\Sigma} and K^\hat{K} according to Eq 9;
      
      Predict w^t+1\hat{w}_{t+1} by w^t+1=Proj(C^(wt)w^tc^C^(wt)w^t)\hat{w}_{t+1}=Proj\left(\frac{\hat{C}(w_{t})\hat{w}_{t}}{\hat{c}^{\top}_{\infty}\hat{C}(w_{t})\hat{w}_{t}}\right), where C^()\hat{C}(\cdot) and c^\hat{c}^{\top}_{\infty} are based on updated μ^\hat{\mu}, Σ^\hat{\Sigma} and K^\hat{K}, and Proj()Proj(\cdot) is solved by Algorithm 1;
      
      Yield x^t+1=U^M^w^t+1\hat{x}_{t+1}=\hat{U}\hat{M}\hat{w}_{t+1};
      
end for
Algorithm 3 Online learning projection-onto-simplex SHMM.
Updating the GMM used in PSHMM

For PSHMM, updating the Gaussian Mixture Model (GMM) requires careful consideration. We recommend updating the GMM without changing cluster membership. For example, classify a new input into a particular cluster and update that cluster’s mean and covariance. We advise against re-estimating the GMM, which could add or remove clusters, as this would alter the relationship between yy’s dimensionality and the number of hidden states. In practice, we find PSHMM performs well even without GMM updates.

4.2 Online Learning of SHMM Class with Forgetfulness

To handle nonstationary data, we introduce a forgetting mechanism in online parameter estimation. This approach requires specifying a decay factor γ\gamma, which determines the rate at which old information is discarded. The updating rule becomes:

μ^\displaystyle\hat{\mu} \displaystyle\xleftarrow{} (1γ)T~μ^+YT+1(1γ)T~+1;\displaystyle\frac{(1-\gamma)\tilde{T}\hat{\mu}+Y_{T+1}}{(1-\gamma)\tilde{T}+1};
Σ^\displaystyle\hat{\Sigma} \displaystyle\xleftarrow{} (1γ)T~Σ^+YT+1YT(1γ)T~+1;\displaystyle\frac{(1-\gamma)\tilde{T}\hat{\Sigma}+Y_{T+1}\otimes Y_{T}}{(1-\gamma)\tilde{T}+1};
K^\displaystyle\hat{K} \displaystyle\xleftarrow{} (1γ)T~K^+YT+1YT1YT(1γ)T~+1;\displaystyle\frac{(1-\gamma)\tilde{T}\hat{K}+Y_{T+1}\otimes Y_{T-1}\otimes Y_{T}}{(1-\gamma)\tilde{T}+1};
T~\displaystyle\tilde{T} \displaystyle\xleftarrow{} T~(1γ)+1.\displaystyle\tilde{T}\cdot(1-\gamma)+1. (10)

Here T~=i=1T(1γ)i1\tilde{T}=\sum_{i=1}^{T}(1-\gamma)^{i-1} serves as an effective sample size. This strategy is equivalent to calculating the exponentially weighted moving average for each parameter:

μ^=t=1T(1γ)TtYtt=1T(1γ)Tt;Σ^=t=2T(1γ)TtYtYt1t=2T(1γ)Tt;K^=t=3T(1γ)TtYtYt2Yt1t=3T(1γ)Tt.\hat{\mu}=\frac{\sum_{t=1}^{T}(1-\gamma)^{T-t}Y_{t}}{\sum_{t=1}^{T}(1-\gamma)^{T-t}};\hat{\Sigma}=\frac{\sum_{t=2}^{T}(1-\gamma)^{T-t}Y_{t}\otimes Y_{t-1}}{\sum_{t=2}^{T}(1-\gamma)^{T-t}};\hat{K}=\frac{\sum_{t=3}^{T}(1-\gamma)^{T-t}Y_{t}\otimes Y_{t-2}\otimes Y_{t-1}}{\sum_{t=3}^{T}(1-\gamma)^{T-t}}. (11)

5 Simulations

This section presents simulation studies to evaluate the performance of SHMM and PSHMM under various conditions. We assess robustness to different signal-to-noise ratios, misspecified models, and heavy-tailed data. We also test the effectiveness of the forgetfulness mechanism and compare computational efficiency.

5.1 Tests of Robustness under Different Signal-Noise Ratio, Mis-specified Models and Heavy-Tailed Data

5.1.1 Experiment setting

We simulated 100100-dimensional data with 1000010000 training points and 100100 test points using various GHMM configurations. Each simulation was repeated 100100 times. We compared SHMM, three variants of PSHMM (projection onto polyhedron, projection onto simplex, and online learning with projection onto simplex), and the E-M algorithm. Prediction performance was evaluated using the average R2R^{2} over all repetitions. The online learning variant of PSHMM used 10001000 training samples for the initial estimate (warm-up), and incorporated the remaining 90009000 training samples using online updates. We also calculated an oracle R2R^{2}, assuming knowledge of all HMM parameters but not the specific hidden states.

In our simulations, online training for PSHMM differs from offline training for two reasons. First, the estimation of U^\hat{U} and M^\hat{M} are based only on the warm-up set for online learning (as is the case for the online version of SHMM), and the entire training set for offline learning. Second, during the training period, for PSHMM, the updated moments are based on the recursive predictions of w^t\hat{w}_{t}, which are themselves based on the weights {ws}s=1t1\{w_{s}\}_{s=1}^{t-1}. In contrast, for the offline learning of PSHMM, the weights used to calculate the moments are based on the GMM estimates from the entire training dataset.

For each state, we made two assumptions about the emission distribution. First, it has a one-hot mean vector in which the ii-th state is [1{i=j}]j=1p[\mathrm{1}\{i=j\}]_{j=1}^{p}, where 1{}\mathrm{1}\{\cdot\} is the indicator function. Second, it has a diagonal covariance matrix. We tested the SHMM, PSHMM methods, and the E-M algorithm under two types of transition matrix, five different signal-noise ratios and different emission distributions as below:

  • Transition matrix:

    • Sticky: diagonal elements are 0.60.6, off-diagonal elements are 0.4S1\frac{0.4}{S-1}, where SS is the number of states that generated the data;

    • Non-sticky: diagonal elements are 0.40.4, off-diagonal elements are 0.6S1\frac{0.6}{S-1}.

  • Signal-noise ratio: Covariance specified by σ2Ip\sigma^{2}I_{p}, where σ={0.01,0.05,0.1,0.5,1.0}\sigma=\{0.01,0.05,0.1,0.5,1.0\}.

  • Emission distributions:

    • Gaussian distribution, generated according to mean vector and covariance matrix;

    • tt distribution: generated a standard a random vector of i.i.d.i.i.d. t5t_{5}, t10t_{10}, t15t_{15} and t20t_{20} distribution, then multiplied by the covariance matrix and shifted by the mean vector.

5.1.2 Simulation Results

Refer to caption
Refer to caption
Refer to caption
Figure 7: Comparative performance of SHMM, PSHMM, and E-M algorithms under various experimental conditions. Left column: Results for sticky transitions. Right column: Results for non-sticky transitions. Row 1: Effect of noise level σ\sigma on algorithm performance (5-state GHMM). Row 2: Impact of model misspecification - varying dimensionality d (5-state GHMM, σ=0.05\sigma=0.05). Row 3: Performance under non-Gaussian emissions (5-state HMM with t-distributed emissions, σ=0.05\sigma=0.05, varying degrees of freedom). Y-axis: Average R2R^{2} (values <0<0 thresholded to 0 for visualization).

Figure 7 shows that the performance of PSHMM is competitive with the E-M algorithm. While both methods achieve similar R2R^{2} values in most scenarios, PSHMM outperforms E-M under conditions of high noise or heavy-tailed distributions. Note that our E-M implementation employs multiple initializations to mitigate local optima issues, which comes at the cost of increased computational time.

Our simulations used 100100-dimensional observed data (dim=100dim=100), which we found approached the practical limits of the E-M algorithm. In higher-dimensional scenarios (dim=1000dim=1000 or 1000010000), E-M failed, whereas SHMM continued to perform effectively. To ensure a fair comparison, we present results for dim=100dim=100 throughout this section.

The last point is important. In Figure 7, in order to include the E-M algorithm in our simulations, we restrict our investigation to settings where E-M is designed to succeed, and show comparable performance with our methodology. Many real-world scenarios exist in a space where E-M is a nonstarter. This highlights PSHMM’s robustness and scalability advantages over traditional E-M approaches, particularly for high-dimensional data.

In Figure 7, we can also see that adding projections to SHMM greatly improved R2R^{2}, achieving near oracle-level performance in some settings. The top row shows that, under both high- and low-signal-to-noise scenarios, PSHMM works well. The middle row shows that PSHMM is robust and outperforms SHMM when the model is mis-specified, for example, when the underlying data contains 5 states but we choose to reduce the dimensions to 33 or 44. The last row shows that PSHMM is more robust and has a better R2R^{2} than SHMM with heavy-tailed data such as data generated by a t distribution. In all figures, when R2<0R^{2}<0, we threshold to 0 for plotting purpose. See supplementary tables for detailed results. Negative R2R^{2} occurs only for the standard SHMM, and implies that it is not stable. Since we simulated 100100 trials for each setting, in addition to computing the mean R2R^{2} we also calculated the variance of the R2R^{2} metric. We provide the variance of the R2R^{2} in the appendix, but note here that the variance of R2R^{2} under PSHMM was smaller than that of the SHMM. Overall, while SHMM often performs well, it is not robust, and PSHMM provides a suitable solution.

Finally, in all simulation settings, we see that the standard SHMM tends to give poor predictions except in non-sticky and high signal-to-noise ratio settings. PSHMM is more robust against noise and mis-specified models. Among the PSHMM variants, projection-onto-simplex outperforms projection-onto-polyhedron. The reason is that the projection-onto-simplex has a dedicated optimization algorithm that guarantees the optimal solution in the projection step. In contrast, projection-onto-polyhedron uses the log-barrier method, which is a general purpose optimization algorithm and does not guarantee the optimality of the solution. Since projection-onto-polyhedron also has a higher computational time, we recommend using projection-onto-simplex. Additionally, for projection-onto-simplex, we see that online and offline estimation perform similarly in most settings, suggesting that online learning does not lose too much power compared with offline learning.

5.2 Testing the effectiveness of forgetfulness

Experimental setting.

Similarly to Section 5.1.1, we simulated 100-dimensional, 5-state GHMM data with σ={0.01,0.05,0.1,0.5,1.0}\sigma=\{0.01,0.05,0.1,0.5,1.0\} and of length 20002000 where the first 10001000 steps are for training and the last 10001000 steps are for testing. The transition matrix is no longer time-constant but differ in the training and testing period as follows:

  • training period (diagonal-0.8): 𝑻(train)=[𝑻ij(train)]i=1,,5j=1,,5=[0.75×I{i=j}+0.05]i=1,,5j=1,,5\bm{T}^{(train)}=[\bm{T}_{ij}^{(train)}]_{i=1,\cdots,5}^{j=1,\cdots,5}=[0.75\times\mathrm{I}\{i=j\}+0.05]_{i=1,\cdots,5}^{j=1,\cdots,5};

  • testing period (antidiagonal-0.8): 𝑻(test)=[𝑻ij(test)]i=1,,5j=1,,5=[0.75×I{i+j=5}+0.05]i=1,,5j=1,,5\bm{T}^{(test)}=[\bm{T}_{ij}^{(test)}]_{i=1,\cdots,5}^{j=1,\cdots,5}=[0.75\times\mathrm{I}\{i+j=5\}+0.05]_{i=1,\cdots,5}^{j=1,\cdots,5}.

We tested different methods on the last 100 time steps, including the standard SHMM, projection-onto-simplex SHMM, online learning projection-onto-simplex SHMM, online learning projection-onto-simplex SHMM with decay factor γ=0.05\gamma=0.05, E-M, and the oracle as defined above. For online learning methods, we used the first 100100 steps in the training set for warm-up and incorporated the remaining 900900 samples using online updates.

Refer to caption
Figure 8: Simulation results for online learning variants. The simulations are generated using a 5-state GHMM across a variety of σ\sigma settings and time-varying transition distributions. The y-axis is the average R2R^{2}. As before, for R2<0R^{2}<0, we threshold at 0 for plotting purposes. See supplementary for detailed results.
Simulation results.

Figure 8 shows the simulation results. Since the underlying data generation process is no longer stationary, most methods failed including the E-M algorithm, with the exception of the online learning projection-onto-simplex SHMM with decay factor =0.05=0.05. Adding the decay factor was critical for accommodating non-stationarity. As σ\sigma increases even PSHMM performs poorly– the oracle shows that this degredation in performance is hard to overcome.

5.3 Testing Computational Time

Experimental setting

We used a similar experimental setting from Section 5.1.1. We simulated 100-dimensional, 3-state GHMM data with σ=0.05\sigma=0.05 and length 2000. We used the first half for warm-up and tested computational time on the last 1000 time steps. We tested both E-M algorithm and SHMM (all variants). For the SHMM family of algorithms, we tested under both online and offline learning regimes. We computed the total running time in seconds. The implementation is done in python with packages Numpy (Harris et al.,, 2020), Scipy (Virtanen et al.,, 2020) and scikit-learn (Pedregosa et al.,, 2011) without multithreading. Note that in contrast to our previous simulations, the entire process is repeated only 30 times. The computational time is the average over the 30 runs.

Simulation results

Table 2 shows the computational times for each method. First, online learning substantially reduced the computational cost. For the offline learning methods, projection-onto-simplex SHMM performed similarly to SHMM, and projection-onto-polyhedron was much slower. In fact, the offline version of projection-onto-polyhedron was slow even compared to the the E-M algorithm. However, the online learning variant of projection-onto-polyhedron was much faster than the Baum-Welch algorithm. Taking both the computational time and prediction accuracy into consideration, we conclude that online and offline projection-onto-simplex SHMM are the best choices among these methods.

Method Offline/online Computational time (sec)
E-M (Baum-Welch) - 2134
SHMM offline 304
SHMM online 0.5
PSHMM (simplex) offline 521
PSHMM (simplex) online 0.7
PSHMM (polyhedron) offline 10178
PSHMM (polyhedron) online 14
Table 2: Simulation results for comparing computational time among different methods.

6 Application: Backtesting on High Frequency Crypto-Currency Data

6.1 Data Description & Experiment Setting

To evaluate our algorithm’s performance on real-world data, we utilized a cryptocurrency trading dataset from Binance (https://data.binance.vision), one of the world’s largest Bitcoin exchanges. We focused on minute-level data for five major cryptocurrencies: Bitcoin, Ethereum, XRP, Cardano, and Polygon. Our analysis used log returns for each minute as input for our models.

We used the period from July 1, 2022, to December 31, 2022, as our test set. For each day in this test set, we employed a 30-day rolling window of historical data for model training. After training, we generated consecutive-minute recursive predictions for the entire test day without updating the model parameters intra-day.

Our study compared three models: the HMM with EM (HMM-EM), the SHMM, and the PSHMM with simplex projection. For all models, we assumed 4 latent states, motivated by the typical patterns observed in log returns, which often fall into combinations of large/small gains/losses.

To assess the practical implications of our models’ predictions, we implemented a straightforward trading strategy. This strategy generated buy signals for positive forecasted returns and short-sell signals for negative forecasted returns. We simulated trading a fixed dollar amount for each cryptocurrency, holding positions for one minute, and executing trades at every minute of the day.

We evaluated the performance of our trading strategy using several standard financial metrics. The daily return was calculated as the average return across all five cryptocurrencies, weighted by our model’s predictions:

Rm=15i=15tsign(Y^i,t(m))Yi,t(m)R_{m}=\frac{1}{5}\sum_{i=1}^{5}\sum_{t}sign(\widehat{Y}_{i,t}^{(m)})Y_{i,t}^{(m)}

where Yi,t(m)Y_{i,t}^{(m)} is the return for minute tt of day mm for currency ii, Y^i,t(m)\widehat{Y}_{i,t}^{(m)} is its prediction, and sign(a)sign(a) is 11 if aa is positive, 1-1 if aa is negative, and 0 if a=0a=0. From these daily returns, we computed the annualized return

Annualizedreturn=365×R¯,Annualized\ return=365\times\overline{R},

to measure overall profitability. To assess risk-adjusted performance, we calculated the Sharpe ratio (Sharpe,, 1966)

Sharperatio=365×R¯std^(R),Sharpe\ ratio=\frac{\sqrt{365}\times\overline{R}}{\widehat{std}(R)},

where R¯\overline{R} and std^(R)\widehat{std}(R) are the sample mean and standard deviation of the daily returns. The Sharpe ratio provides insight into the return earned per unit of risk. Lastly, we determined the maximum drawdown (Grossman and Zhou,, 1993)

Maximumdrawdown=maxm2maxm1<m2[m=m1m2(Rm)1+m=1m1Rm],Maximum\ drawdown=\max_{m_{2}}\max_{m_{1}<m_{2}}\left[\frac{\sum_{m=m_{1}}^{m_{2}}(-R_{m})}{1+\sum_{m=1}^{m_{1}}R_{m}}\right],

which quantifies the largest peak-to-trough decline as a percentage. Since the financial data is leptokurtic, the maximum drawdown shows the outlier effect better than the Sharpe ratio which is purely based on the first and second order moments. A smaller maximum drawdown indicates that the method is less risky.

6.2 Results

Method Sharpe Ratio Annualized Return Maximum drawdown
PSHMM 2.882.88 1012%1012\% 49%49\%
SHMM 1.071.07 345%345\% 90%90\%
HMM-EM 0.890.89 197%197\% 53%53\%
Table 3: Real-world application results: PSHMM, SHMM, HMM-EM and AR on crypto-currency trading.

From Table 3, we see that PSHMM outperforms all other benchmarks with the highest Sharpe ratio and annualized return, and the lowest maximum drawdown. PSHMM outperforms SHMM and SHMM outperforms HMM-EM. SHMM outperforms HMM-EM because spectral learning does not suffer from the local minima problem of the E-M algorithm. PSHMM outperforms SHMM because projection-onto-simplex provides regularization.

Refer to caption
Figure 9: Comparative performance of PSHMM, SHMM, and HMM-EM in cryptocurrency trading. The graph shows the average accumulated return over time for each method applied to five cryptocurrencies (Bitcoin, Ethereum, XRP, Cardano, and Polygon) from July to December 2022. Y-axis: Average accumulated return. X-axis: Trading days.

The accumulated daily return is shown in Figure 9. PSHMM outperformed the other methods. The maximum drawdown of PSHMM is 49%49\%. Considering the high volatility of the crypto-currency market during the second half of 2022, this maximum drawdown is acceptable. For computational purposes, the drawdown is allowed to be larger than 100%100\% because we always use a fixed amount of money to buy or sell, so we are effectively assuming an infinite pool of cash. Between PSHMM and SHMM, the only difference is the projection-onto-simplex. We see that the maximum drawdown of PSHMM is only about half that of SHMM, showing that PSHMM takes a relatively small risk, especially given that PSHMM has a much higher return than SHMM.

7 Discussion

Spectral estimation avoids being trapped in local optima.

E-M (i.e. the B-W algorithm) optimizes the likelihood function and is prone to local optima since the likelihood function is highly non-convex, whereas spectral estimation uses the MOM directly on the observations. Although multiple initializations can mitigate the local optima problem with E-M, it, there is no guarantee that it will convergence to the true global optimum. Spectral estimation provides a framework for estimation which not only avoids non-convex optimization, but also has nice theoretical properties. The approximate error bound tells us that when the number of observations goes to infinity, the approximation error will go to zero. In this manuscript, we also provide the asymptotic distribution for this error.

Projection-onto-simplex serves as regularization.

The standard SHMM can give poor predictions due to the accumulation and propagation of errors. Projection-onto-simplex pulls the prediction back to a reasonable range. This regularization is our primary methodological innovation, and importantly makes the SHMM well-suited for practical use. Although the simplex, estimated by the means from a GMM, can be biased, this simplex provides the most natural and reasonable choice for a projection space. This regularization introduces a bias-variance trade-off. When the data size is small, it accepts some bias in order to reduce variance.

Online learning can adapt to dynamic patterns and provide faster learning.

Finally, we provide an online learning strategy that allows the estimated moments to adapt over time, which is critical in several applications that can exhibit nonstationarity. Our online learning framework can be applied to both the standard SHMM and PSHMM. Importantly, online learning substantially reduces the computational costs compared to re-training the entire model prior to each new prediction.


SUPPLEMENTAL MATERIALS

Appendix A Proof for Lemma 1

Proof (Lemma 1).

First, we expand the estimated likelihood by decomposing it into the underlying truth plus the error terms. We have

P^r(x1:T)\displaystyle\widehat{P}r(x_{1:T})
=\displaystyle= c^C^(yT)C^(yT1)C^(y1)c^1\displaystyle\hat{c}_{\infty}^{\top}\hat{C}(y_{T})\hat{C}(y_{T-1})\cdots\hat{C}(y_{1})\hat{c}_{1}
=\displaystyle= (μ^Σ^1)[K^(yT)Σ^1][K^(yT1)Σ^1][K^(y1)Σ^1]μ^\displaystyle(\hat{\mu}^{\top}\hat{\Sigma}^{-1})[\hat{K}(y_{T})\hat{\Sigma}^{-1}][\hat{K}(y_{T-1})\hat{\Sigma}^{-1}]\cdots[\hat{K}(y_{1})\hat{\Sigma}^{-1}]\hat{\mu}
=\displaystyle= [(μ+Δμ^)(Σ+ΔΣ^)1][(K+ΔK^)(yT)(Σ+ΔΣ^)1]\displaystyle[(\mu+\widehat{\Delta\mu})^{\top}(\Sigma+\widehat{\Delta\Sigma})^{-1}][(K+\widehat{\Delta K})(y_{T})(\Sigma+\widehat{\Delta\Sigma})^{-1}]
[(K+ΔK^)(y1)(Σ+ΔΣ^)1](μ+Δμ^).\displaystyle\cdots[(K+\widehat{\Delta K})(y_{1})(\Sigma+\widehat{\Delta\Sigma})^{-1}](\mu+\widehat{\Delta\mu}).

Consider the matrix perturbation (Σ+ΔΣ^)1=Σ1Σ1ΔΣ^Σ1+O(ΔΣ^2)(\Sigma+\widehat{\Delta\Sigma})^{-1}=\Sigma^{-1}-\Sigma^{-1}\widehat{\Delta\Sigma}\Sigma^{-1}+O(||\widehat{\Delta\Sigma}||^{2}), Here the matrix norm ||||||\cdot|| can be any norm since all matrix norms have equivalent orders. Also note that all items with Δ^\widehat{\Delta} are Op(N12)O_{p}(N^{-\frac{1}{2}}). NN is the number of i.i.d.i.i.d. triple (Y1,Y2,Y3)(Y_{1},Y_{2},Y_{3}) for estimating μ^,Σ^,K^\widehat{\mu},\widehat{\Sigma},\widehat{K}. Note that NN and TT are not related, and that we work in the regime where TT is fixed but NN\xrightarrow{}\infty. For example, μ^=μ+Δμ^=1NΣi=1NYiMVN(μ,1NCov(Y))\hat{\mu}=\mu+\widehat{\Delta\mu}=\frac{1}{N}\Sigma_{i=1}^{N}Y_{i}\sim MVN(\mu,\frac{1}{N}Cov(Y)). Similar analyses apply to Σ^\widehat{\Sigma} and K^\widehat{K} can be similarly. So

Δμ^\displaystyle\widehat{\Delta\mu} =\displaystyle= [Op(N1/2)](d);\displaystyle[O_{p}(N^{-1/2})]^{(d)};
ΔΣ^\displaystyle\widehat{\Delta\Sigma} =\displaystyle= [Op(N1/2)](d×d);\displaystyle[O_{p}(N^{-1/2})]^{(d\times d)};
ΔK^\displaystyle\widehat{\Delta K} =\displaystyle= [Op(N1/2)](d×d×d),\displaystyle[O_{p}(N^{-1/2})]^{(d\times d\times d)},

where dd is the dimension of YtY_{t}. One application is (Σ+ΔΣ^)1=Σ1Σ1ΔΣ^Σ1+Op(N1)(\Sigma+\widehat{\Delta\Sigma})^{-1}=\Sigma^{-1}-\Sigma^{-1}\widehat{\Delta\Sigma}\Sigma^{-1}+O_{p}(N^{-1}).

According to the previous two expansions, we can rewrite Eq LABEL:eqn:ErrorTermExpansion as

[(μ+Δμ^)(Σ+ΔΣ^)1][(K+ΔK^)(yT)(Σ+ΔΣ^)1]\displaystyle[(\mu+\widehat{\Delta\mu})^{\top}(\Sigma+\widehat{\Delta\Sigma})^{-1}][(K+\widehat{\Delta K})(y_{T})(\Sigma+\widehat{\Delta\Sigma})^{-1}]\cdots
[(K+ΔK^)(y1)(Σ+ΔΣ^)1](μ+Δμ^)\displaystyle[(K+\widehat{\Delta K})(y_{1})(\Sigma+\widehat{\Delta\Sigma})^{-1}](\mu+\widehat{\Delta\mu})
=\displaystyle= [(μ+Δμ^)T(Σ1Σ1ΔΣ^Σ1+Op(N1))]\displaystyle[(\mu+\widehat{\Delta\mu})^{T}(\Sigma^{-1}-\Sigma^{-1}\widehat{\Delta\Sigma}\Sigma^{-1}+O_{p}(N^{-1}))]
[(K+ΔK^)(yT)(Σ1Σ1ΔΣ^Σ1+Op(N1))]\displaystyle[(K+\widehat{\Delta K})(y_{T})(\Sigma^{-1}-\Sigma^{-1}\widehat{\Delta\Sigma}\Sigma^{-1}+O_{p}(N^{-1}))]\cdots
[(K+ΔK^)(y1)(Σ1Σ1ΔΣ^Σ1+Op(N1))](μ+Δμ^)\displaystyle[(K+\widehat{\Delta K})(y_{1})(\Sigma^{-1}-\Sigma^{-1}\widehat{\Delta\Sigma}\Sigma^{-1}+O_{p}(N^{-1}))](\mu+\widehat{\Delta\mu})
=\displaystyle= μΣ1K(yT)Σ1K(y1)Σ1μ+(v+v~)Δμ^+t=1TatΔK^(yt)at~\displaystyle\mu^{\top}\Sigma^{-1}K(y_{T})\Sigma^{-1}\cdots K(y_{1})\Sigma^{-1}\mu+(v+\tilde{v})^{\top}\widehat{\Delta\mu}+\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}}
t=0TbtΔΣ^bt~+Op(N1)\displaystyle-\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}}+O_{p}(N^{-1})
=\displaystyle= Pr(x1:T)+(v+v~)Δμ^+t=1TatΔK^(yt)at~t=0TbtΔΣ^bt~+Op(N1)\displaystyle Pr(x_{1:T})+(v+\tilde{v})^{\top}\widehat{\Delta\mu}+\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}}-\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}}+O_{p}(N^{-1})

In the above, we first substitute (Σ+ΔΣ^)1(\Sigma+\widehat{\Delta\Sigma})^{-1} with Σ1Σ1ΔΣ^Σ1+Op(N1)\Sigma^{-1}-\Sigma^{-1}\widehat{\Delta\Sigma}\Sigma^{-1}+O_{p}(N^{-1}), then distribute all multiplications. After distribution, we can categorize the resulting terms into 3 categories according to different orders of convergence. The first category is the deterministic term without any randomness or estimation error. There is only one term in this category, which is Pr(x1:T)Pr(x_{1:T}), i.e. μΣ1K(yT)Σ1K(y1)Σ1μ\mu^{\top}\Sigma^{-1}K(y_{T})\Sigma^{-1}\cdots K(y_{1})\Sigma^{-1}\mu. The second category is the part that converges with order Op(N1/2)O_{p}(N^{-1/2}). This category involves all terms with one ‘Δ^\widehat{\Delta}’ term, i.e. one Δμ^\widehat{\Delta\mu}, ΔΣ^\widehat{\Delta\Sigma} or ΔK^\widehat{\Delta K}. These terms comprise the sum (v+v~)Δμ^+t=1TatΔK^(yt)at~t=0TbtΔΣ^bt~(v+\tilde{v})^{\top}\widehat{\Delta\mu}+\sum_{t=1}^{T}a_{t}^{\top}\widehat{\Delta K}(y_{t})\tilde{a_{t}}-\sum_{t=0}^{T}b_{t}^{\top}\widehat{\Delta\Sigma}\tilde{b_{t}}. To be useful for the main Theorem, we simplify each term by representing the collection of “true” quantities that exist in each term with v,v~,at,a~t,bt,b~tv,\tilde{v},a_{t},\tilde{a}_{t},b_{t},\tilde{b}_{t} as defined in the lemma. Note that each collection contains only terms that fall either to the left or to the right of the ‘Δ^\widehat{\Delta}’ term. The third category contains all remaining terms. These terms converge faster than– or on the order of– Op(N1)O_{p}(N^{-1}). These terms are all finite, and thus their summation is also Op(N1)O_{p}(N^{-1}).

Appendix B Detailed Results for Simulations

Table 4 shows the detailed simulation results. The first columns represents the emission distribution. The second column shows the type of the transition matrix, which we describe in the simulation settings of the paper. Here we show two types of oracle: “limited oracle” is the oracle described in the paper (and is the most commonly used oracle). Limited oracle assumes we know all parameters but don’t know the true hidden states. The “strong oracle” assumes knowledge of every parameter, as well as the particular hidden state at each time point. From the table below, we can see that these oracles nearly overlapped, and that PSHMM with projection-onto-simplex is very close to the oracle in most cases.

E T σ\sigma #Cluster Limited Strong SHMM PSHMM PSHMM PSHMM
inferred oracle oracle simplex polyhedron simplex
online
t5t_{5} sticky 0.050.05 55 0.280.28 0.280.28 673629.23-673629.23 0.270.27 0.240.24 0.260.26
t10t_{10} sticky 0.050.05 55 0.30.3 0.30.3 67.72-67.72 0.290.29 0.270.27 0.290.29
t15t_{15} sticky 0.050.05 55 0.310.31 0.310.31 2.82-2.82 0.30.3 0.280.28 0.290.29
t20t_{20} sticky 0.050.05 55 0.30.3 0.30.3 205.24-205.24 0.290.29 0.270.27 0.290.29
t5t_{5} nonsticky 0.050.05 55 0.170.17 0.170.17 2.48-2.48 0.170.17 0.110.11 0.170.17
t10t_{10} nonsticky 0.050.05 55 0.190.19 0.190.19 0.110.11 0.180.18 0.110.11 0.180.18
t15t_{15} nonsticky 0.050.05 55 0.190.19 0.190.19 0.59-0.59 0.180.18 0.110.11 0.180.18
t20t_{20} nonsticky 0.050.05 55 0.190.19 0.190.19 0.140.14 0.190.19 0.110.11 0.180.18
NN sticky 0.050.05 33 0.310.31 0.310.31 125.03-125.03 0.210.21 0.120.12 0.210.21
NN sticky 0.050.05 44 0.310.31 0.310.31 717.8-717.8 0.250.25 0.230.23 0.250.25
NN nonsticky 0.050.05 33 0.190.19 0.190.19 3.25-3.25 0.160.16 0.10.1 0.170.17
NN nonsticky 0.050.05 44 0.190.19 0.190.19 155.85-155.85 0.170.17 0.090.09 0.180.18
NN sticky 0.010.01 55 0.380.38 0.380.38 0.340.34 0.380.38 0.370.37 0.370.37
NN nonsticky 0.010.01 55 0.240.24 0.240.24 0.240.24 0.240.24 0.150.15 0.240.24
NN sticky 0.050.05 55 0.310.31 0.310.31 9.36-9.36 0.30.3 0.280.28 0.30.3
NN nonsticky 0.050.05 55 0.190.19 0.190.19 0.140.14 0.190.19 0.120.12 0.190.19
NN sticky 0.10.1 55 0.190.19 0.190.19 74.73-74.73 0.180.18 0.150.15 0.180.18
NN sticky 0.50.5 55 0.010.01 0.010.01 22.02-22.02 0.010.01 0.00.0 0.00.0
NN sticky 1.01.0 55 0.00.0 0.00.0 1163.8-1163.8 0.00.0 0.0-0.0 0.01-0.01
NN nonsticky 0.10.1 55 0.120.12 0.120.12 8.47-8.47 0.120.12 0.070.07 0.0-0.0
NN nonsticky 0.50.5 55 0.010.01 0.010.01 22.29-22.29 0.010.01 0.00.0 0.0-0.0
NN nonsticky 1.01.0 55 0.00.0 0.00.0 5.61-5.61 0.00.0 0.0-0.0 0.01-0.01
Table 4: Detailed simulation results.

References

  • Baum and Petrie, (1966) Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554–1563.
  • Baum et al., (1970) Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164–171.
  • Boyd et al., (2004) Boyd, S., Boyd, S. P., and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.
  • Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Series B Stat. Methodol., 39(1):1–22.
  • Eckart and Young, (1936) Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218.
  • Frisch, (1955) Frisch, K. (1955). The logarithmic potential method of convex programming. Memorandum, University Institute of Economics, Oslo, 5(6).
  • Grossman and Zhou, (1993) Grossman, S. J. and Zhou, Z. (1993). Optimal investment strategies for controlling drawdowns. Mathematical finance, 3(3):241–276.
  • Halko et al., (2011) Halko, N., Martinsson, P.-G., and Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288.
  • Harris et al., (2020) Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825):357–362.
  • Hassan and Nath, (2005) Hassan, M. R. and Nath, B. (2005). Stock market forecasting using hidden markov model: a new approach. In 5th International Conference on Intelligent Systems Design and Applications (ISDA’05), pages 192–196. IEEE.
  • Hsu et al., (2012) Hsu, D., Kakade, S. M., and Zhang, T. (2012). A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480.
  • Jaeger, (2000) Jaeger, H. (2000). Observable Operator Models for Discrete Stochastic Time Series. Neural Comput., 12(6):1371–1398.
  • Knoll et al., (2016) Knoll, B. C., Melton, G. B., Liu, H., Xu, H., and Pakhomov, S. V. (2016). Using synthetic clinical data to train an hmm-based pos tagger. In 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pages 252–255. IEEE.
  • Lahmiri and Bekiros, (2021) Lahmiri, S. and Bekiros, S. (2021). Deep learning forecasting in cryptocurrency high-frequency trading. Cognitive Computation, 13:485–487.
  • Mamon and Elliott, (2014) Mamon, R. S. and Elliott, R. J. (2014). Hidden Markov models in finance: Further developments and applications, volume II. International series in operations research & management science. Springer, New York, NY, 2014 edition.
  • McLachlan and Basford, (1988) McLachlan, G. J. and Basford, K. E. (1988). Mixture models: Inference and applications to clustering, volume 38. M. Dekker New York.
  • Pedregosa et al., (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  • Rodu, (2014) Rodu, J. (2014). Spectral estimation of hidden Markov models. University of Pennsylvania.
  • Rodu et al., (2013) Rodu, J., Foster, D. P., Wu, W., and Ungar, L. H. (2013). Using regression for spectral estimation of hmms. In International Conference on Statistical Language and Speech Processing, pages 212–223. Springer.
  • Scott et al., (2005) Scott, S. L., James, G. M., and Sugar, C. A. (2005). Hidden markov models for longitudinal comparisons. J. Am. Stat. Assoc., 100(470):359–369.
  • Sharpe, (1966) Sharpe, W. F. (1966). Mutual fund performance. The Journal of business, 39(1):119–138.
  • Shirley et al., (2010) Shirley, K. E., Small, D. S., Lynch, K. G., Maisto, S. A., and Oslin, D. W. (2010). Hidden markov models for alcoholism treatment trial data. aoas, 4(1):366–395.
  • Siddiqi et al., (2009) Siddiqi, S., Boots, B., and Gordon, G. J. (2009). Reduced-rank hidden markov models. AISTATS, 9:741–748.
  • Stratos et al., (2016) Stratos, K., Collins, M., and Hsu, D. (2016). Unsupervised part-of-speech tagging with anchor hidden markov models. Trans. Assoc. Comput. Linguist., 4:245–257.
  • Virtanen et al., (2020) Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
  • Wang and Carreira-Perpinán, (2013) Wang, W. and Carreira-Perpinán, M. A. (2013). Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv preprint arXiv:1309.1541.
  • Zhou and Su, (2001) Zhou, G. and Su, J. (2001). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, Morristown, NJ, USA. Association for Computational Linguistics.