This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fitting Sparse Markov Models to Categorical Time Series Using Regularization

Tuhin Majumderlabel=e1 [    mark][email protected]    Soumendra Lahirilabel=e2][email protected] [    Donald Martinlabel=e3 [    mark][email protected] Department of Statistics, North Carolina State University, Department of Mathematics and Statistics, Washington University in St. Louis,
Abstract

The major problem of fitting a higher order Markov model is the exponentially growing number of parameters. The most popular approach is to use a Variable Length Markov Chain (VLMC), which determines relevant contexts (recent pasts) of variable orders and form a context tree. A more general approach is called Sparse Markov Model (SMM), where all possible histories of order mm form a partition so that the transition probability vectors are identical for the histories belonging to a particular group. We develop an elegant method of fitting SMM using convex clustering, which involves regularization. The regularization parameter is selected using BIC criterion. Theoretical results demonstrate the model selection consistency of our method for large sample size. Extensive simulation studies under different set-up have been presented to measure the performance of our method. We apply this method to classify genome sequences, obtained from individuals affected by different viruses.

Sparse Markov Model,
Convex Clustering,
keywords:
\startlocaldefs\endlocaldefs

, and

1 Introduction

Let {Xt}\{X_{t}\} be a categorical time series with a finite state space Σ\Sigma. We suppose that the evolution of the time series follows an mm-th order Markov structure where

(Xt+1|Xs,st)=(Xt+1|Xs,tm<st){\cal L}\big{(}X_{t+1}\big{|}X_{s},s\leq t\big{)}={\cal L}\big{(}X_{t+1}\big{|}X_{s},t-m<s\leq t\big{)} (1.1)

for some m1m\geq 1, where for any random vectors X,YX,Y defined on a common probability space, we write (Y|X){\cal L}(Y|X) to denote the probability distribution of YY given XX. Even when the alphabet Σ\Sigma is small, such as Σ={0,1}\Sigma=\{0,1\} in applications involving binary chains or Σ={A,G,T,C}\Sigma=\{A,G,T,C\} in Genetics applications, complexity of the model (1.1) increases fairly quickly and may be difficult to estimate even for moderately large mm. In absence of a parametric model specification, the number of free parameters associated with (1.1) are given by |Σ|m(|Σ|1)|\Sigma|^{m}(|\Sigma|-1) which grows geometrically fast in the order mm, where |Σ||\Sigma| denotes the size of the alphabet, that is the number of elements in Σ\Sigma.

Different dimension reduction startegies have been applied to reduce the model complexity in (1.1), such as Variable Length Markov Chains (VLMC) based on tree-structured conditioning sets. This idea was first introduced by Rissanen (1983)[10], which determines relevant contexts (recent pasts) of variable orders and form a context tree. In VLMC, P(Xt+1=xt+1|Xt=xt,,X1=x1)=P(Xt+1=xt+1|X~t()=x~t())P(X_{t+1}=x_{t+1}|X_{t}=x_{t},\ldots,X_{1}=x_{1})=P\big{(}X_{t+1}=x_{t+1}\big{|}\tilde{X}_{t}^{(\ell)}=\tilde{x}_{t}^{(\ell)}), where \ell may not be a fixed number, rather a function of the past values (xt,,x1)(x_{t},\ldots,x_{1}). In general, context tree models have L(|Σ|1)L(|\Sigma|-1) parameters, where LL is the number of leaves in the context tree. That LL can take on arbitrary positive integer values for general context trees highlights the flexibility of a model with variable length contexts, and the fact that such models can lead to huge reductions in the number of parameters, especially when there is a long context in a single direction. A model of a variable order allows for a better trade-off between bias that arises through using contexts that are too short, and variance that increases with having many parameters, thus improving statistical inference. Bühlmann and Wyner (1999)[2] and Bühlmann (2000)[1] developed model selection strategies and asymptotic behaviour of Variable Length Markov Chains (VLMC).

Roos and Yu (2009a[11], b[12]) first pointed out that there can be relevant contexts that don’t have the hierarchical structure of a context tree. Although the authors have discussed the possibility of more general models, the analysis of those papers was limited to the case where Σ={0,1}\Sigma=\{0,1\}. Recently, the researchers have started the study of the sparse model, which is posed in terms of a general partition of the set of mm-tuples Σm\Sigma^{m}, where mm is the maximal order of Markovian dependence. Sparse Markov models (SMMs) that introduce a sparse parametrization based on an unknown grouping of the relevant mmth order history Σm\Sigma^{m}. This generalization was first proposed by Gárcia and González-López (2010)[4], they named it Minimal Markov Models. Later on, Jääskinen et al. (2014)[6] have developed Bayesian predictive methods to analyze sequence data using SMM. Xiong et al. (2016)[14] have extended the previous paper, introduced recursive algorithm for optimizing the partition for an SMM. In this paper, we shall consider SMMs in its full generality, allowing an arbitrary and unknown number of groups. Specifically, let 𝒞1,,𝒞k0{\cal C}_{1},\ldots,{\cal C}_{k_{0}} be a partition of Σm\Sigma^{m}. Then, the Markov Chain {Xt}\{X_{t}\} in (1.1) is a member of the SMM with groups {𝒞1,,𝒞k0}\{{\cal C}_{1},\ldots,{\cal C}_{k_{0}}\} if it satisfies the sparsity condition :

P(Xt+1|Xt=a1,,Xtm+1=am)is the same for all(am,,a1)𝒞i,P\big{(}X_{t+1}\in\cdot\big{|}X_{t}=a_{-1},\ldots,X_{t-m+1}=a_{-m}\big{)}{\quad\mbox{is the same for all}\quad}(a_{-m},\ldots,a_{-1})\in{\cal C}_{i}, (1.2)

for each i=1,,k0i=1,\ldots,k_{0}. Thus, for each ii, , the transition probability remains unchanged over all mm-step history lying in the set 𝒞i{\cal C}_{i}. This reduces the number of unknown probability parameters to k0(|Σ|1)k_{0}(|\Sigma|-1). However, both the number k0k_{0} of the sets in the partition and the sets 𝒞i{\cal C}_{i} themselves are unknown and must be estimated from the data for fitting the SMM.

The rest of the paper is organised as follows. In section (2), we discuss the methodology of fitting SMM using regularization in detail. Section (3) deals with the theoretical properties which ensures the model selection consistency of our method. In section (4), we numerically illustrate our methodology by extensive simulation studies. A real data analysis involving virus classification has been presented in section (5). We conclude this paper by some remarks in section (6). The proofs of the theoretical results are provided in the appendix.

2 Methodology

2.1 Notation

Let ={1,2,}{\mathbb{N}}=\{1,2,\ldots\} be the set of all positive integers. Let 𝒳n=(X1,,Xn){\cal X}_{n}=(X_{1},\ldots,X_{n}) and X~t(m)=(Xt,Xt1,,Xtm+1)\tilde{X}_{t}^{(m)}=(X_{t},X_{t-1},\ldots,X_{t-m+1}), for m1m\geq 1, tt\in{\mathbb{N}}. Write ww for an ordered (finite) sequence of Σ\Sigma-elements of length |w||w|. Let wuwu denote the (ordered) concatenation of ww and uu. Write |Σ|=d|\Sigma|=d and Σm={σ1,,σp}\Sigma^{m}=\{\sigma_{1},\ldots,\sigma_{p}\} so that p=|Σ|mp=|\Sigma|^{m}. Let Nw=t=|w|n11(X~t(|w|)=w)N_{w}=\sum_{t=|w|}^{n}{1\!\!1}(\tilde{X}_{t}^{(|w|)}=w) where 11(){1\!\!1}(\cdot) denotes the indicator function. For any SΣmS\subset\Sigma^{m} and aΣa\in\Sigma, define NS=t=mn111(X~t(m)S)N_{S}=\sum_{t=m}^{n-1}{1\!\!1}(\tilde{X}_{t}^{(m)}\in S), NS,a=t=mn111(X~t(m)S,Xt+1=a)N_{S,a}=\sum_{t=m}^{n-1}{1\!\!1}(\tilde{X}_{t}^{(m)}\in S,X_{t+1}=a). In particular, NσjN_{\sigma_{j}} denotes the number of times the chain X~t(m)\tilde{X}_{t}^{(m)} hits the mm-tuple σj\sigma_{j}, and Nσj,aN_{\sigma_{j},a} is the number of transitions from σj\sigma_{j} to aa.

Next we define the probabilities associated with the SMM (1.2). For j=1,,pj=1,\ldots,p and aΣa\in\Sigma, let

𝝅j,a=P(Xt+1=a|X~t(m)=σj);\displaystyle\bm{\pi}_{j,a}=P\big{(}X_{t+1}=a\big{|}\tilde{X}_{t}^{(m)}=\sigma_{j}\big{)};

and 𝝅j\bm{\pi}_{j} be the corresponding transition probability vector. Note that by the SMM property, for any aΣa\in\Sigma, the transition probability 𝝅j,a\bm{\pi}_{j,a} is a constant over all jj such that σj𝒞i\sigma_{j}\in{\cal C}_{i}. However, we do not know the sets 𝒞i{\cal C}_{i} and one of the challenges of fitting an SMM to a dataset is to be able to idenitify the sets 𝒞1,,𝒞k0{\cal C}_{1},\ldots,{\cal C}_{k_{0}}. To that end, define nonparametric estimators of 𝝅j,a\bm{\pi}_{j,a} using their empirical versions:

𝝅^ja=Nσj,a/Nσj\hat{\bm{\pi}}_{ja}=N_{\sigma_{j},a}/N_{\sigma_{j}}

where N=nm+1N=n-m+1 denotes the total number of mmth order histrory in the observed variables 𝒳n{\cal X}_{n}.

Here we propose a new approach to fitting the SMM based on regularization.

2.2 Description of the Method/Algorithm

Consider the penalized criterion function

12j=1paΣ(π^jabja)2+λ1j1<j2pwj1j2ρ(𝐛j1,𝐛j2)\dfrac{1}{2}\sum_{j=1}^{p}\sum_{a\in\Sigma}\big{(}\hat{\pi}_{ja}-b_{ja})^{2}+\lambda\sum_{1\leq j_{1}<j_{2}\leq p}w_{j_{1}j_{2}}\rho(\mathbf{b}_{j_{1}},\mathbf{b}_{j_{2}}) (2.3)

over 𝐛j=(bja:aΣ)Πd\mathbf{b}_{j}=(b_{ja}:a\in\Sigma)\in\Pi_{d} for j=1,,pj=1,\ldots,p, where λ>0\lambda>0 is a penalty parameter, Πd\Pi_{d} is the dd-dimensional simplex Πd={(u1,,ud)[0,1]d:u1++ud=1}\Pi_{d}=\{(u_{1},\ldots,u_{d})\in[0,1]^{d}:u_{1}+\ldots+u_{d}=1\} and where ρ(,)\rho(\cdot,\cdot) is a distance measure between two dd-dimensional probability vectors. Thus, (2.3) treats the estimators π^ja\hat{\pi}_{ja} as (correlated) “observations” and penalizes the distance between all distinct pairs of the probability vectors in order to identify the identical probability vectors. In particular, the number of parameters grow approximately quadratically in the number of observations and with a suitable choice of the penalization term, one can identify the identical probability vectors. When ρ(𝐛j1,𝐛j2)2==1d(bj1bj2)2\rho(\mathbf{b}_{j_{1}},\mathbf{b}_{j_{2}})^{2}=\sum_{\ell=1}^{d}(b_{j_{1}\ell}-b_{j_{2}\ell})^{2}, (2.3) gives a version of the Group LASSO of Yuan and Lin (2006)[15] that is designed for selecting pairs of full vectors that are close, and we have a convex optimization problem that can be solved for large pps. On the other hand, if we use the 1\ell_{1} distance ρ(𝐛j1,𝐛j2)==1d|bj1bj2|\rho(\mathbf{b}_{j_{1}},\mathbf{b}_{j_{2}})=\sum_{\ell=1}^{d}|b_{j_{1}\ell}-b_{j_{2}\ell}|, then only componentwise zero differences can be identified.

Once we minimize the criterion function in (2.3), it is a relatively easy task to find estimates of k0k_{0} and the sets 𝒞i{\cal C}_{i}. Specifically, we start with a pair with the smallest j1j_{1} and seek all j2>j1j_{2}>j_{1} that the distance between the solutions bj1b^{*}_{j_{1}} and bj2b^{*}_{j_{2}} is zero. Then, we set 𝒞^1\hat{{\cal C}}_{1} to be the set consisting of j1j_{1} and all such j2j_{2}. In the next step, we consider all pairs that are not in 𝒞^1\hat{{\cal C}}_{1} and repeat the procedure until all pairs with estimated zero distances have been grouped. In case there are indices jj for which none of the estimated paired distances are zero, we keep them as a singletons, that is groups consisting of single elements. This gives the estimated groups 𝒞^i:i=1,,k^\hat{{\cal C}}_{i}:i=1,\ldots,\hat{k}, with k^\hat{k} giving an estimate of k0k_{0}.

The traditional clustering methodologies like KK-means have many limitations. In most cases, we have to pre-specify the number of clusters, along with the possibility that we end up with a local minima instead of the global one. The advantage of clustering by solving the equation (2.3) for a range of λ\lambda is that we get a solution path from at most pp many singleton clusters to only one cluster consisting of all the elements. Subsequently, we can fix some criterion function which will enable us to find the optimum cluster assignment among the all possible models in the solution path. Hence, not only we won’t need to fix the number of clusters beforehand, also we avoid the problems of being stuck at the local minima. Several efficient algorithms have been developed in recent years to solve the equation (2.3) when the penalty function ρ\rho is convex; e.g ρ(bj1,bj2)=bj1bj2p\rho(b_{j1},b_{j2})=\lVert b_{j_{1}}-b_{j_{2}}\rVert_{p} for some p1p\geq 1. Pelckmans et al. (2005)[9], Lindsten et al. (2011)[7], Hocking et al. (2011)[5] and others recently proposed this method and established this to be more robust and scalable in comparison to the traditional ones. Lindsten et al. (2011)[7] used an off-the-shelf convex solver CVX to solve the convex clustering problem, which suffers from scalability issues. The theoretical perfect cluster recovery conditions have been derived by Zhu et al. (2014)[16] only for two clusters, while Panahi et al. (2017)[8] derived perfect recovery conditions for general kk clusters, but under the assumption of uniform weights. Sun et al. (2021)[13] provides sufficient conditions for theoretical recovery conditions under more general weight choices. They have also developed a faster algorithm called semismooth Newton based augmented Lagrangian method (SS-NAL), and derived the convergence criteria for their algorithm.

Chi and Lange (2015)[3] have developed an elegant method of solving the convex clustering problem by augmented lagrangian methods. For ρ(x)=x2\rho(x)=\lVert x\rVert_{2}, we first view solving the equation (2.3) as the following constrained optimization problem

min\displaystyle\min 12j=1p𝝅^j𝐛j22+λlwl𝐯l2\displaystyle\dfrac{1}{2}\sum_{j=1}^{p}\lVert\hat{\bm{\pi}}_{j}-\mathbf{b}_{j}\rVert_{2}^{2}+\lambda\sum_{l\in\mathcal{E}}w_{l}\lVert\mathbf{v}_{l}\rVert_{2} (2.4)
subject to 𝐛l1𝐛l2𝐯l=0;\displaystyle\mathbf{b}_{l_{1}}-\mathbf{b}_{l_{2}}-\mathbf{v}_{l}=0;

where \mathcal{E} is the set of all distinct edges {l:l=(l1,l2),l1<l2,wl>0}\{l:l=(l_{1},l_{2}),l_{1}<l_{2},w_{l}>0\}. Here, a new splitting variable 𝐯l\mathbf{v}_{l} has been introduced to capture the difference between the group centroids, which helps the optimization procedure much easier. Chi and Lange (2015)[3] have used two algorithms for solving this constrained optimization problem, namely ADMM and AMA. For both these algorithms, one first have to incorporate augmented lagrangian as follows:

ν(𝐁,𝐕,𝚪)=\displaystyle\mathcal{L}_{\nu}(\mathbf{B},\mathbf{V},\bm{\Gamma})= 12j=1p𝝅^j𝐛j22+λlwl𝐯l2\displaystyle\dfrac{1}{2}\sum_{j=1}^{p}\lVert\hat{\bm{\pi}}_{j}-\mathbf{b}_{j}\rVert_{2}^{2}+\lambda\sum_{l\in\mathcal{E}}w_{l}\lVert\mathbf{v}_{l}\rVert_{2} (2.5)
+l𝜸l,𝐯l𝐛l1+𝐛l2+ν2l𝐯l𝐛l1+𝐛l222,\displaystyle+\sum_{l\in\mathcal{E}}\langle\bm{\gamma}_{l},\mathbf{v}_{l}-\mathbf{b}_{l_{1}}+\mathbf{b}_{l_{2}}\rangle+\dfrac{\nu}{2}\sum_{l\in\mathcal{E}}\lVert\mathbf{v}_{l}-\mathbf{b}_{l_{1}}+\mathbf{b}_{l_{2}}\rVert_{2}^{2},

where 𝐁,𝐕\mathbf{B},\mathbf{V} and 𝚪\bm{\Gamma} are the matrices with 𝐛j,𝐯l\mathbf{b}_{j},\mathbf{v}_{l} and 𝜸l\bm{\gamma}_{l} for j=1,,pj=1,...,p and ll\in\mathcal{E} in their columns respectively. Splitting the variables in such fashion would allow us to update 𝐁\mathbf{B}, 𝐕\mathbf{V} and 𝚪\bm{\Gamma} sequentially, given the other variables. The convergence of ADMM does not depend on the choice of ν\nu, it will converge for any ν>0\nu>0. On the other hand, AMA converges for any 0<ν<2/p0<\nu<2/p. The performance of both these algorithms have been compared, and it is established that AMA is much faster than ADMM, especially when the weights are sparse. The major advantage of AMA lies in the inherent structure. After some basic algebra, Chi and Lange (2015)[3] have proved that we only need to update 𝐁\mathbf{B} and 𝚪\bm{\Gamma} in every step, and we can bypass updating 𝐕\mathbf{V} applying some linear relationship among the variables. Since, AMA provides much faster result, we use this method in our analysis. Suppose, 𝐁(t)\mathbf{B}^{(t)} and 𝚪(t)\bm{\Gamma}^{(t)} be the parameter values in the ttht^{th} step. The updates in the next step are computed using the following relations:

𝐛j(t+1)\displaystyle\mathbf{b}_{j}^{(t+1)} =𝝅^j+l1=j𝜸l(t)l2=j𝜸l(t)\displaystyle=\hat{\bm{\pi}}_{j}+\sum_{l_{1}=j}\bm{\gamma}_{l}^{(t)}-\sum_{l_{2}=j}\bm{\gamma}_{l}^{(t)}
𝜸l(t+1)\displaystyle\bm{\gamma}_{l}^{(t+1)} =𝒫Cl(𝜸l(t)ν𝐠l(t+1))\displaystyle=\mathcal{P}_{C_{l}}(\bm{\gamma}_{l}^{(t)}-\nu\mathbf{g}_{l}^{(t+1)})

where 𝐠l(t+1)=𝐛l1(t+1)𝐛l2(t+1)\mathbf{g}_{l}^{(t+1)}=\mathbf{b}_{l_{1}}^{(t+1)}-\mathbf{b}_{l_{2}}^{(t+1)}, Cl={𝜸l:𝜸l2λwl}C_{l}=\{\bm{\gamma}_{l}:\lVert\bm{\gamma}_{l}\rVert_{2}\leq\lambda w_{l}\}, and 𝒫A(𝐱)\mathcal{P}_{A}(\mathbf{x}) is the projection of 𝐱\mathbf{x} onto the set AA. We continue till convergence. The convergence criterion has been discussed in Chi and Lange (2015)[3] in details, using the dual problem and duality gap.

Algorithm 1 AMA

Initialize 𝚪(0)\bm{\Gamma}^{(0)}

1:for t=1,2,3,t=1,2,3,... do
2:     for j=1,2,3,,pj=1,2,3,...,p do
3:         𝚫i(t)=l1=j𝜸l(t1)l2=j𝜸l(t1)\bm{\Delta}_{i}^{(t)}=\sum_{l_{1}=j}\bm{\gamma}_{l}^{(t-1)}-\sum_{l_{2}=j}\bm{\gamma}_{l}^{(t-1)}
4:     end for
5:     for all ll do
6:         𝐠l(t)=𝝅^l1𝝅^l2+𝚫l1(t)𝚫l2(t)\mathbf{g}_{l}^{(t)}=\hat{\bm{\pi}}_{l_{1}}-\hat{\bm{\pi}}_{l_{2}}+\bm{\Delta}_{l_{1}}^{(t)}-\bm{\Delta}_{l_{2}}^{(t)}
7:         𝜸l(t)=𝒫Cl(𝜸l(t1)ν𝐠l(t))\bm{\gamma}_{l}^{(t)}=\mathcal{P}_{C_{l}}(\bm{\gamma}_{l}^{(t-1)}-\nu\mathbf{g}_{l}^{(t)})
8:     end for
9:end for

2.3 Selection of the Tuning Parameter

So far, we have discussed the numerical methods to solve (2.3) for a given λ\lambda. But it is important to choose an optimum value of λ\lambda for the optimization problem. In this section, we propose a data driven method to select this tuning parameter using BIC criterion. For a given λ\lambda, denote the obtained clusters as 𝒞^1(λ),,𝒞^kλ(λ)\hat{\mathcal{C}}_{1}(\lambda),\ldots,\hat{\mathcal{C}}_{k_{\lambda}}(\lambda), where kλk_{\lambda} is the number of clusters. Define the common transition probability for the mm-tuples in the estimated group 𝒞^α(λ)\hat{\mathcal{C}}_{\alpha}(\lambda) as

R^α,a(λ)=σj𝒞^α(λ)Nσj,aσj𝒞^α(λ)Nσj=N𝒞^α(λ),aN𝒞^α(λ)α=1,,kλ;aΣ.\hat{R}^{(\lambda)}_{\alpha,a}=\dfrac{\sum_{\sigma_{j}\in\hat{\mathcal{C}}_{\alpha}(\lambda)}N_{\sigma_{j},a}}{\sum_{\sigma_{j}\in\hat{\mathcal{C}}_{\alpha}(\lambda)}N_{\sigma_{j}}}=\dfrac{N_{\hat{\mathcal{C}}_{\alpha}(\lambda),a}}{N_{\hat{\mathcal{C}}_{\alpha}(\lambda)}}\quad\quad\forall\alpha=1,...,k_{\lambda};a\in\Sigma.

The log-likelihood of the observations under the obtained cluster assignment for a particular λ\lambda is given by

n(λ)=α=1kλaΣN𝒞^α(λ),alogR^α,a(λ).\ell_{n}(\lambda)=\sum_{\alpha=1}^{k_{\lambda}}\sum_{a\in\Sigma}N_{\hat{\mathcal{C}}_{\alpha}(\lambda),a}\log\hat{R}^{(\lambda)}_{\alpha,a}.

Hence, the BIC score corresponding to the obtained model is

BICn(λ)=2n(λ)+kλ(|Σ|1)logn.BIC_{n}(\lambda)=-2\ell_{n}(\lambda)+k_{\lambda}(|\Sigma|-1)\log n.

By a grid search over a range of possible λ\lambda values, we select the λ\lambda for which BIC is minimized. The solution of the equation (2.3) corresponding to that λ\lambda is considered as the estimated cluster assignment. In the next section, we provide theoretical results which will demonstrate that for a range of λ\lambda values, we will be able to perfectly recover the true clusters for large nn.

3 Conditions and Theoretical Results

3.1 Conditions

Consider solving the equation (2.3) with ρ(bj1,bj2)=bj1bj22\rho(b_{j_{1}},b_{j_{2}})=\lVert b_{j_{1}}-b_{j_{2}}\rVert_{2}, and the optimum solution is denoted by bi(λ)b_{i}^{*}(\lambda), for i=1,2,,pi=1,2,...,p. Let the true partition of the state space Σm\Sigma^{m} is {𝒞1,,𝒞k0}\{{\cal C}_{1},\ldots,{\cal C}_{k_{0}}\}; with the corresponding transition probability vectors being R1,,Rk0R_{1},...,R_{k_{0}} such that R,a=P(Xt+1=a|Yt=σ)R_{\ell,a}=P(X_{t+1}=a\big{|}Y_{t}=\sigma_{\ell}). Denote p=|𝒞|p_{\ell}=\big{|}{\cal C}_{\ell}\big{|}, the size of the αth\alpha^{th} partition. Following the notations of Yuan et al. (2021), define

wi(β)=j𝒞βwi,ji=1,2,,p;μi,j(α)=β=1,βαk0|wi(β)wj(β)|α=1,2,,k0;\displaystyle w_{i}^{(\beta)}=\sum_{j\in{\cal C}_{\beta}}w_{i,j}\quad\forall i=1,2,...,p;\quad\quad\mu_{i,j}^{(\alpha)}=\sum_{\beta=1,\beta\neq\alpha}^{k_{0}}\big{\lvert}w_{i}^{(\beta)}-w_{j}^{(\beta)}\big{\rvert}\quad\quad\forall\alpha=1,2,...,k_{0};
w(α,β)=i𝒞αj𝒞βwi,jαβ,α,β{1,2,,k0};𝝅¯^(α)=1pαi𝒞α𝝅^i;\displaystyle w^{(\alpha,\beta)}=\sum_{i\in{\cal C}_{\alpha}}\sum_{j\in{\cal C}_{\beta}}w_{i,j}\quad\forall\alpha\neq\beta,\alpha,\beta\in\{1,2,...,k_{0}\};\quad\quad\hat{\bar{\bm{\pi}}}^{(\alpha)}=\dfrac{1}{p_{\alpha}}\sum_{i\in{\cal C}_{\alpha}}\hat{\bm{\pi}}_{i};
λmin(n)=max1αk0maxi,j𝒞α{𝝅^i𝝅^j2pαwi,jμi,j(α)};λmax(n)=min1α<βk0{𝝅¯^(α)𝝅¯^(β)21pαlαw(α,l)+1pβlβw(β,l)}.\displaystyle\lambda_{\text{min}}^{(n)}=\max_{1\leq\alpha\leq k_{0}}\max_{i,j\in{\cal C}_{\alpha}}\Bigg{\{}\dfrac{\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{2}}{p_{\alpha}w_{i,j}-\mu_{i,j}^{(\alpha)}}\Bigg{\}};\quad\lambda_{\text{max}}^{(n)}=\min_{1\leq\alpha<\beta\leq k_{0}}\Bigg{\{}\dfrac{\lVert\hat{\bar{\bm{\pi}}}^{(\alpha)}-\hat{\bar{\bm{\pi}}}^{(\beta)}\rVert_{2}}{\frac{1}{p_{\alpha}}\sum_{l\neq\alpha}w^{(\alpha,l)}+\frac{1}{p_{\beta}}\sum_{l\neq\beta}w^{(\beta,l)}}\Bigg{\}}.

Suppose, the following conditions hold.
(A1)  wj1,j2=wj2,j1w_{j_{1},j_{2}}=w_{j_{2},j_{1}} and wj1,j2>0w_{j_{1},j_{2}}>0 for any j1,j2𝒞j_{1},j_{2}\in{\cal C}_{\ell}, =1,2,,k0\ell=1,2,...,k_{0}.
(A2)  pαwi,j>μi,j(α)p_{\alpha}w_{i,j}>\mu_{i,j}^{(\alpha)}, i,j𝒞α\forall i,j\in{\cal C}_{\alpha} and α=1,2,,k0\forall\alpha=1,2,...,k_{0}.

In other words, the condition (A1) implies the symmetry in the weight choices, and ideally the weight should be positive between two mm-tuples belonging to the same partition. On the other hand, (A2) determines a lower bound for the weight between two mm-tuple in a particular group. We will use these conditions to prove our results. Before going into the main results, let us recall a few other results as follows.
Proposition 1 (Consistency of Estimated Transition Probabilities):  Suppose X1,,XnX_{1},\ldots,X_{n} be an SMM of order mm, with the partitions {𝒞1,,𝒞k0}\{\mathcal{C}_{1},\ldots,\mathcal{C}_{k_{0}}\}. Then, as nn\to\infty,
(a) 𝝅^j𝑝𝑹α\hat{\bm{\pi}}_{j}\xrightarrow[]{p}\bm{R}_{\alpha} for j𝒞αj\in\mathcal{C}_{\alpha};
(b) NσjN𝑝qj\dfrac{N_{\sigma_{j}}}{N}\xrightarrow[]{p}q_{j}, where qjq_{j} is the stationary probability of the state σj\sigma_{j};
(c)

Nσj(𝝅^j𝑹α)𝑑𝒩(𝟎,Σα)\sqrt{N_{\sigma_{j}}}(\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha})\xrightarrow[]{d}\mathcal{N}(\bm{0},\Sigma_{\alpha})

where Σα=diag(𝑹α)𝑹α𝑹α(T)\Sigma_{\alpha}=diag(\bm{R}_{\alpha})-\bm{R}_{\alpha}\bm{R}_{\alpha}^{(T)}. Since Σα\Sigma_{\alpha} is of rank |Σ|1|\Sigma|-1, the asymptotic Normal distribution is singular.
Proposition 2 (Sun et al. (2021)):  Suppose the above conditions hold, and λmin(n)<λmax(n)\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}. Then for any λ(λmin(n),λmax(n))\lambda\in(\lambda_{\text{min}}^{(n)},\lambda_{\text{max}}^{(n)}), 𝒃i(λ)=𝒃j(λ)\bm{b}_{i}^{*}(\lambda)=\bm{b}_{j}^{*}(\lambda) for i,j𝒞αi,j\in\mathcal{C}_{\alpha}; α=1,..,k0\alpha=1,..,k_{0} and 𝒃i(λ)𝒃j(λ)\bm{b}_{i}^{*}(\lambda)\neq\bm{b}_{j}^{*}(\lambda) for any i𝒞α,j𝒞β,αβi\in\mathcal{C}_{\alpha},j\in\mathcal{C}_{\beta},\alpha\neq\beta. In other words, for any λ(λmin(n),λmax(n))\lambda\in(\lambda_{\text{min}}^{(n)},\lambda_{\text{max}}^{(n)}), we recover the true partition of the state space.
Thus, Proposition 1 tells us about the weak consistency and the CLT for the transition probability vectors. Proposition 2 deals with perfect recovery conditions under general weight choices, derived in Sun et al. (2021). These propositions will be the key tools in proving our results. In the next section, we provide our major theoretical findings.

3.2 Main results

The very first result will ensure that the resulting solution of the objective function in (2.4) will produce a valid probability distribution over Σ\Sigma. Note that, we don’t consider this solution as our estimated transition probability for the group. However, if under some extra-ordinary circumstances the solution exhibits oracle properties, one can use the solutions as the common transition probabilities for the partitions.
Theorem 1:  For any λ>0\lambda>0, the optimal solution 𝒃i(λ)\bm{b}_{i}^{*}(\lambda) is a valid probability distribution for i=1,2,,,,pi=1,2,,,,p; i.e.
(a) bi,a(λ)0b^{*}_{i,a}(\lambda)\geq 0 for a=1,,da=1,...,d,
(b) a=1dbi,a(λ)=1\sum_{a=1}^{d}b^{*}_{i,a}(\lambda)=1.
Next, we would like to derive the probability of true cluster recovery. There are two steps involved in this process. First, we need the true model in the solution path over varying λ\lambda. This implies the conditions of the proposition 2 must be satisfied, i.e λmin(n)<λmax(n)\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}. From Theorem 2, we get a lower bound of the probability of the true model being present in the solution path. This will tell us that with increasing sequence length, the probability of not perfect recovery converges to 0 in an exponential rate.
Theorem 2:  Define

δ\displaystyle\delta =min1α<βk0𝑹α𝑹β2;δ1=min1αk0mini,j𝒞α(pαwi,jμi,j(α))\displaystyle=\min_{1\leq\alpha<\beta\leq k_{0}}\lVert\bm{R}_{\alpha}-\bm{R}_{\beta}\rVert_{2};\quad\quad\delta_{1}=\min_{1\leq\alpha\leq k_{0}}\min_{i,j\in\mathcal{C}_{\alpha}}\big{(}p_{\alpha}w_{i,j}-\mu_{i,j}^{(\alpha)}\big{)}
δ2\displaystyle\delta_{2} =max1α<βk0(1pαlαw(α,l)+1pβlβw(β,l))\displaystyle=\max_{1\leq\alpha<\beta\leq k_{0}}\Big{(}\frac{1}{p_{\alpha}}\sum_{l\neq\alpha}w^{(\alpha,l)}+\frac{1}{p_{\beta}}\sum_{l\neq\beta}w^{(\beta,l)}\Big{)}

Then, under the above conditions (A1) and (A2), as nn\to\infty,

P(λmin(n)<λmax(n))\displaystyle P\Big{(}\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}\Big{)} P(𝝅^j𝑹α2<ϵ2;j𝒞α,α=1,,k0)\displaystyle\geq P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2};\forall j\in\mathcal{C}_{\alpha},\forall\alpha=1,...,k_{0}\Big{)}
1α=1k0C1(α)j𝒞αexp[Nϵ2C2,j]\displaystyle\geq 1-\sum_{\alpha=1}^{k_{0}}C_{1}^{(\alpha)}\sum_{j\in\mathcal{C}_{\alpha}}\exp\Big{[}-N\epsilon^{2}C_{2,j}\Big{]}

for 0<ϵ<δδ1δ1+δ20<\epsilon<\dfrac{\delta\delta_{1}}{\delta_{1}+\delta_{2}}, and for some constants C1(α),C2,j>0C_{1}^{(\alpha)},C_{2,j}>0.
Once we have the true model in the solution path, the next step is to establish that the probability of selecting that model through BIC criterion converges to 11 as nn\to\infty. The next theorem will prove this statement.
Theorem 3:  Suppose the conditions of the Theorem (2) holds, and λmin(n)<λmax(n)\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}. For any λ\lambda, denote the clustering assignment obtained by minimizing the equation (2.3) as Mλ={𝒞^1(λ),,𝒞^kλ(λ)}M_{\lambda}=\{\hat{{\cal C}}_{1}(\lambda),\ldots,\hat{{\cal C}}_{k_{\lambda}}(\lambda)\}; where kλk_{\lambda} is the number of clusters. Suppose n(λ)\ell_{n}(\lambda) be the log-likelihood of the observations corresponding to the cluster assignment MλM_{\lambda}, and the corresponding BIC score is BICn(λ)=2n(λ)+kλ(d1)lognBIC_{n}(\lambda)=-2\ell_{n}(\lambda)+k_{\lambda}(d-1)\log n. Choose some λ0(λmin(n),λmax(n))\lambda_{0}\in(\lambda_{\text{min}}^{(n)},\lambda_{\text{max}}^{(n)}). Then, for any λ\lambda such that MλMλ0M_{\lambda}\neq M_{\lambda_{0}},

P(BICn(λ0)<BICn(λ))1P\Big{(}BIC_{n}(\lambda_{0})<BIC_{n}(\lambda)\Big{)}\longrightarrow 1

as nn\to\infty.
Next, we will provide some sufficient conditions for perfect cluster recovery under a particular weight choice using Gaussian kernels. These weights have been used in the previous analysis of convex clustering, and proven to produce good clustering results.
Theorem 4:  Define pmin=minαpα,pmax=maxαpαp_{min}=\min_{\alpha}p_{\alpha},p_{max}=\max_{\alpha}p_{\alpha}. Suppose the following assumptions hold.
(a) wi,j=eϕ𝝅^i𝝅^j22li,jkw_{i,j}=e^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{2}^{2}}l^{k}_{i,j}, where li,jkl^{k}_{i,j} is the indicator function that 𝝅^i\hat{\bm{\pi}}_{i} belongs to kk nearest neighbour of 𝝅^j\hat{\bm{\pi}}_{j} or vice versa, for some ϕ>0\phi>0.
(b) kpmax1k\geq p_{max}-1.
(c) For some ϵ<ϵmax=δ212ϕδlog(2(k+1pmin1))\epsilon<\epsilon_{max}=\dfrac{\delta}{2}-\dfrac{1}{2\phi\delta}\log\Big{(}2\Big{(}\dfrac{k+1}{p_{min}}-1\Big{)}\Big{)}, 𝝅^j𝑹α2<ϵ2;\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2}; j𝒞α\forall j\in\mathcal{C}_{\alpha}, α=1,,k0\forall\alpha=1,...,k_{0}.
Under the above assumptions, the conditions (A1) and (A2) are satisfied. Moreover, δ1pmineϕϵmax22(k+1pmin)eϕ(δϵmax)2=δ1(min)\delta_{1}\geq p_{min}e^{-\phi\epsilon_{max}^{2}}-2(k+1-p_{min})e^{-\phi(\delta-\epsilon_{max})^{2}}=\delta^{(min)}_{1}, δ22(k+1pmin)eϕ(δϵmax)2=δ2(max).\delta_{2}\leq 2(k+1-p_{min})e^{-\phi(\delta-\epsilon_{max})^{2}}=\delta_{2}^{(max)}.

Corollary 1:  Under the assumptions of Theorem 3,
(a) λmin(n)ϵδ1(min)\lambda_{\text{min}}^{(n)}\leq\dfrac{\epsilon}{\delta^{(min)}_{1}}, λmax(n)δϵδ2(max)\lambda_{\text{max}}^{(n)}\geq\dfrac{\delta-\epsilon}{\delta_{2}^{(max)}}.
(b) ϵ<min{ϵmax,δδ1δ1+δ2}λmin(n)<λmax(n)\epsilon<\min\Big{\{}\epsilon_{max},\dfrac{\delta\delta_{1}}{\delta_{1}+\delta_{2}}\Big{\}}\implies\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}.
Corollary 2:  For balanced design, i.e. when pα=p/k0p_{\alpha}=p/k_{0} are the same for all groups 𝒞α\mathcal{C}_{\alpha}, δ2(max)=0\delta_{2}^{(max)}=0 if k=p/k01k=p/k_{0}-1. Hence, for any ϵ<ϵmax\epsilon<\epsilon_{max}, perfect recovery is possible for λ(λmin,)\lambda\in(\lambda_{\text{min}},\infty).
We omit the proof of this theorem as it is just an algebraic derivation of the terms defined earlier for a particular weight choice. The Corollary 1 gives us an idea how much close the empirical transition probabilities for each mm-tuple should be to the true probability. The Corollary 2 provides a special case when the design is balanced. In that scenario, for the correct choice of nearest neighbour, the true model can be retrieved for a huge range of tuning parameter λ\lambda, providing great clustering results. We will show in the simulation studies that how the weight choices impact this clustering accuracy.

4 Simulation

In this section, we will numerically demonstrate the performance of the convex clustering methodology described in the previous section in terms of recovering the true cluster assignments. We have emphasized mostly on the choice of the weights wi,jw_{i,j}, and compare the clustering performance for different choices of the weights. For illustration, we have considered SMM of various orders, lengths and different |Σ||\Sigma| values. Note that, we don’t pre-specify the number of clusters or the labels of the clusters in our method, so it is not feasible to compute the straight-forward miss-classification rate to compare the outcome with the true one. Instead, we use widely used Rand Index (RI) and Adjusted Rand Index (ARI) to measure the similarity between the true cluster and the obtained cluster assignment.

Rand Index computes the proportion of (i,j)(i,j) pair which are correctly identified belonging to same cluster or different clusters. Mathematically, for any two cluster assignments X=(X1,,Xr)X=(X_{1},...,X_{r}) and Y=(Y1,,Ys)Y=(Y_{1},...,Y_{s}) of the elements (σ1,,σp)(\sigma_{1},...,\sigma_{p}), the rand index is defined by

RI=a+ba+b+c+d=a+b(p2)RI=\dfrac{a+b}{a+b+c+d}=\dfrac{a+b}{\binom{p}{2}}

where aa is the number of pairs which are in the same cluster in both XX and YY, bb is the number of pairs which are in the different clusters in both XX and YY, cc is the number of pairs which are in same clusters of XX, but in different clusters of YY, and dd is number of pairs which are in same clusters of YY, but in different clusters of XX. Values of RIRI vary in between 0 and 11. If two clusters are identical, RIRI should be 11. Higher RIRI values indicate more similarity among two given clusters.

However, Rand Index has some limitations. For example, if the number of clusters increases, and the cluster sizes are not big, RIRI will be close to 11 even if two completely different cluster assignments. To address this issue, usage of Adjusted Rand Index (ARI) is preferred. ARIARI is a corrected version of the usual Rand Index, which uses the expected similarity of all pairwise comparisons between clusterings specified by a random model. If ai=|Xi|a_{i}=|X_{i}|, bj=|Yj|b_{j}=|Y_{j}|, pij=|XiYj|p_{ij}=|X_{i}\cap Y_{j}|, then ARIARI is computed by the following formula:

ARI=i,j(pij2)[i(ai2)j(bj2)]/(p2)12[i(ai2)+j(bj2)][i(ai2)j(bj2)]/(p2).ARI=\dfrac{\sum\limits_{i,j}\binom{p_{ij}}{2}-\big{[}\sum\limits_{i}\binom{a_{i}}{2}\sum\limits_{j}\binom{b_{j}}{2}\big{]}/\binom{p}{2}}{\frac{1}{2}\big{[}\sum\limits_{i}\binom{a_{i}}{2}+\sum\limits_{j}\binom{b_{j}}{2}\big{]}-\big{[}\sum\limits_{i}\binom{a_{i}}{2}\sum\limits_{j}\binom{b_{j}}{2}\big{]}/\binom{p}{2}}.

For each of our simulation study, we compare the similarity between the estimated cluster assignment by solving the equation (2.3) for appropriate regularization parameter λ\lambda with the true clustering using both RIRI and ARIARI. Especially, we focus on the choice of weights which result in higher ARIARI values. Chi and Lange (2015), Sun et al. (2021) have shown, both numerically and theoretically that choosing sparse weights substantially improve the clustering quality, as well as make the algorithm much faster. In our study, we perform the clustering method under different weight choices, both sparse and dense, and compare how the ARIARI values depend on that choice. In each set-up, we replicate the experiment 10001000 times to obtain the mean RIRI and ARIARI and their standard error. We also compute the proportion of times ARIARI or RIRI is 11, i.e empirically compute the probability of perfect cluster recovery.

4.1 Simulation Set-up 1

Here, we take |Σ|=4|\Sigma|=4, the usual scenario when analyzing the DNA sequences. The order of the chain mm is taken to be both 22 or 33. For m=2m=2, we equally divide the all 1616 tuples into 44 groups of 44 elements; and for m=3m=3, we divide the 6464 triplets into 88 groups of equal size 88. For a particular group gg, we generate Zg,Z_{g,\ell} independently from Unif(0,1)Unif(0,1), for =1,2,3,4\ell=1,2,3,4. The transition probability of that group is generated from Dirichlet distribution with parameter (eZg,1,,eZg,4)(e^{Z_{g,1}},...,e^{Z_{g,4}}). As of weights, we first take wi,j=1w_{i,j}=1 for all i,j=1,2,p,i<ji,j=1,2...,p,i<j. Next we choose some sparse weights depending on the distance between the estimated transition probabilities 𝝅^i\hat{\bm{\pi}}_{i} and 𝝅^j\hat{\bm{\pi}}_{j}. We have used the kk-nearest neighbour based weights as proposed in Chi and Lange (2015), such as wi,j=expϕ𝝅^i𝝅^j22li,jkw_{i,j}=\exp^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{2}^{2}}l^{k}_{i,j} where li,jkl^{k}_{i,j} is the indicator function that 𝝅^i\hat{\bm{\pi}}_{i} belongs to kk nearest neighbour of 𝝅^j\hat{\bm{\pi}}_{j} or vice versa, for some ϕ>0\phi>0. In this example we use ϕ=100\phi=100. We also incorporate a third choice of weight, namely wi,j=expϕ𝝅^i𝝅^j2li,jk()w_{i,j}=\exp^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{\infty}^{2}}l^{k(\infty)}_{i,j} where li,jk()l^{k(\infty)}_{i,j} is the similar indicator function, but the nearest neighbour is computed w.r.t ll_{\infty} distance. We use two different values of the nearest neighbour, k=5k=5 and k=3k=3. The results are provided in the following tables.

Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.769 (0.03) 0.809 (0.03) 0.819 (0.023) 0.833 (0.02) 0.842 (0.02)
ARI (s.e) 0.015 (0.09) 0.141 (0.15) 0.169 (0.12) 0.239 (0.12) 0.295 (0.12)
Prob. of True Recovery 0 0 0 0 0
Table 1: Summary for m=2m=2, Uniform Weight
𝒌\bm{k} nearest neighbour=𝟓\bm{=5}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.900 (0.09) 0.981 (0.04) 0.994 (0.02) 0.997 (0.01) 0.999 (0.01)
ARI (s.e) 0.745 (0.19) 0.946 (0.10) 0.982 (0.05) 0.992 (0.04) 0.997 (0.02)
Prob. of True Recovery 0.223 0.708 0.876 0.951 0.977
𝒌\bm{k} nearest neighbour=𝟑\bm{=3}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.945 (0.07) 0.994 (0.02) 0.999 (0.01) 0.999 (0.005) 1 (0.003)
ARI (s.e) 0.851 (0.17) 0.983 (0.06) 0.995 (0.03) 0.998 (0.02) 0.999 (0.01)
Prob. of True Recovery 0.480 0.908 0.972 0.991 0.996
Table 2: Summary for m=2m=2, k-nearest neighbour clustering with l2l_{2} distance
𝒌\bm{k} nearest neighbour=𝟓\bm{=5}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.889 (0.10) 0.975 (0.04) 0.992 (0.02) 0.996 (0.01) 0.998 (0.01)
ARI (s.e) 0.720 (0.20) 0.928 (0.12) 0.974 (0.06) 0.989 (0.04) 0.994 (0.03)
Prob. of True Recovery 0.187 0.644 0.821 0.922 0.960
𝒌\bm{k} nearest neighbour=𝟑\bm{=3}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.949 (0.07) 0.995 (0.02) 0.999 (0.01) 0.999 (0.005) 1 (0.004)
ARI (s.e) 0.860 (0.17) 0.984 (0.05) 0.995 (0.03) 0.998 (0.02) 0.999 (0.01)
Prob. of True Recovery 0.501 0.908 0.973 0.990 0.996
Table 3: Summary for m=2m=2, k-nearest neighbour clustering with ll_{\infty} distance
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.866 (0.03) 0.901 (0.04) 0.933 (0.04) 0.954 (0.03) 0.969 (0.03)
ARI (s.e) 0.199 (0.18) 0.469 (0.26) 0.651 (0.22) 0.777 (0.17) 0.854 (0.15)
Prob. of True Recovery 0 0.003 0.017 0.042 0.128
Table 4: Summary for m=3m=3, Uniform Weight
𝒌\bm{k} nearest neighbour=𝟓\bm{=5}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.982 (0.024) 0.995 (0.007) 0.997 (0.005) 0.999 (0.002) 1 (0.001)
ARI (s.e) 0.935 (0.081) 0.980 (0.027) 0.990 (0.019) 0.997 (0.009) 0.999 (0.004)
Prob. of True Recovery 0.264 0.466 0.714 0.907 0.981
𝒌\bm{k} nearest neighbour=𝟑\bm{=3}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.984 (0.024) 0.994 (0.007) 0.997 (0.005) 0.999 (0.002) 1 (0.001)
ARI (s.e) 0.940 (0.074) 0.979 (0.026) 0.990 (0.019) 0.997 (0.009) 0.999 (0.004)
Prob. of True Recovery 0.223 0.408 0.713 0.908 0.981
Table 5: Summary for m=3m=3, k-nearest neighbour clustering with l2l_{2} distance
𝒌\bm{k} nearest neighbour=𝟓\bm{=5}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.981 (0.025) 0.995 (0.009) 0.998 (0.005) 0.999 (0.002) 1 (0.001)
ARI (s.e) 0.931 (0.085) 0.980 (0.033) 0.990 (0.018) 0.997 (0.009) 0.999 (0.004)
Prob. of True Recovery 0.294 0.508 0.735 0.904 0.980
𝒌\bm{k} nearest neighbour=𝟑\bm{=3}
Sample Size (n) 5000 10000 15000 20000 25000
RI (s.e) 0.983 (0.021) 0.994 (0.007) 0.997 (0.005) 0.999 (0.002) 1 (0.001)
ARI (s.e) 0.937 (0.076) 0.977 (0.029) 0.990 (0.019) 0.997 (0.009) 0.999 (0.004)
Prob. of True Recovery 0.206 0.387 0.709 0.907 0.981
Table 6: Summary for m=3m=3, k-nearest neighbour clustering with ll_{\infty} distance

From this study, it is clear that for both m=2m=2 and m=3m=3, choice of uniform weight results in really poor performance in terms of recovering the true cluster. Although the ARIARI increases with increasing nn, we might need really large sample size to get good result. On the other hand, choosing the weights using the number of nearest neighbour k=3k=3 perform much better than that with k=5k=5 in terms of both ARIARI and perfect recovery, especially for lower sample size. Note that the model is balanced in this example, and the optimum choice of kk is 33 (by Corollary 1). Hence, we can justify that fact using our simulation study. On the other hand, the optimum choice of kk is 77 for m=3m=3, but the choice k=5k=5 is reliable as well. k=3k=3 makes the weights too sparse in this case, which results in degradation of the clustering accuracy a little bit. For both m=2m=2 and m=3m=3, the probability of true recovery increases with increasing nn.

This experiment ensures pretty good result for high nn, mostly for n10000n\geq 10000. It is thus worthy of investigation under what circumstances we will be able to get very good recovery for lower nn, such as n=1000n=1000 or n=2000n=2000. The theoretical results suggest that if the cluster centroids are well separated, clustering performance gets better even for lower sample size. In this experiment, we have 44 groups for m=2m=2, with the minimum centroid difference ?? in terms of l2l_{2} distance and ?? in terms of ll_{\infty} distance; while these values are ?? and ?? for m=3m=3. In the next simulation study, we will demonstrate how the clustering accuracy improves for well separated centroids.

4.2 Simulation Set-up 2

In this study as well, we take |Σ|=4|\Sigma|=4 and m=3m=3. We divide this 6464 triplets into four groups of size 18,18,1518,18,15 and 1313. For the αth\alpha^{th} group, Rα,α=0.7,Rα,β=0.1R_{\alpha,\alpha}=0.7,R_{\alpha,\beta}=0.1, α=1,2,3,4\alpha=1,2,3,4, β=1,2,3,4\beta=1,2,3,4, αβ\alpha\neq\beta. As the choice of weight, we have first used the k=15k=15 nearest neighbour weights w.r.t the l2l_{2} distance in the Gaussian kernel, and used ϕ=100\phi=100. The second choice of weight is wi,j=expϕ𝝅^i𝝅^jli,jk()w_{i,j}=\exp^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{\infty}}l^{k(\infty)}_{i,j}, with ϕ=10\phi=10. Note that, here we have used the ll_{\infty} distance to find the nearest neighbour, but instead of incorporating the Gaussian kernel, we have used the natural exponential decay. The third weight is similar to the second one, only the distance is l1l_{1} instead of ll_{\infty}. Here we have only used relatively smaller samples, n=1000n=1000 and n=2000n=2000 respectively. The results are portrayed in the following Table.

Weight Choice
Sample Size
(n)
RI ARI
Prob. of
True Recovery
l2l_{2} Distance, Gaussian Kernel 1000 0.940 (0.022) 0.816 (0.073) 0
2000 0.984 (0.012) 0.954 (0.034) 0.14
ll_{\infty} Distance, Exponential Kernel 1000 0.969 (0.020) 0.908 (0.059) 0.104
2000 0.994 (0.009) 0.983 (0.025) 0.638
l1l_{1} Distance, Exponential Kernel 1000 0.965 (0.020) 0.893 (0.060) 0.03
2000 0.993 (0.009) 0.979 (0.025) 0.468
Table 7: Summary of Simulation 2

From the experiment, we can infer the weight involving l2l_{2} distance in the Gaussian kernel performs poorly compared to ll_{\infty} or l1l_{1} distance. Especially, usage of ll_{\infty} distance is really effective in such scenarios, as it measures the maximum element-wise distance between two estimated transition probability vectors. In this way, we are able to separate out two vectors which are not likely to be in the same cluster. From this study, we can fairly conclude that when the centroids are well-separated in such fashion so that their difference is significantly high in a small number of co-ordinates, use of ll_{\infty} distance can provide the best possible result.

5 Real Data Analysis

A popular application of higher order Markov models is analyzing the DNA sequences of some species. Scientists have used ordinary Markov chains, VLMC, hidden Markov models and many other tools to fit such a sequence for prediction, classification, gene identification etc. Here, we will use the proposed SMM methodology to classify the virus, collected as a sample from human being. We consider four different virus in our study: SARS-CoV-2 (COVID-19), MERS (Middle East Respiratory Syndrome), Dengue and Hepatitis B. We have collected the sample for 500500 individuals from NCBI database, 200200 affected from SARS-CoV-2, 5050 from MERS, 100100 from Dengue and 150150 from Hepatitis B over different time periods and different locations. We have been particularly careful in collecting the samples from COVID-19 disease so that we are able to incorporate different strains of that disease. To ensure that, we have used 5050 samples each from four different time-frame: April 2020, September 2020, January 2021 and April 2021. The time-frames are selected on the basis of the spread of certain strains or looking at the peak of the covid cases.

The NCBI database contains reference genome sequence of every virus. These reference sequences represents an ideal genome structure of any particular virus species. Note that, very minimal changes in the neucleotide sequence can lead to a very different strain of the same disease. The collected samples, if fully available, are almost equal to the reference sequence. In that case, there is no need to fit models, we can easily match the collected sample with the available genome sequence, and identify the virus whose sequence is almost similar to the reference genome. But in practice, there are many occasions where the full sequence is not available, may be only a part can be retrieved, or some portions of the data is lost. The challenge lies in that scenario, and we should be able to identify the correct virus as much as possible. In this experiment, we first build a reference model from the reference sequence for each virus. Next, we randomly select a continuous segment of the genome sequence for each sample, and then compute the likelihood of that segment under each of the 44 reference models. Suppose the ithi^{th} model is denoted by P^i\hat{P}_{i}, i=1,2,3,4i=1,2,3,4. For any given sequence x=x1x2xnx=x_{1}x_{2}...x_{n}, likelihood of xx for each model is P^i(x)\hat{P}_{i}(x). We then classify xx to argmaxi=1,,kP^i(x)\arg\max_{i=1,...,k}\hat{P}_{i}(x).

The lengths of the reference genome sequences for SARS, MERS, Dengue and Hepatitis B are 2990329903, 3011930119, 1073510735 and 35423542 respectively. For SARS and MERS, we fit an SMM of order m=4m=4 while for the other two virus, we use m=3m=3. The orders are based on the lengths of the reference sequences. There is a biological significance for using m3m\geq 3 as well. Three consecutive DNA bases form a “codon”, which translates a genetic code into a sequence of amino acids. So, it is fair to assume that SMM of order 33 or more will be able to explain the structure of a virus. From the samples, we randomly choose segments of length 100ϵ%100\epsilon\%, and compute the likelihoods under 44 models to classify it to the most likely class of virus. Three different values of ϵ\epsilon have been used; 0.050.05, 0.10.1 and 0.250.25. The 4×44\times 4 confusion matrices for each three scenarios are presented in the following Tables.

Observed Fitted
SARS-
Cov-2
MERS Dengue
Hepatitis
B
Total
SARS-Cov-2 185 0 4 11 200
MERS 0 50 0 0 50
Dengue 0 0 100 0 100
Hepatitis B 17 46 36 51 150
Table 8: Confusion Matrix for ϵ=0.05\epsilon=0.05, Mis-classification Rate= 22.8%22.8\%
Observed Fitted
SARS-
Cov-2
MERS Dengue
Hepatitis
B
Total
SARS-Cov-2 193 0 0 7 200
MERS 0 50 0 0 50
Dengue 0 0 100 0 100
Hepatitis B 5 41 23 81 150
Table 9: Confusion Matrix for ϵ=0.1\epsilon=0.1, Mis-classification Rate= 15.2%15.2\%
Observed Fitted
SARS-
Cov-2
MERS Dengue
Hepatitis
B
Total
SARS-Cov-2 194 0 0 6 200
MERS 0 50 0 0 50
Dengue 0 0 100 0 100
Hepatitis B 0 10 0 140 150
Table 10: Confusion Matrix for ϵ=0.25\epsilon=0.25, Mis-classification Rate= 3.2%3.2\%

From the table, it is clear that ϵ=0.05\epsilon=0.05, i.e. when only 5%5\% of the original sample sequence is retained, the miss-classification rate among the Hepatitis B samples are high. This is completely justified, since for Hepatitis B samples, the length of selected segments are about 170170, whereas that lengths are about 500500 for Dengue and 15001500 for MERS and SARS. As we increase the proportion, the performance naturally improves. The overall miss-classification rates are 0.2280.228, 0.1520.152 and 0.0320.032 for ϵ=0.05\epsilon=0.05, 0.100.10 and 0.250.25 respectively. So, with only 25%25\% of the sequences, we can correctly identify the true virus for more than 95%95\% of the cases. Even within Hepatitis B, the miss-classification error reduces drastically once we have fairly long sequence so that meaningful inference could be made. Note that, we have selected the snippets from any part of the full sample sequences. Our SMM method utilizes the information from the reference genome sequence in a compact manner, so that it can capture the diversity of structure from different parts of the samples. Overall, this method is successful in such classification problems, which opens up a scope for a broader research in this area.

6 Conclusion

The proposed method of fitting sparse Markov model can be utilized in many different areas. In future, we may be interested to apply the methodology to text mining, recommender system or other areas of Biostatistics. In terms of theoretical aspects, we can develop new clustering algorithms by changing the objective function and the penalty, preferably taking care of the number of occurences of each mm-tuple. This will be useful not only in SMM set-up, but to general clustering problems as well. We might also be interested how good and computationally efficient our method in terms of prediction using the SMM structure. We can conclude our analysis by saying that our method is innovative enough to apply in a broader spectrum, thus inviting a lot of other related research areas.

Appendix A Proof of Theorem 1

(a) For notational simplicity, we write bi,a(λ)b^{*}_{i,a}(\lambda) as bi,ab^{*}_{i,a}. Also, denote the objective function in (2.3) as R(B,W)R(B,W). Suppose bi,a<0b^{*}_{i,a}<0 for some of the (i,a)(i,a) pairs, i=1,2,,,,pi=1,2,,,,p and a=1,2,,da=1,2,...,d. Let, bi,a=bi,a(bi,a>0)b^{**}_{i,a}=b^{*}_{i,a}\mathcal{I}(b^{*}_{i,a}>0). Since π^i,a0\hat{\pi}_{i,a}\geq 0, we get |π^i,abi,a||π^i,abi,a|\big{\lvert}\hat{\pi}_{i,a}-b^{**}_{i,a}\big{\rvert}\leq\big{\lvert}\hat{\pi}_{i,a}-b^{*}_{i,a}\big{\rvert}. Also, |bi1,abi2,a||bi1,abi2,a|\big{\lvert}b^{*}_{i_{1},a}-b^{*}_{i_{2},a}\big{\rvert}\geq\big{\lvert}b^{**}_{i_{1},a}-b^{**}_{i_{2},a}\big{\rvert}, since the negative elements are shrinked to 0. Hence for any i=1,2,,pi=1,2,...,p,

𝝅^i𝒃i22𝝅^i𝒃i22;𝒃i1𝒃i22𝒃i1𝒃i22.\lVert\hat{\bm{\pi}}_{i}-\bm{b}^{*}_{i}\rVert_{2}^{2}\geq\lVert\hat{\bm{\pi}}_{i}-\bm{b}^{**}_{i}\rVert_{2}^{2};\quad\big{\lVert}\bm{b}^{*}_{i_{1}}-\bm{b}^{*}_{i_{2}}\big{\rVert}_{2}\geq\big{\lVert}\bm{b}^{**}_{i_{1}}-\bm{b}^{**}_{i_{2}}\big{\rVert}_{2}.

Since 𝒃i1𝒃i\bm{b}^{*}_{i_{1}}\neq\bm{b}^{**}_{i} for at least one ii, R(B,W)<R(B,W)R(B^{**},W)<R(B^{*},W), contradicting that BB^{*} is the optimum solution. Hence bi,a0b^{*}_{i,a}\geq 0, i=1,,p;a=1,,,.d\forall i=1,...,p;a=1,,,.d.
(b) If we initialize 𝚪(0)=𝟎\bm{\Gamma}^{(0)}=\bm{0}, we get 𝐛i(1)=𝝅^i\mathbf{b}_{i}^{(1)}=\hat{\bm{\pi}}_{i}, which satisfies a=1dbi,a(1)=1\sum_{a=1}^{d}b^{(1)}_{i,a}=1. Subsequently, 𝜸l(1)=𝒫Cl(𝜸l(0)ν𝐠l(1))=(𝜸l(0)ν𝐠l(1))min{1,λwl𝜸l(0)ν𝐠l(1)2}\bm{\gamma}_{l}^{(1)}=\mathcal{P}_{C_{l}}(\bm{\gamma}_{l}^{(0)}-\nu\mathbf{g}_{l}^{(1)})=(\bm{\gamma}_{l}^{(0)}-\nu\mathbf{g}_{l}^{(1)})\min\Big{\{}1,\dfrac{\lambda w_{l}}{\lVert\bm{\gamma}_{l}^{(0)}-\nu\mathbf{g}_{l}^{(1)}\rVert_{2}}\Big{\}}, and thus 𝜸l(1)T𝟏=0\bm{\gamma}_{l}^{(1)T}\bm{1}=0. Using this similar arguments, for any iteration tt, a=1dbi,a(t)=1\sum_{a=1}^{d}b^{(t)}_{i,a}=1. Hence the limiting quantity will still have the property that the sum of the elements of bib_{i} is always 11. This will complete the proof that 𝒃i\bm{b}_{i}^{*} is indeed a probability distribution.

Appendix B Proof of Theorem 2

Note that as nn\to\infty, NσjN_{\sigma_{j}}\to\infty. Let qjq_{j} be the stationary probability of the state σj\sigma_{j}. Then, Nσj/N𝑝qjN_{\sigma_{j}}/N\xrightarrow[]{p}q_{j} as nn\to\infty; and for j𝒞αj\in{\cal C}_{\alpha}, we have

Nσj(𝝅^j𝑹α)𝑑𝒩(𝟎,Σα)\displaystyle\sqrt{N_{\sigma_{j}}}\big{(}\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\big{)}\xrightarrow[]{d}\mathcal{N}(\bm{0},\Sigma_{\alpha})
\displaystyle\implies N(𝝅^j𝑹α)𝑑𝒩(𝟎,qjΣα)\displaystyle\sqrt{N}\big{(}\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\big{)}\xrightarrow[]{d}\mathcal{N}(\bm{0},q_{j}\Sigma_{\alpha})

where Σα=diag(𝑹α)𝑹α𝑹α(T)\Sigma_{\alpha}=diag(\bm{R}_{\alpha})-\bm{R}_{\alpha}\bm{R}_{\alpha}^{(T)}.
Looking at the expressions of λmin(n)\lambda_{\text{min}}^{(n)} and λmax(n)\lambda_{\text{max}}^{(n)}, it is evident that λmin(n)\lambda_{\text{min}}^{(n)} shrinks towards 0 as nn increases as the estimated transition probability vectors 𝝅^i\hat{\bm{\pi}}_{i} and 𝝅^j\hat{\bm{\pi}}_{j} belonging to the same cluster 𝒞α{\cal C}_{\alpha} become closer to each other. On the other hand, the different group means 𝝅¯^(α)\hat{\bar{\bm{\pi}}}^{(\alpha)} and 𝝅¯^(β)\hat{\bar{\bm{\pi}}}^{(\beta)} tend to get separated from each other, leading λmax(n)\lambda_{\text{max}}^{(n)} to converge to a positive number, and eventually we get λmin(n)<λmax(n)\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}. These expressions also tell us that in order to have perfect recovery of the clusters, a scaled version of the maximum within group deviation of the transition probabilities should be less than a scaled version of minimum between group variation. These scales are heavily dependent on the choice of the weights wi,jw_{i,j}. Note that, if we choose the weights in a way so that wi,jw_{i,j} is higher if 𝝅^i\hat{\bm{\pi}}_{i} and 𝝅^j\hat{\bm{\pi}}_{j} are closer (and potentially belong to the same cluster), and lower if they are far from each other (potentially belong to different clusters), the denominator of the term λmin(n)\lambda_{\text{min}}^{(n)} will be higher, and the denominator of λmax(n)\lambda_{\text{max}}^{(n)} will be lower in the ideal scenario. Hence, these particular choice of the weights will enhance separating λmin(n)\lambda_{\text{min}}^{(n)} and λmax(n)\lambda_{\text{max}}^{(n)}, increasing the chance of recovering the true cluster assignment.

The proof mainly relies on calculating the probability of 𝝅^j\hat{\bm{\pi}}_{j} and 𝑹α\bm{R}_{\alpha}, j𝒞αj\in{\cal C}_{\alpha} being close to each other. Suppose 𝝅^j𝑹α2<ϵ/2\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\epsilon/2 for some ϵ>0\epsilon>0, and j𝒞α\forall j\in{\cal C}_{\alpha}, α=1,2,k0\alpha=1,2...,k_{0}. In that case,

λmin(n)\displaystyle\lambda_{\text{min}}^{(n)} <ϵ/2min1αk0mini,j𝒞α(pαwi,jμi,j(α))\displaystyle<\dfrac{\epsilon/2}{\min\limits_{1\leq\alpha\leq k_{0}}\min\limits_{i,j\in{\cal C}_{\alpha}}\big{(}p_{\alpha}w_{i,j}-\mu_{i,j}^{(\alpha)}\big{)}}
λmax(n)\displaystyle\lambda_{\text{max}}^{(n)} >min1αk0{𝑹α𝑹β2ϵ1pαlαw(α,l)+1pβlβw(β,l)}.\displaystyle>\min_{1\leq\alpha\leq k_{0}}\Bigg{\{}\dfrac{\lVert\bm{R}_{\alpha}-\bm{R}_{\beta}\rVert_{2}-\epsilon}{\frac{1}{p_{\alpha}}\sum_{l\neq\alpha}w^{(\alpha,l)}+\frac{1}{p_{\beta}}\sum_{l\neq\beta}w^{(\beta,l)}}\Bigg{\}}.

So, for ϵ\epsilon sufficiently small, λmin(n)<λmax(n)\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}. We will later find a bound how small ϵ\epsilon we need to achieve this.

We will compute a lower bound of the following probability:

P(𝝅^j𝑹α2<ϵ2;j𝒞α,α=1,,k0).P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2};\forall j\in{\cal C}_{\alpha},\forall\alpha=1,...,k_{0}\Big{)}.

Note that the variance-covariance matrix Σα\Sigma_{\alpha} of the limiting distribution is not full rank, as we have a linear constraint in the elements of 𝝅j\bm{\pi}_{j}. Define Zj=(π^j,1Rj,1,,π^j,d1Rj,d1)TZ_{j}=(\hat{\pi}_{j,1}-R_{j,1},...,\hat{\pi}_{j,d-1}-R_{j,d-1})^{T}, Σα,d\Sigma_{\alpha,-d} be the upper (d1)×(d1)(d-1)\times(d-1) block of Σα\Sigma_{\alpha}. Now,

𝝅^j𝑹α22\displaystyle\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}^{2} =l=1d1(π^j,lRj,l)2+(l=1d1(π^j,lRj,l))2\displaystyle=\sum_{l=1}^{d-1}(\hat{\pi}_{j,l}-R_{j,l})^{2}+\Big{(}\sum_{l=1}^{d-1}(\hat{\pi}_{j,l}-R_{j,l})\Big{)}^{2}
=ZjTZj+(𝟏TZj)2=ZjT(𝑰+𝟏𝟏T)Zj.\displaystyle=Z_{j}^{T}Z_{j}+(\bm{1}^{T}Z_{j})^{2}=Z_{j}^{T}(\bm{I}+\bm{1}\bm{1}^{T})Z_{j}.

Define Uj=NqjΣα,d1/2ZjU_{j}=\sqrt{\dfrac{N}{q_{j}}}\Sigma_{\alpha,-d}^{-1/2}Z_{j}. By asymptotic normality of 𝝅^j\hat{\bm{\pi}}_{j}, NZj𝑑𝒩(𝟎,qjΣα,d)\sqrt{N}Z_{j}\xrightarrow[]{d}\mathcal{N}(\bm{0},q_{j}\Sigma_{\alpha,-d}), hence Uj𝑑𝒩(𝟎,𝑰)U_{j}\xrightarrow[]{d}\mathcal{N}(\bm{0},\bm{I}). So,

P(𝝅^j𝑹α2ϵ2)\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)} =P(ZjT(𝑰+𝟏𝟏T)Zjϵ24)=P(UjTΣα,d1/2(𝑰+𝟏𝟏T)Σα,d1/2UjNϵ24qj)\displaystyle=P\Big{(}Z_{j}^{T}(\bm{I}+\bm{1}\bm{1}^{T})Z_{j}\geq\dfrac{\epsilon^{2}}{4}\Big{)}=P\Big{(}U_{j}^{T}\Sigma_{\alpha,-d}^{1/2}(\bm{I}+\bm{1}\bm{1}^{T})\Sigma_{\alpha,-d}^{1/2}U_{j}\geq\dfrac{N\epsilon^{2}}{4q_{j}}\Big{)}
=P(UjT𝑴UjNϵ24qj);𝑴=Σα,d1/2(𝑰+𝟏𝟏T)Σα,d1/2.\displaystyle=P\Big{(}U_{j}^{T}\bm{M}U_{j}\geq\dfrac{N\epsilon^{2}}{4q_{j}}\Big{)};\quad\bm{M}=\Sigma_{\alpha,-d}^{1/2}(\bm{I}+\bm{1}\bm{1}^{T})\Sigma_{\alpha,-d}^{1/2}.

For a symmetric matrix matrix 𝑴1\bm{M}_{1}, Hanson and Wright (1971) have determined a lower bound of the tail probability of any quadratic form UT𝑴1UU^{T}\bm{M}_{1}U of a sub-Gaussian random variable UU with mean 𝟎\bm{0} and variance-covariance matrix σ2𝑰\sigma^{2}\bm{I} as follows:

P(UT𝑴1Ut+σ2tr(𝑴1))exp[min(a1t2σ4𝑴1F,a2tσ2𝑴1sp)]P\Big{(}U^{T}\bm{M}_{1}U\geq t+\sigma^{2}tr(\bm{M}_{1})\Big{)}\leq\exp\Big{[}-\min\Big{(}\dfrac{a_{1}t^{2}}{\sigma^{4}\lVert\bm{M}_{1}\rVert_{F}},\dfrac{a_{2}t}{\sigma^{2}\lVert\bm{M}_{1}\rVert_{sp}}\Big{)}\Big{]} (B.6)

for some constants a1,a2>0a_{1},a_{2}>0. Here .F\lVert.\rVert_{F} and .sp\lVert.\rVert_{sp} are Frobenius norm and spectral norm respectively. Applying this bound in (B.6) in our problem, we get, as nn\to\infty,

P(𝝅^j𝑹α2ϵ2)\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)} =P(UjT𝑴UjNϵ24qj)\displaystyle=P\Big{(}U_{j}^{T}\bm{M}U_{j}\geq\dfrac{N\epsilon^{2}}{4q_{j}}\Big{)}
exp[min(a1(Nϵ24qjtr(𝑴))216qj2𝑴F,a2(Nϵ24qjtr(𝑴))4qj𝑴sp)].\displaystyle\leq\exp\Big{[}-\min\Big{(}\dfrac{a_{1}(N\epsilon^{2}-4q_{j}tr(\bm{M}))^{2}}{16q_{j}^{2}\lVert\bm{M}\rVert_{F}},\dfrac{a_{2}(N\epsilon^{2}-4q_{j}tr(\bm{M}))}{4q_{j}\lVert\bm{M}\rVert_{sp}}\Big{)}\Big{]}.

As nn increases, N2>>NN^{2}>>N, and eventually for larger nn, min(a1(Nϵ24qjtr(𝑴))216qj2𝑴F,a2(Nϵ24qjtr(𝑴))4qj𝑴sp)\min\Big{(}\dfrac{a_{1}(N\epsilon^{2}-4q_{j}tr(\bm{M}))^{2}}{16q_{j}^{2}\lVert\bm{M}\rVert_{F}},\dfrac{a_{2}(N\epsilon^{2}-4q_{j}tr(\bm{M}))}{4q_{j}\lVert\bm{M}\rVert_{sp}}\Big{)} =a2(Nϵ24qjtr(𝑴))4qj𝑴sp=\dfrac{a_{2}(N\epsilon^{2}-4q_{j}tr(\bm{M}))}{4q_{j}\lVert\bm{M}\rVert_{sp}}. Now,

tr(𝑴)\displaystyle tr(\bm{M}) =tr(Σα,d1/2(𝑰+𝟏𝟏T)Σα,d1/2)=tr(Σα,d)+tr(𝟏TΣα,d𝟏)\displaystyle=tr(\Sigma_{\alpha,-d}^{1/2}(\bm{I}+\bm{1}\bm{1}^{T})\Sigma_{\alpha,-d}^{1/2})=tr(\Sigma_{\alpha,-d})+tr(\bm{1}^{T}\Sigma_{\alpha,-d}\bm{1})
=l=1d1Rα,l(1Rα,l)+l=1d1Rα,ll1=1d1l2=1d1Rα,l1Rα,l2\displaystyle=\sum_{l=1}^{d-1}R_{\alpha,l}(1-R_{\alpha,l})+\sum_{l=1}^{d-1}R_{\alpha,l}-\sum_{l_{1}=1}^{d-1}\sum_{l_{2}=1}^{d-1}R_{\alpha,l_{1}}R_{\alpha,l_{2}}
=l=1d1Rα,l(1Rα,l)+(l=1d1Rα,l)(1l=1d1Rα,l)=l=1dRα,l(1Rα,l)=sα(say);\displaystyle=\sum_{l=1}^{d-1}R_{\alpha,l}(1-R_{\alpha,l})+\Big{(}\sum_{l=1}^{d-1}R_{\alpha,l}\Big{)}\Big{(}1-\sum_{l=1}^{d-1}R_{\alpha,l}\Big{)}=\sum_{l=1}^{d}R_{\alpha,l}(1-R_{\alpha,l})=s_{\alpha}(say);
𝑴sp\displaystyle\lVert\bm{M}\rVert_{sp} =Σα,d+Σα,d1/2𝟏𝟏TΣα,d1/2spΣα,dsp+𝟏TΣα,d𝟏maxl=1,2,,d1Rα,l+Rα,d(1Rα,d)=vα\displaystyle=\lVert\Sigma_{\alpha,-d}+\Sigma_{\alpha,-d}^{1/2}\bm{1}\bm{1}^{T}\Sigma_{\alpha,-d}^{1/2}\rVert_{sp}\leq\lVert\Sigma_{\alpha,-d}\rVert_{sp}+\bm{1}^{T}\Sigma_{\alpha,-d}\bm{1}\leq\max\limits_{l=1,2,...,d-1}R_{\alpha,l}+R_{\alpha,d}(1-R_{\alpha,d})=v_{\alpha}

as Σα,dspmaxl=1,2,,d1Rα,l\lVert\Sigma_{\alpha,-d}\rVert_{sp}\leq\max\limits_{l=1,2,...,d-1}R_{\alpha,l} by the result of Watson (1995). Hence,

P(𝝅^j𝑹α2ϵ2)exp[a2(Nϵ24qjsα)4qjvα]=exp(sαvα)exp[a2Nϵ24qjvα]\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)}\leq\exp\Big{[}-\dfrac{a_{2}(N\epsilon^{2}-4q_{j}s_{\alpha})}{4q_{j}v_{\alpha}}\Big{]}=\exp\Big{(}\dfrac{s_{\alpha}}{v_{\alpha}}\Big{)}\exp\Big{[}-\dfrac{a_{2}N\epsilon^{2}}{4q_{j}v_{\alpha}}\Big{]}
\displaystyle\implies P(𝝅^j𝑹α2<ϵ2;j𝒞α,α=1,,k0)\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2};\forall j\in{\cal C}_{\alpha},\forall\alpha=1,...,k_{0}\Big{)}
\displaystyle\geq 1α=1k0j𝒞αP(𝝅^j𝑹α2ϵ2)1α=1k0exp(sαvα)j𝒞αexp[a2Nϵ24qjvα].\displaystyle 1-\sum_{\alpha=1}^{k_{0}}\sum_{j\in{\cal C}_{\alpha}}P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)}\geq 1-\sum_{\alpha=1}^{k_{0}}\exp\Big{(}\dfrac{s_{\alpha}}{v_{\alpha}}\Big{)}\sum_{j\in{\cal C}_{\alpha}}\exp\Big{[}-\dfrac{a_{2}N\epsilon^{2}}{4q_{j}v_{\alpha}}\Big{]}.

Write C1(α)=exp(sαvα)C_{1}^{(\alpha)}=\exp\Big{(}\dfrac{s_{\alpha}}{v_{\alpha}}\Big{)}, C2,j=a2ϵ24qjvαC_{2,j}=\dfrac{a_{2}\epsilon^{2}}{4q_{j}v_{\alpha}}, and we prove the theorem.

Appendix C Proof of Theorem 3

Recall that π^j,=Nσj,/Nσj\hat{\pi}_{j,\ell}=N_{\sigma_{j},\ell}/N_{\sigma_{j}}. Denote the common transition probability for the estimated group 𝒞^α(λ)\hat{{\cal C}}_{\alpha}(\lambda) as

R^α,(λ)=σj𝒞^α(λ)Nσj,σj𝒞^α(λ)Nσj=N𝒞^α(λ),N𝒞^α(λ)α=1,,kλ;=1,,d.\hat{R}^{(\lambda)}_{\alpha,\ell}=\dfrac{\sum_{\sigma_{j}\in\hat{{\cal C}}_{\alpha}(\lambda)}N_{\sigma_{j},\ell}}{\sum_{\sigma_{j}\in\hat{{\cal C}}_{\alpha}(\lambda)}N_{\sigma_{j}}}=\dfrac{N_{\hat{{\cal C}}_{\alpha}(\lambda),\ell}}{N_{\hat{{\cal C}}_{\alpha}(\lambda)}}\quad\quad\forall\alpha=1,...,k_{\lambda};\ell=1,...,d.

Thus, the log-likelihood is given by

n(λ)=α=1kλ=1dN𝒞^α(λ),logR^α,(λ).\ell_{n}(\lambda)=\sum_{\alpha=1}^{k_{\lambda}}\sum_{\ell=1}^{d}N_{\hat{{\cal C}}_{\alpha}(\lambda),\ell}\log\hat{R}^{(\lambda)}_{\alpha,\ell}.

Note that, as λ\lambda increases, the number of clusters decreases. Also, by the continuity of the solution of (2.3) w.r.t λ\lambda, Mλ2M_{\lambda_{2}} is a submodel of Mλ1M_{\lambda_{1}} for λ1<λ2\lambda_{1}<\lambda_{2} as the separate clusters for lower λ\lambda values are clumped together to form new clusters with larger size as λ\lambda increases. Hence, we can write Mλ2Mλ1M_{\lambda_{2}}\subseteq M_{\lambda_{1}}. Subsequently, n(λ1)n(λ2)\ell_{n}(\lambda_{1})\geq\ell_{n}(\lambda_{2}). Let qjq_{j} be the stationary probability of the state σj\sigma_{j}, and Q(α)(λ)Q^{(\alpha)}(\lambda) be the stationary probability of the partition 𝒞^α(λ)\hat{{\cal C}}_{\alpha}(\lambda). So, Q(α)(λ)=σj𝒞^α(λ)qjQ^{(\alpha)}(\lambda)=\sum_{\sigma_{j}\in\hat{{\cal C}}_{\alpha}(\lambda)}q_{j}. We have to show that the true model minimizes the BIC with probability tending to 11 as nn\to\infty. We prove that for two cases as follows.
Case 1: Suppose, λ<λ0\lambda<\lambda_{0}, and Mλ0MλM_{\lambda_{0}}\subset M_{\lambda}. Clearly, kλ0<kλk_{\lambda_{0}}<k_{\lambda}. Since Mλ0M_{\lambda_{0}} is the true underlying model by Theorem (1), Mλ0={𝒞1,,𝒞k0}M_{\lambda_{0}}=\{{\cal C}_{1},\ldots,{\cal C}_{k_{0}}\}, and

Zn=2(n(λ0)n(λ))𝑑χ(kλk0)(d1)2.Z_{n}=-2\Big{(}\ell_{n}(\lambda_{0})-\ell_{n}(\lambda)\Big{)}\xrightarrow{d}\chi^{2}_{(k_{\lambda}-k_{0})(d-1)}.

Hence, as nn\to\infty,

P(BICn(λ0)BICn(λ))\displaystyle P\Big{(}BIC_{n}(\lambda_{0})\geq BIC_{n}(\lambda)\Big{)} =P(Zn>(kλk0)(d1)logn)\displaystyle=P\Big{(}Z_{n}>(k_{\lambda}-k_{0})(d-1)\log n\Big{)}
exp[(kλk0)(d1)4logn]\displaystyle\leq\exp\Big{[}-\dfrac{(k_{\lambda}-k_{0})(d-1)}{4}\log n\Big{]}
=n(kλk0)(d1)40.\displaystyle=n^{-\dfrac{(k_{\lambda}-k_{0})(d-1)}{4}}\to 0.

Case 2: Now let λ0<λ\lambda_{0}<\lambda and MλMλ0M_{\lambda}\subset M_{\lambda_{0}}. For α=1,,kλ\alpha^{\prime}=1,...,k_{\lambda}, w.lo.g we can write

𝒞^α(λ)=α=tα1+1α=tα𝒞α\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)=\bigcup_{\alpha=t_{\alpha^{\prime}-1}+1}^{\alpha=t_{\alpha^{\prime}}}{\cal C}_{\alpha}

for 0=t0<t1<t2<<tkλ=k00=t_{0}<t_{1}<t_{2}<...<t_{k_{\lambda}}=k_{0}. Now, as nn\to\infty,

1Nn(λ0)=1Nα=1k0=1dN𝒞α,logR^α,(λ0)\displaystyle\dfrac{1}{N}\ell_{n}(\lambda_{0})=\dfrac{1}{N}\sum_{\alpha=1}^{k_{0}}\sum_{\ell=1}^{d}N_{{\cal C}_{\alpha},\ell}\log\hat{R}^{(\lambda_{0})}_{\alpha,\ell} 𝑝α=1k0=1d(j𝒞αqσj)Rα,logRα,\displaystyle\xrightarrow[]{p}\sum_{\alpha=1}^{k_{0}}\sum_{\ell=1}^{d}\Big{(}\sum_{j\in{\cal C}_{\alpha}}q_{\sigma_{j}}\Big{)}R_{\alpha,\ell}\log R_{\alpha,\ell}
=α=1k0=1dQ(α)(λ0)Rα,logRα,=A0;\displaystyle=\sum_{\alpha=1}^{k_{0}}\sum_{\ell=1}^{d}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\log R_{\alpha,\ell}=A_{0};

and

1Nn(λ)\displaystyle\dfrac{1}{N}\ell_{n}(\lambda) =1Nα=1kλ=1dN𝒞^α(λ),logR^α,(λ)=1Nα=1kλ=1d(j𝒞^α(λ)Nσj,)log(j𝒞^α(λ)Nσj,j𝒞^α(λ)Nσj)\displaystyle=\dfrac{1}{N}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}N_{\hat{{\cal C}}_{\alpha^{\prime}}(\lambda),\ell}\log\hat{R}^{(\lambda)}_{\alpha^{\prime},\ell}=\dfrac{1}{N}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}\Big{(}\sum_{j\in\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)}N_{\sigma_{j},\ell}\Big{)}\log\Big{(}\dfrac{\sum_{j\in\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)}N_{\sigma_{j},\ell}}{\sum_{j\in\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)}N_{\sigma_{j}}}\Big{)}
=1Nα=1kλ=1d(α=tα1+1tαN𝒞α,)log(α=tα1+1tαN𝒞α,α=tα1+1tαN𝒞α)\displaystyle=\dfrac{1}{N}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}\Big{(}\sum_{\alpha=t_{{\alpha^{\prime}}-1}+1}^{t_{\alpha^{\prime}}}N_{{\cal C}_{\alpha},\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{{\alpha^{\prime}}-1}+1}^{t_{\alpha^{\prime}}}N_{{\cal C}_{\alpha},\ell}}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}N_{{\cal C}_{\alpha}}}\Big{)}
𝑝α=1kλ=1d(α=tα1+1tαQ(α)(λ0)Rα,)log(α=tα1+1tαQ(α)(λ0)Rα,α=tα1+1tαQ(α)(λ0))=A(λ).\displaystyle\xrightarrow[]{p}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}\Big{(}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})}\Big{)}=A(\lambda).

Now, applying the Jensen’s inequality by using the strict convexity of logx-\log x,

A(λ)\displaystyle A(\lambda) ==1dα=1kλ(α=tα1+1tαQ(α)(λ0)Rα,)log(α=tα1+1tαQ(α)(λ0)α=tα1+1tαQ(α)(λ0)Rα,)\displaystyle=-\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\Big{(}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}}\Big{)}
==1dα=1kλ(α=tα1+1tαQ(α)(λ0)Rα,)log(α=tα1+1tαQ(α)(λ0)Rα,.(1/Rα,)α=tα1+1tαQ(α)(λ0)Rα,)\displaystyle=-\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\Big{(}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}.(1/R_{\alpha,\ell})}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}}\Big{)}
<=1dα=1kλα=tα1+1tαQ(α)(λ0)Rα,log(1/Rα,)\displaystyle<-\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\log(1/R_{\alpha,\ell})
==1dα=1kλα=tα1+1tαQ(α)(λ0)Rα,logRα,=A0.\displaystyle=\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\log R_{\alpha,\ell}=A_{0}.

Hence, 1N(n(λ0)n(λ))𝑝A0A(λ)>0\dfrac{1}{N}(\ell_{n}(\lambda_{0})-\ell_{n}(\lambda))\xrightarrow[]{p}A_{0}-A(\lambda)>0, and P(1N(n(λ0)n(λ))12(A0A(λ))1P\Big{(}\dfrac{1}{N}(\ell_{n}(\lambda_{0})-\ell_{n}(\lambda))\geq\dfrac{1}{2}(A_{0}-A(\lambda)\Big{)}\to 1 as nn\to\infty. Since logn/N0\log n/N\to 0 as nn\to\infty,

P(BICn(λ0)BICn(λ))\displaystyle P\Big{(}BIC_{n}(\lambda_{0})\geq BIC_{n}(\lambda)\Big{)} =P(2n(λ0)2n(λ)+(k0kλ)(d1)logn)\displaystyle=P\Big{(}2\ell_{n}(\lambda_{0})\leq 2\ell_{n}(\lambda)+(k_{0}-k_{\lambda})(d-1)\log n\Big{)}
=P(n(λ0)n(λ)(k0kλ)(d1)logn)\displaystyle=P\Big{(}\ell_{n}(\lambda_{0})-\ell_{n}(\lambda)\leq(k_{0}-k_{\lambda})(d-1)\log n\Big{)}
=P(1N(n(λ0)n(λ))(k0kλ)(d1)lognN)\displaystyle=P\Big{(}\dfrac{1}{N}(\ell_{n}(\lambda_{0})-\ell_{n}(\lambda))\leq(k_{0}-k_{\lambda})(d-1)\dfrac{\log n}{N}\Big{)}
0.\displaystyle\to 0.

References

  • Bühlmann [2000] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. (\byear2000). \btitleModel selection for variable length Markov chains and tuning the context algorithm. \bjournalAnnals of the Institute of Statistical Mathematics \bvolume52 \bpages287–315. \endbibitem
  • Bühlmann et al. [1999] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP., \bauthor\bsnmWyner, \bfnmAbraham J\binitsA. J. \betalet al. (\byear1999). \btitleVariable length Markov chains. \bjournalThe Annals of Statistics \bvolume27 \bpages480–513. \endbibitem
  • Chi and Lange [2015] {barticle}[author] \bauthor\bsnmChi, \bfnmEric C\binitsE. C. and \bauthor\bsnmLange, \bfnmKenneth\binitsK. (\byear2015). \btitleSplitting methods for convex clustering. \bjournalJournal of Computational and Graphical Statistics \bvolume24 \bpages994–1013. \endbibitem
  • Garcıa et al. [2011] {binproceedings}[author] \bauthor\bsnmGarcıa, \bfnmJesús E\binitsJ. E., \bauthor\bsnmGonzález-López, \bfnmVerónica A\binitsV. A., \bauthor\bparticlede \bsnmHolanda, \bfnmRua Sergio Buarque\binitsR. S. B. and \bauthor\bsnmGeraldo, \bfnmCidade Universitária-Barao\binitsC. U.-B. (\byear2011). \btitleMinimal markov models. In \bbooktitleFourth Workshop on Information Theoretic Methods in Science and Engineering \bpages25. \endbibitem
  • Hocking et al. [2011] {binproceedings}[author] \bauthor\bsnmHocking, \bfnmToby Dylan\binitsT. D., \bauthor\bsnmJoulin, \bfnmArmand\binitsA., \bauthor\bsnmBach, \bfnmFrancis\binitsF. and \bauthor\bsnmVert, \bfnmJean-Philippe\binitsJ.-P. (\byear2011). \btitleClusterpath an algorithm for clustering using convex fusion penalties. In \bbooktitle28th international conference on machine learning \bpages1. \endbibitem
  • Jääskinen et al. [2014] {barticle}[author] \bauthor\bsnmJääskinen, \bfnmVäinö\binitsV., \bauthor\bsnmXiong, \bfnmJie\binitsJ., \bauthor\bsnmCorander, \bfnmJukka\binitsJ. and \bauthor\bsnmKoski, \bfnmTimo\binitsT. (\byear2014). \btitleSparse Markov chains for sequence data. \bjournalScandinavian Journal of Statistics \bvolume41 \bpages639–655. \endbibitem
  • Lindsten, Ohlsson and Ljung [2011] {barticle}[author] \bauthor\bsnmLindsten, \bfnmF\binitsF., \bauthor\bsnmOhlsson, \bfnmH\binitsH. and \bauthor\bsnmLjung, \bfnmL\binitsL. (\byear2011). \btitleJust relax and come clustering! a convexication of k-means clustering Technical Report. \bjournalLinköping University, Department of Electrical Engineering, Automatic Control. \endbibitem
  • Panahi et al. [2017] {binproceedings}[author] \bauthor\bsnmPanahi, \bfnmAshkan\binitsA., \bauthor\bsnmDubhashi, \bfnmDevdatt\binitsD., \bauthor\bsnmJohansson, \bfnmFredrik D\binitsF. D. and \bauthor\bsnmBhattacharyya, \bfnmChiranjib\binitsC. (\byear2017). \btitleClustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. In \bbooktitleInternational conference on machine learning \bpages2769–2777. \bpublisherPMLR. \endbibitem
  • Pelckmans et al. [2005] {binproceedings}[author] \bauthor\bsnmPelckmans, \bfnmKristiaan\binitsK., \bauthor\bsnmDe Brabanter, \bfnmJoseph\binitsJ., \bauthor\bsnmSuykens, \bfnmJohan AK\binitsJ. A. and \bauthor\bsnmDe Moor, \bfnmBart\binitsB. (\byear2005). \btitleConvex clustering shrinkage. In \bbooktitlePASCAL Workshop on Statistics and Optimization of Clustering Workshop. \endbibitem
  • Rissanen [1983] {barticle}[author] \bauthor\bsnmRissanen, \bfnmJorma\binitsJ. (\byear1983). \btitleA universal prior for integers and estimation by minimum description length. \bjournalThe Annals of statistics \bvolume11 \bpages416–431. \endbibitem
  • Roos and Yu [2009a] {binproceedings}[author] \bauthor\bsnmRoos, \bfnmTeemu\binitsT. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009a). \btitleSparse Markov source estimation via transformed Lasso. In \bbooktitle2009 IEEE Information Theory Workshop on Networking and Information Theory \bpages241–245. \bpublisherIEEE. \endbibitem
  • Roos and Yu [2009b] {binproceedings}[author] \bauthor\bsnmRoos, \bfnmTeemu\binitsT. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009b). \btitleEstimating sparse models from multivariate discrete data via transformed Lasso. In \bbooktitle2009 Information Theory and Applications Workshop \bpages290–294. \bpublisherIEEE. \endbibitem
  • Sun, Toh and Yuan [2021] {barticle}[author] \bauthor\bsnmSun, \bfnmDefeng\binitsD., \bauthor\bsnmToh, \bfnmKim-Chuan\binitsK.-C. and \bauthor\bsnmYuan, \bfnmYancheng\binitsY. (\byear2021). \btitleConvex Clustering: Model, Theoretical Guarantee and Efficient Algorithm. \bjournalJ. Mach. Learn. Res. \bvolume22 \bpages9–1. \endbibitem
  • Xiong, Jääskinen and Corander [2016] {barticle}[author] \bauthor\bsnmXiong, \bfnmJie\binitsJ., \bauthor\bsnmJääskinen, \bfnmVäinö\binitsV. and \bauthor\bsnmCorander, \bfnmJukka\binitsJ. (\byear2016). \btitleRecursive learning for sparse Markov models. \bjournalBayesian analysis \bvolume11 \bpages247–263. \endbibitem
  • Yuan and Lin [2006] {barticle}[author] \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmLin, \bfnmYi\binitsY. (\byear2006). \btitleModel selection and estimation in regression with grouped variables. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume68 \bpages49–67. \endbibitem
  • Zhu et al. [2014] {barticle}[author] \bauthor\bsnmZhu, \bfnmChangbo\binitsC., \bauthor\bsnmXu, \bfnmHuan\binitsH., \bauthor\bsnmLeng, \bfnmChenlei\binitsC. and \bauthor\bsnmYan, \bfnmShuicheng\binitsS. (\byear2014). \btitleConvex optimization procedure for clustering: Theoretical revisit. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem