Fitting Sparse Markov Models to Categorical Time Series Using Regularization

Tuhin Majumderlabel=e1 [ mark][email protected] Soumendra Lahirilabel=e2][email protected] [ Donald Martinlabel=e3 [ mark][email protected] Department of Statistics, North Carolina State University, Department of Mathematics and Statistics, Washington University in St. Louis,

Abstract

The major problem of fitting a higher order Markov model is the exponentially growing number of parameters. The most popular approach is to use a Variable Length Markov Chain (VLMC), which determines relevant contexts (recent pasts) of variable orders and form a context tree. A more general approach is called Sparse Markov Model (SMM), where all possible histories of order $m$ form a partition so that the transition probability vectors are identical for the histories belonging to a particular group. We develop an elegant method of fitting SMM using convex clustering, which involves regularization. The regularization parameter is selected using BIC criterion. Theoretical results demonstrate the model selection consistency of our method for large sample size. Extensive simulation studies under different set-up have been presented to measure the performance of our method. We apply this method to classify genome sequences, obtained from individuals affected by different viruses.

Sparse Markov Model,

Convex Clustering,

keywords:

\startlocaldefs\endlocaldefs

, and

1 Introduction

Let $\{X_{t}\}$ be a categorical time series with a finite state space $\Sigma$ . We suppose that the evolution of the time series follows an $m$ -th order Markov structure where

{\cal L}\big{(}X_{t+1}\big{|}X_{s},s\leq t\big{)}={\cal L}\big{(}X_{t+1}\big{|}X_{s},t-m<s\leq t\big{)}

(1.1)

for some $m\geq 1$ , where for any random vectors $X,Y$ defined on a common probability space, we write ${\cal L}(Y|X)$ to denote the probability distribution of $Y$ given $X$ . Even when the alphabet $\Sigma$ is small, such as $\Sigma=\{0,1\}$ in applications involving binary chains or $\Sigma=\{A,G,T,C\}$ in Genetics applications, complexity of the model (1.1) increases fairly quickly and may be difficult to estimate even for moderately large $m$ . In absence of a parametric model specification, the number of free parameters associated with (1.1) are given by $|\Sigma|^{m}(|\Sigma|-1)$ which grows geometrically fast in the order $m$ , where $|\Sigma|$ denotes the size of the alphabet, that is the number of elements in $\Sigma$ .

Different dimension reduction startegies have been applied to reduce the model complexity in (1.1), such as Variable Length Markov Chains (VLMC) based on tree-structured conditioning sets. This idea was first introduced by Rissanen (1983)[10], which determines relevant contexts (recent pasts) of variable orders and form a context tree. In VLMC, $P(X_{t+1}=x_{t+1}|X_{t}=x_{t},\ldots,X_{1}=x_{1})=P\big{(}X_{t+1}=x_{t+1}\big{|}\tilde{X}_{t}^{(\ell)}=\tilde{x}_{t}^{(\ell)})$ , where $\ell$ may not be a fixed number, rather a function of the past values $(x_{t},\ldots,x_{1})$ . In general, context tree models have $L(|\Sigma|-1)$ parameters, where $L$ is the number of leaves in the context tree. That $L$ can take on arbitrary positive integer values for general context trees highlights the flexibility of a model with variable length contexts, and the fact that such models can lead to huge reductions in the number of parameters, especially when there is a long context in a single direction. A model of a variable order allows for a better trade-off between bias that arises through using contexts that are too short, and variance that increases with having many parameters, thus improving statistical inference. Bühlmann and Wyner (1999)[2] and Bühlmann (2000)[1] developed model selection strategies and asymptotic behaviour of Variable Length Markov Chains (VLMC).

Roos and Yu (2009a[11], b[12]) first pointed out that there can be relevant contexts that don’t have the hierarchical structure of a context tree. Although the authors have discussed the possibility of more general models, the analysis of those papers was limited to the case where $\Sigma=\{0,1\}$ . Recently, the researchers have started the study of the sparse model, which is posed in terms of a general partition of the set of $m$ -tuples $\Sigma^{m}$ , where $m$ is the maximal order of Markovian dependence. Sparse Markov models (SMMs) that introduce a sparse parametrization based on an unknown grouping of the relevant $m$ th order history $\Sigma^{m}$ . This generalization was first proposed by Gárcia and González-López (2010)[4], they named it Minimal Markov Models. Later on, Jääskinen et al. (2014)[6] have developed Bayesian predictive methods to analyze sequence data using SMM. Xiong et al. (2016)[14] have extended the previous paper, introduced recursive algorithm for optimizing the partition for an SMM. In this paper, we shall consider SMMs in its full generality, allowing an arbitrary and unknown number of groups. Specifically, let ${\cal C}_{1},\ldots,{\cal C}_{k_{0}}$ be a partition of $\Sigma^{m}$ . Then, the Markov Chain $\{X_{t}\}$ in (1.1) is a member of the SMM with groups $\{{\cal C}_{1},\ldots,{\cal C}_{k_{0}}\}$ if it satisfies the sparsity condition :

P\big{(}X_{t+1}\in\cdot\big{|}X_{t}=a_{-1},\ldots,X_{t-m+1}=a_{-m}\big{)}{\quad\mbox{is the same for all}\quad}(a_{-m},\ldots,a_{-1})\in{\cal C}_{i},

(1.2)

for each $i=1,\ldots,k_{0}$ . Thus, for each $i$ , , the transition probability remains unchanged over all $m$ -step history lying in the set ${\cal C}_{i}$ . This reduces the number of unknown probability parameters to $k_{0}(|\Sigma|-1)$ . However, both the number $k_{0}$ of the sets in the partition and the sets ${\cal C}_{i}$ themselves are unknown and must be estimated from the data for fitting the SMM.

The rest of the paper is organised as follows. In section (2), we discuss the methodology of fitting SMM using regularization in detail. Section (3) deals with the theoretical properties which ensures the model selection consistency of our method. In section (4), we numerically illustrate our methodology by extensive simulation studies. A real data analysis involving virus classification has been presented in section (5). We conclude this paper by some remarks in section (6). The proofs of the theoretical results are provided in the appendix.

2 Methodology

2.1 Notation

Let ${\mathbb{N}}=\{1,2,\ldots\}$ be the set of all positive integers. Let ${\cal X}_{n}=(X_{1},\ldots,X_{n})$ and $\tilde{X}_{t}^{(m)}=(X_{t},X_{t-1},\ldots,X_{t-m+1})$ , for $m\geq 1$ , $t\in{\mathbb{N}}$ . Write $w$ for an ordered (finite) sequence of $\Sigma$ -elements of length $|w|$ . Let $wu$ denote the (ordered) concatenation of $w$ and $u$ . Write $|\Sigma|=d$ and $\Sigma^{m}=\{\sigma_{1},\ldots,\sigma_{p}\}$ so that $p=|\Sigma|^{m}$ . Let $N_{w}=\sum_{t=|w|}^{n}{1\!\!1}(\tilde{X}_{t}^{(|w|)}=w)$ where ${1\!\!1}(\cdot)$ denotes the indicator function. For any $S\subset\Sigma^{m}$ and $a\in\Sigma$ , define $N_{S}=\sum_{t=m}^{n-1}{1\!\!1}(\tilde{X}_{t}^{(m)}\in S)$ , $N_{S,a}=\sum_{t=m}^{n-1}{1\!\!1}(\tilde{X}_{t}^{(m)}\in S,X_{t+1}=a)$ . In particular, $N_{\sigma_{j}}$ denotes the number of times the chain $\tilde{X}_{t}^{(m)}$ hits the $m$ -tuple $\sigma_{j}$ , and $N_{\sigma_{j},a}$ is the number of transitions from $\sigma_{j}$ to $a$ .

Next we define the probabilities associated with the SMM (1.2). For $j=1,\ldots,p$ and $a\in\Sigma$ , let

\displaystyle\bm{\pi}_{j,a}=P\big{(}X_{t+1}=a\big{|}\tilde{X}_{t}^{(m)}=\sigma_{j}\big{)};

and $\bm{\pi}_{j}$ be the corresponding transition probability vector. Note that by the SMM property, for any $a\in\Sigma$ , the transition probability $\bm{\pi}_{j,a}$ is a constant over all $j$ such that $\sigma_{j}\in{\cal C}_{i}$ . However, we do not know the sets ${\cal C}_{i}$ and one of the challenges of fitting an SMM to a dataset is to be able to idenitify the sets ${\cal C}_{1},\ldots,{\cal C}_{k_{0}}$ . To that end, define nonparametric estimators of $\bm{\pi}_{j,a}$ using their empirical versions:

\hat{\bm{\pi}}_{ja}=N_{\sigma_{j},a}/N_{\sigma_{j}}

where $N=n-m+1$ denotes the total number of $m$ th order histrory in the observed variables ${\cal X}_{n}$ .

Here we propose a new approach to fitting the SMM based on regularization.

2.2 Description of the Method/Algorithm

Consider the penalized criterion function

\dfrac{1}{2}\sum_{j=1}^{p}\sum_{a\in\Sigma}\big{(}\hat{\pi}_{ja}-b_{ja})^{2}+\lambda\sum_{1\leq j_{1}<j_{2}\leq p}w_{j_{1}j_{2}}\rho(\mathbf{b}_{j_{1}},\mathbf{b}_{j_{2}})

(2.3)

over $\mathbf{b}_{j}=(b_{ja}:a\in\Sigma)\in\Pi_{d}$ for $j=1,\ldots,p$ , where $\lambda>0$ is a penalty parameter, $\Pi_{d}$ is the $d$ -dimensional simplex $\Pi_{d}=\{(u_{1},\ldots,u_{d})\in[0,1]^{d}:u_{1}+\ldots+u_{d}=1\}$ and where $\rho(\cdot,\cdot)$ is a distance measure between two $d$ -dimensional probability vectors. Thus, (2.3) treats the estimators $\hat{\pi}_{ja}$ as (correlated) “observations” and penalizes the distance between all distinct pairs of the probability vectors in order to identify the identical probability vectors. In particular, the number of parameters grow approximately quadratically in the number of observations and with a suitable choice of the penalization term, one can identify the identical probability vectors. When $\rho(\mathbf{b}_{j_{1}},\mathbf{b}_{j_{2}})^{2}=\sum_{\ell=1}^{d}(b_{j_{1}\ell}-b_{j_{2}\ell})^{2}$ , (2.3) gives a version of the Group LASSO of Yuan and Lin (2006)[15] that is designed for selecting pairs of full vectors that are close, and we have a convex optimization problem that can be solved for large $p$ s. On the other hand, if we use the $\ell_{1}$ distance $\rho(\mathbf{b}_{j_{1}},\mathbf{b}_{j_{2}})=\sum_{\ell=1}^{d}|b_{j_{1}\ell}-b_{j_{2}\ell}|$ , then only componentwise zero differences can be identified.

Once we minimize the criterion function in (2.3), it is a relatively easy task to find estimates of $k_{0}$ and the sets ${\cal C}_{i}$ . Specifically, we start with a pair with the smallest $j_{1}$ and seek all $j_{2}>j_{1}$ that the distance between the solutions $b^{*}_{j_{1}}$ and $b^{*}_{j_{2}}$ is zero. Then, we set $\hat{{\cal C}}_{1}$ to be the set consisting of $j_{1}$ and all such $j_{2}$ . In the next step, we consider all pairs that are not in $\hat{{\cal C}}_{1}$ and repeat the procedure until all pairs with estimated zero distances have been grouped. In case there are indices $j$ for which none of the estimated paired distances are zero, we keep them as a singletons, that is groups consisting of single elements. This gives the estimated groups $\hat{{\cal C}}_{i}:i=1,\ldots,\hat{k}$ , with $\hat{k}$ giving an estimate of $k_{0}$ .

The traditional clustering methodologies like $K$ -means have many limitations. In most cases, we have to pre-specify the number of clusters, along with the possibility that we end up with a local minima instead of the global one. The advantage of clustering by solving the equation (2.3) for a range of $\lambda$ is that we get a solution path from at most $p$ many singleton clusters to only one cluster consisting of all the elements. Subsequently, we can fix some criterion function which will enable us to find the optimum cluster assignment among the all possible models in the solution path. Hence, not only we won’t need to fix the number of clusters beforehand, also we avoid the problems of being stuck at the local minima. Several efficient algorithms have been developed in recent years to solve the equation (2.3) when the penalty function $\rho$ is convex; e.g $\rho(b_{j1},b_{j2})=\lVert b_{j_{1}}-b_{j_{2}}\rVert_{p}$ for some $p\geq 1$ . Pelckmans et al. (2005)[9], Lindsten et al. (2011)[7], Hocking et al. (2011)[5] and others recently proposed this method and established this to be more robust and scalable in comparison to the traditional ones. Lindsten et al. (2011)[7] used an off-the-shelf convex solver CVX to solve the convex clustering problem, which suffers from scalability issues. The theoretical perfect cluster recovery conditions have been derived by Zhu et al. (2014)[16] only for two clusters, while Panahi et al. (2017)[8] derived perfect recovery conditions for general $k$ clusters, but under the assumption of uniform weights. Sun et al. (2021)[13] provides sufficient conditions for theoretical recovery conditions under more general weight choices. They have also developed a faster algorithm called semismooth Newton based augmented Lagrangian method (SS-NAL), and derived the convergence criteria for their algorithm.

Chi and Lange (2015)[3] have developed an elegant method of solving the convex clustering problem by augmented lagrangian methods. For $\rho(x)=\lVert x\rVert_{2}$ , we first view solving the equation (2.3) as the following constrained optimization problem

	$\displaystyle\min$	$\displaystyle\dfrac{1}{2}\sum_{j=1}^{p}\lVert\hat{\bm{\pi}}_{j}-\mathbf{b}_{j}\rVert_{2}^{2}+\lambda\sum_{l\in\mathcal{E}}w_{l}\lVert\mathbf{v}_{l}\rVert_{2}$		(2.4)
	subject to	$\displaystyle\mathbf{b}_{l_{1}}-\mathbf{b}_{l_{2}}-\mathbf{v}_{l}=0;$		(2.4)

where $\mathcal{E}$ is the set of all distinct edges $\{l:l=(l_{1},l_{2}),l_{1}<l_{2},w_{l}>0\}$ . Here, a new splitting variable $\mathbf{v}_{l}$ has been introduced to capture the difference between the group centroids, which helps the optimization procedure much easier. Chi and Lange (2015)[3] have used two algorithms for solving this constrained optimization problem, namely ADMM and AMA. For both these algorithms, one first have to incorporate augmented lagrangian as follows:

	$\displaystyle\mathcal{L}_{\nu}(\mathbf{B},\mathbf{V},\bm{\Gamma})=$	$\displaystyle\dfrac{1}{2}\sum_{j=1}^{p}\lVert\hat{\bm{\pi}}_{j}-\mathbf{b}_{j}\rVert_{2}^{2}+\lambda\sum_{l\in\mathcal{E}}w_{l}\lVert\mathbf{v}_{l}\rVert_{2}$		(2.5)
		$\displaystyle+\sum_{l\in\mathcal{E}}\langle\bm{\gamma}_{l},\mathbf{v}_{l}-\mathbf{b}_{l_{1}}+\mathbf{b}_{l_{2}}\rangle+\dfrac{\nu}{2}\sum_{l\in\mathcal{E}}\lVert\mathbf{v}_{l}-\mathbf{b}_{l_{1}}+\mathbf{b}_{l_{2}}\rVert_{2}^{2},$		(2.5)

where $\mathbf{B},\mathbf{V}$ and $\bm{\Gamma}$ are the matrices with $\mathbf{b}_{j},\mathbf{v}_{l}$ and $\bm{\gamma}_{l}$ for $j=1,...,p$ and $l\in\mathcal{E}$ in their columns respectively. Splitting the variables in such fashion would allow us to update $\mathbf{B}$ , $\mathbf{V}$ and $\bm{\Gamma}$ sequentially, given the other variables. The convergence of ADMM does not depend on the choice of $\nu$ , it will converge for any $\nu>0$ . On the other hand, AMA converges for any $0<\nu<2/p$ . The performance of both these algorithms have been compared, and it is established that AMA is much faster than ADMM, especially when the weights are sparse. The major advantage of AMA lies in the inherent structure. After some basic algebra, Chi and Lange (2015)[3] have proved that we only need to update $\mathbf{B}$ and $\bm{\Gamma}$ in every step, and we can bypass updating $\mathbf{V}$ applying some linear relationship among the variables. Since, AMA provides much faster result, we use this method in our analysis. Suppose, $\mathbf{B}^{(t)}$ and $\bm{\Gamma}^{(t)}$ be the parameter values in the $t^{th}$ step. The updates in the next step are computed using the following relations:

	$\displaystyle\mathbf{b}_{j}^{(t+1)}$	$\displaystyle=\hat{\bm{\pi}}_{j}+\sum_{l_{1}=j}\bm{\gamma}_{l}^{(t)}-\sum_{l_{2}=j}\bm{\gamma}_{l}^{(t)}$
	$\displaystyle\bm{\gamma}_{l}^{(t+1)}$	$\displaystyle=\mathcal{P}_{C_{l}}(\bm{\gamma}_{l}^{(t)}-\nu\mathbf{g}_{l}^{(t+1)})$

where $\mathbf{g}_{l}^{(t+1)}=\mathbf{b}_{l_{1}}^{(t+1)}-\mathbf{b}_{l_{2}}^{(t+1)}$ , $C_{l}=\{\bm{\gamma}_{l}:\lVert\bm{\gamma}_{l}\rVert_{2}\leq\lambda w_{l}\}$ , and $\mathcal{P}_{A}(\mathbf{x})$ is the projection of $\mathbf{x}$ onto the set $A$ . We continue till convergence. The convergence criterion has been discussed in Chi and Lange (2015)[3] in details, using the dual problem and duality gap.

Algorithm 1 AMA

Initialize $\bm{\Gamma}^{(0)}$

1:for

t=1,2,3,...

2: for

j=1,2,3,...,p

\bm{\Delta}_{i}^{(t)}=\sum_{l_{1}=j}\bm{\gamma}_{l}^{(t-1)}-\sum_{l_{2}=j}\bm{\gamma}_{l}^{(t-1)}

4: end for

5: for all

l

\mathbf{g}_{l}^{(t)}=\hat{\bm{\pi}}_{l_{1}}-\hat{\bm{\pi}}_{l_{2}}+\bm{\Delta}_{l_{1}}^{(t)}-\bm{\Delta}_{l_{2}}^{(t)}

\bm{\gamma}_{l}^{(t)}=\mathcal{P}_{C_{l}}(\bm{\gamma}_{l}^{(t-1)}-\nu\mathbf{g}_{l}^{(t)})

8: end for

9:end for

2.3 Selection of the Tuning Parameter

So far, we have discussed the numerical methods to solve (2.3) for a given $\lambda$ . But it is important to choose an optimum value of $\lambda$ for the optimization problem. In this section, we propose a data driven method to select this tuning parameter using BIC criterion. For a given $\lambda$ , denote the obtained clusters as $\hat{\mathcal{C}}_{1}(\lambda),\ldots,\hat{\mathcal{C}}_{k_{\lambda}}(\lambda)$ , where $k_{\lambda}$ is the number of clusters. Define the common transition probability for the $m$ -tuples in the estimated group $\hat{\mathcal{C}}_{\alpha}(\lambda)$ as

\hat{R}^{(\lambda)}_{\alpha,a}=\dfrac{\sum_{\sigma_{j}\in\hat{\mathcal{C}}_{\alpha}(\lambda)}N_{\sigma_{j},a}}{\sum_{\sigma_{j}\in\hat{\mathcal{C}}_{\alpha}(\lambda)}N_{\sigma_{j}}}=\dfrac{N_{\hat{\mathcal{C}}_{\alpha}(\lambda),a}}{N_{\hat{\mathcal{C}}_{\alpha}(\lambda)}}\quad\quad\forall\alpha=1,...,k_{\lambda};a\in\Sigma.

The log-likelihood of the observations under the obtained cluster assignment for a particular $\lambda$ is given by

\ell_{n}(\lambda)=\sum_{\alpha=1}^{k_{\lambda}}\sum_{a\in\Sigma}N_{\hat{\mathcal{C}}_{\alpha}(\lambda),a}\log\hat{R}^{(\lambda)}_{\alpha,a}.

Hence, the BIC score corresponding to the obtained model is

BIC_{n}(\lambda)=-2\ell_{n}(\lambda)+k_{\lambda}(|\Sigma|-1)\log n.

By a grid search over a range of possible $\lambda$ values, we select the $\lambda$ for which BIC is minimized. The solution of the equation (2.3) corresponding to that $\lambda$ is considered as the estimated cluster assignment. In the next section, we provide theoretical results which will demonstrate that for a range of $\lambda$ values, we will be able to perfectly recover the true clusters for large $n$ .

3 Conditions and Theoretical Results

3.1 Conditions

Consider solving the equation (2.3) with $\rho(b_{j_{1}},b_{j_{2}})=\lVert b_{j_{1}}-b_{j_{2}}\rVert_{2}$ , and the optimum solution is denoted by $b_{i}^{*}(\lambda)$ , for $i=1,2,...,p$ . Let the true partition of the state space $\Sigma^{m}$ is $\{{\cal C}_{1},\ldots,{\cal C}_{k_{0}}\}$ ; with the corresponding transition probability vectors being $R_{1},...,R_{k_{0}}$ such that $R_{\ell,a}=P(X_{t+1}=a\big{|}Y_{t}=\sigma_{\ell})$ . Denote $p_{\ell}=\big{|}{\cal C}_{\ell}\big{|}$ , the size of the $\alpha^{th}$ partition. Following the notations of Yuan et al. (2021), define

	$\displaystyle w_{i}^{(\beta)}=\sum_{j\in{\cal C}_{\beta}}w_{i,j}\quad\forall i=1,2,...,p;\quad\quad\mu_{i,j}^{(\alpha)}=\sum_{\beta=1,\beta\neq\alpha}^{k_{0}}\big{\lvert}w_{i}^{(\beta)}-w_{j}^{(\beta)}\big{\rvert}\quad\quad\forall\alpha=1,2,...,k_{0};$
	$\displaystyle w^{(\alpha,\beta)}=\sum_{i\in{\cal C}_{\alpha}}\sum_{j\in{\cal C}_{\beta}}w_{i,j}\quad\forall\alpha\neq\beta,\alpha,\beta\in\{1,2,...,k_{0}\};\quad\quad\hat{\bar{\bm{\pi}}}^{(\alpha)}=\dfrac{1}{p_{\alpha}}\sum_{i\in{\cal C}_{\alpha}}\hat{\bm{\pi}}_{i};$
	$\displaystyle\lambda_{\text{min}}^{(n)}=\max_{1\leq\alpha\leq k_{0}}\max_{i,j\in{\cal C}_{\alpha}}\Bigg{\{}\dfrac{\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{2}}{p_{\alpha}w_{i,j}-\mu_{i,j}^{(\alpha)}}\Bigg{\}};\quad\lambda_{\text{max}}^{(n)}=\min_{1\leq\alpha<\beta\leq k_{0}}\Bigg{\{}\dfrac{\lVert\hat{\bar{\bm{\pi}}}^{(\alpha)}-\hat{\bar{\bm{\pi}}}^{(\beta)}\rVert_{2}}{\frac{1}{p_{\alpha}}\sum_{l\neq\alpha}w^{(\alpha,l)}+\frac{1}{p_{\beta}}\sum_{l\neq\beta}w^{(\beta,l)}}\Bigg{\}}.$

Suppose, the following conditions hold.
(A1) $w_{j_{1},j_{2}}=w_{j_{2},j_{1}}$ and $w_{j_{1},j_{2}}>0$ for any $j_{1},j_{2}\in{\cal C}_{\ell}$ , $\ell=1,2,...,k_{0}$ .
(A2) $p_{\alpha}w_{i,j}>\mu_{i,j}^{(\alpha)}$ , $\forall i,j\in{\cal C}_{\alpha}$ and $\forall\alpha=1,2,...,k_{0}$ .

In other words, the condition (A1) implies the symmetry in the weight choices, and ideally the weight should be positive between two $m$ -tuples belonging to the same partition. On the other hand, (A2) determines a lower bound for the weight between two $m$ -tuple in a particular group. We will use these conditions to prove our results. Before going into the main results, let us recall a few other results as follows.
Proposition 1 (Consistency of Estimated Transition Probabilities): Suppose $X_{1},\ldots,X_{n}$ be an SMM of order $m$ , with the partitions $\{\mathcal{C}_{1},\ldots,\mathcal{C}_{k_{0}}\}$ . Then, as $n\to\infty$ ,
(a) $\hat{\bm{\pi}}_{j}\xrightarrow[]{p}\bm{R}_{\alpha}$ for $j\in\mathcal{C}_{\alpha}$ ;
(b) $\dfrac{N_{\sigma_{j}}}{N}\xrightarrow[]{p}q_{j}$ , where $q_{j}$ is the stationary probability of the state $\sigma_{j}$ ;
(c)

\sqrt{N_{\sigma_{j}}}(\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha})\xrightarrow[]{d}\mathcal{N}(\bm{0},\Sigma_{\alpha})

where $\Sigma_{\alpha}=diag(\bm{R}_{\alpha})-\bm{R}_{\alpha}\bm{R}_{\alpha}^{(T)}$ . Since $\Sigma_{\alpha}$ is of rank $|\Sigma|-1$ , the asymptotic Normal distribution is singular.
Proposition 2 (Sun et al. (2021)): Suppose the above conditions hold, and $\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}$ . Then for any $\lambda\in(\lambda_{\text{min}}^{(n)},\lambda_{\text{max}}^{(n)})$ , $\bm{b}_{i}^{*}(\lambda)=\bm{b}_{j}^{*}(\lambda)$ for $i,j\in\mathcal{C}_{\alpha}$ ; $\alpha=1,..,k_{0}$ and $\bm{b}_{i}^{*}(\lambda)\neq\bm{b}_{j}^{*}(\lambda)$ for any $i\in\mathcal{C}_{\alpha},j\in\mathcal{C}_{\beta},\alpha\neq\beta$ . In other words, for any $\lambda\in(\lambda_{\text{min}}^{(n)},\lambda_{\text{max}}^{(n)})$ , we recover the true partition of the state space.
Thus, Proposition 1 tells us about the weak consistency and the CLT for the transition probability vectors. Proposition 2 deals with perfect recovery conditions under general weight choices, derived in Sun et al. (2021). These propositions will be the key tools in proving our results. In the next section, we provide our major theoretical findings.

3.2 Main results

The very first result will ensure that the resulting solution of the objective function in (2.4) will produce a valid probability distribution over $\Sigma$ . Note that, we don’t consider this solution as our estimated transition probability for the group. However, if under some extra-ordinary circumstances the solution exhibits oracle properties, one can use the solutions as the common transition probabilities for the partitions.
Theorem 1: For any $\lambda>0$ , the optimal solution $\bm{b}_{i}^{*}(\lambda)$ is a valid probability distribution for $i=1,2,,,,p$ ; i.e.
(a) $b^{*}_{i,a}(\lambda)\geq 0$ for $a=1,...,d$ ,
(b) $\sum_{a=1}^{d}b^{*}_{i,a}(\lambda)=1$ .
Next, we would like to derive the probability of true cluster recovery. There are two steps involved in this process. First, we need the true model in the solution path over varying $\lambda$ . This implies the conditions of the proposition 2 must be satisfied, i.e $\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}$ . From Theorem 2, we get a lower bound of the probability of the true model being present in the solution path. This will tell us that with increasing sequence length, the probability of not perfect recovery converges to $0$ in an exponential rate.
Theorem 2: Define

	$\displaystyle\delta$	$\displaystyle=\min_{1\leq\alpha<\beta\leq k_{0}}\lVert\bm{R}_{\alpha}-\bm{R}_{\beta}\rVert_{2};\quad\quad\delta_{1}=\min_{1\leq\alpha\leq k_{0}}\min_{i,j\in\mathcal{C}_{\alpha}}\big{(}p_{\alpha}w_{i,j}-\mu_{i,j}^{(\alpha)}\big{)}$
	$\displaystyle\delta_{2}$	$\displaystyle=\max_{1\leq\alpha<\beta\leq k_{0}}\Big{(}\frac{1}{p_{\alpha}}\sum_{l\neq\alpha}w^{(\alpha,l)}+\frac{1}{p_{\beta}}\sum_{l\neq\beta}w^{(\beta,l)}\Big{)}$

Then, under the above conditions (A1) and (A2), as $n\to\infty$ ,

	$\displaystyle P\Big{(}\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}\Big{)}$	$\displaystyle\geq P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2};\forall j\in\mathcal{C}_{\alpha},\forall\alpha=1,...,k_{0}\Big{)}$
		$\displaystyle\geq 1-\sum_{\alpha=1}^{k_{0}}C_{1}^{(\alpha)}\sum_{j\in\mathcal{C}_{\alpha}}\exp\Big{[}-N\epsilon^{2}C_{2,j}\Big{]}$

for $0<\epsilon<\dfrac{\delta\delta_{1}}{\delta_{1}+\delta_{2}}$ , and for some constants $C_{1}^{(\alpha)},C_{2,j}>0$ .
Once we have the true model in the solution path, the next step is to establish that the probability of selecting that model through BIC criterion converges to $1$ as $n\to\infty$ . The next theorem will prove this statement.
Theorem 3: Suppose the conditions of the Theorem (2) holds, and $\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}$ . For any $\lambda$ , denote the clustering assignment obtained by minimizing the equation (2.3) as $M_{\lambda}=\{\hat{{\cal C}}_{1}(\lambda),\ldots,\hat{{\cal C}}_{k_{\lambda}}(\lambda)\}$ ; where $k_{\lambda}$ is the number of clusters. Suppose $\ell_{n}(\lambda)$ be the log-likelihood of the observations corresponding to the cluster assignment $M_{\lambda}$ , and the corresponding BIC score is $BIC_{n}(\lambda)=-2\ell_{n}(\lambda)+k_{\lambda}(d-1)\log n$ . Choose some $\lambda_{0}\in(\lambda_{\text{min}}^{(n)},\lambda_{\text{max}}^{(n)})$ . Then, for any $\lambda$ such that $M_{\lambda}\neq M_{\lambda_{0}}$ ,

P\Big{(}BIC_{n}(\lambda_{0})<BIC_{n}(\lambda)\Big{)}\longrightarrow 1

as $n\to\infty$ .
Next, we will provide some sufficient conditions for perfect cluster recovery under a particular weight choice using Gaussian kernels. These weights have been used in the previous analysis of convex clustering, and proven to produce good clustering results.
Theorem 4: Define $p_{min}=\min_{\alpha}p_{\alpha},p_{max}=\max_{\alpha}p_{\alpha}$ . Suppose the following assumptions hold.
(a) $w_{i,j}=e^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{2}^{2}}l^{k}_{i,j}$ , where $l^{k}_{i,j}$ is the indicator function that $\hat{\bm{\pi}}_{i}$ belongs to $k$ nearest neighbour of $\hat{\bm{\pi}}_{j}$ or vice versa, for some $\phi>0$ .
(b) $k\geq p_{max}-1$ .
(c) For some $\epsilon<\epsilon_{max}=\dfrac{\delta}{2}-\dfrac{1}{2\phi\delta}\log\Big{(}2\Big{(}\dfrac{k+1}{p_{min}}-1\Big{)}\Big{)}$ , $\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2};$ $\forall j\in\mathcal{C}_{\alpha}$ , $\forall\alpha=1,...,k_{0}$ .
Under the above assumptions, the conditions (A1) and (A2) are satisfied. Moreover, $\delta_{1}\geq p_{min}e^{-\phi\epsilon_{max}^{2}}-2(k+1-p_{min})e^{-\phi(\delta-\epsilon_{max})^{2}}=\delta^{(min)}_{1}$ , $\delta_{2}\leq 2(k+1-p_{min})e^{-\phi(\delta-\epsilon_{max})^{2}}=\delta_{2}^{(max)}.$

Corollary 1: Under the assumptions of Theorem 3,
(a) $\lambda_{\text{min}}^{(n)}\leq\dfrac{\epsilon}{\delta^{(min)}_{1}}$ , $\lambda_{\text{max}}^{(n)}\geq\dfrac{\delta-\epsilon}{\delta_{2}^{(max)}}$ .
(b) $\epsilon<\min\Big{\{}\epsilon_{max},\dfrac{\delta\delta_{1}}{\delta_{1}+\delta_{2}}\Big{\}}\implies\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}$ .
Corollary 2: For balanced design, i.e. when $p_{\alpha}=p/k_{0}$ are the same for all groups $\mathcal{C}_{\alpha}$ , $\delta_{2}^{(max)}=0$ if $k=p/k_{0}-1$ . Hence, for any $\epsilon<\epsilon_{max}$ , perfect recovery is possible for $\lambda\in(\lambda_{\text{min}},\infty)$ .
We omit the proof of this theorem as it is just an algebraic derivation of the terms defined earlier for a particular weight choice. The Corollary 1 gives us an idea how much close the empirical transition probabilities for each $m$ -tuple should be to the true probability. The Corollary 2 provides a special case when the design is balanced. In that scenario, for the correct choice of nearest neighbour, the true model can be retrieved for a huge range of tuning parameter $\lambda$ , providing great clustering results. We will show in the simulation studies that how the weight choices impact this clustering accuracy.

4 Simulation

In this section, we will numerically demonstrate the performance of the convex clustering methodology described in the previous section in terms of recovering the true cluster assignments. We have emphasized mostly on the choice of the weights $w_{i,j}$ , and compare the clustering performance for different choices of the weights. For illustration, we have considered SMM of various orders, lengths and different $|\Sigma|$ values. Note that, we don’t pre-specify the number of clusters or the labels of the clusters in our method, so it is not feasible to compute the straight-forward miss-classification rate to compare the outcome with the true one. Instead, we use widely used Rand Index (RI) and Adjusted Rand Index (ARI) to measure the similarity between the true cluster and the obtained cluster assignment.

Rand Index computes the proportion of $(i,j)$ pair which are correctly identified belonging to same cluster or different clusters. Mathematically, for any two cluster assignments $X=(X_{1},...,X_{r})$ and $Y=(Y_{1},...,Y_{s})$ of the elements $(\sigma_{1},...,\sigma_{p})$ , the rand index is defined by

RI=\dfrac{a+b}{a+b+c+d}=\dfrac{a+b}{\binom{p}{2}}

where $a$ is the number of pairs which are in the same cluster in both $X$ and $Y$ , $b$ is the number of pairs which are in the different clusters in both $X$ and $Y$ , $c$ is the number of pairs which are in same clusters of $X$ , but in different clusters of $Y$ , and $d$ is number of pairs which are in same clusters of $Y$ , but in different clusters of $X$ . Values of $RI$ vary in between $0$ and $1$ . If two clusters are identical, $RI$ should be $1$ . Higher $RI$ values indicate more similarity among two given clusters.

However, Rand Index has some limitations. For example, if the number of clusters increases, and the cluster sizes are not big, $RI$ will be close to $1$ even if two completely different cluster assignments. To address this issue, usage of Adjusted Rand Index (ARI) is preferred. $ARI$ is a corrected version of the usual Rand Index, which uses the expected similarity of all pairwise comparisons between clusterings specified by a random model. If $a_{i}=|X_{i}|$ , $b_{j}=|Y_{j}|$ , $p_{ij}=|X_{i}\cap Y_{j}|$ , then $ARI$ is computed by the following formula:

ARI=\dfrac{\sum\limits_{i,j}\binom{p_{ij}}{2}-\big{[}\sum\limits_{i}\binom{a_{i}}{2}\sum\limits_{j}\binom{b_{j}}{2}\big{]}/\binom{p}{2}}{\frac{1}{2}\big{[}\sum\limits_{i}\binom{a_{i}}{2}+\sum\limits_{j}\binom{b_{j}}{2}\big{]}-\big{[}\sum\limits_{i}\binom{a_{i}}{2}\sum\limits_{j}\binom{b_{j}}{2}\big{]}/\binom{p}{2}}.

For each of our simulation study, we compare the similarity between the estimated cluster assignment by solving the equation (2.3) for appropriate regularization parameter $\lambda$ with the true clustering using both $RI$ and $ARI$ . Especially, we focus on the choice of weights which result in higher $ARI$ values. Chi and Lange (2015), Sun et al. (2021) have shown, both numerically and theoretically that choosing sparse weights substantially improve the clustering quality, as well as make the algorithm much faster. In our study, we perform the clustering method under different weight choices, both sparse and dense, and compare how the $ARI$ values depend on that choice. In each set-up, we replicate the experiment $1000$ times to obtain the mean $RI$ and $ARI$ and their standard error. We also compute the proportion of times $ARI$ or $RI$ is $1$ , i.e empirically compute the probability of perfect cluster recovery.

4.1 Simulation Set-up 1

Here, we take $|\Sigma|=4$ , the usual scenario when analyzing the DNA sequences. The order of the chain $m$ is taken to be both $2$ or $3$ . For $m=2$ , we equally divide the all $16$ tuples into $4$ groups of $4$ elements; and for $m=3$ , we divide the $64$ triplets into $8$ groups of equal size $8$ . For a particular group $g$ , we generate $Z_{g,\ell}$ independently from $Unif(0,1)$ , for $\ell=1,2,3,4$ . The transition probability of that group is generated from Dirichlet distribution with parameter $(e^{Z_{g,1}},...,e^{Z_{g,4}})$ . As of weights, we first take $w_{i,j}=1$ for all $i,j=1,2...,p,i<j$ . Next we choose some sparse weights depending on the distance between the estimated transition probabilities $\hat{\bm{\pi}}_{i}$ and $\hat{\bm{\pi}}_{j}$ . We have used the $k$ -nearest neighbour based weights as proposed in Chi and Lange (2015), such as $w_{i,j}=\exp^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{2}^{2}}l^{k}_{i,j}$ where $l^{k}_{i,j}$ is the indicator function that $\hat{\bm{\pi}}_{i}$ belongs to $k$ nearest neighbour of $\hat{\bm{\pi}}_{j}$ or vice versa, for some $\phi>0$ . In this example we use $\phi=100$ . We also incorporate a third choice of weight, namely $w_{i,j}=\exp^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{\infty}^{2}}l^{k(\infty)}_{i,j}$ where $l^{k(\infty)}_{i,j}$ is the similar indicator function, but the nearest neighbour is computed w.r.t $l_{\infty}$ distance. We use two different values of the nearest neighbour, $k=5$ and $k=3$ . The results are provided in the following tables.

Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.769 (0.03)	0.809 (0.03)	0.819 (0.023)	0.833 (0.02)	0.842 (0.02)
ARI (s.e)	0.015 (0.09)	0.141 (0.15)	0.169 (0.12)	0.239 (0.12)	0.295 (0.12)
Prob. of True Recovery	0	0	0	0	0

Table 1: Summary for

m=2

, Uniform Weight

$\bm{k}$ nearest neighbour $\bm{=5}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.900 (0.09)	0.981 (0.04)	0.994 (0.02)	0.997 (0.01)	0.999 (0.01)
ARI (s.e)	0.745 (0.19)	0.946 (0.10)	0.982 (0.05)	0.992 (0.04)	0.997 (0.02)
Prob. of True Recovery	0.223	0.708	0.876	0.951	0.977
$\bm{k}$ nearest neighbour $\bm{=3}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.945 (0.07)	0.994 (0.02)	0.999 (0.01)	0.999 (0.005)	1 (0.003)
ARI (s.e)	0.851 (0.17)	0.983 (0.06)	0.995 (0.03)	0.998 (0.02)	0.999 (0.01)
Prob. of True Recovery	0.480	0.908	0.972	0.991	0.996

Table 2: Summary for

m=2

, k-nearest neighbour clustering with

l_{2}

distance

$\bm{k}$ nearest neighbour $\bm{=5}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.889 (0.10)	0.975 (0.04)	0.992 (0.02)	0.996 (0.01)	0.998 (0.01)
ARI (s.e)	0.720 (0.20)	0.928 (0.12)	0.974 (0.06)	0.989 (0.04)	0.994 (0.03)
Prob. of True Recovery	0.187	0.644	0.821	0.922	0.960
$\bm{k}$ nearest neighbour $\bm{=3}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.949 (0.07)	0.995 (0.02)	0.999 (0.01)	0.999 (0.005)	1 (0.004)
ARI (s.e)	0.860 (0.17)	0.984 (0.05)	0.995 (0.03)	0.998 (0.02)	0.999 (0.01)
Prob. of True Recovery	0.501	0.908	0.973	0.990	0.996

Table 3: Summary for

m=2

, k-nearest neighbour clustering with

l_{\infty}

distance

Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.866 (0.03)	0.901 (0.04)	0.933 (0.04)	0.954 (0.03)	0.969 (0.03)
ARI (s.e)	0.199 (0.18)	0.469 (0.26)	0.651 (0.22)	0.777 (0.17)	0.854 (0.15)
Prob. of True Recovery	0	0.003	0.017	0.042	0.128

Table 4: Summary for

m=3

, Uniform Weight

$\bm{k}$ nearest neighbour $\bm{=5}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.982 (0.024)	0.995 (0.007)	0.997 (0.005)	0.999 (0.002)	1 (0.001)
ARI (s.e)	0.935 (0.081)	0.980 (0.027)	0.990 (0.019)	0.997 (0.009)	0.999 (0.004)
Prob. of True Recovery	0.264	0.466	0.714	0.907	0.981
$\bm{k}$ nearest neighbour $\bm{=3}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.984 (0.024)	0.994 (0.007)	0.997 (0.005)	0.999 (0.002)	1 (0.001)
ARI (s.e)	0.940 (0.074)	0.979 (0.026)	0.990 (0.019)	0.997 (0.009)	0.999 (0.004)
Prob. of True Recovery	0.223	0.408	0.713	0.908	0.981

Table 5: Summary for

m=3

, k-nearest neighbour clustering with

l_{2}

distance

$\bm{k}$ nearest neighbour $\bm{=5}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.981 (0.025)	0.995 (0.009)	0.998 (0.005)	0.999 (0.002)	1 (0.001)
ARI (s.e)	0.931 (0.085)	0.980 (0.033)	0.990 (0.018)	0.997 (0.009)	0.999 (0.004)
Prob. of True Recovery	0.294	0.508	0.735	0.904	0.980
$\bm{k}$ nearest neighbour $\bm{=3}$
Sample Size (n)	5000	10000	15000	20000	25000
RI (s.e)	0.983 (0.021)	0.994 (0.007)	0.997 (0.005)	0.999 (0.002)	1 (0.001)
ARI (s.e)	0.937 (0.076)	0.977 (0.029)	0.990 (0.019)	0.997 (0.009)	0.999 (0.004)
Prob. of True Recovery	0.206	0.387	0.709	0.907	0.981

Table 6: Summary for

m=3

, k-nearest neighbour clustering with

l_{\infty}

distance

From this study, it is clear that for both $m=2$ and $m=3$ , choice of uniform weight results in really poor performance in terms of recovering the true cluster. Although the $ARI$ increases with increasing $n$ , we might need really large sample size to get good result. On the other hand, choosing the weights using the number of nearest neighbour $k=3$ perform much better than that with $k=5$ in terms of both $ARI$ and perfect recovery, especially for lower sample size. Note that the model is balanced in this example, and the optimum choice of $k$ is $3$ (by Corollary 1). Hence, we can justify that fact using our simulation study. On the other hand, the optimum choice of $k$ is $7$ for $m=3$ , but the choice $k=5$ is reliable as well. $k=3$ makes the weights too sparse in this case, which results in degradation of the clustering accuracy a little bit. For both $m=2$ and $m=3$ , the probability of true recovery increases with increasing $n$ .

This experiment ensures pretty good result for high $n$ , mostly for $n\geq 10000$ . It is thus worthy of investigation under what circumstances we will be able to get very good recovery for lower $n$ , such as $n=1000$ or $n=2000$ . The theoretical results suggest that if the cluster centroids are well separated, clustering performance gets better even for lower sample size. In this experiment, we have $4$ groups for $m=2$ , with the minimum centroid difference $?$ in terms of $l_{2}$ distance and $?$ in terms of $l_{\infty}$ distance; while these values are $?$ and $?$ for $m=3$ . In the next simulation study, we will demonstrate how the clustering accuracy improves for well separated centroids.

4.2 Simulation Set-up 2

In this study as well, we take $|\Sigma|=4$ and $m=3$ . We divide this $64$ triplets into four groups of size $18,18,15$ and $13$ . For the $\alpha^{th}$ group, $R_{\alpha,\alpha}=0.7,R_{\alpha,\beta}=0.1$ , $\alpha=1,2,3,4$ , $\beta=1,2,3,4$ , $\alpha\neq\beta$ . As the choice of weight, we have first used the $k=15$ nearest neighbour weights w.r.t the $l_{2}$ distance in the Gaussian kernel, and used $\phi=100$ . The second choice of weight is $w_{i,j}=\exp^{-\phi\lVert\hat{\bm{\pi}}_{i}-\hat{\bm{\pi}}_{j}\rVert_{\infty}}l^{k(\infty)}_{i,j}$ , with $\phi=10$ . Note that, here we have used the $l_{\infty}$ distance to find the nearest neighbour, but instead of incorporating the Gaussian kernel, we have used the natural exponential decay. The third weight is similar to the second one, only the distance is $l_{1}$ instead of $l_{\infty}$ . Here we have only used relatively smaller samples, $n=1000$ and $n=2000$ respectively. The results are portrayed in the following Table.

Weight Choice

Sample Size

(n)

ARI

Prob. of

True Recovery

l_{2}

Distance, Gaussian Kernel

1000

0.940 (0.022)

0.816 (0.073)

2000

0.984 (0.012)

0.954 (0.034)

0.14

l_{\infty}

Distance, Exponential Kernel

1000

0.969 (0.020)

0.908 (0.059)

0.104

2000

0.994 (0.009)

0.983 (0.025)

0.638

l_{1}

Distance, Exponential Kernel

1000

0.965 (0.020)

0.893 (0.060)

0.03

2000

0.993 (0.009)

0.979 (0.025)

0.468

Table 7: Summary of Simulation 2

From the experiment, we can infer the weight involving $l_{2}$ distance in the Gaussian kernel performs poorly compared to $l_{\infty}$ or $l_{1}$ distance. Especially, usage of $l_{\infty}$ distance is really effective in such scenarios, as it measures the maximum element-wise distance between two estimated transition probability vectors. In this way, we are able to separate out two vectors which are not likely to be in the same cluster. From this study, we can fairly conclude that when the centroids are well-separated in such fashion so that their difference is significantly high in a small number of co-ordinates, use of $l_{\infty}$ distance can provide the best possible result.

5 Real Data Analysis

A popular application of higher order Markov models is analyzing the DNA sequences of some species. Scientists have used ordinary Markov chains, VLMC, hidden Markov models and many other tools to fit such a sequence for prediction, classification, gene identification etc. Here, we will use the proposed SMM methodology to classify the virus, collected as a sample from human being. We consider four different virus in our study: SARS-CoV-2 (COVID-19), MERS (Middle East Respiratory Syndrome), Dengue and Hepatitis B. We have collected the sample for $500$ individuals from NCBI database, $200$ affected from SARS-CoV-2, $50$ from MERS, $100$ from Dengue and $150$ from Hepatitis B over different time periods and different locations. We have been particularly careful in collecting the samples from COVID-19 disease so that we are able to incorporate different strains of that disease. To ensure that, we have used $50$ samples each from four different time-frame: April 2020, September 2020, January 2021 and April 2021. The time-frames are selected on the basis of the spread of certain strains or looking at the peak of the covid cases.

The NCBI database contains reference genome sequence of every virus. These reference sequences represents an ideal genome structure of any particular virus species. Note that, very minimal changes in the neucleotide sequence can lead to a very different strain of the same disease. The collected samples, if fully available, are almost equal to the reference sequence. In that case, there is no need to fit models, we can easily match the collected sample with the available genome sequence, and identify the virus whose sequence is almost similar to the reference genome. But in practice, there are many occasions where the full sequence is not available, may be only a part can be retrieved, or some portions of the data is lost. The challenge lies in that scenario, and we should be able to identify the correct virus as much as possible. In this experiment, we first build a reference model from the reference sequence for each virus. Next, we randomly select a continuous segment of the genome sequence for each sample, and then compute the likelihood of that segment under each of the $4$ reference models. Suppose the $i^{th}$ model is denoted by $\hat{P}_{i}$ , $i=1,2,3,4$ . For any given sequence $x=x_{1}x_{2}...x_{n}$ , likelihood of $x$ for each model is $\hat{P}_{i}(x)$ . We then classify $x$ to $\arg\max_{i=1,...,k}\hat{P}_{i}(x)$ .

The lengths of the reference genome sequences for SARS, MERS, Dengue and Hepatitis B are $29903$ , $30119$ , $10735$ and $3542$ respectively. For SARS and MERS, we fit an SMM of order $m=4$ while for the other two virus, we use $m=3$ . The orders are based on the lengths of the reference sequences. There is a biological significance for using $m\geq 3$ as well. Three consecutive DNA bases form a “codon”, which translates a genetic code into a sequence of amino acids. So, it is fair to assume that SMM of order $3$ or more will be able to explain the structure of a virus. From the samples, we randomly choose segments of length $100\epsilon\%$ , and compute the likelihoods under $4$ models to classify it to the most likely class of virus. Three different values of $\epsilon$ have been used; $0.05$ , $0.1$ and $0.25$ . The $4\times 4$ confusion matrices for each three scenarios are presented in the following Tables.

SARS-

Cov-2

MERS

Dengue

Hepatitis

Total

SARS-Cov-2

185

200

MERS

Dengue

100

Hepatitis B

150

Table 8: Confusion Matrix for

\epsilon=0.05

, Mis-classification Rate=

22.8\%

SARS-

Cov-2

MERS

Dengue

Hepatitis

Total

SARS-Cov-2

193

200

MERS

Dengue

100

Hepatitis B

150

Table 9: Confusion Matrix for

\epsilon=0.1

, Mis-classification Rate=

15.2\%

SARS-

Cov-2

MERS

Dengue

Hepatitis

Total

SARS-Cov-2

194

200

MERS

Dengue

100

Hepatitis B

140

150

Table 10: Confusion Matrix for

\epsilon=0.25

, Mis-classification Rate=

3.2\%

From the table, it is clear that $\epsilon=0.05$ , i.e. when only $5\%$ of the original sample sequence is retained, the miss-classification rate among the Hepatitis B samples are high. This is completely justified, since for Hepatitis B samples, the length of selected segments are about $170$ , whereas that lengths are about $500$ for Dengue and $1500$ for MERS and SARS. As we increase the proportion, the performance naturally improves. The overall miss-classification rates are $0.228$ , $0.152$ and $0.032$ for $\epsilon=0.05$ , $0.10$ and $0.25$ respectively. So, with only $25\%$ of the sequences, we can correctly identify the true virus for more than $95\%$ of the cases. Even within Hepatitis B, the miss-classification error reduces drastically once we have fairly long sequence so that meaningful inference could be made. Note that, we have selected the snippets from any part of the full sample sequences. Our SMM method utilizes the information from the reference genome sequence in a compact manner, so that it can capture the diversity of structure from different parts of the samples. Overall, this method is successful in such classification problems, which opens up a scope for a broader research in this area.

6 Conclusion

The proposed method of fitting sparse Markov model can be utilized in many different areas. In future, we may be interested to apply the methodology to text mining, recommender system or other areas of Biostatistics. In terms of theoretical aspects, we can develop new clustering algorithms by changing the objective function and the penalty, preferably taking care of the number of occurences of each $m$ -tuple. This will be useful not only in SMM set-up, but to general clustering problems as well. We might also be interested how good and computationally efficient our method in terms of prediction using the SMM structure. We can conclude our analysis by saying that our method is innovative enough to apply in a broader spectrum, thus inviting a lot of other related research areas.

Appendix A Proof of Theorem 1

(a) For notational simplicity, we write $b^{*}_{i,a}(\lambda)$ as $b^{*}_{i,a}$ . Also, denote the objective function in (2.3) as $R(B,W)$ . Suppose $b^{*}_{i,a}<0$ for some of the $(i,a)$ pairs, $i=1,2,,,,p$ and $a=1,2,...,d$ . Let, $b^{**}_{i,a}=b^{*}_{i,a}\mathcal{I}(b^{*}_{i,a}>0)$ . Since $\hat{\pi}_{i,a}\geq 0$ , we get $\big{\lvert}\hat{\pi}_{i,a}-b^{**}_{i,a}\big{\rvert}\leq\big{\lvert}\hat{\pi}_{i,a}-b^{*}_{i,a}\big{\rvert}$ . Also, $\big{\lvert}b^{*}_{i_{1},a}-b^{*}_{i_{2},a}\big{\rvert}\geq\big{\lvert}b^{**}_{i_{1},a}-b^{**}_{i_{2},a}\big{\rvert}$ , since the negative elements are shrinked to $0$ . Hence for any $i=1,2,...,p$ ,

\lVert\hat{\bm{\pi}}_{i}-\bm{b}^{*}_{i}\rVert_{2}^{2}\geq\lVert\hat{\bm{\pi}}_{i}-\bm{b}^{**}_{i}\rVert_{2}^{2};\quad\big{\lVert}\bm{b}^{*}_{i_{1}}-\bm{b}^{*}_{i_{2}}\big{\rVert}_{2}\geq\big{\lVert}\bm{b}^{**}_{i_{1}}-\bm{b}^{**}_{i_{2}}\big{\rVert}_{2}.

Since $\bm{b}^{*}_{i_{1}}\neq\bm{b}^{**}_{i}$ for at least one $i$ , $R(B^{**},W)<R(B^{*},W)$ , contradicting that $B^{*}$ is the optimum solution. Hence $b^{*}_{i,a}\geq 0$ , $\forall i=1,...,p;a=1,,,.d$ .
(b) If we initialize $\bm{\Gamma}^{(0)}=\bm{0}$ , we get $\mathbf{b}_{i}^{(1)}=\hat{\bm{\pi}}_{i}$ , which satisfies $\sum_{a=1}^{d}b^{(1)}_{i,a}=1$ . Subsequently, $\bm{\gamma}_{l}^{(1)}=\mathcal{P}_{C_{l}}(\bm{\gamma}_{l}^{(0)}-\nu\mathbf{g}_{l}^{(1)})=(\bm{\gamma}_{l}^{(0)}-\nu\mathbf{g}_{l}^{(1)})\min\Big{\{}1,\dfrac{\lambda w_{l}}{\lVert\bm{\gamma}_{l}^{(0)}-\nu\mathbf{g}_{l}^{(1)}\rVert_{2}}\Big{\}}$ , and thus $\bm{\gamma}_{l}^{(1)T}\bm{1}=0$ . Using this similar arguments, for any iteration $t$ , $\sum_{a=1}^{d}b^{(t)}_{i,a}=1$ . Hence the limiting quantity will still have the property that the sum of the elements of $b_{i}$ is always $1$ . This will complete the proof that $\bm{b}_{i}^{*}$ is indeed a probability distribution.

Appendix B Proof of Theorem 2

Note that as $n\to\infty$ , $N_{\sigma_{j}}\to\infty$ . Let $q_{j}$ be the stationary probability of the state $\sigma_{j}$ . Then, $N_{\sigma_{j}}/N\xrightarrow[]{p}q_{j}$ as $n\to\infty$ ; and for $j\in{\cal C}_{\alpha}$ , we have

		$\displaystyle\sqrt{N_{\sigma_{j}}}\big{(}\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\big{)}\xrightarrow[]{d}\mathcal{N}(\bm{0},\Sigma_{\alpha})$
	$\displaystyle\implies$	$\displaystyle\sqrt{N}\big{(}\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\big{)}\xrightarrow[]{d}\mathcal{N}(\bm{0},q_{j}\Sigma_{\alpha})$

where $\Sigma_{\alpha}=diag(\bm{R}_{\alpha})-\bm{R}_{\alpha}\bm{R}_{\alpha}^{(T)}$ .
Looking at the expressions of $\lambda_{\text{min}}^{(n)}$ and $\lambda_{\text{max}}^{(n)}$ , it is evident that $\lambda_{\text{min}}^{(n)}$ shrinks towards $0$ as $n$ increases as the estimated transition probability vectors $\hat{\bm{\pi}}_{i}$ and $\hat{\bm{\pi}}_{j}$ belonging to the same cluster ${\cal C}_{\alpha}$ become closer to each other. On the other hand, the different group means $\hat{\bar{\bm{\pi}}}^{(\alpha)}$ and $\hat{\bar{\bm{\pi}}}^{(\beta)}$ tend to get separated from each other, leading $\lambda_{\text{max}}^{(n)}$ to converge to a positive number, and eventually we get $\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}$ . These expressions also tell us that in order to have perfect recovery of the clusters, a scaled version of the maximum within group deviation of the transition probabilities should be less than a scaled version of minimum between group variation. These scales are heavily dependent on the choice of the weights $w_{i,j}$ . Note that, if we choose the weights in a way so that $w_{i,j}$ is higher if $\hat{\bm{\pi}}_{i}$ and $\hat{\bm{\pi}}_{j}$ are closer (and potentially belong to the same cluster), and lower if they are far from each other (potentially belong to different clusters), the denominator of the term $\lambda_{\text{min}}^{(n)}$ will be higher, and the denominator of $\lambda_{\text{max}}^{(n)}$ will be lower in the ideal scenario. Hence, these particular choice of the weights will enhance separating $\lambda_{\text{min}}^{(n)}$ and $\lambda_{\text{max}}^{(n)}$ , increasing the chance of recovering the true cluster assignment.

The proof mainly relies on calculating the probability of $\hat{\bm{\pi}}_{j}$ and $\bm{R}_{\alpha}$ , $j\in{\cal C}_{\alpha}$ being close to each other. Suppose $\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\epsilon/2$ for some $\epsilon>0$ , and $\forall j\in{\cal C}_{\alpha}$ , $\alpha=1,2...,k_{0}$ . In that case,

	$\displaystyle\lambda_{\text{min}}^{(n)}$	$\displaystyle<\dfrac{\epsilon/2}{\min\limits_{1\leq\alpha\leq k_{0}}\min\limits_{i,j\in{\cal C}_{\alpha}}\big{(}p_{\alpha}w_{i,j}-\mu_{i,j}^{(\alpha)}\big{)}}$
	$\displaystyle\lambda_{\text{max}}^{(n)}$	$\displaystyle>\min_{1\leq\alpha\leq k_{0}}\Bigg{\{}\dfrac{\lVert\bm{R}_{\alpha}-\bm{R}_{\beta}\rVert_{2}-\epsilon}{\frac{1}{p_{\alpha}}\sum_{l\neq\alpha}w^{(\alpha,l)}+\frac{1}{p_{\beta}}\sum_{l\neq\beta}w^{(\beta,l)}}\Bigg{\}}.$

So, for $\epsilon$ sufficiently small, $\lambda_{\text{min}}^{(n)}<\lambda_{\text{max}}^{(n)}$ . We will later find a bound how small $\epsilon$ we need to achieve this.

We will compute a lower bound of the following probability:

P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2};\forall j\in{\cal C}_{\alpha},\forall\alpha=1,...,k_{0}\Big{)}.

Note that the variance-covariance matrix $\Sigma_{\alpha}$ of the limiting distribution is not full rank, as we have a linear constraint in the elements of $\bm{\pi}_{j}$ . Define $Z_{j}=(\hat{\pi}_{j,1}-R_{j,1},...,\hat{\pi}_{j,d-1}-R_{j,d-1})^{T}$ , $\Sigma_{\alpha,-d}$ be the upper $(d-1)\times(d-1)$ block of $\Sigma_{\alpha}$ . Now,

	$\displaystyle\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}^{2}$	$\displaystyle=\sum_{l=1}^{d-1}(\hat{\pi}_{j,l}-R_{j,l})^{2}+\Big{(}\sum_{l=1}^{d-1}(\hat{\pi}_{j,l}-R_{j,l})\Big{)}^{2}$
		$\displaystyle=Z_{j}^{T}Z_{j}+(\bm{1}^{T}Z_{j})^{2}=Z_{j}^{T}(\bm{I}+\bm{1}\bm{1}^{T})Z_{j}.$

Define $U_{j}=\sqrt{\dfrac{N}{q_{j}}}\Sigma_{\alpha,-d}^{-1/2}Z_{j}$ . By asymptotic normality of $\hat{\bm{\pi}}_{j}$ , $\sqrt{N}Z_{j}\xrightarrow[]{d}\mathcal{N}(\bm{0},q_{j}\Sigma_{\alpha,-d})$ , hence $U_{j}\xrightarrow[]{d}\mathcal{N}(\bm{0},\bm{I})$ . So,

	$\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)}$	$\displaystyle=P\Big{(}Z_{j}^{T}(\bm{I}+\bm{1}\bm{1}^{T})Z_{j}\geq\dfrac{\epsilon^{2}}{4}\Big{)}=P\Big{(}U_{j}^{T}\Sigma_{\alpha,-d}^{1/2}(\bm{I}+\bm{1}\bm{1}^{T})\Sigma_{\alpha,-d}^{1/2}U_{j}\geq\dfrac{N\epsilon^{2}}{4q_{j}}\Big{)}$
		$\displaystyle=P\Big{(}U_{j}^{T}\bm{M}U_{j}\geq\dfrac{N\epsilon^{2}}{4q_{j}}\Big{)};\quad\bm{M}=\Sigma_{\alpha,-d}^{1/2}(\bm{I}+\bm{1}\bm{1}^{T})\Sigma_{\alpha,-d}^{1/2}.$

For a symmetric matrix matrix $\bm{M}_{1}$ , Hanson and Wright (1971) have determined a lower bound of the tail probability of any quadratic form $U^{T}\bm{M}_{1}U$ of a sub-Gaussian random variable $U$ with mean $\bm{0}$ and variance-covariance matrix $\sigma^{2}\bm{I}$ as follows:

P\Big{(}U^{T}\bm{M}_{1}U\geq t+\sigma^{2}tr(\bm{M}_{1})\Big{)}\leq\exp\Big{[}-\min\Big{(}\dfrac{a_{1}t^{2}}{\sigma^{4}\lVert\bm{M}_{1}\rVert_{F}},\dfrac{a_{2}t}{\sigma^{2}\lVert\bm{M}_{1}\rVert_{sp}}\Big{)}\Big{]}

(B.6)

for some constants $a_{1},a_{2}>0$ . Here $\lVert.\rVert_{F}$ and $\lVert.\rVert_{sp}$ are Frobenius norm and spectral norm respectively. Applying this bound in (B.6) in our problem, we get, as $n\to\infty$ ,

	$\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)}$	$\displaystyle=P\Big{(}U_{j}^{T}\bm{M}U_{j}\geq\dfrac{N\epsilon^{2}}{4q_{j}}\Big{)}$
		$\displaystyle\leq\exp\Big{[}-\min\Big{(}\dfrac{a_{1}(N\epsilon^{2}-4q_{j}tr(\bm{M}))^{2}}{16q_{j}^{2}\lVert\bm{M}\rVert_{F}},\dfrac{a_{2}(N\epsilon^{2}-4q_{j}tr(\bm{M}))}{4q_{j}\lVert\bm{M}\rVert_{sp}}\Big{)}\Big{]}.$

As $n$ increases, $N^{2}>>N$ , and eventually for larger $n$ , $\min\Big{(}\dfrac{a_{1}(N\epsilon^{2}-4q_{j}tr(\bm{M}))^{2}}{16q_{j}^{2}\lVert\bm{M}\rVert_{F}},\dfrac{a_{2}(N\epsilon^{2}-4q_{j}tr(\bm{M}))}{4q_{j}\lVert\bm{M}\rVert_{sp}}\Big{)}$ $=\dfrac{a_{2}(N\epsilon^{2}-4q_{j}tr(\bm{M}))}{4q_{j}\lVert\bm{M}\rVert_{sp}}$ . Now,

	$\displaystyle tr(\bm{M})$	$\displaystyle=tr(\Sigma_{\alpha,-d}^{1/2}(\bm{I}+\bm{1}\bm{1}^{T})\Sigma_{\alpha,-d}^{1/2})=tr(\Sigma_{\alpha,-d})+tr(\bm{1}^{T}\Sigma_{\alpha,-d}\bm{1})$
		$\displaystyle=\sum_{l=1}^{d-1}R_{\alpha,l}(1-R_{\alpha,l})+\sum_{l=1}^{d-1}R_{\alpha,l}-\sum_{l_{1}=1}^{d-1}\sum_{l_{2}=1}^{d-1}R_{\alpha,l_{1}}R_{\alpha,l_{2}}$
		$\displaystyle=\sum_{l=1}^{d-1}R_{\alpha,l}(1-R_{\alpha,l})+\Big{(}\sum_{l=1}^{d-1}R_{\alpha,l}\Big{)}\Big{(}1-\sum_{l=1}^{d-1}R_{\alpha,l}\Big{)}=\sum_{l=1}^{d}R_{\alpha,l}(1-R_{\alpha,l})=s_{\alpha}(say);$
	$\displaystyle\lVert\bm{M}\rVert_{sp}$	$\displaystyle=\lVert\Sigma_{\alpha,-d}+\Sigma_{\alpha,-d}^{1/2}\bm{1}\bm{1}^{T}\Sigma_{\alpha,-d}^{1/2}\rVert_{sp}\leq\lVert\Sigma_{\alpha,-d}\rVert_{sp}+\bm{1}^{T}\Sigma_{\alpha,-d}\bm{1}\leq\max\limits_{l=1,2,...,d-1}R_{\alpha,l}+R_{\alpha,d}(1-R_{\alpha,d})=v_{\alpha}$

as $\lVert\Sigma_{\alpha,-d}\rVert_{sp}\leq\max\limits_{l=1,2,...,d-1}R_{\alpha,l}$ by the result of Watson (1995). Hence,

		$\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)}\leq\exp\Big{[}-\dfrac{a_{2}(N\epsilon^{2}-4q_{j}s_{\alpha})}{4q_{j}v_{\alpha}}\Big{]}=\exp\Big{(}\dfrac{s_{\alpha}}{v_{\alpha}}\Big{)}\exp\Big{[}-\dfrac{a_{2}N\epsilon^{2}}{4q_{j}v_{\alpha}}\Big{]}$
	$\displaystyle\implies$	$\displaystyle P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}<\dfrac{\epsilon}{2};\forall j\in{\cal C}_{\alpha},\forall\alpha=1,...,k_{0}\Big{)}$
	$\displaystyle\geq$	$\displaystyle 1-\sum_{\alpha=1}^{k_{0}}\sum_{j\in{\cal C}_{\alpha}}P\Big{(}\lVert\hat{\bm{\pi}}_{j}-\bm{R}_{\alpha}\rVert_{2}\geq\dfrac{\epsilon}{2}\Big{)}\geq 1-\sum_{\alpha=1}^{k_{0}}\exp\Big{(}\dfrac{s_{\alpha}}{v_{\alpha}}\Big{)}\sum_{j\in{\cal C}_{\alpha}}\exp\Big{[}-\dfrac{a_{2}N\epsilon^{2}}{4q_{j}v_{\alpha}}\Big{]}.$

Write $C_{1}^{(\alpha)}=\exp\Big{(}\dfrac{s_{\alpha}}{v_{\alpha}}\Big{)}$ , $C_{2,j}=\dfrac{a_{2}\epsilon^{2}}{4q_{j}v_{\alpha}}$ , and we prove the theorem.

Appendix C Proof of Theorem 3

Recall that $\hat{\pi}_{j,\ell}=N_{\sigma_{j},\ell}/N_{\sigma_{j}}$ . Denote the common transition probability for the estimated group $\hat{{\cal C}}_{\alpha}(\lambda)$ as

\hat{R}^{(\lambda)}_{\alpha,\ell}=\dfrac{\sum_{\sigma_{j}\in\hat{{\cal C}}_{\alpha}(\lambda)}N_{\sigma_{j},\ell}}{\sum_{\sigma_{j}\in\hat{{\cal C}}_{\alpha}(\lambda)}N_{\sigma_{j}}}=\dfrac{N_{\hat{{\cal C}}_{\alpha}(\lambda),\ell}}{N_{\hat{{\cal C}}_{\alpha}(\lambda)}}\quad\quad\forall\alpha=1,...,k_{\lambda};\ell=1,...,d.

Thus, the log-likelihood is given by

\ell_{n}(\lambda)=\sum_{\alpha=1}^{k_{\lambda}}\sum_{\ell=1}^{d}N_{\hat{{\cal C}}_{\alpha}(\lambda),\ell}\log\hat{R}^{(\lambda)}_{\alpha,\ell}.

Note that, as $\lambda$ increases, the number of clusters decreases. Also, by the continuity of the solution of (2.3) w.r.t $\lambda$ , $M_{\lambda_{2}}$ is a submodel of $M_{\lambda_{1}}$ for $\lambda_{1}<\lambda_{2}$ as the separate clusters for lower $\lambda$ values are clumped together to form new clusters with larger size as $\lambda$ increases. Hence, we can write $M_{\lambda_{2}}\subseteq M_{\lambda_{1}}$ . Subsequently, $\ell_{n}(\lambda_{1})\geq\ell_{n}(\lambda_{2})$ . Let $q_{j}$ be the stationary probability of the state $\sigma_{j}$ , and $Q^{(\alpha)}(\lambda)$ be the stationary probability of the partition $\hat{{\cal C}}_{\alpha}(\lambda)$ . So, $Q^{(\alpha)}(\lambda)=\sum_{\sigma_{j}\in\hat{{\cal C}}_{\alpha}(\lambda)}q_{j}$ . We have to show that the true model minimizes the BIC with probability tending to $1$ as $n\to\infty$ . We prove that for two cases as follows.
Case 1: Suppose, $\lambda<\lambda_{0}$ , and $M_{\lambda_{0}}\subset M_{\lambda}$ . Clearly, $k_{\lambda_{0}}<k_{\lambda}$ . Since $M_{\lambda_{0}}$ is the true underlying model by Theorem (1), $M_{\lambda_{0}}=\{{\cal C}_{1},\ldots,{\cal C}_{k_{0}}\}$ , and

Z_{n}=-2\Big{(}\ell_{n}(\lambda_{0})-\ell_{n}(\lambda)\Big{)}\xrightarrow{d}\chi^{2}_{(k_{\lambda}-k_{0})(d-1)}.

Hence, as $n\to\infty$ ,

	$\displaystyle P\Big{(}BIC_{n}(\lambda_{0})\geq BIC_{n}(\lambda)\Big{)}$		$\displaystyle=P\Big{(}Z_{n}>(k_{\lambda}-k_{0})(d-1)\log n\Big{)}$
			$\displaystyle\leq\exp\Big{[}-\dfrac{(k_{\lambda}-k_{0})(d-1)}{4}\log n\Big{]}$
			$\displaystyle=n^{-\dfrac{(k_{\lambda}-k_{0})(d-1)}{4}}\to 0.$

Case 2: Now let $\lambda_{0}<\lambda$ and $M_{\lambda}\subset M_{\lambda_{0}}$ . For $\alpha^{\prime}=1,...,k_{\lambda}$ , w.lo.g we can write

\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)=\bigcup_{\alpha=t_{\alpha^{\prime}-1}+1}^{\alpha=t_{\alpha^{\prime}}}{\cal C}_{\alpha}

for $0=t_{0}<t_{1}<t_{2}<...<t_{k_{\lambda}}=k_{0}$ . Now, as $n\to\infty$ ,

	$\displaystyle\dfrac{1}{N}\ell_{n}(\lambda_{0})=\dfrac{1}{N}\sum_{\alpha=1}^{k_{0}}\sum_{\ell=1}^{d}N_{{\cal C}_{\alpha},\ell}\log\hat{R}^{(\lambda_{0})}_{\alpha,\ell}$	$\displaystyle\xrightarrow[]{p}\sum_{\alpha=1}^{k_{0}}\sum_{\ell=1}^{d}\Big{(}\sum_{j\in{\cal C}_{\alpha}}q_{\sigma_{j}}\Big{)}R_{\alpha,\ell}\log R_{\alpha,\ell}$
		$\displaystyle=\sum_{\alpha=1}^{k_{0}}\sum_{\ell=1}^{d}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\log R_{\alpha,\ell}=A_{0};$

and

	$\displaystyle\dfrac{1}{N}\ell_{n}(\lambda)$	$\displaystyle=\dfrac{1}{N}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}N_{\hat{{\cal C}}_{\alpha^{\prime}}(\lambda),\ell}\log\hat{R}^{(\lambda)}_{\alpha^{\prime},\ell}=\dfrac{1}{N}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}\Big{(}\sum_{j\in\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)}N_{\sigma_{j},\ell}\Big{)}\log\Big{(}\dfrac{\sum_{j\in\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)}N_{\sigma_{j},\ell}}{\sum_{j\in\hat{{\cal C}}_{\alpha^{\prime}}(\lambda)}N_{\sigma_{j}}}\Big{)}$
		$\displaystyle=\dfrac{1}{N}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}\Big{(}\sum_{\alpha=t_{{\alpha^{\prime}}-1}+1}^{t_{\alpha^{\prime}}}N_{{\cal C}_{\alpha},\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{{\alpha^{\prime}}-1}+1}^{t_{\alpha^{\prime}}}N_{{\cal C}_{\alpha},\ell}}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}N_{{\cal C}_{\alpha}}}\Big{)}$
		$\displaystyle\xrightarrow[]{p}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\ell=1}^{d}\Big{(}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})}\Big{)}=A(\lambda).$

Now, applying the Jensen’s inequality by using the strict convexity of $-\log x$ ,

	$\displaystyle A(\lambda)$	$\displaystyle=-\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\Big{(}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}}\Big{)}$
		$\displaystyle=-\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\Big{(}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\Big{)}\log\Big{(}\dfrac{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}.(1/R_{\alpha,\ell})}{\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}}\Big{)}$
		$\displaystyle<-\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\log(1/R_{\alpha,\ell})$
		$\displaystyle=\sum_{\ell=1}^{d}\sum_{\alpha^{\prime}=1}^{k_{\lambda}}\sum_{\alpha=t_{\alpha^{\prime}-1}+1}^{t_{\alpha^{\prime}}}Q^{(\alpha)}(\lambda_{0})R_{\alpha,\ell}\log R_{\alpha,\ell}=A_{0}.$

Hence, $\dfrac{1}{N}(\ell_{n}(\lambda_{0})-\ell_{n}(\lambda))\xrightarrow[]{p}A_{0}-A(\lambda)>0$ , and $P\Big{(}\dfrac{1}{N}(\ell_{n}(\lambda_{0})-\ell_{n}(\lambda))\geq\dfrac{1}{2}(A_{0}-A(\lambda)\Big{)}\to 1$ as $n\to\infty$ . Since $\log n/N\to 0$ as $n\to\infty$ ,

	$\displaystyle P\Big{(}BIC_{n}(\lambda_{0})\geq BIC_{n}(\lambda)\Big{)}$	$\displaystyle=P\Big{(}2\ell_{n}(\lambda_{0})\leq 2\ell_{n}(\lambda)+(k_{0}-k_{\lambda})(d-1)\log n\Big{)}$
		$\displaystyle=P\Big{(}\ell_{n}(\lambda_{0})-\ell_{n}(\lambda)\leq(k_{0}-k_{\lambda})(d-1)\log n\Big{)}$
		$\displaystyle=P\Big{(}\dfrac{1}{N}(\ell_{n}(\lambda_{0})-\ell_{n}(\lambda))\leq(k_{0}-k_{\lambda})(d-1)\dfrac{\log n}{N}\Big{)}$
		$\displaystyle\to 0.$

References

Bühlmann [2000] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. (\byear2000). \btitleModel selection for variable length Markov chains and tuning the context algorithm. \bjournalAnnals of the Institute of Statistical Mathematics \bvolume52 \bpages287–315. \endbibitem
Bühlmann et al. [1999] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP., \bauthor\bsnmWyner, \bfnmAbraham J\binitsA. J. \betalet al. (\byear1999). \btitleVariable length Markov chains. \bjournalThe Annals of Statistics \bvolume27 \bpages480–513. \endbibitem
Chi and Lange [2015] {barticle}[author] \bauthor\bsnmChi, \bfnmEric C\binitsE. C. and \bauthor\bsnmLange, \bfnmKenneth\binitsK. (\byear2015). \btitleSplitting methods for convex clustering. \bjournalJournal of Computational and Graphical Statistics \bvolume24 \bpages994–1013. \endbibitem
Garcıa et al. [2011] {binproceedings}[author] \bauthor\bsnmGarcıa, \bfnmJesús E\binitsJ. E., \bauthor\bsnmGonzález-López, \bfnmVerónica A\binitsV. A., \bauthor\bparticlede \bsnmHolanda, \bfnmRua Sergio Buarque\binitsR. S. B. and \bauthor\bsnmGeraldo, \bfnmCidade Universitária-Barao\binitsC. U.-B. (\byear2011). \btitleMinimal markov models. In \bbooktitleFourth Workshop on Information Theoretic Methods in Science and Engineering \bpages25. \endbibitem
Hocking et al. [2011] {binproceedings}[author] \bauthor\bsnmHocking, \bfnmToby Dylan\binitsT. D., \bauthor\bsnmJoulin, \bfnmArmand\binitsA., \bauthor\bsnmBach, \bfnmFrancis\binitsF. and \bauthor\bsnmVert, \bfnmJean-Philippe\binitsJ.-P. (\byear2011). \btitleClusterpath an algorithm for clustering using convex fusion penalties. In \bbooktitle28th international conference on machine learning \bpages1. \endbibitem
Jääskinen et al. [2014] {barticle}[author] \bauthor\bsnmJääskinen, \bfnmVäinö\binitsV., \bauthor\bsnmXiong, \bfnmJie\binitsJ., \bauthor\bsnmCorander, \bfnmJukka\binitsJ. and \bauthor\bsnmKoski, \bfnmTimo\binitsT. (\byear2014). \btitleSparse Markov chains for sequence data. \bjournalScandinavian Journal of Statistics \bvolume41 \bpages639–655. \endbibitem
Lindsten, Ohlsson and Ljung [2011] {barticle}[author] \bauthor\bsnmLindsten, \bfnmF\binitsF., \bauthor\bsnmOhlsson, \bfnmH\binitsH. and \bauthor\bsnmLjung, \bfnmL\binitsL. (\byear2011). \btitleJust relax and come clustering! a convexication of k-means clustering Technical Report. \bjournalLinköping University, Department of Electrical Engineering, Automatic Control. \endbibitem
Panahi et al. [2017] {binproceedings}[author] \bauthor\bsnmPanahi, \bfnmAshkan\binitsA., \bauthor\bsnmDubhashi, \bfnmDevdatt\binitsD., \bauthor\bsnmJohansson, \bfnmFredrik D\binitsF. D. and \bauthor\bsnmBhattacharyya, \bfnmChiranjib\binitsC. (\byear2017). \btitleClustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. In \bbooktitleInternational conference on machine learning \bpages2769–2777. \bpublisherPMLR. \endbibitem
Pelckmans et al. [2005] {binproceedings}[author] \bauthor\bsnmPelckmans, \bfnmKristiaan\binitsK., \bauthor\bsnmDe Brabanter, \bfnmJoseph\binitsJ., \bauthor\bsnmSuykens, \bfnmJohan AK\binitsJ. A. and \bauthor\bsnmDe Moor, \bfnmBart\binitsB. (\byear2005). \btitleConvex clustering shrinkage. In \bbooktitlePASCAL Workshop on Statistics and Optimization of Clustering Workshop. \endbibitem
Rissanen [1983] {barticle}[author] \bauthor\bsnmRissanen, \bfnmJorma\binitsJ. (\byear1983). \btitleA universal prior for integers and estimation by minimum description length. \bjournalThe Annals of statistics \bvolume11 \bpages416–431. \endbibitem
Roos and Yu [2009a] {binproceedings}[author] \bauthor\bsnmRoos, \bfnmTeemu\binitsT. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009a). \btitleSparse Markov source estimation via transformed Lasso. In \bbooktitle2009 IEEE Information Theory Workshop on Networking and Information Theory \bpages241–245. \bpublisherIEEE. \endbibitem
Roos and Yu [2009b] {binproceedings}[author] \bauthor\bsnmRoos, \bfnmTeemu\binitsT. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009b). \btitleEstimating sparse models from multivariate discrete data via transformed Lasso. In \bbooktitle2009 Information Theory and Applications Workshop \bpages290–294. \bpublisherIEEE. \endbibitem
Sun, Toh and Yuan [2021] {barticle}[author] \bauthor\bsnmSun, \bfnmDefeng\binitsD., \bauthor\bsnmToh, \bfnmKim-Chuan\binitsK.-C. and \bauthor\bsnmYuan, \bfnmYancheng\binitsY. (\byear2021). \btitleConvex Clustering: Model, Theoretical Guarantee and Efficient Algorithm. \bjournalJ. Mach. Learn. Res. \bvolume22 \bpages9–1. \endbibitem
Xiong, Jääskinen and Corander [2016] {barticle}[author] \bauthor\bsnmXiong, \bfnmJie\binitsJ., \bauthor\bsnmJääskinen, \bfnmVäinö\binitsV. and \bauthor\bsnmCorander, \bfnmJukka\binitsJ. (\byear2016). \btitleRecursive learning for sparse Markov models. \bjournalBayesian analysis \bvolume11 \bpages247–263. \endbibitem
Yuan and Lin [2006] {barticle}[author] \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmLin, \bfnmYi\binitsY. (\byear2006). \btitleModel selection and estimation in regression with grouped variables. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume68 \bpages49–67. \endbibitem
Zhu et al. [2014] {barticle}[author] \bauthor\bsnmZhu, \bfnmChangbo\binitsC., \bauthor\bsnmXu, \bfnmHuan\binitsH., \bauthor\bsnmLeng, \bfnmChenlei\binitsC. and \bauthor\bsnmYan, \bfnmShuicheng\binitsS. (\byear2014). \btitleConvex optimization procedure for clustering: Theoretical revisit. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem