Fitting Sparse Markov Models to Categorical Time Series Using Regularization
Abstract
The major problem of fitting a higher order Markov model is the exponentially growing number of parameters. The most popular approach is to use a Variable Length Markov Chain (VLMC), which determines relevant contexts (recent pasts) of variable orders and form a context tree. A more general approach is called Sparse Markov Model (SMM), where all possible histories of order form a partition so that the transition probability vectors are identical for the histories belonging to a particular group. We develop an elegant method of fitting SMM using convex clustering, which involves regularization. The regularization parameter is selected using BIC criterion. Theoretical results demonstrate the model selection consistency of our method for large sample size. Extensive simulation studies under different set-up have been presented to measure the performance of our method. We apply this method to classify genome sequences, obtained from individuals affected by different viruses.
keywords:
, and
1 Introduction
Let be a categorical time series with a finite state space . We suppose that the evolution of the time series follows an -th order Markov structure where
(1.1) |
for some , where for any random vectors defined on a common probability space, we write to denote the probability distribution of given . Even when the alphabet is small, such as in applications involving binary chains or in Genetics applications, complexity of the model (1.1) increases fairly quickly and may be difficult to estimate even for moderately large . In absence of a parametric model specification, the number of free parameters associated with (1.1) are given by which grows geometrically fast in the order , where denotes the size of the alphabet, that is the number of elements in .
Different dimension reduction startegies have been applied to reduce the model complexity in (1.1), such as Variable Length Markov Chains (VLMC) based on tree-structured conditioning sets. This idea was first introduced by Rissanen (1983)[10], which determines relevant contexts (recent pasts) of variable orders and form a context tree. In VLMC, , where may not be a fixed number, rather a function of the past values . In general, context tree models have parameters, where is the number of leaves in the context tree. That can take on arbitrary positive integer values for general context trees highlights the flexibility of a model with variable length contexts, and the fact that such models can lead to huge reductions in the number of parameters, especially when there is a long context in a single direction. A model of a variable order allows for a better trade-off between bias that arises through using contexts that are too short, and variance that increases with having many parameters, thus improving statistical inference. Bühlmann and Wyner (1999)[2] and Bühlmann (2000)[1] developed model selection strategies and asymptotic behaviour of Variable Length Markov Chains (VLMC).
Roos and Yu (2009a[11], b[12]) first pointed out that there can be relevant contexts that don’t have the hierarchical structure of a context tree. Although the authors have discussed the possibility of more general models, the analysis of those papers was limited to the case where . Recently, the researchers have started the study of the sparse model, which is posed in terms of a general partition of the set of -tuples , where is the maximal order of Markovian dependence. Sparse Markov models (SMMs) that introduce a sparse parametrization based on an unknown grouping of the relevant th order history . This generalization was first proposed by Gárcia and González-López (2010)[4], they named it Minimal Markov Models. Later on, Jääskinen et al. (2014)[6] have developed Bayesian predictive methods to analyze sequence data using SMM. Xiong et al. (2016)[14] have extended the previous paper, introduced recursive algorithm for optimizing the partition for an SMM. In this paper, we shall consider SMMs in its full generality, allowing an arbitrary and unknown number of groups. Specifically, let be a partition of . Then, the Markov Chain in (1.1) is a member of the SMM with groups if it satisfies the sparsity condition :
(1.2) |
for each . Thus, for each , , the transition probability remains unchanged over all -step history lying in the set . This reduces the number of unknown probability parameters to . However, both the number of the sets in the partition and the sets themselves are unknown and must be estimated from the data for fitting the SMM.
The rest of the paper is organised as follows. In section (2), we discuss the methodology of fitting SMM using regularization in detail. Section (3) deals with the theoretical properties which ensures the model selection consistency of our method. In section (4), we numerically illustrate our methodology by extensive simulation studies. A real data analysis involving virus classification has been presented in section (5). We conclude this paper by some remarks in section (6). The proofs of the theoretical results are provided in the appendix.
2 Methodology
2.1 Notation
Let be the set of all positive integers. Let and , for , . Write for an ordered (finite) sequence of -elements of length . Let denote the (ordered) concatenation of and . Write and so that . Let where denotes the indicator function. For any and , define , . In particular, denotes the number of times the chain hits the -tuple , and is the number of transitions from to .
Next we define the probabilities associated with the SMM (1.2). For and , let
and be the corresponding transition probability vector. Note that by the SMM property, for any , the transition probability is a constant over all such that . However, we do not know the sets and one of the challenges of fitting an SMM to a dataset is to be able to idenitify the sets . To that end, define nonparametric estimators of using their empirical versions:
where denotes the total number of th order histrory in the observed variables .
Here we propose a new approach to fitting the SMM based on regularization.
2.2 Description of the Method/Algorithm
Consider the penalized criterion function
(2.3) |
over for , where is a penalty parameter, is the -dimensional simplex and where is a distance measure between two -dimensional probability vectors. Thus, (2.3) treats the estimators as (correlated) “observations” and penalizes the distance between all distinct pairs of the probability vectors in order to identify the identical probability vectors. In particular, the number of parameters grow approximately quadratically in the number of observations and with a suitable choice of the penalization term, one can identify the identical probability vectors. When , (2.3) gives a version of the Group LASSO of Yuan and Lin (2006)[15] that is designed for selecting pairs of full vectors that are close, and we have a convex optimization problem that can be solved for large s. On the other hand, if we use the distance , then only componentwise zero differences can be identified.
Once we minimize the criterion function in (2.3), it is a relatively easy task to find estimates of and the sets . Specifically, we start with a pair with the smallest and seek all that the distance between the solutions and is zero. Then, we set to be the set consisting of and all such . In the next step, we consider all pairs that are not in and repeat the procedure until all pairs with estimated zero distances have been grouped. In case there are indices for which none of the estimated paired distances are zero, we keep them as a singletons, that is groups consisting of single elements. This gives the estimated groups , with giving an estimate of .
The traditional clustering methodologies like -means have many limitations. In most cases, we have to pre-specify the number of clusters, along with the possibility that we end up with a local minima instead of the global one. The advantage of clustering by solving the equation (2.3) for a range of is that we get a solution path from at most many singleton clusters to only one cluster consisting of all the elements. Subsequently, we can fix some criterion function which will enable us to find the optimum cluster assignment among the all possible models in the solution path. Hence, not only we won’t need to fix the number of clusters beforehand, also we avoid the problems of being stuck at the local minima. Several efficient algorithms have been developed in recent years to solve the equation (2.3) when the penalty function is convex; e.g for some . Pelckmans et al. (2005)[9], Lindsten et al. (2011)[7], Hocking et al. (2011)[5] and others recently proposed this method and established this to be more robust and scalable in comparison to the traditional ones. Lindsten et al. (2011)[7] used an off-the-shelf convex solver CVX to solve the convex clustering problem, which suffers from scalability issues. The theoretical perfect cluster recovery conditions have been derived by Zhu et al. (2014)[16] only for two clusters, while Panahi et al. (2017)[8] derived perfect recovery conditions for general clusters, but under the assumption of uniform weights. Sun et al. (2021)[13] provides sufficient conditions for theoretical recovery conditions under more general weight choices. They have also developed a faster algorithm called semismooth Newton based augmented Lagrangian method (SS-NAL), and derived the convergence criteria for their algorithm.
Chi and Lange (2015)[3] have developed an elegant method of solving the convex clustering problem by augmented lagrangian methods. For , we first view solving the equation (2.3) as the following constrained optimization problem
(2.4) | ||||
subject to |
where is the set of all distinct edges . Here, a new splitting variable has been introduced to capture the difference between the group centroids, which helps the optimization procedure much easier. Chi and Lange (2015)[3] have used two algorithms for solving this constrained optimization problem, namely ADMM and AMA. For both these algorithms, one first have to incorporate augmented lagrangian as follows:
(2.5) | ||||
where and are the matrices with and for and in their columns respectively. Splitting the variables in such fashion would allow us to update , and sequentially, given the other variables. The convergence of ADMM does not depend on the choice of , it will converge for any . On the other hand, AMA converges for any . The performance of both these algorithms have been compared, and it is established that AMA is much faster than ADMM, especially when the weights are sparse. The major advantage of AMA lies in the inherent structure. After some basic algebra, Chi and Lange (2015)[3] have proved that we only need to update and in every step, and we can bypass updating applying some linear relationship among the variables. Since, AMA provides much faster result, we use this method in our analysis. Suppose, and be the parameter values in the step. The updates in the next step are computed using the following relations:
where , , and is the projection of onto the set . We continue till convergence. The convergence criterion has been discussed in Chi and Lange (2015)[3] in details, using the dual problem and duality gap.
Initialize
2.3 Selection of the Tuning Parameter
So far, we have discussed the numerical methods to solve (2.3) for a given . But it is important to choose an optimum value of for the optimization problem. In this section, we propose a data driven method to select this tuning parameter using BIC criterion. For a given , denote the obtained clusters as , where is the number of clusters. Define the common transition probability for the -tuples in the estimated group as
The log-likelihood of the observations under the obtained cluster assignment for a particular is given by
Hence, the BIC score corresponding to the obtained model is
By a grid search over a range of possible values, we select the for which BIC is minimized. The solution of the equation (2.3) corresponding to that is considered as the estimated cluster assignment. In the next section, we provide theoretical results which will demonstrate that for a range of values, we will be able to perfectly recover the true clusters for large .
3 Conditions and Theoretical Results
3.1 Conditions
Consider solving the equation (2.3) with , and the optimum solution is denoted by , for . Let the true partition of the state space is ; with the corresponding transition probability vectors being such that . Denote , the size of the partition. Following the notations of Yuan et al. (2021), define
Suppose, the following conditions hold.
(A1)
and for any , .
(A2) , and .
In other words, the condition (A1) implies the symmetry in the weight choices, and ideally the weight should be positive between two -tuples belonging to the same partition. On the other hand, (A2) determines a lower bound for the weight between two -tuple in a particular group. We will use these conditions to prove our results.
Before going into the main results, let us recall a few other results as follows.
Proposition 1 (Consistency of Estimated Transition Probabilities): Suppose be an SMM of order , with the partitions . Then, as ,
(a) for ;
(b) , where is the stationary probability of the state ;
(c)
where . Since is of rank , the asymptotic Normal distribution is singular.
Proposition 2 (Sun et al. (2021)): Suppose the above conditions hold, and . Then for any , for ; and for any . In other words, for any , we recover the true partition of the state space.
Thus, Proposition 1 tells us about the weak consistency and the CLT for the transition probability vectors. Proposition 2 deals with perfect recovery conditions under general weight choices, derived in Sun et al. (2021). These propositions will be the key tools in proving our results. In the next section, we provide our major theoretical findings.
3.2 Main results
The very first result will ensure that the resulting solution of the objective function in (2.4) will produce a valid probability distribution over . Note that, we don’t consider this solution as our estimated transition probability for the group. However, if under some extra-ordinary circumstances the solution exhibits oracle properties, one can use the solutions as the common transition probabilities for the partitions.
Theorem 1: For any , the optimal solution is a valid probability distribution for ; i.e.
(a) for ,
(b) .
Next, we would like to derive the probability of true cluster recovery. There are two steps involved in this process. First, we need the true model in the solution path over varying . This implies the conditions of the proposition 2 must be satisfied, i.e . From Theorem 2, we get a lower bound of the probability of the true model being present in the solution path. This will tell us that with increasing sequence length, the probability of not perfect recovery converges to in an exponential rate.
Theorem 2:
Define
Then, under the above conditions (A1) and (A2), as ,
for , and for some constants .
Once we have the true model in the solution path, the next step is to establish that the probability of selecting that model through BIC criterion converges to as . The next theorem will prove this statement.
Theorem 3: Suppose the conditions of the Theorem (2) holds, and . For any , denote the clustering assignment obtained by minimizing the equation (2.3) as ; where is the number of clusters. Suppose be the log-likelihood of the observations corresponding to the cluster assignment , and the corresponding BIC score is . Choose some . Then, for any such that ,
as .
Next, we will provide some sufficient conditions for perfect cluster recovery under a particular weight choice using Gaussian kernels. These weights have been used in the previous analysis of convex clustering, and proven to produce good clustering results.
Theorem 4: Define . Suppose the following assumptions hold.
(a) , where is the indicator function that belongs to nearest neighbour of or vice versa, for some .
(b) .
(c) For some ,
, .
Under the above assumptions, the conditions (A1) and (A2) are satisfied. Moreover, ,
Corollary 1: Under the assumptions of Theorem 3,
(a) , .
(b) .
Corollary 2: For balanced design, i.e. when are the same for all groups , if . Hence, for any , perfect recovery is possible for .
We omit the proof of this theorem as it is just an algebraic derivation of the terms defined earlier for a particular weight choice. The Corollary 1 gives us an idea how much close the empirical transition probabilities for each -tuple should be to the true probability. The Corollary 2 provides a special case when the design is balanced. In that scenario, for the correct choice of nearest neighbour, the true model can be retrieved for a huge range of tuning parameter , providing great clustering results. We will show in the simulation studies that how the weight choices impact this clustering accuracy.
4 Simulation
In this section, we will numerically demonstrate the performance of the convex clustering methodology described in the previous section in terms of recovering the true cluster assignments. We have emphasized mostly on the choice of the weights , and compare the clustering performance for different choices of the weights. For illustration, we have considered SMM of various orders, lengths and different values. Note that, we don’t pre-specify the number of clusters or the labels of the clusters in our method, so it is not feasible to compute the straight-forward miss-classification rate to compare the outcome with the true one. Instead, we use widely used Rand Index (RI) and Adjusted Rand Index (ARI) to measure the similarity between the true cluster and the obtained cluster assignment.
Rand Index computes the proportion of pair which are correctly identified belonging to same cluster or different clusters. Mathematically, for any two cluster assignments and of the elements , the rand index is defined by
where is the number of pairs which are in the same cluster in both and , is the number of pairs which are in the different clusters in both and , is the number of pairs which are in same clusters of , but in different clusters of , and is number of pairs which are in same clusters of , but in different clusters of . Values of vary in between and . If two clusters are identical, should be . Higher values indicate more similarity among two given clusters.
However, Rand Index has some limitations. For example, if the number of clusters increases, and the cluster sizes are not big, will be close to even if two completely different cluster assignments. To address this issue, usage of Adjusted Rand Index (ARI) is preferred. is a corrected version of the usual Rand Index, which uses the expected similarity of all pairwise comparisons between clusterings specified by a random model. If , , , then is computed by the following formula:
For each of our simulation study, we compare the similarity between the estimated cluster assignment by solving the equation (2.3) for appropriate regularization parameter with the true clustering using both and . Especially, we focus on the choice of weights which result in higher values. Chi and Lange (2015), Sun et al. (2021) have shown, both numerically and theoretically that choosing sparse weights substantially improve the clustering quality, as well as make the algorithm much faster. In our study, we perform the clustering method under different weight choices, both sparse and dense, and compare how the values depend on that choice. In each set-up, we replicate the experiment times to obtain the mean and and their standard error. We also compute the proportion of times or is , i.e empirically compute the probability of perfect cluster recovery.
4.1 Simulation Set-up 1
Here, we take , the usual scenario when analyzing the DNA sequences. The order of the chain is taken to be both or . For , we equally divide the all tuples into groups of elements; and for , we divide the triplets into groups of equal size . For a particular group , we generate independently from , for . The transition probability of that group is generated from Dirichlet distribution with parameter . As of weights, we first take for all . Next we choose some sparse weights depending on the distance between the estimated transition probabilities and . We have used the -nearest neighbour based weights as proposed in Chi and Lange (2015), such as where is the indicator function that belongs to nearest neighbour of or vice versa, for some . In this example we use . We also incorporate a third choice of weight, namely where is the similar indicator function, but the nearest neighbour is computed w.r.t distance. We use two different values of the nearest neighbour, and . The results are provided in the following tables.
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
---|---|---|---|---|---|
RI (s.e) | 0.769 (0.03) | 0.809 (0.03) | 0.819 (0.023) | 0.833 (0.02) | 0.842 (0.02) |
ARI (s.e) | 0.015 (0.09) | 0.141 (0.15) | 0.169 (0.12) | 0.239 (0.12) | 0.295 (0.12) |
Prob. of True Recovery | 0 | 0 | 0 | 0 | 0 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.900 (0.09) | 0.981 (0.04) | 0.994 (0.02) | 0.997 (0.01) | 0.999 (0.01) |
ARI (s.e) | 0.745 (0.19) | 0.946 (0.10) | 0.982 (0.05) | 0.992 (0.04) | 0.997 (0.02) |
Prob. of True Recovery | 0.223 | 0.708 | 0.876 | 0.951 | 0.977 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.945 (0.07) | 0.994 (0.02) | 0.999 (0.01) | 0.999 (0.005) | 1 (0.003) |
ARI (s.e) | 0.851 (0.17) | 0.983 (0.06) | 0.995 (0.03) | 0.998 (0.02) | 0.999 (0.01) |
Prob. of True Recovery | 0.480 | 0.908 | 0.972 | 0.991 | 0.996 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.889 (0.10) | 0.975 (0.04) | 0.992 (0.02) | 0.996 (0.01) | 0.998 (0.01) |
ARI (s.e) | 0.720 (0.20) | 0.928 (0.12) | 0.974 (0.06) | 0.989 (0.04) | 0.994 (0.03) |
Prob. of True Recovery | 0.187 | 0.644 | 0.821 | 0.922 | 0.960 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.949 (0.07) | 0.995 (0.02) | 0.999 (0.01) | 0.999 (0.005) | 1 (0.004) |
ARI (s.e) | 0.860 (0.17) | 0.984 (0.05) | 0.995 (0.03) | 0.998 (0.02) | 0.999 (0.01) |
Prob. of True Recovery | 0.501 | 0.908 | 0.973 | 0.990 | 0.996 |
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
---|---|---|---|---|---|
RI (s.e) | 0.866 (0.03) | 0.901 (0.04) | 0.933 (0.04) | 0.954 (0.03) | 0.969 (0.03) |
ARI (s.e) | 0.199 (0.18) | 0.469 (0.26) | 0.651 (0.22) | 0.777 (0.17) | 0.854 (0.15) |
Prob. of True Recovery | 0 | 0.003 | 0.017 | 0.042 | 0.128 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.982 (0.024) | 0.995 (0.007) | 0.997 (0.005) | 0.999 (0.002) | 1 (0.001) |
ARI (s.e) | 0.935 (0.081) | 0.980 (0.027) | 0.990 (0.019) | 0.997 (0.009) | 0.999 (0.004) |
Prob. of True Recovery | 0.264 | 0.466 | 0.714 | 0.907 | 0.981 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.984 (0.024) | 0.994 (0.007) | 0.997 (0.005) | 0.999 (0.002) | 1 (0.001) |
ARI (s.e) | 0.940 (0.074) | 0.979 (0.026) | 0.990 (0.019) | 0.997 (0.009) | 0.999 (0.004) |
Prob. of True Recovery | 0.223 | 0.408 | 0.713 | 0.908 | 0.981 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.981 (0.025) | 0.995 (0.009) | 0.998 (0.005) | 0.999 (0.002) | 1 (0.001) |
ARI (s.e) | 0.931 (0.085) | 0.980 (0.033) | 0.990 (0.018) | 0.997 (0.009) | 0.999 (0.004) |
Prob. of True Recovery | 0.294 | 0.508 | 0.735 | 0.904 | 0.980 |
nearest neighbour | |||||
Sample Size (n) | 5000 | 10000 | 15000 | 20000 | 25000 |
RI (s.e) | 0.983 (0.021) | 0.994 (0.007) | 0.997 (0.005) | 0.999 (0.002) | 1 (0.001) |
ARI (s.e) | 0.937 (0.076) | 0.977 (0.029) | 0.990 (0.019) | 0.997 (0.009) | 0.999 (0.004) |
Prob. of True Recovery | 0.206 | 0.387 | 0.709 | 0.907 | 0.981 |
From this study, it is clear that for both and , choice of uniform weight results in really poor performance in terms of recovering the true cluster. Although the increases with increasing , we might need really large sample size to get good result. On the other hand, choosing the weights using the number of nearest neighbour perform much better than that with in terms of both and perfect recovery, especially for lower sample size. Note that the model is balanced in this example, and the optimum choice of is (by Corollary 1). Hence, we can justify that fact using our simulation study. On the other hand, the optimum choice of is for , but the choice is reliable as well. makes the weights too sparse in this case, which results in degradation of the clustering accuracy a little bit. For both and , the probability of true recovery increases with increasing .
This experiment ensures pretty good result for high , mostly for . It is thus worthy of investigation under what circumstances we will be able to get very good recovery for lower , such as or . The theoretical results suggest that if the cluster centroids are well separated, clustering performance gets better even for lower sample size. In this experiment, we have groups for , with the minimum centroid difference in terms of distance and in terms of distance; while these values are and for . In the next simulation study, we will demonstrate how the clustering accuracy improves for well separated centroids.
4.2 Simulation Set-up 2
In this study as well, we take and . We divide this triplets into four groups of size and . For the group, , , , . As the choice of weight, we have first used the nearest neighbour weights w.r.t the distance in the Gaussian kernel, and used . The second choice of weight is , with . Note that, here we have used the distance to find the nearest neighbour, but instead of incorporating the Gaussian kernel, we have used the natural exponential decay. The third weight is similar to the second one, only the distance is instead of . Here we have only used relatively smaller samples, and respectively. The results are portrayed in the following Table.
Weight Choice |
|
RI | ARI |
|
||||
---|---|---|---|---|---|---|---|---|
Distance, Gaussian Kernel | 1000 | 0.940 (0.022) | 0.816 (0.073) | 0 | ||||
2000 | 0.984 (0.012) | 0.954 (0.034) | 0.14 | |||||
Distance, Exponential Kernel | 1000 | 0.969 (0.020) | 0.908 (0.059) | 0.104 | ||||
2000 | 0.994 (0.009) | 0.983 (0.025) | 0.638 | |||||
Distance, Exponential Kernel | 1000 | 0.965 (0.020) | 0.893 (0.060) | 0.03 | ||||
2000 | 0.993 (0.009) | 0.979 (0.025) | 0.468 |
From the experiment, we can infer the weight involving distance in the Gaussian kernel performs poorly compared to or distance. Especially, usage of distance is really effective in such scenarios, as it measures the maximum element-wise distance between two estimated transition probability vectors. In this way, we are able to separate out two vectors which are not likely to be in the same cluster. From this study, we can fairly conclude that when the centroids are well-separated in such fashion so that their difference is significantly high in a small number of co-ordinates, use of distance can provide the best possible result.
5 Real Data Analysis
A popular application of higher order Markov models is analyzing the DNA sequences of some species. Scientists have used ordinary Markov chains, VLMC, hidden Markov models and many other tools to fit such a sequence for prediction, classification, gene identification etc. Here, we will use the proposed SMM methodology to classify the virus, collected as a sample from human being. We consider four different virus in our study: SARS-CoV-2 (COVID-19), MERS (Middle East Respiratory Syndrome), Dengue and Hepatitis B. We have collected the sample for individuals from NCBI database, affected from SARS-CoV-2, from MERS, from Dengue and from Hepatitis B over different time periods and different locations. We have been particularly careful in collecting the samples from COVID-19 disease so that we are able to incorporate different strains of that disease. To ensure that, we have used samples each from four different time-frame: April 2020, September 2020, January 2021 and April 2021. The time-frames are selected on the basis of the spread of certain strains or looking at the peak of the covid cases.
The NCBI database contains reference genome sequence of every virus. These reference sequences represents an ideal genome structure of any particular virus species. Note that, very minimal changes in the neucleotide sequence can lead to a very different strain of the same disease. The collected samples, if fully available, are almost equal to the reference sequence. In that case, there is no need to fit models, we can easily match the collected sample with the available genome sequence, and identify the virus whose sequence is almost similar to the reference genome. But in practice, there are many occasions where the full sequence is not available, may be only a part can be retrieved, or some portions of the data is lost. The challenge lies in that scenario, and we should be able to identify the correct virus as much as possible. In this experiment, we first build a reference model from the reference sequence for each virus. Next, we randomly select a continuous segment of the genome sequence for each sample, and then compute the likelihood of that segment under each of the reference models. Suppose the model is denoted by , . For any given sequence , likelihood of for each model is . We then classify to .
The lengths of the reference genome sequences for SARS, MERS, Dengue and Hepatitis B are , , and respectively. For SARS and MERS, we fit an SMM of order while for the other two virus, we use . The orders are based on the lengths of the reference sequences. There is a biological significance for using as well. Three consecutive DNA bases form a “codon”, which translates a genetic code into a sequence of amino acids. So, it is fair to assume that SMM of order or more will be able to explain the structure of a virus. From the samples, we randomly choose segments of length , and compute the likelihoods under models to classify it to the most likely class of virus. Three different values of have been used; , and . The confusion matrices for each three scenarios are presented in the following Tables.
|
MERS | Dengue |
|
Total | |||||
---|---|---|---|---|---|---|---|---|---|
SARS-Cov-2 | 185 | 0 | 4 | 11 | 200 | ||||
MERS | 0 | 50 | 0 | 0 | 50 | ||||
Dengue | 0 | 0 | 100 | 0 | 100 | ||||
Hepatitis B | 17 | 46 | 36 | 51 | 150 |
|
MERS | Dengue |
|
Total | |||||
---|---|---|---|---|---|---|---|---|---|
SARS-Cov-2 | 193 | 0 | 0 | 7 | 200 | ||||
MERS | 0 | 50 | 0 | 0 | 50 | ||||
Dengue | 0 | 0 | 100 | 0 | 100 | ||||
Hepatitis B | 5 | 41 | 23 | 81 | 150 |
|
MERS | Dengue |
|
Total | |||||
---|---|---|---|---|---|---|---|---|---|
SARS-Cov-2 | 194 | 0 | 0 | 6 | 200 | ||||
MERS | 0 | 50 | 0 | 0 | 50 | ||||
Dengue | 0 | 0 | 100 | 0 | 100 | ||||
Hepatitis B | 0 | 10 | 0 | 140 | 150 |
From the table, it is clear that , i.e. when only of the original sample sequence is retained, the miss-classification rate among the Hepatitis B samples are high. This is completely justified, since for Hepatitis B samples, the length of selected segments are about , whereas that lengths are about for Dengue and for MERS and SARS. As we increase the proportion, the performance naturally improves. The overall miss-classification rates are , and for , and respectively. So, with only of the sequences, we can correctly identify the true virus for more than of the cases. Even within Hepatitis B, the miss-classification error reduces drastically once we have fairly long sequence so that meaningful inference could be made. Note that, we have selected the snippets from any part of the full sample sequences. Our SMM method utilizes the information from the reference genome sequence in a compact manner, so that it can capture the diversity of structure from different parts of the samples. Overall, this method is successful in such classification problems, which opens up a scope for a broader research in this area.
6 Conclusion
The proposed method of fitting sparse Markov model can be utilized in many different areas. In future, we may be interested to apply the methodology to text mining, recommender system or other areas of Biostatistics. In terms of theoretical aspects, we can develop new clustering algorithms by changing the objective function and the penalty, preferably taking care of the number of occurences of each -tuple. This will be useful not only in SMM set-up, but to general clustering problems as well. We might also be interested how good and computationally efficient our method in terms of prediction using the SMM structure. We can conclude our analysis by saying that our method is innovative enough to apply in a broader spectrum, thus inviting a lot of other related research areas.
Appendix A Proof of Theorem 1
(a) For notational simplicity, we write as . Also, denote the objective function in (2.3) as . Suppose for some of the pairs, and . Let, . Since , we get . Also, , since the negative elements are shrinked to . Hence for any ,
Since for at least one , , contradicting that is the optimum solution. Hence , .
(b) If we initialize , we get , which satisfies . Subsequently, , and thus
. Using this similar arguments, for any iteration , . Hence the limiting quantity will still have the property that the sum of the elements of is always . This will complete the proof that is indeed a probability distribution.
Appendix B Proof of Theorem 2
Note that as , . Let be the stationary probability of the state . Then, as ; and for , we have
where .
Looking at the expressions of and , it is evident that shrinks towards as increases as the estimated transition probability vectors and belonging to the same cluster become closer to each other. On the other hand, the different group means and tend to get separated from each other, leading to converge to a positive number, and eventually we get . These expressions also tell us that in order to have perfect recovery of the clusters, a scaled version of the maximum within group deviation of the transition probabilities should be less than a scaled version of minimum between group variation. These scales are heavily dependent on the choice of the weights . Note that, if we choose the weights in a way so that is higher if and are closer (and potentially belong to the same cluster), and lower if they are far from each other (potentially belong to different clusters), the denominator of the term will be higher, and the denominator of will be lower in the ideal scenario. Hence, these particular choice of the weights will enhance separating and , increasing the chance of recovering the true cluster assignment.
The proof mainly relies on calculating the probability of and , being close to each other. Suppose for some , and , . In that case,
So, for sufficiently small, . We will later find a bound how small we need to achieve this.
We will compute a lower bound of the following probability:
Note that the variance-covariance matrix of the limiting distribution is not full rank, as we have a linear constraint in the elements of . Define , be the upper block of . Now,
Define . By asymptotic normality of , , hence . So,
For a symmetric matrix matrix , Hanson and Wright (1971) have determined a lower bound of the tail probability of any quadratic form of a sub-Gaussian random variable with mean and variance-covariance matrix as follows:
(B.6) |
for some constants . Here and are Frobenius norm and spectral norm respectively. Applying this bound in (B.6) in our problem, we get, as ,
As increases, , and eventually for larger , . Now,
as by the result of Watson (1995). Hence,
Write , , and we prove the theorem.
Appendix C Proof of Theorem 3
Recall that . Denote the common transition probability for the estimated group as
Thus, the log-likelihood is given by
Note that, as increases, the number of clusters decreases. Also, by the continuity of the solution of (2.3) w.r.t , is a submodel of for as the separate clusters for lower values are clumped together to form new clusters with larger size as increases. Hence, we can write . Subsequently, . Let be the stationary probability of the state , and be the stationary probability of the partition . So, . We have to show that the true model minimizes the BIC with probability tending to as . We prove that for two cases as follows.
Case 1: Suppose, , and . Clearly, . Since is the true underlying model by Theorem (1), , and
Hence, as ,
Case 2: Now let and . For , w.lo.g we can write
for . Now, as ,
and
Now, applying the Jensen’s inequality by using the strict convexity of ,
Hence, , and as . Since as ,
References
- Bühlmann [2000] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. (\byear2000). \btitleModel selection for variable length Markov chains and tuning the context algorithm. \bjournalAnnals of the Institute of Statistical Mathematics \bvolume52 \bpages287–315. \endbibitem
- Bühlmann et al. [1999] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP., \bauthor\bsnmWyner, \bfnmAbraham J\binitsA. J. \betalet al. (\byear1999). \btitleVariable length Markov chains. \bjournalThe Annals of Statistics \bvolume27 \bpages480–513. \endbibitem
- Chi and Lange [2015] {barticle}[author] \bauthor\bsnmChi, \bfnmEric C\binitsE. C. and \bauthor\bsnmLange, \bfnmKenneth\binitsK. (\byear2015). \btitleSplitting methods for convex clustering. \bjournalJournal of Computational and Graphical Statistics \bvolume24 \bpages994–1013. \endbibitem
- Garcıa et al. [2011] {binproceedings}[author] \bauthor\bsnmGarcıa, \bfnmJesús E\binitsJ. E., \bauthor\bsnmGonzález-López, \bfnmVerónica A\binitsV. A., \bauthor\bparticlede \bsnmHolanda, \bfnmRua Sergio Buarque\binitsR. S. B. and \bauthor\bsnmGeraldo, \bfnmCidade Universitária-Barao\binitsC. U.-B. (\byear2011). \btitleMinimal markov models. In \bbooktitleFourth Workshop on Information Theoretic Methods in Science and Engineering \bpages25. \endbibitem
- Hocking et al. [2011] {binproceedings}[author] \bauthor\bsnmHocking, \bfnmToby Dylan\binitsT. D., \bauthor\bsnmJoulin, \bfnmArmand\binitsA., \bauthor\bsnmBach, \bfnmFrancis\binitsF. and \bauthor\bsnmVert, \bfnmJean-Philippe\binitsJ.-P. (\byear2011). \btitleClusterpath an algorithm for clustering using convex fusion penalties. In \bbooktitle28th international conference on machine learning \bpages1. \endbibitem
- Jääskinen et al. [2014] {barticle}[author] \bauthor\bsnmJääskinen, \bfnmVäinö\binitsV., \bauthor\bsnmXiong, \bfnmJie\binitsJ., \bauthor\bsnmCorander, \bfnmJukka\binitsJ. and \bauthor\bsnmKoski, \bfnmTimo\binitsT. (\byear2014). \btitleSparse Markov chains for sequence data. \bjournalScandinavian Journal of Statistics \bvolume41 \bpages639–655. \endbibitem
- Lindsten, Ohlsson and Ljung [2011] {barticle}[author] \bauthor\bsnmLindsten, \bfnmF\binitsF., \bauthor\bsnmOhlsson, \bfnmH\binitsH. and \bauthor\bsnmLjung, \bfnmL\binitsL. (\byear2011). \btitleJust relax and come clustering! a convexication of k-means clustering Technical Report. \bjournalLinköping University, Department of Electrical Engineering, Automatic Control. \endbibitem
- Panahi et al. [2017] {binproceedings}[author] \bauthor\bsnmPanahi, \bfnmAshkan\binitsA., \bauthor\bsnmDubhashi, \bfnmDevdatt\binitsD., \bauthor\bsnmJohansson, \bfnmFredrik D\binitsF. D. and \bauthor\bsnmBhattacharyya, \bfnmChiranjib\binitsC. (\byear2017). \btitleClustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. In \bbooktitleInternational conference on machine learning \bpages2769–2777. \bpublisherPMLR. \endbibitem
- Pelckmans et al. [2005] {binproceedings}[author] \bauthor\bsnmPelckmans, \bfnmKristiaan\binitsK., \bauthor\bsnmDe Brabanter, \bfnmJoseph\binitsJ., \bauthor\bsnmSuykens, \bfnmJohan AK\binitsJ. A. and \bauthor\bsnmDe Moor, \bfnmBart\binitsB. (\byear2005). \btitleConvex clustering shrinkage. In \bbooktitlePASCAL Workshop on Statistics and Optimization of Clustering Workshop. \endbibitem
- Rissanen [1983] {barticle}[author] \bauthor\bsnmRissanen, \bfnmJorma\binitsJ. (\byear1983). \btitleA universal prior for integers and estimation by minimum description length. \bjournalThe Annals of statistics \bvolume11 \bpages416–431. \endbibitem
- Roos and Yu [2009a] {binproceedings}[author] \bauthor\bsnmRoos, \bfnmTeemu\binitsT. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009a). \btitleSparse Markov source estimation via transformed Lasso. In \bbooktitle2009 IEEE Information Theory Workshop on Networking and Information Theory \bpages241–245. \bpublisherIEEE. \endbibitem
- Roos and Yu [2009b] {binproceedings}[author] \bauthor\bsnmRoos, \bfnmTeemu\binitsT. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2009b). \btitleEstimating sparse models from multivariate discrete data via transformed Lasso. In \bbooktitle2009 Information Theory and Applications Workshop \bpages290–294. \bpublisherIEEE. \endbibitem
- Sun, Toh and Yuan [2021] {barticle}[author] \bauthor\bsnmSun, \bfnmDefeng\binitsD., \bauthor\bsnmToh, \bfnmKim-Chuan\binitsK.-C. and \bauthor\bsnmYuan, \bfnmYancheng\binitsY. (\byear2021). \btitleConvex Clustering: Model, Theoretical Guarantee and Efficient Algorithm. \bjournalJ. Mach. Learn. Res. \bvolume22 \bpages9–1. \endbibitem
- Xiong, Jääskinen and Corander [2016] {barticle}[author] \bauthor\bsnmXiong, \bfnmJie\binitsJ., \bauthor\bsnmJääskinen, \bfnmVäinö\binitsV. and \bauthor\bsnmCorander, \bfnmJukka\binitsJ. (\byear2016). \btitleRecursive learning for sparse Markov models. \bjournalBayesian analysis \bvolume11 \bpages247–263. \endbibitem
- Yuan and Lin [2006] {barticle}[author] \bauthor\bsnmYuan, \bfnmMing\binitsM. and \bauthor\bsnmLin, \bfnmYi\binitsY. (\byear2006). \btitleModel selection and estimation in regression with grouped variables. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume68 \bpages49–67. \endbibitem
- Zhu et al. [2014] {barticle}[author] \bauthor\bsnmZhu, \bfnmChangbo\binitsC., \bauthor\bsnmXu, \bfnmHuan\binitsH., \bauthor\bsnmLeng, \bfnmChenlei\binitsC. and \bauthor\bsnmYan, \bfnmShuicheng\binitsS. (\byear2014). \btitleConvex optimization procedure for clustering: Theoretical revisit. \bjournalAdvances in Neural Information Processing Systems \bvolume27. \endbibitem