Optimal prediction of Markov chains with and without spectral gap
Abstract
We study the following learning problem with dependent data: Observing a trajectory of length from a stationary Markov chain with states, the goal is to predict the next state. For , using techniques from universal compression, the optimal prediction risk in Kullback-Leibler divergence is shown to be , in contrast to the optimal rate of for previously shown in [FOPS16]. These rates, slower than the parametric rate of , can be attributed to the memory in the data, as the spectral gap of the Markov chain can be arbitrarily small. To quantify the memory effect, we study irreducible reversible chains with a prescribed spectral gap. In addition to characterizing the optimal prediction risk for two states, we show that, as long as the spectral gap is not excessively small, the prediction risk in the Markov model is , which coincides with that of an iid model with the same number of parameters. Extensions to higher-order Markov chains are also obtained.
1 Introduction
Learning distributions from samples is a central question in statistics and machine learning. While significant progress has been achieved in property testing and estimation based on independent and identically distributed (iid) data, for many applications, most notably natural language processing, two new challenges arise: (a) Modeling data as independent observations fails to capture their temporal dependency; (b) Distributions are commonly supported on a large domain whose cardinality is comparable to or even exceeds the sample size. Continuing the progress made in [FOPS16, HOP18], in this paper we study the following prediction problem with dependent data modeled as Markov chains.
Suppose is a stationary first-order Markov chain on state space with unknown statistics. Observing a trajectory , the goal is to predict the next state by estimating its distribution conditioned on the present data. We use the Kullback-Leibler (KL) divergence as the loss function: For distributions , if whenever and otherwise. The minimax prediction risk is given by
(1) |
where the supremum is taken over all stationary distributions and transition matrices (row-stochastic) such that , the infimum is taken over all estimators that are proper Markov kernels (i.e. rows sum to 1), and denotes the th row of . Our main objective is to characterize this minimax risk within universal constant factors as a function of and .
The prediction problem (1) is distinct from the parameter estimation problem such as estimating the transition matrix [Bar51, AG57, Bil61, WK19] or its properties [CS00, KV16, HJL+18, HKL+19] in that the quantity to be estimated (conditional distribution of the next state) depends on the sample path itself. This is precisely what renders the prediction problem closely relevant to natural applications such as autocomplete and text generation. In addition, this formulation allows more flexibility with far less assumptions compared to the estimation framework. For example, if certain state has very small probability under the stationary distribution, consistent estimation of the transition matrix with respect to usual loss function, e.g. squared risk, may not be possible, whereas the prediction problem is unencumbered by such rare states.
In the special case of iid data, the prediction problem reduces to estimating the distribution in KL divergence. In this setting the optimal risk is well understood, which is known to be when is fixed and [BFSS02] and for [Pan04, KOPS15].111Here and below or denote equality and inequalities up to universal multiplicative constants. Typical in parametric models, this rate is commonly referred to the “parametric rate”, which leads to a sample complexity that scales proportionally to the number of parameters and inverse proportionally to the desired accuracy.
In the setting of Markov chains, however, the prediction problem is much less understood especially for large state space. Recently the seminal work [FOPS16] showed the surprising result that for stationary Markov chains on two states, the optimal prediction risk satisfies
(2) |
which has a nonparametric rate even when the problem has only two parameters. The follow-up work [HOP18] studied general -state chains and showed a lower bound of for uniform (not necessarily stationary) initial distribution; however, the upper bound in [HOP18] relies on implicit assumptions on mixing time such as spectral gap conditions: the proof of the upper bound for prediction (Lemma 7 in the supplement) and for estimation (Lemma 17 of the supplement) is based on Berstein-type concentration results of the empirical transition counts, which depend on spectral gap. The following theorem resolves the optimal risk for -state Markov chains:
Theorem 1 (Optimal rates without spectral gap).
There exists a universal constant such that for all ,
(3) |
Furthermore, the lower bound continues to hold even if the Markov chain is restricted to be irreducible and reversible.
Remark 1.
The optimal prediction risk of can be achieved by an average version of the add-one estimator (i.e. Laplace’s rule of succession). Given a trajectory of length , denote the transition counts (with the convention if )
(4) |
The add-one estimator for the transition probability is given by
(5) |
which is an additively smoothed version of the empirical frequency. Finally, the optimal rate in (3) can be achieved by the following estimator defined as an average of add-one estimators over different sample sizes:
(6) |
In other words, we apply the add-one estimator to the most recent observations to predict the next , then average over . Such Cesàro-mean-type estimators have been introduced before in the density estimation literature (see, e.g., [YB99]). It remains open whether the usual add-one estimator (namely, the last term in (6) which uses all the data) or any add- estimator for constant achieves the optimal rate. In contrast, for two-state chains the optimal risk (2) is attained by a hybrid strategy [FOPS16], applying add- estimator for for trajectories with at most one transition and otherwise. Also note that the estimator in (6) can be computed in time. To derive this first note that given any calculating takes time and given any we need time to calculate . Summing over all we get the algorithmic complexity upper bound.
Theorem 1 shows that the departure from the parametric rate of , first discovered in [FOPS16, HOP18] for binary chains, is even more pronounced for larger state space. As will become clear in the proof, there is some fundamental difference between two-state and three-state chains, resulting in . It is instructive to compare the sample complexity for prediction in the iid and Markov model. Denote by the number of parameters, which is for the iid case and for Markov chains. Define the sample complexity as the smallest sample size in order to achieve a prescribed prediction risk . For , we have
(7) |
At a high level, the nonparametric rates in the Markov model can be attributed to the memory in the data. On the one hand, Theorem 1 as well as (2) affirm that one can obtain meaningful prediction without imposing any mixing conditions;222To see this, it is helpful to consider the extreme case where the chain does not move at all or is periodic, in which case predicting the next state is in fact easy. such decoupling between learning and mixing has also been observed in other problems such as learning linear dynamics [SMT+18, DMM+19]. On the other hand, the dependency in the data does lead to a strictly higher sample complexity than that of the iid case; in fact, the lower bound in Theorem 1 is proved by constructing chains with spectral gap as small as (see Section 3). Thus, it is conceivable that with sufficiently favorable mixing conditions, the prediction risk improves over that of the worst case and, at some point, reaches the parametric rate. To make this precise, we focus on Markov chains with a prescribed spectral gap.
It is well-known that for an irreducible and reversible chain, the transition matrix has real eigenvalues satisfying . The absolute spectral gap of , defined as
(8) |
quantifies the memory of the Markov chain. For example, the mixing time is determined by (relaxation time) up to logarithmic factors. As extreme cases, the chain which does not move ( is identity) and which is iid ( is rank-one) have spectral gap equal to 0 and 1, respectively. We refer the reader to [LP17] for more background. Note that the definition of absolute spectral gap requires irreducibility and reversibility, thus we restrict ourselves to this class of Markov chains (it is possible to use more general notions such as pseudo spectral gap to quantify the memory of the process, which is beyond the scope of the current paper). Given , define as the set of transition matrices corresponding to irreducible and reversible chains whose absolute spectral gap exceeds . Restricting (1) to this subcollection and noticing the stationary distribution here is uniquely determined by , we define the corresponding minimax risk:
(9) |
Extending the result (2) of [FOPS16], the following theorem characterizes the optimal prediction risk for two-state chains with prescribed spectral gaps (the case correspond to the minimax rate in [FOPS16] over all binary Markov chains):
Theorem 2 (Spectral gap dependent rates for binary chain).
For any
Theorem 2 shows that for binary chains, parametric rate is achievable if and only if the spectral gap is nonvanishing. While this holds for bounded state space (see Corollary 4 below), for large state space, it turns out that much weaker conditions on the absolute spectral gap suffice to guarantee the parametric rate , achieved by the add-one estimator applied to the entire trajectory. In other words, as long as the spectral gap is not excessively small, the prediction risk in the Markov model behaves in the same way as that of an iid model with equal number of parameters. Similar conclusion has been established previously for the sample complexity of estimating the entropy rate of Markov chains in [HJL+18, Theorem 1].
Theorem 3.
The add-one estimator in (5) achieves the following risk bound.
-
(i)
For any , provided that .
-
(ii)
In addition, for , provided that .
Corollary 4.
For any fixed , if and only if .
Finally, we address the optimal prediction risk for higher-order Markov chains:
Theorem 5.
There is a constant depending on such that for any and constant the minimax prediction rate for -order Markov chains with stationary initialization is
Notably, for binary states, it turns out that the optimal rate for first-order Markov chains determined by [FOPS16] is something very special, as we show that for second-order chains the optimal rate is .
1.1 Proof techniques
The proof of Theorem 1 deviates from existing approaches based on concentration inequalities for Markov chains. For instance, the standard program for analyzing the add-one estimator (5) involves proving concentration of the empirical counts on their population version, namely, and , and bounding the risk in the atypical case by concentration inequalities, such as the Chernoff-type bounds in [Lez98, Pau15], which have been widely used in recent work on statistical inference with Markov chains [KV16, HJL+18, HOP18, HKL+19, WK19]. However, these concentration inequalities inevitably depends on the spectral gap of the Markov chain, leading to results which deteriorate as the spectral gap becomes smaller. For two-state chains, results free of the spectral gap are obtained in [FOPS16] using explicit joint distribution of the transition counts; this refined analysis, however, is difficult to extend to larger state space as the probability mass function of is given by Whittle’s formula [Whi55] which takes an unwieldy determinantal form.
Eschewing concentration-based arguments, the crux of our proof of Theorem 1, for both the upper and lower bound, revolves around the following quantity known as redundancy:
(10) |
Here the supremum is taken over all joint distributions of stationary Markov chains on states, and the infimum is over all joint distributions . A central quantity which measures the minimax regret in universal compression, the redundancy (10) corresponds to minimax cumulative risk (namely, the total prediction risk when the sample size ranges from 1 to ), while (1) is the individual minimax risk at sample size – see Section 2 for a detailed discussion. We prove the following reduction between prediction risk and redundancy:
(11) |
where denotes the redundancy for symmetric Markov chains. The upper bound is standard: thanks to the convexity of the loss function and stationarity of the Markov chain, the risk of the Cesàro-mean estimator (6) can be upper bounded using the cumulative risk and, in turn, the redundancy. The proof of the lower bound is more involved. Given a -state chain, we embed it into a larger state space by introducing a new state, such that with constant probability, the chain starts from and gets stuck at this state for a period time that is approximately uniform in , then enters the original chain. Effectively, this scenario is equivalent to a prediction problem on states with a random (approximately uniform) sample size, whose prediction risk can then be related to the cumulative risk and redundancy. This intuition can be made precise by considering a Bayesian setting, in which the -state chain is randomized according to the least favorable prior for (10), and representing the Bayes risk as conditional mutual information and applying the chain rule.
Given the above reduction in (11), it suffices to show both redundancies therein are on the order of . The redundancy is upper bounded by pointwise redundancy, which replaces the average in (10) by the maximum over all trajectories. Following [DMPW81, CS04], we consider an explicit probability assignment defined by add-one smoothing and using combinatorial arguments to bound the pointwise redundancy, shown optimal by information-theoretic arguments.
The optimal spectral gap-dependent rate in Theorem 2 relies on the key observation in [FOPS16] that, for binary chains, the dominating contribution to the prediction risk comes from trajectories with a single transition, for which we may apply an add- estimator with depending appropriately on the spectral gap. The lower bound is shown using a Bayesian argument similar to that of [HOP18, Theorem 1]. The proof of Theorem 3 relies on more delicate concentration arguments as the spectral gap is allowed to be vanishingly small. Notably, for small , direct application of existing Bernstein inequalities for Markov chains in [Lez98, Pau15] falls short of establishing the parametric rate of (see Remark 4 in Section 4.2 for details); instead, we use a fourth moment bound which turns out to be well suited for analyzing concentration of empirical counts conditional on the terminal state.
For large , we further improve the spectral gap condition using a simulation argument for Markov chains using independent samples [Bil61, HJL+18]. A key step is a new concentration inequality for , where is the add-one estimator based on iid observations of supported on :
(12) |
for some absolute constant . Note that an application of the classical concentration inequality of McDiarmid would result in the second term being , and (12) crucially improves this to . Such an improvement has been recently observed by [MJT+20, Agr20, GR20] in studying the similar quantity for the (unsmoothed) empirical distribution ; however, these results, based on either the method of types or an explicit upper bound of the moment generating function, are not directly applicable to (12) in which the true distribution appears as the first argument in the KL divergence.
The nonasymptotic analysis of the prediction rate for higher-order chains with large alphabets is based on a similar redundancy-based reduction as the first-order chain. However, optimal nonasymptotic redundancy bounds for higher-order chains is more challenging. Notably, in lower bounding redundancy, we need to bound the mutual information from below by upper bounding the squared error of certain estimators. As noted in [TJW18], existing analysis in [Dav83, Sec III] based on simple mixing conditions from [Par62] leads to suboptimal results on large alphabets. To bypass this issue, we show the pseudo spectral gap [Pau15] of the transition matrix of the first-order chain is at least a constant. This is accomplished by a careful construction of a prior on -order transition matrices with degrees of freedom.
1.2 Related work
While the exact prediction problem studied in this paper has recently been in focus since [FOPS16, HOP18], there exists a large body of literature on relate works. As mentioned before some of our proof strategies draws inspiration and results from the study of redundancy in universal compression, its connection to mutual information, as well as the perspective of sequential probability assignment as prediction, dating back to [Dav73, DMPW81, Ris84, Sht87, Rya88]. Asymptotic characterization of the minimax redundancy for Markov sources, both average and pointwise, were obtained in [Dav83, Att99, JS02], in the regime of fixed alphabet size and large sample size . Non-asymptotic characterization was obtained in [Dav83] for and recently extended to in [TJW18], which further showed that the behavior of the redundancy remains unchanged even if the Markov chain is very close to being iid in terms of spectral gap .
The current paper adds to a growing body of literature devoted to statistical learning with dependent data, in particular those dealing with Markov chains. Estimation of the transition matrix [Bar51, AG57, Bil61, Sin64] and testing the order of Markov chains [CS00] have been well studied in the large-sample regime. More recently attention has been shifted towards large state space and nonasymptotics. For example, [WK19] studied the estimation of transition matrix in induced norm for Markov chains with prescribed pseudo spectral gap and minimum probability mass of the stationary distribution, and determined sample complexity bounds up to logarithmic factors. Similar results have been obtained for estimating properties of Markov chains, including mixing time and spectral gap [HKL+19], entropy rate [KV16, HJL+18, OS20], graph statistics based on random walk [BHOP18], as well as identity testing [DDG18, CB19, WK20, FW21]. Most of these results rely on assumptions on the Markov chains such as lower bounds on the spectral gap and the stationary distribution, which afford concentration for sample statistics of Markov chains. In contrast, one of the main contributions in this paper, in particular Theorem 1, is that optimal prediction can be achieved without these assumptions, thereby providing a novel way of tackling these seemingly unavoidable issues. This is ultimately accomplished by information-theoretic and combinatorial techniques from universal compression.
1.3 Notations and preliminaries
For , let . Denote and . The distribution of a random variable is denoted by . In a Bayesian setting, the distribution of a parameter is referred to as a prior, denoted by . We recall the following definitions from information theory [CK82, CT06]. The conditional KL divergence is defined as as an average of KL divergence between conditional distributions:
(13) |
The mutual information between random variables and with joint distribution is ; similarly, the conditional mutual information is defined as
The following variational representation of (conditional) mutual information is well-known
(14) |
The entropy of a discrete random variables is .
1.4 Organization
The rest of the paper is organized as follows. In Section 2 we describe the general paradigm of minimax redundancy and prediction risk and their dual representation in terms of mutual information. We give a general redundancy-based bound on the prediction risk, which, combined with redundancy bounds for Markov chains, leads to the upper bound in Theorem 1. Section 3 presents the lower bound construction, starting from three states and then extending to states. Spectral-gap dependent risk bounds in Theorems 2 and 3 are given in Section 4. Section 5 presents the results and proofs for -order Markov chains. Section 6 discusses the assumptions and implications of our results and related open problems.
2 Two general paradigms
2.1 Redundancy, prediction risk, and mutual information representation
For , let be a collection of joint distributions parameterized by .
“Compression”.
Consider a sample of size drawn from for some unknown . The redundancy of a probability assignment (joint distribution) is defined as the worst-case KL risk of fitting the joint distribution of , namely
(15) |
Optimizing over , the minimax redundancy is defined as
(16) |
where the infimum is over all joint distribution . This quantity can be operationalized as the redundancy (i.e. regret) in the setting of universal data compression, that is, the excess number of bits compared to the optimal compressor of that knows [CT06, Chapter 13].
The capacity-redundancy theorem (see [Kem74] for a very general result) provides the following mutual information characterization of (16):
(17) |
where the supremum is over all distributions (priors) on . In view of the variational representation (14), this result can be interpreted as a minimax theorem:
Typically, for fixed model size and , one expects that , where is the number of parameters; see [Ris84] for a general theory of this type. Indeed, on a fixed alphabet of size , we have for iid model [Dav73] and for -order Markov models [Tro74], with more refined asymptotics shown in [XB97, SW12]. For large alphabets, nonasymptotic results have also been obtained. For example, for first-order Markov model, provided that [TJW18].
“Prediction”.
Consider the problem of predicting the next unseen data point based on the observations , where are jointly distributed as for some unknown . Here, an estimator is a distribution (for ) as a function of , which, in turn, can be written as a conditional distribution . As such, its worst-case average risk is
(18) |
where the conditional KL divergence is defined in (13). The minimax prediction risk is then defined as
(19) |
While (16) does not directly correspond to a statistical estimation problem, (19) is exactly the familiar setting of “density estimation”, where is understood as an estimator for the distribution of the unseen based on the available data .
In the Bayesian setting where is drawn from a prior , the Bayes prediction risk coincides with the conditional mutual information as a consequence of the variational representation (14):
(20) |
Furthermore, the Bayes estimator that achieves this infimum takes the following form:
(21) |
known as the Bayes predictive density [Dav73, LB04]. These representations play a crucial role in the lower bound proof of Theorem 1. Under appropriate conditions which hold for Markov models (see Lemma 33 in Appendix A), the minimax prediction risk (19) also admits a dual representation analogous to (17):
(22) |
which, in view of (20), show that the principle of “minimax=worst-case Bayes” continues to hold for prediction problem in Markov models.
The following result relates the redundancy and the prediction risk.
Lemma 6.
For any model ,
(23) |
In addition, suppose that each is stationary and -order Markov. Then for all ,
(24) |
Furthermore, for any joint distribution factorizing as , the prediction risk of the estimator
(25) |
is bounded by the redundancy of as
(26) |
Remark 2.
Note that the upper bound (23) on redundancy, known as the “estimation-compression inequality” [KOPS15, FOPS16], holds without conditions, while the lower bound (24) relies on stationarity and Markovity. For iid data, the estimation-compression inequality is almost an equality; however, this is not the case for Markov chains, as both sides of (23) differ by an unbounded factor of for and for fixed – see (2) and Theorem 1. On the other hand, Markov chains with at least three states offers a rare instance where (24) is tight, namely, (cf. Lemma 7).
Proof.
The upper bound on the redundancy follows from the chain rule of KL divergence:
(27) |
Thus
Minimizing both sides over (or equivalently, for ) yields (23).
To upper bound the prediction risk using redundancy, fix any , which gives rise to for . For clarity, let use denote the estimator as . Consider the estimator defined in (25), namely,
(28) |
That is, we apply to the most recent symbols prior to for predicting its distribution, then average over . We may bound the prediction risk of this estimator by redundancy as follows: Fix . To simplify notation, we suppress the dependency of and write . Then
where (a) uses the -order Markovian assumption; (b) is due to the convexity of the KL divergence; (c) uses the crucial fact that for all , , thanks to stationarity; (d) follows from substituting , the Markovian assumption , and rewriting the expectation as conditional KL divergence; (e) is by the chain rule (27) of KL divergence. Since the above holds for any , the desired (26) follows which implies that . Finally, follows from , since and are equal in law. ∎
Remark 3.
Alternatively, Lemma 6 also follows from the mutual information representation (17) and (22). Indeed, by the chain rule for mutual information,
(29) |
taking the supremum over (the distribution of ) on both sides yields (17). For (22), it suffices to show that is decreasing in : for any ,
and the first term is
where the first and second equalities follow from the -order Markovity and stationarity, respectively. Taking supremum over yields . Finally, by the chain rule (29), we have , yielding .
2.2 Proof of the upper bound part of Theorem 1
Specializing to first-order stationary Markov chains with states, we denote the redundancy and prediction risk in (16) and (19) by and , the latter of which is precisely the quantity previously defined in (1). Applying Lemma 6 yields . To upper bound , consider the following probability assignment:
(30) |
where is the add-one estimator defined in (5). This factorizes as and . The following lemma bounds the redundancy of :
Lemma 7.
Combined with Lemma 6, Lemma 7 shows that for all and some universal constant , achieved by the estimator (6), which is obtained by applying the rule (25) to (30).
It remains to show Lemma 7. To do so, we in fact bound the pointwise redundancy of the add-one probability assignment (30) over all (not necessarily stationary) Markov chains on states. The proof is similar to those of [CS04, Theorems 6.3 and 6.5], which, in turn, follow the arguments of [DMPW81, Sec. III-B].
Proof.
We show that for every Markov chain with transition matrix and initial distribution , and every trajectory , it holds that
(31) |
where we abbreviate the add-one estimator defined in (5) as .
To establish (31), note that could be equivalently expressed using the empirical counts and in (4) as
Note that
where the inequality follows from for each , by the nonnegativity of the KL divergence. Therefore, we have
(32) |
We claim that: for and , it holds that
(33) |
with the understanding that . Applying this claim to (32) gives
where (a) follows from the concavity of , , and .
It remains to justify (33), which has a simple information-theoretic proof: Let denote the collection of sequences in whose type is given by . Namely, for each , appears exactly times for each . Let be drawn uniformly at random from the set . Then
where (a) follows from the fact that the joint entropy is at most the sum of marginal entropies; (b) is because each is distributed as . ∎
3 Optimal rates without spectral gap
In this section, we prove the lower bound part of Theorem 1, which shows the optimality of the average version of the add-one estimator (25). We first describe the lower bound construction for three-state chains, which is subsequently extended to states.
3.1 Warmup: an lower bound for three-state chains
Theorem 8.
To show Theorem 8, consider the following one-parameter family of transition matrices:
(34) |
Note that each transition matrix in is symmetric (hence doubly stochastic), whose corresponding chain is reversible with a uniform stationary distribution and spectral gap ; see Fig. 1.
The main idea is as follows. Notice that by design, with constant probability, the trajectory is of the following form: The chain starts and stays at state 1 for steps, and then transitions into state 2 or 3 and never returns to state 1, where . Since is the single unknown parameter, the only useful observations are visits to state and and each visit entails one observation about by flipping a coin with bias roughly . Thus the effective sample size for estimating is and we expect the best estimation error is of the order of . However, is not fixed. In fact, conditioned on the trajectory is of this form, is roughly uniformly distributed between and . As such, we anticipate the estimation error of is approximately
Intuitively speaking, the construction in Fig. 1 “embeds” a symmetric two-state chain (with states 2 and 3) with unknown parameter into a space of three states, by adding a “nuisance” state 1, which effectively slows down the exploration of the useful part of the state space, so that in a trajectory of length , the effective number of observations we get to make about is roughly uniformly distributed between and . This explains the extra log factor in Theorem 8, which actually stems from the harmonic sum in . We will fully explore this embedding idea in Section 3.2 to deal with larger state space.
Next we make the above intuition rigorous using a Bayesian argument. Let us start by recalling the following well-known lemma.
Lemma 9.
Let . Conditioned on , let . Then the Bayes estimator of given is the “add-one” estimator:
and the Bayes risk is given by
Proof of Theorem 8.
Consider the following Bayesian setting: First, we draw uniformly at random from . Then, we generate the sample path of a stationary (uniform) Markov chain with transition matrix as defined in (34). Define
(37) |
Let . Then
(38) |
where denotes the number of transitions from state 2 to 3 or from 3 to 2. Then
(39) |
and hence
(40) | ||||
Consider the Bayes estimator (for estimating under the mean-squared error)
For , using (38) we have
where the last step follows from Lemma 9. From (38), we conclude that conditioned on and on , , where . Applying Lemma 9 (with and ), we get
Finally, note that conditioned on , the probability of is close to uniform. Indeed, from (39) and (40) we get
Thus
(41) |
Finally, we relate (41) formally to the minimax prediction risk under the KL divergence. Consider any predictor (as a function of the sample path ) for the th row of , . By Pinsker inequality, we conclude that
(42) |
and similarly, . Abbreviate and , both functions of . Taking expectations over both and , the Bayes prediction risk can be bounded as follows
∎
3.2 -state chains
The lower bound construction for -state chains in Section 3.1 can be generalized to -state chains. The high-level argument is again to augment a -state chain into a -state chain. Specifically, we partition the state space into two sets and . Consider a -state Markov chain such that the transition probabilities from to , and from to , are both very small (on the order of ). At state , the chain either stays at with probability or moves to one of the states in with equal probability ; at each state in , the chain moves to with probability ; otherwise, within the state subspace , the chain evolves according to some symmetric transition matrix . (See Fig. 2 in Section 3.2.1 for the precise transition diagram.)
The key feature of such a chain is as follows. Let be the event that and . For each , one can show that for some absolute constant . Moreover, conditioned on the event , is equal in law to a stationary Markov chain on state space with symmetric transition matrix . It is not hard to show that estimating and are nearly equivalent. Consider the Bayesian setting where is drawn from some prior. We have
where the last equality follows from the representation (20) of Bayes prediction risk as conditional mutual information. Lower bounding the minimax risk by the Bayes risk, we have
(43) |
Note that since takes values in . Maximizing the right hand side over the prior and recalling the dual representation for redundancy in (17), the above inequality (3.2) leads to a risk lower bound of , where is the redundancy for symmetric Markov chains with states and sample size . Since symmetric transition matrices still have degrees of freedom, it is expected that for , so that (3.2) yields the desired lower bound in Theorem 1.
Next we rigorously carry out the lower bound proof sketched above: In Section 3.2.1, we explicitly construct the -state chain which satisfies the desired properties in Section 3.2. In Section 3.2.2, we make the steps in (3.2) precise and bound the Bayes risk from below by an appropriate mutual information. In Section 3.2.3, we choose a prior distribution on the transition probabilities and prove a lower bound on the resulting mutual information, thereby completing the proof of Theorem 1, with the added bonus that the construction is restricted to irreducible and reversible chains.
3.2.1 Construction of the -state chain
We construct a -state chain with the following transition probability matrix:
(44) |
where is a symmetric stochastic matrix to be chosen later. The transition diagram of is shown in Figure 2. One can also verify that the spectral gap of is .
Let be the trajectory of a stationary Markov chain with transition matrix . We observe the following properties:
-
(P1)
This Markov chain is irreducible and reversible, with stationary distribution ;
-
(P2)
For , let denote the collections of trajectories such that and . Then
(45) Moreover, this probability does not depend of the choice of ;
-
(P3)
Conditioned on the event that , the trajectory has the same distribution as a length- trajectory of a stationary Markov chain with state space and transition probability , and the uniform initial distribution. Indeed,
3.2.2 Reducing the Bayes prediction risk to redundancy
Let be the collection of all symmetric transition matrices on state space . Consider a Bayesian setting where the transition matrix is constructed in (44) and the submatrix is drawn from an arbitrary prior on . The following lemma lower bounds the Bayes prediction risk.
Lemma 10.
Conditioned on , let denote a stationary Markov chain on state space with transition matrix and uniform initial distribution. Then
Lemma 10 is the formal statement of the inequality (3.2) presented in the proof sketch. Maximizing the lower bound over the prior on and in view of the mutual information representation (17), we obtain the following corollary.
Corollary 11.
Let denote the minimax prediction risk for stationary irreducible and reversible Markov chains on states and the redundancy for stationary symmetric Markov chains on states. Then
Proof of Lemma 10.
Recall that in the Bayesian setting, we first draw from some prior on , then generate the stationary Markov chain with state space and transition matrix in (44), and with state space and transition matrix .
We first relate the Bayes estimator of and (given the and chain respectively). For clarity, we spell out the explicit dependence of the estimators on the input trajectory. For each , denote by the Bayes estimator of give , and the Bayes estimator of give . For each and for each trajectory , recalling the form (21) of the Bayes estimator, we have, for each ,
where we used the stationary distribution of in (P1) and the uniformity of the stationary distribution of , neither of which depends on . Furthermore, by construction in (44), is deterministic. In all, we have
(46) |
with denoting the point mass at state 1, which parallels the fact that
(47) |
By (P2), each event occurs with probability at least , and is independent of . Therefore,
(48) |
By (P3), the conditional joint law of on the event is the same as the joint law of . Thus, we may express the Bayes prediction risk in the chain as
(49) |
where (a) follows from (46), (47), and the fact that for distributions supported on , ; (b) is the mutual information representation (20) of the Bayes prediction risk. Finally, the lemma follows from (48), (3.2.2), and the chain rule
as . ∎
3.2.3 Prior construction and lower bounding the mutual information
In view of Lemma 10, it remains to find a prior on for , such that the mutual information is large. We make use of the connection identified in [DMPW81, Dav83, Ris84] between estimation error and mutual information (see also [CS04, Theorem 7.1] for a self-contained exposition). To lower the mutual information, a key step is to find a good estimator of . This is carried out in the following lemma.
Lemma 12.
In the setting of Lemma 10, suppose that with for all . Then there is an estimator based on such that
where denotes the Frobenius norm.
We show how Lemma 12 leads to the desired lower bound on the mutual information . Since , we may assume that is an even integer. Consider the following prior distribution on : let be iid and uniformly distributed in , and for . Let the transition matrix be given by
(50) |
It is easy to verify that is symmetric and a stochastic matrix, and each entry of is supported in the interval . Since , the condition of Lemma 12 is fulfilled, so there exist estimators and such that
(51) |
Here and below, we identify and as -dimensional vectors.
Let denote the differential entropy of a continuous random vector with density w.r.t the Lebesgue measure and the conditional differential entropy (cf. e.g. [CT06]). Then
(52) |
Then
where (a) is because and are in one-to-one correspondence by (50); (b) follows from the data processing inequality; (c) is because is translation invariant and concave; (d) follows from the maximum entropy principle [CT06]: , which in turn is bounded by (51). Plugging this lower bound into Lemma 10 completes the lower bound proof of Theorem 1.
Proof of Lemma 12.
Since is symmetric, the stationary distribution is uniform, and there is a one-to-one correspondence between the joint distribution of and the transition probabilities. Motivated by this observation, consider the following estimator : for , let
Clearly . The following variance bound is shown in [TJW18, Lemma 7, Lemma 8] using the concentration inequality of [Pau15]:
where is the absolute spectral gap of defined in (8). Note that , where is the all-one matrix and each entry of lying in . Thus the spectral radius of is at most and thus . Consequently, we have
completing the proof. ∎
4 Spectral gap-dependent risk bounds
4.1 Two states
To show Theorem 2, let us prove a refined version. In addition to the absolute spectral gap defined in (8), define the spectral gap
(53) |
and the collection of transition matrices whose spectral gap exceeds . Paralleling defined in (9), define as the minimax prediction risk restricted to Since , we have and hence . Nevertheless, the next result shows that for they have the same rate:
Theorem 13 (Spectral gap dependent rates for binary chain).
For any
We first prove the upper bound on . Note that it is enough to show
(54) |
Indeed, for any , the upper bound proven in [FOPS16], which does not depend on the spectral gap, suffices; for any , by monotonicity we can use the upper bound .
We now define an estimator that achieves (54). Following [FOPS16], consider trajectories with a single transition, namely, , where denotes the trajectory with and . We refer to this type of as step sequences. For all non-step sequences , we apply the add- estimator similar to (5), namely
where the empirical counts and are defined in (4); for step sequences of the form , we estimate by
(55) |
The other type of step sequences are dealt with by symmetry.
Due to symmetry it suffices to analyze the risk for sequences ending in 1. The risk of add- estimator for the non-step sequence is bounded as
where the last step followed by using with and . From [FOPS16, Lemma 7,8] we have that the total risk of other non-step sequences is bounded from above by and hence it is enough to analyze the risk for step sequences, and further by symmetry, those in . The desired upper bound (54) then follows from Lemma 14 next.
Lemma 14.
For any , in (55) satisfies
Proof.
For each using with ,
(56) |
where we define . Recall the following Chebyshev’s sum inequality: for and , it holds that
The following inequalities are thus direct corollaries: for ,
(57) | ||||
(58) |
where in (58) we need to verify that is non-increasing. To verify it, w.l.o.g. we may assume that , and therefore
Therefore,
(59) |
where (a) is due to (56), (b) follows from (57) and (58) applied to . To deal with the remaining sum, we distinguish into two cases. Sticking to the above definitions of and , if , then
where the last step has used that for . If , notice that for two-state chain the spectral gap is given explicitly by , so that the assumption implies that . In this case,
thanks to the assumption . Therefore, in both cases, the first term in (59) is , as desired. ∎
Next we prove the lower bound on . It is enough to show that for . Indeed, for , we can apply the result in the iid setting (see, e.g., [BFSS02]), in which the absolute spectral gap is 1, to obtain the usual parametric-rate lower bound ; for , we simply bound from below by . Define
(60) |
and consider the prior distribution
(61) |
Then the lower bound part of Theorem 2 follows from the next lemma.
Lemma 15.
Assume that . Then
-
(i)
for each ;
-
(ii)
the Bayes risk with respect to the prior is at least .
Proof.
Part (i) follows by noting that absolute spectral gap for any two states matrix is and for any , which guarantees
To show part (ii) we lower bound the Bayes risk when the observed trajectory is a step sequence in . Our argument closely follows that of [HOP18, Theorem 1]. Since , for each , the corresponding stationary distribution satisfies
Denote by the Bayes risk with respect to the prior and by the Bayes estimator for prior given . Note that
(62) |
Then
(63) |
Recalling the general form of the Bayes estimator in (21) and in view of (62), we get
(64) |
Plugging (64) into (63), and using
we arrive at the following lower bound for the Bayes risk:
(65) |
Under the prior , with .
We further lower bound (65) by summing over an appropriate range of . For any , define
Since , our choice of ensures that the intervals are disjoint. We will establish the following claim: for all and , it holds that
(66) |
We first complete the proof of the Bayes risk bound assuming (66). Using (65) and (66), we have
with (a) following from if , and .
We bound each of the terms individually. Clearly, and . Thus it suffices to show that and , for and . Indeed,
-
•
For , we have
where in (a) we use the inequality for . Consequently, ;
-
•
For , we have
where (b) follows from and the definition of . Consequently,
where the last step uses the definition of in (60);
-
•
, since .
Combining the above bounds completes the proof of (66). ∎
4.2 states
4.2.1 Proof of Theorem 3 (i)
Notice that the prediction problem consists of sub-problems of estimating the individual rows of , so it suffices show the contribution from each of them is . In particular, assuming the chain terminates in state 1 we bound the risk of estimating the first row by the add-one estimator . Under the absolute spectral gap condition of , we show
(67) |
By symmetry, we get the desired . The basic steps of our analysis are as follows:
-
•
When is substantially smaller than its mean, we can bound the risk using the worst-case risk bound for add-one estimators and the probability of this rare event.
-
•
Otherwise, we decompose the prediction risk as
We then analyze each term depending on whether is typical or not. Unless is atypically small, the add-one estimator works well whose risk can be bounded quadratically.
To analyze the concentration of the empirical counts we use the following moment bounds. The proofs are deferred to Appendix B.
Lemma 16.
Finite reversible and irreducible chains observe the following moment bounds:
-
(i)
-
(ii)
-
(iii)
When is high this shows that the moments behave as if for each , is approximately Binomial() and is approximately Binomial, which happens in case of iid sampling. For iid models [KOPS15] showed that the add-one estimator achieves risk bound which we aim here too. In addition, dependency of the above moments on gives rise to sufficient conditions that guarantees parametric rate. The technical details are given below.
We decompose the left hand side in (67) based on as
where the typical set and atypical set are defined as
For the atypical case, note the following deterministic property of the add-one estimator. Let be an add-one estimator with sample size and alphabet size of the form , where . Since is bounded below by everywhere, for any distribution , we have
(68) |
Applying this bound on the event , we have
(69) | |||
(70) |
where we got (a) from Markov inequality, (b) from Lemma 16(iii) and (c) using .
Next we bound . Define
As it suffices to bound for each . For some to be optimized later consider the following cases separately
Case (a) or :
Case(b) and :
We decompose based on count of into atypical part and typical part
and bound each of and separately.
Bound on
Using and in we get
(74) |
where the last inequality followed as . Note that for any event and any function ,
Applying this identity with , we can bound the expectation term in (74) as
(75) |
where last inequality uses for all . Using Markov inequality for , Lemma 16(ii) and with
In view of above continuing (75) we get
where (a) followed using for and (b) followed as and . In view of (74) this implies
(76) |
Bound on
Using the inequality (71)
where (a) follows using properties of the set along with . Using Lemma 16(i) we get
Summing up the last bound over and using we get for
Combining this with (73) we obtain
where we chose for the last inequality. In view of (70) this implies the required bound.
Remark 4.
We explain the subtlety of the concentration bound in Lemma 16 based on fourth moment and why existing Chernoff bound or Chebyshev inequality falls short. For example, the risk bound in (70) relies on bounding the probability that is atypically small. To this end, one may use the classical Chernoff-type inequality for reversible chains (see [Lez98, Theorem 1.1] or [Pau15, Proposition 3.10 and Theorem 3.3])
(77) |
in contrast, the fourth moment bound in (69) yields . Although the exponential tail in (77) is much better, the pre-factor , due to conditioning on the initial state, can lead to a suboptimal result when is small. (As a concrete example, consider two states with and . Then , and (77) leads to as opposed to the desired .)
In the same context it is also insufficient to use 2nd moment based bound (Chebyshev), which leads to . This bound is too loose, which, upon substitution into (69), results in an extra factor in the final risk bound when and are large.
4.2.2 Proof of Theorem 3 (ii)
Let and . We prove a stronger result using spectral gap as opposed to the absolute spectral gap. Fix such that . Denote its stationary distribution by . For absolute constants to be chosen later and as in Lemma 17 below, define
(78) |
Let be the number of visits to state as in (4). We bound the risk by accounting for the contributions from different ranges of and separately:
(79) |
where the first inequality uses the worst-case bound (68) for add-one estimator. We analyze the terms separately as follows.
For the second term, given any such that , we have, by definition in (78), and , which implies
(80) |
For the third term, applying [HJL+18, Lemma 16] (which, in turn, is based on the Bernstein inequality in [Pau15]), we get .
To bound the first term in (79), we follow the method in [Bil61, HJL+18] of representing the sample path of the Markov chain using independent samples generated from which we describe below. Consider a random variable and an array of independent random variables, such that and are independent and for each . Starting with generating from , at every step we set as the first element in the -th row of that has not been sampled yet. Then one can verify that is a Markov chain with initial distribution and transition matrix . Furthermore, the transition counts satisfy , where be the number of elements sampled from the th row of . Note the conditioned on , the random variables are no longer iid. Instead, we apply a union bound. Note that for each fixed , the estimator
is an add-one estimator for based on an iid sample of size . Lemma 17 below provides a high-probability bound for the add-one estimator in this iid setting. Using this result and the union bound, we have
where the second inequality applies Lemma 17 with and uses for .
Lemma 17 (KL risk bound for add-one estimator).
Let for some distribution on . Consider the add-one estimator with . There exists an absolute constant such that for any ,
Proof.
Let be the empirical estimator . Then and hence
(81) |
with last equality following by .
To control the sum in the above display it suffices to consider its Poissonized version. Specifically, we aim to show
(82) |
where are distributed independently as . (Here and below denotes the Poisson distribution with mean .) To see why (82) implies the desired result, letting and , we have
(83) |
where (a) followed from the fact that conditioned on their sum independent Poisson random variables follow a multinomial distribution; (b) applies (82); (c) follows from Stirling’s approximation.
To prove (82) we rely on concentration inequalities for sub-exponential distributions. A random variable is called sub-exponential with parameters , denoted as if
(84) |
Sub-exponential random variables satisfy the following properties [Wai19, Sec. 2.1.3]:
-
•
If is for any
(85) -
•
Bernstein condition: A random variable is if it satisfies
(86) -
•
If are independent , then is .
Define Then Lemma 18 below shows that ’s are independent with for absolute constants , and hence is . In view of (85) for the choice this implies
(87) |
Using and
Lemma 18.
There exist absolute constants such that the following holds. For any and , is .
Proof.
Note that is a non-negative random variable. Since , by the Bernstein condition (86), it suffices to show for some absolute constant . guarantees the desired sub-exponential behavior. The analysis is divided into following two cases for some absolute constant .
Case I :
Using Chernoff bound for Poisson [Jan02, Theorem 3]
(88) |
we get
(89) |
which implies with probability at least . Since , we get
Case II :
-
•
On the event , we have , where the last inequality follows because takes non-negative integer values. Since , we have for any . Using the Chernoff bound (88), we get with probability at least , which implies
for absolute constant . Here, the last inequality follows from Cauchy-Schwarz and using the Poisson moment bound [Ahl21, Theorem 2.1]:333For a result with less precise constants, see also [Ahl21, Eq. (1)] based on [Lat97, Corollary 1]. for some absolute constant , with the second inequality applying the assumption .
-
•
As we get for some absolute constant .
∎
4.2.3 Proof of Corollary 4
We show the following monotonicity result of the prediction risk. In view of this result, Corollary 4 immediately follows from Theorem 2 and Theorem 3 (i). Intuitively, the optimal prediction risk is monotonically increasing with the number of states; this, however, does not follow immediately due to the extra assumptions of irreducibility, reversibility, and prescribed spectral gap.
Lemma 19.
for all .
Proof.
Fix an such that . Denote the stationary distribution such that . Fix and define a transition matrix with states as follows:
One can verify the following:
-
•
is irreducible and reversible;
-
•
The stationary distribution for is
-
•
The absolute spectral gap of is , so that for all sufficiently small .
-
•
Let and be stationary Markov chains with transition matrices and , respectively. Then as , converges to in law, i.e., the joint probability mass function converges pointwise.
Next fix any estimator for state space . Note that without loss of generality we can assume for all for otherwise the KL risk is infinite. Define as without the -th row and column, and denote by its normalized version, namely, for . Then
where in the first step we applied the convergence in law of to and the continuity of for fixed componentwise positive ; in the second step we used the fact that for any sub-probability measure and its normalized version with , we have . Taking the supremum over on the LHS and the supremum over on the RHS, and finally the infimum over on the LHS, we conclude . ∎
5 Higher-order Markov chains
5.1 Basic setups
In this section we prove Theorem 5. We start with some basic definitions for higher-order Markov chains. Let . Let be an -th order Markov chain with state space and transition matrix so that for all . Clearly, the joint distribution of the process is specified by the transition matrix and the initial distribution, which is a joint distribution for .
A distribution on is a stationary distribution if with is a stationary process, that is,
(90) |
It is clear that (90) is equivalent to . In other words, is the solution to the linear system:
(91) |
Note that this implies, in particular, that as a joint distribution of -tuples itself must satisfy those symmetry properties required by stationarity, such as all marginals being identical, etc.
Next we discuss reversibility. A random process is reversible if for any ,
(92) |
where denotes the time reversal of . Note that a reversible -order Markov chain must be stationary. Indeed,
(93) |
where the first equality follows from . The following lemma gives a characterization for reversibility:
Lemma 20.
An -order stationary Markov chain is reversible if and only if (92) holds for , namely
(94) |
Proof.
First, we show that (92) for implies that for . Indeed,
(95) |
where the first equality follows from and the second applies stationarity.
In view of the proof of (93), we note that any distribution on and -order transition matrix satisfying and (94) also satisfy (91). This implies such a is a stationary distribution for . In view of Lemma 20 the above conditions also guarantee reversibility. This observation can be summarized in the following lemma, which will be used to prove the reversibility of specific Markov chains later.
Lemma 21.
Let be a stochastic matrix describing transitions from to . Suppose that is a distribution on such that and . Then is the stationary distribution of and the resulting chain is reversible.
For -order stationary Markov chains, the optimal prediction risk is defined as as
(96) |
where the supremum is taken over all stochastic matrices and the trajectory is initiated from the stationary distribution. In the remainder of this section we will show the following result, completing the proof of Theorem 5 previously announced in Section 1.
Theorem 22.
For all , there exist a constant such that for all ,
Furthermore, the lower bound holds even when the Markov chains are required to be reversible.
5.2 Upper bound
We prove the upper bound part of the preceding theorem, using only stationarity (not reversibility). We rely on techniques from [CS04, Chapter 6, Page 486] for proving redundancy bounds for the -order chains. Let be the probability assignment given by
(97) |
where denotes the number of times the block occurs in , and is the number of times the block occurs in . This probability assignment corresponds to the add-one rule
(98) |
Then in view of Lemma 6, the following lemma proves the desired upper bound in Theorem 22.
Lemma 23.
Let be the redundancy of the -order Markov chain, as defined in Section 2.1, and be the corresponding observed trajectory. Then
Proof.
We show that for every Markov chain with transition matrix and initial distribution on , and every trajectory , it holds that
(99) |
where the transition probability of going from to . Note that
where the last inequality follows from for each , by the non-negativity of the KL divergence. Therefore, we have
(100) |
Using (33) we continue (100) to get
where (a) follows from the concavity of , , and . ∎
5.3 Lower bound
5.3.1 Special case:
We only analyze the case , i.e. second-order Markov chains with binary states, as the lower bound still applies to the case of case. The transition matrix for second-order chains is given by a stochastic matrices that gives the transition probability from the ordered pairs to some state :
(101) |
Our result is the following.
Theorem 24.
.
Proof.
The upper bound part has been shown in Lemma 23. For the lower bound, consider the following one-parametric family of transition matrices (we replace by for simplicity of the notation)
(109) |
and place a uniform prior on . One can verify that each has the uniform stationary distribution over the set and the chains are reversible.
Next we introduce the set of trajectories based on which we will lower bound the prediction risk. Analogous to the set defined in (37) for analyzing the first-order chains, we define
(110) |
In other words, the sequences in start with a string of 1’s before transitioning into two consecutive 2’s, are forbidden to have no consecutive 1’s thereafter, and finally end with 2.
To compute the probability of sequences in , we need the following preparations. Denote by the the operation that combines any two blocks from via merging the last symbol of the first block and the first symbol of the second block, for example, . Then for any we can write it in terms of the initial all-1 string, followed by alternating run of blocks from with the first run being of the block 22 (all the runs have positive lengths), combined with the merging operation :
(111) |
Let the vector denotes the transition probabilities between blocks in (recall the convention that the two blocks overlap in the symbol 2). Namely, according to (109),
Given any we can calculate its probability under the law of using frequency counts , defined as
Denote . Then for each with we have
(112) |
where denotes the number of times the chain alternates between runs of 22 and runs of 212, and denotes the number of times the chain jumps between blocks in .
Note that the range of includes all the integers in between 1 and . This follows from the definition of and the fact that if we merge either 22 or 212 using the operation at the end of any string with , it increases the length of the string by at most 2. Also, given any value of the value of ranges from 0 to .
Lemma 25.
The number of sequences in corresponding to a fixed pair is .
Proof.
Fix and let that is the length of the -th run of 22 blocks and is the length of the -th run of 212 blocks in as depicted in (111). The ’s are all non-negative integers. There are total such runs and the ’s satisfy , as the total number of blocks is one more than the total number of transitions. Each positive integer solution to this equation corresponds to a sequence and vice versa. The total number of such sequences is . ∎
We are now ready to compute the Bayes estimator and risk. For any with a given , the Bayes estimator of with prior is
Note that the probabilities in (5.3.1) can be bounded from below by . Using this, for each with given we get the following bound on the integrated squared error for a particular sequence
(113) |
where the last equality followed by noting that the integral is the variance of a random variable without its normalizing constant.
5.3.2 General case:
We will prove the following.
Theorem 26.
For absolute constant , we have
For ease of notation let . Denote . Consider an -order transition matrix of the following form:
(130) |
Here is a transition matrix for an -order Markov chain with state space , satisfying the following property:
-
(P)
.
Lemma 27.
Under the condition (P), the transition matrix has a stationary distribution that is uniform on . Furthermore, the resulting -order Markov chain is reversible (and hence stationary).
Proof.
Next we address the stationarity and reversibility of the chain with the bigger transition matrix in (130):
Lemma 28.
Proof.
Note that the choice of guarantees that . Next we again apply Lemma 21 to verify stationarity and reversibility. First of all, since , we have for all . Next we check the condition . For the sequence the claim is easily verified. For the rest of the sequences we have the following.
-
•
Case 1 (): Note that if and only if . This implies
-
•
Case 2 ( or ): By symmetry it is sufficient to analyze the case . Note that in the sub-case , and . This implies
(132) In view of this we get
-
•
Case 3 ():
Suppose that has many elements from . Then . We have the following sub-cases.
-
–
If , then both have exactly elements from . This implies and .
-
–
If , then both have exactly elements from . This implies and .
-
–
If , then has elements from and has elements from . This implies and .
-
–
If , then has elements from and has elements from then and .
For all these sub-cases we have as required.
-
–
This finishes the proof. ∎
Let be the trajectory of a stationary Markov chain with transition matrix as in (130). We observe the following properties:
-
(R1)
This Markov chain is irreducible and reversible. Furthermore, the stationary distribution assigns probability to the initial state .
-
(R2)
For , let denote the collections of trajectories such that and . Then using Lemma 28
(133) Moreover, this probability does not depend of the choice of ;
-
(R3)
Conditioned on the event that , the trajectory has the same distribution as a length- trajectory of a stationary -order Markov chain with state space and transition probability , and the uniform initial distribution. Indeed,
Reducing the Bayes prediction risk to mutual information
Consider the following Bayesian setting, we first draw from some prior satisfying property (P), then generate the stationary -order Markov chain with state space and transition matrix in (130) and stationary distribution in (131). The following lemma lower bounds the Bayes prediction risk.
Lemma 29.
Conditioned on , let denote an -order stationary Markov chain on state space with transition matrix and uniform initial distribution. Then
Proof.
We first relate the Bayes estimator of and (given the and chain respectively). For each , denote by the Bayes estimator of given , and the Bayes estimator of given . For each and for each trajectory , recalling the form (21) of the Bayes estimator, we have, for each ,
Furthermore, since for all in the construction (130), the Bayes estimator also satisfies for and . In all, we have
(134) |
with denoting the point mass at state 1, which parallels the fact that
(135) |
By (R2), each event occurs with probability at least , and is independent of . Therefore,
(136) |
By (R3), the conditional joint law of on the event is the same as the joint law of . Thus, we may express the Bayes prediction risk in the chain as
(137) |
where (a) follows from (134), (135), and the fact that for distributions supported on , ; (b) is the mutual information representation (20) of the Bayes prediction risk. Finally, the lemma follows from (5.3.2), (5.3.2), and the chain rule
as . ∎
Prior construction and lower bounding the mutual information
We assume that for some integer . For simplicity of notation we replace by . This does not affect the lower bound. Define an equivalent relation on given by the following rule: and are related if and only if or . Let be a subset of that consists of exactly one representative from each of the equivalent classes. As each of the equivalent classes under this relation will have at most two elements the total number of equivalent classes is at least , i.e., . We consider the following prior: let be iid and uniformly distributed in and for each define to be same as . Let the transition matrix be given by
(138) |
One can check that the constructed is a stochastic matrix and satisfies the property (P), which enforces uniform stationary distribution. Also each entry of belongs to the interval .
Next we use the following lemma to derive estimation guarantees on .
Lemma 30.
Suppose that is an transition matrix, on state space with , satisfying and with for all . Then there is an estimator based on stationary trajectory simulated from such that
where denotes the Frobenius norm.
For our purpose we will use the above lemma on with . Therefore it follows that there exist estimators and such that
(139) |
Here and below, we identify and as -dimensional vectors.
Let denote the differential entropy of a continuous random vector with density w.r.t the Lebesgue measure and the conditional differential entropy (cf. e.g. [CT06]). Then
(140) |
Then
for constant , where (a) is because and are in one-to-one correspondence by (5.3.2); (b) follows from the data processing inequality; (c) is because is translation invariant and concave; (d) follows from the maximum entropy principle [CT06]: , which in turn is bounded by (139). Plugging this lower bound into Lemma 29 completes the lower bound proof of Theorem 22.
5.3.3 Proof of Lemma 30 via pseudo spectral gap
In view of Lemma 27 we get that the stationary distribution of is uniform over , and there is a one-to-one correspondence between the joint distribution of and the transition probabilities
(141) |
Consider the following estimator : for , let
Clearly . Next we observe that the sequence of random variables is a first-order Markov chain on . Let us denote its transition matrix by and note that its stationary distribution is given by . For the transition matrix , which must be non-reversible, the pseudo spectral gap is defined as
where is the adjoint of defined as . With these notations, the concentration inequality of [Pau15, Theorem 3.2] gives the following variance bound:
The following lemma bounds the pseudo spectral gap from below.
Lemma 31.
Let be the transition matrix of an -th order Markov chain over a discrete state space with , and assume that
-
•
all the entries of lie in the interval for some absolute constants ;
-
•
has the uniform stationary distribution on .
Let be the transition matrix of the first-order Markov chain . Then we have
Consequently, we have
completing the proof.
Proof of Lemma 31.
As is a first-order Markov chain, the stochastic matrix defines the probabilities of transition from to . By our assumption on
(142) |
Given any , using the above inequality we have
(143) |
(144) |
As is an stochastic matrix, we can use Lemma 32 to get the lower bound on its spectral gap . Hence we get
(145) |
as required. A more generalized version of Lemma 32 can be found in from [Hof67].
Lemma 32.
Suppose that is a stochastic matrix with . Then for any eigenvalue of other than 1 we have .
Proof.
Suppose that is an eigenvalue of other than 1 with non-zero left eigenvector , i.e. . As is a stochastic matrix we know that for all and hence . This implies
(146) |
with the last equality following from . Summing over in the above equation and dividing by we get as required. ∎
∎
6 Discussions and open problems
We discuss the assumptions and implications of our results as well as related open problems.
Very large state space.
Theorem 1 determines the optimal prediction risk under the assumption of . When , Theorem 1 shows that the KL risk is bounded away from zero. However, as the KL risk can be as large as , it is a meaningful question to determine the optimal rate in this case, which, thanks to the general reduction in (11), reduces to determining the redundancy for symmetric and general Markov chains. For iid data, the minimax pointwise redundancy is known to be [SW12, Theorem 1] when . Since the average and pointwise redundancy usually behave similarly, for Markov chains it is reasonable to conjecture that the redundancy is in the large alphabet regime of , which, in view of (11), would imply the optimal prediction risk is for . In comparison, we note that the prediction risk is at most , achieved by the uniform distribution.
Other loss functions
As mentioned in Section 1.1, standard arguments based on concentration inequalities inevitably rely on mixing conditions such as the spectral gap. In contrast, the risk bound in Theorem 1, which is free of any mixing condition, is enabled by powerful techniques from universal compression which bound the redundancy by the pointwise maximum over all trajectories combined with information-theoretic or combinatorial argument. This program only relies on the Markovity of the process rather than stationarity or spectral gap assumptions. The limitation of this approach, however, is that the reduction between prediction and redundancy crucially depends on the form of the KL loss function444In fact, this connection breaks down if one swap and in the KL divergence in (1). in (1), which allows one to use the mutual information representation and the chain rule to relate individual risks to the cumulative risk. More general loss in terms of -divergence have been considered in [HOP18]. Obtaining spectral gap-independent risk bound for these loss functions, this time without the aid of universal compression, is an open question.
Stationarity
As mentioned above, the redundancy result in Lemma 7 (see also [Dav83, TJW18]) holds for nonstationary Markov chains as well. However, our redundancy-based risk upper bound in Lemma 6 crucially relies on stationarity. It is unclear whether the result of Theorem 1 carries over to nonstationary chains.
Appendix A Mutual information representation of prediction risk
The following lemma justifies the representation (22) for the prediction risk as maximal conditional mutual information. Unlike (17) for redundancy which holds essentially without any condition [Kem74], here we impose certain compactness assumptions which hold finite alphabets such as finite-state Markov chains studied in this paper.
Lemma 33.
Let be finite and let be a compact subset of . Given , define the prediction risk
(147) |
Then
(148) |
where denotes the collection of all (Borel) probability measures on .
Note that for stationary Markov chains, (22) follows from Lemma 33 since one can take to be the joint distribution of itself which forms a compact subset of the probability simplex on .
Proof.
It is clear that (147) is equivalent to
By the variational representation (14) of conditional mutual information, we have
(149) |
Thus (148) amounts to justifying the interchange of infimum and supremum in (147). It suffices to prove the upper bound.
Let . For , define an auxiliary quantity:
(150) |
where the constraint in the infimum is pointwise, namely, for all . By definition, we have . Furthermore, can be equivalently written as
(151) |
where denotes the uniform distribution on .
We first show that the infimum and supremum in (151) can be interchanged. This follows from the standard minimax theorem. Indeed, note that is convex in , affine in , continuous in each argument, and takes values in . Since is convex and weakly compact (by Prokhorov’s theorem) and the collection of conditional distributions is convex, the minimax theorem (see, e.g., [Fan53, Theorem 2]) yields
(152) |
Finally, by the convexity of the KL divergence, for any on , we have
which, in view of (149) and (152), implies
By the arbitrariness of , (148) follows. ∎
Appendix B Proof of Lemma 16
Recall that for any irreducible and reversible finite states transition matrix with stationary distribution the followings are satisfied:
-
1.
for all .
-
2.
for all .
The following is a direct consequence of the Markov property.
Lemma 34.
For any and any we have
(153) |
For , denote the -step transition probability by , which is the th entry of . The following result is standard (see, e.g., [LP17, Chap. 12]). We include the proof mainly for the purpose of introducing the spectral decomposition.
Lemma 35.
Define . For any ,
Proof.
Throughout the proof all vectors are column vectors except for . Let denote the diagonal matrix with entries . By reversibility, , which shares the same spectrum with , is a symmetric matrix and admits the spectral decomposition for some orthonormal basis ; in particular, and . Then for each ,
(154) |
where is the all-ones vector. As ’s satisfy we get for any . Using this along with Cauchy-Schwarz inequality we get
as required. ∎
Lemma 36.
Fix states . For any integers , define
Then
-
(i)
-
(ii)
-
(iii)
Proof.
We apply Lemma 35 and time reversibility:
-
(i)
-
(ii)
-
(iii)
. ∎
Proof of Lemma 16(i).
For ease of notation we use to denote an absolute constant whose value may vary at each occurrence. Fix . Note that the empirical count defined in (4) can be written as and . Then
where (a) is due to time reversibility; in (b) we defined . We divide the summands into different cases and apply Lemma 36.
Case I: Two distinct indices.
For any , using Lemma 34 we get
(155) |
which implies
Here the last inequality (and similar sums in later deductions) can be explained as follows. Note that for (i.e. ), the sum is clearly bounded by an absolute constant; for (i.e. ), we compare the sum with the mean (or higher moments in other calculations) of a geometric random variable.
Case II: Single index.
(156) |
Combining the above we get
as required. ∎
Proof of Lemma 16(ii).
We first note that due to reversibility we can write (similar as in proof of Lemma 16(i)) with
(157) |
We bound the sum over different combinations of to come up with a bound on the required fourth moment. We first divide the ’s into groups depending on how many distinct indices of there are. We use the following identities which follow from Lemma 34: for indices
-
•
-
•
For ,
-
•
For ,
-
•
and then use Lemma 36 to bound the functions.
Case I: Four distinct indices.
Using Lemma 36 we have
Case II: Three distinct indices.
There are three cases, namely and .
-
1.
Bounding :
where the last inequality followed by using .
-
2.
Bounding :
-
3.
Bounding :
Case III: Two distinct indices.
There are three different cases, namely and .
-
1.
Bounding :
-
2.
Bounding :
-
3.
Bounding :
Case IV: Single index.
Bound on :
Combining all cases we get
as required. ∎
Proof of Lemma 16(iii).
Throughout our proof we repeatedly use the spectral decomposition (154) applied to the diagonal elements:
Write where . For ,
(158) |
Using the Markov property for any , we get
(159) |
We also get for
(160) |
This implies
(161) |
Using (159) along with Lemma 35 for any we get
(162) | |||
(163) |
This together with (161) and (158) implies
(164) |
To bound the sum over , we divide the analysis according to the number of distinct ordered indices related variations in terms.
Case I: four distinct indices.
We sum (164) over all possible .
-
•
For the first term,
-
•
For the second term,
-
•
For the third term,
-
•
For the fourth term,
-
•
For the fifth term,
-
•
For the sixth term,
Combining the above bounds and using the fact that , we obtain
(165)
Case II: three distinct indices.
There are three cases, namely, , , and .
-
1.
Bounding : We specialize (164) with to get
Summing over we have
(166) with last inequality following from .
- 2.
- 3.
Case III: two distinct indices.
There are three cases, namely, and .
Case IV: single distinct index.
∎
Acknowledgment
The authors are grateful to Alon Orlitsky for helpful and encouraging comments and to Dheeraj Pichapati for providing the full version of [FOPS16]. The authors also thank David Pollard for insightful discussions on Markov chains at the initial stages of the project.
References
- [AG57] Theodore W Anderson and Leo A Goodman. Statistical inference about Markov chains. The Annals of Mathematical Statistics, pages 89–110, 1957.
- [Agr20] Rohit Agrawal. Finite-sample concentration of the multinomial in relative entropy. IEEE Transactions on Information Theory, 66(10):6297–6302, 2020.
- [Ahl21] T.D. Ahle. Sharp and simple bounds for the raw moments of the binomial and poisson distributions. arXiv:2103.17027, 2021.
- [Att99] K. Atteson. The asymptotic redundancy of Bayes rules for Markov chains. IEEE Transactions on Information Theory, 45(6):2104–2109, 1999.
- [Bar51] Maurice S Bartlett. The frequency goodness of fit test for probability chains. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 47, pages 86–95. Cambridge University Press, 1951.
- [BFSS02] Dietrich Braess, Jürgen Forster, Tomas Sauer, and Hans U Simon. How to achieve minimax expected Kullback-Leibler distance from an unknown finite distribution. In Algorithmic Learning Theory, pages 380–394. Springer, 2002.
- [BHOP18] Anna Ben-Hamou, Roberto I Oliveira, and Yuval Peres. Estimating graph parameters via random walks with restarts. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1702–1714. SIAM, 2018.
- [Bil61] P. Billingsley. Statistical methods in Markov chains. The Annals of Mathematical Statistics, pages 12–40, 1961.
- [CB19] Yeshwanth Cherapanamjeri and Peter L Bartlett. Testing symmetric Markov chains without hitting. In Conference on Learning Theory, pages 758–785. PMLR, 2019.
- [CK82] Imre Csiszár and János Körner. Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, Inc., 1982.
- [CS00] Imre Csiszár and Paul C Shields. The consistency of the BIC Markov order estimator. The Annals of Statistics, 28(6):1601–1619, 2000.
- [CS04] I Csiszár and PC Shields. Information theory and statistics: a tutorial. Foundations and Trends in Communications and Information Theory, 1(4):417–527, 2004.
- [CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory, 2nd Ed. Wiley-Interscience, New York, NY, USA, 2006.
- [Dav73] L. Davisson. Universal noiseless coding. IEEE Transactions on Information Theory, 19(6):783–795, 1973.
- [Dav83] L. Davisson. Minimax noiseless universal coding for Markov sources. IEEE Transactions on Information Theory, 29(2):211–215, 1983.
- [DDG18] Constantinos Daskalakis, Nishanth Dikkala, and Nick Gravin. Testing symmetric Markov chains from a single trajectory. In Conference On Learning Theory, pages 385–409. PMLR, 2018.
- [DMM+19] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47, 2019.
- [DMPW81] L Davisson, R McEliece, M Pursley, and Mark Wallace. Efficient universal noiseless source codes. IEEE Transactions on Information Theory, 27(3):269–279, 1981.
- [Fan53] Ky Fan. Minimax theorems. Proceedings of the National Academy of Sciences, 39(1):42–47, 1953.
- [FOPS16] M. Falahatgar, A. Orlitsky, V. Pichapati, and A.T. Suresh. Learning Markov distributions: Does estimation trump compression? In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2689–2693. IEEE, July 2016.
- [FW21] Sela Fried and Geoffrey Wolfer. Identity testing of reversible Markov chains. arXiv preprint arXiv:2105.06347, 2021.
- [GR20] F Richard Guo and Thomas S Richardson. Chernoff-type concentration of empirical probabilities in relative entropy. IEEE Transactions on Information Theory, 67(1):549–558, 2020.
- [HJL+18] Y. Han, J. Jiao, C.Z. Lee, T. Weissman, Y. Wu, and T. Yu. Entropy rate estimation for Markov chains with large state space. In Advances in Neural Information Processing Systems, pages 9781–9792, 2018. arXiv:1802.07889.
- [HKL+19] Daniel Hsu, Aryeh Kontorovich, David A Levin, Yuval Peres, Csaba Szepesvári, and Geoffrey Wolfer. Mixing time estimation in reversible Markov chains from a single sample path. Annals of Applied Probability, 29(4):2439–2480, 2019.
- [Hof67] A.J. Hoffman. Three observations on nonnegative matrices. Journal of Research of the National Bureau of Standards B, pages 39–41, 1967.
- [HOP18] Yi Hao, A. Orlitsky, and V. Pichapati. On learning Markov chains. In Advances in Neural Information Processing Systems, pages 648–657, 2018.
- [Jan02] S. Janson. On concentration of probability. Contemporary combinatorics, 10(3):1–9, 2002.
- [JS02] Philippe Jacquet and Wojciech Szpankowski. A combinatorial problem arising in information theory: Precise minimax redundancy for Markov sources. In Mathematics and Computer Science II, pages 311–328. Springer, 2002.
- [Kem74] JHB Kemperman. On the Shannon capacity of an arbitrary channel. In Indagationes Mathematicae (Proceedings), volume 77, pages 101–115. North-Holland, 1974.
- [KOPS15] S. Kamath, A. Orlitsky, D. Pichapati, and A.T. Suresh. On learning distributions from their samples. In Conference on Learning Theory, pages 1066–1100, June 2015.
- [KV16] Sudeep Kamath and Sergio Verdú. Estimation of entropy rate and Rényi entropy rate for Markov chains. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 685–689. IEEE, 2016.
- [Lat97] Rafał Latała. Estimation of moments of sums of independent real random variables. The Annals of Probability, 25(3):1502–1513, 1997.
- [LB04] Feng Liang and Andrew Barron. Exact minimax strategies for predictive density estimation, data compression, and model selection. IEEE Transactions on Information Theory, 50(11):2708–2726, 2004.
- [Lez98] P. Lezaud. Chernoff-type bound for finite Markov chains. Annals of Applied Probability, 8(3):849–867, 1998.
- [LP17] D.A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- [MJT+20] Jay Mardia, Jiantao Jiao, Ervin Tánczos, Robert D Nowak, and Tsachy Weissman. Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types. Information and Inference: A Journal of the IMA, 9(4):813–850, 2020.
- [OS20] Maciej Obremski and Maciej Skorski. Complexity of estimating Rényi entropy of Markov chains. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2264–2269, 2020.
- [Pan04] Liam Paninski. Variational minimax estimation of discrete distributions under KL loss. Advances in Neural Information Processing Systems, 17:1033–1040, 2004.
- [Par62] Emanuel Parzen. Stochastic processes. Holden Day, 1962.
- [Pau15] D. Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, pages 1–20, 2015.
- [Ris84] Jorma Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information theory, 30(4):629–636, 1984.
- [Rya88] B.Y. Ryabko. Prediction of random sequences and universal coding. Prob. Pered. Inf., 24(2):87–96, 1988.
- [Sht87] Yuri M Shtarkov. Universal sequential coding of single messages. Prob. Pered. Inf., 23:175–186, 1987.
- [Sin64] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2):876–879, 1964.
- [SMT+18] Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439–473. PMLR, 2018.
- [SW12] Wojciech Szpankowski and Marcelo J Weinberger. Minimax pointwise redundancy for memoryless models over large alphabets. IEEE transactions on information theory, 58(7):4094–4104, 2012.
- [TJW18] Kedar Tatwawadi, Jiantao Jiao, and Tsachy Weissman. Minimax redundancy for Markov chains with large state space. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 216–220. IEEE, 2018.
- [Tro74] Viktor Kupriyanovich Trofimov. Redundancy of universal coding of arbitrary Markov sources. Prob. Pered. Inf., 10(4):16–24, 1974.
- [Wai19] M.J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press., 2019.
- [Whi55] P. Whittle. Some distribution and moment formulae for the Markov chain. Journal of the Royal Statistical Society: Series B (Methodological), 17(2):235–242, 1955.
- [WK19] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic Markov chains. In Algorithmic Learning Theory, pages 904–930. PMLR, 2019.
- [WK20] Geoffrey Wolfer and Aryeh Kontorovich. Minimax testing of identity to a reference ergodic Markov chain. In International Conference on Artificial Intelligence and Statistics, pages 191–201. PMLR, 2020.
- [XB97] Qun Xie and Andrew R Barron. Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, 43(2):646–657, 1997.
- [YB99] Y. Yang and A. R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564–1599, 1999.