Optimal prediction of Markov chains with and without spectral gap

Yanjun Han Soham Jana and Yihong Wu Y. Han is with the Simons Institute for the Theory of Computing, University of California, Berkeley, email: [email protected]. S. Jana and Y. Wu are with the Department of Statistics and Data Science, Yale University, New Haven, CT, email: [email protected] and [email protected]. Y. Wu is supported in part by NSF Grant CCF-1900507, an NSF CAREER award CCF-1651588, and an Alfred Sloan fellowship.

Abstract

We study the following learning problem with dependent data: Observing a trajectory of length $n$ from a stationary Markov chain with $k$ states, the goal is to predict the next state. For $3\leq k\leq O(\sqrt{n})$ , using techniques from universal compression, the optimal prediction risk in Kullback-Leibler divergence is shown to be $\Theta(\frac{k^{2}}{n}\log\frac{n}{k^{2}})$ , in contrast to the optimal rate of $\Theta(\frac{\log\log n}{n})$ for $k=2$ previously shown in [FOPS16]. These rates, slower than the parametric rate of $O(\frac{k^{2}}{n})$ , can be attributed to the memory in the data, as the spectral gap of the Markov chain can be arbitrarily small. To quantify the memory effect, we study irreducible reversible chains with a prescribed spectral gap. In addition to characterizing the optimal prediction risk for two states, we show that, as long as the spectral gap is not excessively small, the prediction risk in the Markov model is $O(\frac{k^{2}}{n})$ , which coincides with that of an iid model with the same number of parameters. Extensions to higher-order Markov chains are also obtained.

1 Introduction

Learning distributions from samples is a central question in statistics and machine learning. While significant progress has been achieved in property testing and estimation based on independent and identically distributed (iid) data, for many applications, most notably natural language processing, two new challenges arise: (a) Modeling data as independent observations fails to capture their temporal dependency; (b) Distributions are commonly supported on a large domain whose cardinality is comparable to or even exceeds the sample size. Continuing the progress made in [FOPS16, HOP18], in this paper we study the following prediction problem with dependent data modeled as Markov chains.

Suppose $X_{1},X_{2},\dots$ is a stationary first-order Markov chain on state space $[k]\triangleq\left\{1,\dots,k\right\}$ with unknown statistics. Observing a trajectory $X^{n}\triangleq(X_{1},\ldots,X_{n})$ , the goal is to predict the next state $X_{n+1}$ by estimating its distribution conditioned on the present data. We use the Kullback-Leibler (KL) divergence as the loss function: For distributions $P=\left[p_{1},\dots,p_{k}\right],Q=\left[q_{1},\dots,q_{k}\right]$ , $D(P\|Q)=\sum_{i=1}^{k}p_{i}\log\frac{p_{i}}{q_{i}}$ if $p_{i}=0$ whenever $q_{i}=0$ and $D(P\|Q)=\infty$ otherwise. The minimax prediction risk is given by

\displaystyle\mathsf{Risk}_{k,n}

\displaystyle\triangleq\inf_{\widehat{M}}\sup_{\pi,M}\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))]=\inf_{\widehat{M}}\sup_{\pi,M}\sum_{i=1}^{k}\mathbb{E}[D(M(\cdot|i)\|\widehat{M}(\cdot|i)){\mathbf{1}_{\left\{{X_{n}=i}\right\}}}]

(1)

where the supremum is taken over all stationary distributions $\pi$ and transition matrices $M$ (row-stochastic) such that $\pi M=\pi$ , the infimum is taken over all estimators $\widehat{M}=\widehat{M}(X_{1},\dots,X_{n})$ that are proper Markov kernels (i.e. rows sum to 1), and $M(\cdot|i)$ denotes the $i$ th row of $M$ . Our main objective is to characterize this minimax risk within universal constant factors as a function of $n$ and $k$ .

The prediction problem (1) is distinct from the parameter estimation problem such as estimating the transition matrix [Bar51, AG57, Bil61, WK19] or its properties [CS00, KV16, HJL⁺18, HKL⁺19] in that the quantity to be estimated (conditional distribution of the next state) depends on the sample path itself. This is precisely what renders the prediction problem closely relevant to natural applications such as autocomplete and text generation. In addition, this formulation allows more flexibility with far less assumptions compared to the estimation framework. For example, if certain state has very small probability under the stationary distribution, consistent estimation of the transition matrix with respect to usual loss function, e.g. squared risk, may not be possible, whereas the prediction problem is unencumbered by such rare states.

In the special case of iid data, the prediction problem reduces to estimating the distribution in KL divergence. In this setting the optimal risk is well understood, which is known to be ${k-1\over 2n}(1+o(1))$ when $k$ is fixed and $n\to\infty$ [BFSS02] and $\Theta(\frac{k}{n})$ for $k=O(n)$ [Pan04, KOPS15].¹¹1Here and below $\asymp,\lesssim,\gtrsim$ or $\Theta(\cdot),O(\cdot),\Omega(\cdot)$ denote equality and inequalities up to universal multiplicative constants. Typical in parametric models, this rate $\frac{k}{n}$ is commonly referred to the “parametric rate”, which leads to a sample complexity that scales proportionally to the number of parameters and inverse proportionally to the desired accuracy.

In the setting of Markov chains, however, the prediction problem is much less understood especially for large state space. Recently the seminal work [FOPS16] showed the surprising result that for stationary Markov chains on two states, the optimal prediction risk satisfies

\displaystyle\mathsf{Risk}_{2,n}=\Theta\left(\log\log n\over n\right),

(2)

which has a nonparametric rate even when the problem has only two parameters. The follow-up work [HOP18] studied general $k$ -state chains and showed a lower bound of $\Omega(\frac{k\log\log n}{n})$ for uniform (not necessarily stationary) initial distribution; however, the upper bound $O(\frac{k^{2}\log\log n}{n})$ in [HOP18] relies on implicit assumptions on mixing time such as spectral gap conditions: the proof of the upper bound for prediction (Lemma 7 in the supplement) and for estimation (Lemma 17 of the supplement) is based on Berstein-type concentration results of the empirical transition counts, which depend on spectral gap. The following theorem resolves the optimal risk for $k$ -state Markov chains:

Theorem 1 (Optimal rates without spectral gap).

There exists a universal constant $C>0$ such that for all $3\leq k\leq\sqrt{n}/C$ ,

\frac{k^{2}}{Cn}\log\left(\frac{n}{k^{2}}\right)\leq\mathsf{Risk}_{k,n}\leq\frac{Ck^{2}}{n}\log\left(\frac{n}{k^{2}}\right).

(3)

Furthermore, the lower bound continues to hold even if the Markov chain is restricted to be irreducible and reversible.

Remark 1.

The optimal prediction risk of $O(\frac{k^{2}}{n}\log\frac{n}{k^{2}})$ can be achieved by an average version of the add-one estimator (i.e. Laplace’s rule of succession). Given a trajectory $x^{n}=(x_{1},\ldots,x_{n})$ of length $n$ , denote the transition counts (with the convention $N_{i}\equiv N_{ij}\equiv 0$ if $n=0,1$ )

\displaystyle N_{i}=\sum_{\ell=1}^{n-1}{\mathbf{1}_{\left\{{x_{\ell}=i}\right\}}},\quad N_{ij}=\sum_{\ell=1}^{n-1}{\mathbf{1}_{\left\{{x_{\ell}=i,x_{\ell+1}=j}\right\}}}.

(4)

The add-one estimator for the transition probability $M(j|i)$ is given by

\widehat{M}^{+1}_{x^{n}}(j|i)\triangleq{N_{ij}+1\over N_{i}+k},

(5)

which is an additively smoothed version of the empirical frequency. Finally, the optimal rate in (3) can be achieved by the following estimator $\widehat{M}$ defined as an average of add-one estimators over different sample sizes:

\widehat{M}_{x^{n}}(x_{n+1}|x_{n})\triangleq\frac{1}{n}\sum_{t=1}^{n}\widehat{M}^{+1}_{x_{n-t+1}^{n}}(x_{n+1}|x_{n}).

(6)

In other words, we apply the add-one estimator to the most recent $t$ observations $(X_{n-t+1},\ldots,X_{n})$ to predict the next $X_{n+1}$ , then average over $t=1,\ldots,n$ . Such Cesàro-mean-type estimators have been introduced before in the density estimation literature (see, e.g., [YB99]). It remains open whether the usual add-one estimator (namely, the last term in (6) which uses all the data) or any add- $c$ estimator for constant $c$ achieves the optimal rate. In contrast, for two-state chains the optimal risk (2) is attained by a hybrid strategy [FOPS16], applying add- $c$ estimator for $c=\frac{1}{\log n}$ for trajectories with at most one transition and $c=1$ otherwise. Also note that the estimator in (6) can be computed in $O(nk)$ time. To derive this first note that given any $j\in[k]$ calculating $\widehat{M}^{+1}_{x^{n-1}_{1}}(j|x_{n-1})$ takes $O(n)$ time and given any $M^{+1}_{x^{n-1}_{n-t+1}}(j|x_{n-1})$ we need $O(1)$ time to calculate $\widehat{M}^{+1}_{x^{n-1}_{n-t+2}}(j|x_{n-1})$ . Summing over all $j$ we get the algorithmic complexity upper bound.

Theorem 1 shows that the departure from the parametric rate of $\frac{k^{2}}{n}$ , first discovered in [FOPS16, HOP18] for binary chains, is even more pronounced for larger state space. As will become clear in the proof, there is some fundamental difference between two-state and three-state chains, resulting in $\mathsf{Risk}_{3,n}=\Theta(\frac{\log n}{n})\gg\mathsf{Risk}_{2,n}=\Theta(\frac{\log\log n}{n})$ . It is instructive to compare the sample complexity for prediction in the iid and Markov model. Denote by $d$ the number of parameters, which is $k-1$ for the iid case and $k(k-1)$ for Markov chains. Define the sample complexity $n^{*}(d,\epsilon)$ as the smallest sample size $n$ in order to achieve a prescribed prediction risk $\epsilon$ . For $\epsilon=O(1)$ , we have

n^{*}(d,\epsilon)\asymp\begin{cases}\frac{d}{\epsilon}&\text{iid}\\ \frac{d}{\epsilon}\log\log\frac{1}{\epsilon}&\text{Markov with $2$ states}\\ \frac{d}{\epsilon}\log\frac{1}{\epsilon}&\text{Markov with $k\geq 3$ states}.\end{cases}

(7)

At a high level, the nonparametric rates in the Markov model can be attributed to the memory in the data. On the one hand, Theorem 1 as well as (2) affirm that one can obtain meaningful prediction without imposing any mixing conditions;²²2To see this, it is helpful to consider the extreme case where the chain does not move at all or is periodic, in which case predicting the next state is in fact easy. such decoupling between learning and mixing has also been observed in other problems such as learning linear dynamics [SMT⁺18, DMM⁺19]. On the other hand, the dependency in the data does lead to a strictly higher sample complexity than that of the iid case; in fact, the lower bound in Theorem 1 is proved by constructing chains with spectral gap as small as $O(\frac{1}{n})$ (see Section 3). Thus, it is conceivable that with sufficiently favorable mixing conditions, the prediction risk improves over that of the worst case and, at some point, reaches the parametric rate. To make this precise, we focus on Markov chains with a prescribed spectral gap.

It is well-known that for an irreducible and reversible chain, the transition matrix $M$ has $k$ real eigenvalues satisfying $1=\lambda_{1}\geq\lambda_{2}\geq\dots\lambda_{k}\geq-1$ . The absolute spectral gap of $M$ , defined as

\gamma_{*}\triangleq 1-\max\left\{\left|\lambda_{i}\right|:i\neq 1\right\},

(8)

quantifies the memory of the Markov chain. For example, the mixing time is determined by $1/\gamma^{*}$ (relaxation time) up to logarithmic factors. As extreme cases, the chain which does not move ( $M$ is identity) and which is iid ( $M$ is rank-one) have spectral gap equal to 0 and 1, respectively. We refer the reader to [LP17] for more background. Note that the definition of absolute spectral gap requires irreducibility and reversibility, thus we restrict ourselves to this class of Markov chains (it is possible to use more general notions such as pseudo spectral gap to quantify the memory of the process, which is beyond the scope of the current paper). Given $\gamma_{0}\in(0,1)$ , define ${\mathcal{M}}_{k}(\gamma_{0})$ as the set of transition matrices corresponding to irreducible and reversible chains whose absolute spectral gap exceeds $\gamma_{0}$ . Restricting (1) to this subcollection and noticing the stationary distribution here is uniquely determined by $M$ , we define the corresponding minimax risk:

\displaystyle\mathsf{Risk}_{k,n}(\gamma_{0})

\displaystyle\triangleq\inf_{\widehat{M}}\sup_{M\in{\mathcal{M}}_{k}(\gamma_{0})}\mathbb{E}\left[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))\right]

(9)

Extending the result (2) of [FOPS16], the following theorem characterizes the optimal prediction risk for two-state chains with prescribed spectral gaps (the case $\gamma_{0}=0$ correspond to the minimax rate in [FOPS16] over all binary Markov chains):

Theorem 2 (Spectral gap dependent rates for binary chain).

For any $\gamma_{0}\in(0,1)$

\mathsf{Risk}_{2,n}(\gamma_{0})\asymp\frac{1}{n}\max\left\{1,\log\log\left(\min\left\{n,\frac{1}{\gamma_{0}}\right\}\right)\right\}.

Theorem 2 shows that for binary chains, parametric rate $O(\frac{1}{n})$ is achievable if and only if the spectral gap is nonvanishing. While this holds for bounded state space (see Corollary 4 below), for large state space, it turns out that much weaker conditions on the absolute spectral gap suffice to guarantee the parametric rate $O({k^{2}\over n})$ , achieved by the add-one estimator applied to the entire trajectory. In other words, as long as the spectral gap is not excessively small, the prediction risk in the Markov model behaves in the same way as that of an iid model with equal number of parameters. Similar conclusion has been established previously for the sample complexity of estimating the entropy rate of Markov chains in [HJL⁺18, Theorem 1].

Theorem 3.

The add-one estimator in (5) achieves the following risk bound.

(i)

For any $k\geq 2$ , $\mathsf{Risk}_{k,n}(\gamma_{0})\lesssim{k^{2}\over n}$ provided that $\gamma_{0}\gtrsim(\frac{\log k}{k})^{1/4}$ .
(ii)

In addition, for $k\gtrsim(\log n)^{6}$ , $\mathsf{Risk}_{k,n}(\gamma_{0})\lesssim{k^{2}\over n}$ provided that $\gamma_{0}\gtrsim{(\log(n+k))^{2}\over k}$ .

Corollary 4.

For any fixed $k\geq 2$ , $\mathsf{Risk}_{k,n}(\gamma_{0})=O(\frac{1}{n})$ if and only if $\gamma_{0}=\Omega(1)$ .

Finally, we address the optimal prediction risk for higher-order Markov chains:

Theorem 5.

There is a constant $C_{m}$ depending on $m$ such that for any $2\leq k\leq n^{\frac{1}{m+1}}/C_{m}$ and constant $m\geq 2$ the minimax prediction rate for $m^{\text{th}}$ -order Markov chains with stationary initialization is $\Theta_{m}\left({k^{m+1}\over n}\log{n\over k^{m+1}}\right).$

Notably, for binary states, it turns out that the optimal rate $\Theta\left(\log\log n\over n\right)$ for first-order Markov chains determined by [FOPS16] is something very special, as we show that for second-order chains the optimal rate is $\Theta\left(\log n\over n\right)$ .

1.1 Proof techniques

The proof of Theorem 1 deviates from existing approaches based on concentration inequalities for Markov chains. For instance, the standard program for analyzing the add-one estimator (5) involves proving concentration of the empirical counts on their population version, namely, $N_{i}\approx n\pi_{i}$ and $N_{ij}\approx n\pi_{i}M(j|i)$ , and bounding the risk in the atypical case by concentration inequalities, such as the Chernoff-type bounds in [Lez98, Pau15], which have been widely used in recent work on statistical inference with Markov chains [KV16, HJL⁺18, HOP18, HKL⁺19, WK19]. However, these concentration inequalities inevitably depends on the spectral gap of the Markov chain, leading to results which deteriorate as the spectral gap becomes smaller. For two-state chains, results free of the spectral gap are obtained in [FOPS16] using explicit joint distribution of the transition counts; this refined analysis, however, is difficult to extend to larger state space as the probability mass function of $(N_{ij})$ is given by Whittle’s formula [Whi55] which takes an unwieldy determinantal form.

Eschewing concentration-based arguments, the crux of our proof of Theorem 1, for both the upper and lower bound, revolves around the following quantity known as redundancy:

\mathsf{Red}_{k,n}\triangleq\inf_{Q_{X^{n}}}\sup_{P_{X^{n}}}D(P_{X^{n}}\|Q_{X^{n}})=\inf_{Q_{X^{n}}}\sup_{P_{X^{n}}}\sum_{x^{n}}P_{X^{n}}(x^{n})\log\frac{P_{X^{n}}(x^{n})}{Q_{X^{n}}(x^{n})}.

(10)

Here the supremum is taken over all joint distributions of stationary Markov chains $X^{n}$ on $k$ states, and the infimum is over all joint distributions $Q_{X^{n}}$ . A central quantity which measures the minimax regret in universal compression, the redundancy (10) corresponds to minimax cumulative risk (namely, the total prediction risk when the sample size ranges from 1 to $n$ ), while (1) is the individual minimax risk at sample size $n$ – see Section 2 for a detailed discussion. We prove the following reduction between prediction risk and redundancy:

\frac{1}{n}\mathsf{Red}_{k-1,n}^{\mathsf{sym}}-\frac{\log k}{n}\lesssim\mathsf{Risk}_{k,n}\leq\frac{1}{n-1}\mathsf{Red}_{k,n}

(11)

where $\mathsf{Red}^{\mathsf{sym}}$ denotes the redundancy for symmetric Markov chains. The upper bound is standard: thanks to the convexity of the loss function and stationarity of the Markov chain, the risk of the Cesàro-mean estimator (6) can be upper bounded using the cumulative risk and, in turn, the redundancy. The proof of the lower bound is more involved. Given a $(k-1)$ -state chain, we embed it into a larger state space by introducing a new state, such that with constant probability, the chain starts from and gets stuck at this state for a period time that is approximately uniform in $[n]$ , then enters the original chain. Effectively, this scenario is equivalent to a prediction problem on $k-1$ states with a random (approximately uniform) sample size, whose prediction risk can then be related to the cumulative risk and redundancy. This intuition can be made precise by considering a Bayesian setting, in which the $(k-1)$ -state chain is randomized according to the least favorable prior for (10), and representing the Bayes risk as conditional mutual information and applying the chain rule.

Given the above reduction in (11), it suffices to show both redundancies therein are on the order of $\frac{k^{2}}{n}\log\frac{n}{k^{2}}$ . The redundancy is upper bounded by pointwise redundancy, which replaces the average in (10) by the maximum over all trajectories. Following [DMPW81, CS04], we consider an explicit probability assignment defined by add-one smoothing and using combinatorial arguments to bound the pointwise redundancy, shown optimal by information-theoretic arguments.

The optimal spectral gap-dependent rate in Theorem 2 relies on the key observation in [FOPS16] that, for binary chains, the dominating contribution to the prediction risk comes from trajectories with a single transition, for which we may apply an add- $c$ estimator with $c$ depending appropriately on the spectral gap. The lower bound is shown using a Bayesian argument similar to that of [HOP18, Theorem 1]. The proof of Theorem 3 relies on more delicate concentration arguments as the spectral gap is allowed to be vanishingly small. Notably, for small $k$ , direct application of existing Bernstein inequalities for Markov chains in [Lez98, Pau15] falls short of establishing the parametric rate of $O(\frac{k^{2}}{n})$ (see Remark 4 in Section 4.2 for details); instead, we use a fourth moment bound which turns out to be well suited for analyzing concentration of empirical counts conditional on the terminal state.

For large $k$ , we further improve the spectral gap condition using a simulation argument for Markov chains using independent samples [Bil61, HJL⁺18]. A key step is a new concentration inequality for $D(P\|\widehat{P}_{n,k}^{+1})$ , where $\widehat{P}_{n,k}^{+1}$ is the add-one estimator based on $n$ iid observations of $P$ supported on $[k]$ :

\displaystyle\mathbb{P}\left(D(P\|\widehat{P}_{n,k}^{+1})\geq c\cdot\frac{k}{n}+\frac{\mathsf{polylog}(n)\cdot\sqrt{k}}{n}\right)\leq\frac{1}{\mathsf{poly}(n)},

(12)

for some absolute constant $c>0$ . Note that an application of the classical concentration inequality of McDiarmid would result in the second term being $\mathsf{polylog}(n)/\sqrt{n}$ , and (12) crucially improves this to $\mathsf{polylog}(n)\cdot\sqrt{k}/n$ . Such an improvement has been recently observed by [MJT⁺20, Agr20, GR20] in studying the similar quantity $D(\widehat{P}_{n}\|P)$ for the (unsmoothed) empirical distribution $\widehat{P}_{n}$ ; however, these results, based on either the method of types or an explicit upper bound of the moment generating function, are not directly applicable to (12) in which the true distribution $P$ appears as the first argument in the KL divergence.

The nonasymptotic analysis of the prediction rate for higher-order chains with large alphabets is based on a similar redundancy-based reduction as the first-order chain. However, optimal nonasymptotic redundancy bounds for higher-order chains is more challenging. Notably, in lower bounding redundancy, we need to bound the mutual information from below by upper bounding the squared error of certain estimators. As noted in [TJW18], existing analysis in [Dav83, Sec III] based on simple mixing conditions from [Par62] leads to suboptimal results on large alphabets. To bypass this issue, we show the pseudo spectral gap [Pau15] of the transition matrix of the first-order chain $\{(X_{t+1},\dots,X_{t+m-1})\}_{t=0}^{n-m+1}$ is at least a constant. This is accomplished by a careful construction of a prior on $m^{\rm th}$ -order transition matrices with $\Theta\left(k^{m+1}\right)$ degrees of freedom.

1.2 Related work

While the exact prediction problem studied in this paper has recently been in focus since [FOPS16, HOP18], there exists a large body of literature on relate works. As mentioned before some of our proof strategies draws inspiration and results from the study of redundancy in universal compression, its connection to mutual information, as well as the perspective of sequential probability assignment as prediction, dating back to [Dav73, DMPW81, Ris84, Sht87, Rya88]. Asymptotic characterization of the minimax redundancy for Markov sources, both average and pointwise, were obtained in [Dav83, Att99, JS02], in the regime of fixed alphabet size $k$ and large sample size $n$ . Non-asymptotic characterization was obtained in [Dav83] for $n\gg k^{2}\log k$ and recently extended to $n\asymp k^{2}$ in [TJW18], which further showed that the behavior of the redundancy remains unchanged even if the Markov chain is very close to being iid in terms of spectral gap $\gamma^{*}=1-o(1)$ .

The current paper adds to a growing body of literature devoted to statistical learning with dependent data, in particular those dealing with Markov chains. Estimation of the transition matrix [Bar51, AG57, Bil61, Sin64] and testing the order of Markov chains [CS00] have been well studied in the large-sample regime. More recently attention has been shifted towards large state space and nonasymptotics. For example, [WK19] studied the estimation of transition matrix in $\ell_{\infty}\to\ell_{\infty}$ induced norm for Markov chains with prescribed pseudo spectral gap and minimum probability mass of the stationary distribution, and determined sample complexity bounds up to logarithmic factors. Similar results have been obtained for estimating properties of Markov chains, including mixing time and spectral gap [HKL⁺19], entropy rate [KV16, HJL⁺18, OS20], graph statistics based on random walk [BHOP18], as well as identity testing [DDG18, CB19, WK20, FW21]. Most of these results rely on assumptions on the Markov chains such as lower bounds on the spectral gap and the stationary distribution, which afford concentration for sample statistics of Markov chains. In contrast, one of the main contributions in this paper, in particular Theorem 1, is that optimal prediction can be achieved without these assumptions, thereby providing a novel way of tackling these seemingly unavoidable issues. This is ultimately accomplished by information-theoretic and combinatorial techniques from universal compression.

1.3 Notations and preliminaries

For $n\in\mathbb{N}$ , let $[n]\triangleq\{1,\ldots,n\}$ . Denote $x^{n}=(x_{1},\ldots,x_{n})$ and $x_{t}^{n}=(x_{t},\ldots,x_{n})$ . The distribution of a random variable $X$ is denoted by $P_{X}$ . In a Bayesian setting, the distribution of a parameter $\theta$ is referred to as a prior, denoted by $P_{\theta}$ . We recall the following definitions from information theory [CK82, CT06]. The conditional KL divergence is defined as as an average of KL divergence between conditional distributions:

D(P_{A|B}\|Q_{A|B}|P_{B})\triangleq\mathbb{E}_{B\sim P_{B}}[D(P_{A|B}\|Q_{A|B})]=\int P_{B}(db)D(P_{A|B=b}\|Q_{A|B=b}).

(13)

The mutual information between random variables $A$ and $B$ with joint distribution $P_{AB}$ is $I(A;B)\triangleq D(P_{B|A}\|P_{B}|P_{A})$ ; similarly, the conditional mutual information is defined as

I(A;B|C)\triangleq D(P_{B|A,C}\|P_{B|C}|P_{A,C}).

The following variational representation of (conditional) mutual information is well-known

I(A;B)=\min_{Q_{B}}D(P_{B|A}\|Q_{B}|P_{A}),\quad I(A;B|C)=\min_{Q_{B|C}}D(P_{B|A,C}\|Q_{B|C}|P_{AC}).

(14)

The entropy of a discrete random variables $X$ is $H(X)\triangleq\sum_{x}P_{X}(x)\log\frac{1}{P_{X}(x)}$ .

1.4 Organization

The rest of the paper is organized as follows. In Section 2 we describe the general paradigm of minimax redundancy and prediction risk and their dual representation in terms of mutual information. We give a general redundancy-based bound on the prediction risk, which, combined with redundancy bounds for Markov chains, leads to the upper bound in Theorem 1. Section 3 presents the lower bound construction, starting from three states and then extending to $k$ states. Spectral-gap dependent risk bounds in Theorems 2 and 3 are given in Section 4. Section 5 presents the results and proofs for $m^{\text{th}}$ -order Markov chains. Section 6 discusses the assumptions and implications of our results and related open problems.

2 Two general paradigms

2.1 Redundancy, prediction risk, and mutual information representation

For $n\in\mathbb{N}$ , let ${\mathcal{P}}=\{P_{X^{n+1}|\theta}:\theta\in\Theta\}$ be a collection of joint distributions parameterized by $\theta$ .

“Compression”.

Consider a sample $X^{n}\triangleq(X_{1},\ldots,X_{n})$ of size $n$ drawn from $P_{X^{n}|\theta}$ for some unknown $\theta\in\Theta$ . The redundancy of a probability assignment (joint distribution) $Q_{X^{n}}$ is defined as the worst-case KL risk of fitting the joint distribution of $X^{n}$ , namely

\mathsf{Red}(Q_{X^{n}})\triangleq\sup_{\theta\in\Theta}D(P_{X^{n}|\theta}\|Q_{X^{n}}).

(15)

Optimizing over $Q_{X^{n}}$ , the minimax redundancy is defined as

\mathsf{Red}_{n}\triangleq\inf_{Q_{X^{n}}}\mathsf{Red}_{n}(Q_{X^{n}}),

(16)

where the infimum is over all joint distribution $Q_{X^{n}}$ . This quantity can be operationalized as the redundancy (i.e. regret) in the setting of universal data compression, that is, the excess number of bits compared to the optimal compressor of $X^{n}$ that knows $\theta$ [CT06, Chapter 13].

The capacity-redundancy theorem (see [Kem74] for a very general result) provides the following mutual information characterization of (16):

\mathsf{Red}_{n}=\sup_{P_{\theta}}I(\theta;X^{n}),

(17)

where the supremum is over all distributions (priors) $P_{\theta}$ on $\Theta$ . In view of the variational representation (14), this result can be interpreted as a minimax theorem:

\mathsf{Red}_{n}=\inf_{Q_{X^{n}}}\sup_{P_{\theta}}D(P_{X^{n}|\theta}\|Q_{X^{n}}|P_{\theta})=\sup_{P_{\theta}}\inf_{Q_{X^{n}}}D(P_{X^{n}|\theta}\|Q_{X^{n}}|P_{\theta}).

Typically, for fixed model size and $n\to\infty$ , one expects that $\mathsf{Red}_{n}=\frac{d}{2}\log n(1+o(1)$ , where $d$ is the number of parameters; see [Ris84] for a general theory of this type. Indeed, on a fixed alphabet of size $k$ , we have $\mathsf{Red}_{n}=\frac{k-1}{2}\log n(1+o(1))$ for iid model [Dav73] and $\mathsf{Red}_{n}=\frac{k^{m}(k-1)}{2}\log n(1+o(1))$ for $m$ -order Markov models [Tro74], with more refined asymptotics shown in [XB97, SW12]. For large alphabets, nonasymptotic results have also been obtained. For example, for first-order Markov model, $\mathsf{Red}_{n}\asymp k^{2}\log\frac{n}{k^{2}}$ provided that $n\gtrsim k^{2}$ [TJW18].

“Prediction”.

Consider the problem of predicting the next unseen data point $X_{n+1}$ based on the observations $X_{1},\ldots,X_{n}$ , where $(X_{1},\ldots,X_{n+1})$ are jointly distributed as $P_{X^{n+1}|\theta}$ for some unknown $\theta\in\Theta$ . Here, an estimator is a distribution (for $X_{n+1}$ ) as a function of $X^{n}$ , which, in turn, can be written as a conditional distribution $Q_{X_{n+1}|X^{n}}$ . As such, its worst-case average risk is

\mathsf{Risk}(Q_{X_{n+1}|X^{n}})\triangleq\sup_{\theta\in\Theta}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n}|\theta}),

(18)

where the conditional KL divergence is defined in (13). The minimax prediction risk is then defined as

\mathsf{Risk}_{n}\triangleq\inf_{Q_{X_{n+1}|X^{n}}}\mathsf{Risk}_{n}(Q_{X_{n+1}|X^{n}}),

(19)

While (16) does not directly correspond to a statistical estimation problem, (19) is exactly the familiar setting of “density estimation”, where $Q_{X_{n+1}|X^{n}}$ is understood as an estimator for the distribution of the unseen $X_{n+1}$ based on the available data $X_{1},\ldots,X_{n}$ .

In the Bayesian setting where $\theta$ is drawn from a prior $P_{\theta}$ , the Bayes prediction risk coincides with the conditional mutual information as a consequence of the variational representation (14):

\inf_{Q_{X_{n+1}|X^{n}}}\mathbb{E}_{\theta}[D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n}|\theta})]=I(\theta;X_{n+1}|X^{n}).

(20)

Furthermore, the Bayes estimator that achieves this infimum takes the following form:

Q_{X_{n+1}|X^{n}}^{\sf Bayes}=P_{X^{n+1}|X^{n}}=\frac{\int_{\Theta}P_{X^{n+1}|\theta}\,dP_{\theta}}{\int_{\Theta}P_{X^{n}|\theta}\,dP_{\theta}},

(21)

known as the Bayes predictive density [Dav73, LB04]. These representations play a crucial role in the lower bound proof of Theorem 1. Under appropriate conditions which hold for Markov models (see Lemma 33 in Appendix A), the minimax prediction risk (19) also admits a dual representation analogous to (17):

\mathsf{Risk}_{n}=\sup_{\theta\sim\pi}I(\theta;X_{n+1}|X^{n}),

(22)

which, in view of (20), show that the principle of “minimax=worst-case Bayes” continues to hold for prediction problem in Markov models.

The following result relates the redundancy and the prediction risk.

Lemma 6.

For any model ${\mathcal{P}}$ ,

\mathsf{Red}_{n}\leq\sum_{t=0}^{n-1}\mathsf{Risk}_{t}.

(23)

In addition, suppose that each $P_{X^{n}|\theta}\in{\mathcal{P}}$ is stationary and $m^{\rm th}$ -order Markov. Then for all $n\geq m+1$ ,

\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n-1}\leq\frac{\mathsf{Red}_{n}}{n-m}.

(24)

Furthermore, for any joint distribution $Q_{X^{n}}$ factorizing as $Q_{X^{n}}=\prod_{t=1}^{n}Q_{X_{t}|X^{t-1}}$ , the prediction risk of the estimator

\widetilde{Q}_{X_{n}|X^{n-1}}(x_{n}|x^{n-1})\triangleq\frac{1}{n-m}\sum_{t=m+1}^{n}Q_{X_{t}|X^{t-1}}(x_{n}|x_{n-t+1}^{n-1})

(25)

is bounded by the redundancy of $Q_{X^{n}}$ as

\mathsf{Risk}(\widetilde{Q}_{X_{n}|X^{n-1}})\leq\frac{1}{n-m}\mathsf{Red}(Q_{X^{n}}).

(26)

Remark 2.

Note that the upper bound (23) on redundancy, known as the “estimation-compression inequality” [KOPS15, FOPS16], holds without conditions, while the lower bound (24) relies on stationarity and Markovity. For iid data, the estimation-compression inequality is almost an equality; however, this is not the case for Markov chains, as both sides of (23) differ by an unbounded factor of $\Theta(\log\log n)$ for $k=2$ and $\Theta(\log n)$ for fixed $k\geq 3$ – see (2) and Theorem 1. On the other hand, Markov chains with at least three states offers a rare instance where (24) is tight, namely, $\mathsf{Risk}_{n}\asymp\frac{\mathsf{Red}_{n}}{n}$ (cf. Lemma 7).

Proof.

The upper bound on the redundancy follows from the chain rule of KL divergence:

D(P_{X^{n}|\theta}\|Q_{X^{n}})=\sum_{t=1}^{n}D(P_{X_{t}|X^{t-1},\theta}\|Q_{X_{t}|X^{t-1}}|P_{X^{t-1}}).

(27)

Thus

\sup_{\theta\in\Theta}D(P_{X^{n}|\theta}\|Q_{X^{n}})\leq\sum_{t=1}^{n}\sup_{\theta\in\Theta}D(P_{X_{t}|X^{t-1},\theta}\|Q_{X_{t}|X^{t-1}}|P_{X^{t-1}}).

Minimizing both sides over $Q_{X^{n}}$ (or equivalently, $Q_{X_{t}|X^{t-1}}$ for $t=1,\ldots,n$ ) yields (23).

To upper bound the prediction risk using redundancy, fix any $Q_{X^{n}}$ , which gives rise to $Q_{X_{t}|X^{t-1}}$ for $t=1,\ldots,n$ . For clarity, let use denote the $t^{\rm th}$ estimator as $\widehat{P}_{t}(\cdot|x^{t-1})=Q_{X_{t}|X^{t-1}=x^{t-1}}$ . Consider the estimator $\widetilde{Q}_{X_{n}|X^{n-1}}$ defined in (25), namely,

\widetilde{Q}_{X_{n}|X^{n-1}=x^{n-1}}\triangleq\frac{1}{n-m}\sum_{t=m+1}^{n}\widehat{P}_{t}(\cdot|x_{n-t+1},\ldots,x_{n-1}).

(28)

That is, we apply $\widehat{P}_{t}$ to the most recent $t-1$ symbols prior to $X_{n}$ for predicting its distribution, then average over $t$ . We may bound the prediction risk of this estimator by redundancy as follows: Fix $\theta\in\Theta$ . To simplify notation, we suppress the dependency of $\theta$ and write $P_{X^{n}|\theta}\equiv P_{X^{n}}$ . Then

	$\displaystyle D(P_{X_{n}\|X^{n-1}}\\|\widetilde{Q}_{X_{n}\|X^{n-1}}\|P_{X^{n-1}})\overset{\rm(a)}{=}$	$\displaystyle~{}\mathbb{E}\left[D\left(P_{X_{n}\|X^{n-1}_{n-m}}\Big{\\|}\frac{1}{n}\sum_{t=1}^{n}\widehat{P}_{t}(\cdot\|X_{n-t+1}^{n-1})\right)\right]$
	$\displaystyle\overset{\rm(b)}{\leq}$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}\mathbb{E}\left[D(P_{X_{n}\|X^{n-1}_{n-m}}\\|\widehat{P}_{t}(\cdot\|X_{n-t+1}^{n-1}))\right]$
	$\displaystyle\overset{\rm(c)}{=}$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}\mathbb{E}\left[D(P_{X_{t}\|X^{t-1}_{t-m}}\\|\widehat{P}_{t}(\cdot\|X^{t-1}))\right]$
	$\displaystyle\overset{\rm(d)}{=}$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}D(P_{X_{t}\|X^{t-1}}\\|Q_{X^{t}\|X^{t-1}}\|P_{X^{t-1}})$
	$\displaystyle\leq$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=1}^{n}D(P_{X_{t}\|X^{t-1}}\\|Q_{X^{t}\|X^{t-1}}\|P_{X^{t-1}})$
	$\displaystyle\overset{\rm(e)}{=}$	$\displaystyle~{}\frac{1}{n-m}D(P_{X^{n}}\\|Q_{X^{n}}),$

where (a) uses the $m^{\rm th}$ -order Markovian assumption; (b) is due to the convexity of the KL divergence; (c) uses the crucial fact that for all $t=1,\ldots,n-1$ , $(X_{n-t},\ldots,X_{n-1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{1},\ldots,X_{t})$ , thanks to stationarity; (d) follows from substituting $\widehat{P}_{t}(\cdot|x^{t-1})=Q_{X_{t}|X^{t-1}=x^{t-1}}$ , the Markovian assumption $P_{X_{t}|X^{t-1}_{t-m}}=P_{X_{t}|X^{t-1}}$ , and rewriting the expectation as conditional KL divergence; (e) is by the chain rule (27) of KL divergence. Since the above holds for any $\theta\in\Theta$ , the desired (26) follows which implies that $\mathsf{Risk}_{n-1}\leq\frac{\mathsf{Red}_{n}}{n-m}$ . Finally, $\mathsf{Risk}_{n-1}\leq\mathsf{Risk}_{n}$ follows from $\mathbb{E}[D(P_{X_{n+1|X_{n}}}\|\widehat{P}_{n}(X_{2}^{n}))]=\mathbb{E}[D(P_{X_{n|X_{n-1}}}\|\widehat{P}_{n}(X_{1}^{n-1}))]$ , since $(X_{2},\ldots,X_{n})$ and $(X_{1},\ldots,X_{n-1})$ are equal in law. ∎

Remark 3.

Alternatively, Lemma 6 also follows from the mutual information representation (17) and (22). Indeed, by the chain rule for mutual information,

I(\theta;X^{n})=\sum_{t=1}^{n}I(\theta;X_{t}|X^{t-1}),

(29)

taking the supremum over $\pi$ (the distribution of $\theta$ ) on both sides yields (17). For (22), it suffices to show that $I(\theta;X_{t}|X^{t-1})$ is decreasing in $t$ : for any $\theta\sim\pi$ ,

\displaystyle I(\theta;X_{n+1}|X^{n})=

\displaystyle~{}\mathbb{E}\log\frac{P_{X_{n+1}|X^{n},\theta}}{P_{X_{n+1}|X^{n}}}=\mathbb{E}\log\frac{P_{X_{n+1}|X^{n},\theta}}{P_{X_{n+1}|X^{n}_{2}}}+\underbrace{\mathbb{E}\log\frac{P_{X_{n+1}|X^{n}_{2}}}{P_{X_{n+1}|X^{n}}}}_{-I(X_{1};X_{n+1}|X_{2}^{n})},

and the first term is

\mathbb{E}\log\frac{P_{X_{n+1}|X^{n},\theta}}{P_{X_{n+1}|X^{n}_{2}}}=\mathbb{E}\log\frac{P_{X_{n+1}|X^{n}_{n-m+1},\theta}}{P_{X_{n+1}|X^{n}_{2}}}=\mathbb{E}\log\frac{P_{X_{n}|X^{n-1}_{n-m},\theta}}{P_{X_{n}|X^{n-1}}}=I(\theta;X_{n}|X^{n-1})

where the first and second equalities follow from the $m^{\rm th}$ -order Markovity and stationarity, respectively. Taking supremum over $\pi$ yields $\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n-1}$ . Finally, by the chain rule (29), we have $I(\theta;X^{n})\geq(n-m)I(\theta;X_{n}|X^{n-1})$ , yielding $\mathsf{Risk}_{n-1}\leq\frac{\mathsf{Red}_{n}}{n-m}$ .

2.2 Proof of the upper bound part of Theorem 1

Specializing to first-order stationary Markov chains with $k$ states, we denote the redundancy and prediction risk in (16) and (19) by $\mathsf{Red}_{k,n}$ and $\mathsf{Risk}_{k,n}$ , the latter of which is precisely the quantity previously defined in (1). Applying Lemma 6 yields $\mathsf{Risk}_{k,n}\leq\frac{1}{n-1}\mathsf{Red}_{k,n}$ . To upper bound $\mathsf{Red}_{k,n}$ , consider the following probability assignment:

\displaystyle Q(x_{1},\cdots,x_{n})=\frac{1}{k}\prod_{t=1}^{n-1}\widehat{M}_{x^{t}}^{+1}(x_{t+1}|x_{t})

(30)

where $\widehat{M}^{+1}$ is the add-one estimator defined in (5). This $Q$ factorizes as $Q(x_{1})=\frac{1}{k}$ and $Q(x_{t+1}|x^{t})=\widehat{M}_{x^{t}}^{+1}(x_{t+1}|x_{t})$ . The following lemma bounds the redundancy of $Q$ :

Lemma 7.

$\mathsf{Red}(Q)\leq k(k-1)\left[\log\left(1+\frac{n-1}{k(k-1)}\right)+1\right]+\log k.$

Combined with Lemma 6, Lemma 7 shows that $\mathsf{Risk}_{k,n}\leq C\frac{k^{2}}{n}\log\frac{n}{k^{2}}$ for all $k\leq\sqrt{n/C}$ and some universal constant $C$ , achieved by the estimator (6), which is obtained by applying the rule (25) to (30).

It remains to show Lemma 7. To do so, we in fact bound the pointwise redundancy of the add-one probability assignment (30) over all (not necessarily stationary) Markov chains on $k$ states. The proof is similar to those of [CS04, Theorems 6.3 and 6.5], which, in turn, follow the arguments of [DMPW81, Sec. III-B].

Proof.

We show that for every Markov chain with transition matrix $M$ and initial distribution $\pi$ , and every trajectory $(x_{1},\cdots,x_{n})$ , it holds that

\displaystyle\log\frac{\pi(x_{1})\prod_{t=1}^{n-1}M(x_{t+1}|x_{t})}{Q(x_{1},\cdots,x_{n})}\leq k(k-1)\left[\log\left(1+\frac{n}{k(k-1)}\right)+1\right]+\log k,

(31)

where we abbreviate the add-one estimator $M_{x^{t}}(x_{t+1}|x_{t})$ defined in (5) as $M(x_{t+1}|x_{t})$ .

To establish (31), note that $Q(x_{1},\cdots,x_{n})$ could be equivalently expressed using the empirical counts $N_{i}$ and $N_{ij}$ in (4) as

\displaystyle Q(x_{1},\cdots,x_{n})=\frac{1}{k}\prod_{i=1}^{k}\frac{\prod_{j=1}^{k}N_{ij}!}{k\cdot(k+1)\cdot\cdots\cdot(N_{i}+k-1)}.

Note that

\prod_{t=1}^{n-1}M(x_{t+1}|x_{t})=\prod_{i=1}^{k}\prod_{j=1}^{k}M(j|i)^{N_{ij}}\leq\prod_{i=1}^{k}\prod_{j=1}^{k}(N_{ij}/N_{i})^{N_{ij}},

where the inequality follows from $\sum_{j}\frac{N_{ij}}{N_{i}}\log\frac{N_{ij}/N_{i}}{M(j|i)}\geq 0$ for each $i$ , by the nonnegativity of the KL divergence. Therefore, we have

\displaystyle\frac{\pi(x_{1})\prod_{t=1}^{n-1}M(x_{t+1}|x_{t})}{Q(x_{1},\cdots,x_{n})}\leq k\cdot\prod_{i=1}^{k}\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{i}+k-1)}{N_{i}^{N_{i}}}\prod_{j=1}^{k}\frac{N_{ij}^{N_{ij}}}{N_{ij}!}.

(32)

We claim that: for $n_{1},\cdots,n_{k}\in\mathbb{Z}_{+}$ and $n=\sum_{i=1}^{k}n_{i}\in\mathbb{N}$ , it holds that

\displaystyle\prod_{i=1}^{k}\left(\frac{n_{i}}{n}\right)^{n_{i}}\leq\frac{\prod_{i=1}^{k}n_{i}!}{n!},

(33)

with the understanding that $(\frac{0}{n})^{0}=0!=1$ . Applying this claim to (32) gives

	$\displaystyle\log\frac{\pi(x_{1})\prod_{t=1}^{n-1}M(x_{t+1}\|x_{t})}{Q(x_{1},\cdots,x_{n})}$	$\displaystyle\leq\log k+\sum_{i=1}^{k}\log\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{i}+k-1)}{N_{i}!}$
		$\displaystyle=\log k+\sum_{i=1}^{k}\sum_{\ell=1}^{N_{i}}\log\left(1+\frac{k-1}{\ell}\right)$
		$\displaystyle\leq\log k+\sum_{i=1}^{k}\int_{0}^{N_{i}}\log\left(1+\frac{k-1}{x}\right)dx$
		$\displaystyle=\log k+\sum_{i=1}^{k}\left((k-1)\log\left(1+\frac{N_{i}}{k-1}\right)+N_{i}\log\left(1+\frac{k-1}{N_{i}}\right)\right)$
		$\displaystyle\overset{\rm(a)}{\leq}k(k-1)\log\left(1+\frac{n-1}{k(k-1)}\right)+k(k-1)+\log k,$

where (a) follows from the concavity of $x\mapsto\log x$ , $\sum_{i=1}^{k}N_{i}=n-1$ , and $\log(1+x)\leq x$ .

It remains to justify (33), which has a simple information-theoretic proof: Let $T$ denote the collection of sequences $x^{n}$ in $[k]^{n}$ whose type is given by $(n_{1},\ldots,n_{k})$ . Namely, for each $x^{n}\in T$ , $i$ appears exactly $n_{i}$ times for each $i\in[k]$ . Let $(X_{1},\ldots,X_{n})$ be drawn uniformly at random from the set $T$ . Then

\log\frac{n!}{\prod_{i=1}^{k}n_{i}!}=H(X_{1},\ldots,X_{n})\overset{\rm(a)}{\leq}\sum_{j=1}^{n}H(X_{j})\overset{\rm(b)}{=}n\sum_{i=1}^{k}\frac{n_{i}}{n}\log\frac{n}{n_{i}},

where (a) follows from the fact that the joint entropy is at most the sum of marginal entropies; (b) is because each $X_{j}$ is distributed as $(\frac{n_{1}}{n},\ldots,\frac{n_{k}}{n})$ . ∎

3 Optimal rates without spectral gap

In this section, we prove the lower bound part of Theorem 1, which shows the optimality of the average version of the add-one estimator (25). We first describe the lower bound construction for three-state chains, which is subsequently extended to $k$ states.

3.1 Warmup: an $\Omega(\frac{\log n}{n})$ lower bound for three-state chains

Theorem 8.

$\mathsf{Risk}_{3,n}=\Omega\left(\frac{\log n}{n}\right).$

To show Theorem 8, consider the following one-parameter family of transition matrices:

{\mathcal{M}}=\left\{M_{p}=\left[\begin{matrix}1-\frac{2}{n}&\frac{1}{n}&\frac{1}{n}\\ \frac{1}{n}&1-\frac{1}{n}-p&p\\ \frac{1}{n}&p&1-\frac{1}{n}-p\end{matrix}\right]\colon 0\leq p\leq 1-\frac{1}{n}\right\}.

(34)

Note that each transition matrix in ${\mathcal{M}}$ is symmetric (hence doubly stochastic), whose corresponding chain is reversible with a uniform stationary distribution and spectral gap $\Theta(\frac{1}{n})$ ; see Fig. 1.

Figure 1: Lower bound construction for three-state chains.

The main idea is as follows. Notice that by design, with constant probability, the trajectory is of the following form: The chain starts and stays at state 1 for $t$ steps, and then transitions into state 2 or 3 and never returns to state 1, where $t=1,\ldots,n-1$ . Since $p$ is the single unknown parameter, the only useful observations are visits to state $2$ and $3$ and each visit entails one observation about $p$ by flipping a coin with bias roughly $p$ . Thus the effective sample size for estimating $p$ is $n-t-1$ and we expect the best estimation error is of the order of $\frac{1}{n-t}$ . However, $t$ is not fixed. In fact, conditioned on the trajectory is of this form, $t$ is roughly uniformly distributed between $1$ and $n-1$ . As such, we anticipate the estimation error of $p$ is approximately

\frac{1}{n-1}\sum_{i=1}^{n-1}\frac{1}{n-t}=\Theta\left(\frac{\log n}{n}\right).

Intuitively speaking, the construction in Fig. 1 “embeds” a symmetric two-state chain (with states 2 and 3) with unknown parameter $p$ into a space of three states, by adding a “nuisance” state 1, which effectively slows down the exploration of the useful part of the state space, so that in a trajectory of length $n$ , the effective number of observations we get to make about $p$ is roughly uniformly distributed between $1$ and $n$ . This explains the extra log factor in Theorem 8, which actually stems from the harmonic sum in $\mathbb{E}[\frac{1}{\mathrm{Uniform}([n])}]$ . We will fully explore this embedding idea in Section 3.2 to deal with larger state space.

Next we make the above intuition rigorous using a Bayesian argument. Let us start by recalling the following well-known lemma.

Lemma 9.

Let $q\sim\mathrm{Uniform}(0,1)$ . Conditioned on $q$ , let $N\sim\text{Binom}(m,q)$ . Then the Bayes estimator of $q$ given $N$ is the “add-one” estimator:

\mathbb{E}[q|N]=\frac{N+1}{m+2}

and the Bayes risk is given by

\mathbb{E}[(q-\mathbb{E}[q|N])^{2}]=\frac{1}{6(m+2)}.

Proof of Theorem 8.

Consider the following Bayesian setting: First, we draw $p$ uniformly at random from $[0,1-\frac{1}{n}]$ . Then, we generate the sample path $X^{n}=(X_{1},\ldots,X_{n})$ of a stationary (uniform) Markov chain with transition matrix $M_{p}$ as defined in (34). Define

\displaystyle\begin{gathered}{\mathcal{X}}_{t}=\{x^{n}:x_{1}=\ldots=x_{t}=1,x_{i}\neq 1,i=t+1,\ldots,n\},\quad t=1,\dots,n-1,\\ \quad{\mathcal{X}}=\cup_{t=1}^{n-1}{\mathcal{X}}_{t}.\end{gathered}

(37)

Let $\mu(x^{n}|p)=\mathbb{P}\left[X=x^{n}\right]$ . Then

\mu(x^{n}|p)=\frac{1}{3}\left(1-\frac{2}{n}\right)^{t-1}\frac{2}{n}p^{N(x^{n})}\left(1-\frac{1}{n}-p\right)^{n-t-1-N(x^{n})},\quad x^{n}\in{\mathcal{X}}_{t},

(38)

where $N(x^{n})$ denotes the number of transitions from state 2 to 3 or from 3 to 2. Then

	$\displaystyle\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}\right]=$	$\displaystyle~{}\frac{1}{3}\left(1-\frac{2}{n}\right)^{t-1}\frac{2}{n}\sum_{k=0}^{n-t-1}\binom{n-t-1}{k}p^{k}\left(1-\frac{1}{n}-p\right)$
	$\displaystyle=$	$\displaystyle~{}\frac{1}{3}\left(1-\frac{2}{n}\right)^{t-1}\frac{2}{n}\left(1-\frac{1}{n}\right)^{n-t-1}=\frac{2}{3n}\left(1-\frac{1}{n}\right)^{n-2}\left(1-\frac{1}{n-1}\right)^{t-1}$		(39)

and hence

	$\displaystyle\mathbb{P}\left[X^{n}\in{\mathcal{X}}\right]=$	$\displaystyle~{}\sum_{t=1}^{n-1}\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}\right]=\frac{2(n-1)}{3n}\left(1-\frac{1}{n}\right)^{n-2}\left(1-\left(1-\frac{1}{n-1}\right)^{n-1}\right)$		(40)
	$\displaystyle=$	$\displaystyle~{}\frac{2(1-1/e)}{3e}+o_{n}(1).$

Consider the Bayes estimator (for estimating $p$ under the mean-squared error)

\widehat{p}(x^{n})=\mathbb{E}[p|x^{n}]=\frac{\mathbb{E}[p\cdot\mu(x^{n}|p)]}{\mathbb{E}[\mu(x^{n}|p)]}.

For $x^{n}\in{\mathcal{X}}_{t}$ , using (38) we have

	$\displaystyle\widehat{p}(x^{n})=$	$\displaystyle~{}\frac{\mathbb{E}\left[p^{N(x^{n})+1}\left(1-\frac{1}{n}-p\right)^{n-t-1-N(x^{n})}\right]}{\mathbb{E}\left[p^{N(x^{n})}\left(1-\frac{1}{n}-p\right)^{n-t-1-N(x^{n})}\right]},\quad p\sim\mathrm{Uniform}\left(0,\frac{n-1}{n}\right)$
	$\displaystyle=$	$\displaystyle~{}\frac{n-1}{n}\frac{\mathbb{E}\left[U^{N(x^{n})+1}\left(1-U\right)^{n-t-1-N(x^{n})}\right]}{\mathbb{E}\left[U^{N(x^{n})}\left(1-U\right)^{n-t-1-N(x^{n})}\right]},\quad U\sim\mathrm{Uniform}(0,1)$
	$\displaystyle=$	$\displaystyle~{}\frac{n-1}{n}\frac{N(x^{n})+1}{n-t+1},$

where the last step follows from Lemma 9. From (38), we conclude that conditioned on $X^{n}\in{\mathcal{X}}_{t}$ and on $p$ , $N(X^{n})\sim\text{Binom}(n-t-1,q)$ , where $q=\frac{p}{1-\frac{1}{n}}\sim\mathrm{Uniform}(0,1)$ . Applying Lemma 9 (with $m=n-t-1$ and $N=N(X^{n})$ ), we get

	$\displaystyle\mathbb{E}[(p-\widehat{p}(X^{n}))^{2}\|X^{n}\in{\mathcal{X}}_{t}]=$	$\displaystyle~{}\left(\frac{n-1}{n}\right)^{2}\mathbb{E}\left[\left(q-\frac{N(x^{n})+1}{n-t+1}\right)^{2}\right]$
	$\displaystyle=$	$\displaystyle~{}\left(\frac{n-1}{n}\right)^{2}\frac{1}{6(n-t+1)}.$

Finally, note that conditioned on $X^{n}\in{\mathcal{X}}$ , the probability of $X^{n}\in{\mathcal{X}}_{t}$ is close to uniform. Indeed, from (39) and (40) we get

\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}|{\mathcal{X}}\right]=\frac{1}{n-1}\frac{\left(1-\frac{1}{n-1}\right)^{t-1}}{1-\left(1-\frac{1}{n-1}\right)^{n-1}}\geq\frac{1}{n-1}\left(\frac{1}{e-1}+o_{n}(1)\right),\quad t=1,\ldots,n-1.

Thus

	$\displaystyle\mathbb{E}[(p-\widehat{p}(X^{n}))^{2}{\mathbf{1}_{\left\{{X^{n}\in{\mathcal{X}}}\right\}}}]=$	$\displaystyle~{}\mathbb{P}\left[X^{n}\in{\mathcal{X}}\right]\sum_{t=1}^{n-1}\mathbb{E}[(p-\widehat{p}(X^{n}))^{2}\|X^{n}\in{\mathcal{X}}_{t}]\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}\|{\mathcal{X}}\right]$
	$\displaystyle\gtrsim$	$\displaystyle~{}\frac{1}{n-1}\sum_{t=1}^{n-1}\frac{1}{n-t+1}=\Theta\left(\frac{\log n}{n}\right).$		(41)

Finally, we relate (41) formally to the minimax prediction risk under the KL divergence. Consider any predictor $\widehat{M}(\cdot|i)$ (as a function of the sample path $X$ ) for the $i$ th row of $M$ , $i=1,2,3$ . By Pinsker inequality, we conclude that

\displaystyle D(M(\cdot|2)\|\widehat{M}(\cdot|2))\geq\frac{1}{2}\|M(\cdot|2)-\widehat{M}(\cdot|2)\|_{\ell_{1}}^{2}\geq\frac{1}{2}(p-\widehat{M}(3|2))^{2}

(42)

and similarly, $D(M(\cdot|3)\|\widehat{M}(\cdot|3))\geq\frac{1}{2}(p-\widehat{M}(2|3))^{2}$ . Abbreviate $\widehat{M}(3|2)\equiv\widehat{p}_{2}$ and $\widehat{M}(2|3)\equiv\widehat{p}_{3}$ , both functions of $X$ . Taking expectations over both $p$ and $X$ , the Bayes prediction risk can be bounded as follows

		$\displaystyle\sum_{i=1}^{3}\mathbb{E}[D(M(\cdot\|i)\\|\widehat{M}(\cdot\|i)){\mathbf{1}_{\left\{{X_{n}=i}\right\}}}]$
	$\displaystyle\geq$	$\displaystyle~{}\frac{1}{2}\mathbb{E}[(p-\widehat{p}_{2})^{2}{\mathbf{1}_{\left\{{X_{n}=2}\right\}}}+(p-\widehat{p}_{3})^{2}{\mathbf{1}_{\left\{{X_{n}=3}\right\}}}]$
	$\displaystyle\geq$	$\displaystyle~{}\frac{1}{2}\sum_{x\in{\mathcal{X}}}\mu(x^{n})\left(\mathbb{E}[(p-\widehat{p}_{2})^{2}\|X=x^{n}]{\mathbf{1}_{\left\{{x_{n}=2}\right\}}}+\mathbb{E}[(p-\widehat{p}_{3})^{2}\|X=x^{n}]{\mathbf{1}_{\left\{{x_{n}=3}\right\}}}\right)$
	$\displaystyle\geq$	$\displaystyle~{}\frac{1}{2}\sum_{x^{n}\in{\mathcal{X}}}\mu(x^{n})\mathbb{E}[(p-\widehat{p}(x^{n}))^{2}\|X=x^{n}]({\mathbf{1}_{\left\{{x_{n}=2}\right\}}}+{\mathbf{1}_{\left\{{x_{n}=3}\right\}}})$
	$\displaystyle=$	$\displaystyle~{}\frac{1}{2}\sum_{x^{n}\in{\mathcal{X}}}\mu(x^{n})\mathbb{E}[(p-\widehat{p}(x^{n}))^{2}\|X=x^{n}]$
	$\displaystyle=$	$\displaystyle~{}\frac{1}{2}\mathbb{E}[(p-\widehat{p}(X))^{2}{\mathbf{1}_{\left\{{X\in{\mathcal{X}}}\right\}}}]\overset{(\ref{eq:pbayes})}{=}\Theta\left(\frac{\log n}{n}\right).$

∎

3.2 $k$ -state chains

The lower bound construction for $3$ -state chains in Section 3.1 can be generalized to $k$ -state chains. The high-level argument is again to augment a $(k-1)$ -state chain into a $k$ -state chain. Specifically, we partition the state space $[k]$ into two sets ${\mathcal{S}}_{1}=\{1\}$ and ${\mathcal{S}}_{2}=\{2,3,\cdots,k\}$ . Consider a $k$ -state Markov chain such that the transition probabilities from ${\mathcal{S}}_{1}$ to ${\mathcal{S}}_{2}$ , and from ${\mathcal{S}}_{2}$ to ${\mathcal{S}}_{1}$ , are both very small (on the order of $\Theta(1/n)$ ). At state $1$ , the chain either stays at $1$ with probability $1-1/n$ or moves to one of the states in ${\mathcal{S}}_{2}$ with equal probability $\frac{1}{n(k-1)}$ ; at each state in ${\mathcal{S}}_{2}$ , the chain moves to $1$ with probability $\frac{1}{n}$ ; otherwise, within the state subspace ${\mathcal{S}}_{2}$ , the chain evolves according to some symmetric transition matrix $T$ . (See Fig. 2 in Section 3.2.1 for the precise transition diagram.)

The key feature of such a chain is as follows. Let ${\mathcal{X}}_{t}$ be the event that $X_{1},X_{2},\cdots,X_{t}\in{\mathcal{S}}_{1}$ and $X_{t+1},\cdots,X_{n}\in{\mathcal{S}}_{2}$ . For each $t\in[n-1]$ , one can show that $\mathbb{P}({\mathcal{X}}_{t})\geq c/n$ for some absolute constant $c>0$ . Moreover, conditioned on the event ${\mathcal{X}}_{t}$ , $(X_{t+1},\ldots,X_{n})$ is equal in law to a stationary Markov chain $(Y_{1},\cdots,Y_{n-t})$ on state space ${\mathcal{S}}_{2}$ with symmetric transition matrix $T$ . It is not hard to show that estimating $M$ and $T$ are nearly equivalent. Consider the Bayesian setting where $T$ is drawn from some prior. We have

\displaystyle\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))|{\mathcal{X}}_{t}]\right]\approx\inf_{\widehat{T}}\mathbb{E}_{T}\left[\mathbb{E}[D(T(\cdot|Y_{n-t})\|\widehat{T}(\cdot|Y_{n-t}))]\right]=I(T;Y_{n-t+1}|Y^{n-t}),

where the last equality follows from the representation (20) of Bayes prediction risk as conditional mutual information. Lower bounding the minimax risk by the Bayes risk, we have

$\displaystyle\mathsf{Risk}_{k,n}$	$\displaystyle\geq\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X_{n}))]\right]$
	$\displaystyle\geq\inf_{\widehat{M}}\sum_{t=1}^{n-1}\mathbb{E}_{M}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X_{n}))\|{\mathcal{X}}_{t}]\cdot\mathbb{P}({\mathcal{X}}_{t})\right]$
	$\displaystyle\geq\frac{c}{n}\cdot\sum_{t=1}^{n-1}\inf_{\widehat{M}}\mathbb{E}_{M}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X_{n}))\|{\mathcal{X}}_{t}]\right]$
	$\displaystyle\approx\frac{c}{n}\cdot\sum_{t=1}^{n-1}I(T;Y_{n-t+1}\|Y^{n-t})=\frac{c}{n}\cdot(I(T;Y^{n})-I(T;Y_{1})).$	(43)

Note that $I(T;Y_{1})\leq H(Y_{1})\leq\log(k-1)$ since $Y_{1}$ takes values in ${\mathcal{S}}_{2}$ . Maximizing the right hand side over the prior $P_{T}$ and recalling the dual representation for redundancy in (17), the above inequality (3.2) leads to a risk lower bound of $\mathsf{Risk}_{k,n}\gtrsim\frac{1}{n}(\mathsf{Red}_{k-1,n}^{\sf sym}-\log k)$ , where $\mathsf{Red}_{k-1,n}^{\sf sym}=\sup I(T;Y_{1})$ is the redundancy for symmetric Markov chains with $k-1$ states and sample size $n$ . Since symmetric transition matrices still have $\Theta(k^{2})$ degrees of freedom, it is expected that $\mathsf{Red}_{k,n}^{\sf sym}\asymp k^{2}\log\frac{n}{k^{2}}$ for $n\gtrsim k^{2}$ , so that (3.2) yields the desired lower bound $\mathsf{Risk}_{k,n}=\Omega(\frac{k^{2}}{n}\log\frac{n}{k^{2}})$ in Theorem 1.

Next we rigorously carry out the lower bound proof sketched above: In Section 3.2.1, we explicitly construct the $k$ -state chain which satisfies the desired properties in Section 3.2. In Section 3.2.2, we make the steps in (3.2) precise and bound the Bayes risk from below by an appropriate mutual information. In Section 3.2.3, we choose a prior distribution on the transition probabilities and prove a lower bound on the resulting mutual information, thereby completing the proof of Theorem 1, with the added bonus that the construction is restricted to irreducible and reversible chains.

3.2.1 Construction of the $k$ -state chain

We construct a $k$ -state chain with the following transition probability matrix:

\displaystyle M=\left[\begin{matrix}1-\frac{1}{n}&\begin{matrix}\frac{1}{n(k-1)}&\frac{1}{n(k-1)}&\cdots&\frac{1}{n(k-1)}\end{matrix}\\ \begin{matrix}1/n\\ 1/n\\ \vdots\\ 1/n\end{matrix}&\mbox{\LARGE$\left(1-\frac{1}{n}\right)T$}\end{matrix}\right],

(44)

where $T\in\mathbb{R}^{{\mathcal{S}}_{2}\times{\mathcal{S}}_{2}}$ is a symmetric stochastic matrix to be chosen later. The transition diagram of $M$ is shown in Figure 2. One can also verify that the spectral gap of $M$ is $\Theta(\frac{1}{n})$ .

Figure 2: Lower bound construction for

k

-state chains. Solid arrows represent transitions within

{\mathcal{S}}_{1}

and

{\mathcal{S}}_{2}

, and dashed arrows represent transitions between

{\mathcal{S}}_{1}

and

{\mathcal{S}}_{2}

. The double-headed arrows denote transitions in both directions with equal probabilities.

Let $(X_{1},\ldots,X_{n})$ be the trajectory of a stationary Markov chain with transition matrix $M$ . We observe the following properties:

(P1)

This Markov chain is irreducible and reversible, with stationary distribution $(\frac{1}{2},\frac{1}{2(k-1)},\cdots,\frac{1}{2(k-1)})$ ;

(P2)

For $t\in[n-1]$ , let ${\mathcal{X}}_{t}$ denote the collections of trajectories $x^{n}$ such that $x_{1},x_{2},\cdots,x_{t}\in{\mathcal{S}}_{1}$ and $x_{t+1},\cdots,x_{n}\in{\mathcal{S}}_{2}$ . Then

	$\displaystyle\mathbb{P}(X^{n}\in{\mathcal{X}}_{t})$	$\displaystyle=\mathbb{P}(X_{1}=\cdots=X_{t}=1)\cdot\mathbb{P}(X_{t+1}\neq 1\|X_{t}=1)\cdot\prod_{s=t+1}^{n-1}\mathbb{P}(X_{s+1}\neq 1\|X_{s}\neq 1)$
		$\displaystyle=\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n}\cdot\left(1-\frac{1}{n}\right)^{n-1-t}\geq\frac{1}{2en}.$		(45)

Moreover, this probability does not depend of the choice of $T$ ;

(P3)

Conditioned on the event that $X^{n}\in{\mathcal{X}}_{t}$ , the trajectory $(X_{t+1},\cdots,X_{n})$ has the same distribution as a length- $(n-t)$ trajectory of a stationary Markov chain with state space ${\mathcal{S}}_{2}=\{2,3,\cdots,k\}$ and transition probability $T$ , and the uniform initial distribution. Indeed,

	$\displaystyle\mathbb{P}\left[X_{t+1}=x_{t+1},\ldots,X_{n}=x_{n}\|X^{n}\in{\mathcal{X}}_{t}\right]=$	$\displaystyle~{}\frac{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n(k-1)}\prod_{s=t+1}^{n-1}M(x_{s+1}\|x_{s})}{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n}\cdot\left(1-\frac{1}{n}\right)^{n-1-t}}$
	$\displaystyle=$	$\displaystyle~{}\frac{1}{k-1}\prod_{s=t+1}^{n-1}T(x_{s+1}\|x_{s}).$

3.2.2 Reducing the Bayes prediction risk to redundancy

Let ${\mathcal{M}}_{k-1}^{\mathsf{sym}}$ be the collection of all symmetric transition matrices on state space ${\mathcal{S}}_{2}=\{2,\ldots,k\}$ . Consider a Bayesian setting where the transition matrix $M$ is constructed in (44) and the submatrix $T$ is drawn from an arbitrary prior on ${\mathcal{M}}_{k-1}^{\mathsf{sym}}$ . The following lemma lower bounds the Bayes prediction risk.

Lemma 10.

Conditioned on $T$ , let $Y^{n}=(Y_{1},\ldots,Y_{n})$ denote a stationary Markov chain on state space ${\mathcal{S}}_{2}$ with transition matrix $T$ and uniform initial distribution. Then

\displaystyle\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))]\right]\geq\frac{n-1}{2en^{2}}\left(I(T;Y^{n})-\log(k-1)\right).

Lemma 10 is the formal statement of the inequality (3.2) presented in the proof sketch. Maximizing the lower bound over the prior on $T$ and in view of the mutual information representation (17), we obtain the following corollary.

Corollary 11.

Let $\mathsf{Risk}_{k,n}^{\mathsf{sym}}$ denote the minimax prediction risk for stationary irreducible and reversible Markov chains on $k$ states and $\mathsf{Red}_{k,n}^{\mathsf{sym}}$ the redundancy for stationary symmetric Markov chains on $k$ states. Then

\displaystyle\mathsf{Risk}_{k,n}^{\sf rev}\geq\frac{n-1}{2en^{2}}(\mathsf{Red}_{k-1,n}^{\mathsf{sym}}-\log(k-1)).

We make use of the properties (P1)–(P3) in Section 3.2.1 to prove Lemma 10.

Proof of Lemma 10.

Recall that in the Bayesian setting, we first draw $T$ from some prior on ${\mathcal{M}}_{k-1}^{\mathsf{sym}}$ , then generate the stationary Markov chain $X^{n}=(X_{1},\ldots,X_{n})$ with state space $[k]$ and transition matrix $M$ in (44), and $(Y_{1},\ldots,Y_{n})$ with state space ${\mathcal{S}}_{2}=\{2,\ldots,k\}$ and transition matrix $T$ .

We first relate the Bayes estimator of $M$ and $T$ (given the $X$ and $Y$ chain respectively). For clarity, we spell out the explicit dependence of the estimators on the input trajectory. For each $t\in[n]$ , denote by $\widehat{M}_{t}=\widehat{M}_{t}(\cdot|x^{t})$ the Bayes estimator of $M(\cdot|x_{t})$ give $X^{t}=x^{t}$ , and $\widehat{T}_{t}(\cdot|y^{t})$ the Bayes estimator of $T(\cdot|y_{t})$ give $Y^{t}=y^{t}$ . For each $t=1,\ldots,n-1$ and for each trajectory $x^{n}=(1,\ldots,1,x_{t+1},\ldots,x_{n})\in{\mathcal{X}}_{t}$ , recalling the form (21) of the Bayes estimator, we have, for each $j\in{\mathcal{S}}_{2}$ ,

	$\displaystyle\widehat{M}_{n}(j\|x^{n})=$	$\displaystyle~{}\frac{\mathbb{P}\left[X^{n+1}=(x^{n},j)\right]}{\mathbb{P}\left[X^{n}=x^{n}\right]}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathbb{E}[\frac{1}{2}M(1\|1)^{t-1}M(x_{t+1}\|1)M(x_{t+2}\|x_{t+1})\ldots M(x_{n}\|x_{n-1})M(j\|x_{n})]}{\mathbb{E}[\frac{1}{2}M(1\|1)^{t-1}M(x_{t+1}\|1)M(x_{t+2}\|x_{t+1})\ldots M(x_{n}\|x_{n-1})]}$
	$\displaystyle=$	$\displaystyle~{}\left(1-\frac{1}{n}\right)\frac{\mathbb{E}[T(x_{t+2}\|x_{t+1})\ldots T(x_{n}\|x_{n-1})T(j\|x_{n})]}{\mathbb{E}[T(x_{t+2}\|x_{t+1})\ldots T(x_{n}\|x_{n-1})]}$
	$\displaystyle=$	$\displaystyle~{}\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(j\|x_{t+1}^{n}),$

where we used the stationary distribution of $X$ in (P1) and the uniformity of the stationary distribution of $Y$ , neither of which depends on $T$ . Furthermore, by construction in (44), $\widehat{M}_{n}(1|x^{n})=\frac{1}{n}$ is deterministic. In all, we have

\widehat{M}_{n}(\cdot|x^{n})=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(\cdot|x_{t+1}^{n}),\quad x^{n}\in{\mathcal{X}}_{t}.

(46)

with $\delta_{1}$ denoting the point mass at state 1, which parallels the fact that

M(\cdot|x)=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)T(\cdot|x),\quad x\in{\mathcal{S}}_{2}.

(47)

By (P2), each event $\{X^{n}\in{\mathcal{X}}_{t}\}$ occurs with probability at least $1/(2en)$ , and is independent of $T$ . Therefore,

\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X^{n}))]\right]\geq\frac{1}{2en}\sum_{t=1}^{n-1}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X^{n}))|X^{n}\in{\mathcal{X}}_{t}]\right].

(48)

By (P3), the conditional joint law of $(T,X_{t+1},\ldots,X_{n})$ on the event $\{X^{n}\in{\mathcal{X}}_{t}\}$ is the same as the joint law of $(T,Y_{1},\ldots,Y_{n-t})$ . Thus, we may express the Bayes prediction risk in the $X$ chain as

	$\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X^{n}))\|X^{n}\in{\mathcal{X}}_{t}]\right]$	$\displaystyle\overset{\rm(a)}{=}\left(1-\frac{1}{n}\right)\cdot\mathbb{E}_{T}\left[\mathbb{E}[D(T(\cdot\|Y_{n-t})\\|\widehat{T}(\cdot\|Y^{n-t}))]\right]$
		$\displaystyle\overset{\rm(b)}{=}\left(1-\frac{1}{n}\right)\cdot I(T;Y_{n-t+1}\|Y^{n-t}),$		(49)

where (a) follows from (46), (47), and the fact that for distributions $P,Q$ supported on ${\mathcal{S}}_{2}$ , $D(\epsilon\delta_{1}+(1-\epsilon)P\|\epsilon\delta_{1}+(1-\epsilon)Q)=(1-\epsilon)D(P\|Q)$ ; (b) is the mutual information representation (20) of the Bayes prediction risk. Finally, the lemma follows from (48), (3.2.2), and the chain rule

\displaystyle\sum_{t=1}^{n-1}I(T;Y_{n-t+1}|Y^{n-t})=I(T;Y^{n})-I(T;Y_{1})\geq I(T;Y^{n})-\log(k-1),

as $I(T;Y_{1})\leq H(Y_{1})\leq\log(k-1)$ . ∎

3.2.3 Prior construction and lower bounding the mutual information

In view of Lemma 10, it remains to find a prior on ${\mathcal{M}}_{k-1}^{\mathsf{sym}}$ for $T$ , such that the mutual information $I(T;Y^{n})$ is large. We make use of the connection identified in [DMPW81, Dav83, Ris84] between estimation error and mutual information (see also [CS04, Theorem 7.1] for a self-contained exposition). To lower the mutual information, a key step is to find a good estimator $\widehat{T}(Y^{n})$ of $T$ . This is carried out in the following lemma.

Lemma 12.

In the setting of Lemma 10, suppose that $T\in{\mathcal{M}}_{k}^{\mathsf{sym}}$ with $T_{ij}\in[\frac{1}{2k},\frac{3}{2k}]$ for all $i,j\in[k]$ . Then there is an estimator $\widehat{T}$ based on $Y^{n}$ such that

\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]\leq\frac{16k^{2}}{n-1},

where $\|\widehat{T}-T\|_{\mathsf{F}}=\sqrt{\sum_{ij}(\widehat{T}_{ij}-T_{ij})^{2}}$ denotes the Frobenius norm.

We show how Lemma 12 leads to the desired lower bound on the mutual information $I(T;Y^{n})$ . Since $k\geq 3$ , we may assume that $k-1=2k_{0}$ is an even integer. Consider the following prior distribution $\pi$ on $T$ : let $u=(u_{i,j})_{i,j\in[k_{0}],i\leq j}$ be iid and uniformly distributed in $[1/(4k_{0}),3/(4k_{0})]$ , and $u_{i,j}=u_{j,i}$ for $i>j$ . Let the transition matrix $T$ be given by

\displaystyle T_{2i-1,2j-1}=T_{2i,2j}=u_{i,j},\quad T_{2i-1,2j}=T_{2i,2j-1}=\frac{1}{k_{0}}-u_{i,j},\quad\forall i,j\in[k].

(50)

It is easy to verify that $T$ is symmetric and a stochastic matrix, and each entry of $T$ is supported in the interval $[1/(4k_{0}),3/(4k_{0})]$ . Since $2k_{0}=k-1$ , the condition of Lemma 12 is fulfilled, so there exist estimators $\widehat{T}(Y^{n})$ and $\widehat{u}(Y^{n})$ such that

\displaystyle\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\leq\mathbb{E}[\|\widehat{T}(Y^{n})-T\|_{\mathsf{F}}^{2}]\leq\frac{64k_{0}^{2}}{n-1}.

(51)

Here and below, we identify $u$ and $\widehat{u}$ as $\frac{k_{0}(k_{0}+1)}{2}$ -dimensional vectors.

Let $h(X)=\int-f_{X}(x)\log f_{X}(x)dx$ denote the differential entropy of a continuous random vector $X$ with density $f_{X}$ w.r.t the Lebesgue measure and $h(X|Y)=\int-f_{XY}(xy)\log f_{X|Y}(x|y)dxdy$ the conditional differential entropy (cf. e.g. [CT06]). Then

\displaystyle h(u)=\sum_{i,j\in[k_{0}],i\leq j}h(u_{i,j})=-\frac{k_{0}(k_{0}+1)}{2}\log(2k_{0}).

(52)

Then

	$\displaystyle I(T;Y^{n})$	$\displaystyle\overset{\rm(a)}{=}I(u;Y^{n})$
		$\displaystyle\overset{\rm(b)}{\geq}I(u;\widehat{u}(Y^{n}))=h(u)-h(u\|\widehat{u}(Y^{n}))$
		$\displaystyle\overset{\rm(c)}{\geq}h(u)-h(u-\widehat{u}(Y^{n}))$
		$\displaystyle\overset{\rm(d)}{\geq}\frac{k_{0}(k_{0}+1)}{4}\log\left(\frac{n-1}{1024\pi ek_{0}^{2}}\right)\geq\frac{k^{2}}{16}\log\left(\frac{n-1}{256\pi ek^{2}}\right).$

where (a) is because $u$ and $T$ are in one-to-one correspondence by (50); (b) follows from the data processing inequality; (c) is because $h(\cdot)$ is translation invariant and concave; (d) follows from the maximum entropy principle [CT06]: $h(u-\widehat{u}(Y^{n}))\leq\frac{k_{0}(k_{0}+1)}{4}\log\left(\frac{2\pi e}{k_{0}(k_{0}+1)/2}\cdot\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\right)$ , which in turn is bounded by (51). Plugging this lower bound into Lemma 10 completes the lower bound proof of Theorem 1.

Proof of Lemma 12.

Since $T$ is symmetric, the stationary distribution is uniform, and there is a one-to-one correspondence between the joint distribution of $(Y_{1},Y_{2})$ and the transition probabilities. Motivated by this observation, consider the following estimator $\widehat{T}$ : for $i,j\in[k]$ , let

\displaystyle\widehat{T}_{ij}=k\cdot\frac{\sum_{t=1}^{n}{\mathbf{1}_{\left\{{Y_{t}=i,Y_{t+1}=j}\right\}}}}{n-1}.

Clearly $\mathbb{E}[\widehat{T}_{ij}]=k\cdot\mathbb{P}(Y_{1}=i,Y_{2}=j)=T_{ij}$ . The following variance bound is shown in [TJW18, Lemma 7, Lemma 8] using the concentration inequality of [Pau15]:

\displaystyle\mathrm{Var}(\widehat{T}_{ij})\leq k^{2}\cdot\frac{8T_{ij}k^{-1}}{\gamma_{*}(T)(n-1)},

where $\gamma_{*}(T)$ is the absolute spectral gap of $T$ defined in (8). Note that $T=k^{-1}\mathbf{J}+\Delta$ , where $\mathbf{J}$ is the all-one matrix and each entry of $\Delta$ lying in $[-1/(2k),1/(2k)]$ . Thus the spectral radius of $\Delta$ is at most $1/2$ and thus $\gamma_{*}(T)\geq 1/2$ . Consequently, we have

\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]=\sum_{i,j\in[k]}\mathrm{Var}(\widehat{T}_{ij})\leq\sum_{i,j\in[k]}\frac{16kT_{ij}}{n-1}=\frac{16k^{2}}{n-1},

completing the proof. ∎

4 Spectral gap-dependent risk bounds

4.1 Two states

To show Theorem 2, let us prove a refined version. In addition to the absolute spectral gap defined in (8), define the spectral gap

\gamma\triangleq 1-\lambda_{2}

(53)

and ${\mathcal{M}}_{k}^{\prime}(\gamma_{0})$ the collection of transition matrices whose spectral gap exceeds $\gamma_{0}$ . Paralleling $\mathsf{Risk}_{k,n}(\gamma_{0})$ defined in (9), define $\mathsf{Risk}_{k,n}^{\prime}(\gamma_{0})$ as the minimax prediction risk restricted to $M\in{\mathcal{M}}_{k}^{\prime}(\gamma_{0})$ Since $\gamma\geq\gamma^{*}$ , we have ${\mathcal{M}}_{k}(\gamma_{0})\subseteq{\mathcal{M}}_{k}^{\prime}(\gamma_{0})$ and hence $\mathsf{Risk}_{k,n}^{\prime}(\gamma_{0})\geq\mathsf{Risk}_{k,n}(\gamma_{0})$ . Nevertheless, the next result shows that for $k=2$ they have the same rate:

Theorem 13 (Spectral gap dependent rates for binary chain).

For any $\gamma_{0}\in(0,1)$

\mathsf{Risk}_{2,n}(\gamma_{0})\asymp\mathsf{Risk}_{2,n}^{\prime}(\gamma_{0})\asymp\frac{1}{n}\max\left\{1,\log\log\left(\min\left\{n,\frac{1}{\gamma_{0}}\right\}\right)\right\}.

We first prove the upper bound on $\mathsf{Risk}_{2,n}^{\prime}$ . Note that it is enough to show

\displaystyle\mathsf{Risk}_{2,n}^{\prime}(\gamma_{0})\lesssim{\log\log\left(1/\gamma_{0}\right)\over n},\quad\text{if }n^{-0.9}\leq\gamma_{0}\leq e^{-e^{5}}.

(54)

Indeed, for any $\gamma_{0}\leq n^{-0.9}$ , the upper bound $O\left(\log\log n/n\right)$ proven in [FOPS16], which does not depend on the spectral gap, suffices; for any $\gamma_{0}>e^{-e^{5}}$ , by monotonicity we can use the upper bound $\mathsf{Risk}_{2,n}^{\prime}(e^{-e^{5}})$ .

We now define an estimator that achieves (54). Following [FOPS16], consider trajectories with a single transition, namely, $\left\{2^{n-\ell}1^{\ell},1^{n-\ell}2^{\ell}:1\leq\ell\leq n-1\right\}$ , where $2^{n-\ell}1^{\ell}$ denotes the trajectory $(x_{1},\cdots,x_{n})$ with $x_{1}=\cdots=x_{n-\ell}=2$ and $x_{n-\ell+1}=\cdots=x_{n}=1$ . We refer to this type of $x^{n}$ as step sequences. For all non-step sequences $x^{n}$ , we apply the add- $\frac{1}{2}$ estimator similar to (5), namely

\displaystyle\widehat{M}_{x^{n}}(j|i)=\frac{N_{ij}+\frac{1}{2}}{N_{i}+1},\qquad i,j\in\{1,2\},

where the empirical counts $N_{i}$ and $N_{ij}$ are defined in (4); for step sequences of the form $2^{n-\ell}1^{\ell}$ , we estimate by

\displaystyle{\widehat{M}_{\ell}(2|1)}={1/(\ell\log(1/\gamma_{0}))},\quad{\widehat{M}_{\ell}(1|1)}=1-{\widehat{M}_{\ell}(2|1)}.

(55)

The other type of step sequences $1^{n-\ell}2^{\ell}$ are dealt with by symmetry.

Due to symmetry it suffices to analyze the risk for sequences ending in 1. The risk of add- $\frac{1}{2}$ estimator for the non-step sequence $1^{n}$ is bounded as

	$\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=1^{n}}\right\}}}D({M(\cdot\|1)}\\|{\widehat{M}_{1^{n}}(\cdot\|1)})\right]$	$\displaystyle=P_{X^{n}}(1^{n})\left\{M(2\|1)\log\left(\frac{M(2\|1)}{1/(2n)}\right)+M(1\|1)\log\left(M(1\|1)\over(n-\frac{1}{2})/n\right)\right\}$
		$\displaystyle\leq(1-M(2\|1))^{n-1}\left\{2M(2\|1)^{2}n+\log\left({n\over n-\frac{1}{2}}\right)\right\}\lesssim\frac{1}{n}.$

where the last step followed by using $(1-x)^{n-1}x^{2}\leq n^{-2}$ with $x=M(2|1)$ and $\log x\leq x-1$ . From [FOPS16, Lemma 7,8] we have that the total risk of other non-step sequences is bounded from above by $O\left(\frac{1}{n}\right)$ and hence it is enough to analyze the risk for step sequences, and further by symmetry, those in $\left\{2^{n-\ell}1^{\ell}:1\leq\ell\leq n-1\right\}$ . The desired upper bound (54) then follows from Lemma 14 next.

Lemma 14.

For any $n^{-0.9}\leq\gamma_{0}\leq e^{-e^{5}}$ , $\widehat{M}_{\ell}(\cdot|1)$ in (55) satisfies

\sup_{M\in{\mathcal{M}}^{\prime}_{2}(\gamma_{0})}\sum_{\ell=1}^{n-1}\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=2^{n-\ell}1^{\ell}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}_{\ell}(\cdot|1)})\right]\lesssim{\log\log(1/\gamma_{0})\over n}.

Proof.

For each $\ell$ using $\log\left(1\over{1-x}\right)\leq 2x,x\leq\frac{1}{2}$ with $x=\frac{1}{\ell\log(1/\gamma_{0})}$ ,

$\displaystyle D({M(\cdot\|1)}\\|{\widehat{M}_{\ell}(\cdot\|1)})$	$\displaystyle={M(1\|1)\log\left(M(1\|1)\over 1-\frac{1}{\ell{\log(1/\gamma_{0})}}\right)+{M(2\|1)}\log\left({M(2\|1)}\ell{\log(1/\gamma_{0})}\right)}$
	$\displaystyle\lesssim{1\over\ell{\log(1/\gamma_{0})}}+{M(2\|1)}\log(M(2\|1)\ell)+{M(2\|1)}\log{{\log(1/\gamma_{0})}}$
	$\displaystyle\leq{1\over\ell{\log(1/\gamma_{0})}}+M(2\|1)\log_{+}(M(2\|1)\ell)+M(2\|1){\log\log(1/\gamma_{0})},$	(56)

where we define $\log_{+}(x)=\max\{1,\log x\}$ . Recall the following Chebyshev’s sum inequality: for $a_{1}\leq a_{2}\leq\cdots\leq a_{n}$ and $b_{1}\geq b_{2}\geq\cdots\geq b_{n}$ , it holds that

\displaystyle\sum_{i=1}^{n}a_{i}b_{i}\leq\frac{1}{n}\left(\sum_{i=1}^{n}a_{i}\right)\left(\sum_{i=1}^{n}b_{i}\right).

The following inequalities are thus direct corollaries: for $x,y\in[0,1]$ ,

$\displaystyle\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}y(1-y)^{\ell-1}$	$\displaystyle\leq\frac{1}{n-1}\left(\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}\right)\left(\sum_{\ell=1}^{n-1}y(1-y)^{\ell-1}\right)$
	$\displaystyle\leq\frac{1}{n-1},$	(57)
$\displaystyle\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}y(1-y)^{\ell-1}\log_{+}(\ell y)$	$\displaystyle\leq\frac{1}{n-1}\left(\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}\right)\left(\sum_{\ell=1}^{n-1}y(1-y)^{\ell-1}\log_{+}(\ell y)\right)$
	$\displaystyle\leq\frac{1}{n-1}\sum_{\ell=1}^{n-1}y(1-y)^{\ell-1}(1+\ell y)\leq\frac{2}{n-1},$	(58)

where in (58) we need to verify that $\ell\mapsto y(1-y)^{\ell-1}\log_{+}(\ell y)$ is non-increasing. To verify it, w.l.o.g. we may assume that $(\ell+1)y\geq e$ , and therefore

	$\displaystyle\frac{y(1-y)^{\ell}\log_{+}((\ell+1)y)}{y(1-y)^{\ell-1}\log_{+}(\ell y)}$	$\displaystyle=\frac{(1-y)\log((\ell+1)y)}{\log_{+}(\ell y)}\leq\left(1-\frac{e}{\ell+1}\right)\left(1+\frac{\log(1+1/\ell)}{\log_{+}(\ell y)}\right)$
		$\displaystyle\leq\left(1-\frac{e}{\ell+1}\right)\left(1+\frac{1}{\ell}\right)<1+\frac{1}{\ell}-\frac{e}{\ell+1}<1.$

Therefore,

	$\displaystyle\sum_{\ell=1}^{n-1}\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=2^{n-\ell}1^{\ell}}\right\}}}D(M(\cdot\|1)\\|\widehat{M}_{\ell}(\cdot\|1))\right]$
	$\displaystyle\leq\sum_{\ell=1}^{n-1}M(2\|2)^{n-\ell-1}M(1\|2)M(1\|1)^{\ell-1}D(M(\cdot\|1)\\|\widehat{M}_{\ell}(\cdot\|1))$
	$\displaystyle\overset{\rm(a)}{\lesssim}\sum_{\ell=1}^{n-1}M(2\|2)^{n-\ell-1}M(1\|2)M(1\|1)^{\ell-1}\left(\frac{1}{\ell\log(1/\gamma_{0})}+M(2\|1)\log_{+}(M(2\|1)\ell)+M(2\|1)\log\log(1/\gamma_{0})\right)$
	$\displaystyle\overset{\rm(b)}{\leq}\sum_{\ell=1}^{n-1}\frac{M(2\|2)^{n-\ell-1}M(1\|2)M(1\|1)^{\ell-1}}{\ell\log(1/\gamma_{0})}+\frac{2+\log\log(1/\gamma_{0})}{n-1},$		(59)

where (a) is due to (56), (b) follows from (57) and (58) applied to $x=M(1|2),y=M(2|1)$ . To deal with the remaining sum, we distinguish into two cases. Sticking to the above definitions of $x$ and $y$ , if $y>\gamma_{0}/2$ , then

\displaystyle\sum_{\ell=1}^{n-1}\frac{x(1-x)^{n-\ell-1}(1-y)^{\ell-1}}{\ell}\leq\frac{1}{n-1}\left(\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}\right)\left(\sum_{\ell=1}^{n-1}\frac{(1-y)^{\ell-1}}{\ell}\right)\leq\frac{\log(2/\gamma_{0})}{n-1},

where the last step has used that $\sum_{\ell=1}^{\infty}t^{\ell-1}/\ell=\log(1/(1-t))$ for $|t|<1$ . If $y\leq\gamma_{0}/2$ , notice that for two-state chain the spectral gap is given explicitly by $\gamma=M(1|2)+M(2|1)=x+y$ , so that the assumption $\gamma\geq\gamma_{0}$ implies that $x\geq\gamma_{0}/2$ . In this case,

	$\displaystyle\sum_{\ell=1}^{n-1}\frac{x(1-x)^{n-\ell-1}(1-y)^{\ell-1}}{\ell}$	$\displaystyle\leq\sum_{\ell<n/2}(1-x)^{n/2-1}+\sum_{\ell\geq n/2}\frac{x(1-x)^{n-\ell-1}}{n/2}$
		$\displaystyle\leq\frac{n}{2}e^{-(n/2-1)\gamma_{0}}+\frac{2}{n}\lesssim\frac{1}{n},$

thanks to the assumption $\gamma_{0}\geq n^{-0.9}$ . Therefore, in both cases, the first term in (59) is $O(1/n)$ , as desired. ∎

Next we prove the lower bound on $\mathsf{Risk}_{2,n}$ . It is enough to show that $\mathsf{Risk}_{2,n}(\gamma_{0})\gtrsim\frac{1}{n}\log\log\left(1/\gamma_{0}\right)$ for $n^{-1}\leq\gamma_{0}\leq e^{-e^{5}}$ . Indeed, for $\gamma_{0}\geq e^{-e^{5}}$ , we can apply the result in the iid setting (see, e.g., [BFSS02]), in which the absolute spectral gap is 1, to obtain the usual parametric-rate lower bound $\Omega\left(\frac{1}{n}\right)$ ; for $\gamma_{0}<n^{-1}$ , we simply bound $\mathsf{Risk}_{2,n}(\gamma_{0})$ from below by $\mathsf{Risk}_{2,n}(n^{-1})$ . Define

\displaystyle\alpha=\log(1/\gamma_{0}),\quad\beta=\left\lceil{\alpha\over 5\log\alpha}\right\rceil,

(60)

and consider the prior distribution

\displaystyle\mathscr{M}=\mathrm{Uniform}({\mathcal{M}}),\quad{\mathcal{M}}

\displaystyle=\left\{M:{M(1|2)}=\frac{1}{n},{M(2|1)}={1\over\alpha^{m}}:m\in\mathbb{N}\cap\left(\beta,5\beta\right)\right\}.

(61)

Then the lower bound part of Theorem 2 follows from the next lemma.

Lemma 15.

Assume that $n^{-0.9}\leq\gamma_{0}\leq e^{-e^{5}}$ . Then

(i)

$\gamma_{*}>\gamma_{0}$ for each $M\in{\mathcal{M}}$ ;
(ii)

the Bayes risk with respect to the prior $\mathscr{M}$ is at least $\Omega\left(\log\log(1/\gamma_{0})\over n\right)$ .

Proof.

Part (i) follows by noting that absolute spectral gap for any two states matrix $M$ is $1-\left|1-{M(2|1)}-{M(1|2)}\right|$ and for any $M\in\cal M$ , $M(2|1)\in\left(\alpha^{-5\beta},\alpha^{-\beta}\right)\subseteq(\gamma_{0},\gamma_{0}^{1/5})\subseteq(\gamma_{0},1/2)$ which guarantees $\gamma_{*}=M(1|2)+M(2|1)>\gamma_{0}.$

To show part (ii) we lower bound the Bayes risk when the observed trajectory $X^{n}$ is a step sequence in $\left\{2^{n-\ell}1^{\ell}:1\leq\ell\leq n-1\right\}$ . Our argument closely follows that of [HOP18, Theorem 1]. Since $\gamma_{0}\geq n^{-1}$ , for each $M\in{\mathcal{M}}$ , the corresponding stationary distribution $\pi$ satisfies

\displaystyle\pi_{2}=\frac{M(2|1)}{M(2|1)+M(1|2)}\geq\frac{1}{2}.

Denote by $\mathsf{Risk}(\mathscr{M})$ the Bayes risk with respect to the prior $\mathscr{M}$ and by ${\widehat{M}^{\mathsf{B}}_{\ell}(\cdot|1)}$ the Bayes estimator for prior $\mathscr{M}$ given $X^{n}=2^{n-\ell}1^{\ell}$ . Note that

\mathbb{P}\left[X^{n}=2^{n-\ell}1^{\ell}\right]=\pi_{2}\left(1-\frac{1}{n}\right)^{n-\ell-1}\frac{1}{n}M(1|1)^{\ell-1}\geq\frac{1}{2en}M(1|1)^{\ell-1}.

(62)

Then

$\displaystyle\mathsf{Risk}(\mathscr{M})$	$\displaystyle\geq\mathbb{E}_{M\sim\mathscr{M}}\left[\sum_{\ell=1}^{n-1}\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=2^{n-\ell}1^{\ell}}\right\}}}D({M(\cdot\|1)}\\|{\widehat{M}^{\mathsf{B}}_{\ell}(\cdot\|1)})\right]\right]$
	$\displaystyle\geq\mathbb{E}_{M\sim\mathscr{M}}\left[\sum_{\ell=1}^{n-1}{M(1\|1)^{\ell-1}\over 2en}D({M(\cdot\|1)}\\|{\widehat{M}^{\mathsf{B}}_{\ell}(\cdot\|1)})\right]$
	$\displaystyle=\frac{1}{2en}\sum_{\ell=1}^{n-1}\mathbb{E}_{M\sim\mathscr{M}}\left[M(1\|1)^{\ell-1}D({M(\cdot\|1)}\\|{\widehat{M}^{\mathsf{B}}_{\ell}(\cdot\|1)})\right].$	(63)

Recalling the general form of the Bayes estimator in (21) and in view of (62), we get

\displaystyle\widehat{M}_{\ell}^{\mathsf{B}}(2|1)=\frac{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}M(2|1)]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}]},\quad\widehat{M}_{\ell}^{\mathsf{B}}(1|1)=1-\widehat{M}_{\ell}^{\mathsf{B}}(2|1).

(64)

Plugging (64) into (63), and using

\displaystyle D((x,1-x)\|(y,1-y))=x\log{x\over y}+(1-x)\log{1-x\over 1-y}\geq x\max\left\{0,\log{x\over y}-1\right\},

we arrive at the following lower bound for the Bayes risk:

		$\displaystyle\mathsf{Risk}(\mathscr{M})$
	$\displaystyle\geq$	$\displaystyle\frac{1}{2en}\sum_{\ell=1}^{n-1}\mathbb{E}_{M\sim\mathscr{M}}\left[M(1\|1)^{\ell-1}M(2\|1)\max\left\{0,\log\left(\frac{M(2\|1)\cdot\mathbb{E}_{M\sim\mathscr{M}}[M(1\|1)^{\ell-1}]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1\|1)^{\ell-1}M(2\|1)]}\right)-1\right\}\right].$		(65)

Under the prior $\mathscr{M}$ , $M(2|1)=1-M(1|1)=\alpha^{-m}$ with $\beta\leq m\leq 5\beta$ .

We further lower bound (65) by summing over an appropriate range of $\ell$ . For any $m\in[\beta,3\beta]$ , define

\displaystyle\ell_{1}(m)=\left\lceil\frac{\alpha^{m}}{\log\alpha}\right\rceil,\qquad\ell_{2}(m)=\left\lfloor\alpha^{m}\log\alpha\right\rfloor.

Since $\gamma_{0}\leq e^{-e^{5}}$ , our choice of $\alpha$ ensures that the intervals $\{[\ell_{1}(m),\ell_{2}(m)]\}_{\beta\leq m\leq 3\beta}$ are disjoint. We will establish the following claim: for all $m\in[\beta,3\beta]$ and $\ell\in[\ell_{1}(m),\ell_{2}(m)]$ , it holds that

\displaystyle\frac{\alpha^{-m}\cdot\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}M(2|1)]}\gtrsim\frac{\log(1/\gamma_{0})}{\log\log(1/\gamma_{0})}.

(66)

We first complete the proof of the Bayes risk bound assuming (66). Using (65) and (66), we have

	$\displaystyle\mathsf{Risk}(\mathscr{M})$	$\displaystyle\gtrsim{1\over n}\cdot\frac{1}{4\beta}\sum_{m=\beta}^{3\beta}\sum_{\ell=\ell_{1}(m)}^{\ell_{2}(m)}\alpha^{-m}(1-\alpha^{-m})^{\ell-1}\cdot\log\log(1/\gamma_{0})$
		$\displaystyle={\log\log(1/\gamma_{0})\over 4n\beta}\sum_{m=\beta}^{3\beta}\left\{(1-\alpha^{-m})^{\ell_{1}(m)-1}-(1-\alpha^{-m})^{\ell_{2}(m)}\right\}$
		$\displaystyle\overset{\rm(a)}{\geq}{\log\log(1/\gamma_{0})\over 4n\beta}\sum_{m=\beta}^{3\beta}\left(\left(\frac{1}{4}\right)^{1\over\log\alpha}-\left(\frac{1}{e}\right)^{-1+\log\alpha}\right)\gtrsim{\log\log(1/\gamma_{0})\over n},$

with (a) following from $\frac{1}{4}\leq(1-x)^{1\over x}\leq\frac{1}{e}$ if $x\leq\frac{1}{2}$ , and $\alpha^{-m}\leq\alpha^{-\beta}\leq\gamma_{0}^{1/5}\leq\frac{1}{2}$ .

Next we prove the claim (66). Expanding the expectation in (61), we write the LHS of (66) as

\displaystyle\frac{\alpha^{-m}\cdot\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}M(2|1)]}={X_{\ell}+A_{\ell}+B_{\ell}\over X_{\ell}+C_{\ell}+D_{\ell}},

where

	$\displaystyle X_{\ell}$	$\displaystyle=\left(1-\alpha^{-m}\right)^{\ell},\quad A_{\ell}=\sum_{j={\beta}}^{m-1}\left(1-\alpha^{-j}\right)^{\ell},\quad B_{\ell}=\sum_{j=m+1}^{5{\beta}}\left(1-\alpha^{-j}\right)^{\ell},$
	$\displaystyle C_{\ell}$	$\displaystyle=\sum_{j={\beta}}^{m-1}\left(1-\alpha^{-j}\right)^{\ell}{\alpha}^{m-j},\quad D_{\ell}=\sum_{j=m+1}^{5{\beta}}\left(1-\alpha^{-j}\right)^{\ell}{\alpha}^{m-j}.$

We bound each of the terms individually. Clearly, $X_{\ell}\in(0,1)$ and $A_{\ell}\geq 0$ . Thus it suffices to show that $B_{\ell}\gtrsim\beta$ and $C_{\ell},D_{\ell}\lesssim 1$ , for $m\in[\beta,3\beta]$ and $\ell_{1}(m)\leq\ell\leq\ell_{2}(m)$ . Indeed,

•

For $j\geq m+1$ , we have

\displaystyle\left(1-{\alpha}^{-j}\right)^{\ell}\geq\left(1-{\alpha}^{-j}\right)^{\ell_{2}(m)}\overset{\rm(a)}{\geq}\left(1/4\right)^{\ell_{2}(m)\over{\alpha}^{j}}\geq\left(1/4\right)^{\log{\alpha}\over{\alpha}}\geq 1/4,

where in (a) we use the inequality $(1-x)^{1/x}\geq 1/4$ for $x\leq 1/2$ . Consequently, $B_{\ell}\geq\beta/2$ ;

•

For $j\leq m-1$ , we have

\displaystyle\left(1-{\alpha}^{-j}\right)^{\ell}\leq\left(1-{\alpha}^{-j}\right)^{\ell_{1}(m)}\overset{\rm(b)}{\leq}e^{-\frac{{\alpha}^{m-j}}{\log\alpha}}=\gamma_{0}^{{\alpha}^{m-j-1}\over\log{\alpha}},

where (b) follows from $(1-x)^{1/x}\leq 1/e$ and the definition of $\ell_{1}(m)$ . Consequently,

\displaystyle C_{\ell}\leq\gamma_{0}^{\alpha\over\log\alpha}\sum_{j={\beta}}^{m-2}{\alpha}^{m-j}+{\alpha}\gamma_{0}^{1\over\log{\alpha}}\leq e^{-{\alpha^{2}\over\log\alpha}+(2\beta+1)\log\alpha}+e^{\log\alpha-\frac{\alpha}{\log\alpha}}\leq 2,

where the last step uses the definition of $\beta$ in (60);

•

$D_{\ell}\leq\sum_{j=m+1}^{5\beta}\alpha^{m-j}\leq 1$ , since $\alpha=\log\frac{1}{\gamma_{0}}\geq e^{5}$ .

Combining the above bounds completes the proof of (66). ∎

4.2 $k$ states

4.2.1 Proof of Theorem 3 (i)

Notice that the prediction problem consists of $k$ sub-problems of estimating the individual rows of $M$ , so it suffices show the contribution from each of them is $O\left(\frac{k}{n}\right)$ . In particular, assuming the chain terminates in state 1 we bound the risk of estimating the first row by the add-one estimator $\widehat{M}^{+1}(j|1)={N_{1j}+1\over N_{1}+k}$ . Under the absolute spectral gap condition of $\gamma_{*}\geq\gamma_{0}$ , we show

\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=1}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]\lesssim{k\over n}\left(1+\sqrt{\log k\over k\gamma_{0}^{4}}\right).

(67)

By symmetry, we get the desired $\mathsf{Risk}_{k,n}(\gamma_{0})\lesssim{k^{2}\over n}\left(1+\sqrt{\log k\over k\gamma_{0}^{4}}\right)$ . The basic steps of our analysis are as follows:

•

When $N_{1}$ is substantially smaller than its mean, we can bound the risk using the worst-case risk bound for add-one estimators and the probability of this rare event.

•

Otherwise, we decompose the prediction risk as

\displaystyle D(M(\cdot|1)\|\widehat{M}^{+1}(\cdot|1))=\sum_{j=1}^{k}\left[M(j|1)\log\left(M(j|1)(N_{1}+k)\over N_{1j}+1\right)-M(j|1)+{N_{1j}+1\over N_{1}+k}\right].

We then analyze each term depending on whether $N_{1j}$ is typical or not. Unless $N_{1j}$ is atypically small, the add-one estimator works well whose risk can be bounded quadratically.

To analyze the concentration of the empirical counts we use the following moment bounds. The proofs are deferred to Appendix B.

Lemma 16.

Finite reversible and irreducible chains observe the following moment bounds:

(i)

${\mathbb{E}\left[\left(N_{ij}-N_{i}{M(j|i)}\right)^{2}|X_{n}=i\right]}\lesssim n\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{M(j|i)}\over\gamma_{*}}+{{M(j|i)}\over\gamma_{*}^{2}}$
(ii)

$\mathbb{E}\left[\left(N_{ij}-N_{i}{M(j|i)}\right)^{4}|X_{n}=i\right]\lesssim(n\pi_{i}{M(j|i)}(1-{M(j|i)}))^{2}+{\sqrt{{M(j|i)}}\over\gamma_{*}}+{{M(j|i)}^{2}\over\gamma_{*}^{4}}$
(iii)

$\mathbb{E}\left[\left(N_{i}-(n-1)\pi_{i}\right)^{4}|X_{n}=i\right]\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}.$

When $\gamma_{*}$ is high this shows that the moments behave as if for each $i\in[k]$ , $N_{1}$ is approximately Binomial( $n-1,\pi_{i}$ ) and $N_{ij}$ is approximately Binomial $(N_{i},M(j|i))$ , which happens in case of iid sampling. For iid models [KOPS15] showed that the add-one estimator achieves $O\left(\frac{k}{n}\right)$ risk bound which we aim here too. In addition, dependency of the above moments on $\gamma_{*}$ gives rise to sufficient conditions that guarantees parametric rate. The technical details are given below.

We decompose the left hand side in (67) based on $N_{1}$ as

\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=1}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]=\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{\leq}}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]

where the typical set $A^{>}$ and atypical set $A^{\leq}$ are defined as

\displaystyle A^{\leq}\triangleq\left\{X_{n}=1,N_{1}\leq{(n-1)\pi_{1}/2}\right\},\quad A^{>}\triangleq\left\{X_{n}=1,N_{1}>{(n-1)\pi_{1}/2}\right\}.

For the atypical case, note the following deterministic property of the add-one estimator. Let $\widehat{Q}$ be an add-one estimator with sample size $n$ and alphabet size $k$ of the form $\widehat{Q}_{i}=\frac{n_{i}+1}{n+k}$ , where $\sum n_{i}=n$ . Since $\widehat{Q}$ is bounded below by $\frac{1}{n+k}$ everywhere, for any distribution $P$ , we have

D(P\|\widehat{Q})\leq\log(n+k).

(68)

Applying this bound on the event $A^{\leq}$ , we have

	$\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{\leq}}\right\}}}D\left({M(\cdot\|1)}\\|{\widehat{M}^{+1}(\cdot\|1)}\right)\right]$
	$\displaystyle\leq\log\left(n\pi_{1}+k\right)\mathbb{P}\left[X_{n}=1,N_{1}\leq(n-1)\pi_{1}/2\right]$
	$\displaystyle\overset{\rm(a)}{\lesssim}{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}\leq 10}\right\}}}\pi_{1}\log\left(n\pi_{1}+k\right)+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}>10}\right\}}}\pi_{1}\log\left({n\pi_{1}+k}\right){\mathbb{E}\left[\left(N_{1}-(n-1)\pi_{1}\right)^{4}\|X_{n}=1\right]\over n^{4}\pi_{1}^{4}}$		(69)
	$\displaystyle\overset{\rm(b)}{\leq}{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}\leq 10}\right\}}}{10\over n\gamma_{}}\log\left({10\over\gamma_{}}+k\right)+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}>10}\right\}}}\log\left(n\pi_{1}+k\right)\left({1\over n^{2}\pi_{1}\gamma_{}^{2}}+{1\over n^{4}\pi_{1}^{3}\gamma_{}^{4}}\right)$
	$\displaystyle\overset{\rm(c)}{\lesssim}\frac{1}{n}\left\{{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}\leq 10}\right\}}}{\log(1/\gamma_{})+\log k\over\gamma_{}}+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}>10}\right\}}}\left(n\pi_{1}+\log k\right)\left({1\over n\pi_{1}\gamma_{}^{2}}+{1\over n^{3}\pi_{1}^{3}\gamma_{}^{4}}\right)\right\}$
	$\displaystyle{\lesssim}{1\over n}\left\{{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}\leq 10}\right\}}}\left(\frac{1}{\gamma_{}^{2}}+{\log k\over\gamma_{}}\right)+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{}>10}\right\}}}\left({1\over\gamma_{}^{2}}+{\log k\over\gamma_{}}\right)\right\}\lesssim{1\over n\gamma_{0}^{2}}+{\log k\over n\gamma_{0}}.$		(70)

where we got (a) from Markov inequality, (b) from Lemma 16(iii) and (c) using $x+y\leq xy,x,y\geq 2$ .

Next we bound $\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]$ . Define

\displaystyle\Delta_{i}=M(i|1)\log\left(M(i|1)\over\widehat{M}^{+1}(i|1)\right)-M(i|1)+\widehat{M}^{+1}(i|1).

As $D({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})=\sum_{i=1}^{k}\Delta_{i}$ it suffices to bound $\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right]$ for each $i$ . For some $r\geq 1$ to be optimized later consider the following cases separately

Case (a) $n\pi_{1}\leq r$ or $n\pi_{1}{M(i|1)}\leq 10$ :

Using the fact $y\log(y)-y+1\leq(y-1)^{2}$ with $y={M(i|1)\over\widehat{M}^{+1}(i|1)}={{M(i|1)}(N_{1}+k)\over N_{1i}+1}$ we get

\displaystyle\Delta_{i}\leq{\left({M(i|1)}N_{1}-N_{1i}+{M(i|1)}k-1\right)^{2}\over\left(N_{1}+k\right)\left(N_{1i}+1\right)}.

(71)

This implies

$\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right]$	$\displaystyle\leq\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\left({M(i\|1)}N_{1}-N_{1i}+{M(i\|1)}k-1\right)^{2}\over\left(N_{1}+k\right)\left(N_{1i}+1\right)\right]$
	$\displaystyle\overset{\rm(a)}{\lesssim}{{\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\left({M(i\|1)}N_{1}-N_{1i}\right)^{2}\right]+k^{2}\pi_{1}{M(i\|1)}^{2}+\pi_{1}}\over n\pi_{1}+k}$
	$\displaystyle\overset{\rm(b)}{\lesssim}{\pi_{1}\mathbb{E}\left[\left.{\left({M(i\|1)}N_{1}-N_{1i}\right)^{2}}\right\|X_{n}=1\right]\over n\pi_{1}+k}+{1+rk{M(i\|1)}\over n}$	(72)

where (a) follows from $N_{1}>{(n-1)\pi_{1}\over 2}$ in $A^{>}$ and the fact that $(x+y+z)^{2}\leq 3(x^{2}+y^{2}+z^{2})$ ; (b) uses the assumption that either $n\pi_{1}\leq r$ or $n\pi_{1}{M(i|1)}\leq 10$ . Applying Lemma 16 (i) and the fact that $x+x^{2}\leq 2(1+x^{2})$ , continuing the last display we get

\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right]\lesssim{n\pi_{1}{M(i|1)}+\left(1+{{M(i|1)}\over\gamma^{2}_{*}}\right)\over n}+{1+rk{M(i|1)}\over n}\lesssim{{1+rk{M(i|1)}\over n}+{M(i|1)\over n\gamma_{0}^{2}}}.

Hence

\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})\right]=\sum_{i=1}^{k}\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right]\lesssim{\frac{rk}{n}+{1\over\gamma_{0}^{2}}}.

(73)

Case(b) $n\pi_{1}>r$ and $n\pi_{1}{M(i|1)}>10$ :

We decompose $A^{>}$ based on count of $N_{1i}$ into atypical part $B^{\leq}$ and typical part $B^{>}$

	$\displaystyle B^{\leq}$	$\displaystyle\triangleq\left\{X_{n}=1,N_{1}>{(n-1)\pi_{1}/2},N_{1i}\leq{(n-1)\pi_{1}{M(i\|1)}/4}\right\}$
	$\displaystyle B^{>}$	$\displaystyle\triangleq\left\{X_{n}=1,N_{1}>{(n-1)\pi_{1}/2},N_{1i}>{(n-1)\pi_{1}{M(i\|1)}/4}\right\}$

and bound each of $\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]$ and $\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]$ separately.

Bound on $\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]$

Using $\widehat{M}^{+1}(i|1)\geq{1\over N_{1}+k}$ and $N_{1i}<N_{1}M(i|1)/2$ in $B^{\leq}$ we get

$\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]$	$\displaystyle=\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i\|1)}\log\left({M(i\|1)}(N_{1}+k)\over N_{1i}+1\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\left({N_{1i}+1\over N_{1}+k}-{M(i\|1)}\right)\right]$
	$\displaystyle\leq\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i\|1)}\log\left({M(i\|1)}(N_{1}+k)\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\left({N_{1i}\over N_{1}}-{M(i\|1)}\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\over N_{1}\right]$
	$\displaystyle\lesssim\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i\|1)}\log\left({M(i\|1)}(N_{1}+k)\right)\right]+{1\over n}$	(74)

where the last inequality followed as $\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}/N_{1}\right]\lesssim{\mathbb{P}[X_{n}=1]/n\pi_{1}}=\frac{1}{n}$ . Note that for any event $B$ and any function $g$ ,

\displaystyle\mathbb{E}\left[g(N_{1}){\mathbf{1}_{\left\{{N_{1}\geq t_{0},B}\right\}}}\right]=g(t_{0})\mathbb{P}[N_{1}\geq t_{0},B]+\sum_{t=t_{0}+1}^{n}\left(g(t)-g(t-1)\right)\mathbb{P}[N_{1}\geq t,B].

Applying this identity with $t_{0}={\left\lceil{(n-1)\pi_{1}/2}\right\rceil}$ , we can bound the expectation term in (74) as

	$\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i\|1)}\log\left({M(i\|1)}(N_{1}+k)\right)\right]$
	$\displaystyle={M(i\|1)}\log\left({M(i\|1)}(t_{0}+k)\right)\mathbb{P}\left[N_{1}\geq t_{0},N_{1i}\leq{n\pi_{1}{M(i\|1)}\over 4},X_{n}=1\right]$
	$\displaystyle\quad+{M(i\|1)}\sum_{t=t_{0}+1}^{n-1}\log\left(1+\frac{1}{t-1+k}\right)\mathbb{P}\left[N_{1}\geq t+1,N_{1i}\leq{n\pi_{1}{M(i\|1)}\over 4},X_{n}=1\right]$
	$\displaystyle\leq\pi_{1}{M(i\|1)}\log\left({M(i\|1)}(t_{0}+k)\right)\mathbb{P}\left[\left.{M(i\|1)}N_{1}-N_{1i}\geq{M(i\|1)t_{0}\over 4}\right\|X_{n}=1\right]$
	$\displaystyle\quad+{{M(i\|1)}\over n}\sum_{t=t_{0}+1}^{n-1}\mathbb{P}\left[\left.{M(i\|1)}N_{1}-N_{1i}\geq{M(i\|1)t\over 4}\right\|X_{n}=1\right]$		(75)

where last inequality uses $\log\left(1+\frac{1}{t-1+k}\right)\leq\frac{1}{t}\lesssim\frac{1}{n\pi_{1}}$ for all $t\geq t_{0}$ . Using Markov inequality $\mathbb{P}\left[Z>c\right]\leq c^{-4}{\mathbb{E}\left[Z^{4}\right]}$ for $c>0$ , Lemma 16 (ii) and $x+x^{4}\leq 2(1+x^{4})$ with $x=\sqrt{M(i|1)}/\gamma_{*}$

\displaystyle\mathbb{P}\left[\left.{M(i|1)}N_{1}-N_{1i}\geq{{M(i|1)}t\over 4}\right|X_{n}=1\right]\lesssim{(n\pi_{1}{M(i|1)})^{2}+{{M(i|1)}^{2}\over\gamma_{*}^{4}}\over\left(t{M(i|1)}\right)^{4}}.

In view of above continuing (75) we get

	$\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i\|1)}\log\left({M(i\|1)}(N_{1}+k)\right)\right]$
	$\displaystyle\lesssim\left((n\pi_{1}{M(i\|1)})^{2}+{{M(i\|1)}^{2}\over\gamma_{*}^{4}}\right)\left({\pi_{1}{M(i\|1)}\log({M(i\|1)}(n\pi_{1}+k))\over(n\pi_{1}{M(i\|1)})^{4}}+\frac{1}{n({M(i\|1)})^{3}}\sum_{t=t_{0}+1}^{n}{1\over t^{4}}\right)$
	$\displaystyle{\lesssim}\left((n\pi_{1}{M(i\|1)})^{2}+{{M(i\|1)}^{2}\over\gamma_{*}^{4}}\over n\right)\left({\log(n\pi_{1}M(i\|1)+kM(i\|1))\over(n\pi_{1}{M(i\|1)})^{3}}+\frac{1}{(n\pi_{1}{M(i\|1)})^{3}}\right)$
	$\displaystyle{\lesssim}{1\over n}\left((n\pi_{1}{M(i\|1)})^{2}+{{M(i\|1)}^{2}\over\gamma_{*}^{4}}\right){\log(n\pi_{1}M(i\|1)+kM(i\|1))\over(n\pi_{1}{M(i\|1)})^{3}}$
	$\displaystyle\lesssim{1\over n}\left({\log(n\pi_{1}M(i\|1)+kM(i\|1))\over n\pi_{1}{M(i\|1)}}+{M(i\|1)\log(n\pi_{1}M(i\|1)+k)\over n\pi_{1}\gamma_{*}^{4}(n\pi_{1}M(i\|1))^{2}}\right)$
	$\displaystyle\overset{\rm(a)}{\lesssim}{1\over n}\left({n\pi_{1}M(i\|1)+kM(i\|1)\over n\pi_{1}{M(i\|1)}}+{M(i\|1)\log(n\pi_{1}M(i\|1))\over n\pi_{1}\gamma_{}^{4}(n\pi_{1}M(i\|1))^{2}}+{M(i\|1)\log k\over n\pi_{1}\gamma_{}^{4}(n\pi_{1}M(i\|1))^{2}}\right)$
	$\displaystyle\overset{\rm(b)}{\lesssim}{1\over n}\left(1+kM(i\|1)+{M(i\|1)\log k\over r\gamma_{0}^{4}}\right)$

where (a) followed using $x+y\leq xy$ for $x,y\geq 2$ and (b) followed as $n\pi_{1}\geq r,n\pi_{1}{M(i|1)}\geq 10$ and $\log(n\pi_{1}M(i|1))\leq n\pi_{1}M(i|1)$ . In view of (74) this implies

\displaystyle\sum_{i=1}^{k}\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]\lesssim\sum_{i=1}^{k}{1\over n}\left(1+kM(i|1)\left(1+{\log k\over rk\gamma_{0}^{4}}\right)\right)\lesssim{k\over n}\left(1+{\log k\over rk\gamma_{0}^{4}}\right).

(76)

Bound on $\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]$

Using the inequality (71)

	$\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]$	$\displaystyle\leq\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\left({M(i\|1)}N_{1}-N_{1i}+{M(i\|1)}k-1\right)^{2}\over\left(N_{1}+k\right)\left(N_{1i}+1\right)\right]$
		$\displaystyle\lesssim{\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\left\{\left({M(i\|1)}N_{1}-N_{1i}\right)^{2}\right\}\right]+k^{2}\pi_{1}{M(i\|1)}^{2}+\pi_{1}\over(n\pi_{1}+k)(n\pi_{1}{M(i\|1)}+1)}$
		$\displaystyle\lesssim{\pi_{1}\mathbb{E}\left[\left.{\left({M(i\|1)}N_{1}-N_{1i}\right)^{2}}\right\|X_{n}=1\right]\over(n\pi_{1}+k)(n\pi_{1}{M(i\|1)}+1)}+{k{M(i\|1)}\over n}$

where (a) follows using properties of the set $B^{>}$ along with $(x+y+z)^{2}\leq 3(x^{2}+y^{2}+z^{2})$ . Using Lemma 16 (i) we get

\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]\lesssim{n\pi_{1}{M(i|1)}+\left(1+{{M(i|1)}\over\gamma^{2}_{*}}\right)\over n(n\pi_{1}{M(i|1)}+1)}+{k{M(i|1)}\over n}\lesssim{{1+k{M(i|1)}\over n}+{{M(i|1)}\over n\gamma_{0}^{2}}}.

Summing up the last bound over $i\in[k]$ and using we get for $n\pi_{1}>r,n\pi_{1}M(i|1)>10$

\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})\right]

\displaystyle=\sum_{i=1}^{k}\left[\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]\right]\lesssim{k\over n}\left(1+{1\over k\gamma_{0}^{2}}+{\log k\over rk\gamma_{0}^{4}}\right).

Combining this with (73) we obtain

\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})\right]

\displaystyle\lesssim{k\over n}\left({1\over k\gamma_{0}^{2}}+r+{\log k\over rk\gamma_{0}^{4}}\right)\lesssim{k\over n}\left(1+{\sqrt{\log k\over k\gamma_{0}^{4}}}\right)

where we chose $r=10+\sqrt{\log k\over k\gamma_{0}^{4}}$ for the last inequality. In view of (70) this implies the required bound.

Remark 4.

We explain the subtlety of the concentration bound in Lemma 16 based on fourth moment and why existing Chernoff bound or Chebyshev inequality falls short. For example, the risk bound in (70) relies on bounding the probability that $N_{1}$ is atypically small. To this end, one may use the classical Chernoff-type inequality for reversible chains (see [Lez98, Theorem 1.1] or [Pau15, Proposition 3.10 and Theorem 3.3])

\displaystyle\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2|X_{1}=1\right]\lesssim\frac{1}{\sqrt{\pi_{1}}}e^{-\Theta(n\pi_{1}\gamma_{*})};

(77)

in contrast, the fourth moment bound in (69) yields $\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2|X_{1}=1\right]=O(\frac{1}{(n\pi_{1}\gamma_{*})^{2}})$ . Although the exponential tail in (77) is much better, the pre-factor $\frac{1}{\sqrt{\pi_{1}}}$ , due to conditioning on the initial state, can lead to a suboptimal result when $\pi_{1}$ is small. (As a concrete example, consider two states with $M(2|1)=\Theta(\frac{1}{n})$ and $M(1|2)=\Theta(1)$ . Then $\pi_{1}=\Theta(\frac{1}{n}),\gamma=\gamma_{*}\approx\Theta(1)$ , and (77) leads to $\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2,X_{n}=1\right]=O(\frac{1}{\sqrt{n}})$ as opposed to the desired $O(\frac{1}{n})$ .)

In the same context it is also insufficient to use 2nd moment based bound (Chebyshev), which leads to $\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2|X_{1}=1\right]=O(\frac{1}{n\pi_{1}\gamma_{*}})$ . This bound is too loose, which, upon substitution into (69), results in an extra $\log n$ factor in the final risk bound when $\pi_{1}$ and $\gamma_{*}$ are large.

4.2.2 Proof of Theorem 3 (ii)

Let $k\geq(\log n)^{6}$ and $\gamma_{0}\geq{(\log(n+k))^{2}\over k}$ . We prove a stronger result using spectral gap as opposed to the absolute spectral gap. Fix $M$ such that $\gamma\geq\gamma_{0}$ . Denote its stationary distribution by $\pi$ . For absolute constants $\tau>0$ to be chosen later and $c_{0}$ as in Lemma 17 below, define

	$\displaystyle\epsilon(m)={2k\over m}+{c_{0}(\log n)^{3}\sqrt{k}\over m},\quad c_{n}=100\tau^{2}{\log n\over n\gamma},$
	$\displaystyle n_{i}^{\pm}=n\pi_{i}\pm\tau\max\left\{{\log n\over n\gamma},{\sqrt{\pi_{i}\log n\over n\gamma}}\right\},\quad i=1,\ldots,k.$		(78)

Let $N_{i}$ be the number of visits to state $i$ as in (4). We bound the risk by accounting for the contributions from different ranges of $N_{i}$ and $\pi_{i}$ separately:

	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{k}{\mathbf{1}_{\left\{{X_{n}=i}\right\}}}D\left(M(\cdot\|i)\\|\widehat{M}^{+1}(\cdot\|i)\right)\right]$
	$\displaystyle=\sum_{i:\pi_{i}\geq c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i,n_{i}^{-}\leq N_{i}\leq n_{i}^{+}}\right\}}}D\left(M(\cdot\|i)\\|\widehat{M}^{+1}(\cdot\|i)\right)\right]$
	$\displaystyle+\sum_{i:\pi_{i}\geq c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i,N_{i}>n_{i}^{+}\text{ or }N_{i}<n_{i}^{-}}\right\}}}D\left(M(\cdot\|i)\\|\widehat{M}^{+1}(\cdot\|i)\right)\right]+\sum_{i:\pi_{i}<c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i}\right\}}}D\left(M(\cdot\|i)\\|\widehat{M}^{+1}(\cdot\|i)\right)\right]$
	$\displaystyle\leq\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\mathbb{P}\left[D(M(\cdot\|i)\\|\widehat{M}^{+1}(\cdot\|i))>\epsilon(N_{i}),n_{i}^{-}\leq N_{i}\leq n_{i}^{+}\right]+\sum_{i:\pi_{i}\geq c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i,n_{i}^{-}\leq N_{i}\leq n_{i}^{+}}\right\}}}\epsilon(N_{i})\right]$
	$\displaystyle+\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\left[\mathbb{P}\left[N_{i}\geq n_{i}^{+}\right]+\mathbb{P}\left[N_{i}\leq n_{i}^{-}\right]\right]+\sum_{i:\pi_{i}\leq c_{n}}\pi_{i}\log(n+k)$
	$\displaystyle\lesssim\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\mathbb{P}\left[D(M(\cdot\|i)\\|\widehat{M}^{+1}(\cdot\|i))>\epsilon(N_{i}),n_{i}^{-}\leq N_{i}\leq n_{i}^{+}\right]+\sum_{i:\pi_{i}\geq c_{n}}\pi_{i}\max_{n_{i}^{-}\leq m\leq n_{i}^{+}}\epsilon(m)$
	$\displaystyle\quad+\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\left(\mathbb{P}\left[N_{i}>n_{i}^{+}\right]+\mathbb{P}\left[N_{i}<n_{i}^{-}\right]\right)+{k\left(\log(n+k)\right)^{2}\over n\gamma}.$		(79)

where the first inequality uses the worst-case bound (68) for add-one estimator. We analyze the terms separately as follows.

For the second term, given any $i$ such that $\pi_{i}\geq c_{n}$ , we have, by definition in (78), $n_{i}^{-}\geq 9n\pi_{i}/10$ and $n_{i}^{+}-n_{i}^{-}\leq n\pi_{i}/5$ , which implies

\displaystyle\sum_{i:\pi_{i}\geq c_{n}}\pi_{i}\max_{n_{i}^{-}\leq m\leq n_{i}^{+}}\epsilon(m)

\displaystyle\leq\sum_{i:\pi_{i}\geq c_{n}}\pi_{i}\left({2k\over 0.9n\pi_{i}}+{10\over 9}{c_{0}(\log n)^{3}\sqrt{k}\over n\pi_{i}}\right)\lesssim{k^{2}\over n}+{(\log n)^{3}k^{3/2}\over n}.

(80)

For the third term, applying [HJL⁺18, Lemma 16] (which, in turn, is based on the Bernstein inequality in [Pau15]), we get $\mathbb{P}\left[N_{i}>n_{i}^{+}\right]+\mathbb{P}\left[N_{i}<n_{i}^{-}\right]\leq 2n^{-\tau^{2}\over 4+10\tau}$ .

To bound the first term in (79), we follow the method in [Bil61, HJL⁺18] of representing the sample path of the Markov chain using independent samples generated from $M(\cdot|i)$ which we describe below. Consider a random variable $X_{1}\sim\pi$ and an array $W=\left\{W_{i\ell}:i=1,\dots,k\text{ and }\ell=1,2,\dots\right\}$ of independent random variables, such that $X$ and $W$ are independent and $W_{i\ell}{\stackrel{{\scriptstyle\text{iid}}}{{\sim}}}M(\cdot|i)$ for each $i$ . Starting with generating $X_{1}$ from $\pi$ , at every step $i\geq 2$ we set $X_{i}$ as the first element in the $X_{i-1}$ -th row of $W$ that has not been sampled yet. Then one can verify that $\left\{X_{1},\dots,X_{n}\right\}$ is a Markov chain with initial distribution $\pi$ and transition matrix $M$ . Furthermore, the transition counts satisfy $N_{ij}=\sum_{\ell=1}^{N_{i}}{\mathbf{1}_{\left\{{W_{i\ell}=j}\right\}}}$ , where $N_{i}$ be the number of elements sampled from the $i$ th row of $W$ . Note the conditioned on $N_{i}=m$ , the random variables $\{W_{i1},\ldots,W_{im}\}$ are no longer iid. Instead, we apply a union bound. Note that for each fixed $m$ , the estimator

\widehat{M}^{+1}(j|i)={\sum_{\ell=1}^{m}{\mathbf{1}_{\left\{{W_{i\ell}=j}\right\}}}+1\over m+k}\triangleq\widehat{M}^{+1}_{m}(j|i),\quad j\in[k]

is an add-one estimator for $M(j|i)$ based on an iid sample of size $m$ . Lemma 17 below provides a high-probability bound for the add-one estimator in this iid setting. Using this result and the union bound, we have

	$\displaystyle\sum_{i:\pi_{i}\geq c_{n}}\mathbb{P}\left[D(M(\cdot\|i)\\|\widehat{M}^{+1}(\cdot\|i))>\epsilon(N_{i}),n_{i}^{-}\leq N_{i}\leq n_{i}^{+}\right]$
	$\displaystyle\leq\sum_{i:\pi_{i}\geq c_{n}}\left(n_{i}^{+}-n_{i}^{-}\right)\max_{n_{i}^{-}\leq m\leq n_{i}^{+}}\mathbb{P}\left[D(M(\cdot\|i)\\|\widehat{M}_{m}^{+1}(\cdot\|i))>\epsilon(m)\right]\leq\sum_{i:\pi_{i}\geq c_{n}}{1\over n^{2}}\leq{k\over n^{2}}$

where the second inequality applies Lemma 17 with $t=n\geq n_{i}^{+}\geq m$ and uses $n_{i}^{+}-n_{i}^{-}\leq n\pi_{i}/5$ for $\pi_{i}\geq c_{n}$ .

Combining the above with (80), we continue (79) with $\tau=25$ to get

\displaystyle\mathbb{E}\left[\sum_{i=1}^{k}{\mathbf{1}_{\left\{{X_{n}=i}\right\}}}D\left(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i)\right)\right]\lesssim{k^{2}\over n}+{(\log n)^{3}k^{3/2}\over n}+{k(\log(n+k))^{2}\over n\gamma}

which is $O\left(k^{2}\over n\right)$ whenever $k\geq(\log n)^{6}$ and $\gamma\geq{(\log(n+k))^{2}\over k}$ .

Lemma 17 (KL risk bound for add-one estimator).

Let $V_{1},\dots,V_{m}\stackrel{{\scriptstyle iid}}{{\sim}}Q$ for some distribution $Q=\left\{Q_{i}\right\}_{i=1}^{k}$ on $[k]$ . Consider the add-one estimator $\widehat{Q}^{+1}$ with $\widehat{Q}^{+1}_{i}=\frac{1}{m+k}(\sum_{j=1}^{m}{\mathbf{1}_{\left\{{V_{j}=i}\right\}}}+1)$ . There exists an absolute constant $c_{0}$ such that for any $t\geq m$ ,

\displaystyle\mathbb{P}\left[D(Q\|\widehat{Q}^{+1})\geq{2k\over m}+{c_{0}(\log t)^{3}\sqrt{k}\over m}\right]\leq{1\over t^{3}}.

Proof.

Let $\widehat{Q}$ be the empirical estimator $\widehat{Q}_{i}=\frac{1}{m}\sum_{j=1}^{m}{\mathbf{1}_{\left\{{V_{j}=i}\right\}}}$ . Then $\widehat{Q}^{+1}_{i}={m\widehat{Q}_{i}+1\over m+k}$ and hence

$\displaystyle D(Q\\|\widehat{Q}^{+1})$	$\displaystyle=\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{+1}}-Q_{i}+\widehat{Q}^{+1}_{i}\right)$
	$\displaystyle=\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}(m+k)\over m\widehat{Q}_{i}+1}-Q_{i}+{m\widehat{Q}_{i}+1\over m+k}\right)$
	$\displaystyle=\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}+\frac{1}{m}\right)+\sum_{i=1}^{k}\left(Q_{i}\log{m+k\over m}-{k\widehat{Q}_{i}\over m+k}-{k\over m(m+k)}\right)$
	$\displaystyle\leq\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}+\frac{1}{m}\right)+\frac{k}{m}$	(81)

with last equality following by $0\leq\log\left(m+k\over m\right)\leq k/m$ .

To control the sum in the above display it suffices to consider its Poissonized version. Specifically, we aim to show

\displaystyle\mathbb{P}\left[\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}\right)>{k\over m}+{c_{0}(\log t)^{3}\sqrt{k}\over m}\right]\leq{1\over t^{4}}

(82)

where $m\widehat{Q}^{\mathsf{poi}}_{i},i=1,\ldots,k$ are distributed independently as $\mathrm{Poi}(mQ_{i})$ . (Here and below $\mathrm{Poi}(\lambda)$ denotes the Poisson distribution with mean $\lambda$ .) To see why (82) implies the desired result, letting $w={k\over m}+{c_{0}(\log t)^{3}\sqrt{k}\over m}$ and $Y=\sum_{i=1}^{k}m\widehat{Q}_{i}^{\mathsf{poi}}\sim\mathrm{Poi}(m)$ , we have

	$\displaystyle\mathbb{P}\left[\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}+\frac{1}{m}\right)>w\right]$
	$\displaystyle\overset{\rm(a)}{=}\mathbb{P}\left[\left.\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}\right)>w\right\|\sum_{i=1}^{k}Q^{\mathsf{poi}}_{i}=1\right]$
	$\displaystyle\overset{\rm(b)}{\leq}{1\over t^{4}\mathbb{P}[Y=m]}={m!\over t^{4}e^{-m}m^{m}}\overset{\rm(c)}{\lesssim}{\sqrt{m}\over t^{4}}\leq\frac{1}{t^{3}}.$		(83)

where (a) followed from the fact that conditioned on their sum independent Poisson random variables follow a multinomial distribution; (b) applies (82); (c) follows from Stirling’s approximation.

To prove (82) we rely on concentration inequalities for sub-exponential distributions. A random variable $X$ is called sub-exponential with parameters $\sigma^{2},b>0$ , denoted as $\mathsf{SE}(\sigma^{2},b)$ if

\displaystyle\mathbb{E}\left[e^{\lambda(X-\mathbb{E}[X])}\right]\leq e^{\lambda^{2}\sigma^{2}\over 2},\quad\forall|\lambda|<\frac{1}{b}.

(84)

Sub-exponential random variables satisfy the following properties [Wai19, Sec. 2.1.3]:

•

If $X$ is $\mathsf{SE}(\sigma^{2},b)$ for any $t>0$

\displaystyle\mathbb{P}\left[\left|X-\mathbb{E}[X]\right|\geq v\right]\leq\begin{cases}2e^{-v^{2}/(2\sigma^{2})},&0<v\leq{\sigma^{2}\over b}\\ 2e^{-v/(2b)},&v>{\sigma^{2}\over b}.\end{cases}

(85)

•

Bernstein condition: A random variable $X$ is $\mathsf{SE}(\sigma^{2},b)$ if it satisfies

$\displaystyle\mathbb{E}\left[\left|X-\mathbb{E}[X]\right|^{\ell}\right]\leq\frac{1}{2}\ell!\sigma^{2}b^{\ell-2},\quad\ell=2,3,\dots.$ (86)
•

If $X_{1},\dots,X_{k}$ are independent $\mathsf{SE}(\sigma^{2},b)$ , then $\sum_{i=1}^{k}X_{i}$ is $\mathsf{SE}(k\sigma^{2},b)$ .

Define $X_{i}=Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m},i\in[k].$ Then Lemma 18 below shows that $X_{i}$ ’s are independent $\mathsf{SE}(\sigma^{2},b)$ with $\sigma^{2}={c_{1}(\log m)^{4}\over m^{2}},b={c_{2}(\log m)^{2}\over n}$ for absolute constants $c_{1},c_{2}$ , and hence $\sum_{i=1}^{k}\left(X_{i}-\mathbb{E}[X_{i}]\right)$ is $\mathsf{SE}(k\sigma^{2},b)$ . In view of (85) for the choice $c_{0}=8(c_{1}+c_{2})$ this implies

\displaystyle\mathbb{P}\left[\sum_{i=1}^{k}\left(X_{i}-\mathbb{E}[X_{i}]\right)\geq c_{0}{(\log t)^{3}\sqrt{k}\over m}\right]

\displaystyle\leq 2e^{-{c_{0}^{2}k(\log t)^{6}\over 2m^{2}\sigma^{2}}}+2e^{-{c_{0}\sqrt{k}(\log t)^{3}\over 2mb}}\leq\frac{1}{t^{3}}.

(87)

Using $0\leq y\log y-y+1\leq(y-1)^{2},y>0$ and $\mathbb{E}\left[\frac{\lambda}{\mathrm{Poi}(\lambda)+1}\right]=\sum_{v=0}^{\infty}{e^{-\lambda}\lambda^{v+1}\over(v+1)!}=1-e^{-\lambda}$

	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{k}X_{i}\right]$	$\displaystyle\leq\mathbb{E}\left[\sum_{i=1}^{k}{\left(Q_{i}-\left(\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}\right)\right)^{2}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}\right]$
		$\displaystyle=\sum_{i=1}^{k}mQ_{i}^{2}\mathbb{E}\left[1\over m\widehat{Q}_{i}^{\mathsf{poi}}+1\right]-1+\frac{k}{m}=\sum_{i=1}^{k}Q_{i}\left(1-e^{-mQ_{i}}\right)-1+\frac{k}{m}\leq\frac{k}{m}.$

Combining the above with (87) we get (82) as required. ∎

Lemma 18.

There exist absolute constants $c_{1},c_{2}$ such that the following holds. For any $p\in(0,1)$ and $nY\sim\mathrm{Poi}(np)$ , $X=p\log{p\over Y+\frac{1}{n}}-p+Y+\frac{1}{n}$ is $\mathsf{SE}\left({c_{1}(\log n)^{4}\over n^{2}},{c_{2}(\log n)^{2}\over n}\right)$ .

Proof.

Note that $X$ is a non-negative random variable. Since $\mathbb{E}\left[\left(X-\mathbb{E}[X]\right)^{\ell}\right]\leq 2^{\ell}\mathbb{E}\left[X^{\ell}\right]$ , by the Bernstein condition (86), it suffices to show $\mathbb{E}[X^{\ell}]\leq\left(c_{3}\ell(\log n)^{2}\over n\right)^{\ell},\ell=2,3,\dots$ for some absolute constant $c_{3}$ . guarantees the desired sub-exponential behavior. The analysis is divided into following two cases for some absolute constant $c_{4}\geq 24$ .

Case I $p\geq{c_{4}\ell\log n\over n}$ :

Using Chernoff bound for Poisson [Jan02, Theorem 3]

\displaystyle\mathbb{P}\left[|\mathrm{Poi}(\lambda)-\lambda|>x\right]\leq 2e^{-{x^{2}\over 2(\lambda+x/3)}},\quad\lambda,x>0,

(88)

we get

	$\displaystyle\mathbb{P}\left[\|Y-p\|>\sqrt{c_{4}\ell p\log n\over 4n}\right]$	$\displaystyle\leq 2\mathop{\rm exp}\left(-{c_{4}n\ell p\log n\over 8np+2\sqrt{c_{4}n\ell p\log n}}\right)$
		$\displaystyle\leq 2\mathop{\rm exp}\left(-{c_{4}\ell\log n\over{8+2\sqrt{c_{4}\ell\log n/np}}}\right)\leq{1\over n^{2\ell}}$		(89)

which implies $p/2\leq Y\leq 2p$ with probability at least $1-{n^{-2\ell}}$ . Since $0\leq X\leq{(Y-p-\frac{1}{n})^{2}\over Y+\frac{1}{n}}$ , we get $\mathbb{E}[X^{\ell}]\lesssim{\left(\sqrt{{c_{4}}\ell p\log n/4n}\right)^{2\ell}\over(p/2)^{\ell}}+{n^{\ell}\over n^{2\ell}}\lesssim\left(c_{4}\ell\log n\over n\right)^{\ell}.$

Case II $p<{c_{4}\ell\log n\over n}$ :

•

On the event $\{Y>p\}$ , we have $X\leq Y+\frac{1}{n}\leq 2Y$ , where the last inequality follows because $nY$ takes non-negative integer values. Since $X\geq 0$ , we have $X^{\ell}{\mathbf{1}_{\left\{{Y>p}\right\}}}\leq(2Y)^{\ell}{\mathbf{1}_{\left\{{Y>p}\right\}}}$ for any $\ell\geq 2$ . Using the Chernoff bound (88), we get $Y\leq{2c_{4}\ell\log n\over n}$ with probability at least $1-n^{-2\ell}$ , which implies

	$\displaystyle\mathbb{E}\left[X^{\ell}{\mathbf{1}_{\left\{{Y\geq p}\right\}}}\right]$	$\displaystyle\leq\mathbb{E}\left[(2Y)^{\ell}{\mathbf{1}_{\left\{{Y>p,Y\leq{2c_{4}\ell\log n\over n}}\right\}}}\right]+\mathbb{E}\left[(2Y)^{\ell}{\mathbf{1}_{\left\{{Y>p,Y>{2c_{4}\ell\log n\over n}}\right\}}}\right]$
		$\displaystyle\leq\left(4c_{4}\ell\log n\over n\right)^{\ell}+2^{\ell}\left(\mathbb{E}[Y^{2\ell}]\mathbb{P}\left[Y>{2c_{4}\ell\log n\over n}\right]\right)^{\frac{1}{2}}\leq\left(c_{5}\ell\log n\over n\right)^{\ell}$

for absolute constant $c_{5}$ . Here, the last inequality follows from Cauchy-Schwarz and using the Poisson moment bound [Ahl21, Theorem 2.1]:³³3For a result with less precise constants, see also [Ahl21, Eq. (1)] based on [Lat97, Corollary 1]. $\mathbb{E}[(nY)^{2\ell}]\leq\left(2\ell\over\log\left(1+{2\ell\over np}\right)\right)^{2\ell}\leq\left(c_{6}\ell\log n\right)^{2\ell}$ for some absolute constant $c_{6}$ , with the second inequality applying the assumption $p<{c_{4}\ell\log n\over n}$ .

•

As $X{\mathbf{1}_{\left\{{Y\leq p}\right\}}}\leq p\log n+\frac{1}{n}\lesssim{\ell(\log n)^{2}\over n},$ we get $\mathbb{E}\left[X^{\ell}{\mathbf{1}_{\left\{{Y\leq p}\right\}}}\right]\leq\left(c_{7}\ell(\log n)^{2}\over n\right)^{\ell}$ for some absolute constant $c_{7}$ .

∎

4.2.3 Proof of Corollary 4

We show the following monotonicity result of the prediction risk. In view of this result, Corollary 4 immediately follows from Theorem 2 and Theorem 3 (i). Intuitively, the optimal prediction risk is monotonically increasing with the number of states; this, however, does not follow immediately due to the extra assumptions of irreducibility, reversibility, and prescribed spectral gap.

Lemma 19.

$\mathsf{Risk}_{k+1,n}(\gamma_{0})\geq\mathsf{Risk}_{k,n}(\gamma_{0})$ for all $\gamma_{0}\in(0,1),k\geq 2$ .

Proof.

Fix an $M\in{\mathcal{M}}_{k}(\gamma_{0})$ such that $\gamma_{*}(M)>\gamma_{0}$ . Denote the stationary distribution $\pi$ such that $\pi M=\pi$ . Fix $\delta\in(0,1)$ and define a transition matrix $\widetilde{M}$ with $k+1$ states as follows:

\widetilde{M}=\begin{pmatrix}(1-\delta)M&\delta\mathbf{1}\\ (1-\delta)\pi&\delta\end{pmatrix}

One can verify the following:

•

$\widetilde{M}$ is irreducible and reversible;
•

The stationary distribution for $\widetilde{M}$ is $\widetilde{\pi}=((1-\delta)\pi,\delta)$
•

The absolute spectral gap of $\widetilde{M}$ is $\gamma_{*}(\widetilde{M})=(1-\delta)\gamma_{*}(M)$ , so that $\widetilde{M}\in{\mathcal{M}}_{k+1}(\gamma_{0})$ for all sufficiently small $\delta$ .
•

Let $(X_{1},\ldots,X_{n})$ and $(\widetilde{X}_{1},\ldots,\widetilde{X}_{n})$ be stationary Markov chains with transition matrices $M$ and $\widetilde{M}$ , respectively. Then as $\delta\to 0$ , $(X_{1},\ldots,X_{n})$ converges to $(\widetilde{X}_{1},\ldots,\widetilde{X}_{n})$ in law, i.e., the joint probability mass function converges pointwise.

Next fix any estimator $\widehat{M}$ for state space $[k+1]$ . Note that without loss of generality we can assume $\widehat{M}(j|i)>0$ for all $i,j\in[k+1]$ for otherwise the KL risk is infinite. Define $\widehat{M}^{\text{trunc}}$ as $\widehat{M}$ without the $k+1$ -th row and column, and denote by $\widehat{M}^{\prime}$ its normalized version, namely, $\widehat{M}^{\prime}(\cdot|i)={\widehat{M}^{\text{trunc}}(\cdot|i)\over 1-\widehat{M}^{\text{trunc}}(k+1|i)}$ for $i=1,\ldots,k$ . Then

	$\displaystyle\mathbb{E}_{\widetilde{X}^{n}}\left[D(\widetilde{M}(\cdot\|\widetilde{X}_{n})\\|{\widehat{M}(\cdot\|\widetilde{X}_{n})})\right]\xrightarrow{\delta\to 0}$	$\displaystyle~{}\mathbb{E}_{X^{n}}\left[D(M(\cdot\|X_{n})\\|{\widehat{M}(\cdot\|X_{n})})\right]$
	$\displaystyle\geq$	$\displaystyle~{}\mathbb{E}_{X^{n}}\left[D(M(\cdot\|X_{n})\\|{\widehat{M}^{\prime}(\cdot\|X_{n})})\right]$
	$\displaystyle\geq$	$\displaystyle~{}\inf_{\widehat{M}}\mathbb{E}_{X^{n}}\left[D(M(\cdot\|X_{n})\\|{\widehat{M}(\cdot\|X_{n})})\right]$

where in the first step we applied the convergence in law of $\widetilde{X}^{n}$ to $X^{n}$ and the continuity of $P\mapsto D(P\|Q)$ for fixed componentwise positive $Q$ ; in the second step we used the fact that for any sub-probability measure $Q=(q_{i})$ and its normalized version $\bar{Q}=Q/\alpha$ with $\alpha=\sum q_{i}\leq 1$ , we have $D(P\|Q)=D(P\|\bar{Q})+\log\frac{1}{\alpha}\geq D(P\|\bar{Q})$ . Taking the supremum over $M\in{\mathcal{M}}_{k}(\gamma_{0})$ on the LHS and the supremum over $\widetilde{M}\in{\mathcal{M}}_{k+1}(\gamma_{0})$ on the RHS, and finally the infimum over $\widehat{M}$ on the LHS, we conclude $\mathsf{Risk}_{k+1,n}(\gamma_{0})\geq\mathsf{Risk}_{k,n}(\gamma_{0})$ . ∎

5 Higher-order Markov chains

5.1 Basic setups

In this section we prove Theorem 5. We start with some basic definitions for higher-order Markov chains. Let $m\geq 1$ . Let $X_{1},X_{2},\dots$ be an $m^{\text{th}}$ -th order Markov chain with state space ${\mathcal{S}}$ and transition matrix $M\in\mathbb{R}^{{\mathcal{S}}^{m}\times{\mathcal{S}}}$ so that $\mathbb{P}\left[X_{t+1}=x_{t+1}|X_{t-m+1}^{t}=x_{t-m+1}^{t}\right]=M(x_{t+1}|x_{t-m+1}^{t})$ for all $t\geq m$ . Clearly, the joint distribution of the process is specified by the transition matrix and the initial distribution, which is a joint distribution for $(X_{1},\ldots,X_{m})$ .

A distribution $\pi$ on ${\mathcal{S}}^{m}$ is a stationary distribution if $\{X_{t}:t\geq 1\}$ with $(X_{1},\ldots,X_{m})\sim\pi$ is a stationary process, that is,

(X_{i_{1}+t},\ldots,X_{i_{n}+t}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{i_{1}},\ldots,X_{i_{n}}),\quad\forall n,i_{1},\ldots,i_{n},t\in\mathbb{N}.

(90)

It is clear that (90) is equivalent to $(X_{1},\ldots,X_{m}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{2},\ldots,X_{m+1})$ . In other words, $\pi$ is the solution to the linear system:

\displaystyle\pi(x_{1},\ldots,x_{m})=\sum_{x_{0}\in{\mathcal{S}}}\pi(x_{0},x_{1},\ldots,x_{m-1})M(x_{m}|x_{1},\ldots,x_{m-1}),\quad\forall x_{1},\ldots,x_{m}\in{\mathcal{S}}.

(91)

Note that this implies, in particular, that $\pi$ as a joint distribution of $m$ -tuples itself must satisfy those symmetry properties required by stationarity, such as all marginals being identical, etc.

Next we discuss reversibility. A random process $\{X_{t}\}$ is reversible if for any $n$ ,

X^{n}~{}{\stackrel{{\scriptstyle\rm law}}{{=}}}~{}\overline{X^{n}},

(92)

where $\overline{X^{n}}\triangleq(X_{n},\ldots,X_{1})$ denotes the time reversal of $X^{n}=(X_{1},\ldots,X_{n})$ . Note that a reversible $m^{\text{th}}$ -order Markov chain must be stationary. Indeed,

(X_{2},\ldots,X_{m+1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m},\ldots,X_{1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{1},\ldots,X_{m}),

(93)

where the first equality follows from $(X_{1},\ldots,X_{m+1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m+1},\ldots,X_{1})$ . The following lemma gives a characterization for reversibility:

Lemma 20.

An $m^{\text{th}}$ -order stationary Markov chain is reversible if and only if (92) holds for $n=m+1$ , namely

\pi(x_{1},\ldots,x_{m})M(x_{m+1}|x_{1},\ldots,x_{m})=\pi(x_{m+1},\ldots,x_{2})M(x_{1}|x_{m+1},\ldots,x_{2}),\quad\forall x_{1},\ldots,x_{m+1}\in{\mathcal{S}}.

(94)

Proof.

First, we show that (92) for $n=m+1$ implies that for $n\leq m$ . Indeed,

\displaystyle(X_{1},\ldots,X_{n}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m+1},\ldots,X_{m-n+2}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{n},\ldots,X_{1})

(95)

where the first equality follows from $(X_{1},\ldots,X_{m+1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m+1},\ldots,X_{1})$ and the second applies stationarity.

Next, we show (92) for $n=m+2$ and the rest follows from induction on $n$ . Indeed,

		$\displaystyle~{}\mathbb{P}\left[(X_{1},\ldots,X_{m+2})=(x_{1},\ldots,x_{m+2})\right]$
	$\displaystyle=$	$\displaystyle~{}\pi(x_{1},\ldots,x_{m})M(x_{m+1}\|x_{1},\ldots,x_{m})M(x_{m+2}\|x_{2},\ldots,x_{m+1})$
	$\displaystyle\overset{\rm(a)}{=}$	$\displaystyle~{}\pi(x_{m+1},\ldots,x_{2})M(x_{1}\|x_{m+1},\ldots,x_{2})M(x_{m+2}\|x_{2},\ldots,x_{m+1})$
	$\displaystyle\overset{\rm(b)}{=}$	$\displaystyle~{}\pi(x_{2},\ldots,x_{m+1})M(x_{1}\|x_{m+1},\ldots,x_{2})M(x_{m+2}\|x_{2},\ldots,x_{m+1})$
	$\displaystyle\overset{\rm(c)}{=}$	$\displaystyle~{}\pi(x_{m+2},\ldots,x_{3})M(x_{2}\|x_{m+2},\ldots,x_{3})M(x_{1}\|x_{m+1},\ldots,x_{2})$
	$\displaystyle=$	$\displaystyle~{}\mathbb{P}\left[(X_{1},\ldots,X_{m+2})=(x_{m+2},\ldots,x_{1})\right]=\mathbb{P}\left[(X_{m+2},\ldots,X_{1})=(x_{1},\ldots,x_{m+2})\right].$

where (a) and (c) apply (92) for $n=m+1$ , namely, (94); (b) applies (92) for $n=m$ . ∎

In view of the proof of (93), we note that any distribution $\pi$ on ${\mathcal{S}}^{m}$ and $m^{\text{th}}$ -order transition matrix $M$ satisfying $\pi(x^{m})=\pi(\overline{x^{m}})$ and (94) also satisfy (91). This implies such a $\pi$ is a stationary distribution for $M$ . In view of Lemma 20 the above conditions also guarantee reversibility. This observation can be summarized in the following lemma, which will be used to prove the reversibility of specific Markov chains later.

Lemma 21.

Let $M$ be a $k^{m}\times k$ stochastic matrix describing transitions from ${\mathcal{S}}^{m}$ to ${\mathcal{S}}$ . Suppose that $\pi$ is a distribution on ${\mathcal{S}}^{m}$ such that $\pi(x^{m})=\pi(\overline{x^{m}})$ and $\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}})$ . Then $\pi$ is the stationary distribution of $M$ and the resulting chain is reversible.

For $m^{\rm th}$ -order stationary Markov chains, the optimal prediction risk is defined as as

	$\displaystyle{\mathsf{Risk}}_{k,n,m}$	$\displaystyle\triangleq\inf_{\widehat{M}}\sup_{M}\mathbb{E}[D(M(\cdot\|X_{n-m+1}^{n})\\|\widehat{M}(\cdot\|X_{n-m+1}^{n}))]$
		$\displaystyle=\inf_{\widehat{M}}\sup_{M}\sum_{x^{m}\in{\mathcal{S}}^{m}}\mathbb{E}[D(M(\cdot\|x^{m})\\|\widehat{M}(\cdot\|x^{m})){\mathbf{1}_{\left\{{X_{n-m+1}^{n}=x^{m}}\right\}}}]$		(96)

where the supremum is taken over all $k^{m}\times k$ stochastic matrices $M$ and the trajectory is initiated from the stationary distribution. In the remainder of this section we will show the following result, completing the proof of Theorem 5 previously announced in Section 1.

Theorem 22.

For all $m\geq 2$ , there exist a constant $C_{m}>0$ such that for all $2\leq k\leq n^{\frac{1}{m+1}}/C_{m}$ ,

{k^{m+1}\over C_{m}n}\log\left(n\over k^{m+1}\right)\leq\mathsf{Risk}_{k,n,m}\leq{C_{m}k^{m+1}\over n}\log\left(\frac{n}{k^{m+1}}\right).

Furthermore, the lower bound holds even when the Markov chains are required to be reversible.

5.2 Upper bound

We prove the upper bound part of the preceding theorem, using only stationarity (not reversibility). We rely on techniques from [CS04, Chapter 6, Page 486] for proving redundancy bounds for the $m^{\text{th}}$ -order chains. Let $Q$ be the probability assignment given by

\displaystyle Q(x^{n})=\frac{1}{k^{m}}\prod_{a^{m}\in{\mathcal{S}}^{m}}{\prod_{j=1}^{k}N_{a^{m}j}!\over k\cdot(k+1)\cdots(N_{a^{m}}+k-1)},

(97)

where $N_{a^{m}j}$ denotes the number of times the block $a^{m}j$ occurs in $x^{n}$ , and $N_{a^{m}}=\sum_{j=1}^{k}N_{a^{m}j}$ is the number of times the block $a^{m}$ occurs in $x^{n-1}$ . This probability assignment corresponds to the add-one rule

\displaystyle Q(j|x^{n})=\widehat{M}_{x^{n}}^{+1}(j|x_{n-m+1}^{n})={N_{x_{n-m+1}^{n}j}+1\over N_{x_{n-m+1}^{n}}+k}.

(98)

Then in view of Lemma 6, the following lemma proves the desired upper bound in Theorem 22.

Lemma 23.

Let $\mathsf{Red}(Q_{X^{n}})$ be the redundancy of the $m^{\text{th}}$ -order Markov chain, as defined in Section 2.1, and $X^{m}$ be the corresponding observed trajectory. Then

\mathsf{Red}(Q_{X^{n}})\leq\frac{1}{n-m}\left\{{k^{m}(k-1)}\left[\log\left(1+\frac{n-m}{k^{m}(k-1)}\right)+1\right]+m\log k\right\}.

Proof.

We show that for every Markov chain with transition matrix $M$ and initial distribution $\pi$ on ${\mathcal{S}}^{m}$ , and every trajectory $(x_{1},\cdots,x_{n})$ , it holds that

\displaystyle\log\frac{\pi(x_{1}^{m})\prod_{t=m}^{n-1}M(x_{t+1}|x^{t}_{t-m+1})}{Q(x_{1},\cdots,x_{n})}\leq k^{m}(k-1)\left[\log\left(1+\frac{n-m}{k^{m}(k-1)}\right)+1\right]+m\log k,

(99)

where $M(x_{t+1}|x^{t}_{t-m+1})$ the transition probability of going from $x^{t}_{t-m+1}$ to $x_{t+1}$ . Note that

\prod_{t=m}^{n-1}M(x_{t+1}|x^{t}_{t-m+1})=\prod_{a^{m+1}\in{\mathcal{S}}^{m+1}}M(a_{m+1}|a^{m})^{N_{a^{m+1}}}\leq\prod_{a^{m+1}\in{\mathcal{S}}^{m+1}}(N_{a^{m+1}}/N_{a^{m}})^{N_{a^{m+1}}},

where the last inequality follows from $\sum_{a_{m+1\in{\mathcal{S}}}}\frac{N_{a^{m+1}}}{N_{a^{m}}}\log\frac{N_{a^{m+1}}}{N_{a^{m}}M(a_{m+1}|a^{m})}\geq 0$ for each $a^{m}$ , by the non-negativity of the KL divergence. Therefore, we have

\displaystyle\frac{\pi(x_{1}^{m})\prod_{t=m}^{n-1}M(x_{t+1}|x_{t-m+1}^{t})}{Q(x_{1},\cdots,x_{n})}\leq k^{m}\cdot\prod_{a^{m}\in{\mathcal{S}}^{m}}\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{a^{m}}+k-1)}{N_{a^{m}}^{N_{a^{m}}}}\prod_{a_{m+1}\in{\mathcal{S}}}\frac{N_{a^{m+1}}^{N_{a^{m+1}}}}{N_{a^{m+1}}!}.

(100)

Using (33) we continue (100) to get

	$\displaystyle\log\frac{\pi(x_{1})\prod_{t=m}^{n-1}M(x_{t+1}\|x_{t})}{Q(x_{1},\cdots,x_{n})}$	$\displaystyle\leq m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\log\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{a^{m}}+k-1)}{N_{a^{m}}!}$
		$\displaystyle=m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\sum_{\ell=1}^{N_{a^{m}}}\log\left(1+\frac{k-1}{\ell}\right)$
		$\displaystyle\leq m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\int_{0}^{N_{a^{m}}}\log\left(1+\frac{k-1}{x}\right)dx$
		$\displaystyle=m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\left((k-1)\log\left(1+\frac{N_{a^{m}}}{k-1}\right)+N_{a^{m}}\log\left(1+\frac{k-1}{N_{a^{m}}}\right)\right)$
		$\displaystyle\overset{\rm(a)}{\leq}k^{m}(k-1)\log\left(1+\frac{n-m}{k^{m}(k-1)}\right)+k^{m}(k-1)+m\log k,$

where (a) follows from the concavity of $x\mapsto\log x$ , $\sum_{a^{m}\in{\mathcal{S}}^{m}}N_{a^{m}}=n-m+1$ , and $\log(1+x)\leq x$ . ∎

5.3 Lower bound

5.3.1 Special case: $m\geq 2,k=2$

We only analyze the case $m=2$ , i.e. second-order Markov chains with binary states, as the lower bound still applies to the case of $m\geq 3$ case. The transition matrix for second-order chains is given by a $k^{2}\times k$ stochastic matrices $M$ that gives the transition probability from the ordered pairs $(i,j)\in{\mathcal{S}}\times{\mathcal{S}}$ to some state $\ell\in{\mathcal{S}}$ :

\displaystyle M(\ell|ij)=\mathbb{P}\left[X_{3}=\ell|X_{1}=i,X_{2}=j\right].

(101)

Our result is the following.

Theorem 24.

${\mathsf{Risk}}_{2,n,2}=\Theta\left(\log n\over n\right)$ .

Proof.

The upper bound part has been shown in Lemma 23. For the lower bound, consider the following one-parametric family of transition matrices (we replace ${\mathcal{S}}$ by $\left\{1,2\right\}$ for simplicity of the notation)

\displaystyle\widetilde{\mathcal{M}}=\left\{M_{p}=\hbox{}\;\vbox{\kern 40.12498pt\hbox{$\kern 60.61786pt\kern-4.75pt\left[\kern-60.61786pt\vbox{\kern-40.12498pt\vbox{\halign{$#$\hfil\kern 2\p@\kern\@tempdima&\thinspace\hfil$#$\hfil&&\quad\hfil$#$\hfil\cr\hfil\crcr\kern-12.0pt\cr~{}$\hfil\kern 2.0pt\kern 4.75pt&1&2\crcr\kern 2.0pt\cr 11$\hfil\kern 2.0pt\kern 4.75pt&1-\frac{1}{n}&\frac{1}{n}\cr 21$\hfil\kern 2.0pt\kern 4.75pt&\frac{1}{n}&1-\frac{1}{n}\cr 12$\hfil\kern 2.0pt\kern 4.75pt&1-p&p\cr 22$\hfil\kern 2.0pt\kern 4.75pt&p&1-p\crcr\cr}}\kern-12.0pt}\,\right]$}}:0\leq p\leq 1\right\}

(109)

and place a uniform prior on $p\in[0,1]$ . One can verify that each $M_{p}$ has the uniform stationary distribution over the set $\left\{1,2\right\}\times\left\{1,2\right\}$ and the chains are reversible.

Next we introduce the set of trajectories based on which we will lower bound the prediction risk. Analogous to the set ${\mathcal{X}}=\cup_{t=1}^{n}{\mathcal{X}}_{t}$ defined in (37) for analyzing the first-order chains, we define

\displaystyle{\mathcal{V}}=\left\{1^{n-t}z^{t}:z_{1}=z_{2}=z_{t}=2,z_{i}^{i+1}\neq 11,i\in[t-1],t=4,\dots,n-2\right\}\subset\{1,2\}^{n}.

(110)

In other words, the sequences in ${\mathcal{V}}$ start with a string of 1’s before transitioning into two consecutive 2’s, are forbidden to have no consecutive 1’s thereafter, and finally end with 2.

To compute the probability of sequences in ${\mathcal{V}}$ , we need the following preparations. Denote by $\oplus$ the the operation that combines any two blocks from $\left\{22,212\right\}$ via merging the last symbol of the first block and the first symbol of the second block, for example, $22\oplus 212=2212,22\oplus 22\oplus 22=2222$ . Then for any $x^{n}\in{\mathcal{V}}$ we can write it in terms of the initial all-1 string, followed by alternating run of blocks from $\{22,212\}$ with the first run being of the block 22 (all the runs have positive lengths), combined with the merging operation $\oplus$ :

\displaystyle x^{n}=\underbrace{1\dots 1}_{\text{all ones}}\underbrace{22\oplus 22\dots\oplus 22}_{p_{1}\text{ many }22}\oplus\underbrace{212\oplus 212\dots\oplus 212}_{p_{2}\text{ many }212}\oplus\underbrace{22\oplus 22\dots\oplus 22}_{p_{3}\text{ many }22}\oplus\underbrace{212\oplus 212\dots\oplus 212}_{p_{4}\text{ many }212}\oplus 22\oplus\dots.

(111)

Let the vector $(q_{22\to 22},q_{22\to 212},q_{212\to 22},q_{212\to 212})$ denotes the transition probabilities between blocks in $\left\{22,212\right\}$ (recall the convention that the two blocks overlap in the symbol 2). Namely, according to (109),

	$\displaystyle q_{22\to 22}$	$\displaystyle=\mathbb{P}\left[X_{3}=2,X_{2}=2\|X_{2}=2,X_{1}=2\right]=M(2\|22)=1-p$
	$\displaystyle q_{22\to 212}$	$\displaystyle=\mathbb{P}\left[X_{4}=2,X_{3}=1,X_{2}=2\|X_{2}=2,X_{1}=2\right]=M(2\|21)M(1\|22)=\left(1-\frac{1}{n}\right)p$
	$\displaystyle q_{212\to 22}$	$\displaystyle=\mathbb{P}\left[X_{4}=2,X_{3}=2\|X_{3}=2,X_{2}=1,X_{1}=2\right]=M(2\|12)=p$
	$\displaystyle q_{212\to 212}$	$\displaystyle=\mathbb{P}\left[X_{5}=2,X_{4}=1,X_{3}=2\|X_{3}=2,X_{2}=1,X_{1}=2\right]=M(2\|21)M(1\|12)=\left(1-\frac{1}{n}\right)(1-p).$

Given any $x^{n}\in{\mathcal{V}}$ we can calculate its probability under the law of $M_{p}$ using frequency counts $\bm{F}(x^{n})=\left(F_{111},F_{22\to 22},F_{22\to 212},F_{212\to 22},F_{212\to 212}\right)$ , defined as

	$\displaystyle F_{111}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=1,x_{i+1}=1,x_{i+2}=1}\right\}}},\quad F_{22\to 22}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=2,x_{i+2}=2}\right\}}},$
	$\displaystyle F_{22\to 212}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=2,x_{i+2}=1,x_{i+3}=2}\right\}}},\quad F_{212\to 22}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=1,x_{i+2}=2,x_{i+3}=2}\right\}}},$
	$\displaystyle F_{212\to 212}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=1,x_{i+2}=2,x_{i+3}=1,x_{i+4}=2}\right\}}}.$

Denote $\mu(x^{n}|p)=\mathbb{P}\left[X^{n}=x^{n}|p\right]$ . Then for each $x^{n}\in{\mathcal{V}}$ with $\bm{F}(x^{n})=\bm{F}$ we have

		$\displaystyle\mu(x^{n}\|p)$
		$\displaystyle=\mathbb{P}(X^{F_{111}+2}=1^{F_{111}+2})M(2\|11)M(2\|12)\prod_{a,b\in\left\{22,212\right\}}q_{a\to b}^{F_{a\to b}}$
		$\displaystyle=\frac{1}{4}\left(1-\frac{1}{n}\right)^{F_{111}}\frac{1}{n}\cdot p\cdot p^{F_{212\to 22}}\left\{p\left(1-\frac{1}{n}\right)\right\}^{F_{22\to 212}}(1-p)^{F_{22\to 22}}\left\{(1-p)\left(1-\frac{1}{n}\right)\right\}^{F_{212\to 212}}$
		$\displaystyle=\frac{1}{4}\left(1-\frac{1}{n}\right)^{F_{111}+F_{22\to 212}+F_{212\to 212}}\frac{1}{n}p^{y+1}(1-p)^{f-y}$		(112)

where $y=F_{212\to 22}+F_{22\to 212}$ denotes the number of times the chain alternates between runs of 22 and runs of 212, and $f=F_{212\to 22}+F_{22\to 212}+F_{212\to 212}+F_{22\to 22}$ denotes the number of times the chain jumps between blocks in $\{22,212\}$ .

Note that the range of $f$ includes all the integers in between 1 and $(n-6)/2$ . This follows from the definition of ${\mathcal{V}}$ and the fact that if we merge either 22 or 212 using the operation $\oplus$ at the end of any string $z^{t}$ with $z_{t}=2$ , it increases the length of the string by at most 2. Also, given any value of $f$ the value of $y$ ranges from 0 to $f$ .

Lemma 25.

The number of sequences in ${\mathcal{V}}$ corresponding to a fixed pair $(y,f)$ is $\binom{f}{y}$ .

Proof.

Fix $x^{n}\in{\mathcal{V}}$ and let that $p_{2i-1}$ is the length of the $i$ -th run of 22 blocks and $p_{2i}$ is the length of the $i$ -th run of 212 blocks in $x^{n}$ as depicted in (111). The $p_{i}$ ’s are all non-negative integers. There are total $y+1$ such runs and the $p_{i}$ ’s satisfy $\sum_{i=1}^{y+1}p_{i}=f+1$ , as the total number of blocks is one more than the total number of transitions. Each positive integer solution to this equation $\left\{p_{i}\right\}_{i=1}^{y+1}$ corresponds to a sequence $x^{n}\in\cal V$ and vice versa. The total number of such sequences is $\binom{f}{y}$ . ∎

We are now ready to compute the Bayes estimator and risk. For any $x^{n}\in{\mathcal{V}}$ with a given $(y,f)$ , the Bayes estimator of $p$ with prior $p\sim\mathrm{Uniform}[0,1]$ is

\widehat{p}(x^{n})=\mathbb{E}[p|x^{n}]=\frac{\mathbb{E}[p\cdot\mu(x^{n}|p)]}{\mathbb{E}[\mu(x^{n}|p)]}\overset{\eqref{eq:MC_prob-F}}{=}{y+2\over f+3}.

Note that the probabilities $\mu(x^{n}|p)$ in (5.3.1) can be bounded from below by $\frac{1}{4en}p^{y+1}(1-p)^{f-y}$ . Using this, for each $x^{n}\in{\mathcal{V}}$ with given $y,f$ we get the following bound on the integrated squared error for a particular sequence $x^{n}$

	$\displaystyle\int_{0}^{1}\mu(x^{n}\|p)(p-\widehat{p}(x^{n}))^{2}dp$
	$\displaystyle\geq\frac{1}{4en}\int_{0}^{1}p^{y+1}(1-p)^{f-y}\left(p-{y+2\over f+3}\right)^{2}dp=\frac{1}{4en}{(y+1)!(f-y)!\over(f+2)!}{(y+2)(f-y+1)\over(f+3)^{2}(f+4)}$		(113)

where the last equality followed by noting that the integral is the variance of a $\text{Beta}(y+2,f-y+1)$ random variable without its normalizing constant.

Next we bound the risk of any predictor by the Bayes error. Consider any predictor $\widehat{M}(\cdot|ij)$ (as a function of the sample path $X$ ) for transition from $ij$ , $i,j\in\left\{1,2\right\}$ . By the Pinsker’s inequality, we conclude that

\displaystyle D(M(\cdot|12)\|\widehat{M}(\cdot|12))\geq\frac{1}{2}\|M(\cdot|12)-\widehat{M}(\cdot|12)\|_{\ell_{1}}^{2}\geq\frac{1}{2}(p-\widehat{M}(2|12))^{2}

(114)

and similarly, $D(M(\cdot|22)\|\widehat{M}(\cdot|22))\geq\frac{1}{2}(p-\widehat{M}(1|22))^{2}$ . Abbreviate $\widehat{M}(2|12)\equiv\widehat{p}_{12}$ and $\widehat{M}(1|22)\equiv\widehat{p}_{22}$ , both functions of $X$ . Using (113) and Lemma 25, we have

	$\displaystyle\sum_{i,j=1}^{3}\mathbb{E}[D(M(\cdot\|ij)\\|\widehat{M}(\cdot\|ij))){\mathbf{1}_{\left\{{X_{n-1}^{n}=ij}\right\}}}]$
	$\displaystyle\geq~{}\frac{1}{2}\mathbb{E}\left[(p-\widehat{p}_{12})^{2}{\mathbf{1}_{\left\{{X_{n-1}^{n}=12,X^{n}\in{\mathcal{V}}}\right\}}}+(p-\widehat{p}_{22})^{2}{\mathbf{1}_{\left\{{X_{n-1}^{n}=22,X^{n}\in{\mathcal{V}}}\right\}}}\right]$
	$\displaystyle\geq~{}\frac{1}{2}\int_{0}^{1}\left[\sum_{\bm{F}}\sum_{x^{n}\in{\mathcal{V}}:\bm{F}(x^{n})=\bm{F}}\mu(x^{n}\|p)\left((p-\widehat{p}_{12})^{2}{\mathbf{1}_{\left\{{x_{n-1}^{n}=12}\right\}}}+(p-\widehat{p}_{22})^{2}{\mathbf{1}_{\left\{{x_{n-1}^{n}=22}\right\}}}\right)\right]dp$
	$\displaystyle\geq~{}\frac{1}{2}\int_{0}^{1}\left[\sum_{\bm{F}}\sum_{x^{n}\in{\mathcal{V}}:\bm{F}(x^{n})=\bm{F}}\mu(x^{n}\|p)(p-\widehat{p}(x^{n}))^{2}\right]dp$
	$\displaystyle\geq~{}\frac{1}{2}\sum_{f=1}^{\frac{n-6}{2}}\sum_{y=0}^{f}\binom{f}{y}\frac{1}{4en}{(y+1)!(f-y)!\over(f+2)!}{(y+2)(f-y+1)\over(f+3)^{2}(f+4)}$
	$\displaystyle\geq~{}\frac{1}{8en}\sum_{f=1}^{\frac{n-6}{2}}\sum_{y=0}^{f}{y+1\over(f+2)(f+1)}{(y+2)(f-y+1)\over(f+3)^{2}(f+4)}\geq~{}\Theta\left(\frac{1}{n}\right)\sum_{f=1}^{\frac{n-6}{2}}\sum_{y=\frac{f}{4}}^{\frac{f}{3}}\frac{1}{f^{2}}=~{}\Theta\left(\log n\over n\right).$		(115)

∎

5.3.2 General case: $m\geq 2,k\geq 3$

We will prove the following.

Theorem 26.

For absolute constant $C$ , we have

\mathsf{Risk}_{k,n,m}\geq\frac{1}{2^{m+4}}\left(\frac{1}{2}-{2^{m}-2\over n}\right)\left(1-\frac{1}{n}\right)^{n-2m+1}{(k-1)^{m+1}\over n}\log\left({1\over 2^{2m+8}\cdot 3\pi e(m+1)}\cdot{n-m\over(k-1)^{m+1}}\right).

For ease of notation let ${\mathcal{S}}=\left\{1,\dots,k\right\}$ . Denote ${\widetilde{\mathcal{S}}}=\left\{2,\dots,k\right\}$ . Consider an $m^{\text{th}}$ -order transition matrix $M$ of the following form:

\displaystyle M(s|x^{m})=\text{\begin{tabular}[]{|c|c|c|}\hline\cr\hbox{\multirowsetup Starting string $x^{m}$}&\lx@intercol\hfil Next state\hfil\lx@intercol\vrule\lx@intercol \\ \cline{2-3}\cr&$s=1$&$s\in\left\{2,\dots,k\right\}$\\ \hline\cr&&\\ $1^{m}$&$1-\frac{1}{n}$&$\frac{1}{n(k-1)}$\\ &&\\ \hline\cr&&\\ \hbox{\multirowsetup$1x^{m-1},x^{m-1}\in{\widetilde{\mathcal{S}}}^{m-1}$}&$1-b$&$\frac{b}{(k-1)}$\\ &&\\ \hline\cr&&\\ \hbox{\multirowsetup$x^{m}\in{\widetilde{\mathcal{S}}}^{m}$}&$\frac{1}{n}$&{\mbox{$\left(1-\frac{1}{n}\right)T(s|x^{m})$}}\\ &&\\ \hline\cr&&\\ \hbox{\multirowsetup$x^{m}\notin\left\{1^{m},1{\widetilde{\mathcal{S}}}^{m-1},{\widetilde{\mathcal{S}}}^{m}\right\}$}&$\frac{1}{2}$&$\frac{1}{2(k-1)}$\\ &&\\ \hline\cr\end{tabular} },\quad b=\frac{1}{2}-{2^{m}-2\over n}.

(130)

Here $T$ is a $(k-1)^{m}\times(k-1)$ transition matrix for an $m^{\rm th}$ -order Markov chain with state space $\widetilde{\mathcal{S}}$ , satisfying the following property:

(P)

$T(x_{m+1}|x^{m})=T(x_{1}|\overline{x_{2}^{m+1}}),\quad\forall x^{m+1}\in\widetilde{\mathcal{S}}^{m+1}$ .

Lemma 27.

Under the condition (P), the transition matrix $T$ has a stationary distribution that is uniform on ${\widetilde{\mathcal{S}}}^{m}$ . Furthermore, the resulting $m^{\rm th}$ -order Markov chain is reversible (and hence stationary).

Proof.

We prove this result using Lemma 21. Let $\pi$ denote the uniform distribution on $\widetilde{\mathcal{S}}^{m}$ , i.e., $\pi(x^{m})=\frac{1}{(k-1)^{m}}$ for all $x^{m}\in\widetilde{\mathcal{S}}^{m}$ . Then for any $x^{m}\in\widetilde{\mathcal{S}}^{m}$ the condition $\pi(x^{m})=\pi(\overline{x^{m}})$ follows directly and $\pi(x^{m})T(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})T(x_{1}|\overline{x_{2}^{m+1}})$ follows from the assumption (P). ∎

Next we address the stationarity and reversibility of the chain with the bigger transition matrix $M$ in (130):

Lemma 28.

Let $M$ be defined in (130), wherein the transition matrix $T$ satisfies the condition (P). Then $M$ has a stationary distribution given by

\displaystyle\pi(x^{m})=\begin{cases}\frac{1}{2}&x^{m}=1^{m}\\ \frac{b}{(k-1)^{m}}&x^{m}\in{\widetilde{\mathcal{S}}}^{m}\\ \frac{1}{n(k-1)^{d(x^{m})}}&\text{otherwise}\end{cases}

(131)

where $d(x^{m})\triangleq\sum_{i=1}^{m}{\mathbf{1}_{\left\{{x_{i}\in{\widetilde{\mathcal{S}}}}\right\}}}$ and $b=\frac{1}{2}-{2^{m}-2\over n}$ as in (130). Furthermore, the $m^{\rm th}$ -order Markov chain with initial distribution $\pi$ and transition matrix $M$ is reversible.

Proof.

Note that the choice of $b$ guarantees that $\sum_{x^{m}\in{\mathcal{S}}^{m}}\pi(x^{m})=1$ . Next we again apply Lemma 21 to verify stationarity and reversibility. First of all, since $d(x^{m})=d(\overline{x^{m}})$ , we have $\pi(x^{m})=\pi(\overline{x^{m}})$ for all $x^{m}\in{\mathcal{S}}^{m}$ . Next we check the condition $\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}})$ . For the sequence $1^{m+1}$ the claim is easily verified. For the rest of the sequences we have the following.

•

Case 1 ( $x^{m+1}\in{\widetilde{\mathcal{S}}}^{m+1}$ ): Note that $x^{m+1}\in\widetilde{\mathcal{S}}^{m+1}$ if and only if $x^{m},{\overline{x_{2}^{m+1}}}\in\widetilde{\mathcal{S}}^{m}$ . This implies

	$\displaystyle\pi(x^{m})M(x_{m+1}\|x^{m})$	$\displaystyle=\frac{b}{(k-1)^{m}}\left(1-\frac{1}{n}\right)T(x_{m+1}\|x^{m})$
		$\displaystyle=\frac{b}{(k-1)^{m}}\left(1-\frac{1}{n}\right)T(x_{1}\|\overline{x_{2}^{m+1}})=\pi(\overline{x_{2}^{m+1}})M(x_{1}\|\overline{x_{2}^{m+1}}).$

•

Case 2 ( $x^{m+1}\in 1{\widetilde{\mathcal{S}}}^{m}$ or $x^{m+1}\in{\widetilde{\mathcal{S}}}^{m}1$ ): By symmetry it is sufficient to analyze the case $x^{m+1}\in 1{\widetilde{\mathcal{S}}}^{m}$ . Note that in the sub-case $x^{m+1}\in 1{\widetilde{\mathcal{S}}}^{m}$ , $x^{m}\in 1\widetilde{\mathcal{S}}^{m-1}$ and $\overline{x_{2}^{m+1}}\in\widetilde{\mathcal{S}}^{m}$ . This implies

	$\displaystyle\pi(x^{m})=\frac{1}{n(k-1)^{m-1}},$	$\displaystyle\quad M(x_{m+1}\|x^{m})=\frac{b}{k-1},$
	$\displaystyle\pi(\overline{x_{2}^{m+1}})=\frac{b}{(k-1)^{m}},$	$\displaystyle\quad M(x_{1}\|\overline{x_{2}^{m+1}})=\frac{1}{n}.$		(132)

In view of this we get $\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}}).$

•

Case 3 ( $x^{m+1}\notin 1^{m+1}\cup{\widetilde{\mathcal{S}}}^{m+1}\cup 1{\widetilde{\mathcal{S}}}^{m}\cup{\widetilde{\mathcal{S}}}^{m}1$ ):
Suppose that $x^{m+1}$ has $d$ many elements from ${\widetilde{\mathcal{S}}}$ . Then $x^{m},x_{2}^{m+1}\notin\left\{1^{m},{\widetilde{\mathcal{S}}}^{m}\right\}$ . We have the following sub-cases.
- –
  
  If $x_{1}=x_{m+1}=1$ , then both $x^{m},x_{2}^{m+1}$ have exactly $d$ elements from ${\widetilde{\mathcal{S}}}$ . This implies $\pi(x^{m})=\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d}}$ and $M(x_{m+1}|x^{m})=M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2}$ .
- –
  
  If $x_{1},x_{m+1}\in{\widetilde{\mathcal{S}}}$ , then both $x^{m},x_{2}^{m+1}$ have exactly $d-1$ elements from ${\widetilde{\mathcal{S}}}$ . This implies $\pi(x^{m})=\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d-1}}$ and $M(x_{m+1}|x^{m})=M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2(k-1)}$ .
- –
  
  If $x_{1}=1,x_{m+1}\in{\widetilde{\mathcal{S}}}$ , then $x^{m}$ has $d-1$ elements from $\widetilde{S}$ and $x_{2}^{m+1}$ has $d$ elements from ${\mathcal{S}}$ . This implies $\pi(x^{m})=\frac{1}{n(k-1)^{d-1}},\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d}}$ and $M(x_{m+1}|x^{m})=\frac{1}{2(k-1)},M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2}$ .
- –
  
  If $x_{1}\in{\widetilde{\mathcal{S}}},x_{m+1}=1$ , then $x^{m}$ has $d$ elements from $\widetilde{S}$ and $x_{2}^{m+1}$ has $d-1$ elements from ${\mathcal{S}}$ then $\pi(x^{m})=\frac{1}{n(k-1)^{d}},\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d-1}}$ and $M(x_{m+1}|x^{m})=\frac{1}{2},M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2(k-1)}$ .
For all these sub-cases we have $\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}})$ as required.

This finishes the proof. ∎

Let $(X_{1},\ldots,X_{n})$ be the trajectory of a stationary Markov chain with transition matrix $M$ as in (130). We observe the following properties:

(R1)

This Markov chain is irreducible and reversible. Furthermore, the stationary distribution $\pi$ assigns probability $\frac{1}{2}$ to the initial state $1^{m}$ .

(R2)

For $m\leq t\leq n-1$ , let ${\mathcal{X}}_{t}$ denote the collections of trajectories $x^{n}$ such that $x_{1},x_{2},\cdots,x_{t}=1$ and $x_{t+1},\cdots,x_{n}\in{\widetilde{\mathcal{S}}}$ . Then using Lemma 28

$\displaystyle\mathbb{P}(X^{n}\in{\mathcal{X}}_{t})$	$\displaystyle=\mathbb{P}(X_{1}=\cdots=X_{t}=1)\cdot\mathbb{P}(X_{t+1}\neq 1\|X_{t-m+1}^{t}=1^{m})$
	$\displaystyle\quad\cdot\prod_{i=2}^{m-1}\mathbb{P}(X_{t+i}\neq 1\|X_{t-m+i}^{t}=1^{m-i+1},X_{t+1}^{t+i-1}\in{\widetilde{\mathcal{S}}}^{i-1})$
	$\displaystyle\quad\cdot\mathbb{P}(X_{t+m}\neq 1\|X_{t}=1,X_{t+1}^{t+m-1}\in{\widetilde{\mathcal{S}}}^{m-1})\cdot\prod_{s=t+m}^{n-1}\mathbb{P}(X_{s+1}\neq 1\|X_{s-m+1}^{s}\in{\widetilde{\mathcal{S}}}^{m})$
	$\displaystyle=\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}}\cdot\left(1-\frac{1}{n}\right)^{n-m-t}=\frac{b}{n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}.$	(133)

Moreover, this probability does not depend of the choice of $T$ ;

(R3)

Conditioned on the event that $X^{n}\in{\mathcal{X}}_{t}$ , the trajectory $(X_{t+1},\cdots,X_{n})$ has the same distribution as a length- $(n-t)$ trajectory of a stationary $m^{\text{th}}$ -order Markov chain with state space ${\widetilde{\mathcal{S}}}$ and transition probability $T$ , and the uniform initial distribution. Indeed,

	$\displaystyle\mathbb{P}\left[X_{t+1}=x_{t+1},\ldots,X_{n}=x_{n}\|X^{n}\in{\mathcal{X}}_{t}\right]$
	$\displaystyle=\frac{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}(k-1)^{m}}\prod_{s=t+m}^{n-1}\left(1-\frac{1}{n}\right)T(x_{s+1}\|x_{s-m+1}^{s})}{\frac{b}{n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}}$
	$\displaystyle=\frac{1}{(k-1)^{m}}\prod_{s=t+m}^{n-1}T(x_{s+1}\|x_{s-m+1}^{s}).$

Reducing the Bayes prediction risk to mutual information

Consider the following Bayesian setting, we first draw $T$ from some prior satisfying property (P), then generate the stationary $m^{\rm th}$ -order Markov chain $X^{n}=(X_{1},\ldots,X_{n})$ with state space $[k]$ and transition matrix $M$ in (130) and stationary distribution $\pi$ in (131). The following lemma lower bounds the Bayes prediction risk.

Lemma 29.

Conditioned on $T$ , let $Y^{n}=(Y_{1},\ldots,Y_{n})$ denote an $m^{\rm th}$ -order stationary Markov chain on state space ${\widetilde{\mathcal{S}}}=\{2,\ldots,k\}$ with transition matrix $T$ and uniform initial distribution. Then

	$\displaystyle\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n-m+1}^{n})\\|\widehat{M}(\cdot\|X_{n-m+1}^{n})))]\right]$
	$\displaystyle\geq\frac{b(n-1)}{n^{2}2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}\left(I(T;Y^{n-m})-m\log(k-1)\right).$

Proof.

We first relate the Bayes estimator of $M$ and $T$ (given the $X$ and $Y$ chain respectively). For each $m\leq t\leq n$ , denote by $\widehat{M}_{t}=\widehat{M}_{t}(\cdot|x^{t})$ the Bayes estimator of $M(\cdot|x_{t-m+1}^{t})$ given $X^{t}=x^{t}$ , and $\widehat{T}_{t}(\cdot|y^{t})$ the Bayes estimator of $T(\cdot|y_{t-m+1}^{t})$ given $Y^{t}=y^{t}$ . For each $t=1,\ldots,n-1$ and for each trajectory $x^{n}=(1,\ldots,1,x_{t+1},\ldots,x_{n})\in{\mathcal{X}}_{t}$ , recalling the form (21) of the Bayes estimator, we have, for each $j\in{\widetilde{\mathcal{S}}}$ ,

	$\displaystyle\widehat{M}_{n}(j\|x^{n})$
	$\displaystyle=~{}\frac{\mathbb{P}\left[X^{n+1}=(x^{n},j)\right]}{\mathbb{P}\left[X^{n}=x^{n}\right]}$
	$\displaystyle=~{}\frac{\mathbb{E}[\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}(k-1)^{m}}\prod_{s=t+m}^{n-1}M(x_{s+1}\|x_{s-m+1}^{s})M(j\|x_{n-m+1}^{n})]}{\mathbb{E}[\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}(k-1)^{m}}\prod_{s=t+m}^{n-1}M(x_{s+1}\|x_{s-m+1}^{s})]}$
	$\displaystyle=~{}\left(1-\frac{1}{n}\right)\frac{\mathbb{E}[\frac{1}{(k-1)^{m}}\prod_{s=t+m}^{n-1}T(x_{s+1}\|x_{s-m+1}^{s})T(j\|x_{n-m+1}^{n})]}{\mathbb{E}[\frac{1}{(k-1)^{m}}\prod_{s=t+m}^{n-1}T(x_{s+1}\|x_{s-m+1}^{s})]}$
	$\displaystyle=~{}\left(1-\frac{1}{n}\right)\frac{\mathbb{P}\left[Y^{n-t+1}=(x_{t+1}^{n},j)\right]}{\mathbb{P}\left[Y^{n-t}=x_{t+1}^{n}\right]}$
	$\displaystyle=~{}\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(j\|x_{t+1}^{n}).$

Furthermore, since $M(1|x^{m})=1/n$ for all $x^{m}\in{\widetilde{\mathcal{S}}}$ in the construction (130), the Bayes estimator also satisfies $\widehat{M}_{n}(1|x^{n})=1/n$ for $x^{n}\in{\mathcal{X}}_{t}$ and $t\leq n-m$ . In all, we have

\widehat{M}_{n}(\cdot|x^{n})=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(\cdot|x_{t+1}^{n}),\quad x^{n}\in{\mathcal{X}}_{t},t\leq n-m.

(134)

with $\delta_{1}$ denoting the point mass at state 1, which parallels the fact that

M(\cdot|y^{m})=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)T(\cdot|y^{m}),\quad y^{m}\in{\widetilde{\mathcal{S}}}^{m}.

(135)

By (R2), each event $\{X^{n}\in{\mathcal{X}}_{t}\}$ occurs with probability at least ${b\over n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}$ , and is independent of $T$ . Therefore,

		$\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n-1}X_{n})\\|\widehat{M}(\cdot\|X^{n}))]\right]$
		$\displaystyle\geq\frac{b}{n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}\sum_{t=m}^{n-m}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n-m+1}^{n})\\|\widehat{M}(\cdot\|X^{n}))\|X^{n}\in{\mathcal{X}}_{t}]\right].$		(136)

By (R3), the conditional joint law of $(T,X_{t+1},\ldots,X_{n})$ on the event $\{X^{n}\in{\mathcal{X}}_{t}\}$ is the same as the joint law of $(T,Y_{1},\ldots,Y_{n-t})$ . Thus, we may express the Bayes prediction risk in the $X$ chain as

	$\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n-m+1}^{n})\\|\widehat{M}(\cdot\|X^{n}))\|X^{n}\in{\mathcal{X}}_{t}]\right]$	$\displaystyle\overset{\rm(a)}{=}\left(1-\frac{1}{n}\right)\cdot\mathbb{E}_{T}\left[\mathbb{E}[D(T(\cdot\|Y_{n-t-m+1}^{n-t})\\|\widehat{T}(\cdot\|Y^{n-t}))]\right]$
		$\displaystyle\overset{\rm(b)}{=}\left(1-\frac{1}{n}\right)\cdot I(T;Y_{n-t+1}\|Y^{n-t}),$		(137)

where (a) follows from (134), (135), and the fact that for distributions $P,Q$ supported on ${\widetilde{\mathcal{S}}}$ , $D(\epsilon\delta_{1}+(1-\epsilon)P\|\epsilon\delta_{1}+(1-\epsilon)Q)=(1-\epsilon)D(P\|Q)$ ; (b) is the mutual information representation (20) of the Bayes prediction risk. Finally, the lemma follows from (5.3.2), (5.3.2), and the chain rule

\displaystyle\sum_{t=m}^{n-m}I(T;Y_{n-t+1}|Y^{n-t})=I(T;Y^{n-m})-I(T;Y^{m})\geq I(T;Y^{n-m})-m\log(k-1),

as $I(T;Y^{m})\leq H(Y^{m})\leq m\log(k-1)$ . ∎

Prior construction and lower bounding the mutual information

We assume that $k=2k_{0}+1$ for some integer $k_{0}$ . For simplicity of notation we replace $\widetilde{S}$ by ${\mathcal{Y}}={1,\dots,k-1}$ . This does not affect the lower bound. Define an equivalent relation on $|{\mathcal{Y}}|^{m-1}$ given by the following rule: $x^{m-1}$ and $y^{m-1}$ are related if and only if $x^{m-1}=y^{m-1}$ or $x^{m-1}=\overline{y^{m-1}}$ . Let $R_{m-1}$ be a subset of ${\mathcal{Y}}^{m-1}$ that consists of exactly one representative from each of the equivalent classes. As each of the equivalent classes under this relation will have at most two elements the total number of equivalent classes is at least $|{\mathcal{Y}}|^{m-1}\over 2$ , i.e., $|R_{m-1}|\geq{(k-1)^{m-1}\over 2}$ . We consider the following prior: let $u=\left\{u_{ix^{m-1}j}\right\}_{i\leq j\in{[k_{0}]},x^{m-1}\in R_{m-1}}$ be iid and uniformly distributed in $[1/(4k_{0}),3/(4k_{0})]$ and for each $i\leq j,x^{m-1}\in R_{m-1}$ define $u_{jx^{m-1}i},u_{i\overline{x^{m-1}}j},u_{j\overline{x^{m-1}}i}$ to be same as $u_{i{x^{m-1}}j}$ . Let the transition matrix $T$ be given by

		$\displaystyle T(2j-1\|2i-1,x^{m-1})=T(2j\|2i,x^{m-1})=u_{ix^{m-1}j},$
		$\displaystyle T(2j\|2i-1,x^{m-1})=T(2j-1\|2i,x^{m-1})=\frac{1}{k_{0}}-u_{ix^{m-1}j},\quad i,j\in{\mathcal{Y}},x^{m-1}\in{\mathcal{Y}}^{m-1}.$		(138)

One can check that the constructed $T$ is a stochastic matrix and satisfies the property (P), which enforces uniform stationary distribution. Also each entry of $T$ belongs to the interval $[{1\over 2(k-1)},{3\over 2(k-1)}]$ .

Next we use the following lemma to derive estimation guarantees on $T$ .

Lemma 30.

Suppose that $T$ is an $\ell^{m}\times\ell$ transition matrix, on state space ${\mathcal{Y}}^{m}$ with $|{\mathcal{Y}}|=\ell$ , satisfying $T(x_{m+1}|x^{m})=T(x_{1}|\overline{x_{2}^{m+1}}),\quad\forall x^{m+1}\in[\ell]^{m+1}$ and $T(y_{m+1}|y^{m})\in[\frac{c_{1}}{\ell},\frac{c_{2}}{\ell}]$ with $0<c_{1}<c_{2}<1<c_{1}$ for all $y^{m+1}\in[\ell]^{m+1}$ . Then there is an estimator $\widehat{T}$ based on stationary trajectory $Y^{n}$ simulated from $T$ such that

\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]\leq\frac{4c_{1}^{2m+3}(m+1)\ell^{2m}}{c_{2}(n-m)},

where $\|\widehat{T}-T\|_{\mathsf{F}}=\sqrt{\sum_{y^{m+1}}(\widehat{T}(y_{m+1}|y^{m})-T(y_{m+1}|y^{m}))^{2}}$ denotes the Frobenius norm.

For our purpose we will use the above lemma on $T$ with $\ell=k-1,c_{1}=\frac{1}{2},c_{2}=\frac{3}{2}$ . Therefore it follows that there exist estimators $\widehat{T}(Y^{n})$ and $\widehat{u}(Y^{n})$ such that

\displaystyle\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\leq\mathbb{E}[\|\widehat{T}(Y^{n})-T\|_{\mathsf{F}}^{2}]\leq\frac{4c_{2}(m+1)(k-1)^{2m}}{c_{1}^{2m+3}(n-m)}.

(139)

Here and below, we identify $u=\left\{u_{ix^{m-1}j}\right\}_{i\leq j,x^{m-1}\in R_{m-1}}$ and $\widehat{u}=\left\{\widehat{u}_{ix^{m-1}j}\right\}_{i\leq j,x^{m-1}\in R_{m-1}}$ as ${|R_{m-1}|k_{0}(k_{0}+1)\over 2}={|R_{m-1}|(k^{2}-1)\over 8}$ -dimensional vectors.

\displaystyle h(u)=\sum_{i\leq j\in{[k_{0}]},x^{m-1}\in R_{m-1}}h(u_{ix^{m-1}j})=-\frac{|R_{m-1}|(k^{2}-1)}{8}\log(k-1).

(140)

Then

	$\displaystyle I(T;Y^{n})$	$\displaystyle\overset{\rm(a)}{=}I(u;Y^{n})$
		$\displaystyle\overset{\rm(b)}{\geq}I(u;\widehat{u}(Y^{n}))=h(u)-h(u\|\widehat{u}(Y^{n}))$
		$\displaystyle\overset{\rm(c)}{\geq}h(u)-h(u-\widehat{u}(Y^{n}))$
		$\displaystyle\overset{\rm(d)}{\geq}\frac{\|R_{m-1}\|(k^{2}-1)}{16}\log\left(\frac{c_{1}^{2m+3}\|R_{m-1}\|(k^{2}-1)(n-m)}{64\pi ec_{2}(m+1)(k-1)^{2m+2}}\right)\geq\frac{(k-1)^{m+1}}{32}\log\left(\frac{n-m}{c_{m}(k-1)^{m+1}}\right).$

for constant $c_{m}={128\pi ec_{2}(m+1)\over c_{1}^{2m+3}}$ , where (a) is because $u$ and $T$ are in one-to-one correspondence by (5.3.2); (b) follows from the data processing inequality; (c) is because $h(\cdot)$ is translation invariant and concave; (d) follows from the maximum entropy principle [CT06]: $h(u-\widehat{u}(Y^{n}))\leq\frac{|R_{m-1}|(k^{2}-1)}{16}\log\left(\frac{2\pi e}{|R_{m-1}|(k^{2}-1)/8}\cdot\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\right)$ , which in turn is bounded by (139). Plugging this lower bound into Lemma 29 completes the lower bound proof of Theorem 22.

5.3.3 Proof of Lemma 30 via pseudo spectral gap

In view of Lemma 27 we get that the stationary distribution of $T$ is uniform over ${\mathcal{Y}}^{m}$ , and there is a one-to-one correspondence between the joint distribution of $Y^{m+1}$ and the transition probabilities

\displaystyle\mathbb{P}\left[Y^{m+1}=y^{m+1}\right]=\frac{1}{\ell^{m}}T(y_{m+1}|y^{m}).

(141)

Consider the following estimator $\widehat{T}$ : for $y_{m+1}\in[\ell]^{m+1}$ , let

\displaystyle\widehat{T}(y_{m+1}|y^{m})=\ell^{m}\cdot\frac{\sum_{t=1}^{n-m}{\mathbf{1}_{\left\{{Y_{t}^{t+m}=y^{m+1}}\right\}}}}{n-m}.

Clearly $\mathbb{E}[\widehat{T}(y_{m+1}|y^{m})]=\ell^{m}\mathbb{P}\left[y_{m+1}|y^{m}\right]=T(y_{m+1}|y^{m})$ . Next we observe that the sequence of random variables $\left\{Y_{t}^{t+m}\right\}_{t=1}^{n-m}$ is a first-order Markov chain on $[\ell]^{m+1}$ . Let us denote its transition matrix by $T_{m+1}$ and note that its stationary distribution is given by $\pi(a^{m+1})=\ell^{-m}T(a_{m+1}|a^{m}),a^{m+1}\in[\ell]^{m+1}$ . For the transition matrix $T_{m+1}$ , which must be non-reversible, the pseudo spectral gap $\gamma_{\text{ps}}(T_{m+1})$ is defined as

\displaystyle\gamma_{\text{ps}}(T_{m+1})=\max_{r\geq 1}\frac{\gamma((T_{m+1}^{*})^{r}T_{m+1}^{r})}{r},

where $T_{m+1}^{*}$ is the adjoint of $T_{m+1}$ defined as $T_{m+1}^{*}(b^{m+1}|a^{m+1})=\pi(b^{m+1})T(a^{m+1}|b^{m+1})/\pi(a^{m+1})$ . With these notations, the concentration inequality of [Pau15, Theorem 3.2] gives the following variance bound:

\displaystyle\mathrm{Var}(\widehat{T}(y_{m+1}|y^{m}))\leq\ell^{2m}\cdot\frac{4\mathbb{P}\left[Y^{m+1}=y^{m+1}\right]}{\gamma_{\text{ps}}(T_{m+1})(n-m)}\leq\ell^{2m}\cdot\frac{4T(y_{m+1}|y^{m})\ell^{-m}}{\gamma_{\text{ps}}(T_{m+1})(n-m)}.

The following lemma bounds the pseudo spectral gap from below.

Lemma 31.

Let $T\in\mathbb{R}^{\ell^{m}\times\ell}$ be the transition matrix of an $m$ -th order Markov chain $(Y_{t})_{t\geq 1}$ over a discrete state space $\mathcal{Y}$ with $|\mathcal{Y}|=\ell$ , and assume that

•

all the entries of $T$ lie in the interval $[\frac{c_{1}}{\ell},\frac{c_{2}}{\ell}]$ for some absolute constants $0<c_{1}<c_{2}$ ;
•

$T$ has the uniform stationary distribution on $[\ell]^{m}$ .

Let $T_{m+1}\in\mathbb{R}^{\ell^{m+1}\times\ell^{m+1}}$ be the transition matrix of the first-order Markov chain $((Y_{t},Y_{t+1},\cdots,Y_{t+m}))_{t\geq 1}$ . Then we have

\displaystyle\gamma_{\text{\rm ps}}(T_{m+1})\geq{c_{1}^{2m+3}\over c_{2}(m+1)}.

Consequently, we have

\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]=\sum_{y^{m+1}\in[\ell]^{m+1}}\mathrm{Var}(\widehat{T}(y_{m+1}|y^{m}))\leq\sum_{y^{m+1}\in[\ell]^{m+1}}\frac{4c_{2}(m+1)\ell^{m}}{c_{1}^{2m+3}}\cdot\frac{T(y_{m+1}|y^{m})}{n-m}=\frac{4c_{2}(m+1)\ell^{2m}}{c_{1}^{2m+3}(n-m)},

completing the proof.

Proof of Lemma 31.

As $T_{m+1}$ is a first-order Markov chain, the stochastic matrix $T_{m+1}^{m+1}$ defines the probabilities of transition from $(Y_{t},Y_{t+1},\cdots,Y_{t+m})$ to $(Y_{t+m+1},Y_{t+m+2},\cdots,Y_{t+2m+1})$ . By our assumption on $T$

\displaystyle\min_{a^{2m+2}\in{\mathcal{Y}}^{2m+2}}T^{m+1}_{m+1}(a^{2m+2}_{m+2}|a^{m+1})\geq\prod_{t=0}^{m}T(a_{2m+2-t}|a^{2m+1-t}_{m+2-t})\geq{c_{1}^{m+1}\over\ell^{m+1}}.

(142)

Given any $a^{m+1},b^{m+1}\in{\mathcal{Y}}^{m+1}$ , using the above inequality we have

		$\displaystyle(T_{m+1}^{*})^{m+1}(b^{m+1}\|a^{m+1})$
		$\displaystyle=\sum_{\bm{y}_{1}\in{\mathcal{Y}}^{m+1},\dots,\bm{y}_{m}\in{\mathcal{Y}}^{m+1}}T_{m+1}^{}(b^{m+1}\|\bm{y}_{m})\left\{\prod_{t=1}^{m-1}T_{m+1}^{}(\bm{y}_{m-t+1}\|\bm{y}_{m-t})\right\}T_{m+1}^{*}(\bm{y}_{1}\|a^{m+1})$
		$\displaystyle=\sum_{\bm{y}_{1}\in{\mathcal{Y}}^{m+1},\dots,\bm{y}_{m}\in{\mathcal{Y}}^{m+1}}{\pi(b^{m+1})T_{m+1}(\bm{y}_{m}\|b^{m+1})\over\pi(\bm{y}_{m})}\left\{\prod_{t=1}^{m-1}{\pi(\bm{y}_{m-t+1})T_{m+1}(\bm{y}_{m-t}\|\bm{y}_{m-t+1})\over\pi(\bm{y}_{m-t})}\right\}{\pi(\bm{y}_{1})T_{m+1}(a^{m+1}\|\bm{y}_{1})\over\pi(a^{m+1})}$
		$\displaystyle={\pi(b^{m+1})\over\pi(a^{m+1})}\sum_{\bm{y}_{1}\in{\mathcal{Y}}^{m+1},\dots,\bm{y}_{m}\in{\mathcal{Y}}^{m+1}}T_{m+1}(\bm{y}_{m}\|b^{m+1})\left\{\prod_{t=1}^{m-1}T_{m+1}(\bm{y}_{m-t}\|\bm{y}_{m-t+1})\right\}T_{m+1}(a^{m+1}\|\bm{y}_{1})$
		$\displaystyle={\pi(b^{m+1})\over\pi(a^{m+1})}T^{m+1}_{m+1}(a^{m+1}\|b^{m+1})$
		$\displaystyle={\pi(b^{m})T(b_{m+1}\|b^{m})\over\pi(b^{m})T(a_{m+1}\|a^{m})}T^{m+1}_{m+1}(a^{m+1}\|b^{m+1})\geq{c_{1}\over c_{2}}\cdot{c_{1}^{m+1}\over\ell^{m+1}}.$		(143)

Using (142),(5.3.3) we get

	$\displaystyle\min_{a^{m+1},b^{m+1}\in{\mathcal{Y}}^{m+1}}\left\{(T_{m+1}^{*})^{m+1}T^{m+1}_{m+1}\right\}(b^{m+1}\|a^{m+1})$
	$\displaystyle\geq\sum_{d^{m+1}\in{\mathcal{Y}}^{m+1}}\left(\min_{a^{m+1},d^{m+1}\in{\mathcal{Y}}^{m+1}}(T_{m+1}^{*})^{m+1}(d^{m+1}\|a^{m+1})\right)\left(\min_{b^{m+1},d^{m+1}\in{\mathcal{Y}}^{m+1}}T^{m+1}_{m+1}(b^{m+1}\|d^{m+1})\right)$
	$\displaystyle\geq\sum_{d^{m+1}\in{\mathcal{Y}}^{m+1}}{c_{1}^{2m+3}\over c_{2}\ell^{2m+2}}\geq{c_{1}^{2m+3}\over c_{2}\ell^{m+1}}.$		(144)

As $(T_{m+1}^{*})^{m+1}T^{m+1}_{m+1}$ is an $\ell^{m+1}\times\ell^{m+1}$ stochastic matrix, we can use Lemma 32 to get the lower bound on its spectral gap $\gamma((T_{m+1}^{*})^{m+1}T^{m+1}_{m+1})\geq{c_{1}^{2m+3}\over c_{2}}$ . Hence we get

\displaystyle\gamma_{\text{\rm ps}}(T_{m+1})\geq{\gamma((T_{m+1}^{*})^{m+1}T^{m+1}_{m+1})\over m+1}\geq{c_{1}^{2m+3}\over c_{2}(m+1)}

(145)

as required. A more generalized version of Lemma 32 can be found in from [Hof67].

Lemma 32.

Suppose that $A$ is a $d\times d$ stochastic matrix with $\min_{i,j}A_{ij}\geq\epsilon$ . Then for any eigenvalue $\lambda$ of $A$ other than 1 we have $|\lambda|\leq 1-d\epsilon$ .

Proof.

Suppose that $\lambda$ is an eigenvalue of $A$ other than 1 with non-zero left eigenvector $\bm{v}$ , i.e. $\lambda v_{j}=\sum_{i=1}^{d}v_{i}A_{ij},j=1,\dots,d$ . As $A$ is a stochastic matrix we know that $\sum_{j}A_{ij}=1$ for all $i$ and hence $\sum_{i=1}^{d}v_{i}=0$ . This implies

\displaystyle|\lambda v_{j}|=\left|\sum_{i=1}^{d}v_{i}A_{ij}\right|=\left|\sum_{i=1}^{d}v_{i}(A_{ij}-\epsilon)\right|\leq\sum_{i=1}^{d}|v_{i}(A_{ij}-\epsilon)|=\sum_{i=1}^{d}|v_{i}|(A_{ij}-\epsilon)

(146)

with the last equality following from $A_{ij}\geq\epsilon$ . Summing over $j=1,\dots d$ in the above equation and dividing by $\sum_{i=1}^{d}|v_{i}|$ we get $|\lambda|\leq 1-d\epsilon$ as required. ∎

∎

6 Discussions and open problems

We discuss the assumptions and implications of our results as well as related open problems.

Very large state space.

Theorem 1 determines the optimal prediction risk under the assumption of $k\lesssim\sqrt{n}$ . When $k\gtrsim\sqrt{n}$ , Theorem 1 shows that the KL risk is bounded away from zero. However, as the KL risk can be as large as $\log k$ , it is a meaningful question to determine the optimal rate in this case, which, thanks to the general reduction in (11), reduces to determining the redundancy for symmetric and general Markov chains. For iid data, the minimax pointwise redundancy is known to be $n\log\frac{k}{n}+O(\frac{n^{2}}{k})$ [SW12, Theorem 1] when $k\gg n$ . Since the average and pointwise redundancy usually behave similarly, for Markov chains it is reasonable to conjecture that the redundancy is $\Theta(n\log\frac{k^{2}}{n})$ in the large alphabet regime of $k\gtrsim\sqrt{n}$ , which, in view of (11), would imply the optimal prediction risk is $\Theta(\log\frac{k^{2}}{n})$ for $k\gg\sqrt{n}$ . In comparison, we note that the prediction risk is at most $\log k$ , achieved by the uniform distribution.

Other loss functions

As mentioned in Section 1.1, standard arguments based on concentration inequalities inevitably rely on mixing conditions such as the spectral gap. In contrast, the risk bound in Theorem 1, which is free of any mixing condition, is enabled by powerful techniques from universal compression which bound the redundancy by the pointwise maximum over all trajectories combined with information-theoretic or combinatorial argument. This program only relies on the Markovity of the process rather than stationarity or spectral gap assumptions. The limitation of this approach, however, is that the reduction between prediction and redundancy crucially depends on the form of the KL loss function⁴⁴4In fact, this connection breaks down if one swap $M$ and $\widehat{M}$ in the KL divergence in (1). in (1), which allows one to use the mutual information representation and the chain rule to relate individual risks to the cumulative risk. More general loss in terms of $f$ -divergence have been considered in [HOP18]. Obtaining spectral gap-independent risk bound for these loss functions, this time without the aid of universal compression, is an open question.

Stationarity

As mentioned above, the redundancy result in Lemma 7 (see also [Dav83, TJW18]) holds for nonstationary Markov chains as well. However, our redundancy-based risk upper bound in Lemma 6 crucially relies on stationarity. It is unclear whether the result of Theorem 1 carries over to nonstationary chains.

Appendix A Mutual information representation of prediction risk

The following lemma justifies the representation (22) for the prediction risk as maximal conditional mutual information. Unlike (17) for redundancy which holds essentially without any condition [Kem74], here we impose certain compactness assumptions which hold finite alphabets such as finite-state Markov chains studied in this paper.

Lemma 33.

Let ${\mathcal{X}}$ be finite and let $\Theta$ be a compact subset of $\mathbb{R}^{d}$ . Given $\{P_{X^{n+1}|\theta}:\theta\in\Theta\}$ , define the prediction risk

\mathsf{Risk}_{n}\triangleq\inf_{Q_{X_{n+1}|X^{n}}}\sup_{\theta\in\Theta}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n}|\theta}),

(147)

Then

\mathsf{Risk}_{n}=\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}I(\theta;X_{n+1}|X^{n}).

(148)

where ${\mathcal{M}}(\Theta)$ denotes the collection of all (Borel) probability measures on $\Theta$ .

Note that for stationary Markov chains, (22) follows from Lemma 33 since one can take $\theta$ to be the joint distribution of $(X_{1},\ldots,X_{n+1})$ itself which forms a compact subset of the probability simplex on ${\mathcal{X}}^{n+1}$ .

Proof.

It is clear that (147) is equivalent to

\mathsf{Risk}_{n}=\inf_{Q_{X_{n+1}|X^{n}}}\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n},\theta}).

By the variational representation (14) of conditional mutual information, we have

I(\theta;X_{n+1}|X^{n})=\inf_{Q_{X_{n+1}|X^{n}}}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n},\theta}).

(149)

Thus (148) amounts to justifying the interchange of infimum and supremum in (147). It suffices to prove the upper bound.

Let $|{\mathcal{X}}|=K$ . For $\epsilon\in(0,1)$ , define an auxiliary quantity:

\mathsf{Risk}_{n,\epsilon}\triangleq\inf_{Q_{X_{n+1}|X^{n}}\geq\frac{\epsilon}{K}}\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n},\theta}),

(150)

where the constraint in the infimum is pointwise, namely, $Q_{X_{n+1}=x_{n+1}|X^{n}=x^{n}}\geq\frac{\epsilon}{K}$ for all $x_{1},\ldots,x_{n+1}\in{\mathcal{X}}$ . By definition, we have $\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n,\epsilon}$ . Furthermore, $\mathsf{Risk}_{n,\epsilon}$ can be equivalently written as

\mathsf{Risk}_{n,\epsilon}=\inf_{Q_{X_{n+1}|X^{n}}}\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}D(P_{X_{n+1}|X^{n},\theta}\|(1-\epsilon)Q_{X_{n+1}|X^{n}}+\epsilon U|P_{X^{n},\theta}),

(151)

where $U$ denotes the uniform distribution on ${\mathcal{X}}$ .

We first show that the infimum and supremum in (151) can be interchanged. This follows from the standard minimax theorem. Indeed, note that $D(P_{X_{n+1}|X^{n},\theta}\|(1-\epsilon)Q_{X_{n+1}|X^{n}}+\epsilon U|P_{X^{n},\theta})$ is convex in $Q_{X_{n+1}|X^{n}}$ , affine in $P_{\theta}$ , continuous in each argument, and takes values in $[0,\log\frac{K}{\epsilon}]$ . Since ${\mathcal{M}}(\Theta)$ is convex and weakly compact (by Prokhorov’s theorem) and the collection of conditional distributions $Q_{X_{n+1}|X^{n}}$ is convex, the minimax theorem (see, e.g., [Fan53, Theorem 2]) yields

\mathsf{Risk}_{n,\epsilon}=\sup_{\pi\in{\mathcal{M}}(\Theta)}\inf_{Q_{X_{n+1}|X^{n}}}D(P_{X_{n+1}|X^{n},\theta}\|(1-\epsilon)Q_{X_{n+1}|X^{n}}+\epsilon U|P_{X^{n},\theta}).

(152)

Finally, by the convexity of the KL divergence, for any $P$ on ${\mathcal{X}}$ , we have

D(P\|(1-\epsilon)Q+\epsilon U)\leq(1-\epsilon)D(P\|Q)+\epsilon D(P\|U)\leq(1-\epsilon)D(P\|Q)+\epsilon\log K,

which, in view of (149) and (152), implies

\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n,\epsilon}\leq\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}I(\theta;X_{n+1}|X^{n})+\epsilon\log K.

By the arbitrariness of $\epsilon$ , (148) follows. ∎

Appendix B Proof of Lemma 16

Recall that for any irreducible and reversible finite states transition matrix $M$ with stationary distribution $\pi$ the followings are satisfied:

1.

$\pi_{i}>0$ for all $i$ .
2.

$M(j|i)\pi_{i}=M(i|j)\pi_{j}$ for all $i,j$ .

The following is a direct consequence of the Markov property.

Lemma 34.

For any $1\leq t_{1}<\dots<t_{m}<\dots<t_{k}$ and any $Z_{2}=f\left(X_{t_{k}},\dots,X_{t_{m}}\right),Z_{1}=g\left(X_{t_{m-1}},\dots,X_{t_{1}}\right)$ we have

\displaystyle\mathbb{E}\left[Z_{2}{\mathbf{1}_{\left\{{X_{t_{m}}=j}\right\}}}Z_{1}|X_{1}=i\right]=\mathbb{E}\left[Z_{2}|X_{t_{m}}=j\right]\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{t_{m}}=j}\right\}}}Z_{1}|X_{1}=i\right]

(153)

For $t\geq 0$ , denote the $t$ -step transition probability by $\mathbb{P}\left[X_{t+1}=j|X_{1}=i\right]=M^{t}(j|i)$ , which is the $ij$ th entry of $M^{t}$ . The following result is standard (see, e.g., [LP17, Chap. 12]). We include the proof mainly for the purpose of introducing the spectral decomposition.

Lemma 35.

Define $\lambda_{*}\triangleq 1-\gamma_{*}=\max\left\{\left|\lambda_{i}\right|:i\neq 1\right\}$ . For any $t\geq 0$ , $\left|M^{t}(j|i)-\pi_{j}\right|\leq\lambda_{*}^{t}{\sqrt{\pi_{j}\over\pi_{i}}}.$

Proof.

Throughout the proof all vectors are column vectors except for $\pi$ . Let $D_{\pi}$ denote the diagonal matrix with entries $D_{\pi}(i,i)=\pi_{i}$ . By reversibility, $D_{\pi}^{\frac{1}{2}}MD_{\pi}^{-\frac{1}{2}}$ , which shares the same spectrum with $M$ , is a symmetric matrix and admits the spectral decomposition $D_{\pi}^{\frac{1}{2}}MD_{\pi}^{-\frac{1}{2}}=\sum_{a=1}^{k}\lambda_{a}u_{a}u_{a}^{\top}$ for some orthonormal basis $\{u_{1},\ldots,u_{k}\}$ ; in particular, $\lambda_{1}=1$ and $u_{1i}=\sqrt{\pi_{i}}$ . Then for each $t\geq 1$ ,

\displaystyle M^{t}=\sum_{a=1}^{k}\lambda_{a}^{t}D_{\pi}^{-\frac{1}{2}}u_{a}u_{a}^{\top}D_{\pi}^{\frac{1}{2}}=\mathbf{1}\pi+\sum_{a=2}^{k}\lambda_{a}^{t}D_{\pi}^{-\frac{1}{2}}u_{a}u_{a}^{\top}D_{\pi}^{\frac{1}{2}}.\quad

(154)

where $\mathbf{1}$ is the all-ones vector. As $u_{a}$ ’s satisfy $\sum_{a=1}^{k}u_{a}u_{a}^{\top}=I$ we get $\sum_{a=2}^{k}u_{ab}^{2}=1-u_{a1}^{2}\leq 1$ for any $b=1,\dots,k$ . Using this along with Cauchy-Schwarz inequality we get

\displaystyle\left|M^{t}(j|i)-\pi_{j}\right|\leq\sqrt{\pi_{j}\over\pi_{i}}\sum_{a=2}^{k}\left|\lambda_{a}\right|^{t}|u_{ai}u_{aj}|\leq\lambda_{*}^{t}\sqrt{\pi_{j}\over\pi_{i}}\left(\sum_{a=2}^{k}u_{ai}^{2}\right)^{\frac{1}{2}}\left(\sum_{a=2}^{k}u_{aj}^{2}\right)^{\frac{1}{2}}\leq\lambda_{*}^{t}\sqrt{\pi_{j}\over\pi_{i}}

as required. ∎

Lemma 36.

Fix states $i,j$ . For any integers $a\geq b\geq 1$ , define

h_{s}(a,b)=\left|\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-M(j|i)\right)^{s}|X_{b}=i\right]\right|,\quad s=1,2,3,4.

Then

(i)

${h_{1}(a,b)}\leq 2{\sqrt{{M(j|i)}}}{\lambda^{a-b}_{*}}$
(ii)

$\left|h_{2}(a,b)-\pi_{i}M(j|i)(1-M(j|i))\right|\leq 4{\sqrt{{M(j|i)}}}{\lambda^{a-b}_{*}}.$
(iii)

${h_{3}(a,b)},{h_{4}(a,b)}\leq\pi_{i}M(j|i)(1-M(j|i))+4{\sqrt{{M(j|i)}}}{\lambda^{a-b}_{*}}.$

Proof.

We apply Lemma 35 and time reversibility:

(i)

	$\displaystyle h_{1}(a,b)$	$\displaystyle=\left\|\mathbb{P}\left[X_{a+1}=i,X_{a}=j\|X_{b}=i\right]-M(j\|i)\mathbb{P}\left[X_{a+1}=i\|X_{b}=i\right]\right\|$
		$\displaystyle=\left\|M(i\|j){M^{a-b}}(j\|i)-M(j\|i)M^{a-b+1}(i\|i)\right\|$
		$\displaystyle\leq M(i\|j)\left\|M^{a-b}(j\|i)-\pi_{j}\right\|+M(j\|i)\left\|M^{a-b+1}(i\|i)-\pi_{i}\right\|$
		$\displaystyle\leq\lambda_{}^{a-b}M(i\|j)\sqrt{\pi_{j}\over\pi_{i}}+M(j\|i)\lambda^{a-b+1}_{}$
		$\displaystyle=\lambda^{a-b}_{}\sqrt{M(j\|i)M(i\|j)}+{M(j\|i)}\lambda_{}^{a-b+1}\leq 2\sqrt{M(j\|i)}{\lambda^{a-b}_{*}}.$

(ii)

		$\displaystyle\|h_{2}(a,b)-\pi_{i}M(j\|i)(1-M(j\|i))\|$
	$\displaystyle=$	$\displaystyle\Big{\|}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i,X_{a}=j}\right\}}}\|X_{b}=i\right]-\pi_{i}M(j\|i)+\left(M(j\|i)\right)^{2}(\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\|X_{b}=i\right]-\pi_{i})$
		$\displaystyle-2M(j\|i)(\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i,X_{a}=j}\right\}}}\|X_{b}=i\right]-\pi_{i}M(j\|i))\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\left\|\mathbb{P}\left[X_{a+1}=i,X_{a}=j\|X_{b}=i\right]-\pi_{j}M(i\|j)\right\|+({M(j\|i)})^{2}\left\|\mathbb{P}\left[X_{a+1}=i\|X_{b}=i\right]-\pi_{i}\right\|$
		$\displaystyle\quad+2{M(j\|i)}\left\|\mathbb{P}\left[X_{a+1}=i,X_{a}=j\|X_{b}=i\right]-\pi_{j}M(i\|j)\right\|$
	$\displaystyle=$	$\displaystyle M(i\|j)\left\|M^{a-b}(j\|i)-\pi_{j}\right\|+({M(j\|i)})^{2}\left\|M^{a-b+1}(i\|i)-\pi_{i}\right\|+2M(j\|i)M(i\|j)\left\|M^{a-b}(j\|i)-\pi_{j}\right\|$
	$\displaystyle\leq$	$\displaystyle M(i\|j)\sqrt{\pi_{j}\over\pi_{i}}\lambda_{}^{a-b}+({M(j\|i)})^{2}\lambda^{a-b+1}_{}+2{M(j\|i)}M(i\|j)\sqrt{\pi_{j}\over\pi_{i}}\lambda^{a-b}_{*}$
	$\displaystyle\leq$	$\displaystyle\lambda_{*}^{a-b}\left(\sqrt{M(i\|j)}\sqrt{M(i\|j)\pi_{j}\over\pi_{i}}+({M(j\|i)})^{2}+2{M(j\|i)}\sqrt{M(i\|j)}\sqrt{M(i\|j)\pi_{j}\over\pi_{i}}\right)$
	$\displaystyle\leq$	$\displaystyle 4\sqrt{M(j\|i)}\lambda^{a-b}_{*}.$

(iii)

$h_{3}(a,b),h_{4}(a,b)\leq h_{2}(a,b)$ . ∎

Proof of Lemma 16(i).

For ease of notation we use $c_{0}$ to denote an absolute constant whose value may vary at each occurrence. Fix $i,j\in[k]$ . Note that the empirical count defined in (4) can be written as $N_{i}=\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{n-a}=i}\right\}}}$ and $N_{ij}=\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{n-a}=i,X_{n-a+1}=j}\right\}}}$ . Then

		$\displaystyle\mathbb{E}\left[\left(M(j\|i)N_{i}-N_{ij}\right)^{2}\|X_{n}=i\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left.\left(\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{n-a}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{n-a+1}=j}\right\}}}-{M(j\|i)}\right)\right)^{2}\right\|X_{n}=i\right]$
	$\displaystyle\overset{\rm(a)}{=}$	$\displaystyle\mathbb{E}\left[\left.\left(\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j\|i)}\right)\right)^{2}\right\|X_{1}=i\right]$
	$\displaystyle\overset{\rm(b)}{=}$	$\displaystyle\left\|\sum_{a,b}\mathbb{E}\left[\eta_{a}\eta_{b}\|X_{1}=i\right]\right\|\leq 2\sum_{a\geq b}\left\|\mathbb{E}\left[\eta_{a}\eta_{b}\|X_{1}=i\right]\right\|,$

where (a) is due to time reversibility; in (b) we defined $\eta_{a}\triangleq{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j|i)}\right)$ . We divide the summands into different cases and apply Lemma 36.

Case I: Two distinct indices.

For any $a>b$ , using Lemma 34 we get

\displaystyle\left|\mathbb{E}\left[\eta_{a}\eta_{b}|X_{1}=i\right]\right|=\left|\mathbb{E}\left[\eta_{a}|X_{b+1}=i\right]\right|\left|\mathbb{E}\left[\eta_{b}|X_{1}=1\right]\right|=h_{1}(a,b+1)h_{1}(b,1)

(155)

which implies

\displaystyle\mathop{\sum\sum}_{n-1\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}|X_{1}=i\right]\right|=\mathop{\sum\sum}_{n-1\geq a>b\geq 1}h_{1}(a,b+1)h_{1}(b,1)\lesssim{M(j|i)}\mathop{\sum\sum}_{n-1\geq a>b\geq 1}\lambda^{a-2}_{*}\lesssim{M(j|i)\over{\gamma^{2}_{*}}}.

Here the last inequality (and similar sums in later deductions) can be explained as follows. Note that for $\gamma_{*}\geq\frac{1}{2}$ (i.e. $\lambda_{*}\leq\frac{1}{2}$ ), the sum is clearly bounded by an absolute constant; for $\gamma_{*}<\frac{1}{2}$ (i.e. $\lambda_{*}>\frac{1}{2}$ ), we compare the sum with the mean (or higher moments in other calculations) of a geometric random variable.

Case II: Single index.

\displaystyle\sum_{a=1}^{n-1}\mathbb{E}\left[\eta_{a}^{2}|X_{1}=i\right]=\sum_{a=1}^{n-1}h_{2}(a,1)\lesssim n\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}\over\gamma_{*}}.

(156)

Combining the above we get

\displaystyle\mathbb{E}\left[\left(N_{ij}-{M(j|i)}N_{i}\right)^{2}|X_{n}=i\right]\lesssim n\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}\over\gamma_{*}}+{M(j|i)\over\gamma_{*}^{2}}

as required. ∎

Proof of Lemma 16(ii).

We first note that due to reversibility we can write (similar as in proof of Lemma 16(i)) with $\eta_{a}={\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j|i)}\right)$

	$\displaystyle\mathbb{E}\left[\left(M(j\|i)N_{i}-N_{ij}\right)^{4}\|X_{n}=i\right]$
	$\displaystyle=\mathbb{E}\left[\left.\left(\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j\|i)}\right)\right)^{4}\right\|X_{1}=i\right]$
	$\displaystyle=\left\|\sum_{a,b,d,e}\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}\|X_{1}=i\right]\right\|\leq\sum_{a,b,d,e}\left\|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}\|X_{1}=i\right]\right\|\lesssim\sum_{a\geq b\geq d\geq e}\left\|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}\|X_{1}=i\right]\right\|.$		(157)

We bound the sum over different combinations of $a\geq b\geq d\geq e$ to come up with a bound on the required fourth moment. We first divide the $\eta$ ’s into groups depending on how many distinct indices of $\eta$ there are. We use the following identities which follow from Lemma 34: for indices $a>b>d>e$

•

$\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}|X_{1}=i\right]\right|=h_{1}(a,b+1)h_{1}(b,d+1)h_{1}(d,e+1)h_{1}(e,1)$
•

For $s_{1},s_{2},s_{3}\in\left\{1,2\right\}$ , $\left|\mathbb{E}\left[\eta_{a}^{s_{1}}\eta_{b}^{s_{2}}\eta_{d}^{s_{3}}|X_{1}=i\right]\right|=h_{s_{1}}(a,b+1)h_{s_{2}}(b,d+1)h_{s_{3}}(d,1)$
•

For $s_{1},s_{2}\in\left\{1,2,3\right\}$ , $\left|\mathbb{E}\left[\eta_{a}^{s_{1}}\eta_{b}^{s_{2}}|X_{1}=i\right]\right|=h_{s_{1}}(a,b+1)h_{s_{2}}(b,1)$
•

$\mathbb{E}\left[\eta_{a}^{4}|X_{1}=1\right]=h_{4}(a,1)$

and then use Lemma 36 to bound the $h$ functions.

Case I: Four distinct indices.

Using Lemma 36 we have

	$\displaystyle\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\left\|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}\|X_{1}=i\right]\right\|$	$\displaystyle=\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}h_{1}(a,b+1)h_{1}(b,d+1)h_{1}(d,e+1)h_{1}(e,1)$
		$\displaystyle{\leq}M(j\|i)^{2}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda^{a-4}_{}\lesssim{M(j\|i)^{2}\over\gamma_{}^{4}}.$

Case II: Three distinct indices.

There are three cases, namely $\eta_{a}^{2}\eta_{b}\eta_{d},\eta_{a}\eta_{b}^{2}\eta_{d}$ and $\eta_{a}\eta_{b}\eta_{d}^{2}$ .

Bounding $\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}^{2}\eta_{b}\eta_{d}|X_{1}=i\right]\right|$ :

	$\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}\left\|\mathbb{E}\left[\eta_{a}^{2}\eta_{b}\eta_{d}\|X_{1}=i\right]\right\|$	$\displaystyle=\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}h_{2}(a,b+1)h_{1}(b,d+1)h_{1}(d,1)$
		$\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{\sqrt{{M(j\|i)}}}{\lambda^{a-b-1}_{}}\right){M(j\|i)}{\lambda^{b-2}_{}}$
		$\displaystyle\lesssim{{{{M(j\|i)}\over\gamma_{}^{2}}n\pi_{i}{M(j\|i)}(1-{M(j\|i)})}+{{M(j\|i)}^{\frac{3}{2}}\over\gamma_{}^{3}}}$
		$\displaystyle\lesssim{\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{{M(j\|i)}^{\frac{3}{2}}\over\gamma_{}^{3}}+{{M(j\|i)}^{2}\over\gamma_{}^{4}}}$

where the last inequality followed by using $xy\leq x^{2}+y^{2}$ .

Bounding $\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}^{2}\eta_{d}|X_{1}=i\right]\right|$ :

	$\displaystyle\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left\|\mathbb{E}\left[\eta_{a}\eta_{b}^{2}\eta_{d}\|X_{1}=i\right]\right\|$
	$\displaystyle=\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}h_{1}(a,b+1)h_{2}(b,d+1)h_{1}(d,1)$
	$\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{\sqrt{{M(j\|i)}}}{\lambda^{b-d-1}_{}}\right)M(j\|i){\lambda^{a-b+d-2}_{}}$
	$\displaystyle\lesssim{M(j\|i)\over\gamma_{}^{2}}n\pi_{i}M(j\|i)(1-M(j\|i))+{M(j\|i)^{\frac{3}{2}}\over\gamma_{}^{3}}$
	$\displaystyle\lesssim{n\pi_{i}{M(j\|i)}(1-{M(j\|i)})}^{2}+{M(j\|i)^{\frac{3}{2}}\over\gamma_{}^{3}}+{M(j\|i)^{2}\over\gamma_{}^{4}}.$

Bounding $\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}^{2}|X_{1}=i\right]\right|$ :

	$\displaystyle\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left\|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}^{2}\|X_{1}=i\right]\right\|$
	$\displaystyle=\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}h_{1}(a,b+1)h_{1}(b,d+1)h_{2}(d,1)$
	$\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})+\sqrt{M(j\|i)}{\lambda^{d-1}_{}}\right)M(j\|i)\lambda^{a-d-2}_{}$
	$\displaystyle\lesssim{{{{M(j\|i)}\over\gamma_{}^{2}}n\pi_{i}{M(j\|i)}(1-{M(j\|i)})}+{{M(j\|i)}^{\frac{3}{2}}\over\gamma_{}^{3}}}$
	$\displaystyle\lesssim\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{{M(j\|i)}^{\frac{3}{2}}\over\gamma_{}^{3}}+{{M(j\|i)}^{2}\over\gamma_{}^{4}}.$

Case III: Two distinct indices.

There are three different cases, namely $\eta_{a}^{2}\eta_{b}^{2},\eta_{a}^{3}\eta_{b}$ and $\eta_{a}\eta_{b}^{3}$ .

Bounding $\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}^{2}\eta_{b}^{2}|X_{1}=i\right]\right|$ :

	$\displaystyle\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\mathbb{E}\left[\eta_{a}^{2}\eta_{b}^{2}\|X_{1}=i\right]$
	$\displaystyle=\mathop{\sum\sum}_{n-2\geq a>b\geq 1}h_{2}(a,b+1)h_{2}(b,1)$
	$\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}{\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{\sqrt{{M(j\|i)}}}\lambda^{a-b-1}_{}\right)\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{\sqrt{{M(j\|i)}}}\lambda^{b-1}_{}\right)}$
	$\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\Big{\{}\pi_{i}{M(j\|i)}(1-{M(j\|i)}){\sqrt{{M(j\|i)}}}({\lambda^{a-b-1}_{}}+{\lambda^{b-1}_{}})$
	$\displaystyle\quad+\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{{M(j\|i)}}\lambda_{*}^{a-2}\Big{\}}$
	$\displaystyle\lesssim{\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{{{\sqrt{{M(j\|i)}}\over\gamma_{}}}n\pi_{i}{M(j\|i)}(1-{M(j\|i)})}+{{{M(j\|i)}\over\gamma_{}^{2}}}}$
	$\displaystyle\lesssim\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{{{M(j\|i)}\over\gamma_{*}^{2}}}.$

Bounding $\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}^{3}\eta_{b}|X_{1}=i\right]\right|$ :

	$\displaystyle\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left\|\mathbb{E}\left[\eta_{a}^{3}\eta_{b}\|X_{1}=i\right]\right\|$
	$\displaystyle=\mathop{\sum\sum}_{n-2\geq a>b\geq 1}h_{3}(a,b+1)h_{1}(b,1)$
	$\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{\sqrt{{M(j\|i)}}}{\lambda^{a-b-1}_{}}\right){\sqrt{{M(j\|i)}}}{\lambda^{b-1}_{}}$
	$\displaystyle\lesssim{\sqrt{{M(j\|i)}}\over\gamma_{}}n\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{{M(j\|i)}\over\gamma_{}^{2}}\lesssim\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{{{M(j\|i)}\over\gamma_{*}^{2}}}.$

Bounding $\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}^{3}|X_{1}=i\right]\right|$ :

	$\displaystyle\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left\|\mathbb{E}\left[\eta_{a}\eta_{b}^{3}\|X_{1}=i\right]\right\|$
	$\displaystyle=\mathop{\sum\sum}_{n-2\geq a>b\geq 1}h_{1}(a,b+1)h_{3}(b,1)$
	$\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left(\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{\sqrt{{M(j\|i)}}}\lambda^{b-1}_{}\right){\sqrt{{M(j\|i)}}}{\lambda^{a-b-1}_{}}$
	$\displaystyle\lesssim{{\sqrt{{M(j\|i)}}\over\gamma_{}}n\pi_{i}{M(j\|i)}(1-{M(j\|i)})+{{{M(j\|i)}\over\gamma_{}^{2}}}}\lesssim\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{{{M(j\|i)}\over\gamma_{*}^{2}}}.$

Case IV: Single index.

Bound on $\sum_{a=1}^{n-1}\mathbb{E}\left[\eta_{a}^{4}|X_{1}=i\right]$ :

\displaystyle\sum_{a=1}^{n-1}\mathbb{E}\left[\eta_{a}^{4}|X_{1}=i\right]=\sum_{a=1}^{n-1}h_{4}(a,1){\leq}{n\pi_{i}{M(j|i)}(1-{M(j|i)})}+{{\sqrt{{M(j|i)}}\over\gamma_{*}}}.

Combining all cases we get

	$\displaystyle\mathbb{E}\left[\left({M(j\|i)}N_{i}-N_{ij}\right)^{4}\|X_{n}=i\right]$	$\displaystyle\lesssim\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{\sqrt{{M(j\|i)}}\over\gamma_{}}+{M(j\|i)\over\gamma_{}^{2}}+{M(j\|i)^{\frac{3}{2}}\over\gamma_{}^{3}}+{M(j\|i)^{2}\over\gamma_{}^{4}}$
		$\displaystyle\lesssim\left(n\pi_{i}{M(j\|i)}(1-{M(j\|i)})\right)^{2}+{\sqrt{{M(j\|i)}}\over\gamma_{}}+{M(j\|i)^{2}\over\gamma_{}^{4}}$

as required. ∎

Proof of Lemma 16(iii).

Throughout our proof we repeatedly use the spectral decomposition (154) applied to the diagonal elements:

M^{t}(i|i)=\pi_{i}+\sum_{v\geq 2}\lambda_{v}^{t}u_{vi}^{2},\quad\sum_{v\geq 2}u_{vi}^{2}\leq 1.

Write $N_{i}-(n-1)\pi_{i}=\sum_{a=1}^{n-1}\xi_{a}$ where $\xi_{a}={\mathbf{1}_{\left\{{X_{a}=i}\right\}}}-\pi_{i}$ . For $a\geq b\geq d\geq e$ ,

	$\displaystyle\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{d}\xi_{e}\|X_{1}=i\right]$
	$\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}\left({\mathbf{1}_{\left\{{X_{d}=i,X_{e}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{d}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{e}=i}\right\}}}+\pi_{i}^{2}\right)\|X_{1}=i\right]$
	$\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}{\mathbf{1}_{\left\{{X_{d}=i,X_{e}=i}\right\}}}\|X_{1}=i\right]-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}{\mathbf{1}_{\left\{{X_{d}=i}\right\}}}\|X_{1}=i\right]$
	$\displaystyle\quad-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}{\mathbf{1}_{\left\{{X_{e}=i}\right\}}}\|X_{1}=i\right]+\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{1}=i\right]$
	$\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{d}=i\right]\mathbb{P}\left[X_{d}=i\|X_{e}=i\right]\mathbb{P}[X_{e}=i\|X_{1}=i]-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{d}=i\right]\mathbb{P}[X_{d}=i\|X_{1}=i]$
	$\displaystyle\quad-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{e}=i\right]\mathbb{P}[X_{e}=i\|X_{1}=i]+\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{1}=i\right]$
	$\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{d}=i\right]\left\{M^{d-e}(i\|i)M^{e-1}(i\|i)-\pi_{i}M^{d-1}(i\|i)\right\}$
	$\displaystyle\quad-\left\{\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{e}=i\right]M^{e-1}(i\|i)-\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{1}=i\right]\right\}$		(158)

Using the Markov property for any $d\leq b\leq a$ , we get

	$\displaystyle\left\|\mathbb{E}[\xi_{a}\xi_{b}\|X_{d}=i]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right\|$
	$\displaystyle=\left\|\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a}=i,X_{b}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{a}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{b}=i}\right\}}}+\pi_{i}^{2}\|X_{d}=i\right]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right\|$
	$\displaystyle=\left\|M^{a-b}(i\|i)M^{b-d}(i\|i)-\pi_{i}M^{a-d}(i\|i)-\pi_{i}M^{b-d}(i\|i)+\pi_{i}^{2}-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right\|$
	$\displaystyle=\left\|\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right)\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{b-d}\right)\quad-\pi_{i}\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-d}\right)\right.$
	$\displaystyle\left.\quad-\pi_{i}\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{b-d}\right)+\pi_{i}^{2}-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right\|$
	$\displaystyle=\left\|\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right)\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{b-d}\right)-\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-d}}\right\|$
	$\displaystyle\leq\lambda_{}^{a-d}\left(\sum_{v\geq 2}u_{vi}^{2}\right)\left(\sum_{v\geq 2}u_{vi}^{2}\right)+\lambda_{}^{a-d}\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}}\leq 2\lambda_{*}^{a-d}.$		(159)

We also get for $d\geq e$

	$\displaystyle\left\|M^{d-e}(i\|i)M^{e-1}(i\|i)-\pi_{i}M^{d-1}(i\|i)\right\|$
	$\displaystyle=\left\|\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-e}\right)\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{e-1}\right)-\pi_{i}\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-1}\right)\right\|$
	$\displaystyle=\left\|\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{e-1}}+\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-e}}+\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{e-1}\right)\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-e}\right)-\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-1}}\right\|$
	$\displaystyle\leq 2\lambda_{}^{d-1}+\pi_{i}\lambda_{}^{e-1}+\pi_{i}\lambda_{*}^{d-e}.$		(160)

This implies

	$\displaystyle\left\|\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{d}=i\right]\right\|\left\|M^{d-e}(i\|i)M^{e-1}(i\|i)-\pi_{i}M^{d-1}(i\|i)\right\|$
	$\displaystyle\leq\left(\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}+2\lambda_{}^{a-d}\right)\left(2\lambda_{}^{d-1}+\pi_{i}\lambda_{}^{e-1}+\pi_{i}\lambda_{}^{d-e}\right)$
	$\displaystyle\leq\left(\pi_{i}\lambda_{}^{a-b}+2\lambda_{}^{a-d}\right)\left(2\lambda_{}^{d-1}+\pi_{i}\lambda_{}^{e-1}+\pi_{i}\lambda_{*}^{d-e}\right)$
	$\displaystyle\leq 4\left[\pi_{i}^{2}\lambda_{}^{a-b+d-e}+\pi_{i}^{2}\lambda_{}^{a-b+e-1}+\pi_{i}\left(\lambda_{}^{a-b+d-1}+\lambda_{}^{a-d+e-1}+\lambda_{}^{a-e}\right)+\lambda_{}^{a-1}\right]$		(161)

Using (159) along with Lemma 35 for any $e\leq b\leq a$ we get

	$\displaystyle\left\|\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{e}=i\right]M^{e-1}(i\|i)-\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{1}=i\right]\right\|$
	$\displaystyle\leq\pi_{i}\left\|\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{e}=i\right]\right\|\left\|M^{e-1}(i\|i)-\pi_{i}\right\|+\pi_{i}^{2}\left\|\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{e}=i\right]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right\|$		(162)
	$\displaystyle+\pi_{i}^{2}\left\|\mathbb{E}\left[\xi_{a}\xi_{b}\|X_{1}=i\right]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right\|$
	$\displaystyle\leq\pi_{i}\left[\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}+2{\lambda_{}^{a-e}}\right]2\lambda_{}^{e-1}+2\pi_{i}^{2}\lambda_{}^{a-e}+2\pi_{i}^{2}{\lambda_{}^{a-1}}$
	$\displaystyle\leq 2\pi_{i}^{2}{\lambda_{}^{a-b+e-1}}+4\pi_{i}^{2}{\lambda_{}^{a-e}}+4\pi_{i}^{2}{\lambda_{*}^{a-1}}.$		(163)

This together with (161) and (158) implies

\displaystyle\begin{split}\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{d}\xi_{e}|X_{1}=i\right]\right|&\lesssim\pi_{i}^{2}\left(\lambda_{*}^{a-b+d-e}+\lambda_{*}^{a-b+e-1}\right)+\lambda_{*}^{a-1}\\ &\quad+\pi_{i}\left(\lambda_{*}^{a-b+d-1}+\lambda_{*}^{a-d+e-1}+\lambda_{*}^{a-e}\right)\end{split}

(164)

To bound the sum over $n-1\geq a\geq b\geq d\geq e\geq 1$ , we divide the analysis according to the number of distinct ordered indices related variations in terms.

Case I: four distinct indices.

We sum (164) over all possible $a>b>d>e$ .

•

For the first term,

\displaystyle\pi_{i}^{2}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-b+d-e}\lesssim{n\pi_{i}^{2}\over\gamma_{*}}\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}.

•

For the second term,

\displaystyle\pi_{i}^{2}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-b+e-1}\lesssim{n\pi_{i}^{2}\over\gamma_{*}}\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}

•

For the third term,

\displaystyle\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-1}\lesssim\sum_{n-1\geq a\geq 4}a^{3}\lambda_{*}^{a-1}\lesssim{1\over\gamma_{*}^{4}}.

•

For the fourth term,

\displaystyle\pi_{i}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-b+d-1}\leq{\pi_{i}\over\gamma_{*}^{2}}\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\lesssim\frac{n\pi_{i}}{\gamma_{*}^{3}}

•

For the fifth term,

\displaystyle\pi_{i}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-d+e-1}\lesssim\frac{\pi_{i}}{\gamma_{*}}\left(\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\right)\left(\sum_{d\geq 2}^{b-1}\lambda_{*}^{b-d}\right)\lesssim{n\pi_{i}\over\gamma_{*}^{3}}.

•

For the sixth term,

\displaystyle\pi_{i}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-e}\lesssim\pi_{i}\left(\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\right)\left(\sum_{d\geq 2}^{b-1}\lambda_{*}^{b-d}\right)\left(\sum_{e\geq 1}^{d-1}\lambda_{*}^{d-e}\right)\lesssim{n\pi_{i}\over\gamma_{*}^{3}}.

Combining the above bounds and using the fact that $ab\leq a^{2}+b^{2}$ , we obtain

\displaystyle\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{d}\xi_{e}|X_{1}=i\right]\right|\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{n\pi_{i}\over\gamma_{*}^{3}}+{1\over\gamma_{*}^{4}}}\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}}.

(165)

Case II: three distinct indices.

There are three cases, namely, $\xi_{a}\xi_{b}^{2}\xi_{e}$ , $\xi_{a}\xi_{b}\xi_{e}^{2}$ , and $\xi_{a}^{2}\xi_{b}\xi_{e}$ .

Bounding $\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}^{2}\xi_{e}|X_{1}=i\right]\right|$ : We specialize (164) with $b=d$ to get

\displaystyle\left|\mathbb{E}\left[\xi_{a}\xi_{b}^{2}\xi_{e}|X_{1}=i\right]\right|

\displaystyle\lesssim\pi_{i}\left(\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}.

Summing over $a,b,e$ we have

	$\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\|\mathbb{E}\left[\xi_{a}\xi_{b}^{2}\xi_{e}\|X_{1}=i\right]\right\|$
	$\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\{\pi_{i}\left(\lambda_{}^{a-b+e-1}+\lambda_{}^{a-e}\right)+\lambda_{*}^{a-1}\right\}$
	$\displaystyle\lesssim{\pi_{i}\over\gamma_{}}\mathop{\sum\sum}_{n-1\geq a>b\geq 2}\lambda_{}^{a-b}+\pi_{i}\left(\mathop{\sum\sum}_{n-1\geq a>b\geq 2}\lambda_{}^{a-b}\right)\left(\sum_{e\geq 1}^{b-1}\lambda_{}^{b-e}\right)+\sum_{n-1\geq a\geq 3}a^{3}\lambda_{*}^{a-1}$
	$\displaystyle\lesssim{n\pi_{i}\over\gamma_{}^{2}}+{1\over\gamma_{}^{3}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{}^{2}}+\frac{1}{\gamma_{}^{3}}$		(166)

with last inequality following from $xy\leq x^{2}+y^{2}$ .

Bounding $\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{e}^{2}|X_{1}=i\right]\right|$ : We specialize (164) with $e=d$ to get

\displaystyle\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{e}^{2}|X_{1}=i\right]\right|

\displaystyle\lesssim\pi_{i}^{2}\lambda_{*}^{a-b}+\pi_{i}\left(\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}.

Summing over $a,b,e$ and applying (166), we get

	$\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{e}^{2}\|X_{1}=i\right]\right\|$
	$\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\{\pi_{i}^{2}\lambda_{}^{a-b}+\pi_{i}\left(\lambda_{}^{a-b+e-1}+\lambda_{}^{a-e}\right)+\lambda_{}^{a-1}\right\}$
	$\displaystyle\lesssim{n\pi_{i}^{2}}\mathop{\sum\sum}_{n-1\geq a>b\geq 2}\lambda_{}^{a-b}+{n\pi_{i}\over\gamma_{}^{2}}+{1\over\gamma_{}^{3}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{}}+{n\pi_{i}\over\gamma_{}^{2}}+{1\over\gamma_{}^{3}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{}^{2}}+{1\over\gamma_{}^{3}}.$		(167)

Bounding $\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{2}\xi_{b}\xi_{e}|X_{1}=i\right]\right|$ : Specializing (164) with $a=b$ we get

\displaystyle\left|\mathbb{E}\left[\xi_{b}^{2}\xi_{d}\xi_{e}|X_{1}=i\right]\right|\lesssim\pi_{i}^{2}\left(\lambda_{*}^{d-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{b-1}+\pi_{i}\left(\lambda_{*}^{d-1}+\lambda_{*}^{b-d+e-1}+\lambda_{*}^{b-e}\right),

which is equivalent to

\displaystyle\left|\mathbb{E}\left[\xi_{a}^{2}\xi_{b}\xi_{e}|X_{1}=i\right]\right|\lesssim\pi_{i}^{2}\left(\lambda_{*}^{b-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{a-1}+\pi_{i}\left(\lambda_{*}^{b-1}+\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right).

For the first, second and fourth terms

\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\{\pi_{i}^{2}\left(\lambda_{*}^{b-e}+\lambda_{*}^{e-1}\right)+\pi_{i}\lambda_{*}^{b-1}\right\}\lesssim{\pi_{i}^{2}\over\gamma_{*}}\mathop{\sum\sum}_{n-1\geq a>b\geq 2}1+{n\pi_{i}\over\gamma_{*}^{2}}{\lesssim}{n^{2}\pi_{i}^{2}\over\gamma_{*}}+{n\pi_{i}\over\gamma_{*}^{2}},

and for summing the remaining terms we use (166), which implies

\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{2}\xi_{b}\xi_{e}|X_{1}=i\right]\right|\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}}+{n\pi_{i}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}}\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}}.

(168)

Case III: two distinct indices.

There are three cases, namely, $\eta_{a}^{2}\eta_{e}^{2},\eta_{a}\eta_{e}^{3}$ and $\eta_{a}^{3}\eta_{e}$ .

Bounding $\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\mathbb{E}\left[\xi_{a}^{2}\xi_{e}^{2}|X_{1}=i\right]$ : Specializing (164) for $a=b$ and $e=d$ we get

\displaystyle{\mathbb{E}\left[\xi_{a}^{2}\xi_{e}^{2}|X_{1}=i\right]}\lesssim\pi_{i}^{2}+\pi_{i}\left(\lambda_{*}^{e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}.

Summing up over $a,e$ we have

\displaystyle\mathop{\sum\sum}_{n-1\geq a>e\geq 1}{\mathbb{E}\left[\xi_{a}^{2}\xi_{e}^{2}|X_{1}=i\right]}\lesssim\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left\{\pi_{i}^{2}+\pi_{i}\left(\lambda_{*}^{e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}\right\}\lesssim n^{2}\pi_{i}^{2}+{n\pi_{i}\over\gamma_{*}}+{1\over\gamma_{*}^{2}}.

(169)

Bounding $\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{e}^{3}|X_{1}=i\right]\right|$ : Specializing (164) for $e=b=d$ we get

\displaystyle\left|\mathbb{E}\left[\xi_{a}\xi_{e}^{3}|X_{1}=i\right]\right|\lesssim\pi_{i}\lambda_{*}^{a-e}+\lambda_{*}^{a-1}

which sums up to

\displaystyle\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{e}^{3}|X_{1}=i\right]\right|\lesssim\pi_{i}\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\lambda_{*}^{a-e}+\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\lambda_{*}^{a-1}\lesssim{n\pi_{i}\over\gamma_{*}}+{1\over\gamma_{*}^{2}}.

(170)

Bounding $\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{3}\xi_{e}|X_{1}=i\right]\right|$ : Specializing (164) for $a=b=d$ we get

\displaystyle\left|\mathbb{E}\left[\xi_{a}^{3}\xi_{e}|X_{1}=i\right]\right|

\displaystyle\lesssim\pi_{i}\left(\lambda_{*}^{a-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{a-1}

which sums up to

\displaystyle\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{3}\xi_{e}|X_{1}=i\right]\right|\lesssim\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left\{\pi_{i}\left(\lambda_{*}^{a-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{a-1}\right\}\lesssim{n\pi_{i}\over\gamma_{*}}+{1\over\gamma_{*}^{2}}.

(171)

Case IV: single distinct index.

We specialize (164) to $a=b=d=e$ to get

\displaystyle\mathbb{E}\left[\xi_{a}^{4}|X_{1}=i\right]

\displaystyle\lesssim\pi_{i}+\lambda_{*}^{a-1}.

Summing the above over $a$

\displaystyle\sum_{a=1}^{n-1}\mathbb{E}\left[\xi_{a}^{4}|X_{1}=i\right]\lesssim{n\pi_{i}}+{1\over\gamma_{*}}.

(172)

Combining (165)–(172) and using ${n\pi_{i}\over\gamma_{*}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}$ , we get

\displaystyle\mathbb{E}\left[\left(N_{i}-(n-1)\pi_{i}\right)^{4}|X_{1}=i\right]\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}.

∎

Acknowledgment

The authors are grateful to Alon Orlitsky for helpful and encouraging comments and to Dheeraj Pichapati for providing the full version of [FOPS16]. The authors also thank David Pollard for insightful discussions on Markov chains at the initial stages of the project.

References

[AG57] Theodore W Anderson and Leo A Goodman. Statistical inference about Markov chains. The Annals of Mathematical Statistics, pages 89–110, 1957.
[Agr20] Rohit Agrawal. Finite-sample concentration of the multinomial in relative entropy. IEEE Transactions on Information Theory, 66(10):6297–6302, 2020.
[Ahl21] T.D. Ahle. Sharp and simple bounds for the raw moments of the binomial and poisson distributions. arXiv:2103.17027, 2021.
[Att99] K. Atteson. The asymptotic redundancy of Bayes rules for Markov chains. IEEE Transactions on Information Theory, 45(6):2104–2109, 1999.
[Bar51] Maurice S Bartlett. The frequency goodness of fit test for probability chains. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 47, pages 86–95. Cambridge University Press, 1951.
[BFSS02] Dietrich Braess, Jürgen Forster, Tomas Sauer, and Hans U Simon. How to achieve minimax expected Kullback-Leibler distance from an unknown finite distribution. In Algorithmic Learning Theory, pages 380–394. Springer, 2002.
[BHOP18] Anna Ben-Hamou, Roberto I Oliveira, and Yuval Peres. Estimating graph parameters via random walks with restarts. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1702–1714. SIAM, 2018.
[Bil61] P. Billingsley. Statistical methods in Markov chains. The Annals of Mathematical Statistics, pages 12–40, 1961.
[CB19] Yeshwanth Cherapanamjeri and Peter L Bartlett. Testing symmetric Markov chains without hitting. In Conference on Learning Theory, pages 758–785. PMLR, 2019.
[CK82] Imre Csiszár and János Körner. Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, Inc., 1982.
[CS00] Imre Csiszár and Paul C Shields. The consistency of the BIC Markov order estimator. The Annals of Statistics, 28(6):1601–1619, 2000.
[CS04] I Csiszár and PC Shields. Information theory and statistics: a tutorial. Foundations and Trends in Communications and Information Theory, 1(4):417–527, 2004.
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory, 2nd Ed. Wiley-Interscience, New York, NY, USA, 2006.
[Dav73] L. Davisson. Universal noiseless coding. IEEE Transactions on Information Theory, 19(6):783–795, 1973.
[Dav83] L. Davisson. Minimax noiseless universal coding for Markov sources. IEEE Transactions on Information Theory, 29(2):211–215, 1983.
[DDG18] Constantinos Daskalakis, Nishanth Dikkala, and Nick Gravin. Testing symmetric Markov chains from a single trajectory. In Conference On Learning Theory, pages 385–409. PMLR, 2018.
[DMM⁺19] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47, 2019.
[DMPW81] L Davisson, R McEliece, M Pursley, and Mark Wallace. Efficient universal noiseless source codes. IEEE Transactions on Information Theory, 27(3):269–279, 1981.
[Fan53] Ky Fan. Minimax theorems. Proceedings of the National Academy of Sciences, 39(1):42–47, 1953.
[FOPS16] M. Falahatgar, A. Orlitsky, V. Pichapati, and A.T. Suresh. Learning Markov distributions: Does estimation trump compression? In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2689–2693. IEEE, July 2016.
[FW21] Sela Fried and Geoffrey Wolfer. Identity testing of reversible Markov chains. arXiv preprint arXiv:2105.06347, 2021.
[GR20] F Richard Guo and Thomas S Richardson. Chernoff-type concentration of empirical probabilities in relative entropy. IEEE Transactions on Information Theory, 67(1):549–558, 2020.
[HJL⁺18] Y. Han, J. Jiao, C.Z. Lee, T. Weissman, Y. Wu, and T. Yu. Entropy rate estimation for Markov chains with large state space. In Advances in Neural Information Processing Systems, pages 9781–9792, 2018. arXiv:1802.07889.
[HKL⁺19] Daniel Hsu, Aryeh Kontorovich, David A Levin, Yuval Peres, Csaba Szepesvári, and Geoffrey Wolfer. Mixing time estimation in reversible Markov chains from a single sample path. Annals of Applied Probability, 29(4):2439–2480, 2019.
[Hof67] A.J. Hoffman. Three observations on nonnegative matrices. Journal of Research of the National Bureau of Standards B, pages 39–41, 1967.
[HOP18] Yi Hao, A. Orlitsky, and V. Pichapati. On learning Markov chains. In Advances in Neural Information Processing Systems, pages 648–657, 2018.
[Jan02] S. Janson. On concentration of probability. Contemporary combinatorics, 10(3):1–9, 2002.
[JS02] Philippe Jacquet and Wojciech Szpankowski. A combinatorial problem arising in information theory: Precise minimax redundancy for Markov sources. In Mathematics and Computer Science II, pages 311–328. Springer, 2002.
[Kem74] JHB Kemperman. On the Shannon capacity of an arbitrary channel. In Indagationes Mathematicae (Proceedings), volume 77, pages 101–115. North-Holland, 1974.
[KOPS15] S. Kamath, A. Orlitsky, D. Pichapati, and A.T. Suresh. On learning distributions from their samples. In Conference on Learning Theory, pages 1066–1100, June 2015.
[KV16] Sudeep Kamath and Sergio Verdú. Estimation of entropy rate and Rényi entropy rate for Markov chains. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 685–689. IEEE, 2016.
[Lat97] Rafał Latała. Estimation of moments of sums of independent real random variables. The Annals of Probability, 25(3):1502–1513, 1997.
[LB04] Feng Liang and Andrew Barron. Exact minimax strategies for predictive density estimation, data compression, and model selection. IEEE Transactions on Information Theory, 50(11):2708–2726, 2004.
[Lez98] P. Lezaud. Chernoff-type bound for finite Markov chains. Annals of Applied Probability, 8(3):849–867, 1998.
[LP17] D.A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
[MJT⁺20] Jay Mardia, Jiantao Jiao, Ervin Tánczos, Robert D Nowak, and Tsachy Weissman. Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types. Information and Inference: A Journal of the IMA, 9(4):813–850, 2020.
[OS20] Maciej Obremski and Maciej Skorski. Complexity of estimating Rényi entropy of Markov chains. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2264–2269, 2020.
[Pan04] Liam Paninski. Variational minimax estimation of discrete distributions under KL loss. Advances in Neural Information Processing Systems, 17:1033–1040, 2004.
[Par62] Emanuel Parzen. Stochastic processes. Holden Day, 1962.
[Pau15] D. Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, pages 1–20, 2015.
[Ris84] Jorma Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information theory, 30(4):629–636, 1984.
[Rya88] B.Y. Ryabko. Prediction of random sequences and universal coding. Prob. Pered. Inf., 24(2):87–96, 1988.
[Sht87] Yuri M Shtarkov. Universal sequential coding of single messages. Prob. Pered. Inf., 23:175–186, 1987.
[Sin64] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2):876–879, 1964.
[SMT⁺18] Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439–473. PMLR, 2018.
[SW12] Wojciech Szpankowski and Marcelo J Weinberger. Minimax pointwise redundancy for memoryless models over large alphabets. IEEE transactions on information theory, 58(7):4094–4104, 2012.
[TJW18] Kedar Tatwawadi, Jiantao Jiao, and Tsachy Weissman. Minimax redundancy for Markov chains with large state space. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 216–220. IEEE, 2018.
[Tro74] Viktor Kupriyanovich Trofimov. Redundancy of universal coding of arbitrary Markov sources. Prob. Pered. Inf., 10(4):16–24, 1974.
[Wai19] M.J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press., 2019.
[Whi55] P. Whittle. Some distribution and moment formulae for the Markov chain. Journal of the Royal Statistical Society: Series B (Methodological), 17(2):235–242, 1955.
[WK19] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic Markov chains. In Algorithmic Learning Theory, pages 904–930. PMLR, 2019.
[WK20] Geoffrey Wolfer and Aryeh Kontorovich. Minimax testing of identity to a reference ergodic Markov chain. In International Conference on Artificial Intelligence and Statistics, pages 191–201. PMLR, 2020.
[XB97] Qun Xie and Andrew R Barron. Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, 43(2):646–657, 1997.
[YB99] Y. Yang and A. R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564–1599, 1999.

	$\displaystyle D(P_{X_{n}\|X^{n-1}}\\|\widetilde{Q}_{X_{n}\|X^{n-1}}\|P_{X^{n-1}})\overset{\rm(a)}{=}$	$\displaystyle~{}\mathbb{E}\left[D\left(P_{X_{n}\|X^{n-1}_{n-m}}\Big{\\|}\frac{1}{n}\sum_{t=1}^{n}\widehat{P}_{t}(\cdot\|X_{n-t+1}^{n-1})\right)\right]$
	$\displaystyle\overset{\rm(b)}{\leq}$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}\mathbb{E}\left[D(P_{X_{n}\|X^{n-1}_{n-m}}\\|\widehat{P}_{t}(\cdot\|X_{n-t+1}^{n-1}))\right]$
	$\displaystyle\overset{\rm(c)}{=}$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}\mathbb{E}\left[D(P_{X_{t}\|X^{t-1}_{t-m}}\\|\widehat{P}_{t}(\cdot\|X^{t-1}))\right]$
	$\displaystyle\overset{\rm(d)}{=}$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}D(P_{X_{t}\|X^{t-1}}\\|Q_{X^{t}\|X^{t-1}}\|P_{X^{t-1}})$
	$\displaystyle\leq$	$\displaystyle~{}\frac{1}{n-m}\sum_{t=1}^{n}D(P_{X_{t}\|X^{t-1}}\\|Q_{X^{t}\|X^{t-1}}\|P_{X^{t-1}})$
	$\displaystyle\overset{\rm(e)}{=}$	$\displaystyle~{}\frac{1}{n-m}D(P_{X^{n}}\\|Q_{X^{n}}),$

$\displaystyle\mathsf{Risk}_{k,n}$	$\displaystyle\geq\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X_{n}))]\right]$
	$\displaystyle\geq\inf_{\widehat{M}}\sum_{t=1}^{n-1}\mathbb{E}_{M}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X_{n}))\|{\mathcal{X}}_{t}]\cdot\mathbb{P}({\mathcal{X}}_{t})\right]$
	$\displaystyle\geq\frac{c}{n}\cdot\sum_{t=1}^{n-1}\inf_{\widehat{M}}\mathbb{E}_{M}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X_{n}))\|{\mathcal{X}}_{t}]\right]$
	$\displaystyle\approx\frac{c}{n}\cdot\sum_{t=1}^{n-1}I(T;Y_{n-t+1}\|Y^{n-t})=\frac{c}{n}\cdot(I(T;Y^{n})-I(T;Y_{1})).$	(43)

	$\displaystyle\mathbb{P}\left[X_{t+1}=x_{t+1},\ldots,X_{n}=x_{n}\|X^{n}\in{\mathcal{X}}_{t}\right]=$	$\displaystyle~{}\frac{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n(k-1)}\prod_{s=t+1}^{n-1}M(x_{s+1}\|x_{s})}{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n}\cdot\left(1-\frac{1}{n}\right)^{n-1-t}}$
	$\displaystyle=$	$\displaystyle~{}\frac{1}{k-1}\prod_{s=t+1}^{n-1}T(x_{s+1}\|x_{s}).$

	$\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot\|X_{n})\\|\widehat{M}(\cdot\|X^{n}))\|X^{n}\in{\mathcal{X}}_{t}]\right]$	$\displaystyle\overset{\rm(a)}{=}\left(1-\frac{1}{n}\right)\cdot\mathbb{E}_{T}\left[\mathbb{E}[D(T(\cdot\|Y_{n-t})\\|\widehat{T}(\cdot\|Y^{n-t}))]\right]$
		$\displaystyle\overset{\rm(b)}{=}\left(1-\frac{1}{n}\right)\cdot I(T;Y_{n-t+1}\|Y^{n-t}),$		(49)

$\displaystyle D({M(\cdot\|1)}\\|{\widehat{M}_{\ell}(\cdot\|1)})$	$\displaystyle={M(1\|1)\log\left(M(1\|1)\over 1-\frac{1}{\ell{\log(1/\gamma_{0})}}\right)+{M(2\|1)}\log\left({M(2\|1)}\ell{\log(1/\gamma_{0})}\right)}$
	$\displaystyle\lesssim{1\over\ell{\log(1/\gamma_{0})}}+{M(2\|1)}\log(M(2\|1)\ell)+{M(2\|1)}\log{{\log(1/\gamma_{0})}}$
	$\displaystyle\leq{1\over\ell{\log(1/\gamma_{0})}}+M(2\|1)\log_{+}(M(2\|1)\ell)+M(2\|1){\log\log(1/\gamma_{0})},$	(56)

Optimal prediction of Markov chains with and without spectral gap

Abstract

1 Introduction

Theorem 1 (Optimal rates without spectral gap).

Remark 1.

Theorem 2 (Spectral gap dependent rates for binary chain).

Theorem 3.

Corollary 4.

Theorem 5.

1.1 Proof techniques

1.2 Related work

1.3 Notations and preliminaries

1.4 Organization

2 Two general paradigms

2.1 Redundancy, prediction risk, and mutual information representation

“Compression”.

“Prediction”.

Lemma 6.

Remark 2.

Proof.

Remark 3.

2.2 Proof of the upper bound part of Theorem 1

Lemma 7.

Proof.

3 Optimal rates without spectral gap

3.1 Warmup: an Ω​(log⁡nn)\Omega(\frac{\log n}{n}) lower bound for three-state chains

Theorem 8.

Lemma 9.

Proof of Theorem 8.

3.2 kk-state chains

3.2.1 Construction of the kk-state chain

3.2.2 Reducing the Bayes prediction risk to redundancy

Lemma 10.

Corollary 11.

Proof of Lemma 10.

3.2.3 Prior construction and lower bounding the mutual information

Lemma 12.

Proof of Lemma 12.

4 Spectral gap-dependent risk bounds

4.1 Two states

Theorem 13 (Spectral gap dependent rates for binary chain).

Lemma 14.

Proof.

Lemma 15.

Proof.

4.2 kk states

4.2.1 Proof of Theorem 3 (i)

Lemma 16.

Case (a) nπ1≤rn\pi_{1}\leq r or nπ1M(i|1)≤10n\pi_{1}{M(i|1)}\leq 10:

Case(b) nπ1>rn\pi_{1}>r and nπ1M(i|1)>10n\pi_{1}{M(i|1)}>10:

Bound on 𝔼[𝟏{B≤}Δi]\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]

Bound on 𝔼[𝟏{B>}Δi]\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]

Remark 4.

4.2.2 Proof of Theorem 3 (ii)

Lemma 17 (KL risk bound for add-one estimator).

Proof.

Lemma 18.

Proof.

Case I p≥c4ℓlogn\overnp\geq{c_{4}\ell\log n\over n}:

Case II p<c4ℓlogn\overnp<{c_{4}\ell\log n\over n}:

4.2.3 Proof of Corollary 4

Lemma 19.

Proof.

5 Higher-order Markov chains

5.1 Basic setups

Lemma 20.

Proof.

Lemma 21.

Theorem 22.

5.2 Upper bound

Lemma 23.

Proof.

5.3 Lower bound

5.3.1 Special case: m≥2,k=2m\geq 2,k=2

Theorem 24.

Proof.

Lemma 25.

Proof.

5.3.2 General case: m≥2,k≥3m\geq 2,k\geq 3

Theorem 26.

3.1 Warmup: an $\Omega(\frac{\log n}{n})$ lower bound for three-state chains

3.2 $k$ -state chains

3.2.1 Construction of the $k$ -state chain

4.2 $k$ states

Case (a) $n\pi_{1}\leq r$ or $n\pi_{1}{M(i|1)}\leq 10$ :

Case(b) $n\pi_{1}>r$ and $n\pi_{1}{M(i|1)}>10$ :

Bound on $\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]$

Bound on $\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]$

Case I $p\geq{c_{4}\ell\log n\over n}$ :

Case II $p<{c_{4}\ell\log n\over n}$ :

5.3.1 Special case: $m\geq 2,k=2$

5.3.2 General case: $m\geq 2,k\geq 3$