This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal prediction of Markov chains with and without spectral gap

Yanjun Han Soham Jana and Yihong Wu Y. Han is with the Simons Institute for the Theory of Computing, University of California, Berkeley, email: [email protected]. S. Jana and Y. Wu are with the Department of Statistics and Data Science, Yale University, New Haven, CT, email: [email protected] and [email protected]. Y. Wu is supported in part by NSF Grant CCF-1900507, an NSF CAREER award CCF-1651588, and an Alfred Sloan fellowship.
Abstract

We study the following learning problem with dependent data: Observing a trajectory of length nn from a stationary Markov chain with kk states, the goal is to predict the next state. For 3kO(n)3\leq k\leq O(\sqrt{n}), using techniques from universal compression, the optimal prediction risk in Kullback-Leibler divergence is shown to be Θ(k2nlognk2)\Theta(\frac{k^{2}}{n}\log\frac{n}{k^{2}}), in contrast to the optimal rate of Θ(loglognn)\Theta(\frac{\log\log n}{n}) for k=2k=2 previously shown in [FOPS16]. These rates, slower than the parametric rate of O(k2n)O(\frac{k^{2}}{n}), can be attributed to the memory in the data, as the spectral gap of the Markov chain can be arbitrarily small. To quantify the memory effect, we study irreducible reversible chains with a prescribed spectral gap. In addition to characterizing the optimal prediction risk for two states, we show that, as long as the spectral gap is not excessively small, the prediction risk in the Markov model is O(k2n)O(\frac{k^{2}}{n}), which coincides with that of an iid model with the same number of parameters. Extensions to higher-order Markov chains are also obtained.

1 Introduction

Learning distributions from samples is a central question in statistics and machine learning. While significant progress has been achieved in property testing and estimation based on independent and identically distributed (iid) data, for many applications, most notably natural language processing, two new challenges arise: (a) Modeling data as independent observations fails to capture their temporal dependency; (b) Distributions are commonly supported on a large domain whose cardinality is comparable to or even exceeds the sample size. Continuing the progress made in [FOPS16, HOP18], in this paper we study the following prediction problem with dependent data modeled as Markov chains.

Suppose X1,X2,X_{1},X_{2},\dots is a stationary first-order Markov chain on state space [k]{1,,k}[k]\triangleq\left\{1,\dots,k\right\} with unknown statistics. Observing a trajectory Xn(X1,,Xn)X^{n}\triangleq(X_{1},\ldots,X_{n}), the goal is to predict the next state Xn+1X_{n+1} by estimating its distribution conditioned on the present data. We use the Kullback-Leibler (KL) divergence as the loss function: For distributions P=[p1,,pk],Q=[q1,,qk]P=\left[p_{1},\dots,p_{k}\right],Q=\left[q_{1},\dots,q_{k}\right], D(PQ)=i=1kpilogpiqiD(P\|Q)=\sum_{i=1}^{k}p_{i}\log\frac{p_{i}}{q_{i}} if pi=0p_{i}=0 whenever qi=0q_{i}=0 and D(PQ)=D(P\|Q)=\infty otherwise. The minimax prediction risk is given by

𝖱𝗂𝗌𝗄k,n\displaystyle\mathsf{Risk}_{k,n} infM^supπ,M𝔼[D(M(|Xn)M^(|Xn))]=infM^supπ,Mi=1k𝔼[D(M(|i)M^(|i))𝟏{Xn=i}]\displaystyle\triangleq\inf_{\widehat{M}}\sup_{\pi,M}\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))]=\inf_{\widehat{M}}\sup_{\pi,M}\sum_{i=1}^{k}\mathbb{E}[D(M(\cdot|i)\|\widehat{M}(\cdot|i)){\mathbf{1}_{\left\{{X_{n}=i}\right\}}}] (1)

where the supremum is taken over all stationary distributions π\pi and transition matrices MM (row-stochastic) such that πM=π\pi M=\pi, the infimum is taken over all estimators M^=M^(X1,,Xn)\widehat{M}=\widehat{M}(X_{1},\dots,X_{n}) that are proper Markov kernels (i.e. rows sum to 1), and M(|i)M(\cdot|i) denotes the iith row of MM. Our main objective is to characterize this minimax risk within universal constant factors as a function of nn and kk.

The prediction problem (1) is distinct from the parameter estimation problem such as estimating the transition matrix [Bar51, AG57, Bil61, WK19] or its properties [CS00, KV16, HJL+18, HKL+19] in that the quantity to be estimated (conditional distribution of the next state) depends on the sample path itself. This is precisely what renders the prediction problem closely relevant to natural applications such as autocomplete and text generation. In addition, this formulation allows more flexibility with far less assumptions compared to the estimation framework. For example, if certain state has very small probability under the stationary distribution, consistent estimation of the transition matrix with respect to usual loss function, e.g. squared risk, may not be possible, whereas the prediction problem is unencumbered by such rare states.

In the special case of iid data, the prediction problem reduces to estimating the distribution in KL divergence. In this setting the optimal risk is well understood, which is known to be k1\over2n(1+o(1)){k-1\over 2n}(1+o(1)) when kk is fixed and nn\to\infty [BFSS02] and Θ(kn)\Theta(\frac{k}{n}) for k=O(n)k=O(n) [Pan04, KOPS15].111Here and below ,,\asymp,\lesssim,\gtrsim or Θ(),O(),Ω()\Theta(\cdot),O(\cdot),\Omega(\cdot) denote equality and inequalities up to universal multiplicative constants. Typical in parametric models, this rate kn\frac{k}{n} is commonly referred to the “parametric rate”, which leads to a sample complexity that scales proportionally to the number of parameters and inverse proportionally to the desired accuracy.

In the setting of Markov chains, however, the prediction problem is much less understood especially for large state space. Recently the seminal work [FOPS16] showed the surprising result that for stationary Markov chains on two states, the optimal prediction risk satisfies

𝖱𝗂𝗌𝗄2,n=Θ(loglogn\overn),\displaystyle\mathsf{Risk}_{2,n}=\Theta\left(\log\log n\over n\right), (2)

which has a nonparametric rate even when the problem has only two parameters. The follow-up work [HOP18] studied general kk-state chains and showed a lower bound of Ω(kloglognn)\Omega(\frac{k\log\log n}{n}) for uniform (not necessarily stationary) initial distribution; however, the upper bound O(k2loglognn)O(\frac{k^{2}\log\log n}{n}) in [HOP18] relies on implicit assumptions on mixing time such as spectral gap conditions: the proof of the upper bound for prediction (Lemma 7 in the supplement) and for estimation (Lemma 17 of the supplement) is based on Berstein-type concentration results of the empirical transition counts, which depend on spectral gap. The following theorem resolves the optimal risk for kk-state Markov chains:

Theorem 1 (Optimal rates without spectral gap).

There exists a universal constant C>0C>0 such that for all 3kn/C3\leq k\leq\sqrt{n}/C,

k2Cnlog(nk2)𝖱𝗂𝗌𝗄k,nCk2nlog(nk2).\frac{k^{2}}{Cn}\log\left(\frac{n}{k^{2}}\right)\leq\mathsf{Risk}_{k,n}\leq\frac{Ck^{2}}{n}\log\left(\frac{n}{k^{2}}\right). (3)

Furthermore, the lower bound continues to hold even if the Markov chain is restricted to be irreducible and reversible.

Remark 1.

The optimal prediction risk of O(k2nlognk2)O(\frac{k^{2}}{n}\log\frac{n}{k^{2}}) can be achieved by an average version of the add-one estimator (i.e. Laplace’s rule of succession). Given a trajectory xn=(x1,,xn)x^{n}=(x_{1},\ldots,x_{n}) of length nn, denote the transition counts (with the convention NiNij0N_{i}\equiv N_{ij}\equiv 0 if n=0,1n=0,1)

Ni==1n1𝟏{x=i},Nij==1n1𝟏{x=i,x+1=j}.\displaystyle N_{i}=\sum_{\ell=1}^{n-1}{\mathbf{1}_{\left\{{x_{\ell}=i}\right\}}},\quad N_{ij}=\sum_{\ell=1}^{n-1}{\mathbf{1}_{\left\{{x_{\ell}=i,x_{\ell+1}=j}\right\}}}. (4)

The add-one estimator for the transition probability M(j|i)M(j|i) is given by

M^xn+1(j|i)Nij+1\overNi+k,\widehat{M}^{+1}_{x^{n}}(j|i)\triangleq{N_{ij}+1\over N_{i}+k}, (5)

which is an additively smoothed version of the empirical frequency. Finally, the optimal rate in (3) can be achieved by the following estimator M^\widehat{M} defined as an average of add-one estimators over different sample sizes:

M^xn(xn+1|xn)1nt=1nM^xnt+1n+1(xn+1|xn).\widehat{M}_{x^{n}}(x_{n+1}|x_{n})\triangleq\frac{1}{n}\sum_{t=1}^{n}\widehat{M}^{+1}_{x_{n-t+1}^{n}}(x_{n+1}|x_{n}). (6)

In other words, we apply the add-one estimator to the most recent tt observations (Xnt+1,,Xn)(X_{n-t+1},\ldots,X_{n}) to predict the next Xn+1X_{n+1}, then average over t=1,,nt=1,\ldots,n. Such Cesàro-mean-type estimators have been introduced before in the density estimation literature (see, e.g., [YB99]). It remains open whether the usual add-one estimator (namely, the last term in (6) which uses all the data) or any add-cc estimator for constant cc achieves the optimal rate. In contrast, for two-state chains the optimal risk (2) is attained by a hybrid strategy [FOPS16], applying add-cc estimator for c=1lognc=\frac{1}{\log n} for trajectories with at most one transition and c=1c=1 otherwise. Also note that the estimator in (6) can be computed in O(nk)O(nk) time. To derive this first note that given any j[k]j\in[k] calculating M^x1n1+1(j|xn1)\widehat{M}^{+1}_{x^{n-1}_{1}}(j|x_{n-1}) takes O(n)O(n) time and given any Mxnt+1n1+1(j|xn1)M^{+1}_{x^{n-1}_{n-t+1}}(j|x_{n-1}) we need O(1)O(1) time to calculate M^xnt+2n1+1(j|xn1)\widehat{M}^{+1}_{x^{n-1}_{n-t+2}}(j|x_{n-1}). Summing over all jj we get the algorithmic complexity upper bound.

Theorem 1 shows that the departure from the parametric rate of k2n\frac{k^{2}}{n}, first discovered in [FOPS16, HOP18] for binary chains, is even more pronounced for larger state space. As will become clear in the proof, there is some fundamental difference between two-state and three-state chains, resulting in 𝖱𝗂𝗌𝗄3,n=Θ(lognn)𝖱𝗂𝗌𝗄2,n=Θ(loglognn)\mathsf{Risk}_{3,n}=\Theta(\frac{\log n}{n})\gg\mathsf{Risk}_{2,n}=\Theta(\frac{\log\log n}{n}). It is instructive to compare the sample complexity for prediction in the iid and Markov model. Denote by dd the number of parameters, which is k1k-1 for the iid case and k(k1)k(k-1) for Markov chains. Define the sample complexity n(d,ϵ)n^{*}(d,\epsilon) as the smallest sample size nn in order to achieve a prescribed prediction risk ϵ\epsilon. For ϵ=O(1)\epsilon=O(1), we have

n(d,ϵ){dϵiiddϵloglog1ϵMarkov with 2 statesdϵlog1ϵMarkov with k3 states.n^{*}(d,\epsilon)\asymp\begin{cases}\frac{d}{\epsilon}&\text{iid}\\ \frac{d}{\epsilon}\log\log\frac{1}{\epsilon}&\text{Markov with $2$ states}\\ \frac{d}{\epsilon}\log\frac{1}{\epsilon}&\text{Markov with $k\geq 3$ states}.\end{cases} (7)

At a high level, the nonparametric rates in the Markov model can be attributed to the memory in the data. On the one hand, Theorem 1 as well as (2) affirm that one can obtain meaningful prediction without imposing any mixing conditions;222To see this, it is helpful to consider the extreme case where the chain does not move at all or is periodic, in which case predicting the next state is in fact easy. such decoupling between learning and mixing has also been observed in other problems such as learning linear dynamics [SMT+18, DMM+19]. On the other hand, the dependency in the data does lead to a strictly higher sample complexity than that of the iid case; in fact, the lower bound in Theorem 1 is proved by constructing chains with spectral gap as small as O(1n)O(\frac{1}{n}) (see Section 3). Thus, it is conceivable that with sufficiently favorable mixing conditions, the prediction risk improves over that of the worst case and, at some point, reaches the parametric rate. To make this precise, we focus on Markov chains with a prescribed spectral gap.

It is well-known that for an irreducible and reversible chain, the transition matrix MM has kk real eigenvalues satisfying 1=λ1λ2λk11=\lambda_{1}\geq\lambda_{2}\geq\dots\lambda_{k}\geq-1. The absolute spectral gap of MM, defined as

γ1max{|λi|:i1},\gamma_{*}\triangleq 1-\max\left\{\left|\lambda_{i}\right|:i\neq 1\right\}, (8)

quantifies the memory of the Markov chain. For example, the mixing time is determined by 1/γ1/\gamma^{*} (relaxation time) up to logarithmic factors. As extreme cases, the chain which does not move (MM is identity) and which is iid (MM is rank-one) have spectral gap equal to 0 and 1, respectively. We refer the reader to [LP17] for more background. Note that the definition of absolute spectral gap requires irreducibility and reversibility, thus we restrict ourselves to this class of Markov chains (it is possible to use more general notions such as pseudo spectral gap to quantify the memory of the process, which is beyond the scope of the current paper). Given γ0(0,1)\gamma_{0}\in(0,1), define k(γ0){\mathcal{M}}_{k}(\gamma_{0}) as the set of transition matrices corresponding to irreducible and reversible chains whose absolute spectral gap exceeds γ0\gamma_{0}. Restricting (1) to this subcollection and noticing the stationary distribution here is uniquely determined by MM, we define the corresponding minimax risk:

𝖱𝗂𝗌𝗄k,n(γ0)\displaystyle\mathsf{Risk}_{k,n}(\gamma_{0}) infM^supMk(γ0)𝔼[D(M(|Xn)M^(|Xn))]\displaystyle\triangleq\inf_{\widehat{M}}\sup_{M\in{\mathcal{M}}_{k}(\gamma_{0})}\mathbb{E}\left[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))\right] (9)

Extending the result (2) of [FOPS16], the following theorem characterizes the optimal prediction risk for two-state chains with prescribed spectral gaps (the case γ0=0\gamma_{0}=0 correspond to the minimax rate in [FOPS16] over all binary Markov chains):

Theorem 2 (Spectral gap dependent rates for binary chain).

For any γ0(0,1)\gamma_{0}\in(0,1)

𝖱𝗂𝗌𝗄2,n(γ0)1nmax{1,loglog(min{n,1γ0})}.\mathsf{Risk}_{2,n}(\gamma_{0})\asymp\frac{1}{n}\max\left\{1,\log\log\left(\min\left\{n,\frac{1}{\gamma_{0}}\right\}\right)\right\}.

Theorem 2 shows that for binary chains, parametric rate O(1n)O(\frac{1}{n}) is achievable if and only if the spectral gap is nonvanishing. While this holds for bounded state space (see Corollary 4 below), for large state space, it turns out that much weaker conditions on the absolute spectral gap suffice to guarantee the parametric rate O(k2\overn)O({k^{2}\over n}), achieved by the add-one estimator applied to the entire trajectory. In other words, as long as the spectral gap is not excessively small, the prediction risk in the Markov model behaves in the same way as that of an iid model with equal number of parameters. Similar conclusion has been established previously for the sample complexity of estimating the entropy rate of Markov chains in [HJL+18, Theorem 1].

Theorem 3.

The add-one estimator in (5) achieves the following risk bound.

  1. (i)

    For any k2k\geq 2, 𝖱𝗂𝗌𝗄k,n(γ0)k2\overn\mathsf{Risk}_{k,n}(\gamma_{0})\lesssim{k^{2}\over n} provided that γ0(logkk)1/4\gamma_{0}\gtrsim(\frac{\log k}{k})^{1/4}.

  2. (ii)

    In addition, for k(logn)6k\gtrsim(\log n)^{6}, 𝖱𝗂𝗌𝗄k,n(γ0)k2\overn\mathsf{Risk}_{k,n}(\gamma_{0})\lesssim{k^{2}\over n} provided that γ0(log(n+k))2\overk\gamma_{0}\gtrsim{(\log(n+k))^{2}\over k}.

Corollary 4.

For any fixed k2k\geq 2, 𝖱𝗂𝗌𝗄k,n(γ0)=O(1n)\mathsf{Risk}_{k,n}(\gamma_{0})=O(\frac{1}{n}) if and only if γ0=Ω(1)\gamma_{0}=\Omega(1).

Finally, we address the optimal prediction risk for higher-order Markov chains:

Theorem 5.

There is a constant CmC_{m} depending on mm such that for any 2kn1m+1/Cm2\leq k\leq n^{\frac{1}{m+1}}/C_{m} and constant m2m\geq 2 the minimax prediction rate for mthm^{\text{th}}-order Markov chains with stationary initialization is Θm(km+1\overnlogn\overkm+1).\Theta_{m}\left({k^{m+1}\over n}\log{n\over k^{m+1}}\right).

Notably, for binary states, it turns out that the optimal rate Θ(loglogn\overn)\Theta\left(\log\log n\over n\right) for first-order Markov chains determined by [FOPS16] is something very special, as we show that for second-order chains the optimal rate is Θ(logn\overn)\Theta\left(\log n\over n\right).

1.1 Proof techniques

The proof of Theorem 1 deviates from existing approaches based on concentration inequalities for Markov chains. For instance, the standard program for analyzing the add-one estimator (5) involves proving concentration of the empirical counts on their population version, namely, NinπiN_{i}\approx n\pi_{i} and NijnπiM(j|i)N_{ij}\approx n\pi_{i}M(j|i), and bounding the risk in the atypical case by concentration inequalities, such as the Chernoff-type bounds in [Lez98, Pau15], which have been widely used in recent work on statistical inference with Markov chains [KV16, HJL+18, HOP18, HKL+19, WK19]. However, these concentration inequalities inevitably depends on the spectral gap of the Markov chain, leading to results which deteriorate as the spectral gap becomes smaller. For two-state chains, results free of the spectral gap are obtained in [FOPS16] using explicit joint distribution of the transition counts; this refined analysis, however, is difficult to extend to larger state space as the probability mass function of (Nij)(N_{ij}) is given by Whittle’s formula [Whi55] which takes an unwieldy determinantal form.

Eschewing concentration-based arguments, the crux of our proof of Theorem 1, for both the upper and lower bound, revolves around the following quantity known as redundancy:

𝖱𝖾𝖽k,ninfQXnsupPXnD(PXnQXn)=infQXnsupPXnxnPXn(xn)logPXn(xn)QXn(xn).\mathsf{Red}_{k,n}\triangleq\inf_{Q_{X^{n}}}\sup_{P_{X^{n}}}D(P_{X^{n}}\|Q_{X^{n}})=\inf_{Q_{X^{n}}}\sup_{P_{X^{n}}}\sum_{x^{n}}P_{X^{n}}(x^{n})\log\frac{P_{X^{n}}(x^{n})}{Q_{X^{n}}(x^{n})}. (10)

Here the supremum is taken over all joint distributions of stationary Markov chains XnX^{n} on kk states, and the infimum is over all joint distributions QXnQ_{X^{n}}. A central quantity which measures the minimax regret in universal compression, the redundancy (10) corresponds to minimax cumulative risk (namely, the total prediction risk when the sample size ranges from 1 to nn), while (1) is the individual minimax risk at sample size nn – see Section 2 for a detailed discussion. We prove the following reduction between prediction risk and redundancy:

1n𝖱𝖾𝖽k1,n𝗌𝗒𝗆logkn𝖱𝗂𝗌𝗄k,n1n1𝖱𝖾𝖽k,n\frac{1}{n}\mathsf{Red}_{k-1,n}^{\mathsf{sym}}-\frac{\log k}{n}\lesssim\mathsf{Risk}_{k,n}\leq\frac{1}{n-1}\mathsf{Red}_{k,n} (11)

where 𝖱𝖾𝖽𝗌𝗒𝗆\mathsf{Red}^{\mathsf{sym}} denotes the redundancy for symmetric Markov chains. The upper bound is standard: thanks to the convexity of the loss function and stationarity of the Markov chain, the risk of the Cesàro-mean estimator (6) can be upper bounded using the cumulative risk and, in turn, the redundancy. The proof of the lower bound is more involved. Given a (k1)(k-1)-state chain, we embed it into a larger state space by introducing a new state, such that with constant probability, the chain starts from and gets stuck at this state for a period time that is approximately uniform in [n][n], then enters the original chain. Effectively, this scenario is equivalent to a prediction problem on k1k-1 states with a random (approximately uniform) sample size, whose prediction risk can then be related to the cumulative risk and redundancy. This intuition can be made precise by considering a Bayesian setting, in which the (k1)(k-1)-state chain is randomized according to the least favorable prior for (10), and representing the Bayes risk as conditional mutual information and applying the chain rule.

Given the above reduction in (11), it suffices to show both redundancies therein are on the order of k2nlognk2\frac{k^{2}}{n}\log\frac{n}{k^{2}}. The redundancy is upper bounded by pointwise redundancy, which replaces the average in (10) by the maximum over all trajectories. Following [DMPW81, CS04], we consider an explicit probability assignment defined by add-one smoothing and using combinatorial arguments to bound the pointwise redundancy, shown optimal by information-theoretic arguments.

The optimal spectral gap-dependent rate in Theorem 2 relies on the key observation in [FOPS16] that, for binary chains, the dominating contribution to the prediction risk comes from trajectories with a single transition, for which we may apply an add-cc estimator with cc depending appropriately on the spectral gap. The lower bound is shown using a Bayesian argument similar to that of [HOP18, Theorem 1]. The proof of Theorem 3 relies on more delicate concentration arguments as the spectral gap is allowed to be vanishingly small. Notably, for small kk, direct application of existing Bernstein inequalities for Markov chains in [Lez98, Pau15] falls short of establishing the parametric rate of O(k2n)O(\frac{k^{2}}{n}) (see Remark 4 in Section 4.2 for details); instead, we use a fourth moment bound which turns out to be well suited for analyzing concentration of empirical counts conditional on the terminal state.

For large kk, we further improve the spectral gap condition using a simulation argument for Markov chains using independent samples [Bil61, HJL+18]. A key step is a new concentration inequality for D(PP^n,k+1)D(P\|\widehat{P}_{n,k}^{+1}), where P^n,k+1\widehat{P}_{n,k}^{+1} is the add-one estimator based on nn iid observations of PP supported on [k][k]:

(D(PP^n,k+1)ckn+𝗉𝗈𝗅𝗒𝗅𝗈𝗀(n)kn)1𝗉𝗈𝗅𝗒(n),\displaystyle\mathbb{P}\left(D(P\|\widehat{P}_{n,k}^{+1})\geq c\cdot\frac{k}{n}+\frac{\mathsf{polylog}(n)\cdot\sqrt{k}}{n}\right)\leq\frac{1}{\mathsf{poly}(n)}, (12)

for some absolute constant c>0c>0. Note that an application of the classical concentration inequality of McDiarmid would result in the second term being 𝗉𝗈𝗅𝗒𝗅𝗈𝗀(n)/n\mathsf{polylog}(n)/\sqrt{n}, and (12) crucially improves this to 𝗉𝗈𝗅𝗒𝗅𝗈𝗀(n)k/n\mathsf{polylog}(n)\cdot\sqrt{k}/n. Such an improvement has been recently observed by [MJT+20, Agr20, GR20] in studying the similar quantity D(P^nP)D(\widehat{P}_{n}\|P) for the (unsmoothed) empirical distribution P^n\widehat{P}_{n}; however, these results, based on either the method of types or an explicit upper bound of the moment generating function, are not directly applicable to (12) in which the true distribution PP appears as the first argument in the KL divergence.

The nonasymptotic analysis of the prediction rate for higher-order chains with large alphabets is based on a similar redundancy-based reduction as the first-order chain. However, optimal nonasymptotic redundancy bounds for higher-order chains is more challenging. Notably, in lower bounding redundancy, we need to bound the mutual information from below by upper bounding the squared error of certain estimators. As noted in [TJW18], existing analysis in [Dav83, Sec III] based on simple mixing conditions from [Par62] leads to suboptimal results on large alphabets. To bypass this issue, we show the pseudo spectral gap [Pau15] of the transition matrix of the first-order chain {(Xt+1,,Xt+m1)}t=0nm+1\{(X_{t+1},\dots,X_{t+m-1})\}_{t=0}^{n-m+1} is at least a constant. This is accomplished by a careful construction of a prior on mthm^{\rm th}-order transition matrices with Θ(km+1)\Theta\left(k^{m+1}\right) degrees of freedom.

1.2 Related work

While the exact prediction problem studied in this paper has recently been in focus since [FOPS16, HOP18], there exists a large body of literature on relate works. As mentioned before some of our proof strategies draws inspiration and results from the study of redundancy in universal compression, its connection to mutual information, as well as the perspective of sequential probability assignment as prediction, dating back to [Dav73, DMPW81, Ris84, Sht87, Rya88]. Asymptotic characterization of the minimax redundancy for Markov sources, both average and pointwise, were obtained in [Dav83, Att99, JS02], in the regime of fixed alphabet size kk and large sample size nn. Non-asymptotic characterization was obtained in [Dav83] for nk2logkn\gg k^{2}\log k and recently extended to nk2n\asymp k^{2} in [TJW18], which further showed that the behavior of the redundancy remains unchanged even if the Markov chain is very close to being iid in terms of spectral gap γ=1o(1)\gamma^{*}=1-o(1).

The current paper adds to a growing body of literature devoted to statistical learning with dependent data, in particular those dealing with Markov chains. Estimation of the transition matrix [Bar51, AG57, Bil61, Sin64] and testing the order of Markov chains [CS00] have been well studied in the large-sample regime. More recently attention has been shifted towards large state space and nonasymptotics. For example, [WK19] studied the estimation of transition matrix in \ell_{\infty}\to\ell_{\infty} induced norm for Markov chains with prescribed pseudo spectral gap and minimum probability mass of the stationary distribution, and determined sample complexity bounds up to logarithmic factors. Similar results have been obtained for estimating properties of Markov chains, including mixing time and spectral gap [HKL+19], entropy rate [KV16, HJL+18, OS20], graph statistics based on random walk [BHOP18], as well as identity testing [DDG18, CB19, WK20, FW21]. Most of these results rely on assumptions on the Markov chains such as lower bounds on the spectral gap and the stationary distribution, which afford concentration for sample statistics of Markov chains. In contrast, one of the main contributions in this paper, in particular Theorem 1, is that optimal prediction can be achieved without these assumptions, thereby providing a novel way of tackling these seemingly unavoidable issues. This is ultimately accomplished by information-theoretic and combinatorial techniques from universal compression.

1.3 Notations and preliminaries

For nn\in\mathbb{N}, let [n]{1,,n}[n]\triangleq\{1,\ldots,n\}. Denote xn=(x1,,xn)x^{n}=(x_{1},\ldots,x_{n}) and xtn=(xt,,xn)x_{t}^{n}=(x_{t},\ldots,x_{n}). The distribution of a random variable XX is denoted by PXP_{X}. In a Bayesian setting, the distribution of a parameter θ\theta is referred to as a prior, denoted by PθP_{\theta}. We recall the following definitions from information theory [CK82, CT06]. The conditional KL divergence is defined as as an average of KL divergence between conditional distributions:

D(PA|BQA|B|PB)𝔼BPB[D(PA|BQA|B)]=PB(db)D(PA|B=bQA|B=b).D(P_{A|B}\|Q_{A|B}|P_{B})\triangleq\mathbb{E}_{B\sim P_{B}}[D(P_{A|B}\|Q_{A|B})]=\int P_{B}(db)D(P_{A|B=b}\|Q_{A|B=b}). (13)

The mutual information between random variables AA and BB with joint distribution PABP_{AB} is I(A;B)D(PB|APB|PA)I(A;B)\triangleq D(P_{B|A}\|P_{B}|P_{A}); similarly, the conditional mutual information is defined as

I(A;B|C)D(PB|A,CPB|C|PA,C).I(A;B|C)\triangleq D(P_{B|A,C}\|P_{B|C}|P_{A,C}).

The following variational representation of (conditional) mutual information is well-known

I(A;B)=minQBD(PB|AQB|PA),I(A;B|C)=minQB|CD(PB|A,CQB|C|PAC).I(A;B)=\min_{Q_{B}}D(P_{B|A}\|Q_{B}|P_{A}),\quad I(A;B|C)=\min_{Q_{B|C}}D(P_{B|A,C}\|Q_{B|C}|P_{AC}). (14)

The entropy of a discrete random variables XX is H(X)xPX(x)log1PX(x)H(X)\triangleq\sum_{x}P_{X}(x)\log\frac{1}{P_{X}(x)}.

1.4 Organization

The rest of the paper is organized as follows. In Section 2 we describe the general paradigm of minimax redundancy and prediction risk and their dual representation in terms of mutual information. We give a general redundancy-based bound on the prediction risk, which, combined with redundancy bounds for Markov chains, leads to the upper bound in Theorem 1. Section 3 presents the lower bound construction, starting from three states and then extending to kk states. Spectral-gap dependent risk bounds in Theorems 2 and 3 are given in Section 4. Section 5 presents the results and proofs for mthm^{\text{th}}-order Markov chains. Section 6 discusses the assumptions and implications of our results and related open problems.

2 Two general paradigms

2.1 Redundancy, prediction risk, and mutual information representation

For nn\in\mathbb{N}, let 𝒫={PXn+1|θ:θΘ}{\mathcal{P}}=\{P_{X^{n+1}|\theta}:\theta\in\Theta\} be a collection of joint distributions parameterized by θ\theta.

“Compression”.

Consider a sample Xn(X1,,Xn)X^{n}\triangleq(X_{1},\ldots,X_{n}) of size nn drawn from PXn|θP_{X^{n}|\theta} for some unknown θΘ\theta\in\Theta. The redundancy of a probability assignment (joint distribution) QXnQ_{X^{n}} is defined as the worst-case KL risk of fitting the joint distribution of XnX^{n}, namely

𝖱𝖾𝖽(QXn)supθΘD(PXn|θQXn).\mathsf{Red}(Q_{X^{n}})\triangleq\sup_{\theta\in\Theta}D(P_{X^{n}|\theta}\|Q_{X^{n}}). (15)

Optimizing over QXnQ_{X^{n}}, the minimax redundancy is defined as

𝖱𝖾𝖽ninfQXn𝖱𝖾𝖽n(QXn),\mathsf{Red}_{n}\triangleq\inf_{Q_{X^{n}}}\mathsf{Red}_{n}(Q_{X^{n}}), (16)

where the infimum is over all joint distribution QXnQ_{X^{n}}. This quantity can be operationalized as the redundancy (i.e. regret) in the setting of universal data compression, that is, the excess number of bits compared to the optimal compressor of XnX^{n} that knows θ\theta [CT06, Chapter 13].

The capacity-redundancy theorem (see [Kem74] for a very general result) provides the following mutual information characterization of (16):

𝖱𝖾𝖽n=supPθI(θ;Xn),\mathsf{Red}_{n}=\sup_{P_{\theta}}I(\theta;X^{n}), (17)

where the supremum is over all distributions (priors) PθP_{\theta} on Θ\Theta. In view of the variational representation (14), this result can be interpreted as a minimax theorem:

𝖱𝖾𝖽n=infQXnsupPθD(PXn|θQXn|Pθ)=supPθinfQXnD(PXn|θQXn|Pθ).\mathsf{Red}_{n}=\inf_{Q_{X^{n}}}\sup_{P_{\theta}}D(P_{X^{n}|\theta}\|Q_{X^{n}}|P_{\theta})=\sup_{P_{\theta}}\inf_{Q_{X^{n}}}D(P_{X^{n}|\theta}\|Q_{X^{n}}|P_{\theta}).

Typically, for fixed model size and nn\to\infty, one expects that 𝖱𝖾𝖽n=d2logn(1+o(1)\mathsf{Red}_{n}=\frac{d}{2}\log n(1+o(1), where dd is the number of parameters; see [Ris84] for a general theory of this type. Indeed, on a fixed alphabet of size kk, we have 𝖱𝖾𝖽n=k12logn(1+o(1))\mathsf{Red}_{n}=\frac{k-1}{2}\log n(1+o(1)) for iid model [Dav73] and 𝖱𝖾𝖽n=km(k1)2logn(1+o(1))\mathsf{Red}_{n}=\frac{k^{m}(k-1)}{2}\log n(1+o(1)) for mm-order Markov models [Tro74], with more refined asymptotics shown in [XB97, SW12]. For large alphabets, nonasymptotic results have also been obtained. For example, for first-order Markov model, 𝖱𝖾𝖽nk2lognk2\mathsf{Red}_{n}\asymp k^{2}\log\frac{n}{k^{2}} provided that nk2n\gtrsim k^{2} [TJW18].

“Prediction”.

Consider the problem of predicting the next unseen data point Xn+1X_{n+1} based on the observations X1,,XnX_{1},\ldots,X_{n}, where (X1,,Xn+1)(X_{1},\ldots,X_{n+1}) are jointly distributed as PXn+1|θP_{X^{n+1}|\theta} for some unknown θΘ\theta\in\Theta. Here, an estimator is a distribution (for Xn+1X_{n+1}) as a function of XnX^{n}, which, in turn, can be written as a conditional distribution QXn+1|XnQ_{X_{n+1}|X^{n}}. As such, its worst-case average risk is

𝖱𝗂𝗌𝗄(QXn+1|Xn)supθΘD(PXn+1|Xn,θQXn+1|Xn|PXn|θ),\mathsf{Risk}(Q_{X_{n+1}|X^{n}})\triangleq\sup_{\theta\in\Theta}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n}|\theta}), (18)

where the conditional KL divergence is defined in (13). The minimax prediction risk is then defined as

𝖱𝗂𝗌𝗄ninfQXn+1|Xn𝖱𝗂𝗌𝗄n(QXn+1|Xn),\mathsf{Risk}_{n}\triangleq\inf_{Q_{X_{n+1}|X^{n}}}\mathsf{Risk}_{n}(Q_{X_{n+1}|X^{n}}), (19)

While (16) does not directly correspond to a statistical estimation problem, (19) is exactly the familiar setting of “density estimation”, where QXn+1|XnQ_{X_{n+1}|X^{n}} is understood as an estimator for the distribution of the unseen Xn+1X_{n+1} based on the available data X1,,XnX_{1},\ldots,X_{n}.

In the Bayesian setting where θ\theta is drawn from a prior PθP_{\theta}, the Bayes prediction risk coincides with the conditional mutual information as a consequence of the variational representation (14):

infQXn+1|Xn𝔼θ[D(PXn+1|Xn,θQXn+1|Xn|PXn|θ)]=I(θ;Xn+1|Xn).\inf_{Q_{X_{n+1}|X^{n}}}\mathbb{E}_{\theta}[D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n}|\theta})]=I(\theta;X_{n+1}|X^{n}). (20)

Furthermore, the Bayes estimator that achieves this infimum takes the following form:

QXn+1|Xn𝖡𝖺𝗒𝖾𝗌=PXn+1|Xn=ΘPXn+1|θ𝑑PθΘPXn|θ𝑑Pθ,Q_{X_{n+1}|X^{n}}^{\sf Bayes}=P_{X^{n+1}|X^{n}}=\frac{\int_{\Theta}P_{X^{n+1}|\theta}\,dP_{\theta}}{\int_{\Theta}P_{X^{n}|\theta}\,dP_{\theta}}, (21)

known as the Bayes predictive density [Dav73, LB04]. These representations play a crucial role in the lower bound proof of Theorem 1. Under appropriate conditions which hold for Markov models (see Lemma 33 in Appendix A), the minimax prediction risk (19) also admits a dual representation analogous to (17):

𝖱𝗂𝗌𝗄n=supθπI(θ;Xn+1|Xn),\mathsf{Risk}_{n}=\sup_{\theta\sim\pi}I(\theta;X_{n+1}|X^{n}), (22)

which, in view of (20), show that the principle of “minimax=worst-case Bayes” continues to hold for prediction problem in Markov models.

The following result relates the redundancy and the prediction risk.

Lemma 6.

For any model 𝒫{\mathcal{P}},

𝖱𝖾𝖽nt=0n1𝖱𝗂𝗌𝗄t.\mathsf{Red}_{n}\leq\sum_{t=0}^{n-1}\mathsf{Risk}_{t}. (23)

In addition, suppose that each PXn|θ𝒫P_{X^{n}|\theta}\in{\mathcal{P}} is stationary and mthm^{\rm th}-order Markov. Then for all nm+1n\geq m+1,

𝖱𝗂𝗌𝗄n𝖱𝗂𝗌𝗄n1𝖱𝖾𝖽nnm.\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n-1}\leq\frac{\mathsf{Red}_{n}}{n-m}. (24)

Furthermore, for any joint distribution QXnQ_{X^{n}} factorizing as QXn=t=1nQXt|Xt1Q_{X^{n}}=\prod_{t=1}^{n}Q_{X_{t}|X^{t-1}}, the prediction risk of the estimator

Q~Xn|Xn1(xn|xn1)1nmt=m+1nQXt|Xt1(xn|xnt+1n1)\widetilde{Q}_{X_{n}|X^{n-1}}(x_{n}|x^{n-1})\triangleq\frac{1}{n-m}\sum_{t=m+1}^{n}Q_{X_{t}|X^{t-1}}(x_{n}|x_{n-t+1}^{n-1}) (25)

is bounded by the redundancy of QXnQ_{X^{n}} as

𝖱𝗂𝗌𝗄(Q~Xn|Xn1)1nm𝖱𝖾𝖽(QXn).\mathsf{Risk}(\widetilde{Q}_{X_{n}|X^{n-1}})\leq\frac{1}{n-m}\mathsf{Red}(Q_{X^{n}}). (26)
Remark 2.

Note that the upper bound (23) on redundancy, known as the “estimation-compression inequality” [KOPS15, FOPS16], holds without conditions, while the lower bound (24) relies on stationarity and Markovity. For iid data, the estimation-compression inequality is almost an equality; however, this is not the case for Markov chains, as both sides of (23) differ by an unbounded factor of Θ(loglogn)\Theta(\log\log n) for k=2k=2 and Θ(logn)\Theta(\log n) for fixed k3k\geq 3 – see (2) and Theorem 1. On the other hand, Markov chains with at least three states offers a rare instance where (24) is tight, namely, 𝖱𝗂𝗌𝗄n𝖱𝖾𝖽nn\mathsf{Risk}_{n}\asymp\frac{\mathsf{Red}_{n}}{n} (cf. Lemma 7).

Proof.

The upper bound on the redundancy follows from the chain rule of KL divergence:

D(PXn|θQXn)=t=1nD(PXt|Xt1,θQXt|Xt1|PXt1).D(P_{X^{n}|\theta}\|Q_{X^{n}})=\sum_{t=1}^{n}D(P_{X_{t}|X^{t-1},\theta}\|Q_{X_{t}|X^{t-1}}|P_{X^{t-1}}). (27)

Thus

supθΘD(PXn|θQXn)t=1nsupθΘD(PXt|Xt1,θQXt|Xt1|PXt1).\sup_{\theta\in\Theta}D(P_{X^{n}|\theta}\|Q_{X^{n}})\leq\sum_{t=1}^{n}\sup_{\theta\in\Theta}D(P_{X_{t}|X^{t-1},\theta}\|Q_{X_{t}|X^{t-1}}|P_{X^{t-1}}).

Minimizing both sides over QXnQ_{X^{n}} (or equivalently, QXt|Xt1Q_{X_{t}|X^{t-1}} for t=1,,nt=1,\ldots,n) yields (23).

To upper bound the prediction risk using redundancy, fix any QXnQ_{X^{n}}, which gives rise to QXt|Xt1Q_{X_{t}|X^{t-1}} for t=1,,nt=1,\ldots,n. For clarity, let use denote the ttht^{\rm th} estimator as P^t(|xt1)=QXt|Xt1=xt1\widehat{P}_{t}(\cdot|x^{t-1})=Q_{X_{t}|X^{t-1}=x^{t-1}}. Consider the estimator Q~Xn|Xn1\widetilde{Q}_{X_{n}|X^{n-1}} defined in (25), namely,

Q~Xn|Xn1=xn11nmt=m+1nP^t(|xnt+1,,xn1).\widetilde{Q}_{X_{n}|X^{n-1}=x^{n-1}}\triangleq\frac{1}{n-m}\sum_{t=m+1}^{n}\widehat{P}_{t}(\cdot|x_{n-t+1},\ldots,x_{n-1}). (28)

That is, we apply P^t\widehat{P}_{t} to the most recent t1t-1 symbols prior to XnX_{n} for predicting its distribution, then average over tt. We may bound the prediction risk of this estimator by redundancy as follows: Fix θΘ\theta\in\Theta. To simplify notation, we suppress the dependency of θ\theta and write PXn|θPXnP_{X^{n}|\theta}\equiv P_{X^{n}}. Then

D(PXn|Xn1Q~Xn|Xn1|PXn1)=(a)\displaystyle D(P_{X_{n}|X^{n-1}}\|\widetilde{Q}_{X_{n}|X^{n-1}}|P_{X^{n-1}})\overset{\rm(a)}{=} 𝔼[D(PXn|Xnmn11nt=1nP^t(|Xnt+1n1))]\displaystyle~{}\mathbb{E}\left[D\left(P_{X_{n}|X^{n-1}_{n-m}}\Big{\|}\frac{1}{n}\sum_{t=1}^{n}\widehat{P}_{t}(\cdot|X_{n-t+1}^{n-1})\right)\right]
(b)\displaystyle\overset{\rm(b)}{\leq} 1nmt=m+1n𝔼[D(PXn|Xnmn1P^t(|Xnt+1n1))]\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}\mathbb{E}\left[D(P_{X_{n}|X^{n-1}_{n-m}}\|\widehat{P}_{t}(\cdot|X_{n-t+1}^{n-1}))\right]
=(c)\displaystyle\overset{\rm(c)}{=} 1nmt=m+1n𝔼[D(PXt|Xtmt1P^t(|Xt1))]\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}\mathbb{E}\left[D(P_{X_{t}|X^{t-1}_{t-m}}\|\widehat{P}_{t}(\cdot|X^{t-1}))\right]
=(d)\displaystyle\overset{\rm(d)}{=} 1nmt=m+1nD(PXt|Xt1QXt|Xt1|PXt1)\displaystyle~{}\frac{1}{n-m}\sum_{t=m+1}^{n}D(P_{X_{t}|X^{t-1}}\|Q_{X^{t}|X^{t-1}}|P_{X^{t-1}})
\displaystyle\leq 1nmt=1nD(PXt|Xt1QXt|Xt1|PXt1)\displaystyle~{}\frac{1}{n-m}\sum_{t=1}^{n}D(P_{X_{t}|X^{t-1}}\|Q_{X^{t}|X^{t-1}}|P_{X^{t-1}})
=(e)\displaystyle\overset{\rm(e)}{=} 1nmD(PXnQXn),\displaystyle~{}\frac{1}{n-m}D(P_{X^{n}}\|Q_{X^{n}}),

where (a) uses the mthm^{\rm th}-order Markovian assumption; (b) is due to the convexity of the KL divergence; (c) uses the crucial fact that for all t=1,,n1t=1,\ldots,n-1, (Xnt,,Xn1)=law(X1,,Xt)(X_{n-t},\ldots,X_{n-1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{1},\ldots,X_{t}), thanks to stationarity; (d) follows from substituting P^t(|xt1)=QXt|Xt1=xt1\widehat{P}_{t}(\cdot|x^{t-1})=Q_{X_{t}|X^{t-1}=x^{t-1}}, the Markovian assumption PXt|Xtmt1=PXt|Xt1P_{X_{t}|X^{t-1}_{t-m}}=P_{X_{t}|X^{t-1}}, and rewriting the expectation as conditional KL divergence; (e) is by the chain rule (27) of KL divergence. Since the above holds for any θΘ\theta\in\Theta, the desired (26) follows which implies that 𝖱𝗂𝗌𝗄n1𝖱𝖾𝖽nnm\mathsf{Risk}_{n-1}\leq\frac{\mathsf{Red}_{n}}{n-m}. Finally, 𝖱𝗂𝗌𝗄n1𝖱𝗂𝗌𝗄n\mathsf{Risk}_{n-1}\leq\mathsf{Risk}_{n} follows from 𝔼[D(PXn+1|XnP^n(X2n))]=𝔼[D(PXn|Xn1P^n(X1n1))]\mathbb{E}[D(P_{X_{n+1|X_{n}}}\|\widehat{P}_{n}(X_{2}^{n}))]=\mathbb{E}[D(P_{X_{n|X_{n-1}}}\|\widehat{P}_{n}(X_{1}^{n-1}))], since (X2,,Xn)(X_{2},\ldots,X_{n}) and (X1,,Xn1)(X_{1},\ldots,X_{n-1}) are equal in law. ∎

Remark 3.

Alternatively, Lemma 6 also follows from the mutual information representation (17) and (22). Indeed, by the chain rule for mutual information,

I(θ;Xn)=t=1nI(θ;Xt|Xt1),I(\theta;X^{n})=\sum_{t=1}^{n}I(\theta;X_{t}|X^{t-1}), (29)

taking the supremum over π\pi (the distribution of θ\theta) on both sides yields (17). For (22), it suffices to show that I(θ;Xt|Xt1)I(\theta;X_{t}|X^{t-1}) is decreasing in tt: for any θπ\theta\sim\pi,

I(θ;Xn+1|Xn)=\displaystyle I(\theta;X_{n+1}|X^{n})= 𝔼logPXn+1|Xn,θPXn+1|Xn=𝔼logPXn+1|Xn,θPXn+1|X2n+𝔼logPXn+1|X2nPXn+1|XnI(X1;Xn+1|X2n),\displaystyle~{}\mathbb{E}\log\frac{P_{X_{n+1}|X^{n},\theta}}{P_{X_{n+1}|X^{n}}}=\mathbb{E}\log\frac{P_{X_{n+1}|X^{n},\theta}}{P_{X_{n+1}|X^{n}_{2}}}+\underbrace{\mathbb{E}\log\frac{P_{X_{n+1}|X^{n}_{2}}}{P_{X_{n+1}|X^{n}}}}_{-I(X_{1};X_{n+1}|X_{2}^{n})},

and the first term is

𝔼logPXn+1|Xn,θPXn+1|X2n=𝔼logPXn+1|Xnm+1n,θPXn+1|X2n=𝔼logPXn|Xnmn1,θPXn|Xn1=I(θ;Xn|Xn1)\mathbb{E}\log\frac{P_{X_{n+1}|X^{n},\theta}}{P_{X_{n+1}|X^{n}_{2}}}=\mathbb{E}\log\frac{P_{X_{n+1}|X^{n}_{n-m+1},\theta}}{P_{X_{n+1}|X^{n}_{2}}}=\mathbb{E}\log\frac{P_{X_{n}|X^{n-1}_{n-m},\theta}}{P_{X_{n}|X^{n-1}}}=I(\theta;X_{n}|X^{n-1})

where the first and second equalities follow from the mthm^{\rm th}-order Markovity and stationarity, respectively. Taking supremum over π\pi yields 𝖱𝗂𝗌𝗄n𝖱𝗂𝗌𝗄n1\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n-1}. Finally, by the chain rule (29), we have I(θ;Xn)(nm)I(θ;Xn|Xn1)I(\theta;X^{n})\geq(n-m)I(\theta;X_{n}|X^{n-1}), yielding 𝖱𝗂𝗌𝗄n1𝖱𝖾𝖽nnm\mathsf{Risk}_{n-1}\leq\frac{\mathsf{Red}_{n}}{n-m}.

2.2 Proof of the upper bound part of Theorem 1

Specializing to first-order stationary Markov chains with kk states, we denote the redundancy and prediction risk in (16) and (19) by 𝖱𝖾𝖽k,n\mathsf{Red}_{k,n} and 𝖱𝗂𝗌𝗄k,n\mathsf{Risk}_{k,n}, the latter of which is precisely the quantity previously defined in (1). Applying Lemma 6 yields 𝖱𝗂𝗌𝗄k,n1n1𝖱𝖾𝖽k,n\mathsf{Risk}_{k,n}\leq\frac{1}{n-1}\mathsf{Red}_{k,n}. To upper bound 𝖱𝖾𝖽k,n\mathsf{Red}_{k,n}, consider the following probability assignment:

Q(x1,,xn)=1kt=1n1M^xt+1(xt+1|xt)\displaystyle Q(x_{1},\cdots,x_{n})=\frac{1}{k}\prod_{t=1}^{n-1}\widehat{M}_{x^{t}}^{+1}(x_{t+1}|x_{t}) (30)

where M^+1\widehat{M}^{+1} is the add-one estimator defined in (5). This QQ factorizes as Q(x1)=1kQ(x_{1})=\frac{1}{k} and Q(xt+1|xt)=M^xt+1(xt+1|xt)Q(x_{t+1}|x^{t})=\widehat{M}_{x^{t}}^{+1}(x_{t+1}|x_{t}). The following lemma bounds the redundancy of QQ:

Lemma 7.

𝖱𝖾𝖽(Q)k(k1)[log(1+n1k(k1))+1]+logk.\mathsf{Red}(Q)\leq k(k-1)\left[\log\left(1+\frac{n-1}{k(k-1)}\right)+1\right]+\log k.

Combined with Lemma 6, Lemma 7 shows that 𝖱𝗂𝗌𝗄k,nCk2nlognk2\mathsf{Risk}_{k,n}\leq C\frac{k^{2}}{n}\log\frac{n}{k^{2}} for all kn/Ck\leq\sqrt{n/C} and some universal constant CC, achieved by the estimator (6), which is obtained by applying the rule (25) to (30).

It remains to show Lemma 7. To do so, we in fact bound the pointwise redundancy of the add-one probability assignment (30) over all (not necessarily stationary) Markov chains on kk states. The proof is similar to those of [CS04, Theorems 6.3 and 6.5], which, in turn, follow the arguments of [DMPW81, Sec. III-B].

Proof.

We show that for every Markov chain with transition matrix MM and initial distribution π\pi, and every trajectory (x1,,xn)(x_{1},\cdots,x_{n}), it holds that

logπ(x1)t=1n1M(xt+1|xt)Q(x1,,xn)k(k1)[log(1+nk(k1))+1]+logk,\displaystyle\log\frac{\pi(x_{1})\prod_{t=1}^{n-1}M(x_{t+1}|x_{t})}{Q(x_{1},\cdots,x_{n})}\leq k(k-1)\left[\log\left(1+\frac{n}{k(k-1)}\right)+1\right]+\log k, (31)

where we abbreviate the add-one estimator Mxt(xt+1|xt)M_{x^{t}}(x_{t+1}|x_{t}) defined in (5) as M(xt+1|xt)M(x_{t+1}|x_{t}).

To establish (31), note that Q(x1,,xn)Q(x_{1},\cdots,x_{n}) could be equivalently expressed using the empirical counts NiN_{i} and NijN_{ij} in (4) as

Q(x1,,xn)=1ki=1kj=1kNij!k(k+1)(Ni+k1).\displaystyle Q(x_{1},\cdots,x_{n})=\frac{1}{k}\prod_{i=1}^{k}\frac{\prod_{j=1}^{k}N_{ij}!}{k\cdot(k+1)\cdot\cdots\cdot(N_{i}+k-1)}.

Note that

t=1n1M(xt+1|xt)=i=1kj=1kM(j|i)Niji=1kj=1k(Nij/Ni)Nij,\prod_{t=1}^{n-1}M(x_{t+1}|x_{t})=\prod_{i=1}^{k}\prod_{j=1}^{k}M(j|i)^{N_{ij}}\leq\prod_{i=1}^{k}\prod_{j=1}^{k}(N_{ij}/N_{i})^{N_{ij}},

where the inequality follows from jNijNilogNij/NiM(j|i)0\sum_{j}\frac{N_{ij}}{N_{i}}\log\frac{N_{ij}/N_{i}}{M(j|i)}\geq 0 for each ii, by the nonnegativity of the KL divergence. Therefore, we have

π(x1)t=1n1M(xt+1|xt)Q(x1,,xn)ki=1kk(k+1)(Ni+k1)NiNij=1kNijNijNij!.\displaystyle\frac{\pi(x_{1})\prod_{t=1}^{n-1}M(x_{t+1}|x_{t})}{Q(x_{1},\cdots,x_{n})}\leq k\cdot\prod_{i=1}^{k}\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{i}+k-1)}{N_{i}^{N_{i}}}\prod_{j=1}^{k}\frac{N_{ij}^{N_{ij}}}{N_{ij}!}. (32)

We claim that: for n1,,nk+n_{1},\cdots,n_{k}\in\mathbb{Z}_{+} and n=i=1knin=\sum_{i=1}^{k}n_{i}\in\mathbb{N}, it holds that

i=1k(nin)nii=1kni!n!,\displaystyle\prod_{i=1}^{k}\left(\frac{n_{i}}{n}\right)^{n_{i}}\leq\frac{\prod_{i=1}^{k}n_{i}!}{n!}, (33)

with the understanding that (0n)0=0!=1(\frac{0}{n})^{0}=0!=1. Applying this claim to (32) gives

logπ(x1)t=1n1M(xt+1|xt)Q(x1,,xn)\displaystyle\log\frac{\pi(x_{1})\prod_{t=1}^{n-1}M(x_{t+1}|x_{t})}{Q(x_{1},\cdots,x_{n})} logk+i=1klogk(k+1)(Ni+k1)Ni!\displaystyle\leq\log k+\sum_{i=1}^{k}\log\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{i}+k-1)}{N_{i}!}
=logk+i=1k=1Nilog(1+k1)\displaystyle=\log k+\sum_{i=1}^{k}\sum_{\ell=1}^{N_{i}}\log\left(1+\frac{k-1}{\ell}\right)
logk+i=1k0Nilog(1+k1x)𝑑x\displaystyle\leq\log k+\sum_{i=1}^{k}\int_{0}^{N_{i}}\log\left(1+\frac{k-1}{x}\right)dx
=logk+i=1k((k1)log(1+Nik1)+Nilog(1+k1Ni))\displaystyle=\log k+\sum_{i=1}^{k}\left((k-1)\log\left(1+\frac{N_{i}}{k-1}\right)+N_{i}\log\left(1+\frac{k-1}{N_{i}}\right)\right)
(a)k(k1)log(1+n1k(k1))+k(k1)+logk,\displaystyle\overset{\rm(a)}{\leq}k(k-1)\log\left(1+\frac{n-1}{k(k-1)}\right)+k(k-1)+\log k,

where (a) follows from the concavity of xlogxx\mapsto\log x, i=1kNi=n1\sum_{i=1}^{k}N_{i}=n-1, and log(1+x)x\log(1+x)\leq x.

It remains to justify (33), which has a simple information-theoretic proof: Let TT denote the collection of sequences xnx^{n} in [k]n[k]^{n} whose type is given by (n1,,nk)(n_{1},\ldots,n_{k}). Namely, for each xnTx^{n}\in T, ii appears exactly nin_{i} times for each i[k]i\in[k]. Let (X1,,Xn)(X_{1},\ldots,X_{n}) be drawn uniformly at random from the set TT. Then

logn!i=1kni!=H(X1,,Xn)(a)j=1nH(Xj)=(b)ni=1kninlognni,\log\frac{n!}{\prod_{i=1}^{k}n_{i}!}=H(X_{1},\ldots,X_{n})\overset{\rm(a)}{\leq}\sum_{j=1}^{n}H(X_{j})\overset{\rm(b)}{=}n\sum_{i=1}^{k}\frac{n_{i}}{n}\log\frac{n}{n_{i}},

where (a) follows from the fact that the joint entropy is at most the sum of marginal entropies; (b) is because each XjX_{j} is distributed as (n1n,,nkn)(\frac{n_{1}}{n},\ldots,\frac{n_{k}}{n}). ∎

3 Optimal rates without spectral gap

In this section, we prove the lower bound part of Theorem 1, which shows the optimality of the average version of the add-one estimator (25). We first describe the lower bound construction for three-state chains, which is subsequently extended to kk states.

3.1 Warmup: an Ω(lognn)\Omega(\frac{\log n}{n}) lower bound for three-state chains

Theorem 8.

𝖱𝗂𝗌𝗄3,n=Ω(lognn).\mathsf{Risk}_{3,n}=\Omega\left(\frac{\log n}{n}\right).

To show Theorem 8, consider the following one-parameter family of transition matrices:

={Mp=[12n1n1n1n11npp1np11np]:0p11n}.{\mathcal{M}}=\left\{M_{p}=\left[\begin{matrix}1-\frac{2}{n}&\frac{1}{n}&\frac{1}{n}\\ \frac{1}{n}&1-\frac{1}{n}-p&p\\ \frac{1}{n}&p&1-\frac{1}{n}-p\end{matrix}\right]\colon 0\leq p\leq 1-\frac{1}{n}\right\}. (34)

Note that each transition matrix in {\mathcal{M}} is symmetric (hence doubly stochastic), whose corresponding chain is reversible with a uniform stationary distribution and spectral gap Θ(1n)\Theta(\frac{1}{n}); see Fig. 1.

1231n\frac{1}{n}1n\frac{1}{n}12n1-\frac{2}{n}1n\frac{1}{n}pp11np1-\frac{1}{n}-ppp1n\frac{1}{n}11np1-\frac{1}{n}-p
Figure 1: Lower bound construction for three-state chains.

The main idea is as follows. Notice that by design, with constant probability, the trajectory is of the following form: The chain starts and stays at state 1 for tt steps, and then transitions into state 2 or 3 and never returns to state 1, where t=1,,n1t=1,\ldots,n-1. Since pp is the single unknown parameter, the only useful observations are visits to state 22 and 33 and each visit entails one observation about pp by flipping a coin with bias roughly pp. Thus the effective sample size for estimating pp is nt1n-t-1 and we expect the best estimation error is of the order of 1nt\frac{1}{n-t}. However, tt is not fixed. In fact, conditioned on the trajectory is of this form, tt is roughly uniformly distributed between 11 and n1n-1. As such, we anticipate the estimation error of pp is approximately

1n1i=1n11nt=Θ(lognn).\frac{1}{n-1}\sum_{i=1}^{n-1}\frac{1}{n-t}=\Theta\left(\frac{\log n}{n}\right).

Intuitively speaking, the construction in Fig. 1 “embeds” a symmetric two-state chain (with states 2 and 3) with unknown parameter pp into a space of three states, by adding a “nuisance” state 1, which effectively slows down the exploration of the useful part of the state space, so that in a trajectory of length nn, the effective number of observations we get to make about pp is roughly uniformly distributed between 11 and nn. This explains the extra log factor in Theorem 8, which actually stems from the harmonic sum in 𝔼[1Uniform([n])]\mathbb{E}[\frac{1}{\mathrm{Uniform}([n])}]. We will fully explore this embedding idea in Section 3.2 to deal with larger state space.

Next we make the above intuition rigorous using a Bayesian argument. Let us start by recalling the following well-known lemma.

Lemma 9.

Let qUniform(0,1)q\sim\mathrm{Uniform}(0,1). Conditioned on qq, let NBinom(m,q)N\sim\text{Binom}(m,q). Then the Bayes estimator of qq given NN is the “add-one” estimator:

𝔼[q|N]=N+1m+2\mathbb{E}[q|N]=\frac{N+1}{m+2}

and the Bayes risk is given by

𝔼[(q𝔼[q|N])2]=16(m+2).\mathbb{E}[(q-\mathbb{E}[q|N])^{2}]=\frac{1}{6(m+2)}.
Proof of Theorem 8.

Consider the following Bayesian setting: First, we draw pp uniformly at random from [0,11n][0,1-\frac{1}{n}]. Then, we generate the sample path Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) of a stationary (uniform) Markov chain with transition matrix MpM_{p} as defined in (34). Define

𝒳t={xn:x1==xt=1,xi1,i=t+1,,n},t=1,,n1,𝒳=t=1n1𝒳t.\displaystyle\begin{gathered}{\mathcal{X}}_{t}=\{x^{n}:x_{1}=\ldots=x_{t}=1,x_{i}\neq 1,i=t+1,\ldots,n\},\quad t=1,\dots,n-1,\\ \quad{\mathcal{X}}=\cup_{t=1}^{n-1}{\mathcal{X}}_{t}.\end{gathered} (37)

Let μ(xn|p)=[X=xn]\mu(x^{n}|p)=\mathbb{P}\left[X=x^{n}\right]. Then

μ(xn|p)=13(12n)t12npN(xn)(11np)nt1N(xn),xn𝒳t,\mu(x^{n}|p)=\frac{1}{3}\left(1-\frac{2}{n}\right)^{t-1}\frac{2}{n}p^{N(x^{n})}\left(1-\frac{1}{n}-p\right)^{n-t-1-N(x^{n})},\quad x^{n}\in{\mathcal{X}}_{t}, (38)

where N(xn)N(x^{n}) denotes the number of transitions from state 2 to 3 or from 3 to 2. Then

[Xn𝒳t]=\displaystyle\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}\right]= 13(12n)t12nk=0nt1(nt1\atopk)pk(11np)\displaystyle~{}\frac{1}{3}\left(1-\frac{2}{n}\right)^{t-1}\frac{2}{n}\sum_{k=0}^{n-t-1}\binom{n-t-1}{k}p^{k}\left(1-\frac{1}{n}-p\right)
=\displaystyle= 13(12n)t12n(11n)nt1=23n(11n)n2(11n1)t1\displaystyle~{}\frac{1}{3}\left(1-\frac{2}{n}\right)^{t-1}\frac{2}{n}\left(1-\frac{1}{n}\right)^{n-t-1}=\frac{2}{3n}\left(1-\frac{1}{n}\right)^{n-2}\left(1-\frac{1}{n-1}\right)^{t-1} (39)

and hence

[Xn𝒳]=\displaystyle\mathbb{P}\left[X^{n}\in{\mathcal{X}}\right]= t=1n1[Xn𝒳t]=2(n1)3n(11n)n2(1(11n1)n1)\displaystyle~{}\sum_{t=1}^{n-1}\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}\right]=\frac{2(n-1)}{3n}\left(1-\frac{1}{n}\right)^{n-2}\left(1-\left(1-\frac{1}{n-1}\right)^{n-1}\right) (40)
=\displaystyle= 2(11/e)3e+on(1).\displaystyle~{}\frac{2(1-1/e)}{3e}+o_{n}(1).

Consider the Bayes estimator (for estimating pp under the mean-squared error)

p^(xn)=𝔼[p|xn]=𝔼[pμ(xn|p)]𝔼[μ(xn|p)].\widehat{p}(x^{n})=\mathbb{E}[p|x^{n}]=\frac{\mathbb{E}[p\cdot\mu(x^{n}|p)]}{\mathbb{E}[\mu(x^{n}|p)]}.

For xn𝒳tx^{n}\in{\mathcal{X}}_{t}, using (38) we have

p^(xn)=\displaystyle\widehat{p}(x^{n})= 𝔼[pN(xn)+1(11np)nt1N(xn)]𝔼[pN(xn)(11np)nt1N(xn)],pUniform(0,n1n)\displaystyle~{}\frac{\mathbb{E}\left[p^{N(x^{n})+1}\left(1-\frac{1}{n}-p\right)^{n-t-1-N(x^{n})}\right]}{\mathbb{E}\left[p^{N(x^{n})}\left(1-\frac{1}{n}-p\right)^{n-t-1-N(x^{n})}\right]},\quad p\sim\mathrm{Uniform}\left(0,\frac{n-1}{n}\right)
=\displaystyle= n1n𝔼[UN(xn)+1(1U)nt1N(xn)]𝔼[UN(xn)(1U)nt1N(xn)],UUniform(0,1)\displaystyle~{}\frac{n-1}{n}\frac{\mathbb{E}\left[U^{N(x^{n})+1}\left(1-U\right)^{n-t-1-N(x^{n})}\right]}{\mathbb{E}\left[U^{N(x^{n})}\left(1-U\right)^{n-t-1-N(x^{n})}\right]},\quad U\sim\mathrm{Uniform}(0,1)
=\displaystyle= n1nN(xn)+1nt+1,\displaystyle~{}\frac{n-1}{n}\frac{N(x^{n})+1}{n-t+1},

where the last step follows from Lemma 9. From (38), we conclude that conditioned on Xn𝒳tX^{n}\in{\mathcal{X}}_{t} and on pp, N(Xn)Binom(nt1,q)N(X^{n})\sim\text{Binom}(n-t-1,q), where q=p11nUniform(0,1)q=\frac{p}{1-\frac{1}{n}}\sim\mathrm{Uniform}(0,1). Applying Lemma 9 (with m=nt1m=n-t-1 and N=N(Xn)N=N(X^{n})), we get

𝔼[(pp^(Xn))2|Xn𝒳t]=\displaystyle\mathbb{E}[(p-\widehat{p}(X^{n}))^{2}|X^{n}\in{\mathcal{X}}_{t}]= (n1n)2𝔼[(qN(xn)+1nt+1)2]\displaystyle~{}\left(\frac{n-1}{n}\right)^{2}\mathbb{E}\left[\left(q-\frac{N(x^{n})+1}{n-t+1}\right)^{2}\right]
=\displaystyle= (n1n)216(nt+1).\displaystyle~{}\left(\frac{n-1}{n}\right)^{2}\frac{1}{6(n-t+1)}.

Finally, note that conditioned on Xn𝒳X^{n}\in{\mathcal{X}}, the probability of Xn𝒳tX^{n}\in{\mathcal{X}}_{t} is close to uniform. Indeed, from (39) and (40) we get

[Xn𝒳t|𝒳]=1n1(11n1)t11(11n1)n11n1(1e1+on(1)),t=1,,n1.\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}|{\mathcal{X}}\right]=\frac{1}{n-1}\frac{\left(1-\frac{1}{n-1}\right)^{t-1}}{1-\left(1-\frac{1}{n-1}\right)^{n-1}}\geq\frac{1}{n-1}\left(\frac{1}{e-1}+o_{n}(1)\right),\quad t=1,\ldots,n-1.

Thus

𝔼[(pp^(Xn))2𝟏{Xn𝒳}]=\displaystyle\mathbb{E}[(p-\widehat{p}(X^{n}))^{2}{\mathbf{1}_{\left\{{X^{n}\in{\mathcal{X}}}\right\}}}]= [Xn𝒳]t=1n1𝔼[(pp^(Xn))2|Xn𝒳t][Xn𝒳t|𝒳]\displaystyle~{}\mathbb{P}\left[X^{n}\in{\mathcal{X}}\right]\sum_{t=1}^{n-1}\mathbb{E}[(p-\widehat{p}(X^{n}))^{2}|X^{n}\in{\mathcal{X}}_{t}]\mathbb{P}\left[X^{n}\in{\mathcal{X}}_{t}|{\mathcal{X}}\right]
\displaystyle\gtrsim 1n1t=1n11nt+1=Θ(lognn).\displaystyle~{}\frac{1}{n-1}\sum_{t=1}^{n-1}\frac{1}{n-t+1}=\Theta\left(\frac{\log n}{n}\right). (41)

Finally, we relate (41) formally to the minimax prediction risk under the KL divergence. Consider any predictor M^(|i)\widehat{M}(\cdot|i) (as a function of the sample path XX) for the iith row of MM, i=1,2,3i=1,2,3. By Pinsker inequality, we conclude that

D(M(|2)M^(|2))12M(|2)M^(|2)1212(pM^(3|2))2\displaystyle D(M(\cdot|2)\|\widehat{M}(\cdot|2))\geq\frac{1}{2}\|M(\cdot|2)-\widehat{M}(\cdot|2)\|_{\ell_{1}}^{2}\geq\frac{1}{2}(p-\widehat{M}(3|2))^{2} (42)

and similarly, D(M(|3)M^(|3))12(pM^(2|3))2D(M(\cdot|3)\|\widehat{M}(\cdot|3))\geq\frac{1}{2}(p-\widehat{M}(2|3))^{2}. Abbreviate M^(3|2)p^2\widehat{M}(3|2)\equiv\widehat{p}_{2} and M^(2|3)p^3\widehat{M}(2|3)\equiv\widehat{p}_{3}, both functions of XX. Taking expectations over both pp and XX, the Bayes prediction risk can be bounded as follows

i=13𝔼[D(M(|i)M^(|i))𝟏{Xn=i}]\displaystyle\sum_{i=1}^{3}\mathbb{E}[D(M(\cdot|i)\|\widehat{M}(\cdot|i)){\mathbf{1}_{\left\{{X_{n}=i}\right\}}}]
\displaystyle\geq 12𝔼[(pp^2)2𝟏{Xn=2}+(pp^3)2𝟏{Xn=3}]\displaystyle~{}\frac{1}{2}\mathbb{E}[(p-\widehat{p}_{2})^{2}{\mathbf{1}_{\left\{{X_{n}=2}\right\}}}+(p-\widehat{p}_{3})^{2}{\mathbf{1}_{\left\{{X_{n}=3}\right\}}}]
\displaystyle\geq 12x𝒳μ(xn)(𝔼[(pp^2)2|X=xn]𝟏{xn=2}+𝔼[(pp^3)2|X=xn]𝟏{xn=3})\displaystyle~{}\frac{1}{2}\sum_{x\in{\mathcal{X}}}\mu(x^{n})\left(\mathbb{E}[(p-\widehat{p}_{2})^{2}|X=x^{n}]{\mathbf{1}_{\left\{{x_{n}=2}\right\}}}+\mathbb{E}[(p-\widehat{p}_{3})^{2}|X=x^{n}]{\mathbf{1}_{\left\{{x_{n}=3}\right\}}}\right)
\displaystyle\geq 12xn𝒳μ(xn)𝔼[(pp^(xn))2|X=xn](𝟏{xn=2}+𝟏{xn=3})\displaystyle~{}\frac{1}{2}\sum_{x^{n}\in{\mathcal{X}}}\mu(x^{n})\mathbb{E}[(p-\widehat{p}(x^{n}))^{2}|X=x^{n}]({\mathbf{1}_{\left\{{x_{n}=2}\right\}}}+{\mathbf{1}_{\left\{{x_{n}=3}\right\}}})
=\displaystyle= 12xn𝒳μ(xn)𝔼[(pp^(xn))2|X=xn]\displaystyle~{}\frac{1}{2}\sum_{x^{n}\in{\mathcal{X}}}\mu(x^{n})\mathbb{E}[(p-\widehat{p}(x^{n}))^{2}|X=x^{n}]
=\displaystyle= 12𝔼[(pp^(X))2𝟏{X𝒳}]=(41)Θ(lognn).\displaystyle~{}\frac{1}{2}\mathbb{E}[(p-\widehat{p}(X))^{2}{\mathbf{1}_{\left\{{X\in{\mathcal{X}}}\right\}}}]\overset{(\ref{eq:pbayes})}{=}\Theta\left(\frac{\log n}{n}\right).

3.2 kk-state chains

The lower bound construction for 33-state chains in Section 3.1 can be generalized to kk-state chains. The high-level argument is again to augment a (k1)(k-1)-state chain into a kk-state chain. Specifically, we partition the state space [k][k] into two sets 𝒮1={1}{\mathcal{S}}_{1}=\{1\} and 𝒮2={2,3,,k}{\mathcal{S}}_{2}=\{2,3,\cdots,k\}. Consider a kk-state Markov chain such that the transition probabilities from 𝒮1{\mathcal{S}}_{1} to 𝒮2{\mathcal{S}}_{2}, and from 𝒮2{\mathcal{S}}_{2} to 𝒮1{\mathcal{S}}_{1}, are both very small (on the order of Θ(1/n)\Theta(1/n)). At state 11, the chain either stays at 11 with probability 11/n1-1/n or moves to one of the states in 𝒮2{\mathcal{S}}_{2} with equal probability 1n(k1)\frac{1}{n(k-1)}; at each state in 𝒮2{\mathcal{S}}_{2}, the chain moves to 11 with probability 1n\frac{1}{n}; otherwise, within the state subspace 𝒮2{\mathcal{S}}_{2}, the chain evolves according to some symmetric transition matrix TT. (See Fig. 2 in Section 3.2.1 for the precise transition diagram.)

The key feature of such a chain is as follows. Let 𝒳t{\mathcal{X}}_{t} be the event that X1,X2,,Xt𝒮1X_{1},X_{2},\cdots,X_{t}\in{\mathcal{S}}_{1} and Xt+1,,Xn𝒮2X_{t+1},\cdots,X_{n}\in{\mathcal{S}}_{2}. For each t[n1]t\in[n-1], one can show that (𝒳t)c/n\mathbb{P}({\mathcal{X}}_{t})\geq c/n for some absolute constant c>0c>0. Moreover, conditioned on the event 𝒳t{\mathcal{X}}_{t}, (Xt+1,,Xn)(X_{t+1},\ldots,X_{n}) is equal in law to a stationary Markov chain (Y1,,Ynt)(Y_{1},\cdots,Y_{n-t}) on state space 𝒮2{\mathcal{S}}_{2} with symmetric transition matrix TT. It is not hard to show that estimating MM and TT are nearly equivalent. Consider the Bayesian setting where TT is drawn from some prior. We have

infM^𝔼T[𝔼[D(M(|Xn)M^(|Xn))|𝒳t]]infT^𝔼T[𝔼[D(T(|Ynt)T^(|Ynt))]]=I(T;Ynt+1|Ynt),\displaystyle\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))|{\mathcal{X}}_{t}]\right]\approx\inf_{\widehat{T}}\mathbb{E}_{T}\left[\mathbb{E}[D(T(\cdot|Y_{n-t})\|\widehat{T}(\cdot|Y_{n-t}))]\right]=I(T;Y_{n-t+1}|Y^{n-t}),

where the last equality follows from the representation (20) of Bayes prediction risk as conditional mutual information. Lower bounding the minimax risk by the Bayes risk, we have

𝖱𝗂𝗌𝗄k,n\displaystyle\mathsf{Risk}_{k,n} infM^𝔼T[𝔼[D(M(|Xn)M^(|Xn))]]\displaystyle\geq\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))]\right]
infM^t=1n1𝔼M[𝔼[D(M(|Xn)M^(|Xn))|𝒳t](𝒳t)]\displaystyle\geq\inf_{\widehat{M}}\sum_{t=1}^{n-1}\mathbb{E}_{M}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))|{\mathcal{X}}_{t}]\cdot\mathbb{P}({\mathcal{X}}_{t})\right]
cnt=1n1infM^𝔼M[𝔼[D(M(|Xn)M^(|Xn))|𝒳t]]\displaystyle\geq\frac{c}{n}\cdot\sum_{t=1}^{n-1}\inf_{\widehat{M}}\mathbb{E}_{M}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))|{\mathcal{X}}_{t}]\right]
cnt=1n1I(T;Ynt+1|Ynt)=cn(I(T;Yn)I(T;Y1)).\displaystyle\approx\frac{c}{n}\cdot\sum_{t=1}^{n-1}I(T;Y_{n-t+1}|Y^{n-t})=\frac{c}{n}\cdot(I(T;Y^{n})-I(T;Y_{1})). (43)

Note that I(T;Y1)H(Y1)log(k1)I(T;Y_{1})\leq H(Y_{1})\leq\log(k-1) since Y1Y_{1} takes values in 𝒮2{\mathcal{S}}_{2}. Maximizing the right hand side over the prior PTP_{T} and recalling the dual representation for redundancy in (17), the above inequality (3.2) leads to a risk lower bound of 𝖱𝗂𝗌𝗄k,n1n(𝖱𝖾𝖽k1,n𝗌𝗒𝗆logk)\mathsf{Risk}_{k,n}\gtrsim\frac{1}{n}(\mathsf{Red}_{k-1,n}^{\sf sym}-\log k), where 𝖱𝖾𝖽k1,n𝗌𝗒𝗆=supI(T;Y1)\mathsf{Red}_{k-1,n}^{\sf sym}=\sup I(T;Y_{1}) is the redundancy for symmetric Markov chains with k1k-1 states and sample size nn. Since symmetric transition matrices still have Θ(k2)\Theta(k^{2}) degrees of freedom, it is expected that 𝖱𝖾𝖽k,n𝗌𝗒𝗆k2lognk2\mathsf{Red}_{k,n}^{\sf sym}\asymp k^{2}\log\frac{n}{k^{2}} for nk2n\gtrsim k^{2}, so that (3.2) yields the desired lower bound 𝖱𝗂𝗌𝗄k,n=Ω(k2nlognk2)\mathsf{Risk}_{k,n}=\Omega(\frac{k^{2}}{n}\log\frac{n}{k^{2}}) in Theorem 1.

Next we rigorously carry out the lower bound proof sketched above: In Section 3.2.1, we explicitly construct the kk-state chain which satisfies the desired properties in Section 3.2. In Section 3.2.2, we make the steps in (3.2) precise and bound the Bayes risk from below by an appropriate mutual information. In Section 3.2.3, we choose a prior distribution on the transition probabilities and prove a lower bound on the resulting mutual information, thereby completing the proof of Theorem 1, with the added bonus that the construction is restricted to irreducible and reversible chains.

3.2.1 Construction of the kk-state chain

We construct a kk-state chain with the following transition probability matrix:

M=[11n1n(k1)1n(k1)1n(k1)1/n1/n1/n(11n)T],\displaystyle M=\left[\begin{matrix}1-\frac{1}{n}&\begin{matrix}\frac{1}{n(k-1)}&\frac{1}{n(k-1)}&\cdots&\frac{1}{n(k-1)}\end{matrix}\\ \begin{matrix}1/n\\ 1/n\\ \vdots\\ 1/n\end{matrix}&\mbox{\LARGE$\left(1-\frac{1}{n}\right)T$}\end{matrix}\right], (44)

where T𝒮2×𝒮2T\in\mathbb{R}^{{\mathcal{S}}_{2}\times{\mathcal{S}}_{2}} is a symmetric stochastic matrix to be chosen later. The transition diagram of MM is shown in Figure 2. One can also verify that the spectral gap of MM is Θ(1n)\Theta(\frac{1}{n}).

123\ldotskk𝒮1{\mathcal{S}}_{1}𝒮2{\mathcal{S}}_{2}1n(k1)\frac{1}{n(k-1)}1n(k1)\frac{1}{n(k-1)}1n(k1)\frac{1}{n(k-1)}11n1-\frac{1}{n}1n\frac{1}{n}(11n)T2,3(1-\frac{1}{n})T_{2,3}(11n)T2,k(1-\frac{1}{n})T_{2,k}(11n)T2,2(1-\frac{1}{n})T_{2,2}1n\frac{1}{n}(11n)T3,3(1-\frac{1}{n})T_{3,3}(11n)T3,k(1-\frac{1}{n})T_{3,k}1n\frac{1}{n}1n\frac{1}{n}(11n)Tk,k(1-\frac{1}{n})T_{k,k}
Figure 2: Lower bound construction for kk-state chains. Solid arrows represent transitions within 𝒮1{\mathcal{S}}_{1} and 𝒮2{\mathcal{S}}_{2}, and dashed arrows represent transitions between 𝒮1{\mathcal{S}}_{1} and 𝒮2{\mathcal{S}}_{2}. The double-headed arrows denote transitions in both directions with equal probabilities.

Let (X1,,Xn)(X_{1},\ldots,X_{n}) be the trajectory of a stationary Markov chain with transition matrix MM. We observe the following properties:

  1. (P1)

    This Markov chain is irreducible and reversible, with stationary distribution (12,12(k1),,12(k1))(\frac{1}{2},\frac{1}{2(k-1)},\cdots,\frac{1}{2(k-1)});

  2. (P2)

    For t[n1]t\in[n-1], let 𝒳t{\mathcal{X}}_{t} denote the collections of trajectories xnx^{n} such that x1,x2,,xt𝒮1x_{1},x_{2},\cdots,x_{t}\in{\mathcal{S}}_{1} and xt+1,,xn𝒮2x_{t+1},\cdots,x_{n}\in{\mathcal{S}}_{2}. Then

    (Xn𝒳t)\displaystyle\mathbb{P}(X^{n}\in{\mathcal{X}}_{t}) =(X1==Xt=1)(Xt+11|Xt=1)s=t+1n1(Xs+11|Xs1)\displaystyle=\mathbb{P}(X_{1}=\cdots=X_{t}=1)\cdot\mathbb{P}(X_{t+1}\neq 1|X_{t}=1)\cdot\prod_{s=t+1}^{n-1}\mathbb{P}(X_{s+1}\neq 1|X_{s}\neq 1)
    =12(11n)t11n(11n)n1t12en.\displaystyle=\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n}\cdot\left(1-\frac{1}{n}\right)^{n-1-t}\geq\frac{1}{2en}. (45)

    Moreover, this probability does not depend of the choice of TT;

  3. (P3)

    Conditioned on the event that Xn𝒳tX^{n}\in{\mathcal{X}}_{t}, the trajectory (Xt+1,,Xn)(X_{t+1},\cdots,X_{n}) has the same distribution as a length-(nt)(n-t) trajectory of a stationary Markov chain with state space 𝒮2={2,3,,k}{\mathcal{S}}_{2}=\{2,3,\cdots,k\} and transition probability TT, and the uniform initial distribution. Indeed,

    [Xt+1=xt+1,,Xn=xn|Xn𝒳t]=\displaystyle\mathbb{P}\left[X_{t+1}=x_{t+1},\ldots,X_{n}=x_{n}|X^{n}\in{\mathcal{X}}_{t}\right]= 12(11n)t11n(k1)s=t+1n1M(xs+1|xs)12(11n)t11n(11n)n1t\displaystyle~{}\frac{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n(k-1)}\prod_{s=t+1}^{n-1}M(x_{s+1}|x_{s})}{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-1}\cdot\frac{1}{n}\cdot\left(1-\frac{1}{n}\right)^{n-1-t}}
    =\displaystyle= 1k1s=t+1n1T(xs+1|xs).\displaystyle~{}\frac{1}{k-1}\prod_{s=t+1}^{n-1}T(x_{s+1}|x_{s}).

3.2.2 Reducing the Bayes prediction risk to redundancy

Let k1𝗌𝗒𝗆{\mathcal{M}}_{k-1}^{\mathsf{sym}} be the collection of all symmetric transition matrices on state space 𝒮2={2,,k}{\mathcal{S}}_{2}=\{2,\ldots,k\}. Consider a Bayesian setting where the transition matrix MM is constructed in (44) and the submatrix TT is drawn from an arbitrary prior on k1𝗌𝗒𝗆{\mathcal{M}}_{k-1}^{\mathsf{sym}}. The following lemma lower bounds the Bayes prediction risk.

Lemma 10.

Conditioned on TT, let Yn=(Y1,,Yn)Y^{n}=(Y_{1},\ldots,Y_{n}) denote a stationary Markov chain on state space 𝒮2{\mathcal{S}}_{2} with transition matrix TT and uniform initial distribution. Then

infM^𝔼T[𝔼[D(M(|Xn)M^(|Xn))]]n12en2(I(T;Yn)log(k1)).\displaystyle\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X_{n}))]\right]\geq\frac{n-1}{2en^{2}}\left(I(T;Y^{n})-\log(k-1)\right).

Lemma 10 is the formal statement of the inequality (3.2) presented in the proof sketch. Maximizing the lower bound over the prior on TT and in view of the mutual information representation (17), we obtain the following corollary.

Corollary 11.

Let 𝖱𝗂𝗌𝗄k,n𝗌𝗒𝗆\mathsf{Risk}_{k,n}^{\mathsf{sym}} denote the minimax prediction risk for stationary irreducible and reversible Markov chains on kk states and 𝖱𝖾𝖽k,n𝗌𝗒𝗆\mathsf{Red}_{k,n}^{\mathsf{sym}} the redundancy for stationary symmetric Markov chains on kk states. Then

𝖱𝗂𝗌𝗄k,n𝗋𝖾𝗏n12en2(𝖱𝖾𝖽k1,n𝗌𝗒𝗆log(k1)).\displaystyle\mathsf{Risk}_{k,n}^{\sf rev}\geq\frac{n-1}{2en^{2}}(\mathsf{Red}_{k-1,n}^{\mathsf{sym}}-\log(k-1)).

We make use of the properties (P1)(P3) in Section 3.2.1 to prove Lemma 10.

Proof of Lemma 10.

Recall that in the Bayesian setting, we first draw TT from some prior on k1𝗌𝗒𝗆{\mathcal{M}}_{k-1}^{\mathsf{sym}}, then generate the stationary Markov chain Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) with state space [k][k] and transition matrix MM in (44), and (Y1,,Yn)(Y_{1},\ldots,Y_{n}) with state space 𝒮2={2,,k}{\mathcal{S}}_{2}=\{2,\ldots,k\} and transition matrix TT.

We first relate the Bayes estimator of MM and TT (given the XX and YY chain respectively). For clarity, we spell out the explicit dependence of the estimators on the input trajectory. For each t[n]t\in[n], denote by M^t=M^t(|xt)\widehat{M}_{t}=\widehat{M}_{t}(\cdot|x^{t}) the Bayes estimator of M(|xt)M(\cdot|x_{t}) give Xt=xtX^{t}=x^{t}, and T^t(|yt)\widehat{T}_{t}(\cdot|y^{t}) the Bayes estimator of T(|yt)T(\cdot|y_{t}) give Yt=ytY^{t}=y^{t}. For each t=1,,n1t=1,\ldots,n-1 and for each trajectory xn=(1,,1,xt+1,,xn)𝒳tx^{n}=(1,\ldots,1,x_{t+1},\ldots,x_{n})\in{\mathcal{X}}_{t}, recalling the form (21) of the Bayes estimator, we have, for each j𝒮2j\in{\mathcal{S}}_{2},

M^n(j|xn)=\displaystyle\widehat{M}_{n}(j|x^{n})= [Xn+1=(xn,j)][Xn=xn]\displaystyle~{}\frac{\mathbb{P}\left[X^{n+1}=(x^{n},j)\right]}{\mathbb{P}\left[X^{n}=x^{n}\right]}
=\displaystyle= 𝔼[12M(1|1)t1M(xt+1|1)M(xt+2|xt+1)M(xn|xn1)M(j|xn)]𝔼[12M(1|1)t1M(xt+1|1)M(xt+2|xt+1)M(xn|xn1)]\displaystyle~{}\frac{\mathbb{E}[\frac{1}{2}M(1|1)^{t-1}M(x_{t+1}|1)M(x_{t+2}|x_{t+1})\ldots M(x_{n}|x_{n-1})M(j|x_{n})]}{\mathbb{E}[\frac{1}{2}M(1|1)^{t-1}M(x_{t+1}|1)M(x_{t+2}|x_{t+1})\ldots M(x_{n}|x_{n-1})]}
=\displaystyle= (11n)𝔼[T(xt+2|xt+1)T(xn|xn1)T(j|xn)]𝔼[T(xt+2|xt+1)T(xn|xn1)]\displaystyle~{}\left(1-\frac{1}{n}\right)\frac{\mathbb{E}[T(x_{t+2}|x_{t+1})\ldots T(x_{n}|x_{n-1})T(j|x_{n})]}{\mathbb{E}[T(x_{t+2}|x_{t+1})\ldots T(x_{n}|x_{n-1})]}
=\displaystyle= (11n)T^nt(j|xt+1n),\displaystyle~{}\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(j|x_{t+1}^{n}),

where we used the stationary distribution of XX in (P1) and the uniformity of the stationary distribution of YY, neither of which depends on TT. Furthermore, by construction in (44), M^n(1|xn)=1n\widehat{M}_{n}(1|x^{n})=\frac{1}{n} is deterministic. In all, we have

M^n(|xn)=1nδ1+(11n)T^nt(|xt+1n),xn𝒳t.\widehat{M}_{n}(\cdot|x^{n})=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(\cdot|x_{t+1}^{n}),\quad x^{n}\in{\mathcal{X}}_{t}. (46)

with δ1\delta_{1} denoting the point mass at state 1, which parallels the fact that

M(|x)=1nδ1+(11n)T(|x),x𝒮2.M(\cdot|x)=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)T(\cdot|x),\quad x\in{\mathcal{S}}_{2}. (47)

By (P2), each event {Xn𝒳t}\{X^{n}\in{\mathcal{X}}_{t}\} occurs with probability at least 1/(2en)1/(2en), and is independent of TT. Therefore,

𝔼T[𝔼[D(M(|Xn)M^(|Xn))]]12ent=1n1𝔼T[𝔼[D(M(|Xn)M^(|Xn))|Xn𝒳t]].\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X^{n}))]\right]\geq\frac{1}{2en}\sum_{t=1}^{n-1}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X^{n}))|X^{n}\in{\mathcal{X}}_{t}]\right]. (48)

By (P3), the conditional joint law of (T,Xt+1,,Xn)(T,X_{t+1},\ldots,X_{n}) on the event {Xn𝒳t}\{X^{n}\in{\mathcal{X}}_{t}\} is the same as the joint law of (T,Y1,,Ynt)(T,Y_{1},\ldots,Y_{n-t}). Thus, we may express the Bayes prediction risk in the XX chain as

𝔼T[𝔼[D(M(|Xn)M^(|Xn))|Xn𝒳t]]\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n})\|\widehat{M}(\cdot|X^{n}))|X^{n}\in{\mathcal{X}}_{t}]\right] =(a)(11n)𝔼T[𝔼[D(T(|Ynt)T^(|Ynt))]]\displaystyle\overset{\rm(a)}{=}\left(1-\frac{1}{n}\right)\cdot\mathbb{E}_{T}\left[\mathbb{E}[D(T(\cdot|Y_{n-t})\|\widehat{T}(\cdot|Y^{n-t}))]\right]
=(b)(11n)I(T;Ynt+1|Ynt),\displaystyle\overset{\rm(b)}{=}\left(1-\frac{1}{n}\right)\cdot I(T;Y_{n-t+1}|Y^{n-t}), (49)

where (a) follows from (46), (47), and the fact that for distributions P,QP,Q supported on 𝒮2{\mathcal{S}}_{2}, D(ϵδ1+(1ϵ)Pϵδ1+(1ϵ)Q)=(1ϵ)D(PQ)D(\epsilon\delta_{1}+(1-\epsilon)P\|\epsilon\delta_{1}+(1-\epsilon)Q)=(1-\epsilon)D(P\|Q); (b) is the mutual information representation (20) of the Bayes prediction risk. Finally, the lemma follows from (48), (3.2.2), and the chain rule

t=1n1I(T;Ynt+1|Ynt)=I(T;Yn)I(T;Y1)I(T;Yn)log(k1),\displaystyle\sum_{t=1}^{n-1}I(T;Y_{n-t+1}|Y^{n-t})=I(T;Y^{n})-I(T;Y_{1})\geq I(T;Y^{n})-\log(k-1),

as I(T;Y1)H(Y1)log(k1)I(T;Y_{1})\leq H(Y_{1})\leq\log(k-1). ∎

3.2.3 Prior construction and lower bounding the mutual information

In view of Lemma 10, it remains to find a prior on k1𝗌𝗒𝗆{\mathcal{M}}_{k-1}^{\mathsf{sym}} for TT, such that the mutual information I(T;Yn)I(T;Y^{n}) is large. We make use of the connection identified in [DMPW81, Dav83, Ris84] between estimation error and mutual information (see also [CS04, Theorem 7.1] for a self-contained exposition). To lower the mutual information, a key step is to find a good estimator T^(Yn)\widehat{T}(Y^{n}) of TT. This is carried out in the following lemma.

Lemma 12.

In the setting of Lemma 10, suppose that Tk𝗌𝗒𝗆T\in{\mathcal{M}}_{k}^{\mathsf{sym}} with Tij[12k,32k]T_{ij}\in[\frac{1}{2k},\frac{3}{2k}] for all i,j[k]i,j\in[k]. Then there is an estimator T^\widehat{T} based on YnY^{n} such that

𝔼[T^T𝖥2]16k2n1,\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]\leq\frac{16k^{2}}{n-1},

where T^T𝖥=ij(T^ijTij)2\|\widehat{T}-T\|_{\mathsf{F}}=\sqrt{\sum_{ij}(\widehat{T}_{ij}-T_{ij})^{2}} denotes the Frobenius norm.

We show how Lemma 12 leads to the desired lower bound on the mutual information I(T;Yn)I(T;Y^{n}). Since k3k\geq 3, we may assume that k1=2k0k-1=2k_{0} is an even integer. Consider the following prior distribution π\pi on TT: let u=(ui,j)i,j[k0],iju=(u_{i,j})_{i,j\in[k_{0}],i\leq j} be iid and uniformly distributed in [1/(4k0),3/(4k0)][1/(4k_{0}),3/(4k_{0})], and ui,j=uj,iu_{i,j}=u_{j,i} for i>ji>j. Let the transition matrix TT be given by

T2i1,2j1=T2i,2j=ui,j,T2i1,2j=T2i,2j1=1k0ui,j,i,j[k].\displaystyle T_{2i-1,2j-1}=T_{2i,2j}=u_{i,j},\quad T_{2i-1,2j}=T_{2i,2j-1}=\frac{1}{k_{0}}-u_{i,j},\quad\forall i,j\in[k]. (50)

It is easy to verify that TT is symmetric and a stochastic matrix, and each entry of TT is supported in the interval [1/(4k0),3/(4k0)][1/(4k_{0}),3/(4k_{0})]. Since 2k0=k12k_{0}=k-1, the condition of Lemma 12 is fulfilled, so there exist estimators T^(Yn)\widehat{T}(Y^{n}) and u^(Yn)\widehat{u}(Y^{n}) such that

𝔼[u^(Yn)u22]𝔼[T^(Yn)T𝖥2]64k02n1.\displaystyle\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\leq\mathbb{E}[\|\widehat{T}(Y^{n})-T\|_{\mathsf{F}}^{2}]\leq\frac{64k_{0}^{2}}{n-1}. (51)

Here and below, we identify uu and u^\widehat{u} as k0(k0+1)2\frac{k_{0}(k_{0}+1)}{2}-dimensional vectors.

Let h(X)=fX(x)logfX(x)dxh(X)=\int-f_{X}(x)\log f_{X}(x)dx denote the differential entropy of a continuous random vector XX with density fXf_{X} w.r.t the Lebesgue measure and h(X|Y)=fXY(xy)logfX|Y(x|y)dxdyh(X|Y)=\int-f_{XY}(xy)\log f_{X|Y}(x|y)dxdy the conditional differential entropy (cf. e.g. [CT06]). Then

h(u)=i,j[k0],ijh(ui,j)=k0(k0+1)2log(2k0).\displaystyle h(u)=\sum_{i,j\in[k_{0}],i\leq j}h(u_{i,j})=-\frac{k_{0}(k_{0}+1)}{2}\log(2k_{0}). (52)

Then

I(T;Yn)\displaystyle I(T;Y^{n}) =(a)I(u;Yn)\displaystyle\overset{\rm(a)}{=}I(u;Y^{n})
(b)I(u;u^(Yn))=h(u)h(u|u^(Yn))\displaystyle\overset{\rm(b)}{\geq}I(u;\widehat{u}(Y^{n}))=h(u)-h(u|\widehat{u}(Y^{n}))
(c)h(u)h(uu^(Yn))\displaystyle\overset{\rm(c)}{\geq}h(u)-h(u-\widehat{u}(Y^{n}))
(d)k0(k0+1)4log(n11024πek02)k216log(n1256πek2).\displaystyle\overset{\rm(d)}{\geq}\frac{k_{0}(k_{0}+1)}{4}\log\left(\frac{n-1}{1024\pi ek_{0}^{2}}\right)\geq\frac{k^{2}}{16}\log\left(\frac{n-1}{256\pi ek^{2}}\right).

where (a) is because uu and TT are in one-to-one correspondence by (50); (b) follows from the data processing inequality; (c) is because h()h(\cdot) is translation invariant and concave; (d) follows from the maximum entropy principle [CT06]: h(uu^(Yn))k0(k0+1)4log(2πek0(k0+1)/2𝔼[u^(Yn)u22])h(u-\widehat{u}(Y^{n}))\leq\frac{k_{0}(k_{0}+1)}{4}\log\left(\frac{2\pi e}{k_{0}(k_{0}+1)/2}\cdot\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\right), which in turn is bounded by (51). Plugging this lower bound into Lemma 10 completes the lower bound proof of Theorem 1.

Proof of Lemma 12.

Since TT is symmetric, the stationary distribution is uniform, and there is a one-to-one correspondence between the joint distribution of (Y1,Y2)(Y_{1},Y_{2}) and the transition probabilities. Motivated by this observation, consider the following estimator T^\widehat{T}: for i,j[k]i,j\in[k], let

T^ij=kt=1n𝟏{Yt=i,Yt+1=j}n1.\displaystyle\widehat{T}_{ij}=k\cdot\frac{\sum_{t=1}^{n}{\mathbf{1}_{\left\{{Y_{t}=i,Y_{t+1}=j}\right\}}}}{n-1}.

Clearly 𝔼[T^ij]=k(Y1=i,Y2=j)=Tij\mathbb{E}[\widehat{T}_{ij}]=k\cdot\mathbb{P}(Y_{1}=i,Y_{2}=j)=T_{ij}. The following variance bound is shown in [TJW18, Lemma 7, Lemma 8] using the concentration inequality of [Pau15]:

Var(T^ij)k28Tijk1γ(T)(n1),\displaystyle\mathrm{Var}(\widehat{T}_{ij})\leq k^{2}\cdot\frac{8T_{ij}k^{-1}}{\gamma_{*}(T)(n-1)},

where γ(T)\gamma_{*}(T) is the absolute spectral gap of TT defined in (8). Note that T=k1𝐉+ΔT=k^{-1}\mathbf{J}+\Delta, where 𝐉\mathbf{J} is the all-one matrix and each entry of Δ\Delta lying in [1/(2k),1/(2k)][-1/(2k),1/(2k)]. Thus the spectral radius of Δ\Delta is at most 1/21/2 and thus γ(T)1/2\gamma_{*}(T)\geq 1/2. Consequently, we have

𝔼[T^T𝖥2]=i,j[k]Var(T^ij)i,j[k]16kTijn1=16k2n1,\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]=\sum_{i,j\in[k]}\mathrm{Var}(\widehat{T}_{ij})\leq\sum_{i,j\in[k]}\frac{16kT_{ij}}{n-1}=\frac{16k^{2}}{n-1},

completing the proof. ∎

4 Spectral gap-dependent risk bounds

4.1 Two states

To show Theorem 2, let us prove a refined version. In addition to the absolute spectral gap defined in (8), define the spectral gap

γ1λ2\gamma\triangleq 1-\lambda_{2} (53)

and k(γ0){\mathcal{M}}_{k}^{\prime}(\gamma_{0}) the collection of transition matrices whose spectral gap exceeds γ0\gamma_{0}. Paralleling 𝖱𝗂𝗌𝗄k,n(γ0)\mathsf{Risk}_{k,n}(\gamma_{0}) defined in (9), define 𝖱𝗂𝗌𝗄k,n(γ0)\mathsf{Risk}_{k,n}^{\prime}(\gamma_{0}) as the minimax prediction risk restricted to Mk(γ0)M\in{\mathcal{M}}_{k}^{\prime}(\gamma_{0}) Since γγ\gamma\geq\gamma^{*}, we have k(γ0)k(γ0){\mathcal{M}}_{k}(\gamma_{0})\subseteq{\mathcal{M}}_{k}^{\prime}(\gamma_{0}) and hence 𝖱𝗂𝗌𝗄k,n(γ0)𝖱𝗂𝗌𝗄k,n(γ0)\mathsf{Risk}_{k,n}^{\prime}(\gamma_{0})\geq\mathsf{Risk}_{k,n}(\gamma_{0}). Nevertheless, the next result shows that for k=2k=2 they have the same rate:

Theorem 13 (Spectral gap dependent rates for binary chain).

For any γ0(0,1)\gamma_{0}\in(0,1)

𝖱𝗂𝗌𝗄2,n(γ0)𝖱𝗂𝗌𝗄2,n(γ0)1nmax{1,loglog(min{n,1γ0})}.\mathsf{Risk}_{2,n}(\gamma_{0})\asymp\mathsf{Risk}_{2,n}^{\prime}(\gamma_{0})\asymp\frac{1}{n}\max\left\{1,\log\log\left(\min\left\{n,\frac{1}{\gamma_{0}}\right\}\right)\right\}.

We first prove the upper bound on 𝖱𝗂𝗌𝗄2,n\mathsf{Risk}_{2,n}^{\prime}. Note that it is enough to show

𝖱𝗂𝗌𝗄2,n(γ0)loglog(1/γ0)\overn,if n0.9γ0ee5.\displaystyle\mathsf{Risk}_{2,n}^{\prime}(\gamma_{0})\lesssim{\log\log\left(1/\gamma_{0}\right)\over n},\quad\text{if }n^{-0.9}\leq\gamma_{0}\leq e^{-e^{5}}. (54)

Indeed, for any γ0n0.9\gamma_{0}\leq n^{-0.9}, the upper bound O(loglogn/n)O\left(\log\log n/n\right) proven in [FOPS16], which does not depend on the spectral gap, suffices; for any γ0>ee5\gamma_{0}>e^{-e^{5}}, by monotonicity we can use the upper bound 𝖱𝗂𝗌𝗄2,n(ee5)\mathsf{Risk}_{2,n}^{\prime}(e^{-e^{5}}).

We now define an estimator that achieves (54). Following [FOPS16], consider trajectories with a single transition, namely, {2n1,1n2:1n1}\left\{2^{n-\ell}1^{\ell},1^{n-\ell}2^{\ell}:1\leq\ell\leq n-1\right\}, where 2n12^{n-\ell}1^{\ell} denotes the trajectory (x1,,xn)(x_{1},\cdots,x_{n}) with x1==xn=2x_{1}=\cdots=x_{n-\ell}=2 and xn+1==xn=1x_{n-\ell+1}=\cdots=x_{n}=1. We refer to this type of xnx^{n} as step sequences. For all non-step sequences xnx^{n}, we apply the add-12\frac{1}{2} estimator similar to (5), namely

M^xn(j|i)=Nij+12Ni+1,i,j{1,2},\displaystyle\widehat{M}_{x^{n}}(j|i)=\frac{N_{ij}+\frac{1}{2}}{N_{i}+1},\qquad i,j\in\{1,2\},

where the empirical counts NiN_{i} and NijN_{ij} are defined in (4); for step sequences of the form 2n12^{n-\ell}1^{\ell}, we estimate by

M^(2|1)=1/(log(1/γ0)),M^(1|1)=1M^(2|1).\displaystyle{\widehat{M}_{\ell}(2|1)}={1/(\ell\log(1/\gamma_{0}))},\quad{\widehat{M}_{\ell}(1|1)}=1-{\widehat{M}_{\ell}(2|1)}. (55)

The other type of step sequences 1n21^{n-\ell}2^{\ell} are dealt with by symmetry.

Due to symmetry it suffices to analyze the risk for sequences ending in 1. The risk of add-12\frac{1}{2} estimator for the non-step sequence 1n1^{n} is bounded as

𝔼[𝟏{Xn=1n}D(M(|1)M^1n(|1))]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=1^{n}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}_{1^{n}}(\cdot|1)})\right] =PXn(1n){M(2|1)log(M(2|1)1/(2n))+M(1|1)log(M(1|1)\over(n12)/n)}\displaystyle=P_{X^{n}}(1^{n})\left\{M(2|1)\log\left(\frac{M(2|1)}{1/(2n)}\right)+M(1|1)\log\left(M(1|1)\over(n-\frac{1}{2})/n\right)\right\}
(1M(2|1))n1{2M(2|1)2n+log(n\overn12)}1n.\displaystyle\leq(1-M(2|1))^{n-1}\left\{2M(2|1)^{2}n+\log\left({n\over n-\frac{1}{2}}\right)\right\}\lesssim\frac{1}{n}.

where the last step followed by using (1x)n1x2n2(1-x)^{n-1}x^{2}\leq n^{-2} with x=M(2|1)x=M(2|1) and logxx1\log x\leq x-1. From [FOPS16, Lemma 7,8] we have that the total risk of other non-step sequences is bounded from above by O(1n)O\left(\frac{1}{n}\right) and hence it is enough to analyze the risk for step sequences, and further by symmetry, those in {2n1:1n1}\left\{2^{n-\ell}1^{\ell}:1\leq\ell\leq n-1\right\}. The desired upper bound (54) then follows from Lemma 14 next.

Lemma 14.

For any n0.9γ0ee5n^{-0.9}\leq\gamma_{0}\leq e^{-e^{5}}, M^(|1)\widehat{M}_{\ell}(\cdot|1) in (55) satisfies

supM2(γ0)=1n1𝔼[𝟏{Xn=2n1}D(M(|1)M^(|1))]loglog(1/γ0)\overn.\sup_{M\in{\mathcal{M}}^{\prime}_{2}(\gamma_{0})}\sum_{\ell=1}^{n-1}\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=2^{n-\ell}1^{\ell}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}_{\ell}(\cdot|1)})\right]\lesssim{\log\log(1/\gamma_{0})\over n}.
Proof.

For each \ell using log(1\over1x)2x,x12\log\left(1\over{1-x}\right)\leq 2x,x\leq\frac{1}{2} with x=1log(1/γ0)x=\frac{1}{\ell\log(1/\gamma_{0})},

D(M(|1)M^(|1))\displaystyle D({M(\cdot|1)}\|{\widehat{M}_{\ell}(\cdot|1)}) =M(1|1)log(M(1|1)\over11log(1/γ0))+M(2|1)log(M(2|1)log(1/γ0))\displaystyle={M(1|1)\log\left(M(1|1)\over 1-\frac{1}{\ell{\log(1/\gamma_{0})}}\right)+{M(2|1)}\log\left({M(2|1)}\ell{\log(1/\gamma_{0})}\right)}
1\overlog(1/γ0)+M(2|1)log(M(2|1))+M(2|1)loglog(1/γ0)\displaystyle\lesssim{1\over\ell{\log(1/\gamma_{0})}}+{M(2|1)}\log(M(2|1)\ell)+{M(2|1)}\log{{\log(1/\gamma_{0})}}
1\overlog(1/γ0)+M(2|1)log+(M(2|1))+M(2|1)loglog(1/γ0),\displaystyle\leq{1\over\ell{\log(1/\gamma_{0})}}+M(2|1)\log_{+}(M(2|1)\ell)+M(2|1){\log\log(1/\gamma_{0})}, (56)

where we define log+(x)=max{1,logx}\log_{+}(x)=\max\{1,\log x\}. Recall the following Chebyshev’s sum inequality: for a1a2ana_{1}\leq a_{2}\leq\cdots\leq a_{n} and b1b2bnb_{1}\geq b_{2}\geq\cdots\geq b_{n}, it holds that

i=1naibi1n(i=1nai)(i=1nbi).\displaystyle\sum_{i=1}^{n}a_{i}b_{i}\leq\frac{1}{n}\left(\sum_{i=1}^{n}a_{i}\right)\left(\sum_{i=1}^{n}b_{i}\right).

The following inequalities are thus direct corollaries: for x,y[0,1]x,y\in[0,1],

=1n1x(1x)n1y(1y)1\displaystyle\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}y(1-y)^{\ell-1} 1n1(=1n1x(1x)n1)(=1n1y(1y)1)\displaystyle\leq\frac{1}{n-1}\left(\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}\right)\left(\sum_{\ell=1}^{n-1}y(1-y)^{\ell-1}\right)
1n1,\displaystyle\leq\frac{1}{n-1}, (57)
=1n1x(1x)n1y(1y)1log+(y)\displaystyle\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}y(1-y)^{\ell-1}\log_{+}(\ell y) 1n1(=1n1x(1x)n1)(=1n1y(1y)1log+(y))\displaystyle\leq\frac{1}{n-1}\left(\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}\right)\left(\sum_{\ell=1}^{n-1}y(1-y)^{\ell-1}\log_{+}(\ell y)\right)
1n1=1n1y(1y)1(1+y)2n1,\displaystyle\leq\frac{1}{n-1}\sum_{\ell=1}^{n-1}y(1-y)^{\ell-1}(1+\ell y)\leq\frac{2}{n-1}, (58)

where in (58) we need to verify that y(1y)1log+(y)\ell\mapsto y(1-y)^{\ell-1}\log_{+}(\ell y) is non-increasing. To verify it, w.l.o.g. we may assume that (+1)ye(\ell+1)y\geq e, and therefore

y(1y)log+((+1)y)y(1y)1log+(y)\displaystyle\frac{y(1-y)^{\ell}\log_{+}((\ell+1)y)}{y(1-y)^{\ell-1}\log_{+}(\ell y)} =(1y)log((+1)y)log+(y)(1e+1)(1+log(1+1/)log+(y))\displaystyle=\frac{(1-y)\log((\ell+1)y)}{\log_{+}(\ell y)}\leq\left(1-\frac{e}{\ell+1}\right)\left(1+\frac{\log(1+1/\ell)}{\log_{+}(\ell y)}\right)
(1e+1)(1+1)<1+1e+1<1.\displaystyle\leq\left(1-\frac{e}{\ell+1}\right)\left(1+\frac{1}{\ell}\right)<1+\frac{1}{\ell}-\frac{e}{\ell+1}<1.

Therefore,

=1n1𝔼[𝟏{Xn=2n1}D(M(|1)M^(|1))]\displaystyle\sum_{\ell=1}^{n-1}\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=2^{n-\ell}1^{\ell}}\right\}}}D(M(\cdot|1)\|\widehat{M}_{\ell}(\cdot|1))\right]
=1n1M(2|2)n1M(1|2)M(1|1)1D(M(|1)M^(|1))\displaystyle\leq\sum_{\ell=1}^{n-1}M(2|2)^{n-\ell-1}M(1|2)M(1|1)^{\ell-1}D(M(\cdot|1)\|\widehat{M}_{\ell}(\cdot|1))
(a)=1n1M(2|2)n1M(1|2)M(1|1)1(1log(1/γ0)+M(2|1)log+(M(2|1))+M(2|1)loglog(1/γ0))\displaystyle\overset{\rm(a)}{\lesssim}\sum_{\ell=1}^{n-1}M(2|2)^{n-\ell-1}M(1|2)M(1|1)^{\ell-1}\left(\frac{1}{\ell\log(1/\gamma_{0})}+M(2|1)\log_{+}(M(2|1)\ell)+M(2|1)\log\log(1/\gamma_{0})\right)
(b)=1n1M(2|2)n1M(1|2)M(1|1)1log(1/γ0)+2+loglog(1/γ0)n1,\displaystyle\overset{\rm(b)}{\leq}\sum_{\ell=1}^{n-1}\frac{M(2|2)^{n-\ell-1}M(1|2)M(1|1)^{\ell-1}}{\ell\log(1/\gamma_{0})}+\frac{2+\log\log(1/\gamma_{0})}{n-1}, (59)

where (a) is due to (56), (b) follows from (57) and (58) applied to x=M(1|2),y=M(2|1)x=M(1|2),y=M(2|1). To deal with the remaining sum, we distinguish into two cases. Sticking to the above definitions of xx and yy, if y>γ0/2y>\gamma_{0}/2, then

=1n1x(1x)n1(1y)11n1(=1n1x(1x)n1)(=1n1(1y)1)log(2/γ0)n1,\displaystyle\sum_{\ell=1}^{n-1}\frac{x(1-x)^{n-\ell-1}(1-y)^{\ell-1}}{\ell}\leq\frac{1}{n-1}\left(\sum_{\ell=1}^{n-1}x(1-x)^{n-\ell-1}\right)\left(\sum_{\ell=1}^{n-1}\frac{(1-y)^{\ell-1}}{\ell}\right)\leq\frac{\log(2/\gamma_{0})}{n-1},

where the last step has used that =1t1/=log(1/(1t))\sum_{\ell=1}^{\infty}t^{\ell-1}/\ell=\log(1/(1-t)) for |t|<1|t|<1. If yγ0/2y\leq\gamma_{0}/2, notice that for two-state chain the spectral gap is given explicitly by γ=M(1|2)+M(2|1)=x+y\gamma=M(1|2)+M(2|1)=x+y, so that the assumption γγ0\gamma\geq\gamma_{0} implies that xγ0/2x\geq\gamma_{0}/2. In this case,

=1n1x(1x)n1(1y)1\displaystyle\sum_{\ell=1}^{n-1}\frac{x(1-x)^{n-\ell-1}(1-y)^{\ell-1}}{\ell} <n/2(1x)n/21+n/2x(1x)n1n/2\displaystyle\leq\sum_{\ell<n/2}(1-x)^{n/2-1}+\sum_{\ell\geq n/2}\frac{x(1-x)^{n-\ell-1}}{n/2}
n2e(n/21)γ0+2n1n,\displaystyle\leq\frac{n}{2}e^{-(n/2-1)\gamma_{0}}+\frac{2}{n}\lesssim\frac{1}{n},

thanks to the assumption γ0n0.9\gamma_{0}\geq n^{-0.9}. Therefore, in both cases, the first term in (59) is O(1/n)O(1/n), as desired. ∎

Next we prove the lower bound on 𝖱𝗂𝗌𝗄2,n\mathsf{Risk}_{2,n}. It is enough to show that 𝖱𝗂𝗌𝗄2,n(γ0)1nloglog(1/γ0)\mathsf{Risk}_{2,n}(\gamma_{0})\gtrsim\frac{1}{n}\log\log\left(1/\gamma_{0}\right) for n1γ0ee5n^{-1}\leq\gamma_{0}\leq e^{-e^{5}}. Indeed, for γ0ee5\gamma_{0}\geq e^{-e^{5}}, we can apply the result in the iid setting (see, e.g., [BFSS02]), in which the absolute spectral gap is 1, to obtain the usual parametric-rate lower bound Ω(1n)\Omega\left(\frac{1}{n}\right); for γ0<n1\gamma_{0}<n^{-1}, we simply bound 𝖱𝗂𝗌𝗄2,n(γ0)\mathsf{Risk}_{2,n}(\gamma_{0}) from below by 𝖱𝗂𝗌𝗄2,n(n1)\mathsf{Risk}_{2,n}(n^{-1}). Define

α=log(1/γ0),β=α\over5logα,\displaystyle\alpha=\log(1/\gamma_{0}),\quad\beta=\left\lceil{\alpha\over 5\log\alpha}\right\rceil, (60)

and consider the prior distribution

=Uniform(),\displaystyle\mathscr{M}=\mathrm{Uniform}({\mathcal{M}}),\quad{\mathcal{M}} ={M:M(1|2)=1n,M(2|1)=1\overαm:m(β,5β)}.\displaystyle=\left\{M:{M(1|2)}=\frac{1}{n},{M(2|1)}={1\over\alpha^{m}}:m\in\mathbb{N}\cap\left(\beta,5\beta\right)\right\}. (61)

Then the lower bound part of Theorem 2 follows from the next lemma.

Lemma 15.

Assume that n0.9γ0ee5n^{-0.9}\leq\gamma_{0}\leq e^{-e^{5}}. Then

  1. (i)

    γ>γ0\gamma_{*}>\gamma_{0} for each MM\in{\mathcal{M}};

  2. (ii)

    the Bayes risk with respect to the prior \mathscr{M} is at least Ω(loglog(1/γ0)\overn)\Omega\left(\log\log(1/\gamma_{0})\over n\right).

Proof.

Part (i) follows by noting that absolute spectral gap for any two states matrix MM is 1|1M(2|1)M(1|2)|1-\left|1-{M(2|1)}-{M(1|2)}\right| and for any MM\in\cal M, M(2|1)(α5β,αβ)(γ0,γ01/5)(γ0,1/2)M(2|1)\in\left(\alpha^{-5\beta},\alpha^{-\beta}\right)\subseteq(\gamma_{0},\gamma_{0}^{1/5})\subseteq(\gamma_{0},1/2) which guarantees γ=M(1|2)+M(2|1)>γ0.\gamma_{*}=M(1|2)+M(2|1)>\gamma_{0}.

To show part (ii) we lower bound the Bayes risk when the observed trajectory XnX^{n} is a step sequence in {2n1:1n1}\left\{2^{n-\ell}1^{\ell}:1\leq\ell\leq n-1\right\}. Our argument closely follows that of [HOP18, Theorem 1]. Since γ0n1\gamma_{0}\geq n^{-1}, for each MM\in{\mathcal{M}}, the corresponding stationary distribution π\pi satisfies

π2=M(2|1)M(2|1)+M(1|2)12.\displaystyle\pi_{2}=\frac{M(2|1)}{M(2|1)+M(1|2)}\geq\frac{1}{2}.

Denote by 𝖱𝗂𝗌𝗄()\mathsf{Risk}(\mathscr{M}) the Bayes risk with respect to the prior \mathscr{M} and by M^𝖡(|1){\widehat{M}^{\mathsf{B}}_{\ell}(\cdot|1)} the Bayes estimator for prior \mathscr{M} given Xn=2n1X^{n}=2^{n-\ell}1^{\ell}. Note that

[Xn=2n1]=π2(11n)n11nM(1|1)112enM(1|1)1.\mathbb{P}\left[X^{n}=2^{n-\ell}1^{\ell}\right]=\pi_{2}\left(1-\frac{1}{n}\right)^{n-\ell-1}\frac{1}{n}M(1|1)^{\ell-1}\geq\frac{1}{2en}M(1|1)^{\ell-1}. (62)

Then

𝖱𝗂𝗌𝗄()\displaystyle\mathsf{Risk}(\mathscr{M}) 𝔼M[=1n1𝔼[𝟏{Xn=2n1}D(M(|1)M^𝖡(|1))]]\displaystyle\geq\mathbb{E}_{M\sim\mathscr{M}}\left[\sum_{\ell=1}^{n-1}\mathbb{E}\left[{\mathbf{1}_{\left\{{X^{n}=2^{n-\ell}1^{\ell}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}^{\mathsf{B}}_{\ell}(\cdot|1)})\right]\right]
𝔼M[=1n1M(1|1)1\over2enD(M(|1)M^𝖡(|1))]\displaystyle\geq\mathbb{E}_{M\sim\mathscr{M}}\left[\sum_{\ell=1}^{n-1}{M(1|1)^{\ell-1}\over 2en}D({M(\cdot|1)}\|{\widehat{M}^{\mathsf{B}}_{\ell}(\cdot|1)})\right]
=12en=1n1𝔼M[M(1|1)1D(M(|1)M^𝖡(|1))].\displaystyle=\frac{1}{2en}\sum_{\ell=1}^{n-1}\mathbb{E}_{M\sim\mathscr{M}}\left[M(1|1)^{\ell-1}D({M(\cdot|1)}\|{\widehat{M}^{\mathsf{B}}_{\ell}(\cdot|1)})\right]. (63)

Recalling the general form of the Bayes estimator in (21) and in view of (62), we get

M^𝖡(2|1)=𝔼M[M(1|1)1M(2|1)]𝔼M[M(1|1)1],M^𝖡(1|1)=1M^𝖡(2|1).\displaystyle\widehat{M}_{\ell}^{\mathsf{B}}(2|1)=\frac{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}M(2|1)]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}]},\quad\widehat{M}_{\ell}^{\mathsf{B}}(1|1)=1-\widehat{M}_{\ell}^{\mathsf{B}}(2|1). (64)

Plugging (64) into (63), and using

D((x,1x)(y,1y))=xlogx\overy+(1x)log1x\over1yxmax{0,logx\overy1},\displaystyle D((x,1-x)\|(y,1-y))=x\log{x\over y}+(1-x)\log{1-x\over 1-y}\geq x\max\left\{0,\log{x\over y}-1\right\},

we arrive at the following lower bound for the Bayes risk:

𝖱𝗂𝗌𝗄()\displaystyle\mathsf{Risk}(\mathscr{M})
\displaystyle\geq 12en=1n1𝔼M[M(1|1)1M(2|1)max{0,log(M(2|1)𝔼M[M(1|1)1]𝔼M[M(1|1)1M(2|1)])1}].\displaystyle\frac{1}{2en}\sum_{\ell=1}^{n-1}\mathbb{E}_{M\sim\mathscr{M}}\left[M(1|1)^{\ell-1}M(2|1)\max\left\{0,\log\left(\frac{M(2|1)\cdot\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}M(2|1)]}\right)-1\right\}\right]. (65)

Under the prior \mathscr{M}, M(2|1)=1M(1|1)=αmM(2|1)=1-M(1|1)=\alpha^{-m} with βm5β\beta\leq m\leq 5\beta.

We further lower bound (65) by summing over an appropriate range of \ell. For any m[β,3β]m\in[\beta,3\beta], define

1(m)=αmlogα,2(m)=αmlogα.\displaystyle\ell_{1}(m)=\left\lceil\frac{\alpha^{m}}{\log\alpha}\right\rceil,\qquad\ell_{2}(m)=\left\lfloor\alpha^{m}\log\alpha\right\rfloor.

Since γ0ee5\gamma_{0}\leq e^{-e^{5}}, our choice of α\alpha ensures that the intervals {[1(m),2(m)]}βm3β\{[\ell_{1}(m),\ell_{2}(m)]\}_{\beta\leq m\leq 3\beta} are disjoint. We will establish the following claim: for all m[β,3β]m\in[\beta,3\beta] and [1(m),2(m)]\ell\in[\ell_{1}(m),\ell_{2}(m)], it holds that

αm𝔼M[M(1|1)1]𝔼M[M(1|1)1M(2|1)]log(1/γ0)loglog(1/γ0).\displaystyle\frac{\alpha^{-m}\cdot\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}M(2|1)]}\gtrsim\frac{\log(1/\gamma_{0})}{\log\log(1/\gamma_{0})}. (66)

We first complete the proof of the Bayes risk bound assuming (66). Using (65) and (66), we have

𝖱𝗂𝗌𝗄()\displaystyle\mathsf{Risk}(\mathscr{M}) 1\overn14βm=β3β=1(m)2(m)αm(1αm)1loglog(1/γ0)\displaystyle\gtrsim{1\over n}\cdot\frac{1}{4\beta}\sum_{m=\beta}^{3\beta}\sum_{\ell=\ell_{1}(m)}^{\ell_{2}(m)}\alpha^{-m}(1-\alpha^{-m})^{\ell-1}\cdot\log\log(1/\gamma_{0})
=loglog(1/γ0)\over4nβm=β3β{(1αm)1(m)1(1αm)2(m)}\displaystyle={\log\log(1/\gamma_{0})\over 4n\beta}\sum_{m=\beta}^{3\beta}\left\{(1-\alpha^{-m})^{\ell_{1}(m)-1}-(1-\alpha^{-m})^{\ell_{2}(m)}\right\}
(a)loglog(1/γ0)\over4nβm=β3β((14)1\overlogα(1e)1+logα)loglog(1/γ0)\overn,\displaystyle\overset{\rm(a)}{\geq}{\log\log(1/\gamma_{0})\over 4n\beta}\sum_{m=\beta}^{3\beta}\left(\left(\frac{1}{4}\right)^{1\over\log\alpha}-\left(\frac{1}{e}\right)^{-1+\log\alpha}\right)\gtrsim{\log\log(1/\gamma_{0})\over n},

with (a) following from 14(1x)1\overx1e\frac{1}{4}\leq(1-x)^{1\over x}\leq\frac{1}{e} if x12x\leq\frac{1}{2}, and αmαβγ01/512\alpha^{-m}\leq\alpha^{-\beta}\leq\gamma_{0}^{1/5}\leq\frac{1}{2}.

Next we prove the claim (66). Expanding the expectation in (61), we write the LHS of (66) as

αm𝔼M[M(1|1)1]𝔼M[M(1|1)1M(2|1)]=X+A+B\overX+C+D,\displaystyle\frac{\alpha^{-m}\cdot\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}]}{\mathbb{E}_{M\sim\mathscr{M}}[M(1|1)^{\ell-1}M(2|1)]}={X_{\ell}+A_{\ell}+B_{\ell}\over X_{\ell}+C_{\ell}+D_{\ell}},

where

X\displaystyle X_{\ell} =(1αm),A=j=βm1(1αj),B=j=m+15β(1αj),\displaystyle=\left(1-\alpha^{-m}\right)^{\ell},\quad A_{\ell}=\sum_{j={\beta}}^{m-1}\left(1-\alpha^{-j}\right)^{\ell},\quad B_{\ell}=\sum_{j=m+1}^{5{\beta}}\left(1-\alpha^{-j}\right)^{\ell},
C\displaystyle C_{\ell} =j=βm1(1αj)αmj,D=j=m+15β(1αj)αmj.\displaystyle=\sum_{j={\beta}}^{m-1}\left(1-\alpha^{-j}\right)^{\ell}{\alpha}^{m-j},\quad D_{\ell}=\sum_{j=m+1}^{5{\beta}}\left(1-\alpha^{-j}\right)^{\ell}{\alpha}^{m-j}.

We bound each of the terms individually. Clearly, X(0,1)X_{\ell}\in(0,1) and A0A_{\ell}\geq 0. Thus it suffices to show that BβB_{\ell}\gtrsim\beta and C,D1C_{\ell},D_{\ell}\lesssim 1, for m[β,3β]m\in[\beta,3\beta] and 1(m)2(m)\ell_{1}(m)\leq\ell\leq\ell_{2}(m). Indeed,

  • For jm+1j\geq m+1, we have

    (1αj)(1αj)2(m)(a)(1/4)2(m)\overαj(1/4)logα\overα1/4,\displaystyle\left(1-{\alpha}^{-j}\right)^{\ell}\geq\left(1-{\alpha}^{-j}\right)^{\ell_{2}(m)}\overset{\rm(a)}{\geq}\left(1/4\right)^{\ell_{2}(m)\over{\alpha}^{j}}\geq\left(1/4\right)^{\log{\alpha}\over{\alpha}}\geq 1/4,

    where in (a) we use the inequality (1x)1/x1/4(1-x)^{1/x}\geq 1/4 for x1/2x\leq 1/2. Consequently, Bβ/2B_{\ell}\geq\beta/2;

  • For jm1j\leq m-1, we have

    (1αj)(1αj)1(m)(b)eαmjlogα=γ0αmj1\overlogα,\displaystyle\left(1-{\alpha}^{-j}\right)^{\ell}\leq\left(1-{\alpha}^{-j}\right)^{\ell_{1}(m)}\overset{\rm(b)}{\leq}e^{-\frac{{\alpha}^{m-j}}{\log\alpha}}=\gamma_{0}^{{\alpha}^{m-j-1}\over\log{\alpha}},

    where (b) follows from (1x)1/x1/e(1-x)^{1/x}\leq 1/e and the definition of 1(m)\ell_{1}(m). Consequently,

    Cγ0α\overlogαj=βm2αmj+αγ01\overlogαeα2\overlogα+(2β+1)logα+elogααlogα2,\displaystyle C_{\ell}\leq\gamma_{0}^{\alpha\over\log\alpha}\sum_{j={\beta}}^{m-2}{\alpha}^{m-j}+{\alpha}\gamma_{0}^{1\over\log{\alpha}}\leq e^{-{\alpha^{2}\over\log\alpha}+(2\beta+1)\log\alpha}+e^{\log\alpha-\frac{\alpha}{\log\alpha}}\leq 2,

    where the last step uses the definition of β\beta in (60);

  • Dj=m+15βαmj1D_{\ell}\leq\sum_{j=m+1}^{5\beta}\alpha^{m-j}\leq 1, since α=log1γ0e5\alpha=\log\frac{1}{\gamma_{0}}\geq e^{5}.

Combining the above bounds completes the proof of (66). ∎

4.2 kk states

4.2.1 Proof of Theorem 3 (i)

Notice that the prediction problem consists of kk sub-problems of estimating the individual rows of MM, so it suffices show the contribution from each of them is O(kn)O\left(\frac{k}{n}\right). In particular, assuming the chain terminates in state 1 we bound the risk of estimating the first row by the add-one estimator M^+1(j|1)=N1j+1\overN1+k\widehat{M}^{+1}(j|1)={N_{1j}+1\over N_{1}+k}. Under the absolute spectral gap condition of γγ0\gamma_{*}\geq\gamma_{0}, we show

𝔼[𝟏{Xn=1}D(M(|1)M^+1(|1))]k\overn(1+logk\overkγ04).\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=1}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]\lesssim{k\over n}\left(1+\sqrt{\log k\over k\gamma_{0}^{4}}\right). (67)

By symmetry, we get the desired 𝖱𝗂𝗌𝗄k,n(γ0)k2\overn(1+logk\overkγ04)\mathsf{Risk}_{k,n}(\gamma_{0})\lesssim{k^{2}\over n}\left(1+\sqrt{\log k\over k\gamma_{0}^{4}}\right). The basic steps of our analysis are as follows:

  • When N1N_{1} is substantially smaller than its mean, we can bound the risk using the worst-case risk bound for add-one estimators and the probability of this rare event.

  • Otherwise, we decompose the prediction risk as

    D(M(|1)M^+1(|1))=j=1k[M(j|1)log(M(j|1)(N1+k)\overN1j+1)M(j|1)+N1j+1\overN1+k].\displaystyle D(M(\cdot|1)\|\widehat{M}^{+1}(\cdot|1))=\sum_{j=1}^{k}\left[M(j|1)\log\left(M(j|1)(N_{1}+k)\over N_{1j}+1\right)-M(j|1)+{N_{1j}+1\over N_{1}+k}\right].

    We then analyze each term depending on whether N1jN_{1j} is typical or not. Unless N1jN_{1j} is atypically small, the add-one estimator works well whose risk can be bounded quadratically.

To analyze the concentration of the empirical counts we use the following moment bounds. The proofs are deferred to Appendix B.

Lemma 16.

Finite reversible and irreducible chains observe the following moment bounds:

  1. (i)

    𝔼[(NijNiM(j|i))2|Xn=i]nπiM(j|i)(1M(j|i))+M(j|i)\overγ+M(j|i)\overγ2{\mathbb{E}\left[\left(N_{ij}-N_{i}{M(j|i)}\right)^{2}|X_{n}=i\right]}\lesssim n\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{M(j|i)}\over\gamma_{*}}+{{M(j|i)}\over\gamma_{*}^{2}}

  2. (ii)

    𝔼[(NijNiM(j|i))4|Xn=i](nπiM(j|i)(1M(j|i)))2+M(j|i)\overγ+M(j|i)2\overγ4\mathbb{E}\left[\left(N_{ij}-N_{i}{M(j|i)}\right)^{4}|X_{n}=i\right]\lesssim(n\pi_{i}{M(j|i)}(1-{M(j|i)}))^{2}+{\sqrt{{M(j|i)}}\over\gamma_{*}}+{{M(j|i)}^{2}\over\gamma_{*}^{4}}

  3. (iii)

    𝔼[(Ni(n1)πi)4|Xn=i]n2πi2\overγ2+1\overγ4.\mathbb{E}\left[\left(N_{i}-(n-1)\pi_{i}\right)^{4}|X_{n}=i\right]\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}.

When γ\gamma_{*} is high this shows that the moments behave as if for each i[k]i\in[k], N1N_{1} is approximately Binomial(n1,πin-1,\pi_{i}) and NijN_{ij} is approximately Binomial(Ni,M(j|i))(N_{i},M(j|i)), which happens in case of iid sampling. For iid models [KOPS15] showed that the add-one estimator achieves O(kn)O\left(\frac{k}{n}\right) risk bound which we aim here too. In addition, dependency of the above moments on γ\gamma_{*} gives rise to sufficient conditions that guarantees parametric rate. The technical details are given below.

We decompose the left hand side in (67) based on N1N_{1} as

𝔼[𝟏{Xn=1}D(M(|1)M^+1(|1))]=𝔼[𝟏{A}D(M(|1)M^+1(|1))]+𝔼[𝟏{A>}D(M(|1)M^+1(|1))]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=1}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]=\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{\leq}}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]

where the typical set A>A^{>} and atypical set AA^{\leq} are defined as

A{Xn=1,N1(n1)π1/2},A>{Xn=1,N1>(n1)π1/2}.\displaystyle A^{\leq}\triangleq\left\{X_{n}=1,N_{1}\leq{(n-1)\pi_{1}/2}\right\},\quad A^{>}\triangleq\left\{X_{n}=1,N_{1}>{(n-1)\pi_{1}/2}\right\}.

For the atypical case, note the following deterministic property of the add-one estimator. Let Q^\widehat{Q} be an add-one estimator with sample size nn and alphabet size kk of the form Q^i=ni+1n+k\widehat{Q}_{i}=\frac{n_{i}+1}{n+k}, where ni=n\sum n_{i}=n. Since Q^\widehat{Q} is bounded below by 1n+k\frac{1}{n+k} everywhere, for any distribution PP, we have

D(PQ^)log(n+k).D(P\|\widehat{Q})\leq\log(n+k). (68)

Applying this bound on the event AA^{\leq}, we have

𝔼[𝟏{A}D(M(|1)M^+1(|1))]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{\leq}}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]
log(nπ1+k)[Xn=1,N1(n1)π1/2]\displaystyle\leq\log\left(n\pi_{1}+k\right)\mathbb{P}\left[X_{n}=1,N_{1}\leq(n-1)\pi_{1}/2\right]
(a)𝟏{nπ1γ10}π1log(nπ1+k)+𝟏{nπ1γ>10}π1log(nπ1+k)𝔼[(N1(n1)π1)4|Xn=1]\overn4π14\displaystyle\overset{\rm(a)}{\lesssim}{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}\leq 10}\right\}}}\pi_{1}\log\left(n\pi_{1}+k\right)+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}>10}\right\}}}\pi_{1}\log\left({n\pi_{1}+k}\right){\mathbb{E}\left[\left(N_{1}-(n-1)\pi_{1}\right)^{4}|X_{n}=1\right]\over n^{4}\pi_{1}^{4}} (69)
(b)𝟏{nπ1γ10}10\overnγlog(10\overγ+k)+𝟏{nπ1γ>10}log(nπ1+k)(1\overn2π1γ2+1\overn4π13γ4)\displaystyle\overset{\rm(b)}{\leq}{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}\leq 10}\right\}}}{10\over n\gamma_{*}}\log\left({10\over\gamma_{*}}+k\right)+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}>10}\right\}}}\log\left(n\pi_{1}+k\right)\left({1\over n^{2}\pi_{1}\gamma_{*}^{2}}+{1\over n^{4}\pi_{1}^{3}\gamma_{*}^{4}}\right)
(c)1n{𝟏{nπ1γ10}log(1/γ)+logk\overγ+𝟏{nπ1γ>10}(nπ1+logk)(1\overnπ1γ2+1\overn3π13γ4)}\displaystyle\overset{\rm(c)}{\lesssim}\frac{1}{n}\left\{{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}\leq 10}\right\}}}{\log(1/\gamma_{*})+\log k\over\gamma_{*}}+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}>10}\right\}}}\left(n\pi_{1}+\log k\right)\left({1\over n\pi_{1}\gamma_{*}^{2}}+{1\over n^{3}\pi_{1}^{3}\gamma_{*}^{4}}\right)\right\}
1\overn{𝟏{nπ1γ10}(1γ2+logk\overγ)+𝟏{nπ1γ>10}(1\overγ2+logk\overγ)}1\overnγ02+logk\overnγ0.\displaystyle{\lesssim}{1\over n}\left\{{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}\leq 10}\right\}}}\left(\frac{1}{\gamma_{*}^{2}}+{\log k\over\gamma_{*}}\right)+{\mathbf{1}_{\left\{{n\pi_{1}\gamma_{*}>10}\right\}}}\left({1\over\gamma_{*}^{2}}+{\log k\over\gamma_{*}}\right)\right\}\lesssim{1\over n\gamma_{0}^{2}}+{\log k\over n\gamma_{0}}. (70)

where we got (a) from Markov inequality, (b) from Lemma 16(iii) and (c) using x+yxy,x,y2x+y\leq xy,x,y\geq 2.

Next we bound 𝔼[𝟏{A>}D(M(|1)M^+1(|1))]\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D\left({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)}\right)\right]. Define

Δi=M(i|1)log(M(i|1)\overM^+1(i|1))M(i|1)+M^+1(i|1).\displaystyle\Delta_{i}=M(i|1)\log\left(M(i|1)\over\widehat{M}^{+1}(i|1)\right)-M(i|1)+\widehat{M}^{+1}(i|1).

As D(M(|1)M^+1(|1))=i=1kΔiD({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})=\sum_{i=1}^{k}\Delta_{i} it suffices to bound 𝔼[𝟏{A>}Δi]\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right] for each ii. For some r1r\geq 1 to be optimized later consider the following cases separately

Case (a) nπ1rn\pi_{1}\leq r or nπ1M(i|1)10n\pi_{1}{M(i|1)}\leq 10:

Using the fact ylog(y)y+1(y1)2y\log(y)-y+1\leq(y-1)^{2} with y=M(i|1)\overM^+1(i|1)=M(i|1)(N1+k)\overN1i+1y={M(i|1)\over\widehat{M}^{+1}(i|1)}={{M(i|1)}(N_{1}+k)\over N_{1i}+1} we get

Δi(M(i|1)N1N1i+M(i|1)k1)2\over(N1+k)(N1i+1).\displaystyle\Delta_{i}\leq{\left({M(i|1)}N_{1}-N_{1i}+{M(i|1)}k-1\right)^{2}\over\left(N_{1}+k\right)\left(N_{1i}+1\right)}. (71)

This implies

𝔼[𝟏{A>}Δi]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right] 𝔼[𝟏{A>}(M(i|1)N1N1i+M(i|1)k1)2\over(N1+k)(N1i+1)]\displaystyle\leq\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\left({M(i|1)}N_{1}-N_{1i}+{M(i|1)}k-1\right)^{2}\over\left(N_{1}+k\right)\left(N_{1i}+1\right)\right]
(a)𝔼[𝟏{A>}(M(i|1)N1N1i)2]+k2π1M(i|1)2+π1\overnπ1+k\displaystyle\overset{\rm(a)}{\lesssim}{{\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\left({M(i|1)}N_{1}-N_{1i}\right)^{2}\right]+k^{2}\pi_{1}{M(i|1)}^{2}+\pi_{1}}\over n\pi_{1}+k}
(b)π1𝔼[(M(i|1)N1N1i)2|Xn=1]\overnπ1+k+1+rkM(i|1)\overn\displaystyle\overset{\rm(b)}{\lesssim}{\pi_{1}\mathbb{E}\left[\left.{\left({M(i|1)}N_{1}-N_{1i}\right)^{2}}\right|X_{n}=1\right]\over n\pi_{1}+k}+{1+rk{M(i|1)}\over n} (72)

where (a) follows from N1>(n1)π1\over2N_{1}>{(n-1)\pi_{1}\over 2} in A>A^{>} and the fact that (x+y+z)23(x2+y2+z2)(x+y+z)^{2}\leq 3(x^{2}+y^{2}+z^{2}); (b) uses the assumption that either nπ1rn\pi_{1}\leq r or nπ1M(i|1)10n\pi_{1}{M(i|1)}\leq 10. Applying Lemma 16(i) and the fact that x+x22(1+x2)x+x^{2}\leq 2(1+x^{2}), continuing the last display we get

𝔼[𝟏{A>}Δi]nπ1M(i|1)+(1+M(i|1)\overγ2)\overn+1+rkM(i|1)\overn1+rkM(i|1)\overn+M(i|1)\overnγ02.\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right]\lesssim{n\pi_{1}{M(i|1)}+\left(1+{{M(i|1)}\over\gamma^{2}_{*}}\right)\over n}+{1+rk{M(i|1)}\over n}\lesssim{{1+rk{M(i|1)}\over n}+{M(i|1)\over n\gamma_{0}^{2}}}.

Hence

𝔼[𝟏{A>}D(M(|1)M^+1(|1))]=i=1k𝔼[𝟏{A>}Δi]rkn+1\overγ02.\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})\right]=\sum_{i=1}^{k}\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}\Delta_{i}\right]\lesssim{\frac{rk}{n}+{1\over\gamma_{0}^{2}}}. (73)
Case(b) nπ1>rn\pi_{1}>r and nπ1M(i|1)>10n\pi_{1}{M(i|1)}>10:

We decompose A>A^{>} based on count of N1iN_{1i} into atypical part BB^{\leq} and typical part B>B^{>}

B\displaystyle B^{\leq} {Xn=1,N1>(n1)π1/2,N1i(n1)π1M(i|1)/4}\displaystyle\triangleq\left\{X_{n}=1,N_{1}>{(n-1)\pi_{1}/2},N_{1i}\leq{(n-1)\pi_{1}{M(i|1)}/4}\right\}
B>\displaystyle B^{>} {Xn=1,N1>(n1)π1/2,N1i>(n1)π1M(i|1)/4}\displaystyle\triangleq\left\{X_{n}=1,N_{1}>{(n-1)\pi_{1}/2},N_{1i}>{(n-1)\pi_{1}{M(i|1)}/4}\right\}

and bound each of 𝔼[𝟏{B}Δi]\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right] and 𝔼[𝟏{B>}Δi]\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right] separately.

Bound on 𝔼[𝟏{B}Δi]\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]

Using M^+1(i|1)1\overN1+k\widehat{M}^{+1}(i|1)\geq{1\over N_{1}+k} and N1i<N1M(i|1)/2N_{1i}<N_{1}M(i|1)/2 in BB^{\leq} we get

𝔼[𝟏{B}Δi]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right] =𝔼[𝟏{B}M(i|1)log(M(i|1)(N1+k)\overN1i+1)]+𝔼[𝟏{B}(N1i+1\overN1+kM(i|1))]\displaystyle=\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i|1)}\log\left({M(i|1)}(N_{1}+k)\over N_{1i}+1\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\left({N_{1i}+1\over N_{1}+k}-{M(i|1)}\right)\right]
𝔼[𝟏{B}M(i|1)log(M(i|1)(N1+k))]+𝔼[𝟏{B}(N1i\overN1M(i|1))]+𝔼[𝟏{B}\overN1]\displaystyle\leq\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i|1)}\log\left({M(i|1)}(N_{1}+k)\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\left({N_{1i}\over N_{1}}-{M(i|1)}\right)\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\over N_{1}\right]
𝔼[𝟏{B}M(i|1)log(M(i|1)(N1+k))]+1\overn\displaystyle\lesssim\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i|1)}\log\left({M(i|1)}(N_{1}+k)\right)\right]+{1\over n} (74)

where the last inequality followed as 𝔼[𝟏{B}/N1][Xn=1]/nπ1=1n\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}/N_{1}\right]\lesssim{\mathbb{P}[X_{n}=1]/n\pi_{1}}=\frac{1}{n}. Note that for any event BB and any function gg,

𝔼[g(N1)𝟏{N1t0,B}]=g(t0)[N1t0,B]+t=t0+1n(g(t)g(t1))[N1t,B].\displaystyle\mathbb{E}\left[g(N_{1}){\mathbf{1}_{\left\{{N_{1}\geq t_{0},B}\right\}}}\right]=g(t_{0})\mathbb{P}[N_{1}\geq t_{0},B]+\sum_{t=t_{0}+1}^{n}\left(g(t)-g(t-1)\right)\mathbb{P}[N_{1}\geq t,B].

Applying this identity with t0=(n1)π1/2t_{0}={\left\lceil{(n-1)\pi_{1}/2}\right\rceil}, we can bound the expectation term in (74) as

𝔼[𝟏{B}M(i|1)log(M(i|1)(N1+k))]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i|1)}\log\left({M(i|1)}(N_{1}+k)\right)\right]
=M(i|1)log(M(i|1)(t0+k))[N1t0,N1inπ1M(i|1)\over4,Xn=1]\displaystyle={M(i|1)}\log\left({M(i|1)}(t_{0}+k)\right)\mathbb{P}\left[N_{1}\geq t_{0},N_{1i}\leq{n\pi_{1}{M(i|1)}\over 4},X_{n}=1\right]
+M(i|1)t=t0+1n1log(1+1t1+k)[N1t+1,N1inπ1M(i|1)\over4,Xn=1]\displaystyle\quad+{M(i|1)}\sum_{t=t_{0}+1}^{n-1}\log\left(1+\frac{1}{t-1+k}\right)\mathbb{P}\left[N_{1}\geq t+1,N_{1i}\leq{n\pi_{1}{M(i|1)}\over 4},X_{n}=1\right]
π1M(i|1)log(M(i|1)(t0+k))[M(i|1)N1N1iM(i|1)t0\over4|Xn=1]\displaystyle\leq\pi_{1}{M(i|1)}\log\left({M(i|1)}(t_{0}+k)\right)\mathbb{P}\left[\left.{M(i|1)}N_{1}-N_{1i}\geq{M(i|1)t_{0}\over 4}\right|X_{n}=1\right]
+M(i|1)\overnt=t0+1n1[M(i|1)N1N1iM(i|1)t\over4|Xn=1]\displaystyle\quad+{{M(i|1)}\over n}\sum_{t=t_{0}+1}^{n-1}\mathbb{P}\left[\left.{M(i|1)}N_{1}-N_{1i}\geq{M(i|1)t\over 4}\right|X_{n}=1\right] (75)

where last inequality uses log(1+1t1+k)1t1nπ1\log\left(1+\frac{1}{t-1+k}\right)\leq\frac{1}{t}\lesssim\frac{1}{n\pi_{1}} for all tt0t\geq t_{0}. Using Markov inequality [Z>c]c4𝔼[Z4]\mathbb{P}\left[Z>c\right]\leq c^{-4}{\mathbb{E}\left[Z^{4}\right]} for c>0c>0, Lemma 16(ii) and x+x42(1+x4)x+x^{4}\leq 2(1+x^{4}) with x=M(i|1)/γx=\sqrt{M(i|1)}/\gamma_{*}

[M(i|1)N1N1iM(i|1)t\over4|Xn=1](nπ1M(i|1))2+M(i|1)2\overγ4\over(tM(i|1))4.\displaystyle\mathbb{P}\left[\left.{M(i|1)}N_{1}-N_{1i}\geq{{M(i|1)}t\over 4}\right|X_{n}=1\right]\lesssim{(n\pi_{1}{M(i|1)})^{2}+{{M(i|1)}^{2}\over\gamma_{*}^{4}}\over\left(t{M(i|1)}\right)^{4}}.

In view of above continuing (75) we get

𝔼[𝟏{B}M(i|1)log(M(i|1)(N1+k))]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}{M(i|1)}\log\left({M(i|1)}(N_{1}+k)\right)\right]
((nπ1M(i|1))2+M(i|1)2\overγ4)(π1M(i|1)log(M(i|1)(nπ1+k))\over(nπ1M(i|1))4+1n(M(i|1))3t=t0+1n1\overt4)\displaystyle\lesssim\left((n\pi_{1}{M(i|1)})^{2}+{{M(i|1)}^{2}\over\gamma_{*}^{4}}\right)\left({\pi_{1}{M(i|1)}\log({M(i|1)}(n\pi_{1}+k))\over(n\pi_{1}{M(i|1)})^{4}}+\frac{1}{n({M(i|1)})^{3}}\sum_{t=t_{0}+1}^{n}{1\over t^{4}}\right)
((nπ1M(i|1))2+M(i|1)2\overγ4\overn)(log(nπ1M(i|1)+kM(i|1))\over(nπ1M(i|1))3+1(nπ1M(i|1))3)\displaystyle{\lesssim}\left((n\pi_{1}{M(i|1)})^{2}+{{M(i|1)}^{2}\over\gamma_{*}^{4}}\over n\right)\left({\log(n\pi_{1}M(i|1)+kM(i|1))\over(n\pi_{1}{M(i|1)})^{3}}+\frac{1}{(n\pi_{1}{M(i|1)})^{3}}\right)
1\overn((nπ1M(i|1))2+M(i|1)2\overγ4)log(nπ1M(i|1)+kM(i|1))\over(nπ1M(i|1))3\displaystyle{\lesssim}{1\over n}\left((n\pi_{1}{M(i|1)})^{2}+{{M(i|1)}^{2}\over\gamma_{*}^{4}}\right){\log(n\pi_{1}M(i|1)+kM(i|1))\over(n\pi_{1}{M(i|1)})^{3}}
1\overn(log(nπ1M(i|1)+kM(i|1))\overnπ1M(i|1)+M(i|1)log(nπ1M(i|1)+k)\overnπ1γ4(nπ1M(i|1))2)\displaystyle\lesssim{1\over n}\left({\log(n\pi_{1}M(i|1)+kM(i|1))\over n\pi_{1}{M(i|1)}}+{M(i|1)\log(n\pi_{1}M(i|1)+k)\over n\pi_{1}\gamma_{*}^{4}(n\pi_{1}M(i|1))^{2}}\right)
(a)1\overn(nπ1M(i|1)+kM(i|1)\overnπ1M(i|1)+M(i|1)log(nπ1M(i|1))\overnπ1γ4(nπ1M(i|1))2+M(i|1)logk\overnπ1γ4(nπ1M(i|1))2)\displaystyle\overset{\rm(a)}{\lesssim}{1\over n}\left({n\pi_{1}M(i|1)+kM(i|1)\over n\pi_{1}{M(i|1)}}+{M(i|1)\log(n\pi_{1}M(i|1))\over n\pi_{1}\gamma_{*}^{4}(n\pi_{1}M(i|1))^{2}}+{M(i|1)\log k\over n\pi_{1}\gamma_{*}^{4}(n\pi_{1}M(i|1))^{2}}\right)
(b)1\overn(1+kM(i|1)+M(i|1)logk\overrγ04)\displaystyle\overset{\rm(b)}{\lesssim}{1\over n}\left(1+kM(i|1)+{M(i|1)\log k\over r\gamma_{0}^{4}}\right)

where (a) followed using x+yxyx+y\leq xy for x,y2x,y\geq 2 and (b) followed as nπ1r,nπ1M(i|1)10n\pi_{1}\geq r,n\pi_{1}{M(i|1)}\geq 10 and log(nπ1M(i|1))nπ1M(i|1)\log(n\pi_{1}M(i|1))\leq n\pi_{1}M(i|1). In view of (74) this implies

i=1k𝔼[𝟏{B}Δi]i=1k1\overn(1+kM(i|1)(1+logk\overrkγ04))k\overn(1+logk\overrkγ04).\displaystyle\sum_{i=1}^{k}\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]\lesssim\sum_{i=1}^{k}{1\over n}\left(1+kM(i|1)\left(1+{\log k\over rk\gamma_{0}^{4}}\right)\right)\lesssim{k\over n}\left(1+{\log k\over rk\gamma_{0}^{4}}\right). (76)
Bound on 𝔼[𝟏{B>}Δi]\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]

Using the inequality (71)

𝔼[𝟏{B>}Δi]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right] 𝔼[𝟏{B>}(M(i|1)N1N1i+M(i|1)k1)2\over(N1+k)(N1i+1)]\displaystyle\leq\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\left({M(i|1)}N_{1}-N_{1i}+{M(i|1)}k-1\right)^{2}\over\left(N_{1}+k\right)\left(N_{1i}+1\right)\right]
𝔼[𝟏{B>}{(M(i|1)N1N1i)2}]+k2π1M(i|1)2+π1\over(nπ1+k)(nπ1M(i|1)+1)\displaystyle\lesssim{\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\left\{\left({M(i|1)}N_{1}-N_{1i}\right)^{2}\right\}\right]+k^{2}\pi_{1}{M(i|1)}^{2}+\pi_{1}\over(n\pi_{1}+k)(n\pi_{1}{M(i|1)}+1)}
π1𝔼[(M(i|1)N1N1i)2|Xn=1]\over(nπ1+k)(nπ1M(i|1)+1)+kM(i|1)\overn\displaystyle\lesssim{\pi_{1}\mathbb{E}\left[\left.{\left({M(i|1)}N_{1}-N_{1i}\right)^{2}}\right|X_{n}=1\right]\over(n\pi_{1}+k)(n\pi_{1}{M(i|1)}+1)}+{k{M(i|1)}\over n}

where (a) follows using properties of the set B>B^{>} along with (x+y+z)23(x2+y2+z2)(x+y+z)^{2}\leq 3(x^{2}+y^{2}+z^{2}). Using Lemma 16(i) we get

𝔼[𝟏{B>}Δi]nπ1M(i|1)+(1+M(i|1)\overγ2)\overn(nπ1M(i|1)+1)+kM(i|1)\overn1+kM(i|1)\overn+M(i|1)\overnγ02.\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]\lesssim{n\pi_{1}{M(i|1)}+\left(1+{{M(i|1)}\over\gamma^{2}_{*}}\right)\over n(n\pi_{1}{M(i|1)}+1)}+{k{M(i|1)}\over n}\lesssim{{1+k{M(i|1)}\over n}+{{M(i|1)}\over n\gamma_{0}^{2}}}.

Summing up the last bound over i[k]i\in[k] and using we get for nπ1>r,nπ1M(i|1)>10n\pi_{1}>r,n\pi_{1}M(i|1)>10

𝔼[𝟏{A>}D(M(|1)M^+1(|1))]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})\right] =i=1k[𝔼[𝟏{B}Δi]+𝔼[𝟏{B>}Δi]]k\overn(1+1\overkγ02+logk\overrkγ04).\displaystyle=\sum_{i=1}^{k}\left[\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{\leq}}\right\}}}\Delta_{i}\right]+\mathbb{E}\left[{\mathbf{1}_{\left\{{B^{>}}\right\}}}\Delta_{i}\right]\right]\lesssim{k\over n}\left(1+{1\over k\gamma_{0}^{2}}+{\log k\over rk\gamma_{0}^{4}}\right).

Combining this with (73) we obtain

𝔼[𝟏{A>}D(M(|1)M^+1(|1))]\displaystyle\mathbb{E}\left[{\mathbf{1}_{\left\{{A^{>}}\right\}}}D({M(\cdot|1)}\|{\widehat{M}^{+1}(\cdot|1)})\right] k\overn(1\overkγ02+r+logk\overrkγ04)k\overn(1+logk\overkγ04)\displaystyle\lesssim{k\over n}\left({1\over k\gamma_{0}^{2}}+r+{\log k\over rk\gamma_{0}^{4}}\right)\lesssim{k\over n}\left(1+{\sqrt{\log k\over k\gamma_{0}^{4}}}\right)

where we chose r=10+logk\overkγ04r=10+\sqrt{\log k\over k\gamma_{0}^{4}} for the last inequality. In view of (70) this implies the required bound.

Remark 4.

We explain the subtlety of the concentration bound in Lemma 16 based on fourth moment and why existing Chernoff bound or Chebyshev inequality falls short. For example, the risk bound in (70) relies on bounding the probability that N1N_{1} is atypically small. To this end, one may use the classical Chernoff-type inequality for reversible chains (see [Lez98, Theorem 1.1] or [Pau15, Proposition 3.10 and Theorem 3.3])

[N1(n1)π1/2|X1=1]1π1eΘ(nπ1γ);\displaystyle\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2|X_{1}=1\right]\lesssim\frac{1}{\sqrt{\pi_{1}}}e^{-\Theta(n\pi_{1}\gamma_{*})}; (77)

in contrast, the fourth moment bound in (69) yields [N1(n1)π1/2|X1=1]=O(1(nπ1γ)2)\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2|X_{1}=1\right]=O(\frac{1}{(n\pi_{1}\gamma_{*})^{2}}). Although the exponential tail in (77) is much better, the pre-factor 1π1\frac{1}{\sqrt{\pi_{1}}}, due to conditioning on the initial state, can lead to a suboptimal result when π1\pi_{1} is small. (As a concrete example, consider two states with M(2|1)=Θ(1n)M(2|1)=\Theta(\frac{1}{n}) and M(1|2)=Θ(1)M(1|2)=\Theta(1). Then π1=Θ(1n),γ=γΘ(1)\pi_{1}=\Theta(\frac{1}{n}),\gamma=\gamma_{*}\approx\Theta(1), and (77) leads to [N1(n1)π1/2,Xn=1]=O(1n)\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2,X_{n}=1\right]=O(\frac{1}{\sqrt{n}}) as opposed to the desired O(1n)O(\frac{1}{n}).)

In the same context it is also insufficient to use 2nd moment based bound (Chebyshev), which leads to [N1(n1)π1/2|X1=1]=O(1nπ1γ)\mathbb{P}\left[N_{1}\leq(n-1)\pi_{1}/2|X_{1}=1\right]=O(\frac{1}{n\pi_{1}\gamma_{*}}). This bound is too loose, which, upon substitution into (69), results in an extra logn\log n factor in the final risk bound when π1\pi_{1} and γ\gamma_{*} are large.

4.2.2 Proof of Theorem 3 (ii)

Let k(logn)6k\geq(\log n)^{6} and γ0(log(n+k))2\overk\gamma_{0}\geq{(\log(n+k))^{2}\over k}. We prove a stronger result using spectral gap as opposed to the absolute spectral gap. Fix MM such that γγ0\gamma\geq\gamma_{0}. Denote its stationary distribution by π\pi. For absolute constants τ>0\tau>0 to be chosen later and c0c_{0} as in Lemma 17 below, define

ϵ(m)=2k\overm+c0(logn)3k\overm,cn=100τ2logn\overnγ,\displaystyle\epsilon(m)={2k\over m}+{c_{0}(\log n)^{3}\sqrt{k}\over m},\quad c_{n}=100\tau^{2}{\log n\over n\gamma},
ni±=nπi±τmax{logn\overnγ,πilogn\overnγ},i=1,,k.\displaystyle n_{i}^{\pm}=n\pi_{i}\pm\tau\max\left\{{\log n\over n\gamma},{\sqrt{\pi_{i}\log n\over n\gamma}}\right\},\quad i=1,\ldots,k. (78)

Let NiN_{i} be the number of visits to state ii as in (4). We bound the risk by accounting for the contributions from different ranges of NiN_{i} and πi\pi_{i} separately:

𝔼[i=1k𝟏{Xn=i}D(M(|i)M^+1(|i))]\displaystyle\mathbb{E}\left[\sum_{i=1}^{k}{\mathbf{1}_{\left\{{X_{n}=i}\right\}}}D\left(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i)\right)\right]
=i:πicn𝔼[𝟏{Xn=i,niNini+}D(M(|i)M^+1(|i))]\displaystyle=\sum_{i:\pi_{i}\geq c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i,n_{i}^{-}\leq N_{i}\leq n_{i}^{+}}\right\}}}D\left(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i)\right)\right]
+i:πicn𝔼[𝟏{Xn=i,Ni>ni+ or Ni<ni}D(M(|i)M^+1(|i))]+i:πi<cn𝔼[𝟏{Xn=i}D(M(|i)M^+1(|i))]\displaystyle+\sum_{i:\pi_{i}\geq c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i,N_{i}>n_{i}^{+}\text{ or }N_{i}<n_{i}^{-}}\right\}}}D\left(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i)\right)\right]+\sum_{i:\pi_{i}<c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i}\right\}}}D\left(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i)\right)\right]
log(n+k)i:πicn[D(M(|i)M^+1(|i))>ϵ(Ni),niNini+]+i:πicn𝔼[𝟏{Xn=i,niNini+}ϵ(Ni)]\displaystyle\leq\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\mathbb{P}\left[D(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i))>\epsilon(N_{i}),n_{i}^{-}\leq N_{i}\leq n_{i}^{+}\right]+\sum_{i:\pi_{i}\geq c_{n}}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{n}=i,n_{i}^{-}\leq N_{i}\leq n_{i}^{+}}\right\}}}\epsilon(N_{i})\right]
+log(n+k)i:πicn[[Nini+]+[Nini]]+i:πicnπilog(n+k)\displaystyle+\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\left[\mathbb{P}\left[N_{i}\geq n_{i}^{+}\right]+\mathbb{P}\left[N_{i}\leq n_{i}^{-}\right]\right]+\sum_{i:\pi_{i}\leq c_{n}}\pi_{i}\log(n+k)
log(n+k)i:πicn[D(M(|i)M^+1(|i))>ϵ(Ni),niNini+]+i:πicnπimaxnimni+ϵ(m)\displaystyle\lesssim\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\mathbb{P}\left[D(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i))>\epsilon(N_{i}),n_{i}^{-}\leq N_{i}\leq n_{i}^{+}\right]+\sum_{i:\pi_{i}\geq c_{n}}\pi_{i}\max_{n_{i}^{-}\leq m\leq n_{i}^{+}}\epsilon(m)
+log(n+k)i:πicn([Ni>ni+]+[Ni<ni])+k(log(n+k))2\overnγ.\displaystyle\quad+\log(n+k)\sum_{i:\pi_{i}\geq c_{n}}\left(\mathbb{P}\left[N_{i}>n_{i}^{+}\right]+\mathbb{P}\left[N_{i}<n_{i}^{-}\right]\right)+{k\left(\log(n+k)\right)^{2}\over n\gamma}. (79)

where the first inequality uses the worst-case bound (68) for add-one estimator. We analyze the terms separately as follows.

For the second term, given any ii such that πicn\pi_{i}\geq c_{n}, we have, by definition in (78), ni9nπi/10n_{i}^{-}\geq 9n\pi_{i}/10 and ni+ninπi/5n_{i}^{+}-n_{i}^{-}\leq n\pi_{i}/5, which implies

i:πicnπimaxnimni+ϵ(m)\displaystyle\sum_{i:\pi_{i}\geq c_{n}}\pi_{i}\max_{n_{i}^{-}\leq m\leq n_{i}^{+}}\epsilon(m) i:πicnπi(2k\over0.9nπi+10\over9c0(logn)3k\overnπi)k2\overn+(logn)3k3/2\overn.\displaystyle\leq\sum_{i:\pi_{i}\geq c_{n}}\pi_{i}\left({2k\over 0.9n\pi_{i}}+{10\over 9}{c_{0}(\log n)^{3}\sqrt{k}\over n\pi_{i}}\right)\lesssim{k^{2}\over n}+{(\log n)^{3}k^{3/2}\over n}. (80)

For the third term, applying [HJL+18, Lemma 16] (which, in turn, is based on the Bernstein inequality in [Pau15]), we get [Ni>ni+]+[Ni<ni]2nτ2\over4+10τ\mathbb{P}\left[N_{i}>n_{i}^{+}\right]+\mathbb{P}\left[N_{i}<n_{i}^{-}\right]\leq 2n^{-\tau^{2}\over 4+10\tau}.

To bound the first term in (79), we follow the method in [Bil61, HJL+18] of representing the sample path of the Markov chain using independent samples generated from M(|i)M(\cdot|i) which we describe below. Consider a random variable X1πX_{1}\sim\pi and an array W={Wi:i=1,,k and =1,2,}W=\left\{W_{i\ell}:i=1,\dots,k\text{ and }\ell=1,2,\dots\right\} of independent random variables, such that XX and WW are independent and WiiidM(|i)W_{i\ell}{\stackrel{{\scriptstyle\text{iid}}}{{\sim}}}M(\cdot|i) for each ii. Starting with generating X1X_{1} from π\pi, at every step i2i\geq 2 we set XiX_{i} as the first element in the Xi1X_{i-1}-th row of WW that has not been sampled yet. Then one can verify that {X1,,Xn}\left\{X_{1},\dots,X_{n}\right\} is a Markov chain with initial distribution π\pi and transition matrix MM. Furthermore, the transition counts satisfy Nij==1Ni𝟏{Wi=j}N_{ij}=\sum_{\ell=1}^{N_{i}}{\mathbf{1}_{\left\{{W_{i\ell}=j}\right\}}}, where NiN_{i} be the number of elements sampled from the iith row of WW. Note the conditioned on Ni=mN_{i}=m, the random variables {Wi1,,Wim}\{W_{i1},\ldots,W_{im}\} are no longer iid. Instead, we apply a union bound. Note that for each fixed mm, the estimator

M^+1(j|i)==1m𝟏{Wi=j}+1\overm+kM^+1m(j|i),j[k]\widehat{M}^{+1}(j|i)={\sum_{\ell=1}^{m}{\mathbf{1}_{\left\{{W_{i\ell}=j}\right\}}}+1\over m+k}\triangleq\widehat{M}^{+1}_{m}(j|i),\quad j\in[k]

is an add-one estimator for M(j|i)M(j|i) based on an iid sample of size mm. Lemma 17 below provides a high-probability bound for the add-one estimator in this iid setting. Using this result and the union bound, we have

i:πicn[D(M(|i)M^+1(|i))>ϵ(Ni),niNini+]\displaystyle\sum_{i:\pi_{i}\geq c_{n}}\mathbb{P}\left[D(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i))>\epsilon(N_{i}),n_{i}^{-}\leq N_{i}\leq n_{i}^{+}\right]
i:πicn(ni+ni)maxnimni+[D(M(|i)M^m+1(|i))>ϵ(m)]i:πicn1\overn2k\overn2\displaystyle\leq\sum_{i:\pi_{i}\geq c_{n}}\left(n_{i}^{+}-n_{i}^{-}\right)\max_{n_{i}^{-}\leq m\leq n_{i}^{+}}\mathbb{P}\left[D(M(\cdot|i)\|\widehat{M}_{m}^{+1}(\cdot|i))>\epsilon(m)\right]\leq\sum_{i:\pi_{i}\geq c_{n}}{1\over n^{2}}\leq{k\over n^{2}}

where the second inequality applies Lemma 17 with t=nni+mt=n\geq n_{i}^{+}\geq m and uses ni+ninπi/5n_{i}^{+}-n_{i}^{-}\leq n\pi_{i}/5 for πicn\pi_{i}\geq c_{n}.

Combining the above with (80), we continue (79) with τ=25\tau=25 to get

𝔼[i=1k𝟏{Xn=i}D(M(|i)M^+1(|i))]k2\overn+(logn)3k3/2\overn+k(log(n+k))2\overnγ\displaystyle\mathbb{E}\left[\sum_{i=1}^{k}{\mathbf{1}_{\left\{{X_{n}=i}\right\}}}D\left(M(\cdot|i)\|\widehat{M}^{+1}(\cdot|i)\right)\right]\lesssim{k^{2}\over n}+{(\log n)^{3}k^{3/2}\over n}+{k(\log(n+k))^{2}\over n\gamma}

which is O(k2\overn)O\left(k^{2}\over n\right) whenever k(logn)6k\geq(\log n)^{6} and γ(log(n+k))2\overk\gamma\geq{(\log(n+k))^{2}\over k}.

Lemma 17 (KL risk bound for add-one estimator).

Let V1,,VmiidQV_{1},\dots,V_{m}\stackrel{{\scriptstyle iid}}{{\sim}}Q for some distribution Q={Qi}i=1kQ=\left\{Q_{i}\right\}_{i=1}^{k} on [k][k]. Consider the add-one estimator Q^+1\widehat{Q}^{+1} with Q^+1i=1m+k(j=1m𝟏{Vj=i}+1)\widehat{Q}^{+1}_{i}=\frac{1}{m+k}(\sum_{j=1}^{m}{\mathbf{1}_{\left\{{V_{j}=i}\right\}}}+1). There exists an absolute constant c0c_{0} such that for any tmt\geq m,

[D(QQ^+1)2k\overm+c0(logt)3k\overm]1\overt3.\displaystyle\mathbb{P}\left[D(Q\|\widehat{Q}^{+1})\geq{2k\over m}+{c_{0}(\log t)^{3}\sqrt{k}\over m}\right]\leq{1\over t^{3}}.
Proof.

Let Q^\widehat{Q} be the empirical estimator Q^i=1mj=1m𝟏{Vj=i}\widehat{Q}_{i}=\frac{1}{m}\sum_{j=1}^{m}{\mathbf{1}_{\left\{{V_{j}=i}\right\}}}. Then Q^+1i=mQ^i+1\overm+k\widehat{Q}^{+1}_{i}={m\widehat{Q}_{i}+1\over m+k} and hence

D(QQ^+1)\displaystyle D(Q\|\widehat{Q}^{+1}) =i=1k(QilogQi\overQ^i+1Qi+Q^+1i)\displaystyle=\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{+1}}-Q_{i}+\widehat{Q}^{+1}_{i}\right)
=i=1k(QilogQi(m+k)\overmQ^i+1Qi+mQ^i+1\overm+k)\displaystyle=\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}(m+k)\over m\widehat{Q}_{i}+1}-Q_{i}+{m\widehat{Q}_{i}+1\over m+k}\right)
=i=1k(QilogQi\overQ^i+1mQi+Q^i+1m)+i=1k(Qilogm+k\overmkQ^i\overm+kk\overm(m+k))\displaystyle=\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}+\frac{1}{m}\right)+\sum_{i=1}^{k}\left(Q_{i}\log{m+k\over m}-{k\widehat{Q}_{i}\over m+k}-{k\over m(m+k)}\right)
i=1k(QilogQi\overQ^i+1mQi+Q^i+1m)+km\displaystyle\leq\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}+\frac{1}{m}\right)+\frac{k}{m} (81)

with last equality following by 0log(m+k\overm)k/m0\leq\log\left(m+k\over m\right)\leq k/m.

To control the sum in the above display it suffices to consider its Poissonized version. Specifically, we aim to show

[i=1k(QilogQi\overQ^i𝗉𝗈𝗂+1mQi+Q^i𝗉𝗈𝗂+1m)>k\overm+c0(logt)3k\overm]1\overt4\displaystyle\mathbb{P}\left[\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}\right)>{k\over m}+{c_{0}(\log t)^{3}\sqrt{k}\over m}\right]\leq{1\over t^{4}} (82)

where mQ^𝗉𝗈𝗂i,i=1,,km\widehat{Q}^{\mathsf{poi}}_{i},i=1,\ldots,k are distributed independently as Poi(mQi)\mathrm{Poi}(mQ_{i}). (Here and below Poi(λ)\mathrm{Poi}(\lambda) denotes the Poisson distribution with mean λ\lambda.) To see why (82) implies the desired result, letting w=k\overm+c0(logt)3k\overmw={k\over m}+{c_{0}(\log t)^{3}\sqrt{k}\over m} and Y=i=1kmQ^i𝗉𝗈𝗂Poi(m)Y=\sum_{i=1}^{k}m\widehat{Q}_{i}^{\mathsf{poi}}\sim\mathrm{Poi}(m), we have

[i=1k(QilogQi\overQ^i+1mQi+Q^i+1m)>w]\displaystyle\mathbb{P}\left[\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}+\frac{1}{m}\right)>w\right]
=(a)[i=1k(QilogQi\overQ^i𝗉𝗈𝗂+1mQi+Q^i𝗉𝗈𝗂+1m)>w|i=1kQ𝗉𝗈𝗂i=1]\displaystyle\overset{\rm(a)}{=}\mathbb{P}\left[\left.\sum_{i=1}^{k}\left(Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}\right)>w\right|\sum_{i=1}^{k}Q^{\mathsf{poi}}_{i}=1\right]
(b)1\overt4[Y=m]=m!\overt4emmm(c)m\overt41t3.\displaystyle\overset{\rm(b)}{\leq}{1\over t^{4}\mathbb{P}[Y=m]}={m!\over t^{4}e^{-m}m^{m}}\overset{\rm(c)}{\lesssim}{\sqrt{m}\over t^{4}}\leq\frac{1}{t^{3}}. (83)

where (a) followed from the fact that conditioned on their sum independent Poisson random variables follow a multinomial distribution; (b) applies (82); (c) follows from Stirling’s approximation.

To prove (82) we rely on concentration inequalities for sub-exponential distributions. A random variable XX is called sub-exponential with parameters σ2,b>0\sigma^{2},b>0, denoted as 𝖲𝖤(σ2,b)\mathsf{SE}(\sigma^{2},b) if

𝔼[eλ(X𝔼[X])]eλ2σ2\over2,|λ|<1b.\displaystyle\mathbb{E}\left[e^{\lambda(X-\mathbb{E}[X])}\right]\leq e^{\lambda^{2}\sigma^{2}\over 2},\quad\forall|\lambda|<\frac{1}{b}. (84)

Sub-exponential random variables satisfy the following properties [Wai19, Sec. 2.1.3]:

  • If XX is 𝖲𝖤(σ2,b)\mathsf{SE}(\sigma^{2},b) for any t>0t>0

    [|X𝔼[X]|v]{2ev2/(2σ2),0<vσ2\overb2ev/(2b),v>σ2\overb.\displaystyle\mathbb{P}\left[\left|X-\mathbb{E}[X]\right|\geq v\right]\leq\begin{cases}2e^{-v^{2}/(2\sigma^{2})},&0<v\leq{\sigma^{2}\over b}\\ 2e^{-v/(2b)},&v>{\sigma^{2}\over b}.\end{cases} (85)
  • Bernstein condition: A random variable XX is 𝖲𝖤(σ2,b)\mathsf{SE}(\sigma^{2},b) if it satisfies

    𝔼[|X𝔼[X]|]12!σ2b2,=2,3,.\displaystyle\mathbb{E}\left[\left|X-\mathbb{E}[X]\right|^{\ell}\right]\leq\frac{1}{2}\ell!\sigma^{2}b^{\ell-2},\quad\ell=2,3,\dots. (86)
  • If X1,,XkX_{1},\dots,X_{k} are independent 𝖲𝖤(σ2,b)\mathsf{SE}(\sigma^{2},b), then i=1kXi\sum_{i=1}^{k}X_{i} is 𝖲𝖤(kσ2,b)\mathsf{SE}(k\sigma^{2},b).

Define Xi=QilogQi\overQ^i𝗉𝗈𝗂+1mQi+Q^i𝗉𝗈𝗂+1m,i[k].X_{i}=Q_{i}\log{Q_{i}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}-Q_{i}+\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m},i\in[k]. Then Lemma 18 below shows that XiX_{i}’s are independent 𝖲𝖤(σ2,b)\mathsf{SE}(\sigma^{2},b) with σ2=c1(logm)4\overm2,b=c2(logm)2\overn\sigma^{2}={c_{1}(\log m)^{4}\over m^{2}},b={c_{2}(\log m)^{2}\over n} for absolute constants c1,c2c_{1},c_{2}, and hence i=1k(Xi𝔼[Xi])\sum_{i=1}^{k}\left(X_{i}-\mathbb{E}[X_{i}]\right) is 𝖲𝖤(kσ2,b)\mathsf{SE}(k\sigma^{2},b). In view of (85) for the choice c0=8(c1+c2)c_{0}=8(c_{1}+c_{2}) this implies

[i=1k(Xi𝔼[Xi])c0(logt)3k\overm]\displaystyle\mathbb{P}\left[\sum_{i=1}^{k}\left(X_{i}-\mathbb{E}[X_{i}]\right)\geq c_{0}{(\log t)^{3}\sqrt{k}\over m}\right] 2ec02k(logt)6\over2m2σ2+2ec0k(logt)3\over2mb1t3.\displaystyle\leq 2e^{-{c_{0}^{2}k(\log t)^{6}\over 2m^{2}\sigma^{2}}}+2e^{-{c_{0}\sqrt{k}(\log t)^{3}\over 2mb}}\leq\frac{1}{t^{3}}. (87)

Using 0ylogyy+1(y1)2,y>00\leq y\log y-y+1\leq(y-1)^{2},y>0 and 𝔼[λPoi(λ)+1]=v=0eλλv+1\over(v+1)!=1eλ\mathbb{E}\left[\frac{\lambda}{\mathrm{Poi}(\lambda)+1}\right]=\sum_{v=0}^{\infty}{e^{-\lambda}\lambda^{v+1}\over(v+1)!}=1-e^{-\lambda}

𝔼[i=1kXi]\displaystyle\mathbb{E}\left[\sum_{i=1}^{k}X_{i}\right] 𝔼[i=1k(Qi(Q^i𝗉𝗈𝗂+1m))2\overQ^i𝗉𝗈𝗂+1m]\displaystyle\leq\mathbb{E}\left[\sum_{i=1}^{k}{\left(Q_{i}-\left(\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}\right)\right)^{2}\over\widehat{Q}_{i}^{\mathsf{poi}}+\frac{1}{m}}\right]
=i=1kmQi2𝔼[1\overmQ^i𝗉𝗈𝗂+1]1+km=i=1kQi(1emQi)1+kmkm.\displaystyle=\sum_{i=1}^{k}mQ_{i}^{2}\mathbb{E}\left[1\over m\widehat{Q}_{i}^{\mathsf{poi}}+1\right]-1+\frac{k}{m}=\sum_{i=1}^{k}Q_{i}\left(1-e^{-mQ_{i}}\right)-1+\frac{k}{m}\leq\frac{k}{m}.

Combining the above with (87) we get (82) as required. ∎

Lemma 18.

There exist absolute constants c1,c2c_{1},c_{2} such that the following holds. For any p(0,1)p\in(0,1) and nYPoi(np)nY\sim\mathrm{Poi}(np), X=plogp\overY+1np+Y+1nX=p\log{p\over Y+\frac{1}{n}}-p+Y+\frac{1}{n} is 𝖲𝖤(c1(logn)4\overn2,c2(logn)2\overn)\mathsf{SE}\left({c_{1}(\log n)^{4}\over n^{2}},{c_{2}(\log n)^{2}\over n}\right).

Proof.

Note that XX is a non-negative random variable. Since 𝔼[(X𝔼[X])]2𝔼[X]\mathbb{E}\left[\left(X-\mathbb{E}[X]\right)^{\ell}\right]\leq 2^{\ell}\mathbb{E}\left[X^{\ell}\right] , by the Bernstein condition (86), it suffices to show 𝔼[X](c3(logn)2\overn),=2,3,\mathbb{E}[X^{\ell}]\leq\left(c_{3}\ell(\log n)^{2}\over n\right)^{\ell},\ell=2,3,\dots for some absolute constant c3c_{3}. guarantees the desired sub-exponential behavior. The analysis is divided into following two cases for some absolute constant c424c_{4}\geq 24.

Case I pc4logn\overnp\geq{c_{4}\ell\log n\over n}:

Using Chernoff bound for Poisson [Jan02, Theorem 3]

[|Poi(λ)λ|>x]2ex2\over2(λ+x/3),λ,x>0,\displaystyle\mathbb{P}\left[|\mathrm{Poi}(\lambda)-\lambda|>x\right]\leq 2e^{-{x^{2}\over 2(\lambda+x/3)}},\quad\lambda,x>0, (88)

we get

[|Yp|>c4plogn\over4n]\displaystyle\mathbb{P}\left[|Y-p|>\sqrt{c_{4}\ell p\log n\over 4n}\right] 2exp(c4nplogn\over8np+2c4nplogn)\displaystyle\leq 2\mathop{\rm exp}\left(-{c_{4}n\ell p\log n\over 8np+2\sqrt{c_{4}n\ell p\log n}}\right)
2exp(c4logn\over8+2c4logn/np)1\overn2\displaystyle\leq 2\mathop{\rm exp}\left(-{c_{4}\ell\log n\over{8+2\sqrt{c_{4}\ell\log n/np}}}\right)\leq{1\over n^{2\ell}} (89)

which implies p/2Y2pp/2\leq Y\leq 2p with probability at least 1n21-{n^{-2\ell}}. Since 0X(Yp1n)2\overY+1n0\leq X\leq{(Y-p-\frac{1}{n})^{2}\over Y+\frac{1}{n}}, we get 𝔼[X](c4plogn/4n)2\over(p/2)+n\overn2(c4logn\overn).\mathbb{E}[X^{\ell}]\lesssim{\left(\sqrt{{c_{4}}\ell p\log n/4n}\right)^{2\ell}\over(p/2)^{\ell}}+{n^{\ell}\over n^{2\ell}}\lesssim\left(c_{4}\ell\log n\over n\right)^{\ell}.

Case II p<c4logn\overnp<{c_{4}\ell\log n\over n}:
  • On the event {Y>p}\{Y>p\}, we have XY+1n2YX\leq Y+\frac{1}{n}\leq 2Y, where the last inequality follows because nYnY takes non-negative integer values. Since X0X\geq 0, we have X𝟏{Y>p}(2Y)𝟏{Y>p}X^{\ell}{\mathbf{1}_{\left\{{Y>p}\right\}}}\leq(2Y)^{\ell}{\mathbf{1}_{\left\{{Y>p}\right\}}} for any 2\ell\geq 2. Using the Chernoff bound (88), we get Y2c4logn\overnY\leq{2c_{4}\ell\log n\over n} with probability at least 1n21-n^{-2\ell}, which implies

    𝔼[X𝟏{Yp}]\displaystyle\mathbb{E}\left[X^{\ell}{\mathbf{1}_{\left\{{Y\geq p}\right\}}}\right] 𝔼[(2Y)𝟏{Y>p,Y2c4logn\overn}]+𝔼[(2Y)𝟏{Y>p,Y>2c4logn\overn}]\displaystyle\leq\mathbb{E}\left[(2Y)^{\ell}{\mathbf{1}_{\left\{{Y>p,Y\leq{2c_{4}\ell\log n\over n}}\right\}}}\right]+\mathbb{E}\left[(2Y)^{\ell}{\mathbf{1}_{\left\{{Y>p,Y>{2c_{4}\ell\log n\over n}}\right\}}}\right]
    (4c4logn\overn)+2(𝔼[Y2][Y>2c4logn\overn])12(c5logn\overn)\displaystyle\leq\left(4c_{4}\ell\log n\over n\right)^{\ell}+2^{\ell}\left(\mathbb{E}[Y^{2\ell}]\mathbb{P}\left[Y>{2c_{4}\ell\log n\over n}\right]\right)^{\frac{1}{2}}\leq\left(c_{5}\ell\log n\over n\right)^{\ell}

    for absolute constant c5c_{5}. Here, the last inequality follows from Cauchy-Schwarz and using the Poisson moment bound [Ahl21, Theorem 2.1]:333For a result with less precise constants, see also [Ahl21, Eq. (1)] based on [Lat97, Corollary 1]. 𝔼[(nY)2](2\overlog(1+2\overnp))2(c6logn)2\mathbb{E}[(nY)^{2\ell}]\leq\left(2\ell\over\log\left(1+{2\ell\over np}\right)\right)^{2\ell}\leq\left(c_{6}\ell\log n\right)^{2\ell} for some absolute constant c6c_{6}, with the second inequality applying the assumption p<c4logn\overnp<{c_{4}\ell\log n\over n}.

  • As X𝟏{Yp}plogn+1n(logn)2\overn,X{\mathbf{1}_{\left\{{Y\leq p}\right\}}}\leq p\log n+\frac{1}{n}\lesssim{\ell(\log n)^{2}\over n}, we get 𝔼[X𝟏{Yp}](c7(logn)2\overn)\mathbb{E}\left[X^{\ell}{\mathbf{1}_{\left\{{Y\leq p}\right\}}}\right]\leq\left(c_{7}\ell(\log n)^{2}\over n\right)^{\ell} for some absolute constant c7c_{7}.

4.2.3 Proof of Corollary 4

We show the following monotonicity result of the prediction risk. In view of this result, Corollary 4 immediately follows from Theorem 2 and Theorem 3 (i). Intuitively, the optimal prediction risk is monotonically increasing with the number of states; this, however, does not follow immediately due to the extra assumptions of irreducibility, reversibility, and prescribed spectral gap.

Lemma 19.

𝖱𝗂𝗌𝗄k+1,n(γ0)𝖱𝗂𝗌𝗄k,n(γ0)\mathsf{Risk}_{k+1,n}(\gamma_{0})\geq\mathsf{Risk}_{k,n}(\gamma_{0}) for all γ0(0,1),k2\gamma_{0}\in(0,1),k\geq 2.

Proof.

Fix an Mk(γ0)M\in{\mathcal{M}}_{k}(\gamma_{0}) such that γ(M)>γ0\gamma_{*}(M)>\gamma_{0}. Denote the stationary distribution π\pi such that πM=π\pi M=\pi. Fix δ(0,1)\delta\in(0,1) and define a transition matrix M~\widetilde{M} with k+1k+1 states as follows:

M~=((1δ)Mδ𝟏(1δ)πδ)\widetilde{M}=\begin{pmatrix}(1-\delta)M&\delta\mathbf{1}\\ (1-\delta)\pi&\delta\end{pmatrix}

One can verify the following:

  • M~\widetilde{M} is irreducible and reversible;

  • The stationary distribution for M~\widetilde{M} is π~=((1δ)π,δ)\widetilde{\pi}=((1-\delta)\pi,\delta)

  • The absolute spectral gap of M~\widetilde{M} is γ(M~)=(1δ)γ(M)\gamma_{*}(\widetilde{M})=(1-\delta)\gamma_{*}(M), so that M~k+1(γ0)\widetilde{M}\in{\mathcal{M}}_{k+1}(\gamma_{0}) for all sufficiently small δ\delta.

  • Let (X1,,Xn)(X_{1},\ldots,X_{n}) and (X~1,,X~n)(\widetilde{X}_{1},\ldots,\widetilde{X}_{n}) be stationary Markov chains with transition matrices MM and M~\widetilde{M}, respectively. Then as δ0\delta\to 0, (X1,,Xn)(X_{1},\ldots,X_{n}) converges to (X~1,,X~n)(\widetilde{X}_{1},\ldots,\widetilde{X}_{n}) in law, i.e., the joint probability mass function converges pointwise.

Next fix any estimator M^\widehat{M} for state space [k+1][k+1]. Note that without loss of generality we can assume M^(j|i)>0\widehat{M}(j|i)>0 for all i,j[k+1]i,j\in[k+1] for otherwise the KL risk is infinite. Define M^trunc\widehat{M}^{\text{trunc}} as M^\widehat{M} without the k+1k+1-th row and column, and denote by M^\widehat{M}^{\prime} its normalized version, namely, M^(|i)=M^trunc(|i)\over1M^trunc(k+1|i)\widehat{M}^{\prime}(\cdot|i)={\widehat{M}^{\text{trunc}}(\cdot|i)\over 1-\widehat{M}^{\text{trunc}}(k+1|i)} for i=1,,ki=1,\ldots,k. Then

𝔼X~n[D(M~(|X~n)M^(|X~n))]δ0\displaystyle\mathbb{E}_{\widetilde{X}^{n}}\left[D(\widetilde{M}(\cdot|\widetilde{X}_{n})\|{\widehat{M}(\cdot|\widetilde{X}_{n})})\right]\xrightarrow{\delta\to 0} 𝔼Xn[D(M(|Xn)M^(|Xn))]\displaystyle~{}\mathbb{E}_{X^{n}}\left[D(M(\cdot|X_{n})\|{\widehat{M}(\cdot|X_{n})})\right]
\displaystyle\geq 𝔼Xn[D(M(|Xn)M^(|Xn))]\displaystyle~{}\mathbb{E}_{X^{n}}\left[D(M(\cdot|X_{n})\|{\widehat{M}^{\prime}(\cdot|X_{n})})\right]
\displaystyle\geq infM^𝔼Xn[D(M(|Xn)M^(|Xn))]\displaystyle~{}\inf_{\widehat{M}}\mathbb{E}_{X^{n}}\left[D(M(\cdot|X_{n})\|{\widehat{M}(\cdot|X_{n})})\right]

where in the first step we applied the convergence in law of X~n\widetilde{X}^{n} to XnX^{n} and the continuity of PD(PQ)P\mapsto D(P\|Q) for fixed componentwise positive QQ; in the second step we used the fact that for any sub-probability measure Q=(qi)Q=(q_{i}) and its normalized version Q¯=Q/α\bar{Q}=Q/\alpha with α=qi1\alpha=\sum q_{i}\leq 1, we have D(PQ)=D(PQ¯)+log1αD(PQ¯)D(P\|Q)=D(P\|\bar{Q})+\log\frac{1}{\alpha}\geq D(P\|\bar{Q}). Taking the supremum over Mk(γ0)M\in{\mathcal{M}}_{k}(\gamma_{0}) on the LHS and the supremum over M~k+1(γ0)\widetilde{M}\in{\mathcal{M}}_{k+1}(\gamma_{0}) on the RHS, and finally the infimum over M^\widehat{M} on the LHS, we conclude 𝖱𝗂𝗌𝗄k+1,n(γ0)𝖱𝗂𝗌𝗄k,n(γ0)\mathsf{Risk}_{k+1,n}(\gamma_{0})\geq\mathsf{Risk}_{k,n}(\gamma_{0}). ∎

5 Higher-order Markov chains

5.1 Basic setups

In this section we prove Theorem 5. We start with some basic definitions for higher-order Markov chains. Let m1m\geq 1. Let X1,X2,X_{1},X_{2},\dots be an mthm^{\text{th}}-th order Markov chain with state space 𝒮{\mathcal{S}} and transition matrix M𝒮m×𝒮M\in\mathbb{R}^{{\mathcal{S}}^{m}\times{\mathcal{S}}} so that [Xt+1=xt+1|Xtm+1t=xtm+1t]=M(xt+1|xtm+1t)\mathbb{P}\left[X_{t+1}=x_{t+1}|X_{t-m+1}^{t}=x_{t-m+1}^{t}\right]=M(x_{t+1}|x_{t-m+1}^{t}) for all tmt\geq m. Clearly, the joint distribution of the process is specified by the transition matrix and the initial distribution, which is a joint distribution for (X1,,Xm)(X_{1},\ldots,X_{m}).

A distribution π\pi on 𝒮m{\mathcal{S}}^{m} is a stationary distribution if {Xt:t1}\{X_{t}:t\geq 1\} with (X1,,Xm)π(X_{1},\ldots,X_{m})\sim\pi is a stationary process, that is,

(Xi1+t,,Xin+t)=law(Xi1,,Xin),n,i1,,in,t.(X_{i_{1}+t},\ldots,X_{i_{n}+t}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{i_{1}},\ldots,X_{i_{n}}),\quad\forall n,i_{1},\ldots,i_{n},t\in\mathbb{N}. (90)

It is clear that (90) is equivalent to (X1,,Xm)=law(X2,,Xm+1)(X_{1},\ldots,X_{m}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{2},\ldots,X_{m+1}). In other words, π\pi is the solution to the linear system:

π(x1,,xm)=x0𝒮π(x0,x1,,xm1)M(xm|x1,,xm1),x1,,xm𝒮.\displaystyle\pi(x_{1},\ldots,x_{m})=\sum_{x_{0}\in{\mathcal{S}}}\pi(x_{0},x_{1},\ldots,x_{m-1})M(x_{m}|x_{1},\ldots,x_{m-1}),\quad\forall x_{1},\ldots,x_{m}\in{\mathcal{S}}. (91)

Note that this implies, in particular, that π\pi as a joint distribution of mm-tuples itself must satisfy those symmetry properties required by stationarity, such as all marginals being identical, etc.

Next we discuss reversibility. A random process {Xt}\{X_{t}\} is reversible if for any nn,

Xn=lawXn¯,X^{n}~{}{\stackrel{{\scriptstyle\rm law}}{{=}}}~{}\overline{X^{n}}, (92)

where Xn¯(Xn,,X1)\overline{X^{n}}\triangleq(X_{n},\ldots,X_{1}) denotes the time reversal of Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}). Note that a reversible mthm^{\text{th}}-order Markov chain must be stationary. Indeed,

(X2,,Xm+1)=law(Xm,,X1)=law(X1,,Xm),(X_{2},\ldots,X_{m+1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m},\ldots,X_{1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{1},\ldots,X_{m}), (93)

where the first equality follows from (X1,,Xm+1)=law(Xm+1,,X1)(X_{1},\ldots,X_{m+1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m+1},\ldots,X_{1}). The following lemma gives a characterization for reversibility:

Lemma 20.

An mthm^{\text{th}}-order stationary Markov chain is reversible if and only if (92) holds for n=m+1n=m+1, namely

π(x1,,xm)M(xm+1|x1,,xm)=π(xm+1,,x2)M(x1|xm+1,,x2),x1,,xm+1𝒮.\pi(x_{1},\ldots,x_{m})M(x_{m+1}|x_{1},\ldots,x_{m})=\pi(x_{m+1},\ldots,x_{2})M(x_{1}|x_{m+1},\ldots,x_{2}),\quad\forall x_{1},\ldots,x_{m+1}\in{\mathcal{S}}. (94)
Proof.

First, we show that (92) for n=m+1n=m+1 implies that for nmn\leq m. Indeed,

(X1,,Xn)=law(Xm+1,,Xmn+2)=law(Xn,,X1)\displaystyle(X_{1},\ldots,X_{n}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m+1},\ldots,X_{m-n+2}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{n},\ldots,X_{1}) (95)

where the first equality follows from (X1,,Xm+1)=law(Xm+1,,X1)(X_{1},\ldots,X_{m+1}){\stackrel{{\scriptstyle\rm law}}{{=}}}(X_{m+1},\ldots,X_{1}) and the second applies stationarity.

Next, we show (92) for n=m+2n=m+2 and the rest follows from induction on nn. Indeed,

[(X1,,Xm+2)=(x1,,xm+2)]\displaystyle~{}\mathbb{P}\left[(X_{1},\ldots,X_{m+2})=(x_{1},\ldots,x_{m+2})\right]
=\displaystyle= π(x1,,xm)M(xm+1|x1,,xm)M(xm+2|x2,,xm+1)\displaystyle~{}\pi(x_{1},\ldots,x_{m})M(x_{m+1}|x_{1},\ldots,x_{m})M(x_{m+2}|x_{2},\ldots,x_{m+1})
=(a)\displaystyle\overset{\rm(a)}{=} π(xm+1,,x2)M(x1|xm+1,,x2)M(xm+2|x2,,xm+1)\displaystyle~{}\pi(x_{m+1},\ldots,x_{2})M(x_{1}|x_{m+1},\ldots,x_{2})M(x_{m+2}|x_{2},\ldots,x_{m+1})
=(b)\displaystyle\overset{\rm(b)}{=} π(x2,,xm+1)M(x1|xm+1,,x2)M(xm+2|x2,,xm+1)\displaystyle~{}\pi(x_{2},\ldots,x_{m+1})M(x_{1}|x_{m+1},\ldots,x_{2})M(x_{m+2}|x_{2},\ldots,x_{m+1})
=(c)\displaystyle\overset{\rm(c)}{=} π(xm+2,,x3)M(x2|xm+2,,x3)M(x1|xm+1,,x2)\displaystyle~{}\pi(x_{m+2},\ldots,x_{3})M(x_{2}|x_{m+2},\ldots,x_{3})M(x_{1}|x_{m+1},\ldots,x_{2})
=\displaystyle= [(X1,,Xm+2)=(xm+2,,x1)]=[(Xm+2,,X1)=(x1,,xm+2)].\displaystyle~{}\mathbb{P}\left[(X_{1},\ldots,X_{m+2})=(x_{m+2},\ldots,x_{1})\right]=\mathbb{P}\left[(X_{m+2},\ldots,X_{1})=(x_{1},\ldots,x_{m+2})\right].

where (a) and (c) apply (92) for n=m+1n=m+1, namely, (94); (b) applies (92) for n=mn=m. ∎

In view of the proof of (93), we note that any distribution π\pi on 𝒮m{\mathcal{S}}^{m} and mthm^{\text{th}}-order transition matrix MM satisfying π(xm)=π(xm¯)\pi(x^{m})=\pi(\overline{x^{m}}) and (94) also satisfy (91). This implies such a π\pi is a stationary distribution for MM. In view of Lemma 20 the above conditions also guarantee reversibility. This observation can be summarized in the following lemma, which will be used to prove the reversibility of specific Markov chains later.

Lemma 21.

Let MM be a km×kk^{m}\times k stochastic matrix describing transitions from 𝒮m{\mathcal{S}}^{m} to 𝒮{\mathcal{S}}. Suppose that π\pi is a distribution on 𝒮m{\mathcal{S}}^{m} such that π(xm)=π(xm¯)\pi(x^{m})=\pi(\overline{x^{m}}) and π(xm)M(xm+1|xm)=π(x2m+1¯)M(x1|x2m+1¯)\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}}). Then π\pi is the stationary distribution of MM and the resulting chain is reversible.

For mthm^{\rm th}-order stationary Markov chains, the optimal prediction risk is defined as as

𝖱𝗂𝗌𝗄k,n,m\displaystyle{\mathsf{Risk}}_{k,n,m} infM^supM𝔼[D(M(|Xnm+1n)M^(|Xnm+1n))]\displaystyle\triangleq\inf_{\widehat{M}}\sup_{M}\mathbb{E}[D(M(\cdot|X_{n-m+1}^{n})\|\widehat{M}(\cdot|X_{n-m+1}^{n}))]
=infM^supMxm𝒮m𝔼[D(M(|xm)M^(|xm))𝟏{Xnm+1n=xm}]\displaystyle=\inf_{\widehat{M}}\sup_{M}\sum_{x^{m}\in{\mathcal{S}}^{m}}\mathbb{E}[D(M(\cdot|x^{m})\|\widehat{M}(\cdot|x^{m})){\mathbf{1}_{\left\{{X_{n-m+1}^{n}=x^{m}}\right\}}}] (96)

where the supremum is taken over all km×kk^{m}\times k stochastic matrices MM and the trajectory is initiated from the stationary distribution. In the remainder of this section we will show the following result, completing the proof of Theorem 5 previously announced in Section 1.

Theorem 22.

For all m2m\geq 2, there exist a constant Cm>0C_{m}>0 such that for all 2kn1m+1/Cm2\leq k\leq n^{\frac{1}{m+1}}/C_{m},

km+1\overCmnlog(n\overkm+1)𝖱𝗂𝗌𝗄k,n,mCmkm+1\overnlog(nkm+1).{k^{m+1}\over C_{m}n}\log\left(n\over k^{m+1}\right)\leq\mathsf{Risk}_{k,n,m}\leq{C_{m}k^{m+1}\over n}\log\left(\frac{n}{k^{m+1}}\right).

Furthermore, the lower bound holds even when the Markov chains are required to be reversible.

5.2 Upper bound

We prove the upper bound part of the preceding theorem, using only stationarity (not reversibility). We rely on techniques from [CS04, Chapter 6, Page 486] for proving redundancy bounds for the mthm^{\text{th}}-order chains. Let QQ be the probability assignment given by

Q(xn)=1kmam𝒮mj=1kNamj!\overk(k+1)(Nam+k1),\displaystyle Q(x^{n})=\frac{1}{k^{m}}\prod_{a^{m}\in{\mathcal{S}}^{m}}{\prod_{j=1}^{k}N_{a^{m}j}!\over k\cdot(k+1)\cdots(N_{a^{m}}+k-1)}, (97)

where NamjN_{a^{m}j} denotes the number of times the block amja^{m}j occurs in xnx^{n}, and Nam=j=1kNamjN_{a^{m}}=\sum_{j=1}^{k}N_{a^{m}j} is the number of times the block ama^{m} occurs in xn1x^{n-1}. This probability assignment corresponds to the add-one rule

Q(j|xn)=M^xn+1(j|xnm+1n)=Nxnm+1nj+1\overNxnm+1n+k.\displaystyle Q(j|x^{n})=\widehat{M}_{x^{n}}^{+1}(j|x_{n-m+1}^{n})={N_{x_{n-m+1}^{n}j}+1\over N_{x_{n-m+1}^{n}}+k}. (98)

Then in view of Lemma 6, the following lemma proves the desired upper bound in Theorem 22.

Lemma 23.

Let 𝖱𝖾𝖽(QXn)\mathsf{Red}(Q_{X^{n}}) be the redundancy of the mthm^{\text{th}}-order Markov chain, as defined in Section 2.1, and XmX^{m} be the corresponding observed trajectory. Then

𝖱𝖾𝖽(QXn)1nm{km(k1)[log(1+nmkm(k1))+1]+mlogk}.\mathsf{Red}(Q_{X^{n}})\leq\frac{1}{n-m}\left\{{k^{m}(k-1)}\left[\log\left(1+\frac{n-m}{k^{m}(k-1)}\right)+1\right]+m\log k\right\}.
Proof.

We show that for every Markov chain with transition matrix MM and initial distribution π\pi on 𝒮m{\mathcal{S}}^{m}, and every trajectory (x1,,xn)(x_{1},\cdots,x_{n}), it holds that

logπ(x1m)t=mn1M(xt+1|xttm+1)Q(x1,,xn)km(k1)[log(1+nmkm(k1))+1]+mlogk,\displaystyle\log\frac{\pi(x_{1}^{m})\prod_{t=m}^{n-1}M(x_{t+1}|x^{t}_{t-m+1})}{Q(x_{1},\cdots,x_{n})}\leq k^{m}(k-1)\left[\log\left(1+\frac{n-m}{k^{m}(k-1)}\right)+1\right]+m\log k, (99)

where M(xt+1|xttm+1)M(x_{t+1}|x^{t}_{t-m+1}) the transition probability of going from xttm+1x^{t}_{t-m+1} to xt+1x_{t+1}. Note that

t=mn1M(xt+1|xttm+1)=am+1𝒮m+1M(am+1|am)Nam+1am+1𝒮m+1(Nam+1/Nam)Nam+1,\prod_{t=m}^{n-1}M(x_{t+1}|x^{t}_{t-m+1})=\prod_{a^{m+1}\in{\mathcal{S}}^{m+1}}M(a_{m+1}|a^{m})^{N_{a^{m+1}}}\leq\prod_{a^{m+1}\in{\mathcal{S}}^{m+1}}(N_{a^{m+1}}/N_{a^{m}})^{N_{a^{m+1}}},

where the last inequality follows from am+1𝒮Nam+1NamlogNam+1NamM(am+1|am)0\sum_{a_{m+1\in{\mathcal{S}}}}\frac{N_{a^{m+1}}}{N_{a^{m}}}\log\frac{N_{a^{m+1}}}{N_{a^{m}}M(a_{m+1}|a^{m})}\geq 0 for each ama^{m}, by the non-negativity of the KL divergence. Therefore, we have

π(x1m)t=mn1M(xt+1|xtm+1t)Q(x1,,xn)kmam𝒮mk(k+1)(Nam+k1)NamNamam+1𝒮Nam+1Nam+1Nam+1!.\displaystyle\frac{\pi(x_{1}^{m})\prod_{t=m}^{n-1}M(x_{t+1}|x_{t-m+1}^{t})}{Q(x_{1},\cdots,x_{n})}\leq k^{m}\cdot\prod_{a^{m}\in{\mathcal{S}}^{m}}\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{a^{m}}+k-1)}{N_{a^{m}}^{N_{a^{m}}}}\prod_{a_{m+1}\in{\mathcal{S}}}\frac{N_{a^{m+1}}^{N_{a^{m+1}}}}{N_{a^{m+1}}!}. (100)

Using (33) we continue (100) to get

logπ(x1)t=mn1M(xt+1|xt)Q(x1,,xn)\displaystyle\log\frac{\pi(x_{1})\prod_{t=m}^{n-1}M(x_{t+1}|x_{t})}{Q(x_{1},\cdots,x_{n})} mlogk+am𝒮mlogk(k+1)(Nam+k1)Nam!\displaystyle\leq m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\log\frac{k\cdot(k+1)\cdot\cdots\cdot(N_{a^{m}}+k-1)}{N_{a^{m}}!}
=mlogk+am𝒮m=1Namlog(1+k1)\displaystyle=m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\sum_{\ell=1}^{N_{a^{m}}}\log\left(1+\frac{k-1}{\ell}\right)
mlogk+am𝒮m0Namlog(1+k1x)dx\displaystyle\leq m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\int_{0}^{N_{a^{m}}}\log\left(1+\frac{k-1}{x}\right)dx
=mlogk+am𝒮m((k1)log(1+Namk1)+Namlog(1+k1Nam))\displaystyle=m\log k+\sum_{a^{m}\in{\mathcal{S}}^{m}}\left((k-1)\log\left(1+\frac{N_{a^{m}}}{k-1}\right)+N_{a^{m}}\log\left(1+\frac{k-1}{N_{a^{m}}}\right)\right)
(a)km(k1)log(1+nmkm(k1))+km(k1)+mlogk,\displaystyle\overset{\rm(a)}{\leq}k^{m}(k-1)\log\left(1+\frac{n-m}{k^{m}(k-1)}\right)+k^{m}(k-1)+m\log k,

where (a) follows from the concavity of xlogxx\mapsto\log x, am𝒮mNam=nm+1\sum_{a^{m}\in{\mathcal{S}}^{m}}N_{a^{m}}=n-m+1, and log(1+x)x\log(1+x)\leq x. ∎

5.3 Lower bound

5.3.1 Special case: m2,k=2m\geq 2,k=2

We only analyze the case m=2m=2, i.e. second-order Markov chains with binary states, as the lower bound still applies to the case of m3m\geq 3 case. The transition matrix for second-order chains is given by a k2×kk^{2}\times k stochastic matrices MM that gives the transition probability from the ordered pairs (i,j)𝒮×𝒮(i,j)\in{\mathcal{S}}\times{\mathcal{S}} to some state 𝒮\ell\in{\mathcal{S}}:

M(|ij)=[X3=|X1=i,X2=j].\displaystyle M(\ell|ij)=\mathbb{P}\left[X_{3}=\ell|X_{1}=i,X_{2}=j\right]. (101)

Our result is the following.

Theorem 24.

𝖱𝗂𝗌𝗄2,n,2=Θ(logn\overn){\mathsf{Risk}}_{2,n,2}=\Theta\left(\log n\over n\right).

Proof.

The upper bound part has been shown in Lemma 23. For the lower bound, consider the following one-parametric family of transition matrices (we replace 𝒮{\mathcal{S}} by {1,2}\left\{1,2\right\} for simplicity of the notation)

~={Mp=[ 121111n1n211n11n121pp22p1p]:0p1}\displaystyle\widetilde{\mathcal{M}}=\left\{M_{p}=\hbox{}\;\vbox{\kern 40.12498pt\hbox{$\kern 60.61786pt\kern-4.75pt\left[\kern-60.61786pt\vbox{\kern-40.12498pt\vbox{\halign{$#$\hfil\kern 2\p@\kern\@tempdima&\thinspace\hfil$#$\hfil&&\quad\hfil$#$\hfil\cr\hfil\crcr\kern-12.0pt\cr~{}$\hfil\kern 2.0pt\kern 4.75pt&1&2\crcr\kern 2.0pt\cr 11$\hfil\kern 2.0pt\kern 4.75pt&1-\frac{1}{n}&\frac{1}{n}\cr 21$\hfil\kern 2.0pt\kern 4.75pt&\frac{1}{n}&1-\frac{1}{n}\cr 12$\hfil\kern 2.0pt\kern 4.75pt&1-p&p\cr 22$\hfil\kern 2.0pt\kern 4.75pt&p&1-p\crcr\cr}}\kern-12.0pt}\,\right]$}}:0\leq p\leq 1\right\} (109)

and place a uniform prior on p[0,1]p\in[0,1]. One can verify that each MpM_{p} has the uniform stationary distribution over the set {1,2}×{1,2}\left\{1,2\right\}\times\left\{1,2\right\} and the chains are reversible.

Next we introduce the set of trajectories based on which we will lower bound the prediction risk. Analogous to the set 𝒳=t=1n𝒳t{\mathcal{X}}=\cup_{t=1}^{n}{\mathcal{X}}_{t} defined in (37) for analyzing the first-order chains, we define

𝒱={1ntzt:z1=z2=zt=2,zii+111,i[t1],t=4,,n2}{1,2}n.\displaystyle{\mathcal{V}}=\left\{1^{n-t}z^{t}:z_{1}=z_{2}=z_{t}=2,z_{i}^{i+1}\neq 11,i\in[t-1],t=4,\dots,n-2\right\}\subset\{1,2\}^{n}. (110)

In other words, the sequences in 𝒱{\mathcal{V}} start with a string of 1’s before transitioning into two consecutive 2’s, are forbidden to have no consecutive 1’s thereafter, and finally end with 2.

To compute the probability of sequences in 𝒱{\mathcal{V}}, we need the following preparations. Denote by \oplus the the operation that combines any two blocks from {22,212}\left\{22,212\right\} via merging the last symbol of the first block and the first symbol of the second block, for example, 22212=2212,222222=222222\oplus 212=2212,22\oplus 22\oplus 22=2222. Then for any xn𝒱x^{n}\in{\mathcal{V}} we can write it in terms of the initial all-1 string, followed by alternating run of blocks from {22,212}\{22,212\} with the first run being of the block 22 (all the runs have positive lengths), combined with the merging operation \oplus:

xn=11all ones222222p1 many 22212212212p2 many 212222222p3 many 22212212212p4 many 21222.\displaystyle x^{n}=\underbrace{1\dots 1}_{\text{all ones}}\underbrace{22\oplus 22\dots\oplus 22}_{p_{1}\text{ many }22}\oplus\underbrace{212\oplus 212\dots\oplus 212}_{p_{2}\text{ many }212}\oplus\underbrace{22\oplus 22\dots\oplus 22}_{p_{3}\text{ many }22}\oplus\underbrace{212\oplus 212\dots\oplus 212}_{p_{4}\text{ many }212}\oplus 22\oplus\dots. (111)

Let the vector (q2222,q22212,q21222,q212212)(q_{22\to 22},q_{22\to 212},q_{212\to 22},q_{212\to 212}) denotes the transition probabilities between blocks in {22,212}\left\{22,212\right\} (recall the convention that the two blocks overlap in the symbol 2). Namely, according to (109),

q2222\displaystyle q_{22\to 22} =[X3=2,X2=2|X2=2,X1=2]=M(2|22)=1p\displaystyle=\mathbb{P}\left[X_{3}=2,X_{2}=2|X_{2}=2,X_{1}=2\right]=M(2|22)=1-p
q22212\displaystyle q_{22\to 212} =[X4=2,X3=1,X2=2|X2=2,X1=2]=M(2|21)M(1|22)=(11n)p\displaystyle=\mathbb{P}\left[X_{4}=2,X_{3}=1,X_{2}=2|X_{2}=2,X_{1}=2\right]=M(2|21)M(1|22)=\left(1-\frac{1}{n}\right)p
q21222\displaystyle q_{212\to 22} =[X4=2,X3=2|X3=2,X2=1,X1=2]=M(2|12)=p\displaystyle=\mathbb{P}\left[X_{4}=2,X_{3}=2|X_{3}=2,X_{2}=1,X_{1}=2\right]=M(2|12)=p
q212212\displaystyle q_{212\to 212} =[X5=2,X4=1,X3=2|X3=2,X2=1,X1=2]=M(2|21)M(1|12)=(11n)(1p).\displaystyle=\mathbb{P}\left[X_{5}=2,X_{4}=1,X_{3}=2|X_{3}=2,X_{2}=1,X_{1}=2\right]=M(2|21)M(1|12)=\left(1-\frac{1}{n}\right)(1-p).

Given any xn𝒱x^{n}\in{\mathcal{V}} we can calculate its probability under the law of MpM_{p} using frequency counts 𝑭(xn)=(F111,F2222,F22212,F21222,F212212)\bm{F}(x^{n})=\left(F_{111},F_{22\to 22},F_{22\to 212},F_{212\to 22},F_{212\to 212}\right), defined as

F111=i𝟏{xi=1,xi+1=1,xi+2=1},F2222=i𝟏{xi=2,xi+1=2,xi+2=2},\displaystyle F_{111}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=1,x_{i+1}=1,x_{i+2}=1}\right\}}},\quad F_{22\to 22}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=2,x_{i+2}=2}\right\}}},
F22212=i𝟏{xi=2,xi+1=2,xi+2=1,xi+3=2},F21222=i𝟏{xi=2,xi+1=1,xi+2=2,xi+3=2},\displaystyle F_{22\to 212}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=2,x_{i+2}=1,x_{i+3}=2}\right\}}},\quad F_{212\to 22}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=1,x_{i+2}=2,x_{i+3}=2}\right\}}},
F212212=i𝟏{xi=2,xi+1=1,xi+2=2,xi+3=1,xi+4=2}.\displaystyle F_{212\to 212}=\sum_{i}{\mathbf{1}_{\left\{{x_{i}=2,x_{i+1}=1,x_{i+2}=2,x_{i+3}=1,x_{i+4}=2}\right\}}}.

Denote μ(xn|p)=[Xn=xn|p]\mu(x^{n}|p)=\mathbb{P}\left[X^{n}=x^{n}|p\right]. Then for each xn𝒱x^{n}\in{\mathcal{V}} with 𝑭(xn)=𝑭\bm{F}(x^{n})=\bm{F} we have

μ(xn|p)\displaystyle\mu(x^{n}|p)
=(XF111+2=1F111+2)M(2|11)M(2|12)a,b{22,212}qabFab\displaystyle=\mathbb{P}(X^{F_{111}+2}=1^{F_{111}+2})M(2|11)M(2|12)\prod_{a,b\in\left\{22,212\right\}}q_{a\to b}^{F_{a\to b}}
=14(11n)F1111nppF21222{p(11n)}F22212(1p)F2222{(1p)(11n)}F212212\displaystyle=\frac{1}{4}\left(1-\frac{1}{n}\right)^{F_{111}}\frac{1}{n}\cdot p\cdot p^{F_{212\to 22}}\left\{p\left(1-\frac{1}{n}\right)\right\}^{F_{22\to 212}}(1-p)^{F_{22\to 22}}\left\{(1-p)\left(1-\frac{1}{n}\right)\right\}^{F_{212\to 212}}
=14(11n)F111+F22212+F2122121npy+1(1p)fy\displaystyle=\frac{1}{4}\left(1-\frac{1}{n}\right)^{F_{111}+F_{22\to 212}+F_{212\to 212}}\frac{1}{n}p^{y+1}(1-p)^{f-y} (112)

where y=F21222+F22212y=F_{212\to 22}+F_{22\to 212} denotes the number of times the chain alternates between runs of 22 and runs of 212, and f=F21222+F22212+F212212+F2222f=F_{212\to 22}+F_{22\to 212}+F_{212\to 212}+F_{22\to 22} denotes the number of times the chain jumps between blocks in {22,212}\{22,212\}.

Note that the range of ff includes all the integers in between 1 and (n6)/2(n-6)/2. This follows from the definition of 𝒱{\mathcal{V}} and the fact that if we merge either 22 or 212 using the operation \oplus at the end of any string ztz^{t} with zt=2z_{t}=2, it increases the length of the string by at most 2. Also, given any value of ff the value of yy ranges from 0 to ff.

Lemma 25.

The number of sequences in 𝒱{\mathcal{V}} corresponding to a fixed pair (y,f)(y,f) is (f\atopy)\binom{f}{y}.

Proof.

Fix xn𝒱x^{n}\in{\mathcal{V}} and let that p2i1p_{2i-1} is the length of the ii-th run of 22 blocks and p2ip_{2i} is the length of the ii-th run of 212 blocks in xnx^{n} as depicted in (111). The pip_{i}’s are all non-negative integers. There are total y+1y+1 such runs and the pip_{i}’s satisfy i=1y+1pi=f+1\sum_{i=1}^{y+1}p_{i}=f+1, as the total number of blocks is one more than the total number of transitions. Each positive integer solution to this equation {pi}i=1y+1\left\{p_{i}\right\}_{i=1}^{y+1} corresponds to a sequence xn𝒱x^{n}\in\cal V and vice versa. The total number of such sequences is (f\atopy)\binom{f}{y}. ∎

We are now ready to compute the Bayes estimator and risk. For any xn𝒱x^{n}\in{\mathcal{V}} with a given (y,f)(y,f), the Bayes estimator of pp with prior pUniform[0,1]p\sim\mathrm{Uniform}[0,1] is

p^(xn)=𝔼[p|xn]=𝔼[pμ(xn|p)]𝔼[μ(xn|p)]=(5.3.1)y+2\overf+3.\widehat{p}(x^{n})=\mathbb{E}[p|x^{n}]=\frac{\mathbb{E}[p\cdot\mu(x^{n}|p)]}{\mathbb{E}[\mu(x^{n}|p)]}\overset{\eqref{eq:MC_prob-F}}{=}{y+2\over f+3}.

Note that the probabilities μ(xn|p)\mu(x^{n}|p) in (5.3.1) can be bounded from below by 14enpy+1(1p)fy\frac{1}{4en}p^{y+1}(1-p)^{f-y}. Using this, for each xn𝒱x^{n}\in{\mathcal{V}} with given y,fy,f we get the following bound on the integrated squared error for a particular sequence xnx^{n}

01μ(xn|p)(pp^(xn))2dp\displaystyle\int_{0}^{1}\mu(x^{n}|p)(p-\widehat{p}(x^{n}))^{2}dp
14en01py+1(1p)fy(py+2\overf+3)2dp=14en(y+1)!(fy)!\over(f+2)!(y+2)(fy+1)\over(f+3)2(f+4)\displaystyle\geq\frac{1}{4en}\int_{0}^{1}p^{y+1}(1-p)^{f-y}\left(p-{y+2\over f+3}\right)^{2}dp=\frac{1}{4en}{(y+1)!(f-y)!\over(f+2)!}{(y+2)(f-y+1)\over(f+3)^{2}(f+4)} (113)

where the last equality followed by noting that the integral is the variance of a Beta(y+2,fy+1)\text{Beta}(y+2,f-y+1) random variable without its normalizing constant.

Next we bound the risk of any predictor by the Bayes error. Consider any predictor M^(|ij)\widehat{M}(\cdot|ij) (as a function of the sample path XX) for transition from ijij, i,j{1,2}i,j\in\left\{1,2\right\}. By the Pinsker’s inequality, we conclude that

D(M(|12)M^(|12))12M(|12)M^(|12)1212(pM^(2|12))2\displaystyle D(M(\cdot|12)\|\widehat{M}(\cdot|12))\geq\frac{1}{2}\|M(\cdot|12)-\widehat{M}(\cdot|12)\|_{\ell_{1}}^{2}\geq\frac{1}{2}(p-\widehat{M}(2|12))^{2} (114)

and similarly, D(M(|22)M^(|22))12(pM^(1|22))2D(M(\cdot|22)\|\widehat{M}(\cdot|22))\geq\frac{1}{2}(p-\widehat{M}(1|22))^{2}. Abbreviate M^(2|12)p^12\widehat{M}(2|12)\equiv\widehat{p}_{12} and M^(1|22)p^22\widehat{M}(1|22)\equiv\widehat{p}_{22}, both functions of XX. Using (113) and Lemma 25, we have

i,j=13𝔼[D(M(|ij)M^(|ij)))𝟏{Xn1n=ij}]\displaystyle\sum_{i,j=1}^{3}\mathbb{E}[D(M(\cdot|ij)\|\widehat{M}(\cdot|ij))){\mathbf{1}_{\left\{{X_{n-1}^{n}=ij}\right\}}}]
12𝔼[(pp^12)2𝟏{Xn1n=12,Xn𝒱}+(pp^22)2𝟏{Xn1n=22,Xn𝒱}]\displaystyle\geq~{}\frac{1}{2}\mathbb{E}\left[(p-\widehat{p}_{12})^{2}{\mathbf{1}_{\left\{{X_{n-1}^{n}=12,X^{n}\in{\mathcal{V}}}\right\}}}+(p-\widehat{p}_{22})^{2}{\mathbf{1}_{\left\{{X_{n-1}^{n}=22,X^{n}\in{\mathcal{V}}}\right\}}}\right]
1201[𝑭xn𝒱:𝑭(xn)=𝑭μ(xn|p)((pp^12)2𝟏{xn1n=12}+(pp^22)2𝟏{xn1n=22})]dp\displaystyle\geq~{}\frac{1}{2}\int_{0}^{1}\left[\sum_{\bm{F}}\sum_{x^{n}\in{\mathcal{V}}:\bm{F}(x^{n})=\bm{F}}\mu(x^{n}|p)\left((p-\widehat{p}_{12})^{2}{\mathbf{1}_{\left\{{x_{n-1}^{n}=12}\right\}}}+(p-\widehat{p}_{22})^{2}{\mathbf{1}_{\left\{{x_{n-1}^{n}=22}\right\}}}\right)\right]dp
1201[𝑭xn𝒱:𝑭(xn)=𝑭μ(xn|p)(pp^(xn))2]dp\displaystyle\geq~{}\frac{1}{2}\int_{0}^{1}\left[\sum_{\bm{F}}\sum_{x^{n}\in{\mathcal{V}}:\bm{F}(x^{n})=\bm{F}}\mu(x^{n}|p)(p-\widehat{p}(x^{n}))^{2}\right]dp
12f=1n62y=0f(f\atopy)14en(y+1)!(fy)!\over(f+2)!(y+2)(fy+1)\over(f+3)2(f+4)\displaystyle\geq~{}\frac{1}{2}\sum_{f=1}^{\frac{n-6}{2}}\sum_{y=0}^{f}\binom{f}{y}\frac{1}{4en}{(y+1)!(f-y)!\over(f+2)!}{(y+2)(f-y+1)\over(f+3)^{2}(f+4)}
18enf=1n62y=0fy+1\over(f+2)(f+1)(y+2)(fy+1)\over(f+3)2(f+4)Θ(1n)f=1n62y=f4f31f2=Θ(logn\overn).\displaystyle\geq~{}\frac{1}{8en}\sum_{f=1}^{\frac{n-6}{2}}\sum_{y=0}^{f}{y+1\over(f+2)(f+1)}{(y+2)(f-y+1)\over(f+3)^{2}(f+4)}\geq~{}\Theta\left(\frac{1}{n}\right)\sum_{f=1}^{\frac{n-6}{2}}\sum_{y=\frac{f}{4}}^{\frac{f}{3}}\frac{1}{f^{2}}=~{}\Theta\left(\log n\over n\right). (115)

5.3.2 General case: m2,k3m\geq 2,k\geq 3

We will prove the following.

Theorem 26.

For absolute constant CC, we have

𝖱𝗂𝗌𝗄k,n,m12m+4(122m2\overn)(11n)n2m+1(k1)m+1\overnlog(1\over22m+83πe(m+1)nm\over(k1)m+1).\mathsf{Risk}_{k,n,m}\geq\frac{1}{2^{m+4}}\left(\frac{1}{2}-{2^{m}-2\over n}\right)\left(1-\frac{1}{n}\right)^{n-2m+1}{(k-1)^{m+1}\over n}\log\left({1\over 2^{2m+8}\cdot 3\pi e(m+1)}\cdot{n-m\over(k-1)^{m+1}}\right).

For ease of notation let 𝒮={1,,k}{\mathcal{S}}=\left\{1,\dots,k\right\}. Denote 𝒮~={2,,k}{\widetilde{\mathcal{S}}}=\left\{2,\dots,k\right\}. Consider an mthm^{\text{th}}-order transition matrix MM of the following form:

M(s|xm)=
Starting string xm Next state
s=1 s{2,,k}
1m 1-1n 1n(k-1)
1xm-1,xm-1~Sm-1 1-b b(k-1)
xm~Sm 1n (1-1n)T(s|xm)
xm{1m,1~Sm-1,~Sm} 12 12(k-1)
 
,b=122m2\overn.
\displaystyle M(s|x^{m})=\text{\begin{tabular}[]{|c|c|c|}\hline\cr\hbox{\multirowsetup Starting string $x^{m}$}&\lx@intercol\hfil Next state\hfil\lx@intercol\vrule\lx@intercol \\ \cline{2-3}\cr&$s=1$&$s\in\left\{2,\dots,k\right\}$\\ \hline\cr&&\\ $1^{m}$&$1-\frac{1}{n}$&$\frac{1}{n(k-1)}$\\ &&\\ \hline\cr&&\\ \hbox{\multirowsetup$1x^{m-1},x^{m-1}\in{\widetilde{\mathcal{S}}}^{m-1}$}&$1-b$&$\frac{b}{(k-1)}$\\ &&\\ \hline\cr&&\\ \hbox{\multirowsetup$x^{m}\in{\widetilde{\mathcal{S}}}^{m}$}&$\frac{1}{n}$&{\mbox{$\left(1-\frac{1}{n}\right)T(s|x^{m})$}}\\ &&\\ \hline\cr&&\\ \hbox{\multirowsetup$x^{m}\notin\left\{1^{m},1{\widetilde{\mathcal{S}}}^{m-1},{\widetilde{\mathcal{S}}}^{m}\right\}$}&$\frac{1}{2}$&$\frac{1}{2(k-1)}$\\ &&\\ \hline\cr\end{tabular} },\quad b=\frac{1}{2}-{2^{m}-2\over n}.
(130)

Here TT is a (k1)m×(k1)(k-1)^{m}\times(k-1) transition matrix for an mthm^{\rm th}-order Markov chain with state space 𝒮~\widetilde{\mathcal{S}}, satisfying the following property:

  1. (P)

    T(xm+1|xm)=T(x1|x2m+1¯),xm+1𝒮~m+1T(x_{m+1}|x^{m})=T(x_{1}|\overline{x_{2}^{m+1}}),\quad\forall x^{m+1}\in\widetilde{\mathcal{S}}^{m+1}.

Lemma 27.

Under the condition (P), the transition matrix TT has a stationary distribution that is uniform on 𝒮~m{\widetilde{\mathcal{S}}}^{m}. Furthermore, the resulting mthm^{\rm th}-order Markov chain is reversible (and hence stationary).

Proof.

We prove this result using Lemma 21. Let π\pi denote the uniform distribution on 𝒮~m\widetilde{\mathcal{S}}^{m}, i.e., π(xm)=1(k1)m\pi(x^{m})=\frac{1}{(k-1)^{m}} for all xm𝒮~mx^{m}\in\widetilde{\mathcal{S}}^{m}. Then for any xm𝒮~mx^{m}\in\widetilde{\mathcal{S}}^{m} the condition π(xm)=π(xm¯)\pi(x^{m})=\pi(\overline{x^{m}}) follows directly and π(xm)T(xm+1|xm)=π(x2m+1¯)T(x1|x2m+1¯)\pi(x^{m})T(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})T(x_{1}|\overline{x_{2}^{m+1}}) follows from the assumption (P). ∎

Next we address the stationarity and reversibility of the chain with the bigger transition matrix MM in (130):

Lemma 28.

Let MM be defined in (130), wherein the transition matrix TT satisfies the condition (P). Then MM has a stationary distribution given by

π(xm)={12xm=1mb(k1)mxm𝒮~m1n(k1)d(xm)otherwise\displaystyle\pi(x^{m})=\begin{cases}\frac{1}{2}&x^{m}=1^{m}\\ \frac{b}{(k-1)^{m}}&x^{m}\in{\widetilde{\mathcal{S}}}^{m}\\ \frac{1}{n(k-1)^{d(x^{m})}}&\text{otherwise}\end{cases} (131)

where d(xm)i=1m𝟏{xi𝒮~}d(x^{m})\triangleq\sum_{i=1}^{m}{\mathbf{1}_{\left\{{x_{i}\in{\widetilde{\mathcal{S}}}}\right\}}} and b=122m2\overnb=\frac{1}{2}-{2^{m}-2\over n} as in (130). Furthermore, the mthm^{\rm th}-order Markov chain with initial distribution π\pi and transition matrix MM is reversible.

Proof.

Note that the choice of bb guarantees that xm𝒮mπ(xm)=1\sum_{x^{m}\in{\mathcal{S}}^{m}}\pi(x^{m})=1. Next we again apply Lemma 21 to verify stationarity and reversibility. First of all, since d(xm)=d(xm¯)d(x^{m})=d(\overline{x^{m}}), we have π(xm)=π(xm¯)\pi(x^{m})=\pi(\overline{x^{m}}) for all xm𝒮mx^{m}\in{\mathcal{S}}^{m} . Next we check the condition π(xm)M(xm+1|xm)=π(x2m+1¯)M(x1|x2m+1¯)\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}}). For the sequence 1m+11^{m+1} the claim is easily verified. For the rest of the sequences we have the following.

  • Case 1 (xm+1𝒮~m+1x^{m+1}\in{\widetilde{\mathcal{S}}}^{m+1}): Note that xm+1𝒮~m+1x^{m+1}\in\widetilde{\mathcal{S}}^{m+1} if and only if xm,x2m+1¯𝒮~mx^{m},{\overline{x_{2}^{m+1}}}\in\widetilde{\mathcal{S}}^{m}. This implies

    π(xm)M(xm+1|xm)\displaystyle\pi(x^{m})M(x_{m+1}|x^{m}) =b(k1)m(11n)T(xm+1|xm)\displaystyle=\frac{b}{(k-1)^{m}}\left(1-\frac{1}{n}\right)T(x_{m+1}|x^{m})
    =b(k1)m(11n)T(x1|x2m+1¯)=π(x2m+1¯)M(x1|x2m+1¯).\displaystyle=\frac{b}{(k-1)^{m}}\left(1-\frac{1}{n}\right)T(x_{1}|\overline{x_{2}^{m+1}})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}}).
  • Case 2 (xm+11𝒮~mx^{m+1}\in 1{\widetilde{\mathcal{S}}}^{m} or xm+1𝒮~m1x^{m+1}\in{\widetilde{\mathcal{S}}}^{m}1): By symmetry it is sufficient to analyze the case xm+11𝒮~mx^{m+1}\in 1{\widetilde{\mathcal{S}}}^{m}. Note that in the sub-case xm+11𝒮~mx^{m+1}\in 1{\widetilde{\mathcal{S}}}^{m}, xm1𝒮~m1x^{m}\in 1\widetilde{\mathcal{S}}^{m-1} and x2m+1¯𝒮~m\overline{x_{2}^{m+1}}\in\widetilde{\mathcal{S}}^{m}. This implies

    π(xm)=1n(k1)m1,\displaystyle\pi(x^{m})=\frac{1}{n(k-1)^{m-1}}, M(xm+1|xm)=bk1,\displaystyle\quad M(x_{m+1}|x^{m})=\frac{b}{k-1},
    π(x2m+1¯)=b(k1)m,\displaystyle\pi(\overline{x_{2}^{m+1}})=\frac{b}{(k-1)^{m}}, M(x1|x2m+1¯)=1n.\displaystyle\quad M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{n}. (132)

    In view of this we get π(xm)M(xm+1|xm)=π(x2m+1¯)M(x1|x2m+1¯).\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}}).

  • Case 3 (xm+11m+1𝒮~m+11𝒮~m𝒮~m1x^{m+1}\notin 1^{m+1}\cup{\widetilde{\mathcal{S}}}^{m+1}\cup 1{\widetilde{\mathcal{S}}}^{m}\cup{\widetilde{\mathcal{S}}}^{m}1):

    Suppose that xm+1x^{m+1} has dd many elements from 𝒮~{\widetilde{\mathcal{S}}}. Then xm,x2m+1{1m,𝒮~m}x^{m},x_{2}^{m+1}\notin\left\{1^{m},{\widetilde{\mathcal{S}}}^{m}\right\}. We have the following sub-cases.

    • If x1=xm+1=1x_{1}=x_{m+1}=1, then both xm,x2m+1x^{m},x_{2}^{m+1} have exactly dd elements from 𝒮~{\widetilde{\mathcal{S}}}. This implies π(xm)=π(x2m+1¯)=1n(k1)d\pi(x^{m})=\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d}} and M(xm+1|xm)=M(x1|x2m+1¯)=12M(x_{m+1}|x^{m})=M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2}.

    • If x1,xm+1𝒮~x_{1},x_{m+1}\in{\widetilde{\mathcal{S}}}, then both xm,x2m+1x^{m},x_{2}^{m+1} have exactly d1d-1 elements from 𝒮~{\widetilde{\mathcal{S}}}. This implies π(xm)=π(x2m+1¯)=1n(k1)d1\pi(x^{m})=\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d-1}} and M(xm+1|xm)=M(x1|x2m+1¯)=12(k1)M(x_{m+1}|x^{m})=M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2(k-1)}.

    • If x1=1,xm+1𝒮~x_{1}=1,x_{m+1}\in{\widetilde{\mathcal{S}}}, then xmx^{m} has d1d-1 elements from S~\widetilde{S} and x2m+1x_{2}^{m+1} has dd elements from 𝒮{\mathcal{S}}. This implies π(xm)=1n(k1)d1,π(x2m+1¯)=1n(k1)d\pi(x^{m})=\frac{1}{n(k-1)^{d-1}},\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d}} and M(xm+1|xm)=12(k1),M(x1|x2m+1¯)=12M(x_{m+1}|x^{m})=\frac{1}{2(k-1)},M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2}.

    • If x1𝒮~,xm+1=1x_{1}\in{\widetilde{\mathcal{S}}},x_{m+1}=1, then xmx^{m} has dd elements from S~\widetilde{S} and x2m+1x_{2}^{m+1} has d1d-1 elements from 𝒮{\mathcal{S}} then π(xm)=1n(k1)d,π(x2m+1¯)=1n(k1)d1\pi(x^{m})=\frac{1}{n(k-1)^{d}},\pi(\overline{x_{2}^{m+1}})=\frac{1}{n(k-1)^{d-1}} and M(xm+1|xm)=12,M(x1|x2m+1¯)=12(k1)M(x_{m+1}|x^{m})=\frac{1}{2},M(x_{1}|\overline{x_{2}^{m+1}})=\frac{1}{2(k-1)}.

    For all these sub-cases we have π(xm)M(xm+1|xm)=π(x2m+1¯)M(x1|x2m+1¯)\pi(x^{m})M(x_{m+1}|x^{m})=\pi(\overline{x_{2}^{m+1}})M(x_{1}|\overline{x_{2}^{m+1}}) as required.

This finishes the proof. ∎

Let (X1,,Xn)(X_{1},\ldots,X_{n}) be the trajectory of a stationary Markov chain with transition matrix MM as in (130). We observe the following properties:

  1. (R1)

    This Markov chain is irreducible and reversible. Furthermore, the stationary distribution π\pi assigns probability 12\frac{1}{2} to the initial state 1m1^{m}.

  2. (R2)

    For mtn1m\leq t\leq n-1, let 𝒳t{\mathcal{X}}_{t} denote the collections of trajectories xnx^{n} such that x1,x2,,xt=1x_{1},x_{2},\cdots,x_{t}=1 and xt+1,,xn𝒮~x_{t+1},\cdots,x_{n}\in{\widetilde{\mathcal{S}}}. Then using Lemma 28

    (Xn𝒳t)\displaystyle\mathbb{P}(X^{n}\in{\mathcal{X}}_{t}) =(X1==Xt=1)(Xt+11|Xtm+1t=1m)\displaystyle=\mathbb{P}(X_{1}=\cdots=X_{t}=1)\cdot\mathbb{P}(X_{t+1}\neq 1|X_{t-m+1}^{t}=1^{m})
    i=2m1(Xt+i1|Xtm+it=1mi+1,Xt+1t+i1𝒮~i1)\displaystyle\quad\cdot\prod_{i=2}^{m-1}\mathbb{P}(X_{t+i}\neq 1|X_{t-m+i}^{t}=1^{m-i+1},X_{t+1}^{t+i-1}\in{\widetilde{\mathcal{S}}}^{i-1})
    (Xt+m1|Xt=1,Xt+1t+m1𝒮~m1)s=t+mn1(Xs+11|Xsm+1s𝒮~m)\displaystyle\quad\cdot\mathbb{P}(X_{t+m}\neq 1|X_{t}=1,X_{t+1}^{t+m-1}\in{\widetilde{\mathcal{S}}}^{m-1})\cdot\prod_{s=t+m}^{n-1}\mathbb{P}(X_{s+1}\neq 1|X_{s-m+1}^{s}\in{\widetilde{\mathcal{S}}}^{m})
    =12(11n)tmbn2m2(11n)nmt=bn2m1(11n)n2m.\displaystyle=\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}}\cdot\left(1-\frac{1}{n}\right)^{n-m-t}=\frac{b}{n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}. (133)

    Moreover, this probability does not depend of the choice of TT;

  3. (R3)

    Conditioned on the event that Xn𝒳tX^{n}\in{\mathcal{X}}_{t}, the trajectory (Xt+1,,Xn)(X_{t+1},\cdots,X_{n}) has the same distribution as a length-(nt)(n-t) trajectory of a stationary mthm^{\text{th}}-order Markov chain with state space 𝒮~{\widetilde{\mathcal{S}}} and transition probability TT, and the uniform initial distribution. Indeed,

    [Xt+1=xt+1,,Xn=xn|Xn𝒳t]\displaystyle\mathbb{P}\left[X_{t+1}=x_{t+1},\ldots,X_{n}=x_{n}|X^{n}\in{\mathcal{X}}_{t}\right]
    =12(11n)tmbn2m2(k1)ms=t+mn1(11n)T(xs+1|xsm+1s)bn2m1(11n)n2m\displaystyle=\frac{\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}(k-1)^{m}}\prod_{s=t+m}^{n-1}\left(1-\frac{1}{n}\right)T(x_{s+1}|x_{s-m+1}^{s})}{\frac{b}{n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}}
    =1(k1)ms=t+mn1T(xs+1|xsm+1s).\displaystyle=\frac{1}{(k-1)^{m}}\prod_{s=t+m}^{n-1}T(x_{s+1}|x_{s-m+1}^{s}).
Reducing the Bayes prediction risk to mutual information

Consider the following Bayesian setting, we first draw TT from some prior satisfying property (P), then generate the stationary mthm^{\rm th}-order Markov chain Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) with state space [k][k] and transition matrix MM in (130) and stationary distribution π\pi in (131). The following lemma lower bounds the Bayes prediction risk.

Lemma 29.

Conditioned on TT, let Yn=(Y1,,Yn)Y^{n}=(Y_{1},\ldots,Y_{n}) denote an mthm^{\rm th}-order stationary Markov chain on state space 𝒮~={2,,k}{\widetilde{\mathcal{S}}}=\{2,\ldots,k\} with transition matrix TT and uniform initial distribution. Then

infM^𝔼T[𝔼[D(M(|Xnm+1n)M^(|Xnm+1n)))]]\displaystyle\inf_{\widehat{M}}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n-m+1}^{n})\|\widehat{M}(\cdot|X_{n-m+1}^{n})))]\right]
b(n1)n22m1(11n)n2m(I(T;Ynm)mlog(k1)).\displaystyle\geq\frac{b(n-1)}{n^{2}2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}\left(I(T;Y^{n-m})-m\log(k-1)\right).
Proof.

We first relate the Bayes estimator of MM and TT (given the XX and YY chain respectively). For each mtnm\leq t\leq n, denote by M^t=M^t(|xt)\widehat{M}_{t}=\widehat{M}_{t}(\cdot|x^{t}) the Bayes estimator of M(|xtm+1t)M(\cdot|x_{t-m+1}^{t}) given Xt=xtX^{t}=x^{t}, and T^t(|yt)\widehat{T}_{t}(\cdot|y^{t}) the Bayes estimator of T(|ytm+1t)T(\cdot|y_{t-m+1}^{t}) given Yt=ytY^{t}=y^{t}. For each t=1,,n1t=1,\ldots,n-1 and for each trajectory xn=(1,,1,xt+1,,xn)𝒳tx^{n}=(1,\ldots,1,x_{t+1},\ldots,x_{n})\in{\mathcal{X}}_{t}, recalling the form (21) of the Bayes estimator, we have, for each j𝒮~j\in{\widetilde{\mathcal{S}}},

M^n(j|xn)\displaystyle\widehat{M}_{n}(j|x^{n})
=[Xn+1=(xn,j)][Xn=xn]\displaystyle=~{}\frac{\mathbb{P}\left[X^{n+1}=(x^{n},j)\right]}{\mathbb{P}\left[X^{n}=x^{n}\right]}
=𝔼[12(11n)tmbn2m2(k1)ms=t+mn1M(xs+1|xsm+1s)M(j|xnm+1n)]𝔼[12(11n)tmbn2m2(k1)ms=t+mn1M(xs+1|xsm+1s)]\displaystyle=~{}\frac{\mathbb{E}[\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}(k-1)^{m}}\prod_{s=t+m}^{n-1}M(x_{s+1}|x_{s-m+1}^{s})M(j|x_{n-m+1}^{n})]}{\mathbb{E}[\frac{1}{2}\cdot\left(1-\frac{1}{n}\right)^{t-m}\cdot\frac{b}{n2^{m-2}(k-1)^{m}}\prod_{s=t+m}^{n-1}M(x_{s+1}|x_{s-m+1}^{s})]}
=(11n)𝔼[1(k1)ms=t+mn1T(xs+1|xsm+1s)T(j|xnm+1n)]𝔼[1(k1)ms=t+mn1T(xs+1|xsm+1s)]\displaystyle=~{}\left(1-\frac{1}{n}\right)\frac{\mathbb{E}[\frac{1}{(k-1)^{m}}\prod_{s=t+m}^{n-1}T(x_{s+1}|x_{s-m+1}^{s})T(j|x_{n-m+1}^{n})]}{\mathbb{E}[\frac{1}{(k-1)^{m}}\prod_{s=t+m}^{n-1}T(x_{s+1}|x_{s-m+1}^{s})]}
=(11n)[Ynt+1=(xt+1n,j)][Ynt=xt+1n]\displaystyle=~{}\left(1-\frac{1}{n}\right)\frac{\mathbb{P}\left[Y^{n-t+1}=(x_{t+1}^{n},j)\right]}{\mathbb{P}\left[Y^{n-t}=x_{t+1}^{n}\right]}
=(11n)T^nt(j|xt+1n).\displaystyle=~{}\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(j|x_{t+1}^{n}).

Furthermore, since M(1|xm)=1/nM(1|x^{m})=1/n for all xm𝒮~x^{m}\in{\widetilde{\mathcal{S}}} in the construction (130), the Bayes estimator also satisfies M^n(1|xn)=1/n\widehat{M}_{n}(1|x^{n})=1/n for xn𝒳tx^{n}\in{\mathcal{X}}_{t} and tnmt\leq n-m. In all, we have

M^n(|xn)=1nδ1+(11n)T^nt(|xt+1n),xn𝒳t,tnm.\widehat{M}_{n}(\cdot|x^{n})=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)\widehat{T}_{n-t}(\cdot|x_{t+1}^{n}),\quad x^{n}\in{\mathcal{X}}_{t},t\leq n-m. (134)

with δ1\delta_{1} denoting the point mass at state 1, which parallels the fact that

M(|ym)=1nδ1+(11n)T(|ym),ym𝒮~m.M(\cdot|y^{m})=\frac{1}{n}\delta_{1}+\left(1-\frac{1}{n}\right)T(\cdot|y^{m}),\quad y^{m}\in{\widetilde{\mathcal{S}}}^{m}. (135)

By (R2), each event {Xn𝒳t}\{X^{n}\in{\mathcal{X}}_{t}\} occurs with probability at least b\overn2m1(11n)n2m{b\over n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}, and is independent of TT. Therefore,

𝔼T[𝔼[D(M(|Xn1Xn)M^(|Xn))]]\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n-1}X_{n})\|\widehat{M}(\cdot|X^{n}))]\right]
bn2m1(11n)n2mt=mnm𝔼T[𝔼[D(M(|Xnm+1n)M^(|Xn))|Xn𝒳t]].\displaystyle\geq\frac{b}{n2^{m-1}}\left(1-\frac{1}{n}\right)^{n-2m}\sum_{t=m}^{n-m}\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n-m+1}^{n})\|\widehat{M}(\cdot|X^{n}))|X^{n}\in{\mathcal{X}}_{t}]\right]. (136)

By (R3), the conditional joint law of (T,Xt+1,,Xn)(T,X_{t+1},\ldots,X_{n}) on the event {Xn𝒳t}\{X^{n}\in{\mathcal{X}}_{t}\} is the same as the joint law of (T,Y1,,Ynt)(T,Y_{1},\ldots,Y_{n-t}). Thus, we may express the Bayes prediction risk in the XX chain as

𝔼T[𝔼[D(M(|Xnm+1n)M^(|Xn))|Xn𝒳t]]\displaystyle\mathbb{E}_{T}\left[\mathbb{E}[D(M(\cdot|X_{n-m+1}^{n})\|\widehat{M}(\cdot|X^{n}))|X^{n}\in{\mathcal{X}}_{t}]\right] =(a)(11n)𝔼T[𝔼[D(T(|Yntm+1nt)T^(|Ynt))]]\displaystyle\overset{\rm(a)}{=}\left(1-\frac{1}{n}\right)\cdot\mathbb{E}_{T}\left[\mathbb{E}[D(T(\cdot|Y_{n-t-m+1}^{n-t})\|\widehat{T}(\cdot|Y^{n-t}))]\right]
=(b)(11n)I(T;Ynt+1|Ynt),\displaystyle\overset{\rm(b)}{=}\left(1-\frac{1}{n}\right)\cdot I(T;Y_{n-t+1}|Y^{n-t}), (137)

where (a) follows from (134), (135), and the fact that for distributions P,QP,Q supported on 𝒮~{\widetilde{\mathcal{S}}}, D(ϵδ1+(1ϵ)Pϵδ1+(1ϵ)Q)=(1ϵ)D(PQ)D(\epsilon\delta_{1}+(1-\epsilon)P\|\epsilon\delta_{1}+(1-\epsilon)Q)=(1-\epsilon)D(P\|Q); (b) is the mutual information representation (20) of the Bayes prediction risk. Finally, the lemma follows from (5.3.2), (5.3.2), and the chain rule

t=mnmI(T;Ynt+1|Ynt)=I(T;Ynm)I(T;Ym)I(T;Ynm)mlog(k1),\displaystyle\sum_{t=m}^{n-m}I(T;Y_{n-t+1}|Y^{n-t})=I(T;Y^{n-m})-I(T;Y^{m})\geq I(T;Y^{n-m})-m\log(k-1),

as I(T;Ym)H(Ym)mlog(k1)I(T;Y^{m})\leq H(Y^{m})\leq m\log(k-1). ∎

Prior construction and lower bounding the mutual information

We assume that k=2k0+1k=2k_{0}+1 for some integer k0k_{0}. For simplicity of notation we replace S~\widetilde{S} by 𝒴=1,,k1{\mathcal{Y}}={1,\dots,k-1}. This does not affect the lower bound. Define an equivalent relation on |𝒴|m1|{\mathcal{Y}}|^{m-1} given by the following rule: xm1x^{m-1} and ym1y^{m-1} are related if and only if xm1=ym1x^{m-1}=y^{m-1} or xm1=ym1¯x^{m-1}=\overline{y^{m-1}}. Let Rm1R_{m-1} be a subset of 𝒴m1{\mathcal{Y}}^{m-1} that consists of exactly one representative from each of the equivalent classes. As each of the equivalent classes under this relation will have at most two elements the total number of equivalent classes is at least |𝒴|m1\over2|{\mathcal{Y}}|^{m-1}\over 2, i.e., |Rm1|(k1)m1\over2|R_{m-1}|\geq{(k-1)^{m-1}\over 2}. We consider the following prior: let u={uixm1j}ij[k0],xm1Rm1u=\left\{u_{ix^{m-1}j}\right\}_{i\leq j\in{[k_{0}]},x^{m-1}\in R_{m-1}} be iid and uniformly distributed in [1/(4k0),3/(4k0)][1/(4k_{0}),3/(4k_{0})] and for each ij,xm1Rm1i\leq j,x^{m-1}\in R_{m-1} define ujxm1i,uixm1¯j,ujxm1¯iu_{jx^{m-1}i},u_{i\overline{x^{m-1}}j},u_{j\overline{x^{m-1}}i} to be same as uixm1ju_{i{x^{m-1}}j}. Let the transition matrix TT be given by

T(2j1|2i1,xm1)=T(2j|2i,xm1)=uixm1j,\displaystyle T(2j-1|2i-1,x^{m-1})=T(2j|2i,x^{m-1})=u_{ix^{m-1}j},
T(2j|2i1,xm1)=T(2j1|2i,xm1)=1k0uixm1j,i,j𝒴,xm1𝒴m1.\displaystyle T(2j|2i-1,x^{m-1})=T(2j-1|2i,x^{m-1})=\frac{1}{k_{0}}-u_{ix^{m-1}j},\quad i,j\in{\mathcal{Y}},x^{m-1}\in{\mathcal{Y}}^{m-1}. (138)

One can check that the constructed TT is a stochastic matrix and satisfies the property (P), which enforces uniform stationary distribution. Also each entry of TT belongs to the interval [1\over2(k1),3\over2(k1)][{1\over 2(k-1)},{3\over 2(k-1)}].

Next we use the following lemma to derive estimation guarantees on TT.

Lemma 30.

Suppose that TT is an m×\ell^{m}\times\ell transition matrix, on state space 𝒴m{\mathcal{Y}}^{m} with |𝒴|=|{\mathcal{Y}}|=\ell, satisfying T(xm+1|xm)=T(x1|x2m+1¯),xm+1[]m+1T(x_{m+1}|x^{m})=T(x_{1}|\overline{x_{2}^{m+1}}),\quad\forall x^{m+1}\in[\ell]^{m+1} and T(ym+1|ym)[c1,c2]T(y_{m+1}|y^{m})\in[\frac{c_{1}}{\ell},\frac{c_{2}}{\ell}] with 0<c1<c2<1<c10<c_{1}<c_{2}<1<c_{1} for all ym+1[]m+1y^{m+1}\in[\ell]^{m+1}. Then there is an estimator T^\widehat{T} based on stationary trajectory YnY^{n} simulated from TT such that

𝔼[T^T𝖥2]4c12m+3(m+1)2mc2(nm),\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]\leq\frac{4c_{1}^{2m+3}(m+1)\ell^{2m}}{c_{2}(n-m)},

where T^T𝖥=ym+1(T^(ym+1|ym)T(ym+1|ym))2\|\widehat{T}-T\|_{\mathsf{F}}=\sqrt{\sum_{y^{m+1}}(\widehat{T}(y_{m+1}|y^{m})-T(y_{m+1}|y^{m}))^{2}} denotes the Frobenius norm.

For our purpose we will use the above lemma on TT with =k1,c1=12,c2=32\ell=k-1,c_{1}=\frac{1}{2},c_{2}=\frac{3}{2}. Therefore it follows that there exist estimators T^(Yn)\widehat{T}(Y^{n}) and u^(Yn)\widehat{u}(Y^{n}) such that

𝔼[u^(Yn)u22]𝔼[T^(Yn)T𝖥2]4c2(m+1)(k1)2mc12m+3(nm).\displaystyle\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\leq\mathbb{E}[\|\widehat{T}(Y^{n})-T\|_{\mathsf{F}}^{2}]\leq\frac{4c_{2}(m+1)(k-1)^{2m}}{c_{1}^{2m+3}(n-m)}. (139)

Here and below, we identify u={uixm1j}ij,xm1Rm1u=\left\{u_{ix^{m-1}j}\right\}_{i\leq j,x^{m-1}\in R_{m-1}} and u^={u^ixm1j}ij,xm1Rm1\widehat{u}=\left\{\widehat{u}_{ix^{m-1}j}\right\}_{i\leq j,x^{m-1}\in R_{m-1}} as |Rm1|k0(k0+1)\over2=|Rm1|(k21)\over8{|R_{m-1}|k_{0}(k_{0}+1)\over 2}={|R_{m-1}|(k^{2}-1)\over 8}-dimensional vectors.

Let h(X)=fX(x)logfX(x)dxh(X)=\int-f_{X}(x)\log f_{X}(x)dx denote the differential entropy of a continuous random vector XX with density fXf_{X} w.r.t the Lebesgue measure and h(X|Y)=fXY(xy)logfX|Y(x|y)dxdyh(X|Y)=\int-f_{XY}(xy)\log f_{X|Y}(x|y)dxdy the conditional differential entropy (cf. e.g. [CT06]). Then

h(u)=ij[k0],xm1Rm1h(uixm1j)=|Rm1|(k21)8log(k1).\displaystyle h(u)=\sum_{i\leq j\in{[k_{0}]},x^{m-1}\in R_{m-1}}h(u_{ix^{m-1}j})=-\frac{|R_{m-1}|(k^{2}-1)}{8}\log(k-1). (140)

Then

I(T;Yn)\displaystyle I(T;Y^{n}) =(a)I(u;Yn)\displaystyle\overset{\rm(a)}{=}I(u;Y^{n})
(b)I(u;u^(Yn))=h(u)h(u|u^(Yn))\displaystyle\overset{\rm(b)}{\geq}I(u;\widehat{u}(Y^{n}))=h(u)-h(u|\widehat{u}(Y^{n}))
(c)h(u)h(uu^(Yn))\displaystyle\overset{\rm(c)}{\geq}h(u)-h(u-\widehat{u}(Y^{n}))
(d)|Rm1|(k21)16log(c12m+3|Rm1|(k21)(nm)64πec2(m+1)(k1)2m+2)(k1)m+132log(nmcm(k1)m+1).\displaystyle\overset{\rm(d)}{\geq}\frac{|R_{m-1}|(k^{2}-1)}{16}\log\left(\frac{c_{1}^{2m+3}|R_{m-1}|(k^{2}-1)(n-m)}{64\pi ec_{2}(m+1)(k-1)^{2m+2}}\right)\geq\frac{(k-1)^{m+1}}{32}\log\left(\frac{n-m}{c_{m}(k-1)^{m+1}}\right).

for constant cm=128πec2(m+1)\overc12m+3c_{m}={128\pi ec_{2}(m+1)\over c_{1}^{2m+3}}, where (a) is because uu and TT are in one-to-one correspondence by (5.3.2); (b) follows from the data processing inequality; (c) is because h()h(\cdot) is translation invariant and concave; (d) follows from the maximum entropy principle [CT06]: h(uu^(Yn))|Rm1|(k21)16log(2πe|Rm1|(k21)/8𝔼[u^(Yn)u22])h(u-\widehat{u}(Y^{n}))\leq\frac{|R_{m-1}|(k^{2}-1)}{16}\log\left(\frac{2\pi e}{|R_{m-1}|(k^{2}-1)/8}\cdot\mathbb{E}[\|\widehat{u}(Y^{n})-u\|_{2}^{2}]\right), which in turn is bounded by (139). Plugging this lower bound into Lemma 29 completes the lower bound proof of Theorem 22.

5.3.3 Proof of Lemma 30 via pseudo spectral gap

In view of Lemma 27 we get that the stationary distribution of TT is uniform over 𝒴m{\mathcal{Y}}^{m}, and there is a one-to-one correspondence between the joint distribution of Ym+1Y^{m+1} and the transition probabilities

[Ym+1=ym+1]=1mT(ym+1|ym).\displaystyle\mathbb{P}\left[Y^{m+1}=y^{m+1}\right]=\frac{1}{\ell^{m}}T(y_{m+1}|y^{m}). (141)

Consider the following estimator T^\widehat{T}: for ym+1[]m+1y_{m+1}\in[\ell]^{m+1}, let

T^(ym+1|ym)=mt=1nm𝟏{Ytt+m=ym+1}nm.\displaystyle\widehat{T}(y_{m+1}|y^{m})=\ell^{m}\cdot\frac{\sum_{t=1}^{n-m}{\mathbf{1}_{\left\{{Y_{t}^{t+m}=y^{m+1}}\right\}}}}{n-m}.

Clearly 𝔼[T^(ym+1|ym)]=m[ym+1|ym]=T(ym+1|ym)\mathbb{E}[\widehat{T}(y_{m+1}|y^{m})]=\ell^{m}\mathbb{P}\left[y_{m+1}|y^{m}\right]=T(y_{m+1}|y^{m}). Next we observe that the sequence of random variables {Ytt+m}t=1nm\left\{Y_{t}^{t+m}\right\}_{t=1}^{n-m} is a first-order Markov chain on []m+1[\ell]^{m+1}. Let us denote its transition matrix by Tm+1T_{m+1} and note that its stationary distribution is given by π(am+1)=mT(am+1|am),am+1[]m+1\pi(a^{m+1})=\ell^{-m}T(a_{m+1}|a^{m}),a^{m+1}\in[\ell]^{m+1}. For the transition matrix Tm+1T_{m+1}, which must be non-reversible, the pseudo spectral gap γps(Tm+1)\gamma_{\text{ps}}(T_{m+1}) is defined as

γps(Tm+1)=maxr1γ((Tm+1)rTm+1r)r,\displaystyle\gamma_{\text{ps}}(T_{m+1})=\max_{r\geq 1}\frac{\gamma((T_{m+1}^{*})^{r}T_{m+1}^{r})}{r},

where Tm+1T_{m+1}^{*} is the adjoint of Tm+1T_{m+1} defined as Tm+1(bm+1|am+1)=π(bm+1)T(am+1|bm+1)/π(am+1)T_{m+1}^{*}(b^{m+1}|a^{m+1})=\pi(b^{m+1})T(a^{m+1}|b^{m+1})/\pi(a^{m+1}). With these notations, the concentration inequality of [Pau15, Theorem 3.2] gives the following variance bound:

Var(T^(ym+1|ym))2m4[Ym+1=ym+1]γps(Tm+1)(nm)2m4T(ym+1|ym)mγps(Tm+1)(nm).\displaystyle\mathrm{Var}(\widehat{T}(y_{m+1}|y^{m}))\leq\ell^{2m}\cdot\frac{4\mathbb{P}\left[Y^{m+1}=y^{m+1}\right]}{\gamma_{\text{ps}}(T_{m+1})(n-m)}\leq\ell^{2m}\cdot\frac{4T(y_{m+1}|y^{m})\ell^{-m}}{\gamma_{\text{ps}}(T_{m+1})(n-m)}.

The following lemma bounds the pseudo spectral gap from below.

Lemma 31.

Let Tm×T\in\mathbb{R}^{\ell^{m}\times\ell} be the transition matrix of an mm-th order Markov chain (Yt)t1(Y_{t})_{t\geq 1} over a discrete state space 𝒴\mathcal{Y} with |𝒴|=|\mathcal{Y}|=\ell, and assume that

  • all the entries of TT lie in the interval [c1,c2][\frac{c_{1}}{\ell},\frac{c_{2}}{\ell}] for some absolute constants 0<c1<c20<c_{1}<c_{2};

  • TT has the uniform stationary distribution on []m[\ell]^{m}.

Let Tm+1m+1×m+1T_{m+1}\in\mathbb{R}^{\ell^{m+1}\times\ell^{m+1}} be the transition matrix of the first-order Markov chain ((Yt,Yt+1,,Yt+m))t1((Y_{t},Y_{t+1},\cdots,Y_{t+m}))_{t\geq 1}. Then we have

γps(Tm+1)c12m+3\overc2(m+1).\displaystyle\gamma_{\text{\rm ps}}(T_{m+1})\geq{c_{1}^{2m+3}\over c_{2}(m+1)}.

Consequently, we have

𝔼[T^T𝖥2]=ym+1[]m+1Var(T^(ym+1|ym))ym+1[]m+14c2(m+1)mc12m+3T(ym+1|ym)nm=4c2(m+1)2mc12m+3(nm),\displaystyle\mathbb{E}[\|\widehat{T}-T\|_{\mathsf{F}}^{2}]=\sum_{y^{m+1}\in[\ell]^{m+1}}\mathrm{Var}(\widehat{T}(y_{m+1}|y^{m}))\leq\sum_{y^{m+1}\in[\ell]^{m+1}}\frac{4c_{2}(m+1)\ell^{m}}{c_{1}^{2m+3}}\cdot\frac{T(y_{m+1}|y^{m})}{n-m}=\frac{4c_{2}(m+1)\ell^{2m}}{c_{1}^{2m+3}(n-m)},

completing the proof.

Proof of Lemma 31.

As Tm+1T_{m+1} is a first-order Markov chain, the stochastic matrix Tm+1m+1T_{m+1}^{m+1} defines the probabilities of transition from (Yt,Yt+1,,Yt+m)(Y_{t},Y_{t+1},\cdots,Y_{t+m}) to (Yt+m+1,Yt+m+2,,Yt+2m+1)(Y_{t+m+1},Y_{t+m+2},\cdots,Y_{t+2m+1}). By our assumption on TT

mina2m+2𝒴2m+2Tm+1m+1(a2m+2m+2|am+1)t=0mT(a2m+2t|a2m+1tm+2t)c1m+1\overm+1.\displaystyle\min_{a^{2m+2}\in{\mathcal{Y}}^{2m+2}}T^{m+1}_{m+1}(a^{2m+2}_{m+2}|a^{m+1})\geq\prod_{t=0}^{m}T(a_{2m+2-t}|a^{2m+1-t}_{m+2-t})\geq{c_{1}^{m+1}\over\ell^{m+1}}. (142)

Given any am+1,bm+1𝒴m+1a^{m+1},b^{m+1}\in{\mathcal{Y}}^{m+1}, using the above inequality we have

(Tm+1)m+1(bm+1|am+1)\displaystyle(T_{m+1}^{*})^{m+1}(b^{m+1}|a^{m+1})
=𝒚1𝒴m+1,,𝒚m𝒴m+1Tm+1(bm+1|𝒚m){t=1m1Tm+1(𝒚mt+1|𝒚mt)}Tm+1(𝒚1|am+1)\displaystyle=\sum_{\bm{y}_{1}\in{\mathcal{Y}}^{m+1},\dots,\bm{y}_{m}\in{\mathcal{Y}}^{m+1}}T_{m+1}^{*}(b^{m+1}|\bm{y}_{m})\left\{\prod_{t=1}^{m-1}T_{m+1}^{*}(\bm{y}_{m-t+1}|\bm{y}_{m-t})\right\}T_{m+1}^{*}(\bm{y}_{1}|a^{m+1})
=𝒚1𝒴m+1,,𝒚m𝒴m+1π(bm+1)Tm+1(𝒚m|bm+1)\overπ(𝒚m){t=1m1π(𝒚mt+1)Tm+1(𝒚mt|𝒚mt+1)\overπ(𝒚mt)}π(𝒚1)Tm+1(am+1|𝒚1)\overπ(am+1)\displaystyle=\sum_{\bm{y}_{1}\in{\mathcal{Y}}^{m+1},\dots,\bm{y}_{m}\in{\mathcal{Y}}^{m+1}}{\pi(b^{m+1})T_{m+1}(\bm{y}_{m}|b^{m+1})\over\pi(\bm{y}_{m})}\left\{\prod_{t=1}^{m-1}{\pi(\bm{y}_{m-t+1})T_{m+1}(\bm{y}_{m-t}|\bm{y}_{m-t+1})\over\pi(\bm{y}_{m-t})}\right\}{\pi(\bm{y}_{1})T_{m+1}(a^{m+1}|\bm{y}_{1})\over\pi(a^{m+1})}
=π(bm+1)\overπ(am+1)𝒚1𝒴m+1,,𝒚m𝒴m+1Tm+1(𝒚m|bm+1){t=1m1Tm+1(𝒚mt|𝒚mt+1)}Tm+1(am+1|𝒚1)\displaystyle={\pi(b^{m+1})\over\pi(a^{m+1})}\sum_{\bm{y}_{1}\in{\mathcal{Y}}^{m+1},\dots,\bm{y}_{m}\in{\mathcal{Y}}^{m+1}}T_{m+1}(\bm{y}_{m}|b^{m+1})\left\{\prod_{t=1}^{m-1}T_{m+1}(\bm{y}_{m-t}|\bm{y}_{m-t+1})\right\}T_{m+1}(a^{m+1}|\bm{y}_{1})
=π(bm+1)\overπ(am+1)Tm+1m+1(am+1|bm+1)\displaystyle={\pi(b^{m+1})\over\pi(a^{m+1})}T^{m+1}_{m+1}(a^{m+1}|b^{m+1})
=π(bm)T(bm+1|bm)\overπ(bm)T(am+1|am)Tm+1m+1(am+1|bm+1)c1\overc2c1m+1\overm+1.\displaystyle={\pi(b^{m})T(b_{m+1}|b^{m})\over\pi(b^{m})T(a_{m+1}|a^{m})}T^{m+1}_{m+1}(a^{m+1}|b^{m+1})\geq{c_{1}\over c_{2}}\cdot{c_{1}^{m+1}\over\ell^{m+1}}. (143)

Using (142),(5.3.3) we get

minam+1,bm+1𝒴m+1{(Tm+1)m+1Tm+1m+1}(bm+1|am+1)\displaystyle\min_{a^{m+1},b^{m+1}\in{\mathcal{Y}}^{m+1}}\left\{(T_{m+1}^{*})^{m+1}T^{m+1}_{m+1}\right\}(b^{m+1}|a^{m+1})
dm+1𝒴m+1(minam+1,dm+1𝒴m+1(Tm+1)m+1(dm+1|am+1))(minbm+1,dm+1𝒴m+1Tm+1m+1(bm+1|dm+1))\displaystyle\geq\sum_{d^{m+1}\in{\mathcal{Y}}^{m+1}}\left(\min_{a^{m+1},d^{m+1}\in{\mathcal{Y}}^{m+1}}(T_{m+1}^{*})^{m+1}(d^{m+1}|a^{m+1})\right)\left(\min_{b^{m+1},d^{m+1}\in{\mathcal{Y}}^{m+1}}T^{m+1}_{m+1}(b^{m+1}|d^{m+1})\right)
dm+1𝒴m+1c12m+3\overc22m+2c12m+3\overc2m+1.\displaystyle\geq\sum_{d^{m+1}\in{\mathcal{Y}}^{m+1}}{c_{1}^{2m+3}\over c_{2}\ell^{2m+2}}\geq{c_{1}^{2m+3}\over c_{2}\ell^{m+1}}. (144)

As (Tm+1)m+1Tm+1m+1(T_{m+1}^{*})^{m+1}T^{m+1}_{m+1} is an m+1×m+1\ell^{m+1}\times\ell^{m+1} stochastic matrix, we can use Lemma 32 to get the lower bound on its spectral gap γ((Tm+1)m+1Tm+1m+1)c12m+3\overc2\gamma((T_{m+1}^{*})^{m+1}T^{m+1}_{m+1})\geq{c_{1}^{2m+3}\over c_{2}}. Hence we get

γps(Tm+1)γ((Tm+1)m+1Tm+1m+1)\overm+1c12m+3\overc2(m+1)\displaystyle\gamma_{\text{\rm ps}}(T_{m+1})\geq{\gamma((T_{m+1}^{*})^{m+1}T^{m+1}_{m+1})\over m+1}\geq{c_{1}^{2m+3}\over c_{2}(m+1)} (145)

as required. A more generalized version of Lemma 32 can be found in from [Hof67].

Lemma 32.

Suppose that AA is a d×dd\times d stochastic matrix with mini,jAijϵ\min_{i,j}A_{ij}\geq\epsilon. Then for any eigenvalue λ\lambda of AA other than 1 we have |λ|1dϵ|\lambda|\leq 1-d\epsilon.

Proof.

Suppose that λ\lambda is an eigenvalue of AA other than 1 with non-zero left eigenvector 𝒗\bm{v}, i.e. λvj=i=1dviAij,j=1,,d\lambda v_{j}=\sum_{i=1}^{d}v_{i}A_{ij},j=1,\dots,d. As AA is a stochastic matrix we know that jAij=1\sum_{j}A_{ij}=1 for all ii and hence i=1dvi=0\sum_{i=1}^{d}v_{i}=0. This implies

|λvj|=|i=1dviAij|=|i=1dvi(Aijϵ)|i=1d|vi(Aijϵ)|=i=1d|vi|(Aijϵ)\displaystyle|\lambda v_{j}|=\left|\sum_{i=1}^{d}v_{i}A_{ij}\right|=\left|\sum_{i=1}^{d}v_{i}(A_{ij}-\epsilon)\right|\leq\sum_{i=1}^{d}|v_{i}(A_{ij}-\epsilon)|=\sum_{i=1}^{d}|v_{i}|(A_{ij}-\epsilon) (146)

with the last equality following from AijϵA_{ij}\geq\epsilon. Summing over j=1,dj=1,\dots d in the above equation and dividing by i=1d|vi|\sum_{i=1}^{d}|v_{i}| we get |λ|1dϵ|\lambda|\leq 1-d\epsilon as required. ∎

6 Discussions and open problems

We discuss the assumptions and implications of our results as well as related open problems.

Very large state space.

Theorem 1 determines the optimal prediction risk under the assumption of knk\lesssim\sqrt{n}. When knk\gtrsim\sqrt{n}, Theorem 1 shows that the KL risk is bounded away from zero. However, as the KL risk can be as large as logk\log k, it is a meaningful question to determine the optimal rate in this case, which, thanks to the general reduction in (11), reduces to determining the redundancy for symmetric and general Markov chains. For iid data, the minimax pointwise redundancy is known to be nlogkn+O(n2k)n\log\frac{k}{n}+O(\frac{n^{2}}{k}) [SW12, Theorem 1] when knk\gg n. Since the average and pointwise redundancy usually behave similarly, for Markov chains it is reasonable to conjecture that the redundancy is Θ(nlogk2n)\Theta(n\log\frac{k^{2}}{n}) in the large alphabet regime of knk\gtrsim\sqrt{n}, which, in view of (11), would imply the optimal prediction risk is Θ(logk2n)\Theta(\log\frac{k^{2}}{n}) for knk\gg\sqrt{n}. In comparison, we note that the prediction risk is at most logk\log k, achieved by the uniform distribution.

Other loss functions

As mentioned in Section 1.1, standard arguments based on concentration inequalities inevitably rely on mixing conditions such as the spectral gap. In contrast, the risk bound in Theorem 1, which is free of any mixing condition, is enabled by powerful techniques from universal compression which bound the redundancy by the pointwise maximum over all trajectories combined with information-theoretic or combinatorial argument. This program only relies on the Markovity of the process rather than stationarity or spectral gap assumptions. The limitation of this approach, however, is that the reduction between prediction and redundancy crucially depends on the form of the KL loss function444In fact, this connection breaks down if one swap MM and M^\widehat{M} in the KL divergence in (1). in (1), which allows one to use the mutual information representation and the chain rule to relate individual risks to the cumulative risk. More general loss in terms of ff-divergence have been considered in [HOP18]. Obtaining spectral gap-independent risk bound for these loss functions, this time without the aid of universal compression, is an open question.

Stationarity

As mentioned above, the redundancy result in Lemma 7 (see also [Dav83, TJW18]) holds for nonstationary Markov chains as well. However, our redundancy-based risk upper bound in Lemma 6 crucially relies on stationarity. It is unclear whether the result of Theorem 1 carries over to nonstationary chains.

Appendix A Mutual information representation of prediction risk

The following lemma justifies the representation (22) for the prediction risk as maximal conditional mutual information. Unlike (17) for redundancy which holds essentially without any condition [Kem74], here we impose certain compactness assumptions which hold finite alphabets such as finite-state Markov chains studied in this paper.

Lemma 33.

Let 𝒳{\mathcal{X}} be finite and let Θ\Theta be a compact subset of d\mathbb{R}^{d}. Given {PXn+1|θ:θΘ}\{P_{X^{n+1}|\theta}:\theta\in\Theta\}, define the prediction risk

𝖱𝗂𝗌𝗄ninfQXn+1|XnsupθΘD(PXn+1|Xn,θQXn+1|Xn|PXn|θ),\mathsf{Risk}_{n}\triangleq\inf_{Q_{X_{n+1}|X^{n}}}\sup_{\theta\in\Theta}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n}|\theta}), (147)

Then

𝖱𝗂𝗌𝗄n=supPθ(Θ)I(θ;Xn+1|Xn).\mathsf{Risk}_{n}=\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}I(\theta;X_{n+1}|X^{n}). (148)

where (Θ){\mathcal{M}}(\Theta) denotes the collection of all (Borel) probability measures on Θ\Theta.

Note that for stationary Markov chains, (22) follows from Lemma 33 since one can take θ\theta to be the joint distribution of (X1,,Xn+1)(X_{1},\ldots,X_{n+1}) itself which forms a compact subset of the probability simplex on 𝒳n+1{\mathcal{X}}^{n+1}.

Proof.

It is clear that (147) is equivalent to

𝖱𝗂𝗌𝗄n=infQXn+1|XnsupPθ(Θ)D(PXn+1|Xn,θQXn+1|Xn|PXn,θ).\mathsf{Risk}_{n}=\inf_{Q_{X_{n+1}|X^{n}}}\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n},\theta}).

By the variational representation (14) of conditional mutual information, we have

I(θ;Xn+1|Xn)=infQXn+1|XnD(PXn+1|Xn,θQXn+1|Xn|PXn,θ).I(\theta;X_{n+1}|X^{n})=\inf_{Q_{X_{n+1}|X^{n}}}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n},\theta}). (149)

Thus (148) amounts to justifying the interchange of infimum and supremum in (147). It suffices to prove the upper bound.

Let |𝒳|=K|{\mathcal{X}}|=K. For ϵ(0,1)\epsilon\in(0,1), define an auxiliary quantity:

𝖱𝗂𝗌𝗄n,ϵinfQXn+1|XnϵKsupPθ(Θ)D(PXn+1|Xn,θQXn+1|Xn|PXn,θ),\mathsf{Risk}_{n,\epsilon}\triangleq\inf_{Q_{X_{n+1}|X^{n}}\geq\frac{\epsilon}{K}}\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}D(P_{X_{n+1}|X^{n},\theta}\|Q_{X_{n+1}|X^{n}}|P_{X^{n},\theta}), (150)

where the constraint in the infimum is pointwise, namely, QXn+1=xn+1|Xn=xnϵKQ_{X_{n+1}=x_{n+1}|X^{n}=x^{n}}\geq\frac{\epsilon}{K} for all x1,,xn+1𝒳x_{1},\ldots,x_{n+1}\in{\mathcal{X}}. By definition, we have 𝖱𝗂𝗌𝗄n𝖱𝗂𝗌𝗄n,ϵ\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n,\epsilon}. Furthermore, 𝖱𝗂𝗌𝗄n,ϵ\mathsf{Risk}_{n,\epsilon} can be equivalently written as

𝖱𝗂𝗌𝗄n,ϵ=infQXn+1|XnsupPθ(Θ)D(PXn+1|Xn,θ(1ϵ)QXn+1|Xn+ϵU|PXn,θ),\mathsf{Risk}_{n,\epsilon}=\inf_{Q_{X_{n+1}|X^{n}}}\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}D(P_{X_{n+1}|X^{n},\theta}\|(1-\epsilon)Q_{X_{n+1}|X^{n}}+\epsilon U|P_{X^{n},\theta}), (151)

where UU denotes the uniform distribution on 𝒳{\mathcal{X}}.

We first show that the infimum and supremum in (151) can be interchanged. This follows from the standard minimax theorem. Indeed, note that D(PXn+1|Xn,θ(1ϵ)QXn+1|Xn+ϵU|PXn,θ)D(P_{X_{n+1}|X^{n},\theta}\|(1-\epsilon)Q_{X_{n+1}|X^{n}}+\epsilon U|P_{X^{n},\theta}) is convex in QXn+1|XnQ_{X_{n+1}|X^{n}}, affine in PθP_{\theta}, continuous in each argument, and takes values in [0,logKϵ][0,\log\frac{K}{\epsilon}]. Since (Θ){\mathcal{M}}(\Theta) is convex and weakly compact (by Prokhorov’s theorem) and the collection of conditional distributions QXn+1|XnQ_{X_{n+1}|X^{n}} is convex, the minimax theorem (see, e.g., [Fan53, Theorem 2]) yields

𝖱𝗂𝗌𝗄n,ϵ=supπ(Θ)infQXn+1|XnD(PXn+1|Xn,θ(1ϵ)QXn+1|Xn+ϵU|PXn,θ).\mathsf{Risk}_{n,\epsilon}=\sup_{\pi\in{\mathcal{M}}(\Theta)}\inf_{Q_{X_{n+1}|X^{n}}}D(P_{X_{n+1}|X^{n},\theta}\|(1-\epsilon)Q_{X_{n+1}|X^{n}}+\epsilon U|P_{X^{n},\theta}). (152)

Finally, by the convexity of the KL divergence, for any PP on 𝒳{\mathcal{X}}, we have

D(P(1ϵ)Q+ϵU)(1ϵ)D(PQ)+ϵD(PU)(1ϵ)D(PQ)+ϵlogK,D(P\|(1-\epsilon)Q+\epsilon U)\leq(1-\epsilon)D(P\|Q)+\epsilon D(P\|U)\leq(1-\epsilon)D(P\|Q)+\epsilon\log K,

which, in view of (149) and (152), implies

𝖱𝗂𝗌𝗄n𝖱𝗂𝗌𝗄n,ϵsupPθ(Θ)I(θ;Xn+1|Xn)+ϵlogK.\mathsf{Risk}_{n}\leq\mathsf{Risk}_{n,\epsilon}\leq\sup_{P_{\theta}\in{\mathcal{M}}(\Theta)}I(\theta;X_{n+1}|X^{n})+\epsilon\log K.

By the arbitrariness of ϵ\epsilon, (148) follows. ∎

Appendix B Proof of Lemma 16

Recall that for any irreducible and reversible finite states transition matrix MM with stationary distribution π\pi the followings are satisfied:

  1. 1.

    πi>0\pi_{i}>0 for all ii.

  2. 2.

    M(j|i)πi=M(i|j)πjM(j|i)\pi_{i}=M(i|j)\pi_{j} for all i,ji,j.

The following is a direct consequence of the Markov property.

Lemma 34.

For any 1t1<<tm<<tk1\leq t_{1}<\dots<t_{m}<\dots<t_{k} and any Z2=f(Xtk,,Xtm),Z1=g(Xtm1,,Xt1)Z_{2}=f\left(X_{t_{k}},\dots,X_{t_{m}}\right),Z_{1}=g\left(X_{t_{m-1}},\dots,X_{t_{1}}\right) we have

𝔼[Z2𝟏{Xtm=j}Z1|X1=i]=𝔼[Z2|Xtm=j]𝔼[𝟏{Xtm=j}Z1|X1=i]\displaystyle\mathbb{E}\left[Z_{2}{\mathbf{1}_{\left\{{X_{t_{m}}=j}\right\}}}Z_{1}|X_{1}=i\right]=\mathbb{E}\left[Z_{2}|X_{t_{m}}=j\right]\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{t_{m}}=j}\right\}}}Z_{1}|X_{1}=i\right] (153)

For t0t\geq 0, denote the tt-step transition probability by [Xt+1=j|X1=i]=Mt(j|i)\mathbb{P}\left[X_{t+1}=j|X_{1}=i\right]=M^{t}(j|i), which is the ijijth entry of MtM^{t}. The following result is standard (see, e.g., [LP17, Chap. 12]). We include the proof mainly for the purpose of introducing the spectral decomposition.

Lemma 35.

Define λ1γ=max{|λi|:i1}\lambda_{*}\triangleq 1-\gamma_{*}=\max\left\{\left|\lambda_{i}\right|:i\neq 1\right\}. For any t0t\geq 0, |Mt(j|i)πj|λtπj\overπi.\left|M^{t}(j|i)-\pi_{j}\right|\leq\lambda_{*}^{t}{\sqrt{\pi_{j}\over\pi_{i}}}.

Proof.

Throughout the proof all vectors are column vectors except for π\pi. Let DπD_{\pi} denote the diagonal matrix with entries Dπ(i,i)=πiD_{\pi}(i,i)=\pi_{i}. By reversibility, Dπ12MDπ12D_{\pi}^{\frac{1}{2}}MD_{\pi}^{-\frac{1}{2}}, which shares the same spectrum with MM, is a symmetric matrix and admits the spectral decomposition Dπ12MDπ12=a=1kλauauaD_{\pi}^{\frac{1}{2}}MD_{\pi}^{-\frac{1}{2}}=\sum_{a=1}^{k}\lambda_{a}u_{a}u_{a}^{\top} for some orthonormal basis {u1,,uk}\{u_{1},\ldots,u_{k}\}; in particular, λ1=1\lambda_{1}=1 and u1i=πiu_{1i}=\sqrt{\pi_{i}}. Then for each t1t\geq 1,

Mt=a=1kλatDπ12uauaDπ12=𝟏π+a=2kλatDπ12uauaDπ12.\displaystyle M^{t}=\sum_{a=1}^{k}\lambda_{a}^{t}D_{\pi}^{-\frac{1}{2}}u_{a}u_{a}^{\top}D_{\pi}^{\frac{1}{2}}=\mathbf{1}\pi+\sum_{a=2}^{k}\lambda_{a}^{t}D_{\pi}^{-\frac{1}{2}}u_{a}u_{a}^{\top}D_{\pi}^{\frac{1}{2}}.\quad (154)

where 𝟏\mathbf{1} is the all-ones vector. As uau_{a}’s satisfy a=1kuaua=I\sum_{a=1}^{k}u_{a}u_{a}^{\top}=I we get a=2kuab2=1ua121\sum_{a=2}^{k}u_{ab}^{2}=1-u_{a1}^{2}\leq 1 for any b=1,,kb=1,\dots,k. Using this along with Cauchy-Schwarz inequality we get

|Mt(j|i)πj|πj\overπia=2k|λa|t|uaiuaj|λtπj\overπi(a=2kuai2)12(a=2kuaj2)12λtπj\overπi\displaystyle\left|M^{t}(j|i)-\pi_{j}\right|\leq\sqrt{\pi_{j}\over\pi_{i}}\sum_{a=2}^{k}\left|\lambda_{a}\right|^{t}|u_{ai}u_{aj}|\leq\lambda_{*}^{t}\sqrt{\pi_{j}\over\pi_{i}}\left(\sum_{a=2}^{k}u_{ai}^{2}\right)^{\frac{1}{2}}\left(\sum_{a=2}^{k}u_{aj}^{2}\right)^{\frac{1}{2}}\leq\lambda_{*}^{t}\sqrt{\pi_{j}\over\pi_{i}}

as required. ∎

Lemma 36.

Fix states i,ji,j. For any integers ab1a\geq b\geq 1, define

hs(a,b)=|𝔼[𝟏{Xa+1=i}(𝟏{Xa=j}M(j|i))s|Xb=i]|,s=1,2,3,4.h_{s}(a,b)=\left|\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-M(j|i)\right)^{s}|X_{b}=i\right]\right|,\quad s=1,2,3,4.

Then

  1. (i)

    h1(a,b)2M(j|i)λab{h_{1}(a,b)}\leq 2{\sqrt{{M(j|i)}}}{\lambda^{a-b}_{*}}

  2. (ii)

    |h2(a,b)πiM(j|i)(1M(j|i))|4M(j|i)λab.\left|h_{2}(a,b)-\pi_{i}M(j|i)(1-M(j|i))\right|\leq 4{\sqrt{{M(j|i)}}}{\lambda^{a-b}_{*}}.

  3. (iii)

    h3(a,b),h4(a,b)πiM(j|i)(1M(j|i))+4M(j|i)λab.{h_{3}(a,b)},{h_{4}(a,b)}\leq\pi_{i}M(j|i)(1-M(j|i))+4{\sqrt{{M(j|i)}}}{\lambda^{a-b}_{*}}.

Proof.

We apply Lemma 35 and time reversibility:

  1. (i)
    h1(a,b)\displaystyle h_{1}(a,b) =|[Xa+1=i,Xa=j|Xb=i]M(j|i)[Xa+1=i|Xb=i]|\displaystyle=\left|\mathbb{P}\left[X_{a+1}=i,X_{a}=j|X_{b}=i\right]-M(j|i)\mathbb{P}\left[X_{a+1}=i|X_{b}=i\right]\right|
    =|M(i|j)Mab(j|i)M(j|i)Mab+1(i|i)|\displaystyle=\left|M(i|j){M^{a-b}}(j|i)-M(j|i)M^{a-b+1}(i|i)\right|
    M(i|j)|Mab(j|i)πj|+M(j|i)|Mab+1(i|i)πi|\displaystyle\leq M(i|j)\left|M^{a-b}(j|i)-\pi_{j}\right|+M(j|i)\left|M^{a-b+1}(i|i)-\pi_{i}\right|
    λabM(i|j)πj\overπi+M(j|i)λab+1\displaystyle\leq\lambda_{*}^{a-b}M(i|j)\sqrt{\pi_{j}\over\pi_{i}}+M(j|i)\lambda^{a-b+1}_{*}
    =λabM(j|i)M(i|j)+M(j|i)λab+12M(j|i)λab.\displaystyle=\lambda^{a-b}_{*}\sqrt{M(j|i)M(i|j)}+{M(j|i)}\lambda_{*}^{a-b+1}\leq 2\sqrt{M(j|i)}{\lambda^{a-b}_{*}}.
  2. (ii)
    |h2(a,b)πiM(j|i)(1M(j|i))|\displaystyle|h_{2}(a,b)-\pi_{i}M(j|i)(1-M(j|i))|
    =\displaystyle= |𝔼[𝟏{Xa+1=i,Xa=j}|Xb=i]πiM(j|i)+(M(j|i))2(𝔼[𝟏{Xa+1=i}|Xb=i]πi)\displaystyle\Big{|}\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i,X_{a}=j}\right\}}}|X_{b}=i\right]-\pi_{i}M(j|i)+\left(M(j|i)\right)^{2}(\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}|X_{b}=i\right]-\pi_{i})
    2M(j|i)(𝔼[𝟏{Xa+1=i,Xa=j}|Xb=i]πiM(j|i))|\displaystyle-2M(j|i)(\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a+1}=i,X_{a}=j}\right\}}}|X_{b}=i\right]-\pi_{i}M(j|i))\Big{|}
    \displaystyle\leq |[Xa+1=i,Xa=j|Xb=i]πjM(i|j)|+(M(j|i))2|[Xa+1=i|Xb=i]πi|\displaystyle\left|\mathbb{P}\left[X_{a+1}=i,X_{a}=j|X_{b}=i\right]-\pi_{j}M(i|j)\right|+({M(j|i)})^{2}\left|\mathbb{P}\left[X_{a+1}=i|X_{b}=i\right]-\pi_{i}\right|
    +2M(j|i)|[Xa+1=i,Xa=j|Xb=i]πjM(i|j)|\displaystyle\quad+2{M(j|i)}\left|\mathbb{P}\left[X_{a+1}=i,X_{a}=j|X_{b}=i\right]-\pi_{j}M(i|j)\right|
    =\displaystyle= M(i|j)|Mab(j|i)πj|+(M(j|i))2|Mab+1(i|i)πi|+2M(j|i)M(i|j)|Mab(j|i)πj|\displaystyle M(i|j)\left|M^{a-b}(j|i)-\pi_{j}\right|+({M(j|i)})^{2}\left|M^{a-b+1}(i|i)-\pi_{i}\right|+2M(j|i)M(i|j)\left|M^{a-b}(j|i)-\pi_{j}\right|
    \displaystyle\leq M(i|j)πj\overπiλab+(M(j|i))2λab+1+2M(j|i)M(i|j)πj\overπiλab\displaystyle M(i|j)\sqrt{\pi_{j}\over\pi_{i}}\lambda_{*}^{a-b}+({M(j|i)})^{2}\lambda^{a-b+1}_{*}+2{M(j|i)}M(i|j)\sqrt{\pi_{j}\over\pi_{i}}\lambda^{a-b}_{*}
    \displaystyle\leq λab(M(i|j)M(i|j)πj\overπi+(M(j|i))2+2M(j|i)M(i|j)M(i|j)πj\overπi)\displaystyle\lambda_{*}^{a-b}\left(\sqrt{M(i|j)}\sqrt{M(i|j)\pi_{j}\over\pi_{i}}+({M(j|i)})^{2}+2{M(j|i)}\sqrt{M(i|j)}\sqrt{M(i|j)\pi_{j}\over\pi_{i}}\right)
    \displaystyle\leq 4M(j|i)λab.\displaystyle 4\sqrt{M(j|i)}\lambda^{a-b}_{*}.
  3. (iii)

    h3(a,b),h4(a,b)h2(a,b)h_{3}(a,b),h_{4}(a,b)\leq h_{2}(a,b). ∎

Proof of Lemma 16(i).

For ease of notation we use c0c_{0} to denote an absolute constant whose value may vary at each occurrence. Fix i,j[k]i,j\in[k]. Note that the empirical count defined in (4) can be written as Ni=a=1n1𝟏{Xna=i}N_{i}=\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{n-a}=i}\right\}}} and Nij=a=1n1𝟏{Xna=i,Xna+1=j}N_{ij}=\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{n-a}=i,X_{n-a+1}=j}\right\}}}. Then

𝔼[(M(j|i)NiNij)2|Xn=i]\displaystyle\mathbb{E}\left[\left(M(j|i)N_{i}-N_{ij}\right)^{2}|X_{n}=i\right]
=\displaystyle= 𝔼[(a=1n1𝟏{Xna=i}(𝟏{Xna+1=j}M(j|i)))2|Xn=i]\displaystyle\mathbb{E}\left[\left.\left(\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{n-a}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{n-a+1}=j}\right\}}}-{M(j|i)}\right)\right)^{2}\right|X_{n}=i\right]
=(a)\displaystyle\overset{\rm(a)}{=} 𝔼[(a=1n1𝟏{Xa+1=i}(𝟏{Xa=j}M(j|i)))2|X1=i]\displaystyle\mathbb{E}\left[\left.\left(\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j|i)}\right)\right)^{2}\right|X_{1}=i\right]
=(b)\displaystyle\overset{\rm(b)}{=} |a,b𝔼[ηaηb|X1=i]|2ab|𝔼[ηaηb|X1=i]|,\displaystyle\left|\sum_{a,b}\mathbb{E}\left[\eta_{a}\eta_{b}|X_{1}=i\right]\right|\leq 2\sum_{a\geq b}\left|\mathbb{E}\left[\eta_{a}\eta_{b}|X_{1}=i\right]\right|,

where (a) is due to time reversibility; in (b) we defined ηa𝟏{Xa+1=i}(𝟏{Xa=j}M(j|i))\eta_{a}\triangleq{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j|i)}\right). We divide the summands into different cases and apply Lemma 36.

Case I: Two distinct indices.

For any a>ba>b, using Lemma 34 we get

|𝔼[ηaηb|X1=i]|=|𝔼[ηa|Xb+1=i]||𝔼[ηb|X1=1]|=h1(a,b+1)h1(b,1)\displaystyle\left|\mathbb{E}\left[\eta_{a}\eta_{b}|X_{1}=i\right]\right|=\left|\mathbb{E}\left[\eta_{a}|X_{b+1}=i\right]\right|\left|\mathbb{E}\left[\eta_{b}|X_{1}=1\right]\right|=h_{1}(a,b+1)h_{1}(b,1) (155)

which implies

n1a>b1|𝔼[ηaηb|X1=i]|=n1a>b1h1(a,b+1)h1(b,1)M(j|i)n1a>b1λa2M(j|i)\overγ2.\displaystyle\mathop{\sum\sum}_{n-1\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}|X_{1}=i\right]\right|=\mathop{\sum\sum}_{n-1\geq a>b\geq 1}h_{1}(a,b+1)h_{1}(b,1)\lesssim{M(j|i)}\mathop{\sum\sum}_{n-1\geq a>b\geq 1}\lambda^{a-2}_{*}\lesssim{M(j|i)\over{\gamma^{2}_{*}}}.

Here the last inequality (and similar sums in later deductions) can be explained as follows. Note that for γ12\gamma_{*}\geq\frac{1}{2} (i.e. λ12\lambda_{*}\leq\frac{1}{2}), the sum is clearly bounded by an absolute constant; for γ<12\gamma_{*}<\frac{1}{2} (i.e. λ>12\lambda_{*}>\frac{1}{2}), we compare the sum with the mean (or higher moments in other calculations) of a geometric random variable.

Case II: Single index.
a=1n1𝔼[ηa2|X1=i]=a=1n1h2(a,1)nπiM(j|i)(1M(j|i))+M(j|i)\overγ.\displaystyle\sum_{a=1}^{n-1}\mathbb{E}\left[\eta_{a}^{2}|X_{1}=i\right]=\sum_{a=1}^{n-1}h_{2}(a,1)\lesssim n\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}\over\gamma_{*}}. (156)

Combining the above we get

𝔼[(NijM(j|i)Ni)2|Xn=i]nπiM(j|i)(1M(j|i))+M(j|i)\overγ+M(j|i)\overγ2\displaystyle\mathbb{E}\left[\left(N_{ij}-{M(j|i)}N_{i}\right)^{2}|X_{n}=i\right]\lesssim n\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}\over\gamma_{*}}+{M(j|i)\over\gamma_{*}^{2}}

as required. ∎

Proof of Lemma 16(ii).

We first note that due to reversibility we can write (similar as in proof of Lemma 16(i)) with ηa=𝟏{Xa+1=i}(𝟏{Xa=j}M(j|i))\eta_{a}={\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j|i)}\right)

𝔼[(M(j|i)NiNij)4|Xn=i]\displaystyle\mathbb{E}\left[\left(M(j|i)N_{i}-N_{ij}\right)^{4}|X_{n}=i\right]
=𝔼[(a=1n1𝟏{Xa+1=i}(𝟏{Xa=j}M(j|i)))4|X1=i]\displaystyle=\mathbb{E}\left[\left.\left(\sum_{a=1}^{n-1}{\mathbf{1}_{\left\{{X_{a+1}=i}\right\}}}\left({\mathbf{1}_{\left\{{X_{a}=j}\right\}}}-{M(j|i)}\right)\right)^{4}\right|X_{1}=i\right]
=|a,b,d,e𝔼[ηaηbηdηe|X1=i]|a,b,d,e|𝔼[ηaηbηdηe|X1=i]|abde|𝔼[ηaηbηdηe|X1=i]|.\displaystyle=\left|\sum_{a,b,d,e}\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}|X_{1}=i\right]\right|\leq\sum_{a,b,d,e}\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}|X_{1}=i\right]\right|\lesssim\sum_{a\geq b\geq d\geq e}\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}|X_{1}=i\right]\right|. (157)

We bound the sum over different combinations of abdea\geq b\geq d\geq e to come up with a bound on the required fourth moment. We first divide the η\eta’s into groups depending on how many distinct indices of η\eta there are. We use the following identities which follow from Lemma 34: for indices a>b>d>ea>b>d>e

  • |𝔼[ηaηbηdηe|X1=i]|=h1(a,b+1)h1(b,d+1)h1(d,e+1)h1(e,1)\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}|X_{1}=i\right]\right|=h_{1}(a,b+1)h_{1}(b,d+1)h_{1}(d,e+1)h_{1}(e,1)

  • For s1,s2,s3{1,2}s_{1},s_{2},s_{3}\in\left\{1,2\right\}, |𝔼[ηas1ηbs2ηds3|X1=i]|=hs1(a,b+1)hs2(b,d+1)hs3(d,1)\left|\mathbb{E}\left[\eta_{a}^{s_{1}}\eta_{b}^{s_{2}}\eta_{d}^{s_{3}}|X_{1}=i\right]\right|=h_{s_{1}}(a,b+1)h_{s_{2}}(b,d+1)h_{s_{3}}(d,1)

  • For s1,s2{1,2,3}s_{1},s_{2}\in\left\{1,2,3\right\}, |𝔼[ηas1ηbs2|X1=i]|=hs1(a,b+1)hs2(b,1)\left|\mathbb{E}\left[\eta_{a}^{s_{1}}\eta_{b}^{s_{2}}|X_{1}=i\right]\right|=h_{s_{1}}(a,b+1)h_{s_{2}}(b,1)

  • 𝔼[ηa4|X1=1]=h4(a,1)\mathbb{E}\left[\eta_{a}^{4}|X_{1}=1\right]=h_{4}(a,1)

and then use Lemma 36 to bound the hh functions.

Case I: Four distinct indices.

Using Lemma 36 we have

n1a>b>d>e1|𝔼[ηaηbηdηe|X1=i]|\displaystyle\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}\eta_{e}|X_{1}=i\right]\right| =n1a>b>d>e1h1(a,b+1)h1(b,d+1)h1(d,e+1)h1(e,1)\displaystyle=\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}h_{1}(a,b+1)h_{1}(b,d+1)h_{1}(d,e+1)h_{1}(e,1)
M(j|i)2n1a>b>d>e1λa4M(j|i)2\overγ4.\displaystyle{\leq}M(j|i)^{2}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda^{a-4}_{*}\lesssim{M(j|i)^{2}\over\gamma_{*}^{4}}.
Case II: Three distinct indices.

There are three cases, namely ηa2ηbηd,ηaηb2ηd\eta_{a}^{2}\eta_{b}\eta_{d},\eta_{a}\eta_{b}^{2}\eta_{d} and ηaηbηd2\eta_{a}\eta_{b}\eta_{d}^{2}.

  1. 1.

    Bounding n1a>b>d1|𝔼[ηa2ηbηd|X1=i]|\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}^{2}\eta_{b}\eta_{d}|X_{1}=i\right]\right|:

    n1a>b>d1|𝔼[ηa2ηbηd|X1=i]|\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}^{2}\eta_{b}\eta_{d}|X_{1}=i\right]\right| =n1a>b>d1h2(a,b+1)h1(b,d+1)h1(d,1)\displaystyle=\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}h_{2}(a,b+1)h_{1}(b,d+1)h_{1}(d,1)
    n1a>b>d1(πiM(j|i)(1M(j|i))+M(j|i)λab1)M(j|i)λb2\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-1\geq a>b>d\geq 1}\left(\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}}{\lambda^{a-b-1}_{*}}\right){M(j|i)}{\lambda^{b-2}_{*}}
    M(j|i)\overγ2nπiM(j|i)(1M(j|i))+M(j|i)32\overγ3\displaystyle\lesssim{{{{M(j|i)}\over\gamma_{*}^{2}}n\pi_{i}{M(j|i)}(1-{M(j|i)})}+{{M(j|i)}^{\frac{3}{2}}\over\gamma_{*}^{3}}}
    (nπiM(j|i)(1M(j|i)))2+M(j|i)32\overγ3+M(j|i)2\overγ4\displaystyle\lesssim{\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{{M(j|i)}^{\frac{3}{2}}\over\gamma_{*}^{3}}+{{M(j|i)}^{2}\over\gamma_{*}^{4}}}

    where the last inequality followed by using xyx2+y2xy\leq x^{2}+y^{2}.

  2. 2.

    Bounding n2a>b>d1|𝔼[ηaηb2ηd|X1=i]|\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}^{2}\eta_{d}|X_{1}=i\right]\right|:

    n2a>b>d1|𝔼[ηaηb2ηd|X1=i]|\displaystyle\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}^{2}\eta_{d}|X_{1}=i\right]\right|
    =n2a>b>d1h1(a,b+1)h2(b,d+1)h1(d,1)\displaystyle=\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}h_{1}(a,b+1)h_{2}(b,d+1)h_{1}(d,1)
    n2a>b>d1(πiM(j|i)(1M(j|i))+M(j|i)λbd1)M(j|i)λab+d2\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left(\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}}{\lambda^{b-d-1}_{*}}\right)M(j|i){\lambda^{a-b+d-2}_{*}}
    M(j|i)\overγ2nπiM(j|i)(1M(j|i))+M(j|i)32\overγ3\displaystyle\lesssim{M(j|i)\over\gamma_{*}^{2}}n\pi_{i}M(j|i)(1-M(j|i))+{M(j|i)^{\frac{3}{2}}\over\gamma_{*}^{3}}
    nπiM(j|i)(1M(j|i))2+M(j|i)32\overγ3+M(j|i)2\overγ4.\displaystyle\lesssim{n\pi_{i}{M(j|i)}(1-{M(j|i)})}^{2}+{M(j|i)^{\frac{3}{2}}\over\gamma_{*}^{3}}+{M(j|i)^{2}\over\gamma_{*}^{4}}.
  3. 3.

    Bounding n2a>b>d1|𝔼[ηaηbηd2|X1=i]|\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}^{2}|X_{1}=i\right]\right|:

    n2a>b>d1|𝔼[ηaηbηd2|X1=i]|\displaystyle\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}\eta_{d}^{2}|X_{1}=i\right]\right|
    =n2a>b>d1h1(a,b+1)h1(b,d+1)h2(d,1)\displaystyle=\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}h_{1}(a,b+1)h_{1}(b,d+1)h_{2}(d,1)
    n2a>b>d1(πiM(j|i)(1M(j|i))+M(j|i)λd1)M(j|i)λad2\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-2\geq a>b>d\geq 1}\left(\pi_{i}{M(j|i)}(1-{M(j|i)})+\sqrt{M(j|i)}{\lambda^{d-1}_{*}}\right)M(j|i)\lambda^{a-d-2}_{*}
    M(j|i)\overγ2nπiM(j|i)(1M(j|i))+M(j|i)32\overγ3\displaystyle\lesssim{{{{M(j|i)}\over\gamma_{*}^{2}}n\pi_{i}{M(j|i)}(1-{M(j|i)})}+{{M(j|i)}^{\frac{3}{2}}\over\gamma_{*}^{3}}}
    (nπiM(j|i)(1M(j|i)))2+M(j|i)32\overγ3+M(j|i)2\overγ4.\displaystyle\lesssim\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{{M(j|i)}^{\frac{3}{2}}\over\gamma_{*}^{3}}+{{M(j|i)}^{2}\over\gamma_{*}^{4}}.
Case III: Two distinct indices.

There are three different cases, namely ηa2ηb2,ηa3ηb\eta_{a}^{2}\eta_{b}^{2},\eta_{a}^{3}\eta_{b} and ηaηb3\eta_{a}\eta_{b}^{3}.

  1. 1.

    Bounding n2a>b1|𝔼[ηa2ηb2|X1=i]|\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}^{2}\eta_{b}^{2}|X_{1}=i\right]\right|:

    n2a>b1𝔼[ηa2ηb2|X1=i]\displaystyle\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\mathbb{E}\left[\eta_{a}^{2}\eta_{b}^{2}|X_{1}=i\right]
    =n2a>b1h2(a,b+1)h2(b,1)\displaystyle=\mathop{\sum\sum}_{n-2\geq a>b\geq 1}h_{2}(a,b+1)h_{2}(b,1)
    n2a>b1(πiM(j|i)(1M(j|i))+M(j|i)λab1)(πiM(j|i)(1M(j|i))+M(j|i)λb1)\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}{\left(\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}}\lambda^{a-b-1}_{*}\right)\left(\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}}\lambda^{b-1}_{*}\right)}
    n2a>b1{πiM(j|i)(1M(j|i))M(j|i)(λab1+λb1)\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\Big{\{}\pi_{i}{M(j|i)}(1-{M(j|i)}){\sqrt{{M(j|i)}}}({\lambda^{a-b-1}_{*}}+{\lambda^{b-1}_{*}})
    +(πiM(j|i)(1M(j|i)))2+M(j|i)λa2}\displaystyle\quad+\left(\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{{M(j|i)}}\lambda_{*}^{a-2}\Big{\}}
    (nπiM(j|i)(1M(j|i)))2+M(j|i)\overγnπiM(j|i)(1M(j|i))+M(j|i)\overγ2\displaystyle\lesssim{\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{{{\sqrt{{M(j|i)}}\over\gamma_{*}}}n\pi_{i}{M(j|i)}(1-{M(j|i)})}+{{{M(j|i)}\over\gamma_{*}^{2}}}}
    (nπiM(j|i)(1M(j|i)))2+M(j|i)\overγ2.\displaystyle\lesssim\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{{{M(j|i)}\over\gamma_{*}^{2}}}.
  2. 2.

    Bounding n2a>b1|𝔼[ηa3ηb|X1=i]|\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}^{3}\eta_{b}|X_{1}=i\right]\right|:

    n2a>b1|𝔼[ηa3ηb|X1=i]|\displaystyle\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}^{3}\eta_{b}|X_{1}=i\right]\right|
    =n2a>b1h3(a,b+1)h1(b,1)\displaystyle=\mathop{\sum\sum}_{n-2\geq a>b\geq 1}h_{3}(a,b+1)h_{1}(b,1)
    n2a>b1(πiM(j|i)(1M(j|i))+M(j|i)λab1)M(j|i)λb1\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left(\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}}{\lambda^{a-b-1}_{*}}\right){\sqrt{{M(j|i)}}}{\lambda^{b-1}_{*}}
    M(j|i)\overγnπiM(j|i)(1M(j|i))+M(j|i)\overγ2(nπiM(j|i)(1M(j|i)))2+M(j|i)\overγ2.\displaystyle\lesssim{\sqrt{{M(j|i)}}\over\gamma_{*}}n\pi_{i}{M(j|i)}(1-{M(j|i)})+{{M(j|i)}\over\gamma_{*}^{2}}\lesssim\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{{{M(j|i)}\over\gamma_{*}^{2}}}.
  3. 3.

    Bounding n2a>b1|𝔼[ηaηb3|X1=i]|\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}^{3}|X_{1}=i\right]\right|:

    n2a>b1|𝔼[ηaηb3|X1=i]|\displaystyle\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left|\mathbb{E}\left[\eta_{a}\eta_{b}^{3}|X_{1}=i\right]\right|
    =n2a>b1h1(a,b+1)h3(b,1)\displaystyle=\mathop{\sum\sum}_{n-2\geq a>b\geq 1}h_{1}(a,b+1)h_{3}(b,1)
    n2a>b1(πiM(j|i)(1M(j|i))+M(j|i)λb1)M(j|i)λab1\displaystyle\lesssim\mathop{\sum\sum}_{n-2\geq a>b\geq 1}\left(\pi_{i}{M(j|i)}(1-{M(j|i)})+{\sqrt{{M(j|i)}}}\lambda^{b-1}_{*}\right){\sqrt{{M(j|i)}}}{\lambda^{a-b-1}_{*}}
    M(j|i)\overγnπiM(j|i)(1M(j|i))+M(j|i)\overγ2(nπiM(j|i)(1M(j|i)))2+M(j|i)\overγ2.\displaystyle\lesssim{{\sqrt{{M(j|i)}}\over\gamma_{*}}n\pi_{i}{M(j|i)}(1-{M(j|i)})+{{{M(j|i)}\over\gamma_{*}^{2}}}}\lesssim\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{{{M(j|i)}\over\gamma_{*}^{2}}}.
Case IV: Single index.

Bound on a=1n1𝔼[ηa4|X1=i]\sum_{a=1}^{n-1}\mathbb{E}\left[\eta_{a}^{4}|X_{1}=i\right]:

a=1n1𝔼[ηa4|X1=i]=a=1n1h4(a,1)nπiM(j|i)(1M(j|i))+M(j|i)\overγ.\displaystyle\sum_{a=1}^{n-1}\mathbb{E}\left[\eta_{a}^{4}|X_{1}=i\right]=\sum_{a=1}^{n-1}h_{4}(a,1){\leq}{n\pi_{i}{M(j|i)}(1-{M(j|i)})}+{{\sqrt{{M(j|i)}}\over\gamma_{*}}}.

Combining all cases we get

𝔼[(M(j|i)NiNij)4|Xn=i]\displaystyle\mathbb{E}\left[\left({M(j|i)}N_{i}-N_{ij}\right)^{4}|X_{n}=i\right] (nπiM(j|i)(1M(j|i)))2+M(j|i)\overγ+M(j|i)\overγ2+M(j|i)32\overγ3+M(j|i)2\overγ4\displaystyle\lesssim\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{\sqrt{{M(j|i)}}\over\gamma_{*}}+{M(j|i)\over\gamma_{*}^{2}}+{M(j|i)^{\frac{3}{2}}\over\gamma_{*}^{3}}+{M(j|i)^{2}\over\gamma_{*}^{4}}
(nπiM(j|i)(1M(j|i)))2+M(j|i)\overγ+M(j|i)2\overγ4\displaystyle\lesssim\left(n\pi_{i}{M(j|i)}(1-{M(j|i)})\right)^{2}+{\sqrt{{M(j|i)}}\over\gamma_{*}}+{M(j|i)^{2}\over\gamma_{*}^{4}}

as required. ∎

Proof of Lemma 16(iii).

Throughout our proof we repeatedly use the spectral decomposition (154) applied to the diagonal elements:

Mt(i|i)=πi+v2λvtuvi2,v2uvi21.M^{t}(i|i)=\pi_{i}+\sum_{v\geq 2}\lambda_{v}^{t}u_{vi}^{2},\quad\sum_{v\geq 2}u_{vi}^{2}\leq 1.

Write Ni(n1)πi=a=1n1ξaN_{i}-(n-1)\pi_{i}=\sum_{a=1}^{n-1}\xi_{a} where ξa=𝟏{Xa=i}πi\xi_{a}={\mathbf{1}_{\left\{{X_{a}=i}\right\}}}-\pi_{i}. For abdea\geq b\geq d\geq e,

𝔼[ξaξbξdξe|X1=i]\displaystyle\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{d}\xi_{e}|X_{1}=i\right]
=𝔼[ξaξb(𝟏{Xd=i,Xe=i}πi𝟏{Xd=i}πi𝟏{Xe=i}+πi2)|X1=i]\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}\left({\mathbf{1}_{\left\{{X_{d}=i,X_{e}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{d}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{e}=i}\right\}}}+\pi_{i}^{2}\right)|X_{1}=i\right]
=𝔼[ξaξb𝟏{Xd=i,Xe=i}|X1=i]πi𝔼[ξaξb𝟏{Xd=i}|X1=i]\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}{\mathbf{1}_{\left\{{X_{d}=i,X_{e}=i}\right\}}}|X_{1}=i\right]-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}{\mathbf{1}_{\left\{{X_{d}=i}\right\}}}|X_{1}=i\right]
πi𝔼[ξaξb𝟏{Xe=i}|X1=i]+πi2𝔼[ξaξb|X1=i]\displaystyle\quad-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}{\mathbf{1}_{\left\{{X_{e}=i}\right\}}}|X_{1}=i\right]+\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{1}=i\right]
=𝔼[ξaξb|Xd=i][Xd=i|Xe=i][Xe=i|X1=i]πi𝔼[ξaξb|Xd=i][Xd=i|X1=i]\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}|X_{d}=i\right]\mathbb{P}\left[X_{d}=i|X_{e}=i\right]\mathbb{P}[X_{e}=i|X_{1}=i]-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{d}=i\right]\mathbb{P}[X_{d}=i|X_{1}=i]
πi𝔼[ξaξb|Xe=i][Xe=i|X1=i]+πi2𝔼[ξaξb|X1=i]\displaystyle\quad-\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{e}=i\right]\mathbb{P}[X_{e}=i|X_{1}=i]+\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{1}=i\right]
=𝔼[ξaξb|Xd=i]{Mde(i|i)Me1(i|i)πiMd1(i|i)}\displaystyle=\mathbb{E}\left[\xi_{a}\xi_{b}|X_{d}=i\right]\left\{M^{d-e}(i|i)M^{e-1}(i|i)-\pi_{i}M^{d-1}(i|i)\right\}
{πi𝔼[ξaξb|Xe=i]Me1(i|i)πi2𝔼[ξaξb|X1=i]}\displaystyle\quad-\left\{\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{e}=i\right]M^{e-1}(i|i)-\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{1}=i\right]\right\} (158)

Using the Markov property for any dbad\leq b\leq a, we get

|𝔼[ξaξb|Xd=i]πiv2uvi2λvab|\displaystyle\left|\mathbb{E}[\xi_{a}\xi_{b}|X_{d}=i]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right|
=|𝔼[𝟏{Xa=i,Xb=i}πi𝟏{Xa=i}πi𝟏{Xb=i}+πi2|Xd=i]πiv2uvi2λvab|\displaystyle=\left|\mathbb{E}\left[{\mathbf{1}_{\left\{{X_{a}=i,X_{b}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{a}=i}\right\}}}-\pi_{i}{\mathbf{1}_{\left\{{X_{b}=i}\right\}}}+\pi_{i}^{2}|X_{d}=i\right]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right|
=|Mab(i|i)Mbd(i|i)πiMad(i|i)πiMbd(i|i)+πi2πiv2uvi2λvab|\displaystyle=\left|M^{a-b}(i|i)M^{b-d}(i|i)-\pi_{i}M^{a-d}(i|i)-\pi_{i}M^{b-d}(i|i)+\pi_{i}^{2}-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right|
=|(πi+v2uvi2λvab)(πi+v2uvi2λvbd)πi(πi+v2uvi2λvad)\displaystyle=\left|\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right)\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{b-d}\right)\quad-\pi_{i}\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-d}\right)\right.
πi(πi+v2uvi2λvbd)+πi2πiv2uvi2λvab|\displaystyle\left.\quad-\pi_{i}\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{b-d}\right)+\pi_{i}^{2}-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right|
=|(v2uvi2λvab)(v2uvi2λvbd)πiv2uvi2λvad|\displaystyle=\left|\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right)\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{b-d}\right)-\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-d}}\right|
λad(v2uvi2)(v2uvi2)+λadπiv2uvi22λad.\displaystyle\leq\lambda_{*}^{a-d}\left(\sum_{v\geq 2}u_{vi}^{2}\right)\left(\sum_{v\geq 2}u_{vi}^{2}\right)+\lambda_{*}^{a-d}\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}}\leq 2\lambda_{*}^{a-d}. (159)

We also get for ded\geq e

|Mde(i|i)Me1(i|i)πiMd1(i|i)|\displaystyle\left|M^{d-e}(i|i)M^{e-1}(i|i)-\pi_{i}M^{d-1}(i|i)\right|
=|(πi+v2uvi2λvde)(πi+v2uvi2λve1)πi(πi+v2uvi2λvd1)|\displaystyle=\left|\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-e}\right)\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{e-1}\right)-\pi_{i}\left(\pi_{i}+\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-1}\right)\right|
=|πiv2uvi2λve1+πiv2uvi2λvde+(v2uvi2λve1)(v2uvi2λvde)πiv2uvi2λvd1|\displaystyle=\left|\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{e-1}}+\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-e}}+\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{e-1}\right)\left(\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-e}\right)-\pi_{i}{\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{d-1}}\right|
2λd1+πiλe1+πiλde.\displaystyle\leq 2\lambda_{*}^{d-1}+\pi_{i}\lambda_{*}^{e-1}+\pi_{i}\lambda_{*}^{d-e}. (160)

This implies

|𝔼[ξaξb|Xd=i]||Mde(i|i)Me1(i|i)πiMd1(i|i)|\displaystyle\left|\mathbb{E}\left[\xi_{a}\xi_{b}|X_{d}=i\right]\right|\left|M^{d-e}(i|i)M^{e-1}(i|i)-\pi_{i}M^{d-1}(i|i)\right|
(πiv2uvi2λvab+2λad)(2λd1+πiλe1+πiλde)\displaystyle\leq\left(\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}+2\lambda_{*}^{a-d}\right)\left(2\lambda_{*}^{d-1}+\pi_{i}\lambda_{*}^{e-1}+\pi_{i}\lambda_{*}^{d-e}\right)
(πiλab+2λad)(2λd1+πiλe1+πiλde)\displaystyle\leq\left(\pi_{i}\lambda_{*}^{a-b}+2\lambda_{*}^{a-d}\right)\left(2\lambda_{*}^{d-1}+\pi_{i}\lambda_{*}^{e-1}+\pi_{i}\lambda_{*}^{d-e}\right)
4[πi2λab+de+πi2λab+e1+πi(λab+d1+λad+e1+λae)+λa1]\displaystyle\leq 4\left[\pi_{i}^{2}\lambda_{*}^{a-b+d-e}+\pi_{i}^{2}\lambda_{*}^{a-b+e-1}+\pi_{i}\left(\lambda_{*}^{a-b+d-1}+\lambda_{*}^{a-d+e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}\right] (161)

Using (159) along with Lemma 35 for any ebae\leq b\leq a we get

|πi𝔼[ξaξb|Xe=i]Me1(i|i)πi2𝔼[ξaξb|X1=i]|\displaystyle\left|\pi_{i}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{e}=i\right]M^{e-1}(i|i)-\pi_{i}^{2}\mathbb{E}\left[\xi_{a}\xi_{b}|X_{1}=i\right]\right|
πi|𝔼[ξaξb|Xe=i]||Me1(i|i)πi|+πi2|𝔼[ξaξb|Xe=i]πiv2uvi2λvab|\displaystyle\leq\pi_{i}\left|\mathbb{E}\left[\xi_{a}\xi_{b}|X_{e}=i\right]\right|\left|M^{e-1}(i|i)-\pi_{i}\right|+\pi_{i}^{2}\left|\mathbb{E}\left[\xi_{a}\xi_{b}|X_{e}=i\right]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right| (162)
+πi2|𝔼[ξaξb|X1=i]πiv2uvi2λvab|\displaystyle+\pi_{i}^{2}\left|\mathbb{E}\left[\xi_{a}\xi_{b}|X_{1}=i\right]-\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}\right|
πi[πiv2uvi2λvab+2λae]2λe1+2πi2λae+2πi2λa1\displaystyle\leq\pi_{i}\left[\pi_{i}\sum_{v\geq 2}u_{vi}^{2}\lambda_{v}^{a-b}+2{\lambda_{*}^{a-e}}\right]2\lambda_{*}^{e-1}+2\pi_{i}^{2}\lambda_{*}^{a-e}+2\pi_{i}^{2}{\lambda_{*}^{a-1}}
2πi2λab+e1+4πi2λae+4πi2λa1.\displaystyle\leq 2\pi_{i}^{2}{\lambda_{*}^{a-b+e-1}}+4\pi_{i}^{2}{\lambda_{*}^{a-e}}+4\pi_{i}^{2}{\lambda_{*}^{a-1}}. (163)

This together with (161) and (158) implies

|𝔼[ξaξbξdξe|X1=i]|πi2(λab+de+λab+e1)+λa1+πi(λab+d1+λad+e1+λae)\displaystyle\begin{split}\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{d}\xi_{e}|X_{1}=i\right]\right|&\lesssim\pi_{i}^{2}\left(\lambda_{*}^{a-b+d-e}+\lambda_{*}^{a-b+e-1}\right)+\lambda_{*}^{a-1}\\ &\quad+\pi_{i}\left(\lambda_{*}^{a-b+d-1}+\lambda_{*}^{a-d+e-1}+\lambda_{*}^{a-e}\right)\end{split} (164)

To bound the sum over n1abde1n-1\geq a\geq b\geq d\geq e\geq 1, we divide the analysis according to the number of distinct ordered indices related variations in terms.

Case I: four distinct indices.

We sum (164) over all possible a>b>d>ea>b>d>e.

  • For the first term,

    πi2n1a>b>d>e1λab+denπi2\overγn1a>b3λabn2πi2\overγ2.\displaystyle\pi_{i}^{2}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-b+d-e}\lesssim{n\pi_{i}^{2}\over\gamma_{*}}\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}.
  • For the second term,

    πi2n1a>b>d>e1λab+e1nπi2\overγn1a>b3λabn2πi2\overγ2\displaystyle\pi_{i}^{2}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-b+e-1}\lesssim{n\pi_{i}^{2}\over\gamma_{*}}\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}
  • For the third term,

    n1a>b>d>e1λa1n1a4a3λa11\overγ4.\displaystyle\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-1}\lesssim\sum_{n-1\geq a\geq 4}a^{3}\lambda_{*}^{a-1}\lesssim{1\over\gamma_{*}^{4}}.
  • For the fourth term,

    πin1a>b>d>e1λab+d1πi\overγ2n1a>b3λabnπiγ3\displaystyle\pi_{i}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-b+d-1}\leq{\pi_{i}\over\gamma_{*}^{2}}\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\lesssim\frac{n\pi_{i}}{\gamma_{*}^{3}}
  • For the fifth term,

    πin1a>b>d>e1λad+e1πiγ(n1a>b3λab)(d2b1λbd)nπi\overγ3.\displaystyle\pi_{i}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-d+e-1}\lesssim\frac{\pi_{i}}{\gamma_{*}}\left(\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\right)\left(\sum_{d\geq 2}^{b-1}\lambda_{*}^{b-d}\right)\lesssim{n\pi_{i}\over\gamma_{*}^{3}}.
  • For the sixth term,

    πin1a>b>d>e1λaeπi(n1a>b3λab)(d2b1λbd)(e1d1λde)nπi\overγ3.\displaystyle\pi_{i}\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\lambda_{*}^{a-e}\lesssim\pi_{i}\left(\mathop{\sum\sum}_{n-1\geq a>b\geq 3}\lambda_{*}^{a-b}\right)\left(\sum_{d\geq 2}^{b-1}\lambda_{*}^{b-d}\right)\left(\sum_{e\geq 1}^{d-1}\lambda_{*}^{d-e}\right)\lesssim{n\pi_{i}\over\gamma_{*}^{3}}.

    Combining the above bounds and using the fact that aba2+b2ab\leq a^{2}+b^{2}, we obtain

    n1a>b>d>e1|𝔼[ξaξbξdξe|X1=i]|n2πi2\overγ2+nπi\overγ3+1\overγ4n2πi2\overγ2+1\overγ4.\displaystyle\mathop{\sum\sum\sum\sum}_{n-1\geq a>b>d>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{d}\xi_{e}|X_{1}=i\right]\right|\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{n\pi_{i}\over\gamma_{*}^{3}}+{1\over\gamma_{*}^{4}}}\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}}. (165)
Case II: three distinct indices.

There are three cases, namely, ξaξb2ξe\xi_{a}\xi_{b}^{2}\xi_{e}, ξaξbξe2\xi_{a}\xi_{b}\xi_{e}^{2}, and ξa2ξbξe\xi_{a}^{2}\xi_{b}\xi_{e}.

  1. 1.

    Bounding n1a>b>e1|𝔼[ξaξb2ξe|X1=i]|\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}^{2}\xi_{e}|X_{1}=i\right]\right|: We specialize (164) with b=db=d to get

    |𝔼[ξaξb2ξe|X1=i]|\displaystyle\left|\mathbb{E}\left[\xi_{a}\xi_{b}^{2}\xi_{e}|X_{1}=i\right]\right| πi(λab+e1+λae)+λa1.\displaystyle\lesssim\pi_{i}\left(\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}.

    Summing over a,b,ea,b,e we have

    n1a>b>e1|𝔼[ξaξb2ξe|X1=i]|\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}^{2}\xi_{e}|X_{1}=i\right]\right|
    n1a>b>e1{πi(λab+e1+λae)+λa1}\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\{\pi_{i}\left(\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}\right\}
    πi\overγn1a>b2λab+πi(n1a>b2λab)(e1b1λbe)+n1a3a3λa1\displaystyle\lesssim{\pi_{i}\over\gamma_{*}}\mathop{\sum\sum}_{n-1\geq a>b\geq 2}\lambda_{*}^{a-b}+\pi_{i}\left(\mathop{\sum\sum}_{n-1\geq a>b\geq 2}\lambda_{*}^{a-b}\right)\left(\sum_{e\geq 1}^{b-1}\lambda_{*}^{b-e}\right)+\sum_{n-1\geq a\geq 3}a^{3}\lambda_{*}^{a-1}
    nπi\overγ2+1\overγ3n2πi2\overγ2+1γ3\displaystyle\lesssim{n\pi_{i}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+\frac{1}{\gamma_{*}^{3}} (166)

    with last inequality following from xyx2+y2xy\leq x^{2}+y^{2}.

  2. 2.

    Bounding n1a>b>e1|𝔼[ξaξbξe2|X1=i]|\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{e}^{2}|X_{1}=i\right]\right|: We specialize (164) with e=de=d to get

    |𝔼[ξaξbξe2|X1=i]|\displaystyle\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{e}^{2}|X_{1}=i\right]\right| πi2λab+πi(λab+e1+λae)+λa1.\displaystyle\lesssim\pi_{i}^{2}\lambda_{*}^{a-b}+\pi_{i}\left(\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}.

    Summing over a,b,ea,b,e and applying (166), we get

    n1a>b>e1|𝔼[ξaξbξe2|X1=i]|\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{b}\xi_{e}^{2}|X_{1}=i\right]\right|
    n1a>b>e1{πi2λab+πi(λab+e1+λae)+λa1}\displaystyle\lesssim\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\{\pi_{i}^{2}\lambda_{*}^{a-b}+\pi_{i}\left(\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}\right\}
    nπi2n1a>b2λab+nπi\overγ2+1\overγ3n2πi2\overγ+nπi\overγ2+1\overγ3n2πi2\overγ2+1\overγ3.\displaystyle\lesssim{n\pi_{i}^{2}}\mathop{\sum\sum}_{n-1\geq a>b\geq 2}\lambda_{*}^{a-b}+{n\pi_{i}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}}+{n\pi_{i}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}. (167)
  3. 3.

    Bounding n1a>b>e1|𝔼[ξa2ξbξe|X1=i]|\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{2}\xi_{b}\xi_{e}|X_{1}=i\right]\right|: Specializing (164) with a=ba=b we get

    |𝔼[ξb2ξdξe|X1=i]|πi2(λde+λe1)+λb1+πi(λd1+λbd+e1+λbe),\displaystyle\left|\mathbb{E}\left[\xi_{b}^{2}\xi_{d}\xi_{e}|X_{1}=i\right]\right|\lesssim\pi_{i}^{2}\left(\lambda_{*}^{d-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{b-1}+\pi_{i}\left(\lambda_{*}^{d-1}+\lambda_{*}^{b-d+e-1}+\lambda_{*}^{b-e}\right),

    which is equivalent to

    |𝔼[ξa2ξbξe|X1=i]|πi2(λbe+λe1)+λa1+πi(λb1+λab+e1+λae).\displaystyle\left|\mathbb{E}\left[\xi_{a}^{2}\xi_{b}\xi_{e}|X_{1}=i\right]\right|\lesssim\pi_{i}^{2}\left(\lambda_{*}^{b-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{a-1}+\pi_{i}\left(\lambda_{*}^{b-1}+\lambda_{*}^{a-b+e-1}+\lambda_{*}^{a-e}\right).

    For the first, second and fourth terms

    n1a>b>e1{πi2(λbe+λe1)+πiλb1}πi2\overγn1a>b21+nπi\overγ2n2πi2\overγ+nπi\overγ2,\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left\{\pi_{i}^{2}\left(\lambda_{*}^{b-e}+\lambda_{*}^{e-1}\right)+\pi_{i}\lambda_{*}^{b-1}\right\}\lesssim{\pi_{i}^{2}\over\gamma_{*}}\mathop{\sum\sum}_{n-1\geq a>b\geq 2}1+{n\pi_{i}\over\gamma_{*}^{2}}{\lesssim}{n^{2}\pi_{i}^{2}\over\gamma_{*}}+{n\pi_{i}\over\gamma_{*}^{2}},

    and for summing the remaining terms we use (166), which implies

    n1a>b>e1|𝔼[ξa2ξbξe|X1=i]|n2πi2\overγ+nπi\overγ2+1\overγ3n2πi2\overγ2+1\overγ3.\displaystyle\mathop{\sum\sum\sum}_{n-1\geq a>b>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{2}\xi_{b}\xi_{e}|X_{1}=i\right]\right|\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}}+{n\pi_{i}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}}\lesssim{{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{3}}}. (168)
Case III: two distinct indices.

There are three cases, namely, ηa2ηe2,ηaηe3\eta_{a}^{2}\eta_{e}^{2},\eta_{a}\eta_{e}^{3} and ηa3ηe\eta_{a}^{3}\eta_{e}.

  1. 1.

    Bounding n1a>e1𝔼[ξa2ξe2|X1=i]\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\mathbb{E}\left[\xi_{a}^{2}\xi_{e}^{2}|X_{1}=i\right]: Specializing (164) for a=ba=b and e=de=d we get

    𝔼[ξa2ξe2|X1=i]πi2+πi(λe1+λae)+λa1.\displaystyle{\mathbb{E}\left[\xi_{a}^{2}\xi_{e}^{2}|X_{1}=i\right]}\lesssim\pi_{i}^{2}+\pi_{i}\left(\lambda_{*}^{e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}.

    Summing up over a,ea,e we have

    n1a>e1𝔼[ξa2ξe2|X1=i]n1a>e1{πi2+πi(λe1+λae)+λa1}n2πi2+nπi\overγ+1\overγ2.\displaystyle\mathop{\sum\sum}_{n-1\geq a>e\geq 1}{\mathbb{E}\left[\xi_{a}^{2}\xi_{e}^{2}|X_{1}=i\right]}\lesssim\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left\{\pi_{i}^{2}+\pi_{i}\left(\lambda_{*}^{e-1}+\lambda_{*}^{a-e}\right)+\lambda_{*}^{a-1}\right\}\lesssim n^{2}\pi_{i}^{2}+{n\pi_{i}\over\gamma_{*}}+{1\over\gamma_{*}^{2}}. (169)
  2. 2.

    Bounding n1a>e1|𝔼[ξaξe3|X1=i]|\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{e}^{3}|X_{1}=i\right]\right|: Specializing (164) for e=b=de=b=d we get

    |𝔼[ξaξe3|X1=i]|πiλae+λa1\displaystyle\left|\mathbb{E}\left[\xi_{a}\xi_{e}^{3}|X_{1}=i\right]\right|\lesssim\pi_{i}\lambda_{*}^{a-e}+\lambda_{*}^{a-1}

    which sums up to

    n1a>e1|𝔼[ξaξe3|X1=i]|πin1a>e1λae+n1a>e1λa1nπi\overγ+1\overγ2.\displaystyle\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}\xi_{e}^{3}|X_{1}=i\right]\right|\lesssim\pi_{i}\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\lambda_{*}^{a-e}+\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\lambda_{*}^{a-1}\lesssim{n\pi_{i}\over\gamma_{*}}+{1\over\gamma_{*}^{2}}. (170)
  3. 3.

    Bounding n1a>e1|𝔼[ξa3ξe|X1=i]|\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{3}\xi_{e}|X_{1}=i\right]\right|: Specializing (164) for a=b=da=b=d we get

    |𝔼[ξa3ξe|X1=i]|\displaystyle\left|\mathbb{E}\left[\xi_{a}^{3}\xi_{e}|X_{1}=i\right]\right| πi(λae+λe1)+λa1\displaystyle\lesssim\pi_{i}\left(\lambda_{*}^{a-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{a-1}

    which sums up to

    n1a>e1|𝔼[ξa3ξe|X1=i]|n1a>e1{πi(λae+λe1)+λa1}nπi\overγ+1\overγ2.\displaystyle\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left|\mathbb{E}\left[\xi_{a}^{3}\xi_{e}|X_{1}=i\right]\right|\lesssim\mathop{\sum\sum}_{n-1\geq a>e\geq 1}\left\{\pi_{i}\left(\lambda_{*}^{a-e}+\lambda_{*}^{e-1}\right)+\lambda_{*}^{a-1}\right\}\lesssim{n\pi_{i}\over\gamma_{*}}+{1\over\gamma_{*}^{2}}. (171)
Case IV: single distinct index.

We specialize (164) to a=b=d=ea=b=d=e to get

𝔼[ξa4|X1=i]\displaystyle\mathbb{E}\left[\xi_{a}^{4}|X_{1}=i\right] πi+λa1.\displaystyle\lesssim\pi_{i}+\lambda_{*}^{a-1}.

Summing the above over aa

a=1n1𝔼[ξa4|X1=i]nπi+1\overγ.\displaystyle\sum_{a=1}^{n-1}\mathbb{E}\left[\xi_{a}^{4}|X_{1}=i\right]\lesssim{n\pi_{i}}+{1\over\gamma_{*}}. (172)

Combining (165)–(172) and using nπi\overγn2πi2\overγ2+1\overγ4{n\pi_{i}\over\gamma_{*}}\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}, we get

𝔼[(Ni(n1)πi)4|X1=i]n2πi2\overγ2+1\overγ4.\displaystyle\mathbb{E}\left[\left(N_{i}-(n-1)\pi_{i}\right)^{4}|X_{1}=i\right]\lesssim{n^{2}\pi_{i}^{2}\over\gamma_{*}^{2}}+{1\over\gamma_{*}^{4}}.

Acknowledgment

The authors are grateful to Alon Orlitsky for helpful and encouraging comments and to Dheeraj Pichapati for providing the full version of [FOPS16]. The authors also thank David Pollard for insightful discussions on Markov chains at the initial stages of the project.

References

  • [AG57] Theodore W Anderson and Leo A Goodman. Statistical inference about Markov chains. The Annals of Mathematical Statistics, pages 89–110, 1957.
  • [Agr20] Rohit Agrawal. Finite-sample concentration of the multinomial in relative entropy. IEEE Transactions on Information Theory, 66(10):6297–6302, 2020.
  • [Ahl21] T.D. Ahle. Sharp and simple bounds for the raw moments of the binomial and poisson distributions. arXiv:2103.17027, 2021.
  • [Att99] K. Atteson. The asymptotic redundancy of Bayes rules for Markov chains. IEEE Transactions on Information Theory, 45(6):2104–2109, 1999.
  • [Bar51] Maurice S Bartlett. The frequency goodness of fit test for probability chains. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 47, pages 86–95. Cambridge University Press, 1951.
  • [BFSS02] Dietrich Braess, Jürgen Forster, Tomas Sauer, and Hans U Simon. How to achieve minimax expected Kullback-Leibler distance from an unknown finite distribution. In Algorithmic Learning Theory, pages 380–394. Springer, 2002.
  • [BHOP18] Anna Ben-Hamou, Roberto I Oliveira, and Yuval Peres. Estimating graph parameters via random walks with restarts. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1702–1714. SIAM, 2018.
  • [Bil61] P. Billingsley. Statistical methods in Markov chains. The Annals of Mathematical Statistics, pages 12–40, 1961.
  • [CB19] Yeshwanth Cherapanamjeri and Peter L Bartlett. Testing symmetric Markov chains without hitting. In Conference on Learning Theory, pages 758–785. PMLR, 2019.
  • [CK82] Imre Csiszár and János Körner. Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, Inc., 1982.
  • [CS00] Imre Csiszár and Paul C Shields. The consistency of the BIC Markov order estimator. The Annals of Statistics, 28(6):1601–1619, 2000.
  • [CS04] I Csiszár and PC Shields. Information theory and statistics: a tutorial. Foundations and Trends in Communications and Information Theory, 1(4):417–527, 2004.
  • [CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory, 2nd Ed. Wiley-Interscience, New York, NY, USA, 2006.
  • [Dav73] L. Davisson. Universal noiseless coding. IEEE Transactions on Information Theory, 19(6):783–795, 1973.
  • [Dav83] L. Davisson. Minimax noiseless universal coding for Markov sources. IEEE Transactions on Information Theory, 29(2):211–215, 1983.
  • [DDG18] Constantinos Daskalakis, Nishanth Dikkala, and Nick Gravin. Testing symmetric Markov chains from a single trajectory. In Conference On Learning Theory, pages 385–409. PMLR, 2018.
  • [DMM+19] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47, 2019.
  • [DMPW81] L Davisson, R McEliece, M Pursley, and Mark Wallace. Efficient universal noiseless source codes. IEEE Transactions on Information Theory, 27(3):269–279, 1981.
  • [Fan53] Ky Fan. Minimax theorems. Proceedings of the National Academy of Sciences, 39(1):42–47, 1953.
  • [FOPS16] M. Falahatgar, A. Orlitsky, V. Pichapati, and A.T. Suresh. Learning Markov distributions: Does estimation trump compression? In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2689–2693. IEEE, July 2016.
  • [FW21] Sela Fried and Geoffrey Wolfer. Identity testing of reversible Markov chains. arXiv preprint arXiv:2105.06347, 2021.
  • [GR20] F Richard Guo and Thomas S Richardson. Chernoff-type concentration of empirical probabilities in relative entropy. IEEE Transactions on Information Theory, 67(1):549–558, 2020.
  • [HJL+18] Y. Han, J. Jiao, C.Z. Lee, T. Weissman, Y. Wu, and T. Yu. Entropy rate estimation for Markov chains with large state space. In Advances in Neural Information Processing Systems, pages 9781–9792, 2018. arXiv:1802.07889.
  • [HKL+19] Daniel Hsu, Aryeh Kontorovich, David A Levin, Yuval Peres, Csaba Szepesvári, and Geoffrey Wolfer. Mixing time estimation in reversible Markov chains from a single sample path. Annals of Applied Probability, 29(4):2439–2480, 2019.
  • [Hof67] A.J. Hoffman. Three observations on nonnegative matrices. Journal of Research of the National Bureau of Standards B, pages 39–41, 1967.
  • [HOP18] Yi Hao, A. Orlitsky, and V. Pichapati. On learning Markov chains. In Advances in Neural Information Processing Systems, pages 648–657, 2018.
  • [Jan02] S. Janson. On concentration of probability. Contemporary combinatorics, 10(3):1–9, 2002.
  • [JS02] Philippe Jacquet and Wojciech Szpankowski. A combinatorial problem arising in information theory: Precise minimax redundancy for Markov sources. In Mathematics and Computer Science II, pages 311–328. Springer, 2002.
  • [Kem74] JHB Kemperman. On the Shannon capacity of an arbitrary channel. In Indagationes Mathematicae (Proceedings), volume 77, pages 101–115. North-Holland, 1974.
  • [KOPS15] S. Kamath, A. Orlitsky, D. Pichapati, and A.T. Suresh. On learning distributions from their samples. In Conference on Learning Theory, pages 1066–1100, June 2015.
  • [KV16] Sudeep Kamath and Sergio Verdú. Estimation of entropy rate and Rényi entropy rate for Markov chains. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 685–689. IEEE, 2016.
  • [Lat97] Rafał Latała. Estimation of moments of sums of independent real random variables. The Annals of Probability, 25(3):1502–1513, 1997.
  • [LB04] Feng Liang and Andrew Barron. Exact minimax strategies for predictive density estimation, data compression, and model selection. IEEE Transactions on Information Theory, 50(11):2708–2726, 2004.
  • [Lez98] P. Lezaud. Chernoff-type bound for finite Markov chains. Annals of Applied Probability, 8(3):849–867, 1998.
  • [LP17] D.A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • [MJT+20] Jay Mardia, Jiantao Jiao, Ervin Tánczos, Robert D Nowak, and Tsachy Weissman. Concentration inequalities for the empirical distribution of discrete distributions: beyond the method of types. Information and Inference: A Journal of the IMA, 9(4):813–850, 2020.
  • [OS20] Maciej Obremski and Maciej Skorski. Complexity of estimating Rényi entropy of Markov chains. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2264–2269, 2020.
  • [Pan04] Liam Paninski. Variational minimax estimation of discrete distributions under KL loss. Advances in Neural Information Processing Systems, 17:1033–1040, 2004.
  • [Par62] Emanuel Parzen. Stochastic processes. Holden Day, 1962.
  • [Pau15] D. Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, pages 1–20, 2015.
  • [Ris84] Jorma Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information theory, 30(4):629–636, 1984.
  • [Rya88] B.Y. Ryabko. Prediction of random sequences and universal coding. Prob. Pered. Inf., 24(2):87–96, 1988.
  • [Sht87] Yuri M Shtarkov. Universal sequential coding of single messages. Prob. Pered. Inf., 23:175–186, 1987.
  • [Sin64] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics, 35(2):876–879, 1964.
  • [SMT+18] Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference On Learning Theory, pages 439–473. PMLR, 2018.
  • [SW12] Wojciech Szpankowski and Marcelo J Weinberger. Minimax pointwise redundancy for memoryless models over large alphabets. IEEE transactions on information theory, 58(7):4094–4104, 2012.
  • [TJW18] Kedar Tatwawadi, Jiantao Jiao, and Tsachy Weissman. Minimax redundancy for Markov chains with large state space. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 216–220. IEEE, 2018.
  • [Tro74] Viktor Kupriyanovich Trofimov. Redundancy of universal coding of arbitrary Markov sources. Prob. Pered. Inf., 10(4):16–24, 1974.
  • [Wai19] M.J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press., 2019.
  • [Whi55] P. Whittle. Some distribution and moment formulae for the Markov chain. Journal of the Royal Statistical Society: Series B (Methodological), 17(2):235–242, 1955.
  • [WK19] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic Markov chains. In Algorithmic Learning Theory, pages 904–930. PMLR, 2019.
  • [WK20] Geoffrey Wolfer and Aryeh Kontorovich. Minimax testing of identity to a reference ergodic Markov chain. In International Conference on Artificial Intelligence and Statistics, pages 191–201. PMLR, 2020.
  • [XB97] Qun Xie and Andrew R Barron. Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, 43(2):646–657, 1997.
  • [YB99] Y. Yang and A. R. Barron. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):1564–1599, 1999.