This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the α\alpha-lazy version of Markov chains in estimation and testing problems

Sela Fried
[email protected]
A postdoctoral fellow in the Department of Computer Science at the Ben-Gurion University of the Negev. Research supported by ISF grant No. 1456/18.
   Geoffrey Wolfer
[email protected]
A JSPS International Research Fellow (Department of Computer and Information Sciences, Tokyo University of Agriculture and Technology).
Abstract

Given access to a single long trajectory generated by an unknown irreducible Markov chain MM, we simulate an α\alpha-lazy version of MM which is ergodic. This enables us to generalize recent results on estimation and identity testing that were stated for ergodic Markov chains in a way that allows fully empirical inference. In particular, our approach shows that the pseudo spectral gap introduced by Paulin (2015) and defined for ergodic Markov chains may be given a meaning already in the case of irreducible but possibly periodic Markov chains.

1 Introduction

Estimating the parameters of a discrete distribution and distinguishing whether an unknown distribution is identical to a reference one or is ε\varepsilon-far from it under some notion of distance, assuming access to iid samples, are two classical and well studied problems in statistics and computer science (e.g. Han et al. (2015), Kamath et al. (2015), Orlitsky and Suresh (2015), Batu et al. (2001), Valiant and Valiant (2017) and the references therein). In contrast, research on the analogue problems assuming the samples are obtained from a single long trajectory generated by an unknown Markov chain is rather young. This work is concerned with the latter and adds to the still rather short list of works on the subject: Wolfer and Kontorovich (2019, 2020, 2021), Cherapanamjeri and Bartlett (2019), Daskalakis et al. (2018), Chan et al. (2021), Fried and Wolfer (2021) and Hao et al. (2018). We shall review the results of these works in the Related literature section.

Before we rigorously present our main results, let us somewhat informally sketch their motivation: In Wolfer and Kontorovich (2021, Theorem 3.1) an estimator for the transition matrix MM of an unknown ergodic Markov chain was constructed, based on a single long trajectory generated by the Markov chain. In addition, they upper bounded the sample complexity of obtaining an estimation whose distance from MM under the infinity norm is less than a prescribed ε(0,1)\varepsilon\in(0,1) with high probability. The assumption that the unknown Markov chain is ergodic is crucial in their sample complexity analysis since it relies on concentration bounds (cf. Paulin (2015)) which assume ergodicity. We have noticed that it is possible to generalize their results to irreducible but possibly periodic Markov chains in the following way: Let MM be an irreducible Markov chain on dd states from which we can sample a single long trajectory of arbitrary length. Recall that if α(0,1)\alpha\in(0,1) and II denotes the identity matrix of size dd, then α(M)=αI+(1α)M\mathcal{L}_{\alpha}(M)=\alpha I+(1-\alpha)M is called the α\alpha-lazy version of MM. It is well known that the α\alpha-lazy version of an irreducible Markov chain is ergodic. Crucially, it is possible to simulate α(M)\mathcal{L}_{\alpha}(M) even when one only has access (as is the case here) to a black box that generates a single long trajectory from MM (cf. (6) of Section 2.2). This simulation results in a trajectory sampled from the ergodic Markov chain α(M)\mathcal{L}_{\alpha}(M) on which we may apply the estimator mentioned above. Finally, we pull back the estimation of α(M)\mathcal{L}_{\alpha}(M) to an estimation of MM.

Several questions immediately arise from this procedure:

  1. (a)

    Can it be done with other estimation problems or even other statistical inference problems, e.g. identity testing?

  2. (b)

    The α\alpha-lazy version of a Markov chain is a special case of a matrix transformation. Can the procedure be done with other transformations?

  3. (c)

    An upper bound for the sample complexity of the original (ergodic) estimation problem involves (the inverse of) the pseudo spectral gap γps(M)\gamma_{\text{ps}}(M) of an ergodic Markov chain MM. It follows that the sample complexity in the irreducible case is upper bounded by γps(α(M))\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M)) for MM an irreducible Markov chain. Since γps(M)=0\gamma_{\text{ps}}(M)=0 for an irreducible but periodic Markov chain MM, no meaningful comparison between γps(M)\gamma_{\text{ps}}(M) and γps(α(M))\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M)) is possible. Instead, we ask the following question: Suppose MM is ergodic. How are γps(M)\gamma_{\text{ps}}(M) and γps(α(M))\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M)) related? An answer to this question quantifies the “price” one has to pay for not being sure whether an unknown irreducible Markov chain is periodic or not.

Our first two main results address the questions in (a) and (b) that are related to Markov chains estimation problems, by which we mean a tuple (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) where Θ\Theta is a set, ρ:Θ×Θ+\rho\colon\Theta\times\Theta\to\mathbb{R}_{+} is a distance function, i.e., for x,yΘx,y\in\Theta it holds: ρ(x,y)=0\rho(x,y)=0 if and only if x=yx=y (Notice that we neither assume that ρ\rho is symmetric nor that it satisfies the triangle inequality. Thus, in addition to metrics such as the total variation or the Hellinger distance, our machinery is applicable to, for example, the Kullback–Leibler divergence). The set \mathcal{M}^{\circ} is a subset of the set of all transition matrices of irreducible Markov chains on dd states which we denote by dirr\mathcal{M}_{d}^{\textnormal{irr}}. The set Θ\Theta consists of the values that the parameter to be estimated can take and is accompanied by a map θ:dirrΘ\theta\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\Theta. The distance function ρ\rho is used to measure the distance between the true value of the parameter and its estimation. Following the works mentioned above, in the setting we consider we have access to a single long trajectory generated from an unknown Markov chain MM, started from an arbitrary state, and the purpose is to construct an estimator for some parameter of interest θ(M)\theta(M) such that the distance between the estimation and the true value of the parameter is less than a prescribed ε(0,1)\varepsilon\in(0,1). Along with the estimator we would like to know in advance how long the trajectory must be for the estimation to succeed. The minimal necessary trajectory length is referred to as the sample complexity. Since the estimation relies on a random trajectory, there is generally some probability that the trajectory obtained is not informative. Thus, in addition to ε\varepsilon, the sample complexity depends also on δ(0,1)\delta\in(0,1) such that the estimator is guaranteed to succeed with probability at least 1δ1-\delta.

The following theorem provides a sufficient condition for the feasibility of the procedure we described above. It uses the notions of a solution to an estimation problem and of extendibility of their solutions which will be defined in Section 3.1:

Theorem 1.1.

Consider a Markov chains estimation problem (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}). Let φ:dirrdirr,g:ΘΘ\varphi\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\mathcal{M}_{d}^{\textnormal{irr}},g\colon\Theta\to\Theta and :(0,1)(0,1)\ell\colon(0,1)\to(0,1) be such that for every xΘ,Mφ1()x\in\Theta,M\in\varphi^{-1}(\mathcal{M}^{\circ}) and ε(0,1)\varepsilon\in(0,1) it holds

ρ(x,θ(φ(M)))<(ε)ρ(g(x),θ(M))<ε.\rho(x,\theta(\varphi(M)))<\ell(\varepsilon)\Longrightarrow\rho(g(x),\theta(M))<\varepsilon. (1)

Then every solution to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) extends to φ1()\varphi^{-1}(\mathcal{M}^{\circ}).

Let us demonstrate a typical usage of Theorem 1.1 by applying it on the problem of estimating the transition matrix mentioned above: We have

Θ=dirr,θ(M)=M,Mdirr,ρ(M,M¯)=MM¯,M,M¯Θ.\Theta=\mathcal{M}_{d}^{\textnormal{irr}},\;\theta(M)=M,\;\;\forall M\in\mathcal{M}_{d}^{\textnormal{irr}},\;\rho(M,\bar{M})=||M-\bar{M}||_{\infty},\;\;\forall M,\bar{M}\in\Theta.

For \mathcal{M}^{\circ} we take π,γps\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}} which is defined as the set of all ergodic Markov chains on dd states whose minimum stationary probability and pseudo spectral gap are lower bounded by π,γps(0,1)\pi_{\star},\gamma_{\text{ps}}\in(0,1), respectively. The set π,γps\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}} is the default choice in this work since the sample complexity of the problems that we consider was shown to be upper bounded by expressions that involve 1γps\frac{1}{\gamma_{\text{ps}}} and 1π\frac{1}{\pi_{\star}}. Taking φ=α\varphi=\mathcal{L}_{\alpha} and (ε)=(1α)ε\ell(\varepsilon)=(1-\alpha)\varepsilon for every ε(0,1)\varepsilon\in(0,1), a natural candidate for gg is α1\mathcal{L}_{\alpha}^{-1} but it may well result in a matrix that is not irreducible. Even worse, we might end up with negative entries on the main diagonal, i.e., not a row-stochastic matrix. In order to guarantee an estimation which is an irreducible Makrov chain, α1\mathcal{L}_{\alpha}^{-1} is followed by a projection on the set of all row-stochastic matrices of size dd which we denote by d\mathcal{M}_{d}, which, in turn, is followed by an additional adjustment to guarantee irreducibility. We discuss this example in detail in Example 3.2(b).

Our second main result shows that there are cases in which extending solutions is hard:

Theorem 1.2.

Let (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) be a Markov chains estimation problem with a finite sample complexity and let φ:dirrdirr\varphi\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\mathcal{M}_{d}^{\textnormal{irr}} be such that there exist M1,M2φ1()M_{1},M_{2}\in\varphi^{-1}(\mathcal{M}^{\circ}) with θ(φ(M1))=θ(φ(M2))\theta(\varphi(M_{1}))=\theta(\varphi(M_{2})) but θ(M1)θ(M2)\theta(M_{1})\neq\theta(M_{2}). Let g:ΘΘg\colon\Theta\to\Theta be such that for every R>0R>0 there exists σ>0\sigma>0 with the following property: Whenever x,yΘx,y\in\Theta satisfy ρ(x,y)<σ\rho(x,y)<\sigma then ρ(g(x),g(y))<R\rho(g(x),g(y))<R (this holds, for example, if gg is uniformly continuous). Then no solution to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) extends to φ1()\varphi^{-1}(\mathcal{M}^{\circ}) via gg.

In Example 3.4 we shall show that the assumptions of Theorem 1.2 are satisfied in the setting of Wolfer and Kontorovich (2019, Theorem 8.1) who constructed an estimator for the pseudo spectral gap (in absolute error).

Our third main result is a reformulation of Theorem 1.1 to Markov chains identity testing problems. In this setting one needs to decide, based on a single long trajectory generated from an unknown Markov chain, whether a certain parameter of the Markov chain equals a specific value or is ε\varepsilon-far from it under some notion of distance for ε(0,1)\varepsilon\in(0,1). We will apply the theorem on the Markov chains identity testing problem considered by Wolfer and Kontorovich (2020, Theorem 4.1) in which the parameter of interest is the transition matrix. The formal definition of Markov chains identity testing problems and of extendibility of their solutions is given in Section 3.2.

Theorem 1.3.

Let (Θ,ρ,,¯)(\Theta,\rho,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}) be a Markov chains identity testing problem. Let φ:dirrdirr\varphi\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\mathcal{M}_{d}^{\textnormal{irr}} and :(0,1)(0,1)\ell\colon(0,1)\to(0,1) be such that for every Mφ1(),M¯φ1(¯)M\in\varphi^{-1}(\mathcal{M}^{\circ}),\overline{M}\in\varphi^{-1}(\overline{\mathcal{M}^{\circ}}) and ε(0,1)\varepsilon\in(0,1) it holds

{θ(M)=θ(M¯)θ(φ(M))=θ(φ(M¯))ρ(θ(M),θ(M¯))>ερ(θ(φ(M)),θ(φ(M¯)))>(ε).\begin{cases}\theta(M)=\theta(\overline{M})&\Longrightarrow\theta(\varphi(M))=\theta(\varphi(\overline{M}))\\ \rho(\theta(M),\theta(\overline{M}))>\varepsilon&\Longrightarrow\rho(\theta(\varphi(M)),\theta(\varphi(\overline{M})))>\ell(\varepsilon).\end{cases} (2)

Then every solution to (Θ,ρ,,¯)(\Theta,\rho,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}) extends to (φ1(),φ1(¯))(\varphi^{-1}(\mathcal{M}^{\circ}),\varphi^{-1}(\overline{\mathcal{M}^{\circ}})).

Our last main result addresses the question in (c), namely, the connection between γps(M)\gamma_{\text{ps}}(M) and γps(α(M))\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M)) for an ergodic Markov chain MM. Recall that the pseudo spectral gap captures the mixing time (cf. (5)) and that there exists a universal constant cc such that

t𝗆𝗂𝗑(α(M))c1αmax{11α,t𝗆𝗂𝗑(M)}t_{\mathsf{mix}}(\mathcal{L}_{\alpha}(M))\leq\frac{c}{1-\alpha}\max\left\{\frac{1}{1-\alpha},t_{\mathsf{mix}}(M)\right\}

(Lemma 4.5). We conjecture (Conjecture 4.6) that a similar inequality holds for the pseudo spectral gap. As a consequence of the following theorem we show that for an ergodic Markov chain MM such that γps(M)\gamma_{\text{ps}}(M) is obtained at k=1k=1 (cf. (4)) it holds

γps(α(M))(1α)γps(M).\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M))\geq(1-\alpha)\gamma_{\text{ps}}(M). (3)

This confirms the conjecture for this class of ergodic Markov chains that contains all ergodic and reversible Markov chains.

In the following theorem γ(M)\gamma(M) denotes the spectral gap of a reversible Markov chain MM and MM^{\dagger} denotes the multiplicative reversibilization of an irreducible Markov chain MM:

Theorem 1.4.

Let MM be an ergodic Markov chain. Then

γ(α(M))(1α)γ(M).\gamma(\mathcal{L}_{\alpha}(M)^{\dagger})\geq(1-\alpha)\gamma(M^{\dagger}).

The paper is structured as follows: After a short literature review, in Section 2 we recall definitions and well known facts regarding Markov chains and their α\alpha-lazy versions that will be used throughout this work. Our main reference here is Levin and Peres (2017); In section 3, that contains our main results, we formulate the framework and demonstrate its usage; Section 4 is devoted to the sample complexity aspect of moving to an α\alpha-lazy version.

1.1 Related literature

In recent years there have been attempts to go beyond the iid assumption in statistical inference problems. The Markovian setting in which every sample depends only on the sample immediately preceding it is a natural weakening of the iid assumption. Works on the subject differ mainly in their choice of the distance notion between Markov chains: Wolfer and Kontorovich (2020, 2021) consider each row of the transition matrix separately and under the total variation distance between distributions. This leads to a distance notion based on the infinity norm. Their analysis relies on the concentration bounds of Paulin (2015) involving the pseudo spectral gap which is defined only for ergodic Markov chains. Chan et al. (2021) generalized Wolfer and Kontorovich’s results to all irreducible Markov chains by recognizing that the sample complexity under the infinity norm is characterized by what they refer to as the kk-cover time denoted by tcov(k)t_{\text{cov}}^{(k)} and defined to be the expected length of a trajectory such that every state is visited at least kk times. Taking a different path, Daskalakis et al. (2018) and Cherapanamjeri and Bartlett (2019) use a distance notion that relies on the largest eigenvalue of the geometric mean of the transition matrices. They make the stringent assumption that the Markov chains are symmetric (Fried and Wolfer (2021) subsequently showed that, in essence, reversibility suffices).

Our work is especially related to Chan et al. (2021) since it follows from both theirs and ours that the results of Wolfer and Kontorovich mentioned above apply to all irreducible Markov chains: Chan et al. (2021) achieve this by considering tcov(k)t_{\text{cov}}^{(k)} that is defined for every irreducible Markov chain and we by moving to the α\alpha-lazy version.

We argue that while the kk-cover provides a sharper sample complexity characterization of the problem of inference in the single long trajectory setting, compared to the lower and upper bounds of Wolfer and Kontorovich that are based on the mixing properties and exhibit a logarithmic gap, at the moment it is not clear how to exploit this sharpness. Indeed, first, there is currently no method to estimate the kk-cover time from a single long trajectory directly. Second, all the upper bounds on the kk-cover times in terms of well known quantities that Chan et al. (2021) provide (e.g. tcov(k)=O~(tcov+kπ)t_{\text{cov}}^{(k)}=\tilde{O}\left(t_{\text{cov}}+\frac{k}{\pi_{\star}}\right) for irreducible Markov chains and tcov(k)=O~(1πγps+kπ)t_{\text{cov}}^{(k)}=\tilde{O}\left(\frac{1}{\pi_{\star}\gamma_{\text{ps}}}+\frac{k}{\pi_{\star}}\right) for ergodic Markov chains) involve the minimum stationary probability. To the best of our knowledge, Example 3.2 (a) (see also Lemma 4.2), which is a combination of our machinery and the estimator of Wolfer and Kontorovich (2019), provides the only fully empirical method for the estimation of the minimum stationary probability of all irreducible Markov chains from a single long trajectory. Furthermore, Wolfer and Kontorovich (2019) provide a similar estimator of the pseudo spectral gap. It seems that although considerable progress has been made regarding the computational problem of approximating the cover time (e.g. Feige and Rabinovich (2003), Bui and Sohier (2007), Feige and Zeitouni (2009) and Ding et al. (2011)), the statistical estimation of the cover time from a single trajectory remains a challenging open question.

2 Preliminaries

2.1 Markov chains

Let Ω\Omega be a finite state space which we assume without loss to be equal to [d]={1,2,,d}[d]=\{1,2,\ldots,d\} where 2d2\leq d\in\mathbb{N}. We denote by d\mathcal{M}_{d} the set of all d×dd\times d row-stochastic matrices and refer to elements of d\mathcal{M}_{d} as Markov chains. For a matrix MM of size d×dd\times d and i,j[d]i,j\in[d] we write M(i,j)M(i,j) for the entry at the iith row and the jjth column of MM. Recall that a Markov chain MM is called irreducible if for every i,j[d]i,j\in[d] there exists tt\in\mathbb{N} such that Mt(i,j)>0M^{t}(i,j)>0. The period of a state i[d]i\in[d] is defined as

gcd{t|Mt(i,i)>0}.\text{gcd}\{t\in\mathbb{N}\;|\;M^{t}(i,i)>0\}.

If a Markov chain MM is irreducible, all its states share the same period (Levin and Peres (2017, Lemma 1.6)). In this case, the periodicity of MM is defined as the period of one of its states. If the periodicity is 11, then MM is said to be aperiodic and otherwise periodic. Irreducible and aperiodic Markov chains are called ergodic.

We denote by dirr\mathcal{M}_{d}^{\textnormal{irr}} (resp. derg\mathcal{M}_{d}^{\textnormal{erg}}) the subset of d\mathcal{M}_{d} consisting of all irreducible (resp. ergodic) Markov chains. Let MdM\in\mathcal{M}_{d} and α(0,1)\alpha\in(0,1) (to be used henceforth). Following Montenegro and Tetali (2006) and Hermon (2016), we call α(M)=αI+(1α)M\mathcal{L}_{\alpha}(M)=\alpha I+(1-\alpha)M the α\alpha-lazy version of MM where II is the identity matrix of size dd. The simplex of all probability distributions over [d][d] will be denoted by Δd\Delta_{d}. If μd\mu\in\mathbb{R}^{d} and i[d]i\in[d] we write μ(i)\mu(i) for the iith entry of μ\mu. Vectors are assumed to be row-vectors.

Let Md,μΔd,mM\in\mathcal{M}_{d},\mu\in\Delta_{d},m\in\mathbb{N} and i1,,im[d]i_{1},\ldots,i_{m}\in[d]. By (X1,,Xm)(M,μ)(X_{1},\ldots,X_{m})\sim(M,\mu) we mean

((X1,,Xm)=(i1,,im))=μ(i1)t=1m1M(it,it+1).\mathbb{P}((X_{1},\ldots,X_{m})=(i_{1},\ldots,i_{m}))=\mu(i_{1})\prod_{t=1}^{m-1}M(i_{t},i_{t+1}).

We say that πΔd\pi\in\Delta_{d} is a stationary distribution of MdM\in\mathcal{M}_{d} if πM=π\pi M=\pi. All Markov chains on a finite state space have at least one stationary distribution (Levin and Peres (2017, Section 1.7)) and irreducible markov chains have unique stationary distributions (Levin and Peres (2017, Corollary 1.17)). Furthermore, if π\pi is the stationary distribution of MdirrM\in\mathcal{M}_{d}^{\textnormal{irr}}, then the minimum stationary probability of MM defined as

π(M)=mini[d]{π(i)}\pi_{\star}(M)=\min_{i\in[d]}\{\pi(i)\}

satisfies π(M)>0\pi_{\star}(M)>0 (Levin and Peres (2017, Proposition 1.14)). Ergodic Markov chains converge to their stationary distribution in total variation (Levin and Peres (2017, Theorem 4.9)), i.e., if MdergM\in\mathcal{M}_{d}^{\textnormal{erg}} and π\pi is its stationary distribution, then

d(t):=maxi[d]Mt(i,)πTVt0d(t):=\max_{i\in[d]}||M^{t}(i,\cdot)-\pi||_{\text{TV}}\underset{t\to\infty}{\longrightarrow}0

where

μνTV=12i[d]|μ(i)ν(i)|,μ,νΔd||\mu-\nu||_{\text{TV}}=\frac{1}{2}\sum_{i\in[d]}|\mu(i)-\nu(i)|,\;\;\forall\mu,\nu\in\Delta_{d}

(Levin and Peres (2017, Proposition 4.2)). For ε>0\varepsilon>0 denote

t𝗆𝗂𝗑(M,ε)=min{t|d(t)ε}t_{\mathsf{mix}}(M,\varepsilon)=\min\{t\in\mathbb{N}\;|\;d(t)\leq\varepsilon\}

and define the mixing time of MM by t𝗆𝗂𝗑(M):=t𝗆𝗂𝗑(M,1/4).t_{\mathsf{mix}}(M):=t_{\mathsf{mix}}(M,1/4).

Let MdirrM\in\mathcal{M}_{d}^{\textnormal{irr}} with stationary distribution π\pi. We write Q(M)=diag(π)MQ(M)=\operatorname{diag}(\pi)M for the edge measure of MM where diag(π)\operatorname{diag}(\pi) is the diagonal matrix whose entries correspond to π\pi and call MM reversible if Q(M)Q(M) is symmetric. Suppose now MdergM\in\mathcal{M}_{d}^{\textnormal{erg}} is reversible. Then 11 is an eigenvalue of MM with a one dimensional eigenspace and all eigenvalues of MM are real and lie in (1,1](-1,1] (Levin and Peres (2017, Lemmas 12.1 and 12.2)). They may be therefore ordered:

1=λ1>λ2λd>1.1=\lambda_{1}>\lambda_{2}\geq\cdots\geq\lambda_{d}>-1.

The spectral gap and the absolute spectral gap of MM are defined by

γ(M)=1λ2,andγ(M)=1max{λ2,|λd|}, respectively.\gamma(M)=1-\lambda_{2},\;\;\;\text{and}\;\;\;\gamma_{\star}(M)=1-\max\{\lambda_{2},|\lambda_{d}|\},\text{ respectively}.

The absolute spectral gap and the minimum stationary probability of MM capture t𝗆𝗂𝗑(M)t_{\mathsf{mix}}(M) (Levin and Peres (2017, Theorems 12.3 and 12.4)):

(1γ(M)1)ln2t𝗆𝗂𝗑(M)1γ(M)ln4π(M).\left(\frac{1}{\gamma_{\star}(M)}-1\right)\ln 2\leq t_{\mathsf{mix}}(M)\leq\frac{1}{\gamma_{\star}(M)}\ln\frac{4}{\pi_{\star}(M)}.

For ergodic Markov chains that are not necessarily reversible there exists an analogue of the absolute spectral gap, namely the pseudo spectral gap that was introduced by Paulin (2015): Let MdergM\in\mathcal{M}_{d}^{\textnormal{erg}} with stationary distribution π\pi. Define the time reversal of MM by M=diag(π)1MTdiag(π)M^{*}=\operatorname{diag}(\pi)^{-1}M^{T}\operatorname{diag}(\pi) and the multiplicative reversibilization of MM by M=MMM^{\dagger}=M^{*}M. The pseudo spectral gap of MM is then defined to be

γps(M)=maxk{γ((Mk))k}.\gamma_{\text{ps}}(M)=\max_{k\in\mathbb{N}}\left\{\frac{\gamma\left((M^{k})^{\dagger}\right)}{k}\right\}. (4)

By Paulin (2015, Proposition 3.4),

1γps(M)t𝗆𝗂𝗑(M)1γps(M)(1+2ln2+ln1π(M)).\frac{1}{\gamma_{\text{ps}}(M)}\leq t_{\mathsf{mix}}(M)\leq\frac{1}{\gamma_{\text{ps}}(M)}\left(1+2\ln 2+\ln\frac{1}{\pi_{\star}(M)}\right). (5)

For π,γps(0,1)\pi_{\star},\gamma_{\text{ps}}\in(0,1) denote by π,γps\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}} the set of all ergodic Markov chains whose minimum stationary probability and pseudo spectral gap are lower bounded by π\pi_{\star} and γps\gamma_{\text{ps}}, respectively.

If π\pi is the stationary distribution of some irreducible Markov chain and μΔd\mu\in\Delta_{d}, define

μ/π2,π2=i[d]μ(i)2π(i).||\mu/\pi||^{2}_{2,\pi}=\sum_{i\in[d]}\frac{\mu(i)^{2}}{\pi(i)}.

Finally, if AA is a square matrix of size dd then the infinity norm of AA is defined by A=maxi[d]j[d]|A(i,j)|.||A||_{\infty}=\max_{i\in[d]}\sum_{j\in[d]}|A(i,j)|.

2.2 The α\alphalazy version of a Markov chain

While Theorems 1.1, 1.2 and 1.3 are stated for arbitrary maps φ:dirrdirr\varphi\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\mathcal{M}_{d}^{\textnormal{irr}}, we are mainly interested in the case φ=α\varphi=\mathcal{L}_{\alpha} where α:dirrdirr\mathcal{L}_{\alpha}\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\mathcal{M}_{d}^{\textnormal{irr}} is defined by

α(M)=αI+(1α)M,Mdirr.\mathcal{L}_{\alpha}(M)=\alpha I+(1-\alpha)M,\;\;\forall M\in\mathcal{M}_{d}^{\textnormal{irr}}.

Let us recall some of the properties of α\mathcal{L}_{\alpha}. To this end fix MdirrM\in\mathcal{M}_{d}^{\textnormal{irr}} with stationary distribution π\pi:

  1. (1)

    α(dirr)derg\mathcal{L}_{\alpha}(\mathcal{M}_{d}^{\textnormal{irr}})\subseteq\mathcal{M}_{d}^{\textnormal{erg}}.

  2. (2)

    α\mathcal{L}_{\alpha} is injective and is not surjective.

  3. (3)

    The stationary distribution of α(M)\mathcal{L}_{\alpha}(M) is also π\pi. In particular, π(α(M))=π(M)\pi_{\star}(\mathcal{L}_{\alpha}(M))=\pi_{\star}(M).

  4. (4)

    There exist π,γps(0,1)\pi_{\star},\gamma_{\text{ps}}\in(0,1) such that Mα1(π,γps)M\in\mathcal{L}_{\alpha}^{-1}(\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}}).

  5. (5)

    If MM is reversible, then so is α(M)\mathcal{L}_{\alpha}(M).

  6. (6)

    It is possible to simulate α(M)\mathcal{L}_{\alpha}(M) even when one only has access to a black box that generates a single long trajectory from MM. Indeed, independently in each step, we toss a biased coin that comes out head with probability 1α1-\alpha. If the coin shows head, we move one step along the trajectory from MM. Otherwise, we insert a duplicate of the current state at the position immediately after the current position and set it to be the new current position.

3 Main results

Definition 3.1, together with its analogue - Definition 3.5 - provide the basis for this work. The reader might find it useful to keep the following story in mind while reading them: Suppose we are interested in a certain parameter of irreducible Markov chains, e.g. their minimum stationary probability. Let Θ\Theta denote the set of all values that this parameter may take, e.g. (0,1/2](0,1/2]. Denote by θ(M)\theta(M) the value that the parameter takes for MdirrM\in\mathcal{M}_{d}^{\textnormal{irr}}. Let θ^\hat{\theta} be an estimator that upon receiving a trajectory from an unknown MdirrM\in\mathcal{M}_{d}^{\textnormal{irr}}, outputs an estimation θ^(M)\hat{\theta}(M) of θ(M)\theta(M) whose “quality” is measured by a distance function ρ\rho. Suppose we know that for some class dirr\mathcal{M}^{\circ}\subseteq\mathcal{M}_{d}^{\textnormal{irr}} the estimator needs not more than m0m_{0}\in\mathbb{N} samples in order to output a “good” estimate with “high” probability. Now, suppose we want to use this estimator for Markov chains not belonging to \mathcal{M}^{\circ}. If there is a map φ\varphi into \mathcal{M}^{\circ}, the following recipe might work: Given MdirrM\in\mathcal{M}_{d}^{\textnormal{irr}}, apply θ^\hat{\theta} on φ(M)\varphi(M)\in\mathcal{M}^{\circ} and then pull back the estimation θ^(φ(M))\hat{\theta}(\varphi(M)) to an estimation of θ(M)\theta(M).

3.1 Estimation problems

In this section we define Markov chains estimation problems, their solutions and extendibility of solutions. We then prove Theorems 1.1 and 1.2 and apply these on the results of Wolfer and Kontorovich (2019, 2021).

Definition 3.1.

Let Θ\Theta be a set and θ:dirrΘ\theta\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\Theta. Let ρ:Θ×Θ+\rho\colon\Theta\times\Theta\to\mathbb{R}_{+} be a distance function and dirr\mathcal{M}^{\circ}\subseteq\mathcal{M}_{d}^{\textnormal{irr}}. We refer to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) as a Markov chains estimation problem. An estimator is a sequence (θ^m)m\left(\hat{\theta}_{m}\right)_{m\in\mathbb{N}} such that θ^m:[d]mΘ\hat{\theta}_{m}\colon[d]^{m}\to\Theta for every mm\in\mathbb{N}. Now, let m,Mdirr,μΔd,ε,δ(0,1)m\in\mathbb{N},M\in\mathcal{M}_{d}^{\textnormal{irr}},\mu\in\Delta_{d},\varepsilon,\delta\in(0,1) and θ^\hat{\theta} an estimator. We write θ^m(M)\hat{\theta}_{m}(M) for the output of the estimator θ^m\hat{\theta}_{m} upon receiving a sequence (X1,,Xm)(M,μ)(X_{1},\ldots,X_{m})\sim(M,\mu). Define the minimax risk

m(θ^,ε,,μ)=supM(ρ(θ^m(M),θ(M))ε)\mathcal{R}_{m}(\hat{\theta},\varepsilon,\mathcal{M}^{\circ},\mu)=\sup_{M\in\mathcal{M}^{\circ}}\mathop{\mathbb{P}}\left(\rho(\hat{\theta}_{m}(M),\theta(M))\geq\varepsilon\right)

and the sample complexity

m0(θ^,ε,δ,,μ)=inf{mm(θ^,ε,,μ)<δ}.m_{0}(\hat{\theta},\varepsilon,\delta,\mathcal{M}^{\circ},\mu)=\inf\left\{m\in\mathbb{N}\mid\mathcal{R}_{m}(\hat{\theta},\varepsilon,\mathcal{M}^{\circ},\mu)<\delta\right\}.

An estimator θ^\hat{\theta} is a solution to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) if m0(θ^,ε,δ,,μ)<m_{0}(\hat{\theta},\varepsilon,\delta,\mathcal{M}^{\circ},\mu)<\infty for every ε,δ(0,1)\varepsilon,\delta\in(0,1). Finally, let φ:dirrdirr\varphi\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\mathcal{M}_{d}^{\textnormal{irr}}. We say that a solution θ^\hat{\theta} to (Θ,ρ,,μ)(\Theta,\rho,\mathcal{M}^{\circ},\mu) extends to φ1()\varphi^{-1}(\mathcal{M}^{\circ}) if there exists g:ΘΘg\colon\Theta\to\Theta such that gθ^φg\circ\hat{\theta}\circ\varphi is a solution to (Θ,ρ,φ1(),μ)(\Theta,\rho,\varphi^{-1}(\mathcal{M}^{\circ}),\mu).

Proof of Theorem 1.1

Let θ^\hat{\theta} be a solution to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}). We need to show that gθ^φg\circ\hat{\theta}\circ\varphi is a solution to (Θ,ρ,φ1())(\Theta,\rho,\varphi^{-1}(\mathcal{M}^{\circ})). To this end, let ε,δ(0,1)\varepsilon,\delta\in(0,1). For every mm\in\mathbb{N} it holds

m(θ^,(ε),,μ)=\displaystyle\mathcal{R}_{m}(\hat{\theta},\ell(\varepsilon),\mathcal{M}^{\circ},\mu)= supM(ρ(θ^m(M),θ(M))(ε))\displaystyle\sup_{M\in\mathcal{M}^{\circ}}\mathbb{P}(\rho(\hat{\theta}_{m}(M),\theta(M))\geq\ell(\varepsilon))
\displaystyle\geq supMφ1()(ρ(θ^m(φ(M)),θ(φ(M)))(ε))\displaystyle\sup_{M\in\varphi^{-1}(\mathcal{M}^{\circ})}\mathbb{P}(\rho(\hat{\theta}_{m}(\varphi(M)),\theta(\varphi(M)))\geq\ell(\varepsilon))
\displaystyle\geq supMφ1()(ρ(g(θ^m(φ(M))),θ(M))ε)\displaystyle\sup_{M\in\varphi^{-1}(\mathcal{M}^{\circ})}\mathbb{P}(\rho(g(\hat{\theta}_{m}(\varphi(M))),\theta(M))\geq\varepsilon)
=\displaystyle= m(gθ^φ,ε,φ1(),μ).\displaystyle\mathcal{R}_{m}(g\circ\hat{\theta}\circ\varphi,\varepsilon,\varphi^{-1}(\mathcal{M}^{\circ}),\mu).

It follows that

m(θ^,(ε),,μ)<δm(gθ^φ,ε,φ1(),μ)<δ.\mathcal{R}_{m}(\hat{\theta},\ell(\varepsilon),\mathcal{M}^{\circ},\mu)<\delta\Longrightarrow\mathcal{R}_{m}(g\circ\hat{\theta}\circ\varphi,\varepsilon,\varphi^{-1}(\mathcal{M}^{\circ}),\mu)<\delta.

Therefore, since m0(θ^,(ε),δ,,μ)<m_{0}(\hat{\theta},\ell(\varepsilon),\delta,\mathcal{M}^{\circ},\mu)<\infty, then also

m0(gθ^φ,ε,δ,φ1(),μ)<.m_{0}(g\circ\hat{\theta}\circ\varphi,\varepsilon,\delta,\varphi^{-1}(\mathcal{M}^{\circ}),\mu)<\infty.

Example 3.2.
  1. (a)

    In Wolfer and Kontorovich (2019, Theorem 5.1) a procedure for estimating the minimum stationary probability of an ergodic Markov chain (in relative error) was constructed, thus, in our terminology, a solution to the following Markov chain estimation problem: Let π,γps(0,1)\pi_{\star},\gamma_{\text{ps}}\in(0,1) and =π,γps\mathcal{M}^{\circ}=\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}}. Define

    Θ=(0,1/2],θ(M)=π(M),Mdirr,ρ(x,y)=|xy1|,x,yΘ.\Theta=(0,1/2],\;\theta(M)=\pi_{\star}(M),\;\;\forall M\in\mathcal{M}_{d}^{\textnormal{irr}},\;\rho(x,y)=\left|\frac{x}{y}-1\right|,\;\;\forall x,y\in\Theta.

    We shall now show that any solution to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) extends to α1()\mathcal{L}_{\alpha}^{-1}(\mathcal{M}^{\circ}). To this end, take g=idΘg=\textnormal{id}_{\Theta} and =id(0,1)\ell=\textnormal{id}_{(0,1)}. Then gg and \ell satisfy condition (1). Indeed, let xΘ,Mα1()x\in\Theta,M\in\mathcal{L}_{\alpha}^{-1}(\mathcal{M}^{\circ}) and ε(0,1)\varepsilon\in(0,1) such that |xπ(α(M))1|<ε\left|\frac{x}{\pi_{\star}(\mathcal{L}_{\alpha}(M))}-1\right|<\varepsilon. Since π(α(M))=π(M)\pi_{\star}(\mathcal{L}_{\alpha}(M))=\pi_{\star}(M) we also have |xπ(M)1|<ε\left|\frac{x}{\pi_{\star}(M)}-1\right|<\varepsilon.

  2. (b)

    In Wolfer and Kontorovich (2021, Theorem 3.1) a procedure for estimating the transition matrix of an ergodic Markov chain was constructed, thus, a solution to the following Markov chains estimation problem (actually, Wolfer and Kontorovich did not make sure that the output of their estimator is ergodic, or even irreducible, but only a row-stochastic matrix. Nevertheless, this is easily achieved as we shall see immediately): Let π,γps(0,1)\pi_{\star},\gamma_{\text{ps}}\in(0,1) and =π,γps\mathcal{M}^{\circ}=\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}}. Define

    Θ=\displaystyle\Theta= dirr,θ(M)=M,Mdirr,\displaystyle\mathcal{M}_{d}^{\textnormal{irr}},\;\theta(M)=M,\;\;\forall M\in\mathcal{M}_{d}^{\textnormal{irr}},
    ρ(M1,M2)=\displaystyle\rho(M_{1},M_{2})= M1M2,M1,M2Θ.\displaystyle||M_{1}-M_{2}||_{\infty},\;\;\forall M_{1},M_{2}\in\Theta.

    We shall now show that any solution to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) extends to α1()\mathcal{L}_{\alpha}^{-1}(\mathcal{M}^{\circ}). Defining :(0,1)(0,1)\ell\colon(0,1)\to(0,1) is straightforward:

    (ε)=(1α)ε2,ε(0,1).\ell(\varepsilon)=\frac{(1-\alpha)\varepsilon}{2},\;\;\forall\varepsilon\in(0,1).

    To justify the definition of g:ΘΘg\colon\Theta\to\Theta that we subsequently give, let us notice that for Mα1()M\in\mathcal{L}_{\alpha}^{-1}(\mathcal{M}^{\circ}), a solution θ^\hat{\theta} to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) gives an estimation M^\hat{M} of α(M)\mathcal{L}_{\alpha}(M), and M^\hat{M} is a row-stochastic matrix. But it might happen that M^α(dirr)\hat{M}\not\in\mathcal{L}_{\alpha}(\mathcal{M}_{d}^{\textnormal{irr}}). Thus, applying α1\mathcal{L}_{\alpha}^{-1} to M^\hat{M} might lead to negative entries on the main diagonal and therefore to a matrix which is not row-stochastic. To mitigate this, one possibility is to project α1(M^)\mathcal{L}_{\alpha}^{-1}(\hat{M}) onto d\mathcal{M}_{d} with respect to ||||||\cdot||_{\infty} which is equivalent to projecting each of α1(M^)\mathcal{L}_{\alpha}^{-1}(\hat{M})’s rows on Δd\Delta_{d} with respect to the 1\ell_{1} norm. This leads to a convex optimization problem whose details we give in 5.1 in the appendix. It remains to make sure that the row-stochastic matrix is irreducible. To this end denote by 𝟏\mathbf{1} the square matrix of size dd that has all entries equal to 11 and for NdN\in\mathcal{M}_{d} and β(0,1)\beta\in(0,1) let N𝟏,β=βd𝟏+(1β)NN_{\mathbf{1},\beta}=\frac{\beta}{d}\mathbf{1}+(1-\beta)N. Clearly, N𝟏,βN_{\mathbf{1},\beta} is an irreducible (even ergodic) Markov chain. Furthermore,

    N𝟏,βN=β1d𝟏N2β.||N_{\mathbf{1},\beta}-N||_{\infty}=\beta\left|\left|\frac{1}{d}\mathbf{1}-N\right|\right|_{\infty}\leq 2\beta. (6)

    Thus, for MΘM\in\Theta we define

    g(M)=(Pd(α1(M)))𝟏,ε/4g(M)=\left(P_{\mathcal{M}_{d}}\left(\mathcal{L}_{\alpha}^{-1}(M)\right)\right)_{\mathbf{1},\varepsilon/4}

    where by Pd(α1(M))P_{\mathcal{M}_{d}}\left(\mathcal{L}_{\alpha}^{-1}(M)\right) we mean that the procedure from Lemma 5.2 is applied independently to each row of α1(M)\mathcal{L}_{\alpha}^{-1}(M). Then gg and \ell satisfy condition of (1). Indeed, let M^Θ,Mα1()\hat{M}\in\Theta,M\in\mathcal{L}_{\alpha}^{-1}(\mathcal{M}^{\circ}) and ε(0,1)\varepsilon\in(0,1) such that M^α(M)<(1α)ε2||\hat{M}-\mathcal{L}_{\alpha}(M)||_{\infty}<\frac{(1-\alpha)\varepsilon}{2}. This inequality is equivalent to α1(M^)M<ε2.||\mathcal{L}_{\alpha}^{-1}(\hat{M})-M||_{\infty}<\frac{\varepsilon}{2}. By Lemma 5.2, inequality (6) and the triangle inequality, g(M^)M<ε||g(\hat{M})-M||_{\infty}<\varepsilon.

Remark 3.3.

Our definition of a solution to a Markov chains estimation problem only assumes that the sample complexity is finite and therefore sheds no light on the extent to which the sample complexity might change while extending a solution. We postpone the treatment of this aspect to Section 4.

We now come to the

Proof of Theorem 1.2

Let θ^\hat{\theta} be a solution to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}). By assumption,

R:=ρ(θ(M1),θ(M2))>0.R:=\rho(\theta(M_{1}),\theta(M_{2}))>0.

Then there exists σ(0,1)\sigma\in(0,1) such that if x,yΘx,y\in\Theta satisfy ρ(x,y)<σ\rho(x,y)<\sigma, then ρ(g(x),g(y))<R/2\rho(g(x),g(y))<R/2. Assume gθ^φg\circ\hat{\theta}\circ\varphi is a solution to (Θ,ρ,φ1())(\Theta,\rho,\varphi^{-1}(\mathcal{M}^{\circ})). Thus, for mm\in\mathbb{N} large enough, we have

m(θ^,σ2,,μ)<14andm(gθ^φ,R4,φ1(),μ)<14.\mathcal{R}_{m}\left(\hat{\theta},\frac{\sigma}{2},\mathcal{M}^{\circ},\mu\right)<\frac{1}{4}\;\;\;\textnormal{and}\;\;\;\mathcal{R}_{m}\left(g\circ\hat{\theta}\circ\varphi,\frac{R}{4},\varphi^{-1}(\mathcal{M}^{\circ}),\mu\right)<\frac{1}{4}.

It follows that, with probability at least 1/21/2,

ρ(θ^m(φ(M1)),θ^m(φ(M2)))<σ.\rho\left(\hat{\theta}_{m}(\varphi(M_{1})),\hat{\theta}_{m}(\varphi(M_{2}))\right)<\sigma.

By the assumption on gg, with probability at least 1/21/2,

ρ(g(θ^m(φ(M1))),g(θ^m(φ(M2))))<R/2.\rho\left(g(\hat{\theta}_{m}(\varphi(M_{1}))),g(\hat{\theta}_{m}(\varphi(M_{2})))\right)<R/2.

On the other hand, with probability at least 1/21/2, for i=1,2i=1,2:

ρ(g(θ^m(φ(Mi))),θ(Mi)))<R/4.\rho(g(\hat{\theta}_{m}(\varphi(M_{i}))),\theta(M_{i})))<R/4.

This means that, with probability at least 1/21/2,

ρ(g(θ^m(φ(M1))),g(θ^m(φ(M2))))>R/2.\rho\left(g(\hat{\theta}_{m}(\varphi(M_{1}))),g(\hat{\theta}_{m}(\varphi(M_{2})))\right)>R/2.

A contradiction. ∎

Example 3.4.

In Wolfer and Kontorovich (2019, Theorem 8.1) a procedure for estimating the pseudo spectral gap (in absolute error) of an ergodic Markov chain was constructed, thus, a solution to the following Markov chain estimation problem: Let π,γps(0,1)\pi_{\star},\gamma_{\text{ps}}\in(0,1) and =π,γps\mathcal{M}^{\circ}=\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}}. Define

Θ=[0,1],θ(M)=γps(M),Mdirr,ρ(x,y)=|xy|,x,yΘ.\Theta=[0,1],\;\theta(M)=\gamma_{\text{ps}}(M),\;\;\forall M\in\mathcal{M}_{d}^{\textnormal{irr}},\;\rho(x,y)=|x-y|,\;\;\forall x,y\in\Theta.

We claim that no solution θ^\hat{\theta} to (Θ,ρ,)(\Theta,\rho,\mathcal{M}^{\circ}) extends to 1/21()\mathcal{L}_{1/2}^{-1}(\mathcal{M}^{\circ}) with gg continuous (together with the compactness of Θ\Theta this means that gg is uniformly continuous). Indeed, by Theorem 1.2, it suffices to find M1,M21/21()M_{1},M_{2}\in\mathcal{L}_{1/2}^{-1}(\mathcal{M}^{\circ}) such that γps(1/2(M1))=γps(1/2(M2))\gamma_{\text{ps}}(\mathcal{L}_{1/2}(M_{1}))=\gamma_{\text{ps}}(\mathcal{L}_{1/2}(M_{2})) but γps(M1)γps(M2)\gamma_{\text{ps}}(M_{1})\neq\gamma_{\text{ps}}(M_{2}). Consider

M1=(0100.500.5010),M2=(0.50.20.30.50.20.30.50.20.3).M_{1}=\begin{pmatrix}0&1&0\\ 0.5&0&0.5\\ 0&1&0\end{pmatrix},\;M_{2}=\begin{pmatrix}0.5&0.2&0.3\\ 0.5&0.2&0.3\\ 0.5&0.2&0.3\end{pmatrix}.

Then γps(1/2(M1))=0.75=γps(1/2(M2))\gamma_{\text{ps}}(\mathcal{L}_{1/2}(M_{1}))=0.75=\gamma_{\text{ps}}(\mathcal{L}_{1/2}(M_{2})) but γps(M1)=01=γps(M2)\gamma_{\text{ps}}(M_{1})=0\neq 1=\gamma_{\text{ps}}(M_{2}). To see that, first notice that M1M_{1} is periodic and therefore γps(M1)=0\gamma_{\text{ps}}(M_{1})=0 (cf. Paulin (2015, p. 11)). Let πi\pi_{i} denote the stationary distribution of Mi,i=1,2M_{i},i=1,2. It holds

π1=(0.25,0.5,0.25) and π2=(0.5,0.2,0.3).\pi_{1}=(0.25,0.5,0.25)\text{ and }\pi_{2}=(0.5,0.2,0.3).

Thus, with Qi=Q(Mi),i=1,2Q_{i}=Q(M_{i}),i=1,2 we have

Q1=(00.2500.2500.2500.250),Q2=(0.250.10.150.10.040.060.150.060.09).Q_{1}=\begin{pmatrix}0&0.25&0\\ 0.25&0&0.25\\ 0&0.25&0\end{pmatrix},Q_{2}=\begin{pmatrix}0.25&0.1&0.15\\ 0.1&0.04&0.06\\ 0.15&0.06&0.09\end{pmatrix}.

Conclude that both M1M_{1} and M2M_{2} are reversible and therefore also 1/2(M1)\mathcal{L}_{1/2}(M_{1}) and 1/2(M2)\mathcal{L}_{1/2}(M_{2}). By Wolfer and Kontorovich (2021, p. 18), for ergodic and reversible Markov chains, the maximum in the definition of the pseudo spectral gap is obtained at k=1k=1. Thus, for M{1/2(M1),M2,1/2(M2)}M\in\{\mathcal{L}_{1/2}(M_{1}),M_{2},\mathcal{L}_{1/2}(M_{2})\}:

γps(M)=γ(M2).\gamma_{\text{ps}}(M)=\gamma(M^{2}).

The claim follows now by calculating the second largest eigenvalue of M2M^{2} in each of the three cases.

3.2 Identity testing problems

In this section we define Markov chains identity testing problems, their solutions and extendibility of solutions. We then apply Theorem 1.3 on the results of Wolfer and Kontorovich (2020, Theorem 4.1).

Definition 3.5.

Let Θ\Theta be a set and θ:dirrΘ\theta\colon\mathcal{M}_{d}^{\textnormal{irr}}\to\Theta. Let Θ×Θ+\Theta\times\Theta\to\mathbb{R}_{+} be a distance function and ,¯dirr\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}\subseteq\mathcal{M}_{d}^{\textnormal{irr}}. We refer to (Θ,ρ,,¯)(\Theta,\rho,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}) as a Markov chains identity testing problem. An identity tester is a sequence (τ^m)m\left(\hat{\tau}_{m}\right)_{m\in\mathbb{N}} such that τ^m:[d]m×d{0,1}\hat{\tau}_{m}\colon[d]^{m}\times\mathcal{M}_{d}\to\{0,1\} for every mm\in\mathbb{N}. Now, let m,M,M¯dirr,μΔd,ε,δ(0,1)m\in\mathbb{N},M,\overline{M}\in\mathcal{M}_{d}^{\textnormal{irr}},\mu\in\Delta_{d},\varepsilon,\delta\in(0,1) and τ^\hat{\tau} an identity tester. We will write τ^m(M,M¯)\hat{\tau}_{m}(M,\overline{M}) for the output of the identity tester τ^m\hat{\tau}_{m} upon receiving a sequence (X1,,Xm)(M,μ)(X_{1},\ldots,X_{m})\sim(M,\mu) and M¯\overline{M}. Define the minimax risk

m(τ^,ε,,¯,μ)=supM,M¯¯({τ^m(M,M¯)0,if θ(M)=θ(M¯)orτ^m(M,M¯)1,if ρ(θ(M),θ(M¯))>ε)\mathcal{R}_{m}(\hat{\tau},\varepsilon,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}},\mu)=\sup_{\begin{subarray}{c}M\in\mathcal{M}^{\circ},\\ \overline{M}\in\overline{\mathcal{M}^{\circ}}\end{subarray}}\mathop{\mathbb{P}}\left(\begin{cases}\hat{\tau}_{m}(M,\overline{M})\neq 0,&\text{if }\theta(M)=\theta(\overline{M})\\ &\text{or}\\ \hat{\tau}_{m}(M,\overline{M})\neq 1,&\text{if }\rho(\theta(M),\theta(\overline{M}))>\varepsilon\end{cases}\right)

and the sample complexity

m0(τ^,δ,ε,,¯,μ)=inf{mm(τ^,ε,,¯,μ)<δ}.m_{0}(\hat{\tau},\delta,\varepsilon,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}},\mu)=\inf\left\{m\in\mathbb{N}\mid\mathcal{R}_{m}(\hat{\tau},\varepsilon,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}},\mu)<\delta\right\}.

An identity tester τ^\hat{\tau} is a solution to (Θ,ρ,,¯)(\Theta,\rho,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}) if

m0(τ^,δ,ε,,¯,μ)<m_{0}(\hat{\tau},\delta,\varepsilon,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}},\mu)<\infty

for every ε,δ(0,1)\varepsilon,\delta\in(0,1).

Let φ:dd\varphi\colon\mathcal{M}_{d}\to\mathcal{M}_{d} such that φ1(),φ1(¯)U\varphi^{-1}(\mathcal{M}^{\circ}),\varphi^{-1}(\overline{\mathcal{M}^{\circ}})\subseteq U . We say that a solution τ^\hat{\tau} to (Θ,ρ,,¯)(\Theta,\rho,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}) extends to (φ1(),φ1(¯))(\varphi^{-1}(\mathcal{M}^{\circ}),\varphi^{-1}(\overline{\mathcal{M}^{\circ}})) if τ^φ\hat{\tau}\circ\varphi is a solution to (Θ,ρ,φ1(),φ1(¯)).(\Theta,\rho,\varphi^{-1}(\mathcal{M}^{\circ}),\varphi^{-1}(\overline{\mathcal{M}^{\circ}})).

The proof of Theorem 1.3 is similar to the proof of Theorem 1.1 and is omitted.

Example 3.6.

In Wolfer and Kontorovich (2020, Theorem 4.1) a procedure for identity testing of Markov chains transition matrices was constructed, thus, a solution to the following Markov chains identity testing problem: Let π,γps(0,1)\pi_{\star},\gamma_{\text{ps}}\in(0,1) and ,¯=π,γps\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}=\mathcal{M}_{\pi_{\star},\gamma_{\text{ps}}}. Define

Θ=dirr,θ(M)=\displaystyle\Theta=\mathcal{M}_{d}^{\textnormal{irr}},\theta(M)= M,Mdirr,\displaystyle M,\;\;\forall M\in\mathcal{M}_{d}^{\textnormal{irr}},
ρ(M1,M2)=\displaystyle\rho(M_{1},M_{2})= M1M2,M1,M2Θ.\displaystyle||M_{1}-M_{2}||_{\infty},\;\;\forall M_{1},M_{2}\in\Theta.

Then any solution to (Θ,ρ,,¯)(\Theta,\rho,\mathcal{M}^{\circ},\overline{\mathcal{M}^{\circ}}) extends to (φ1(),φ1(¯))(\varphi^{-1}(\mathcal{M}^{\circ}),\varphi^{-1}(\overline{\mathcal{M}^{\circ}})). Indeed, :(0,1)(0,1)\ell\colon(0,1)\to(0,1) given by

(ε)=(1α)ε,ε(0,1)\ell(\varepsilon)=(1-\alpha)\varepsilon,\;\;\forall\varepsilon\in(0,1)

obviously satisfies condition (2) of Theorem 1.3.

4 The cost of moving to an α\alpha-lazy version of a Markov chain

In the previous section we were solely interested in the feasibility of extending solutions to Markov chains estimation and testing problems. In this section we wish to highlight that extending a solution might result in an increase in the sample complexity of the problem. In particular, we give the exact statements of the problems whose solutions we extended.

Remark 4.1.

Let MdM\in\mathcal{M}_{d}. It follows from the way the Markov chain α(M)\mathcal{L}_{\alpha}(M) is simulated (cf. (6) of Section 2.2) that in order to obtain the actual number of samples coming from MM, the length of the simulated Markov chain should be multiplied by approximately 1α1-\alpha. In order to make this observation more quantitative, let us denote by mactm_{\text{act}} the number of samples coming from MM. Thus, if we simulate α(M)\mathcal{L}_{\alpha}(M) for mm\in\mathbb{N} steps, then mactBinomial(m,1α)m_{\text{act}}\sim\text{Binomial}(m,1-\alpha). In particular, 𝔼[mact]=(1α)m\mathbb{E}[m_{\text{act}}]=(1-\alpha)m and for every ρ>0\rho>0, by Hoeffding’s inequality,

(mact(1α+ρ)m)e2ρ2m.\mathbb{P}\left(m_{\text{act}}\geq(1-\alpha+\rho)m\right)\leq e^{-2\rho^{2}m}.

4.1 The sample complexity of some extended solutions

The bounds given in this section are for the simulated α\alpha-lazy Markov chain. The observations made in Remark 4.1 may be used to calculate the actual sample size from the original Markov chain.

Lemma 4.2 (Example 3.2(a)).

Let Mdirr,ε,δ(0,1)M\in\mathcal{M}_{d}^{\textnormal{irr}},\varepsilon,\delta\in(0,1) and μΔd\mu\in\Delta_{d}. Then there exist a universal constant cc and an estimator θ^\hat{\theta} for π(M)\pi_{\star}(M) such that if

mc(1α)2ε2π(M)ln1π(M)δγps(α(M))m\geq\frac{c}{(1-\alpha)^{2}\varepsilon^{2}\pi_{\star}(M)}\frac{\ln\frac{1}{\pi_{\star}(M)\delta}}{\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M))}

then, upon receiving a sequence (X1,,Xm)(M,μ)(X_{1},\ldots,X_{m})\sim(M,\mu), it holds

|θ^(M)π(M)1|<ε,\left|\frac{\hat{\theta}(M)}{\pi_{\star}(M)}-1\right|<\varepsilon,

with probability at least 1δ1-\delta.

Lemma 4.3 (Example 3.2(b)).

Let Mdirr,ε,δ(0,1)M\in\mathcal{M}_{d}^{\textnormal{irr}},\varepsilon,\delta\in(0,1) and μΔd\mu\in\Delta_{d}. Let π\pi be the stationary distribution of MM. Then there exist a universal constant cc and an estimator θ^\hat{\theta} for MM such that if

mcmax{1(1α)2ε2π(M)max{d,ln1(1α)εδ},lndμ/π2,πδγps(α(M))π(M)}m\geq c\max\left\{\frac{1}{(1-\alpha)^{2}\varepsilon^{2}\pi_{\star}(M)}\max\left\{d,\ln\frac{1}{(1-\alpha)\varepsilon\delta}\right\},\frac{\ln\frac{d||\mu/\pi||_{2,\pi}}{\delta}}{\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M))\pi_{\star}(M)}\right\}

then, upon receiving a sequence (X1,,Xm)(M,μ)(X_{1},\ldots,X_{m})\sim(M,\mu), it holds

θ^(M)M<ε,||\hat{\theta}(M)-M||_{\infty}<\varepsilon,

with probability at least 1δ1-\delta.

Lemma 4.4 (Example 3.6).

Let ε,δ(0,1),M,M¯dirr\varepsilon,\delta\in(0,1),M,\overline{M}\in\mathcal{M}_{d}^{\textnormal{irr}} and μΔd\mu\in\Delta_{d}. There exist a universal constant cc and an identity tester τ^\hat{\tau} such that if

mcπ(M¯)max{d(1α)2ε2lndδ(1α)ε,t𝗆𝗂𝗑(α(M¯))lndδπ(M¯)}m\geq\frac{c}{\pi_{\star}(\overline{M})}\max\left\{\frac{\sqrt{d}}{(1-\alpha)^{2}\varepsilon^{2}}\ln\frac{d}{\delta(1-\alpha)\varepsilon},t_{\mathsf{mix}}(\mathcal{L}_{\alpha}(\overline{M}))\ln\frac{d}{\delta\pi_{\star}(\overline{M})}\right\}

then, upon receiving a sequence (X1,,Xm)(M,μ)(X_{1},\ldots,X_{m})\sim(M,\mu) and M¯\overline{M}, it holds

M=M¯\displaystyle M=\overline{M} τ^(M,M¯)=0\displaystyle\Longrightarrow\hat{\tau}(M,\overline{M})=0
MM¯>ε\displaystyle||M-\overline{M}||_{\infty}>\varepsilon τ^(M,M¯)=1,\displaystyle\Longrightarrow\hat{\tau}(M,\overline{M})=1,

with probability at least 1δ1-\delta.

4.2 Bounds on the mixing time and the pseudo spectral gap of the α\alpha-lazy version

A natural question that arises is by how much the sample complexity can increase if we unnecessarily move to the α\alpha-lazy version, i.e., the Markov chain is already ergodic. The following Lemma, that bounds the mixing time of the α\alpha-lazy version in terms of α\alpha and the mixing time of the original Markov chain, seems to be folklore. We give its proof for completeness in (5.2) in the Appendix.

Lemma 4.5.

Let MdergM\in\mathcal{M}_{d}^{\textnormal{erg}} and let ε(0,1)\varepsilon\in(0,1). Then

t𝗆𝗂𝗑(α(M),ε)max{2ln2ε(1α)2,2(1α)t𝗆𝗂𝗑(M,ε/2)}.t_{\mathsf{mix}}(\mathcal{L}_{\alpha}(M),\varepsilon)\leq\max\left\{\frac{2\ln\frac{2}{\varepsilon}}{(1-\alpha)^{2}},\frac{2}{(1-\alpha)}t_{\mathsf{mix}}(M,\varepsilon/2)\right\}.

In particular,

t𝗆𝗂𝗑(α(M))max{5(1α)2,61αt𝗆𝗂𝗑(M)}.t_{\mathsf{mix}}(\mathcal{L}_{\alpha}(M))\leq\max\left\{\frac{5}{(1-\alpha)^{2}},\frac{6}{1-\alpha}t_{\mathsf{mix}}(M)\right\}.

A similar result to Lemma 4.5 seems to hold for the pseudo spectral gap:

Conjecture 4.6.

There exists a function f:(0,1)2+f\colon(0,1)^{2}\to\mathbb{R}_{+} such that for every MdergM\in\mathcal{M}_{d}^{\textnormal{erg}} it holds γps(α(M))f(α,γps(M))\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M))\geq f(\alpha,\gamma_{\text{ps}}(M)).

As a consequence of the following lemma we establish Conjecture 4.6 for a certain class that contains all ergodic and reversible Markov chains.

Proof of Theorem 1.4

Let π\pi be the stationary distribution of MM. Denote

={fdi=1df(i)π(i)=1 and i=1df(i)2=1}.\mathcal{F}=\{f\in\mathbb{R}^{d}\mid\sum_{i=1}^{d}f(i)\pi(i)=1\text{ and }\sum_{i=1}^{d}f(i)^{2}=1\}.

By Levin and Peres (2017, Lemmas 13.11 and 13.12) and noticing that γ(I)=0\gamma(I)=0, we have

γ(α(M))\displaystyle\gamma(\mathcal{L}_{\alpha}(M)^{\dagger}) =minf12i,j[d](f(i)f(j))2π(i)α(M)(i,j)\displaystyle=\min_{f\in\mathcal{F}}\frac{1}{2}\sum_{i,j\in[d]}(f(i)-f(j))^{2}\pi(i)\mathcal{L}_{\alpha}(M)^{\dagger}(i,j)
=minf12i,j[d](f(i)f(j))2π(i)\displaystyle=\min_{f\in\mathcal{F}}\frac{1}{2}\sum_{i,j\in[d]}(f(i)-f(j))^{2}\pi(i)\cdot
(α2I+α(1α)(M+M)+(1α)2M)(i,j)\displaystyle\hskip 71.13188pt\left(\alpha^{2}I+\alpha(1-\alpha)(M+M^{*})+(1-\alpha)^{2}M^{\dagger}\right)(i,j)
2α(1α)γ(12(M+M))+(1α)2γ(M).\displaystyle\geq 2\alpha(1-\alpha)\gamma\left(\frac{1}{2}(M+M^{*})\right)+(1-\alpha)^{2}\gamma(M^{\dagger}). (7)

It is well known (e.g. Marshall et al. (1979, Theorem 9.F.4)) that

(λ2(12(M+M)))2λ2(M).\left(\lambda_{2}\left(\frac{1}{2}(M+M^{*})\right)\right)^{2}\leq\lambda_{2}(M^{\dagger}).

Thus,

γ(M)γ(12(M+M))(1+λ2(12(M+M)))2γ(12(M+M)).\gamma(M^{\dagger})\leq\gamma\left(\frac{1}{2}(M+M^{*})\right)\left(1+\lambda_{2}\left(\frac{1}{2}(M+M^{*})\right)\right)\leq 2\gamma\left(\frac{1}{2}(M+M^{*})\right).

It follows that

(7)α(1α)γ(M)+(1α)2γ(M)=(1α)γ(M).(\ref{eq; 17})\geq\alpha(1-\alpha)\gamma(M^{\dagger})+(1-\alpha)^{2}\gamma(M^{\dagger})=(1-\alpha)\gamma(M^{\dagger}).

Corollary 4.7.

Let MdergM\in\mathcal{M}_{d}^{\textnormal{erg}}. Suppose that γps(M)\gamma_{\text{ps}}(M) is obtained at k=1k=1 (cf. (4)). Then

γps(α(M))(1α)γps(M).\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M))\geq(1-\alpha)\gamma_{\text{ps}}(M).

This holds, in particular, if MM is reversible.

Proof.

It holds

γps(α(M))γ(α(M))(1α)γ(M)=(1α)γps(M).\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M))\geq\gamma(\mathcal{L}_{\alpha}(M)^{\dagger})\geq(1-\alpha)\gamma(M^{\dagger})=(1-\alpha)\gamma_{\text{ps}}(M).

That γps(M)\gamma_{\text{ps}}(M) is obtained at k=1k=1 if MM is reversible was shown in Wolfer and Kontorovich (2021, p. 18). ∎

Remark 4.8.

The class

{Mdergγps(M) is obtained at k=1},\{M\in\mathcal{M}_{d}^{\textnormal{erg}}\mid\gamma_{\text{ps}}(M)\text{ is obtained at }k=1\},

of which ergodic and reversible Markov chains are a subclass, seems to be very large. Indeed, for MdergM\in\mathcal{M}_{d}^{\textnormal{erg}} it suffices that γ(M)12\gamma(M^{\dagger})\geq\frac{1}{2}. This follows from the more general observation that if γ((Mk0))11k0+1\gamma((M^{k_{0}})^{\dagger})\geq 1-\frac{1}{k_{0}+1} for some k0k_{0}\in\mathbb{N} then γps(M)\gamma_{\text{ps}}(M) is obtained at kk0k\leq k_{0}.

Remark 4.9.

In the case of ergodic and reversible Markov chains a stronger result holds: By Paulin (2015, Theorem 3.3), one may replace (at the price of changing the universal constant) γps(α(M))\gamma_{\text{ps}}(\mathcal{L}_{\alpha}(M)) with γ(α(M))\gamma(\mathcal{L}_{\alpha}(M)) in Lemmas 4.2 and 4.3. Now,

γ(α(M))=\displaystyle\gamma(\mathcal{L}_{\alpha}(M))= 1λ2(α(M))\displaystyle 1-\lambda_{2}(\mathcal{L}_{\alpha}(M))
=\displaystyle= 1(α+(1α)λ2(M))\displaystyle 1-(\alpha+(1-\alpha)\lambda_{2}(M))
=\displaystyle= (1α)γ(M).\displaystyle(1-\alpha)\gamma(M).

5 Appendix

5.1 1\ell_{1}-projection on Δd\Delta_{d}

In Example 3.2(b) it was necessary to project matrices on d\mathcal{M}_{d} with respect to ||||||\cdot||_{\infty}. Since in this norm each row is considered separately, this problem is equivalent to the problem of projecting each of the matrices rows on Δd\Delta_{d} with respect to the 1\ell_{1} norm. Thus, we need to solve (at most) dd optimization problems of the following form: For xdx\in\mathbb{R}^{d} find an 1\ell_{1}-projection PΔd(x)P_{\Delta_{d}}(x) of xx on Δd\Delta_{d} (cf. Boyd et al. (2004, p. 397)):

PΔd(x)=argminyΔd{yx1}P_{\Delta_{d}}(x)=\operatorname*{arg\,min}_{y\in\Delta_{d}}\{||y-x||_{1}\} (8)

where

z1=i=1d|z|,zd.||z||_{1}=\sum_{i=1}^{d}|z|,\;\forall z\in\mathbb{R}^{d}.

Notice that, in contrast to ||||p||\cdot||_{p} for p>1p>1, optimization problem (8) has, in general, infinitely many solutions and we will understand PΔd(x)P_{\Delta_{d}}(x) as the set of all such solutions.

The following lemma seems to be well known but we were not able to find a reference. In it, only parts (a) and (b) and |S|1|S|\leq 1 are relevant for our needs.

Lemma 5.1.

Let x=(x1,,xd)dx=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d}. Denote

S={i[d]xi<0} and s=i[d]Sxi.S=\{i\in[d]\mid x_{i}<0\}\text{ and }s=\sum_{i\in[d]\setminus S}x_{i}.

If S=[d]S=[d] then choose any yΔdy\in\Delta_{d}. Otherwise, define y=(y1,,yd)Δdy=(y_{1},\ldots,y_{d})\in\Delta_{d} as follows: Set yi=0y_{i}=0 for every iSi\in S. Now, for every i[d]Si\in[d]\setminus S:

  1. (a)

    If s=1s=1 then set yi=xiy_{i}=x_{i}

  2. (b)

    If s<1s<1 then choose any yixiy_{i}\geq x_{i} such that j[d]Sxj=1\sum_{j\in[d]\setminus S}x_{j}=1.

  3. (c)

    If s>1s>1 then choose any yixiy_{i}\leq x_{i} such that j[d]Sxj=1\sum_{j\in[d]\setminus S}x_{j}=1.

Then yPΔd(x)y\in P_{\Delta_{d}}(x).

Proof.

First assume S=[d]S=[d] and let y=(y1,,yd)Δdy^{\prime}=(y^{\prime}_{1},\ldots,y^{\prime}_{d})\in\Delta_{d}. For each i[d]i\in[d] denote εi=yiyi\varepsilon_{i}=y^{\prime}_{i}-y_{i}. Notice that i=1dεi=0\sum_{i=1}^{d}\varepsilon_{i}=0. It holds

i=1d|yixi|=i=1d(yixi)=i=1d(yi+εixi)=i=1d(yixi)+i=1dεi=i=1d|yixi|.\sum_{i=1}^{d}|y^{\prime}_{i}-x_{i}|=\sum_{i=1}^{d}(y^{\prime}_{i}-x_{i})=\sum_{i=1}^{d}(y_{i}+\varepsilon_{i}-x_{i})=\sum_{i=1}^{d}(y_{i}-x_{i})+\sum_{i=1}^{d}\varepsilon_{i}=\sum_{i=1}^{d}|y_{i}-x_{i}|.

Now assume S[d]S\neq[d]. Let z=(z1,,zd)Δdz=(z_{1},\ldots,z_{d})\in\Delta_{d}. We consider each of the three possibilities for ss separately:

  1. (a)

    Without loss, there exists k[d]k\in[d] such that xi<0x_{i}<0 for every 1ik1\leq i\leq k and xi0x_{i}\geq 0 for every k+1idk+1\leq i\leq d. It holds

    i=1d|zixi|=\displaystyle\sum_{i=1}^{d}|z_{i}-x_{i}|= i=1k|zixi|+i=k+1d|zixi|\displaystyle\sum_{i=1}^{k}|z_{i}-x_{i}|+\sum_{i=k+1}^{d}|z_{i}-x_{i}|
    \displaystyle\geq i=1k|0xi|+i=k+1d|xixi|\displaystyle\sum_{i=1}^{k}|0-x_{i}|+\sum_{i=k+1}^{d}|x_{i}-x_{i}|
    =\displaystyle= i=1k|yixi|+i=k+1d|yixi|\displaystyle\sum_{i=1}^{k}|y_{i}-x_{i}|+\sum_{i=k+1}^{d}|y_{i}-x_{i}|
    =\displaystyle= i=1d|yixi|.\displaystyle\sum_{i=1}^{d}|y_{i}-x_{i}|.
  2. (b)

    Without loss, there exist 1klmd1\leq k\leq l\leq m\leq d such that

    xi<0=yizi,\displaystyle x_{i}<0=y_{i}\leq z_{i}, 1ik\displaystyle\;\;\forall 1\leq i\leq k
    0zixiyi,\displaystyle 0\leq z_{i}\leq x_{i}\leq y_{i}, k+1il\displaystyle\;\;\forall k+1\leq i\leq l
    0xiziyi,\displaystyle 0\leq x_{i}\leq z_{i}\leq y_{i}, l+1im\displaystyle\;\;\forall l+1\leq i\leq m
    0xiyizi,\displaystyle 0\leq x_{i}\leq y_{i}\leq z_{i}, m+1id.\displaystyle\;\;\forall m+1\leq i\leq d.

    Then,

    i=1d|zixi|=\displaystyle\sum_{i=1}^{d}|z_{i}-x_{i}|= i=1d|yixi|+i=1k(ziyi)+i=k+1l((yizi)2(yixi))\displaystyle\sum_{i=1}^{d}|y_{i}-x_{i}|+\sum_{i=1}^{k}(z_{i}-y_{i})+\sum_{i=k+1}^{l}\left((y_{i}-z_{i})-2(y_{i}-x_{i})\right)-
    i=l+1m(yizi)+i=m+1d(ziyi).\displaystyle\hskip 56.9055pt\sum_{i=l+1}^{m}(y_{i}-z_{i})+\sum_{i=m+1}^{d}(z_{i}-y_{i}). (9)

    Thus, it suffices to show that

    i=1k(ziyi)+i=k+1l((yizi)2(yixi))i=l+1m(yizi)+i=m+1d(ziyi)0.\sum_{i=1}^{k}(z_{i}-y_{i})+\sum_{i=k+1}^{l}\left((y_{i}-z_{i})-2(y_{i}-x_{i})\right)-\sum_{i=l+1}^{m}(y_{i}-z_{i})+\sum_{i=m+1}^{d}(z_{i}-y_{i})\geq 0. (10)

    Indeed,

    (10)\displaystyle(\ref{eq; 334})\iff i=1dzi=1+i=k+1l(2(xi+zi))i=k+1dyi=10\displaystyle\overbrace{\sum_{i=1}^{d}z_{i}}^{=1}+\sum_{i=k+1}^{l}\left(-2(-x_{i}+z_{i})\right)-\overbrace{\sum_{i=k+1}^{d}y_{i}}^{=1}\geq 0
    \displaystyle\iff i=k+1l(xizi)0\displaystyle\sum_{i=k+1}^{l}(x_{i}-z_{i})\geq 0 (11)

    and, by assumption, xizix_{i}\geq z_{i} for every k+1ilk+1\leq i\leq l.

  3. (c)

    Similar to the previous case.

An ordinary triangle inequality trick would have introduced an additional 22-factor on the sample complexity in Example 3.2(b). The following lemma shows that this may be avoided:

Lemma 5.2.

Suppose (y1,,yd)Δd(y_{1},\ldots,y_{d})\in\Delta_{d} and let x=(x1,,xd)dx=(x_{1},\ldots,x_{d})\in\mathbb{R}^{d} such that x1=1||x||_{1}=1. Let ε>0\varepsilon>0 and assume that xy1<ε||x-y||_{1}<\varepsilon. Then there exists x=(x1,,xd)PΔd(x)x^{\prime}=(x^{\prime}_{1},\ldots,x^{\prime}_{d})\in P_{\Delta_{d}}(x) such that xy1<ε||x^{\prime}-y||_{1}<\varepsilon.

Proof.

If x1,,xd0x_{1},\ldots,x_{d}\geq 0 then xΔdx\in\Delta_{d} and we may take x=xx^{\prime}=x. Otherwise, assume without loss that there is 1kd1\leq k\leq d such that xi<0x_{i}<0 for 1ik1\leq i\leq k and xi0x_{i}\geq 0 for k+1idk+1\leq i\leq d. Let k+1ldk+1\leq l\leq d be minimal such that i=1lxi0\sum_{i=1}^{l}x_{i}\geq 0 (such ll must exist since x1=1||x||_{1}=1). Define x=(x1,,xd)Δdx^{\prime}=(x^{\prime}_{1},\ldots,x^{\prime}_{d})\in\Delta_{d} as follows: For 1id1\leq i\leq d let

xi={0,if 1il1j=1lxj,if i=lxi,if l+1id.x^{\prime}_{i}=\begin{cases}0,&\text{if }1\leq i\leq l-1\\ \sum_{j=1}^{l}x_{j},&\text{if }i=l\\ x_{i},&\text{if }l+1\leq i\leq d.\end{cases}

It holds

i=1d|xiyi|=\displaystyle\sum_{i=1}^{d}|x^{\prime}_{i}-y_{i}|= i=1l1yi+|i=1lxiyl|+i=l+1d|xiyi|\displaystyle\sum_{i=1}^{l-1}y_{i}+|\sum_{i=1}^{l}x_{i}-y_{l}|+\sum_{i=l+1}^{d}|x_{i}-y_{i}|
\displaystyle\leq i=1l1yii=1l1xi+i=ld|xiyi|\displaystyle\sum_{i=1}^{l-1}y_{i}-\sum_{i=1}^{l-1}x_{i}+\sum_{i=l}^{d}|x_{i}-y_{i}|
\displaystyle\leq i=1d|xiyi|<ε.\displaystyle\sum_{i=1}^{d}|x_{i}-y_{i}|<\varepsilon.

5.2 Proof of Lemma 4.5

Proof.

Let π\pi be the stationary distribution of MM and suppose (Xt)t(M,μ)(X_{t})_{t\in\mathbb{N}}\sim(M,\mu). Let (Bt)t(B_{t})_{t\in\mathbb{N}} be a sequence of i.i.d. random variables, independent from (Xt)t(X_{t})_{t\in\mathbb{N}}, such that BtBernoulli(1α)B_{t}\sim\text{Bernoulli}(1-\alpha) for each tt\in\mathbb{N}. For t0t\geq 0 define Nt=i=1tBiN_{t}=\sum_{i=1}^{t}B_{i} and Xt=XNtX^{\prime}_{t}=X_{N_{t}}. Clearly, (Xt)t(α(M),μ)(X^{\prime}_{t})_{t\in\mathbb{N}}\sim(\mathcal{L}_{\alpha}(M),\mu).

For i[d]i\in[d] and t0t_{0}\in\mathbb{N} to be determined later we have for tt0t\geq t_{0}:

(α(M))t(i,)πTV\displaystyle||(\mathcal{L}_{\alpha}(M))^{t}(i,\cdot)-\pi||_{\textnormal{TV}} =n=0t(Nt=n)n(i,)πTV\displaystyle=||\sum_{n=0}^{t}\mathbb{P}(N_{t}=n)\mathcal{M}^{n}(i,\cdot)-\pi||_{\textnormal{TV}}
n=0t0(Nt=n)Mn(i,)πTV+\displaystyle\leq\sum_{n=0}^{t_{0}}\mathbb{P}(N_{t}=n)||M^{n}(i,\cdot)-\pi||_{\textnormal{TV}}+
n=t0+1t(Nt=n)Mn(i,)πTV\displaystyle\hskip 56.9055pt\sum_{n=t_{0}+1}^{t}\mathbb{P}(N_{t}=n)||M^{n}(i,\cdot)-\pi||_{\textnormal{TV}}
(Ntt0)+maxn>t0{Mn(i,)πTV}.\displaystyle\leq\mathbb{P}(N_{t}\leq t_{0})+\max_{n>t_{0}}\{||M^{n}(i,\cdot)-\pi||_{\textnormal{TV}}\}.

Consider the second term in the last expression: Since the total variation distance is monotone decreasing (cf. Lalley (2009, Proposition 7)), we have

maxn>t0{Mn(i,)πTV}=Mt0+1(i,)πTV.\max_{n>t_{0}}\{||M^{n}(i,\cdot)-\pi||_{\textnormal{TV}}\}=||M^{t_{0}+1}(i,\cdot)-\pi||_{\textnormal{TV}}.

Thus, for t0t𝗆𝗂𝗑(M,ε/2)t_{0}\geq t_{\mathsf{mix}}(M,\varepsilon/2) it holds

maxn>t0{Mn(x,)πTV}ε/2.\max_{n>t_{0}}\{||M^{n}(x,\cdot)-\pi||_{\textnormal{TV}}\}\leq\varepsilon/2.

Now, by Hoeffding’s inequality, if t0=(1αδ)tt_{0}=(1-\alpha-\delta)t for some δ>0\delta>0 then

(Nt(1αδ)t)e2δ2t.\mathbb{P}(N_{t}\leq(1-\alpha-\delta)t)\leq e^{-2\delta^{2}t}.

Taking δ=α/2\delta=\alpha/2 we obtain

(N2t0/(1α)t0)e2(1α)t0.\mathbb{P}(N_{2t_{0}/(1-\alpha)}\leq t_{0})\leq e^{-2(1-\alpha)t_{0}}.

It follows that if t0ln2ε1αt_{0}\geq\frac{\ln\frac{2}{\varepsilon}}{1-\alpha} then (N2t0/(1α)t0)ε/2\mathbb{P}(N_{2t_{0}/(1-\alpha)}\leq t_{0})\leq\varepsilon/2. Thus, we may take

t0=max{ln2ε1α,t𝗆𝗂𝗑(M,ε/2)}.t_{0}=\max\left\{\frac{\ln\frac{2}{\varepsilon}}{1-\alpha},t_{\mathsf{mix}}(M,\varepsilon/2)\right\}.

The last assertion follows from Levin and Peres (2017, (4.36) on p. 55). ∎

References

  • Batu et al. (2001) T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White. Testing random variables for independence and identity. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 442–451. IEEE, 2001.
  • Boyd et al. (2004) S. Boyd, S. P. Boyd, and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Bui and Sohier (2007) A. Bui and D. Sohier. How to compute times of random walks based distributed algorithms. Fundamenta Informaticae, 80(4):363–378, 2007.
  • Chan et al. (2021) S. O. Chan, Q. Ding, and S. H. Li. Learning and testing irreducible Markov chains via the kk-cover time. In Algorithmic Learning Theory, pages 458–480. PMLR, 2021.
  • Cherapanamjeri and Bartlett (2019) Y. Cherapanamjeri and P. L. Bartlett. Testing symmetric Markov chains without hitting. In Conference on Learning Theory, pages 758–785. PMLR, 2019.
  • Daskalakis et al. (2018) C. Daskalakis, N. Dikkala, and N. Gravin. Testing symmetric Markov chains from a single trajectory. In Conference On Learning Theory, pages 385–409. PMLR, 2018.
  • Ding et al. (2011) J. Ding, J. R. Lee, and Y. Peres. Cover times, blanket times, and majorizing measures. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 61–70, 2011.
  • Feige and Rabinovich (2003) U. Feige and Y. Rabinovich. Deterministic approximation of the cover time. Random Structures & Algorithms, 23(1):1–22, 2003.
  • Feige and Zeitouni (2009) U. Feige and O. Zeitouni. Deterministic approximation for the cover time of trees. arXiv preprint arXiv:0909.2005, 2009.
  • Fried and Wolfer (2021) S. Fried and G. Wolfer. Identity testing of reversible Markov chains. arXiv preprint arXiv:2105.06347, 2021.
  • Han et al. (2015) Y. Han, J. Jiao, and T. Weissman. Minimax estimation of discrete distributions. In 2015 IEEE International Symposium on Information Theory (ISIT), pages 2291–2295. IEEE, 2015.
  • Hao et al. (2018) Y. Hao, A. Orlitsky, and V. Pichapati. On learning Markov chains. arXiv preprint arXiv:1810.11754, 2018.
  • Hermon (2016) J. Hermon. Maximal inequalities and mixing times. PhD thesis, UC Berkeley, 2016.
  • Kamath et al. (2015) S. Kamath, A. Orlitsky, D. Pichapati, and A. T. Suresh. On learning distributions from their samples. In Conference on Learning Theory, pages 1066–1100. PMLR, 2015.
  • Lalley (2009) S. P. Lalley. Convergence rates of Markov chains, 2009.
  • Levin and Peres (2017) D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • Marshall et al. (1979) A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: theory of majorization and its applications, volume 143. Springer, 1979.
  • Montenegro and Tetali (2006) R. Montenegro and P. Tetali. Mathematical aspects of mixing times in Markov chains. Foundations and Trends in Theoretical Computer Science, 1(3):237–354, 2006.
  • Orlitsky and Suresh (2015) A. Orlitsky and A. T. Suresh. Competitive distribution estimation: Why is Good-Turing good. In NIPS, pages 2143–2151, 2015.
  • Paulin (2015) D. Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electron. J. Probab., 20:32 pp., 2015. doi: 10.1214/EJP.v20-4039. URL https://doi.org/10.1214/EJP.v20-4039.
  • Valiant and Valiant (2017) G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429–455, 2017.
  • Wolfer and Kontorovich (2019) G. Wolfer and A. Kontorovich. Estimating the mixing time of ergodic Markov chains. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 3120–3159, Phoenix, USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/wolfer19a.html.
  • Wolfer and Kontorovich (2020) G. Wolfer and A. Kontorovich. Minimax testing of identity to a reference ergodic Markov chain. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 191–201, Online, 26–28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/wolfer20a.html.
  • Wolfer and Kontorovich (2021) G. Wolfer and A. Kontorovich. Statistical estimation of ergodic Markov chain kernel over discrete state space. Bernoulli, 27(1):532–553, 02 2021. doi: 10.3150/20-BEJ1248. URL https://doi.org/10.3150/20-BEJ1248.