This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quantitative Convergence Rates for Stochastically Monotone Markov Chains

Takashi Kamihigashi  and  John Stachurski
Abstract.

For Markov chains and Markov processes exhibiting a form of stochastic monotonicity (larger states shift up transition probabilities in terms of stochastic dominance), stability and ergodicity results can be obtained using order-theoretic mixing conditions. We complement these results by providing quantitative bounds on deviations between distributions. We also show that well-known total variation bounds can be recovered as a special case.

1. Introduction

Quantitative bounds on the distance between distributions generated by Markov chains have many applications in statistics and the natural and social sciences (see, e.g., [21, 17]). One approach uses total variation distance and exploits minorization conditions (see, e.g., [20, 11, 2]). Another branch of the literature bounds deviations using Wasserstein distance [6, 18, 19]. These bounds require some form of uniform continuity with respect to a metric on the state space.

In some applications, Markov chains lack both the minorization and continuity properties discussed above, making total variation and Wasserstein-type bounds difficult or impossible to apply. Fortunately, some of these models also possess valuable structure in the form of stochastic monotonicity. Such monotonicity can be exploited to obtain stability and ergodicity via order-theoretic versions of mixing conditions [5, 3, 8, 4, 12, 7, 13]. In this paper we complement these stability and ergodicity results by providing a theorem on quantitative bounds for stochastically monotone Markov chains.

There already exist several results that use stochastic monotonicity to bound the distributions generated by Markov chains [16, 9]. However, these bounds are typically stated in terms of total variation distance, which again requires traditional minorization conditions (as opposed to the order-theoretic mixing conditions discussed in the last paragraph). In this paper, we aim to fully exploit the monotonicity by instead bounding Kolmogorov distance between distributions. This works well because Kolmogorov distance respects order structure on the state space.

Our main theorem is closely related to the total variation bound in Theorem 1 of [20], which is representative of existing work on total variation bounds and supplies a simple and elegant proof. The main differences between that theorem and the one presented below is that we use Kolmogorov distance instead of total variation distance and an order-theoretic mixing condition instead of a standard minorization condition. At the same time, it is possible to recover Theorem 1 of [20] from the result we present below by a particular choice of partial order (see Sections 4.1 and 4.4). Thus our work can be viewed as a generalization of existing total variation results.

2. Set Up

In this section we recall basic definitions and state some preliminary results.

2.1. Environment

Throughout this paper, 𝕏\mathbbm{X} is a Polish space, \mathcal{B} is its Borel sets, and \preceq is a closed partial order on 𝕏\mathbbm{X}. The last statement means that the graph of \preceq, denoted by

𝔾:-{(x,x)𝕏×𝕏:xx},\mathbbm{G}\coloneq\{(x^{\prime},x)\in\mathbbm{X}\times\mathbbm{X}:x^{\prime}\preceq x\},

is closed under the product topology on 𝕏×𝕏\mathbbm{X}\times\mathbbm{X}. A map h:𝕏h\colon\mathbbm{X}\to\mathbbm{R} is called increasing if xxx\preceq x^{\prime} implies h(x)h(x)h(x)\leqslant h(x^{\prime}). We take pp\mathcal{B} to be the set of all probability measures on \mathcal{B} and let bb\mathcal{B} be the bounded Borel measurable functions sending 𝕏\mathbbm{X} into \mathbbm{R}. The symbol ibib\mathcal{B} represents all increasing hbh\in b\mathcal{B}.

Given μ,ν\mu,\nu in pp\mathcal{B}, we say that μ\mu is stochastically dominated by ν\nu and write μsν\mu\preceq_{s}\nu if μ(h)ν(h)\mu(h)\leqslant\nu(h) for all hibh\in ib\mathcal{B}. In addition, we set

κ(μ,ν):-sup{|hdμhdν|:hib and 0h1},\kappa(\mu,\nu)\coloneq\sup\left\{\left|\,\int hd\mu-\int hd\nu\,\right|\,:\,h\in ib\mathcal{B}\text{ and }0\leqslant h\leqslant 1\right\}, (1)

which corresponds to the Kolmogorov metric on pp\mathcal{B} [13, 10].

A function Q:(𝕏,)Q\colon(\mathbbm{X},\mathcal{B})\to\mathbbm{R} is called a stochastic kernel on (𝕏,)(\mathbbm{X},\mathcal{B}) if QQ is a map from 𝕏×\mathbbm{X}\times\mathcal{B} to [0,1][0,1] such that that xQ(x,A)x\mapsto Q(x,A) is measurable for each AA\in\mathcal{B} and AQ(x,A)A\mapsto Q(x,A) is a probability measure on \mathcal{B} for each x𝕏x\in\mathbbm{X}. At times we use the symbol QxQ_{x} to represent the distribution Q(x,)Q(x,\cdot) at given xx. A stochastic kernel QQ on (𝕏,)(\mathbbm{X},\mathcal{B}) is called increasing if QhibQh\in ib\mathcal{B} whenever hibh\in ib\mathcal{B}.

For a given stochastic kernel QQ on (𝕏,)(\mathbbm{X},\mathcal{B}), we define the left and right Markov operators generated by QQ via

μQ(A):-Q(x,A)μ(dx)andQf(A):-f(y)Q(x,dy).\mu Q(A)\coloneq\int Q(x,A)\mu(\mathop{}\!\mathrm{d}x)\quad\text{and}\quad Qf(A)\coloneq\int f(y)Q(x,\mathop{}\!\mathrm{d}y).

(The left Markov operator μμQ\mu\mapsto\mu Q maps pp\mathcal{B} to itself, while the right Markov operator fQff\mapsto Qf acts on bounded measurable functions.) A discrete-time 𝕏\mathbbm{X}-valued stochastic process (Xt)t0(X_{t})_{t\geqslant 0} on a filtered probability space (Ω,,,(t)t0)(\Omega,\mathscr{F},\mathbbm{P},(\mathscr{F}_{t})_{t\geqslant 0}) is called Markov-(Q,μ)(Q,\mu) if X0=dμX_{0}\stackrel{{\scriptstyle\scriptsize{d}}}{{=}}\mu and

𝔼[h(Xt+1)|t]=Qh(Xt) with probability one for all t0 and hb.\mathbbm{E}[h(X_{t+1})\,|\,\mathscr{F}_{t}]=Qh(X_{t})\text{ with probability one for all $t\geqslant 0$ and $h\in b\mathcal{B}$}.

2.2. Couplings

A coupling of (μ,ν)p×p(\mu,\nu)\in p\mathcal{B}\times p\mathcal{B} is a probability measure ρ\rho on \mathcal{B}\otimes\mathcal{B} satisfying ρ(A×𝕏)=μ(A)\rho(A\times\mathbbm{X})=\mu(A) and ρ(𝕏×A)=ν(A)\rho(\mathbbm{X}\times A)=\nu(A) for all AA\in\mathcal{B}. Let 𝒞(μ,ν)\mathscr{C}(\mu,\nu) denote the set of all couplings of (μ,ν)(\mu,\nu) and let

α(μ,ν):-supρ𝒞(μ,ν)ρ(𝔾)((μ,ν)p×p).\alpha(\mu,\nu)\coloneq\sup_{\rho\in\mathscr{C}(\mu,\nu)}\rho(\mathbbm{G})\qquad((\mu,\nu)\in p\mathcal{B}\times p\mathcal{B}). (2)

The value α(μ,ν)\alpha(\mu,\nu) lies in [0,1][0,1] and can be understood as a measure of “partial stochastic dominance” of ν\nu over μ\mu [14]. By the Polish assumption and Strassen’s theorem [22, 15] we have

α(μ,ν)=1wheneverμsν.\alpha(\mu,\nu)=1\quad\text{whenever}\quad\mu\preceq_{s}\nu. (3)

Let QQ be a stochastic kernel on (𝕏,)(\mathbbm{X},\mathcal{B}) and let Q^\hat{Q} be a stochastic kernel on (𝕏×𝕏,)(\mathbbm{X}\times\mathbbm{X},\mathcal{B}\otimes\mathcal{B}). We call Q^\hat{Q} a Markov coupling of QQ if Q^(x,x)\hat{Q}_{(x,x^{\prime})} is a coupling of QxQ_{x} and QxQ_{x^{\prime}} for all x,x𝕏x,x^{\prime}\in\mathbbm{X}. We call Q^\hat{Q} a \preceq-maximal Markov coupling of QQ if Q^\hat{Q} is a Markov coupling of QQ and, in addition,

Q^((x,x),𝔾)=α(Qx,Qx)for all (x,x)𝕏×𝕏.\hat{Q}((x,x^{\prime}),\mathbbm{G})=\alpha(Q_{x},Q_{x^{\prime}})\quad\text{for all }(x,x^{\prime})\in\mathbbm{X}\times\mathbbm{X}. (4)
Lemma 2.1.

For any stochastic kernel QQ on (𝕏,)(\mathbbm{X},\mathcal{B}), there exists a \preceq-maximal Markov coupling of QQ.

Proof.

By Theorem 1.1 of [23], given lower semicontinuous φ:𝕏×𝕏\varphi\colon\mathbbm{X}\times\mathbbm{X}\to\mathbbm{R}, there exists a stochastic kernel Q^\hat{Q} on (𝕏×𝕏,)(\mathbbm{X}\times\mathbbm{X},\mathcal{B}\otimes\mathcal{B}) such that Q^\hat{Q} is a Markov coupling of QQ and, in addition

(Q^φ)(x,x)=inf{φdρ:ρ𝒞(Qx,Qx)}.(\hat{Q}\varphi)(x,x^{\prime})=\inf\left\{\int\varphi\mathop{}\!\mathrm{d}\rho\,:\,\rho\in\mathscr{C}(Q_{x},Q_{x^{\prime}})\right\}.

As 𝔾\mathbbm{G} is closed, this equality is attained when φ=1𝟙𝔾\varphi=1-\mathbbm{1}_{\mathbbm{G}}. Since Q^((x,x)\hat{Q}(_{(x,x^{\prime})} and ρ\rho are probability measures, we then have

Q^((x,x),𝔾)=sup{ρ(𝔾):ρ𝒞(Qx,Qx)}.\hat{Q}((x,x^{\prime}),\mathbbm{G})=\sup\left\{\rho(\mathbbm{G})\,:\,\rho\in\mathscr{C}(Q_{x},Q_{x^{\prime}})\right\}.

Thus, Q^\hat{Q} is a \preceq-maximal Markov coupling of QQ. ∎

2.3. Drift

Consider the geometric drift condition

QV(x)λV(x)+βfor all x𝕏,QV(x)\leqslant\lambda V(x)+\beta\quad\text{for all }x\in\mathbbm{X}, (5)

where QQ is a stochastic kernel on (𝕏,)(\mathbbm{X},\mathcal{B}), VV is a measurable function from 𝕏\mathbbm{X} to [1,)[1,\infty), and λ\lambda and β\beta are nonnegative constants. We fix d1d\geqslant 1 and set

γ:-λ+2βdandC:-{x𝕏:V(x)d}.\gamma\coloneq\lambda+\frac{2\beta}{d}\quad\text{and}\quad C\coloneq\{x\in\mathbbm{X}:V(x)\leqslant d\}. (6)

Fix μ,μ\mu,\mu^{\prime} in pp\mathcal{B} and set

H(μ,μ):-12[Vdμ+Vdμ].H(\mu,\mu^{\prime})\coloneq\frac{1}{2}\left[\int V\mathop{}\!\mathrm{d}\mu+\int V\mathop{}\!\mathrm{d}\mu^{\prime}\right]. (7)

Let Q^\hat{Q} be a Markov coupling of QQ and let ((Xt,Xt))t0((X_{t},X^{\prime}_{t}))_{t\geqslant 0} be Markov-(Q^,μ×μ)(\hat{Q},\mu\times\mu^{\prime}) on (Ω,,,(t)t0)(\Omega,\mathscr{F},\mathbbm{P},(\mathscr{F}_{t})_{t\geqslant 0}). We are interested in studying the number of visits to C×CC\times C, as given by

Nt:-j=0t𝟙{(Xt,Xt)C×C}.N_{t}\coloneq\sum_{j=0}^{t}\mathbbm{1}\{(X_{t},X^{\prime}_{t})\in C\times C\}.
Lemma 2.2.

If QQ satisfies the geometric drift condition (5), then, for all tt\in\mathbbm{N} and all jj\in\mathbbm{N} with jtj\leqslant t, we have

{Nt<j}γtdj1H(μ,μ).\mathbbm{P}\{N_{t}<j\}\leqslant\gamma^{t}d^{j-1}H(\mu,\mu^{\prime}).

The result in Lemma 2.2 has already been used in other sources. To make the paper more self-contained, we provide a proof in the appendix. Our proof is based on arguments in [20].

3. Convergence Rates

Let VV be a measurable function from 𝕏\mathbbm{X} to [1,)[1,\infty) and let QQ be a stochastic kernel on (𝕏,)(\mathbbm{X},\mathcal{B}) satisfying the geometric drift condition (5). Fix d+d\in\mathbbm{R}_{+} and let CC and γ\gamma be as defined in (6). Let H(μ,μ)H(\mu,\mu^{\prime}) be as given in (7). Let

ε:-inf{α(Qx,Qx):(x,x)C×C}.\varepsilon\coloneq\inf\left\{\alpha(Q_{x},Q_{x^{\prime}})\,:\,(x,x^{\prime})\in C\times C\right\}. (8)

We now state the main result.

Theorem 3.1.

If QQ is increasing, then, for any j,tj,t\in\mathbbm{N} with jtj\leqslant t, we have

κ(μQt,μQt)(1ε)j+γtdj1H(μ,μ).\kappa(\mu Q^{t},\mu^{\prime}Q^{t})\leqslant(1-\varepsilon)^{j}+\gamma^{t}d^{j-1}H(\mu,\mu^{\prime}).
Proof.

Given QQ in Theorem 3.1, we let Q^\hat{Q} be a \preceq-maximal Markov coupling of QQ (existence of which follows from Lemma 2.1). Let ((Xt,Xt))t0((X_{t},X^{\prime}_{t}))_{t\geqslant 0} be Markov-(Q^,μ×μ)(\hat{Q},\mu\times\mu^{\prime}) on (Ω,,,(t)t0)(\Omega,\mathscr{F},\mathbbm{P},(\mathscr{F}_{t})_{t\geqslant 0}). We observe that the graph 𝔾\mathbbm{G} of \preceq is absorbing for Q^\hat{Q}. Indeed, if (x,x)𝔾(x,x^{\prime})\in\mathbbm{G}, then, since QQ is increasing, Q(x,)sQ(x,)Q(x,\cdot)\preceq_{s}Q(x^{\prime},\cdot). Hence, by (3), we have α(Q(x,),Q(x,))=1\alpha(Q(x,\cdot),Q(x^{\prime},\cdot))=1. Applying (4) yields Q^((x,x),𝔾)=1\hat{Q}((x,x^{\prime}),\mathbbm{G})=1.

Let τ\tau be the stopping time τ:-inf{t0:XtXt}\tau\coloneq\inf\{t\geqslant 0:X^{\prime}_{t}\preceq X_{t}\} with inf=\inf\varnothing=\infty. Since 𝔾\mathbbm{G} is absorbing for Q^\hat{Q}, we have {XtXt}=1\mathbbm{P}\{X_{t}^{\prime}\preceq X_{t}\}=1 whenever tτt\geqslant\tau. Let hh be any element of ibib\mathcal{B} with 0h10\leqslant h\leqslant 1. Since ((Xt,Xt))t0((X_{t},X^{\prime}_{t}))_{t\geqslant 0} is Markov-(Q^,μ×μ)(\hat{Q},\mu\times\mu^{\prime}) and Q^((x,x),)\hat{Q}((x,x^{\prime}),\cdot) is a coupling of Q(x,)Q(x,\cdot) and Q(x,)Q(x^{\prime},\cdot), we have

(μQt)(h)(μQt)(h)\displaystyle(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h) =𝔼h(Xt)𝔼h(Xt)\displaystyle=\mathbbm{E}h(X^{\prime}_{t})-\mathbbm{E}h(X_{t})
=𝔼[h(Xt)h(Xt)]𝟙{XtXt}+𝔼[h(Xt)h(Xt)]𝟙{XtXt}c.\displaystyle=\mathbbm{E}[h(X^{\prime}_{t})-h(X_{t})]\mathbbm{1}\{X^{\prime}_{t}\preceq X_{t}\}+\mathbbm{E}[h(X^{\prime}_{t})-h(X_{t})]\mathbbm{1}\{X^{\prime}_{t}\preceq X_{t}\}^{c}.

Since hh is increasing, this leads to

(μQt)(h)(μQt)(h)𝔼[h(Xt)h(Xt)]𝟙{XtXt}c{XtXt}c.(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)\leqslant\mathbbm{E}[h(X^{\prime}_{t})-h(X_{t})]\mathbbm{1}\{X^{\prime}_{t}\preceq X_{t}\}^{c}\leqslant\mathbbm{P}\{X^{\prime}_{t}\preceq X_{t}\}^{c}.

Since τt\tau\leqslant t implies XtXtX^{\prime}_{t}\preceq X_{t} we have {XtXt}c{τ>t}\{X^{\prime}_{t}\preceq X_{t}\}^{c}\subset\{\tau>t\}, and hence

(μQt)(h)(μQt)(h){τ>t}.(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)\leqslant\mathbbm{P}\{\tau>t\}. (9)

Now define Nt:-j=0t𝟙{(Xt,Xt)C×C}N_{t}\coloneq\sum_{j=0}^{t}\mathbbm{1}\{(X^{\prime}_{t},X_{t})\in C\times C\}. Fixing jj\in\mathbbm{N} with jtj\leqslant t, we have

{τ>t}={τ>t,Nt<j}+{τ>t,Ntj}.\mathbbm{P}\{\tau>t\}=\mathbbm{P}\{\tau>t,N_{t}<j\}+\mathbbm{P}\{\tau>t,N_{t}\geqslant j\}. (10)

To bound the first term in (10), we set W(x,x):-[V(x)+V(x)]/2W(x,x^{\prime})\coloneq[V(x)+V(x^{\prime})]/2. Since Q^((x,x),)\hat{Q}((x,x^{\prime}),\cdot) is a coupling of Q(x,)Q(x,\cdot) and Q(x,)Q(x^{\prime},\cdot), we have

Q^W(x,x)=QV(x)+QV(x)2λW(x,x)+β.\hat{Q}W(x,x^{\prime})=\frac{QV(x)+QV(x^{\prime})}{2}\leqslant\lambda W(x,x^{\prime})+\beta. (11)

Hence, applying Lemma 2.2 to Q^\hat{Q} yields

{τ>t,Nt<j}{Nt<j}=γtdj1H(μ,μ)\mathbbm{P}\{\tau>t,N_{t}<j\}\leqslant\mathbbm{P}\{N_{t}<j\}\leqslant=\gamma^{t}d^{j-1}H(\mu,\mu^{\prime}) (12)

Regarding the second term in (10), we claim that

{τ>t,Ntj}(1ε)j.\mathbbm{P}\{\tau>t,\,N_{t}\geqslant j\}\leqslant(1-\varepsilon)^{j}. (13)

To see this, suppose (Ji)i1(J_{i})_{i\geqslant 1} is the times of the successive visits of (Xt,Xt)(X_{t},X_{t}^{\prime}) to C×CC\times C. That is, J1J_{1} is the time of the first visit and

Ji+1:-inf{mJi+1:(Xm,Xm)C×C}for i1.J_{i+1}\coloneq\inf\{m\geqslant J_{i}+1:(X_{m},X_{m}^{\prime})\in C\times C\}\quad\text{for }i\geqslant 1.

It is not difficult to see that {Nt>j}{Jjt1}\{N_{t}>j\}\subset\{J_{j}\leqslant t-1\}. As a result,

{τ>t,Nt>j}{τ>t,Jj+1t}.\mathbbm{P}\{\tau>t,\,N_{t}>j\}\leqslant\mathbbm{P}\{\tau>t,\,J_{j}+1\leqslant t\}. (14)

Consider the set {τ>t,Jj+1t}\{\tau>t,\,J_{j}+1\leqslant t\}. If a path is in this set, then as τ>t\tau>t, for any index jj with jtj\leqslant t we have XjXjX_{j}^{\prime}\npreceq X_{j}. In addition, Ji+1Jj+1tJ_{i}+1\leqslant J_{j}+1\leqslant t for any iji\leqslant j, so XJi+1XJi+1X^{\prime}_{J_{i}+1}\npreceq X_{J_{i}+1} for every iji\leqslant j.

{τ>t,Jj+1t}i=1j{XJi+1XJi+1}.\therefore\quad\mathbbm{P}\{\tau>t,\,J_{j}+1\leqslant t\}\leqslant\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}^{\prime}\npreceq X_{J_{i}+1}\}. (15)

Observe that

i=1j{XJi+1XJi+1}=[i=1j1{XJi+1XJi+1}[XJj+1XJj+1|Jj]].\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}^{\prime}\npreceq X_{J_{i}+1}\}=\mathbbm{P}\left[\cap_{i=1}^{j-1}\{X_{J_{i}+1}^{\prime}\npreceq X_{J_{i}+1}\}\;\mathbbm{P}[X_{J_{j}+1}^{\prime}\npreceq X_{J_{j}+1}\,|\,\mathscr{F}_{J_{j}}]\right].

By the definition of JjJ_{j} we have (XJj,XJj)C×C(X_{J_{j}},X^{\prime}_{J_{j}})\in C\times C. Using this fact, the strong Markov property and the definition of Q^\hat{Q} (see (4)) yields

[XJj+1XJj+1|Jj]=Q^((XJj,XJj),𝔾)=α(Q(XJj,),Q(XJj,)).\mathbbm{P}[X_{J_{j}+1}^{\prime}\preceq X_{J_{j}+1}\,|\,\mathscr{F}_{J_{j}}]=\hat{Q}((X_{J_{j}}^{\prime},X_{J_{j}}),\mathbbm{G})=\alpha(Q(X_{J_{j}}^{\prime},\cdot),Q(X_{J_{j}},\cdot)).

Applying the definition of ε\varepsilon in (8), we obtain [XJj+1XJj+1|Jj]1ε\mathbbm{P}[X_{J_{j}+1}\npreceq X_{J_{j}+1}^{\prime}\,|\,\mathscr{F}_{J_{j}}]\leqslant 1-\varepsilon, so

i=1j{XJi+1XJi+1}(1ε)i=1j1{XJi+1XJi+1}.\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}\npreceq X_{J_{i}+1}^{\prime}\}\leqslant(1-\varepsilon)\mathbbm{P}\ \cap_{i=1}^{j-1}\{X_{J_{i}+1}\npreceq X_{J_{i}+1}^{\prime}\}.

Continuing to iterate backwards in this way yields i=1j{XJi+1XJi+1}(1ε)j\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}\npreceq X_{J_{i}+1}^{\prime}\}\leqslant(1-\varepsilon)^{j}. Combining this inequality with (14) and (15) verifies (13).

Combining (9), (10), (12), and (13) yields

(μQt)(h)(μQt)(h)(1ε)j+γtdj1H(μ,μ).(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)\leqslant(1-\varepsilon)^{j}+\gamma^{t}d^{j-1}\,H(\mu,\mu^{\prime}).

Reversing the roles of μ\mu and μ\mu^{\prime} does not change the value on the right-hand side of this bound, and hence

|(μQt)(h)(μQt)(h)|(1ε)j+γtdj1H(μ,μ)|(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)|\leqslant(1-\varepsilon)^{j}+\gamma^{t}d^{j-1}\,H(\mu,\mu^{\prime})

also holds. Taking the supremum over all hibh\in ib\mathcal{B} with 0h10\leqslant h\leqslant 1 completes the proof of Theorem 3.1. ∎

4. Examples and Applications

In this section we consider some special cases, with a focus on (a) connections to the existing literature and (b) how to obtain an estimate of the value ε\varepsilon in (8).

4.1. Connection to Total Variation Results

One special case is when \preceq is the identity order, so that xyx\preceq y if and only if x=yx=y. For this order we have ib=bib\mathcal{B}=b\mathcal{B}, so every stochastic kernel is increasing, and the Kolmogorov metric (see (1)) becomes the total variation distance. In this setting total variation setting, Theorem 3.1 is similar to standard geometric bounds for total variation distance, such as Theorem 1 in [20].

It is worth noting that, in the total variation setting, ε\varepsilon in (8) is at least as large as the analogous term ε\varepsilon in Theorem 1 in [20]. Indeed, in [20], the value ε\varepsilon, which we now write as ε^\hat{\varepsilon} to avoid confusion, comes from an assumed minorization condition: there exists a νp\nu\in p\mathcal{B} such that

ε^ν(B)Q(x,B)for all B and xC.\hat{\varepsilon}\nu(B)\leqslant Q(x,B)\quad\text{for all }B\in\mathcal{B}\text{ and }x\in C. (16)

To compare ε^\hat{\varepsilon} with ε\varepsilon defined in (8), suppose that this minorization condition holds and define the residual kernel R(x,B):-(Q(x,B)ε^ν(B))/(1ε^)R(x,B)\coloneq(Q(x,B)-\hat{\varepsilon}\nu(B))/(1-\hat{\varepsilon}). Fixing (x,x)C×C(x,x^{\prime})\in C\times C, we draw (X,X)(X,X^{\prime}) as follows: With probability ε\varepsilon, draw XνX\sim\nu and set X=XX^{\prime}=X. With probability 1ε1-\varepsilon, independently draw XR(x,)X\sim R(x,\cdot) and XR(x,)X^{\prime}\sim R(x^{\prime},\cdot). Simple arguments confirm that XX is a draw from Q(x,)Q(x,\cdot) and XX^{\prime} is a draw from Q(x,)Q(x^{\prime},\cdot). Recalling that \preceq is the identity order, this leads to ε^={Iε^}{X=X}={XX}α(Q(x,),Q(x,))\hat{\varepsilon}=\mathbbm{P}\{I\leqslant\hat{\varepsilon}\}\leqslant\mathbbm{P}\{X=X^{\prime}\}=\mathbbm{P}\{X\preceq X^{\prime}\}\leqslant\alpha(Q(x,\cdot),Q(x^{\prime},\cdot)). (The last bound is by the definition of α\alpha in (2) and the fact that the joint distribution of (X,X)(X,X^{\prime}) is a coupling of Q(x,)Q(x,\cdot) and Q(x,)Q(x^{\prime},\cdot).) Since, in this discussion, the point (x,x)(x,x^{\prime}) was arbitrarily chosen from C×CC\times C, we conclude that ε^ε\hat{\varepsilon}\leqslant\varepsilon, where ε\varepsilon is as defined in (8).

4.2. Stochastic Recursive Sequences

The preceding section showed that Theorem 3.1 reduces to existing results for bounds on total variation distance when the partial order \preceq is the identity order. Now we show how Theorem 3.1 leads to new results other settings, such as when \preceq is a pointwise partial order. To this end, consider the process

Xt+1=F(Xt,Wt+1)X_{t+1}=F(X_{t},W_{t+1}) (17)

where (Wt)t1(W_{t})_{t\geqslant 1} is an iid shock process taking values in some space 𝕎\mathbb{W}, and FF is a measurable function from 𝕏×𝕎\mathbbm{X}\times\mathbb{W} to 𝕏\mathbbm{X}. The common distribution of each WtW_{t} is denoted by φ\varphi. We suppose that FF is increasing, in the sense that xxx\preceq x^{\prime} implies F(x,w)F(x,w)F(x,w)\preceq F(x^{\prime},w) for any fixed w𝕎w\in\mathbb{W}. We let QQ represent the stochastic kernel corresponding to (17), so that Q(x,B)=φ{w𝕎:F(x,w)B}Q(x,B)=\varphi\{w\in\mathbb{W}:F(x,w)\in B\} for all x𝕏x\in\mathbbm{X} and BB\in\mathcal{B}. Since FF is increasing, the kernel QQ is increasing. Hence Theorem 3.1 applies. We can obtain a lower bound on ε\varepsilon in (8) by calculating

e:-inf{𝟙{F(x,w)F(x,w)}φ(dw)φ(dw):(x,x)C×C}.e\coloneq\inf\left\{\int\int\mathbbm{1}\{F(x^{\prime},w^{\prime})\leqslant F(x,w)\}\varphi(\mathop{}\!\mathrm{d}w)\varphi(\mathop{}\!\mathrm{d}w^{\prime})\,:\,(x,x^{\prime})\in C\times C\right\}. (18)

Indeed, if WW and WW^{\prime} are drawn independently from φ\varphi, then X=F(x,W)X=F(x,W) is a draw from Q(x,)Q(x,\cdot) and X=F(x,W)X^{\prime}=F(x^{\prime},W) is a draw from Q(x,)Q(x^{\prime},\cdot). Hence

e={XX}α(Q(x,),Q(x,))ε.e=\mathbbm{P}\{X^{\prime}\preceq X\}\leqslant\alpha(Q(x,\cdot),Q(x^{\prime},\cdot))\leqslant\varepsilon. (19)

4.3. Example: TCP Window Size Process

To illustrate the method in Section 4.2, we consider the TCP window size process (see, e.g., [2]), which has embedded jump chain Xt+1=a(Xt2+2Et+1)1/2X_{t+1}=a(X_{t}^{2}+2E_{t+1})^{1/2}. Here a(0,1)a\in(0,1) and (Et)(E_{t}) is iid exponential with unit rate. If C=[0,c]C=[0,c], then drawing E,EE,E^{\prime} as independent standard exponentials and using (19),

εinf0x,yc{ay2+2Eax2+2E}={c2+2E2E}.\varepsilon\geqslant\inf_{0\leqslant x,y\leqslant c}\mathbbm{P}\{a\sqrt{y^{2}+2E^{\prime}}\leqslant a\sqrt{x^{2}+2E}\}=\mathbbm{P}\{\sqrt{c^{2}+2E^{\prime}}\leqslant\sqrt{2E}\}.

Since EEE^{\prime}-E has the Laplace-(0,1)(0,1) distribution, we get

1ε{c2+2E>2E}={EE>c2/2}=12exp(c2/2).1-\varepsilon\leqslant\mathbbm{P}\{c^{2}+2E^{\prime}>2E\}=\mathbbm{P}\{E^{\prime}-E>c^{2}/2\}=\frac{1}{2}\exp(-c^{2}/2).

4.4. Example: When Minorization Fails

We provide an elementary scenario where Theorem 3.1 provides a usable bound while the minorization based methods described in Section 4.1 do not. Let \mathbbm{Q} be the rational numbers, let 𝕏=\mathbbm{X}=\mathbbm{R}, and assume that

Xt+1=Xt2+Wt+1where Wt is iid on {0,1} and {Wt=0}=1/2.X_{t+1}=\frac{X_{t}}{2}+W_{t+1}\quad\text{where $W_{t}$ is {\sc iid} on $\{0,1\}$ and $\mathbbm{P}\{W_{t}=0\}=1/2$}.

Let CC contain at least one rational and one irrational number. Let μ\mu be a measure on the Borel sets of \mathbbm{R} obeying μ(B)Q(x,B)={x/2+WB}\mu(B)\leqslant Q(x,B)=\mathbbm{P}\{x/2+W\in B\} for all xCx\in C and Borel sets BB. If xx is rational, then x/2+Wx/2+W\in\mathbbm{Q} with probability one, so μ(c)Q(x,c)=0\mu(\mathbbm{Q}^{c})\leqslant Q(x,\mathbbm{Q}^{c})=0. Similarly, if xx is irrational, then x/2+Wcx/2+W\in\mathbbm{Q}^{c} with probability one, so μ()Q(x,)=0\mu(\mathbbm{Q})\leqslant Q(x,\mathbbm{Q})=0. Hence μ\mu is the zero measure on \mathbbm{R}. Thus, we cannot take a ε^>0\hat{\varepsilon}>0 and probability measure ν\nu obeying the minorization condition (16). On the other hand, letting V(x)=x+1V(x)=x+1 and d=1d=1, so that C={V2}=[0,1]C=\{V\leqslant 2\}=[0,1], the value ee from (18) obeys e={1/2+WW}={WW1/2}=14e=\mathbbm{P}\{1/2+W\leqslant W^{\prime}\}=\mathbbm{P}\{W^{\prime}-W\geqslant 1/2\}=\frac{1}{4}. Hence, by (19), the constant ε\varepsilon in Theorem 3.1 is positive.

4.5. Example: Wealth Dynamics

Many economic models examine wealth dynamics in the presence of credit market imperfections (see, e.g., [1]). These often result in dynamics of the form

Xt+1=ηt+1G(Xt)+ξt+1,(ηt) iid φ,(ξt) iid ψ.X_{t+1}=\eta_{t+1}\,G(X_{t})+\xi_{t+1},\quad(\eta_{t})\stackrel{{\scriptstyle\textrm{ {\sc iid }}}}{{\sim}}\varphi,\quad(\xi_{t})\stackrel{{\scriptstyle\textrm{ {\sc iid }}}}{{\sim}}\psi. (20)

Here (Xt)(X_{t}) is some measure of household wealth, GG is a function from +\mathbbm{R}_{+} to itself and (ηt)(\eta_{t}) and (ξt)(\xi_{t}) are independent +\mathbbm{R}_{+}-valued sequences. The function GG is increasing, since greater current wealth relaxes borrowing constraints and increases financial income. We assume that there exists a λ<1\lambda<1 such that 𝔼ηtG(x)λx\mathbbm{E}\,\eta_{t}G(x)\leqslant\lambda x for all x+x\in\mathbbm{R}_{+}, and, in addition, that ξ¯:-𝔼ξt<\bar{\xi}\coloneq\mathbbm{E}\xi_{t}<\infty.

Let QQ be the stochastic kernel corresponding to (20). With V(x)=x+1V(x)=x+1, we have

QV(x)=𝔼[ηt+1G(x)+ξt+1+1]λx+ξ¯+1λV(x)+ξ¯+1.QV(x)=\mathbbm{E}[\eta_{t+1}\,G(x)+\xi_{t+1}+1]\leqslant\lambda x+\bar{\xi}+1\leqslant\lambda V(x)+\bar{\xi}+1. (21)

Fixing d+d\in\mathbbm{R}_{+} and setting C={Vd}=[0,d]C=\{V\leqslant d\}=[0,d], we can obtain ee in (18) via

e={ηG(d)+ξηG(0)+ξ}when (η,ξ,η,ξ)φ×ψ×φ×ψ.e=\mathbbm{P}\{\eta^{\prime}G(d)+\xi^{\prime}\leqslant\eta G(0)+\xi\}\quad\text{when }\quad(\eta^{\prime},\xi^{\prime},\eta,\xi)\sim\varphi\times\psi\times\varphi\times\psi.

This term will be strictly positive under suitable conditions, such as when ψ\psi has a sufficiently large support. Combining (19) and (21) with the bound in Theorem 3.1, we have, for any μ\mu and μ\mu^{\prime} in pp\mathcal{B} and j,tj,t\in\mathbbm{N} with jtj\leqslant t,

κ(μQt,μQt)(1e)j+12(λ+2(¯ξ+1)d)tdj1H(μ,μ).\kappa(\mu Q^{t},\mu^{\prime}Q^{t})\leqslant(1-e)^{j}+\frac{1}{2}\left(\lambda+\frac{2\bar{(}\xi+1)}{d}\right)^{t}d^{j-1}H(\mu,\mu^{\prime}).

where H(μ,μ):-(xμ(dx)+xμ(dx))/2H(\mu,\mu^{\prime})\coloneq\left(\int x\mu(\mathop{}\!\mathrm{d}x)+\int x\mu^{\prime}(\mathop{}\!\mathrm{d}x)\right)/2.

Notice that, for this model, the lack of smooth mixing and continuity implies that neither total variation nor Wasserstein distance bounds can be computed without additional assumptions.

References

  • [1] Antonio Antunes and Tiago Cavalcanti. Start up costs, limited enforcement, and the hidden economy,. European Economic Review, 51:203–224, 2007.
  • [2] Jean-Baptiste Bardet, Alejandra Christen, Arnaud Guillin, Florent Malrieu, and Pierre-André Zitt. Total variation estimates for the TCP process. Electron. J. Probab, 18(10):1–21, 2013.
  • [3] Rabi Bhattacharya and Mukul Majumdar. On a theorem of Dubins and Freedman. Journal of Theoretical Probability, 12:1067–1087, 1999.
  • [4] Rabi Bhattacharya, Mukul Majumdar, and Nigar Hashimzade. Limit theorems for monotone Markov processes. Sankhya A, 72:170–190, 2010.
  • [5] Rabi N Bhattacharya and Oesook Lee. Asymptotics of a class of Markov processes which are not in general irreducible. The Annals of Probability, pages 1333–1347, 1988.
  • [6] Djalil Chafaï, Florent Malrieu, and Katy Paroux. On the long time behavior of the TCP window size process. Stochastic Processes and their Applications, 120(8):1518–1534, 2010.
  • [7] Sergey Foss, Vsevolod Shneer, Jonathan P Thomas, and Tim Worrall. Stochastic stability of monotone economies in regenerative environments. Journal of Economic Theory, 173:334–360, 2018.
  • [8] Serguei Foss and Takis Konstantopoulos. An overview of some stochastic stability methods. Journal of the Operations Research Society of Japan, 47(4):275–303, 2004.
  • [9] Julia Gaudio, Saurabh Amin, and Patrick Jaillet. Exponential convergence rates for stochastically ordered Markov processes with random initial conditions. arXiv preprint arXiv:1810.07732v1, 202118.
  • [10] Robert E Gaunt and Siqi Li. Bounding Kolmogorov distances through Wasserstein and related integral probability metrics. Journal of Mathematical Analysis and Applications, 522(1):126985, 2023.
  • [11] Yu Hang Jiang, Tong Liu, Zhiya Lou, Jeffrey S Rosenthal, Shanshan Shangguan, Fei Wang, and Zixuan Wu. The coupling/minorization/drift approach to Markov chain convergence rates. Notices of the American Mathematical Society, 68(4), 2021.
  • [12] Takashi Kamihigashi and John Stachurski. Stochastic stability in monotone economies. Theoretical Economics, 9(2):383–407, 2014.
  • [13] Takashi Kamihigashi and John Stachurski. A unified stability theory for classical and monotone Markov chains. Journal of Applied Probability, 56(1):1–22, 2019.
  • [14] Takashi Kamihigashi and John Stachurski. Partial stochastic dominance via optimal transport. Operations Research Letters, 48(5):584–586, 2020.
  • [15] Torgny Lindvall. Lectures on the coupling method. Dover, 2002.
  • [16] Robert B Lund, Sean P Meyn, and Richard L Tweedie. Computable exponential convergence rates for stochastically ordered Markov processes. The Annals of Applied Probability, 6(1):218–237, 1996.
  • [17] Ravi Montenegro and Prasad Tetali. Mathematical aspects of mixing times in Markov chains. Foundations and Trends in Theoretical Computer Science, 1(3):237–354, 2006.
  • [18] Qian Qin and James P Hobert. Geometric convergence bounds for Markov chains in Wasserstein distance based on generalized drift and contraction conditions. 58(2):872–889, 2022.
  • [19] Yanlin Qu, Jose Blanchet, and Peter Glynn. Computable bounds on convergence of Markov chains in Wasserstein distance. arXiv preprint arXiv:2308.10341, 2023.
  • [20] Jeffrey S Rosenthal. Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7:123–128, 2002.
  • [21] Jeffrey S Rosenthal. How Markov’s little idea transformed statistics. Handbook of the History and Philosophy of Mathematical Practice, pages 1–11, 2023.
  • [22] Volker Strassen. The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439, 1965.
  • [23] Shaoyi Zhang. Existence and application of optimal Markovian coupling with respect to non-negative lower semi-continuous functions. Acta Mathematica Sinica, 16(2):261–270, 2000.

Appendix A Proof of Lemma 2.2

Let the conditions of Lemma 2.2 hold and let QQ, Q^\hat{Q} and ((Xt,Xt))t0((X_{t},X^{\prime}_{t}))_{t\geqslant 0} be as described above. We assume in what follows that H(μ,μ)H(\mu,\mu^{\prime}) is finite, since otherwise Lemma 2.2 is trivial. Setting W(x,x):-(V(x)+V(x))/2W(x,x^{\prime})\coloneq(V(x)+V(x^{\prime}))/2, we have

Q^W(x,x)=QV(Xt)+QV(Xt)2λW(x,x)+β.\hat{Q}W(x,x^{\prime})=\frac{QV(X_{t})+QV(X^{\prime}_{t})}{2}\leqslant\lambda W(x,x^{\prime})+\beta. (22)

Now define

Mt:-γtdNt1W(Xt,Xt)for t0 with N1:-1.M_{t}\coloneq\gamma^{-t}d^{-N_{t-1}}W(X_{t},X_{t}^{\prime})\quad\text{for }t\geqslant 0\text{ with }N_{-1}\coloneq 1.

We claim that (Mt)t0(M_{t})_{t\geqslant 0} is an (t)(\mathscr{F}_{t})-supermartingale. Clearly (Mt)(M_{t}) is adapted. To see that 𝔼[Mt+1|t]Mt\mathbbm{E}[M_{t+1}\,|\,\mathscr{F}_{t}]\leqslant M_{t} holds \mathbbm{P}-almost surely,111This inequality implies integrability of MtM_{t} because then 𝔼Mt𝔼M0\mathbbm{E}M_{t}\leqslant\mathbbm{E}M_{0} and 𝔼M0=H(μ,μ)\mathbbm{E}M_{0}=H(\mu,\mu^{\prime}) is finite by assumption. let Ft:-𝟙{(Xt,Xt)C×C}F_{t}\coloneq\mathbbm{1}\{(X_{t},X_{t}^{\prime})\in C\times C\}, and let Ftc:-1FtF_{t}^{c}\coloneq 1-F_{t}, so that

𝔼[Mt+1|t]=𝔼[Mt+1Ft|t]+𝔼[Mt+1Ftc|t].\mathbbm{E}[M_{t+1}\,|\,\mathscr{F}_{t}]=\mathbbm{E}[M_{t+1}F_{t}\,|\,\mathscr{F}_{t}]+\mathbbm{E}[M_{t+1}F_{t}^{c}\,|\,\mathscr{F}_{t}]. (23)

On C×CC\times C we have WdW\leqslant d, so, by (22), 𝔼[W(Xt+1,Xt+1)|t]λd+b=dγ\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})\,|\,\mathscr{F}_{t}]\leqslant\lambda d+b=d\gamma. Therefore,

𝔼[Mt+1Ft|t]=γ(t+1)dNt𝔼[W(Xt+1,Xt+1)|t]FtγtdNt+1Ft.\mathbbm{E}[M_{t+1}F_{t}\,|\,\mathscr{F}_{t}]=\gamma^{-(t+1)}d^{-N_{t}}\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})\,|\,\mathscr{F}_{t}]F_{t}\leqslant\gamma^{-t}d^{-N_{t}+1}F_{t}.

Also, on FtF_{t} we have Nt=Nt1+1N_{t}=N_{t-1}+1. Using this fact and 1W1\leqslant W yields

𝔼[Mt+1Ft|t]γtdNt1FtMtFt.\mathbbm{E}[M_{t+1}F_{t}\,|\,\mathscr{F}_{t}]\leqslant\gamma^{-t}d^{-N_{t-1}}F_{t}\leqslant M_{t}F_{t}. (24)

Turning to the term 𝔼[Mt+1Ftc|t]\mathbbm{E}[M_{t+1}F_{t}^{c}\,|\,\mathscr{F}_{t}], observe that on FtcF_{t}^{c} we have Wd/2W\geqslant d/2, so, using (11) again,

Q^WWλ+βWλ+2βd=γ.\frac{\hat{Q}W}{W}\leqslant\lambda+\frac{\beta}{W}\leqslant\lambda+\frac{2\beta}{d}=\gamma.

Therefore, 𝔼[W(Xt+1,Xt+1)Ftc|t]γW(Xt,Xt)Ftc\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})F_{t}^{c}\,|\,\mathscr{F}_{t}]\leqslant\gamma W(X_{t},X^{\prime}_{t})F_{t}^{c}. Combining this bound with the fact that Nt=Nt1N_{t}=N_{t-1} on FtcF_{t}^{c} yields

𝔼[Mt+1Ftc|t]=γ(t+1)dNt𝔼[W(Xt+1,Xt+1)|t]FtcγtdNt1FtcMtFtc,\mathbbm{E}[M_{t+1}F_{t}^{c}\,|\,\mathscr{F}_{t}]=\gamma^{-(t+1)}d^{-N_{t}}\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})\,|\,\mathscr{F}_{t}]F_{t}^{c}\leqslant\gamma^{-t}d^{-N_{t-1}}F_{t}^{c}\leqslant M_{t}F_{t}^{c},

where the last inequality used 1W1\leqslant W. Together with (24) and (23), this inequality gives 𝔼[Mt+1|t]Mt\mathbbm{E}[M_{t+1}\,|\,\mathscr{F}_{t}]\leqslant M_{t}, so (Mt)(M_{t}) is a supermartingale as claimed.

Now fix tt\in\mathbbm{N} and jtj\leqslant t. Since d1d\geqslant 1 we have

{Nt<j}{Nt1j1}={dNt1dj+1}.\mathbbm{P}\{N_{t}<j\}\leqslant\mathbbm{P}\{N_{t-1}\leqslant j-1\}=\mathbbm{P}\{d^{-N_{t-1}}\geqslant d^{-j+1}\}.

From Chebychev’s inequality, 1W1\leqslant W and the supermartingale property, the last term is dominated by

dj1𝔼dNt1γtdj1𝔼[γtdNt1W(Xt,Xt)]=γtdj1𝔼[Mt]γtdj1𝔼[M0].d^{j-1}\mathbbm{E}d^{-N_{t-1}}\leqslant\gamma^{t}d^{j-1}\mathbbm{E}[\gamma^{-t}d^{-N_{t-1}}W(X_{t},X^{\prime}_{t})]=\gamma^{t}d^{j-1}\mathbbm{E}[M_{t}]\leqslant\gamma^{t}d^{j-1}\mathbbm{E}[M_{0}].

The last term is just γtdj1H(μ,μ)\gamma^{t}d^{j-1}H(\mu,\mu^{\prime}), so the claim in Lemma 2.2 is now proved.