Quantitative Convergence Rates for Stochastically Monotone Markov Chains

Takashi Kamihigashi and John Stachurski

Abstract.

For Markov chains and Markov processes exhibiting a form of stochastic monotonicity (larger states shift up transition probabilities in terms of stochastic dominance), stability and ergodicity results can be obtained using order-theoretic mixing conditions. We complement these results by providing quantitative bounds on deviations between distributions. We also show that well-known total variation bounds can be recovered as a special case.

1. Introduction

Quantitative bounds on the distance between distributions generated by Markov chains have many applications in statistics and the natural and social sciences (see, e.g., [21, 17]). One approach uses total variation distance and exploits minorization conditions (see, e.g., [20, 11, 2]). Another branch of the literature bounds deviations using Wasserstein distance [6, 18, 19]. These bounds require some form of uniform continuity with respect to a metric on the state space.

In some applications, Markov chains lack both the minorization and continuity properties discussed above, making total variation and Wasserstein-type bounds difficult or impossible to apply. Fortunately, some of these models also possess valuable structure in the form of stochastic monotonicity. Such monotonicity can be exploited to obtain stability and ergodicity via order-theoretic versions of mixing conditions [5, 3, 8, 4, 12, 7, 13]. In this paper we complement these stability and ergodicity results by providing a theorem on quantitative bounds for stochastically monotone Markov chains.

There already exist several results that use stochastic monotonicity to bound the distributions generated by Markov chains [16, 9]. However, these bounds are typically stated in terms of total variation distance, which again requires traditional minorization conditions (as opposed to the order-theoretic mixing conditions discussed in the last paragraph). In this paper, we aim to fully exploit the monotonicity by instead bounding Kolmogorov distance between distributions. This works well because Kolmogorov distance respects order structure on the state space.

Our main theorem is closely related to the total variation bound in Theorem 1 of [20], which is representative of existing work on total variation bounds and supplies a simple and elegant proof. The main differences between that theorem and the one presented below is that we use Kolmogorov distance instead of total variation distance and an order-theoretic mixing condition instead of a standard minorization condition. At the same time, it is possible to recover Theorem 1 of [20] from the result we present below by a particular choice of partial order (see Sections 4.1 and 4.4). Thus our work can be viewed as a generalization of existing total variation results.

2. Set Up

In this section we recall basic definitions and state some preliminary results.

2.1. Environment

Throughout this paper, $\mathbbm{X}$ is a Polish space, $\mathcal{B}$ is its Borel sets, and $\preceq$ is a closed partial order on $\mathbbm{X}$ . The last statement means that the graph of $\preceq$ , denoted by

\mathbbm{G}\coloneq\{(x^{\prime},x)\in\mathbbm{X}\times\mathbbm{X}:x^{\prime}\preceq x\},

is closed under the product topology on $\mathbbm{X}\times\mathbbm{X}$ . A map $h\colon\mathbbm{X}\to\mathbbm{R}$ is called increasing if $x\preceq x^{\prime}$ implies $h(x)\leqslant h(x^{\prime})$ . We take $p\mathcal{B}$ to be the set of all probability measures on $\mathcal{B}$ and let $b\mathcal{B}$ be the bounded Borel measurable functions sending $\mathbbm{X}$ into $\mathbbm{R}$ . The symbol $ib\mathcal{B}$ represents all increasing $h\in b\mathcal{B}$ .

Given $\mu,\nu$ in $p\mathcal{B}$ , we say that $\mu$ is stochastically dominated by $\nu$ and write $\mu\preceq_{s}\nu$ if $\mu(h)\leqslant\nu(h)$ for all $h\in ib\mathcal{B}$ . In addition, we set

\kappa(\mu,\nu)\coloneq\sup\left\{\left|\,\int hd\mu-\int hd\nu\,\right|\,:\,h\in ib\mathcal{B}\text{ and }0\leqslant h\leqslant 1\right\},

(1)

which corresponds to the Kolmogorov metric on $p\mathcal{B}$ [13, 10].

A function $Q\colon(\mathbbm{X},\mathcal{B})\to\mathbbm{R}$ is called a stochastic kernel on $(\mathbbm{X},\mathcal{B})$ if $Q$ is a map from $\mathbbm{X}\times\mathcal{B}$ to $[0,1]$ such that that $x\mapsto Q(x,A)$ is measurable for each $A\in\mathcal{B}$ and $A\mapsto Q(x,A)$ is a probability measure on $\mathcal{B}$ for each $x\in\mathbbm{X}$ . At times we use the symbol $Q_{x}$ to represent the distribution $Q(x,\cdot)$ at given $x$ . A stochastic kernel $Q$ on $(\mathbbm{X},\mathcal{B})$ is called increasing if $Qh\in ib\mathcal{B}$ whenever $h\in ib\mathcal{B}$ .

For a given stochastic kernel $Q$ on $(\mathbbm{X},\mathcal{B})$ , we define the left and right Markov operators generated by $Q$ via

\mu Q(A)\coloneq\int Q(x,A)\mu(\mathop{}\!\mathrm{d}x)\quad\text{and}\quad Qf(A)\coloneq\int f(y)Q(x,\mathop{}\!\mathrm{d}y).

(The left Markov operator $\mu\mapsto\mu Q$ maps $p\mathcal{B}$ to itself, while the right Markov operator $f\mapsto Qf$ acts on bounded measurable functions.) A discrete-time $\mathbbm{X}$ -valued stochastic process $(X_{t})_{t\geqslant 0}$ on a filtered probability space $(\Omega,\mathscr{F},\mathbbm{P},(\mathscr{F}_{t})_{t\geqslant 0})$ is called Markov- $(Q,\mu)$ if $X_{0}\stackrel{{\scriptstyle\scriptsize{d}}}{{=}}\mu$ and

\mathbbm{E}[h(X_{t+1})\,|\,\mathscr{F}_{t}]=Qh(X_{t})\text{ with probability one for all $t\geqslant 0$ and $h\in b\mathcal{B}$}.

2.2. Couplings

A coupling of $(\mu,\nu)\in p\mathcal{B}\times p\mathcal{B}$ is a probability measure $\rho$ on $\mathcal{B}\otimes\mathcal{B}$ satisfying $\rho(A\times\mathbbm{X})=\mu(A)$ and $\rho(\mathbbm{X}\times A)=\nu(A)$ for all $A\in\mathcal{B}$ . Let $\mathscr{C}(\mu,\nu)$ denote the set of all couplings of $(\mu,\nu)$ and let

\alpha(\mu,\nu)\coloneq\sup_{\rho\in\mathscr{C}(\mu,\nu)}\rho(\mathbbm{G})\qquad((\mu,\nu)\in p\mathcal{B}\times p\mathcal{B}).

(2)

The value $\alpha(\mu,\nu)$ lies in $[0,1]$ and can be understood as a measure of “partial stochastic dominance” of $\nu$ over $\mu$ [14]. By the Polish assumption and Strassen’s theorem [22, 15] we have

\alpha(\mu,\nu)=1\quad\text{whenever}\quad\mu\preceq_{s}\nu.

(3)

Let $Q$ be a stochastic kernel on $(\mathbbm{X},\mathcal{B})$ and let $\hat{Q}$ be a stochastic kernel on $(\mathbbm{X}\times\mathbbm{X},\mathcal{B}\otimes\mathcal{B})$ . We call $\hat{Q}$ a Markov coupling of $Q$ if $\hat{Q}_{(x,x^{\prime})}$ is a coupling of $Q_{x}$ and $Q_{x^{\prime}}$ for all $x,x^{\prime}\in\mathbbm{X}$ . We call $\hat{Q}$ a $\preceq$ -maximal Markov coupling of $Q$ if $\hat{Q}$ is a Markov coupling of $Q$ and, in addition,

\hat{Q}((x,x^{\prime}),\mathbbm{G})=\alpha(Q_{x},Q_{x^{\prime}})\quad\text{for all }(x,x^{\prime})\in\mathbbm{X}\times\mathbbm{X}.

(4)

Lemma 2.1.

For any stochastic kernel $Q$ on $(\mathbbm{X},\mathcal{B})$ , there exists a $\preceq$ -maximal Markov coupling of $Q$ .

Proof.

By Theorem 1.1 of [23], given lower semicontinuous $\varphi\colon\mathbbm{X}\times\mathbbm{X}\to\mathbbm{R}$ , there exists a stochastic kernel $\hat{Q}$ on $(\mathbbm{X}\times\mathbbm{X},\mathcal{B}\otimes\mathcal{B})$ such that $\hat{Q}$ is a Markov coupling of $Q$ and, in addition

(\hat{Q}\varphi)(x,x^{\prime})=\inf\left\{\int\varphi\mathop{}\!\mathrm{d}\rho\,:\,\rho\in\mathscr{C}(Q_{x},Q_{x^{\prime}})\right\}.

As $\mathbbm{G}$ is closed, this equality is attained when $\varphi=1-\mathbbm{1}_{\mathbbm{G}}$ . Since $\hat{Q}(_{(x,x^{\prime})}$ and $\rho$ are probability measures, we then have

\hat{Q}((x,x^{\prime}),\mathbbm{G})=\sup\left\{\rho(\mathbbm{G})\,:\,\rho\in\mathscr{C}(Q_{x},Q_{x^{\prime}})\right\}.

Thus, $\hat{Q}$ is a $\preceq$ -maximal Markov coupling of $Q$ . ∎

2.3. Drift

Consider the geometric drift condition

QV(x)\leqslant\lambda V(x)+\beta\quad\text{for all }x\in\mathbbm{X},

(5)

where $Q$ is a stochastic kernel on $(\mathbbm{X},\mathcal{B})$ , $V$ is a measurable function from $\mathbbm{X}$ to $[1,\infty)$ , and $\lambda$ and $\beta$ are nonnegative constants. We fix $d\geqslant 1$ and set

\gamma\coloneq\lambda+\frac{2\beta}{d}\quad\text{and}\quad C\coloneq\{x\in\mathbbm{X}:V(x)\leqslant d\}.

(6)

Fix $\mu,\mu^{\prime}$ in $p\mathcal{B}$ and set

H(\mu,\mu^{\prime})\coloneq\frac{1}{2}\left[\int V\mathop{}\!\mathrm{d}\mu+\int V\mathop{}\!\mathrm{d}\mu^{\prime}\right].

(7)

Let $\hat{Q}$ be a Markov coupling of $Q$ and let $((X_{t},X^{\prime}_{t}))_{t\geqslant 0}$ be Markov- $(\hat{Q},\mu\times\mu^{\prime})$ on $(\Omega,\mathscr{F},\mathbbm{P},(\mathscr{F}_{t})_{t\geqslant 0})$ . We are interested in studying the number of visits to $C\times C$ , as given by

N_{t}\coloneq\sum_{j=0}^{t}\mathbbm{1}\{(X_{t},X^{\prime}_{t})\in C\times C\}.

Lemma 2.2.

If $Q$ satisfies the geometric drift condition (5), then, for all $t\in\mathbbm{N}$ and all $j\in\mathbbm{N}$ with $j\leqslant t$ , we have

\mathbbm{P}\{N_{t}<j\}\leqslant\gamma^{t}d^{j-1}H(\mu,\mu^{\prime}).

The result in Lemma 2.2 has already been used in other sources. To make the paper more self-contained, we provide a proof in the appendix. Our proof is based on arguments in [20].

3. Convergence Rates

Let $V$ be a measurable function from $\mathbbm{X}$ to $[1,\infty)$ and let $Q$ be a stochastic kernel on $(\mathbbm{X},\mathcal{B})$ satisfying the geometric drift condition (5). Fix $d\in\mathbbm{R}_{+}$ and let $C$ and $\gamma$ be as defined in (6). Let $H(\mu,\mu^{\prime})$ be as given in (7). Let

\varepsilon\coloneq\inf\left\{\alpha(Q_{x},Q_{x^{\prime}})\,:\,(x,x^{\prime})\in C\times C\right\}.

(8)

We now state the main result.

Theorem 3.1.

If $Q$ is increasing, then, for any $j,t\in\mathbbm{N}$ with $j\leqslant t$ , we have

\kappa(\mu Q^{t},\mu^{\prime}Q^{t})\leqslant(1-\varepsilon)^{j}+\gamma^{t}d^{j-1}H(\mu,\mu^{\prime}).

Proof.

Given $Q$ in Theorem 3.1, we let $\hat{Q}$ be a $\preceq$ -maximal Markov coupling of $Q$ (existence of which follows from Lemma 2.1). Let $((X_{t},X^{\prime}_{t}))_{t\geqslant 0}$ be Markov- $(\hat{Q},\mu\times\mu^{\prime})$ on $(\Omega,\mathscr{F},\mathbbm{P},(\mathscr{F}_{t})_{t\geqslant 0})$ . We observe that the graph $\mathbbm{G}$ of $\preceq$ is absorbing for $\hat{Q}$ . Indeed, if $(x,x^{\prime})\in\mathbbm{G}$ , then, since $Q$ is increasing, $Q(x,\cdot)\preceq_{s}Q(x^{\prime},\cdot)$ . Hence, by (3), we have $\alpha(Q(x,\cdot),Q(x^{\prime},\cdot))=1$ . Applying (4) yields $\hat{Q}((x,x^{\prime}),\mathbbm{G})=1$ .

Let $\tau$ be the stopping time $\tau\coloneq\inf\{t\geqslant 0:X^{\prime}_{t}\preceq X_{t}\}$ with $\inf\varnothing=\infty$ . Since $\mathbbm{G}$ is absorbing for $\hat{Q}$ , we have $\mathbbm{P}\{X_{t}^{\prime}\preceq X_{t}\}=1$ whenever $t\geqslant\tau$ . Let $h$ be any element of $ib\mathcal{B}$ with $0\leqslant h\leqslant 1$ . Since $((X_{t},X^{\prime}_{t}))_{t\geqslant 0}$ is Markov- $(\hat{Q},\mu\times\mu^{\prime})$ and $\hat{Q}((x,x^{\prime}),\cdot)$ is a coupling of $Q(x,\cdot)$ and $Q(x^{\prime},\cdot)$ , we have

	$\displaystyle(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)$	$\displaystyle=\mathbbm{E}h(X^{\prime}_{t})-\mathbbm{E}h(X_{t})$
		$\displaystyle=\mathbbm{E}[h(X^{\prime}_{t})-h(X_{t})]\mathbbm{1}\{X^{\prime}_{t}\preceq X_{t}\}+\mathbbm{E}[h(X^{\prime}_{t})-h(X_{t})]\mathbbm{1}\{X^{\prime}_{t}\preceq X_{t}\}^{c}.$

Since $h$ is increasing, this leads to

(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)\leqslant\mathbbm{E}[h(X^{\prime}_{t})-h(X_{t})]\mathbbm{1}\{X^{\prime}_{t}\preceq X_{t}\}^{c}\leqslant\mathbbm{P}\{X^{\prime}_{t}\preceq X_{t}\}^{c}.

Since $\tau\leqslant t$ implies $X^{\prime}_{t}\preceq X_{t}$ we have $\{X^{\prime}_{t}\preceq X_{t}\}^{c}\subset\{\tau>t\}$ , and hence

(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)\leqslant\mathbbm{P}\{\tau>t\}.

(9)

Now define $N_{t}\coloneq\sum_{j=0}^{t}\mathbbm{1}\{(X^{\prime}_{t},X_{t})\in C\times C\}$ . Fixing $j\in\mathbbm{N}$ with $j\leqslant t$ , we have

\mathbbm{P}\{\tau>t\}=\mathbbm{P}\{\tau>t,N_{t}<j\}+\mathbbm{P}\{\tau>t,N_{t}\geqslant j\}.

(10)

To bound the first term in (10), we set $W(x,x^{\prime})\coloneq[V(x)+V(x^{\prime})]/2$ . Since $\hat{Q}((x,x^{\prime}),\cdot)$ is a coupling of $Q(x,\cdot)$ and $Q(x^{\prime},\cdot)$ , we have

\hat{Q}W(x,x^{\prime})=\frac{QV(x)+QV(x^{\prime})}{2}\leqslant\lambda W(x,x^{\prime})+\beta.

(11)

Hence, applying Lemma 2.2 to $\hat{Q}$ yields

\mathbbm{P}\{\tau>t,N_{t}<j\}\leqslant\mathbbm{P}\{N_{t}<j\}\leqslant=\gamma^{t}d^{j-1}H(\mu,\mu^{\prime})

(12)

Regarding the second term in (10), we claim that

\mathbbm{P}\{\tau>t,\,N_{t}\geqslant j\}\leqslant(1-\varepsilon)^{j}.

(13)

To see this, suppose $(J_{i})_{i\geqslant 1}$ is the times of the successive visits of $(X_{t},X_{t}^{\prime})$ to $C\times C$ . That is, $J_{1}$ is the time of the first visit and

J_{i+1}\coloneq\inf\{m\geqslant J_{i}+1:(X_{m},X_{m}^{\prime})\in C\times C\}\quad\text{for }i\geqslant 1.

It is not difficult to see that $\{N_{t}>j\}\subset\{J_{j}\leqslant t-1\}$ . As a result,

\mathbbm{P}\{\tau>t,\,N_{t}>j\}\leqslant\mathbbm{P}\{\tau>t,\,J_{j}+1\leqslant t\}.

(14)

Consider the set $\{\tau>t,\,J_{j}+1\leqslant t\}$ . If a path is in this set, then as $\tau>t$ , for any index $j$ with $j\leqslant t$ we have $X_{j}^{\prime}\npreceq X_{j}$ . In addition, $J_{i}+1\leqslant J_{j}+1\leqslant t$ for any $i\leqslant j$ , so $X^{\prime}_{J_{i}+1}\npreceq X_{J_{i}+1}$ for every $i\leqslant j$ .

\therefore\quad\mathbbm{P}\{\tau>t,\,J_{j}+1\leqslant t\}\leqslant\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}^{\prime}\npreceq X_{J_{i}+1}\}.

(15)

Observe that

\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}^{\prime}\npreceq X_{J_{i}+1}\}=\mathbbm{P}\left[\cap_{i=1}^{j-1}\{X_{J_{i}+1}^{\prime}\npreceq X_{J_{i}+1}\}\;\mathbbm{P}[X_{J_{j}+1}^{\prime}\npreceq X_{J_{j}+1}\,|\,\mathscr{F}_{J_{j}}]\right].

By the definition of $J_{j}$ we have $(X_{J_{j}},X^{\prime}_{J_{j}})\in C\times C$ . Using this fact, the strong Markov property and the definition of $\hat{Q}$ (see (4)) yields

\mathbbm{P}[X_{J_{j}+1}^{\prime}\preceq X_{J_{j}+1}\,|\,\mathscr{F}_{J_{j}}]=\hat{Q}((X_{J_{j}}^{\prime},X_{J_{j}}),\mathbbm{G})=\alpha(Q(X_{J_{j}}^{\prime},\cdot),Q(X_{J_{j}},\cdot)).

Applying the definition of $\varepsilon$ in (8), we obtain $\mathbbm{P}[X_{J_{j}+1}\npreceq X_{J_{j}+1}^{\prime}\,|\,\mathscr{F}_{J_{j}}]\leqslant 1-\varepsilon$ , so

\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}\npreceq X_{J_{i}+1}^{\prime}\}\leqslant(1-\varepsilon)\mathbbm{P}\ \cap_{i=1}^{j-1}\{X_{J_{i}+1}\npreceq X_{J_{i}+1}^{\prime}\}.

Continuing to iterate backwards in this way yields $\mathbbm{P}\cap_{i=1}^{j}\{X_{J_{i}+1}\npreceq X_{J_{i}+1}^{\prime}\}\leqslant(1-\varepsilon)^{j}$ . Combining this inequality with (14) and (15) verifies (13).

Combining (9), (10), (12), and (13) yields

(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)\leqslant(1-\varepsilon)^{j}+\gamma^{t}d^{j-1}\,H(\mu,\mu^{\prime}).

Reversing the roles of $\mu$ and $\mu^{\prime}$ does not change the value on the right-hand side of this bound, and hence

|(\mu^{\prime}Q^{t})(h)-(\mu Q^{t})(h)|\leqslant(1-\varepsilon)^{j}+\gamma^{t}d^{j-1}\,H(\mu,\mu^{\prime})

also holds. Taking the supremum over all $h\in ib\mathcal{B}$ with $0\leqslant h\leqslant 1$ completes the proof of Theorem 3.1. ∎

4. Examples and Applications

In this section we consider some special cases, with a focus on (a) connections to the existing literature and (b) how to obtain an estimate of the value $\varepsilon$ in (8).

4.1. Connection to Total Variation Results

One special case is when $\preceq$ is the identity order, so that $x\preceq y$ if and only if $x=y$ . For this order we have $ib\mathcal{B}=b\mathcal{B}$ , so every stochastic kernel is increasing, and the Kolmogorov metric (see (1)) becomes the total variation distance. In this setting total variation setting, Theorem 3.1 is similar to standard geometric bounds for total variation distance, such as Theorem 1 in [20].

It is worth noting that, in the total variation setting, $\varepsilon$ in (8) is at least as large as the analogous term $\varepsilon$ in Theorem 1 in [20]. Indeed, in [20], the value $\varepsilon$ , which we now write as $\hat{\varepsilon}$ to avoid confusion, comes from an assumed minorization condition: there exists a $\nu\in p\mathcal{B}$ such that

\hat{\varepsilon}\nu(B)\leqslant Q(x,B)\quad\text{for all }B\in\mathcal{B}\text{ and }x\in C.

(16)

To compare $\hat{\varepsilon}$ with $\varepsilon$ defined in (8), suppose that this minorization condition holds and define the residual kernel $R(x,B)\coloneq(Q(x,B)-\hat{\varepsilon}\nu(B))/(1-\hat{\varepsilon})$ . Fixing $(x,x^{\prime})\in C\times C$ , we draw $(X,X^{\prime})$ as follows: With probability $\varepsilon$ , draw $X\sim\nu$ and set $X^{\prime}=X$ . With probability $1-\varepsilon$ , independently draw $X\sim R(x,\cdot)$ and $X^{\prime}\sim R(x^{\prime},\cdot)$ . Simple arguments confirm that $X$ is a draw from $Q(x,\cdot)$ and $X^{\prime}$ is a draw from $Q(x^{\prime},\cdot)$ . Recalling that $\preceq$ is the identity order, this leads to $\hat{\varepsilon}=\mathbbm{P}\{I\leqslant\hat{\varepsilon}\}\leqslant\mathbbm{P}\{X=X^{\prime}\}=\mathbbm{P}\{X\preceq X^{\prime}\}\leqslant\alpha(Q(x,\cdot),Q(x^{\prime},\cdot))$ . (The last bound is by the definition of $\alpha$ in (2) and the fact that the joint distribution of $(X,X^{\prime})$ is a coupling of $Q(x,\cdot)$ and $Q(x^{\prime},\cdot)$ .) Since, in this discussion, the point $(x,x^{\prime})$ was arbitrarily chosen from $C\times C$ , we conclude that $\hat{\varepsilon}\leqslant\varepsilon$ , where $\varepsilon$ is as defined in (8).

4.2. Stochastic Recursive Sequences

The preceding section showed that Theorem 3.1 reduces to existing results for bounds on total variation distance when the partial order $\preceq$ is the identity order. Now we show how Theorem 3.1 leads to new results other settings, such as when $\preceq$ is a pointwise partial order. To this end, consider the process

X_{t+1}=F(X_{t},W_{t+1})

(17)

where $(W_{t})_{t\geqslant 1}$ is an iid shock process taking values in some space $\mathbb{W}$ , and $F$ is a measurable function from $\mathbbm{X}\times\mathbb{W}$ to $\mathbbm{X}$ . The common distribution of each $W_{t}$ is denoted by $\varphi$ . We suppose that $F$ is increasing, in the sense that $x\preceq x^{\prime}$ implies $F(x,w)\preceq F(x^{\prime},w)$ for any fixed $w\in\mathbb{W}$ . We let $Q$ represent the stochastic kernel corresponding to (17), so that $Q(x,B)=\varphi\{w\in\mathbb{W}:F(x,w)\in B\}$ for all $x\in\mathbbm{X}$ and $B\in\mathcal{B}$ . Since $F$ is increasing, the kernel $Q$ is increasing. Hence Theorem 3.1 applies. We can obtain a lower bound on $\varepsilon$ in (8) by calculating

e\coloneq\inf\left\{\int\int\mathbbm{1}\{F(x^{\prime},w^{\prime})\leqslant F(x,w)\}\varphi(\mathop{}\!\mathrm{d}w)\varphi(\mathop{}\!\mathrm{d}w^{\prime})\,:\,(x,x^{\prime})\in C\times C\right\}.

(18)

Indeed, if $W$ and $W^{\prime}$ are drawn independently from $\varphi$ , then $X=F(x,W)$ is a draw from $Q(x,\cdot)$ and $X^{\prime}=F(x^{\prime},W)$ is a draw from $Q(x^{\prime},\cdot)$ . Hence

e=\mathbbm{P}\{X^{\prime}\preceq X\}\leqslant\alpha(Q(x,\cdot),Q(x^{\prime},\cdot))\leqslant\varepsilon.

(19)

4.3. Example: TCP Window Size Process

To illustrate the method in Section 4.2, we consider the TCP window size process (see, e.g., [2]), which has embedded jump chain $X_{t+1}=a(X_{t}^{2}+2E_{t+1})^{1/2}$ . Here $a\in(0,1)$ and $(E_{t})$ is iid exponential with unit rate. If $C=[0,c]$ , then drawing $E,E^{\prime}$ as independent standard exponentials and using (19),

\varepsilon\geqslant\inf_{0\leqslant x,y\leqslant c}\mathbbm{P}\{a\sqrt{y^{2}+2E^{\prime}}\leqslant a\sqrt{x^{2}+2E}\}=\mathbbm{P}\{\sqrt{c^{2}+2E^{\prime}}\leqslant\sqrt{2E}\}.

Since $E^{\prime}-E$ has the Laplace- $(0,1)$ distribution, we get

1-\varepsilon\leqslant\mathbbm{P}\{c^{2}+2E^{\prime}>2E\}=\mathbbm{P}\{E^{\prime}-E>c^{2}/2\}=\frac{1}{2}\exp(-c^{2}/2).

4.4. Example: When Minorization Fails

We provide an elementary scenario where Theorem 3.1 provides a usable bound while the minorization based methods described in Section 4.1 do not. Let $\mathbbm{Q}$ be the rational numbers, let $\mathbbm{X}=\mathbbm{R}$ , and assume that

X_{t+1}=\frac{X_{t}}{2}+W_{t+1}\quad\text{where $W_{t}$ is {\sc iid} on $\{0,1\}$ and $\mathbbm{P}\{W_{t}=0\}=1/2$}.

Let $C$ contain at least one rational and one irrational number. Let $\mu$ be a measure on the Borel sets of $\mathbbm{R}$ obeying $\mu(B)\leqslant Q(x,B)=\mathbbm{P}\{x/2+W\in B\}$ for all $x\in C$ and Borel sets $B$ . If $x$ is rational, then $x/2+W\in\mathbbm{Q}$ with probability one, so $\mu(\mathbbm{Q}^{c})\leqslant Q(x,\mathbbm{Q}^{c})=0$ . Similarly, if $x$ is irrational, then $x/2+W\in\mathbbm{Q}^{c}$ with probability one, so $\mu(\mathbbm{Q})\leqslant Q(x,\mathbbm{Q})=0$ . Hence $\mu$ is the zero measure on $\mathbbm{R}$ . Thus, we cannot take a $\hat{\varepsilon}>0$ and probability measure $\nu$ obeying the minorization condition (16). On the other hand, letting $V(x)=x+1$ and $d=1$ , so that $C=\{V\leqslant 2\}=[0,1]$ , the value $e$ from (18) obeys $e=\mathbbm{P}\{1/2+W\leqslant W^{\prime}\}=\mathbbm{P}\{W^{\prime}-W\geqslant 1/2\}=\frac{1}{4}$ . Hence, by (19), the constant $\varepsilon$ in Theorem 3.1 is positive.

4.5. Example: Wealth Dynamics

Many economic models examine wealth dynamics in the presence of credit market imperfections (see, e.g., [1]). These often result in dynamics of the form

X_{t+1}=\eta_{t+1}\,G(X_{t})+\xi_{t+1},\quad(\eta_{t})\stackrel{{\scriptstyle\textrm{ {\sc iid }}}}{{\sim}}\varphi,\quad(\xi_{t})\stackrel{{\scriptstyle\textrm{ {\sc iid }}}}{{\sim}}\psi.

(20)

Here $(X_{t})$ is some measure of household wealth, $G$ is a function from $\mathbbm{R}_{+}$ to itself and $(\eta_{t})$ and $(\xi_{t})$ are independent $\mathbbm{R}_{+}$ -valued sequences. The function $G$ is increasing, since greater current wealth relaxes borrowing constraints and increases financial income. We assume that there exists a $\lambda<1$ such that $\mathbbm{E}\,\eta_{t}G(x)\leqslant\lambda x$ for all $x\in\mathbbm{R}_{+}$ , and, in addition, that $\bar{\xi}\coloneq\mathbbm{E}\xi_{t}<\infty$ .

Let $Q$ be the stochastic kernel corresponding to (20). With $V(x)=x+1$ , we have

QV(x)=\mathbbm{E}[\eta_{t+1}\,G(x)+\xi_{t+1}+1]\leqslant\lambda x+\bar{\xi}+1\leqslant\lambda V(x)+\bar{\xi}+1.

(21)

Fixing $d\in\mathbbm{R}_{+}$ and setting $C=\{V\leqslant d\}=[0,d]$ , we can obtain $e$ in (18) via

e=\mathbbm{P}\{\eta^{\prime}G(d)+\xi^{\prime}\leqslant\eta G(0)+\xi\}\quad\text{when }\quad(\eta^{\prime},\xi^{\prime},\eta,\xi)\sim\varphi\times\psi\times\varphi\times\psi.

This term will be strictly positive under suitable conditions, such as when $\psi$ has a sufficiently large support. Combining (19) and (21) with the bound in Theorem 3.1, we have, for any $\mu$ and $\mu^{\prime}$ in $p\mathcal{B}$ and $j,t\in\mathbbm{N}$ with $j\leqslant t$ ,

\kappa(\mu Q^{t},\mu^{\prime}Q^{t})\leqslant(1-e)^{j}+\frac{1}{2}\left(\lambda+\frac{2\bar{(}\xi+1)}{d}\right)^{t}d^{j-1}H(\mu,\mu^{\prime}).

where $H(\mu,\mu^{\prime})\coloneq\left(\int x\mu(\mathop{}\!\mathrm{d}x)+\int x\mu^{\prime}(\mathop{}\!\mathrm{d}x)\right)/2$ .

Notice that, for this model, the lack of smooth mixing and continuity implies that neither total variation nor Wasserstein distance bounds can be computed without additional assumptions.

References

[1] Antonio Antunes and Tiago Cavalcanti. Start up costs, limited enforcement, and the hidden economy,. European Economic Review, 51:203–224, 2007.
[2] Jean-Baptiste Bardet, Alejandra Christen, Arnaud Guillin, Florent Malrieu, and Pierre-André Zitt. Total variation estimates for the TCP process. Electron. J. Probab, 18(10):1–21, 2013.
[3] Rabi Bhattacharya and Mukul Majumdar. On a theorem of Dubins and Freedman. Journal of Theoretical Probability, 12:1067–1087, 1999.
[4] Rabi Bhattacharya, Mukul Majumdar, and Nigar Hashimzade. Limit theorems for monotone Markov processes. Sankhya A, 72:170–190, 2010.
[5] Rabi N Bhattacharya and Oesook Lee. Asymptotics of a class of Markov processes which are not in general irreducible. The Annals of Probability, pages 1333–1347, 1988.
[6] Djalil Chafaï, Florent Malrieu, and Katy Paroux. On the long time behavior of the TCP window size process. Stochastic Processes and their Applications, 120(8):1518–1534, 2010.
[7] Sergey Foss, Vsevolod Shneer, Jonathan P Thomas, and Tim Worrall. Stochastic stability of monotone economies in regenerative environments. Journal of Economic Theory, 173:334–360, 2018.
[8] Serguei Foss and Takis Konstantopoulos. An overview of some stochastic stability methods. Journal of the Operations Research Society of Japan, 47(4):275–303, 2004.
[9] Julia Gaudio, Saurabh Amin, and Patrick Jaillet. Exponential convergence rates for stochastically ordered Markov processes with random initial conditions. arXiv preprint arXiv:1810.07732v1, 202118.
[10] Robert E Gaunt and Siqi Li. Bounding Kolmogorov distances through Wasserstein and related integral probability metrics. Journal of Mathematical Analysis and Applications, 522(1):126985, 2023.
[11] Yu Hang Jiang, Tong Liu, Zhiya Lou, Jeffrey S Rosenthal, Shanshan Shangguan, Fei Wang, and Zixuan Wu. The coupling/minorization/drift approach to Markov chain convergence rates. Notices of the American Mathematical Society, 68(4), 2021.
[12] Takashi Kamihigashi and John Stachurski. Stochastic stability in monotone economies. Theoretical Economics, 9(2):383–407, 2014.
[13] Takashi Kamihigashi and John Stachurski. A unified stability theory for classical and monotone Markov chains. Journal of Applied Probability, 56(1):1–22, 2019.
[14] Takashi Kamihigashi and John Stachurski. Partial stochastic dominance via optimal transport. Operations Research Letters, 48(5):584–586, 2020.
[15] Torgny Lindvall. Lectures on the coupling method. Dover, 2002.
[16] Robert B Lund, Sean P Meyn, and Richard L Tweedie. Computable exponential convergence rates for stochastically ordered Markov processes. The Annals of Applied Probability, 6(1):218–237, 1996.
[17] Ravi Montenegro and Prasad Tetali. Mathematical aspects of mixing times in Markov chains. Foundations and Trends in Theoretical Computer Science, 1(3):237–354, 2006.
[18] Qian Qin and James P Hobert. Geometric convergence bounds for Markov chains in Wasserstein distance based on generalized drift and contraction conditions. 58(2):872–889, 2022.
[19] Yanlin Qu, Jose Blanchet, and Peter Glynn. Computable bounds on convergence of Markov chains in Wasserstein distance. arXiv preprint arXiv:2308.10341, 2023.
[20] Jeffrey S Rosenthal. Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7:123–128, 2002.
[21] Jeffrey S Rosenthal. How Markov’s little idea transformed statistics. Handbook of the History and Philosophy of Mathematical Practice, pages 1–11, 2023.
[22] Volker Strassen. The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439, 1965.
[23] Shaoyi Zhang. Existence and application of optimal Markovian coupling with respect to non-negative lower semi-continuous functions. Acta Mathematica Sinica, 16(2):261–270, 2000.

Appendix A Proof of Lemma 2.2

Let the conditions of Lemma 2.2 hold and let $Q$ , $\hat{Q}$ and $((X_{t},X^{\prime}_{t}))_{t\geqslant 0}$ be as described above. We assume in what follows that $H(\mu,\mu^{\prime})$ is finite, since otherwise Lemma 2.2 is trivial. Setting $W(x,x^{\prime})\coloneq(V(x)+V(x^{\prime}))/2$ , we have

\hat{Q}W(x,x^{\prime})=\frac{QV(X_{t})+QV(X^{\prime}_{t})}{2}\leqslant\lambda W(x,x^{\prime})+\beta.

(22)

Now define

M_{t}\coloneq\gamma^{-t}d^{-N_{t-1}}W(X_{t},X_{t}^{\prime})\quad\text{for }t\geqslant 0\text{ with }N_{-1}\coloneq 1.

We claim that $(M_{t})_{t\geqslant 0}$ is an $(\mathscr{F}_{t})$ -supermartingale. Clearly $(M_{t})$ is adapted. To see that $\mathbbm{E}[M_{t+1}\,|\,\mathscr{F}_{t}]\leqslant M_{t}$ holds $\mathbbm{P}$ -almost surely,¹¹1This inequality implies integrability of $M_{t}$ because then $\mathbbm{E}M_{t}\leqslant\mathbbm{E}M_{0}$ and $\mathbbm{E}M_{0}=H(\mu,\mu^{\prime})$ is finite by assumption. let $F_{t}\coloneq\mathbbm{1}\{(X_{t},X_{t}^{\prime})\in C\times C\}$ , and let $F_{t}^{c}\coloneq 1-F_{t}$ , so that

\mathbbm{E}[M_{t+1}\,|\,\mathscr{F}_{t}]=\mathbbm{E}[M_{t+1}F_{t}\,|\,\mathscr{F}_{t}]+\mathbbm{E}[M_{t+1}F_{t}^{c}\,|\,\mathscr{F}_{t}].

(23)

On $C\times C$ we have $W\leqslant d$ , so, by (22), $\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})\,|\,\mathscr{F}_{t}]\leqslant\lambda d+b=d\gamma$ . Therefore,

\mathbbm{E}[M_{t+1}F_{t}\,|\,\mathscr{F}_{t}]=\gamma^{-(t+1)}d^{-N_{t}}\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})\,|\,\mathscr{F}_{t}]F_{t}\leqslant\gamma^{-t}d^{-N_{t}+1}F_{t}.

Also, on $F_{t}$ we have $N_{t}=N_{t-1}+1$ . Using this fact and $1\leqslant W$ yields

\mathbbm{E}[M_{t+1}F_{t}\,|\,\mathscr{F}_{t}]\leqslant\gamma^{-t}d^{-N_{t-1}}F_{t}\leqslant M_{t}F_{t}.

(24)

Turning to the term $\mathbbm{E}[M_{t+1}F_{t}^{c}\,|\,\mathscr{F}_{t}]$ , observe that on $F_{t}^{c}$ we have $W\geqslant d/2$ , so, using (11) again,

\frac{\hat{Q}W}{W}\leqslant\lambda+\frac{\beta}{W}\leqslant\lambda+\frac{2\beta}{d}=\gamma.

Therefore, $\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})F_{t}^{c}\,|\,\mathscr{F}_{t}]\leqslant\gamma W(X_{t},X^{\prime}_{t})F_{t}^{c}$ . Combining this bound with the fact that $N_{t}=N_{t-1}$ on $F_{t}^{c}$ yields

\mathbbm{E}[M_{t+1}F_{t}^{c}\,|\,\mathscr{F}_{t}]=\gamma^{-(t+1)}d^{-N_{t}}\mathbbm{E}[W(X_{t+1},X^{\prime}_{t+1})\,|\,\mathscr{F}_{t}]F_{t}^{c}\leqslant\gamma^{-t}d^{-N_{t-1}}F_{t}^{c}\leqslant M_{t}F_{t}^{c},

where the last inequality used $1\leqslant W$ . Together with (24) and (23), this inequality gives $\mathbbm{E}[M_{t+1}\,|\,\mathscr{F}_{t}]\leqslant M_{t}$ , so $(M_{t})$ is a supermartingale as claimed.

Now fix $t\in\mathbbm{N}$ and $j\leqslant t$ . Since $d\geqslant 1$ we have

\mathbbm{P}\{N_{t}<j\}\leqslant\mathbbm{P}\{N_{t-1}\leqslant j-1\}=\mathbbm{P}\{d^{-N_{t-1}}\geqslant d^{-j+1}\}.

From Chebychev’s inequality, $1\leqslant W$ and the supermartingale property, the last term is dominated by

d^{j-1}\mathbbm{E}d^{-N_{t-1}}\leqslant\gamma^{t}d^{j-1}\mathbbm{E}[\gamma^{-t}d^{-N_{t-1}}W(X_{t},X^{\prime}_{t})]=\gamma^{t}d^{j-1}\mathbbm{E}[M_{t}]\leqslant\gamma^{t}d^{j-1}\mathbbm{E}[M_{0}].

The last term is just $\gamma^{t}d^{j-1}H(\mu,\mu^{\prime})$ , so the claim in Lemma 2.2 is now proved.