Quantitative Convergence Rates for Stochastically Monotone Markov Chains
Abstract.
For Markov chains and Markov processes exhibiting a form of stochastic monotonicity (larger states shift up transition probabilities in terms of stochastic dominance), stability and ergodicity results can be obtained using order-theoretic mixing conditions. We complement these results by providing quantitative bounds on deviations between distributions. We also show that well-known total variation bounds can be recovered as a special case.
1. Introduction
Quantitative bounds on the distance between distributions generated by Markov chains have many applications in statistics and the natural and social sciences (see, e.g., [21, 17]). One approach uses total variation distance and exploits minorization conditions (see, e.g., [20, 11, 2]). Another branch of the literature bounds deviations using Wasserstein distance [6, 18, 19]. These bounds require some form of uniform continuity with respect to a metric on the state space.
In some applications, Markov chains lack both the minorization and continuity properties discussed above, making total variation and Wasserstein-type bounds difficult or impossible to apply. Fortunately, some of these models also possess valuable structure in the form of stochastic monotonicity. Such monotonicity can be exploited to obtain stability and ergodicity via order-theoretic versions of mixing conditions [5, 3, 8, 4, 12, 7, 13]. In this paper we complement these stability and ergodicity results by providing a theorem on quantitative bounds for stochastically monotone Markov chains.
There already exist several results that use stochastic monotonicity to bound the distributions generated by Markov chains [16, 9]. However, these bounds are typically stated in terms of total variation distance, which again requires traditional minorization conditions (as opposed to the order-theoretic mixing conditions discussed in the last paragraph). In this paper, we aim to fully exploit the monotonicity by instead bounding Kolmogorov distance between distributions. This works well because Kolmogorov distance respects order structure on the state space.
Our main theorem is closely related to the total variation bound in Theorem 1 of [20], which is representative of existing work on total variation bounds and supplies a simple and elegant proof. The main differences between that theorem and the one presented below is that we use Kolmogorov distance instead of total variation distance and an order-theoretic mixing condition instead of a standard minorization condition. At the same time, it is possible to recover Theorem 1 of [20] from the result we present below by a particular choice of partial order (see Sections 4.1 and 4.4). Thus our work can be viewed as a generalization of existing total variation results.
2. Set Up
In this section we recall basic definitions and state some preliminary results.
2.1. Environment
Throughout this paper, is a Polish space, is its Borel sets, and is a closed partial order on . The last statement means that the graph of , denoted by
is closed under the product topology on . A map is called increasing if implies . We take to be the set of all probability measures on and let be the bounded Borel measurable functions sending into . The symbol represents all increasing .
Given in , we say that is stochastically dominated by and write if for all . In addition, we set
(1) |
A function is called a stochastic kernel on if is a map from to such that that is measurable for each and is a probability measure on for each . At times we use the symbol to represent the distribution at given . A stochastic kernel on is called increasing if whenever .
For a given stochastic kernel on , we define the left and right Markov operators generated by via
(The left Markov operator maps to itself, while the right Markov operator acts on bounded measurable functions.) A discrete-time -valued stochastic process on a filtered probability space is called Markov- if and
2.2. Couplings
A coupling of is a probability measure on satisfying and for all . Let denote the set of all couplings of and let
(2) |
The value lies in and can be understood as a measure of “partial stochastic dominance” of over [14]. By the Polish assumption and Strassen’s theorem [22, 15] we have
(3) |
Let be a stochastic kernel on and let be a stochastic kernel on . We call a Markov coupling of if is a coupling of and for all . We call a -maximal Markov coupling of if is a Markov coupling of and, in addition,
(4) |
Lemma 2.1.
For any stochastic kernel on , there exists a -maximal Markov coupling of .
Proof.
By Theorem 1.1 of [23], given lower semicontinuous , there exists a stochastic kernel on such that is a Markov coupling of and, in addition
As is closed, this equality is attained when . Since and are probability measures, we then have
Thus, is a -maximal Markov coupling of . ∎
2.3. Drift
Consider the geometric drift condition
(5) |
where is a stochastic kernel on , is a measurable function from to , and and are nonnegative constants. We fix and set
(6) |
Fix in and set
(7) |
Let be a Markov coupling of and let be Markov- on . We are interested in studying the number of visits to , as given by
Lemma 2.2.
If satisfies the geometric drift condition (5), then, for all and all with , we have
3. Convergence Rates
Let be a measurable function from to and let be a stochastic kernel on satisfying the geometric drift condition (5). Fix and let and be as defined in (6). Let be as given in (7). Let
(8) |
We now state the main result.
Theorem 3.1.
If is increasing, then, for any with , we have
Proof.
Given in Theorem 3.1, we let be a -maximal Markov coupling of (existence of which follows from Lemma 2.1). Let be Markov- on . We observe that the graph of is absorbing for . Indeed, if , then, since is increasing, . Hence, by (3), we have . Applying (4) yields .
Let be the stopping time with . Since is absorbing for , we have whenever . Let be any element of with . Since is Markov- and is a coupling of and , we have
Since is increasing, this leads to
Since implies we have , and hence
(9) |
Now define . Fixing with , we have
(10) |
To bound the first term in (10), we set . Since is a coupling of and , we have
(11) |
Hence, applying Lemma 2.2 to yields
(12) |
Regarding the second term in (10), we claim that
(13) |
To see this, suppose is the times of the successive visits of to . That is, is the time of the first visit and
It is not difficult to see that . As a result,
(14) |
Consider the set . If a path is in this set, then as , for any index with we have . In addition, for any , so for every .
(15) |
Observe that
By the definition of we have . Using this fact, the strong Markov property and the definition of (see (4)) yields
Applying the definition of in (8), we obtain , so
Continuing to iterate backwards in this way yields . Combining this inequality with (14) and (15) verifies (13).
4. Examples and Applications
In this section we consider some special cases, with a focus on (a) connections to the existing literature and (b) how to obtain an estimate of the value in (8).
4.1. Connection to Total Variation Results
One special case is when is the identity order, so that if and only if . For this order we have , so every stochastic kernel is increasing, and the Kolmogorov metric (see (1)) becomes the total variation distance. In this setting total variation setting, Theorem 3.1 is similar to standard geometric bounds for total variation distance, such as Theorem 1 in [20].
It is worth noting that, in the total variation setting, in (8) is at least as large as the analogous term in Theorem 1 in [20]. Indeed, in [20], the value , which we now write as to avoid confusion, comes from an assumed minorization condition: there exists a such that
(16) |
To compare with defined in (8), suppose that this minorization condition holds and define the residual kernel . Fixing , we draw as follows: With probability , draw and set . With probability , independently draw and . Simple arguments confirm that is a draw from and is a draw from . Recalling that is the identity order, this leads to . (The last bound is by the definition of in (2) and the fact that the joint distribution of is a coupling of and .) Since, in this discussion, the point was arbitrarily chosen from , we conclude that , where is as defined in (8).
4.2. Stochastic Recursive Sequences
The preceding section showed that Theorem 3.1 reduces to existing results for bounds on total variation distance when the partial order is the identity order. Now we show how Theorem 3.1 leads to new results other settings, such as when is a pointwise partial order. To this end, consider the process
(17) |
where is an iid shock process taking values in some space , and is a measurable function from to . The common distribution of each is denoted by . We suppose that is increasing, in the sense that implies for any fixed . We let represent the stochastic kernel corresponding to (17), so that for all and . Since is increasing, the kernel is increasing. Hence Theorem 3.1 applies. We can obtain a lower bound on in (8) by calculating
(18) |
Indeed, if and are drawn independently from , then is a draw from and is a draw from . Hence
(19) |
4.3. Example: TCP Window Size Process
4.4. Example: When Minorization Fails
We provide an elementary scenario where Theorem 3.1 provides a usable bound while the minorization based methods described in Section 4.1 do not. Let be the rational numbers, let , and assume that
Let contain at least one rational and one irrational number. Let be a measure on the Borel sets of obeying for all and Borel sets . If is rational, then with probability one, so . Similarly, if is irrational, then with probability one, so . Hence is the zero measure on . Thus, we cannot take a and probability measure obeying the minorization condition (16). On the other hand, letting and , so that , the value from (18) obeys . Hence, by (19), the constant in Theorem 3.1 is positive.
4.5. Example: Wealth Dynamics
Many economic models examine wealth dynamics in the presence of credit market imperfections (see, e.g., [1]). These often result in dynamics of the form
(20) |
Here is some measure of household wealth, is a function from to itself and and are independent -valued sequences. The function is increasing, since greater current wealth relaxes borrowing constraints and increases financial income. We assume that there exists a such that for all , and, in addition, that .
Let be the stochastic kernel corresponding to (20). With , we have
(21) |
Fixing and setting , we can obtain in (18) via
This term will be strictly positive under suitable conditions, such as when has a sufficiently large support. Combining (19) and (21) with the bound in Theorem 3.1, we have, for any and in and with ,
where .
Notice that, for this model, the lack of smooth mixing and continuity implies that neither total variation nor Wasserstein distance bounds can be computed without additional assumptions.
References
- [1] Antonio Antunes and Tiago Cavalcanti. Start up costs, limited enforcement, and the hidden economy,. European Economic Review, 51:203–224, 2007.
- [2] Jean-Baptiste Bardet, Alejandra Christen, Arnaud Guillin, Florent Malrieu, and Pierre-André Zitt. Total variation estimates for the TCP process. Electron. J. Probab, 18(10):1–21, 2013.
- [3] Rabi Bhattacharya and Mukul Majumdar. On a theorem of Dubins and Freedman. Journal of Theoretical Probability, 12:1067–1087, 1999.
- [4] Rabi Bhattacharya, Mukul Majumdar, and Nigar Hashimzade. Limit theorems for monotone Markov processes. Sankhya A, 72:170–190, 2010.
- [5] Rabi N Bhattacharya and Oesook Lee. Asymptotics of a class of Markov processes which are not in general irreducible. The Annals of Probability, pages 1333–1347, 1988.
- [6] Djalil Chafaï, Florent Malrieu, and Katy Paroux. On the long time behavior of the TCP window size process. Stochastic Processes and their Applications, 120(8):1518–1534, 2010.
- [7] Sergey Foss, Vsevolod Shneer, Jonathan P Thomas, and Tim Worrall. Stochastic stability of monotone economies in regenerative environments. Journal of Economic Theory, 173:334–360, 2018.
- [8] Serguei Foss and Takis Konstantopoulos. An overview of some stochastic stability methods. Journal of the Operations Research Society of Japan, 47(4):275–303, 2004.
- [9] Julia Gaudio, Saurabh Amin, and Patrick Jaillet. Exponential convergence rates for stochastically ordered Markov processes with random initial conditions. arXiv preprint arXiv:1810.07732v1, 202118.
- [10] Robert E Gaunt and Siqi Li. Bounding Kolmogorov distances through Wasserstein and related integral probability metrics. Journal of Mathematical Analysis and Applications, 522(1):126985, 2023.
- [11] Yu Hang Jiang, Tong Liu, Zhiya Lou, Jeffrey S Rosenthal, Shanshan Shangguan, Fei Wang, and Zixuan Wu. The coupling/minorization/drift approach to Markov chain convergence rates. Notices of the American Mathematical Society, 68(4), 2021.
- [12] Takashi Kamihigashi and John Stachurski. Stochastic stability in monotone economies. Theoretical Economics, 9(2):383–407, 2014.
- [13] Takashi Kamihigashi and John Stachurski. A unified stability theory for classical and monotone Markov chains. Journal of Applied Probability, 56(1):1–22, 2019.
- [14] Takashi Kamihigashi and John Stachurski. Partial stochastic dominance via optimal transport. Operations Research Letters, 48(5):584–586, 2020.
- [15] Torgny Lindvall. Lectures on the coupling method. Dover, 2002.
- [16] Robert B Lund, Sean P Meyn, and Richard L Tweedie. Computable exponential convergence rates for stochastically ordered Markov processes. The Annals of Applied Probability, 6(1):218–237, 1996.
- [17] Ravi Montenegro and Prasad Tetali. Mathematical aspects of mixing times in Markov chains. Foundations and Trends in Theoretical Computer Science, 1(3):237–354, 2006.
- [18] Qian Qin and James P Hobert. Geometric convergence bounds for Markov chains in Wasserstein distance based on generalized drift and contraction conditions. 58(2):872–889, 2022.
- [19] Yanlin Qu, Jose Blanchet, and Peter Glynn. Computable bounds on convergence of Markov chains in Wasserstein distance. arXiv preprint arXiv:2308.10341, 2023.
- [20] Jeffrey S Rosenthal. Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7:123–128, 2002.
- [21] Jeffrey S Rosenthal. How Markov’s little idea transformed statistics. Handbook of the History and Philosophy of Mathematical Practice, pages 1–11, 2023.
- [22] Volker Strassen. The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439, 1965.
- [23] Shaoyi Zhang. Existence and application of optimal Markovian coupling with respect to non-negative lower semi-continuous functions. Acta Mathematica Sinica, 16(2):261–270, 2000.
Appendix A Proof of Lemma 2.2
Let the conditions of Lemma 2.2 hold and let , and be as described above. We assume in what follows that is finite, since otherwise Lemma 2.2 is trivial. Setting , we have
(22) |
Now define
We claim that is an -supermartingale. Clearly is adapted. To see that holds -almost surely,111This inequality implies integrability of because then and is finite by assumption. let , and let , so that
(23) |
On we have , so, by (22), . Therefore,
Also, on we have . Using this fact and yields
(24) |
Turning to the term , observe that on we have , so, using (11) again,
Therefore, . Combining this bound with the fact that on yields
where the last inequality used . Together with (24) and (23), this inequality gives , so is a supermartingale as claimed.
Now fix and . Since we have
From Chebychev’s inequality, and the supermartingale property, the last term is dominated by
The last term is just , so the claim in Lemma 2.2 is now proved.