Contraction of Markovian Operators in Orlicz Spaces
and Error Bounds for Markov Chain Monte Carlo
Abstract
We introduce a novel concept of convergence for Markovian processes within Orlicz spaces, extending beyond the conventional approach associated with spaces. After showing that Markovian operators are contractive in Orlicz spaces, our key technical contribution is an upper bound on their contraction coefficient, which admits a closed-form expression. The bound is tight in some settings, and it recovers well-known results, such as the connection between contraction and ergodicity, ultra-mixing and Doeblin’s minorisation. Specialising our approach to spaces leads to a significant improvement upon classical Riesz-Thorin’s interpolation methods. Furthermore, by exploiting the flexibility offered by Orlicz spaces, we can tackle settings where the stationary distribution is heavy-tailed, a severely under-studied setup. As an application of the framework put forward in the paper, we introduce tighter bounds on the mixing time of Markovian processes, better exponential concentration bounds for MCMC methods, and better lower bounds on the burn-in period. To conclude, we show how our results can be used to prove the concentration of measure phenomenon for a sequence of Markovian random variables.
Index Terms:
Markovian operators, Orlicz spaces, Contraction, MCMC, McDiarmid, Burn-inI Introduction
††footnotetext: Accepted for presentation at the Conference on Learning Theory (COLT) 2024The topic of bounding the contraction coefficient of Markovian operators in spaces and, specifically, spaces has a rich history, see the survey by [1]. The motivation comes from the fact that, whenever the contraction coefficient is strictly smaller than , the corresponding Markovian process converges exponentially fast to the stationary distribution [2]. Moreover, the characterisation of the contraction coefficient also yields upper bounds on the so-called “mixing times” of the process, i.e., the number of steps required to be close to the stationary distribution. This has practical implications in Markov Chain Monte Carlo (MCMC) methods: MCMC allows estimating integrals with respect to a certain measure when said measure is either inaccessible or cannot be sampled efficiently, by designing a Markov chain that approximates it [3, 4, 2]. A big effort has thus been devoted to understanding how the estimation error behaves and how long the Markov chain needs to get close enough to the limiting distribution (burn-in period) [3, 4, 2].
Most of the literature focuses on asymptotic analyses [3] or mean-squared error bounds [2, 5], and it is restricted to finite-state space Markov chains. A different line of work tackles the problem by providing concentration inequalities for dependent random variables [6, 7, 8, 9, 10, 11, 12], which lead to high-probability error bounds and characterisations of the burn-in period. We highlight that all these results rely on the contraction coefficient of the Markov kernel associated with the chain. Similarly, but through a different route, [13] proves that one can control the mixing times of a Markov chain by studying the strong data-processing inequality (SDPI) constant of various divergences.
Our work delivers new and improved bounds on these fundamental quantities, by stepping away from the more classical framework and considering Orlicz spaces, a generalisation of spaces. We show that, as for , Markov kernels are contractive in Orlicz spaces, i.e., the contraction coefficients of Markov kernels are less than or equal to . We can then define a notion of convergence of Markov processes, which depends on the corresponding contraction coefficient. In particular, our main result, Theorem 3, provides an upper bound for the contraction coefficient of Markovian operators, which admits a closed-form expression. The key technical novelties come from duality considerations: the convergence of a Markovian process determined by depends on the contraction coefficient of its dual , which can in turn be bounded by considering appropriate nested norms of densities of with respect to the stationary measure.
Our approach stands out as the first of its kind, as it does not rely on the existence of a spectral gap. In contrast, most existing bounds are based on the spectral gap [2, 11] and, thus, hold in . To handle spaces, the typical strategy is then to leverage Riesz-Thorin’s interpolation, which often leads to vacuous bounds. In contrast, Theorem 3 evades said restrictions by considering directly the contraction coefficient in , thus vastly improving upon the state of the art. Leveraging arbitrary Orlicz spaces allows us to also consider distributions with heavy tails, which are hard to handle with classical techniques [14, 15], see Remark 4.
Theorem 3 recovers well-known theorems in the literature, i.e., the connection between contraction and ergodicity (Corollary 1), as well as ultra-mixing and Doeblin’s minorisation (Corollary 2) [16, 17, 13]. As an application of our analysis, we give better bounds on the mixing times of Markovian processes (Section IV-A) and better convergence guarantees for MCMC methods (Section IV-B), which lead to more refined bounds on the burn-in period [2, 11]. Finally, by exploiting the strategy developed by [12], we provide improved concentration results for a sequence of dependent random variables, where the dependence is captured by a Markovian process.
II Preliminaries
We adopt a measure-theoretic framework. Given a measurable space and two measures making it a measure space, if is absolutely continuous w.r.t. (), we represent with the Radon-Nikodym derivative of w.r.t. . Given a (measurable) function and a measure , we denote by the measure of an event and with the Lebesgue integral of with respect to the measure .
II-A Markov kernels and Markov chains
Definition 1 (Markov kernel).
Let be a measurable space. A Markov kernel is such that (i) , the mapping is a probability measure on , and (ii) , the mapping is an -measurable real-valued function.
A Markov kernel acts on measures “from the right”, i.e., given a measure on , and on functions “from the left”, i.e., given a function , Hence, given a function , one can see as a mapping . Moreover, an application of Jensen’s inequality gives that , thus the mapping can be extended to .
Given a Markov kernel and a measure , let denote the dual operator. By definition, the dual kernel can be seen as acting on with and returning a function in . In particular, given a probability measure and two measurable functions , one can define the inner-product and is the operator such that , for all and . In discrete settings, the dual operator can be explicitly characterised via and as , see [13, Eq. (1.2)].
Given a sequence of random variables , one says that it represents a Markov chain if, given , there exists a Markov kernel such that for every measurable event :
(1) |
If for every , for some Markov kernel , then the Markov chain is time-homogeneous, and we suppress the index from . The kernel of a Markov chain is the probability of getting from to in one step, i.e., for every , . One can then define (inductively) the -step kernel as . Note that is also a Markov kernel, and it represents the probability of getting from to in steps: . If is the Markov chain associated to the kernel and , then denotes the measure of at every . Furthermore, a probability measure is a stationary measure for if for every measurable event . When the state space is discrete, can be represented via a stochastic matrix.
Given a Markov kernel and a measure, one can define the corresponding dual [16] and show the following result, which is pivotal for characterising the convergence of a Markovian process to the stationary measure. The proof is deferred to Appendix A-B.
Lemma 1.
Let be a positive measure, a positive measure s.t. , and the Radon-Nikodym derivative . Let be a Markov kernel and . Then,
(2) |
where is the operator such that, given two functions , one has that
Furthermore, can be proved to be Markovian [16, Section 2, Lemma 2.2].
II-B Orlicz spaces
Definition 2 (Young Function, [18, Chapter 3]).
Let be a convex non-decreasing function s.t. and . Then, is called a Young function.
Definition 3 ([18, Definition 5]).
Let be a Young function. Let be a measure space and denote by the space of all the -measurable and real-valued functions on . An Orlicz space can be defined as the following set:
(3) |
Given a Young function , the complementary function to in the sense of Young, denoted as , is defined as follows [18, Section 1.3]:
(4) |
An Orlicz space can be endowed with several norms that render it a Banach space [19]: the Luxemburg norm, the Orlicz norm, and the Amemiya norm. We will henceforth ignore the Orlicz norm, as it is equivalent to the Amemiya norm [19], and we define the Luxemburg norm and the Amemiya norm as
(5) |
Orlicz spaces and the corresponding norms recover well-known objects (e.g., -norms) and characterise random variables according to their “tail”. For more details, see Appendix A-A.
III Contraction of Markov kernels
The contractivity of Markov operators has been studied in particular for spaces, with [16, 20] considering Orlicz spaces as well. As we prove results for both Amemiya and Luxemburg norms, we use with to denote both norms and we use for the corresponding dual norm (e.g., if and one has an Amemiya norm, then and the dual is the Luxemburg norm).
Definition 4.
Let be a Young functional, and a Markov kernel. The contraction coefficients of in the spaces are denoted as follows:
(6) |
where denotes the closed subspace of functions with mean , i.e., . Whenever the measure is clear from the context, for ease of notation, we denote the contraction coefficient as .
A characterisation of how Markovian operators behave in Orlicz spaces is stated below and proved in Appendix C-A.
Theorem 1.
Let be a Young functional, , and a Markov kernel with stationary distribution . Then, for every ,
(7) |
The key takeaway of Theorem 1 is that, as in spaces, Markov kernels are contractive, but on the general space of -integrable functions said contraction coefficient is trivially . Restricting to the closed subspace of functions with mean (i.e., ) does yield an improvement. In fact, the contraction coefficient of a Markov kernel in is typically less than . This observation is fundamental to characterise convergence to the stationary distribution. Indeed, if , then admits an -spectral gap , which implies “-geometric ergodicity”, as stated below [2, Proposition 3.12].
Proposition 1.
Let be a Markov kernel with stationary distribution and spectral gap . For every probability measure on the same space s.t. , we have
(8) |
In fact, in the specific setting of , geometric ergodicity and existence of a spectral gap are equivalent notions [2, Proposition 3.13].
The purpose of this work is to relate the contractivity of Markovian operators on Orlicz spaces to the convergence of the corresponding Markov chain to its stationary distribution. To do so, we leverage Lemma 1 and show convergence in -norm, as soon as the contraction coefficient of the dual kernel is . This is formalised in the result below, which is proved in Appendix C-B.
Theorem 2.
Let be a Markov kernel with stationary distribution , , and a Young functional. For every measure and , we have
(9) |
Theorem 2 links the convergence of a Markovian process to the contraction coefficient of the dual kernel . Indeed, if , then the right-hand side of Equation 9 goes to exponentially fast in the number of steps and, hence, . We also note that the proof can be adapted to replace with , where is the operator mapping a function to . These two bounds are closely related: focuses on -mean functions, while in we directly subtract the mean . More generally, we can relate the contraction coefficient of on to that of on , see Section A-C.
At this point, we are ready to state our key technical contribution which bounds the contraction coefficient of a kernel and of its dual. Its proof is in Section C-C.
Theorem 3.
Let be a Young functional and a probability measure. Let be a Markov kernel, its dual, and . Assume that, for every , is absolutely continuous w.r.t. and, for every , is absolutely continuous w.r.t. . Denote by and the Radon-Nikodym derivatives and with and the corresponding random variables whose law is induced by and . Then, for every and ,
(10) |
Remark 1.
Let for , and . Then, the following is a corollary of Theorem 3 that holds for every and :
(11) |
Theorem 3 offers a closed-form bound on the contraction coefficient. We remark that, to the best of our knowledge, no such bound in spaces exists. If the kernel is induced by the so-called binary symmetric channel of parameter ( for ; otherwise, ), then Equation 11 is tight for , see Example 2 in Appendix B for details. The case of represents another setting in which we can prove the tightness of Equation 11, as formalised below and proved in Section C-D.
Corollary 1.
Let be a Markov kernel with stationary distribution . Then, the following holds:
(12) |
where .
In fact, by Proposition 3.23 in [2]. In the Markov chain literature, it is quite common to show that the right-hand side of Equation 12 is bounded. This usually goes under the name of uniform (or strong) ergodicity. Hence, Corollary 1 leads to another proof that asking for uniform ergodicity implies -exponential convergence.
Theorem 3 also recovers other well-known results. Indeed, [16] show that, if ultra-mixing holds (which, in turn, implies Doebling’s minorisation [17, Section 4.3.3]), then one can bound the contraction coefficient of Markovian kernels in arbitrary norms.
Corollary 2.
Given a Markov kernel , assume there exists s.t. the kernel satisfies the ultra-mixing condition, i.e., for all ,
(13) |
Then, for every Young function and , we have
(14) |
Corollary 2 is indeed a consequence of Theorem 3, as proved in Section C-E.
IV Applications
IV-A Mixing times
Definition 5.
Given a Markov kernel with stationary distribution , , and a Young functional , the -mixing time of is the function such that for every
(15) |
Corollary 3.
Let be a Markov operator with stationary distribution , its dual, , and a Young functional. Then, the following holds for every :
(16) |
Remark 2.
For discrete spaces and a discrete-valued Markov kernel, by leveraging the convexity of the norms and the fact that convex functions on a compact convex set attain their maximum on an extreme point, we can upper bound for every with , which admits a closed-form expression. See Section D-A for details.
Corollary 3 relates the contraction coefficient of the dual operator to the mixing time of the corresponding Markov chain, thus motivating our study of said contraction coefficient. Moreover, as is arbitrary, a natural question concerns the choice of the norm. To address it, first we derive a general result providing intuition over what it means to be close in a specific norm. In particular, we relate the probability of any event under with the probability of the same event under the stationary measure and an -norm. We then focus on the well-known family of -norms (particularly relevant for exponentially decaying probabilities under ), and show that asking for a bounded norm with larger guarantees a faster decay of the probability of the same event under . Finally, we design a Young function tailored to the challenging (and less understood) setting in which behaves like a power law [14].
Theorem 4.
Let and be a measurable event. Then,
(17) |
where the infimum is over all Young functionals .
We now apply Theorem 4 to various , and start by demonstrating the advantage of considering -norms with . The result below is proved in Section D-C.
Corollary 4.
Let with , , and a measurable event. Then,
(18) |
As an immediate consequence, if then for every
(19) |
The following setting is especially interesting for the applicability of Corollary 4. Consider a function and for some fixed . If for with , by Hoeffding’s inequality one has:
(20) |
where the constant at the exponent can be improved under additional assumptions, see e.g. [21]. Notice that Equation 20 does not require to be a product distribution. Indeed, for example, if the Markov chain has spectral gap , then Equation 20 holds with the additional constant at the exponent of , see [11, Theorem 1]. Then, given and , we have
(21) |
Starting from Equation 21, one can now select such that the probability is exponentially decaying in the dimension of the ambient space . In fact, a bound that decays exponentially in the dimension is guaranteed whenever, for a given , , and finite ,
(22) |
For example, guarantees exponential convergence in Equation 21 and plugging this choice in Equation 16 gives the bound below on the mixing time:
(23) |
where . We note that, besides , both and depend on the dimension . To maximise the rate of exponential decay in Equation 21, one has to take , which implies . The price to pay is an increase of the mixing time and, to quantify such increase, it is crucial to provide tight bounds on the contraction coefficient of Markovian operators, beyond the usual space. Theorem 3 can then be leveraged to compute said bounds, and we now compare it with the standard approach in the literature, i.e., the Riesz-Thorin interpolation theorem between norms, which is stated below for convenience.
Proposition 2.
Let and . Let be a Markov kernel with stationary distribution . If admits an -spectral gap of , then the following holds:
(24) |
Moreover, one has that:
(25) |
As an example, consider the binary symmetric channel of parameter . Then, and Proposition 2 with yields:
(26) |
In contrast, Theorem 3 gives:
(27) |
which improves over Equation 26 for every . Notably, Equation 27 holds even if and , while the result of Proposition 2 is vacuous in both cases (cf. Equation 25). Figure 2 considers Markov kernels induced by randomly generated stochastic matrices, and it shows that while the bound in Proposition 2 is always vacuous (red dots), the bound given by Equation 11 is always below and often very close to (blue dots). Moreover, given that we are in a discrete setting one can explicitly compute the /-operator norm for a specific kernel and interpolate between the /–operator and the spectral gap. The comparison yields the following results:
-
•
For , our upper bound is roughly 50% smaller than Riesz-Thorin interpolation (on average over the 100 random trials)
-
•
For , , , our upper bound is roughly 70% smaller than Riesz-Thorin interpolation (on average over the 100 random trials);
-
•
For , our upper bound is roughly 23% smaller than Riesz-Thorin interpolation (on average over the 100 random trials);
-
•
For , our upper bound is roughly 20% smaller than Riesz-Thorin interpolation (on average over the 100 random trials).

Remark 3.
So far, we have compared our approach with Riesz-Thorin interpolation. However, part of the Markov chain literature leverages Stein interpolation as well e.g., [22]. This, however, requires considering the semigroup associated with a Markov kernel . Let us thus compare with the argument used in [22, Theorem 3.9] where, given , Stein interpolation is used to provide a bound on , the operator norm of from to , with a function of opportunely chosen. In particular, select and in [22, Theorem 3.9] and denote by . Via Stein interpolation one can provide the following bound for every : , where . In contrast, an analogous version of Equation 11 gives , where denotes the density of with respect to the stationary distribution . In order to provide a concrete comparison between the two bounds, let be the binary symmetric channel with parameter and select . Figure 3 shows how, in this setting, our bound always improves over [22].

Remark 4.
A suitable choice of the Young function allows to treat stationary distributions with heavy tails. In this case, geometric or uniform ergodicity cannot be achieved, and the best one can hope for is a polynomial convergence in Total Variation [14]. Thus, there is no spectral gap [23, Theorem 2.1], which affects the convergence rates of MCMC algorithms with polynomial target distributions, e.g., Metropolis-Hastings with independence sampler [15, Section 4.3].
For concreteness, consider upper bounding , when the stationary distribution is a power-law with exponent and is s.t. . Then, if we restrict to spaces, we obtain the bound in Equation 19, with . Here, the lower bound on comes from the fact that and the largest s.t. a random variable with measure belongs to has to satisfy ( implies leading to a trivial bound in Equation 19).
Instead, by considering Orlicz spaces different from , much more refined bounds can be obtained. Let , which has a different rate of growth before and after a certain parameter . Then, one can readily verify that is a Young function, and a random variable with measure belongs to for any finite . By applying Theorem 4 with this choice of , we obtain
(28) |
which vastly improves upon the bound in Equation 19, as shown in Figure 2. Note that to maximise the decay of in Equation 19, () needs to be as small (large) as possible. Hence with the nearly best decay is given by which implies .
IV-B Markov Chain Monte Carlo
Proposition 2 was used by [11] to prove concentration for MCMC methods. Here, we show how an application of Theorem 3 improves over such results.
Let us briefly recall the setup. Given a function , we want to estimate the mean with respect to a measure which cannot be directly sampled. A common approach is to consider a Markov chain with stationary distribution , and estimate via empirical averages of samples , where the burn-in period ensures that the Markov chain is close to .
We start by considering again the binary symmetric channel with parameter . The stationary distribution is and the spectral gap is . Assume for some measure and . Applying Theorem 12 by [11] to s.t. , yields
(29) |
where , , denotes the tensor product of , and
(30) |
In contrast, an application of Theorem 3 yields for every . Hence, leveraging Equation 18 leads to
(31) |
Equation 31 improves over Equation 29 for the absence of the term in the exponent of the first addend of Equation 31 (which provides a faster exponential decay, since ) and thanks to the improved bound on the contraction coefficient provided by Equation 11, in contrast with Equation 30. Indeed, for , for and for . The difference in behavior is the most striking for which maximises the rate of decay in both Equation 29 and Equation 31. Most importantly, considering such limit in Equation 29 leads to a bound that does not depend on anymore. This is not the case of Equation 31, where not only we retrieve the best exponential decay, but we also retain the dependence on . Hence, for a given accuracy and sample budget , if we start from a distribution and we require a certain confidence , we have the following lower bound on the burn-in period:
(32) |
where and . Figure 4 compares the bounds when the kernel is a randomly generated stochastic matrix, and it clearly shows the improvement of our approach (in blue) over the interpolation method used by [11] (in red).

IV-C Concentration of measure without independence
Another application comes from leveraging Corollary 4 along with Equation 11 in proving concentration of measure bounds for non-independent random variables. In particular, the following result holds (see Section E-A for the proof).
Corollary 5.
For , let be a discrete-valued Markov kernel. Let be a sequence of random variables with joint measure and marginals with . Assume that i.e., the random variables are Markovian. Then, for every and every s.t. with and two -dimensional vectors differing only in the -th position,
(33) |
where , and is the product of the marginals.
To compare with existing results, we start from a binary setting: and for every with . This corresponds to the kernel induced by the matrix described in Example 2 in Appendix B. Let denote the stationary distribution. Theorem 2 by [12] employs the hypercontractivity of the Markov operator to give a concentration result. Note that in this setting, the hypercontraction coefficient is not known, while the contraction coefficient can still be bounded via Equation 11 as
(34) |
which combined with Corollary 5 gives
(35) |
Assuming without loss of generality that and taking leads to
(36) |
One can then compute a similar bound leveraging Theorem 2 in [12] and then take the limit , which gives
(37) |
Thus, the approach by [12] guarantees an exponential decay if
(38) |
In contrast, the approach provided here leads to the exponential concentration if (see Equation 36)
(39) |
Hence, Equation 36 improves over Equation 37 if , i.e., if the bound on the contraction coefficient in Equation 34 is less than . Notably, if , then Equation 11 leads to the same bound as [12, Equation (110)], with the main difference that [12] explicitly compute the required norms while in this work we exploit our novel bound on the contraction coefficient and still achieve the same result.
Similarly, for arbitrary doubly-stochastic matrices, Corollary 5 with improves over [12] whenever the bound on the contraction coefficient is less than , see Section E-B for details. Furthermore, for stochastic matrices, we now compare Corollary 5 with the approach by [12] via numerical experiments. In particular, selecting as the stationary distribution, [12] prove exponential concentration if
In contrast, Corollary 5 leads to exponential concentration if
with as defined in Corollary 5. In Figure 5a and Figure 5b, we compare the two bounds for a randomly generated stochastic matrix when: (a) , and (b) . In the first case, [12] cannot provide exponential concentration, and the improvement brought by Corollary 5 is obvious. In the second case, even if both approaches lead to exponential concentration, Corollary 5 is still better for intermediate values of .


Notice that up to this point, our comparisons have focused solely on [12], as it already compares and improves upon other leading approaches in the literature, specifically [24, 11, 25]. However, for the sake of completeness, we will also present a direct comparison with these approaches in the context of the general binary channel. For ease of exposition, let and . Then, one retrieves the following bounds from the literature
(40) |
Consequently, one can see that the bounds on the right-hand side of Equation 40 are worse than the right-hand side of Equation 36 when
(41) |
For and , Figure 6 shows that our approach outperforms all previous methods for all .

V Conclusions
The notion of convergence of a Markov process induced by a kernel in Orlicz spaces is introduced. We show that one can control the speed of said convergence by studying the contraction coefficient of the dual kernel . Specifically, by considering specific norms of densities of and , Theorem 3 provides a novel and closed-form bound on the contraction coefficient in any Orlicz space. As discussed in Section IV, this yields a vast improvement over the state of the art in error bounds on MCMC methods, as well as in the concentration of Markovian processes. Given the generality of the framework, as explained in Remark 4, our analysis is able to tackle under-explored settings, such as those involving heavy-tailed distributions.
We envision applications of the results brought forth to other settings of interest. For instance, multi-armed bandit problems with Markovian rewards represent a severely under-studied setting in the literature [26]. The reason is that bandit algorithms heavily rely on concentration results. Hence, the improvements provided for the concentration of Markovian random variables can impact the development of bandit algorithms. Another setting comes from information theory: data-processing inequalities (DPI) are fundamental results in the field, and they can be significantly improved by computing the strong DPI (SDPI) constant for divergences [13]. Some connections between contraction coefficients of Markov kernels and strong DPI constants already exist [27, 16, 13]. Hence, leveraging Theorem 3 for spaces, one can also provide improved bounds on SDPI constants for -divergences.
Acknowledgments
The authors are partially supported by the 2019 Lopez-Loreta Prize. They would also like to thank the anonymous reviewers for their careful reading of the manuscript and for providing valuable suggestions and comments, identifying a typo and suggesting meaningful references.
References
- [1] G. O. Roberts and J. S. Rosenthal, “General state space Markov chains and MCMC algorithms,” Probability Surveys, 2004.
- [2] D. Rudolf, “Explicit error bounds for Markov chain Monte Carlo,” Dissertationes Mathematicae, vol. 485, pp. 1–93, 2012.
- [3] C. J. Geyer, “Practical Markov Chain Monte Carlo,” Statistical Science, vol. 7, no. 4, pp. 473 – 483, 1992.
- [4] W. R. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice, ser. Chapman & Hall/CRC Interdisciplinary Statistics. Taylor & Francis, 1995.
- [5] K. Łatuszyński, B. Miasojedow, and W. Niemiro, “Nonasymptotic bounds on the estimation error of mcmc algorithms,” Bernoulli, vol. 19, no. 5A, pp. 2033–2066, 2013.
- [6] D. W. Gillman, “Hidden Markov chains: convergence rates and the complexity of inference,” Ph.D. dissertation, Massachusetts Institute of Technology, Boston, US, 1993.
- [7] I. H. Dinwoodie, “A probability inequality for the occupation measure of a reversible Markov chain,” The Annals of Applied Probability, vol. 5, no. 1, pp. 37–43, 1995.
- [8] C. A. León and F. Perron, “Optimal Hoeffding bounds for discrete reversible Markov chains,” The Annals of Applied Probability, vol. 14, no. 2, pp. 958–970, 2004.
- [9] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher, “Chernoff-hoeffding bounds for Markov chains: Generalized and simplified,” in Symposium on Theoretical Aspects of Computer Science, 2012.
- [10] B. Miasojedow, “Hoeffding’s inequalities for geometrically ergodic Markov chains on general state space,” Statistics & Probability Letters, vol. 87, pp. 115–120, 2014.
- [11] J. Fan, B. Jiang, and Q. Sun, “Hoeffding’s inequality for general markov chains and its applications to statistical learning,” Journal of Machine Learning Research, vol. 22, no. 139, pp. 1–35, 2021.
- [12] A. R. Esposito and M. Mondelli, “Concentration without independence via information measures,” IEEE Transactions on Information Theory, vol. 70, no. 6, pp. 3823–3839, 2024.
- [13] M. Raginsky, “Strong data processing inequalities and -sobolev inequalities for discrete channels,” IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
- [14] S. Jarner and G. Roberts, “Polynomial convergence rates of markov chains,” Annals of Applied Probability, vol. 12, 2000.
- [15] ——, “Convergence of heavy-tailed monte carlo markov chain algorithms,” Scandinavian Journal of Statistics, vol. 34, pp. 781–815, 2007.
- [16] P. Del Moral, M. Ledoux, and L. Miclo, “On contraction properties of markov kernels,” Probability Theory and Related Fields, vol. 126, pp. 395–420, 2003.
- [17] O. Cappé, E. Moulines, and T. Ryden, Inference in Hidden Markov Models (Springer Series in Statistics). Berlin, Heidelberg: Springer-Verlag, 2005.
- [18] Z. R. M.M. Rao, Theory of Orlicz Spaces. New York: M. Dekker, 1991.
- [19] H. Hudzik and L. Maligranda, “Amemiya norm equals orlicz norm in general,” Indagationes Mathematicae, vol. 11, no. 4, pp. 573 – 585, 2000.
- [20] C. Roberto and B. Zegarlinski, “Hypercontractivity for Markov semi-groups,” Journal of Functional Analysis, vol. 282, no. 12, p. 109439, 2022.
- [21] M. Raginsky and I. Sason, “Concentration of measure inequalities in information theory, communications, and coding,” Foundations and Trends in Communications and Information Theory, vol. 10, no. 1-2, pp. 1–246, 2013.
- [22] P. Diaconis and L. Saloff-Coste, “Logarithmic Sobolev inequalities for finite Markov chains,” The Annals of Applied Probability, vol. 6, no. 3, pp. 695 – 750, 1996.
- [23] G. Roberts and J. Rosenthal, “Geometric Ergodicity and Hybrid Markov Chains,” Electronic Communications in Probability, vol. 2, no. none, pp. 13 – 25, 1997.
- [24] L. A. Kontorovich and K. Ramanan, “Concentration inequalities for dependent random variables via the martingale method,” The Annals of Probability, vol. 36, no. 6, pp. 2126 – 2158, 2008.
- [25] K. Marton, “A measure concentration inequality for contracting Markov chains,” Geometric and functional analysis, vol. 6, no. 3, pp. 556–571, 1996.
- [26] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Machine Learning, vol. 5, 04 2012.
- [27] R. L. Dobrushin, “Central limit theorem for nonstationary Markov chains. i,” Theory of Probability & Its Applications, vol. 1, no. 1, pp. 65–80, 1956.
- [28] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science, ser. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
- [29] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via Rényi-, f-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
- [30] I. Csiszár, “Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizitat von markoffschen ketten,” Magyar. Tud. Akad. Mat. Kutató Int. Közl,, vol. 8, pp. 85–108, 1963.
Appendix A Additional preliminaries and results
A-A Orlicz spaces
Orlicz spaces are quite general, and they include the well-known -spaces. Indeed, if , then and is the -norm, see Equation 5. We note that different choices of categorize different tail behaviors of random variables.
Example 1.
Let and, given a measure space , define on the Orlicz space . Given a random variable defined on this space, is the sub-Gaussian norm, i.e.,
(42) |
Thus, represents the set of all sub-gaussian random variables with respect to [28, Definition 2.5.6].
Given a function and its Young’s dual, the following generalised Hölder’s inequality holds, see [29, Appendix A] for a proof.
Lemma 2.
Let be an Orlicz function and its complementary function. For every pair of non-negative random variables respectively defined over , given a coupling , one has that
(43) |
Choosing , which implies with , Lemma 2 yields the classical Hölder’s inequality:
(44) |
A-B Proof of Lemma 1
Proof.
Given that , one has that . Indeed, let be an event such that . Then,
(45) |
Hence, is -a.e., and by absolute continuity between and , is also -a.e. and . Thus, the function is well-defined. Moreover, given any function ,
(46) | ||||
(47) | ||||
(48) | ||||
(49) |
which implies that , -almost everywhere. ∎
Moreover, one also has that for every reversible Markov kernel with stationary distribution , every with a Young functional and every the following holds true:
(50) |
consequently, one has that:
(51) |
A-C Relationship between contraction coefficients
Theorem 5.
Let be a Young functional, and let . Then, the following holds:
(52) |
Remark 5.
Taking , which gives -norms, we recover Lemma 3.16 by [2].
Proof.
The first inequality is obtained via the passages below:
(53) | ||||
(54) | ||||
(55) | ||||
(56) |
Moreover, one has that the following sequence of steps holds:
(57) | ||||
(58) | ||||
(59) | ||||
(60) | ||||
(61) |
Indeed, one has that
(62) |
where the last step follows from Jensen’s inequality and the assumption that . ∎
To conclude our characterisation of the contraction coefficients, we prove the equivalence between and .
Lemma 3.
Let be a Markov kernel with stationary distribution , and the dual kernel. Then, one has that
(63) |
Proof.
We first show that , i.e., that for every function and , . Indeed, we have
(64) |
From this, one can also conclude that, if is stationary with respect to , then . Indeed, we have
(65) |
∎
Appendix B Examples: discrete-valued kernels
Example 2.
Consider a binary setting where the Markov kernel is induced by a matrix:
(66) |
In this case, the stationary distribution is , and the dual kernel is . If , we use the shorthands and to denote and , respectively. We note that is known in information theory as the binary symmetric channel with parameter . Now, both and are doubly-stochastic and the stationary distribution is .
Let us focus on the setting where , in which we can compute everything explicitly and thus compare the upper bound to the real quantity. In this case, is ( minus) the spectral gap i.e., the second largest singular value of . Thus, . Let us now compute the bound on that stems from Equation 11. In particular, one has that
Consequently,
(67) |
In a example, the bound is still very close to the actual value. Specifically, we consider the following matrix:
(68) |
Given that is doubly-stochastic, the stationary distribution is uniform, i.e., for . Similarly to the setting, . We can thus numerically compute the second largest singular value of which gives . Instead, Equation 11 leads to the bound below which is very close to the actual value of :
(69) |
More generally, one can prove the following.
Lemma 4.
Consider a discrete-valued Markovian process described by an doubly-stochastic matrix . Then, the contraction coefficient of in the space can be bounded as follows:
(70) |
Proof.
The stationary distribution is the uniform distribution: , for . It is also easy to see that the dual of the kernel induced by with respect to is the transpose matrix . By using Equation 11, we can write
(71) | ||||
(72) |
which concludes the proof. ∎
In the case of general doubly-stochastic matrices, we can see that the bound maintains some of its properties but not all. For instance, it is still true that, if the matrix enforces independence (i.e., for every ), then . However, it is not always true that the bound on the contraction coefficient is always strictly less than one. Indeed, if is deterministic (i.e., for every there exists an such that and for every ), one can see that the bound becomes:
(73) |
which is larger than for and .
Appendix C Proofs for Section III
C-A Proof of Theorem 1
Assume that, for every , there exists a constant such that
(74) |
Moreover, the following steps hold:
(75) | ||||
(76) | ||||
(77) |
where Equation 75 follows from Jensen’s inequality. Hence, since one has that . The statement follows from the arbitrariness of . To conclude the argument for the Luxemburg norm, select the function identically equal to . One has that in order for the following needs to hold
(78) |
and thus . Hence, since , one has that .
As for the Amemiya norm, assume that, for every , there exists such that
(79) |
then one has that for every Markov kernel
(80) | ||||
(81) | ||||
(82) | ||||
(83) |
where Equation 81 follows from Jensen’s inequality. The argument then follows from the arbitrariness of . This implies that . Moreover, if one selects to be the function identically equal to , then
(84) |
Indeed, denoting , one has that and . Given that
(85) |
the argument follows. Hence, similarly to before, since , one has that .
C-B Proof of Theorem 2
Let and let . Then,
(86) | ||||
(87) | ||||
(88) | ||||
(89) | ||||
(90) |
where Equation 89 follows by the definition of (see Definition 4) and the fact that if then ; Equation 90 follows from Equation 51.
C-C Proof of Theorem 3
As , we have that . Thus,
(91) | ||||
(92) | ||||
(93) | ||||
(94) | ||||
(95) | ||||
(96) |
where Equation 91 follows from the fact that Amemiya and Luxemburg norms are “ideal norms”, Equation 95 follows from the fact that is a probability measure along with Jensen’s inequality, and Equation 96 follows from generalised Hölder’s inequality, i.e., Lemma 2. This gives the desired bound on the contraction coefficient of . The bound on the contraction coefficient of follows from an analogous argument.
C-D Proof of Corollary 1
The inequality follows from Equation 11 with . As for the equality, notice that, for every , and
(97) | ||||
(98) |
The last step follows from the definition of Total Variation distance. Hence,
(99) |
C-E Proof of Corollary 2
In [16, Lemma 4.1], the authors prove that if Equation 13 holds, then for every measure one has that:
(100) |
where is distributed according to . Given that and selecting , this implies that for every and -a.e.; thus, for every and , Note that norms are non-decreasing and
Then, the claim follows from an application of Theorem 2 with and Theorem 3.
Appendix D Proofs for Section IV-A
D-A Computations for Remark 2
The following closed-form expressions for hold:
(101) |
(102) |
where is the space over which and are defined and the Young conjugate, see Equation 4. We start by proving Equation 101. Let be fixed. Then, for a given and ,
(103) |
We need to consider two cases. First assume that , which is equivalent to . Then, since is convex and , one has that
(104) |
Thus, in order to have the right-hand side of Equation 104 less than one, one needs
If instead , then
(105) |
Consequently, one needs
Analogous considerations allow us to derive a bound on the Amemiya norm. Indeed, if , then
(106) |
One can then also similarly bound the Amemiya norm when .
D-B Proof of Theorem 4
The proof leverages the generalised Hölder’s inequality. Let be the measure obtained in the Markovian process when is the starting distribution and after applications of the Markov kernel . Assume that , with being the stationary distribution. The following sequence of steps hold for every Young function :
(107) | ||||
(108) | ||||
(109) | ||||
(110) | ||||
(111) | ||||
(112) |
D-C Proof of Corollary 4
Since we know the exact closed-form expression of the norm when , it is possible to prove a stronger result directly. Consider the function . This is a convex function such that and is no longer a Young function. One can thus define a corresponding -divergence [30] with . I.e., given two measures ,
(113) |
As such, it will satisfy the data-processing inequality, which we can use to prove Corollary 4. Following the same steps undertaken to prove [29, Theorem 3], for every event one has that
(114) |
If , we can then invert and obtain
(115) |
Moreover, one has that , and thus . Hence, one can rewrite Equation 115 as follows:
(116) |
If , then the claim holds trivially as .
Appendix E Proofs for Section IV-C
E-A Proof of Corollary 5
From [12, Theorem 1], we have that
(117) | ||||
(118) | ||||
(119) | ||||
(120) |
E-B Concentration and doubly-stochastic matrices
Consider an doubly-stochastic matrix with elements , . The bound on the contraction coefficient is given in Lemma 4 and, consequently, the bound provided by Corollary 5 yields
(121) |
which implies that exponential convergence is guaranteed whenever
(122) |
In contrast, Theorem 1 by [12] gives
(123) |
which implies that exponential convergence is guaranteed if
(124) |
Hence, if
(125) |
then the numerator in Equation 122 is strictly smaller than the numerator in Equation 124, and Equation 121 improves over Equation 123. In particular, in the limit when (regime that maximises the rate of decay) and , in order to improve over [12], we need the bound on the contraction coefficient to be bounded away from .