This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contraction of Markovian Operators in Orlicz Spaces
and Error Bounds for Markov Chain Monte Carlo

Amedeo Roberto Esposito, Marco Mondelli
Institute of Science and Technology Austria
{amedeoroberto.esposito, marco.mondelli}@ist.ac.at
Abstract

We introduce a novel concept of convergence for Markovian processes within Orlicz spaces, extending beyond the conventional approach associated with LpL_{p} spaces. After showing that Markovian operators are contractive in Orlicz spaces, our key technical contribution is an upper bound on their contraction coefficient, which admits a closed-form expression. The bound is tight in some settings, and it recovers well-known results, such as the connection between contraction and ergodicity, ultra-mixing and Doeblin’s minorisation. Specialising our approach to LpL_{p} spaces leads to a significant improvement upon classical Riesz-Thorin’s interpolation methods. Furthermore, by exploiting the flexibility offered by Orlicz spaces, we can tackle settings where the stationary distribution is heavy-tailed, a severely under-studied setup. As an application of the framework put forward in the paper, we introduce tighter bounds on the mixing time of Markovian processes, better exponential concentration bounds for MCMC methods, and better lower bounds on the burn-in period. To conclude, we show how our results can be used to prove the concentration of measure phenomenon for a sequence of Markovian random variables.

Index Terms:
Markovian operators, Orlicz spaces, Contraction, MCMC, McDiarmid, Burn-in

I Introduction

footnotetext: Accepted for presentation at the Conference on Learning Theory (COLT) 2024

The topic of bounding the contraction coefficient of Markovian operators in LpL_{p} spaces and, specifically, L2L_{2} spaces has a rich history, see the survey by [1]. The motivation comes from the fact that, whenever the contraction coefficient is strictly smaller than 11, the corresponding Markovian process converges exponentially fast to the stationary distribution [2]. Moreover, the characterisation of the contraction coefficient also yields upper bounds on the so-called “mixing times” of the process, i.e., the number of steps required to be close to the stationary distribution. This has practical implications in Markov Chain Monte Carlo (MCMC) methods: MCMC allows estimating integrals with respect to a certain measure π\pi when said measure is either inaccessible or cannot be sampled efficiently, by designing a Markov chain that approximates it [3, 4, 2]. A big effort has thus been devoted to understanding how the estimation error behaves and how long the Markov chain needs to get close enough to the limiting distribution (burn-in period) [3, 4, 2].

Most of the literature focuses on asymptotic analyses [3] or mean-squared error bounds [2, 5], and it is restricted to finite-state space Markov chains. A different line of work tackles the problem by providing concentration inequalities for dependent random variables [6, 7, 8, 9, 10, 11, 12], which lead to high-probability error bounds and characterisations of the burn-in period. We highlight that all these results rely on the contraction coefficient of the Markov kernel associated with the chain. Similarly, but through a different route, [13] proves that one can control the mixing times of a Markov chain by studying the strong data-processing inequality (SDPI) constant of various divergences.

Our work delivers new and improved bounds on these fundamental quantities, by stepping away from the more classical framework and considering Orlicz spaces, a generalisation of LpL_{p} spaces. We show that, as for LpL_{p}, Markov kernels are contractive in Orlicz spaces, i.e., the contraction coefficients of Markov kernels are less than or equal to 11. We can then define a notion of convergence of Markov processes, which depends on the corresponding contraction coefficient. In particular, our main result, Theorem 3, provides an upper bound for the contraction coefficient of Markovian operators, which admits a closed-form expression. The key technical novelties come from duality considerations: the convergence of a Markovian process determined by KK depends on the contraction coefficient of its dual KK^{\star}, which can in turn be bounded by considering appropriate nested norms of densities of KK^{\star} with respect to the stationary measure.

Our approach stands out as the first of its kind, as it does not rely on the existence of a spectral gap. In contrast, most existing bounds are based on the spectral gap [2, 11] and, thus, hold in L2L_{2}. To handle LpL_{p} spaces, the typical strategy is then to leverage Riesz-Thorin’s interpolation, which often leads to vacuous bounds. In contrast, Theorem 3 evades said restrictions by considering directly the contraction coefficient in LpL_{p}, thus vastly improving upon the state of the art. Leveraging arbitrary Orlicz spaces allows us to also consider distributions with heavy tails, which are hard to handle with classical techniques [14, 15], see Remark 4.

Theorem 3 recovers well-known theorems in the literature, i.e., the connection between contraction and ergodicity (Corollary 1), as well as ultra-mixing and Doeblin’s minorisation (Corollary 2[16, 17, 13]. As an application of our analysis, we give better bounds on the mixing times of Markovian processes (Section IV-A) and better convergence guarantees for MCMC methods (Section IV-B), which lead to more refined bounds on the burn-in period [2, 11]. Finally, by exploiting the strategy developed by [12], we provide improved concentration results for a sequence of dependent random variables, where the dependence is captured by a Markovian process.

II Preliminaries

We adopt a measure-theoretic framework. Given a measurable space (𝒳,)(\mathcal{X},\mathcal{F}) and two measures μ,ν\mu,\nu making it a measure space, if ν\nu is absolutely continuous w.r.t. μ\mu (νμ\nu\ll\mu), we represent with dνdμ\frac{d\nu}{d\mu} the Radon-Nikodym derivative of ν\nu w.r.t. μ\mu. Given a (measurable) function f:𝒳f:\mathcal{X}\to\mathbb{R} and a measure μ\mu, we denote by μ(E)\mu(E) the measure of an event EE and with μ(f)=f𝑑μ\mu(f)=\int fd\mu the Lebesgue integral of ff with respect to the measure μ\mu .

II-A Markov kernels and Markov chains

Definition 1 (Markov kernel).

Let (Ω,)(\Omega,\mathcal{F}) be a measurable space. A Markov kernel K:×Ω[0,1]K:\mathcal{F}\times\Omega\to[0,1] is such that (i) xΩ\forall\,x\in\Omega, the mapping EK(E|x)E\in\mathcal{F}\to K(E|x) is a probability measure on (Ω,)(\Omega,\mathcal{F}), and (ii) E\forall\,E\in\mathcal{F}, the mapping xΩK(E|x)x\in\Omega\to K(E|x) is an \mathcal{F}-measurable real-valued function.

A Markov kernel acts on measures “from the right”, i.e., given a measure μ\mu on (Ω,)(\Omega,\mathcal{F}), μK(E)=μ(K(E|))=𝑑μ(x)K(E|x),\mu K(E)=\mu(K(E|\cdot))=\int d\mu(x)K(E|x), and on functions “from the left”, i.e., given a function f:Ωf:\Omega\to\mathbb{R}, Kf(x)=𝑑K(y|x)f(y).Kf(x)=\int dK(y|x)f(y). Hence, given a function fLp(μ)f\in L^{p}(\mu), one can see KK as a mapping K:Lp(μ)Lp(μ)K:L^{p}(\mu)\to L^{p}(\mu). Moreover, an application of Jensen’s inequality gives that μ((Kf)p)μ(K(fp))=μK(fp)\mu((Kf)^{p})\leq\mu(K(f^{p}))=\mu K(f^{p}), thus the mapping can be extended to K:Lp(μ)Lp(μK)K:L^{p}(\mu)\to L^{p}(\mu K).

Given a Markov kernel KK and a measure μ\mu, let KμK^{\star}_{\mu} denote the dual operator. By definition, the dual kernel can be seen as acting on (Lp(μK))Lq(μK)(L^{p}(\mu K))^{\star}\cong L^{q}(\mu K) with q=p1pq=\frac{p-1}{p} and returning a function in (Lp(μ))Lq(μ)(L^{p}(\mu))^{\star}\cong L^{q}(\mu). In particular, given a probability measure μ\mu and two measurable functions f,gf,g, one can define the inner-product f,gμ=(fg)𝑑μ,\langle f,g\rangle_{\mu}=\int(f\cdot g)d\mu, and KμK^{\star}_{\mu} is the operator such that Kf,gμ=f,KμgμK\langle Kf,g\rangle_{\mu}=\langle f,K^{\star}_{\mu}g\rangle_{\mu K}, for all ff and gg. In discrete settings, the dual operator can be explicitly characterised via KK and μ\mu as Kμ(y|x)=K(y|x)μ(x)/μK(y)K_{\mu}^{\star}(y|x)=K(y|x)\mu(x)/\mu K(y), see [13, Eq. (1.2)].

Given a sequence of random variables (Xt)t(X_{t})_{t\in\mathbb{N}}, one says that it represents a Markov chain if, given i1i\geq 1, there exists a Markov kernel KiK_{i} such that for every measurable event EE:

(XiE|X1,,Xi1)=(XiE|Xi1)=Ki(E|Xi1) almost surely.\mathbb{P}(X_{i}\in E|X_{1},\ldots,X_{i-1})=\mathbb{P}(X_{i}\in E|X_{i-1})=K_{i}(E|X_{i-1})\qquad\text{ almost surely}. (1)

If for every i1i\geq 1, Ki=KK_{i}=K for some Markov kernel KK, then the Markov chain is time-homogeneous, and we suppress the index from KK. The kernel KK of a Markov chain is the probability of getting from xx to EE in one step, i.e., for every i1i\geq 1, K(E|x)=(XiE|Xi1=x)K(E|x)=\mathbb{P}(X_{i}\in E|X_{i-1}=x). One can then define (inductively) the tt-step kernel KtK^{t} as Kt(E|x)=Kt1(E|y)𝑑K(y|x)K^{t}(E|x)=\int K^{t-1}(E|y)dK(y|x). Note that KtK^{t} is also a Markov kernel, and it represents the probability of getting from xx to EE in tt steps: Kt(E|x)=(Xt+1E|X1=x)K^{t}(E|x)=\mathbb{P}(X_{t+1}\in E|X_{1}=x). If (Xt)t(X_{t})_{t\in\mathbb{N}} is the Markov chain associated to the kernel KK and X0μX_{0}\sim\mu, then μKm\mu K^{m} denotes the measure of Xm+1X_{m+1} at every mm\in\mathbb{N}. Furthermore, a probability measure π\pi is a stationary measure for KK if πK(E)=π(E)\pi K(E)=\pi(E) for every measurable event EE. When the state space is discrete, KK can be represented via a stochastic matrix.

Given a Markov kernel and a measure, one can define the corresponding dual [16] and show the following result, which is pivotal for characterising the convergence of a Markovian process to the stationary measure. The proof is deferred to Appendix A-B.

Lemma 1.

Let μ\mu be a positive measure, ν\nu a positive measure s.t. νμ\nu\ll\mu, and ff the Radon-Nikodym derivative dνdμ\frac{d\nu}{d\mu}. Let KK be a Markov kernel and g=dνKdμKg=\frac{d\nu K}{d\mu K}. Then,

g=Kμfμ-a.e.,g=K_{\mu}^{\star}f\hskip 20.00003pt\mu\text{-a.e.}, (2)

where KμK_{\mu}^{\star} is the operator such that, given two functions f,gf,g, one has that Kh,fμ=h,KμfμK.\langle Kh,f\rangle_{\mu}=\langle h,K_{\mu}^{\star}f\rangle_{\mu K}.

Furthermore, KμK^{\star}_{\mu} can be proved to be Markovian [16, Section 2, Lemma 2.2].

II-B Orlicz spaces

Definition 2 (Young Function, [18, Chapter 3]).

Let ψ:++\psi:\mathbb{R}^{+}\to\mathbb{R}^{+} be a convex non-decreasing function s.t. ψ(0)=0\psi(0)=0 and ψ(x)x\psi(x)\xrightarrow[x\to\infty]{}\infty. Then, ψ\psi is called a Young function.

Definition 3 ([18, Definition 5]).

Let ψ\psi be a Young function. Let (Ω,,μ)(\Omega,\mathcal{F},\mu) be a measure space and denote by L0(μ)L_{0}(\mu) the space of all the \mathcal{F}-measurable and real-valued functions on Ω\Omega. An Orlicz space can be defined as the following set:

Lψ(μ)={fL0(μ):Ωψ(|λf|)𝑑μ<+ for some λ>0}.L_{\psi}(\mu)=\left\{f\in L_{0}(\mu):\int_{\Omega}\psi(|\lambda f|)d\mu<+\infty\text{ for some }\lambda>0\right\}. (3)

Given a Young function ψ:[0,+)¯+\psi:[0,+\infty)\to\overline{\mathbb{R}}^{+}, the complementary function to ψ\psi in the sense of Young, denoted as ψ:¯+\psi^{\star}:\mathbb{R}\to\overline{\mathbb{R}}^{+}, is defined as follows [18, Section 1.3]:

ψ(x)=sup{λ|x|ψ(λ):λ0}.\displaystyle\psi^{\star}(x)=\sup\left\{\lambda|x|-\psi(\lambda):\lambda\geq 0\right\}. (4)

An Orlicz space can be endowed with several norms that render it a Banach space [19]: the Luxemburg norm, the Orlicz norm, and the Amemiya norm. We will henceforth ignore the Orlicz norm, as it is equivalent to the Amemiya norm [19], and we define the Luxemburg norm LψLL_{\psi}^{L} and the Amemiya norm LψAL_{\psi}^{A} as

fLψL(μ)=inf{σ>0:μ(ψ(|f|/σ))1},fLψA(μ)=inft>0μ(ψ(t|f|))+1t.\lVert f\rVert_{L^{L}_{\psi}(\mu)}=\inf\left\{\sigma>0:\mu\left(\psi\left(|f|/\sigma\right)\right)\leq 1\right\},\qquad\lVert f\rVert_{L^{A}_{\psi}(\mu)}=\inf_{t>0}\frac{\mu\left(\psi(t|f|)\right)+1}{t}. (5)

Orlicz spaces and the corresponding norms recover well-known objects (e.g., LpL_{p}-norms) and characterise random variables according to their “tail”. For more details, see Appendix A-A.

III Contraction of Markov kernels

The contractivity of Markov operators has been studied in particular for LpL^{p} spaces, with [16, 20] considering Orlicz spaces as well. As we prove results for both Amemiya and Luxemburg norms, we use LψN(μ)\left\lVert\cdot\right\rVert_{L_{\psi}^{N}(\mu)} with N{A,L}N\in\{A,L\} to denote both norms and we use LψN(μ)\left\lVert\cdot\right\rVert_{L_{\psi}^{N^{\star}}(\mu)} for the corresponding dual norm (e.g., if N=AN=A and one has an Amemiya norm, then N=LN^{\star}=L and the dual is the Luxemburg norm).

Definition 4.

Let ψ\psi be a Young functional, N{A,L}N\in\{A,L\} and KK a Markov kernel. The contraction coefficients of KK in the spaces Lψ(μ),Lψ0(μ)L_{\psi}(\mu),L_{\psi}^{0}(\mu) are denoted as follows:

supfLψ(μ)KfLψN(μ)fLψN(μ)=KLψN(μ)LψN(μ),supfLψ0(μ)KfLψN(μ)fLψN(μ)=KLψN,0(μ)LψN,0(μ),\sup_{f\in L_{\psi}(\mu)}\frac{\left\lVert Kf\right\rVert_{L_{\psi}^{N}(\mu)}}{\left\lVert f\right\rVert_{L_{\psi}^{N}(\mu)}}=\left\lVert K\right\rVert_{L_{\psi}^{N}(\mu)\to L_{\psi}^{N}(\mu)},\qquad\sup_{f\in L_{\psi}^{0}(\mu)}\frac{\left\lVert Kf\right\rVert_{L_{\psi}^{N}(\mu)}}{\left\lVert f\right\rVert_{L_{\psi}^{N}(\mu)}}=\left\lVert K\right\rVert_{L_{\psi}^{N,0}(\mu)\to L_{\psi}^{N,0}(\mu)}, (6)

where Lψ0(μ)L_{\psi}^{0}(\mu) denotes the closed subspace of functions with mean 0, i.e., Lψ0(μ)={fLψ(μ):μ(f)=0}L_{\psi}^{0}(\mu)=\{f\in L_{\psi}(\mu):\mu(f)=0\}. Whenever the measure is clear from the context, for ease of notation, we denote the contraction coefficient as KLψNLψN\left\lVert K\right\rVert_{L_{\psi}^{N}\to L_{\psi}^{N}}.

A characterisation of how Markovian operators behave in Orlicz spaces is stated below and proved in Appendix C-A.

Theorem 1.

Let ψ\psi be a Young functional, N{A,L}N\in\{A,L\}, and KK a Markov kernel with stationary distribution π\pi. Then, for every fLψ(π)f\in L_{\psi}(\pi),

KfLψN(π)fLψN(π)\displaystyle\left\lVert Kf\right\rVert_{L^{N}_{\psi}(\pi)}\leq\left\lVert f\right\rVert_{L^{N}_{\psi}(\pi)}\hskip 20.00003pt andKLψNLψN=1.\displaystyle\text{and}\hskip 20.00003pt\left\lVert K\right\rVert_{L^{N}_{\psi}\to L^{N}_{\psi}}=1. (7)

The key takeaway of Theorem 1 is that, as in LpL_{p} spaces, Markov kernels are contractive, but on the general space of LψL_{\psi}-integrable functions said contraction coefficient is trivially 11. Restricting to the closed subspace of functions with mean 0 (i.e., Lψ0(μ)L^{0}_{\psi}(\mu)) does yield an improvement. In fact, the contraction coefficient of a Markov kernel KK in Lψ0L_{\psi}^{0} is typically less than 11. This observation is fundamental to characterise convergence to the stationary distribution. Indeed, if KL20L20=γ<1\left\lVert K\right\rVert_{L_{2}^{0}\to L_{2}^{0}}=\gamma<1, then KK admits an L2L_{2}-spectral gap (1γ)(1-\gamma), which implies “L2L_{2}-geometric ergodicity”, as stated below [2, Proposition 3.12].

Proposition 1.

Let KK be a Markov kernel with stationary distribution π\pi and spectral gap 1γ>01-\gamma>0. For every probability measure ν\nu on the same space s.t. dν/dπL2(π)<\left\lVert d\nu/d\pi\right\rVert_{L_{2}(\pi)}<\infty, we have

νKtπL2(π)γtdν/dπ1L2(π).\left\lVert\nu K^{t}-\pi\right\rVert_{L_{2}(\pi)}\leq\gamma^{t}\left\lVert d\nu/d\pi-1\right\rVert_{L_{2}(\pi)}. (8)

In fact, in the specific setting of L2L_{2}, geometric ergodicity and existence of a spectral gap are equivalent notions [2, Proposition 3.13].

The purpose of this work is to relate the contractivity of Markovian operators on Orlicz spaces to the convergence of the corresponding Markov chain to its stationary distribution. To do so, we leverage Lemma 1 and show convergence in LψNL_{\psi}^{N}-norm, as soon as the contraction coefficient of the dual kernel K{K}^{\star} is <1<1. This is formalised in the result below, which is proved in Appendix C-B.

Theorem 2.

Let KK be a Markov kernel with stationary distribution π\pi, N{A,L}N\in\{A,L\}, and ψ\psi a Young functional. For every measure νπ\nu\ll\pi and tt\in\mathbb{N}, we have

dνKtdπ1LψN(π)KLψN,0LψN,0tdνdπ1LψN(π).\left\lVert\frac{d\nu K^{t}}{d\pi}-1\right\rVert_{L^{N}_{\psi}(\pi)}\leq\left\lVert{K}^{\star}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}^{t}\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)}. (9)

Theorem 2 links the convergence of a Markovian process to the contraction coefficient of the dual kernel KK^{\star}. Indeed, if KLψN,0<1\left\lVert{K}^{\star}\right\rVert_{L_{\psi}^{N,0}}<1, then the right-hand side of Equation 9 goes to 0 exponentially fast in the number of steps tt and, hence, νK=π\nu K^{\infty}=\pi. We also note that the proof can be adapted to replace KLψN,0LψN,0t\left\lVert{K}^{\star}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}^{t} with K1πLψNLψNt\left\lVert{K}^{\star}-1_{\pi}\right\rVert^{t}_{L_{\psi}^{N}\to L_{\psi}^{N}}, where 1π1_{\pi} is the operator mapping a function ff to f𝑑π\int fd\pi. These two bounds are closely related: KLψN,0LψN,0t\left\lVert{K}^{\star}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}^{t} focuses on 0-mean functions, while in K1πLψNLψNt\left\lVert{K}^{\star}-1_{\pi}\right\rVert^{t}_{L_{\psi}^{N}\to L_{\psi}^{N}} we directly subtract the mean 1π1_{\pi}. More generally, we can relate the contraction coefficient of KK^{\star} on Lψ0L_{\psi}^{0} to that of K1πK^{\star}-1_{\pi} on LψL_{\psi}, see Section A-C.

At this point, we are ready to state our key technical contribution which bounds the contraction coefficient of a kernel and of its dual. Its proof is in Section C-C.

Theorem 3.

Let ψ\psi be a Young functional and μ\mu a probability measure. Let KK be a Markov kernel, K=KμK^{\star}=K^{\star}_{\mu} its dual, and N{A,L}N\in\{A,L\}. Assume that, for every xx, K(|x)K(\cdot|x) is absolutely continuous w.r.t. μK\mu K and, for every yy, K(|y)K^{\star}(\cdot|y) is absolutely continuous w.r.t. μ\mu. Denote by gx=dK(|x)dμKg_{x}=\frac{dK(\cdot|x)}{d\mu K} and gy=dK(|y)dμg_{y}=\frac{dK^{\star}(\cdot|y)}{d\mu} the Radon-Nikodym derivatives and with gXg_{X} and gYg_{Y} the corresponding random variables whose law is induced by XμX\sim\mu and YμKY\sim\mu K. Then, for every fLψ0(μK)f\in L_{\psi}^{0}(\mu K) and hLψ0(μ)h\in L_{\psi}^{0}(\mu),

KfLψN(μ)fLψN(μK)gX1LψN(μK)LψN(μ) and KhLψN(μK)hLψN(μ)gY1LψN(μ)LψN(μK).\frac{\left\lVert Kf\right\rVert_{L_{\psi}^{N}(\mu)}}{\left\lVert f\right\rVert_{L_{\psi}^{N}(\mu K)}}\leq\left\lVert\left\lVert g_{X}-1\right\rVert_{L_{\psi^{\star}}^{N^{\star}}(\mu K)}\right\rVert_{L_{\psi}^{N}(\mu)}\text{ and }\hskip 10.00002pt\frac{\left\lVert K^{\star}h\right\rVert_{L_{\psi}^{N}(\mu K)}}{\left\lVert h\right\rVert_{L_{\psi}^{N}(\mu)}}\leq\left\lVert\left\lVert g_{Y}-1\right\rVert_{L_{\psi^{\star}}^{N^{\star}}(\mu)}\right\rVert_{L_{\psi}^{N}(\mu K)}. (10)
Remark 1.

Let ψ(x)=|x|p\psi(x)=|x|^{p} for p>1p>1, and q=p/(p1)q=p/(p-1). Then, the following is a corollary of Theorem 3 that holds for every fLp0(μK)f\in L_{p}^{0}(\mu K) and hLψ0(μ)h\in L_{\psi}^{0}(\mu):

KfLp(μ)fLp(μK)gX1Lq(μK)Lp(μ) and KhLp(μK)hLp(μ)gY1Lq(μ)Lp(μK).\frac{\left\lVert Kf\right\rVert_{L_{p}(\mu)}}{\left\lVert f\right\rVert_{L_{p}(\mu K)}}\leq\left\lVert\left\lVert g_{X}-1\right\rVert_{L_{q}(\mu K)}\right\rVert_{L_{p}(\mu)}\text{ and }\hskip 10.00002pt\frac{\left\lVert K^{\star}h\right\rVert_{L_{p}(\mu K)}}{\left\lVert h\right\rVert_{L_{p}(\mu)}}\leq\left\lVert\left\lVert g_{Y}-1\right\rVert_{L_{q}(\mu)}\right\rVert_{L_{p}(\mu K)}. (11)

Theorem 3 offers a closed-form bound on the contraction coefficient. We remark that, to the best of our knowledge, no such bound in Lψ0L_{\psi}^{0} spaces exists. If the kernel is induced by the so-called binary symmetric channel of parameter λ\lambda (K(x|y)=1λK(x|y)=1-\lambda for x=yx=y; otherwise, K(x|y)=λK(x|y)=\lambda), then Equation 11 is tight for p=2p=2, see Example 2 in Appendix B for details. The case of p=p=\infty represents another setting in which we can prove the tightness of Equation 11, as formalised below and proved in Section C-D.

Corollary 1.

Let KK be a Markov kernel with stationary distribution π\pi. Then, the following holds:

Kt1πLLgXt1L1(π)L(π)=esssupπ2Kt(|X)πTV,\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{\infty}\to L_{\infty}}\leq\left\lVert\left\lVert g^{t}_{X}-1\right\rVert_{L_{1}(\pi)}\right\rVert_{L_{\infty}(\pi)}=\operatorname*{ess\,sup}_{\pi}2\left\lVert K^{t}(\cdot|X)-\pi\right\rVert_{TV},~{} (12)

where gxt(y)=dKt(y|x)dπKt(y)=dKt(y|x)dπ(y)g_{x}^{t}(y)=\frac{dK^{t}(y|x)}{d\pi K^{t}(y)}=\frac{dK^{t}(y|x)}{d\pi(y)}.

In fact, Kt1πLL=esssupx2Kt(|x)πTV\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{\infty}\to L_{\infty}}=\operatorname*{ess\,sup}_{x}2\left\lVert K^{t}(\cdot|x)-\pi\right\rVert_{TV} by Proposition 3.23 in [2]. In the Markov chain literature, it is quite common to show that the right-hand side of Equation 12 is bounded. This usually goes under the name of uniform (or strong) ergodicity. Hence, Corollary 1 leads to another proof that asking for uniform ergodicity implies LL_{\infty}-exponential convergence.

Theorem 3 also recovers other well-known results. Indeed, [16] show that, if ultra-mixing holds (which, in turn, implies Doebling’s minorisation [17, Section 4.3.3]), then one can bound the contraction coefficient of Markovian kernels in arbitrary norms.

Corollary 2.

Given a Markov kernel KK, assume there exists 0<ϵ<10<\epsilon<1 s.t. the kernel satisfies the ultra-mixing condition, i.e., for all x,yx,y,

dK(|x)dK(|y)ϵ,K(|y)a.e.\frac{dK(\cdot|x)}{dK(\cdot|y)}\geq\epsilon,\hskip 20.00003ptK(\cdot|y)-a.e. (13)

Then, for every Young function ψ\psi and N{A,L}N\in\{A,L\}, we have

dνKdμK1LψN(μK)(1ϵ)dνdμ1LψN(μ).\left\lVert\frac{d\nu K}{d\mu K}-1\right\rVert_{L^{N}_{\psi}(\mu K)}\leq(1-\epsilon)\left\lVert\frac{d\nu}{d\mu}-1\right\rVert_{L^{N}_{\psi}(\mu)}. (14)

Corollary 2 is indeed a consequence of Theorem 3, as proved in Section C-E.

IV Applications

IV-A Mixing times

Definition 5.

Given a Markov kernel KK with stationary distribution π\pi, N{A,L}N\in\{A,L\}, and a Young functional ψ\psi, the ψ\psi-mixing time of KK is the function τψ(K,):+\tau_{\psi}(K,\cdot):\mathbb{R}^{+}\to\mathbb{N} such that for every νπ\nu\ll\pi

τψ(K,ϵ)=min{t:dνKtdπ1LψN(π)ϵ}.\tau_{\psi}(K,\epsilon)=\min\left\{t\in\mathbb{N}:\left\lVert\frac{d\nu K^{t}}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)}\leq\epsilon\right\}. (15)

The first application of our framework is stated below and follows from Theorem 3 and Eq. (15).

Corollary 3.

Let KK be a Markov operator with stationary distribution π\pi, K=KπK^{\star}=K^{\star}_{\pi} its dual, N{A,L}N\in\{A,L\}, and ψ\psi a Young functional. Then, the following holds for every νπ\nu\ll\pi:

τψ(K,ϵ)log(dνdπ1LψN(π)/ϵ)log(KLψN,0LψN,0).\tau_{\psi}(K,\epsilon)\leq\frac{\log\left(\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)}/\epsilon\right)}{-\log\left(\left\lVert K^{\star}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}\right)}. (16)
Remark 2.

For discrete spaces and a discrete-valued Markov kernel, by leveraging the convexity of the norms and the fact that convex functions on a compact convex set attain their maximum on an extreme point, we can upper bound dνdπ1LψN(π)\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)} for every ν\nu with maxxdδxdπ1LψN(π)\max_{x}\left\lVert\frac{d\delta_{x}}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)}, which admits a closed-form expression. See Section D-A for details.

Corollary 3 relates the contraction coefficient of the dual operator to the mixing time of the corresponding Markov chain, thus motivating our study of said contraction coefficient. Moreover, as ψ\psi is arbitrary, a natural question concerns the choice of the norm. To address it, first we derive a general result providing intuition over what it means to be close in a specific norm. In particular, we relate the probability of any event under μKt\mu K^{t} with the probability of the same event under the stationary measure π\pi and an LψNL^{N}_{\psi}-norm. We then focus on the well-known family of LpL_{p}-norms (particularly relevant for exponentially decaying probabilities under π\pi), and show that asking for a bounded norm with larger pp guarantees a faster decay of the probability of the same event under μKt\mu K^{t}. Finally, we design a Young function tailored to the challenging (and less understood) setting in which π\pi behaves like a power law [14].

Theorem 4.

Let tt\in\mathbb{N} and EE be a measurable event. Then,

μKt(E)min{infψ1/ψ1(1/π(E))(dμKtdπ1LψA(π)+ψ1(1)),infψπ(E)ψ1(1/π(E))(dμKtdπ1LψL(π)+1/ψ1(1)),\mu K^{t}(E)\leq\min\begin{cases}\inf_{\psi}1/{\psi^{\star}}^{-1}(1/\pi(E))\left(\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{\psi}^{A}(\pi)}+{\psi^{\star}}^{-1}(1)\right),\\ \inf_{\psi}\pi(E)\psi^{-1}(1/\pi(E))\left(\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{\psi}^{L}(\pi)}+1/\psi^{-1}(1)\right),\end{cases} (17)

where the infimum is over all Young functionals ψ\psi.

We now apply Theorem 4 to various ψ\psi, and start by demonstrating the advantage of considering LpL_{p}-norms with p2p\neq 2. The result below is proved in Section D-C.

Corollary 4.

Let φp(x)=|x|p/p\varphi_{p}(x)=|x|^{p}/p with p>1p>1, q=p/(p1)q=p/(p-1), tt\in\mathbb{N} and EE a measurable event. Then,

μKt(E)\displaystyle\mu K^{t}(E) π(E)1q(π(E)1p+dμKtdπ1Lp(π)).\displaystyle\leq\pi(E)^{\frac{1}{q}}\left(\pi(E)^{\frac{1}{p}}+\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{p}(\pi)}\right). (18)

As an immediate consequence, if dμKtdπ1Lp(π)ϵ,\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{p}(\pi)}\leq\epsilon, then for every EE

μKt(E)\displaystyle\mu K^{t}(E) π(E)1q(π(E)1p+ϵ)=π(E)+π(E)1qϵ.\displaystyle\leq\pi(E)^{\frac{1}{q}}\left(\pi(E)^{\frac{1}{p}}+\epsilon\right)=\pi(E)+\pi(E)^{\frac{1}{q}}\epsilon. (19)

The following setting is especially interesting for the applicability of Corollary 4. Consider a function f:𝒳nf:\mathcal{X}^{n}\to\mathbb{R} and E={(x1,,xn):|f(x1,,xn)π(f)|η}E=\{(x_{1},\ldots,x_{n}):|f(x_{1},\ldots,x_{n})-\pi(f)|\geq\eta\} for some fixed η>0\eta>0. If f(x1,,xn)=1ni=1nfi(xi)f(x_{1},\ldots,x_{n})=\frac{1}{n}\sum_{i=1}^{n}f_{i}(x_{i}) for f1,,fnf_{1},\ldots,f_{n} with 0fi(x)C0\leq f_{i}(x)\leq C, by Hoeffding’s inequality one has:

π(E)2exp(2η2nC2),\pi(E)\leq 2\exp\left(-\frac{2\eta^{2}n}{C^{2}}\right), (20)

where the constant C2C^{2} at the exponent can be improved under additional assumptions, see e.g. [21]. Notice that Equation 20 does not require π\pi to be a product distribution. Indeed, for example, if the Markov chain has spectral gap 1λ1-\lambda, then Equation 20 holds with the additional constant at the exponent of (1λ)/(1+λ)(1-\lambda)/(1+\lambda), see [11, Theorem 1]. Then, given tτφp(K,ϵ)t\geq\tau_{\varphi_{p}}(K,\epsilon) and q=pp1q=\frac{p}{p-1}, we have

μKt(E)2exp(2nη2C2)+ϵ21qexp(2nη2qC2).\mu K^{t}(E)\leq 2\exp\left(-\frac{2n\eta^{2}}{C^{2}}\right)+\epsilon 2^{\frac{1}{q}}\exp\left(-\frac{2n\eta^{2}}{qC^{2}}\right). (21)

Starting from Equation 21, one can now select ϵ\epsilon such that the probability is exponentially decaying in the dimension of the ambient space nn. In fact, a bound that decays exponentially in the dimension nn is guaranteed whenever, for a given η\eta, ff, and finite pp,

ϵ<exp(γ2nη2qC2), with γ<1.\epsilon<\exp\left(\gamma\frac{2n\eta^{2}}{qC^{2}}\right),\text{ with }\gamma<1. (22)

For example, ϵ=exp(nη2/qC2)\epsilon=\exp\left(n\eta^{2}/qC^{2}\right) guarantees exponential convergence in Equation 21 and plugging this choice in Equation 16 gives the bound below on the mixing time:

τp(K,ϵ)1log(KLp0Lp0)(log(Lp(π))+qC2nη2),\tau_{p}(K,\epsilon)\leq\frac{1}{-\log\left(\left\lVert K^{\star}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\right)}\left(\log\left(L_{p}^{\star}(\pi)\right)+\frac{qC^{2}}{n\eta^{2}}\right), (23)

where Lp(π)=supνπdνdπ1Lp(π)L_{p}^{\star}(\pi)=\sup_{\nu\ll\pi}\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L^{p}(\pi)}. We note that, besides ϵ\epsilon, both Lp(π)L_{p}^{\star}(\pi) and KLp0Lp0\left\lVert K^{\star}\right\rVert_{L_{p}^{0}\to L_{p}^{0}} depend on the dimension nn. To maximise the rate of exponential decay in Equation 21, one has to take pp\to\infty, which implies q1q\to 1. The price to pay is an increase of the mixing time and, to quantify such increase, it is crucial to provide tight bounds on the contraction coefficient of Markovian operators, beyond the usual L2L_{2} space. Theorem 3 can then be leveraged to compute said bounds, and we now compare it with the standard approach in the literature, i.e., the Riesz-Thorin interpolation theorem between norms, which is stated below for convenience.

Proposition 2.

Let p(1,+)p\in(1,+\infty) and tt\in\mathbb{N}. Let KK be a Markov kernel with stationary distribution π\pi. If KK admits an L2L_{2}-spectral gap of (1γ)(1-\gamma), then the following holds:

Kt1πLpLp{22ppγ2tp1p,p(1,2),2p2pγ2t/p,p(2,+).\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{p}\to L_{p}}\leq\begin{cases}2^{\frac{2-p}{p}}\gamma^{2t\frac{p-1}{p}},&p\in(1,2),\\ 2^{\frac{p-2}{p}}\gamma^{2t/p},&p\in(2,+\infty).\end{cases} (24)

Moreover, one has that:

Kt1πL2L2γt , Kt1πL1L12 , Kt1πLL2.\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{2}\to L_{2}}\leq\gamma^{t}\text{ , }\quad\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{1}\to L_{1}}\leq 2\text{ , }\quad\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{\infty}\to L_{\infty}}\leq 2. (25)

As an example, consider the binary symmetric channel of parameter λ\lambda. Then, γ=(12λ)\gamma=(1-2\lambda) and Proposition 2 with p>2p>2 yields:

Kt1πLpLp2p2p(12λ)2t/p.\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{p}\to L_{p}}\leq 2^{\frac{p-2}{p}}(1-2\lambda)^{2t/p}. (26)

In contrast, Theorem 3 gives:

Kt1πLpLp(12λ)t, for every p1,\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{p}\to L_{p}}\leq(1-2\lambda)^{t},\text{ for every }p\geq 1, (27)

which improves over Equation 26 for every pp. Notably, Equation 27 holds even if p1p\to 1 and pp\to\infty, while the result of Proposition 2 is vacuous in both cases (cf. Equation 25). Figure 2 considers Markov kernels induced by 5×55\times 5 randomly generated stochastic matrices, and it shows that while the bound in Proposition 2 is always vacuous (red dots), the bound given by Equation 11 is always below 11 and often very close to 0 (blue dots). Moreover, given that we are in a discrete setting one can explicitly compute the LL_{\infty}/L1L_{1}-operator norm for a specific kernel KK and interpolate between the LL_{\infty}/L1L_{1}–operator and the spectral gap. The comparison yields the following results:

  • For n=5,t=20,p=10n=5,t=20,p=10, our upper bound is roughly 50% smaller than Riesz-Thorin interpolation (on average over the 100 random trials)

  • For n=5n=5, t=200t=200, p=100p=100, our upper bound is roughly 70% smaller than Riesz-Thorin interpolation (on average over the 100 random trials);

  • For n=3,t=4,p=1.1n=3,t=4,p=1.1, our upper bound is roughly 23% smaller than Riesz-Thorin interpolation (on average over the 100 random trials);

  • For n=3,t=4,p=1.5n=3,t=4,p=1.5, our upper bound is roughly 20% smaller than Riesz-Thorin interpolation (on average over the 100 random trials).

Refer to caption
Figure 1: Bound on the contraction coefficient given by Equations 11 and 24 in 100100 randomly generated 5×55\times 5 stochastic matrices, for p=100p=100 and t=10t=10.
Refer to caption
Figure 2: Bound on μKt(E)\mu K^{t}(E), as a function of nn, induced by the L1.09L_{1.09}-norm (Eq. (19)) and Lψ55LL_{\psi_{5}^{5}}^{L} (Eq. 28), when π(E)\pi(E) decays polynomially with exponent 2.1-2.1.
Remark 3.

So far, we have compared our approach with Riesz-Thorin interpolation. However, part of the Markov chain literature leverages Stein interpolation as well e.g., [22]. This, however, requires considering the semigroup Hz=ezn=0(zK)nn!=ez(IK)H_{z}=e^{-z}\sum_{n=0}^{\infty}\frac{(zK)^{n}}{n!}=e^{-z(I-K)} associated with a Markov kernel KK. Let us thus compare with the argument used in [22, Theorem 3.9] where, given zz, Stein interpolation is used to provide a bound on Hz12p(z)\left\lVert H_{z}-1\right\rVert_{2\to p(z)}, the operator norm of Hz1H_{z}-1 from L2L_{2} to Lp(z)L_{p(z)}, with p(z)p(z) a function of zz opportunely chosen. In particular, select q=q=\infty and zz_{\infty} in [22, Theorem 3.9] and denote by H12=M\left\lVert H_{\infty}-1\right\rVert_{2\to\infty}=M_{\infty}. Via Stein interpolation one can provide the following bound for every z<zz<z_{\infty}: Hz12p(z)Mz/z\left\lVert H_{z}-1\right\rVert_{2\to p(z)}\leq M_{\infty}^{z/z_{\infty}}, where p(z)=2z/(zz)p(z)=2z_{\infty}/(z_{\infty}-z). In contrast, an analogous version of Equation 11 gives Hz12p(z)hzX12p(z)\left\lVert H_{z}-1\right\rVert_{2\to p(z)}\leq\left\lVert\left\lVert h_{z}^{X}-1\right\rVert_{2}\right\rVert_{p(z)}, where hzXh_{z}^{X} denotes the density of HzH_{z} with respect to the stationary distribution π\pi. In order to provide a concrete comparison between the two bounds, let KK be the binary symmetric channel with parameter λ=1/3\lambda=1/3 and select z=10z_{\infty}=10. Figure 3 shows how, in this setting, our bound always improves over [22].

Refer to caption
Figure 3: Comparison between the bounds induced by an analogous of Equation 11 and Stein interpolation on Hz12p(z)\left\lVert H_{z}-1\right\rVert_{2\to p(z)} as a function of z<10z<10. KK denotes the kernel induced by a binary symmetric channel of parameter λ=1/3\lambda=1/3 and MM_{\infty} is computed numerically.
Remark 4.

A suitable choice of the Young function allows to treat stationary distributions with heavy tails. In this case, geometric or uniform ergodicity cannot be achieved, and the best one can hope for is a polynomial convergence in Total Variation [14]. Thus, there is no spectral gap [23, Theorem 2.1], which affects the convergence rates of MCMC algorithms with polynomial target distributions, e.g., Metropolis-Hastings with independence sampler [15, Section 4.3].

For concreteness, consider upper bounding μKt(E)\mu K^{t}(E), when the stationary distribution π\pi is a power-law with exponent (2+η)-(2+\eta) and tt is s.t. dμKtdπ1Lp(π)ϵ\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{p}(\pi)}\leq\epsilon. Then, if we restrict to LpL_{p} spaces, we obtain the bound in Equation 19, with q>(1+η)/ηq>(1+\eta)/\eta. Here, the lower bound on qq comes from the fact that q=pp1q=\frac{p}{p-1} and the largest pp s.t. a random variable with measure π\pi belongs to LpL_{p} has to satisfy 1<p<1+η1<p<1+\eta (p=1p=1 implies q=q=\infty leading to a trivial bound in  Equation 19).

Instead, by considering Orlicz spaces different from LpL_{p}, much more refined bounds can be obtained. Let ψkm(x)=xm𝟙1xk+((1+x)log(1+x)+km(1+k)log(1+k))𝟙x>k\psi_{k}^{m}(x)=x^{m}\mathbbm{1}_{1\leq x\leq k}+((1+x)\log(1+x)+k^{m}-(1+k)\log(1+k))\mathbbm{1}_{x>k}, which has a different rate of growth before and after a certain parameter kk. Then, one can readily verify that ψkm\psi_{k}^{m} is a Young function, and a random variable with measure π\pi belongs to LψkmL_{\psi_{k}^{m}} for any finite kk. By applying Theorem 4 with this choice of ψ\psi, we obtain

μKt(E)π(E)(ψkm)1(1π(E))(ϵ+1(ψkm)1(1)),\mu K^{t}(E)\leq\pi(E){(\psi_{k}^{m})}^{-1}\left(\frac{1}{\pi(E)}\right)\left(\epsilon+\frac{1}{(\psi_{k}^{m})^{-1}(1)}\right), (28)

which vastly improves upon the bound in Equation 19, as shown in Figure 2. Note that to maximise the decay of μKt\mu K^{t} in Equation 19, qq (pp) needs to be as small (large) as possible. Hence with η=0.1\eta=0.1 the nearly best decay is given by p=1.09p=1.09 which implies q=1.09/0.09q=1.09/0.09.

IV-B Markov Chain Monte Carlo

Proposition 2 was used by [11] to prove concentration for MCMC methods. Here, we show how an application of Theorem 3 improves over such results.

Let us briefly recall the setup. Given a function ff, we want to estimate the mean π(f)\pi(f) with respect to a measure π\pi which cannot be directly sampled. A common approach is to consider a Markov chain {Xi}i1\{X_{i}\}_{i\geq 1} with stationary distribution π\pi, and estimate π(f)\pi(f) via empirical averages of samples {Xi}t0+1t0+t\{X_{i}\}_{t_{0}+1}^{t_{0}+t}, where the burn-in period t0t_{0} ensures that the Markov chain is close to π\pi.

We start by considering again the binary symmetric channel with parameter λ\lambda. The stationary distribution is π=(1/2,1/2)\pi=(1/2,1/2) and the spectral gap is |12λ||1-2\lambda|. Assume X1νX_{1}\sim\nu for some measure νπ\nu\neq\pi and λ<1/2\lambda<1/2. Applying Theorem 12 by [11] to ff s.t. f[0,1t]f\in[0,\frac{1}{t}], yields

(1ni=t0+1t0+tf(Xi)πt(f)η)C(ν,t0,p)exp(2qλ1λtη2),\mathbb{P}\left(\frac{1}{n}\sum_{i=t_{0}+1}^{t_{0}+t}f(X_{i})-\pi^{\otimes t}(f)\geq\eta\right)\leq C(\nu,t_{0},p)\exp\left(-\frac{2}{q}\frac{\lambda}{1-\lambda}t\eta^{2}\right), (29)

where p>1p>1, q=p/(p1)q=p/(p-1), πt\pi^{\otimes t} denotes the tensor product of π\pi, and

C(ν,t0,p){1+(22/p𝟙p<2+𝟙p=2)(12λ)2t0qdνdπ1Lp(π),if p(1,2],1+22/q(12λ)2t0pdνdπ1Lp(π),if p(2,+),dνdπL(π),if p=+.C(\nu,t_{0},p)\leq\begin{cases}1+(2^{2/p}\mathbbm{1}_{p<2}+\mathbbm{1}_{p=2})(1-2\lambda)^{\frac{2t_{0}}{q}}\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L^{p}(\pi)},&\text{if }p\in(1,2],\\ 1+2^{2/q}(1-2\lambda)^{\frac{2t_{0}}{p}}\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L^{p}(\pi)},&\text{if }p\in(2,+\infty),\\ \left\lVert\frac{d\nu}{d\pi}\right\rVert_{L^{\infty}(\pi)},&\text{if }p=+\infty.\end{cases} (30)

In contrast, an application of Theorem 3 yields (K1π)t0LpLp(12λ)t0\left\lVert(K-1_{\pi})^{t_{0}}\right\rVert_{L_{p}\to L_{p}}\leq(1-2\lambda)^{t_{0}} for every p(1,+]p\in(1,+\infty]. Hence, leveraging Equation 18 leads to

(1ni=t0+1t0+tf(Xi)πt(f)η)exp(2λtη21λ)+|12λ|t0exp(2λtη2q(1λ))dνdπ1Lp(π).\begin{split}\mathbb{P}\left(\frac{1}{n}\sum_{i=t_{0}+1}^{t_{0}+t}f(X_{i})-\pi^{\otimes t}(f)\geq\eta\right)&\leq\exp\left(-\frac{2\lambda t\eta^{2}}{1-\lambda}\right)+|1-2\lambda|^{t_{0}}\exp\left(-\frac{2\lambda t\eta^{2}}{q(1-\lambda)}\right)\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L^{p}(\pi)}.\end{split} (31)

Equation 31 improves over Equation 29 for the absence of the 1q\frac{1}{q} term in the exponent of the first addend of Equation 31 (which provides a faster exponential decay, since q>1q>1) and thanks to the improved bound on the contraction coefficient provided by Equation 11, in contrast with Equation 30. Indeed, for t0>1t_{0}>1, (12λ)t0<22/p(12λ)2t0q(1-2\lambda)^{t_{0}}<2^{2/p}(1-2\lambda)^{\frac{2t_{0}}{q}} for 1<p<21<p<2 and (12λ)t0<22/q(12λ)2t0p(1-2\lambda)^{t_{0}}<2^{2/q}(1-2\lambda)^{\frac{2t_{0}}{p}} for p>2p>2. The difference in behavior is the most striking for pp\to\infty which maximises the rate of decay in both Equation 29 and Equation 31. Most importantly, considering such limit in Equation 29 leads to a bound that does not depend on t0t_{0} anymore. This is not the case of Equation 31, where not only we retrieve the best exponential decay, but we also retain the dependence on t0t_{0}. Hence, for a given accuracy η\eta and sample budget tt, if we start from a distribution ν\nu and we require a certain confidence δ\delta, we have the following lower bound on the burn-in period:

t0log(δKKM)log(|12λ|),t_{0}\geq\frac{\log\left(\frac{\delta-K}{KM}\right)}{\log(|1-2\lambda|)}, (32)

where K=exp(2λtη21λ)K=\exp\big{(}-\frac{2\lambda t\eta^{2}}{1-\lambda}\big{)} and M=dν/dπ1L(π)M=\left\lVert d\nu/d\pi-1\right\rVert_{L^{\infty}(\pi)}. Figure 4 compares the bounds when the kernel is a 10×1010\times 10 randomly generated stochastic matrix, and it clearly shows the improvement of our approach (in blue) over the interpolation method used by [11] (in red).

Refer to caption
Figure 4: Behaviour of Equation 31 and Equation 29 as a function of tt. We set η=12\eta=\frac{1}{2}, p=100p=100 and t0=100t_{0}=100. The starting distribution ν\nu is randomly drawn from the 99-dimensional simplex and the 10×1010\times 10 stochastic matrix is also obtained randomly. The spectral gap is computed exactly via numerical methods.

IV-C Concentration of measure without independence

Another application comes from leveraging Corollary 4 along with Equation 11 in proving concentration of measure bounds for non-independent random variables. In particular, the following result holds (see Section E-A for the proof).

Corollary 5.

For i1i\geq 1, let KiK_{i} be a discrete-valued Markov kernel. Let X1,,XtX_{1},\ldots,X_{t} be a sequence of random variables with joint measure 𝒫Xt\mathcal{P}_{X^{t}} and marginals 𝒫Xi\mathcal{P}_{X_{i}} with 1it1\leq i\leq t. Assume that 𝒫Xi=𝒫Xi1Ki\mathcal{P}_{X_{i}}=\mathcal{P}_{X_{i-1}}K_{i} i.e., the random variables are Markovian. Then, for every p>1p>1 and every ff s.t. |f(x)f(x^)|1t|f(x)-f(\hat{x})|\leq\frac{1}{t} with xx and x^\hat{x} two tt-dimensional vectors differing only in the ii-th position,

(|f(X1,,Xt)𝒫i=1tXi(f)|η)21qexp(2tη2q)(i=2t(KiLp0Lp0ωpi+1)),\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)\leq 2^{\frac{1}{q}}\exp\left(-\frac{2t\eta^{2}}{q}\right)\left(\prod_{i=2}^{t}\left(\left\lVert K_{i}^{\star}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\omega_{p}^{i}+1\right)\right), (33)

where ωpi=maxx((1𝒫Xi1({x}))p𝒫Xi1({x})1p\omega_{p}^{i}=\max_{x}((1-\mathcal{P}_{X_{i-1}}(\{x\}))^{p}\mathcal{P}_{X_{i-1}}(\{x\})^{1-p} +(1𝒫Xi1({x})))1p+(1-\mathcal{P}_{X_{i-1}}(\{x\})))^{\frac{1}{p}}, and 𝒫i=1tXi(f)\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f) is the product of the marginals.

To compare with existing results, we start from a binary setting: 𝒫Xi|Xi1(0|0)=1λ\mathcal{P}_{X_{i}|X_{i-1}}(0|0)=1-\lambda and 𝒫Xi|Xi1(1|1)=1κ\mathcal{P}_{X_{i}|X_{i-1}}(1|1)=1-\kappa for every ii with κ,λ<1\kappa,\lambda<1. This corresponds to the kernel Kλ,κK_{\lambda,\kappa} induced by the matrix described in Example 2 in Appendix B. Let π=1/(κ+λ)(κ,λ)\pi=1/(\kappa+\lambda)\cdot(\kappa,\lambda) denote the stationary distribution. Theorem 2 by [12] employs the hypercontractivity of the Markov operator Kλ,κK_{\lambda,\kappa} to give a concentration result. Note that in this setting, the hypercontraction coefficient is not known, while the contraction coefficient can still be bounded via Equation 11 as

Kλ,κLp0Lp02|1λκ|,\left\lVert K_{\lambda,\kappa}^{\star}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\leq 2|1-\lambda-\kappa|, (34)

which combined with Corollary 5 gives

(|f(X1,,Xt)𝒫i=1tXi(f)|η)\displaystyle\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)
21qexp(2tη2q+(t1)log(2|1λκ|((κλ)pλλ+κ+κλ+κ)1p+1)).\displaystyle\leq 2^{\frac{1}{q}}\exp\left(-\frac{2t\eta^{2}}{q}+(t-1)\log\left(2|1-\lambda-\kappa|\left(\left(\frac{\kappa}{\lambda}\right)^{p}\frac{\lambda}{\lambda+\kappa}+\frac{\kappa}{\lambda+\kappa}\right)^{\frac{1}{p}}+1\ \right)\right). (35)

Assuming without loss of generality that κ>λ\kappa>\lambda and taking pp\to\infty leads to

(|f(X1,,Xt)𝒫i=1tXi(f)|η)2exp(2tη2+(t1)log(2|1λκ|κ+λλ)).\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)\leq 2\exp\left(-2t\eta^{2}+(t-1)\log\left(\frac{2|1-\lambda-\kappa|\kappa+\lambda}{\lambda}\ \right)\right). (36)

One can then compute a similar bound leveraging Theorem 2 in [12] and then take the limit pp\to\infty, which gives

(|f(X1,,Xt)𝒫i=1tXi(f)|η)2exp(2tη2+(t1)log(λ+κλ)).\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)\leq 2\exp\left(-2t\eta^{2}+(t-1)\log\left(\frac{\lambda+\kappa}{\lambda}\ \right)\right). (37)

Thus, the approach by [12] guarantees an exponential decay if

η>12log(κ+λλ)(1+ot(1)).\eta>\sqrt{\frac{1}{2}\log\left(\frac{\kappa+\lambda}{\lambda}\right)(1+o_{t}(1))}. (38)

In contrast, the approach provided here leads to the exponential concentration if (see Equation 36)

η>12log(2|1λκ|κ+λλ)(1+ot(1)).\eta>\sqrt{\frac{1}{2}\log\left(\frac{2|1-\lambda-\kappa|\kappa+\lambda}{\lambda}\right)(1+o_{t}(1))}. (39)

Hence, Equation 36 improves over Equation 37 if 2|1λκ|<12|1-\lambda-\kappa|<1, i.e., if the bound on the contraction coefficient in Equation 34 is less than 11. Notably, if λ=κ\lambda=\kappa, then Equation 11 leads to the same bound as [12, Equation (110)], with the main difference that [12] explicitly compute the required norms while in this work we exploit our novel bound on the contraction coefficient and still achieve the same result.

Similarly, for arbitrary m×mm\times m doubly-stochastic matrices, Corollary 5 with pp\to\infty improves over [12] whenever the bound on the contraction coefficient is less than 11, see Section E-B for details. Furthermore, for m×mm\times m stochastic matrices, we now compare Corollary 5 with the approach by [12] via numerical experiments. In particular, selecting 𝒫X0\mathcal{P}_{X_{0}} as the stationary distribution, [12] prove exponential concentration if

η>η1=(1+ot(1))ln(m)/2.\eta>\eta_{1}=\sqrt{(1+o_{t}(1))\ln(m)/2}.

In contrast, Corollary 5 leads to exponential concentration if

η>η2=q(1+ot(1))ln(KLp0Lp0ωp+1))/2,\eta>\eta_{2}=\sqrt{q(1+o_{t}(1))\ln(\left\lVert K^{\star}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\omega_{p}+1))/2},

with ωp=ωp1\omega_{p}=\omega_{p}^{1} as defined in Corollary 5. In Figure 5a and Figure 5b, we compare the two bounds for a 5×55\times 5 randomly generated stochastic matrix when: (a) η2<η<η1\eta_{2}<\eta<\eta_{1}, and (b) η>η1>η2\eta>\eta_{1}>\eta_{2}. In the first case, [12] cannot provide exponential concentration, and the improvement brought by Corollary 5 is obvious. In the second case, even if both approaches lead to exponential concentration, Corollary 5 is still better for intermediate values of tt.

Figure 5: Behaviour of Corollary 5 and [12, Thm 1], as a function of tt, for p=100p=100 and two different choices of η\eta.
Refer to caption
(a) η=η2+0.1\eta=\eta_{2}+0.1
Refer to caption
(b) η=η1+0.1\eta=\eta_{1}+0.1

Notice that up to this point, our comparisons have focused solely on [12], as it already compares and improves upon other leading approaches in the literature, specifically [24, 11, 25]. However, for the sake of completeness, we will also present a direct comparison with these approaches in the context of the general binary channel. For ease of exposition, let ϑ=|1λκ|\vartheta=|1-\lambda-\kappa| and ϖ=log(2ϑκ/λ)\varpi=\log(2\vartheta\kappa/\lambda). Then, one retrieves the following bounds from the literature

(|f(X1,,Xt)𝒫i=1tXi(f)|η){exp((κ+λ)2tη2(1ϑt)2))[24, Thm 1.2]exp(tη2(κ+λ)(2κλ))[11, Thm 1.]exp(2tη2(λ+κ)2+2t(κ+λ)tlog(2)/2)[25].\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)\leq\begin{cases}\exp\left(-\frac{(\kappa+\lambda)^{2}t\eta^{2}}{(1-\vartheta^{t})^{2})}\right)\text{, \cite[cite]{[\@@bibref{Number}{dependentViaMartingale}{}{}, Thm 1.2]}}\\ \exp\left(-\frac{t\eta^{2}(\kappa+\lambda)}{(2-\kappa-\lambda)}\right)\text{, \cite[cite]{[\@@bibref{Number}{dependentViaStatAndChange}{}{}, Thm 1.]}}\\ \exp(-2t\eta^{2}(\lambda+\kappa)^{2}+2t(\kappa+\lambda)\sqrt{t\log(2)/2})\text{, \cite[cite]{[\@@bibref{Number}{Marton1996}{}{}]}.}\end{cases} (40)

Consequently, one can see that the bounds on the right-hand side of Equation 40 are worse than the right-hand side of Equation 36 when

η2>(1+o1(t))max{ϖ(1ϑt)2/(2(1ϑt)2(1ϑ)2)ϖ(1+ϑ)/2ϑϖ/(1ϑ2).\eta^{2}>(1+o_{1}(t))\max\begin{cases}\varpi(1-\vartheta^{t})^{2}/(2(1-\vartheta^{t})^{2}-(1-\vartheta)^{2})\\ \varpi(1+\vartheta)/2\vartheta\\ \sqrt{\varpi}/(1-\vartheta^{2}).\end{cases} (41)

For λ=1/3,κ=1/4\lambda=1/3,\kappa=1/4 and η=0.65\eta=0.65Figure 6 shows that our approach outperforms all previous methods for all tt.

Refer to caption
Figure 6: Behaviour of Equation 36 in comparison with the corresponding bounds induced by [12, 24, 11, 25] as a function of tt. The kernel KK is induced by a general binary channel with λ=1/3\lambda=1/3 and κ=1/4\kappa=1/4, η\eta is set to be 0.650.65.

V Conclusions

The notion of convergence of a Markov process induced by a kernel KK in Orlicz spaces is introduced. We show that one can control the speed of said convergence by studying the contraction coefficient of the dual kernel KK^{\star}. Specifically, by considering specific norms of densities of KK and KK^{\star}Theorem 3 provides a novel and closed-form bound on the contraction coefficient in any Orlicz space. As discussed in Section IV, this yields a vast improvement over the state of the art in error bounds on MCMC methods, as well as in the concentration of Markovian processes. Given the generality of the framework, as explained in Remark 4, our analysis is able to tackle under-explored settings, such as those involving heavy-tailed distributions.

We envision applications of the results brought forth to other settings of interest. For instance, multi-armed bandit problems with Markovian rewards represent a severely under-studied setting in the literature [26]. The reason is that bandit algorithms heavily rely on concentration results. Hence, the improvements provided for the concentration of Markovian random variables can impact the development of bandit algorithms. Another setting comes from information theory: data-processing inequalities (DPI) are fundamental results in the field, and they can be significantly improved by computing the strong DPI (SDPI) constant for divergences [13]. Some connections between contraction coefficients of Markov kernels and strong DPI constants already exist [27, 16, 13]. Hence, leveraging Theorem 3 for LψL_{\psi} spaces, one can also provide improved bounds on SDPI constants for ψ\psi-divergences.

Acknowledgments

The authors are partially supported by the 2019 Lopez-Loreta Prize. They would also like to thank the anonymous reviewers for their careful reading of the manuscript and for providing valuable suggestions and comments, identifying a typo and suggesting meaningful references.

References

  • [1] G. O. Roberts and J. S. Rosenthal, “General state space Markov chains and MCMC algorithms,” Probability Surveys, 2004.
  • [2] D. Rudolf, “Explicit error bounds for Markov chain Monte Carlo,” Dissertationes Mathematicae, vol. 485, pp. 1–93, 2012.
  • [3] C. J. Geyer, “Practical Markov Chain Monte Carlo,” Statistical Science, vol. 7, no. 4, pp. 473 – 483, 1992.
  • [4] W. R. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte Carlo in Practice, ser. Chapman & Hall/CRC Interdisciplinary Statistics.   Taylor & Francis, 1995.
  • [5] K. Łatuszyński, B. Miasojedow, and W. Niemiro, “Nonasymptotic bounds on the estimation error of mcmc algorithms,” Bernoulli, vol. 19, no. 5A, pp. 2033–2066, 2013.
  • [6] D. W. Gillman, “Hidden Markov chains: convergence rates and the complexity of inference,” Ph.D. dissertation, Massachusetts Institute of Technology, Boston, US, 1993.
  • [7] I. H. Dinwoodie, “A probability inequality for the occupation measure of a reversible Markov chain,” The Annals of Applied Probability, vol. 5, no. 1, pp. 37–43, 1995.
  • [8] C. A. León and F. Perron, “Optimal Hoeffding bounds for discrete reversible Markov chains,” The Annals of Applied Probability, vol. 14, no. 2, pp. 958–970, 2004.
  • [9] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher, “Chernoff-hoeffding bounds for Markov chains: Generalized and simplified,” in Symposium on Theoretical Aspects of Computer Science, 2012.
  • [10] B. Miasojedow, “Hoeffding’s inequalities for geometrically ergodic Markov chains on general state space,” Statistics & Probability Letters, vol. 87, pp. 115–120, 2014.
  • [11] J. Fan, B. Jiang, and Q. Sun, “Hoeffding’s inequality for general markov chains and its applications to statistical learning,” Journal of Machine Learning Research, vol. 22, no. 139, pp. 1–35, 2021.
  • [12] A. R. Esposito and M. Mondelli, “Concentration without independence via information measures,” IEEE Transactions on Information Theory, vol. 70, no. 6, pp. 3823–3839, 2024.
  • [13] M. Raginsky, “Strong data processing inequalities and ϕ\phi-sobolev inequalities for discrete channels,” IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
  • [14] S. Jarner and G. Roberts, “Polynomial convergence rates of markov chains,” Annals of Applied Probability, vol. 12, 2000.
  • [15] ——, “Convergence of heavy-tailed monte carlo markov chain algorithms,” Scandinavian Journal of Statistics, vol. 34, pp. 781–815, 2007.
  • [16] P. Del Moral, M. Ledoux, and L. Miclo, “On contraction properties of markov kernels,” Probability Theory and Related Fields, vol. 126, pp. 395–420, 2003.
  • [17] O. Cappé, E. Moulines, and T. Ryden, Inference in Hidden Markov Models (Springer Series in Statistics).   Berlin, Heidelberg: Springer-Verlag, 2005.
  • [18] Z. R. M.M. Rao, Theory of Orlicz Spaces.   New York: M. Dekker, 1991.
  • [19] H. Hudzik and L. Maligranda, “Amemiya norm equals orlicz norm in general,” Indagationes Mathematicae, vol. 11, no. 4, pp. 573 – 585, 2000.
  • [20] C. Roberto and B. Zegarlinski, “Hypercontractivity for Markov semi-groups,” Journal of Functional Analysis, vol. 282, no. 12, p. 109439, 2022.
  • [21] M. Raginsky and I. Sason, “Concentration of measure inequalities in information theory, communications, and coding,” Foundations and Trends in Communications and Information Theory, vol. 10, no. 1-2, pp. 1–246, 2013.
  • [22] P. Diaconis and L. Saloff-Coste, “Logarithmic Sobolev inequalities for finite Markov chains,” The Annals of Applied Probability, vol. 6, no. 3, pp. 695 – 750, 1996.
  • [23] G. Roberts and J. Rosenthal, “Geometric Ergodicity and Hybrid Markov Chains,” Electronic Communications in Probability, vol. 2, no. none, pp. 13 – 25, 1997.
  • [24] L. A. Kontorovich and K. Ramanan, “Concentration inequalities for dependent random variables via the martingale method,” The Annals of Probability, vol. 36, no. 6, pp. 2126 – 2158, 2008.
  • [25] K. Marton, “A measure concentration inequality for contracting Markov chains,” Geometric and functional analysis, vol. 6, no. 3, pp. 556–571, 1996.
  • [26] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Machine Learning, vol. 5, 04 2012.
  • [27] R. L. Dobrushin, “Central limit theorem for nonstationary Markov chains. i,” Theory of Probability & Its Applications, vol. 1, no. 1, pp. 65–80, 1956.
  • [28] R. Vershynin, High-Dimensional Probability: An Introduction with Applications in Data Science, ser. Cambridge Series in Statistical and Probabilistic Mathematics.   Cambridge University Press, 2018.
  • [29] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via Rényi-, f-divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021.
  • [30] I. Csiszár, “Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der ergodizitat von markoffschen ketten,” Magyar. Tud. Akad. Mat. Kutató Int. Közl,, vol. 8, pp. 85–108, 1963.

Appendix A Additional preliminaries and results

A-A Orlicz spaces

Orlicz spaces are quite general, and they include the well-known LpL_{p}-spaces. Indeed, if ψ(x)=|x|p\psi(x)=|x|^{p}, then Lψ=LpL_{\psi}=L_{p} and LψL(μ)\left\lVert\cdot\right\rVert_{L^{L}_{\psi}(\mu)} is the LpL^{p}-norm, see Equation 5. We note that different choices of ψ\psi categorize different tail behaviors of random variables.

Example 1.

Let ψ(x)=ex21\psi(x)=e^{x^{2}}-1 and, given a measure space (Ω,,μ)(\Omega,\mathcal{F},\mu), define on (Ω,)(\Omega,\mathcal{F}) the Orlicz space Lψ(μ)L_{\psi}(\mu). Given a random variable XX defined on this space, Xψ,μ\left\lVert X\right\rVert_{\psi,\mu} is the sub-Gaussian norm, i.e.,

XLψL(μ)=inf{σ>0:μ(exp(X2σ2))2}.\left\lVert X\right\rVert_{L^{L}_{\psi}(\mu)}=\inf\left\{\sigma>0:\mu\left(\exp\left(\frac{X^{2}}{\sigma^{2}}\right)\right)\leq 2\right\}. (42)

Thus, Lψ(μ)L_{\psi}(\mu) represents the set of all sub-gaussian random variables with respect to μ\mu [28, Definition 2.5.6].

Given a function and its Young’s dual, the following generalised Hölder’s inequality holds, see [29, Appendix A] for a proof.

Lemma 2.

Let ψ\psi be an Orlicz function and ψ\psi^{\star} its complementary function. For every pair of non-negative random variables U,VU,V respectively defined over (ΩU,U,𝒫U),(ΩV,V,𝒫V)(\Omega_{U},\mathcal{F}_{U},\mathcal{P}_{U}),(\Omega_{V},\mathcal{F}_{V},\mathcal{P}_{V}), given a coupling 𝒫UV(UV)\mathcal{P}_{UV}(UV), one has that

𝒫UV(UV)ULψL(𝒫U)VLψA(𝒫V).\mathcal{P}_{UV}(UV)\leq\lVert U\rVert_{L^{L}_{\psi}(\mathcal{P}_{U})}\lVert V\rVert_{L^{A}_{\psi^{\star}}(\mathcal{P}_{V})}. (43)

Choosing ψ(x)=|x|p\psi(x)=|x|^{p}, which implies ψ(x)=|x|q(p1)pq{\psi}^{\star}(x^{\star})=|x^{\star}|^{q}\frac{(p-1)}{p^{q}} with 1p+1q=1\frac{1}{p}+\frac{1}{q}=1, Lemma 2 yields the classical Hölder’s inequality:

𝒫UV(UV)ULp(𝒫U)VLq(𝒫V).\mathcal{P}_{UV}(UV)\leq\left\lVert U\right\rVert_{L^{p}(\mathcal{P}_{U})}\cdot\left\lVert V\right\rVert_{L^{q}(\mathcal{P}_{V})}. (44)

A-B Proof of Lemma 1

Proof.

Given that νμ\nu\ll\mu, one has that νKμK\nu K\ll\mu K. Indeed, let EE be an event such that μK(E)=0\mu K(E)=0. Then,

μK(E)=𝟙E𝑑μK=K(𝟙E)𝑑μ=0.\mu K(E)=\int\mathbbm{1}_{E}d\mu K=\int K(\mathbbm{1}_{E})d\mu=0. (45)

Hence, K(𝟙E)K(\mathbbm{1}_{E}) is 0 μ\mu-a.e., and by absolute continuity between ν\nu and μ\mu, K(𝟙E)K(\mathbbm{1}_{E}) is also 0 ν\nu-a.e. and νK(E)=0\nu K(E)=0. Thus, the function gg is well-defined. Moreover, given any function hh,

h,gμK=μK(hg)\displaystyle\langle h,g\rangle_{\mu K}=\mu K(h\cdot g) =hdνKdμK𝑑μK\displaystyle=\int h\frac{d\nu K}{d\mu K}d\mu K (46)
=h𝑑νK\displaystyle=\int hd\nu K (47)
=Kh𝑑ν\displaystyle=\int Khd\nu (48)
=(Kh)f𝑑μ=Kh,fμ=h,KμfμK,\displaystyle=\int(Kh)fd\mu=\langle Kh,f\rangle_{\mu}=\langle h,K^{\star}_{\mu}f\rangle_{\mu K}, (49)

which implies that g=Kμfg=K^{\star}_{\mu}f, μK\mu K-almost everywhere. ∎

Moreover, one also has that for every reversible Markov kernel KK with stationary distribution π\pi, every fLψ0(π)f\in L_{\psi}^{0}(\pi) with ψ\psi a Young functional and every tt\in\mathbb{N} the following holds true:

KtfLψN(π)KLψN,0LψN,0Kt1fLψN(π).\left\lVert K^{t}f\right\rVert_{L_{\psi}^{N}(\pi)}\leq\left\lVert K\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}\cdot\left\lVert K^{t-1}f\right\rVert_{L_{\psi}^{N}(\pi)}. (50)

consequently, one has that:

KtLψN,0LψN,0KLψN,0LψN,0t.\left\lVert K^{t}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}\leq\left\lVert K\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}^{t}. (51)

A-C Relationship between contraction coefficients

Theorem 5.

Let ψ\psi be a Young functional, N{A,L}N\in\{A,L\} and let tt\in\mathbb{N}. Then, the following holds:

KtLψN,0LψN,0\displaystyle\left\lVert K^{t}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}} (K1π)tLψNLψN2KtLψN,0LψN,0.\displaystyle\leq\left\lVert(K-1_{\pi})^{t}\right\rVert_{L_{\psi}^{N}\to L_{\psi}^{N}}\leq 2\left\lVert K^{t}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}. (52)
Remark 5.

Taking ψ(x)=|x|p\psi(x)=|x|^{p}, which gives LpL_{p}-norms, we recover Lemma 3.16 by [2].

Proof.

The first inequality is obtained via the passages below:

KtLψN,0LψN,0\displaystyle\left\lVert K^{t}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}} =supg:gLψN(π)1,π(g)=0KtgLψN(π)\displaystyle=\sup_{g:\left\lVert g\right\rVert_{L_{\psi}^{N}(\pi)}\leq 1,\pi(g)=0}\left\lVert K^{t}g\right\rVert_{L_{\psi}^{N}(\pi)} (53)
=supg:gLψN(π)1,π(g)=0(Kt1π)gLψN(π)\displaystyle=\sup_{g:\left\lVert g\right\rVert_{L_{\psi}^{N}(\pi)}\leq 1,\pi(g)=0}\left\lVert(K^{t}-1_{\pi})g\right\rVert_{L_{\psi}^{N}(\pi)} (54)
supf:fLψN(π)1,(Kt1π)fLψN(π)\displaystyle\leq\sup_{f:\left\lVert f\right\rVert_{L_{\psi}^{N}(\pi)}\leq 1,}\left\lVert(K^{t}-1_{\pi})f\right\rVert_{L_{\psi}^{N}(\pi)} (55)
=Kt1πLψNLψN.\displaystyle=\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{\psi}^{N}\to L_{\psi}^{N}}. (56)

Moreover, one has that the following sequence of steps holds:

Kt1πLψNLψN\displaystyle\left\lVert K^{t}-1_{\pi}\right\rVert_{L_{\psi}^{N}\to L_{\psi}^{N}} =supf:fLψN(π)1(Kt1π)fLψN(π)\displaystyle=\sup_{f:\left\lVert f\right\rVert_{L_{\psi}^{N}(\pi)}\leq 1}\left\lVert(K^{t}-1_{\pi})f\right\rVert_{L_{\psi}^{N}(\pi)} (57)
=supf:fLψN(π)1Kt(fπ(f))LψN(π)\displaystyle=\sup_{f:\left\lVert f\right\rVert_{L_{\psi}^{N}(\pi)}\leq 1}\left\lVert K^{t}(f-\pi(f))\right\rVert_{L_{\psi}^{N}(\pi)} (58)
=2supf:fLψN(π)1Kt(fπ(f)2)LψN(π)\displaystyle=2\sup_{f:\left\lVert f\right\rVert_{L_{\psi}^{N}(\pi)}\leq 1}\left\lVert K^{t}\left(\frac{f-\pi(f)}{2}\right)\right\rVert_{L_{\psi}^{N}(\pi)} (59)
2supg:gLψN(π)1,π(g)=0KtgLψN(π)\displaystyle\leq 2\sup_{g:\left\lVert g\right\rVert_{L_{\psi}^{N}(\pi)}\leq 1,\pi(g)=0}\left\lVert K^{t}g\right\rVert_{L_{\psi}^{N}(\pi)} (60)
=2KtLψN,0LψN,0.\displaystyle=2\left\lVert K^{t}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}. (61)

Indeed, one has that

fπ(f)2LψN(π)\displaystyle\left\lVert\frac{f-\pi(f)}{2}\right\rVert_{L^{N}_{\psi}(\pi)} 12(fLψN(π)+π(f)LψN(π))1,\displaystyle\leq\frac{1}{2}\left(\left\lVert f\right\rVert_{L^{N}_{\psi}(\pi)}+\left\lVert\pi(f)\right\rVert_{L^{N}_{\psi}(\pi)}\right)\leq 1, (62)

where the last step follows from Jensen’s inequality and the assumption that fLψN(π)1\left\lVert f\right\rVert_{L^{N}_{\psi}(\pi)}\leq 1. ∎

To conclude our characterisation of the contraction coefficients, we prove the equivalence between (K1π)LψNLψN\left\lVert(K-1_{\pi})^{\star}\right\rVert_{L_{\psi}^{N}\to L_{\psi}^{N}} and K1πLψNLψN\left\lVert K^{\star}-1_{\pi}\right\rVert_{L_{\psi}^{N}\to L_{\psi}^{N}}.

Lemma 3.

Let KK be a Markov kernel with stationary distribution π\pi, and K=KπK^{\star}=K^{\star}_{\pi} the dual kernel. Then, one has that

(K1π)=K1π,πa.e.(K-1_{\pi})^{\star}=K^{\star}-1_{\pi},\pi-a.e. (63)
Proof.

We first show that 1π=1π1_{\pi}^{\star}=1_{\pi}, i.e., that for every function hh and ff, 1πh,fπ=h,1πfπ\langle 1_{\pi}h,f\rangle_{\pi}=\langle h,1_{\pi}f\rangle_{\pi}. Indeed, we have

1πh,fπ=𝑑πfπ(h)=π(f)π(h)=𝑑πhπ(f)=h,1πfπ.\displaystyle\langle 1_{\pi}h,f\rangle_{\pi}=\int d\pi f\pi(h)=\pi(f)\pi(h)=\int d\pi h\pi(f)=\langle h,1_{\pi}f\rangle_{\pi}. (64)

From this, one can also conclude that, if KK is stationary with respect to π\pi, then (K1π)=K1π(K-1_{\pi})^{\star}=K^{\star}-1_{\pi}. Indeed, we have

(K1π)h,fπ=Kh,fπ1πh,fπ=h,Kfπh,1πfπ=h,(K1π)fπ.\displaystyle\langle(K-1_{\pi})h,f\rangle_{\pi}=\langle Kh,f\rangle_{\pi}-\langle 1_{\pi}h,f\rangle_{\pi}=\langle h,K^{\star}f\rangle_{\pi}-\langle h,1_{\pi}f\rangle_{\pi}=\langle h,(K^{\star}-1_{\pi})f\rangle_{\pi}. (65)

Appendix B Examples: discrete-valued kernels

Example 2.

Consider a binary setting where the Markov kernel is induced by a 2×22\times 2 matrix:

Kλ,κ=[1λλκ1κ].K_{\lambda,\kappa}=\begin{bmatrix}1-\lambda&\lambda\\ \kappa&1-\kappa\end{bmatrix}. (66)

In this case, the stationary distribution is π=1κ+λ(κ,λ)\pi=\frac{1}{\kappa+\lambda}(\kappa,\lambda), and the dual kernel is Kλ,κ=Kλ,κTK^{\star}_{\lambda,\kappa}=K^{T}_{\lambda,\kappa}. If κ=λ\kappa=\lambda, we use the shorthands KλK_{\lambda} and KλK^{\star}_{\lambda} to denote Kλ,λK_{\lambda,\lambda} and Kλ,λK^{\star}_{\lambda,\lambda}, respectively. We note that KλK_{\lambda} is known in information theory as the binary symmetric channel with parameter λ\lambda. Now, both KλK_{\lambda} and KλK^{\star}_{\lambda} are doubly-stochastic and the stationary distribution is π=(12,12)\pi=\left(\frac{1}{2},\frac{1}{2}\right).

Let us focus on the setting where ψ(x)=|x|2\psi(x)=|x|^{2}, in which we can compute everything explicitly and thus compare the upper bound to the real quantity. In this case, Kλ,κL20L20=γ\left\lVert K_{\lambda,\kappa}\right\rVert_{L_{2}^{0}\to L_{2}^{0}}=\gamma is (11 minus) the spectral gap i.e., the second largest singular value of Kλ,κK_{\lambda,\kappa}^{\star}. Thus, γ=|1λκ|\gamma=|1-\lambda-\kappa|. Let us now compute the bound on Kλ,κL20L20\left\lVert K_{\lambda,\kappa}\right\rVert_{L_{2}^{0}\to L_{2}^{0}} that stems from Equation 11. In particular, one has that

g0(0)=(1λ)(λ+κ)κ,g0(1)=g1(0)=κ+λ,g1(1)=(1κ)(λ+κ)λ.\begin{split}g_{0}(0)&=\frac{(1-\lambda)(\lambda+\kappa)}{\kappa},\\ g_{0}(1)&=g_{1}(0)=\kappa+\lambda,\\ g_{1}(1)&=\frac{(1-\kappa)(\lambda+\kappa)}{\lambda}.\end{split}

Consequently,

|1λκ|=Kλ,κL20L20\displaystyle|1-\lambda-\kappa|=\left\lVert K_{\lambda,\kappa}\right\rVert_{L_{2}^{0}\to L_{2}^{0}} |x|y|gx(y)1|2π({y})|π({x})|12=|1λκ|.\displaystyle\leq\left|\sum_{x}\left|\sum_{y}\left|g_{x}(y)-1\right|^{2}\pi(\{y\})\right|\pi(\{x\})\right|^{\frac{1}{2}}=|1-\lambda-\kappa|. (67)

In a 3×33\times 3 example, the bound is still very close to the actual value. Specifically, we consider the following matrix:

Λ=[0.20.10.70.30.40.30.50.50].\Lambda=\begin{bmatrix}0.2&0.1&0.7\\ 0.3&0.4&0.3\\ 0.5&0.5&0\end{bmatrix}. (68)

Given that Λ\Lambda is doubly-stochastic, the stationary distribution π\pi is uniform, i.e., π({x})=1/3\pi(\{x\})=1/3 for x{0,1,2}x\in\{0,1,2\}. Similarly to the 2×22\times 2 setting, Λ=ΛT\Lambda^{\star}=\Lambda^{T}. We can thus numerically compute the second largest singular value of ΛT\Lambda^{T} which gives γ=ΛL20L20\gamma=\left\lVert\Lambda\right\rVert_{L_{2}^{0}\to L_{2}^{0}}. Instead, Equation 11 leads to the bound below which is very close to the actual value of γ\gamma:

0.61098529=γgX1L2(π)L2(π)=0.6164414.0.61098529=\gamma\leq\left\lVert\left\lVert g_{X}-1\right\rVert_{L_{2}(\pi)}\right\rVert_{L_{2}(\pi)}=0.6164414. (69)

More generally, one can prove the following.

Lemma 4.

Consider a discrete-valued Markovian process described by an m×mm\times m doubly-stochastic matrix Λ[0,1]m×m\Lambda\in[0,1]^{m\times m}. Then, the contraction coefficient of Λ\Lambda in the space Lp0L_{p}^{0} can be bounded as follows:

ΛTLp0Lp0(j(i|λj,i1m|q)pq)1p.\left\lVert\Lambda^{T}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\leq\left(\sum_{j}\left(\sum_{i}\left|\lambda_{j,i}-\frac{1}{m}\right|^{q}\right)^{\frac{p}{q}}\right)^{\frac{1}{p}}. (70)
Proof.

The stationary distribution π\pi is the uniform distribution: π({i})=1m\pi(\{i\})=\frac{1}{m}, for 1im1\leq i\leq m. It is also easy to see that the dual of the kernel KK induced by Λ\Lambda with respect to π\pi is the transpose matrix Λπ=ΛT\Lambda^{\star}_{\pi}=\Lambda^{T}. By using Equation 11, we can write

ΛTLp0Lp0p\displaystyle\left\lVert\Lambda^{T}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}^{p} j1m(i1m|mλj,i1|q)pq\displaystyle\leq\sum_{j}\frac{1}{m}\left(\sum_{i}\frac{1}{m}|m\lambda_{j,i}-1|^{q}\right)^{\frac{p}{q}} (71)
=mp(q1)q1j(i|λj,i1m|q)pq=j(i|λj,i1m|q)pq,\displaystyle=m^{\frac{p(q-1)}{q}-1}\sum_{j}\left(\sum_{i}\left|\lambda_{j,i}-\frac{1}{m}\right|^{q}\right)^{\frac{p}{q}}=\sum_{j}\left(\sum_{i}\left|\lambda_{j,i}-\frac{1}{m}\right|^{q}\right)^{\frac{p}{q}}, (72)

which concludes the proof. ∎

In the case of general m×mm\times m doubly-stochastic matrices, we can see that the bound maintains some of its properties but not all. For instance, it is still true that, if the matrix enforces independence (i.e., λi,j=1m\lambda_{i,j}=\frac{1}{m} for every i,ji,j), then ΛTLp0Lp00\left\lVert\Lambda^{T}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\leq 0. However, it is not always true that the bound on the contraction coefficient is always strictly less than one. Indeed, if Λ\Lambda is deterministic (i.e., for every jj there exists an ii such that λi,j=1\lambda_{i,j}=1 and λk,j=0\lambda_{k,j}=0 for every kik\neq i), one can see that the bound becomes:

ΛTLp0Lp0p((m1)q+(m1))pqm1p,\left\lVert\Lambda^{T}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}^{p}\leq((m-1)^{q}+(m-1))^{\frac{p}{q}}m^{1-p}, (73)

which is larger than 11 for m>2m>2 and p1p\geq 1.

Appendix C Proofs for Section III

C-A Proof of Theorem 1

Assume that, for every ϵ>0\epsilon>0, there exists a constant λ\lambda such that

λ=fLψL(π)+ϵandπ(ψ(|f|λ))1.\lambda=\left\lVert f\right\rVert_{L^{L}_{\psi}(\pi)}+\epsilon\hskip 20.00003pt\text{and}\hskip 20.00003pt\pi\left(\psi\left(\frac{|f|}{\lambda}\right)\right)\leq 1. (74)

Moreover, the following steps hold:

π(ψ(|Kf|λ))\displaystyle\pi\left(\psi\left(\frac{|Kf|}{\lambda}\right)\right) π(ψ(K|f|λ))\displaystyle\leq\pi\left(\psi\left(\frac{K|f|}{\lambda}\right)\right) (75)
Kπ(ψ(|f|λ))\displaystyle\leq K\pi\left(\psi\left(\frac{|f|}{\lambda}\right)\right) (76)
=π(ψ(|f|λ))1,\displaystyle=\pi\left(\psi\left(\frac{|f|}{\lambda}\right)\right)\leq 1, (77)

where Equation 75 follows from Jensen’s inequality. Hence, since KfLψL(π)=inf{σ>0:π(ψ(|Kf|σ))1}\left\lVert Kf\right\rVert_{L_{\psi}^{L}(\pi)}=\inf\left\{\sigma>0:\pi\left(\psi\left(\frac{|Kf|}{\sigma}\right)\right)\leq 1\right\} one has that KfLψL(π)λ=fLψL(π)+ϵ\left\lVert Kf\right\rVert_{L_{\psi}^{L}(\pi)}\leq\lambda=\left\lVert f\right\rVert_{L_{\psi}^{L}(\pi)}+\epsilon. The statement follows from the arbitrariness of ϵ\epsilon. To conclude the argument for the Luxemburg norm, select the function gg identically equal to ψ1(1)>0\psi^{-1}(1)>0. One has that in order for π((gλ))=π(ψ(ψ1(1)λ))1\pi\left(\left(\frac{g}{\lambda}\right)\right)=\pi\left(\psi\left(\frac{\psi^{-1}(1)}{\lambda}\right)\right)\leq 1 the following needs to hold

λψ1(1)ψ1(1)=1,\lambda\geq\frac{\psi^{-1}(1)}{\psi^{-1}(1)}=1, (78)

and thus gLψL(π)=1\left\lVert g\right\rVert_{L^{L}_{\psi}(\pi)}=1. Hence, since Kg=gKg=g, one has that KLψLLψL=1\left\lVert K\right\rVert_{L_{\psi}^{L}\to L_{\psi}^{L}}=1.

As for the Amemiya norm, assume that, for every ϵ>0\epsilon>0, there exists λ\lambda such that

1+π(ψ(λ|f|))λfLψA(π)+ϵ,\frac{1+\pi(\psi(\lambda|f|))}{\lambda}\leq\left\lVert f\right\rVert_{L^{A}_{\psi}(\pi)}+\epsilon, (79)

then one has that for every Markov kernel KK

KfLψA(π)1+π(ψ(λ|Kf|))λ\displaystyle\left\lVert Kf\right\rVert_{L^{A}_{\psi}(\pi)}\leq\frac{1+\pi(\psi(\lambda|Kf|))}{\lambda} 1+π(ψ(λK|f|))λ\displaystyle\leq\frac{1+\pi(\psi(\lambda K|f|))}{\lambda} (80)
1+Kπ(ψ(λ|f|))λ\displaystyle\leq\frac{1+K\pi(\psi(\lambda|f|))}{\lambda} (81)
=1+π(ψ(λ|f|))λ\displaystyle=\frac{1+\pi(\psi(\lambda|f|))}{\lambda} (82)
fLψA(π)+ϵ,\displaystyle\leq\left\lVert f\right\rVert_{L^{A}_{\psi}(\pi)}+\epsilon, (83)

where Equation 81 follows from Jensen’s inequality. The argument then follows from the arbitrariness of ϵ\epsilon. This implies that KLψALψA1\left\lVert K\right\rVert_{L_{\psi}^{A}\to L_{\psi}^{A}}\leq 1. Moreover, if one selects gg to be the function identically equal to 1ψ1(1)\frac{1}{{\psi^{\star}}^{-1}(1)}, then

gLψA(π)=inft>01+π(ψ(tψ1(1)))t\displaystyle\left\lVert g\right\rVert_{L^{A}_{\psi}(\pi)}=\inf_{t>0}\frac{1+\pi\left(\psi\left(\frac{t}{{\psi^{\star}}^{-1}(1)}\right)\right)}{t} =inft>01+ψ(tψ1(1))t=1.\displaystyle=\inf_{t>0}\frac{1+\psi\left(\frac{t}{{\psi^{\star}}^{-1}(1)}\right)}{t}=1. (84)

Indeed, denoting φ(x)=ψ(x/ψ1(1))\varphi(x)=\psi(x/{\psi^{\star}}^{-1}(1)), one has that φ(x)=ψ(xψ1(1))\varphi^{\star}(x)=\psi^{\star}(x{\psi^{\star}}^{-1}(1)) and φ1(y)=ψ1(y)/ψ1(1){\varphi^{\star}}^{-1}(y)={\psi^{\star}}^{-1}(y)/{\psi^{\star}}^{-1}(1). Given that

inft>01+ψ(tψ1(1))t=inft>01+φ(t)t=φ1(1)=ψ1(1)ψ1(1),\inf_{t>0}\frac{1+\psi\left(\frac{t}{{\psi^{\star}}^{-1}(1)}\right)}{t}=\inf_{t>0}\frac{1+\varphi(t)}{t}={\varphi^{\star}}^{-1}(1)=\frac{{\psi^{\star}}^{-1}(1)}{{\psi^{\star}}^{-1}(1)}, (85)

the argument follows. Hence, similarly to before, since Kg=gKg=g, one has that KLψALψA=1\left\lVert K\right\rVert_{L_{\psi}^{A}\to L_{\psi}^{A}}=1.

C-B Proof of Theorem 2

Let tt\in\mathbb{N} and let νπ\nu\ll\pi. Then,

dνKtdπ1LψN(π)\displaystyle\left\lVert\frac{d\nu K^{t}}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)} =dνKtdπKt1LψN(π)\displaystyle=\left\lVert\frac{d\nu K^{t}}{d\pi K^{t}}-1\right\rVert_{L_{\psi}^{N}(\pi)} (86)
=Ktdνdπ1LψN(π)\displaystyle=\left\lVert{K^{t}}^{\star}\frac{d\nu}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)} (87)
=Kt(dνdπ1)LψN(π)\displaystyle=\left\lVert{K^{t}}^{\star}\left(\frac{d\nu}{d\pi}-1\right)\right\rVert_{L_{\psi}^{N}(\pi)} (88)
KtLψN,0LψN,0dνdπ1LψN(π)\displaystyle\leq\left\lVert{K^{t}}^{\star}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)} (89)
KLψN,0LψN,0tdνdπ1LψN(π),\displaystyle\leq\left\lVert{K}^{\star}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}}^{t}\left\lVert\frac{d\nu}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)}, (90)

where Equation 89 follows by the definition of KLψN,0LψN,0\left\lVert{K}^{\star}\right\rVert_{L_{\psi}^{N,0}\to L_{\psi}^{N,0}} (see Definition 4) and the fact that if f=dνdπ1f=\frac{d\nu}{d\pi}-1 then π(f)=0\pi\left(f\right)=0; Equation 90 follows from Equation 51.

C-C Proof of Theorem 3

As fLψ0(μK)f\in L_{\psi}^{0}(\mu K), we have that μ(f)=0\mu(f)=0. Thus,

KfLψN(μ)\displaystyle\left\lVert Kf\right\rVert_{L_{\psi}^{N}(\mu)} =|Kf|LψN(μ)\displaystyle=\left\lVert|Kf|\right\rVert_{L_{\psi}^{N}(\mu)} (91)
=|f(y)dK(y|X)|LψN(μ)\displaystyle=\left\lVert\left|\int f(y)dK(y|X)\right|\right\rVert_{L_{\psi}^{N}(\mu)} (92)
=|gX(y)f(y)𝑑μK|LψN(μ)\displaystyle=\left\lVert\left|\int g_{X}(y)f(y)d\mu K\right|\right\rVert_{L_{\psi}^{N}(\mu)} (93)
=|(gX(y)1)f(y)𝑑μK|LψN(μ)\displaystyle=\left\lVert\left|\int(g_{X}(y)-1)f(y)d\mu K\right|\right\rVert_{L_{\psi}^{N}(\mu)} (94)
|(gX(y)1)f(y)|𝑑μKLψN(μ)\displaystyle\leq\left\lVert\int\left|(g_{X}(y)-1)f(y)\right|d\mu K\right\rVert_{L_{\psi}^{N}(\mu)} (95)
fLψN(μK)gX1LψN(μK)LψN(μ),\displaystyle\leq\left\lVert f\right\rVert_{L_{\psi}^{N}(\mu K)}\left\lVert\left\lVert g_{X}-1\right\rVert_{L_{\psi}^{N^{\star}}(\mu K)}\right\rVert_{L_{\psi}^{N}(\mu)}, (96)

where Equation 91 follows from the fact that Amemiya and Luxemburg norms are “ideal norms”, Equation 95 follows from the fact that μK\mu K is a probability measure along with Jensen’s inequality, and Equation 96 follows from generalised Hölder’s inequality, i.e., Lemma 2. This gives the desired bound on the contraction coefficient of KK. The bound on the contraction coefficient of KK^{\star} follows from an analogous argument.

C-D Proof of Corollary 1

The inequality follows from Equation 11 with pp\to\infty. As for the equality, notice that, for every xx, and tt\in\mathbb{N}

gxt1L1(π)=dKt(Y|x)dπ(Y)1L1(π)\displaystyle\left\lVert g^{t}_{x}-1\right\rVert_{L_{1}(\pi)}=\left\lVert\frac{dK^{t}(Y|x)}{d\pi(Y)}-1\right\rVert_{L_{1}(\pi)} =𝑑π(y)|dKt(y|x)dπ(y)1|\displaystyle=\int d\pi(y)\left|\frac{dK^{t}(y|x)}{d\pi(y)}-1\right| (97)
=2Kt(|x)πTV.\displaystyle=2\left\lVert K^{t}(\cdot|x)-\pi\right\rVert_{TV}. (98)

The last step follows from the definition of Total Variation distance. Hence,

gXt1L1(π)Lq(π)qesssupπgXt1L1(π)=esssupπ2Kt(|X)πTV.\left\lVert\left\lVert g^{t}_{X}-1\right\rVert_{L_{1}(\pi)}\right\rVert_{L_{q}(\pi)}\xrightarrow[q\to\infty]{}\operatorname*{ess\,sup}_{\pi}\left\lVert g^{t}_{X}-1\right\rVert_{L_{1}(\pi)}=\operatorname*{ess\,sup}_{\pi}2\left\lVert K^{t}(\cdot|X)-\pi\right\rVert_{TV}. (99)

C-E Proof of Corollary 2

In [16, Lemma 4.1], the authors prove that if Equation 13 holds, then for every measure ξ\xi one has that:

dKξ(|Y)dξϵξKa.e.,\frac{dK_{\xi}^{\star}(\cdot|Y)}{d\xi}\geq\epsilon\hskip 20.00003pt\xi K-a.e., (100)

where YY is distributed according to ξK\xi K. Given that 0<ϵ<10<\epsilon<1 and selecting ξ=μ\xi=\mu, this implies that gY(x)ϵg_{Y}(x)\geq\epsilon for every xx and μK\mu K-a.e.; thus, for every yy and xx, |gy(x)1||ϵ1|=(1ϵ).|g_{y}(x)-1|\leq|\epsilon-1|=(1-\epsilon). Note that norms are non-decreasing and

cLψL(μ)LψA(μK)=|c|ψ1(1)LψA(μ)=|c|ψ1(1)ψ1(1)=|c|=cLψA(μK)LψL(μK).\left\lVert\left\lVert c\right\rVert_{L_{\psi^{\star}}^{L}(\mu)}\right\rVert_{L_{\psi}^{A}(\mu K)}=\left\lVert\frac{|c|}{{\psi^{\star}}^{-1}(1)}\right\rVert_{L_{\psi}^{A}(\mu)}=\frac{|c|{\psi^{\star}}^{-1}(1)}{{\psi^{\star}}^{-1}(1)}=|c|=\left\lVert\left\lVert c\right\rVert_{L_{\psi^{\star}}^{A}(\mu K)}\right\rVert_{L_{\psi}^{L}(\mu K)}.

Then, the claim follows from an application of Theorem 2 with t=1t=1 and Theorem 3.

Appendix D Proofs for Section IV-A

D-A Computations for Remark 2

The following closed-form expressions for maxxdδxdπ1LψN(π)\max_{x}\left\lVert\frac{d\delta_{x}}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)} hold:

maxxdδxdπ1LψL(π)maxx{1/ψ1(12π(Ω{x})) if π({x})1/2,π(Ω{x})π({x})1/ψ1(12π({x})) else ,\max_{x}\left\lVert\frac{d\delta_{x}}{d\pi}-1\right\rVert_{L_{\psi}^{L}(\pi)}\leq\max_{x}\begin{cases}1/{\psi}^{-1}\left(\frac{1}{2\pi(\Omega\setminus\{x\})}\right)&\text{ if }\pi(\{x\})\geq 1/2,\\ \frac{\pi(\Omega\setminus\{x\})}{\pi(\{x\})}\cdot 1/{\psi}^{-1}\left(\frac{1}{2\pi(\{x\})}\right)&\text{ else },\end{cases} (101)
maxxdδxdπ1LψA(π)2maxxπ(Ω{x}){ψ1(12π(Ω{x})) if π({x})1/2,ψ1(12π({x})) else ,\max_{x}\left\lVert\frac{d\delta_{x}}{d\pi}-1\right\rVert_{L_{\psi}^{A}(\pi)}\leq 2\max_{x}\pi(\Omega\setminus\{x\})\begin{cases}{\psi^{\star}}^{-1}\left(\frac{1}{2\pi(\Omega\setminus\{x\})}\right)&\text{ if }\pi(\{x\})\geq 1/2,\\ {\psi^{\star}}^{-1}\left(\frac{1}{2\pi(\{x\})}\right)&\text{ else },\end{cases} (102)

where Ω\Omega is the space over which ν\nu and π\pi are defined and ψ\psi^{\star} the Young conjugate, see Equation 4. We start by proving Equation 101. Let xx be fixed. Then, for a given ψ\psi and σ>0\sigma>0,

π(ψ(|dδxdπ1|σ))=π({x})ψ(π(Ω{x})π({x})1σ)+π(Ω{x})ψ(1σ).\pi\left(\psi\left(\frac{\left|\frac{d\delta_{x}}{d\pi}-1\right|}{\sigma}\right)\right)=\pi(\{x\})\psi\left(\frac{\pi(\Omega\setminus\{x\})}{\pi(\{x\})}\frac{1}{\sigma}\right)+\pi(\Omega\setminus\{x\})\psi\left(\frac{1}{\sigma}\right). (103)

We need to consider two cases. First assume that π(Ω{x})/π({x})1\pi(\Omega\setminus\{x\})/\pi(\{x\})\leq 1, which is equivalent to π({x})1/2\pi(\{x\})\geq 1/2. Then, since ψ\psi is convex and ψ(0)=0\psi(0)=0, one has that

π({x})ψ(π(Ω{x})π({x})1σ)+π(Ω{x})ψ(1σ)2π(Ω{x})ψ(1σ).\displaystyle\pi(\{x\})\psi\left(\frac{\pi(\Omega\setminus\{x\})}{\pi(\{x\})}\frac{1}{\sigma}\right)+\pi(\Omega\setminus\{x\})\psi\left(\frac{1}{\sigma}\right)\leq 2\pi(\Omega\setminus\{x\})\psi\left(\frac{1}{\sigma}\right). (104)

Thus, in order to have the right-hand side of Equation 104 less than one, one needs

σ1ψ1(12π(Ω{x})).\sigma\geq\frac{1}{\psi^{-1}\left(\frac{1}{2\pi(\Omega\setminus\{x\})}\right)}.

If instead π(Ω{x})/π({x})1\pi(\Omega\setminus\{x\})/\pi(\{x\})\geq 1, then

π({x})ψ(π(Ω{x})π({x})1σ)+π(Ω{x})ψ(1σ)2π({x})ψ(π(Ω{x})π({x})1σ).\displaystyle\pi(\{x\})\psi\left(\frac{\pi(\Omega\setminus\{x\})}{\pi(\{x\})}\frac{1}{\sigma}\right)+\pi(\Omega\setminus\{x\})\psi\left(\frac{1}{\sigma}\right)\leq 2\pi(\{x\})\psi\left(\frac{\pi(\Omega\setminus\{x\})}{\pi(\{x\})}\frac{1}{\sigma}\right). (105)

Consequently, one needs

σπ(Ω{x})π({x})ψ1(12π({x})).\sigma\geq\frac{\pi(\Omega\setminus\{x\})}{\pi(\{x\})\psi^{-1}\left(\frac{1}{2\pi(\{x\})}\right)}.

Analogous considerations allow us to derive a bound on the Amemiya norm. Indeed, if π(Ω{x})/π({x})1\pi(\Omega\setminus\{x\})/\pi(\{x\})\leq 1, then

inft>01+π(ψ(t|dδxdπ1|))t\displaystyle\inf_{t>0}\frac{1+\pi\left(\psi\left(t\left|\frac{d\delta_{x}}{d\pi}-1\right|\right)\right)}{t} inft>01+2π(Ω{x})ψ(t)t=2π(Ω{x})ψ1(12π(Ω{x})).\displaystyle\leq\inf_{t>0}\frac{1+2\pi(\Omega\setminus\{x\})\psi(t)}{t}=2\pi(\Omega\setminus\{x\}){\psi^{\star}}^{-1}\left(\frac{1}{2\pi(\Omega\setminus\{x\})}\right). (106)

One can then also similarly bound the Amemiya norm when π(Ω{x})/π({x})1\pi(\Omega\setminus\{x\})/\pi(\{x\})\geq 1.

D-B Proof of Theorem 4

The proof leverages the generalised Hölder’s inequality. Let μKt\mu K^{t} be the measure obtained in the Markovian process when μ\mu is the starting distribution and after tt applications of the Markov kernel KK. Assume that μKtπ\mu K^{t}\ll\pi, with π\pi being the stationary distribution. The following sequence of steps hold for every Young function ψ\psi:

μKt(E)=E𝑑μKt\displaystyle\mu K^{t}(E)=\int_{E}d\mu K^{t} =𝟙E𝑑μKt\displaystyle=\int\mathbbm{1}_{E}d\mu K^{t} (107)
=𝟙EdμKtdπ𝑑π\displaystyle=\int\mathbbm{1}_{E}\frac{d\mu K^{t}}{d\pi}d\pi (108)
=𝟙E,dμKtdππ\displaystyle=\left\langle\mathbbm{1}_{E},\frac{d\mu K^{t}}{d\pi}\right\rangle_{\pi} (109)
𝟙ELψN(π)dμKtdπLψN(π)\displaystyle\leq\left\lVert\mathbbm{1}_{E}\right\rVert_{L_{{\psi}^{\star}}^{N}(\pi)}\left\lVert\frac{d\mu K^{t}}{d\pi}\right\rVert_{L_{\psi}^{N}(\pi)} (110)
𝟙ELψN(π)(dμKtdπ1LψN(π)+1LψN(π))\displaystyle\leq\left\lVert\mathbbm{1}_{E}\right\rVert_{L_{{\psi}^{\star}}^{N}(\pi)}\left(\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{\psi}^{N}(\pi)}+\left\lVert 1\right\rVert_{L_{\psi}^{N}(\pi)}\right) (111)
={1/ψ1(1/π(E))(dμKtdπ1LψA(π)+ψ1(1))π(E)ψ1(1/π(E))(dμKtdπ1LψL(π)+1/ψ1(1)).\displaystyle=\begin{cases}1/{\psi^{\star}}^{-1}(1/\pi(E))\left(\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{\psi}^{A}(\pi)}+{\psi^{\star}}^{-1}(1)\right)\\ \pi(E)\psi^{-1}(1/\pi(E))\left(\left\lVert\frac{d\mu K^{t}}{d\pi}-1\right\rVert_{L_{\psi}^{L}(\pi)}+1/\psi^{-1}(1)\right)\end{cases}. (112)

D-C Proof of Corollary 4

Since we know the exact closed-form expression of the norm when ψ(x)=|x|p/p\psi(x)=|x|^{p}/p, it is possible to prove a stronger result directly. Consider the function ψ^(x)=|x1|p/p\hat{\psi}(x)=|x-1|^{p}/p. This is a convex function such that ψ^(1)=0\hat{\psi}(1)=0 and is no longer a Young function. One can thus define a corresponding ff-divergence [30] with f(x)=ψ^(x)f(x)=\hat{\psi}(x). I.e., given two measures μν\mu\ll\nu,

Dψ^(μν)=ψ^(dμdν)𝑑ν=1p(dμdν1Lp(π))p.D_{\hat{\psi}}(\mu\|\nu)=\int\hat{\psi}\left(\frac{d\mu}{d\nu}\right)d\nu=\frac{1}{p}\left(\left\lVert\frac{d\mu}{d\nu}-1\right\rVert_{L_{p}(\pi)}\right)^{p}. (113)

As such, it will satisfy the data-processing inequality, which we can use to prove Corollary 4. Following the same steps undertaken to prove [29, Theorem 3], for every event EE one has that

Dψ^(μν)+ν(Ec)ψ^(0)ν(E)ψ^(μ(E)ν(E)).\displaystyle\frac{D_{\hat{\psi}}(\mu\|\nu)+\nu(E^{c}){\hat{\psi}}^{\star}(0)}{\nu(E)}\geq\hat{\psi}\left(\frac{\mu(E)}{\nu(E)}\right). (114)

If μ(E)ν(E)\mu(E)\geq\nu(E), we can then invert ψ^\hat{\psi} and obtain

μ(E)ν(E)ψ^1(Dψ^(μν)+ν(Ec)ψ^(0)ν(E)).\mu(E)\leq\nu(E)\hat{\psi}^{-1}\left(\frac{D_{\hat{\psi}}(\mu\|\nu)+\nu(E^{c}){\hat{\psi}}^{\star}(0)}{\nu(E)}\right). (115)

Moreover, one has that ψ^(y)=|y|q/q+y\hat{\psi}^{\star}(y)=|y|^{q}/q+y, and thus ψ^(0)=0\hat{\psi}^{\star}(0)=0. Hence, one can rewrite Equation 115 as follows:

μ(E)ν(E)((pDψ^(μν)ν(E))1p+1)=ν(E)+ν(E)1qdμdν1Lp(π).\mu(E)\leq\nu(E)\left(\left(p\frac{D_{\hat{\psi}}(\mu\|\nu)}{\nu(E)}\right)^{\frac{1}{p}}+1\right)=\nu(E)+\nu(E)^{\frac{1}{q}}\left\lVert\frac{d\mu}{d\nu}-1\right\rVert_{L_{p}(\pi)}. (116)

If μ(E)ν(E)\mu(E)\leq\nu(E), then the claim holds trivially as ν(E)1qdμdν1Lp(π)0\nu(E)^{\frac{1}{q}}\left\lVert\frac{d\mu}{d\nu}-1\right\rVert_{L_{p}(\pi)}\geq 0.

Appendix E Proofs for Section IV-C

E-A Proof of Corollary 5

From [12, Theorem 1], we have that

(|f(X1,,Xt)𝒫i=1tXi(f)|η)21qexp(2tη2q)\displaystyle\frac{\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)}{2^{\frac{1}{q}}\exp\left(\frac{-2t\eta^{2}}{q}\right)} i=2tmaxxi1d𝒫Xi|Xi1=xi1d𝒫XiLp(𝒫Xi)\displaystyle\leq\prod_{i=2}^{t}\max_{x_{i-1}}\left\lVert\frac{d\mathcal{P}_{X_{i}|X_{i-1}=x_{i-1}}}{d\mathcal{P}_{X_{i}}}\right\rVert_{L^{p}(\mathcal{P}_{X_{i}})} (117)
i=2tmaxxi1(d𝒫Xi|Xi1=xi1d𝒫Xi1Lp(𝒫Xi)+1)\displaystyle\leq\prod_{i=2}^{t}\max_{x_{i-1}}\left(\left\lVert\frac{d\mathcal{P}_{X_{i}|X_{i-1}=x_{i-1}}}{d\mathcal{P}_{X_{i}}}-1\right\rVert_{L^{p}(\mathcal{P}_{X_{i}})}+1\right) (118)
i=2t(KiLp0Lp0maxxi1dδxi1d𝒫Xi11Lp(𝒫Xi1)+1)\displaystyle\leq\prod_{i=2}^{t}\left(\left\lVert K_{i}^{\star}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\max_{x_{i-1}}\left\lVert\frac{d\delta_{x_{i-1}}}{d\mathcal{P}_{X_{i}-1}}-1\right\rVert_{L^{p}(\mathcal{P}_{X_{i}-1})}+1\right) (119)
=i=2t(KiLp0Lp0ωpi+1).\displaystyle=\prod_{i=2}^{t}\left(\left\lVert K_{i}^{\star}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\omega_{p}^{i}+1\right). (120)

E-B Concentration and doubly-stochastic matrices

Consider an m×mm\times m doubly-stochastic matrix Λ\Lambda with elements λi,j\lambda_{i,j}, 1i,jm1\leq i,j\leq m. The bound on the contraction coefficient ΛTLp0Lp0\left\lVert\Lambda^{T}\right\rVert_{L_{p}^{0}\to L_{p}^{0}} is given in Lemma 4 and, consequently, the bound provided by Corollary 5 yields

(|f(X1,,Xt)𝒫i=1tXi(f)|η)21qexp(2tη2q)((min{j(i|λj,i1m|q)pq,1}((m1)pm+m1m))1p+1)t1,\begin{split}&\frac{\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)}{2^{\frac{1}{q}}\exp\left(\frac{-2t\eta^{2}}{q}\right)}\leq\\ &\hskip 30.00005pt\left(\left(\min\left\{\sum_{j}\left(\sum_{i}\left|\lambda_{j,i}-\frac{1}{m}\right|^{q}\right)^{\frac{p}{q}},1\right\}\left(\frac{(m-1)^{p}}{m}+\frac{m-1}{m}\right)\right)^{\frac{1}{p}}+1\right)^{t-1},\end{split} (121)

which implies that exponential convergence is guaranteed whenever

η>(1+ot(1))qlog((min{j(i|λj,i1m|q)pq,1}((m1)pm+m1m))1p+1)2.\eta>\sqrt{(1+o_{t}(1))\frac{q\log\left(\left(\min\left\{\sum_{j}\left(\sum_{i}\left|\lambda_{j,i}-\frac{1}{m}\right|^{q}\right)^{\frac{p}{q}},1\right\}\left(\frac{(m-1)^{p}}{m}+\frac{m-1}{m}\right)\right)^{\frac{1}{p}}+1\right)}{2}}.~{} (122)

In contrast, Theorem 1 by [12] gives

(|f(X1,,Xt)𝒫i=1tXi(f)|η)21qexp(2tη2q)mt1q,\displaystyle\mathbb{P}\left(\left\lvert f(X_{1},\ldots,X_{t})-\mathcal{P}_{\bigotimes_{i=1}^{t}X_{i}}(f)\right\rvert\geq\eta\right)\leq 2^{\frac{1}{q}}\exp\left(\frac{-2t\eta^{2}}{q}\right)\cdot m^{\frac{t-1}{q}}, (123)

which implies that exponential convergence is guaranteed if

η>(1+ot(1))log(m)2.\eta>\sqrt{(1+o_{t}(1))\frac{\log(m)}{2}}. (124)

Hence, if

ΛTLp0Lp0(j(i|λj,i1m|q)pq)1p<mp1p1((m1)pm+m1m)1pp1,\left\lVert\Lambda^{T}\right\rVert_{L_{p}^{0}\to L_{p}^{0}}\leq\left(\sum_{j}\left(\sum_{i}\left|\lambda_{j,i}-\frac{1}{m}\right|^{q}\right)^{\frac{p}{q}}\right)^{\frac{1}{p}}<\frac{m^{\frac{p-1}{p}}-1}{\left(\frac{(m-1)^{p}}{m}+\frac{m-1}{m}\right)^{\frac{1}{p}}}\xrightarrow[p\to\infty]{}1, (125)

then the numerator in Equation 122 is strictly smaller than the numerator in Equation 124, and Equation 121 improves over Equation 123. In particular, in the limit when pp\to\infty (regime that maximises the rate of decay) and m2m\geq 2, in order to improve over [12], we need the bound on the contraction coefficient to be bounded away from 11.