This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Entropy Contractions in Markov Chains: Half-Step, Full-Step and Continuous-Time

Pietro Caputo [email protected]. Università Roma Tre.    Zongchen Chen [email protected]. Georgia Institute of Technology.    Yuzhou Gu [email protected]. New York University.    Yury Polyanskiy [email protected]. Massachusetts Institute of Technology.
Abstract

This paper considers the speed of convergence (mixing) of a finite Markov kernel PP with respect to the Kullback-Leibler divergence (entropy). Given a Markov kernel one defines either a discrete-time Markov chain (with the nn-step transition kernel given by the matrix power PnP^{n}) or a continuous-time Markov process (with the time-tt transition kernel given by et(PId)e^{t(P-\mathrm{Id})}). The contraction of entropy for n=1n=1 or t=0+t=0+ are characterized by the famous functional inequalities, the strong data processing inequality (SDPI) and the modified log-Sobolev inequality (MLSI), respectively. When P=KKP=KK^{*} is written as the product of a kernel and its adjoint, one could also consider the “half-step” contraction, which is the SDPI for KK, while the “full-step” contraction refers to the SDPI for PP. The work [DMLM03] claimed that these contraction coefficients (half-step, full-step, and continuous-time) are generally comparable, that is their ratio is bounded from above and below by two absolute constants. We disprove this and related conjectures by working out a number of different counterexamples. In particular, we construct (a) a continuous-time Markov process that contracts arbitrarily faster than its discrete-time counterpart; and (b) a kernel PP such that Pm+1P^{m+1} contracts arbitrarily better than PmP^{m}. Hence, our main conclusion is that the four standard inequalities comparing five common notions of entropy and variance contraction are generally not improvable (even in the subclass of factorizable Markov chains).

In the process of analyzing the counterexamples, we survey and sharpen the tools for bounding the contraction coefficients and characterize properties of extremizers of the respective functional inequalities. For example, we show that while the MLSI extremizers always have full support (unless the MLSI constant equals twice the spectral gap), the SDPI extremizers can have partial support. As our examples range from Bernoulli-Laplace model, random walks on graphs, to birth-death chains, the paper is also intended as a tutorial on computing MLSI, SDPI and other constants for these types of commonly occurring Markov chains.

1 Introduction

Markov chains are widely used in almost all areas of science and engineering, in the form of MCMC averaging in numerical analysis, approximation algorithms in computer science, generative modeling in artificial intelligence, and so on. Perhaps the most important problem in the study of Markov chains is to understand their equilibration properties. One example of such property is the mixing time, characterizing the time it takes for the marginal distribution to come close to the stationary one, as measured by some statistical distance: the total variation (most commonly) or Wasserstein distance, χ2\chi^{2} or Kullback-Leibler (KL) divergence.

In this paper, we focus on Markov chains on a finite state space. There are two common ways to define or implement Markov processes. In a discrete-time Markov chain, the state is updated at every discrete time t1t\in\mathbb{Z}_{\geq 1}; meanwhile, in a continuous-time Markov chain, the update times are distributed as a Poisson point process on the positive real axis 0\mathbb{R}_{\geq 0}. For example, if the single-step kernel for the former is given by a row-stochastic matrix PP then the corresponding continuous-time chain has kernel Tt=et(PId)T_{t}=e^{t(P-\mathrm{Id})}. Due to concentration of the number of updates in the time interval [0,t][0,t], both versions are known to have almost equal mixing times in total variation (e.g., [LP17, Theorem 20.3]). Thus, for the study of total-variation mixing time probabilists are free to switch between the discrete and continuous times as convenient.

One common approach (e.g., [DSC96]) to show rapid mixing of Markov processes is to prove that they “make progress” step-wise in terms of some ff-divergence, most commonly the χ2\chi^{2} or the KL divergence. For discrete-time Markov chains, this means that the ff-divergence to the target distribution decreases by a certain factor in every step, which can be described by the contraction coefficient of the associated Markov kernel. For continuous-time chains, this corresponds to the derivative of ff-divergence with respect to time is suitably bounded away from zero from below, and hence the ff-divergence to the target distribution decreases at a certain speed. For the KL divergence, the respective contraction inequalities are known as the strong data processing inequality (SDPI) and the modified log-Sobolev inequality (MLSI), respectively for discrete-time and continuous-time. See Eqs. 7 and 15 below for formal definitions.

For a large family of Markov chains, including the Glauber dynamics on product spaces and random walks on high-dimensional simplices (e.g., [KM17]), the transition kernel PP consists of two stages, for which P=KKP=KK^{*} is written as the product of a kernel and its adjoint. The first stage (KK) is often described as the downward, forward, or noising move, and the second stage (KK^{*}) as the upward, backward, or denoising move. Recent success on analyzing such two-stage processes (e.g., [ALGV19, CLV21, CE22]) considers the decay of ff-divergence in a single stage KK or KK^{*}, which for many specific problems turns out to be easier to handle. For such chains it is then natural to define a half-step contraction coefficient (corresponding to application of KK only) and ask how it compares with full-step, multi-step and continuous-time ones.

Given the equivalence between continuous-time and discrete-time mixing times, one naturally expects similar equivalence between the contraction coefficients. This turns out to be true for the χ2\chi^{2} contraction ([Rag16, Remark 4.2]). Thus, it was not surprising when the work [DMLM03] claimed to show such equivalence for the KL divergence contraction as well (that is showing that the ratio of the MLSI and SDPI constants is universally bounded). Specifically, they claimed an equivalence (up to universal multiplicative factors) between the MLSI and the half-step SDPI, which implies in particular that the MLSI and the full-step SDPI (i.e., continuous-time and discrete-time entropy contraction) are equivalent. The present paper originated from us discovering a gap in their proof (see Section 3.6) and realizing that their claim cannot hold true because the ratio between the MLSI constant and the half-step SPDI constant can be arbitrarily large for the random transposition model (Example 24). However, the question of whether the MLSI and the full-step SDPI are equivalent remained open.

We answer this question by presenting an example (Example 20) separating the MLSI and the full-step SDPI. That is, there exist cases where the contraction rate of the discrete-time chain can be much slower than that of the continuous-time one. The example is adapted from Münch’s counterexample to the Peres-Tetali conjecture [Mün23], although there seems to be no direct relationship between the properties used here and op. cit.

The separation between continuous-time and discrete-time entropy contraction leads us to consider a related question of comparing the entropy contraction of the mm-step kernel PmP^{m} versus the (m+1)(m+1)-step one Pm+1P^{m+1}. If a separation is possible, then it would explain how a continuous-time chain (which averages over many PmP^{m}’s) could have better convergence properties at finite time than the corresponding discrete-time chain (see discussions at the end of Section 3.2). It turns out that a counterexample is again possible (see Example 18 below).

To avoid trivial counterexamples, we restrict our attention to factorizable kernels, which are kernels that can be written in the form P=KKP=KK^{*}. As discussed above, this is a natural class of Markov kernels to study. All our examples are factorizable.

In all, the main purpose of this paper is to provide a self-contained and thorough introduction to all common notions of relative entropy decay for finite-state Markov chains, for both continuous-time and discrete-time versions. We summarize known comparisons among these notions (including the recent half-step contraction) and we give examples demonstrating that in all cases where comparisons are not available there exist counterexamples, in the sense that the ratio can be arbitrarily large. Along the way, we correct several misstatements that appeared in previous works, and show how to get sharp upper and lower bounds for these coefficients and study the extremizers in the respective functional inequalities.

Organization. This paper’s content is succinctly summarized in three tables: Table 1 gives definitions of five common contraction coefficients, Table 2 lists all known comparison inequalities along with (new) counterexamples for the missing comparisons, and Table 3 summarizes the list of counterexamples. The rest of the introduction defines all notions rigorously and recalls the standard comparison chain Eq. 29:

ραδρ02λ.\rho\leq\alpha\leq\delta\leq\rho_{0}\leq 2\lambda\,.

Then, after introducing factorizable Markov kernels (P=KKP=KK^{*}) in Section 2, Section 3 shows that every inequality above can have arbitrarily high ratio, even when restricted to factorizable kernels. Section 4 concludes with some results on the properties of extremizers of functional inequalities.

Notation. Throughout the paper, let 𝒳\mathcal{X} be a finite set, and P:𝒳𝒳P:\mathcal{X}\to\mathcal{X} be a Markov kernel with an invariant distribution π\pi. Assume that (π,P)(\pi,P) is reversible, i.e., π(x)P(x,y)=π(y)P(y,x)\pi(x)P(x,y)=\pi(y)P(y,x) for all x,y𝒳x,y\in\mathcal{X}. For a Markov kernel K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} and a distribution π\pi on 𝒳\mathcal{X}, we define the reverse channel Kπ:𝒴𝒳K_{\pi}^{*}:\mathcal{Y}\to\mathcal{X} as

Kπ(y,x)={π(x)K(x,y)(πK)(y),if(πK)(y)>0,π(x),otherwise.\displaystyle K_{\pi}^{*}(y,x)=\left\{\begin{array}[]{ll}\frac{\pi(x)K(x,y)}{(\pi K)(y)},&\text{if}~{}(\pi K)(y)>0,\\ \pi(x),&\text{otherwise.}\end{array}\right. (3)

Note that π\pi is an invariant distribution of KKπKK_{\pi}^{*} and (π,KKπ)(\pi,KK_{\pi}^{*}) is reversible.

1.1 Continuous-time contraction notions

For f,g:𝒳f,g:\mathcal{X}\to\mathbb{R}, define the Dirichlet form as

π,P(f,g)=π[f(Lg)]\displaystyle\mathcal{E}_{\pi,P}(f,g)=-\pi[f(Lg)] (4)

where L=PIL=P-I is the Markov generator.

The Poincaré constant (also called the spectral gap) λ=λ(π,P)\lambda=\lambda(\pi,P) is the largest number such that

λVarπ(f)π,P(f,f),f:𝒳,\displaystyle\lambda\operatorname{\mathrm{Var}}_{\pi}(f)\leq\mathcal{E}_{\pi,P}(f,f),\qquad\forall f:\mathcal{X}\to\mathbb{R}, (5)

where Varπ(f):=π[f2](π[f])2\operatorname{\mathrm{Var}}_{\pi}(f):=\pi[f^{2}]-(\pi[f])^{2} is the variance of ff under π\pi. When f=dνdπf=\frac{d\nu}{d\pi} for some ν𝒫(𝒳)\nu\in\mathcal{P}(\mathcal{X}) (where 𝒫(𝒳)\mathcal{P}(\mathcal{X}) denotes the space of distributions on 𝒳\mathcal{X}), we have Varπ(f)=χ2(νπ)\operatorname{\mathrm{Var}}_{\pi}(f)=\chi^{2}(\nu\|\pi) where χ2()\chi^{2}(\cdot\|\cdot) stands for the χ2\chi^{2}-divergence. Eq. 5 is called the Poincaré inequality.

The log-Sobolev constant (LSC) ρ=ρ(π,P)\rho=\rho(\pi,P) is the largest number such that

ρEntπ(f)π,P(f,f),f:𝒳0,\displaystyle\rho\operatorname{\mathrm{Ent}}_{\pi}(f)\leq\mathcal{E}_{\pi,P}(\sqrt{f},\sqrt{f}),\qquad\forall f:\mathcal{X}\to\mathbb{R}_{\geq 0}, (6)

where Entπ(f):=π[flogfπ[f]]\operatorname{\mathrm{Ent}}_{\pi}(f):=\pi\left[f\log\frac{f}{\pi[f]}\right] is the entropy of ff. When f=dνdπf=\frac{d\nu}{d\pi} for some ν𝒫(𝒳)\nu\in\mathcal{P}(\mathcal{X}), we have Entπ(f)=D(νπ)\operatorname{\mathrm{Ent}}_{\pi}(f)=D(\nu\|\pi) where D()D(\cdot\|\cdot) stands for the Kullback-Leibler (KL) divergence. Eq. 6 is called the log-Sobolev inequality (LSI).

The modified log-Sobolev constant (MLSC) ρ0=ρ0(π,P)\rho_{0}=\rho_{0}(\pi,P) is the largest number such that

ρ0Entπ(f)π,P(f,logf),f:𝒳0.\displaystyle\rho_{0}\operatorname{\mathrm{Ent}}_{\pi}(f)\leq\mathcal{E}_{\pi,P}(f,\log f),\qquad\forall f:\mathcal{X}\to\mathbb{R}_{\geq 0}. (7)

Eq. 7 is called the modified log-Sobolev inequality (MLSI).

For reversible (π,P)(\pi,P), we always have ([DSC96, BT06])

4ρρ02λ.\displaystyle 4\rho\leq\rho_{0}\leq 2\lambda. (8)

The constants ρ\rho, ρ0\rho_{0}, and λ\lambda represent the contraction ability of the continuous-time Markov chain (also known as the Markov semigroup) (Tt)t0(T_{t})_{t\geq 0}, where Tt=exp(tL)T_{t}=\exp(-tL).111In this work we focus on continuous-time Markov chains of this type and do not consider more general continuous-time Markov chains. These constants are properties of the Markov generator LL (which can be arbitrarily scaled), rather than the Markov kernel PP.

Let ν\nu be a distribution on 𝒳\mathcal{X}. Define νt=νTt\nu_{t}=\nu T_{t} to be the distribution of the Markov chain at time t0t\geq 0 when initialized at ν\nu, and define ft=dνtdπf_{t}=\frac{d\nu_{t}}{d\pi} to be the relative density. A direct computation yields

ddtVarπ(ft)\displaystyle\frac{d}{dt}\operatorname{\mathrm{Var}}_{\pi}(f_{t}) =2π,P(ft,ft),\displaystyle=-2\mathcal{E}_{\pi,P}(f_{t},f_{t}), (9)
ddtEntπ(ft)\displaystyle\frac{d}{dt}\operatorname{\mathrm{Ent}}_{\pi}(f_{t}) =π,P(ft,logft).\displaystyle=-\mathcal{E}_{\pi,P}(f_{t},\log f_{t}). (10)

Therefore the Poincaré inequality and the modified log-Sobolev inequality can be equivalently stated as

ddtVarπ(ft)\displaystyle\frac{d}{dt}\operatorname{\mathrm{Var}}_{\pi}(f_{t}) 2λVarπ(ft),\displaystyle\leq-2\lambda\operatorname{\mathrm{Var}}_{\pi}(f_{t}), (11)
ddtEntπ(ft)\displaystyle\frac{d}{dt}\operatorname{\mathrm{Ent}}_{\pi}(f_{t}) ρ0Entπ(ft).\displaystyle\leq-\rho_{0}\operatorname{\mathrm{Ent}}_{\pi}(f_{t}). (12)

Eqs. 11 and 12 can also be understood as alternative definitions for λ\lambda and ρ0\rho_{0}, from which one immediately obtains

Varπ(ft)\displaystyle\operatorname{\mathrm{Var}}_{\pi}(f_{t}) exp(2λt)Varπ(f0),\displaystyle\leq\exp(-2\lambda t)\operatorname{\mathrm{Var}}_{\pi}(f_{0}), (13)
Entπ(ft)\displaystyle\operatorname{\mathrm{Ent}}_{\pi}(f_{t}) exp(ρ0t)Entπ(f0).\displaystyle\leq\exp(-\rho_{0}t)\operatorname{\mathrm{Ent}}_{\pi}(f_{0}). (14)

We remark that the log-Sobolev inequality is equivalent to hypercontractivity by [DSC96]. Furthermore, [BG99] shows that the log-Sobolev inequality is equivalent (up to a constant factor) to the Poincaré inequality on an Orlicz space. In this work we consider only the Poincaré inequality on the L2L^{2} space.

We refer the reader to [DSC96, BT06] for more discussions on (modified) log-Sobolev constants.

1.2 Discrete-time contraction notions

Another class of contraction notions comes from contraction coefficients for ff-divergences. We refer the reader to [PW24, Chapter 7] for an introduction to ff-divergences. Let K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} be a Markov kernel and π\pi be a distribution on 𝒳\mathcal{X}. For any ff-divergence, we define the (input-restricted222One could also consider the input-unrestricted contraction coefficient defined as ηf(K):=supπ𝒫(𝒳)ηf(π,K)\eta_{f}(K):=\sup_{\pi\in\mathcal{P}(\mathcal{X})}\eta_{f}(\pi,K). In this paper we focus on the input-restricted version.) ff-contraction coefficient

ηf(π,K):=supν𝒫(𝒳)0<Df(νπ)<Df(νKπK)Df(νπ).\displaystyle\eta_{f}(\pi,K):=\sup_{\begin{subarray}{c}\nu\in\mathcal{P}(\mathcal{X})\\ 0<D_{f}(\nu\|\pi)<\infty\end{subarray}}\frac{D_{f}(\nu K\|\pi K)}{D_{f}(\nu\|\pi)}. (15)

In other words, we have

Df(νKπK)ηf(π,K)Df(νπ),\displaystyle D_{f}(\nu K\|\pi K)\leq\eta_{f}(\pi,K)D_{f}(\nu\|\pi), (16)

known as the strong data processing inequality (SDPI). By the data processing inequality (DPI), we always have 0ηf(π,K)10\leq\eta_{f}(\pi,K)\leq 1, and a smaller value means a stronger contraction ability for ff-divergence. The most commonly used ff-contraction coefficients include the total variation (TV) contraction coefficient ηTV(π,K)\eta_{\operatorname{\mathrm{TV}}}(\pi,K), the Kullback-Leibler (KL) contraction coefficient ηKL(π,K)\eta_{\operatorname{\mathrm{KL}}}(\pi,K) (the subscript KL\operatorname{\mathrm{KL}} is sometimes omitted), and the χ2\chi^{2}-contraction coefficient ηχ2(π,K)\eta_{\chi^{2}}(\pi,K). It is known ([AG76, PW17]) that

ηχ2(π,K)ηKL(π,K)ηTV(π,K).\displaystyle\eta_{\chi^{2}}(\pi,K)\leq\eta_{\operatorname{\mathrm{KL}}}(\pi,K)\leq\eta_{\operatorname{\mathrm{TV}}}(\pi,K). (17)

The TV contraction coefficient ηTV(π,K)\eta_{\operatorname{\mathrm{TV}}}(\pi,K) is also known as the Dobrushin’s coefficient ([Dob56]), and satisfies

ηTV(π,K)=maxx,x𝒳TV(K(x,),K(x,)).\displaystyle\eta_{\operatorname{\mathrm{TV}}}(\pi,K)=\max_{x,x^{\prime}\in\mathcal{X}}\operatorname{\mathrm{TV}}(K(x,\cdot),K(x^{\prime},\cdot)). (18)

Via Eq. 17, Eq. 18 provides an easy upper bound for ηχ2(π,K)\eta_{\chi^{2}}(\pi,K) and ηKL(π,K)\eta_{\operatorname{\mathrm{KL}}}(\pi,K).

We refer the reader to [Rag16, PW24] for more discussions on contraction coefficients.

Let P=KKπP=KK_{\pi}^{*}. We define the half-step entropy contraction coefficient

α(π,K)=1ηKL(π,K)=infν𝒫(𝒳)0<D(νπ)<D(νπ)D(νKπK)D(νπ)\displaystyle\alpha(\pi,K)=1-\eta_{\operatorname{\mathrm{KL}}}(\pi,K)=\inf_{\begin{subarray}{c}\nu\in\mathcal{P}(\mathcal{X})\\ 0<D(\nu\|\pi)<\infty\end{subarray}}\frac{D(\nu\|\pi)-D(\nu K\|\pi K)}{D(\nu\|\pi)} (19)

and the full-step entropy contraction coefficient

δ(π,P)=1ηKL(π,P)=infν𝒫(𝒳)0<D(νπ)<D(νπ)D(νPπ)D(νπ).\displaystyle\delta(\pi,P)=1-\eta_{\operatorname{\mathrm{KL}}}(\pi,P)=\inf_{\begin{subarray}{c}\nu\in\mathcal{P}(\mathcal{X})\\ 0<D(\nu\|\pi)<\infty\end{subarray}}\frac{D(\nu\|\pi)-D(\nu P\|\pi)}{D(\nu\|\pi)}. (20)

By rearranging, we have inequalities

EntπK(Kπf)Entπ(f)\displaystyle\operatorname{\mathrm{Ent}}_{\pi K}(K_{\pi}^{*}f)-\operatorname{\mathrm{Ent}}_{\pi}(f) αEntπ(f),\displaystyle\leq-\alpha\operatorname{\mathrm{Ent}}_{\pi}(f), (21)
Entπ(Pf)Entπ(f)\displaystyle\operatorname{\mathrm{Ent}}_{\pi}(Pf)-\operatorname{\mathrm{Ent}}_{\pi}(f) δEntπ(f),\displaystyle\leq-\delta\operatorname{\mathrm{Ent}}_{\pi}(f), (22)

for all f:𝒳0f:\mathcal{X}\to\mathbb{R}_{\geq 0}. Eqs. 21 and 22 can be seen as definitions of α\alpha and δ\delta, and can be compared with Eqs. 13 and 14. In the next section, we will discuss relationships between the discrete-time contraction notions α\alpha, δ\delta and the continuous-time contraction notions ρ\rho, ρ0\rho_{0}, λ\lambda.

1.3 Continuous-time versus discrete-time contraction

As we discussed above, the log-Sobolev constant ρ(π,P)\rho(\pi,P), modified log-Sobolev constant ρ0(π,P)\rho_{0}(\pi,P), and the Poincaré constant λ(π,P)\lambda(\pi,P) represent the contraction ability of the continuous Markov chain, while the contraction coefficients represent the contraction ability of the discrete-time Markov chain. These constants allow one to derive mixing time bounds for associated Markov chains in the continuous-time and discrete-time settings respectively, see e.g. [Cap23] and the references therein for more details. In this paper, we give a full comparison between the discrete-time contraction notions and the continuous-time contraction notions.

The χ2\chi^{2}-contraction coefficient has a close relationship with the Poincaré constant λ\lambda. We have

1ηχ2(π,K)=λ(π,KKπ).\displaystyle 1-\eta_{\chi^{2}}(\pi,K)=\lambda(\pi,KK_{\pi}^{*}). (23)

For P=KKπP=KK_{\pi}^{*}, the two versions of contraction are equivalent up to a constant factor. By Eq. 23 and 1λ(π,P2)=(1λ(π,P))21-\lambda(\pi,P^{2})=(1-\lambda(\pi,P))^{2}, we have

1ηχ2(π,P)=λ(π,P).\displaystyle 1-\sqrt{\eta_{\chi^{2}}(\pi,P)}=\lambda(\pi,P). (24)

In particular,

12(1ηχ2(π,P))λ(π,P)1ηχ2(π,P).\displaystyle\frac{1}{2}\left(1-\eta_{\chi^{2}}(\pi,P)\right)\leq\lambda(\pi,P)\leq 1-\eta_{\chi^{2}}(\pi,P). (25)

This shows that the rates of χ2\chi^{2}-divergence contraction for continuous-time and discrete-time Markov chains are within a factor of two of each other.

If we consider entropy (KL divergence) rather than χ2\chi^{2}-divergence, then previous works have shown a one-sided inequality. [BCP+21] shows that

δ(π,P)ρ0(π,P).\displaystyle\delta(\pi,P)\leq\rho_{0}(\pi,P). (26)

[Mic97] proves that

ρ(π,P)α(π,K).\displaystyle\rho(\pi,P)\leq\alpha(\pi,K). (27)

By the data processing inequality, we have

α(π,K)δ(π,P).\displaystyle\alpha(\pi,K)\leq\delta(\pi,P). (28)

Summarizing the above, we have a chain of inequalities

ραδρ02λ.\displaystyle\rho\leq\alpha\leq\delta\leq\rho_{0}\leq 2\lambda. (29)

In other words, we have the following implications:

Log-Sobolev InequalityHalf-Step Entropy ContractionFull-Step Entropy ContractionModified Log-Sobolev InequalityPoincaré Inequality\text{Log-Sobolev Inequality}\Longrightarrow\text{Half-Step Entropy Contraction}\Longrightarrow\text{Full-Step Entropy Contraction}\\ \Longrightarrow\text{Modified Log-Sobolev Inequality}\Longrightarrow\text{Poincar\'{e} Inequality} (30)

In Section 3, we show that the gap between any two adjacent constants in Eq. 29 can be arbitrarily large. In particular, unlike the χ2\chi^{2}-divergence, the discrete-time entropy contraction δ(π,P)\delta(\pi,P) is not equivalent to the continuous-time entropy contraction ρ0(π,P)\rho_{0}(\pi,P).

Table 1 summarizes the main constants discussed in this paper, and Table 2 summarizes relationships between these contraction notions. While these contraction notions are in general non-equivalent (up to a constant factor), we will see in Section 3.5 that under extra conditions, some of them could become equivalent for certain chains.

Symbol Name Inequality (Definition)
ρ(π,P)\rho(\pi,P) Log-Sobolev Constant ρEntπ(f)π,P(f,f)\rho\operatorname{\mathrm{Ent}}_{\pi}(f)\leq\mathcal{E}_{\pi,P}(\sqrt{f},\sqrt{f})
α(π,K)\alpha(\pi,K) Half-Step Entropy Contraction αEntπ(f)Entπ(f)EntπK(Kπf)\alpha\operatorname{\mathrm{Ent}}_{\pi}(f)\leq\operatorname{\mathrm{Ent}}_{\pi}(f)-\operatorname{\mathrm{Ent}}_{\pi K}(K_{\pi}^{*}f)
δ(π,P)\delta(\pi,P) Full-Step Entropy Contraction δEntπ(f)Entπ(f)Entπ(Pf)\delta\operatorname{\mathrm{Ent}}_{\pi}(f)\leq\operatorname{\mathrm{Ent}}_{\pi}(f)-\operatorname{\mathrm{Ent}}_{\pi}(Pf)
ρ0(π,P)\rho_{0}(\pi,P) Modified Log-Sobolev Constant ρ0Entπ(f)π,P(f,logf)\rho_{0}\operatorname{\mathrm{Ent}}_{\pi}(f)\leq\mathcal{E}_{\pi,P}(f,\log f)
λ(π,P)\lambda(\pi,P) Poincaré Constant λVarπ(f)π,P(f,f)\lambda\operatorname{\mathrm{Var}}_{\pi}(f)\leq\mathcal{E}_{\pi,P}(f,f)
Table 1: Contraction notions discussed in this paper. We assume that P=KKπP=KK_{\pi}^{*}. See Table 2 for relationships between the constants.
Relationship Explanation Reference
ρ(π,P)α(π,K)\rho(\pi,P)\leq\alpha(\pi,K) Log-Sobolev Inequality \Rightarrow Half-Step Entropy Contraction [Mic97, Prop. 6]
ρ(π,P)≵α(π,K)\rho(\pi,P)\not\gtrsim\alpha(\pi,K) Half-Step Entropy Contraction ⇏\not\Rightarrow Log-Sobolev Inequality Section 3.1
α(π,K)δ(π,P)\alpha(\pi,K)\leq\delta(\pi,P) Half-Step Entropy Contraction \Rightarrow Full-Step Entropy Contraction Data processing inequality
α(π,K)≵δ(π,P)\alpha(\pi,K)\not\gtrsim\delta(\pi,P) Full-Step Entropy Contraction ⇏\not\Rightarrow Half-Step Entropy Contraction Section 3.2
δ(π,Pm)≵δ(π,Pm+1)\delta\left(\pi,P^{m}\right)\not\gtrsim\delta\left(\pi,P^{m+1}\right) (m+1)(m+1)-Step Entropy Contraction ⇏\not\Rightarrow mm-Step Entropy Contraction Section 3.2
δ(π,P)ρ0(π,P)\delta(\pi,P)\leq\rho_{0}(\pi,P) Full-Step Entropy Contraction \Rightarrow Modified Log-Sobolev Inequality [BCP+21, Lemma 2.7]
δ(π,P)≵ρ0(π,P)\delta(\pi,P)\not\gtrsim\rho_{0}(\pi,P) Modified Log-Sobolev Inequality ⇏\not\Rightarrow Full-Step Entropy Contraction Section 3.3
ρ0(π,P)2λ(π,P)\rho_{0}(\pi,P)\leq 2\lambda(\pi,P) Modified Log-Sobolev Inequality \Rightarrow Poincaré Inequality [BT06, Prop. 3.6]
ρ0(π,P)≵λ(π,P)\rho_{0}(\pi,P)\not\gtrsim\lambda(\pi,P) Poincaré Inequality ⇏\not\Rightarrow Modified Log-Sobolev Inequality Section 3.4
Table 2: Relationships between contraction notions. The setting is the same as Table 1. a≵ba\not\gtrsim b means ba\frac{b}{a} can be arbitrarily large. Inequality A \Rightarrow Inequality B means Inequality A with constant aa implies Inequality B with constant CaCa for some absolute constant C>0C>0.

2 Factorizable kernels

Definition 1 (Factorizable pairs).

Let 𝒳\mathcal{X} be a finite set and P:𝒳𝒳P:\mathcal{X}\to\mathcal{X} be a Markov kernel with invariant distribution π\pi. We say (π,P)(\pi,P) is factorizable if P=KKπP=KK_{\pi}^{*} for some finite Markov kernel K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y}.

Lemma 2 (Factorizable implies reversible).

A factorizable pair (π,P)(\pi,P) is reversible.

Proof.

Suppose P=KKπP=KK_{\pi}^{*} for a finite Markov kernel K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y}. For x,y𝒳x,y\in\mathcal{X}, we have

π(x)P(x,y)\displaystyle\pi(x)P(x,y) =π(x)z𝒴K(x,z)Kπ(z,y)\displaystyle=\pi(x)\sum_{z\in\mathcal{Y}}K(x,z)K_{\pi}^{*}(z,y) (31)
=π(x)π(y)z𝒴1(πK)(z)K(x,z)K(y,z)=π(y)P(y,x).\displaystyle=\pi(x)\pi(y)\sum_{z\in\mathcal{Y}}\frac{1}{(\pi K)(z)}K(x,z)K(y,z)=\pi(y)P(y,x).

So (π,P)(\pi,P) is reversible. ∎

Factorizability is a reasonable assumption for several reasons. Recall that ρ\rho, ρ0\rho_{0}, and λ\lambda are properties of the Markov generator LL, while α\alpha and δ\delta are properties of the Markov kernel PP, and the two are related by L=PIL=P-I. If we do not make extra assumptions on PP, then it could happen that for two Markov kernels P1P_{1} and P2P_{2}, P1P_{1} contracts better than P2P_{2}, but the corresponding Markov generators L1L_{1} and L2L_{2} satisfy that L2L_{2} contracts better than than L1L_{1}. Consider the example where 𝒳=[2]\mathcal{X}=[2], π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}), Pc=(1ccc1c)P_{c}=\begin{pmatrix}1-c&c\\ c&1-c\end{pmatrix} for c[0,1]c\in[0,1]. Then the contraction ability (as Markov kernels) is increasing for c[0,12]c\in\left[0,\frac{1}{2}\right] and decreasing for c[12,1]c\in\left[\frac{1}{2},1\right]. However, the contraction ability for the corresponding Markov generators Lc=PcIL_{c}=P_{c}-I is increasing on the whole interval [0,1][0,1]. To avoid such undesirable behavior, we need to impose extra assumptions like factorizability. Furthermore, Eqs. 27 and 28 imply that

ρ(π,P)δ(π,P)\displaystyle\rho(\pi,P)\leq\delta(\pi,P) (32)

for factorizable PP. While the statement of Eq. 32 does not involve factorizability, it does not hold if PP is not factorizable. Consider the same example with c=1c=1, that is, P=(0110)P=\begin{pmatrix}0&1\\ 1&0\end{pmatrix}. Then δ(π,P)=0\delta(\pi,P)=0 but ρ(π,P)=1\rho(\pi,P)=1.

2.1 Important classes of factorizable kernels

Another reason that the factorizability assumption is reasonable is that many natural Markov chains considered in the literature are factorizable. In this section we discuss two important classes of factorizable kernels: lazy kernels and the Glauber dynamics.

Lemma 3 (Lazy implies factorizable).

Let (π,P)(\pi,P) be a reversible pair. If P(x,x)12P(x,x)\geq\frac{1}{2} for all x𝒳x\in\mathcal{X}, then there exists 𝒴\mathcal{Y} and K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} such that P=KKπP=KK_{\pi}^{*}.

Proof.

Let 𝒴=(𝒳1)(𝒳2)\mathcal{Y}=\binom{\mathcal{X}}{1}\cup\binom{\mathcal{X}}{2}. Let K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} be the map

K(x,e)={2P(x,x)1,ife={x},2P(x,y),ife={x,y},0,ifxe.\displaystyle K(x,e)=\left\{\begin{array}[]{ll}2P(x,x)-1,&\text{if}~{}e=\{x\},\\ 2P(x,y),&\text{if}~{}e=\{x,y\},\\ 0,&\text{if}~{}x\not\in e.\end{array}\right. (36)

Then Kπ(e,)=Unif(e)K_{\pi}^{*}(e,\cdot)=\operatorname{\mathrm{Unif}}(e). For yx𝒳y\neq x\in\mathcal{X}, we have

(KKπ)(x,y)=K(x,{x,y})Kπ({x,y},y)=P(x,y).\displaystyle(KK_{\pi}^{*})(x,y)=K(x,\{x,y\})K_{\pi}^{*}(\{x,y\},y)=P(x,y). (37)

Therefore P=KKπP=KK_{\pi}^{*}. ∎

Lazy chains are quite common. In particular, many of our examples are lazy random walks on regular graphs.

Definition 4 (Lazy random walk Markov chain).

Let G=(V,E)G=(V,E) be a dd-regular graph. We associate with it a canonical Markov chain called the (lazy) random walk. Let 𝒳=V\mathcal{X}=V, 𝒴=E\mathcal{Y}=E, and K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} be K(x,e)=1d𝟙{xe}K(x,e)=\frac{1}{d}\mathbbm{1}\{x\in e\} for x𝒳x\in\mathcal{X}, e𝒴e\in\mathcal{Y}. Then Kπ(e,)=Unif(e)K_{\pi}^{*}(e,\cdot)=\operatorname{\mathrm{Unif}}(e) and P=KKπP=KK_{\pi}^{*} satisfies

P(x,y)={12,ifx=y,12d,if(xy)E,0,otherwise.\displaystyle P(x,y)=\left\{\begin{array}[]{ll}\frac{1}{2},&\text{if}~{}x=y,\\ \frac{1}{2d},&\text{if}~{}(xy)\in E,\\ 0,&\text{otherwise.}\end{array}\right. (41)

The Glauber dynamics is an important class of Markov chains that have shown huge practical and theoretical success in sampling spin systems. Let us consider a general α\alpha-weighted block dynamics, defined as follows. Let π\pi be a probability measure on Ω=[q]n\Omega=[q]^{n} (e.g., the Ising model on a graph with nn vertices). For any σ,ηΩ\sigma,\eta\in\Omega and A[n]A\subseteq[n], consider the conditional probability of η\eta given the configuration σ\sigma on AA:

π(η|σA)=π(η)𝟙{σ|A=η|A}ηΩπ(η)𝟙{σ|A=η|A}.\displaystyle\pi(\eta|\sigma_{A})=\frac{\pi(\eta)\mathbbm{1}\{\sigma|_{A}=\eta|_{A}\}}{\sum_{\eta^{\prime}\in\Omega}\pi(\eta^{\prime})\mathbbm{1}\{\sigma|_{A}=\eta^{\prime}|_{A}\}}. (42)

Let 𝒮=2[n]\mathcal{S}=2^{[n]} be the set of all subsets of [n][n]. For any probability measure α=(αA)A𝒮\alpha=(\alpha_{A})_{A\in\mathcal{S}} on 𝒮\mathcal{S}, the α\alpha-weighted block dynamics is the Markov chain on Ω\Omega with transition matrix

P(σ,σ)=A𝒮αAπ(σ|σAc)\displaystyle P(\sigma,\sigma^{\prime})=\sum_{A\in\mathcal{S}}\alpha_{A}\pi(\sigma^{\prime}|\sigma_{A^{c}}) (43)

where Ac=[n]\AA^{c}=[n]\backslash A. The case αA=1n𝟙{|A|=1}\alpha_{A}=\frac{1}{n}\mathbbm{1}\{|A|=1\} is known as the Glauber dynamics. The pair (π,P)(\pi,P) is reversible. A factorization of the α\alpha-weighted block dynamics can be defined as follows. Let 𝒳=Ω\mathcal{X}=\Omega, 𝒴=𝒮×Ω\mathcal{Y}=\mathcal{S}\times\Omega, and K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} be defined as

K(σ,(A,η))=αAπ(η|σAc).\displaystyle K(\sigma,(A,\eta))=\alpha_{A}\pi(\eta|\sigma_{A^{c}}). (44)

One can compute that (πK)(A,η)=αAπ(η)(\pi K)(A,\eta)=\alpha_{A}\pi(\eta) and that Kπ((A,η),σ)=π(σ|ηAc)K_{\pi}^{*}((A,\eta),\sigma)=\pi(\sigma|\eta_{A^{c}}) for all (A,η)𝒮×Ω(A,\eta)\in\mathcal{S}\times\Omega, σΩ\sigma\in\Omega. Therefore, for all σ,σΩ\sigma,\sigma^{\prime}\in\Omega,

(A,η)𝒮×ΩK(σ,(A,η))Kπ((A,η),σ)=A𝒮αAηΩπ(η|σAc)π(σ|ηAc)=P(σ,σ).\displaystyle\sum_{(A,\eta)\in\mathcal{S}\times\Omega}K(\sigma,(A,\eta))K_{\pi}^{*}((A,\eta),\sigma^{\prime})=\sum_{A\in\mathcal{S}}\alpha_{A}\sum_{\eta\in\Omega}\pi(\eta|\sigma_{A^{c}})\pi(\sigma^{\prime}|\eta_{A^{c}})=P(\sigma,\sigma^{\prime}). (45)

Thus P=KKπP=KK_{\pi}^{*} is a factorization of the α\alpha-weighted block dynamics. Half-step entropy contractions for such Markov chains have been recently investigated in the context of spin systems under the name of block factorizations of entropy (e.g., [BCC+22]).

2.2 Non-uniqueness of factorization

In general, a factorizable pair (π,P)(\pi,P) can be factorized in more than one different ways, and the associated half-step contraction rates may differ considerably. This is illustrated in the following example.

Example 5 (Complete graph).

Let n3n\geq 3 be an integer. Let 𝒳=[n]\mathcal{X}=[n], π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}). Let P:𝒳𝒳P:\mathcal{X}\to\mathcal{X} be the lazy random walk on the complete graph. That is,

P(x,y)={12,ifx=y,12(n1),ifxy.\displaystyle P(x,y)=\left\{\begin{array}[]{ll}\frac{1}{2},&\text{if}~{}x=y,\\ \frac{1}{2(n-1)},&\text{if}~{}x\neq y.\end{array}\right. (48)

Given an integer 2n2\leq\ell\leq n, let 𝒴\mathcal{Y} be the set of all subsets A[n]A\subseteq[n] with either |A|=1|A|=1 or |A|=|A|=\ell, and define K():𝒳𝒴K(\ell):\mathcal{X}\to\mathcal{Y} as

K()(x,y)={22(1),ify={x},2(1)(n11),if|y|=,xy,0,otherwise.\displaystyle K(\ell)(x,y)=\left\{\begin{array}[]{ll}\frac{\ell-2}{2(\ell-1)},&\text{if}~{}y=\{x\},\\ \frac{\ell}{2(\ell-1)\binom{n-1}{\ell-1}},&\text{if}~{}|y|=\ell,x\in y,\\ 0,&\text{otherwise.}\end{array}\right. (52)

One can check that K()π(y,)=Unif(y)K(\ell)_{\pi}^{*}(y,\cdot)=\operatorname{\mathrm{Unif}}(y) and that K()K()π=PK(\ell)K(\ell)_{\pi}^{*}=P. We also note that when =2\ell=2 this reduces to the construction in the proof of Lemma 3.

[BC24, Theorem 1.1] computes the half-step entropy contraction coefficient α(π,K())\alpha(\pi,K(\ell)) for every \ell, and shows that

α(π,K())=log2(1)logn,\displaystyle\alpha(\pi,K(\ell))=\frac{\ell\log\ell}{2(\ell-1)\log n}, (53)

achieved at and only at point distributions. From Eq. 53 we see that α(π,K())\alpha(\pi,K(\ell)) increases with \ell from the minimum value α(π,K(2))=log2logn\alpha(\pi,K(2))=\frac{\log 2}{\log n} to the maximum value α(π,K(n))=n2(n1)\alpha(\pi,K(n))=\frac{n}{2(n-1)}. The former is of the same magnitude of the log-Sobolev constant ρ(π,P)=n22(n1)log(n1)\rho(\pi,P)=\frac{n-2}{2(n-1)\log(n-1)} ([DSC96, Corollary A.5]) and the latter matches asymptotically with the full-step contraction rate δ(π,P)=12±o(1)\delta(\pi,P)=\frac{1}{2}\pm o(1) ([GP23, Eq. (131)]).

2.3 Characterizations of factorizable kernels

An interesting question is to characterize the set of factorizable kernels for a fixed π\pi. Lemma 3 provides a sufficient condition and it is certainly not necessary. For example, any pair (π,P)(\pi,P) satisfying P(x,)=πP(x,\cdot)=\pi for all x𝒳x\in\mathcal{X} is factorizable (see Example 10). On the other hand, a necessary condition for factorizability is positive semidefiniteness.

Lemma 6 (Factorizable implies positive semidefinite).

Let (π,P)(\pi,P) be a reversible pair. If PP is factorizable, then the matrix Diag(π)P\operatorname{\mathrm{Diag}}(\pi)P is positive semidefinite (PSD), where Diag(π)\operatorname{\mathrm{Diag}}(\pi) denotes the 𝒳×𝒳\mathcal{X}\times\mathcal{X} diagonal matrix with diagonal π\pi.

Proof.

Let A=Diag(π)PA=\operatorname{\mathrm{Diag}}(\pi)P. Let K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} be a Markov kernel such that P=KKπP=KK_{\pi}^{*}. WLOG assume that πK\pi K has full support. Then

Ax,y=π(x)z𝒴K(x,z)Kπ(z,y)=π(x)π(y)z𝒴1(πK)(z)K(x,z)K(y,z).\displaystyle A_{x,y}=\pi(x)\sum_{z\in\mathcal{Y}}K(x,z)K_{\pi}^{*}(z,y)=\pi(x)\pi(y)\sum_{z\in\mathcal{Y}}\frac{1}{(\pi K)(z)}K(x,z)K(y,z). (54)

So A=MMA=MM^{\top} where M=Diag(π)KDiag(πK)1/2M=\operatorname{\mathrm{Diag}}(\pi)K\operatorname{\mathrm{Diag}}(\pi K)^{-1/2}. This finishes the proof. ∎

In particular, while a factorizable kernel is not necessarily lazy, when π\pi has full support, PP must have strictly positive diagonal entries.

For a distribution π\pi on 𝒳\mathcal{X}, let π\mathcal{F}_{\pi} denote the set of Markov kernels P:𝒳𝒳P:\mathcal{X}\to\mathcal{X} such that (π,P)(\pi,P) is factorizable.

Lemma 7 (π\mathcal{F}_{\pi} is convex).

For any distribution π\pi, the set π\mathcal{F}_{\pi} is convex.

Proof.

Let P0,P1πP_{0},P_{1}\in\mathcal{F}_{\pi} and K0:𝒳𝒴0K_{0}:\mathcal{X}\to\mathcal{Y}_{0}, K1:𝒳𝒴1K_{1}:\mathcal{X}\to\mathcal{Y}_{1} be the corresponding factors. Let t[0,1]t\in[0,1]. We prove that Pt:=(1t)P0+tP1P_{t}:=(1-t)P_{0}+tP_{1} is in π\mathcal{F}_{\pi}. Let Kt:𝒳𝒴0𝒴1K_{t}:\mathcal{X}\to\mathcal{Y}_{0}\sqcup\mathcal{Y}_{1} be defined as

Kt(x,y)={(1t)K0(x,y),ify𝒴0,tK1(x,y),ify𝒴1.\displaystyle K_{t}(x,y)=\left\{\begin{array}[]{ll}(1-t)K_{0}(x,y),&\text{if}~{}y\in\mathcal{Y}_{0},\\ tK_{1}(x,y),&\text{if}~{}y\in\mathcal{Y}_{1}.\end{array}\right. (57)

Then we can verify that KtK_{t} is a Markov kernel, and

Kt(Kt)π=(1t)K0(K0)π+tK1(K1)π=Pt.\displaystyle K_{t}(K_{t})_{\pi}^{*}=(1-t)K_{0}(K_{0})_{\pi}^{*}+tK_{1}(K_{1})_{\pi}^{*}=P_{t}. (58)

For a distribution π\pi on 𝒳\mathcal{X}, let 𝒫π\mathcal{P}_{\pi} denote the set of Markov kernels P:𝒳𝒳P:\mathcal{X}\to\mathcal{X} such that Diag(π)P\operatorname{\mathrm{Diag}}(\pi)P is PSD. By Lemmas 6 and 7, π\mathcal{F}_{\pi} is a convex subset of 𝒫π\mathcal{P}_{\pi}. When the state space is binary, the two sets are equal, but this is not true in general.

Lemma 8.

If |𝒳|=2|\mathcal{X}|=2, then for any distribution π\pi on 𝒳\mathcal{X}, we have π=𝒫π\mathcal{F}_{\pi}=\mathcal{P}_{\pi}.

Proof.

Direct calculation shows that 𝒫π\mathcal{P}_{\pi} is the convex hull of {JDiag(π),I}\{J\operatorname{\mathrm{Diag}}(\pi),I\}, where JJ is the all-ones matrix. Both extreme points are in π\mathcal{F}_{\pi}. ∎

Lemma 9.

For n5n\geq 5, 𝒳=[n]\mathcal{X}=[n], and π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}), the set π\mathcal{F}_{\pi} is strictly smaller than 𝒫π\mathcal{P}_{\pi}.

Proof.

An n×nn\times n matrix AA is called completely positive if it can be written as A=MMA=MM^{\top} for some (not necessarily square) matrix MM with non-negative entries. Clearly, all completely positive matrices are PSD. It is known ([MM62, BSM03]) that for n4n\leq 4, a PSD matrix is completely positive if and only if all its entries are non-negative, while for n5n\geq 5, there exist PSD matrices with strictly positive entries that are not completely positive.

Fix n5n\geq 5, 𝒳=[n]\mathcal{X}=[n], and π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}). Then 𝒫π\mathcal{P}_{\pi} is exactly the set of doubly stochastic PSD matrices. Let AA be an n×nn\times n PSD matrix with strictly positive entries that is not completely positive. By Sinkhorn’s theorem ([MN61, Sin64]), there is a diagonal matrix DD with strictly positive entries such that DADDAD is doubly stochastic. Let P=DADP=DAD. Clearly PP is in 𝒫π\mathcal{P}_{\pi}. We claim that PP is not in π\mathcal{F}_{\pi}. Suppose for the sake of contradiction that P=KKπP=KK_{\pi}^{*} for some KK. Then

A=D1PD1=1nD1KDiag(πK)1KD1=MM\displaystyle A=D^{-1}PD^{-1}=\frac{1}{n}D^{-1}K\operatorname{\mathrm{Diag}}(\pi K)^{-1}K^{\top}D^{-1}=MM^{\top} (59)

where M=1nD1KDiag(πK)1/2M=\frac{1}{\sqrt{n}}D^{-1}K\operatorname{\mathrm{Diag}}(\pi K)^{-1/2} is non-negative. This contradicts with the assumption that AA is not completely positive. ∎

It remains an interesting open problem to characterize π\mathcal{F}_{\pi}, even for uniform π\pi.

3 Comparison between constants

In this section we compare constants in Table 1, showing that there is a superconstant separation between any two of them. Table 3 summarizes examples in this section.

Example Description Separation
Example 10 One-step chain Log-Sobolev Constant ρ(π,P)\rho(\pi,P) vs Half-Step Entropy Contraction α(π,K)\alpha(\pi,K)
Example 11 11-to-kk chain Half-Step Entropy Contraction α(π,K)\alpha(\pi,K) vs Full-Step Entropy Contraction δ(π,P)\delta(\pi,P)
Example 13 Bernoulli-Laplace model Half-Step Entropy Contraction α(π,K)\alpha(\pi,K) vs Full-Step Entropy Contraction δ(π,P)\delta(\pi,P)
Example 16 Three-state chain One-Step Entropy Contraction δ(π,P)\delta(\pi,P) vs Two-Step Entropy Contraction δ(π,P2)\delta\left(\pi,P^{2}\right)
Example 18 Birth-death chain mm-Step Entropy Contraction δ(π,Pm)\delta\left(\pi,P^{m}\right) vs (m+1)(m+1)-Step Entropy Contraction δ(π,Pm+1)\delta\left(\pi,P^{m+1}\right)
Example 20 Three-state chain Full-Step Entropy Contraction δ(π,P)\delta(\pi,P) vs Modified Log-Sobolev Constant ρ0(π,P)\rho_{0}(\pi,P)
Example 23 Expander graph Modified Log-Sobolev Constant ρ0(π,P)\rho_{0}(\pi,P) vs Poincaré Constant λ(π,P)\lambda(\pi,P)
Example 24 Random transposition model Half-Step Entropy Contraction α(π,K)\alpha(\pi,K) vs Modified Log-Sobolev Constant ρ0(π,P)\rho_{0}(\pi,P)
Table 3: Examples in Section 3 and the separations they witness.

3.1 Log-Sobolev constant ρ\rho vs half-step entropy contraction α\alpha

By [Mic97], we always have ρ(π,P)α(π,K)\rho(\pi,P)\leq\alpha(\pi,K) for P=KKπP=KK_{\pi}^{*}. The following example shows that the gap can be arbitrarily large.

Example 10 (A one-step Markov chain).

Let 𝒳\mathcal{X} be a finite set, π\pi be a distribution on 𝒳\mathcal{X} with full support. Let 𝒴={}\mathcal{Y}=\{*\} and K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} be the unique Markov kernel from 𝒳\mathcal{X} to 𝒴\mathcal{Y}. Then P=KKπP=KK_{\pi}^{*} satisfies P(x,y)=π(y)P(x,y)=\pi(y) for all x,y𝒳x,y\in\mathcal{X}. By [DSC96, Theorem A.1],

ρ(π,P)=12πlog(1/π1)\displaystyle\rho(\pi,P)=\frac{1-2\pi_{*}}{\log(1/\pi_{*}-1)} (60)

where π=minx𝒳π(x)\pi_{*}=\min_{x\in\mathcal{X}}\pi(x). On the other hand, α(π,K)=1ηKL(π,K)=1\alpha(\pi,K)=1-\eta_{\operatorname{\mathrm{KL}}}(\pi,K)=1. As π0\pi_{*}\to 0, we have α(π,K)ρ(π,P)\frac{\alpha(\pi,K)}{\rho(\pi,P)}\to\infty.

3.2 Half-step entropy contraction α\alpha vs full-step entropy contraction δ\delta

By the data processing inequality, we always have α(π,K)δ(π,P)\alpha(\pi,K)\leq\delta(\pi,P) for P=KKπP=KK_{\pi}^{*}. Example 5 with =2\ell=2 already shows that the gap can be arbitrarily large. Below we present a few different examples.

Example 11 (A 11-to-kk Markov chain).

Let 𝒳=[n]\mathcal{X}=[n], 𝒴=[n]k\mathcal{Y}=[n]^{k}, K(x,y)=1knk1j[k]𝟙{yj=x}K(x,y)=\frac{1}{kn^{k-1}}\sum_{j\in[k]}\mathbbm{1}\{y_{j}=x\}, π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}). In other words, on input x𝒳x\in\mathcal{X}, this chain generates a uniform length-kk output string and then randomly overwrite one of the kk positions with xx. Then Kπ:𝒴𝒳K_{\pi}^{*}:\mathcal{Y}\to\mathcal{X} satisfies Kπ(y,x)=1kj[k]𝟙{yj=x}K_{\pi}^{*}(y,x)=\frac{1}{k}\sum_{j\in[k]}\mathbbm{1}\{y_{j}=x\}. Let P=KKπP=KK_{\pi}^{*}. The motivation for this chain comes from analysis of the Glauber dynamics on Unif(𝒳k)\operatorname{\mathrm{Unif}}\left(\mathcal{X}^{k}\right). The Markov kernel KπK_{\pi}^{*} is the kk-to-11 walk, and its entropy contraction is called entropic independence in [AJK+22]. Proposition 12 shows that for constant k2k\geq 2, as nn\to\infty, we have δ(π,P)α(π,K)\frac{\delta(\pi,P)}{\alpha(\pi,K)}\to\infty.

Proposition 12 (A 11-to-kk Markov chain).

Let π,K,P\pi,K,P be as in Example 11. Then we have α(π,K)=O(1logn)\alpha(\pi,K)=O\left(\frac{1}{\log n}\right) and δ(π,P)11k\delta(\pi,P)\geq 1-\frac{1}{k}. In particular, for fixed k2k\geq 2, δ(π,P)α(π,K)=Ω(logn)\frac{\delta(\pi,P)}{\alpha(\pi,K)}=\Omega(\log n).

Proof.

Upper bound on α(π,K)\alpha(\pi,K). Let ν\nu be the point distribution at 1𝒳1\in\mathcal{X}. Then D(νπ)=lognD(\nu\|\pi)=\log n. Consider the distribution νK\nu K. For y𝒴y\in\mathcal{Y}, if yy contains ii copies of 11, then (νK)(y)=iknk1(\nu K)(y)=\frac{i}{kn^{k-1}}. So

D(νKπK)\displaystyle D(\nu K\|\pi K) =1ik(ki)(n1)kiiknk1lognik\displaystyle=\sum_{1\leq i\leq k}\binom{k}{i}(n-1)^{k-i}\cdot\frac{i}{kn^{k-1}}\log\frac{ni}{k} (61)
=logn1ik(k1i1)(n1)kink1logki.\displaystyle=\log n-\sum_{1\leq i\leq k}\binom{k-1}{i-1}\frac{(n-1)^{k-i}}{n^{k-1}}\log\frac{k}{i}.

For constant k2k\geq 2, as nn\to\infty, we have D(νKπK)=lognΘ(1)D(\nu K\|\pi K)=\log n-\Theta(1). Therefore

α(π,K)1D(νKπK)D(νπ)=Θ(1logn).\displaystyle\alpha(\pi,K)\leq 1-\frac{D(\nu K\|\pi K)}{D(\nu\|\pi)}=\Theta\left(\frac{1}{\log n}\right). (62)

Lower bound on δ(π,P)\delta(\pi,P). Let M:𝒴[k]×[n]M:\mathcal{Y}\to[k]\times[n] be the Markov kernel defined as M(y,)=Unif({(i,yi):i[k]})M(y,\cdot)=\operatorname{\mathrm{Unif}}(\{(i,y_{i}):i\in[k]\}). By the data processing inequality,

1δ(π,P)=ηKL(π,P)ηKL(πK,Kπ)ηKL(πK,M).\displaystyle 1-\delta(\pi,P)=\eta_{\operatorname{\mathrm{KL}}}(\pi,P)\leq\eta_{\operatorname{\mathrm{KL}}}(\pi K,K_{\pi}^{*})\leq\eta_{\operatorname{\mathrm{KL}}}(\pi K,M). (63)

Let ν\nu be any distribution on 𝒴\mathcal{Y}, and νi\nu_{i} (i[k]i\in[k]) be the ii-th marginal of ν\nu. Then

D(νπK)=D(νπ×k)=D(νν1××νk)+i[k]D(νiπ)i[k]D(νiπ).\displaystyle D(\nu\|\pi K)=D\left(\nu\|\pi^{\times k}\right)=D(\nu\|\nu_{1}\times\cdots\times\nu_{k})+\sum_{i\in[k]}D(\nu_{i}\|\pi)\geq\sum_{i\in[k]}D(\nu_{i}\|\pi). (64)

On the other hand,

D(νMπKM)=D(νMUnif([k])×π)=1ki[k]D(νiπ).\displaystyle D(\nu M\|\pi KM)=D(\nu M\|\operatorname{\mathrm{Unif}}([k])\times\pi)=\frac{1}{k}\sum_{i\in[k]}D(\nu_{i}\|\pi). (65)

Therefore

ηKL(π,M)1k.\displaystyle\eta_{\operatorname{\mathrm{KL}}}(\pi,M)\leq\frac{1}{k}. (66)

So

δ(π,P)11k.\displaystyle\delta(\pi,P)\geq 1-\frac{1}{k}. (67)

Example 13 (Bernoulli-Laplace model).

Let nn be a positive integer and 1kn11\leq k\leq n-1. We define a graph G=(V,E)G=(V,E). Let V=([n]k)V=\binom{[n]}{k} (i.e., size-kk subsets of of [n]). Equivalently, VV is the set of length-nn bit strings with Hamming weight kk. There is an edge (x,y)(x,y) for x,yVx,y\in V if and only if xy1=2\|x-y\|_{1}=2 (considered as elements in {0,1}n\{0,1\}^{n}). The Bernoulli-Laplace model is the lazy random walk on GG (Definition 4). That is, 𝒳=V\mathcal{X}=V, π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}), 𝒴=E\mathcal{Y}=E, K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} is defined as K(x,e)=1k(nk)𝟙{xe}K(x,e)=\frac{1}{k(n-k)}\mathbbm{1}\{x\in e\}. For (ij)([n]2)(ij)\in\binom{[n]}{2}, define map σij:𝒳𝒳\sigma_{ij}:\mathcal{X}\to\mathcal{X} by swapping the ii-th coordinate and the jj-th coordinate (under the bit string interpretation). Then

P(x,)=12𝟙x+12k(nk)ixj[n]\x𝟙σij(x).\displaystyle P(x,\cdot)=\frac{1}{2}\mathbbm{1}_{x}+\frac{1}{2k(n-k)}\sum_{\begin{subarray}{c}i\in x\\ j\in[n]\backslash x\end{subarray}}\mathbbm{1}_{\sigma_{ij}(x)}. (68)

Proposition 14 shows that for constant k1k\geq 1, as nn\to\infty, we have δ(π,P)α(π,K)\frac{\delta(\pi,P)}{\alpha(\pi,K)}\to\infty.

Proposition 14 (Bernoulli-Laplace model).

Let π,K,P\pi,K,P be as in Example 13 with 1kn11\leq k\leq n-1. Then α(π,K)=O(1log(nk))\alpha(\pi,K)=O\left(\frac{1}{\log\binom{n}{k}}\right) and δ(π,P)n2k(nk)\delta(\pi,P)\geq\frac{n}{2k(n-k)}. In particular, for constant k1k\geq 1, we have δ(π,P)α(π,K)=Ω(logn)\frac{\delta(\pi,P)}{\alpha(\pi,K)}=\Omega(\log n).

Proof.

Upper bound on α(π,K)\alpha(\pi,K). Let ν\nu be the point distribution on any x𝒳x\in\mathcal{X}. Then D(νπ)=log|𝒳|=log(nk)D(\nu\|\pi)=\log|\mathcal{X}|=\log\binom{n}{k},

νK=1k(nk)ixj[n]\x𝟙{x,σij(x)}.\displaystyle\nu K=\frac{1}{k(n-k)}\sum_{\begin{subarray}{c}i\in x\\ j\in[n]\backslash x\end{subarray}}\mathbbm{1}_{\{x,\sigma_{ij}(x)\}}. (69)

Because πK=Unif(𝒴)\pi K=\operatorname{\mathrm{Unif}}(\mathcal{Y}),

D(νKπK)=log|𝒴|log(k(nk))=log(nk)log2.\displaystyle D(\nu K\|\pi K)=\log|\mathcal{Y}|-\log(k(n-k))=\log\binom{n}{k}-\log 2. (70)

Therefore

α(π,K)1D(νKπK)D(νπ)=log2log(nk).\displaystyle\alpha(\pi,K)\leq 1-\frac{D(\nu K\|\pi K)}{D(\nu\|\pi)}=\frac{\log 2}{\log\binom{n}{k}}. (71)

Lower bound on δ(π,P)\delta(\pi,P). We make use of the following useful result from [CMS24].

Lemma 15 ([CMS24, Theorem 1]).

Let dd be a metric on 𝒳\mathcal{X}. Let WpW_{p} denote the Wasserstein pp-distance on 𝒫(𝒳)\mathcal{P}(\mathcal{X}) If

W(P(x,),P(y,))\displaystyle W_{\infty}(P(x,\cdot),P(y,\cdot)) d(x,y),x,y𝒳,\displaystyle\leq d(x,y),\qquad\forall x,y\in\mathcal{X}, (72)
W1(P(x,),P(y,))\displaystyle W_{1}(P(x,\cdot),P(y,\cdot)) (1κ)d(x,y),x,y𝒳,\displaystyle\leq(1-\kappa)d(x,y),\qquad\forall x,y\in\mathcal{X}, (73)

then

δ(π,P)κ.\displaystyle\delta(\pi,P)\geq\kappa. (74)

For the Bernoulli-Laplace model, we let dd be the graph distance on V=([n]k)V=\binom{[n]}{k}. To prove Eqs. 72 and 73, it suffices to prove the result for adjacent xx and yy. By symmetry, WLOG assume that x={1,3,4,,k+1}x=\{1,3,4,\ldots,k+1\}, y={2,3,,k+1}y=\{2,3,\ldots,k+1\}. We define a coupling between P(x,)P(x,\cdot) and P(y,)P(y,\cdot) such that Eqs. 72 and 73 are both satisfied.

  1. (1)

    For 3ik+13\leq i\leq k+1, k+2jnk+2\leq j\leq n, couple σij(x)\sigma_{ij}(x) with σij(y)\sigma_{ij}(y). This happens with probability (k1)(nk1)2k(nk)\frac{(k-1)(n-k-1)}{2k(n-k)} and incurs distance 11.

  2. (2)

    For k+2jnk+2\leq j\leq n, couple σ1j(x)\sigma_{1j}(x) with σ2j(y)\sigma_{2j}(y). This happens with probability nk12k(nk)\frac{n-k-1}{2k(n-k)} and incurs distance 0.

  3. (3)

    For 3ik+13\leq i\leq k+1, couple σ2i(x)\sigma_{2i}(x) with σ1i(y)\sigma_{1i}(y). This happens with probabilities k12k(nk)\frac{k-1}{2k(n-k)} and incurs distance 0.

  4. (4)

    Couple σ12(x)\sigma_{12}(x) with yy, and xx with σ12(y)\sigma_{12}(y) each with weight 12k(nk)\frac{1}{2k(n-k)}. This happens with probability 1k(nk)\frac{1}{k(n-k)} and incurs distance 0.

  5. (5)

    At this point, all remaining mass in P(x,)P(x,\cdot) (resp. P(y,)P(y,\cdot)) is at xx (resp. yy). Couple them directly. This happens with probability 1212k(nk)\frac{1}{2}-\frac{1}{2k(n-k)} and incurs distance 11.

To summarize, the coupling has distance at most one and expected distance 1n2k(nk)1-\frac{n}{2k(n-k)}. By Lemma 15, we have

δ(π,P)n2k(nk).\displaystyle\delta(\pi,P)\geq\frac{n}{2k(n-k)}. (75)

We remark that for the Bernoulli-Laplace model, the exact value of α(π,K)\alpha(\pi,K) has been determined in [BC24, Theorem 1.12], where it is shown that Eq. 71 is tight. For δ(π,P)\delta(\pi,P), by considering a point distribution at any x𝒳x\in\mathcal{X}, we have

δ(π,P)log(2n(nk))2log(nk).\displaystyle\delta(\pi,P)\leq\frac{\log(2n(n-k))}{2\log\binom{n}{k}}. (76)

Therefore, for n,kn,k satisfying logk=(1Ω(1))logn\log k=(1-\Omega(1))\log n, we have δ(π,P)=Θ(1k)\delta(\pi,P)=\Theta\left(\frac{1}{k}\right).

Examples 11 and 13 show that there is a separation between α(π,K)\alpha(\pi,K) and δ(π,P=KKπ)\delta(\pi,P=KK_{\pi}^{*}). In these examples, KπK_{\pi}^{*} can be quite different from KK. One natural question is whether there is a separation between δ(π,P)\delta(\pi,P) and δ(π,P2)\delta\left(\pi,P^{2}\right). That is, can running the same Markov kernel twice result in much better contraction than running only once? The following example shows that indeed such a separation exists. This example is adapted from [Mün23]’s counterexample to the Peres-Tetali conjecture. We note, however, that the properties of this chain we use here and in Example 20 are different from those used op. cit.

Example 16 (A three-state Markov chain).

Let MM be a positive real number. Let 𝒳=[3]\mathcal{X}=[3], π=(MM+2,1M+2,1M+2)\pi=\left(\frac{M}{M+2},\frac{1}{M+2},\frac{1}{M+2}\right). Let P:𝒳𝒳P:\mathcal{X}\to\mathcal{X} be defined as

P=(114M14M014121401434).\displaystyle P=\begin{pmatrix}1-\frac{1}{4M}&\frac{1}{4M}&0\\ \frac{1}{4}&\frac{1}{2}&\frac{1}{4}\\ 0&\frac{1}{4}&\frac{3}{4}\end{pmatrix}. (77)

It is easy to see that π\pi is the invariant distribution of PP and (π,P)(\pi,P) is reversible. Note that by Lemma 3, PP is factorizable. By Proposition 17, δ(π,P2)δ(π,P)\frac{\delta\left(\pi,P^{2}\right)}{\delta(\pi,P)}\to\infty as MM\to\infty.

Proposition 17 (A three-state Markov chain).

Let π,P\pi,P be as in Example 16. Then δ(π,P)=O(1logM)\delta(\pi,P)=O\left(\frac{1}{\log M}\right) and δ(π,P2)=Ω(1)\delta\left(\pi,P^{2}\right)=\Omega(1). In particular, δ(π,P2)δ(π,P)=Ω(logM)\frac{\delta\left(\pi,P^{2}\right)}{\delta(\pi,P)}=\Omega(\log M).

Proof.

Upper bound on δ(π,P)\delta(\pi,P). Let ν\nu be the point distribution at 3𝒳3\in\mathcal{X}. Then D(νπ)=log(M+2)D(\nu\|\pi)=\log(M+2),

D(νPπ)=14logM+24+34log3(M+2)4=log(M+2)h(1/4)\displaystyle D(\nu P\|\pi)=\frac{1}{4}\log\frac{M+2}{4}+\frac{3}{4}\log\frac{3(M+2)}{4}=\log(M+2)-h(1/4) (78)

where h(x):[0,1][0,log2]h(x):[0,1]\to[0,\log 2] is the binary entropy function

h(x)=xlogx(1x)log(1x).\displaystyle h(x)=-x\log x-(1-x)\log(1-x). (79)

Therefore

δ(π,P)1D(νPπ)D(νπ)=O(1logM).\displaystyle\delta(\pi,P)\leq 1-\frac{D(\nu P\|\pi)}{D(\nu\|\pi)}=O\left(\frac{1}{\log M}\right). (80)

Lower bound on δ(π,P2)\delta\left(\pi,P^{2}\right). We prove that for large MM, we have P2(x,1)=Ω(1)P^{2}(x,1)=\Omega(1) for all x𝒳x\in\mathcal{X}. In fact, P2(1,1)P(1,1)2=Ω(1)P^{2}(1,1)\geq P(1,1)^{2}=\Omega(1), P2(2,1)P(2,1)P(1,1)=Ω(1)P^{2}(2,1)\geq P(2,1)P(1,1)=\Omega(1), and P2(3,1)P(3,2)P(2,1)=Ω(1)P^{2}(3,1)\geq P(3,2)P(2,1)=\Omega(1). So TV(P2(x,),P2(x,))=1Ω(1)\operatorname{\mathrm{TV}}(P^{2}(x,\cdot),P^{2}(x^{\prime},\cdot))=1-\Omega(1) for all x,x𝒳x,x^{\prime}\in\mathcal{X}. By Eq. 18, the Dobrushin’s coefficient satisfies ηTV(π,P2)=1Ω(1)\eta_{\operatorname{\mathrm{TV}}}(\pi,P^{2})=1-\Omega(1). By Eq. 17, we have

δ(π,P2)1ηTV(π,P2)=Ω(1).\displaystyle\delta\left(\pi,P^{2}\right)\geq 1-\eta_{\operatorname{\mathrm{TV}}}(\pi,P^{2})=\Omega(1). (81)

We generalize Example 16 as follows, showing that for any positive integer mm, there exists a Markov kernel such that running (m+1)(m+1) steps results in entropy contraction much better than running mm steps. We note that there is a relatively simple characterization of the LSI for birth-death chains ([Che05, Che03]), but for MLSI or SDPI no such characterizations are known, except for partial progress in [Rob01, CDPP09].

Example 18 (A birth-death Markov chain).

We fix a positive integer mm and let MM be a large positive real number. Let 𝒳=[m+2]\mathcal{X}=[m+2], π(x)=1M+m+1+𝟙{x=1}M1M+m+1\pi(x)=\frac{1}{M+m+1}+\mathbbm{1}\{x=1\}\frac{M-1}{M+m+1}. Let P:𝒳𝒳P:\mathcal{X}\to\mathcal{X} be a birth-death Markov chain, where P(x,y)=0P(x,y)=0 for |xy|2|x-y|\geq 2, P(x,x1)=14P(x,x-1)=\frac{1}{4} for 2xm+22\leq x\leq m+2, P(x,x+1)=14P(x,x+1)=\frac{1}{4} for 2xm+12\leq x\leq m+1, P(1,2)=14MP(1,2)=\frac{1}{4M}, and P(x,x)=1P(x,x1)P(x,x+1)P(x,x)=1-P(x,x-1)-P(x,x+1). It is easy to verify that (π,P)(\pi,P) is a reversible pair and P(x,x)12P(x,x)\geq\frac{1}{2} for all x[m+2]x\in[m+2]. By Lemma 3, (π,P)(\pi,P) is factorizable. Proposition 19 shows that as MM\to\infty, we have δ(π,Pm+1)δ(π,Pm)\frac{\delta\left(\pi,P^{m+1}\right)}{\delta\left(\pi,P^{m}\right)}\to\infty.

Proposition 19 (A birth-death Markov chain).

Let π,P\pi,P be as in Example 18. Then δ(π,Pm)=O(1logM)\delta\left(\pi,P^{m}\right)=O\left(\frac{1}{\log M}\right) and δ(π,Pm+1)=Ω(1)\delta\left(\pi,P^{m+1}\right)=\Omega(1). In particular, δ(π,Pm+1)δ(π,Pm)=Ω(logM)\frac{\delta\left(\pi,P^{m+1}\right)}{\delta\left(\pi,P^{m}\right)}=\Omega(\log M).

Proof.

Upper bound on δ(π,Pm)\delta\left(\pi,P^{m}\right). Let ν\nu be the point distribution at m+2𝒳m+2\in\mathcal{X}. Then D(νπ)=log(M+m+1)D(\nu\|\pi)=\log(M+m+1). Note that (νPm)(1)=0(\nu P^{m})(1)=0, (νPm)(x)=cx(\nu P^{m})(x)=c_{x} for some cx=Θm(1)c_{x}=\Theta_{m}(1) for 2xm+22\leq x\leq m+2, where Θm\Theta_{m} hides a constant factor depending on mm. Furthermore, (c2,,cm+2)(c_{2},\ldots,c_{m+2}) is a distribution on {2,,m+2}\{2,\ldots,m+2\}. Then

D(νPπ)=log(M+m+1)H(c2,,cm+2)\displaystyle D(\nu P\|\pi)=\log(M+m+1)-H(c_{2},\ldots,c_{m+2}) (82)

where HH is the entropy function

H(c2,,cm+2)=2im+2cilogci.\displaystyle H(c_{2},\ldots,c_{m+2})=-\sum_{2\leq i\leq m+2}c_{i}\log c_{i}. (83)

Because ci=Θm(1)c_{i}=\Theta_{m}(1) for all 2im+22\leq i\leq m+2, we have H(c2,,cm+2)=Θm(1)H(c_{2},\ldots,c_{m+2})=\Theta_{m}(1). Therefore

δ(π,Pm)1D(νPπ)D(νπ)=Θm(1logM).\displaystyle\delta\left(\pi,P^{m}\right)\leq 1-\frac{D(\nu P\|\pi)}{D(\nu\|\pi)}=\Theta_{m}\left(\frac{1}{\log M}\right). (84)

Lower bound on δ(π,Pm+1)\delta\left(\pi,P^{m+1}\right). Note that for any x[m+2]x\in[m+2], we have

Pm+1(x,1)P(x,x1)P(2,1)P(1,1)m+1x=Ωm(1)\displaystyle P^{m+1}(x,1)\geq P(x,x-1)\cdots P(2,1)\cdot P(1,1)^{m+1-x}=\Omega_{m}(1) (85)

where Ωm\Omega_{m} hides a constant factor depending on mm. By Eqs. 18 and 17,

δ(π,Pm+1)1ηTV(π,Pm+1)=Ωm(1).\displaystyle\delta\left(\pi,P^{m+1}\right)\geq 1-\eta_{\operatorname{\mathrm{TV}}}(\pi,P^{m+1})=\Omega_{m}(1). (86)

If a Markov kernel (π,P)(\pi,P) separates δ(π,Pm)\delta\left(\pi,P^{m}\right) and δ(π,Pm+1)\delta\left(\pi,P^{m+1}\right), then it also separates δ(π,Tt)\delta(\pi,T_{t}) (where Tt=et(PId)T_{t}=e^{t(P-\mathrm{Id})}) and δ(π,Pm)\delta\left(\pi,P^{m}\right) for finite tt. In other words, the continuous-time chain contracts entropy at finite time better than the discrete-time counterpart. To see this, note that for any function ff on 𝒳\mathcal{X}, we have

Entπ(Ttf)\displaystyle\operatorname{\mathrm{Ent}}_{\pi}(T_{t}f) 𝔼nPois(t)Entπ(Pnf)\displaystyle\leq\mathbb{E}_{n\sim\operatorname{\mathrm{Pois}}(t)}\operatorname{\mathrm{Ent}}_{\pi}\left(P^{n}f\right) (87)
𝔼nPois(t)[𝟙{nm}Entπ(f)+𝟙{nm+1}Entπ(Pm+1f)]\displaystyle\leq\mathbb{E}_{n\sim\operatorname{\mathrm{Pois}}(t)}\left[\mathbbm{1}\{n\leq m\}\operatorname{\mathrm{Ent}}_{\pi}(f)+\mathbbm{1}\{n\geq m+1\}\operatorname{\mathrm{Ent}}_{\pi}\left(P^{m+1}f\right)\right]
[Pois(t)m]Entπ(f)+[Pois(t)m+1](1δ(π,Pm+1))\displaystyle\leq\mathbb{P}[\operatorname{\mathrm{Pois}}(t)\leq m]\operatorname{\mathrm{Ent}}_{\pi}(f)+\mathbb{P}[\operatorname{\mathrm{Pois}}(t)\geq m+1]\left(1-\delta\left(\pi,P^{m+1}\right)\right)
Entπ(f)(1[Pois(t)m+1]δ(π,Pm+1)),\displaystyle\leq\operatorname{\mathrm{Ent}}_{\pi}(f)\left(1-\mathbb{P}[\operatorname{\mathrm{Pois}}(t)\geq m+1]\delta\left(\pi,P^{m+1}\right)\right),

where the first step is by convexity of Entπ\operatorname{\mathrm{Ent}}_{\pi}, and the second step is by the data processing inequality. Therefore

δ(π,Tt)[Pois(t)m+1]δ(π,Pm+1),\displaystyle\delta(\pi,T_{t})\geq\mathbb{P}[\operatorname{\mathrm{Pois}}(t)\geq m+1]\delta\left(\pi,P^{m+1}\right), (88)

which is separated from δ(π,Pm)\delta\left(\pi,P^{m}\right) for finite tt, assuming that δ(π,Pm+1)\delta\left(\pi,P^{m+1}\right) is separated from δ(π,Pm)\delta\left(\pi,P^{m}\right).

3.3 Full-step entropy contraction δ\delta vs modified log-Sobolev constant ρ0\rho_{0}

By [BCP+21, Lemma 2.7], we always have δ(π,P)ρ0(π,P)\delta(\pi,P)\leq\rho_{0}(\pi,P) for any reversible (π,P)(\pi,P).333They in fact prove a more general statement that works for non-reversible (π,P)(\pi,P). If we allow non-factorizable (π,P)(\pi,P), then it is easy to give an example where the gap is infinite: take 𝒳=[2]\mathcal{X}=[2], π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}), and P=(0110)P=\begin{pmatrix}0&1\\ 1&0\end{pmatrix}. The following example shows that the gap can be arbitrarily large even if we restrict to factorizable (π,P)(\pi,P).

Example 20 (A three-state Markov chain).

This example is the same as Example 16. Proposition 21 shows that we have ρ0(π,P)δ(π,P)\frac{\rho_{0}(\pi,P)}{\delta(\pi,P)}\to\infty as MM\to\infty.

Proposition 21 (A three-state Markov chain).

Let (π,P)(\pi,P) be as in Example 16. Then δ(π,P)=O(1logM)\delta(\pi,P)=O\left(\frac{1}{\log M}\right) and ρ0(π,P)=Θ(loglogMlogM)\rho_{0}(\pi,P)=\Theta\left(\frac{\log\log M}{\log M}\right). In particular, ρ0(π,P)δ(π,P)=Ω(loglogM)\frac{\rho_{0}(\pi,P)}{\delta(\pi,P)}=\Omega(\log\log M).

Proof.

Upper bound on δ(π,P)\delta(\pi,P). We have established in Proposition 17 that δ(π,P)=O(1logM)\delta(\pi,P)=O\left(\frac{1}{\log M}\right).

Lower bound on ρ0(π,P)\rho_{0}(\pi,P). We will show that there exists a sufficiently large universal constant C>0C>0 such that, for any positive function f:[3]+f:[3]\to\mathbb{R}^{+}, it holds

Entπ(f)ClogMloglogMπ,P(f,logf).\displaystyle\operatorname{\mathrm{Ent}}_{\pi}(f)\leq\frac{C\log M}{\log\log M}\mathcal{E}_{\pi,P}(f,\log f). (89)

This implies ρ0(π,P)=Ω(loglogMlogM)\rho_{0}(\pi,P)=\Omega\left(\frac{\log\log M}{\log M}\right). By applying suitable scaling, we assume that f(1)=xf(1)=x, f(2)=1f(2)=1, and f(3)=yf(3)=y for some x,y>0x,y>0. Furthermore, we may assume without loss of generality that xβMx\geq\frac{\beta}{M} where β>0\beta>0 is some tiny absolute constant, due to [TY24, Lemma 2.1]. More specifically, [TY24, Lemma 2.1] shows that to establish a modified log-Sobolev inequality Eq. 89, it suffices to consider a restricted class of functions such that, among other restrictions, f(1)β𝔼πff(1)\geq\beta^{\prime}\mathbb{E}_{\pi}f for an absolute constant β>0\beta^{\prime}>0; such a restriction immediately implies

x=f(1)β(MxM+2+1M+2+yM+2)βM+2β3M.\displaystyle x=f(1)\geq\beta^{\prime}\left(\frac{Mx}{M+2}+\frac{1}{M+2}+\frac{y}{M+2}\right)\geq\frac{\beta^{\prime}}{M+2}\geq\frac{\beta^{\prime}}{3M}. (90)

Hence, we can safely assume xβMx\geq\frac{\beta}{M} where β=β/3\beta=\beta^{\prime}/3.

A direct calculation yields

(M+2)Entπ(f)=Mxlogx+ylogy(Mx+y+1)log(Mx+y+1M+2),\displaystyle(M+2)\cdot\operatorname{\mathrm{Ent}}_{\pi}(f)=Mx\log x+y\log y-(Mx+y+1)\log\left(\frac{Mx+y+1}{M+2}\right), (91)

and

4(M+2)π,P(f,logf)=(x1)logx+(y1)logy.\displaystyle 4(M+2)\cdot\mathcal{E}_{\pi,P}(f,\log f)=(x-1)\log x+(y-1)\log y. (92)

The following claim is helpful.

Claim 22.

We have

Entπ(f)4π,P(f,logf)(2xy1)+ylogy+(y+1)log(1x)(x1)logx+(y1)logy.\displaystyle\frac{\operatorname{\mathrm{Ent}}_{\pi}(f)}{4\,\mathcal{E}_{\pi,P}(f,\log f)}\leq\frac{(2x-y-1)+y\log y+(y+1)\log\left(\frac{1}{x}\right)}{(x-1)\log x+(y-1)\log y}. (93)
Proof.

We rewrite the entropy as

(M+2)Entπ(f)\displaystyle(M+2)\cdot\operatorname{\mathrm{Ent}}_{\pi}(f) =Mxlogx+ylogy(Mx+y+1)log(Mx+y+1M+2)\displaystyle=Mx\log x+y\log y-(Mx+y+1)\log\left(\frac{Mx+y+1}{M+2}\right) (94)
=(Mx+y+1)log((M+2)xMx+y+1)+ylogy+(y+1)log(1x).\displaystyle=(Mx+y+1)\log\left(\frac{(M+2)x}{Mx+y+1}\right)+y\log y+(y+1)\log\left(\frac{1}{x}\right).

Notice that the first term above can be controlled by

(Mx+y+1)log((M+2)xMx+y+1)\displaystyle(Mx+y+1)\log\left(\frac{(M+2)x}{Mx+y+1}\right) =(Mx+y+1)log(1+2xy1Mx+y+1)\displaystyle=(Mx+y+1)\log\left(1+\frac{2x-y-1}{Mx+y+1}\right) (95)
(Mx+y+1)2xy1Mx+y+1=2xy1.\displaystyle\leq(Mx+y+1)\cdot\frac{2x-y-1}{Mx+y+1}=2x-y-1.

The claim then follows. ∎

We consider two separate cases of (x,y)(x,y) to establish Eq. 89.

Case 1: (x,y)(12,32)×(12,32)(x,y)\notin(\frac{1}{2},\frac{3}{2})\times(\frac{1}{2},\frac{3}{2}). In this case, we have

(x1)logx+(y1)logy110.\displaystyle(x-1)\log x+(y-1)\log y\geq\frac{1}{10}. (96)

Since we have

2xy12x12(x1)logx+332((x1)logx+(y1)logy)\displaystyle 2x-y-1\leq 2x-1\leq 2(x-1)\log x+3\leq 32\left((x-1)\log x+(y-1)\log y\right) (97)

and also

ylogy2(y1)logy+112((x1)logx+(y1)logy),\displaystyle y\log y\leq 2(y-1)\log y+1\leq 12\left((x-1)\log x+(y-1)\log y\right), (98)

we deduce from 22 that

Entπ(f)4π,P(f,logf)44+(y+1)log(1x)(x1)logx+(y1)logy.\displaystyle\frac{\operatorname{\mathrm{Ent}}_{\pi}(f)}{4\,\mathcal{E}_{\pi,P}(f,\log f)}\leq 44+\frac{(y+1)\log\left(\frac{1}{x}\right)}{(x-1)\log x+(y-1)\log y}. (99)

Consider three subcases.

  1. (i)

    If x32x\geq\frac{3}{2}, then log(1/x)<0\log(1/x)<0 and hence

    Entπ(f)4π,P(f,logf)44.\displaystyle\frac{\operatorname{\mathrm{Ent}}_{\pi}(f)}{4\,\mathcal{E}_{\pi,P}(f,\log f)}\leq 44. (100)
  2. (ii)

    If x12x\leq\frac{1}{2}, then consider how large yy is. If ylogMloglogMy\leq\frac{\log M}{\log\log M}, then by (x1)logx12log(1/x)(x-1)\log x\geq\frac{1}{2}\log(1/x) and (y1)logy0(y-1)\log y\geq 0 we deduce that

    (y+1)log(1x)(x1)logx+(y1)logy2(y+1)4logMloglogM.\displaystyle\frac{(y+1)\log\left(\frac{1}{x}\right)}{(x-1)\log x+(y-1)\log y}\leq 2(y+1)\leq\frac{4\log M}{\log\log M}. (101)

    If y>logMloglogMy>\frac{\log M}{\log\log M}, then by (x1)logx0(x-1)\log x\geq 0, y+1y13\frac{y+1}{y-1}\leq 3, and xβMx\geq\frac{\beta}{M} we deduce that

    (y+1)log(1x)(x1)logx+(y1)logy(y+1)log(1x)(y1)logy3log(Mβ)log(logMloglogM)C0logMloglogM,\displaystyle\frac{(y+1)\log\left(\frac{1}{x}\right)}{(x-1)\log x+(y-1)\log y}\leq\frac{(y+1)\log\left(\frac{1}{x}\right)}{(y-1)\log y}\leq\frac{3\log(\frac{M}{\beta})}{\log(\frac{\log M}{\log\log M})}\leq\frac{C_{0}\log M}{\log\log M}, (102)

    for some C0=C0(β)>0C_{0}=C_{0}(\beta)>0 when MM is sufficiently large.

  3. (iii)

    If x(12,32)x\in(\frac{1}{2},\frac{3}{2}), then by our assumption it must hold y(12,32)y\notin(\frac{1}{2},\frac{3}{2}). Since log(1/x)1\log(1/x)\leq 1 when x>12x>\frac{1}{2}, we have

    (y+1)log(1x)(x1)logx+(y1)logyy+1(y1)logy10,\displaystyle\frac{(y+1)\log\left(\frac{1}{x}\right)}{(x-1)\log x+(y-1)\log y}\leq\frac{y+1}{(y-1)\log y}\leq 10, (103)

    when y(12,32)y\notin(\frac{1}{2},\frac{3}{2}).

Therefore, in all three subcases we have

Entπ(f)4π,P(f,logf)ClogMloglogM\displaystyle\frac{\operatorname{\mathrm{Ent}}_{\pi}(f)}{4\,\mathcal{E}_{\pi,P}(f,\log f)}\leq\frac{C\log M}{\log\log M} (104)

where C=C(β)>0C=C(\beta)>0 is constant, whenever MM is sufficiently large.

Case 2: (x,y)(12,32)×(12,32)(x,y)\in(\frac{1}{2},\frac{3}{2})\times(\frac{1}{2},\frac{3}{2}). In this case, we have

(x1)logx+(y1)logy12(x1)2+12(y1)2,\displaystyle(x-1)\log x+(y-1)\log y\geq\frac{1}{2}(x-1)^{2}+\frac{1}{2}(y-1)^{2}, (105)

and also

(2xy1)+ylogy+(y+1)log(1x)\displaystyle~{}(2x-y-1)+y\log y+(y+1)\log\left(\frac{1}{x}\right) (106)
\displaystyle\leq 2(x1)(y1)+y(y1)+(y+1)(1x1)\displaystyle~{}2(x-1)-(y-1)+y(y-1)+(y+1)\left(\frac{1}{x}-1\right)
=\displaystyle= 1x(x1)(2xy1)+(y1)2\displaystyle~{}\frac{1}{x}(x-1)\left(2x-y-1\right)+(y-1)^{2}
=\displaystyle= 2x(x1)21x(x1)(y1)+(y1)2\displaystyle~{}\frac{2}{x}(x-1)^{2}-\frac{1}{x}(x-1)(y-1)+(y-1)^{2}
\displaystyle\leq 4(x1)2+2|(x1)(y1)|+(y1)2\displaystyle~{}4(x-1)^{2}+2|(x-1)(y-1)|+(y-1)^{2}
\displaystyle\leq 5(x1)2+5(y1)2.\displaystyle~{}5(x-1)^{2}+5(y-1)^{2}.

By 22,

Entπ(f)4π,P(f,logf)10.\displaystyle\frac{\operatorname{\mathrm{Ent}}_{\pi}(f)}{4\,\mathcal{E}_{\pi,P}(f,\log f)}\leq 10. (107)

Combining the two cases, we conclude that

ρ0(π,P)=Ω(loglogMlogM).\displaystyle\rho_{0}(\pi,P)=\Omega\left(\frac{\log\log M}{\log M}\right). (108)

In fact, this lower bound on ρ0(π,P)\rho_{0}(\pi,P) is asymptotically tight and can be achieved by, for example, f(1)=1/Mf(1)=1/M, f(2)=1f(2)=1, and f(3)=logMf(3)=\log M as given in [Mün23]. Therefore,

ρ0(π,P)=Θ(loglogMlogM)\displaystyle\rho_{0}(\pi,P)=\Theta\left(\frac{\log\log M}{\log M}\right) (109)

as MM\to\infty.

3.4 Modified log-Sobolev constant ρ0\rho_{0} vs Poincaré constant λ\lambda

[BT06] shows that ρ0(π,P)2λ(π,P)\rho_{0}(\pi,P)\leq 2\lambda(\pi,P), and gave an example showing that the gap can be arbitrarily large. We record their example here.

Example 23 (Expander graphs).

Let G=(V,E)G=(V,E) be a expander graph with bounded degree. Consider the lazy random walk on GG (Definition 4). [BT06] shows that ρ0(π,P)=Θ(1log|V|)=Θ(λ1(π,P)log|V|)\rho_{0}(\pi,P)=\Theta\left(\frac{1}{\log|V|}\right)=\Theta\left(\frac{\lambda_{1}(\pi,P)}{\log|V|}\right). Therefore as |V||V|\to\infty, we have λ1(π,P)ρ0(π,P)\frac{\lambda_{1}(\pi,P)}{\rho_{0}(\pi,P)}\to\infty.

3.5 Other comparisons

[DSC96] shows that the spectral gap λ\lambda and the log-Sobolev constant ρ\rho differ by at most a factor of O(log(1/πmin))O(\log(1/\pi_{\min})) where

πmin=minx𝒳:π(x)>0π(x).\displaystyle\pi_{\min}=\min_{x\in\mathcal{X}:\,\pi(x)>0}\pi(x). (110)

More precisely, [DSC96, Corollay A.4] shows that, assuming πmin1/2\pi_{\min}\leq 1/2, it holds

λ2+log(1/πmin)(12πmin)λlog(1/πmin1)ρλ2.\displaystyle\frac{\lambda}{2+\log(1/\pi_{\min})}\leq\frac{(1-2\pi_{\min})\lambda}{\log(1/\pi_{\min}-1)}\leq\rho\leq\frac{\lambda}{2}. (111)

This in particular shows that all constants discussed in this paper, including also the half-step entropy contraction α\alpha, the full-step entropy contraction δ\delta, and the modified log-Sobolev constant ρ0\rho_{0}, differ by at most a factor of O(log(1/πmin))O(\log(1/\pi_{\min})) from each other.

In a recent work [STY23], Salez, Tikhomirov, and Youssef establish a surprising and remarkable comparison between the modified log-Sobolev constant ρ0\rho_{0} and the log-Sobolev constant ρ\rho. For a reversible Markov kernel PP with respect to a probability measure π\pi, define the sparsity parameter as

pmin=min(x,y)𝒳2:P(x,y)>0P(x,y).\displaystyle p_{\min}=\min_{(x,y)\in\mathcal{X}^{2}:\,P(x,y)>0}P(x,y). (112)

Then, [STY23, Theorem 1] shows that

ρ020log(1/pmin)ρρ04.\displaystyle\frac{\rho_{0}}{20\log(1/p_{\min})}\leq\rho\leq\frac{\rho_{0}}{4}. (113)

Hence, all entropy-related constants discussed in this paper, including also the half-step entropy contraction α\alpha and the full-step entropy contraction δ\delta, differ by at most a factor of O(log(1/pmin))O(\log(1/p_{\min})) from each other.

3.6 Comments on several previous works

[DMLM03, Prop. 5.1] claims that

cρ0α\displaystyle c\rho_{0}\leq\alpha (114)

for some universal constant c>0c>0. In fact, Eq. 114 fails for the random transposition model, showing that the claim must be incorrect.

Example 24 (Random transposition model).

Let nn be a positive integer. Let G=(V,E)G=(V,E) be the Cayley graph on the symmetric group SnS_{n} generated by transpositions. That is, VV is the set of permutations of [n][n], and there is an edge between σ,τV\sigma,\tau\in V if and only if they differ by exactly two entries. The random transposition model is the lazy random walk on GG (Definition 4). Considering the point measure at any x𝒳x\in\mathcal{X} gives αlog2log(n!)=Θ(1nlogn)\alpha\leq\frac{\log 2}{\log(n!)}=\Theta\left(\frac{1}{n\log n}\right). In fact, [BC24, Theorem 1.9] shows that this is tight, i.e, α=log2log(n!)\alpha=\frac{\log 2}{\log(n!)}. On the other hand, [GQ03] shows that ρ0=Θ(1n)\rho_{0}=\Theta\left(\frac{1}{n}\right). Therefore, as nn\to\infty, we have ρ0α\frac{\rho_{0}}{\alpha}\to\infty.

Our separations of α\alpha vs δ\delta (Section 3.2) and δ\delta vs ρ0\rho_{0} (Section 3.3) also provide alternative counterexamples to the claim. We now explain briefly the issue in the proof of Eq. 114 in [DMLM03]. In their proof of Prop. 5.1, the authors apply a technical result, Lemma 5.2, which represents the (relative) entropy of a function ff with expectation 𝔼n(f)=1\mathbb{E}_{n}(f)=1 (where nn denotes the underlying probability measure) as an integral of the covariance between ftf_{t} and log(ft)\log(f_{t}) where ft=etf+(1et)f_{t}=e^{-t}f+(1-e^{-t}), t0t\in\mathbb{R}_{\geq 0} represents an interpolation between ff and 11. However, in the actual application of Lemma 5.2, the measure nn is a conditional probability measure under which the expectation of ff is no longer 11, and hence the interpolation function ftf_{t} should be replaced by ft=etf+(1et)𝔼n(f)f_{t}=e^{-t}f+(1-e^{-t})\mathbb{E}_{n}(f); this would require the function ftf_{t} to depend on the conditioning and the proofs following afterwards no longer work.

[Rag16, Prop. 4.3] claims that

ρ01c(1α)\displaystyle\rho_{0}\leq 1-c(1-\alpha) (115)

for some universal constant c>0c>0. Our examples do not disprove the claim. However, the proof of [Rag16, Theorem 4.4] is a generalization of that of [DMLM03, Prop. 5.1], so it has the same error. In particular, the last display of the proof of [Rag16, Prop. 4.3] implies that cρ0αc\rho_{0}\leq\alpha for some constant c>0c>0, which we have shown to be incorrect. Therefore the proof of [Rag16, Prop. 4.3] is incorrect and does not establish Eq. 115. It is unclear whether Eq. 115 as stated is correct.

4 Extremal functions

In this section we discuss another difference between the continuous-time entropy contraction constants and the discrete-time entropy contraction constants. It is known ([BT06]) that for irreducible (π,P)(\pi,P), the log-Sobolev constant ρ(π,P)\rho(\pi,P) and the modified log-Sobolev constant ρ0(π,P)\rho_{0}(\pi,P) satisfy a dichotomy: they are either equal to twice the Poincaré constant λ(π,P)\lambda(\pi,P) or achieved at a full-support function. We show that this is no longer true for the discrete-time entropy contraction constants α(π,K)\alpha(\pi,K) and δ(π,P)\delta(\pi,P) by providing explicit examples whose extremal functions have non-full support.

4.1 Log-Sobolev constant ρ\rho, modified log-Sobolev constant ρ0\rho_{0}, Poincaré constant λ\lambda

[BT06] studies extremal functions for ρ\rho, ρ0\rho_{0}, λ\lambda. The extremal functions for the Poincaré constant λ(π,P)\lambda(\pi,P) are easy to describe. They are the (right) eigenfunctions of L-L corresponding to the eigenvalue λ\lambda. In particular, λ\lambda is always achieved at some non-constant function f:𝒳f:\mathcal{X}\to\mathbb{R}.

For the log-Sobolev constant ρ\rho, [BT06] shows that for any reversible (π,P)(\pi,P), either

  1. (i)

    ρ(π,P)=2λ(π,P)\rho(\pi,P)=2\lambda(\pi,P), or

  2. (ii)

    there exists a non-constant function f:𝒳0f:\mathcal{X}\to\mathbb{R}_{\geq 0} with π[f]=1\pi[f]=1 such that

    ρEntπ(f)=π,P(f,f).\displaystyle\rho\operatorname{\mathrm{Ent}}_{\pi}(f)=\mathcal{E}_{\pi,P}(\sqrt{f},\sqrt{f}). (116)

    Furthermore, such ff satisfies the equation

    Lf=ρflogf.\displaystyle-L\sqrt{f}=\rho\sqrt{f}\log f. (117)

    If (π,P)(\pi,P) is irreducible, then ff has full support.

For the modified log-Sobolev constant ρ0(π,P)\rho_{0}(\pi,P), [BT06] shows that for any reversible (π,P)(\pi,P), either

  1. (i)

    ρ0(π,P)=2λ(π,P)\rho_{0}(\pi,P)=2\lambda(\pi,P), or

  2. (ii)

    there exists a non-constant function f:𝒳0f:\mathcal{X}\to\mathbb{R}_{\geq 0} with π[f]=1\pi[f]=1 such that

    ρ0Entπ(f)=π,P(f,logf).\displaystyle\rho_{0}\operatorname{\mathrm{Ent}}_{\pi}(f)=\mathcal{E}_{\pi,P}(f,\log f). (118)

    Furthermore, such ff satisfies the equation

    LffL(logf)=ρ0flogf.\displaystyle-Lf-fL(\log f)=\rho_{0}f\log f. (119)

    If (π,P)(\pi,P) is irreducible, then ff has full support.

4.2 Half-step entropy contraction α\alpha and full-step entropy contraction δ\delta

In this section we study the extremal distributions for ηKL\eta_{\operatorname{\mathrm{KL}}}, which includes both α\alpha and δ\delta.

Lemma 25 (Extremal distributions for ηKL\eta_{\operatorname{\mathrm{KL}}}).

Let π\pi be a distribution on 𝒳\mathcal{X} and K:𝒳𝒴K:\mathcal{X}\to\mathcal{Y} be a Markov kernel. Then either

  1. (i)

    ηKL(π,K)=ηχ2(π,K)\eta_{\operatorname{\mathrm{KL}}}(\pi,K)=\eta_{\chi^{2}}(\pi,K), or

  2. (ii)

    there exists distribution ν\nu on 𝒳\mathcal{X} such that

    ηKL(π,K)=D(νKπK)D(νπ).\displaystyle\eta_{\operatorname{\mathrm{KL}}}(\pi,K)=\frac{D(\nu K\|\pi K)}{D(\nu\|\pi)}. (120)
Proof.

WLOG assume that π\pi has full support. If Item (ii) does not happen, then there exists a sequence {fn}n\{f_{n}\}_{n} (fn:𝒳0f_{n}:\mathcal{X}\to\mathbb{R}_{\geq 0}, π[fn]=1\pi[f_{n}]=1) satisfying fn𝟙0\|f_{n}-\mathbbm{1}\|_{\infty}\to 0 and EntπK(Kπfn)Entπ(fn)ηKL(π,K)\frac{\operatorname{\mathrm{Ent}}_{\pi K}(K_{\pi}^{*}f_{n})}{\operatorname{\mathrm{Ent}}_{\pi}(f_{n})}\to\eta_{\operatorname{\mathrm{KL}}}(\pi,K) as nn\to\infty.

Write fn=𝟙+ϵngnf_{n}=\mathbbm{1}+\epsilon_{n}g_{n}, where ϵn0\epsilon_{n}\geq 0, π[gn2]=1\pi[g_{n}^{2}]=1. Note that the space 𝒢:={g:𝒳:π[g]=0,π[g2]=1}\mathcal{G}:=\{g:\mathcal{X}\to\mathbb{R}:\pi[g]=0,\pi[g^{2}]=1\} is compact. By replacing the sequence {fn}n\{f_{n}\}_{n} with a subsequence, we can WLOG assume that there exists g𝒢g^{*}\in\mathcal{G} such that gng0\|g_{n}-g^{*}\|_{\infty}\to 0 as nn\to\infty.

Now let us prove that

limnEntπK(Kπfn)Entπ(fn)=VarπK(Kπg)Varπ(g).\displaystyle\lim_{n\to\infty}\frac{\operatorname{\mathrm{Ent}}_{\pi K}(K_{\pi}^{*}f_{n})}{\operatorname{\mathrm{Ent}}_{\pi}(f_{n})}=\frac{\operatorname{\mathrm{Var}}_{\pi K}(K_{\pi}^{*}g^{*})}{\operatorname{\mathrm{Var}}_{\pi}(g^{*})}. (121)

If Eq. 121 holds, then

ηχ2(π,K)=supg𝒢VarπK(Kπg)Varπ(g)ηKL(π,K),\displaystyle\eta_{\chi^{2}}(\pi,K)=\sup_{g\in\mathcal{G}}\frac{\operatorname{\mathrm{Var}}_{\pi K}(K_{\pi}^{*}g)}{\operatorname{\mathrm{Var}}_{\pi}(g)}\geq\eta_{\operatorname{\mathrm{KL}}}(\pi,K), (122)

and Item (i) holds.

By Lemma 26,

(1+ϵngn)log(1+ϵngn)(1+ϵng)log(1+ϵng)ϵn(gng)\displaystyle~{}\left\|(1+\epsilon_{n}g_{n})\log(1+\epsilon_{n}g_{n})-(1+\epsilon_{n}g^{*})\log(1+\epsilon_{n}g^{*})-\epsilon_{n}(g_{n}-g^{*})\right\|_{\infty} (123)
=\displaystyle= O(ϵn2(ϵn+gng)).\displaystyle~{}O\left(\epsilon_{n}^{2}(\epsilon_{n}+\|g_{n}-g^{*}\|_{\infty})\right).

Taking expectation and using triangle inequality, we get

|Entπ(fn)Entπ(1+ϵng)|=O(ϵn2(ϵn+gng)).\displaystyle\left|\operatorname{\mathrm{Ent}}_{\pi}(f_{n})-\operatorname{\mathrm{Ent}}_{\pi}(1+\epsilon_{n}g^{*})\right|=O\left(\epsilon_{n}^{2}(\epsilon_{n}+\|g_{n}-g^{*}\|_{\infty})\right). (124)

It is known that

limϵ01ϵ2Entπ(1+ϵg)=12Varπ(g).\displaystyle\lim_{\epsilon\to 0}\frac{1}{\epsilon^{2}}\operatorname{\mathrm{Ent}}_{\pi}(1+\epsilon g^{*})=\frac{1}{2}\operatorname{\mathrm{Var}}_{\pi}(g^{*}). (125)

So

limn1ϵn2Entπ(fn)=12Varπ(g).\displaystyle\lim_{n\to\infty}\frac{1}{\epsilon_{n}^{2}}\operatorname{\mathrm{Ent}}_{\pi}(f_{n})=\frac{1}{2}\operatorname{\mathrm{Var}}_{\pi}(g^{*}). (126)

Similarly

limn1ϵn2EntπK(Kπfn)=12VarπK(Kπg).\displaystyle\lim_{n\to\infty}\frac{1}{\epsilon_{n}^{2}}\operatorname{\mathrm{Ent}}_{\pi K}(K_{\pi}^{*}f_{n})=\frac{1}{2}\operatorname{\mathrm{Var}}_{\pi K}(K_{\pi}^{*}g^{*}). (127)

This finishes the proof of Eq. 121. ∎

Lemma 26.

There exists C>0C>0 such that for ϵ,ϵ>0\epsilon,\epsilon^{\prime}>0 small enough, for x,y,z,wx,y,z,w\in\mathbb{R} satisfying |x|,|y|,|z|,|w|ϵ|x|,|y|,|z|,|w|\leq\epsilon, |xz|,|yw|ϵϵ|x-z|,|y-w|\leq\epsilon\epsilon^{\prime}, we have

|(1+x)log(1+y)(1+z)log(1+w)(yw)|Cϵ2(ϵ+ϵ).\displaystyle|(1+x)\log(1+y)-(1+z)\log(1+w)-(y-w)|\leq C\epsilon^{2}(\epsilon+\epsilon^{\prime}). (128)
Proof.

First note that |log(1+y)(yy2/2)|=O(ϵ3)|\log(1+y)-(y-y^{2}/2)|=O(\epsilon^{3}). Then

|(1+x)log(1+y)(1+z)log(1+w)(yw)|\displaystyle~{}|(1+x)\log(1+y)-(1+z)\log(1+w)-(y-w)| (129)
=\displaystyle= |(1+x)(yy2/2)(1+z)(ww2/2)(yw)|+O(ϵ3)\displaystyle~{}|(1+x)(y-y^{2}/2)-(1+z)(w-w^{2}/2)-(y-w)|+O(\epsilon^{3})
=\displaystyle= |xyzw|+|y2w2|/2+|xy2|/2+|zw2|/2+O(ϵ3)\displaystyle~{}|xy-zw|+|y^{2}-w^{2}|/2+|xy^{2}|/2+|zw^{2}|/2+O(\epsilon^{3})
=\displaystyle= |(xz)y+z(yw)|+|(yw)(y+w)|/2+O(ϵ2(ϵ+ϵ))\displaystyle~{}|(x-z)y+z(y-w)|+|(y-w)(y+w)|/2+O(\epsilon^{2}(\epsilon+\epsilon^{\prime}))
=\displaystyle= O(ϵ2(ϵ+ϵ)).\displaystyle~{}O(\epsilon^{2}(\epsilon+\epsilon^{\prime})).

Unlike ρ\rho and ρ0\rho_{0} where the extremal functions (if they exist) have full support, the extremal distributions for ηKL\eta_{\operatorname{\mathrm{KL}}} may have non-full support.

Example 27 (Complete graph).

Let n3n\geq 3 be an integer. Let 𝒳=[n]\mathcal{X}=[n] and π=Unif(𝒳)\pi=\operatorname{\mathrm{Unif}}(\mathcal{X}). Let K:𝒳𝒳K:\mathcal{X}\to\mathcal{X} be the (non-lazy) random walk on the complete graph. That is,

K(x,y)=1n1𝟙{xy}\displaystyle K(x,y)=\frac{1}{n-1}\mathbbm{1}\{x\neq y\} (130)

for x,y𝒳x,y\in\mathcal{X}. [GP23, Prop. 33] proves that ηKL(π,K)=lognlog(n1)logn\eta_{\operatorname{\mathrm{KL}}}(\pi,K)=\frac{\log n-\log(n-1)}{\log n}, and is achieved at and only at point distributions.

In the above example, KK is not factorizable, so it corresponds to the half-step contraction coefficient α\alpha. The next example shows the extremal distributions for the full-step contraction coefficient δ\delta can have non-full support.

Example 28 (Complete bipartite graph).

Let n3n\geq 3 be an integer. Let G=Kn,nG=K_{n,n} be the complete bipartite graph. That is, V=[2]×[n]V=[2]\times[n], and there is an edge between (1,i)(1,i) and (2,j)(2,j) for all i,j[n]i,j\in[n]. Let (π,P)(\pi,P) be the random walk on GG (Definition 4). Numerical computation suggests that ηKL(π,P)=logn2log(2n)\eta_{\operatorname{\mathrm{KL}}}(\pi,P)=\frac{\log n}{2\log(2n)}, and equality is achieved at and only at point distributions. Proposition 29 proves this observation for n=3n=3.

Proposition 29 (Complete bipartite graph).

Let π,P\pi,P be as in Example 28. For n=3n=3, ηKL(π,P)=logn2log(2n)\eta_{\operatorname{\mathrm{KL}}}(\pi,P)=\frac{\log n}{2\log(2n)}, and equality is achieved at and only at point distributions.

Proof.

Let ν\nu be the point distribution on any x𝒳x\in\mathcal{X}. Then

D(νπ)\displaystyle D(\nu\|\pi) =log(2n),\displaystyle=\log(2n), (131)
D(νPπ)\displaystyle D(\nu P\|\pi) =12logn.\displaystyle=\frac{1}{2}\log n. (132)

So

ηKL(π,P)D(νPπ)D(νπ)=logn2log(2n).\displaystyle\eta_{\operatorname{\mathrm{KL}}}(\pi,P)\geq\frac{D(\nu P\|\pi)}{D(\nu\|\pi)}=\frac{\log n}{2\log(2n)}. (133)

Note that this holds for any n3n\geq 3.

We prove that for n=3n=3, the point distributions are the only maximizers. We represent a distribution ν\nu using a tuple (t,ν1,ν2)(t,\nu_{1},\nu_{2}), where t[0,1]t\in[0,1] and νi\nu_{i} (i=1,2i=1,2) is a distribution on 𝒳i={i}×[n]𝒳\mathcal{X}_{i}=\{i\}\times[n]\subseteq\mathcal{X}. Given ν\nu, the corresponding tuple ϕ(ν)=(ν(𝒳1),ν|𝒳1,ν|𝒳2)\phi(\nu)=(\nu(\mathcal{X}_{1}),\nu|_{\mathcal{X}_{1}},\nu|_{\mathcal{X}_{2}}), where ν|𝒳i\nu|_{\mathcal{X}_{i}} denotes the conditional distribution (if ν(𝒳i)=0\nu(\mathcal{X}_{i})=0, then choose an arbitrary distribution on 𝒳i\mathcal{X}_{i}). Given a tuple (t,ν1,ν2)(t,\nu_{1},\nu_{2}), the corresponding distribution ν\nu is ψ(t,ν1,ν2)=tν1+(1t)ν2\psi(t,\nu_{1},\nu_{2})=t\nu_{1}+(1-t)\nu_{2}. Let πi=Unif(𝒳i)\pi_{i}=\operatorname{\mathrm{Unif}}(\mathcal{X}_{i}) for i=1,2i=1,2. By symmetry, we only need to consider the case 12t1\frac{1}{2}\leq t\leq 1.

Under the tuple parametrization, we have

D(ψ(t,ν1,ν2)π)\displaystyle D(\psi(t,\nu_{1},\nu_{2})\|\pi) =dKL(t12)+tD(ν1π1)+(1t)D(ν2π2),\displaystyle=d_{\operatorname{\mathrm{KL}}}\left(t\|\frac{1}{2}\right)+tD(\nu_{1}\|\pi_{1})+(1-t)D(\nu_{2}\|\pi_{2}), (134)
ϕ(ψ(t,ν1,ν2)P)\displaystyle\phi(\psi(t,\nu_{1},\nu_{2})P) =(12,tν1+(1t)π1,(1t)ν2+tπ2),\displaystyle=\left(\frac{1}{2},t\nu_{1}+(1-t)\pi_{1},(1-t)\nu_{2}+t\pi_{2}\right), (135)

where dKL(xy)=xlogxy+(1x)log1x1yd_{\operatorname{\mathrm{KL}}}(x\|y)=x\log\frac{x}{y}+(1-x)\log\frac{1-x}{1-y} is the binary KL divergence function. So for ν=ψ(t,ν1,ν2)\nu=\psi(t,\nu_{1},\nu_{2}), we have

D(νPπ)D(νπ)=12D(tν1+(1t)π1π1)+D((1t)ν2+tπ2π2)dKL(t12)+tD(ν1π1)+(1t)D(ν2π2).\displaystyle\frac{D(\nu P\|\pi)}{D(\nu\|\pi)}=\frac{1}{2}\cdot\frac{D(t\nu_{1}+(1-t)\pi_{1}\|\pi_{1})+D((1-t)\nu_{2}+t\pi_{2}\|\pi_{2})}{d_{\operatorname{\mathrm{KL}}}\left(t\|\frac{1}{2}\right)+tD(\nu_{1}\|\pi_{1})+(1-t)D(\nu_{2}\|\pi_{2})}. (136)

When t=1t=1, Eq. 136 simplifies to

D(νPπ)D(νπ)=12D(ν1π1)log2+D(ν1π1)\displaystyle\frac{D(\nu P\|\pi)}{D(\nu\|\pi)}=\frac{1}{2}\cdot\frac{D(\nu_{1}\|\pi_{1})}{\log 2+D(\nu_{1}\|\pi_{1})} (137)

which is maximized when and only when ν1\nu_{1} is a point measure, taking value logn2log(2n)\frac{\log n}{2\log(2n)}. Note that when t=1t=1 and ν1\nu_{1} is a point measure, ν=ψ(t,ν1,ν2)\nu=\psi(t,\nu_{1},\nu_{2}) does not depend on ν2\nu_{2} and is always a point measure.

Now assume that 12t<1\frac{1}{2}\leq t<1. By [GP23, Prop. 17], for fixed D(ν1π1)D(\nu_{1}\|\pi_{1}) (resp. D(ν2π2)D(\nu_{2}\|\pi_{2})), D(tν1+(1t)π1π1)D(t\nu_{1}+(1-t)\pi_{1}\|\pi_{1}) (resp. D(ν2+tπ2π2)D(\nu_{2}+t\pi_{2}\|\pi_{2})) is uniquely (up to permutation of the alphabet) at a distribution of form (x,1xn1,,1xn1)\left(x,\frac{1-x}{n-1},\ldots,\frac{1-x}{n-1}\right) where 1nx1\frac{1}{n}\leq x\leq 1. Therefore, we can assume WLOG that ν1=(x,1xn1,,1xn1)\nu_{1}=\left(x,\frac{1-x}{n-1},\ldots,\frac{1-x}{n-1}\right), ν2=(y,1yn1,,1yn1)\nu_{2}=\left(y,\frac{1-y}{n-1},\ldots,\frac{1-y}{n-1}\right) for some x,y[1n,1]x,y\in\left[\frac{1}{n},1\right]. In this case, we can simplify Eq. 136 as

D(νPπ)D(νπ)=12fn(tx+1tn)+fn((1t)y+tn)f2(t)+tfn(x)+(1t)fn(y)=:Fn(t,x,y),\displaystyle\frac{D(\nu P\|\pi)}{D(\nu\|\pi)}=\frac{1}{2}\cdot\frac{f_{n}\left(tx+\frac{1-t}{n}\right)+f_{n}\left((1-t)y+\frac{t}{n}\right)}{f_{2}(t)+tf_{n}(x)+(1-t)f_{n}(y)}=:F_{n}(t,x,y), (138)

where

fn(x)=D((x,1xn1,,1xn1)Unif([n]))=logn+xlogx+(1x)log1xn1.\displaystyle f_{n}(x)=D\left(\left(x,\frac{1-x}{n-1},\ldots,\frac{1-x}{n-1}\right)\|\operatorname{\mathrm{Unif}}([n])\right)=\log n+x\log x+(1-x)\log\frac{1-x}{n-1}. (139)

Therefore, computing ηKL(π,P)\eta_{\operatorname{\mathrm{KL}}}(\pi,P) is equivalent to computing

sup12t11nx,y1(t,x,y)(12,1n,1n)Fn(t,x,y)\displaystyle\sup_{\begin{subarray}{c}\frac{1}{2}\leq t\leq 1\\ \frac{1}{n}\leq x,y\leq 1\\ (t,x,y)\neq\left(\frac{1}{2},\frac{1}{n},\frac{1}{n}\right)\end{subarray}}F_{n}(t,x,y) (140)

We have reduced the original problem of computing ηKL(π,K)\eta_{\operatorname{\mathrm{KL}}}(\pi,K), which is a priori an (2n1)(2n-1)-dimensional optimization problem, to a optimization problem with three real variables t[0,1],x,y[1n,1]t\in[0,1],x,y\in\left[\frac{1}{n},1\right]. The following lemma helps us reduce it further to two real variables.

Lemma 30.

For any n3n\geq 3, there exists t>12t^{*}>\frac{1}{2} such that

sup0<t<t1n<x1fn(tx+1tn)tfn(x)<lognlog(2n).\displaystyle\sup_{\begin{subarray}{c}0<t<t^{*}\\ \frac{1}{n}<x\leq 1\end{subarray}}\frac{f_{n}\left(tx+\frac{1-t}{n}\right)}{tf_{n}(x)}<\frac{\log n}{\log(2n)}. (141)
Proof.

By [GP23, Eqs. (23) and (27)], for fixed 0t10\leq t\leq 1, we have

sup1n<x1fn(tx+1tn)tfn(x)t1logn.\displaystyle\sup_{\frac{1}{n}<x\leq 1}\frac{f_{n}\left(tx+\frac{1-t}{n}\right)}{tf_{n}(x)}\leq t^{\frac{1}{\log n}}. (142)

For any t<exp((logn)loglognlog(2n))t<\exp\left((\log n)\log\frac{\log n}{\log(2n)}\right), we have

t1logn<lognlog(2n).\displaystyle t^{\frac{1}{\log n}}<\frac{\log n}{\log(2n)}. (143)

Furthermore, notice that exp((logn)loglognlog(2n))>12\exp\left((\log n)\log\frac{\log n}{\log(2n)}\right)>\frac{1}{2} for all n3n\geq 3. So we can take tt^{*} to be any number smaller than exp((logn)loglognlog(2n))\exp\left((\log n)\log\frac{\log n}{\log(2n)}\right). ∎

Note that all arguments until this point work for any n3n\geq 3. From now on we will use the assumption n=3n=3. For n=3n=3, we take t=0.58<exp((log3)loglog3log6)t^{*}=0.58<\exp\left((\log 3)\log\frac{\log 3}{\log 6}\right) and apply Lemma 30 to Eq. 140. For 12tt\frac{1}{2}\leq t\leq t^{*}, we can apply Lemma 30 to (t,x)(t,x) and (1t,y)(1-t,y) and get

sup12tt13x,y1(t,x,y)(12,13,13)F3(t,x,y)<log32log6\displaystyle\sup_{\begin{subarray}{c}\frac{1}{2}\leq t\leq t^{*}\\ \frac{1}{3}\leq x,y\leq 1\\ (t,x,y)\neq\left(\frac{1}{2},\frac{1}{3},\frac{1}{3}\right)\end{subarray}}F_{3}(t,x,y)<\frac{\log 3}{2\log 6} (144)

For t<t<1t^{*}<t<1, we have

supt<t<113x,y1F3(t,x,y)limt1F3(t,1,13)=log32log6.\displaystyle\sup_{\begin{subarray}{c}t^{*}<t<1\\ \frac{1}{3}\leq x,y\leq 1\end{subarray}}F_{3}(t,x,y)\geq\lim_{t\to 1^{-}}F_{3}\left(t,1,\frac{1}{3}\right)=\frac{\log 3}{2\log 6}. (145)

Applying Lemma 30 to (1t,y)(1-t,y) we see that for any (t,x,y)(t,x,y) with t<t<1t^{*}<t<1 and F3(t,x,y)log32log6F_{3}(t,x,y)\geq\frac{\log 3}{2\log 6}, we have F3(t,x,y)F3(t,x,13)F_{3}(t,x,y)\leq F_{3}\left(t,x,\frac{1}{3}\right). So

supt<t<113x,y1F3(t,x,y)=supt<t<113x1F3(t,x,13)=supt<t<113x112f3(tx+1t3)f2(t)+tf3(x).\displaystyle\sup_{\begin{subarray}{c}t^{*}<t<1\\ \frac{1}{3}\leq x,y\leq 1\end{subarray}}F_{3}(t,x,y)=\sup_{\begin{subarray}{c}t^{*}<t<1\\ \frac{1}{3}\leq x\leq 1\end{subarray}}F_{3}\left(t,x,\frac{1}{3}\right)=\sup_{\begin{subarray}{c}t^{*}<t<1\\ \frac{1}{3}\leq x\leq 1\end{subarray}}\frac{1}{2}\cdot\frac{f_{3}\left(tx+\frac{1-t}{3}\right)}{f_{2}(t)+tf_{3}(x)}. (146)

In the following, we prove that for t<t<1t^{*}<t<1, 13x1\frac{1}{3}\leq x\leq 1, we have

G3(t,x)=log6log3f3(tx+1t3)(f2(t)+tf3(x))<0.\displaystyle G_{3}(t,x)=\frac{\log 6}{\log 3}\cdot f_{3}\left(tx+\frac{1-t}{3}\right)-(f_{2}(t)+tf_{3}(x))<0. (147)

Note that Eq. 147 implies the desired result by Eqs. 144 and 146.

Case 1: t0.999t\geq 0.999, x0.999x\geq 0.999. For simplicity of notation, write a=1ta=1-t and b=1xb=1-x. Then 0<a0.0010<a\leq 0.001 and 0b0.0010\leq b\leq 0.001. Let c=1(tx+1t3)=23a+babc=1-\left(tx+\frac{1-t}{3}\right)=\frac{2}{3}a+b-ab. Write gn(u)=(1u)log(1u)ulogun0g_{n}(u)=-(1-u)\log(1-u)-u\log\frac{u}{n}\geq 0. Then

G3(1a,1b)\displaystyle~{}G_{3}(1-a,1-b) (148)
=\displaystyle= log6log3(log3g2(c))(log2g1(a)+t(log3g2(b)))\displaystyle~{}\frac{\log 6}{\log 3}\cdot\left(\log 3-g_{2}(c)\right)-(\log 2-g_{1}(a)+t(\log 3-g_{2}(b)))
\displaystyle\leq log6log3(log3g2(c))(log2g1(a)alog3+log3g2(b))\displaystyle~{}\frac{\log 6}{\log 3}\cdot\left(\log 3-g_{2}(c)\right)-(\log 2-g_{1}(a)-a\log 3+\log 3-g_{2}(b))
=\displaystyle= g3(a)+g2(b)log6log3g2(c).\displaystyle~{}g_{3}(a)+g_{2}(b)-\frac{\log 6}{\log 3}\cdot g_{2}(c).

For 0w0.0010\leq w\leq 0.001, we have

0.999w(1w)log(1w)w.\displaystyle 0.999w\leq-(1-w)\log(1-w)\leq w. (149)

So

g3(a)alog3ea,g2(b)blog2eb,g2(c)c(0.999+log2c)=:h(c).\displaystyle g_{3}(a)\leq a\log\frac{3e}{a},\qquad g_{2}(b)\leq b\log\frac{2e}{b},\qquad g_{2}(c)\geq c\left(0.999+\log\frac{2}{c}\right)=:h(c). (150)

Because h(c)h(c) is concave and increasing for 0c0.0020\leq c\leq 0.002, we have

h(c)12h(43a)+12h(2b(1a))12h(43a)+12h(1.998b).\displaystyle h(c)\geq\frac{1}{2}h\left(\frac{4}{3}a\right)+\frac{1}{2}h(2b(1-a))\geq\frac{1}{2}h\left(\frac{4}{3}a\right)+\frac{1}{2}h(1.998b). (151)

For 0<a0.0010<a\leq 0.001 and 0b0.0010\leq b\leq 0.001, we have

alog3ea<12h(43a),blog2eb12h(1.998b).\displaystyle a\log\frac{3e}{a}<\frac{1}{2}h\left(\frac{4}{3}a\right),\qquad b\log\frac{2e}{b}\leq\frac{1}{2}h(1.998b). (152)

So G3(t,x)<0G_{3}(t,x)<0 for 0.999t<10.999\leq t<1, 0.999x10.999\leq x\leq 1. This finishes the proof for Case 1.

Case 2: t0.999t\leq 0.999 or x0.999x\leq 0.999. Let A=([t,1]×[13,1])\([0.999,1]×[0.999,1])A=\left(\left[t^{*},1\right]\times\left[\frac{1}{3},1\right]\right)\backslash\left([0.999,1]\times[0.999,1]\right). Our goal is to prove that G3(t,x)<0G_{3}(t,x)<0 for all (t,x)A(t,x)\in A. The proof strategy is as follows. We choose ϵ,δ>0\epsilon,\delta>0 and a finite set AAA^{*}\subseteq A such that the following are true.

  1. (a)

    G3(t,x)<ϵG_{3}(t,x)<-\epsilon for all (t,x)A(t,x)\in A^{*};

  2. (b)

    For any (t,x)A(t,x)\in A, there exists (t,x)A(t^{*},x^{*})\in A^{*} such that max{|tt|,|xx|}δ\max\{|t-t^{*}|,|x-x^{*}|\}\leq\delta;

  3. (c)

    For any (t,x),(t,x)A(t,x),(t^{\prime},x^{\prime})\in A, if max{|tt|,|xx|}δ\max\{|t-t^{\prime}|,|x-x^{\prime}|\}\leq\delta, then |G3(t,x)G3(t,x)|ϵ|G_{3}(t,x)-G_{3}(t^{\prime},x^{\prime})|\leq\epsilon.

If all three items hold, then they imply our goal.

We take ϵ=0.00078\epsilon=0.00078, δ=105\delta=10^{-5}, A=A(105×105)A^{*}=A\cap\left(10^{-5}\mathbb{Z}\times 10^{-5}\mathbb{Z}\right). Item (a) is verified using a computer program by iterating over all points in AA^{*}. Item (b) is immediate by our choice of AA^{*}. It remains to prove Item (c). Note that on AA, f3(tx+1t3)f_{3}\left(tx+\frac{1-t}{3}\right) and f2(t)+tf3(x)f_{2}(t)+tf_{3}(x) are non-decreasing in both tt and xx. By convexity and monotonicity,

sup13u,u1|uu|δ|f3(u)f3(u)|\displaystyle\sup_{\begin{subarray}{c}\frac{1}{3}\leq u,u^{\prime}\leq 1\\ |u-u^{\prime}|\leq\delta\end{subarray}}|f_{3}(u)-f_{3}(u^{\prime})| =|f3(1)f3(1δ)|0.00014,\displaystyle=|f_{3}(1)-f_{3}(1-\delta)|\leq 0.00014, (153)
sup12u,u1|uu|δ|f2(u)f2(u)|\displaystyle\sup_{\begin{subarray}{c}\frac{1}{2}\leq u,u^{\prime}\leq 1\\ |u-u^{\prime}|\leq\delta\end{subarray}}|f_{2}(u)-f_{2}(u^{\prime})| =|f2(1)f2(1δ)|0.00013.\displaystyle=|f_{2}(1)-f_{2}(1-\delta)|\leq 0.00013. (154)

So for (t,x),(t,x)A(t,x),(t^{\prime},x^{\prime})\in A with max{|tt|,|xx|}δ\max\{|t-t^{*}|,|x-x^{*}|\}\leq\delta, we have

|G3(t,x)G3(t,x)|\displaystyle|G_{3}(t,x)-G_{3}(t,x^{\prime})| max{log6log3|f3(tx+1t3)f3(tx+1t3)|,\displaystyle\leq\max\left\{\frac{\log 6}{\log 3}\cdot\left|f_{3}\left(tx+\frac{1-t}{3}\right)-f_{3}\left(tx^{\prime}+\frac{1-t}{3}\right)\right|,\right. (155)
|(f2(t)+tf3(x))(f2(t)+tf3(x))|}\displaystyle\qquad\left.\left|(f_{2}(t)+tf_{3}(x))-(f_{2}(t)+tf_{3}(x^{\prime}))\right|\right\}
max{log6log30.00014,0.00014}0.00023,\displaystyle\leq\max\left\{\frac{\log 6}{\log 3}\cdot 0.00014,0.00014\right\}\leq 0.00023,
|G3(t,x)G3(t,x)|\displaystyle|G_{3}(t,x^{\prime})-G_{3}(t^{\prime},x^{\prime})| max{log6log3|f3(tx+1t3)f3(tx+1t3)|,\displaystyle\leq\max\left\{\frac{\log 6}{\log 3}\cdot\left|f_{3}\left(tx^{\prime}+\frac{1-t}{3}\right)-f_{3}\left(t^{\prime}x^{\prime}+\frac{1-t^{\prime}}{3}\right)\right|,\right. (156)
|(f2(t)+tf3(x))(f2(t)+tf3(x))|}\displaystyle\qquad\left.\left|(f_{2}(t)+tf_{3}(x^{\prime}))-(f_{2}(t^{\prime})+tf_{3}(x^{\prime}))\right|\right\}
max{log6log30.00014,0.00013+δlog3}0.00023.\displaystyle\leq\max\left\{\frac{\log 6}{\log 3}\cdot 0.00014,0.00013+\delta\log 3\right\}\leq 0.00023.

Therefore

|G3(t,x)G3(t,x)|\displaystyle|G_{3}(t,x)-G_{3}(t^{\prime},x^{\prime})| 0.00023+0.00023=0.00046<ϵ.\displaystyle\leq 0.00023+0.00023=0.00046<\epsilon. (157)

This proves Item (c), thus finishing the proof for Case 2.

Eq. 147 follows by combining the three cases. This finishes the proof that ηKL(π,P)=log32log6\eta_{\operatorname{\mathrm{KL}}}(\pi,P)=\frac{\log 3}{2\log 6} and equality is achieved at and only at point distributions. ∎

Using the same proof strategy one could in principle prove the statement of Proposition 29 for any given n3n\geq 3 (assuming it is true). However it is unclear to us how to prove uniformly for all n3n\geq 3.

Finally, we provide sufficient conditions for the extremal distribution for ηKL\eta_{\operatorname{\mathrm{KL}}} to have full support. This result can be contrasted with Examples 27 and 28, and [BC24] where it is shown that the half-step entropy contraction for many natural chains (e.g., the random transposition model) has point measures as their only extremizers.

Lemma 31.

Let (π,P)(\pi,P) be a reversible pair where PP is irreducible. Suppose that either

  1. (i)

    P(x,x)12P(x,x)\geq\frac{1}{2} for all x𝒳x\in\mathcal{X}, and ηKL(π,P)>12\eta_{\operatorname{\mathrm{KL}}}(\pi,P)>\frac{1}{2}, or

  2. (ii)

    P(x,y)>0P(x,y)>0 for all x,y𝒳x,y\in\mathcal{X}.

Let ν\nu be a distribution on 𝒳\mathcal{X} satisfying ηKL(π,P)=D(νPπP)D(νπ)\eta_{\operatorname{\mathrm{KL}}}(\pi,P)=\frac{D(\nu P\|\pi P)}{D(\nu\|\pi)}. Then ν\nu has full support.

Proof.

Let f=dνdπf=\frac{d\nu}{d\pi} be the Radon-Nikodym derivative. For the sake of contradiction assume that suppf𝒳\operatorname{\mathrm{supp}}f\neq\mathcal{X}. Choose a𝒳suppfa\in\mathcal{X}-\operatorname{\mathrm{supp}}f and bsuppfb\in\operatorname{\mathrm{supp}}f such that P(a,b)>0P(a,b)>0. Because PP is irreducible, such (a,b)(a,b) always exists.

Let h=𝟙aπ(a)π(b)𝟙bh=\mathbbm{1}_{a}-\frac{\pi(a)}{\pi(b)}\mathbbm{1}_{b} and fϵ=f+ϵhf_{\epsilon}=f+\epsilon h. For small enough ϵ>0\epsilon>0, we have fϵ0f_{\epsilon}\geq 0 and π[fϵ]=1\pi[f_{\epsilon}]=1. We prove that for ϵ>0\epsilon>0 small enough, we have

ddϵEntπ(Pfϵ)Entπ(fϵ)>0.\displaystyle\frac{d}{d\epsilon}\frac{\operatorname{\mathrm{Ent}}_{\pi}(Pf_{\epsilon})}{\operatorname{\mathrm{Ent}}_{\pi}(f_{\epsilon})}>0. (158)

Computation shows that

ddϵEntπ(Pfϵ)Entπ(fϵ)=1Entπ(fϵ)2(π[Phlog(Pfϵ)]π[fϵlogfϵ]π[hlogfϵ]π[Pfϵlog(Pfϵ)]).\displaystyle\frac{d}{d\epsilon}\frac{\operatorname{\mathrm{Ent}}_{\pi}(Pf_{\epsilon})}{\operatorname{\mathrm{Ent}}_{\pi}(f_{\epsilon})}=\frac{1}{\operatorname{\mathrm{Ent}}_{\pi}(f_{\epsilon})^{2}}\left(\pi[Ph\log(Pf_{\epsilon})]\pi[f_{\epsilon}\log f_{\epsilon}]-\pi[h\log f_{\epsilon}]\pi[Pf_{\epsilon}\log(Pf_{\epsilon})]\right). (159)

Expanding near ϵ=0\epsilon=0 gives

π[hlogfϵ]\displaystyle\pi[h\log f_{\epsilon}] =π(a)logϵ+O(1),\displaystyle=\pi(a)\log\epsilon+O(1), (160)
π[Phlog(Pfϵ)]\displaystyle\pi[Ph\log(Pf_{\epsilon})] =j𝒳π(j)(P(j,a)P(j,b)π(a)π(b))log(Pfϵ(j))\displaystyle=\sum_{j\in\mathcal{X}}\pi(j)\left(P(j,a)-P(j,b)\frac{\pi(a)}{\pi(b)}\right)\log(Pf_{\epsilon}(j)) (161)
=j𝒳supp(Pf)π(j)P(j,a)log(P(j,a)ϵ)+O(1)\displaystyle=\sum_{j\in\mathcal{X}-\operatorname{\mathrm{supp}}(Pf)}\pi(j)P(j,a)\log(P(j,a)\epsilon)+O(1)
=π(a)P(a,𝒳supp(Pf))logϵ+O(1),\displaystyle=\pi(a)P(a,\mathcal{X}-\operatorname{\mathrm{supp}}(Pf))\log\epsilon+O(1),
π[fϵlogfϵ]\displaystyle\pi[f_{\epsilon}\log f_{\epsilon}] =Entπ(f)+o(1),\displaystyle=\operatorname{\mathrm{Ent}}_{\pi}(f)+o(1), (162)
π[Pfϵlog(Pfϵ)]\displaystyle\pi[Pf_{\epsilon}\log(Pf_{\epsilon})] =Entπ(Pf)+o(1).\displaystyle=\operatorname{\mathrm{Ent}}_{\pi}(Pf)+o(1). (163)

Case Item (i). Because P(a,b)>0P(a,b)>0, we have asupp(Pf)a\in\operatorname{\mathrm{supp}}(Pf). Because PP is lazy, we have

P(a,𝒳supp(Pf))1P(a,a)12.\displaystyle P(a,\mathcal{X}-\operatorname{\mathrm{supp}}(Pf))\leq 1-P(a,a)\leq\frac{1}{2}. (164)

By assumption, Entπ(Pf)Entπ(f)=ηKL(π,P)>12\frac{\operatorname{\mathrm{Ent}}_{\pi}(Pf)}{\operatorname{\mathrm{Ent}}_{\pi}(f)}=\eta_{\operatorname{\mathrm{KL}}}(\pi,P)>\frac{1}{2}. Therefore Eq. 158 holds for ϵ>0\epsilon>0 small enough.

Case Item (ii). In this case, P(a,𝒳supp(Pf))=0P(a,\mathcal{X}-\operatorname{\mathrm{supp}}(Pf))=0. So Eq. 158 holds for ϵ>0\epsilon>0 small enough. ∎

Acknowledgments

The work of Y.G. was supported in part by the National Science Foundation under Grant No. DMS-1926686. The work of Y.P. was supported in part by the MIT-IBM Watson AI Lab and by the National Science Foundation under Grant No CCF-2131115.

References

  • [AG76] Rudolf Ahlswede and Peter Gács. Spreading of sets in product spaces and hypercontraction of the Markov operator. The Annals of Probability, pages 925–939, 1976.
  • [AJK+22] Nima Anari, Vishesh Jain, Frederic Koehler, Huy Tuan Pham, and Thuy-Duong Vuong. Entropic independence: optimal mixing of down-up random walks. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1418–1430, 2022.
  • [ALGV19] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, and Cynthia Vinzant. Log-concave polynomials II: high-dimensional walks and an FPRAS for counting bases of a matroid. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1–12, 2019.
  • [BC24] Alexandre Bristiel and Pietro Caputo. Entropy inequalities for random walks and permutations. Annales de l’Institut Henri Poincare (B) Probabilites et statistiques, 60(1):54–81, 2024.
  • [BCC+22] Antonio Blanca, Pietro Caputo, Zongchen Chen, Daniel Parisi, Daniel Štefankovič, and Eric Vigoda. On mixing of Markov chains: coupling, spectral independence, and entropy factorization. Electronic Journal of Probability, 27:1–42, 2022.
  • [BCP+21] Antonio Blanca, Pietro Caputo, Daniel Parisi, Alistair Sinclair, and Eric Vigoda. Entropy decay in the Swendsen-Wang dynamics on d\mathbb{Z}^{d}. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1551–1564, 2021.
  • [BG99] Sergej G. Bobkov and Friedrich Götze. Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28, 1999.
  • [BSM03] Abraham Berman and Naomi Shaked-Monderer. Completely positive matrices. World Scientific, 2003.
  • [BT06] Sergey G. Bobkov and Prasad Tetali. Modified logarithmic Sobolev inequalities in discrete settings. Journal of Theoretical Probability, 19(2):289–336, 2006.
  • [Cap23] Pietro Caputo. Lecture notes on entropy and Markov chains. Preprint, 2023. Available at: http://www.mat.uniroma3.it/users/caputo/entropy.pdf.
  • [CDPP09] Pietro Caputo, Paolo Dai Pra, and Gustavo Posta. Convex entropy decay via the Bochner-Bakry-Emery approach. Annales de l’IHP Probabilités et statistiques, 45(3):734–753, 2009.
  • [CE22] Yuansi Chen and Ronen Eldan. Localization schemes: A framework for proving mixing bounds for markov chains. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 110–122. IEEE, 2022.
  • [Che03] Mu-Fa Chen. Variational formulas of Poincaré-type inequalities for birth-death processes. Acta Mathematica Sinica, 19(4):625–644, 2003.
  • [Che05] Mu-Fa Chen. Poincaré-type inequalities in dimension one. Eigenvalues, Inequalities, and Ergodic Theory, pages 113–130, 2005.
  • [CLV21] Zongchen Chen, Kuikui Liu, and Eric Vigoda. Optimal mixing of Glauber dynamics: Entropy factorization via high-dimensional expansion. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 1537–1550, 2021.
  • [CMS24] Pietro Caputo, Florentin Münch, and Justin Salez. Entropy and curvature: beyond the Peres-Tetali conjecture. arXiv preprint arXiv:2401.17148, 2024.
  • [DMLM03] Pierre Del Moral, Michel Ledoux, and Laurent Miclo. On contraction properties of Markov kernels. Probability theory and related fields, 126(3):395–420, 2003.
  • [Dob56] Roland L. Dobrushin. Central limit theorem for nonstationary Markov chains. I. Theory of Probability & Its Applications, 1(1):65–80, 1956.
  • [DSC96] Persi Diaconis and Laurent Saloff-Coste. Logarithmic Sobolev inequalities for finite Markov chains. The Annals of Applied Probability, 6(3):695–750, 1996.
  • [GP23] Yuzhou Gu and Yury Polyanskiy. Non-linear log-Sobolev inequalities for the Potts semigroup and applications to reconstruction problems. Communications in Mathematical Physics, 404(2):769–831, 2023.
  • [GQ03] Fuqing Gao and Jeremy Quastel. Exponential decay of entropy in the random transposition and Bernoulli-Laplace models. The Annals of Applied Probability, 13(4):1591–1600, 2003.
  • [KM17] Tali Kaufman and David Mass. High dimensional random walks and colorful expansion. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2017.
  • [LP17] David A. Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • [Mic97] Laurent Miclo. Remarques sur l’hypercontractivité et l’évolution de l’entropie pour des chaînes de Markov finies. In Séminaire de Probabilités XXXI, pages 136–167. Springer, 1997.
  • [MM62] John E. Maxfield and Henryk Minc. On the matrix equation XX=AX^{\prime}X=A. Proceedings of the Edinburgh Mathematical Society, 13(2):125–129, 1962.
  • [MN61] Marvin Marcus and Morris Newman. The permanent of a symmetric matrix. Notices Amer. Math. Soc, 8:595, 1961.
  • [Mün23] Florentin Münch. Ollivier curvature, isoperimetry, concentration, and log-Sobolev inequalitiy. arXiv preprint arXiv:2309.06493, 2023.
  • [PW17] Yury Polyanskiy and Yihong Wu. Strong data-processing inequalities for channels and bayesian networks. In Convexity and Concentration, pages 211–249. Springer, 2017.
  • [PW24] Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning. Cambridge university press, 2024.
  • [Rag16] Maxim Raginsky. Strong data processing inequalities and Φ\Phi-Sobolev inequalities for discrete channels. IEEE Transactions on Information Theory, 62(6):3355–3389, 2016.
  • [Rob01] Cyril Roberto. Inégalités de Hardy et de Sobolev logarithmiques. PhD thesis, PhD thesis, 2001.
  • [Sin64] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
  • [STY23] Justin Salez, Konstantin Tikhomirov, and Pierre Youssef. Upgrading MLSI to LSI for reversible Markov chains. Journal of Functional Analysis, 285(9):110076, 2023.
  • [TY24] Konstantin Tikhomirov and Pierre Youssef. Regularized modified log-Sobolev inequalities and comparison of Markov chains. The Annals of Probability, 52(4):1201–1224, 2024.