This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Conditional Mixing of MCMC Algorithms for Non-log-concave Distributions

Xiang Cheng
MIT
[email protected] &Bohan Wang11footnotemark: 1
USTC
[email protected] Zhang11footnotemark: 1
IIIS, Tsinghua; Shanghai Qizhi Institute
[email protected] &Yusong Zhu11footnotemark: 1
Tsinghua University
[email protected]
Alphabetical author order
Abstract

MCMC algorithms offer empirically efficient tools for sampling from a target distribution π(x)exp(V(x))\pi(x)\propto\exp(-V(x)). However, on the theory side, MCMC algorithms suffer from slow mixing rate when π(x)\pi(x) is non-log-concave. Our work examines this gap and shows that when Poincaré-style inequality holds on a subset 𝒳\mathcal{X} of the state space, the conditional distribution of MCMC iterates over 𝒳\mathcal{X} mixes fast to the true conditional distribution. This fast mixing guarantee can hold in cases when global mixing is provably slow. We formalize the statement and quantify the conditional mixing rate. We further show that conditional mixing can have interesting implications for sampling from mixtures of Gaussians, parameter estimation for Gaussian mixture models and Gibbs-sampling with well-connected local minima.

1 Introduction

Sampling from a given target distribution of the form π(x)eV(x)\pi(x)\propto e^{-V(x)} plays a central role in many machine learning problems, such as Bayesian inference, optimization, and generative modeling [9, 14, 15]. The Langevin MCMC algorithm in particular has received a lot of recent attention; it makes use of the first-order gradient information V(x)\nabla V(x), and can be viewed as the sampling analog of gradient descent.

Langevin MCMC has been shown to converge quickly when π(x)\pi(x) is log-concave [6, 10]. More recently, similar guarantees have been established for p(x)p(x) satisfying weaker conditions such as log-Sobolev inequality (LSI) [23], for instance, π(x)eV(x)\pi(x)\propto e^{-V(x)} can have a good LSI constant when V(x)V(x) is a small perturbation of a convex function. However, few guarantees exist for general non-log-concave distributions. One simple example is when p(x)p(x) is a well-separated mixture of two Gaussian distributions; in this case, one verifies that Langevin MCMC mixes at a rate proportional to the inverse of the exponential of the separation distance.

Many important modern machine-learning problems are highly non-convex, one prominent class being functions arising from neural networks. Though finding the global minimum of a non-convex function can be difficult, gradient descent can often be shown to converge rather quickly to a local optimum [3]. This raises the important question:

What is the sampling analog of a local minimum? And how can we sample efficiently from such a minimum?

In [3], authors provide a partial answer to this question by adopting the view of Langevin Diffusion as the gradient flow of KL divergence in probability space. Under this perspective, the gradient of KL divergence is given by the Fisher Information (FI). Authors of [3] show that LMC achieves ϵ\epsilon error in FI in O(1/ϵ2)O(1/\epsilon^{2}) steps.

However, one crucial question remains unanswered: how does the local optimality of FI help us sample from non-convex distributions? Intuitively, small FI is useful for characterizing local convergence when π\pi is multi-modal. Authors of [3] suggested this connection, but it remains unclear how local mixing is defined and how small FI can quantitatively lead to good local mixing. Thus motivated, one of the goals of this paper is to provide a useful interpretation of the FI bound.

To this end, we propose a rigorous quantitative measure of "local mixing". We also provide a more general notion of "stationary point" for the sampling problem. Under these definitions, we show that LMC can achieve ϵ\epsilon error in our proposed "measure of local mixing", in polynomial time. Finally, we consider discrete-time analog of the aforementioned ideas, and prove local convergence for random walk on a hypercube. Below is a detailed description of our theoretical contributions.

1.1 Main Contributions

  1. 1.

    We define a notion of conditional convergence: Let 𝒳Ω\mathcal{X}\subseteq\Omega denote a subset of the state space. We study the convergence of πt|𝒳\pi_{t}|\mathcal{X} to π|𝒳\pi|\mathcal{X}, where πt\pi_{t} denotes distributions of LMC iterates and π\pi denotes the target distribution. This definition of convergence is much weaker than the standard global convergence, but in exchange, LMC can achieve fast conditional convergence in settings where global convergence is known to be exponentially slow.

  2. 2.

    We define local Logarithmic Sobolev Inequality and show how to combine it with existing results on the convergence of Fisher information to derive the conditional convergence of LMC.

  3. 3.

    When local Logarithmic Sobolev Inequality does not hold, we define local Poincaré Inequality and Poincaré Fisher information (which is an analogy of Fisher information). We show the convergence of Poincaré Fisher information assuming strong dissipativity and show the conditional convergence of LMC when local Poincaré Inequality is present.

  4. 4.

    To showcase the applications of our results, we respectively study sampling from Gaussian mixture model with the same covariance and sampling from the power posterior distribution of symmetric two-component Gaussian mixtures. The global isometric constants of these examples are exponential in dimension and may have slow global convergence. We show that the local isoperimetric constants of these examples are polynomial in dimension, and prove fast conditional convergence for these examples.

  5. 5.

    In Theorem 1, we consider an application of our result to Gibbs sampling on discrete state space. We show that fast conditional mixing happens when the spectral gap is not small. We further show in Theorem 2 that a subset has large spectral gap if it contains only one local minimum and is well connected.

2 Related Work

When the target distribution π\pi is strongly log-concave, the entropy Entπ[μπ]\operatorname{Ent}_{\pi}\left[\frac{\mu}{\pi}\right] is known to be strongly convex with respect to μ\mu, and thus Langevin Dynamics converges exponentially fast [2]. Such a result is later extended to LMC [8, 4, 9, 7, 21]. Several works further loosen the assumption by using isoperimetric inequalities such as Logarithmic Sobolev Inequality and Poincaré Inequality instead of strong log-concavity [24, 5, 25, 11]. However, there are few existing works on general non-log-concave sampling. Recently, Balasubramanian [3] defines convergence in relative Fisher information as a kind of "weak convergence" for sampling and proves a polynomial guarantee in general non-log-concave case only assuming global Lipschitz condition. However, this paper doesn’t give any rigorous statistical interpretation of this weak convergence; Majka et al. [18] and Erdogdu et al. [11] study the cases when the target distribution is non-log-concave but has some good tail growth or curvature; Ma et al. [17] analyze the situation when the target distribution is non-log-concave inside the region but log-concave outside. Although these works give strict proofs of polynomial time guarantee in their setting, their results only hold for a small branch of non-log-concave distributions. It is still hardly possible to obtain a polynomial guarantee in general non-log-concave cases. Multimodality, as a special case of non-log-concavity, has attracted lots of attention due to its prevalence in applied science. Many modified versions of MCMC were proposed to try to tackle the sampling of these distributions, such as Darting MC [1], Wormhole HMC [16], and etc. However, these algorithms require explicit knowledge of the location of the modes.

3 Conditional Mixing for MCMC

In this section, we provide a formal definition of "conditional mixing". Specifically, let μt\mu_{t} denote the distribution of a Markov chain {Zt}t\{Z_{t}\}_{t} at time tt. We assume that the Markov chain dynamics is reversible with a unique stationary distribution π\pi. Existing analyses mostly focus on understanding the rate of convergence measured by d(μt,π)d(\mu_{t},\pi) where dd is some probability distance.

However, unless the stationary distribution π\pi satisfies certain restrictive properties (e.g., log-concavity), the rate of convergence can be exponentially slow in the problem dimension or the distribution moments even for simple distributions such as the mixture of Gaussians. For this reason, we consider a weaker notion of convergence below.

Definition 1 (Conditional mixing).

Given a distribution μt\mu_{t} supported on the state space Ω\Omega. We say μt\mu_{t} converges conditioned on set 𝒳Ω\mathcal{X}\subseteq\Omega with respect to the divergence dd if

d(μt|𝒳,μ|𝒳)ϵ,\displaystyle d(\mu_{t}|\mathcal{X},\mu|\mathcal{X})\leq\epsilon,

where we have the conditional distribution

μt|𝒳(x)=μt(x)𝟙{x𝒳}μt(𝒳).\displaystyle\mu_{t}|\mathcal{X}(x)=\tfrac{\mu_{t}(x)\mathbbm{1}\left\{x\in\mathcal{X}\right\}}{\mu_{t}(\mathcal{X})}.

For now we can think of the distance dd as the total variation distance. Later we will discuss stronger divergence such as KL-divergence or Chi-squared divergence.

The focus of our work is on identifying several sufficient conditions for fast convergence, and quantitatively bounding the convergence rate under these conditions. We focus on two MCMC algorithms: the Langevin Monte Carlo in continuous space, and the Gibbs sampling algorithm in discrete space. We further discuss the implications of conditional convergence for the two algorithms.

4 Conditional Mixing for Langevin Monte Carlo

In this section, we study the conditional mixing of the Langevin Monte Carlo (LMC) algorithm. This section is organized as follows: in Subsection 4.1, we first introduce the Langevin Monte Carlo algorithm, Langevin Dynamics, functional inequalities, and the Fisher information; in Subsection 4.2, we provide our main results characterizing the conditional convergence of LMC; finally, in Subsection 4.3, we showcase two applications of our main results.

4.1 Preliminaries

Langevin Monte Carlo. We are interested in the convergence of Langevin Monte Carlo (LMC), which is a standard algorithm employed to sample from a target probability density πeV:d\pi\propto e^{-V}:\mathbb{R}^{d}\rightarrow\mathbb{R}, where VV is called the potential function. The pseudocode of LMC is given in Algorithm 1.

Algorithm 1 Langevin Monte Carlo

Input: Initial parameter zz, potential function VV, step size hh, number of iteration TT

1:Initialization z0zz_{0}\leftarrow z
2:For t=0Tt=0\rightarrow T:
3:     Generate Gaussian random vector ξt𝒩(0,𝕀d)\xi_{t}\sim\mathcal{N}(0,\mathbb{I}_{d})
4:     Update z(t+1)hzthhV(zth)+2hξtz_{(t+1)h}\leftarrow z_{th}-h\nabla V(z_{th})+\sqrt{2h}\xi_{t}
5:EndFor

LMC can be viewed as a time-discretization of Langevin Dynamics (LD), described by the following stochastic differential equation:

dZt=V(Zt)dt+2dBt.\mathrm{d}Z_{t}=-\nabla V(Z_{t})\mathop{}\!\mathrm{d}{t}+\sqrt{2}\mathrm{d}B_{t}. (1)

We can interpolate LMC following the similar manner of LD as

dZt=V(Zkh)dt+2dBt,t[kh,kh+1),k.\mathrm{d}Z_{t}=-\nabla V(Z_{kh})\mathop{}\!\mathrm{d}{t}+\sqrt{2}\mathrm{d}B_{t},~{}t\in[kh,kh+1),~{}k\in\mathbb{N}. (2)

One can easily observe that {Zkh}k=0\{Z_{kh}\}_{k=0}^{\infty} in Eq. (2) has the same joint distribution as {zkh}k=0\{z_{kh}\}_{k=0}^{\infty} in Algorithm 1, and thus Eq. (2) is a continuous-time interpolation of Algorithm 1.

Poincaré Inequality & Logarithmic Sobolev Inequalities. The convergence of LMC does not necessarily hold for all potential functions, and quantitative bounds require specific conditions over the potential function VV. Poincare Inequality (PI) and Logarithmic Sobolev Inequality (LSI) are two commonly used conditions in the analysis of LMC convergence. We present these below:

Definition 2 (Poincaré Inequality).

A probability measure π\pi on d\mathbb{R}^{d} satisfies the Poincaré inequality with constant ρ\rho > 0 (abbreviated as PI(ρ)\operatorname{PI}(\rho)), if for all functions f:df:\mathbb{R}^{d}\rightarrow\mathbb{R},

f(x)2dπ(x)ρVarπ[f].\int\|\nabla f(x)\|^{2}\mathop{}\!\mathrm{d}{\pi(x)}\geq\rho\mathrm{Var}_{\pi}[f]. (PI)
Definition 3 (Logarithmic Sobolev Inequality).

Given a function f:d+f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{+} and a probability measure π\pi on d\mathbb{R}^{d}, define Entπ(f)flogfdπ(x)fdπ(x)(logfdπ(x))\operatorname{Ent}_{\pi}(f)\triangleq\int f\log f\mathop{}\!\mathrm{d}{\pi(x)}-\int f\mathop{}\!\mathrm{d}{\pi(x)}\left(\log\int f\mathop{}\!\mathrm{d}{\pi(x)}\right). We say that a distribution π\pi on d\mathbb{R}^{d} satisfies the Logarithmic Sobolev Inequality (abbreviated as LSI(α)\operatorname{LSI}(\alpha)) with some constant α\alpha, if for all functions f:d+f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{+},

𝒳f(x)2fdπ(x)αEntπ(f).\int_{\mathcal{X}}\frac{\|\nabla f(x)\|^{2}}{f}\mathop{}\!\mathrm{d}{\pi(x)}\geq\alpha\operatorname{Ent}_{\pi}(f). (LSI)

Both PI and LSI can imply the convergence of LMC when the step size hh is small. Specifically, denote πth\pi_{th} as the distribution of zthz_{th} in LMC (Algorithm 1) and π~t\tilde{\pi}_{t} as the distribution of ZtZ_{t} in Langevin Dynamics (Eq.(1)). We further denote πt\pi_{t} as the distribution of ZtZ_{t} interpolating LMC. The left-hand-side of Eq.(PI) with f=π~tπf=\frac{\tilde{\pi}_{t}}{\pi} is then the derivative (w.r.t. time) of the entropy of ff, i.e., Entπ[π~tπ]\operatorname{Ent}_{\pi}\left[\frac{\tilde{\pi}_{t}}{\pi}\right], and thus we directly establish the exponential convergence of Varp[π~tπ]\mathrm{Var}_{p}[\frac{\tilde{\pi}_{t}}{\pi}]. The convergence of LMC can then be induced by bounding the discretization error between LMC and Langevin Dynamics. The methodology is similar when we have LSI.

Fisher information. It is well-known that if VV is either strongly convex or convex with bounded support, then it obeys both PI and LSI, and the convergence of LMC follows immediately according to the arguments above. However, real-world sampling problems usually have non-convex potential functions, and for these problems we may no longer have either PI or LSI. If we revisit the above methodology to establish the convergence of LMC under LSI, we find that we still have

ddtEntπ[π~tπ]=lnπ~tπ(x)2dπt(x),Entπ[π~0π]Entπ[π~Tπ]=0Tlnπ~tπ(x)2dπt(x)dt,\frac{\mathop{}\!\mathrm{d}}{\mathop{}\!\mathrm{d}t}\operatorname{Ent}_{\pi}\left[\frac{\tilde{\pi}_{t}}{\pi}\right]=-\int\left\|\nabla\ln\frac{\tilde{\pi}_{t}}{\pi}(x)\right\|^{2}\mathop{}\!\mathrm{d}{\pi_{t}(x)},\operatorname{Ent}_{\pi}\left[\frac{\tilde{\pi}_{0}}{\pi}\right]-\operatorname{Ent}_{\pi}\left[\frac{\tilde{\pi}_{T}}{\pi}\right]=\int_{0}^{T}\int\left\|\nabla\ln\frac{\tilde{\pi}_{t}}{\pi}(x)\right\|^{2}\mathop{}\!\mathrm{d}{\pi_{t}(x)}\mathop{}\!\mathrm{d}{t},

and thus limTmint[0,T]π~tπ(x)2dπ(x)=0\lim_{T\rightarrow\infty}\min_{t\in[0,T]}\int\left\|\nabla\frac{\tilde{\pi}_{t}}{\pi}(x)\right\|^{2}\mathop{}\!\mathrm{d}{\pi(x)}=0. Following this methodology, [3] uses the considers a notion of convergence which is defined using Fisher Information FI(μ||π)lnμ/π2dμ\operatorname{FI}(\mu||\pi)\triangleq\int\|\nabla\ln\mu/\pi\|^{2}\mathop{}\!\mathrm{d}{\mu}, which they use to analyze LMC convergence. We present the result of [3] for completeness:

Proposition 1 (Theorem 2, [3]).

Assume V\nabla V is LL-lipschitz. Then, for any step size h(0,16L)h\in(0,\frac{1}{6L}),

1Th0ThFIπ(πtπ)dt2Ent(π0π)Th+8L2dh.\frac{1}{Th}\int_{0}^{Th}\operatorname{FI}_{\pi}\left(\pi_{t}\|\pi\right)\mathrm{d}t\leq\frac{2\operatorname{Ent}\left(\frac{\pi_{0}}{\pi}\right)}{Th}+8L^{2}dh.

4.2 Rates of convergence

4.2.1 Conditional mixing under local Logarithmic Sobolev Inequality

We first show that if the target distribution π\pi obeys a local Logarithmic Sobolev Inequality (defined below), then convergence of Fisher information implies conditional convergence.

Definition 4 (Local Logarithmic Sobolev Inequality).

We say that a distribution π\pi on d\mathbb{R}^{d} satisfies the local Logarithmic Sobolev Inequality over a subset SdS\subset\mathbb{R}^{d} with some constant α\alpha (abbreviated as LSIS(α)\operatorname{LSI}_{S}(\alpha)), if π|S\pi|S satisfies LSI(α)LSI(\alpha).

Definition 4 characterizes the local property of one distribution. It is considerably weaker than global LSI (i.e., Definition 3) when SS is convex, and recovers LSI when S=dS=\mathbb{R}^{d}. The intuition is that when the distribution is multi-modal, it is quite possible that LSI does not hold (or hold but with an exponentially small constant). In contrast, we can reasonably expect local LSI hold within each mode. In Subsection 4.3, we will show that a Gaussian mixture model with the same covariance satisfies Local Logarithmic Sobolev Inequality. We show in the following lemma that, whenever we have an estimation on the Fisher information between μ\mu and π\pi, we can turn it to an estimation of the entropy between μ|S\mu|S and π|S\pi|S given LSIS(α)\operatorname{LSI}_{S}(\alpha).

Lemma 1.

Let S𝒳S\subset\mathcal{X} and assume that π\pi obeys LSIS(α)\operatorname{LSI}_{S}(\alpha). For any distribution μ\mu such that FI(μ||π)ε\operatorname{FI}(\mu||\pi)\leq\varepsilon, we have that either μ(S)εα\mu(S)\leq\frac{\sqrt{\varepsilon}}{\sqrt{\alpha}}, or Entπ|S[μ|Sπ|S]εα\operatorname{Ent}_{\pi|{S}}[\frac{\mu|{S}}{\pi|{S}}]\leq\frac{\sqrt{\varepsilon}}{\sqrt{\alpha}}.

We defer the proof of Lemma 1 to Appendix A.1. Intuitively, Lemma 1 states that if the distribution π\pi satisfies local LSI over the region SS and has small global Fisher information with respect to a distribution μ\mu, one can expect that either the probability mass of μ\mu over SS is minor, or μ\mu is close to π\pi both conditional over SS. In other words, μ\mu is close to π\pi conditional over SS whenever μ\mu has a moderate mass over SS. Note that Proposition 1 has already guaranteed global FI of LMC with little assumption, which together with Lemma 1 leads to the following corollary:

Corollary 1.

Assume V\nabla V is LL-lipschitz and S𝒳S\subset\mathcal{X} satisfies that π\pi obeys LSIS(α)\operatorname{LSI}_{S}(\alpha). Define π¯Th=0ThπtdtTh\bar{\pi}_{Th}=\frac{\int_{0}^{Th}\pi_{t}\mathop{}\!\mathrm{d}t}{Th}. Choosing step size h=1Th=\frac{1}{\sqrt{T}}, we have that either π¯Th(S)𝒪(dEnt(π0π)4αT4)\bar{\pi}_{Th}(S)\leq\mathcal{O}\left(\frac{\sqrt{d}\sqrt[4]{\operatorname{Ent}\left(\frac{\pi_{0}}{\pi}\right)}}{\sqrt{\alpha}\sqrt[4]{T}}\right), or Entπ|S[π¯Th|Sπ|S]𝒪(dEnt(π0π)4αT4).\operatorname{Ent}_{\pi|S}\left[\frac{\bar{\pi}_{Th}|S}{\pi|S}\right]\leq\mathcal{O}\left(\frac{\sqrt{d}\sqrt[4]{\operatorname{Ent}\left(\frac{\pi_{0}}{\pi}\right)}}{\sqrt{\alpha}\sqrt[4]{T}}\right).

Corollary 1 is one of our key results. Intuitively, it shows that LMC can achieve local mixing when only local LSI is assumed. As we discussed under Definition 4, we can reasonably expect that local LSI holds within each mode when the distribution is multi-modal, in which case Corollary 1 can be further applied to show local mixing with polynomial rate. Note here the dependence of the local mixing rate is polynomial over the dimension dd, meaning that considering local mixing helps to get rid of the exponential dependence over the dimension dd in the global mixing rate when studying multi-modal distributions.

4.2.2 Conditional Convergence under local Poincaré Inequality

We have established the conditional convergence of LMC under local Logarithmic Sobolev Inequality above. However, there are cases where even local LSI fails to hold. To tackle these cases and inspired by Poincaré Inequality, we introduce a weaker notion than local LSI called local Poincaré Inequality:

Definition 5 (Local Poincaré Inequality).

A distribution π\pi on d\mathbb{R}^{d} satisfies the local Poincaré Inequality over a subset SdS\subset\mathbb{R}^{d} with constant ρ\rho (abbreviated as PIS(ρ)\operatorname{PI}_{S}(\rho)), if π|S\pi|S satisfies PI(ρ)\operatorname{PI}(\rho).

As Poincaré Inequality is weaker than Logarithmic Sobolev Inequality, one can easily see that local Poincaré Inequality is also weaker than local Logarithmic Sobolev Inequality. Based on local Poincaré Inequality, we have the following Poincaré version of Lemma 1, which converts an estimation of Poincaré Fisher information to an estimation of chi-squared divergence of the conditional distributions.

Lemma 2.

Define Poincaré Fisher information as PFI(μ||π)(μ/π)2dμ\operatorname{PFI}(\mu||\pi)\triangleq\int\|\nabla(\mu/\pi)\|^{2}\mathop{}\!\mathrm{d}{\mu}. Let S𝒳S\subset\mathcal{X} and assume that π\pi obeys PIS(ρ)\operatorname{PI}_{S}(\rho). For any distribution μ\mu such that PFI(μ||π)ε\operatorname{PFI}(\mu||\pi)\leq\varepsilon, we have that either μ(S)2Varπ|S[μ|Sπ|S]επ(S)ρ\mu(S)^{2}\mathrm{Var}_{\pi|_{S}}[\frac{\mu|_{S}}{\pi|_{S}}]\leq\frac{\varepsilon\pi(S)}{\rho}.

To derive the conditional convergence of LMC under local Poincaré inequality, we further need to bound Poincaré Fisher information of LMC. Such a result, however, does not exist in the literature to our best knowledge. In analogy to the analysis in [3], we develop the following lemma to control the Fisher information of LD (recall that π~t\tilde{\pi}_{t} is the distribution of Langevin Dynamics in time tt).

Proposition 2.

Denote π~¯Th=0Thπ~tdtTh\bar{\tilde{\pi}}_{Th}=\frac{\int_{0}^{Th}\tilde{\pi}_{t}\mathop{}\!\mathrm{d}{t}}{Th}. Then, we have the following estimation of the PFI between π~¯Th\bar{\tilde{\pi}}_{Th} and π\pi.

PFI(π~¯Thπ)dtVarπ[π~0π]2Th.\operatorname{PFI}(\bar{\tilde{\pi}}_{Th}\|\pi)\,\mathop{}\!\mathrm{d}t\leq\frac{\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]}{2Th}.

The proof can be directly derived by taking the derivative of Varπ[π~tπ]\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{t}}{\pi}] with respect to tt. We defer the formal proof to Appendix A.2. Combining Lemma 2, Proposition 2, and recent advances in discrete error analysis of LD [19], we obtain the following result showing conditional convergence under local Poincaré inequality.

Corollary 2.

Assume V\nabla V and 2V\nabla^{2}V is LL-lipschtiz, VV satisfies strong dissipativity, i.e., V(x),xmx2b\langle\nabla V(x),x\rangle\geq m\|x\|^{2}-b, and S𝒳S\subset\mathcal{X} satisfies that π|S\pi|_{S} obeys PI(α)\operatorname{PI}(\alpha). Initialize z0𝒩(0,σ2𝕀)z_{0}\sim\mathcal{N}(0,\sigma^{2}\mathbb{I}) and select h=Θ~(Varπ[π~0π]13T23ρ13d23)h=\tilde{\Theta}\left(\frac{\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]^{\frac{1}{3}}}{T^{\frac{2}{3}}\rho^{\frac{1}{3}}d^{\frac{2}{3}}}\right). Recall that π¯Th=0ThπtdtTh\bar{\pi}_{Th}=\frac{\int_{0}^{Th}\pi_{t}\mathop{}\!\mathrm{d}t}{Th}. Then either π¯Th(S)=𝒪(d16Varπ[π~0π]16ρ16T112)\bar{\pi}_{Th}(S)=\mathcal{O}(\frac{d^{\frac{1}{6}}\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]^{\frac{1}{6}}}{\rho^{\frac{1}{6}}T^{\frac{1}{12}}}), or DTV[π¯Th|S||π|S]=𝒪(d16Varπ[π~0π]16ρ16T112)D_{TV}\left[{\bar{\pi}_{Th}|{S}}||{\pi|{S}}\right]=\mathcal{O}(\frac{d^{\frac{1}{6}}\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]^{\frac{1}{6}}}{\rho^{\frac{1}{6}}T^{\frac{1}{12}}}).

Corollary 2 shows that conditional mixing can be established when local PI holds, which is a weaker assumption than local LSI. However, the assumption of Corollary 2 is stronger than its LSI counterpart, and the mixing rate depends on Varπ[π~0π]\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}] instead of Ent(π~0π){\operatorname{Ent}\left(\frac{\tilde{\pi}_{0}}{\pi}\right)}, which can be considerably large. These drawbacks are mainly due to that the dynamics of chi-square distance are much harder than that of KL divergence, even when global PI holds [12].

Extension to Rényi divergence. In some prior works, e.g. [5, 23], Rényi divergence is considered a natural “measure” of convergence, due to the analytical simplicity of the resulting expressions. One may wonder if our methodology above can be extended to establish a local mixing rate of Rényi divergence. Unfortunately, there are technical difficulties which for now we have not find a way to tackle or bypass, which we list in Appendix A.3.

4.3 Applications

In this subsection, we apply Corollary 1 and Corollary 2 to two concrete examples to demonstrate their usage in obtaining local mixing rates. We start by analyzing the example of sampling from Gaussian Mixture Models with uniform covariance and then turn to the example of sampling from the power posterior distribution of symmetric two-component Gaussian mixtures.

4.3.1 Case Study: Sampling from Gaussian Mixture Model with the uniform covariance

Target distribution and potential function. We are interested in the gaussian mixture model formed by gaussians with the same variance but different means. Specifically, the target distribution is defined as π=i=1nwipiΔ(d)\pi=\sum_{i=1}^{n}w_{i}p_{i}\in\Delta(\mathbb{R}^{d}), where wi>0w_{i}>0, i=1nwi=1\sum_{i=1}^{n}w_{i}=1, pi𝒩(μi,𝚺)p_{i}\sim\mathcal{N}(\mu_{i},\bm{\Sigma}) and 𝚺σ2𝕀d\bm{\Sigma}\succ\sigma^{2}\mathbb{I}_{d}. The potential function is then defined as V(x)=log(i=1nwipi(x))V(x)=-\log(\sum_{i=1}^{n}w_{i}p_{i}(x)).

Partition of Space. We divide the space according to the sub-level set. Specifically, we define Si{x:pi(x)pj(x),ji}S_{i}\triangleq\{x:p_{i}(x)\geq p_{j}(x),\forall j\neq i\}. One can easily verify that i=1nSi=d\cup_{i=1}^{n}S_{i}=\mathbb{R}^{d}. Furthermore, SiS_{i} is convex since by the definition of pip_{i}, we have

Si=\displaystyle S_{i}= {x:(xμi)𝚺1(xμi)(xμj)𝚺1(xμj),ji}.\displaystyle\{x:(x-\mu_{i})^{\top}\bm{\Sigma}^{-1}(x-\mu_{i})\leq(x-\mu_{j})^{\top}\bm{\Sigma}^{-1}(x-\mu_{j}),\forall j\neq i\}.

As 𝚺\bm{\Sigma} is positive definite and symmetric, we can decompose 𝚺\bm{\Sigma} into UUUU, where UU is also positive definite and symmetric. We then obtain

U1Si={x:(xU1μi)(xU1μi)(xU1μj)(xU1μj),ji},U^{-1}S_{i}=\{x:(x-U^{-1}\mu_{i})^{\top}(x-U^{-1}\mu_{i})\leq(x-U^{-1}\mu_{j})^{\top}(x-U^{-1}\mu_{j}),\forall j\neq i\},

and thus U1SiU^{-1}S_{i} is one region of the Voronoi diagrams generated by {U1μi}i=1n\{U^{-1}\mu_{i}\}_{i=1}^{n}, which is convex. As SiS_{i} can be obtained by performing linear transformation to U1SiU^{-1}S_{i}, we obtain that SiS_{i} is also convex.

Verification of local Logarithmic Sobolev Inequality. We prove the local Logarithmic Sobolev Inequality of π\pi over each partition SiS_{i} as follows.

Lemma 3.

For all i{1,,n}i\in\{1,\cdots,n\}, π|Si\pi|S_{i} obeys LSI(σ2mini[n]wimaxi[n]wi)\operatorname{LSI}(\frac{\sigma^{-2}\min_{i\in[n]}w_{i}}{\max_{i\in[n]}w_{i}}).

Lemma 3 is proved by first showing pi|Sip_{i}|S_{i} obeys LSI(σ2)\operatorname{LSI}(\sigma^{-2}) through Bakry-Émery criterion due to the convexity of SiS_{i} and pip_{i} is strongly convex, and then applying Holley-Stroock perturbation principle by viewing π\pi as a perturbation of pip_{i} over SiS_{i}. The concrete proof is deferred to Appendix B.1.

Verification of Lipschitz gradient. By direct calculation, we derive the following bound on the Lipschitz constant of V\nabla V.

Lemma 4.

xd\forall x\in\mathbb{R}^{d}, 2V(x)maxi,jμiμj2σ4+σ2\|\nabla^{2}V(x)\|\leq\frac{\max_{i,j}\|\mu_{i}-\mu_{j}\|^{2}}{\sigma^{4}}+\sigma^{-2}.

As a conclusion, we obtain the following guarantee of running LMC over this example.

Corollary 3.

Let the stepsize of LMC be h=Θ(1T)h=\Theta(\frac{1}{\sqrt{T}}) . Assume T>Θ(d2ε4maxi[n]wi2σ4mini[n]wi2)T>\Theta(\frac{d^{2}}{\varepsilon^{4}}\frac{\max_{i\in[n]}w_{i}^{2}}{\sigma^{-4}\min_{i\in[n]}w_{i}^{2}}). Then, for every i[n]i\in[n], either π¯Th(Si)ε\bar{\pi}_{Th}(S_{i})\leq\varepsilon, or Entπ|Si[π¯Th|Siπ|Si]ε.\operatorname{Ent}_{\pi|S_{i}}\left[\frac{\bar{\pi}_{Th}|S_{i}}{\pi|S_{i}}\right]\leq\varepsilon.

Corollary 3 indicates that running LMC over gaussian mixture distribution (with the same covariance) can ensure local mixing with cost polynomial over the dimension and independent of the distance between the means. By contrast, it is a folk-tale that even the global mixing of gaussian mixture with 2 components has a global mixing rate exponential over the distance, which suggests local mixing analysis provide a more concrete analysis of what is going on over each component.

4.3.2 Sampling from the power posterior distribution of symmetric Gaussian mixtures

Target distribution and potential function. The symmetric Gaussian mixture model is given as

fθ0(x)12φ(x;θ0,𝕀d)+12φ(x;θ0,𝕀d).f_{\theta_{0}}(x)\triangleq\frac{1}{2}\varphi(x;\theta_{0},\mathbb{I}_{d})+\frac{1}{2}\varphi(x;-\theta_{0},\mathbb{I}_{d}).

Here φ(x;θ0,𝕀d)\varphi(x;\theta_{0},\mathbb{I}_{d}) denotes the multivariate Gaussian distribution with location parameter θ0d\theta_{0}\in\mathbb{R}^{d} and covariance matrix 𝕀d\mathbb{I}_{d}. Without loss of generality, we assume θ0span{e1}\theta_{0}\in\operatorname{span}\{e_{1}\}, i.e., all coordinates of θ0\theta_{0} except the first is zero, and θ0=θ0e1\theta_{0}=\|\theta_{0}\|e_{1}. The power posterior distribution, or the target distribution, is defined as πn,β/n(θ{Xi}i=1n):=i=1n(fθ(Xi))β/nλ(θ)i=1n(fu(Xi))β/nλ(u)du.\pi_{n,\beta/n}\left(\theta\mid\left\{X_{i}\right\}_{i=1}^{n}\right):=\frac{\prod_{i=1}^{n}\left(f_{\theta}\left(X_{i}\right)\right)^{\beta/n}\lambda(\theta)}{\int\prod_{i=1}^{n}\left(f_{u}\left(X_{i}\right)\right)^{\beta/n}\lambda(u)du}. We set λ1\lambda\equiv 1 for simplicity. When no ambiguity is possible, we abbreviate πn,β/n(θ{Xi}i=1n)\pi_{n,\beta/n}\left(\theta\mid\left\{X_{i}\right\}_{i=1}^{n}\right) as π(θ)\pi(\theta). The potential function is then given as V(θ):=βni=1nlog(12φ(θXi)+12φ(θ+Xi))+logλ(θ)V(\theta):=\frac{\beta}{n}\sum_{i=1}^{n}\log\left(\frac{1}{2}\varphi\left(\theta-X_{i}\right)+\frac{1}{2}\varphi\left(\theta+X_{i}\right)\right)+\log\lambda(\theta).

Partition of Space. With an accuracy budget ε\varepsilon, we define R1=[0,A]×(0,M)R_{1}=[0,A]\times\mathcal{B}(0,M), R2=[A,0]×(0,M)R_{2}=[-A,0]\times\mathcal{B}(0,M), and R3=(R1R2)cR_{3}=(R_{1}\cup R_{2})^{c}, where AA and MM are ε\varepsilon-dependent hyperparameter specified latter.

Verification of local Poincaré Inequality over R1R_{1} and R2R_{2}. We have the following characterization for the local Poincaré Inequality over R1R_{1} and R2R_{2}:

Lemma 5.

If A,Mθ0+1A,M\geq\|\theta_{0}\|+1 and nΘ~((A+M)2dlog(1/δ))n\geq\tilde{\Theta}((A+M)^{2}d\log(1/\delta)), then with probability at least 1δ1-\delta with respect to the sampling of {Xi}i=1n\{X_{i}\}_{i=1}^{n}, we have that π\pi obeys PIR1(Θ(1/(A4M2)))\operatorname{PI}_{R_{1}}(\Theta(1/(A^{4}M^{2}))) and PIR2(Θ(1/(A4M2)))\operatorname{PI}_{R_{2}}(\Theta(1/(A^{4}M^{2}))).

The lemma is derived by first considering the distribution π¯\bar{\pi} corresponding to the potential function V¯=𝔼V\bar{V}=\mathbb{E}V and proving local Poincaré Inequality of π¯|R1\bar{\pi}|R_{1} (or π¯|R2\bar{\pi}|R_{2}), then bounding the difference between V¯\bar{V} and VV through concentration inequality, and finally completing the proof by applying Holley-Stroock perturbation principle to pass local Poincaré Inequality from π¯|R1\bar{\pi}|R_{1} to π|R1{\pi}|R_{1}. A concrete proof is deferred to Appendix B.2.

Verification of strong dissipativity. Similar to local Logarithmic Sobolev Inequality, we can show strong dissipativity of VV holds in high probability.

Lemma 6.

If n>Θ~((d+θ02)log(1/δ))n>\tilde{\Theta}((d+\|\theta_{0}\|^{2})\log(1/\delta)), then with probability at least 1δ1-\delta over the sampling of {Xi}i=1n\{X_{i}\}_{i=1}^{n}, 2V(θ)𝒪(1+θ02),V(θ),θΩ(θ2)𝒪(θ02+1)\|\nabla^{2}V(\theta)\|\leq\mathcal{O}(1+\|\theta_{0}\|^{2}),\langle\nabla V(\theta),\theta\rangle\geq\Omega(\|\theta\|^{2})-\mathcal{O}\left(\left\|\theta_{0}\right\|^{2}+1\right), and 3V(θ)𝒪(d)\|\nabla^{3}V(\theta)\|\leq\mathcal{O}(d).

Bounding the probability of R3R_{3}. Using strong dissipativity, we can bound πt(R3)\pi_{t}(R_{3}) as follows.

Lemma 7.

If n>Θ~((d+θ02)log(1/δ))n>\tilde{\Theta}((d+\|\theta_{0}\|^{2})\log(1/\delta)), then with probability at least 1δ1-\delta over the sampling of {Xi}i=1n\{X_{i}\}_{i=1}^{n}, we have πt(R3)32e16β(θ02+1)+6dmin{A,M}22β.\pi_{t}(R_{3})\leq 32e^{\frac{16\beta\left(\left\|\theta_{0}\right\|^{2}+1\right)+6d-\min\{A,M\}^{2}}{2\beta}}.

All in all, we obtain the following characterization of running LMC over this example.

Corollary 4.

Initialize z0𝒩(0,σ2𝕀)z_{0}\sim\mathcal{N}(0,\sigma^{2}\mathbb{I}) for σ211+L\sigma^{2}\leq\frac{1}{1+L} and select h=Θ~(1T23)h=\tilde{\Theta}\left(\frac{1}{T^{\frac{2}{3}}}\right). Set A=M=Θ(d+log(1/ε))A=M=\Theta(d+\log(1/\varepsilon)). If T=Ω(d2/ε12)T=\Omega(d^{2}/\varepsilon^{12}) and n>Θ~((d+θ02)log(1/δ))n>\tilde{\Theta}((d+\|\theta_{0}\|^{2})\log(1/\delta)), then with probability at least 1δ1-\delta over the sampling of {Xi}i=1n\{X_{i}\}_{i=1}^{n}, either π¯Th(Ri)ε\bar{\pi}_{Th}(R_{i})\leq\varepsilon or DTV[π¯Th|Ri||π|Ri]εD_{TV}\left[\bar{\pi}_{Th}|R_{i}||{\pi|R_{i}}\right]\leq\varepsilon for i{1,2}i\in\{1,2\}, and π¯Th(R3)ε\bar{\pi}_{Th}(R_{3})\leq\varepsilon.

As a comparison, such a problem was also studied by [20], where they constructed a specific sampling algorithm with prior knowledge of this problem to achieve global convergence. In contrast, we analyze LMC, which does not require any prior knowledge of the problem, and derive the conditional convergence of LMC.

5 Conditional Mixing for Gibbs Sampling on Finite States

In previous sections, we showed that conditional mixing can happen for LMC on a continuous state space. We now show that similar results hold for MCMC algorithms on a finite state space. For simplicity, we consider an energy function f:{0,1}d{0,1,2,,M}=:[M]f:\left\{0,1\right\}^{d}\to\left\{0,1,2,...,M\right\}=:[M] defined on the vertices of a hypercube. Denote its corresponding Gibbs measure π(x)ef(x)\pi(x)\propto e^{-f(x)}.

We consider vertices of the hypercube as a d-regular graph where for any x,y{0,1}dx,y\in\left\{0,1\right\}^{d}, an edge exists xyx\sim y if and only if they differ by one coordinate, dHamming(x,y)=1d_{\text{Hamming}}(x,y)=1. Then a lazy Gibbs sampler has the following transition matrix on this finite graph:

p(x,y)={12dπ(y)π(y)+π(x),yx,112dx,s.t. xxπ(x)π(x)+π(x),y=x,0,otherwise.\displaystyle p(x,y)=\begin{cases}\frac{1}{2d}\frac{\pi(y)}{\pi(y)+\pi(x)},&y\sim x,\\ 1-\frac{1}{2d}\sum_{x^{\prime},\text{s.t. }x^{\prime}\sim x}\frac{\pi(x^{\prime})}{\pi(x^{\prime})+\pi(x)},&y=x,\\ 0,&\text{otherwise}.\end{cases}

Note that the process is lazy because, p(x,x)1/2p(x,x)\geq 1/2 for any xx. This is assumed for simplicity of analysis to avoid almost periodic behaviors. This assumption does not make the analysis less general, as a lazy self loop with probability 1/21/2 only changes the mixing time by a multiplicative absolute constant (see Corollary 9.5 [22]).

To prove conditional convergence, we need an analogue of conditional Poincaré Inequality. The story for the discrete state space can be more convoluted as the transition matrix, in addition to the stationary distribution, plays a role here. Many of the analyses below are inspired by  [13].

First, given a finite number of subsets {𝒳i}im\{\mathcal{X}_{i}\}_{i\leq m}, we define the conditional probability πi=π|𝒳i\pi_{i}=\pi|\mathcal{X}_{i} supported on 𝒳i\mathcal{X}_{i}, and hence for wi=π(𝒳i)w_{i}=\pi(\mathcal{X}_{i}),

π(x)=i=1mwiπi(x)𝟙{x𝒳i}.\displaystyle\pi(x)=\sum_{i=1}^{m}w_{i}\pi_{i}(x)\mathbbm{1}\left\{x\in\mathcal{X}_{i}\right\}.

We also need to design a conditioned transition kernel so that x,y𝒳i,\forall x,y\in\mathcal{X}_{i},

pi(x,y)=p(x,y)+𝟙{x=y}P(x,𝒳ic),\displaystyle p_{i}(x,y)=p(x,y)+\mathbbm{1}\left\{x=y\right\}P(x,\mathcal{X}_{i}^{c}), (3)

where 𝒳ic\mathcal{X}_{i}^{c} denotes the complement of 𝒳i\mathcal{X}_{i}, and hence the conditioned kernel simply rejects all outgoing transitions. Then we can easily tell that pip_{i} is reversible with a unique stationary distribution πi\pi_{i}. We are now ready to give an analogue of Corollary 1.

Theorem 1.

Given a sequence of subsets {𝒳i}im\left\{\mathcal{X}_{i}\right\}_{i\leq m}. If for every ii, PiP_{i} defined in (3) as a distribution on 𝒳i\mathcal{X}_{i} has spectral gap at least α\alpha, then we have that either for some tt, the distribution μt\mu_{t} has small probability on 𝒳i\mathcal{X}_{i}, μt(𝒳i)π(𝒳i)T1/4\mu_{t}(\mathcal{X}_{i})\leq\pi(\mathcal{X}_{i})T^{-1/4}, or the conditional mixing happens with respect to Chi-squared divergence,

1TtVarπi[μt|𝒳iπ|𝒳i]1αTVarπ[μ0π].\displaystyle\frac{1}{T}\sum_{t}\operatorname{Var}_{\pi_{i}}\left[\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}}\right]\leq\frac{1}{\alpha\sqrt{T}}\operatorname{Var}_{\pi}\left[\frac{\mu_{0}}{\pi}\right].

The proof builds upon the convergence of the Dirichlet form and can be found in Appendix  C. The above theorem suggests that if the spectral gap of the Gibbs sampling algorithm is lower bounded on a local subset, then after a polynomial number of iterations, we get either μt\mu_{t} has very small probability on this set, or conditioned on the set, the distribution is close to the stationary distribution.

One thing less clear, in finite-state Gibbs sampling as compared to the LMC, is when would the spectral gap for a Gibbs distribution be large. For the Langevin Monte Carlo algorithm, classical results show that the spectral gap cannot be too small if the stationary distribution is locally near log-concave. Below we provide an analogue of this observation for discrete state space.

5.1 Spectral Gap for Gibbs Sampling

We first define a graph 𝒢=(V,E)\mathcal{G}=(V,E), where V={0,1}d,V=\{0,1\}^{d}, and (x,y)E(x,y)\in E if and only if dHamming(x,y)=1d_{\text{Hamming}}(x,y)=1. Then we define quasi-concavity in this case. Note that Gibbs sampling in each iteration can only move to a neighbor or stay at the current vertex. Then we introduce the counterpart of quasi-convexity on a function defined on vertices of graphs.

Definition 6.

Let 𝒢=(V,E)\mathcal{G}=(V^{\prime},E^{\prime}) be a subgraph, VV,E=(x,y)E|xV,yVV^{\prime}\subseteq V,E^{\prime}={(x,y)\in E|x\in V^{\prime},y\in V^{\prime}}. We say a function f:V[M]f:V\to[M] is quasi-convex with a radius D on a subgraph 𝒢=(V,E)\mathcal{G}=(V^{\prime},E^{\prime}), if there exists a local minimum vVv*\in V^{\prime} such that for any vVv\in V^{\prime}, any shortest path vv1vv\to v_{1}\to...\to v^{*} from vvv\to v^{*} is of length at most DD, and ff is non-increasing along the path.

Before providing a guarantee for the spectral gap, we give two examples of quasi-convexity.

Example 5.1.

If g:+g:\mathbbm{R}^{+}\to\mathbbm{R} is a quasi-convex function defined on positive reals. Then for a given xVVx^{*}\in V^{\prime}\subseteq V any function f(x)=g(ad(x,x)+b)f(x)=g(a\cdot d(x,x^{*})+b) is a quasi-convex function on the graph where a,ba,b are reals and d(x,y)d(x,y) is the shortest path length from xx to yy.

Example 5.2.

If on a subset VVV^{\prime}\subseteq V, f(x)=cTxf(x)=c^{T}x for some cd,x{0,1}dc\in\mathbbm{R}^{d},x\in\{0,1\}^{d} (i.e. f is a linear function), then ff is quasi-convex on the graph.

The next theorem states that for such a function, the spectral gap is polynomial in problem dimension dd and the diameter of the region DD.

Theorem 2.

If a function ff is quasi-convex with a radius D on a subset 𝒳i{0,1}d\mathcal{X}_{i}\subseteq\{0,1\}^{d}, then the conditional transition of Gibbs sampling defined in (3) has a spectral gap lower bounded by 116d2D2\frac{1}{16d^{2}D^{2}}.

The proof studies the conductance of the Markov chain and can be found in Appendix D. The above theorem suggests that although Gibbs sampling can mix very slowly, on any subset with a well connected local minimum, the conditional mixing to the stationary distribution can be fast. This concludes our analysis and we move on to verify our statements with experiments in the next section.

6 Experiments

6.1 Observations on Gaussian Mixture

In this section, we conduct experiments to verify the theoretical results and compare global mixing versus conditional mixing for Gaussian mixture models. We take three Gaussian mixtures: ν1=0.91(10,1)+0.11(10,1)\nu_{1}=0.9\mathbb{N}_{1}(-10,1)+0.1\mathbb{N}_{1}(10,1), ν2=0.151(5,1)+0.151(2.5,1)+0.31(0,1)+0.21(2.5,1)+0.21(5,1)\nu_{2}=0.15\mathbb{N}_{1}(-5,1)+0.15\mathbb{N}_{1}(-2.5,1)+0.3\mathbb{N}_{1}(0,1)+0.2\mathbb{N}_{1}(2.5,1)+0.2\mathbb{N}_{1}(5,1), and ν3=0.42((5,5),I2)+0.42((5,5),I2)+0.12((5,5),I2)+0.12((5,5),I2)\nu_{3}=0.4\mathbb{N}_{2}((-5,-5),I_{2})+0.4\mathbb{N}_{2}((5,5),I_{2})+0.1\mathbb{N}_{2}((-5,5),I_{2})+0.1\mathbb{N}_{2}((5,-5),I_{2}) as our target distributions. We use Algorithm 1 as our sampling algorithm, and set step size h=102h=10^{-2}. The initial distributions are both uniform in a large enough range. We plot the sampling distribution after T=500,5000,500T=500,5000,500 rounds respectively in Figure 1(a), 1(b), and 1(c), and plot the conditional and global KL divergence in Figure 1(d), 1(e), and 1(f).

Refer to caption
(a) sample distribution(ν1\nu_{1})
Refer to caption
(b) sample distribution(ν2\nu_{2})
Refer to caption
(c) sample distribution(ν3\nu_{3})
Refer to caption
(d) KL divergence(ν1\nu_{1})
Refer to caption
(e) KL divergence(ν2\nu_{2})
Refer to caption
(f) KL divergence(ν3\nu_{3})
Figure 1: we plot the sampling distributions after TT iterations and the KL divergences w.r.t tt

We make the following observations: (1) The global KL divergences of the sampling distributions of ν1\nu_{1} and ν3\nu_{3} decrease fast at first. Then they maintain at a constant level and never converge to 0. It is reasonable since ν1,ν3\nu_{1},\nu_{3} have very bad LSI constants (exponential in the distance between means). Thus, by classic Langevin Danymics analysis results[23], global KL divergence would have an exponentially slow convergence rate. (2) Both the global and conditional divergences of the sampling distribution of ν2\nu_{2} converge very fast. This is because dense Gaussian mixtures like ν2\nu_{2} have good LSI constant. The KL values are noisy due to the limited calculation precision of KL divergence. (3) The conditional divergence of the sampling distribution of ν2\nu_{2} converges faster than the global divergence because the LSI constant bound of ν2\nu_{2} is much worse than the conclusion in Corollary 1, which means conditional convergence is faster than global convergence. (4) The conditional KL divergences of the sampling distributions of ν1\nu_{1} and ν3\nu_{3} converge to 0 very fast (the flat part is due to the limit of calculation precision), which could be seen as a verification of our theoretical result. (5) The sampling distributions of ν1\nu_{1} and ν3\nu_{3} after TT iterations contain all of the Gaussian components in target distributions, the only difference between them is weight. Since learning the right weight and component will directly lead to global KL convergence, this observation could be seen as an example of the gap between global convergence and conditional convergence.

6.2 Observations on LMC with restarts

In the LMC analysis, We study the evolution of a distribution p0p_{0} that usually is an absolutely continuous distribution. However, in practice, a more common implementation is as below: one first randomly generates an initial point x0x_{0}, runs the sampling algorithm for TT iterations, and then collects samples along a single trajectory. A gap between theory and practice here is that we always assume continuity and smoothness conditions on the initial distribution in theory, while in practice, we only generate a limited number of initial points (sometimes only one point) to run the sampling algorithm.

For log-concave sampling, this gap is usually negligible since the ergodicity of Langevin Dynamics guarantees that we could always capture the features of the target distribution. Thus, it’s reasonable that there are plenty of works about the discretization error on the time scale, while we hardly pay attention to the approximation error of initial distribution. When it comes to non-log-concave sampling, this gap may become crucial. We conduct several experiments in Appendix E to verify this conjecture and show that LMC with restarts could empirically help to eliminate the gap and improve the convergence speed.

7 Conclusions

Our work examines sampling problems where the global mixing of an MCMC algorithm is slow. We show that in such cases, fast conditional mixing can be achieved on subsets where the target distribution has benign local structures. We make the above statements rigorous and provide polynomial-time guarantees for conditional mixing. We give several examples, such as the mixture of Gaussian and the power posterior to show that the benign local structure often exists despite the global mixing rate is exponentially slow.

Much remains to be done. Theoretically, whether faster convergence rates can be achieved or a lower bound exists remain unknown. Instantiating our analyses to more MCMC algorithms may also lead to new observations. More importantly, the implication of being able to sample efficiently from local distributions requires more careful analysis. This may lead to a new theoretical guarantee for problems with symmetry (such as permutation symmetry, sign symmetry, rotation invariance, etc) where all local minima are equally good and sampling from any mode suffices.

Acknowledgments and Disclosure of Funding

Jingzhao Zhang acknowledges support from Tsinghua University Initiative Scientific Research Program.

References

  • [1] S. Ahn, Y. Chen, and M. Welling. Distributed and adaptive darting monte carlo through regenerations. In International Conference on Artificial Intelligence and Statistics, 2013.
  • [2] D. Bakry and M. Émery. Diffusions hypercontractives. In Séminaire de Probabilités XIX 1983/84: Proceedings, pages 177–206. Springer, 2006.
  • [3] K. Balasubramanian, S. Chewi, M. A. Erdogdu, A. Salim, and S. Zhang. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for langevin monte carlo. In Conference on Learning Theory, pages 2896–2923. PMLR, 2022.
  • [4] X. Cheng and P. Bartlett. Convergence of langevin mcmc in kl-divergence. In Algorithmic Learning Theory, pages 186–211. PMLR, 2018.
  • [5] S. Chewi, M. A. Erdogdu, M. B. Li, R. Shen, and M. Zhang. Analysis of langevin monte carlo from poincaré to log-sobolev, 2021.
  • [6] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 651–676, 2017.
  • [7] A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
  • [8] A. S. Dalalyan and A. B. Tsybakov. Sparse regression learning by aggregation and langevin monte-carlo. Journal of Computer and System Sciences, 78(5):1423–1443, 2012.
  • [9] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of langevin monte carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666–2711, 2019.
  • [10] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. 2017.
  • [11] M. A. Erdogdu and R. Hosseinzadeh. On the convergence of langevin monte carlo: The interplay between tail growth and smoothness, 2020.
  • [12] M. A. Erdogdu, R. Hosseinzadeh, and S. Zhang. Convergence of langevin monte carlo in chi-squared and rényi divergence. In International Conference on Artificial Intelligence and Statistics, pages 8151–8175. PMLR, 2022.
  • [13] Y. Guan and S. M. Krone. Small-world mcmc and convergence to multi-modal distributions: From slow mixing to fast mixing. 2007.
  • [14] A. T. Kalai and S. Vempala. Simulated annealing for convex optimization. Mathematics of Operations Research, 31(2):253–266, 2006.
  • [15] D. Kingma, T. Salimans, B. Poole, and J. Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • [16] S. Lan, J. Streets, and B. Shahbaba. Wormhole hamiltonian monte carlo, 2014.
  • [17] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881–20885, sep 2019.
  • [18] M. B. Majka, A. Mijatović, and L. Szpruch. Non-asymptotic bounds for sampling algorithms without log-concavity, 2019.
  • [19] W. Mou, N. Flammarion, M. J. Wainwright, and P. L. Bartlett. Improved bounds for discretization of langevin diffusions: Near-optimal rates without convexity. Bernoulli, 28(3):1577–1601, 2022.
  • [20] W. Mou, N. Ho, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. Sampling for bayesian mixture models: Mcmc with polynomial-time mixing. arXiv preprint arXiv:1912.05153, 2019.
  • [21] A. Mousavi-Hosseini, T. Farghly, Y. He, K. Balasubramanian, and M. A. Erdogdu. Towards a complete analysis of langevin monte carlo: Beyond poincar\\backslash’e inequality. arXiv preprint arXiv:2303.03589, 2023.
  • [22] Y. Peres and P. Sousi. Mixing times are hitting times of large sets. Journal of Theoretical Probability, 28:488–519, 2015.
  • [23] S. Vempala and A. Wibisono. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.
  • [24] S. S. Vempala. Recent progress and open problems in algorithmic convex geometry. In IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2010). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2010.
  • [25] A. Wibisono. Proximal langevin algorithm: Rapid convergence under isoperimetry. arXiv preprint arXiv:1911.01469, 2019.

Appendix A Proofs of results in Section 4.2

A.1 Local LSI: Proof of Lemma 1

Proof.

Since

μ(S)Sμ|Sπ|S(x)2π|𝒳j(x)2μ|𝒳j(x)dx=Sμπ(x)2π(x)2μ(x)dx𝒳μπ(x)2π(x)2μ(x)dx,\mu(S)\int_{S}\left\|\nabla\frac{\mu|_{S}}{\pi|_{S}}(x)\right\|^{2}\frac{\pi|_{\mathcal{X}_{j}}(x)^{2}}{\mu|_{\mathcal{X}_{j}}(x)}\mathop{}\!\mathrm{d}{x}=\int_{S}\left\|\nabla\frac{\mu}{\pi}(x)\right\|^{2}\frac{\pi(x)^{2}}{\mu(x)}\mathop{}\!\mathrm{d}{x}\leq\int_{\mathcal{X}}\left\|\nabla\frac{\mu}{\pi}(x)\right\|^{2}\frac{\pi(x)^{2}}{\mu(x)}\mathop{}\!\mathrm{d}{x},

if μ(S)εα\mu(S)\leq\sqrt{\frac{\varepsilon}{\alpha}}, the proof is finished. Otherwise, we have

αEntπ|S[μ|Sπ|S]Sμ|Sπ|S(x)2π|𝒳j(x)2μ|𝒳j(x)dxεα.\alpha\operatorname{Ent}_{\pi|_{S}}\left[\frac{\mu|_{S}}{\pi|_{S}}\right]\leq\int_{S}\left\|\nabla\frac{\mu|_{S}}{\pi|_{S}}(x)\right\|^{2}\frac{\pi|_{\mathcal{X}_{j}}(x)^{2}}{\mu|_{\mathcal{X}_{j}}(x)}\mathop{}\!\mathrm{d}{x}\leq\sqrt{\varepsilon\alpha}.

The proof is completed. ∎

A.2 Local PI: Proof of Proposition 2

To begin with, we provide the proof of Lemma 2.

Proof of Lemma 2.

Since

μ(S)2π(S)Sμ|Sπ|S(x)2π|S(x)dx=Sμπ(x)2π(x)dx𝒳μπ(x)2π(x)dx.\frac{\mu(S)^{2}}{\pi(S)}\int_{S}\left\|\nabla\frac{\mu|_{S}}{\pi|_{S}}(x)\right\|^{2}\pi|_{S}(x)\mathop{}\!\mathrm{d}{x}=\int_{S}\left\|\nabla\frac{\mu}{\pi}(x)\right\|^{2}\pi(x)\mathop{}\!\mathrm{d}{x}\leq\int_{\mathcal{X}}\left\|\nabla\frac{\mu}{\pi}(x)\right\|^{2}\pi(x)\mathop{}\!\mathrm{d}{x}.

The proof is completed by noticing that

Sμ|Sπ|S(x)2π|S(x)dx=PFI(μ|S|π|S)ρVarπ|S[μ|Sπ|S].\int_{S}\left\|\nabla\frac{\mu|_{S}}{\pi|_{S}}(x)\right\|^{2}\pi|_{S}(x)\mathop{}\!\mathrm{d}{x}=\operatorname{PFI}(\mu|_{S}||\pi|_{S})\geq\rho\mathrm{Var}_{\pi|_{S}}\left[\frac{\mu|_{S}}{\pi|_{S}}\right].

We then prove Proposition 2.

Proof of Propsition 2.

By applying the definition of LD, we have

dVarπ[π~tπ]dt=2PFI(π~tπ).\frac{\mathrm{d}\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{t}}{\pi}]}{\mathrm{d}t}=-2\operatorname{PFI}(\tilde{\pi}_{t}\|\pi).

Taking integration of both sides of the above equation then gives

Varπ[π~0π]Varπ[π~0π]Varπ[π~Tπ]=20ThPFI(π~tπ)dt.\mathrm{Var}_{\pi}\left[\frac{\tilde{\pi}_{0}}{\pi}\right]\geq\mathrm{Var}_{\pi}\left[\frac{\tilde{\pi}_{0}}{\pi}\right]-\mathrm{Var}_{\pi}\left[\frac{\tilde{\pi}_{T}}{\pi}\right]=2\int_{0}^{Th}\operatorname{PFI}(\tilde{\pi}_{t}\|\pi)\,\mathop{}\!\mathrm{d}t.

The proof is then completed by noting that according to Hölder’s inequality,

𝒳0Thπ~tdtπ(x)2π(x)dx0Th𝒳π~tπ(x)2π(x)dxdt=0ThPFI(π~tπ)dt.\int_{\mathcal{X}}\left\|\nabla\frac{\int_{0}^{Th}\tilde{\pi}_{t}\mathop{}\!\mathrm{d}t}{\pi}(x)\right\|^{2}\pi(x)\mathop{}\!\mathrm{d}{x}\leq\int_{0}^{Th}\int_{\mathcal{X}}\left\|\nabla\frac{\tilde{\pi}_{t}}{\pi}(x)\right\|^{2}\pi(x)\mathop{}\!\mathrm{d}{x}\mathop{}\!\mathrm{d}t=\int_{0}^{Th}\operatorname{PFI}(\tilde{\pi}_{t}\|\pi)\,\mathop{}\!\mathrm{d}t.

The proof is then completed. ∎

We then restate the main result from [19], which provides a tight characterization of the discretization error between LMC and LD.

Proposition 3 (Theorem 2, [19]).

Assume V\nabla V and 2V\nabla^{2}V is LL-lipschtiz. Initialize z0z_{0} according to Gaussian and select h=𝒪(1/L1)h=\mathcal{O}(1/L_{1}). Then for t[0,Th]t\in[0,Th], KL(π~t||πt)=𝒪(h2d2T)\operatorname{KL}(\tilde{\pi}_{t}||\pi_{t})=\mathcal{O}(h^{2}d^{2}T).

It should be noticed that in the original version of [Theorem 2, [19]], the result is only stated for interger time, i.e., t=kht=kh where k{0,,T}k\in\{0,\cdots,T\}. However, their proof also applies to non-integer time which leads to the above proposition.

We are now ready to prove Corollary 2.

Proof of Corollary 2.

Combing Proposition 2 and Lemma 2, we have that

π~¯Th(S)2Varπ|S[π~¯Th|Sπ|S]π(S)Varπ[π~0π]2Thρ.\bar{\tilde{\pi}}_{Th}(S)^{2}\mathrm{Var}_{\pi|_{S}}\left[\frac{\bar{\tilde{\pi}}_{Th}|_{S}}{\pi|_{S}}\right]\leq\frac{\pi(S)\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]}{2Th\rho}.

Using Pinsker’s inequality, we further obtain

2π~¯Th(S)2DTV[π~¯Th|S||π|S]2π(S)Varπ[π~0π]2Thρ.2\bar{\tilde{\pi}}_{Th}(S)^{2}D_{\operatorname{TV}}\left[{\bar{\tilde{\pi}}_{Th}|_{S}}||{\pi|_{S}}\right]^{2}\leq\frac{\pi(S)\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]}{2Th\rho}. (4)

On the other hand, according to Proposition 3 and Pinsker’s inequality, we obtain that for t[0,Th]t\in[0,Th],

DTV[π~t||πt]2𝒪(h2d2T),D_{\operatorname{TV}}\left[\tilde{\pi}_{t}||\pi_{t}\right]^{2}\leq\mathcal{O}(h^{2}d^{2}T),

and thus

DTV[π~t||πt]𝒪(hdT).D_{\operatorname{TV}}\left[\tilde{\pi}_{t}||\pi_{t}\right]\leq\mathcal{O}(hd\sqrt{T}).

Due to the convexity of L1L_{1} norm, we then obtain

DTV[π~¯Th||π¯Th]=\displaystyle D_{\operatorname{TV}}\left[\bar{\tilde{\pi}}_{Th}||\bar{\pi}_{Th}\right]= supA|1Th0Thπ~t(A)1Th0Thπt(A)dt|\displaystyle\sup_{A}\left|\frac{1}{Th}\int_{0}^{Th}\tilde{\pi}_{t}(A)-\frac{1}{Th}\int_{0}^{Th}{\pi}_{t}(A)\mathop{}\!\mathrm{d}t\right|
\displaystyle\leq 1Th0ThsupA|π~t(A)πt(A)|dt\displaystyle\frac{1}{Th}\int_{0}^{Th}\sup_{A}\left|\tilde{\pi}_{t}(A)-{\pi}_{t}(A)\right|\mathop{}\!\mathrm{d}t
\displaystyle\leq 1Th0ThDTV[π~t||πt]dt=𝒪(hdT).\displaystyle\frac{1}{Th}\int_{0}^{Th}D_{\operatorname{TV}}\left[\tilde{\pi}_{t}||\pi_{t}\right]\mathop{}\!\mathrm{d}t=\mathcal{O}(hd\sqrt{T}). (5)

Let AA be a positive constant. According to Eq. (4), we have either π~¯Th(S)A\bar{\tilde{\pi}}_{Th}(S)\leq A, or DTV[π~¯Th|S||π|S]=𝒪(Varπ[π~0π]AThρ)D_{\operatorname{TV}}\left[{\bar{\tilde{\pi}}_{Th}|_{S}}||{\pi|_{S}}\right]=\mathcal{O}\left(\frac{\sqrt{\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]}}{A\sqrt{Th\rho}}\right). In the former case, combined with Eq. (5), we have that

π¯Th(S)A+𝒪(hdT).\bar{\pi}_{Th}(S)\leq A+\mathcal{O}(hd\sqrt{T}).

In the latter case, we have that

DTV[π~¯Th|S||π¯Th|S]𝒪(hdTA),D_{\operatorname{TV}}\left[\bar{\tilde{\pi}}_{Th}|_{S}||\bar{\pi}_{Th}|_{S}\right]\leq\mathcal{O}\left(\frac{hd\sqrt{T}}{A}\right),

and thus

DTV[π||π¯Th]𝒪(Varπ[π~0π]AThρ)+𝒪(hdTA).D_{\operatorname{TV}}\left[\pi||\bar{\pi}_{Th}\right]\leq\mathcal{O}\left(\frac{\sqrt{\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]}}{A\sqrt{Th\rho}}\right)+\mathcal{O}\left(\frac{hd\sqrt{T}}{A}\right).

The proof is completed by selecting h=Varπ[π~0π]13T23ρ13d23h=\frac{\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]^{\frac{1}{3}}}{T^{\frac{2}{3}}\rho^{\frac{1}{3}}d^{\frac{2}{3}}} , and A=d16Varπ[π~0π]16ρ16T112A=\frac{d^{\frac{1}{6}}\mathrm{Var}_{\pi}[\frac{\tilde{\pi}_{0}}{\pi}]^{\frac{1}{6}}}{\rho^{\frac{1}{6}}T^{\frac{1}{12}}}. ∎

A.3 Difficulty of extension to Rényi divergence

Given two distribution π\pi and μ\mu, Rényi divergence of qq between them, i.e., q(μπ)\mathcal{R}_{q}(\mu\|\pi) is defined as follows:

q(μπ)1q1ln𝔼xπ(μπ(x))q.\mathcal{R}_{q}(\mu\|\pi)\triangleq\frac{1}{q-1}\ln\mathbb{E}_{x\sim\pi}\left(\frac{\mu}{\pi}(x)\right)^{q}.

For simplicity, here we consider Langevin Dynamics defined as Eq. (1) with the distribution of time tt denoted as π~t\tilde{\pi}_{t}. The measure of convergence is then defined as q(πtπ)\mathcal{R}_{q}(\pi_{t}\|\pi), the derivative of which, according to (Lemma 6, [23]) is given as

ddtq(πtπ)=qGq(πt||π)Fq(πt||π),\frac{d}{dt}\mathcal{R}_{q}(\pi_{t}\|\pi)=-q\frac{G_{q}\left(\pi_{t}||\pi\right)}{F_{q}\left(\pi_{t}||\pi\right)},

where

Gq(πt||π)=𝔼π[(πtπ)qlogπtπ2]=𝔼π[(πtπ)q2πtπ2]=4q2𝔼π[(πtπ)q22],\displaystyle G_{q}\left(\pi_{t}||\pi\right)=\mathbb{E}_{\pi}\left[\left(\frac{\pi_{t}}{\pi}\right)^{q}\left\|\nabla\log\frac{\pi_{t}}{\pi}\right\|^{2}\right]=\mathbb{E}_{\pi}\left[\left(\frac{\pi_{t}}{\pi}\right)^{q-2}\left\|\nabla\frac{\pi_{t}}{\pi}\right\|^{2}\right]=\frac{4}{q^{2}}\mathbb{E}_{\pi}\left[\left\|\nabla\left(\frac{\pi_{t}}{\pi}\right)^{\frac{q}{2}}\right\|^{2}\right],
Fq(πt||π)=𝔼π[(πtπ)q].\displaystyle F_{q}\left(\pi_{t}||\pi\right)=\mathbb{E}_{\pi}\left[\left(\frac{\pi_{t}}{\pi}\right)^{q}\right].

If we would like to follow the routine of Lemma 1, we need to firstly lower bound Gq(πt||π)Fq(πt||π)\frac{G_{q}\left(\pi_{t}||\pi\right)}{F_{q}\left(\pi_{t}||\pi\right)} by Gq((πt|S)||(π|S))Fq((πt|S)||(π|S))\frac{G_{q}\left((\pi_{t}|S)||(\pi|S)\right)}{F_{q}\left((\pi_{t}|S)||(\pi|S)\right)} and secondly transfer Gq((πt|S)||(π|S))Fq((πt|S)||(π|S))\frac{G_{q}\left((\pi_{t}|S)||(\pi|S)\right)}{F_{q}\left((\pi_{t}|S)||(\pi|S)\right)} back to q((πt|S)(π|S))\mathcal{R}_{q}((\pi_{t}|S)\|(\pi|S)). The second step can be accomplished following the same routine as (Lemma 9,[24]) using local PI. However, we are still unaware of how to implement the first step. This is because although we have

Gq(πt||π)=4q2𝔼π[(πtπ)q22]4q2𝔼π|S[(πt|Sπ|S)q22]×πt(S)qπ(S)q1,\displaystyle G_{q}\left(\pi_{t}||\pi\right)=\frac{4}{q^{2}}\mathbb{E}_{\pi}\left[\left\|\nabla\left(\frac{\pi_{t}}{\pi}\right)^{\frac{q}{2}}\right\|^{2}\right]\geq\frac{4}{q^{2}}\mathbb{E}_{\pi|S}\left[\left\|\nabla\left(\frac{\pi_{t}|S}{\pi|S}\right)^{\frac{q}{2}}\right\|^{2}\right]\times\frac{\pi_{t}(S)^{q}}{\pi(S)^{q-1}},

which is similar to the analysis of local LSI, we also have

Fq(πt||π)=𝔼π[(πtπ)q]𝔼π[(πtπ)q]×πt(S)qπ(S)q1.F_{q}\left(\pi_{t}||\pi\right)=\mathbb{E}_{\pi}\left[\left(\frac{\pi_{t}}{\pi}\right)^{q}\right]\geq\mathbb{E}_{\pi}\left[\left(\frac{\pi_{t}}{\pi}\right)^{q}\right]\times\frac{\pi_{t}(S)^{q}}{\pi(S)^{q-1}}.

In other words, when restricting to SS, both the denominator and the numerator of Gq(πt||π)Fq(πt||π)\frac{G_{q}\left(\pi_{t}||\pi\right)}{F_{q}\left(\pi_{t}||\pi\right)} will get smaller, which makes Gq(πt||π)Fq(πt||π)\frac{G_{q}\left(\pi_{t}||\pi\right)}{F_{q}\left(\pi_{t}||\pi\right)} no longer lower bounded by Gq((πt|S)||(π|S))Fq((πt|S)||(π|S))\frac{G_{q}\left((\pi_{t}|S)||(\pi|S)\right)}{F_{q}\left((\pi_{t}|S)||(\pi|S)\right)}. We leave how to resolve this challenge as an interesting future work.

Appendix B Proof of Applications

B.1 Proof of gaussian-mixture

To start with, we recall Bakry-Émery criterion and Holley-Stroock perturbation principle.

Lemma 8 (Bakry-Émery criterion).

Let Ωn\Omega\subset\mathbb{R}^{n} be convex and let H:ΩH:\Omega\rightarrow\mathbb{R} be a Hamiltonian with Gibbs measure μ(x)eH(x)1Ω(x)\mu(x)\propto e^{-H(x)}1_{\Omega}(x) and assume that 2H(x)κ>0\nabla^{2}H(x)\geq\kappa>0 for all xsupp(μ)x\in supp(\mu). Then μ\mu satisfies LSI(κ)\operatorname{LSI}(\kappa).

Lemma 9 (Holley-Stroock perturbation principle).

If pΔ(Ω)p\in\Delta(\Omega) satisfies LSI(πt)\operatorname{LSI}(\pi_{t}), and ψ:Ω\psi:\Omega\rightarrow\mathbb{R} satisfies mψMm\leq\psi\leq M, where m,M>0m,M>0. Then, qΔ(Ω)ψpq\in\Delta(\Omega)\propto\psi p satisfies LSI(mMπt)\operatorname{LSI}(\frac{m}{M}\pi_{t}).

We are now ready to prove Lemma 3.

Proof of Lemma 3.

Since SiS_{i} is convex, by applying Lemma 8, we obtain pi|Sip_{i}|_{S_{i}} is LSI(σ2)\operatorname{LSI}(\sigma^{2}). Let c=miniwic=\min_{i}w_{i}.

Meanwhile, we have over SjS_{j}

cpjwjpjp=ijwipi+wjpjijwipj+wjpj=pj.cp_{j}\leq w_{j}p_{j}\leq p=\sum_{i\neq j}w_{i}p_{i}+w_{j}p_{j}\leq\sum_{i\neq j}w_{i}p_{j}+w_{j}p_{j}=p_{j}.

Therefore, by Holley-Stroock perturbation principle, we have that p|Sjp|_{S_{j}} satisfies LSI(1cσ2)\operatorname{LSI}(\frac{1}{c}\sigma^{-2}). ∎

Proof of Lemma 4.

Through direct calculation, we obtain

2V=Σ112Σ1i,jwiwjpi(x)pj(x)(μiμj)(μiμj)(i=1nwipi(x))2Σ1.\displaystyle\nabla^{2}V=\Sigma^{-1}-\frac{1}{2}\Sigma^{-1}\frac{\sum_{i,j}w_{i}w_{j}p_{i}(x)p_{j}(x)(\mu_{i}-\mu_{j})(\mu_{i}-\mu_{j})^{\top}}{(\sum_{i=1}^{n}w_{i}p_{i}(x))^{2}}\Sigma^{-1}.

Then, for any 𝒂d\bm{a}\in\mathbb{R}^{d} with 𝒂=1\|\bm{a}\|=1, we have

𝒂2V𝒂=\displaystyle\bm{a}^{\top}\nabla^{2}V\bm{a}= 𝒂Σ1𝒂12i,jwiwjpi(x)pj(x)|(μiμj)Σ1𝒂|2(i=1nwipi(x))2\displaystyle\bm{a}^{\top}\Sigma^{-1}\bm{a}-\frac{1}{2}\frac{\sum_{i,j}w_{i}w_{j}p_{i}(x)p_{j}(x)|(\mu_{i}-\mu_{j})^{\top}\Sigma^{-1}\bm{a}|^{2}}{(\sum_{i=1}^{n}w_{i}p_{i}(x))^{2}}
\displaystyle\leq 𝒂Σ1𝒂1σ2,\displaystyle\bm{a}^{\top}\Sigma^{-1}\bm{a}\leq\frac{1}{\sigma^{-2}},

and

𝒂2V𝒂\displaystyle\bm{a}^{\top}\nabla^{2}V\bm{a}\leq 𝒂Σ1𝒂12i,jwiwjpi(x)pj(x)|(μiμj)Σ1𝒂|2(i=1nwipi(x))2\displaystyle\bm{a}^{\top}\Sigma^{-1}\bm{a}-\frac{1}{2}\frac{\sum_{i,j}w_{i}w_{j}p_{i}(x)p_{j}(x)|(\mu_{i}-\mu_{j})^{\top}\Sigma^{-1}\bm{a}|^{2}}{(\sum_{i=1}^{n}w_{i}p_{i}(x))^{2}}
\displaystyle\geq 12i,jwiwjpi(x)pj(x)|(μiμj)Σ1𝒂|2(i=1nwipi(x))2maxi,jμiμj2σ4.\displaystyle-\frac{1}{2}\frac{\sum_{i,j}w_{i}w_{j}p_{i}(x)p_{j}(x)|(\mu_{i}-\mu_{j})^{\top}\Sigma^{-1}\bm{a}|^{2}}{(\sum_{i=1}^{n}w_{i}p_{i}(x))^{2}}\geq-\frac{\max_{i,j}\|\mu_{i}-\mu_{j}\|^{2}}{\sigma^{-4}}.

The proof is completed. ∎

B.2 Proof of sampling from the power posterior distribution of symmetric Gaussian mixtures

To begin with, define the expected potential function V¯\bar{V} as

V¯(θ):=β𝔼Xlog(12φ(θX)+12φ(θ+X))+logλ(θ).\bar{V}(\theta):=\beta\mathbb{E}_{X}\log\left(\frac{1}{2}\varphi\left(\theta-X\right)+\frac{1}{2}\varphi\left(\theta+X\right)\right)+\log\lambda(\theta).

We further define the corresponding distribution as π¯eV¯\bar{\pi}\propto e^{-\bar{V}}.

The following lemmas will be needed in the proof.

Lemma 10 (Theorem 2, [20]).

If A,Mθ0+1A,M\geq\|\|\theta_{0}\|\|+1, we have π¯|R1\bar{\pi}|_{R_{1}} satisfies PI(Θ(1/A4M2))\operatorname{PI}(\Theta(1/A^{4}M^{2})).

Lemma 11 (Lemma 4, [20]).

If A,Mθ0+1A,M\geq\|\theta_{0}\|+1, we have with probability at least 1δ1-\delta,

supθ[0,A]×(0,M)|V(θ)V¯(θ)|𝒪((1+A+M)dnlogn(A+M+d)δ).\sup_{\theta\in[0,A]\times\mathcal{B}(0,M)}|V(\theta)-\bar{V}(\theta)|\leq\mathcal{O}((1+A+M)\sqrt{\frac{d}{n}\log\frac{n(A+M+d)}{\delta}}).
Lemma 12.

Suppose VV obeys strong dissipativity, i.e., V\nabla V is LL-lipschitz and V(x),xmx2b\langle\nabla V(x),x\rangle\geq m\|x\|^{2}-b. Then, running LMC with h<116Lh<\frac{1}{16L} and h8m4b+32dh\leq\frac{8m}{4b+32d}, we have

𝔼xkπke14mxk232exp(14m(8b+64d)).\displaystyle\mathbb{E}_{x_{k}\sim\pi_{k}}{e^{\frac{1}{4m}{\left\|x_{k}\right\|}^{2}}}\leq 32\cdot\exp\left(\frac{1}{4m}\left(8b+64d\right)\right).
Proof.

Define f(x)=e14mx2f(x)=e^{\frac{1}{4m}{\left\|x\right\|}^{2}}. Assume wlog that V(0)=0\nabla V(0)=0.

Then,

f(xk+1)=\displaystyle f(x_{k+1})= exp(14mxk2+12m𝒉𝑽(𝒙𝒌)+𝟐𝒉𝝃𝒌,𝒙𝒌+14mhV(xk)+2hξk2)\displaystyle\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}+\frac{1}{2m}\bm{\left\langle}-h\nabla V(x_{k})+\sqrt{2h}\xi_{k},x_{k}\bm{}+\frac{1}{4m}{\left\|-h\nabla V(x_{k})+\sqrt{2h}\xi_{k}\right\|}^{2}\right)
\displaystyle\leq exp(14mxk2h2m(xk2b)+2h2m𝝃𝒌,𝒙𝒌+12mh2V(xk)2+hmξk2)\displaystyle\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{2m}\left({\left\|x_{k}\right\|}^{2}-b\right)+\frac{\sqrt{2h}}{2m}\bm{\left\langle}\xi_{k},x_{k}\bm{}+\frac{1}{2m}{\left\|h^{2}\nabla V(x_{k})\right\|}^{2}+\frac{h}{m}{\left\|\xi_{k}\right\|}^{2}\right)
\displaystyle\leq exp(14mxk2h4m(xk22b)+2h2m𝝃𝒌,𝒙𝒌+hmξk2).\displaystyle\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{4m}\left({\left\|x_{k}\right\|}^{2}-2b\right)+\frac{\sqrt{2h}}{2m}\bm{\left\langle}\xi_{k},x_{k}\bm{}+\frac{h}{m}{\left\|\xi_{k}\right\|}^{2}\right).

Let 𝔼k\mathbb{E}_{k} denote expectation wrt ξk\xi_{k}. Then

𝔼k[f(xk+1)]\displaystyle\mathbb{E}_{k}{\left[f(x_{k+1})\right]}\leq exp(14mxk2h4m(xk22b))𝔼k[2h2m𝝃𝒌,𝒙𝒌+hmξk2]\displaystyle\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{4m}\left({\left\|x_{k}\right\|}^{2}-2b\right)\right)\cdot\mathbb{E}_{k}{\left[\frac{\sqrt{2h}}{2m}\bm{\left\langle}\xi_{k},x_{k}\bm{}+\frac{h}{m}{\left\|\xi_{k}\right\|}^{2}\right]}
\displaystyle\leq exp(14mxk2h4m(xk22b))𝔼k[2hm𝝃𝒌,𝒙𝒌]1/2𝔼k[2hmξk2]1/2\displaystyle\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{4m}\left({\left\|x_{k}\right\|}^{2}-2b\right)\right)\cdot\mathbb{E}_{k}{\left[\frac{\sqrt{2h}}{m}\bm{\left\langle}\xi_{k},x_{k}\bm{}\right]}^{1/2}\cdot\mathbb{E}_{k}{\left[\frac{2h}{m}{\left\|\xi_{k}\right\|}^{2}\right]}^{1/2}

Using the fact that χ2\chi^{2} variable is sub-exponential,

𝔼k[exp(2hmξk2)]exp(4hdm)\displaystyle\mathbb{E}_{k}{\left[\exp\left(\frac{2h}{m}{\left\|\xi_{k}\right\|}^{2}\right)\right]}\leq\exp\left(\frac{4hd}{m}\right)

On the other hand, notice that 𝝃𝒌,𝒙𝒌𝒩(0,xk2)\bm{\left\langle}\xi_{k},x_{k}\bm{}\sim\mathcal{N}(0,{\left\|x_{k}\right\|}^{2}) is a 1-dimensional gaussian random variable. It can be shown that

𝔼k[exp(2hm𝝃𝒌,𝒙𝒌)]exp(2hm2xk2).\displaystyle\mathbb{E}_{k}{\left[\exp\left(\frac{\sqrt{2h}}{m}\bm{\left\langle}\xi_{k},x_{k}\bm{}\right)\right]}\leq\exp\left(\frac{2h}{m^{2}}{\left\|x_{k}\right\|}^{2}\right). (6)

Combining the above,

𝔼k[f(xk+1)]\displaystyle\mathbb{E}_{k}{\left[f(x_{k+1})\right]}\leq exp(14mxk2h4m(xk22b)+2hdm+hm2xk2)\displaystyle\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{4m}\left({\left\|x_{k}\right\|}^{2}-2b\right)+\frac{2hd}{m}+\frac{h}{m^{2}}{\left\|x_{k}\right\|}^{2}\right)
\displaystyle\leq exp(14mxk2h8m(xk24b32d))\displaystyle\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{8m}\left({\left\|x_{k}\right\|}^{2}-4b-32d\right)\right)

Let 𝔼\mathbb{E} denote expectation wrt all randomness. Then

𝔼[f(xk+1)]\displaystyle\mathbb{E}{\left[f(x_{k+1})\right]}
\displaystyle\leq 𝔼[exp(14mxk2h8m(xk24b32d))]\displaystyle\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{8m}\left({\left\|x_{k}\right\|}^{2}-4b-32d\right)\right)\right]}
=\displaystyle= 𝔼[exp(14mxk2h8m(xk24b32d))1{xk28b+64d}]\displaystyle\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{8m}\left({\left\|x_{k}\right\|}^{2}-4b-32d\right)\right)1\left\{{\left\|x_{k}\right\|}^{2}\geq 8b+64d\right\}\right]}
+𝔼[exp(14mxk2h8m(xk24b32d))1{xk28b+64d}]\displaystyle\quad+\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{8m}\left({\left\|x_{k}\right\|}^{2}-4b-32d\right)\right)1\left\{{\left\|x_{k}\right\|}^{2}\leq 8b+64d\right\}\right]}
\displaystyle\leq 𝔼[exp(14mxk2h16m(4b+32d))1{xk28b+64d}]\displaystyle\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}-\frac{h}{16m}\left(4b+32d\right)\right)1\left\{{\left\|x_{k}\right\|}^{2}\geq 8b+64d\right\}\right]}
+𝔼[exp(14mxk2+h8m(4b+32d))1{xk28b+64d}]\displaystyle\quad+\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}+\frac{h}{8m}\left(4b+32d\right)\right)1\left\{{\left\|x_{k}\right\|}^{2}\leq 8b+64d\right\}\right]}
\displaystyle\leq 𝔼[exp(14mxk2)1{xk28b+64d}(1h32m(4b+32d))]\displaystyle\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}\right)1\left\{{\left\|x_{k}\right\|}^{2}\geq 8b+64d\right\}\cdot\left(1-\frac{h}{32m}\left(4b+32d\right)\right)\right]}
+𝔼[exp(14mxk2)1{xk28b+64d}(1+h4m(4b+32d))]\displaystyle\quad+\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}\right)1\left\{{\left\|x_{k}\right\|}^{2}\leq 8b+64d\right\}\cdot\left(1+\frac{h}{4m}\left(4b+32d\right)\right)\right]}
=\displaystyle= 𝔼[exp(14mxk2)]h32m(4b+32d)𝔼[exp(14mxk2)1{xk28b+64d}]\displaystyle\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}\right)\right]}-\frac{h}{32m}\left(4b+32d\right)\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}\right)1\left\{{\left\|x_{k}\right\|}^{2}\geq 8b+64d\right\}\right]}
+h4m(4b+32d)𝔼[exp(14mxk2)1{xk28b+64d}]\displaystyle\quad+\frac{h}{4m}\left(4b+32d\right)\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}\right)1\left\{{\left\|x_{k}\right\|}^{2}\leq 8b+64d\right\}\right]}
\displaystyle\leq 𝔼[exp(14mxk2)]h32m(4b+32d)𝔼[exp(14mxk2)]\displaystyle\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}\right)\right]}-\frac{h}{32m}\left(4b+32d\right)\mathbb{E}{\left[\exp\left(\frac{1}{4m}{\left\|x_{k}\right\|}^{2}\right)\right]}
+hm(4b+32d)exp(14m(8b+64d))\displaystyle\quad+\frac{h}{m}\left(4b+32d\right)\cdot\exp\left(\frac{1}{4m}\left(8b+64d\right)\right)
=\displaystyle= 𝔼[f(xk)]h32m(4b+32d)𝔼[f(xk)]+hm(4b+32d)exp(14m(8b+64d)).\displaystyle\mathbb{E}{\left[f(x_{k})\right]}-\frac{h}{32m}\left(4b+32d\right)\mathbb{E}{\left[f(x_{k})\right]}+\frac{h}{m}\left(4b+32d\right)\cdot\exp\left(\frac{1}{4m}\left(8b+64d\right)\right).

Suppose that xkx_{k} is drawn from the invariant distribution under LMC. Then we know that 𝔼f(xk)=𝔼f(xk+1)\mathbb{E}{f(x_{k})}=\mathbb{E}{f(x_{k+1})}. In this case,

0h32m(4b+32d)𝔼[f(xk)]+hm(4b+32d)exp(14m(8b+64d)).\displaystyle 0\leq-\frac{h}{32m}\left(4b+32d\right)\mathbb{E}{\left[f(x_{k})\right]}+\frac{h}{m}\left(4b+32d\right)\cdot\exp\left(\frac{1}{4m}\left(8b+64d\right)\right).

Moving things around gives

𝔼f(xk)32exp(14m(8b+64d)).\displaystyle\mathbb{E}{f(x_{k})}\leq 32\cdot\exp\left(\frac{1}{4m}\left(8b+64d\right)\right).

We are now ready to prove the lemmas in the main text.

Proof of Lemma 5.

Based on Lemma 11, if nΘ~((A+M)2dlog(1/δ))n\geq\tilde{\Theta}((A+M)^{2}d\log(1/\delta)), we have with probability at least 1δ1-\delta,

supθR1|V(θ)V¯(θ)|𝒪(1).\sup_{\theta\in R_{1}}|V(\theta)-\bar{V}(\theta)|\leq\mathcal{O}(1).

As a result, we have π¯|R1π|R1(θ)=Θ(1)\frac{\bar{\pi}|R_{1}}{\pi|R_{1}}(\theta)=\Theta(1), and by Lemma 9 and Lemma 10, the proof is completed. ∎

Proof of Lemma 6.

To begin with, set ξ𝒩(0,𝕀d)\xi\sim\mathcal{N}(0,\mathbb{I}_{d}), we have

V¯(θ),θ=\displaystyle\left\langle\nabla\bar{V}(\theta),\theta\right\rangle= βθ2+β𝔼(φ(θ0e1+ξθ)+φ(θ0e1+ξ+θ)φ(θ0e1+ξθ)+φ(θ0e1+ξ+θ)θ(θ0e1+ξ))\displaystyle\beta\|\theta\|^{2}+\beta\mathbb{E}\left(\frac{-\varphi\left(\|\theta_{0}\|e_{1}+\xi-\theta\right)+\varphi\left(\|\theta_{0}\|e_{1}+\xi+\theta\right)}{\varphi\left(\|\theta_{0}\|e_{1}+\xi-\theta\right)+\varphi\left(\|\theta_{0}\|e_{1}+\xi+\theta\right)}\theta^{\top}\left(\|\theta_{0}\|e_{1}+\xi\right)\right)
β2θ2β(θ02+1).\displaystyle\geq\frac{\beta}{2}\|\theta\|^{2}-\beta\left(\left\|\theta_{0}\right\|^{2}+1\right).

We obtain

V(θ),θβ2θ22β(θ02+1)\displaystyle\left\langle\nabla V(\theta),\theta\right\rangle\geq\frac{\beta}{2}\|\theta\|^{2}-2\beta\left(\left\|\theta_{0}\right\|^{2}+1\right)

following a standard empirical process argument due to nΘ~((A+M)2dlog(1/δ))n\geq\tilde{\Theta}((A+M)^{2}d\log(1/\delta)) (see Lemma 6, [20] as an example).

Meanwhile, denote τi=1\tau_{i}=1 if XiX_{i} is sampled from 𝒩(θ0,𝕀d)\mathcal{N}(\theta_{0},\mathbb{I}_{d}) and τi=1\tau_{i}=-1 else-wise. We have

2V(θ)=\displaystyle\nabla^{2}V(\theta)= β𝕀dθ02βni=1n(4φ(Xiθ)φ(Xi+θ)(φ(Xiθ)+φ(X+θ))2)e1e1\displaystyle\beta\mathbb{I}_{d}-\frac{\|\theta_{0}\|^{2}\beta}{n}\sum_{i=1}^{n}\left(\frac{4\varphi(X_{i}-\theta)\varphi(X_{i}+\theta)}{(\varphi(X_{i}-\theta)+\varphi(X+\theta))^{2}}\right)e_{1}e_{1}^{\top}
βni=1n(4φ(Xiθ)φ(Xi+θ)(φ(Xiθ)+φ(Xi+θ))2(Xiτiθ0e1)(Xiτiθ0e1)),\displaystyle-\frac{\beta}{n}\sum_{i=1}^{n}\left(\frac{4\varphi(X_{i}-\theta)\varphi(X_{i}+\theta)}{(\varphi(X_{i}-\theta)+\varphi(X_{i}+\theta))^{2}}(X_{i}-\tau_{i}\|\theta_{0}\|e_{1})(X_{i}-\tau_{i}\|\theta_{0}\|e_{1})^{\top}\right),

and thus (βθ02ββni=1nXiτiθ0e12)𝕀d2V(θ)β𝕀d(\beta-\|\theta_{0}\|^{2}\beta-\frac{\beta}{n}\sum_{i=1}^{n}\|X_{i}-\tau_{i}\|\theta_{0}\|e_{1}\|^{2})\mathbb{I}_{d}\leq\nabla^{2}V(\theta)\leq\beta\mathbb{I}_{d}. As Xiτiθ0e1𝒩(0,𝕀d)X_{i}-\tau_{i}\|\theta_{0}\|e_{1}\sim\mathcal{N}(0,\mathbb{I}_{d}), by concentration inequality of chi-square variable and since nΘ~((A+M)2dlog(1/δ))n\geq\tilde{\Theta}((A+M)^{2}d\log(1/\delta)), we have with probability at least 1δ1-\delta,

2V(θ)2β(1+θ02).\|\nabla^{2}V(\theta)\|\leq 2\beta(1+\|\theta_{0}\|^{2}).

The claim for 3V\|\nabla^{3}V\| can be calculated following the similar routine. The proof is completed. ∎

Proof of Lemma 7.

The claim directly follows by applying Markov’s inequality to Lemma 12. ∎

Appendix C Proof of Theorem 1

Proof.

We consider the variational form of spectral gap. We first define the Dirichlet form and the variance

P(ϕ,ϕ)\displaystyle\mathcal{E}_{P}(\phi,\phi) =ϕ,(IP)ϕπ=x,yπ(x)ϕ(x)(𝕀(x,y)P(x,y))ϕ(y),\displaystyle=\left\langle\phi,(I-P)\phi\right\rangle_{\pi}=\sum_{x,y}\pi(x)\phi(x)(\mathbb{I}(x,y)-P(x,y))\phi(y),
Varπ[ϕ]\displaystyle\operatorname{Var}_{\pi}[\phi] =xπ(x)(ϕ(x)𝔼π[ϕ])2\displaystyle=\sum_{x}\pi(x)(\phi(x)-\mathbb{E}_{\pi}[\phi])^{2}

Then by the fact that PP is lazy, and hence P=12(I+P^)P=\frac{1}{2}(I+\hat{P}) for some reversible transition P^\hat{P}, we have

Varπ[Pϕ]Varπ[ϕ]P(ϕ,ϕ),\displaystyle\operatorname{Var}_{\pi}[P\phi]\leq\operatorname{Var}_{\pi}[\phi]-\mathcal{E}_{P}(\phi,\phi),

By nonnegativity P(ϕ,ϕ)0,\mathcal{E}_{P}(\phi,\phi)\geq 0,, we have

tTP(μtπ,μtπ)Varπ[μ0π].\displaystyle\sum_{t\leq T}\mathcal{E}_{P}\left(\frac{\mu_{t}}{\pi},\frac{\mu_{t}}{\pi}\right)\leq\operatorname{Var}_{\pi}\left[\frac{\mu_{0}}{\pi}\right]. (7)

By the variational form of spectral gap,

αinfϕnon-constantP(ϕ,ϕ)Varπ[ϕ].\displaystyle\alpha\leq\inf_{\phi\ \text{non-constant}}\frac{\mathcal{E}_{P}(\phi,\phi)}{\operatorname{Var}_{\pi}[\phi]}.

Therefore, applying the Markov chain generated by PiP_{i} we have that for any im,i\leq m,

tVarπi[μt|𝒳iπ|𝒳i]1αtPi(μt|𝒳iπ|𝒳i,μt|𝒳iπ|𝒳i)\displaystyle\sum_{t}\operatorname{Var}_{\pi_{i}}\left[\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}}\right]\leq\frac{1}{\alpha}\sum_{t}\mathcal{E}_{P_{i}}\left(\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}},\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}}\right) (8)

We then note that

P(μtπ,μtπ)\displaystyle\mathcal{E}_{P}\left(\frac{\mu_{t}}{\pi},\frac{\mu_{t}}{\pi}\right) =12x,yπ(x)P(x,y)(μtπ(x)μtπ(y))2\displaystyle=\frac{1}{2}\sum_{x,y}\pi(x)P(x,y)(\frac{\mu_{t}}{\pi}(x)-\frac{\mu_{t}}{\pi}(y))^{2}
12x,y𝒳iπ(x)P(x,y)(μtπ(x)μtπ(y))2\displaystyle\geq\frac{1}{2}\sum_{x,y\in\mathcal{X}_{i}}\pi(x)P(x,y)(\frac{\mu_{t}}{\pi}(x)-\frac{\mu_{t}}{\pi}(y))^{2}
=12x,y𝒳iπ(x)Pi(x,y)(μtπ(x)μtπ(y))2\displaystyle=\frac{1}{2}\sum_{x,y\in\mathcal{X}_{i}}\pi(x)P_{i}(x,y)(\frac{\mu_{t}}{\pi}(x)-\frac{\mu_{t}}{\pi}(y))^{2}
+12x,y𝒳iπ(x)(P(x,y)Pi(x,y))(μtπ(x)μtπ(y))2\displaystyle+\frac{1}{2}\sum_{x,y\in\mathcal{X}_{i}}\pi(x)(P(x,y)-P_{i}(x,y))(\frac{\mu_{t}}{\pi}(x)-\frac{\mu_{t}}{\pi}(y))^{2}

We note by  (PI) that for any x,y𝒳i,xy,x,y\in\mathcal{X}_{i},x\neq y, P(x,y)=Pi(x,y)P(x,y)=P_{i}(x,y) Therefore, the second term is zero. Hence,

P(μtπ,μtπ)π(𝒳i)22μt(𝒳i)2Pi(μt|𝒳iπ|𝒳i,μt|𝒳iπ|𝒳i)\displaystyle\mathcal{E}_{P}\left(\frac{\mu_{t}}{\pi},\frac{\mu_{t}}{\pi}\right)\geq\frac{\pi(\mathcal{X}_{i})^{2}}{2\mu_{t}(\mathcal{X}_{i})^{2}}\mathcal{E}_{P_{i}}\left(\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}},\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}}\right)

Combining with  (7) and  (8) we get,

1TtVarπi[μt|𝒳iπ|𝒳i]/μt(𝒳i)22π(𝒳i)2αTVarπ[μ0π]\displaystyle\frac{1}{T}\sum_{t}\operatorname{Var}_{\pi_{i}}\left[\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}}\right]/\mu_{t}(\mathcal{X}_{i})^{2}\leq\frac{2}{\pi(\mathcal{X}_{i})^{2}\alpha T}\operatorname{Var}_{\pi}\left[\frac{\mu_{0}}{\pi}\right] (9)

The claim then follows by discussing if for all tt, μt(𝒳i)π(𝒳i)T1/4\mu_{t}(\mathcal{X}_{i})\leq\pi(\mathcal{X}_{i})T^{-1/4}, then

1TtVarπi[μt|𝒳iπ|𝒳i]2αT1/2Varπ[μ0π.]\displaystyle\frac{1}{T}\sum_{t}\operatorname{Var}_{\pi_{i}}\left[\frac{\mu_{t}|\mathcal{X}_{i}}{\pi|\mathcal{X}_{i}}\right]\leq\frac{2}{\alpha T^{1/2}}\operatorname{Var}_{\pi}\left[\frac{\mu_{0}}{\pi}.\right] (10)

Appendix D Proof for Theorem 2

Proof.

Given the subgraph VV, we have that conditioned transition kernel is

p(x,y)=p(x,y)+𝟙{x=y}P(x,(V)c).\displaystyle p^{\prime}(x,y)=p(x,y)+\mathbbm{1}\left\{x=y\right\}P(x,(V^{\prime})^{c}).

We analyze the conductance on graph VV^{\prime} induced by pp^{\prime},

Φ=minV1VxV1π(x)p(x,V1c)min{π(V1),1π(V1)}\displaystyle\Phi=\min_{V_{1}\subset V^{\prime}}\frac{\sum_{x\in V_{1}}\pi(x)p^{\prime}(x,V_{1}^{c})}{\min\{\pi(V_{1}),1-\pi(V_{1})\}}

For any partition V1,V2V_{1},V_{2} of VV^{\prime}, without loss of generality assume the local min of ff is in V2V_{2}, vV2v^{*}\in V_{2}. For simplicity, for any xV1x\in V_{1}, we denote τ(x)\tau(x) be the last element in its shortest path to vv^{*}. We note that

d(x,τ(x))D,π(x)π(τ(x)).\displaystyle d(x,\tau(x))\leq D,\pi(x)\leq\pi(\tau(x)).

Denote τ(V1)\tau(V_{1}) be the set of such elements in V1V_{1}. Then we have

xV1π(x)p(x,V1c)xτ(V1)π(x)4d.\displaystyle\sum_{x\in V_{1}}\pi(x)p^{\prime}(x,V_{1}^{c})\geq\frac{\sum_{x\in\tau(V_{1})}\pi(x)}{4d}. (11)

Further denote τ(x,r)\tau(x,r) for some r{0,1,,d}r\in\{0,1,...,d\} as the set of elements at distance rr from τ(x)\tau(x) along the descending shortest paths. We provide an illustrative figure in 2. Then we have that

π(V1)=τ(x)τ(V1)r=1Dτ(x,r).\displaystyle\pi(V_{1})=\sum_{\tau(x)\in\tau(V_{1})}\sum_{r=1}^{D}\tau(x,r). (12)

We note by reversibility that

π(τ(x,r))p(τ(x,r),τ(x,r+1))=τ(x,r+1)p(τ(x,r+1),τ(x,r)).\displaystyle\pi(\tau(x,r))p^{\prime}(\tau(x,r),\tau(x,r+1))=\tau(x,r+1)p^{\prime}(\tau(x,r+1),\tau(x,r)). (13)

Further by the fact that the function along the path is descending, we get that

p(τ(x,r),τ(x,r+1))p(τ(x,r+1),τ(x,r)).\displaystyle p^{\prime}(\tau(x,r),\tau(x,r+1))\geq p^{\prime}(\tau(x,r+1),\tau(x,r)). (14)

Therefore,

π(τ(x,r))π(τ(x,r+1)).\displaystyle\pi(\tau(x,r))\geq\pi(\tau(x,r+1)). (15)

Hence we have

π(V1)=τ(x)τ(V1)r=1Dτ(x,r)Dπ(τ(V1))\displaystyle\pi(V_{1})=\sum_{\tau(x)\in\tau(V_{1})}\sum_{r=1}^{D}\tau(x,r)\leq D\pi(\tau(V_{1})) (16)

Therefore, when π(V1)1/2\pi(V_{1})\leq 1/2 we have

xV1π(x)p(x,V1c)min{π(V1),1π(V1)}14dD.\displaystyle\frac{\sum_{x\in V_{1}}\pi(x)p^{\prime}(x,V_{1}^{c})}{\min\{\pi(V_{1}),1-\pi(V_{1})\}}\geq\frac{1}{4dD}.

when π(V1)1/2\pi(V_{1})\geq 1/2 we have

xV1π(x)p(x,V1c)min{π(V1),1π(V1)}\displaystyle\frac{\sum_{x\in V_{1}}\pi(x)p^{\prime}(x,V_{1}^{c})}{\min\{\pi(V_{1}),1-\pi(V_{1})\}} 2xV1π(x)p(x,V1c)\displaystyle\geq 2\sum_{x\in V_{1}}\pi(x)p^{\prime}(x,V_{1}^{c})
xτ(V1)π(x)2dπ(τ(V1))2dD14dD.\displaystyle\geq\frac{\sum_{x\in\tau(V_{1})}\pi(x)}{2d}\geq\frac{\pi(\tau(V_{1}))}{2dD}\geq\frac{1}{4dD}.

The theorem then follows by Cheeger’s inequality.

Refer to caption
Figure 2: An illustration for the definitions of τ(x)\tau(x)and τ(x,r)\tau(x,r).

Appendix E Additional Experiments

We still use π3\pi_{3} in 6.1 as our target distribution. We initially set n=1,10,2000n=1,10,2000 particles to run the LMC sampling algorithm respectively. We collect the locations of these particles after T=1000T=1000 iterations as valid samples. The target distribution is shown in Figure 3(a); empirical distributions using n=1,10,2000n=1,10,2000 particles are shown in Figure 3(b), 3(c), and 3(d).

Refer to caption
(a) target distribution
Refer to caption
(b) 1 particle
Refer to caption
(c) 10 particles
Refer to caption
(d) 2000 particles

We highlight two observations from the experiments. First when we use only one particle, it is always trapped in one Gaussian component and we never get samples from other modes. It has very bad global convergence, but still gets good conditional convergence: The convergence result conditioned on the 1st quadrant is good, and the sample after TT iterations has no probability distributed on the other 3 quadrants. Second, when the number of particles grows, the sample after TT iterations has non-negligible probabilities distributed on all of the 4 quadrants, which means we could capture all of the Gaussian components. The results indicate that although restarting LMC may not directly lead to global convergence, it may help us capture the features in multi-modal distributions, which further implies we may empirically eliminate the "small probability mass" condition in Corollary 1 and mitigate the gap between sampling distribution and target distribution.

A broader implication could be on ensemble and stochastic averaging in neural network training, where ensemble follows the restart procedure where as stochastic averaging is closer to sampling from a single trajectory. Our theory and experiment suggest that the two can have very different distributions in the end.