This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Resolving the Mixing Time of the Langevin Algorithm to its Stationary Distribution for Log-Concave Sampling

Jason M. Altschuler111This work was done in part while JA was an intern at Apple during Summer 2021. JA was also supported in part by NSF Graduate Research Fellowship 1122374, a TwoSigma PhD Fellowship, and a Faculty Fellowship from the NYU Center for Data Science.
NYU
[email protected]
   Kunal Talwar
Apple
[email protected]
Abstract

Sampling from a high-dimensional distribution is a fundamental task in statistics, engineering, and the sciences. A canonical approach is the Langevin Algorithm, i.e., the Markov chain for the discretized Langevin Diffusion. This is the sampling analog of Gradient Descent. Despite being studied for several decades in multiple communities, tight mixing bounds for this algorithm remain unresolved even in the seemingly simple setting of log-concave distributions over a bounded domain. This paper completely characterizes the mixing time of the Langevin Algorithm to its stationary distribution in this setting (and others). This mixing result can be combined with any bound on the discretization bias in order to sample from the stationary distribution of the continuous Langevin Diffusion. In this way, we disentangle the study of the mixing and bias of the Langevin Algorithm.

Our key insight is to introduce a technique from the differential privacy literature to the sampling literature. This technique, called Privacy Amplification by Iteration, uses as a potential a variant of Rényi divergence that is made geometrically aware via Optimal Transport smoothing. This gives a short, simple proof of optimal mixing bounds and has several additional appealing properties. First, our approach removes all unnecessary assumptions required by other sampling analyses. Second, our approach unifies many settings: it extends unchanged if the Langevin Algorithm uses projections, stochastic mini-batch gradients, or strongly convex potentials (whereby our mixing time improves exponentially). Third, our approach exploits convexity only through the contractivity of a gradient step—reminiscent of how convexity is used in textbook proofs of Gradient Descent. In this way, we offer a new approach towards further unifying the analyses of optimization and sampling algorithms.

1 Introduction

Generating samples from a high-dimensional distribution is a ubiquitous problem with applications in diverse fields such as machine learning and statistics [4, 76], numerical integration and scientific computing [50, 51, 61], and differential privacy [68, 37], to name just a few. Here we revisit this problem of sampling in one of the most foundational and well-studied settings: the setting where the target distribution π\pi is log-concave, i.e.,

π(x)ef(x)\pi(x)\propto e^{-f(x)}

for some convex potential f:df:\mathbb{R}^{d}\to\mathbb{R}. The purpose of this paper is to investigate the mixing time of a canonical algorithm—the Langevin Algorithm—for this problem.

The Langevin Algorithm.

A celebrated, classical fact is that one can generate a sample from π\pi via the Langevin Diffusion, i.e., the solution to the Stochastic Differential Equation given by

dXt=f(Xt)+2dBt,\displaystyle dX_{t}=-\nabla f(X_{t})+\sqrt{2}dB_{t}\,, (1.1)

where BtB_{t} is a standard Brownian motion on d\mathbb{R}^{d}. More precisely, under mild conditions on the tail growth of ff, the distribution of XtX_{t} tends to π\pi in the limit tt\to\infty, see e.g., [78, 10].

Although this continuous-time Markov chain is not directly implementable, it leads naturally to an algorithm: run a discretization of the Langevin Diffusion (1.1) for a sufficiently long time. This idea dates back at least to Parisi’s work in 1981 [74]. In its simplest form, the resulting algorithm is the discrete-time Markov chain given by

Xt+1=Xtηf(Xt)+Zt,\displaystyle X_{t+1}=X_{t}-\eta\nabla f(X_{t})+Z_{t}\,, (1.2)

where the stepsize η>0\eta>0 is the discretization parameter, and Zt𝒩(0,2ηId)Z_{t}\sim\mathcal{N}(0,2\eta I_{d}) are independent Gaussians. In words, this algorithm is identical to the Langevin Diffusion except that it uses the same gradient for η\eta units of time before recomputing. Note that this algorithm is applicable in the common practical setup where π\pi is only known up to its normalizing constant (e.g., as is the case in Bayesian statistics), since this normalization is irrelevant for computing f\nabla f.

A more general form of this discretized update (1.2) enables handling constrained distributions via projection and large-scale finite-sum potentials f=i=1nfif=\sum_{i=1}^{n}f_{i} via stochastic gradients. Since all of our results apply without change in this more general setup, we slightly abuse notation to call this the Langevin Algorithm (rather than the Projected Stochastic Langevin Algorithm).

Definition 1.1.

For a convex set 𝒦d\mathcal{K}\subseteq\mathbb{R}^{d}, potential f=i=1nfi:𝒦f=\sum_{i=1}^{n}f_{i}:\mathcal{K}\to\mathbb{R}, batch size bnb\leqslant n, stepsize η>0\eta>0, and initialization X0𝒦X_{0}\in\mathcal{K}, the Langevin Algorithm is the Markov chain

Xt+1=Π𝒦[XtηGt+Zt],\displaystyle X_{t+1}=\Pi_{\mathcal{K}}\left[X_{t}-\eta G_{t}+Z_{t}\right], (1.3)

where Π𝒦\Pi_{\mathcal{K}} denotes the Euclidean projection onto 𝒦\mathcal{K}, Zt𝒩(0,2ηId)Z_{t}\sim\mathcal{N}(0,2\eta I_{d}) are independent Gaussians, and Gt=1biBtfi(Xt)G_{t}=\frac{1}{b}\sum_{i\in B_{t}}\nabla f_{i}(X_{t}) are the average gradients on uniform random batches Bt[n]B_{t}\subseteq[n] of size bb.

This Langevin Algorithm has an immense literature in part because it has been studied in several communities, in either identical or similar forms. For example, this algorithm is the direct analog of (Projected Stochastic) Gradient Descent in convex optimization, the only difference being the noise ZtZ_{t}. As another example, this algorithm (with different noise scaling) has been studied extensively in the differential privacy literature under the names (Projected Stochastic) Noisy Gradient Descent and Differentially Private Gradient Descent. Note that in the sampling, scientific computing, and PDE literatures, this algorithm goes by various other names, such as (Projected Stochastic variants of) the Langevin Monte Carlo, the Unadjusted Langevin Algorithm, the Overdamped Langevin Algorithm, and the Forward Euler or Euler-Maruyama discretization of the Langevin Diffusion.

Refer to caption
Figure 1: Bottom: The Langevin Diffusion is a continuous-time Markov chain with stationary distribution π\pi. Top: The Langevin Algorithm is a discrete-time Markov chain with stationary distribution πη\pi_{\eta}. This paper establishes tight bounds for the Langevin Algorithm to mix to its stationary distribution πη\pi_{\eta} (the boxed arrow) for any non-trivial discretization stepsize η>0\eta>0 in the setting where π\pi is (strongly) log-concave. If desired, this result can be combined with any bound on the discretization bias in order to sample from π\pi. In this way, we aim to disentangle the study of the mixing and bias of the Langevin Algorithm.
Rapid mixing of the Langevin Algorithm?

While the asymptotic convergence of the Langevin Algorithm is classically understood [78, 10], non-asymptotic convergence bounds (a.k.a., mixing time bounds) are significantly more challenging. Yet this question is essential since algorithms are implemented in non-asymptotic regimes, and moreover since it is in general impossible to obtain an a posteriori error estimate from a sample.

There are two types of “mixing” results that one might hope to show: convergence of the Langevin Algorithm to its stationary distribution πη\pi_{\eta}, or to the target distribution π\pi. Note that these two distributions are different for any discretization stepsize η>0\eta>0; the error is called the bias of the Langevin Algorithm, see Figure 1. Henceforth, we refer to the convergence of the Langevin Algorithm to its stationary distribution πη\pi_{\eta} as “mixing” since this is the standard notion of mixing for a Markov chain, and we disambiguate convergence to π\pi as “mixing to π\pi”. (Note that much of the literature calls the latter notion “mixing” even though this is a slight abuse of notation.)

There has been an incredible flurry of activity around the mixing time of the Langevin Algorithm since the the first results in the seminal paper [26]. However, tight mixing bounds (to either πη\pi_{\eta} or π\pi) remain open even in the seemingly simple setting of strongly convex and smooth potentials over d\mathbb{R}^{d}—let alone in less idealized settings where the potential is convex rather than strongly convex, and/or the domain is a subset of d\mathbb{R}^{d} rather than the whole space, and/or the gradients used by the Langevin Algorithm are stochastic rather than exact. See the prior work section §1.3 for details.

The purpose of this paper is to determine the mixing time of the Langevin Algorithm to its stationary distribution πη\pi_{\eta} in all these settings. If desired, these mixing bounds can then be combined with any bound on the discretization bias in order to sample from π\pi, see Figure 1. This differs from most previous work which has primarily focused on directly analyzing mixing bounds to π\pi. In this way, on one hand we answer a natural question about the Markov chain defining the Langevin Algorithm, and on the other hand we aim to disentangle the study of its mixing and bias. See the discussion in §5 for future directions about understanding the bias.

The rest of the introduction is organized as follows: we detail our contributions in §1.1, our analysis techniques in §1.2, and then further contextualize with related work in §1.3.

1.1 Contributions

The main result of this paper is a characterization of the mixing time of the (Projected Stochastic) Langevin Algorithm to its stationary distribution πη\pi_{\eta} for discretization parameter η>0\eta>0 in the log-concave setting. For simplicity, we state this result below for mixing in the total variation metric. Extensions to other notions of distance are straightforward and summarized in Table 1. In what follows, we denote the mixing time for a divergence or metric DD by Tmix,D(ε)T_{\mathrm{mix},\;D}(\varepsilon); this is the smallest TT\in\mathbb{N} for which D(XTπη)εD(X_{T}\|\pi_{\eta})\leqslant\varepsilon for any initialization X0X_{0} of the Langevin Algorithm.

Theorem 1.2 (Informal versions of Theorems 3.1 and 3.2).

For any convex set 𝒦d\mathcal{K}\subset\mathbb{R}^{d} of diameter DD, any MM-smooth convex potentials f1,,fn:𝒦f_{1},\dots,f_{n}:\mathcal{K}\to\mathbb{R}, any batch size bnb\leqslant n, and any stepsize η2/M\eta\leqslant 2/M, the Langevin Algorithm in Definition 1.1 mixes to its stationary distribution πη\pi_{\eta} in time

Tmix,𝖳𝖵(14)D2η.T_{\mathrm{mix},\,\mathsf{TV}}\left(\frac{1}{4}\right)\asymp\frac{D^{2}}{\eta}\,.

This result is proved by establishing an upper bound (Theorem 3.1) and lower bound (Theorem 3.2) that match up to a constant factor.

Discussion of dependence on parameters.
  • Dependence on error ε\varepsilon. As is standard, Theorem 1.2 is stated for mixing to constant error, here ε=1/4\varepsilon=1/4. A basic fact about Markov chains is that a mixing bound to constant error can be boosted to arbitrarily small error ε>0\varepsilon>0 at an exponential rate, see, e.g., [58, §4.5]. Specifically, Theorem 3.1 implies Tmix,𝖳𝖵(ε)D2ηlog1εT_{\mathrm{mix},\,\mathsf{TV}}\left(\varepsilon\right)\leqslant\frac{D^{2}}{\eta}\log\frac{1}{\varepsilon}.

  • Dependence on diameter DD and stepsize η\eta. Note that Θ(D2/η)\Theta(D^{2}/\eta) is the number of iterations it takes for a univariate random walk with i.i.d. 𝒩(0,2η)\mathcal{N}(0,2\eta) Gaussian increments to move a distance DD from its initialization. Thus, Θ(D2/η)\Theta(D^{2}/\eta) is the number of iterations it takes for the Langevin Algorithm to just reach the opposite side of a univariate interval 𝒦=[D/2,D/2]\mathcal{K}=[-D/2,D/2] even in the simple setting of zero potentials (a.k.a., uniform π\pi). Since reachability is obviously a pre-requisite for mixing, Θ(D2/η)\Theta(D^{2}/\eta) is therefore a lower bound on the mixing time. (This is the intuition behind the lower bound in Theorem 3.2.) The matching upper bound in Theorem 3.1 shows that, modulo a constant factor, the Langevin Algorithm takes the same amount of time to mix as it does to just reach the opposite end of 𝒦\mathcal{K}.

  • Dependence on other parameters. Note that the fast mixing in Theorem 1.2 depends on no other parameters, and in particular has no dimension dependence for η\eta of constant size (a common regime in differentially private optimization). However, if one seeks to sample from π\pi, then the discretization bias πηπ\pi_{\eta}\approx\pi must be small, which requires 1/η1/\eta to scale polynomially in the dimension and inverse accuracy. This is why mixing bounds to π\pi inevitably have polynomial dependence on the dimension and inverse accuracy. Theorem 1.2 clarifies that these polynomial dependencies arise solely through the single factor of 1/η1/\eta.

Streamlined analysis and implications.

We prove Theorem 3.1 by introducing a technique from the differential privacy literature to the sampling literature. See §1.2 below for an overview of this technique, called Privacy Amplification by Iteration. This approach gives a short, simple proof of optimal mixing bounds—and moreover has several additional appealing properties, which are a mix of technical and pedagogical in nature:

Log-concave Strongly log-concave
𝖳𝖵\mathsf{TV}, 𝖪𝖫\mathsf{KL}, χ2\chi^{2}, 𝖧\mathsf{H} D2ηlog1ε\frac{D^{2}}{\eta}\operatorname{\log\tfrac{1}{\varepsilon}} 1ηmlogDηε\frac{1}{\eta m}\log\frac{D}{\eta\varepsilon}
𝒟α\mathcal{D}_{\alpha} D2η(α+log1ε)\frac{D^{2}}{\eta}(\alpha+\operatorname{\log\tfrac{1}{\varepsilon}}) 1ηmlogαDηε\frac{1}{\eta m}\log\frac{\alpha D}{\eta\varepsilon}
Table 1: Summary of our mixing time bounds for the (Projected Stochastic) Langevin Algorithm to its stationary distribution πη\pi_{\eta}. Here, η\eta is the stepsize, ε\varepsilon is the mixing error, DD is the diameter of the convex set 𝒦\mathcal{K}, mm is the strong-convexity constant of the potential ff, MM is the smoothness constant of ff, and α\alpha is the Rényi parameter. We take η1/M\eta\leqslant 1/M here to simplify the asymptotics. These results apply to stochastic gradients with arbitrary batch size. Extensions to unconstrained settings are straightforward, see Appendix A.4. We report mixing times for Total Variation (𝖳𝖵\mathsf{TV}), Kullback-Leibler (𝖪𝖫\mathsf{KL}), Chi-Squared (χ2\chi^{2}), Hellinger (𝖧\mathsf{H}), and Rényi divergence (𝒟α\mathcal{D}_{\alpha}). For full details, see Corollary 3.3 and Theorem 4.1 for the cases of log-concave and strongly log-concave targets π\pi, respectively.
  • Minimal assumptions. Theorem 3.1 requires an essentially minimal set of assumptions. This is in part because of our direct analysis approach, and also in part because our analysis distenagles the study of the mixing and the bias of the Langevin Algorithm. Indeed, assumptions like curvature of the set and Lipschitzness of the potential222In the constrained setting, Lipschitzness is implied by smoothness; however, in the unconstrained setting, Lipschitzness is not implied and yet our techniques still apply despite this. appear necessary for discretization bounds (see, e.g,. [14])—but as we show, these are unnecessary for fast mixing. Note also that our result requires no warm start.

    In fact, Theorem 3.1 only makes three assumptions. The first is that 𝒦\mathcal{K} is bounded; this is easily relaxed, see the following paragraph. The second is that the potential ff and the set 𝒦\mathcal{K} are convex; it is unavoidable to have at least some type of assumption in this direction since arbitrary non-convexity leads to computational hardness due to reductions to non-convex optimization (see §5). The third is that the stepsize η2/M\eta\leqslant 2/M is not too large, else πη\pi_{\eta} is a meaningless approximation of π\pi and in fact is even transient in the unconstrained setting.333This can be seen even in 11 dimension with the quadratic potential f(x)=M2x2f(x)=\tfrac{M}{2}x^{2} on 𝒦=[D/2,D/2]\mathcal{K}=[-D/2,D/2]. Briefly, if η=(2+ε)/M\eta=(2+\varepsilon)/M, then πη\pi_{\eta} concentrates away from 0, whereas π\pi concentrates around 0, in fact arbitrarily so as MM\to\infty. Thus π\pi and πη\pi_{\eta} can be arbitrarily different if η>2/M\eta>2/M. Moreover in the unconstrained setting, this Markov chain is transient.

  • Unification of different problem settings. Our analysis technique extends nearly unchanged and yields tight bounds regardless of whether the Langevin Algorithm uses mini-batch stochastic gradients, projections, or strongly log-concave targets π\pi. In the stochastic setting, our mixing result is identical regardless of the batch size. In the unconstrained setting, our mixing result is identical except with DD replaced by the diameter of a ball that captures all but ε\varepsilon mass of πη\pi_{\eta}; such a diameter proxy is finite and leads to tight mixing bounds, details in Appendix A.4. And in the strongly log-concave setting, our mixing result improves exponentially in DD; details in §4.

  • Pedagogical hope towards unifying sampling and optimization. While the problems of optimization and sampling are well-known to exhibit appealing syntactic similarities in terms of problem formulations and algorithms, it is less clear to how these correspondences should extend to analyses and error bounds. One particularly popular approach, initiated by the seminal work [88], exploits the JKO Theorem which states that the Langevin Diffusion is equivalent to a KL gradient flow over the space 𝒫(d)\mathcal{P}(\mathbb{R}^{d}) of probability distributions [52]. See the previous work section §1.3 for details. The present paper offers a different approach towards unifying the analyses and error bounds of sampling and optimization algorithms that does not require lifting from d\mathbb{R}^{d} to 𝒫(d)\mathcal{P}(\mathbb{R}^{d}), thereby avoiding the (beautiful yet) high-powered associated mathematical machinery. From an analysis perspective, we exploit convexity through the contractivity of a gradient update (in a different way from [26, 27]). This is reminiscent of how convexity is used in textbook proofs of Gradient Descent (see, e.g., [13, pages 271 and 279] respectively for the convex and strongly convex settings), and leads us to error bounds for sampling that in many ways mirror classical results in optimization, see Table 2.444In fact, it is worth mentioning that our techniques yield analyses and error bounds for sampling which are even simpler than those for optimization in the setting of stochastic gradients—since our sampling bounds are unchanged for any minibatch size bb, whereas optimization bounds certainly are not.

Optimization Sampling
Task Optimize minxf(x)\min_{x}f(x) Sample π(x)ef(x)\pi(x)\propto e^{-f(x)}
Simple first-order algorithm Gradient Descent Langevin Algorithm
Algorithm update Xt+1=Π𝒦[Xtηf(Xt)]X_{t+1}=\Pi_{\mathcal{K}}[X_{t}-\eta\nabla f(X_{t})] Xt+1=Π𝒦[Xtηf(Xt)+𝒩(0,2ηId)]X_{t+1}=\Pi_{\mathcal{K}}[X_{t}-\eta\nabla f(X_{t})+\mathcal{N}(0,2\eta I_{d})]
“Error” for ff convex MD2/TMD^{2}/T eT/MD2e^{-T/MD^{2}} (Theorem 3.1)
“Error” for ff strongly convex cTMD2c^{T}MD^{2} cTMD2c^{T}MD^{2} (Theorem 4.1)
Table 2: The research areas of optimization and sampling exhibit appealing syntactic similarities in terms of problem formulations and algorithms. This paper analyzes the mixing of the Langevin Algorithm by exploiting convexity of the potential ff (a.k.a., log-concavity of π\pi) in a way that is reminiscent of the classical analysis of Gradient Descent. This leads to syntactically similar error bounds. To highlight similarities, here the stepsize η1/M\eta\asymp 1/M, the optimization “error” is reported here for function suboptimality f(XT)ff(X_{T})-f^{*} [13, Theorems 3.3 and 3.12], and the sampling “error” is reported for mixing to πη\pi_{\eta}. Here, TT is the number of iterations, DD is the diameter, mm is the strong convexity of ff, MM is the smoothness of ff, and c=maxλ{m,M}|1ηλ|1c=\max_{\lambda\in\{m,M\}}|1-\eta\lambda|\leqslant 1 is the contraction rate for a Gradient Descent step.

1.2 Techniques

1.2.1 Upper bound on mixing time

We establish rapid mixing by introducing to the sampling community a technique from differential privacy. This technique, called Privacy Amplification by Iteration (PABI), was originally proposed in [43] and then recently developed in [3] into the form used here which is more amenable to sampling. See the preliminaries §2.3 for a discussion of these differences and for full technical details about PABI. Here we briefly overview this PABI technique and how we use it—in an effort both to promote accessibility to the sampling community, as well as to emphasize the simplicity of this mixing time argument.

The Langevin Algorithm as a Contractive Noisy Process.

Rather than view the Langevin Algorithm as a discretized approximation of a continuous SDE, we analyze it directly by exploiting the fact that, modulo a slight shifting of the process described in §3.1, the update equation (1.3) of the Langevin Algorithm decomposes into two simple operations:

  • A contractive map, namely the composition of a projection operation and a Stochastic Gradient Descent update, both of which are contractive.

  • A noise convolution, namely add Gaussian noise 𝒩(0,2ηId)\mathcal{N}(0,2\eta I_{d}).

A simple but key insight is that it is harder to distinguish two random variables X,XX,X^{\prime} from each other after applying either of these operations. Indeed, applying the same contraction map to two points can only bring them closer together; and adding the same noise to two random variables can only lower the signal-to-noise ratio. Such observations naturally lend themselves to mixing-time analyses, where by definition, one seeks to understand how distinguishable XT,XTX_{T},X_{T}^{\prime} are if they are the final iterates of the same Markov chain run on different initializations X0,X0X_{0},X_{0}^{\prime}.

The key issue is how to quantify this intuition: how much less distinguishable do two random variables become after the (repeated, interleaved) application of a contractive map or noise convolution? This is precisely the sort of question that the PABI technique is designed for.

Lyapunov function.

The PABI technique measures progress through the shifted Rényi divergence, which is a variant of the classical Rényi divergence that is smoothed by the Optimal Transport (a.k.a., Wasserstein) distance555Recall that the infinity-Wasserstein distance W(μ,μ)W_{\infty}(\mu,\mu^{\prime}) between distributions μ,μ\mu,\mu^{\prime} on d\mathbb{R}^{d} is the smallest z0z\geqslant 0 for which there exists a joint distribution PP with marginals XμX\sim\mu, XμX^{\prime}\sim\mu^{\prime} satisfying XXz\|X-X^{\prime}\|\leqslant z almost surely. Throughout this paper, \|\cdot\| denotes the Euclidean norm.. For simplicity, here we describe PABI in the special case of shifted KL divergence; see §2.3 for full details on the shifted Rényi divergence, which generalizes the shifted KL divergence by bounding “all moments” simultaneously instead of just “one moment”.

The shifted KL divergence is defined as

𝖪𝖫(z)(μν):=infμ:W(μ,μ)z𝖪𝖫(μν).\mathsf{KL}^{(z)}\left(\mu\;\|\;\nu\right):=\inf_{\mu^{\prime}\;:\;W_{\infty}(\mu,\mu^{\prime})\leqslant z}\mathsf{KL}\left(\mu^{\prime}\;\|\;\nu\right)\,.

In words, the shift z0z\geqslant 0 makes the KL divergence “geometrically aware”. Indeed, a major drawback of the KL divergence is that it is invariant under a permutation of points in the underlying space d\mathbb{R}^{d}. The shifted KL divergence is more natural for analyzing sampling algorithms since closeness of points is captured by the Optimal Transport shift. As a toy illustrative example, consider two Dirac distributions at points xx and x+εx+\varepsilon, where ε\varepsilon is a vector of small norm—while the KL divergence between these two Diracs is infinite, their shifted KL divergence is 0 for any shift zεz\geqslant\|\varepsilon\|.

Tracking the Lypaunov function.

Operationally, the PABI technique is based on two core lemmas, which disentangle how the two aforementioned operations (contraction and convolution) affect the distinguishability of the iterates, as measured in shifted KL divergence. We recall both lemmas here in a simplified form that suffices for the present purpose of mixing analyses; see §2.3 for a discussion of the differences to PABI in differential privacy analyses.

  • The contraction-reduction lemma establishes that for any contraction ϕ\phi,

    𝖪𝖫(z)(ϕ#μϕ#ν)𝖪𝖫(z)(μν).\mathsf{KL}^{(z)}\left(\phi_{\#}\mu\;\|\;\phi_{\#}\nu\right)\leqslant\mathsf{KL}^{(z)}\left(\mu\;\|\;\nu\right)\,.

    Intuitively, this can be viewed as a generalization of the classical data-processing inequality for the KL divergence to the shifted KL divergence—so long as the data-processing is contractive.

  • The shift-reduction lemma establishes that for any shift a0a\geqslant 0,

    𝖪𝖫(z)(μ𝒩(0,σ2Id)ν𝒩(0,σ2Id))𝖪𝖫(z+a)(μν)+a22σ2.\mathsf{KL}^{(z)}\left(\mu\ast\mathcal{N}(0,\sigma^{2}I_{d})\;\|\;\nu\ast\mathcal{N}(0,\sigma^{2}I_{d})\right)\leqslant\mathsf{KL}^{(z+a)}\left(\mu\;\|\;\nu\right)+\frac{a^{2}}{2\sigma^{2}}\,.

    Intuitively, this can be thought of as a strengthening of the trivial bound 𝖪𝖫(μξνξ)𝖪𝖫(μξνξ)=𝖪𝖫(μν)+𝖪𝖫(ξξ)\mathsf{KL}(\mu\ast\xi\|\nu\ast\xi)\leqslant\mathsf{KL}(\mu\otimes\xi\|\nu\otimes\xi)=\mathsf{KL}(\mu\|\nu)+\mathsf{KL}(\xi\|\xi) which follows from the data-processing and tensorization properties of the KL divergence. In words, the shift-reduction lemma sharpens this trivial bound by moving aa units of “displacement” between μ\mu and ν\nu from the first term to the second term—lowering the KL divergence between μ\mu and ν\nu, at a penalty given by how well the noise ξ=𝒩(0,σ2Id)\xi=\mathcal{N}(0,\sigma^{2}I_{d}) masks a displacement of size aa. Indeed, this is precisely how the penalty arises, since supw:wa𝖪𝖫(𝒩(w,σ2Id)𝒩(0,σ2Id))=a2/(2σ2)\sup_{w:\|w\|\leqslant a}\mathsf{KL}(\mathcal{N}(w,\sigma^{2}I_{d})\;\|\;\mathcal{N}(0,\sigma^{2}I_{d}))=a^{2}/(2\sigma^{2}). Operationally, this increase in shift allows us to “hide” the gap between two different Langevin initializations, as described next.

Combining these two lemmas bounds the joint effect of the two operations—a.k.a., the effect of one iteration of the Langevin Algorithm:

𝖪𝖫(z)((ϕ#μ)𝒩(0,2ηId)(ϕ#ν)𝒩(0,2ηId))𝖪𝖫(z+a)(μν)+a24η,\mathsf{KL}^{(z)}\left((\phi_{\#}\mu)\ast\mathcal{N}(0,2\eta I_{d})\;\|\;(\phi_{\#}\nu)\ast\mathcal{N}(0,2\eta I_{d})\right)\leqslant\mathsf{KL}^{(z+a)}\left(\mu\;\|\;\nu\right)+\frac{a^{2}}{4\eta}\,,

for any choice of a0a\geqslant 0. In particular, we can analyze TT iterations of the Langevin Algorithm by repeating this argument TT times and setting a=D/Ta=D/T, yielding

𝖪𝖫(XTXT)=𝖪𝖫(0)(XTXT)𝖪𝖫(D)(X0X0)+Ta24η=D24ηT.\mathsf{KL}\left(X_{T}\;\|\;X_{T}^{\prime}\right)=\mathsf{KL}^{(0)}\left(X_{T}\;\|\;X_{T}^{\prime}\right)\leqslant\mathsf{KL}^{(D)}\left(X_{0}\;\|\;X_{0}^{\prime}\right)+T\frac{a^{2}}{4\eta}=\frac{D^{2}}{4\eta T}\,.

Above, the first equality is because 0-shifted KL is the standard KL divergence; and the final equality is because the DD-shifted KL vanishes if the two distributions are supported on a set of diameter DD. For constant ε\varepsilon, this is already enough to conclude the claimed 𝖪𝖫\mathsf{KL} mixing time bound of Tmix,𝖪𝖫(ε)D2/ηT_{\mathrm{mix},\mathsf{KL}}(\varepsilon)\lesssim D^{2}/\eta. For sub-constant ε\varepsilon, a standard boosting argument improves the dependence on ε\varepsilon from 1/ε1/\varepsilon to log1/ε\log 1/\varepsilon.

Extensions.

This argument readily extends to different settings. For example, it applies regardless of stochastic gradients because, for any choice of minibatch, a Stochastic Gradient Descent update is contractive. As a second example, this fast mixing bound for KL immediately implies a fast mixing bound for other notions of distance (e.g., Total Variation) via standard comparison inequalities (e.g., Pinsker’s inequality) and the fact that the mixing contraction rate is slowest for Total Variation among all ff-divergences; details in §A.3. As a third example, this argument recovers tight, exponentially better mixing rates if the potentials are strongly convex: in this case, the contraction map is cc-contractive for some c<1c<1, which improves the contraction-reduction lemma by a factor of cc and the resulting TT-iteration argument by a factor of cTc^{T}; details in §2.3.

1.2.2 Lower bound on mixing time

Our (algorithm-dependent) lower bounds for sampling are loosely motivated by classical lower bounds from convex optimization in the sense that although we use somewhat different constructions and analyses, the key reasons behind the lower bounds are mirrored: the log-concave lower bound is due to reachability, and the strongly log-concave lower bound is due to strong contractivity. Below we briefly describe the core ideas behind both. As an aside, an appealing property of both lower bounds is that they apply even in the simple setting of univariate quadratic potentials with full-batch gradients.

Log-concave lower bound via reachability.

Briefly, the simple but key observation is that if the Langevin Algorithm is initialized near the bottom of the interval 𝒦=[D/2,D/2]\mathcal{K}=[-D/2,D/2]\subset\mathbb{R} and the potential f0f\equiv 0 is identically zero, then the iterates form a constrained random walk with 𝒩(0,2η)\mathcal{N}(0,2\eta) increments—and thus, with high probability, the Langevin Algorithm requires the minimax-optimal number Θ(D2/η)\Theta(D^{2}/\eta) of iterations before it can even reach the top of the interval 𝒦\mathcal{K}, let alone mix. Note that this construction is different from optimization lower bounds (since minimizing the identically zero function is trivial). However, this construction intuitively leads to similar rates as in convex optimization due to the same core challenge of reachability—reachability to a minimizer in optimization, and reachability to the full set in sampling; details in §3.2.

Strongly log-concave lower bound via strong contractivity.

Here, we can re-use the classical lower bound construction for Gradient Descent on strongly convex functions. Specifically, consider running the Langevin Algorithm on the univariate quadratic xλ2x2x\mapsto\frac{\lambda}{2}x^{2} whose second derivative λ[m,M]\lambda\in[m,M] is chosen so as to worsen the corresponding contraction coefficient c=|1λη|c=\left\lvert 1-\lambda\eta\right\rvert for a Gradient Descent update with stepsize η\eta. In convex optimization, that is already sufficient to prove that Gradient Descent converges at a rate no faster than cΘ(T)c^{\Theta(T)}; however, for sampling one must also understand how the Gaussian noise affects the probabilistic goal of being close to πη\pi_{\eta}. Briefly, the simple but key observation is that in the unconstrained setting (which affects the mixing time only up to a logarithmic factor), the Langevin Algorithm’s iterates have explicitly computable Gaussian laws since each iteration contracts the previous iterate by a factor of cc and then injects fresh Gaussian noise. More precisely, the TT-th iterate XTX_{T} has variance which decays in TT at a rate of 1c2T1-c^{2T}, and thus 𝖪𝖫(XTπη)=𝖪𝖫(XTX)=𝖪𝖫(𝒩(0,1c2T)𝒩(0,1))\mathsf{KL}(X_{T}\;\|\;\pi_{\eta})=\mathsf{KL}(X_{T}\;\|\;X_{\infty})=\mathsf{KL}(\mathcal{N}(0,1-c^{2T})\;\|\;\mathcal{N}(0,1)) which by a direct calculation is of the desired order cΘ(T)c^{\Theta(T)}; details in §4.2.

1.3 Related Work

This paper is at the intersection of several lines of related work. We further contextualize this here.

Other sampling algorithms.

Due to the many applications of sampling, there is an extensive literature on different sampling algorithms spanning multiple communities and decades. Much of this work was inspired by the seminal paper [38], which gave the first randomized polynomial-time algorithm for computing the volume of a convex body via a reduction to (log-concave) sampling. A beautiful line of work from the theoretical computer science community has improved the (originally large) polynomial runtime factors by developing elegant algorithms based on geometric random walks such as the Metropolized Random Walk [69, 79], the Ball Walk [63, 64], and the Hit-and-Run Walk [66, 65, 62, 53, 8], to name a few; for details see the survey [87]. A key distinguishing aspect of this line of work is that these geometric random walk algorithms only access the target distribution π\pi by evaluating the density up to a constant factor—in contrast to the Langevin Algorithm which also evaluates gradient information about the density. This distinction is formalized through the notion of zero-th order and first-order algorithms: the former access π\pi only through evaluations of the potential ff, whereas the latter also evaluate gradients of ff. Most first-order algorithms are variants, at least intuitively, of the Langevin Algorithm. These variations come in many forms and can be essential for faster runtimes in theory, practice, or both. A few examples of these variations include: dealing with constrained sets by using projected [14], proximal [81, 12, 34], or mirror [1, 93, 49] variants of the Langevin Algorithm; augmenting the state space by using Hamiltonian Monte Carlo [29, 71] or its variants such as Riemannian Hamiltonian Monte Carlo [47, 55, 54], NUTS [48], or the Underdamped Langevin Algorithm [18, 39, 15]; or using different discretization schemes such as the Randomized Midpoint Discretization [84] or the Leap-Frog Integrator [11]. These algorithms can be combined with Metropolis-Hastings filters to correct the bias of the algorithm’s stationary distribution; this typically improves mixing times to π\pi from polynomial to polylogarithmic in the inverse accuracy. A notable example is the Metropolis-Adjusted Langevin Algorithm [78], whose mixing time has been the focus of much recent work [22, 36] and was recently characterized in [91]. While there have been many more exciting developments, this literature is far too large to attempt to comprehensively cite here, and we refer the reader to textbooks and surveys such as [4, 76, 61, 77, 20] for further background.

The present paper fits into this line of work by analyzing the (Projected Stochastic) Langevin Algorithm. We focus on this algorithm because it is arguably the simplest first-order sampling algorithm and also is the direct sampling analog of (Projected Stochastic) Gradient Descent.

Langevin Algorithm analyses under different assumptions.

While the Langevin Diffusion converges asymptotically under mild tail assumptions [78], non-asymptotic convergence of the Langevin Algorithm require stronger assumptions. This is in part because sampling is in general computationally hard if the potential is non-convex, e.g., due to reductions from non-convex optimization. Non-asymptotic mixing bounds for the Langevin Algorithm were first derived by the seminal paper [26] in the unconstrained setting where ff is strongly convex and smooth. This led to a rapidly growing line of work on mixing times (primarily to π\pi) in many settings—for example log-concave/strongly log-concave settings and constrained/unconstrained settings as already mentioned above, but also settings where the potential has less stuctural assumptions, for example when it is non-smooth [35, 60, 57, 73, 16], or is non-convex and only satisfies tail-growth or isoperimetric conditions [70, 75, 42, 40, 73, 41, 17, 6, 88]. Another line of work has focused on sampling from non-convex manifolds [59, 19, 46].

We highlight a recent line of work which relaxes (strong) convexity of the potential to functional inequalities of the target distribution π\pi. This idea was pioneered by the paper [88], which pointed out that smoothness and functional inequalities are sufficient to establish rapid mixing—essentially replacing convexity and strong convexity by Poincaré and Log-Sobolev inequality assumptions, respectively. A rapidly growing line of recent work has established even faster mixing bounds under these assumptions (e.g., [21, 6, 42, 40]) as well as found appealing ways to interpolate between these functional inequalities [21]. It should be noted that for the purpose of mixing times, convexity and functional inequality assumptions are incomparable in the sense that on one hand, functional inequalities allow for (mild) non-convexity; and on the other hand, while convexity implies a Poincaré inequality with a finite constant, there is no uniform bound on this Poincaré constant.

In this paper we focus on the foundational setting of smooth convex potentials ff (possibly also with strong convexity and/or bounded domains and/or stochastic gradients) for several reasons: (i) this is the direct sampling analog of the most classical setup in convex optimization; (ii) this setup has many applications in machine learning, statistics, and scientific computing; and (iii) optimal mixing times for πη\pi_{\eta} were previously unknown even in this seemingly simple setting.

State-of-the-art mixing bounds for the Langevin Algorithm.

As mentioned earlier, most previous work focuses on mixing to π\pi. Existing bounds vary depending on the precise set of assumptions, the precise variant of the algorithm, and the notion of mixing error. For several common combinations thereof, state-of-the-art mixing bounds have changed in essentially each of the past few years—and may continue to change in the years to come, as there is no666 At the time of writing, there was no setting in which current mixing bounds for the (Unadjusted) Langevin Algorithm were known to be optimal. The concurrent work [23] also shows an optimality result. Their result concerns an entirely different setting (non-convex potentials where error is measured in Fisher information) and applies to a variant of the Langevin Algorithm (that outputs the averaged iterate) in the specific regime of very large mixing error (namely εd\varepsilon\approx\sqrt{d}, where dd is the dimension). We also note that our lower bound (Theorem 4.3) establishes that an upper bound in [88] for mixing to πη\pi_{\eta} is tight in certain metrics in the unconstrained, non-stochastic, strongly log-concave, and smooth setting. See the main text for details. setting in which current mixing bounds for the (Unadjusted) Langevin Algorithm are known to be optimal.

For example, for the unconstrained, smooth, strongly log-concave setting originally investigated by [26, 31, 32, 27], a line of recent work [88, 45, 42, 21, 35] has led to state-of-the-art mixing results to π\pi in Rényi divergence 𝒟α\mathcal{D}_{\alpha} of order roughly O~(αdκ2/ε)\tilde{O}(\alpha d\kappa^{2}/\varepsilon) for α\alpha of moderate size [21], and even faster O~(dκ/ε)\tilde{O}(d\kappa/\varepsilon) for the KL divergence using averaged iterates [35]. Here dd denotes the dimension, ε\varepsilon denotes the mixing error to π\pi, mm denotes the strong convexity of the potential ff, MM denotes its smoothness, and κ=M/m\kappa=M/m denotes the condition number. In this setting, these Rényi and KL guarantees recover state-of-the-art Wasserstein guarantees [26, 27, 33] by Talagrand’s inequality.

For this unconstrained, smooth, strongly-log concave setting, the recent revision777We thank Andre Wibisono for making us aware of this revision. of [88, Lemmas 4,8] provided an upper bound of O~(αmη)\tilde{O}(\frac{\alpha}{m\eta}) on the mixing time of the Langevin Algorithm to πη\pi_{\eta}. Our lower bound (Theorem 4.3) establishes that their upper bound is tight for Rényi divergences of constant size α=Θ(1)\alpha=\Theta(1). Our upper bound O~(1ηm)\tilde{O}(\frac{1}{\eta m}) for this setting (Theorem 4.1) has exponentially better (and optimal) dependence on the Rényi parameter α\alpha, does not require as onerous a warm start, and extends to other settings (e.g., constrained, stochastic, all non-trivial stepsizes, and log-concave target distributions).

For the unconstrained, smooth, (non-strongly) log-concave setting, state-of-the-art mixing results to π\pi in 𝖳𝖵\mathsf{TV} are of order roughly O~(d2M2cPI2/ε4)\tilde{O}(d^{2}M^{2}c_{\textrm{PI}}^{2}/\varepsilon^{4}) [6] and O~(d3M2cPI2/ε2)\tilde{O}(d^{3}M^{2}c_{\textrm{PI}}^{2}/\varepsilon^{2}) [21]. Here, cPIc_{\textrm{PI}} is the Poincaré constant, which is guaranteed finite but cannot be uniformly bounded. The lack of strong log-concavity poses technical challenges, leading to some results requiring averaged iterates and/or having unstable guarantees as TT grows. One approach for bypassing these challenges is to consider “modified” algorithms which are run on a quadratically regularized (and therefore strongly convex) potential [26]; this has led to state-of-the-art guarantees for Wasserstein mixing [28] and MALA [36]. For further details and history, see, e.g., [36, Table 1] and [21, Table 2].

For constrained settings, much less is known. Indeed, the only mixing time result we are aware of for the Projected Langevin Algorithm is [14], which in a tour-de-force analysis shows a polynomial mixing time to π\pi, albeit with large polynomial factors like d12d^{12}. Faster upper bounds for mixing to π\pi were recently shown for different algorithms, notably the Proximal Langevin Algorithm [34, 12] and the Mirror Langevin Algorithm [1]. The paper [14] states two open questions for the Projected Langevin Algorithm: extend their mixing results to mini-batch gradients, and obtain optimal dependence on the dimension and accuracy. We accomplish both goals for πη\pi_{\eta}, and hope that this will help pave the way towards the answers for π\pi.

The present paper fits into this line of work by establishing tight mixing times to πη\pi_{\eta} for the (Projected Stochastic) Langevin Algorithm in a variety of settings: constrained/unconstrained, log-concave/strongly log-concave, and exact/stochastic gradients. An appealing aspect of our analysis technique is that it extends to all these settings with essentially no change. An appealing aspect of our results on mixing to πη\pi_{\eta} is that they clarify that the dimension and error dependencies for mixing to π\pi arise only through a single factor of 1/η1/\eta, see the discussion following Theorem 1.2.

Lower bounds.

Whereas the study of lower bounds is well developed in the field of optimization, much less is known for sampling. We refer the reader to the recent paper [24] (which settles this question in dimension 11 for unconstrained, smooth, strongly log-concave sampling) for a discussion of this gap (in dimensions greater than 11) and the technical challenges for overcoming it. Because of these challenges, existing sampling lower bounds are primarily algorithm-specific, e.g., for the Underdamped Langevin Algorithm [15], or the Unadjusted Langevin Algorithm [23], or the Metropolis Adjusted Langevin Algorithm [91, 56, 22].

The present paper fits into this line of work on algorithm-specific lower bounds by establishing tight lower bounds for the Langevin Algorithm to mix to πη\pi_{\eta} in the log-concave setting. These are the first tight lower bounds for the mixing time of the (Unadjusted) Langevin Algorithm (to either πη\pi_{\eta} or π\pi). See Footnote 6.

Connections between optimization and sampling.

Much of the literature on sampling is inspired by similarities to optimization. These parallels are particularly strong between the Langevin Algorithm and Gradient Descent—the simplest and most canonical first-order algorithms for sampling and optimization, respectively. Broadly speaking, there are two approaches for extending this connection from algorithm definition to analysis. One approach views the Langevin Algorithm as Gradient Descent plus noise, and borrows classical techniques from optimization over d\mathbb{R}^{d} to prove sampling bounds—indeed this was the motivation of the original non-asymptotic bounds in [26], see the discussion in [27, §3]. The other approach links sampling on d\mathbb{R}^{d} not with optimization on d\mathbb{R}^{d}, but rather with optimization over the lifted space of probability measures 𝒫(d)\mathcal{P}(\mathbb{R}^{d}) endowed with the geometry of the 22-Wasserstein metric. This connection is based on the celebrated JKO Theorem [52], which states that the Langevin Diffusion is the gradient flow on 𝒫(d)\mathcal{P}(\mathbb{R}^{d}) with respect to the objective functional 𝖪𝖫(π)\mathsf{KL}(\cdot\;\|\;\pi). This connection is strengthened by the fact that properties of π\pi such as log-concavity, strong log-concavity, and log-Sobolev inequalities imply properties of the functional 𝖪𝖫(π)\mathsf{KL}(\cdot\;\|\;\pi) such as convexity, strong convexity, and gradient domination inequalities, respectively. A fruitful line of recent work has exploited this connection by analyzing the Langevin Diffusion (and its variants) via optimization techniques on Wasserstein space, see e.g., [89, 9, 35, 6, 67].

The present paper fits into this interface of optimization and sampling by providing a new way to extract quantitative bounds from the first approach (linking sampling and optimization over d\mathbb{R}^{d}). Specifically, on the analysis side, we show how the Privacy Amplification by Iteration technique implies tight mixing bounds for the Langevin Algorithm by exploiting the foundational fact from optimization that a Gradient Descent step is contractive (but in a different way from [26, 27]). This leads to error bounds for sampling that in many ways mirror classical results in optimization, see Table 2.

Connections to differential privacy.

The Langevin Algorithm is identical to Noisy-SGD—the standard algorithm for large-scale machine learning with differential privacy (DP)—modulo a different noise scaling and possibly gradient clipping. What it means for Noisy-SGD to satisfy DP is similar, at least intuitively, to what it means for the Langevin Algorithm to mix. Essentially, DP means that the algorithm yields indistinguishable outputs when run from identical initializations on different objectives, whereas mixing means that the algorithm yields indistinguishable outputs when run from different initializations on identical objectives. These syntactic similarities suggest that analyses might carry over if appropriately modified, and indeed this cross-pollination has proven fruitful in both directions. In one direction, ideas from sampling have recently been used to achieve tight DP guarantees for Noisy-SGD in certain regimes [25, 80, 92]. In the other direction, the technique of adaptive composition from DP has recently been used to prove the first mixing bounds for the Langevin Algorithm in the Rényi divergence [45, 42].

The present paper fits into this interface of sampling and DP in the sense that we introduce to the samping literature a new analysis technique from the DP literature (namely, Privacy Amplification by Iteration), and we show that this gives optimal mixing bounds for the Langevin Algorithm.

Privacy Amplification by Iteration.

As mentioned earlier, the technique of Privacy Amplification by Iteration was first introduced in [43]. It has been used to derive fast algorithms with optimal privacy-utility tradeoffs for stochastic convex optimization [44]. [7] considered generalizations of this approach, and in particular showed a strengthening for strongly convex losses. [5] prove a version of Privacy Amplification by Iteration that holds for bounded sets by using contraction coefficients for hockey-stick divergence, and [85] extend these results to shuffled SGD. Recent work by the authors [3] provides a diameter-aware version of Privacy Amplification by Iteration that also incorporates Privacy Amplification by Sampling for the Sampled Gaussian Noise Mechanism. This is the version of Privacy Amplification by Iteration exploited in this paper.

The present paper fits into this line of work by showing how this technique, previously only used in the privacy literature, also enables understanding questions in the sampling literature.

1.4 Organization

§2 recalls relevant preliminaries. §3 and §4 contain our main results, which characterize the mixing time of the Langevin Algorithm for log-concave and strongly log-concave targets, respectively. §5 describes several directions for future research that are motivated by the results in this paper. For brevity, several proofs and auxiliary results are deferred to the Appendix.

2 Preliminaries

In this section, we recall relevant preliminaries about regularity properties in §2.1, the Rényi divergence in §2.2, and the shifted Rényi divergence and the analysis technique of Privacy Amplification by Iteration in §2.3. In an effort towards self-containedness as well as accessibility of this article within the sampling community, the Appendix contains proofs of the convex optimization lemma in §2.1 and of the differential privacy lemmas in §2.3.

Notation.

Throughout \|\cdot\| denotes the Euclidean norm on d\mathbb{R}^{d}, and 𝒫(d)\mathcal{P}(\mathbb{R}^{d}) denotes the set of Borel-measurable probability distributions over d\mathbb{R}^{d}. A function f:df:\mathbb{R}^{d}\to\mathbb{R} is said to be cc-contractive (equivalently, cc-Lipschitz) if f(x)f(y)cxy\|f(x)-f(y)\|\leqslant c\|x-y\| for all x,ydx,y\in\mathbb{R}^{d}. If c=1c=1, then we simply say ff is contractive; and if c<1c<1, then we say ff is cc-strongly contractive. Throughout, the potentials are assumed differentiable so that the Langevin Algorithm is well-defined, and we assume ef(x)𝑑x<\int e^{-f(x)}dx<\infty so that the target distribution π\pi is well-defined. Also, throughout 𝒦\mathcal{K} is a non-empty, closed, convex subset of d\mathbb{R}^{d} and for brevity we drop the quantifiers “non-empty” and “closed”. These assumptions ensure that the projection operator Π𝒦(x):=argminy𝒦yx\Pi_{\mathcal{K}}(x):=\operatorname*{argmin}_{y\in\mathcal{K}}\|y-x\| onto 𝒦\mathcal{K} is well-defined (see, e.g., [72, Theorem 3.1.10]). All other notation is introduced in the main text.

2.1 Convexity, Smoothness, and Contractivity

The convergence of the Langevin Algorithm depends, of course, on the regularity properties of the potentials. We recall two of the most basic assumptions below: strong convexity and smoothness, which amount to under and overestimates, respectively, of the second-order Taylor expansion. Or, if the function is twice differentiable, then mm-strong convexity and MM-smoothness are equivalent to the simple eigenvalue condition mId2f(x)MIdmI_{d}\preceq\nabla^{2}f(x)\preceq MI_{d} for all xx. Note also that 0-strong convexity corresponds to convexity.

Definition 2.1 (Regularity assumptions on potentials).

A differentiable function f:df:\mathbb{R}^{d}\to\mathbb{R} is:

  • mm-strongly convex if f(x)f(y)+f(y),xy+m2xy2f(x)\geqslant f(y)+\langle\nabla f(y),x-y\rangle+\frac{m}{2}\|x-y\|^{2} for all x,ydx,y\in\mathbb{R}^{d}.

  • MM-smooth if f(x)f(y)+f(y),xy+M2xy2f(x)\leqslant f(y)+\langle\nabla f(y),x-y\rangle+\frac{M}{2}\|x-y\|^{2} for all x,ydx,y\in\mathbb{R}^{d}.

For settings in which potentials are both mm-strongly convex and MM-smooth, the corresponding condition number is denoted by κ:=M/m1\kappa:=M/m\geqslant 1.

Since these concepts are by now standard in the sampling literature, for brevity we refer to the monograph [72] for a detailed discussion of them and their many equivalent formulations (e.g., a differentiable function ff is MM-smooth if and only if f\nabla f is MM-Lipschitz). However, we do recall here one fundamental consequence of these properties that we use repeatedly in the sequel: smoothness implies contractivity of Gradient Descent updates. This improves to strong contractivity under the additional assumption of strong convexity. This contractivity property is arguably one of the most important consequence of strong convexity and smoothness in convex optimization. Since this fact is key to our results, for the convenience of the reader we provide a short proof in Appendix A.1.

Lemma 2.2 (Contractivity of Gradient Descent updates).

Suppose ff is an mm-strongly convex, MM-smooth function for 0mM<0\leqslant m\leqslant M<\infty. For any stepsize η2/M\eta\leqslant 2/M, the Gradient Descent update

xxηf(x)x\mapsto x-\eta\nabla f(x)

is cc-contractive for

c=maxλ{m,M}|1ηλ|.c=\max_{\lambda\in\{m,M\}}\left\lvert 1-\eta\lambda\right\rvert.

In particular, if 0<mM<0<m\leqslant M<\infty, then c=(κ1)/(κ+1)c=(\kappa-1)/(\kappa+1) for the stepsize choice η=2/(M+m)\eta=2/(M+m).

In the sequel, we also make use of the basic fact from convex analysis that the projection operator onto a convex set is contractive. A proof can be found in, e.g., [83, Theorem 1.2.1].

Lemma 2.3 (Contractivity of projections onto convex sets).

If 𝒦d\mathcal{K}\subset\mathbb{R}^{d} is convex, then Π𝒦\Pi_{\mathcal{K}} is contractive.

2.2 Rényi Divergence

Here we define the Rényi divergence as it plays a central role in our analysis. See the survey [86] for a detailed discussion of Rényi divergences and their many properties. In what follows, we adopt the standard convention that 0/0=00/0=0 and x/0=x/0=\infty for x0x\neq 0.

Definition 2.4 (Rényi divergence).

The Rényi divergence between probability distributions μ,ν𝒫(d)\mu,\nu\in\mathcal{P}(\mathbb{R}^{d}) of order α(1,)\alpha\in(1,\infty) is

𝒟α(μν)=1α1log(μ(x)ν(x))αν(x)𝑑x\mathcal{D}_{\alpha}(\mu\;\|\;\nu)=\frac{1}{\alpha-1}\log\int\left(\frac{\mu(x)}{\nu(x)}\right)^{\alpha}\nu(x)dx

if μν\mu\ll\nu, and is \infty otherwise. The Rényi divergences of order α{1,}\alpha\in\{1,\infty\} are defined by continuity.

Our analysis makes use of the following facts about the Rényi divergence. First, we recall standard comparison inequalities that upper bound various probability metrics by the Rényi divergence. Since we prove fast mixing with respect to Rényi divergence, these inequalities immediately imply fast mixing with respect to all these other metrics. Note that for the KL and Chi-Squared divergences, no inequality is needed, since these are equal to the Rényi divergence with parameters α=1\alpha=1 and 22, respectively (modulo a simple transform for the latter). The Total Variation bound is Pinkser’s inequality, and the Hellinger bound can be found in, e.g., [86, equation (7)].

Lemma 2.5 (Standard comparison inequalities to Rényi divergence).

Suppose μν\mu\ll\nu. Then:

  • Kullback-Leibler divergence: 𝖪𝖫(μν)=𝒟1(μν)\mathsf{KL}(\mu\;\|\;\nu)=\mathcal{D}_{1}(\mu\;\|\;\nu).

  • Chi-squared divergence: χ2(μ,ν)=exp(𝒟2(μν))1\chi^{2}(\mu,\nu)=\exp\left(\mathcal{D}_{2}(\mu\;\|\;\nu)\right)-1.

  • Total variation distance: 𝖳𝖵(μ,ν)12𝒟1(μν)\mathsf{TV}(\mu,\nu)\leqslant\sqrt{\operatorname{\frac{1}{2}}\,\mathcal{D}_{1}(\mu\;\|\;\nu)}.

  • Hellinger divergence: 𝖧(μ,ν)𝒟1(μν)\mathsf{H}(\mu,\nu)\leqslant\sqrt{\mathcal{D}_{1}(\mu\;\|\;\nu)}.

We also make use of the data-processing inequality for Rényi divergence. This generalizes the standard data-processing inequality for the KL divergence to all α1\alpha\geqslant 1. A proof can be found in, e.g. [86, Theorem 9].

Lemma 2.6 (Data-processing inequality for Rényi divergence).

For any Rényi divergence parameter α1\alpha\geqslant 1, any (possibly random) function h:ddh:\mathbb{R}^{d}\to\mathbb{R}^{d^{\prime}}, and any distributions μ,ν𝒫(d)\mu,\nu\in\mathcal{P}(\mathbb{R}^{d}),

𝒟α(h#μh#ν)𝒟α(μν).\mathcal{D}_{\alpha}\left(h_{\#}\mu\;\|\;h_{\#}\nu\right)\leqslant\mathcal{D}_{\alpha}\left(\mu\;\|\;\nu\right)\,.

The next fact is a closed-form expression for the Rényi divergence between two Gaussian distributions. This identity can be found, e.g., in [86, equation (10)].

Lemma 2.7 (Rényi divergence between Gaussians).

If σα2=(1α)σ02+ασ12\sigma_{\alpha}^{2}=(1-\alpha)\sigma_{0}^{2}+\alpha\sigma_{1}^{2} is non-zero, then

𝒟α(𝒩(μ0,σ02)𝒩(μ1,σ12))=α(μ1μ0)22σα2+11αlogσασ01ασ1α.\mathcal{D}_{\alpha}\left(\mathcal{N}\left(\mu_{0},\sigma_{0}^{2}\right)\;\|\;\mathcal{N}\left(\mu_{1},\sigma_{1}^{2}\right)\right)=\frac{\alpha(\mu_{1}-\mu_{0})^{2}}{2\sigma_{\alpha}^{2}}+\frac{1}{1-\alpha}\log\frac{\sigma_{\alpha}}{\sigma_{0}^{1-\alpha}\sigma_{1}^{\alpha}}.

2.3 Shifted Rényi Divergence and Privacy Amplification by Iteration

Here we describe the core technique in our mixing time analysis: Privacy Amplification by Iteration. This technique was originally introduced in [43]; however, for the purposes of this paper, it is crucial to use the diameter-aware variant of this technique which was recently developed in [3]. We note that everything in this subsection can be extended to arbitrary noise distributions over Banach spaces, but for simplicity of exposition we restrict to Gaussian noise over d\mathbb{R}^{d} as this suffices for the present purpose of mixing time bounds.

We begin by defining the shifted Rényi divergence, which is the key quantity that Privacy Amplification by Iteration concerns. In words, the shifted Rényi divergence is a version of the Rényi divergence that is smoothed out by the Optimal Transport (a.k.a., Wasserstein distance). This smoothing makes the Rényi divergence geometrically aware in a natural way for mixing analyses, see the discussion in the techniques section §1.2.

Definition 2.8 (Shifted Rényi divergence).

The shifted Rényi divergence between probability distributions μ,ν𝒫(d)\mu,\nu\in\mathcal{P}(\mathbb{R}^{d}) of order α[1,]\alpha\in[1,\infty] and shift z0z\geqslant 0 is

𝒟α(z)(μν)=infμ:W(μ,μ)z𝒟α(μν).\mathcal{D}_{\alpha}^{(z)}\left(\mu\;\|\;\nu\right)=\inf_{\mu^{\prime}\;:\;W_{\infty}(\mu,\mu^{\prime})\leqslant z}\mathcal{D}_{\alpha}\left(\mu^{\prime}\;\|\;\nu\right).

The technique of Privacy Amplification by Iteration uses the shifted Rényi divergence as an intermediate Lyapunov function in order to prove that the (unshifted) Rényi divergence between the final two iterates of two processes is close, so long as these processes are generated by interleaving contraction maps and additive Gaussian noise. This notion is formalized as follows.

Definition 2.9 (Contractive Noisy Iteration).

Consider an initial distribution μ0𝒫(d)\mu_{0}\in\mathcal{P}(\mathbb{R}^{d}), a sequence of (random) cc-contractions ϕt:dd\phi_{t}:\mathbb{R}^{d}\to\mathbb{R}^{d}, and a sequence of noise distributions ξt\xi_{t}. The cc-Contractive Noisy Iteration (cc-CNI) is the process generated by

Xt+1=ϕt+1(Xt)+Zt,X_{t+1}=\phi_{t+1}(X_{t})+Z_{t}\,,

where X0μ0X_{0}\sim\mu_{0} and ZtξtZ_{t}\sim\xi_{t} independently. For shorthand, we denote the law of the final iterate XTX_{T} by CNIc(μ0,{ϕt},{ξt})\mathrm{CNI}_{c}(\mu_{0},\{\phi_{t}\},\{\xi_{t}\}).

The Privacy Amplification by Iteration technique (or, more precisely, the modification thereof in [3]) provides the following key estimate. We remark that while we state here separate bounds for the cases c<1c<1 and c=1c=1, the true PABI bound is slightly sharper and is continuous in cc, see Remark A.4—however, we ignore this sharpening because it complicates the expressions and is uneceessary for the purposes of this paper.

Proposition 2.10 (Diameter-aware Privacy Amplication by Iteration bound from [3], simplified version).

Suppose XTCNIc(μ0,{ϕt},{ξt})X_{T}\sim\mathrm{CNI}_{c}(\mu_{0},\{\phi_{t}\},\{\xi_{t}\}) and XTCNIc(μ0,{ϕt},{ξt})X_{T}^{\prime}\sim\mathrm{CNI}_{c}(\mu_{0}^{\prime},\{\phi_{t}\},\{\xi_{t}\}) where the initial distributions satisfy W(μ0,μ0)DW_{\infty}(\mu_{0},\mu_{0}^{\prime})\leqslant D, the update functions ϕt\phi_{t} are cc-contractive, and the noise distributions ξt=𝒩(0,σ2Id)\xi_{t}=\mathcal{N}(0,\sigma^{2}I_{d}) are Gaussian. Then

𝒟α(XTXT)αD22σ2{1/T for c=1c2T for c<1\displaystyle\mathcal{D}_{\alpha}\left(X_{T}\;\|\;X_{T}^{\prime}\right)\leqslant\frac{\alpha D^{2}}{2\sigma^{2}}\cdot\begin{cases}1/T&\text{ for }c=1\\ c^{2T}&\text{ for }c<1\end{cases}

The proof of this proposition is a simple combination of two lemmas, which disentangle how the shifted Rényi divergence is affected by the contraction and noise operations defining the CNI. See the techniques section §1.2 for a high-level overview. Because these arguments have previously only appeared in the differential privacy literature and also because they can be simplified888 The Privacy Amplification by Iteration bound [3, Proposition 3.2] from DP generalizes Proposition 2.10 in two ways: the update functions ϕt\phi_{t} in the two CNI are different, and the bound uses an intermediate time between 0 and TT. The simplifications are possible here because for mixing, the goal is to understand the stability of the same algorithm when run on two different initializations; see §1.3 for a discussion of the differences between DP and mixing. for the present purpose of mixing time analyses, in the interest of accessibility to the sampling community, we recall in Appendix A.2 these two lemmas, their proofs, and how they imply Proposition 2.10.

3 Mixing Time for Log-Concave Distributions

In this section we establish tight mixing bounds on the Langevin Algorithm to its stationary distribution πη\pi_{\eta} under the assumption that the target distribution is log-concave (equivalently, the potentials are convex).

Theorem 3.1 (Upper bound: mixing of the Langevin Algorithm for log-concave targets).

Consider any convex set 𝒦d\mathcal{K}\subset\mathbb{R}^{d} of diameter DD, any MM-smooth convex potentials fi:𝒦f_{i}:\mathcal{K}\to\mathbb{R}, any stepsize η2/M\eta\leqslant 2/M, and any batch size b[n]b\in[n]. The Langevin Algorithm mixes to its stationary distribution πη\pi_{\eta} in time

Tmix,𝖳𝖵(14)2D2η.T_{\mathrm{mix},\,\mathsf{TV}}\left(\frac{1}{4}\right)\leqslant\frac{2D^{2}}{\eta}\,.

The next result proves that this mixing bound is tight up to a constant factor. No attempt has been made to optimize this constant.

Theorem 3.2 (Lower bound: mixing of the Langevin Algorithm for log-concave targets).

Consider running the Langevin Algorithm with any stepsize η>0\eta>0 and any batch size b[n]b\in[n] on the identically zero potentials fi0f_{i}\equiv 0 over the interval 𝒦=[D/2,D/2]\mathcal{K}=[-D/2,D/2]\subset\mathbb{R}. Then

Tmix,𝖳𝖵(14)D2100η.T_{\mathrm{mix},\,\mathsf{TV}}\left(\frac{1}{4}\right)\geqslant\frac{D^{2}}{100\eta}\,.

We remark that the fast mixing in total variation in Theorem 3.1 implies the following fast mixing in other notions of distance via standard comparison inequalities and the fact that the mixing contraction rate for total variation is worst-case among all ff-divergences; details in §A.3.

Corollary 3.3 (Mixing bounds for other notions of distance).

Consider the setup in Theorem 3.1. The Langevin Algorithm mixes to ε\varepsilon error in:

  • Tmix,Dα(ε)D2η(α+log1ε)T_{\mathrm{mix},\,D_{\alpha}}(\varepsilon)\lesssim\frac{D^{2}}{\eta}(\alpha+\operatorname{\log\tfrac{1}{\varepsilon}}).

  • Tmix,𝖪𝖫(ε)D2ηlog1εT_{\mathrm{mix},\,\mathsf{KL}}(\varepsilon)\lesssim\frac{D^{2}}{\eta}\operatorname{\log\tfrac{1}{\varepsilon}}.

  • Tmix,χ2(ε)D2ηlog1εT_{\mathrm{mix},\,\chi^{2}}(\varepsilon)\lesssim\frac{D^{2}}{\eta}\operatorname{\log\tfrac{1}{\varepsilon}}.

  • Tmix,𝖳𝖵(ε)D2ηlog1εT_{\mathrm{mix},\,\mathsf{TV}}(\varepsilon)\lesssim\frac{D^{2}}{\eta}\operatorname{\log\tfrac{1}{\varepsilon}}.

  • Tmix,𝖧(ε)D2ηlog1εT_{\mathrm{mix},\,\mathsf{H}}(\varepsilon)\lesssim\frac{D^{2}}{\eta}\operatorname{\log\tfrac{1}{\varepsilon}}.

3.1 Upper Bound (Proof of Theorem 3.1)

Step 1: Coupling the iterates.

Consider running the Langevin Algorithm for TT iterations from two different initializations X0μ0X_{0}\sim\mu_{0} and X0πηX_{0}^{\prime}\sim\pi_{\eta}, where μ0\mu_{0} is an arbitrary probability distribution supported on 𝒦\mathcal{K}, and πη\pi_{\eta} is the stationary distribution of the algorithm. Then by definition of the algorithm’s updates, we can couple the random batches {Bt}t=0T1\{B_{t}\}_{t=0}^{T-1} and the noise {Zt}t=0T1\{Z_{t}\}_{t=0}^{T-1} so that

Xt+1\displaystyle X_{t+1} =Π𝒦[XtηbiBtfi(Xt)+Zt]\displaystyle=\Pi_{\mathcal{K}}\left[X_{t}-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(X_{t})+Z_{t}\right]
Xt+1\displaystyle X_{t+1}^{\prime} =Π𝒦[XtηbiBtfi(Xt)+Zt]\displaystyle=\Pi_{\mathcal{K}}\left[X_{t}^{\prime}-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(X_{t}^{\prime})+Z_{t}\right]

where Zt𝒩(0,2ηId)Z_{t}\sim\mathcal{N}(0,2\eta I_{d}) are drawn independently.

Step 2: Construction of auxiliary Contractive Noisy Iterations.
999This construction allows us to combine the two contraction maps (projection and stochastic gradient update) into a single contraction, so that the new sequences are bona fide CNI. Alternatively, one can directly argue about the original {Xt}\{X_{t}\}, {Xt}\{X_{t}^{\prime}\} sequences as “CNCI” or “Projected CNI” using the contraction-reduction lemma twice to analyze each iteration, as described in [3]. We adopt this shifting here for simplicity of exposition.

Define auxiliary sequences {Yt}t=0T\{Y_{t}\}_{t=0}^{T} and {Yt}t=0T\{Y^{\prime}_{t}\}_{t=0}^{T} as follows. Initialize Y0=X0Y_{0}=X_{0} and Y0=X0Y^{\prime}_{0}=X^{\prime}_{0}. Then update via

Yt+1\displaystyle Y_{t+1} =ϕt+1(Yt)+Zt\displaystyle=\phi_{t+1}(Y_{t})+Z_{t}
Yt+1\displaystyle Y^{\prime}_{t+1} =ϕt+1(Yt)+Zt\displaystyle=\phi_{t+1}(Y^{\prime}_{t})+Z_{t}

using the same update function, which is defined as

ϕt+1(y)\displaystyle\phi_{t+1}(y) =Π𝒦[y]ηbiBtfi(Π𝒦[y]).\displaystyle=\Pi_{\mathcal{K}}\left[y\right]-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(\Pi_{\mathcal{K}}\left[y\right]).

Observe that by induction, these auxiliary processes satisfy Xt=Π𝒦[Yt]X_{t}=\Pi_{\mathcal{K}}\left[Y_{t}\right] and Xt=Π𝒦[Yt]X_{t}^{\prime}=\Pi_{\mathcal{K}}\left[Y_{t}^{\prime}\right].

In order to show that these two auxiliary processes are 11-CNI (henceforth simply called CNI), it suffices to check that the stochastic update functions ϕt\phi_{t} are contractive, as shown next.

Observation 3.4 (Contractivity of ϕt\phi_{t}).

Suppose the potentials fif_{i} are convex and MM-smooth. If η2/M\eta\leqslant 2/M, then ϕt\phi_{t} is contractive.

Proof.

The claim follows because ϕt\phi_{t} is the composition of a projection and a stochastic gradient descent update, both of which are contractive. Formally, consider any y,ydy,y^{\prime}\in\mathbb{R}^{d}; we show that ϕt(y)ϕt(y)yy\|\phi_{t}(y)-\phi_{t}(y^{\prime})\|\leqslant\|y-y^{\prime}\|. For shorthand, denote x=Π𝒦[y]x=\Pi_{\mathcal{K}}\left[y\right] and x=Π𝒦[y]x^{\prime}=\Pi_{\mathcal{K}}\left[y^{\prime}\right]. Then

ϕt(y)ϕt(y)\displaystyle\big{\|}\phi_{t}(y)-\phi_{t}(y^{\prime})\big{\|} =(xηbiBtfi(x))(xηbiBtfi(x))\displaystyle=\big{\|}\big{(}x-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(x)\big{)}-\big{(}x^{\prime}-\frac{\eta}{b}\sum_{i\in B_{t}}\nabla f_{i}(x^{\prime})\big{)}\big{\|}
1biBt(xηfi(x))(xηfi(x))\displaystyle\leqslant\frac{1}{b}\sum_{i\in B_{t}}\big{\|}\left(x-\eta\nabla f_{i}(x)\right)-\left(x^{\prime}-\eta\nabla f_{i}(x^{\prime})\right)\big{\|}
xx\displaystyle\leqslant\|x-x^{\prime}\| (3.1)
=Π𝒦[y]Π𝒦[y]\displaystyle=\|\Pi_{\mathcal{K}}\left[y\right]-\Pi_{\mathcal{K}}\left[y^{\prime}\right]\|
yy.\displaystyle\leqslant\|y-y^{\prime}\|\,.

Above, the first inequality uses the triangle inequality, the second inequality uses the fact that a Gradient Descent update is contractive (Lemma 2.2), and the third inequality uses the fact that the projection function onto a convex set is contractive (Lemma 2.3). ∎

Step 3: Fast mixing via Privacy Amplification by Iteration.

Since we have established that the two auxiliary sequences {Yt}t=0T\{Y_{t}\}_{t=0}^{T} and {Yt}t=0T\{Y_{t}^{\prime}\}_{t=0}^{T} are CNI with identical update functions, Proposition 2.10 with α=1\alpha=1 implies

𝖪𝖫(YTYT)D24ηT.\mathsf{KL}\left(Y_{T}\;\|\;Y_{T}^{\prime}\right)\leqslant\frac{D^{2}}{4\eta T}\,.

Now make two observations. First, by the data-processing property of the KL divergence (Lemma 2.6) and the fact that XT=Π𝒦[YT]X_{T}=\Pi_{\mathcal{K}}\left[Y_{T}\right] and XT=Π𝒦[YT]X^{\prime}_{T}=\Pi_{\mathcal{K}}\left[Y^{\prime}_{T}\right], we have 𝖪𝖫(XTXT)𝖪𝖫(YTYT)\mathsf{KL}\left(X_{T}\;\|\;X_{T}^{\prime}\right)\leqslant\mathsf{KL}\left(Y_{T}\;\|\;Y_{T}^{\prime}\right). Second, by stationarity of X0πηX_{0}^{\prime}\sim\pi_{\eta}, we have by induction that XTπηX_{T}^{\prime}\sim\pi_{\eta}. It follows that

𝖪𝖫(XTπη)=𝖪𝖫(XTXT)𝖪𝖫(YTYT)D24ηT.\displaystyle\mathsf{KL}\left(X_{T}\;\|\;\pi_{\eta}\right)=\mathsf{KL}\left(X_{T}\;\|\;X_{T}^{\prime}\right)\leqslant\mathsf{KL}\left(Y_{T}\;\|\;Y_{T}^{\prime}\right)\leqslant\frac{D^{2}}{4\eta T}\,. (3.2)

Therefore the KL divergence is at most 1/81/8 after T=2D2/ηT=2D^{2}/\eta iterations. Since a KL divergence bound of 1/81/8 implies a TV bound of 1/41/4 by Pinkser’s inequality (Lemma 2.5), we conclude that

Tmix,𝖳𝖵(1/4)2D2η.\displaystyle T_{\mathrm{mix},\,\mathsf{TV}}(1/4)\leqslant\frac{2D^{2}}{\eta}\,. (3.3)

3.2 Lower Bound (Proof of Theorem 3.2)

For these identically zero potentials, the update of the Langevin Algorithm simplifies to

Xt+1=Π[D/2,D/2][Xt+Zt]X_{t+1}=\Pi_{[-D/2,D/2]}\left[X_{t}+Z_{t}\right]

where Zt𝒩(0,2η)Z_{t}\sim\mathcal{N}(0,2\eta). Observe that the stationary distribution πη\pi_{\eta} puts half the mass on the set [0,D/2][0,D/2] by symmetry. (Indeed, if the the Markov chain were initialized at 0, then by induction the law of each iterate is symmetric around 0, hence the stationary distribution is too.)

Thus, in order for the Langevin Algorithm to mix to total variation error 𝖳𝖵(XT,πη)1/4\mathsf{TV}(X_{T},\,\pi_{\eta})\leqslant 1/4, it must be the case that

[XT0]1/4.\displaystyle\mathbb{P}\big{[}X_{T}\geqslant 0\big{]}\geqslant 1/4. (3.4)

Therefore, in order to prove the present theorem, it suffices to show that this inequality (3.4) is false for initialization X0=D/4X_{0}=-D/4. To this end, observe that

[XT0][maxtT|s=1tZs|D4]exp(D264Tη).\mathbb{P}\Bigg{[}X_{T}\geqslant 0\Bigg{]}\leqslant\mathbb{P}\Bigg{[}\max_{t\leqslant T}\left\lvert\sum_{s=1}^{t}Z_{s}\right\rvert\geqslant\frac{D}{4}\Bigg{]}\leqslant\exp\Bigg{(}-\frac{D^{2}}{64T\eta}\Bigg{)}.

Above, the first step is because under the assumption that X0=D/4X_{0}=-D/4 and maxtT|s=1tZs|<D/4\max_{t\leqslant T}\left\lvert\sum_{s=1}^{t}Z_{s}\right\rvert<D/4, there is no projection and thus Xt=D/4+s=1tZsX_{t}=-D/4+\sum_{s=1}^{t}Z_{s} stays within the interval (D/2,0)(-D/2,0). The second step is by a standard martinginale concentration inequality on the supremum of a discrete Gaussian random walk (Lemma 3.5 below). We conclude in particular that if T<aD2/ηT<aD^{2}/\eta for the choice of constant a=1/(64ln4)1/100a=1/(64\ln 4)\geqslant 1/100, then [XtS]<1/4\mathbb{P}[X_{t}\in S]<1/4, which contradicts (3.4) and thus completes the proof of Theorem 3.2.

It remains only to state the concentration inequality used above. This fact can be proved, e.g., as a consequence of Doob’s Submartingale Inequality, see [90, Page 139].

Lemma 3.5 (Concentration inequality for supremum of random walk).

Suppose ξ1,,ξT𝒩(0,1)\xi_{1},\dots,\xi_{T}\sim\mathcal{N}(0,1) are independent. Then for any a>0a>0,

(maxtT|s=1tξs|a)exp(a22T).\mathbb{P}\left(\max_{t\leqslant T}\left\lvert\sum_{s=1}^{t}\xi_{s}\right\rvert\geqslant a\right)\leqslant\exp\left(-\frac{a^{2}}{2T}\right).

4 Mixing Time for Strongly Log-Concave Distributions

In this section we establish tight mixing bounds on the Langevin Algorithm to its stationary distribution πη\pi_{\eta} under the assumption that the target distribution is strongly log-concave (equivalently, the potentials are strongly convex). In this case, the rapid mixing improves from polynomially to exponentially fast in the diameter DD. Following is the analog of Theorem 3.1 and Corollary 3.3.

Theorem 4.1 (Upper bound: mixing of the Langevin Algorithm for strongly log-concave targets).

Consider any convex set 𝒦d\mathcal{K}\subset\mathbb{R}^{d} of diameter DD, any mm-strongly convex and MM-smooth potentials fi:𝒦f_{i}:\mathcal{K}\to\mathbb{R}, any stepsize η<2/M\eta<2/M, and any batch size b[n]b\in[n]. Then the Langevin Algorithm mixes to its stationary distribution πη\pi_{\eta} in time

  • Tmix,Dα(ε)log1/cαDηεT_{\mathrm{mix},\,D_{\alpha}}(\varepsilon)\lesssim\log_{1/c}\frac{\alpha D}{\eta\varepsilon}.

  • Tmix,𝖪𝖫(ε)log1/cDηεT_{\mathrm{mix},\,\mathsf{KL}}(\varepsilon)\lesssim\log_{1/c}\frac{D}{\eta\varepsilon}.

  • Tmix,χ2(ε)log1/cDηεT_{\mathrm{mix},\,\chi^{2}}(\varepsilon)\lesssim\log_{1/c}\frac{D}{\eta\varepsilon}.

  • Tmix,𝖳𝖵(ε)log1/cDηεT_{\mathrm{mix},\,\mathsf{TV}}(\varepsilon)\lesssim\log_{1/c}\frac{D}{\eta\varepsilon}.

  • Tmix,𝖧(ε)log1/cDηεT_{\mathrm{mix},\,\mathsf{H}}(\varepsilon)\lesssim\log_{1/c}\frac{D}{\eta\varepsilon}.

where c=maxλ{m,M}|1ηλ|c=\max_{\lambda\in\{m,M\}}|1-\eta\lambda|.

Remark 4.2 (Special cases of contraction coefficient cc).

It always holds that c<1c<1. In the setting η<1/M\eta<1/M, the contraction coefficient simplifies to c=1ηmexp(ηm)c=1-\eta m\leqslant\exp(-\eta m), and thus the mixing time is of order 1ηmlogDηε\tfrac{1}{\eta m}\log\tfrac{D}{\eta\varepsilon}. Another important case is the stepsize choice η=2/(M+m)\eta=2/(M+m) which minimizes the contraction coefficient cc, in which case c=(κ1)/(κ+1)11/κexp(κ)c=(\kappa-1)/(\kappa+1)\leqslant 1-1/\kappa\leqslant\exp(-\kappa) and thus the mixing time is of the order κlogDηε\kappa\log\tfrac{D}{\eta\varepsilon}, where κ:=Mm\kappa:=\tfrac{M}{m} denotes the condition number.

We also show a lower bound on the mixing which decays at the same exponential rate cΘ(T)c^{\Theta(T)} as the upper bound in Theorem 4.1. Thus our upper and lower bounds on the mixing time match up to a logarithmic factor. For simplicity, we make two choices when stating this lower bound. First, we state it for the unconstrained setting since the upper bound in Theorem 4.1 extends to unconstrained settings at the expense of at most a logarithmic factor, see Appendix A.4. Second, we state this lower bound for mixing in Rényi divergence; this implies analogous lower bounds in divergences such as KL and chi-squared, since those are equal to certain Rényi divergences (modulo simple transformations). Analogous mixing time lower bounds in other notions of distance (e.g., 𝖳𝖵\mathsf{TV} or Hellinger) are easily proved by using the same lower bound construction and changing only the final step in the proof which bounds d(𝒩(0,1c2T),𝒩(0,1))d(\mathcal{N}(0,1-c^{2T}),\,\mathcal{N}(0,1)) in the relevant metric dd.

Theorem 4.3 (Lower bound: mixing of the Langevin Algorithm for strongly log-concave targets).

For any stepsize η>0\eta>0, there exist univariate quadratic potentials fif_{i} over 𝒦=\mathcal{K}=\mathbb{R} that are mm-strongly convex, MM-smooth, and satisfy the following. For any Rényi divergence parameter α1\alpha\geqslant 1, any number of iterations TT\in\mathbb{N}, and any batch size b[n]b\in[n], the TT-th iterate XtX_{t} of the Langevin Algorithm mixes to its stationary distribution πη\pi_{\eta} no faster than

𝒟α(Xtπη)αc4T4,\mathcal{D}_{\alpha}(X_{t}\;\|\;\pi_{\eta})\geqslant\frac{\alpha c^{4T}}{4}\,,

where c=maxλ{m,M}|1ηλ|c=\max_{\lambda\in\{m,M\}}|1-\eta\lambda|. Thus in particular,

Tmix,Dα(ε)log1/cαε.T_{\mathrm{mix},\,D_{\alpha}}(\varepsilon)\gtrsim\log_{1/c}\frac{\alpha}{\varepsilon}\,.

4.1 Upper Bound (Proof of Theorem 4.1)

The analysis is essentially identical to the analysis of log-concave distributions in Theorem 3.1. The only difference is that in the present setting of strongly convex potentials, the stochastic update functions ϕt\phi_{t} are not just contractive, but in fact cc-strongly contractive (proven in Observation 4.4 below). In other words, the auxiliary sequences {Yt}t\{Y_{t}\}_{t} and {Yt}t\{Y_{t}^{\prime}\}_{t} are generated by cc-CNI for c<1c<1. Thus the Privacy Amplification by Iteration Bound for cc-CNI (Proposition 2.10) implies

𝒟α(YTYT)αD2c2T4η.\mathcal{D}_{\alpha}\left(Y_{T}\;\|\;Y_{T}^{\prime}\right)\leqslant\frac{\alpha D^{2}c^{2T}}{4\eta}\,.

The rest of the proof proceeds identically to give the desired mixing time for the Rényi divergence. This immediately implies the fast mixing bounds in all the other desired metrics by the standard comparison inequalities in Lemma 2.5.

(Note that in the present strongly log-concave setting, there is no need to argue indirectly through the Total Variation metric because here the Rényi mixing bound obtained by the Privacy Amplification by Iteration argument is already exponentially fast, whereas it is not in the log-concave setting, c.f., Appendix A.3.)

It remains only to prove the strong contractivity of ϕt\phi_{t} in the setting of strongly convex fif_{i}.

Observation 4.4 (Strong contractivity of ϕt\phi_{t}).

Assume the functions fif_{i} are mm-strongly convex and MM-smooth. If η<2/M\eta<2/M, then ϕt\phi_{t} is cc-strongly contractive for c=maxλ{m,M}|1ηλ|c=\max_{\lambda\in\{m,M\}}|1-\eta\lambda|.

Proof.

Identical to the proof of Observation 3.4, except that the inequality (3.1) tightens by a factor of cc because by Lemma 2.2, a Gradient Descent update is cc-strongly contractive rather than just contractive in the present setting where ff is strongly convex rather than just convex. ∎

4.2 Lower Bound (Proof of Theorem 4.3)

Consider the univariate quadratic potential

f(x)=λ2x2,f(x)=\frac{\lambda}{2}x^{2}\,,

for the choice of λ=argmaxλ{m,M}|1ηλ|\lambda=\operatorname*{argmax}_{\lambda\in\{m,M\}}|1-\eta\lambda|. Clearly this potential is mm-strongly convex and MM-smooth as its second derivative is f′′(x)=λ[m,M]f^{\prime\prime}(x)=\lambda\in[m,M]. For this potential ff and the present unconstrained setting 𝒦=\mathcal{K}=\mathbb{R}, the update step (1.3) of the Langevin Algorithm simplifies to:

Xt+1=(1ηλ)Xt+Zt,X_{t+1}=(1-\eta\lambda)X_{t}+Z_{t}\,,

where Zt𝒩(0,2η)Z_{t}\sim\mathcal{N}(0,2\eta). By unrolling this update and considering the initialization X0=0X_{0}=0 (such a choice suffices for the present purpose of proving a lower bound on mixing), we have that

Xt+1=s=0t(1ηλ)tsZs.X_{t+1}=\sum_{s=0}^{t}(1-\eta\lambda)^{t-s}Z_{s}\,.

Now by our choice of λ\lambda as well as the basic fact that the sum of independent Gaussians is Gaussian with variance equal to the sum of the constitutents’ squared variances, it follows that

XT𝒩(0,2ηs=0T1(1ηλ)2(T1s))=𝒩(0,2η1c2T1c2)\displaystyle X_{T}\sim\mathcal{N}\left(0,2\eta\sum_{s=0}^{T-1}(1-\eta\lambda)^{2(T-1-s)}\right)=\mathcal{N}\left(0,2\eta\,\frac{1-c^{2T}}{1-c^{2}}\right) (4.1)

Since the Langevin Algorithm has stationary distribution πη\pi_{\eta}, taking the limit TT\to\infty implies

πη=𝒩(0,2η11c2)\displaystyle\pi_{\eta}=\mathcal{N}\left(0,2\eta\,\frac{1}{1-c^{2}}\right) (4.2)

Now that we have explicitly computed the laws of intermediate iterates XTX_{T} and the stationary distribution πη\pi_{\eta}, both as Gaussians, it is straightforward to compute their Rényi divergence:

𝒟α(XTπη)\displaystyle\mathcal{D}_{\alpha}\left(X_{T}\;\|\;\pi_{\eta}\right) =𝒟α(𝒩(0,2η1c2T1c2)𝒩(0,2η11c2))\displaystyle=\mathcal{D}_{\alpha}\left(\mathcal{N}\left(0,2\eta\,\frac{1-c^{2T}}{1-c^{2}}\right)\;\|\;\mathcal{N}\left(0,2\eta\,\frac{1}{1-c^{2}}\right)\right)
=𝒟α(𝒩(0,1c2T)𝒩(0,1))\displaystyle=\mathcal{D}_{\alpha}\left(\mathcal{N}\left(0,1-c^{2T}\right)\;\|\;\mathcal{N}\left(0,1\right)\right)
=12βlog(1βc2T)12log(1c2T).\displaystyle=\frac{1}{2\beta}\log\left(1-\beta c^{2T}\right)-\frac{1}{2}\log\left(1-c^{2T}\right)\,. (4.3)

Above, the first step is by plugging in the laws of XtX_{t} and πη\pi_{\eta} from (4.1) and (4.2); the second step is by a change of measure; the third step is by the explicit formula for the Rényi divergence between Gaussians (Lemma 2.7) and simplifying. In this final step, we denote β:=1α0\beta:=1-\alpha\leqslant 0 for shorthand.

The desired asymptotics 𝒟α(Xtπη)αc4T/2\mathcal{D}_{\alpha}\left(X_{t}\;\|\;\pi_{\eta}\right)\approx\alpha c^{4T}/2 now follow from (4.3) by using the second-order Taylor expansion log(1+ε)εε2/2\log(1+\varepsilon)\approx\varepsilon-\varepsilon^{2}/2 to expand both logarithmic terms. Formally, use the elementary inequality log(1+x)xx2/2\log(1+x)\geqslant x-x^{2}/2 for x0x\geqslant 0 to bound the first term, and use the elementary inequality log(1x)xx2/2\log(1-x)\leqslant-x-x^{2}/2 for x0x\geqslant 0 to bound the latter term. We conclude that

𝒟α(Xtπη)12β(c2Tβc4T2)12(c2Tc4T2)=(1β)c4T4=αc4T4.\mathcal{D}_{\alpha}\left(X_{t}\;\|\;\pi_{\eta}\right)\geqslant\frac{1}{2\beta}\left(-c^{2T}-\frac{\beta c^{4T}}{2}\right)-\frac{1}{2}\left(-c^{2T}-\frac{c^{4T}}{2}\right)=\frac{(1-\beta)c^{4T}}{4}=\frac{\alpha c^{4T}}{4}\,.

5 Discussion

We mention a few natural directions for future work that are motivated by the results of this paper.

  • Bias of the Langevin Algorithm? The purpose of this paper is to characterize the mixing time of the Langevin Algorithm to its stationary distribution πη\pi_{\eta}, for any discretization stepsize η>0\eta>0. In order to sample from the target distribution π\pi, these mixing bounds must be combined with a discretization bound on the error between πη\pi_{\eta} and π\pi. While bias bounds are known for certain settings (e.g., [14, 88, 6, 35, 21]), for the vast majority of settings it remains unclear if the current bounds are tight, and at least in certain settings it seems clear that they are not. Our result on the mixing time of the Langevin Algorithm further motivates the quest for optimal bias bounds.

    A related challenge is whether bias bounds can obtained by “directly” comparing the stationary distributions π\pi and πη\pi_{\eta}—in contrast to all known bias bounds, which are obtained by taking non-asymptotic mixing bounds to π\pi (which compare π\pi to the law πη,T\pi_{\eta,T} of the TT-th iterate of the Langevin Algorithm) and sending TT\to\infty. It seems that any such bias bound is useless for combining with our mixing bounds to πη\pi_{\eta}, since the indirectness of this argument makes the resulting mixing bound (comparing πη,T\pi_{\eta,T} to π\pi) worse than the original mixing bound used to bound the bias. In fact, directly arguing about properties of the stationary distribution πη\pi_{\eta} appears to require new techniques (see the related open questions in [88]). The only results in this vein that we are aware of are in the recent revision to [88] and in the forthcoming work [2], but tight bias bounds remain unclear in the settings of this paper.

  • Broader use of these techniques? The techniques we use in this paper apply at the generality of Contractive Noisy Iterations, which is broader than just the Langevin Algorithm. Can these techniques shed light on the rapid mixing of other sampling algorithms? Most first-order algorithms can be viewed, at least intuitively, as variants of the Langevin Algorithm (see the discussion in §1.3). In particular, can our techniques help understand the mixing time of more sophisticated sampling algorithms which can perform better in practice, such as the Metropolis-Adjusted Langevin Algorithm or Hamiltonian Monte Carlo?

  • Reconciliation of mixing analyses? As discussed in §1.3, our mixing analysis exploits connections between sampling over d\mathbb{R}^{d} and optimization over d\mathbb{R}^{d}, whereas many recent analyses exploit connections between sampling over d\mathbb{R}^{d} and optimization over the space 𝒫(d)\mathcal{P}(\mathbb{R}^{d}) of probability distributions. From both a technical as well as pedagogical perspective, it is interesting to ask: to what extent can these seemingly disparate techniques be reconciled?

  • Beyond log-concavity? An important but difficult question for bridging theory and practice is to understand to what extent fast mixing carries over to families of target distributions π\pi under weaker assumptions than log-concavity (i.e., weaker assumptions on potentials than convexity). There have been several recent breakthrough results towards this direction under various assumptions like dissipativity or isoperimetry or under weaker notions of convergence (see the previous work section §1.3 and the very recent papers [6, 23]). However, on one hand tight mixing bounds are unknown for any of these settings (see Footnote 6), and on the other hand the non-convexity assumptions made in the sampling literature are often markedly different from those in the optimization literature. To what extent can either of these issues be reconciled? Can our techniques help shed light on either question? The key issue with extending our techniques to these settings is that without convexity of the potential, a Gradient Descent step is not contractive, which seems to preclude using the Privacy Amplification by Iteration argument.

Acknowledgements.

We are grateful to Adam Smith for an insightful discussion about [3] which led us to the realization of the rapid mixing in Theorem 3.1, to John Duchi for a reference to Lemma A.6, to Sinho Chewi for sharing an early draft of his book [20], and to Murat Erdogdu and Andre Wibisono for encouragement and helpful conversations about the related literature.

Appendix A Deferred Details

A.1 Contractivity of Gradient Steps

Proof of Lemma 2.2.

Since our purpose here is to explain the idea behind this classical lemma, we will assume for simplicity that ff is twice differentiable. (A proof for the general case can be found in standard convex optimization texts.) Consider any two points x,xx,x^{\prime}. By the Fundamental Theorem of Calculus,

f(x)f(x)=012f(tx+(1t)x)(xx)𝑑t.\nabla f(x)-\nabla f(x^{\prime})=\int_{0}^{1}\nabla^{2}f\left(tx+(1-t)x^{\prime}\right)(x-x^{\prime})dt\,.

Thus

(xηf(x))(xηf(x))\displaystyle\left\|\left(x-\eta\nabla f(x)\right)-\left(x^{\prime}-\eta\nabla f(x^{\prime})\right)\right\| =(Idη012f(tx+(1t)x)𝑑t)(xx)\displaystyle=\left\|\left(I_{d}-\eta\int_{0}^{1}\nabla^{2}f\left(tx+(1-t)x^{\prime}\right)dt\right)(x-x^{\prime})\right\|
Idη012f(tx+(1t)x)𝑑txx\displaystyle\leqslant\left\|I_{d}-\eta\int_{0}^{1}\nabla^{2}f\left(tx+(1-t)x^{\prime}\right)dt\right\|\left\|x-x^{\prime}\right\|
maxλ[m,M]|1ηλ|xx\displaystyle\leqslant\max_{\lambda\in[m,M]}\left\lvert 1-\eta\lambda\right\rvert\left\|x-x^{\prime}\right\|
=cxx.\displaystyle=c\left\|x-x^{\prime}\right\|\,.

Above, the first step is by factoring, the second step is by sub-multiplicativity of the operator norm, the third step is because a twice differentiable function being mm-strong convexity and MM-smoothness is equivalent to its Hessians having eigenvalues within [m,M][m,M], and the last step is because λ|1ηλ|\lambda\mapsto\left\lvert 1-\eta\lambda\right\rvert is convex and thus maximized at the boundary of the interval [m,M][m,M]. This establishes cc-contractivity, as desired. ∎

A.2 Privacy Amplification by Iteration

Here we recall (simplified forms of) the two lemmas underpinning the PABI technique, their proofs, and how they imply Proposition 2.10. These proofs have already appeared in previous work on PABI [43, 3] and are provided here only for completeness and for the sake of accessibility of this article beyond the differential privacy community. See the techniques section §1.2 for a high-level overview of the PABI technique.

A.2.1 Contraction-reduction lemma

We recall this lemma in a simplified form, where both measures are pushed-forward through the same contraction, as this suffices for the present purpose of sampling analyses (in differential privacy one must consider two different contraction maps). In this setting, the proof simplifies, and the lemma can be interpreted as an extension of the classical data-processing inequality for Rényi divergence to shifted Rényi divergence—under the condition that the data-processing operation is contractive.

Lemma A.1 (Contraction-reduction lemma, simplified version).

For any distributions μ,ν𝒫(d)\mu,\nu\in\mathcal{P}(\mathbb{R}^{d}) and (random) cc-contractive function ϕ:d\phi:\mathbb{R}^{d}\to\mathbb{R},

𝒟α(cz)(ϕ#μϕ#ν)𝒟α(z)(μν).\mathcal{D}_{\alpha}^{(cz)}\left(\phi_{\#}\mu\;\|\;\phi_{\#}\nu\right)\leqslant\mathcal{D}_{\alpha}^{(z)}\left(\mu\;\|\;\nu\right)\,.
Proof.

Let μ\mu^{\prime} be a distribution verifying 𝒟α(z)(μν)\mathcal{D}_{\alpha}^{(z)}(\mu\;\|\;\nu); that is, μ\mu^{\prime} satisfies W(μ,μ)zW_{\infty}(\mu,\mu^{\prime})\leqslant z and 𝒟α(μν)=𝒟α(z)(μν)\mathcal{D}_{\alpha}(\mu^{\prime}\;\|\;\nu)=\mathcal{D}_{\alpha}^{(z)}(\mu\;\|\;\nu). Because ϕ\phi is a cc-contraction, it follows that W(ϕ#μ,ϕ#μ)cW(μ,μ)czW_{\infty}(\phi_{\#}\mu,\phi_{\#}\mu^{\prime})\leqslant cW_{\infty}(\mu,\mu^{\prime})\leqslant cz. Thus

𝒟α(cz)(ϕ#μϕ#ν)𝒟α(ϕ#μϕ#ν)𝒟α(μν)=𝒟α(z)(μν),\mathcal{D}_{\alpha}^{(cz)}\left(\phi_{\#}\mu\;\|\;\phi_{\#}\nu\right)\leqslant\mathcal{D}_{\alpha}\left(\phi_{\#}\mu^{\prime}\;\|\;\phi_{\#}\nu\right)\leqslant\mathcal{D}_{\alpha}\left(\mu^{\prime}\;\|\;\nu\right)=\mathcal{D}_{\alpha}^{(z)}(\mu\;\|\;\nu)\,,

where the middle step is due to the data-processing inequality for Rényi divergences (Lemma 2.6). ∎

A.2.2 Shift-Reduction Lemma

We state this lemma in the form used in differential privacy, with only the minor simplification that the noise does not come from an arbitrary distribution on a Banach space, but instead is simply a Gaussian distribution on d\mathbb{R}^{d}. See the techniques section §1.2 for intuition of how this lemma can be thought of as a strengthening of the trivial bound

𝒟α(μξνξ)𝒟α(μξνξ)=𝒟α(μν)+𝒟α(ξξ)\displaystyle\mathcal{D}_{\alpha}\left(\mu\ast\xi\;\|\;\nu\ast\xi\right)\leqslant\mathcal{D}_{\alpha}\left(\mu\otimes\xi\;\|\;\nu\otimes\xi\right)=\mathcal{D}_{\alpha}\left(\mu\;\|\;\nu\right)+\mathcal{D}_{\alpha}\left(\xi\;\|\;\xi\right) (A.1)

which follows from the data-processing and tensorization properties of Rényi divergence—where in words, the strengthening is achieved by moving aa units of “displacement” between μ\mu and ν\nu from the first term to the second term.

Lemma A.2 (Shift-reduction lemma).

For any distributions μ,ν𝒫(d)\mu,\nu\in\mathcal{P}(\mathbb{R}^{d}) and shift a0a\geqslant 0,

𝒟α(z)(μ𝒩(0,σ2Id)ν𝒩(0,σ2Id))𝒟α(z+a)(μν)+αa22σ2.\mathcal{D}_{\alpha}^{(z)}\left(\mu\ast\mathcal{N}(0,\sigma^{2}I_{d})\;\|\;\nu\ast\mathcal{N}(0,\sigma^{2}I_{d})\right)\leqslant\mathcal{D}_{\alpha}^{(z+a)}\left(\mu\;\|\;\nu\right)+\frac{\alpha a^{2}}{2\sigma^{2}}\,.

Below we reproduce the proof of this lemma from [43] for completeness, although with a slightly modified presentation that emphasizes the connection to (A.1), as isolated in the following observation. Intuitively, this lemma can be thought of as a variant of the Chain Rule for the KL divergence, with expectation replaced by supremum so that the inequality extends from the KL divergence (α=1\alpha=1) any Rényi divergence (α1\alpha\geqslant 1).

Lemma A.3 (Analog of (A.1) used in Lemma A.2).

Consider any random vectors X,Y,X,YX,Y,X^{\prime},Y^{\prime} in d\mathbb{R}^{d}. If XX^{\prime} and YY^{\prime} are independent, then

𝒟α(X+YX+Y)𝒟α(XX)+esssupx𝒟α(Y|X=xY)\mathcal{D}_{\alpha}\left(X+Y\;\|\;X^{\prime}+Y^{\prime}\right)\leqslant\mathcal{D}_{\alpha}\left(X\;\|\;X^{\prime}\right)+\operatorname*{ess\,sup}_{x}\mathcal{D}_{\alpha}\left(Y\big{|}_{X=x}\;\|\;Y^{\prime}\right)

where Y|X=xY\big{|}_{X=x} denotes the distribution of YY conditional on X=xX=x.

Proof.

By the data-processing property of the Rényi divergence, the definition of the Rényi divergence, the independence assumption, Hölder’s inequality, and again the definition of Rényi divergence,

𝒟α(X+YX+Y)\displaystyle\mathcal{D}_{\alpha}\left(X+Y\;\|\;X^{\prime}+Y^{\prime}\right) 𝒟α((X,Y)(X,Y))\displaystyle\leqslant\mathcal{D}_{\alpha}\left((X,Y)\;\|\;(X^{\prime},Y^{\prime})\right)
=1α1log((X,Y)(x,y)(X,Y)(x,y))α(X,Y)(x,y)𝑑x𝑑y\displaystyle=\frac{1}{\alpha-1}\log\iint\left(\frac{\mathbb{P}_{(X,Y)}(x,y)}{\mathbb{P}_{(X^{\prime},Y^{\prime})}(x,y)}\right)^{\alpha}\mathbb{P}_{(X^{\prime},Y^{\prime})}(x,y)dxdy
=1α1log(X(x)Y|X=x(y)X(x)Y(y))αX(x)Y(y)𝑑x𝑑y\displaystyle=\frac{1}{\alpha-1}\log\iint\left(\frac{\mathbb{P}_{X}(x)\mathbb{P}_{Y|X=x}(y)}{\mathbb{P}_{X^{\prime}}(x)\mathbb{P}_{Y^{\prime}}(y)}\right)^{\alpha}\mathbb{P}_{X^{\prime}}(x)\mathbb{P}_{Y^{\prime}}(y)dxdy
1α1log(X(x)X(x))αX(x)𝑑xesssupx(Y|X=x(y)Y(y))αY(y)𝑑y\displaystyle\leqslant\frac{1}{\alpha-1}\log\int\left(\frac{\mathbb{P}_{X}(x)}{\mathbb{P}_{X^{\prime}}(x)}\right)^{\alpha}\mathbb{P}_{X^{\prime}}(x)dx\cdot\operatorname*{ess\,sup}_{x}\int\left(\frac{\mathbb{P}_{Y|X=x}(y)}{\mathbb{P}_{Y^{\prime}}(y)}\right)^{\alpha}\mathbb{P}_{Y^{\prime}}(y)dy
=𝒟α(XX)+esssupx𝒟α(Y|X=xY)\displaystyle=\mathcal{D}_{\alpha}\left(X\;\|\;X^{\prime}\right)+\operatorname*{ess\,sup}_{x}\mathcal{D}_{\alpha}\left(Y\big{|}_{X=x}\;\|\;Y^{\prime}\right)

Proof of Lemma A.2.

Case of z=0z=0. Let μ\mu^{\prime} be a distribution guaranteed by the definition of the shifted Rényi divergence 𝒟α(a)(μν)\mathcal{D}_{\alpha}^{(a)}(\mu\;\|\;\nu); that is, μ\mu^{\prime} is a distribution satisfying W(μ,μ)aW_{\infty}(\mu,\mu^{\prime})\leqslant a and 𝒟α(μν)=𝒟α(a)(μν)\mathcal{D}_{\alpha}\left(\mu^{\prime}\;\|\;\nu\right)=\mathcal{D}_{\alpha}^{(a)}\left(\mu\;\|\;\nu\right). Let (U,U)(U,U^{\prime}) be the coupling of (μ,μ)(\mu,\mu^{\prime}) guaranteed by the definition of W(μ,μ)W_{\infty}(\mu,\mu^{\prime}); then W:=UUW:=U-U^{\prime} satisfies Wa\|W\|\leqslant a almost surely. Now bound

𝒟α(μ𝒩(0,σ2Id)ν𝒩(0,σ2Id))\displaystyle\mathcal{D}_{\alpha}\left(\mu\ast\mathcal{N}(0,\sigma^{2}I_{d})\;\|\;\nu\ast\mathcal{N}(0,\sigma^{2}I_{d})\right) =𝒟α(μ𝒩(W,σ2Id)ν𝒩(0,σ2Id))\displaystyle=\mathcal{D}_{\alpha}\left(\mu^{\prime}\ast\mathcal{N}(W,\sigma^{2}I_{d})\;\|\;\nu\ast\mathcal{N}(0,\sigma^{2}I_{d})\right)
𝒟α(μν)+supw:wa𝒟α(𝒩(w,σ2Id)𝒩(0,σ2Id))\displaystyle\leqslant\mathcal{D}_{\alpha}\left(\mu^{\prime}\;\|\;\nu\right)+\sup_{w\;:\;\|w\|\leqslant a}\mathcal{D}_{\alpha}\left(\mathcal{N}(w,\sigma^{2}I_{d})\;\|\;\mathcal{N}(0,\sigma^{2}I_{d})\right)
=𝒟α(a)(μν)+αa22σ2.\displaystyle=\mathcal{D}_{\alpha}^{(a)}\left(\mu\;\|\;\nu\right)+\frac{\alpha a^{2}}{2\sigma^{2}}\,.

Above, the first step is by adding and subtracting WW, the second step is by Lemma A.3, and the third step is by construction of μ\mu^{\prime} and the identity 𝒟α(𝒩(w,σ2Id)𝒩(0,σ2Id))=αw22σ2\mathcal{D}_{\alpha}\left(\mathcal{N}(w,\sigma^{2}I_{d})\;\|\;\mathcal{N}(0,\sigma^{2}I_{d})\right)=\frac{\alpha\|w\|^{2}}{2\sigma^{2}} which can be found in, e.g., [86, Example 3].

Case of general z0z\geqslant 0. Define μ,U,U,W\mu^{\prime},U,U^{\prime},W analogously, now in terms of 𝒟α(z+a)(μν)\mathcal{D}_{\alpha}^{(z+a)}(\mu\;\|\;\nu) rather than 𝒟α(a)(μν)\mathcal{D}_{\alpha}^{(a)}(\mu\;\|\;\nu). Decompose W=W1+W2W=W_{1}+W_{2} for W1=Wmin(1,z/W)W_{1}=W\cdot\min(1,z/\|W\|); this ensures W1z\|W_{1}\|\leqslant z and W2a\|W_{2}\|\leqslant a. Observe that

𝒟α(z)(μ𝒩(0,σ2Id)ν𝒩(0,σ2Id))𝒟α(μW1𝒩(0,σ2Id)ν𝒩(0,σ2Id))𝒟α(a)(μW1ν)+αa22σ2.\displaystyle\mathcal{D}_{\alpha}^{(z)}\left(\mu\ast\mathcal{N}(0,\sigma^{2}I_{d})\;\|\;\nu\ast\mathcal{N}(0,\sigma^{2}I_{d})\right)\leqslant\mathcal{D}_{\alpha}\left(\mu\ast\mathbb{P}_{W_{1}}\ast\mathcal{N}(0,\sigma^{2}I_{d})\;\|\;\nu\ast\mathcal{N}(0,\sigma^{2}I_{d})\right)\leqslant\mathcal{D}_{\alpha}^{(a)}\left(\mu\ast\mathbb{P}_{W_{1}}\;\|\;\nu\right)+\frac{\alpha a^{2}}{2\sigma^{2}}\,.

Above, the first step is because W1z\|W_{1}\|\leqslant z, and the second step is by the case z=0z=0 proved above. The proof is complete by observing that

𝒟α(a)(μW1ν)𝒟α(μW1W2ν)=𝒟α(μν)=𝒟α(z+a)(μν),\displaystyle\mathcal{D}_{\alpha}^{(a)}\left(\mu\ast\mathbb{P}_{W_{1}}\;\|\;\nu\right)\leqslant\mathcal{D}_{\alpha}\left(\mu\ast\mathbb{P}_{W_{1}}\ast\mathbb{P}_{W_{2}}\;\|\;\nu\right)=\mathcal{D}_{\alpha}\left(\mu^{\prime}\;\|\;\nu\right)=\mathcal{D}_{\alpha}^{(z+a)}\left(\mu\;\|\;\nu\right)\,,

where above the first step is by the bound W2a\|W_{2}\|\leqslant a, the second step is by the decompositions W=W1+W2W=W_{1}+W_{2} and U=U+WU^{\prime}=U+W, and the third step is by the bound Wz+a\|W\|\leqslant z+a. ∎

A.2.3 Proof of Diameter-Aware PABI Bound

Finally, we recall how the combination of Lemmas A.1 and A.2 imply the diameter-aware PABI bound in Proposition 2.10. This argument is from [3] and is reproduced below for completeness, albeit here with the simplification that here the two CNI use the same contractions, since this suffices for sampling analyses.

Proof of Proposition 2.10.

Consider any sequence a1,,aT0a_{1},\dots,a_{T}\geqslant 0 satisfying D=t=1TctaiD=\sum_{t=1}^{T}c^{-t}a_{i}. Relate the shifted Rényi divergence at time TT to that at time T1T-1 by bounding

𝒟α(XTXT)\displaystyle\mathcal{D}_{\alpha}\left(X_{T}\;\|\;X_{T}^{\prime}\right) =𝒟α(0)(ϕT(XT1)+ZT1ϕT(XT1)+ZT1)\displaystyle=\mathcal{D}_{\alpha}^{(0)}\left(\phi_{T}(X_{T-1})+Z_{T-1}\;\|\;\phi_{T}(X_{T-1}^{\prime})+Z_{T-1}\right)
𝒟α(aT)(ϕT(XT1)ϕT(XT1))+αaT24η\displaystyle\leqslant\mathcal{D}_{\alpha}^{(a_{T})}\left(\phi_{T}(X_{T-1})\;\|\;\phi_{T}(X_{T-1}^{\prime})\right)+\frac{\alpha a_{T}^{2}}{4\eta}
𝒟α(caT)(XT1XT1)+αaT22σ2,\displaystyle\leqslant\mathcal{D}_{\alpha}^{(ca_{T})}\left(X_{T-1}\;\|\;X_{T-1}^{\prime}\right)+\frac{\alpha a_{T}^{2}}{2\sigma^{2}}\,,

where above, the first step is by definition of the CNI and the fact that the standard Rényi divergence is the same as 0-shifted Rényi divergence; the second step is by the shift-reduction lemma (Lemma A.2); the third step is by the contraction-reduction lemma (Lemma A.1).

By repeating this argument TT times, we can relate the shifted Rényi divergence at time TT to that at time 0 by bounding

𝒟α(XTXT)𝒟α(t=1Tciai)(X0X0)+α2σ2t=1TaT2.\displaystyle\mathcal{D}_{\alpha}\left(X_{T}\;\|\;X_{T}^{\prime}\right)\leqslant\mathcal{D}_{\alpha}^{(\sum_{t=1}^{T}c^{-i}a_{i})}\left(X_{0}\;\|\;X_{0}^{\prime}\right)+\frac{\alpha}{2\sigma^{2}}\sum_{t=1}^{T}a_{T}^{2}\,.

Now this shifted Rényi divergence at time 0 vanishes because X0,X0X_{0},X_{0}^{\prime} both lie in the set 𝒦\mathcal{K} which has diameter D=t=1TctatD=\sum_{t=1}^{T}c^{-t}a_{t}. We conclude that

𝒟α(XTXT)α2σ2(infa1,,aT0s.t. t=1Tctat=Dt=1TaT2).\displaystyle\mathcal{D}_{\alpha}\left(X_{T}\;\|\;X_{T}^{\prime}\right)\leqslant\frac{\alpha}{2\sigma^{2}}\cdot\Bigg{(}\;\inf_{\begin{subarray}{c}a_{1},\dots,a_{T}\geqslant 0\\ \text{s.t. }\sum_{t=1}^{T}c^{-t}a_{t}=D\end{subarray}}\sum_{t=1}^{T}a_{T}^{2}\;\Bigg{)}\,. (A.2)

The proof is complete by setting a1==at=D/Ta_{1}=\dots=a_{t}=D/T if c=1c=1, or setting at=cTD𝟙t=Ta_{t}=c^{-T}D\mathds{1}_{t=T} if c<1c<1. ∎

Remark A.4 (PABI bound that is continuous at c=1c=1).

The statement of Proposition 2.10 is discontinuous in cc, as mentioned in the discussion preceding it. This is just for simplicity of the exposition. Indeed, by actually solving the optimization problem in (A.2), we obtain a slightly sharper bound that is continuous in cc. Namely, by taking the optimal solution is at=ctβDa_{t}=c^{-t}\beta D where β=(c21)/(1c2T)\beta=(c^{2}-1)/(1-c^{-2T}), we have

𝒟α(XTXT)αD22σ2c211c2T.\displaystyle\mathcal{D}_{\alpha}\left(X_{T}\;\|\;X_{T}^{\prime}\right)\leqslant\frac{\alpha D^{2}}{2\sigma^{2}}\cdot\frac{c^{2}-1}{1-c^{-2T}}\,. (A.3)

This is continuous in cc because limc1(c21)/(1c2T)=1/T\lim_{c\to 1}(c^{2}-1)/(1-c^{-2T})=1/T. However, we keep Proposition 2.10 as stated because the exact bound (A.3) is harder to parse and only yields a small constant improvement for c<1c<1.

A.3 Mixing in Other Notions of Distance

A.3.1 Preliminaries on Dobrushin’s Coefficient and ff-Divergences

A simple, helpful lemma for proving fast mixing bounds for the total variation metric is the following. Operationally, it reduces proving mixing bounds to arbitrary error ε\varepsilon to just the case of constant ε\varepsilon, say 1/41/4. This holds for arbitrary Markov chains (not just the Langevin Algorithm), and is a straightforward consequence of the submultiplicativity of the function tsupμ,νPtμPtν𝖳𝖵t\mapsto\sup_{\mu,\nu}\|P^{t}\mu-P^{t}\nu\|_{\mathsf{TV}} for any Markov kernel PP; a proof can be found in, e.g., [58, §4.5].

Lemma A.5 (Mixing times are exponentially fast for total variation).
Tmix,𝖳𝖵(ε)log2ε1Tmix,𝖳𝖵(1/4).T_{\mathrm{mix},\,\mathsf{TV}}(\varepsilon)\leqslant\lceil\log_{2}\varepsilon^{-1}\rceil\cdot T_{\mathrm{mix},\,\mathsf{TV}}(1/4)\,.

Although Lemma A.5 and its proof are tailored to the total variation metric, this fact is helpful for proving rapid mixing bounds for other divergences. This is due to the following lemma from information theory which states that, among all ff-divergences, the “mixing rate” is worst for the total variation distance. This mixing rate is quantified in terms of the so-called strong data processing constant ρ𝒟(P):=supμν𝒟(PμPν)/𝒟(μν)\rho_{\mathcal{D}}(P):=\sup_{\mu\neq\nu}\mathcal{D}(P\mu\|P\nu)/\mathcal{D}(\mu\|\nu) for a Markov chain PP with respect to an ff-divergence 𝒟\mathcal{D}, a.k.a., the Dobrushin contraction coefficient in the case of the total variation distance 𝒟=𝖳𝖵\mathcal{D}=\mathsf{TV}. For further details on this lemma, see [30, §11.1].

Lemma A.6 (Total variation mixing is worst case among all ff-divergences).

For any ff-divergence 𝒟\mathcal{D} and any Markov chain PP,

ρ𝒟(P)ρ𝖳𝖵(P),\rho_{\mathcal{D}}(P)\leqslant\rho_{\mathsf{TV}}(P)\,,

Lemma A.6 applies, for example, to the KL, Chi-Squared, and Hellinger divergences since those are all ff-divergences. However, applying this lemma to the Rényi divergence requires an intermediate step since the Rényi divergence is not an ff-divergence. This can be achieved by relating the Rényi divergence 𝒟α\mathcal{D}_{\alpha} to the generalized Hellinger divergence 𝖧α\mathsf{H}_{\alpha}, and recalling that the latter is an ff-divergence. We record the necessary preliminaries for this connection in the following lemma; further details can be found in, e.g., [82, equations (52) and (80)].

Lemma A.7 (Preliminaries for using Lemma A.6 for Rényi divergence).

Suppose α(1,)\alpha\in(1,\infty). Then 𝖧α\mathsf{H}_{\alpha} is an ff-divergence and satifies

𝒟α(μ,ν)=1α1log(1+(α1)𝖧α(μ,ν)).\mathcal{D}_{\alpha}(\mu,\nu)=\frac{1}{\alpha-1}\log\left(1+(\alpha-1)\mathsf{H}_{\alpha}(\mu,\nu)\right)\,.

A.3.2 Proof of Corollary 3.3

By a basic mixing time argument (Lemma A.5) which applies to any Markov chain, we can boost the total variation mixing time proved in Theorem 3.1 for constant error 1/41/4 to arbitrarily small error ε>0\varepsilon>0, namely

Tmix,𝖳𝖵(ε)2D2ηlog2(1/ε).T_{\mathrm{mix},\,\mathsf{TV}}(\varepsilon)\leqslant\frac{2D^{2}}{\eta}\lceil\log_{2}(1/\varepsilon)\rceil\,.

We now show how this fast mixing bound for total variation implies fast mixing for the Rényi divergence 𝒟α\mathcal{D}_{\alpha} for α[1,)\alpha\in[1,\infty). We do this by proving fast mixing for the more stringent notion of the Hellinger divergence 𝖧α\mathsf{H}_{\alpha}. Our analysis proceeds in two phases. First, we argue about the number of iterations it takes to mix to error 2eα12e^{\alpha-1} in Hellinger divergence. To this end, observe that Tmix,Dα(1)αD2/ηT_{\mathrm{mix},\,D_{\alpha}}(1)\lesssim\alpha D^{2}/\eta by (3.2); and also observe that if the Rényi divergence 𝒟α\mathcal{D}_{\alpha} between two distributions is at most 11, then the Hellinger divergence 𝖧α\mathsf{H}_{\alpha} is at most 2eα12e^{\alpha-1} due to the identity in Lemma A.7 combined with a crude bound that uses the inequality eα11+2(α1)e^{\alpha-1}\leqslant 1+2(\alpha-1) for α[1,2]\alpha\in[1,2]. Together, these two observations imply

Tmix,𝖧α(2eα1)Tmix,Dα(1)αD2η.T_{\mathrm{mix},\mathsf{H}_{\alpha}}\left(2e^{\alpha-1}\right)\leqslant T_{\mathrm{mix},\,D_{\alpha}}(1)\lesssim\frac{\alpha D^{2}}{\eta}\,.

Next, we claim that it takes at most

D2ηlog2eα1εD2η[(α1)+log1ε].\frac{D^{2}}{\eta}\log\frac{2e^{\alpha-1}}{\varepsilon}\lesssim\frac{D^{2}}{\eta}\left[(\alpha-1)+\log\frac{1}{\varepsilon}\right]\,.

iterations to mix to ε\varepsilon error in Hellinger divergence 𝖧α\mathsf{H}_{\alpha} after we have reached error 2eα12e^{\alpha-1}. This is a consequence of two observations: (i) the Dobrushin contraction coefficient corresponding to D2/ηD^{2}/\eta iterations of the Langevin Algorithm is bounded above by 1/41/4 by (3.3); and (ii) the analogous contraction coefficient for the Hellinger divergence 𝖧α\mathsf{H}_{\alpha} is no worse by Lemma A.6 and the fact that 𝖧α\mathsf{H}_{\alpha} is an ff-divergence (Lemma A.7). By combining the above two displays, we conclude that

Tmix,𝖧α(ε)D2η[(α1)+log1ε].T_{\mathrm{mix},\,\mathsf{H}_{\alpha}}(\varepsilon)\lesssim\frac{D^{2}}{\eta}\left[(\alpha-1)+\log\frac{1}{\varepsilon}\right]\,.

Now, because the Rényi divergence 𝒟α\mathcal{D}_{\alpha} is always bounded above by the Hellinger divergence 𝖧α\mathsf{H}_{\alpha} (this follows from Lemma A.7 and the elementary inequality log(1+x)x\log(1+x)\leqslant x), the above display implies the claimed mixing time bound for the Rényi divergence. This immediately implies the claimed fast mixing bounds in the other divergences (KL, Hellinger, and Chi-Squared) due to the standard comparison inequalities in Lemma 2.5.

A.4 Extension to Unconstrained Settings

Here, we briefly mention how our analysis technique readily extends unchanged to unconstrained settings, i.e., if 𝒦=d\mathcal{K}=\mathbb{R}^{d}, or equivalently the Langevin Algorithm makes no projection in its update. This leads to tight mixing bounds and only requires two simple changes in the analysis.

First, we must slightly restrict the initialization, since otherwise the Langevin Algorithm takes arbitrarily long to even come close to where πη\pi_{\eta} is concentrated. To remedy this, simply initialize at any mode xx^{*} of π\pi, a.k.a., any minimizer of the potential ff. Such an xx^{*} is readily computed using Gradient Descent; note that this uses the same first-order access to gradients of ff as the Langevin Algorithm.

Second, the diameter DD is replaced by a proxy Dη,εD_{\eta,\varepsilon} which is the diameter of a ball that captures all but ε\varepsilon mass of πη\pi_{\eta}. Formally, let Dη,εD_{\eta,\varepsilon} satisfy

Xπη[XxDη,ε]ε.\displaystyle\mathbb{P}_{X\sim\pi_{\eta}}\left[\|X-x^{*}\|\geqslant D_{\eta,\varepsilon}\right]\leqslant\varepsilon\,. (A.4)

for any minimizer xx^{*} of ff. Bounds on Dη,εD_{\eta,\varepsilon} for both the convex and strongly convex settings are in [2]. A bound for the strongly convex setting can also be obtained by Theorem 8 of the recent revision of [88].

With these two minor changes, the proofs of our 𝖳𝖵\mathsf{TV} mixing time upper bounds for convex potentials (Theorem 3.1) and strongly convex potentials (Theorem 4.1) carry over to the present unconstrained setting, establishing that for the same mixing time TT (except with DD replaced by Dη,εD_{\eta,\varepsilon}), we have 𝖳𝖵(XT,πη,ε)ε\mathsf{TV}(X_{T},\pi_{\eta,\varepsilon})\leqslant\varepsilon, where πη,ε\pi_{\eta,\varepsilon} is the truncation of πη\pi_{\eta} to the ball of radius Dη,εD_{\eta,\varepsilon}; that is, πη,ε(x)πη(x)𝟙[xDη,ε]\pi_{\eta,\varepsilon}(x)\propto\pi_{\eta}(x)\cdot\mathds{1}[\|x\|\leqslant D_{\eta,\varepsilon}]. By definition of πη,ε\pi_{\eta,\varepsilon}, we have 𝖳𝖵(πη,πη,ε)2ε\mathsf{TV}(\pi_{\eta},\pi_{\eta,\varepsilon})\leqslant 2\varepsilon. Thus 𝖳𝖵(XT,πη)3ε\mathsf{TV}(X_{T},\pi_{\eta})\leqslant 3\varepsilon by the triangle inequality, so we conclude that in this unconstrained setting, TT iterations suffices for the Langevin Algorithm initialized at xx^{*} to converge to 𝖳𝖵\mathsf{TV} distance 3ε3\varepsilon from πη\pi_{\eta}.

This leads to tight mixing bounds. Indeed, for the case of strongly convex potentials, our mixing lower bound matches up to a mild logarithmic term, see the discussion in §4. And for the case of convex potentials, the reachability lower bound argument in §3.2 applies to the unconstrained setting here: it takes roughly Dη,ε2/ηD_{\eta,\varepsilon}^{2}/\eta iterations for the Langevin Algorithm to move a distance of Dη,εD_{\eta,\varepsilon} and thus to reach all parts of the ball of diameter Dη,εD_{\eta,\varepsilon}, and this is a pre-requisite for mixing to Θ(ε)\Theta(\varepsilon) error in 𝖳𝖵\mathsf{TV}.

References

  • Ahn and Chewi [2021] Kwangjun Ahn and Sinho Chewi. Efficient constrained sampling via the mirror-Langevin algorithm. Advances in Neural Information Processing Systems, 34:28405–28418, 2021.
  • Altschuler and Talwar [2022a] Jason M Altschuler and Kunal Talwar. Concentration of the Langevin Algorithm’s stationary distribution. Forthcoming, 2022a.
  • Altschuler and Talwar [2022b] Jason M Altschuler and Kunal Talwar. Privacy of Noisy Stochastic Gradient Descent: More iterations without more privacy loss. In Advances in Neural Information Processing Systems, 2022b.
  • Andrieu et al. [2003] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to MCMC for machine learning. Machine Learning, 50(1):5–43, 2003.
  • Asoodeh et al. [2020] Shahab Asoodeh, Mario Díaz, and Flávio du Pin Calmon. Privacy amplification of iterative algorithms via contraction coefficients. 2020 IEEE International Symposium on Information Theory, pages 896–901, 2020.
  • Balasubramanian et al. [2022] Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for Langevin Monte Carlo. In Conference on Learning Theory, pages 2896–2923. PMLR, 2022.
  • Balle et al. [2019] Borja Balle, Gilles Barthe, Marco Gaboardi, and Joseph Geumlek. Privacy Amplification by Mixing and Diffusion Mechanisms. Curran Associates Inc., Red Hook, NY, USA, 2019.
  • Bélisle et al. [1993] Claude JP Bélisle, H Edwin Romeijn, and Robert L Smith. Hit-and-run algorithms for generating multivariate distributions. Mathematics of Operations Research, 18(2):255–266, 1993.
  • Bernton [2018] Espen Bernton. Langevin Monte Carlo and JKO splitting. In Conference On Learning Theory, pages 1777–1798. PMLR, 2018.
  • Bhattacharya [1978] RN Bhattacharya. Criteria for recurrence and existence of invariant measures for multidimensional diffusions. The Annals of Probability, pages 541–553, 1978.
  • Bishop and Nasrabadi [2006] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
  • Brosse et al. [2017] Nicolas Brosse, Alain Durmus, Éric Moulines, and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo. In Conference on Learning Theory, pages 319–342. PMLR, 2017.
  • Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
  • Bubeck et al. [2018] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with Projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4):757–783, 2018.
  • Cao et al. [2020] Yu Cao, Jianfeng Lu, and Lihan Wang. Complexity of randomized algorithms for underdamped Langevin dynamics. arXiv preprint arXiv:2003.09906, 2020.
  • Chatterji et al. [2020] Niladri Chatterji, Jelena Diakonikolas, Michael I Jordan, and Peter Bartlett. Langevin Monte Carlo without smoothness. In International Conference on Artificial Intelligence and Statistics, pages 1716–1726. PMLR, 2020.
  • Cheng et al. [2018a] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648, 2018a.
  • Cheng et al. [2018b] Xiang Cheng, Niladri S Chatterji, Peter L Bartlett, and Michael I Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference on Learning Theory, pages 300–323. PMLR, 2018b.
  • Cheng et al. [2022] Xiang Cheng, Jingzhao Zhang, and Suvrit Sra. Theory and algorithms for diffusion processes on Riemannian Manifolds. arXiv preprint arXiv:2204.13665, 2022.
  • Chewi [2022] Sinho Chewi. Log-Concave Sampling. Forthcoming, 2022.
  • Chewi et al. [2021a] Sinho Chewi, Murat A Erdogdu, Mufan Bill Li, Ruoqi Shen, and Matthew Zhang. Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev. arXiv preprint arXiv:2112.12662, 2021a.
  • Chewi et al. [2021b] Sinho Chewi, Chen Lu, Kwangjun Ahn, Xiang Cheng, Thibaut Le Gouic, and Philippe Rigollet. Optimal dimension dependence of the Metropolis-Adjusted Langevin Algorithm. In Conference on Learning Theory, pages 1260–1300. PMLR, 2021b.
  • Chewi et al. [2022a] Sinho Chewi, Patrik Gerber, Holden Lee, and Chen Lu. Fisher information lower bounds for sampling. arXiv preprint arXiv:2210.02482, 2022a.
  • Chewi et al. [2022b] Sinho Chewi, Patrik R Gerber, Chen Lu, Thibaut Le Gouic, and Philippe Rigollet. The query complexity of sampling from strongly log-concave distributions in one dimension. In Conference on Learning Theory, pages 2041–2059. PMLR, 2022b.
  • Chourasia et al. [2021] Rishav Chourasia, Jiayuan Ye, and Reza Shokri. Differential privacy dynamics of Langevin diffusion and noisy gradient descent. In Advances in Neural Information Processing Systems, volume 34, pages 14771–14781, 2021.
  • Dalalyan [2017a] Arnak Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 6 2017a.
  • Dalalyan [2017b] Arnak Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Conference on Learning Theory, pages 678–689. PMLR, 2017b.
  • Dalalyan et al. [2019] Arnak S Dalalyan, Avetik Karagulyan, and Lionel Riou-Durand. Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets. arXiv preprint arXiv:1906.08530, 2019.
  • Duane et al. [1987] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
  • Duchi [2021] John Duchi. Lecture notes for statistics 311/electrical engineering 377. https://web.stanford.edu/class/stats311/lecture-notes.pdf, page 324, 2021.
  • Durmus and Moulines [2016] Alain Durmus and Eric Moulines. Sampling from strongly log-concave distributions with the Unadjusted Langevin Algorithm. arXiv preprint arXiv:1605.01559, 2016.
  • Durmus and Moulines [2017] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the Unadjusted Langevin Algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
  • Durmus and Moulines [2019] Alain Durmus and Eric Moulines. High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm. Bernoulli, 25:2854–2882, 11 2019.
  • Durmus et al. [2018] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by Proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
  • Durmus et al. [2019] Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. The Journal of Machine Learning Research, 20(1):2666–2711, 2019.
  • Dwivedi et al. [2018] Raaz Dwivedi, Yuansi Chen, Martin J Wainwright, and Bin Yu. Log-concave sampling: Metropolis-Hastings algorithms are fast! In Conference on learning theory, pages 793–797. PMLR, 2018.
  • Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 & 4):211–407, 2014.
  • Dyer et al. [1991] Martin Dyer, Alan Frieze, and Ravi Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM, 38(1):1–17, 1991.
  • Eberle et al. [2019] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability, 47(4):1982–2010, 2019.
  • Erdogdu and Hosseinzadeh [2021] Murat A Erdogdu and Rasa Hosseinzadeh. On the convergence of Langevin Monte Carlo: The interplay between tail growth and smoothness. In Conference on Learning Theory, pages 1776–1822. PMLR, 2021.
  • Erdogdu et al. [2018] Murat A Erdogdu, Lester Mackey, and Ohad Shamir. Global non-convex optimization with discretized diffusions. Advances in Neural Information Processing Systems, 31, 2018.
  • Erdogdu et al. [2022] Murat A Erdogdu, Rasa Hosseinzadeh, and Shunshi Zhang. Convergence of Langevin Monte Carlo in chi-squared and Rényi divergence. In International Conference on Artificial Intelligence and Statistics, pages 8151–8175. PMLR, 2022.
  • Feldman et al. [2018] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. In Symposium on Foundations of Computer Science, pages 521–532. IEEE, 2018.
  • Feldman et al. [2020] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Symposium on the Theory of Computing, pages 439–449, 2020.
  • Ganesh and Talwar [2020] Arun Ganesh and Kunal Talwar. Faster differentially private samplers via Rényi divergence analysis of discretized Langevin MCMC. In Advances in Neural Information Processing Systems, volume 33, pages 7222–7233, 2020.
  • Gatmiry and Vempala [2022] Khashayar Gatmiry and Santosh S Vempala. Convergence of the Riemannian Langevin Algorithm. arXiv preprint arXiv:2204.10818, 2022.
  • Girolami and Calderhead [2011] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
  • Hoffman et al. [2014] Matthew D Hoffman, Andrew Gelman, et al. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
  • Hsieh et al. [2018] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored Langevin dynamics. Advances in Neural Information Processing Systems, 31, 2018.
  • James [1980] Frederick James. Monte Carlo theory and practice. Reports on Progress in Physics, 43(9):1145, 1980.
  • Jerrum and Sinclair [1996] Mark Jerrum and Alistair Sinclair. The Markov chain Monte Carlo method: an approach to approximate counting and integration. Approximation Algorithms for NP-hard problems, PWS Publishing, 1996.
  • Jordan et al. [1998] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
  • Kannan et al. [1995] Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a localization lemma. Discrete & Computational Geometry, 13(3):541–559, 1995.
  • Kook et al. [2022] Yunbum Kook, Yin Tat Lee, Ruoqi Shen, and Santosh S Vempala. Sampling with Riemannian Hamiltonian Monte Carlo in a constrained space. arXiv preprint arXiv:2202.01908, 2022.
  • Lee and Vempala [2018] Yin Tat Lee and Santosh S Vempala. Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1115–1121, 2018.
  • Lee et al. [2021] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Lower bounds on Metropolized sampling methods for well-conditioned distributions. Advances in Neural Information Processing Systems, 34:18812–18824, 2021.
  • Lehec [2021] Joseph Lehec. The Langevin Monte Carlo algorithm in the non-smooth log-concave case. arXiv preprint arXiv:2101.10695, 2021.
  • Levin and Peres [2017] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Society, 2017.
  • Li and Erdogdu [2020] Mufan Bill Li and Murat A Erdogdu. Riemannian Langevin algorithm for solving semidefinite programs. arXiv preprint arXiv:2010.11176, 2020.
  • Liang and Chen [2021] Jiaming Liang and Yongxin Chen. A proximal algorithm for sampling from non-smooth potentials. arXiv preprint arXiv:2110.04597, 2021.
  • Liu and Liu [2001] Jun S Liu and Jun S Liu. Monte Carlo strategies in scientific computing, volume 10. Springer, 2001.
  • Lovász [1999] László Lovász. Hit-and-run mixes fast. Mathematical programming, 86(3):443–461, 1999.
  • Lovász and Simonovits [1990] László Lovász and Miklós Simonovits. The mixing rate of Markov chains, an isoperimetric inequality, and computing the volume. In Symposium on Foundations of Computer Science, pages 346–354. IEEE, 1990.
  • Lovász and Simonovits [1993] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume algorithm. Random Structures & Algorithms, 4(4):359–412, 1993.
  • Lovász and Vempala [2006] László Lovász and Santosh Vempala. Hit-and-run from a corner. SIAM Journal on Computing, 35(4):985–1005, 2006.
  • Lovász and Vempala [2007] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
  • Ma et al. [2021] Yi-An Ma, Niladri S Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L Bartlett, and Michael I Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3):1942–1992, 2021.
  • [68] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Symposium on Foundations of Computer Science.
  • Mengersen and Tweedie [1996] Kerrie L Mengersen and Richard L Tweedie. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics, 24(1):101–121, 1996.
  • Mou et al. [2022] Wenlong Mou, Nicolas Flammarion, Martin J Wainwright, and Peter L Bartlett. Improved bounds for discretization of Langevin diffusions: Near-optimal rates without convexity. Bernoulli, 28(3):1577–1601, 2022.
  • Neal et al. [2011] Radford M Neal et al. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
  • Nesterov [2003] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • Nguyen et al. [2021] Dao Nguyen, Xin Dang, and Yixin Chen. Unadjusted langevin algorithm for non-convex weakly smooth potentials. arXiv preprint arXiv:2101.06369, 2021.
  • Parisi [1981] Giorgio Parisi. Correlation functions and computer simulations. Nuclear Physics B, 180(3):378–384, 1981.
  • Raginsky et al. [2017] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
  • Robert and Casella [1999] Christian P Robert and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
  • Roberts and Rosenthal [2004] Gareth O Roberts and Jeffrey S Rosenthal. General state space Markov chains and MCMC algorithms. Probability surveys, 1:20–71, 2004.
  • Roberts and Tweedie [1996a] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996a.
  • Roberts and Tweedie [1996b] Gareth O Roberts and Richard L Tweedie. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika, 83(1):95–110, 1996b.
  • Ryffel et al. [2022] Theo Ryffel, Francis Bach, and David Pointcheval. Differential privacy guarantees for stochastic gradient Langevin dynamics. arXiv:2201.11980, 2022.
  • Salim and Richtárik [2020] Adil Salim and Peter Richtárik. Primal dual interpretation of the proximal stochastic gradient Langevin algorithm. Advances in Neural Information Processing Systems, 33:3786–3796, 2020.
  • Sason and Verdú [2016] Igal Sason and Sergio Verdú. ff-divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
  • Schneider [2014] Rolf Schneider. Convex bodies: the Brunn–Minkowski theory. Number 151. Cambridge University Press, 2014.
  • Shen and Lee [2019] Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. Advances in Neural Information Processing Systems, 32, 2019.
  • Sordello et al. [2021] Matteo Sordello, Zhiqi Bu, and Jinshuo Dong. Privacy amplification via iteration for shuffled and online PNSGD. ArXiv, abs/2106.11767, 2021.
  • Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
  • Vempala [2005] Santosh Vempala. Geometric random walks: a survey. Combinatorial and computational geometry, 52(573-612):2, 2005.
  • Vempala and Wibisono [2019] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems 32, pages 8092–8104. 2019.
  • Wibisono [2018] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018.
  • Williams [1991] David Williams. Probability with martingales. Cambridge University Press, 1991.
  • Wu et al. [2021] Keru Wu, Scott Schmidler, and Yuansi Chen. Minimax mixing time of the Metropolis-adjusted langevin algorithm for log-concave sampling. arXiv preprint arXiv:2109.13055, 2021.
  • Ye and Shokri [2022] Jiayuan Ye and Reza Shokri. Differentially private learning needs hidden state (or much faster convergence. arXiv:2203.05363, 2022.
  • Zhang et al. [2020] Kelvin Shuangjian Zhang, Gabriel Peyré, Jalal Fadili, and Marcelo Pereyra. Wasserstein control of mirror Langevin Monte Carlo. In Conference on Learning Theory, pages 3814–3841. PMLR, 2020.