This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Concentration of the Langevin Algorithm’s Stationary Distribution

Jason M. Altschuler111JMA was partially supported by an NSF Graduate Research Fellowship, a TwoSigma PhD Fellowship, and an NYU Data Science Faculty Fellowship.
UPenn
[email protected]
   Kunal Talwar
Apple
[email protected]
Abstract

A canonical algorithm for log-concave sampling is the Langevin Algorithm, aka the Langevin Diffusion run with some discretization stepsize η>0\eta>0. This discretization leads the Langevin Algorithm to have a stationary distribution πη\pi_{\eta} which differs from the stationary distribution π\pi of the Langevin Diffusion, and it is an important challenge to understand whether the well-known properties of π\pi extend to πη\pi_{\eta}. In particular, while concentration properties such as isoperimetry and rapidly decaying tails are classically known for π\pi, the analogous properties for πη\pi_{\eta} are open questions with algorithmic implications. This note provides a first step in this direction by establishing concentration results for πη\pi_{\eta} that mirror classical results for π\pi. Specifically, we show that for any nontrivial stepsize η>0\eta>0, πη\pi_{\eta} is sub-exponential (respectively, sub-Gaussian) when the potential is convex (respectively, strongly convex). Moreover, the concentration bounds we show are essentially tight. We also show that these concentration bounds extend to all iterates along the trajectory of the Langevin Algorithm, and to inexact implementations which use sub-Gaussian estimates of the gradient.

Key to our analysis is the use of a rotation-invariant moment generating function (aka Bessel function) to study the stationary dynamics of the Langevin Algorithm. This technique may be of independent interest because it enables directly analyzing the discrete-time stationary distribution πη\pi_{\eta} without going through the continuous-time stationary distribution π\pi as an intermediary.

1 Introduction

Sampling from a log-concave distribution πef\pi\propto e^{-f} over d\mathbb{R}^{d} is a foundational problem with applications throughout statistics, engineering, and the sciences. A canonical and well-studied approach is the Langevin Algorithm, which is the discrete-time Markov process

Xt+1=Xtηf(Xt)+Zt,\displaystyle X_{t+1}=X_{t}-\eta\nabla f(X_{t})+Z_{t}\,, (1.1)

where η>0\eta>0 is a stepsize parameter, and Zt𝒩(0,2ηId)Z_{t}\sim\mathcal{N}(0,2\eta I_{d}) are independent Gaussians. Briefly, the intuition behind the Langevin Algorithm is that it discretizes the Langevin Diffusion, which is a continuous-time Markov process with stationary distribution π\pi. More precisely, the Langevin Diffusion is the stochastic differential equation

dXt=f(Xt)+2dBt,\displaystyle dX_{t}=-\nabla f(X_{t})+\sqrt{2}dB_{t}\,, (1.2)

where BtB_{t} is a standard Brownian motion on d\mathbb{R}^{d}, and the Langevin Algorithm is identical except that it updates the gradient f(Xt)\nabla f(X_{t}) every η\eta units of time rather than continuously. We refer to the textbooks [7, 2, 16, 10, 13] for a detailed historical account of the Langevin Algorithm and the truly extensive literature surrounding it, which spans multiple decades and communities.

Although the discretization enables simulating the Langevin Diffusion algorithmically, it introduces an asymptotic bias. Specifically, for any discretization stepsize η>0\eta>0, the stationary distribution πη\pi_{\eta} of the Langevin Algorithm differs from the stationary distribution π\pi of the Langevin Diffusion. An important question is:

How similar are πη and π?\text{\emph{How similar are }}\pi_{\eta}\text{\emph{ and }}\pi\text{?}

Concentration?

The motivation of this note is that this question is wide open for concentration properties. Indeed, while concentration properties for π\pi have been classically known for several decades, the analogous properties for πη\pi_{\eta} have remained unknown. Perhaps the most notable example of this is isoperimetry: while classical results imply that π\pi satisfies Poincaré/log-Sobolev isoperimetric inequalities when the potential ff is convex/strongly-convex [4, 6, 11], the analogous properties for πη\pi_{\eta} are unknown. The influential paper [19] stated this as an open question and proved rapid mixing of the Langevin Algorithm in Rényi divergence under this conjectural assumption of isoperimetry of πη\pi_{\eta}. Another example is rapidly decaying tails: while classical results imply that π\pi has sub-exponential/sub-Gaussian tails when the potential ff is convex/strongly-convex [12, 3], the analogous properties for πη\pi_{\eta} were previously unknown. Such tail decay results can be exploited to extend fast-mixing results for the Langevin Algorithm from constrained settings to unconstrained settings [1].

Contributions.

The purpose of this note is to provide a first step in this direction. Specifically, in §4 we show that the stationary distribution πη\pi_{\eta} of the Langevin Algorithm is sub-Gaussian when the potential ff is strongly convex; and in §5 we show that πη\pi_{\eta} is sub-exponential when ff is convex. These results hold for any nontrivial222 Our results apply for η=O(1/M)\eta=O(1/M). This is the relevant regime in both theory and practice since the Langevin Algorithm is transient if η>2/M\eta>2/M. In fact, η\eta is typically polynomially smaller than O(1/M)O(1/M) to ensure small bias [7]. discretization stepsize η>0\eta>0 and mirror the aforementioned classical results for π\pi. Moreover, the concentration bounds we show are essentially tight.

In §6, we mention two extensions of these results: the concentration bounds extend to all iterates XtX_{t} along the trajectory of the Langevin Algorithm, and/or to inexact implementations of the Langevin Algorithm which use sub-Gaussian estimates of the gradient.

Techniques.

A central obstacle for proving this result is that this requires establishing properties of πη\pi_{\eta} for any discretization stepsize η>0\eta>0, not just η\eta arbitrarily small. Since the difference between π\pi and πη\pi_{\eta} grows with η\eta, we approach this problem by directly analyzing πη\pi_{\eta}—rather than analyzing properties of π\pi and then trying to conclude properties of πη\pi_{\eta} via approximation bounds between π\pi and πη\pi_{\eta}. However, directly analyzing πη\pi_{\eta} comes with the difficulty that many standard analysis techniques for the continuous-time Langevin Diffusion become much more complicated after discretization. To overcome this, our analysis makes use of a rotation-invariant moment generating function (aka Bessel function) to directly study the dynamics of the discrete-time Langevin Algorithm at stationarity. Briefly, the key point is that this Lyapunov function readily tracks the effect of gradient descent updates and Gaussian noise convolutions—the two components of the update equation (1.1) for the Langevin Algorithm—thereby enabling a simple and direct analysis of this discrete-time process. This technique may be of independent interest and is developed in §3.

Independent work.

After completing an initial draft of this manuscript, we found out that a recent revision of [19] (arXiv v4) has a new result (Theorem 8) which shows that in the strongly convex setting, π\pi satisfies LSI and thus is sub-Gaussian with concentration parameters that exactly match our Theorem 4.1. The proof techniques are completely different, and the final results are incomparable in their level of generality. On one hand, their result subsumes Theorem 4.1 in that it shows LSI, which implies sub-Gaussianity. On the other hand, our techniques are more general in that they readily extend to the (non-strongly) convex setting, while the techniques of [19] only apply to the strongly convex setting (it is stated as an open question in their paper whether one can obtain the analogous results for this more general setting). We are grateful to Andre Wibisono for making us aware of this recent revision and for discussing the differences with us.

2 Preliminaries

Notation.

We write 𝒮d1\mathcal{S}^{d-1} to denote the unit sphere in d\mathbb{R}^{d}, and we abuse notation slightly by writing v𝒮d1v\sim\mathcal{S}^{d-1} to indicate that vv is a random vector drawn uniformly from 𝒮d1\mathcal{S}^{d-1}. Throughout, \|\cdot\| denotes the Euclidean norm. Recall that a differentiable function f:df:\mathbb{R}^{d}\to\mathbb{R} is said to be MM-smooth if f\nabla f is MM-Lipschitz; and is said to be mm-strongly convex if f(y)f(x)+f(x),yx+m2xy2f(y)\geqslant f(x)+\langle\nabla f(x),y-x\rangle+\frac{m}{2}\|x-y\|^{2} for all x,yx,y. All logarithms are natural. All other notation is introduced in the main text.

Integrability assumption.

Throughout, we assume that the convex potential f:df:\mathbb{R}^{d}\to\mathbb{R} satisfies ef(x)𝑑x<\int e^{-f(x)}dx<\infty since this suffices to ensure that the stationary distribution of the Langevin Diffusion exists and is equal to πef\pi\propto e^{-f} [17, 5]. This assumption is automatically satisfied in the strongly convex setting, and is a mild assumption in the convex setting (e.g., it holds if there is a minimizer xx^{*} of ff and a large enough radius RR such that f(x)>f(x)f(x)>f(x^{*}) at all points xx satisfying xx=R\|x-x^{*}\|=R.)

3 Lyapunov function

We propose to use the following Lyapunov function in order to analyze πη\pi_{\eta}.

Definition 3.1 (Lyapunov function).

For any dimension dd\in\mathbb{N} and weight λ>0\lambda>0, let Φd,λ:d\Phi_{d,\lambda}:\mathbb{R}^{d}\to\mathbb{R} denote the function

Φd,λ(x):=𝔼v𝒮d1[eλv,x].\displaystyle\Phi_{d,\lambda}(x):=\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\left[e^{\lambda\langle v,x\rangle}\right]\,. (3.1)

This definition extends to random variables XX by taking the expectation Φd,λ(X)=𝔼x[Φd,λ(x)]\Phi_{d,\lambda}(X)=\mathbb{E}_{x}[\Phi_{d,\lambda}(x)]. The rest of this section discusses properties and interpretations of this Lyapunov function.

3.1 Relation to rotation-invariant MGF

This Lyapunov function has a natural interpretation in terms of a rotationally-invariant version of the standard moment generating function (MGF). Specifically, rather than bounding the standard MGF of a distribution μ\mu—which we recall is 𝔼Xμeλv,X,\mathbb{E}_{X\sim\mu}e^{\lambda\langle v,X\rangle}\,, as a function of a fixed vector v𝒮d1v\in\mathcal{S}^{d-1}—we will bound the average 𝔼XμΦd,λ(X)=𝔼v𝒮d1𝔼Xμeλv,X\mathbb{E}_{X\sim\mu}\Phi_{d,\lambda}(X)=\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\mathbb{E}_{X\sim\mu}e^{\lambda\langle v,X\rangle}. The latter is precisely the average of the standard MGF over all vectors v𝒮d1v\in\mathcal{S}^{d-1}, hence why we refer to it as the “rotation-invariant MGF”.

How do bounds on this rotation-invariant MGF imply concentration inequalities? Since properties like sub-Gaussianity and sub-exponentiality of a distribution are by definition equivalent to MGF bounds which hold uniformly over all vectors v𝒮d1v\in\mathcal{S}^{d-1} [15], these properties of course imply the same bounds for the rotationally-invariant MGF (simply average over v𝒮d1v\sim\mathcal{S}^{d-1}). The converse is in general lossy, in that a bound on the rotationally-invariant MGF does not imply the same333It is possible to extract a weaker bound for the standard MGF, but this loses dimension-dependent factors, which leads to loose bounds. In contrast, our analysis is tight up to a small constant factor. bound on the standard MGF uniformly over all vectors v𝒮d1v\in\mathcal{S}^{d-1}. But this conversion is of course lossless for rotationally-invariant distributions μ\mu—and this suffices for the purposes of this note.

Let us elaborate: in order to prove concentration inequalities for X\|X\| where XX is drawn from the (not rotationally-symmetric) distribution πη\pi_{\eta}, it suffices to prove the same concentration inequalities for X~\|\tilde{X}\|, where X~\tilde{X} is a randomly rotated version of XX, namely X~=RX\tilde{X}=RX where RR is a random rotation. Since our analysis in the sequel lets us bound the rotationally-invariant MGF for the law of X~\tilde{X}, this immediately gives concentration inequalities for X~\|\tilde{X}\| and thus also for X\|X\|, as desired.

Briefly, the upshot of using this rotationally-invariant MGF is that, unlike the standard MGF, it is a function of the norm (see §3.2)—and as we describe in the sequel, this enables exploiting contractivity of the gradient descent step in the Langevin Algorithm update. We also mention that since our goal is to prove concentration inequalities on the norm, of course no information is lost by using the rotational-invariant MGF rather than the standard MGF, since the only difference is averaging over directions.

3.2 Explicit expression via Bessel functions

Of course, by rotational invariance, this Lyapunov function Φd,λ(x)\Phi_{d,\lambda}(x) depends on the point xx only through its norm x\|x\|. That is,

Φd,λ(x):=ϕd(λx)\displaystyle\Phi_{d,\lambda}(x):=\phi_{d}(\lambda\|x\|) (3.2)

where ϕd:00\phi_{d}:\mathbb{R}_{\geqslant 0}\to\mathbb{R}_{\geqslant 0} is the univariate function given by

ϕd(z)=𝔼v𝒮d1[ezv,e1].\displaystyle\phi_{d}(z)=\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\left[e^{z\langle v,e_{1}\rangle}\right]\,. (3.3)

The following lemma provides an elegant closed-form expression for this univariate function ϕd\phi_{d} in terms of modified Bessel functions In()I_{n}(\cdot) of the first kind and the Gamma function Γ()\Gamma(\cdot).

Lemma 3.2 (Explicit formula for Lyapunov function).

For any dimension d2d\geqslant 2 and argument z>0z>0,

ϕd(z)=Γ(α+1)(2z)αIα(z),\phi_{d}(z)=\Gamma(\alpha+1)\cdot\left(\frac{2}{z}\right)^{\alpha}\cdot I_{\alpha}(z)\,,

where α:=(d2)/2\alpha:=(d-2)/2. (In dimension d=1d=1, we simply have ϕd(z)=cosh(z)\phi_{d}(z)=\cosh(z).)

Proof.

It is a classical fact (see, e.g., [18]) that if vv is drawn uniformly at random from the unit sphere 𝒮d1\mathcal{S}^{d-1}, then the induced law νd1\nu_{d-1} of v1v_{1} has density

dνd1(t)dt=Γ(α+1)πΓ(α)(1t2)α1/2𝟙t(1,1).\displaystyle\frac{d\nu_{d-1}(t)}{dt}=\frac{\Gamma(\alpha+1)}{\sqrt{\pi}\Gamma(\alpha)}(1-t^{2})^{\alpha-1/2}\mathds{1}_{t\in(-1,1)}\,.

Therefore

Φd,λ(x)=𝔼v𝒮d1[eλv,x]=𝔼v𝒮d1[ezv1]=Γ(α+1)πΓ(α)11ezt(1t2)α1/2𝑑t.\displaystyle\Phi_{d,\lambda}(x)=\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\left[e^{\lambda\langle v,x\rangle}\right]=\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\left[e^{zv_{1}}\right]=\frac{\Gamma(\alpha+1)}{\sqrt{\pi}\Gamma(\alpha)}\int_{-1}^{1}e^{zt}(1-t^{2})^{\alpha-1/2}dt\,.

We conclude by using the Poisson integral representation

Iα(z)=(z/2)απΓ(α+1/2)11ezt(1t2)α1/2𝑑t.\displaystyle I_{\alpha}(z)=\frac{(z/2)^{\alpha}}{\sqrt{\pi}\Gamma(\alpha+1/2)}\int_{-1}^{1}e^{zt}(1-t^{2})^{\alpha-1/2}dt\,.

of the Bessel functions, see, e.g., [8, equation (10.32.2)]. ∎

3.3 Properties of the Lyapunov function

The key reason we use the Lyapunov function Φ\Phi to analyze the dynamics of the Langevin Algorithm is that the Langevin Algorithm’s update decomposes into a gradient descent step and a noise convolution step, and Φ\Phi enables tracking the effect of both steps in an elegant manner.

The following lemma describes how Φ\Phi is affected by the noise convolution. While a similar property holds for the standard MGF, the advantage of the rotation-invariant MGF over the standard MGF is that it enables exploiting contractivity of the gradient descent step. However, the effect of the gradient descent step is slightly different depending on whether the potential is just convex or also strongly convex, and thus is treated separately in the corresponding sections below.

Lemma 3.3 (Behavior of Φ\Phi under Gaussian convolution).

For any dimension dd\in\mathbb{N}, point xdx\in\mathbb{R}^{d}, weight λ>0\lambda>0, and noise variance σ2\sigma^{2},

𝔼Z𝒩(0,σ2Id)[Φd,λ(x+Z)]=eλ2σ2/2Φd,λ(x).\mathbb{E}_{Z\sim\mathcal{N}(0,\sigma^{2}I_{d})}\left[\Phi_{d,\lambda}\left(x+Z\right)\right]=e^{\lambda^{2}\sigma^{2}/2}\cdot\Phi_{d,\lambda}(x)\,.
Proof.

We compute:

𝔼Z𝒩(0,σ2Id)[Φd,λ(x+Z)]\displaystyle\mathbb{E}_{Z\sim\mathcal{N}(0,\sigma^{2}I_{d})}[\Phi_{d,\lambda}(x+Z)] =𝔼Z𝒩(0,σ2Id)𝔼v𝒮d1[eλv,x+Z]\displaystyle=\mathbb{E}_{Z\sim\mathcal{N}(0,\sigma^{2}I_{d})}\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\left[e^{\lambda\langle v,x+Z\rangle}\right]
=𝔼v𝒮d1[eλv,x𝔼Z𝒩(0,σ2Id)[eλv,Z]]\displaystyle=\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\left[e^{\lambda\langle v,x\rangle}\mathbb{E}_{Z\sim\mathcal{N}(0,\sigma^{2}I_{d})}[e^{\lambda\langle v,Z\rangle}]\right]
=𝔼v𝒮d1[eλv,xeλ2σ2v2/2]\displaystyle=\mathbb{E}_{v\sim\mathcal{S}^{d-1}}\left[e^{\lambda\langle v,x\rangle}e^{\lambda^{2}\sigma^{2}\|v\|^{2}/2}\right]
=eλ2σ2/2Φλ(x).\displaystyle=e^{\lambda^{2}\sigma^{2}/2}\cdot\Phi_{\lambda}(x).

Above, the first and last steps are by definition of Φ\Phi; the second step is by Fubini’s Theorem; and the third step is by the well-known formula for the moment generating function of the multivariate Gaussian distribution (see, e.g., [9, §5.8]). ∎

The next lemma collects various basic properties of Φ\Phi that we use in our analysis in the sequel. We remark that in the setting of strongly convex potentials, our analysis only makes use of the positivity and monotonicity in item (i) of this lemma; the setting of convex potentials has a more involved analysis and requires the eventual exponential growth in item (ii) of this lemma.

Lemma 3.4 (Properties of rotation-invariant MGF).

For any dimension dd\in\mathbb{N}:

  • (i)

    The function ϕd\phi_{d} is positive and increasing on 0\mathbb{R}_{\geqslant 0}.

  • (ii)

    There exists some r0:=r0(d)r_{0}:=r_{0}(d) such that

    ϕd(r+Δ)eΔ/2ϕd(r)\phi_{d}(r+\Delta)\geqslant e^{\Delta/2}\phi_{d}(r)\,

    for all rr0r\geqslant r_{0} and Δ>0\Delta>0.

Proof.

Item (i) is immediate because ϕd\phi_{d} is by definition an expectation over functions which are each positive and increasing. For item (ii), observe that Lemma 3.2 expresses ϕd(z)\phi_{d}(z) as Γ(α+1)(2/z)α=Θ(zα)\Gamma(\alpha+1)(2/z)^{\alpha}=\Theta(z^{-\alpha}), which grows inverse polynomially in zz, times the Bessel function Iα(z)I_{\alpha}(z), which grows exponentially for large arguments zα=Θ(d)z\gg\sqrt{\alpha}=\Theta(\sqrt{d}) due to the Hankel expansion Iα(z)ez(2πz)1/2(14α218z+(4α21)(4α29)2!(8z)2)I_{\alpha}(z)\sim e^{z}(2\pi z)^{-1/2}(1-\frac{4\alpha^{2}-1}{8z}+\frac{(4\alpha^{2}-1)(4\alpha^{2}-9)}{2!(8z)^{2}}-\dots) [8, equation (10.40.1)]. The desired exponential growth immediately follows444The “fudge factor” of 1/21/2 in the exponent is used to simply and crudely bound the lower-order polynomial terms, and is irrelevant for our purposes in the sequel.. ∎

Although unnecessary for the purposes of this paper, we remark in passing that non-asymptotic bounds on the exponential growth in item (ii) of the above lemma can be computed in several ways. One way is to use more precise non-asymptotic bounds on Bessel functions since, as mentioned above, ϕd(z)\phi_{d}(z) is equal to Iα(z)I_{\alpha}(z) modulo a lower-order polynomial term. Another way is to show that the logarithmic derivative of ϕd\phi_{d} is eventually lower bounded by a constant; we cannot help but mention that an elegant identity for this purpose is that this logarithmic derivative simplifies to ddzlogϕd(z)=Iα+1(z)/Iα(z)\frac{d}{dz}\log\phi_{d}(z)=I_{\alpha+1}(z)/I_{\alpha}(z), which is the ratio of Bessel functions of different orders.

4 Sub-Gaussian concentration for strongly convex potentials

In this section we show that if the potential ff is strongly convex and smooth, then the stationary distribution πη\pi_{\eta} of the Langevin Algorithm has sub-Gaussian tails. This parallels the classically known fact that in this strongly convex setting, the unbiased stationary distribution πef\pi\propto e^{-f} of the continuous Langevin Diffusion is sub-Gaussian [12].

Theorem 4.1 (Sub-Gaussianity of πη\pi_{\eta} for strongly convex potentials).

Suppose f:df:\mathbb{R}^{d}\to\mathbb{R} is mm-strongly convex and MM-smooth for some 0<mM<0<m\leqslant M<\infty. Consider running the Langevin Algorithm with stepsize η<2/M\eta<2/M. Then its stationary distribution πη\pi_{\eta} satisfies

Xπη[Xx4η1c(2d+log1/δ)]δ,δ(0,1),\mathbb{P}_{X\sim\pi_{\eta}}\left[\|X-x^{*}\|\geqslant 4\sqrt{\frac{\eta}{1-c}}\left(\sqrt{2d}+\sqrt{\log 1/\delta}\right)\right]\leqslant\delta\,,\quad\forall\delta\in(0,1)\,,

where xx^{*} denotes the unique minimizer of ff, and c:=maxρ{m,M}|1ηρ|c:=\max_{\rho\in\{m,M\}}|1-\eta\rho|.

The operational interpretation of cc is that it is the contraction coefficient corresponding to a gradient descent step (see Lemma 4.2 below). Note that c<1c<1 because η<2/M\eta<2/M. In the typical setting of η1/M\eta\leqslant 1/M, it simplifies to c=1ηmc=1-\eta m, whereby Theorem 4.1 simplifies to

Xπη[Xx4m(2d+log1/δ)]δ,δ(0,1).\mathbb{P}_{X\sim\pi_{\eta}}\left[\|X-x^{*}\|\geqslant\frac{4}{\sqrt{m}}\left(\sqrt{2d}+\sqrt{\log 1/\delta}\right)\right]\leqslant\delta\,,\quad\forall\delta\in(0,1)\,.

The rest of the section is organized as follows. In §4.1 we prove Theorem 4.1 by showing that the rotationally symmetrized version of π\pi is sub-Gaussian with variance proxy 2η/(1c)2\eta/(1-c), and in §4.2 we show that this bound is tight in all parameters (up to a constant factor of at most 22).

4.1 Proof

The proof makes use of the following helper lemma. This fact is well-known in convex optimization because it immediately implies a tight bound on the convergence rate of gradient descent for strongly-convex, smooth objectives. A short proof can be found in, e.g., Appendix A.1 of [1].

Lemma 4.2 (Contractivity of gradient descent step).

Suppose ff is mm-strongly convex and MM-smooth over d\mathbb{R}^{d} for some 0mM<0\leqslant m\leqslant M<\infty, and consider any stepsize η2/M\eta\leqslant 2/M. Then

xηf(x)xcxx,xd,\|x-\eta\nabla f(x)-x^{*}\|\leqslant c\|x-x^{*}\|\,,\quad\forall x\in\mathbb{R}^{d}\,,

where xx^{*} is any minimizer of ff, and c:=maxρ{m,M}|1ηρ|1c:=\max_{\rho\in\{m,M\}}|1-\eta\rho|\leqslant 1.

Proof of Theorem 4.1.

To avoid cumbersome notation, let us assume that ff is centered to have minimizer x=0x^{*}=0 (this is without loss of generality after a translation). We bound the change in the Lyapunov function Φd,λ()\Phi_{d,\lambda}(\cdot) from running an iteration of the Langevin Algorithm from some point xdx\in\mathbb{R}^{d} to

x=xηf(x)+Z,x^{\prime}=x-\eta\nabla f(x)+Z\,,

where Z𝒩(0,2ηId)Z\sim\mathcal{N}(0,2\eta I_{d}). To analyze this, we disassociate the two parts of this update: the gradient descent step and the noise convolution.

First, we analyze the gradient descent step:

Φd,λ(xηf(x))=ϕd(λxηf(x))ϕd(cλx)(ϕd(λx))c=(Φd,λ(x))c.\displaystyle\Phi_{d,\lambda}(x-\eta\nabla f(x))=\phi_{d}(\lambda\|x-\eta\nabla f(x)\|)\leqslant\phi_{d}(c\lambda\|x\|)\leqslant\big{(}\phi_{d}(\lambda\|x\|)\big{)}^{c}=\big{(}\Phi_{d,\lambda}(x)\big{)}^{c}\,.

Above, the first and last steps are by definition of ϕ\phi in terms of Φ\Phi (see (3.2)); the second step is by contractivity of a gradient descent iteration (Lemma 4.2) and monotonicity of ϕd\phi_{d} (item (i) of Lemma 3.4); and the third step is by an application of Jensen’s inequality 𝔼[Yc]𝔼[Y]c\mathbb{E}[Y^{c}]\leqslant\mathbb{E}[Y]^{c} where Y:=eλv,xY:=e^{\lambda\langle v,x\rangle}, which applies since the function yycy\mapsto y^{c} is concave for c1c\leqslant 1.

Second, we use Lemma 3.3 to bound the change in the Lyapunov function from the Gaussian noise convolution:

Φd,λ(x)=Φd,λ(xηf(x)+Z)=eηλ2Φd,λ(xηf(x)).\displaystyle\Phi_{d,\lambda}(x^{\prime})=\Phi_{d,\lambda}(x-\eta\nabla f(x)+Z)=e^{\eta\lambda^{2}}\Phi_{d,\lambda}(x-\eta\nabla f(x))\,.

By combining the above two displays, we conclude that

Φd,λ(x)eηλ2(Φd,λ(x))c.\displaystyle\Phi_{d,\lambda}(x^{\prime})\leqslant e^{\eta\lambda^{2}}\big{(}\Phi_{d,\lambda}(x)\big{)}^{c}\,. (4.1)

Now take expectations on both sides, drawing XπηX\sim\pi_{\eta}. Note that by stationarity, XπηX^{\prime}\sim\pi_{\eta}. Thus

𝔼Xπη[Φd,λ(X)]eηλ2𝔼Xπη[Φd,λ(X)]c,\mathbb{E}_{X\sim\pi_{\eta}}\big{[}\Phi_{d,\lambda}(X)\big{]}\leqslant e^{\eta\lambda^{2}}\mathbb{E}_{X\sim\pi_{\eta}}\big{[}\Phi_{d,\lambda}(X)\big{]}^{c}\,,

where above we have again used Jensen’s inequality on the concave function yycy\mapsto y^{c}. Rearranging and simplifying yields

𝔼Xπη[Φd,λ(X)]eηλ2/(1c).\displaystyle\mathbb{E}_{X\sim\pi_{\eta}}\big{[}\Phi_{d,\lambda}(X)\big{]}\leqslant e^{\eta\lambda^{2}/(1-c)}\,. (4.2)

By definition of the Lyapunov function Φd,λ\Phi_{d,\lambda}, this implies that

𝔼Xπη𝔼R[eλu,RX]eηλ2/(1c),u𝒮d1,\displaystyle\mathbb{E}_{X\sim\pi_{\eta}}\mathbb{E}_{R\sim\mathcal{R}}\big{[}e^{\lambda\langle u,RX\rangle}\big{]}\leqslant e^{\eta\lambda^{2}/(1-c)}\,,\qquad\forall u\in\mathcal{S}^{d-1}\,, (4.3)

where \mathcal{R} denotes the Haar measure over rotations of d\mathbb{R}^{d}. Define π~η\tilde{\pi}_{\eta} to be the law of RXRX (in words, π~η\tilde{\pi}_{\eta} is a rotationally symmetrized version of πη\pi_{\eta}). Then by definition of sub-Gaussianity, the above display establishes that π~η\tilde{\pi}_{\eta} is sub-Gaussian with variance proxy 2η/(1c)2\eta/(1-c). By a standard concentration inequality on the norm of a sub-Gaussian vector (see e.g., [15, Theorem 1.19]),

Xπ~η[X4η1c(2d+log1/δ)]δ,δ(0,1).\mathbb{P}_{X\sim\tilde{\pi}_{\eta}}\left[\|X\|\geqslant 4\sqrt{\frac{\eta}{1-c}}\left(\sqrt{2d}+\sqrt{\log 1/\delta}\right)\right]\leqslant\delta\,,\quad\forall\delta\in(0,1)\,.

Since the Euclidean norm is invariant under rotations, this concentration inequality remains true if π~η\tilde{\pi}_{\eta} is replaced by πη\pi_{\eta} (cf., the discussion in §3.1). This concludes the proof. ∎

4.2 Tightness

Here we observe that, modulo a constant factor of at most 22, the sub-Gaussianity we proved is tight in all parameters. It is simplest to explain this tightness in terms of the sub-Gaussian parameters since the final concentration inequality in Theorem 4.1 was proved as a direct consequence of this. To this end, recall that in the proof of Theorem 4.1 above, we proved that the rotationally symmetric version π~η\tilde{\pi}_{\eta} of πη\pi_{\eta} is sub-Gaussian with variance proxy 2η/(1c)2\eta/(1-c). Following is a simple construction which matches this upper bound. This calculation is similar to [19, Example 4].

Example 4.3 (Tightness of sub-Gaussanity parameters).

Consider running the Langevin Algorithm on the univariate quadratic potential f(x)=ρ2x2f(x)=\frac{\rho}{2}x^{2}, where the curvature ρ\rho is chosen so as to maximize the contraction coefficient c:=maxρ{m,M}|1ηρ|c:=\max_{\rho\in\{m,M\}}|1-\eta\rho|. Then, for any stepsize η>0\eta>0, the Langevin Algorithm has stationary distribution

πη=𝒩(0,2η11c2)\pi_{\eta}=\mathcal{N}\left(0,2\eta\frac{1}{1-c^{2}}\right)

when it is initialized to X0=0X_{0}=0. (This fact is proved in [1, §4.2] and follows by simply unraveling the definition of a Langevin Algorithm update, composing tt times, and taking tt\to\infty.) Note that π~η=πη\tilde{\pi}_{\eta}=\pi_{\eta} in this example since πη\pi_{\eta} is already rotationally symmetric. Moreover, since 1c2=(1c)(1+c)1-c^{2}=(1-c)(1+c) and since also (1+c)(1,2)(1+c)\in(1,2), this construction matches our aforementioned upper bound on the sub-Gaussian parameter of π~η\tilde{\pi}_{\eta} up to a constant factor of at most 22.

5 Sub-exponential concentration for convex potentials

In this section we show that if the potential ff is convex and smooth, then the stationary distribution πη\pi_{\eta} of the Langevin Algorithm has sub-exponential tails. This parallels the classically known fact that in this convex setting, the unbiased stationary distribution πef\pi\propto e^{-f} of the continuous Langevin Diffusion is sub-exponential (see Lemma 5.2 below).

Theorem 5.1 (Sub-exponentiality of πη\pi_{\eta} for convex potentials).

Consider running the Langevin Algorithm on a convex, MM-smooth potential f:df:\mathbb{R}^{d}\to\mathbb{R} with stepsize η1/M\eta\leqslant 1/M. Then the stationary distribution πη\pi_{\eta} of the Langevin Algorithm has sub-exponential concentration; i.e., there exists A,C,R>0A,C,R>0 and a minimizer xx^{*} of ff such that

Xπη[XxR+Clog(A/δ)]δ,δ(0,1).\mathbb{P}_{X\sim\pi_{\eta}}\left[\lVert{X-x^{*}}\rVert\geqslant R+C\log(A/\delta)\right]\leqslant\delta\,,\quad\forall\delta\in(0,1)\,.

Note that in this sub-exponential concentration inequality, the parameters AA, CC, and RR depend on ff and η\eta (this dependence is made explicit in the proof). A dependence on ff is inevitable since even the unbiased distribution π\pi has sub-exponential parameters which depend on ff. As a simple concrete example, consider the potential f(x)=d(x,BR)f(x)=d(x,B_{R}) where BRB_{R} is the Euclidean ball of radius RR centered at the origin, and d(x,BR)d(x,B_{R}) is the Euclidean distance to the ball. The distribution π\pi then concentrates no better than the uniform distribution on the Euclidean ball of radius RR, and a randomly drawn point XX from BRB_{R} has norm Θ(R)\Theta(R) with high probability. See also the tightness discussion in §5.2 for a fully worked-out example.

The rest of the section is organized as follows. In §4.1 we prove Theorem 4.1 by showing that the rotationally symmetrized version of π\pi is sub-Gaussian with variance proxy 2η/(1c)2\eta/(1-c), and in §4.2 we show that this bound is tight in all parameters (up to a constant factor of at most 22).

The rest of the section is organized as follows. In §5.1 we prove Theorem 5.1, and in §5.2 we discuss tightness of this bound.

5.1 Proof

We begin by recalling the classical fact from functional analysis that every log-concave distribution is sub-exponential. For a proof, see e.g., [3, Lemma 10.6.1].

Lemma 5.2 (Log-concavity implies sub-exponentiality).

If π\pi is a log-concave probability distribution, then there exist a,b>0a,b>0 such that π(x)aebx\pi(x)\leqslant ae^{-b\|x\|} for all xdx\in\mathbb{R}^{d}.

To apply this lemma, it is convenient to first assume without loss of generality that ff is translated and shifted so that 0 is a minimizer of ff with value f(0)=0f(0)=0. Then f(x)=log(π(0)/π(x))f(x)=\log(\pi(0)/\pi(x)). By Lemma 5.2, we conclude that

f(x)α+βx,xd,\displaystyle f(x)\geqslant-\alpha+\beta\|x\|\,,\quad\forall x\in\mathbb{R}^{d}\,, (5.1)

for some α,β\alpha,\beta\in\mathbb{R}. (Here, α\alpha plays the role of logalogπ(0)\log a-\log\pi(0) and β\beta plays the role of bb, where aa and bb are the quantities from Lemma 5.2.) We exploit this super-linear growth (5.1) of ff to show that a gradient descent update on ff makes significant progress towards the minimizer x=0x^{*}=0 when the current iterate has sufficiently large norm.

Lemma 5.3 (Gradient descent updates make significant progress for super-linear objectives).

Suppose f:df:\mathbb{R}^{d}\to\mathbb{R} is convex, MM-smooth, achieves its minimum value at f(0)=0f(0)=0, and satisfies (5.1). Then there exists r1:=r1(f)r_{1}:=r_{1}(f) such that for any stepsize η1/M\eta\leqslant 1/M, it holds that

xηf(x)xηβ4,xr1.\|x-\eta\nabla f(x)\|\leqslant\|x\|-\frac{\eta\beta}{4}\,,\quad\forall\|x\|\geqslant r_{1}\,.
Proof.

We bound

xηf(x)\displaystyle\|x-\eta\nabla f(x)\| =xηf(x)2\displaystyle=\sqrt{\|x-\eta\nabla f(x)\|^{2}}
=x22ηf(x),x+η2f(x)2\displaystyle=\sqrt{\|x\|^{2}-2\eta\langle\nabla f(x),x\rangle+\eta^{2}\|\nabla f(x)\|^{2}}
x22ηf(x),x+η2Mf(x),x\displaystyle\leqslant\sqrt{\|x\|^{2}-2\eta\langle\nabla f(x),x\rangle+\eta^{2}M\langle\nabla f(x),x\rangle}
x2ηf(x),x\displaystyle\leqslant\sqrt{\|x\|^{2}-\eta\langle\nabla f(x),x\rangle}
xηf(x),x/(2x)\displaystyle\leqslant\|x\|-\eta\langle\nabla f(x),x\rangle/(2\|x\|)
xηf(x)/(2x)\displaystyle\leqslant\|x\|-\eta f(x)/(2\|x\|)
xηβ/2+ηα/(2x).\displaystyle\leqslant\|x\|-\eta\beta/2+\eta\alpha/(2\|x\|)\,.

Above, the second step expands the square. The third step uses the folklore fact from optimization (see, e.g., [14, Equation (2.1.8)]) that MM-smoothness implies f(x)f(y),xy1Mf(x)f(y)2\langle\nabla f(x)-\nabla f(y),x-y\rangle\geqslant\frac{1}{M}\|\nabla f(x)-\nabla f(y)\|^{2}, which we use here for y=0y=0. The fourth step uses the assumption that η1/M\eta\leqslant 1/M. The fifth step adds a non-negative quantity to complete the square. The penultimate step uses convexity and the assumption f(0)=0f(0)=0 to bound f(x),xf(x)f(0)=f(x)\langle\nabla f(x),x\rangle\geqslant f(x)-f(0)=f(x). The final step uses the super-linear growth condition (5.1). We conclude that if x2α/β\|x\|\geqslant 2\alpha/\beta, then xηf(x)xηβ/4\|x-\eta\nabla f(x)\|\leqslant\|x\|-\eta\beta/4. ∎

Proof of Theorem 5.1.

As in the proof of Theorem 4.1, we bound the change in the Lyapunov function Φd,λ()\Phi_{d,\lambda}(\cdot) from running an iteration of the Langevin Algorithm from some point xdx\in\mathbb{R}^{d} to

x=xηf(x)+Z,x^{\prime}=x-\eta\nabla f(x)+Z\,,

where Z𝒩(0,2ηId)Z\sim\mathcal{N}(0,2\eta I_{d}). To analyze this, we disassociate the two parts of this update: the gradient descent step and the noise convolution.

First, we analyze the gradient descent step. We consider two cases depending on whether the norm x\|x\| of the current iterate is larger than R:=max(r0/λ+ηβ/4,r1)R:=\max(r_{0}/\lambda+\eta\beta/4,r_{1}), where r0r_{0} is the quantity from Lemma 3.4, and r1r_{1} is the quantity from Lemma 5.3. (We will set λ=β/16\lambda=\beta/16.)

  • Case 1: xR\|x\|\leqslant R. Then by the fact that a gradient descent step is non-expansive (Lemma 4.2 with m=0m=0), we know that xηf(x)xR\|x-\eta\nabla f(x)\|\leqslant\|x\|\leqslant R. Thus

    Φd,λ(xηf(x))=ϕd(λxηf(x))ϕd(λR).\displaystyle\Phi_{d,\lambda}(x-\eta\nabla f(x))=\phi_{d}(\lambda\|x-\eta\nabla f(x)\|)\leqslant\phi_{d}(\lambda R)\,.

    where above the first step is by definition of ϕ\phi in terms of Φ\Phi (see (3.2)), and the second step is by monotonicity of ϕd\phi_{d} (item (i) of Lemma 3.4).

  • Case 2: x>R\|x\|>R. Then xr1\|x\|\geqslant r_{1}, so xηf(x)xηβ/4\|x-\eta\nabla f(x)\|\leqslant\|x\|-\eta\beta/4 by Lemma 5.3. Thus

    Φd,λ(xηf(x))=ϕd(λxηf(x))ϕd(λ(xηβ/4))eηβλ/8ϕd(λx)=eηβλ/8Φd,λ(x),\displaystyle\Phi_{d,\lambda}(x-\eta\nabla f(x))=\phi_{d}(\lambda\|x-\eta\nabla f(x)\|)\leqslant\phi_{d}(\lambda(\|x\|-\eta\beta/4))\leqslant e^{-\eta\beta\lambda/8}\phi_{d}(\lambda\|x\|)=e^{-\eta\beta\lambda/8}\Phi_{d,\lambda}(x)\,,

    where above the first and last steps are by definition of ϕ\phi in terms of Φ\Phi (see (3.2)), the second step is by monotonicity of ϕd\phi_{d} (item (i) of Lemma 3.4), and the third step is due to the super-exponential growth of ϕd\phi_{d} above λ(Rηβ/4)r0\lambda(R-\eta\beta/4)\geqslant r_{0} (item (ii) of Lemma 3.4).

Now, since ϕ\phi is non-negative (item (i) of Lemma 3.4), we can crudely combine these two cases to conclude that

Φd,λ(xηf(x))ϕd(λR)+eηβλ/8Φd,λ(x).\displaystyle\Phi_{d,\lambda}(x-\eta\nabla f(x))\leqslant\phi_{d}(\lambda R)+e^{-\eta\beta\lambda/8}\Phi_{d,\lambda}(x)\,.

Now use Lemma 3.3 to bound the change in the Lyapunov function from the Gaussian noise convolution:

Φd,λ(x)=Φd,λ(xηf(x)+Z)=eηλ2Φd,λ(xηf(x))\displaystyle\Phi_{d,\lambda}(x^{\prime})=\Phi_{d,\lambda}(x-\eta\nabla f(x)+Z)=e^{\eta\lambda^{2}}\Phi_{d,\lambda}(x-\eta\nabla f(x)) (5.2)

By combining the above two displays, we conclude that

Φd,λ(x)eηλ2[ϕd(λR)+eηβλ/8Φd,λ(x)].\Phi_{d,\lambda}(x^{\prime})\leqslant e^{\eta\lambda^{2}}\left[\phi_{d}(\lambda R)+e^{-\eta\beta\lambda/8}\Phi_{d,\lambda}(x)\right]\,.

Now take expectations on both sides, drawing XπηX\sim\pi_{\eta}. Note that by stationarity, XπηX^{\prime}\sim\pi_{\eta}. Thus

𝔼Xπη[Φd,λ(X)]eηλ2(ϕd(λR)+eηβλ/8𝔼Xπη[Φd,λ(X)]).\mathbb{E}_{X\sim\pi_{\eta}}\big{[}\Phi_{d,\lambda}(X)\big{]}\leqslant e^{\eta\lambda^{2}}\left(\phi_{d}(\lambda R)+e^{-\eta\beta\lambda/8}\mathbb{E}_{X\sim\pi_{\eta}}\left[\Phi_{d,\lambda}(X)\right]\right)\,.

Re-arranging and simplifying yields

𝔼Xπη[Φd,λ(X)]eηλ21eηλ(λβ/8)ϕd(λR).\mathbb{E}_{X\sim\pi_{\eta}}\big{[}\Phi_{d,\lambda}(X)\big{]}\leqslant\frac{e^{\eta\lambda^{2}}}{1-e^{\eta\lambda(\lambda-\beta/8)}}\phi_{d}(\lambda R)\,.

Setting λ=β/16\lambda=\beta/16 yields

𝔼Xπη[Φd,λ(X)]Aϕd(λR),\displaystyle\mathbb{E}_{X\sim\pi_{\eta}}\big{[}\Phi_{d,\lambda}(X)\big{]}\leqslant A\phi_{d}(\lambda R)\,, (5.3)

where for notational shorthand we have defined A:=eηβ2/256/(1eηβ2/256)>1A:=e^{\eta\beta^{2}/256}/(1-e^{-\eta\beta^{2}/256})>1.

From this, we argue a concentration inequality in a way analogous to Chernoff bounds, except using our rotationally-invariant version of the moment generating function. Specifically, bound

Xπη[XR+2/λlog(A/δ)]\displaystyle\mathbb{P}_{X\sim\pi_{\eta}}\left[\|X\|\geqslant R+2/\lambda\log(A/\delta)\right] =Xπη[ϕd(λX)ϕd(λR+2log(A/δ))]\displaystyle=\mathbb{P}_{X\sim\pi_{\eta}}\left[\phi_{d}(\lambda\|X\|)\geqslant\phi_{d}\left(\lambda R+2\log(A/\delta)\right)\right]
𝔼Xπη[ϕd(λX)]ϕd(λR+2log(A/δ))\displaystyle\leqslant\frac{\mathbb{E}_{X\sim\pi_{\eta}}\left[\phi_{d}(\lambda\|X\|)\right]}{\phi_{d}\left(\lambda R+2\log(A/\delta)\right)}
Aϕd(λR)elog(A/δ)ϕd(λR)\displaystyle\leqslant\frac{A\phi_{d}(\lambda R)}{e^{\log(A/\delta)}\phi_{d}\left(\lambda R\right)}
=δ.\displaystyle=\delta\,.

Above, the first step is by monotonicity of ϕd\phi_{d} (item (i) of Lemma 3.4), the second step is by Markov’s inequality which is applicable since ϕd\phi_{d} is non-negative (item (i) of Lemma 3.4), and the third step is by (5.3) and the super-exponential growth of ϕd\phi_{d} above λRr0\lambda R\geqslant r_{0} (item (ii) of Lemma 3.4). ∎

5.2 Tightness

Here we observe that, modulo a constant factor, the sub-exponential concentration bound that we proved for πη\pi_{\eta} matches the concentration for π\pi—both in terms of when the sub-exponentiality threshold occurs, and also in terms of the sub-exponential decay rate beyond this threshold. Although no techniques currently exist for extending these lower bounds from π\pi to πη\pi_{\eta} (this is an interesting open question), we conjecture that this tightness extends from π\pi to πη\pi_{\eta} since we suspect that πη\pi_{\eta} should not have better concentration than π\pi.

Example 5.4 (Tightness of sub-exponentiality parameters).

Consider running the Langevin Algorithm where the potential ff is the univariate Huber function

f(x)={βx2/2|x|1β|x|α|x|1f(x)=\begin{cases}\beta x^{2}/2&|x|\leqslant 1\\ \beta|x|-\alpha&|x|\geqslant 1\end{cases}

where α:=β/2\alpha:=\beta/2 is set so that ff and ff^{\prime} are continuous at the breakpoints. Intuitively, modulo a translation and a quadratic smoothing around 0, ff is essentially the absolute value function times β\beta, and therefore π\pi is essentially the exponential distribution with parameter β\beta. It is straightforward to check that ff is convex and MM-smooth with M=βM=\beta, and thus satisfies the conditions in Theorem 5.1. In this example, the quantities in the proof of Theorem 5.1 are as follows. To simplify the asymptotics, suppose that β(0,1)\beta\in(0,1).

  • The quantities α\alpha and β\beta in the super-linear growth inequality (5.1) exactly match the α\alpha and β\beta in the construction of this function ff.

  • The quantity r0r_{0} from Lemma 3.4 can be set to r0=log(3)/20.5493r_{0}=\log(3)/2\approx 0.5493 because in dimension d=1d=1, the Lyapunov function ϕd(r)=cosh(r)\phi_{d}(r)=\cosh(r), and it holds555Proof: By Jensen’s inequality, 34exp(Δ)+14exp(Δ)exp(Δ/2)\frac{3}{4}\exp(\Delta)+\frac{1}{4}\exp(-\Delta)\geqslant\exp(\Delta/2). Re-arranging gives e2r3exp(Δ/2)exp(Δ)exp(Δ)exp(Δ/2)e^{2r}\geqslant 3\geqslant\frac{\exp(\Delta/2)-\exp(-\Delta)}{\exp(\Delta)-\exp(\Delta/2)} for any rlog(3)/2r\geqslant\log(3)/2. Re-arranging further establishes the desired inequality cosh(r+Δ)exp(Δ/2)cosh(r)\cosh(r+\Delta)\geqslant\exp(\Delta/2)\cosh(r). that cosh(r+Δ)exp(Δ/2)cosh(r)\cosh(r+\Delta)\geqslant\exp(\Delta/2)\cosh(r) for all rlog(3)/2r\geqslant\log(3)/2 and all Δ0\Delta\geqslant 0.

  • The quantity r1r_{1} from Lemma 5.3 is r1=2αβ=1r_{1}=\frac{2\alpha}{\beta}=1.

  • The quantity R=max(r0λ+ηβ4,r1)max(1β+ηβ1,1)1βR=\max(\frac{r_{0}}{\lambda}+\frac{\eta\beta}{4},r_{1})\asymp\max(\frac{1}{\beta}+\frac{\eta\beta}{1},1)\asymp\frac{1}{\beta} since η1β\eta\leqslant\frac{1}{\beta}.

Therefore, for this potential ff, Theorem 5.1 establishes sub-exponential tails that decay at rate 2λ1log1δβ1log1δ2\lambda^{-1}\operatorname{\log\tfrac{1}{\delta}}\asymp\beta^{-1}\operatorname{\log\tfrac{1}{\delta}} beyond norms of size R+2λ1logAβ1R+2\lambda^{-1}\log A\asymp\beta^{-1}. Similarly, πef\pi\propto e^{-f} has sub-exponential tails that decay at a rate of order β1log1δ\beta^{-1}\operatorname{\log\tfrac{1}{\delta}} beyond norms of order β1\beta^{-1}, because π\pi has tails that are similar to that of an exponential random variable with parameter β\beta. We conclude that for all stepsizes η1/M\eta\leqslant 1/M that are not exponentially small in the problem parameters666This is a mild assumption, since in all relevant parameter regimes for sampling, η\eta is only polynomially small; and moreover, for exponentially small η\eta, sub-exponential concentration of πη\pi_{\eta} is immediate from combining any bias bound between π\pi and πη\pi_{\eta} (which depend polynomially on η\eta) with the classical result that π\pi is sub-exponential (Lemma 5.2)., the concentration inequality in Theorem 5.1 matches that of π\pi—both in terms of when the sub-exponentiality threshold occurs, as well as the rate of sub-exponential decay beyond this threshold.

6 Further extensions

6.1 Concentration along the trajectory

In this subsection we mention that the concentration we showed for πη\pi_{\eta} extends to concentration of the tt-th iterate XtX_{t} of the Langevin Algorithm, for every tt. This is a straightforward extension of our proofs for πη\pi_{\eta}, recovers that result in the limit tt\to\infty, and provides slightly tighter concentration bounds for every finite tt. Moreover, it applies to both the settings of strongly and non-strongly convex potentials, which we studied in §4 and §5, respectively. We state the results for both settings below. These results are most simply stated when the algorithm is initialized to a minimizer xx^{*} of ff, which is a standard pre-processing for sampling algorithms (see, e.g., [7]) and can be done efficiently using the same access to first-order queries (e.g., via gradient descent).

Theorem 6.1 (Sub-Gaussianity of Langevin iterates for strongly convex potentials).

Consider the setting of Theorem 4.1 and suppose the Langevin Algorithm is initialized at xx^{*}. Then the concentration shown for πη\pi_{\eta} in Theorem 4.1 also applies to every iterate of the Langevin Algorithm.

Proof.

The proof is a simple extension of the proof of Theorem 4.1. Recall the key step (4.1) in that proof. By applying this inductively to iterates of the Langevin Algorithm (i.e., applying this inequality with XtX_{t} as xx), we obtain the following recursion for the rotation-invariant MGF:

Φd,λ(Xt+1)eηλ2(Φd,λ(Xt))c.\displaystyle\Phi_{d,\lambda}(X_{t+1})\leqslant e^{\eta\lambda^{2}}\big{(}\Phi_{d,\lambda}(X_{t})\big{)}^{c}\,.

Moreover, note that Φd,λ(X0)=1\Phi_{d,\lambda}(X_{0})=1 since the initialization is X0=x=0X_{0}=x^{*}=0. Unraveling this recursion, it follows that

Φd,λ(Xt)eηλ2s=0t1cs=eηλ2(1ct)/(1c),\displaystyle\Phi_{d,\lambda}(X_{t})\leqslant e^{\eta\lambda^{2}\sum_{s=0}^{t-1}c^{s}}=e^{\eta\lambda^{2}(1-c^{t})/(1-c)}\,,

for all iterations t0t\geqslant 0. This recovers (and improves for any finite tt) the upper bound in (4.2) which is for the stationary distribution (a.k.a., the limit tt\to\infty). The rest of the proof then follows identically as done there (simply use sub-Gaussian concentration). ∎

Theorem 6.2 (Sub-exponentiality of Langevin iterates for convex potentials).

Consider the setting of Theorem 5.1 and suppose the Langevin Algorithm is initialized at any minimizer of ff. Then the concentration shown for πη\pi_{\eta} in Theorem 5.1 also applies to every iterate of the Langevin Algorithm.

Proof.

The proof is a simple extension of the proof of Theorem 5.1 (mirroring the extension described above for the strongly convex case). Recall the key step (5.2) in the proof of Theorem 5.1. By applying this inductively to iterates of the Langevin Algorithm (i.e., applying this inequality with XtX_{t} as xx) and setting λ=β/16\lambda=\beta/16 as done in the proof of Theorem 5.1, we obtain the following recursion for the rotation-invariant MGF:

Φd,λ(Xt+1)ρ1ϕd(λR)+ρΦd,λ(Xt),\displaystyle\Phi_{d,\lambda}(X_{t+1})\leqslant\rho^{-1}\phi_{d}(\lambda R)+\rho\Phi_{d,\lambda}(X_{t})\,,

where for shorthand we denote ρ:=eηβ2/256\rho:=e^{-\eta\beta^{2}/256}. Moreover, note that Φd,λ(x0)=1\Phi_{d,\lambda}(x_{0})=1 since the initialization is x0=x=0x_{0}=x^{*}=0. Unraveling this recursion, it follows that

Φd,λ(Xt)1ρtρ(1ρ)ϕd(λR)+ρt1ρt+1ρ(1ρ)ϕd(λR),\displaystyle\Phi_{d,\lambda}(X_{t})\leqslant\frac{1-\rho^{t}}{\rho(1-\rho)}\phi_{d}(\lambda R)+\rho^{t}\leqslant\frac{1-\rho^{t+1}}{\rho(1-\rho)}\phi_{d}(\lambda R)\,, (6.1)

for all iterations t0t\geqslant 0. Above, the last step is by a crude bound that uses the fact ρ1\rho\leqslant 1 and 1=ϕd(0)ϕd(λR)1=\phi_{d}(0)\leqslant\phi_{d}(\lambda R). This recovers (and improves for any finite tt) the upper bound in (5.3) which is for the stationary distribution (a.k.a., the limit tt\to\infty). The rest of the proof then follows identically as done there (i.e., by using a Chernoff-type argument). ∎

6.2 Using inexact gradients

In this subsection we mention that the results in this paper apply to inexact implementations of the Langevin Algorithm which use sub-Gaussian estimates of the true gradients. Specifically, here we consider implementations of the form

Xt+1=XtηGt+Zt,\displaystyle X_{t+1}=X_{t}-\eta G_{t}+Z_{t}\,, (6.2)

where ξt:=Gtf(Xt)\xi_{t}:=G_{t}-\nabla f(X_{t}) is sub-Gaussian with some variance proxy σ2\sigma^{2} and, conditional on XtX_{t}, is independent of all other randomness.

This extension applies to all settings in this paper—both the settings of strongly and non-strongly convex potentials, and both the concentration of the stationary distribution πη\pi_{\eta} as well as the entire trajectory of the Langevin Algorithm. For brevity, we state these results for the trajectory; the same bounds then hold for the stationary distribution by taking the limit tt\to\infty. In what follows, we use the standard convention that ξ\xi being sub-Gaussian with variance proxy σ2\sigma^{2} means 𝔼[ξ]=0\mathbb{E}[\xi]=0 and 𝔼[esv,ξ]es2σ2/2\mathbb{E}[e^{s\langle v,\xi\rangle}]\leqslant e^{s^{2}\sigma^{2}/2} for any unit vector vv.

Theorem 6.3 (Sub-Gaussianity of inexact Langevin for strongly convex potentials).

Consider the setting of Theorem 6.1, except now with the Langevin Algorithm implemented inexactly as in (6.2), where ξt:=Gtf(Xt)\xi_{t}:=G_{t}-\nabla f(X_{t}) is sub-Gaussian with variance proxy σ2\sigma^{2}. Then for each tt, the tt-th iterate of this algorithm satisfies

[Xtx4η+η2σ2/21c(2d+log1/δ)]δ,δ(0,1),\mathbb{P}\left[\|X_{t}-x^{*}\|\geqslant 4\sqrt{\frac{\eta+\eta^{2}\sigma^{2}/2}{1-c}}\left(\sqrt{2d}+\sqrt{\log 1/\delta}\right)\right]\leqslant\delta\,,\quad\forall\delta\in(0,1)\,,
Proof.

By a nearly identical argument, the key recursion (4.1) becomes

Φd,λ(x)=eηλ2+η2λ2σ2/2Φd,λ(xηf(x)),\displaystyle\Phi_{d,\lambda}(x^{\prime})=e^{\eta\lambda^{2}+\eta^{2}\lambda^{2}\sigma^{2}/2}\;\Phi_{d,\lambda}(x-\eta\nabla f(x))\,,

where the only difference is the extra factor of Φλ,d(η(Gtf(Xt)))=Φλ,d(ηξt)=eη2λ2σ2/2\Phi_{\lambda,d}(\eta(G_{t}-\nabla f(X_{t})))=\Phi_{\lambda,d}(\eta\xi_{t})=e^{\eta^{2}\lambda^{2}\sigma^{2}/2} arising from the inexact gradients and the definition of sub-Gaussianity. The rest of the proof is then identical, with the sub-Gaussian proxy changing from 2η1c\frac{2\eta}{1-c} to 2η+η2σ2/21c\frac{2\eta+\eta^{2}\sigma^{2}/2}{1-c} due to this extra term. ∎

Theorem 6.4 (Sub-exponentiality of inexact Langevin for convex potentials).

Consider the setting of Theorem 6.2, except now with the Langevin Algorithm implemented inexactly as in (6.2), where ξt:=Gtf(Xt)\xi_{t}:=G_{t}-\nabla f(X_{t}) is sub-Gaussian with variance proxy σ2\sigma^{2}. Then there exist A,C,R>0A,C,R>0 such that for each tt, the tt-th iterate of this algorithm satisfies

Xπη[XtxR+Clog(A/δ)]δ,δ(0,1).\mathbb{P}_{X\sim\pi_{\eta}}\left[\lVert{X_{t}-x^{*}}\rVert\geqslant R+C\log(A/\delta)\right]\leqslant\delta\,,\quad\forall\delta\in(0,1)\,.
Proof.

By a nearly identical argument, the key recursion (5.2) becomes

Φd,λ(x)eηλ2+η2λ2σ2/2[ϕd(λR)+eηβλ/8Φd,λ(x)],\displaystyle\Phi_{d,\lambda}(x^{\prime})\leqslant e^{\eta\lambda^{2}+\eta^{2}\lambda^{2}\sigma^{2}/2}\left[\phi_{d}(\lambda R)+e^{-\eta\beta\lambda/8}\Phi_{d,\lambda}(x)\right]\,,

due to the sub-Gaussian term, as explained in the above proof for the strongly convex case. The rest of the proof is then identical to that of Theorem 6.2, except with λ\lambda now set to β/(16+8ησ2)\beta/(16+8\eta\sigma^{2}), so that ρ\rho now becomes eηβλ/16=eηβ2/(256+128ησ2)e^{-\eta\beta\lambda/16}=e^{-\eta\beta^{2}/(256+128\eta\sigma^{2})}, which just changes the definition of A=1ρ(1ρ)A=\tfrac{1}{\rho(1-\rho)}. ∎

Acknowledgements.

We thank Sinho Chewi, Murat Erdogdu, and Andre Wibisono for helpful conversations about the related literature. We also thank Sinho Chewi for sharing an early draft of his book [7] on log-concave sampling.

References

  • Altschuler and Talwar [2023] Jason M. Altschuler and Kunal Talwar. Resolving the mixing time of the Langevin Algorithm to its stationary distribution for log-concave sampling. In Conference on Learning Theory, pages 2509–2510, 2023.
  • Andrieu et al. [2003] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to MCMC for machine learning. Machine Learning, 50(1):5–43, 2003.
  • Artstein-Avidan et al. [2015] Shiri Artstein-Avidan, Apostolos Giannopoulos, and Vitali D Milman. Asymptotic Geometric Analysis, Part I, volume 202. American Mathematical Society, 2015.
  • Bakry and Émery [1985] Dominique Bakry and Michel Émery. Diffusions hypercontractives. In Seminaire de probabilités XIX 1983/84, pages 177–206. Springer, 1985.
  • Bhattacharya [1978] RN Bhattacharya. Criteria for recurrence and existence of invariant measures for multidimensional diffusions. The Annals of Probability, pages 541–553, 1978.
  • Bobkov [2003] Sergey G Bobkov. Spectral gap and concentration for some spherically symmetric probability measures. In Geometric Aspects of Functional Analysis, pages 37–43. Springer, 2003.
  • Chewi [2022] Sinho Chewi. Log-Concave Sampling. Forthcoming, 2022.
  • [8] DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.0.23 of 2019-06-15. URL http://dlmf.nist.gov/. F. W. J. Olver, F.  A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller and B. V. Saunders, eds.
  • Grimmett and Stirzaker [2020] Geoffrey Grimmett and David Stirzaker. Probability and Random Processes. Oxford University Press, 2020.
  • Jerrum and Sinclair [1996] Mark Jerrum and Alistair Sinclair. The Markov chain Monte Carlo method: an approach to approximate counting and integration. Approximation Algorithms for NP-hard problems, PWS Publishing, 1996.
  • Kannan et al. [1995] Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a localization lemma. Discrete & Computational Geometry, 13(3):541–559, 1995.
  • Ledoux [1999] Michel Ledoux. Concentration of measure and logarithmic Sobolev inequalities. In Seminaire de probabilites XXXIII, pages 120–216. Springer, 1999.
  • Liu [2001] Jun S Liu. Monte Carlo strategies in scientific computing, volume 10. Springer, 2001.
  • Nesterov [2003] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • Rigollet and Hütter [2015] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.
  • Robert and Casella [1999] Christian P Robert and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
  • Roberts and Tweedie [1996] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
  • Szegő [1939] Gábor Szegő. Orthogonal polynomials, volume 23. American Mathematical Society, 1939.
  • Vempala and Wibisono [2019] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems, pages 8092–8104. 2019.