This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Structured Logconcave Sampling with a Restricted Gaussian Oracle

Yin Tat Lee University of Washington and Microsoft Research, [email protected]    Ruoqi Shen University of Washington, [email protected]    Kevin Tian Stanford University, [email protected]

1 Introduction

Since its study was pioneered by the celebrated randomized convex body volume approximation algorithm of Dyer, Frieze, and Kannan [DFK91], designing samplers for logconcave distributions has been a very active area of research in theoretical computer science and statistics with many connections to other fields. In a generic form, the problem can be stated as: sample from a distribution whose negative log-density is convex, under various access models to the distribution.

Developing efficient algorithms for sampling from structured logconcave densities is a topic that has received significant recent interest due to its widespread practical applications. There are many types of structure which densities commonplace in applications may possess that are exploitable for improved runtimes. Examples of such structure include derivative bounds (“well-conditioned densities”) and various types of separability (e.g. “composite densities” corresponding to possibly non-smooth regularization or restrictions to a set, and “logconcave finite sums” corresponding to averages over multiple data points).333We make this terminology more precise in Section 2.1, which contains various definitions used in this paper. Building an algorithmic theory for sampling these latter two families, which are not well-understood in the literature, is a primary motivation of this work.

There are strong parallels between the types of structured logconcave families garnering recent attention and the classes of convex functions known to admit efficient first-order optimization algorithms. Notably, gradient descent and its accelerated counterpart [Nes83] are well-known to quickly optimize a well-conditioned function, and have become ubiquitous in both practice and theory. Similarly, various methods have been developed for efficiently optimizing non-smooth but structured composite objectives [BT09] and well-conditioned finite sums [All17].

Logconcave sampling and convex optimization are intimately related primitives (cf. e.g. [BV04, AH16]), so it is perhaps unsurprising that there are analogies between the types of structure algorithm designers may exploit. Nonetheless, our understanding of the complexity landscape for sampling is quite a bit weaker in comparison to counterparts in the field of optimization; few lower bounds are known for the complexity of sampling tasks, and obtaining stronger upper bounds is an extremely active research area (contrary to optimization, where matching bounds exist in many cases). Moreover (and perhaps relatedly), the toolkit for designing logconcave samplers is comparatively lacking; for many important primitives in optimization, it is unclear if there are analogs in sampling, possibly impeding improved bounds. Our work broadly falls under the themes of (1) understanding which types of structured logconcave distributions admit efficient samplers, and (2) leveraging connections between optimization and sampling for algorithm design. We address these problems on two fronts, which constitute the primary technical contributions of this paper.

  1. 1.

    We give a general reduction framework for bootstrapping samplers with mixing times with polynomial dependence on a conditioning measure κ\kappa to mixing times with linear dependence on κ\kappa. The framework is heavily motivated by a perspective on proximal point methods in structured convex optimization as instances of optimizing composite objectives, and leverages this connection via a surprisingly simple analysis (cf. Theorem 1).

  2. 2.

    We develop novel “base samplers” for composite logconcave distributions and logconcave finite sums (cf. Theorems 23). The former is the first composite sampler with stronger guarantees than those known in the general logconcave setting. The latter constitutes the first high-accuracy finite sum sampler whose gradient query complexity improves upon the naïve strategy of querying full gradients of the negative log-density in each iteration.

Using our novel base samplers within our reduction framework, we obtain state-of-the-art samplers for all of the aforementioned structured families, i.e. well-conditioned, composite, and finite sum, as Corollaries 12, and 3. We emphasize that even without our reduction technique, the guarantees of our base samplers for composite and finite sum-structured densities are the first of their kind. However, by boosting their mixing via our reduction, we obtain guarantees for these structured distribution families which are essentially the best one can hope for without a significant improvement in the most commonly studied well-conditioned regime (cf. discussion in Section 1.1).

We now formally state our results in Section 1.1, and situate them in the literature in Section 1.2. Section 1.3 is a technical overview of our approaches for developing our base samplers for composite and finite sum-structured densities (Sections 1.3.1 and 1.3.2), as well as our proximal reduction framework (Section 1.3.3). Finally, Section 1.5 gives a roadmap for the rest of the paper.

1.1 Our results

Before stating our results, we first require the notion of a restricted Gaussian oracle, whose definition is a key ingredient in giving our reduction framework as well as our later composite samplers.

Definition 1 (Restricted Gaussian oracle).

𝒪(λ,v)\mathcal{O}(\lambda,v) is a restricted Gaussian oracle (RGO) for convex g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R} if it returns

𝒪(λ,v)sample from the distribution with densityexp(12λxv22g(x)).\mathcal{O}(\lambda,v)\leftarrow\textup{sample from the distribution with density}\propto\exp\left(-\frac{1}{2\lambda}\left\lVert x-v\right\rVert_{2}^{2}-g(x)\right).

In other words, an RGO asks to sample from a multivariate Gaussian (with covariance a multiple of the identity), “restricted” by some convex function gg. Intuitively, if we can reduce a sampling problem for the density exp(g)\propto\exp(-g) to calling an RGO a small number of times with a small value of λ\lambda, each RGO subproblem could be much easier to solve than the original problem. This can happen for a variety of reasons, e.g. if the regularized density is extremely well-conditioned, or because it inherits concentration properties of a Gaussian. This idea of reducing a sampling problem to multiple subproblems, each implementing an RGO, underlies the framework of Theorem 1. Because the idea of regularization by a large Gaussian component repeatedly appears in this paper, we make the following more specific definition for convenience, which lower bounds the size of the Gaussian.

Definition 2 (η\eta-RGO).

We say 𝒪(λ,v)\mathcal{O}(\lambda,v) is an η\eta-restricted Gaussian oracle (η\eta-RGO) if it satisfies Definition 1 with the restriction that parameter λ\lambda is required to be always at most η\eta in calls to 𝒪\mathcal{O}.

Variants of our notion of an RGO have implicitly appeared previously [CV18, MFWB19], and efficient RGO implementation was a key subroutine in the fastest sampling algorithm for general logconcave distributions [CV18]. It also extends a similar oracle used in composite optimization, which we will discuss shortly. However, the explicit use of RGOs in a framework such as Theorem 1 is a novel technical innovation of our work, and we believe this abstraction will find further uses.

Proximal reduction framework.

In Section 3, we prove correctness of our proximal reduction framework, whose guarantees are stated in the following Theorem 1.

Theorem 1.

Let π\pi be a distribution on d\mathbb{R}^{d} with dπdx(x)exp(foracle(x))\tfrac{d\pi}{dx}(x)\propto\exp(-f_{\textup{oracle}}(x)) such that foraclef_{\textup{oracle}} is μ\mu-strongly convex, and let ϵ(0,1)\epsilon\in(0,1). Let η1μ\eta\leq\frac{1}{\mu}, T=Θ(1ημlogdημϵ)T=\Theta(\frac{1}{\eta\mu}\log{\frac{d}{\eta\mu\epsilon}}) for some β1\beta\geq 1, and 𝒪\mathcal{O} be a η\eta-RGO for foraclef_{\textup{oracle}}. Algorithm 1, initialized at the minimizer of foraclef_{\textup{oracle}}, runs in TT iterations, each querying 𝒪\mathcal{O} a constant number of times, and obtains ϵ\epsilon total variation distance to π\pi.

In other words, if we can implement an η\eta-RGO for a μ\mu-strongly convex function foraclef_{\textup{oracle}} in time 𝒯RGO\mathcal{T}_{\text{RGO}}, we can sample from exp(foracle)\exp(-f_{\textup{oracle}}) in time O~(1ημ𝒯RGO)\widetilde{O}(\tfrac{1}{\eta\mu}\cdot\mathcal{T}_{\text{RGO}}). To highlight the power of this reduction framework, suppose there was an existing sampler 𝒜\mathcal{A} for densities exp(f)\propto\exp(-f) with mixing time O~(κ10d)\widetilde{O}(\kappa^{10}\sqrt{d}), where f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is LL-smooth, μ\mu-strongly convex, and has condition number κ=Lμ\kappa=\tfrac{L}{\mu} (cf. Section 2.1 for definitions).444No sampler with mixing time scaling as poly(κ)d\text{poly}(\kappa)\sqrt{d} is currently known. Choosing η=1L\eta=\tfrac{1}{L} and foracleff_{\textup{oracle}}\leftarrow f in Theorem 1 yields a sampler whose mixing time is O~(κ𝒯RGO)\widetilde{O}(\kappa\cdot\mathcal{T}_{\text{RGO}}), where 𝒯RGO\mathcal{T}_{\text{RGO}} is the cost of sampling from a density proportional to

exp(L2xv22f(x)),\exp\left(-\frac{L}{2}\left\lVert x-v\right\rVert_{2}^{2}-f(x)\right),

for some vdv\in\mathbb{R}^{d}. However, observe that this distribution has a negative log-density with constant condition number L+LL+μ2\tfrac{L+L}{L+\mu}\leq 2! By using 𝒜\mathcal{A} as our RGO, we have 𝒯RGO=O~(d)\mathcal{T}_{\text{RGO}}=\widetilde{O}(\sqrt{d}), and the overall mixing time is O~(κd)\widetilde{O}(\kappa\sqrt{d}). Leveraging Theorem 1 in applications, we obtain the following new results, improving mixing of various “base samplers” which we bootstrap as RGOs for regularized densities.

Well-conditioned densities.

In [LST20], it was shown that a variant of Metropolized Hamiltonian Monte Carlo obtains a mixing time of O~(κdlog3κdϵ)\widetilde{O}(\kappa d\log^{3}\tfrac{\kappa d}{\epsilon}) for sampling a density on d\mathbb{R}^{d} with condition number κ\kappa. The analysis of [LST20] was somewhat delicate, and required reasoning about conditioning on a nonconvex set with desirable concentration properties. In Section 4.1, we prove Corollary 1, improving [LST20] by roughly a logarithmic factor with a significantly simpler analysis.

Corollary 1.

Let π\pi be a distribution on d\mathbb{R}^{d} with dπdx(x)exp(f(x))\tfrac{d\pi}{dx}(x)\propto\exp\left(-f(x)\right) such that ff is LL-smooth and μ\mu-strongly convex, and let ϵ(0,1)\epsilon\in(0,1), κ=Lμ\kappa=\tfrac{L}{\mu}. Assume access to x=argminxdf(x)x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}f(x). Algorithm 1 with η=18Ldlog(κ)\eta=\tfrac{1}{8Ld\log(\kappa)} using Algorithm 2 as a restricted Gaussian oracle for ff uses O(κdlogκlogκdϵ)O(\kappa d\log\kappa\log\tfrac{\kappa d}{\epsilon}) gradient queries in expectation, and obtains ϵ\epsilon total variation distance to π\pi.

We include Corollary 1 as a warmup for our more complicated results, as a way to showcase the use of our reduction framework in a slightly different way than the one outlined earlier. In particular, in proving Corollary 1, we will choose a significantly smaller value of η\eta, at which point a simple rejection sampling scheme implements each RGO with expected constant gradient queries.

We give another algorithm matching Corollary 1 with a deterministic query complexity bound as Corollary 5. The algorithm of Corollary 5 is interesting in that it is entirely a zeroth-order algorithm, and does not require access to a gradient oracle. To our knowledge, in the well-conditioned optimization setting, no zeroth-order query complexities better than roughly κd\sqrt{\kappa}d are known, e.g. simulating accelerated gradient descent with a value oracle; thus, our sampling algorithm has a query bound off by only O~(κ)\tilde{O}(\sqrt{\kappa}) from the best-known optimization algorithm. We are hopeful this result may help in the search for query lower bounds for structured logconcave sampling.

Composite densities with a restricted Gaussian oracle.

In Section 5, we develop a sampler for densities on d\mathbb{R}^{d} proportional to exp(f(x)g(x))\exp(-f(x)-g(x)), where ff has condition number κ\kappa and gg admits a restricted Gaussian oracle 𝒪\mathcal{O}. We state its guarantees here.

Theorem 2.

Let π\pi be a distribution on d\mathbb{R}^{d} with dπdx(x)exp(f(x)g(x))\tfrac{d\pi}{dx}(x)\propto\exp\left(-f(x)-g(x)\right) such that ff is LL-smooth and μ\mu-strongly convex, and let ϵ(0,1)\epsilon\in(0,1). Let η132Lκdlog(κ/ϵ)\eta\leq\tfrac{1}{32L\kappa d\log(\kappa/\epsilon)} (where κ=Lμ\kappa=\tfrac{L}{\mu}), T=Θ(1ημlog(κdϵ))T=\Theta(\tfrac{1}{\eta\mu}\log(\tfrac{\kappa d}{\epsilon})), and let 𝒪\mathcal{O} be a η\eta-RGO for gg. Further, assume access to the minimizer x=argminxd{f(x)+g(x)}x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}\{f(x)+g(x)\}. There is an algorithm which runs in TT iterations in expectation, each querying a gradient oracle of ff and 𝒪\mathcal{O} a constant number of times, and obtains ϵ\epsilon total variation distance to π\pi.

The assumption that the composite component gg admits an RGO can be thought of as a measure of “simplicity” of gg. This mirrors the widespread use of a proximal oracle as a measure of simplicity in the context of composite optimization [BT09], which we now define.

Definition 3 (Proximal oracle).

𝒪(λ,v)\mathcal{O}(\lambda,v) is a proximal oracle for convex g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R} if it returns

𝒪(λ,v)argminxd{12λxv22+g(x)}.\mathcal{O}(\lambda,v)\leftarrow\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{\frac{1}{2\lambda}\left\lVert x-v\right\rVert_{2}^{2}+g(x)\right\}.

Many regularizers gg in defining composite optimization objectives, which are often used to enforce a quality such as sparsity or “simplicity” in a solution, admit efficient proximal oracles. In particular, if the proximal oracle subproblem admits a closed form solution (or otherwise is computable in O(d)O(d) time), the regularized objective can be optimized at essentially no asymptotic loss. It is readily apparent that our RGO (Definition 1) is the extension of Definition 3 to the sampling setting. In [MFWB19], a variety of regularizations arising in practical applications including coordinate-separable gg (such as restrictions to a coordinate-wise box, e.g. for a Bayesian inference task where we have side information on the ranges of parameters) and 1\ell_{1} or group Lasso regularized densities were shown to admit RGOs. Our composite sampling results achieve a similar “no loss” phenomenon for such regularizations, with respect to existing well-conditioned samplers.

By choosing the largest possible value of η\eta in Theorem 2, we obtain an iteration bound of O~(κ2d)\widetilde{O}(\kappa^{2}d). In Section 4.2, we prove Corollary 2, which improves Theorem 2 by roughly a κ\kappa factor.

Corollary 2.

Let π\pi be a distribution on d\mathbb{R}^{d} with dπdx(x)exp(f(x)g(x))\tfrac{d\pi}{dx}(x)\propto\exp(-f(x)-g(x)) such that ff is LL-smooth and μ\mu-strongly convex, and let ϵ(0,1)\epsilon\in(0,1), κ=Lμ\kappa=\tfrac{L}{\mu}. Assume access to x=argminxd{f(x)+g(x)}x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}\{f(x)+g(x)\} and let 𝒪\mathcal{O} be a restricted Gaussian oracle for gg. There is an algorithm (Algorithm 1 using Theorem 2 as a restricted Gaussian oracle) which runs in O(κdlog3κdϵ)O(\kappa d\log^{3}\tfrac{\kappa d}{\epsilon}) iterations in expectation, each querying a gradient of ff and 𝒪\mathcal{O} a constant number of times, and obtains ϵ\epsilon total variation distance to π\pi.

To sketch the proof, choosing η=1L\eta=\tfrac{1}{L} in Theorem 1 yields an algorithm running in O~(1ημ)=O~(κ)\widetilde{O}(\tfrac{1}{\eta\mu})=\widetilde{O}(\kappa) iterations. In each iteration, the RGO subproblem asks to sample from the distribution whose negative log-density is f(x)+g(x)+L2xv22f(x)+g(x)+\tfrac{L}{2}\left\lVert x-v\right\rVert_{2}^{2} for some vdv\in\mathbb{R}^{d}, so we can call Theorem 2, where the “well-conditioned” portion f(x)+L2xv22f(x)+\tfrac{L}{2}\left\lVert x-v\right\rVert_{2}^{2} has constant condition number. Thus, Theorem 2 runs in O~(d)\widetilde{O}(d) iterations to solve the subproblem, yielding the result. In fact, Corollary 2 nearly matches Corollary 1 in the case g=0g=0 uniformly. Surprisingly, this recovers the runtime of [LST20] without appealing to strong gradient concentration bounds (e.g. [LST20], Theorem 3.2).

Logconcave finite sums.

In Section 6, we initiate the study of mixing times for sampling logconcave finite sums with polylogarithmic dependence on accuracy. We give the following result.

Theorem 3.

Let π\pi be a distribution on d\mathbb{R}^{d} with dπdx(x)exp(F(x))\tfrac{d\pi}{dx}(x)\propto\exp(-F(x)), where F(x)=1ni=1nfi(x)F(x)=\tfrac{1}{n}\sum_{i=1}^{n}f_{i}(x) is μ\mu-strongly convex, fif_{i} is LL-smooth and convex i[n]\forall i\in[n], κ=Lμ\kappa=\tfrac{L}{\mu}, and ϵ(0,1)\epsilon\in(0,1). Assume access to x=argminxdF(x)x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}F(x). Algorithm 6 uses O(κ2dlog4nκdϵ)O\left(\kappa^{2}d\log^{4}\frac{n\kappa d}{\epsilon}\right) value queries to summands {fi}i[n]\{f_{i}\}_{i\in[n]}, and obtains ϵ\epsilon total variation distance to π\pi.

For a zeroth-order algorithm, Theorem 3 serves as a surprisingly strong baseline as it nearly matches the previously best-known bound for zeroth-order well-conditioned sampling when n=1n=1; however, when e.g. κd\kappa\approx d, the complexity bound is at least cubic. By using Theorem 3 within the framework of Theorem 1, we obtain the following improved result.

Corollary 3 (Improved first-order logconcave finite sum sampling).

In the setting of Theorem 3, Algorithm 1 using Algorithm 6 and SVRG [JZ13] as a restricted Gaussian oracle for FF uses

O(nlog(nκdϵ)+κndlog3.5(nκdϵ)+κdlog5(nκdϵ))=O~(n+κmax(d,nd))O\left(n\log\left(\frac{n\kappa d}{\epsilon}\right)+\kappa\sqrt{nd}\log^{3.5}\left(\frac{n\kappa d}{\epsilon}\right)+\kappa d\log^{5}\left(\frac{n\kappa d}{\epsilon}\right)\right)=\widetilde{O}\left(n+\kappa\max\left(d,\sqrt{nd}\right)\right)

queries to first-order oracles for summands {fi}i[n]\{f_{i}\}_{i\in[n]}, and obtains ϵ\epsilon total variation distance to π\pi.

Corollary 3 has several surprising properties. First, its bound when n=1n=1 gives yet another way of (up to polylogarithmic factors) recovering the runtime of [LST20] without gradient concentration. Second, up to a O~(max(1,nd))\widetilde{O}(\max(1,\sqrt{\tfrac{n}{d}})) factor, it is essentially the best runtime one could hope for without an improvement when n=1n=1. This is in the sense that O~(κd)\widetilde{O}(\kappa d) is the best runtime for n=1n=1, and to our knowledge every efficient well-conditioned sampler requires minimizer access, i.e. O~(n)\widetilde{O}(n) gradient queries [WS16]. Interestingly, when n=1n=1, Algorithm 6 can be significantly simplified, and becomes the standard Metropolized random walk [DCWY18]; this yields Corollary 5, an algorithm attaining the iteration complexity of Corollary 1 while only querying a value oracle for ff.

1.2 Previous work

Logconcave sampling is a problem that, within the theoretical computer science field, has its origins in convex body sampling (a problem it generalizes). A long sequence of developments have made significant advances in the general model, where only convexity is assumed about the negative log-density, and only value oracle access is given. In this prior work discussion, we focus on more structured cases where all or part of the negative log-density has finite condition number, and refer the reader to [Vem05, LV06a, CV15] for an account on progress in the general case.

Well-conditioned densities.

Significant recent efforts in the machine learning and statistics communities focused on obtaining provable guarantees for well-conditioned distributions, starting from pioneering work of [Dal17], and continued in e.g. [CCBJ18, DR18, CV19, CDWY19, DCWY18, DM19a, DMM19, LSV18, MMW+19, SL19, LST20]. In this setting, many methods based on discretizations of continuous-time first-order processes (such as the Langevin dynamics) have been proposed. Typically, error guarantees come in two forms: either in the 22-Wasserstein (W2W_{2}) distance, or in total variation (TV). The line [DCWY18, CDWY19, LST20] has brought the gradient complexity for obtaining ϵ\epsilon TV distance to O~(κd)\widetilde{O}(\kappa d) where dd is the dimension, by exploiting gradient concentration properties. For progress in complexities depending polynomially on ϵ1\epsilon^{-1}, attaining W2W_{2} guarantees (typically incomparable to TV bounds), we defer to [SL19], the state-of-the-art using O~(κ76ϵ13+κϵ23)\widetilde{O}(\kappa^{\frac{7}{6}}\epsilon^{-\frac{1}{3}}+\kappa\epsilon^{-\frac{2}{3}}) queries to obtain W2W_{2} distance ϵdμ1\epsilon\sqrt{d\mu^{-1}} from the target.555Here, dμ1\sqrt{d\mu^{-1}} is the effective diameter; this accuracy measure allows for scale-invariant W2W_{2} guarantees. We note incomparable guarantees can be obtained by assuming higher derivative bounds (e.g. a Lipschitz Hessian); our work uses only the minimal assumption of bounded second derivatives.

Composite densities.

Recent works have studied sampling from densities of the form (1), or its specializations (e.g. restrictions to a convex set). Several [Per16, BDMP17, Ber18] are based on Moreau envelope or proximal regularization strategies, and demonstrate efficiency under more stringent assumptions on the structure of the composite function gg, but under minimal assumptions obtain fairly large provable mixing times Ω(d5)\Omega(d^{5}). Proximal regularization algorithms have also been considered for non-composite sampling [Wib19]. Another discretization strategy based on projections was studied by [BEL18], but obtained mixing time Ω(d7)\Omega(d^{7}). Finally, improved algorithms for special constrained sampling problems have been proposed, such as simplex restrictions [HKRC18].

Of particular relevance and inspiration to this work is [MFWB19]. By generalizing and adapting Metropolized HMC algorithms of [DCWY18, CDWY19], adopting a Moreau envelope strategy, and using (a stronger version of) the RGO access model, [MFWB19] obtained a runtime which in the best case scales as O~(κ2d)\tilde{O}\left(\kappa^{2}d\right), similar to the guarantee of our base sampler in Theorem 2. However, this result required a variety of additional assumptions, such as access to the normalization factor of restricted Gaussians, Lipschitzness of gg, warmness of the start, and various problem parameter tradeoffs. The general problem of sampling from (1) under minimal assumptions more efficiently than general-purpose logconcave algorithms is to the best of our knowledge unresolved (even under restricted Gaussian oracle access), a novel contribution of our mixing time bound. Our results also suggest that the RGO is a natural notion of tractability for the composite sampling problem.

Logconcave finite sums.

Since [WT11] proposed the stochastic gradient Langevin dynamics, which at each step stochastically estimates the full gradient of the function, there has been a long line of work giving bounds for this method and other similar algorithms [DK19, GGZ18, SKR19, BCM+18, NF19]. These convergence rates depend heavily on the variance of the stochastic estimates. Inspired by variance-reduced methods in convex optimization, samplers based on low-variance estimators have also been proposed [DRW+16, DSM+16, BFR+19, BFFN19, NDH+17, CWZ+17, ZXG18, CFM+18]. Although our reduction-based approach is not designed specifically for solving problems of finite sum structure, our speedup can be viewed as due to a lower variance estimator implicitly defined through the oracle subproblems of Theorem 1 via repeated re-centering.

In Table 1, we list prior runtimes [ZXG18, CFM+18] for sampling logconcave finite sums; note these results additionally require bounded higher derivatives (with the exception of the κ4\kappa^{4} dependence), obtain guarantees only in Wasserstein distance, and have polynomial dependences on ϵ1\epsilon^{-1}. On the other hand, our reduction-based approach obtains total variation bounds with linear dependence on κ\kappa and polylogarithmic dependence on ϵ1\epsilon^{-1}. Our bound also simultaneously matches the state-of-the-art bound when n=1n=1, a feature not shared by prior stochastic algorithms. To our knowledge, no previous nontrivial666As mentioned previously, one can always compute the full F\nabla F in every iteration in a well-conditioned sampler. bounds were known in the high-accuracy regime before our work.

Method Gradient oracle complexity
W2ϵW_{2}\leq\epsilon, μ=1\mu=1 W2ϵdμ1W_{2}\leq\epsilon\sqrt{d\mu^{-1}}
SAGA-LD [CFM+18] n+κ1.5d+κd+Mdϵ+κd4/3ϵ2/3n+\frac{\kappa^{1.5}\sqrt{d}+\kappa d+Md}{\epsilon}+\frac{\kappa d^{4/3}}{\epsilon^{2/3}} n+κ1.5+κd+Mdϵ+κd2/3ϵ2/3n+\frac{\kappa^{1.5}+\kappa\sqrt{d}+M\sqrt{d}}{\epsilon}+\frac{\kappa d^{2/3}}{\epsilon^{2/3}}
SVRG-LD [CFM+18] n+κ1.5d+κd+Mdϵ+κd4/3ϵ2/3n+\frac{\kappa^{1.5}\sqrt{d}+\kappa d+Md}{\epsilon}+\frac{\kappa d^{4/3}}{\epsilon^{2/3}} n+κ3ϵ2+κ1.5+Mdϵn+\frac{\kappa^{3}}{\epsilon^{2}}+\frac{\kappa^{1.5}+M\sqrt{d}}{\epsilon}
CV-ULD [CFM+18] n+κ4d1.5ϵ3n+\frac{\kappa^{4}d^{1.5}}{\epsilon^{3}} n+κ4ϵ3n+\frac{\kappa^{4}}{\epsilon^{3}}
SVRG-LD [ZXG18] n+κ1.5d+Mdϵ+κndϵn+\frac{\kappa^{1.5}\sqrt{d}+Md}{\epsilon}+\frac{\kappa\sqrt{nd}}{\epsilon} n+κ1.5+Mdϵ+κnϵn+\frac{\kappa^{1.5}+M\sqrt{d}}{\epsilon}+\frac{\kappa\sqrt{n}}{\epsilon}
State-of-the-art, n=1n=1 [SL19] κ7/6d1/6ϵ1/3+κd1/3ϵ2/3\frac{\kappa^{7/6}d^{1/6}}{\epsilon^{1/3}}+\frac{\kappa d^{1/3}}{\epsilon^{2/3}} κ7/6ϵ1/3+κϵ2/3\frac{\kappa^{7/6}}{\epsilon^{1/3}}+\frac{\kappa}{\epsilon^{2/3}}
Method Gradient oracle complexity (TV ϵ\leq\epsilon)
Corollary 3 n+κd+κndn+\kappa d+\kappa\sqrt{nd}
State-of-the-art, n=1n=1 [LST20] κd\kappa d
Table 1: Complexity of sampling from eF(x)e^{-F(x)} where F(x)=1ni[n]fi(x)F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x) on d\mathbb{R}^{d} is μ\mu-strongly convex, each fif_{i} is convex and LL-smooth, and κ=Lμ\kappa=\tfrac{L}{\mu}. For relevant lines, MM is the Lipschitz constant of the Hessian 2F\nabla^{2}F, which our algorithm has no dependence on. Complexity is measured in terms of the number of calls to fif_{i} or fi\nabla f_{i} for summands {fi}i[n]\{f_{i}\}_{i\in[n]}. We hide polylog(nκdϵ)\text{polylog}(\tfrac{n\kappa d}{\epsilon}) factors for simplicity.
Preliminary version.

A preliminary version of this work, containing the results of Section 5, appeared as [STL20]. The preliminary version also contained an experimental evaluation of the algorithm in Section 5 for the task of sampling a (non-diagonal covariance) multivariate Gaussian restricted to a box, and demonstrated the efficacy of our method in comparison to general-purpose logconcave samplers (i.e. the hit-and-run method [LV06c]). The focus of the present version is giving theoretical guarantees for structured logconcave sampling tasks, so we omit empirical evaluations, and defer an evaluation of the new methods developed in this paper to interesting follow-up work.

1.3 Technical overview

1.3.1 Composite logconcave sampling

We study the problem of approximately sampling from a distribution π\pi on d\mathbb{R}^{d}, with density

dπ(x)dxexp(f(x)g(x)).\frac{d\pi(x)}{dx}\propto\exp\left(-f(x)-g(x)\right). (1)

Here, f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is assumed to be “well-behaved” (i.e. has finite condition number), and g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R} is a convex, but possibly non-smooth function. This problem generalizes the special case of sampling from exp(f(x))\exp(-f(x)) for well-conditioned ff, simply by letting gg vanish. Even the specialization of (1) where gg indicates a convex set (i.e. is 0 inside the set, and \infty outside) is not well-understood; existing mixing time bounds for this restricted setting are large polynomials in dd [BDMP17, BEL18], and are typically weaker than guarantees in the general logconcave setting [LV06c, LV06b]. This is in contrast to the convex optimization setting, where first-order methods readily generalize to solve problem families such as minx𝒳f(x)\min_{x\in\mathcal{X}}f(x), where 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} is a convex set, as well as its generalization

minxdf(x)+g(x), where g:d is convex and admits a proximal oracle.\min_{x\in\mathbb{R}^{d}}f(x)+g(x),\text{ where }g:\mathbb{R}^{d}\rightarrow\mathbb{R}\text{ is convex and admits a proximal oracle.} (2)

We defined proximal oracles in Definition 3; in short, they are prodecures which minimize the sum of a quadratic and gg. Definition 3 is desirable as many natural non-smooth composite objectives arising in learning settings, such as the Lasso [Tib96] and elastic net [ZH05], admit efficient proximal oracles. It is clear that the definition of a proximal oracle implies it can also handle arbitrary sums of linear functions and quadratics, as the resulting function can be rewritten as the sum of a constant and a single quadratic. The seminal work [BT09] extends fast gradient methods to solve (2) via proximal oracles, and has prompted many follow-up studies.

Motivated by the success of the proximal oracle framework in convex optimization, we study sampling from the family (1) through the lens of RGOs, a natural extension of Definition 3. The main result of Section 5 is a “base” algorithm efficiently sampling from (1), assuming access to an RGO for gg. We now survey the main components of this algorithm.

Reduction to shared minimizers. We first observe that without loss of generality, ff and gg share a minimizer: by shifting ff and gg by linear terms, i.e. f~(x):=f(x)f(x),x\tilde{f}(x):=f(x)-\left\langle\nabla f(x^{*}),x\right\rangle, g~(x):=g(x)+f(x),x\tilde{g}(x):=g(x)+\left\langle\nabla f(x^{*}),x\right\rangle, where xx^{*} minimizes f+gf+g, first-order optimality implies both f~\tilde{f} and g~\tilde{g} are minimized by xx^{*}. Moreover, implementation of a first-order oracle for f~\tilde{f} and an RGO for g~\tilde{g} are immediate without additional assumptions. This modification becomes crucial for our later developments, and we hope this simple observation, reminiscent of “variance reduction” techniques in stochastic optimization [JZ13], is broadly applicable to improving algorithms for the sampling problem induced by (1).

Beyond Moreau envelopes: expanding the space. A typical approach in convex optimization in handling non-smooth objectives gg is to instead optimize its Moreau envelope, defined by

gη(y):=minxd{g(x)+12ηxy22}.g^{\eta}(y):=\min_{x\in\mathbb{R}^{d}}\left\{g(x)+\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right\}. (3)

Intuitively, the envelope gηg^{\eta} trades off function value with proximity to yy; a standard exercise shows that gηg^{\eta} is smooth (has a Lipschitz gradient), with smoothness depending on η\eta, and moreover that computing gradients of gηg^{\eta} reduces to calling a proximal oracle (Definition 3). It is natural to extend this idea to the composite sampling setting, e.g. via sampling from the density

exp(f(x)gη(x)).\exp\left(-f(x)-g^{\eta}(x)\right).

However, a variety of complications prevent such strategies from obtaining rates comparable to their non-composite, well-conditioned counterparts, including difficulty in bounding closeness of the resulting distribution, as well as biased drifts of the sampling process due to error in gradients.

Our approach departs from this smoothing strategy in a crucial way, inspired by Hamiltonian Monte Carlo (HMC) methods [Kra40, Nea11]. HMC can be seen as a discretization of the ubiquitous Langevin dynamics, on an expanded space. In particular, discretizations of Langevin dynamics simulate the stochastic differential equation dxtdt=f(xt)+2dWtdt\tfrac{dx_{t}}{dt}=-\nabla f(x_{t})+\sqrt{2}\tfrac{dW_{t}}{dt}, where WtW_{t} is Brownian motion. HMC methods instead simulate dynamics on an extended space d×d\mathbb{R}^{d}\times\mathbb{R}^{d}, via an auxiliary “velocity” variable which accumulates gradient information. This is sometimes interpreted as a discretization of the underdamped Langevin dynamics [CCBJ18]. HMC often has desirable stability properties, and expanding the dimension via an auxiliary variable has been used in algorithms obtaining the fastest rates in the well-conditioned logconcave sampling regime [SL19, LST20]. Inspired by this phenomenon, we consider the density on d×d\mathbb{R}^{d}\times\mathbb{R}^{d}

dπ^dz(z):=exp(f(y)g(x)12ηxy22) where z=(x,y).\frac{d\hat{\pi}}{dz}(z):=\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)\text{ where }z=(x,y). (4)

Due to technical reasons, the family of distributions we use in our final algorithms are of slightly different form than (4), but this simplification is useful to build intuition. Note in particular that the form of (4) is directly inspired by (3), where rather than maximizing over xx, we directly expand the space. The idea is that for small enough η\eta and a set on xx of large measure, smoothness of ff will guarantee that the marginal of (4) on xx will concentrate yy near xx, a fact we make rigorous. To sample from (1), we then show that a rejection filter applied to a sample xx from the marginal of (4) will terminate in constant steps. Consequently, it suffices to develop a fast sampler for (4).

Alternating sampling with an oracle. The form of the distribution (4) suggests a natural strategy for sampling from it: starting from a current state (xk,yk)(x_{k},y_{k}), we iterate

  1. 1.

    Sample yk+1exp(f(y)12ηxky22)y_{k+1}\sim\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert x_{k}-y\right\rVert_{2}^{2}\right).

  2. 2.

    Sample xk+1exp(g(x)12ηxyk+122)x_{k+1}\sim\exp\left(-g(x)-\frac{1}{2\eta}\left\lVert x-y_{k+1}\right\rVert_{2}^{2}\right), via a restricted Gaussian oracle.

When ff and gg share a minimizer, taking a first-order approximation in the first step, i.e. sampling yk+1exp(f(xk)f(xk),yxk12ηyxk22)y_{k+1}\sim\exp(-f(x_{k})-\left\langle\nabla f(x_{k}),y-x_{k}\right\rangle-\tfrac{1}{2\eta}\left\lVert y-x_{k}\right\rVert_{2}^{2}), can be shown to generalize the Leapfrog step of HMC updates. However, for η\eta very small (as in our setting), we observe the first step itself reduces to the case of sampling from a distribution with constant condition number, performable in O~(d)\tilde{O}(d) gradient calls by e.g. Metropolized HMC [DCWY18, CDWY19, LST20]. Moreover, it is not hard to see that this “alternating marginal” sampling strategy preserves the stationary distribution exactly, so no filtering is necessary. Directly bounding the conductance of this random walk, for small enough η\eta, leads to an algorithm running in O~(κ2d2)\tilde{O}\left(\kappa^{2}d^{2}\right) iterations, each calling an RGO once, and a gradient oracle for ff roughly O~(d)\tilde{O}\left(d\right) times. This latter guarantee is by an appeal to known bounds [CDWY19, LST20] on the mixing time in high dimensions of Metropolized HMC for a well-conditioned distribution, a property satisfied by the yy-marginal of (4) for small η\eta.

Stability of Gaussians under bounded perturbations. To obtain our tightest runtime result, we use that η\eta is chosen to be much smaller than L1L^{-1} to show structural results about distributions of the form (4), yielding tighter concentration for bounded perturbations of a Gaussian (i.e. the Gaussian has covariance 1η𝐈\tfrac{1}{\eta}\mathbf{I}, and is restricted by LL-smooth ff for ηL1\eta\ll L^{-1}). To illustrate, let

d𝒫x(y)dyexp(f(y)12ηyx22)\frac{d\mathcal{P}_{x}(y)}{dy}\propto\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)

and let its mean and mode be y¯x\bar{y}_{x}, yxy^{*}_{x}. It is standard that y¯xyx2dη\left\lVert\bar{y}_{x}-y^{*}_{x}\right\rVert_{2}\leq\sqrt{d\eta}, by η1\eta^{-1}-strong logconcavity of 𝒫x\mathcal{P}_{x}. Informally, we show that for ηL1\eta\ll L^{-1} and xx not too far from the minimizer of ff, we can improve this to y¯xyx2=O(η)\left\lVert\bar{y}_{x}-y^{*}_{x}\right\rVert_{2}=O(\sqrt{\eta}); see Proposition 12 for a precise statement.

Using our structural results, we sharpen conductance bounds, improve the warmness of a starting distribution, and develop a simple rejection sampling scheme for sampling the yy variable in expected constant gradient queries. Our proofs are continuous in flavor and based on gradually perturbing the Gaussian and solving a differential inequality; we believe they may of independent interest. These improvements lead to an algorithm running in O~(κ2d)\tilde{O}\left(\kappa^{2}d\right) iterations; ultimately, we use our reduction framework, stated in Theorem 1, to improve this dependence to O~(κd)\tilde{O}\left(\kappa d\right).

1.3.2 Logconcave finite sums

We initiate the algorithmic study of the following task in the high-accuracy regime: sample xπx\sim\pi within total variation distance ϵ\epsilon, where dπdx(x)exp(F(x))\tfrac{d\pi}{dx}(x)\propto\exp(-F(x)) and

F(x)=1ni[n]fi(x),F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x), (5)

all fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} are convex and LL-smooth, and FF is μ\mu-strongly convex. We call such a distribution π\pi a (well-conditioned) logconcave finite sum.

In applications (where summands correspond to points in a dataset, e.g. in Bayesian linear and logistic regression tasks [DCWY18]), querying F\nabla F may be prohibitively expensive, so a natural goal is to obtain bounds on the number of required queries to summands fi\nabla f_{i} for i[n]i\in[n]. This motivation also underlies the development of stochastic gradient methods in optimization, a foundational tool in modern statistics and data processing. Naïvely, one can complete the task by using existing samplers for well-conditioned distributions and querying the full gradient F\nabla F in each iteration, resulting in a summand gradient query complexity of O~(nκd)\widetilde{O}(n\kappa d) [LST20]. Many recent works, inspired from recent developments in the complexity of optimizing a well-conditioned finite sum, have developed subsampled gradient methods for the sampling problem. However, to our knowledge, all such guarantees depend polynomially on the accuracy ϵ\epsilon and are measured in the 22-Wasserstein distance; in the high-accuracy, total variation case no nontrivial query complexity is currently known.

We show in Section 6 that given access to the minimizer of FF, a simple zeroth-order algorithm which queries O~(κ2d)\widetilde{O}(\kappa^{2}d) values of summands {fi}i[n]\{f_{i}\}_{i\in[n]} succeeds (i.e. it never requires a full value or gradient query of FF). The algorithm is essentially the Metropolized random walk proposed in [DCWY18] for the n=1n=1 case with a cheaper subsampled filter step. Notably, because the random walk is conducted with respect to FF, we cannot efficiently query the function value at any point; nonetheless, by randomly sampling to compute a nearly-unbiased estimator of the rejection probability, we do not incur too much error. This random walk was shown in [CDWY19] to mix in O~(κ2d)\widetilde{O}(\kappa^{2}d) iterations; we implement each step to sufficient accuracy using O~(1)\widetilde{O}(1) function evaluations.

It is natural to ask if first-order information can be used to improve this query complexity, perhaps through “variance reduction” techniques (e.g. [JZ13]) developed for stochastic optimization. The idea behind variance reduction is to recenter gradient estimates in phases, occasionally computing full gradients to improve the estimate quality. One fundamental difficulty which arises from using variance reduction in high-accuracy sampling is that the resulting algorithms are not stateless. By this, we mean that the variance-reduced estimates depend on the history of the algorithm, and thus it is difficult to ascertain correctness of the stationary distribution. We take a different approach to achieve a linear query dependence on the conditioning κ\kappa, described in the following section.

1.3.3 Proximal point reduction framework

To motivate Theorem 1, we first recast existing “proximal point” reduction-based optimization methods through the lens of composite optimization, and subsequently show that similar ideas underlying our composite sampler in Section 1.3.1 yield an analagous “proximal point reduction framework” for sampling. Chronologically, our composite sampler (originally announced in [STL20]) predates our reduction framework, which was then inspired by the perspective given here. We hope these insights prove fruitful for further development of proximal approaches to sampling.

Proximal point methods as composite optimization.

Proximal point methods are a well-studied primitive in optimization, developed by [Roc76]; cf. [PB14] for a modern survey. The principal idea is that to minimize convex F:dF:\mathbb{R}^{d}\rightarrow\mathbb{R}, it suffices to solve a sequence of subproblems

xk+1argminxd{F(x)+12λxxk22}.x_{k+1}\leftarrow\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{F(x)+\frac{1}{2\lambda}\left\lVert x-x_{k}\right\rVert_{2}^{2}\right\}. (6)

Intuitively, by tuning the parameter λ0\lambda\geq 0, we trade off how regularized the subproblems (6) are with how rapidly the overall method converges. Smaller values of λ\lambda result in larger regularization amounts which are amenable to algorithms for minimizing well-conditioned objectives.

For optimizing functions of the form (5) via stochastic gradient estimates to ϵ\epsilon error, [JZ13, DBL14, SRB17] developed variance-reduced methods obtaining a query complexity of O~(n+κ)\widetilde{O}(n+\kappa). To match a known lower bound of O~(n+nκ)\widetilde{O}(n+\sqrt{n\kappa}) due to [WS16], two works [LMH15, FGKS15] appropriately applied instances of accelerated proximal point methods [Gul92] with careful analyses of how accurately subproblems (6) needed to be solved. These algorithms black-box called the O~(n+κ)\widetilde{O}(n+\kappa) runtime as an oracle to solve the subproblems (6) for an appropriate choice of λ\lambda, obtaining an accelerated rate.777We note that an improved runtime without extraneous logarithmic factors was later obtained by [All17]. To shed some light on this acceleration procedure, we adopt an alternative view on proximal point methods.888This perspective can also be found in the lecture notes [Lee18]. Consider the following known composite optimization result.

Proposition 1 (Informal statement of [BT09]).

Let f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be LL-smooth and μ\mu-strongly convex, and g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R} admit a proximal oracle 𝒪(λ,v)\mathcal{O}(\lambda,v) (cf. Definition 3). There is an algorithm taking O~(κ)\widetilde{O}(\sqrt{\kappa}) iterations for κ=Lμ\kappa=\tfrac{L}{\mu} to find an ϵ\epsilon-approximate minimizer to f+gf+g, each querying f\nabla f and 𝒪\mathcal{O} a constant number of times. Further, λ=1L\lambda=\tfrac{1}{L} in all calls to 𝒪\mathcal{O}.

Ignoring subtleties of the error tolerance of 𝒪\mathcal{O}, we show how to use an instance of Proposition 1 to recover the O~(n+nκ)\widetilde{O}(n+\sqrt{n\kappa}) query complexity for optimizing (5). Let f(x)=μ2x22f(x)=\tfrac{\mu}{2}\left\lVert x\right\rVert_{2}^{2}, and g=Ffg=F-f. For any Λμ\Lambda\geq\mu, ff is both μ\mu-strongly convex and Λ\Lambda-smooth. Moreover, note that all calls to the proximal oracle 𝒪\mathcal{O} for gg require solving subproblems of the form

argminxd{F(x)μ2x22+Λ2xv22}.\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{F(x)-\frac{\mu}{2}\left\lVert x\right\rVert_{2}^{2}+\frac{\Lambda}{2}\left\lVert x-v\right\rVert_{2}^{2}\right\}. (7)

The upshot of choosing a smoothness bound Λμ\Lambda\geq\mu is that the regularization amount in (7) increases, improving the conditioning of the subproblem, which is Λ\Lambda-strongly convex and L+ΛL+\Lambda-smooth. The algorithm of e.g. [JZ13] solves each subproblem (7) in O~(n+L+ΛΛ)\widetilde{O}(n+\tfrac{L+\Lambda}{\Lambda}) gradient queries, leading to an overall query complexity (for Proposition 1) of

O~(Λμ(n+LΛ)).\widetilde{O}\left(\sqrt{\frac{\Lambda}{\mu}}\cdot\left(n+\frac{L}{\Lambda}\right)\right).

Optimizing over Λμ\Lambda\geq\mu, i.e. taking Λ=max(μ,Ln)\Lambda=\max(\mu,\tfrac{L}{n}), yields the desired bound of O~(n+nκ)\widetilde{O}(n+\sqrt{n\kappa}).

Applications to sampling.

In Sections 5 and 6, we develop samplers for structured families with quadratic dependence on the conditioning κ\kappa. The proximal point approach suggests a strategy for accelerating these runtimes. Namely, if there is a framework which repeatedly calls a sampler for a regularized density (analogous to calls to (6)), one could trade off the regularization with the rate of the outer loop. Fortunately, in the spirit of interpreting proximal point methods as composite optimization, the composite sampler of Section 5 itself meets these reduction framework criteria.

We briefly recall properties of our composite sampler here. Let π\pi be a distribution on d\mathbb{R}^{d} with dπdx(x)exp(fwc(x)foracle(x))\tfrac{d\pi}{dx}(x)\propto\exp(-f_{\textup{wc}}(x)-f_{\textup{oracle}}(x)),999To disambiguate, we sometimes also use the notation fwc+foraclef_{\textup{wc}}+f_{\textup{oracle}} rather than f+gf+g in defining instances of our reduction framework or composite samplers, when convenient in the context. where fwcf_{\textup{wc}} is well-conditioned (has finite condition number κ\kappa) and foraclef_{\textup{oracle}} admits an RGO, which solves subproblems of the form

𝒪(η,v)the density proportional to exp(12ηxv22foracle(x)).\mathcal{O}(\eta,v)\sim\text{the density proportional to }\exp\left(-\frac{1}{2\eta}\left\lVert x-v\right\rVert_{2}^{2}-f_{\textup{oracle}}(x)\right). (8)

The algorithm of Section 5 only calls 𝒪\mathcal{O} with a fixed η\eta; note the strong parallel between the RGO subproblem and the proximal oracle of Proposition 1. For a given value of η0\eta\geq 0, our composite sampler runs in O~(1ημ)\widetilde{O}(\tfrac{1}{\eta\mu}) iterations, each requiring a call to 𝒪\mathcal{O}. Smaller η\eta improve the conditioning of the negative log-density of subproblem (8), but increase the overall iteration count, yielding a range of trade-offs. The algorithm of Section 5 has an upper bound requirement on η\eta (cf. Theorem 2); in Section 3, we observe that this may be lifted when fwc=0f_{\textup{wc}}=0 uniformly, allowing for a full range of choices. Moreover, the analysis of the composite sampler becomes much simpler when fwc=0f_{\textup{wc}}=0, as in Theorem 1. We give the framework as Algorithm 1, as well as a full (fairly short) convergence analysis. By trading off the regularization amount with the cost of implementing (8) via bootstrapping base samplers, we obtain a host of improved runtimes.

Beyond our specific applications, the framework we provide has strong implications as a generic reduction from mixing times scaling polynomially in κ\kappa to improved methods scaling linearly in κ\kappa. This is akin to the observation in [LMH15] that accelerated proximal point methods generically improve poly(κ)\text{poly}(\kappa) dependences to κ\sqrt{\kappa} dependences for optimization. We are optimistic this insight will find further implications in the logconcave sampling literature.

1.4 Erratum, and a word of warning for o(d)o(d) mixing

The initial version of this paper, presented at COLT 2021, had an incorrect proof of Theorem 1. This was due to our reliance on the average conductance (“spectral profile”) technique of [LK99] for bounding mixing. Roughly speaking, the mistake was caused by a misunderstanding that for stationary measures satisfying μ\mu-log isoperimetry (for example, μ\mu-strongly logconcave densities) and with transition distributions of Δ\Delta-close points having constant overlap, [LK99] provides mixing time bounds of the form (where β\beta is a warmness parameter of the starting distribution)

1β121sΦ(s)2𝑑s1μΔ21β121slog(s)𝑑s1μΔ2loglogβ.\int_{\frac{1}{\beta}}^{\frac{1}{2}}\frac{1}{s\Phi(s)^{2}}ds\lesssim\frac{1}{\mu\Delta^{2}}\int_{\frac{1}{\beta}}^{\frac{1}{2}}\frac{1}{s\log(s)}ds\approx\frac{1}{\mu\Delta^{2}}\log\log\beta. (9)

Here, Φ(s)\Phi(s) is the ss-conductance of the Markov chain, which can typically be lower bounded by Ω(μlog(s)Δ)\Omega(\sqrt{\mu\log(s)}\Delta) under a stationary density exhibiting log-isoperimetry. However, the trivial bound Φ(s)1\Phi(s)\leq 1 demonstrates that there is an additive log(β)\log(\beta) term in (9). This is a bottleneck towards mixing times scaling as o(d)o(d) for distributions where only an exp(Ω(d))\exp(\Omega(d))-warm start is feasible; in particular, the conductance actually scales as min(1,Ω(μlog(s)Δ))\min(1,\Omega(\sqrt{\mu\log(s)}\Delta)), causing this additive term. In settings where μΔ2d1\mu\Delta^{2}\geq d^{-1} (such as our reductions, where this term often scales as a condition number of the problem), this additive term log(β)=Ω(d)\log(\beta)=\Omega(d) may dominate. This observation (and the fix) came out of conversations with Sinho Chewi; we are immensely greatful for his help.

For the particular structure of the algorithm in Theorem 1, we are able to give an alternative analysis going through W2W_{2} convergence bounds, preserving the correctness of the theorem. However, this bottleneck is a general phenomenon which may cause future attempts to use Metropolized algorithms from exponentially warm starts to be stuck at Ω(d)\Omega(d) iterations, which merits further investigation. We write this section as a word of warning to future researchers aiming at sublinear dimension dependences in Metropolized algorithms, and as a suggested open research direction.

1.5 Roadmap

We give notations and preliminaries in Section 2. In Section 3 we give our framework for bootstrapping fast regularized samplers, and prove its correctness (Theorem 1). Assuming the “base samplers” of Theorems 2 and 3, in Section 4 we apply our reduction to obtain all of our strongest guarantees, namely Corollaries 12, and 3. We then prove Theorems 2 and 3 in Sections 5 and 6.

2 Preliminaries

2.1 Notation

General notation.

For dd\in\mathbb{N}, [d][d] refers to the set of naturals 1id1\leq i\leq d; 2\left\lVert\cdot\right\rVert_{2} is the Euclidean norm on d\mathbb{R}^{d} when dd is clear from context. 𝒩(μ,𝚺)\mathcal{N}(\mu,\boldsymbol{\Sigma}) is the multivariate Gaussian of specified mean and variance, 𝐈\mathbf{I} is the identity of appropriate dimension when clear from context, and \preceq is the Loewner order on symmetric matrices.

Functions.

We say twice-differentiable function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is LL-smooth and μ\mu-strongly convex if μ𝐈2f(x)L𝐈\mu\mathbf{I}\preceq\nabla^{2}f(x)\preceq L\mathbf{I} for all xdx\in\mathbb{R}^{d}; it is well-known that LL-smoothness implies that ff has an LL-Lipschitz gradient, and that for any x,ydx,y\in\mathbb{R}^{d},

f(x)+f(x),yx+μ2yx22f(y)f(x)+f(x),yx+L2yx22.f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{\mu}{2}\left\lVert y-x\right\rVert_{2}^{2}\leq f(y)\leq f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}.

If ff is LL-smooth and μ\mu-strongly convex, we say it has a condition number κ:=Lμ\kappa:=\tfrac{L}{\mu}. We call a zeroth-order oracle, or “value oracle”, an oracle which returns f(x)f(x) on any input point xdx\in\mathbb{R}^{d}; similarly, a first-order oracle, or “gradient oracle”, returns both the value and gradient (f(x),f(x))(f(x),\nabla f(x)).

Distributions.

We call distribution π\pi on d\mathbb{R}^{d} logconcave if dπdx(x)=exp(f(x))\tfrac{d\pi}{dx}(x)=\exp(-f(x)) for convex ff; π\pi is μ\mu-strongly logconcave if ff is μ\mu-strongly convex. For AdA\subseteq\mathbb{R}^{d}, AcA^{c} is its complement, and we let π(A):=xA𝑑π(x)\pi(A):=\int_{x\in A}d\pi(x). We say distribution ρ\rho is β\beta-warm with respect to π\pi if dπdρ(x)β\tfrac{d\pi}{d\rho}(x)\leq\beta everywhere, and define the total variation πρTV:=supAdπ(A)ρ(A)\left\lVert\pi-\rho\right\rVert_{\textup{TV}}:=\sup_{A\subseteq\mathbb{R}^{d}}\pi(A)-\rho(A). We will frequently use the fact that πρTV\left\lVert\pi-\rho\right\rVert_{\textup{TV}} is also the probability that xπx\sim\pi and xρx^{\prime}\sim\rho are unequal under the best coupling of (π,ρ)(\pi,\rho); this allows us to “locally share randomness” when comparing two random walk procedures. We define the expectation 𝔼π\mathbb{E}_{\pi} and variance Varπ\textup{Var}_{\pi} with respect to distribution π\pi in the standard way,

𝔼π[h(x)]:=h(x)𝑑π(x),Varπ[h(x)]:=𝔼π[(h(x))2](𝔼π[h(x)])2.\mathbb{E}_{\pi}[h(x)]:=\int h(x)d\pi(x),\;\textup{Var}_{\pi}[h(x)]:=\mathbb{E}_{\pi}\left[(h(x))^{2}\right]-\left(\mathbb{E}_{\pi}[h(x)]\right)^{2}.
Structured distributions.

This work considers two types of distributions with additional structure, which we call composite logconcave densities and logconcave finite sums. A composite logconcave density has the form exp(f(x)g(x))\exp(-f(x)-g(x)), where both ff and gg are convex. In this context throughout, ff will either be uniformly 0 or have a finite condition number (be “well-conditioned”), and gg will represent a “simple” but possibly non-smooth function, as measured by admitting an RGO (cf. Definition 1). We will sometimes refer to the components as ff and gg as fwcf_{\textup{wc}} and foraclef_{\textup{oracle}} respectively, to disambiguate when the functions ff and gg are already defined in context. In our reduction-based approaches, we have additional structure on the parameter λ\lambda which an RGO is called with (cf. Definition 2). Specifically, in our instances typically λ1L\lambda^{-1}\gg L (or some other “niceness” parameter associated with the negative log-density); this can be seen as heavily regularizing the negative log-density, and often makes the implementation simpler.

Finally, a logconcave finite sum has density of the form exp(F(x))\exp(-F(x)) where F(x)=1ni[n]fi(x)F(x)=\tfrac{1}{n}\sum_{i\in[n]}f_{i}(x). When treating such densities, we make the assumption that each constituent summand fif_{i} is LL-smooth and convex, and the overall function FF is μ\mu-strongly convex. We measure complexity of algorithms for logconcave finite sums by gradient queries to summands, i.e. fi(x)\nabla f_{i}(x) for some i[n]i\in[n].

Optimization.

Throughout this work, we are somewhat liberal with assuming access to minimizers to various functions (namely, the negative log-densities of target distributions). We give a more thorough discussion of this assumption in Appendix A, but note here that for all function families we consider (well-conditioned, composite, and finite sum), efficient first-order methods exist for obtaining high accuracy minimizers, and this optimization query complexity is never the leading-order term in any of our algorithms assuming polynomially bounded initial error.

2.2 Technical facts

We will repeatedly use the following results.

Fact 1 (Gaussian integral).

For any λ0\lambda\geq 0 and vdv\in\mathbb{R}^{d},

exp(12λxv22)𝑑x=(2πλ)d2.\int\exp\left(-\frac{1}{2\lambda}\left\lVert x-v\right\rVert_{2}^{2}\right)dx=(2\pi\lambda)^{\frac{d}{2}}.
Fact 2 ([DCWY18], Lemma 1).

Let π\pi be a μ\mu-strongly logconcave distribution, and let xx^{*} minimize its negative log-density. Then, for xπx\sim\pi and any δ[0,1]\delta\in[0,1], with probability at least 1δ1-\delta,

xx2dμ(2+2max(log(1/δ)d4,log(1/δ)d)).\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\frac{d}{\mu}}\left(2+2\max\left(\sqrt[4]{\frac{\log(1/\delta)}{d}},\sqrt{\frac{\log(1/\delta)}{d}}\right)\right).
Fact 3 ([Har04], Theorem 1.1).

Let π\pi be a μ\mu-strongly logconcave density. Let dγμ(x)d\gamma_{\mu}(x) be the Gaussian density with covariance matrix μ1𝐈\mu^{-1}\mathbf{I}. For any convex function hh,

𝔼π[h(x𝔼π[x])]𝔼γμ[h(x𝔼γμ[x])].\mathbb{E}_{\pi}[h(x-\mathbb{E}_{\pi}[x])]\leq\mathbb{E}_{\gamma_{\mu}}[h(x-\mathbb{E}_{\gamma_{\mu}}[x])].
Fact 4 ([DM19a], Theorem 1).

Let π\pi be a μ\mu-strongly logconcave distribution, and let xx^{*} minimize its negative log-density. Then, 𝔼π[xx22]dμ\mathbb{E}_{\pi}[\left\lVert x-x^{*}\right\rVert_{2}^{2}]\leq\tfrac{d}{\mu}.

3 Proximal reduction framework

The reduction framework of Theorem 1 can be thought of as a specialization of a more general composite sampler which we develop in Section 5, whose guarantees are reproduced here.

See 2

Our main observation, elaborated on more formally for specific applications in Section 4, is that a variety of structured logconcave densities have negative log-densities foraclef_{\textup{oracle}}, where we can implement an efficient restricted Gaussian oracle for foraclef_{\textup{oracle}} via calling an existing sampling method. Crucially, in these instantiations we use the fact that the distributions which 𝒪\mathcal{O} is required to sampled from are heavily regularized (restricted by a quadratic with large leading coefficient) to obtain fast samplers. We further note that the upper bound requirement on η\eta in Theorem 2 can be lifted when the “well-conditioned” component is uniformly 0. Instead of setting f=0f=0 and g=foracleg=f_{\textup{oracle}} in Theorem 2, and refining the analysis for this special case to tolerate arbitrary η\eta, we provide a self-contained proof here. This particular structure (the composite setting where fwcf_{\textup{wc}} is uniformly zero and foraclef_{\textup{oracle}} is strongly convex) admits significant simplifications from the more general case, so using slightly different proof techniques, we are able to obtain stronger convergence guarantees for this particular problem allowing for mixing in fewer than dd iterations from a feasible start (see Section 1.4).

See 1

For simplicity of notation, we replace foraclef_{\textup{oracle}} in the statement of Theorem 1 with gg throughout just this section. Let π\pi be a density on d\mathbb{R}^{d} with dπdx(x)exp(g(x))\tfrac{d\pi}{dx}(x)\propto\exp(-g(x)) where gg is μ\mu-strongly convex (but possibly non-smooth), and let 𝒪\mathcal{O} be a restricted Gaussian oracle for gg. Consider the joint distribution π^\hat{\pi} supported on an expanded space z=(x,y)d×dz=(x,y)\in\mathbb{R}^{d}\times\mathbb{R}^{d} with density, for some η>0\eta>0,

dπ^dz(z)exp(g(x)12ηxy22).\frac{d\hat{\pi}}{dz}(z)\propto\exp\left(-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right).

Note that the xx-marginal of π^\hat{\pi} is precisely π\pi, so it suffices to sample from the xx-marginal. We consider a simple alternating Markov chain for sampling from π^\hat{\pi}, described in the following Algorithm 1.

Algorithm 1 AlternateSample(g,η,T)\texttt{AlternateSample}(g,\eta,T)

Input: μ\mu-strongly convex g:dg:\mathbb{R}^{d}\rightarrow\mathbb{R}, η>0\eta>0, TT\in\mathbb{N}, x0=minxg(x)x_{0}=\min_{x}g(x).

1:for k[T]k\in[T] do
2:     Sample ykπxk1y_{k}\sim\pi_{x_{k-1}}, where for all xdx\in\mathbb{R}^{d}, dπxdy(y)exp(12ηxy22)\tfrac{d\pi_{x}}{dy}(y)\propto\exp\left(-\tfrac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right).
3:     Sample xkπykx_{k}\sim\pi_{y_{k}}, where for all ydy\in\mathbb{R}^{d}, dπydx(x)exp(g(x)12ηxy22)\tfrac{d\pi_{y}}{dx}(x)\propto\exp\left(-g(x)-\tfrac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right).
4:end for
5:return xTx_{T}

By observing that the distributions πx\pi_{x} and πy\pi_{y} in the above method are precisely the marginal distributions of π^\hat{\pi} with one variable fixed, it is straightforward to see that π^\hat{\pi} is a stationary distribution of the process. We make this formal in the following lemma.

Lemma 1 (Alternating marginal sampling).

Let π^\hat{\pi} be a density on two blocks (x,y)(x,y). Sample (x,y)π^(x,y)\sim\hat{\pi}, and then sample x~π^(,y)\tilde{x}\sim\hat{\pi}(\cdot,y), y~π^(x~,)\tilde{y}\sim\hat{\pi}(\tilde{x},\cdot). Then, the distribution of (x~,y~)(\tilde{x},\tilde{y}) is π^\hat{\pi}. Moreover, the alternating marginal sampling Markov chain on either marginal is reversible.

Proof.

The density of the resulting distribution at (x~,y)(\tilde{x},y) is proportional to the product of the (marginal) density at yy and the conditional distribution of x~y\tilde{x}\mid y, which by definition is π^\hat{\pi}. Therefore, (x~,y)(\tilde{x},y) is distributed as π^\hat{\pi}, and the argument for y~\tilde{y} follows symmetrically. To see reversibility on the xx marginal, it suffices to note that the probability we move from xx to xx^{\prime} is proportional to

yπ^(x,y)π^(x,y)𝑑y,\int_{y}\hat{\pi}(x,y)\hat{\pi}(x^{\prime},y)dy,

which is a symmetric function of xx and xx^{\prime}. A similar argument holds for the yy marginals. ∎

We also state a simple observation about alternating schemes such as Algorithm 1, which will be useful later. Let 𝒫x\mathcal{P}_{x} be the density of yky_{k} after one step of the above procedure starting from xk1=xx_{k-1}=x, and let 𝒯x\mathcal{T}_{x} be the resulting density of xkx_{k}.

Observation 1.

For any two points xx, xdx^{\prime}\in\mathbb{R}^{d}, 𝒯x𝒯xTV𝒫x𝒫xTV\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}.

Proof.

This follows by the coupling characterization of total variation (see e.g. Chapter 5 of [LPW09]). Per the optimal coupling of y𝒫xy\sim\mathcal{P}_{x} and y𝒫xy^{\prime}\sim\mathcal{P}_{x^{\prime}}, whenever the total variation sets y=yy=y^{\prime} in Line 2 of AlternateSample, we can couple the resulting distributions in Line 3 as well. ∎

In order to prove Theorem 1, we first show that the random walk in Algorithm 1 converges rapidly in the 2-Wasserstein distance (denoted W2W_{2} in this section).

Lemma 2.

Let π0\pi_{0} be the starting distribution of xx in Algorithm 1. Let πk\pi_{k} be the distribution of xkx_{k} and π\pi be the xx-marginal of π^\hat{\pi}. For all k0k\geq 0,

W22(πk+1,π)1(1+ημ)2W22(πk,π).\displaystyle W_{2}^{2}(\pi_{{k+1}},\pi)\leq\frac{1}{(1+\eta\mu)^{2}}W^{2}_{2}(\pi_{{k}},\pi).

Hence, for any η1μ\eta\leq\frac{1}{\mu}, in T=O(1ημlogdμΔ)T^{\prime}=O\left(\frac{1}{\eta\mu}\log{\frac{d}{\mu\Delta}}\right) iterations, the random walk mixes to

W2(πT,π)Δ.\displaystyle W_{2}(\pi_{{T^{\prime}}},\pi)\leq\Delta.
Proof.

Let Γxk\Gamma_{x_{k}} be the optimal coupling between xkπkx_{k}\sim\pi_{k} and x^π\hat{x}\sim\pi according to the W2W_{2} distance. Coupling the Gaussian random variable generating yk+1πxky_{k+1}\sim\pi_{x_{k}} and y^πx^\hat{y}\sim\pi_{\hat{x}} gives a coupling Γyk+1\Gamma_{y_{k+1}} between yk+1y_{k+1} and y^\hat{y} such that

𝔼Γyk+1[yk+1y^22]=𝔼Γxk[xkx^22].\displaystyle\mathbb{E}_{\Gamma_{y_{k+1}}}\left[\left\lVert y_{k+1}-\hat{y}\right\rVert_{2}^{2}\right]=\mathbb{E}_{\Gamma_{x_{k}}}\left[\left\lVert x_{k}-\hat{x}\right\rVert_{2}^{2}\right]. (10)

Then, let πy\pi_{y} be the distribution of xk+1x_{k+1} in a run of Line 3 of Algorithm 1 starting from yk+1=yy_{k+1}=y, and πy^\pi_{\hat{y}} be the distribution of x^\hat{x} in Line 3 starting from y^\hat{y}, respectively. Since πy^\pi_{\hat{y}} is μ+1η\mu+\frac{1}{\eta} strongly log-concave, πy^\pi_{\hat{y}} satisfies a log-Sobolev inequality with constant μ+1η\mu+\frac{1}{\eta} (Theorem 2 of [OV00]). Hence,

W22(πy,πy^)\displaystyle W_{2}^{2}(\pi_{y},\pi_{\hat{y}}) 2μ+1ηdKL(πyπy^)\displaystyle\leq\frac{2}{\mu+\frac{1}{\eta}}d_{\textup{KL}}(\pi_{y}\|\pi_{\hat{y}})
1(μ+1η)2𝔼πy[logπyπy^22]\displaystyle\leq\frac{1}{\left(\mu+\frac{1}{\eta}\right)^{2}}\mathbb{E}_{\pi_{y}}\left[\left\lVert\nabla\log\frac{\pi_{y}}{\pi_{\hat{y}}}\right\rVert_{2}^{2}\right]
1(1+ημ)2yy^22.\displaystyle\leq\frac{1}{(1+\eta\mu)^{2}}\left\lVert y-\hat{y}\right\rVert^{2}_{2}.

The first step used the Talagrand transportation inequality (Theorem 1 of [OV00]). The second step used the log-Sobolev inequality. The third step used

logπy(x)πy^(x)\displaystyle\nabla\log\frac{\pi_{y}(x)}{\pi_{\hat{y}}(x)} =logexp(g(x)12ηxy22)xexp(g(x)12ηxy^22)𝑑xexp(g(x)12ηxy^22)xexp(g(x)12ηxy22)𝑑x\displaystyle=\nabla\log\frac{\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2})\int_{x^{\prime}}\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-\hat{y}\right\rVert_{2}^{2})dx^{\prime}}{\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-\hat{y}\right\rVert_{2}^{2})\int_{x^{\prime}}\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2})dx^{\prime}}
=12η(xy^22xy22)=1η(yy^).\displaystyle=\frac{1}{2\eta}\nabla\left(\left\lVert x-\hat{y}\right\rVert_{2}^{2}-\left\lVert x-y\right\rVert_{2}^{2}\right)=\frac{1}{\eta}(y-\hat{y}). (11)

Taking expectation over Γyk+1\Gamma_{y_{k+1}} and using (10) shows that

W22(πk+1,π)1(1+ημ)2W22(πk,π).\displaystyle W_{2}^{2}(\pi_{{k+1}},\pi)\leq\frac{1}{(1+\eta\mu)^{2}}W^{2}_{2}(\pi_{k},\pi).

Algorithm 1 starts from the distribution π0=δx\pi_{0}=\delta_{x^{*}}, where x=minxg(x)x^{*}=\min_{x}g(x). Since π\pi is μ\mu-strongly logconcave, we have (see e.g. Proposition 1 of [DM19b])

W22(π0,π)=𝔼π^[xx2]dμ.\displaystyle W_{2}^{2}(\pi_{0},\pi)=\mathbb{E}_{\hat{\pi}}\left[\left\lVert x^{*}-x\right\rVert^{2}\right]\leq\frac{d}{\mu}.

Then, for η<1μ\eta<\frac{1}{\mu}, 11+ημ1ημ2\frac{1}{1+\eta\mu}\leq 1-\frac{\eta\mu}{2}, so after T=O(1ημlogdμΔ)T^{\prime}=O(\frac{1}{\eta\mu}\log\frac{d}{\mu\Delta}) iterations, W2(πT,π)ΔW_{2}(\pi_{{T^{\prime}}},\pi)\leq\Delta. ∎

Next, we bound the KL divergence between the output of Algorithm 1 and the target distribution π\pi. We need the following standard lemma regarding KL divergences of marginal distributions.

Lemma 3.

Let PzP_{z} and QzQ_{z} be distributions supported on 𝒳\mathcal{X} indexed by zz, a random variable distributed as πz\pi_{z}. Let P~\widetilde{P} be the joint distribution of (x,z)(x,z) for xPzx\sim P_{z} and zπzz\sim\pi_{z}, and Q~\widetilde{Q} be the joint distribution of (x,z)(x,z) as xQzx\sim Q_{z} and zπzz\sim\pi_{z}. Let PP and QQ be the marginal distribution of P~\widetilde{P} and Q~\widetilde{Q} on xx, averaged over zz. Then,

dKL(PQ)𝔼zπz[dKL(PzQz)].\displaystyle d_{\textup{KL}}(P\|Q)\leq\mathbb{E}_{z\sim\pi_{z}}\left[d_{\textup{KL}}(P_{z}\|Q_{z})\right].
Proof.

By the definition of dKLd_{\textup{KL}},

dKL(P~Q~)\displaystyle d_{\textup{KL}}(\widetilde{P}\|\widetilde{Q}) =𝔼(x,z)P~[logP~(x,z)Q~(x,z)]\displaystyle=\mathbb{E}_{(x,z)\sim\widetilde{P}}\left[\log\frac{\widetilde{P}(x,z)}{\widetilde{Q}(x,z)}\right]
=𝔼zπz[𝔼xPz[logP~(x,z)Q~(x,z)]]\displaystyle=\mathbb{E}_{z\sim\pi_{z}}\left[\mathbb{E}_{x\sim P_{z}}\left[\log\frac{\widetilde{P}(x,z)}{\widetilde{Q}(x,z)}\right]\right]
=𝔼zπz[𝔼xPz[logPz(x)Qz(x)]]\displaystyle=\mathbb{E}_{z\sim\pi_{z}}\left[\mathbb{E}_{x\sim P_{z}}\left[\log\frac{P_{z}(x)}{Q_{z}(x)}\right]\right]
=𝔼zπz[dKL(PzQz)].\displaystyle=\mathbb{E}_{z\sim\pi_{z}}\left[d_{\textup{KL}}(P_{z}\|Q_{z})\right].

Finally, by the data processing inequality,

dKL(PQ)dKL(P~Q~)=𝔼zπz[dKL(PzQz)].\displaystyle d_{\textup{KL}}(P\|Q)\leq d_{\textup{KL}}(\widetilde{P}\|\widetilde{Q})=\mathbb{E}_{z\sim\pi_{z}}\left[d_{\textup{KL}}(P_{z}\|Q_{z})\right].

The following lemma shows that a 2-Wasserstein distance bound on the distribution at iteration kk implies a KL divergence bound on iteration k+1k+1.

Lemma 4.

Let πk\pi_{{k}} be the distribution of xkx_{k} for some kk such that W2(πk,π)ΔW_{2}(\pi_{k},\pi)\leq\Delta and π\pi be the x-marginal of π^\hat{\pi}. Then,

dKL(πk+1π)Δ22η.\displaystyle d_{\textup{KL}}(\pi_{{k+1}}\|\pi)\leq\frac{\Delta^{2}}{2\eta}.
Proof.

As in Lemma 2, let Γxk\Gamma_{x_{k}} be the optimal coupling between xkπkx_{k}\sim\pi_{k} and x^π\hat{x}\sim\pi, which yields a coupling Γyk+1\Gamma_{y_{k+1}} between yk+1y_{k+1} and y^\hat{y} such that

𝔼Γyk+1[yk+1y^22]=𝔼Γxk[xkx^22]Δ2.\displaystyle\mathbb{E}_{\Gamma_{y_{k+1}}}\left[\left\lVert y_{k+1}-\hat{y}\right\rVert_{2}^{2}\right]=\mathbb{E}_{\Gamma_{x_{k}}}\left[\left\lVert x_{k}-\hat{x}\right\rVert_{2}^{2}\right]\leq\Delta^{2}. (12)

Then,

dKL(πk+1π)\displaystyle d_{\textup{KL}}(\pi_{k+1}\|\pi) 𝔼(yk+1,y^)Γyk[dKL(πyk+1πy^)]\displaystyle\leq\mathbb{E}_{(y_{k+1},\hat{y})\sim\Gamma_{y_{k}}}\left[d_{\textup{KL}}(\pi_{y_{k+1}}\|\pi_{\hat{y}})\right]
12η2(μ+1η)𝔼(yk+1,y^)Γyk+1[yk+1y^22]Δ22η.\displaystyle\leq\frac{1}{2\eta^{2}\left(\mu+\frac{1}{\eta}\right)}\mathbb{E}_{(y_{k+1},\hat{y})\sim\Gamma_{y_{k+1}}}\left[\left\lVert y_{k+1}-\hat{y}\right\rVert_{2}^{2}\right]\leq\frac{\Delta^{2}}{2\eta}.

The first inequality followed from Lemma 4 by taking P=πk+1P=\pi_{k+1}, Q=πQ=\pi and z=(yk+1,y)z=(y_{k+1},y). The second inequality used the log-Sobolev inequality and (3). The last inequality used (12). ∎

Finally, putting the pieces together, Theorem 1 follows from Lemma 2 and Lemma 4.

Proof of Theorem 1.

By Lemma 2 and Lemma 4, there is T=O(1ημlogdημϵ)T=O\left(\frac{1}{\eta\mu}\log{\frac{d}{\eta\mu\epsilon}}\right) so that dKL(πTπ)2ϵ2d_{\textup{KL}}(\pi_{T}\|\pi)\leq 2\epsilon^{2}. By Pinsker’s inequality,

πTπTV12dKL(πTπ)=ϵ.\left\lVert\pi_{T}-\pi\right\rVert_{\textup{TV}}\leq\sqrt{\frac{1}{2}d_{\textup{KL}}(\pi_{T}\|\pi)}=\epsilon.

We note that Theorem 1 is robust to a small amount of error tolerance in the sampler 𝒪\mathcal{O}. Specifically, if 𝒪\mathcal{O} has tolerance ϵ2T\tfrac{\epsilon}{2T}, then by calling Theorem 1 with desired accuracy ϵ2\tfrac{\epsilon}{2} and adjusting constants appropriately, the cumulative error incurred by all calls to 𝒪\mathcal{O} is within the total requisite bound (formally, this can be shown via the coupling characterization of total variation). We defer a more formal elaboration on this inexactness argument to Appendix A and the proof of Proposition 5.

4 Tighter runtimes for structured densities

In this section, we use applications of Theorem 1 to obtain simple analyses of novel state-of-the-art high-accuracy runtimes for the well-conditioned densities studied in [DCWY18, CDWY19, LST20], as well as the composite and finite sum densities studied in this work. We will assume the conclusions of Theorems 2 and 3 respectively in deriving the results of Sections 4.2 and 4.3.

4.1 Well-conditioned logconcave sampling: proof of Corollary 1

In this section, let π\pi be a distribution on d\mathbb{R}^{d} with density proportional to exp(f(x))\exp(-f(x)), where ff is LL-smooth and μ\mu-strongly convex (and κ=Lμ\kappa=\tfrac{L}{\mu}) and has pre-computed minimizer xx^{*}. We will instantiate Theorem 1 with foracle(x)=f(x)f_{\textup{oracle}}(x)=f(x), and choose η=18Ldlog(κ)\eta=\tfrac{1}{8Ld\log(\kappa)}. We now require an η\eta-RGO 𝒪\mathcal{O} for foracle=ff_{\textup{oracle}}=f to use in Theorem 1.

Our implementation of 𝒪\mathcal{O} is a rejection sampling scheme. We use the following helpful guarantee.

Lemma 5 (Rejection sampling).

Let π\pi, π^\hat{\pi} be distributions on d\mathbb{R}^{d} with dπdx(x)p(x)\tfrac{d\pi}{dx}(x)\propto p(x), dπ^dx(x)p^(x)\tfrac{d\hat{\pi}}{dx}(x)\propto\hat{p}(x). Suppose for some C1C\geq 1 and all xdx\in\mathbb{R}^{d}, p(x)p^(x)C\tfrac{p(x)}{\hat{p}(x)}\leq C. The following is termed “rejection sampling”: repeat independent runs of the following procedure until a point is outputted.

  1. 1.

    Draw xπ^x\sim\hat{\pi}.

  2. 2.

    With probability p(x)Cp^(x)\tfrac{p(x)}{C\hat{p}(x)}, output xx.

Rejection sampling terminates in Cp^(x)𝑑xp(x)𝑑x\tfrac{C\int\hat{p}(x)dx}{\int p(x)dx} runs in expectation, and the output distribution is π\pi.

Proof.

The second claim follows from Bayes’ rule which implies the conditional density of the output point is proportional to p^(x)p(x)Cp(x)^p(x)\hat{p}(x)\cdot\tfrac{p(x)}{C\hat{p(x)}}\propto p(x), so the distribution is π\pi. To see the first claim, the probability any sample outputs is

xp(x)Cp^(x)𝑑π^(x)=1Cxxp(x)𝑑xxp^(x)𝑑x𝑑π(x)=xp(x)𝑑xCxp^(x)𝑑x.\int_{x}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)=\frac{1}{C}\int_{x}\frac{\int_{x}p(x)dx}{\int_{x}\hat{p}(x)dx}d\pi(x)=\frac{\int_{x}p(x)dx}{C\int_{x}\hat{p}(x)dx}.

The conclusion follows by independence and linearity of expectation. ∎

We further state a concentration bound shown first in [LST20] regarding the norm of the gradient of a point drawn from a logsmooth distribution.

Proposition 2 (Logsmooth gradient concentration, Corollary 3.3, [LST20]).

Let π\pi be a distribution on d\mathbb{R}^{d} with dπdx(x)exp(f(x))\tfrac{d\pi}{dx}(x)\propto\exp(-f(x)) where ff is convex and LL-smooth. With probability at least 1κd1-\kappa^{-d},

f(x)23Ldlogκ for xπ.\left\lVert\nabla f(x)\right\rVert_{2}\leq 3\sqrt{L}d\log\kappa\text{ for }x\sim\pi. (13)

By the requirements of Theorem 1, the restricted Gaussian oracle 𝒪\mathcal{O} only must be able to draw samples from densities of the form, for some ydy\in\mathbb{R}^{d},

exp(foracle(x)12ηxy22)=exp(f(x)4Ldlogκxy22).\exp\left(-f_{\textup{oracle}}(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)=\exp\left(-f(x)-4Ld\log\kappa\left\lVert x-y\right\rVert_{2}^{2}\right). (14)

We will use the following Algorithm 2 to implement 𝒪\mathcal{O}.

Algorithm 2 XSample(f,y,η)\texttt{XSample}(f,y,\eta)

Input: LL-smooth, μ\mu-strongly convex f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}, ydy\in\mathbb{R}^{d}, η>0\eta>0

1:if f(y)23Ldlogκ\left\lVert\nabla f(y)\right\rVert_{2}\leq 3\sqrt{L}d\log\kappa then
2:     while true do
3:         Draw x𝒩(yf(y),η𝐈)x\sim\mathcal{N}(y-\nabla f(y),\eta\mathbf{I})
4:         τUnif[0,1]\tau\sim\text{Unif}[0,1]
5:         if τexp(f(y)+f(y),xyf(x))\tau\leq\exp(f(y)+\left\langle\nabla f(y),x-y\right\rangle-f(x)) then
6:              return xx
7:         end if
8:     end while
9:end if
10:Use [CDWY19] to sample xx from (14) to total variation distance ϵΘ(κd2log3(κdϵ))\tfrac{\epsilon}{\Theta(\kappa d^{2}\log^{3}(\frac{\kappa d}{\epsilon}))} using O(dlogκdϵ)O(d\log\tfrac{\kappa d}{\epsilon}) queries to f\nabla f (Theorem 1, [CDWY19], where (14) has constant condition number)
11:return xx
Lemma 6.

Let η=18Ldlog(κ)\eta=\tfrac{1}{8Ld\log(\kappa)}, and suppose yy satisfies the bound in (13), i.e. f(y)23Ldlogκ\left\lVert\nabla f(y)\right\rVert_{2}\leq 3\sqrt{L}d\log\kappa. Then, Line 3 of Algorithm 2 runs an expected 2 times, and Algorithm 2 samples exactly from (14), whenever the condition of Line 1 is met.

Proof.

Note that when the assumption of Line 1 is met, Algorithm 2 is an instantiation of rejection sampling (Lemma 5) with

p(x)\displaystyle p(x) =exp(f(x)12ηxy22),\displaystyle=\exp\left(-f(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right),
p^(x)\displaystyle\hat{p}(x) =exp(f(y)f(y),xy12ηxy22).\displaystyle=\exp\left(-f(y)-\left\langle\nabla f(y),x-y\right\rangle-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right).

By convexity, we may take C=1C=1. Next, by applying Fact 1 twice and LL-smoothness of foraclef_{\textup{oracle}},

xp(x)𝑑x\displaystyle\int_{x}p(x)dx xexp(f(y)f(y),xy1+ηL2ηxy22)𝑑x\displaystyle\geq\int_{x}\exp\left(-f(y)-\left\langle\nabla f(y),x-y\right\rangle-\frac{1+\eta L}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)dx
=exp(f(y)+η2(1+ηL)f(y)22)xexp(1+ηL2ηxy+η1+ηLf(y)22)𝑑x\displaystyle=\exp\left(-f(y)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)\int_{x}\exp\left(-\frac{1+\eta L}{2\eta}\left\lVert x-y+\frac{\eta}{1+\eta L}\nabla f(y)\right\rVert_{2}^{2}\right)dx
=exp(f(y)+η2(1+ηL)f(y)22)(2πη1+ηL)d2,\displaystyle=\exp\left(-f(y)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}},
xp^(x)𝑑x\displaystyle\int_{x}\hat{p}(x)dx =exp(f(y)+η2f(y)22)(2πη)d2,\displaystyle=\exp\left(-f(y)+\frac{\eta}{2}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)(2\pi\eta)^{\frac{d}{2}},

which implies the desired bound (recalling Lemma 5 and our assumed bound on f(y)2\left\lVert\nabla f(y)\right\rVert_{2})

p^(x)𝑑xp(x)𝑑x\displaystyle\frac{\int\hat{p}(x)dx}{\int p(x)dx} exp((η2η2(1+ηL))f(y)22)(1+ηL)d2\displaystyle\leq\exp\left(\left(\frac{\eta}{2}-\frac{\eta}{2(1+\eta L)}\right)\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}
1.5exp(η2L2(1+ηL)f(y)22)2.\displaystyle\leq 1.5\exp\left(\frac{\eta^{2}L}{2(1+\eta L)}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)\leq 2.

We are now equipped to prove our main result concerning well-conditioned densities.

See 1

Proof.

By applying Theorem 1 with the chosen η\eta, and noting that the cumulative error due to all calls to Line 10 cannot amount to more than ϵ2\tfrac{\epsilon}{2} total variation error throughout the algorithm, it suffices to show that Algorithm 2 uses O(1)O(1) gradient queries each iteration in expectation. This happens whenever the condition in Line 1 is met via Lemma 6, so we must show Line 10 is executed with probability O((dlogκdϵ)1)O((d\log\frac{\kappa d}{\epsilon})^{-1}).

To show this, note that combining Proposition 2 with the warmness of the start x0x_{0} in Algorithm 2, this event occurs with probability at most κd2\kappa^{-\frac{d}{2}} in the first iteration.101010Formally, Line 2 of Algorithm 1 has y1𝒩(x0,η𝐈)y_{1}\sim\mathcal{N}(x_{0},\eta\mathbf{I}), but by smoothness f(y1)2f(x0)2+Lxy2\left\lVert\nabla f(y_{1})\right\rVert_{2}\leq\left\lVert\nabla f(x_{0})\right\rVert_{2}+L\left\lVert x-y\right\rVert_{2} and Lxy2O~(Lη)L\left\lVert x-y\right\rVert_{2}\leq\widetilde{O}(L\sqrt{\eta}) with high probability, adding a negligible constant to the bound of Proposition 2. Since warmness is monotonically decreasing111111This is a standard fact in the literature, and can be seen as follows: each transition step in the chain is a convex combination of warm point masses, preserving warmness. throughout using an exact oracle in Algorithm 1, and the total error accumulated due to Line 10 throughout the algorithm is O((dlogκdϵ)1)O((d\log\frac{\kappa d}{\epsilon})^{-1}), we have the desired conclusion. ∎

We show a bound nearly-matching Corollary 1 using only value access to ff, and with a deterministic iteration complexity (rather than an expected one), as Corollary 5 in Section 4.3.

4.2 Composite logconcave sampling: proof of Corollary 2

In this section, let π\pi be a distribution on d\mathbb{R}^{d} with density proportional to exp(f(x)g(x))\exp(-f(x)-g(x)), where ff is LL-smooth and μ\mu-strongly convex (and κ=Lμ\kappa=\tfrac{L}{\mu}), and gg is convex and admits a restricted Gaussian oracle 𝒪\mathcal{O}. Without loss of generality, we assume that ff and gg share a minimizer xx^{*} which we have pre-computed; if this is not the case, we can redefine f(x)f(x)f(x),xf(x)\leftarrow f(x)-\left\langle\nabla f(x^{*}),x\right\rangle and g(x)g(x)+f(x),xg(x)\leftarrow g(x)+\left\langle\nabla f(x^{*}),x\right\rangle; see Section 5.1 for this reduction.

We will instantiate Theorem 1 with foracle=f+gf_{\textup{oracle}}=f+g, which is a μ\mu-strongly convex function. Our main result of this section follows directly from Theorem 1 and using Theorem 2 as the required oracle 𝒪\mathcal{O}, stated more precisely in the following.

See 2

Proof.

As discussed at the beginning of this section, assume without loss that ff and gg both are minimized by xx^{*}. We apply the algorithm of Theorem 1 with η=1L\eta=\tfrac{1}{L} to the μ\mu-strongly convex function f+gf+g, which requires one call to 𝒪\mathcal{O} to implement. Thus, the iteration count parameter in Theorem 1 is T=O(κlogκdϵ)T=O(\kappa\log\tfrac{\kappa d}{\epsilon}).

Recall that we chose η=1L\eta=\tfrac{1}{L}. To bound the total complexity of this algorithm, it suffices to give an η\eta-RGO 𝒪+\mathcal{O}^{+} for sampling from distributions with densities of the form, for some ydy\in\mathbb{R}^{d},

exp(f(x)g(x)12ηxy22)=exp(f(x)g(x)L2xy22)\exp\left(-f(x)-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)=\exp\left(-f(x)-g(x)-\frac{L}{2}\left\lVert x-y\right\rVert_{2}^{2}\right)

to total variation distance ϵΘ(T)\tfrac{\epsilon}{\Theta(T)} (see discussion at the end of Section 3). To this end, we apply Theorem 2 with the well-conditioned component f(x)+L2xy22f(x)+\tfrac{L}{2}\left\lVert x-y\right\rVert_{2}^{2}, the composite component g(x)g(x), and the largest possible choice of η\eta. Note that we indeed have access to a restricted Gaussian oracle for gg (namely, 𝒪\mathcal{O}), and this choice of well-conditioned component is 2L2L-smooth and LL-strongly convex, so its condition number is a constant. Thus, Theorem 2 requires O(dlog2κdϵ)O(d\log^{2}\tfrac{\kappa d}{\epsilon}) calls to 𝒪\mathcal{O} and gradients of ff to implement the desired 𝒪+\mathcal{O}^{+} on any query yy (where we note ϵΘ(T)=1poly(κ,d,ϵ1)\tfrac{\epsilon}{\Theta(T)}=\tfrac{1}{\textup{poly}(\kappa,d,\epsilon^{-1})}). Combining these complexity bounds yields the desired conclusion. ∎

4.3 Sampling logconcave finite sums: proof of Corollary 3

In this section, let π\pi be a distribution on d\mathbb{R}^{d} with density proportional to exp(F(x))\exp(-F(x)), where F(x)=1ni[n]fi(x)F(x)=\tfrac{1}{n}\sum_{i\in[n]}f_{i}(x) is μ\mu-strongly convex, and for all i[n]i\in[n], fif_{i} is LL-smooth (and κ=Lμ\kappa=\tfrac{L}{\mu}). We will instantiate Theorem 1 with foracle(x)=F(x)f_{\textup{oracle}}(x)=F(x), and Theorem 3 as an η\eta-RGO for some choice of η\eta.

More precisely, Theorem 3 shows that given access to the minimizer xx^{*}, only zeroth-order access to the summands of FF is necessary to obtain the iteration bound. In order to obtain the minimizer to high accuracy however, variance reduced stochastic gradient methods (e.g. [JZ13]) require Ω(n+κ)\Omega(n+\kappa) gradient queries, which amounts to Ω((n+κ)d)\Omega((n+\kappa)d) function evaluations. We state a convenient corollary of Theorem 3 which removes the requirement of accessing xx^{*}, via an optimization pre-processing step using the method of [JZ13] (see further discussion in Appendix A). This is useful to us in proving Theorem 3 because in the sampling tasks required by the RGO, the minimizer changes (and thus must be recomputed every time).

Corollary 4 (First-order logconcave finite sum sampling).

In the setting of Theorem 3, using [JZ13] to precompute the minimizer xx^{*} and running Algorithm 6 uses O(nlogκdϵ+κ2dlog4nκdϵ)O(n\log\tfrac{\kappa d}{\epsilon}+\kappa^{2}d\log^{4}\tfrac{n\kappa d}{\epsilon}) first-order oracle queries to summands {fi}i[n]\{f_{i}\}_{i\in[n]} and obtains ϵ\epsilon total variation distance to π\pi.

We now apply the reduction framework developed in Section 2 to our Algorithm 6 to obtain an improved query complexity for sampling from logconcave finite sums.

See 3

Proof.

We apply Theorem 1 with μ\mu-strongly convex foracle=F(x)f_{\textup{oracle}}=F(x), using Algorithm 6 as the required η\eta-RGO 𝒪\mathcal{O} for sampling from distributions with densities of the form

exp(F(x)1ηxy22)\exp\left(-{F}(x)-\frac{1}{\eta}\left\lVert x-y\right\rVert^{2}_{2}\right)

for some ydy\in\mathbb{R}^{d}, to total variation ϵΘ(T)\tfrac{\epsilon}{\Theta(T)} (see Section 3) for TT the iteration bound of Algorithm 1. We apply Theorem 3 to the function F~(x)=F(x)+1ηxy22\widetilde{F}(x)={F}(x)+\frac{1}{\eta}\left\lVert x-y\right\rVert_{2}^{2}; we can express this in finite sum form by adding 1ηxy22\tfrac{1}{\eta}\left\lVert x-y\right\rVert_{2}^{2} to every constituent function, and the effect on gradient oracles is 1η(xy)\tfrac{1}{\eta}(x-y). Note F~\widetilde{F} has condition number O(1+ηL)O(1+\eta L). For a given η\eta, the overall complexity is

logκdϵημ(nlog(nκdϵ)+dlog4(nκdϵ)+(ηL)2dlog4(nκdϵ))\frac{\log\frac{\kappa d}{\epsilon}}{\eta\mu}\left(n\log\left(\frac{n\kappa d}{\epsilon}\right)+d\log^{4}\left(\frac{n\kappa d}{\epsilon}\right)+(\eta L)^{2}d\log^{4}\left(\frac{n\kappa d}{\epsilon}\right)\right)

Here, the inner loop complexity uses Corollary 4 to also find the minimizer (for warm starts), and the outer loop complexity is by Theorem 1. The result follows by optimizing over η\eta, namely picking η=max(1L,nL2dlog3(nκd/ϵ))\eta=\max(\tfrac{1}{L},\sqrt{\frac{n}{L^{2}d\log^{3}(n\kappa d/\epsilon)}}), and that Algorithm 1 always must have at least one iteration. ∎

Note the only place that Corollary 3 used gradient evaluations was in determining minimizers of subproblems, via the first step of Corollary 4. Consider now the n=1n=1 case. By running e.g. accelerated gradient descent for smooth and strongly convex functions, it is well-known [Nes83] that we can obtain a minimizer in O~(κ)\widetilde{O}(\sqrt{\kappa}) iterations, each querying a gradient oracle, where κ\kappa is the condition number. By smoothness, we can approximate every coordinate of the gradient to arbitrary precision using 22 function evaluations, so this is a O~(κd)\widetilde{O}(\sqrt{\kappa}d) value oracle complexity.

Finally, for every optimization subproblem in Corollary 3 where η=(Lpolylogκdϵ)1\eta=(L\cdot\text{polylog}\frac{\kappa d}{\epsilon})^{-1}, the condition number is a constant, which amounts to a O~(d)\widetilde{O}(d) value oracle complexity for computing a minimizer. This is never the dominant term compared to Theorem 3, yielding the following conclusion.

Corollary 5.

In the setting of Corollary 1, Algorithm 1 using Algorithm 6 as a restricted Gaussian oracle uses O(κdlog2κdϵ)O(\kappa d\log^{2}\tfrac{\kappa d}{\epsilon}) value queries and obtains ϵ\epsilon total variation distance to π\pi.

We note that the polylogarithmic factor is significantly improved when compared to Corollary 3 by removing the random sampling steps in Algorithm 6. A precise complexity bound of the resulting Metropolized random walk, a zeroth-order algorithm mixing in O(κ2dlogκdϵ)O(\kappa^{2}d\log\tfrac{\kappa d}{\epsilon}) for a logconcave distribution with condition number κ\kappa, is given as Theorem 2 of [CDWY19].

Finally, in the case n1n\geq 1, we also exhibit an improved query complexity in terms of an entirely zeroth-order sampling algorithm which interpolates with Corollary 5 (up to logarithmic factors). By trading off the O~(nd+κd)\widetilde{O}(nd+\kappa d) zeroth-order complexity of minimizing a finite sum function [JZ13], and the O~(κ2d)\widetilde{O}(\kappa^{2}d) zeroth-order complexity of sampling, we can run Theorem 1 for the optimal choice of η=O~(nL)\eta=\widetilde{O}(\tfrac{\sqrt{n}}{L}). The overall zeroth-order complexity can be seen to be O~(nd+nκd)\widetilde{O}(nd+\sqrt{n}\kappa d).

5 Composite logconcave sampling with a restricted Gaussian oracle

In this section, we provide our “base sampler” for composite logconcave densities as Algorithm 3, and give its guarantees by proving Theorem 2. Throughout, fix distribution π\pi with density

dπdx(x)exp(f(x)g(x)),where f:d is L-smooth, μ-strongly convex,\displaystyle\frac{d\pi}{dx}(x)\propto\exp\left(-f(x)-g(x)\right),\text{where }f:\mathbb{R}^{d}\rightarrow\mathbb{R}\text{ is }L\text{-smooth, }\mu\text{-strongly convex,} (15)
and g:d admits a restricted Gaussian oracle 𝒪.\displaystyle\text{and }g:\mathbb{R}^{d}\rightarrow\mathbb{R}\text{ admits a restricted Gaussian oracle }\mathcal{O}.

We will define κ:=Lμ\kappa:=\frac{L}{\mu}, and assume that we have precomputed x:=argminxd{f(x)+g(x)}x^{*}:=\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{f(x)+g(x)\right\}. Our algorithm proceeds in stages following the outline in Section 1.3.1.

  1. 1.

    Composite-Sample is reduced to Composite-Sample-Shared-Min, which takes as input a distribution with negative log-density f+gf+g, where ff and gg share a minimizer; this reduction is given in Section 5.1, and the remainder of the section handles the shared-minimizer case.

  2. 2.

    The algorithm Composite-Sample-Shared-Min is a rejection sampling scheme built on top of sampling from a joint distribution π^\hat{\pi} on (x,y)d×d(x,y)\in\mathbb{R}^{d}\times\mathbb{R}^{d} whose xx-marginal approximates π\pi. We give this reduction in Section 5.2.

  3. 3.

    The bulk of our analysis is for Sample-Joint-Dist, an alternating marginal sampling algorithm for sampling from π^\hat{\pi}. To implement marginal sampling, it alternates calls to 𝒪\mathcal{O} and a rejection sampling algorithm YSample. We prove its correctness in Section 5.3.

We put these pieces together in Section 5.4 to prove Theorem 2. We remark that for simplicity, we will give the algorithms corresponding to the largest value of step size η\eta in the theorem statement; it is straightforward to modify the bounds to tolerate smaller values of η\eta, which will cause the mixing time to become correspondingly larger (in particular, the value of KK in Algorithm 5).

Algorithm 3 Composite-Sample(π,x,ϵ)\texttt{Composite-Sample}(\pi,x^{*},\epsilon)

Input: Distribution π\pi of form (15), xx^{*} minimizing negative log-density of π\pi, ϵ[0,1]\epsilon\in[0,1].
Output: Sample xx from a distribution π\pi^{\prime} with ππTVϵ\left\lVert\pi^{\prime}-\pi\right\rVert_{\textup{TV}}\leq\epsilon.

1:f~(x)f(x)f(x),x\tilde{f}(x)\leftarrow f(x)-\left\langle\nabla f(x^{*}),x\right\rangle, g~(x)g(x)+f(x),x\tilde{g}(x)\leftarrow g(x)+\left\langle\nabla f(x^{*}),x\right\rangle
2:return Composite-Sample-Shared-Min(π,f~,g~,x,ϵ)\texttt{Composite-Sample-Shared-Min}(\pi,\tilde{f},\tilde{g},x^{*},\epsilon)
Algorithm 4 Composite-Sample-Shared-Min(π,f,g,x,ϵ)\texttt{Composite-Sample-Shared-Min}(\pi,f,g,x^{*},\epsilon)

Input: Distribution π\pi of form (15), where ff and gg are both minimized by xx^{*}, ϵ[0,1]\epsilon\in[0,1].
Output: Sample xx from a distribution π\pi^{\prime} with ππTVϵ\left\lVert\pi^{\prime}-\pi\right\rVert_{\textup{TV}}\leq\epsilon.

1:while true do
2:     Define the set
Ω:={xxx24dlog(288κ/ϵ)μ}\Omega:=\left\{x\mid\left\lVert x-x^{*}\right\rVert_{2}\leq 4\sqrt{\frac{d\log(288\kappa/\epsilon)}{\mu}}\right\} (16)
3:     xSample-Joint-Dist(f,g,x,𝒪,ϵ18)x\leftarrow\texttt{Sample-Joint-Dist}(f,g,x^{*},\mathcal{O},\tfrac{\epsilon}{18})
4:     if xΩx\in\Omega then
5:         τUnif[0,1]\tau\sim\text{Unif}[0,1]
6:         yYSample(f,x,η)y\leftarrow\texttt{YSample}(f,x,\eta)
7:         αexp(f(y)f(x),yxL2yx22+g(x)+ηL22xx22)\alpha\leftarrow\exp\left(f(y)-\left\langle\nabla f(x),y-x\right\rangle-\tfrac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}+g(x)+\tfrac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)
8:         θ^exp(f(x)g(x)+η2(1+ηL)f(x)22)(1+ηL)d2α\hat{\theta}\leftarrow\exp\left(-f(x)-g(x)+\tfrac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha
9:         if τθ^4\tau\leq\tfrac{\hat{\theta}}{4} then
10:              return xx
11:         end if
12:     end if
13:end while
Algorithm 5 Sample-Joint-Dist(f,g,x,η,𝒪,δ)\texttt{Sample-Joint-Dist}(f,g,x^{*},\eta,\mathcal{O},\delta)

Input: ff, gg of form (15) both minimized by xx^{*}, δ[0,1]\delta\in[0,1], η>0\eta>0, 𝒪\mathcal{O} restricted Gaussian oracle for gg.
Output: Sample xx from a distribution π^\hat{\pi}^{\prime} with π^π^TVδ\left\lVert\hat{\pi}^{\prime}-\hat{\pi}\right\rVert_{\textup{TV}}\leq\delta, where we overload π^\hat{\pi} to mean the marginal of (17) on the xx variable.

1:η132Lκdlog(16κ/δ)\eta\leftarrow\tfrac{1}{32L\kappa d\log(16\kappa/\delta)}
2:Let π^\hat{\pi} be the density with
dπ^dx(z)exp(f(y)g(x)12ηyx22ηL22xx22)\frac{d\hat{\pi}}{dx}(z)\propto\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right) (17)
3:Call 𝒪\mathcal{O} to sample x0πstartx_{0}\sim\pi_{\textup{start}}, for
dπstart(x)dxexp(L+ηL22xx22g(x))\frac{d\pi_{\textup{start}}(x)}{dx}\propto\exp\left(-\frac{L+\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right) (18)
4:K226100ημlog(dlog(16κ)4δ)K\leftarrow\frac{2^{26}\cdot 100}{\eta\mu}\log\left(\frac{d\log(16\kappa)}{4\delta}\right) (see Remark 1)
5:for k[K]k\in[K] do
6:     Call YSample(f,xk1,η,δ2Kdlog(dκδ))\texttt{YSample}\left(f,x_{k-1},\eta,\tfrac{\delta}{2Kd\log(\frac{d\kappa}{\delta})}\right) to sample ykπxk1y_{k}\sim\pi_{x_{k-1}} (Algorithm 7), for
dπxdy(y)exp(f(y)12ηyx22)\frac{d\pi_{x}}{dy}(y)\propto\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right) (19)
7:     Call 𝒪\mathcal{O} to sample xkπykx_{k}\sim\pi_{y_{k}}, for
dπydx(x)exp(g(x)12ηyx22ηL22xx22)\frac{d\pi_{y}}{dx}(x)\propto\exp\left(-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right) (20)
8:end for
9:return xKx_{K}

5.1 Reduction from Composite-Sample to Composite-Sample-Shared-Min

Correctness of Composite-Sample is via the following properties.

Proposition 3.

Let f~\tilde{f} and g~\tilde{g} be defined as in Composite-Sample.

  1. 1.

    The density exp(f(x)g(x))\propto\exp(-f(x)-g(x)) is the same as the density exp(f~(x)g~(x))\propto\exp(-\tilde{f}(x)-\tilde{g}(x)).

  2. 2.

    Assuming first-order (function and gradient evaluation) access to ff, and restricted Gaussian oracle access to gg, we can implement the same accesses to f~\tilde{f}, g~\tilde{g} with constant overhead.

  3. 3.

    f~\tilde{f} and g~\tilde{g} are both minimized by xx^{*}.

Proof.

For ff and gg with properties as in (15), with xx^{*} minimizing f+gf+g, define the functions

f~(x):=f(x)f(x),x,g~(x):=g(x)+f(x),x,\tilde{f}(x):=f(x)-\left\langle\nabla f(x^{*}),x\right\rangle,\;\tilde{g}(x):=g(x)+\left\langle\nabla f(x^{*}),x\right\rangle,

and observe that f~+g~=f+g\tilde{f}+\tilde{g}=f+g everywhere. This proves the first claim. Further, implementation of a first-order oracle for f~\tilde{f} and a restricted Gaussian oracle for g~\tilde{g} are immediate assuming a first-order oracle for ff and a restricted Gaussian oracle for gg, showing the second claim; any quadratic shifted by a linear term is the sum of a quadratic and a constant. We now show f~\tilde{f} and g~\tilde{g} have the same minimizer. By strong convexity, f~\tilde{f} has a unique minimizer; first-order optimality shows that

f~(x)=f(x)f(x)=0,\nabla\tilde{f}(x^{*})=\nabla f(x^{*})-\nabla f(x^{*})=0,

so this unique minimizer is xx^{*}. Moreover, optimality of xx^{*} for f+gf+g implies that for all xdx\in\mathbb{R}^{d},

g(x)+f(x),xx0.\left\langle\partial g(x^{*})+\nabla f(x^{*}),x^{*}-x\right\rangle\leq 0.

Here, g\partial g is a subgradient. This shows first-order optimality of xx^{*} for g~\tilde{g} also, so xx^{*} minimizes g~\tilde{g}. ∎

5.2 Reduction from Composite-Sample-Shared-Min to Sample-Joint-Dist

Composite-Sample-Shared-Min is a rejection sampling scheme, which accepts samples from subroutine Sample-Joint-Dist in the high-probability region Ω\Omega defined in (16). We give a general analysis for approximate rejection sampling in Appendix B.1.1, and Appendix B.1.2 bounds relationships between distributions π\pi and π^\hat{\pi}, defined in (15) and (17) respectively (i.e. relative densities and normalization constant ratios). Combining these pieces proves the following main claim.

Proposition 4.

Let η=132Lκdlog(288κ/ϵ)\eta=\tfrac{1}{32L\kappa d\log(288\kappa/\epsilon)}, and assume Sample-Joint-Dist(f,g,x,𝒪,δ)\texttt{Sample-Joint-Dist}(f,g,x^{*},\mathcal{O},\delta) samples within δ\delta total variation of the xx-marginal on (17). Composite-Sample-Shared-Min outputs a sample within total variation ϵ\epsilon of (15) in an expected O(1)O(1) calls to Sample-Joint-Dist.

5.3 Implementing Sample-Joint-Dist

Sample-Joint-Dist alternates between sampling marginals in the joint distribution π^\hat{\pi}, as seen by definitions (19), (20). We showed that marginal sampling attains the correct stationary distribution as Lemma 1. We bound the conductance of the induced walk on iterates {xk}\{x_{k}\} by combining an isoperimetry bound with a total variation guarantee between transitions of nearby points in Appendix B.2.1. Finally, we give a simple rejection sampling scheme YSample as Algorithm 7 for implementing the step (19). Since the yy-marginal of π^\hat{\pi} is a bounded perturbation of a Gaussian (intuitively, ff is LL-smooth and η1L\eta^{-1}\gg L), we show in a high probability region that rejecting from the sum of a first-order approximation to ff and the Gaussian succeeds in 22 iterations.

Remark 1.

For simplicity of presentation, we were conservative in bounding constants throughout; in practice, we found that the constant in Line 4 is orders of magnitude too large (a constant <10<10 sufficed), which can be found as Section 4 of [STL20]. Several constants were inherited from prior analyses, which we do not rederive to save on redundancy.

We now give a complete guarantee on the complexity of Sample-Joint-Dist.

Proposition 5.

Sample-Joint-Dist outputs a point with distribution within δ\delta total variation distance from the xx-marginal of π^\hat{\pi}. The expected number of gradient queries per iteration is constant.

5.4 Putting it all together: proof of Theorem 2

We show Theorem 2 follows from the guarantees of Propositions 34, and 5. Formally, Theorem 2 is stated for an arbitrary value of η\eta which is upper bounded by the value in Line 1 of Algorithm 5; however, it is straightforward to see that all our proofs go through for any smaller value. By observing the value of KK in Sample-Joint-Dist, we see that the number of total iterations in each call to Sample-Joint-Dist O(1ημlog(κdϵ))=O(κ2dlog2(κdδ)).O\left(\tfrac{1}{\eta\mu}\log(\tfrac{\kappa d}{\epsilon})\right)=O\left(\kappa^{2}d\log^{2}\left(\tfrac{\kappa d}{\delta}\right)\right). Proposition 5 also shows that every iteration, we require an expected constant number of gradient queries and calls to 𝒪\mathcal{O}, the restricted Gaussian oracle for gg, and that the resulting distribution has δ\delta total variation from the desired marginal of π^\hat{\pi}. Next, Proposition 4 implies that the number of calls to Sample-Joint-Dist in a run of Composite-Sample-Shared-Min is bounded by a constant, the choice of δ\delta is Θ(ϵ)\Theta(\epsilon), and the resulting point has total variation ϵ\epsilon from the original distribution π\pi. Finally, Proposition 3 shows sampling from a general distribution of the form (1) is reducible to one call of Composite-Sample-Shared-Min, and the requisite oracles are implementable.

6 Logconcave finite sums

In this section, we provide our “base sampler” for logconcave finite sums as Algorithm 6, and give its guarantees by proving Theorem 3. Throughout, fix distribution π\pi with density

dπdx(x)exp(F(x)), where F(x)=1ni[n]fi(x) is μ-strongly convex,\displaystyle\frac{d\pi}{dx}(x)\propto\exp(-F(x)),\text{ where }F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x)\text{ is }\mu\text{-strongly convex,}
and for all i[n],fi is L-smooth.\displaystyle\text{ and for all }i\in[n],\;f_{i}\text{ is }L\text{-smooth}.

We will define κ:=Lμ\kappa:=\frac{L}{\mu}, and assume that we have precomputed x:=argminxd{F(x)}x^{*}:=\textup{argmin}_{x\in\mathbb{R}^{d}}\{F(x)\}. We will also assume explicitly that fi(x)=0\nabla f_{i}(x^{*})=0 for all i[n]i\in[n] throughout this section (i.e. all fif_{i} are minimized at the same point); this is without loss of generality, by a similar argument as in Proposition 3.

Algorithm 6 FiniteSum-MRW(F,h,x0,p,K)\texttt{FiniteSum-MRW}(F,h,x_{0},p,K)

Input: F(x)=1ni[n]fi(x)F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x), step size h>0h>0, initial x0x_{0}, p[0,1]p\in[0,1], iteration count KK\in\mathbb{N}

1:for 0k<K0\leq k<K do
2:     Draw ξk𝒩(0,𝐈)\xi_{k}\sim\mathcal{N}(0,\mathbf{I})
3:     yk+1xk+2hξky_{k+1}\leftarrow x_{k}+\sqrt{2h}\xi_{k}
4:     Draw Sk[n]S_{k}\subseteq[n] by including each iSki\in S_{k} independently with probability pp
5:     For each i[n]i\in[n],
γk(i){1p(exp(1nfi(yk+1)+1nfi(xk))1)+1iSk1iSk\gamma_{k}^{(i)}\leftarrow\begin{cases}\frac{1}{p}\left(\sqrt{\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k})\right)}-1\right)+1&i\in S_{k}\\ 1&i\not\in S_{k}\end{cases}
6:     γki=1nγk(i)\gamma_{k}\leftarrow\prod_{i=1}^{n}\gamma_{k}^{(i)}, τUnif[0,1]\tau\sim\text{Unif}[0,1]
7:     if τ34γk\tau\leq\tfrac{3}{4}\gamma_{k} and |Sk|2pn|S_{k}|\leq 2pn then
8:         xk+1yk+1x_{k+1}\leftarrow y_{k+1}
9:     else
10:         xk+1xkx_{k+1}\leftarrow x_{k}
11:     end if
12:end for
13:return xKx_{K}.

Algorithm 6 is the zeroth-order Metropolized random walk of [DCWY18] with an efficient, but biased, filter step; the goal of our analysis is to show this bias does not incur significant error.

6.1 Approximate Metropolis-Hastings

We first recall the following well-known fact underlying Metropolis-Hastings (MH) filters.

Proposition 6.

Consider a random walk on d\mathbb{R}^{d} with proposal distributions {𝒫x}xd\{\mathcal{P}_{x}\}_{x\in\mathbb{R}^{d}} and acceptance probabilities {α(x,x)}x,xd\{\alpha(x,x^{\prime})\}_{x,x^{\prime}\in\mathbb{R}^{d}} conducted as follows: at a current point xx,

  1. 1.

    Draw a point x𝒫xx^{\prime}\sim\mathcal{P}_{x}.

  2. 2.

    Move the random walk to xx^{\prime} with probability α(x,x)\alpha(x,x^{\prime}), else stay at xx.

Suppose 𝒫x(x)=𝒫x(x)\mathcal{P}_{x}(x^{\prime})=\mathcal{P}_{x^{\prime}}(x) for all pairs x,xdx,x^{\prime}\in\mathbb{R}^{d}, and further dπdx(x)α(x,x)=dπdx(x)α(x,x)\tfrac{d\pi}{dx}(x)\alpha(x,x^{\prime})=\tfrac{d\pi}{dx}(x^{\prime})\alpha(x^{\prime},x). Then, π\pi is a stationary distribution for the random walk.

Proof.

This follows because the walk satisfies detailed balance (reversibility) with respect to π\pi. ∎

We propose an algorithm that applies a variant of the Metropolis-Hastings filter to a Gaussian random walk. Specifically, we define the following algorithm, which we call Inefficient-MRW.

Definition 4 (Inefficient-MRW).

Consider the following random walk for some step size h>0h>0: for each iteration kk at a current point xkdx_{k}\in\mathbb{R}^{d},

  1. 1.

    Set yk+1xk+2hξy_{k+1}\leftarrow x_{k}+\sqrt{2h}\xi, where ξ𝒩(0,𝐈)\xi\sim\mathcal{N}(0,\mathbf{I}).

  2. 2.

    xk+1yk+1x_{k+1}\leftarrow y_{k+1} with probability α(xk,yk+1)\alpha(x_{k},y_{k+1}) (otherwise, xk+1xkx_{k+1}\leftarrow x_{k}), where

    α(x,y)={1exp(F(y))exp(F(x))>43,34exp(F(y))exp(F(x))34exp(F(y))exp(F(x))43,exp(F(y))exp(F(x))exp(F(y))exp(F(x))<34.\alpha(x,y)=\begin{cases}1&\sqrt{\frac{\exp(-F(y))}{\exp(-F(x))}}>\frac{4}{3},\\ \frac{3}{4}\sqrt{\frac{\exp(-F(y))}{\exp(-F(x))}}&\frac{3}{4}\leq\sqrt{\tfrac{\exp(-F(y))}{\exp(-F(x))}}\leq\frac{4}{3},\\ \frac{\exp(-F(y))}{\exp(-F(x))}&\sqrt{\frac{\exp(-F(y))}{\exp(-F(x))}}<\frac{3}{4}.\end{cases} (21)
Lemma 7.

Distribution π\pi with dπdx(x)exp(F(x))\tfrac{d\pi}{dx}(x)\propto\exp(-F(x)) is stationary for Inefficient-MRW.

Proof.

Without loss of generality, assume that π\pi has been normalized so that dπdx(x)=exp(F(x))\tfrac{d\pi}{dx}(x)=\exp(-F(x)). We apply Proposition 6, dropping subscripts in the following. It is clear that 𝒫x(y)=𝒫y(x)\mathcal{P}_{x}(y)=\mathcal{P}_{y}(x) for any x,yx,y, so it suffices to check the second condition. When 34exp(F(y))exp(F(x))43\tfrac{3}{4}\leq\sqrt{\tfrac{\exp(-F(y))}{\exp(-F(x))}}\leq\tfrac{4}{3}, this follows from

dπdx(x)α(x,x)=34exp(F(x)F(y))=dπdx(x)α(x,x).\frac{d\pi}{dx}(x)\alpha(x,x^{\prime})=\frac{3}{4}\sqrt{\exp(-F(x)-F(y))}=\frac{d\pi}{dx}(x^{\prime})\alpha(x^{\prime},x).

The other case is similar (as it is a standard Metropolis-Hastings filter). ∎

In Algorithm 6, we implement an approximate version of the modified MH filter in Definition 4, where we always assume the pair xx, yy are in the second case of (21). In Lemma 8, we show that if a certain boundedness condition holds, then Algorithm 6 approximates Inefficient-MRW well. We then show that the output distributions of Inefficient-MRW and our Algorithm 6 have small total variation distance in Lemma 9.

Lemma 8.

Suppose that in an iteration 0k<K0\leq k<K of Algorithm 6, the following three conditions hold for some parameters RxR_{x}, CξC_{\xi}, Cx0C_{x}\in\mathbb{R}_{\geq 0}:

  1. 1.

    xkx2Rx\left\lVert x_{k}-x^{*}\right\rVert_{2}\leq R_{x}.

  2. 2.

    ξk2Cξd\left\lVert\xi_{k}\right\rVert_{2}\leq C_{\xi}\sqrt{d}.

  3. 3.

    For all i[n]i\in[n], |fi(xk)ξk|Cxfi(xk)2|\nabla f_{i}(x_{k})^{\top}\xi_{k}|\leq C_{x}\left\lVert\nabla f_{i}(x_{k})\right\rVert_{2}.

Then, for any

h198Cx2L2Rx2+7LCξ2d,h\leq\frac{1}{98C_{x}^{2}L^{2}R_{x}^{2}+7LC_{\xi}^{2}d}, (22)

34exp(F(yk+1))exp(F(xk))43\tfrac{3}{4}\leq\sqrt{\tfrac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}\leq\tfrac{4}{3}. Moreover, we have 𝔼[γk]=exp(F(yk+1))exp(F(xk))\mathbb{E}\left[\gamma_{k}\right]=\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}, and when |Sk|2pn|S_{k}|\leq 2pn, γk43\gamma_{k}\leq\tfrac{4}{3}.

Proof.

We first show 𝔼[γk]=exp(F(yk+1))exp(F(xk))\mathbb{E}\left[\gamma_{k}\right]=\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}. Since each iSki\in S_{k} is generated independently,

𝔼[γk]\displaystyle\mathbb{E}\left[\gamma_{k}\right] =i[n]𝔼[γk(i)]\displaystyle=\prod_{i\in[n]}\mathbb{E}\left[\gamma_{k}^{(i)}\right]
=i[n][(1p)+p(1p(exp(1nfi(yk+1)+1nfi(xk))1)+1)]\displaystyle=\prod_{i\in[n]}\left[(1-p)+p\left(\frac{1}{p}\left(\sqrt{\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k})\right)}-1\right)+1\right)\right]
=i[n]exp(1nfi(yk+1)+1nfi(xk))=exp(F(yk+1))exp(F(xk)).\displaystyle=\prod_{i\in[n]}\sqrt{\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k})\right)}=\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}.

Next, for any i[n]i\in[n], we lower and upper bound fi(yk+1)+fi(xk)-f_{i}(y_{k+1})+f_{i}(x_{k}). First,

fi(yk+1)+fi(xk)\displaystyle-f_{i}(y_{k+1})+f_{i}(x_{k}) fi(xk)(xkyk+1)\displaystyle\leq\nabla f_{i}(x_{k})^{\top}\left(x_{k}-y_{k+1}\right)
2hCxfi(xk)22hCxLRx.\displaystyle\leq\sqrt{2h}C_{x}\left\|\nabla f_{i}(x_{k})\right\|_{2}\leq\sqrt{2h}C_{x}LR_{x}.

The first inequality followed from convexity of fif_{i}, the second from yk+1xk=2hξky_{k+1}-x_{k}=\sqrt{2h}\xi_{k} and our assumed bound, and the third from smoothness and f(x)=0\nabla f(x^{*})=0. To show a lower bound,

fi(yk+1)fi(xk)\displaystyle f_{i}(y_{k+1})-f_{i}(x_{k}) fi(xk)(yk+1xk)+L2yk+1xk22\displaystyle\leq\nabla f_{i}(x_{k})^{\top}\left(y_{k+1}-x_{k}\right)+\frac{L}{2}\left\|y_{k+1}-x_{k}\right\|_{2}^{2}
2hCxLRx+hLCξ2d.\displaystyle\leq\sqrt{2h}C_{x}LR_{x}+hLC_{\xi}^{2}d.

The first inequality was smoothness. Repeating this argument for each i[n]i\in[n] and averaging,

2hCxLRxhLCξ2dF(yk+1)+F(xk)2hCxLRx.-\sqrt{2h}C_{x}LR_{x}-hLC_{\xi}^{2}d\leq-F(y_{k+1})+F(x_{k})\leq\sqrt{2h}C_{x}LR_{x}. (23)

Then, when h198Cx2L2Rx2+7LCξ2dh\leq\tfrac{1}{98C_{x}^{2}L^{2}R_{x}^{2}+7LC_{\xi}^{2}d},

34exp(F(yk+1))exp(F(xk))43, and for all i[n],fi(yk+1)+fi(xk)14.\frac{3}{4}\leq\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}\leq\frac{4}{3},\text{ and for all }i\in[n],\;-f_{i}(y_{k+1})+f_{i}(x_{k})\leq\frac{1}{4}.

Thus, we can bound each γk(i)\gamma_{k}^{(i)}:

γk(i)\displaystyle\gamma_{k}^{(i)} 1p(exp(18n)1)+11+17pn.\displaystyle\leq\frac{1}{p}\left(\exp\left(\frac{1}{8n}\right)-1\right)+1\leq 1+\frac{1}{7pn}.

Finally, when |Sk|2pn|S_{k}|\leq 2pn, γk(1+17pn)2pn43\gamma_{k}\leq(1+\frac{1}{7pn})^{2pn}\leq\tfrac{4}{3} as desired. ∎

Lemma 9.

Draw x0𝒩(x,1L𝐈)x_{0}\sim\mathcal{N}(x^{*},\tfrac{1}{L}\mathbf{I}). Let π^K\hat{\pi}_{K} be the output distribution of the algorithm of Definition 4 for KK steps starting from x0x_{0}, and let πK\pi_{K} be the output distribution of Algorithm 6 starting from x0x_{0}. For any δ[0,1]\delta\in[0,1], let p=5log12Kδnp=\tfrac{5\log\frac{12K}{\delta}}{n} in Algorithm 6. There exist

Cξ=O(1+logKδd),Cx=O(lognKδ), and Rx=O(dlogκKδμ),C_{\xi}=O\left(1+\sqrt{\frac{\log\frac{K}{\delta}}{d}}\right),\;C_{x}=O\left(\sqrt{\log\frac{nK}{\delta}}\right),\;\text{ and }R_{x}=O\left(\sqrt{\frac{d\log\frac{\kappa K}{\delta}}{\mu}}\right),

so that when h198Cx2L2Rx2+7LCξ2dh\leq\tfrac{1}{98C_{x}^{2}L^{2}R_{x}^{2}+7LC_{\xi}^{2}d}, we have πKπ^KTVδ\left\lVert\pi_{K}-\hat{\pi}_{K}\right\rVert_{\textup{TV}}\leq\delta.

Proof.

By the coupling definition of total variation, it suffices to upper bound the probability that the algorithms’ trajectories, sharing all randomness in proposing points yk+1y_{k+1}, differ. This can happen for two reasons: either we used an incorrect filtering step (i.e. the pair (xk,yk+1)(x_{k},y_{k+1}) did not lie in the second case of (21)), or we incorrectly rejected in Line 7 of Algorithm 6 because |Sk|2pn|S_{k}|\geq 2pn. We bound the error due to either happening over any iteration by δ\delta, yielding the conclusion.

Incorrect filtering.

Consider some iteration kk. Lemma 8 shows that as long as its three conditions hold in iteration kk, we are in the second case of (21), so it suffices to show all conditions hold. By Fact 2 and as ξk\xi_{k} is independent of all {fi(xk)}i[n]\{\nabla f_{i}(x_{k})\}_{i\in[n]}, with probability at least 1δ2K1-\tfrac{\delta}{2K}, both of the conditions ξk2Cξd\left\lVert\xi_{k}\right\rVert_{2}\leq C_{\xi}\sqrt{d} and121212We recall that the distribution of vξv^{\top}\xi for ξ𝒩(0,𝐈)\xi\sim\mathcal{N}(0,\mathbf{I}) is the one-dimensional 𝒩(0,v22)\mathcal{N}(0,\left\lVert v\right\rVert_{2}^{2}). |fi(xk)ξk|Cxfi(xk)2|\nabla f_{i}(x_{k})^{\top}\xi_{k}|\leq C_{x}\left\lVert\nabla f_{i}(x_{k})\right\rVert_{2} for all i[n]i\in[n] hold for some

Cξ=O(1+logKδd),Cx=O(lognKδ).C_{\xi}=O\left(1+\sqrt{\frac{\log\frac{K}{\delta}}{d}}\right),\;C_{x}=O\left(\sqrt{\log\frac{nK}{\delta}}\right).

Next, x0𝒩(x,1L𝐈)x_{0}\sim\mathcal{N}(x^{*},\tfrac{1}{L}\mathbf{I}) is drawn from a κd2\kappa^{\frac{d}{2}} warm start for π\pi. By Fact 2, we have x0x2Rx\left\lVert x_{0}-x^{*}\right\rVert_{2}\leq R_{x} for x0x_{0} drawn from π\pi with probability at least 1δ4Kκd21-\tfrac{\delta}{4K}\cdot\kappa^{-\frac{d}{2}}, for some

Rx=O(dlogκKδμ).R_{x}=O\left(\sqrt{\frac{d\log\frac{\kappa K}{\delta}}{\mu}}\right).

Since warmness of the exact algorithm of Definition 4 is monotonic, as long as the trajectories have not differed up to iteration kk, xkx2Rx\left\lVert x_{k}-x^{*}\right\rVert_{2}\leq R_{x} also holds with probability 1δ4K\geq 1-\tfrac{\delta}{4K}. Inductively, the total variation error caused by incorrect filtering over KK steps is at most 3δ4\tfrac{3\delta}{4}.

Error due to large |Sk||S_{k}|.

Supposing all the conditions of Lemma 8 are satisfied in iteration kk, we show that with high probability, Inefficient-MRW and Algorithm 6 make the same accept or reject decision. By Lemma 8, Inefficient-MRW (21) accepts with probability αk=34exp(F(yk+1))exp(F(xk))\alpha^{\prime}_{k}=\tfrac{3}{4}\sqrt{\tfrac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}. On the other hand, Algorithm 6 accepts with probability

αk=34𝔼[γk|Sk|2pn]Pr[|Sk|2pn].\alpha_{k}=\frac{3}{4}\mathbb{E}\left[\gamma_{k}\mid|S_{k}|\leq 2pn\right]\cdot\Pr[|S_{k}|\leq 2pn].

The total variation between the output distributions is |αkαk||\alpha_{k}-\alpha^{\prime}_{k}|. Further, since by Lemma 8,

αk\displaystyle\alpha^{\prime}_{k} =34𝔼[γk]\displaystyle=\frac{3}{4}\mathbb{E}\left[\gamma_{k}\right]
=34(𝔼[γk|Sk|2pn]Pr[|Sk|2pn]+𝔼[γk|Sk|>2pn]Pr[|Sk|>2pn])\displaystyle=\frac{3}{4}\left(\mathbb{E}\left[\gamma_{k}\mid|S_{k}|\leq 2pn\right]\cdot\Pr[|S_{k}|\leq 2pn]+\mathbb{E}\left[\gamma_{k}\mid|S_{k}|>2pn\right]\cdot\Pr[|S_{k}|>2pn]\right)
=αk+34𝔼[γk|Sk|>2pn]Pr[|Sk|>2pn],\displaystyle=\alpha_{k}+\frac{3}{4}\mathbb{E}\left[\gamma_{k}\mid|S_{k}|>2pn\right]\cdot\Pr[|S_{k}|>2pn],

it suffices to upper bound this latter quantity. First, by Lemma 10, when p=5log12Kδnp=\tfrac{5\log\frac{12K}{\delta}}{n}, we have Pr[|Sk|>2pn]δ12K\Pr[|S_{k}|>2pn]\leq\tfrac{\delta}{12K}. Finally, since each iSki\in S_{k} is generated independently,

𝔼[γk|Sk|>2pn]\displaystyle\mathbb{E}\left[\gamma_{k}\mid|S_{k}|>2pn\right] maxS:|S|=2pn𝔼[i[n]γk(i)SSk]\displaystyle\leq\max_{S^{\prime}:|S^{\prime}|=2pn}\mathbb{E}\left[\prod_{i\in[n]}\gamma_{k}^{(i)}\mid S^{\prime}\subseteq S_{k}\right]
2𝔼[i[n]Sγk(i)]=2[n]Sexp(1nfi(yk+1)+1nfi(xk)))4.\displaystyle\leq 2\mathbb{E}\left[\prod_{i\in[n]\setminus S^{\prime}}\gamma_{k}^{(i)}\right]=2\sqrt{\prod_{[n]\setminus S^{\prime}}\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k}))\right)}\leq 4.

Here, we used Lemma 8 applied to the set SS^{\prime}, and the upper bound (23) we derived earlier. Combining these calculations shows that the total variation distance incurred in any iteration kk due to |Sk||S_{k}| being too large is at most δ4K\tfrac{\delta}{4K}, so the overall contribution over KK steps is at most δ4\tfrac{\delta}{4}. ∎

We used the following helper lemma in our analysis.

Lemma 10.

Let S[n]S\subseteq[n] be formed by independently including each i[n]i\in[n] with probability pp. Then,

Pr[|S|>2pn]exp(3pn14).\Pr\left[|S|>2pn\right]\leq\exp\left(-\frac{3pn}{14}\right).
Proof.

For i[n]i\in[n], let 𝟏iS\mathbf{1}_{i\in S} be the indicator random variable of the event iSi\in S, so 𝔼[𝟏iS]=p\mathbb{E}\left[\mathbf{1}_{i\in S}\right]=p and

Var[𝟏iSp]=p(1p)2+(1p)p22p.\textup{Var}\left[\mathbf{1}_{i\in S}-p\right]=p(1-p)^{2}+(1-p)p^{2}\leq 2p.

By Bernstein’s inequality,

Pr[i[n]𝟏iSnp+r]exp(12r22np+13r).\Pr\left[\sum_{i\in[n]}\mathbf{1}_{i\in S}\geq np+r\right]\leq\exp\left(-\frac{\frac{1}{2}r^{2}}{2np+\frac{1}{3}r}\right).

In particular, when r=pnr=pn, we have the desired conclusion. ∎

6.2 Conductance analysis

We next bound the mixing time of Inefficient-MRW, using the following result from prior work. We remark that (see Section 1.4) in our application, the logβ\log\beta term is non-dominant.

Proposition 7 (Lemma 1, Lemma 2, [CDWY19]).

Let a random walk with a μ\mu-strongly logconcave stationary distribution π\pi on xdx\in\mathbb{R}^{d} have transition distributions {𝒯x}xd\{\mathcal{T}_{x}\}_{x\in\mathbb{R}^{d}}. For some ϵ[0,1]\epsilon\in[0,1], let convex set Ωd\Omega\subseteq\mathbb{R}^{d} have π(Ω)1ϵ22β2\pi(\Omega)\geq 1-\tfrac{\epsilon^{2}}{2\beta^{2}}. Let πstart\pi_{\textup{start}} be a β\beta-warm start for π\pi, and let the algorithm be initialized at x0πstartx_{0}\sim\pi_{\textup{start}}. Suppose for any x,xΩx,x^{\prime}\in\Omega with xx2Δ\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta,

𝒯x𝒯xTV78.\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\frac{7}{8}. (24)

Then, the random walk mixes to total variation distance within ϵ\epsilon of π\pi in O(logβ+1Δ2μloglogβϵ)O(\log\beta+\tfrac{1}{\Delta^{2}\mu}\log\tfrac{\log\beta}{\epsilon}) iterations.

Consider an iteration of Inefficient-MRW from xk=xx_{k}=x. Let 𝒫x\mathcal{P}_{x} be the density of yk+1y_{k+1}, and let 𝒯x\mathcal{T}_{x} be the density of xk+1x_{k+1} after filtering. Define a convex set Ωd\Omega\subseteq\mathbb{R}^{d} parameterized by RΩ0R_{\Omega}\in\mathbb{R}_{\geq 0}:

Ω={xd:xx2RΩ}.\displaystyle\Omega=\{x\in\mathbb{R}^{d}:\left\lVert x-x^{*}\right\rVert_{2}\leq R_{\Omega}\}.

We show that for two close points x,xΩx,x^{\prime}\subseteq\Omega, the total variation between 𝒯x\mathcal{T}_{x} and 𝒯x\mathcal{T}_{x^{\prime}} is small.

Lemma 11.

For some h=O(1L2RΩ2+Ld)h=O(\tfrac{1}{L^{2}R_{\Omega}^{2}+Ld}) and x,xΩx,x^{\prime}\subseteq\Omega with xx218h\left\lVert x-x^{\prime}\right\rVert_{2}\leq\tfrac{1}{8}\sqrt{h}, 𝒯x𝒯xTV78\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{7}{8}.

Proof.

By the triangle inequality of total variation distance,

𝒯x𝒯xTV𝒯x𝒫xTV+𝒫x𝒫xTV+𝒯x𝒫xTV.\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}}+\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}+\left\lVert\mathcal{T}_{x^{\prime}}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}.

First, by Pinsker’s inequality and the KL divergence between Gaussian distributions,

𝒫x𝒫xTV2KL(𝒫x||𝒫x)=xx22h.\left\|\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\|_{\textup{TV}}\leq\sqrt{2\textup{KL}(\mathcal{P}_{x}||\mathcal{P}_{x^{\prime}})}=\frac{\left\|x-x^{\prime}\right\|_{2}}{\sqrt{2h}}.

When xx218h\left\lVert x-x^{\prime}\right\rVert_{2}\leq\tfrac{1}{8}\sqrt{h}, 𝒫x𝒫xTV18\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{1}{8}. Next, we bound 𝒯x𝒫xTV\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}}: by a standard calculation (e.g. Lemma D.1 of [LST20]), we have

𝒯x𝒫xTV\displaystyle\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}} =134𝔼ξk+1[exp(F(yk+1))exp(F(xk))].\displaystyle=1-\frac{3}{4}\mathbb{E}_{\xi_{k+1}}\left[\sqrt{\frac{\exp\left(-F(y_{k+1})\right)}{\exp\left(-F(x_{k})\right)}}\right].

We show that 𝒯x𝒫xTV38\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}}\leq\frac{3}{8}. It suffices to show that 𝔼ξk+1[exp(F(yk+1)+F(xk))]56.\mathbb{E}_{\xi_{k+1}}\left[\sqrt{\exp\left(-F(y_{k+1})+F(x_{k})\right)}\right]\geq\frac{5}{6}.

Since 1516exp(116)56\frac{15}{16}\sqrt{\exp\left(-\frac{1}{16}\right)}\geq\frac{5}{6}, it suffices to show that with probability at least 1516\tfrac{15}{16} over the randomness of ξk+1\xi_{k+1}, F(yk+1)+F(xk)116-F(y_{k+1})+F(x_{k})\geq-\frac{1}{16}. As ξk+1𝒩(0,𝕀d)\xi_{k+1}\sim\mathcal{N}(0,\mathbb{I}_{d}), by applying Fact 2 twice,

Pr[ξk+122>36d]exp(4)132,\displaystyle\Pr\left[\text{$\left\|\xi_{k+1}\right\|$}_{2}^{2}>36d\right]\leq\exp(-4)\leq\frac{1}{32}, (25)
Pr[|F(xk)ξk+1|236F(xk)22]132.\displaystyle\Pr\left[\left|\nabla F(x_{k})^{\top}\xi_{k+1}\right|^{2}\geq 36\left\lVert\nabla F(x_{k})\right\rVert_{2}^{2}\right]\leq\frac{1}{32}.

We upper bound the term F(yk+1)F(xk)F(y_{k+1})-F(x_{k}) by smoothness and Cauchy-Schwarz:

F(yk+1)F(xk)\displaystyle F(y_{k+1})-F(x_{k}) F(xk)(yk+1xk)+L2yk+1xk22\displaystyle\leq\nabla F(x_{k})^{\top}\left(y_{k+1}-x_{k}\right)+\frac{L}{2}\left\|y_{k+1}-x_{k}\right\|_{2}^{2}
2h|F(xk)ξk+1|+hLξk+122.\displaystyle\leq\sqrt{2h}\left|\nabla F(x_{k})^{\top}\xi_{k+1}\right|+hL\text{$\left\|\xi_{k+1}\right\|$}_{2}^{2}.

Then, since F(xk)LRΩ\left\|\nabla F(x_{k})\right\|\leq LR_{\Omega} when xΩx\in\Omega, it is enough to choose h=O(1L2RΩ2+Ld)h=O(\tfrac{1}{L^{2}R_{\Omega}^{2}+Ld}) so that

F(yk+1)+F(xk)116,-F(y_{k+1})+F(x_{k})\geq-\frac{1}{16},

as long as the events of (25) hold, which occurs with probability at least 1516\frac{15}{16}. Similarly, we can show that 𝒯x𝒫xTV38\left\lVert\mathcal{T}_{x^{\prime}}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{3}{8}. Combining the three bounds, we have the desired conclusion.

See 3

Proof.

First, 𝒩(x,1L𝐈)\mathcal{N}(x^{*},\tfrac{1}{L}\mathbf{I}) yields a β=κd2\beta=\kappa^{\frac{d}{2}}-warm start for π\pi (see e.g. [DCWY18]). For this value of β\beta, by Fact 2 it suffices to choose

RΩ=Θ(dlogκϵμ)R_{\Omega}=\Theta\left(\sqrt{\frac{d\log\frac{\kappa}{\epsilon}}{\mu}}\right)

for π(Ω)1ϵ22β2\pi(\Omega)\geq 1-\tfrac{\epsilon^{2}}{2\beta^{2}}. Letting δ=ϵ2\delta=\tfrac{\epsilon}{2}, we will choose the step size hh and iteration count KK so that

1h=Θ(Lκdlog2nκdϵ),K=Θ(κ2dlog3nκdϵ)\displaystyle\frac{1}{h}=\Theta\left(L\kappa d\log^{2}\frac{n\kappa d}{\epsilon}\right),\;K=\Theta\left(\kappa^{2}d\log^{3}\frac{n\kappa d}{\epsilon}\right)

have constants compatible with Lemma 9. Note that this choice of hh is also sufficiently small to apply Lemma 11 for our choice of RΩR_{\Omega}. By applying Proposition 7 to the algorithm of Definition 4, and using the bound from Lemma 11, in KK iterations Inefficient-MRW will mix to total variation distance δ\delta to π\pi. Furthermore, applying Lemma 9, we conclude that Algorithm 6 has total variation distance at most 2δ=ϵ2\delta=\epsilon from π\pi.

It remains to bound the oracle complexity of Algorithm 6. Note in every iteration, we never compute more than 4pn4pn values of {fi}i[n]\{f_{i}\}_{i\in[n]}, since we always reject if |Sk|2pn|S_{k}|\geq 2pn, and we only compute values for indices in SkS_{k}. For the value of pp in Lemma 9, this amounts to O(lognκdϵ)O(\log\tfrac{n\kappa d}{\epsilon}) value queries. ∎

Acknowledgments

YL and RS are supported by NSF awards CCF-1749609, CCF-1740551, DMS-1839116, and DMS-2023166, a Microsoft Research Faculty Fellowship, a Sloan Research Fellowship, and a Packard Fellowship. KT is supported by NSF CAREER Award CCF-1844855 and a PayPal research gift.

RS and KT would like to thank Sinho Chewi for his extremely generous help, in particular insightful conversations which led to our discovery of the gap in Section 3, as well as his suggested fix.

References

  • [AH16] Jacob D. Abernethy and Elad Hazan. Faster convex optimization: Simulated annealing with an efficient universal barrier. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2520–2528, 2016.
  • [All17] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res., 18:221:1–221:51, 2017.
  • [BCM+18] M Barkhagen, NH Chau, É Moulines, M Rásonyi, S Sabanis, and Y Zhang. On stochastic gradient langevin dynamics with dependent data streams in the logconcave case. arXiv preprint arXiv:1812.02709, 2018.
  • [BDMP17] Nicolas Brosse, Alain Durmus, Eric Moulines, and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal langevin monte carlo. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 319–342, 2017.
  • [BEL18] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. Discret. Comput. Geom., 59(4):757–783, 2018.
  • [Ber18] Espen Bernton. Langevin monte carlo and JKO splitting. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 1777–1798, 2018.
  • [BFFN19] Jack Baker, Paul Fearnhead, Emily B Fox, and Christopher Nemeth. Control variates for stochastic gradient mcmc. Statistics and Computing, 29(3):599–615, 2019.
  • [BFR+19] Joris Bierkens, Paul Fearnhead, Gareth Roberts, et al. The zig-zag process and super-efficient sampling for bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320, 2019.
  • [BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
  • [BV04] Dimitris Bertsimas and Santosh S. Vempala. Solving convex programs by random walks. J. ACM, 51(4):540–556, 2004.
  • [CCBJ18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped langevin MCMC: A non-asymptotic analysis. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 300–323, 2018.
  • [CDWY19] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. CoRR, abs/1905.12247, 2019.
  • [CFM+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient monte carlo. In International Conference on Machine Learning, pages 764–773, 2018.
  • [CV15] Benjamin Cousins and Santosh Vempala. Bypassing kls: Gaussian cooling and an o^*(n3) volume algorithm. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 539–548, 2015.
  • [CV18] Ben Cousins and Santosh S. Vempala. Gaussian cooling and o*(n3){}^{\mbox{*(n${}^{\mbox{3)}}$}} algorithms for volume and gaussian volume. SIAM J. Comput., 47(3):1237–1273, 2018.
  • [CV19] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, September 20-22, 2019, Massachusetts Institute of Technology, Cambridge, MA, USA, pages 64:1–64:12, 2019.
  • [CWZ+17] Changyou Chen, Wenlin Wang, Yizhe Zhang, Qinliang Su, and Lawrence Carin. A convergence analysis for a class of practical variance-reduction stochastic gradient mcmc. arXiv preprint arXiv:1709.01180, 2017.
  • [Dal17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 3(79):651–676, 2017.
  • [DBL14] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1646–1654, 2014.
  • [DCWY18] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 793–797, 2018.
  • [DFK91] Martin E. Dyer, Alan M. Frieze, and Ravi Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. J. ACM, 38(1):1–17, 1991.
  • [DK19] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
  • [DM19a] Alain Durmus and Éric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
  • [DM19b] Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
  • [DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. J. Mach. Learn. Res., 20:73:1–73:46, 2019.
  • [DR18] Arnak S. Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic langevin diffusions. CoRR, abs/1807.09382, 2018.
  • [DRW+16] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos, Alexander J Smola, and Eric P Xing. Variance reduction in stochastic gradient langevin dynamics. In Advances in neural information processing systems, pages 1154–1162, 2016.
  • [DSM+16] Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, and Gaël Richard. Stochastic gradient richardson-romberg markov chain monte carlo. In Advances in Neural Information Processing Systems, pages 2047–2055, 2016.
  • [FGKS15] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2540–2548, 2015.
  • [GGZ18] Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global convergence of stochastic gradient hamiltonian monte carlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration. arXiv preprint arXiv:1809.04618, 2018.
  • [Gul92] Osman Guler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649–664, 1992.
  • [Har04] Gilles Hargé. A convex/log-concave correlation inequality for gaussian measure and an application to abstract wiener spaces. Probability theory and related fields, 130(3):415–440, 2004.
  • [HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 2883–2892, 2018.
  • [JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 315–323, 2013.
  • [Kra40] Hendrik Anthony Kramers. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284–304, 1940.
  • [Lee18] Yin Tat Lee. Lecture 8: Stochastic methods and applications. Class notes, UW CSE 599: Interplay between Convex Optimization and Geometry, 2018.
  • [LK99] László Lovász and Ravi Kannan. Faster mixing via average conductance. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, May 1-4, 1999, Atlanta, Georgia, USA, pages 282–287, 1999.
  • [LMH15] Hongzhou Lin, Julien Mairal, and Zaïd Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3384–3392, 2015.
  • [LPW09] David Asher Levin, Yuval Peres, and Elizabeth Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.
  • [LST20] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Logsmooth gradient concentration and tighter runtimes for metropolized hamiltonian monte carlo. In Conference on Learning Theory, COLT 2020, 2020.
  • [LSV18] Yin Tat Lee, Zhao Song, and Santosh S. Vempala. Algorithmic theory of odes and sampling from well-conditioned logconcave densities. CoRR, abs/1812.06243, 2018.
  • [LV06a] László Lovász and Santosh Vempala. Simulated annealing in convex bodies and an o*(n4) volume algorithm. Journal of Computer and System Sciences, 72(2):392–417, 2006.
  • [LV06b] László Lovász and Santosh S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, pages 57–68, 2006.
  • [LV06c] László Lovász and Santosh S. Vempala. Hit-and-run from a corner. SIAM J. Comput., 35(4):985–1005, 2006.
  • [MFWB19] Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. CoRR, abs/1910.00551, 2019.
  • [MMW+19] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. High-order langevin diffusion yields an accelerated MCMC algorithm. CoRR, abs/1908.10859, 2019.
  • [NDH+17] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient langevin dynamics. arXiv preprint arXiv:1706.02692, 2017.
  • [Nea11] Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11):2, 2011.
  • [Nes83] Yurii Nesterov. A method for solving a convex programming problem with convergence rate o(1/k2)o(1/k^{2}). Doklady AN SSSR, 269:543–547, 1983.
  • [NF19] Christopher Nemeth and Paul Fearnhead. Stochastic gradient markov chain monte carlo. arXiv preprint arXiv:1907.06986, 2019.
  • [OV00] Felix Otto and Cédric Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
  • [PB14] Neal Parikh and Stephen P. Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127–239, 2014.
  • [Per16] Marcelo Pereyra. Proximal markov chain monte carlo algorithms. Stat. Comput., 26(4):745–760, 2016.
  • [Roc76] R Tyrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
  • [SKR19] Adil Salim, Dmitry Koralev, and Peter Richtárik. Stochastic proximal langevin algorithm: Potential splitting and nonasymptotic rates. In Advances in Neural Information Processing Systems, pages 6653–6664, 2019.
  • [SL19] Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, pages 2100–2111, 2019.
  • [SRB17] Mark Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83–112, 2017.
  • [STL20] Ruoqi Shen, Kevin Tian, and Yin Tat Lee. Composite logconcave sampling with a restricted gaussian oracle. CoRR, abs/2006.05976, 2020.
  • [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1):267–288, 1996.
  • [Vem05] Santosh Vempala. Geometric random walks: A survey. MSRI Combinatorial and Computational Geometry, 52:573–612, 2005.
  • [Wib19] Andre Wibisono. Proximal langevin algorithm: Rapid convergence under isoperimetry. CoRR, abs/1911.01469, 2019.
  • [WS16] Blake E. Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3639–3647, 2016.
  • [WT11] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
  • [ZH05] Hui Zou and Trevor Hastie. Regularization and variable section via the elastic net. Journal of the Royal Statistical Society, Series B (Methodological), 67(2):301–320, 2005.
  • [ZXG18] Difan Zou, Pan Xu, and Quanquan Gu. Subsampled stochastic variance-reduced gradient langevin dynamics. In International Conference on Uncertainty in Artificial Intelligence, 2018.

Appendix A Discussion of inexactness tolerance

We briefly discuss the tolerance of our algorithm to approximation error in two places: computation of minimizers, and implementation of RGOs in the methods of Sections 3 and 5.

Inexact minimization.

For all function classes considered in this work, there exist efficient optimization methods converging to a minimizer with logarithmic dependence on the target accuracy.

Specifically, for negative log-densities with condition number κ\kappa, accelerated gradient descent [Nes83] converges at a rate O(κ)O(\sqrt{\kappa}) with logarithmic dependence on initial error and target accuracy (we implicitly assumed in stating our runtimes that one can attain initial error polynomial in problem parameters for negative log-densities; otherwise, there is additional logarithmic overhead in the quality of the initial point to optimization procedures). For composite functions fwc+foraclef_{\textup{wc}}+f_{\textup{oracle}} where fwcf_{\textup{wc}} has condition number κ\kappa, the FISTA method of [BT09] converges at the same rate with each iteration querying fwc\nabla f_{\textup{wc}} and a proximal oracle for foraclef_{\textup{oracle}} once; typically, access to a proximal oracle is a weaker assumption than access to a restricted Gaussian oracle, so this is not restrictive. Finally, for minimizing finite sums with condition number κ\kappa, the algorithm of [All17] obtains a convergence rate linearly dependent on n+nκn+κn+\sqrt{n\kappa}\leq n+\kappa; alternatively, [JZ13] has a dependence on n+κn+\kappa. In all our final runtimes, these optimization rates do not constitute the bottleneck for oracle complexities.

The only additional difficulty our algorithms may present is if the function requiring minimization, say of the form foracle(x)+12ηxy22f_{\textup{oracle}}(x)+\tfrac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2} for some ydy\in\mathbb{R}^{d} where we have computed the minimizer xx^{*} to foraclef_{\textup{oracle}}, has yx22\left\lVert y-x^{*}\right\rVert_{2}^{2} very large (so the initial function error is bad). However, in all our settings yy is drawn from a distribution with sub-Gaussian tails, so yx22\left\lVert y-x^{*}\right\rVert_{2}^{2} decays exponentially (whereas the complexity of first-order methods increases only logarithmically), negligibly affecting the expected oracle query complexity for our methods.

Finally, by solving the relevant optimization problems to high accuracy as a subroutine in each of our methods, and adjusting various distance bounds to the minimizer by constants (e.g. by expanding the radius in the definition of the sets Ω\Omega in Algorithm 4 and Section 6.2), this accomodates tolerance to inexact minimization and only affects all bounds throughout the paper by constants. The only other place that xx^{*} is used in our algorithms is in initializing warm starts; tolerance to inexactness in our warmness calculations follows essentially identically to Section 3.2.1 of [DCWY18].

Inexact oracle implementation.

Our algorithms based on restricted Gaussian oracle access are tolerant to total variation error inverse polynomial in problem parameters for the restricted Gaussian oracle for gg. We discussed this at the end of Section 3, in the case of RGO use for our reduction framework. To see this in the case of the composite sampler in Section 5, we pessimistically handled the case where the sampler YSample for a quadratic restriction of ff resulted in total variation error in the proof of Proposition 5, assuming that the error was incurred in every iteration. By accounting for similar amounts of error in calls to 𝒪\mathcal{O} (on the order of ϵT\tfrac{\epsilon}{T}, where TT is the number of times an RGO was used), the bounds in our algorithm are only affected by constants.

Appendix B Deferred proofs from Section 5

B.1 Deferred proofs from Section 5.2

B.1.1 Approximate rejection sampling

We first define the rejection sampling framework we will use, and prove various properties.

Definition 5 (Approximate rejection sampling).

Let π\pi be a distribution, with dπdx(x)p(x)\tfrac{d\pi}{dx}(x)\propto p(x). Suppose set Ω\Omega has π(Ω)=1ϵ\pi(\Omega)=1-\epsilon^{\prime}, and distribution π^\hat{\pi} with dπ^dx(x)p^(x)\frac{d\hat{\pi}}{dx}(x)\propto\hat{p}(x) has for some C1C\geq 1,

p(x)p^(x)C for all xΩ, and p^(x)𝑑xp(x)𝑑x1.\frac{p(x)}{\hat{p}(x)}\leq C\text{ for all }x\in\Omega,\text{ and }\frac{\int\hat{p}(x)dx}{\int p(x)dx}\leq 1.

Suppose there is an algorithm 𝒜\mathcal{A} which draws samples from a distribution π^\hat{\pi}^{\prime}, such that π^π^TV1δ\left\lVert\hat{\pi}^{\prime}-\hat{\pi}\right\rVert_{\textup{TV}}\leq 1-\delta. We call the following scheme approximate rejection sampling: repeat independent runs of the following procedure until a point is outputted.

  1. 1.

    Draw xx via 𝒜\mathcal{A} until xΩx\in\Omega.

  2. 2.

    With probability p(x)Cp^(x)\tfrac{p(x)}{C\hat{p}(x)}, output xx.

Lemma 12.

Consider an approximate rejection sampling scheme with relevant parameters defined as in Definition 5, with 2δ1ϵC2\delta\leq\tfrac{1-\epsilon^{\prime}}{C}. The algorithm terminates in at most

11ϵC2δ\frac{1}{\frac{1-\epsilon^{\prime}}{C}-2\delta} (26)

calls to 𝒜\mathcal{A} in expectation, and outputs a point from a distribution π\pi^{\prime} with ππTVϵ+2δC1ϵ\left\lVert\pi^{\prime}-\pi\right\rVert_{\textup{TV}}\leq\epsilon^{\prime}+\frac{2\delta C}{1-\epsilon^{\prime}}.

Proof.

Define for notational simplicity normalization constants Z:=p(x)𝑑xZ:=\int p(x)dx and Z^:=p^(x)𝑑x\hat{Z}:=\int\hat{p}(x)dx. First, we bound the probability any particular call to 𝒜\mathcal{A} returns in the scheme:

xΩp(x)Cp^(x)𝑑π^(x)\displaystyle\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x) xΩp(x)Cp^(x)𝑑π^(x)|xΩp(x)Cp^(x)(dπ^(x)dπ^(x))|\displaystyle\geq\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)-\left|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right| (27)
=xΩZCZ^𝑑π(x)|xΩp(x)Cp^(x)(dπ^(x)dπ^(x))|\displaystyle=\int_{x\in\Omega}\frac{Z}{C\hat{Z}}d\pi(x)-\left|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right|
1ϵCxΩ|dπ^(x)dπ^(x)|1ϵC2δ.\displaystyle\geq\frac{1-\epsilon^{\prime}}{C}-\int_{x\in\Omega}|d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x)|\geq\frac{1-\epsilon^{\prime}}{C}-2\delta.

The second line followed by the definitions of ZZ and Z^\hat{Z}, and the third followed by triangle inequality, the assumed lower bound on Z/Z^Z/\hat{Z}, and the total variation distance between π^\hat{\pi}^{\prime} and π^\hat{\pi}. By linearity of expectation and independence, this proves the first claim.

Next, we claim the output distribution is close in total variation distance to the conditional distribution of π\pi restricted to Ω\Omega. The derivation of (27) implies

xΩp(x)Cp^(x)𝑑π^(x)1ϵC,|xΩp(x)Cp^(x)(dπ^(x)dπ^(x))|2δ,\displaystyle\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)\geq\frac{1-\epsilon^{\prime}}{C},\;\left|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right|\leq 2\delta, (28)
12δC1ϵxΩp(x)Cp^(x)𝑑π^(x)xΩp(x)Cp^(x)𝑑π^(x)1+2δC1ϵ.\displaystyle\implies 1-\frac{2\delta C}{1-\epsilon^{\prime}}\leq\frac{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}\leq 1+\frac{2\delta C}{1-\epsilon^{\prime}}.

Thus, the total variation of the true output distribution from π\pi restricted to Ω\Omega is

12xΩ|dπ(x)1ϵp(x)Cp^(x)dπ^(x)xΩp(x)Cp^(x)𝑑π^(x)|\displaystyle\frac{1}{2}\int_{x\in\Omega}\left|\frac{d\pi(x)}{1-\epsilon^{\prime}}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}\right|
12xΩ|dπ(x)1ϵp(x)Cp^(x)dπ^(x)xΩp(x)Cp^(x)𝑑π^(x)|+12xΩ|p(x)Cp^(x)dπ^(x)xΩp(x)Cp^(x)𝑑π^(x)p(x)Cp^(x)dπ^(x)xΩp(x)Cp^(x)𝑑π^(x)|\displaystyle\leq\frac{1}{2}\int_{x\in\Omega}\left|\frac{d\pi(x)}{1-\epsilon^{\prime}}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}\right|+\frac{1}{2}\int_{x\in\Omega}\left|\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}\right|
12xΩ|dπ(x)1ϵp(x)Cp^(x)dπ^(x)xΩp(x)Cp^(x)𝑑π^(x)|+δC1ϵ=12xΩdπ(x)1ϵ|1dπ^dπ^(x)|+δC1ϵ.\displaystyle\leq\frac{1}{2}\int_{x\in\Omega}\left|\frac{d\pi(x)}{1-\epsilon^{\prime}}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}\right|+\frac{\delta C}{1-\epsilon^{\prime}}=\frac{1}{2}\int_{x\in\Omega}\frac{d\pi(x)}{1-\epsilon^{\prime}}\left|1-\frac{d\hat{\pi}^{\prime}}{d\hat{\pi}}(x)\right|+\frac{\delta C}{1-\epsilon^{\prime}}.

The first inequality was triangle inequality, and we bounded the second term by (28). To obtain the final equality, we used

xΩp(x)Cp^(x)𝑑π^(x)=xΩZCZ^𝑑π(x)=(1ϵ)ZCZ^\displaystyle\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)=\int_{x\in\Omega}\frac{Z}{C\hat{Z}}d\pi(x)=\frac{(1-\epsilon^{\prime})Z}{C\hat{Z}}
p(x)Cp^(x)dπ^(x)xΩp(x)Cp^(x)𝑑π^(x)=p(x)ZZ^p^(x)11ϵdπ^(x)=dπ(x)1ϵdπ^dπ^(x).\displaystyle\implies\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}=\frac{p(x)}{Z}\cdot\frac{\hat{Z}}{\hat{p}(x)}\cdot\frac{1}{1-\epsilon^{\prime}}\cdot d\hat{\pi}^{\prime}(x)=\frac{d\pi(x)}{1-\epsilon^{\prime}}\cdot\frac{d\hat{\pi}^{\prime}}{d\hat{\pi}}(x).

We now bound this final term. Observe that the given conditions imply that dπdπ^(x)\tfrac{d\pi}{d\hat{\pi}}(x) is bounded by CC everywhere in Ω\Omega. Thus, expanding we have

12xΩdπ(x)1ϵ|1dπ^dπ^(x)|C2(1ϵ)xΩ|dπ^(x)dπ^(x)|δC1ϵ.\frac{1}{2}\int_{x\in\Omega}\frac{d\pi(x)}{1-\epsilon^{\prime}}\left|1-\frac{d\hat{\pi}^{\prime}}{d\hat{\pi}}(x)\right|\leq\frac{C}{2(1-\epsilon^{\prime})}\int_{x\in\Omega}|d\hat{\pi}(x)-d\hat{\pi}^{\prime}(x)|\leq\frac{\delta C}{1-\epsilon^{\prime}}.

Finally, combining these guarantees, and the fact that restricting π\pi to Ω\Omega loses ϵ\epsilon^{\prime} in total variation distance, yields the desired conclusion by triangle inequality. ∎

Corollary 6.

Let θ^(x)\hat{\theta}(x) be an unbiased estimator for p(x)p^(x)\tfrac{p(x)}{\hat{p}(x)}, and suppose θ^(x)C\hat{\theta}(x)\leq C with probability 1 for all xΩx\in\Omega. Then, implementing the procedure of Definition 5 with acceptance probability θ^(x)C\tfrac{\hat{\theta}(x)}{C} has the same runtime bound and total variation guarantee as given by Lemma 12.

Proof.

It suffices to take expectations over the randomness of θ^\hat{\theta} everywhere in the proof of Lemma 12. ∎

B.1.2 Distribution ratio bounds

We next show two bounds relating the densities of distributions π\pi and π^\hat{\pi}. We first define the normalization constants of (15), (17) for shorthand, and then tightly bound their ratio.

Definition 6 (Normalization constants).

We denote normalization constants of π\pi and π^\hat{\pi} by

Zπ\displaystyle Z_{\pi} :=xexp(f(x)g(x))𝑑x,\displaystyle:=\int_{x}\exp\left(-f(x)-g(x)\right)dx,
Zπ^\displaystyle Z_{\hat{\pi}} :=x,yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑x𝑑y.\displaystyle:=\int_{x,y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dxdy.
Lemma 13 (Normalization constant bounds).

Let ZπZ_{\pi} and Zπ^Z_{\hat{\pi}} be as in Definition 6. Then,

(2πη1+ηL)d2(1+ηL2μ)d2Zπ^Zπ(2πη)d2.\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\frac{\eta L^{2}}{\mu}\right)^{-\frac{d}{2}}\leq\frac{Z_{\hat{\pi}}}{Z_{\pi}}\leq(2\pi\eta)^{\frac{d}{2}}.
Proof.

For each xx, by convexity we have

yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y\displaystyle\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy (29)
exp(g(x)ηL22xx22)yexp(f(x)f(x),yx12ηyx22)𝑑y\displaystyle\leq\exp\left(-g(x)-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy
=exp(f(x)g(x)ηL22xx22)yexp(η2f(x)2212ηyx+ηf(x)22)𝑑y\displaystyle=\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(\frac{\eta}{2}\left\lVert\nabla f(x)\right\rVert_{2}^{2}-\frac{1}{2\eta}\left\lVert y-x+\eta\nabla f(x)\right\rVert_{2}^{2}\right)dy
=(2πη)d2exp(f(x)g(x))exp(η2f(x)22ηL22xx22)\displaystyle=(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right)\exp\left(\frac{\eta}{2}\left\lVert\nabla f(x)\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)
(2πη)d2exp(f(x)g(x)).\displaystyle\leq(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right).

Integrating both sides over xx yields the upper bound on Zπ^Zπ\tfrac{Z_{\hat{\pi}}}{Z_{\pi}}. Next, for the lower bound we have a similar derivation. For each xx, by smoothness

yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y\displaystyle\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\|y-x\right\|_{2}^{2}-\frac{\eta L^{2}}{2}\left\|x-x^{*}\right\|_{2}^{2}\right)dy
exp(f(x)g(x)ηL22xx22)yexp(f(x),xy1+ηL2ηyx22)𝑑y\displaystyle\geq\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\|x-x^{*}\right\|_{2}^{2}\right)\int_{y}\exp\left(\left\langle\nabla f(x),x-y\right\rangle-\frac{1+\eta L}{2\eta}\left\|y-x\right\|_{2}^{2}\right)dy
=exp(f(x)g(x)ηL22xx2+η2(1+ηL)f(x)2)(2πη1+ηL)d2\displaystyle=\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\|x-x^{*}\right\|^{2}+\frac{\eta}{2(1+\eta L)}\left\|\nabla f(x)\right\|^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}
exp(f(x)g(x)ηL22xx22)(2πη1+ηL)d2.\displaystyle\geq\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\|x-x^{*}\right\|_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.

Integrating both sides over xx yields

Zπ^Zπ(2πη1+ηL)d2xexp(f(x)g(x)ηL22xx22)𝑑xxexp(f(x)g(x))𝑑x(2πη1+ηL)d2(1+ηL2μ)d2.\displaystyle\frac{Z_{\hat{\pi}}}{Z_{\pi}}\geq\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\frac{\int_{x}\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\|x-x^{*}\right\|_{2}^{2}\right)dx}{\int_{x}\exp\left(-f(x)-g(x)\right)dx}\geq\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\frac{\eta L^{2}}{\mu}\right)^{-\frac{d}{2}}.

The last inequality followed from Proposition 11, where we used f+gf+g is μ\mu-strongly convex. ∎

Lemma 14 (Relative density bounds).

Let η=132Lκdlog(288κ/ϵ)\eta=\tfrac{1}{32L\kappa d\log(288\kappa/\epsilon)}. For all xΩx\in\Omega, as defined in (16), dπdπ^(x)2\frac{d\pi}{d\hat{\pi}}(x)\leq 2. Here, dπ^dx(x)\tfrac{d\hat{\pi}}{dx}(x) denotes the marginal density of π^\hat{\pi}. Moreover, for all xdx\in\mathbb{R}^{d}, dπdπ^(x)12\frac{d\pi}{d\hat{\pi}}(x)\geq\tfrac{1}{2}.

Proof.

We first show the upper bound. By Lemma 13,

dπdπ^(x)\displaystyle\frac{d\pi}{d\hat{\pi}}(x) =exp(f(x)g(x))yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑yZπ^Zπ\displaystyle=\frac{\exp\left(-f(x)-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}\cdot\frac{Z_{\hat{\pi}}}{Z_{\pi}} (30)
exp(f(x)g(x))yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y(2πη)d2.\displaystyle\leq\frac{\exp\left(-f(x)-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}\cdot(2\pi\eta)^{\frac{d}{2}}.

We now bound the first term, for xΩx\in\Omega. By smoothness, we have

exp(f(y)g(x))exp(f(x)g(x))exp(f(x),xyL2yx22),\frac{\exp\left(-f(y)-g(x)\right)}{\exp\left(-f(x)-g(x)\right)}\geq\exp\left(\left\langle\nabla f(x),x-y\right\rangle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}\right),

so applying this for each yy,

yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑yexp(f(x)g(x))\displaystyle\frac{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}{\exp\left(-f(x)-g(x)\right)}
exp(ηL22xx22)yexp(f(x),xy1+ηL2ηyx22)𝑑y\displaystyle\geq\exp\left(-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(\left\langle\nabla f(x),x-y\right\rangle-\frac{1+\eta L}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy
=exp(ηL22xx22+η2(1+ηL)f(x)22)yexp(1+ηL2ηxyη1+ηLf(x)22)𝑑y\displaystyle=\exp\left(-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-\frac{1+\eta L}{2\eta}\left\lVert x-y-\frac{\eta}{1+\eta L}\nabla f(x)\right\rVert_{2}^{2}\right)dy
exp(ηL2216dlog(288κ/ϵ)μ)(2πη1+ηL)d234(2πη1+ηL)d2.\displaystyle\geq\exp\left(-\frac{\eta L^{2}}{2}\cdot\frac{16d\log(288\kappa/\epsilon)}{\mu}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\geq\frac{3}{4}\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.

In the last line, we used that xΩx\in\Omega implies xx2216dlog(288κ/ϵ)μ\left\lVert x-x^{*}\right\rVert_{2}^{2}\leq\tfrac{16d\log(288\kappa/\epsilon)}{\mu}, and the definition of η\eta. Combining this bound with (30), we have the desired

dπdπ^(x)43(1+ηL)d22.\frac{d\pi}{d\hat{\pi}}(x)\leq\frac{4}{3}\left(1+\eta L\right)^{\frac{d}{2}}\leq 2.

Next, we consider the lower bound. By combining (29) with Lemma 13, we have the desired

dπdπ^(x)\displaystyle\frac{d\pi}{d\hat{\pi}}(x) =exp(f(x)g(x))yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑yZπ^Zπ\displaystyle=\frac{\exp\left(-f(x)-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}\cdot\frac{Z_{\hat{\pi}}}{Z_{\pi}}
(2πη)d2(2πη1+ηL)d2(1+ηL2μ)d2=(11+ηL)d2(1+ηLκ)d212.\displaystyle\geq(2\pi\eta)^{-\frac{d}{2}}\cdot\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\frac{\eta L^{2}}{\mu}\right)^{-\frac{d}{2}}=\left(\frac{1}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\eta L\kappa\right)^{-\frac{d}{2}}\geq\frac{1}{2}.

B.1.3 Correctness of Composite-Sample-Shared-Min

See 4

Proof.

We remark that η=132Lκdlog(288κ/ϵ)\eta=\tfrac{1}{32L\kappa d\log(288\kappa/\epsilon)} is precisely the choice of η\eta in Sample-Joint-Dist where δ=ϵ/18\delta=\epsilon/18, as in Composite-Sample-Shared-Min. First, we may apply Fact 2 to conclude that the measure of set Ω\Omega with respect to the μ\mu-strongly logconcave density π\pi is at least 1ϵ/31-\epsilon/3. The conclusion of correctness will follow from an appeal to Corollary 6, with parameters

C=4,ϵ=ϵ3,δ=ϵ18.C=4,\;\epsilon^{\prime}=\frac{\epsilon}{3},\;\delta=\frac{\epsilon}{18}.

Note that indeed we have ϵ+2δC1ϵ\epsilon^{\prime}+\tfrac{2\delta C}{1-\epsilon^{\prime}} is bounded by ϵ\epsilon, as 1ϵ231-\epsilon^{\prime}\geq\tfrac{2}{3}. Moreover, the expected number of calls (26) is clearly bounded by a constant as well.

We now show that these parameters satisfy the requirements of Corollary 6. Define the functions

p(x)\displaystyle p(x) :=exp(f(x)g(x)),\displaystyle:=\exp(-f(x)-g(x)),
p^(x)\displaystyle\hat{p}(x) :=(2πη)d2yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y,\displaystyle:=(2\pi\eta)^{-\frac{d}{2}}\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy,

and observe that clearly the densities of π\pi and π^\hat{\pi} are respectively proportional to pp and p^\hat{p}. Moreover, define Z=p(x)𝑑xZ=\int p(x)dx and Z^=p^(x)𝑑x\hat{Z}=\int\hat{p}(x)dx. By comparing these definitions with Lemma 13, we have Z=ZπZ=Z_{\pi} and Z^=(2πη)d2Zπ^\hat{Z}=(2\pi\eta)^{-\frac{d}{2}}Z_{\hat{\pi}}, so by the upper bound in Lemma 13, Z^/Z1\hat{Z}/Z\leq 1. Next, we claim that the following procedure produces an unbiased estimator for p(x)p^(x)\tfrac{p(x)}{\hat{p}(x)}.

  1. 1.

    Sample yπxy\sim\pi_{x}, where dπx(y)dyexp(f(y)12ηyx22)\tfrac{d\pi_{x}(y)}{dy}\propto\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)

  2. 2.

    αexp(f(y)f(x),yxL2yx22+g(x)+ηL22xx22)\alpha\leftarrow\exp\left(f(y)-\left\langle\nabla f(x),y-x\right\rangle-\tfrac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}+g(x)+\tfrac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)

  3. 3.

    Output θ^(x)exp(f(x)g(x)+η2(1+ηL)f(x)22)(1+ηL)d2α\hat{\theta}(x)\leftarrow\exp\left(-f(x)-g(x)+\tfrac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha

To prove correctness of this estimator θ^\hat{\theta}, define for simplicity

Zx:=yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y.Z_{x}:=\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy.

We compute, using dπx(y)dy=exp(f(y)g(x)12ηyx22ηL22xx22)Zx\tfrac{d\pi_{x}(y)}{dy}=\tfrac{\exp(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2})}{Z_{x}}, that

𝔼πx[α]\displaystyle\mathbb{E}_{\pi_{x}}\left[\alpha\right] =yexp(f(y)f(x),yxL2yx22+g(x)+ηL22xx22)𝑑πx(y)\displaystyle=\int_{y}\exp\left(f(y)-\left\langle\nabla f(x),y-x\right\rangle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}+g(x)+\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)d\pi_{x}(y)
=1Zxyexp(f(x),yxL2yx2212ηyx22)𝑑y\displaystyle=\frac{1}{Z_{x}}\int_{y}\exp\left(-\left\langle\nabla f(x),y-x\right\rangle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy
=1Zxexp(η2(1+ηL)f(x)22)(2πη1+ηL)d2.\displaystyle=\frac{1}{Z_{x}}\exp\left(-\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.

This implies that the output quantity

θ^(x)=exp(f(x)g(x)+η2(1+ηL)f(x)22)(1+ηL)d2α\hat{\theta}(x)=\exp\left(-f(x)-g(x)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha

is unbiased for p(x)p^(x)=exp(f(x)g(x))Zx1(2πη)d2\tfrac{p(x)}{\hat{p}(x)}=\exp(-f(x)-g(x))Z_{x}^{-1}(2\pi\eta)^{\frac{d}{2}}. Finally, note that for any yy used in the definition of θ^(x)\hat{\theta}(x), by using f(y)f(x)f(x),yxL2yx220f(y)-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\tfrac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}\leq 0 via smoothness, we have

θ^(x)\displaystyle\hat{\theta}(x) =exp(f(x)g(x)+η2(1+ηL)f(x)22)(1+ηL)d2α\displaystyle=\exp\left(-f(x)-g(x)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha
(1+ηL)d2exp(η2(1+ηL)f(x)22+ηL22xx22)\displaystyle\leq(1+\eta L)^{\frac{d}{2}}\exp\left(\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}+\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)
(1+ηL)d2exp(ηL2xx22)4.\displaystyle\leq(1+\eta L)^{\frac{d}{2}}\exp\left(\eta L^{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\leq 4.

Here, we used the definition of η\eta and L2xx2216Lκdlog(288κ/ϵ)L^{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\leq 16L\kappa d\log(288\kappa/\epsilon) by the definition of Ω\Omega. ∎

B.2 Deferred proofs from Section 5.3

Throughout this section, for error tolerance δ[0,1]\delta\in[0,1] which parameterizes Sample-Joint-Dist, we denote for shorthand a high-probability region Ωδ\Omega_{\delta} and its radius RδR_{\delta} by

Ωδ:={xxx2Rδ}, for Rδ:=4dlog(16κ/δ)μ.\Omega_{\delta}:=\left\{x\mid\left\lVert x-x^{*}\right\rVert_{2}\leq R_{\delta}\right\},\text{ for }R_{\delta}:=4\sqrt{\frac{d\log(16\kappa/\delta)}{\mu}}. (31)

The following density ratio bounds hold within this region, by simply modifying Lemma 14.

Corollary 7.

Let η=132Lκdlog(16κ/δ)\eta=\tfrac{1}{32L\kappa d\log(16\kappa/\delta)}, and let π^\hat{\pi} be parameterized by this choice of η\eta in (17). For all xΩδx\in\Omega_{\delta}, as defined in (31), dπdπ^(x)2\frac{d\pi}{d\hat{\pi}}(x)\leq 2. Moreover, for all xdx\in\mathbb{R}^{d}, dπdπ^(x)12\frac{d\pi}{d\hat{\pi}}(x)\geq\frac{1}{2}.

The following claim follows immediately from applying Fact 2.

Lemma 15.

With probability at least 1δ28(1+κ)d1-\tfrac{\delta^{2}}{8(1+\kappa)^{d}}, xπ^x\sim\hat{\pi} lies in Ωδ\Omega_{\delta}.

Finally, when clear from context, we overload π^\hat{\pi} as a distribution on xdx\in\mathbb{R}^{d} to be the xx component marginal of the distribution (17), i.e. with density

dπ^dx(x)yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y.\frac{d\hat{\pi}}{dx}(x)\propto\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy.

We first note that π^\hat{\pi} is stationary for Sample-Joint-Dist; this follows immediately from Lemma 1. In Section B.2.1, we bound the conductance of the walk. We then use this bound in Section B.2.2 to bound the mixing time and overall complexity of Sample-Joint-Dist.

B.2.1 Conductance of Sample-Joint-Dist

We bound the conductance of this random walk, as a process on the iterates {xk}\{x_{k}\}, to show the final point has distribution close to the marginal of π^\hat{\pi} on xx. To do so, we break Proposition 7 into two pieces, which we will use in a more white-box manner to prove our conductance bound.

Definition 7 (Restricted conductance).

Let a random walk with stationary distribution π^\hat{\pi} on xdx\in\mathbb{R}^{d} have transition densities 𝒯x\mathcal{T}_{x}, and let Ωd\Omega\subseteq\mathbb{R}^{d}. The Ω\Omega-restricted conductance, for v(0,12π^(Ω))v\in(0,\tfrac{1}{2}\hat{\pi}(\Omega)), is

ΦΩ(v)=infπ^(SΩ)(0,v]𝒯S(Sc)π^(SΩ), where 𝒯S(Sc):=xSxSc𝒯x(x)𝑑π^(x)𝑑x.\Phi_{\Omega}(v)=\inf_{\hat{\pi}(S\cap\Omega)\in(0,v]}\frac{\mathcal{T}_{S}(S^{c})}{\hat{\pi}(S\cap\Omega)},\text{ where }\mathcal{T}_{S}(S^{c}):=\int_{x\in S}\int_{x^{\prime}\in S^{c}}\mathcal{T}_{x}(x^{\prime})d\hat{\pi}(x)dx^{\prime}.
Proposition 8 (Lemma 1, [CDWY19]).

Let πstart\pi_{\textup{start}} be a β\beta-warm start for π^\hat{\pi}, and let x0πstartx_{0}\sim\pi_{\textup{start}}. For some δ>0\delta>0, let Ωd\Omega\subseteq\mathbb{R}^{d} have π^(Ω)1δ22β2\hat{\pi}(\Omega)\geq 1-\tfrac{\delta^{2}}{2\beta^{2}}. Suppose that a random walk with stationary distribution π^\hat{\pi} satisfies the Ω\Omega-restricted conductance bound

ΦΩ(v)Blog(1v), for all v[4β,12].\Phi_{\Omega}(v)\geq\sqrt{B\log\left(\frac{1}{v}\right)},\text{ for all }v\in\left[\frac{4}{\beta},\frac{1}{2}\right].

Let xKx_{K} be the result of KK steps of this random walk, starting from x0x_{0}. Then, for

K64Blog(logβ2δ),K\geq\frac{64}{B}\log\left(\frac{\log\beta}{2\delta}\right),

the resulting distribution of xKx_{K} has total variation at most δ2\tfrac{\delta}{2} from π^\hat{\pi}.

We state a well-known strategy for lower bounding conductance, via showing the stationary distribution has good isoperimetry and that transition distributions of nearby points have large overlap.

Proposition 9 (Lemma 2, [CDWY19]).

Let a random walk with stationary distribution π^\hat{\pi} on xdx\in\mathbb{R}^{d} have transition distribution densities 𝒯x\mathcal{T}_{x}, and let Ωd\Omega\subseteq\mathbb{R}^{d}, and let π^Ω\hat{\pi}_{\Omega} be the conditional distribution of π^\hat{\pi} on Ω\Omega. Suppose for any x,xΩx,x^{\prime}\in\Omega with xx2Δ\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta,

𝒯x𝒯xTV12.\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\frac{1}{2}.

Also, suppose π^Ω\hat{\pi}_{\Omega} satisfies, for any partition S1S_{1}, S2S_{2}, S3S_{3} of Ω\Omega, where d(S1,S2)d(S_{1},S_{2}) is the minimum Euclidean distance between points in S1S_{1}, S2S_{2}, the log-isoperimetric inequality

π^Ω(S3)12ψd(S1,S2)min(π^Ω(S1),π^Ω(S2))log(1+1min(π^Ω(S1),π^Ω(S2))).\hat{\pi}_{\Omega}(S_{3})\geq\frac{1}{2\psi}d(S_{1},S_{2})\cdot\min\left(\hat{\pi}_{\Omega}(S_{1}),\hat{\pi}_{\Omega}(S_{2})\right)\cdot\sqrt{\log\left(1+\frac{1}{\min\left(\hat{\pi}_{\Omega}(S_{1}),\hat{\pi}_{\Omega}(S_{2})\right)}\right)}. (32)

Then, we have the bound for all v(0,12]v\in(0,\tfrac{1}{2}]

ΦΩ(v)min(1,Δ128ψlog(1v)).\Phi_{\Omega}(v)\geq\min\left(1,\frac{\Delta}{128\psi}\sqrt{\log\left(\frac{1}{v}\right)}\right).

To utilize Propositions 8 and 9, we prove the following bounds in Appendices C.1C.2, and C.3.

Lemma 16 (Warm start).

For η1Lκd\eta\leq\tfrac{1}{L\kappa d}, πstart\pi_{\textup{start}} defined in (18) is a 2(1+κ)d22(1+\kappa)^{\frac{d}{2}}-warm start for π^\hat{\pi}.

Lemma 17 (Transitions of nearby points).

Suppose ηL1\eta L\leq 1, ηL2Rδ212\eta L^{2}R_{\delta}^{2}\leq\tfrac{1}{2}, and 400d2ηRδ2400d^{2}\eta\leq R_{\delta}^{2}. For a point xx, let 𝒯x\mathcal{T}_{x} be the density of xkx_{k} after sampling according to Lines 6 and 7 of Algorithm 5 from xk1=xx_{k-1}=x. For x,xΩδx,x^{\prime}\in\Omega_{\delta} with xx2η10\left\lVert x-x^{\prime}\right\rVert_{2}\leq\tfrac{\sqrt{\eta}}{10}, for Ωδ\Omega_{\delta} defined in (31), we have 𝒯x𝒯xTV12\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{1}{2}.

Lemma 18 (Isoperimetry).

Density π^\hat{\pi} and set Ωδ\Omega_{\delta} defined in (17), (31) satisfy (32) with ψ=8μ12\psi=8\mu^{-\frac{1}{2}}.

We note that the parameters of Algorithm 5 and the set Ωδ\Omega_{\delta} in (31) satisfy all assumptions of Lemmas 1617, and 18. By combining these results in the context of Proposition 9, we see that the random walk satisfies the bound for all v(0,12]v\in(0,\tfrac{1}{2}]:

ΦΩδ(v)ημ220100log(1v).\Phi_{\Omega_{\delta}}(v)\geq\sqrt{\frac{\eta\mu}{2^{20}\cdot 100}\cdot\log\left(\frac{1}{v}\right)}.

Plugging this conductance lower bound, the high-probability guarantee of Ωδ\Omega_{\delta} by Lemma 15, and the warm start bound of Lemma 16 into Proposition 8, we have the following conclusion.

Corollary 8 (Mixing time of ideal Sample-Joint-Dist).

Assume that calls to YSample are exact in the implementation of Sample-Joint-Dist. Then, for any error parameter δ\delta, and

K:=226100ημlog(dlog(16κ)4δ),K:=\frac{2^{26}\cdot 100}{\eta\mu}\log\left(\frac{d\log(16\kappa)}{4\delta}\right),

the distribution of xKx_{K} has total variation at most δ2\tfrac{\delta}{2} from π^\hat{\pi}.

B.2.2 Complexity of Sample-Joint-Dist

We first state a guarantee on the subroutine YSample, which we prove in Appendix C.4.

Lemma 19 (YSample guarantee).

For δ[0,1]\delta\in[0,1], define RδR_{\delta} as in (31), and let η=132Lκdlog(16κ/δ)\eta=\tfrac{1}{32L\kappa d\log(16\kappa/\delta)}. For any xx with xx2κdlog(16κ/δ)Rδ\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}, Algorithm 7 (YSample) draws an exact sample yy from the density proportional to exp(f(y)12ηyx22)\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right) in an expected 22 iterations.

We also state a result due to [CDWY19], which bounds the mixing time of 1-step Metropolized HMC for well-conditioned distributions; this handles the case when xx2\left\lVert x-x^{*}\right\rVert_{2} is large in Algorithm 7.

Proposition 10 (Theorem 1, [CDWY19]).

Let π\pi be a distribution on d\mathbb{R}^{d} whose negative log-density is convex and has condition number bounded by a constant. Then, Metropolized HMC from an explicit starting distribution mixes to total variation δ\delta to the distribution π\pi in O(dlog(dδ))O(d\log(\tfrac{d}{\delta})) iterations.

See 5

Proof.

Under an exact YSample, Corollary 8 shows the output distribution of Sample-Joint-Dist has total variation at most δ2\tfrac{\delta}{2} from π^\hat{\pi}. Next, the resulting distribution of the subroutine YSample is never larger than δ/(2Kdlog(dκδ))\delta/(2Kd\log(\frac{d\kappa}{\delta})) in total variation distance away from an exact sampler. By running for KK steps, and using the coupling characterization of total variation, it follows that this can only incur additional error δ/(2dlog(dκδ))\delta/(2d\log(\frac{d\kappa}{\delta})), proving correctness (in fact, the distribution is always at most O((dlog(dκ/δ))1)O((d\log(d\kappa/\delta))^{-1}) away in total variation from an exact YSample).

Next, we prove the guarantee on the expected gradient evaluations per iteration. Lemma 19 shows whenever the current iterate xkx_{k} has xx2κdlog(16κ/δ)Rδ\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}, the expected number of gradient evaluations is constant, and moreover Proposition 10 shows that the number of gradient evaluations is never larger than O(dlog(dκδ))O(d\log(\tfrac{d\kappa}{\delta})), where we use that the condition number of the log-density in (19) is bounded by a constant. Therefore, it suffices to show in every iteration 0kK0\leq k\leq K, the probability xkx2>κdlog(16κ/δ)Rδ\left\lVert x_{k}-x^{*}\right\rVert_{2}>\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta} is O((dlog(dκ/δ))1)O((d\log(d\kappa/\delta))^{-1}). By the warmness assumption in Lemma 16, and the concentration bound in Fact 2, the probability x0x_{0} does not satisfy this bound is negligible (inverse exponential in κd2log(κ/δ)\kappa d^{2}\log(\kappa/\delta)). Since warmness is monotonically decreasing with an exact sampler,131313This fact is well-known in the literature, and a simple proof is that if a distribution is warm, then taking one step of the Markov chain induces a convex combination of warm point masses, and is thus also warm. and the accumulated error due to inexactness of YSample is at most O((dlog(dκ/δ))1)O((d\log(d\kappa/\delta))^{-1}) through the whole algorithm, this holds for all iterations. ∎

Appendix C Mixing time ingredients

We now prove facts which are used in the mixing time analysis of Sample-Joint-Dist. Throughout this section, as in the specification of Sample-Joint-Dist, ff and gg are functions with properties as in (15), and share a minimizer xx^{*}.

C.1 Warm start

We show that we obtain a warm start for the distribution π^\hat{\pi} in algorithm Sample-Joint-Dist via one call to the restricted Gaussian oracle for gg, by proving Lemma 16.

See 16

Proof.

By the definitions of π^\hat{\pi} and πstart\pi_{\textup{start}} in (17), (18), we wish to bound everywhere the quantity

dπstartdπ^(x)=Zπ^Zstartexp(L2xx22ηL22xx22g(x))yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y.\frac{d\pi_{\textup{start}}}{d\hat{\pi}}(x)=\frac{Z_{\hat{\pi}}}{Z_{\textup{start}}}\cdot\frac{\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}. (33)

Here, Zπ^Z_{\hat{\pi}} is as in Definition 6, and we let ZstartZ_{\textup{start}} denote the normalization constant of πstart\pi_{\textup{start}}, i.e.

Zstart:=xexp(L2xx22ηL22xx22g(x))𝑑x.Z_{\textup{start}}:=\int_{x}\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)dx.

Regarding the first term of (33), the earlier derivation (29) showed

yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y(2πη)d2exp(f(x)g(x)).\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy\leq(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right).

Then, integrating, we can bound the ratio of the normalization constants

Zπ^Zπstart\displaystyle\frac{Z_{\hat{\pi}}}{Z_{\pi_{\textup{start}}}} x(2πη)d2exp(f(x)g(x))𝑑xxexp(L2xx22ηL22xx22g(x))𝑑x\displaystyle\leq\frac{\int_{x}(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right)dx}{\int_{x}\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)dx} (34)
x(2πη)d2exp(f(x)μ2xx22g(x))𝑑xxexp(L2xx22μ2xx22g(x))𝑑x\displaystyle\leq\frac{\int_{x}(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x^{*})-\frac{\mu}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)dx}{\int_{x}\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\mu}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)dx}
(2πη)d2exp(f(x))(1+Lμ)d2.\displaystyle\leq(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x^{*})\right)\left(1+\frac{L}{\mu}\right)^{\frac{d}{2}}.

The second inequality followed from ff is μ\mu-strongly convex and ηL2μ\eta L^{2}\leq\mu by assumption. The last inequality followed from Proposition 11, where we used μ2xx22+g(x)\frac{\mu}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}+g(x) is μ\mu-strongly convex. Next, to bound the second term of (33), notice first that

exp(L2xx22ηL22xx22g(x))yexp(f(y)g(x)12ηyx22ηL22xx22)𝑑y=exp(L2xx22)yexp(f(y)12ηyx22)𝑑y.\displaystyle\frac{\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}=\frac{\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)}{\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy}.

It thus suffices to lower bound exp(L2xx22)yexp(f(y)12ηyx22)𝑑y\exp\left(\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy. We have

exp(L2xx22)yexp(f(y)12ηyx22)𝑑y\displaystyle\exp\left(\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy (35)
exp(f(x)+L2xx22)yexp(f(x),yx(12η+L2)yx22)𝑑y\displaystyle\geq\exp\left(-f(x)+\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-\langle\nabla f(x),y-x\rangle-\left(\frac{1}{2\eta}+\frac{L}{2}\right)\left\lVert y-x\right\rVert_{2}^{2}\right)dy
=exp(f(x)+L2xx22)(2πη1+Lη)d2exp(η2(1+Lη)f(x)22)\displaystyle=\exp\left(-f(x)+\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+L\eta}\right)^{\frac{d}{2}}\exp\left(\frac{\eta}{2(1+L\eta)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)
exp(f(x))(2πη1+Lη)d2\displaystyle\geq\exp(-f(x^{*}))\left(\frac{2\pi\eta}{1+L\eta}\right)^{\frac{d}{2}}

The first and third steps followed from LL-smoothness of ff, and the second applied the Gaussian integral (Fact 1). Combining the bounds in (34) and (35), (33) becomes

dπstartdπ^(x)(1+Lμ)d2(1+Lη)d22(1+κ)d2,\displaystyle\frac{d\pi_{\textup{start}}}{d\hat{\pi}}(x)\leq\left(1+\frac{L}{\mu}\right)^{\frac{d}{2}}\left(1+L\eta\right)^{\frac{d}{2}}\leq 2(1+\kappa)^{\frac{d}{2}},

where xdx\in\mathbb{R}^{d} was arbitrary, which completes the proof. ∎

C.2 Transitions of nearby points

Here, we prove Lemma 17. Throughout this section, 𝒯x\mathcal{T}_{x} is the density of xkx_{k}, according to the steps in Lines 6 and 7 of Sample-Joint-Dist (Algorithm 5) starting at xk1=xx_{k-1}=x. We also define 𝒫x\mathcal{P}_{x} to be the density of yky_{k}, by just the step in Line 6. We first make a simplifying observation: by Observation 1, for any two points xx, xx^{\prime}, we have

𝒯x𝒯xTV𝒫x𝒫xTV.\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}.

Thus, it suffices to understand 𝒫x𝒫xTV\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}} for nearby x,xΩδx,x^{\prime}\in\Omega_{\delta}. Our proof of Lemma 17 combines two pieces: (1) bounding the ratio of normalization constants ZxZ_{x}, ZxZ_{x^{\prime}} of 𝒫x\mathcal{P}_{x} and 𝒫x\mathcal{P}_{x^{\prime}} for nearby xx, xx^{\prime} in Lemma 22 and (2) the structural result Proposition 12. To bound the normalization constant ratio, we state two helper lemmas. Lemma 20 characterizes facts about the minimizer of

f(y)+12ηyx22.f(y)+\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}. (36)
Lemma 20.

Let ff be convex with minimizer xx^{*}, and yxy_{x} minimize (36) for a given xx. Then,

  1. 1.

    yxyx2xx2\left\lVert y_{x}-y_{x^{\prime}}\right\rVert_{2}\leq\left\lVert x-x^{\prime}\right\rVert_{2}.

  2. 2.

    For any xx, yxx2xx2\left\lVert y_{x}-x^{*}\right\rVert_{2}\leq\left\lVert x-x^{*}\right\rVert_{2}.

  3. 3.

    For any xx with xx2R\left\lVert x-x^{*}\right\rVert_{2}\leq R, xyx2ηLR\left\lVert x-y_{x}\right\rVert_{2}\leq\eta LR.

Proof.

By optimality conditions in the definition of yxy_{x},

ηf(yx)=xyx.\eta\nabla f(y_{x})=x-y_{x}.

Fix two points xx, xx^{\prime}, and let xt:=(1t)x+txx_{t}:=(1-t)x+tx^{\prime}. Letting 𝐉x(yx)\mathbf{J}_{x}(y_{x}) be the Jacobian matrix of yxy_{x},

ddtηf(yxt)=ddt(xtyxt)\displaystyle\frac{d}{dt}\eta\nabla f(y_{x_{t}})=\frac{d}{dt}\left(x_{t}-y_{x_{t}}\right) η2f(yxt)𝐉x(yxt)(xx)=(𝐈𝐉x(yxt))(xx)\displaystyle\implies\eta\nabla^{2}f(y_{x_{t}})\mathbf{J}_{x}(y_{x_{t}})(x^{\prime}-x)=(\mathbf{I}-\mathbf{J}_{x}(y_{x_{t}}))(x^{\prime}-x)
𝐉x(yxt)(xx)=(𝐈+η2f(yxt))1(xx).\displaystyle\implies\mathbf{J}_{x}(y_{x_{t}})(x^{\prime}-x)=(\mathbf{I}+\eta\nabla^{2}f(y_{x_{t}}))^{-1}(x^{\prime}-x).

We can then compute

yxyx=01ddtyxt𝑑t=01𝐉x(yxt)(xx)𝑑t=01(𝐈+η2f(yxt))1(xx)𝑑t.y_{x^{\prime}}-y_{x}=\int_{0}^{1}\frac{d}{dt}y_{x_{t}}dt=\int_{0}^{1}\mathbf{J}_{x}(y_{x_{t}})(x^{\prime}-x)dt=\int_{0}^{1}(\mathbf{I}+\eta\nabla^{2}f(y_{x_{t}}))^{-1}(x^{\prime}-x)dt.

By triangle inequality and convexity of ff, the first claim follows:

yxyx201(𝐈+η2f(yxt))12xx2𝑑txx2.\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}\leq\int_{0}^{1}\left\lVert(\mathbf{I}+\eta\nabla^{2}f(y_{x_{t}}))^{-1}\right\rVert_{2}\left\lVert x^{\prime}-x\right\rVert_{2}dt\leq\left\lVert x^{\prime}-x\right\rVert_{2}.

The second claim follows from the first by yx=xy_{x^{*}}=x^{*}. The third claim follows from the second via

xyx2=ηf(yx)2ηLyxx2ηLR.\left\lVert x-y_{x}\right\rVert_{2}=\eta\left\lVert\nabla f(y_{x})\right\rVert_{2}\leq\eta L\left\lVert y_{x}-x^{*}\right\rVert_{2}\leq\eta LR.

Next, Lemma 21 states well-known bounds on the integral of a well-conditioned function hh.

Lemma 21.

Let hh be a LhL_{h}-smooth, μh\mu_{h}-strongly convex function and let yhy^{*}_{h} be its minimizer. Then

(2πLh1)d2exp(h(yh))yexp(h(y))(2πμh1)d2exp(h(yh)).\left(2\pi L_{h}^{-1}\right)^{\frac{d}{2}}\exp\left(-h(y^{*}_{h})\right)\leq\int_{y}\exp\left(-h(y)\right)\leq\left(2\pi\mu_{h}^{-1}\right)^{\frac{d}{2}}\exp\left(-h(y^{*}_{h})\right).
Proof.

By smoothness and strong convexity,

exp(h(yh)Lh2yyh22)exp(h(y))exp(h(yh)μh2yyh22).\exp\left(-h(y^{*}_{h})-\frac{L_{h}}{2}\left\lVert y-y^{*}_{h}\right\rVert_{2}^{2}\right)\leq\exp(-h(y))\leq\exp\left(-h(y^{*}_{h})-\frac{\mu_{h}}{2}\left\lVert y-y^{*}_{h}\right\rVert_{2}^{2}\right).

The result follows by Gaussian integrals, i.e. Fact 1. ∎

We now define the normalization constants of 𝒫x\mathcal{P}_{x} and 𝒫x\mathcal{P}_{x^{\prime}}:

Zx=yexp(f(y)12ηyx22)𝑑y,\displaystyle Z_{x}=\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy, (37)
Zx=yexp(f(y)12ηyx22)𝑑y.\displaystyle Z_{x^{\prime}}=\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x^{\prime}\right\rVert_{2}^{2}\right)dy.

We apply Lemma 20 and Lemma 21 to bound the ratio of ZxZ_{x} and ZxZ_{x^{\prime}}.

Lemma 22.

Let ff be μ\mu-strongly convex and LL-smooth. Let x,xΩδx,x^{\prime}\in\Omega_{\delta}, for Ωδ\Omega_{\delta} defined in (31), and let xx2Δ\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta. Then, the normalization constants ZxZ_{x} and ZxZ_{x^{\prime}} in (37) satisfy

ZxZx1.05exp(3LRΔ+LΔ22).\frac{Z_{x}}{Z_{x^{\prime}}}\leq 1.05\exp\left(3LR\Delta+\frac{L\Delta^{2}}{2}\right).
Proof.

First, applying Lemma 21 to ZxZ_{x} and ZxZ_{x^{\prime}} yields that the ratio is bounded by

ZxZx\displaystyle\frac{Z_{x}}{Z_{x^{\prime}}} exp(f(yx)12ηyxx22)(2π(μ+1η)1)d2exp(f(yx)12ηyxx22)(2π(L+1η)1)d2\displaystyle\leq\frac{\exp\left(-f(y_{x})-\frac{1}{2\eta}\left\lVert y_{x}-x\right\rVert_{2}^{2}\right)\left(2\pi\left(\mu+\frac{1}{\eta}\right)^{-1}\right)^{\frac{d}{2}}}{\exp\left(-f(y_{x^{\prime}})-\frac{1}{2\eta}\left\lVert y_{x^{\prime}}-x\right\rVert_{2}^{2}\right)\left(2\pi\left(L+\frac{1}{\eta}\right)^{-1}\right)^{\frac{d}{2}}}
1.05exp(f(yx)f(yx)+12η(yxx22yxx22)).\displaystyle\leq 1.05\exp\left(f(y_{x^{\prime}})-f(y_{x})+\frac{1}{2\eta}\left(\left\lVert y_{x^{\prime}}-x^{\prime}\right\rVert_{2}^{2}-\left\lVert y_{x}-x\right\rVert_{2}^{2}\right)\right).

Here, we used the bound for η132Ld\eta^{-1}\geq 32Ld that

(L+1ημ+1η)d/21.05.\left(\frac{L+\frac{1}{\eta}}{\mu+\frac{1}{\eta}}\right)^{d/2}\leq 1.05.

Regarding the remaining term, recall xx, xx^{\prime} both belong to Ωδ\Omega_{\delta}, and xx2Δ\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta. We have

f(yx)f(yx)+12η(yxx22yxx22)\displaystyle f(y_{x^{\prime}})-f(y_{x})+\frac{1}{2\eta}\left(\left\lVert y_{x^{\prime}}-x^{\prime}\right\rVert_{2}^{2}-\left\lVert y_{x}-x\right\rVert_{2}^{2}\right)
f(yx),yxyx+L2yxyx22+12ηyxx+yxx,yxyx+xx\displaystyle\leq\left\langle\nabla f(y_{x}),y_{x^{\prime}}-y_{x}\right\rangle+\frac{L}{2}\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}^{2}+\frac{1}{2\eta}\left\langle y_{x^{\prime}}-x^{\prime}+y_{x}-x,y_{x^{\prime}}-y_{x}+x-x^{\prime}\right\rangle
LRΔ+LΔ22+12η(yxx2+yxx2)(yxyx2+xx2)\displaystyle\leq LR\Delta+\frac{L\Delta^{2}}{2}+\frac{1}{2\eta}\left(\left\lVert y_{x}-x\right\rVert_{2}+\left\lVert y_{x^{\prime}}-x^{\prime}\right\rVert_{2}\right)\left(\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}+\left\lVert x^{\prime}-x\right\rVert_{2}\right)
LRΔ+LΔ22+2ηLR2η(yxyx2+xx2)3LRΔ+LΔ22.\displaystyle\leq LR\Delta+\frac{L\Delta^{2}}{2}+\frac{2\eta LR}{2\eta}\left(\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}+\left\lVert x^{\prime}-x\right\rVert_{2}\right)\leq 3LR\Delta+\frac{L\Delta^{2}}{2}.

The first inequality was smoothness and expanding the difference of quadratics. The second was by f(yx)2Lyxx2LR\left\lVert\nabla f(y_{x})\right\rVert_{2}\leq L\left\lVert y_{x}-x^{*}\right\rVert_{2}\leq LR and yxyx2Δ\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}\leq\Delta, where we used the first and second parts of Lemma 20; we also applied Cauchy-Schwarz and triangle inequality. The third used the third part of Lemma 20. Finally, the last inequality was by the first part of Lemma 20 and xx2Δ\left\lVert x^{\prime}-x\right\rVert_{2}\leq\Delta. ∎

We now are ready to prove Lemma 17. See 17

Proof.

First, by Observation 1, it suffices to show 𝒫x𝒫xTV12\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{1}{2}. Pinsker’s inequality states

𝒫x𝒫xTV12dKL(𝒫x,𝒫x),\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\sqrt{\frac{1}{2}d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)},

where dKLd_{\text{KL}} is KL-divergence, so it is enough to show dKL(𝒫x,𝒫x)12d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)\leq\tfrac{1}{2}. Notice that

dKL(𝒫x,𝒫x)=log(ZxZx)+y𝒫x(y)log(exp(f(y)12ηyx22)exp(f(y)12ηyx22))𝑑y.\displaystyle d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)=\log\left(\frac{Z_{x^{\prime}}}{Z_{x}}\right)+\int_{y}\mathcal{P}_{x}(y)\log\left(\frac{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)}{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x^{\prime}\right\rVert_{2}^{2}\right)}\right)dy.

By Lemma 22, the first term satisfies, for Δ:=η10\Delta:=\tfrac{\sqrt{\eta}}{10},

log(ZxZx)3LRΔ+LΔ22+log(1.05).\log\left(\frac{Z_{x^{\prime}}}{Z_{x}}\right)\leq 3LR\Delta+\frac{L\Delta^{2}}{2}+\log(1.05).

To bound the second term, we have

y𝒫x(y)log(exp(f(y)12ηyx22)exp(f(y)12ηyx22))𝑑y\displaystyle\int_{y}\mathcal{P}_{x}(y)\log\left(\frac{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)}{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x^{\prime}\right\rVert_{2}^{2}\right)}\right)dy =12ηy𝒫x(y)(yx22yx22)𝑑y\displaystyle=\frac{1}{2\eta}\int_{y}\mathcal{P}_{x}(y)\left(\left\lVert y-x^{\prime}\right\rVert_{2}^{2}-\left\lVert y-x\right\rVert_{2}^{2}\right)dy
=12ηy𝒫x(y)xx,2(yx)+(xx)𝑑y\displaystyle=\frac{1}{2\eta}\int_{y}\mathcal{P}_{x}(y)\left\langle x-x^{\prime},2\left(y-x\right)+\left(x-x^{\prime}\right)\right\rangle dy
Δ22η+Δηyy𝒫x(y)𝑑yx2.\displaystyle\leq\frac{\Delta^{2}}{2\eta}+\frac{\Delta}{\eta}\left\|\int_{y}y\mathcal{P}_{x}(y)dy-x\right\|_{2}.

Here, the second line was by expanding and the third line was by xx2Δ\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta and Cauchy-Schwarz. By Proposition 12, yy𝒫x(y)𝑑yx22ηLR\left\|\int_{y}y\mathcal{P}_{x}(y)dy-x\right\|_{2}\leq 2\eta LR, where by assumption the parameters satisfy the conditions of Proposition 12. Then, combining the two bounds, we have

dKL(𝒫x,𝒫x)3LRΔ+LΔ22+Δ22η+2LRΔ+log(1.05)=5LRΔ+LΔ22+Δ22η+log(1.05).d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)\leq 3LR\Delta+\frac{L\Delta^{2}}{2}+\frac{\Delta^{2}}{2\eta}+2LR\Delta+\log(1.05)=5LR\Delta+\frac{L\Delta^{2}}{2}+\frac{\Delta^{2}}{2\eta}+\log(1.05).

When Δ=η10\Delta=\tfrac{\sqrt{\eta}}{10}, ηL1\eta L\leq 1, and ηL2R212\eta L^{2}R^{2}\leq\tfrac{1}{2}, we have the desired

dKL(𝒫x,𝒫x)ηLR2+Lη200+1200+log(1.05)12.d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)\leq\frac{\sqrt{\eta}LR}{2}+\frac{L\eta}{200}+\frac{1}{200}+\log(1.05)\leq\frac{1}{2}.

C.3 Isoperimetry

In this section, we prove Lemma 18, which asks to show that π^Ωδ\hat{\pi}_{\Omega_{\delta}} satisfies a log-isoperimetric inequality (32). Here, we define π^Ωδ\hat{\pi}_{\Omega_{\delta}} to be the conditional distribution of the π^\hat{\pi} xx-marginal on set Ωδ\Omega_{\delta}. We recall this means that for any partition S1S_{1}, S2S_{2}, S3S_{3} of Ωδ\Omega_{\delta},

π^Ωδ(S3)12ψd(S1,S2)min(π^Ωδ(S1),π^Ωδ(S2))log(1+1min(π^Ωδ(S1),π^Ωδ(S2))).\hat{\pi}_{\Omega_{\delta}}(S_{3})\geq\frac{1}{2\psi}d(S_{1},S_{2})\cdot\min\left(\hat{\pi}_{\Omega_{\delta}}(S_{1}),\hat{\pi}_{\Omega_{\delta}}(S_{2})\right)\cdot\sqrt{\log\left(1+\frac{1}{\min\left(\hat{\pi}_{\Omega_{\delta}}(S_{1}),\hat{\pi}_{\Omega_{\delta}}(S_{2})\right)}\right)}.

The following fact was shown in [CDWY19].

Lemma 23 ([CDWY19], Lemma 11).

Any μ\mu-strongly logconcave distribution π\pi satisfies the log-isoperimetric inequality (32) with ψ=μ12\psi=\mu^{-\frac{1}{2}}.

Observe that πΩδ\pi_{\Omega_{\delta}}, the restriction of π\pi to the convex set Ωδ\Omega_{\delta}, is μ\mu-strongly logconcave by the definition of π\pi (15), so it satisfies a log-isoperimetric inequality. We now combine this fact with the relative density bounds Lemma 14 to prove Lemma 18.

See 18

Proof.

Fix some partition S1S_{1}, S2S_{2}, S3S_{3} of Ωδ\Omega_{\delta}, and without loss of generality let π^Ωδ(S1)π^Ωδ(S2)\hat{\pi}_{\Omega_{\delta}}(S_{1})\leq\hat{\pi}_{\Omega_{\delta}}(S_{2}). First, by applying Corollary 7, which shows dπdπ^(x)[12,2]\tfrac{d\pi}{d\hat{\pi}}(x)\in[\tfrac{1}{2},2] everywhere in Ωδ\Omega_{\delta}, we have the bounds

12πΩδ(S1)π^Ωδ(S1)2πΩδ(S1),12πΩδ(S2)π^Ωδ(S2)2πΩδ(S2),andπ^Ωδ(S3)12πΩδ(S3).\frac{1}{2}\pi_{\Omega_{\delta}}(S_{1})\leq\hat{\pi}_{\Omega_{\delta}}(S_{1})\leq 2\pi_{\Omega_{\delta}}(S_{1}),\;\frac{1}{2}\pi_{\Omega_{\delta}}(S_{2})\leq\hat{\pi}_{\Omega_{\delta}}(S_{2})\leq 2\pi_{\Omega_{\delta}}(S_{2}),\;\text{and}\;\hat{\pi}_{\Omega_{\delta}}(S_{3})\geq\frac{1}{2}\pi_{\Omega_{\delta}}(S_{3}).

Therefore, we have the sequence of conclusions

π^Ωδ(S3)\displaystyle\hat{\pi}_{\Omega_{\delta}}(S_{3}) 12πΩδ(S3)\displaystyle\geq\frac{1}{2}\pi_{\Omega_{\delta}}(S_{3})
d(S1,S2)μ4min(πΩδ(S1),πΩδ(S2))log(1+1min(πΩδ(S1),πΩδ(S2)))\displaystyle\geq\frac{d(S_{1},S_{2})\sqrt{\mu}}{4}\cdot\min\left(\pi_{\Omega_{\delta}}(S_{1}),\pi_{\Omega_{\delta}}(S_{2})\right)\cdot\sqrt{\log\left(1+\frac{1}{\min\left(\pi_{\Omega_{\delta}}(S_{1}),\pi_{\Omega_{\delta}}(S_{2})\right)}\right)}
d(S1,S2)μ8π^Ωδ(S1)log(1+12π^Ωδ(S1))\displaystyle\geq\frac{d(S_{1},S_{2})\sqrt{\mu}}{8}\cdot\hat{\pi}_{\Omega_{\delta}}(S_{1})\cdot\sqrt{\log\left(1+\frac{1}{2\hat{\pi}_{\Omega_{\delta}}(S_{1})}\right)}
d(S1,S2)μ16π^Ωδ(S1)log(1+1π^Ωδ(S1)).\displaystyle\geq\frac{d(S_{1},S_{2})\sqrt{\mu}}{16}\cdot\hat{\pi}_{\Omega_{\delta}}(S_{1})\cdot\sqrt{\log\left(1+\frac{1}{\hat{\pi}_{\Omega_{\delta}}(S_{1})}\right)}.

Here, the second line was by applying Lemma 23 to the μ\mu-strongly logconcave distribution πΩδ\pi_{\Omega_{\delta}}, and the final line used log(1+α)2log(1+α2)\sqrt{\log(1+\alpha)}\leq 2\sqrt{\log(1+\tfrac{\alpha}{2})} for all α>0\alpha>0. ∎

C.4 Correctness of YSample

In this section, we show how we can sample yy efficiently in the alternating scheme of the algorithm Sample-Joint-Dist, within an extremely high probability region. Specifically, for any xx with xx2κdlog(16κ/δ)Rδ\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}, where RδR_{\delta} is defined in (31), we give a method for implementing

draw yexp(f(y)12ηyx22)dy.\text{draw }y\propto\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy.

The algorithm is Algorithm 7, which is a simple rejection sampling scheme.

Algorithm 7 YSample(f,x,η,δ)\texttt{YSample}(f,x,\eta,\delta)

Input: LL-smooth, μ\mu-strongly convex f:df:\mathbb{R}^{d}\to\mathbb{R} with minimizer xx^{*}, η>0\eta>0, δ[0,1]\delta\in[0,1], xdx\in\mathbb{R}^{d}.
Output: If xx2κdlog(16κ/δ)Rδ\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}, return exact sample from distribution with density exp(f(y)12ηyx22)\propto\exp(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}) (see (31) for definition of RδR_{\delta}). Otherwise, return sample within δ\delta TV from distribution with density exp(f(y)12ηyx22)\propto\exp(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}).

1:if xx2κdlog(16κ/δ)Rδ\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta} then
2:     while true do
3:         Draw y𝒩(xηf(x),η𝐈)y\sim\mathcal{N}(x-\eta\nabla f(x),\eta\mathbf{I})
4:         τUnif[0,1]\tau\sim\text{Unif}[0,1]
5:         if τexp(f(x)+f(x),yxf(y))\tau\leq\exp(f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y)) then
6:              return yy
7:         end if
8:     end while
9:end if
10:return Sample xx within TV δ\delta from density exp(f(y)12ηyx22)\propto\exp(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}) using [CDWY19]

We recall that we gave guarantees on rejection sampling procedures in Lemma 5 (an “exact” version of Lemma 12 and Corollary 6). We now prove Lemma 19 via a direct application of Lemma 5.

See 19

Proof.

For xx2κdlog(16κ/δ)Rδ\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}, YSample is a rejection sampling scheme with

p(y)=exp(f(y)12ηyx22),p^(y)=exp(f(x)f(x),yx12ηyx22).p(y)=\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right),\;\hat{p}(y)=\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right).

It is clear that p(y)p^(y)p(y)\leq\hat{p}(y) everywhere by convexity of ff, so we may choose C=1C=1. To bound the expected number of iterations and obtain the desired conclusion, Lemma 5 requires a bound on

yexp(f(x)f(x),yx12ηyx22)𝑑yyexp(f(y)12ηyx22)𝑑y,\frac{\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy}{\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy}, (38)

the ratio of the normalization constants of p^\hat{p} and pp. First, by Fact 1,

yexp(f(x)f(x),yx12ηyx22)𝑑y=exp(f(x)+η2f(x)22)(2πη)d2.\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy=\exp\left(-f(x)+\frac{\eta}{2}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(2\pi\eta)^{\frac{d}{2}}.

Next, by smoothness and Fact 1 once more,

yexp(f(y)12ηyx22)𝑑y\displaystyle\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy yexp(f(x)f(x),yx1+ηL2ηyx22)𝑑y\displaystyle\geq\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1+\eta L}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy
=exp(f(x)+η2(1+ηL)f(x)22)(2πη1+ηL)d2.\displaystyle=\exp\left(-f(x)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.

Taking a ratio, the quantity in (38) is bounded above by

exp((η2η2(1+ηL))f(x)22)(1+ηL)d2\displaystyle\exp\left(\left(\frac{\eta}{2}-\frac{\eta}{2(1+\eta L)}\right)\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\left(1+\eta L\right)^{\frac{d}{2}} 1.5exp(η2L2(1+ηL)f(x)22)\displaystyle\leq 1.5\exp\left(\frac{\eta^{2}L}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)
1.5exp(η2L32(16κd2log2(16κ/δ)μ))2.\displaystyle\leq 1.5\exp\left(\frac{\eta^{2}L^{3}}{2}\cdot\left(\frac{16\kappa d^{2}\log^{2}(16\kappa/\delta)}{\mu}\right)\right)\leq 2.

The first inequality was (1+ηL)d21.5(1+\eta L)^{\frac{d}{2}}\leq 1.5, the second used smoothness and the assumed bound on xx2\left\lVert x-x^{*}\right\rVert_{2}, and the third again used our choice of η\eta. ∎

Appendix D Structural results

Here, we prove two structural results about distributions whose negative log-densities are small perturbations of a quadratic, which obtain tighter concentration guarantees compared to naive bounds on strongly logconcave distributions. They are used in obtaining our bounds in Section C (and for the warm start bounds in Section 4), but we hope both the statements and proof techniques are of independent interest to the community. Our first structural result is a bound on normalization constant ratios, used throughout the paper.

Proposition 11.

Let f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be μ\mu-strongly convex with minimizer xx^{*}, and let λ>0\lambda>0. Then,

exp(f(x))𝑑xexp(f(x)12λxx22)𝑑x(1+1μλ)d2.\frac{\int\exp(-f(x))dx}{\int\exp\left(-f(x)-\frac{1}{2\lambda}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}\leq\left(1+\frac{1}{\mu\lambda}\right)^{\frac{d}{2}}.
Proof.

Define the function

R(α):=exp(f(x)12λαxx22)𝑑xexp(f(x)12λxx22)𝑑x.R(\alpha):=\frac{\int\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}{\int\exp\left(-f(x)-\frac{1}{2\lambda}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}.

Let dπα(x)d\pi_{\alpha}(x) be the density proportional to exp(f(x)12λαxx22)dx\exp\left(-f(x)-\tfrac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx. We compute

ddαR(α)\displaystyle\frac{d}{d\alpha}R(\alpha) =exp(f(x)12λαxx22)exp(f(x)12λxx22)𝑑x12λα2xx22𝑑x\displaystyle=\int\frac{\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)}{\int\exp\left(-f(x)-\frac{1}{2\lambda}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}\frac{1}{2\lambda\alpha^{2}}\left\lVert x-x^{*}\right\rVert_{2}^{2}dx
=R(α)2λα2exp(f(x)12λαxx22)xx22exp(f(x)12λαxx22)𝑑x𝑑x\displaystyle=\frac{R(\alpha)}{2\lambda\alpha^{2}}\int\frac{\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\left\lVert x-x^{*}\right\rVert_{2}^{2}}{\int\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}dx
=R(α)2λα2xx22𝑑πα(x)R(α)2αdμλα+1.\displaystyle=\frac{R(\alpha)}{2\lambda\alpha^{2}}\int\left\lVert x-x^{*}\right\rVert_{2}^{2}d\pi_{\alpha}(x)\leq\frac{R(\alpha)}{2\alpha}\cdot\frac{d}{\mu\lambda\alpha+1}.

Here, the last inequality was by Fact 4, using the fact that the function f(x)+12λαxx22f(x)+\tfrac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2} is μ+1λα\mu+\tfrac{1}{\lambda\alpha}-strongly convex. Moreover, note that R(1)=1R(1)=1, and

ddαlog(αμλα+1)=1αμλμλα+1=1μλα2+α.\frac{d}{d\alpha}\log\left(\frac{\alpha}{\mu\lambda\alpha+1}\right)=\frac{1}{\alpha}-\frac{\mu\lambda}{\mu\lambda\alpha+1}=\frac{1}{\mu\lambda\alpha^{2}+\alpha}.

Solving the differential inequality

ddαlog(R(α))=dR(α)dα1R(α)d21μλα2+α,\frac{d}{d\alpha}\log(R(\alpha))=\frac{dR(\alpha)}{d\alpha}\cdot\frac{1}{R(\alpha)}\leq\frac{d}{2}\cdot\frac{1}{\mu\lambda\alpha^{2}+\alpha},

we obtain the bound for any α1\alpha\geq 1 (since log(R(1))=0\log(R(1))=0)

log(R(α))d2log(μλα+αμλα+1)R(α)(μλα+αμλα+1)d2(1+1μλ)d2.\log(R(\alpha))\leq\frac{d}{2}\log\left(\frac{\mu\lambda\alpha+\alpha}{\mu\lambda\alpha+1}\right)\implies R(\alpha)\leq\left(\frac{\mu\lambda\alpha+\alpha}{\mu\lambda\alpha+1}\right)^{\frac{d}{2}}\leq\left(1+\frac{1}{\mu\lambda}\right)^{\frac{d}{2}}.

Taking a limit α\alpha\rightarrow\infty yields the conclusion. ∎

Our second structural result uses a similar proof technique to show that the mean of a bounded perturbation ff of a Gaussian is not far from its mode, as long as the gradient of the mode is small. We remark that one may directly apply strong logconcavity, i.e. a variant of Fact 4, to obtain a weaker bound by roughly a d\sqrt{d} factor, which would result in a loss of Ω(d)\Omega(d) in the guarantees of Theorem 2. This tighter analysis is crucial in our improved mixing time result.

Before stating the bound, we apply Fact 3 to the convex functions h(x)=(θx)2h(x)=(\theta^{\top}x)^{2} and h(x)=x24h(x)=\left\lVert x\right\rVert_{2}^{4} to obtain the following conclusions which will be used in the proof of Proposition 12.

Corollary 9.

Let π\pi be a μ\mu-strongly logconcave density. Then,

  1. 1.

    𝔼π[(θ(x𝔼π[x]))2]μ1\mathbb{E}_{\pi}[(\theta^{\top}(x-\mathbb{E}_{\pi}[x]))^{2}]\leq\mu^{-1}, for all unit vectors θ\theta.

  2. 2.

    𝔼π[x𝔼π[x]24]3d2μ2\mathbb{E}_{\pi}[\left\lVert x-\mathbb{E}_{\pi}[x]\right\rVert_{2}^{4}]\leq 3d^{2}\mu^{-2}.

Proposition 12.

Let f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} be LL-smooth and convex with minimizer xx^{*}, let xdx\in\mathbb{R}^{d} with xx2R\left\lVert x-x^{*}\right\rVert_{2}\leq R, and let dπη(y)d\pi_{\eta}(y) be the density proportional to exp(f(y)12ηyx22)dy\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy. Suppose that ηmin(12L2R2,R2400d2)\eta\leq\min\left(\tfrac{1}{2L^{2}R^{2}},\tfrac{R^{2}}{400d^{2}}\right). Then,

𝔼πη[y]x22ηLR.\left\lVert\mathbb{E}_{\pi_{\eta}}[y]-x\right\rVert_{2}\leq 2\eta LR.
Proof.

Define a family of distributions πα\pi^{\alpha} for α[0,1]\alpha\in[0,1], with

dπα(y)exp(α(f(y)f(x)f(x),yx)f(x)f(x),yx12ηyx22)dy.d\pi^{\alpha}(y)\propto\exp\left(-\alpha\left(f(y)-f(x)-\left\langle\nabla f(x),y-x\right\rangle\right)-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy.

In particular, π1=πη\pi^{1}=\pi_{\eta}, and π0\pi^{0} is a Gaussian with mean xηf(x)x-\eta\nabla f(x). We define y¯α:=𝔼πα[y]\bar{y}_{\alpha}:=\mathbb{E}_{\pi_{\alpha}}[y], and

yα:=argminy{α(f(y)f(x)f(x),yx)+f(x)+f(x),yx+12ηyx22}.y^{*}_{\alpha}:=\textup{argmin}_{y}\left\{\alpha\left(f(y)-f(x)-\left\langle\nabla f(x),y-x\right\rangle\right)+f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right\}.

Define the function D(α):=y¯αx2D(\alpha):=\left\lVert\bar{y}_{\alpha}-x\right\rVert_{2}, such that we wish to bound D(1)D(1). First, by smoothness

D(0)=𝔼π0[y]x2=ηf(x)2ηLR.D(0)=\left\lVert\mathbb{E}_{\pi_{0}}[y]-x\right\rVert_{2}=\left\lVert\eta\nabla f(x)\right\rVert_{2}\leq\eta LR.

Next, we observe

ddαD(α)=y¯αxy¯αx2,dy¯αdαdy¯αdα2.\frac{d}{d\alpha}D(\alpha)=\left\langle\frac{\bar{y}_{\alpha}-x}{\left\lVert\bar{y}_{\alpha}-x\right\rVert_{2}},\frac{d\bar{y}_{\alpha}}{d\alpha}\right\rangle\leq\left\lVert\frac{d\bar{y}_{\alpha}}{d\alpha}\right\rVert_{2}.

In order to bound dy¯αdα2\left\lVert\tfrac{d\bar{y}_{\alpha}}{d\alpha}\right\rVert_{2}, fix a unit vector θ\theta. We have

dy¯αdα,θ\displaystyle\left\langle\frac{d\bar{y}_{\alpha}}{d\alpha},\theta\right\rangle =ddα(yx)𝑑πα(y),θ\displaystyle=\frac{d}{d\alpha}\left\langle\int(y-x)d\pi^{\alpha}(y),\theta\right\rangle (39)
=yx,θ(f(x)+f(x),yxf(y))𝑑πα(y)\displaystyle=\int\left\langle y-x,\theta\right\rangle(f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y))d\pi^{\alpha}(y)
(yx,θ)2𝑑πα(y)(f(x)+f(x),yxf(y))2𝑑πα(y)\displaystyle\leq\sqrt{\int(\left\langle y-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y)}\sqrt{\int(f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y))^{2}d\pi^{\alpha}(y)}
(yx,θ)2𝑑πα(y)L24yx24𝑑πα(y).\displaystyle\leq\sqrt{\int(\left\langle y-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y)}\sqrt{\int\frac{L^{2}}{4}\left\lVert y-x\right\rVert_{2}^{4}d\pi^{\alpha}(y)}.

The third line was Cauchy-Schwarz and the last line used smoothness and convexity, i.e.

L2yx22f(x)+f(x),yxf(y)0.\displaystyle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}\leq f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y)\leq 0.

We now bound these terms. First,

(yx,θ)2𝑑πα(y)\displaystyle\int(\left\langle y-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y) 2(yy¯α,θ)2𝑑πα(y)+2(y¯αx,θ)2𝑑πα(y)\displaystyle\leq 2\int(\left\langle y-\bar{y}_{\alpha},\theta\right\rangle)^{2}d\pi^{\alpha}(y)+2\int(\left\langle\bar{y}_{\alpha}-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y) (40)
2η+2y¯αx22=2η+2D(α)2.\displaystyle\leq 2\eta+2\left\lVert\bar{y}_{\alpha}-x\right\rVert_{2}^{2}=2\eta+2D(\alpha)^{2}.

Here, we applied the first part of Corollary 9, as πα\pi^{\alpha} is η1\eta^{-1}-strongly logconcave, and the definition of D(α)D(\alpha). Next, using for any a,bda,b\in\mathbb{R}^{d}, a+b24(a2+b2)416a24+16b24\left\lVert a+b\right\rVert_{2}^{4}\leq(\left\lVert a\right\rVert_{2}+\left\lVert b\right\rVert_{2})^{4}\leq 16\left\lVert a\right\rVert_{2}^{4}+16\left\lVert b\right\rVert_{2}^{4}, we have

L24yx24𝑑πα(y)\displaystyle\int\frac{L^{2}}{4}\left\lVert y-x\right\rVert_{2}^{4}d\pi^{\alpha}(y) 4L2yy¯α24𝑑πα(y)+4L2xy¯α24𝑑πα(y)\displaystyle\leq\int 4L^{2}\left\lVert y-\bar{y}_{\alpha}\right\rVert_{2}^{4}d\pi^{\alpha}(y)+\int 4L^{2}\left\lVert x-\bar{y}_{\alpha}\right\rVert_{2}^{4}d\pi^{\alpha}(y) (41)
12L2d2η2+4L2D(α)4.\displaystyle\leq 12L^{2}d^{2}\eta^{2}+4L^{2}D(\alpha)^{4}.

Here, we used the second part of Corollary 9. Maximizing (39) over θ\theta, and applying (40), (41),

ddαD(α)dy¯αdα2\displaystyle\frac{d}{d\alpha}D(\alpha)\leq\left\lVert\frac{d\bar{y}_{\alpha}}{d\alpha}\right\rVert_{2} 8L2(η+D(α)2)(3d2η2+D(α)4)\displaystyle\leq\sqrt{8L^{2}(\eta+D(\alpha)^{2})(3d^{2}\eta^{2}+D(\alpha)^{4})}
4L(η+D(α))max(2ηd,D(α)2).\displaystyle\leq 4L(\sqrt{\eta}+D(\alpha))\cdot\max(2\eta d,D(\alpha)^{2}). (42)

Assume for contradiction that D(1)>2ηLRD(1)>2\eta LR, violating the conclusion of the proposition. By continuity of DD, there must have been some α¯(0,1)\bar{\alpha}\in(0,1) where D(α¯)=2ηLRD(\bar{\alpha})=2\eta LR, and for all 0α<α¯0\leq\alpha<\bar{\alpha}, D(α)<2ηLRD(\alpha)<2\eta LR. By the mean value theorem, there then exists 0α^α¯0\leq\hat{\alpha}\leq\bar{\alpha} such that

dD(α^)dα=D(α¯)D(0)α¯>ηLR.\frac{dD(\hat{\alpha})}{d\alpha}=\frac{D(\bar{\alpha})-D(0)}{\bar{\alpha}}>\eta LR.

On the other hand, by our assumption that 2ηL2R212\eta L^{2}R^{2}\leq 1, for any d1d\geq 1 it follows that

2ηd4η2L2R2>D(α^)2,2η2ηLR>D(α^).2\eta d\geq 4\eta^{2}L^{2}R^{2}>D(\hat{\alpha})^{2},\;\sqrt{2\eta}\geq 2\eta LR>D(\hat{\alpha}).

Then, plugging these bounds into (D) and using η+D(α^)52η\sqrt{\eta}+D(\hat{\alpha})\leq\tfrac{5}{2}\sqrt{\eta} as 232\sqrt{2}\leq\tfrac{3}{2},

ddαD(α^)4L52η2ηd=20ηdRηLRηLR.\frac{d}{d\alpha}D(\hat{\alpha})\leq 4L\cdot\frac{5}{2}\sqrt{\eta}\cdot 2\eta d=20\sqrt{\eta}\frac{d}{R}\cdot\eta LR\leq\eta LR.

We used ηR2400d2\eta\leq\tfrac{R^{2}}{400d^{2}} in the last inequality. This is a contradiction, implying D(1)2ηLRD(1)\leq 2\eta LR. ∎