Structured Logconcave Sampling with a Restricted Gaussian Oracle

Yin Tat Lee University of Washington and Microsoft Research, [email protected] Ruoqi Shen University of Washington, [email protected] Kevin Tian Stanford University, [email protected]

1 Introduction

Since its study was pioneered by the celebrated randomized convex body volume approximation algorithm of Dyer, Frieze, and Kannan [DFK91], designing samplers for logconcave distributions has been a very active area of research in theoretical computer science and statistics with many connections to other fields. In a generic form, the problem can be stated as: sample from a distribution whose negative log-density is convex, under various access models to the distribution.

Developing efficient algorithms for sampling from structured logconcave densities is a topic that has received significant recent interest due to its widespread practical applications. There are many types of structure which densities commonplace in applications may possess that are exploitable for improved runtimes. Examples of such structure include derivative bounds (“well-conditioned densities”) and various types of separability (e.g. “composite densities” corresponding to possibly non-smooth regularization or restrictions to a set, and “logconcave finite sums” corresponding to averages over multiple data points).³³3We make this terminology more precise in Section 2.1, which contains various definitions used in this paper. Building an algorithmic theory for sampling these latter two families, which are not well-understood in the literature, is a primary motivation of this work.

There are strong parallels between the types of structured logconcave families garnering recent attention and the classes of convex functions known to admit efficient first-order optimization algorithms. Notably, gradient descent and its accelerated counterpart [Nes83] are well-known to quickly optimize a well-conditioned function, and have become ubiquitous in both practice and theory. Similarly, various methods have been developed for efficiently optimizing non-smooth but structured composite objectives [BT09] and well-conditioned finite sums [All17].

Logconcave sampling and convex optimization are intimately related primitives (cf. e.g. [BV04, AH16]), so it is perhaps unsurprising that there are analogies between the types of structure algorithm designers may exploit. Nonetheless, our understanding of the complexity landscape for sampling is quite a bit weaker in comparison to counterparts in the field of optimization; few lower bounds are known for the complexity of sampling tasks, and obtaining stronger upper bounds is an extremely active research area (contrary to optimization, where matching bounds exist in many cases). Moreover (and perhaps relatedly), the toolkit for designing logconcave samplers is comparatively lacking; for many important primitives in optimization, it is unclear if there are analogs in sampling, possibly impeding improved bounds. Our work broadly falls under the themes of (1) understanding which types of structured logconcave distributions admit efficient samplers, and (2) leveraging connections between optimization and sampling for algorithm design. We address these problems on two fronts, which constitute the primary technical contributions of this paper.

1.

We give a general reduction framework for bootstrapping samplers with mixing times with polynomial dependence on a conditioning measure $\kappa$ to mixing times with linear dependence on $\kappa$ . The framework is heavily motivated by a perspective on proximal point methods in structured convex optimization as instances of optimizing composite objectives, and leverages this connection via a surprisingly simple analysis (cf. Theorem 1).
2.

We develop novel “base samplers” for composite logconcave distributions and logconcave finite sums (cf. Theorems 2, 3). The former is the first composite sampler with stronger guarantees than those known in the general logconcave setting. The latter constitutes the first high-accuracy finite sum sampler whose gradient query complexity improves upon the naïve strategy of querying full gradients of the negative log-density in each iteration.

Using our novel base samplers within our reduction framework, we obtain state-of-the-art samplers for all of the aforementioned structured families, i.e. well-conditioned, composite, and finite sum, as Corollaries 1, 2, and 3. We emphasize that even without our reduction technique, the guarantees of our base samplers for composite and finite sum-structured densities are the first of their kind. However, by boosting their mixing via our reduction, we obtain guarantees for these structured distribution families which are essentially the best one can hope for without a significant improvement in the most commonly studied well-conditioned regime (cf. discussion in Section 1.1).

We now formally state our results in Section 1.1, and situate them in the literature in Section 1.2. Section 1.3 is a technical overview of our approaches for developing our base samplers for composite and finite sum-structured densities (Sections 1.3.1 and 1.3.2), as well as our proximal reduction framework (Section 1.3.3). Finally, Section 1.5 gives a roadmap for the rest of the paper.

1.1 Our results

Before stating our results, we first require the notion of a restricted Gaussian oracle, whose definition is a key ingredient in giving our reduction framework as well as our later composite samplers.

Definition 1 (Restricted Gaussian oracle).

$\mathcal{O}(\lambda,v)$ is a restricted Gaussian oracle (RGO) for convex $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ if it returns

\mathcal{O}(\lambda,v)\leftarrow\textup{sample from the distribution with density}\propto\exp\left(-\frac{1}{2\lambda}\left\lVert x-v\right\rVert_{2}^{2}-g(x)\right).

In other words, an RGO asks to sample from a multivariate Gaussian (with covariance a multiple of the identity), “restricted” by some convex function $g$ . Intuitively, if we can reduce a sampling problem for the density $\propto\exp(-g)$ to calling an RGO a small number of times with a small value of $\lambda$ , each RGO subproblem could be much easier to solve than the original problem. This can happen for a variety of reasons, e.g. if the regularized density is extremely well-conditioned, or because it inherits concentration properties of a Gaussian. This idea of reducing a sampling problem to multiple subproblems, each implementing an RGO, underlies the framework of Theorem 1. Because the idea of regularization by a large Gaussian component repeatedly appears in this paper, we make the following more specific definition for convenience, which lower bounds the size of the Gaussian.

Definition 2 ( $\eta$ -RGO).

We say $\mathcal{O}(\lambda,v)$ is an $\eta$ -restricted Gaussian oracle ( $\eta$ -RGO) if it satisfies Definition 1 with the restriction that parameter $\lambda$ is required to be always at most $\eta$ in calls to $\mathcal{O}$ .

Variants of our notion of an RGO have implicitly appeared previously [CV18, MFWB19], and efficient RGO implementation was a key subroutine in the fastest sampling algorithm for general logconcave distributions [CV18]. It also extends a similar oracle used in composite optimization, which we will discuss shortly. However, the explicit use of RGOs in a framework such as Theorem 1 is a novel technical innovation of our work, and we believe this abstraction will find further uses.

Proximal reduction framework.

In Section 3, we prove correctness of our proximal reduction framework, whose guarantees are stated in the following Theorem 1.

Theorem 1.

Let $\pi$ be a distribution on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp(-f_{\textup{oracle}}(x))$ such that $f_{\textup{oracle}}$ is $\mu$ -strongly convex, and let $\epsilon\in(0,1)$ . Let $\eta\leq\frac{1}{\mu}$ , $T=\Theta(\frac{1}{\eta\mu}\log{\frac{d}{\eta\mu\epsilon}})$ for some $\beta\geq 1$ , and $\mathcal{O}$ be a $\eta$ -RGO for $f_{\textup{oracle}}$ . Algorithm 1, initialized at the minimizer of $f_{\textup{oracle}}$ , runs in $T$ iterations, each querying $\mathcal{O}$ a constant number of times, and obtains $\epsilon$ total variation distance to $\pi$ .

In other words, if we can implement an $\eta$ -RGO for a $\mu$ -strongly convex function $f_{\textup{oracle}}$ in time $\mathcal{T}_{\text{RGO}}$ , we can sample from $\exp(-f_{\textup{oracle}})$ in time $\widetilde{O}(\tfrac{1}{\eta\mu}\cdot\mathcal{T}_{\text{RGO}})$ . To highlight the power of this reduction framework, suppose there was an existing sampler $\mathcal{A}$ for densities $\propto\exp(-f)$ with mixing time $\widetilde{O}(\kappa^{10}\sqrt{d})$ , where $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is $L$ -smooth, $\mu$ -strongly convex, and has condition number $\kappa=\tfrac{L}{\mu}$ (cf. Section 2.1 for definitions).⁴⁴4No sampler with mixing time scaling as $\text{poly}(\kappa)\sqrt{d}$ is currently known. Choosing $\eta=\tfrac{1}{L}$ and $f_{\textup{oracle}}\leftarrow f$ in Theorem 1 yields a sampler whose mixing time is $\widetilde{O}(\kappa\cdot\mathcal{T}_{\text{RGO}})$ , where $\mathcal{T}_{\text{RGO}}$ is the cost of sampling from a density proportional to

\exp\left(-\frac{L}{2}\left\lVert x-v\right\rVert_{2}^{2}-f(x)\right),

for some $v\in\mathbb{R}^{d}$ . However, observe that this distribution has a negative log-density with constant condition number $\tfrac{L+L}{L+\mu}\leq 2$ ! By using $\mathcal{A}$ as our RGO, we have $\mathcal{T}_{\text{RGO}}=\widetilde{O}(\sqrt{d})$ , and the overall mixing time is $\widetilde{O}(\kappa\sqrt{d})$ . Leveraging Theorem 1 in applications, we obtain the following new results, improving mixing of various “base samplers” which we bootstrap as RGOs for regularized densities.

Well-conditioned densities.

In [LST20], it was shown that a variant of Metropolized Hamiltonian Monte Carlo obtains a mixing time of $\widetilde{O}(\kappa d\log^{3}\tfrac{\kappa d}{\epsilon})$ for sampling a density on $\mathbb{R}^{d}$ with condition number $\kappa$ . The analysis of [LST20] was somewhat delicate, and required reasoning about conditioning on a nonconvex set with desirable concentration properties. In Section 4.1, we prove Corollary 1, improving [LST20] by roughly a logarithmic factor with a significantly simpler analysis.

Corollary 1.

Let $\pi$ be a distribution on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp\left(-f(x)\right)$ such that $f$ is $L$ -smooth and $\mu$ -strongly convex, and let $\epsilon\in(0,1)$ , $\kappa=\tfrac{L}{\mu}$ . Assume access to $x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}f(x)$ . Algorithm 1 with $\eta=\tfrac{1}{8Ld\log(\kappa)}$ using Algorithm 2 as a restricted Gaussian oracle for $f$ uses $O(\kappa d\log\kappa\log\tfrac{\kappa d}{\epsilon})$ gradient queries in expectation, and obtains $\epsilon$ total variation distance to $\pi$ .

We include Corollary 1 as a warmup for our more complicated results, as a way to showcase the use of our reduction framework in a slightly different way than the one outlined earlier. In particular, in proving Corollary 1, we will choose a significantly smaller value of $\eta$ , at which point a simple rejection sampling scheme implements each RGO with expected constant gradient queries.

We give another algorithm matching Corollary 1 with a deterministic query complexity bound as Corollary 5. The algorithm of Corollary 5 is interesting in that it is entirely a zeroth-order algorithm, and does not require access to a gradient oracle. To our knowledge, in the well-conditioned optimization setting, no zeroth-order query complexities better than roughly $\sqrt{\kappa}d$ are known, e.g. simulating accelerated gradient descent with a value oracle; thus, our sampling algorithm has a query bound off by only $\tilde{O}(\sqrt{\kappa})$ from the best-known optimization algorithm. We are hopeful this result may help in the search for query lower bounds for structured logconcave sampling.

Composite densities with a restricted Gaussian oracle.

In Section 5, we develop a sampler for densities on $\mathbb{R}^{d}$ proportional to $\exp(-f(x)-g(x))$ , where $f$ has condition number $\kappa$ and $g$ admits a restricted Gaussian oracle $\mathcal{O}$ . We state its guarantees here.

Theorem 2.

Let $\pi$ be a distribution on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp\left(-f(x)-g(x)\right)$ such that $f$ is $L$ -smooth and $\mu$ -strongly convex, and let $\epsilon\in(0,1)$ . Let $\eta\leq\tfrac{1}{32L\kappa d\log(\kappa/\epsilon)}$ (where $\kappa=\tfrac{L}{\mu}$ ), $T=\Theta(\tfrac{1}{\eta\mu}\log(\tfrac{\kappa d}{\epsilon}))$ , and let $\mathcal{O}$ be a $\eta$ -RGO for $g$ . Further, assume access to the minimizer $x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}\{f(x)+g(x)\}$ . There is an algorithm which runs in $T$ iterations in expectation, each querying a gradient oracle of $f$ and $\mathcal{O}$ a constant number of times, and obtains $\epsilon$ total variation distance to $\pi$ .

The assumption that the composite component $g$ admits an RGO can be thought of as a measure of “simplicity” of $g$ . This mirrors the widespread use of a proximal oracle as a measure of simplicity in the context of composite optimization [BT09], which we now define.

Definition 3 (Proximal oracle).

$\mathcal{O}(\lambda,v)$ is a proximal oracle for convex $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ if it returns

\mathcal{O}(\lambda,v)\leftarrow\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{\frac{1}{2\lambda}\left\lVert x-v\right\rVert_{2}^{2}+g(x)\right\}.

Many regularizers $g$ in defining composite optimization objectives, which are often used to enforce a quality such as sparsity or “simplicity” in a solution, admit efficient proximal oracles. In particular, if the proximal oracle subproblem admits a closed form solution (or otherwise is computable in $O(d)$ time), the regularized objective can be optimized at essentially no asymptotic loss. It is readily apparent that our RGO (Definition 1) is the extension of Definition 3 to the sampling setting. In [MFWB19], a variety of regularizations arising in practical applications including coordinate-separable $g$ (such as restrictions to a coordinate-wise box, e.g. for a Bayesian inference task where we have side information on the ranges of parameters) and $\ell_{1}$ or group Lasso regularized densities were shown to admit RGOs. Our composite sampling results achieve a similar “no loss” phenomenon for such regularizations, with respect to existing well-conditioned samplers.

By choosing the largest possible value of $\eta$ in Theorem 2, we obtain an iteration bound of $\widetilde{O}(\kappa^{2}d)$ . In Section 4.2, we prove Corollary 2, which improves Theorem 2 by roughly a $\kappa$ factor.

Corollary 2.

Let $\pi$ be a distribution on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp(-f(x)-g(x))$ such that $f$ is $L$ -smooth and $\mu$ -strongly convex, and let $\epsilon\in(0,1)$ , $\kappa=\tfrac{L}{\mu}$ . Assume access to $x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}\{f(x)+g(x)\}$ and let $\mathcal{O}$ be a restricted Gaussian oracle for $g$ . There is an algorithm (Algorithm 1 using Theorem 2 as a restricted Gaussian oracle) which runs in $O(\kappa d\log^{3}\tfrac{\kappa d}{\epsilon})$ iterations in expectation, each querying a gradient of $f$ and $\mathcal{O}$ a constant number of times, and obtains $\epsilon$ total variation distance to $\pi$ .

To sketch the proof, choosing $\eta=\tfrac{1}{L}$ in Theorem 1 yields an algorithm running in $\widetilde{O}(\tfrac{1}{\eta\mu})=\widetilde{O}(\kappa)$ iterations. In each iteration, the RGO subproblem asks to sample from the distribution whose negative log-density is $f(x)+g(x)+\tfrac{L}{2}\left\lVert x-v\right\rVert_{2}^{2}$ for some $v\in\mathbb{R}^{d}$ , so we can call Theorem 2, where the “well-conditioned” portion $f(x)+\tfrac{L}{2}\left\lVert x-v\right\rVert_{2}^{2}$ has constant condition number. Thus, Theorem 2 runs in $\widetilde{O}(d)$ iterations to solve the subproblem, yielding the result. In fact, Corollary 2 nearly matches Corollary 1 in the case $g=0$ uniformly. Surprisingly, this recovers the runtime of [LST20] without appealing to strong gradient concentration bounds (e.g. [LST20], Theorem 3.2).

Logconcave finite sums.

In Section 6, we initiate the study of mixing times for sampling logconcave finite sums with polylogarithmic dependence on accuracy. We give the following result.

Theorem 3.

Let $\pi$ be a distribution on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp(-F(x))$ , where $F(x)=\tfrac{1}{n}\sum_{i=1}^{n}f_{i}(x)$ is $\mu$ -strongly convex, $f_{i}$ is $L$ -smooth and convex $\forall i\in[n]$ , $\kappa=\tfrac{L}{\mu}$ , and $\epsilon\in(0,1)$ . Assume access to $x^{*}=\textup{argmin}_{x\in\mathbb{R}^{d}}F(x)$ . Algorithm 6 uses $O\left(\kappa^{2}d\log^{4}\frac{n\kappa d}{\epsilon}\right)$ value queries to summands $\{f_{i}\}_{i\in[n]}$ , and obtains $\epsilon$ total variation distance to $\pi$ .

For a zeroth-order algorithm, Theorem 3 serves as a surprisingly strong baseline as it nearly matches the previously best-known bound for zeroth-order well-conditioned sampling when $n=1$ ; however, when e.g. $\kappa\approx d$ , the complexity bound is at least cubic. By using Theorem 3 within the framework of Theorem 1, we obtain the following improved result.

Corollary 3 (Improved first-order logconcave finite sum sampling).

In the setting of Theorem 3, Algorithm 1 using Algorithm 6 and SVRG [JZ13] as a restricted Gaussian oracle for $F$ uses

O\left(n\log\left(\frac{n\kappa d}{\epsilon}\right)+\kappa\sqrt{nd}\log^{3.5}\left(\frac{n\kappa d}{\epsilon}\right)+\kappa d\log^{5}\left(\frac{n\kappa d}{\epsilon}\right)\right)=\widetilde{O}\left(n+\kappa\max\left(d,\sqrt{nd}\right)\right)

queries to first-order oracles for summands $\{f_{i}\}_{i\in[n]}$ , and obtains $\epsilon$ total variation distance to $\pi$ .

Corollary 3 has several surprising properties. First, its bound when $n=1$ gives yet another way of (up to polylogarithmic factors) recovering the runtime of [LST20] without gradient concentration. Second, up to a $\widetilde{O}(\max(1,\sqrt{\tfrac{n}{d}}))$ factor, it is essentially the best runtime one could hope for without an improvement when $n=1$ . This is in the sense that $\widetilde{O}(\kappa d)$ is the best runtime for $n=1$ , and to our knowledge every efficient well-conditioned sampler requires minimizer access, i.e. $\widetilde{O}(n)$ gradient queries [WS16]. Interestingly, when $n=1$ , Algorithm 6 can be significantly simplified, and becomes the standard Metropolized random walk [DCWY18]; this yields Corollary 5, an algorithm attaining the iteration complexity of Corollary 1 while only querying a value oracle for $f$ .

1.2 Previous work

Logconcave sampling is a problem that, within the theoretical computer science field, has its origins in convex body sampling (a problem it generalizes). A long sequence of developments have made significant advances in the general model, where only convexity is assumed about the negative log-density, and only value oracle access is given. In this prior work discussion, we focus on more structured cases where all or part of the negative log-density has finite condition number, and refer the reader to [Vem05, LV06a, CV15] for an account on progress in the general case.

Well-conditioned densities.

Significant recent efforts in the machine learning and statistics communities focused on obtaining provable guarantees for well-conditioned distributions, starting from pioneering work of [Dal17], and continued in e.g. [CCBJ18, DR18, CV19, CDWY19, DCWY18, DM19a, DMM19, LSV18, MMW⁺19, SL19, LST20]. In this setting, many methods based on discretizations of continuous-time first-order processes (such as the Langevin dynamics) have been proposed. Typically, error guarantees come in two forms: either in the $2$ -Wasserstein ( $W_{2}$ ) distance, or in total variation (TV). The line [DCWY18, CDWY19, LST20] has brought the gradient complexity for obtaining $\epsilon$ TV distance to $\widetilde{O}(\kappa d)$ where $d$ is the dimension, by exploiting gradient concentration properties. For progress in complexities depending polynomially on $\epsilon^{-1}$ , attaining $W_{2}$ guarantees (typically incomparable to TV bounds), we defer to [SL19], the state-of-the-art using $\widetilde{O}(\kappa^{\frac{7}{6}}\epsilon^{-\frac{1}{3}}+\kappa\epsilon^{-\frac{2}{3}})$ queries to obtain $W_{2}$ distance $\epsilon\sqrt{d\mu^{-1}}$ from the target.⁵⁵5Here, $\sqrt{d\mu^{-1}}$ is the effective diameter; this accuracy measure allows for scale-invariant $W_{2}$ guarantees. We note incomparable guarantees can be obtained by assuming higher derivative bounds (e.g. a Lipschitz Hessian); our work uses only the minimal assumption of bounded second derivatives.

Composite densities.

Recent works have studied sampling from densities of the form (1), or its specializations (e.g. restrictions to a convex set). Several [Per16, BDMP17, Ber18] are based on Moreau envelope or proximal regularization strategies, and demonstrate efficiency under more stringent assumptions on the structure of the composite function $g$ , but under minimal assumptions obtain fairly large provable mixing times $\Omega(d^{5})$ . Proximal regularization algorithms have also been considered for non-composite sampling [Wib19]. Another discretization strategy based on projections was studied by [BEL18], but obtained mixing time $\Omega(d^{7})$ . Finally, improved algorithms for special constrained sampling problems have been proposed, such as simplex restrictions [HKRC18].

Of particular relevance and inspiration to this work is [MFWB19]. By generalizing and adapting Metropolized HMC algorithms of [DCWY18, CDWY19], adopting a Moreau envelope strategy, and using (a stronger version of) the RGO access model, [MFWB19] obtained a runtime which in the best case scales as $\tilde{O}\left(\kappa^{2}d\right)$ , similar to the guarantee of our base sampler in Theorem 2. However, this result required a variety of additional assumptions, such as access to the normalization factor of restricted Gaussians, Lipschitzness of $g$ , warmness of the start, and various problem parameter tradeoffs. The general problem of sampling from (1) under minimal assumptions more efficiently than general-purpose logconcave algorithms is to the best of our knowledge unresolved (even under restricted Gaussian oracle access), a novel contribution of our mixing time bound. Our results also suggest that the RGO is a natural notion of tractability for the composite sampling problem.

Logconcave finite sums.

Since [WT11] proposed the stochastic gradient Langevin dynamics, which at each step stochastically estimates the full gradient of the function, there has been a long line of work giving bounds for this method and other similar algorithms [DK19, GGZ18, SKR19, BCM⁺18, NF19]. These convergence rates depend heavily on the variance of the stochastic estimates. Inspired by variance-reduced methods in convex optimization, samplers based on low-variance estimators have also been proposed [DRW⁺16, DSM⁺16, BFR⁺19, BFFN19, NDH⁺17, CWZ⁺17, ZXG18, CFM⁺18]. Although our reduction-based approach is not designed specifically for solving problems of finite sum structure, our speedup can be viewed as due to a lower variance estimator implicitly defined through the oracle subproblems of Theorem 1 via repeated re-centering.

In Table 1, we list prior runtimes [ZXG18, CFM⁺18] for sampling logconcave finite sums; note these results additionally require bounded higher derivatives (with the exception of the $\kappa^{4}$ dependence), obtain guarantees only in Wasserstein distance, and have polynomial dependences on $\epsilon^{-1}$ . On the other hand, our reduction-based approach obtains total variation bounds with linear dependence on $\kappa$ and polylogarithmic dependence on $\epsilon^{-1}$ . Our bound also simultaneously matches the state-of-the-art bound when $n=1$ , a feature not shared by prior stochastic algorithms. To our knowledge, no previous nontrivial⁶⁶6As mentioned previously, one can always compute the full $\nabla F$ in every iteration in a well-conditioned sampler. bounds were known in the high-accuracy regime before our work.

Method	Gradient oracle complexity
Method	$W_{2}\leq\epsilon$ , $\mu=1$	$W_{2}\leq\epsilon\sqrt{d\mu^{-1}}$
SAGA-LD [CFM⁺18]	$n+\frac{\kappa^{1.5}\sqrt{d}+\kappa d+Md}{\epsilon}+\frac{\kappa d^{4/3}}{\epsilon^{2/3}}$	$n+\frac{\kappa^{1.5}+\kappa\sqrt{d}+M\sqrt{d}}{\epsilon}+\frac{\kappa d^{2/3}}{\epsilon^{2/3}}$
SVRG-LD [CFM⁺18]	$n+\frac{\kappa^{1.5}\sqrt{d}+\kappa d+Md}{\epsilon}+\frac{\kappa d^{4/3}}{\epsilon^{2/3}}$	$n+\frac{\kappa^{3}}{\epsilon^{2}}+\frac{\kappa^{1.5}+M\sqrt{d}}{\epsilon}$
CV-ULD [CFM⁺18]	$n+\frac{\kappa^{4}d^{1.5}}{\epsilon^{3}}$	$n+\frac{\kappa^{4}}{\epsilon^{3}}$
SVRG-LD [ZXG18]	$n+\frac{\kappa^{1.5}\sqrt{d}+Md}{\epsilon}+\frac{\kappa\sqrt{nd}}{\epsilon}$	$n+\frac{\kappa^{1.5}+M\sqrt{d}}{\epsilon}+\frac{\kappa\sqrt{n}}{\epsilon}$
State-of-the-art, $n=1$ [SL19]	$\frac{\kappa^{7/6}d^{1/6}}{\epsilon^{1/3}}+\frac{\kappa d^{1/3}}{\epsilon^{2/3}}$	$\frac{\kappa^{7/6}}{\epsilon^{1/3}}+\frac{\kappa}{\epsilon^{2/3}}$

Method	Gradient oracle complexity (TV $\leq\epsilon$ )
Corollary 3	$n+\kappa d+\kappa\sqrt{nd}$
State-of-the-art, $n=1$ [LST20]	$\kappa d$

Table 1: Complexity of sampling from

e^{-F(x)}

where

F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x)

\mathbb{R}^{d}

\mu

-strongly convex, each

f_{i}

is convex and

L

-smooth, and

\kappa=\tfrac{L}{\mu}

. For relevant lines,

M

is the Lipschitz constant of the Hessian

\nabla^{2}F

, which our algorithm has no dependence on. Complexity is measured in terms of the number of calls to

f_{i}

\nabla f_{i}

for summands

\{f_{i}\}_{i\in[n]}

. We hide

\text{polylog}(\tfrac{n\kappa d}{\epsilon})

factors for simplicity.

Preliminary version.

A preliminary version of this work, containing the results of Section 5, appeared as [STL20]. The preliminary version also contained an experimental evaluation of the algorithm in Section 5 for the task of sampling a (non-diagonal covariance) multivariate Gaussian restricted to a box, and demonstrated the efficacy of our method in comparison to general-purpose logconcave samplers (i.e. the hit-and-run method [LV06c]). The focus of the present version is giving theoretical guarantees for structured logconcave sampling tasks, so we omit empirical evaluations, and defer an evaluation of the new methods developed in this paper to interesting follow-up work.

1.3 Technical overview

1.3.1 Composite logconcave sampling

We study the problem of approximately sampling from a distribution $\pi$ on $\mathbb{R}^{d}$ , with density

\frac{d\pi(x)}{dx}\propto\exp\left(-f(x)-g(x)\right).

(1)

Here, $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is assumed to be “well-behaved” (i.e. has finite condition number), and $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a convex, but possibly non-smooth function. This problem generalizes the special case of sampling from $\exp(-f(x))$ for well-conditioned $f$ , simply by letting $g$ vanish. Even the specialization of (1) where $g$ indicates a convex set (i.e. is $0$ inside the set, and $\infty$ outside) is not well-understood; existing mixing time bounds for this restricted setting are large polynomials in $d$ [BDMP17, BEL18], and are typically weaker than guarantees in the general logconcave setting [LV06c, LV06b]. This is in contrast to the convex optimization setting, where first-order methods readily generalize to solve problem families such as $\min_{x\in\mathcal{X}}f(x)$ , where $\mathcal{X}\subseteq\mathbb{R}^{d}$ is a convex set, as well as its generalization

\min_{x\in\mathbb{R}^{d}}f(x)+g(x),\text{ where }g:\mathbb{R}^{d}\rightarrow\mathbb{R}\text{ is convex and admits a proximal oracle.}

(2)

We defined proximal oracles in Definition 3; in short, they are prodecures which minimize the sum of a quadratic and $g$ . Definition 3 is desirable as many natural non-smooth composite objectives arising in learning settings, such as the Lasso [Tib96] and elastic net [ZH05], admit efficient proximal oracles. It is clear that the definition of a proximal oracle implies it can also handle arbitrary sums of linear functions and quadratics, as the resulting function can be rewritten as the sum of a constant and a single quadratic. The seminal work [BT09] extends fast gradient methods to solve (2) via proximal oracles, and has prompted many follow-up studies.

Motivated by the success of the proximal oracle framework in convex optimization, we study sampling from the family (1) through the lens of RGOs, a natural extension of Definition 3. The main result of Section 5 is a “base” algorithm efficiently sampling from (1), assuming access to an RGO for $g$ . We now survey the main components of this algorithm.

Reduction to shared minimizers. We first observe that without loss of generality, $f$ and $g$ share a minimizer: by shifting $f$ and $g$ by linear terms, i.e. $\tilde{f}(x):=f(x)-\left\langle\nabla f(x^{*}),x\right\rangle$ , $\tilde{g}(x):=g(x)+\left\langle\nabla f(x^{*}),x\right\rangle$ , where $x^{*}$ minimizes $f+g$ , first-order optimality implies both $\tilde{f}$ and $\tilde{g}$ are minimized by $x^{*}$ . Moreover, implementation of a first-order oracle for $\tilde{f}$ and an RGO for $\tilde{g}$ are immediate without additional assumptions. This modification becomes crucial for our later developments, and we hope this simple observation, reminiscent of “variance reduction” techniques in stochastic optimization [JZ13], is broadly applicable to improving algorithms for the sampling problem induced by (1).

Beyond Moreau envelopes: expanding the space. A typical approach in convex optimization in handling non-smooth objectives $g$ is to instead optimize its Moreau envelope, defined by

g^{\eta}(y):=\min_{x\in\mathbb{R}^{d}}\left\{g(x)+\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right\}.

(3)

Intuitively, the envelope $g^{\eta}$ trades off function value with proximity to $y$ ; a standard exercise shows that $g^{\eta}$ is smooth (has a Lipschitz gradient), with smoothness depending on $\eta$ , and moreover that computing gradients of $g^{\eta}$ reduces to calling a proximal oracle (Definition 3). It is natural to extend this idea to the composite sampling setting, e.g. via sampling from the density

\exp\left(-f(x)-g^{\eta}(x)\right).

However, a variety of complications prevent such strategies from obtaining rates comparable to their non-composite, well-conditioned counterparts, including difficulty in bounding closeness of the resulting distribution, as well as biased drifts of the sampling process due to error in gradients.

Our approach departs from this smoothing strategy in a crucial way, inspired by Hamiltonian Monte Carlo (HMC) methods [Kra40, Nea11]. HMC can be seen as a discretization of the ubiquitous Langevin dynamics, on an expanded space. In particular, discretizations of Langevin dynamics simulate the stochastic differential equation $\tfrac{dx_{t}}{dt}=-\nabla f(x_{t})+\sqrt{2}\tfrac{dW_{t}}{dt}$ , where $W_{t}$ is Brownian motion. HMC methods instead simulate dynamics on an extended space $\mathbb{R}^{d}\times\mathbb{R}^{d}$ , via an auxiliary “velocity” variable which accumulates gradient information. This is sometimes interpreted as a discretization of the underdamped Langevin dynamics [CCBJ18]. HMC often has desirable stability properties, and expanding the dimension via an auxiliary variable has been used in algorithms obtaining the fastest rates in the well-conditioned logconcave sampling regime [SL19, LST20]. Inspired by this phenomenon, we consider the density on $\mathbb{R}^{d}\times\mathbb{R}^{d}$

\frac{d\hat{\pi}}{dz}(z):=\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)\text{ where }z=(x,y).

(4)

Due to technical reasons, the family of distributions we use in our final algorithms are of slightly different form than (4), but this simplification is useful to build intuition. Note in particular that the form of (4) is directly inspired by (3), where rather than maximizing over $x$ , we directly expand the space. The idea is that for small enough $\eta$ and a set on $x$ of large measure, smoothness of $f$ will guarantee that the marginal of (4) on $x$ will concentrate $y$ near $x$ , a fact we make rigorous. To sample from (1), we then show that a rejection filter applied to a sample $x$ from the marginal of (4) will terminate in constant steps. Consequently, it suffices to develop a fast sampler for (4).

Alternating sampling with an oracle. The form of the distribution (4) suggests a natural strategy for sampling from it: starting from a current state $(x_{k},y_{k})$ , we iterate

1.

Sample $y_{k+1}\sim\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert x_{k}-y\right\rVert_{2}^{2}\right)$ .
2.

Sample $x_{k+1}\sim\exp\left(-g(x)-\frac{1}{2\eta}\left\lVert x-y_{k+1}\right\rVert_{2}^{2}\right)$ , via a restricted Gaussian oracle.

When $f$ and $g$ share a minimizer, taking a first-order approximation in the first step, i.e. sampling $y_{k+1}\sim\exp(-f(x_{k})-\left\langle\nabla f(x_{k}),y-x_{k}\right\rangle-\tfrac{1}{2\eta}\left\lVert y-x_{k}\right\rVert_{2}^{2})$ , can be shown to generalize the Leapfrog step of HMC updates. However, for $\eta$ very small (as in our setting), we observe the first step itself reduces to the case of sampling from a distribution with constant condition number, performable in $\tilde{O}(d)$ gradient calls by e.g. Metropolized HMC [DCWY18, CDWY19, LST20]. Moreover, it is not hard to see that this “alternating marginal” sampling strategy preserves the stationary distribution exactly, so no filtering is necessary. Directly bounding the conductance of this random walk, for small enough $\eta$ , leads to an algorithm running in $\tilde{O}\left(\kappa^{2}d^{2}\right)$ iterations, each calling an RGO once, and a gradient oracle for $f$ roughly $\tilde{O}\left(d\right)$ times. This latter guarantee is by an appeal to known bounds [CDWY19, LST20] on the mixing time in high dimensions of Metropolized HMC for a well-conditioned distribution, a property satisfied by the $y$ -marginal of (4) for small $\eta$ .

Stability of Gaussians under bounded perturbations. To obtain our tightest runtime result, we use that $\eta$ is chosen to be much smaller than $L^{-1}$ to show structural results about distributions of the form (4), yielding tighter concentration for bounded perturbations of a Gaussian (i.e. the Gaussian has covariance $\tfrac{1}{\eta}\mathbf{I}$ , and is restricted by $L$ -smooth $f$ for $\eta\ll L^{-1}$ ). To illustrate, let

\frac{d\mathcal{P}_{x}(y)}{dy}\propto\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)

and let its mean and mode be $\bar{y}_{x}$ , $y^{*}_{x}$ . It is standard that $\left\lVert\bar{y}_{x}-y^{*}_{x}\right\rVert_{2}\leq\sqrt{d\eta}$ , by $\eta^{-1}$ -strong logconcavity of $\mathcal{P}_{x}$ . Informally, we show that for $\eta\ll L^{-1}$ and $x$ not too far from the minimizer of $f$ , we can improve this to $\left\lVert\bar{y}_{x}-y^{*}_{x}\right\rVert_{2}=O(\sqrt{\eta})$ ; see Proposition 12 for a precise statement.

Using our structural results, we sharpen conductance bounds, improve the warmness of a starting distribution, and develop a simple rejection sampling scheme for sampling the $y$ variable in expected constant gradient queries. Our proofs are continuous in flavor and based on gradually perturbing the Gaussian and solving a differential inequality; we believe they may of independent interest. These improvements lead to an algorithm running in $\tilde{O}\left(\kappa^{2}d\right)$ iterations; ultimately, we use our reduction framework, stated in Theorem 1, to improve this dependence to $\tilde{O}\left(\kappa d\right)$ .

1.3.2 Logconcave finite sums

We initiate the algorithmic study of the following task in the high-accuracy regime: sample $x\sim\pi$ within total variation distance $\epsilon$ , where $\tfrac{d\pi}{dx}(x)\propto\exp(-F(x))$ and

F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x),

(5)

all $f_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ are convex and $L$ -smooth, and $F$ is $\mu$ -strongly convex. We call such a distribution $\pi$ a (well-conditioned) logconcave finite sum.

In applications (where summands correspond to points in a dataset, e.g. in Bayesian linear and logistic regression tasks [DCWY18]), querying $\nabla F$ may be prohibitively expensive, so a natural goal is to obtain bounds on the number of required queries to summands $\nabla f_{i}$ for $i\in[n]$ . This motivation also underlies the development of stochastic gradient methods in optimization, a foundational tool in modern statistics and data processing. Naïvely, one can complete the task by using existing samplers for well-conditioned distributions and querying the full gradient $\nabla F$ in each iteration, resulting in a summand gradient query complexity of $\widetilde{O}(n\kappa d)$ [LST20]. Many recent works, inspired from recent developments in the complexity of optimizing a well-conditioned finite sum, have developed subsampled gradient methods for the sampling problem. However, to our knowledge, all such guarantees depend polynomially on the accuracy $\epsilon$ and are measured in the $2$ -Wasserstein distance; in the high-accuracy, total variation case no nontrivial query complexity is currently known.

We show in Section 6 that given access to the minimizer of $F$ , a simple zeroth-order algorithm which queries $\widetilde{O}(\kappa^{2}d)$ values of summands $\{f_{i}\}_{i\in[n]}$ succeeds (i.e. it never requires a full value or gradient query of $F$ ). The algorithm is essentially the Metropolized random walk proposed in [DCWY18] for the $n=1$ case with a cheaper subsampled filter step. Notably, because the random walk is conducted with respect to $F$ , we cannot efficiently query the function value at any point; nonetheless, by randomly sampling to compute a nearly-unbiased estimator of the rejection probability, we do not incur too much error. This random walk was shown in [CDWY19] to mix in $\widetilde{O}(\kappa^{2}d)$ iterations; we implement each step to sufficient accuracy using $\widetilde{O}(1)$ function evaluations.

It is natural to ask if first-order information can be used to improve this query complexity, perhaps through “variance reduction” techniques (e.g. [JZ13]) developed for stochastic optimization. The idea behind variance reduction is to recenter gradient estimates in phases, occasionally computing full gradients to improve the estimate quality. One fundamental difficulty which arises from using variance reduction in high-accuracy sampling is that the resulting algorithms are not stateless. By this, we mean that the variance-reduced estimates depend on the history of the algorithm, and thus it is difficult to ascertain correctness of the stationary distribution. We take a different approach to achieve a linear query dependence on the conditioning $\kappa$ , described in the following section.

1.3.3 Proximal point reduction framework

To motivate Theorem 1, we first recast existing “proximal point” reduction-based optimization methods through the lens of composite optimization, and subsequently show that similar ideas underlying our composite sampler in Section 1.3.1 yield an analagous “proximal point reduction framework” for sampling. Chronologically, our composite sampler (originally announced in [STL20]) predates our reduction framework, which was then inspired by the perspective given here. We hope these insights prove fruitful for further development of proximal approaches to sampling.

Proximal point methods as composite optimization.

Proximal point methods are a well-studied primitive in optimization, developed by [Roc76]; cf. [PB14] for a modern survey. The principal idea is that to minimize convex $F:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , it suffices to solve a sequence of subproblems

x_{k+1}\leftarrow\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{F(x)+\frac{1}{2\lambda}\left\lVert x-x_{k}\right\rVert_{2}^{2}\right\}.

(6)

Intuitively, by tuning the parameter $\lambda\geq 0$ , we trade off how regularized the subproblems (6) are with how rapidly the overall method converges. Smaller values of $\lambda$ result in larger regularization amounts which are amenable to algorithms for minimizing well-conditioned objectives.

For optimizing functions of the form (5) via stochastic gradient estimates to $\epsilon$ error, [JZ13, DBL14, SRB17] developed variance-reduced methods obtaining a query complexity of $\widetilde{O}(n+\kappa)$ . To match a known lower bound of $\widetilde{O}(n+\sqrt{n\kappa})$ due to [WS16], two works [LMH15, FGKS15] appropriately applied instances of accelerated proximal point methods [Gul92] with careful analyses of how accurately subproblems (6) needed to be solved. These algorithms black-box called the $\widetilde{O}(n+\kappa)$ runtime as an oracle to solve the subproblems (6) for an appropriate choice of $\lambda$ , obtaining an accelerated rate.⁷⁷7We note that an improved runtime without extraneous logarithmic factors was later obtained by [All17]. To shed some light on this acceleration procedure, we adopt an alternative view on proximal point methods.⁸⁸8This perspective can also be found in the lecture notes [Lee18]. Consider the following known composite optimization result.

Proposition 1 (Informal statement of [BT09]).

Let $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be $L$ -smooth and $\mu$ -strongly convex, and $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ admit a proximal oracle $\mathcal{O}(\lambda,v)$ (cf. Definition 3). There is an algorithm taking $\widetilde{O}(\sqrt{\kappa})$ iterations for $\kappa=\tfrac{L}{\mu}$ to find an $\epsilon$ -approximate minimizer to $f+g$ , each querying $\nabla f$ and $\mathcal{O}$ a constant number of times. Further, $\lambda=\tfrac{1}{L}$ in all calls to $\mathcal{O}$ .

Ignoring subtleties of the error tolerance of $\mathcal{O}$ , we show how to use an instance of Proposition 1 to recover the $\widetilde{O}(n+\sqrt{n\kappa})$ query complexity for optimizing (5). Let $f(x)=\tfrac{\mu}{2}\left\lVert x\right\rVert_{2}^{2}$ , and $g=F-f$ . For any $\Lambda\geq\mu$ , $f$ is both $\mu$ -strongly convex and $\Lambda$ -smooth. Moreover, note that all calls to the proximal oracle $\mathcal{O}$ for $g$ require solving subproblems of the form

\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{F(x)-\frac{\mu}{2}\left\lVert x\right\rVert_{2}^{2}+\frac{\Lambda}{2}\left\lVert x-v\right\rVert_{2}^{2}\right\}.

(7)

The upshot of choosing a smoothness bound $\Lambda\geq\mu$ is that the regularization amount in (7) increases, improving the conditioning of the subproblem, which is $\Lambda$ -strongly convex and $L+\Lambda$ -smooth. The algorithm of e.g. [JZ13] solves each subproblem (7) in $\widetilde{O}(n+\tfrac{L+\Lambda}{\Lambda})$ gradient queries, leading to an overall query complexity (for Proposition 1) of

\widetilde{O}\left(\sqrt{\frac{\Lambda}{\mu}}\cdot\left(n+\frac{L}{\Lambda}\right)\right).

Optimizing over $\Lambda\geq\mu$ , i.e. taking $\Lambda=\max(\mu,\tfrac{L}{n})$ , yields the desired bound of $\widetilde{O}(n+\sqrt{n\kappa})$ .

Applications to sampling.

In Sections 5 and 6, we develop samplers for structured families with quadratic dependence on the conditioning $\kappa$ . The proximal point approach suggests a strategy for accelerating these runtimes. Namely, if there is a framework which repeatedly calls a sampler for a regularized density (analogous to calls to (6)), one could trade off the regularization with the rate of the outer loop. Fortunately, in the spirit of interpreting proximal point methods as composite optimization, the composite sampler of Section 5 itself meets these reduction framework criteria.

We briefly recall properties of our composite sampler here. Let $\pi$ be a distribution on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp(-f_{\textup{wc}}(x)-f_{\textup{oracle}}(x))$ ,⁹⁹9To disambiguate, we sometimes also use the notation $f_{\textup{wc}}+f_{\textup{oracle}}$ rather than $f+g$ in defining instances of our reduction framework or composite samplers, when convenient in the context. where $f_{\textup{wc}}$ is well-conditioned (has finite condition number $\kappa$ ) and $f_{\textup{oracle}}$ admits an RGO, which solves subproblems of the form

\mathcal{O}(\eta,v)\sim\text{the density proportional to }\exp\left(-\frac{1}{2\eta}\left\lVert x-v\right\rVert_{2}^{2}-f_{\textup{oracle}}(x)\right).

(8)

The algorithm of Section 5 only calls $\mathcal{O}$ with a fixed $\eta$ ; note the strong parallel between the RGO subproblem and the proximal oracle of Proposition 1. For a given value of $\eta\geq 0$ , our composite sampler runs in $\widetilde{O}(\tfrac{1}{\eta\mu})$ iterations, each requiring a call to $\mathcal{O}$ . Smaller $\eta$ improve the conditioning of the negative log-density of subproblem (8), but increase the overall iteration count, yielding a range of trade-offs. The algorithm of Section 5 has an upper bound requirement on $\eta$ (cf. Theorem 2); in Section 3, we observe that this may be lifted when $f_{\textup{wc}}=0$ uniformly, allowing for a full range of choices. Moreover, the analysis of the composite sampler becomes much simpler when $f_{\textup{wc}}=0$ , as in Theorem 1. We give the framework as Algorithm 1, as well as a full (fairly short) convergence analysis. By trading off the regularization amount with the cost of implementing (8) via bootstrapping base samplers, we obtain a host of improved runtimes.

Beyond our specific applications, the framework we provide has strong implications as a generic reduction from mixing times scaling polynomially in $\kappa$ to improved methods scaling linearly in $\kappa$ . This is akin to the observation in [LMH15] that accelerated proximal point methods generically improve $\text{poly}(\kappa)$ dependences to $\sqrt{\kappa}$ dependences for optimization. We are optimistic this insight will find further implications in the logconcave sampling literature.

1.4 Erratum, and a word of warning for $o(d)$ mixing

The initial version of this paper, presented at COLT 2021, had an incorrect proof of Theorem 1. This was due to our reliance on the average conductance (“spectral profile”) technique of [LK99] for bounding mixing. Roughly speaking, the mistake was caused by a misunderstanding that for stationary measures satisfying $\mu$ -log isoperimetry (for example, $\mu$ -strongly logconcave densities) and with transition distributions of $\Delta$ -close points having constant overlap, [LK99] provides mixing time bounds of the form (where $\beta$ is a warmness parameter of the starting distribution)

\int_{\frac{1}{\beta}}^{\frac{1}{2}}\frac{1}{s\Phi(s)^{2}}ds\lesssim\frac{1}{\mu\Delta^{2}}\int_{\frac{1}{\beta}}^{\frac{1}{2}}\frac{1}{s\log(s)}ds\approx\frac{1}{\mu\Delta^{2}}\log\log\beta.

(9)

Here, $\Phi(s)$ is the $s$ -conductance of the Markov chain, which can typically be lower bounded by $\Omega(\sqrt{\mu\log(s)}\Delta)$ under a stationary density exhibiting log-isoperimetry. However, the trivial bound $\Phi(s)\leq 1$ demonstrates that there is an additive $\log(\beta)$ term in (9). This is a bottleneck towards mixing times scaling as $o(d)$ for distributions where only an $\exp(\Omega(d))$ -warm start is feasible; in particular, the conductance actually scales as $\min(1,\Omega(\sqrt{\mu\log(s)}\Delta))$ , causing this additive term. In settings where $\mu\Delta^{2}\geq d^{-1}$ (such as our reductions, where this term often scales as a condition number of the problem), this additive term $\log(\beta)=\Omega(d)$ may dominate. This observation (and the fix) came out of conversations with Sinho Chewi; we are immensely greatful for his help.

For the particular structure of the algorithm in Theorem 1, we are able to give an alternative analysis going through $W_{2}$ convergence bounds, preserving the correctness of the theorem. However, this bottleneck is a general phenomenon which may cause future attempts to use Metropolized algorithms from exponentially warm starts to be stuck at $\Omega(d)$ iterations, which merits further investigation. We write this section as a word of warning to future researchers aiming at sublinear dimension dependences in Metropolized algorithms, and as a suggested open research direction.

1.5 Roadmap

We give notations and preliminaries in Section 2. In Section 3 we give our framework for bootstrapping fast regularized samplers, and prove its correctness (Theorem 1). Assuming the “base samplers” of Theorems 2 and 3, in Section 4 we apply our reduction to obtain all of our strongest guarantees, namely Corollaries 1, 2, and 3. We then prove Theorems 2 and 3 in Sections 5 and 6.

2 Preliminaries

2.1 Notation

General notation.

For $d\in\mathbb{N}$ , $[d]$ refers to the set of naturals $1\leq i\leq d$ ; $\left\lVert\cdot\right\rVert_{2}$ is the Euclidean norm on $\mathbb{R}^{d}$ when $d$ is clear from context. $\mathcal{N}(\mu,\boldsymbol{\Sigma})$ is the multivariate Gaussian of specified mean and variance, $\mathbf{I}$ is the identity of appropriate dimension when clear from context, and $\preceq$ is the Loewner order on symmetric matrices.

Functions.

We say twice-differentiable function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is $L$ -smooth and $\mu$ -strongly convex if $\mu\mathbf{I}\preceq\nabla^{2}f(x)\preceq L\mathbf{I}$ for all $x\in\mathbb{R}^{d}$ ; it is well-known that $L$ -smoothness implies that $f$ has an $L$ -Lipschitz gradient, and that for any $x,y\in\mathbb{R}^{d}$ ,

f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{\mu}{2}\left\lVert y-x\right\rVert_{2}^{2}\leq f(y)\leq f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}.

If $f$ is $L$ -smooth and $\mu$ -strongly convex, we say it has a condition number $\kappa:=\tfrac{L}{\mu}$ . We call a zeroth-order oracle, or “value oracle”, an oracle which returns $f(x)$ on any input point $x\in\mathbb{R}^{d}$ ; similarly, a first-order oracle, or “gradient oracle”, returns both the value and gradient $(f(x),\nabla f(x))$ .

Distributions.

We call distribution $\pi$ on $\mathbb{R}^{d}$ logconcave if $\tfrac{d\pi}{dx}(x)=\exp(-f(x))$ for convex $f$ ; $\pi$ is $\mu$ -strongly logconcave if $f$ is $\mu$ -strongly convex. For $A\subseteq\mathbb{R}^{d}$ , $A^{c}$ is its complement, and we let $\pi(A):=\int_{x\in A}d\pi(x)$ . We say distribution $\rho$ is $\beta$ -warm with respect to $\pi$ if $\tfrac{d\pi}{d\rho}(x)\leq\beta$ everywhere, and define the total variation $\left\lVert\pi-\rho\right\rVert_{\textup{TV}}:=\sup_{A\subseteq\mathbb{R}^{d}}\pi(A)-\rho(A)$ . We will frequently use the fact that $\left\lVert\pi-\rho\right\rVert_{\textup{TV}}$ is also the probability that $x\sim\pi$ and $x^{\prime}\sim\rho$ are unequal under the best coupling of $(\pi,\rho)$ ; this allows us to “locally share randomness” when comparing two random walk procedures. We define the expectation $\mathbb{E}_{\pi}$ and variance $\textup{Var}_{\pi}$ with respect to distribution $\pi$ in the standard way,

\mathbb{E}_{\pi}[h(x)]:=\int h(x)d\pi(x),\;\textup{Var}_{\pi}[h(x)]:=\mathbb{E}_{\pi}\left[(h(x))^{2}\right]-\left(\mathbb{E}_{\pi}[h(x)]\right)^{2}.

Structured distributions.

This work considers two types of distributions with additional structure, which we call composite logconcave densities and logconcave finite sums. A composite logconcave density has the form $\exp(-f(x)-g(x))$ , where both $f$ and $g$ are convex. In this context throughout, $f$ will either be uniformly $0$ or have a finite condition number (be “well-conditioned”), and $g$ will represent a “simple” but possibly non-smooth function, as measured by admitting an RGO (cf. Definition 1). We will sometimes refer to the components as $f$ and $g$ as $f_{\textup{wc}}$ and $f_{\textup{oracle}}$ respectively, to disambiguate when the functions $f$ and $g$ are already defined in context. In our reduction-based approaches, we have additional structure on the parameter $\lambda$ which an RGO is called with (cf. Definition 2). Specifically, in our instances typically $\lambda^{-1}\gg L$ (or some other “niceness” parameter associated with the negative log-density); this can be seen as heavily regularizing the negative log-density, and often makes the implementation simpler.

Finally, a logconcave finite sum has density of the form $\exp(-F(x))$ where $F(x)=\tfrac{1}{n}\sum_{i\in[n]}f_{i}(x)$ . When treating such densities, we make the assumption that each constituent summand $f_{i}$ is $L$ -smooth and convex, and the overall function $F$ is $\mu$ -strongly convex. We measure complexity of algorithms for logconcave finite sums by gradient queries to summands, i.e. $\nabla f_{i}(x)$ for some $i\in[n]$ .

Optimization.

Throughout this work, we are somewhat liberal with assuming access to minimizers to various functions (namely, the negative log-densities of target distributions). We give a more thorough discussion of this assumption in Appendix A, but note here that for all function families we consider (well-conditioned, composite, and finite sum), efficient first-order methods exist for obtaining high accuracy minimizers, and this optimization query complexity is never the leading-order term in any of our algorithms assuming polynomially bounded initial error.

2.2 Technical facts

We will repeatedly use the following results.

Fact 1 (Gaussian integral).

For any $\lambda\geq 0$ and $v\in\mathbb{R}^{d}$ ,

\int\exp\left(-\frac{1}{2\lambda}\left\lVert x-v\right\rVert_{2}^{2}\right)dx=(2\pi\lambda)^{\frac{d}{2}}.

Fact 2 ([DCWY18], Lemma 1).

Let $\pi$ be a $\mu$ -strongly logconcave distribution, and let $x^{*}$ minimize its negative log-density. Then, for $x\sim\pi$ and any $\delta\in[0,1]$ , with probability at least $1-\delta$ ,

\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\frac{d}{\mu}}\left(2+2\max\left(\sqrt[4]{\frac{\log(1/\delta)}{d}},\sqrt{\frac{\log(1/\delta)}{d}}\right)\right).

Fact 3 ([Har04], Theorem 1.1).

Let $\pi$ be a $\mu$ -strongly logconcave density. Let $d\gamma_{\mu}(x)$ be the Gaussian density with covariance matrix $\mu^{-1}\mathbf{I}$ . For any convex function $h$ ,

\mathbb{E}_{\pi}[h(x-\mathbb{E}_{\pi}[x])]\leq\mathbb{E}_{\gamma_{\mu}}[h(x-\mathbb{E}_{\gamma_{\mu}}[x])].

Fact 4 ([DM19a], Theorem 1).

Let $\pi$ be a $\mu$ -strongly logconcave distribution, and let $x^{*}$ minimize its negative log-density. Then, $\mathbb{E}_{\pi}[\left\lVert x-x^{*}\right\rVert_{2}^{2}]\leq\tfrac{d}{\mu}$ .

3 Proximal reduction framework

The reduction framework of Theorem 1 can be thought of as a specialization of a more general composite sampler which we develop in Section 5, whose guarantees are reproduced here.

See 2

Our main observation, elaborated on more formally for specific applications in Section 4, is that a variety of structured logconcave densities have negative log-densities $f_{\textup{oracle}}$ , where we can implement an efficient restricted Gaussian oracle for $f_{\textup{oracle}}$ via calling an existing sampling method. Crucially, in these instantiations we use the fact that the distributions which $\mathcal{O}$ is required to sampled from are heavily regularized (restricted by a quadratic with large leading coefficient) to obtain fast samplers. We further note that the upper bound requirement on $\eta$ in Theorem 2 can be lifted when the “well-conditioned” component is uniformly $0$ . Instead of setting $f=0$ and $g=f_{\textup{oracle}}$ in Theorem 2, and refining the analysis for this special case to tolerate arbitrary $\eta$ , we provide a self-contained proof here. This particular structure (the composite setting where $f_{\textup{wc}}$ is uniformly zero and $f_{\textup{oracle}}$ is strongly convex) admits significant simplifications from the more general case, so using slightly different proof techniques, we are able to obtain stronger convergence guarantees for this particular problem allowing for mixing in fewer than $d$ iterations from a feasible start (see Section 1.4).

See 1

For simplicity of notation, we replace $f_{\textup{oracle}}$ in the statement of Theorem 1 with $g$ throughout just this section. Let $\pi$ be a density on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp(-g(x))$ where $g$ is $\mu$ -strongly convex (but possibly non-smooth), and let $\mathcal{O}$ be a restricted Gaussian oracle for $g$ . Consider the joint distribution $\hat{\pi}$ supported on an expanded space $z=(x,y)\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ with density, for some $\eta>0$ ,

\frac{d\hat{\pi}}{dz}(z)\propto\exp\left(-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right).

Note that the $x$ -marginal of $\hat{\pi}$ is precisely $\pi$ , so it suffices to sample from the $x$ -marginal. We consider a simple alternating Markov chain for sampling from $\hat{\pi}$ , described in the following Algorithm 1.

Algorithm 1

\texttt{AlternateSample}(g,\eta,T)

Input: $\mu$ -strongly convex $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , $\eta>0$ , $T\in\mathbb{N}$ , $x_{0}=\min_{x}g(x)$ .

1:for

k\in[T]

2: Sample

y_{k}\sim\pi_{x_{k-1}}

, where for all

x\in\mathbb{R}^{d}

\tfrac{d\pi_{x}}{dy}(y)\propto\exp\left(-\tfrac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)

3: Sample

x_{k}\sim\pi_{y_{k}}

, where for all

y\in\mathbb{R}^{d}

\tfrac{d\pi_{y}}{dx}(x)\propto\exp\left(-g(x)-\tfrac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)

4:end for

5:return

x_{T}

By observing that the distributions $\pi_{x}$ and $\pi_{y}$ in the above method are precisely the marginal distributions of $\hat{\pi}$ with one variable fixed, it is straightforward to see that $\hat{\pi}$ is a stationary distribution of the process. We make this formal in the following lemma.

Lemma 1 (Alternating marginal sampling).

Let $\hat{\pi}$ be a density on two blocks $(x,y)$ . Sample $(x,y)\sim\hat{\pi}$ , and then sample $\tilde{x}\sim\hat{\pi}(\cdot,y)$ , $\tilde{y}\sim\hat{\pi}(\tilde{x},\cdot)$ . Then, the distribution of $(\tilde{x},\tilde{y})$ is $\hat{\pi}$ . Moreover, the alternating marginal sampling Markov chain on either marginal is reversible.

Proof.

The density of the resulting distribution at $(\tilde{x},y)$ is proportional to the product of the (marginal) density at $y$ and the conditional distribution of $\tilde{x}\mid y$ , which by definition is $\hat{\pi}$ . Therefore, $(\tilde{x},y)$ is distributed as $\hat{\pi}$ , and the argument for $\tilde{y}$ follows symmetrically. To see reversibility on the $x$ marginal, it suffices to note that the probability we move from $x$ to $x^{\prime}$ is proportional to

\int_{y}\hat{\pi}(x,y)\hat{\pi}(x^{\prime},y)dy,

which is a symmetric function of $x$ and $x^{\prime}$ . A similar argument holds for the $y$ marginals. ∎

We also state a simple observation about alternating schemes such as Algorithm 1, which will be useful later. Let $\mathcal{P}_{x}$ be the density of $y_{k}$ after one step of the above procedure starting from $x_{k-1}=x$ , and let $\mathcal{T}_{x}$ be the resulting density of $x_{k}$ .

Observation 1.

For any two points $x$ , $x^{\prime}\in\mathbb{R}^{d}$ , $\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}$ .

Proof.

This follows by the coupling characterization of total variation (see e.g. Chapter 5 of [LPW09]). Per the optimal coupling of $y\sim\mathcal{P}_{x}$ and $y^{\prime}\sim\mathcal{P}_{x^{\prime}}$ , whenever the total variation sets $y=y^{\prime}$ in Line 2 of AlternateSample, we can couple the resulting distributions in Line 3 as well. ∎

In order to prove Theorem 1, we first show that the random walk in Algorithm 1 converges rapidly in the 2-Wasserstein distance (denoted $W_{2}$ in this section).

Lemma 2.

Let $\pi_{0}$ be the starting distribution of $x$ in Algorithm 1. Let $\pi_{k}$ be the distribution of $x_{k}$ and $\pi$ be the $x$ -marginal of $\hat{\pi}$ . For all $k\geq 0$ ,

\displaystyle W_{2}^{2}(\pi_{{k+1}},\pi)\leq\frac{1}{(1+\eta\mu)^{2}}W^{2}_{2}(\pi_{{k}},\pi).

Hence, for any $\eta\leq\frac{1}{\mu}$ , in $T^{\prime}=O\left(\frac{1}{\eta\mu}\log{\frac{d}{\mu\Delta}}\right)$ iterations, the random walk mixes to

\displaystyle W_{2}(\pi_{{T^{\prime}}},\pi)\leq\Delta.

Proof.

Let $\Gamma_{x_{k}}$ be the optimal coupling between $x_{k}\sim\pi_{k}$ and $\hat{x}\sim\pi$ according to the $W_{2}$ distance. Coupling the Gaussian random variable generating $y_{k+1}\sim\pi_{x_{k}}$ and $\hat{y}\sim\pi_{\hat{x}}$ gives a coupling $\Gamma_{y_{k+1}}$ between $y_{k+1}$ and $\hat{y}$ such that

\displaystyle\mathbb{E}_{\Gamma_{y_{k+1}}}\left[\left\lVert y_{k+1}-\hat{y}\right\rVert_{2}^{2}\right]=\mathbb{E}_{\Gamma_{x_{k}}}\left[\left\lVert x_{k}-\hat{x}\right\rVert_{2}^{2}\right].

(10)

Then, let $\pi_{y}$ be the distribution of $x_{k+1}$ in a run of Line 3 of Algorithm 1 starting from $y_{k+1}=y$ , and $\pi_{\hat{y}}$ be the distribution of $\hat{x}$ in Line 3 starting from $\hat{y}$ , respectively. Since $\pi_{\hat{y}}$ is $\mu+\frac{1}{\eta}$ strongly log-concave, $\pi_{\hat{y}}$ satisfies a log-Sobolev inequality with constant $\mu+\frac{1}{\eta}$ (Theorem 2 of [OV00]). Hence,

	$\displaystyle W_{2}^{2}(\pi_{y},\pi_{\hat{y}})$	$\displaystyle\leq\frac{2}{\mu+\frac{1}{\eta}}d_{\textup{KL}}(\pi_{y}\\|\pi_{\hat{y}})$
		$\displaystyle\leq\frac{1}{\left(\mu+\frac{1}{\eta}\right)^{2}}\mathbb{E}_{\pi_{y}}\left[\left\lVert\nabla\log\frac{\pi_{y}}{\pi_{\hat{y}}}\right\rVert_{2}^{2}\right]$
		$\displaystyle\leq\frac{1}{(1+\eta\mu)^{2}}\left\lVert y-\hat{y}\right\rVert^{2}_{2}.$

The first step used the Talagrand transportation inequality (Theorem 1 of [OV00]). The second step used the log-Sobolev inequality. The third step used

	$\displaystyle\nabla\log\frac{\pi_{y}(x)}{\pi_{\hat{y}}(x)}$	$\displaystyle=\nabla\log\frac{\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2})\int_{x^{\prime}}\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-\hat{y}\right\rVert_{2}^{2})dx^{\prime}}{\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-\hat{y}\right\rVert_{2}^{2})\int_{x^{\prime}}\exp(-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2})dx^{\prime}}$
		$\displaystyle=\frac{1}{2\eta}\nabla\left(\left\lVert x-\hat{y}\right\rVert_{2}^{2}-\left\lVert x-y\right\rVert_{2}^{2}\right)=\frac{1}{\eta}(y-\hat{y}).$		(11)

Taking expectation over $\Gamma_{y_{k+1}}$ and using (10) shows that

\displaystyle W_{2}^{2}(\pi_{{k+1}},\pi)\leq\frac{1}{(1+\eta\mu)^{2}}W^{2}_{2}(\pi_{k},\pi).

Algorithm 1 starts from the distribution $\pi_{0}=\delta_{x^{*}}$ , where $x^{*}=\min_{x}g(x)$ . Since $\pi$ is $\mu$ -strongly logconcave, we have (see e.g. Proposition 1 of [DM19b])

\displaystyle W_{2}^{2}(\pi_{0},\pi)=\mathbb{E}_{\hat{\pi}}\left[\left\lVert x^{*}-x\right\rVert^{2}\right]\leq\frac{d}{\mu}.

Then, for $\eta<\frac{1}{\mu}$ , $\frac{1}{1+\eta\mu}\leq 1-\frac{\eta\mu}{2}$ , so after $T^{\prime}=O(\frac{1}{\eta\mu}\log\frac{d}{\mu\Delta})$ iterations, $W_{2}(\pi_{{T^{\prime}}},\pi)\leq\Delta$ . ∎

Next, we bound the KL divergence between the output of Algorithm 1 and the target distribution $\pi$ . We need the following standard lemma regarding KL divergences of marginal distributions.

Lemma 3.

Let $P_{z}$ and $Q_{z}$ be distributions supported on $\mathcal{X}$ indexed by $z$ , a random variable distributed as $\pi_{z}$ . Let $\widetilde{P}$ be the joint distribution of $(x,z)$ for $x\sim P_{z}$ and $z\sim\pi_{z}$ , and $\widetilde{Q}$ be the joint distribution of $(x,z)$ as $x\sim Q_{z}$ and $z\sim\pi_{z}$ . Let $P$ and $Q$ be the marginal distribution of $\widetilde{P}$ and $\widetilde{Q}$ on $x$ , averaged over $z$ . Then,

\displaystyle d_{\textup{KL}}(P\|Q)\leq\mathbb{E}_{z\sim\pi_{z}}\left[d_{\textup{KL}}(P_{z}\|Q_{z})\right].

Proof.

By the definition of $d_{\textup{KL}}$ ,

	$\displaystyle d_{\textup{KL}}(\widetilde{P}\\|\widetilde{Q})$	$\displaystyle=\mathbb{E}_{(x,z)\sim\widetilde{P}}\left[\log\frac{\widetilde{P}(x,z)}{\widetilde{Q}(x,z)}\right]$
		$\displaystyle=\mathbb{E}_{z\sim\pi_{z}}\left[\mathbb{E}_{x\sim P_{z}}\left[\log\frac{\widetilde{P}(x,z)}{\widetilde{Q}(x,z)}\right]\right]$
		$\displaystyle=\mathbb{E}_{z\sim\pi_{z}}\left[\mathbb{E}_{x\sim P_{z}}\left[\log\frac{P_{z}(x)}{Q_{z}(x)}\right]\right]$
		$\displaystyle=\mathbb{E}_{z\sim\pi_{z}}\left[d_{\textup{KL}}(P_{z}\\|Q_{z})\right].$

Finally, by the data processing inequality,

\displaystyle d_{\textup{KL}}(P\|Q)\leq d_{\textup{KL}}(\widetilde{P}\|\widetilde{Q})=\mathbb{E}_{z\sim\pi_{z}}\left[d_{\textup{KL}}(P_{z}\|Q_{z})\right].

∎

The following lemma shows that a 2-Wasserstein distance bound on the distribution at iteration $k$ implies a KL divergence bound on iteration $k+1$ .

Lemma 4.

Let $\pi_{{k}}$ be the distribution of $x_{k}$ for some $k$ such that $W_{2}(\pi_{k},\pi)\leq\Delta$ and $\pi$ be the x-marginal of $\hat{\pi}$ . Then,

\displaystyle d_{\textup{KL}}(\pi_{{k+1}}\|\pi)\leq\frac{\Delta^{2}}{2\eta}.

Proof.

As in Lemma 2, let $\Gamma_{x_{k}}$ be the optimal coupling between $x_{k}\sim\pi_{k}$ and $\hat{x}\sim\pi$ , which yields a coupling $\Gamma_{y_{k+1}}$ between $y_{k+1}$ and $\hat{y}$ such that

\displaystyle\mathbb{E}_{\Gamma_{y_{k+1}}}\left[\left\lVert y_{k+1}-\hat{y}\right\rVert_{2}^{2}\right]=\mathbb{E}_{\Gamma_{x_{k}}}\left[\left\lVert x_{k}-\hat{x}\right\rVert_{2}^{2}\right]\leq\Delta^{2}.

(12)

Then,

	$\displaystyle d_{\textup{KL}}(\pi_{k+1}\\|\pi)$	$\displaystyle\leq\mathbb{E}_{(y_{k+1},\hat{y})\sim\Gamma_{y_{k}}}\left[d_{\textup{KL}}(\pi_{y_{k+1}}\\|\pi_{\hat{y}})\right]$
		$\displaystyle\leq\frac{1}{2\eta^{2}\left(\mu+\frac{1}{\eta}\right)}\mathbb{E}_{(y_{k+1},\hat{y})\sim\Gamma_{y_{k+1}}}\left[\left\lVert y_{k+1}-\hat{y}\right\rVert_{2}^{2}\right]\leq\frac{\Delta^{2}}{2\eta}.$

The first inequality followed from Lemma 4 by taking $P=\pi_{k+1}$ , $Q=\pi$ and $z=(y_{k+1},y)$ . The second inequality used the log-Sobolev inequality and (3). The last inequality used (12). ∎

Finally, putting the pieces together, Theorem 1 follows from Lemma 2 and Lemma 4.

Proof of Theorem 1.

By Lemma 2 and Lemma 4, there is $T=O\left(\frac{1}{\eta\mu}\log{\frac{d}{\eta\mu\epsilon}}\right)$ so that $d_{\textup{KL}}(\pi_{T}\|\pi)\leq 2\epsilon^{2}$ . By Pinsker’s inequality,

\left\lVert\pi_{T}-\pi\right\rVert_{\textup{TV}}\leq\sqrt{\frac{1}{2}d_{\textup{KL}}(\pi_{T}\|\pi)}=\epsilon.

∎

We note that Theorem 1 is robust to a small amount of error tolerance in the sampler $\mathcal{O}$ . Specifically, if $\mathcal{O}$ has tolerance $\tfrac{\epsilon}{2T}$ , then by calling Theorem 1 with desired accuracy $\tfrac{\epsilon}{2}$ and adjusting constants appropriately, the cumulative error incurred by all calls to $\mathcal{O}$ is within the total requisite bound (formally, this can be shown via the coupling characterization of total variation). We defer a more formal elaboration on this inexactness argument to Appendix A and the proof of Proposition 5.

4 Tighter runtimes for structured densities

In this section, we use applications of Theorem 1 to obtain simple analyses of novel state-of-the-art high-accuracy runtimes for the well-conditioned densities studied in [DCWY18, CDWY19, LST20], as well as the composite and finite sum densities studied in this work. We will assume the conclusions of Theorems 2 and 3 respectively in deriving the results of Sections 4.2 and 4.3.

4.1 Well-conditioned logconcave sampling: proof of Corollary 1

In this section, let $\pi$ be a distribution on $\mathbb{R}^{d}$ with density proportional to $\exp(-f(x))$ , where $f$ is $L$ -smooth and $\mu$ -strongly convex (and $\kappa=\tfrac{L}{\mu}$ ) and has pre-computed minimizer $x^{*}$ . We will instantiate Theorem 1 with $f_{\textup{oracle}}(x)=f(x)$ , and choose $\eta=\tfrac{1}{8Ld\log(\kappa)}$ . We now require an $\eta$ -RGO $\mathcal{O}$ for $f_{\textup{oracle}}=f$ to use in Theorem 1.

Our implementation of $\mathcal{O}$ is a rejection sampling scheme. We use the following helpful guarantee.

Lemma 5 (Rejection sampling).

Let $\pi$ , $\hat{\pi}$ be distributions on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto p(x)$ , $\tfrac{d\hat{\pi}}{dx}(x)\propto\hat{p}(x)$ . Suppose for some $C\geq 1$ and all $x\in\mathbb{R}^{d}$ , $\tfrac{p(x)}{\hat{p}(x)}\leq C$ . The following is termed “rejection sampling”: repeat independent runs of the following procedure until a point is outputted.

1.

Draw $x\sim\hat{\pi}$ .
2.

With probability $\tfrac{p(x)}{C\hat{p}(x)}$ , output $x$ .

Rejection sampling terminates in $\tfrac{C\int\hat{p}(x)dx}{\int p(x)dx}$ runs in expectation, and the output distribution is $\pi$ .

Proof.

The second claim follows from Bayes’ rule which implies the conditional density of the output point is proportional to $\hat{p}(x)\cdot\tfrac{p(x)}{C\hat{p(x)}}\propto p(x)$ , so the distribution is $\pi$ . To see the first claim, the probability any sample outputs is

\int_{x}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)=\frac{1}{C}\int_{x}\frac{\int_{x}p(x)dx}{\int_{x}\hat{p}(x)dx}d\pi(x)=\frac{\int_{x}p(x)dx}{C\int_{x}\hat{p}(x)dx}.

The conclusion follows by independence and linearity of expectation. ∎

We further state a concentration bound shown first in [LST20] regarding the norm of the gradient of a point drawn from a logsmooth distribution.

Proposition 2 (Logsmooth gradient concentration, Corollary 3.3, [LST20]).

Let $\pi$ be a distribution on $\mathbb{R}^{d}$ with $\tfrac{d\pi}{dx}(x)\propto\exp(-f(x))$ where $f$ is convex and $L$ -smooth. With probability at least $1-\kappa^{-d}$ ,

\left\lVert\nabla f(x)\right\rVert_{2}\leq 3\sqrt{L}d\log\kappa\text{ for }x\sim\pi.

(13)

By the requirements of Theorem 1, the restricted Gaussian oracle $\mathcal{O}$ only must be able to draw samples from densities of the form, for some $y\in\mathbb{R}^{d}$ ,

\exp\left(-f_{\textup{oracle}}(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)=\exp\left(-f(x)-4Ld\log\kappa\left\lVert x-y\right\rVert_{2}^{2}\right).

(14)

We will use the following Algorithm 2 to implement $\mathcal{O}$ .

Algorithm 2

\texttt{XSample}(f,y,\eta)

Input: $L$ -smooth, $\mu$ -strongly convex $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , $y\in\mathbb{R}^{d}$ , $\eta>0$

1:if

\left\lVert\nabla f(y)\right\rVert_{2}\leq 3\sqrt{L}d\log\kappa

then

2: while true do

3: Draw

x\sim\mathcal{N}(y-\nabla f(y),\eta\mathbf{I})

\tau\sim\text{Unif}[0,1]

5: if

\tau\leq\exp(f(y)+\left\langle\nabla f(y),x-y\right\rangle-f(x))

then

6: return

x

7: end if

8: end while

9:end if

10:Use [CDWY19] to sample

x

from (14) to total variation distance

\tfrac{\epsilon}{\Theta(\kappa d^{2}\log^{3}(\frac{\kappa d}{\epsilon}))}

using

O(d\log\tfrac{\kappa d}{\epsilon})

queries to

\nabla f

(Theorem 1, [CDWY19], where (14) has constant condition number)

11:return

x

Lemma 6.

Let $\eta=\tfrac{1}{8Ld\log(\kappa)}$ , and suppose $y$ satisfies the bound in (13), i.e. $\left\lVert\nabla f(y)\right\rVert_{2}\leq 3\sqrt{L}d\log\kappa$ . Then, Line 3 of Algorithm 2 runs an expected 2 times, and Algorithm 2 samples exactly from (14), whenever the condition of Line 1 is met.

Proof.

Note that when the assumption of Line 1 is met, Algorithm 2 is an instantiation of rejection sampling (Lemma 5) with

	$\displaystyle p(x)$	$\displaystyle=\exp\left(-f(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right),$
	$\displaystyle\hat{p}(x)$	$\displaystyle=\exp\left(-f(y)-\left\langle\nabla f(y),x-y\right\rangle-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right).$

By convexity, we may take $C=1$ . Next, by applying Fact 1 twice and $L$ -smoothness of $f_{\textup{oracle}}$ ,

	$\displaystyle\int_{x}p(x)dx$	$\displaystyle\geq\int_{x}\exp\left(-f(y)-\left\langle\nabla f(y),x-y\right\rangle-\frac{1+\eta L}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)dx$
		$\displaystyle=\exp\left(-f(y)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)\int_{x}\exp\left(-\frac{1+\eta L}{2\eta}\left\lVert x-y+\frac{\eta}{1+\eta L}\nabla f(y)\right\rVert_{2}^{2}\right)dx$
		$\displaystyle=\exp\left(-f(y)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}},$
	$\displaystyle\int_{x}\hat{p}(x)dx$	$\displaystyle=\exp\left(-f(y)+\frac{\eta}{2}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)(2\pi\eta)^{\frac{d}{2}},$

which implies the desired bound (recalling Lemma 5 and our assumed bound on $\left\lVert\nabla f(y)\right\rVert_{2}$ )

	$\displaystyle\frac{\int\hat{p}(x)dx}{\int p(x)dx}$	$\displaystyle\leq\exp\left(\left(\frac{\eta}{2}-\frac{\eta}{2(1+\eta L)}\right)\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}$
		$\displaystyle\leq 1.5\exp\left(\frac{\eta^{2}L}{2(1+\eta L)}\left\lVert\nabla f(y)\right\rVert_{2}^{2}\right)\leq 2.$

∎

We are now equipped to prove our main result concerning well-conditioned densities.

See 1

Proof.

By applying Theorem 1 with the chosen $\eta$ , and noting that the cumulative error due to all calls to Line 10 cannot amount to more than $\tfrac{\epsilon}{2}$ total variation error throughout the algorithm, it suffices to show that Algorithm 2 uses $O(1)$ gradient queries each iteration in expectation. This happens whenever the condition in Line 1 is met via Lemma 6, so we must show Line 10 is executed with probability $O((d\log\frac{\kappa d}{\epsilon})^{-1})$ .

To show this, note that combining Proposition 2 with the warmness of the start $x_{0}$ in Algorithm 2, this event occurs with probability at most $\kappa^{-\frac{d}{2}}$ in the first iteration.¹⁰¹⁰10Formally, Line 2 of Algorithm 1 has $y_{1}\sim\mathcal{N}(x_{0},\eta\mathbf{I})$ , but by smoothness $\left\lVert\nabla f(y_{1})\right\rVert_{2}\leq\left\lVert\nabla f(x_{0})\right\rVert_{2}+L\left\lVert x-y\right\rVert_{2}$ and $L\left\lVert x-y\right\rVert_{2}\leq\widetilde{O}(L\sqrt{\eta})$ with high probability, adding a negligible constant to the bound of Proposition 2. Since warmness is monotonically decreasing¹¹¹¹11This is a standard fact in the literature, and can be seen as follows: each transition step in the chain is a convex combination of warm point masses, preserving warmness. throughout using an exact oracle in Algorithm 1, and the total error accumulated due to Line 10 throughout the algorithm is $O((d\log\frac{\kappa d}{\epsilon})^{-1})$ , we have the desired conclusion. ∎

We show a bound nearly-matching Corollary 1 using only value access to $f$ , and with a deterministic iteration complexity (rather than an expected one), as Corollary 5 in Section 4.3.

4.2 Composite logconcave sampling: proof of Corollary 2

In this section, let $\pi$ be a distribution on $\mathbb{R}^{d}$ with density proportional to $\exp(-f(x)-g(x))$ , where $f$ is $L$ -smooth and $\mu$ -strongly convex (and $\kappa=\tfrac{L}{\mu}$ ), and $g$ is convex and admits a restricted Gaussian oracle $\mathcal{O}$ . Without loss of generality, we assume that $f$ and $g$ share a minimizer $x^{*}$ which we have pre-computed; if this is not the case, we can redefine $f(x)\leftarrow f(x)-\left\langle\nabla f(x^{*}),x\right\rangle$ and $g(x)\leftarrow g(x)+\left\langle\nabla f(x^{*}),x\right\rangle$ ; see Section 5.1 for this reduction.

We will instantiate Theorem 1 with $f_{\textup{oracle}}=f+g$ , which is a $\mu$ -strongly convex function. Our main result of this section follows directly from Theorem 1 and using Theorem 2 as the required oracle $\mathcal{O}$ , stated more precisely in the following.

See 2

Proof.

As discussed at the beginning of this section, assume without loss that $f$ and $g$ both are minimized by $x^{*}$ . We apply the algorithm of Theorem 1 with $\eta=\tfrac{1}{L}$ to the $\mu$ -strongly convex function $f+g$ , which requires one call to $\mathcal{O}$ to implement. Thus, the iteration count parameter in Theorem 1 is $T=O(\kappa\log\tfrac{\kappa d}{\epsilon})$ .

Recall that we chose $\eta=\tfrac{1}{L}$ . To bound the total complexity of this algorithm, it suffices to give an $\eta$ -RGO $\mathcal{O}^{+}$ for sampling from distributions with densities of the form, for some $y\in\mathbb{R}^{d}$ ,

\exp\left(-f(x)-g(x)-\frac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}\right)=\exp\left(-f(x)-g(x)-\frac{L}{2}\left\lVert x-y\right\rVert_{2}^{2}\right)

to total variation distance $\tfrac{\epsilon}{\Theta(T)}$ (see discussion at the end of Section 3). To this end, we apply Theorem 2 with the well-conditioned component $f(x)+\tfrac{L}{2}\left\lVert x-y\right\rVert_{2}^{2}$ , the composite component $g(x)$ , and the largest possible choice of $\eta$ . Note that we indeed have access to a restricted Gaussian oracle for $g$ (namely, $\mathcal{O}$ ), and this choice of well-conditioned component is $2L$ -smooth and $L$ -strongly convex, so its condition number is a constant. Thus, Theorem 2 requires $O(d\log^{2}\tfrac{\kappa d}{\epsilon})$ calls to $\mathcal{O}$ and gradients of $f$ to implement the desired $\mathcal{O}^{+}$ on any query $y$ (where we note $\tfrac{\epsilon}{\Theta(T)}=\tfrac{1}{\textup{poly}(\kappa,d,\epsilon^{-1})}$ ). Combining these complexity bounds yields the desired conclusion. ∎

4.3 Sampling logconcave finite sums: proof of Corollary 3

In this section, let $\pi$ be a distribution on $\mathbb{R}^{d}$ with density proportional to $\exp(-F(x))$ , where $F(x)=\tfrac{1}{n}\sum_{i\in[n]}f_{i}(x)$ is $\mu$ -strongly convex, and for all $i\in[n]$ , $f_{i}$ is $L$ -smooth (and $\kappa=\tfrac{L}{\mu}$ ). We will instantiate Theorem 1 with $f_{\textup{oracle}}(x)=F(x)$ , and Theorem 3 as an $\eta$ -RGO for some choice of $\eta$ .

More precisely, Theorem 3 shows that given access to the minimizer $x^{*}$ , only zeroth-order access to the summands of $F$ is necessary to obtain the iteration bound. In order to obtain the minimizer to high accuracy however, variance reduced stochastic gradient methods (e.g. [JZ13]) require $\Omega(n+\kappa)$ gradient queries, which amounts to $\Omega((n+\kappa)d)$ function evaluations. We state a convenient corollary of Theorem 3 which removes the requirement of accessing $x^{*}$ , via an optimization pre-processing step using the method of [JZ13] (see further discussion in Appendix A). This is useful to us in proving Theorem 3 because in the sampling tasks required by the RGO, the minimizer changes (and thus must be recomputed every time).

Corollary 4 (First-order logconcave finite sum sampling).

In the setting of Theorem 3, using [JZ13] to precompute the minimizer $x^{*}$ and running Algorithm 6 uses $O(n\log\tfrac{\kappa d}{\epsilon}+\kappa^{2}d\log^{4}\tfrac{n\kappa d}{\epsilon})$ first-order oracle queries to summands $\{f_{i}\}_{i\in[n]}$ and obtains $\epsilon$ total variation distance to $\pi$ .

We now apply the reduction framework developed in Section 2 to our Algorithm 6 to obtain an improved query complexity for sampling from logconcave finite sums.

See 3

Proof.

We apply Theorem 1 with $\mu$ -strongly convex $f_{\textup{oracle}}=F(x)$ , using Algorithm 6 as the required $\eta$ -RGO $\mathcal{O}$ for sampling from distributions with densities of the form

\exp\left(-{F}(x)-\frac{1}{\eta}\left\lVert x-y\right\rVert^{2}_{2}\right)

for some $y\in\mathbb{R}^{d}$ , to total variation $\tfrac{\epsilon}{\Theta(T)}$ (see Section 3) for $T$ the iteration bound of Algorithm 1. We apply Theorem 3 to the function $\widetilde{F}(x)={F}(x)+\frac{1}{\eta}\left\lVert x-y\right\rVert_{2}^{2}$ ; we can express this in finite sum form by adding $\tfrac{1}{\eta}\left\lVert x-y\right\rVert_{2}^{2}$ to every constituent function, and the effect on gradient oracles is $\tfrac{1}{\eta}(x-y)$ . Note $\widetilde{F}$ has condition number $O(1+\eta L)$ . For a given $\eta$ , the overall complexity is

\frac{\log\frac{\kappa d}{\epsilon}}{\eta\mu}\left(n\log\left(\frac{n\kappa d}{\epsilon}\right)+d\log^{4}\left(\frac{n\kappa d}{\epsilon}\right)+(\eta L)^{2}d\log^{4}\left(\frac{n\kappa d}{\epsilon}\right)\right)

Here, the inner loop complexity uses Corollary 4 to also find the minimizer (for warm starts), and the outer loop complexity is by Theorem 1. The result follows by optimizing over $\eta$ , namely picking $\eta=\max(\tfrac{1}{L},\sqrt{\frac{n}{L^{2}d\log^{3}(n\kappa d/\epsilon)}})$ , and that Algorithm 1 always must have at least one iteration. ∎

Note the only place that Corollary 3 used gradient evaluations was in determining minimizers of subproblems, via the first step of Corollary 4. Consider now the $n=1$ case. By running e.g. accelerated gradient descent for smooth and strongly convex functions, it is well-known [Nes83] that we can obtain a minimizer in $\widetilde{O}(\sqrt{\kappa})$ iterations, each querying a gradient oracle, where $\kappa$ is the condition number. By smoothness, we can approximate every coordinate of the gradient to arbitrary precision using $2$ function evaluations, so this is a $\widetilde{O}(\sqrt{\kappa}d)$ value oracle complexity.

Finally, for every optimization subproblem in Corollary 3 where $\eta=(L\cdot\text{polylog}\frac{\kappa d}{\epsilon})^{-1}$ , the condition number is a constant, which amounts to a $\widetilde{O}(d)$ value oracle complexity for computing a minimizer. This is never the dominant term compared to Theorem 3, yielding the following conclusion.

Corollary 5.

In the setting of Corollary 1, Algorithm 1 using Algorithm 6 as a restricted Gaussian oracle uses $O(\kappa d\log^{2}\tfrac{\kappa d}{\epsilon})$ value queries and obtains $\epsilon$ total variation distance to $\pi$ .

We note that the polylogarithmic factor is significantly improved when compared to Corollary 3 by removing the random sampling steps in Algorithm 6. A precise complexity bound of the resulting Metropolized random walk, a zeroth-order algorithm mixing in $O(\kappa^{2}d\log\tfrac{\kappa d}{\epsilon})$ for a logconcave distribution with condition number $\kappa$ , is given as Theorem 2 of [CDWY19].

Finally, in the case $n\geq 1$ , we also exhibit an improved query complexity in terms of an entirely zeroth-order sampling algorithm which interpolates with Corollary 5 (up to logarithmic factors). By trading off the $\widetilde{O}(nd+\kappa d)$ zeroth-order complexity of minimizing a finite sum function [JZ13], and the $\widetilde{O}(\kappa^{2}d)$ zeroth-order complexity of sampling, we can run Theorem 1 for the optimal choice of $\eta=\widetilde{O}(\tfrac{\sqrt{n}}{L})$ . The overall zeroth-order complexity can be seen to be $\widetilde{O}(nd+\sqrt{n}\kappa d)$ .

5 Composite logconcave sampling with a restricted Gaussian oracle

In this section, we provide our “base sampler” for composite logconcave densities as Algorithm 3, and give its guarantees by proving Theorem 2. Throughout, fix distribution $\pi$ with density

	$\displaystyle\frac{d\pi}{dx}(x)\propto\exp\left(-f(x)-g(x)\right),\text{where }f:\mathbb{R}^{d}\rightarrow\mathbb{R}\text{ is }L\text{-smooth, }\mu\text{-strongly convex,}$		(15)
	$\displaystyle\text{and }g:\mathbb{R}^{d}\rightarrow\mathbb{R}\text{ admits a restricted Gaussian oracle }\mathcal{O}.$		(15)

We will define $\kappa:=\frac{L}{\mu}$ , and assume that we have precomputed $x^{*}:=\textup{argmin}_{x\in\mathbb{R}^{d}}\left\{f(x)+g(x)\right\}$ . Our algorithm proceeds in stages following the outline in Section 1.3.1.

1.

Composite-Sample is reduced to Composite-Sample-Shared-Min, which takes as input a distribution with negative log-density $f+g$ , where $f$ and $g$ share a minimizer; this reduction is given in Section 5.1, and the remainder of the section handles the shared-minimizer case.
2.

The algorithm Composite-Sample-Shared-Min is a rejection sampling scheme built on top of sampling from a joint distribution $\hat{\pi}$ on $(x,y)\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ whose $x$ -marginal approximates $\pi$ . We give this reduction in Section 5.2.
3.

The bulk of our analysis is for Sample-Joint-Dist, an alternating marginal sampling algorithm for sampling from $\hat{\pi}$ . To implement marginal sampling, it alternates calls to $\mathcal{O}$ and a rejection sampling algorithm YSample. We prove its correctness in Section 5.3.

We put these pieces together in Section 5.4 to prove Theorem 2. We remark that for simplicity, we will give the algorithms corresponding to the largest value of step size $\eta$ in the theorem statement; it is straightforward to modify the bounds to tolerate smaller values of $\eta$ , which will cause the mixing time to become correspondingly larger (in particular, the value of $K$ in Algorithm 5).

Algorithm 3

\texttt{Composite-Sample}(\pi,x^{*},\epsilon)

Input: Distribution $\pi$ of form (15), $x^{*}$ minimizing negative log-density of $\pi$ , $\epsilon\in[0,1]$ .
Output: Sample $x$ from a distribution $\pi^{\prime}$ with $\left\lVert\pi^{\prime}-\pi\right\rVert_{\textup{TV}}\leq\epsilon$ .

\tilde{f}(x)\leftarrow f(x)-\left\langle\nabla f(x^{*}),x\right\rangle

\tilde{g}(x)\leftarrow g(x)+\left\langle\nabla f(x^{*}),x\right\rangle

2:return

\texttt{Composite-Sample-Shared-Min}(\pi,\tilde{f},\tilde{g},x^{*},\epsilon)

Algorithm 4

\texttt{Composite-Sample-Shared-Min}(\pi,f,g,x^{*},\epsilon)

Input: Distribution $\pi$ of form (15), where $f$ and $g$ are both minimized by $x^{*}$ , $\epsilon\in[0,1]$ .
Output: Sample $x$ from a distribution $\pi^{\prime}$ with $\left\lVert\pi^{\prime}-\pi\right\rVert_{\textup{TV}}\leq\epsilon$ .

1:while true do

2: Define the set

\Omega:=\left\{x\mid\left\lVert x-x^{*}\right\rVert_{2}\leq 4\sqrt{\frac{d\log(288\kappa/\epsilon)}{\mu}}\right\}

(16)

x\leftarrow\texttt{Sample-Joint-Dist}(f,g,x^{*},\mathcal{O},\tfrac{\epsilon}{18})

4: if

x\in\Omega

then

\tau\sim\text{Unif}[0,1]

y\leftarrow\texttt{YSample}(f,x,\eta)

\alpha\leftarrow\exp\left(f(y)-\left\langle\nabla f(x),y-x\right\rangle-\tfrac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}+g(x)+\tfrac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)

\hat{\theta}\leftarrow\exp\left(-f(x)-g(x)+\tfrac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha

9: if

\tau\leq\tfrac{\hat{\theta}}{4}

then

10: return

x

11: end if

12: end if

13:end while

Algorithm 5

\texttt{Sample-Joint-Dist}(f,g,x^{*},\eta,\mathcal{O},\delta)

Input: $f$ , $g$ of form (15) both minimized by $x^{*}$ , $\delta\in[0,1]$ , $\eta>0$ , $\mathcal{O}$ restricted Gaussian oracle for $g$ .
Output: Sample $x$ from a distribution $\hat{\pi}^{\prime}$ with $\left\lVert\hat{\pi}^{\prime}-\hat{\pi}\right\rVert_{\textup{TV}}\leq\delta$ , where we overload $\hat{\pi}$ to mean the marginal of (17) on the $x$ variable.

\eta\leftarrow\tfrac{1}{32L\kappa d\log(16\kappa/\delta)}

2:Let

\hat{\pi}

be the density with

\frac{d\hat{\pi}}{dx}(z)\propto\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)

(17)

3:Call

\mathcal{O}

to sample

x_{0}\sim\pi_{\textup{start}}

, for

\frac{d\pi_{\textup{start}}(x)}{dx}\propto\exp\left(-\frac{L+\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)

(18)

K\leftarrow\frac{2^{26}\cdot 100}{\eta\mu}\log\left(\frac{d\log(16\kappa)}{4\delta}\right)

(see Remark 1)

5:for

k\in[K]

6: Call

\texttt{YSample}\left(f,x_{k-1},\eta,\tfrac{\delta}{2Kd\log(\frac{d\kappa}{\delta})}\right)

to sample

y_{k}\sim\pi_{x_{k-1}}

(Algorithm 7), for

\frac{d\pi_{x}}{dy}(y)\propto\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)

(19)

7: Call

\mathcal{O}

to sample

x_{k}\sim\pi_{y_{k}}

, for

\frac{d\pi_{y}}{dx}(x)\propto\exp\left(-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)

(20)

8:end for

9:return

x_{K}

5.1 Reduction from Composite-Sample to Composite-Sample-Shared-Min

Correctness of Composite-Sample is via the following properties.

Proposition 3.

Let $\tilde{f}$ and $\tilde{g}$ be defined as in Composite-Sample.

1.

The density $\propto\exp(-f(x)-g(x))$ is the same as the density $\propto\exp(-\tilde{f}(x)-\tilde{g}(x))$ .
2.

Assuming first-order (function and gradient evaluation) access to $f$ , and restricted Gaussian oracle access to $g$ , we can implement the same accesses to $\tilde{f}$ , $\tilde{g}$ with constant overhead.
3.

$\tilde{f}$ and $\tilde{g}$ are both minimized by $x^{*}$ .

Proof.

For $f$ and $g$ with properties as in (15), with $x^{*}$ minimizing $f+g$ , define the functions

\tilde{f}(x):=f(x)-\left\langle\nabla f(x^{*}),x\right\rangle,\;\tilde{g}(x):=g(x)+\left\langle\nabla f(x^{*}),x\right\rangle,

and observe that $\tilde{f}+\tilde{g}=f+g$ everywhere. This proves the first claim. Further, implementation of a first-order oracle for $\tilde{f}$ and a restricted Gaussian oracle for $\tilde{g}$ are immediate assuming a first-order oracle for $f$ and a restricted Gaussian oracle for $g$ , showing the second claim; any quadratic shifted by a linear term is the sum of a quadratic and a constant. We now show $\tilde{f}$ and $\tilde{g}$ have the same minimizer. By strong convexity, $\tilde{f}$ has a unique minimizer; first-order optimality shows that

\nabla\tilde{f}(x^{*})=\nabla f(x^{*})-\nabla f(x^{*})=0,

so this unique minimizer is $x^{*}$ . Moreover, optimality of $x^{*}$ for $f+g$ implies that for all $x\in\mathbb{R}^{d}$ ,

\left\langle\partial g(x^{*})+\nabla f(x^{*}),x^{*}-x\right\rangle\leq 0.

Here, $\partial g$ is a subgradient. This shows first-order optimality of $x^{*}$ for $\tilde{g}$ also, so $x^{*}$ minimizes $\tilde{g}$ . ∎

5.2 Reduction from Composite-Sample-Shared-Min to Sample-Joint-Dist

Composite-Sample-Shared-Min is a rejection sampling scheme, which accepts samples from subroutine Sample-Joint-Dist in the high-probability region $\Omega$ defined in (16). We give a general analysis for approximate rejection sampling in Appendix B.1.1, and Appendix B.1.2 bounds relationships between distributions $\pi$ and $\hat{\pi}$ , defined in (15) and (17) respectively (i.e. relative densities and normalization constant ratios). Combining these pieces proves the following main claim.

Proposition 4.

Let $\eta=\tfrac{1}{32L\kappa d\log(288\kappa/\epsilon)}$ , and assume $\texttt{Sample-Joint-Dist}(f,g,x^{*},\mathcal{O},\delta)$ samples within $\delta$ total variation of the $x$ -marginal on (17). Composite-Sample-Shared-Min outputs a sample within total variation $\epsilon$ of (15) in an expected $O(1)$ calls to Sample-Joint-Dist.

5.3 Implementing Sample-Joint-Dist

Sample-Joint-Dist alternates between sampling marginals in the joint distribution $\hat{\pi}$ , as seen by definitions (19), (20). We showed that marginal sampling attains the correct stationary distribution as Lemma 1. We bound the conductance of the induced walk on iterates $\{x_{k}\}$ by combining an isoperimetry bound with a total variation guarantee between transitions of nearby points in Appendix B.2.1. Finally, we give a simple rejection sampling scheme YSample as Algorithm 7 for implementing the step (19). Since the $y$ -marginal of $\hat{\pi}$ is a bounded perturbation of a Gaussian (intuitively, $f$ is $L$ -smooth and $\eta^{-1}\gg L$ ), we show in a high probability region that rejecting from the sum of a first-order approximation to $f$ and the Gaussian succeeds in $2$ iterations.

Remark 1.

For simplicity of presentation, we were conservative in bounding constants throughout; in practice, we found that the constant in Line 4 is orders of magnitude too large (a constant $<10$ sufficed), which can be found as Section 4 of [STL20]. Several constants were inherited from prior analyses, which we do not rederive to save on redundancy.

We now give a complete guarantee on the complexity of Sample-Joint-Dist.

Proposition 5.

Sample-Joint-Dist outputs a point with distribution within $\delta$ total variation distance from the $x$ -marginal of $\hat{\pi}$ . The expected number of gradient queries per iteration is constant.

5.4 Putting it all together: proof of Theorem 2

We show Theorem 2 follows from the guarantees of Propositions 3, 4, and 5. Formally, Theorem 2 is stated for an arbitrary value of $\eta$ which is upper bounded by the value in Line 1 of Algorithm 5; however, it is straightforward to see that all our proofs go through for any smaller value. By observing the value of $K$ in Sample-Joint-Dist, we see that the number of total iterations in each call to Sample-Joint-Dist $O\left(\tfrac{1}{\eta\mu}\log(\tfrac{\kappa d}{\epsilon})\right)=O\left(\kappa^{2}d\log^{2}\left(\tfrac{\kappa d}{\delta}\right)\right).$ Proposition 5 also shows that every iteration, we require an expected constant number of gradient queries and calls to $\mathcal{O}$ , the restricted Gaussian oracle for $g$ , and that the resulting distribution has $\delta$ total variation from the desired marginal of $\hat{\pi}$ . Next, Proposition 4 implies that the number of calls to Sample-Joint-Dist in a run of Composite-Sample-Shared-Min is bounded by a constant, the choice of $\delta$ is $\Theta(\epsilon)$ , and the resulting point has total variation $\epsilon$ from the original distribution $\pi$ . Finally, Proposition 3 shows sampling from a general distribution of the form (1) is reducible to one call of Composite-Sample-Shared-Min, and the requisite oracles are implementable.

6 Logconcave finite sums

In this section, we provide our “base sampler” for logconcave finite sums as Algorithm 6, and give its guarantees by proving Theorem 3. Throughout, fix distribution $\pi$ with density

	$\displaystyle\frac{d\pi}{dx}(x)\propto\exp(-F(x)),\text{ where }F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x)\text{ is }\mu\text{-strongly convex,}$
	$\displaystyle\text{ and for all }i\in[n],\;f_{i}\text{ is }L\text{-smooth}.$

We will define $\kappa:=\frac{L}{\mu}$ , and assume that we have precomputed $x^{*}:=\textup{argmin}_{x\in\mathbb{R}^{d}}\{F(x)\}$ . We will also assume explicitly that $\nabla f_{i}(x^{*})=0$ for all $i\in[n]$ throughout this section (i.e. all $f_{i}$ are minimized at the same point); this is without loss of generality, by a similar argument as in Proposition 3.

Algorithm 6

\texttt{FiniteSum-MRW}(F,h,x_{0},p,K)

Input: $F(x)=\frac{1}{n}\sum_{i\in[n]}f_{i}(x)$ , step size $h>0$ , initial $x_{0}$ , $p\in[0,1]$ , iteration count $K\in\mathbb{N}$

1:for

0\leq k<K

2: Draw

\xi_{k}\sim\mathcal{N}(0,\mathbf{I})

y_{k+1}\leftarrow x_{k}+\sqrt{2h}\xi_{k}

4: Draw

S_{k}\subseteq[n]

by including each

i\in S_{k}

independently with probability

p

5: For each

i\in[n]

\gamma_{k}^{(i)}\leftarrow\begin{cases}\frac{1}{p}\left(\sqrt{\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k})\right)}-1\right)+1&i\in S_{k}\\ 1&i\not\in S_{k}\end{cases}

\gamma_{k}\leftarrow\prod_{i=1}^{n}\gamma_{k}^{(i)}

\tau\sim\text{Unif}[0,1]

7: if

\tau\leq\tfrac{3}{4}\gamma_{k}

and

|S_{k}|\leq 2pn

then

x_{k+1}\leftarrow y_{k+1}

9: else

10:

x_{k+1}\leftarrow x_{k}

11: end if

12:end for

13:return

x_{K}

Algorithm 6 is the zeroth-order Metropolized random walk of [DCWY18] with an efficient, but biased, filter step; the goal of our analysis is to show this bias does not incur significant error.

6.1 Approximate Metropolis-Hastings

We first recall the following well-known fact underlying Metropolis-Hastings (MH) filters.

Proposition 6.

Consider a random walk on $\mathbb{R}^{d}$ with proposal distributions $\{\mathcal{P}_{x}\}_{x\in\mathbb{R}^{d}}$ and acceptance probabilities $\{\alpha(x,x^{\prime})\}_{x,x^{\prime}\in\mathbb{R}^{d}}$ conducted as follows: at a current point $x$ ,

1.

Draw a point $x^{\prime}\sim\mathcal{P}_{x}$ .
2.

Move the random walk to $x^{\prime}$ with probability $\alpha(x,x^{\prime})$ , else stay at $x$ .

Suppose $\mathcal{P}_{x}(x^{\prime})=\mathcal{P}_{x^{\prime}}(x)$ for all pairs $x,x^{\prime}\in\mathbb{R}^{d}$ , and further $\tfrac{d\pi}{dx}(x)\alpha(x,x^{\prime})=\tfrac{d\pi}{dx}(x^{\prime})\alpha(x^{\prime},x)$ . Then, $\pi$ is a stationary distribution for the random walk.

Proof.

This follows because the walk satisfies detailed balance (reversibility) with respect to $\pi$ . ∎

We propose an algorithm that applies a variant of the Metropolis-Hastings filter to a Gaussian random walk. Specifically, we define the following algorithm, which we call Inefficient-MRW.

Definition 4 (Inefficient-MRW).

Consider the following random walk for some step size $h>0$ : for each iteration $k$ at a current point $x_{k}\in\mathbb{R}^{d}$ ,

1.

Set $y_{k+1}\leftarrow x_{k}+\sqrt{2h}\xi$ , where $\xi\sim\mathcal{N}(0,\mathbf{I})$ .

$x_{k+1}\leftarrow y_{k+1}$ with probability $\alpha(x_{k},y_{k+1})$ (otherwise, $x_{k+1}\leftarrow x_{k}$ ), where

\alpha(x,y)=\begin{cases}1&\sqrt{\frac{\exp(-F(y))}{\exp(-F(x))}}>\frac{4}{3},\\ \frac{3}{4}\sqrt{\frac{\exp(-F(y))}{\exp(-F(x))}}&\frac{3}{4}\leq\sqrt{\tfrac{\exp(-F(y))}{\exp(-F(x))}}\leq\frac{4}{3},\\ \frac{\exp(-F(y))}{\exp(-F(x))}&\sqrt{\frac{\exp(-F(y))}{\exp(-F(x))}}<\frac{3}{4}.\end{cases}

(21)

Lemma 7.

Distribution $\pi$ with $\tfrac{d\pi}{dx}(x)\propto\exp(-F(x))$ is stationary for Inefficient-MRW.

Proof.

Without loss of generality, assume that $\pi$ has been normalized so that $\tfrac{d\pi}{dx}(x)=\exp(-F(x))$ . We apply Proposition 6, dropping subscripts in the following. It is clear that $\mathcal{P}_{x}(y)=\mathcal{P}_{y}(x)$ for any $x,y$ , so it suffices to check the second condition. When $\tfrac{3}{4}\leq\sqrt{\tfrac{\exp(-F(y))}{\exp(-F(x))}}\leq\tfrac{4}{3}$ , this follows from

\frac{d\pi}{dx}(x)\alpha(x,x^{\prime})=\frac{3}{4}\sqrt{\exp(-F(x)-F(y))}=\frac{d\pi}{dx}(x^{\prime})\alpha(x^{\prime},x).

The other case is similar (as it is a standard Metropolis-Hastings filter). ∎

In Algorithm 6, we implement an approximate version of the modified MH filter in Definition 4, where we always assume the pair $x$ , $y$ are in the second case of (21). In Lemma 8, we show that if a certain boundedness condition holds, then Algorithm 6 approximates Inefficient-MRW well. We then show that the output distributions of Inefficient-MRW and our Algorithm 6 have small total variation distance in Lemma 9.

Lemma 8.

Suppose that in an iteration $0\leq k<K$ of Algorithm 6, the following three conditions hold for some parameters $R_{x}$ , $C_{\xi}$ , $C_{x}\in\mathbb{R}_{\geq 0}$ :

1.

$\left\lVert x_{k}-x^{*}\right\rVert_{2}\leq R_{x}$ .
2.

$\left\lVert\xi_{k}\right\rVert_{2}\leq C_{\xi}\sqrt{d}$ .
3.

For all $i\in[n]$ , $|\nabla f_{i}(x_{k})^{\top}\xi_{k}|\leq C_{x}\left\lVert\nabla f_{i}(x_{k})\right\rVert_{2}$ .

Then, for any

h\leq\frac{1}{98C_{x}^{2}L^{2}R_{x}^{2}+7LC_{\xi}^{2}d},

(22)

$\tfrac{3}{4}\leq\sqrt{\tfrac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}\leq\tfrac{4}{3}$ . Moreover, we have $\mathbb{E}\left[\gamma_{k}\right]=\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}$ , and when $|S_{k}|\leq 2pn$ , $\gamma_{k}\leq\tfrac{4}{3}$ .

Proof.

We first show $\mathbb{E}\left[\gamma_{k}\right]=\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}$ . Since each $i\in S_{k}$ is generated independently,

	$\displaystyle\mathbb{E}\left[\gamma_{k}\right]$	$\displaystyle=\prod_{i\in[n]}\mathbb{E}\left[\gamma_{k}^{(i)}\right]$
		$\displaystyle=\prod_{i\in[n]}\left[(1-p)+p\left(\frac{1}{p}\left(\sqrt{\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k})\right)}-1\right)+1\right)\right]$
		$\displaystyle=\prod_{i\in[n]}\sqrt{\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k})\right)}=\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}.$

Next, for any $i\in[n]$ , we lower and upper bound $-f_{i}(y_{k+1})+f_{i}(x_{k})$ . First,

	$\displaystyle-f_{i}(y_{k+1})+f_{i}(x_{k})$	$\displaystyle\leq\nabla f_{i}(x_{k})^{\top}\left(x_{k}-y_{k+1}\right)$
		$\displaystyle\leq\sqrt{2h}C_{x}\left\\|\nabla f_{i}(x_{k})\right\\|_{2}\leq\sqrt{2h}C_{x}LR_{x}.$

The first inequality followed from convexity of $f_{i}$ , the second from $y_{k+1}-x_{k}=\sqrt{2h}\xi_{k}$ and our assumed bound, and the third from smoothness and $\nabla f(x^{*})=0$ . To show a lower bound,

	$\displaystyle f_{i}(y_{k+1})-f_{i}(x_{k})$	$\displaystyle\leq\nabla f_{i}(x_{k})^{\top}\left(y_{k+1}-x_{k}\right)+\frac{L}{2}\left\\|y_{k+1}-x_{k}\right\\|_{2}^{2}$
		$\displaystyle\leq\sqrt{2h}C_{x}LR_{x}+hLC_{\xi}^{2}d.$

The first inequality was smoothness. Repeating this argument for each $i\in[n]$ and averaging,

-\sqrt{2h}C_{x}LR_{x}-hLC_{\xi}^{2}d\leq-F(y_{k+1})+F(x_{k})\leq\sqrt{2h}C_{x}LR_{x}.

(23)

Then, when $h\leq\tfrac{1}{98C_{x}^{2}L^{2}R_{x}^{2}+7LC_{\xi}^{2}d}$ ,

\frac{3}{4}\leq\sqrt{\frac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}\leq\frac{4}{3},\text{ and for all }i\in[n],\;-f_{i}(y_{k+1})+f_{i}(x_{k})\leq\frac{1}{4}.

Thus, we can bound each $\gamma_{k}^{(i)}$ :

\displaystyle\gamma_{k}^{(i)}

\displaystyle\leq\frac{1}{p}\left(\exp\left(\frac{1}{8n}\right)-1\right)+1\leq 1+\frac{1}{7pn}.

Finally, when $|S_{k}|\leq 2pn$ , $\gamma_{k}\leq(1+\frac{1}{7pn})^{2pn}\leq\tfrac{4}{3}$ as desired. ∎

Lemma 9.

Draw $x_{0}\sim\mathcal{N}(x^{*},\tfrac{1}{L}\mathbf{I})$ . Let $\hat{\pi}_{K}$ be the output distribution of the algorithm of Definition 4 for $K$ steps starting from $x_{0}$ , and let $\pi_{K}$ be the output distribution of Algorithm 6 starting from $x_{0}$ . For any $\delta\in[0,1]$ , let $p=\tfrac{5\log\frac{12K}{\delta}}{n}$ in Algorithm 6. There exist

C_{\xi}=O\left(1+\sqrt{\frac{\log\frac{K}{\delta}}{d}}\right),\;C_{x}=O\left(\sqrt{\log\frac{nK}{\delta}}\right),\;\text{ and }R_{x}=O\left(\sqrt{\frac{d\log\frac{\kappa K}{\delta}}{\mu}}\right),

so that when $h\leq\tfrac{1}{98C_{x}^{2}L^{2}R_{x}^{2}+7LC_{\xi}^{2}d}$ , we have $\left\lVert\pi_{K}-\hat{\pi}_{K}\right\rVert_{\textup{TV}}\leq\delta$ .

Proof.

By the coupling definition of total variation, it suffices to upper bound the probability that the algorithms’ trajectories, sharing all randomness in proposing points $y_{k+1}$ , differ. This can happen for two reasons: either we used an incorrect filtering step (i.e. the pair $(x_{k},y_{k+1})$ did not lie in the second case of (21)), or we incorrectly rejected in Line 7 of Algorithm 6 because $|S_{k}|\geq 2pn$ . We bound the error due to either happening over any iteration by $\delta$ , yielding the conclusion.

Incorrect filtering.

Consider some iteration $k$ . Lemma 8 shows that as long as its three conditions hold in iteration $k$ , we are in the second case of (21), so it suffices to show all conditions hold. By Fact 2 and as $\xi_{k}$ is independent of all $\{\nabla f_{i}(x_{k})\}_{i\in[n]}$ , with probability at least $1-\tfrac{\delta}{2K}$ , both of the conditions $\left\lVert\xi_{k}\right\rVert_{2}\leq C_{\xi}\sqrt{d}$ and¹²¹²12We recall that the distribution of $v^{\top}\xi$ for $\xi\sim\mathcal{N}(0,\mathbf{I})$ is the one-dimensional $\mathcal{N}(0,\left\lVert v\right\rVert_{2}^{2})$ . $|\nabla f_{i}(x_{k})^{\top}\xi_{k}|\leq C_{x}\left\lVert\nabla f_{i}(x_{k})\right\rVert_{2}$ for all $i\in[n]$ hold for some

C_{\xi}=O\left(1+\sqrt{\frac{\log\frac{K}{\delta}}{d}}\right),\;C_{x}=O\left(\sqrt{\log\frac{nK}{\delta}}\right).

Next, $x_{0}\sim\mathcal{N}(x^{*},\tfrac{1}{L}\mathbf{I})$ is drawn from a $\kappa^{\frac{d}{2}}$ warm start for $\pi$ . By Fact 2, we have $\left\lVert x_{0}-x^{*}\right\rVert_{2}\leq R_{x}$ for $x_{0}$ drawn from $\pi$ with probability at least $1-\tfrac{\delta}{4K}\cdot\kappa^{-\frac{d}{2}}$ , for some

R_{x}=O\left(\sqrt{\frac{d\log\frac{\kappa K}{\delta}}{\mu}}\right).

Since warmness of the exact algorithm of Definition 4 is monotonic, as long as the trajectories have not differed up to iteration $k$ , $\left\lVert x_{k}-x^{*}\right\rVert_{2}\leq R_{x}$ also holds with probability $\geq 1-\tfrac{\delta}{4K}$ . Inductively, the total variation error caused by incorrect filtering over $K$ steps is at most $\tfrac{3\delta}{4}$ .

Error due to large $|S_{k}|$ .

Supposing all the conditions of Lemma 8 are satisfied in iteration $k$ , we show that with high probability, Inefficient-MRW and Algorithm 6 make the same accept or reject decision. By Lemma 8, Inefficient-MRW (21) accepts with probability $\alpha^{\prime}_{k}=\tfrac{3}{4}\sqrt{\tfrac{\exp(-F(y_{k+1}))}{\exp(-F(x_{k}))}}$ . On the other hand, Algorithm 6 accepts with probability

\alpha_{k}=\frac{3}{4}\mathbb{E}\left[\gamma_{k}\mid|S_{k}|\leq 2pn\right]\cdot\Pr[|S_{k}|\leq 2pn].

The total variation between the output distributions is $|\alpha_{k}-\alpha^{\prime}_{k}|$ . Further, since by Lemma 8,

	$\displaystyle\alpha^{\prime}_{k}$	$\displaystyle=\frac{3}{4}\mathbb{E}\left[\gamma_{k}\right]$
		$\displaystyle=\frac{3}{4}\left(\mathbb{E}\left[\gamma_{k}\mid\|S_{k}\|\leq 2pn\right]\cdot\Pr[\|S_{k}\|\leq 2pn]+\mathbb{E}\left[\gamma_{k}\mid\|S_{k}\|>2pn\right]\cdot\Pr[\|S_{k}\|>2pn]\right)$
		$\displaystyle=\alpha_{k}+\frac{3}{4}\mathbb{E}\left[\gamma_{k}\mid\|S_{k}\|>2pn\right]\cdot\Pr[\|S_{k}\|>2pn],$

it suffices to upper bound this latter quantity. First, by Lemma 10, when $p=\tfrac{5\log\frac{12K}{\delta}}{n}$ , we have $\Pr[|S_{k}|>2pn]\leq\tfrac{\delta}{12K}$ . Finally, since each $i\in S_{k}$ is generated independently,

	$\displaystyle\mathbb{E}\left[\gamma_{k}\mid\|S_{k}\|>2pn\right]$	$\displaystyle\leq\max_{S^{\prime}:\|S^{\prime}\|=2pn}\mathbb{E}\left[\prod_{i\in[n]}\gamma_{k}^{(i)}\mid S^{\prime}\subseteq S_{k}\right]$
		$\displaystyle\leq 2\mathbb{E}\left[\prod_{i\in[n]\setminus S^{\prime}}\gamma_{k}^{(i)}\right]=2\sqrt{\prod_{[n]\setminus S^{\prime}}\exp\left(-\frac{1}{n}f_{i}(y_{k+1})+\frac{1}{n}f_{i}(x_{k}))\right)}\leq 4.$

Here, we used Lemma 8 applied to the set $S^{\prime}$ , and the upper bound (23) we derived earlier. Combining these calculations shows that the total variation distance incurred in any iteration $k$ due to $|S_{k}|$ being too large is at most $\tfrac{\delta}{4K}$ , so the overall contribution over $K$ steps is at most $\tfrac{\delta}{4}$ . ∎

We used the following helper lemma in our analysis.

Lemma 10.

Let $S\subseteq[n]$ be formed by independently including each $i\in[n]$ with probability $p$ . Then,

\Pr\left[|S|>2pn\right]\leq\exp\left(-\frac{3pn}{14}\right).

Proof.

For $i\in[n]$ , let $\mathbf{1}_{i\in S}$ be the indicator random variable of the event $i\in S$ , so $\mathbb{E}\left[\mathbf{1}_{i\in S}\right]=p$ and

\textup{Var}\left[\mathbf{1}_{i\in S}-p\right]=p(1-p)^{2}+(1-p)p^{2}\leq 2p.

By Bernstein’s inequality,

\Pr\left[\sum_{i\in[n]}\mathbf{1}_{i\in S}\geq np+r\right]\leq\exp\left(-\frac{\frac{1}{2}r^{2}}{2np+\frac{1}{3}r}\right).

In particular, when $r=pn$ , we have the desired conclusion. ∎

6.2 Conductance analysis

We next bound the mixing time of Inefficient-MRW, using the following result from prior work. We remark that (see Section 1.4) in our application, the $\log\beta$ term is non-dominant.

Proposition 7 (Lemma 1, Lemma 2, [CDWY19]).

Let a random walk with a $\mu$ -strongly logconcave stationary distribution $\pi$ on $x\in\mathbb{R}^{d}$ have transition distributions $\{\mathcal{T}_{x}\}_{x\in\mathbb{R}^{d}}$ . For some $\epsilon\in[0,1]$ , let convex set $\Omega\subseteq\mathbb{R}^{d}$ have $\pi(\Omega)\geq 1-\tfrac{\epsilon^{2}}{2\beta^{2}}$ . Let $\pi_{\textup{start}}$ be a $\beta$ -warm start for $\pi$ , and let the algorithm be initialized at $x_{0}\sim\pi_{\textup{start}}$ . Suppose for any $x,x^{\prime}\in\Omega$ with $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta$ ,

\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\frac{7}{8}.

(24)

Then, the random walk mixes to total variation distance within $\epsilon$ of $\pi$ in $O(\log\beta+\tfrac{1}{\Delta^{2}\mu}\log\tfrac{\log\beta}{\epsilon})$ iterations.

Consider an iteration of Inefficient-MRW from $x_{k}=x$ . Let $\mathcal{P}_{x}$ be the density of $y_{k+1}$ , and let $\mathcal{T}_{x}$ be the density of $x_{k+1}$ after filtering. Define a convex set $\Omega\subseteq\mathbb{R}^{d}$ parameterized by $R_{\Omega}\in\mathbb{R}_{\geq 0}$ :

\displaystyle\Omega=\{x\in\mathbb{R}^{d}:\left\lVert x-x^{*}\right\rVert_{2}\leq R_{\Omega}\}.

We show that for two close points $x,x^{\prime}\subseteq\Omega$ , the total variation between $\mathcal{T}_{x}$ and $\mathcal{T}_{x^{\prime}}$ is small.

Lemma 11.

For some $h=O(\tfrac{1}{L^{2}R_{\Omega}^{2}+Ld})$ and $x,x^{\prime}\subseteq\Omega$ with $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\tfrac{1}{8}\sqrt{h}$ , $\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{7}{8}$ .

Proof.

By the triangle inequality of total variation distance,

\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}}+\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}+\left\lVert\mathcal{T}_{x^{\prime}}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}.

First, by Pinsker’s inequality and the KL divergence between Gaussian distributions,

\left\|\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\|_{\textup{TV}}\leq\sqrt{2\textup{KL}(\mathcal{P}_{x}||\mathcal{P}_{x^{\prime}})}=\frac{\left\|x-x^{\prime}\right\|_{2}}{\sqrt{2h}}.

When $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\tfrac{1}{8}\sqrt{h}$ , $\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{1}{8}$ . Next, we bound $\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}}$ : by a standard calculation (e.g. Lemma D.1 of [LST20]), we have

\displaystyle\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}}

\displaystyle=1-\frac{3}{4}\mathbb{E}_{\xi_{k+1}}\left[\sqrt{\frac{\exp\left(-F(y_{k+1})\right)}{\exp\left(-F(x_{k})\right)}}\right].

We show that $\left\lVert\mathcal{T}_{x}-\mathcal{P}_{x}\right\rVert_{\textup{TV}}\leq\frac{3}{8}$ . It suffices to show that $\mathbb{E}_{\xi_{k+1}}\left[\sqrt{\exp\left(-F(y_{k+1})+F(x_{k})\right)}\right]\geq\frac{5}{6}.$

Since $\frac{15}{16}\sqrt{\exp\left(-\frac{1}{16}\right)}\geq\frac{5}{6}$ , it suffices to show that with probability at least $\tfrac{15}{16}$ over the randomness of $\xi_{k+1}$ , $-F(y_{k+1})+F(x_{k})\geq-\frac{1}{16}$ . As $\xi_{k+1}\sim\mathcal{N}(0,\mathbb{I}_{d})$ , by applying Fact 2 twice,

	$\displaystyle\Pr\left[\text{$\left\\|\xi_{k+1}\right\\|$}_{2}^{2}>36d\right]\leq\exp(-4)\leq\frac{1}{32},$		(25)
	$\displaystyle\Pr\left[\left\|\nabla F(x_{k})^{\top}\xi_{k+1}\right\|^{2}\geq 36\left\lVert\nabla F(x_{k})\right\rVert_{2}^{2}\right]\leq\frac{1}{32}.$		(25)

We upper bound the term $F(y_{k+1})-F(x_{k})$ by smoothness and Cauchy-Schwarz:

	$\displaystyle F(y_{k+1})-F(x_{k})$	$\displaystyle\leq\nabla F(x_{k})^{\top}\left(y_{k+1}-x_{k}\right)+\frac{L}{2}\left\\|y_{k+1}-x_{k}\right\\|_{2}^{2}$
		$\displaystyle\leq\sqrt{2h}\left\|\nabla F(x_{k})^{\top}\xi_{k+1}\right\|+hL\text{$\left\\|\xi_{k+1}\right\\|$}_{2}^{2}.$

Then, since $\left\|\nabla F(x_{k})\right\|\leq LR_{\Omega}$ when $x\in\Omega$ , it is enough to choose $h=O(\tfrac{1}{L^{2}R_{\Omega}^{2}+Ld})$ so that

-F(y_{k+1})+F(x_{k})\geq-\frac{1}{16},

as long as the events of (25) hold, which occurs with probability at least $\frac{15}{16}$ . Similarly, we can show that $\left\lVert\mathcal{T}_{x^{\prime}}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{3}{8}$ . Combining the three bounds, we have the desired conclusion.

∎

See 3

Proof.

First, $\mathcal{N}(x^{*},\tfrac{1}{L}\mathbf{I})$ yields a $\beta=\kappa^{\frac{d}{2}}$ -warm start for $\pi$ (see e.g. [DCWY18]). For this value of $\beta$ , by Fact 2 it suffices to choose

R_{\Omega}=\Theta\left(\sqrt{\frac{d\log\frac{\kappa}{\epsilon}}{\mu}}\right)

for $\pi(\Omega)\geq 1-\tfrac{\epsilon^{2}}{2\beta^{2}}$ . Letting $\delta=\tfrac{\epsilon}{2}$ , we will choose the step size $h$ and iteration count $K$ so that

\displaystyle\frac{1}{h}=\Theta\left(L\kappa d\log^{2}\frac{n\kappa d}{\epsilon}\right),\;K=\Theta\left(\kappa^{2}d\log^{3}\frac{n\kappa d}{\epsilon}\right)

have constants compatible with Lemma 9. Note that this choice of $h$ is also sufficiently small to apply Lemma 11 for our choice of $R_{\Omega}$ . By applying Proposition 7 to the algorithm of Definition 4, and using the bound from Lemma 11, in $K$ iterations Inefficient-MRW will mix to total variation distance $\delta$ to $\pi$ . Furthermore, applying Lemma 9, we conclude that Algorithm 6 has total variation distance at most $2\delta=\epsilon$ from $\pi$ .

It remains to bound the oracle complexity of Algorithm 6. Note in every iteration, we never compute more than $4pn$ values of $\{f_{i}\}_{i\in[n]}$ , since we always reject if $|S_{k}|\geq 2pn$ , and we only compute values for indices in $S_{k}$ . For the value of $p$ in Lemma 9, this amounts to $O(\log\tfrac{n\kappa d}{\epsilon})$ value queries. ∎

Acknowledgments

YL and RS are supported by NSF awards CCF-1749609, CCF-1740551, DMS-1839116, and DMS-2023166, a Microsoft Research Faculty Fellowship, a Sloan Research Fellowship, and a Packard Fellowship. KT is supported by NSF CAREER Award CCF-1844855 and a PayPal research gift.

RS and KT would like to thank Sinho Chewi for his extremely generous help, in particular insightful conversations which led to our discovery of the gap in Section 3, as well as his suggested fix.

References

[AH16] Jacob D. Abernethy and Elad Hazan. Faster convex optimization: Simulated annealing with an efficient universal barrier. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2520–2528, 2016.
[All17] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res., 18:221:1–221:51, 2017.
[BCM⁺18] M Barkhagen, NH Chau, É Moulines, M Rásonyi, S Sabanis, and Y Zhang. On stochastic gradient langevin dynamics with dependent data streams in the logconcave case. arXiv preprint arXiv:1812.02709, 2018.
[BDMP17] Nicolas Brosse, Alain Durmus, Eric Moulines, and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal langevin monte carlo. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 319–342, 2017.
[BEL18] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. Discret. Comput. Geom., 59(4):757–783, 2018.
[Ber18] Espen Bernton. Langevin monte carlo and JKO splitting. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 1777–1798, 2018.
[BFFN19] Jack Baker, Paul Fearnhead, Emily B Fox, and Christopher Nemeth. Control variates for stochastic gradient mcmc. Statistics and Computing, 29(3):599–615, 2019.
[BFR⁺19] Joris Bierkens, Paul Fearnhead, Gareth Roberts, et al. The zig-zag process and super-efficient sampling for bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320, 2019.
[BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
[BV04] Dimitris Bertsimas and Santosh S. Vempala. Solving convex programs by random walks. J. ACM, 51(4):540–556, 2004.
[CCBJ18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped langevin MCMC: A non-asymptotic analysis. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 300–323, 2018.
[CDWY19] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. CoRR, abs/1905.12247, 2019.
[CFM⁺18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient monte carlo. In International Conference on Machine Learning, pages 764–773, 2018.
[CV15] Benjamin Cousins and Santosh Vempala. Bypassing kls: Gaussian cooling and an o^*(n3) volume algorithm. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 539–548, 2015.
[CV18] Ben Cousins and Santosh S. Vempala. Gaussian cooling and o ${}^{\mbox{*(n${}^{\mbox{3)}}$}}$ algorithms for volume and gaussian volume. SIAM J. Comput., 47(3):1237–1273, 2018.
[CV19] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, September 20-22, 2019, Massachusetts Institute of Technology, Cambridge, MA, USA, pages 64:1–64:12, 2019.
[CWZ⁺17] Changyou Chen, Wenlin Wang, Yizhe Zhang, Qinliang Su, and Lawrence Carin. A convergence analysis for a class of practical variance-reduction stochastic gradient mcmc. arXiv preprint arXiv:1709.01180, 2017.
[Dal17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 3(79):651–676, 2017.
[DBL14] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1646–1654, 2014.
[DCWY18] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 793–797, 2018.
[DFK91] Martin E. Dyer, Alan M. Frieze, and Ravi Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. J. ACM, 38(1):1–17, 1991.
[DK19] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
[DM19a] Alain Durmus and Éric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
[DM19b] Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
[DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. J. Mach. Learn. Res., 20:73:1–73:46, 2019.
[DR18] Arnak S. Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic langevin diffusions. CoRR, abs/1807.09382, 2018.
[DRW⁺16] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos, Alexander J Smola, and Eric P Xing. Variance reduction in stochastic gradient langevin dynamics. In Advances in neural information processing systems, pages 1154–1162, 2016.
[DSM⁺16] Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, and Gaël Richard. Stochastic gradient richardson-romberg markov chain monte carlo. In Advances in Neural Information Processing Systems, pages 2047–2055, 2016.
[FGKS15] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2540–2548, 2015.
[GGZ18] Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global convergence of stochastic gradient hamiltonian monte carlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration. arXiv preprint arXiv:1809.04618, 2018.
[Gul92] Osman Guler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649–664, 1992.
[Har04] Gilles Hargé. A convex/log-concave correlation inequality for gaussian measure and an application to abstract wiener spaces. Probability theory and related fields, 130(3):415–440, 2004.
[HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 2883–2892, 2018.
[JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 315–323, 2013.
[Kra40] Hendrik Anthony Kramers. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284–304, 1940.
[Lee18] Yin Tat Lee. Lecture 8: Stochastic methods and applications. Class notes, UW CSE 599: Interplay between Convex Optimization and Geometry, 2018.
[LK99] László Lovász and Ravi Kannan. Faster mixing via average conductance. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, May 1-4, 1999, Atlanta, Georgia, USA, pages 282–287, 1999.
[LMH15] Hongzhou Lin, Julien Mairal, and Zaïd Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3384–3392, 2015.
[LPW09] David Asher Levin, Yuval Peres, and Elizabeth Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.
[LST20] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Logsmooth gradient concentration and tighter runtimes for metropolized hamiltonian monte carlo. In Conference on Learning Theory, COLT 2020, 2020.
[LSV18] Yin Tat Lee, Zhao Song, and Santosh S. Vempala. Algorithmic theory of odes and sampling from well-conditioned logconcave densities. CoRR, abs/1812.06243, 2018.
[LV06a] László Lovász and Santosh Vempala. Simulated annealing in convex bodies and an o*(n4) volume algorithm. Journal of Computer and System Sciences, 72(2):392–417, 2006.
[LV06b] László Lovász and Santosh S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, pages 57–68, 2006.
[LV06c] László Lovász and Santosh S. Vempala. Hit-and-run from a corner. SIAM J. Comput., 35(4):985–1005, 2006.
[MFWB19] Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. CoRR, abs/1910.00551, 2019.
[MMW⁺19] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. High-order langevin diffusion yields an accelerated MCMC algorithm. CoRR, abs/1908.10859, 2019.
[NDH⁺17] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient langevin dynamics. arXiv preprint arXiv:1706.02692, 2017.
[Nea11] Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11):2, 2011.
[Nes83] Yurii Nesterov. A method for solving a convex programming problem with convergence rate $o(1/k^{2})$ . Doklady AN SSSR, 269:543–547, 1983.
[NF19] Christopher Nemeth and Paul Fearnhead. Stochastic gradient markov chain monte carlo. arXiv preprint arXiv:1907.06986, 2019.
[OV00] Felix Otto and Cédric Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
[PB14] Neal Parikh and Stephen P. Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127–239, 2014.
[Per16] Marcelo Pereyra. Proximal markov chain monte carlo algorithms. Stat. Comput., 26(4):745–760, 2016.
[Roc76] R Tyrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
[SKR19] Adil Salim, Dmitry Koralev, and Peter Richtárik. Stochastic proximal langevin algorithm: Potential splitting and nonasymptotic rates. In Advances in Neural Information Processing Systems, pages 6653–6664, 2019.
[SL19] Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, pages 2100–2111, 2019.
[SRB17] Mark Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83–112, 2017.
[STL20] Ruoqi Shen, Kevin Tian, and Yin Tat Lee. Composite logconcave sampling with a restricted gaussian oracle. CoRR, abs/2006.05976, 2020.
[Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1):267–288, 1996.
[Vem05] Santosh Vempala. Geometric random walks: A survey. MSRI Combinatorial and Computational Geometry, 52:573–612, 2005.
[Wib19] Andre Wibisono. Proximal langevin algorithm: Rapid convergence under isoperimetry. CoRR, abs/1911.01469, 2019.
[WS16] Blake E. Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3639–3647, 2016.
[WT11] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
[ZH05] Hui Zou and Trevor Hastie. Regularization and variable section via the elastic net. Journal of the Royal Statistical Society, Series B (Methodological), 67(2):301–320, 2005.
[ZXG18] Difan Zou, Pan Xu, and Quanquan Gu. Subsampled stochastic variance-reduced gradient langevin dynamics. In International Conference on Uncertainty in Artificial Intelligence, 2018.

Appendix A Discussion of inexactness tolerance

We briefly discuss the tolerance of our algorithm to approximation error in two places: computation of minimizers, and implementation of RGOs in the methods of Sections 3 and 5.

Inexact minimization.

For all function classes considered in this work, there exist efficient optimization methods converging to a minimizer with logarithmic dependence on the target accuracy.

Specifically, for negative log-densities with condition number $\kappa$ , accelerated gradient descent [Nes83] converges at a rate $O(\sqrt{\kappa})$ with logarithmic dependence on initial error and target accuracy (we implicitly assumed in stating our runtimes that one can attain initial error polynomial in problem parameters for negative log-densities; otherwise, there is additional logarithmic overhead in the quality of the initial point to optimization procedures). For composite functions $f_{\textup{wc}}+f_{\textup{oracle}}$ where $f_{\textup{wc}}$ has condition number $\kappa$ , the FISTA method of [BT09] converges at the same rate with each iteration querying $\nabla f_{\textup{wc}}$ and a proximal oracle for $f_{\textup{oracle}}$ once; typically, access to a proximal oracle is a weaker assumption than access to a restricted Gaussian oracle, so this is not restrictive. Finally, for minimizing finite sums with condition number $\kappa$ , the algorithm of [All17] obtains a convergence rate linearly dependent on $n+\sqrt{n\kappa}\leq n+\kappa$ ; alternatively, [JZ13] has a dependence on $n+\kappa$ . In all our final runtimes, these optimization rates do not constitute the bottleneck for oracle complexities.

The only additional difficulty our algorithms may present is if the function requiring minimization, say of the form $f_{\textup{oracle}}(x)+\tfrac{1}{2\eta}\left\lVert x-y\right\rVert_{2}^{2}$ for some $y\in\mathbb{R}^{d}$ where we have computed the minimizer $x^{*}$ to $f_{\textup{oracle}}$ , has $\left\lVert y-x^{*}\right\rVert_{2}^{2}$ very large (so the initial function error is bad). However, in all our settings $y$ is drawn from a distribution with sub-Gaussian tails, so $\left\lVert y-x^{*}\right\rVert_{2}^{2}$ decays exponentially (whereas the complexity of first-order methods increases only logarithmically), negligibly affecting the expected oracle query complexity for our methods.

Finally, by solving the relevant optimization problems to high accuracy as a subroutine in each of our methods, and adjusting various distance bounds to the minimizer by constants (e.g. by expanding the radius in the definition of the sets $\Omega$ in Algorithm 4 and Section 6.2), this accomodates tolerance to inexact minimization and only affects all bounds throughout the paper by constants. The only other place that $x^{*}$ is used in our algorithms is in initializing warm starts; tolerance to inexactness in our warmness calculations follows essentially identically to Section 3.2.1 of [DCWY18].

Inexact oracle implementation.

Our algorithms based on restricted Gaussian oracle access are tolerant to total variation error inverse polynomial in problem parameters for the restricted Gaussian oracle for $g$ . We discussed this at the end of Section 3, in the case of RGO use for our reduction framework. To see this in the case of the composite sampler in Section 5, we pessimistically handled the case where the sampler YSample for a quadratic restriction of $f$ resulted in total variation error in the proof of Proposition 5, assuming that the error was incurred in every iteration. By accounting for similar amounts of error in calls to $\mathcal{O}$ (on the order of $\tfrac{\epsilon}{T}$ , where $T$ is the number of times an RGO was used), the bounds in our algorithm are only affected by constants.

Appendix B Deferred proofs from Section 5

B.1 Deferred proofs from Section 5.2

B.1.1 Approximate rejection sampling

We first define the rejection sampling framework we will use, and prove various properties.

Definition 5 (Approximate rejection sampling).

Let $\pi$ be a distribution, with $\tfrac{d\pi}{dx}(x)\propto p(x)$ . Suppose set $\Omega$ has $\pi(\Omega)=1-\epsilon^{\prime}$ , and distribution $\hat{\pi}$ with $\frac{d\hat{\pi}}{dx}(x)\propto\hat{p}(x)$ has for some $C\geq 1$ ,

\frac{p(x)}{\hat{p}(x)}\leq C\text{ for all }x\in\Omega,\text{ and }\frac{\int\hat{p}(x)dx}{\int p(x)dx}\leq 1.

Suppose there is an algorithm $\mathcal{A}$ which draws samples from a distribution $\hat{\pi}^{\prime}$ , such that $\left\lVert\hat{\pi}^{\prime}-\hat{\pi}\right\rVert_{\textup{TV}}\leq 1-\delta$ . We call the following scheme approximate rejection sampling: repeat independent runs of the following procedure until a point is outputted.

1.

Draw $x$ via $\mathcal{A}$ until $x\in\Omega$ .
2.

With probability $\tfrac{p(x)}{C\hat{p}(x)}$ , output $x$ .

Lemma 12.

Consider an approximate rejection sampling scheme with relevant parameters defined as in Definition 5, with $2\delta\leq\tfrac{1-\epsilon^{\prime}}{C}$ . The algorithm terminates in at most

\frac{1}{\frac{1-\epsilon^{\prime}}{C}-2\delta}

(26)

calls to $\mathcal{A}$ in expectation, and outputs a point from a distribution $\pi^{\prime}$ with $\left\lVert\pi^{\prime}-\pi\right\rVert_{\textup{TV}}\leq\epsilon^{\prime}+\frac{2\delta C}{1-\epsilon^{\prime}}$ .

Proof.

Define for notational simplicity normalization constants $Z:=\int p(x)dx$ and $\hat{Z}:=\int\hat{p}(x)dx$ . First, we bound the probability any particular call to $\mathcal{A}$ returns in the scheme:

$\displaystyle\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)$	$\displaystyle\geq\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)-\left\|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right\|$	(27)
	$\displaystyle=\int_{x\in\Omega}\frac{Z}{C\hat{Z}}d\pi(x)-\left\|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right\|$
	$\displaystyle\geq\frac{1-\epsilon^{\prime}}{C}-\int_{x\in\Omega}\|d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x)\|\geq\frac{1-\epsilon^{\prime}}{C}-2\delta.$

The second line followed by the definitions of $Z$ and $\hat{Z}$ , and the third followed by triangle inequality, the assumed lower bound on $Z/\hat{Z}$ , and the total variation distance between $\hat{\pi}^{\prime}$ and $\hat{\pi}$ . By linearity of expectation and independence, this proves the first claim.

Next, we claim the output distribution is close in total variation distance to the conditional distribution of $\pi$ restricted to $\Omega$ . The derivation of (27) implies

	$\displaystyle\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)\geq\frac{1-\epsilon^{\prime}}{C},\;\left\|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right\|\leq 2\delta,$		(28)
	$\displaystyle\implies 1-\frac{2\delta C}{1-\epsilon^{\prime}}\leq\frac{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}\leq 1+\frac{2\delta C}{1-\epsilon^{\prime}}.$		(28)

Thus, the total variation of the true output distribution from $\pi$ restricted to $\Omega$ is

	$\displaystyle\frac{1}{2}\int_{x\in\Omega}\left\|\frac{d\pi(x)}{1-\epsilon^{\prime}}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}\right\|$
	$\displaystyle\leq\frac{1}{2}\int_{x\in\Omega}\left\|\frac{d\pi(x)}{1-\epsilon^{\prime}}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}\right\|+\frac{1}{2}\int_{x\in\Omega}\left\|\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}\right\|$
	$\displaystyle\leq\frac{1}{2}\int_{x\in\Omega}\left\|\frac{d\pi(x)}{1-\epsilon^{\prime}}-\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}\right\|+\frac{\delta C}{1-\epsilon^{\prime}}=\frac{1}{2}\int_{x\in\Omega}\frac{d\pi(x)}{1-\epsilon^{\prime}}\left\|1-\frac{d\hat{\pi}^{\prime}}{d\hat{\pi}}(x)\right\|+\frac{\delta C}{1-\epsilon^{\prime}}.$

The first inequality was triangle inequality, and we bounded the second term by (28). To obtain the final equality, we used

	$\displaystyle\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)=\int_{x\in\Omega}\frac{Z}{C\hat{Z}}d\pi(x)=\frac{(1-\epsilon^{\prime})Z}{C\hat{Z}}$
	$\displaystyle\implies\frac{\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)}{\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)}=\frac{p(x)}{Z}\cdot\frac{\hat{Z}}{\hat{p}(x)}\cdot\frac{1}{1-\epsilon^{\prime}}\cdot d\hat{\pi}^{\prime}(x)=\frac{d\pi(x)}{1-\epsilon^{\prime}}\cdot\frac{d\hat{\pi}^{\prime}}{d\hat{\pi}}(x).$

We now bound this final term. Observe that the given conditions imply that $\tfrac{d\pi}{d\hat{\pi}}(x)$ is bounded by $C$ everywhere in $\Omega$ . Thus, expanding we have

\frac{1}{2}\int_{x\in\Omega}\frac{d\pi(x)}{1-\epsilon^{\prime}}\left|1-\frac{d\hat{\pi}^{\prime}}{d\hat{\pi}}(x)\right|\leq\frac{C}{2(1-\epsilon^{\prime})}\int_{x\in\Omega}|d\hat{\pi}(x)-d\hat{\pi}^{\prime}(x)|\leq\frac{\delta C}{1-\epsilon^{\prime}}.

Finally, combining these guarantees, and the fact that restricting $\pi$ to $\Omega$ loses $\epsilon^{\prime}$ in total variation distance, yields the desired conclusion by triangle inequality. ∎

Corollary 6.

Let $\hat{\theta}(x)$ be an unbiased estimator for $\tfrac{p(x)}{\hat{p}(x)}$ , and suppose $\hat{\theta}(x)\leq C$ with probability 1 for all $x\in\Omega$ . Then, implementing the procedure of Definition 5 with acceptance probability $\tfrac{\hat{\theta}(x)}{C}$ has the same runtime bound and total variation guarantee as given by Lemma 12.

Proof.

It suffices to take expectations over the randomness of $\hat{\theta}$ everywhere in the proof of Lemma 12. ∎

B.1.2 Distribution ratio bounds

We next show two bounds relating the densities of distributions $\pi$ and $\hat{\pi}$ . We first define the normalization constants of (15), (17) for shorthand, and then tightly bound their ratio.

Definition 6 (Normalization constants).

We denote normalization constants of $\pi$ and $\hat{\pi}$ by

	$\displaystyle Z_{\pi}$	$\displaystyle:=\int_{x}\exp\left(-f(x)-g(x)\right)dx,$
	$\displaystyle Z_{\hat{\pi}}$	$\displaystyle:=\int_{x,y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dxdy.$

Lemma 13 (Normalization constant bounds).

Let $Z_{\pi}$ and $Z_{\hat{\pi}}$ be as in Definition 6. Then,

\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\frac{\eta L^{2}}{\mu}\right)^{-\frac{d}{2}}\leq\frac{Z_{\hat{\pi}}}{Z_{\pi}}\leq(2\pi\eta)^{\frac{d}{2}}.

Proof.

For each $x$ , by convexity we have

	$\displaystyle\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy$		(29)
	$\displaystyle\leq\exp\left(-g(x)-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$
	$\displaystyle=\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(\frac{\eta}{2}\left\lVert\nabla f(x)\right\rVert_{2}^{2}-\frac{1}{2\eta}\left\lVert y-x+\eta\nabla f(x)\right\rVert_{2}^{2}\right)dy$
	$\displaystyle=(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right)\exp\left(\frac{\eta}{2}\left\lVert\nabla f(x)\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)$
	$\displaystyle\leq(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right).$

Integrating both sides over $x$ yields the upper bound on $\tfrac{Z_{\hat{\pi}}}{Z_{\pi}}$ . Next, for the lower bound we have a similar derivation. For each $x$ , by smoothness

	$\displaystyle\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\\|y-x\right\\|_{2}^{2}-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|_{2}^{2}\right)dy$
	$\displaystyle\geq\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|_{2}^{2}\right)\int_{y}\exp\left(\left\langle\nabla f(x),x-y\right\rangle-\frac{1+\eta L}{2\eta}\left\\|y-x\right\\|_{2}^{2}\right)dy$
	$\displaystyle=\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|^{2}+\frac{\eta}{2(1+\eta L)}\left\\|\nabla f(x)\right\\|^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}$
	$\displaystyle\geq\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.$

Integrating both sides over $x$ yields

\displaystyle\frac{Z_{\hat{\pi}}}{Z_{\pi}}\geq\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\frac{\int_{x}\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\|x-x^{*}\right\|_{2}^{2}\right)dx}{\int_{x}\exp\left(-f(x)-g(x)\right)dx}\geq\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\frac{\eta L^{2}}{\mu}\right)^{-\frac{d}{2}}.

The last inequality followed from Proposition 11, where we used $f+g$ is $\mu$ -strongly convex. ∎

Lemma 14 (Relative density bounds).

Let $\eta=\tfrac{1}{32L\kappa d\log(288\kappa/\epsilon)}$ . For all $x\in\Omega$ , as defined in (16), $\frac{d\pi}{d\hat{\pi}}(x)\leq 2$ . Here, $\tfrac{d\hat{\pi}}{dx}(x)$ denotes the marginal density of $\hat{\pi}$ . Moreover, for all $x\in\mathbb{R}^{d}$ , $\frac{d\pi}{d\hat{\pi}}(x)\geq\tfrac{1}{2}$ .

Proof.

We first show the upper bound. By Lemma 13,

	$\displaystyle\frac{d\pi}{d\hat{\pi}}(x)$	$\displaystyle=\frac{\exp\left(-f(x)-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}\cdot\frac{Z_{\hat{\pi}}}{Z_{\pi}}$		(30)
		$\displaystyle\leq\frac{\exp\left(-f(x)-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}\cdot(2\pi\eta)^{\frac{d}{2}}.$		(30)

We now bound the first term, for $x\in\Omega$ . By smoothness, we have

\frac{\exp\left(-f(y)-g(x)\right)}{\exp\left(-f(x)-g(x)\right)}\geq\exp\left(\left\langle\nabla f(x),x-y\right\rangle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}\right),

so applying this for each $y$ ,

	$\displaystyle\frac{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}{\exp\left(-f(x)-g(x)\right)}$
	$\displaystyle\geq\exp\left(-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(\left\langle\nabla f(x),x-y\right\rangle-\frac{1+\eta L}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$
	$\displaystyle=\exp\left(-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-\frac{1+\eta L}{2\eta}\left\lVert x-y-\frac{\eta}{1+\eta L}\nabla f(x)\right\rVert_{2}^{2}\right)dy$
	$\displaystyle\geq\exp\left(-\frac{\eta L^{2}}{2}\cdot\frac{16d\log(288\kappa/\epsilon)}{\mu}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\geq\frac{3}{4}\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.$

In the last line, we used that $x\in\Omega$ implies $\left\lVert x-x^{*}\right\rVert_{2}^{2}\leq\tfrac{16d\log(288\kappa/\epsilon)}{\mu}$ , and the definition of $\eta$ . Combining this bound with (30), we have the desired

\frac{d\pi}{d\hat{\pi}}(x)\leq\frac{4}{3}\left(1+\eta L\right)^{\frac{d}{2}}\leq 2.

Next, we consider the lower bound. By combining (29) with Lemma 13, we have the desired

	$\displaystyle\frac{d\pi}{d\hat{\pi}}(x)$	$\displaystyle=\frac{\exp\left(-f(x)-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}\cdot\frac{Z_{\hat{\pi}}}{Z_{\pi}}$
		$\displaystyle\geq(2\pi\eta)^{-\frac{d}{2}}\cdot\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\frac{\eta L^{2}}{\mu}\right)^{-\frac{d}{2}}=\left(\frac{1}{1+\eta L}\right)^{\frac{d}{2}}\left(1+\eta L\kappa\right)^{-\frac{d}{2}}\geq\frac{1}{2}.$

∎

B.1.3 Correctness of Composite-Sample-Shared-Min

See 4

Proof.

We remark that $\eta=\tfrac{1}{32L\kappa d\log(288\kappa/\epsilon)}$ is precisely the choice of $\eta$ in Sample-Joint-Dist where $\delta=\epsilon/18$ , as in Composite-Sample-Shared-Min. First, we may apply Fact 2 to conclude that the measure of set $\Omega$ with respect to the $\mu$ -strongly logconcave density $\pi$ is at least $1-\epsilon/3$ . The conclusion of correctness will follow from an appeal to Corollary 6, with parameters

C=4,\;\epsilon^{\prime}=\frac{\epsilon}{3},\;\delta=\frac{\epsilon}{18}.

Note that indeed we have $\epsilon^{\prime}+\tfrac{2\delta C}{1-\epsilon^{\prime}}$ is bounded by $\epsilon$ , as $1-\epsilon^{\prime}\geq\tfrac{2}{3}$ . Moreover, the expected number of calls (26) is clearly bounded by a constant as well.

We now show that these parameters satisfy the requirements of Corollary 6. Define the functions

	$\displaystyle p(x)$	$\displaystyle:=\exp(-f(x)-g(x)),$
	$\displaystyle\hat{p}(x)$	$\displaystyle:=(2\pi\eta)^{-\frac{d}{2}}\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy,$

and observe that clearly the densities of $\pi$ and $\hat{\pi}$ are respectively proportional to $p$ and $\hat{p}$ . Moreover, define $Z=\int p(x)dx$ and $\hat{Z}=\int\hat{p}(x)dx$ . By comparing these definitions with Lemma 13, we have $Z=Z_{\pi}$ and $\hat{Z}=(2\pi\eta)^{-\frac{d}{2}}Z_{\hat{\pi}}$ , so by the upper bound in Lemma 13, $\hat{Z}/Z\leq 1$ . Next, we claim that the following procedure produces an unbiased estimator for $\tfrac{p(x)}{\hat{p}(x)}$ .

1.

Sample $y\sim\pi_{x}$ , where $\tfrac{d\pi_{x}(y)}{dy}\propto\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)$
2.

$\alpha\leftarrow\exp\left(f(y)-\left\langle\nabla f(x),y-x\right\rangle-\tfrac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}+g(x)+\tfrac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)$
3.

Output $\hat{\theta}(x)\leftarrow\exp\left(-f(x)-g(x)+\tfrac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha$

To prove correctness of this estimator $\hat{\theta}$ , define for simplicity

Z_{x}:=\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy.

We compute, using $\tfrac{d\pi_{x}(y)}{dy}=\tfrac{\exp(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2})}{Z_{x}}$ , that

	$\displaystyle\mathbb{E}_{\pi_{x}}\left[\alpha\right]$	$\displaystyle=\int_{y}\exp\left(f(y)-\left\langle\nabla f(x),y-x\right\rangle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}+g(x)+\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)d\pi_{x}(y)$
		$\displaystyle=\frac{1}{Z_{x}}\int_{y}\exp\left(-\left\langle\nabla f(x),y-x\right\rangle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$
		$\displaystyle=\frac{1}{Z_{x}}\exp\left(-\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.$

This implies that the output quantity

\hat{\theta}(x)=\exp\left(-f(x)-g(x)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha

is unbiased for $\tfrac{p(x)}{\hat{p}(x)}=\exp(-f(x)-g(x))Z_{x}^{-1}(2\pi\eta)^{\frac{d}{2}}$ . Finally, note that for any $y$ used in the definition of $\hat{\theta}(x)$ , by using $f(y)-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\tfrac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}\leq 0$ via smoothness, we have

	$\displaystyle\hat{\theta}(x)$	$\displaystyle=\exp\left(-f(x)-g(x)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(1+\eta L)^{\frac{d}{2}}\alpha$
		$\displaystyle\leq(1+\eta L)^{\frac{d}{2}}\exp\left(\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}+\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)$
		$\displaystyle\leq(1+\eta L)^{\frac{d}{2}}\exp\left(\eta L^{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\leq 4.$

Here, we used the definition of $\eta$ and $L^{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\leq 16L\kappa d\log(288\kappa/\epsilon)$ by the definition of $\Omega$ . ∎

B.2 Deferred proofs from Section 5.3

Throughout this section, for error tolerance $\delta\in[0,1]$ which parameterizes Sample-Joint-Dist, we denote for shorthand a high-probability region $\Omega_{\delta}$ and its radius $R_{\delta}$ by

\Omega_{\delta}:=\left\{x\mid\left\lVert x-x^{*}\right\rVert_{2}\leq R_{\delta}\right\},\text{ for }R_{\delta}:=4\sqrt{\frac{d\log(16\kappa/\delta)}{\mu}}.

(31)

The following density ratio bounds hold within this region, by simply modifying Lemma 14.

Corollary 7.

Let $\eta=\tfrac{1}{32L\kappa d\log(16\kappa/\delta)}$ , and let $\hat{\pi}$ be parameterized by this choice of $\eta$ in (17). For all $x\in\Omega_{\delta}$ , as defined in (31), $\frac{d\pi}{d\hat{\pi}}(x)\leq 2$ . Moreover, for all $x\in\mathbb{R}^{d}$ , $\frac{d\pi}{d\hat{\pi}}(x)\geq\frac{1}{2}$ .

The following claim follows immediately from applying Fact 2.

Lemma 15.

With probability at least $1-\tfrac{\delta^{2}}{8(1+\kappa)^{d}}$ , $x\sim\hat{\pi}$ lies in $\Omega_{\delta}$ .

Finally, when clear from context, we overload $\hat{\pi}$ as a distribution on $x\in\mathbb{R}^{d}$ to be the $x$ component marginal of the distribution (17), i.e. with density

\frac{d\hat{\pi}}{dx}(x)\propto\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy.

We first note that $\hat{\pi}$ is stationary for Sample-Joint-Dist; this follows immediately from Lemma 1. In Section B.2.1, we bound the conductance of the walk. We then use this bound in Section B.2.2 to bound the mixing time and overall complexity of Sample-Joint-Dist.

B.2.1 Conductance of Sample-Joint-Dist

We bound the conductance of this random walk, as a process on the iterates $\{x_{k}\}$ , to show the final point has distribution close to the marginal of $\hat{\pi}$ on $x$ . To do so, we break Proposition 7 into two pieces, which we will use in a more white-box manner to prove our conductance bound.

Definition 7 (Restricted conductance).

Let a random walk with stationary distribution $\hat{\pi}$ on $x\in\mathbb{R}^{d}$ have transition densities $\mathcal{T}_{x}$ , and let $\Omega\subseteq\mathbb{R}^{d}$ . The $\Omega$ -restricted conductance, for $v\in(0,\tfrac{1}{2}\hat{\pi}(\Omega))$ , is

\Phi_{\Omega}(v)=\inf_{\hat{\pi}(S\cap\Omega)\in(0,v]}\frac{\mathcal{T}_{S}(S^{c})}{\hat{\pi}(S\cap\Omega)},\text{ where }\mathcal{T}_{S}(S^{c}):=\int_{x\in S}\int_{x^{\prime}\in S^{c}}\mathcal{T}_{x}(x^{\prime})d\hat{\pi}(x)dx^{\prime}.

Proposition 8 (Lemma 1, [CDWY19]).

Let $\pi_{\textup{start}}$ be a $\beta$ -warm start for $\hat{\pi}$ , and let $x_{0}\sim\pi_{\textup{start}}$ . For some $\delta>0$ , let $\Omega\subseteq\mathbb{R}^{d}$ have $\hat{\pi}(\Omega)\geq 1-\tfrac{\delta^{2}}{2\beta^{2}}$ . Suppose that a random walk with stationary distribution $\hat{\pi}$ satisfies the $\Omega$ -restricted conductance bound

\Phi_{\Omega}(v)\geq\sqrt{B\log\left(\frac{1}{v}\right)},\text{ for all }v\in\left[\frac{4}{\beta},\frac{1}{2}\right].

Let $x_{K}$ be the result of $K$ steps of this random walk, starting from $x_{0}$ . Then, for

K\geq\frac{64}{B}\log\left(\frac{\log\beta}{2\delta}\right),

the resulting distribution of $x_{K}$ has total variation at most $\tfrac{\delta}{2}$ from $\hat{\pi}$ .

We state a well-known strategy for lower bounding conductance, via showing the stationary distribution has good isoperimetry and that transition distributions of nearby points have large overlap.

Proposition 9 (Lemma 2, [CDWY19]).

Let a random walk with stationary distribution $\hat{\pi}$ on $x\in\mathbb{R}^{d}$ have transition distribution densities $\mathcal{T}_{x}$ , and let $\Omega\subseteq\mathbb{R}^{d}$ , and let $\hat{\pi}_{\Omega}$ be the conditional distribution of $\hat{\pi}$ on $\Omega$ . Suppose for any $x,x^{\prime}\in\Omega$ with $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta$ ,

\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\frac{1}{2}.

Also, suppose $\hat{\pi}_{\Omega}$ satisfies, for any partition $S_{1}$ , $S_{2}$ , $S_{3}$ of $\Omega$ , where $d(S_{1},S_{2})$ is the minimum Euclidean distance between points in $S_{1}$ , $S_{2}$ , the log-isoperimetric inequality

\hat{\pi}_{\Omega}(S_{3})\geq\frac{1}{2\psi}d(S_{1},S_{2})\cdot\min\left(\hat{\pi}_{\Omega}(S_{1}),\hat{\pi}_{\Omega}(S_{2})\right)\cdot\sqrt{\log\left(1+\frac{1}{\min\left(\hat{\pi}_{\Omega}(S_{1}),\hat{\pi}_{\Omega}(S_{2})\right)}\right)}.

(32)

Then, we have the bound for all $v\in(0,\tfrac{1}{2}]$

\Phi_{\Omega}(v)\geq\min\left(1,\frac{\Delta}{128\psi}\sqrt{\log\left(\frac{1}{v}\right)}\right).

To utilize Propositions 8 and 9, we prove the following bounds in Appendices C.1, C.2, and C.3.

Lemma 16 (Warm start).

For $\eta\leq\tfrac{1}{L\kappa d}$ , $\pi_{\textup{start}}$ defined in (18) is a $2(1+\kappa)^{\frac{d}{2}}$ -warm start for $\hat{\pi}$ .

Lemma 17 (Transitions of nearby points).

Suppose $\eta L\leq 1$ , $\eta L^{2}R_{\delta}^{2}\leq\tfrac{1}{2}$ , and $400d^{2}\eta\leq R_{\delta}^{2}$ . For a point $x$ , let $\mathcal{T}_{x}$ be the density of $x_{k}$ after sampling according to Lines 6 and 7 of Algorithm 5 from $x_{k-1}=x$ . For $x,x^{\prime}\in\Omega_{\delta}$ with $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\tfrac{\sqrt{\eta}}{10}$ , for $\Omega_{\delta}$ defined in (31), we have $\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{1}{2}$ .

Lemma 18 (Isoperimetry).

Density $\hat{\pi}$ and set $\Omega_{\delta}$ defined in (17), (31) satisfy (32) with $\psi=8\mu^{-\frac{1}{2}}$ .

We note that the parameters of Algorithm 5 and the set $\Omega_{\delta}$ in (31) satisfy all assumptions of Lemmas 16, 17, and 18. By combining these results in the context of Proposition 9, we see that the random walk satisfies the bound for all $v\in(0,\tfrac{1}{2}]$ :

\Phi_{\Omega_{\delta}}(v)\geq\sqrt{\frac{\eta\mu}{2^{20}\cdot 100}\cdot\log\left(\frac{1}{v}\right)}.

Plugging this conductance lower bound, the high-probability guarantee of $\Omega_{\delta}$ by Lemma 15, and the warm start bound of Lemma 16 into Proposition 8, we have the following conclusion.

Corollary 8 (Mixing time of ideal Sample-Joint-Dist).

Assume that calls to YSample are exact in the implementation of Sample-Joint-Dist. Then, for any error parameter $\delta$ , and

K:=\frac{2^{26}\cdot 100}{\eta\mu}\log\left(\frac{d\log(16\kappa)}{4\delta}\right),

the distribution of $x_{K}$ has total variation at most $\tfrac{\delta}{2}$ from $\hat{\pi}$ .

B.2.2 Complexity of Sample-Joint-Dist

We first state a guarantee on the subroutine YSample, which we prove in Appendix C.4.

Lemma 19 (YSample guarantee).

For $\delta\in[0,1]$ , define $R_{\delta}$ as in (31), and let $\eta=\tfrac{1}{32L\kappa d\log(16\kappa/\delta)}$ . For any $x$ with $\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}$ , Algorithm 7 (YSample) draws an exact sample $y$ from the density proportional to $\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)$ in an expected $2$ iterations.

We also state a result due to [CDWY19], which bounds the mixing time of 1-step Metropolized HMC for well-conditioned distributions; this handles the case when $\left\lVert x-x^{*}\right\rVert_{2}$ is large in Algorithm 7.

Proposition 10 (Theorem 1, [CDWY19]).

Let $\pi$ be a distribution on $\mathbb{R}^{d}$ whose negative log-density is convex and has condition number bounded by a constant. Then, Metropolized HMC from an explicit starting distribution mixes to total variation $\delta$ to the distribution $\pi$ in $O(d\log(\tfrac{d}{\delta}))$ iterations.

See 5

Proof.

Under an exact YSample, Corollary 8 shows the output distribution of Sample-Joint-Dist has total variation at most $\tfrac{\delta}{2}$ from $\hat{\pi}$ . Next, the resulting distribution of the subroutine YSample is never larger than $\delta/(2Kd\log(\frac{d\kappa}{\delta}))$ in total variation distance away from an exact sampler. By running for $K$ steps, and using the coupling characterization of total variation, it follows that this can only incur additional error $\delta/(2d\log(\frac{d\kappa}{\delta}))$ , proving correctness (in fact, the distribution is always at most $O((d\log(d\kappa/\delta))^{-1})$ away in total variation from an exact YSample).

Next, we prove the guarantee on the expected gradient evaluations per iteration. Lemma 19 shows whenever the current iterate $x_{k}$ has $\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}$ , the expected number of gradient evaluations is constant, and moreover Proposition 10 shows that the number of gradient evaluations is never larger than $O(d\log(\tfrac{d\kappa}{\delta}))$ , where we use that the condition number of the log-density in (19) is bounded by a constant. Therefore, it suffices to show in every iteration $0\leq k\leq K$ , the probability $\left\lVert x_{k}-x^{*}\right\rVert_{2}>\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}$ is $O((d\log(d\kappa/\delta))^{-1})$ . By the warmness assumption in Lemma 16, and the concentration bound in Fact 2, the probability $x_{0}$ does not satisfy this bound is negligible (inverse exponential in $\kappa d^{2}\log(\kappa/\delta)$ ). Since warmness is monotonically decreasing with an exact sampler,¹³¹³13This fact is well-known in the literature, and a simple proof is that if a distribution is warm, then taking one step of the Markov chain induces a convex combination of warm point masses, and is thus also warm. and the accumulated error due to inexactness of YSample is at most $O((d\log(d\kappa/\delta))^{-1})$ through the whole algorithm, this holds for all iterations. ∎

Appendix C Mixing time ingredients

We now prove facts which are used in the mixing time analysis of Sample-Joint-Dist. Throughout this section, as in the specification of Sample-Joint-Dist, $f$ and $g$ are functions with properties as in (15), and share a minimizer $x^{*}$ .

C.1 Warm start

We show that we obtain a warm start for the distribution $\hat{\pi}$ in algorithm Sample-Joint-Dist via one call to the restricted Gaussian oracle for $g$ , by proving Lemma 16.

See 16

Proof.

By the definitions of $\hat{\pi}$ and $\pi_{\textup{start}}$ in (17), (18), we wish to bound everywhere the quantity

\frac{d\pi_{\textup{start}}}{d\hat{\pi}}(x)=\frac{Z_{\hat{\pi}}}{Z_{\textup{start}}}\cdot\frac{\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}.

(33)

Here, $Z_{\hat{\pi}}$ is as in Definition 6, and we let $Z_{\textup{start}}$ denote the normalization constant of $\pi_{\textup{start}}$ , i.e.

Z_{\textup{start}}:=\int_{x}\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)dx.

Regarding the first term of (33), the earlier derivation (29) showed

\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy\leq(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right).

Then, integrating, we can bound the ratio of the normalization constants

$\displaystyle\frac{Z_{\hat{\pi}}}{Z_{\pi_{\textup{start}}}}$	$\displaystyle\leq\frac{\int_{x}(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x)-g(x)\right)dx}{\int_{x}\exp\left(-\frac{L}{2}\left\lVert x-x^{}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{}\right\rVert_{2}^{2}-g(x)\right)dx}$	(34)
	$\displaystyle\leq\frac{\int_{x}(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x^{})-\frac{\mu}{2}\left\lVert x-x^{}\right\rVert_{2}^{2}-g(x)\right)dx}{\int_{x}\exp\left(-\frac{L}{2}\left\lVert x-x^{}\right\rVert_{2}^{2}-\frac{\mu}{2}\left\lVert x-x^{}\right\rVert_{2}^{2}-g(x)\right)dx}$
	$\displaystyle\leq(2\pi\eta)^{\frac{d}{2}}\exp\left(-f(x^{*})\right)\left(1+\frac{L}{\mu}\right)^{\frac{d}{2}}.$

The second inequality followed from $f$ is $\mu$ -strongly convex and $\eta L^{2}\leq\mu$ by assumption. The last inequality followed from Proposition 11, where we used $\frac{\mu}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}+g(x)$ is $\mu$ -strongly convex. Next, to bound the second term of (33), notice first that

\displaystyle\frac{\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}-g(x)\right)}{\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}-\frac{\eta L^{2}}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dy}=\frac{\exp\left(-\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)}{\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy}.

It thus suffices to lower bound $\exp\left(\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$ . We have

	$\displaystyle\exp\left(\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$		(35)
	$\displaystyle\geq\exp\left(-f(x)+\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\int_{y}\exp\left(-\langle\nabla f(x),y-x\rangle-\left(\frac{1}{2\eta}+\frac{L}{2}\right)\left\lVert y-x\right\rVert_{2}^{2}\right)dy$
	$\displaystyle=\exp\left(-f(x)+\frac{L}{2}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+L\eta}\right)^{\frac{d}{2}}\exp\left(\frac{\eta}{2(1+L\eta)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)$
	$\displaystyle\geq\exp(-f(x^{*}))\left(\frac{2\pi\eta}{1+L\eta}\right)^{\frac{d}{2}}$

The first and third steps followed from $L$ -smoothness of $f$ , and the second applied the Gaussian integral (Fact 1). Combining the bounds in (34) and (35), (33) becomes

\displaystyle\frac{d\pi_{\textup{start}}}{d\hat{\pi}}(x)\leq\left(1+\frac{L}{\mu}\right)^{\frac{d}{2}}\left(1+L\eta\right)^{\frac{d}{2}}\leq 2(1+\kappa)^{\frac{d}{2}},

where $x\in\mathbb{R}^{d}$ was arbitrary, which completes the proof. ∎

C.2 Transitions of nearby points

Here, we prove Lemma 17. Throughout this section, $\mathcal{T}_{x}$ is the density of $x_{k}$ , according to the steps in Lines 6 and 7 of Sample-Joint-Dist (Algorithm 5) starting at $x_{k-1}=x$ . We also define $\mathcal{P}_{x}$ to be the density of $y_{k}$ , by just the step in Line 6. We first make a simplifying observation: by Observation 1, for any two points $x$ , $x^{\prime}$ , we have

\left\lVert\mathcal{T}_{x}-\mathcal{T}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}.

Thus, it suffices to understand $\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}$ for nearby $x,x^{\prime}\in\Omega_{\delta}$ . Our proof of Lemma 17 combines two pieces: (1) bounding the ratio of normalization constants $Z_{x}$ , $Z_{x^{\prime}}$ of $\mathcal{P}_{x}$ and $\mathcal{P}_{x^{\prime}}$ for nearby $x$ , $x^{\prime}$ in Lemma 22 and (2) the structural result Proposition 12. To bound the normalization constant ratio, we state two helper lemmas. Lemma 20 characterizes facts about the minimizer of

f(y)+\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}.

(36)

Lemma 20.

Let $f$ be convex with minimizer $x^{*}$ , and $y_{x}$ minimize (36) for a given $x$ . Then,

1.

$\left\lVert y_{x}-y_{x^{\prime}}\right\rVert_{2}\leq\left\lVert x-x^{\prime}\right\rVert_{2}$ .
2.

For any $x$ , $\left\lVert y_{x}-x^{*}\right\rVert_{2}\leq\left\lVert x-x^{*}\right\rVert_{2}$ .
3.

For any $x$ with $\left\lVert x-x^{*}\right\rVert_{2}\leq R$ , $\left\lVert x-y_{x}\right\rVert_{2}\leq\eta LR$ .

Proof.

By optimality conditions in the definition of $y_{x}$ ,

\eta\nabla f(y_{x})=x-y_{x}.

Fix two points $x$ , $x^{\prime}$ , and let $x_{t}:=(1-t)x+tx^{\prime}$ . Letting $\mathbf{J}_{x}(y_{x})$ be the Jacobian matrix of $y_{x}$ ,

	$\displaystyle\frac{d}{dt}\eta\nabla f(y_{x_{t}})=\frac{d}{dt}\left(x_{t}-y_{x_{t}}\right)$	$\displaystyle\implies\eta\nabla^{2}f(y_{x_{t}})\mathbf{J}_{x}(y_{x_{t}})(x^{\prime}-x)=(\mathbf{I}-\mathbf{J}_{x}(y_{x_{t}}))(x^{\prime}-x)$
		$\displaystyle\implies\mathbf{J}_{x}(y_{x_{t}})(x^{\prime}-x)=(\mathbf{I}+\eta\nabla^{2}f(y_{x_{t}}))^{-1}(x^{\prime}-x).$

We can then compute

y_{x^{\prime}}-y_{x}=\int_{0}^{1}\frac{d}{dt}y_{x_{t}}dt=\int_{0}^{1}\mathbf{J}_{x}(y_{x_{t}})(x^{\prime}-x)dt=\int_{0}^{1}(\mathbf{I}+\eta\nabla^{2}f(y_{x_{t}}))^{-1}(x^{\prime}-x)dt.

By triangle inequality and convexity of $f$ , the first claim follows:

\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}\leq\int_{0}^{1}\left\lVert(\mathbf{I}+\eta\nabla^{2}f(y_{x_{t}}))^{-1}\right\rVert_{2}\left\lVert x^{\prime}-x\right\rVert_{2}dt\leq\left\lVert x^{\prime}-x\right\rVert_{2}.

The second claim follows from the first by $y_{x^{*}}=x^{*}$ . The third claim follows from the second via

\left\lVert x-y_{x}\right\rVert_{2}=\eta\left\lVert\nabla f(y_{x})\right\rVert_{2}\leq\eta L\left\lVert y_{x}-x^{*}\right\rVert_{2}\leq\eta LR.

∎

Next, Lemma 21 states well-known bounds on the integral of a well-conditioned function $h$ .

Lemma 21.

Let $h$ be a $L_{h}$ -smooth, $\mu_{h}$ -strongly convex function and let $y^{*}_{h}$ be its minimizer. Then

\left(2\pi L_{h}^{-1}\right)^{\frac{d}{2}}\exp\left(-h(y^{*}_{h})\right)\leq\int_{y}\exp\left(-h(y)\right)\leq\left(2\pi\mu_{h}^{-1}\right)^{\frac{d}{2}}\exp\left(-h(y^{*}_{h})\right).

Proof.

By smoothness and strong convexity,

\exp\left(-h(y^{*}_{h})-\frac{L_{h}}{2}\left\lVert y-y^{*}_{h}\right\rVert_{2}^{2}\right)\leq\exp(-h(y))\leq\exp\left(-h(y^{*}_{h})-\frac{\mu_{h}}{2}\left\lVert y-y^{*}_{h}\right\rVert_{2}^{2}\right).

The result follows by Gaussian integrals, i.e. Fact 1. ∎

We now define the normalization constants of $\mathcal{P}_{x}$ and $\mathcal{P}_{x^{\prime}}$ :

	$\displaystyle Z_{x}=\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy,$		(37)
	$\displaystyle Z_{x^{\prime}}=\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x^{\prime}\right\rVert_{2}^{2}\right)dy.$		(37)

We apply Lemma 20 and Lemma 21 to bound the ratio of $Z_{x}$ and $Z_{x^{\prime}}$ .

Lemma 22.

Let $f$ be $\mu$ -strongly convex and $L$ -smooth. Let $x,x^{\prime}\in\Omega_{\delta}$ , for $\Omega_{\delta}$ defined in (31), and let $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta$ . Then, the normalization constants $Z_{x}$ and $Z_{x^{\prime}}$ in (37) satisfy

\frac{Z_{x}}{Z_{x^{\prime}}}\leq 1.05\exp\left(3LR\Delta+\frac{L\Delta^{2}}{2}\right).

Proof.

First, applying Lemma 21 to $Z_{x}$ and $Z_{x^{\prime}}$ yields that the ratio is bounded by

	$\displaystyle\frac{Z_{x}}{Z_{x^{\prime}}}$	$\displaystyle\leq\frac{\exp\left(-f(y_{x})-\frac{1}{2\eta}\left\lVert y_{x}-x\right\rVert_{2}^{2}\right)\left(2\pi\left(\mu+\frac{1}{\eta}\right)^{-1}\right)^{\frac{d}{2}}}{\exp\left(-f(y_{x^{\prime}})-\frac{1}{2\eta}\left\lVert y_{x^{\prime}}-x\right\rVert_{2}^{2}\right)\left(2\pi\left(L+\frac{1}{\eta}\right)^{-1}\right)^{\frac{d}{2}}}$
		$\displaystyle\leq 1.05\exp\left(f(y_{x^{\prime}})-f(y_{x})+\frac{1}{2\eta}\left(\left\lVert y_{x^{\prime}}-x^{\prime}\right\rVert_{2}^{2}-\left\lVert y_{x}-x\right\rVert_{2}^{2}\right)\right).$

Here, we used the bound for $\eta^{-1}\geq 32Ld$ that

\left(\frac{L+\frac{1}{\eta}}{\mu+\frac{1}{\eta}}\right)^{d/2}\leq 1.05.

Regarding the remaining term, recall $x$ , $x^{\prime}$ both belong to $\Omega_{\delta}$ , and $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta$ . We have

	$\displaystyle f(y_{x^{\prime}})-f(y_{x})+\frac{1}{2\eta}\left(\left\lVert y_{x^{\prime}}-x^{\prime}\right\rVert_{2}^{2}-\left\lVert y_{x}-x\right\rVert_{2}^{2}\right)$
	$\displaystyle\leq\left\langle\nabla f(y_{x}),y_{x^{\prime}}-y_{x}\right\rangle+\frac{L}{2}\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}^{2}+\frac{1}{2\eta}\left\langle y_{x^{\prime}}-x^{\prime}+y_{x}-x,y_{x^{\prime}}-y_{x}+x-x^{\prime}\right\rangle$
	$\displaystyle\leq LR\Delta+\frac{L\Delta^{2}}{2}+\frac{1}{2\eta}\left(\left\lVert y_{x}-x\right\rVert_{2}+\left\lVert y_{x^{\prime}}-x^{\prime}\right\rVert_{2}\right)\left(\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}+\left\lVert x^{\prime}-x\right\rVert_{2}\right)$
	$\displaystyle\leq LR\Delta+\frac{L\Delta^{2}}{2}+\frac{2\eta LR}{2\eta}\left(\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}+\left\lVert x^{\prime}-x\right\rVert_{2}\right)\leq 3LR\Delta+\frac{L\Delta^{2}}{2}.$

The first inequality was smoothness and expanding the difference of quadratics. The second was by $\left\lVert\nabla f(y_{x})\right\rVert_{2}\leq L\left\lVert y_{x}-x^{*}\right\rVert_{2}\leq LR$ and $\left\lVert y_{x^{\prime}}-y_{x}\right\rVert_{2}\leq\Delta$ , where we used the first and second parts of Lemma 20; we also applied Cauchy-Schwarz and triangle inequality. The third used the third part of Lemma 20. Finally, the last inequality was by the first part of Lemma 20 and $\left\lVert x^{\prime}-x\right\rVert_{2}\leq\Delta$ . ∎

We now are ready to prove Lemma 17. See 17

Proof.

First, by Observation 1, it suffices to show $\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\tfrac{1}{2}$ . Pinsker’s inequality states

\left\lVert\mathcal{P}_{x}-\mathcal{P}_{x^{\prime}}\right\rVert_{\textup{TV}}\leq\sqrt{\frac{1}{2}d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)},

where $d_{\text{KL}}$ is KL-divergence, so it is enough to show $d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)\leq\tfrac{1}{2}$ . Notice that

\displaystyle d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)=\log\left(\frac{Z_{x^{\prime}}}{Z_{x}}\right)+\int_{y}\mathcal{P}_{x}(y)\log\left(\frac{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)}{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x^{\prime}\right\rVert_{2}^{2}\right)}\right)dy.

By Lemma 22, the first term satisfies, for $\Delta:=\tfrac{\sqrt{\eta}}{10}$ ,

\log\left(\frac{Z_{x^{\prime}}}{Z_{x}}\right)\leq 3LR\Delta+\frac{L\Delta^{2}}{2}+\log(1.05).

To bound the second term, we have

	$\displaystyle\int_{y}\mathcal{P}_{x}(y)\log\left(\frac{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)}{\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x^{\prime}\right\rVert_{2}^{2}\right)}\right)dy$	$\displaystyle=\frac{1}{2\eta}\int_{y}\mathcal{P}_{x}(y)\left(\left\lVert y-x^{\prime}\right\rVert_{2}^{2}-\left\lVert y-x\right\rVert_{2}^{2}\right)dy$
		$\displaystyle=\frac{1}{2\eta}\int_{y}\mathcal{P}_{x}(y)\left\langle x-x^{\prime},2\left(y-x\right)+\left(x-x^{\prime}\right)\right\rangle dy$
		$\displaystyle\leq\frac{\Delta^{2}}{2\eta}+\frac{\Delta}{\eta}\left\\|\int_{y}y\mathcal{P}_{x}(y)dy-x\right\\|_{2}.$

Here, the second line was by expanding and the third line was by $\left\lVert x-x^{\prime}\right\rVert_{2}\leq\Delta$ and Cauchy-Schwarz. By Proposition 12, $\left\|\int_{y}y\mathcal{P}_{x}(y)dy-x\right\|_{2}\leq 2\eta LR$ , where by assumption the parameters satisfy the conditions of Proposition 12. Then, combining the two bounds, we have

d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)\leq 3LR\Delta+\frac{L\Delta^{2}}{2}+\frac{\Delta^{2}}{2\eta}+2LR\Delta+\log(1.05)=5LR\Delta+\frac{L\Delta^{2}}{2}+\frac{\Delta^{2}}{2\eta}+\log(1.05).

When $\Delta=\tfrac{\sqrt{\eta}}{10}$ , $\eta L\leq 1$ , and $\eta L^{2}R^{2}\leq\tfrac{1}{2}$ , we have the desired

d_{\text{KL}}\left(\mathcal{P}_{x},\mathcal{P}_{x^{\prime}}\right)\leq\frac{\sqrt{\eta}LR}{2}+\frac{L\eta}{200}+\frac{1}{200}+\log(1.05)\leq\frac{1}{2}.

∎

C.3 Isoperimetry

In this section, we prove Lemma 18, which asks to show that $\hat{\pi}_{\Omega_{\delta}}$ satisfies a log-isoperimetric inequality (32). Here, we define $\hat{\pi}_{\Omega_{\delta}}$ to be the conditional distribution of the $\hat{\pi}$ $x$ -marginal on set $\Omega_{\delta}$ . We recall this means that for any partition $S_{1}$ , $S_{2}$ , $S_{3}$ of $\Omega_{\delta}$ ,

\hat{\pi}_{\Omega_{\delta}}(S_{3})\geq\frac{1}{2\psi}d(S_{1},S_{2})\cdot\min\left(\hat{\pi}_{\Omega_{\delta}}(S_{1}),\hat{\pi}_{\Omega_{\delta}}(S_{2})\right)\cdot\sqrt{\log\left(1+\frac{1}{\min\left(\hat{\pi}_{\Omega_{\delta}}(S_{1}),\hat{\pi}_{\Omega_{\delta}}(S_{2})\right)}\right)}.

The following fact was shown in [CDWY19].

Lemma 23 ([CDWY19], Lemma 11).

Any $\mu$ -strongly logconcave distribution $\pi$ satisfies the log-isoperimetric inequality (32) with $\psi=\mu^{-\frac{1}{2}}$ .

Observe that $\pi_{\Omega_{\delta}}$ , the restriction of $\pi$ to the convex set $\Omega_{\delta}$ , is $\mu$ -strongly logconcave by the definition of $\pi$ (15), so it satisfies a log-isoperimetric inequality. We now combine this fact with the relative density bounds Lemma 14 to prove Lemma 18.

See 18

Proof.

Fix some partition $S_{1}$ , $S_{2}$ , $S_{3}$ of $\Omega_{\delta}$ , and without loss of generality let $\hat{\pi}_{\Omega_{\delta}}(S_{1})\leq\hat{\pi}_{\Omega_{\delta}}(S_{2})$ . First, by applying Corollary 7, which shows $\tfrac{d\pi}{d\hat{\pi}}(x)\in[\tfrac{1}{2},2]$ everywhere in $\Omega_{\delta}$ , we have the bounds

\frac{1}{2}\pi_{\Omega_{\delta}}(S_{1})\leq\hat{\pi}_{\Omega_{\delta}}(S_{1})\leq 2\pi_{\Omega_{\delta}}(S_{1}),\;\frac{1}{2}\pi_{\Omega_{\delta}}(S_{2})\leq\hat{\pi}_{\Omega_{\delta}}(S_{2})\leq 2\pi_{\Omega_{\delta}}(S_{2}),\;\text{and}\;\hat{\pi}_{\Omega_{\delta}}(S_{3})\geq\frac{1}{2}\pi_{\Omega_{\delta}}(S_{3}).

Therefore, we have the sequence of conclusions

	$\displaystyle\hat{\pi}_{\Omega_{\delta}}(S_{3})$	$\displaystyle\geq\frac{1}{2}\pi_{\Omega_{\delta}}(S_{3})$
		$\displaystyle\geq\frac{d(S_{1},S_{2})\sqrt{\mu}}{4}\cdot\min\left(\pi_{\Omega_{\delta}}(S_{1}),\pi_{\Omega_{\delta}}(S_{2})\right)\cdot\sqrt{\log\left(1+\frac{1}{\min\left(\pi_{\Omega_{\delta}}(S_{1}),\pi_{\Omega_{\delta}}(S_{2})\right)}\right)}$
		$\displaystyle\geq\frac{d(S_{1},S_{2})\sqrt{\mu}}{8}\cdot\hat{\pi}_{\Omega_{\delta}}(S_{1})\cdot\sqrt{\log\left(1+\frac{1}{2\hat{\pi}_{\Omega_{\delta}}(S_{1})}\right)}$
		$\displaystyle\geq\frac{d(S_{1},S_{2})\sqrt{\mu}}{16}\cdot\hat{\pi}_{\Omega_{\delta}}(S_{1})\cdot\sqrt{\log\left(1+\frac{1}{\hat{\pi}_{\Omega_{\delta}}(S_{1})}\right)}.$

Here, the second line was by applying Lemma 23 to the $\mu$ -strongly logconcave distribution $\pi_{\Omega_{\delta}}$ , and the final line used $\sqrt{\log(1+\alpha)}\leq 2\sqrt{\log(1+\tfrac{\alpha}{2})}$ for all $\alpha>0$ . ∎

C.4 Correctness of YSample

In this section, we show how we can sample $y$ efficiently in the alternating scheme of the algorithm Sample-Joint-Dist, within an extremely high probability region. Specifically, for any $x$ with $\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}$ , where $R_{\delta}$ is defined in (31), we give a method for implementing

\text{draw }y\propto\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy.

The algorithm is Algorithm 7, which is a simple rejection sampling scheme.

Algorithm 7

\texttt{YSample}(f,x,\eta,\delta)

Input: $L$ -smooth, $\mu$ -strongly convex $f:\mathbb{R}^{d}\to\mathbb{R}$ with minimizer $x^{*}$ , $\eta>0$ , $\delta\in[0,1]$ , $x\in\mathbb{R}^{d}$ .
Output: If $\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}$ , return exact sample from distribution with density $\propto\exp(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2})$ (see (31) for definition of $R_{\delta}$ ). Otherwise, return sample within $\delta$ TV from distribution with density $\propto\exp(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2})$ .

1:if

\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}

then

2: while true do

3: Draw

y\sim\mathcal{N}(x-\eta\nabla f(x),\eta\mathbf{I})

\tau\sim\text{Unif}[0,1]

5: if

\tau\leq\exp(f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y))

then

6: return

y

7: end if

8: end while

9:end if

10:return Sample

x

within TV

\delta

from density

\propto\exp(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2})

using [CDWY19]

We recall that we gave guarantees on rejection sampling procedures in Lemma 5 (an “exact” version of Lemma 12 and Corollary 6). We now prove Lemma 19 via a direct application of Lemma 5.

See 19

Proof.

For $\left\lVert x-x^{*}\right\rVert_{2}\leq\sqrt{\kappa d\log(16\kappa/\delta)}\cdot R_{\delta}$ , YSample is a rejection sampling scheme with

p(y)=\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right),\;\hat{p}(y)=\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right).

It is clear that $p(y)\leq\hat{p}(y)$ everywhere by convexity of $f$ , so we may choose $C=1$ . To bound the expected number of iterations and obtain the desired conclusion, Lemma 5 requires a bound on

\frac{\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy}{\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy},

(38)

the ratio of the normalization constants of $\hat{p}$ and $p$ . First, by Fact 1,

\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy=\exp\left(-f(x)+\frac{\eta}{2}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)(2\pi\eta)^{\frac{d}{2}}.

Next, by smoothness and Fact 1 once more,

	$\displaystyle\int_{y}\exp\left(-f(y)-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$	$\displaystyle\geq\int_{y}\exp\left(-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1+\eta L}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$
		$\displaystyle=\exp\left(-f(x)+\frac{\eta}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.$

Taking a ratio, the quantity in (38) is bounded above by

	$\displaystyle\exp\left(\left(\frac{\eta}{2}-\frac{\eta}{2(1+\eta L)}\right)\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)\left(1+\eta L\right)^{\frac{d}{2}}$	$\displaystyle\leq 1.5\exp\left(\frac{\eta^{2}L}{2(1+\eta L)}\left\lVert\nabla f(x)\right\rVert_{2}^{2}\right)$
		$\displaystyle\leq 1.5\exp\left(\frac{\eta^{2}L^{3}}{2}\cdot\left(\frac{16\kappa d^{2}\log^{2}(16\kappa/\delta)}{\mu}\right)\right)\leq 2.$

The first inequality was $(1+\eta L)^{\frac{d}{2}}\leq 1.5$ , the second used smoothness and the assumed bound on $\left\lVert x-x^{*}\right\rVert_{2}$ , and the third again used our choice of $\eta$ . ∎

Appendix D Structural results

Here, we prove two structural results about distributions whose negative log-densities are small perturbations of a quadratic, which obtain tighter concentration guarantees compared to naive bounds on strongly logconcave distributions. They are used in obtaining our bounds in Section C (and for the warm start bounds in Section 4), but we hope both the statements and proof techniques are of independent interest to the community. Our first structural result is a bound on normalization constant ratios, used throughout the paper.

Proposition 11.

Let $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be $\mu$ -strongly convex with minimizer $x^{*}$ , and let $\lambda>0$ . Then,

\frac{\int\exp(-f(x))dx}{\int\exp\left(-f(x)-\frac{1}{2\lambda}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}\leq\left(1+\frac{1}{\mu\lambda}\right)^{\frac{d}{2}}.

Proof.

Define the function

R(\alpha):=\frac{\int\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}{\int\exp\left(-f(x)-\frac{1}{2\lambda}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}.

Let $d\pi_{\alpha}(x)$ be the density proportional to $\exp\left(-f(x)-\tfrac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx$ . We compute

	$\displaystyle\frac{d}{d\alpha}R(\alpha)$	$\displaystyle=\int\frac{\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{}\right\rVert_{2}^{2}\right)}{\int\exp\left(-f(x)-\frac{1}{2\lambda}\left\lVert x-x^{}\right\rVert_{2}^{2}\right)dx}\frac{1}{2\lambda\alpha^{2}}\left\lVert x-x^{*}\right\rVert_{2}^{2}dx$
		$\displaystyle=\frac{R(\alpha)}{2\lambda\alpha^{2}}\int\frac{\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{}\right\rVert_{2}^{2}\right)\left\lVert x-x^{}\right\rVert_{2}^{2}}{\int\exp\left(-f(x)-\frac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}\right)dx}dx$
		$\displaystyle=\frac{R(\alpha)}{2\lambda\alpha^{2}}\int\left\lVert x-x^{*}\right\rVert_{2}^{2}d\pi_{\alpha}(x)\leq\frac{R(\alpha)}{2\alpha}\cdot\frac{d}{\mu\lambda\alpha+1}.$

Here, the last inequality was by Fact 4, using the fact that the function $f(x)+\tfrac{1}{2\lambda\alpha}\left\lVert x-x^{*}\right\rVert_{2}^{2}$ is $\mu+\tfrac{1}{\lambda\alpha}$ -strongly convex. Moreover, note that $R(1)=1$ , and

\frac{d}{d\alpha}\log\left(\frac{\alpha}{\mu\lambda\alpha+1}\right)=\frac{1}{\alpha}-\frac{\mu\lambda}{\mu\lambda\alpha+1}=\frac{1}{\mu\lambda\alpha^{2}+\alpha}.

Solving the differential inequality

\frac{d}{d\alpha}\log(R(\alpha))=\frac{dR(\alpha)}{d\alpha}\cdot\frac{1}{R(\alpha)}\leq\frac{d}{2}\cdot\frac{1}{\mu\lambda\alpha^{2}+\alpha},

we obtain the bound for any $\alpha\geq 1$ (since $\log(R(1))=0$ )

\log(R(\alpha))\leq\frac{d}{2}\log\left(\frac{\mu\lambda\alpha+\alpha}{\mu\lambda\alpha+1}\right)\implies R(\alpha)\leq\left(\frac{\mu\lambda\alpha+\alpha}{\mu\lambda\alpha+1}\right)^{\frac{d}{2}}\leq\left(1+\frac{1}{\mu\lambda}\right)^{\frac{d}{2}}.

Taking a limit $\alpha\rightarrow\infty$ yields the conclusion. ∎

Our second structural result uses a similar proof technique to show that the mean of a bounded perturbation $f$ of a Gaussian is not far from its mode, as long as the gradient of the mode is small. We remark that one may directly apply strong logconcavity, i.e. a variant of Fact 4, to obtain a weaker bound by roughly a $\sqrt{d}$ factor, which would result in a loss of $\Omega(d)$ in the guarantees of Theorem 2. This tighter analysis is crucial in our improved mixing time result.

Before stating the bound, we apply Fact 3 to the convex functions $h(x)=(\theta^{\top}x)^{2}$ and $h(x)=\left\lVert x\right\rVert_{2}^{4}$ to obtain the following conclusions which will be used in the proof of Proposition 12.

Corollary 9.

Let $\pi$ be a $\mu$ -strongly logconcave density. Then,

1.

$\mathbb{E}_{\pi}[(\theta^{\top}(x-\mathbb{E}_{\pi}[x]))^{2}]\leq\mu^{-1}$ , for all unit vectors $\theta$ .
2.

$\mathbb{E}_{\pi}[\left\lVert x-\mathbb{E}_{\pi}[x]\right\rVert_{2}^{4}]\leq 3d^{2}\mu^{-2}$ .

Proposition 12.

Let $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be $L$ -smooth and convex with minimizer $x^{*}$ , let $x\in\mathbb{R}^{d}$ with $\left\lVert x-x^{*}\right\rVert_{2}\leq R$ , and let $d\pi_{\eta}(y)$ be the density proportional to $\exp\left(-f(y)-\tfrac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy$ . Suppose that $\eta\leq\min\left(\tfrac{1}{2L^{2}R^{2}},\tfrac{R^{2}}{400d^{2}}\right)$ . Then,

\left\lVert\mathbb{E}_{\pi_{\eta}}[y]-x\right\rVert_{2}\leq 2\eta LR.

Proof.

Define a family of distributions $\pi^{\alpha}$ for $\alpha\in[0,1]$ , with

d\pi^{\alpha}(y)\propto\exp\left(-\alpha\left(f(y)-f(x)-\left\langle\nabla f(x),y-x\right\rangle\right)-f(x)-\left\langle\nabla f(x),y-x\right\rangle-\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right)dy.

In particular, $\pi^{1}=\pi_{\eta}$ , and $\pi^{0}$ is a Gaussian with mean $x-\eta\nabla f(x)$ . We define $\bar{y}_{\alpha}:=\mathbb{E}_{\pi_{\alpha}}[y]$ , and

y^{*}_{\alpha}:=\textup{argmin}_{y}\left\{\alpha\left(f(y)-f(x)-\left\langle\nabla f(x),y-x\right\rangle\right)+f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{1}{2\eta}\left\lVert y-x\right\rVert_{2}^{2}\right\}.

Define the function $D(\alpha):=\left\lVert\bar{y}_{\alpha}-x\right\rVert_{2}$ , such that we wish to bound $D(1)$ . First, by smoothness

D(0)=\left\lVert\mathbb{E}_{\pi_{0}}[y]-x\right\rVert_{2}=\left\lVert\eta\nabla f(x)\right\rVert_{2}\leq\eta LR.

Next, we observe

\frac{d}{d\alpha}D(\alpha)=\left\langle\frac{\bar{y}_{\alpha}-x}{\left\lVert\bar{y}_{\alpha}-x\right\rVert_{2}},\frac{d\bar{y}_{\alpha}}{d\alpha}\right\rangle\leq\left\lVert\frac{d\bar{y}_{\alpha}}{d\alpha}\right\rVert_{2}.

In order to bound $\left\lVert\tfrac{d\bar{y}_{\alpha}}{d\alpha}\right\rVert_{2}$ , fix a unit vector $\theta$ . We have

$\displaystyle\left\langle\frac{d\bar{y}_{\alpha}}{d\alpha},\theta\right\rangle$	$\displaystyle=\frac{d}{d\alpha}\left\langle\int(y-x)d\pi^{\alpha}(y),\theta\right\rangle$	(39)
	$\displaystyle=\int\left\langle y-x,\theta\right\rangle(f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y))d\pi^{\alpha}(y)$
	$\displaystyle\leq\sqrt{\int(\left\langle y-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y)}\sqrt{\int(f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y))^{2}d\pi^{\alpha}(y)}$
	$\displaystyle\leq\sqrt{\int(\left\langle y-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y)}\sqrt{\int\frac{L^{2}}{4}\left\lVert y-x\right\rVert_{2}^{4}d\pi^{\alpha}(y)}.$

The third line was Cauchy-Schwarz and the last line used smoothness and convexity, i.e.

\displaystyle-\frac{L}{2}\left\lVert y-x\right\rVert_{2}^{2}\leq f(x)+\left\langle\nabla f(x),y-x\right\rangle-f(y)\leq 0.

We now bound these terms. First,

	$\displaystyle\int(\left\langle y-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y)$	$\displaystyle\leq 2\int(\left\langle y-\bar{y}_{\alpha},\theta\right\rangle)^{2}d\pi^{\alpha}(y)+2\int(\left\langle\bar{y}_{\alpha}-x,\theta\right\rangle)^{2}d\pi^{\alpha}(y)$		(40)
		$\displaystyle\leq 2\eta+2\left\lVert\bar{y}_{\alpha}-x\right\rVert_{2}^{2}=2\eta+2D(\alpha)^{2}.$		(40)

Here, we applied the first part of Corollary 9, as $\pi^{\alpha}$ is $\eta^{-1}$ -strongly logconcave, and the definition of $D(\alpha)$ . Next, using for any $a,b\in\mathbb{R}^{d}$ , $\left\lVert a+b\right\rVert_{2}^{4}\leq(\left\lVert a\right\rVert_{2}+\left\lVert b\right\rVert_{2})^{4}\leq 16\left\lVert a\right\rVert_{2}^{4}+16\left\lVert b\right\rVert_{2}^{4}$ , we have

	$\displaystyle\int\frac{L^{2}}{4}\left\lVert y-x\right\rVert_{2}^{4}d\pi^{\alpha}(y)$	$\displaystyle\leq\int 4L^{2}\left\lVert y-\bar{y}_{\alpha}\right\rVert_{2}^{4}d\pi^{\alpha}(y)+\int 4L^{2}\left\lVert x-\bar{y}_{\alpha}\right\rVert_{2}^{4}d\pi^{\alpha}(y)$		(41)
		$\displaystyle\leq 12L^{2}d^{2}\eta^{2}+4L^{2}D(\alpha)^{4}.$		(41)

Here, we used the second part of Corollary 9. Maximizing (39) over $\theta$ , and applying (40), (41),

	$\displaystyle\frac{d}{d\alpha}D(\alpha)\leq\left\lVert\frac{d\bar{y}_{\alpha}}{d\alpha}\right\rVert_{2}$	$\displaystyle\leq\sqrt{8L^{2}(\eta+D(\alpha)^{2})(3d^{2}\eta^{2}+D(\alpha)^{4})}$
		$\displaystyle\leq 4L(\sqrt{\eta}+D(\alpha))\cdot\max(2\eta d,D(\alpha)^{2}).$		(42)

Assume for contradiction that $D(1)>2\eta LR$ , violating the conclusion of the proposition. By continuity of $D$ , there must have been some $\bar{\alpha}\in(0,1)$ where $D(\bar{\alpha})=2\eta LR$ , and for all $0\leq\alpha<\bar{\alpha}$ , $D(\alpha)<2\eta LR$ . By the mean value theorem, there then exists $0\leq\hat{\alpha}\leq\bar{\alpha}$ such that

\frac{dD(\hat{\alpha})}{d\alpha}=\frac{D(\bar{\alpha})-D(0)}{\bar{\alpha}}>\eta LR.

On the other hand, by our assumption that $2\eta L^{2}R^{2}\leq 1$ , for any $d\geq 1$ it follows that

2\eta d\geq 4\eta^{2}L^{2}R^{2}>D(\hat{\alpha})^{2},\;\sqrt{2\eta}\geq 2\eta LR>D(\hat{\alpha}).

Then, plugging these bounds into (D) and using $\sqrt{\eta}+D(\hat{\alpha})\leq\tfrac{5}{2}\sqrt{\eta}$ as $\sqrt{2}\leq\tfrac{3}{2}$ ,

\frac{d}{d\alpha}D(\hat{\alpha})\leq 4L\cdot\frac{5}{2}\sqrt{\eta}\cdot 2\eta d=20\sqrt{\eta}\frac{d}{R}\cdot\eta LR\leq\eta LR.

We used $\eta\leq\tfrac{R^{2}}{400d^{2}}$ in the last inequality. This is a contradiction, implying $D(1)\leq 2\eta LR$ . ∎

$\displaystyle\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}^{\prime}(x)$	$\displaystyle\geq\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}d\hat{\pi}(x)-\left\|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right\|$	(27)
	$\displaystyle=\int_{x\in\Omega}\frac{Z}{C\hat{Z}}d\pi(x)-\left\|\int_{x\in\Omega}\frac{p(x)}{C\hat{p}(x)}(d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x))\right\|$
	$\displaystyle\geq\frac{1-\epsilon^{\prime}}{C}-\int_{x\in\Omega}\|d\hat{\pi}^{\prime}(x)-d\hat{\pi}(x)\|\geq\frac{1-\epsilon^{\prime}}{C}-2\delta.$

	$\displaystyle\int_{y}\exp\left(-f(y)-g(x)-\frac{1}{2\eta}\left\\|y-x\right\\|_{2}^{2}-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|_{2}^{2}\right)dy$
	$\displaystyle\geq\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|_{2}^{2}\right)\int_{y}\exp\left(\left\langle\nabla f(x),x-y\right\rangle-\frac{1+\eta L}{2\eta}\left\\|y-x\right\\|_{2}^{2}\right)dy$
	$\displaystyle=\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|^{2}+\frac{\eta}{2(1+\eta L)}\left\\|\nabla f(x)\right\\|^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}$
	$\displaystyle\geq\exp\left(-f(x)-g(x)-\frac{\eta L^{2}}{2}\left\\|x-x^{*}\right\\|_{2}^{2}\right)\left(\frac{2\pi\eta}{1+\eta L}\right)^{\frac{d}{2}}.$

Structured Logconcave Sampling with a Restricted Gaussian Oracle

1 Introduction

1.1 Our results

Definition 1 (Restricted Gaussian oracle).

Definition 2 (η\eta-RGO).

Proximal reduction framework.

Theorem 1.

Well-conditioned densities.

Corollary 1.

Composite densities with a restricted Gaussian oracle.

Theorem 2.

Definition 3 (Proximal oracle).

Corollary 2.

Logconcave finite sums.

Theorem 3.

Corollary 3 (Improved first-order logconcave finite sum sampling).

1.2 Previous work

Well-conditioned densities.

Composite densities.

Logconcave finite sums.

Preliminary version.

1.3 Technical overview

1.3.1 Composite logconcave sampling

1.3.2 Logconcave finite sums

1.3.3 Proximal point reduction framework

Proximal point methods as composite optimization.

Proposition 1 (Informal statement of [BT09]).

Applications to sampling.

1.4 Erratum, and a word of warning for o​(d)o(d) mixing

1.5 Roadmap

2 Preliminaries

2.1 Notation

General notation.

Functions.

Distributions.

Structured distributions.

Optimization.

2.2 Technical facts

Fact 1 (Gaussian integral).

Fact 2 ([DCWY18], Lemma 1).

Fact 3 ([Har04], Theorem 1.1).

Fact 4 ([DM19a], Theorem 1).

3 Proximal reduction framework

Lemma 1 (Alternating marginal sampling).

Proof.

Observation 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Lemma 4.

Proof.

Proof of Theorem 1.

4 Tighter runtimes for structured densities

4.1 Well-conditioned logconcave sampling: proof of Corollary 1

Lemma 5 (Rejection sampling).

Proof.

Proposition 2 (Logsmooth gradient concentration, Corollary 3.3, [LST20]).

Lemma 6.

Proof.

Proof.

4.2 Composite logconcave sampling: proof of Corollary 2

Proof.

4.3 Sampling logconcave finite sums: proof of Corollary 3

Corollary 4 (First-order logconcave finite sum sampling).

Proof.

Corollary 5.

5 Composite logconcave sampling with a restricted Gaussian oracle

5.1 Reduction from Composite-Sample to Composite-Sample-Shared-Min

Proposition 3.

Proof.

5.2 Reduction from Composite-Sample-Shared-Min to Sample-Joint-Dist

Proposition 4.

5.3 Implementing Sample-Joint-Dist

Remark 1.

Proposition 5.

5.4 Putting it all together: proof of Theorem 2

6 Logconcave finite sums

6.1 Approximate Metropolis-Hastings

Definition 2 ( $\eta$ -RGO).

1.4 Erratum, and a word of warning for $o(d)$ mixing

Error due to large $|S_{k}|$ .