Structured Logconcave Sampling with a Restricted Gaussian Oracle
1 Introduction
Since its study was pioneered by the celebrated randomized convex body volume approximation algorithm of Dyer, Frieze, and Kannan [DFK91], designing samplers for logconcave distributions has been a very active area of research in theoretical computer science and statistics with many connections to other fields. In a generic form, the problem can be stated as: sample from a distribution whose negative log-density is convex, under various access models to the distribution.
Developing efficient algorithms for sampling from structured logconcave densities is a topic that has received significant recent interest due to its widespread practical applications. There are many types of structure which densities commonplace in applications may possess that are exploitable for improved runtimes. Examples of such structure include derivative bounds (“well-conditioned densities”) and various types of separability (e.g. “composite densities” corresponding to possibly non-smooth regularization or restrictions to a set, and “logconcave finite sums” corresponding to averages over multiple data points).333We make this terminology more precise in Section 2.1, which contains various definitions used in this paper. Building an algorithmic theory for sampling these latter two families, which are not well-understood in the literature, is a primary motivation of this work.
There are strong parallels between the types of structured logconcave families garnering recent attention and the classes of convex functions known to admit efficient first-order optimization algorithms. Notably, gradient descent and its accelerated counterpart [Nes83] are well-known to quickly optimize a well-conditioned function, and have become ubiquitous in both practice and theory. Similarly, various methods have been developed for efficiently optimizing non-smooth but structured composite objectives [BT09] and well-conditioned finite sums [All17].
Logconcave sampling and convex optimization are intimately related primitives (cf. e.g. [BV04, AH16]), so it is perhaps unsurprising that there are analogies between the types of structure algorithm designers may exploit. Nonetheless, our understanding of the complexity landscape for sampling is quite a bit weaker in comparison to counterparts in the field of optimization; few lower bounds are known for the complexity of sampling tasks, and obtaining stronger upper bounds is an extremely active research area (contrary to optimization, where matching bounds exist in many cases). Moreover (and perhaps relatedly), the toolkit for designing logconcave samplers is comparatively lacking; for many important primitives in optimization, it is unclear if there are analogs in sampling, possibly impeding improved bounds. Our work broadly falls under the themes of (1) understanding which types of structured logconcave distributions admit efficient samplers, and (2) leveraging connections between optimization and sampling for algorithm design. We address these problems on two fronts, which constitute the primary technical contributions of this paper.
-
1.
We give a general reduction framework for bootstrapping samplers with mixing times with polynomial dependence on a conditioning measure to mixing times with linear dependence on . The framework is heavily motivated by a perspective on proximal point methods in structured convex optimization as instances of optimizing composite objectives, and leverages this connection via a surprisingly simple analysis (cf. Theorem 1).
-
2.
We develop novel “base samplers” for composite logconcave distributions and logconcave finite sums (cf. Theorems 2, 3). The former is the first composite sampler with stronger guarantees than those known in the general logconcave setting. The latter constitutes the first high-accuracy finite sum sampler whose gradient query complexity improves upon the naïve strategy of querying full gradients of the negative log-density in each iteration.
Using our novel base samplers within our reduction framework, we obtain state-of-the-art samplers for all of the aforementioned structured families, i.e. well-conditioned, composite, and finite sum, as Corollaries 1, 2, and 3. We emphasize that even without our reduction technique, the guarantees of our base samplers for composite and finite sum-structured densities are the first of their kind. However, by boosting their mixing via our reduction, we obtain guarantees for these structured distribution families which are essentially the best one can hope for without a significant improvement in the most commonly studied well-conditioned regime (cf. discussion in Section 1.1).
We now formally state our results in Section 1.1, and situate them in the literature in Section 1.2. Section 1.3 is a technical overview of our approaches for developing our base samplers for composite and finite sum-structured densities (Sections 1.3.1 and 1.3.2), as well as our proximal reduction framework (Section 1.3.3). Finally, Section 1.5 gives a roadmap for the rest of the paper.
1.1 Our results
Before stating our results, we first require the notion of a restricted Gaussian oracle, whose definition is a key ingredient in giving our reduction framework as well as our later composite samplers.
Definition 1 (Restricted Gaussian oracle).
is a restricted Gaussian oracle (RGO) for convex if it returns
In other words, an RGO asks to sample from a multivariate Gaussian (with covariance a multiple of the identity), “restricted” by some convex function . Intuitively, if we can reduce a sampling problem for the density to calling an RGO a small number of times with a small value of , each RGO subproblem could be much easier to solve than the original problem. This can happen for a variety of reasons, e.g. if the regularized density is extremely well-conditioned, or because it inherits concentration properties of a Gaussian. This idea of reducing a sampling problem to multiple subproblems, each implementing an RGO, underlies the framework of Theorem 1. Because the idea of regularization by a large Gaussian component repeatedly appears in this paper, we make the following more specific definition for convenience, which lower bounds the size of the Gaussian.
Definition 2 (-RGO).
We say is an -restricted Gaussian oracle (-RGO) if it satisfies Definition 1 with the restriction that parameter is required to be always at most in calls to .
Variants of our notion of an RGO have implicitly appeared previously [CV18, MFWB19], and efficient RGO implementation was a key subroutine in the fastest sampling algorithm for general logconcave distributions [CV18]. It also extends a similar oracle used in composite optimization, which we will discuss shortly. However, the explicit use of RGOs in a framework such as Theorem 1 is a novel technical innovation of our work, and we believe this abstraction will find further uses.
Proximal reduction framework.
In Section 3, we prove correctness of our proximal reduction framework, whose guarantees are stated in the following Theorem 1.
Theorem 1.
Let be a distribution on with such that is -strongly convex, and let . Let , for some , and be a -RGO for . Algorithm 1, initialized at the minimizer of , runs in iterations, each querying a constant number of times, and obtains total variation distance to .
In other words, if we can implement an -RGO for a -strongly convex function in time , we can sample from in time . To highlight the power of this reduction framework, suppose there was an existing sampler for densities with mixing time , where is -smooth, -strongly convex, and has condition number (cf. Section 2.1 for definitions).444No sampler with mixing time scaling as is currently known. Choosing and in Theorem 1 yields a sampler whose mixing time is , where is the cost of sampling from a density proportional to
for some . However, observe that this distribution has a negative log-density with constant condition number ! By using as our RGO, we have , and the overall mixing time is . Leveraging Theorem 1 in applications, we obtain the following new results, improving mixing of various “base samplers” which we bootstrap as RGOs for regularized densities.
Well-conditioned densities.
In [LST20], it was shown that a variant of Metropolized Hamiltonian Monte Carlo obtains a mixing time of for sampling a density on with condition number . The analysis of [LST20] was somewhat delicate, and required reasoning about conditioning on a nonconvex set with desirable concentration properties. In Section 4.1, we prove Corollary 1, improving [LST20] by roughly a logarithmic factor with a significantly simpler analysis.
Corollary 1.
We include Corollary 1 as a warmup for our more complicated results, as a way to showcase the use of our reduction framework in a slightly different way than the one outlined earlier. In particular, in proving Corollary 1, we will choose a significantly smaller value of , at which point a simple rejection sampling scheme implements each RGO with expected constant gradient queries.
We give another algorithm matching Corollary 1 with a deterministic query complexity bound as Corollary 5. The algorithm of Corollary 5 is interesting in that it is entirely a zeroth-order algorithm, and does not require access to a gradient oracle. To our knowledge, in the well-conditioned optimization setting, no zeroth-order query complexities better than roughly are known, e.g. simulating accelerated gradient descent with a value oracle; thus, our sampling algorithm has a query bound off by only from the best-known optimization algorithm. We are hopeful this result may help in the search for query lower bounds for structured logconcave sampling.
Composite densities with a restricted Gaussian oracle.
In Section 5, we develop a sampler for densities on proportional to , where has condition number and admits a restricted Gaussian oracle . We state its guarantees here.
Theorem 2.
Let be a distribution on with such that is -smooth and -strongly convex, and let . Let (where ), , and let be a -RGO for . Further, assume access to the minimizer . There is an algorithm which runs in iterations in expectation, each querying a gradient oracle of and a constant number of times, and obtains total variation distance to .
The assumption that the composite component admits an RGO can be thought of as a measure of “simplicity” of . This mirrors the widespread use of a proximal oracle as a measure of simplicity in the context of composite optimization [BT09], which we now define.
Definition 3 (Proximal oracle).
is a proximal oracle for convex if it returns
Many regularizers in defining composite optimization objectives, which are often used to enforce a quality such as sparsity or “simplicity” in a solution, admit efficient proximal oracles. In particular, if the proximal oracle subproblem admits a closed form solution (or otherwise is computable in time), the regularized objective can be optimized at essentially no asymptotic loss. It is readily apparent that our RGO (Definition 1) is the extension of Definition 3 to the sampling setting. In [MFWB19], a variety of regularizations arising in practical applications including coordinate-separable (such as restrictions to a coordinate-wise box, e.g. for a Bayesian inference task where we have side information on the ranges of parameters) and or group Lasso regularized densities were shown to admit RGOs. Our composite sampling results achieve a similar “no loss” phenomenon for such regularizations, with respect to existing well-conditioned samplers.
By choosing the largest possible value of in Theorem 2, we obtain an iteration bound of . In Section 4.2, we prove Corollary 2, which improves Theorem 2 by roughly a factor.
Corollary 2.
Let be a distribution on with such that is -smooth and -strongly convex, and let , . Assume access to and let be a restricted Gaussian oracle for . There is an algorithm (Algorithm 1 using Theorem 2 as a restricted Gaussian oracle) which runs in iterations in expectation, each querying a gradient of and a constant number of times, and obtains total variation distance to .
To sketch the proof, choosing in Theorem 1 yields an algorithm running in iterations. In each iteration, the RGO subproblem asks to sample from the distribution whose negative log-density is for some , so we can call Theorem 2, where the “well-conditioned” portion has constant condition number. Thus, Theorem 2 runs in iterations to solve the subproblem, yielding the result. In fact, Corollary 2 nearly matches Corollary 1 in the case uniformly. Surprisingly, this recovers the runtime of [LST20] without appealing to strong gradient concentration bounds (e.g. [LST20], Theorem 3.2).
Logconcave finite sums.
In Section 6, we initiate the study of mixing times for sampling logconcave finite sums with polylogarithmic dependence on accuracy. We give the following result.
Theorem 3.
Let be a distribution on with , where is -strongly convex, is -smooth and convex , , and . Assume access to . Algorithm 6 uses value queries to summands , and obtains total variation distance to .
For a zeroth-order algorithm, Theorem 3 serves as a surprisingly strong baseline as it nearly matches the previously best-known bound for zeroth-order well-conditioned sampling when ; however, when e.g. , the complexity bound is at least cubic. By using Theorem 3 within the framework of Theorem 1, we obtain the following improved result.
Corollary 3 (Improved first-order logconcave finite sum sampling).
Corollary 3 has several surprising properties. First, its bound when gives yet another way of (up to polylogarithmic factors) recovering the runtime of [LST20] without gradient concentration. Second, up to a factor, it is essentially the best runtime one could hope for without an improvement when . This is in the sense that is the best runtime for , and to our knowledge every efficient well-conditioned sampler requires minimizer access, i.e. gradient queries [WS16]. Interestingly, when , Algorithm 6 can be significantly simplified, and becomes the standard Metropolized random walk [DCWY18]; this yields Corollary 5, an algorithm attaining the iteration complexity of Corollary 1 while only querying a value oracle for .
1.2 Previous work
Logconcave sampling is a problem that, within the theoretical computer science field, has its origins in convex body sampling (a problem it generalizes). A long sequence of developments have made significant advances in the general model, where only convexity is assumed about the negative log-density, and only value oracle access is given. In this prior work discussion, we focus on more structured cases where all or part of the negative log-density has finite condition number, and refer the reader to [Vem05, LV06a, CV15] for an account on progress in the general case.
Well-conditioned densities.
Significant recent efforts in the machine learning and statistics communities focused on obtaining provable guarantees for well-conditioned distributions, starting from pioneering work of [Dal17], and continued in e.g. [CCBJ18, DR18, CV19, CDWY19, DCWY18, DM19a, DMM19, LSV18, MMW+19, SL19, LST20]. In this setting, many methods based on discretizations of continuous-time first-order processes (such as the Langevin dynamics) have been proposed. Typically, error guarantees come in two forms: either in the -Wasserstein () distance, or in total variation (TV). The line [DCWY18, CDWY19, LST20] has brought the gradient complexity for obtaining TV distance to where is the dimension, by exploiting gradient concentration properties. For progress in complexities depending polynomially on , attaining guarantees (typically incomparable to TV bounds), we defer to [SL19], the state-of-the-art using queries to obtain distance from the target.555Here, is the effective diameter; this accuracy measure allows for scale-invariant guarantees. We note incomparable guarantees can be obtained by assuming higher derivative bounds (e.g. a Lipschitz Hessian); our work uses only the minimal assumption of bounded second derivatives.
Composite densities.
Recent works have studied sampling from densities of the form (1), or its specializations (e.g. restrictions to a convex set). Several [Per16, BDMP17, Ber18] are based on Moreau envelope or proximal regularization strategies, and demonstrate efficiency under more stringent assumptions on the structure of the composite function , but under minimal assumptions obtain fairly large provable mixing times . Proximal regularization algorithms have also been considered for non-composite sampling [Wib19]. Another discretization strategy based on projections was studied by [BEL18], but obtained mixing time . Finally, improved algorithms for special constrained sampling problems have been proposed, such as simplex restrictions [HKRC18].
Of particular relevance and inspiration to this work is [MFWB19]. By generalizing and adapting Metropolized HMC algorithms of [DCWY18, CDWY19], adopting a Moreau envelope strategy, and using (a stronger version of) the RGO access model, [MFWB19] obtained a runtime which in the best case scales as , similar to the guarantee of our base sampler in Theorem 2. However, this result required a variety of additional assumptions, such as access to the normalization factor of restricted Gaussians, Lipschitzness of , warmness of the start, and various problem parameter tradeoffs. The general problem of sampling from (1) under minimal assumptions more efficiently than general-purpose logconcave algorithms is to the best of our knowledge unresolved (even under restricted Gaussian oracle access), a novel contribution of our mixing time bound. Our results also suggest that the RGO is a natural notion of tractability for the composite sampling problem.
Logconcave finite sums.
Since [WT11] proposed the stochastic gradient Langevin dynamics, which at each step stochastically estimates the full gradient of the function, there has been a long line of work giving bounds for this method and other similar algorithms [DK19, GGZ18, SKR19, BCM+18, NF19]. These convergence rates depend heavily on the variance of the stochastic estimates. Inspired by variance-reduced methods in convex optimization, samplers based on low-variance estimators have also been proposed [DRW+16, DSM+16, BFR+19, BFFN19, NDH+17, CWZ+17, ZXG18, CFM+18]. Although our reduction-based approach is not designed specifically for solving problems of finite sum structure, our speedup can be viewed as due to a lower variance estimator implicitly defined through the oracle subproblems of Theorem 1 via repeated re-centering.
In Table 1, we list prior runtimes [ZXG18, CFM+18] for sampling logconcave finite sums; note these results additionally require bounded higher derivatives (with the exception of the dependence), obtain guarantees only in Wasserstein distance, and have polynomial dependences on . On the other hand, our reduction-based approach obtains total variation bounds with linear dependence on and polylogarithmic dependence on . Our bound also simultaneously matches the state-of-the-art bound when , a feature not shared by prior stochastic algorithms. To our knowledge, no previous nontrivial666As mentioned previously, one can always compute the full in every iteration in a well-conditioned sampler. bounds were known in the high-accuracy regime before our work.
Preliminary version.
A preliminary version of this work, containing the results of Section 5, appeared as [STL20]. The preliminary version also contained an experimental evaluation of the algorithm in Section 5 for the task of sampling a (non-diagonal covariance) multivariate Gaussian restricted to a box, and demonstrated the efficacy of our method in comparison to general-purpose logconcave samplers (i.e. the hit-and-run method [LV06c]). The focus of the present version is giving theoretical guarantees for structured logconcave sampling tasks, so we omit empirical evaluations, and defer an evaluation of the new methods developed in this paper to interesting follow-up work.
1.3 Technical overview
1.3.1 Composite logconcave sampling
We study the problem of approximately sampling from a distribution on , with density
(1) |
Here, is assumed to be “well-behaved” (i.e. has finite condition number), and is a convex, but possibly non-smooth function. This problem generalizes the special case of sampling from for well-conditioned , simply by letting vanish. Even the specialization of (1) where indicates a convex set (i.e. is inside the set, and outside) is not well-understood; existing mixing time bounds for this restricted setting are large polynomials in [BDMP17, BEL18], and are typically weaker than guarantees in the general logconcave setting [LV06c, LV06b]. This is in contrast to the convex optimization setting, where first-order methods readily generalize to solve problem families such as , where is a convex set, as well as its generalization
(2) |
We defined proximal oracles in Definition 3; in short, they are prodecures which minimize the sum of a quadratic and . Definition 3 is desirable as many natural non-smooth composite objectives arising in learning settings, such as the Lasso [Tib96] and elastic net [ZH05], admit efficient proximal oracles. It is clear that the definition of a proximal oracle implies it can also handle arbitrary sums of linear functions and quadratics, as the resulting function can be rewritten as the sum of a constant and a single quadratic. The seminal work [BT09] extends fast gradient methods to solve (2) via proximal oracles, and has prompted many follow-up studies.
Motivated by the success of the proximal oracle framework in convex optimization, we study sampling from the family (1) through the lens of RGOs, a natural extension of Definition 3. The main result of Section 5 is a “base” algorithm efficiently sampling from (1), assuming access to an RGO for . We now survey the main components of this algorithm.
Reduction to shared minimizers. We first observe that without loss of generality, and share a minimizer: by shifting and by linear terms, i.e. , , where minimizes , first-order optimality implies both and are minimized by . Moreover, implementation of a first-order oracle for and an RGO for are immediate without additional assumptions. This modification becomes crucial for our later developments, and we hope this simple observation, reminiscent of “variance reduction” techniques in stochastic optimization [JZ13], is broadly applicable to improving algorithms for the sampling problem induced by (1).
Beyond Moreau envelopes: expanding the space. A typical approach in convex optimization in handling non-smooth objectives is to instead optimize its Moreau envelope, defined by
(3) |
Intuitively, the envelope trades off function value with proximity to ; a standard exercise shows that is smooth (has a Lipschitz gradient), with smoothness depending on , and moreover that computing gradients of reduces to calling a proximal oracle (Definition 3). It is natural to extend this idea to the composite sampling setting, e.g. via sampling from the density
However, a variety of complications prevent such strategies from obtaining rates comparable to their non-composite, well-conditioned counterparts, including difficulty in bounding closeness of the resulting distribution, as well as biased drifts of the sampling process due to error in gradients.
Our approach departs from this smoothing strategy in a crucial way, inspired by Hamiltonian Monte Carlo (HMC) methods [Kra40, Nea11]. HMC can be seen as a discretization of the ubiquitous Langevin dynamics, on an expanded space. In particular, discretizations of Langevin dynamics simulate the stochastic differential equation , where is Brownian motion. HMC methods instead simulate dynamics on an extended space , via an auxiliary “velocity” variable which accumulates gradient information. This is sometimes interpreted as a discretization of the underdamped Langevin dynamics [CCBJ18]. HMC often has desirable stability properties, and expanding the dimension via an auxiliary variable has been used in algorithms obtaining the fastest rates in the well-conditioned logconcave sampling regime [SL19, LST20]. Inspired by this phenomenon, we consider the density on
(4) |
Due to technical reasons, the family of distributions we use in our final algorithms are of slightly different form than (4), but this simplification is useful to build intuition. Note in particular that the form of (4) is directly inspired by (3), where rather than maximizing over , we directly expand the space. The idea is that for small enough and a set on of large measure, smoothness of will guarantee that the marginal of (4) on will concentrate near , a fact we make rigorous. To sample from (1), we then show that a rejection filter applied to a sample from the marginal of (4) will terminate in constant steps. Consequently, it suffices to develop a fast sampler for (4).
Alternating sampling with an oracle. The form of the distribution (4) suggests a natural strategy for sampling from it: starting from a current state , we iterate
-
1.
Sample .
-
2.
Sample , via a restricted Gaussian oracle.
When and share a minimizer, taking a first-order approximation in the first step, i.e. sampling , can be shown to generalize the Leapfrog step of HMC updates. However, for very small (as in our setting), we observe the first step itself reduces to the case of sampling from a distribution with constant condition number, performable in gradient calls by e.g. Metropolized HMC [DCWY18, CDWY19, LST20]. Moreover, it is not hard to see that this “alternating marginal” sampling strategy preserves the stationary distribution exactly, so no filtering is necessary. Directly bounding the conductance of this random walk, for small enough , leads to an algorithm running in iterations, each calling an RGO once, and a gradient oracle for roughly times. This latter guarantee is by an appeal to known bounds [CDWY19, LST20] on the mixing time in high dimensions of Metropolized HMC for a well-conditioned distribution, a property satisfied by the -marginal of (4) for small .
Stability of Gaussians under bounded perturbations. To obtain our tightest runtime result, we use that is chosen to be much smaller than to show structural results about distributions of the form (4), yielding tighter concentration for bounded perturbations of a Gaussian (i.e. the Gaussian has covariance , and is restricted by -smooth for ). To illustrate, let
and let its mean and mode be , . It is standard that , by -strong logconcavity of . Informally, we show that for and not too far from the minimizer of , we can improve this to ; see Proposition 12 for a precise statement.
Using our structural results, we sharpen conductance bounds, improve the warmness of a starting distribution, and develop a simple rejection sampling scheme for sampling the variable in expected constant gradient queries. Our proofs are continuous in flavor and based on gradually perturbing the Gaussian and solving a differential inequality; we believe they may of independent interest. These improvements lead to an algorithm running in iterations; ultimately, we use our reduction framework, stated in Theorem 1, to improve this dependence to .
1.3.2 Logconcave finite sums
We initiate the algorithmic study of the following task in the high-accuracy regime: sample within total variation distance , where and
(5) |
all are convex and -smooth, and is -strongly convex. We call such a distribution a (well-conditioned) logconcave finite sum.
In applications (where summands correspond to points in a dataset, e.g. in Bayesian linear and logistic regression tasks [DCWY18]), querying may be prohibitively expensive, so a natural goal is to obtain bounds on the number of required queries to summands for . This motivation also underlies the development of stochastic gradient methods in optimization, a foundational tool in modern statistics and data processing. Naïvely, one can complete the task by using existing samplers for well-conditioned distributions and querying the full gradient in each iteration, resulting in a summand gradient query complexity of [LST20]. Many recent works, inspired from recent developments in the complexity of optimizing a well-conditioned finite sum, have developed subsampled gradient methods for the sampling problem. However, to our knowledge, all such guarantees depend polynomially on the accuracy and are measured in the -Wasserstein distance; in the high-accuracy, total variation case no nontrivial query complexity is currently known.
We show in Section 6 that given access to the minimizer of , a simple zeroth-order algorithm which queries values of summands succeeds (i.e. it never requires a full value or gradient query of ). The algorithm is essentially the Metropolized random walk proposed in [DCWY18] for the case with a cheaper subsampled filter step. Notably, because the random walk is conducted with respect to , we cannot efficiently query the function value at any point; nonetheless, by randomly sampling to compute a nearly-unbiased estimator of the rejection probability, we do not incur too much error. This random walk was shown in [CDWY19] to mix in iterations; we implement each step to sufficient accuracy using function evaluations.
It is natural to ask if first-order information can be used to improve this query complexity, perhaps through “variance reduction” techniques (e.g. [JZ13]) developed for stochastic optimization. The idea behind variance reduction is to recenter gradient estimates in phases, occasionally computing full gradients to improve the estimate quality. One fundamental difficulty which arises from using variance reduction in high-accuracy sampling is that the resulting algorithms are not stateless. By this, we mean that the variance-reduced estimates depend on the history of the algorithm, and thus it is difficult to ascertain correctness of the stationary distribution. We take a different approach to achieve a linear query dependence on the conditioning , described in the following section.
1.3.3 Proximal point reduction framework
To motivate Theorem 1, we first recast existing “proximal point” reduction-based optimization methods through the lens of composite optimization, and subsequently show that similar ideas underlying our composite sampler in Section 1.3.1 yield an analagous “proximal point reduction framework” for sampling. Chronologically, our composite sampler (originally announced in [STL20]) predates our reduction framework, which was then inspired by the perspective given here. We hope these insights prove fruitful for further development of proximal approaches to sampling.
Proximal point methods as composite optimization.
Proximal point methods are a well-studied primitive in optimization, developed by [Roc76]; cf. [PB14] for a modern survey. The principal idea is that to minimize convex , it suffices to solve a sequence of subproblems
(6) |
Intuitively, by tuning the parameter , we trade off how regularized the subproblems (6) are with how rapidly the overall method converges. Smaller values of result in larger regularization amounts which are amenable to algorithms for minimizing well-conditioned objectives.
For optimizing functions of the form (5) via stochastic gradient estimates to error, [JZ13, DBL14, SRB17] developed variance-reduced methods obtaining a query complexity of . To match a known lower bound of due to [WS16], two works [LMH15, FGKS15] appropriately applied instances of accelerated proximal point methods [Gul92] with careful analyses of how accurately subproblems (6) needed to be solved. These algorithms black-box called the runtime as an oracle to solve the subproblems (6) for an appropriate choice of , obtaining an accelerated rate.777We note that an improved runtime without extraneous logarithmic factors was later obtained by [All17]. To shed some light on this acceleration procedure, we adopt an alternative view on proximal point methods.888This perspective can also be found in the lecture notes [Lee18]. Consider the following known composite optimization result.
Proposition 1 (Informal statement of [BT09]).
Let be -smooth and -strongly convex, and admit a proximal oracle (cf. Definition 3). There is an algorithm taking iterations for to find an -approximate minimizer to , each querying and a constant number of times. Further, in all calls to .
Ignoring subtleties of the error tolerance of , we show how to use an instance of Proposition 1 to recover the query complexity for optimizing (5). Let , and . For any , is both -strongly convex and -smooth. Moreover, note that all calls to the proximal oracle for require solving subproblems of the form
(7) |
The upshot of choosing a smoothness bound is that the regularization amount in (7) increases, improving the conditioning of the subproblem, which is -strongly convex and -smooth. The algorithm of e.g. [JZ13] solves each subproblem (7) in gradient queries, leading to an overall query complexity (for Proposition 1) of
Optimizing over , i.e. taking , yields the desired bound of .
Applications to sampling.
In Sections 5 and 6, we develop samplers for structured families with quadratic dependence on the conditioning . The proximal point approach suggests a strategy for accelerating these runtimes. Namely, if there is a framework which repeatedly calls a sampler for a regularized density (analogous to calls to (6)), one could trade off the regularization with the rate of the outer loop. Fortunately, in the spirit of interpreting proximal point methods as composite optimization, the composite sampler of Section 5 itself meets these reduction framework criteria.
We briefly recall properties of our composite sampler here. Let be a distribution on with ,999To disambiguate, we sometimes also use the notation rather than in defining instances of our reduction framework or composite samplers, when convenient in the context. where is well-conditioned (has finite condition number ) and admits an RGO, which solves subproblems of the form
(8) |
The algorithm of Section 5 only calls with a fixed ; note the strong parallel between the RGO subproblem and the proximal oracle of Proposition 1. For a given value of , our composite sampler runs in iterations, each requiring a call to . Smaller improve the conditioning of the negative log-density of subproblem (8), but increase the overall iteration count, yielding a range of trade-offs. The algorithm of Section 5 has an upper bound requirement on (cf. Theorem 2); in Section 3, we observe that this may be lifted when uniformly, allowing for a full range of choices. Moreover, the analysis of the composite sampler becomes much simpler when , as in Theorem 1. We give the framework as Algorithm 1, as well as a full (fairly short) convergence analysis. By trading off the regularization amount with the cost of implementing (8) via bootstrapping base samplers, we obtain a host of improved runtimes.
Beyond our specific applications, the framework we provide has strong implications as a generic reduction from mixing times scaling polynomially in to improved methods scaling linearly in . This is akin to the observation in [LMH15] that accelerated proximal point methods generically improve dependences to dependences for optimization. We are optimistic this insight will find further implications in the logconcave sampling literature.
1.4 Erratum, and a word of warning for mixing
The initial version of this paper, presented at COLT 2021, had an incorrect proof of Theorem 1. This was due to our reliance on the average conductance (“spectral profile”) technique of [LK99] for bounding mixing. Roughly speaking, the mistake was caused by a misunderstanding that for stationary measures satisfying -log isoperimetry (for example, -strongly logconcave densities) and with transition distributions of -close points having constant overlap, [LK99] provides mixing time bounds of the form (where is a warmness parameter of the starting distribution)
(9) |
Here, is the -conductance of the Markov chain, which can typically be lower bounded by under a stationary density exhibiting log-isoperimetry. However, the trivial bound demonstrates that there is an additive term in (9). This is a bottleneck towards mixing times scaling as for distributions where only an -warm start is feasible; in particular, the conductance actually scales as , causing this additive term. In settings where (such as our reductions, where this term often scales as a condition number of the problem), this additive term may dominate. This observation (and the fix) came out of conversations with Sinho Chewi; we are immensely greatful for his help.
For the particular structure of the algorithm in Theorem 1, we are able to give an alternative analysis going through convergence bounds, preserving the correctness of the theorem. However, this bottleneck is a general phenomenon which may cause future attempts to use Metropolized algorithms from exponentially warm starts to be stuck at iterations, which merits further investigation. We write this section as a word of warning to future researchers aiming at sublinear dimension dependences in Metropolized algorithms, and as a suggested open research direction.
1.5 Roadmap
We give notations and preliminaries in Section 2. In Section 3 we give our framework for bootstrapping fast regularized samplers, and prove its correctness (Theorem 1). Assuming the “base samplers” of Theorems 2 and 3, in Section 4 we apply our reduction to obtain all of our strongest guarantees, namely Corollaries 1, 2, and 3. We then prove Theorems 2 and 3 in Sections 5 and 6.
2 Preliminaries
2.1 Notation
General notation.
For , refers to the set of naturals ; is the Euclidean norm on when is clear from context. is the multivariate Gaussian of specified mean and variance, is the identity of appropriate dimension when clear from context, and is the Loewner order on symmetric matrices.
Functions.
We say twice-differentiable function is -smooth and -strongly convex if for all ; it is well-known that -smoothness implies that has an -Lipschitz gradient, and that for any ,
If is -smooth and -strongly convex, we say it has a condition number . We call a zeroth-order oracle, or “value oracle”, an oracle which returns on any input point ; similarly, a first-order oracle, or “gradient oracle”, returns both the value and gradient .
Distributions.
We call distribution on logconcave if for convex ; is -strongly logconcave if is -strongly convex. For , is its complement, and we let . We say distribution is -warm with respect to if everywhere, and define the total variation . We will frequently use the fact that is also the probability that and are unequal under the best coupling of ; this allows us to “locally share randomness” when comparing two random walk procedures. We define the expectation and variance with respect to distribution in the standard way,
Structured distributions.
This work considers two types of distributions with additional structure, which we call composite logconcave densities and logconcave finite sums. A composite logconcave density has the form , where both and are convex. In this context throughout, will either be uniformly or have a finite condition number (be “well-conditioned”), and will represent a “simple” but possibly non-smooth function, as measured by admitting an RGO (cf. Definition 1). We will sometimes refer to the components as and as and respectively, to disambiguate when the functions and are already defined in context. In our reduction-based approaches, we have additional structure on the parameter which an RGO is called with (cf. Definition 2). Specifically, in our instances typically (or some other “niceness” parameter associated with the negative log-density); this can be seen as heavily regularizing the negative log-density, and often makes the implementation simpler.
Finally, a logconcave finite sum has density of the form where . When treating such densities, we make the assumption that each constituent summand is -smooth and convex, and the overall function is -strongly convex. We measure complexity of algorithms for logconcave finite sums by gradient queries to summands, i.e. for some .
Optimization.
Throughout this work, we are somewhat liberal with assuming access to minimizers to various functions (namely, the negative log-densities of target distributions). We give a more thorough discussion of this assumption in Appendix A, but note here that for all function families we consider (well-conditioned, composite, and finite sum), efficient first-order methods exist for obtaining high accuracy minimizers, and this optimization query complexity is never the leading-order term in any of our algorithms assuming polynomially bounded initial error.
2.2 Technical facts
We will repeatedly use the following results.
Fact 1 (Gaussian integral).
For any and ,
Fact 2 ([DCWY18], Lemma 1).
Let be a -strongly logconcave distribution, and let minimize its negative log-density. Then, for and any , with probability at least ,
Fact 3 ([Har04], Theorem 1.1).
Let be a -strongly logconcave density. Let be the Gaussian density with covariance matrix . For any convex function ,
Fact 4 ([DM19a], Theorem 1).
Let be a -strongly logconcave distribution, and let minimize its negative log-density. Then, .
3 Proximal reduction framework
The reduction framework of Theorem 1 can be thought of as a specialization of a more general composite sampler which we develop in Section 5, whose guarantees are reproduced here.
See 2
Our main observation, elaborated on more formally for specific applications in Section 4, is that a variety of structured logconcave densities have negative log-densities , where we can implement an efficient restricted Gaussian oracle for via calling an existing sampling method. Crucially, in these instantiations we use the fact that the distributions which is required to sampled from are heavily regularized (restricted by a quadratic with large leading coefficient) to obtain fast samplers. We further note that the upper bound requirement on in Theorem 2 can be lifted when the “well-conditioned” component is uniformly . Instead of setting and in Theorem 2, and refining the analysis for this special case to tolerate arbitrary , we provide a self-contained proof here. This particular structure (the composite setting where is uniformly zero and is strongly convex) admits significant simplifications from the more general case, so using slightly different proof techniques, we are able to obtain stronger convergence guarantees for this particular problem allowing for mixing in fewer than iterations from a feasible start (see Section 1.4).
See 1
For simplicity of notation, we replace in the statement of Theorem 1 with throughout just this section. Let be a density on with where is -strongly convex (but possibly non-smooth), and let be a restricted Gaussian oracle for . Consider the joint distribution supported on an expanded space with density, for some ,
Note that the -marginal of is precisely , so it suffices to sample from the -marginal. We consider a simple alternating Markov chain for sampling from , described in the following Algorithm 1.
Input: -strongly convex , , , .
By observing that the distributions and in the above method are precisely the marginal distributions of with one variable fixed, it is straightforward to see that is a stationary distribution of the process. We make this formal in the following lemma.
Lemma 1 (Alternating marginal sampling).
Let be a density on two blocks . Sample , and then sample , . Then, the distribution of is . Moreover, the alternating marginal sampling Markov chain on either marginal is reversible.
Proof.
The density of the resulting distribution at is proportional to the product of the (marginal) density at and the conditional distribution of , which by definition is . Therefore, is distributed as , and the argument for follows symmetrically. To see reversibility on the marginal, it suffices to note that the probability we move from to is proportional to
which is a symmetric function of and . A similar argument holds for the marginals. ∎
We also state a simple observation about alternating schemes such as Algorithm 1, which will be useful later. Let be the density of after one step of the above procedure starting from , and let be the resulting density of .
Observation 1.
For any two points , , .
Proof.
This follows by the coupling characterization of total variation (see e.g. Chapter 5 of [LPW09]). Per the optimal coupling of and , whenever the total variation sets in Line 2 of AlternateSample, we can couple the resulting distributions in Line 3 as well. ∎
In order to prove Theorem 1, we first show that the random walk in Algorithm 1 converges rapidly in the 2-Wasserstein distance (denoted in this section).
Lemma 2.
Let be the starting distribution of in Algorithm 1. Let be the distribution of and be the -marginal of . For all ,
Hence, for any , in iterations, the random walk mixes to
Proof.
Let be the optimal coupling between and according to the distance. Coupling the Gaussian random variable generating and gives a coupling between and such that
(10) |
Then, let be the distribution of in a run of Line 3 of Algorithm 1 starting from , and be the distribution of in Line 3 starting from , respectively. Since is strongly log-concave, satisfies a log-Sobolev inequality with constant (Theorem 2 of [OV00]). Hence,
The first step used the Talagrand transportation inequality (Theorem 1 of [OV00]). The second step used the log-Sobolev inequality. The third step used
(11) |
Taking expectation over and using (10) shows that
Algorithm 1 starts from the distribution , where . Since is -strongly logconcave, we have (see e.g. Proposition 1 of [DM19b])
Then, for , , so after iterations, . ∎
Next, we bound the KL divergence between the output of Algorithm 1 and the target distribution . We need the following standard lemma regarding KL divergences of marginal distributions.
Lemma 3.
Let and be distributions supported on indexed by , a random variable distributed as . Let be the joint distribution of for and , and be the joint distribution of as and . Let and be the marginal distribution of and on , averaged over . Then,
Proof.
By the definition of ,
Finally, by the data processing inequality,
∎
The following lemma shows that a 2-Wasserstein distance bound on the distribution at iteration implies a KL divergence bound on iteration .
Lemma 4.
Let be the distribution of for some such that and be the x-marginal of . Then,
Proof.
We note that Theorem 1 is robust to a small amount of error tolerance in the sampler . Specifically, if has tolerance , then by calling Theorem 1 with desired accuracy and adjusting constants appropriately, the cumulative error incurred by all calls to is within the total requisite bound (formally, this can be shown via the coupling characterization of total variation). We defer a more formal elaboration on this inexactness argument to Appendix A and the proof of Proposition 5.
4 Tighter runtimes for structured densities
In this section, we use applications of Theorem 1 to obtain simple analyses of novel state-of-the-art high-accuracy runtimes for the well-conditioned densities studied in [DCWY18, CDWY19, LST20], as well as the composite and finite sum densities studied in this work. We will assume the conclusions of Theorems 2 and 3 respectively in deriving the results of Sections 4.2 and 4.3.
4.1 Well-conditioned logconcave sampling: proof of Corollary 1
In this section, let be a distribution on with density proportional to , where is -smooth and -strongly convex (and ) and has pre-computed minimizer . We will instantiate Theorem 1 with , and choose . We now require an -RGO for to use in Theorem 1.
Our implementation of is a rejection sampling scheme. We use the following helpful guarantee.
Lemma 5 (Rejection sampling).
Let , be distributions on with , . Suppose for some and all , . The following is termed “rejection sampling”: repeat independent runs of the following procedure until a point is outputted.
-
1.
Draw .
-
2.
With probability , output .
Rejection sampling terminates in runs in expectation, and the output distribution is .
Proof.
The second claim follows from Bayes’ rule which implies the conditional density of the output point is proportional to , so the distribution is . To see the first claim, the probability any sample outputs is
The conclusion follows by independence and linearity of expectation. ∎
We further state a concentration bound shown first in [LST20] regarding the norm of the gradient of a point drawn from a logsmooth distribution.
Proposition 2 (Logsmooth gradient concentration, Corollary 3.3, [LST20]).
Let be a distribution on with where is convex and -smooth. With probability at least ,
(13) |
By the requirements of Theorem 1, the restricted Gaussian oracle only must be able to draw samples from densities of the form, for some ,
(14) |
We will use the following Algorithm 2 to implement .
Lemma 6.
Proof.
We are now equipped to prove our main result concerning well-conditioned densities.
See 1
Proof.
By applying Theorem 1 with the chosen , and noting that the cumulative error due to all calls to Line 10 cannot amount to more than total variation error throughout the algorithm, it suffices to show that Algorithm 2 uses gradient queries each iteration in expectation. This happens whenever the condition in Line 1 is met via Lemma 6, so we must show Line 10 is executed with probability .
To show this, note that combining Proposition 2 with the warmness of the start in Algorithm 2, this event occurs with probability at most in the first iteration.101010Formally, Line 2 of Algorithm 1 has , but by smoothness and with high probability, adding a negligible constant to the bound of Proposition 2. Since warmness is monotonically decreasing111111This is a standard fact in the literature, and can be seen as follows: each transition step in the chain is a convex combination of warm point masses, preserving warmness. throughout using an exact oracle in Algorithm 1, and the total error accumulated due to Line 10 throughout the algorithm is , we have the desired conclusion. ∎
4.2 Composite logconcave sampling: proof of Corollary 2
In this section, let be a distribution on with density proportional to , where is -smooth and -strongly convex (and ), and is convex and admits a restricted Gaussian oracle . Without loss of generality, we assume that and share a minimizer which we have pre-computed; if this is not the case, we can redefine and ; see Section 5.1 for this reduction.
We will instantiate Theorem 1 with , which is a -strongly convex function. Our main result of this section follows directly from Theorem 1 and using Theorem 2 as the required oracle , stated more precisely in the following.
See 2
Proof.
As discussed at the beginning of this section, assume without loss that and both are minimized by . We apply the algorithm of Theorem 1 with to the -strongly convex function , which requires one call to to implement. Thus, the iteration count parameter in Theorem 1 is .
Recall that we chose . To bound the total complexity of this algorithm, it suffices to give an -RGO for sampling from distributions with densities of the form, for some ,
to total variation distance (see discussion at the end of Section 3). To this end, we apply Theorem 2 with the well-conditioned component , the composite component , and the largest possible choice of . Note that we indeed have access to a restricted Gaussian oracle for (namely, ), and this choice of well-conditioned component is -smooth and -strongly convex, so its condition number is a constant. Thus, Theorem 2 requires calls to and gradients of to implement the desired on any query (where we note ). Combining these complexity bounds yields the desired conclusion. ∎
4.3 Sampling logconcave finite sums: proof of Corollary 3
In this section, let be a distribution on with density proportional to , where is -strongly convex, and for all , is -smooth (and ). We will instantiate Theorem 1 with , and Theorem 3 as an -RGO for some choice of .
More precisely, Theorem 3 shows that given access to the minimizer , only zeroth-order access to the summands of is necessary to obtain the iteration bound. In order to obtain the minimizer to high accuracy however, variance reduced stochastic gradient methods (e.g. [JZ13]) require gradient queries, which amounts to function evaluations. We state a convenient corollary of Theorem 3 which removes the requirement of accessing , via an optimization pre-processing step using the method of [JZ13] (see further discussion in Appendix A). This is useful to us in proving Theorem 3 because in the sampling tasks required by the RGO, the minimizer changes (and thus must be recomputed every time).
Corollary 4 (First-order logconcave finite sum sampling).
We now apply the reduction framework developed in Section 2 to our Algorithm 6 to obtain an improved query complexity for sampling from logconcave finite sums.
See 3
Proof.
We apply Theorem 1 with -strongly convex , using Algorithm 6 as the required -RGO for sampling from distributions with densities of the form
for some , to total variation (see Section 3) for the iteration bound of Algorithm 1. We apply Theorem 3 to the function ; we can express this in finite sum form by adding to every constituent function, and the effect on gradient oracles is . Note has condition number . For a given , the overall complexity is
Here, the inner loop complexity uses Corollary 4 to also find the minimizer (for warm starts), and the outer loop complexity is by Theorem 1. The result follows by optimizing over , namely picking , and that Algorithm 1 always must have at least one iteration. ∎
Note the only place that Corollary 3 used gradient evaluations was in determining minimizers of subproblems, via the first step of Corollary 4. Consider now the case. By running e.g. accelerated gradient descent for smooth and strongly convex functions, it is well-known [Nes83] that we can obtain a minimizer in iterations, each querying a gradient oracle, where is the condition number. By smoothness, we can approximate every coordinate of the gradient to arbitrary precision using function evaluations, so this is a value oracle complexity.
Finally, for every optimization subproblem in Corollary 3 where , the condition number is a constant, which amounts to a value oracle complexity for computing a minimizer. This is never the dominant term compared to Theorem 3, yielding the following conclusion.
Corollary 5.
We note that the polylogarithmic factor is significantly improved when compared to Corollary 3 by removing the random sampling steps in Algorithm 6. A precise complexity bound of the resulting Metropolized random walk, a zeroth-order algorithm mixing in for a logconcave distribution with condition number , is given as Theorem 2 of [CDWY19].
Finally, in the case , we also exhibit an improved query complexity in terms of an entirely zeroth-order sampling algorithm which interpolates with Corollary 5 (up to logarithmic factors). By trading off the zeroth-order complexity of minimizing a finite sum function [JZ13], and the zeroth-order complexity of sampling, we can run Theorem 1 for the optimal choice of . The overall zeroth-order complexity can be seen to be .
5 Composite logconcave sampling with a restricted Gaussian oracle
In this section, we provide our “base sampler” for composite logconcave densities as Algorithm 3, and give its guarantees by proving Theorem 2. Throughout, fix distribution with density
(15) | |||
We will define , and assume that we have precomputed . Our algorithm proceeds in stages following the outline in Section 1.3.1.
-
1.
Composite-Sample is reduced to Composite-Sample-Shared-Min, which takes as input a distribution with negative log-density , where and share a minimizer; this reduction is given in Section 5.1, and the remainder of the section handles the shared-minimizer case.
-
2.
The algorithm Composite-Sample-Shared-Min is a rejection sampling scheme built on top of sampling from a joint distribution on whose -marginal approximates . We give this reduction in Section 5.2.
-
3.
The bulk of our analysis is for Sample-Joint-Dist, an alternating marginal sampling algorithm for sampling from . To implement marginal sampling, it alternates calls to and a rejection sampling algorithm YSample. We prove its correctness in Section 5.3.
We put these pieces together in Section 5.4 to prove Theorem 2. We remark that for simplicity, we will give the algorithms corresponding to the largest value of step size in the theorem statement; it is straightforward to modify the bounds to tolerate smaller values of , which will cause the mixing time to become correspondingly larger (in particular, the value of in Algorithm 5).
Input: Distribution of form (15), minimizing negative log-density of , .
Output: Sample from a distribution with .
Input: Distribution of form (15), where and are both minimized by , .
Output: Sample from a distribution with .
(16) |
5.1 Reduction from Composite-Sample to Composite-Sample-Shared-Min
Correctness of Composite-Sample is via the following properties.
Proposition 3.
Let and be defined as in Composite-Sample.
-
1.
The density is the same as the density .
-
2.
Assuming first-order (function and gradient evaluation) access to , and restricted Gaussian oracle access to , we can implement the same accesses to , with constant overhead.
-
3.
and are both minimized by .
Proof.
For and with properties as in (15), with minimizing , define the functions
and observe that everywhere. This proves the first claim. Further, implementation of a first-order oracle for and a restricted Gaussian oracle for are immediate assuming a first-order oracle for and a restricted Gaussian oracle for , showing the second claim; any quadratic shifted by a linear term is the sum of a quadratic and a constant. We now show and have the same minimizer. By strong convexity, has a unique minimizer; first-order optimality shows that
so this unique minimizer is . Moreover, optimality of for implies that for all ,
Here, is a subgradient. This shows first-order optimality of for also, so minimizes . ∎
5.2 Reduction from Composite-Sample-Shared-Min to Sample-Joint-Dist
Composite-Sample-Shared-Min is a rejection sampling scheme, which accepts samples from subroutine Sample-Joint-Dist in the high-probability region defined in (16). We give a general analysis for approximate rejection sampling in Appendix B.1.1, and Appendix B.1.2 bounds relationships between distributions and , defined in (15) and (17) respectively (i.e. relative densities and normalization constant ratios). Combining these pieces proves the following main claim.
5.3 Implementing Sample-Joint-Dist
Sample-Joint-Dist alternates between sampling marginals in the joint distribution , as seen by definitions (19), (20). We showed that marginal sampling attains the correct stationary distribution as Lemma 1. We bound the conductance of the induced walk on iterates by combining an isoperimetry bound with a total variation guarantee between transitions of nearby points in Appendix B.2.1. Finally, we give a simple rejection sampling scheme YSample as Algorithm 7 for implementing the step (19). Since the -marginal of is a bounded perturbation of a Gaussian (intuitively, is -smooth and ), we show in a high probability region that rejecting from the sum of a first-order approximation to and the Gaussian succeeds in iterations.
Remark 1.
For simplicity of presentation, we were conservative in bounding constants throughout; in practice, we found that the constant in Line 4 is orders of magnitude too large (a constant sufficed), which can be found as Section 4 of [STL20]. Several constants were inherited from prior analyses, which we do not rederive to save on redundancy.
We now give a complete guarantee on the complexity of Sample-Joint-Dist.
Proposition 5.
Sample-Joint-Dist outputs a point with distribution within total variation distance from the -marginal of . The expected number of gradient queries per iteration is constant.
5.4 Putting it all together: proof of Theorem 2
We show Theorem 2 follows from the guarantees of Propositions 3, 4, and 5. Formally, Theorem 2 is stated for an arbitrary value of which is upper bounded by the value in Line 1 of Algorithm 5; however, it is straightforward to see that all our proofs go through for any smaller value. By observing the value of in Sample-Joint-Dist, we see that the number of total iterations in each call to Sample-Joint-Dist Proposition 5 also shows that every iteration, we require an expected constant number of gradient queries and calls to , the restricted Gaussian oracle for , and that the resulting distribution has total variation from the desired marginal of . Next, Proposition 4 implies that the number of calls to Sample-Joint-Dist in a run of Composite-Sample-Shared-Min is bounded by a constant, the choice of is , and the resulting point has total variation from the original distribution . Finally, Proposition 3 shows sampling from a general distribution of the form (1) is reducible to one call of Composite-Sample-Shared-Min, and the requisite oracles are implementable.
6 Logconcave finite sums
In this section, we provide our “base sampler” for logconcave finite sums as Algorithm 6, and give its guarantees by proving Theorem 3. Throughout, fix distribution with density
We will define , and assume that we have precomputed . We will also assume explicitly that for all throughout this section (i.e. all are minimized at the same point); this is without loss of generality, by a similar argument as in Proposition 3.
Input: , step size , initial , , iteration count
Algorithm 6 is the zeroth-order Metropolized random walk of [DCWY18] with an efficient, but biased, filter step; the goal of our analysis is to show this bias does not incur significant error.
6.1 Approximate Metropolis-Hastings
We first recall the following well-known fact underlying Metropolis-Hastings (MH) filters.
Proposition 6.
Consider a random walk on with proposal distributions and acceptance probabilities conducted as follows: at a current point ,
-
1.
Draw a point .
-
2.
Move the random walk to with probability , else stay at .
Suppose for all pairs , and further . Then, is a stationary distribution for the random walk.
Proof.
This follows because the walk satisfies detailed balance (reversibility) with respect to . ∎
We propose an algorithm that applies a variant of the Metropolis-Hastings filter to a Gaussian random walk. Specifically, we define the following algorithm, which we call Inefficient-MRW.
Definition 4 (Inefficient-MRW).
Consider the following random walk for some step size : for each iteration at a current point ,
-
1.
Set , where .
-
2.
with probability (otherwise, ), where
(21)
Lemma 7.
Distribution with is stationary for Inefficient-MRW.
Proof.
Without loss of generality, assume that has been normalized so that . We apply Proposition 6, dropping subscripts in the following. It is clear that for any , so it suffices to check the second condition. When , this follows from
The other case is similar (as it is a standard Metropolis-Hastings filter). ∎
In Algorithm 6, we implement an approximate version of the modified MH filter in Definition 4, where we always assume the pair , are in the second case of (21). In Lemma 8, we show that if a certain boundedness condition holds, then Algorithm 6 approximates Inefficient-MRW well. We then show that the output distributions of Inefficient-MRW and our Algorithm 6 have small total variation distance in Lemma 9.
Lemma 8.
Suppose that in an iteration of Algorithm 6, the following three conditions hold for some parameters , , :
-
1.
.
-
2.
.
-
3.
For all , .
Then, for any
(22) |
. Moreover, we have , and when , .
Proof.
We first show . Since each is generated independently,
Next, for any , we lower and upper bound . First,
The first inequality followed from convexity of , the second from and our assumed bound, and the third from smoothness and . To show a lower bound,
The first inequality was smoothness. Repeating this argument for each and averaging,
(23) |
Then, when ,
Thus, we can bound each :
Finally, when , as desired. ∎
Lemma 9.
Proof.
By the coupling definition of total variation, it suffices to upper bound the probability that the algorithms’ trajectories, sharing all randomness in proposing points , differ. This can happen for two reasons: either we used an incorrect filtering step (i.e. the pair did not lie in the second case of (21)), or we incorrectly rejected in Line 7 of Algorithm 6 because . We bound the error due to either happening over any iteration by , yielding the conclusion.
Incorrect filtering.
Consider some iteration . Lemma 8 shows that as long as its three conditions hold in iteration , we are in the second case of (21), so it suffices to show all conditions hold. By Fact 2 and as is independent of all , with probability at least , both of the conditions and121212We recall that the distribution of for is the one-dimensional . for all hold for some
Next, is drawn from a warm start for . By Fact 2, we have for drawn from with probability at least , for some
Since warmness of the exact algorithm of Definition 4 is monotonic, as long as the trajectories have not differed up to iteration , also holds with probability . Inductively, the total variation error caused by incorrect filtering over steps is at most .
Error due to large .
Supposing all the conditions of Lemma 8 are satisfied in iteration , we show that with high probability, Inefficient-MRW and Algorithm 6 make the same accept or reject decision. By Lemma 8, Inefficient-MRW (21) accepts with probability . On the other hand, Algorithm 6 accepts with probability
The total variation between the output distributions is . Further, since by Lemma 8,
it suffices to upper bound this latter quantity. First, by Lemma 10, when , we have . Finally, since each is generated independently,
Here, we used Lemma 8 applied to the set , and the upper bound (23) we derived earlier. Combining these calculations shows that the total variation distance incurred in any iteration due to being too large is at most , so the overall contribution over steps is at most . ∎
We used the following helper lemma in our analysis.
Lemma 10.
Let be formed by independently including each with probability . Then,
Proof.
For , let be the indicator random variable of the event , so and
By Bernstein’s inequality,
In particular, when , we have the desired conclusion. ∎
6.2 Conductance analysis
We next bound the mixing time of Inefficient-MRW, using the following result from prior work. We remark that (see Section 1.4) in our application, the term is non-dominant.
Proposition 7 (Lemma 1, Lemma 2, [CDWY19]).
Let a random walk with a -strongly logconcave stationary distribution on have transition distributions . For some , let convex set have . Let be a -warm start for , and let the algorithm be initialized at . Suppose for any with ,
(24) |
Then, the random walk mixes to total variation distance within of in iterations.
Consider an iteration of Inefficient-MRW from . Let be the density of , and let be the density of after filtering. Define a convex set parameterized by :
We show that for two close points , the total variation between and is small.
Lemma 11.
For some and with , .
Proof.
By the triangle inequality of total variation distance,
First, by Pinsker’s inequality and the KL divergence between Gaussian distributions,
When , . Next, we bound : by a standard calculation (e.g. Lemma D.1 of [LST20]), we have
We show that . It suffices to show that
Since , it suffices to show that with probability at least over the randomness of , . As , by applying Fact 2 twice,
(25) | |||
We upper bound the term by smoothness and Cauchy-Schwarz:
Then, since when , it is enough to choose so that
as long as the events of (25) hold, which occurs with probability at least . Similarly, we can show that . Combining the three bounds, we have the desired conclusion.
∎
See 3
Proof.
First, yields a -warm start for (see e.g. [DCWY18]). For this value of , by Fact 2 it suffices to choose
for . Letting , we will choose the step size and iteration count so that
have constants compatible with Lemma 9. Note that this choice of is also sufficiently small to apply Lemma 11 for our choice of . By applying Proposition 7 to the algorithm of Definition 4, and using the bound from Lemma 11, in iterations Inefficient-MRW will mix to total variation distance to . Furthermore, applying Lemma 9, we conclude that Algorithm 6 has total variation distance at most from .
Acknowledgments
YL and RS are supported by NSF awards CCF-1749609, CCF-1740551, DMS-1839116, and DMS-2023166, a Microsoft Research Faculty Fellowship, a Sloan Research Fellowship, and a Packard Fellowship. KT is supported by NSF CAREER Award CCF-1844855 and a PayPal research gift.
RS and KT would like to thank Sinho Chewi for his extremely generous help, in particular insightful conversations which led to our discovery of the gap in Section 3, as well as his suggested fix.
References
- [AH16] Jacob D. Abernethy and Elad Hazan. Faster convex optimization: Simulated annealing with an efficient universal barrier. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2520–2528, 2016.
- [All17] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res., 18:221:1–221:51, 2017.
- [BCM+18] M Barkhagen, NH Chau, É Moulines, M Rásonyi, S Sabanis, and Y Zhang. On stochastic gradient langevin dynamics with dependent data streams in the logconcave case. arXiv preprint arXiv:1812.02709, 2018.
- [BDMP17] Nicolas Brosse, Alain Durmus, Eric Moulines, and Marcelo Pereyra. Sampling from a log-concave distribution with compact support with proximal langevin monte carlo. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 319–342, 2017.
- [BEL18] Sébastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution with projected langevin monte carlo. Discret. Comput. Geom., 59(4):757–783, 2018.
- [Ber18] Espen Bernton. Langevin monte carlo and JKO splitting. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 1777–1798, 2018.
- [BFFN19] Jack Baker, Paul Fearnhead, Emily B Fox, and Christopher Nemeth. Control variates for stochastic gradient mcmc. Statistics and Computing, 29(3):599–615, 2019.
- [BFR+19] Joris Bierkens, Paul Fearnhead, Gareth Roberts, et al. The zig-zag process and super-efficient sampling for bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320, 2019.
- [BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
- [BV04] Dimitris Bertsimas and Santosh S. Vempala. Solving convex programs by random walks. J. ACM, 51(4):540–556, 2004.
- [CCBJ18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped langevin MCMC: A non-asymptotic analysis. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 300–323, 2018.
- [CDWY19] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. CoRR, abs/1905.12247, 2019.
- [CFM+18] Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient monte carlo. In International Conference on Machine Learning, pages 764–773, 2018.
- [CV15] Benjamin Cousins and Santosh Vempala. Bypassing kls: Gaussian cooling and an o^*(n3) volume algorithm. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 539–548, 2015.
- [CV18] Ben Cousins and Santosh S. Vempala. Gaussian cooling and o algorithms for volume and gaussian volume. SIAM J. Comput., 47(3):1237–1273, 2018.
- [CV19] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, September 20-22, 2019, Massachusetts Institute of Technology, Cambridge, MA, USA, pages 64:1–64:12, 2019.
- [CWZ+17] Changyou Chen, Wenlin Wang, Yizhe Zhang, Qinliang Su, and Lawrence Carin. A convergence analysis for a class of practical variance-reduction stochastic gradient mcmc. arXiv preprint arXiv:1709.01180, 2017.
- [Dal17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 3(79):651–676, 2017.
- [DBL14] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1646–1654, 2014.
- [DCWY18] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast! In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018, pages 793–797, 2018.
- [DFK91] Martin E. Dyer, Alan M. Frieze, and Ravi Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. J. ACM, 38(1):1–17, 1991.
- [DK19] Arnak S Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
- [DM19a] Alain Durmus and Éric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
- [DM19b] Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
- [DMM19] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. J. Mach. Learn. Res., 20:73:1–73:46, 2019.
- [DR18] Arnak S. Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using kinetic langevin diffusions. CoRR, abs/1807.09382, 2018.
- [DRW+16] Kumar Avinava Dubey, Sashank J Reddi, Sinead A Williamson, Barnabas Poczos, Alexander J Smola, and Eric P Xing. Variance reduction in stochastic gradient langevin dynamics. In Advances in neural information processing systems, pages 1154–1162, 2016.
- [DSM+16] Alain Durmus, Umut Simsekli, Eric Moulines, Roland Badeau, and Gaël Richard. Stochastic gradient richardson-romberg markov chain monte carlo. In Advances in Neural Information Processing Systems, pages 2047–2055, 2016.
- [FGKS15] Roy Frostig, Rong Ge, Sham M. Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2540–2548, 2015.
- [GGZ18] Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global convergence of stochastic gradient hamiltonian monte carlo for non-convex stochastic optimization: Non-asymptotic performance bounds and momentum-based acceleration. arXiv preprint arXiv:1809.04618, 2018.
- [Gul92] Osman Guler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization, 2(4):649–664, 1992.
- [Har04] Gilles Hargé. A convex/log-concave correlation inequality for gaussian measure and an application to abstract wiener spaces. Probability theory and related fields, 130(3):415–440, 2004.
- [HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 2883–2892, 2018.
- [JZ13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 315–323, 2013.
- [Kra40] Hendrik Anthony Kramers. Brownian motion in a field of force and the diffusion model of chemical reactions. Physica, 7(4):284–304, 1940.
- [Lee18] Yin Tat Lee. Lecture 8: Stochastic methods and applications. Class notes, UW CSE 599: Interplay between Convex Optimization and Geometry, 2018.
- [LK99] László Lovász and Ravi Kannan. Faster mixing via average conductance. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, May 1-4, 1999, Atlanta, Georgia, USA, pages 282–287, 1999.
- [LMH15] Hongzhou Lin, Julien Mairal, and Zaïd Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 3384–3392, 2015.
- [LPW09] David Asher Levin, Yuval Peres, and Elizabeth Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.
- [LST20] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Logsmooth gradient concentration and tighter runtimes for metropolized hamiltonian monte carlo. In Conference on Learning Theory, COLT 2020, 2020.
- [LSV18] Yin Tat Lee, Zhao Song, and Santosh S. Vempala. Algorithmic theory of odes and sampling from well-conditioned logconcave densities. CoRR, abs/1812.06243, 2018.
- [LV06a] László Lovász and Santosh Vempala. Simulated annealing in convex bodies and an o*(n4) volume algorithm. Journal of Computer and System Sciences, 72(2):392–417, 2006.
- [LV06b] László Lovász and Santosh S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings, pages 57–68, 2006.
- [LV06c] László Lovász and Santosh S. Vempala. Hit-and-run from a corner. SIAM J. Comput., 35(4):985–1005, 2006.
- [MFWB19] Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. CoRR, abs/1910.00551, 2019.
- [MMW+19] Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. High-order langevin diffusion yields an accelerated MCMC algorithm. CoRR, abs/1908.10859, 2019.
- [NDH+17] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient langevin dynamics. arXiv preprint arXiv:1706.02692, 2017.
- [Nea11] Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov chain Monte Carlo, 2(11):2, 2011.
- [Nes83] Yurii Nesterov. A method for solving a convex programming problem with convergence rate . Doklady AN SSSR, 269:543–547, 1983.
- [NF19] Christopher Nemeth and Paul Fearnhead. Stochastic gradient markov chain monte carlo. arXiv preprint arXiv:1907.06986, 2019.
- [OV00] Felix Otto and Cédric Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
- [PB14] Neal Parikh and Stephen P. Boyd. Proximal algorithms. Found. Trends Optim., 1(3):127–239, 2014.
- [Per16] Marcelo Pereyra. Proximal markov chain monte carlo algorithms. Stat. Comput., 26(4):745–760, 2016.
- [Roc76] R Tyrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
- [SKR19] Adil Salim, Dmitry Koralev, and Peter Richtárik. Stochastic proximal langevin algorithm: Potential splitting and nonasymptotic rates. In Advances in Neural Information Processing Systems, pages 6653–6664, 2019.
- [SL19] Ruoqi Shen and Yin Tat Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, pages 2100–2111, 2019.
- [SRB17] Mark Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1-2):83–112, 2017.
- [STL20] Ruoqi Shen, Kevin Tian, and Yin Tat Lee. Composite logconcave sampling with a restricted gaussian oracle. CoRR, abs/2006.05976, 2020.
- [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58(1):267–288, 1996.
- [Vem05] Santosh Vempala. Geometric random walks: A survey. MSRI Combinatorial and Computational Geometry, 52:573–612, 2005.
- [Wib19] Andre Wibisono. Proximal langevin algorithm: Rapid convergence under isoperimetry. CoRR, abs/1911.01469, 2019.
- [WS16] Blake E. Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3639–3647, 2016.
- [WT11] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
- [ZH05] Hui Zou and Trevor Hastie. Regularization and variable section via the elastic net. Journal of the Royal Statistical Society, Series B (Methodological), 67(2):301–320, 2005.
- [ZXG18] Difan Zou, Pan Xu, and Quanquan Gu. Subsampled stochastic variance-reduced gradient langevin dynamics. In International Conference on Uncertainty in Artificial Intelligence, 2018.
Appendix A Discussion of inexactness tolerance
We briefly discuss the tolerance of our algorithm to approximation error in two places: computation of minimizers, and implementation of RGOs in the methods of Sections 3 and 5.
Inexact minimization.
For all function classes considered in this work, there exist efficient optimization methods converging to a minimizer with logarithmic dependence on the target accuracy.
Specifically, for negative log-densities with condition number , accelerated gradient descent [Nes83] converges at a rate with logarithmic dependence on initial error and target accuracy (we implicitly assumed in stating our runtimes that one can attain initial error polynomial in problem parameters for negative log-densities; otherwise, there is additional logarithmic overhead in the quality of the initial point to optimization procedures). For composite functions where has condition number , the FISTA method of [BT09] converges at the same rate with each iteration querying and a proximal oracle for once; typically, access to a proximal oracle is a weaker assumption than access to a restricted Gaussian oracle, so this is not restrictive. Finally, for minimizing finite sums with condition number , the algorithm of [All17] obtains a convergence rate linearly dependent on ; alternatively, [JZ13] has a dependence on . In all our final runtimes, these optimization rates do not constitute the bottleneck for oracle complexities.
The only additional difficulty our algorithms may present is if the function requiring minimization, say of the form for some where we have computed the minimizer to , has very large (so the initial function error is bad). However, in all our settings is drawn from a distribution with sub-Gaussian tails, so decays exponentially (whereas the complexity of first-order methods increases only logarithmically), negligibly affecting the expected oracle query complexity for our methods.
Finally, by solving the relevant optimization problems to high accuracy as a subroutine in each of our methods, and adjusting various distance bounds to the minimizer by constants (e.g. by expanding the radius in the definition of the sets in Algorithm 4 and Section 6.2), this accomodates tolerance to inexact minimization and only affects all bounds throughout the paper by constants. The only other place that is used in our algorithms is in initializing warm starts; tolerance to inexactness in our warmness calculations follows essentially identically to Section 3.2.1 of [DCWY18].
Inexact oracle implementation.
Our algorithms based on restricted Gaussian oracle access are tolerant to total variation error inverse polynomial in problem parameters for the restricted Gaussian oracle for . We discussed this at the end of Section 3, in the case of RGO use for our reduction framework. To see this in the case of the composite sampler in Section 5, we pessimistically handled the case where the sampler YSample for a quadratic restriction of resulted in total variation error in the proof of Proposition 5, assuming that the error was incurred in every iteration. By accounting for similar amounts of error in calls to (on the order of , where is the number of times an RGO was used), the bounds in our algorithm are only affected by constants.
Appendix B Deferred proofs from Section 5
B.1 Deferred proofs from Section 5.2
B.1.1 Approximate rejection sampling
We first define the rejection sampling framework we will use, and prove various properties.
Definition 5 (Approximate rejection sampling).
Let be a distribution, with . Suppose set has , and distribution with has for some ,
Suppose there is an algorithm which draws samples from a distribution , such that . We call the following scheme approximate rejection sampling: repeat independent runs of the following procedure until a point is outputted.
-
1.
Draw via until .
-
2.
With probability , output .
Lemma 12.
Consider an approximate rejection sampling scheme with relevant parameters defined as in Definition 5, with . The algorithm terminates in at most
(26) |
calls to in expectation, and outputs a point from a distribution with .
Proof.
Define for notational simplicity normalization constants and . First, we bound the probability any particular call to returns in the scheme:
(27) | ||||
The second line followed by the definitions of and , and the third followed by triangle inequality, the assumed lower bound on , and the total variation distance between and . By linearity of expectation and independence, this proves the first claim.
Next, we claim the output distribution is close in total variation distance to the conditional distribution of restricted to . The derivation of (27) implies
(28) | |||
Thus, the total variation of the true output distribution from restricted to is
The first inequality was triangle inequality, and we bounded the second term by (28). To obtain the final equality, we used
We now bound this final term. Observe that the given conditions imply that is bounded by everywhere in . Thus, expanding we have
Finally, combining these guarantees, and the fact that restricting to loses in total variation distance, yields the desired conclusion by triangle inequality. ∎
Corollary 6.
Proof.
It suffices to take expectations over the randomness of everywhere in the proof of Lemma 12. ∎
B.1.2 Distribution ratio bounds
We next show two bounds relating the densities of distributions and . We first define the normalization constants of (15), (17) for shorthand, and then tightly bound their ratio.
Definition 6 (Normalization constants).
We denote normalization constants of and by
Lemma 13 (Normalization constant bounds).
Let and be as in Definition 6. Then,
Proof.
For each , by convexity we have
(29) | |||
Integrating both sides over yields the upper bound on . Next, for the lower bound we have a similar derivation. For each , by smoothness
Integrating both sides over yields
The last inequality followed from Proposition 11, where we used is -strongly convex. ∎
Lemma 14 (Relative density bounds).
Let . For all , as defined in (16), . Here, denotes the marginal density of . Moreover, for all , .
Proof.
We first show the upper bound. By Lemma 13,
(30) | ||||
We now bound the first term, for . By smoothness, we have
so applying this for each ,
In the last line, we used that implies , and the definition of . Combining this bound with (30), we have the desired
Next, we consider the lower bound. By combining (29) with Lemma 13, we have the desired
∎
B.1.3 Correctness of Composite-Sample-Shared-Min
See 4
Proof.
We remark that is precisely the choice of in Sample-Joint-Dist where , as in Composite-Sample-Shared-Min. First, we may apply Fact 2 to conclude that the measure of set with respect to the -strongly logconcave density is at least . The conclusion of correctness will follow from an appeal to Corollary 6, with parameters
Note that indeed we have is bounded by , as . Moreover, the expected number of calls (26) is clearly bounded by a constant as well.
We now show that these parameters satisfy the requirements of Corollary 6. Define the functions
and observe that clearly the densities of and are respectively proportional to and . Moreover, define and . By comparing these definitions with Lemma 13, we have and , so by the upper bound in Lemma 13, . Next, we claim that the following procedure produces an unbiased estimator for .
-
1.
Sample , where
-
2.
-
3.
Output
To prove correctness of this estimator , define for simplicity
We compute, using , that
This implies that the output quantity
is unbiased for . Finally, note that for any used in the definition of , by using via smoothness, we have
Here, we used the definition of and by the definition of . ∎
B.2 Deferred proofs from Section 5.3
Throughout this section, for error tolerance which parameterizes Sample-Joint-Dist, we denote for shorthand a high-probability region and its radius by
(31) |
The following density ratio bounds hold within this region, by simply modifying Lemma 14.
Corollary 7.
The following claim follows immediately from applying Fact 2.
Lemma 15.
With probability at least , lies in .
Finally, when clear from context, we overload as a distribution on to be the component marginal of the distribution (17), i.e. with density
We first note that is stationary for Sample-Joint-Dist; this follows immediately from Lemma 1. In Section B.2.1, we bound the conductance of the walk. We then use this bound in Section B.2.2 to bound the mixing time and overall complexity of Sample-Joint-Dist.
B.2.1 Conductance of Sample-Joint-Dist
We bound the conductance of this random walk, as a process on the iterates , to show the final point has distribution close to the marginal of on . To do so, we break Proposition 7 into two pieces, which we will use in a more white-box manner to prove our conductance bound.
Definition 7 (Restricted conductance).
Let a random walk with stationary distribution on have transition densities , and let . The -restricted conductance, for , is
Proposition 8 (Lemma 1, [CDWY19]).
Let be a -warm start for , and let . For some , let have . Suppose that a random walk with stationary distribution satisfies the -restricted conductance bound
Let be the result of steps of this random walk, starting from . Then, for
the resulting distribution of has total variation at most from .
We state a well-known strategy for lower bounding conductance, via showing the stationary distribution has good isoperimetry and that transition distributions of nearby points have large overlap.
Proposition 9 (Lemma 2, [CDWY19]).
Let a random walk with stationary distribution on have transition distribution densities , and let , and let be the conditional distribution of on . Suppose for any with ,
Also, suppose satisfies, for any partition , , of , where is the minimum Euclidean distance between points in , , the log-isoperimetric inequality
(32) |
Then, we have the bound for all
Lemma 16 (Warm start).
For , defined in (18) is a -warm start for .
Lemma 17 (Transitions of nearby points).
We note that the parameters of Algorithm 5 and the set in (31) satisfy all assumptions of Lemmas 16, 17, and 18. By combining these results in the context of Proposition 9, we see that the random walk satisfies the bound for all :
Plugging this conductance lower bound, the high-probability guarantee of by Lemma 15, and the warm start bound of Lemma 16 into Proposition 8, we have the following conclusion.
Corollary 8 (Mixing time of ideal Sample-Joint-Dist).
Assume that calls to YSample are exact in the implementation of Sample-Joint-Dist. Then, for any error parameter , and
the distribution of has total variation at most from .
B.2.2 Complexity of Sample-Joint-Dist
We first state a guarantee on the subroutine YSample, which we prove in Appendix C.4.
Lemma 19 (YSample guarantee).
We also state a result due to [CDWY19], which bounds the mixing time of 1-step Metropolized HMC for well-conditioned distributions; this handles the case when is large in Algorithm 7.
Proposition 10 (Theorem 1, [CDWY19]).
Let be a distribution on whose negative log-density is convex and has condition number bounded by a constant. Then, Metropolized HMC from an explicit starting distribution mixes to total variation to the distribution in iterations.
See 5
Proof.
Under an exact YSample, Corollary 8 shows the output distribution of Sample-Joint-Dist has total variation at most from . Next, the resulting distribution of the subroutine YSample is never larger than in total variation distance away from an exact sampler. By running for steps, and using the coupling characterization of total variation, it follows that this can only incur additional error , proving correctness (in fact, the distribution is always at most away in total variation from an exact YSample).
Next, we prove the guarantee on the expected gradient evaluations per iteration. Lemma 19 shows whenever the current iterate has , the expected number of gradient evaluations is constant, and moreover Proposition 10 shows that the number of gradient evaluations is never larger than , where we use that the condition number of the log-density in (19) is bounded by a constant. Therefore, it suffices to show in every iteration , the probability is . By the warmness assumption in Lemma 16, and the concentration bound in Fact 2, the probability does not satisfy this bound is negligible (inverse exponential in ). Since warmness is monotonically decreasing with an exact sampler,131313This fact is well-known in the literature, and a simple proof is that if a distribution is warm, then taking one step of the Markov chain induces a convex combination of warm point masses, and is thus also warm. and the accumulated error due to inexactness of YSample is at most through the whole algorithm, this holds for all iterations. ∎
Appendix C Mixing time ingredients
We now prove facts which are used in the mixing time analysis of Sample-Joint-Dist. Throughout this section, as in the specification of Sample-Joint-Dist, and are functions with properties as in (15), and share a minimizer .
C.1 Warm start
We show that we obtain a warm start for the distribution in algorithm Sample-Joint-Dist via one call to the restricted Gaussian oracle for , by proving Lemma 16.
See 16
Proof.
By the definitions of and in (17), (18), we wish to bound everywhere the quantity
(33) |
Here, is as in Definition 6, and we let denote the normalization constant of , i.e.
Regarding the first term of (33), the earlier derivation (29) showed
Then, integrating, we can bound the ratio of the normalization constants
(34) | ||||
The second inequality followed from is -strongly convex and by assumption. The last inequality followed from Proposition 11, where we used is -strongly convex. Next, to bound the second term of (33), notice first that
It thus suffices to lower bound . We have
(35) | |||
The first and third steps followed from -smoothness of , and the second applied the Gaussian integral (Fact 1). Combining the bounds in (34) and (35), (33) becomes
where was arbitrary, which completes the proof. ∎
C.2 Transitions of nearby points
Here, we prove Lemma 17. Throughout this section, is the density of , according to the steps in Lines 6 and 7 of Sample-Joint-Dist (Algorithm 5) starting at . We also define to be the density of , by just the step in Line 6. We first make a simplifying observation: by Observation 1, for any two points , , we have
Thus, it suffices to understand for nearby . Our proof of Lemma 17 combines two pieces: (1) bounding the ratio of normalization constants , of and for nearby , in Lemma 22 and (2) the structural result Proposition 12. To bound the normalization constant ratio, we state two helper lemmas. Lemma 20 characterizes facts about the minimizer of
(36) |
Lemma 20.
Let be convex with minimizer , and minimize (36) for a given . Then,
-
1.
.
-
2.
For any , .
-
3.
For any with , .
Proof.
By optimality conditions in the definition of ,
Fix two points , , and let . Letting be the Jacobian matrix of ,
We can then compute
By triangle inequality and convexity of , the first claim follows:
The second claim follows from the first by . The third claim follows from the second via
∎
Next, Lemma 21 states well-known bounds on the integral of a well-conditioned function .
Lemma 21.
Let be a -smooth, -strongly convex function and let be its minimizer. Then
Proof.
We now define the normalization constants of and :
(37) | |||
Lemma 22.
Proof.
First, applying Lemma 21 to and yields that the ratio is bounded by
Here, we used the bound for that
Regarding the remaining term, recall , both belong to , and . We have
The first inequality was smoothness and expanding the difference of quadratics. The second was by and , where we used the first and second parts of Lemma 20; we also applied Cauchy-Schwarz and triangle inequality. The third used the third part of Lemma 20. Finally, the last inequality was by the first part of Lemma 20 and . ∎
Proof.
First, by Observation 1, it suffices to show . Pinsker’s inequality states
where is KL-divergence, so it is enough to show . Notice that
By Lemma 22, the first term satisfies, for ,
To bound the second term, we have
Here, the second line was by expanding and the third line was by and Cauchy-Schwarz. By Proposition 12, , where by assumption the parameters satisfy the conditions of Proposition 12. Then, combining the two bounds, we have
When , , and , we have the desired
∎
C.3 Isoperimetry
In this section, we prove Lemma 18, which asks to show that satisfies a log-isoperimetric inequality (32). Here, we define to be the conditional distribution of the -marginal on set . We recall this means that for any partition , , of ,
The following fact was shown in [CDWY19].
Lemma 23 ([CDWY19], Lemma 11).
Any -strongly logconcave distribution satisfies the log-isoperimetric inequality (32) with .
Observe that , the restriction of to the convex set , is -strongly logconcave by the definition of (15), so it satisfies a log-isoperimetric inequality. We now combine this fact with the relative density bounds Lemma 14 to prove Lemma 18.
See 18
Proof.
Fix some partition , , of , and without loss of generality let . First, by applying Corollary 7, which shows everywhere in , we have the bounds
Therefore, we have the sequence of conclusions
Here, the second line was by applying Lemma 23 to the -strongly logconcave distribution , and the final line used for all . ∎
C.4 Correctness of YSample
In this section, we show how we can sample efficiently in the alternating scheme of the algorithm Sample-Joint-Dist, within an extremely high probability region. Specifically, for any with , where is defined in (31), we give a method for implementing
The algorithm is Algorithm 7, which is a simple rejection sampling scheme.
Input: -smooth, -strongly convex with minimizer , , , .
Output: If , return exact sample from distribution with density (see (31) for definition of ). Otherwise, return sample within TV from distribution with density .
We recall that we gave guarantees on rejection sampling procedures in Lemma 5 (an “exact” version of Lemma 12 and Corollary 6). We now prove Lemma 19 via a direct application of Lemma 5.
See 19
Proof.
For , YSample is a rejection sampling scheme with
It is clear that everywhere by convexity of , so we may choose . To bound the expected number of iterations and obtain the desired conclusion, Lemma 5 requires a bound on
(38) |
the ratio of the normalization constants of and . First, by Fact 1,
Next, by smoothness and Fact 1 once more,
Taking a ratio, the quantity in (38) is bounded above by
The first inequality was , the second used smoothness and the assumed bound on , and the third again used our choice of . ∎
Appendix D Structural results
Here, we prove two structural results about distributions whose negative log-densities are small perturbations of a quadratic, which obtain tighter concentration guarantees compared to naive bounds on strongly logconcave distributions. They are used in obtaining our bounds in Section C (and for the warm start bounds in Section 4), but we hope both the statements and proof techniques are of independent interest to the community. Our first structural result is a bound on normalization constant ratios, used throughout the paper.
Proposition 11.
Let be -strongly convex with minimizer , and let . Then,
Proof.
Define the function
Let be the density proportional to . We compute
Here, the last inequality was by Fact 4, using the fact that the function is -strongly convex. Moreover, note that , and
Solving the differential inequality
we obtain the bound for any (since )
Taking a limit yields the conclusion. ∎
Our second structural result uses a similar proof technique to show that the mean of a bounded perturbation of a Gaussian is not far from its mode, as long as the gradient of the mode is small. We remark that one may directly apply strong logconcavity, i.e. a variant of Fact 4, to obtain a weaker bound by roughly a factor, which would result in a loss of in the guarantees of Theorem 2. This tighter analysis is crucial in our improved mixing time result.
Before stating the bound, we apply Fact 3 to the convex functions and to obtain the following conclusions which will be used in the proof of Proposition 12.
Corollary 9.
Let be a -strongly logconcave density. Then,
-
1.
, for all unit vectors .
-
2.
.
Proposition 12.
Let be -smooth and convex with minimizer , let with , and let be the density proportional to . Suppose that . Then,
Proof.
Define a family of distributions for , with
In particular, , and is a Gaussian with mean . We define , and
Define the function , such that we wish to bound . First, by smoothness
Next, we observe
In order to bound , fix a unit vector . We have
(39) | ||||
The third line was Cauchy-Schwarz and the last line used smoothness and convexity, i.e.
We now bound these terms. First,
(40) | ||||
Here, we applied the first part of Corollary 9, as is -strongly logconcave, and the definition of . Next, using for any , , we have
(41) | ||||
Here, we used the second part of Corollary 9. Maximizing (39) over , and applying (40), (41),
(42) |
Assume for contradiction that , violating the conclusion of the proposition. By continuity of , there must have been some where , and for all , . By the mean value theorem, there then exists such that
On the other hand, by our assumption that , for any it follows that
Then, plugging these bounds into (D) and using as ,
We used in the last inequality. This is a contradiction, implying . ∎