Complexity of zigzag sampling algorithm for strongly log-concave distributions
Abstract
We study the computational complexity of zigzag sampling algorithm for strongly log-concave distributions. The zigzag process has the advantage of not requiring time discretization for implementation, and that each proposed bouncing event requires only one evaluation of partial derivative of the potential, while its convergence rate is dimension independent. Using these properties, we prove that the zigzag sampling algorithm achieves error in chi-square divergence with a computational cost equivalent to gradient evaluations in the regime under a warm start assumption, where is the condition number and is the dimension.
Keywords: Monte Carlo sampling; zigzag sampler; log-concave distribution; computational complexity
1 Introduction and Main Results
Monte Carlo sampling from a high-dimensional probability distribution is a fundamental problem with applications in various areas including Bayesian statistics, machine learning, and statistical physics. Many sampling algorithms, especially those for continuous state space like , are based on continuous time Markov processes. Examples of these processes include the overdamped Langevin dynamics, whose invariant measure is the target measure, the underdamped Langevin dynamics and Hamiltonian Monte Carlo (HMC) [duane1987hybrid], both augment the state space with a velocity variable , and have the -marginal distribution of the invariant measure as the target measure. For strongly log-concave distributions, all these processes converge to the equilibrium exponentially fast with rates independent of the dimension, making them suitable for sampling purposes. On the other hand, all of these processes require time discretizations for implementation, which not only induces further numerical errors but requires the time step to be small as well, requiring higher computational complexity if a small bias is desired. To remove such bias due to discretization, the conventional procedure is to introduce the Metropolis-Hastings acceptance-rejection step, but rejections indicate waste of computational resources.
A very different line of sampling algorithms have been recently developed in statistical physics and statistics literature [peters2012rejection], which are based on piecewise deterministic Markov processes (PDMPs) [davis1984piecewise]. These processes are non-reversible, which may mix faster than reversible MCMC methods [diaconis2000analysis, turitsyn2011irreversible]. Examples of such samplers include the randomized Hamiltonian Monte Carlo [bou2017randomized], the zigzag process [bierkens2019zig], the bouncy particle sampler [peters2012rejection, bouchard2018bouncy], and some others [vanetti2017piecewise, michel2014generalized, bierkens2020boomerang]. The zigzag and bouncy particle samplers are appealing for big data applications, as they can be unbiased even if stochastic gradient is used [bouchard2018bouncy, bierkens2019zig]. These algorithms, as they are still relatively new, have not yet been thoroughly analyzed. In particular, no non-asymptotic computational complexity bounds on these algorithms have been established yet, to the best of our knowledge. Our previous work [lu2020explicit] gives explicit exponential convergence rates for the PDMPs with log-concave potentials, which opens the possibility of deriving such complexity bounds for PDMPs, and provides the foundation of this work.
1.1 Algorithm and Assumptions
Let denote the state variable in where is the dimension. The target distribution we want to sample from is denoted by
where is the potential and is the normalizing constant. Although the zigzag process can also be applied to sample non log-concave distributions, we will restrict our analysis to strongly log-concave distributions, namely, we make the following assumption throughout:
Assumption 1.
The potential function satisfies
(1) |
for some . Moreover, has a unique minimizer at , and .
For any random variable , we use to denote its law. In this paper, we use chi-square divergence to measure the difference between two probability measures: for probability measures that , it is defined as
The zigzag sampling algorithm is based on a piecewise deterministic Markov process, called zigzag process. Besides the variable , we augment the state space by an auxiliary velocity variable taking value in . A trajectory of the zigzag process, denoted by , can be described as follows. Given some initial , the position always evolves according to , while the velocity is piecewise constant which only changes when bouncing or refreshing events occur at some random time following Poisson clocks. Bouncing events on the -th direction occur with rate , and at such an event the velocity changes by flipping its -th component to . Refreshing events occur with rate for some fixed , when the velocity is completely redrawn from the standard normal .
It has been established [andrieu2018hypocoercivity, bierkens2019ergodicity, lu2020explicit] that under Assumption 1, converges to the invariant measure of the zigzag process, which is a product measure of the target measure in and the standard Gaussian in :
Our analysis relies on the following more quantitative convergence result for zigzag process proved in [lu2020explicit], which also specifies the optimal choice of refreshing rate .111[lu2020explicit] shows exponential convergence for the backward equation. By duality the exponential convergence of the backward equation in is equivalent to the exponential convergence of the forward equation in with the same rate. We would like to comment here that the choice of is completely technical since it optimizes the theoretical convergence rate (up to a universal constant) of the zigzag process established in [lu2020explicit]. The zigzag process is ergodic even if and in practice the choice is common.
Proposition 1.1.
[lu2020explicit]*Theorem 1 Under Assumption 1, there exists a universal constant independent of all parameters, such that for any initial density , the zigzag process with friction parameter satisfies
(2) |
The left-hand side of (2) controls desired divergence of with respect to the target measure , as we have
Moreover, we would take initial condition in the form of
(3) |
which implies that . Therefore, we get
(4) |
which suggests the total time needed to achieve control of chi-square divergence.
Of course, in practice, we cannot simulate the zigzag process directly, as simulating the Poisson process associated with the bouncing event would require integrating along the trajectory. To turn the zigzag process into an efficient and practical sampling algorithm, the Poisson process for the bouncing events are usually simulated using the Poisson thinning trick (see e.g., discussions in [bierkens2019zig]*Section 3). Under Assumption 1, we will use the following upper bound estimate for the rate:
(5) |
This upper bound has the advantage of not involving evaluations of and its partial derivatives, which greatly reduces the computational cost, compared with using numerical quadrature for Poisson clocks. The price to pay is the increased frequency of potential bouncing events, which scales like since the pessimistic bound for the partial derivative typically sacrifices a factor of in the first inequality.
Following the above discussions, the zigzag sampling algorithm is described in Algorithm 1, where Step LABEL:state:magrate uses the upper bound estimate in (5), while Steps LABEL:state:thinning1–LABEL:state:thinning2 correspond to the Poisson thinning step. Note that for each potential bouncing event, the algorithm requires one evaluation of in Step LABEL:state:thinning1. In practice, typically accessing the partial derivatives of is the most time consuming step, therefore, in our complexity analysis, we focus on the number of access to partial derivatives.
We also need the following assumption for technical purposes, as will be discussed after stating the main results:
Assumption 2.
The initial distribution satisfies a warm-start condition:
(6) |
where is the condition number, and is the same universal constant as in (2). Furthermore, the initial distribution is concentrated in the sense of
(7) |
Remark 1.2.
The concentration condition (7) can be easily satisfied. By Gaussian Annulus Theorem, if we pick , then for some universal constant . The failure probability gets smaller if we take or . The warm start condition (6) is more stringent but can be achieved by first running Langevin Monte Carlo (LMC). We will discuss that after presenting our main result.
1.2 Main Results
Theorem 1.
Remark 1.3.
By repeated trials, the theorem implies that for any , with probability , Algorithm 1 returns the desired output with a computational cost of
that is evaluations of partial derivatives of , where hides logarithmic factors.
With the common computational model that evaluations of partial derivatives of is equivalent to one evaluation of in complexity, the complexity of zigzag is equivalent to evaluations of .
Let us explain the choice of in (9): For the zigzag sampling algorithm to reach the target accuracy according to (4), the terminal time needs to be large enough. Meanwhile, the Assumption 2 guarantees that is not too large, as otherwise we cannot effectively control the number of bouncing events either due to a very large drawn from a velocity refreshing event or the trajectory reaching regions with large gradient. These motivate our previous Assumption 2 on the initial distribution , as well as the restriction on that it cannot be too small compared to . We remark that the assumption on is not prohibitive as we are interested in high dimensional cases and the error threshold is exponentially small in .
The warm start condition (6) can be achieved if we start with a Gaussian distribution in and run Langevin Monte Carlo
(10) |
where is the step size, and are i.i.d. random variables. This leads to the following corollary:
Corollary 1.4.
Let . Suppose the potential satisfies Assumption 1 for some such that for some computable (from [erdogdu2020convergence]) universal constant . Then, for any prescribed accuracy , if we initialize , the hybrid algorithm by first running LMC (10) for steps with step size and then Algorithm 1 up to time outputs a random variable such that
(11) |
Moreover, if with probability , Algorithm 1 returns an output with a computational cost of
evaluations of partial derivatives of .
Proof.
It is easy to verify that our choice of satisfies and (where satisfies [erdogdu2020convergence]*Lemma 14). Therefore, we may appeal Lemmas 2, 14, 25, 26 of [erdogdu2020convergence], so that the random variable produced in (10) satisfies
(12) |
for some universal constant . This, combined with our assumption on , guarantees that (6) holds with playing the role of . We can also check the validity of (7) by
Therefore we may apply Theorem 1 with , and derive that the total computational cost (in terms of number of evaluations of ) equals to
∎
Theorem 1 guarantees that the zigzag sampling algorithm (Algorithm 1) outputs a sample from a distribution with -divergence at most away from the target density for a computational complexity equivalent to partial derivative evaluations (i.e., amounts to gradient evaluations), in the regime with a warm-start condition. Corollary 1.4 establishes that the hybrid LMC-zigzag algorithm outputs a sample for a computational complexity gradient evaluations. The initialization using LMC is added only for technical reasons as we currently do not have complexity guarantees otherwise with an explicit initial distribution, nor is it necessary for actual implementations. We would also like to comment that our goal is to obtain the best possible scaling in , and the scaling in might be possibly improved by a more careful analysis.
Our analysis is based on the quantitative convergence rate of the zigzag process established in [lu2020explicit], which is for -convex and -smooth potentials. The rest of our proof is based on estimating along a single trajectory of the zigzag process and subsequently turn this into an estimate on the number of potential bouncing events, and hence number of partial derivative evaluations. Our analysis utilizes the two important and desirable features of the zigzag sampling process:
-
•
The implementation of the zigzag process does not need time discretization, as the velocity in deterministic portion of the trajectory remains constant, which makes it possible to simulate the exact trajectories of the zigzag process while eliminating an important source of error. This is the reason that the complexity of the zigzag process only has logarithmic dependence on , without Metropolis acceptance/rejection.
-
•
Moreover, for each potential bouncing event of zigzag, only one evaluation of a partial derivative of the potential is required, which is cheaper than a full gradient evaluation in computational cost for usual model of computation.
We would also remark that we quantify the error of distribution in terms of -divergence, which provides stronger guarantee than total variation, KL divergence or -Wasserstein distance. While -divergence is relatively convenient for obtaining convergence rates of continuous processes based on Poincarè inequality [cao2019explicit, lu2020explicit], it does not seem easy to use for analyzing discretization error of SDEs. The work [vempala2019rapid] made assumptions of Poincarè inequality for the discrete invariant measure, which is difficult to verify. We are fortunate to avoid such problem for zigzag sampler, thanks to the fact that zigzag does not need time discretization. After the first version of this work appears online, [erdogdu2020convergence] established convergence of LMC in - and Rényi divergence, using the exponential convergence of continuous time overdamped Langevin dynamics in Rényi divergence [cao2019exponential, vempala2019rapid].
1.3 Previous Works
Here we focus on results on non-asymptotic analysis of sampling algorithms, which has been a focused research area in recent years. Many sampling algorithms have been analyzed including algorithms based on overdamped Langevin dynamics [dalalyan2017theoretical, durmus2019high, durmus2019analysis, vempala2019rapid, li2019stochastic, ding2020randomlangevin], underdamped Langevin dynamics [cheng2018underdamped, dalalyan2020sampling, ma2019there, shen2019randomized, ding2020randomunderdamped, monmarche2020high], Hamiltonian Monte Carlo [mangoubi2017rapid, lee2018algorithmic, chen2019optimal, mangoubi2018dimensionally, bou2020coupling], or high order Langevin dynamics [mou2019high], among others. These methods involve discretization of ODEs or SDEs, which yields an error that scales polynomially with step size. Thus the complexity of these algorithms has polynomial dependence on , where is the desired accuracy threshold.
Metropolized variants of sampling algorithms, including Metropolized HMC and Metropolis Adjusted Langevin Algorithm (MALA), have also been studied in [dwivedi2018log, chen2020fast, lee2020logsmooth], the complexities of which have only logarithmic dependence on , similar to the zigzag sampling process analyzed here. In [dwivedi2018log] the complexity upper bound for MALA is established as under warm start condition, and with a feasible start. In [chen2020fast] the complexity upper bound for MALA is improved to with feasible start (where ). The work [chen2020fast] also established bounds for Metropolized HMC, which is with warm start (which is in fact more stringent than our Assumption 2) in the regime , and with feasible start if the target potential function has a bounded Hessian. The complexity upper bound has been improved in [lee2020logsmooth] to for both Metropolized HMC and MALA with a feasible start, based on a refined analysis using concentration of gradient norm. In comparison, our result for zigzag relies on a warm start (which is achievable by LMC), while the complexity upper bound has better dependence in . The issue of feasible start will be further discussed in Section 3.
Regarding asymptotic analysis for the convergence of zigzag process, the ergodicity was first established in [bierkens2019ergodicity]. Exponential convergence of the zigzag process is established in [fontbona2016long, bierkens2017piecewise] using a Lyapunov function argument. A central limit theorem of the zigzag process is established in [bierkens2017limit], and a large deviation principle is established for the empirical measure in [bierkens2019large]. The spectrum of the zigzag process has been studied in [bierkens2019spectral, guillin2020low]. A dimension independent exponential convergence rate for the zigzag process is established in [andrieu2018hypocoercivity], using the hypocoercivity framework developed in [dolbeault2015hypocoercivity]. Finally, a more quantitative convergence estimate was established in [lu2020explicit], for which our analysis of the sampling algorithm is based on.
2 Strategy of the Proof
Since Algorithm 1 always simulates exact trajectories of the zigzag process, we see that (8) is guaranteed with the correct choice of . Therefore we only need to estimate the computational complexity. The strategy of the proof is to first give an estimate on (Lemma 2.1), which directly controls . The upper bound on in turn provides us an estimate of upper bound on the number of partial derivative evaluations of . The complexity upper bound we derive holds with high probability, while it does not always hold (for example, the number of proposed bouncing events from the Poisson clock might be atypically high), such events only occur with very small probability, which will be controlled in the proof.
Let be the total number of velocity refreshments (including the initial refreshment), therefore is a Poisson random variable such that
(13) |
Let be the refresh times, and be the velocity variable after refreshment at time . For , we use to denote the time duration between refreshments. For convenience, we will also denote .
The first step of the proof is the following lemma which controls condition on some high probability events. The proof will be deferred to the appendix.
Lemma 2.1.
The next element in the proof is to control the failure event that (14) does not hold. The control of the first four events are relatively straightforward and will thus be directly carried out in the proof of theorem below; we state the probability for the event (14e) to hold as the following lemma, which will also be proved in the appendix.
Lemma 2.2.
There exists a universal constant such that, if , then with probability , condition (14e) holds.
The final component of the proof is to turn the estimate for to an upper bound for the number of proposed bouncing events.
Proof of Theorem 1.
Let be the probability that condition in (14) of Lemma 2.1 fails. We start with condition (14a) of Lemma 2.1. For Poisson process with as the arrival times, we may estimate the first failure probability (here and for the rest of the proofs denotes a universal constant that may change from line to line)
(16) |
We now check the conditions (14b) and (14c) of Lemma 2.1. By Gaussian Annulus Theorem, for each refreshment, we have
(17) |
where is some universal constant. We also require to satisfy , where , which has failure probability
Since we have to draw for times, cumulatively this yields a failure probability
(18) |
Recall the assumption as well as (6) (and that ), which implies that for our choice of as in (9). Together with condition (14a), we derive (neglecting the obviously smaller term )
The failure probability for condition (14d) is straightforward to estimate. Using Assumption 1, we have
which indicates
Finally, is already estimated in Lemma 2.2, which yields . In summary, the total failure probability of (14) can be bounded as
(19) |
We now assume that condition (14) holds. Thus, Lemma 2.1 together with Assumption 1 implies that
(20) |
After each refreshment or bouncing event, Algorithm 1 runs independent Poisson clocks defined in Step LABEL:state:magrate where, noticing ,
(21) |
This motivates us to consider the following counting process : suppose are i.i.d. random variables with where and , and let . By construction, the probability of under condition (14) is controlled by . Therefore, it suffices to estimate .
We compute the expectation of (here notice ):
(22) |
On the other hand,
Therefore we may appeal to Kolmogorov’s inequality [durrett2019probability]*Theorem 2.5.2 (here denotes ):
≤^(22)P(S_8AT- ES_8AT¡-T) | |||
To sum up, we have established that with high probability the number of partial derivative evaluations is bounded by
3 Discussion
We establish non-asymptotic complexity bounds for the zigzag sampling algorithm. While we focus on zigzag sampler in this work, we expect that similar analysis for other PDMPs [bouchard2018bouncy, vanetti2017piecewise, michel2014generalized, bierkens2020boomerang] can be carried out. We leave these for future research.
We admit that our warm-start requirement (6) may be stringent. We observe that (6) implicitly requires the condition number to be much smaller than , as otherwise, if , (6) requires which is unrealistic. Corollary 1.4 essentially requires for the analysis to hold. This restriction on condition number is not completely unexpected since the zigzag sampler does perform poorly for highly anisotropic densities (see for example numerical results in [michel2014generalized]).
A major issue of the warm-start assumption comes from our choice of divergence, rather than total variation, -Wasserstein distance, or KL divergence as in previous works for non-asymptotic analysis of sampling algorithms. In particular, if we choose the initial condition
(23) |
as in previous works, then for , we have
which violates (6). On the other hand, for the same choice of , as long as satisfies Assumption 1, one can estimate
This means , and consequently the logarithm of total variation or -Wasserstein distances are much smaller than any algebraic power of , making it suitable for initialization. We hope the following conjecture is true:
Conjecture 1.
Under Assumption 1, there exists a universal constant independent of all parameters, such that for any initial density , the zigzag process with friction parameter satisfies
If this is indeed true, we can establish the convergence in KL divergence of the pure zigzag sampler using a feasible start, without using LMC for initialization.
Another interesting open question is whether one can find a tighter upper bound than Step LABEL:state:magrate of Algorithm 1 in order to reduce the computational complexity, since it magnifies the proposed bouncing rates by . The following lemma, which might be of independent interest, provides a concentration bound for so that we might be able to give up a small probability to obtain a much sharper bouncing rate control.
Lemma 3.1.
Let satisfy Assumption 1, then for any ,
(24) |
The proof of this lemma, deferred to the appendix, is inspired by [lee2020logsmooth], which uses the following Brascamp-Lieb inequality [brascamp2002extensions]:
Lemma 3.2.
Let satisfy Assumption 1, then for any ,
(25) |
With Lemma 3.1, it might be possible to improve Algorithm 1 while surrendering a small probability by replacing Step LABEL:state:magrate with since with high probability. This motivates the following conjecture:
Conjecture 2.
Under the Assumption 1, for any and that are both smaller than some algebraic power of , there exists an algorithm that gives a random variable such that
(26) |
Moreover, with high probability, the algorithm requires evaluations of partial derivatives of .
Unfortunately there are several difficulties for proving the conjecture. One is that although does not exceed with high probability, we are unable to control the partial derivatives for a trajectory of the zigzag process. Another issue is that since some trajectories of the zigzag process may go to regions with partial derivatives exceeding , we do not always simulate the exact trajectories, which introduces bias in the sampling.
Acknowledgment. This work is supported in part by National Science Foundation via grants CCF-1910571 and DMS-2012286. We would like to thank Murat Erdogdu for pointing us to their complexity analysis of Langevin Monte Carlo in chi-square divergence [erdogdu2020convergence] to remove the warm start assumptions.
Appendix A Proof of Lemma 2.1
Proof.
Let . If no bouncing happens, then
In addition, decreases when bouncing happens, since there is some positive being changed to while and other ’s remain unchanged. Therefore, since does not change between refreshments, we have for any , 222We remark here that is not well-defined at the bouncing times. Nevertheless, (27) still makes sense since decreases at the bouncing events, and since we only use (27) in the time integral sense, this will not cause any problem.
(27) |
Notice for a convex function that satisfy Assumption 1, we have by co-coercivity
therefore for any , and any ,
(28) | ||||
In particular,
Choosing , we have
Now we apply the above formula iteratively and derive
≤^(14d),(14e)CLTd. |
Here we used so , which is true due to (14a), and that , which is true with our choice of in (9). ∎
Appendix B Proof of Lemma 2.2
Proof.
Let . By properties of the Poisson process [durrett1999essentials], if we condition on , the distribution of has the same joint distribution as that of i.i.d. random variables uniformly distributed in . This means
(29) |
To calculate , let us define
and compute by induction in . For , as the sum contains only one term, . An easy calculation shows that . We will show in general
(30) |
Indeed, suppose (30) holds for , we want to prove (30) for , the starting point of which is the following observation:
The first integral can be treated by integrating the variables one by one, from to and then , etc.
(31) | ||||
By the induction assumption (30) for we have
Combining above with (31) we finish the proof for . Therefore
The full expectation follows as is a Poisson random variable
To get the desired estimate, we apply Chebyshev’s inequality using the second moment. By the same arguments leading towards (29), we have
Denote
Using the same induction argument as the proof of (30), we can prove
This can be easily verified for and the induction follows form the calculation:
This shows , and therefore
This means
where the inequality above holds for larger than some universal constant (which we would assume as it is the interesting parameter regime).
Finally, to conclude the proof, we apply Chebyshev inequality to estimate the failure probability as
Appendix C Proof of Lemma 3.1
Proof.
The first step is to show that
(32) |
This is straightforward, since using integration by parts,
(33) |
and (32) then follows from Cauchy-Schwarz inequality.
The next step is to establish a concentration bound. Let , where is a smooth nonnegative increasing function satisfying
and . By the construction of , we have
(34) |
Then . By Lemma 3.2 for , we have
Thus for we have
(35) |
Now we use (35) recursively, and we obtain for ,
(36) |
Notice
(37) |
Moreover, by [bobkov1997poincare]*Proposition 4.1,
(38) |
Substituting (37) and (38) into (36), we obtain
Finally, combining the above exponential moment bound of with Chebyshev inequality, we get
Now take , and , and using (34) (noticing when since ), we arrive at