This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Complexity of zigzag sampling algorithm for strongly log-concave distributions

Jianfeng Lu Department of Mathematics, Department of Physics, and Department of Chemistry, Duke University, Durham NC 27708, USA ([email protected])    Lihan Wang For correspondence; Department of Mathematics, Duke University, Durham NC 27708 ([email protected]). Current Address: Department of Mathematical Sciences, Carnegie Mellon University, 311 Hamerschlag Drive, Pittsburgh, PA, 15213, USA ([email protected])
Abstract

We study the computational complexity of zigzag sampling algorithm for strongly log-concave distributions. The zigzag process has the advantage of not requiring time discretization for implementation, and that each proposed bouncing event requires only one evaluation of partial derivative of the potential, while its convergence rate is dimension independent. Using these properties, we prove that the zigzag sampling algorithm achieves ε\varepsilon error in chi-square divergence with a computational cost equivalent to O(κ2d12(log1ε)32)O\bigl{(}\kappa^{2}d^{\frac{1}{2}}(\log\frac{1}{\varepsilon})^{\frac{3}{2}}\bigr{)} gradient evaluations in the regime κdlogd\kappa\ll\frac{d}{\log d} under a warm start assumption, where κ\kappa is the condition number and dd is the dimension.

Keywords: Monte Carlo sampling; zigzag sampler; log-concave distribution; computational complexity

1 Introduction and Main Results

Monte Carlo sampling from a high-dimensional probability distribution is a fundamental problem with applications in various areas including Bayesian statistics, machine learning, and statistical physics. Many sampling algorithms, especially those for continuous state space like d\mathbb{R}^{d}, are based on continuous time Markov processes. Examples of these processes include the overdamped Langevin dynamics, whose invariant measure is the target measure, the underdamped Langevin dynamics and Hamiltonian Monte Carlo (HMC) [duane1987hybrid], both augment the state space with a velocity variable vv, and have the xx-marginal distribution of the invariant measure as the target measure. For strongly log-concave distributions, all these processes converge to the equilibrium exponentially fast with rates independent of the dimension, making them suitable for sampling purposes. On the other hand, all of these processes require time discretizations for implementation, which not only induces further numerical errors but requires the time step to be small as well, requiring higher computational complexity if a small bias is desired. To remove such bias due to discretization, the conventional procedure is to introduce the Metropolis-Hastings acceptance-rejection step, but rejections indicate waste of computational resources.

A very different line of sampling algorithms have been recently developed in statistical physics and statistics literature [peters2012rejection], which are based on piecewise deterministic Markov processes (PDMPs) [davis1984piecewise]. These processes are non-reversible, which may mix faster than reversible MCMC methods [diaconis2000analysis, turitsyn2011irreversible]. Examples of such samplers include the randomized Hamiltonian Monte Carlo [bou2017randomized], the zigzag process [bierkens2019zig], the bouncy particle sampler [peters2012rejection, bouchard2018bouncy], and some others [vanetti2017piecewise, michel2014generalized, bierkens2020boomerang]. The zigzag and bouncy particle samplers are appealing for big data applications, as they can be unbiased even if stochastic gradient is used [bouchard2018bouncy, bierkens2019zig]. These algorithms, as they are still relatively new, have not yet been thoroughly analyzed. In particular, no non-asymptotic computational complexity bounds on these algorithms have been established yet, to the best of our knowledge. Our previous work [lu2020explicit] gives explicit exponential convergence rates for the PDMPs with log-concave potentials, which opens the possibility of deriving such complexity bounds for PDMPs, and provides the foundation of this work.

1.1 Algorithm and Assumptions

Let xx denote the state variable in d\mathbb{R}^{d} where dd is the dimension. The target distribution we want to sample from is denoted by

dμ(x)=Z1exp(U(x))dx,\,\mathrm{d}\mu(x)=Z^{-1}\exp(-U(x))\,\mathrm{d}x,

where U(x)U(x) is the potential and Z=dexp(U(x))dxZ=\int_{\mathbb{R}^{d}}\exp(-U(x))\,\mathrm{d}x is the normalizing constant. Although the zigzag process can also be applied to sample non log-concave distributions, we will restrict our analysis to strongly log-concave distributions, namely, we make the following assumption throughout:

Assumption 1.

The potential function U(x)U(x) satisfies

mId2U(x)LId,m\mathrm{Id}\leq\nabla^{2}U(x)\leq L\mathrm{Id}, (1)

for some 0<m1L0<m\leq 1\leq L. Moreover, U(x)U(x) has a unique minimizer at x=0x=0, and U(0)=0U(0)=0.

For any random variable XX, we use ρ(X)\rho(X) to denote its law. In this paper, we use chi-square divergence to measure the difference between two probability measures: for probability measures ρ1,ρ2\rho_{1},\rho_{2} that ρ1ρ2\rho_{1}\ll\rho_{2}, it is defined as

χ2(ρ1ρ2):=d(dρ1dρ21)2dρ2.\chi^{2}(\rho_{1}\,\|\,\rho_{2}):=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho_{1}}{\,\mathrm{d}\rho_{2}}-1\Bigr{)}^{2}\,\mathrm{d}\rho_{2}.

The zigzag sampling algorithm is based on a piecewise deterministic Markov process, called zigzag process. Besides the variable xx, we augment the state space by an auxiliary velocity variable taking value in d\mathbb{R}^{d}. A trajectory of the zigzag process, denoted by (Xt,Vt)(X_{t},V_{t}), can be described as follows. Given some initial (X0,V0)(X_{0},V_{0}), the position XtX_{t} always evolves according to ddtXt=Vt\frac{\,\mathrm{d}}{\,\mathrm{d}t}X_{t}=V_{t}, while the velocity VtV_{t} is piecewise constant which only changes when bouncing or refreshing events occur at some random time following Poisson clocks. Bouncing events on the jj-th direction occur with rate (Vt(j)xjU(Xt))+(V_{t}^{(j)}\partial_{x_{j}}U(X_{t}))_{+}, and at such an event the velocity VtV_{t} changes by flipping its jj-th component to Vt(j)-V_{t}^{(j)}. Refreshing events occur with rate λ\lambda for some fixed λ>0\lambda>0, when the velocity VtV_{t} is completely redrawn from the standard normal 𝒩(0,Id)\mathcal{N}(0,\mathrm{Id}).

It has been established [andrieu2018hypocoercivity, bierkens2019ergodicity, lu2020explicit] that under Assumption 1, ρ(Xt,Vt)\rho(X_{t},V_{t}) converges to the invariant measure of the zigzag process, which is a product measure of the target measure in xx and the standard Gaussian in vv:

dμ¯(x,v)=dμ(x)dν(v) where dν(v)=(2π)d2exp(|v|22)dv.\,\mathrm{d}\bar{\mu}(x,v)=\,\mathrm{d}\mu(x)\,\mathrm{d}\nu(v)\ \mbox{ where }\,\mathrm{d}\nu(v)=(2\pi)^{-\frac{d}{2}}\exp\bigl{(}-\frac{|v|^{2}}{2}\bigr{)}\,\mathrm{d}v.

Our analysis relies on the following more quantitative convergence result for zigzag process proved in [lu2020explicit], which also specifies the optimal choice of refreshing rate λ\lambda.111[lu2020explicit] shows exponential convergence for the backward equation. By duality the exponential convergence of the backward equation in L2(μ¯)L^{2}(\bar{\mu}) is equivalent to the exponential convergence of the forward equation in χ2\chi^{2} with the same rate. We would like to comment here that the choice of λ=L\lambda=\sqrt{L} is completely technical since it optimizes the theoretical convergence rate (up to a universal constant) of the zigzag process established in [lu2020explicit]. The zigzag process is ergodic even if λ=0\lambda=0 and in practice the choice λ=0\lambda=0 is common.

Proposition 1.1.

[lu2020explicit]*Theorem 1 Under Assumption 1, there exists a universal constant KK independent of all parameters, such that for any initial density μ¯0\bar{\mu}_{0}, the zigzag process with friction parameter λ=L\lambda=\sqrt{L} satisfies

χ2(ρ(XT,VT)μ¯)Kexp(mKLT)χ2(μ¯0μ¯).\chi^{2}(\rho(X_{T},V_{T})\,\|\,\bar{\mu})\leq K\exp\bigl{(}-\frac{m}{K\sqrt{L}}T\bigr{)}\chi^{2}(\bar{\mu}_{0}\,\|\,\bar{\mu}). (2)

The left-hand side of (2) controls desired divergence of ρ(X)\rho(X) with respect to the target measure μ\mu, as we have

χ2(ρ(XT,VT)μ¯)\displaystyle\chi^{2}(\rho(X_{T},V_{T})\,\|\,\bar{\mu}) =d×d(dρ(XT,VT)dμ¯)2dμ¯(x,v)1\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T},V_{T})}{\,\mathrm{d}\bar{\mu}}\Bigr{)}^{2}\,\mathrm{d}\bar{\mu}(x,v)-1
=d(dρ(XT)dμ)2(d(dρ(VTXT)dν(v))2dν(v))dμ(x)1\displaystyle=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\Bigl{(}\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(V_{T}\mid X_{T})}{\,\mathrm{d}\nu(v)}\Bigr{)}^{2}\,\mathrm{d}\nu(v)\Bigr{)}\,\mathrm{d}\mu(x)-1
=d(dρ(XT)dμ)2(1+χ2(ρ(VTXT)ν))dμ(x)1\displaystyle=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\Bigl{(}1+\chi^{2}\bigl{(}\rho(V_{T}\mid X_{T})\,\|\,\nu\bigr{)}\Bigr{)}\,\mathrm{d}\mu(x)-1
d(dρ(XT)dμ)2dμ(x)1=χ2(ρ(XT)μ).\displaystyle\geq\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\,\mathrm{d}\mu(x)-1=\chi^{2}(\rho(X_{T})\,\|\,\mu).

Moreover, we would take initial condition in the form of

(X0,V0)μ¯0(x,v)=μ0(x)ν(v),(X_{0},V_{0})\sim\bar{\mu}_{0}(x,v)=\mu_{0}(x)\nu(v), (3)

which implies that χ2(μ¯0μ¯)=χ2(μ0μ)\chi^{2}(\bar{\mu}_{0}\,\|\,\bar{\mu})=\chi^{2}(\mu_{0}\,\|\,\mu). Therefore, we get

χ2(ρ(XT)μ)Kexp(mKLT)χ2(μ0μ),\chi^{2}(\rho(X_{T})\,\|\,\mu)\leq K\exp\bigl{(}-\frac{m}{K\sqrt{L}}T\bigr{)}\chi^{2}(\mu_{0}\,\|\,\mu), (4)

which suggests the total time TT needed to achieve control of chi-square divergence.

Of course, in practice, we cannot simulate the zigzag process directly, as simulating the Poisson process associated with the bouncing event would require integrating (Vt(j)xjU(Xt))+(V_{t}^{(j)}\partial_{x_{j}}U(X_{t}))_{+} along the trajectory. To turn the zigzag process into an efficient and practical sampling algorithm, the Poisson process for the bouncing events are usually simulated using the Poisson thinning trick (see e.g., discussions in [bierkens2019zig]*Section 3). Under Assumption 1, we will use the following upper bound estimate for the rate:

(vixiU(x+vt))+|vixiU(x+vt)||vi||xiU(x+vt)|L|vi|(|x|+t|v|).(v_{i}\partial_{x_{i}}U(x+vt))_{+}\leq\lvert v_{i}\partial_{x_{i}}U(x+vt)\rvert\leq\lvert v_{i}\rvert\lvert\partial_{x_{i}}U(x+vt)\rvert\leq L\lvert v_{i}\rvert(\lvert x\rvert+t\lvert v\rvert). (5)

This upper bound has the advantage of not involving evaluations of UU and its partial derivatives, which greatly reduces the computational cost, compared with using numerical quadrature for dd Poisson clocks. The price to pay is the increased frequency of potential bouncing events, which scales like O(d)O(\sqrt{d}) since the pessimistic bound for the partial derivative |xiU(x)||U(x)|L|x||\partial_{x_{i}}U(x)|\leq|\nabla U(x)|\leq L|x| typically sacrifices a factor of O(d)O(\sqrt{d}) in the first inequality.

Following the above discussions, the zigzag sampling algorithm is described in Algorithm 1, where Step LABEL:state:magrate uses the upper bound estimate in (5), while Steps LABEL:state:thinning1LABEL:state:thinning2 correspond to the Poisson thinning step. Note that for each potential bouncing event, the algorithm requires one evaluation of xiU\partial_{x_{i}}U in Step LABEL:state:thinning1. In practice, typically accessing the partial derivatives of UU is the most time consuming step, therefore, in our complexity analysis, we focus on the number of access to partial derivatives.

Algorithm 1 The zigzag sampling algorithm 

We also need the following assumption for technical purposes, as will be discussed after stating the main results:

Assumption 2.

The initial distribution μ0(x)\mu_{0}(x) satisfies a warm-start condition:

χ2(μ0μ)exp(d8Kκlogd),\chi^{2}(\mu_{0}\,\|\,\mu)\leq\exp\Bigl{(}\frac{d}{8K\kappa\log d}\Bigr{)}, (6)

where κ:=L/m\kappa:=L/m is the condition number, and KK is the same universal constant as in (2). Furthermore, the initial distribution is concentrated in the sense of

η:=μ0(|x|>2dm)<14.\eta:=\mathbb{P}_{\mu_{0}}\Bigl{(}|x|>\sqrt{\frac{2d}{m}}\Bigr{)}<\frac{1}{4}. (7)
Remark 1.2.

The concentration condition (7) can be easily satisfied. By Gaussian Annulus Theorem, if we pick μ0=𝒩(0,1mId)\mu_{0}=\mathcal{N}(0,\frac{1}{m}\mathrm{Id}), then μ0(|x|>2dm)3ecd\mathbb{P}_{\mu_{0}}\bigl{(}|x|>\sqrt{\frac{2d}{m}}\bigr{)}\leq 3e^{-cd} for some universal constant cc. The failure probability gets smaller if we take μ0=𝒩(0,1LId)\mu_{0}=\mathcal{N}(0,\frac{1}{L}\mathrm{Id}) or μ0=μ\mu_{0}=\mu. The warm start condition (6) is more stringent but can be achieved by first running Langevin Monte Carlo (LMC). We will discuss that after presenting our main result.

1.2 Main Results

Theorem 1.

Under Assumption 1, for any prescribed accuracy ε>0\varepsilon>0, Algorithm 1 outputs a random variable XX such that

χ2(ρ(X)μ)ε,\chi^{2}(\rho(X)\,\|\,\mu)\leq\varepsilon, (8)

for terminal time TT chosen as

T=K(Lm(log1ε+logχ2(μ0μ)+logK)),T=K\Bigl{(}\frac{\sqrt{L}}{m}\bigl{(}\log\frac{1}{\varepsilon}+\log\chi^{2}(\mu_{0}\,\|\,\mu)+\log K\bigr{)}\Bigr{)}, (9)

where KK is the universal constant in (2).

Moreover, if εexp(d8Kκlogd),\varepsilon\geq\exp\bigl{(}-\frac{d}{8K\kappa\log d}\bigr{)}, then, under Assumption 2, with probability 1CLTClog32dη1-\frac{C}{\sqrt{L}T}-C\log^{-\frac{3}{2}}d-\eta, Algorithm 1 returns an output with a computational cost of

O(d32κ2(log321ε+log32χ2(μ0μ)))O\Bigl{(}d^{\frac{3}{2}}\kappa^{2}\bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\mu_{0}\,\|\,\mu)\bigr{)}\Bigr{)}

evaluations of partial derivatives of UU, where η\eta is defined in (7) and CC is a universal constant.

Remark 1.3.

By repeated trials, the theorem implies that for any δ(0,14)\delta\in(0,\frac{1}{4}), with probability 1δ1-\delta, Algorithm 1 returns the desired output with a computational cost of

O(d32κ2(log321ε+log32χ2(μ0μ))log1δ|log1(1LT+log32d+η)|),O\biggl{(}d^{\frac{3}{2}}\kappa^{2}\Bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\mu_{0}\,\|\,\mu)\Bigr{)}\log\frac{1}{\delta}\,\Bigl{\lvert}\log^{-1}\bigl{(}\frac{1}{\sqrt{L}T}+\log^{-\frac{3}{2}}d+\eta\bigr{)}\Bigr{\rvert}\biggr{)},

that is O~(d32κ2)\widetilde{O}(d^{\frac{3}{2}}\kappa^{2}) evaluations of partial derivatives of UU, where O~()\widetilde{O}(\cdot) hides logarithmic factors.

With the common computational model that dd evaluations of partial derivatives of UU is equivalent to one evaluation of U\nabla U in complexity, the complexity of zigzag is equivalent to O~(d12κ2)\widetilde{O}(d^{\frac{1}{2}}\kappa^{2}) evaluations of U\nabla U.

Let us explain the choice of TT in (9): For the zigzag sampling algorithm to reach the target ε\varepsilon accuracy according to (4), the terminal time TT needs to be large enough. Meanwhile, the Assumption 2 guarantees that TT is not too large, as otherwise we cannot effectively control the number of bouncing events either due to a very large VV drawn from a velocity refreshing event or the trajectory reaching regions with large gradient. These motivate our previous Assumption 2 on the initial distribution μ0\mu_{0}, as well as the restriction on ε\varepsilon that it cannot be too small compared to dd. We remark that the assumption on ε\varepsilon is not prohibitive as we are interested in high dimensional cases and the error threshold is exponentially small in dd.

The warm start condition (6) can be achieved if we start with a Gaussian distribution in xx and run Langevin Monte Carlo

Xn+1=XnhU(Xn)+2hξnX_{n+1}=X_{n}-h\nabla U(X_{n})+\sqrt{2h}\,\xi_{n} (10)

where hh is the step size, and ξn\xi_{n} are i.i.d. 𝒩(0,Id)\mathcal{N}(0,\mathrm{Id}) random variables. This leads to the following corollary:

Corollary 1.4.

Let d1d\gg 1. Suppose the potential UU satisfies Assumption 1 for some κ1\kappa\geq 1 such that κ95d45Clog3d\kappa^{\frac{9}{5}}\leq\frac{d^{\frac{4}{5}}}{C\log^{3}d} for some computable (from [erdogdu2020convergence]) universal constant CC. Then, for any prescribed accuracy ε>0\varepsilon>0, if we initialize X0𝒩(0,12LId)X_{0}\sim\mathcal{N}(0,\frac{1}{2L}\mathrm{Id}), the hybrid algorithm by first running LMC (10) for N=d4/5κ16/5N=d^{4/5}\kappa^{16/5} steps with step size h=45d4/5κ16/5m1logdκh=\frac{4}{5}d^{-4/5}\kappa^{-16/5}m^{-1}\log\frac{d}{\kappa} and then Algorithm 1 up to time T=K(Lm(log1ε+d15κ45log2dκ+logK))T=K\Bigl{(}\frac{\sqrt{L}}{m}\bigl{(}\log\frac{1}{\varepsilon}+d^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}+\log K\bigr{)}\Bigr{)} outputs a random variable XX such that

χ2(ρ(X)μ)ε.\chi^{2}(\rho(X)\,\|\,\mu)\leq\varepsilon. (11)

Moreover, if εexp(d8Kκlogd),\varepsilon\geq\exp\bigl{(}-\frac{d}{8K\kappa\log d}\bigr{)}, with probability 1CLTClog32dCexp(Cd15κ45log2dκcd)1-\frac{C}{\sqrt{L}T}-C\log^{-\frac{3}{2}}d-C\exp(Cd^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}-cd), Algorithm 1 returns an output with a computational cost of

O(d12κ2log321ε+d45κ165log3dκ)O\Bigl{(}d^{\frac{1}{2}}\kappa^{2}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+d^{\frac{4}{5}}\kappa^{\frac{16}{5}}\log^{3}\frac{d}{\kappa}\Bigr{)}

evaluations of partial derivatives of UU.

Proof.

It is easy to verify that our choice of N,hN,h satisfies hm4L2h\leq\frac{m}{4L^{2}} and Nh21196cκ2L2Nh^{2}\leq\frac{1}{196c\kappa^{2}L^{2}} (where cc satisfies [erdogdu2020convergence]*Lemma 14). Therefore, we may appeal Lemmas 2, 14, 25, 26 of [erdogdu2020convergence], so that the random variable XNX_{N} produced in (10) satisfies

χ2(ρ(XN)μ)exp(Cdexp(Nhm)+CNh2κ2L2(d+logN))=exp(Cd15κ45log2dκ)\chi^{2}(\rho(X_{N})\|\mu)\leq\exp\Bigl{(}Cd\exp(-Nhm)+CNh^{2}\kappa^{2}L^{2}(d+\log N)\Bigr{)}=\exp\Bigl{(}Cd^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}\Bigr{)} (12)

for some universal constant CC. This, combined with our assumption on κ\kappa, guarantees that (6) holds with ρ(XN)\rho(X_{N}) playing the role of μ0\mu_{0}. We can also check the validity of (7) by

ρ(XN)(|x|>2dm)(1+χ2(ρ(XN)μ))12(μ(|x|>2dm))12Cexp(Cd15κ45log2dκcd)1.\displaystyle\mathbb{P}_{\rho(X_{N})}\bigl{(}|x|>\sqrt{\frac{2d}{m}}\bigr{)}\leq\Bigl{(}1+\chi^{2}(\rho(X_{N})\|\mu)\Bigr{)}^{\frac{1}{2}}\Bigl{(}\mathbb{P}_{\mu}\bigl{(}|x|>\sqrt{\frac{2d}{m}}\bigr{)}\Bigr{)}^{\frac{1}{2}}\leq C\exp\Bigl{(}Cd^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}-cd\Bigr{)}\ll 1.

Therefore we may apply Theorem 1 with μ0=ρ(XN)\mu_{0}=\rho(X_{N}), and derive that the total computational cost (in terms of number of evaluations of U\nabla U) equals to

O(N+d12κ2(log321ε+log32χ2(ρ(XN)μ)))=O(d45κ165log3dκ+d12κ2log321ε).O\Bigl{(}N+d^{\frac{1}{2}}\kappa^{2}\bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\rho(X_{N})\|\mu)\bigr{)}\Bigr{)}=O\Bigl{(}d^{\frac{4}{5}}\kappa^{\frac{16}{5}}\log^{3}\frac{d}{\kappa}+d^{\frac{1}{2}}\kappa^{2}\log^{\frac{3}{2}}\frac{1}{\varepsilon}\Bigr{)}.

Theorem 1 guarantees that the zigzag sampling algorithm (Algorithm 1) outputs a sample from a distribution with χ2\chi^{2}-divergence at most ε\varepsilon away from the target density for a computational complexity equivalent to O~(d32κ2)\widetilde{O}(d^{\frac{3}{2}}\kappa^{2}) partial derivative evaluations (i.e., amounts to O~(d12κ2)\widetilde{O}(d^{\frac{1}{2}}\kappa^{2}) gradient evaluations), in the regime max{κ,log1ε}dlogd\max\{\kappa,\log\frac{1}{\varepsilon}\}\ll\frac{d}{\log d} with a warm-start condition. Corollary 1.4 establishes that the hybrid LMC-zigzag algorithm outputs a sample for a computational complexity O~(d45κ165)\widetilde{O}(d^{\frac{4}{5}}\kappa^{\frac{16}{5}}) gradient evaluations. The initialization using LMC is added only for technical reasons as we currently do not have complexity guarantees otherwise with an explicit initial distribution, nor is it necessary for actual implementations. We would also like to comment that our goal is to obtain the best possible scaling in dd, and the scaling in κ\kappa might be possibly improved by a more careful analysis.

Our analysis is based on the quantitative convergence rate of the zigzag process established in [lu2020explicit], which is O(mL)O(\frac{m}{\sqrt{L}}) for mm-convex and LL-smooth potentials. The rest of our proof is based on estimating sup|Xt|\sup|X_{t}| along a single trajectory of the zigzag process and subsequently turn this into an estimate on the number of potential bouncing events, and hence number of partial derivative evaluations. Our analysis utilizes the two important and desirable features of the zigzag sampling process:

  • The implementation of the zigzag process does not need time discretization, as the velocity in deterministic portion of the trajectory remains constant, which makes it possible to simulate the exact trajectories of the zigzag process while eliminating an important source of error. This is the reason that the complexity of the zigzag process only has logarithmic dependence on 1ε\frac{1}{\varepsilon}, without Metropolis acceptance/rejection.

  • Moreover, for each potential bouncing event of zigzag, only one evaluation of a partial derivative of the potential is required, which is O(d)O(d) cheaper than a full gradient evaluation in computational cost for usual model of computation.

We would also remark that we quantify the error of distribution in terms of χ2\chi^{2}-divergence, which provides stronger guarantee than total variation, KL divergence or 22-Wasserstein distance. While χ2\chi^{2}-divergence is relatively convenient for obtaining convergence rates of continuous processes based on Poincarè inequality [cao2019explicit, lu2020explicit], it does not seem easy to use for analyzing discretization error of SDEs. The work [vempala2019rapid] made assumptions of Poincarè inequality for the discrete invariant measure, which is difficult to verify. We are fortunate to avoid such problem for zigzag sampler, thanks to the fact that zigzag does not need time discretization. After the first version of this work appears online, [erdogdu2020convergence] established convergence of LMC in χ2\chi^{2}- and Rényi divergence, using the exponential convergence of continuous time overdamped Langevin dynamics in Rényi divergence [cao2019exponential, vempala2019rapid].

1.3 Previous Works

Here we focus on results on non-asymptotic analysis of sampling algorithms, which has been a focused research area in recent years. Many sampling algorithms have been analyzed including algorithms based on overdamped Langevin dynamics [dalalyan2017theoretical, durmus2019high, durmus2019analysis, vempala2019rapid, li2019stochastic, ding2020randomlangevin], underdamped Langevin dynamics [cheng2018underdamped, dalalyan2020sampling, ma2019there, shen2019randomized, ding2020randomunderdamped, monmarche2020high], Hamiltonian Monte Carlo [mangoubi2017rapid, lee2018algorithmic, chen2019optimal, mangoubi2018dimensionally, bou2020coupling], or high order Langevin dynamics [mou2019high], among others. These methods involve discretization of ODEs or SDEs, which yields an error that scales polynomially with step size. Thus the complexity of these algorithms has polynomial dependence on ε1\varepsilon^{-1}, where ε\varepsilon is the desired accuracy threshold.

Metropolized variants of sampling algorithms, including Metropolized HMC and Metropolis Adjusted Langevin Algorithm (MALA), have also been studied in [dwivedi2018log, chen2020fast, lee2020logsmooth], the complexities of which have only logarithmic dependence on ε1\varepsilon^{-1}, similar to the zigzag sampling process analyzed here. In [dwivedi2018log] the complexity upper bound for MALA is established as O~(κd+κ32d12)\widetilde{O}(\kappa d+\kappa^{\frac{3}{2}}d^{\frac{1}{2}}) under warm start condition, and O~(κd2+κ32d32)\widetilde{O}(\kappa d^{2}+\kappa^{\frac{3}{2}}d^{\frac{3}{2}}) with a feasible start. In [chen2020fast] the complexity upper bound for MALA is improved to O~(κd+κ32d12)\widetilde{O}(\kappa d+\kappa^{\frac{3}{2}}d^{\frac{1}{2}}) with feasible start (where μ0=𝒩(0,1LId)\mu_{0}=\mathcal{N}(0,\frac{1}{L}\mathrm{Id})). The work [chen2020fast] also established bounds for Metropolized HMC, which is O~(κd1112)\widetilde{O}(\kappa d^{\frac{11}{12}}) with warm start (which is in fact more stringent than our Assumption 2) in the regime κ=O(d23)\kappa=O(d^{\frac{2}{3}}), and O~(κ34d+κ74d12)\widetilde{O}(\kappa^{\frac{3}{4}}d+\kappa^{\frac{7}{4}}d^{\frac{1}{2}}) with feasible start if the target potential function has a bounded Hessian. The complexity upper bound has been improved in [lee2020logsmooth] to O~(κd)\widetilde{O}(\kappa d) for both Metropolized HMC and MALA with a feasible start, based on a refined analysis using concentration of gradient norm. In comparison, our result for zigzag relies on a warm start (which is achievable by LMC), while the complexity upper bound has better dependence in dd. The issue of feasible start will be further discussed in Section 3.

Regarding asymptotic analysis for the convergence of zigzag process, the ergodicity was first established in [bierkens2019ergodicity]. Exponential convergence of the zigzag process is established in [fontbona2016long, bierkens2017piecewise] using a Lyapunov function argument. A central limit theorem of the zigzag process is established in [bierkens2017limit], and a large deviation principle is established for the empirical measure in [bierkens2019large]. The spectrum of the zigzag process has been studied in [bierkens2019spectral, guillin2020low]. A dimension independent exponential convergence rate for the zigzag process is established in [andrieu2018hypocoercivity], using the hypocoercivity framework developed in [dolbeault2015hypocoercivity]. Finally, a more quantitative convergence estimate was established in [lu2020explicit], for which our analysis of the sampling algorithm is based on.

2 Strategy of the Proof

Since Algorithm 1 always simulates exact trajectories of the zigzag process, we see that (8) is guaranteed with the correct choice of TT. Therefore we only need to estimate the computational complexity. The strategy of the proof is to first give an estimate on supt[0,T]U(Xt)\sup_{t\in[0,T]}U(X_{t}) (Lemma 2.1), which directly controls supt[0,T]|Xt|\sup_{t\in[0,T]}|X_{t}|. The upper bound on |Xt||X_{t}| in turn provides us an estimate of upper bound on the number of partial derivative evaluations of UU. The complexity upper bound we derive holds with high probability, while it does not always hold (for example, the number of proposed bouncing events from the Poisson clock might be atypically high), such events only occur with very small probability, which will be controlled in the proof.

Let N+1N+1 be the total number of velocity refreshments (including the initial refreshment), therefore NN is a Poisson random variable such that

(N=n)=(LT)nn!eLT.\mathbb{P}(N=n)=\dfrac{(\sqrt{L}T)^{n}}{n!}e^{-\sqrt{L}T}. (13)

Let 0=T0<T1<T2<<TNT<TN+10=T_{0}<T_{1}<T_{2}<\cdots<T_{N}\leq T<T_{N+1} be the refresh times, and VTkV_{T_{k}} be the velocity variable after refreshment at time TkT_{k}. For k=1,,Nk=1,\cdots,N, we use tk=TkTk1t_{k}=T_{k}-T_{k-1} to denote the time duration between refreshments. For convenience, we will also denote tN+1=TTNt_{N+1}=T-T_{N}.

The first step of the proof is the following lemma which controls supt[0,T]U(Xt)\sup_{t\in[0,T]}U(X_{t}) condition on some high probability events. The proof will be deferred to the appendix.

Lemma 2.1.

Under Assumptions 1 and 2, suppose the following conditions hold:

12LTN32LT;\displaystyle\frac{1}{2}\sqrt{L}T\leq N\leq\frac{3}{2}\sqrt{L}T; (14a)
|VTkU(XTk)|(dLT)1/2|U(XTk)|,k=1,,N;\displaystyle\lvert V_{T_{k}}\cdot\nabla U(X_{T_{k}})\rvert\leq\Bigl{(}\frac{d}{\sqrt{L}T}\Bigr{)}^{1/2}|\nabla U(X_{T_{k}})|,\quad\forall k=1,\cdots,N; (14b)
|VTk|2d,k=1,,N\displaystyle\lvert V_{T_{k}}\rvert\leq 2\sqrt{d},\quad\forall k=1,\cdots,N (14c)
U(X0)κd;\displaystyle U(X_{0})\leq\kappa d; (14d)
k=1N+1tk24TL.\displaystyle\sum_{k=1}^{N+1}t_{k}^{2}\leq\dfrac{4T}{\sqrt{L}}. (14e)

Then there exists a universal constant CC such that

supt[0,T]U(Xt)CLTd.\sup_{t\in[0,T]}U(X_{t})\leq C\sqrt{L}Td. (15)

The next element in the proof is to control the failure event that (14) does not hold. The control of the first four events are relatively straightforward and will thus be directly carried out in the proof of theorem below; we state the probability for the event (14e) to hold as the following lemma, which will also be proved in the appendix.

Lemma 2.2.

There exists a universal constant CC such that, if LT>C\sqrt{L}T>C, then with probability 12LT1-\frac{2}{\sqrt{L}T}, condition (14e) holds.

The final component of the proof is to turn the estimate for supt[0,T]U(Xt)\sup_{t\in[0,T]}U(X_{t}) to an upper bound for the number of proposed bouncing events.

Proof of Theorem 1.

Let pip_{i} be the probability that condition ii in (14) of Lemma 2.1 fails. We start with condition (14a) of Lemma 2.1. For Poisson process with tit_{i} as the arrival times, we may estimate the first failure probability (here and for the rest of the proofs CC denotes a universal constant that may change from line to line)

paexp(1CLT)CLT.p_{a}\leq\exp(-\frac{1}{C}\sqrt{L}T)\leq\frac{C}{\sqrt{L}T}. (16)

We now check the conditions (14b) and (14c) of Lemma 2.1. By Gaussian Annulus Theorem, for each refreshment, we have

(|VTk|2d)3ecd,\mathbb{P}(\lvert V_{T_{k}}\rvert\geq 2\sqrt{d})\leq 3e^{-cd}, (17)

where c>0c>0 is some universal constant. We also require VTkV_{T_{k}} to satisfy |VTkn(XTk)|(dLT)1/2\lvert V_{T_{k}}\cdot n(X_{T_{k}})\rvert\leq\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}, where n(XTk)=U(XTk)|U(XTk)|n(X_{T_{k}})=\frac{\nabla U(X_{T_{k}})}{|\nabla U(X_{T_{k}})|}, which has failure probability

(|Vn(X)|(dLT)1/2)\displaystyle\mathbb{P}(\lvert V\cdot n(X)\rvert\geq\Bigl{(}\frac{d}{\sqrt{L}T}\Bigr{)}^{1/2}) =12π(dLT)1/2exp(r22)dr\displaystyle=\dfrac{1}{\sqrt{2\pi}}\int_{\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}}^{\infty}\exp(-\frac{r^{2}}{2})\,\mathrm{d}r
12π(dLT)1/2exp(r2(dLT)1/2)dr\displaystyle\leq\dfrac{1}{\sqrt{2\pi}}\int_{\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}}^{\infty}\exp\Bigl{(}-\frac{r}{2}\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}\Bigr{)}\,\mathrm{d}r
2π(LTd)1/2exp(d2LT).\displaystyle\leq\sqrt{\dfrac{2}{\pi}}\Bigl{(}\frac{\sqrt{L}T}{d}\Bigr{)}^{1/2}\exp(-\frac{d}{2\sqrt{L}T}).

Since we have to draw VV for NN times, cumulatively this yields a failure probability

pb+pcC(ecd+(LTd)1/2exp(d2LT))𝔼N.p_{b}+p_{c}\leq C\biggl{(}e^{-cd}+\Bigl{(}\frac{\sqrt{L}T}{d}\Bigr{)}^{1/2}\exp\Bigl{(}-\frac{d}{2\sqrt{L}T}\Bigr{)}\biggr{)}\mathbb{E}N. (18)

Recall the assumption εexp(d8Kκlogd)\varepsilon\geq\exp(-\frac{d}{8K\kappa\log d}) as well as (6) (and that κKlogKd4logd\kappa K\log K\leq\frac{d}{4\log d}), which implies that LTd2logd\sqrt{L}T\leq\frac{d}{2\log d} for our choice of TT as in (9). Together with condition (14a), we derive (neglecting the obviously smaller term ecde^{-cd})

pb+pcCLT(LTd)1/2exp(d2LT)Clog32d.p_{b}+p_{c}\leq C\sqrt{L}T\Bigl{(}\frac{\sqrt{L}T}{d}\Bigr{)}^{1/2}\exp\Bigl{(}-\frac{d}{2\sqrt{L}T}\Bigr{)}\leq C\log^{-\frac{3}{2}}d.

The failure probability for condition (14d) is straightforward to estimate. Using Assumption 1, we have

U(X0)L2|X0|2,U(X_{0})\leq\frac{L}{2}|X_{0}|^{2},

which indicates

pdη=(|X0|2dm).p_{d}\leq\eta=\mathbb{P}(|X_{0}|\geq\sqrt{\frac{2d}{m}}).

Finally, pep_{e} is already estimated in Lemma 2.2, which yields pe2LTp_{e}\leq\frac{2}{\sqrt{L}T}. In summary, the total failure probability of (14) can be bounded as

pa+pb+pc+pd+peCLT+Clog32d+η.p_{a}+p_{b}+p_{c}+p_{d}+p_{e}\leq\frac{C}{\sqrt{L}T}+C\log^{-\frac{3}{2}}d+\eta. (19)

We now assume that condition (14) holds. Thus, Lemma 2.1 together with Assumption 1 implies that

supt[0,T]|Xt|(2msupt[0,T]U(Xt))1/2C(LmTd)1/2.\sup_{t\in[0,T]}\lvert X_{t}\rvert\leq\Bigl{(}\dfrac{2}{m}\sup_{t\in[0,T]}U(X_{t})\Bigr{)}^{1/2}\leq C\Bigl{(}\frac{\sqrt{L}}{m}Td\Bigr{)}^{1/2}. (20)

After each refreshment or bouncing event, Algorithm 1 runs dd independent Poisson clocks {τi}i=1,,d\{\tau_{i}\}_{i=1,\cdots,d} defined in Step LABEL:state:magrate where, noticing i|Vi|d|V|2d\sum_{i}\lvert V_{i}\rvert\leq\sqrt{d}\lvert V\rvert\leq 2d,

(minτit)exp(tL|X|i|Vi|t22|V|i|Vi|)exp(Cd32(L54m12T12t+t2)).\mathbb{P}(\min\tau_{i}\geq t)\geq\exp\Bigl{(}-tL\lvert X\rvert\sum_{i}\lvert V_{i}\rvert-\frac{t^{2}}{2}\lvert V\rvert\sum_{i}\lvert V_{i}\rvert\Bigr{)}\geq\exp\bigl{(}-Cd^{\frac{3}{2}}(L^{\frac{5}{4}}m^{-\frac{1}{2}}T^{\frac{1}{2}}t+t^{2})\bigr{)}. (21)

This motivates us to consider the following counting process N~t\tilde{N}_{t}: suppose t~1,\tilde{t}_{1},\cdots are i.i.d. random variables with (t~is)=exp(AsBs2)\mathbb{P}(\tilde{t}_{i}\geq s)=\exp(-As-Bs^{2}) where A=Cd32L54m12T12A=Cd^{\frac{3}{2}}L^{\frac{5}{4}}m^{-\frac{1}{2}}T^{\frac{1}{2}} and B=Cd32B=Cd^{\frac{3}{2}}, and let N~t=infn{i=1nt~i>t}\tilde{N}_{t}=\inf_{n}\{\sum_{i=1}^{n}\tilde{t}_{i}>t\}. By construction, the probability of N>8ATN>8AT under condition (14) is controlled by (N~T>8AT)\mathbb{P}(\tilde{N}_{T}>8AT). Therefore, it suffices to estimate (N~T>8AT)\mathbb{P}(\tilde{N}_{T}>8AT).

We compute the expectation of t~1\tilde{t}_{1} (here notice AB1A\gg B\gg 1):

𝔼t~1\displaystyle\mathbb{E}\tilde{t}_{1} =0s(A+2Bs)exp(AsBs2)ds0ABs(A+2Bs)exp(2As)ds\displaystyle=\int_{0}^{\infty}s(A+2Bs)\exp(-As-Bs^{2})\,\mathrm{d}s\geq\int_{0}^{\frac{A}{B}}s(A+2Bs)\exp(-2As)\,\mathrm{d}s
=14A+B2A3(3A2B+54A+B2A3)e2A2B14A.\displaystyle=\frac{1}{4A}+\frac{B}{2A^{3}}-\Bigl{(}\frac{3A}{2B}+\frac{5}{4A}+\frac{B}{2A^{3}}\Bigr{)}e^{-\frac{2A^{2}}{B}}\geq\frac{1}{4A}. (22)

On the other hand,

𝔼t~120s2(A+2Bs)exp(As)ds=2A2+12BA43316A2.\mathbb{E}\tilde{t}_{1}^{2}\leq\int_{0}^{\infty}s^{2}(A+2Bs)\exp(-As)\,\mathrm{d}s=\frac{2}{A^{2}}+\frac{12B}{A^{4}}\leq\frac{33}{16A^{2}}.

Therefore we may appeal to Kolmogorov’s inequality [durrett2019probability]*Theorem 2.5.2 (here SnS_{n} denotes i=1nt~i\sum_{i=1}^{n}\tilde{t}_{i}):

(N~T>8AT)\displaystyle\mathbb{P}(\tilde{N}_{T}>8AT) =(S8AT<T)=(S8AT𝔼S8AT<T𝔼S8AT\displaystyle=\mathbb{P}(S_{8AT}<T)=\mathbb{P}(S_{8AT}-\mathbb{E}S_{8AT}<T-\mathbb{E}S_{8AT}^(22)P(S_8AT- ES_8AT¡-T)
1T2VarS8AT=8ATVart~116ATCLT.\displaystyle\leq\frac{1}{T^{2}}\operatorname{Var}S_{8AT}=\frac{8A}{T}\operatorname{Var}\tilde{t}_{1}\leq\frac{16}{AT}\leq\frac{C}{\sqrt{L}T}.

To sum up, we have established that with high probability the number of partial derivative evaluations is bounded by

O(AT)=O(d32L54m12T32)=O(d32κ2(log321ε+log32χ2(μ0μ))).O(AT)=O(d^{\frac{3}{2}}L^{\frac{5}{4}}m^{-\frac{1}{2}}T^{\frac{3}{2}})=O\Bigl{(}d^{\frac{3}{2}}\kappa^{2}\bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\mu_{0}\,\|\,\mu)\bigr{)}\Bigr{)}.\qed

3 Discussion

We establish non-asymptotic complexity bounds for the zigzag sampling algorithm. While we focus on zigzag sampler in this work, we expect that similar analysis for other PDMPs [bouchard2018bouncy, vanetti2017piecewise, michel2014generalized, bierkens2020boomerang] can be carried out. We leave these for future research.

We admit that our warm-start requirement (6) may be stringent. We observe that (6) implicitly requires the condition number κ\kappa to be much smaller than dd, as otherwise, if κd\kappa\sim d, (6) requires χ2(μ0μ)=O(1)\chi^{2}(\mu_{0}\,\|\,\mu)=O(1) which is unrealistic. Corollary 1.4 essentially requires κd49\kappa\ll d^{\frac{4}{9}} for the analysis to hold. This restriction on condition number is not completely unexpected since the zigzag sampler does perform poorly for highly anisotropic densities (see for example numerical results in [michel2014generalized]).

A major issue of the warm-start assumption comes from our choice of χ2\chi^{2} divergence, rather than total variation, 22-Wasserstein distance, or KL divergence as in previous works for non-asymptotic analysis of sampling algorithms. In particular, if we choose the initial condition

dμ0(x)=(L2π)d2exp(L|x|22)dx,\,\mathrm{d}\mu_{0}(x)=(\dfrac{L}{2\pi})^{\frac{d}{2}}\exp(-\frac{L|x|^{2}}{2})\,\mathrm{d}x, (23)

as in previous works, then for U(x)=m|x|22U(x)=\frac{m|x|^{2}}{2}, we have

χ2(μ0μ)\displaystyle\chi^{2}(\mu_{0}\,\|\,\mu) =Z(L2π)ddexp(L|x|2+U(x))dx1\displaystyle=Z(\frac{L}{2\pi})^{d}\int_{\mathbb{R}^{d}}\exp(-L|x|^{2}+U(x))\,\mathrm{d}x-1
=κd2(L2π)d2dexp((Lm2)|x|2)dx1=κd2(L2Lm)d21,\displaystyle=\kappa^{\frac{d}{2}}(\frac{L}{2\pi})^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\exp\bigl{(}-(L-\frac{m}{2})|x|^{2}\bigr{)}\,\mathrm{d}x-1=\kappa^{\frac{d}{2}}(\frac{L}{2L-m})^{\frac{d}{2}}-1,

which violates (6). On the other hand, for the same choice of μ0\mu_{0}, as long as UU satisfies Assumption 1, one can estimate

KL(μ0μ)=(L2π)d2d(d2logL2π+logZL2|x|2+U(x))exp(L2|x|2)dxd2logκ.\displaystyle\mathrm{KL}(\mu_{0}\,\|\,\mu)=(\dfrac{L}{2\pi})^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\Bigl{(}\frac{d}{2}\log\frac{L}{2\pi}+\log Z-\frac{L}{2}|x|^{2}+U(x)\Bigr{)}\exp(-\frac{L}{2}|x|^{2})\,\mathrm{d}x\leq\frac{d}{2}\log\kappa.

This means logKL(μ0μ)\log\mathrm{KL}(\mu_{0}\,\|\,\mu), and consequently the logarithm of total variation or 22-Wasserstein distances are much smaller than any algebraic power of dd, making it suitable for initialization. We hope the following conjecture is true:

Conjecture 1.

Under Assumption 1, there exists a universal constant KK independent of all parameters, such that for any initial density μ¯0\bar{\mu}_{0}, the zigzag process with friction parameter λ=L\lambda=\sqrt{L} satisfies

KL(ρ(XT,VT)μ¯)Kexp(mKLT)KL(μ¯0μ¯).\mathrm{KL}(\rho(X_{T},V_{T})\,\|\,\bar{\mu})\leq K\exp\bigl{(}-\frac{m}{K\sqrt{L}}T\bigr{)}\,\mathrm{KL}(\bar{\mu}_{0}\,\|\,\bar{\mu}).

If this is indeed true, we can establish the convergence in KL divergence of the pure zigzag sampler using a feasible start, without using LMC for initialization.

Another interesting open question is whether one can find a tighter upper bound than Step LABEL:state:magrate of Algorithm 1 in order to reduce the computational complexity, since it magnifies the proposed bouncing rates by O(d)O(\sqrt{d}). The following lemma, which might be of independent interest, provides a concentration bound for |xiU||\partial_{x_{i}}U| so that we might be able to give up a small probability to obtain a much sharper bouncing rate control.

Lemma 3.1.

Let U(x)U(x) satisfy Assumption 1, then for any c>0c>0,

μ(|xiU|2L+2cLlogd)3dc.\mathbb{P}_{\mu}\Bigl{(}|\partial_{x_{i}}U|\geq 2\sqrt{L}+2c\sqrt{L}\log d\Bigr{)}\leq 3d^{-c}. (24)

The proof of this lemma, deferred to the appendix, is inspired by [lee2020logsmooth], which uses the following Brascamp-Lieb inequality [brascamp2002extensions]:

Lemma 3.2.

Let U(x)U(x) satisfy Assumption 1, then for any gH1(μ)g\in H^{1}(\mu),

Varμgdg(2U)1gdμ.\operatorname{Var}_{\mu}g\leq\int_{\mathbb{R}^{d}}\nabla g(\nabla^{2}U)^{-1}\nabla g\,\mathrm{d}\mu. (25)

With Lemma 3.1, it might be possible to improve Algorithm 1 while surrendering a small probability by replacing Step  LABEL:state:magrate with (τis)=exp(csL|vi|logd)\mathbb{P}(\tau_{i}\geq s)=\exp(-cs\sqrt{L}|v_{i}|\log d) since (vixiU(x+vs))+cL|vi|logd(v_{i}\partial_{x_{i}}U(x+vs))_{+}\leq c\sqrt{L}|v_{i}|\log d with high probability. This motivates the following conjecture:

Conjecture 2.

Under the Assumption 1, for any κ\kappa and log1ε\log\frac{1}{\varepsilon} that are both smaller than some algebraic power of dd, there exists an algorithm that gives a random variable XX such that

χ2(ρ(X)μ)ε.\chi^{2}(\rho(X)\,\|\,\mu)\leq\varepsilon. (26)

Moreover, with high probability, the algorithm requires O(dκlogd(log1ε+logχ2(μ0μ)))O\Bigl{(}d\kappa\log d\bigl{(}\log\frac{1}{\varepsilon}+\log\chi^{2}(\mu_{0}\,\|\,\mu)\bigr{)}\Bigr{)} evaluations of partial derivatives of UU.

Unfortunately there are several difficulties for proving the conjecture. One is that although xiU\partial_{x_{i}}U does not exceed O(logd)O(\log d) with high probability, we are unable to control the partial derivatives for a trajectory of the zigzag process. Another issue is that since some trajectories of the zigzag process may go to regions with partial derivatives exceeding O(logd)O(\log d), we do not always simulate the exact trajectories, which introduces bias in the sampling.


Acknowledgment. This work is supported in part by National Science Foundation via grants CCF-1910571 and DMS-2012286. We would like to thank Murat Erdogdu for pointing us to their complexity analysis of Langevin Monte Carlo in chi-square divergence [erdogdu2020convergence] to remove the warm start assumptions.

Appendix A Proof of Lemma 2.1

Proof.

Let λ(t)=VtxU(Xt)\lambda(t)=V_{t}\cdot\nabla_{x}U(X_{t}). If no bouncing happens, then

ddtλ(t)=Vtx2U(Xt)VtL|Vt|2.\dfrac{\,\mathrm{d}}{\,\mathrm{d}t}\lambda(t)=V_{t}^{\top}\nabla_{x}^{2}U(X_{t})V_{t}\leq L\lvert V_{t}\rvert^{2}.

In addition, λ(t)\lambda(t) decreases when bouncing happens, since there is some positive Vt(i)xiU(Xt)V_{t}^{(i)}\partial_{x_{i}}U(X_{t}) being changed to Vt(i)xiU(Xt)-V_{t}^{(i)}\partial_{x_{i}}U(X_{t}) while XtX_{t} and other Vt(j)V_{t}^{(j)}’s remain unchanged. Therefore, since |Vt|\lvert V_{t}\rvert does not change between refreshments, we have for any t(0,Tk+1Tk)t\in(0,T_{k+1}-T_{k}), 222We remark here that λ(t)\lambda(t) is not well-defined at the bouncing times. Nevertheless, (27) still makes sense since λ(t)\lambda(t) decreases at the bouncing events, and since we only use (27) in the time integral sense, this will not cause any problem.

λ(Tk+t)λ(Tk)+tL|VTk|2.\lambda(T_{k}+t)\leq\lambda(T_{k})+tL\lvert V_{T_{k}}\rvert^{2}. (27)

Notice for a convex function U(x)U(x) that satisfy Assumption 1, we have by co-coercivity

|U(x)|22LU(x),\lvert\nabla U(x)\rvert^{2}\leq 2LU(x),

therefore for any t[0,Tk+1Tk)t\in[0,T_{k+1}-T_{k}), and any α>0\alpha>0,

U(XTk+t)\displaystyle U(X_{T_{k}+t}) =U(XTk)+0tλ(Tk+τ)dτ\displaystyle=U(X_{T_{k}})+\int_{0}^{t}\lambda(T_{k}+\tau)\,\mathrm{d}\tau (28)
U(XTk)+tλ(Tk)+Lt22|VTk|2\displaystyle\leq U(X_{T_{k}})+t\lambda(T_{k})+\dfrac{Lt^{2}}{2}\lvert V_{T_{k}}\rvert^{2}
(14b),(14c)U(XTk)+t(dLT)1/2|U(XTk)|+2Lt2d\displaystyle\leavevmode\kern-46.00009pt\mathrel{\mathop{\leq}\limits^{\eqref{eqn:condV1},\eqref{eqn:condV2}}}U(X_{T_{k}})+t\Bigl{(}\frac{d}{\sqrt{L}T}\Bigr{)}^{1/2}\lvert\nabla U(X_{T_{k}})\rvert+2Lt^{2}d
U(XTk)+t(2dLT)1/2U(XTk)+2Lt2d\displaystyle\leq U(X_{T_{k}})+t\Bigl{(}\frac{2d\sqrt{L}}{T}\Bigr{)}^{1/2}\sqrt{U(X_{T_{k}})}+2Lt^{2}d
(1+α)U(XTk)+dLt2(12Tα+2L).\displaystyle\leq(1+\alpha)U(X_{T_{k}})+d\sqrt{L}t^{2}(\dfrac{1}{\sqrt{2}T\alpha}+2\sqrt{L}).

In particular,

U(XTk+1)(1+α)U(XTk)+dLtk+12(12Tα+2L).U(X_{T_{k+1}})\leq(1+\alpha)U(X_{T_{k}})+d\sqrt{L}t_{k+1}^{2}(\dfrac{1}{\sqrt{2}T\alpha}+2\sqrt{L}).

Choosing α=1LT\alpha=\frac{1}{\sqrt{L}T}, we have

U(XTk+1)(1+α)U(XTk)+CdLtk+12.U(X_{T_{k+1}})\leq(1+\alpha)U(X_{T_{k}})+CdLt_{k+1}^{2}.

Now we apply the above formula iteratively and derive

U(XT)\displaystyle U(X_{T}) (1+α)N+1U(X0)+CLdk=1N+1(1+α)Nk+1tk2\displaystyle\leq(1+\alpha)^{N+1}U(X_{0})+CLd\sum_{k=1}^{N+1}(1+\alpha)^{N-k+1}t_{k}^{2}
(1+α)N+1(U(X0)+CLdk=1N+1tk2)\displaystyle\leq(1+\alpha)^{N+1}\Bigl{(}U(X_{0})+CLd\sum_{k=1}^{N+1}t_{k}^{2}\Bigr{)}
^(14d),(14e)CLTd.

Here we used α=1LT=O(1N)\alpha=\frac{1}{\sqrt{L}T}=O(\frac{1}{N}) so (1+α)N+1=O(1)(1+\alpha)^{N+1}=O(1), which is true due to (14a), and that κLT\kappa\leq\sqrt{L}T, which is true with our choice of TT in (9). ∎

Appendix B Proof of Lemma 2.2

Proof.

Let Ξ=k=1N+1tk2\Xi=\sum_{k=1}^{N+1}t_{k}^{2}. By properties of the Poisson process [durrett1999essentials], if we condition on NN, the distribution of T1,T2,,TNT_{1},T_{2},\cdots,T_{N} has the same joint distribution as that of NN i.i.d. random variables uniformly distributed in (0,T)(0,T). This means

𝔼(ΞN)=N!TNt1++tN<T(k=1Ntk2+(Tk=1Ntk)2)dtNdt1.\mathbb{E}(\Xi\mid N)=\dfrac{N!}{T^{N}}\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+\bigl{(}T-\sum_{k=1}^{N}t_{k}\bigr{)}^{2}\Bigr{)}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1}. (29)

To calculate 𝔼(ΞN)\mathbb{E}(\Xi\mid N), let us define

I1(N,T)=t1++tN<T(k=1Ntk2+(Tk=1Ntk)2)dtNdt1I_{1}(N,T)=\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+\bigl{(}T-\sum_{k=1}^{N}t_{k}\bigr{)}^{2}\Bigr{)}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1}

and compute I1(N,T)I_{1}(N,T) by induction in NN. For N=0N=0, as the sum contains only one term, I1(0,T)=T2I_{1}(0,T)=T^{2}. An easy calculation shows that I1(1,T)=23T3I_{1}(1,T)=\frac{2}{3}T^{3}. We will show in general

I1(N,T)=2(N+1)(N+2)!TN+2.I_{1}(N,T)=\frac{2(N+1)}{(N+2)!}T^{N+2}. (30)

Indeed, suppose (30) holds for N1N-1, we want to prove (30) for NN, the starting point of which is the following observation:

I1(N,T)=t1++tN<Tt12dtNdt1+0TI1(N1,Tt1)dt1.I_{1}(N,T)=\int_{t_{1}+\cdots+t_{N}<T}t_{1}^{2}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1}+\int_{0}^{T}I_{1}(N-1,T-t_{1})\,\mathrm{d}t_{1}.

The first integral can be treated by integrating the variables one by one, from tNt_{N} to tN1t_{N-1} and then tN2t_{N-2}, etc.

t1++tN<Tt12dtNdt1\displaystyle\int_{t_{1}+\cdots+t_{N}<T}t_{1}^{2}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1} =t1++tN1<Tt12(Tt1tN1)dtN1dt1\displaystyle=\int_{t_{1}+\cdots+t_{N-1}<T}t_{1}^{2}(T-t_{1}-\cdots-t_{N-1})\,\mathrm{d}t_{N-1}\cdots\,\mathrm{d}t_{1} (31)
=12t1++tN2<Tt12(Tt1tN2)2dtN2dt1\displaystyle=\frac{1}{2}\int_{t_{1}+\cdots+t_{N-2}<T}t_{1}^{2}(T-t_{1}-\cdots-t_{N-2})^{2}\,\mathrm{d}t_{N-2}\cdots\,\mathrm{d}t_{1}
=\displaystyle=\cdots
=1(N1)!0Tt12(Tt1)N1dt1=2(N+2)!TN+2.\displaystyle=\dfrac{1}{(N-1)!}\int_{0}^{T}t_{1}^{2}(T-t_{1})^{N-1}\,\mathrm{d}t_{1}=\dfrac{2}{(N+2)!}T^{N+2}.

By the induction assumption (30) for N1N-1 we have

0TI1(N1,Tt1)dt1=0T2N(N+1)!(Tt1)N+1dt1=2N(N+2)!TN+2.\int_{0}^{T}I_{1}(N-1,T-t_{1})\,\mathrm{d}t_{1}=\int_{0}^{T}\dfrac{2N}{(N+1)!}(T-t_{1})^{N+1}\,\mathrm{d}t_{1}=\dfrac{2N}{(N+2)!}T^{N+2}.

Combining above with (31) we finish the proof for NN. Therefore

𝔼(ΞN)=N!TN2(N+1)TN+2(N+2)!=2T2N+2.\mathbb{E}(\Xi\mid N)=\dfrac{N!}{T^{N}}\dfrac{2(N+1)T^{N+2}}{(N+2)!}=\dfrac{2T^{2}}{N+2}.

The full expectation 𝔼Ξ\mathbb{E}\Xi follows as NN is a Poisson random variable

𝔼Ξ\displaystyle\mathbb{E}\Xi =n=0𝔼(ΞN=n)(N=n)\displaystyle=\sum_{n=0}^{\infty}\mathbb{E}(\Xi\mid N=n)\mathbb{P}(N=n)
=n=02T2n+2(LT)nn!eLT\displaystyle=\sum_{n=0}^{\infty}\dfrac{2T^{2}}{n+2}\dfrac{(\sqrt{L}T)^{n}}{n!}e^{-\sqrt{L}T}
=2T2eLTn=0((LT)n(n+1)!(LT)n(n+2)!)\displaystyle=2T^{2}e^{-\sqrt{L}T}\sum_{n=0}^{\infty}\Bigl{(}\dfrac{(\sqrt{L}T)^{n}}{(n+1)!}-\dfrac{(\sqrt{L}T)^{n}}{(n+2)!}\Bigr{)}
=2TL2L+2eLTL2TL.\displaystyle=\dfrac{2T}{\sqrt{L}}-\frac{2}{L}+\frac{2e^{-\sqrt{L}T}}{L}\leq\frac{2T}{\sqrt{L}}.

To get the desired estimate, we apply Chebyshev’s inequality using the second moment. By the same arguments leading towards (29), we have

𝔼(Ξ2N)=N!TNt1++tN<T(k=1Ntk2+(Tk=1Ntk)2)2.\mathbb{E}(\Xi^{2}\mid N)=\dfrac{N!}{T^{N}}\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+(T-\sum_{k=1}^{N}t_{k})^{2}\Bigr{)}^{2}.

Denote

I2(N,T)=t1++tN<T(k=1Ntk2+(Tk=1Ntk)2)2.I_{2}(N,T)=\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+(T-\sum_{k=1}^{N}t_{k})^{2}\Bigr{)}^{2}.

Using the same induction argument as the proof of (30), we can prove

I2(N,T)=4(N+1)(N+6)(N+4)!TN+4.I_{2}(N,T)=\dfrac{4(N+1)(N+6)}{(N+4)!}T^{N+4}.

This can be easily verified for N=0,1N=0,1 and the induction follows form the calculation:

I2(N,T)\displaystyle I_{2}(N,T) =t1++tN<Tt14+0TI2(N1,Tt1)dt1+20Tt12I1(N1,Tt1)dt1\displaystyle=\int_{t_{1}+\cdots+t_{N}<T}t_{1}^{4}+\int_{0}^{T}I_{2}(N-1,T-t_{1})\,\mathrm{d}t_{1}+2\int_{0}^{T}t_{1}^{2}I_{1}(N-1,T-t_{1})\,\mathrm{d}t_{1}
=1(N1)!0Tt14(Tt1)N1dt1+4N(N+5)(N+3)!0T(Tt1)N+3dt1\displaystyle=\dfrac{1}{(N-1)!}\int_{0}^{T}t_{1}^{4}(T-t_{1})^{N-1}\,\mathrm{d}t_{1}+\dfrac{4N(N+5)}{(N+3)!}\int_{0}^{T}(T-t_{1})^{N+3}\,\mathrm{d}t_{1}
+4N(N+1)!0Tt12(Tt1)N+1dt1\displaystyle\qquad+\dfrac{4N}{(N+1)!}\int_{0}^{T}t_{1}^{2}(T-t_{1})^{N+1}\,\mathrm{d}t_{1}
=4(N+1)(N+6)(N+4)!TN+4.\displaystyle=\dfrac{4(N+1)(N+6)}{(N+4)!}T^{N+4}.

This shows 𝔼(Ξ2N)=N!TNI2(N,T)=4(N+6)(N+2)(N+3)(N+4)T4\mathbb{E}(\Xi^{2}\mid N)=\frac{N!}{T^{N}}I_{2}(N,T)=\frac{4(N+6)}{(N+2)(N+3)(N+4)}T^{4}, and therefore

𝔼Ξ2\displaystyle\mathbb{E}\Xi^{2} =n=0𝔼(Ξ2N=n)(N=n)\displaystyle=\sum_{n=0}^{\infty}\mathbb{E}(\Xi^{2}\mid N=n)\mathbb{P}(N=n)
=n=04T4(n+6)(n+2)(n+3)(n+4)(LT)nn!eLT\displaystyle=\sum_{n=0}^{\infty}\dfrac{4T^{4}(n+6)}{(n+2)(n+3)(n+4)}\dfrac{(\sqrt{L}T)^{n}}{n!}e^{-\sqrt{L}T}
=4T4eLTn=0((LT)n(n+2)!6(LT)n(n+4)!)\displaystyle=4T^{4}e^{-\sqrt{L}T}\sum_{n=0}^{\infty}\Bigl{(}\dfrac{(\sqrt{L}T)^{n}}{(n+2)!}-6\dfrac{(\sqrt{L}T)^{n}}{(n+4)!}\Bigr{)}
=4T2L24L2+8eLT(T2L+3TL32+3L2).\displaystyle=\frac{4T^{2}}{L}-\frac{24}{L^{2}}+8e^{-\sqrt{L}T}(\frac{T^{2}}{L}+\frac{3T}{L^{\frac{3}{2}}}+\frac{3}{L^{2}}).

This means

𝔼(Ξ𝔼Ξ)2=8TL3228L2+8eLT(T2L+2TL32+4L24eLTL2)8TL32,\mathbb{E}(\Xi-\mathbb{E}\Xi)^{2}=\frac{8T}{L^{\frac{3}{2}}}-\frac{28}{L^{2}}+8e^{-\sqrt{L}T}(\frac{T^{2}}{L}+\frac{2T}{L^{\frac{3}{2}}}+\frac{4}{L^{2}}-\frac{4e^{-\sqrt{L}T}}{L^{2}})\leq\frac{8T}{L^{\frac{3}{2}}},

where the inequality above holds for LT\sqrt{L}T larger than some universal constant (which we would assume as it is the interesting parameter regime).

Finally, to conclude the proof, we apply Chebyshev inequality to estimate the failure probability as

(Ξ4TL)(Ξ𝔼Ξ2TL)L𝔼(Ξ𝔼Ξ)24T22LT.\mathbb{P}(\Xi\geq\frac{4T}{\sqrt{L}})\leq\mathbb{P}(\Xi-\mathbb{E}\Xi\geq\frac{2T}{\sqrt{L}})\leq\frac{L\mathbb{E}(\Xi-\mathbb{E}\Xi)^{2}}{4T^{2}}\leq\frac{2}{\sqrt{L}T}.\qed

Appendix C Proof of Lemma 3.1

Proof.

The first step is to show that

𝔼μ|xiU|L.\mathbb{E}_{\mu}|\partial_{x_{i}}U|\leq\sqrt{L}. (32)

This is straightforward, since using integration by parts,

𝔼μ|xiU|2=d(xiU)2dμ=dxixiUdμL,\mathbb{E}_{\mu}|\partial_{x_{i}}U|^{2}=\int_{\mathbb{R}^{d}}(\partial_{x_{i}}U)^{2}\,\mathrm{d}\mu=\int_{\mathbb{R}^{d}}\partial_{x_{i}x_{i}}U\,\mathrm{d}\mu\leq L, (33)

and (32) then follows from Cauchy-Schwarz inequality.

The next step is to establish a concentration bound. Let G(x)=ψ(xiU)G(x)=\psi(\partial_{x_{i}}U), where ψ(a)=ψ(|a|)\psi(a)=\psi(|a|) is a smooth nonnegative increasing function satisfying

ψ(0)=ψ(0)=0,ψ(a)=|a| for |a|1,and |ψ(a)|2,\psi(0)=\psi^{\prime}(0)=0,\ \psi(a)=|a|\ \text{ for }|a|\geq 1,\ \text{and }|\psi^{\prime}(a)|\leq 2,

and g(x)=exp(12λG(x))g(x)=\exp(\frac{1}{2}\lambda G(x)). By the construction of GG, we have

𝔼μG=𝔼μψ(xiU)2𝔼μ|xiU|2L.\mathbb{E}_{\mu}G=\mathbb{E}_{\mu}\psi(\partial_{x_{i}}U)\leq 2\mathbb{E}_{\mu}|\partial_{x_{i}}U|\leq 2\sqrt{L}. (34)

Then g(x)=λ2ψ(xiU)(xiU)g(x)\nabla g(x)=\frac{\lambda}{2}\psi^{\prime}(\partial_{x_{i}}U)\nabla(\partial_{x_{i}}U)g(x). By Lemma 3.2 for g(x)g(x), we have

𝔼μexp(λG)\displaystyle\mathbb{E}_{\mu}\exp(\lambda G) (𝔼μexp(λG2))2=Varμg(x)\displaystyle-\bigl{(}\mathbb{E}_{\mu}\exp\bigl{(}\dfrac{\lambda G}{2}\bigr{)}\bigr{)}^{2}=\operatorname{Var}_{\mu}g(x)
λ24d(ψ(xiU))2(xiU)(2U)1(xiU)g2(x)dμ\displaystyle\leq\dfrac{\lambda^{2}}{4}\int_{\mathbb{R}^{d}}(\psi^{\prime}(\partial_{x_{i}}U))^{2}\nabla(\partial_{x_{i}}U)(\nabla^{2}U)^{-1}\nabla(\partial_{x_{i}}U)g^{2}(x)\,\mathrm{d}\mu
λ2d(xiU)(2U)1(xiU)g2(x)dμ\displaystyle\leq\lambda^{2}\int_{\mathbb{R}^{d}}\nabla(\partial_{x_{i}}U)(\nabla^{2}U)^{-1}\nabla(\partial_{x_{i}}U)g^{2}(x)\,\mathrm{d}\mu
=λ2dxixiUg2(x)dμλ2L𝔼μexp(λG).\displaystyle=\lambda^{2}\int_{\mathbb{R}^{d}}\partial_{x_{i}x_{i}}Ug^{2}(x)\,\mathrm{d}\mu\leq\lambda^{2}L\mathbb{E}_{\mu}\exp(\lambda G).

Thus for λ12L\lambda\leq\frac{1}{2\sqrt{L}} we have

𝔼μexp(λG)11λ2L(𝔼μexp(λG2))2.\mathbb{E}_{\mu}\exp(\lambda G)\leq\dfrac{1}{1-\lambda^{2}L}\bigl{(}\mathbb{E}_{\mu}\exp(\dfrac{\lambda G}{2})\bigr{)}^{2}. (35)

Now we use (35) recursively, and we obtain for H(λ):=𝔼μexp(λG)H(\lambda):=\mathbb{E}_{\mu}\exp(\lambda G),

H(λ)k=0(11λ2L4k)2klimH(λ).H(\lambda)\leq\prod_{k=0}^{\infty}\Bigl{(}\dfrac{1}{1-\frac{\lambda^{2}L}{4^{k}}}\Bigr{)}^{2^{k}}\lim_{\ell\to\infty}H(\dfrac{\lambda}{\ell})^{\ell}. (36)

Notice

limH(λ)=lim(𝔼μexp(λG))=lim(1+𝔼μλG)=exp(λ𝔼μG).\lim_{\ell\to\infty}H(\dfrac{\lambda}{\ell})^{\ell}=\lim_{\ell\to\infty}\Bigl{(}\mathbb{E}_{\mu}\exp(\dfrac{\lambda G}{\ell})\Bigr{)}^{\ell}=\lim_{\ell\to\infty}\Bigl{(}1+\mathbb{E}_{\mu}\dfrac{\lambda G}{\ell}\Bigr{)}^{\ell}=\exp(\lambda\mathbb{E}_{\mu}G). (37)

Moreover, by [bobkov1997poincare]*Proposition 4.1,

k=0(11λ2L4k)2k1+λL1λL.\prod_{k=0}^{\infty}\Bigl{(}\dfrac{1}{1-\frac{\lambda^{2}L}{4^{k}}}\Bigr{)}^{2^{k}}\leq\frac{1+\lambda\sqrt{L}}{1-\lambda\sqrt{L}}. (38)

Substituting (37) and (38) into (36), we obtain

H(λ)1+λL1λLexp(λ𝔼μG).H(\lambda)\leq\dfrac{1+\lambda\sqrt{L}}{1-\lambda\sqrt{L}}\exp(\lambda\mathbb{E}_{\mu}G).

Finally, combining the above exponential moment bound of GG with Chebyshev inequality, we get

μ(G(x)𝔼μG+r)exp(λr)1+λL1λL.\mathbb{P}_{\mu}\Bigl{(}G(x)\geq\mathbb{E}_{\mu}G+r\Bigr{)}\leq\exp(-\lambda r)\dfrac{1+\lambda\sqrt{L}}{1-\lambda\sqrt{L}}.

Now take λ=1/2L\lambda=1/2\sqrt{L}, and r=2cLlogdr=2c\sqrt{L}\log d, and using (34) (noticing G(x)=|xiU|G(x)=|\partial_{x_{i}}U| when G(x)rG(x)\geq r since r1r\geq 1), we arrive at

μ(|xiU|2L+2cLlogd)3dc.\mathbb{P}_{\mu}\Bigl{(}|\partial_{x_{i}}U|\geq 2\sqrt{L}+2c\sqrt{L}\log d\Bigr{)}\leq 3d^{-c}.\qed

References