Complexity of zigzag sampling algorithm for strongly log-concave distributions

Jianfeng Lu Department of Mathematics, Department of Physics, and Department of Chemistry, Duke University, Durham NC 27708, USA ([email protected]) Lihan Wang For correspondence; Department of Mathematics, Duke University, Durham NC 27708 ([email protected]). Current Address: Department of Mathematical Sciences, Carnegie Mellon University, 311 Hamerschlag Drive, Pittsburgh, PA, 15213, USA ([email protected])

Abstract

We study the computational complexity of zigzag sampling algorithm for strongly log-concave distributions. The zigzag process has the advantage of not requiring time discretization for implementation, and that each proposed bouncing event requires only one evaluation of partial derivative of the potential, while its convergence rate is dimension independent. Using these properties, we prove that the zigzag sampling algorithm achieves $\varepsilon$ error in chi-square divergence with a computational cost equivalent to $O\bigl{(}\kappa^{2}d^{\frac{1}{2}}(\log\frac{1}{\varepsilon})^{\frac{3}{2}}\bigr{)}$ gradient evaluations in the regime $\kappa\ll\frac{d}{\log d}$ under a warm start assumption, where $\kappa$ is the condition number and $d$ is the dimension.

Keywords: Monte Carlo sampling; zigzag sampler; log-concave distribution; computational complexity

1 Introduction and Main Results

Monte Carlo sampling from a high-dimensional probability distribution is a fundamental problem with applications in various areas including Bayesian statistics, machine learning, and statistical physics. Many sampling algorithms, especially those for continuous state space like $\mathbb{R}^{d}$ , are based on continuous time Markov processes. Examples of these processes include the overdamped Langevin dynamics, whose invariant measure is the target measure, the underdamped Langevin dynamics and Hamiltonian Monte Carlo (HMC) [duane1987hybrid], both augment the state space with a velocity variable $v$ , and have the $x$ -marginal distribution of the invariant measure as the target measure. For strongly log-concave distributions, all these processes converge to the equilibrium exponentially fast with rates independent of the dimension, making them suitable for sampling purposes. On the other hand, all of these processes require time discretizations for implementation, which not only induces further numerical errors but requires the time step to be small as well, requiring higher computational complexity if a small bias is desired. To remove such bias due to discretization, the conventional procedure is to introduce the Metropolis-Hastings acceptance-rejection step, but rejections indicate waste of computational resources.

A very different line of sampling algorithms have been recently developed in statistical physics and statistics literature [peters2012rejection], which are based on piecewise deterministic Markov processes (PDMPs) [davis1984piecewise]. These processes are non-reversible, which may mix faster than reversible MCMC methods [diaconis2000analysis, turitsyn2011irreversible]. Examples of such samplers include the randomized Hamiltonian Monte Carlo [bou2017randomized], the zigzag process [bierkens2019zig], the bouncy particle sampler [peters2012rejection, bouchard2018bouncy], and some others [vanetti2017piecewise, michel2014generalized, bierkens2020boomerang]. The zigzag and bouncy particle samplers are appealing for big data applications, as they can be unbiased even if stochastic gradient is used [bouchard2018bouncy, bierkens2019zig]. These algorithms, as they are still relatively new, have not yet been thoroughly analyzed. In particular, no non-asymptotic computational complexity bounds on these algorithms have been established yet, to the best of our knowledge. Our previous work [lu2020explicit] gives explicit exponential convergence rates for the PDMPs with log-concave potentials, which opens the possibility of deriving such complexity bounds for PDMPs, and provides the foundation of this work.

1.1 Algorithm and Assumptions

Let $x$ denote the state variable in $\mathbb{R}^{d}$ where $d$ is the dimension. The target distribution we want to sample from is denoted by

\,\mathrm{d}\mu(x)=Z^{-1}\exp(-U(x))\,\mathrm{d}x,

where $U(x)$ is the potential and $Z=\int_{\mathbb{R}^{d}}\exp(-U(x))\,\mathrm{d}x$ is the normalizing constant. Although the zigzag process can also be applied to sample non log-concave distributions, we will restrict our analysis to strongly log-concave distributions, namely, we make the following assumption throughout:

Assumption 1.

The potential function $U(x)$ satisfies

m\mathrm{Id}\leq\nabla^{2}U(x)\leq L\mathrm{Id},

(1)

for some $0<m\leq 1\leq L$ . Moreover, $U(x)$ has a unique minimizer at $x=0$ , and $U(0)=0$ .

For any random variable $X$ , we use $\rho(X)$ to denote its law. In this paper, we use chi-square divergence to measure the difference between two probability measures: for probability measures $\rho_{1},\rho_{2}$ that $\rho_{1}\ll\rho_{2}$ , it is defined as

\chi^{2}(\rho_{1}\,\|\,\rho_{2}):=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho_{1}}{\,\mathrm{d}\rho_{2}}-1\Bigr{)}^{2}\,\mathrm{d}\rho_{2}.

The zigzag sampling algorithm is based on a piecewise deterministic Markov process, called zigzag process. Besides the variable $x$ , we augment the state space by an auxiliary velocity variable taking value in $\mathbb{R}^{d}$ . A trajectory of the zigzag process, denoted by $(X_{t},V_{t})$ , can be described as follows. Given some initial $(X_{0},V_{0})$ , the position $X_{t}$ always evolves according to $\frac{\,\mathrm{d}}{\,\mathrm{d}t}X_{t}=V_{t}$ , while the velocity $V_{t}$ is piecewise constant which only changes when bouncing or refreshing events occur at some random time following Poisson clocks. Bouncing events on the $j$ -th direction occur with rate $(V_{t}^{(j)}\partial_{x_{j}}U(X_{t}))_{+}$ , and at such an event the velocity $V_{t}$ changes by flipping its $j$ -th component to $-V_{t}^{(j)}$ . Refreshing events occur with rate $\lambda$ for some fixed $\lambda>0$ , when the velocity $V_{t}$ is completely redrawn from the standard normal $\mathcal{N}(0,\mathrm{Id})$ .

It has been established [andrieu2018hypocoercivity, bierkens2019ergodicity, lu2020explicit] that under Assumption 1, $\rho(X_{t},V_{t})$ converges to the invariant measure of the zigzag process, which is a product measure of the target measure in $x$ and the standard Gaussian in $v$ :

\,\mathrm{d}\bar{\mu}(x,v)=\,\mathrm{d}\mu(x)\,\mathrm{d}\nu(v)\ \mbox{ where }\,\mathrm{d}\nu(v)=(2\pi)^{-\frac{d}{2}}\exp\bigl{(}-\frac{|v|^{2}}{2}\bigr{)}\,\mathrm{d}v.

Our analysis relies on the following more quantitative convergence result for zigzag process proved in [lu2020explicit], which also specifies the optimal choice of refreshing rate $\lambda$ .¹¹1[lu2020explicit] shows exponential convergence for the backward equation. By duality the exponential convergence of the backward equation in $L^{2}(\bar{\mu})$ is equivalent to the exponential convergence of the forward equation in $\chi^{2}$ with the same rate. We would like to comment here that the choice of $\lambda=\sqrt{L}$ is completely technical since it optimizes the theoretical convergence rate (up to a universal constant) of the zigzag process established in [lu2020explicit]. The zigzag process is ergodic even if $\lambda=0$ and in practice the choice $\lambda=0$ is common.

Proposition 1.1.

[lu2020explicit]*Theorem 1 Under Assumption 1, there exists a universal constant $K$ independent of all parameters, such that for any initial density $\bar{\mu}_{0}$ , the zigzag process with friction parameter $\lambda=\sqrt{L}$ satisfies

\chi^{2}(\rho(X_{T},V_{T})\,\|\,\bar{\mu})\leq K\exp\bigl{(}-\frac{m}{K\sqrt{L}}T\bigr{)}\chi^{2}(\bar{\mu}_{0}\,\|\,\bar{\mu}).

(2)

The left-hand side of (2) controls desired divergence of $\rho(X)$ with respect to the target measure $\mu$ , as we have

	$\displaystyle\chi^{2}(\rho(X_{T},V_{T})\,\\|\,\bar{\mu})$	$\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T},V_{T})}{\,\mathrm{d}\bar{\mu}}\Bigr{)}^{2}\,\mathrm{d}\bar{\mu}(x,v)-1$
		$\displaystyle=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\Bigl{(}\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(V_{T}\mid X_{T})}{\,\mathrm{d}\nu(v)}\Bigr{)}^{2}\,\mathrm{d}\nu(v)\Bigr{)}\,\mathrm{d}\mu(x)-1$
		$\displaystyle=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\Bigl{(}1+\chi^{2}\bigl{(}\rho(V_{T}\mid X_{T})\,\\|\,\nu\bigr{)}\Bigr{)}\,\mathrm{d}\mu(x)-1$
		$\displaystyle\geq\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\,\mathrm{d}\mu(x)-1=\chi^{2}(\rho(X_{T})\,\\|\,\mu).$

Moreover, we would take initial condition in the form of

(X_{0},V_{0})\sim\bar{\mu}_{0}(x,v)=\mu_{0}(x)\nu(v),

(3)

which implies that $\chi^{2}(\bar{\mu}_{0}\,\|\,\bar{\mu})=\chi^{2}(\mu_{0}\,\|\,\mu)$ . Therefore, we get

\chi^{2}(\rho(X_{T})\,\|\,\mu)\leq K\exp\bigl{(}-\frac{m}{K\sqrt{L}}T\bigr{)}\chi^{2}(\mu_{0}\,\|\,\mu),

(4)

which suggests the total time $T$ needed to achieve control of chi-square divergence.

Of course, in practice, we cannot simulate the zigzag process directly, as simulating the Poisson process associated with the bouncing event would require integrating $(V_{t}^{(j)}\partial_{x_{j}}U(X_{t}))_{+}$ along the trajectory. To turn the zigzag process into an efficient and practical sampling algorithm, the Poisson process for the bouncing events are usually simulated using the Poisson thinning trick (see e.g., discussions in [bierkens2019zig]*Section 3). Under Assumption 1, we will use the following upper bound estimate for the rate:

(v_{i}\partial_{x_{i}}U(x+vt))_{+}\leq\lvert v_{i}\partial_{x_{i}}U(x+vt)\rvert\leq\lvert v_{i}\rvert\lvert\partial_{x_{i}}U(x+vt)\rvert\leq L\lvert v_{i}\rvert(\lvert x\rvert+t\lvert v\rvert).

(5)

This upper bound has the advantage of not involving evaluations of $U$ and its partial derivatives, which greatly reduces the computational cost, compared with using numerical quadrature for $d$ Poisson clocks. The price to pay is the increased frequency of potential bouncing events, which scales like $O(\sqrt{d})$ since the pessimistic bound for the partial derivative $|\partial_{x_{i}}U(x)|\leq|\nabla U(x)|\leq L|x|$ typically sacrifices a factor of $O(\sqrt{d})$ in the first inequality.

Following the above discussions, the zigzag sampling algorithm is described in Algorithm 1, where Step LABEL:state:magrate uses the upper bound estimate in (5), while Steps LABEL:state:thinning1–LABEL:state:thinning2 correspond to the Poisson thinning step. Note that for each potential bouncing event, the algorithm requires one evaluation of $\partial_{x_{i}}U$ in Step LABEL:state:thinning1. In practice, typically accessing the partial derivatives of $U$ is the most time consuming step, therefore, in our complexity analysis, we focus on the number of access to partial derivatives.

Algorithm 1 The zigzag sampling algorithm

We also need the following assumption for technical purposes, as will be discussed after stating the main results:

Assumption 2.

The initial distribution $\mu_{0}(x)$ satisfies a warm-start condition:

\chi^{2}(\mu_{0}\,\|\,\mu)\leq\exp\Bigl{(}\frac{d}{8K\kappa\log d}\Bigr{)},

(6)

where $\kappa:=L/m$ is the condition number, and $K$ is the same universal constant as in (2). Furthermore, the initial distribution is concentrated in the sense of

\eta:=\mathbb{P}_{\mu_{0}}\Bigl{(}|x|>\sqrt{\frac{2d}{m}}\Bigr{)}<\frac{1}{4}.

(7)

Remark 1.2.

The concentration condition (7) can be easily satisfied. By Gaussian Annulus Theorem, if we pick $\mu_{0}=\mathcal{N}(0,\frac{1}{m}\mathrm{Id})$ , then $\mathbb{P}_{\mu_{0}}\bigl{(}|x|>\sqrt{\frac{2d}{m}}\bigr{)}\leq 3e^{-cd}$ for some universal constant $c$ . The failure probability gets smaller if we take $\mu_{0}=\mathcal{N}(0,\frac{1}{L}\mathrm{Id})$ or $\mu_{0}=\mu$ . The warm start condition (6) is more stringent but can be achieved by first running Langevin Monte Carlo (LMC). We will discuss that after presenting our main result.

1.2 Main Results

Theorem 1.

Under Assumption 1, for any prescribed accuracy $\varepsilon>0$ , Algorithm 1 outputs a random variable $X$ such that

\chi^{2}(\rho(X)\,\|\,\mu)\leq\varepsilon,

(8)

for terminal time $T$ chosen as

T=K\Bigl{(}\frac{\sqrt{L}}{m}\bigl{(}\log\frac{1}{\varepsilon}+\log\chi^{2}(\mu_{0}\,\|\,\mu)+\log K\bigr{)}\Bigr{)},

(9)

where $K$ is the universal constant in (2).

Moreover, if $\varepsilon\geq\exp\bigl{(}-\frac{d}{8K\kappa\log d}\bigr{)},$ then, under Assumption 2, with probability $1-\frac{C}{\sqrt{L}T}-C\log^{-\frac{3}{2}}d-\eta$ , Algorithm 1 returns an output with a computational cost of

O\Bigl{(}d^{\frac{3}{2}}\kappa^{2}\bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\mu_{0}\,\|\,\mu)\bigr{)}\Bigr{)}

evaluations of partial derivatives of $U$ , where $\eta$ is defined in (7) and $C$ is a universal constant.

Remark 1.3.

By repeated trials, the theorem implies that for any $\delta\in(0,\frac{1}{4})$ , with probability $1-\delta$ , Algorithm 1 returns the desired output with a computational cost of

O\biggl{(}d^{\frac{3}{2}}\kappa^{2}\Bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\mu_{0}\,\|\,\mu)\Bigr{)}\log\frac{1}{\delta}\,\Bigl{\lvert}\log^{-1}\bigl{(}\frac{1}{\sqrt{L}T}+\log^{-\frac{3}{2}}d+\eta\bigr{)}\Bigr{\rvert}\biggr{)},

that is $\widetilde{O}(d^{\frac{3}{2}}\kappa^{2})$ evaluations of partial derivatives of $U$ , where $\widetilde{O}(\cdot)$ hides logarithmic factors.

With the common computational model that $d$ evaluations of partial derivatives of $U$ is equivalent to one evaluation of $\nabla U$ in complexity, the complexity of zigzag is equivalent to $\widetilde{O}(d^{\frac{1}{2}}\kappa^{2})$ evaluations of $\nabla U$ .

Let us explain the choice of $T$ in (9): For the zigzag sampling algorithm to reach the target $\varepsilon$ accuracy according to (4), the terminal time $T$ needs to be large enough. Meanwhile, the Assumption 2 guarantees that $T$ is not too large, as otherwise we cannot effectively control the number of bouncing events either due to a very large $V$ drawn from a velocity refreshing event or the trajectory reaching regions with large gradient. These motivate our previous Assumption 2 on the initial distribution $\mu_{0}$ , as well as the restriction on $\varepsilon$ that it cannot be too small compared to $d$ . We remark that the assumption on $\varepsilon$ is not prohibitive as we are interested in high dimensional cases and the error threshold is exponentially small in $d$ .

The warm start condition (6) can be achieved if we start with a Gaussian distribution in $x$ and run Langevin Monte Carlo

X_{n+1}=X_{n}-h\nabla U(X_{n})+\sqrt{2h}\,\xi_{n}

(10)

where $h$ is the step size, and $\xi_{n}$ are i.i.d. $\mathcal{N}(0,\mathrm{Id})$ random variables. This leads to the following corollary:

Corollary 1.4.

Let $d\gg 1$ . Suppose the potential $U$ satisfies Assumption 1 for some $\kappa\geq 1$ such that $\kappa^{\frac{9}{5}}\leq\frac{d^{\frac{4}{5}}}{C\log^{3}d}$ for some computable (from [erdogdu2020convergence]) universal constant $C$ . Then, for any prescribed accuracy $\varepsilon>0$ , if we initialize $X_{0}\sim\mathcal{N}(0,\frac{1}{2L}\mathrm{Id})$ , the hybrid algorithm by first running LMC (10) for $N=d^{4/5}\kappa^{16/5}$ steps with step size $h=\frac{4}{5}d^{-4/5}\kappa^{-16/5}m^{-1}\log\frac{d}{\kappa}$ and then Algorithm 1 up to time $T=K\Bigl{(}\frac{\sqrt{L}}{m}\bigl{(}\log\frac{1}{\varepsilon}+d^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}+\log K\bigr{)}\Bigr{)}$ outputs a random variable $X$ such that

\chi^{2}(\rho(X)\,\|\,\mu)\leq\varepsilon.

(11)

Moreover, if $\varepsilon\geq\exp\bigl{(}-\frac{d}{8K\kappa\log d}\bigr{)},$ with probability $1-\frac{C}{\sqrt{L}T}-C\log^{-\frac{3}{2}}d-C\exp(Cd^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}-cd)$ , Algorithm 1 returns an output with a computational cost of

O\Bigl{(}d^{\frac{1}{2}}\kappa^{2}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+d^{\frac{4}{5}}\kappa^{\frac{16}{5}}\log^{3}\frac{d}{\kappa}\Bigr{)}

evaluations of partial derivatives of $U$ .

Proof.

It is easy to verify that our choice of $N,h$ satisfies $h\leq\frac{m}{4L^{2}}$ and $Nh^{2}\leq\frac{1}{196c\kappa^{2}L^{2}}$ (where $c$ satisfies [erdogdu2020convergence]*Lemma 14). Therefore, we may appeal Lemmas 2, 14, 25, 26 of [erdogdu2020convergence], so that the random variable $X_{N}$ produced in (10) satisfies

\chi^{2}(\rho(X_{N})\|\mu)\leq\exp\Bigl{(}Cd\exp(-Nhm)+CNh^{2}\kappa^{2}L^{2}(d+\log N)\Bigr{)}=\exp\Bigl{(}Cd^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}\Bigr{)}

(12)

for some universal constant $C$ . This, combined with our assumption on $\kappa$ , guarantees that (6) holds with $\rho(X_{N})$ playing the role of $\mu_{0}$ . We can also check the validity of (7) by

\displaystyle\mathbb{P}_{\rho(X_{N})}\bigl{(}|x|>\sqrt{\frac{2d}{m}}\bigr{)}\leq\Bigl{(}1+\chi^{2}(\rho(X_{N})\|\mu)\Bigr{)}^{\frac{1}{2}}\Bigl{(}\mathbb{P}_{\mu}\bigl{(}|x|>\sqrt{\frac{2d}{m}}\bigr{)}\Bigr{)}^{\frac{1}{2}}\leq C\exp\Bigl{(}Cd^{\frac{1}{5}}\kappa^{\frac{4}{5}}\log^{2}\frac{d}{\kappa}-cd\Bigr{)}\ll 1.

Therefore we may apply Theorem 1 with $\mu_{0}=\rho(X_{N})$ , and derive that the total computational cost (in terms of number of evaluations of $\nabla U$ ) equals to

O\Bigl{(}N+d^{\frac{1}{2}}\kappa^{2}\bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\rho(X_{N})\|\mu)\bigr{)}\Bigr{)}=O\Bigl{(}d^{\frac{4}{5}}\kappa^{\frac{16}{5}}\log^{3}\frac{d}{\kappa}+d^{\frac{1}{2}}\kappa^{2}\log^{\frac{3}{2}}\frac{1}{\varepsilon}\Bigr{)}.

∎

Theorem 1 guarantees that the zigzag sampling algorithm (Algorithm 1) outputs a sample from a distribution with $\chi^{2}$ -divergence at most $\varepsilon$ away from the target density for a computational complexity equivalent to $\widetilde{O}(d^{\frac{3}{2}}\kappa^{2})$ partial derivative evaluations (i.e., amounts to $\widetilde{O}(d^{\frac{1}{2}}\kappa^{2})$ gradient evaluations), in the regime $\max\{\kappa,\log\frac{1}{\varepsilon}\}\ll\frac{d}{\log d}$ with a warm-start condition. Corollary 1.4 establishes that the hybrid LMC-zigzag algorithm outputs a sample for a computational complexity $\widetilde{O}(d^{\frac{4}{5}}\kappa^{\frac{16}{5}})$ gradient evaluations. The initialization using LMC is added only for technical reasons as we currently do not have complexity guarantees otherwise with an explicit initial distribution, nor is it necessary for actual implementations. We would also like to comment that our goal is to obtain the best possible scaling in $d$ , and the scaling in $\kappa$ might be possibly improved by a more careful analysis.

Our analysis is based on the quantitative convergence rate of the zigzag process established in [lu2020explicit], which is $O(\frac{m}{\sqrt{L}})$ for $m$ -convex and $L$ -smooth potentials. The rest of our proof is based on estimating $\sup|X_{t}|$ along a single trajectory of the zigzag process and subsequently turn this into an estimate on the number of potential bouncing events, and hence number of partial derivative evaluations. Our analysis utilizes the two important and desirable features of the zigzag sampling process:

•

The implementation of the zigzag process does not need time discretization, as the velocity in deterministic portion of the trajectory remains constant, which makes it possible to simulate the exact trajectories of the zigzag process while eliminating an important source of error. This is the reason that the complexity of the zigzag process only has logarithmic dependence on $\frac{1}{\varepsilon}$ , without Metropolis acceptance/rejection.
•

Moreover, for each potential bouncing event of zigzag, only one evaluation of a partial derivative of the potential is required, which is $O(d)$ cheaper than a full gradient evaluation in computational cost for usual model of computation.

We would also remark that we quantify the error of distribution in terms of $\chi^{2}$ -divergence, which provides stronger guarantee than total variation, KL divergence or $2$ -Wasserstein distance. While $\chi^{2}$ -divergence is relatively convenient for obtaining convergence rates of continuous processes based on Poincarè inequality [cao2019explicit, lu2020explicit], it does not seem easy to use for analyzing discretization error of SDEs. The work [vempala2019rapid] made assumptions of Poincarè inequality for the discrete invariant measure, which is difficult to verify. We are fortunate to avoid such problem for zigzag sampler, thanks to the fact that zigzag does not need time discretization. After the first version of this work appears online, [erdogdu2020convergence] established convergence of LMC in $\chi^{2}$ - and Rényi divergence, using the exponential convergence of continuous time overdamped Langevin dynamics in Rényi divergence [cao2019exponential, vempala2019rapid].

1.3 Previous Works

Here we focus on results on non-asymptotic analysis of sampling algorithms, which has been a focused research area in recent years. Many sampling algorithms have been analyzed including algorithms based on overdamped Langevin dynamics [dalalyan2017theoretical, durmus2019high, durmus2019analysis, vempala2019rapid, li2019stochastic, ding2020randomlangevin], underdamped Langevin dynamics [cheng2018underdamped, dalalyan2020sampling, ma2019there, shen2019randomized, ding2020randomunderdamped, monmarche2020high], Hamiltonian Monte Carlo [mangoubi2017rapid, lee2018algorithmic, chen2019optimal, mangoubi2018dimensionally, bou2020coupling], or high order Langevin dynamics [mou2019high], among others. These methods involve discretization of ODEs or SDEs, which yields an error that scales polynomially with step size. Thus the complexity of these algorithms has polynomial dependence on $\varepsilon^{-1}$ , where $\varepsilon$ is the desired accuracy threshold.

Metropolized variants of sampling algorithms, including Metropolized HMC and Metropolis Adjusted Langevin Algorithm (MALA), have also been studied in [dwivedi2018log, chen2020fast, lee2020logsmooth], the complexities of which have only logarithmic dependence on $\varepsilon^{-1}$ , similar to the zigzag sampling process analyzed here. In [dwivedi2018log] the complexity upper bound for MALA is established as $\widetilde{O}(\kappa d+\kappa^{\frac{3}{2}}d^{\frac{1}{2}})$ under warm start condition, and $\widetilde{O}(\kappa d^{2}+\kappa^{\frac{3}{2}}d^{\frac{3}{2}})$ with a feasible start. In [chen2020fast] the complexity upper bound for MALA is improved to $\widetilde{O}(\kappa d+\kappa^{\frac{3}{2}}d^{\frac{1}{2}})$ with feasible start (where $\mu_{0}=\mathcal{N}(0,\frac{1}{L}\mathrm{Id})$ ). The work [chen2020fast] also established bounds for Metropolized HMC, which is $\widetilde{O}(\kappa d^{\frac{11}{12}})$ with warm start (which is in fact more stringent than our Assumption 2) in the regime $\kappa=O(d^{\frac{2}{3}})$ , and $\widetilde{O}(\kappa^{\frac{3}{4}}d+\kappa^{\frac{7}{4}}d^{\frac{1}{2}})$ with feasible start if the target potential function has a bounded Hessian. The complexity upper bound has been improved in [lee2020logsmooth] to $\widetilde{O}(\kappa d)$ for both Metropolized HMC and MALA with a feasible start, based on a refined analysis using concentration of gradient norm. In comparison, our result for zigzag relies on a warm start (which is achievable by LMC), while the complexity upper bound has better dependence in $d$ . The issue of feasible start will be further discussed in Section 3.

Regarding asymptotic analysis for the convergence of zigzag process, the ergodicity was first established in [bierkens2019ergodicity]. Exponential convergence of the zigzag process is established in [fontbona2016long, bierkens2017piecewise] using a Lyapunov function argument. A central limit theorem of the zigzag process is established in [bierkens2017limit], and a large deviation principle is established for the empirical measure in [bierkens2019large]. The spectrum of the zigzag process has been studied in [bierkens2019spectral, guillin2020low]. A dimension independent exponential convergence rate for the zigzag process is established in [andrieu2018hypocoercivity], using the hypocoercivity framework developed in [dolbeault2015hypocoercivity]. Finally, a more quantitative convergence estimate was established in [lu2020explicit], for which our analysis of the sampling algorithm is based on.

2 Strategy of the Proof

Since Algorithm 1 always simulates exact trajectories of the zigzag process, we see that (8) is guaranteed with the correct choice of $T$ . Therefore we only need to estimate the computational complexity. The strategy of the proof is to first give an estimate on $\sup_{t\in[0,T]}U(X_{t})$ (Lemma 2.1), which directly controls $\sup_{t\in[0,T]}|X_{t}|$ . The upper bound on $|X_{t}|$ in turn provides us an estimate of upper bound on the number of partial derivative evaluations of $U$ . The complexity upper bound we derive holds with high probability, while it does not always hold (for example, the number of proposed bouncing events from the Poisson clock might be atypically high), such events only occur with very small probability, which will be controlled in the proof.

Let $N+1$ be the total number of velocity refreshments (including the initial refreshment), therefore $N$ is a Poisson random variable such that

\mathbb{P}(N=n)=\dfrac{(\sqrt{L}T)^{n}}{n!}e^{-\sqrt{L}T}.

(13)

Let $0=T_{0}<T_{1}<T_{2}<\cdots<T_{N}\leq T<T_{N+1}$ be the refresh times, and $V_{T_{k}}$ be the velocity variable after refreshment at time $T_{k}$ . For $k=1,\cdots,N$ , we use $t_{k}=T_{k}-T_{k-1}$ to denote the time duration between refreshments. For convenience, we will also denote $t_{N+1}=T-T_{N}$ .

The first step of the proof is the following lemma which controls $\sup_{t\in[0,T]}U(X_{t})$ condition on some high probability events. The proof will be deferred to the appendix.

Lemma 2.1.

Under Assumptions 1 and 2, suppose the following conditions hold:


	$\displaystyle\frac{1}{2}\sqrt{L}T\leq N\leq\frac{3}{2}\sqrt{L}T;$		(14a)
	$\displaystyle\lvert V_{T_{k}}\cdot\nabla U(X_{T_{k}})\rvert\leq\Bigl{(}\frac{d}{\sqrt{L}T}\Bigr{)}^{1/2}\|\nabla U(X_{T_{k}})\|,\quad\forall k=1,\cdots,N;$		(14b)
	$\displaystyle\lvert V_{T_{k}}\rvert\leq 2\sqrt{d},\quad\forall k=1,\cdots,N$		(14c)
	$\displaystyle U(X_{0})\leq\kappa d;$		(14d)
	$\displaystyle\sum_{k=1}^{N+1}t_{k}^{2}\leq\dfrac{4T}{\sqrt{L}}.$		(14e)

Then there exists a universal constant $C$ such that

\sup_{t\in[0,T]}U(X_{t})\leq C\sqrt{L}Td.

(15)

The next element in the proof is to control the failure event that (14) does not hold. The control of the first four events are relatively straightforward and will thus be directly carried out in the proof of theorem below; we state the probability for the event (14e) to hold as the following lemma, which will also be proved in the appendix.

Lemma 2.2.

There exists a universal constant $C$ such that, if $\sqrt{L}T>C$ , then with probability $1-\frac{2}{\sqrt{L}T}$ , condition (14e) holds.

The final component of the proof is to turn the estimate for $\sup_{t\in[0,T]}U(X_{t})$ to an upper bound for the number of proposed bouncing events.

Proof of Theorem 1.

Let $p_{i}$ be the probability that condition $i$ in (14) of Lemma 2.1 fails. We start with condition (14a) of Lemma 2.1. For Poisson process with $t_{i}$ as the arrival times, we may estimate the first failure probability (here and for the rest of the proofs $C$ denotes a universal constant that may change from line to line)

p_{a}\leq\exp(-\frac{1}{C}\sqrt{L}T)\leq\frac{C}{\sqrt{L}T}.

(16)

We now check the conditions (14b) and (14c) of Lemma 2.1. By Gaussian Annulus Theorem, for each refreshment, we have

\mathbb{P}(\lvert V_{T_{k}}\rvert\geq 2\sqrt{d})\leq 3e^{-cd},

(17)

where $c>0$ is some universal constant. We also require $V_{T_{k}}$ to satisfy $\lvert V_{T_{k}}\cdot n(X_{T_{k}})\rvert\leq\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}$ , where $n(X_{T_{k}})=\frac{\nabla U(X_{T_{k}})}{|\nabla U(X_{T_{k}})|}$ , which has failure probability

	$\displaystyle\mathbb{P}(\lvert V\cdot n(X)\rvert\geq\Bigl{(}\frac{d}{\sqrt{L}T}\Bigr{)}^{1/2})$	$\displaystyle=\dfrac{1}{\sqrt{2\pi}}\int_{\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}}^{\infty}\exp(-\frac{r^{2}}{2})\,\mathrm{d}r$
		$\displaystyle\leq\dfrac{1}{\sqrt{2\pi}}\int_{\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}}^{\infty}\exp\Bigl{(}-\frac{r}{2}\bigl{(}\frac{d}{\sqrt{L}T}\bigr{)}^{1/2}\Bigr{)}\,\mathrm{d}r$
		$\displaystyle\leq\sqrt{\dfrac{2}{\pi}}\Bigl{(}\frac{\sqrt{L}T}{d}\Bigr{)}^{1/2}\exp(-\frac{d}{2\sqrt{L}T}).$

Since we have to draw $V$ for $N$ times, cumulatively this yields a failure probability

p_{b}+p_{c}\leq C\biggl{(}e^{-cd}+\Bigl{(}\frac{\sqrt{L}T}{d}\Bigr{)}^{1/2}\exp\Bigl{(}-\frac{d}{2\sqrt{L}T}\Bigr{)}\biggr{)}\mathbb{E}N.

(18)

Recall the assumption $\varepsilon\geq\exp(-\frac{d}{8K\kappa\log d})$ as well as (6) (and that $\kappa K\log K\leq\frac{d}{4\log d}$ ), which implies that $\sqrt{L}T\leq\frac{d}{2\log d}$ for our choice of $T$ as in (9). Together with condition (14a), we derive (neglecting the obviously smaller term $e^{-cd}$ )

p_{b}+p_{c}\leq C\sqrt{L}T\Bigl{(}\frac{\sqrt{L}T}{d}\Bigr{)}^{1/2}\exp\Bigl{(}-\frac{d}{2\sqrt{L}T}\Bigr{)}\leq C\log^{-\frac{3}{2}}d.

The failure probability for condition (14d) is straightforward to estimate. Using Assumption 1, we have

U(X_{0})\leq\frac{L}{2}|X_{0}|^{2},

which indicates

p_{d}\leq\eta=\mathbb{P}(|X_{0}|\geq\sqrt{\frac{2d}{m}}).

Finally, $p_{e}$ is already estimated in Lemma 2.2, which yields $p_{e}\leq\frac{2}{\sqrt{L}T}$ . In summary, the total failure probability of (14) can be bounded as

p_{a}+p_{b}+p_{c}+p_{d}+p_{e}\leq\frac{C}{\sqrt{L}T}+C\log^{-\frac{3}{2}}d+\eta.

(19)

We now assume that condition (14) holds. Thus, Lemma 2.1 together with Assumption 1 implies that

\sup_{t\in[0,T]}\lvert X_{t}\rvert\leq\Bigl{(}\dfrac{2}{m}\sup_{t\in[0,T]}U(X_{t})\Bigr{)}^{1/2}\leq C\Bigl{(}\frac{\sqrt{L}}{m}Td\Bigr{)}^{1/2}.

(20)

After each refreshment or bouncing event, Algorithm 1 runs $d$ independent Poisson clocks $\{\tau_{i}\}_{i=1,\cdots,d}$ defined in Step LABEL:state:magrate where, noticing $\sum_{i}\lvert V_{i}\rvert\leq\sqrt{d}\lvert V\rvert\leq 2d$ ,

\mathbb{P}(\min\tau_{i}\geq t)\geq\exp\Bigl{(}-tL\lvert X\rvert\sum_{i}\lvert V_{i}\rvert-\frac{t^{2}}{2}\lvert V\rvert\sum_{i}\lvert V_{i}\rvert\Bigr{)}\geq\exp\bigl{(}-Cd^{\frac{3}{2}}(L^{\frac{5}{4}}m^{-\frac{1}{2}}T^{\frac{1}{2}}t+t^{2})\bigr{)}.

(21)

This motivates us to consider the following counting process $\tilde{N}_{t}$ : suppose $\tilde{t}_{1},\cdots$ are i.i.d. random variables with $\mathbb{P}(\tilde{t}_{i}\geq s)=\exp(-As-Bs^{2})$ where $A=Cd^{\frac{3}{2}}L^{\frac{5}{4}}m^{-\frac{1}{2}}T^{\frac{1}{2}}$ and $B=Cd^{\frac{3}{2}}$ , and let $\tilde{N}_{t}=\inf_{n}\{\sum_{i=1}^{n}\tilde{t}_{i}>t\}$ . By construction, the probability of $N>8AT$ under condition (14) is controlled by $\mathbb{P}(\tilde{N}_{T}>8AT)$ . Therefore, it suffices to estimate $\mathbb{P}(\tilde{N}_{T}>8AT)$ .

We compute the expectation of $\tilde{t}_{1}$ (here notice $A\gg B\gg 1$ ):

	$\displaystyle\mathbb{E}\tilde{t}_{1}$	$\displaystyle=\int_{0}^{\infty}s(A+2Bs)\exp(-As-Bs^{2})\,\mathrm{d}s\geq\int_{0}^{\frac{A}{B}}s(A+2Bs)\exp(-2As)\,\mathrm{d}s$
		$\displaystyle=\frac{1}{4A}+\frac{B}{2A^{3}}-\Bigl{(}\frac{3A}{2B}+\frac{5}{4A}+\frac{B}{2A^{3}}\Bigr{)}e^{-\frac{2A^{2}}{B}}\geq\frac{1}{4A}.$		(22)

On the other hand,

\mathbb{E}\tilde{t}_{1}^{2}\leq\int_{0}^{\infty}s^{2}(A+2Bs)\exp(-As)\,\mathrm{d}s=\frac{2}{A^{2}}+\frac{12B}{A^{4}}\leq\frac{33}{16A^{2}}.

Therefore we may appeal to Kolmogorov’s inequality [durrett2019probability]*Theorem 2.5.2 (here $S_{n}$ denotes $\sum_{i=1}^{n}\tilde{t}_{i}$ ):

	$\displaystyle\mathbb{P}(\tilde{N}_{T}>8AT)$	$\displaystyle=\mathbb{P}(S_{8AT}<T)=\mathbb{P}(S_{8AT}-\mathbb{E}S_{8AT}<T-\mathbb{E}S_{8AT}$ ≤^(22)P(S_8AT- ES_8AT¡-T)
		$\displaystyle\leq\frac{1}{T^{2}}\operatorname{Var}S_{8AT}=\frac{8A}{T}\operatorname{Var}\tilde{t}_{1}\leq\frac{16}{AT}\leq\frac{C}{\sqrt{L}T}.$

To sum up, we have established that with high probability the number of partial derivative evaluations is bounded by

O(AT)=O(d^{\frac{3}{2}}L^{\frac{5}{4}}m^{-\frac{1}{2}}T^{\frac{3}{2}})=O\Bigl{(}d^{\frac{3}{2}}\kappa^{2}\bigl{(}\log^{\frac{3}{2}}\frac{1}{\varepsilon}+\log^{\frac{3}{2}}\chi^{2}(\mu_{0}\,\|\,\mu)\bigr{)}\Bigr{)}.\qed

3 Discussion

We establish non-asymptotic complexity bounds for the zigzag sampling algorithm. While we focus on zigzag sampler in this work, we expect that similar analysis for other PDMPs [bouchard2018bouncy, vanetti2017piecewise, michel2014generalized, bierkens2020boomerang] can be carried out. We leave these for future research.

We admit that our warm-start requirement (6) may be stringent. We observe that (6) implicitly requires the condition number $\kappa$ to be much smaller than $d$ , as otherwise, if $\kappa\sim d$ , (6) requires $\chi^{2}(\mu_{0}\,\|\,\mu)=O(1)$ which is unrealistic. Corollary 1.4 essentially requires $\kappa\ll d^{\frac{4}{9}}$ for the analysis to hold. This restriction on condition number is not completely unexpected since the zigzag sampler does perform poorly for highly anisotropic densities (see for example numerical results in [michel2014generalized]).

A major issue of the warm-start assumption comes from our choice of $\chi^{2}$ divergence, rather than total variation, $2$ -Wasserstein distance, or KL divergence as in previous works for non-asymptotic analysis of sampling algorithms. In particular, if we choose the initial condition

\,\mathrm{d}\mu_{0}(x)=(\dfrac{L}{2\pi})^{\frac{d}{2}}\exp(-\frac{L|x|^{2}}{2})\,\mathrm{d}x,

(23)

as in previous works, then for $U(x)=\frac{m|x|^{2}}{2}$ , we have

	$\displaystyle\chi^{2}(\mu_{0}\,\\|\,\mu)$	$\displaystyle=Z(\frac{L}{2\pi})^{d}\int_{\mathbb{R}^{d}}\exp(-L\|x\|^{2}+U(x))\,\mathrm{d}x-1$
		$\displaystyle=\kappa^{\frac{d}{2}}(\frac{L}{2\pi})^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\exp\bigl{(}-(L-\frac{m}{2})\|x\|^{2}\bigr{)}\,\mathrm{d}x-1=\kappa^{\frac{d}{2}}(\frac{L}{2L-m})^{\frac{d}{2}}-1,$

which violates (6). On the other hand, for the same choice of $\mu_{0}$ , as long as $U$ satisfies Assumption 1, one can estimate

\displaystyle\mathrm{KL}(\mu_{0}\,\|\,\mu)=(\dfrac{L}{2\pi})^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\Bigl{(}\frac{d}{2}\log\frac{L}{2\pi}+\log Z-\frac{L}{2}|x|^{2}+U(x)\Bigr{)}\exp(-\frac{L}{2}|x|^{2})\,\mathrm{d}x\leq\frac{d}{2}\log\kappa.

This means $\log\mathrm{KL}(\mu_{0}\,\|\,\mu)$ , and consequently the logarithm of total variation or $2$ -Wasserstein distances are much smaller than any algebraic power of $d$ , making it suitable for initialization. We hope the following conjecture is true:

Conjecture 1.

Under Assumption 1, there exists a universal constant $K$ independent of all parameters, such that for any initial density $\bar{\mu}_{0}$ , the zigzag process with friction parameter $\lambda=\sqrt{L}$ satisfies

\mathrm{KL}(\rho(X_{T},V_{T})\,\|\,\bar{\mu})\leq K\exp\bigl{(}-\frac{m}{K\sqrt{L}}T\bigr{)}\,\mathrm{KL}(\bar{\mu}_{0}\,\|\,\bar{\mu}).

If this is indeed true, we can establish the convergence in KL divergence of the pure zigzag sampler using a feasible start, without using LMC for initialization.

Another interesting open question is whether one can find a tighter upper bound than Step LABEL:state:magrate of Algorithm 1 in order to reduce the computational complexity, since it magnifies the proposed bouncing rates by $O(\sqrt{d})$ . The following lemma, which might be of independent interest, provides a concentration bound for $|\partial_{x_{i}}U|$ so that we might be able to give up a small probability to obtain a much sharper bouncing rate control.

Lemma 3.1.

Let $U(x)$ satisfy Assumption 1, then for any $c>0$ ,

\mathbb{P}_{\mu}\Bigl{(}|\partial_{x_{i}}U|\geq 2\sqrt{L}+2c\sqrt{L}\log d\Bigr{)}\leq 3d^{-c}.

(24)

The proof of this lemma, deferred to the appendix, is inspired by [lee2020logsmooth], which uses the following Brascamp-Lieb inequality [brascamp2002extensions]:

Lemma 3.2.

Let $U(x)$ satisfy Assumption 1, then for any $g\in H^{1}(\mu)$ ,

\operatorname{Var}_{\mu}g\leq\int_{\mathbb{R}^{d}}\nabla g(\nabla^{2}U)^{-1}\nabla g\,\mathrm{d}\mu.

(25)

With Lemma 3.1, it might be possible to improve Algorithm 1 while surrendering a small probability by replacing Step LABEL:state:magrate with $\mathbb{P}(\tau_{i}\geq s)=\exp(-cs\sqrt{L}|v_{i}|\log d)$ since $(v_{i}\partial_{x_{i}}U(x+vs))_{+}\leq c\sqrt{L}|v_{i}|\log d$ with high probability. This motivates the following conjecture:

Conjecture 2.

Under the Assumption 1, for any $\kappa$ and $\log\frac{1}{\varepsilon}$ that are both smaller than some algebraic power of $d$ , there exists an algorithm that gives a random variable $X$ such that

\chi^{2}(\rho(X)\,\|\,\mu)\leq\varepsilon.

(26)

Moreover, with high probability, the algorithm requires $O\Bigl{(}d\kappa\log d\bigl{(}\log\frac{1}{\varepsilon}+\log\chi^{2}(\mu_{0}\,\|\,\mu)\bigr{)}\Bigr{)}$ evaluations of partial derivatives of $U$ .

Unfortunately there are several difficulties for proving the conjecture. One is that although $\partial_{x_{i}}U$ does not exceed $O(\log d)$ with high probability, we are unable to control the partial derivatives for a trajectory of the zigzag process. Another issue is that since some trajectories of the zigzag process may go to regions with partial derivatives exceeding $O(\log d)$ , we do not always simulate the exact trajectories, which introduces bias in the sampling.

Acknowledgment. This work is supported in part by National Science Foundation via grants CCF-1910571 and DMS-2012286. We would like to thank Murat Erdogdu for pointing us to their complexity analysis of Langevin Monte Carlo in chi-square divergence [erdogdu2020convergence] to remove the warm start assumptions.

Appendix A Proof of Lemma 2.1

Proof.

Let $\lambda(t)=V_{t}\cdot\nabla_{x}U(X_{t})$ . If no bouncing happens, then

\dfrac{\,\mathrm{d}}{\,\mathrm{d}t}\lambda(t)=V_{t}^{\top}\nabla_{x}^{2}U(X_{t})V_{t}\leq L\lvert V_{t}\rvert^{2}.

In addition, $\lambda(t)$ decreases when bouncing happens, since there is some positive $V_{t}^{(i)}\partial_{x_{i}}U(X_{t})$ being changed to $-V_{t}^{(i)}\partial_{x_{i}}U(X_{t})$ while $X_{t}$ and other $V_{t}^{(j)}$ ’s remain unchanged. Therefore, since $\lvert V_{t}\rvert$ does not change between refreshments, we have for any $t\in(0,T_{k+1}-T_{k})$ , ²²2We remark here that $\lambda(t)$ is not well-defined at the bouncing times. Nevertheless, (27) still makes sense since $\lambda(t)$ decreases at the bouncing events, and since we only use (27) in the time integral sense, this will not cause any problem.

\lambda(T_{k}+t)\leq\lambda(T_{k})+tL\lvert V_{T_{k}}\rvert^{2}.

(27)

Notice for a convex function $U(x)$ that satisfy Assumption 1, we have by co-coercivity

\lvert\nabla U(x)\rvert^{2}\leq 2LU(x),

therefore for any $t\in[0,T_{k+1}-T_{k})$ , and any $\alpha>0$ ,

$\displaystyle U(X_{T_{k}+t})$	$\displaystyle=U(X_{T_{k}})+\int_{0}^{t}\lambda(T_{k}+\tau)\,\mathrm{d}\tau$	(28)
	$\displaystyle\leq U(X_{T_{k}})+t\lambda(T_{k})+\dfrac{Lt^{2}}{2}\lvert V_{T_{k}}\rvert^{2}$
	$\displaystyle\leavevmode\kern-46.00009pt\mathrel{\mathop{\leq}\limits^{\eqref{eqn:condV1},\eqref{eqn:condV2}}}U(X_{T_{k}})+t\Bigl{(}\frac{d}{\sqrt{L}T}\Bigr{)}^{1/2}\lvert\nabla U(X_{T_{k}})\rvert+2Lt^{2}d$
	$\displaystyle\leq U(X_{T_{k}})+t\Bigl{(}\frac{2d\sqrt{L}}{T}\Bigr{)}^{1/2}\sqrt{U(X_{T_{k}})}+2Lt^{2}d$
	$\displaystyle\leq(1+\alpha)U(X_{T_{k}})+d\sqrt{L}t^{2}(\dfrac{1}{\sqrt{2}T\alpha}+2\sqrt{L}).$

In particular,

U(X_{T_{k+1}})\leq(1+\alpha)U(X_{T_{k}})+d\sqrt{L}t_{k+1}^{2}(\dfrac{1}{\sqrt{2}T\alpha}+2\sqrt{L}).

Choosing $\alpha=\frac{1}{\sqrt{L}T}$ , we have

U(X_{T_{k+1}})\leq(1+\alpha)U(X_{T_{k}})+CdLt_{k+1}^{2}.

Now we apply the above formula iteratively and derive

	$\displaystyle U(X_{T})$	$\displaystyle\leq(1+\alpha)^{N+1}U(X_{0})+CLd\sum_{k=1}^{N+1}(1+\alpha)^{N-k+1}t_{k}^{2}$
		$\displaystyle\leq(1+\alpha)^{N+1}\Bigl{(}U(X_{0})+CLd\sum_{k=1}^{N+1}t_{k}^{2}\Bigr{)}$
		≤^(14d),(14e)CLTd.

Here we used $\alpha=\frac{1}{\sqrt{L}T}=O(\frac{1}{N})$ so $(1+\alpha)^{N+1}=O(1)$ , which is true due to (14a), and that $\kappa\leq\sqrt{L}T$ , which is true with our choice of $T$ in (9). ∎

Appendix B Proof of Lemma 2.2

Proof.

Let $\Xi=\sum_{k=1}^{N+1}t_{k}^{2}$ . By properties of the Poisson process [durrett1999essentials], if we condition on $N$ , the distribution of $T_{1},T_{2},\cdots,T_{N}$ has the same joint distribution as that of $N$ i.i.d. random variables uniformly distributed in $(0,T)$ . This means

\mathbb{E}(\Xi\mid N)=\dfrac{N!}{T^{N}}\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+\bigl{(}T-\sum_{k=1}^{N}t_{k}\bigr{)}^{2}\Bigr{)}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1}.

(29)

To calculate $\mathbb{E}(\Xi\mid N)$ , let us define

I_{1}(N,T)=\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+\bigl{(}T-\sum_{k=1}^{N}t_{k}\bigr{)}^{2}\Bigr{)}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1}

and compute $I_{1}(N,T)$ by induction in $N$ . For $N=0$ , as the sum contains only one term, $I_{1}(0,T)=T^{2}$ . An easy calculation shows that $I_{1}(1,T)=\frac{2}{3}T^{3}$ . We will show in general

I_{1}(N,T)=\frac{2(N+1)}{(N+2)!}T^{N+2}.

(30)

Indeed, suppose (30) holds for $N-1$ , we want to prove (30) for $N$ , the starting point of which is the following observation:

I_{1}(N,T)=\int_{t_{1}+\cdots+t_{N}<T}t_{1}^{2}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1}+\int_{0}^{T}I_{1}(N-1,T-t_{1})\,\mathrm{d}t_{1}.

The first integral can be treated by integrating the variables one by one, from $t_{N}$ to $t_{N-1}$ and then $t_{N-2}$ , etc.

$\displaystyle\int_{t_{1}+\cdots+t_{N}<T}t_{1}^{2}\,\mathrm{d}t_{N}\cdots\,\mathrm{d}t_{1}$	$\displaystyle=\int_{t_{1}+\cdots+t_{N-1}<T}t_{1}^{2}(T-t_{1}-\cdots-t_{N-1})\,\mathrm{d}t_{N-1}\cdots\,\mathrm{d}t_{1}$	(31)
	$\displaystyle=\frac{1}{2}\int_{t_{1}+\cdots+t_{N-2}<T}t_{1}^{2}(T-t_{1}-\cdots-t_{N-2})^{2}\,\mathrm{d}t_{N-2}\cdots\,\mathrm{d}t_{1}$
	$\displaystyle=\cdots$
	$\displaystyle=\dfrac{1}{(N-1)!}\int_{0}^{T}t_{1}^{2}(T-t_{1})^{N-1}\,\mathrm{d}t_{1}=\dfrac{2}{(N+2)!}T^{N+2}.$

By the induction assumption (30) for $N-1$ we have

\int_{0}^{T}I_{1}(N-1,T-t_{1})\,\mathrm{d}t_{1}=\int_{0}^{T}\dfrac{2N}{(N+1)!}(T-t_{1})^{N+1}\,\mathrm{d}t_{1}=\dfrac{2N}{(N+2)!}T^{N+2}.

Combining above with (31) we finish the proof for $N$ . Therefore

\mathbb{E}(\Xi\mid N)=\dfrac{N!}{T^{N}}\dfrac{2(N+1)T^{N+2}}{(N+2)!}=\dfrac{2T^{2}}{N+2}.

The full expectation $\mathbb{E}\Xi$ follows as $N$ is a Poisson random variable

	$\displaystyle\mathbb{E}\Xi$	$\displaystyle=\sum_{n=0}^{\infty}\mathbb{E}(\Xi\mid N=n)\mathbb{P}(N=n)$
		$\displaystyle=\sum_{n=0}^{\infty}\dfrac{2T^{2}}{n+2}\dfrac{(\sqrt{L}T)^{n}}{n!}e^{-\sqrt{L}T}$
		$\displaystyle=2T^{2}e^{-\sqrt{L}T}\sum_{n=0}^{\infty}\Bigl{(}\dfrac{(\sqrt{L}T)^{n}}{(n+1)!}-\dfrac{(\sqrt{L}T)^{n}}{(n+2)!}\Bigr{)}$
		$\displaystyle=\dfrac{2T}{\sqrt{L}}-\frac{2}{L}+\frac{2e^{-\sqrt{L}T}}{L}\leq\frac{2T}{\sqrt{L}}.$

To get the desired estimate, we apply Chebyshev’s inequality using the second moment. By the same arguments leading towards (29), we have

\mathbb{E}(\Xi^{2}\mid N)=\dfrac{N!}{T^{N}}\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+(T-\sum_{k=1}^{N}t_{k})^{2}\Bigr{)}^{2}.

Denote

I_{2}(N,T)=\int_{t_{1}+\cdots+t_{N}<T}\Bigl{(}\sum_{k=1}^{N}t_{k}^{2}+(T-\sum_{k=1}^{N}t_{k})^{2}\Bigr{)}^{2}.

Using the same induction argument as the proof of (30), we can prove

I_{2}(N,T)=\dfrac{4(N+1)(N+6)}{(N+4)!}T^{N+4}.

This can be easily verified for $N=0,1$ and the induction follows form the calculation:

	$\displaystyle I_{2}(N,T)$	$\displaystyle=\int_{t_{1}+\cdots+t_{N}<T}t_{1}^{4}+\int_{0}^{T}I_{2}(N-1,T-t_{1})\,\mathrm{d}t_{1}+2\int_{0}^{T}t_{1}^{2}I_{1}(N-1,T-t_{1})\,\mathrm{d}t_{1}$
		$\displaystyle=\dfrac{1}{(N-1)!}\int_{0}^{T}t_{1}^{4}(T-t_{1})^{N-1}\,\mathrm{d}t_{1}+\dfrac{4N(N+5)}{(N+3)!}\int_{0}^{T}(T-t_{1})^{N+3}\,\mathrm{d}t_{1}$
		$\displaystyle\qquad+\dfrac{4N}{(N+1)!}\int_{0}^{T}t_{1}^{2}(T-t_{1})^{N+1}\,\mathrm{d}t_{1}$
		$\displaystyle=\dfrac{4(N+1)(N+6)}{(N+4)!}T^{N+4}.$

This shows $\mathbb{E}(\Xi^{2}\mid N)=\frac{N!}{T^{N}}I_{2}(N,T)=\frac{4(N+6)}{(N+2)(N+3)(N+4)}T^{4}$ , and therefore

	$\displaystyle\mathbb{E}\Xi^{2}$	$\displaystyle=\sum_{n=0}^{\infty}\mathbb{E}(\Xi^{2}\mid N=n)\mathbb{P}(N=n)$
		$\displaystyle=\sum_{n=0}^{\infty}\dfrac{4T^{4}(n+6)}{(n+2)(n+3)(n+4)}\dfrac{(\sqrt{L}T)^{n}}{n!}e^{-\sqrt{L}T}$
		$\displaystyle=4T^{4}e^{-\sqrt{L}T}\sum_{n=0}^{\infty}\Bigl{(}\dfrac{(\sqrt{L}T)^{n}}{(n+2)!}-6\dfrac{(\sqrt{L}T)^{n}}{(n+4)!}\Bigr{)}$
		$\displaystyle=\frac{4T^{2}}{L}-\frac{24}{L^{2}}+8e^{-\sqrt{L}T}(\frac{T^{2}}{L}+\frac{3T}{L^{\frac{3}{2}}}+\frac{3}{L^{2}}).$

This means

\mathbb{E}(\Xi-\mathbb{E}\Xi)^{2}=\frac{8T}{L^{\frac{3}{2}}}-\frac{28}{L^{2}}+8e^{-\sqrt{L}T}(\frac{T^{2}}{L}+\frac{2T}{L^{\frac{3}{2}}}+\frac{4}{L^{2}}-\frac{4e^{-\sqrt{L}T}}{L^{2}})\leq\frac{8T}{L^{\frac{3}{2}}},

where the inequality above holds for $\sqrt{L}T$ larger than some universal constant (which we would assume as it is the interesting parameter regime).

Finally, to conclude the proof, we apply Chebyshev inequality to estimate the failure probability as

\mathbb{P}(\Xi\geq\frac{4T}{\sqrt{L}})\leq\mathbb{P}(\Xi-\mathbb{E}\Xi\geq\frac{2T}{\sqrt{L}})\leq\frac{L\mathbb{E}(\Xi-\mathbb{E}\Xi)^{2}}{4T^{2}}\leq\frac{2}{\sqrt{L}T}.\qed

Appendix C Proof of Lemma 3.1

Proof.

The first step is to show that

\mathbb{E}_{\mu}|\partial_{x_{i}}U|\leq\sqrt{L}.

(32)

This is straightforward, since using integration by parts,

\mathbb{E}_{\mu}|\partial_{x_{i}}U|^{2}=\int_{\mathbb{R}^{d}}(\partial_{x_{i}}U)^{2}\,\mathrm{d}\mu=\int_{\mathbb{R}^{d}}\partial_{x_{i}x_{i}}U\,\mathrm{d}\mu\leq L,

(33)

and (32) then follows from Cauchy-Schwarz inequality.

The next step is to establish a concentration bound. Let $G(x)=\psi(\partial_{x_{i}}U)$ , where $\psi(a)=\psi(|a|)$ is a smooth nonnegative increasing function satisfying

\psi(0)=\psi^{\prime}(0)=0,\ \psi(a)=|a|\ \text{ for }|a|\geq 1,\ \text{and }|\psi^{\prime}(a)|\leq 2,

and $g(x)=\exp(\frac{1}{2}\lambda G(x))$ . By the construction of $G$ , we have

\mathbb{E}_{\mu}G=\mathbb{E}_{\mu}\psi(\partial_{x_{i}}U)\leq 2\mathbb{E}_{\mu}|\partial_{x_{i}}U|\leq 2\sqrt{L}.

(34)

Then $\nabla g(x)=\frac{\lambda}{2}\psi^{\prime}(\partial_{x_{i}}U)\nabla(\partial_{x_{i}}U)g(x)$ . By Lemma 3.2 for $g(x)$ , we have

	$\displaystyle\mathbb{E}_{\mu}\exp(\lambda G)$	$\displaystyle-\bigl{(}\mathbb{E}_{\mu}\exp\bigl{(}\dfrac{\lambda G}{2}\bigr{)}\bigr{)}^{2}=\operatorname{Var}_{\mu}g(x)$
		$\displaystyle\leq\dfrac{\lambda^{2}}{4}\int_{\mathbb{R}^{d}}(\psi^{\prime}(\partial_{x_{i}}U))^{2}\nabla(\partial_{x_{i}}U)(\nabla^{2}U)^{-1}\nabla(\partial_{x_{i}}U)g^{2}(x)\,\mathrm{d}\mu$
		$\displaystyle\leq\lambda^{2}\int_{\mathbb{R}^{d}}\nabla(\partial_{x_{i}}U)(\nabla^{2}U)^{-1}\nabla(\partial_{x_{i}}U)g^{2}(x)\,\mathrm{d}\mu$
		$\displaystyle=\lambda^{2}\int_{\mathbb{R}^{d}}\partial_{x_{i}x_{i}}Ug^{2}(x)\,\mathrm{d}\mu\leq\lambda^{2}L\mathbb{E}_{\mu}\exp(\lambda G).$

Thus for $\lambda\leq\frac{1}{2\sqrt{L}}$ we have

\mathbb{E}_{\mu}\exp(\lambda G)\leq\dfrac{1}{1-\lambda^{2}L}\bigl{(}\mathbb{E}_{\mu}\exp(\dfrac{\lambda G}{2})\bigr{)}^{2}.

(35)

Now we use (35) recursively, and we obtain for $H(\lambda):=\mathbb{E}_{\mu}\exp(\lambda G)$ ,

H(\lambda)\leq\prod_{k=0}^{\infty}\Bigl{(}\dfrac{1}{1-\frac{\lambda^{2}L}{4^{k}}}\Bigr{)}^{2^{k}}\lim_{\ell\to\infty}H(\dfrac{\lambda}{\ell})^{\ell}.

(36)

Notice

\lim_{\ell\to\infty}H(\dfrac{\lambda}{\ell})^{\ell}=\lim_{\ell\to\infty}\Bigl{(}\mathbb{E}_{\mu}\exp(\dfrac{\lambda G}{\ell})\Bigr{)}^{\ell}=\lim_{\ell\to\infty}\Bigl{(}1+\mathbb{E}_{\mu}\dfrac{\lambda G}{\ell}\Bigr{)}^{\ell}=\exp(\lambda\mathbb{E}_{\mu}G).

(37)

Moreover, by [bobkov1997poincare]*Proposition 4.1,

\prod_{k=0}^{\infty}\Bigl{(}\dfrac{1}{1-\frac{\lambda^{2}L}{4^{k}}}\Bigr{)}^{2^{k}}\leq\frac{1+\lambda\sqrt{L}}{1-\lambda\sqrt{L}}.

(38)

Substituting (37) and (38) into (36), we obtain

H(\lambda)\leq\dfrac{1+\lambda\sqrt{L}}{1-\lambda\sqrt{L}}\exp(\lambda\mathbb{E}_{\mu}G).

Finally, combining the above exponential moment bound of $G$ with Chebyshev inequality, we get

\mathbb{P}_{\mu}\Bigl{(}G(x)\geq\mathbb{E}_{\mu}G+r\Bigr{)}\leq\exp(-\lambda r)\dfrac{1+\lambda\sqrt{L}}{1-\lambda\sqrt{L}}.

Now take $\lambda=1/2\sqrt{L}$ , and $r=2c\sqrt{L}\log d$ , and using (34) (noticing $G(x)=|\partial_{x_{i}}U|$ when $G(x)\geq r$ since $r\geq 1$ ), we arrive at

\mathbb{P}_{\mu}\Bigl{(}|\partial_{x_{i}}U|\geq 2\sqrt{L}+2c\sqrt{L}\log d\Bigr{)}\leq 3d^{-c}.\qed

	$\displaystyle\chi^{2}(\rho(X_{T},V_{T})\,\\|\,\bar{\mu})$	$\displaystyle=\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T},V_{T})}{\,\mathrm{d}\bar{\mu}}\Bigr{)}^{2}\,\mathrm{d}\bar{\mu}(x,v)-1$
		$\displaystyle=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\Bigl{(}\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(V_{T}\mid X_{T})}{\,\mathrm{d}\nu(v)}\Bigr{)}^{2}\,\mathrm{d}\nu(v)\Bigr{)}\,\mathrm{d}\mu(x)-1$
		$\displaystyle=\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\Bigl{(}1+\chi^{2}\bigl{(}\rho(V_{T}\mid X_{T})\,\\|\,\nu\bigr{)}\Bigr{)}\,\mathrm{d}\mu(x)-1$
		$\displaystyle\geq\int_{\mathbb{R}^{d}}\Bigl{(}\frac{\,\mathrm{d}\rho(X_{T})}{\,\mathrm{d}\mu}\Bigr{)}^{2}\,\mathrm{d}\mu(x)-1=\chi^{2}(\rho(X_{T})\,\\|\,\mu).$

Complexity of zigzag sampling algorithm for strongly log-concave distributions

Abstract

1 Introduction and Main Results

1.1 Algorithm and Assumptions

Assumption 1.

Proposition 1.1.

Assumption 2.

Remark 1.2.

1.2 Main Results

Theorem 1.

Remark 1.3.

Corollary 1.4.

Proof.

1.3 Previous Works

2 Strategy of the Proof

Lemma 2.1.

Lemma 2.2.

Proof of Theorem 1.

3 Discussion

Conjecture 1.

Lemma 3.1.

Lemma 3.2.

Conjecture 2.

Appendix A Proof of Lemma 2.1

Proof.

Appendix B Proof of Lemma 2.2

Proof.

Appendix C Proof of Lemma 3.1

Proof.

References