This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Introduction to Hamiltonian Monte Carlo Method for Sampling

Nisheeth K. Vishnoi
Abstract

The goal of this article is to introduce the Hamiltonian Monte Carlo method – a Hamiltonian dynamics inspired algorithm for sampling from a Gibbs density π(x)ef(x)\pi(x)\propto e^{-f(x)}. We focus on the “idealized” case, where one can compute continuous trajectories exactly. We show that idealized HMC preserves π\pi and we establish its convergence when ff is strongly convex and smooth.

1 Introduction

Many fundamental tasks in disciplines such as machine learning (ML), optimization, statistics, theoretical computer science, and molecular dynamics rely on the ability to sample from a probability distribution. Sampling is used both as a tool to select training data and to generate samples and infer statistics from models such as highly multi-class classifiers, Bayesian networks, Boltzmann machines, and GANs; see [GBW14, SM08, KF09, GBC16]. Sampling is also useful to provide robustness to optimization methods used to train deep networks by allowing them to escape local minima/saddle points, and to prevent them from overfitting [WT11, DPG+14, SM08]. Sampling methods are often the key to solve integration, counting, and volume computation problems that are rampant in various applications at the intersection of sciences and ML [DFK91]. In chemistry and molecular biology sampling is used to estimate reaction rates and simulate molecular dynamics which, in turn, is used for discovering new materials and drugs [CCHK89, RMZT02, ZRC13]. In finance, sampling is used to optimize expected return of portfolios [DGR03]. Sampling algorithms are also used to solve partial differential equations in applications such as geophysics [CGST11].

Mathematically, several of the above applications reduce to the following problem: given a function f:d,f:\mathbb{R}^{d}\rightarrow\mathbb{R}, generate independent samples from the distribution π(x)ef(x)\pi(x)\propto e^{-f(x)}, referred to as the Gibbs or Boltzmann distribution. Markov chain Monte Carlo (MCMC) is one of the most influential frameworks to design algorithms to sample from Gibbs distributions [GBC16]. Examples of MCMC algorithms include Random Walk Metropolis (RWM), ball walk, Gibbs samplers, Grid walks [DFK91], Langevin Monte Carlo [RR98], and a physics-inspired meta-algorithm – Hamiltonian Monte Carlo (HMC) [DKPR87] – that subsumes several of the aforementioned methods as its special cases. HMC was first discovered by physicists [DKPR87], and was adopted with much success in ML [Nea92, War97, CFG14, BG15], and is currently the main algorithm used in the popular software package Stan [CGH+16].

HMC algorithms are inspired from a physics viewpoint and reduce the problem of sampling from the Gibbs distribution of ff to simulating the trajectory of a particle in d\mathbb{R}^{d} according to the laws of Hamiltonian dynamics where ff plays the role of a “potential energy” function. We show here that the physical laws that govern HMC also endow it with the ability to take long steps, at least in the setting where one can simulate Hamiltonian dynamics exactly. Moreover, we present sophisticated discretizations of continuous trajectories of Hamiltonian dynamics that can lead to sampling algorithms for Gibbs distribution for “regular-enough” distributions.

2 Related Algorithms for Sampling

We give a brief overview of a number of different algorithms available for sampling from continuous distributions. In a typical MCMC algorithm, one sets up a Markov chain whose domain includes that of ff and whose transition function determines the motion of a particle in the target distribution’s domain.

Traditional algorithms: Ball Walk, Random Walk Metropolis.

In the ball walk and related random walk metropolis (RWM) algorithms, each step of the Markov chain is computed by proposing a step X~i+1=Xi+ηvi\tilde{X}_{i+1}=X_{i}+\eta v_{i}, where viv_{i} is uniformly distributed on the unit ball for the ball walk and standard multivariate Gaussian for RWM, and η>0\eta>0 is the “step size”. The proposal is then passed through a Metropolis filter, which accepts the proposal with probability min(π(X~i+1)π(X~i),1)\min\left(\frac{\pi(\tilde{X}_{i+1})}{\pi(\tilde{X}_{i})},1\right), ensuring that the target distribution π\pi is a stationary distribution of the Markov chain.

Some instances and applications of these algorithms include sampling from and computing the volume of a polytope [DFK91, AK91, LS90, LS92, LS93, LV03] as well as sampling from a non-uniform distribution π\pi and the related problem of integrating functions [GGR97, LV06]. One advantage of these algorithms is that they only require a membership oracle for the domain and a zeroth-order oracle for ff. On the other hand, the running times of these algorithms are sub-optimal in situations where one does have access to information such as the gradient of ff, since they are unable to make use of this additional higher-order information. To ensure a high acceptance probability, the step size η\eta in the RWM or ball walk algorithms must be chosen to be sufficiently small so that π(Xi)\pi(X_{i}) does not change too much at any given step. For instance, if π\pi is N(0,Id)N(0,I_{d}), then the optimal choice of step size is roughly η1d\eta\approx\frac{1}{\sqrt{d}}, implying that ηvi1\|\eta v_{i}\|\approx 1 with high probability [GGR97]. Therefore, since most of the probability of a standard spherical Gaussian lies in a ball of radius d\sqrt{d}, one would expect all of these algorithms to take roughly (d)2=d(\sqrt{d})^{2}=d steps to explore a sufficiently regular target distribution.

Langevin algorithms.

The Langevin algorithm makes use of the gradient of ff to improve on the provable running times of the RWM and ball walk algorithms. Each update is computed as Xi+1=Xiηf(Xi)+2ηViX_{i+1}=X_{i}-\eta\nabla f(X_{i})+\sqrt{2\eta}V_{i}, where V1,V2,N(0,Id)V_{1},V_{2},\ldots\sim N(0,I_{d}). Recently, provable running time bounds have been shown for different versions of the Langevin algorithm for mm-strongly log-concave distributions with MM-Lipschitz gradient [Dal17, DM17, CCBJ18, DM17], including an O~(max(dκ,d12κ1.5)log(1ε))\tilde{O}(\max(d\kappa,d^{\frac{1}{2}}\kappa^{1.5})\log(\frac{1}{\varepsilon})) bound for getting ε\varepsilon close to the target distribution π\pi in the “total variation” (TV) metric, where κ=Mm\kappa=\frac{M}{m} [DCWY19]. This bound has been improved to roughly dκd\kappa in [LST21].

Running time bounds for Langevin in the non-convex setting have also recently been obtained in terms of isoperimetric constants for π\pi [RRT17]. For sufficiently regular target distributions, the optimal step size of the Langevin algorithm can be shown to be ηd14\eta\approx d^{-\frac{1}{4}} without Metropolis adjustment and d16d^{-\frac{1}{6}} with Metropolis adjustment [RR98, Nea11]. Since the random “momentum” terms V1,V2,V_{1},V_{2},\ldots are independent, the distance traveled by Langevin in a given number of steps is still roughly proportional to the square of its inverse numerical step size η1\eta^{-1}, meaning that the running time is at least η2=d12\eta^{-2}=d^{\frac{1}{2}} without Metropolis adjustment and η2=d13\eta^{-2}=d^{\frac{1}{3}} with the Metropolis adjustment. The fact that the “momentum” is discarded after each numerical step is therefore a barrier in further improving the running time of Langevin algorithms.

3 Hamiltonian Dynamics

We present an overview of the Hamiltonian dynamics and its properties that play a crucial role in the description and analysis of HMC. Imagine a particle of unit mass in a potential well f:d{f}:\mathbb{R}^{d}\rightarrow\mathbb{R}. If the particle has position xdx\in\mathbb{R}^{d} and velocity vdv\in\mathbb{R}^{d}, its total energy is given by the function H:2d{H}:\mathbb{R}^{2d}\rightarrow\mathbb{R} defined as

H(x,v)=f(x)+12v2.{H}(x,v)={f}(x)+\frac{1}{2}\|v\|^{2}.

This function is called the Hamiltonian of the particle. The Hamiltonian dynamics for this particle are:

dxdt=Hvanddvdt=Hx.\frac{{d}x}{{d}t}=\frac{\partial H}{\partial v}\ \ \mbox{and}\ \ \frac{{d}v}{{d}t}=-\frac{\partial H}{\partial x}.

In the case where the Hamiltonian has a simple forms such as mentioned above, these equations reduce to:

dxdt=vanddvdt=f(x).\frac{{d}x}{{d}t}=v\ \ \mbox{and}\ \ \frac{{d}v}{{d}t}=-\nabla{f}(x).

For a starting configuration (x,v)(x,v), we denote the solutions to these equations by (xt(x,v),vt(x,v))(x_{t}(x,v),v_{t}(x,v)) for t0t\geq 0. Let the “Hamiltonian flow” φt:(x,v)(xt(x,v),vt(x,v))\varphi_{t}:(x,v)\mapsto(x_{t}(x,v),v_{t}(x,v)) be the position and velocity after time tt starting from (x,v)(x,v).

Interestingly, these solutions satisfy a number of conservation properties.

  1. 1.

    Hamiltonian dynamics conserves the Hamiltonian: H(x(t),v(t))=H(x(0),v(0)){H}(x(t),v(t))={H}(x(0),v(0)) for all t0t\geq 0. To see this note that, since HH does not depend on tt explicitly (or Ht=0\frac{\partial H}{\partial t}=0),

    dHdt=i[d]dxidtHxi+dvidtHvi=0.\frac{dH}{dt}=\sum_{i\in[d]}\frac{dx_{i}}{dt}\frac{\partial H}{\partial x_{i}}+\frac{dv_{i}}{dt}\frac{\partial H}{\partial v_{i}}=0.
  2. 2.

    Hamiltonian dynamics conserves the volume in the “phase space.” Formally, let F=(dxdt,dvdt)F=\left(\frac{dx}{dt},\frac{dv}{dt}\right) be the vector field associated to the Hamiltonian in the phase space d×d.\mathbb{R}^{d}\times\mathbb{R}^{d}. First note that the divergence of FF is zero:

    divF\displaystyle\mathrm{div}F =\displaystyle= F=i[d]xidxidt+vidvidt\displaystyle\nabla\cdot F=\sum_{i\in[d]}\frac{\partial}{\partial x_{i}}\frac{dx_{i}}{dt}+\frac{\partial}{\partial v_{i}}\frac{dv_{i}}{dt}
    =\displaystyle= i[d]xiHviviHxi\displaystyle\sum_{i\in[d]}\frac{\partial}{\partial x_{i}}\frac{\partial H}{\partial v_{i}}-\frac{\partial}{\partial v_{i}}\frac{\partial H}{\partial x_{i}}
    =\displaystyle= 0.\displaystyle 0.

    Since divergence represents the volume density of the outward flux of a vector field from an infinitesimal volume around a given point, it being zero everywhere implies volume preservation. Another way to see this is that divergence is the trace of the Jacobian of the map FF, and the trace of the Jacobian is the derivative of the determinant of the Jacobian. Hence, the trace being 0 implies that the determinant of the Jacobian of FF does not change.

  3. 3.

    Hamiltonian dynamics of the form mentioned above are time reversible for t0t\geq 0:

    φt(xt(x,v),vt(x,v))=(x,v).\varphi_{t}(x_{t}(x,v),-v_{t}(x,v))=(x,-v).

The underlying geometry and conservation laws have been generalized significantly in physics and have led to the area of symplectic geometry in mathematics [DSDS08].

4 Hamiltonian Monte Carlo

To improve on the running time of the Langevin algorithms, one must find a way to take longer steps while still conserving the target distribution. Hamiltonian Monte Carlo (HMC) algorithms accomplish this by taking advantage of conservation properties of Hamiltonian dynamics. These conservation properties allow HMC to choose the momentum at the beginning of each step and simulate the trajectory of the particle for a long time. HMC is a large class of algorithms, and includes the Langevin algorithms and RWM as special cases.

Each step of the HMC Markov chain X1,X2,X_{1},X_{2},\ldots is determined by first sampling a new independent momentum ξN(0,Id)\xi\sim N(0,I_{d}), and then running Hamilton’s equations for a fixed time TT, that is Xi=xT(Xi1,ξ)X_{i}=x_{T}(X_{i-1},\xi). This is called the idealized HMC, since its trajectories are the continuous solutions to Hamilton’s equations rather than a numerical approximation.

Input: First-order oracle for f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}, an initial point X0dX_{0}\in\mathbb{R}^{d}, T>0T\in\mathbb{R}_{>0}, kk\in\mathbb{N}
for i=1,,ki=1,\ldots,k do
      
      Sample a momentum ξN(0,Id)\xi\sim N(0,I_{d})
      Set (Xi,Vi)=φT(Xi1,ξ)(X_{i},V_{i})=\varphi_{T}(X_{i-1},\xi)
end for
Output XkX_{k}
Algorithm 1 Idealized Hamiltonian Monte Carlo

Despite its simplicity, popularity, and the widespread belief that HMC is faster than its competitor algorithms in a wide range of high-dimensional sampling problems [Cre88, BPR+13, Nea11, BBG14], its theoretical properties are relatively less understood compared to its older competitor MCMC algorithms, such as the Random Walk Metropolis [MPS+12] or Langevin [DM19, DMss, Dal17] algorithms. Thus, for instance, it is more difficult to tune the parameters of HMC. Several papers have have made progress in bridging this gap, showing that HMC is geometrically ergodic for a large class of problems [LBBG19, DMS17] and proving quantitative bounds for the convergence rate of the idealized HMC for special cases [SRSH14, MS18].

These analysis benefit from the observation that there are various invariants that are preserved along the Hamiltonian trajectories, which in principle obviate the need for a Metropolis step, and raise the possibility of taking very long steps (large TT). Using these invariance properties, we first show that the idealized HMC “preserves” the target density (Theorem 5.1) and subsequently give a dimension-independent bound on TT when ff is strongly convex and smooth (Theorem 6.1). While not the focus of this article, in Section 7, we discuss different numerical integrators (approximate algorithms to simulate Hamiltonian dynamics dynamics in the non-idealized setting) and bounds associated to them.

5 Stationarity: HMC Preserves the Target Density

Recall that for (x,v)d×d(x,v)\in\mathbb{R}^{d}\times\mathbb{R}^{d} and T0T\geq 0, φT(x,v)\varphi_{T}(x,v) denotes the position in the phase space of the particle moving according to the Hamiltonian dynamics with respect to a Hamiltonian H(x,v)=f(x)+12v2H(x,v)=f(x)+\frac{1}{2}\|v\|^{2}. Let μ\mu be the Lebesgue measure on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} with respect to which all densities are defined.

Theorem 5.1

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a differentiable function. Let T>0T>0 be the step size of the HMC. Suppose (X,V)(X,V) is a sample from the density

π(x,v)=ef(x)12v2ef(y)12w2𝑑μ(y,w).\pi(x,v)=\frac{e^{-f(x)-\frac{1}{2}\|v\|^{2}}}{\int e^{-f(y)-\frac{1}{2}\|w\|^{2}}d\mu(y,w)}.

Then the density of φT(X,V)\varphi_{T}(X,V) is π\pi for any T0T\geq 0. Moreover the density of φT(X,ξ)\varphi_{T}(X,\xi), where ξN(0,Id)\xi\sim N(0,I_{d}) is also π\pi. Thus, the idealized HMC algorithms preserves π\pi.

The proof of this theorem heavily relies on the properties of Hamiltonian dynamics. For T0T\geq 0, let (x~,v~)=φT(x,y)(\tilde{x},\tilde{v})=\varphi_{T}(x,y). Then, time reversibility of Hamiltonian dynamics implies that

(x,v)=φT(x~,v~).(x,-v)=\varphi_{T}(\tilde{x},-\tilde{v}).

And, it follows from the preservation of Hamiltonian along trajectories that

H(x,v)=H(φT(x~,v~)).H(x,-v)=H(\varphi_{T}(\tilde{x},-\tilde{v})).

Thus,

H(x,v)\displaystyle H(x,v) =\displaystyle= f(x)+12v2\displaystyle f(x)+\frac{1}{2}\|v\|^{2}
=\displaystyle= H(x,v)\displaystyle H(x,-v)
=\displaystyle= H(φT(x~,v~))\displaystyle H(\varphi_{T}(\tilde{x},-\tilde{v}))
=\displaystyle= H(x~,v~)\displaystyle H(\tilde{x},-\tilde{v})
=\displaystyle= f(x~)+12v~2\displaystyle f(\tilde{x})+\frac{1}{2}\|-\tilde{v}\|^{2}
=\displaystyle= H(x~,v~).\displaystyle H(\tilde{x},\tilde{v}).

Thus, eH(x,v)e^{-H(x,v)}, the value of density associated to (x~,v~)(\tilde{x},\tilde{v}), is the same as eH(x~,v~)e^{-H(\tilde{x},\tilde{v})}. Let μ\mu_{*} be the pushforward of μ\mu under the map φT\varphi_{T}. The property that Hamiltonian dynamics preserves volume in phase space implies that μ=μ\mu_{*}=\mu as the determinant of the Jacobian of the map φT\varphi_{T} is 11. Thus, the density π\pi remains invariant under φT\varphi_{T}:

π~(x~,v~)=eH(x~,v~)eH(y,w)𝑑μ(y,w)=eH(x~,v~)eH(y,w)𝑑μ(y,w)=π(x~,v~).\tilde{\pi}(\tilde{x},\tilde{v})=\frac{e^{-H(\tilde{x},\tilde{v})}}{\int e^{-H(y,w)}d\mu_{*}(y,w)}=\frac{e^{-H(\tilde{x},\tilde{v})}}{\int e^{-H(y,w)}d\mu(y,w)}=\pi(\tilde{x},\tilde{v}).

To see the second part, note that the marginal density of vv drawn from π\pi is the same as that of N(0,Id)N(0,I_{d}). Thus, π\pi is an invariant density of the idealized HMC algorithm.

6 Convergence: Running Time Bound for Strongly Convex and Smooth Potentials

For densities π1\pi_{1} and π2\pi_{2} with the same base measure, the Wasserstein distance W2W_{2} is defined to be the infimum, over all joint distributions of the random variables XX and YY with marginals π1\pi_{1} and π2\pi_{2}, of the expectation of the squared Euclidean distance XY2\|X-Y\|^{2}.

Theorem 6.1

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a twice-differentiable function which satisfies mI2f(x)MImI\preceq\nabla^{2}f(x)\preceq MI. Let νk\nu_{k} be the distribution of XkX_{k} at step kk\in\mathbb{Z}^{\star} from Algorithm 1. Suppose that both ν0\nu_{0} and π\pi have mean and variance bounded by O(1)O(1). Then given any ε>0\varepsilon>0, for T=Ω(mM)T=\Omega\left(\frac{\sqrt{m}}{M}\right) and k=O((Mm)2log1ε)k=O\left((\frac{M}{m})^{2}\log\frac{1}{\varepsilon}\right), we have that W2(νk,π)εW_{2}(\nu_{k},\pi)\leq\varepsilon.

The bound in this theorem can be improved to O(Mmlog1ε)O\left(\frac{M}{m}\log\frac{1}{\varepsilon}\right); see [CV19]. To prove Theorem 6.1 we use the coupling method. In addition to the idealized HMC Markov chain XX which is initialized at some arbitrary point X0X_{0}, to prove Theorem 6.1 we also consider another “copy” Y=Y0,Y1,Y=Y_{0},Y_{1},\ldots of the idealized HMC Markov chain defined in Algorithm 1. To initialize YY we imagine that we sample a point Y0Y_{0} from the density π\pi. Since we have already shown that π\pi is a stationary density of the idealized HMC Markov chain (Theorem 5.1), YkY_{k} will preserve the distribution π\pi at every step kk\in\mathbb{Z}^{\ast}.

To show that the density of XX converges to π\pi in the Wasserstein distance, we design a coupling of the two Markov chains such that the distance between XkX_{k} and YkY_{k} contracts at each step kk. If π\pi has mean and variance bounded by O(1)O(1), and we initialize, e.g., X0=0X_{0}=0, then the we have that W2(ν0,π)=O(1)W_{2}(\nu_{0},\pi)=O(1) as well since Y0πY_{0}\sim\pi. Hence,

W2(ν0,π)𝔼[Y0X02]=𝔼[Y02]Var(Y0)+𝔼[Y0]2=O(1).W_{2}(\nu_{0},\pi)\leq\mathbb{E}[\|Y_{0}-X_{0}\|^{2}]=\mathbb{E}[\|Y_{0}\|^{2}]\leq\mathrm{Var}(Y_{0})+\|\mathbb{E}[Y_{0}]\|^{2}=O(1).

Thus, if we can find a coupling such that XkYk(1γ)Xk1Yk1\|X_{k}-Y_{k}\|\leq(1-\gamma)\|X_{k-1}-Y_{k-1}\| for each kk and some 0<γ<10<\gamma<1, we would have that

W2(νk,π)𝔼[XkYk2](1γ)2k𝔼[X0Y02]=(1γ)2k×O(1).W_{2}(\nu_{k},\pi)\leq\mathbb{E}[\|X_{k}-Y_{k}\|^{2}]\leq(1-\gamma)^{2k}\mathbb{E}[\|X_{0}-Y_{0}\|^{2}]=(1-\gamma)^{2k}\times O(1).

We define a coupling of XX and YY as follows: At each step i1i\geq 1 of the Markov chain XX we sample an initial momentum ξi1N(0,Id)\xi_{i-1}\sim N(0,I_{d}), and set (Xi,Vi)=φT(Xi1,ξi1)(X_{i},V_{i})=\varphi_{T}(X_{i-1},\xi_{i-1}). And at each step i1i\geq 1 of the Markov chain YY we sample momentum ξi1N(0,Id)\xi_{i-1}^{\prime}\sim N(0,I_{d}), and set (Yi,Vi)=φT(Yi1,ξi1)(Y_{i},V_{i})=\varphi_{T}(Y_{i-1},\xi_{i-1}^{\prime}). To couple the two Markov chains, we will give the same initial momentum ξi1ξi1\xi_{i-1}^{\prime}\leftarrow\xi_{i-1} to the Markov chain YY; see Figure 1. This coupling preserves the marginal density of YY since, in this coupling, ξi1\xi_{i-1}^{\prime} and ξi1\xi_{i-1} both have marginal density N(0,Id)N(0,I_{d}). Thus, we still have that the marginal density of YiY_{i} is π\pi at each step ii.

Refer to caption
Figure 1: Coupling two copies XX (blue) and Y{Y} (red) of idealized HMC by choosing the same momentum at every step.

Spherical harmonic oscillator.

As a simple example, we consider the spherical harmonic oscillator with potential function f(x)12xxf(x)\coloneqq\frac{1}{2}x^{\top}x. In this case the gradient is f(x)=x\nabla f(x)=x at every point xx. Define (xt,vt)φt(Xi,ξi)(x_{t},v_{t})\coloneqq\varphi_{t}(X_{i},\xi_{i}) to be the position and momentum of the Hamiltonian flow which determines the update of the Markov chain XX at each step ii, and (yt,ut)φt(Yi,ξi)(y_{t},u_{t})\coloneqq\varphi_{t}(Y_{i},\xi_{i}) to be the position and momentum of the Hamiltonian flow which determines the update of the Markov chain YY. Then we have

dvtdtdutdt=f(xt)+f(yt)=ytxt.\frac{{d}v_{t}}{{d}t}-\frac{{d}u_{t}}{{d}t}=-\nabla f(x_{t})+\nabla f(y_{t})=y_{t}-x_{t}.

Thus, the difference between the force on the particle at xtx_{t} and the particle at yty_{t} points in the direction of the particle at yty_{t}. This means that, in the case of the spherical harmonic oscillator, after a sufficient amount of time tt the particle xtx_{t} will reach the point 0 and we will have xtyt=0x_{t}-y_{t}=0.

We can solve for the trajectory of xtytx_{t}-y_{t} exactly. We have

d2(xtyt)dt2=(xtyt).\frac{{d}^{2}(x_{t}-y_{t})}{{d}t^{2}}=-(x_{t}-y_{t}). (1)

Since the two Markov chains are coupled such that the particles xx and yy have the same initial momentum, v0=u0v_{0}=u_{0}, the initial conditions for the ODE in Equation (1) are d(xtyt)dt=v0u0=0\frac{d(x_{t}-y_{t})}{dt}=v_{0}-u_{0}=0. Therefore, the solution to this ODE is

xtyt=cos(t)×(x0y0).x_{t}-y_{t}=\cos(t)\times(x_{0}-y_{0}). (2)

Hence, after time t=π2t=\frac{\pi}{2} we have xtyt=0x_{t}-y_{t}=0.

General harmonic oscillator.

More generally, we can consider a harmonic oscillator f(x)=j=1dcjxj2f(x)=\sum_{j=1}^{d}c_{j}x_{j}^{2}, where mcjMm\leq c_{j}\leq M. In this case the gradient is f(x)=2Cx\nabla f(x)=2Cx, where CC is the diagonal matrix with jj-th diagonal entry cjc_{j}.

In this case, we have

dvtdtdutdt=f(xt)+f(yt)=2C(ytxt).\frac{{d}v_{t}}{{d}t}-\frac{{d}u_{t}}{{d}t}=-\nabla f(x_{t})+\nabla f(y_{t})=2C(y_{t}-x_{t}).

Thus, since cj>0c_{j}>0 for all jj, the difference dvtdtdutdt\frac{{d}v_{t}}{{d}t}-\frac{{d}u_{t}}{{d}t} between the force on the particle at force on the particle at xtx_{t} and the particle at yty_{t} is has a component in the direction ytxty_{t}-x_{t}. This means that, for small values of tt, two particles will move towards each other at a rate of roughly

1ytxt(ytxt)(2C)(ytxt)t=1ytxt2C(ytxt)2t,\frac{1}{\|y_{t}-x_{t}\|}(y_{t}-x_{t})^{\top}(2C)(y_{t}-x_{t})t=\frac{1}{\|y_{t}-x_{t}\|}\|\sqrt{2C}(y_{t}-x_{t})\|^{2}t,

since the initial velocities v0v_{0} and u0u_{0} of the two particles are equal. However, unless CC is a multiple of the identity matrix, the difference between the forces also has a component orthogonal to ytxty_{t}-x_{t}. Thus, unlike in the case of the spherical harmonic oscillator, the distance between the two particles will not in general contract to zero for any value of tt.

As in the case of the spherical harmonic oscillator, we can compute the distance between the two particles at any value of tt by solving for the trajectory xtytx_{t}-y_{t} exactly. Here we have

d2(xtyt)dt2=2C(xtyt)\frac{{d}^{2}(x_{t}-y_{t})}{{d}t^{2}}=-2C(x_{t}-y_{t}) (3)

with initial conditions d(xtyt)dt=0\frac{d(x_{t}-y_{t})}{dt}=0. Since CC is diagonal, the ODE is separable along the coordinate directions, and we have, for all j[d]j\in[d]

d2(xt[j]yt[j])dt2=2cj(xt[j]yt[j])\frac{{d}^{2}(x_{t}[j]-y_{t}[j])}{{d}t^{2}}=-2c_{j}(x_{t}[j]-y_{t}[j]) (4)

with initial conditions d(xt[j]yt[j])dt=0\frac{d(x_{t}[j]-y_{t}[j])}{dt}=0. Therefore, xt[j]yt[j]=cos(2cjt)×(x0[j]y0[j])x_{t}[j]-y_{t}[j]=\cos(\sqrt{2c_{j}}t)\times(x_{0}[j]-y_{0}[j]). Hence, since cjMc_{j}\leq M, the distance xt[j]yt[j]x_{t}[j]-y_{t}[j] will contract up to at least time T=π2×12MT=\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}. In particular, for any jj such that cj=Mc_{j}=M after time T=π2×12MT=\frac{\pi}{2}\times\frac{1}{\sqrt{2M}} we have that xT[j]yT[j]=0x_{T}[j]-y_{T}[j]=0. However, since the cjc_{j} are values such that mcjMm\leq c_{j}\leq M, there is in general no single value of TT such that all xT[j]yT[j]=0x_{T}[j]-y_{T}[j]=0. However, we can show that for T=π2×12MT=\frac{\pi}{2}\times\frac{1}{\sqrt{2M}},

xT[j]yT[j]\displaystyle x_{T}[j]-y_{T}[j] =cos(2cjT)×(x0[j]y0[j])\displaystyle=\cos\left(\sqrt{2c_{j}}T\right)\times(x_{0}[j]-y_{0}[j]) (5)
cos(2m×π2×12M)×(x0[j]y0[j])\displaystyle\leq\cos\left(\sqrt{2m}\times\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}\right)\times(x_{0}[j]-y_{0}[j]) (6)
(118(2m×π2×12M)2)×(x0[j]y0[j])\displaystyle\leq\left(1-\frac{1}{8}\left(\sqrt{2m}\times\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}\right)^{2}\right)\times(x_{0}[j]-y_{0}[j]) (7)
(1Ω(mM))×(x0[j]y0[j])j[d],\displaystyle\leq\left(1-\Omega\left(\frac{m}{M}\right)\right)\times(x_{0}[j]-y_{0}[j])\qquad\forall j\in[d], (8)

where the inequality holds because the fact that ff is mm-strongly convex implies that cjmc_{j}\geq m, cos\cos is monotone decreasing on the interval [0,π][0,\pi], and 2m×π2×12M[0,π]\sqrt{2m}\times\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}\in[0,\pi] since 0<mM0<m\leq M. The second inequality holds because cos(s)118s2\cos(s)\leq 1-\frac{1}{8}s^{2} for all s[0,π]s\in[0,\pi]. Hence, we have that

xTyT\displaystyle\|x_{T}-y_{T}\| (1Ω(mM))×x0y0.\displaystyle\leq\left(1-\Omega\left(\frac{m}{M}\right)\right)\times\|x_{0}-y_{0}\|. (9)

General strongly convex and smooth ff (sketch).

Recall that in the case of the spherical Harmonic oscillator, f(x)=mxxf(x)=mx^{\top}x, the difference between the forces on the two particles is

f(xt)+f(yt)=2m(ytxt)-\nabla f(x_{t})+\nabla f(y_{t})=2m(y_{t}-x_{t}) (10)

and thus, the difference between the force acting on xtx_{t} and the force acting on yty_{t} is a vector which points exactly in the direction of the vector from xtx_{t} to yty_{t}.

In more general settings where ff is mm-strongly convex and MM-smooth, but not necessarily a quadratic/harmonic oscillator potential, strong convexity implies that the component of the vector f(xt)+f(yt)-\nabla f(x_{t})+\nabla f(y_{t}) in the direction (ytxt)(y_{t}-x_{t}) still points in the direction (ytxt)(y_{t}-x_{t}) and still has magnitude at least 2mytxt2m\|y_{t}-x_{t}\|. However, unless m=Mm=M, the difference in the forces, f(xt)+f(yt)-\nabla f(x_{t})+\nabla f(y_{t}), may also have a component orthogonal to the vector ytxty_{t}-x_{t}. Since ff is MM-smooth, this orthogonal component has magnitude no larger than 2Mytxt2M\|y_{t}-x_{t}\|.

Thus, since the two Markov chains are coupled such that the two particles have the same initial velocity, that is, v0u0=0v_{0}-u_{0}=0, we have, in the worst case,

xtyt\displaystyle x_{t}-y_{t} =x0y0+(v0u0)×tmt2(x0y0)+Mt2zx0y0+Higher-order terms(t)\displaystyle=x_{0}-y_{0}+(v_{0}-u_{0})\times t-mt^{2}(x_{0}-y_{0})+Mt^{2}z\|x_{0}-y_{0}\|+\textrm{Higher-order terms}(t) (11)
=x0y0mt2(x0y0)+Mt2zx0y0+Higher-order terms(t),\displaystyle=x_{0}-y_{0}-mt^{2}(x_{0}-y_{0})+Mt^{2}z\|x_{0}-y_{0}\|+\textrm{Higher-order terms}(t), (12)

where zz is a unit vector orthogonal to (x0y0)(x_{0}-y_{0}). We would like to determine the value of tt which minimizes xtyt\|x_{t}-y_{t}\|, and the extent to which the distance xtyt\|x_{t}-y_{t}\| contracts at this value of tt. In this proof sketch we will ignore the higher-order terms. These higher-order terms can be bounded using comparison theorems for ordinary differential equations; see Section 4 of [MSss].

Ignoring the higher-order terms, we have (since zz is a unit vector orthogonal to (x0y0)(x_{0}-y_{0})),

xtyt2\displaystyle\|x_{t}-y_{t}\|^{2} (1mt2)2x0y02+(Mt2)2x0y02\displaystyle\leq(1-mt^{2})^{2}\|x_{0}-y_{0}\|^{2}+(Mt^{2})^{2}\|x_{0}-y_{0}\|^{2} (13)
=((1mt2)2+(Mt2)2)x0y02\displaystyle=\left((1-mt^{2})^{2}+(Mt^{2})^{2}\right)\|x_{0}-y_{0}\|^{2} (14)
=(12mt2+m2t4+M2t4)x0y02\displaystyle=\left(1-2mt^{2}+m^{2}t^{4}+M^{2}t^{4}\right)\|x_{0}-y_{0}\|^{2} (15)
(12mt2+2M2t4)x0y02.\displaystyle\leq\left(1-2mt^{2}+2M^{2}t^{4}\right)\|x_{0}-y_{0}\|^{2}. (16)

The RHS of (13) is minimized at t=m2Mt=\frac{\sqrt{m}}{\sqrt{2}M}. Thus, for t=m2Mt=\frac{\sqrt{m}}{\sqrt{2}M} we have:

xtyt2\displaystyle\|x_{t}-y_{t}\|^{2} (12m(m2M)2+M2(m2M)4)×x0y02\displaystyle\leq\left(1-2m\left(\frac{\sqrt{m}}{\sqrt{2}M}\right)^{2}+M^{2}\left(\frac{\sqrt{m}}{\sqrt{2}M}\right)^{4}\right)\times\|x_{0}-y_{0}\|^{2} (17)
=(1m22M2)×x0y02.\displaystyle=\left(1-\frac{m^{2}}{2M^{2}}\right)\times\|x_{0}-y_{0}\|^{2}. (18)

Thus, we have γ=m22M2\gamma=\frac{m^{2}}{2M^{2}} for T=Θ(mM)T=\Theta\left(\frac{\sqrt{m}}{M}\right).

7 Discretizing HMC

To prove (non-asymptotic) running time bounds on HMC, we must approximate xx and vv in the idealized HMC with some numerical method. One can use a numerical method such as the Euler [GH10] or leapfrog integrators [HLW03]. The earliest theoretical analyses of HMC were the asymptotic “optimal scaling” results of [KP91], for the special case when the target distribution is a multivariate Gaussian. Specifically, they showed that the Metropolis-adjusted implementation of HMC with leapfrog integrator requires a numerical step size of O(d14)O^{*}(d^{-\frac{1}{4}}) to maintain an Ω(1)\Omega(1) Metropolis acceptance probability in the limit as the dimension dd\rightarrow\infty. They then showed that for this choice of numerical step size the number of numerical steps HMC requires to obtain samples from Gaussian targets with a small autocorrelation is O(d14)O^{*}(d^{\frac{1}{4}}) in the large-dd limit. [PST12] extended their asymptotic analysis of the acceptance probability to more general classes of separable distributions.

The earliest non-asymptotic analysis of an HMC Markov chain was provided in [SRSH14] for an idealized version of HMC based on continuous Hamiltonian dynamics, in the special case of Gaussian target distributions. As mentioned earlier, [MS17] show that idealized HMC can sample from general mm-strongly logconcave target distributions with MM-Lipschitz gradient in O~(κ2)\tilde{O}(\kappa^{2}) steps, where κ=Mm\kappa=\frac{M}{m} (see also [BRSS17, BREZ20] for more work on idealized HMC). They also show that an unadjusted implementation of HMC with first-order discretization can sample with Wasserstein error ε>0\varepsilon>0 in O~(d12κ6.5ε1)\tilde{O}(d^{\frac{1}{2}}\kappa^{6.5}\varepsilon^{-1}) gradient evaluations. In addition, they show that a second-order discretization of HMC can sample from separable target distributions in O~(d14ε1f(m,M,B))\tilde{O}(d^{\frac{1}{4}}\varepsilon^{-1}f(m,M,B)) gradient evaluations, where ff is an unknown (non-polynomial) function of m,M,Bm,M,B, if the operator norms of the first four Fréchet derivatives of the restriction of UU to the coordinate directions are bounded by BB. [LV18] use the conductance method to show that an idealized version of the Riemannian variant of HMC (RHMC) has mixing time with total variation (TV) error ε>0\varepsilon>0 of roughly O~(1ψ2T2Rlog(1ε))\tilde{O}(\frac{1}{\psi^{2}T^{2}}R\log(\frac{1}{\varepsilon})), for any 0Td140\leq T\leq d^{-\frac{1}{4}}, where RR is a regularity parameter for UU and ψ\psi is an isoperimetric constant for π\pi. Metropolized variants of HMC have also been studied recently; see [CDWY20] and the references in there.

A second-order discretization.

Here we discuss the approach of [MV18]. They approximate a Hamiltonian trajectory with a second-order Euler integrator that iteratively computes second-order Taylor expansions (x^η,v^η)(\hat{x}_{\eta},\hat{v}_{\eta}) of Hamilton’s equations, where

x^η(x,v)=x+vη12η2f(x),v^η(x,v)=vηf(x)12η22f(x)v.\hat{x}_{\eta}(x,v)=x+v\eta-\frac{1}{2}\eta^{2}\nabla{f}(x),\qquad\hat{v}_{\eta}(x,v)=v-\eta\nabla{f}(x)-\frac{1}{2}\eta^{2}\nabla^{2}f(x)v.

Here, η>0\eta>0 is the parameter corresponding to the “step size”. If we approximate

2f(x)vf(x^η)f(x)η,\nabla^{2}f(x)v\approx\frac{\nabla f(\hat{x}_{\eta})-\nabla f(x)}{\eta},

we obtain the following numerical integrator:

x^η(x,v)=x+vη12η2f(x),v^η(x,v)=v12η(f(x)f(x^η)).\hat{x}_{\eta}(x,v)=x+v\eta-\frac{1}{2}\eta^{2}\nabla{f}(x),\qquad\hat{v}_{\eta}(x,v)=v-\frac{1}{2}\eta\left(\nabla f(x)-\nabla f(\hat{x}_{\eta})\right).

This leads to the following algorithm.

Input: First-order oracle for f:df:\mathbb{R}^{d}\rightarrow\mathbb{R}, an initial point X0dX_{0}\in\mathbb{R}^{d}, T>0T\in\mathbb{R}_{>0}, kk\in\mathbb{N}, η>0\eta>0
for i=1,,ki=1,\ldots,k do
      
      Sample ξN(0,Id)\xi\sim N(0,I_{d}).
       Set q0=Xiq_{0}=X_{i} and p0=ξp_{0}=\xi
       for j=1,,Tηj=1,\ldots,\frac{T}{\eta} do
             Set qj=qj1+ηpj112η2f(qj1)q_{j}=q_{j-1}+\eta p_{j-1}-\frac{1}{2}\eta^{2}\nabla f(q_{j-1}) and pj=pj112η(f(qj1)f(qj))p_{j}=p_{j-1}-\frac{1}{2}\eta\left(\nabla f(q_{j-1})-\nabla f(q_{j})\right)
       end for
      Set Xi=qTη(Xi1,ξ)X_{i}=q_{\frac{T}{\eta}}(X_{i-1},\xi)
end for
Output XkX_{k}
Algorithm 2 Unadjusted Hamiltonian Monte Carlo

It can be shown that under a mild “regularity” condition, this unadjusted HMC requires at most (roughly) O(d14ε12)O\left(d^{\frac{1}{4}}\varepsilon^{-\frac{1}{2}}\right) gradient evaluations; see [MV18] for details.

Higher-order integrators. More generally, one can replace the second-order Taylor expansion with a kkth-order Taylor expansion for any k1k\geq 1, to obtain a kkth-order numerical integrator for Hamiltonian trajectories. Unfortunately, the number of components in the Taylor expansion grows exponentially with kk because of the product rule, so it is difficult to compute the Taylor expansion exactly. However, it is possible to compute this expansion approximately using polynomial interpolation methods such as the “Collocation Method” of [LV17] that was used for the special case when ff is constant on a polytope [LV18]. While higher-order integrators can give theoretical bounds, they are generally unstable in practice as they are not symplectic.

Developing practical higher-order methods and identifying other interesting regularity conditions on the target density that lead to fast algorithms remain interesting future directions.

Acknowledgments

The author would like to thank Oren Mangoubi for his help with the proof of Theorem 6.1 which initially appeared in [MS17], and Yin-Tat Lee, Anay Mehrotra, and Andre Wibisono for useful comments. The author would also like to acknowledge the support of NSF CCF-1908347.

References

  • [AK91] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 156–163. ACM, 1991.
  • [BBG14] MJ Betancourt, Simon Byrne, and Mark Girolami. Optimizing the integrator step size for Hamiltonian Monte Carlo. arXiv preprint arXiv:1411.6669, 2014.
  • [BG15] Michael Betancourt and Mark Girolami. Hamiltonian Monte Carlo for hierarchical models. Current trends in Bayesian methodology with applications, 79:30, 2015.
  • [BPR+13] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
  • [BREZ20] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. Coupling and convergence for Hamiltonian Monte Carlo. The Annals of Applied Probability, 30(3):1209–1250, 2020.
  • [BRSS17] Nawaf Bou-Rabee and Jesús María Sanz-Serna. Randomized hamiltonian monte carlo. The Annals of Applied Probability, 27(4):2159–2194, 2017.
  • [CCBJ18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped langevin MCMC: A non-asymptotic analysis. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 300–323. PMLR, 2018.
  • [CCHK89] EA Carter, Giovanni Ciccotti, James T Hynes, and Raymond Kapral. Constrained reaction coordinate dynamics for the simulation of rare events. Chemical Physics Letters, 156(5):472–477, 1989.
  • [CDWY20] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. J. Mach. Learn. Res., 21:92:1–92:72, 2020.
  • [CFG14] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.
  • [CGH+16] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 20:1–37, 2016.
  • [CGST11] K Andrew Cliffe, Mike B Giles, Robert Scheichl, and Aretha L Teckentrup. Multilevel Monte Carlo methods and applications to elliptic pdes with random coefficients. Computing and Visualization in Science, 14(1):3, 2011.
  • [Cre88] M. Creutz. Global Monte Carlo algorithms for many-fermion systems. Phy. Rev. D, 38(4):1228, 1988.
  • [CV19] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. In Dimitris Achlioptas and László A. Végh, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, September 20-22, 2019, Massachusetts Institute of Technology, Cambridge, MA, USA, volume 145 of LIPIcs, pages 64:1–64:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [Dal17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
  • [DCWY19] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast. J. Mach. Learn. Res., 20:183:1–183:42, 2019.
  • [DFK91] Martin Dyer, Alan Frieze, and Ravi Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM (JACM), 38(1):1–17, 1991.
  • [DGR03] Jerome B Detemple, Ren Garcia, and Marcel Rindisbacher. A Monte Carlo method for optimal portfolios. The journal of Finance, 58(1):401–446, 2003.
  • [DKPR87] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
  • [DM17] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
  • [DM19] Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
  • [DMss] Alain Durmus and Eric Moulines. Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm. The Annals of Applied Probability, in press.
  • [DMS17] Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian Monte Carlo. arXiv preprint arXiv:1705.00166, 2017.
  • [DPG+14] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
  • [DSDS08] Ana Cannas Da Silva and A Cannas Da Salva. Lectures on symplectic geometry. Springer, 2008.
  • [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • [GBW14] Maya R. Gupta, Samy Bengio, and Jason Weston. Training highly multiclass classifiers. J. Mach. Learn. Res., 15(1):1461–1492, January 2014.
  • [GGR97] Andrew Gelman, Walter R Gilks, and Gareth O Roberts. Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1):110–120, 1997.
  • [GH10] David F Griffiths and Desmond J Higham. Numerical methods for ordinary differential equations: initial value problems. Springer Science & Business Media, 2010.
  • [HLW03] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration illustrated by the Störmer–Verlet method. Acta numerica, 12:399–450, 2003.
  • [KF09] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.
  • [KP91] AD Kennedy and Brian Pendleton. Acceptances and autocorrelations in hybrid Monte Carlo. Nuclear Physics B-Proceedings Supplements, 20:118–121, 1991.
  • [LBBG19] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami. On the geometric ergodicity of hamiltonian monte carlo. Bernoulli, 25(4A):3109–3138, 2019.
  • [LS90] László Lovász and Miklós Simonovits. The mixing rate of Markov chains, an isoperimetric inequality, and computing the volume. In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on, pages 346–354. IEEE, 1990.
  • [LS92] László Lovász and Miklós Simonovits. On the randomized complexity of volume and diameter. In Foundations of Computer Science, 1992. Proceedings., 33rd Annual Symposium on, pages 482–492. IEEE, 1992.
  • [LS93] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume algorithm. Random structures & algorithms, 4(4):359–412, 1993.
  • [LST21] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted gaussian oracle. In Mikhail Belkin and Samory Kpotufe, editors, Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pages 2993–3050. PMLR, 2021.
  • [LV03] László Lovász and Santosh Vempala. Hit-and-run is fast and fun. preprint, Microsoft Research, 2003.
  • [LV06] László Lovász and Santosh Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 57–68. IEEE, 2006.
  • [LV17] Yin Tat Lee and Santosh S Vempala. Geodesic walks in polytopes. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 927–940. ACM, 2017.
  • [LV18] Yin Tat Lee and Santosh S Vempala. Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. In STOC, 2018.
  • [MPS+12] Jonathan C Mattingly, Natesh S Pillai, Andrew M Stuart, et al. Diffusion limits of the random walk Metropolis algorithm in high dimensions. The Annals of Applied Probability, 22(3):881–930, 2012.
  • [MS17] Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv preprint arXiv:1708.07114, 2017.
  • [MS18] Oren Mangoubi and Aaron Smith. Rapid mixing of geodesic walks on manifolds with positive curvature. The Annals of Applied Probability, 2018.
  • [MSss] Oren Mangoubi and Aaron Smith. Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions: Continuous dynamics. Annals of Applied Probability, in press.
  • [MV18] Oren Mangoubi and Nisheeth K Vishnoi. Dimensionally tight running time bounds for second-order Hamiltonian Monte Carlo. In NeurIPS. Available at arXiv:1802.08898, 2018.
  • [Nea92] Radford M Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, CRG-TR-92-1, Dept. of Computer Science, University of Toronto, 1992.
  • [Nea11] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113–162, 2011.
  • [PST12] Natesh S Pillai, Andrew M Stuart, and Alexandre H Thiéry. Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions. The Annals of Applied Probability, 22(6):2320–2356, 2012.
  • [RMZT02] Lula Rosso, Peter Mináry, Zhongwei Zhu, and Mark E Tuckerman. On the use of the adiabatic molecular dynamics technique in the calculation of free energy profiles. The Journal of chemical physics, 116(11):4389–4402, 2002.
  • [RR98] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
  • [RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703, 2017.
  • [SM08] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th ICML, pages 880–887. ACM, 2008.
  • [SRSH14] Christof Seiler, Simon Rubinstein-Salzedo, and Susan Holmes. Positive curvature and Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages 586–594, 2014.
  • [War97] Bradley A Warner. Bayesian learning for neural networks (lecture notes in statistics vol. 118), Radford M. Neal. Journal of American Statistical Association, 92:791–791, 1997.
  • [WT11] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
  • [ZRC13] Wenwei Zheng, Mary A Rohrdanz, and Cecilia Clementi. Rapid exploration of configuration space with diffusion-map-directed molecular dynamics. The journal of physical chemistry B, 117(42):12769–12776, 2013.