An Introduction to Hamiltonian Monte Carlo Method for Sampling

Nisheeth K. Vishnoi

Abstract

The goal of this article is to introduce the Hamiltonian Monte Carlo method – a Hamiltonian dynamics inspired algorithm for sampling from a Gibbs density $\pi(x)\propto e^{-f(x)}$ . We focus on the “idealized” case, where one can compute continuous trajectories exactly. We show that idealized HMC preserves $\pi$ and we establish its convergence when $f$ is strongly convex and smooth.

1 Introduction

Many fundamental tasks in disciplines such as machine learning (ML), optimization, statistics, theoretical computer science, and molecular dynamics rely on the ability to sample from a probability distribution. Sampling is used both as a tool to select training data and to generate samples and infer statistics from models such as highly multi-class classifiers, Bayesian networks, Boltzmann machines, and GANs; see [GBW14, SM08, KF09, GBC16]. Sampling is also useful to provide robustness to optimization methods used to train deep networks by allowing them to escape local minima/saddle points, and to prevent them from overfitting [WT11, DPG⁺14, SM08]. Sampling methods are often the key to solve integration, counting, and volume computation problems that are rampant in various applications at the intersection of sciences and ML [DFK91]. In chemistry and molecular biology sampling is used to estimate reaction rates and simulate molecular dynamics which, in turn, is used for discovering new materials and drugs [CCHK89, RMZT02, ZRC13]. In finance, sampling is used to optimize expected return of portfolios [DGR03]. Sampling algorithms are also used to solve partial differential equations in applications such as geophysics [CGST11].

Mathematically, several of the above applications reduce to the following problem: given a function $f:\mathbb{R}^{d}\rightarrow\mathbb{R},$ generate independent samples from the distribution $\pi(x)\propto e^{-f(x)}$ , referred to as the Gibbs or Boltzmann distribution. Markov chain Monte Carlo (MCMC) is one of the most influential frameworks to design algorithms to sample from Gibbs distributions [GBC16]. Examples of MCMC algorithms include Random Walk Metropolis (RWM), ball walk, Gibbs samplers, Grid walks [DFK91], Langevin Monte Carlo [RR98], and a physics-inspired meta-algorithm – Hamiltonian Monte Carlo (HMC) [DKPR87] – that subsumes several of the aforementioned methods as its special cases. HMC was first discovered by physicists [DKPR87], and was adopted with much success in ML [Nea92, War97, CFG14, BG15], and is currently the main algorithm used in the popular software package Stan [CGH⁺16].

HMC algorithms are inspired from a physics viewpoint and reduce the problem of sampling from the Gibbs distribution of $f$ to simulating the trajectory of a particle in $\mathbb{R}^{d}$ according to the laws of Hamiltonian dynamics where $f$ plays the role of a “potential energy” function. We show here that the physical laws that govern HMC also endow it with the ability to take long steps, at least in the setting where one can simulate Hamiltonian dynamics exactly. Moreover, we present sophisticated discretizations of continuous trajectories of Hamiltonian dynamics that can lead to sampling algorithms for Gibbs distribution for “regular-enough” distributions.

2 Related Algorithms for Sampling

We give a brief overview of a number of different algorithms available for sampling from continuous distributions. In a typical MCMC algorithm, one sets up a Markov chain whose domain includes that of $f$ and whose transition function determines the motion of a particle in the target distribution’s domain.

Traditional algorithms: Ball Walk, Random Walk Metropolis.

In the ball walk and related random walk metropolis (RWM) algorithms, each step of the Markov chain is computed by proposing a step $\tilde{X}_{i+1}=X_{i}+\eta v_{i}$ , where $v_{i}$ is uniformly distributed on the unit ball for the ball walk and standard multivariate Gaussian for RWM, and $\eta>0$ is the “step size”. The proposal is then passed through a Metropolis filter, which accepts the proposal with probability $\min\left(\frac{\pi(\tilde{X}_{i+1})}{\pi(\tilde{X}_{i})},1\right)$ , ensuring that the target distribution $\pi$ is a stationary distribution of the Markov chain.

Some instances and applications of these algorithms include sampling from and computing the volume of a polytope [DFK91, AK91, LS90, LS92, LS93, LV03] as well as sampling from a non-uniform distribution $\pi$ and the related problem of integrating functions [GGR97, LV06]. One advantage of these algorithms is that they only require a membership oracle for the domain and a zeroth-order oracle for $f$ . On the other hand, the running times of these algorithms are sub-optimal in situations where one does have access to information such as the gradient of $f$ , since they are unable to make use of this additional higher-order information. To ensure a high acceptance probability, the step size $\eta$ in the RWM or ball walk algorithms must be chosen to be sufficiently small so that $\pi(X_{i})$ does not change too much at any given step. For instance, if $\pi$ is $N(0,I_{d})$ , then the optimal choice of step size is roughly $\eta\approx\frac{1}{\sqrt{d}}$ , implying that $\|\eta v_{i}\|\approx 1$ with high probability [GGR97]. Therefore, since most of the probability of a standard spherical Gaussian lies in a ball of radius $\sqrt{d}$ , one would expect all of these algorithms to take roughly $(\sqrt{d})^{2}=d$ steps to explore a sufficiently regular target distribution.

Langevin algorithms.

The Langevin algorithm makes use of the gradient of $f$ to improve on the provable running times of the RWM and ball walk algorithms. Each update is computed as $X_{i+1}=X_{i}-\eta\nabla f(X_{i})+\sqrt{2\eta}V_{i}$ , where $V_{1},V_{2},\ldots\sim N(0,I_{d})$ . Recently, provable running time bounds have been shown for different versions of the Langevin algorithm for $m$ -strongly log-concave distributions with $M$ -Lipschitz gradient [Dal17, DM17, CCBJ18, DM17], including an $\tilde{O}(\max(d\kappa,d^{\frac{1}{2}}\kappa^{1.5})\log(\frac{1}{\varepsilon}))$ bound for getting $\varepsilon$ close to the target distribution $\pi$ in the “total variation” (TV) metric, where $\kappa=\frac{M}{m}$ [DCWY19]. This bound has been improved to roughly $d\kappa$ in [LST21].

Running time bounds for Langevin in the non-convex setting have also recently been obtained in terms of isoperimetric constants for $\pi$ [RRT17]. For sufficiently regular target distributions, the optimal step size of the Langevin algorithm can be shown to be $\eta\approx d^{-\frac{1}{4}}$ without Metropolis adjustment and $d^{-\frac{1}{6}}$ with Metropolis adjustment [RR98, Nea11]. Since the random “momentum” terms $V_{1},V_{2},\ldots$ are independent, the distance traveled by Langevin in a given number of steps is still roughly proportional to the square of its inverse numerical step size $\eta^{-1}$ , meaning that the running time is at least $\eta^{-2}=d^{\frac{1}{2}}$ without Metropolis adjustment and $\eta^{-2}=d^{\frac{1}{3}}$ with the Metropolis adjustment. The fact that the “momentum” is discarded after each numerical step is therefore a barrier in further improving the running time of Langevin algorithms.

3 Hamiltonian Dynamics

We present an overview of the Hamiltonian dynamics and its properties that play a crucial role in the description and analysis of HMC. Imagine a particle of unit mass in a potential well ${f}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ . If the particle has position $x\in\mathbb{R}^{d}$ and velocity $v\in\mathbb{R}^{d}$ , its total energy is given by the function ${H}:\mathbb{R}^{2d}\rightarrow\mathbb{R}$ defined as

{H}(x,v)={f}(x)+\frac{1}{2}\|v\|^{2}.

This function is called the Hamiltonian of the particle. The Hamiltonian dynamics for this particle are:

\frac{{d}x}{{d}t}=\frac{\partial H}{\partial v}\ \ \mbox{and}\ \ \frac{{d}v}{{d}t}=-\frac{\partial H}{\partial x}.

In the case where the Hamiltonian has a simple forms such as mentioned above, these equations reduce to:

\frac{{d}x}{{d}t}=v\ \ \mbox{and}\ \ \frac{{d}v}{{d}t}=-\nabla{f}(x).

For a starting configuration $(x,v)$ , we denote the solutions to these equations by $(x_{t}(x,v),v_{t}(x,v))$ for $t\geq 0$ . Let the “Hamiltonian flow” $\varphi_{t}:(x,v)\mapsto(x_{t}(x,v),v_{t}(x,v))$ be the position and velocity after time $t$ starting from $(x,v)$ .

Interestingly, these solutions satisfy a number of conservation properties.

Hamiltonian dynamics conserves the Hamiltonian: ${H}(x(t),v(t))={H}(x(0),v(0))$ for all $t\geq 0$ . To see this note that, since $H$ does not depend on $t$ explicitly (or $\frac{\partial H}{\partial t}=0$ ),

\frac{dH}{dt}=\sum_{i\in[d]}\frac{dx_{i}}{dt}\frac{\partial H}{\partial x_{i}}+\frac{dv_{i}}{dt}\frac{\partial H}{\partial v_{i}}=0.

Hamiltonian dynamics conserves the volume in the “phase space.” Formally, let $F=\left(\frac{dx}{dt},\frac{dv}{dt}\right)$ be the vector field associated to the Hamiltonian in the phase space $\mathbb{R}^{d}\times\mathbb{R}^{d}.$ First note that the divergence of $F$ is zero:

$\displaystyle\mathrm{div}F$	$\displaystyle=$	$\displaystyle\nabla\cdot F=\sum_{i\in[d]}\frac{\partial}{\partial x_{i}}\frac{dx_{i}}{dt}+\frac{\partial}{\partial v_{i}}\frac{dv_{i}}{dt}$
	$\displaystyle=$	$\displaystyle\sum_{i\in[d]}\frac{\partial}{\partial x_{i}}\frac{\partial H}{\partial v_{i}}-\frac{\partial}{\partial v_{i}}\frac{\partial H}{\partial x_{i}}$
	$\displaystyle=$	$\displaystyle 0.$

Since divergence represents the volume density of the outward flux of a vector field from an infinitesimal volume around a given point, it being zero everywhere implies volume preservation. Another way to see this is that divergence is the trace of the Jacobian of the map $F$ , and the trace of the Jacobian is the derivative of the determinant of the Jacobian. Hence, the trace being $0$ implies that the determinant of the Jacobian of $F$ does not change.

3.

Hamiltonian dynamics of the form mentioned above are time reversible for $t\geq 0$ :

$\varphi_{t}(x_{t}(x,v),-v_{t}(x,v))=(x,-v).$

The underlying geometry and conservation laws have been generalized significantly in physics and have led to the area of symplectic geometry in mathematics [DSDS08].

4 Hamiltonian Monte Carlo

To improve on the running time of the Langevin algorithms, one must find a way to take longer steps while still conserving the target distribution. Hamiltonian Monte Carlo (HMC) algorithms accomplish this by taking advantage of conservation properties of Hamiltonian dynamics. These conservation properties allow HMC to choose the momentum at the beginning of each step and simulate the trajectory of the particle for a long time. HMC is a large class of algorithms, and includes the Langevin algorithms and RWM as special cases.

Each step of the HMC Markov chain $X_{1},X_{2},\ldots$ is determined by first sampling a new independent momentum $\xi\sim N(0,I_{d})$ , and then running Hamilton’s equations for a fixed time $T$ , that is $X_{i}=x_{T}(X_{i-1},\xi)$ . This is called the idealized HMC, since its trajectories are the continuous solutions to Hamilton’s equations rather than a numerical approximation.

Input: First-order oracle for

f:\mathbb{R}^{d}\rightarrow\mathbb{R}

, an initial point

X_{0}\in\mathbb{R}^{d}

T\in\mathbb{R}_{>0}

k\in\mathbb{N}

for $i=1,\ldots,k$ do

Sample a momentum

\xi\sim N(0,I_{d})

Set

(X_{i},V_{i})=\varphi_{T}(X_{i-1},\xi)

end for

Output

X_{k}

Algorithm 1 Idealized Hamiltonian Monte Carlo

Despite its simplicity, popularity, and the widespread belief that HMC is faster than its competitor algorithms in a wide range of high-dimensional sampling problems [Cre88, BPR⁺13, Nea11, BBG14], its theoretical properties are relatively less understood compared to its older competitor MCMC algorithms, such as the Random Walk Metropolis [MPS⁺12] or Langevin [DM19, DMss, Dal17] algorithms. Thus, for instance, it is more difficult to tune the parameters of HMC. Several papers have have made progress in bridging this gap, showing that HMC is geometrically ergodic for a large class of problems [LBBG19, DMS17] and proving quantitative bounds for the convergence rate of the idealized HMC for special cases [SRSH14, MS18].

These analysis benefit from the observation that there are various invariants that are preserved along the Hamiltonian trajectories, which in principle obviate the need for a Metropolis step, and raise the possibility of taking very long steps (large $T$ ). Using these invariance properties, we first show that the idealized HMC “preserves” the target density (Theorem 5.1) and subsequently give a dimension-independent bound on $T$ when $f$ is strongly convex and smooth (Theorem 6.1). While not the focus of this article, in Section 7, we discuss different numerical integrators (approximate algorithms to simulate Hamiltonian dynamics dynamics in the non-idealized setting) and bounds associated to them.

5 Stationarity: HMC Preserves the Target Density

Recall that for $(x,v)\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ and $T\geq 0$ , $\varphi_{T}(x,v)$ denotes the position in the phase space of the particle moving according to the Hamiltonian dynamics with respect to a Hamiltonian $H(x,v)=f(x)+\frac{1}{2}\|v\|^{2}$ . Let $\mu$ be the Lebesgue measure on $\mathbb{R}^{d}\times\mathbb{R}^{d}$ with respect to which all densities are defined.

Theorem 5.1

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a differentiable function. Let $T>0$ be the step size of the HMC. Suppose $(X,V)$ is a sample from the density

\pi(x,v)=\frac{e^{-f(x)-\frac{1}{2}\|v\|^{2}}}{\int e^{-f(y)-\frac{1}{2}\|w\|^{2}}d\mu(y,w)}.

Then the density of $\varphi_{T}(X,V)$ is $\pi$ for any $T\geq 0$ . Moreover the density of $\varphi_{T}(X,\xi)$ , where $\xi\sim N(0,I_{d})$ is also $\pi$ . Thus, the idealized HMC algorithms preserves $\pi$ .

The proof of this theorem heavily relies on the properties of Hamiltonian dynamics. For $T\geq 0$ , let $(\tilde{x},\tilde{v})=\varphi_{T}(x,y)$ . Then, time reversibility of Hamiltonian dynamics implies that

(x,-v)=\varphi_{T}(\tilde{x},-\tilde{v}).

And, it follows from the preservation of Hamiltonian along trajectories that

H(x,-v)=H(\varphi_{T}(\tilde{x},-\tilde{v})).

Thus,

$\displaystyle H(x,v)$	$\displaystyle=$	$\displaystyle f(x)+\frac{1}{2}\\|v\\|^{2}$
	$\displaystyle=$	$\displaystyle H(x,-v)$
	$\displaystyle=$	$\displaystyle H(\varphi_{T}(\tilde{x},-\tilde{v}))$
	$\displaystyle=$	$\displaystyle H(\tilde{x},-\tilde{v})$
	$\displaystyle=$	$\displaystyle f(\tilde{x})+\frac{1}{2}\\|-\tilde{v}\\|^{2}$
	$\displaystyle=$	$\displaystyle H(\tilde{x},\tilde{v}).$

Thus, $e^{-H(x,v)}$ , the value of density associated to $(\tilde{x},\tilde{v})$ , is the same as $e^{-H(\tilde{x},\tilde{v})}$ . Let $\mu_{*}$ be the pushforward of $\mu$ under the map $\varphi_{T}$ . The property that Hamiltonian dynamics preserves volume in phase space implies that $\mu_{*}=\mu$ as the determinant of the Jacobian of the map $\varphi_{T}$ is $1$ . Thus, the density $\pi$ remains invariant under $\varphi_{T}$ :

\tilde{\pi}(\tilde{x},\tilde{v})=\frac{e^{-H(\tilde{x},\tilde{v})}}{\int e^{-H(y,w)}d\mu_{*}(y,w)}=\frac{e^{-H(\tilde{x},\tilde{v})}}{\int e^{-H(y,w)}d\mu(y,w)}=\pi(\tilde{x},\tilde{v}).

To see the second part, note that the marginal density of $v$ drawn from $\pi$ is the same as that of $N(0,I_{d})$ . Thus, $\pi$ is an invariant density of the idealized HMC algorithm.

6 Convergence: Running Time Bound for Strongly Convex and Smooth Potentials

For densities $\pi_{1}$ and $\pi_{2}$ with the same base measure, the Wasserstein distance $W_{2}$ is defined to be the infimum, over all joint distributions of the random variables $X$ and $Y$ with marginals $\pi_{1}$ and $\pi_{2}$ , of the expectation of the squared Euclidean distance $\|X-Y\|^{2}$ .

Theorem 6.1

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a twice-differentiable function which satisfies $mI\preceq\nabla^{2}f(x)\preceq MI$ . Let $\nu_{k}$ be the distribution of $X_{k}$ at step $k\in\mathbb{Z}^{\star}$ from Algorithm 1. Suppose that both $\nu_{0}$ and $\pi$ have mean and variance bounded by $O(1)$ . Then given any $\varepsilon>0$ , for $T=\Omega\left(\frac{\sqrt{m}}{M}\right)$ and $k=O\left((\frac{M}{m})^{2}\log\frac{1}{\varepsilon}\right)$ , we have that $W_{2}(\nu_{k},\pi)\leq\varepsilon$ .

The bound in this theorem can be improved to $O\left(\frac{M}{m}\log\frac{1}{\varepsilon}\right)$ ; see [CV19]. To prove Theorem 6.1 we use the coupling method. In addition to the idealized HMC Markov chain $X$ which is initialized at some arbitrary point $X_{0}$ , to prove Theorem 6.1 we also consider another “copy” $Y=Y_{0},Y_{1},\ldots$ of the idealized HMC Markov chain defined in Algorithm 1. To initialize $Y$ we imagine that we sample a point $Y_{0}$ from the density $\pi$ . Since we have already shown that $\pi$ is a stationary density of the idealized HMC Markov chain (Theorem 5.1), $Y_{k}$ will preserve the distribution $\pi$ at every step $k\in\mathbb{Z}^{\ast}$ .

To show that the density of $X$ converges to $\pi$ in the Wasserstein distance, we design a coupling of the two Markov chains such that the distance between $X_{k}$ and $Y_{k}$ contracts at each step $k$ . If $\pi$ has mean and variance bounded by $O(1)$ , and we initialize, e.g., $X_{0}=0$ , then the we have that $W_{2}(\nu_{0},\pi)=O(1)$ as well since $Y_{0}\sim\pi$ . Hence,

W_{2}(\nu_{0},\pi)\leq\mathbb{E}[\|Y_{0}-X_{0}\|^{2}]=\mathbb{E}[\|Y_{0}\|^{2}]\leq\mathrm{Var}(Y_{0})+\|\mathbb{E}[Y_{0}]\|^{2}=O(1).

Thus, if we can find a coupling such that $\|X_{k}-Y_{k}\|\leq(1-\gamma)\|X_{k-1}-Y_{k-1}\|$ for each $k$ and some $0<\gamma<1$ , we would have that

W_{2}(\nu_{k},\pi)\leq\mathbb{E}[\|X_{k}-Y_{k}\|^{2}]\leq(1-\gamma)^{2k}\mathbb{E}[\|X_{0}-Y_{0}\|^{2}]=(1-\gamma)^{2k}\times O(1).

We define a coupling of $X$ and $Y$ as follows: At each step $i\geq 1$ of the Markov chain $X$ we sample an initial momentum $\xi_{i-1}\sim N(0,I_{d})$ , and set $(X_{i},V_{i})=\varphi_{T}(X_{i-1},\xi_{i-1})$ . And at each step $i\geq 1$ of the Markov chain $Y$ we sample momentum $\xi_{i-1}^{\prime}\sim N(0,I_{d})$ , and set $(Y_{i},V_{i})=\varphi_{T}(Y_{i-1},\xi_{i-1}^{\prime})$ . To couple the two Markov chains, we will give the same initial momentum $\xi_{i-1}^{\prime}\leftarrow\xi_{i-1}$ to the Markov chain $Y$ ; see Figure 1. This coupling preserves the marginal density of $Y$ since, in this coupling, $\xi_{i-1}^{\prime}$ and $\xi_{i-1}$ both have marginal density $N(0,I_{d})$ . Thus, we still have that the marginal density of $Y_{i}$ is $\pi$ at each step $i$ .

Refer to caption — Figure 1: Coupling two copies $X$ (blue) and ${Y}$ (red) of idealized HMC by choosing the same momentum at every step.

Spherical harmonic oscillator.

As a simple example, we consider the spherical harmonic oscillator with potential function $f(x)\coloneqq\frac{1}{2}x^{\top}x$ . In this case the gradient is $\nabla f(x)=x$ at every point $x$ . Define $(x_{t},v_{t})\coloneqq\varphi_{t}(X_{i},\xi_{i})$ to be the position and momentum of the Hamiltonian flow which determines the update of the Markov chain $X$ at each step $i$ , and $(y_{t},u_{t})\coloneqq\varphi_{t}(Y_{i},\xi_{i})$ to be the position and momentum of the Hamiltonian flow which determines the update of the Markov chain $Y$ . Then we have

\frac{{d}v_{t}}{{d}t}-\frac{{d}u_{t}}{{d}t}=-\nabla f(x_{t})+\nabla f(y_{t})=y_{t}-x_{t}.

Thus, the difference between the force on the particle at $x_{t}$ and the particle at $y_{t}$ points in the direction of the particle at $y_{t}$ . This means that, in the case of the spherical harmonic oscillator, after a sufficient amount of time $t$ the particle $x_{t}$ will reach the point $0$ and we will have $x_{t}-y_{t}=0$ .

We can solve for the trajectory of $x_{t}-y_{t}$ exactly. We have

\frac{{d}^{2}(x_{t}-y_{t})}{{d}t^{2}}=-(x_{t}-y_{t}).

(1)

Since the two Markov chains are coupled such that the particles $x$ and $y$ have the same initial momentum, $v_{0}=u_{0}$ , the initial conditions for the ODE in Equation (1) are $\frac{d(x_{t}-y_{t})}{dt}=v_{0}-u_{0}=0$ . Therefore, the solution to this ODE is

x_{t}-y_{t}=\cos(t)\times(x_{0}-y_{0}).

(2)

Hence, after time $t=\frac{\pi}{2}$ we have $x_{t}-y_{t}=0$ .

General harmonic oscillator.

More generally, we can consider a harmonic oscillator $f(x)=\sum_{j=1}^{d}c_{j}x_{j}^{2}$ , where $m\leq c_{j}\leq M$ . In this case the gradient is $\nabla f(x)=2Cx$ , where $C$ is the diagonal matrix with $j$ -th diagonal entry $c_{j}$ .

In this case, we have

\frac{{d}v_{t}}{{d}t}-\frac{{d}u_{t}}{{d}t}=-\nabla f(x_{t})+\nabla f(y_{t})=2C(y_{t}-x_{t}).

Thus, since $c_{j}>0$ for all $j$ , the difference $\frac{{d}v_{t}}{{d}t}-\frac{{d}u_{t}}{{d}t}$ between the force on the particle at force on the particle at $x_{t}$ and the particle at $y_{t}$ is has a component in the direction $y_{t}-x_{t}$ . This means that, for small values of $t$ , two particles will move towards each other at a rate of roughly

\frac{1}{\|y_{t}-x_{t}\|}(y_{t}-x_{t})^{\top}(2C)(y_{t}-x_{t})t=\frac{1}{\|y_{t}-x_{t}\|}\|\sqrt{2C}(y_{t}-x_{t})\|^{2}t,

since the initial velocities $v_{0}$ and $u_{0}$ of the two particles are equal. However, unless $C$ is a multiple of the identity matrix, the difference between the forces also has a component orthogonal to $y_{t}-x_{t}$ . Thus, unlike in the case of the spherical harmonic oscillator, the distance between the two particles will not in general contract to zero for any value of $t$ .

As in the case of the spherical harmonic oscillator, we can compute the distance between the two particles at any value of $t$ by solving for the trajectory $x_{t}-y_{t}$ exactly. Here we have

\frac{{d}^{2}(x_{t}-y_{t})}{{d}t^{2}}=-2C(x_{t}-y_{t})

(3)

with initial conditions $\frac{d(x_{t}-y_{t})}{dt}=0$ . Since $C$ is diagonal, the ODE is separable along the coordinate directions, and we have, for all $j\in[d]$

\frac{{d}^{2}(x_{t}[j]-y_{t}[j])}{{d}t^{2}}=-2c_{j}(x_{t}[j]-y_{t}[j])

(4)

with initial conditions $\frac{d(x_{t}[j]-y_{t}[j])}{dt}=0$ . Therefore, $x_{t}[j]-y_{t}[j]=\cos(\sqrt{2c_{j}}t)\times(x_{0}[j]-y_{0}[j])$ . Hence, since $c_{j}\leq M$ , the distance $x_{t}[j]-y_{t}[j]$ will contract up to at least time $T=\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}$ . In particular, for any $j$ such that $c_{j}=M$ after time $T=\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}$ we have that $x_{T}[j]-y_{T}[j]=0$ . However, since the $c_{j}$ are values such that $m\leq c_{j}\leq M$ , there is in general no single value of $T$ such that all $x_{T}[j]-y_{T}[j]=0$ . However, we can show that for $T=\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}$ ,

$\displaystyle x_{T}[j]-y_{T}[j]$	$\displaystyle=\cos\left(\sqrt{2c_{j}}T\right)\times(x_{0}[j]-y_{0}[j])$	(5)
	$\displaystyle\leq\cos\left(\sqrt{2m}\times\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}\right)\times(x_{0}[j]-y_{0}[j])$	(6)
	$\displaystyle\leq\left(1-\frac{1}{8}\left(\sqrt{2m}\times\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}\right)^{2}\right)\times(x_{0}[j]-y_{0}[j])$	(7)
	$\displaystyle\leq\left(1-\Omega\left(\frac{m}{M}\right)\right)\times(x_{0}[j]-y_{0}[j])\qquad\forall j\in[d],$	(8)

where the inequality holds because the fact that $f$ is $m$ -strongly convex implies that $c_{j}\geq m$ , $\cos$ is monotone decreasing on the interval $[0,\pi]$ , and $\sqrt{2m}\times\frac{\pi}{2}\times\frac{1}{\sqrt{2M}}\in[0,\pi]$ since $0<m\leq M$ . The second inequality holds because $\cos(s)\leq 1-\frac{1}{8}s^{2}$ for all $s\in[0,\pi]$ . Hence, we have that

\displaystyle\|x_{T}-y_{T}\|

\displaystyle\leq\left(1-\Omega\left(\frac{m}{M}\right)\right)\times\|x_{0}-y_{0}\|.

(9)

General strongly convex and smooth $f$ (sketch).

Recall that in the case of the spherical Harmonic oscillator, $f(x)=mx^{\top}x$ , the difference between the forces on the two particles is

-\nabla f(x_{t})+\nabla f(y_{t})=2m(y_{t}-x_{t})

(10)

and thus, the difference between the force acting on $x_{t}$ and the force acting on $y_{t}$ is a vector which points exactly in the direction of the vector from $x_{t}$ to $y_{t}$ .

In more general settings where $f$ is $m$ -strongly convex and $M$ -smooth, but not necessarily a quadratic/harmonic oscillator potential, strong convexity implies that the component of the vector $-\nabla f(x_{t})+\nabla f(y_{t})$ in the direction $(y_{t}-x_{t})$ still points in the direction $(y_{t}-x_{t})$ and still has magnitude at least $2m\|y_{t}-x_{t}\|$ . However, unless $m=M$ , the difference in the forces, $-\nabla f(x_{t})+\nabla f(y_{t})$ , may also have a component orthogonal to the vector $y_{t}-x_{t}$ . Since $f$ is $M$ -smooth, this orthogonal component has magnitude no larger than $2M\|y_{t}-x_{t}\|$ .

Thus, since the two Markov chains are coupled such that the two particles have the same initial velocity, that is, $v_{0}-u_{0}=0$ , we have, in the worst case,

	$\displaystyle x_{t}-y_{t}$	$\displaystyle=x_{0}-y_{0}+(v_{0}-u_{0})\times t-mt^{2}(x_{0}-y_{0})+Mt^{2}z\\|x_{0}-y_{0}\\|+\textrm{Higher-order terms}(t)$		(11)
		$\displaystyle=x_{0}-y_{0}-mt^{2}(x_{0}-y_{0})+Mt^{2}z\\|x_{0}-y_{0}\\|+\textrm{Higher-order terms}(t),$		(12)

where $z$ is a unit vector orthogonal to $(x_{0}-y_{0})$ . We would like to determine the value of $t$ which minimizes $\|x_{t}-y_{t}\|$ , and the extent to which the distance $\|x_{t}-y_{t}\|$ contracts at this value of $t$ . In this proof sketch we will ignore the higher-order terms. These higher-order terms can be bounded using comparison theorems for ordinary differential equations; see Section 4 of [MSss].

Ignoring the higher-order terms, we have (since $z$ is a unit vector orthogonal to $(x_{0}-y_{0})$ ),

$\displaystyle\\|x_{t}-y_{t}\\|^{2}$	$\displaystyle\leq(1-mt^{2})^{2}\\|x_{0}-y_{0}\\|^{2}+(Mt^{2})^{2}\\|x_{0}-y_{0}\\|^{2}$	(13)
	$\displaystyle=\left((1-mt^{2})^{2}+(Mt^{2})^{2}\right)\\|x_{0}-y_{0}\\|^{2}$	(14)
	$\displaystyle=\left(1-2mt^{2}+m^{2}t^{4}+M^{2}t^{4}\right)\\|x_{0}-y_{0}\\|^{2}$	(15)
	$\displaystyle\leq\left(1-2mt^{2}+2M^{2}t^{4}\right)\\|x_{0}-y_{0}\\|^{2}.$	(16)

The RHS of (13) is minimized at $t=\frac{\sqrt{m}}{\sqrt{2}M}$ . Thus, for $t=\frac{\sqrt{m}}{\sqrt{2}M}$ we have:

	$\displaystyle\\|x_{t}-y_{t}\\|^{2}$	$\displaystyle\leq\left(1-2m\left(\frac{\sqrt{m}}{\sqrt{2}M}\right)^{2}+M^{2}\left(\frac{\sqrt{m}}{\sqrt{2}M}\right)^{4}\right)\times\\|x_{0}-y_{0}\\|^{2}$		(17)
		$\displaystyle=\left(1-\frac{m^{2}}{2M^{2}}\right)\times\\|x_{0}-y_{0}\\|^{2}.$		(18)

Thus, we have $\gamma=\frac{m^{2}}{2M^{2}}$ for $T=\Theta\left(\frac{\sqrt{m}}{M}\right)$ .

7 Discretizing HMC

To prove (non-asymptotic) running time bounds on HMC, we must approximate $x$ and $v$ in the idealized HMC with some numerical method. One can use a numerical method such as the Euler [GH10] or leapfrog integrators [HLW03]. The earliest theoretical analyses of HMC were the asymptotic “optimal scaling” results of [KP91], for the special case when the target distribution is a multivariate Gaussian. Specifically, they showed that the Metropolis-adjusted implementation of HMC with leapfrog integrator requires a numerical step size of $O^{*}(d^{-\frac{1}{4}})$ to maintain an $\Omega(1)$ Metropolis acceptance probability in the limit as the dimension $d\rightarrow\infty$ . They then showed that for this choice of numerical step size the number of numerical steps HMC requires to obtain samples from Gaussian targets with a small autocorrelation is $O^{*}(d^{\frac{1}{4}})$ in the large- $d$ limit. [PST12] extended their asymptotic analysis of the acceptance probability to more general classes of separable distributions.

The earliest non-asymptotic analysis of an HMC Markov chain was provided in [SRSH14] for an idealized version of HMC based on continuous Hamiltonian dynamics, in the special case of Gaussian target distributions. As mentioned earlier, [MS17] show that idealized HMC can sample from general $m$ -strongly logconcave target distributions with $M$ -Lipschitz gradient in $\tilde{O}(\kappa^{2})$ steps, where $\kappa=\frac{M}{m}$ (see also [BRSS17, BREZ20] for more work on idealized HMC). They also show that an unadjusted implementation of HMC with first-order discretization can sample with Wasserstein error $\varepsilon>0$ in $\tilde{O}(d^{\frac{1}{2}}\kappa^{6.5}\varepsilon^{-1})$ gradient evaluations. In addition, they show that a second-order discretization of HMC can sample from separable target distributions in $\tilde{O}(d^{\frac{1}{4}}\varepsilon^{-1}f(m,M,B))$ gradient evaluations, where $f$ is an unknown (non-polynomial) function of $m,M,B$ , if the operator norms of the first four Fréchet derivatives of the restriction of $U$ to the coordinate directions are bounded by $B$ . [LV18] use the conductance method to show that an idealized version of the Riemannian variant of HMC (RHMC) has mixing time with total variation (TV) error $\varepsilon>0$ of roughly $\tilde{O}(\frac{1}{\psi^{2}T^{2}}R\log(\frac{1}{\varepsilon}))$ , for any $0\leq T\leq d^{-\frac{1}{4}}$ , where $R$ is a regularity parameter for $U$ and $\psi$ is an isoperimetric constant for $\pi$ . Metropolized variants of HMC have also been studied recently; see [CDWY20] and the references in there.

A second-order discretization.

Here we discuss the approach of [MV18]. They approximate a Hamiltonian trajectory with a second-order Euler integrator that iteratively computes second-order Taylor expansions $(\hat{x}_{\eta},\hat{v}_{\eta})$ of Hamilton’s equations, where

\hat{x}_{\eta}(x,v)=x+v\eta-\frac{1}{2}\eta^{2}\nabla{f}(x),\qquad\hat{v}_{\eta}(x,v)=v-\eta\nabla{f}(x)-\frac{1}{2}\eta^{2}\nabla^{2}f(x)v.

Here, $\eta>0$ is the parameter corresponding to the “step size”. If we approximate

\nabla^{2}f(x)v\approx\frac{\nabla f(\hat{x}_{\eta})-\nabla f(x)}{\eta},

we obtain the following numerical integrator:

\hat{x}_{\eta}(x,v)=x+v\eta-\frac{1}{2}\eta^{2}\nabla{f}(x),\qquad\hat{v}_{\eta}(x,v)=v-\frac{1}{2}\eta\left(\nabla f(x)-\nabla f(\hat{x}_{\eta})\right).

This leads to the following algorithm.

Input: First-order oracle for

f:\mathbb{R}^{d}\rightarrow\mathbb{R}

, an initial point

X_{0}\in\mathbb{R}^{d}

T\in\mathbb{R}_{>0}

k\in\mathbb{N}

\eta>0

for $i=1,\ldots,k$ do

Sample

\xi\sim N(0,I_{d})

Set

q_{0}=X_{i}

and

p_{0}=\xi

for $j=1,\ldots,\frac{T}{\eta}$ do

Set

q_{j}=q_{j-1}+\eta p_{j-1}-\frac{1}{2}\eta^{2}\nabla f(q_{j-1})

and

p_{j}=p_{j-1}-\frac{1}{2}\eta\left(\nabla f(q_{j-1})-\nabla f(q_{j})\right)

end for

Set

X_{i}=q_{\frac{T}{\eta}}(X_{i-1},\xi)

end for

Output

X_{k}

Algorithm 2 Unadjusted Hamiltonian Monte Carlo

It can be shown that under a mild “regularity” condition, this unadjusted HMC requires at most (roughly) $O\left(d^{\frac{1}{4}}\varepsilon^{-\frac{1}{2}}\right)$ gradient evaluations; see [MV18] for details.

Higher-order integrators. More generally, one can replace the second-order Taylor expansion with a $k$ th-order Taylor expansion for any $k\geq 1$ , to obtain a $k$ th-order numerical integrator for Hamiltonian trajectories. Unfortunately, the number of components in the Taylor expansion grows exponentially with $k$ because of the product rule, so it is difficult to compute the Taylor expansion exactly. However, it is possible to compute this expansion approximately using polynomial interpolation methods such as the “Collocation Method” of [LV17] that was used for the special case when $f$ is constant on a polytope [LV18]. While higher-order integrators can give theoretical bounds, they are generally unstable in practice as they are not symplectic.

Developing practical higher-order methods and identifying other interesting regularity conditions on the target density that lead to fast algorithms remain interesting future directions.

Acknowledgments

The author would like to thank Oren Mangoubi for his help with the proof of Theorem 6.1 which initially appeared in [MS17], and Yin-Tat Lee, Anay Mehrotra, and Andre Wibisono for useful comments. The author would also like to acknowledge the support of NSF CCF-1908347.

References

[AK91] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 156–163. ACM, 1991.
[BBG14] MJ Betancourt, Simon Byrne, and Mark Girolami. Optimizing the integrator step size for Hamiltonian Monte Carlo. arXiv preprint arXiv:1411.6669, 2014.
[BG15] Michael Betancourt and Mark Girolami. Hamiltonian Monte Carlo for hierarchical models. Current trends in Bayesian methodology with applications, 79:30, 2015.
[BPR⁺13] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
[BREZ20] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. Coupling and convergence for Hamiltonian Monte Carlo. The Annals of Applied Probability, 30(3):1209–1250, 2020.
[BRSS17] Nawaf Bou-Rabee and Jesús María Sanz-Serna. Randomized hamiltonian monte carlo. The Annals of Applied Probability, 27(4):2159–2194, 2017.
[CCBJ18] Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped langevin MCMC: A non-asymptotic analysis. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 300–323. PMLR, 2018.
[CCHK89] EA Carter, Giovanni Ciccotti, James T Hynes, and Raymond Kapral. Constrained reaction coordinate dynamics for the simulation of rare events. Chemical Physics Letters, 156(5):472–477, 1989.
[CDWY20] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. J. Mach. Learn. Res., 21:92:1–92:72, 2020.
[CFG14] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.
[CGH⁺16] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 20:1–37, 2016.
[CGST11] K Andrew Cliffe, Mike B Giles, Robert Scheichl, and Aretha L Teckentrup. Multilevel Monte Carlo methods and applications to elliptic pdes with random coefficients. Computing and Visualization in Science, 14(1):3, 2011.
[Cre88] M. Creutz. Global Monte Carlo algorithms for many-fermion systems. Phy. Rev. D, 38(4):1228, 1988.
[CV19] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of hamiltonian monte carlo for strongly logconcave distributions. In Dimitris Achlioptas and László A. Végh, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2019, September 20-22, 2019, Massachusetts Institute of Technology, Cambridge, MA, USA, volume 145 of LIPIcs, pages 64:1–64:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
[Dal17] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
[DCWY19] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling: Metropolis-hastings algorithms are fast. J. Mach. Learn. Res., 20:183:1–183:42, 2019.
[DFK91] Martin Dyer, Alan Frieze, and Ravi Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM (JACM), 38(1):1–17, 1991.
[DGR03] Jerome B Detemple, Ren Garcia, and Marcel Rindisbacher. A Monte Carlo method for optimal portfolios. The journal of Finance, 58(1):401–446, 2003.
[DKPR87] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
[DM17] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
[DM19] Alain Durmus and Eric Moulines. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
[DMss] Alain Durmus and Eric Moulines. Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm. The Annals of Applied Probability, in press.
[DMS17] Alain Durmus, Eric Moulines, and Eero Saksman. On the convergence of Hamiltonian Monte Carlo. arXiv preprint arXiv:1705.00166, 2017.
[DPG⁺14] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
[DSDS08] Ana Cannas Da Silva and A Cannas Da Salva. Lectures on symplectic geometry. Springer, 2008.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
[GBW14] Maya R. Gupta, Samy Bengio, and Jason Weston. Training highly multiclass classifiers. J. Mach. Learn. Res., 15(1):1461–1492, January 2014.
[GGR97] Andrew Gelman, Walter R Gilks, and Gareth O Roberts. Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1):110–120, 1997.
[GH10] David F Griffiths and Desmond J Higham. Numerical methods for ordinary differential equations: initial value problems. Springer Science & Business Media, 2010.
[HLW03] Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration illustrated by the Störmer–Verlet method. Acta numerica, 12:399–450, 2003.
[KF09] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009.
[KP91] AD Kennedy and Brian Pendleton. Acceptances and autocorrelations in hybrid Monte Carlo. Nuclear Physics B-Proceedings Supplements, 20:118–121, 1991.
[LBBG19] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami. On the geometric ergodicity of hamiltonian monte carlo. Bernoulli, 25(4A):3109–3138, 2019.
[LS90] László Lovász and Miklós Simonovits. The mixing rate of Markov chains, an isoperimetric inequality, and computing the volume. In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on, pages 346–354. IEEE, 1990.
[LS92] László Lovász and Miklós Simonovits. On the randomized complexity of volume and diameter. In Foundations of Computer Science, 1992. Proceedings., 33rd Annual Symposium on, pages 482–492. IEEE, 1992.
[LS93] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume algorithm. Random structures & algorithms, 4(4):359–412, 1993.
[LST21] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted gaussian oracle. In Mikhail Belkin and Samory Kpotufe, editors, Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pages 2993–3050. PMLR, 2021.
[LV03] László Lovász and Santosh Vempala. Hit-and-run is fast and fun. preprint, Microsoft Research, 2003.
[LV06] László Lovász and Santosh Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 57–68. IEEE, 2006.
[LV17] Yin Tat Lee and Santosh S Vempala. Geodesic walks in polytopes. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 927–940. ACM, 2017.
[LV18] Yin Tat Lee and Santosh S Vempala. Convergence rate of Riemannian Hamiltonian Monte Carlo and faster polytope volume computation. In STOC, 2018.
[MPS⁺12] Jonathan C Mattingly, Natesh S Pillai, Andrew M Stuart, et al. Diffusion limits of the random walk Metropolis algorithm in high dimensions. The Annals of Applied Probability, 22(3):881–930, 2012.
[MS17] Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv preprint arXiv:1708.07114, 2017.
[MS18] Oren Mangoubi and Aaron Smith. Rapid mixing of geodesic walks on manifolds with positive curvature. The Annals of Applied Probability, 2018.
[MSss] Oren Mangoubi and Aaron Smith. Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions: Continuous dynamics. Annals of Applied Probability, in press.
[MV18] Oren Mangoubi and Nisheeth K Vishnoi. Dimensionally tight running time bounds for second-order Hamiltonian Monte Carlo. In NeurIPS. Available at arXiv:1802.08898, 2018.
[Nea92] Radford M Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, CRG-TR-92-1, Dept. of Computer Science, University of Toronto, 1992.
[Nea11] Radford M Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113–162, 2011.
[PST12] Natesh S Pillai, Andrew M Stuart, and Alexandre H Thiéry. Optimal scaling and diffusion limits for the Langevin algorithm in high dimensions. The Annals of Applied Probability, 22(6):2320–2356, 2012.
[RMZT02] Lula Rosso, Peter Mináry, Zhongwei Zhu, and Mark E Tuckerman. On the use of the adiabatic molecular dynamics technique in the calculation of free energy profiles. The Journal of chemical physics, 116(11):4389–4402, 2002.
[RR98] Gareth O Roberts and Jeffrey S Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
[RRT17] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703, 2017.
[SM08] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th ICML, pages 880–887. ACM, 2008.
[SRSH14] Christof Seiler, Simon Rubinstein-Salzedo, and Susan Holmes. Positive curvature and Hamiltonian Monte Carlo. In Advances in Neural Information Processing Systems, pages 586–594, 2014.
[War97] Bradley A Warner. Bayesian learning for neural networks (lecture notes in statistics vol. 118), Radford M. Neal. Journal of American Statistical Association, 92:791–791, 1997.
[WT11] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
[ZRC13] Wenwei Zheng, Mary A Rohrdanz, and Cecilia Clementi. Rapid exploration of configuration space with diffusion-map-directed molecular dynamics. The journal of physical chemistry B, 117(42):12769–12776, 2013.

$\displaystyle\\|x_{t}-y_{t}\\|^{2}$	$\displaystyle\leq(1-mt^{2})^{2}\\|x_{0}-y_{0}\\|^{2}+(Mt^{2})^{2}\\|x_{0}-y_{0}\\|^{2}$	(13)
	$\displaystyle=\left((1-mt^{2})^{2}+(Mt^{2})^{2}\right)\\|x_{0}-y_{0}\\|^{2}$	(14)
	$\displaystyle=\left(1-2mt^{2}+m^{2}t^{4}+M^{2}t^{4}\right)\\|x_{0}-y_{0}\\|^{2}$	(15)
	$\displaystyle\leq\left(1-2mt^{2}+2M^{2}t^{4}\right)\\|x_{0}-y_{0}\\|^{2}.$	(16)

	$\displaystyle\\|x_{t}-y_{t}\\|^{2}$	$\displaystyle\leq\left(1-2m\left(\frac{\sqrt{m}}{\sqrt{2}M}\right)^{2}+M^{2}\left(\frac{\sqrt{m}}{\sqrt{2}M}\right)^{4}\right)\times\\|x_{0}-y_{0}\\|^{2}$		(17)
		$\displaystyle=\left(1-\frac{m^{2}}{2M^{2}}\right)\times\\|x_{0}-y_{0}\\|^{2}.$		(18)

An Introduction to Hamiltonian Monte Carlo Method for Sampling

Abstract

1 Introduction

2 Related Algorithms for Sampling

Traditional algorithms: Ball Walk, Random Walk Metropolis.

Langevin algorithms.

3 Hamiltonian Dynamics

4 Hamiltonian Monte Carlo

5 Stationarity: HMC Preserves the Target Density

Theorem 5.1

6 Convergence: Running Time Bound for Strongly Convex and Smooth Potentials

Theorem 6.1

Spherical harmonic oscillator.

General harmonic oscillator.

General strongly convex and smooth ff (sketch).

7 Discretizing HMC

A second-order discretization.

Acknowledgments

References

General strongly convex and smooth $f$ (sketch).