This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stein Self-Repulsive Dynamics: Benefits from Past Samples

Mao Ye
UT Austin
[email protected]
&Tongzheng Ren *
UT Austin
[email protected]
&Qiang Liu
UT Austin
[email protected]
Equal Contribution
Abstract

We propose a new Stein self-repulsive dynamics for obtaining diversified samples from intractable un-normalized distributions. Our idea is to introduce Stein variational gradient as a repulsive force to push the samples of Langevin dynamics away from the past trajectories. This simple idea allows us to significantly decrease the auto-correlation in Langevin dynamics and hence increase the effective sample size. Importantly, as we establish in our theoretical analysis, the asymptotic stationary distribution remains correct even with the addition of the repulsive force, thanks to the special properties of the Stein variational gradient. We perform extensive empirical studies of our new algorithm, showing that our method yields much higher sample efficiency and better uncertainty estimation than vanilla Langevin dynamics.

1 Introduction

Drawing samples from complex un-normalized distributions is one of the most basic problems in statistics and machine learning, with broad applications to enormous research fields that rely on probabilistic modeling. Over the past decades, large amounts of methods have been proposed for approximate sampling, including both Markov Chain Monte Carlo (MCMC) (e.g., Brooks et al., 2011) and variational inference (e.g., Wainwright et al., 2008; Blei et al., 2017).

MCMC works by simulating Markov chains whose stationary distributions match the distributions of interest. Despite nice asymptotic theoretical properties, MCMC is widely criticized for its slow convergence rate in practice. In difficult problems, the samples drawn from MCMC are often found to have high auto-correlation across time, meaning that the Markov chains explore very slowly in the configuration space. When this happens, the samples returned by MCMC only approximate a small local region, and under-estimate the probability of the regions un-explored by the chain.

Stein variational gradient descent (SVGD) (Liu and Wang, 2016) is a different type of approximate sampling methods designed to overcome the limitation of MCMC. Instead of drawing random samples sequentially, SVGD evolves a pre-defined number of particles (or sample points) in parallel with a special interacting particle system to match the distribution of interest by minimizing the KL divergence. In SVGD, the particles interact with each other to simultaneously move towards the high probability regions following the gradient direction, and also move away from each other due to a special repulsive force. As a result, SVGD allows us to obtain diversified samples that correctly represent the variation of the distribution of interest. SVGD has found applications in various challenging problems (e.g., Feng et al., 2017; Haarnoja et al., 2017; Pu et al., 2017; Liu et al., 2017a; Gong et al., 2019). See Han and Liu (e.g., 2018); Chen et al. (e.g., 2018); Liu et al. (e.g., 2019); Wang et al. (e.g., 2019a) for examples of extensions.

However, one problem of SVGD is that it theoretically requires to run an infinite number of chains in parallel in order to approximate the target distribution asymptotically (Liu, 2017). With a finite number of particles, the fixed point of SVGD does still provide a prioritized, partial approximation to the distribution in terms of the expectation of a special case of functions (Liu and Wang, 2018). Nevertheless, it is still desirable to develop a variant of “single-chain SVGD”, which only requires to run a single chain sequentially like MCMC to achieve the correct stationary distribution asymptotically in time, with no need to take the limit of infinite number of parallel particles.

In this work, we propose an example of single-chain SVGD by integrating the special repulsive mechanism of SVGD with gradient-based MCMC such as Langevin dynamics. Our idea is to use repulsive term of SVGD to enforce the samples in MCMC away from the past samples visited at previous iterations. Such a new self-repulsive dynamics allows us to decrease the auto-correlation in MCMC and hence increase the mixing rate, but still obtain the same stationary distribution thanks to the special property of the SVGD repulsive mechanism.

We provide thorough theoretical analysis of our new method, establishing its asymptotic convergence to the target distribution. Such result is highly non-trivial, as our new self-repulsive dynamic is a non-linear high-order Markov process. Empirically, we evaluate our methods on an array of challenging sampling tasks, showing that our method yields much better uncertainty estimation and larger effective sample size.

2 Background: Langevin dynamics & SVGD

This section gives a brief introduction on Langevin dynamics (Rossky et al., 1978) and Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016), which we integrate together to develop our new self-repulsive dynamics for more efficient sampling.

Langevin Dynamics  Langevin dynamics is a basic gradient based MCMC method. For a given target distribution supported on d\mathbb{R}^{d} with density function ρ(𝜽)exp(V(𝜽))\rho^{*}(\bm{\theta})\propto\exp({-V(\bm{\theta})}), where V:dV\colon\mathbb{R}^{d}\mapsto\mathbb{R} is the potential function, the (Euler-discrerized) Langevin dynamics simulates a Markov chain with the following rule:

𝜽k+1=𝜽kηV(𝜽k)+2η𝒆k,𝒆k𝒩(0,𝐈),\bm{\theta}_{k+1}=\bm{\theta}_{k}-\eta\nabla V(\bm{\theta}_{k})+\sqrt{2\eta}\bm{e}_{k},~{}~{}~{}~{}~{}~{}\bm{e}_{k}\sim\mathcal{N}(0,\mathbf{I}),

where kk denotes the number of iterations, {𝒆k}k=1\{\bm{e}_{k}\}_{k=1}^{\infty} are independent standard Gaussian noise, and η\eta is a step size parameter. It is well known that the limiting distribution of 𝜽k\bm{\theta}_{k} when kk\to\infty approximates the target distribution when η\eta is sufficiently small.

Because the updates in Langevin dynamics are local and incremental, new points generated by the dynamics can be highly correlated to the past sample, in which case we need to run Langevin dynamics sufficiently long in order to obtain diverse samples.

Stein Variatinal Gradient Descent (SVGD)  Different from Langevin dynamics, SVGD iteratively evolves a pre-defined number of particles in parallel. Starting from an initial set of particles {𝜽0i}i=1M\{\bm{\theta}_{0}^{i}\}_{i=1}^{M}, SVGD updates the MM particles in parallel by

𝜽k+1i=𝜽ki+η𝒈(𝜽ki;δ^kM),i=1,,M,\bm{\theta}_{k+1}^{i}=\bm{\theta}_{k}^{i}+\eta{\bm{g}}(\bm{\theta}_{k}^{i};~{}\hat{\delta}_{k}^{M}),~{}~{}~{}~{}\forall i=1,\ldots,M,

where 𝒈(𝜽ki;δ^kM){\bm{g}}(\bm{\theta}_{k}^{i};~{}\hat{\delta}_{k}^{M}) is a velocity field that depends the empirical distribution of the current set of particles δ^kM:=1Mj=1Mδ𝜽kj\hat{\delta}_{k}^{M}:=\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k}^{j}} in the following way,

𝒈(𝜽ki;δ^kM)=𝔼𝜽δ^kM[K(𝜽,𝜽ki)V(𝜽)ConfiningTerm+𝜽K(𝜽,𝜽ki)RepulsiveTerm].\displaystyle{\bm{g}}(\bm{\theta}_{k}^{i};\hat{\delta}_{k}^{M})=\mathbb{E}_{\bm{\theta}\sim\hat{\delta}_{k}^{M}}\!\!\bigg{[}\underset{\mathrm{Confining\ Term}}{\text{$\underbrace{-K(\bm{\theta},\bm{\theta}_{k}^{i})\nabla V(\bm{\theta})}$}}+\underset{\mathrm{Repulsive\ Term}}{\underbrace{\nabla_{\bm{\theta}}K(\bm{\theta},\bm{\theta}_{k}^{i})}}\!\bigg{]}\!\!.

Here δ𝜽\delta_{\bm{\theta}} is the Dirac measure centered at 𝜽\bm{\theta}, and hence 𝔼𝜽δ^kM[]\mathbb{E}_{\bm{\theta}\sim\hat{\delta}_{k}^{M}}[\cdot] denotes averaging on the particles. The K(,)K(\cdot,\cdot) is a positive definite kernel, such as the RBF kernel, that can be specified by users.

Note that 𝒈(𝜽ki;δ^kM){\bm{g}}(\bm{\theta}_{k}^{i};~{}\hat{\delta}_{k}^{M}) consists of a confining term and repulsive term: the confining term pushes particles to move towards high density region, and the repulsive term prevents the particles from colliding with each other. It is the balance of these two terms that allows us to asymptotically approximate the target distribution ρ(𝜽)exp(V(𝜽))\rho^{*}(\bm{\theta})\propto\exp(-V(\bm{\theta})) at the fixed point, when the number of particles goes to infinite. We refer the readers to Liu and Wang (2016); Liu (2017); Liu and Wang (2018) for thorough theoretical justifications of SVGD. But a quick, informal way to justify the SVGD update is through the Stein’s identity, which shows that the velocity field 𝒈(𝜽;ρ){\bm{g}}(\bm{\theta};~{}\rho) equals zero when ρ\rho equals the true distribution ρ\rho^{*}, that is, 𝜽d\forall\bm{\theta}^{\prime}\in\mathbb{R}^{d},

𝒈(𝜽;ρ)=𝔼𝜽ρ[K(𝜽,𝜽)V(𝜽)+𝜽K(𝜽,𝜽)]=0.\displaystyle{\bm{g}}(\bm{\theta}^{\prime};\rho^{*})=\mathbb{E}_{\bm{\theta}\sim\rho^{*}}\left[-K(\bm{\theta},\bm{\theta}^{\prime})\nabla V(\bm{\theta})+\nabla_{\bm{\theta}}K(\bm{\theta},\bm{\theta}^{\prime})\right]=0. (1)

This equation shows that, the target distributions forms a fixed point of the update, and SVGD would converge if the particle distribution δ^kM\hat{\delta}_{k}^{M} gives a close approximation to the target distribution ρ\rho^{*}.

3 Stein Self-Repulsive Dynamics

In this work, we propose to integrate Langevin dynamics and SVGD to simultaneously decrease the auto-correlation of Langevin dynamics and eliminate the need for running parallel chains in SVGD. The idea is to use Stein repulsive force between the the current particle and the particles from previous iterations, hence forming a new self-avoiding dynamics with fast convergence speed.

Specifically, assume we run a single Markov chain like Langevin dynamics, where 𝜽k\bm{\theta}_{k} denotes the sample at the kk-th iteration. Denote by δ~kM\tilde{\delta}_{k}^{M} the empirical measure of MM samples from the past iterations:

δ~kM:=1Mj=1Mδ𝜽kjcη,cη=c/η,\small\tilde{\delta}_{k}^{M}:=\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc_{\eta}}},~{}~{}~{}~{}~{}~{}~{}~{}c_{\eta}=c/\eta,

where cηc_{\eta} is a thinning factor, which scales inversely with the step size η\eta, introduced to slim the sequence of past samples. Compared with the δ^kM\hat{\delta}_{k}^{M} in SVGD, which is averaged over MM parallel particles, δ~kM\tilde{\delta}_{k}^{M} is averaged across time over MM past samples. Given this, our Stein self-repulsive dynamics updates the sample via

𝜽k+1𝜽k+(ηV(𝜽k)+2η𝒆k)Langevin+ηα𝒈(𝜽k;δ~kM)Stein Repulsive,\displaystyle\bm{\theta}_{k+1}\leftarrow\bm{\theta}_{k}~{}+~{}\underbrace{(-\eta V(\bm{\theta}_{k})+\sqrt{2\eta}\bm{e}_{k})}_{\text{Langevin}}~{}+~{}\underbrace{\eta\alpha{\bm{g}}(\bm{\theta}_{k};~{}\tilde{\delta}_{k}^{M})}_{\text{Stein Repulsive}}, (2)

in which the particle is updated with the typical Langevin gradient, plus a Stein repulsive force against the particles from the previous iterations. Here α0\alpha\geq 0 is a parameter that controls the magnitude of the Stein repulsive term. In this way, the particles are pushed away from the past samples, and hence admits lower auto-correlation and faster convergence speed. Importantly, the addition of the repulsive force does not impact the asymptotic stationary distribution, thanks to Stein’s identity in (1). This is because if the self-repulsive dynamics has converged to the target distribution ρ\rho^{*}, such that θkρ\theta_{k}\sim\rho^{*} for all kk, the Stein self-repulsive term would equal to zero in expectation due to Stein’s identity and hence does not introduce additional bias over Langevin dynamics. Rigorous theoretical analysis of this idea is developed in Section 4.

Practical Algorithm  Because δ~kM\tilde{\delta}_{k}^{M} is averaged across the past samples, it is necessary to introduce a burn-in phase with the repulsive dynamics. Therefore, our overall procedure works as follows,

𝜽k+1={𝜽kηV(𝜽k)+2η𝒆k,k<Mcη,𝜽k+η[V(𝜽k)+α𝒈(𝜽k;δ~kM)]+2η𝒆k,kMcη.\displaystyle\bm{\theta}_{k+1}=\begin{cases}\bm{\theta}_{k}\!-\!\eta\nabla V(\bm{\theta}_{k})\!+\!\sqrt{2\eta}\bm{e}_{k},&k<Mc_{\eta},\\ \bm{\theta}_{k}\!+\!\eta\left[-\nabla V(\bm{\theta}_{k})\!+\!\alpha{\bm{g}}(\bm{\theta}_{k};\tilde{\delta}_{k}^{M})\right]\!+\!\sqrt{2\eta}\bm{e}_{k},&k\geq Mc_{\eta}.\end{cases} (3)

It includes two phases. The first phase is the same as the Langevin dynamics which collects the initial MM samples used in the second phase while serves as a warm start. The repulsive gradient update is introduced in the second phase to encourage the dynamics to visit the under-explored density region. We call this particular instance of our algorithm Self-Repulsive Langevin dynamics (SRLD), self-repulsive variants of more general dynamics is discussed in Section 5.

Remark  Note that the first phase is introduced to collect the initial MM samples. However, it’s not really necessary to generate the initial MM samples with Langevin dynamics. We can simply use some other initialization distribution and get MM initial samples from that distribution. In practice, we find using Langevin dynamics to collect the initial samples is natural and it can also be viewed as the burn-in phase before sampling, so we use (3) in all of the other experiments.

Remark  The general idea of introducing self-repulsive terms inside MCMC or other iterative algorithms is not new itself. For example, in molecular dynamics simulations, an algorithm called metadynamics (Laio and Parrinello, 2002) has been widely used, in which the particles are repelled away from the past samples in a way similar to our method, but with a typical repulsive function, such as jD(θk,θkjcη)\sum_{j}D(\theta_{k},\theta_{k-jc_{\eta}}), where D(,)D(\cdot,\cdot) can be any kind of dis-similarity. However, introducing an arbitrary repulsive force would alter the stationary distribution of the dynamics, introducing a harmful bias into the algorithm. Besides, the self-adjusting mechanism in Deng et al. (2020b) can also be viewed as a repulsive force using the multiplier in gradient. The key highlight of our approach, as reflected by our theoretical results in Section 4, is the unique property of the Stein repulsive term, that allows us to obtain the correct stationary distribution even with the addition of the repulsive force.

Remark  Recent works (Gallego and Insua, 2018; Zhang et al., 2018) also combine SVGD with Langevin dynamics, in which, however, the Langevin force is directly added to a set of particles that evolve in parallel with SVGD. Using our terminology, their system is

𝜽k+1i=𝜽ki+(ηV(𝜽ki)+2η𝒆ki)+ηα𝒈(𝜽ki;δ^kM),𝒆k𝒩(0,𝐈),i=1,,M.\displaystyle\bm{\theta}_{k+1}^{i}=\bm{\theta}_{k}^{i}+{(-\eta V(\bm{\theta}_{k}^{i})+\sqrt{2\eta}\bm{e}_{k}^{i})}+{\eta\alpha{\bm{g}}(\bm{\theta}_{k}^{i};~{}\hat{\delta}_{k}^{M})},\quad\bm{e}_{k}\sim\mathcal{N}(0,\mathbf{I}),\quad\forall i=1,\ldots,M.

This is significantly different from our method on both motivation and practical algorithm. Their algorithm still requires to simulate MM parallel chains of particles like SVGD, and was proposed to obtain easier theoretical analysis than the deterministic dynamics of SVGD. Our work is instead motivated by the practical need of decreasing the auto-correlation in Langevin dynamics, and avoiding the need of running multiple chains in SVGD, and hence must be based on self-repulsion against past samples along a single chain.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Illustrating the advantage of our Self-Repulsive Langevin dynamics. With a set of initial examples locating on the left part of the target distribution (yellow dots), Self-Repulsive Langevin dynamics is forced to explore the right part more frequently, yielding an accurate approximation when combined with the initial samples. Langevin dynamics, however, does not take the past samples into account and yields a poor overall approximation.

An Illustrative Example  We give an illustrative example to show the key advantage of our self-repulsive dynamics. Assume that we want to sample from a bi-variate Gaussian distribution shown in Figure 1. Unlike standard settings, we assume that we have already obtained some initial samples (yellow dots in Figure 1) before running the dynamics. The initial samples are assumed to concentrate on the left part of the target distribution as shown in Figure 1. In this extreme case, since the left part of the distribution is over-explored by the initial samples, it is desirable to have the subsequent new samples to concentrate more on the un-explored part of the distribution. However, standard Langevin dynamics does not take this into account, and hence yielding a biased overall representation of the true distribution (left panel). With our self-repulsive dynamics, the new samples are forced to explore the un-explored region more frequently, allowing us to obtain a much more accurate approximation when combining the new and initial samples.

4 Theoretical Analysis of Stein Self-Repulsive Dynamics

We provide theoretical analysis of the self-repulsive dynamics. We establish that our self-repulsive dynamics converges to the correct target distribution asymptotically, in the limit when particle size MM approaches to infinite and the step size η\eta approaches to 0. This is a highly non-trivial task, as the self-repulsive dynamics is a highly complex, non-linear and high order Markov stochastic process. We attack this problem by breaking the proof into the following three steps:

  1. (1)

    At the limit of MM\to\infty (called the mean field limit), we show that practical dynamics in (3) is closely approximated by a discrete-time mean-field dynamics characterized by (4).

  2. (2)

    By further taking the limit of η0+\eta\to 0^{+} (called the continuous-time limit), the dynamics in (4) converges to a continuous-time mean-field dynamics characterized by (5).

  3. (3)

    We show that the dynamics in (5) converges to the target distribution.

Remark  As we mentioned in Section 3, we introduce the first phase to collect the initial MM samples for the second phase, and our theoretical analysis follows this setting to make our theory as close to the practice. However, the theoretical analysis can be generalized to the setting of drawing MM initial samples from some initialization distribution with almost identical argument.

Notations  We use \left\|\cdot\right\| and ,\left\langle\cdot,\cdot\right\rangle to represent the 2\ell_{2} vector norm and inner product, respectively. The Lipschitz norm and bounded Lipschitz norm of a function ff are defined by fLip\left\|f\right\|_{\mathrm{Lip}} and fBL\left\|f\right\|_{\mathrm{BL}}. The KL divergence, Wasserstein-2 distance and Bounded Lipschitz distance between distribution ρ1\rho_{1}, ρ2\rho_{2} are denoted as 𝔻KL[ρ1ρ2]\mathbb{D}_{\mathrm{KL}}[\rho_{1}\|\rho_{2}], 𝕎2[ρ1,ρ2]\mathbb{W}_{2}[\rho_{1},\rho_{2}] and 𝔻BL[ρ1,ρ2]\mathbb{D}_{\mathrm{BL}}[\rho_{1},\rho_{2}], respectively.

4.1 Mean-Field and Continuous-Time Limits

To fix the notation, we denote by ρk:=Law(𝜽k)\rho_{k}:=\mathrm{Law}(\bm{\theta}_{k}) the distribution of 𝜽k\bm{\theta}_{k} at time kk of the practical self-repulsive dynamics (3), which we refer to the practical dynamics in the sequel, when the initial particle 𝜽0\bm{\theta}_{0} is drawn from an initial continuous distribution ρ0\rho_{0}. Note that given ρ0\rho_{0}, the subsequent ρk\rho_{k} can be recursively defined through dynamics (3). Due to the diffusion noise in Langevin dynamics, all ρk\rho_{k} are continuous distributions supported on d\mathbb{R}^{d}. We now introduce the limit dynamics when we take the mean-field limit (M+)(M\to+\infty) and then the continuous-time limit (η0+)(\eta\to 0^{+}).

Discrete-Time Mean-Field Dynamics (M+M\to+\infty) In the limit of MM\to\infty, our practical dynamics (3) approaches to the following limit dynamics, in which the delta measures on the particles are replaced by the actual continuous distributions of the particles,

𝜽~k+1={𝜽~kηV(𝜽~k)+2η𝒆k,kMcη,𝜽~k+η[V(𝜽~k)+α𝒈(𝜽~k,ρ~kM)]+2η𝒆k,kMcη.\displaystyle\tilde{\bm{\theta}}_{k+1}=\begin{cases}\tilde{\bm{\theta}}_{k}\!-\!\eta\nabla V(\tilde{\bm{\theta}}_{k})\!+\!\sqrt{2\eta}\bm{e}_{k},&k\leq Mc_{\eta},\\ \tilde{\bm{\theta}}_{k}\!+\!\eta\left[-\nabla V(\tilde{\bm{\theta}}_{k})\!+\!\alpha{\bm{g}}(\tilde{\bm{\theta}}_{k},\tilde{\rho}_{k}^{M})\right]\!+\!\sqrt{2\eta}\bm{e}_{k},&k\geq Mc_{\eta}.\end{cases} (4)

where ρ~kM=1Mj=1Mρ~kjcη\tilde{\rho}_{k}^{M}=\frac{1}{M}\sum_{j=1}^{M}\tilde{\rho}_{k-jc_{\eta}} and ρ~k:=Law(𝜽~k)\tilde{\rho}_{k}:=\mathrm{Law}(\tilde{\bm{\theta}}_{k}) is the (smooth) distribution of 𝜽~k\tilde{\bm{\theta}}_{k} at time-step kk when the dynamics is initialized with 𝜽~0ρ~0=ρ0\tilde{\bm{\theta}}_{0}\sim\tilde{\rho}_{0}=\rho_{0}. Compared with the practical dynamics in (3), the difference is that the empirical distribution δ~kM\tilde{\delta}_{k}^{M} is replaced by the smooth distribution ρ~kM\tilde{\rho}_{k}^{M}. Similar to the recursive definition of ρk\rho_{k} following dynamics (3), ρ~k\tilde{\rho}_{k} is also recursively defined through dynamics (4), starting from ρ~0=ρ0\tilde{\rho}_{0}=\rho_{0}.

As we show in Theorem 4.3, if the auto-correlation of 𝜽k\bm{\theta}_{k} decays fast enough and MM is sufficiently large, ρ~kM\tilde{\rho}_{k}^{M} is well approximated by the empirical distribution δ~kM\tilde{\delta}_{k}^{M} in (3), and further the two dynamics ((3) and (4)) converges to each other in the sense that 𝕎2[ρk,ρ~k]0\mathbb{W}_{2}[\rho_{k},\tilde{\rho}_{k}]\to 0 as MM\to\infty for any kk. Note that in taking the limit of MM\to\infty, we need to ensure that we run the dynamics for more than McηMc_{\eta} steps. Otherwise, SRLD degenerates to Langevin dynamics as we stop the chain before we finish collecting the MM samples.

Continuous-Time Mean-Field Dynamics (η0+\eta\to 0^{+})  In the limit of zero step size (η0+)(\eta\to 0^{+}), the discrete-time mean field dynamics in (4) can be shown to converge to the following continuous-time mean-field dynamics:

d𝜽¯t\displaystyle d\bar{\bm{\theta}}_{t} ={V(𝜽¯t)dt+dt,t[0,Mc),[V(𝜽¯t)+α𝒈(𝜽¯t,ρ¯tM)]dt+dt,tMc.\displaystyle=\begin{cases}-\nabla V(\bar{\bm{\theta}}_{t})dt+d\mathcal{B}_{t},&t\in[0,Mc),\\ \left[-\nabla V(\bar{\bm{\theta}}_{t})+\alpha{\bm{g}}(\bar{\bm{\theta}}_{t},~{}\bar{\rho}_{t}^{M})\right]dt+d\mathcal{B}_{t},&t\geq Mc.\end{cases} (5)

where ρ¯tM:=1Mj=1Mρ¯tjc()\bar{\rho}_{t}^{M}:=\frac{1}{M}\sum_{j=1}^{M}\bar{\rho}_{t-jc}(\cdot), t\mathcal{B}_{t} is the Brownian motion and ρ¯t=Law(𝜽¯t)\bar{\rho}_{t}=\mathrm{Law}\left(\bar{\bm{\theta}}_{t}\right) is the distribution of 𝜽¯t\bar{\bm{\theta}}_{t} at a continuous time point tt with 𝜽0\bm{\theta}_{0} initialized by 𝜽¯0ρ~0=ρ0\bar{\bm{\theta}}_{0}\sim\tilde{\rho}_{0}=\rho_{0}. We prove that (5) is closely approximated by (4) with small step size in the sense that 𝔻KL[ρ~kρ¯kη]0\mathbb{D}_{\mathrm{KL}}[\tilde{\rho}_{k}\|\bar{\rho}_{k\eta}]\to 0 as η0\eta\to 0 in Theorem 4.2, and importantly, the stationary distribution of (5) equals to the target distribution ρ(𝜽)exp(V(𝜽)).\rho^{*}(\bm{\theta})\propto\exp(-V(\bm{\theta})).

4.2 Assumptions

We first introduce the techinical assumptions used in our theoretical analysis.

Assumption 4.1 (RBF Kernel).

We use RBF kernel, i.e. K(𝛉1,𝛉2)=exp(𝛉1𝛉22/σ)K(\bm{\theta}_{1},\bm{\theta}_{2})=\exp({-\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|^{2}/\sigma}), for some fixed 0<σ<0<\sigma<\infty.

We only assume the RBF kernel for the simplicity of our analysis. However, it is straightforward to generalize our theoretical result to other positive definite kernels.

Assumption 4.2 (VV is dissipative and smooth).

Assume that 𝛉,V(𝛉)b1a1𝛉2\left\langle\bm{\theta},-\nabla V(\bm{\theta})\right\rangle\leq b_{1}-a_{1}\left\|\bm{\theta}\right\|^{2} and V(𝛉1)V(𝛉2)b1𝛉1𝛉2\left\|\nabla V(\bm{\theta}_{1})-\nabla V(\bm{\theta}_{2})\right\|\leq b_{1}\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|. We also assume that V(𝟎)b1\left\|\nabla V(\bm{0})\right\|\leq b_{1}. Here a1a_{1} and b1b_{1} are some finite positive constant.

Assumption 4.3 (Regularity Condition).

Assume 𝔼𝛉ρ0[𝛉2]>0\mathbb{E}_{\bm{\theta}\sim\rho_{0}}[\left\|\bm{\theta}\right\|^{2}]>0. Define ρkM=j=1Mρkjcη/M\rho_{k}^{M}=\sum_{j=1}^{M}\rho_{k-jc_{\eta}}/M, assume there exists a2,B<a_{2},B<\infty such that

supkMcη𝔼g(𝜽k;δ~kM)g(𝜽k;ρkM)2sup𝜽B𝔼g(𝜽;δ~kM)g(𝜽;ρkM)2a2.\sup_{k\geq Mc_{\eta}}\frac{\mathbb{E}\left\|g(\bm{\theta}_{k};\tilde{\delta}_{k}^{M})-g(\bm{\theta}_{k};\rho_{k}^{M})\right\|^{2}}{\sup_{\left\|\bm{\theta}\right\|\leq B}\mathbb{E}\left\|g(\bm{\theta};\tilde{\delta}_{k}^{M})-g(\bm{\theta};\rho_{k}^{M})\right\|^{2}}\leq a_{2}.
Assumption 4.4 (Strong-convexity).

Suppose that V(𝛉1)V(𝛉2),𝛉1𝛉2L𝛉1𝛉22\left\langle\nabla V(\bm{\theta}_{1})-\nabla V(\bm{\theta}_{2}),\bm{\theta}_{1}-\bm{\theta}_{2}\right\rangle\geq L\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|^{2} for a positive constant LL.

Remark Assumption 4.2 is standard in the existing Langevin dynamics analysis (see Dalalyan, 2017; Raginsky et al., 2017; Deng et al., 2020a). Assumption 4.3 is a weak condition as it assumes that the dynamics can not degenerate into one local mode and stop moving anymore. This assumption is expected to be true when we have diffusion terms like the Gaussian noises in our self-repulsive dynamics. Assumption 4.4 is a classical assumption on the existing Langevin dynamics analysis with convex potential Dalalyan (2017); Durmus et al. (2019). Although being a bit strong, this assumption broadly applies to posterior inference problem in the limit of big data, as the posterior distribution converges to Gaussian distributions for large training set as shown by Bernstein-von Mises theorem. It is technically possible to further generalize our results to the non-convex settings with a refined analysis, which we leave as future work. This work focuses on the classic convex setting for simplicity.

4.3 Main Theorems

All of the proofs in this section can be found in Appendix E. We first prove that the limiting distribution of the continuous-time mean field dynamics (5) is the target distribution. This is achieved by writing dynamics (5) into the following (non-linear) partial differential equation:

tρ¯t={(Vρ¯t)+Δρ¯tt[0,Mc)[(V+α𝒈(,ρ¯tM))ρ¯t]+Δρ¯t,tMc.\partial_{t}\bar{\rho}_{t}=\begin{cases}\nabla\cdot\left(-\nabla V\bar{\rho}_{t}\right)+\Delta\bar{\rho}_{t}&t\in[0,Mc)\\ \nabla\cdot\left[\left(-\nabla V+\alpha{\bm{g}}(\cdot,\bar{\rho}_{t}^{M})\right)\bar{\rho}_{t}\right]+\Delta\bar{\rho}_{t},&t\geq Mc.\end{cases}
Theorem 4.1 (Stationary Distribution).

Given some finite MM, cc and α\alpha, and suppose that the limit distribution of (5) exists. Then the limit distribution is unique and satisfies ρ(𝛉)exp(V(𝛉))\rho^{*}(\bm{\theta})\propto\exp({-V(\bm{\theta})}).

We then give the upper bound on the discretization error, which can be characterized by analyzing the KL divergence between ρ~k\tilde{\rho}_{k} and ρ¯kη\bar{\rho}_{k\eta}.

Theorem 4.2 (Time Discretization Error).

Given some sufficiently small step size η\eta and choose α<a2/(2b1+4/σ)\alpha<a_{2}/(2b_{1}+4/\sigma). Under Assumption 4.1, 4.2, 4.3 and cη=c/ηc_{\eta}=c/\eta. we have for some constant CC,

maxl{0,,k}𝔻KL[ρ¯lηρ~l]{𝒪(η+kη2)kMcη1𝒪(η+Mcη+α2MceCα2(kηMc)η2)kMcη.\displaystyle\underset{l\in\{0,...,k\}}{\max}\ \mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{l\eta}\|\tilde{\rho}_{l}\right]\leq\begin{cases}\mathcal{O}\left(\eta+k\eta^{2}\right)&k\leq Mc_{\eta}-1\\ \mathcal{O}\left(\eta+Mc\eta+\alpha^{2}Mce^{C\alpha^{2}(k\eta-Mc)}\eta^{2}\right)&k\geq Mc_{\eta}.\end{cases}

With this theorem, we can know that if η\eta is small enough, then the discretization error is small and ρ~\tilde{\rho} approximates ρ¯\bar{\rho} closely. Next we give result on the mean field limit of MM\to\infty.

Theorem 4.3 (Mean-Field Limit).

Under Assumption 4.1, 4.2, 4.3, and 4.4, suppose that we choose α\alpha and η\eta such that (a12αb1/σ)+ηb1<0-(a_{1}-2\alpha b_{1}/\sigma)+\eta b_{1}<0; 2αησ(b1+1)<1\frac{2\alpha\eta}{\sigma}(b_{1}+1)<1; a2α(2b1+4σ)>0a_{2}-\alpha\left(2b_{1}+\frac{4}{\sigma}\right)>0; Then there exists a constant c2c_{2}, such that when L/ac2L/a\geq c_{2} and we have

𝕎22[ρk,ρ~k]={𝒪(α2/M+η2)Mcη,0kMcη1.\displaystyle\mathbb{W}_{2}^{2}[\rho_{k},\tilde{\rho}_{k}]=\begin{cases}\mathcal{O}\left({\alpha^{2}/{M}}+\eta^{2}\right)&\geq Mc_{\eta},\\ 0&k\leq Mc_{\eta}-1.\end{cases}

Thus, if MM is sufficiently large, ρk\rho_{k} can well approximate the ρ~k\tilde{\rho}_{k}. Combining all the above theorems, we have the following Corollary showing the convergence of the proposed practical algorithm to the target distribution.

Corollary 4.1 (Convergence to Target Distribution).

Under the assumptions of Theorem 4.1, 4.2 and 4.3, by choosing k,η,Mk,\eta,M such that kηk\eta\to\infty, exp(Cα2kη)η2=o(1)\exp(C\alpha^{2}k\eta)\eta^{2}=o(1) and kηMc=γ(1+o(1))\frac{k\eta}{Mc}=\gamma\left(1+o(1)\right) with γ>1\gamma>1, we have

limk,M,η0+𝔻BL[ρk,ρ]=0.\lim_{k,M\to\infty,\eta\to 0^{+}}\mathbb{D}_{\text{BL}}\left[\rho_{k},\rho^{*}\right]=0.

Remark  A careful choice of parameters is needed as our system is a complicated longitudinal particle system. Also notice that if γ1\gamma\leq 1, the repulsive dynamics reduces to Langevin dynamics, as only the samples from the first phase will be collected.

5 Extension to General Dynamics

Although we have focused on self-repulsive Langevin dynamics, our Stein self-repulsive idea can be broadly combined with general gradient-based MCMC. Following Ma et al. (2015), we consider the following general class of sampling dynamics for drawing samples from p(𝜽)exp(V(𝜽))p(\bm{\theta})\propto\exp(-V(\bm{\theta})):

d𝜽t=𝒇(𝜽)dt+2𝑫(𝜽)dt,\displaystyle d\bm{\theta}_{t}=-\bm{f}(\bm{\theta})dt+\sqrt{2\bm{D}(\bm{\theta})}d\mathcal{B}_{t},
with𝒇(𝜽)=[𝑫(𝜽)+𝑸(𝜽)]V(𝜽)𝚪(𝜽),Γi(𝜽)=j=1d𝜽j(Dij(𝜽)+Qij(𝜽)).\displaystyle\text{with}\quad\bm{f}(\bm{\theta})=[\bm{D}(\bm{\theta})+\bm{Q}(\bm{\theta})]\nabla V(\bm{\theta})-\bm{\Gamma}(\bm{\theta}),\quad\Gamma_{i}(\bm{\theta})=\sum_{j=1}^{d}\frac{\partial}{\partial\text{$\bm{\theta}$}_{j}}\left(D_{ij}(\bm{\theta})+Q_{ij}(\bm{\theta})\right).

where 𝑫\bm{D} is a positive semi-definite diffusion matrix that determines the strength of the Brownian motion and 𝑸\bm{Q} is a skew-symmetric curl matrix that can represent the traverse effect (e.g. in Neal et al., 2011; Ding et al., 2014). By adding the Stein repulsive force, we obtain the following general self-repulsive dynamics

d𝜽¯t\displaystyle d\bar{\bm{\theta}}_{t} ={𝒇(𝜽)dt+2𝑫(𝜽)dt,t[0,Mc)(𝒇(𝜽)+α𝒈(𝜽¯t;ρ¯tM))dt+dt,tMc\displaystyle=\begin{cases}-\bm{f}(\bm{\theta})dt+\sqrt{2\bm{D}(\bm{\theta})}d\mathcal{B}_{t},&t\in[0,Mc)\\ -\left(\bm{f}(\bm{\theta})+\alpha{\bm{g}}(\bar{\bm{\theta}}_{t};~{}\bar{\rho}_{t}^{M})\right)dt+d\mathcal{B}_{t},&t\geq Mc\end{cases} (6)

where ρ¯t:=Law(𝜽¯t)\bar{\rho}_{t}:=\mathrm{Law}(\bar{\bm{\theta}}_{t}) is again the distribution of 𝜽¯t\bar{\bm{\theta}}_{t} following (6) when initalized at 𝜽¯0ρ¯0\bar{\bm{\theta}}_{0}\sim\bar{\rho}_{0}. Similar to the case of Langevin dynamics, this process also converges to the correct target distribution, and can be simulated by practical dynamics similar to (3).

Theorem 5.1 (Stationary Distribution).

Given some finite MM, cc and α\alpha, and suppose that the limiting distribution of dynamics (6) exists. Then the limiting distribution is unique and equals the target distribution ρ(𝛉)exp(V(𝛉))\rho^{*}(\bm{\theta})\propto\exp({-V(\bm{\theta})}).

6 Experiments

In this section, we evaluate the proposed method in various challenging tasks. We demonstrate the effectiveness of SRLD in high dimensions by applying it to sample the posterior of Bayesian Neural Networks. To demonstrate the superiority of the SRLD in obtaining diversified samples, we apply SRLD on contextual bandits problem, which requires the sampler efficiently explores the distribution in order to give good uncertainty estimation.

We include discussion on the parameter tuning and additional experiment on sampling high dimensional Gaussian and Gaussian mixture in Appendix B. Our code is available at https://github.com/lushleaf/Stein-Repulsive-Dynamics.

6.1 Synthetic Experiment

We first show how the repulsive gradient helps explore the whole distribution using a synthetic distribution that is easy to visualize. Following Ma et al. (2015), we compare the sampling efficiency on the following correlated 2D distribution with density

ρ([θ1,θ2])θ14/10(4(θ2+1.2)θ12)2/2.\rho^{*}([\theta_{1},\theta_{2}])\propto-\theta_{1}^{4}/10-\left(4\left(\theta_{2}+1.2\right)-\theta_{1}^{2}\right)^{2}/2.

We compare the SRLD with vanilla Langevin dynamics, and evaluate the sample quality by Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), Wasserstein-1 Distance and effective sample size (ESS). Notice that the finite sample quality of gradient based MCMC method is highly related to the step size. Compared with Langevin dynamics, we have an extra repulsive gradient and thus we implicitly have larger step size. To rule out this effect, we set different step sizes of the two dynamics so that the gradient of the two dynamics has the same magnitude.

In addition, to decrease the influence of random noise, the two dynamics are set to have the same initialization and use the same sequence of Gaussian noise. We collect the sample of every iteration. We repeat the experiment 20 times with different initialization and sequence of Gaussian noise.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Sample quality of SRLD and Langevin dynamics for sampling the correlated 2D distribution. The auto-correlation is the averaged auto-correlation of the two dimensions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Sampling trajectory of the correlated 2D distribution.

Figure 2 summarizes the result with different metrics. We can see that SRLD has a significantly smaller MMD and Wasserstein-1 Distance as well as a larger ESS compared with the vanilla Langevin dynamics. Moreover, the introduced repulsive gradient creates a negative auto-correlation between samples. Figure 3 shows a typical trajectory of the two sampling dynamics. We can see that SRLD have a faster mixing rate than vanilla Langevin dynamics. Note that since we use the same sequence of Gaussian noise for both algorithms, the difference is mainly due to the use of repulsive gradient rather than the randomness.

6.2 Bayesian Neural Network

Bayesian Neural Network is one of the most important methods in Bayesian Deep Learning with wide application in practice. Here we test the performance of SRLD on sampling the posterior of Bayesian Neural Network on the UCI datasets (Dua and Graff, 2017). We assume the output is normal distributed, with a two-layer neural network with 50 hidden units and tanh\mathrm{tanh} activation to predict the mean of outputs. All of the datasets are randomly partitioned into 90%90\% for training and 10%10\% for testing. The results are averaged over 20 random trials. We refer readers to Appendix C for hyper-parameter tuning and other experiment details.

Dataset Ave Test RMSE Ave Test LL
SVGD LD SRLD SVGD LD SRLD
Boston 3.300±0.1423.300\pm 0.142 3.342±0.1873.342\pm 0.187 3.086±0.181¯\underline{{\bf 3.086\pm 0.181}} 4.276±0.217-4.276\pm 0.217 2.678±0.092-2.678\pm 0.092 2.500±0.054¯\underline{{\bf-2.500\pm 0.054}}
Concrete 4.994±0.1714.994\pm 0.171 4.908±0.1134.908\pm 0.113 4.886±0.108{\bf 4.886\pm 0.108} 5.500±0.398-5.500\pm 0.398 3.055±0.035-3.055\pm 0.035 3.034±0.031¯\underline{{\bf-3.034\pm 0.031}}
Energy 0.428±0.0160.428\pm 0.016 0.412±0.0160.412\pm 0.016 0.395±0.016¯\underline{{\bf 0.395\pm 0.016}} 0.781±0.094-0.781\pm 0.094 0.543±0.014-0.543\pm 0.014 0.476±0.036¯\underline{{\bf-0.476\pm 0.036}}
Naval 0.006±0.0000.006\pm 0.000 0.006±0.0020.006\pm 0.002 0.003±0.000¯\underline{{\bf 0.003\pm 0.000}} 3.056±0.0343.056\pm 0.034 4.041±0.0304.041\pm 0.030 4.186±0.015¯\underline{{\bf 4.186\pm 0.015}}
WineRed 0.655±0.0080.655\pm 0.008 0.649±0.0090.649\pm 0.009 0.639±0.009¯\underline{{\bf 0.639\pm 0.009}} 1.040±0.018-1.040\pm 0.018 1.004±0.019-1.004\pm 0.019 0.970±0.016¯\underline{{\bf-0.970\pm 0.016}}
WineWhite 0.655±0.008¯\underline{{\bf 0.655\pm 0.008}} 0.692±0.0030.692\pm 0.003 0.688±0.0030.688\pm 0.003 1.040±0.019¯\underline{{\bf-1.040\pm 0.019}} 1.047±0.004-1.047\pm 0.004 1.043±0.004-1.043\pm 0.004
Yacht 0.593±0.0710.593\pm 0.071 0.597±0.0510.597\pm 0.051 0.578±0.054{\bf 0.578\pm 0.054} 1.281±0.279-1.281\pm 0.279 1.187±0.307-1.187\pm 0.307 0.458±0.036¯\underline{{\bf-0.458\pm 0.036}}
Table 1: Averaged test RMSE and test log-likelihood on UCI datasets. Results are averaged over 20 trails. The boldface indicates the method has the best average performance and the underline marks the methods that perform the best with a significance level of 0.050.05.

Table 1 shows the average test RMSE and test log-likelihood and their standard deviation. The method that has the best average performance is marked as boldface. We observe that a large portion of the variance is due to the random partition of the dataset. Therefore, to show the statistical significance, we use the matched pair tt-test to test the statistical significance, mark the methods that perform the best with a significance level of 0.05 with underlines. Note that the results of SRLD/LD and SVGD is not very comparable, because SRLD/LD are single chain methods which averages across time, and SVGD is a multi-chain method that only use the results of the last iteration. We provide additional results in Appendix C that SRLD averaged on 20 particles (across time) can also achieve similar or better results as SVGD with 20 (parallel) particles.

6.3 Contextual Bandits

We consider the posterior sampling (a.k.a Thompson sampling) algorithm with Bayesian neural network as the function approximator, to demonstrate the uncertanty estimation provided by SRLD. We follow the experimental setting from Riquelme et al. (2018). The only difference is that we change the optimization of the objective (e.g. evidence lower bound (ELBO) in variational inference methods) into running MCMC samplers. We compare the SRLD with the Langevin dynamics on two benchmarks from (Riquelme et al., 2018), and include SVGD as a baseline. For more detailed introduction, setup, hyper-parameter tuning and experiment details; see Appendix D.

Dataset SVGD LD SRLD
Mushroom 20.7±2.020.7\pm 2.0 4.28±0.094.28\pm 0.09 3.80±0.16¯\underline{\mathbf{3.80\pm 0.16}}
Wheel 91.32±0.1791.32\pm 0.17 38.07±1.1138.07\pm 1.11 32.08±0.75¯\underline{\mathbf{32.08\pm 0.75}}
Table 2: Cumulative Regrets on two bandits problem (smaller is better). Results are averaged over 10 trails. Boldface indicates the methods with best performance and underline marks the best significant methods with significant level 0.050.05.

The cumulative regret is shown in Table 2. SVGD is known to have the under-estimated uncertainty for Bayesian neural network if particle number is limited (Wang et al., 2019b), and as a result, has the worst performance among the three methods. SRLD is slightly better than vanilla Langevin dynamics on the simple Mushroom bandits. On the much more harder Wheel bandits, SRLD is significantly better than the vanilla Langevin dynamics, which shows the improving uncertainty estimation of our methods within finite number of samples.

7 Conclusion

We propose a Stein self-repulsive dynamics which applies Stein variational gradient to push samples from MCMC dynamics away from its past trajectories. This allows us to significantly decrease the auto-correlation of MCMC, increasing the sample efficiency for better estimation. The advantages of our method are extensive studied both theoretical and empirical analysis in our work. In future work, we plan to investigate the combination of our Stein self-repulsive idea with more general MCMC procedures, and explore broader applications.

Broader Impact Statement

This work incorporates Stein repulsive force into Langevin dynamics to improve sample efficiency. It brings a positively improvement to the community such as reinforcement learning and Bayesian neural network that needs efficient sampler. Our work do not have any negative societal impacts that we can foresee in the future.

Acknowledgement

This paper is supported in part by NSF CAREER 1846421, SenSE 2037267 and EAGER 2041327.

References

  • An and Huang (1996) HZ An and FC Huang. The geometrical ergodicity of nonlinear autoregressive models. Statistica Sinica, pages 943–956, 1996.
  • Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  • Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1613–1622, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/blundell15.html.
  • Bradley (2005) Richard Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107–144, 2005.
  • Bradley (2007) Richard C. Bradley. Introduction to strong mixing conditions. 2007.
  • Brooks et al. (2011) Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov chain Monte Carlo. CRC press, 2011.
  • Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Chapelle and Li (2011) Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
  • Chen et al. (2018) Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable Bayesian sampling. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 746–755, 2018.
  • Dalalyan (2017) Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
  • Deng et al. (2020a) Wei Deng, Qi Feng, Liyao Gao, Faming Liang, and Guang Lin. Non-convex learning via replica exchange stochastic gradient mcmc. In International Conference on Machine Learning, pages 2474–2483. PMLR, 2020a.
  • Deng et al. (2020b) Wei Deng, Guang Lin, and Faming Liang. A contour stochastic gradient langevin dynamics algorithm for simulations of multi-modal distributions, 2020b.
  • Ding et al. (2014) Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3203–3211, 2014.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Durmus et al. (2019) Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
  • Feng et al. (2017) Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized stein variational gradient descent. uncertainty in artificial intelligence, 2017.
  • Gallego and Insua (2018) Victor Gallego and David Rios Insua. Stochastic gradient mcmc with repulsive forces. arXiv preprint arXiv:1812.00071, 2018.
  • Gong et al. (2019) Chengyue Gong, Jian Peng, and Qiang Liu. Quantile Stein variational gradient descent for batch Bayesian optimization. In International Conference on Machine Learning, pages 2347–2356, 2019.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. international conference on machine learning, pages 1352–1361, 2017.
  • Han and Liu (2017) Jun Han and Qiang Liu. Stein variational adaptive importance sampling. Uncertainty in Artificial Intelligence, 2017.
  • Han and Liu (2018) Jun Han and Qiang Liu. Stein variational gradient descent without gradient. In Uncertainty in Artificial Intelligence, 2018.
  • Laio and Parrinello (2002) Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562–12566, 2002.
  • Liu et al. (2019) Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning, pages 4082–4092, 2019.
  • Liu (2017) Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3118–3126, 2017.
  • Liu and Wang (2016) Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.
  • Liu and Wang (2018) Qiang Liu and Dilin Wang. Stein variational gradient descent as moment matching. In Advances in Neural Information Processing Systems, pages 8854–8863, 2018.
  • Liu et al. (2017a) Yang Liu, Qiang Liu, and Jian Peng. Stein variational policy gradient. uncertainty in artificial intelligence, 2017a.
  • Liu et al. (2017b) Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. Uncertainty in Artificial Intelligence, 2017b.
  • Ma et al. (2015) Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
  • Neal et al. (2011) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  • Pu et al. (2017) Yunchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. Vae learning via stein variational gradient descent. neural information processing systems, pages 4236–4245, 2017.
  • Raginsky et al. (2017) Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1674–1703, 2017. URL http://proceedings.mlr.press/v65/raginsky17a.html.
  • Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=SyYe6k-CW.
  • Rossky et al. (1978) Peter J Rossky, JD Doll, and HL Friedman. Brownian dynamics as smart monte carlo simulation. The Journal of Chemical Physics, 69(10):4628–4633, 1978.
  • Schlimmer (1981) Jeff Schlimmer. Mushroom records drawn from the audubon society field guide to north american mushrooms. GH Lincoff (Pres), New York, 1981.
  • Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Wainwright et al. (2008) Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • Wang et al. (2019a) Dilin Wang, Ziyang Tang, Chandrajit Bajaj, and Qiang Liu. Stein variational gradient descent with matrix-valued kernels. In Conference on Neural Information Processing Systems, 2019a.
  • Wang et al. (2019b) Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function space particle optimization for bayesian neural networks. arXiv preprint arXiv:1902.09754, 2019b.
  • Zhang et al. (2018) Jianyi Zhang, Ruiyi Zhang, and Changyou Chen. Stochastic particle-optimization sampling and the non-asymptotic convergence theory. arXiv preprint arXiv:1809.01293, 2018.

Appendix A Discussion on Hyper-parameter Tuning

The key hyper-parameters of SRLD are 1. α\alpha, which balance the confining gradient and repulsive gradient; 2. MM the number of particles used; 3. σ\sigma the bandwidth of kernel; 4. η\eta the stepsize; 5. cηc_{\eta} the thinning factor. Among which, α\alpha, MM and σ\sigma are the hyper-parameter introduced by the proposed repulsive gradient and thus we mainly discuss these three hyper-parameter. The number of particles MM and bandwidth of kennel σ\sigma are introduced by the use the repulsive term in SVGD [Liu and Wang, 2016]. In practice, we find using a similar setting for tuning MM and σ\sigma as that in SVGD [Liu and Wang, 2016] gives good performance. In specific, in order to obtain good performance, MM does not needs be very large, and similar to SVGD, M=10M=10 already gives good enough particle approximation. A good choice of bandwidth σ\sigma is also important for the kernel. In SVGD, instead of tuning σ\sigma, they propose an adaptive way to adjust σ\sigma during the running of the dynamics. Specifically, they choose σ=med2/log(M)\sigma=\text{med}^{2}/\log(M), where med is the median of the pairwise distance between the particles 𝜽ki\bm{\theta}_{k}^{i}, i[M]i\in[M]. In this way, the bandwidth ensures that j=1MK(𝜽ki,𝜽kj)1\sum_{j=1}^{M}K(\bm{\theta}_{k}^{i},\bm{\theta}_{k}^{j})\approx 1. This adaptive way of choosing σ\sigma is also widely used in current approximation inference area, e.g., Liu et al. [2017b], Han and Liu [2017], Wang et al. [2019b, a]. We also find that applying this adaptive bandwidth is able to give good empirical performance and thus we also use this method in the implementation. Now we discuss how choose α\alpha. Notice that α\alpha serves to balance the confining gradient and repulsive gradient and based on this motivation, we recommend readers to find a proper α\alpha using the samples at burn-in phase by setting

αk=1McηV(𝜽k)k=1Mcηg(𝜽k,δ~kM).\alpha\approx\frac{\sum_{k=1}^{Mc_{\eta}}\left\|\nabla V(\bm{\theta}_{k})\right\|}{\sum_{k=1}^{Mc_{\eta}}\left\|g(\bm{\theta}_{k},\tilde{\delta}_{k}^{M})\right\|}.

In this way, α\alpha balances the two kind of gradients. And then we may further tune α\alpha by searching around this value. An empirical good choice of α\alpha is 10 for the data sets we tested and we use α=10\alpha=10 for all the experiments.

The step size is important for gradient based MCMC, as too large step size gives too large discretization error while a too small step size will cause the dynamics converges very slowly. In this paper, we mainly use validation set to tune the step size. The thinning factor is also a common parameter is MCMC methods and usually MCMC methods are not sensitive to this parameter. SRLD is not sensitive to this parameter and we simply set cη=100c_{\eta}=100 for all experiments.

Appendix B Additional Experiment Result on Synthetic Data

In this section, we show additional experiment on synthetic data. To further visualize the role of the proposed stein repulsive gradient, we also apply our method to sample a 2D mixture of Gaussian distribution (see section B.1). To further study how different α\alpha influences sampling high dimension distribution, we apply SRLD to sample high dimensional Gaussian (section B.2) and high dimensional mixture of Gaussian (section B.3).

B.1 Synthetic 2D Mixture of Gaussian Experiment

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Sample quality and autocorrelation of the mixture distribution. The auto-correlation is the averaged auto-correlation of the two dimensions.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Sampling trajectory of the mixture of Gaussian.

We aim to show how the repulsive gradient helps the particle escape from the local high density region by sampling the 2D mixture of Gaussian distribution using SRLD and Langevin dynamics. The target density is set to be

ρ(𝜽)0.5exp(𝜽𝟏2/2)+0.5exp(𝜽+𝟏2/2),\displaystyle\rho^{*}(\bm{\theta})\propto 0.5\exp{\left(-\left\|\bm{\theta}-\bm{1}\right\|^{2}/2\right)}+0.5\exp{\left(-\left\|\bm{\theta}+\bm{1}\right\|^{2}/2\right)},

where 𝜽=[θ1,θ2]\bm{\theta}=[\theta_{1},\theta_{2}]^{\top} and 𝟏=[1,1]\mathbf{1}=[1,1]^{\top}. This target distribution have two mode at 𝟏-\bm{1} and 𝟏\bm{1}, and vanilla Langevin dynamics can stuck in one mode while keeps the another mode under-explored (as the gradient of energy function can dominate the update of samples). We use the same evaluation method, step sizes, initialization and Gaussian noise as the previous experiment. We collect one sample every 100 iterations and the experiment is repeated for 20 times. Figure 4 shows that SRLD consistently outperforms the Langevin dynamics on all of the evaluation metrics.

To provide more evidence on the effectiveness of SRLD on escaping from local high density region, we plot the sampling trajectory of SRLD and vanilla Langevin dynamics on the mixture of Gaussian mentioned in Section 6.1. We can find that, when both of the methods obtain 200 samples, SRLD have started to explore the second mode, while vanilla Langevin dynamics still stuck in the original mode. When both of the methods have 250 examples, the vanilla Langevin dynamics just start to explore the second mode, while our SRLD have already obtained several samples from the second mode, which shows our methods effectiveness on escaping the local mode.

B.2 Synthetic higher dimensional Gaussian Experiment

To show the performance of SRLD in higher dimensional case with different value of α\alpha, we additionally considering the problem on sampling from Gaussian distribution with d=100d=100 and covariance 𝚺=0.5I\bm{\Sigma}=0.5\textbf{I}. We run SRLD with α=100,50,20,10,0\alpha=100,50,20,10,0 and the case α=0\alpha=0 reduces to Langevin. We collect 1 sample every 10 iterations. The other experiment setting is the same as the toy examples in the main text. The results are summarized at Figure 6. In this experiment, we set one SRLD with an inappropriate α=100\alpha=100. For this chain, the repulsive gradient gives strong repulsive force and thus has the largest ESS and the fastest decay of autocorrelation. While the inappropriate value α\alpha induces too much extra approximation error and thus its performance is not as good as these with smaller α\alpha (see MMD and Wasserstein distance). This phenomenon matches our theoretical finding.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Sample quality and autocorrelation of the higher dimensional Gaussian distribution. The auto-correlation is the averaged auto-correlation of all dimensions.

B.3 Synthetic higher dimensional Mixture of Gaussian Experiment

We also consider sampling from the mixture of Gaussian with d=20d=20. The target density is set to be

ρ(𝜽)12exp(0.5𝜽2/d𝟏2)+12exp(0.5𝜽+2/d𝟏2),\rho^{*}(\bm{\theta})\propto\frac{1}{2}\exp\left(-0.5\left\|\bm{\theta}-\sqrt{2/d}\bm{1}\right\|^{2}\right)+\frac{1}{2}\exp\left(-0.5\left\|\bm{\theta}+\sqrt{2/d}\bm{1}\right\|^{2}\right),

where 𝜽=[θ1,,θ20]\bm{\theta}=[\theta_{1},...,\theta_{20}]^{\top} and 𝟏=[1,,1]\bm{1}=[1,...,1]^{\top}. And thus the mean of the two mixture component is with distance 222\sqrt{2}. We run SRLD with α=20,10,5,0\alpha=20,10,5,0 (and when α=0\alpha=0, it reduces to LD). The other experiment setting is the same as the low dimensional mixture Gaussian case. Figure [7] summarizes the result. As shown in the figure, when α\alpha becomes larger, the repulsive forces helps the sampler better explore the density region.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Sample quality and autocorrelation of the higher dimensional mixture distribution. The auto-correlation is the averaged auto-correlation of all dimensions.

Appendix C BNN on UCI Datasets: Experiment Settings and Additional Results

We first give detailed experiment settings. We set a Γ(1,0.1)\Gamma(1,0.1) prior for the inverse output variance. We set the mini-batch size to be 100. We run 50000 iterations for each methods, and for LD and SRLD, the first 40000 iteration is discarded as burn-in. We use a thinning factor of cη=c/η=100c_{\eta}=c/\eta=100 and in total we collect 100 samples from the posterior distribution. For each dataset, we generate 3 extra data splits for tuning the step size for each method. the number of past samples MM to be 10. In all experiments, we use RBF kernel with bandwidth set by the median trick as suggested in Liu and Wang [2016]. We use α=10\alpha=10 for all the data sets. For SVGD, we use the original implementation with 20 particles by Liu and Wang [2016].

We show some additional experiment result on posterior inference on UCI datasets. As mentioned in Section 6.2, the comparison between SVGD and SRLD is not direct as SVGD is a multiple-chain method with fewer particles and SRLD is a single chain method with more samples. To show more detailed comparison, we compare the SVGD with SRLD using the first 20, 40, 60, 80 and 100 samples, denoted as SRLD-nn where nn is the number of samples used. Table 3 shows the result of averaged test RMSE and table 4 shows the result of averaged test loglikelihood. For SRLD with different number of samples, the value is set to be boldface if it has better average performance than SVGD. If it is statistical significant with significant level 0.05 using a matched pair t-test, we add an underline on it.

Figure 8 and 9 give some visualized result on the comparison with Langevin dynamics and SRLD. To rule out the variance of different splitting on the dataset, the errorbar is calculated based on the difference between RMSE of SRLD and RMSE of Langevin dynamcis in 20 repeats (And similarily for test log-likelihood). And we only applied the error bar on Langevin dynamics.

Dataset Ave Test RMSE
SRLD-20 SRLD-40 SRLD-60 SRLD-80 SRLD-100 SVGD
Boston 3.236±0.174\bf{3.236\pm 0.174} 3.173±0.176\bf{3.173\pm 0.176} 3.130±0.173\bf{3.130\pm 0.173} 3.101±0.179¯\underline{\bf{3.101\pm 0.179}} 3.086±0.181¯\underline{{\bf 3.086\pm 0.181}} 3.300±0.1423.300\pm 0.142
Concrete 4.959±0.109{\bf 4.959\pm 0.109} 4.921±0.111{\bf 4.921\pm 0.111} 4.906±0.109{\bf 4.906\pm 0.109} 4.891±0.108{\bf 4.891\pm 0.108} 4.886±0.108{\bf 4.886\pm 0.108} 4.994±0.1714.994\pm 0.171
Energy 0.422±0.016{\bf 0.422\pm 0.016} 0.409±0.016{\bf 0.409\pm 0.016} 0.405±0.016¯\underline{{\bf 0.405\pm 0.016}} 0.399±0.016¯\underline{{\bf 0.399\pm 0.016}} 0.395±0.016¯\underline{{\bf 0.395\pm 0.016}} 0.428±0.0160.428\pm 0.016
Naval 0.005±0.001{\bf 0.005\pm 0.001} 0.004±0.000¯\underline{{\bf 0.004\pm 0.000}} 0.003±0.000¯\underline{{\bf 0.003\pm 0.000}} 0.003±0.000¯\underline{{\bf 0.003\pm 0.000}} 0.003±0.000¯\underline{{\bf 0.003\pm 0.000}} 0.006±0.0000.006\pm 0.000
WineRed 0.654±0.009{\bf 0.654\pm 0.009} 0.647±0.009¯\underline{{\bf\mathbf{0.647\pm 0.009}}} 0.644±0.009¯\underline{{\bf 0.644\pm 0.009}} 0.641±0.009¯\underline{{\bf 0.641\pm 0.009}} 0.639±0.009¯\underline{{\bf 0.639\pm 0.009}} 0.655±0.0080.655\pm 0.008
WineWhite 0.695±0.0030.695\pm 0.003 0.692±0.0030.692\pm 0.003 0.690±0.0030.690\pm 0.003 0.689±0.0020.689\pm 0.002 0.688±0.0030.688\pm 0.003 0.655±0.008¯\underline{{\bf 0.655\pm 0.008}}
Yacht 0.616±0.0550.616\pm 0.055 0.608±0.0520.608\pm 0.052 0.597±0.0510.597\pm 0.051 0.587±0.054{\bf 0.587\pm 0.054} 0.578±0.054{\bf 0.578\pm 0.054} 0.593±0.0710.593\pm 0.071
Table 3: Comparing SRLD with different number of samples with SVGD on test RMSE. The results are computed over 20 trials. For SRLD, the value is set to be boldface if it has better average performance than SVGD. The value if with underline if it is significantly better than SVGD with significant level 0.05 using a matched pair t-test.
Dataset Ave Test LL
SRLD-20 SRLD-40 SRLD-60 SRLD-80 SRLD-100 SVGD
Boston 2.642±.088¯\underline{{\bf-2.642\pm.088}} 2.582±0.084¯\underline{{\bf-2.582\pm 0.084}} 2.527±0.612¯\underline{{\bf-2.527\pm 0.612}} 2.516±0.062¯\underline{{\bf-2.516\pm 0.062}} 2.500±0.054¯\underline{{\bf-2.500\pm 0.054}} 4.276±0.217-4.276\pm 0.217
Concrete 3.084±0.036¯\underline{{\bf-3.084\pm 0.036}} 3.061±0.034¯\underline{{\bf-3.061\pm 0.034}} 3.050±0.033¯\underline{{\bf-3.050\pm 0.033}} 3.040±0.031¯\underline{{\bf-3.040\pm 0.031}} 3.034±0.031¯\underline{{\bf-3.034\pm 0.031}} 5.500±0.398-5.500\pm 0.398
Energy 0.580±0.053¯\underline{{\bf-0.580\pm 0.053}} 0.536±0.048¯\underline{{\bf-0.536\pm 0.048}} 0.522±0.046¯\underline{{\bf-0.522\pm 0.046}} 0.504±0.044¯\underline{{\bf-0.504\pm 0.044}} 0.476±0.036¯\underline{{\bf-0.476\pm 0.036}} 0.781±0.094-0.781\pm 0.094
Naval 4.033±0.230¯\underline{{\bf 4.033\pm 0.230}} 4.100±0.171¯\underline{{\bf 4.100\pm 0.171}} 4.140±0.015¯\underline{{\bf 4.140\pm 0.015}} 4.167±0.014¯\underline{{\bf 4.167\pm 0.014}} 4.186±0.015¯\underline{{\bf 4.186\pm 0.015}} 3.056±0.0343.056\pm 0.034
WineRed 1.008±0.019¯\underline{{\bf-1.008\pm 0.019}} 0.990±0.017¯\underline{{\bf-0.990\pm 0.017}} 0.982±0.016¯\underline{{\bf-0.982\pm 0.016}} 0.974±0.016¯\underline{{\bf-0.974\pm 0.016}} 0.970±0.016¯\underline{{\bf-0.970\pm 0.016}} 1.040±0.018-1.040\pm 0.018
WineWhite 1.053±0.004-1.053\pm 0.004 1.049±0.004-1.049\pm 0.004 1.047±0.004-1.047\pm 0.004 1.044±0.004-1.044\pm 0.004 1.043±0.004-1.043\pm 0.004 1.040±0.019¯\underline{{\bf-1.040\pm 0.019}}
Yacht 1.160±0.256¯\underline{{\bf-1.160\pm 0.256}} 0.650±0.173¯\underline{{\bf-0.650\pm 0.173}} 0.556±0.096¯\underline{{\bf-0.556\pm 0.096}} 0.465±0.037¯\underline{{\bf-0.465\pm 0.037}} 0.458±0.036¯\underline{{\bf-0.458\pm 0.036}} 1.281±0.279-1.281\pm 0.279
Table 4: Comparing SRLD with different number of samples with SVGD on test log-likelihood. The results are computed over 20 trials. For SRLD, the value is set to be boldface if it has better average performance than SVGD. The value if with underline if it is significantly better than SVGD with significant level 0.05 using a matched pair t-test.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Comparison between SRLD and Langevin dynamics on test RMSE. The results are computed based on 20 repeats. The error bar is calculated based on RMSE of SRLD - RMSE of Langevin dynamics in 20 repeats to rule out the variance of different data splitting
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Comparison between SRLD and Langevin dynamics on test log-likelihood. The results are computed based on 20 repeats. The error bar is calculated based on log-likelihood of SRLD - log-likelihood of Langevin dynamics in 20 repeats to rule out the variance of data splitting.

Appendix D Contextual Bandit: Experiment Settings and More Background

Contextual bandit is a class of online learning problems that can be viewed as a simple reinforcement learning problem without transition. For a completely understanding of contextual bandit problems, we refer the readers to the Chapter 4 of [Bubeck et al., 2012]. Here we include the main idea for completeness. In contextual bandit problems, the agent needs to find out the best action given some observed context (a.k.a the optimal policy in reinforcement learning). Formally, we define 𝒮\mathcal{S} as the context set and KK as the number of action. Then we can concretely describe the contextual bandit problems as follows: for each time-step t=1,2,,Nt=1,2,\cdots,N, where NN is some pre-defined time horizon (and can be given to the agent), the environment provides a context st𝒮s_{t}\in\mathcal{S} to the agent, then the agent should choose one action at{1,2,,K}a_{t}\in\{1,2,\cdots,K\} based on context sts_{t}. The environment will return a (stochastic) reward r(st,at)r(s_{t},a_{t}) to the agent based on the context sts_{t} and the action ata_{t} that similar to the reinforcement learning setting. And notice that, the agent can adjust the strategy at each time-step, so that this kinds of problems are called “online” learning problem.

Solving the contextual bandit problems is equivalent to find some algorithms that can minimize the pseudo-regret [Bubeck et al., 2012], which is defined as:

R¯N𝒮=maxπ:𝒮{1,2,,K}𝔼[t=1Nr(st,g(st))t=1Nr(st,at)].\overline{R}_{N}^{\mathcal{S}}=\max_{\pi:\mathcal{S}\to\{1,2,\cdots,K\}}\mathbb{E}\left[\sum_{t=1}^{N}r(s_{t},g(s_{t}))-\sum_{t=1}^{N}r(s_{t},a_{t})\right]. (7)

where π\pi denotes the deterministic mapping from the context set 𝒮\mathcal{S} to actions {1,2,,K}\{1,2,\cdots,K\} (readers can view π\pi as a deterministic policy in reinforcement learning). Intuitively, this pseudo-regret measures the difference of cumulative reward between the action sequence ata_{t} and the best action sequence π(st)\pi(s_{t}). Thus, an algorithm that can minimize the pseudo-regret (7) can also find the best π\pi.

Posterior sampling [a.k.a. Thompson sampling; Thompson, 1933] is one of the classical yet successful algorithms that can achieve the state-of-the-art performance in practice [Chapelle and Li, 2011]. It works by first placing an user-specified prior μs,a0\mu^{0}_{s,a} on the reward r(s,a)r(s,a), and each turn make decision based on the posterior distribution and update it, i.e. update the posterior distribution μs,at\mu^{t}_{s,a} with the observation r(st1,at1)r(s_{t-1},a_{t-1}) at time t1t-1 where at1a_{t-1} is selected with the posterior distribution: each time, the action is selected with the following way:

at=argmaxa{1,2,,K}r^(st,a),r^(st,a)μs,at.a_{t}=\mathop{\arg\max}_{a\in\{1,2,\cdots,K\}}{\hat{r}(s_{t},a)},\quad\hat{r}(s_{t},a)\sim\mu^{t}_{s,a}.

i.e., greedy select the action based on the sampled reward from the posterior, thus called “Posterior Sampling”. Algorithm 1 summarizes the whole procedure of Posterior Sampling.

Algorithm 1 Posterior sampling for contextual bandits
  Input: Prior distribution μs,a0\mu^{0}_{s,a}, time horizon NN
  for time t=1,2,,Nt=1,2,\cdots,N do
     observe a new context st𝒮s_{t}\in\mathcal{S},
     sample the reward of each action r^(st,a)μs,at\hat{r}(s_{t},a)\sim\mu^{t}_{s,a}, a{1,2,,K}a\in\{1,2,\cdots,K\},
     select action at=argmaxa{1,2,,K}r^(st,a)a_{t}=\mathop{\arg\max}_{a\in\{1,2,\cdots,K\}}{\hat{r}(s_{t},a)} and get the reward r(st,at)r(s_{t},a_{t}),
     update the posterior μst,att+1\mu^{t+1}_{s_{t},a_{t}} with r(st,at)r(s_{t},a_{t}).
  end for

Notice that all of the reinforcement learning problems face the exploration-exploitation dilemma, so as the contextual bandit problem. Posterior sampling trade off the exploration and exploitation with the uncertainty provided by the posterior distribution. So if the posterior uncertainty is not estimated properly, posterior sampling will perform poorly. To see this, if we over-estimate the uncertainty, we can explore too-much sub-optimal actions, while if we under-estimate the uncertainty, we can fail to find the optimal actions. Thus, it is a good benchmark for evaluating the uncertainty provided by different inference methods.

Though in principle all of the MCMC methods return the samples follow the true posterior if we can run infinite MCMC steps, in practice we can only obtain finite samples as we only have finite time to run the MCMC sampler. In this case, the auto-correlation issue can lead to the under-estimate the uncertainty, which will cause the failure on all of the reinforcement learning problems that need exploration.

Here, we test the uncertainty provided by vanilla Langevin dynamics and Self-repulsive Langevin dynamics on two of the benchmark contextual bandit problems suggested by [Riquelme et al., 2018], called mushroom and wheel. One can read [Riquelme et al., 2018] to find the detail introduction of this two contextual bandit problems. For completeness, we include it as follows:

Mushroom Mushroom bandit utilizes the data from Mushroom dataset [Schlimmer, 1981], which includes different kinds of poisonous mushroom and safe mushroom with 22 attributes that can indicate whether the mushroom is poisonous or not. Blundell et al. [2015] first introduced the mushroom bandit by designing the following reward function: eating a safe mushroom will give a +5+5 reward, while eating a poisonous mushroom will return a reward +5+5 and 35-35 with equal chances. The agent can also choose not to eat the mushroom, which always yield a 0 reward. Same to [Riquelme et al., 2018], we use 50000 instances in this problem.

Wheel To highlight the need for exploration, [Riquelme et al., 2018] designs the wheel bandit, that can control the need of exploration with some “exploration parameter” δ(0,1)\delta\in(0,1). The context set 𝒮\mathcal{S} is the unit circle s21\|s\|_{2}\leq 1 in 2\mathbb{R}^{2}, and each turn the context sts_{t} is uniformly sampled from 𝒮\mathcal{S}. K=5K=5 possible actions are provided: the first action yields a constant reward r𝒩(μ1,σ2)r\sim\mathcal{N}(\mu_{1},\sigma^{2}); the reward corresponding to other actions is determined by the provided context ss:

  • For s𝒮s\in\mathcal{S} s.t. s2δ\|s\|_{2}\leq\delta, all of the four other actions return a suboptimal reward sampled from 𝒩(μ2,σ2)\mathcal{N}(\mu_{2},\sigma^{2}) for μ2<μ1\mu_{2}<\mu_{1}.

  • For s𝒮s\in\mathcal{S} s.t. s2>δ\|s\|_{2}>\delta, according to the quarter the context ss is in, one of the four actions becomes optimal. This optimal action gives a reward of 𝒩(μ3,σ2)\mathcal{N}(\mu_{3},\sigma^{2}) for μ3μ1\mu_{3}\gg\mu_{1}, and another three actions still yield the suboptimal reward 𝒩(μ2,σ2)\mathcal{N}(\mu_{2},\sigma^{2}).

Following the setting from [Riquelme et al., 2018], we set μ1=1.2\mu_{1}=1.2, μ2=1.0\mu_{2}=1.0, and μ3=50\mu_{3}=50.

When δ\delta approaches 11, the inner circle s2δ\|s\|_{2}\leq\delta will dominate the unit circle and the first action becomes the optimal for most of the context. Thus, inference methods with poorly estimated uncertainty will continuously choose the suboptimal action a1a_{1} for all of the contexts without exploration. This phenomenon have been confirmed in [Riquelme et al., 2018]. In our experiments, as we want to evaluate the quality of uncertainty provided by different methods, we set δ=0.95\delta=0.95, which is pretty hard for existing inference methods as shown in [Riquelme et al., 2018], and use 5000050000 contexts for evaluation.

Refer to caption
Figure 10: Visualization of the wheel bandit (δ=0.95\delta=0.95), taken from [Riquelme et al., 2018].

Experiment Setup Following [Riquelme et al., 2018], we use a feed-forward network with two hidden layer of 100 units and ReLU activation. We use the same step-size and thinning factor c/η=100c/\eta=100 for vanilla Langevin dynamics and SRLD, and set M=20M=20, α=10\alpha=10 on both of the mushroom and wheel bandits. The update schedule is similar to [Riquelme et al., 2018], and we just change the optimization step in stochastic variational inference methods into MCMC sampler step and replace the warm-up of stochastic variational inference methods in Riquelme et al. [2018] with the burn-in phase of the sampling. Similar to other methods in [Riquelme et al., 2018], we keep the initial learning rate as 10110^{-1} for fast burn-in and the step-size for sampling is tuned on the mushroom bandit and keep the same for both the mushroom and wheel bandit. As this is an online posterior inference problem, we only use the last 2020 samples to give the prediction. Notice that, in the original implementation of Riquelme et al. [2018], the authors only update a few steps with new observation after observing enough data, as the posterior will gradually converge to the true reward distribution and little update is needed after observing sufficient data. Similar to their implementation, after observing enough data, we only collect one new sample with the new observation each time. For SVGD, we use 20 particles to make the comparison fair, and also tune the step-size on the mushroom bandit.

Appendix E The Detailed analysis of SRLD

E.1 Some additional notation

We use \|\cdot\|_{\infty} to denote the \ell_{\infty} vector norm and define the \mathcal{L}_{\infty} norm of a function f:d1f:\mathbb{R}^{d}\to\mathbb{R}^{1} as f\left\|f\right\|_{\mathcal{L}_{\infty}}. 𝔻TV\mathbb{D}_{\mathrm{TV}} denote the Total Variation distance between distribution ρ1,ρ2\rho_{1},\rho_{2} respectively. Also, as KK is d×d1\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}^{1}, we denote K,=sup𝜽1,𝜽2K(𝜽1,𝜽2)\left\|K\right\|_{\mathcal{L}_{\infty},\mathcal{L}_{\infty}}=\sup_{\bm{\theta}_{1},\bm{\theta}_{2}}K(\bm{\theta}_{1},\bm{\theta}_{2}). For simplicity, we may use K,\left\|K\right\|_{\infty,\infty} as K,\left\|K\right\|_{\mathcal{L}_{\infty},\mathcal{L}_{\infty}}. In the appendix, we also use ϕ[ρ](𝜽)𝒈(𝜽;ρ)\phi[\rho](\bm{\theta})\coloneqq\bm{g}(\bm{\theta};\rho), where 𝒈(𝜽;ρ)\bm{g}(\bm{\theta};\rho) is defined in the main text. For the clearance, we define πM,c/ηρkρkM\pi_{M,c/\eta}*\rho_{k}\coloneqq\rho_{k}^{M}, πM,c/ηρ~kρ~kM\pi_{M,c/\eta}*\tilde{\rho}_{k}\coloneqq\tilde{\rho}_{k}^{M} and πM,cρ¯tρ¯tM\pi_{M,c}*\bar{\rho}_{t}\coloneqq\bar{\rho}_{t}^{M}, where ρkM\rho_{k}^{M}, ρ~kM\tilde{\rho}_{k}^{M} and ρ¯tM\bar{\rho}_{t}^{M} are defined in main text.

E.2 Geometric Ergodicity of SRLD

Before we start the proof of main theorems, we give the following theorem on the geometric ergodicity of SRLD. It is noticeable that under this assumption, the practical dynamics follows an (Mc/η+1)(Mc/\eta+1)-order nonlinear autoregressive model when kMc/ηk\geq Mc/\eta:

𝜽k+1=𝝍(𝜽k,,𝜽kMc/η)+2η𝒆k,\bm{\theta}_{k+1}=\bm{\psi}\left(\bm{\theta}_{k},...,\bm{\theta}_{k-Mc/\eta}\right)+\sqrt{2\eta}\bm{e}_{k},

where

𝝍(𝜽k,,𝜽kMc/η)=𝜽k+η{V(𝜽k)+αϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)}.\bm{\psi}\left(\bm{\theta}_{k},...,\bm{\theta}_{k-Mc/\eta}\right)=\bm{\theta}_{k}+\eta\left\{-\nabla V(\bm{\theta}_{k})+\alpha\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})\right\}.

Further, if we stack the parameter by 𝚯k=[𝜽k,,𝜽kMc/η]\bm{\Theta}_{k}=\left[\bm{\theta}_{k},...,\bm{\theta}_{k-Mc/\eta}\right]^{\top} and define 𝚿(𝚯k)=[𝝍(𝚯k),𝚯k]\bm{\Psi}\left(\bm{\Theta}_{k}\right)=\left[\bm{\psi}^{\top}\left(\bm{\Theta}_{k}\right),\bm{\Theta}_{k}^{\top}\right]^{\top}, we have

𝚯k+1=𝚿(𝚯k)+2η𝑬k,\bm{\Theta}_{k+1}=\bm{\Psi}\left(\bm{\Theta}_{k}\right)+\sqrt{2\eta}\bm{E}_{k},

where 𝑬k=[𝒆k,𝟎,,𝟎].\bm{E}_{k}=\left[\bm{e}_{k}^{\top},\bm{0}^{\top},...,\bm{0}^{\top}\right]^{\top}. In this way, we formulate 𝚯k\bm{\Theta}_{k} as a time homogeneous Markov Chain. In the following analysis, we only analyze the second phase of SRLD given some initial stacked particles 𝚯Mc/η1\bm{\Theta}_{Mc/\eta-1}.

Theorem E.1 (Geometric Ergodicity).

Under Assumption 4.1 and Assumption 4.2, suppose we choose η\eta and α\alpha such that

max(12ηa1+η2b1+2αησb1,2αησ(b1+1))<1,\max\left(1-2\eta a_{1}+\eta^{2}b_{1}+\frac{2\alpha\eta}{\sigma}b_{1},\frac{2\alpha\eta}{\sigma}(b_{1}+1)\right)<1,

then the Markov Chain of 𝚯k\bm{\Theta}_{k} is stationary, geometrically ergodic, i.e., for any 𝚯0=𝚯Mc/η1\bm{\Theta}^{\prime}_{0}=\bm{\Theta}_{Mc/\eta-1}, we have

𝔻TV[Pk(,𝚯0),Π()]Q(𝚯0)erk,\mathbb{D}_{\mathrm{TV}}\left[P^{k}\left(\cdot,\bm{\Theta}_{0}\right),\Pi\left(\cdot\right)\right]\leq Q\left(\bm{\Theta}_{0}\right)e^{-rk},

where r=𝒪(η)r=\mathcal{O}(\eta) is some positive constant, Q(𝚯0)Q(\bm{\Theta}_{0}) is constant related to 𝚯0\bm{\Theta}_{0}, PkP^{k} is the kk-step Markov transition kernel and Π\Pi is the stationary distribution.

We defer the proof to Appendix E.5.1.

E.3 Moment Bound

Theorem E.2 (Moment Bound).

Under Assumption 4.2, suppose that we have 𝔼𝛉ρ0𝛉2<\mathbb{E}_{\bm{\theta}\sim\rho_{0}}\left\|\bm{\theta}\right\|^{2}<\infty; and a2αK(2b1+4σ)>0a_{2}-\alpha\left\|K\right\|_{\infty}\left(2b_{1}+\frac{4}{\sigma}\right)>0, we have

supk𝔼𝜽ρk𝜽2supk𝔼𝜽ρ~k𝜽2supt𝔼𝜽ρ¯t𝜽2\displaystyle\sup_{k}\mathbb{E}_{\bm{\theta}\sim\rho_{k}}\left\|\bm{\theta}\right\|^{2}\lor\sup_{k}\ \mathbb{E}_{\bm{\theta}\sim\tilde{\rho}_{k}}\left\|\bm{\theta}\right\|^{2}\lor\sup_{t}\mathbb{E}_{\bm{\theta}\sim\bar{\rho}_{t}}\left\|\bm{\theta}\right\|^{2}
\displaystyle\leq 𝔼𝜽ρ0𝜽2+b1+1+ηa2K,2ασαK,(2b1+2σ).\displaystyle\mathbb{E}_{\bm{\theta}\sim\rho_{0}}\left\|\bm{\theta}\right\|^{2}+\frac{b_{1}+1+\eta}{a_{2}-\left\|K\right\|_{\mathcal{L}_{\infty},\mathcal{L}_{\infty}}\frac{2\alpha}{\sigma}-\alpha\left\|K\right\|_{\mathcal{L}_{\infty},\mathcal{L}_{\infty}}\left(2b_{1}+\frac{2}{\sigma}\right)}.

And by Lemma E.1, we thus have

supk𝔼𝜽ρkV(𝜽)2supk𝔼𝜽ρ~kV(𝜽)2supt𝔼𝜽ρ¯tV(𝜽)2\displaystyle\sup_{k}\ \mathbb{E}_{\bm{\theta}\sim\rho_{k}}\left\|\nabla V(\bm{\theta})\right\|^{2}\lor\sup_{k}\ \mathbb{E}_{\bm{\theta}\sim\tilde{\rho}_{k}}\left\|\nabla V(\bm{\theta})\right\|^{2}\lor\sup_{t}\mathbb{E}_{\bm{\theta}\sim\bar{\rho}_{t}}\left\|\nabla V(\bm{\theta})\right\|^{2}
\displaystyle\leq b1𝔼𝜽ρ0𝜽2+b1(b1+1+η)a2K,2ασαK,(2b1+2σ)+1\displaystyle b_{1}\mathbb{E}_{\bm{\theta}\sim\rho_{0}}\left\|\bm{\theta}\right\|^{2}+\frac{b_{1}(b_{1}+1+\eta)}{a_{2}-\left\|K\right\|_{\mathcal{L}_{\infty},\mathcal{L}_{\infty}}\frac{2\alpha}{\sigma}-\alpha\left\|K\right\|_{\mathcal{L}_{\infty},\mathcal{L}_{\infty}}\left(2b_{1}+\frac{2}{\sigma}\right)}+1

The proof can be found at Appendix E.5.2.

E.4 Technical Lemma

Definition E.1 (α\alpha-mixing).

For any two σ\sigma-algebras 𝒜\mathcal{A} and \mathcal{B}, the α\alpha-mixing coefficient is defined by

α(𝒜,)=supA𝒜,B|(AB)(A)(B)|.\alpha(\mathcal{A},\mathcal{B})=\underset{A\in\mathcal{A},B\in\mathcal{B}}{\sup}\left|\mathbb{P}\left(A\cap B\right)-\mathbb{P}\left(A\right)\mathbb{P}\left(B\right)\right|.

Let (Xk,k1)\left(X_{k},k\geq 1\right) be a sequence of real random variable defined on (Ω,𝒜,)\left(\Omega,\mathcal{A},\mathbb{P}\right). This sequence is α\alpha-mixing if

α(n)supk1α(k,𝒢k+n)0,asn,\alpha(n)\coloneqq\underset{k\geq 1}{\sup}\ \alpha\left(\mathcal{M}_{k},\mathcal{G}_{k+n}\right)\to 0,\ \mathrm{as}\ n\to\infty,

where jσ(Xi,ij)\mathcal{M}_{j}\coloneqq\sigma\left(X_{i},i\leq j\right) and 𝒢jσ(Xi,ij)\mathcal{G}_{j}\coloneqq\sigma\left(X_{i},i\geq j\right) for j1j\geq 1. Alternatively, as shown by Theorem 4.4 of Bradley [2007]

α(n)14sup{Cov(f,g)fg;f(k),g(𝒢k+n)}.\alpha(n)\coloneqq\frac{1}{4}\sup\left\{\frac{\mathrm{Cov}\left(f,g\right)}{\left\|f\right\|_{\mathcal{L}_{\infty}}\left\|g\right\|_{\mathcal{L}_{\infty}}};\ f\in\mathcal{L}_{\infty}\left(\mathcal{M}_{k}\right),\ g\in\mathcal{L}_{\infty}\left(\mathcal{G}_{k+n}\right)\right\}.
Definition E.2 (β\beta-mixing).

For any two σ\sigma-algebras 𝒜\mathcal{A} and \mathcal{B}, the α\alpha-mixing coefficient is defined by

β(𝒜,)sup12i=1Ij=1J|(AiBj)(Ai)(Bj)|,\beta\left(\mathcal{A},\mathcal{B}\right)\coloneqq\sup\frac{1}{2}\sum_{i=1}^{I}\sum_{j=1}^{J}\left|\mathbb{P}\left(A_{i}\cap B_{j}\right)-\mathbb{P}\left(A_{i}\right)\mathbb{P}\left(B_{j}\right)\right|,

where the supremum is taken over all pairs of finite partitions {A1,,AI}\left\{A_{1},...,A_{I}\right\} and {B1,,BJ}\left\{B_{1},...,B_{J}\right\} of Ω\Omega such that Ai𝒜A_{i}\in\mathcal{A} and BjB_{j}\in\mathcal{B} for each ii, jj. Let (Xk,k1)\left(X_{k},k\geq 1\right) be a sequence of real random variable defined on (Ω,𝒜,)\left(\Omega,\mathcal{A},\mathbb{P}\right). This sequence is β\beta-mixing if

β(n)supk1β(k,𝒢k+n)0,asn.\beta(n)\coloneqq\underset{k\geq 1}{\sup}\ \beta\left(\mathcal{M}_{k},\mathcal{G}_{k+n}\right)\to 0,\ \mathrm{as}\ n\to\infty.
Proposition E.1 (β\beta-mixing implies α\alpha-mixing)).

For any two σ\sigma-algebras 𝒜\mathcal{A} and \mathcal{B},

α(𝒜,)12β(𝒜,).\alpha\left(\mathcal{A},\mathcal{B}\right)\leq\frac{1}{2}\beta\left(\mathcal{A},\mathcal{B}\right).

This proposition can be found in Equation 1.11 of Bradley [2005].

Proposition E.2.

A (strictly) stationary Markov Chain is geometric ergodicity if and only if β(n)0\beta(n)\to 0 at least exponentially fast as nn\to\infty.

This proposition is Theorem 3.7 of Bradley [2005].

Lemma E.1 (Regularity Conditions).

By Assumption 4.2, we have V(𝛉)b1(𝛉1+1)\left\|\nabla V(\bm{\theta})\right\|\leq b_{1}\left(\left\|\bm{\theta}_{1}\right\|+1\right) and 𝛉ηV(𝛉)(12ηa1+η2b1)𝛉2+η2b1+2ηb1.\left\|\bm{\theta}-\eta\nabla V(\bm{\theta})\right\|\leq\left(1-2\eta a_{1}+\eta^{2}b_{1}\right)\left\|\bm{\theta}\right\|^{2}+\eta^{2}b_{1}+2\eta b_{1}.

Lemma E.2 (Properties of RBF Kernel).

For RBF kernel with bandwidth σ\sigma, we have K,1\left\|K\right\|_{\infty,\infty}\leq 1 and

K(𝜽,𝜽1)K(𝜽,𝜽2)\displaystyle\left\|K(\bm{\theta}^{\prime},\bm{\theta}_{1})-K(\bm{\theta}^{\prime},\bm{\theta}_{2})\right\| \displaystyle\leq e()2/σLip𝜽1𝜽22\displaystyle\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}
𝜽K(𝜽,𝜽1)𝜽K(𝜽,𝜽2)\displaystyle\left\|\nabla_{\bm{\theta}^{\prime}}K(\bm{\theta}^{\prime},\bm{\theta}_{1})-\nabla_{\bm{\theta}^{\prime}}K(\bm{\theta}^{\prime},\bm{\theta}_{2})\right\| \displaystyle\leq 2σe()2/σ()Lip𝜽1𝜽22.\displaystyle\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}.
Lemma E.3 (Properties of Stein Operator).

For any distribution ρ\rho such that 𝔼𝛉ρV(𝛉)<,\mathbb{E}_{\bm{\theta}\sim\rho}\left\|\nabla V(\bm{\theta})\right\|<\infty, we have

ϕ[ρ]()Lip\displaystyle\left\|\phi[\rho](\cdot)\right\|_{\mathrm{Lip}} e()2/σLip𝔼𝜽ρV(𝜽)+2σe()2/σ()Lip,\displaystyle\leq\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\mathbb{E}_{\bm{\theta}\sim\rho}\left\|\nabla V(\bm{\theta})\right\|+\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}},
ϕ[ρ](𝜽)\displaystyle\left\|\phi[\rho](\bm{\theta})\right\| K𝔼𝜽ρ[V(𝜽)+2σ(𝜽+𝜽)]\displaystyle\leq\left\|K\right\|_{\infty}\mathbb{E}_{\bm{\theta}^{\prime}\sim\rho}\left[\left\|\nabla V(\bm{\theta}^{\prime})\right\|+\frac{2}{\sigma}\left(\left\|\bm{\theta}^{\prime}\right\|+\left\|\bm{\theta}\right\|\right)\right]
Kb1+𝔼𝜽ρ[(2σ+b1)𝜽]+𝜽.\displaystyle\leq\left\|K\right\|_{\infty}b_{1}+\mathbb{E}_{\bm{\theta}^{\prime}\sim\rho}\left[\left(\frac{2}{\sigma}+b_{1}\right)\left\|\bm{\theta}^{\prime}\right\|\right]+\left\|\bm{\theta}\right\|.
Lemma E.4 (Bounded Lipschitz of Stein Operator).

Given 𝛉\bm{\theta}^{\prime}, define ϕ¯𝛉(𝛉)ϕ[δ𝛉](𝛉)=K(𝛉,𝛉)V(𝛉)+1K(𝛉,𝛉)\bar{\phi}_{\bm{\theta}^{\prime}}(\bm{\theta})\coloneqq\phi[\delta_{\bm{\theta}^{\prime}}](\bm{\theta})=K(\bm{\theta}^{\prime},\bm{\theta})\nabla V(\bm{\theta}^{\prime})+\nabla_{1}K(\bm{\theta}^{\prime},\bm{\theta}). We also denote ϕ¯𝛉(𝛉)=[ϕ¯𝛉,1(𝛉),,ϕ¯𝛉,d(𝛉)]\bar{\phi}_{\bm{\theta}^{\prime}}(\bm{\theta})=[\bar{\phi}_{\bm{\theta}^{\prime},1}(\bm{\theta}),...,\bar{\phi}_{\bm{\theta}^{\prime},d}(\bm{\theta})]^{\top}. We have

i=1dϕ¯𝜽,i(𝜽)Lip2\displaystyle\sum_{i=1}^{d}\left\|\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta})\right\|_{\mathrm{Lip}}^{2} \displaystyle\leq 2V(𝜽)2e2/σLip2+2d2σe𝜽2/σθ1Lip2\displaystyle 2\left\|\nabla V(\bm{\theta}^{\prime})\right\|^{2}\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{Lip}}^{2}+2d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{Lip}}^{2}
i=ddϕ¯𝜽,i(𝜽)2\displaystyle\sum_{i=d}^{d}\left\|\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta})\right\|_{\mathcal{L}_{\infty}}^{2} \displaystyle\leq 2d2σe𝜽2/σθ12+2e2/σ2V(𝜽)2.\displaystyle 2d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathcal{L}_{\infty}}^{2}+2\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathcal{L}_{\infty}}^{2}\left\|\nabla V(\bm{\theta}^{\prime})\right\|^{2}.

E.5 Proof of Main Theorems

E.5.1 Proof of Theorem E.1

The proof of this theorem is by verifying the condition of Theorem 3.2 of An and Huang [1996]. Suppose 𝚯=[𝜽1,,𝜽MC+1]\bm{\Theta}=\left[\bm{\theta}_{1},...,\bm{\theta}_{MC+1}\right], where C=c/ηC=c/\eta, we have

𝝍(𝚯)\displaystyle\left\|\bm{\psi}\left(\bm{\Theta}\right)\right\| =𝜽1+η{V(𝜽1)+αϕ[1Mj=1Mδ𝜽1+jC](𝜽k)}\displaystyle=\left\|\bm{\theta}_{1}+\eta\left\{-\nabla V(\bm{\theta}_{1})+\alpha\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{1+jC}}](\bm{\theta}_{k})\right\}\right\|
=𝜽1ηV(𝜽1)+ηαMj=1M[e𝜽1+jC𝜽12/σ2σ(𝜽1𝜽1+jC)e𝜽1+jC𝜽12/σV(𝜽1+jC)]\displaystyle=\left\|\bm{\theta}_{1}-\eta\nabla V(\bm{\theta}_{1})+\frac{\eta\alpha}{M}\sum_{j=1}^{M}\left[e^{-\left\|\bm{\theta}_{1+jC}-\bm{\theta}_{1}\right\|^{2}/\sigma}\frac{2}{\sigma}\left(\bm{\theta}_{1}-\bm{\theta}_{1+jC}\right)-e^{-\left\|\bm{\theta}_{1+jC}-\bm{\theta}_{1}\right\|^{2}/\sigma}\nabla V(\bm{\theta}_{1+jC})\right]\right\|
𝜽1ηV(𝜽1)+2σηαMj=1Me𝜽1+jC𝜽12/σ𝜽1\displaystyle\leq\left\|\bm{\theta}_{1}-\eta\nabla V(\bm{\theta}_{1})+\frac{2}{\sigma}\frac{\eta\alpha}{M}\sum_{j=1}^{M}e^{-\left\|\bm{\theta}_{1+jC}-\bm{\theta}_{1}\right\|^{2}/\sigma}\bm{\theta}_{1}\right\|
+ηαMj=1Me𝜽1+jC𝜽12/σ2σ(V(𝜽1+jC)𝜽1+jC)\displaystyle+\left\|\frac{\eta\alpha}{M}\sum_{j=1}^{M}e^{-\left\|\bm{\theta}_{1+jC}-\bm{\theta}_{1}\right\|^{2}/\sigma}\frac{2}{\sigma}\left(-\nabla V(\bm{\theta}_{1+jC})-\bm{\theta}_{1+jC}\right)\right\|
𝜽1ηV(𝜽1)+2αησK,b1(1+𝜽1)\displaystyle\leq\left\|\bm{\theta}_{1}-\eta\nabla V(\bm{\theta}_{1})\right\|+\frac{2\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty}b_{1}(1+\left\|\bm{\theta}_{1}\right\|)
+2αηMσj=1MK,b1(1+(1+1b1)𝜽1+jC)\displaystyle+\frac{2\alpha\eta}{M\sigma}\sum_{j=1}^{M}\left\|K\right\|_{\infty,\infty}b_{1}\left(1+(1+\frac{1}{b_{1}})\left\|\bm{\theta}_{1+jC}\right\|\right)
(1)b1(1+4αησK,)+η2b1+2ηb1\displaystyle\overset{(1)}{\leq}b_{1}(1+\frac{4\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty})+\eta^{2}b_{1}+2\eta b_{1}
+(12ηa1+η2b1+2αησK,b1)𝜽1+2αησK,(b1+1)maxi[MC+1]{1}𝜽1+jC\displaystyle+\left(1-2\eta a_{1}+\eta^{2}b_{1}+\frac{2\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty}b_{1}\right)\left\|\bm{\theta}_{1}\right\|+\frac{2\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty}(b_{1}+1)\max_{i\in[MC+1]-\{1\}}\left\|\bm{\theta}_{1+jC}\right\|
b1(1+4αησK,)+η2b1+2ηb1\displaystyle\leq b_{1}(1+\frac{4\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty})+\eta^{2}b_{1}+2\eta b_{1}
+max(12ηa1+η2b1+2αησK,b1,2αησK,(b1+1))maxi[MC+1]𝜽1+jC,\displaystyle+\max\left(1-2\eta a_{1}+\eta^{2}b_{1}+\frac{2\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty}b_{1},\frac{2\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty}(b_{1}+1)\right)\max_{i\in[MC+1]}\left\|\bm{\theta}_{1+jC}\right\|,

where (1)(1) is by Lemma E.1. Thus, given the step size η\eta, if we choose η\eta, α\alpha such that

max(12ηa1+η2b1+2αησK,b1,2αησK,(b1+1))<1,\max\left(1-2\eta a_{1}+\eta^{2}b_{1}+\frac{2\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty}b_{1},\frac{2\alpha\eta}{\sigma}\left\|K\right\|_{\infty,\infty}(b_{1}+1)\right)<1,

then our dynamics is geometric ergodic.

E.5.2 Proof of Theorem E.2

Continuous-Time Mean Field Dynamics (5) Notice that as our dynamics has two phases and the first phase can be viewed as an special case of the second phase by setting α=0\alpha=0, here we only analysis the second phase. Define Ut=supst𝔼𝜽¯s2U_{t}=\underset{s\leq t}{\sup}\sqrt{\mathbb{E}\left\|\bar{\bm{\theta}}_{s}\right\|^{2}}, and thus

tUt2𝔼𝜽¯t,V(𝜽¯)+αϕ[πM,cρ¯t](𝜽¯t)0.\frac{\partial}{\partial t}U_{t}^{2}\leq\mathbb{E}\left\langle\bar{\bm{\theta}}_{t},-V(\bar{\bm{\theta}})+\alpha\phi[\pi_{M,c}*\bar{\rho}_{t}](\bar{\bm{\theta}}_{t})\right\rangle\vee 0.

Now we bound 𝔼𝜽¯t,V(𝜽¯)+αϕ[πM,cρ¯t](𝜽¯t)\mathbb{E}\left\langle\bar{\bm{\theta}}_{t},-V(\bar{\bm{\theta}})+\alpha\phi[\pi_{M,c}*\bar{\rho}_{t}](\bar{\bm{\theta}}_{t})\right\rangle:

𝔼𝜽¯t,V(𝜽¯t)+αϕ[πM,cρ¯t](𝜽¯t)\displaystyle\mathbb{E}\left\langle\bar{\bm{\theta}}_{t},-V(\bar{\bm{\theta}}_{t})+\alpha\phi[\pi_{M,c}*\bar{\rho}_{t}](\bar{\bm{\theta}}_{t})\right\rangle
\displaystyle\leq b1a2𝔼𝜽¯t2+α𝔼𝜽¯tϕ[πM,cρ¯t](𝜽¯t)\displaystyle b_{1}-a_{2}\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|^{2}+\alpha\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|\left\|\phi[\pi_{M,c}*\bar{\rho}_{t}](\bar{\bm{\theta}}_{t})\right\|
(1)\displaystyle\stackrel{{\scriptstyle(1)}}{{\leq}} b1a2𝔼𝜽¯t2+αK𝔼𝜽¯t𝔼𝜽πM,cρ¯t[V(𝜽)+2σ(𝜽+𝜽t)]\displaystyle b_{1}-a_{2}\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|^{2}+\alpha\left\|K\right\|_{\infty}\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|\mathbb{E}_{\bm{\theta}^{\prime}\sim\pi_{M,c}*\bar{\rho}_{t}}\left[\left\|\nabla V(\bm{\theta}^{\prime})\right\|+\frac{2}{\sigma}\left(\left\|\bm{\theta}^{\prime}\right\|+\left\|\bm{\theta}_{t}\right\|\right)\right]
\displaystyle\leq b1a2𝔼𝜽¯t2+αK𝔼𝜽¯t𝔼𝜽πM,cρ¯t[b1(𝜽+1)+2σ(𝜽+𝜽t)]\displaystyle b_{1}-a_{2}\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|^{2}+\alpha\left\|K\right\|_{\infty}\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|\mathbb{E}_{\bm{\theta}^{\prime}\sim\pi_{M,c}*\bar{\rho}_{t}}\left[b_{1}\left(\left\|\bm{\theta}^{\prime}\right\|+1\right)+\frac{2}{\sigma}\left(\left\|\bm{\theta}^{\prime}\right\|+\left\|\bm{\theta}_{t}\right\|\right)\right]
=\displaystyle= b1(a2K2ασ)𝔼𝜽¯t2+αK𝔼𝜽¯t𝔼𝜽πM,cρ¯t((b1+2σ)𝜽+b1)\displaystyle b_{1}-\left(a_{2}-\left\|K\right\|_{\infty}\frac{2\alpha}{\sigma}\right)\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|^{2}+\alpha\left\|K\right\|_{\infty}\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|\mathbb{E}_{\bm{\theta}^{\prime}\sim\pi_{M,c}*\bar{\rho}_{t}}\left(\left(b_{1}+\frac{2}{\sigma}\right)\left\|\bm{\theta}^{\prime}\right\|+b_{1}\right)
\displaystyle\leq b1(a2K2ασ)Ut2+αK𝔼𝜽¯t𝔼𝜽πM,cρ¯t((b1+2σ)𝜽+b1)\displaystyle b_{1}-\left(a_{2}-\left\|K\right\|_{\infty}\frac{2\alpha}{\sigma}\right)U_{t}^{2}+\alpha\left\|K\right\|_{\infty}\mathbb{E}\left\|\bar{\bm{\theta}}_{t}\right\|\mathbb{E}_{\bm{\theta}^{\prime}\sim\pi_{M,c}*\bar{\rho}_{t}}\left(\left(b_{1}+\frac{2}{\sigma}\right)\left\|\bm{\theta}^{\prime}\right\|+b_{1}\right)
\displaystyle\leq b1(a2K2ασ)Ut2+αK(b1+2σ)1Mj=1MUtUtjc+αKb1Ut\displaystyle b_{1}-\left(a_{2}-\left\|K\right\|_{\infty}\frac{2\alpha}{\sigma}\right)U_{t}^{2}+\alpha\left\|K\right\|_{\infty}\left(b_{1}+\frac{2}{\sigma}\right)\frac{1}{M}\sum_{j=1}^{M}U_{t}U_{t-jc}+\alpha\left\|K\right\|_{\infty}b_{1}U_{t}
\displaystyle\leq b1(a2K2ασ)Ut2+αK(b1+2σ)Ut2+αKb1(Ut2+1)\displaystyle b_{1}-\left(a_{2}-\left\|K\right\|_{\infty}\frac{2\alpha}{\sigma}\right)U_{t}^{2}+\alpha\left\|K\right\|_{\infty}\left(b_{1}+\frac{2}{\sigma}\right)U_{t}^{2}+\alpha\left\|K\right\|_{\infty}b_{1}(U_{t}^{2}+1)
\displaystyle\leq (b1+1)(a2K2ασαK(2b1+2σ))Ut2,\displaystyle\left(b_{1}+1\right)-\left(a_{2}-\left\|K\right\|_{\infty}\frac{2\alpha}{\sigma}-\alpha\left\|K\right\|_{\infty}\left(2b_{1}+\frac{2}{\sigma}\right)\right)U_{t}^{2},

where (1)(1) is by E.3. By the assumption that λa2K2ασαK(2b1+2σ)>0\lambda\coloneqq a_{2}-\left\|K\right\|_{\infty}\frac{2\alpha}{\sigma}-\alpha\left\|K\right\|_{\infty}\left(2b_{1}+\frac{2}{\sigma}\right)>0, we have

tUt2[(b1+1)λUt2]0.\frac{\partial}{\partial t}U_{t}^{2}\leq\left[\left(b_{1}+1\right)-\lambda U_{t}^{2}\right]\lor 0.

By Gronwall’s inequality, we have Ut2U02+b1+1λU_{t}^{2}\leq U_{0}^{2}+\frac{b_{1}+1}{\lambda}. (If tUt2=0\frac{\partial}{\partial t}U_{t}^{2}=0, then UtU_{t} fix and this bound still holds.) Notice that in the first phase, as α=0\alpha=0, we have λ<a2\lambda<a_{2} and thus this inequality also holds.

Discrete-Time Mean Field Dynamics (4) Similarly to the analysis of the continuous-time mean field dynamics (5), we only give proof of the second phase. Define Uk=supsk𝔼𝜽~s2U_{k}=\underset{s\leq k}{\sup}\sqrt{\mathbb{E}\left\|\tilde{\bm{\theta}}_{s}\right\|^{2}}, and thus

Uk2Uk12[2η𝔼𝜽~k1,V(𝜽~k)+αϕ[πM,c/ηρ~k](𝜽~k)+2η2]0.U_{k}^{2}-U_{k-1}^{2}\leq\left[2\eta\mathbb{E}\left\langle\tilde{\bm{\theta}}_{k-1},-\nabla V(\tilde{\bm{\theta}}_{k})+\alpha\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\rangle+2\eta^{2}\right]\lor 0.

By a similarly analysis, we have bound

𝔼𝜽~k1,V(𝜽~k)+αϕ[πM,c/ηρ~k](𝜽~k)\displaystyle\mathbb{E}\left\langle\tilde{\bm{\theta}}_{k-1},-\nabla V(\tilde{\bm{\theta}}_{k})+\alpha\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\rangle
\displaystyle\leq (b1+1)λUt2,\displaystyle\left(b_{1}+1\right)-\lambda U_{t}^{2},

where λ=a2K,2ασαK,(2b1+2σ)>0\lambda=a_{2}-\left\|K\right\|_{\infty,\infty}\frac{2\alpha}{\sigma}-\alpha\left\|K\right\|_{\infty,\infty}\left(2b_{1}+\frac{2}{\sigma}\right)>0. And thus we have

Uk2Uk12[2η[(b1+1)λUk12]+2η2]0.U_{k}^{2}-U_{k-1}^{2}\leq\left[2\eta\left[\left(b_{1}+1\right)-\lambda U_{k-1}^{2}\right]+2\eta^{2}\right]\lor 0.

It gives that

Uk2b1+1+ηλ+U02.U_{k}^{2}\leq\frac{b_{1}+1+\eta}{\lambda}+U_{0}^{2}.

Practical Dynamics (3) The analysis of Practical Dynamics (3) is almost identical to that of the discrete-time mean field dynamics (4) and thus is omitted here.

E.5.3 Proof of Theorem 4.1 and 5.1

Notice that the dynamics in Theorem 4.1 is special case of that in Theorem 5.1 and thus we only prove Theorem 5.1 here. After some algebra, we can show that the continuity equation of dynamics (6) is

tρt=([(D(𝜽)+Q(𝜽))V(𝜽)+αϕ[πM,cρt](𝜽t)]ρt+(D(𝜽)+Q(𝜽))ρt).\partial_{t}\rho_{t}=\nabla\cdot\left(\left[-\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla V(\bm{\theta})+\alpha\phi[\pi_{M,c}*\rho_{t}](\bm{\theta}_{t})\right]\rho_{t}+\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla\rho_{t}\right).

Notice that the limiting distribution satisfies

0\displaystyle 0 =a.e.([(D(𝜽)+Q(𝜽))V(𝜽)+αϕ[πM,cρ](𝜽t)]ρ+(D(𝜽)+Q(𝜽))ρ)\displaystyle\overset{a.e.}{=}\nabla\cdot\left(\left[-\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla V(\bm{\theta})+\alpha\phi[\pi_{M,c}*\rho_{\infty}](\bm{\theta}_{t})\right]\rho_{\infty}+\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla\rho_{\infty}\right)
=([(D(𝜽)+Q(𝜽))V(𝜽)+αϕ[ρ](𝜽t)]ρ+(D(𝜽)+Q(𝜽))ρ)\displaystyle=\nabla\cdot\left(\left[-\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla V(\bm{\theta})+\alpha\phi[\rho_{\infty}](\bm{\theta}_{t})\right]\rho_{\infty}+\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla\rho_{\infty}\right)
=([(D(𝜽)+Q(𝜽))V(𝜽)]ρ+(D(𝜽)+Q(𝜽))ρ)\displaystyle=\nabla\cdot\left(\left[-\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla V(\bm{\theta})\right]\rho_{\infty}+\left(D(\bm{\theta})+Q(\bm{\theta})\right)\nabla\rho_{\infty}\right)
+α(K(ρV(𝜽)ρ)ρ).\displaystyle+\alpha\nabla\cdot\left(K*\left(\nabla\rho_{\infty}-\nabla V(\bm{\theta})\rho_{\infty}\right)\rho_{\infty}\right).

which implies that ρexp(V(𝜽))\rho_{\infty}\propto\exp(-V(\bm{\theta})) is the stationary distribution.

E.5.4 Proof of Theorem 4.2

In the later proof we use cdc_{d} to represent the quantity

𝔼𝜽ρ0𝜽2+b1+1+ηa2K,2ασαK,(2b1+2σ).\sqrt{\mathbb{E}_{\bm{\theta}\sim\rho_{0}}\left\|\bm{\theta}\right\|^{2}+\frac{b_{1}+1+\eta}{a_{2}-\left\|K\right\|_{\infty,\infty}\frac{2\alpha}{\sigma}-\alpha\left\|K\right\|_{\infty,\infty}\left(2b_{1}+\frac{2}{\sigma}\right)}}.

Recall that there are two dynamics: the continuous-time mean field dynamics (5) and the discretized version discrete-time mean field Dynamics (4). Notice that here we couple the discrete-time mean field dynamics with the continuous-time mean field system using the same initialization. Given any T=ηNT=\eta N, for any 0tT0\leq t\leq T, define t¯=tηη\underline{t}=\lfloor\frac{t}{\eta}\rfloor\eta. We introduce an another continuous-time interpolation dynamics:

𝜽^t\displaystyle\hat{\bm{\theta}}_{t} ={V(𝜽^t¯)+dt,t[0,Mc)V(𝜽^t¯)+αϕ[πM,cρ^t¯](𝜽^t¯)+dt,tMc,\displaystyle=\begin{cases}-\nabla V(\hat{\bm{\theta}}_{\underline{t}})+d\mathcal{B}_{t},&t\in[0,Mc)\\ -\nabla V(\hat{\bm{\theta}}_{\underline{t}})+\alpha\phi[\pi_{M,c}*\hat{\rho}_{\underline{t}}](\hat{\bm{\theta}}_{\underline{t}})+d\mathcal{B}_{t},&t\geq Mc,\end{cases}
ρ^t\displaystyle\hat{\rho}_{t} =Law(𝜽^t),\displaystyle=\mathrm{Law}(\hat{\bm{\theta}}_{t}),
𝜽^0\displaystyle\hat{\bm{\theta}}_{0} =𝜽¯0ρ¯0,\displaystyle=\bar{\bm{\theta}}_{0}\sim\bar{\rho}_{0},

Notice that here we couples this interpolation dynamics with the same Brownian motion as that of the dynamics of 𝜽¯t\bar{\bm{\theta}}_{t}. By the definition of 𝜽^t\hat{\bm{\theta}}_{t}, at any tkkηt_{k}\coloneqq k\eta for some integrate k[N]k\in[N], 𝜽^tk\hat{\bm{\theta}}_{t_{k}} and 𝜽~k\tilde{\bm{\theta}}_{k} has the same distribution. Define ρ¯t𝜽0=Law(𝜽¯t)\bar{\rho}_{t}^{\bm{\theta}_{0}}=\mathrm{Law}(\bar{\bm{\theta}}_{t}) conditioning on 𝜽¯0=𝜽0\bar{\bm{\theta}}_{0}=\bm{\theta}_{0} and ρ^t𝜽0=Law(𝜽^t)\hat{\rho}_{t}^{\bm{\theta}_{0}}=\mathrm{Law}(\hat{\bm{\theta}}_{t}) conditioning on 𝜽^0=𝜽0\hat{\bm{\theta}}_{0}=\bm{\theta}_{0}. Followed by the argument of proving Lemma 2 in Dalalyan [2017], if kMcηk\geq\frac{Mc}{\eta}, we have

𝔻KL[ρ¯tk𝜽0ρ^tk𝜽0]\displaystyle\mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{t_{k}}^{\bm{\theta}_{0}}\|\hat{\rho}_{t_{k}}^{\bm{\theta}_{0}}\right]
=\displaystyle= 140tk𝔼V(𝜽^s¯)+αϕ[πM,cρ^s¯](𝜽^s¯)+V(𝜽^s)αϕ[πM,cρ¯s](𝜽^s)2𝑑s\displaystyle\frac{1}{4}\int_{0}^{t_{k}}\mathbb{E}\left\|-\nabla V(\hat{\bm{\theta}}_{\underline{s}})+\alpha\phi[\pi_{M,c}*\hat{\rho}_{\underline{s}}](\hat{\bm{\theta}}_{\underline{s}})+\nabla V(\hat{\bm{\theta}}_{s})-\alpha\phi[\pi_{M,c}*\bar{\rho}_{s}](\hat{\bm{\theta}}_{s})\right\|^{2}ds
=\displaystyle= 14j=0k1tjtj+1𝔼V(𝜽^tj)+αϕ[πM,cρ^tj](𝜽^tj)+V(𝜽^s)αϕ[πM,cρ¯s](𝜽^s)2𝑑s\displaystyle\frac{1}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|-\nabla V(\hat{\bm{\theta}}_{t_{j}})+\alpha\phi[\pi_{M,c}*\hat{\rho}_{t_{j}}](\hat{\bm{\theta}}_{t_{j}})+\nabla V(\hat{\bm{\theta}}_{s})-\alpha\phi[\pi_{M,c}*\bar{\rho}_{s}](\hat{\bm{\theta}}_{s})\right\|^{2}ds
\displaystyle\leq 34j=0k1tjtj+1𝔼V(𝜽^tj)V(𝜽^s)2𝑑s\displaystyle\frac{3}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|\nabla V(\hat{\bm{\theta}}_{t_{j}})-\nabla V(\hat{\bm{\theta}}_{s})\right\|^{2}ds
+\displaystyle+ 3α24j=0k1tjtj+1𝔼ϕ[πM,cρ^tj](𝜽^tj)ϕ[πM,cρ¯s](𝜽^tj)2𝑑s\displaystyle\frac{3\alpha^{2}}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|\phi[\pi_{M,c}*\hat{\rho}_{t_{j}}](\hat{\bm{\theta}}_{t_{j}})-\phi[\pi_{M,c}*\bar{\rho}_{s}](\hat{\bm{\theta}}_{t_{j}})\right\|^{2}ds
\displaystyle\leq 3α24j=0k1tjtj+1𝔼ϕ[πM,cρ¯s](𝜽^tj)ϕ[πM,cρ¯s](𝜽^s)2𝑑s\displaystyle\frac{3\alpha^{2}}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|\phi[\pi_{M,c}*\bar{\rho}_{s}](\hat{\bm{\theta}}_{t_{j}})-\phi[\pi_{M,c}*\bar{\rho}_{s}](\hat{\bm{\theta}}_{s})\right\|^{2}ds
=\displaystyle= I1+I2+I3.\displaystyle I_{1}+I_{2}+I_{3}.

We bound I1I_{1}, I2I_{2} and I3I_{3} separately.

Bounding I1I_{1} and I3I_{3} By the smoothness of V\nabla V, we have

V(𝜽^tj)V(𝜽^s)2b12𝜽^tj𝜽^s2.\left\|\nabla V(\hat{\bm{\theta}}_{t_{j}})-\nabla V(\hat{\bm{\theta}}_{s})\right\|^{2}\leq b_{1}^{2}\left\|\hat{\bm{\theta}}_{t_{j}}-\hat{\bm{\theta}}_{s}\right\|^{2}.

And by Lemma E.3 (Lipschitz of Stein Operator), we know that

ϕ[πM,cρ¯s](𝜽1)ϕ[πM,cρ¯s](𝜽2)\displaystyle\left\|\phi[\pi_{M,c}*\bar{\rho}_{s}](\bm{\theta}_{1})-\phi[\pi_{M,c}*\bar{\rho}_{s}](\bm{\theta}_{2})\right\|
\displaystyle\leq [e()2/σLip𝔼𝜽πM,cρ¯sV(𝜽)+2σe()2/σ()Lip]𝜽1𝜽22.\displaystyle\left[\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\mathbb{E}_{\bm{\theta}\sim\pi_{M,c}*\bar{\rho}_{s}}\left\|\nabla V(\bm{\theta})\right\|+\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}\right]\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}.

And by the Assumption 4.2 and that ρ¯s\bar{\rho}_{s} as finite second moment, we have

ϕ[πM,cρ¯s](𝜽1)ϕ[πM,cρ¯s](𝜽2)\displaystyle\left\|\phi[\pi_{M,c}*\bar{\rho}_{s}](\bm{\theta}_{1})-\phi[\pi_{M,c}*\bar{\rho}_{s}](\bm{\theta}_{2})\right\|
\displaystyle\leq Ccd𝜽1𝜽22.\displaystyle Cc_{d}\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}.

Combine the two bounds, we have

I1+I33Ccd24j=0k1tjtj+1𝔼𝜽^tj𝜽^s2𝑑s.I_{1}+I_{3}\leq\frac{3Cc_{d}^{2}}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|\hat{\bm{\theta}}_{t_{j}}-\hat{\bm{\theta}}_{s}\right\|^{2}ds.

Notice that 𝜽^t=𝜽^t¯+[V(𝜽^t¯)+αϕ[πM,cρ^t¯](𝜽^t¯)](tt¯)+t¯t𝑑s\hat{\bm{\theta}}_{t}=\hat{\bm{\theta}}_{\underline{t}}+\left[-\nabla V(\hat{\bm{\theta}}_{\underline{t}})+\alpha\phi[\pi_{M,c}*\hat{\rho}_{\underline{t}}](\hat{\bm{\theta}}_{\underline{t}})\right](t-\underline{t})+\int_{\underline{t}}^{t}d\mathcal{B}_{s}. By Itô’s lemma, it implies that

I1+I3\displaystyle I_{1}+I_{3} 3Ccd24j=0k1tjtj+1𝔼𝜽^tj𝜽^s2𝑑s\displaystyle\leq\frac{3Cc_{d}^{2}}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|\hat{\bm{\theta}}_{t_{j}}-\hat{\bm{\theta}}_{s}\right\|^{2}ds
3Ccd24tjtj+1[𝔼V(𝜽^s¯)+αϕ[πM,cρ^s¯](𝜽^s¯)2(stj)2+2d(stj)]𝑑s\displaystyle\leq\frac{3Cc_{d}^{2}}{4}\int_{t_{j}}^{t_{j+1}}\left[\mathbb{E}\left\|-\nabla V(\hat{\bm{\theta}}_{\underline{s}})+\alpha\phi[\pi_{M,c}*\hat{\rho}_{\underline{s}}](\hat{\bm{\theta}}_{\underline{s}})\right\|^{2}(s-t_{j})^{2}+2d(s-t_{j})\right]ds
=Ccd2η3j=0k1𝔼V(𝜽^tj)+αϕ[πM,cρ^tj](𝜽^tj)2+Ccd2dkh2.\displaystyle=Cc_{d}^{2}\eta^{3}\sum_{j=0}^{k-1}\mathbb{E}\left\|-\nabla V(\hat{\bm{\theta}}_{t_{j}})+\alpha\phi[\pi_{M,c}*\hat{\rho}_{t_{j}}](\hat{\bm{\theta}}_{t_{j}})\right\|^{2}+Cc_{d}^{2}dkh^{2}.

By the assumption that 𝔼𝜽~tj\mathbb{E}\left\|\tilde{\bm{\theta}}_{t_{j}}\right\| is finite and 𝜽~tj=𝑑𝜽^tj\tilde{\bm{\theta}}_{t_{j}}\overset{d}{=}\hat{\bm{\theta}}_{t_{j}}, 𝔼𝜽^tj2\mathbb{E}\left\|\hat{\bm{\theta}}_{t_{j}}\right\|^{2} is also finite, we have

𝔼V(𝜽^t¯)+αϕ[πM,cρ^t¯](𝜽^t¯)2\displaystyle\mathbb{E}\left\|-\nabla V(\hat{\bm{\theta}}_{\underline{t}})+\alpha\phi[\pi_{M,c}*\hat{\rho}_{\underline{t}}](\hat{\bm{\theta}}_{\underline{t}})\right\|^{2}
\displaystyle\leq 2𝔼V(𝜽^t¯)2+2α2𝔼ϕ[πM,cρ^t¯](𝜽^t¯)2\displaystyle 2\mathbb{E}\left\|\nabla V(\hat{\bm{\theta}}_{\underline{t}})\right\|^{2}+2\alpha^{2}\mathbb{E}\left\|\phi[\pi_{M,c}*\hat{\rho}_{\underline{t}}](\hat{\bm{\theta}}_{\underline{t}})\right\|^{2}
\displaystyle\leq 4b12+4b12𝔼𝜽^t¯2+2α2𝔼((2σ+b1)𝔼𝜽πM,cρ^t¯𝜽+𝜽)2\displaystyle 4b_{1}^{2}+4b_{1}^{2}\mathbb{E}\left\|\hat{\bm{\theta}}_{\underline{t}}\right\|^{2}+2\alpha^{2}\mathbb{E}\left(\left(\frac{2}{\sigma}+b_{1}\right)\mathbb{E}_{\bm{\theta}^{\prime}\sim\pi_{M,c}*\hat{\rho}_{\underline{t}}}\left\|\bm{\theta}^{\prime}\right\|+\left\|\bm{\theta}\right\|\right)^{2}
\displaystyle\leq cd2C.\displaystyle c_{d}^{2}C.

Thus we conclude that

I1+I3Ccd2(cd2kη3+dkη2).I_{1}+I_{3}\leq Cc_{d}^{2}\left(c_{d}^{2}k\eta^{3}+dk\eta^{2}\right).

Bounding I2I_{2}

𝔼ϕ[πM,cρ^tj](𝜽^tj)ϕ[πM,cρ¯s](𝜽^tj)2\displaystyle\mathbb{E}\left\|\phi[\pi_{M,c}*\hat{\rho}_{t_{j}}](\hat{\bm{\theta}}_{t_{j}})-\phi[\pi_{M,c}*\bar{\rho}_{s}](\hat{\bm{\theta}}_{t_{j}})\right\|^{2}
=\displaystyle= 𝔼1Ml=1M[ϕ[ρ^tjcl](𝜽^tj)ϕ[ρ¯scl](𝜽^tj)]2\displaystyle\mathbb{E}\left\|\frac{1}{M}\sum_{l=1}^{M}\left[\phi[\hat{\rho}_{t_{j}-cl}](\hat{\bm{\theta}}_{t_{j}})-\phi[\bar{\rho}_{s-cl}](\hat{\bm{\theta}}_{t_{j}})\right]\right\|^{2}
\displaystyle\leq 1Ml=1M𝔼ϕ[ρ^tjcl](𝜽^tj)ϕ[ρ¯scl](𝜽^tj)2\displaystyle\frac{1}{M}\sum_{l=1}^{M}\mathbb{E}\left\|\phi[\hat{\rho}_{t_{j}-cl}](\hat{\bm{\theta}}_{t_{j}})-\phi[\bar{\rho}_{s-cl}](\hat{\bm{\theta}}_{t_{j}})\right\|^{2}
=\displaystyle= 1Ml=1M𝔼𝔼𝜽ρ^tjclϕ¯𝜽^tj(𝜽)𝔼𝜽ρ¯sclϕ¯𝜽^tj(𝜽)2\displaystyle\frac{1}{M}\sum_{l=1}^{M}\mathbb{E}\left\|\mathbb{E}_{\bm{\theta}\sim\hat{\rho}_{t_{j}-cl}}\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}}}(\bm{\theta})-\mathbb{E}_{\bm{\theta}\sim\bar{\rho}_{s-cl}}\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}}}(\bm{\theta})\right\|^{2}
=\displaystyle= 1Ml=1M𝔼𝜽^tji=1d|𝔼𝜽ρ^tjclϕ¯𝜽^tj,i(𝜽)𝔼𝜽ρ¯sclϕ¯𝜽^tj,i(𝜽)|2\displaystyle\frac{1}{M}\sum_{l=1}^{M}\mathbb{E}_{\hat{\bm{\theta}}_{t_{j}}}\sum_{i=1}^{d}\left|\mathbb{E}_{\bm{\theta}\sim\hat{\rho}_{t_{j}-cl}}\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\bm{\theta})-\mathbb{E}_{\bm{\theta}\sim\bar{\rho}_{s-cl}}\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\bm{\theta})\right|^{2}
\displaystyle\leq 1Ml=1M𝔼𝜽^tji=1d(ϕ¯𝜽^tj,i()ϕ¯𝜽^tj,i()Lip)2𝔻BL2[ρ^tjcl,ρ¯scl]\displaystyle\frac{1}{M}\sum_{l=1}^{M}\mathbb{E}_{\hat{\bm{\theta}}_{t_{j}}}\sum_{i=1}^{d}\left(\left\|\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\cdot)\right\|_{\mathcal{L}_{\infty}}\lor\left\|\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\cdot)\right\|_{\mathrm{Lip}}\right)^{2}\mathbb{D}_{\mathrm{BL}}^{2}\left[\hat{\rho}_{t_{j}-cl},\bar{\rho}_{s-cl}\right]

By Lemma E.4 and the Assumption 4.4 that VV is at most quadratic growth and that ρ^t¯\hat{\rho}_{\underline{t}} has finite second moment, we have

𝔼𝜽^tji=1d(ϕ¯𝜽^tj,i()ϕ¯𝜽^tj,i()Lip)2\displaystyle\mathbb{E}_{\hat{\bm{\theta}}_{t_{j}}}\sum_{i=1}^{d}\left(\left\|\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\cdot)\right\|_{\mathcal{L}_{\infty}}\lor\left\|\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\cdot)\right\|_{\mathrm{Lip}}\right)^{2}
=\displaystyle= 𝔼𝜽^tji=1d(ϕ¯𝜽^tj,i()2ϕ¯𝜽^tj,i()Lip2)\displaystyle\mathbb{E}_{\hat{\bm{\theta}}_{t_{j}}}\sum_{i=1}^{d}\left(\left\|\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\cdot)\right\|_{\mathcal{L}_{\infty}}^{2}\lor\left\|\bar{\phi}_{\hat{\bm{\theta}}_{t_{j}},i}(\cdot)\right\|_{\mathrm{Lip}}^{2}\right)
\displaystyle\leq [4d2σe𝜽2/σθ1BL2+4e2/σBL2𝔼𝜽^tjV(𝜽^tj)2]\displaystyle\left[4d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}+4\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\mathbb{E}_{\hat{\bm{\theta}}_{t_{j}}}\left\|\nabla V(\hat{\bm{\theta}}_{t_{j}})\right\|^{2}\right]
\displaystyle\leq C(d+cd2).\displaystyle C(d+c_{d}^{2}).

Plug in the above estimation, we have

I2\displaystyle I_{2} =3α24j=0k1tjtj+1𝔼ϕ[πM,cρ^tj](𝜽^tj)ϕ[πM,cρ¯s](𝜽^tj)2𝑑s\displaystyle=\frac{3\alpha^{2}}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|\phi[\pi_{M,c}*\hat{\rho}_{t_{j}}](\hat{\bm{\theta}}_{t_{j}})-\phi[\pi_{M,c}*\bar{\rho}_{s}](\hat{\bm{\theta}}_{t_{j}})\right\|^{2}ds
α2C(d+cd2)j=0k1tjtj+11Ml=1M𝔻BL2[ρ^tjcl,ρ¯scl]ds\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\frac{1}{M}\sum_{l=1}^{M}\mathbb{D}_{\mathrm{BL}}^{2}\left[\hat{\rho}_{t_{j}-cl},\bar{\rho}_{s-cl}\right]ds
α2C(d+cd2)j=0k11Ml=1Mtjtj+1𝔻KL[ρ^tjcl,ρ¯scl]𝑑s,\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\sum_{j=0}^{k-1}\frac{1}{M}\sum_{l=1}^{M}\int_{t_{j}}^{t_{j+1}}\mathbb{D}_{\mathrm{KL}}\left[\hat{\rho}_{t_{j}-cl},\bar{\rho}_{s-cl}\right]ds,

where the last inequality is due to the relation that 𝔻BL2definition𝔻TV2Pinskers𝔻KL\mathbb{D}_{\mathrm{BL}}^{2}\overset{\mathrm{definition}}{\leq}\mathbb{D}_{\mathrm{TV}}^{2}\overset{\mathrm{Pinsker^{\prime}s}}{\leq}\mathbb{D}_{\mathrm{KL}}.

Overall Bound Combine all the estimation, we have

𝔻KL[ρ¯tk𝜽0ρ^tk𝜽0]\displaystyle\mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{t_{k}}^{\bm{\theta}_{0}}\|\hat{\rho}_{t_{k}}^{\bm{\theta}_{0}}\right] α2C(d+cd2)j=0k11Ml=1Mtjtj+1𝔻KL[ρ^tjcl,ρ¯scl]𝑑s+Ccd2(cd2kη3+dkη2)\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\sum_{j=0}^{k-1}\frac{1}{M}\sum_{l=1}^{M}\int_{t_{j}}^{t_{j+1}}\mathbb{D}_{\mathrm{KL}}\left[\hat{\rho}_{t_{j}-cl},\bar{\rho}_{s-cl}\right]ds+Cc_{d}^{2}\left(c_{d}^{2}k\eta^{3}+dk\eta^{2}\right)
=α2C(d+cd2)j=0k11Ml=1M0η𝔻KL[ρ^t(jηclη),ρ¯t(jηclη)+s]𝑑s+Ccd2(cd2kη3+dkη2)\displaystyle=\alpha^{2}C(d+c_{d}^{2})\sum_{j=0}^{k-1}\frac{1}{M}\sum_{l=1}^{M}\int_{0}^{\eta}\mathbb{D}_{\mathrm{KL}}\left[\hat{\rho}_{t_{\left(\frac{j\eta-cl}{\eta}\right)}},\bar{\rho}_{t_{\left(\frac{j\eta-cl}{\eta}\right)}+s}\right]ds+Cc_{d}^{2}\left(c_{d}^{2}k\eta^{3}+dk\eta^{2}\right)

Similar, if kMcη1k\leq\frac{Mc}{\eta}-1, we have

𝔻KL[ρ¯tk𝜽0ρ^tk𝜽0]\displaystyle\mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{t_{k}}^{\bm{\theta}_{0}}\|\hat{\rho}_{t_{k}}^{\bm{\theta}_{0}}\right]
=\displaystyle= 140tk𝔼V(𝜽^s¯)V(𝜽^s)2𝑑s\displaystyle\frac{1}{4}\int_{0}^{t_{k}}\mathbb{E}\left\|\nabla V(\hat{\bm{\theta}}_{\underline{s}})-\nabla V(\hat{\bm{\theta}}_{s})\right\|^{2}ds
\displaystyle\leq b124j=0k1tjtj+1𝔼𝜽^tj𝜽^s2𝑑s\displaystyle\frac{b_{1}^{2}}{4}\sum_{j=0}^{k-1}\int_{t_{j}}^{t_{j+1}}\mathbb{E}\left\|\hat{\bm{\theta}}_{t_{j}}-\hat{\bm{\theta}}_{s}\right\|^{2}ds
\leq b12η312j=0k1𝔼V(𝜽^tj)2+dkb12η24\displaystyle\frac{b_{1}^{2}\eta^{3}}{12}\sum_{j=0}^{k-1}\mathbb{E}\left\|\nabla V(\hat{\bm{\theta}}_{t_{j}})\right\|^{2}+\frac{dkb_{1}^{2}\eta^{2}}{4}
\displaystyle\leq b12η3kcd212+dkb12η24.\displaystyle\frac{b_{1}^{2}\eta^{3}kc_{d}^{2}}{12}+\frac{dkb_{1}^{2}\eta^{2}}{4}.

Define

uk=sups[tk,tk+1]𝔻KL[ρ¯s¯𝜽0ρ^s𝜽0],u_{k}=\underset{s\in\left[t_{k},t_{k+1}\right]}{\sup}\ \mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{\underline{s}}^{\bm{\theta}_{0}}\|\hat{\rho}_{s}^{\bm{\theta}_{0}}\right],

and Uk=maxl{0,,k}ulU_{k}=\underset{l\in\{0,...,k\}}{\max}\ u_{l}. We conclude that for kMcηk\geq\frac{Mc}{\eta}, for any kkk^{\prime}\leq k,

uk\displaystyle u_{k^{\prime}} α2C(d+cd2)j=0k11Ml=1M0h𝔻KL[ρ^t(jηclη),ρ¯t(jηclη)+s]𝑑s+Ccd2(cd2kη3+dkη2)\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\sum_{j=0}^{k-1}\frac{1}{M}\sum_{l=1}^{M}\int_{0}^{h}\mathbb{D}_{\mathrm{KL}}\left[\hat{\rho}_{t_{\left(\frac{j\eta-cl}{\eta}\right)}},\bar{\rho}_{t_{\left(\frac{j\eta-cl}{\eta}\right)}+s}\right]ds+Cc_{d}^{2}\left(c_{d}^{2}k\eta^{3}+dk\eta^{2}\right)
α2C(d+cd2)j=0k11Ml=1Mηu(jηclη)+Ccd2(cd2kη3+dkη2)\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\sum_{j=0}^{k-1}\frac{1}{M}\sum_{l=1}^{M}\eta u_{\left(\frac{j\eta-cl}{\eta}\right)}+Cc_{d}^{2}\left(c_{d}^{2}k\eta^{3}+dk\eta^{2}\right)
α2C(d+cd2)ηj=0k1Uj+Ccd2(cd2kη3+dkη2).\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\eta\sum_{j=0}^{k-1}U_{j}+Cc_{d}^{2}\left(c_{d}^{2}k\eta^{3}+dk\eta^{2}\right).

For k<Mcηk<\frac{Mc}{\eta}, which is a simpler case, we have

UkC(η3kcd2+dkη2)<CMc(ηcd2+d)η.U_{k}\leq C\left(\eta^{3}kc_{d}^{2}+dk\eta^{2}\right)<CMc\left(\eta c_{d}^{2}+d\right)\eta.

We bound the case when kMcηk\geq\frac{Mc}{\eta},

Ukα2C(d+cd2)ηj=0k1Uj+Ccd2(cd2kη3+dkη2).U_{k}\leq\alpha^{2}C(d+c_{d}^{2})\eta\sum_{j=0}^{k-1}U_{j}+Cc_{d}^{2}\left(c_{d}^{2}k\eta^{3}+dk\eta^{2}\right).

If we take η\eta sufficiently small, such that cd2kη3dkη2c_{d}^{2}k\eta^{3}\leq dk\eta^{2}, we have

Uk\displaystyle U_{k} α2C(d+cd2)ηj=0k1Uj+2Ccd2dkη2\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\eta\sum_{j=0}^{k-1}U_{j}+2Cc_{d}^{2}dk\eta^{2}
α2C(d+cd2)ηj=0k1(Uj+η).\displaystyle\leq\alpha^{2}C(d+c_{d}^{2})\eta\sum_{j=0}^{k-1}\left(U_{j}+\eta\right).

Define η=α2C(d+cd2)η\eta^{\prime}=\alpha^{2}C(d+c_{d}^{2})\eta and we can choose η\eta small enough such that η<1/2\eta^{\prime}<1/2 and η<1/2\eta<1/2. Without loss of generality, we also assume ηη\eta^{\prime}\geq\eta and thus we have

Ukηj=0k1(Uj+η).U_{k}\leq\eta^{\prime}\sum_{j=0}^{k-1}\left(U_{j}+\eta^{\prime}\right).

Also we assume UkηU_{k}\geq\eta^{\prime}, otherwise we conclude that Uk<ηU_{k}<\eta^{\prime}. We thus have Ukqj=0k1UjU_{k}\leq q\sum_{j=0}^{k-1}U_{j}, where q=2ηq=2\eta^{\prime}. Suppose that UMcη1=xCMc(ηcd2+d)ηU_{\frac{Mc}{\eta}-1}=x\leq CMc\left(\eta c_{d}^{2}+d\right)\eta and some algebra (which reduces to Pascal’s triangle) shows that

Ukxq(1+q)kMcη.U_{k}\leq xq(1+q)^{k-\frac{Mc}{\eta}}.

We conclude that Ukxq(1+q)k1U_{k}\leq xq(1+q)^{k-1}. Notice that q=2α2C(d+cd2)η.q=2\alpha^{2}C(d+c_{d}^{2})\eta. Thus for any kMc/ηk\geq Mc/\eta,

Uk\displaystyle U_{k} xq(1+q)kMcη\displaystyle\leq xq(1+q)^{k-\frac{Mc}{\eta}}
=xq(1+q)(kηMc)/η\displaystyle=xq(1+q)^{\left(k\eta-Mc\right)/\eta}
=xq(1+q)2α2C(d+cd2)(kηMc)/q\displaystyle=xq(1+q)^{2\alpha^{2}C(d+c_{d}^{2})\left(k\eta-Mc\right)/q}
x2α2C(d+cd2)e2α2C(d+cd2)(kηMc)η\displaystyle\leq x2\alpha^{2}C(d+c_{d}^{2})e^{2\alpha^{2}C(d+c_{d}^{2})\left(k\eta-Mc\right)}\eta
CMcα2(ηcd2+d)(d+cd2)e2α2C(d+cd2)(kηMc)η2,\displaystyle\leq CMc\alpha^{2}\left(\eta c_{d}^{2}+d\right)(d+c_{d}^{2})e^{2\alpha^{2}C(d+c_{d}^{2})\left(k\eta-Mc\right)}\eta^{2},

for sufficiently small η\eta. Combine the above two estimations, we have

Uk{C(η3kcd2+dkη2+η)kMc/η1CMcα2(ηcd2+d)(d+cd2)e2α2C(d+cd2)(kηMc)η2+CηkMc/η.U_{k}\leq\begin{cases}C\left(\eta^{3}kc_{d}^{2}+dk\eta^{2}+\eta\right)&k\leq Mc/\eta-1\\ CMc\alpha^{2}\left(\eta c_{d}^{2}+d\right)(d+c_{d}^{2})e^{2\alpha^{2}C(d+c_{d}^{2})\left(k\eta-Mc\right)}\eta^{2}+C\eta&k\geq Mc/\eta\end{cases}.

Notice that now we have Uk=maxl{0,,k}sups[0,η]𝔻KL[ρ¯lη+s𝜽0ρ~lη𝜽0]U_{k}=\underset{l\in\{0,...,k\}}{\max}\underset{s\in\left[0,\eta\right]}{\sup}\ \mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{l\eta+s}^{\bm{\theta}_{0}}\|\tilde{\rho}_{l\eta}^{\bm{\theta}_{0}}\right], which is a function of 𝜽0\bm{\theta}_{0}. We then bound U¯k=maxl{0,,k}sups[0,η]𝔻KL[ρ¯lη+sρ~lη]\bar{U}_{k}=\underset{l\in\{0,...,k\}}{\max}\underset{s\in\left[0,\eta\right]}{\sup}\ \mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{l\eta+s}\|\tilde{\rho}_{l\eta}\right]. Notice that the KL divergence has the following variational representation:

𝔻KL[ρ1ρ2]=supf[𝔼ρ1f𝔼ρ2ef],\mathbb{D}_{\mathrm{KL}}[\rho_{1}\|\rho_{2}]=\sup_{f}\left[\mathbb{E}_{\rho_{1}}f-\mathbb{E}_{\rho_{2}}e^{f}\right],

where the ff is chosen in the set that 𝔼ρ1f\mathbb{E}_{\rho_{1}}f and 𝔼ρ2ef\mathbb{E}_{\rho_{2}}e^{f} exist. And thus we have

𝔻KL[ρ¯lη+sρ~lη]\displaystyle\mathbb{D}_{\mathrm{KL}}[\bar{\rho}_{l\eta+s}\|\tilde{\rho}_{l\eta}] =supf[𝔼𝜽0ρ0(𝔼ρ¯lη+s𝜽0f𝔼ρ~lη𝜽0ef)]\displaystyle=\sup_{f}\left[\mathbb{E}_{\bm{\theta}_{0}\sim\rho_{0}}\left(\mathbb{E}_{\bar{\rho}_{l\eta+s}^{\bm{\theta}_{0}}}f-\mathbb{E}_{\tilde{\rho}_{l\eta}^{\bm{\theta}_{0}}}e^{f}\right)\right]
𝔼𝜽0ρ0supf[(𝔼ρ¯lη+s𝜽0f𝔼ρ~lη𝜽0ef)].\displaystyle\leq\mathbb{E}_{\bm{\theta}_{0}\sim\rho_{0}}\sup_{f}\left[\left(\mathbb{E}_{\bar{\rho}_{l\eta+s}^{\bm{\theta}_{0}}}f-\mathbb{E}_{\tilde{\rho}_{l\eta}^{\bm{\theta}_{0}}}e^{f}\right)\right].

And thus U¯kUk\bar{U}_{k}\leq U_{k}. Also the inequality that

U¯k=maxl{0,,k}sups[0,η]𝔻KL[ρ¯lη+sρ~lη]maxl{0,,k}𝔻KL[ρ¯lηρ~lη]\bar{U}_{k}=\underset{l\in\{0,...,k\}}{\max}\underset{s\in\left[0,\eta\right]}{\sup}\ \mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{l\eta+s}\|\tilde{\rho}_{l\eta}\right]\geq\underset{l\in\{0,...,k\}}{\max}\ \mathbb{D}_{\mathrm{KL}}\left[\bar{\rho}_{l\eta}\|\tilde{\rho}_{l\eta}\right]

holds naturally by definition. We complete the proof.

E.5.5 Proof of Theorem 4.3

The constant h1h_{1} is defined as

h1=2σe𝜽2/σθ1BL2e2/σBL22σe()2/σ()Liph_{1}=\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}\lor\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\lor\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}

Now we start the proof. We couple the process of 𝜽k\bm{\theta}_{k} and 𝜽~k\tilde{\bm{\theta}}_{k} by the same gaussian noise 𝒆k\bm{e}_{k} in every iteration and same initialization 𝜽~0=𝜽0\tilde{\bm{\theta}}_{0}=\bm{\theta}_{0}. For kMc/η1k\leq Mc/\eta-1, 𝔼𝜽k𝜽~k2=0\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}=0 and for kMc/ηk\geq Mc/\eta we have the following inequality,

𝔼𝜽k+1𝜽~k+12𝔼𝜽k𝜽~k2\displaystyle\mathbb{E}\left\|\bm{\theta}_{k+1}-\tilde{\bm{\theta}}_{k+1}\right\|^{2}-\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}
=\displaystyle= 2η𝔼𝜽k𝜽~k,V(𝜽k)+V(𝜽~k)\displaystyle 2\eta\mathbb{E}\left\langle\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k},-\nabla V(\bm{\theta}_{k})+\nabla V(\tilde{\bm{\theta}}_{k})\right\rangle
+\displaystyle+ 2ηα𝔼𝜽k𝜽~k,ϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)ϕ[πM,c/ηρ~k](𝜽~k)\displaystyle 2\eta\alpha\mathbb{E}\left\langle\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k},\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\rangle
+\displaystyle+ η2𝔼V(𝜽k)+αϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)+V(𝜽~k)αϕ[πM,c/ηρ~k](𝜽~k)2.\displaystyle\eta^{2}\mathbb{E}\left\|-\nabla V(\bm{\theta}_{k})+\alpha\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})+\nabla V(\tilde{\bm{\theta}}_{k})-\alpha\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\|^{2}.

By the log-concavity, we have

𝔼𝜽k𝜽~k,V(𝜽k)+V(𝜽~k)\displaystyle\mathbb{E}\left\langle\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k},-\nabla V(\bm{\theta}_{k})+\nabla V(\tilde{\bm{\theta}}_{k})\right\rangle
\displaystyle\leq L𝔼𝜽k𝜽~k2,\displaystyle-L\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2},

for some positive constant LL. And also, as η\eta is small, the last term on the right side of the equation is small term. Thus our main target is to bound the second term. We decompose the second term on the left side of the equation by

𝔼𝜽k𝜽~k,ϕ[1Mj=1Mδ𝜽kjc](𝜽k)ϕ[πM,c/ηρ~k](𝜽~k)\displaystyle\mathbb{E}\left\langle\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k},\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc}}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\rangle
=\displaystyle= 𝔼𝜽k𝜽~k,ϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)ϕ[πM,c/ηρk](𝜽k)\displaystyle\mathbb{E}\left\langle\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k},\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})\right\rangle
+\displaystyle+ 𝔼𝜽k𝜽~k,ϕ[πM,c/ηρk](𝜽k)ϕ[πM,c/ηρ~k](𝜽k)\displaystyle\mathbb{E}\left\langle\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k},\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\bm{\theta}_{k})\right\rangle
+\displaystyle+ 𝔼𝜽k𝜽~k,ϕ[πM,c/ηρ~k](𝜽k)ϕ[πM,c/ηρ~k](𝜽~k)\displaystyle\mathbb{E}\left\langle\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k},\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\rangle
=\displaystyle= I1+I2+I3.\displaystyle I_{1}+I_{2}+I_{3}.

We bound I1I_{1}, I2I_{2} and I3I_{3} independently.

Bounding I1I_{1}

By Holder’s inequality,

I1\displaystyle I_{1} 𝔼[𝜽k𝜽~kϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)ϕ[πM,c/ηρk](𝜽k)]\displaystyle\leq\mathbb{E}\left[\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|\left\|\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})\right\|\right]
𝔼𝜽k𝜽~k2𝔼ϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)ϕ[πM,c/ηρk](𝜽k)2.\displaystyle\leq\sqrt{\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}}\sqrt{\mathbb{E}\left\|\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})\right\|^{2}}.

We bound the second term on the right side of the inequality. Define

a2=sup𝑘𝔼ϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)ϕ[πM,c/ηρk](𝜽k)2sup𝜽B𝔼ϕ[1Mj=1Mδ𝜽kjc/η](𝜽)ϕ[πM,c/ηρk](𝜽)2a_{2}=\underset{k}{\sup}\ \frac{\mathbb{E}\left\|\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})\right\|^{2}}{\underset{\left\|\bm{\theta}\right\|\leq B}{\sup}\mathbb{E}\left\|\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta})-\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta})\right\|^{2}}

and by the regularity assumption we know that a2<a_{2}<\infty. Define ϕ[1Mj=1Mδ𝜽kjc/η](𝜽)ϕ[πM,c/ηρk](𝜽)=ϕ[1Mj=1Mδ𝜽kjc/η]\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta})-\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta})=\phi^{*}[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}] and since the stein operator is linear functional of the distribution, we have

𝔼ϕ[1Mj=1Mδ𝜽kjc/η](𝜽)=0,\mathbb{E}\phi^{*}[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta})=0,

given any 𝜽\bm{\theta}. By Theorem E.1 that 𝚯k\bm{\Theta}_{k} is geometric ergodicity and thus is β\beta-mixing with exponentially fast decay rate by Proposition E.2. And by Proposition E.1, we know that 𝚯k\bm{\Theta}_{k} is also α\alpha-mixing with exponentially fast decay rate. We have the following estimation

𝔼ϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)ϕ[πM,c/ηρk](𝜽k)2\displaystyle\mathbb{E}\left\|\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})\right\|^{2}
\displaystyle\leq a2sup𝜽B𝔼ϕ[1Mj=1Mδ𝜽kjc/η](𝜽)2\displaystyle a_{2}\underset{\left\|\bm{\theta}\right\|\leq B}{\sup}\mathbb{E}\left\|\phi^{*}[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta})\right\|^{2}
\displaystyle\leq a2M2sup𝜽B𝔼k=1Mϕ[δ𝜽tkc/η](𝜽)2\displaystyle\frac{a_{2}}{M^{2}}\underset{\left\|\bm{\theta}\right\|\leq B}{\sup}\mathbb{E}\sum_{k=1}^{M}\left\|\phi^{*}[\delta_{\bm{\theta}_{t-kc/\eta}}](\bm{\theta})\right\|^{2}
+\displaystyle+ a2M2sup𝜽B𝔼kjϕ[δ𝜽tkc/η](𝜽),ϕ[δ𝜽tjc/η](𝜽)\displaystyle\frac{a_{2}}{M^{2}}\underset{\left\|\bm{\theta}\right\|\leq B}{\sup}\mathbb{E}\sum_{k\neq j}\left\langle\phi^{*}[\delta_{\bm{\theta}_{t-kc/\eta}}](\bm{\theta}),\phi^{*}[\delta_{\bm{\theta}_{t-jc/\eta}}](\bm{\theta})\right\rangle
\displaystyle\leq Ca2M[erc(1erMc)1erc+1],\displaystyle\frac{Ca_{2}}{M}\left[\frac{e^{-rc}\left(1-e^{-rMc}\right)}{1-e^{rc}}+1\right],

for some positive constant rr that characterize the decay rate of α\alpha mixing. Notice that here η\eta is canceled because the decay rate of mixing is 𝒪(η)\mathcal{O}(\eta) (on the power of exponential) and c/η=𝒪(η1)c/\eta=\mathcal{O}(\eta^{-1}). Combine this two estimations, we have

I1𝔼𝜽k𝜽~k2a2CM[erc(1erMc)1erc+1].I_{1}\leq\sqrt{\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}}\sqrt{\frac{a_{2}C}{M}\left[\frac{e^{-rc}\left(1-e^{-rMc}\right)}{1-e^{rc}}+1\right]}.

Bounding I2I_{2} By Holder’s inequality, we have

I2\displaystyle I_{2} 𝔼𝜽k𝜽~k2𝔼ϕ[πM,c/ηρk](𝜽k)ϕ[πM,c/ηρ~k](𝜽k)2.\displaystyle\leq\sqrt{\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}}\sqrt{\mathbb{E}\left\|\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\bm{\theta}_{k})\right\|^{2}}.

We bound the second term in the right side of the inequality.

𝔼ϕ[πM,c/ηρk](𝜽k)ϕ[πM,c/ηρ~k](𝜽k)2\displaystyle\mathbb{E}\left\|\phi[\pi_{M,c/\eta}*\rho_{k}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\bm{\theta}_{k})\right\|^{2}
=\displaystyle= 𝔼1Mj=1M[ϕ[ρkjc/η](𝜽k)ϕ[ρ~kjc/η](𝜽k)]2\displaystyle\mathbb{E}\left\|\frac{1}{M}\sum_{j=1}^{M}\left[\phi[\rho_{k-jc/\eta}](\bm{\theta}_{k})-\phi[\tilde{\rho}_{k-jc/\eta}](\bm{\theta}_{k})\right]\right\|^{2}
\displaystyle\leq 1Mj=1M𝔼ϕ[ρkjc/η](𝜽k)ϕ[ρ~kjc/η](𝜽k)2\displaystyle\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}\left\|\phi[\rho_{k-jc/\eta}](\bm{\theta}_{k})-\phi[\tilde{\rho}_{k-jc/\eta}](\bm{\theta}_{k})\right\|^{2}
=\displaystyle= 1Mj=1M𝔼𝜽k𝔼𝜽ρkjc/ηϕ¯𝜽k(𝜽)𝔼𝜽ρ~kjc/ηϕ¯𝜽k(𝜽)2\displaystyle\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}_{\bm{\theta}_{k}}\left\|\mathbb{E}_{\bm{\theta}\sim\rho_{k-jc/\eta}}\bar{\phi}_{\bm{\theta}_{k}}(\bm{\theta})-\mathbb{E}_{\bm{\theta}\sim\tilde{\rho}_{k-jc/\eta}}\bar{\phi}_{\bm{\theta}_{k}}(\bm{\theta})\right\|^{2}
=\displaystyle= 1Mj=1M𝔼𝜽ki=1d|𝔼𝜽ρkjc/ηϕ¯𝜽k,i(𝜽)𝔼𝜽ρ~kjc/ηϕ¯𝜽k,i(𝜽)|2\displaystyle\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}_{\bm{\theta}_{k}}\sum_{i=1}^{d}\left|\mathbb{E}_{\bm{\theta}\sim\rho_{k-jc/\eta}}\bar{\phi}_{\bm{\theta}_{k},i}(\bm{\theta})-\mathbb{E}_{\bm{\theta}\sim\tilde{\rho}_{k-jc/\eta}}\bar{\phi}_{\bm{\theta}_{k},i}(\bm{\theta})\right|^{2}
\displaystyle\leq 1Mj=1M𝔼𝜽ki=1d(ϕ¯𝜽k,i()ϕ¯𝜽k,i()Lip)2𝔻BL2[ρkjc/η,ρ~kjc/η].\displaystyle\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}_{\bm{\theta}_{k}}\sum_{i=1}^{d}\left(\left\|\bar{\phi}_{\bm{\theta}_{k},i}(\cdot)\right\|_{\mathcal{L}_{\infty}}\lor\left\|\bar{\phi}_{\bm{\theta}_{k},i}(\cdot)\right\|_{\mathrm{Lip}}\right)^{2}\mathbb{D}_{\mathrm{BL}}^{2}\left[\rho_{k-jc/\eta},\tilde{\rho}_{k-jc/\eta}\right].

By Lemma E.4, we have

i=1d(ϕ¯𝜽k,i()ϕ¯𝜽k,i()Lip)2\displaystyle\sum_{i=1}^{d}\left(\left\|\bar{\phi}_{\bm{\theta}_{k},i}(\cdot)\right\|_{\mathcal{L}_{\infty}}\lor\left\|\bar{\phi}_{\bm{\theta}_{k},i}(\cdot)\right\|_{\mathrm{Lip}}\right)^{2}
=\displaystyle= i=1d(ϕ¯𝜽k,i()2ϕ¯𝜽k,i()Lip2)\displaystyle\sum_{i=1}^{d}\left(\left\|\bar{\phi}_{\bm{\theta}_{k},i}(\cdot)\right\|_{\mathcal{L}_{\infty}}^{2}\lor\left\|\bar{\phi}_{\bm{\theta}_{k},i}(\cdot)\right\|_{\mathrm{Lip}}^{2}\right)
\displaystyle\leq [4d2σe𝜽2/σθ1BL2+4e2/σBL2V(𝜽k)2].\displaystyle\left[4d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}+4\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\left\|\nabla V(\bm{\theta}_{k})\right\|^{2}\right].

Plug in the above estimation and by the relation that 𝔻BL𝕎1𝕎2\mathbb{D}_{\mathrm{BL}}\leq\mathbb{W}_{1}\leq\mathbb{W}_{2}, we have

𝔼ϕ[πM,cρk](𝜽k)ϕ[πM,cρ~k](𝜽k)2\displaystyle\mathbb{E}\left\|\phi[\pi_{M,c}*\rho_{k}](\bm{\theta}_{k})-\phi[\pi_{M,c}*\tilde{\rho}_{k}](\bm{\theta}_{k})\right\|^{2}
\displaystyle\leq [4d2σe𝜽2/σθ1BL2+4e2/σBL2𝔼𝜽kV(𝜽k)2]1Mj=1M𝔻BL2[ρkcj,ρkcj~]\displaystyle\left[4d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}+4\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\mathbb{E}_{\bm{\theta}_{k}}\left\|\nabla V(\bm{\theta}_{k})\right\|^{2}\right]\frac{1}{M}\sum_{j=1}^{M}\mathbb{D}_{\mathrm{BL}}^{2}\left[\rho_{k-cj},\tilde{\rho_{k-cj}}\right]
\displaystyle\leq [4d2σe𝜽2/σθ1BL2+4e2/σBL2𝔼𝜽kV(𝜽k)2]1Mj=1M𝕎22[ρkcj,ρkcj~].\displaystyle\left[4d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}+4\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\mathbb{E}_{\bm{\theta}_{k}}\left\|\nabla V(\bm{\theta}_{k})\right\|^{2}\right]\frac{1}{M}\sum_{j=1}^{M}\mathbb{W}_{2}^{2}\left[\rho_{k-cj},\tilde{\rho_{k-cj}}\right].

And combined all the estimation and by the definition of Wasserstein-distance, we conclude that

I2\displaystyle I_{2} 4d2σe𝜽2/σθ1BL2+4e2/σBL2𝔼𝜽kV(𝜽k)21Mj=1M𝕎22[ρkcj,ρ~kcj]\displaystyle\leq\sqrt{4d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}+4\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\mathbb{E}_{\bm{\theta}_{k}}\left\|\nabla V(\bm{\theta}_{k})\right\|^{2}}\sqrt{\frac{1}{M}\sum_{j=1}^{M}\mathbb{W}_{2}^{2}\left[\rho_{k-cj},\tilde{\rho}_{k-cj}\right]}
4d2σe𝜽2/σθ1BL2+4e2/σBL2𝔼𝜽kV(𝜽k)21Mj=1M𝔼𝜽kcj𝜽~kcj2.\displaystyle\leq\sqrt{4d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}+4\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\mathbb{E}_{\bm{\theta}_{k}}\left\|\nabla V(\bm{\theta}_{k})\right\|^{2}}\sqrt{\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}\left\|\bm{\theta}_{k-cj}-\tilde{\bm{\theta}}_{k-cj}\right\|^{2}}.

Bounding I3I_{3}

By Holder’s inequality,

I3\displaystyle I_{3} 𝔼𝜽k𝜽~k2𝔼ϕ[πM,c/ηρ~k](𝜽k)ϕ[πM,c/ηρ~k](𝜽~k)2.\displaystyle\leq\sqrt{\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}}\sqrt{\mathbb{E}\left\|\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\|^{2}}.

We bound the last term on the right side of the inequality. By assumption and Lemma E.3, we have

𝔼ϕ[πM,c/ηρ~k](𝜽k)ϕ[πM,c/ηρ~k](𝜽~k)2\displaystyle\mathbb{E}\left\|\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\bm{\theta}_{k})-\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\|^{2}
\displaystyle\leq [e()2/σLip𝔼𝜽ρ~kV(𝜽)+2σe()2/σ()Lip]2𝔼𝜽k𝜽~k2.\displaystyle\left[\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\mathbb{E}_{\bm{\theta}\sim\tilde{\rho}_{k}}\left\|\nabla V(\bm{\theta})\right\|+\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}\right]^{2}\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}.

And combine the estimation, we have

I3\displaystyle I_{3} [e()2/σLip𝔼𝜽ρ~kV(𝜽)+2σe()2/σ()Lip]𝔼𝜽k𝜽~k2.\displaystyle\leq\left[\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\mathbb{E}_{\bm{\theta}\sim\tilde{\rho}_{k}}\left\|\nabla V(\bm{\theta})\right\|+\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}\right]\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}.

Overall Bound Combine all the results, we have the following bound: for kMck\geq Mc,

𝔼𝜽k+1𝜽~k+12𝔼𝜽k𝜽~k2\displaystyle\mathbb{E}\left\|\bm{\theta}_{k+1}-\tilde{\bm{\theta}}_{k+1}\right\|^{2}-\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}
\displaystyle\leq 2ηL𝔼𝜽k𝜽~k2\displaystyle-2\eta L\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}
+\displaystyle+ 2ηα𝔼𝜽k𝜽~k2c1M\displaystyle 2\eta\alpha\sqrt{\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}}\frac{c_{1}}{\sqrt{M}}
+\displaystyle+ 2ηαc21Mj=1M𝔼𝜽kjc/η𝜽~kjc/η2𝔼𝜽k𝜽~k2\displaystyle 2\text{$\eta\alpha$}c_{2}\sqrt{\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}\left\|\bm{\theta}_{k-jc/\eta}-\tilde{\bm{\theta}}_{k-jc/\eta}\right\|^{2}\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}}
+\displaystyle+ 2ηαc3𝔼𝜽k𝜽~k2\displaystyle 2\eta\alpha c_{3}\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}
+\displaystyle+ η2c4,\displaystyle\eta^{2}c_{4},

where

c1=a2C[erc(1erMc)1erc+1],c_{1}=\sqrt{a_{2}C\left[\frac{e^{-rc}\left(1-e^{-rMc}\right)}{1-e^{rc}}+1\right]},
c2=4d2σe𝜽2/σθ1BL2+4e2/σBL2sup𝑘𝔼𝜽kV(𝜽k)2,c_{2}=\sqrt{4d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{BL}}^{2}+4\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{BL}}^{2}\underset{k}{\sup}\ \mathbb{E}_{\bm{\theta}_{k}}\left\|\nabla V(\bm{\theta}_{k})\right\|^{2}},
c3=[e()2/σLipsup𝑘𝔼𝜽ρ~kV(𝜽)+2σe()2/σ()Lip],c_{3}=\left[\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\underset{k}{\sup}\ \mathbb{E}_{\bm{\theta}\sim\tilde{\rho}_{k}}\left\|\nabla V(\bm{\theta})\right\|+\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}\right],

and

c4=supkMc/η𝔼V(𝜽k)+αϕ[1Mj=1Mδ𝜽kjc/η](𝜽k)V(𝜽~k)αϕ[πM,c/ηρ~k](𝜽~k)2.c_{4}=\underset{k\geq Mc/\eta}{\sup}\ \mathbb{E}\left\|\nabla V(\bm{\theta}_{k})+\alpha\phi[\frac{1}{M}\sum_{j=1}^{M}\delta_{\bm{\theta}_{k-jc/\eta}}](\bm{\theta}_{k})-\nabla V(\tilde{\bm{\theta}}_{k})-\alpha\phi[\pi_{M,c/\eta}*\tilde{\rho}_{k}](\tilde{\bm{\theta}}_{k})\right\|^{2}.

Define uk=𝔼𝜽k𝜽~k2u_{k}=\sqrt{\mathbb{E}\left\|\bm{\theta}_{k}-\tilde{\bm{\theta}}_{k}\right\|^{2}} and Uk=supl[k]ulU_{k}=\underset{l\in[k]}{\sup}\ u_{l}, we have

Uk+12qUk2+2ηαc1MUk+η2c4,U_{k+1}^{2}\leq qU_{k}^{2}+\frac{2\eta\alpha c_{1}}{\sqrt{M}}U_{k}+\eta^{2}c_{4},

where q=(12η(Lαc2αc3))q=(1-2\eta(L-\alpha c_{2}-\alpha c_{3})). By the assumption that αL/(c2+c3)\alpha\leq L/(c_{2}+c_{3}), q<1q<1. Now we prove the bound of UkU_{k} by induction. We take the hypothesis that Uk2(2ηαc1M+(1q)η(c4+11q))2(1q)2U_{k}^{2}\leq\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}} and notice that the hypothesis holds for U0=0U_{0}=0. By the hypothesis, we have

Uk+12\displaystyle U_{k+1}^{2} q(2ηαc1M+(1q)η(c4+11q))2(1q)2+2ηαc1M(2ηαc1M+(1q)η(c4+11q))(1q)+η2(c4+11q)\displaystyle\leq q\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}}+\frac{2\eta\alpha c_{1}}{\sqrt{M}}\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)}{\left(1-q\right)}+\eta^{2}\left(c_{4}+\frac{1}{1-q}\right)
=q(2ηαc1M+(1q)η(c4+11q))2(1q)2+11q(2ηαc1M)2+2ηαc1Mη(c4+11q)+η2(c4+11q)\displaystyle=q\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}}+\frac{1}{1-q}\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}\right)^{2}+\frac{2\eta\alpha c_{1}}{\sqrt{M}}\eta\left(c_{4}+\frac{1}{1-q}\right)+\eta^{2}\left(c_{4}+\frac{1}{1-q}\right)
=q(2ηαc1M+(1q)η(c4+11q))2(1q)2\displaystyle=q\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}}
+1q(1q)2[(2ηαc1M)2+(1q)2ηαc1Mη(c4+11q)+(1q)η2(c4+11q)]\displaystyle+\frac{1-q}{\left(1-q\right)^{2}}\left[\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}\right)^{2}+(1-q)\frac{2\eta\alpha c_{1}}{\sqrt{M}}\eta\left(c_{4}+\frac{1}{1-q}\right)+(1-q)\eta^{2}\left(c_{4}+\frac{1}{1-q}\right)\right]
q(2ηαc1M+(1q)η(c4+11q))2(1q)2\displaystyle\leq q\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}}
+1q(1q)2[(2ηαc1M)2+(1q)2ηαc1Mη(c4+11q)+(1q)2η2(c4+11q)2]\displaystyle+\frac{1-q}{\left(1-q\right)^{2}}\left[\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}\right)^{2}+(1-q)\frac{2\eta\alpha c_{1}}{\sqrt{M}}\eta\left(c_{4}+\frac{1}{1-q}\right)+(1-q)^{2}\eta^{2}\left(c_{4}+\frac{1}{1-q}\right)^{2}\right]
=q(2ηαc1M+(1q)η(c4+11q))2(1q)2+(1q)(2ηαc1M+(1q)η(c4+11q))2(1q)2\displaystyle=q\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}}+(1-q)\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}}
=(2ηαc1M+(1q)η(c4+11q))2(1q)2,\displaystyle=\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}},

where the last second inequality holds by (1q)(c4+11q)1(1-q)\left(c_{4}+\frac{1}{1-q}\right)\geq 1. Thus we complete the argument of induction and we have, for any kk,

Uk2\displaystyle U_{k}^{2} (2ηαc1M+(1q)η(c4+11q))2(1q)2\displaystyle\leq\frac{\left(\frac{2\eta\alpha c_{1}}{\sqrt{M}}+(1-q)\eta\left(c_{4}+\frac{1}{1-q}\right)\right)^{2}}{\left(1-q\right)^{2}}
24η2α2c12M+2(1q)2η2(c4+11q)2(1q)2\displaystyle\leq 2\frac{\frac{4\eta^{2}\alpha^{2}c_{1}^{2}}{M}+2(1-q)^{2}\eta^{2}\left(c_{4}+\frac{1}{1-q}\right)^{2}}{\left(1-q\right)^{2}}
=2α2c12(Lαc2αc3)21M+4η2(c4+2η(Lαc2αc3))2.\displaystyle=\frac{2\alpha^{2}c_{1}^{2}}{(L-\alpha c_{2}-\alpha c_{3})^{2}}\frac{1}{M}+4\eta^{2}\left(c_{4}+2\eta(L-\alpha c_{2}-\alpha c_{3})\right)^{2}.

And it implies that 𝕎22[ρk,ρ~k]ukUk2α2c12(Lαc2αc3)21M+4η2(c4+2η(Lαc2αc3))2.\mathbb{W}_{2}^{2}[\rho_{k},\tilde{\rho}_{k}]\leq u_{k}\leq U_{k}\leq\frac{2\alpha^{2}c_{1}^{2}}{(L-\alpha c_{2}-\alpha c_{3})^{2}}\frac{1}{M}+4\eta^{2}\left(c_{4}+2\eta(L-\alpha c_{2}-\alpha c_{3})\right)^{2}.

E.6 Proof of Technical Lemmas

E.6.1 Proof of Lemma E.1

For the first part:

V(𝜽)\displaystyle\left\|\nabla V(\bm{\theta})\right\|
\displaystyle\leq V(𝜽)V(𝟎)+V(𝟎)\displaystyle\left\|\nabla V(\bm{\theta})-\nabla V(\bm{0})\right\|+\left\|\nabla V(\bm{0})\right\|
\displaystyle\leq b1(𝜽1+1).\displaystyle b_{1}\left(\left\|\bm{\theta}_{1}\right\|+1\right).

For the second part:

𝜽ηV(𝜽)\displaystyle\left\|\bm{\theta}-\eta\nabla V(\bm{\theta})\right\|
=\displaystyle= 𝜽ηV(𝜽),𝜽ηV(𝜽)\displaystyle\left\langle\bm{\theta}-\eta\nabla V(\bm{\theta}),\bm{\theta}-\eta\nabla V(\bm{\theta})\right\rangle
=\displaystyle= 𝜽2+2η𝜽,V(𝜽)+η2V(𝜽)2\displaystyle\left\|\bm{\theta}\right\|^{2}+2\eta\left\langle\bm{\theta},-\nabla V(\bm{\theta})\right\rangle+\eta^{2}\left\|\nabla V(\bm{\theta})\right\|^{2}
\displaystyle\leq 𝜽2+2η(a1𝜽2+b1)+η2b1(1+𝜽2)\displaystyle\left\|\bm{\theta}\right\|^{2}+2\eta\left(-a_{1}\left\|\bm{\theta}\right\|^{2}+b_{1}\right)+\eta^{2}b_{1}(1+\left\|\bm{\theta}\right\|^{2})
=\displaystyle= (12ηa1+η2b1)𝜽2+η2b1+2ηb1.\displaystyle\left(1-2\eta a_{1}+\eta^{2}b_{1}\right)\left\|\bm{\theta}\right\|^{2}+\eta^{2}b_{1}+2\eta b_{1}.

E.6.2 Proof of Lemma E.2

It is obvious that K,1\left\|K\right\|_{\infty,\infty}\leq 1.

K(𝜽,𝜽1)K(𝜽,𝜽2)\displaystyle\left\|K(\bm{\theta}^{\prime},\bm{\theta}_{1})-K(\bm{\theta}^{\prime},\bm{\theta}_{2})\right\|
e𝜽𝜽12/σe𝜽𝜽22/σ\displaystyle\left\|e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{1}\right\|^{2}/\sigma}-e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{2}\right\|^{2}/\sigma}\right\|
\displaystyle\leq e()2/σLip𝜽1𝜽22.\displaystyle\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}.

And

𝜽K(𝜽,𝜽1)𝜽K(𝜽,𝜽2)\displaystyle\left\|\nabla_{\bm{\theta}^{\prime}}K(\bm{\theta}^{\prime},\bm{\theta}_{1})-\nabla_{\bm{\theta}^{\prime}}K(\bm{\theta}^{\prime},\bm{\theta}_{2})\right\|
=\displaystyle= 2σe𝜽𝜽12/σ(𝜽𝜽1)2σe𝜽𝜽22/σ(𝜽𝜽2)\displaystyle\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{1}\right\|^{2}/\sigma}\left(\bm{\theta}^{\prime}-\bm{\theta}_{1}\right)-\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{2}\right\|^{2}/\sigma}\left(\bm{\theta}^{\prime}-\bm{\theta}_{2}\right)\right\|
\displaystyle\leq 2σe()2/σ()Lip𝜽1𝜽22.\displaystyle\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}.

E.6.3 Proof of Lemma E.3

For any distribution ρ\rho such that 𝜽V(𝜽)ρ(𝜽)𝑑𝜽<\int\left\|\nabla_{\bm{\theta}}V(\bm{\theta})\right\|\rho(\bm{\theta})d\bm{\theta}<\infty,

ϕ[ρ](𝜽1)ϕ[ρ](𝜽2)\displaystyle\left\|\phi[\rho](\bm{\theta}_{1})-\phi[\rho](\bm{\theta}_{2})\right\|
=\displaystyle= 𝔼𝜽ρ{[K(𝜽,𝜽1)K(𝜽,𝜽2)]V(𝜽)+1K(𝜽,𝜽1)1K(𝜽,𝜽2)}\displaystyle\left\|\mathbb{E}_{\bm{\theta}\sim\rho}\left\{-\left[K(\bm{\theta},\bm{\theta}_{1})-K(\bm{\theta},\bm{\theta}_{2})\right]\nabla V(\bm{\theta})+\nabla_{1}K(\bm{\theta},\bm{\theta}_{1})-\nabla_{1}K(\bm{\theta},\bm{\theta}_{2})\right\}\right\|
\displaystyle\leq e()2/σLip𝔼𝜽ρV(𝜽)𝜽1𝜽22\displaystyle\left\|e^{-(\cdot)^{2}/\sigma}\right\|_{\mathrm{Lip}}\mathbb{E}_{\bm{\theta}\sim\rho}\left\|\nabla V(\bm{\theta})\right\|\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}
+\displaystyle+ 2σe()2/σ()Lip𝜽1𝜽22.\displaystyle\left\|\frac{2}{\sigma}e^{-(\cdot)^{2}/\sigma}(\cdot)\right\|_{\mathrm{Lip}}\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}.

For proving the second result, we notice that

ϕ[ρ](𝜽)\displaystyle\left\|\phi[\rho](\bm{\theta})\right\| =𝔼𝜽ρ[K(𝜽,𝜽)V(𝜽)+1K(𝜽,𝜽)]\displaystyle=\mathbb{E}_{\bm{\theta}^{\prime}\sim\rho}\left[K(\bm{\theta}^{\prime},\bm{\theta})\nabla V(\bm{\theta}^{\prime})+\nabla_{1}K(\bm{\theta}^{\prime},\bm{\theta})\right]
K𝔼𝜽ρ[V(𝜽)+2σ(𝜽+𝜽)]\displaystyle\leq\left\|K\right\|_{\infty}\mathbb{E}_{\bm{\theta}^{\prime}\sim\rho}\left[\left\|\nabla V(\bm{\theta}^{\prime})\right\|+\frac{2}{\sigma}\left(\left\|\bm{\theta}^{\prime}\right\|+\left\|\bm{\theta}\right\|\right)\right]
Kb1+𝔼𝜽ρ[(2σ+b1)𝜽+𝜽].\displaystyle\leq\left\|K\right\|_{\infty}b_{1}+\mathbb{E}_{\bm{\theta}^{\prime}\sim\rho}\left[\left(\frac{2}{\sigma}+b_{1}\right)\left\|\bm{\theta}^{\prime}\right\|+\left\|\bm{\theta}\right\|\right].

E.6.4 Proof of Lemma E.4

Given any 𝜽\bm{\theta}^{\prime},

i=1dϕ¯𝜽,i(𝜽)Lip2\displaystyle\sum_{i=1}^{d}\left\|\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta})\right\|_{\mathrm{Lip}}^{2}
=\displaystyle= i=1d[sup𝜽1𝜽2|ϕ¯𝜽,i(𝜽1)ϕ¯𝜽,i(𝜽2)|𝜽1𝜽22]2\displaystyle\sum_{i=1}^{d}\left[\underset{\bm{\theta}_{1}\neq\bm{\theta}_{2}}{\sup}\frac{\left|\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta}_{1})-\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta}_{2})\right|}{\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}}\right]^{2}
=\displaystyle= i=1dsup𝜽1𝜽2|ϕ¯𝜽,i(𝜽1)ϕ¯𝜽,i(𝜽2)|2𝜽1𝜽222\displaystyle\sum_{i=1}^{d}\underset{\bm{\theta}_{1}\neq\bm{\theta}_{2}}{\sup}\frac{\left|\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta}_{1})-\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta}_{2})\right|^{2}}{\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}^{2}}
\displaystyle\leq 2i=1dsup𝜽1𝜽2|(e𝜽𝜽12/σe𝜽𝜽22/σ)𝜽iV(𝜽)|2𝜽1𝜽222\displaystyle 2\sum_{i=1}^{d}\underset{\bm{\theta}_{1}\neq\bm{\theta}_{2}}{\sup}\frac{\left|\left(e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{1}\right\|^{2}/\sigma}-e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{2}\right\|^{2}/\sigma}\right)\frac{\partial}{\partial\bm{\theta}_{i}^{\prime}}V(\bm{\theta}^{\prime})\right|^{2}}{\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}^{2}}
+\displaystyle+ 2i=1dsup𝜽1𝜽2|2σe𝜽𝜽12/σ(𝜽1,i𝜽i)2σe𝜽𝜽22/σ(𝜽2,i𝜽i)|2𝜽1𝜽222.\displaystyle 2\sum_{i=1}^{d}\underset{\bm{\theta}_{1}\neq\bm{\theta}_{2}}{\sup}\frac{\left|\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{1}\right\|^{2}/\sigma}(\bm{\theta}_{1,i}-\bm{\theta}_{i}^{\prime})-\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{2}\right\|^{2}/\sigma}(\bm{\theta}_{2,i}-\bm{\theta}_{i}^{\prime})\right|^{2}}{\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}^{2}}.

For the first term on the right side of the inequality,

i=1dsup𝜽1𝜽2|(e𝜽𝜽12/σe𝜽𝜽22/σ)θiV(𝜽)|2𝜽1𝜽222\displaystyle\sum_{i=1}^{d}\underset{\bm{\theta}_{1}\neq\bm{\theta}_{2}}{\sup}\frac{\left|\left(e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{1}\right\|^{2}/\sigma}-e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{2}\right\|^{2}/\sigma}\right)\frac{\partial}{\partial\theta_{i}^{\prime}}V(\bm{\theta}^{\prime})\right|^{2}}{\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}^{2}}
=\displaystyle= i=1d|θiV(𝜽)|2sup𝜽1𝜽2|(e𝜽𝜽12/σe𝜽𝜽22/σ)|2𝜽1𝜽222\displaystyle\sum_{i=1}^{d}\left|\frac{\partial}{\partial\theta_{i}}V(\bm{\theta}^{\prime})\right|^{2}\underset{\bm{\theta}_{1}\neq\bm{\theta}_{2}}{\sup}\frac{\left|\left(e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{1}\right\|^{2}/\sigma}-e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{2}\right\|^{2}/\sigma}\right)\right|^{2}}{\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}^{2}}
=\displaystyle= V(𝜽)2e2/σLip2.\displaystyle\left\|\nabla V(\bm{\theta}^{\prime})\right\|^{2}\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathrm{Lip}}^{2}.

To bound the second term, by the symmetry of each coordinates, we have

i=1dsup𝜽1𝜽2|2σe𝜽𝜽12/σ(θ1,iθi)2σe𝜽𝜽22/σ(𝜽1,i𝜽i)|2𝜽1𝜽222\displaystyle\sum_{i=1}^{d}\underset{\bm{\theta}_{1}\neq\bm{\theta}_{2}}{\sup}\frac{\left|\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{1}\right\|^{2}/\sigma}(\theta_{1,i}-\theta^{\prime}_{i})-\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}_{2}\right\|^{2}/\sigma}(\bm{\theta}_{1,i}-\bm{\theta}^{\prime}_{i})\right|^{2}}{\left\|\bm{\theta}_{1}-\bm{\theta}_{2}\right\|_{2}^{2}}
=\displaystyle= d2σe𝜽2/σθ1Lip2.\displaystyle d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathrm{Lip}}^{2}.

This finishes the first part of the lemma.

i=ddϕ¯𝜽,i(𝜽)2\displaystyle\sum_{i=d}^{d}\left\|\bar{\phi}_{\bm{\theta}^{\prime},i}(\bm{\theta})\right\|_{\mathcal{L}_{\infty}}^{2}
=\displaystyle= i=dde𝜽𝜽2/σ(2σ𝜽i2σ𝜽i𝜽iV(𝜽))2\displaystyle\sum_{i=d}^{d}\left\|e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}\right\|^{2}/\sigma}\left(\frac{2}{\sigma}\bm{\theta}_{i}-\frac{2}{\sigma}\bm{\theta}_{i}^{\prime}-\frac{\partial}{\partial\bm{\theta}_{i}^{\prime}}V(\bm{\theta}^{\prime})\right)\right\|_{\mathcal{L}_{\infty}}^{2}
\displaystyle\leq i=dd22σe𝜽𝜽2/σ(𝜽i𝜽i)2+i=dd2e𝜽𝜽2/σ𝜽iV(𝜽)2\displaystyle\sum_{i=d}^{d}2\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}\right\|^{2}/\sigma}\left(\bm{\theta}_{i}-\bm{\theta}_{i}^{\prime}\right)\right\|_{\mathcal{L}_{\infty}}^{2}+\sum_{i=d}^{d}2\left\|e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}\right\|^{2}/\sigma}\frac{\partial}{\partial\bm{\theta}_{i}^{\prime}}V(\bm{\theta}^{\prime})\right\|_{\mathcal{L}_{\infty}}^{2}
\displaystyle\leq i=dd22σe𝜽𝜽2/σ(𝜽i𝜽i)2+2e2/σ2V(𝜽)2\displaystyle\sum_{i=d}^{d}2\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}^{\prime}-\bm{\theta}\right\|^{2}/\sigma}\left(\bm{\theta}_{i}-\bm{\theta}_{i}^{\prime}\right)\right\|_{\mathcal{L}_{\infty}}^{2}+2\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathcal{L}_{\infty}}^{2}\left\|\nabla V(\bm{\theta}^{\prime})\right\|^{2}
\displaystyle\leq 2d2σe𝜽2/σθ12+2e2/σ2V(𝜽)2.\displaystyle 2d\left\|\frac{2}{\sigma}e^{-\left\|\bm{\theta}\right\|^{2}/\sigma}\theta_{1}\right\|_{\mathcal{L}_{\infty}}^{2}+2\left\|e^{-\left\|\cdot\right\|^{2}/\sigma}\right\|_{\mathcal{L}_{\infty}}^{2}\left\|\nabla V(\bm{\theta}^{\prime})\right\|^{2}.