Stein Self-Repulsive Dynamics: Benefits from Past Samples
Abstract
We propose a new Stein self-repulsive dynamics for obtaining diversified samples from intractable un-normalized distributions. Our idea is to introduce Stein variational gradient as a repulsive force to push the samples of Langevin dynamics away from the past trajectories. This simple idea allows us to significantly decrease the auto-correlation in Langevin dynamics and hence increase the effective sample size. Importantly, as we establish in our theoretical analysis, the asymptotic stationary distribution remains correct even with the addition of the repulsive force, thanks to the special properties of the Stein variational gradient. We perform extensive empirical studies of our new algorithm, showing that our method yields much higher sample efficiency and better uncertainty estimation than vanilla Langevin dynamics.
1 Introduction
Drawing samples from complex un-normalized distributions is one of the most basic problems in statistics and machine learning, with broad applications to enormous research fields that rely on probabilistic modeling. Over the past decades, large amounts of methods have been proposed for approximate sampling, including both Markov Chain Monte Carlo (MCMC) (e.g., Brooks et al., 2011) and variational inference (e.g., Wainwright et al., 2008; Blei et al., 2017).
MCMC works by simulating Markov chains whose stationary distributions match the distributions of interest. Despite nice asymptotic theoretical properties, MCMC is widely criticized for its slow convergence rate in practice. In difficult problems, the samples drawn from MCMC are often found to have high auto-correlation across time, meaning that the Markov chains explore very slowly in the configuration space. When this happens, the samples returned by MCMC only approximate a small local region, and under-estimate the probability of the regions un-explored by the chain.
Stein variational gradient descent (SVGD) (Liu and Wang, 2016) is a different type of approximate sampling methods designed to overcome the limitation of MCMC. Instead of drawing random samples sequentially, SVGD evolves a pre-defined number of particles (or sample points) in parallel with a special interacting particle system to match the distribution of interest by minimizing the KL divergence. In SVGD, the particles interact with each other to simultaneously move towards the high probability regions following the gradient direction, and also move away from each other due to a special repulsive force. As a result, SVGD allows us to obtain diversified samples that correctly represent the variation of the distribution of interest. SVGD has found applications in various challenging problems (e.g., Feng et al., 2017; Haarnoja et al., 2017; Pu et al., 2017; Liu et al., 2017a; Gong et al., 2019). See Han and Liu (e.g., 2018); Chen et al. (e.g., 2018); Liu et al. (e.g., 2019); Wang et al. (e.g., 2019a) for examples of extensions.
However, one problem of SVGD is that it theoretically requires to run an infinite number of chains in parallel in order to approximate the target distribution asymptotically (Liu, 2017). With a finite number of particles, the fixed point of SVGD does still provide a prioritized, partial approximation to the distribution in terms of the expectation of a special case of functions (Liu and Wang, 2018). Nevertheless, it is still desirable to develop a variant of “single-chain SVGD”, which only requires to run a single chain sequentially like MCMC to achieve the correct stationary distribution asymptotically in time, with no need to take the limit of infinite number of parallel particles.
In this work, we propose an example of single-chain SVGD by integrating the special repulsive mechanism of SVGD with gradient-based MCMC such as Langevin dynamics. Our idea is to use repulsive term of SVGD to enforce the samples in MCMC away from the past samples visited at previous iterations. Such a new self-repulsive dynamics allows us to decrease the auto-correlation in MCMC and hence increase the mixing rate, but still obtain the same stationary distribution thanks to the special property of the SVGD repulsive mechanism.
We provide thorough theoretical analysis of our new method, establishing its asymptotic convergence to the target distribution. Such result is highly non-trivial, as our new self-repulsive dynamic is a non-linear high-order Markov process. Empirically, we evaluate our methods on an array of challenging sampling tasks, showing that our method yields much better uncertainty estimation and larger effective sample size.
2 Background: Langevin dynamics & SVGD
This section gives a brief introduction on Langevin dynamics (Rossky et al., 1978) and Stein Variational Gradient Descent (SVGD) (Liu and Wang, 2016), which we integrate together to develop our new self-repulsive dynamics for more efficient sampling.
Langevin Dynamics Langevin dynamics is a basic gradient based MCMC method. For a given target distribution supported on with density function , where is the potential function, the (Euler-discrerized) Langevin dynamics simulates a Markov chain with the following rule:
where denotes the number of iterations, are independent standard Gaussian noise, and is a step size parameter. It is well known that the limiting distribution of when approximates the target distribution when is sufficiently small.
Because the updates in Langevin dynamics are local and incremental, new points generated by the dynamics can be highly correlated to the past sample, in which case we need to run Langevin dynamics sufficiently long in order to obtain diverse samples.
Stein Variatinal Gradient Descent (SVGD) Different from Langevin dynamics, SVGD iteratively evolves a pre-defined number of particles in parallel. Starting from an initial set of particles , SVGD updates the particles in parallel by
where is a velocity field that depends the empirical distribution of the current set of particles in the following way,
Here is the Dirac measure centered at , and hence denotes averaging on the particles. The is a positive definite kernel, such as the RBF kernel, that can be specified by users.
Note that consists of a confining term and repulsive term: the confining term pushes particles to move towards high density region, and the repulsive term prevents the particles from colliding with each other. It is the balance of these two terms that allows us to asymptotically approximate the target distribution at the fixed point, when the number of particles goes to infinite. We refer the readers to Liu and Wang (2016); Liu (2017); Liu and Wang (2018) for thorough theoretical justifications of SVGD. But a quick, informal way to justify the SVGD update is through the Stein’s identity, which shows that the velocity field equals zero when equals the true distribution , that is, ,
(1) |
This equation shows that, the target distributions forms a fixed point of the update, and SVGD would converge if the particle distribution gives a close approximation to the target distribution .
3 Stein Self-Repulsive Dynamics
In this work, we propose to integrate Langevin dynamics and SVGD to simultaneously decrease the auto-correlation of Langevin dynamics and eliminate the need for running parallel chains in SVGD. The idea is to use Stein repulsive force between the the current particle and the particles from previous iterations, hence forming a new self-avoiding dynamics with fast convergence speed.
Specifically, assume we run a single Markov chain like Langevin dynamics, where denotes the sample at the -th iteration. Denote by the empirical measure of samples from the past iterations:
where is a thinning factor, which scales inversely with the step size , introduced to slim the sequence of past samples. Compared with the in SVGD, which is averaged over parallel particles, is averaged across time over past samples. Given this, our Stein self-repulsive dynamics updates the sample via
(2) |
in which the particle is updated with the typical Langevin gradient, plus a Stein repulsive force against the particles from the previous iterations. Here is a parameter that controls the magnitude of the Stein repulsive term. In this way, the particles are pushed away from the past samples, and hence admits lower auto-correlation and faster convergence speed. Importantly, the addition of the repulsive force does not impact the asymptotic stationary distribution, thanks to Stein’s identity in (1). This is because if the self-repulsive dynamics has converged to the target distribution , such that for all , the Stein self-repulsive term would equal to zero in expectation due to Stein’s identity and hence does not introduce additional bias over Langevin dynamics. Rigorous theoretical analysis of this idea is developed in Section 4.
Practical Algorithm Because is averaged across the past samples, it is necessary to introduce a burn-in phase with the repulsive dynamics. Therefore, our overall procedure works as follows,
(3) |
It includes two phases. The first phase is the same as the Langevin dynamics which collects the initial samples used in the second phase while serves as a warm start. The repulsive gradient update is introduced in the second phase to encourage the dynamics to visit the under-explored density region. We call this particular instance of our algorithm Self-Repulsive Langevin dynamics (SRLD), self-repulsive variants of more general dynamics is discussed in Section 5.
Remark Note that the first phase is introduced to collect the initial samples. However, it’s not really necessary to generate the initial samples with Langevin dynamics. We can simply use some other initialization distribution and get initial samples from that distribution. In practice, we find using Langevin dynamics to collect the initial samples is natural and it can also be viewed as the burn-in phase before sampling, so we use (3) in all of the other experiments.
Remark The general idea of introducing self-repulsive terms inside MCMC or other iterative algorithms is not new itself. For example, in molecular dynamics simulations, an algorithm called metadynamics (Laio and Parrinello, 2002) has been widely used, in which the particles are repelled away from the past samples in a way similar to our method, but with a typical repulsive function, such as , where can be any kind of dis-similarity. However, introducing an arbitrary repulsive force would alter the stationary distribution of the dynamics, introducing a harmful bias into the algorithm. Besides, the self-adjusting mechanism in Deng et al. (2020b) can also be viewed as a repulsive force using the multiplier in gradient. The key highlight of our approach, as reflected by our theoretical results in Section 4, is the unique property of the Stein repulsive term, that allows us to obtain the correct stationary distribution even with the addition of the repulsive force.
Remark Recent works (Gallego and Insua, 2018; Zhang et al., 2018) also combine SVGD with Langevin dynamics, in which, however, the Langevin force is directly added to a set of particles that evolve in parallel with SVGD. Using our terminology, their system is
This is significantly different from our method on both motivation and practical algorithm. Their algorithm still requires to simulate parallel chains of particles like SVGD, and was proposed to obtain easier theoretical analysis than the deterministic dynamics of SVGD. Our work is instead motivated by the practical need of decreasing the auto-correlation in Langevin dynamics, and avoiding the need of running multiple chains in SVGD, and hence must be based on self-repulsion against past samples along a single chain.




An Illustrative Example We give an illustrative example to show the key advantage of our self-repulsive dynamics. Assume that we want to sample from a bi-variate Gaussian distribution shown in Figure 1. Unlike standard settings, we assume that we have already obtained some initial samples (yellow dots in Figure 1) before running the dynamics. The initial samples are assumed to concentrate on the left part of the target distribution as shown in Figure 1. In this extreme case, since the left part of the distribution is over-explored by the initial samples, it is desirable to have the subsequent new samples to concentrate more on the un-explored part of the distribution. However, standard Langevin dynamics does not take this into account, and hence yielding a biased overall representation of the true distribution (left panel). With our self-repulsive dynamics, the new samples are forced to explore the un-explored region more frequently, allowing us to obtain a much more accurate approximation when combining the new and initial samples.
4 Theoretical Analysis of Stein Self-Repulsive Dynamics
We provide theoretical analysis of the self-repulsive dynamics. We establish that our self-repulsive dynamics converges to the correct target distribution asymptotically, in the limit when particle size approaches to infinite and the step size approaches to . This is a highly non-trivial task, as the self-repulsive dynamics is a highly complex, non-linear and high order Markov stochastic process. We attack this problem by breaking the proof into the following three steps:
- (1)
- (2)
-
(3)
We show that the dynamics in (5) converges to the target distribution.
Remark As we mentioned in Section 3, we introduce the first phase to collect the initial samples for the second phase, and our theoretical analysis follows this setting to make our theory as close to the practice. However, the theoretical analysis can be generalized to the setting of drawing initial samples from some initialization distribution with almost identical argument.
Notations We use and to represent the vector norm and inner product, respectively. The Lipschitz norm and bounded Lipschitz norm of a function are defined by and . The KL divergence, Wasserstein-2 distance and Bounded Lipschitz distance between distribution , are denoted as , and , respectively.
4.1 Mean-Field and Continuous-Time Limits
To fix the notation, we denote by the distribution of at time of the practical self-repulsive dynamics (3), which we refer to the practical dynamics in the sequel, when the initial particle is drawn from an initial continuous distribution . Note that given , the subsequent can be recursively defined through dynamics (3). Due to the diffusion noise in Langevin dynamics, all are continuous distributions supported on . We now introduce the limit dynamics when we take the mean-field limit and then the continuous-time limit .
Discrete-Time Mean-Field Dynamics () In the limit of , our practical dynamics (3) approaches to the following limit dynamics, in which the delta measures on the particles are replaced by the actual continuous distributions of the particles,
(4) |
where and is the (smooth) distribution of at time-step when the dynamics is initialized with . Compared with the practical dynamics in (3), the difference is that the empirical distribution is replaced by the smooth distribution . Similar to the recursive definition of following dynamics (3), is also recursively defined through dynamics (4), starting from .
As we show in Theorem 4.3, if the auto-correlation of decays fast enough and is sufficiently large, is well approximated by the empirical distribution in (3), and further the two dynamics ((3) and (4)) converges to each other in the sense that as for any . Note that in taking the limit of , we need to ensure that we run the dynamics for more than steps. Otherwise, SRLD degenerates to Langevin dynamics as we stop the chain before we finish collecting the samples.
Continuous-Time Mean-Field Dynamics () In the limit of zero step size , the discrete-time mean field dynamics in (4) can be shown to converge to the following continuous-time mean-field dynamics:
(5) |
where , is the Brownian motion and is the distribution of at a continuous time point with initialized by . We prove that (5) is closely approximated by (4) with small step size in the sense that as in Theorem 4.2, and importantly, the stationary distribution of (5) equals to the target distribution
4.2 Assumptions
We first introduce the techinical assumptions used in our theoretical analysis.
Assumption 4.1 (RBF Kernel).
We use RBF kernel, i.e. , for some fixed .
We only assume the RBF kernel for the simplicity of our analysis. However, it is straightforward to generalize our theoretical result to other positive definite kernels.
Assumption 4.2 ( is dissipative and smooth).
Assume that and . We also assume that . Here and are some finite positive constant.
Assumption 4.3 (Regularity Condition).
Assume . Define , assume there exists such that
Assumption 4.4 (Strong-convexity).
Suppose that for a positive constant .
Remark Assumption 4.2 is standard in the existing Langevin dynamics analysis (see Dalalyan, 2017; Raginsky et al., 2017; Deng et al., 2020a). Assumption 4.3 is a weak condition as it assumes that the dynamics can not degenerate into one local mode and stop moving anymore. This assumption is expected to be true when we have diffusion terms like the Gaussian noises in our self-repulsive dynamics. Assumption 4.4 is a classical assumption on the existing Langevin dynamics analysis with convex potential Dalalyan (2017); Durmus et al. (2019). Although being a bit strong, this assumption broadly applies to posterior inference problem in the limit of big data, as the posterior distribution converges to Gaussian distributions for large training set as shown by Bernstein-von Mises theorem. It is technically possible to further generalize our results to the non-convex settings with a refined analysis, which we leave as future work. This work focuses on the classic convex setting for simplicity.
4.3 Main Theorems
All of the proofs in this section can be found in Appendix E. We first prove that the limiting distribution of the continuous-time mean field dynamics (5) is the target distribution. This is achieved by writing dynamics (5) into the following (non-linear) partial differential equation:
Theorem 4.1 (Stationary Distribution).
Given some finite , and , and suppose that the limit distribution of (5) exists. Then the limit distribution is unique and satisfies .
We then give the upper bound on the discretization error, which can be characterized by analyzing the KL divergence between and .
Theorem 4.2 (Time Discretization Error).
With this theorem, we can know that if is small enough, then the discretization error is small and approximates closely. Next we give result on the mean field limit of .
Theorem 4.3 (Mean-Field Limit).
Thus, if is sufficiently large, can well approximate the . Combining all the above theorems, we have the following Corollary showing the convergence of the proposed practical algorithm to the target distribution.
Corollary 4.1 (Convergence to Target Distribution).
Remark A careful choice of parameters is needed as our system is a complicated longitudinal particle system. Also notice that if , the repulsive dynamics reduces to Langevin dynamics, as only the samples from the first phase will be collected.
5 Extension to General Dynamics
Although we have focused on self-repulsive Langevin dynamics, our Stein self-repulsive idea can be broadly combined with general gradient-based MCMC. Following Ma et al. (2015), we consider the following general class of sampling dynamics for drawing samples from :
where is a positive semi-definite diffusion matrix that determines the strength of the Brownian motion and is a skew-symmetric curl matrix that can represent the traverse effect (e.g. in Neal et al., 2011; Ding et al., 2014). By adding the Stein repulsive force, we obtain the following general self-repulsive dynamics
(6) |
where is again the distribution of following (6) when initalized at . Similar to the case of Langevin dynamics, this process also converges to the correct target distribution, and can be simulated by practical dynamics similar to (3).
Theorem 5.1 (Stationary Distribution).
Given some finite , and , and suppose that the limiting distribution of dynamics (6) exists. Then the limiting distribution is unique and equals the target distribution .
6 Experiments
In this section, we evaluate the proposed method in various challenging tasks. We demonstrate the effectiveness of SRLD in high dimensions by applying it to sample the posterior of Bayesian Neural Networks. To demonstrate the superiority of the SRLD in obtaining diversified samples, we apply SRLD on contextual bandits problem, which requires the sampler efficiently explores the distribution in order to give good uncertainty estimation.
We include discussion on the parameter tuning and additional experiment on sampling high dimensional Gaussian and Gaussian mixture in Appendix B. Our code is available at https://github.com/lushleaf/Stein-Repulsive-Dynamics.
6.1 Synthetic Experiment
We first show how the repulsive gradient helps explore the whole distribution using a synthetic distribution that is easy to visualize. Following Ma et al. (2015), we compare the sampling efficiency on the following correlated 2D distribution with density
We compare the SRLD with vanilla Langevin dynamics, and evaluate the sample quality by Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), Wasserstein-1 Distance and effective sample size (ESS). Notice that the finite sample quality of gradient based MCMC method is highly related to the step size. Compared with Langevin dynamics, we have an extra repulsive gradient and thus we implicitly have larger step size. To rule out this effect, we set different step sizes of the two dynamics so that the gradient of the two dynamics has the same magnitude.
In addition, to decrease the influence of random noise, the two dynamics are set to have the same initialization and use the same sequence of Gaussian noise. We collect the sample of every iteration. We repeat the experiment 20 times with different initialization and sequence of Gaussian noise.














Figure 2 summarizes the result with different metrics. We can see that SRLD has a significantly smaller MMD and Wasserstein-1 Distance as well as a larger ESS compared with the vanilla Langevin dynamics. Moreover, the introduced repulsive gradient creates a negative auto-correlation between samples. Figure 3 shows a typical trajectory of the two sampling dynamics. We can see that SRLD have a faster mixing rate than vanilla Langevin dynamics. Note that since we use the same sequence of Gaussian noise for both algorithms, the difference is mainly due to the use of repulsive gradient rather than the randomness.
6.2 Bayesian Neural Network
Bayesian Neural Network is one of the most important methods in Bayesian Deep Learning with wide application in practice. Here we test the performance of SRLD on sampling the posterior of Bayesian Neural Network on the UCI datasets (Dua and Graff, 2017). We assume the output is normal distributed, with a two-layer neural network with 50 hidden units and activation to predict the mean of outputs. All of the datasets are randomly partitioned into for training and for testing. The results are averaged over 20 random trials. We refer readers to Appendix C for hyper-parameter tuning and other experiment details.
Dataset | Ave Test RMSE | Ave Test LL | ||||
---|---|---|---|---|---|---|
SVGD | LD | SRLD | SVGD | LD | SRLD | |
Boston | ||||||
Concrete | ||||||
Energy | ||||||
Naval | ||||||
WineRed | ||||||
WineWhite | ||||||
Yacht |
Table 1 shows the average test RMSE and test log-likelihood and their standard deviation. The method that has the best average performance is marked as boldface. We observe that a large portion of the variance is due to the random partition of the dataset. Therefore, to show the statistical significance, we use the matched pair -test to test the statistical significance, mark the methods that perform the best with a significance level of 0.05 with underlines. Note that the results of SRLD/LD and SVGD is not very comparable, because SRLD/LD are single chain methods which averages across time, and SVGD is a multi-chain method that only use the results of the last iteration. We provide additional results in Appendix C that SRLD averaged on 20 particles (across time) can also achieve similar or better results as SVGD with 20 (parallel) particles.
6.3 Contextual Bandits
We consider the posterior sampling (a.k.a Thompson sampling) algorithm with Bayesian neural network as the function approximator, to demonstrate the uncertanty estimation provided by SRLD. We follow the experimental setting from Riquelme et al. (2018). The only difference is that we change the optimization of the objective (e.g. evidence lower bound (ELBO) in variational inference methods) into running MCMC samplers. We compare the SRLD with the Langevin dynamics on two benchmarks from (Riquelme et al., 2018), and include SVGD as a baseline. For more detailed introduction, setup, hyper-parameter tuning and experiment details; see Appendix D.
Dataset | SVGD | LD | SRLD |
---|---|---|---|
Mushroom | |||
Wheel |
The cumulative regret is shown in Table 2. SVGD is known to have the under-estimated uncertainty for Bayesian neural network if particle number is limited (Wang et al., 2019b), and as a result, has the worst performance among the three methods. SRLD is slightly better than vanilla Langevin dynamics on the simple Mushroom bandits. On the much more harder Wheel bandits, SRLD is significantly better than the vanilla Langevin dynamics, which shows the improving uncertainty estimation of our methods within finite number of samples.
7 Conclusion
We propose a Stein self-repulsive dynamics which applies Stein variational gradient to push samples from MCMC dynamics away from its past trajectories. This allows us to significantly decrease the auto-correlation of MCMC, increasing the sample efficiency for better estimation. The advantages of our method are extensive studied both theoretical and empirical analysis in our work. In future work, we plan to investigate the combination of our Stein self-repulsive idea with more general MCMC procedures, and explore broader applications.
Broader Impact Statement
This work incorporates Stein repulsive force into Langevin dynamics to improve sample efficiency. It brings a positively improvement to the community such as reinforcement learning and Bayesian neural network that needs efficient sampler. Our work do not have any negative societal impacts that we can foresee in the future.
Acknowledgement
This paper is supported in part by NSF CAREER 1846421, SenSE 2037267 and EAGER 2041327.
References
- An and Huang (1996) HZ An and FC Huang. The geometrical ergodicity of nonlinear autoregressive models. Statistica Sinica, pages 943–956, 1996.
- Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1613–1622, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/blundell15.html.
- Bradley (2005) Richard Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107–144, 2005.
- Bradley (2007) Richard C. Bradley. Introduction to strong mixing conditions. 2007.
- Brooks et al. (2011) Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov chain Monte Carlo. CRC press, 2011.
- Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Chapelle and Li (2011) Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
- Chen et al. (2018) Changyou Chen, Ruiyi Zhang, Wenlin Wang, Bai Li, and Liqun Chen. A unified particle-optimization framework for scalable Bayesian sampling. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 746–755, 2018.
- Dalalyan (2017) Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
- Deng et al. (2020a) Wei Deng, Qi Feng, Liyao Gao, Faming Liang, and Guang Lin. Non-convex learning via replica exchange stochastic gradient mcmc. In International Conference on Machine Learning, pages 2474–2483. PMLR, 2020a.
- Deng et al. (2020b) Wei Deng, Guang Lin, and Faming Liang. A contour stochastic gradient langevin dynamics algorithm for simulations of multi-modal distributions, 2020b.
- Ding et al. (2014) Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D. Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3203–3211, 2014.
- Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Durmus et al. (2019) Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of langevin monte carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
- Feng et al. (2017) Yihao Feng, Dilin Wang, and Qiang Liu. Learning to draw samples with amortized stein variational gradient descent. uncertainty in artificial intelligence, 2017.
- Gallego and Insua (2018) Victor Gallego and David Rios Insua. Stochastic gradient mcmc with repulsive forces. arXiv preprint arXiv:1812.00071, 2018.
- Gong et al. (2019) Chengyue Gong, Jian Peng, and Qiang Liu. Quantile Stein variational gradient descent for batch Bayesian optimization. In International Conference on Machine Learning, pages 2347–2356, 2019.
- Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
- Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. international conference on machine learning, pages 1352–1361, 2017.
- Han and Liu (2017) Jun Han and Qiang Liu. Stein variational adaptive importance sampling. Uncertainty in Artificial Intelligence, 2017.
- Han and Liu (2018) Jun Han and Qiang Liu. Stein variational gradient descent without gradient. In Uncertainty in Artificial Intelligence, 2018.
- Laio and Parrinello (2002) Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562–12566, 2002.
- Liu et al. (2019) Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning, pages 4082–4092, 2019.
- Liu (2017) Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3118–3126, 2017.
- Liu and Wang (2016) Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386, 2016.
- Liu and Wang (2018) Qiang Liu and Dilin Wang. Stein variational gradient descent as moment matching. In Advances in Neural Information Processing Systems, pages 8854–8863, 2018.
- Liu et al. (2017a) Yang Liu, Qiang Liu, and Jian Peng. Stein variational policy gradient. uncertainty in artificial intelligence, 2017a.
- Liu et al. (2017b) Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. Uncertainty in Artificial Intelligence, 2017b.
- Ma et al. (2015) Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
- Neal et al. (2011) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
- Pu et al. (2017) Yunchen Pu, Zhe Gan, Ricardo Henao, Chunyuan Li, Shaobo Han, and Lawrence Carin. Vae learning via stein variational gradient descent. neural information processing systems, pages 4236–4245, 2017.
- Raginsky et al. (2017) Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1674–1703, 2017. URL http://proceedings.mlr.press/v65/raginsky17a.html.
- Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net/forum?id=SyYe6k-CW.
- Rossky et al. (1978) Peter J Rossky, JD Doll, and HL Friedman. Brownian dynamics as smart monte carlo simulation. The Journal of Chemical Physics, 69(10):4628–4633, 1978.
- Schlimmer (1981) Jeff Schlimmer. Mushroom records drawn from the audubon society field guide to north american mushrooms. GH Lincoff (Pres), New York, 1981.
- Thompson (1933) William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Wainwright et al. (2008) Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
- Wang et al. (2019a) Dilin Wang, Ziyang Tang, Chandrajit Bajaj, and Qiang Liu. Stein variational gradient descent with matrix-valued kernels. In Conference on Neural Information Processing Systems, 2019a.
- Wang et al. (2019b) Ziyu Wang, Tongzheng Ren, Jun Zhu, and Bo Zhang. Function space particle optimization for bayesian neural networks. arXiv preprint arXiv:1902.09754, 2019b.
- Zhang et al. (2018) Jianyi Zhang, Ruiyi Zhang, and Changyou Chen. Stochastic particle-optimization sampling and the non-asymptotic convergence theory. arXiv preprint arXiv:1809.01293, 2018.
Appendix A Discussion on Hyper-parameter Tuning
The key hyper-parameters of SRLD are 1. , which balance the confining gradient and repulsive gradient; 2. the number of particles used; 3. the bandwidth of kernel; 4. the stepsize; 5. the thinning factor. Among which, , and are the hyper-parameter introduced by the proposed repulsive gradient and thus we mainly discuss these three hyper-parameter. The number of particles and bandwidth of kennel are introduced by the use the repulsive term in SVGD [Liu and Wang, 2016]. In practice, we find using a similar setting for tuning and as that in SVGD [Liu and Wang, 2016] gives good performance. In specific, in order to obtain good performance, does not needs be very large, and similar to SVGD, already gives good enough particle approximation. A good choice of bandwidth is also important for the kernel. In SVGD, instead of tuning , they propose an adaptive way to adjust during the running of the dynamics. Specifically, they choose , where med is the median of the pairwise distance between the particles , . In this way, the bandwidth ensures that . This adaptive way of choosing is also widely used in current approximation inference area, e.g., Liu et al. [2017b], Han and Liu [2017], Wang et al. [2019b, a]. We also find that applying this adaptive bandwidth is able to give good empirical performance and thus we also use this method in the implementation. Now we discuss how choose . Notice that serves to balance the confining gradient and repulsive gradient and based on this motivation, we recommend readers to find a proper using the samples at burn-in phase by setting
In this way, balances the two kind of gradients. And then we may further tune by searching around this value. An empirical good choice of is 10 for the data sets we tested and we use for all the experiments.
The step size is important for gradient based MCMC, as too large step size gives too large discretization error while a too small step size will cause the dynamics converges very slowly. In this paper, we mainly use validation set to tune the step size. The thinning factor is also a common parameter is MCMC methods and usually MCMC methods are not sensitive to this parameter. SRLD is not sensitive to this parameter and we simply set for all experiments.
Appendix B Additional Experiment Result on Synthetic Data
In this section, we show additional experiment on synthetic data. To further visualize the role of the proposed stein repulsive gradient, we also apply our method to sample a 2D mixture of Gaussian distribution (see section B.1). To further study how different influences sampling high dimension distribution, we apply SRLD to sample high dimensional Gaussian (section B.2) and high dimensional mixture of Gaussian (section B.3).
B.1 Synthetic 2D Mixture of Gaussian Experiment














We aim to show how the repulsive gradient helps the particle escape from the local high density region by sampling the 2D mixture of Gaussian distribution using SRLD and Langevin dynamics. The target density is set to be
where and . This target distribution have two mode at and , and vanilla Langevin dynamics can stuck in one mode while keeps the another mode under-explored (as the gradient of energy function can dominate the update of samples). We use the same evaluation method, step sizes, initialization and Gaussian noise as the previous experiment. We collect one sample every 100 iterations and the experiment is repeated for 20 times. Figure 4 shows that SRLD consistently outperforms the Langevin dynamics on all of the evaluation metrics.
To provide more evidence on the effectiveness of SRLD on escaping from local high density region, we plot the sampling trajectory of SRLD and vanilla Langevin dynamics on the mixture of Gaussian mentioned in Section 6.1. We can find that, when both of the methods obtain 200 samples, SRLD have started to explore the second mode, while vanilla Langevin dynamics still stuck in the original mode. When both of the methods have 250 examples, the vanilla Langevin dynamics just start to explore the second mode, while our SRLD have already obtained several samples from the second mode, which shows our methods effectiveness on escaping the local mode.
B.2 Synthetic higher dimensional Gaussian Experiment
To show the performance of SRLD in higher dimensional case with different value of , we additionally considering the problem on sampling from Gaussian distribution with and covariance . We run SRLD with and the case reduces to Langevin. We collect 1 sample every 10 iterations. The other experiment setting is the same as the toy examples in the main text. The results are summarized at Figure 6. In this experiment, we set one SRLD with an inappropriate . For this chain, the repulsive gradient gives strong repulsive force and thus has the largest ESS and the fastest decay of autocorrelation. While the inappropriate value induces too much extra approximation error and thus its performance is not as good as these with smaller (see MMD and Wasserstein distance). This phenomenon matches our theoretical finding.




B.3 Synthetic higher dimensional Mixture of Gaussian Experiment
We also consider sampling from the mixture of Gaussian with . The target density is set to be
where and . And thus the mean of the two mixture component is with distance . We run SRLD with (and when , it reduces to LD). The other experiment setting is the same as the low dimensional mixture Gaussian case. Figure [7] summarizes the result. As shown in the figure, when becomes larger, the repulsive forces helps the sampler better explore the density region.




Appendix C BNN on UCI Datasets: Experiment Settings and Additional Results
We first give detailed experiment settings. We set a prior for the inverse output variance. We set the mini-batch size to be 100. We run 50000 iterations for each methods, and for LD and SRLD, the first 40000 iteration is discarded as burn-in. We use a thinning factor of and in total we collect 100 samples from the posterior distribution. For each dataset, we generate 3 extra data splits for tuning the step size for each method. the number of past samples to be 10. In all experiments, we use RBF kernel with bandwidth set by the median trick as suggested in Liu and Wang [2016]. We use for all the data sets. For SVGD, we use the original implementation with 20 particles by Liu and Wang [2016].
We show some additional experiment result on posterior inference on UCI datasets. As mentioned in Section 6.2, the comparison between SVGD and SRLD is not direct as SVGD is a multiple-chain method with fewer particles and SRLD is a single chain method with more samples. To show more detailed comparison, we compare the SVGD with SRLD using the first 20, 40, 60, 80 and 100 samples, denoted as SRLD- where is the number of samples used. Table 3 shows the result of averaged test RMSE and table 4 shows the result of averaged test loglikelihood. For SRLD with different number of samples, the value is set to be boldface if it has better average performance than SVGD. If it is statistical significant with significant level 0.05 using a matched pair t-test, we add an underline on it.
Figure 8 and 9 give some visualized result on the comparison with Langevin dynamics and SRLD. To rule out the variance of different splitting on the dataset, the errorbar is calculated based on the difference between RMSE of SRLD and RMSE of Langevin dynamcis in 20 repeats (And similarily for test log-likelihood). And we only applied the error bar on Langevin dynamics.
Dataset | Ave Test RMSE | |||||
---|---|---|---|---|---|---|
SRLD-20 | SRLD-40 | SRLD-60 | SRLD-80 | SRLD-100 | SVGD | |
Boston | ||||||
Concrete | ||||||
Energy | ||||||
Naval | ||||||
WineRed | ||||||
WineWhite | ||||||
Yacht |
Dataset | Ave Test LL | |||||
---|---|---|---|---|---|---|
SRLD-20 | SRLD-40 | SRLD-60 | SRLD-80 | SRLD-100 | SVGD | |
Boston | ||||||
Concrete | ||||||
Energy | ||||||
Naval | ||||||
WineRed | ||||||
WineWhite | ||||||
Yacht |














Appendix D Contextual Bandit: Experiment Settings and More Background
Contextual bandit is a class of online learning problems that can be viewed as a simple reinforcement learning problem without transition. For a completely understanding of contextual bandit problems, we refer the readers to the Chapter 4 of [Bubeck et al., 2012]. Here we include the main idea for completeness. In contextual bandit problems, the agent needs to find out the best action given some observed context (a.k.a the optimal policy in reinforcement learning). Formally, we define as the context set and as the number of action. Then we can concretely describe the contextual bandit problems as follows: for each time-step , where is some pre-defined time horizon (and can be given to the agent), the environment provides a context to the agent, then the agent should choose one action based on context . The environment will return a (stochastic) reward to the agent based on the context and the action that similar to the reinforcement learning setting. And notice that, the agent can adjust the strategy at each time-step, so that this kinds of problems are called “online” learning problem.
Solving the contextual bandit problems is equivalent to find some algorithms that can minimize the pseudo-regret [Bubeck et al., 2012], which is defined as:
(7) |
where denotes the deterministic mapping from the context set to actions (readers can view as a deterministic policy in reinforcement learning). Intuitively, this pseudo-regret measures the difference of cumulative reward between the action sequence and the best action sequence . Thus, an algorithm that can minimize the pseudo-regret (7) can also find the best .
Posterior sampling [a.k.a. Thompson sampling; Thompson, 1933] is one of the classical yet successful algorithms that can achieve the state-of-the-art performance in practice [Chapelle and Li, 2011]. It works by first placing an user-specified prior on the reward , and each turn make decision based on the posterior distribution and update it, i.e. update the posterior distribution with the observation at time where is selected with the posterior distribution: each time, the action is selected with the following way:
i.e., greedy select the action based on the sampled reward from the posterior, thus called “Posterior Sampling”. Algorithm 1 summarizes the whole procedure of Posterior Sampling.
Notice that all of the reinforcement learning problems face the exploration-exploitation dilemma, so as the contextual bandit problem. Posterior sampling trade off the exploration and exploitation with the uncertainty provided by the posterior distribution. So if the posterior uncertainty is not estimated properly, posterior sampling will perform poorly. To see this, if we over-estimate the uncertainty, we can explore too-much sub-optimal actions, while if we under-estimate the uncertainty, we can fail to find the optimal actions. Thus, it is a good benchmark for evaluating the uncertainty provided by different inference methods.
Though in principle all of the MCMC methods return the samples follow the true posterior if we can run infinite MCMC steps, in practice we can only obtain finite samples as we only have finite time to run the MCMC sampler. In this case, the auto-correlation issue can lead to the under-estimate the uncertainty, which will cause the failure on all of the reinforcement learning problems that need exploration.
Here, we test the uncertainty provided by vanilla Langevin dynamics and Self-repulsive Langevin dynamics on two of the benchmark contextual bandit problems suggested by [Riquelme et al., 2018], called mushroom and wheel. One can read [Riquelme et al., 2018] to find the detail introduction of this two contextual bandit problems. For completeness, we include it as follows:
Mushroom Mushroom bandit utilizes the data from Mushroom dataset [Schlimmer, 1981], which includes different kinds of poisonous mushroom and safe mushroom with 22 attributes that can indicate whether the mushroom is poisonous or not. Blundell et al. [2015] first introduced the mushroom bandit by designing the following reward function: eating a safe mushroom will give a reward, while eating a poisonous mushroom will return a reward and with equal chances. The agent can also choose not to eat the mushroom, which always yield a reward. Same to [Riquelme et al., 2018], we use 50000 instances in this problem.
Wheel To highlight the need for exploration, [Riquelme et al., 2018] designs the wheel bandit, that can control the need of exploration with some “exploration parameter” . The context set is the unit circle in , and each turn the context is uniformly sampled from . possible actions are provided: the first action yields a constant reward ; the reward corresponding to other actions is determined by the provided context :
-
•
For s.t. , all of the four other actions return a suboptimal reward sampled from for .
-
•
For s.t. , according to the quarter the context is in, one of the four actions becomes optimal. This optimal action gives a reward of for , and another three actions still yield the suboptimal reward .
Following the setting from [Riquelme et al., 2018], we set , , and .
When approaches , the inner circle will dominate the unit circle and the first action becomes the optimal for most of the context. Thus, inference methods with poorly estimated uncertainty will continuously choose the suboptimal action for all of the contexts without exploration. This phenomenon have been confirmed in [Riquelme et al., 2018]. In our experiments, as we want to evaluate the quality of uncertainty provided by different methods, we set , which is pretty hard for existing inference methods as shown in [Riquelme et al., 2018], and use contexts for evaluation.

Experiment Setup Following [Riquelme et al., 2018], we use a feed-forward network with two hidden layer of 100 units and ReLU activation. We use the same step-size and thinning factor for vanilla Langevin dynamics and SRLD, and set , on both of the mushroom and wheel bandits. The update schedule is similar to [Riquelme et al., 2018], and we just change the optimization step in stochastic variational inference methods into MCMC sampler step and replace the warm-up of stochastic variational inference methods in Riquelme et al. [2018] with the burn-in phase of the sampling. Similar to other methods in [Riquelme et al., 2018], we keep the initial learning rate as for fast burn-in and the step-size for sampling is tuned on the mushroom bandit and keep the same for both the mushroom and wheel bandit. As this is an online posterior inference problem, we only use the last samples to give the prediction. Notice that, in the original implementation of Riquelme et al. [2018], the authors only update a few steps with new observation after observing enough data, as the posterior will gradually converge to the true reward distribution and little update is needed after observing sufficient data. Similar to their implementation, after observing enough data, we only collect one new sample with the new observation each time. For SVGD, we use 20 particles to make the comparison fair, and also tune the step-size on the mushroom bandit.
Appendix E The Detailed analysis of SRLD
E.1 Some additional notation
We use to denote the vector norm and define the norm of a function as . denote the Total Variation distance between distribution respectively. Also, as is , we denote . For simplicity, we may use as . In the appendix, we also use , where is defined in the main text. For the clearance, we define , and , where , and are defined in main text.
E.2 Geometric Ergodicity of SRLD
Before we start the proof of main theorems, we give the following theorem on the geometric ergodicity of SRLD. It is noticeable that under this assumption, the practical dynamics follows an -order nonlinear autoregressive model when :
where
Further, if we stack the parameter by and define , we have
where In this way, we formulate as a time homogeneous Markov Chain. In the following analysis, we only analyze the second phase of SRLD given some initial stacked particles .
Theorem E.1 (Geometric Ergodicity).
We defer the proof to Appendix E.5.1.
E.3 Moment Bound
Theorem E.2 (Moment Bound).
The proof can be found at Appendix E.5.2.
E.4 Technical Lemma
Definition E.1 (-mixing).
For any two -algebras and , the -mixing coefficient is defined by
Let be a sequence of real random variable defined on . This sequence is -mixing if
where and for . Alternatively, as shown by Theorem 4.4 of Bradley [2007]
Definition E.2 (-mixing).
For any two -algebras and , the -mixing coefficient is defined by
where the supremum is taken over all pairs of finite partitions and of such that and for each , . Let be a sequence of real random variable defined on . This sequence is -mixing if
Proposition E.1 (-mixing implies -mixing)).
For any two -algebras and ,
This proposition can be found in Equation 1.11 of Bradley [2005].
Proposition E.2.
A (strictly) stationary Markov Chain is geometric ergodicity if and only if at least exponentially fast as .
This proposition is Theorem 3.7 of Bradley [2005].
Lemma E.1 (Regularity Conditions).
By Assumption 4.2, we have and
Lemma E.2 (Properties of RBF Kernel).
For RBF kernel with bandwidth , we have and
Lemma E.3 (Properties of Stein Operator).
For any distribution such that we have
Lemma E.4 (Bounded Lipschitz of Stein Operator).
Given , define . We also denote . We have
E.5 Proof of Main Theorems
E.5.1 Proof of Theorem E.1
E.5.2 Proof of Theorem E.2
Continuous-Time Mean Field Dynamics (5) Notice that as our dynamics has two phases and the first phase can be viewed as an special case of the second phase by setting , here we only analysis the second phase. Define , and thus
Now we bound :
where is by E.3. By the assumption that , we have
By Gronwall’s inequality, we have . (If , then fix and this bound still holds.) Notice that in the first phase, as , we have and thus this inequality also holds.
E.5.3 Proof of Theorem 4.1 and 5.1
E.5.4 Proof of Theorem 4.2
In the later proof we use to represent the quantity
Recall that there are two dynamics: the continuous-time mean field dynamics (5) and the discretized version discrete-time mean field Dynamics (4). Notice that here we couple the discrete-time mean field dynamics with the continuous-time mean field system using the same initialization. Given any , for any , define . We introduce an another continuous-time interpolation dynamics:
Notice that here we couples this interpolation dynamics with the same Brownian motion as that of the dynamics of . By the definition of , at any for some integrate , and has the same distribution. Define conditioning on and conditioning on . Followed by the argument of proving Lemma 2 in Dalalyan [2017], if , we have
We bound , and separately.
Bounding and By the smoothness of , we have
And by Lemma E.3 (Lipschitz of Stein Operator), we know that
And by the Assumption 4.2 and that as finite second moment, we have
Combine the two bounds, we have
Notice that . By Itô’s lemma, it implies that
By the assumption that is finite and , is also finite, we have
Thus we conclude that
Bounding
By Lemma E.4 and the Assumption 4.4 that is at most quadratic growth and that has finite second moment, we have
Plug in the above estimation, we have
where the last inequality is due to the relation that .
Overall Bound Combine all the estimation, we have
Similar, if , we have
Define
and . We conclude that for , for any ,
For , which is a simpler case, we have
We bound the case when ,
If we take sufficiently small, such that , we have
Define and we can choose small enough such that and . Without loss of generality, we also assume and thus we have
Also we assume , otherwise we conclude that . We thus have , where . Suppose that and some algebra (which reduces to Pascal’s triangle) shows that
We conclude that . Notice that Thus for any ,
for sufficiently small . Combine the above two estimations, we have
Notice that now we have , which is a function of . We then bound . Notice that the KL divergence has the following variational representation:
where the is chosen in the set that and exist. And thus we have
And thus . Also the inequality that
holds naturally by definition. We complete the proof.
E.5.5 Proof of Theorem 4.3
The constant is defined as
Now we start the proof. We couple the process of and by the same gaussian noise in every iteration and same initialization . For , and for we have the following inequality,
By the log-concavity, we have
for some positive constant . And also, as is small, the last term on the right side of the equation is small term. Thus our main target is to bound the second term. We decompose the second term on the left side of the equation by
We bound , and independently.
Bounding
By Holder’s inequality,
We bound the second term on the right side of the inequality. Define
and by the regularity assumption we know that . Define and since the stein operator is linear functional of the distribution, we have
given any . By Theorem E.1 that is geometric ergodicity and thus is -mixing with exponentially fast decay rate by Proposition E.2. And by Proposition E.1, we know that is also -mixing with exponentially fast decay rate. We have the following estimation
for some positive constant that characterize the decay rate of mixing. Notice that here is canceled because the decay rate of mixing is (on the power of exponential) and . Combine this two estimations, we have
Bounding By Holder’s inequality, we have
We bound the second term in the right side of the inequality.
By Lemma E.4, we have
Plug in the above estimation and by the relation that , we have
And combined all the estimation and by the definition of Wasserstein-distance, we conclude that
Bounding
By Holder’s inequality,
We bound the last term on the right side of the inequality. By assumption and Lemma E.3, we have
And combine the estimation, we have
Overall Bound Combine all the results, we have the following bound: for ,
where
and
Define and , we have
where . By the assumption that , . Now we prove the bound of by induction. We take the hypothesis that and notice that the hypothesis holds for . By the hypothesis, we have
where the last second inequality holds by . Thus we complete the argument of induction and we have, for any ,
And it implies that
E.6 Proof of Technical Lemmas
E.6.1 Proof of Lemma E.1
For the first part:
For the second part:
E.6.2 Proof of Lemma E.2
It is obvious that .
And
E.6.3 Proof of Lemma E.3
For any distribution such that ,
For proving the second result, we notice that
E.6.4 Proof of Lemma E.4
Given any ,
For the first term on the right side of the inequality,
To bound the second term, by the symmetry of each coordinates, we have
This finishes the first part of the lemma.