A Convergence Theory for SVGD in the Population Limit
under Talagrand’s Inequality T1
Abstract
Stein Variational Gradient Descent (SVGD) is an algorithm for sampling from a target density which is known up to a multiplicative constant. Although SVGD is a popular algorithm in practice, its theoretical study is limited to a few recent works. We study the convergence of SVGD in the population limit, (i.e., with an infinite number of particles) to sample from a non-logconcave target distribution satisfying Talagrand’s inequality T1. We first establish the convergence of the algorithm. Then, we establish a dimension-dependent complexity bound in terms of the Kernelized Stein Discrepancy (KSD). Unlike existing works, we do not assume that the KSD is bounded along the trajectory of the algorithm. Our approach relies on interpreting SVGD as a gradient descent over a space of probability measures.
1 Introduction
Sampling from a given target distribution is a fundamental task of many Machine Learning procedures. In Bayesian Machine Learning, the target distribution is typically known up to a multiplicative factor and often takes the form
(1) |
where is a -smooth nonconvex function defined on , and satisfying
As sampling algorithms are intended to be applied to large scale problems, it has become increasingly important to understand their theoretical properties, such as their complexity, as a function of the dimension of the problem , and the desired accuracy . In this regard, most of the Machine Learning literature on sampling has concentrated on understanding the complexity (in terms of and ) of (variants of) the Langevin algorithm, see Durmus et al. (2019); Bernton (2018); Wibisono (2018); Cheng et al. (2018); Salim & Richtárik (2020); Hsieh et al. (2018); Dalalyan (2017); Durmus & Moulines (2017); Rolland et al. (2020); Vempala & Wibisono (2019); Zou et al. (2019); Şimşekli (2017); Shen & Lee (2019); Bubeck et al. (2018); Durmus et al. (2018); Ma et al. (2019); Foster et al. (2021); Li et al. (2021).
1.1 Stein Variational Gradient Descent (SVGD)
Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016; Liu, 2017) is an alternative to the Langevin algorithm that has been applied in several contexts in Machine Learning, including Reinforcement Learning (Liu et al., 2017), sequential decision making (Zhang et al., 2018, 2019), Generative Adversarial Networks (Tao et al., 2019), Variational Auto Encoders (Pu et al., 2017), and Federated Learning (Kassab & Simeone, 2020).
The literature on theoretical properties of SVGD is scarce compared to that of Langevin algorithm, and limited to a few recent works (Korba et al., 2020; Lu et al., 2019; Duncan et al., 2019; Liu, 2017; Chewi et al., 2020; Gorham et al., 2020; Nüsken & Renger, 2021; Shi et al., 2021). In this paper, our goal is to provide a clean convergence theory for SVGD in the population limit, i.e., with an infinite number of particles.
1.2 Related works
The Machine Learning literature on the complexity of sampling from a non-logconcave target distribution has mainly focused on the Langevin algorithm. For instance,111The example of Vempala & Wibisono (2019) is taken only for illustration purpose. Many other results were obtained for Langevin algorithm, even in nonconvex cases, see above. Vempala & Wibisono (2019) showed that Langevin algorithm reaches accuracy in terms of the Kullback-Leibler divergence after iterations, assuming that the target distribution satisfies the logarithmic-Sobolev inequality (LSI) with constant . In this work, we will assume Talagrand’s inequality T1 with constant , which is milder than LSI with constant , and we will prove a complexity result in terms of another discrepancy called Kernelized Stein Discrepancy (KSD). Besides, a very recent work studies Langevin algorithm for a non-logconcave target distribution without assuming LSI and provides guarantees in terms of the Fisher information (Balasubramanian et al., 2022).
Most existing results on SVGD deal with the continuous time approximation of SVGD in the population limit, a Partial Differential Equation (PDE) representing SVGD with a vanishing step size and an infinite number of particles (Lu et al., 2019; Duncan et al., 2019; Liu, 2017; Nüsken & Renger, 2021; Chewi et al., 2020). In particular, Duncan et al. (2019) propose a Stein logarithmic Sobolev inequality that implies the linear convergence of this PDE. However, it is not yet understood when Stein logarithmic Sobolev inequality holds. Besides, Chewi et al. (2020) showed that the Wasserstein gradient flow of the chi-squared divergence can be seen as an approximation of that PDE, and showed linear convergence of the Wasserstein gradient flow of the chi-squared under Poincaré inequality. Other results, such as those of Lu et al. (2019); Liu (2017); Nüsken & Renger (2021), include asymptotic convergence properties of the PDE, but do not include convergence rates. In this paper, we will prove convergence rates for SVGD in discrete time.
1.2.1 Comparison to Korba et al. (2020)
The closest work to ours is Korba et al. (2020). To our knowledge, Korba et al. (2020) showed the first complexity result for SVGD in discrete time. This result is proven in the population limit and in terms of the Kernelized Stein Discrepancy (KSD), similarly to our main complexity result.
However, their complexity result relies on the assumption that the KSD is uniformly bounded along the iterations of SVGD, an assumption that cannot be checked prior to running the algorithm. Moreover, their complexity bound does not express the dependence in the dimension explicitly. This is because the uniform bound on the KSD appears in their complexity bound. On the contrary, one of our contributions is to present a dimension-dependent complexity result under verifiable assumptions.
Besides, Korba et al. (2020) provide a bound on the distance between SVGD in the finite number of particles regime and SVGD in the population limit. This bound cannot be used to study the complexity or convergence rate of SVGD in the finite number of particles regime, see Korba et al. (2020, Proposition 7).
1.3 Contributions
We consider SVGD in the population limit, similarly to concurrent works such as Liu (2017); Korba et al. (2020); Gorham et al. (2020). Our paper intends to provide a clean analysis of SVGD, a problem stated in Liu (2017, Conclusion). To this end, we do not make any assumptions on the trajectory of the algorithm. Instead, our key assumption is that the target distribution satisfies T1, the mildest of the Talagrand’s inequalities, which holds under a mild assumption on the tails of the distribution; see Villani (2008, Theorem 22.10). Moreover, T1 is implied, for example, by the logarithmic Sobolev inequality (Villani, 2008, Theorem 22.17), with the same constant .
Although sampling algorithms are meant to be applied on high-dimensional problems, the question of the dependence of the complexity of SVGD in has not been studied in concurrent works, nor has been studied the generic weak convergence of SVGD under verifiable assumptions, to our knowledge. Assuming that the T1 inequality holds, we provide
-
•
a generic weak convergence result for SVGD (actually our result is a bit stronger: convergence holds in 1-Wasserstein distance),
-
•
a complexity bound for SVGD in terms of the dimension and the desired accuracy , under verifiable assumptions (i.e., assumptions that do not depend on the trajectory of the algorithm): iterations suffice to obtain a sample such that , where is the smoothness constant of and the constant in T1 inequality.
Note that these results hold without assuming convex. In particular, in the population limit, SVGD applied to non-logconcave target distributions satisfying T1 converges to the target distribution.
1.4 Paper structure
The remainder of the paper is organized as follows. In Section 2 we introduce the necessary mathematical and notational background on optimal transport, reproducing kernel Hilbert spaces and SVGD in order to be able to describe and explain our results. Section 3 is devoted to the development of our theory. Finally, in Section 4 we formulate three corollaries of our key result, capturing weak convergence and complexity estimates for SVGD. Technical proofs are postponed to the Appendix.
2 Background and Notation
2.1 Notation
For any Hilbert space , we denote by the inner product of and by its norm.
We denote by the set of continuous functions from to vanishing at infinity and by the set of continuously differentiable functions from to a Hilbert space . Given , its gradient is denoted by , and if , the Jacobian of is denoted by . For every , can be seen as a matrix. The trace of the Jacobian, also called divergence, is denoted by .
For any matrix , denotes the Hilbert Schmidt norm of and the operator norm of viewed as a linear operator (where is endowed with the standard Euclidean inner product). Finally, is the Dirac measure at .
2.2 Optimal transport
Consider . We denote by the set of Borel probability measures over with finite moment: . We denote by the set of measurable functions such that . Note that the identity map of satisfies if . Moreover, denoting the image (or pushforward) measure of by a map as , we have that if and then using the transfer lemma.
For every , the -Wasserstein distance between and is defined by
(2) |
where is the set of couplings between and , i.e., the set of nonnegative measures over such that (resp. ) where (resp. ) denotes the projection onto the first (resp. the second) component. The -Wasserstein distance is a metric over . The metric space is called the Wasserstein space.
In this paper, we consider a target probability distribution proportional to , where satisfies the following.
Assumption 2.1.
The Hessian is well-defined and such that .
Moreover, using , admits a stationary point.
Proposition 2.2.
Under Assumptions 2.1 , there exists for which , i.e., admits a stationary point.
To specify the dependence in the dimension of our complexity bounds, we will initialize the algorithm from a Gaussian distribution centered at a stationary point. Such a stationary point can be found by gradient descent on for instance.
The task of sampling from can be formalized as an optimization problem. Indeed, define the Kullback-Leibler () divergence as
(3) |
if admits the density with respect to , and else. Then, and if and only if . Therefore, assuming , the optimization problem
(4) |
where
admits a unique solution: the distribution . We will see in Section 3 that SVGD can be seen as a gradient descent algorithm to solve (4).
Indeed, the Wasserstein space can be endowed with a differential structure. In particular, when it is well defined, the Wasserstein gradient of the functional denoted by is an element of and satisfies .
2.2.1 Functional inequalities
The analysis of sampling algorithm in the case where is nonconvex often goes through functional inequalities.
Definition 2.3 (Logarithmic Sobolev Inequality (LSI)).
The distribution satisfies the Logarithmic Sobolev Inequality if there exists such that for all ,
LSI is a popular assumption in the analysis of Langevin algorithm in the case when is not convex see e.g. Vempala & Wibisono (2019).
Definition 2.4 (Talagrand’s Inequality T).
Let . The distribution satisfies the Talagrand’s Inequality T if there exists such that for all , we have .
We now claim that T1 is milder than LSI. Indeed, using , T2 implies T1 with the same constant . Moreover, using Villani (2008, Theorem 22.17), LSI implies T2 with the same constant . In conclusion, LSI T2 T1, with the same constant .
Besides, if is -strongly convex, then satisfies LSI with constant . A bounded perturbation of in the latter case would also satisfies LSI with a constant independent of the dimension (Villani, 2008, Remark 21.5).
Finally, to get the exponential convergence of SVGD in continuous time, another inequality called Stein-LSI was proposed in Duncan et al. (2019). Stein-LSI is an assumption on both the kernel and the target distribution, and it implies LSI. Obtaining reasonable sufficient conditions for Stein-LSI to hold is an open problem, but there are simple cases where it cannot hold (Duncan et al., 2019, Lemma 36). In particular, Stein-LSI never holds under the assumptions that we will make in this paper to study SVGD in discrete time, see Korba et al. (2020, Section 11.3).
Our key assumption on is that it satisfies the Talagrand’s inequality T1 (Villani, 2008, Definition 22.1).
Assumption 2.5.
The target distribution satisfies T1.
We will use 2.5 to recursively control the KSD by the KL divergence along the iterations of the algorithm.
2.3 Reproducing Kernel Hilbert Space
We consider a kernel associated to a Reproducing Kernel Hilbert Space (RKHS) denoted by . We denote by the so-called feature map . The product space is also a Hilbert space denoted . We make the following assumption on the kernel .
Assumption 2.6.
There exists such that the inequalities
and
hold for all . Moreover, is continuous.
2.4 Stein Variational Gradient Descent
2.4.1 The population limit
Stein Variational Gradient Descent (SVGD) is an algorithm to sample from . SVGD proceeds by maintaining a set of particles over , whose empirical distribution at time aims to approximate as , see Liu & Wang (2016). The SVGD algorithm is presented above.
Denoting by the empirical distribution of , i.e.,
the SVGD update can be written
Therefore, SVGD performs the update
at the level of measures. We call population limit the regime where, formally, . Mathematically, this corresponds to the assumption that has a density (which can be seen as intuitively seen ) which belongs to . In this case, we shall see in our analysis that has a density for every . To summarize, in the population limit, SVGD performs the same update:
(8) |
where
or
and where has a density.
Finally, note that the SVGD algorithm was originally derived in Liu & Wang (2016) from its population limit. The authors first introduced the SVGD update in the population limit, and then, the SVGD algorithm (Algorithm 1) is obtained from the population limit by approximating the expectations by empirical means.
Our point of view on SVGD in the population limit.
We now provide the intuition behind our results on SVGD.
In the population limit, SVGD can be seen as a Riemannian gradient descent, thanks to the following two reasons.
First, in a Riemannian interpretation of the Wasserstein space (Villani, 2008), for every , the map can be seen as the exponential map at . In the population limit, SVGD (8) can be rewritten as
Second, can be seen as the negative gradient of at under a certain metric. Indeed, using integration by parts, , see e.g. Korba et al. (2020); Duncan et al. (2019). Therefore, for every , , hence can be seen as a Wasserstein gradient of under the inner product of .
The Kernelized Stein Discrepancy (KSD) is a natural discrepancy between probability distributions that was introduced prior to SVGD (Liu et al., 2016; Chwialkowski et al., 2016) to compare probablity measures. Indeed, if the RKHS is rich enough (Liu et al., 2016; Chwialkowski et al., 2016; Oates et al., 2019), an assumption that we shall always make in this paper, then
The KSD is intimately related to SVGD, and the KSD naturally appears in the original derivation of SVGD (Liu & Wang, 2016). The KSD is defined as the square root of the Stein Fisher Information (Duncan et al., 2019) :
(9) |
In this paper, we study the complexity of SVGD in terms of the KSD. To understand better the topology of the KSD and compare it to common topologies in the space of probability measures, we refer to Gorham & Mackey (2017).
3 Analysis of SVGD
In this section, we analyze SVGD in the infinite number of particles regime. Recall that in this regime, SVGD is given by and
where
3.1 A fundamental inequality
We start by stating a fundamental inequality satisfied by for any update of the form
(10) |
where .
Proposition 3.1.
Inequality (11) is a property of the functional , and not a property of the SVGD algorithm. Inequality (11) plays the role of a Taylor inequality for the functional , where is the Wasserstein gradient of at under the metric induced by . Proposition 3.1 is a slight generalization of Korba et al. (2020, Proposition 5), and is not our main contribution, therefore we only sketch its proof in the Appendix.
3.2 Main result
Applying recursively the Taylor inequality Proposition 3.1 with , we obtain the following descent property for SVGD, which is our main theoretical result. The proof of this result can be found in the Appendix.
Theorem 3.2 (Descent lemma).
If , then, using Theorem 3.2, is nonincreasing and has a density w.r.t. Lebesgue measure for every (since ).
4 Convergence and Complexity
4.1 Convergence
We now show that Theorem 3.2 implies weak convergence and convergence in .
Corollary 4.1 (Weak convergence).
Proof.
Using Theorem 3.2 and iterating,
Therefore, is uniformly bounded. For every ,
Consequently, . Therefore .
Moreover, using 2.5 and (5), for every , . Therefore, using Dupuis & Ellis (2011, Lemma 1.4.3), is both tight and uniformly integrable. Consider a subsequence of converging weakly to some . We shall prove that .
First, using 2.1 and 2.6, is continuous and
Moreover, as a subsequence, is also uniformly integrable and also converges weakly to . Therefore, using Villani (2003, Theorem 7.12) with , converges to in . In other words, converges to in . Taking the norm, along the subsequence. Recalling that we obtain , which implies .
In conclusion, weakly. Moreover, the convergence also happens in because is uniformly integrable, see (Villani, 2003, Theorem 7.12). ∎
In summary, under T1 and some smoothness assumptions but without convexity of the potential, SVGD in the population limit converges to the target distribution.
One can be surprised to see that SVGD converges without convexity assumption on , but this is actually natural if one thinks about the gradient descent interpretation of SVGD. Indeed, SVGD in the population limit is a gradient descent on the KL divergence, which is
-
•
"smooth" if we restrict the descent directions to a RKHS (i.e., it satisfies a Taylor inequality Proposition 3.1),
-
•
coercive (i.e., sublevel sets are tight) Dupuis & Ellis (2011, Lemma 1.4.3),
-
•
and has a single stationary point which is its global minimizer (the KSD is the norm of the gradient of KL in our interpretation, and the KSD is equal to zero only at the optimum).
One can show that, over , gradient descent applied to a smooth coercive function with a single stationary point converges to the global minimizer. The situation here is similar.
4.2 Complexity
Next, we provide a convergence rate for the empirical mean of the iterates in terms of the squared KSD. This result is obtained from our descent lemma (Theorem 3.2).
Corollary 4.2 (Convergence rate).
Note that this convergence rate is given in terms of the uniform mixture of . Similar mixtures appear in the analysis of Langevin algorithm (see e.g. Durmus et al. (2019)). Note also that the convergence rate in Corollary 4.2 is similar to the convergence rate of the squared norm of the gradient in the gradient descent algorithm applied to a smooth function (Nesterov, 2013).
Proof.
Using Theorem 3.2, and by iterating, we get
Rearranging the terms, and using the convexity of the squared norm,
∎
From the last result, we can characterize the iteration complexity of SVGD.
Corollary 4.3 (Complexity).
To our knowledge, Corollary 4.3 provides the first dimension-dependent complexity result for SVGD. Its proof can be found in the appendix. The dependence of the T1 constant in the dimension is subject to active research in optimal transport theory (Villani, 2008, Remark 22.11) and is out of the scope of this paper. Yet, using Villani (2008, Theorem 22.10, Equation 22.16), can be taken as
Note that the output of the algorithm is a mixture of the iterates: . Besides, optimizing the complexity over leads to involved calculations that do not change the overall complexity. To see this, note that the larger the step size , the smaller the complexity. But, even if the step size were allowed, the overall complexity would be the same.
5 Conclusion
We proved that under T1 inequality and some smoothness assumptions on the kernel and the potential of the target distribution but without convexity, SVGD in the population limit converges weakly and in 1-Wasserstein distance to the target distribution. Moreover, we showed that SVGD reaches accuracy in terms of the squared Kernelized Stein Discrepancy after iterations.
A possible extension of our work is to study SVGD under functional inequalities other than T1, such as (Bolley & Villani, 2005, Corollary 2.6 (i)) (which is weaker than T1) or the Poincaré inequality (in the form of Villani (2008, Theorem 22.25 (iii))). We claim that our approach can be extended to study SVGD in these settings.
Finally, an important and difficult open problem in the analysis of SVGD is to characterize its complexity with a finite number of particles, i.e. with discrete measures. In this regime, we lose the interpretation of SVGD as a gradient descent in the space of probability measures, because the KL divergence w.r.t. the target distribution is infinite. However, we believe that our clean analysis in the population limit makes a first step towards this open problem.
References
- Balasubramanian et al. (2022) Balasubramanian, K., Chewi, S., Erdogdu, M. A., Salim, A., and Zhang, M. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for langevin monte carlo. arXiv preprint arXiv:2202.05214, 2022.
- Bernton (2018) Bernton, E. Langevin Monte Carlo and JKO splitting. In Conference on Learning Theory (COLT), pp. 1777–1798, 2018.
- Bolley & Villani (2005) Bolley, F. and Villani, C. Weighted csiszár-kullback-pinsker inequalities and applications to transportation inequalities. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 14, pp. 331–352, 2005.
- Bubeck et al. (2018) Bubeck, S., Eldan, R., and Lehec, J. Sampling from a log-concave distribution with projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4):757–783, 2018.
- Carmeli et al. (2010) Carmeli, C., De Vito, E., Toigo, A., and Umanitá, V. Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 8(01):19–61, 2010.
- Cheng et al. (2018) Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference on Learning Theory (COLT), pp. 300–323, 2018.
- Chewi et al. (2020) Chewi, S., Gouic, T. L., Lu, C., Maunu, T., and Rigollet, P. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2098–2109, 2020.
- Chwialkowski et al. (2016) Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In International Conference on Machine Learning (ICML), pp. 2606–2615, 2016.
- Dalalyan (2017) Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
- Duncan et al. (2019) Duncan, A., Nüsken, N., and Szpruch, L. On the geometry of Stein variational gradient descent. arXiv preprint arXiv:1912.00894, 2019.
- Dupuis & Ellis (2011) Dupuis, P. and Ellis, R. S. A weak convergence approach to the theory of large deviations, volume 902. John Wiley & Sons, 2011.
- Durmus & Moulines (2017) Durmus, A. and Moulines, E. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
- Durmus et al. (2018) Durmus, A., Moulines, E., and Pereyra, M. Efficient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
- Durmus et al. (2019) Durmus, A., Majewski, S., and Miasojedow, B. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
- Foster et al. (2021) Foster, J., Lyons, T., and Oberhauser, H. The shifted ode method for underdamped langevin mcmc. arXiv preprint arXiv:2101.03446, 2021.
- Gorham & Mackey (2017) Gorham, J. and Mackey, L. Measuring sample quality with kernels. In International Conference on Machine Learning (ICML), pp. 1292–1301, 2017.
- Gorham et al. (2020) Gorham, J., Raj, A., and Mackey, L. Stochastic stein discrepancies. Advances in Neural Information Processing Systems (NeurIPS), 33:17931–17942, 2020.
- Hsieh et al. (2018) Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. Mirrored Langevin dynamics. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2878–2887, 2018.
- Kassab & Simeone (2020) Kassab, R. and Simeone, O. Federated generalized bayesian learning via distributed stein variational gradient descent. arXiv preprint arXiv:2009.06419, 2020.
- Korba et al. (2020) Korba, A., Salim, A., Arbel, M., Luise, G., and Gretton, A. A non-asymptotic analysis for Stein variational gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), pp. 4672–4682, 2020.
- Li et al. (2021) Li, R., Zha, H., and Tao, M. Sqrt (d) dimension dependence of langevin monte carlo. arXiv preprint arXiv:2109.03839, 2021.
- Liu (2017) Liu, Q. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems (NIPS), pp. 3115–3123, 2017.
- Liu & Wang (2016) Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems (NIPS), pp. 2378–2386, 2016.
- Liu et al. (2016) Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning (ICML), pp. 276–284, 2016.
- Liu et al. (2017) Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
- Lu et al. (2019) Lu, J., Lu, Y., and Nolen, J. Scaling limit of the Stein variational gradient descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2):648–671, 2019.
- Ma et al. (2019) Ma, Y.-A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P. L., and Jordan, M. I. Is there an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996, 2019.
- Nesterov (2013) Nesterov, Y. E. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Nüsken & Renger (2021) Nüsken, N. and Renger, D. Stein variational gradient descent: many-particle and long-time asymptotics. arXiv preprint arXiv:2102.12956, 2021.
- Oates et al. (2019) Oates, C. J., Cockayne, J., Briol, F.-X., and Girolami, M. Convergence rates for a class of estimators based on stein’s method. Bernoulli, 25(2):1141–1159, 2019.
- Pu et al. (2017) Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems (NIPS), pp. 4236–4245, 2017.
- Rolland et al. (2020) Rolland, P., Eftekhari, A., Kavis, A., and Cevher, V. Double-loop unadjusted Langevin algorithm. In International Conference on Machine Learning (ICML), pp. 8169–8177, 2020.
- Salim & Richtárik (2020) Salim, A. and Richtárik, P. Primal dual interpretation of the proximal stochastic gradient Langevin algorithm. In Advances in Neural Information Processing Systems (NeurIPS), pp. 3786–3796, 2020.
- Shen & Lee (2019) Shen, R. and Lee, Y. T. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2100–2111, 2019.
- Shi et al. (2021) Shi, J., Liu, C., and Mackey, L. Sampling with mirrored stein operators. arXiv preprint arXiv:2106.12506, 2021.
- Şimşekli (2017) Şimşekli, U. Fractional Langevin Monte Carlo: Exploring Lévy driven stochastic differential equations for Markov Chain Monte Carlo. In International Conference on Machine Learning (ICML), pp. 3200–3209, 2017.
- Tao et al. (2019) Tao, C., Dai, S., Chen, L., Bai, K., Chen, J., Liu, C., Zhang, R., Bobashev, G., and Carin, L. Variational annealing of GANs: A Langevin perspective. In International Conference on Machine Learning (ICML), pp. 6176–6185, 2019.
- Vempala & Wibisono (2019) Vempala, S. and Wibisono, A. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems (NeurIPS), pp. 8092–8104, 2019.
- Villani (2003) Villani, C. Topics in optimal transportation. Number 58 in Graduate Studies in Mathematics. American Mathematical Society, 2003.
- Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
- Wibisono (2018) Wibisono, A. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory (COLT), pp. 2093–3027, 2018.
- Zhang et al. (2018) Zhang, R., Li, C., Chen, C., and Carin, L. Learning structural weight uncertainty for sequential decision-making. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1137–1146, 2018.
- Zhang et al. (2019) Zhang, R., Wen, Z., Chen, C., Fang, C., Yu, T., and Carin, L. Scalable Thompson sampling via optimal transport. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 87–96, 2019.
- Zou et al. (2019) Zou, D., Xu, P., and Gu, Q. Sampling from non-log-concave distributions via variance-reduced gradient Langevin dynamics. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 2936–2945, 2019.
Appendix
Appendix A Proof of Proposition 2.2
First, we prove that is coercive, i.e., for every , the set is compact. Since is continuous, is closed. It remains to prove that is bounded. Assume, by contradiction, that is unbounded. Then, there exists a sequence of points in such that , and for every , where denotes the unit ball centered at .
Let . Using the smoothness of (2.1), for every ,
Denote by the volume of the unit ball centered at , i.e. its Lebesgue measure. The positive number does not depend on . Then
where we used Jensen’s inequality for the uniform distribution over , thanks to the convexity of . Finally,
which means that is not integrable. This contradicts the definition of and therefore, is bounded.
Next, since the set is compact, is coercive, and hence admits a stationary point. Indeed, is continuous over the compact set , and therefore, admits a minimizer, , over this set. Moreover, this point is a stationary point i.e., (note that the point is actually a global minimizer of ).
Appendix B Proof of Proposition 3.1
Let for and . Note that and . First, for every ,
(16) |
and
(17) |
Hence,
(18) |
using our assumption on the step size . Inequality (18) proves that is a diffeomorphism for every . Moreover,
(19) |
Using the density of the pushforward formula,
Moreover, for every using (19). Besides, if then using that is a diffeomorphism. Therefore, as the product of a function with a bounded function. In particular, . By induction, for every .
Using Villani (2003, Theorem 5.34), the velocity field ruling the time evolution of is defined by . Denote . Using a Taylor expansion,
(20) |
We now identify each term. First, and . Using the reproducing property, one can show that
(21) |
Appendix C Proof of Theorem 3.2
We start with a Lemma.
Proof.
Using 2.6
Using 2.1 and Proposition 2.2, . Therefore, using the triangle inequality for the metric ,
We obtain the first inequality using 2.5: .
To prove the second inequality, recall that Using 2.1 and Proposition 2.2, . Therefore, using the triangle inequality for the metric ,
Therefore,
(23) |
∎
Besides, Proposition 3.1 can be applied to SVGD by setting . In this case, we obtain the following descent property if the step size is small enough.
Having established Proposition 3.1 and Lemmas C.2 and C.1, we are now ready to formulate and prove our main Theorem 3.2.
Proof.
We now prove by induction the first implication of Theorem 3.2: (12) (14). First, if satisfies (12), then, using Lemma C.1, . Therefore, using Lemma C.2,
i.e., Inequality (14) holds with . Now, assume that the condition (12) implies Inequality (14) for every and let us prove it for . First, . Letting , this implies
Therefore, if satisfies (12), then . To see this, using Lemma C.1 we obtain
Therefore, using Lemma C.2, the condition (12) implies Inequality (14) at step :
Finally, it remains to recall that . The proof of the second implication of Theorem 3.2, (13) (14), is similar. ∎
Appendix D Proof of Corollary 4.3
Using Vempala & Wibisono (2019, Lemma 1), . Besides,
and using the transfer lemma and Cauchy-Schwartz inequality,
Therefore,
and
Since ,
Let . To output the mixture such that , it suffices to ensure that . Therefore, iterations suffice.