This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Convergence Theory for SVGD in the Population Limit
under Talagrand’s Inequality T1

Adil Salim    Lukang Sun    Peter Richtárik
Abstract

Stein Variational Gradient Descent (SVGD) is an algorithm for sampling from a target density which is known up to a multiplicative constant. Although SVGD is a popular algorithm in practice, its theoretical study is limited to a few recent works. We study the convergence of SVGD in the population limit, (i.e., with an infinite number of particles) to sample from a non-logconcave target distribution satisfying Talagrand’s inequality T1. We first establish the convergence of the algorithm. Then, we establish a dimension-dependent complexity bound in terms of the Kernelized Stein Discrepancy (KSD). Unlike existing works, we do not assume that the KSD is bounded along the trajectory of the algorithm. Our approach relies on interpreting SVGD as a gradient descent over a space of probability measures.

Stein Variational Gradient Descent, Complexity, Convergence, Optimization, Sampling

1 Introduction

Sampling from a given target distribution π\pi is a fundamental task of many Machine Learning procedures. In Bayesian Machine Learning, the target distribution π\pi is typically known up to a multiplicative factor and often takes the form

π(x)exp(F(x)),\pi(x)\propto\exp(-F(x)), (1)

where F:𝒳F:{\mathcal{X}}\to{\mathbb{R}} is a LL-smooth nonconvex function defined on 𝒳d{\mathcal{X}}\coloneqq{\mathbb{R}}^{d}, and satisfying

exp(F(x))𝑑x<.\int\exp(-F(x))dx<\infty.

As sampling algorithms are intended to be applied to large scale problems, it has become increasingly important to understand their theoretical properties, such as their complexity, as a function of the dimension of the problem dd, and the desired accuracy ε\varepsilon. In this regard, most of the Machine Learning literature on sampling has concentrated on understanding the complexity (in terms of dd and ε\varepsilon) of (variants of) the Langevin algorithm, see Durmus et al. (2019); Bernton (2018); Wibisono (2018); Cheng et al. (2018); Salim & Richtárik (2020); Hsieh et al. (2018); Dalalyan (2017); Durmus & Moulines (2017); Rolland et al. (2020); Vempala & Wibisono (2019); Zou et al. (2019); Şimşekli (2017); Shen & Lee (2019); Bubeck et al. (2018); Durmus et al. (2018); Ma et al. (2019); Foster et al. (2021); Li et al. (2021).

1.1 Stein Variational Gradient Descent (SVGD)

Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016; Liu, 2017) is an alternative to the Langevin algorithm that has been applied in several contexts in Machine Learning, including Reinforcement Learning (Liu et al., 2017), sequential decision making (Zhang et al., 2018, 2019), Generative Adversarial Networks (Tao et al., 2019), Variational Auto Encoders (Pu et al., 2017), and Federated Learning (Kassab & Simeone, 2020).

The literature on theoretical properties of SVGD is scarce compared to that of Langevin algorithm, and limited to a few recent works (Korba et al., 2020; Lu et al., 2019; Duncan et al., 2019; Liu, 2017; Chewi et al., 2020; Gorham et al., 2020; Nüsken & Renger, 2021; Shi et al., 2021). In this paper, our goal is to provide a clean convergence theory for SVGD in the population limit, i.e., with an infinite number of particles.

1.2 Related works

The Machine Learning literature on the complexity of sampling from a non-logconcave target distribution has mainly focused on the Langevin algorithm. For instance,111The example of Vempala & Wibisono (2019) is taken only for illustration purpose. Many other results were obtained for Langevin algorithm, even in nonconvex cases, see above. Vempala & Wibisono (2019) showed that Langevin algorithm reaches ε\varepsilon accuracy in terms of the Kullback-Leibler divergence after Ω~(L2dλ2ε)\tilde{\Omega}(\frac{L^{2}d}{\lambda^{2}\varepsilon}) iterations, assuming that the target distribution satisfies the logarithmic-Sobolev inequality (LSI) with constant λ\lambda. In this work, we will assume Talagrand’s inequality T1 with constant λ\lambda, which is milder than LSI with constant λ\lambda, and we will prove a complexity result in terms of another discrepancy called Kernelized Stein Discrepancy (KSD). Besides, a very recent work studies Langevin algorithm for a non-logconcave target distribution without assuming LSI and provides guarantees in terms of the Fisher information (Balasubramanian et al., 2022).

Most existing results on SVGD deal with the continuous time approximation of SVGD in the population limit, a Partial Differential Equation (PDE) representing SVGD with a vanishing step size and an infinite number of particles (Lu et al., 2019; Duncan et al., 2019; Liu, 2017; Nüsken & Renger, 2021; Chewi et al., 2020). In particular, Duncan et al. (2019) propose a Stein logarithmic Sobolev inequality that implies the linear convergence of this PDE. However, it is not yet understood when Stein logarithmic Sobolev inequality holds. Besides, Chewi et al. (2020) showed that the Wasserstein gradient flow of the chi-squared divergence can be seen as an approximation of that PDE, and showed linear convergence of the Wasserstein gradient flow of the chi-squared under Poincaré inequality. Other results, such as those of Lu et al. (2019); Liu (2017); Nüsken & Renger (2021), include asymptotic convergence properties of the PDE, but do not include convergence rates. In this paper, we will prove convergence rates for SVGD in discrete time.

1.2.1 Comparison to Korba et al. (2020)

The closest work to ours is Korba et al. (2020). To our knowledge, Korba et al. (2020) showed the first complexity result for SVGD in discrete time. This result is proven in the population limit and in terms of the Kernelized Stein Discrepancy (KSD), similarly to our main complexity result.

However, their complexity result relies on the assumption that the KSD is uniformly bounded along the iterations of SVGD, an assumption that cannot be checked prior to running the algorithm. Moreover, their complexity bound does not express the dependence in the dimension dd explicitly. This is because the uniform bound on the KSD appears in their complexity bound. On the contrary, one of our contributions is to present a dimension-dependent complexity result under verifiable assumptions.

Besides, Korba et al. (2020) provide a bound on the distance between SVGD in the finite number of particles regime and SVGD in the population limit. This bound cannot be used to study the complexity or convergence rate of SVGD in the finite number of particles regime, see Korba et al. (2020, Proposition 7).

1.3 Contributions

We consider SVGD in the population limit, similarly to concurrent works such as Liu (2017); Korba et al. (2020); Gorham et al. (2020). Our paper intends to provide a clean analysis of SVGD, a problem stated in Liu (2017, Conclusion). To this end, we do not make any assumptions on the trajectory of the algorithm. Instead, our key assumption is that the target distribution π\pi satisfies T1, the mildest of the Talagrand’s inequalities, which holds under a mild assumption on the tails of the distribution; see Villani (2008, Theorem 22.10). Moreover, T1 is implied, for example, by the logarithmic Sobolev inequality (Villani, 2008, Theorem 22.17), with the same constant λ\lambda.

Although sampling algorithms are meant to be applied on high-dimensional problems, the question of the dependence of the complexity of SVGD in dd has not been studied in concurrent works, nor has been studied the generic weak convergence of SVGD under verifiable assumptions, to our knowledge. Assuming that the T1 inequality holds, we provide

  • a generic weak convergence result for SVGD (actually our result is a bit stronger: convergence holds in 1-Wasserstein distance),

  • a complexity bound for SVGD in terms of the dimension dd and the desired accuracy ε\varepsilon, under verifiable assumptions (i.e., assumptions that do not depend on the trajectory of the algorithm): Ω~(Ld3/2λ1/2ε)\tilde{\Omega}\left(\frac{L{d}^{3/2}}{\lambda^{1/2}\varepsilon}\right) iterations suffice to obtain a sample μ\mu such that KSD2(μ|π)<ε\mathop{\mathrm{KSD}}\nolimits^{2}(\mu|\pi)<\varepsilon, where LL is the smoothness constant of FF and λ\lambda the constant in T1 inequality.

Note that these results hold without assuming FF convex. In particular, in the population limit, SVGD applied to non-logconcave target distributions satisfying T1 converges to the target distribution.

1.4 Paper structure

The remainder of the paper is organized as follows. In Section 2 we introduce the necessary mathematical and notational background on optimal transport, reproducing kernel Hilbert spaces and SVGD in order to be able to describe and explain our results. Section 3 is devoted to the development of our theory. Finally, in Section 4 we formulate three corollaries of our key result, capturing weak convergence and complexity estimates for SVGD. Technical proofs are postponed to the Appendix.

2 Background and Notation

2.1 Notation

For any Hilbert space HH, we denote by ,H\langle\cdot,\cdot\rangle_{H} the inner product of HH and by H\|\cdot\|_{H} its norm.

We denote by C0(𝒳)C_{0}({\mathcal{X}}) the set of continuous functions from 𝒳{\mathcal{X}} to {\mathbb{R}} vanishing at infinity and by C1(𝒳,𝒴)C^{1}({\mathcal{X}},{\mathcal{Y}}) the set of continuously differentiable functions from 𝒳{\mathcal{X}} to a Hilbert space 𝒴{\mathcal{Y}}. Given ϕC1(𝒳,)\phi\in C^{1}({\mathcal{X}},{\mathbb{R}}), its gradient is denoted by ϕ\nabla\phi, and if ϕC1(𝒳,𝒳)\phi\in C^{1}({\mathcal{X}},{\mathcal{X}}), the Jacobian of ϕ\phi is denoted by JϕJ\phi. For every x𝒳x\in{\mathcal{X}}, Jϕ(x)J\phi(x) can be seen as a d×dd\times d matrix. The trace of the Jacobian, also called divergence, is denoted by divϕ\mathop{\mathrm{div}}\nolimits\phi.

For any d×dd\times d matrix AA, AHS\|A\|_{\mathop{\mathrm{HS}}\nolimits} denotes the Hilbert Schmidt norm of AA and Aop\|A\|_{\mathop{\mathrm{op}}\nolimits} the operator norm of AA viewed as a linear operator A:𝒳𝒳A:{\mathcal{X}}\to{\mathcal{X}} (where 𝒳{\mathcal{X}} is endowed with the standard Euclidean inner product). Finally, δx\delta_{x} is the Dirac measure at x𝒳x\in{\mathcal{X}}.

2.2 Optimal transport

Consider p1p\geq 1. We denote by 𝒫p(𝒳){\mathcal{P}}_{p}({\mathcal{X}}) the set of Borel probability measures μ\mu over 𝒳{\mathcal{X}} with finite pthp^{\text{th}} moment: xp𝑑μ(x)<\int\|x\|^{p}d\mu(x)<\infty. We denote by Lp(μ)L^{p}(\mu) the set of measurable functions f:𝒳𝒳f:{\mathcal{X}}\to{\mathcal{X}} such that fp𝑑μ<\int\|f\|^{p}d\mu<\infty. Note that the identity map II of 𝒳{\mathcal{X}} satisfies ILp(μ)I\in L^{p}(\mu) if μ𝒫p(𝒳)\mu\in{\mathcal{P}}_{p}({\mathcal{X}}). Moreover, denoting the image (or pushforward) measure of μ\mu by a map TT as T#μT\#\mu, we have that if μ𝒫p(𝒳)\mu\in{\mathcal{P}}_{p}({\mathcal{X}}) and TLp(μ)T\in L^{p}(\mu) then T#μ𝒫p(𝒳)T\#\mu\in{\mathcal{P}}_{p}({\mathcal{X}}) using the transfer lemma.

For every μ,ν𝒫p(𝒳)\mu,\nu\in{\mathcal{P}}_{p}({\mathcal{X}}), the pp-Wasserstein distance between μ\mu and ν\nu is defined by

Wpp(μ,ν)=infs𝒮(μ,ν)xyp𝑑s(x,y),W_{p}^{p}(\mu,\nu)=\inf_{s\in{\mathcal{S}}(\mu,\nu)}\int\|x-y\|^{p}ds(x,y), (2)

where 𝒮(μ,ν){\mathcal{S}}(\mu,\nu) is the set of couplings between μ\mu and ν\nu, i.e., the set of nonnegative measures over 𝒳2{\mathcal{X}}^{2} such that P#s=μP\#s=\mu (resp. Q#s=νQ\#s=\nu) where P:(x,y)xP:(x,y)\mapsto x (resp. Q:(x,y)yQ:(x,y)\mapsto y) denotes the projection onto the first (resp. the second) component. The pp-Wasserstein distance is a metric over 𝒫p(𝒳){\mathcal{P}}_{p}({\mathcal{X}}). The metric space (𝒫2(𝒳),W2)({\mathcal{P}}_{2}({\mathcal{X}}),W_{2}) is called the Wasserstein space.

In this paper, we consider a target probability distribution π\pi proportional to exp(F)\exp(-F), where FF satisfies the following.

Assumption 2.1.

The Hessian HFH_{F} is well-defined and L0\exists L\geq 0 such that HFopL\|H_{F}\|_{\mathop{\mathrm{op}}\nolimits}\leq L.

Moreover, using exp(F(x))𝑑x<\int\exp(-F(x))dx<\infty, FF admits a stationary point.

Proposition 2.2.

Under Assumptions 2.1 , there exists x𝒳x_{\star}\in{\mathcal{X}} for which F(x)=0\nabla F(x_{\star})=0, i.e., FF admits a stationary point.

To specify the dependence in the dimension of our complexity bounds, we will initialize the algorithm from a Gaussian distribution centered at a stationary point. Such a stationary point can be found by gradient descent on FF for instance.

The task of sampling from π\pi can be formalized as an optimization problem. Indeed, define the Kullback-Leibler (KL\mathop{\mathrm{KL}}\nolimits) divergence as

KL(μ|π)log(dμdπ(x))𝑑μ(x),\mathop{\mathrm{KL}}\nolimits(\mu|\pi)\coloneqq\int\log\left(\frac{d\mu}{d\pi}(x)\right)d\mu(x), (3)

if μ\mu admits the density dμdπ\frac{d\mu}{d\pi} with respect to π\pi, and KL(μ|π)+\mathop{\mathrm{KL}}\nolimits(\mu|\pi)\coloneqq+\infty else. Then, KL(μ|π)0\mathop{\mathrm{KL}}\nolimits(\mu|\pi)\geq 0 and KL(μ|π)=0\mathop{\mathrm{KL}}\nolimits(\mu|\pi)=0 if and only if μ=π\mu=\pi. Therefore, assuming π𝒫2(𝒳)\pi\in{\mathcal{P}}_{2}({\mathcal{X}}), the optimization problem

minμ𝒫2(𝒳)(μ),\min_{\mu\in{\mathcal{P}}_{2}({\mathcal{X}})}{\mathcal{F}}(\mu), (4)

where

(μ)KL(μ|π),{\mathcal{F}}(\mu)\coloneqq\mathop{\mathrm{KL}}\nolimits(\mu|\pi),

admits a unique solution: the distribution π\pi. We will see in Section 3 that SVGD can be seen as a gradient descent algorithm to solve (4).

Indeed, the Wasserstein space can be endowed with a differential structure. In particular, when it is well defined, the Wasserstein gradient of the functional {\mathcal{F}} denoted by W(μ)\nabla_{W}{\mathcal{F}}(\mu) is an element of L2(μ)L^{2}(\mu) and satisfies W(μ)=log(dμdπ)\nabla_{W}{\mathcal{F}}(\mu)=\nabla\log\left(\frac{d\mu}{d\pi}\right).

2.2.1 Functional inequalities

The analysis of sampling algorithm in the case where FF is nonconvex often goes through functional inequalities.

Definition 2.3 (Logarithmic Sobolev Inequality (LSI)).

The distribution π\pi satisfies the Logarithmic Sobolev Inequality if there exists λ>0\lambda>0 such that for all μ𝒫2(𝒳)\mu\in{\mathcal{P}}_{2}({\mathcal{X}}),

(μ)2λW(μ)L2(μ)2.{\mathcal{F}}(\mu)\leq\frac{2}{\lambda}\|\nabla_{W}{\mathcal{F}}(\mu)\|_{L^{2}(\mu)}^{2}.

LSI is a popular assumption in the analysis of Langevin algorithm in the case when FF is not convex see e.g. Vempala & Wibisono (2019).

Definition 2.4 (Talagrand’s Inequality Tpp).

Let p1p\geq 1. The distribution π\pi satisfies the Talagrand’s Inequality Tpp if there exists λ>0\lambda>0 such that for all μ𝒫p(𝒳)\mu\in{\mathcal{P}}_{p}({\mathcal{X}}), we have Wp(μ,π)2(μ)λW_{p}(\mu,\pi)\leq\sqrt{\frac{2{\mathcal{F}}(\mu)}{\lambda}}.

We now claim that T1 is milder than LSI. Indeed, using W1(μ,π)W2(μ,π)W_{1}(\mu,\pi)\leq W_{2}(\mu,\pi), T2 implies T1 with the same constant λ\lambda. Moreover, using Villani (2008, Theorem 22.17), LSI implies T2 with the same constant λ\lambda. In conclusion, LSI \Rightarrow T2 \Rightarrow T1, with the same constant λ\lambda.

Besides, if FF is λ\lambda-strongly convex, then π\pi satisfies LSI with constant λ\lambda. A bounded perturbation of π\pi in the latter case would also satisfies LSI with a constant independent of the dimension (Villani, 2008, Remark 21.5).

Finally, to get the exponential convergence of SVGD in continuous time, another inequality called Stein-LSI was proposed in Duncan et al. (2019). Stein-LSI is an assumption on both the kernel and the target distribution, and it implies LSI. Obtaining reasonable sufficient conditions for Stein-LSI to hold is an open problem, but there are simple cases where it cannot hold (Duncan et al., 2019, Lemma 36). In particular, Stein-LSI never holds under the assumptions that we will make in this paper to study SVGD in discrete time, see Korba et al. (2020, Section 11.3).

Our key assumption on π\pi is that it satisfies the Talagrand’s inequality T1 (Villani, 2008, Definition 22.1).

Assumption 2.5.

The target distribution π\pi satisfies T1.

We will use 2.5 to recursively control the KSD by the KL divergence along the iterations of the algorithm.

The target distribution π\pi satisfies T1 if and only if there exist a𝒳a\in{\mathcal{X}} and β>0\beta>0 such that

exp(βxa2)𝑑π(x)<,\int\exp(\beta\|x-a\|^{2})d\pi(x)<\infty, (5)

see Villani (2008, Theorem 22.10). Therefore, 2.5 is essentially an assumption on the tails of π\pi. In particular, π𝒫2(𝒳)\pi\in{\mathcal{P}}_{2}({\mathcal{X}}).

2.3 Reproducing Kernel Hilbert Space

We consider a kernel kk associated to a Reproducing Kernel Hilbert Space (RKHS) denoted by 0{\mathcal{H}}_{0}. We denote by Φ:𝒳0\Phi:{\mathcal{X}}\to{\mathcal{H}}_{0} the so-called feature map Φ:xk(,x)\Phi:x\mapsto k(\cdot,x). The product space 0d{\mathcal{H}}_{0}^{d} is also a Hilbert space denoted 0d{\mathcal{H}}\coloneqq{\mathcal{H}}_{0}^{d}. We make the following assumption on the kernel kk.

Assumption 2.6.

There exists B>0B>0 such that the inequalities

Φ(x)0B,\|\Phi(x)\|_{{\mathcal{H}}_{0}}\leq B,

and

Φ(x)2=i=1diΦ(x)02B2\|\nabla\Phi(x)\|_{{\mathcal{H}}}^{2}=\sum_{i=1}^{d}\|\partial_{i}\Phi(x)\|^{2}_{{\mathcal{H}}_{0}}\leq B^{2}

hold for all x𝒳x\in{\mathcal{X}}. Moreover, Φ:𝒳\nabla\Phi:{\mathcal{X}}\to{\mathcal{H}} is continuous.

2.6 is satisfied by the Gaussian kernel for example, with BB independent of dd using a scaling argument. 2.6 states that Φ:𝒳0\Phi:{\mathcal{X}}\to{\mathcal{H}}_{0} is bounded, Lipschitz and C1C^{1}. This is satisfied by many classical kernels used in practice. Note that k(x,x)=Φ(x)02k(x,x)=\|\Phi(x)\|_{{\mathcal{H}}_{0}}^{2} and that div12k(x,a)=Φ(x),Φ(a)\mathop{\mathrm{div}}\nolimits_{1}\nabla_{2}k(x,a)=\langle\nabla\Phi(x),\nabla\Phi(a)\rangle_{{\mathcal{H}}} (in particular, div12k(x,x)=Φ(x)2\mathop{\mathrm{div}}\nolimits_{1}\nabla_{2}k(x,x)=\|\nabla\Phi(x)\|_{{\mathcal{H}}}^{2}). Hence, Φ\nabla\Phi is continuous iff xdiv12k(x,x)x\mapsto\mathop{\mathrm{div}}\nolimits_{1}\nabla_{2}k(x,x) and xdiv12k(x,a)x\mapsto\mathop{\mathrm{div}}\nolimits_{1}\nabla_{2}k(x,a) are continuous for every a𝒳a\in{\mathcal{X}}.

Under Assumption 2.6, L2(μ){\mathcal{H}}\subset L^{2}(\mu) for every probability distribution on 𝒳{\mathcal{X}}, and the inclusion map ιμ:L2(μ)\iota_{\mu}:{\mathcal{H}}\to L^{2}(\mu) is continuous. We denote by Pμ:L2(μ)P_{\mu}:L^{2}(\mu)\to{\mathcal{H}} its adjoint defined by the relation: for every fL2(μ)f\in L^{2}(\mu), gg\in{\mathcal{H}},

f,ιμgL2(μ)=Pμf,g.\langle f,\iota_{\mu}g\rangle_{L^{2}(\mu)}=\langle P_{\mu}f,g\rangle_{{\mathcal{H}}}. (6)

Then, PμP_{\mu} can be expressed as a convolution with kk (Carmeli et al., 2010, Proposition 3):

Pμf(x)=k(x,y)f(y)𝑑μ(y),P_{\mu}f(x)=\int k(x,y)f(y)d\mu(y), (7)

or Pμf=Φ(y)f(y)𝑑μ(y)P_{\mu}f=\int\Phi(y)f(y)d\mu(y) where the integral converges in norm.

2.4 Stein Variational Gradient Descent

2.4.1 The population limit

Stein Variational Gradient Descent (SVGD) is an algorithm to sample from πexp(F)\pi\propto\exp(-F). SVGD proceeds by maintaining a set of NN particles over d{\mathbb{R}}^{d}, whose empirical distribution μnN\mu_{n}^{N} at time nn aims to approximate π\pi as nn\to\infty, see Liu & Wang (2016). The SVGD algorithm is presented above.

Algorithm 1 Stein Variational Gradient Descent (Liu & Wang, 2016)
  Initialization: a set x01,,x0N𝒳x_{0}^{1},\ldots,x_{0}^{N}\in{\mathcal{X}} of NN particles, a kernel kk, a step size γ>0\gamma>0.
  for n=0,1,2,n=0,1,2,\ldots do
     for i=1,2,,Ni=1,2,\ldots,N do
        
xn+1i=xniγNj=1Nk(xni,xnj)F(xnj)2k(xni,xnj)\displaystyle x_{n+1}^{i}={\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}x_{n}^{i}}-\frac{\gamma}{N}\sum_{j=1}^{N}k({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}x_{n}^{i}},x_{n}^{j})\nabla F(x_{n}^{j})-\nabla_{2}k({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}x_{n}^{i}},x_{n}^{j})
     end for
  end for

Denoting by μnN\mu_{n}^{N} the empirical distribution of xn1,,xnNx_{n}^{1},\ldots,x_{n}^{N}, i.e.,

μnN1Ni=1Nδxni,\mu_{n}^{N}\coloneqq\frac{1}{N}\sum_{i=1}^{N}\delta_{x_{n}^{i}},

the SVGD update can be written

xn+1i=\displaystyle x_{n+1}^{i}= xniγk(xni,y)F(y)2k(xni,y)dμnN(y)\displaystyle{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}x_{n}^{i}}-\gamma\int k({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}x_{n}^{i}},y)\nabla F(y)-\nabla_{2}k({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}x_{n}^{i}},y)d\mu_{n}^{N}(y)
=\displaystyle= (Iγk(,y)F(y)2k(,y)dμnN(y))(xni).\displaystyle\left(I-\gamma\int k(\cdot,y)\nabla F(y)-\nabla_{2}k(\cdot,y)d\mu_{n}^{N}(y)\right)({\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}x_{n}^{i}}).

Therefore, SVGD performs the update

μn+1N=(IγΦ(y)F(y)Φ(y)dμnN(y))#μnN,\mu_{n+1}^{N}=\left(I-\gamma\int\Phi(y)\nabla F(y)-\nabla\Phi(y)d\mu_{n}^{N}(y)\right)\#\mu_{n}^{N},

at the level of measures. We call population limit the regime where, formally, N=N=\infty. Mathematically, this corresponds to the assumption that μ0\mu_{0} has a density (which can be seen as intuitively seen 𝔼limNμ0N{\mathbb{E}}\lim_{N\to\infty}\mu_{0}^{N}) which belongs to C0(𝒳)C_{0}({\mathcal{X}}). In this case, we shall see in our analysis that μn\mu_{n} has a density for every nn. To summarize, in the population limit, SVGD performs the same update:

μn+1=(Iγhμn)#μn,\mu_{n+1}=\left(I-\gamma h_{\mu_{n}}\right)\#\mu_{n}, (8)

where

hμ(x)k(x,y)F(y)yk(x,y)dμ(y)h_{\mu}(x)\coloneqq\int k(x,y)\nabla F(y)-\nabla_{y}k(x,y)d\mu(y)

or

hμΦ(y)F(y)Φ(y)dμ(y),h_{\mu}\coloneqq\int\Phi(y)\nabla F(y)-\nabla\Phi(y)d\mu(y),

and where μn\mu_{n} has a density.

Finally, note that the SVGD algorithm was originally derived in Liu & Wang (2016) from its population limit. The authors first introduced the SVGD update in the population limit, and then, the SVGD algorithm (Algorithm 1) is obtained from the population limit by approximating the expectations by empirical means.

Our point of view on SVGD in the population limit.

We now provide the intuition behind our results on SVGD.

In the population limit, SVGD can be seen as a Riemannian gradient descent, thanks to the following two reasons.

First, in a Riemannian interpretation of the Wasserstein space (Villani, 2008), for every μ𝒫2(𝒳)\mu\in{\mathcal{P}}_{2}({\mathcal{X}}), the map expμ:ϕ(I+ϕ)#μ\exp_{\mu}:\phi\mapsto(I+\phi)\#\mu can be seen as the exponential map at μ\mu. In the population limit, SVGD (8) can be rewritten as

μn+1=expμn(γhμn).\mu_{n+1}=\exp_{\mu_{n}}(-\gamma h_{\mu_{n}}).

Second, hμ-h_{\mu} can be seen as the negative gradient of {\mathcal{F}} at μ\mu under a certain metric. Indeed, using integration by parts, hμ=PμW(μ)h_{\mu}=P_{\mu}\nabla_{W}{\mathcal{F}}(\mu), see e.g. Korba et al. (2020); Duncan et al. (2019). Therefore, for every gg\in{\mathcal{H}}, hμ,g=W(μ),gL2(μ)\langle h_{\mu},g\rangle_{{\mathcal{H}}}=\langle\nabla_{W}{\mathcal{F}}(\mu),g\rangle_{L^{2}(\mu)}, hence hμh_{\mu} can be seen as a Wasserstein gradient of {\mathcal{F}} under the inner product of {\mathcal{H}}.

The Kernelized Stein Discrepancy (KSD) is a natural discrepancy between probability distributions that was introduced prior to SVGD (Liu et al., 2016; Chwialkowski et al., 2016) to compare probablity measures. Indeed, if the RKHS {\mathcal{H}} is rich enough (Liu et al., 2016; Chwialkowski et al., 2016; Oates et al., 2019), an assumption that we shall always make in this paper, then

KSD(μ|π)=0μ=π.\mathop{\mathrm{KSD}}\nolimits(\mu|\pi)=0\Longrightarrow\mu=\pi.

The KSD is intimately related to SVGD, and the KSD naturally appears in the original derivation of SVGD (Liu & Wang, 2016). The KSD is defined as the square root of the Stein Fisher Information (Duncan et al., 2019) ISteinI_{\mathop{\mathrm{Stein}}\nolimits}:

IStein(μ|π)hμ2,KSD(μ|π)hμ2.I_{\mathop{\mathrm{Stein}}\nolimits}(\mu|\pi)\coloneqq\|h_{\mu}\|_{{\mathcal{H}}}^{2},\quad\mathop{\mathrm{KSD}}\nolimits(\mu|\pi)\coloneqq\|h_{\mu}\|_{{\mathcal{H}}}^{2}. (9)

In this paper, we study the complexity of SVGD in terms of the KSD. To understand better the topology of the KSD and compare it to common topologies in the space of probability measures, we refer to Gorham & Mackey (2017).

3 Analysis of SVGD

In this section, we analyze SVGD in the infinite number of particles regime. Recall that in this regime, SVGD is given by μ0C0(𝒳)\mu_{0}\in C_{0}({\mathcal{X}}) and

μn+1=(Iγhμn)#μn,\mu_{n+1}=(I-\gamma h_{\mu_{n}})\#\mu_{n},

where

hμF(x)Φ(x)Φ(x)dμ(x).h_{\mu}\coloneqq\int\nabla F(x)\Phi(x)-\nabla\Phi(x)d\mu(x).

3.1 A fundamental inequality

We start by stating a fundamental inequality satisfied by {\mathcal{F}} for any update of the form

μn+1=(Iγg)#μn,\mu_{n+1}=\left(I-\gamma g\right)\#\mu_{n}, (10)

where gg\in{\mathcal{H}}.

Proposition 3.1.

Let Assumptions 2.1 and 2.6 hold true. Let α>1\alpha>1 and choose γ>0\gamma>0 such that γgα1αB\gamma\|g\|_{{\mathcal{H}}}\leq\frac{\alpha-1}{\alpha B}. Then,

(μn+1)(μn)γhμn,g+γ2K2g2,{\mathcal{F}}(\mu_{n+1})\leq{\mathcal{F}}(\mu_{n})-\gamma\langle h_{\mu_{n}},g\rangle_{{\mathcal{H}}}+\frac{\gamma^{2}K}{2}\|g\|_{{\mathcal{H}}}^{2}, (11)

where K=(α2+L)BK=(\alpha^{2}+L)B.

Inequality (11) is a property of the functional {\mathcal{F}}, and not a property of the SVGD algorithm. Inequality (11) plays the role of a Taylor inequality for the functional {\mathcal{F}}, where hμnh_{\mu_{n}} is the Wasserstein gradient of {\mathcal{F}} at μn\mu_{n} under the metric induced by {\mathcal{H}}. Proposition 3.1 is a slight generalization of Korba et al. (2020, Proposition 5), and is not our main contribution, therefore we only sketch its proof in the Appendix.

3.2 Main result

Applying recursively the Taylor inequality Proposition 3.1 with g=hμng=h_{\mu_{n}}, we obtain the following descent property for SVGD, which is our main theoretical result. The proof of this result can be found in the Appendix.

Theorem 3.2 (Descent lemma).

Let Assumptions 2.1, 2.5 and 2.6 hold true. Let α>1\alpha>1. If

γ(α1)×\displaystyle\gamma\leq(\alpha-1)\times (12)
(αB2(1+F(0)+Lx𝑑π(x)+L2(μ0)λ))1\displaystyle\left(\alpha B^{2}\left(1+\|\nabla F(0)\|+L\int\|x\|d\pi(x)+L\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}\right)\right)^{-1}

or

γ(α1)×\displaystyle\gamma\leq(\alpha-1)\times (13)
(αB2(1+2L2(μ0)λ+Lxxdμ0(x))))1\displaystyle\left(\alpha B^{2}\left(1+2L\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}+L\int\|x-x_{\star}\|d\mu_{0}(x))\right)\right)^{-1}

then

(μn+1)(μn)γ(1γB(α2+L)2)KSD2(μn|π).{\mathcal{F}}(\mu_{n+1})\leq{\mathcal{F}}(\mu_{n})-\gamma\left(1-\frac{\gamma B(\alpha^{2}+L)}{2}\right)\mathop{\mathrm{KSD}}\nolimits^{2}(\mu_{n}|\pi). (14)

If (μ0)<{\mathcal{F}}(\mu_{0})<\infty, then, using Theorem 3.2, ((μn))n({\mathcal{F}}(\mu_{n}))_{n} is nonincreasing and μn\mu_{n} has a density w.r.t. Lebesgue measure for every nn (since (μn)<{\mathcal{F}}(\mu_{n})<\infty).

In the language of the gradient descent algorithms, Theorem 3.2 is called a descent property. It can be seen as a discrete time analogue of dissipation properties obtained for the PDE modeling SVGD in continuous time in the population limit (Duncan et al., 2019; Korba et al., 2020).

Unlike Korba et al. (2020, Proposition 5) and Liu (2017, Theorem 3.3), we do not assume that supnKSD(μn|π)<\sup_{n}\mathop{\mathrm{KSD}}\nolimits(\mu_{n}|\pi)<\infty or that γKSD(μn|π)1\gamma\leq\mathop{\mathrm{KSD}}\nolimits(\mu_{n}|\pi)^{-1} to obtain our descent property. The step size γ\gamma is bounded by a constant. Iterating Theorem 3.2, we obtain convergence results as corollaries in the next section.

4 Convergence and Complexity

4.1 Convergence

We now show that Theorem 3.2 implies weak convergence and convergence in W1W_{1}.

Corollary 4.1 (Weak convergence).

Let Assumptions 2.1, 2.5 and 2.6 hold true. Let α>1\alpha>1. If γ<2B(α2+L)\gamma<\frac{2}{B(\alpha^{2}+L)}, and γ\gamma further satisfies either (12) or (13), then μnn+π\mu_{n}\rightarrow_{n\to+\infty}\pi weakly and W1(μn,π)0W_{1}(\mu_{n},\pi)\to 0.

Proof.

Using Theorem 3.2 and iterating,

(μn)(μ0)γ(1γB(α2+L)2)k=0n1KSD2(μk|π).{\mathcal{F}}(\mu_{n})\leq{\mathcal{F}}(\mu_{0})-\gamma\left(1-\frac{\gamma B(\alpha^{2}+L)}{2}\right)\sum_{k=0}^{n-1}\mathop{\mathrm{KSD}}\nolimits^{2}(\mu_{k}|\pi).

Therefore, (μn){\mathcal{F}}(\mu_{n}) is uniformly bounded. For every n1n\geq 1,

γ(1γB(α2+L)2)k=0n1KSD2(μk|π)(μ0).\gamma\left(1-\frac{\gamma B(\alpha^{2}+L)}{2}\right)\sum_{k=0}^{n-1}\mathop{\mathrm{KSD}}\nolimits^{2}(\mu_{k}|\pi)\leq{\mathcal{F}}(\mu_{0}).

Consequently, n=0+KSD2(μn|π)<\sum_{n=0}^{+\infty}\mathop{\mathrm{KSD}}\nolimits^{2}(\mu_{n}|\pi)<\infty. Therefore KSD(μn|π)n+0\mathop{\mathrm{KSD}}\nolimits(\mu_{n}|\pi)\rightarrow_{n\to+\infty}0.

Moreover, using 2.5 and (5), for every a𝒳a\in{\mathcal{X}}, exp(a,x)𝑑π(x)<\int\exp(\langle a,x\rangle)d\pi(x)<\infty. Therefore, using Dupuis & Ellis (2011, Lemma 1.4.3), (μn)(\mu_{n}) is both tight and uniformly integrable. Consider a subsequence of (μϕ(n))(\mu_{\phi(n)}) converging weakly to some μ\mu_{\star}. We shall prove that μ=π\mu_{\star}=\pi.

First, using 2.1 and 2.6, xF(x)Φ(x)Φ(x)x\mapsto\nabla F(x)\Phi(x)-\nabla\Phi(x)\in{\mathcal{H}} is continuous and

F(x)Φ(x)Φ(x)\displaystyle\left\|\nabla F(x)\Phi(x)-\nabla\Phi(x)\right\|_{{\mathcal{H}}}
F(x)Φ(x)+Φ(x)\displaystyle\left\|\nabla F(x)\Phi(x)\right\|_{{\mathcal{H}}}+\left\|\nabla\Phi(x)\right\|_{{\mathcal{H}}}
=\displaystyle= F(x)Φ(x)0+Φ(x)\displaystyle\left\|\nabla F(x)\right\|\left\|\Phi(x)\right\|_{{\mathcal{H}}_{0}}+\left\|\nabla\Phi(x)\right\|_{{\mathcal{H}}}
\displaystyle\leq B(F(x)+1)\displaystyle B\left(\left\|\nabla F(x)\right\|+1\right)
\displaystyle\leq B(F(0)+Lx+1).\displaystyle B\left(\left\|\nabla F(0)\right\|+L\|x\|+1\right).

Moreover, as a subsequence, (μϕ(n))(\mu_{\phi(n)}) is also uniformly integrable and also converges weakly to μ\mu_{\star}. Therefore, using Villani (2003, Theorem 7.12) with p=1p=1, 𝔼xμϕ(n)(F(x)Φ(x)Φ(x)){{\mathbb{E}}}_{x\sim\mu_{\phi(n)}}\left(\nabla F(x)\Phi(x)-\nabla\Phi(x)\right) converges to 𝔼xμ(F(x)Φ(x)Φ(x)){{\mathbb{E}}}_{x\sim\mu_{\star}}\left(\nabla F(x)\Phi(x)-\nabla\Phi(x)\right) in {\mathcal{H}}. In other words, hμϕ(n)h_{\mu_{\phi(n)}} converges to hμh_{\mu_{\star}} in {\mathcal{H}}. Taking the norm, KSD(μϕ(n)|π)KSD(μ|π)\mathop{\mathrm{KSD}}\nolimits(\mu_{\phi(n)}|\pi)\rightarrow\mathop{\mathrm{KSD}}\nolimits(\mu_{\star}|\pi) along the subsequence. Recalling that KSD(μn|π)0\mathop{\mathrm{KSD}}\nolimits(\mu_{n}|\pi)\rightarrow 0 we obtain KSD(μ|π)=0\mathop{\mathrm{KSD}}\nolimits(\mu_{\star}|\pi)=0, which implies μ=π\mu_{\star}=\pi.

In conclusion, μnn+π\mu_{n}\rightarrow_{n\to+\infty}\pi weakly. Moreover, the convergence also happens in W1W_{1} because (μn)(\mu_{n}) is uniformly integrable, see (Villani, 2003, Theorem 7.12). ∎

In summary, under T1 and some smoothness assumptions but without convexity of the potential, SVGD in the population limit converges to the target distribution.

One can be surprised to see that SVGD converges without convexity assumption on FF, but this is actually natural if one thinks about the gradient descent interpretation of SVGD. Indeed, SVGD in the population limit is a gradient descent on the KL divergence, which is

  • "smooth" if we restrict the descent directions to a RKHS (i.e., it satisfies a Taylor inequality Proposition 3.1),

  • coercive (i.e., sublevel sets are tight) Dupuis & Ellis (2011, Lemma 1.4.3),

  • and has a single stationary point which is its global minimizer (the KSD is the norm of the gradient of KL in our interpretation, and the KSD is equal to zero only at the optimum).

One can show that, over d{\mathbb{R}}^{d}, gradient descent applied to a smooth coercive function with a single stationary point converges to the global minimizer. The situation here is similar.

4.2 Complexity

Next, we provide a 𝒪(1/n){\mathcal{O}}(1/n) convergence rate for the empirical mean of the iterates μn\mu_{n} in terms of the squared KSD. This result is obtained from our descent lemma (Theorem 3.2).

Corollary 4.2 (Convergence rate).

Let Assumptions 2.1, 2.5 and 2.6 hold true. Let α>1\alpha>1. If γ<2B(α2+L)\gamma<\frac{2}{B(\alpha^{2}+L)}, and γ\gamma further satisfies either (12) or (13), then

IStein(μ¯n|π)2(μ0)nγ,I_{\mathop{\mathrm{Stein}}\nolimits}(\bar{\mu}_{n}|\pi)\leq\frac{2{\mathcal{F}}(\mu_{0})}{n\gamma}, (15)

where μ¯n=1nk=0n1μk\bar{\mu}_{n}=\frac{1}{n}\sum_{k=0}^{n-1}\mu_{k}.

Note that this convergence rate is given in terms of the uniform mixture of μ0,,μn1\mu_{0},\ldots,\mu_{n-1}. Similar mixtures appear in the analysis of Langevin algorithm (see e.g. Durmus et al. (2019)). Note also that the convergence rate in Corollary 4.2 is similar to the convergence rate of the squared norm of the gradient in the gradient descent algorithm applied to a smooth function (Nesterov, 2013).

Proof.

Using Theorem 3.2, (μn+1)(μn)γ2KSD2(μn|π),{\mathcal{F}}(\mu_{n+1})\leq{\mathcal{F}}(\mu_{n})-\frac{\gamma}{2}\mathop{\mathrm{KSD}}\nolimits^{2}(\mu_{n}|\pi), and by iterating, we get

0(μn)(μ0)γ2k=0n1hμk2.0\leq{\mathcal{F}}(\mu_{n})\leq{\mathcal{F}}(\mu_{0})-\frac{\gamma}{2}\sum_{k=0}^{n-1}\|h_{\mu_{k}}\|^{2}.

Rearranging the terms, and using the convexity of the squared norm,

hμ¯n2=1nk=0n1hμk21nk=0n1hμk22(μ0)nγ.\|h_{\bar{\mu}_{n}}\|^{2}=\left\|\frac{1}{n}\sum_{k=0}^{n-1}h_{\mu_{k}}\right\|^{2}\leq\frac{1}{n}\sum_{k=0}^{n-1}\|h_{\mu_{k}}\|^{2}\leq\frac{2{\mathcal{F}}(\mu_{0})}{n\gamma}.

From the last result, we can characterize the iteration complexity of SVGD.

Corollary 4.3 (Complexity).

Let Assumptions 2.1, 2.5 and 2.6 hold true. Let α>1\alpha>1. If γmin(2B(α2+L),α1αK)\gamma\leq\min(\frac{2}{B(\alpha^{2}+L)},\frac{\alpha-1}{\alpha K}), where

KB2(1+2L2λF(x)+d2log(L2π)+Ld),K\coloneqq B^{2}\left(1+2L\sqrt{\frac{2}{\lambda}}\sqrt{F(x_{\star})+\frac{d}{2}\log\left(\frac{L}{2\pi}\right)}+\sqrt{Ld}\right),

and if μ0=𝒩(x,1LI)\mu_{0}={\mathcal{N}}(x_{\star},\frac{1}{L}I), then

n=Ω~(Ld3/2λ1/2ε)n=\tilde{\Omega}\left(\frac{L{d}^{3/2}}{\lambda^{1/2}\varepsilon}\right)

iterations of SVGD suffice to output μμ¯n\mu\coloneqq\bar{\mu}_{n} such that IStein(μ|π)ε.I_{\mathop{\mathrm{Stein}}\nolimits}(\mu|\pi)\leq\varepsilon.

To our knowledge, Corollary 4.3 provides the first dimension-dependent complexity result for SVGD. Its proof can be found in the appendix. The dependence of the T1 constant λ\lambda in the dimension dd is subject to active research in optimal transport theory (Villani, 2008, Remark 22.11) and is out of the scope of this paper. Yet, using Villani (2008, Theorem 22.10, Equation 22.16), λ\lambda can be taken as

1/λ=mina𝒳,β>01β2(1+logexp(βxa2)𝑑π(x)).1/\lambda=\min_{a\in\mathcal{X},\beta>0}\frac{1}{\beta^{2}}\left(1+\log\int\exp(\beta\|x-a\|^{2})d\pi(x)\right).

Note that the output μ\mu of the algorithm is a mixture of the iterates: μ=μ¯n\mu=\bar{\mu}_{n}. Besides, optimizing the complexity over α\alpha leads to involved calculations that do not change the overall complexity. To see this, note that the larger the step size γ\gamma, the smaller the complexity. But, even if the step size γ=min(2BL,1/K)\gamma=\min(\frac{2}{BL},1/K) were allowed, the overall complexity would be the same.

5 Conclusion

We proved that under T1 inequality and some smoothness assumptions on the kernel and the potential of the target distribution but without convexity, SVGD in the population limit converges weakly and in 1-Wasserstein distance to the target distribution. Moreover, we showed that SVGD reaches ε\varepsilon accuracy in terms of the squared Kernelized Stein Discrepancy after Ω~(d3/2ε)\tilde{\Omega}\left(\frac{{d}^{3/2}}{\varepsilon}\right) iterations.

A possible extension of our work is to study SVGD under functional inequalities other than T1, such as (Bolley & Villani, 2005, Corollary 2.6 (i)) (which is weaker than T1) or the Poincaré inequality (in the form of Villani (2008, Theorem 22.25 (iii))). We claim that our approach can be extended to study SVGD in these settings.

Finally, an important and difficult open problem in the analysis of SVGD is to characterize its complexity with a finite number of particles, i.e. with discrete measures. In this regime, we lose the interpretation of SVGD as a gradient descent in the space of probability measures, because the KL divergence w.r.t. the target distribution is infinite. However, we believe that our clean analysis in the population limit makes a first step towards this open problem.

References

  • Balasubramanian et al. (2022) Balasubramanian, K., Chewi, S., Erdogdu, M. A., Salim, A., and Zhang, M. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for langevin monte carlo. arXiv preprint arXiv:2202.05214, 2022.
  • Bernton (2018) Bernton, E. Langevin Monte Carlo and JKO splitting. In Conference on Learning Theory (COLT), pp.  1777–1798, 2018.
  • Bolley & Villani (2005) Bolley, F. and Villani, C. Weighted csiszár-kullback-pinsker inequalities and applications to transportation inequalities. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 14, pp.  331–352, 2005.
  • Bubeck et al. (2018) Bubeck, S., Eldan, R., and Lehec, J. Sampling from a log-concave distribution with projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4):757–783, 2018.
  • Carmeli et al. (2010) Carmeli, C., De Vito, E., Toigo, A., and Umanitá, V. Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 8(01):19–61, 2010.
  • Cheng et al. (2018) Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. Underdamped Langevin MCMC: A non-asymptotic analysis. In Conference on Learning Theory (COLT), pp.  300–323, 2018.
  • Chewi et al. (2020) Chewi, S., Gouic, T. L., Lu, C., Maunu, T., and Rigollet, P. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In Advances in Neural Information Processing Systems (NeurIPS), pp.  2098–2109, 2020.
  • Chwialkowski et al. (2016) Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In International Conference on Machine Learning (ICML), pp. 2606–2615, 2016.
  • Dalalyan (2017) Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
  • Duncan et al. (2019) Duncan, A., Nüsken, N., and Szpruch, L. On the geometry of Stein variational gradient descent. arXiv preprint arXiv:1912.00894, 2019.
  • Dupuis & Ellis (2011) Dupuis, P. and Ellis, R. S. A weak convergence approach to the theory of large deviations, volume 902. John Wiley & Sons, 2011.
  • Durmus & Moulines (2017) Durmus, A. and Moulines, E. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
  • Durmus et al. (2018) Durmus, A., Moulines, E., and Pereyra, M. Efficient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
  • Durmus et al. (2019) Durmus, A., Majewski, S., and Miasojedow, B. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
  • Foster et al. (2021) Foster, J., Lyons, T., and Oberhauser, H. The shifted ode method for underdamped langevin mcmc. arXiv preprint arXiv:2101.03446, 2021.
  • Gorham & Mackey (2017) Gorham, J. and Mackey, L. Measuring sample quality with kernels. In International Conference on Machine Learning (ICML), pp. 1292–1301, 2017.
  • Gorham et al. (2020) Gorham, J., Raj, A., and Mackey, L. Stochastic stein discrepancies. Advances in Neural Information Processing Systems (NeurIPS), 33:17931–17942, 2020.
  • Hsieh et al. (2018) Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. Mirrored Langevin dynamics. In Advances in Neural Information Processing Systems (NeurIPS), pp.  2878–2887, 2018.
  • Kassab & Simeone (2020) Kassab, R. and Simeone, O. Federated generalized bayesian learning via distributed stein variational gradient descent. arXiv preprint arXiv:2009.06419, 2020.
  • Korba et al. (2020) Korba, A., Salim, A., Arbel, M., Luise, G., and Gretton, A. A non-asymptotic analysis for Stein variational gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), pp.  4672–4682, 2020.
  • Li et al. (2021) Li, R., Zha, H., and Tao, M. Sqrt (d) dimension dependence of langevin monte carlo. arXiv preprint arXiv:2109.03839, 2021.
  • Liu (2017) Liu, Q. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems (NIPS), pp.  3115–3123, 2017.
  • Liu & Wang (2016) Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems (NIPS), pp.  2378–2386, 2016.
  • Liu et al. (2016) Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning (ICML), pp. 276–284, 2016.
  • Liu et al. (2017) Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
  • Lu et al. (2019) Lu, J., Lu, Y., and Nolen, J. Scaling limit of the Stein variational gradient descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2):648–671, 2019.
  • Ma et al. (2019) Ma, Y.-A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P. L., and Jordan, M. I. Is there an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996, 2019.
  • Nesterov (2013) Nesterov, Y. E. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
  • Nüsken & Renger (2021) Nüsken, N. and Renger, D. Stein variational gradient descent: many-particle and long-time asymptotics. arXiv preprint arXiv:2102.12956, 2021.
  • Oates et al. (2019) Oates, C. J., Cockayne, J., Briol, F.-X., and Girolami, M. Convergence rates for a class of estimators based on stein’s method. Bernoulli, 25(2):1141–1159, 2019.
  • Pu et al. (2017) Pu, Y., Gan, Z., Henao, R., Li, C., Han, S., and Carin, L. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems (NIPS), pp.  4236–4245, 2017.
  • Rolland et al. (2020) Rolland, P., Eftekhari, A., Kavis, A., and Cevher, V. Double-loop unadjusted Langevin algorithm. In International Conference on Machine Learning (ICML), pp. 8169–8177, 2020.
  • Salim & Richtárik (2020) Salim, A. and Richtárik, P. Primal dual interpretation of the proximal stochastic gradient Langevin algorithm. In Advances in Neural Information Processing Systems (NeurIPS), pp.  3786–3796, 2020.
  • Shen & Lee (2019) Shen, R. and Lee, Y. T. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems (NeurIPS), pp.  2100–2111, 2019.
  • Shi et al. (2021) Shi, J., Liu, C., and Mackey, L. Sampling with mirrored stein operators. arXiv preprint arXiv:2106.12506, 2021.
  • Şimşekli (2017) Şimşekli, U. Fractional Langevin Monte Carlo: Exploring Lévy driven stochastic differential equations for Markov Chain Monte Carlo. In International Conference on Machine Learning (ICML), pp. 3200–3209, 2017.
  • Tao et al. (2019) Tao, C., Dai, S., Chen, L., Bai, K., Chen, J., Liu, C., Zhang, R., Bobashev, G., and Carin, L. Variational annealing of GANs: A Langevin perspective. In International Conference on Machine Learning (ICML), pp. 6176–6185, 2019.
  • Vempala & Wibisono (2019) Vempala, S. and Wibisono, A. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems (NeurIPS), pp.  8092–8104, 2019.
  • Villani (2003) Villani, C. Topics in optimal transportation. Number 58 in Graduate Studies in Mathematics. American Mathematical Society, 2003.
  • Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Wibisono (2018) Wibisono, A. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory (COLT), pp.  2093–3027, 2018.
  • Zhang et al. (2018) Zhang, R., Li, C., Chen, C., and Carin, L. Learning structural weight uncertainty for sequential decision-making. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp.  1137–1146, 2018.
  • Zhang et al. (2019) Zhang, R., Wen, Z., Chen, C., Fang, C., Yu, T., and Carin, L. Scalable Thompson sampling via optimal transport. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp.  87–96, 2019.
  • Zou et al. (2019) Zou, D., Xu, P., and Gu, Q. Sampling from non-log-concave distributions via variance-reduced gradient Langevin dynamics. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp.  2936–2945, 2019.

Appendix

Appendix A Proof of Proposition 2.2

First, we prove that FF is coercive, i.e., for every C>0C>0, the set S={x𝒳:F(x)C}S=\{x\in{\mathcal{X}}:F(x)\leq C\} is compact. Since FF is continuous, SS is closed. It remains to prove that SS is bounded. Assume, by contradiction, that SS is unbounded. Then, there exists a sequence (xn)(x_{n}) of points in 𝒳{\mathcal{X}} such that F(xn)CF(x_{n})\leq C, xn+\|x_{n}\|\to+\infty and B(xn)B(xm)=B(x_{n})\cap B(x_{m})=\emptyset for every nmn\neq m, where B(x)B(x) denotes the unit ball centered at xx.

Let n0n\geq 0. Using the smoothness of FF (2.1), for every xB(xn)x\in B(x_{n}),

F(x)F(xn)+F(xn),xxn+L2.F(x)\leq F(x_{n})+\langle\nabla F(x_{n}),x-x_{n}\rangle+\frac{L}{2}.

Denote by VV the volume of the unit ball centered at xx, i.e. its Lebesgue measure. The positive number VV does not depend on xx. Then

B(xn)exp(F(x))𝑑x\displaystyle\int_{B(x_{n})}\exp(-F(x))dx \displaystyle\geq B(xn)exp(F(xn)F(xn),xxnL2)𝑑x\displaystyle\int_{B(x_{n})}\exp\left(-F(x_{n})-\langle\nabla F(x_{n}),x-x_{n}\rangle-\frac{L}{2}\right)dx
=\displaystyle= Vexp(F(xn)L2)B(xn)exp(F(xn),xnx)dxV\displaystyle V\exp\left(-F(x_{n})-\frac{L}{2}\right)\int_{B(x_{n})}\exp\left(\langle\nabla F(x_{n}),x_{n}-x\rangle\right)\frac{dx}{V}
=\displaystyle= Vexp(F(xn)L2)B(0)exp(F(xn),u)duV\displaystyle V\exp\left(-F(x_{n})-\frac{L}{2}\right)\int_{B(0)}\exp\left(\langle\nabla F(x_{n}),u\rangle\right)\frac{du}{V}
\displaystyle\geq Vexp(F(xn)L2)exp(B(0)F(xn),uduV)\displaystyle V\exp\left(-F(x_{n})-\frac{L}{2}\right)\exp\left(\int_{B(0)}\langle\nabla F(x_{n}),u\rangle\frac{du}{V}\right)
=\displaystyle= Vexp(F(xn)L2)\displaystyle V\exp\left(-F(x_{n})-\frac{L}{2}\right)
\displaystyle\geq Vexp(CL2),\displaystyle V\exp\left(-C-\frac{L}{2}\right),

where we used Jensen’s inequality for the uniform distribution over B(0)B(0), thanks to the convexity of texp(t)t\mapsto\exp(t). Finally,

exp(F(x))𝑑xn=0B(xn)exp(F(x))𝑑xn=0Vexp(CL2)=+,\int\exp(-F(x))dx\geq\sum_{n=0}^{\infty}\int_{B(x_{n})}\exp(-F(x))dx\geq\sum_{n=0}^{\infty}V\exp\left(-C-\frac{L}{2}\right)=+\infty,

which means that exp(F)\exp(-F) is not integrable. This contradicts the definition of FF and therefore, SS is bounded.

Next, since the set SS is compact, FF is coercive, and hence FF admits a stationary point. Indeed, FF is continuous over the compact set {x𝒳:F(x)1}\{x\in{\mathcal{X}}:F(x)\leq 1\}, and therefore, FF admits a minimizer, xx_{\star}, over this set. Moreover, this point xx_{\star} is a stationary point i.e., F(x)=0\nabla F(x_{\star})=0 (note that the point xx_{\star} is actually a global minimizer of FF).

Appendix B Proof of Proposition 3.1

Let ϕt=Itg\phi_{t}=I-tg for t[0,γ]t\in[0,\gamma] and ρt=(ϕt)#μn\rho_{t}=(\phi_{t})\#\mu_{n}. Note that ρ0=μn\rho_{0}=\mu_{n} and ργ=μn+1\rho_{\gamma}=\mu_{n+1}. First, for every x𝒳x\in{\mathcal{X}},

g(x)2=i=1dk(x,.),gi02k(x,.)02g2B2g2,\|g(x)\|^{2}=\sum_{i=1}^{d}\langle k(x,.),g_{i}\rangle^{2}_{{\mathcal{H}}_{0}}\leq\|k(x,.)\|_{{\mathcal{H}}_{0}}^{2}\|g\|_{{\mathcal{H}}}^{2}\leq B^{2}\|g\|_{{\mathcal{H}}}^{2}, (16)

and

Jg(x)HS2\displaystyle\|Jg(x)\|_{\mathop{\mathrm{HS}}\nolimits}^{2} =i,j=1d|gi(x)xj|2\displaystyle=\sum_{i,j=1}^{d}\left|\frac{\partial g_{i}(x)}{\partial x_{j}}\right|^{2}
=i,j=1dxjk(x,.),gi02\displaystyle=\sum_{i,j=1}^{d}\langle\partial_{x_{j}}k(x,.),g_{i}\rangle_{{\mathcal{H}}_{0}}^{2}
i,j=1dxjk(x,.)02gi02\displaystyle\leq\sum_{i,j=1}^{d}\|\partial_{x_{j}}k(x,.)\|^{2}_{{\mathcal{H}}_{0}}\|g_{i}\|_{{\mathcal{H}}_{0}}^{2}
=k(x,.)2g2\displaystyle=\|\nabla k(x,.)\|^{2}_{{\mathcal{H}}}\|g\|^{2}_{{\mathcal{H}}}
B2g2.\displaystyle\leq B^{2}\|g\|^{2}_{{\mathcal{H}}}. (17)

Hence,

tJg(x)optJg(x)HSγBgα1α<1,\|tJg(x)\|_{\mathop{\mathrm{op}}\nolimits}\leq\|tJg(x)\|_{\mathop{\mathrm{HS}}\nolimits}\leq\gamma B\|g\|_{{\mathcal{H}}}\leq\frac{\alpha-1}{\alpha}<1, (18)

using our assumption on the step size γ\gamma. Inequality (18) proves that ϕt\phi_{t} is a diffeomorphism for every t[0,γ]t\in[0,\gamma]. Moreover,

(Jϕt(x))1opk=0tJg(x)opkk=0(α1α)k=α.\|(J\phi_{t}(x))^{-1}\|_{\mathop{\mathrm{op}}\nolimits}\leq\sum_{k=0}^{\infty}\|tJg(x)\|_{\mathop{\mathrm{op}}\nolimits}^{k}\leq\sum_{k=0}^{\infty}\left(\frac{\alpha-1}{\alpha}\right)^{k}=\alpha. (19)

Using the density of the pushforward formula,

ρt(x)=|det((Jϕt)1)μn|ϕt1.\rho_{t}(x)=\left|\det((J\phi_{t})^{-1})\mu_{n}\right|\circ\phi_{t}^{-1}.

Moreover, det(Jϕt(x))1αd\det(J\phi_{t}(x))^{-1}\leq\alpha^{d} for every x𝒳x\in{\mathcal{X}} using (19). Besides, if μnC0(𝒳)\mu_{n}\in C_{0}({\mathcal{X}}) then μnϕt1C0(𝒳)\mu_{n}\circ\phi_{t}^{-1}\in C_{0}({\mathcal{X}}) using that ϕt\phi_{t} is a diffeomorphism. Therefore, ρtC0(𝒳)\rho_{t}\in C_{0}({\mathcal{X}}) as the product of a C0(𝒳)C_{0}({\mathcal{X}}) function with a bounded function. In particular, μn+1C0(𝒳)\mu_{n+1}\in C_{0}({\mathcal{X}}). By induction, μkC0(𝒳)\mu_{k}\in C_{0}({\mathcal{X}}) for every kk.

Using Villani (2003, Theorem 5.34), the velocity field ruling the time evolution of ρt\rho_{t} is wtL2(ρt)w_{t}\in L^{2}(\rho_{t}) defined by wt(x)=g(ϕt1(x))w_{t}(x)=-g(\phi_{t}^{-1}(x)). Denote φ(t)=(ρt)\varphi(t)={\mathcal{F}}(\rho_{t}). Using a Taylor expansion,

φ(γ)=φ(0)+γφ(0)+0γ(γt)φ′′(t)𝑑t.\varphi(\gamma)=\varphi(0)+\gamma\varphi^{\prime}(0)+\int_{0}^{\gamma}(\gamma-t)\varphi^{\prime\prime}(t)dt. (20)

We now identify each term. First, φ(0)=(μn)\varphi(0)={\mathcal{F}}(\mu_{n}) and φ(γ)=(μn+1)\varphi(\gamma)={\mathcal{F}}(\mu_{n+1}). Using the reproducing property, one can show that

φ(0)=hμn,g.\varphi^{\prime}(0)=-\langle h_{\mu_{n}},g\rangle_{{\mathcal{H}}}. (21)

Moreover, one can show that φ′′(t)=ψ1(t)+ψ2(t)\varphi^{\prime\prime}(t)=\psi_{1}(t)+\psi_{2}(t), where

ψ1(t)=𝔼xρt[wt(x),HF(x)wt(x)]andψ2(t)=𝔼xρt[Jwt(x)HS2].\psi_{1}(t)={{\mathbb{E}}}_{x\sim\rho_{t}}\left[\langle w_{t}(x),H_{F}(x)w_{t}(x)\rangle\right]\quad\text{and}\quad\psi_{2}(t)={{\mathbb{E}}}_{x\sim\rho_{t}}\left[\|Jw_{t}(x)\|_{\mathop{\mathrm{HS}}\nolimits}^{2}\right]. (22)

Recall that wt=g(ϕt)1w_{t}=-g\circ(\phi_{t})^{-1}. The first term ψ1(t)\psi_{1}(t) is bounded using the transfer lemma, 2.1 and Inequality (16):

ψ1(t)=𝔼xμn[g(x),HV(ϕt(x))g(x)]LgL2(μn)2LB2g2.\psi_{1}(t)={{\mathbb{E}}}_{x\sim\mu_{n}}\left[\langle g(x),H_{V}(\phi_{t}(x))g(x)\rangle\right]\leq L\|g\|_{L^{2}(\mu_{n})}^{2}\leq LB^{2}\|g\|_{{\mathcal{H}}}^{2}.

For the second term ψ2(t)\psi_{2}(t), using the chain rule, Jwtϕt=Jg(Jϕt)1-Jw_{t}\circ\phi_{t}=Jg(J\phi_{t})^{-1}. Therefore,

Jwtϕt(x)HS2Jg(x)HS2(Jϕt)1(x)op2α2B2g2,\|Jw_{t}\circ\phi_{t}(x)\|_{\mathop{\mathrm{HS}}\nolimits}^{2}\leq\|Jg(x)\|_{\mathop{\mathrm{HS}}\nolimits}^{2}\|(J\phi_{t})^{-1}(x)\|_{\mathop{\mathrm{op}}\nolimits}^{2}\leq\alpha^{2}B^{2}\|g\|_{{\mathcal{H}}}^{2},

using (B) and (19). Combining each of the quantity in the Taylor expansion (20) gives the desired result.

Appendix C Proof of Theorem 3.2

We start with a Lemma.

Lemma C.1.

Let Assumptions 2.1, 2.5 and 2.6 hold true. Then, for every μ𝒫2(𝒳)\mu\in{\mathcal{P}}_{2}({\mathcal{X}}), we have

hμB(1+F(0)+Lx𝑑π(x))+BL2(μ)λ\|h_{\mu}\|_{{\mathcal{H}}}\leq B\left(1+\|\nabla F(0)\|+L\int\|x\|d\pi(x)\right)+BL\sqrt{\frac{2{\mathcal{F}}(\mu)}{\lambda}}

and

hμB(1+L2(μ0)λ+L2(μ)λ+Lxx𝑑μ0(x)).\|h_{\mu}\|_{{\mathcal{H}}}\leq B\left(1+L\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}+L\sqrt{\frac{2{\mathcal{F}}(\mu)}{\lambda}}+L\int\|x-x_{\star}\|d\mu_{0}(x)\right).
Proof.

Using 2.6

hμ\displaystyle\|h_{\mu}\|_{{\mathcal{H}}} =𝔼xμ(F(x)Φ(x)Φ(x))\displaystyle=\left\|{{\mathbb{E}}}_{x\sim\mu}\left(\nabla F(x)\Phi(x)-\nabla\Phi(x)\right)\right\|_{{\mathcal{H}}}
𝔼xμF(x)Φ(x)Φ(x)\displaystyle\leq{{\mathbb{E}}}_{x\sim\mu}\left\|\nabla F(x)\Phi(x)-\nabla\Phi(x)\right\|_{{\mathcal{H}}}
𝔼xμF(x)Φ(x)+𝔼xμΦ(x)\displaystyle\leq{{\mathbb{E}}}_{x\sim\mu}\left\|\nabla F(x)\Phi(x)\right\|_{{\mathcal{H}}}+{{\mathbb{E}}}_{x\sim\mu}\left\|\nabla\Phi(x)\right\|_{{\mathcal{H}}}
=𝔼xμF(x)Φ(x)+𝔼xμΦ(x)\displaystyle={{\mathbb{E}}}_{x\sim\mu}\left\|\nabla F(x)\right\|\left\|\Phi(x)\right\|_{{\mathcal{H}}}+{{\mathbb{E}}}_{x\sim\mu}\left\|\nabla\Phi(x)\right\|_{{\mathcal{H}}}
B(𝔼xμF(x)+1).\displaystyle\leq B\left({{\mathbb{E}}}_{x\sim\mu}\left\|\nabla F(x)\right\|+1\right).

Using 2.1 and Proposition 2.2, F(x)F(0)+Lx\|\nabla F(x)\|\leq\|\nabla F(0)\|+L\|x\|. Therefore, using the triangle inequality for the metric W1W_{1},

hμ\displaystyle\|h_{\mu}\|_{{\mathcal{H}}} B(1+F(0)+Lx𝑑μ(x))\displaystyle\leq B\left(1+\|\nabla F(0)\|+L\int\|x\|d\mu(x)\right)
=B(1+F(0)+LW1(μ,δ0))\displaystyle=B\left(1+\|\nabla F(0)\|+LW_{1}(\mu,\delta_{0})\right)
B(1+F(0)+LW1(π,δ0))+BLW1(μ,π).\displaystyle\leq B\left(1+\|\nabla F(0)\|+LW_{1}(\pi,\delta_{0})\right)+BLW_{1}(\mu,\pi).

We obtain the first inequality using 2.5: W1(μ,π)2(μ)λW_{1}(\mu,\pi)\leq\sqrt{\frac{2{\mathcal{F}}(\mu)}{\lambda}}.

To prove the second inequality, recall that hμB(𝔼xμF(x)+1).\|h_{\mu}\|_{{\mathcal{H}}}\leq B\left({{\mathbb{E}}}_{x\sim\mu}\left\|\nabla F(x)\right\|+1\right). Using 2.1 and Proposition 2.2, F(x)=F(x)F(x)Lxx\|\nabla F(x)\|=\|\nabla F(x)-\nabla F(x_{\star})\|\leq L\|x-x_{\star}\|. Therefore, using the triangle inequality for the metric W1W_{1},

xx𝑑μ(x)=W1(μ,δx)\displaystyle\int\|x-x_{\star}\|d\mu(x)=W_{1}(\mu,\delta_{x_{\star}}) W1(μ,π)+W1(π,μ0)+W1(μ0,δx)\displaystyle\leq W_{1}(\mu,\pi)+W_{1}(\pi,\mu_{0})+W_{1}(\mu_{0},\delta_{x_{\star}})
2(μ0)λ+2(μ)λ+W1(μ0,δx).\displaystyle\leq\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}+\sqrt{\frac{2{\mathcal{F}}(\mu)}{\lambda}}+W_{1}(\mu_{0},\delta_{x_{\star}}).

Therefore,

hμ\displaystyle\|h_{\mu}\|_{{\mathcal{H}}} B(1+Lxx𝑑μ(x))\displaystyle\leq B\left(1+L\int\|x-x_{\star}\|d\mu(x)\right)
B(1+L2(μ0)λ+L2(μ)λ+LW1(μ0,δx)).\displaystyle\leq B\left(1+L\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}+L\sqrt{\frac{2{\mathcal{F}}(\mu)}{\lambda}}+LW_{1}(\mu_{0},\delta_{x_{\star}})\right). (23)

Besides, Proposition 3.1 can be applied to SVGD by setting g=hμng=h_{\mu_{n}}\in{\mathcal{H}}. In this case, we obtain the following descent property if the step size is small enough.

Lemma C.2.

Let Assumptions 2.1 and 2.6 hold true. Let α>1\alpha>1 and choose γ>0\gamma>0 such that γhμnα1αB\gamma\|h_{\mu_{n}}\|_{{\mathcal{H}}}\leq\frac{\alpha-1}{\alpha B}. Then,

(μn+1)(μn)γ(1γK2)hμn2,{\mathcal{F}}(\mu_{n+1})\leq{\mathcal{F}}(\mu_{n})-\gamma\left(1-\frac{\gamma K}{2}\right)\|h_{\mu_{n}}\|_{{\mathcal{H}}}^{2}, (24)

where K=(α2+L)BK=(\alpha^{2}+L)B.

Contrary to Inequality (11), Inequality (24) is a property of the SVGD algorithm.

Having established Proposition 3.1 and Lemmas C.2 and C.1, we are now ready to formulate and prove our main Theorem 3.2.

Proof.

We now prove by induction the first implication of Theorem 3.2: (12) \Rightarrow (14). First, if γ>0\gamma>0 satisfies (12), then, using Lemma C.1, γhμ0α1αB\gamma\|h_{\mu_{0}}\|_{{\mathcal{H}}}\leq\frac{\alpha-1}{\alpha B}. Therefore, using Lemma C.2,

(μ1)(μ0)γ(1γK2)hμ02,{\mathcal{F}}(\mu_{1})\leq{\mathcal{F}}(\mu_{0})-\gamma\left(1-\frac{\gamma K}{2}\right)\|h_{\mu_{0}}\|_{{\mathcal{H}}}^{2},

i.e., Inequality (14) holds with n=0n=0. Now, assume that the condition (12) implies Inequality (14) for every n{0,,N1}n\in\{0,\ldots,N-1\} and let us prove it for n=Nn=N. First, (μN)(μ0){\mathcal{F}}(\mu_{N})\leq{\mathcal{F}}(\mu_{0}). Letting A:=B(1+F(0)+Lx𝑑π(x))A:=B\left(1+\|\nabla F(0)\|+L\int\|x\|d\pi(x)\right), this implies

A+BL2(μN)λA+BL2(μ0)λ.\displaystyle A+BL\sqrt{\frac{2{\mathcal{F}}(\mu_{N})}{\lambda}}\leq A+BL\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}.

Therefore, if γ>0\gamma>0 satisfies (12), then γhμNα1αB\gamma\|h_{\mu_{N}}\|_{{\mathcal{H}}}\leq\frac{\alpha-1}{\alpha B}. To see this, using Lemma C.1 we obtain

γhμN\displaystyle\gamma\|h_{\mu_{N}}\|_{{\mathcal{H}}} γ(A+BL2(μN)λ)γ(A+BL2(μ0)λ)α1αB.\displaystyle\leq\gamma\left(A+BL\sqrt{\frac{2{\mathcal{F}}(\mu_{N})}{\lambda}}\right)\leq\gamma\left(A+BL\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}\right)\leq\frac{\alpha-1}{\alpha B}.

Therefore, using Lemma C.2, the condition (12) implies Inequality (14) at step n=Nn=N:

(μN+1)(μN)γ(1γK2)hμN2.{\mathcal{F}}(\mu_{N+1})\leq{\mathcal{F}}(\mu_{N})-\gamma\left(1-\frac{\gamma K}{2}\right)\|h_{\mu_{N}}\|_{{\mathcal{H}}}^{2}.

Finally, it remains to recall that hμN2=KSD2(μN|π)\|h_{\mu_{N}}\|_{{\mathcal{H}}}^{2}=\mathop{\mathrm{KSD}}\nolimits^{2}(\mu_{N}|\pi). The proof of the second implication of Theorem 3.2, (13) \Rightarrow (14), is similar. ∎

Appendix D Proof of Corollary 4.3

Using Corollary 4.2, if

γmin((α1)(αB2(1+2L2(μ0)λ+Lxxdμ0(x))))1,2B(α2+L)),\gamma\leq\min\left((\alpha-1)\left(\alpha B^{2}\left(1+2L\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}+L\int\|x-x_{\star}\|d\mu_{0}(x))\right)\right)^{-1},\frac{2}{B(\alpha^{2}+L)}\right),

then,

KSD2(μ¯n|π)2(μ0)nγ.\mathop{\mathrm{KSD}}\nolimits^{2}(\bar{\mu}_{n}|\pi)\leq\frac{2{\mathcal{F}}(\mu_{0})}{n\gamma}.

Using Vempala & Wibisono (2019, Lemma 1), (μ0)F(x)+d2log(L2π){\mathcal{F}}(\mu_{0})\leq F(x_{\star})+\frac{d}{2}\log\left(\frac{L}{2\pi}\right). Besides,

xx𝑑μ0(x)=𝔼Xμ0Xx=1L𝔼Xμ0L(Xx),\int\|x-x_{\star}\|d\mu_{0}(x)={{\mathbb{E}}}_{X\sim\mu_{0}}\|X-x_{\star}\|=\frac{1}{\sqrt{L}}{{\mathbb{E}}}_{X\sim\mu_{0}}\|\sqrt{L}(X-x_{\star})\|,

and using the transfer lemma and Cauchy-Schwartz inequality,

xx𝑑μ0(x)=1L𝔼Y𝒩(0,I)Y1L(𝔼Y𝒩(0,I)Y2)1/2=dL.\int\|x-x_{\star}\|d\mu_{0}(x)=\frac{1}{\sqrt{L}}{{\mathbb{E}}}_{Y\sim{\mathcal{N}}(0,I)}\|Y\|\leq\frac{1}{\sqrt{L}}\left({{\mathbb{E}}}_{Y\sim{\mathcal{N}}(0,I)}\|Y\|^{2}\right)^{1/2}=\sqrt{\frac{d}{L}}.

Therefore,

(α1)(αB2(1+2L2(μ0)λ+Lxxdμ0(x))))1\displaystyle(\alpha-1)\left(\alpha B^{2}\left(1+2L\sqrt{\frac{2{\mathcal{F}}(\mu_{0})}{\lambda}}+L\int\|x-x_{\star}\|d\mu_{0}(x))\right)\right)^{-1}
\displaystyle\geq (α1)(αB2(1+2L2λF(x)+d2log(L2π)+Ld))1\displaystyle(\alpha-1)\left(\alpha B^{2}\left(1+2L\sqrt{\frac{2}{\lambda}}\sqrt{F(x_{\star})+\frac{d}{2}\log\left(\frac{L}{2\pi}\right)}+\sqrt{Ld}\right)\right)^{-1}
=\displaystyle= Ω~(1Ldλ+Ld),\displaystyle\tilde{\Omega}\left(\frac{1}{\frac{L\sqrt{d}}{\sqrt{\lambda}}+\sqrt{Ld}}\right),

and

γ1=𝒪~(Ldλ+Ld+L)=𝒪~(Ldλ).\gamma^{-1}=\tilde{{\mathcal{O}}}\left(\frac{L\sqrt{d}}{\sqrt{\lambda}}+\sqrt{Ld}+L\right)=\tilde{{\mathcal{O}}}\left(\frac{L\sqrt{d}}{\sqrt{\lambda}}\right).

Since (μ0)=𝒪~(d){\mathcal{F}}(\mu_{0})=\tilde{{\mathcal{O}}}(d),

(μ0)γ=𝒪~(Ld3/2λ).\frac{{\mathcal{F}}(\mu_{0})}{\gamma}=\tilde{{\mathcal{O}}}\left(\frac{L{d}^{3/2}}{\sqrt{\lambda}}\right).

Let ε>0\varepsilon>0. To output the mixture μ¯n\bar{\mu}_{n} such that KSD2(μ¯n|π)<ε\mathop{\mathrm{KSD}}\nolimits^{2}(\bar{\mu}_{n}|\pi)<\varepsilon, it suffices to ensure that 2(μ0)nγ<ε\frac{2{\mathcal{F}}(\mu_{0})}{n\gamma}<\varepsilon. Therefore, n=2(μ0)γε=Ω~(Ld3/2ελ)n=\frac{2{\mathcal{F}}(\mu_{0})}{\gamma\varepsilon}=\tilde{\Omega}\left(\frac{L{d}^{3/2}}{\varepsilon\sqrt{\lambda}}\right) iterations suffice.