This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The equivalence between Stein variational gradient descent
and black-box variational inference

Casey Chu    Kentaro Minami    Kenji Fukumizu
Abstract

We formalize an equivalence between two popular methods for Bayesian inference: Stein variational gradient descent (SVGD) and black-box variational inference (BBVI). In particular, we show that BBVI corresponds precisely to SVGD when the kernel is the neural tangent kernel. Furthermore, we interpret SVGD and BBVI as kernel gradient flows; we do this by leveraging the recent perspective that views SVGD as a gradient flow in the space of probability distributions and showing that BBVI naturally motivates a Riemannian structure on that space. We observe that kernel gradient flow also describes dynamics found in the training of generative adversarial networks (GANs). This work thereby unifies several existing techniques in variational inference and generative modeling and identifies the kernel as a fundamental object governing the behavior of these algorithms, motivating deeper analysis of its properties.

Machine Learning, ICML

1 Introduction

The goal of Bayesian inference is to compute the posterior P(x|z)P(x|z) over a variable of interest xx. In principle, this posterior may be computed from the prior P(x)P(x) and the likelihood P(z|x)P(z|x) of observing data zz, using the equation

p(x):=P(x|z)=P(z|x)P(x)P(z|x)P(x)𝑑x.p(x):=P(x|z)=\frac{P(z|x)P(x)}{\int P(z|x)P(x)\,dx}. (1)

We denote the posterior as p(x)p(x) for convenience of notation. Unfortunately, the integral in the denominator is usually intractable, which motivates variational inference techniques, which approximate the true posterior p(x)p(x) with an approximate posterior q(x)q(x), often by minimizing the KL divergence KL(q(x)p(x))\mathrm{KL}(q(x)\,\|\,p(x)). In this paper, we consider two popular variational inference techniques, black-box variational inference (Ranganath et al., 2014) and Stein variational gradient descent (Liu & Wang, 2016), and show that they are equivalent when viewed as instances of kernel gradient flow.

2 Stein variational gradient descent

Stein variational gradient descent (Liu & Wang, 2016), or SVGD, is a technique for Bayesian inference that approximates the true posterior p(x)p(x) with a set of particles x1,,xnx_{1},\ldots,x_{n}.

In the continuous-time limit of small step size, each particle undergoes the update rule

dxidt=𝔼yqt[k(xi,y)ylogp(y)+yk(xi,y)],\frac{dx_{i}}{dt}=\mathbb{E}_{y\sim q_{t}}[k(x_{i},y)\nabla_{y}\log p(y)+\nabla_{y}k(x_{i},y)], (2)

where qtq_{t} denotes the empirical distribution of particles at time tt:

qt=1ni=1nδxi(t),q_{t}=\frac{1}{n}\sum_{i=1}^{n}\delta_{x_{i}(t)}, (3)

and k(x,y)k(x,y) is a user-specified kernel function, such as the RBF kernel k(x,y)=exy2k(x,y)=e^{-||x-y||^{2}}.

In the mean-field limit as nn\to\infty (Lu et al., 2019), an equivalent form of the dynamics (2) is obtained by an application of Stein’s identity (integration by parts on the second term):

dxdt=𝔼yqt[k(x,y)y(logp(y)logqt(y))].\frac{dx}{dt}=\mathbb{E}_{y\sim q_{t}}[k(x,y)\nabla_{y}(\log p(y)-\log q_{t}(y))]. (4)

3 Black-box variational inference

Black-box variational inference (Ranganath et al., 2014), or BBVI, is another technique for Bayesian inference that approximates the true posterior p(x)p(x) with an approximate posterior qϕ(x)q_{\phi}(x), where qϕq_{\phi} is a family of distributions parameterized by ϕ\phi. In BBVI, we maximize the evidence lower bound, or ELBO, objective

L(ϕ):=𝔼xqϕ[logP(z|x)P(x)qϕ(x)]L(\phi):=\mathbb{E}_{x\sim q_{\phi}}\Big{[}\log\frac{P(z|x)P(x)}{q_{\phi}(x)}\Big{]} (5)

by gradient ascent on ϕ\phi. This procedure effectively minimizes the KL divergence between qϕ(x)q_{\phi}(x) and the true posterior p(x)=P(x|z)p(x)=P(x|z), since the KL divergence and the ELBO objective differ by only the evidence P(z)P(z), which is constant w.r.t. ϕ\phi:

KL(qϕ(x)p(x))=P(z)L(ϕ).\mathrm{KL}(q_{\phi}(x)\,\|\,p(x))=P(z)-L(\phi). (6)

Our claim is:

Claim 1.

The sequence of approximate posteriors generated by BBVI, when the reparameterization trick of Kingma & Welling (2014) is used, is governed by the SVGD dynamics (4), where the kernel kk is the neural tangent kernel of Jacot et al. (2018).

To see this, we observe that the evolution of the parameters ϕ\phi under gradient ascent is governed by

dϕdt=ϕL(ϕ).\frac{d\phi}{dt}=\nabla_{\phi}L(\phi). (7)

Next, we specialize to the case where the family of approximate posteriors is parameterized via the reparameterization trick (Kingma & Welling, 2014). That is, suppose that there exists a fixed distribution ω\omega and a parameterized function fϕf_{\phi} such that the following two sampling methods result in the same distribution over xx:

xqϕεω and x=fϕ(ε).x\sim q_{\phi}\iff\varepsilon\sim\omega\text{ and }x=f_{\phi}(\varepsilon). (8)

As an example, the family of normal distributions 𝒩(μ,σ)\mathcal{N}(\mu,\sigma) may be reparameterized as

x𝒩(μ,σ)ε𝒩(0,1) and x=μ+σε.x\sim\mathcal{N}(\mu,\sigma)\iff\varepsilon\sim\mathcal{N}(0,1)\text{ and }x=\mu+\sigma\varepsilon. (9)

In this setting, Roeder et al. (2017) and Feng et al. (2017) noted that

ϕL(ϕ)\displaystyle\nabla_{\phi}L(\phi) (10)
=𝔼wω[ϕfϕ(w)y(logp(y)logqϕ(y))|y=fϕ(w)].\displaystyle\qquad=\mathbb{E}_{w\sim\omega}[\nabla_{\phi}f_{\phi}(w)\cdot\nabla_{y}(\log p(y)-\log q_{\phi}(y))\big{|}_{y=f_{\phi}(w)}].

Now, we consider the dynamics of a sample x=fϕ(ε)x=f_{\phi}(\varepsilon) under the parameter dynamics (7). By the chain rule, we have that

dxdt=(ϕfϕ(ε))Tdϕdt.\frac{dx}{dt}=(\nabla_{\phi}f_{\phi}(\varepsilon))^{T}\frac{d\phi}{dt}. (11)

Let us introduce the neural tangent kernel of Jacot et al. (2018)

Θϕ(ε,w):=(ϕfϕ(ε))Tϕfϕ(w),\Theta_{\phi}(\varepsilon,w):=(\nabla_{\phi}f_{\phi}(\varepsilon))^{T}\nabla_{\phi}f_{\phi}(w), (12)

and define

kϕ(x,y):=Θϕ(fϕ1(x),fϕ1(y)),k_{\phi}(x,y):=\Theta_{\phi}(f^{-1}_{\phi}(x),f^{-1}_{\phi}(y)), (13)

making the additional assumption that εfϕ(ε)\varepsilon\mapsto f_{\phi}(\varepsilon) is injective. Note that if xnx\in\mathbb{R}^{n}, then Θϕ(ε,w)\Theta_{\phi}(\varepsilon,w) and kϕ(x,y)k_{\phi}(x,y) are both nn-by-nn matrices that depend on ϕ\phi. Then, substituting (7) and (10) into (11), we find that the samples satisfy

dxdt\displaystyle\frac{dx}{dt} =(ϕfϕ(ε))Tdϕdt\displaystyle=(\nabla_{\phi}f_{\phi}(\varepsilon))^{T}\frac{d\phi}{dt} (14)
=𝔼wω[Θϕ(ε,w)y(logp(y)logqϕ(y))|y=fϕ(w)]\displaystyle=\mathbb{E}_{w\sim\omega}[\Theta_{\phi}(\varepsilon,w)\,\nabla_{y}(\log p(y)-\log q_{\phi}(y))\big{|}_{y=f_{\phi}(w)}] (15)
=𝔼yqϕ[kϕ(x,y)y(logp(y)logqϕ(y))].\displaystyle=\mathbb{E}_{y\sim q_{\phi}}[k_{\phi}(x,y)\,\nabla_{y}(\log p(y)-\log q_{\phi}(y))]. (16)

Comparing (16) with the SVGD dynamics (4), we find an exact correspondence between SVGD and BBVI, where in BBVI, the kernel is given by (13) and defined by the neural tangent kernel.

3.1 Example: a Gaussian variational family

As an example, consider the family of multivariate normal distributions 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma), parameterized by an invertible matrix AA and a vector μ\mu, with the relation Σ=AAT\Sigma=AA^{T}. This variational family is reparameterizable with

x𝒩(μ,Σ)ε𝒩(0,I) and x=μ+Aε.x\sim\mathcal{N}(\mu,\Sigma)\iff\varepsilon\sim\mathcal{N}(0,I)\text{ and }x=\mu+A\varepsilon. (17)

In this setting, the kernel (13) becomes

k(x,y)=(1+(xμ)TΣ1(yμ))I,k(x,y)=(1+(x-\mu)^{T}\Sigma^{-1}(y-\mu))I, (18)

where II is the identity matrix. In the continuous-time and many-particle limit, BBVI with the parameterization (17) produces the same sequence of approximate posteriors as SVGD with the kernel (18). Figure 1 compares the sequence of approximate posteriors generated by BBVI and SVGD with the theoretically equivalent kernel (18) in fitting a bimodal 2D distribution; we see that the agreement is quite close.

It is instructive to perform the computation of (18) explicitly. We use index notation with Einstein summation notation, where indices that appear twice are implicitly summed over. We have that fi(ε)=μi+Aikεkf_{i}(\varepsilon)=\mu_{i}+A_{ik}\varepsilon_{k} and

fi(ε)μ=δi,fi(ε)Am=δiδkmεk,\frac{\partial f_{i}(\varepsilon)}{\partial\mu_{\ell}}=\delta_{i\ell},\quad\frac{\partial f_{i}(\varepsilon)}{\partial A_{\ell m}}=\delta_{i\ell}\delta_{km}\varepsilon_{k}, (19)

so that the neural tangent kernel is

Θij(ε,w)\displaystyle\Theta_{ij}(\varepsilon,w) =fi(ε)μfj(w)μ+fi(ε)Amfj(w)Am\displaystyle=\frac{\partial f_{i}(\varepsilon)}{\partial\mu_{\ell}}\frac{\partial f_{j}(w)}{\partial\mu_{\ell}}+\frac{\partial f_{i}(\varepsilon)}{\partial A_{\ell m}}\frac{\partial f_{j}(w)}{\partial A_{\ell m}} (20)
=δiδj+δiδkmεkδjδomwo\displaystyle=\delta_{i\ell}\delta_{j\ell}+\delta_{i\ell}\delta_{km}\varepsilon_{k}\delta_{j\ell}\delta_{om}w_{o} (21)
=δij+δijεmwm,\displaystyle=\delta_{ij}+\delta_{ij}\varepsilon_{m}w_{m}, (22)

or Θ(ε,w)=(1+εw)I\Theta(\varepsilon,w)=(1+\varepsilon\cdot w)I in vector notation. Then, using the definition (13) and substituting f1(x)=A1(xμ)f^{-1}(x)=A^{-1}(x-\mu) and Σ=AAT\Sigma=AA^{T}, we arrive at (18).

Refer to caption
Figure 1: The sequence of approximate posteriors obtained by BBVI and SVGD with the theoretically equivalent kernel.

4 Motivating a Riemannian structure

In the previous section, we found that SVGD and BBVI both correspond to particle dynamics of the form

dxdt=𝔼yqϕ[kϕ(x,y)y(logp(y)logqϕ(y))].\frac{dx}{dt}=\mathbb{E}_{y\sim q_{\phi}}[k_{\phi}(x,y)\,\nabla_{y}(\log p(y)-\log q_{\phi}(y))]. (23)

One peculiar feature of the BBVI dynamics is that the kernel kϕk_{\phi} depends on the current parameter ϕ\phi, rather than being constant as the approximate posterior qϕq_{\phi} changes, as in the SVGD case.

In fact, we argue that this feature of BBVI is quite natural:

Claim 2.

The requirement of BBVI that the kernel depends on the current distribution naturally motivates a Riemannian structure on the space of probability distributions.

To make this claim, let us first review Euclidean and Riemannian gradient flows. In Euclidean space, following the negative gradient of a function J:nJ:\mathbb{R}^{n}\to\mathbb{R} according to

dxdt=J(x)\frac{dx}{dt}=-\nabla J(x) (24)

can lead to a minimizer of JJ. Analogously, on a Riemannian manifold MM, following the negative Riemannian gradient of a function J:MJ:M\to\mathbb{R} according to

dxdt=G(x)1J(x),\frac{dx}{dt}=-G(x)^{-1}\nabla J(x), (25)

can lead to a minimizer of JJ. Here, GG is a positive-definite matrix-valued function called the Riemannian metric, which defines the local geometry at xx and perturbs the Euclidean gradient J\nabla J pointwise. Note that in the case that G(x)G(x) is the identity matrix for all xx, Riemannian gradient flow reduces to the Euclidean gradient flow.

Next, we review Wasserstein gradient flows, which generalize gradient flows to the space of probability distributions (Ambrosio et al., 2008). Here, we consider the set of all probability distributions over a particular space formally as an “infinite-dimensional” manifold 𝒫\mathcal{P}, and we consider a function J:𝒫J:\mathcal{P}\to\mathbb{R}. In variational inference, the most relevant such function is the KL divergence J(q):=KL(qp)J(q):=\mathrm{KL}(q\,\|\,p), where we are interested in finding an approximate posterior that minimizes JJ. Analogous to before, a minimizer of JJ may be obtained by following the analogue of a gradient; the trajectory of the distribution qq turns out to take the form of the PDE

qt=(qΨq).\frac{\partial q}{\partial t}=\nabla\cdot(q\nabla\Psi_{q}). (26)

Here, Ψq\nabla\Psi_{q} serves as the correct analogue of the gradient of JJ evaluated at qq, and it turns out that Ψq(x)=logq(x)logp(x)\Psi_{q}(x)=\log q(x)-\log p(x) for the variational inference case J(q):=KL(qp)J(q):=\mathrm{KL}(q\,\|\,p). This function Ψq\Psi_{q} is known variously as the functional derivative, first variation, or von Mises influence function.

Now, we review the recent perspective that SVGD can be interpreted as a generalized Wasserstein gradient flow under the Stein geometry (Liu, 2017; Liu et al., 2019; Duncan et al., 2019). We follow the presentation of Duncan et al. (2019) and refer to it for a rigorous treatment. To set the stage, we take a non-parametric view of the SVGD update (2), in which the dependence on ϕ\phi is interpreted as dependence on the distribution qq itself:

dxdt=𝔼yq[k(x,y)y(logp(y)logq(y))].\frac{dx}{dt}=\mathbb{E}_{y\sim q}[k(x,y)\,\nabla_{y}(\log p(y)-\log q(y))]. (27)

Substituting Ψq(y)=logq(y)logp(y)\Psi_{q}(y)=\log q(y)-\log p(y) and the linear operator TqT_{q} defined by

(Tqφ)(x):=𝔼yq[k(x,y)φ(y)],(T_{q}\varphi)(x):=\mathbb{E}_{y\sim q}[k(x,y)\varphi(y)], (28)

we have

dxdt=(TqΨq)(x).\frac{dx}{dt}=-(T_{q}\nabla\Psi_{q})(x). (29)

Under these dynamics, the probability distribution qq evolves according to the PDE

qt=(qTqΨq).\frac{\partial q}{\partial t}=\nabla\cdot(qT_{q}\nabla\Psi_{q}). (30)

Comparing (30) with (26), we see that (30) defines a modified gradient flow in which the gradient Ψq\nabla\Psi_{q} is perturbed by the operator TqT_{q}.

We now advocate for generalizing Wasserstein gradient flow in the same way that Riemannian gradient flow generalizes Euclidean gradient flow. The operator TqT_{q} perturbs the gradient in a way analogous to how the Riemannian metric perturbs the Euclidean gradient in (25), so the operator TqT_{q} thereby defines an analogue of a Riemannian metric on 𝒫\mathcal{P}. However, there is no fundamental reason that TqT_{q} must have the restrictive form prescribed by (28). Indeed, because TqT_{q} is analogous to the Riemannian metric G(x)G(x), it is natural to let the kernel, whose action defines the operator TqT_{q}, depend on the current value of qq. It is also natural to allow the kernel to output a matrix rather than a scalar so that TqT_{q} may mix all components of φ\varphi. Duncan et al. (2019) in fact speculate on these possibilities (Remarks 17 and 1).

With these considerations in mind, we propose replacing (28) with

(Tqφ)(x):=𝔼yq[kq(x,y)φ(y)],(T_{q}\varphi)(x):=\mathbb{E}_{y\sim q}[k_{q}(x,y)\varphi(y)], (31)

where the kernel kqk_{q} now depends on qq and outputs a matrix. This defines a gradient flow by (30) that we will refer to as kernel gradient flow.111To further the analogy between Euclidean and Riemannian gradient flow and Wasserstein and kernel gradient flow, note that just as setting the Riemannian metric to identity matrix for all xx reduces Riemannian to Euclidean gradient flow, setting TqT_{q} to the identity operator for all qq reduces kernel to Wasserstein gradient flow. The special “Euclidean” Riemannian metric obtained this way is the central object of the Otto calculus (Otto, 2001; Ambrosio et al., 2008).

Once TqT_{q} has the form (31), BBVI may naturally be regarded as an instance of kernel gradient flow, in which the kernel kqk_{q} is the neural tangent kernel which depends on the current distribution qq. More abstractly, we see that the neural tangent kernel defines a Riemannian metric on the space of probability distributions. We summarize the perspective that this framework gives on variational inference:

Claim 3.

SVGD updates generate a kernel gradient flow of the loss function J(q):=KL(qp)J(q):=\mathrm{KL}(q\,\|\,p), with a Riemannian metric determined by the user-specified kernel.

Claim 4.

BBVI updates generate a kernel gradient flow of the loss function J(q):=KL(qp)J(q):=\mathrm{KL}(q\,\|\,p), with a Riemannian metric determined by the neural tangent kernel of fϕf_{\phi}.

5 Beyond variational inference: GANs as kernel gradient flow

We now argue that the kernel gradient flow perspective we have developed describes not only SVGD and BBVI, but also describes the training dynamics of generative adversarial networks (Goodfellow et al., 2014).

Generative adversarial networks, or GANs, are a technique for learning a generator distribution qϕq_{\phi} that mimics an empirical data distribution pdatap_{\mathrm{data}}. The generator distribution qϕq_{\phi} is defined implicitly as the distribution obtained by sampling from a fixed distribution ω\omega, often a standard normal, and running the sample through a neural network fϕf_{\phi} called the generator. The learning process is facilitated by another neural network DθD_{\theta} called the discriminator that takes a sample and outputs a real number, and is trained to distinguish between a real sample from pdatap_{\mathrm{data}} and a fake sample from qϕq_{\phi}. The generator and discriminator are trained simultaneously until the discriminator is unable to distinguish between real and fake samples, at which point the generator distribution qϕq_{\phi} hopefully mimics the data distribution pdatap_{\mathrm{data}}.

For many GAN variants, the rule to update the generator parameters ϕ\phi can be expressed in the continuous-time limit as

dϕdt=ϕ𝔼wω[Dθ(fϕ(w))],\frac{d\phi}{dt}=\nabla_{\phi}\mathbb{E}_{w\sim\omega}[D_{\theta}(f_{\phi}(w))], (32)

or by the chain rule,

dϕdt=𝔼wω[ϕfϕ(w)yDθ(y)|y=fϕ(w)].\frac{d\phi}{dt}=\mathbb{E}_{w\sim\omega}[\nabla_{\phi}f_{\phi}(w)\cdot\nabla_{y}D_{\theta}(y)\big{|}_{y=f_{\phi}(w)}]. (33)

The discriminator parameters θ\theta are updated simultaneously to minimize a separate discriminator loss Ldisc(θ,ϕ)L_{\text{disc}}(\theta,\phi), but it is common for theoretical purposes to assume that the discriminator achieves optimality at every training step. Denoting this optimal discriminator as Ψϕ-\Psi_{\phi} (i.e. setting Ψϕ:=Dθ\Psi_{\phi}:=-D_{\theta^{*}} for θ:=argminθLdisc(θ,ϕ)\theta^{*}:=\operatorname*{arg\,min}_{\theta}L_{\text{disc}}(\theta,\phi)), we have

dϕdt=𝔼wω[ϕfϕ(w)yΨϕ(y)|y=fϕ(w)].\frac{d\phi}{dt}=-\mathbb{E}_{w\sim\omega}[\nabla_{\phi}f_{\phi}(w)\cdot\nabla_{y}\Psi_{\phi}(y)\big{|}_{y=f_{\phi}(w)}]. (34)

This matches the BBVI update rule (10) with logqϕ(y)logp(y)\log q_{\phi}(y)-\log p(y) replaced by the discriminator Ψϕ(y)\Psi_{\phi}(y). Hence, analogous to (16), the generated points xx satisfy

dxdt=𝔼yqϕ[kϕ(x,y)yΨϕ(y)],\frac{dx}{dt}=-\mathbb{E}_{y\sim q_{\phi}}[k_{\phi}(x,y)\,\nabla_{y}\Psi_{\phi}(y)], (35)

where here kϕk_{\phi} is defined as in (13) by the neural tangent kernel of the generator fϕf_{\phi}. Finally, it was observed that the optimal discriminator Ψq\Psi_{q} of the minimax GAN equals the functional derivative of the Jensen–Shannon divergence (Chu et al., 2019, 2020); hence we conclude:

Claim 5.

Minimax GAN updates generate a kernel gradient flow of the Jensen–Shannon divergence J(q):=DJS(pdata,q)J(q):=\mathrm{D}_{\mathrm{JS}}(p_{\text{data}},q), with a Riemannian metric determined by the neural tangent kernel of the generator fϕf_{\phi}.

Similarly, non-saturating and Wasserstein GAN updates generate kernel gradient flows on the directed divergence J(q):=KL(12pdata+12qpdata)J(q):=\mathrm{KL}(\frac{1}{2}p_{\text{data}}+\frac{1}{2}q\,\|\,p_{\text{data}}) and Wasserstein-1 distance J(q):=W1(pdata,q)J(q):=W_{1}(p_{\text{data}},q) respectively.

6 Conclusion

We have cast SVGD and BBVI, as well as the dynamics of GANs, into the same theoretical framework of kernel gradient flow, thus identifying an area ripe for further study.

References

  • Ambrosio et al. (2008) Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • Chu et al. (2019) Chu, C., Blanchet, J., and Glynn, P. Probability functional descent: A unifying perspective on GANs, variational inference, and reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1213–1222, 2019.
  • Chu et al. (2020) Chu, C., Minami, K., and Fukumizu, K. Smoothness and stability in GANs. In International Conference on Learning Representations, 2020.
  • Duncan et al. (2019) Duncan, A., Nüsken, N., and Szpruch, L. On the geometry of Stein variational gradient descent. arXiv preprint arXiv:1912.00894, 2019.
  • Feng et al. (2017) Feng, Y., Wang, D., and Liu, Q. Learning to draw samples with amortized Stein variational gradient descent. In UAI, 2017.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. 2014.
  • Liu et al. (2019) Liu, C., Zhuo, J., Cheng, P., Zhang, R., and Zhu, J. Understanding and accelerating particle-based variational inference. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4082–4092, 2019.
  • Liu (2017) Liu, Q. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, pp. 3115–3123, 2017.
  • Liu & Wang (2016) Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in neural information processing systems, pp. 2378–2386, 2016.
  • Lu et al. (2019) Lu, J., Lu, Y., and Nolen, J. Scaling limit of the Stein variational gradient descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2):648–671, 2019.
  • Otto (2001) Otto, F. The geometry of dissipative evolution equations: the porous medium equation. 2001.
  • Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp.  814–822, 2014.
  • Roeder et al. (2017) Roeder, G., Wu, Y., and Duvenaud, D. K. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pp. 6925–6934, 2017.