This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Augmented Message Passing Stein Variational Gradient Descent

Jiankui Zhou
School of Information Science and Technology
ShanghaiTech University
[email protected]
&Yue Qiu
College of Mathematics and Statistics
Chongqing University
[email protected]
Funded by NSF China under No. 12101407
Abstract

Stein Variational Gradient Descent (SVGD) is a popular particle-based method for Bayesian inference. However, its convergence suffers from the variance collapse, which reduces the accuracy and diversity of the estimation. In this paper, we study the isotropy property of finite particles during the convergence process and show that SVGD of finite particles cannot spread across the entire sample space. Instead, all particles tend to cluster around the particle center within a certain range and we provide an analytical bound for this cluster. To further improve the effectiveness of SVGD for high-dimensional problems, we propose the Augmented Message Passing SVGD (AUMP-SVGD) method, which is a two-stage optimization procedure that does not require sparsity of the target distribution, unlike the MP-SVGD method. Our algorithm achieves satisfactory accuracy and overcomes the variance collapse problem in various benchmark problems.

1 Introduction

Stein variational gradient descent (SVGD) proposed by Liu and Wang (2016) is a non-parametric inference method. In order to approximate an intractable but distinguishable target distribution, it constructs a set of particles, which can be initialized from any initial distribution. These particles move in the Reproducing Kernel Hilbert Space (RKHS) determined by the kernel function. SVGD drives these particles in the direction that the KL divergence of the two distribution decreases most rapidly. SVGD  is more efficient than traditional Markov chain Monte Carlo (MCMC) method due to the fact that the particles in SVGD reach the target distribution precisely through the dynamic process. These advantages make SVGD appealing and it has attracted lots of research interest (Liu et al. (2022); Ba et al. (2021); Zhuo et al. (2018); Salim et al. (2022); Yan and Zhou (2021a)).

Although SVGD succeeds in many applications (Liu (2017); Yoon et al. (2018); Yan and Zhou (2021b)), it lacks the necessary theoretical support in terms of convergence under the condition of limited particles. The convergence of SVGD is guaranteed only in the mean field assumption and when the number of particles is infinite, i.e., particles converge to the true distribution when the number of particles is infinite (Liu (2017); Salim et al. (2022)). However, the convergence of SVGD with finite particles is still an open problem. Furthermore, it has been observed that as the dimension of the problem increases, the variance estimated by SVGD may be much smaller than that of the target distribution. This phenomenon is known as the variance collapse, which limits the applicability of SVGD due to the following facts. First, underestimated variance leads to a failed explanation of the uncertainty of the model. Second, the Bayesian inference is usually high-dimensional in practice, but SVGD is not applicable in some scenarios due to this high-dimensional curse. For example, training Bayesian Neural Network (BNNs) (MacKay (1992)) requires inferring a huge amount of network weight post-test distributions whose dimensions are in millions (Krizhevsky et al. (2017)). Recently, structural prediction of proteins with long structures requires inferences on the position of each atom, which results in a high-dimensional problem (Wu et al. (2022)).

The first contributions of this paper is that we show particles of SVGD under convergence do not spread across the whole probability space but in a finite range. We give an analytic bound of this clustered space. This bounded distribution of particles is an indication of the curse of high dimension. In addition, we provide an estimate of the error between the covariance of finite particles and the true covariance.

There have been many efforts to make SVGD applicable for high-dimension problems. According to Zhuo et al. (2018), the size of the repulsive force of particles is inversely proportional to the dimension of the problem. Reducing the dimension of the problem is the key to address the variance collapse, such methods include combining the Grassmann manifold and matrix decompositions to reduce the dimension of the target distribution (Chen and Ghattas (2020); Liu et al. (2022)). Another approach to resolve this problem is to find the Markov blanket for each dimension of the target distribution, therefore the global kernel function could be replaced by a local kernel function. Under such scenarios, the efficiency of SVGD is improved and such method is called message passing SVGD (MP-SVGD) (Zhuo et al. (2018); Wang et al. (2018)). However, MP-SVGD needs to know the probability graph model structure in advance and is efficient only when the graph is sparse. Moreover, identifying the Markov blanket for high-dimension problems is challenging.

The second contribution of this paper is that we further overcome the shortcomings of MP-SVGD and extend MP-SVGD to high-dimension problems. Combined with the important results of our variance analysis, we propose the so-called Augmented MP-SVGD (AUMP-SVGD). AUMP-SVGD decomposes the problem dimension into three parts by the KL divergence factorization. Different from MP-SVGD, AUMP-SVGD adopts a two-stage update procedure to solve the dependence on sparse probabilistic graphical models. Therefore, it overcomes the shortcomings of the variance collapse and does not require prior knowledge of the graph structure. We show the superiority of AUMP-SVGD theoretically and experimentally to state-of-the-art algorithms.

2 Preliminaries

2.1 SVGD

SVGD approximates an intractable unknown target distribution p(𝐱)p(\mathbf{x}) where 𝐱=[x1,,xD]T𝒳D\mathbf{x}=[x_{1},...,x_{D}]^{T}\in\mathcal{X}\subseteq\mathbb{R}^{D} with the best q(𝐱)q^{*}(\mathbf{x}) by minimizing the Kullback-Leibler (KL) divergence (Liu and Wang (2016)). Here, DD is the dimension of the target distribution and

q(𝐱)=argminq(𝐱)𝒬KL(q(𝐱)p(𝐱)).q^{*}(\mathbf{x})=\mathop{\arg\min}\limits_{q(\mathbf{x})\in\mathcal{Q}}\mathrm{KL}(q(\mathbf{x})\|p(\mathbf{x})).

SVGD takes a group of particles {𝐱}i=0m\left\{\mathbf{x}\right\}^{m}_{i=0} from the initial distribution q0(𝐱)q_{0}(\mathbf{x}), and after a series of smooth transforms, these particles finally converge to the target distribution p(𝐱)p(\mathbf{x}). Each smooth transformation can be expressed by 𝐓(𝐱)=𝐱+ϵΦ(𝐱)\mathbf{T(x)}=\mathbf{x}+\epsilon\Phi(\mathbf{x}), where ϵ\epsilon is the step size and Φ:𝒳D\Phi:\mathcal{X}\to\mathbb{R}^{D} is the transformation direction. Here, 𝒳\mathcal{X} is the collection of the particles. Let \mathcal{H} denote the set of positive definite kernels k(,)k\left(\cdot,\cdot\right) in the reproducing kernel Hilbert space (RKHS) and denote D=××\mathcal{H}^{D}=\mathcal{H}\times\cdots\times\mathcal{H}, where ×\times is the Cartesian product. The steepest descent direction is obtained by minimizing the KL divergence,

minϕD1ϵKL(q[T]p)|ϵ=0=maxϕD1E𝐱q[𝒜pϕ(𝐱)],\small\min_{\|\phi\|_{\mathcal{H}^{D}\leq 1}}\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\|p\right)|_{\epsilon=0}=-\max_{\|\phi\|_{\mathcal{H}^{D}}\leq 1}E_{\mathbf{x}\sim q}\left[\mathcal{A}_{p}\phi(\mathbf{x})\right],

where q[T]q_{[T]} means particles take the distribution qq after 𝐓(𝐱)\mathbf{T(x)} mapping, 𝒜p\mathcal{A}_{p} is the Stein operator given by 𝒜pϕ(𝐱)=𝐱logp(𝐱)ϕ(𝐱)T+𝐱ϕ(𝐱)\mathcal{A}_{p}\phi(\mathbf{x})=\nabla_{\mathbf{x}}\log p(\mathbf{x})\phi(\mathbf{x})^{T}+\nabla_{\mathbf{x}}\phi(\mathbf{x}). SVGD updates the particles {𝐱}i=1m\left\{\mathbf{x}\right\}_{i=1}^{m} drawn from the initial distribution q0(𝐱)q_{0}(\mathbf{x}) by

𝐱n+1=𝐱n+ϵlϕ^(𝐱n),n=1,,m,=0,1,\small\mathbf{x}_{n}^{\ell+1}=\mathbf{x}_{n}^{\ell}+\epsilon_{l}\hat{\phi}_{\ell}^{*}\left(\mathbf{x}_{n}^{\ell}\right),\quad n=1,\ldots,m,\ell=0,1,\ldots

The steepest descent direction is given by

ϕ^(𝐱n)=1mi=1mk(𝐱i,𝐱n)𝐱ilogp(𝐱i)+𝐱ik(𝐱i,𝐱n).\small\hat{\phi}_{\ell}^{*}\left(\mathbf{x}_{n}^{\ell}\right)=\frac{1}{m}\sum_{i=1}^{m}k\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{n}^{\ell}\right)\nabla_{\mathbf{x}_{i}^{\ell}}\log p\left(\mathbf{x}_{i}^{\ell}\right)+\nabla_{\mathbf{x}_{i}^{\ell}}k\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{n}^{\ell}\right). (1)

The kernel function can be chosen as the RBF kernel k(𝐱,𝐲)=exp(𝐱𝐲22/(2h))k(\mathbf{x},\mathbf{y})=\exp(-\|\mathbf{x}-\mathbf{y}\|_{2}^{2}/(2h)) (Liu and Wang (2016)) or the IMQ kernel k(𝐱,𝐲)=1/1+𝐱𝐲22/(2h)k(\mathbf{x},\mathbf{y})=1/\sqrt{1+\|\mathbf{x}-\mathbf{y}\|_{2}^{2}/(2h)} (Gorham and Mackey (2017)). Equation (1) can be divided into two parts: the driving force term [k(𝒙,𝒙)𝒙logp(𝒙)]\left[k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)\nabla_{\boldsymbol{x}^{\prime}}\log p\left(\boldsymbol{x}^{\prime}\right)\right] and the repulsive force term 𝒙k(𝒙,𝒙)\nabla_{\boldsymbol{x}^{\prime}}k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right). It has been demonstrated that SVGD falls under the high-dimension curse (Zhuo et al. (2018); Liu (2017)). Related research shows that there exists a negative correlation between the problem dimension and the repulsive force of SVGD (Ba et al. (2021)). The influence of repulsive force is mainly related to the dimension of the target distribution.

MP-SVGD. The Message Passing SVGD (MP-SVGD) (Zhuo et al. (2018); Wang et al. (2018)) is proposed to reduce the dimension of the target distribution by identifying the Markov blanket for problems with a known graph structure. For dimension index dd, its Markov blanket Γd={F:dF}\{d}\Gamma_{d}=\cup\{F:d\in F\}\backslash\{d\} contains neighborhood nodes of dd such that p(xd𝐱¬d)=p(xd𝐱Γd)p\left(x_{d}\mid\mathbf{x}_{\neg d}\right)=p\left(x_{d}\mid\mathbf{x}_{\Gamma_{d}}\right), where F={1,,D}F=\{1,...,D\} and ¬d={1,,D}\{d}\neg d=\{1,...,D\}\backslash\{d\}. However, it is necessary to rely on the sparse correlation of variables of the target distribution in order to obtain good results.

2.2 Mixing for random variables

Since SVGD forms a dynamic system where particles interact with each other, one can no longer treat the converged particles as independent and identically distributed. Therefore, we need to use the mathematical tool “mixing”, cf. Bradley (2005) for more information.

Mixing. Let {𝐱m,m1}\left\{\mathbf{x}_{m},m\geq 1\right\} be a sequence of random variables over some probability space (Ω,F,P)(\Omega,F,P), where σ(𝐱i,ij)\sigma(\mathbf{x}_{i},i\leq j) denotes the σ\sigma-algebra. For any two σ\sigma-fields σ(𝐱i,ij),σ(𝐱i,ij+k)\sigma\left(\mathbf{x}_{i},i\leq j\right),\ \sigma\left(\mathbf{x}_{i},i\geq j+k\right) with any k1k\geq 1, let

β0=1,βk=supj1β(σ(𝐱i,ij),σ(𝐱i,ij+k)),\small\beta_{0}=1,\beta_{k}=\sup_{j\geq 1}\beta\left(\sigma\left(\mathbf{x}_{i},i\leq j\right),\sigma\left(\mathbf{x}_{i},i\geq j+k\right)\right), (2)

where

β(𝒜,)=12sup{iIjJ|(AiBj)(Ai)(Bj)|}\small\beta(\mathcal{A},\mathcal{B})=\frac{1}{2}\sup\{\sum_{i\in I}\sum_{j\in J}\left|\mathbb{P}\left(A_{i}\cap B_{j}\right)-\mathbb{P}\left(A_{i}\right)\mathbb{P}\left(B_{j}\right)\right|\}

is the maximum taken over all finite partitions (Ai)iI\left(A_{i}\right)_{i\in I} and (Bi)iJ\left(B_{i}\right)_{i\in J} of Ω\Omega. A family {𝐱m,m1}\left\{\mathbf{x}_{m},m\geq 1\right\} of random variables will be said to be absolutely regular (or β\beta-mixing) if limmβm=0\lim_{m\to\infty}\beta_{m}=0, where the coefficients of absolute regularity βm\beta_{m} are defined in Equation (2) (Banna et al. (2016)). These coefficients quantify the strength of dependence between the σ\sigma-algebra generated by (𝐱i)1ik(\mathbf{x}_{i})_{1\leq i\leq k} and the one generated by (𝐱i)ik+m(\mathbf{x}_{i})_{i\geq k+m} for all kk\in\mathbb{N}^{*}. βm\beta_{m} tends to zero while mm goes to infinity implies that the σ\sigma-algebra generated by (𝐱i)ik+m(\mathbf{x}_{i})_{i\geq k+m} is less and less dependent on σ(𝐱i,ik)\sigma\left(\mathbf{x}_{i},i\leq k\right) (Bradley (2005)).

3 Covariance Analysis under β\mathbf{\beta}-mixing

Ba et al. (2021) analyzed the convergence of SVGD when the covariance matrix of the Gaussian distribution is an identity matrix under the assumption of near-orthogonality. On the basis of this work, we further analyze the more general form of the variance collapse. Moreover, the quantification of the variance collapse for finite particles is still an open problem to the best of our knowledge. Chewi et al. (2020) shows that SVGD can be viewed as a kernelized Wasserstein gradient flow of the chi-squared divergence, which makes us believe that extending our analysis in this section to other MCMC methods such as Wasserstein gradient flow or normalizing flow is possible.

3.1 Assumptions

A1 (Fixed points). Although the convergence of SVGD under finite particles is still an open problem, it can be found in many experimental studies that all particles converge to fixed points (Ba et al. (2021)). In this paper, we also assume that for SVGD with finite number of particles, these particles eventually converge to the target distribution and approach fixed points.

A2 (mm-dependent of particles)

Definition 1

(Hoeffding and Robbins (1948)) If for some function f(n)f(n), the inequality srf(n)0s-r\geq f(n)\geq 0 implies that the tow sets (x1,,xr)(x_{1},...,x_{r}) and (xs,,xm)(x_{s},...,x_{m}) are independent, then the sequence {xi}i=1m\left\{x_{i}\right\}_{i=1}^{m} is said to f(n)f(n)-dependent.

We assume that the fixed points of SVGD satisfy the mm-dependent assumption. In Ba et al. (2021), only weak correlation between SVGD particles is reported. We perform numerical verification in Appendix B and leave the rigorous proof for our future work. Under Assumptions A1-A2 and the zero-mean of the target distribution, we analyze the variance collapse of SVGD quantitatively in the following part.

3.2 Concentration of particles

We first give the upper bound of the concentrated particles. The main tool used here is the Jensen gap (Gao et al. (2019)).

Proposition 1

Let Assumption A1 hold, for mean zero Gaussian target and Gaussian RBF kernel k(𝐱i,𝐱j)=exp𝐱i𝐱j222hk(\mathbf{x}_{i},\mathbf{x}_{j})=\exp^{-\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{2h}}, we have

𝐱i2(2M/c+1)(𝔼𝐱22)1/2,\small\|\mathbf{x}_{i}\|_{2}\leq(2M/c+1)\left(\mathbb{E}\|\mathbf{x}\|_{2}^{2}\right)^{1/2},

where M=sup𝐱0|k(𝐱i,𝐱)k(𝐱i,0)|2𝐱2M=\sup_{\mathbf{x}\neq 0}\frac{\left|k(\mathbf{x}_{i},\mathbf{x})-k(\mathbf{x}_{i},0)\right|}{2\|\mathbf{x}\|_{2}}, c=2hec0tr(Σm)hc=\frac{2}{h}e^{\frac{-c_{0}tr(\Sigma_{m})}{h}}, 1<c0m1<c_{0}\leq m is a positive constant, and Σm\Sigma_{m} is the empirical covariance matrix of the particles.

We leave this proof in Appendix A. For most of the sampling-based inference methods, such as MCMC, VI, et al., samples spread across the whole sample space but with extremely small probability for some samples. However, Proposition 1 shows that SVGD with finite particles are confined to a certain range, although this range may expand as the number of particles increases. With the RBF kernel, this upper bound is related to the trace of the covariance of the target distribution. Under the IMQ kernel k(𝐱,𝐲)=1/1+𝐱𝐲22/(2h)k(\mathbf{x},\mathbf{y})=1/\sqrt{1+\|\mathbf{x}-\mathbf{y}\|_{2}^{2}/(2h)} conditions, this range will be further reduced (Gorham and Mackey (2017)).

3.3 Covariance estimation

For independent and identically distributed (i.i.d.) samples, the variance can be estimated using the Bernstein inequality or the Lieb inequality. However, these random matrix results typically require the particles to be i.i.d, which is no longer satisfied by SVGD due to its interacting update. To analyze the variance of particles from SVGD, we assume these particles are mm-dependent. For a sequence of random variables, the mm-dependent assumption implies that they satisfy the β\beta-mixing condition (Bradley (2005)). We obtain the following results based on the Bernstein inequality of dependent random matrices (Banna et al. (2016)).

Proposition 2

Let Assumption A1-A2 hold, for SVGD with the Gaussian RBF kernel k(𝐱i,𝐱j)=exp𝐱i𝐱j222hk(\mathbf{x}_{i},\mathbf{x}_{j})=\exp^{-\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{2h}}, denote 𝕏i=𝐱i𝐱iTΣ\mathbb{X}_{i}=\mathbf{x}_{i}\mathbf{x}_{i}^{T}-\Sigma where Σ\Sigma is the covariance matrix of the target distribution. There exists α0\alpha\geq 0 such that for any integer k1k\geq 1, the following inequality holds,

βk=supj1β(σ(𝐱i,ij),σ(𝐱i,ij+k))eα(k1).\small\beta_{k}=\sup_{j\geq 1}\beta\left(\sigma\left(\mathbf{x}_{i},i\leq j\right),\sigma\left(\mathbf{x}_{i},i\geq j+k\right)\right)\leq\mathrm{e}^{-\alpha(k-1)}.

Denote the covariance matrix of these mm particles by Σm\Sigma_{m}, we have

𝔼ΣmΣ30vlogDm+2K2tr(Σ)m(4α1/2logD+γ(α,m)logD),\mathbb{E}\|\Sigma_{m}-\Sigma\|\leq 30v\sqrt{\frac{\log D}{m}}+\frac{2K^{2}tr(\Sigma)}{m}(4\alpha^{-1/2}\sqrt{\log D}+\gamma(\alpha,m)\log D),

where K=(2M/c+1)K=(2M/c+1) and MM, cc are identical to that in Proposition 1. Here, α\alpha measures the correlation between particles, and DD is the dimension of the target distribution. Moreover,

v2=supN{1,,m}1|N|λmax(𝔼(iN𝕏i)2).\small v^{2}=\sup_{N\subseteq\{1,\ldots,m\}}\frac{1}{|N|}\lambda_{\max}\left(\mathbb{E}(\sum_{i\in N}\mathbb{X}_{i})^{2}\right).

Here, λmax(𝕏)\lambda_{\max}\left(\mathbb{X}\right) represents the eigenvalue of 𝕏\mathbb{X} with the maximum magnitude, |N||N| is the cardinality of the set NN, and

γ(α,m)=logmlog2max(2,32logmαlog2).\small\gamma(\alpha,\ m)=\frac{\log m}{\log 2}\max\left(2,\frac{32\log m}{\alpha\log 2}\right).

Proposition 2 shows that the main factors that affect the upper bound of the variance error include the inter-particle correlation, the number of particles, the dimension of the target distribution, and the trace of its covariance matrix. According to Ba et al. (2021), it can be considered that tr(Σm)tr(Σ)tr(\Sigma_{m})\leq tr(\Sigma) should be true for SVGD, therefore the constant cc in Proposition 1 can be replaced by 2hec0tr(Σ)h\frac{2}{h}e^{\frac{-c_{0}tr(\Sigma)}{h}}.

4 Augmented MP-SVGD

Here, we propose the so-called Augmented Message Passing (AUMP-SVGD) to overcome the covariance underestimation of SVGD. Compared with MP-SVGD, AUMP-SVGD requires neither a known graph structure nor the sparsity structure of the target distribution.

4.1 MP-SVGD

The update direction Δ\Delta of SVGD is given by

ΔSVGD(𝒙)=𝔼𝒙qk(𝒙,𝒙)𝒙logp(𝒙)driving force +𝒙k(𝒙,𝒙)repulsive force\small\Delta^{\mathrm{SVGD}}(\boldsymbol{x})=\mathbb{E}_{\boldsymbol{x}^{\prime}\sim q}\underbrace{k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)\nabla_{\boldsymbol{x}^{\prime}}\log p\left(\boldsymbol{x}^{\prime}\right)}_{\text{driving force }}+\underbrace{\nabla_{\boldsymbol{x}^{\prime}}k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)}_{\text{repulsive force}}

The log derivative term in the update rule 𝔼𝒙q[k(𝒙,𝒙)𝒙logp(𝒙)]\mathbb{E}_{\boldsymbol{x}^{\prime}\sim q}\left[k(\boldsymbol{x}^{\prime},\boldsymbol{x})\nabla_{\boldsymbol{x}^{\prime}}\log p(\boldsymbol{x}^{\prime})\right] corresponds to the driving force that guides particles toward the high-likelihood region. The (Stein) score function 𝒙logp(𝒙)\nabla_{\boldsymbol{x}^{\prime}}\log p\left(\boldsymbol{x}^{\prime}\right) is a vector field describing the target distribution. 𝒙k(𝒙,𝒙)\nabla_{\boldsymbol{x}^{\prime}}k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right) provides a repulsive force to prevent particles from aggregating. However, as the dimension of the target distribution increases, this repulsive force gradually decreases (Zhuo et al. (2018)). This causes SVGD to fall under the curse of high dimensions. How to effectively reduce the dimension has become the guiding ideology to make SVGD overcome the curse of dimensionality.

Our concern is the problem with the continuous graphical model, i.e., the target distribution that satisfies the following form p(𝐱)exp[s𝒮ψ(𝐱s)]p(\mathbf{x})\propto\exp\left[\sum_{s\in\mathcal{S}}\psi\left(\mathbf{x}_{s}\right)\right] where 𝒮\mathcal{S} is the family of index sets s{1,,D}s\subseteq\{1,\ldots,D\} that specifies the Markov structure. For any index ii, its Markov blanket is represented by 𝒩i:={s:s𝒮,is}\{i}\mathcal{N}_{i}:=\cup\{s:s\subset\mathcal{S},i\in s\}\backslash\{i\}. According to Wang et al. (2018), one can transform the global kernel function into a local kernel function and this local kernel function is just related to 𝒞i:=𝒩i{i}\mathcal{C}_{i}:=\mathcal{N}_{i}\cup\{i\} for any ii. Under this transformation, the dimension is reduced from DD to |𝒞i||\mathcal{C}_{i}|, where |𝒞i||\mathcal{C}_{i}| is the size of the set 𝒞i\mathcal{C}_{i}. Then, SVGD becomes

𝐱i𝐱i+ϵϕ(𝐱i),i{1,,D},{1,,m},\displaystyle\mathbf{x}_{i}^{\ell}\leftarrow\mathbf{x}_{i}^{\ell}+\epsilon\phi^{*}\left(\mathbf{x}_{i}^{\ell}\right),\quad\forall i\in\{1,...,D\},\ell\in\{1,...,m\}, (3a)
ϕ(𝐱i):=1m=1ms(𝐱i)k(𝐱i,𝐱𝒞i)+𝐱iki(𝐱i,𝐱𝒞i),\displaystyle\phi^{*}(\mathbf{x}_{i}^{\ell}):=\frac{1}{m}\sum_{\ell=1}^{m}s\left(\mathbf{x}_{i}^{\ell}\right)k\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{\mathcal{C}_{i}}\right)+\partial_{\mathbf{x}_{i}^{\ell}}k_{i}\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{\mathcal{C}_{i}}\right), (3b)

where s(xi)=xilogp(xi)s(x_{i}^{\ell})=\nabla_{x_{i}^{\ell}}\log p(x_{i}^{\ell}). Such method is known as the message passing SVGD (MP-SVGD) (Zhuo et al. (2018); Liu (2017)). However, MP-SVGD needs to know the graph structure of the target distribution in advance to determine the Markov blanket and the sparsity of this graph is needed in order to achieve dimension reduction.

4.2 Augmented MP-SVGD

Inspired by MP-SVGD, we propose the augmented MP-SVGD which is suitable for more complex graph structures. We keep the previous symbols but redefine them for clarity. We assume p(𝐱)p(\mathbf{x}) can be factorized as p(𝐱)FψF(𝐱F)p(\mathbf{x})\propto\prod_{F\in\mathcal{F}}\psi_{F}\left(\mathbf{x}_{F}\right) where F{1,,D}F\subset\{1,\ldots,D\} is the index set. Partition Γd{1,,D}\{d}\Gamma_{d}\subset\{1,\ldots,D\}\backslash\{d\} and 𝐒𝐝{1,,D}\{d}\mathbf{S_{d}}\subset\{1,\ldots,D\}\backslash\{d\} such that Γd𝐒𝐝=\Gamma_{d}\cap\mathbf{S_{d}}=\emptyset and Γd𝐒𝐝={1,,D}\{d}\Gamma_{d}\cup\mathbf{S_{d}}=\{1,\ldots,D\}\backslash\{d\}. Let 𝐱𝐒𝐝=[xi,,xj],i,,j𝐒𝐝\mathbf{x}_{\mathbf{S_{d}}}=[x_{i},\ ...\ ,\ x_{j}],\ i,...,j\in\mathbf{S_{d}}. Similarly, 𝐱Γd=[xi,,xj],i,,jΓd\mathbf{x}_{\Gamma_{d}}=[x_{i},\ ...\ ,\ x_{j}],\ i,...,j\in\Gamma_{d}.

Refer to caption

Figure 1: Probabilistic Graphical Model

Consider the probability graph model illustrated by Figure 1, p(𝐱)p(\mathbf{x}) can be represented as p(𝐱)=p(xd|𝐱Γd,𝐱𝐒𝐝)p(𝐱Γd|𝐱𝐒𝐝)p(𝐱𝐒𝐝)p(\mathbf{x})=p(x_{d}|\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}})p(\mathbf{x}_{\Gamma_{d}}|\mathbf{x}_{\mathbf{S_{d}}})p(\mathbf{x}_{\mathbf{S_{d}}}). Our method relies on the key observation of Zhuo et al. (2018) that

KL(qp)=KL(q(𝐱Γd,𝐱𝐒𝐝)p(𝐱Γd,𝐱𝐒𝐝))prKL(q,p,d)+KL(q(xd𝐱¬d)q(𝐱¬d)p(xd𝐱¬d)p(𝐱¬d))seKL(q,p,d).\small\mathrm{KL}(q\|p)=\underbrace{\mathrm{KL}\left(q\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)}_{\text{prKL}(q,p,d)}+\underbrace{\mathrm{KL}\left(q\left(x_{d}\mid\mathbf{x}_{\neg d}\right)q\left(\mathbf{x}_{\neg d}\right)\|p\left(x_{d}\mid\mathbf{x}_{\neg d}\right)p\left(\mathbf{x}_{\neg d}\right)\right)}_{\text{seKL}(q,p,d)}.

To minimize KL(qp)\mathrm{KL}(q\|p), we adopt a two-stage procedure so that prKL(q,p,d)\text{prKL}(q,p,d) and seKL(q,p,d)\text{seKL}(q,p,d) are optimized alternatively. At the first stage, prKL(q,p,d)\text{prKL}(q,p,d) is further decomposed into

prKL(q,p,d)=KL(q(x𝐒𝐝)p(x𝐒𝐝))+KL(q(𝐱Γdx𝐒𝐝)q(x𝐒𝐝)p(𝐱Γdx𝐒𝐝)p(x𝐒𝐝)).\text{prKL}(q,p,d)=\mathrm{KL}\left(q\left(x_{\mathbf{S_{d}}}\right)\|p\left(x_{\mathbf{S_{d}}}\right)\right)+\mathrm{KL}\left(q\left(\mathbf{x}_{\Gamma_{d}}\mid x_{\mathbf{S_{d}}}\right)q\left(x_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}}\mid x_{\mathbf{S_{d}}}\right)p\left(x_{\mathbf{S_{d}}}\right)\right). (4)

We can fix 𝐱𝐒𝐝\mathbf{x}_{\mathbf{S_{d}}} to apply the local kernel function to the second part of Equation (4) to minimize prKL(q,p,d)\text{prKL}(q,p,d),

𝐱Γd=argminq(𝐱Γd𝐱𝐒𝐝)KL(q(𝐱Γd𝐱𝐒𝐝)q(𝐱𝐒𝐝)p(𝐱Γd𝐱𝐒𝐝)p(𝐱𝐒𝐝)).\displaystyle\mathbf{x}_{\Gamma_{d}}=\mathop{\arg\min}\limits_{q\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)}\mathrm{KL}\left(q\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)p\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\right).

This optimization procedure is given by Proposition 3 and we leave the proof in the appendix.

Proposition 3

Let T(𝐱)=[x1,,TΓd(𝐱Γd),,xD]TT(\mathbf{x})=\left[x_{1},...,T_{\Gamma_{d}}\left(\mathbf{x}_{\Gamma_{d}}\right),...,x_{D}\right]^{T} where TΓd:𝐱Γd𝐱Γd+ϵϕΓd(𝐱¬d)T_{\Gamma_{d}}:\ \mathbf{x}_{\Gamma_{d}}\rightarrow\mathbf{x}_{\Gamma_{d}}+\epsilon\phi_{\Gamma_{d}}\left(\mathbf{x}_{\neg d}\right) and ϕΓdΓd\phi_{\Gamma_{d}}\in\mathcal{H}_{\Gamma_{d}}. Here Γd\mathcal{H}_{\Gamma_{d}} is the space that defines the local kernel function kΓd:𝒳¬d×𝒳¬dk_{\Gamma_{d}}:\ \mathcal{X}_{\neg d}\times\mathcal{X}_{\neg d}\rightarrow\mathbb{R}. The optimal solution of the optimization problem

minϕΓd1ϵKL(q[T](𝐱Γd,x𝐒𝐝)p(𝐱Γd,x𝐒𝐝))|ϵ=0,\displaystyle\mathop{\min}\limits_{\left\|\phi_{\Gamma_{d}}\right\|\leq 1}\left.\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\right)\right|_{\epsilon=0},

is given by ϕΓd/ϕΓdΓd\phi_{\Gamma_{d}}^{*}/\left\|\phi_{\Gamma_{d}}^{*}\right\|_{\mathcal{H}_{\Gamma_{d}}} where

ϕΓd(𝐱¬d)=𝔼𝐲¬dq[kΓd(𝐱¬d,𝐲¬d)𝐲Γdlogp(𝐲Γd𝐲¬d)+𝐲ΓdkΓd(𝐱¬d,𝐲¬d)].\small\phi_{\Gamma_{d}}^{*}\left(\mathbf{x}_{\neg d}\right)=\mathbb{E}_{\mathbf{y}_{\neg d}\sim q}[k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\nabla_{\mathbf{y}_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)].

At the second stage, 𝐱Γd\mathbf{x}_{\Gamma_{d}} and 𝐱𝐒𝐝\mathbf{x}_{\mathbf{S_{d}}} are fixed while only 𝐱d\mathbf{x}_{d} is updated. We can further decompose seKL(q,p,d)\mathrm{seKL}(q,p,d) into three parts via the convexity of the KL divergence,

0seKL(q,p,d)KL[q(xd𝐱Γd)q(𝐱¬d)p(xd𝐱Γd)p(𝐱¬d)]+KL[q(xd𝐱𝐒𝐝)q(𝐱¬d)p(xd𝐱𝐒𝐝)p(𝐱¬d)]+C,\small 0\leq\mathrm{seKL}(q,p,d)\leq\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{p(\mathbf{x}_{\neg d})}\right]+\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{p(\mathbf{x}_{\neg d})}\right]+C,

where C=KL[1q(𝐱¬d)1p(𝐱¬d)]C=\mathrm{KL}\left[\frac{1}{q(\mathbf{x}_{\neg d})}\|\frac{1}{p(\mathbf{x}_{\neg d})}\right] is a positive constant due to the fact that 𝐱¬d=𝐱Γd𝐱𝐒𝐝\mathbf{x}_{\neg d}=\mathbf{x}_{\Gamma_{d}}\cup\mathbf{x}_{\mathbf{S_{d}}} is fixed. Therefore,

xd=argminq(xd𝐱Γd)),p(xd𝐱𝐒𝐝)KL[q(xd𝐱Γd)q(𝐱¬d)p(xd𝐱Γd)p(𝐱¬d)]+KL[q(xd𝐱𝐒𝐝)q(𝐱¬d)p(xd𝐱𝐒𝐝)p(𝐱¬d)].\small x_{d}=\mathop{\arg\min}\limits_{q(x_{d}\mid\mathbf{x}_{\Gamma_{d}})),p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{p(\mathbf{x}_{\neg d})}\right]+\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{p(\mathbf{x}_{\neg d})}\right].

The optimization procedure is described by Proposition 4 and we also leave the proof in the appendix.

Proposition 4

Let T(𝐱)=[x1,,Td(xd),,xD]TT(\mathbf{x})=\left[x_{1},\ldots,T_{d}\left(x_{d}\right),\ldots,x_{D}\right]^{T} where Td:xdxd+ϵϕd(𝐱Cd)T_{d}:x_{d}\rightarrow x_{d}+\epsilon\phi_{d}\left(\mathbf{x}_{C_{d}}\right) and ϕdd\phi_{d}\in\mathcal{H}_{d}. Here d\mathcal{H}_{d} is the space that defines the local kernel kd:𝒳Sd×𝒳Sdk_{d}:\mathcal{X}_{S_{d}}\times\mathcal{X}_{S_{d}}\rightarrow\mathbb{R} and Cd=Sd{d}C_{d}=S_{d}\cup\{d\} or Cd=Γd{d}C_{d}=\Gamma_{d}\cup\{d\}. The optimal solution of the following optimization problem

minϕd1ϵKL[q[T](xd𝐱𝐂𝐝)q(𝐱¬d)p(xd𝐱𝐂𝐝)p(𝐱¬d)]|ϵ=0,\small\mathop{\min}\limits_{\left\|\phi_{d}\right\|\leq 1}\left.\nabla_{\epsilon}\mathrm{KL}\left[\frac{q_{[T]}(x_{d}\mid\mathbf{x}_{\mathbf{C_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{C_{d}}})}{p(\mathbf{x}_{\neg d})}\right]\right|_{\epsilon=0},

is given by ϕd/ϕdd\phi_{d}^{*}/\left\|\phi_{d}^{*}\right\|_{\mathcal{H}_{d}} where

ϕd(𝐱Cd)=𝔼𝐲Cdq[kd(𝐱Cd,𝐲Cd)ydlogp(yd𝐲Cd)+ydkd(𝐱Cd,𝐲Cd)].\small\phi_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim q}\left[k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{C_{d}}\right)+\nabla_{y_{d}}k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\right].

For d{1,2,,D}d\in\left\{1,2,\dots,D\right\}, xdx_{d} is updated through the above two-stage procedure, which reduces the dimension of the original problem from DD to the size of min(|𝐱𝐒𝐝|,|𝐱Γd|)min(|\mathbf{x}_{\mathbf{S_{d}}}|,\ |\mathbf{x}_{\Gamma_{d}}|). In this way, we are able to solve the problem of inferring more complex target distributions that traditional MP-SVGD failed. This comparison is illustrated in Experiment 2. Moreover, we do not need to know the real probability graph structure of the target distribution in advance.

The key start of our AUMP-SVGD is to choose Γd\Gamma_{d} or 𝐒𝐝\mathbf{S_{d}}. This problem can be formulated in the following form, for mm particles with each xDx\in\mathbb{R}^{D}, denote the ensemble matrix of these particles by XX, i.e., Xm×DX\in\mathbb{R}^{m\times D}. Let r=|𝐱Γd|r=|\mathbf{x}_{\Gamma_{d}}| and (r,m×D)\mathcal{M}(\mathrm{r},m\times D) be the set of sub-matrices Γxm×r\Gamma_{x}\in\mathbb{R}^{m\times r} of XX. Determining the set Γd\Gamma_{d} corresponds to select the sub-matrix XrX_{r} from the set (r,m×D)\mathcal{M}(\mathrm{r},m\times D) to minimize 𝐄ΣmΣx\mathbf{E}\|\Sigma_{m}-\Sigma_{x}\|, where Σm\Sigma_{m} is the empirical covariance matrix of particles and Σx\Sigma_{x} is the empirical covariance matrix of sub-ensembles. The upper bound of 𝐄ΣmΣx\mathbf{E}\|\Sigma_{m}-\Sigma_{x}\| is already given by Proposition 2 and it is related to tr(Σx)tr(\Sigma_{x}). Therefore, we choose the sub-matrix with the smallest tr(Σx)tr(\Sigma_{x}) to ensure that 𝔼ΣmΣx\mathbb{E}\|\Sigma_{m}-\Sigma_{x}\| is minimized. This corresponds to select Xrm×rX_{r}\in\mathbb{R}^{m\times r} from XX to get minimal tr(XrXrT)\operatorname{tr}\left(X_{r}X_{r}^{T}\right). Since tr(XrXrT)=tr(XrTXr)\operatorname{tr}\left(X_{r}X_{r}^{T}\right)=\operatorname{tr}\left(X_{r}^{T}X_{r}\right) , we just need to reorder each column of XX by the 2-norm and choose rr columns with smallest 2-norm. The computational complexity for this part is 𝒪(Dm2+DlogD)\mathcal{O}\left(Dm^{2}+D\log D\right). Finally, we give the complete form of our AUMP-SVGD by Algorithm 1.

Algorithm 1 Augment Message Passing SVGD
Input: a set of initial particles {𝐱(i)}i=1m\left\{\mathbf{x}^{(i)}\right\}_{i=1}^{m}
for iteration ii do
    for d1,,Dd\in{1,...,D} do
       Set Γd\Gamma_{d} and SdS_{d}
       𝐱Γd(i)𝐱Γd(i)+ϵϕ^Γd(𝐱¬d(i))\mathbf{x}_{\Gamma_{d}}^{(i)}\leftarrow\mathbf{x}_{\Gamma_{d}}^{(i)}+\epsilon\hat{\phi}_{\Gamma_{d}}^{*}(\mathbf{x}_{\neg d}^{(i)})
       ϕ^Γd(𝐱¬d)=𝔼𝐲¬dq^M[kΓd(𝐱¬d,𝐲¬d)yΓdlogp(𝐲Γd𝐲¬d)+𝐲Γdkd(𝐱¬d,𝐲¬d)]\hat{\phi}_{\Gamma_{d}}^{*}\left(\mathbf{x}_{\neg d}\right)=\mathbb{E}_{\mathbf{y}_{\neg d}\sim\hat{q}_{M}}[k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\nabla_{y_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}k_{d}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)]
       update xdx_{d} from 𝐱Sd\mathbf{x}_{S_{d}} by Cd=𝐱SdxdC_{d}=\mathbf{x}_{S_{d}}\cup{x_{d}}
         𝐱d(i)𝐱d(i)+ϵϕ^d(𝐱Cd(i))\mathbf{x}_{d}^{(i)}\leftarrow\mathbf{x}_{d}^{(i)}+\epsilon\hat{\phi}_{d}^{*}(\mathbf{x}_{C_{d}}^{(i)})
         ϕ^d(𝐱Cd)=𝔼𝐲Cdq^M[ydkd(𝐱Cd,𝐲Cd)+kd(𝐱Cd,𝐲Cd)ydlogp(yd𝐲Sd)]\hat{\phi}_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim\hat{q}_{M}}\left[\nabla_{y_{d}}k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)+k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{S_{d}}\right)\right]
       update xdx_{d} from 𝐱Γd\mathbf{x}_{\Gamma_{d}} by Cd=𝐱ΓdxdC_{d}=\mathbf{x}_{\Gamma_{d}}\cup{x_{d}}
         𝐱d(i)𝐱d(i)+ϵϕ^d(𝐱Cd(i))\mathbf{x}_{d}^{(i)}\leftarrow\mathbf{x}_{d}^{(i)}+\epsilon\hat{\phi}_{d}^{*}(\mathbf{x}_{C_{d}}^{(i)})
         ϕ^d(𝐱Cd)=𝔼𝐲Cdq^M[ydkd(𝐱Cd,𝐲Cd)+kd(𝐱Cd,𝐲Cd)ydlogp(yd𝐲Γd)]\hat{\phi}_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim\hat{q}_{M}}\left[\nabla_{y_{d}}k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)+k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{\Gamma_{d}}\right)\right]
    end for
end for
Output: a set of particles {𝐱(i)}i=1m\left\{\mathbf{x}^{(i)}\right\}_{i=1}^{m} as samples from p(𝐱)p(\mathbf{x})

5 Experiments

We study the uncertainty quantification properties of AUMP-SVGD compared with other existing methods such as SVGD, MP-SVGD, projected SVGD (PSVGD) (Chen and Ghattas (2020)), and Sliced Stein variational gradient descent (S-SVGD) (Gong et al. (2020)) through extensive experiments. We conclude from experiments that SVGD may underestimate uncertainty, S-SVGD may overestimate it and AUMP-SVGD with a properly partitioned Γd\Gamma_{d} and 𝐒d\mathbf{S}_{d} produces the best estimate. In almost all scenarios, AUMP-SVGD outperforms PSVGD.

5.1 Gaussian Mixture Models

Multivariate Gaussian. The first example is a DD-dimensional multivariate Gaussian p(𝐱)=𝒩(0,ID)p(\mathbf{x})=\mathcal{N}\left(0,I_{D}\right). For each method, 100 particles are initialized from q0(𝐱)=𝒩(10,ID)q_{0}(\mathbf{x})=\mathcal{N}\left(10,I_{D}\right).

Spaceship Mixture. The target in the second experiment is a DD-dimensional mixture of two correlated Gaussian distributions p(𝐱)=0.5𝒩(x;μ1,Σ1)+0.5𝒩(x;μ2,Σ2)p(\mathbf{x})=0.5\mathcal{N}\left(x;\ \mu_{1},\ \Sigma_{1}\right)+0.5\mathcal{N}\left(x;\ \mu_{2},\ \Sigma_{2}\right). The mean μ1\mu_{1}, μ2\mu_{2} of each Gaussian have components equal to 1 in the first two coordinates and 0 otherwise. The covariance matrix admits a correlated block diagonal structure. The mixture hence manifests as a “spaceship” density margin in the first two dimensions (see Figure 2).

Refer to caption
Figure 2: The result of object edges and estimated densities in the first two dimensions. Red dots are particles after 2000 iterations. Both examples are 50-dimensional

It can be seen from Figure 2 that for the high-dimensional inference, particles from SVGD aggregate, which leads to a high-dimensional curse (Zhuo et al. (2018); Liu and Wang (2016)). However, AUMP-SVGD can estimate the true probability distribution well in these high-dimensional situations. We calculate the energy distance and the mean-square error (MSE) 𝔼ΣmΣ2\mathbb{E}\|\Sigma_{m}-\Sigma\|_{2} between the samples from the inference algorithm and the real samples. The energy distance is given by D2(F,G)=2𝔼XY𝔼XX𝔼YYD^{2}(F,G)=2\mathbb{E}\|X-Y\|-\mathbb{E}\left\|X-X^{\prime}\right\|-\mathbb{E}\left\|Y-Y^{\prime}\right\|, where FF and GG are the cumulative distribution function (CDF) of XX and YY, respectively. XX^{\prime} and YY^{\prime} denote an independent and identically distributed (i.i.d.) copy of XX and YY (Rizzo and Székely (2016)). 10 experiments are performed and the averaged results are given in Figure 3.

Refer to caption
Figure 3: The energy distance and MSE between SVGD samples and the real target distribution.

Figure 3 demonstrates the gradual expansion of the error difference of SVGD as the dimension increases. For the sparse problem, when the graph structure is already known, MP-SVGD achieves comparable outcomes with S-SVGD and PSVGD-2. In the above example, PSVGD achieves its best results when the problem dimension is reduced to 2. The AUMP-SVGD with a set size of Γd\Gamma_{d} of 1 or 3 outperforms other methods. In Experiment 1, we demonstrate that AUMP-SVGD yields a variance estimate equivalent to that of MP-SVGD under the simplified graph structure. In Experiment 2, as the correlation between different dimensions of the target becomes strong or the density of the graph increases, our algorithm performs better than other SVGD variants. Furthermore, our approach exhibits superior variance estimation compared with SVGD, MP-SVGD, S-SVGD, and PSVGD. In practice, the covariance matrix of the target distribution may not be sparse, making it challenging to capture this structure. As a result, the effectiveness of MP-SVGD is significantly limited. However, AUMP-SVGD can still attain stable and superior results.

Non-sparse experiment We set the dimension of the target distribution to 50 and systematically transfer the sparse covariance matrix to a non-sparse one by augmenting the correlations between different dimensions along the main diagonal. The results are presented in Figure  4.

Refer to caption
Figure 4: MSE and energy distance. Each graph structure is averaged over ten experiments.

As the density of the probability map increases, the discrepancy between MP-SVGD and the actual target distribution gradually magnifies. Eventually, MP-SVGD succumbs to the curse of dimensionality. This phenomenon arises due to the fact that the dispersion force of particles in MP-SVGD primarily depends on the size of the Markov blanket. With increasing density, MP-SVGD encounters the same challenges associated with high-dimensional scenarios as observed in SVGD. However, as illustrated in Figure 4, regardless of the density of the target distribution’s graph structure, AUMP-SVGD remains unaffected by variance collapse. This is attributed to the repulsion force of particles in AUMP-SVGD being influenced by the artificially chosen Markov blanket size, underscoring the superiority of our algorithm over MP-SVGD.

5.2 Conditioned Diffusion Process

The next example is a benchmark that is often used to test inference methods in high dimensions (Detommaso et al. (2018); Chen and Ghattas (2020); Liu et al. (2022)). We consider a stochastic process u:[0,1]u:[0,1]\rightarrow\mathbb{R} governed by

du=5u(1u2)1+u2dt+dx,u0=0,\small du=\frac{5u\left(1-u^{2}\right)}{1+u^{2}}dt+dx,\quad u_{0}=0,

where t(0,1]t\in(0,1], the forcing term x=(xt)t0x=\left(x_{t}\right)_{t\geq 0} follows a Brownian motion so that x𝒩(0,B)x\sim\mathcal{N}(0,B) with B(t,t)=min(t,t)B\left(t,t^{\prime}\right)=\min\left(t,t^{\prime}\right). The noisy data 𝐳=(zt1,,zt50)T50\mathbf{z}=\left(z_{t_{1}},\ldots,z_{t_{50}}\right)^{T}\in\mathbb{R}^{50} at 50 equi-spaced time points with ti=0.02it_{i}=0.02i, where zti=uti+ϵz_{t_{i}}=u_{t_{i}}+\epsilon for ϵ𝒩(0,σ2)\epsilon\sim\mathcal{N}\left(0,\sigma^{2}\right) with σ=0.1\sigma=0.1. The objective is to use zz to infer the forcing term xx and thus the state of the solution uu. The results are given in Figure 5 where the shadow interval depicts the mean plus/minus the standard deviation.

Refer to caption
Figure 5: Results at iteration 1000.

As depicted by Figure 5, it is evident that in the case of 50 dimensions, both SVGD and S-SVGD exhibit certain deviations from the ground truth while S-SVGD exhbits excessively large variances in numerous tests. Consequently, SVGD and S-SVGD prove inadequate in effectively addressing the conditional diffusion model. Conversely, our AUMP-SVGD demonstrates satisfactory performance with the size of Γd\Gamma_{d} of 5 and 10.

5.3 Bayesian Logistic Regression

We investigate a Bayesian logistic regression model from Liu and Wang (2016) applied to the Covertype dataset from Asuncion and Newman (2007). We use 70% data for training and 30% for testing. We compare AUMP-SVGD with SVGD, S-SVGD, PSVGD, MP-SVGD, and Hamiltonian Monte Carlo(HMC) and the number of generated samples ranges from 100 to 500. Each experiment uses ten different random seeds and the error of each value does not exceed 0.01. We verify the impact of different sampling methods on the prediction accuracy and the results are given by Table 1.

Table 1: The Optimal values for each case are in bold.
methods accuracy
# particles HMC SVGD S-SVGD MP-SVGD PSVGD_2 AUMP-SVGD-5 AUMP-SVGD-10
100 0.70 0.74 0.76 0.74 0.77 0.73 0.75
200 0.71 0.74 0.762 0.74 0.79 0.78 0.78
300 0.73 0.74 0.762 0.741 0.81 0.80 0.81
400 0.80 0.75 0.76 0.74 0.83 0.81 0.81
500 0.81 0.75 0.765 0.75 0.85 0.82 0.86

Table 1 demonstrates that our AUMP-SVGD has a prediction accuracy rate similar to that of PSVGD, and as the number of particles increases, our algorithm is more accurate than other algorithms.

6 Conclusion and Future work

In this paper, we analyze the upper bound of the variance collapse of SVGD when the number of particles is finite. We show that the distribution of particles is restricted to a specific region rather than the entire probability space. We also propose the AUMP-SVGD algorithm to further overcome the dependency of MP-SVGD on the known and sparse graph structure. We show the effectiveness of AUMP-SVGD through various experiments. For future work, we aim to further investigate the convergence of SVGD with finite particles and tighten the estimation limit. We also plan to apply MP-SVGD to more complex real-world applications, such as posture estimation (Pacheco et al. (2014)).

References

  • Asuncion and Newman [2007] Arthur Asuncion and David Newman. UCI machine learning repository, 2007.
  • Ba et al. [2021] Jimmy Ba, Murat A Erdogdu, Marzyeh Ghassemi, Shengyang Sun, Taiji Suzuki, Denny Wu, and Tianzong Zhang. Understanding the variance collapse of SVGD in high dimensions. In International Conference on Learning Representations, 2021.
  • Banna et al. [2016] Marwa Banna, Florence Merlevède, and Pierre Youssef. Bernstein-type inequality for a class of dependent random matrices. Random Matrices: Theory and Applications, 5(2):1650006, 2016.
  • Bradley [2005] Richard C Bradley. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2:107–144, 2005.
  • Chen and Ghattas [2020] Peng Chen and Omar Ghattas. Projected Stein variational gradient descent. In Advances in Neural Information Processing Systems, volume 33, pages 1947–1958, 2020.
  • Chewi et al. [2020] Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In Advances in Neural Information Processing Systems, volume 33, pages 2098–2109, 2020.
  • Detommaso et al. [2018] Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl. A Stein variational Newton method. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • Gao et al. [2019] Xiang Gao, Meera Sitharam, and Adrian E Roitberg. Bounds on the Jensen gap, and implications for mean-concentrated distributions. The Australian Journal of Mathematical Analysis and Applications, 16:1–16, 2019.
  • Gong et al. [2020] Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Sliced kernelized Stein discrepancy. In International Conference on Learning Representations, 2020.
  • Gorham and Mackey [2017] Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pages 1292–1301, 2017.
  • Hoeffding and Robbins [1948] Wassily Hoeffding and Herbert Robbins. The central limit theorem for dependent random variables. Duke Mathematical Journal, 15(3):773–780, 1948.
  • Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  • Liu [2017] Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • Liu and Wang [2016] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, volume 29, 2016.
  • Liu et al. [2022] Xing Liu, Harrison Zhu, Jean-Francois Ton, George Wynne, and Andrew Duncan. Grassmann Stein variational gradient descent. In International Conference on Artificial Intelligence and Statistics, pages 2002–2021, 2022.
  • MacKay [1992] David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
  • Pacheco et al. [2014] Jason Pacheco, Silvia Zuffi, Michael Black, and Erik Sudderth. Preserving modes and messages via diverse particle selection. In International Conference on Machine Learning, pages 1152–1160, 2014.
  • Rizzo and Székely [2016] Maria L Rizzo and Gábor J Székely. Energy distance. Wiley Interdisciplinary Reviews: Computational Statistics, 8(1):27–38, 2016.
  • Salim et al. [2022] Adil Salim, Lukang Sun, and Peter Richtarik. A convergence theory for SVGD in the population limit under Talagrand’s inequality T1. In International Conference on Machine Learning, pages 19139–19152, 2022.
  • Wang et al. [2018] Dilin Wang, Zhe Zeng, and Qiang Liu. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pages 5219–5227, 2018.
  • Wu et al. [2022] Kevin E Wu, Kevin K Yang, Rianne van den Berg, James Y Zou, Alex X Lu, and Ava P Amini. Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611, 2022.
  • Yan and Zhou [2021a] Liang Yan and Tao Zhou. Stein variational gradient descent with local approximations. Computer Methods in Applied Mechanics and Engineering, 386, 2021a.
  • Yan and Zhou [2021b] Liang Yan and Tao Zhou. Gradient-free Stein variational gradient descent with kernel approximation. Applied Mathematics Letters, 121, 2021b.
  • Yoon et al. [2018] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • Zhuo et al. [2018] Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pages 6018–6027, 2018.

Appendix A

Proof of Proposition 1

According to the fixed points assumption (A1),

Δ(𝐱i)=1Nj=1N𝐱jlogp(𝐱j)k(𝐱i,𝐱j)+𝐱jk(𝐱i,𝐱j)=0.\Delta(\mathbf{x}_{i})=\frac{1}{N}\sum_{j=1}^{N}\nabla_{\mathbf{x}_{j}}\log p\left(\mathbf{x}_{j}\right)k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)+\nabla_{\mathbf{x}_{j}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)=0.

Then,

𝔼(Δ(𝐱i))\displaystyle\mathbb{E}\left(\Delta(\mathbf{x}i)\right) =𝔼(𝐱logp(𝐱)k(𝐱i,𝐱))+𝐱k(𝐱i,𝐱)\displaystyle=\mathbb{E}(\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right)k\left(\mathbf{x}_{i},\mathbf{x}\right))+\nabla{\mathbf{x}}k\left(\mathbf{x}_{i},\mathbf{x}\right) (5)
=𝔼(𝐱logp(𝐱))𝔼(k(𝐱i,𝐱))+𝔼(𝐱k(𝐱i,𝐱))\displaystyle=\mathbb{E}\left(\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right)\right)\mathbb{E}\left(k\left(\mathbf{x}_{i},\mathbf{x}\right)\right)+\mathbb{E}(\nabla{\mathbf{x}}k\left(\mathbf{x}_{i},\mathbf{x}\right))
=𝔼(𝐱k(𝐱i,𝐱))=0.\displaystyle=\mathbb{E}(\nabla{\mathbf{x}}k\left(\mathbf{x}_{i},\mathbf{x}\right))=0.

In Equation (5), 𝐱logp(𝐱)\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right) and k(𝐱i,𝐱)k\left(\mathbf{x}_{i},\mathbf{x}\right) are independent and,

𝔼(𝐱logp(𝐱))=p(𝐱)p(𝐱)p(𝐱)𝑑x=p(𝐱)𝑑x=0.\mathbb{E}\left(\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right)\right)=\int\frac{p^{{}^{\prime}}(\mathbf{x})}{p(\mathbf{x})}p(\mathbf{x})dx\ =\int p^{{}^{\prime}}(\mathbf{x})dx\ =0.

Associated with the Jensen gap (Gao et al. [2019]), we get

|𝔼[e|𝐱i𝐱|22h2(𝐱i𝐱)h]e|𝐱i𝔼𝐱|22h2(𝐱i𝔼𝐱)h|2M𝔼|𝐱|,\displaystyle|\mathbb{E}[e^{\frac{-\left|\mathbf{x}_{i}-\mathbf{x}\right|_{2}^{2}}{h}}\frac{2(\mathbf{x}_{i}-\mathbf{x})}{h}]-e^{\frac{-|\mathbf{x}_{i}-\mathbb{E}\mathbf{x}|_{2}^{2}}{h}}\frac{2(\mathbf{x}_{i}-\mathbb{E}\mathbf{x})}{h}|\ \leq 2M\mathbb{E}\left|\mathbf{x}\right|,

which in turn gives

|e𝐱i𝔼𝐱22h2(𝐱i𝔼𝐱)h|<2M𝔼|𝐱|\left|e^{\frac{-\left\|\mathbf{x}_{i}-\mathbb{E}\mathbf{x}\right\|_{2}^{2}}{h}}\frac{2(\mathbf{x}_{i}-\mathbb{E}\mathbf{x})}{h}\right|<2M\mathbb{E}\left|\mathbf{x}\right|

since |𝐱i𝔼𝐱|22=tr((𝐱i𝔼𝐱)(𝐱i𝔼𝐱)T)c0tr(Σ)\left|\mathbf{x}_{i}-\mathbb{E}\mathbf{x}\right|_{2}^{2}=\mathrm{tr}\left((\mathbf{x}_{i}-\mathbb{E}\mathbf{x})(\mathbf{x}_{i}-\mathbb{E}\mathbf{x})^{T}\right)\leq c_{0}\mathrm{tr}(\Sigma), where c0c_{0} is a positive number.

Proof of Proposition 2

According to Proposition 1:

𝐱22K2trΣ\left\|\mathbf{x}\right\|_{2}^{2}\leq K^{2}tr\Sigma

Apply the expectation version of the Bernstein inequality (Banna et al. [2016]) for the sum of mean zero random matrices 𝐱i𝐱iTΣ\mathbf{x}_{i}\mathbf{x}_{i}^{T}-\Sigma and we obtain,

𝔼ΣmΣ\displaystyle\mathbb{E}\|\Sigma_{m}-\Sigma\| =1mi=1m𝔼(𝐱i𝐱iTΣ)\displaystyle=\frac{1}{m}\|\sum_{i=1}^{m}\mathbb{E}(\mathbf{x}_{i}\mathbf{x}_{i}^{T}-\Sigma)\|
30vnlogD+4Mc1/2logD+Mγ(c,n)logD,\displaystyle\leq 30v\sqrt{n\log D}+4Mc^{-1/2}\sqrt{\log D}+M\gamma(c,n)\log D,

and MM is any number chosen such that,

𝐱𝐱TΣM.\small\|\mathbf{x}\mathbf{x}^{T}-\Sigma\|\leq M.

To bound MM is simple:

(𝐱𝐱TΣ)𝐱22+Σ2K2trΣ,\small\|(\mathbf{x}\mathbf{x}^{T}-\Sigma)\|\leq\|\mathbf{x}\|_{2}^{2}+\|\Sigma\|\leq 2K^{2}tr\Sigma,

which completes the proof.

Proof of Proposition 3

Note that

ϵKL(q[T](𝐱Γd,𝐱𝐒𝐝)p(𝐱Γd,𝐱𝐒𝐝))=ϵKL(q[T](𝐱Γd𝐱𝐒𝐝)q(𝐱𝐒𝐝)p(𝐱Γd𝐱𝐒𝐝)p(𝐱𝐒𝐝)).\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)=\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)p\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\right).

Now we derive the optimality condition for Equation (3). Note that,

KL(q[T](𝐱Γd,x𝐒𝐝)p(𝐱Γd,x𝐒𝐝))=𝔼q(𝐱¬d)[KL(q[T](xΓd𝐱¬d)p(zΓd𝐱¬d))].\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\right)=\mathbb{E}_{q\left(\mathbf{x}_{\neg d}\right)}\left[\mathrm{KL}\left(q_{[T]}\left(x_{\Gamma_{d}}\mid\mathbf{x}_{\neg d}\right)\|p\left(z_{\Gamma d}\mid\mathbf{x}_{\neg d}\right)\right)\right].

Following the proof of Theorem 3.1 by Liu and Wang [2016], we have

ϵKL(q[T](𝐱Γd,𝐱𝐒𝐝)p(𝐱Γd,𝐱𝐒𝐝))|ϵ=0=𝔼q(𝐲Γd𝐲¬d)[ϕΓd(𝐲¬d)𝐲Γdlogp(𝐲Γd𝐲¬d)+𝐲ΓdϕΓd(𝐲¬d)].\begin{split}\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)|_{\epsilon=0}&=\\ &-\mathbb{E}_{q\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)}\left[\phi_{\Gamma_{d}}\left(\mathbf{y}_{\neg d}\right)\nabla_{\mathbf{y}_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}\phi_{\Gamma_{d}}\left(\mathbf{y}_{\neg d}\right)\right].\end{split}

According to Liu and Wang [2016], we can show that the optimal solution is given by ϕΓdϕΓdΓd\frac{\phi_{\Gamma_{d}}^{*}}{\|\phi_{\Gamma_{d}}^{*}\|_{\mathcal{H}_{\Gamma_{d}}}}, where

ϕΓd(𝐱¬d)=𝔼𝐲¬dq[kΓd(𝐱¬d,𝐲¬d)𝐲Γdlogp(𝐲Γd𝐲¬d)+𝐲ΓdkΓd(𝐱¬d,𝐲¬d)].\displaystyle\phi_{\Gamma_{d}}^{*}\left(\mathbf{x}_{\neg d}\right)=\mathbb{E}_{\mathbf{y}_{\neg d}\sim q}\left[k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\nabla_{\mathbf{y}_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\right].

Proof of Proposition 4

Similar to the last proof, first we have

ϵKL[q[T](xd𝐱𝐒𝐝)q(𝐱¬d)p(xd𝐱𝐒𝐝)p(𝐱¬d)]=ϵKL(q[T](xd𝐱𝐒𝐝)q(𝐱𝐒𝐝)p(xd𝐱𝐒𝐝)q(𝐱𝐒𝐝)).\small\nabla_{\epsilon}\mathrm{KL}\left[\frac{q_{[T]}(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{p(\mathbf{x}_{\neg d})}\right]=\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\right).

Then we derive the optimality condition ϕd\phi_{d}^{*} for Equation (4),

𝐊𝐋(q[T](xd,𝐱𝐒𝐝)p(xd,𝐱𝐒𝐝))=𝔼q(𝐱Sd)[KL(q[T](xd𝐱Sd)p(xd𝐱Sd))]\small\mathbf{KL}\left(q_{\left[T\right]}\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)=\mathbb{E}_{q\left(\mathbf{x}_{S_{d}}\right)}\left[\operatorname{KL}\left(q_{\left[T\right]}\left(x_{d}\mid\mathbf{x}_{S_{d}}\right)\|p\left(x_{d}\mid\mathbf{x}_{S_{d}}\right)\right)\right]

Following the proof of Theorem 3.1 in Liu and Wang [2016], we have

ϵKL(q(xd,𝐱𝐒𝐝)p(xd,𝐱𝐒𝐝))|ϵ=0=𝔼q(yd𝐲Sd)[ϕd(𝐲Cd)ydlogp(yd𝐲Sd)+ydϕd(𝐲Sd)].\small\nabla_{\epsilon}\mathrm{KL}\left(q\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)|_{\epsilon=0}=-\mathbb{E}_{q\left(y_{d}\mid\mathbf{y}_{S_{d}}\right)}\left[\phi_{d}\left(\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{S_{d}}\right)+\nabla_{y_{d}}\phi_{d}\left(\mathbf{y}_{S_{d}}\right)\right].

According to Liu and Wang [2016], we can show that the optimal solution is given by ϕdϕdd\frac{\phi_{d}^{*}}{\left\|\phi_{d}^{*}\right\|_{\mathcal{H}_{d}}}, where

ϕd(𝐱Cd)=𝔼𝐲Cdq[kSd(𝐱Cd,𝐲Cd)𝐲Sdlogp(𝐲Sd𝐲Cd)+𝐲SdkSd(𝐱Cd,𝐲Cd)],\displaystyle\phi_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim q}\left[k_{S_{d}}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{\mathbf{y}_{S_{d}}}\log p\left(\mathbf{y}_{S_{d}}\mid\mathbf{y}_{C_{d}}\right)+\nabla_{\mathbf{y}_{S_{d}}}k_{S_{d}}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\right],

which completes the proof.

Appendix B

Numerical verification of mm-dependent

In our paper, we use the concept of mm-dependent in the mixture to estimate the variance of non-i.i.d. particles. Here we give a verification experiment for mm-dependent.

[Uncaptioned image]

The above figure shows the final dynamic magnitude of the two particles in SVGD. Here, the legend SVGD_3000_10\text{SVGD}\_3000\_10 indicates 3000 particles for a 10-dimensional Gaussian model. We sort the particles according to 𝐱2\|\mathbf{x}\|_{2} and measure the update effect between two particles at intervals of 500, i.e.,

1N𝐱nlogp(𝐱n)k(𝐱n,𝐱m)+𝐱nk(𝐱n,𝐱m).\small\frac{1}{N}\nabla_{\mathbf{x}_{n}^{\ell}}\log p\left(\mathbf{x}_{n}^{\ell}\right)k\left(\mathbf{x}_{n}^{\ell},\mathbf{x}_{m}^{\ell}\right)+\nabla_{\mathbf{x}_{n}^{\ell}}k\left(\mathbf{x}_{n}^{\ell},\mathbf{x}_{m}^{\ell}\right). (6)

We can see from the above figure, as the number of particle intervals increases, the force between particles decreases.

Numerical verification of Proposition 1-2

In the presented tabular data, we have undertaken empirical investigations for the target distribution 𝐍(0,Idim)\mathbf{N}\left(0,I_{dim}\right) and have demonstrated the soundness of our theoretical bounds through these experiments.

Table 2: Upper bound of x22\|x\|_{2}^{2} in Proposition 1
dim 2 5 10 15 20 25
10 particles max x22\|x\|_{2}^{2} 1.43 1.18 1.17 1.17 1.17 1.17
theoretical bound 4.82 5.56 5.39 5.36 5.35 5.35
50 particles Max x22\|x\|_{2}^{2} 2.11 1.63 1.55 1.52 1.52 1.51
theoretical bound 9.41 16.1 17.1 14.8 14.31 14.0

In the presented table, the upper bound of the variance error of SVGD has been assessed for the case where the distribution is 𝐍(0,Idim)\mathbf{N}\left(0,I_{dim}\right).

Table 3: Upper bound of 𝔼ΣmΣ22\mathbb{E}\left\|\Sigma_{m}-\Sigma\right\|_{2}^{2} in Proposition 2
Number of particles 1000 5000 10000 15000 20000
Dim-2 𝔼ΣmΣ22\mathbb{E}\|\Sigma_{m}-\Sigma\|_{2}^{2} 0.008 0.002 0.004 0.0063 0.003
theoretical bound 0.082 0.019 0.01 0.0078 0.005
Dim-5 𝔼ΣmΣ22\mathbb{E}\|\Sigma_{m}-\Sigma\|_{2}^{2} 0.3 0.11 0.06 0.06 0.04
theoretical bound 4.63 1.71 1.02 0.71 0.57