Augmented Message Passing Stein Variational Gradient Descent

Jiankui Zhou
School of Information Science and Technology
ShanghaiTech University
[email protected]
&Yue Qiu
College of Mathematics and Statistics
Chongqing University
[email protected]
Funded by NSF China under No. 12101407

Abstract

Stein Variational Gradient Descent (SVGD) is a popular particle-based method for Bayesian inference. However, its convergence suffers from the variance collapse, which reduces the accuracy and diversity of the estimation. In this paper, we study the isotropy property of finite particles during the convergence process and show that SVGD of finite particles cannot spread across the entire sample space. Instead, all particles tend to cluster around the particle center within a certain range and we provide an analytical bound for this cluster. To further improve the effectiveness of SVGD for high-dimensional problems, we propose the Augmented Message Passing SVGD (AUMP-SVGD) method, which is a two-stage optimization procedure that does not require sparsity of the target distribution, unlike the MP-SVGD method. Our algorithm achieves satisfactory accuracy and overcomes the variance collapse problem in various benchmark problems.

1 Introduction

Stein variational gradient descent (SVGD) proposed by Liu and Wang (2016) is a non-parametric inference method. In order to approximate an intractable but distinguishable target distribution, it constructs a set of particles, which can be initialized from any initial distribution. These particles move in the Reproducing Kernel Hilbert Space (RKHS) determined by the kernel function. SVGD drives these particles in the direction that the KL divergence of the two distribution decreases most rapidly. SVGD is more efficient than traditional Markov chain Monte Carlo (MCMC) method due to the fact that the particles in SVGD reach the target distribution precisely through the dynamic process. These advantages make SVGD appealing and it has attracted lots of research interest (Liu et al. (2022); Ba et al. (2021); Zhuo et al. (2018); Salim et al. (2022); Yan and Zhou (2021a)).

Although SVGD succeeds in many applications (Liu (2017); Yoon et al. (2018); Yan and Zhou (2021b)), it lacks the necessary theoretical support in terms of convergence under the condition of limited particles. The convergence of SVGD is guaranteed only in the mean field assumption and when the number of particles is infinite, i.e., particles converge to the true distribution when the number of particles is infinite (Liu (2017); Salim et al. (2022)). However, the convergence of SVGD with finite particles is still an open problem. Furthermore, it has been observed that as the dimension of the problem increases, the variance estimated by SVGD may be much smaller than that of the target distribution. This phenomenon is known as the variance collapse, which limits the applicability of SVGD due to the following facts. First, underestimated variance leads to a failed explanation of the uncertainty of the model. Second, the Bayesian inference is usually high-dimensional in practice, but SVGD is not applicable in some scenarios due to this high-dimensional curse. For example, training Bayesian Neural Network (BNNs) (MacKay (1992)) requires inferring a huge amount of network weight post-test distributions whose dimensions are in millions (Krizhevsky et al. (2017)). Recently, structural prediction of proteins with long structures requires inferences on the position of each atom, which results in a high-dimensional problem (Wu et al. (2022)).

The first contributions of this paper is that we show particles of SVGD under convergence do not spread across the whole probability space but in a finite range. We give an analytic bound of this clustered space. This bounded distribution of particles is an indication of the curse of high dimension. In addition, we provide an estimate of the error between the covariance of finite particles and the true covariance.

There have been many efforts to make SVGD applicable for high-dimension problems. According to Zhuo et al. (2018), the size of the repulsive force of particles is inversely proportional to the dimension of the problem. Reducing the dimension of the problem is the key to address the variance collapse, such methods include combining the Grassmann manifold and matrix decompositions to reduce the dimension of the target distribution (Chen and Ghattas (2020); Liu et al. (2022)). Another approach to resolve this problem is to find the Markov blanket for each dimension of the target distribution, therefore the global kernel function could be replaced by a local kernel function. Under such scenarios, the efficiency of SVGD is improved and such method is called message passing SVGD (MP-SVGD) (Zhuo et al. (2018); Wang et al. (2018)). However, MP-SVGD needs to know the probability graph model structure in advance and is efficient only when the graph is sparse. Moreover, identifying the Markov blanket for high-dimension problems is challenging.

The second contribution of this paper is that we further overcome the shortcomings of MP-SVGD and extend MP-SVGD to high-dimension problems. Combined with the important results of our variance analysis, we propose the so-called Augmented MP-SVGD (AUMP-SVGD). AUMP-SVGD decomposes the problem dimension into three parts by the KL divergence factorization. Different from MP-SVGD, AUMP-SVGD adopts a two-stage update procedure to solve the dependence on sparse probabilistic graphical models. Therefore, it overcomes the shortcomings of the variance collapse and does not require prior knowledge of the graph structure. We show the superiority of AUMP-SVGD theoretically and experimentally to state-of-the-art algorithms.

2 Preliminaries

2.1 SVGD

SVGD approximates an intractable unknown target distribution $p(\mathbf{x})$ where $\mathbf{x}=[x_{1},...,x_{D}]^{T}\in\mathcal{X}\subseteq\mathbb{R}^{D}$ with the best $q^{*}(\mathbf{x})$ by minimizing the Kullback-Leibler (KL) divergence (Liu and Wang (2016)). Here, $D$ is the dimension of the target distribution and

q^{*}(\mathbf{x})=\mathop{\arg\min}\limits_{q(\mathbf{x})\in\mathcal{Q}}\mathrm{KL}(q(\mathbf{x})\|p(\mathbf{x})).

SVGD takes a group of particles $\left\{\mathbf{x}\right\}^{m}_{i=0}$ from the initial distribution $q_{0}(\mathbf{x})$ , and after a series of smooth transforms, these particles finally converge to the target distribution $p(\mathbf{x})$ . Each smooth transformation can be expressed by $\mathbf{T(x)}=\mathbf{x}+\epsilon\Phi(\mathbf{x})$ , where $\epsilon$ is the step size and $\Phi:\mathcal{X}\to\mathbb{R}^{D}$ is the transformation direction. Here, $\mathcal{X}$ is the collection of the particles. Let $\mathcal{H}$ denote the set of positive definite kernels $k\left(\cdot,\cdot\right)$ in the reproducing kernel Hilbert space (RKHS) and denote $\mathcal{H}^{D}=\mathcal{H}\times\cdots\times\mathcal{H}$ , where $\times$ is the Cartesian product. The steepest descent direction is obtained by minimizing the KL divergence,

\small\min_{\|\phi\|_{\mathcal{H}^{D}\leq 1}}\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\|p\right)|_{\epsilon=0}=-\max_{\|\phi\|_{\mathcal{H}^{D}}\leq 1}E_{\mathbf{x}\sim q}\left[\mathcal{A}_{p}\phi(\mathbf{x})\right],

where $q_{[T]}$ means particles take the distribution $q$ after $\mathbf{T(x)}$ mapping, $\mathcal{A}_{p}$ is the Stein operator given by $\mathcal{A}_{p}\phi(\mathbf{x})=\nabla_{\mathbf{x}}\log p(\mathbf{x})\phi(\mathbf{x})^{T}+\nabla_{\mathbf{x}}\phi(\mathbf{x})$ . SVGD updates the particles $\left\{\mathbf{x}\right\}_{i=1}^{m}$ drawn from the initial distribution $q_{0}(\mathbf{x})$ by

\small\mathbf{x}_{n}^{\ell+1}=\mathbf{x}_{n}^{\ell}+\epsilon_{l}\hat{\phi}_{\ell}^{*}\left(\mathbf{x}_{n}^{\ell}\right),\quad n=1,\ldots,m,\ell=0,1,\ldots

The steepest descent direction is given by

\small\hat{\phi}_{\ell}^{*}\left(\mathbf{x}_{n}^{\ell}\right)=\frac{1}{m}\sum_{i=1}^{m}k\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{n}^{\ell}\right)\nabla_{\mathbf{x}_{i}^{\ell}}\log p\left(\mathbf{x}_{i}^{\ell}\right)+\nabla_{\mathbf{x}_{i}^{\ell}}k\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{n}^{\ell}\right).

(1)

The kernel function can be chosen as the RBF kernel $k(\mathbf{x},\mathbf{y})=\exp(-\|\mathbf{x}-\mathbf{y}\|_{2}^{2}/(2h))$ (Liu and Wang (2016)) or the IMQ kernel $k(\mathbf{x},\mathbf{y})=1/\sqrt{1+\|\mathbf{x}-\mathbf{y}\|_{2}^{2}/(2h)}$ (Gorham and Mackey (2017)). Equation (1) can be divided into two parts: the driving force term $\left[k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)\nabla_{\boldsymbol{x}^{\prime}}\log p\left(\boldsymbol{x}^{\prime}\right)\right]$ and the repulsive force term $\nabla_{\boldsymbol{x}^{\prime}}k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)$ . It has been demonstrated that SVGD falls under the high-dimension curse (Zhuo et al. (2018); Liu (2017)). Related research shows that there exists a negative correlation between the problem dimension and the repulsive force of SVGD (Ba et al. (2021)). The influence of repulsive force is mainly related to the dimension of the target distribution.

MP-SVGD. The Message Passing SVGD (MP-SVGD) (Zhuo et al. (2018); Wang et al. (2018)) is proposed to reduce the dimension of the target distribution by identifying the Markov blanket for problems with a known graph structure. For dimension index $d$ , its Markov blanket $\Gamma_{d}=\cup\{F:d\in F\}\backslash\{d\}$ contains neighborhood nodes of $d$ such that $p\left(x_{d}\mid\mathbf{x}_{\neg d}\right)=p\left(x_{d}\mid\mathbf{x}_{\Gamma_{d}}\right)$ , where $F=\{1,...,D\}$ and $\neg d=\{1,...,D\}\backslash\{d\}$ . However, it is necessary to rely on the sparse correlation of variables of the target distribution in order to obtain good results.

2.2 Mixing for random variables

Since SVGD forms a dynamic system where particles interact with each other, one can no longer treat the converged particles as independent and identically distributed. Therefore, we need to use the mathematical tool “mixing”, cf. Bradley (2005) for more information.

Mixing. Let $\left\{\mathbf{x}_{m},m\geq 1\right\}$ be a sequence of random variables over some probability space $(\Omega,F,P)$ , where $\sigma(\mathbf{x}_{i},i\leq j)$ denotes the $\sigma$ -algebra. For any two $\sigma$ -fields $\sigma\left(\mathbf{x}_{i},i\leq j\right),\ \sigma\left(\mathbf{x}_{i},i\geq j+k\right)$ with any $k\geq 1$ , let

\small\beta_{0}=1,\beta_{k}=\sup_{j\geq 1}\beta\left(\sigma\left(\mathbf{x}_{i},i\leq j\right),\sigma\left(\mathbf{x}_{i},i\geq j+k\right)\right),

(2)

where

\small\beta(\mathcal{A},\mathcal{B})=\frac{1}{2}\sup\{\sum_{i\in I}\sum_{j\in J}\left|\mathbb{P}\left(A_{i}\cap B_{j}\right)-\mathbb{P}\left(A_{i}\right)\mathbb{P}\left(B_{j}\right)\right|\}

is the maximum taken over all finite partitions $\left(A_{i}\right)_{i\in I}$ and $\left(B_{i}\right)_{i\in J}$ of $\Omega$ . A family $\left\{\mathbf{x}_{m},m\geq 1\right\}$ of random variables will be said to be absolutely regular (or $\beta-$ mixing) if $\lim_{m\to\infty}\beta_{m}=0$ , where the coefficients of absolute regularity $\beta_{m}$ are defined in Equation (2) (Banna et al. (2016)). These coefficients quantify the strength of dependence between the $\sigma$ -algebra generated by $(\mathbf{x}_{i})_{1\leq i\leq k}$ and the one generated by $(\mathbf{x}_{i})_{i\geq k+m}$ for all $k\in\mathbb{N}^{*}$ . $\beta_{m}$ tends to zero while $m$ goes to infinity implies that the $\sigma$ -algebra generated by $(\mathbf{x}_{i})_{i\geq k+m}$ is less and less dependent on $\sigma\left(\mathbf{x}_{i},i\leq k\right)$ (Bradley (2005)).

3 Covariance Analysis under $\mathbf{\beta}$ -mixing

Ba et al. (2021) analyzed the convergence of SVGD when the covariance matrix of the Gaussian distribution is an identity matrix under the assumption of near-orthogonality. On the basis of this work, we further analyze the more general form of the variance collapse. Moreover, the quantification of the variance collapse for finite particles is still an open problem to the best of our knowledge. Chewi et al. (2020) shows that SVGD can be viewed as a kernelized Wasserstein gradient flow of the chi-squared divergence, which makes us believe that extending our analysis in this section to other MCMC methods such as Wasserstein gradient flow or normalizing flow is possible.

3.1 Assumptions

A1 (Fixed points). Although the convergence of SVGD under finite particles is still an open problem, it can be found in many experimental studies that all particles converge to fixed points (Ba et al. (2021)). In this paper, we also assume that for SVGD with finite number of particles, these particles eventually converge to the target distribution and approach fixed points.

A2 ( $m$ -dependent of particles)

Definition 1

(Hoeffding and Robbins (1948)) If for some function $f(n)$ , the inequality $s-r\geq f(n)\geq 0$ implies that the tow sets $(x_{1},...,x_{r})$ and $(x_{s},...,x_{m})$ are independent, then the sequence $\left\{x_{i}\right\}_{i=1}^{m}$ is said to $f(n)$ -dependent.

We assume that the fixed points of SVGD satisfy the $m$ -dependent assumption. In Ba et al. (2021), only weak correlation between SVGD particles is reported. We perform numerical verification in Appendix B and leave the rigorous proof for our future work. Under Assumptions A1-A2 and the zero-mean of the target distribution, we analyze the variance collapse of SVGD quantitatively in the following part.

3.2 Concentration of particles

We first give the upper bound of the concentrated particles. The main tool used here is the Jensen gap (Gao et al. (2019)).

Proposition 1

Let Assumption A1 hold, for mean zero Gaussian target and Gaussian RBF kernel $k(\mathbf{x}_{i},\mathbf{x}_{j})=\exp^{-\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{2h}}$ , we have

\small\|\mathbf{x}_{i}\|_{2}\leq(2M/c+1)\left(\mathbb{E}\|\mathbf{x}\|_{2}^{2}\right)^{1/2},

where $M=\sup_{\mathbf{x}\neq 0}\frac{\left|k(\mathbf{x}_{i},\mathbf{x})-k(\mathbf{x}_{i},0)\right|}{2\|\mathbf{x}\|_{2}}$ , $c=\frac{2}{h}e^{\frac{-c_{0}tr(\Sigma_{m})}{h}}$ , $1<c_{0}\leq m$ is a positive constant, and $\Sigma_{m}$ is the empirical covariance matrix of the particles.

We leave this proof in Appendix A. For most of the sampling-based inference methods, such as MCMC, VI, et al., samples spread across the whole sample space but with extremely small probability for some samples. However, Proposition 1 shows that SVGD with finite particles are confined to a certain range, although this range may expand as the number of particles increases. With the RBF kernel, this upper bound is related to the trace of the covariance of the target distribution. Under the IMQ kernel $k(\mathbf{x},\mathbf{y})=1/\sqrt{1+\|\mathbf{x}-\mathbf{y}\|_{2}^{2}/(2h)}$ conditions, this range will be further reduced (Gorham and Mackey (2017)).

3.3 Covariance estimation

For independent and identically distributed (i.i.d.) samples, the variance can be estimated using the Bernstein inequality or the Lieb inequality. However, these random matrix results typically require the particles to be i.i.d, which is no longer satisfied by SVGD due to its interacting update. To analyze the variance of particles from SVGD, we assume these particles are $m$ -dependent. For a sequence of random variables, the $m$ -dependent assumption implies that they satisfy the $\beta$ -mixing condition (Bradley (2005)). We obtain the following results based on the Bernstein inequality of dependent random matrices (Banna et al. (2016)).

Proposition 2

Let Assumption A1-A2 hold, for SVGD with the Gaussian RBF kernel $k(\mathbf{x}_{i},\mathbf{x}_{j})=\exp^{-\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}}{2h}}$ , denote $\mathbb{X}_{i}=\mathbf{x}_{i}\mathbf{x}_{i}^{T}-\Sigma$ where $\Sigma$ is the covariance matrix of the target distribution. There exists $\alpha\geq 0$ such that for any integer $k\geq 1$ , the following inequality holds,

\small\beta_{k}=\sup_{j\geq 1}\beta\left(\sigma\left(\mathbf{x}_{i},i\leq j\right),\sigma\left(\mathbf{x}_{i},i\geq j+k\right)\right)\leq\mathrm{e}^{-\alpha(k-1)}.

Denote the covariance matrix of these $m$ particles by $\Sigma_{m}$ , we have

\mathbb{E}\|\Sigma_{m}-\Sigma\|\leq 30v\sqrt{\frac{\log D}{m}}+\frac{2K^{2}tr(\Sigma)}{m}(4\alpha^{-1/2}\sqrt{\log D}+\gamma(\alpha,m)\log D),

where $K=(2M/c+1)$ and $M$ , $c$ are identical to that in Proposition 1. Here, $\alpha$ measures the correlation between particles, and $D$ is the dimension of the target distribution. Moreover,

\small v^{2}=\sup_{N\subseteq\{1,\ldots,m\}}\frac{1}{|N|}\lambda_{\max}\left(\mathbb{E}(\sum_{i\in N}\mathbb{X}_{i})^{2}\right).

Here, $\lambda_{\max}\left(\mathbb{X}\right)$ represents the eigenvalue of $\mathbb{X}$ with the maximum magnitude, $|N|$ is the cardinality of the set $N$ , and

\small\gamma(\alpha,\ m)=\frac{\log m}{\log 2}\max\left(2,\frac{32\log m}{\alpha\log 2}\right).

Proposition 2 shows that the main factors that affect the upper bound of the variance error include the inter-particle correlation, the number of particles, the dimension of the target distribution, and the trace of its covariance matrix. According to Ba et al. (2021), it can be considered that $tr(\Sigma_{m})\leq tr(\Sigma)$ should be true for SVGD, therefore the constant $c$ in Proposition 1 can be replaced by $\frac{2}{h}e^{\frac{-c_{0}tr(\Sigma)}{h}}$ .

4 Augmented MP-SVGD

Here, we propose the so-called Augmented Message Passing (AUMP-SVGD) to overcome the covariance underestimation of SVGD. Compared with MP-SVGD, AUMP-SVGD requires neither a known graph structure nor the sparsity structure of the target distribution.

4.1 MP-SVGD

The update direction $\Delta$ of SVGD is given by

\small\Delta^{\mathrm{SVGD}}(\boldsymbol{x})=\mathbb{E}_{\boldsymbol{x}^{\prime}\sim q}\underbrace{k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)\nabla_{\boldsymbol{x}^{\prime}}\log p\left(\boldsymbol{x}^{\prime}\right)}_{\text{driving force }}+\underbrace{\nabla_{\boldsymbol{x}^{\prime}}k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)}_{\text{repulsive force}}

The log derivative term in the update rule $\mathbb{E}_{\boldsymbol{x}^{\prime}\sim q}\left[k(\boldsymbol{x}^{\prime},\boldsymbol{x})\nabla_{\boldsymbol{x}^{\prime}}\log p(\boldsymbol{x}^{\prime})\right]$ corresponds to the driving force that guides particles toward the high-likelihood region. The (Stein) score function $\nabla_{\boldsymbol{x}^{\prime}}\log p\left(\boldsymbol{x}^{\prime}\right)$ is a vector field describing the target distribution. $\nabla_{\boldsymbol{x}^{\prime}}k\left(\boldsymbol{x}^{\prime},\boldsymbol{x}\right)$ provides a repulsive force to prevent particles from aggregating. However, as the dimension of the target distribution increases, this repulsive force gradually decreases (Zhuo et al. (2018)). This causes SVGD to fall under the curse of high dimensions. How to effectively reduce the dimension has become the guiding ideology to make SVGD overcome the curse of dimensionality.

Our concern is the problem with the continuous graphical model, i.e., the target distribution that satisfies the following form $p(\mathbf{x})\propto\exp\left[\sum_{s\in\mathcal{S}}\psi\left(\mathbf{x}_{s}\right)\right]$ where $\mathcal{S}$ is the family of index sets $s\subseteq\{1,\ldots,D\}$ that specifies the Markov structure. For any index $i$ , its Markov blanket is represented by $\mathcal{N}_{i}:=\cup\{s:s\subset\mathcal{S},i\in s\}\backslash\{i\}$ . According to Wang et al. (2018), one can transform the global kernel function into a local kernel function and this local kernel function is just related to $\mathcal{C}_{i}:=\mathcal{N}_{i}\cup\{i\}$ for any $i$ . Under this transformation, the dimension is reduced from $D$ to $|\mathcal{C}_{i}|$ , where $|\mathcal{C}_{i}|$ is the size of the set $\mathcal{C}_{i}$ . Then, SVGD becomes


	$\displaystyle\mathbf{x}_{i}^{\ell}\leftarrow\mathbf{x}_{i}^{\ell}+\epsilon\phi^{*}\left(\mathbf{x}_{i}^{\ell}\right),\quad\forall i\in\{1,...,D\},\ell\in\{1,...,m\},$		(3a)
	$\displaystyle\phi^{*}(\mathbf{x}_{i}^{\ell}):=\frac{1}{m}\sum_{\ell=1}^{m}s\left(\mathbf{x}_{i}^{\ell}\right)k\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{\mathcal{C}_{i}}\right)+\partial_{\mathbf{x}_{i}^{\ell}}k_{i}\left(\mathbf{x}_{i}^{\ell},\mathbf{x}_{\mathcal{C}_{i}}\right),$		(3b)

where $s(x_{i}^{\ell})=\nabla_{x_{i}^{\ell}}\log p(x_{i}^{\ell})$ . Such method is known as the message passing SVGD (MP-SVGD) (Zhuo et al. (2018); Liu (2017)). However, MP-SVGD needs to know the graph structure of the target distribution in advance to determine the Markov blanket and the sparsity of this graph is needed in order to achieve dimension reduction.

4.2 Augmented MP-SVGD

Inspired by MP-SVGD, we propose the augmented MP-SVGD which is suitable for more complex graph structures. We keep the previous symbols but redefine them for clarity. We assume $p(\mathbf{x})$ can be factorized as $p(\mathbf{x})\propto\prod_{F\in\mathcal{F}}\psi_{F}\left(\mathbf{x}_{F}\right)$ where $F\subset\{1,\ldots,D\}$ is the index set. Partition $\Gamma_{d}\subset\{1,\ldots,D\}\backslash\{d\}$ and $\mathbf{S_{d}}\subset\{1,\ldots,D\}\backslash\{d\}$ such that $\Gamma_{d}\cap\mathbf{S_{d}}=\emptyset$ and $\Gamma_{d}\cup\mathbf{S_{d}}=\{1,\ldots,D\}\backslash\{d\}$ . Let $\mathbf{x}_{\mathbf{S_{d}}}=[x_{i},\ ...\ ,\ x_{j}],\ i,...,j\in\mathbf{S_{d}}$ . Similarly, $\mathbf{x}_{\Gamma_{d}}=[x_{i},\ ...\ ,\ x_{j}],\ i,...,j\in\Gamma_{d}$ .

Refer to caption — Figure 1: Probabilistic Graphical Model

Consider the probability graph model illustrated by Figure 1, $p(\mathbf{x})$ can be represented as $p(\mathbf{x})=p(x_{d}|\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}})p(\mathbf{x}_{\Gamma_{d}}|\mathbf{x}_{\mathbf{S_{d}}})p(\mathbf{x}_{\mathbf{S_{d}}})$ . Our method relies on the key observation of Zhuo et al. (2018) that

\small\mathrm{KL}(q\|p)=\underbrace{\mathrm{KL}\left(q\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)}_{\text{prKL}(q,p,d)}+\underbrace{\mathrm{KL}\left(q\left(x_{d}\mid\mathbf{x}_{\neg d}\right)q\left(\mathbf{x}_{\neg d}\right)\|p\left(x_{d}\mid\mathbf{x}_{\neg d}\right)p\left(\mathbf{x}_{\neg d}\right)\right)}_{\text{seKL}(q,p,d)}.

To minimize $\mathrm{KL}(q\|p)$ , we adopt a two-stage procedure so that $\text{prKL}(q,p,d)$ and $\text{seKL}(q,p,d)$ are optimized alternatively. At the first stage, $\text{prKL}(q,p,d)$ is further decomposed into

\text{prKL}(q,p,d)=\mathrm{KL}\left(q\left(x_{\mathbf{S_{d}}}\right)\|p\left(x_{\mathbf{S_{d}}}\right)\right)+\mathrm{KL}\left(q\left(\mathbf{x}_{\Gamma_{d}}\mid x_{\mathbf{S_{d}}}\right)q\left(x_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}}\mid x_{\mathbf{S_{d}}}\right)p\left(x_{\mathbf{S_{d}}}\right)\right).

(4)

We can fix $\mathbf{x}_{\mathbf{S_{d}}}$ to apply the local kernel function to the second part of Equation (4) to minimize $\text{prKL}(q,p,d)$ ,

\displaystyle\mathbf{x}_{\Gamma_{d}}=\mathop{\arg\min}\limits_{q\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)}\mathrm{KL}\left(q\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)p\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\right).

This optimization procedure is given by Proposition 3 and we leave the proof in the appendix.

Proposition 3

Let $T(\mathbf{x})=\left[x_{1},...,T_{\Gamma_{d}}\left(\mathbf{x}_{\Gamma_{d}}\right),...,x_{D}\right]^{T}$ where $T_{\Gamma_{d}}:\ \mathbf{x}_{\Gamma_{d}}\rightarrow\mathbf{x}_{\Gamma_{d}}+\epsilon\phi_{\Gamma_{d}}\left(\mathbf{x}_{\neg d}\right)$ and $\phi_{\Gamma_{d}}\in\mathcal{H}_{\Gamma_{d}}$ . Here $\mathcal{H}_{\Gamma_{d}}$ is the space that defines the local kernel function $k_{\Gamma_{d}}:\ \mathcal{X}_{\neg d}\times\mathcal{X}_{\neg d}\rightarrow\mathbb{R}$ . The optimal solution of the optimization problem

\displaystyle\mathop{\min}\limits_{\left\|\phi_{\Gamma_{d}}\right\|\leq 1}\left.\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\right)\right|_{\epsilon=0},

is given by $\phi_{\Gamma_{d}}^{*}/\left\|\phi_{\Gamma_{d}}^{*}\right\|_{\mathcal{H}_{\Gamma_{d}}}$ where

\small\phi_{\Gamma_{d}}^{*}\left(\mathbf{x}_{\neg d}\right)=\mathbb{E}_{\mathbf{y}_{\neg d}\sim q}[k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\nabla_{\mathbf{y}_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)].

At the second stage, $\mathbf{x}_{\Gamma_{d}}$ and $\mathbf{x}_{\mathbf{S_{d}}}$ are fixed while only $\mathbf{x}_{d}$ is updated. We can further decompose $\mathrm{seKL}(q,p,d)$ into three parts via the convexity of the KL divergence,

\small 0\leq\mathrm{seKL}(q,p,d)\leq\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{p(\mathbf{x}_{\neg d})}\right]+\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{p(\mathbf{x}_{\neg d})}\right]+C,

where $C=\mathrm{KL}\left[\frac{1}{q(\mathbf{x}_{\neg d})}\|\frac{1}{p(\mathbf{x}_{\neg d})}\right]$ is a positive constant due to the fact that $\mathbf{x}_{\neg d}=\mathbf{x}_{\Gamma_{d}}\cup\mathbf{x}_{\mathbf{S_{d}}}$ is fixed. Therefore,

\small x_{d}=\mathop{\arg\min}\limits_{q(x_{d}\mid\mathbf{x}_{\Gamma_{d}})),p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\Gamma_{d}})}{p(\mathbf{x}_{\neg d})}\right]+\mathrm{KL}\left[\frac{q(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{p(\mathbf{x}_{\neg d})}\right].

The optimization procedure is described by Proposition 4 and we also leave the proof in the appendix.

Proposition 4

Let $T(\mathbf{x})=\left[x_{1},\ldots,T_{d}\left(x_{d}\right),\ldots,x_{D}\right]^{T}$ where $T_{d}:x_{d}\rightarrow x_{d}+\epsilon\phi_{d}\left(\mathbf{x}_{C_{d}}\right)$ and $\phi_{d}\in\mathcal{H}_{d}$ . Here $\mathcal{H}_{d}$ is the space that defines the local kernel $k_{d}:\mathcal{X}_{S_{d}}\times\mathcal{X}_{S_{d}}\rightarrow\mathbb{R}$ and $C_{d}=S_{d}\cup\{d\}$ or $C_{d}=\Gamma_{d}\cup\{d\}$ . The optimal solution of the following optimization problem

\small\mathop{\min}\limits_{\left\|\phi_{d}\right\|\leq 1}\left.\nabla_{\epsilon}\mathrm{KL}\left[\frac{q_{[T]}(x_{d}\mid\mathbf{x}_{\mathbf{C_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{C_{d}}})}{p(\mathbf{x}_{\neg d})}\right]\right|_{\epsilon=0},

is given by $\phi_{d}^{*}/\left\|\phi_{d}^{*}\right\|_{\mathcal{H}_{d}}$ where

\small\phi_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim q}\left[k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{C_{d}}\right)+\nabla_{y_{d}}k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\right].

For $d\in\left\{1,2,\dots,D\right\}$ , $x_{d}$ is updated through the above two-stage procedure, which reduces the dimension of the original problem from $D$ to the size of $min(|\mathbf{x}_{\mathbf{S_{d}}}|,\ |\mathbf{x}_{\Gamma_{d}}|)$ . In this way, we are able to solve the problem of inferring more complex target distributions that traditional MP-SVGD failed. This comparison is illustrated in Experiment 2. Moreover, we do not need to know the real probability graph structure of the target distribution in advance.

The key start of our AUMP-SVGD is to choose $\Gamma_{d}$ or $\mathbf{S_{d}}$ . This problem can be formulated in the following form, for $m$ particles with each $x\in\mathbb{R}^{D}$ , denote the ensemble matrix of these particles by $X$ , i.e., $X\in\mathbb{R}^{m\times D}$ . Let $r=|\mathbf{x}_{\Gamma_{d}}|$ and $\mathcal{M}(\mathrm{r},m\times D)$ be the set of sub-matrices $\Gamma_{x}\in\mathbb{R}^{m\times r}$ of $X$ . Determining the set $\Gamma_{d}$ corresponds to select the sub-matrix $X_{r}$ from the set $\mathcal{M}(\mathrm{r},m\times D)$ to minimize $\mathbf{E}\|\Sigma_{m}-\Sigma_{x}\|$ , where $\Sigma_{m}$ is the empirical covariance matrix of particles and $\Sigma_{x}$ is the empirical covariance matrix of sub-ensembles. The upper bound of $\mathbf{E}\|\Sigma_{m}-\Sigma_{x}\|$ is already given by Proposition 2 and it is related to $tr(\Sigma_{x})$ . Therefore, we choose the sub-matrix with the smallest $tr(\Sigma_{x})$ to ensure that $\mathbb{E}\|\Sigma_{m}-\Sigma_{x}\|$ is minimized. This corresponds to select $X_{r}\in\mathbb{R}^{m\times r}$ from $X$ to get minimal $\operatorname{tr}\left(X_{r}X_{r}^{T}\right)$ . Since $\operatorname{tr}\left(X_{r}X_{r}^{T}\right)=\operatorname{tr}\left(X_{r}^{T}X_{r}\right)$ , we just need to reorder each column of $X$ by the 2-norm and choose $r$ columns with smallest 2-norm. The computational complexity for this part is $\mathcal{O}\left(Dm^{2}+D\log D\right)$ . Finally, we give the complete form of our AUMP-SVGD by Algorithm 1.

Algorithm 1 Augment Message Passing SVGD

Input: a set of initial particles

\left\{\mathbf{x}^{(i)}\right\}_{i=1}^{m}

for iteration

i

for

d\in{1,...,D}

Set

\Gamma_{d}

and

S_{d}

\mathbf{x}_{\Gamma_{d}}^{(i)}\leftarrow\mathbf{x}_{\Gamma_{d}}^{(i)}+\epsilon\hat{\phi}_{\Gamma_{d}}^{*}(\mathbf{x}_{\neg d}^{(i)})

\hat{\phi}_{\Gamma_{d}}^{*}\left(\mathbf{x}_{\neg d}\right)=\mathbb{E}_{\mathbf{y}_{\neg d}\sim\hat{q}_{M}}[k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\nabla_{y_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}k_{d}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)]

update

x_{d}

from

\mathbf{x}_{S_{d}}

C_{d}=\mathbf{x}_{S_{d}}\cup{x_{d}}

\mathbf{x}_{d}^{(i)}\leftarrow\mathbf{x}_{d}^{(i)}+\epsilon\hat{\phi}_{d}^{*}(\mathbf{x}_{C_{d}}^{(i)})

\hat{\phi}_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim\hat{q}_{M}}\left[\nabla_{y_{d}}k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)+k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{S_{d}}\right)\right]

update

x_{d}

from

\mathbf{x}_{\Gamma_{d}}

C_{d}=\mathbf{x}_{\Gamma_{d}}\cup{x_{d}}

\mathbf{x}_{d}^{(i)}\leftarrow\mathbf{x}_{d}^{(i)}+\epsilon\hat{\phi}_{d}^{*}(\mathbf{x}_{C_{d}}^{(i)})

\hat{\phi}_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim\hat{q}_{M}}\left[\nabla_{y_{d}}k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)+k_{d}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{\Gamma_{d}}\right)\right]

end for

Output: a set of particles

\left\{\mathbf{x}^{(i)}\right\}_{i=1}^{m}

as samples from

p(\mathbf{x})

5 Experiments

We study the uncertainty quantification properties of AUMP-SVGD compared with other existing methods such as SVGD, MP-SVGD, projected SVGD (PSVGD) (Chen and Ghattas (2020)), and Sliced Stein variational gradient descent (S-SVGD) (Gong et al. (2020)) through extensive experiments. We conclude from experiments that SVGD may underestimate uncertainty, S-SVGD may overestimate it and AUMP-SVGD with a properly partitioned $\Gamma_{d}$ and $\mathbf{S}_{d}$ produces the best estimate. In almost all scenarios, AUMP-SVGD outperforms PSVGD.

5.1 Gaussian Mixture Models

Multivariate Gaussian. The first example is a $D$ -dimensional multivariate Gaussian $p(\mathbf{x})=\mathcal{N}\left(0,I_{D}\right)$ . For each method, 100 particles are initialized from $q_{0}(\mathbf{x})=\mathcal{N}\left(10,I_{D}\right)$ .

Spaceship Mixture. The target in the second experiment is a $D$ -dimensional mixture of two correlated Gaussian distributions $p(\mathbf{x})=0.5\mathcal{N}\left(x;\ \mu_{1},\ \Sigma_{1}\right)+0.5\mathcal{N}\left(x;\ \mu_{2},\ \Sigma_{2}\right)$ . The mean $\mu_{1}$ , $\mu_{2}$ of each Gaussian have components equal to 1 in the first two coordinates and 0 otherwise. The covariance matrix admits a correlated block diagonal structure. The mixture hence manifests as a “spaceship” density margin in the first two dimensions (see Figure 2).

It can be seen from Figure 2 that for the high-dimensional inference, particles from SVGD aggregate, which leads to a high-dimensional curse (Zhuo et al. (2018); Liu and Wang (2016)). However, AUMP-SVGD can estimate the true probability distribution well in these high-dimensional situations. We calculate the energy distance and the mean-square error (MSE) $\mathbb{E}\|\Sigma_{m}-\Sigma\|_{2}$ between the samples from the inference algorithm and the real samples. The energy distance is given by $D^{2}(F,G)=2\mathbb{E}\|X-Y\|-\mathbb{E}\left\|X-X^{\prime}\right\|-\mathbb{E}\left\|Y-Y^{\prime}\right\|$ , where $F$ and $G$ are the cumulative distribution function (CDF) of $X$ and $Y$ , respectively. $X^{\prime}$ and $Y^{\prime}$ denote an independent and identically distributed (i.i.d.) copy of $X$ and $Y$ (Rizzo and Székely (2016)). 10 experiments are performed and the averaged results are given in Figure 3.

Figure 3 demonstrates the gradual expansion of the error difference of SVGD as the dimension increases. For the sparse problem, when the graph structure is already known, MP-SVGD achieves comparable outcomes with S-SVGD and PSVGD-2. In the above example, PSVGD achieves its best results when the problem dimension is reduced to 2. The AUMP-SVGD with a set size of $\Gamma_{d}$ of 1 or 3 outperforms other methods. In Experiment 1, we demonstrate that AUMP-SVGD yields a variance estimate equivalent to that of MP-SVGD under the simplified graph structure. In Experiment 2, as the correlation between different dimensions of the target becomes strong or the density of the graph increases, our algorithm performs better than other SVGD variants. Furthermore, our approach exhibits superior variance estimation compared with SVGD, MP-SVGD, S-SVGD, and PSVGD. In practice, the covariance matrix of the target distribution may not be sparse, making it challenging to capture this structure. As a result, the effectiveness of MP-SVGD is significantly limited. However, AUMP-SVGD can still attain stable and superior results.

Non-sparse experiment We set the dimension of the target distribution to 50 and systematically transfer the sparse covariance matrix to a non-sparse one by augmenting the correlations between different dimensions along the main diagonal. The results are presented in Figure 4.

As the density of the probability map increases, the discrepancy between MP-SVGD and the actual target distribution gradually magnifies. Eventually, MP-SVGD succumbs to the curse of dimensionality. This phenomenon arises due to the fact that the dispersion force of particles in MP-SVGD primarily depends on the size of the Markov blanket. With increasing density, MP-SVGD encounters the same challenges associated with high-dimensional scenarios as observed in SVGD. However, as illustrated in Figure 4, regardless of the density of the target distribution’s graph structure, AUMP-SVGD remains unaffected by variance collapse. This is attributed to the repulsion force of particles in AUMP-SVGD being influenced by the artificially chosen Markov blanket size, underscoring the superiority of our algorithm over MP-SVGD.

5.2 Conditioned Diffusion Process

The next example is a benchmark that is often used to test inference methods in high dimensions (Detommaso et al. (2018); Chen and Ghattas (2020); Liu et al. (2022)). We consider a stochastic process $u:[0,1]\rightarrow\mathbb{R}$ governed by

\small du=\frac{5u\left(1-u^{2}\right)}{1+u^{2}}dt+dx,\quad u_{0}=0,

where $t\in(0,1]$ , the forcing term $x=\left(x_{t}\right)_{t\geq 0}$ follows a Brownian motion so that $x\sim\mathcal{N}(0,B)$ with $B\left(t,t^{\prime}\right)=\min\left(t,t^{\prime}\right)$ . The noisy data $\mathbf{z}=\left(z_{t_{1}},\ldots,z_{t_{50}}\right)^{T}\in\mathbb{R}^{50}$ at 50 equi-spaced time points with $t_{i}=0.02i$ , where $z_{t_{i}}=u_{t_{i}}+\epsilon$ for $\epsilon\sim\mathcal{N}\left(0,\sigma^{2}\right)$ with $\sigma=0.1$ . The objective is to use $z$ to infer the forcing term $x$ and thus the state of the solution $u$ . The results are given in Figure 5 where the shadow interval depicts the mean plus/minus the standard deviation.

As depicted by Figure 5, it is evident that in the case of 50 dimensions, both SVGD and S-SVGD exhibit certain deviations from the ground truth while S-SVGD exhbits excessively large variances in numerous tests. Consequently, SVGD and S-SVGD prove inadequate in effectively addressing the conditional diffusion model. Conversely, our AUMP-SVGD demonstrates satisfactory performance with the size of $\Gamma_{d}$ of 5 and 10.

5.3 Bayesian Logistic Regression

We investigate a Bayesian logistic regression model from Liu and Wang (2016) applied to the Covertype dataset from Asuncion and Newman (2007). We use 70% data for training and 30% for testing. We compare AUMP-SVGD with SVGD, S-SVGD, PSVGD, MP-SVGD, and Hamiltonian Monte Carlo(HMC) and the number of generated samples ranges from 100 to 500. Each experiment uses ten different random seeds and the error of each value does not exceed 0.01. We verify the impact of different sampling methods on the prediction accuracy and the results are given by Table 1.

Table 1: The Optimal values for each case are in bold.

methods	accuracy
# particles	HMC	SVGD	S-SVGD	MP-SVGD	PSVGD_2	AUMP-SVGD-5	AUMP-SVGD-10
100	0.70	0.74	0.76	0.74	0.77	0.73	0.75
200	0.71	0.74	0.762	0.74	0.79	0.78	0.78
300	0.73	0.74	0.762	0.741	0.81	0.80	0.81
400	0.80	0.75	0.76	0.74	0.83	0.81	0.81
500	0.81	0.75	0.765	0.75	0.85	0.82	0.86

Table 1 demonstrates that our AUMP-SVGD has a prediction accuracy rate similar to that of PSVGD, and as the number of particles increases, our algorithm is more accurate than other algorithms.

6 Conclusion and Future work

In this paper, we analyze the upper bound of the variance collapse of SVGD when the number of particles is finite. We show that the distribution of particles is restricted to a specific region rather than the entire probability space. We also propose the AUMP-SVGD algorithm to further overcome the dependency of MP-SVGD on the known and sparse graph structure. We show the effectiveness of AUMP-SVGD through various experiments. For future work, we aim to further investigate the convergence of SVGD with finite particles and tighten the estimation limit. We also plan to apply MP-SVGD to more complex real-world applications, such as posture estimation (Pacheco et al. (2014)).

References

Asuncion and Newman [2007] Arthur Asuncion and David Newman. UCI machine learning repository, 2007.
Ba et al. [2021] Jimmy Ba, Murat A Erdogdu, Marzyeh Ghassemi, Shengyang Sun, Taiji Suzuki, Denny Wu, and Tianzong Zhang. Understanding the variance collapse of SVGD in high dimensions. In International Conference on Learning Representations, 2021.
Banna et al. [2016] Marwa Banna, Florence Merlevède, and Pierre Youssef. Bernstein-type inequality for a class of dependent random matrices. Random Matrices: Theory and Applications, 5(2):1650006, 2016.
Bradley [2005] Richard C Bradley. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2:107–144, 2005.
Chen and Ghattas [2020] Peng Chen and Omar Ghattas. Projected Stein variational gradient descent. In Advances in Neural Information Processing Systems, volume 33, pages 1947–1958, 2020.
Chewi et al. [2020] Sinho Chewi, Thibaut Le Gouic, Chen Lu, Tyler Maunu, and Philippe Rigollet. SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence. In Advances in Neural Information Processing Systems, volume 33, pages 2098–2109, 2020.
Detommaso et al. [2018] Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl. A Stein variational Newton method. In Advances in Neural Information Processing Systems, volume 31, 2018.
Gao et al. [2019] Xiang Gao, Meera Sitharam, and Adrian E Roitberg. Bounds on the Jensen gap, and implications for mean-concentrated distributions. The Australian Journal of Mathematical Analysis and Applications, 16:1–16, 2019.
Gong et al. [2020] Wenbo Gong, Yingzhen Li, and José Miguel Hernández-Lobato. Sliced kernelized Stein discrepancy. In International Conference on Learning Representations, 2020.
Gorham and Mackey [2017] Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pages 1292–1301, 2017.
Hoeffding and Robbins [1948] Wassily Hoeffding and Herbert Robbins. The central limit theorem for dependent random variables. Duke Mathematical Journal, 15(3):773–780, 1948.
Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
Liu [2017] Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, volume 30, 2017.
Liu and Wang [2016] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, volume 29, 2016.
Liu et al. [2022] Xing Liu, Harrison Zhu, Jean-Francois Ton, George Wynne, and Andrew Duncan. Grassmann Stein variational gradient descent. In International Conference on Artificial Intelligence and Statistics, pages 2002–2021, 2022.
MacKay [1992] David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
Pacheco et al. [2014] Jason Pacheco, Silvia Zuffi, Michael Black, and Erik Sudderth. Preserving modes and messages via diverse particle selection. In International Conference on Machine Learning, pages 1152–1160, 2014.
Rizzo and Székely [2016] Maria L Rizzo and Gábor J Székely. Energy distance. Wiley Interdisciplinary Reviews: Computational Statistics, 8(1):27–38, 2016.
Salim et al. [2022] Adil Salim, Lukang Sun, and Peter Richtarik. A convergence theory for SVGD in the population limit under Talagrand’s inequality T1. In International Conference on Machine Learning, pages 19139–19152, 2022.
Wang et al. [2018] Dilin Wang, Zhe Zeng, and Qiang Liu. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pages 5219–5227, 2018.
Wu et al. [2022] Kevin E Wu, Kevin K Yang, Rianne van den Berg, James Y Zou, Alex X Lu, and Ava P Amini. Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611, 2022.
Yan and Zhou [2021a] Liang Yan and Tao Zhou. Stein variational gradient descent with local approximations. Computer Methods in Applied Mechanics and Engineering, 386, 2021a.
Yan and Zhou [2021b] Liang Yan and Tao Zhou. Gradient-free Stein variational gradient descent with kernel approximation. Applied Mathematics Letters, 121, 2021b.
Yoon et al. [2018] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, volume 31, 2018.
Zhuo et al. [2018] Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pages 6018–6027, 2018.

Appendix A

Proof of Proposition 1

According to the fixed points assumption (A1),

\Delta(\mathbf{x}_{i})=\frac{1}{N}\sum_{j=1}^{N}\nabla_{\mathbf{x}_{j}}\log p\left(\mathbf{x}_{j}\right)k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)+\nabla_{\mathbf{x}_{j}}k\left(\mathbf{x}_{i},\mathbf{x}_{j}\right)=0.

Then,

$\displaystyle\mathbb{E}\left(\Delta(\mathbf{x}i)\right)$	$\displaystyle=\mathbb{E}(\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right)k\left(\mathbf{x}_{i},\mathbf{x}\right))+\nabla{\mathbf{x}}k\left(\mathbf{x}_{i},\mathbf{x}\right)$	(5)
	$\displaystyle=\mathbb{E}\left(\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right)\right)\mathbb{E}\left(k\left(\mathbf{x}_{i},\mathbf{x}\right)\right)+\mathbb{E}(\nabla{\mathbf{x}}k\left(\mathbf{x}_{i},\mathbf{x}\right))$
	$\displaystyle=\mathbb{E}(\nabla{\mathbf{x}}k\left(\mathbf{x}_{i},\mathbf{x}\right))=0.$

In Equation (5), $\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right)$ and $k\left(\mathbf{x}_{i},\mathbf{x}\right)$ are independent and,

\mathbb{E}\left(\nabla{\mathbf{x}}\log p\left(\mathbf{x}\right)\right)=\int\frac{p^{{}^{\prime}}(\mathbf{x})}{p(\mathbf{x})}p(\mathbf{x})dx\ =\int p^{{}^{\prime}}(\mathbf{x})dx\ =0.

Associated with the Jensen gap (Gao et al. [2019]), we get

\displaystyle|\mathbb{E}[e^{\frac{-\left|\mathbf{x}_{i}-\mathbf{x}\right|_{2}^{2}}{h}}\frac{2(\mathbf{x}_{i}-\mathbf{x})}{h}]-e^{\frac{-|\mathbf{x}_{i}-\mathbb{E}\mathbf{x}|_{2}^{2}}{h}}\frac{2(\mathbf{x}_{i}-\mathbb{E}\mathbf{x})}{h}|\ \leq 2M\mathbb{E}\left|\mathbf{x}\right|,

which in turn gives

\left|e^{\frac{-\left\|\mathbf{x}_{i}-\mathbb{E}\mathbf{x}\right\|_{2}^{2}}{h}}\frac{2(\mathbf{x}_{i}-\mathbb{E}\mathbf{x})}{h}\right|<2M\mathbb{E}\left|\mathbf{x}\right|

since $\left|\mathbf{x}_{i}-\mathbb{E}\mathbf{x}\right|_{2}^{2}=\mathrm{tr}\left((\mathbf{x}_{i}-\mathbb{E}\mathbf{x})(\mathbf{x}_{i}-\mathbb{E}\mathbf{x})^{T}\right)\leq c_{0}\mathrm{tr}(\Sigma)$ , where $c_{0}$ is a positive number.

Proof of Proposition 2

According to Proposition 1:

\left\|\mathbf{x}\right\|_{2}^{2}\leq K^{2}tr\Sigma

Apply the expectation version of the Bernstein inequality (Banna et al. [2016]) for the sum of mean zero random matrices $\mathbf{x}_{i}\mathbf{x}_{i}^{T}-\Sigma$ and we obtain,

	$\displaystyle\mathbb{E}\\|\Sigma_{m}-\Sigma\\|$	$\displaystyle=\frac{1}{m}\\|\sum_{i=1}^{m}\mathbb{E}(\mathbf{x}_{i}\mathbf{x}_{i}^{T}-\Sigma)\\|$
		$\displaystyle\leq 30v\sqrt{n\log D}+4Mc^{-1/2}\sqrt{\log D}+M\gamma(c,n)\log D,$

and $M$ is any number chosen such that,

\small\|\mathbf{x}\mathbf{x}^{T}-\Sigma\|\leq M.

To bound $M$ is simple:

\small\|(\mathbf{x}\mathbf{x}^{T}-\Sigma)\|\leq\|\mathbf{x}\|_{2}^{2}+\|\Sigma\|\leq 2K^{2}tr\Sigma,

which completes the proof.

Proof of Proposition 3

Note that

\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)=\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)p\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\right).

Now we derive the optimality condition for Equation (3). Note that,

\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},x_{\mathbf{S_{d}}}\right)\right)=\mathbb{E}_{q\left(\mathbf{x}_{\neg d}\right)}\left[\mathrm{KL}\left(q_{[T]}\left(x_{\Gamma_{d}}\mid\mathbf{x}_{\neg d}\right)\|p\left(z_{\Gamma d}\mid\mathbf{x}_{\neg d}\right)\right)\right].

Following the proof of Theorem 3.1 by Liu and Wang [2016], we have

\begin{split}\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(\mathbf{x}_{\Gamma_{d}},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)|_{\epsilon=0}&=\\ &-\mathbb{E}_{q\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)}\left[\phi_{\Gamma_{d}}\left(\mathbf{y}_{\neg d}\right)\nabla_{\mathbf{y}_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}\phi_{\Gamma_{d}}\left(\mathbf{y}_{\neg d}\right)\right].\end{split}

According to Liu and Wang [2016], we can show that the optimal solution is given by $\frac{\phi_{\Gamma_{d}}^{*}}{\|\phi_{\Gamma_{d}}^{*}\|_{\mathcal{H}_{\Gamma_{d}}}}$ , where

\displaystyle\phi_{\Gamma_{d}}^{*}\left(\mathbf{x}_{\neg d}\right)=\mathbb{E}_{\mathbf{y}_{\neg d}\sim q}\left[k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\nabla_{\mathbf{y}_{\Gamma_{d}}}\log p\left(\mathbf{y}_{\Gamma_{d}}\mid\mathbf{y}_{\neg d}\right)+\nabla_{\mathbf{y}_{\Gamma_{d}}}k_{\Gamma_{d}}\left(\mathbf{x}_{\neg d},\mathbf{y}_{\neg d}\right)\right].

Proof of Proposition 4

Similar to the last proof, first we have

\small\nabla_{\epsilon}\mathrm{KL}\left[\frac{q_{[T]}(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{q(\mathbf{x}_{\neg d})}\|\frac{p(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}})}{p(\mathbf{x}_{\neg d})}\right]=\nabla_{\epsilon}\mathrm{KL}\left(q_{[T]}\left(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(x_{d}\mid\mathbf{x}_{\mathbf{S_{d}}}\right)q\left(\mathbf{x}_{\mathbf{S_{d}}}\right)\right).

Then we derive the optimality condition $\phi_{d}^{*}$ for Equation (4),

\small\mathbf{KL}\left(q_{\left[T\right]}\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)=\mathbb{E}_{q\left(\mathbf{x}_{S_{d}}\right)}\left[\operatorname{KL}\left(q_{\left[T\right]}\left(x_{d}\mid\mathbf{x}_{S_{d}}\right)\|p\left(x_{d}\mid\mathbf{x}_{S_{d}}\right)\right)\right]

Following the proof of Theorem 3.1 in Liu and Wang [2016], we have

\small\nabla_{\epsilon}\mathrm{KL}\left(q\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\|p\left(x_{d},\mathbf{x}_{\mathbf{S_{d}}}\right)\right)|_{\epsilon=0}=-\mathbb{E}_{q\left(y_{d}\mid\mathbf{y}_{S_{d}}\right)}\left[\phi_{d}\left(\mathbf{y}_{C_{d}}\right)\nabla_{y_{d}}\log p\left(y_{d}\mid\mathbf{y}_{S_{d}}\right)+\nabla_{y_{d}}\phi_{d}\left(\mathbf{y}_{S_{d}}\right)\right].

According to Liu and Wang [2016], we can show that the optimal solution is given by $\frac{\phi_{d}^{*}}{\left\|\phi_{d}^{*}\right\|_{\mathcal{H}_{d}}}$ , where

\displaystyle\phi_{d}^{*}\left(\mathbf{x}_{C_{d}}\right)=\mathbb{E}_{\mathbf{y}_{C_{d}}\sim q}\left[k_{S_{d}}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\nabla_{\mathbf{y}_{S_{d}}}\log p\left(\mathbf{y}_{S_{d}}\mid\mathbf{y}_{C_{d}}\right)+\nabla_{\mathbf{y}_{S_{d}}}k_{S_{d}}\left(\mathbf{x}_{C_{d}},\mathbf{y}_{C_{d}}\right)\right],

which completes the proof.

Appendix B

Numerical verification of $m$ -dependent

In our paper, we use the concept of $m$ -dependent in the mixture to estimate the variance of non-i.i.d. particles. Here we give a verification experiment for $m$ -dependent.

The above figure shows the final dynamic magnitude of the two particles in SVGD. Here, the legend $\text{SVGD}\_3000\_10$ indicates 3000 particles for a 10-dimensional Gaussian model. We sort the particles according to $\|\mathbf{x}\|_{2}$ and measure the update effect between two particles at intervals of 500, i.e.,

\small\frac{1}{N}\nabla_{\mathbf{x}_{n}^{\ell}}\log p\left(\mathbf{x}_{n}^{\ell}\right)k\left(\mathbf{x}_{n}^{\ell},\mathbf{x}_{m}^{\ell}\right)+\nabla_{\mathbf{x}_{n}^{\ell}}k\left(\mathbf{x}_{n}^{\ell},\mathbf{x}_{m}^{\ell}\right).

(6)

We can see from the above figure, as the number of particle intervals increases, the force between particles decreases.

Numerical verification of Proposition 1-2

In the presented tabular data, we have undertaken empirical investigations for the target distribution $\mathbf{N}\left(0,I_{dim}\right)$ and have demonstrated the soundness of our theoretical bounds through these experiments.

Table 2: Upper bound of

\|x\|_{2}^{2}

in Proposition 1

dim		2	5	10	15	20	25
10 particles	max $\\|x\\|_{2}^{2}$	1.43	1.18	1.17	1.17	1.17	1.17
10 particles	theoretical bound	4.82	5.56	5.39	5.36	5.35	5.35
50 particles	Max $\\|x\\|_{2}^{2}$	2.11	1.63	1.55	1.52	1.52	1.51
50 particles	theoretical bound	9.41	16.1	17.1	14.8	14.31	14.0

In the presented table, the upper bound of the variance error of SVGD has been assessed for the case where the distribution is $\mathbf{N}\left(0,I_{dim}\right)$ .

Table 3: Upper bound of

\mathbb{E}\left\|\Sigma_{m}-\Sigma\right\|_{2}^{2}

in Proposition 2

Number of particles		1000	5000	10000	15000	20000
Dim-2	$\mathbb{E}\\|\Sigma_{m}-\Sigma\\|_{2}^{2}$	0.008	0.002	0.004	0.0063	0.003
Dim-2	theoretical bound	0.082	0.019	0.01	0.0078	0.005
Dim-5	$\mathbb{E}\\|\Sigma_{m}-\Sigma\\|_{2}^{2}$	0.3	0.11	0.06	0.06	0.04
Dim-5	theoretical bound	4.63	1.71	1.02	0.71	0.57

Augmented Message Passing Stein Variational Gradient Descent

Abstract

1 Introduction

2 Preliminaries

2.1 SVGD

2.2 Mixing for random variables

3 Covariance Analysis under β\mathbf{\beta}-mixing

3.1 Assumptions

Definition 1

3.2 Concentration of particles

Proposition 1

3.3 Covariance estimation

Proposition 2

4 Augmented MP-SVGD

4.1 MP-SVGD

4.2 Augmented MP-SVGD

Proposition 3

Proposition 4

5 Experiments

5.1 Gaussian Mixture Models

5.2 Conditioned Diffusion Process

5.3 Bayesian Logistic Regression

6 Conclusion and Future work

References

Appendix A

Proof of Proposition 1

Proof of Proposition 2

Proof of Proposition 3

Proof of Proposition 4

Appendix B

Numerical verification of mm-dependent

Numerical verification of Proposition 1-2

3 Covariance Analysis under $\mathbf{\beta}$ -mixing

Numerical verification of $m$ -dependent