Theoretical Error Analysis of Entropy Approximation
for Gaussian Mixtures

Takashi Furuya Education and Research Center for Mathematical and Data Science, Shimane University, Japan Email: [email protected] Hiroyuki Kusumoto Equal contribution. Graduate School of Mathematics, Nagoya University, Japan Email: [email protected]
Koichi Taniguchi Mathematical and Systems Engineering, Shizuoka University, Japan Naoya Kanno Graduate School of Biomedical Engineering, Tohoku University, Japan Kazuma Suetake AISIN SOFTWARE, Japan

Abstract

Gaussian mixture distributions are commonly employed to represent general probability distributions. Despite the importance of using Gaussian mixtures for uncertainty estimation, the entropy of a Gaussian mixture cannot be calculated analytically. In this paper, we study the approximate entropy represented as the sum of the entropies of unimodal Gaussian distributions with mixing coefficients. This approximation is easy to calculate analytically regardless of dimension, but there is a lack of theoretical guarantees. We theoretically analyze the approximation error between the true and the approximate entropy to reveal when this approximation works effectively. This error is essentially controlled by how far apart each Gaussian component of the Gaussian mixture is. To measure such separation, we introduce the ratios of the distances between the means to the sum of the variances of each Gaussian component of the Gaussian mixture, and we reveal that the error converges to zero as the ratios tend to infinity. In addition, the probabilistic estimate indicates that this convergence situation is more likely to occur in higher-dimensional spaces. Therefore, our results provide a guarantee that this approximation works well for high-dimensional problems, such as neural networks that involve a large number of parameters.

Keywords : Gaussian mixture, Entropy approximation, Approximation error estimate, Upper and lower bounds

1 Introduction

Entropy is a fundamental measure of uncertainty in information theory with many applications in machine learning, data compression, and image processing. In machine learning, for instance, entropy is a key component in the evidence lower bound (ELBO) in the variational inference (Hinton and Van Camp, 1993; Barber and Bishop, 1998; Bishop, 2006) and the variational autoencoder (Kingma and Welling, 2013). It also plays a crucial role in the data acquisition for Bayesian optimization (Frazier, 2018) and active learning (Settles, 2009). In the real-world scenario, unimodal Gaussian distribution is widely used and offers computational advantages because its entropy can be calculated analytically, which offers computational advantages. However, unimodal Gaussian distributions are limited in the approximation ability, and in particular, they hardly approximate multimodal distributions, which are often assumed, for instance, as posterior distributions of Bayesian neural networks (BNNs) (MacKay, 1992; Neal, 2012) (see, e.g., Fort et al. (2019)). On the other hand, Gaussian mixture distribution (the superposition of Gaussian distributions) can capture the multimodality, and in general, approximate any continuous distribution in some topology (see, e.g., Bacharoglou (2010)). Unfortunately, the entropy of Gaussian mixture lacks a closed form, thereby necessitating the use of entropy approximation methods.

There are numerous approximation methods for estimating the entropy of Gaussian mixture (see Section 2). One common method for approximating the entropy of a Gaussian mixture is to compute the weighted sum of the entropies of the individual unimodal Gaussian components with mixing coefficients (see (3.2)), and this approximate entropy is easy to calculate analytically. However, despite its empirical success, this approximation lacks theoretical guarantees. The purpose of this paper is to reveal under what conditions this approximation performs well and to provide some theoretical insights on this approximation. Our contributions are as follows:

•

(Main result: New error bounds) We provide new upper and lower bounds of the approximation error. These bounds show that the error is essentially controlled by the ratios $\alpha_{k,k^{\prime}}$ (which are given in Definition 4.1) of the distances between the means to the sum of the variances of each (Gaussian) component of the Gaussian mixture, and the error converges to zero as the ratios tend to infinity (Theorem 4.2). Consequently, we provide an “almost” necessary and sufficient condition for the approximate entropy to be valid (Remark 4.3).
•

(Probabilistic error bound) To confirm the effectiveness of the approximation for higher-dimension problems, we also provide the approximation error bound in the form of a probabilistic inequality (Corollary 4.7). In supplementary of Gal and Ghahramani (2016), it is mentioned (without mathematical proof) that this approximate entropy tends to be the true one in high-dimensional spaces when the means of the Gaussian mixture are randomly distributed. Our probabilistic inequality is a rigorous and generalized result of what they mention and shows the usefulness of this approximation in high-dimensional problems. Moreover, we numerically demonstrate this result in a simple case and show its superiority over several other methods (Section 5).
•

(Error bound for derivatives) For example in machine learning, not only the approximation of entropy but also its partial derivatives are required in backpropagation. Therefore, we also provide the upper bounds for the partial derivatives of the error with respect to parameters (Theorem 4.8), which ensure that the derivatives of the approximate entropy are also close enough to the true ones when the ratios are large enough.
•

(More detailed analysis in the special case) We conduct a more detailed analysis of the error bounds in a special case. More precisely, when all covariance matrices coincide, we provide an explicit formula on the entropy of Gaussian mixture with the integral dimension-reduced to its component number (Proposition 4.9). Then, by using this formula, we improve and simplify the upper and lower bounds of the error (Theorem 4.10) and the probabilistic inequality (Corollary 4.12). In this special case, we obtain a necessary and sufficient condition for this approximation to converge to the true entropy (Remark 4.11).

2 Related work

In numerical computation of the entropy of Gaussian mixtures, the approximation by Monte Carlo estimation is often used. However, it may require a large number of samples for high accuracy, leading to high computational costs. Furthermore, the Monte Carlo estimator gives a stochastic approximation, and hence, it does not guarantee deterministic bounds (confidence intervals may be used). There are numerous deterministic approximation methods (see, e.g., Hershey and Olsen (2007)). For example, approximation methods based on upper or lower bounds of the entropy are often used. That is, we try to obtain an approximation by estimating the upper or lower bounds of the entropy or we adopt the bounds as an approximate entropy (see, e.g., Bonilla et al. (2019); Hershey and Olsen (2007); Nielsen and Sun (2016); Zobay (2014)). A typical one is to use the lower bound of the entropy based on Jensen’s inequality (see Bonilla et al. (2019)). These approximations have the advantage of being analytically calculated in closed forms, whereas there is no theoretical guarantee that they work well in the context of machine learning such as variational inference. As another typical method, Huber et al. (2008) proposed the entropy approximation by a combination of the Taylor approximation with the splitting method of Gaussian mixture components. Recently, Dahlke and Pacheco (2023) provided new approximations using Taylor and Legendre series, and they theoretically and experimentally analyzed these approximations together with the approximation in Huber et al. (2008). Notably, Dahlke and Pacheco (2023) theoretically showed sufficient conditions for the approximations to be convergent. However, their performance remains unexplored in higher-dimensional cases. As an approximation that is easy to calculate regardless of dimensions, the sum of the entropies of unimodal Gaussian distributions is often used. For example, the approximation entropy represented as the “sum with mixing coefficients” of the entropies of unimodal Gaussian distributions (see Remark 4.5), which is equal to the true one when all Gaussian components coincide, is investigated in Melbourne et al. (2022). On the other hand, Gal and Ghahramani (2016) used the approximation represented as the sum of the entropies of “unimodal Gaussian distributions with mixture coefficients” in the context of variational inference from an intuition that this approximation tends to the true one in high-dimensional spaces when the means of the mixture are randomly distributed. In our work, we focus on the theoretical error estimation for this approximation and reveal that this approximation converges to the true one when all Gaussian components are far apart or in high-dimensional cases.

3 Entropy approximation for Gaussian mixtures

The entropy of probability distribution $q(w)$ is defined by

\displaystyle H[q]\coloneqq-\int_{\mathbb{R}^{m}}q(w)\log(q(w))\,dw.

(3.1)

Here, we choose a probability distribution $q(w)$ as the Gaussian mixture distribution, that is,

q(w)=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k}),\quad w\in\mathbb{R}^{m},

where $K\in\mathbb{N}$ is the number of mixture components, and $\pi_{k}\in(0,1]$ are mixing coefficients constrained by $\sum_{k=1}^{K}\pi_{k}=1$ . Here, $\mathcal{N}(\mu_{k},\Sigma_{k})=\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})$ is the Gaussian distribution with a mean $\mu_{k}\in\mathbb{R}^{m}$ and (positive definite) covariance matrix $\Sigma_{k}\in\mathbb{R}^{m\times m}$ , that is,

\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})=\frac{1}{\sqrt{(2\pi)^{m}|\Sigma_{k}|}}\exp\left(-\frac{1}{2}\left\|w-\mu_{k}\right\|_{\Sigma_{k}}^{2}\right),

where $|\Sigma_{k}|$ is the determinant of matrix $\Sigma_{k}$ , and $\|x\|_{\Sigma}^{2}\coloneqq x\cdot(\Sigma^{-1}x)$ for a vector $x\in\mathbb{R}^{m}$ and a positive definite matrix $\Sigma\in\mathbb{R}^{m\times m}$ . However, the entropy term $H[q]$ cannot be computed analytically when the distribution $q(w)$ is a Gaussian mixture.

We define the approximate entropy $\widetilde{H}[q]$ by the sum of the entropies of “unimodal Gaussian distributions with mixture coefficients”:

\begin{split}\widetilde{H}[q]\coloneqq&\,-\sum_{k=1}^{K}\int_{\mathbb{R}^{m}}\pi_{k}\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})\log\left(\pi_{k}\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})\right)dw\\ =&\,\frac{m}{2}+\frac{m}{2}\log 2\pi+\frac{1}{2}\sum_{k=1}^{K}\pi_{k}\log|\Sigma_{k}|-\sum_{k=1}^{K}\pi_{k}\log\pi_{k},\end{split}

(3.2)

which can be computed analytically. It is obvious that $H[q]=\widetilde{H}[q]$ holds for unimodal Gaussian (i.e., the case $K=1$ ). Moreover, it is shown that

0\leq\widetilde{H}[q]-H[q]\leq-2\sum_{k=1}^{K}\pi_{k}\log\pi_{k}\ (\leq 2\log K)

(see (LABEL:computation1) in Appendix and Remark 4.5). These bounds show that the error does not blow up with respect to the mean $\mu_{k}$ , the covariance $\Sigma_{k}$ , and the dimension $m$ if the number of mixture components $K$ is fixed. In addition, these bounds imply that $|\widetilde{H}[q]-H[q]|\to 0$ as the Gaussian mixture $q$ converges to a unimodal Gaussian (i.e., $\pi_{k}\to 1$ for some $k$ and $\pi_{k^{\prime}}\to 0$ for all $k^{\prime}\not=k$ ). It is natural to ask under what other conditions this approximation $H[q]\approx\widetilde{H}[q]$ can be justified. In Section 4, we will provide an “almost” necessary and sufficient condition for this approximation to be valid.

4 Theoretical error analysis of the entropy approximation

In this section, we analyze the approximation error $|H[q]-\widetilde{H}[q]|$ to theoretically justify the entropy approximation $\widetilde{H}[q]\approx H[q]$ .

We introduce the following notation to state our results.

Definition 4.1.

For two Gaussian distributions $\mathcal{N}(\mu_{k},\Sigma_{k})$ and $\mathcal{N}(\mu_{k^{\prime}},\Sigma_{k^{\prime}})$ , we define $\alpha_{\{k,k^{\prime}\}}$ by

\alpha_{\{k,k^{\prime}\}}\coloneqq\max\Big{\{}\alpha>0:\{x\in\mathbb{R}^{m}:\|x-\mu_{k}\|_{\Sigma_{k}}<\alpha\}\cap\{x\in\mathbb{R}^{m}:\|x-\mu_{k^{\prime}}\|_{\Sigma_{k^{\prime}}}<\alpha\}=\varnothing\Big{\}},

(4.1)

and $\alpha_{k,k^{\prime}}$ by

\alpha_{k,k^{\prime}}\coloneqq\frac{\|\mu_{k}-\mu_{k^{\prime}}\|_{\Sigma_{k}}}{1+\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\|_{\rm op}},

(4.2)

where $k,k^{\prime}\in[K]\coloneqq\{1,\ldots,K\}$ and $\|\cdot\|_{\rm op}$ is the operator norm (i.e., the largest singular value).

Interpretation of $\alpha$ : We can interpret that $\alpha_{\{k,k^{\prime}\}}$ and $\alpha_{k,k^{\prime}}$ measure distances of two Gaussian distributions in some sense respectively. Here notice that $\alpha_{\{k,k^{\prime}\}}$ is symmetric (i.e., $\alpha_{\{k,k^{\prime}\}}$ is always equal to $\alpha_{\{k^{\prime},k\}}$ ), but $\alpha_{k,k^{\prime}}$ is not necessarily symmetric (i.e., $\alpha_{k,k^{\prime}}$ is not always equal to $\alpha_{k^{\prime},k}$ ).

Figure 1 shows the geometric interpretation of $\alpha_{k,k^{\prime}}$ in the isotropic case, that is, $\Sigma_{k}=\sigma^{2}_{k}I$ and $\Sigma_{k^{\prime}}=\sigma^{2}_{k^{\prime}}I$ . In this case, $\alpha_{k,k^{\prime}}$ has a symmetric form with respect to $k,k^{\prime}$ as

\alpha_{\{k,k^{\prime}\}}=\alpha_{k,k^{\prime}}=\alpha_{k^{\prime},k}=\frac{|\mu_{k}-\mu_{k^{\prime}}|}{\sigma_{k}+\sigma_{k^{\prime}}}.

Here, the volume of $\mathcal{N}(\mu_{k},\sigma_{k}^{2}I)$ on $B(\mu_{k},\alpha\sigma_{k})$ is equal to that of $\mathcal{N}(\mu_{k^{\prime}},\sigma_{k^{\prime}}^{2}I)$ on $B(\mu_{k^{\prime}},\alpha\sigma_{k^{\prime}})$ , where $B(\mu,\sigma)\coloneqq\{x\in\mathbb{R}^{m}:|x-\mu|<\sigma\}$ for $\mu\in\mathbb{R}^{m}$ and $\sigma>0$ . If all distances $|\mu_{k}-\mu_{k^{\prime}}|$ between the means go to infinity, or all variances $\sigma_{k}$ go to zero, etc., then $\alpha_{k,k^{\prime}}$ go to infinity for all pairs of $k,k^{\prime}$ . Furthermore, if all means $\mu_{k}$ are normally distributed (variances $\sigma_{k}$ are fixed), then an expected value of $\alpha_{k,k^{\prime}}^{2}$ is in proportion to the dimension $m$ . That intuitively means that the expected value of $\alpha_{k,k^{\prime}}$ becomes large as the dimension $m$ of the parameter increases.

Refer to caption — Figure 1: Illustration of $\alpha=\alpha_{\{k,k^{\prime}\}}$ ( $m=2$ , isotropic)

These intuitions also hold in anisotropic case. From definition (4.2), we have

\alpha_{k,k^{\prime}}+\alpha_{k,k^{\prime}}\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\|_{\rm op}=|\Sigma_{k}^{-\frac{1}{2}}\mu_{k}-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}|.

This shows that the circumsphere of $\{x\in\mathbb{R}^{m}:\|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}\|_{\Sigma_{k}^{-1}\Sigma_{k}^{\prime}}<\alpha_{k,k^{\prime}}\}$ circumscribes $B(\Sigma_{k}^{-\frac{1}{2}}\mu_{k},\alpha)$ (see Figure 2 right). Moreover, by coordinate transformation $x\to\Sigma_{k}^{\frac{1}{2}}x$ , we can interpret that $\alpha_{k,k^{\prime}}$ is a distance of $\mathcal{N}(\mu_{k},\Sigma_{k})$ and $\mathcal{N}(\mu_{k^{\prime}},\Sigma_{k^{\prime}})$ from the perspective of $\Sigma_{k}$ (see Figure 2 left), and $\alpha_{k,k^{\prime}}$ gives a concrete form of $\alpha$ satisfying ${\{\|x-\mu_{k}\|_{\Sigma_{k}}<\alpha\}\cap\{\|x-\mu_{k^{\prime}}\|_{\Sigma_{k^{\prime}}}<\alpha\}=\varnothing}$ , that is, $\alpha_{k,k^{\prime}}\leq\alpha_{\{k,k^{\prime}\}}$ (Lemma A.4).

4.1 General covariance case

We study the error $|H[q]-\widetilde{H}[q]|$ for general covariance matrices $\Sigma_{k}$ . First, we give the following upper and lower bounds for the error.

Theorem 4.2.

Let $s\in[0,1)$ . Then

\begin{split}&\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\frac{\pi_{k}\pi_{k^{\prime}}}{1-\pi_{k}}c_{k,k^{\prime}}\log\Biggl{(}1+\frac{1-\pi_{k}}{\pi_{k}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{\displaystyle\max_{l}|\Sigma_{l}|^{\frac{1}{2}}}\exp\Biggl{(}-\frac{\Bigl{(}1+\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\|_{\rm op}\Bigr{)}^{2}}{2}\alpha_{k^{\prime},k}^{2}\Biggr{)}\Biggr{)}\\ &\leq\left|H[q]-\widetilde{H}[q]\right|\leq\frac{2}{(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right),\end{split}

(4.3)

where the coefficient $c_{k,k^{\prime}}$ is defined by

c_{k,k^{\prime}}\coloneqq\frac{1}{\sqrt{(2\pi)^{m}}}\int_{\mathbb{R}^{m}_{k,k^{\prime}}}\exp\left(-\frac{|y|^{2}}{2}\right)dy\geq 0,

and the set $\mathbb{R}^{m}_{k,k^{\prime}}$ is defined by

\mathbb{R}^{m}_{k,k^{\prime}}\coloneqq\left\{y\in\mathbb{R}^{m}:\begin{array}[]{cc}y\cdot y\geq(\Sigma_{k}^{\frac{1}{2}}\Sigma_{k^{\prime}}^{-1}\Sigma_{k}^{\frac{1}{2}}y)\cdot y,\\ y\cdot(\Sigma_{k}^{\frac{1}{2}}\Sigma_{k^{\prime}}^{-1}(\mu_{k^{\prime}}-\mu_{k}))\geq 0\end{array}\right\}.

Moreover, the same upper bound holds for $\alpha_{\{k,k^{\prime}\}}$ instead of $\alpha_{k,k^{\prime}}$ :

\left|H[q]-\widetilde{H}[q]\right|\leq\frac{2}{(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha_{\{k,k^{\prime}\}}^{2}}{4}\right).

(4.4)

The proof is given by calculations with the triangle inequality, the Cauchy-Schwarz inequality, and a change of variables. For upper bound, we use the inequality $\log(1+x)\leq\sqrt{x}$ $(x\geq 0)$ , and we split the integral region to $|y|<\alpha_{k,k^{\prime}}$ and $|y|>\alpha_{k,k^{\prime}}$ in order to utilize the characteristic of $\alpha_{k,k^{\prime}}$ . While, for the lower bound, we use the concavity of function $x\mapsto\log(1+x)$ $(x>0)$ and the definition of $\mathbb{R}^{m}_{k,k^{\prime}}$ . See Appendix A.1 for the proof of Theorem 4.2.

Remark 4.3.

Theorem 4.2 implies the following facts:

(i)
According to the upper bound in (4.3), the approximation $H[q]\approx\widetilde{H}[q]$ is valid when
- (a)
  
  $\pi_{k}\pi_{k^{\prime}}\to 0$ or $\alpha_{k,k^{\prime}}\to+\infty$ for all pairs $k,k^{\prime}\in[K]$ with $k\neq k^{\prime}$
for $s\in(0,1)$ . In particular, the error exponentially decays to zero when all $\alpha_{k,k^{\prime}}$ go to zero.
(ii)
According to the upper bound (4.4), the approximation $H[q]\approx\widetilde{H}[q]$ is also valid when
- (b)
  
  $\pi_{k}\pi_{k^{\prime}}\to 0$ or $\alpha_{\{k,k^{\prime}\}}\to+\infty$ for all pairs $k,k^{\prime}\in[K]$ with $k\neq k^{\prime}$
for $s\in(0,1)$ . Moreover, the upper bound in (4.4) is better than that in (4.3) since $\alpha_{k,k^{\prime}}\leq\alpha_{\{k,k^{\prime}\}}$ always holds (see Lemma A.4).
(iii)

According to the lower bound in (4.3), if $c_{k,k^{\prime}}>0$ and $\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\|_{\rm op}\leq C$ hold for all pairs $k,k^{\prime}$ and some constant $C>0$ independent of $k,k^{\prime}$ , then (a) is necessary for this approximation to be valid (where we note that $|\Sigma_{k}|^{\frac{1}{2}}/|\Sigma_{k^{\prime}}|^{\frac{1}{2}}=|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}|\leq C^{m}$ when $\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\|_{\rm op}\leq C$ , and that $\min_{k}|\Sigma_{k}|^{\frac{1}{2}}/\max_{k}|\Sigma_{k}|^{\frac{1}{2}}\geq C^{-m}$ holds). It is unclear whether all $c_{k,k^{\prime}}$ are positive, but we can show that either $c_{k,k^{\prime}}$ or $c_{k^{\prime},k}$ is always positive for any pair $k,k^{\prime}$ (see Remark A.6).
(iv)

From the above facts and the symmetry of $\alpha_{\{k,k^{\prime}\}}$ , we conclude that (b) is a necessary and sufficient condition for the approximation $H[q]\approx\widetilde{H}[q]$ to be valid if $\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\|_{\rm op}\leq C$ hold for all pairs $k,k^{\prime}$ and some constant $C>0$ independent of $k,k^{\prime}$ .

Remark 4.4.

In Theorem 4.2, the parameter $s$ plays a role of adjusting the convergence speed as follows:

(i)

In the case of $s=0$ , the upper bound does not imply the convergence, and the following bound is obtained:

\left|H[q]-\widetilde{H}[q]\right|\leq 2\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}(\pi_{k}+\pi_{k}^{\prime})=\sum_{k=1}^{K}\{(K-1)\pi_{k}+(1-\pi_{k})\}=2(K-1).

(ii)

Consider the upper bound (focusing on the term of $k,k^{\prime}$ pair in summation) of (4.3) as a function with respect to $s$ , it is minimal on $s=1-m/\alpha_{k,k^{\prime}}^{2}$ if $\alpha_{k,k^{\prime}}\geq\sqrt{m}$ , and monotonically increase in $s\in[0,1)$ if $\alpha_{k,k^{\prime}}<\sqrt{m}$ . Replacing $s$ on each $k,k^{\prime}$ summation in the proof of Theorem 4.2 with minimal points $s_{k,k^{\prime}}=1-m/\alpha_{k,k^{\prime}}^{2}$ , we obtain more precise upper bound

\left|H[q]-\widetilde{H}[q]\right|\leq 2\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(\frac{\alpha_{k,k^{\prime}}^{2}}{m}\right)^{\frac{m}{4}}\exp\left(\frac{m-\alpha_{k,k^{\prime}}^{2}}{4}\right),

which converges to zero when dimension $m$ goes to infinity, if $\alpha_{k,k^{\prime}}>\sqrt{m}$ for all pairs $k,k^{\prime}$ .

Remark 4.5.

Melbourne et al. (2022) has explored the entropy approximation of mixtures represented as the “sum with mixing coefficients” of the entropies of unimodal Gaussian distributions (without mixing coefficients in the logarithmic term):

\widetilde{H}_{\rm Melbourne}[q]\coloneqq-\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})\log\left(\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})\right)dw=\widetilde{H}[q]+\sum_{k=1}^{K}\pi_{k}\log\pi_{k}.

(4.5)

This approximate entropy is equal to the true entropy $H[q]$ when all Gaussian components coincide (i.e., $\mu_{k}=\mu_{k^{\prime}}$ and $\Sigma_{k}=\Sigma_{k^{\prime}}$ for all $k,k^{\prime}\in[K])($ see (Melbourne et al., 2022, Theorem I.1)), but not when all Gaussian components are far apart (i.e., $\alpha_{k,k^{\prime}}\to\infty$ for all $k,k^{\prime}\in[K]$ ). In contrast, while the approximation $\widetilde{H}[q]$ differs from $H[q]$ when all Gaussian components coincide, it converges to $H[q]$ when all Gaussian components are far apart (see Theorem 4.2). This is because $\widetilde{H}_{\rm Melbourne}[q]$ differs by $-\sum_{k=1}^{K}\pi_{k}\log\pi_{k}$ from $\widetilde{H}[q]$ (refer to (3.2)). Moreover, using also Wang and Madiman (2014, Lemma XI.2), the other upper bound

\left|H[q]-\widetilde{H}[q]\right|\leq\Bigl{|}H[q]-H_{\rm Melbourne}[q]\Bigr{|}+\left|H_{\rm Melbourne}[q]-\widetilde{H}[q]\right|\leq-2\sum_{k=1}^{K}\pi_{k}\log\pi_{k}\leq 2\log K

(4.6)

is obtained, where the last inequality is justified by constraint $\sum_{k=1}^{K}\pi_{k}=1$ .

Remark 4.6.

In Lemma A.1, we obtain another upper bound $K/2$ , which is slightly better than $2\log K$ in (4.6) when $2\leq K\leq 8$ .

Next, we provide the probabilistic inequality for the error as a corollary of Theorem 4.2.

Corollary 4.7.

Let $c>0$ . Take $\{\mu_{k}\}_{k}$ and $\{\Sigma_{k}\}_{k}$ such that

\frac{\Sigma_{k}^{-\frac{1}{2}}(\mu_{k}-\mu_{k^{\prime}})}{1+\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\|_{\rm op}}\sim\mathcal{N}(0,c^{2}I)

(4.7)

for all pairs $k,k^{\prime}\in[K]$ ( $k\neq k^{\prime}$ ), that is, the left-hand side follows a Gaussian distribution with zero mean and an isotropic covariance matrix $c^{2}I$ . Then, for $\varepsilon>0$ and $s\in(0,1)$ ,

\begin{split}&P\left(\left|H[q]-\widetilde{H}[q]\right|\geq\varepsilon\right)\leq\frac{2(K-1)}{\varepsilon}\left(\sqrt{1-s}\left(1+\frac{sc^{2}}{2}\right)\right)^{-\frac{m}{2}}.\end{split}

(4.8)

The proof is a combination of Markov’s inequality and the upper bound in Theorem 4.2. We also use the moment generating function of $\alpha_{k,k^{\prime}}^{2}/c^{2}$ which follows the $\chi^{2}$ -distribution by the assumption (4.7). See Appendix A.2 for the details.

When (4.7) holds, an expected value of $\alpha^{2}_{k,k^{\prime}}$ is $c^{2}m$ . Hence, if $\alpha^{2}_{k,k^{\prime}}$ is regarded as $c^{2}m$ , then for $c>1$ there exists $s\in(0,1)$ such that the upper bound in (4.3) expectedly converges to zero as the dimension $m$ goes to infinity. Furthermore, Corollary 4.7 justifies Gal and Ghahramani (2016, Proposition 1 in Appendix A), which formally mentions that $\widetilde{H}[q]$ tends to $H[q]$ when means $\mu_{k}$ are normally distributed, all elements of covariance matrices $\Sigma_{k}$ do not depend on $m$ , and $m$ is large enough. In fact, the right-hand side of (4.8) converges to zero as $m\to\infty$ for some $s\in(0,1)$ if $c>1$ .

We also study the derivatives of the error $|H[q]-\widetilde{H}[q]|$ with respect to learning parameters $\theta=(\pi_{k},\mu_{k},\Sigma_{k})_{k=1}^{K}$ . For simplicity, we write

\Gamma_{k}\coloneqq\Sigma^{\frac{1}{2}}_{k}.

We give the following upper bounds for the derivatives of the error.

Theorem 4.8.

Let $k\in[K]$ , $p,q\in[m]$ , and $s\in(0,1)$ . Then

	$\displaystyle\mathrm{(i)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\mu_{k,p}}\left(H[q]-\widetilde{H}[q]\right)\right\|\leq\frac{2}{(1-s)^{\frac{m+2}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}+\left\\|\Gamma_{k}^{-1}\right\\|_{1}\right)\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right),$
	$\displaystyle\mathrm{(ii)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\gamma_{k,pq}}\left(H[q]-\widetilde{H}[q]\right)\right\|$
		$\displaystyle\quad\leq\frac{6}{(1-s)^{\frac{m+4}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(2\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|+\left\\|\Gamma_{k}^{-1}\right\\|_{1}+\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}\right)\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)$
		$\displaystyle\quad\text{for}\ \gamma_{k,pq}\in\mathbb{R}\ \text{satisfying}\ \\|\Gamma_{k}^{-1}\\|_{1}<\infty,$
	$\displaystyle\mathrm{(iii)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\pi_{k}}\left(H[q]-\widetilde{H}[q]\right)\right\|\leq\frac{8}{(1-s)^{\frac{m}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\frac{\pi_{k^{\prime}}}{\pi_{k}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right),$

where $\mu_{k,p}$ and $\gamma_{k,pq}$ is the $p$ -th and $(p,q)$ -th components of vector $\mu_{k}$ and matrix $\Gamma_{k}$ , respectively, and $\left\|\cdot\right\|_{1}$ is the entry-wise matrix $1$ -norm, and $|\Gamma_{k,pq}|$ is the determinant of the $(m-1)\times(m-1)$ matrix that results from deleting $p$ -th row and $q$ -th column of matrix $\Gamma_{k}$ . Moreover, the same upper bounds hold for $\alpha_{\{k,k^{\prime}\}}$ instead of $\alpha_{k,k^{\prime}}$ .

The proof is given by similar calculations and techniques to the proof of Theorem 4.2. For the details, see Appendix A.3.

We observe that even in the derivatives of the error, the upper bounds exponentially decay to zero as $\alpha_{k,k^{\prime}}$ go to infinity for all pairs $k,k^{\prime}\in[K]$ with $k\neq k^{\prime}$ . We can also show that if means $\mu_{k}$ are normally distributed with certain large standard deviation $c$ , then the probabilistic inequality like Corollary 4.7 that the bound converges to zero as $m$ goes to infinity is obtained.

4.2 Coincident covariance case

We study the error $|H[q]-\widetilde{H}[q]|$ for coincident covariance matrices, that is,

\Sigma_{k}=\Sigma\quad\mbox{for all}\ k\in[K],

(4.9)

where $\Sigma\in\mathbb{R}^{m\times m}$ is a positive definite matrix. In this case, $\alpha_{k,k^{\prime}}$ have the form as

\alpha_{\{k,k^{\prime}\}}=\alpha_{k,k^{\prime}}=\alpha_{k^{\prime},k}=\frac{\left\|\mu_{k}-\mu_{k^{\prime}}\right\|_{\Sigma}}{2}.

In this case, a more detailed analysis can be done. First, we show the following explicit form of the true entropy $H[q]$ .

Proposition 4.9.

Let $m\geq K\geq 2$ . Then

H[q]=\widetilde{H}[q]-\sum_{k=1}^{K}\frac{\pi_{k}}{(2\pi)^{\frac{K-1}{2}}}\int_{\mathbb{R}^{K-1}}\exp\left(-\frac{|v|^{2}}{2}\right)\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{\pi_{k}}\exp\left(\frac{|v|^{2}-\left|v-u_{k^{\prime},k}\right|^{2}}{2}\right)\right)dv,

(4.10)

where $u_{k^{\prime},k}\coloneqq[R_{k}\Sigma^{-1/2}(\mu_{k^{\prime}}-\mu_{k})]_{1:K-1}\in\mathbb{R}^{K-1}$ and $R_{k}\in\mathbb{R}^{m\times m}$ is some rotation matrix such that

R_{k}\Sigma^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\in\mathrm{span}\{e_{1},\cdots,e_{K-1}\},\,\,\,k^{\prime}\in[K].

(4.11)

Here, $\{e_{i}\}_{i=1}^{K-1}$ is the standard basis in $\mathbb{R}^{K-1}$ , and $u_{1:K-1}\coloneqq(u_{1},\ldots,u_{K-1})^{T}\in\mathbb{R}^{K-1}$ for $u=(u_{1},\ldots,u_{m})^{T}\in\mathbb{R}^{m}$ .

The proof is given by certain rotations and polar transformations. For the details, see Appendix A.4.

We note that the special case $K=2$ of (4.10) can be found in Zobay (2014, Appendix A). Using Proposition 4.9, we have the following upper and lower bounds for the error.

Theorem 4.10.

Let $m\geq K\geq 2$ and $s\in[0,1)$ . Then

	$\displaystyle\frac{1}{2}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\frac{\pi_{k}\pi_{k^{\prime}}}{1-\pi_{k}}\log\left(1+\frac{1-\pi_{k}}{\pi_{k}}\exp(-2\alpha_{k^{\prime},k}^{2})\right)$		(4.12)
	$\displaystyle\leq\left\|H[q]-\widetilde{H}[q]\right\|\leq\frac{2}{(1-s)^{\frac{K-1}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right).$		(4.13)

The upper bound is proved by the argument in proof of the second upper bound in Theorem 4.2 with the explicit form (4.10) and $|u_{k,k^{\prime}}|=\left\|\mu_{k}-\mu_{k^{\prime}}\right\|_{\Sigma}$ . The lower bound is given by applying that in Theorem 4.2 in the case of $\Sigma_{k}=\Sigma$ for all $k\in[K]$ by remarking $c_{k,k^{\prime}}=1/2$ in that case.

Remark 4.11.

Theorem 4.10 implies the following facts:

(i)

When all covariance matrices coincide, the condition (a) or (b) in Remark 4.3 is a necessary and sufficient condition for the approximation $H[q]\approx\widetilde{H}[q]$ to be valid.
(ii)

The upper bound in (4.13) is sharper than that in (4.3) of Theorem 4.2 when $m\geq K$ .

Corollary 4.12.

Let $m\geq K\geq 2$ and $c>0$ . Take $\{\mu_{k}\}_{k}$ and $\Sigma$ such that

\frac{\Sigma^{-\frac{1}{2}}(\mu_{k}-\mu_{k^{\prime}})}{2}\sim\mathcal{N}(0,c^{2}I),

for all pairs $k,k^{\prime}\in[K]$ ( $k\neq k^{\prime}$ ). Then, for $\varepsilon>0$ and $s\in(0,1)$ ,

P\left(\left|H[q]-\widetilde{H}[q]\right|\geq\varepsilon\right)\leq\frac{2(K-1)}{\varepsilon(1-s)^{\frac{K-1}{4}}}\left(1+\frac{sc^{2}}{2}\right)^{-\frac{m}{2}}.

The proof is given by the same way as in the proof of Corollary 4.7, which uses the Markov’s inequality, the upper bound of Theorem 4.10, and the moment generating function for $\chi^{2}$ -distribution.

Note that, in Corollary 4.12, the assumption $c>1$ , which is required in Corollary 4.7, is no longer necessary for zero convergence.

5 Experiment

We numerically examined the approximation capabilities of the approximate entropy (3.2) compared with Huber et al. (2008), Bonilla et al. (2019), and the Monte Carlo integration. Generally, we cannot compute the entropy (3.1) in a closed form. Therefore, we restricted the setting of the experiment to the case for the coincident covariance matrices (Section 4.2), in particular $\Sigma=I$ , and the number of mixture components $K=2$ , where we obtained the more tractable formula for the entropy (B.4). In this setting, we investigated the relative error between the entropy and each approximation method (see Figure 3). The details for the experimental setting and exact formulas for each method are shown in Appendix B.

From the result in Figure 3, we can observe the following. First, the relative error of ours shows faster decay than others in higher dimensions $m$ . Therefore, the approximate entropy (3.2) has an advantage in higher dimensions. Second, the graph for ours scales in the $x$ -axis as $c$ scales, which is consistent with the expression of the upper bound in Corollary 4.12. Finally, ours is robust against varying mixing coefficients, which cannot be explained by Corollary 4.12. Note that we can hardly conduct a similar experiment for $K>2$ because we cannot prepare the tolerant ground truth of the entropy. For example, even the Monte Carlo integration is not suitable for the ground truth already in the case for $K=2$ due to its large relative error around $10^{-3}$ .

6 Limitations and future work

The limitations and future work are as follows:

•

When all covariance matrices coincide, a necessary and sufficient condition is obtained for the approximation $H[q]\approx\tilde{H}[q]$ to be valid (see (i) in Remark 4.11). However, in the general covariance case, it has not yet been obtained without the constraints for covariance matrices (Remark 4.3). Improving the lower bound (or upper bound) to find a necessary and sufficient condition is a future work.
•

There is an unsolved problem on the standard deviation $c$ in (4.7) of Corollary 4.7. According to this corollary, the approximation error almost surely converges to zero as $m\to\infty$ if we take $c>1$ (the discussion after Corollary 4.7). However, it is unsolved whether the condition $c>1$ is optimal or not for the convergence. According to Corollary 4.12, the condition $c>1$ can be removed in the particular case $\Sigma_{k}=\Sigma$ for all $k\in[K]$ .
•

The approximate entropy (3.2) is valid only when $\alpha_{k,k^{\prime}}$ are large enough. However, since there are situations where $\alpha_{k,k^{\prime}}$ are likely to be small, such as the low-dimensional latent space of a variational autoencoder (Kingma and Welling, 2013), it is worthwhile to propose an appropriate entropy approximation for small $\alpha_{k,k^{\prime}}$ situation. Although approximation $H_{\rm Melbourne}$ of (4.5) seems to be an appropriate one for small $\alpha_{k,k^{\prime}}$ situation, the criteria (e.g., some value of $\alpha_{k,k^{\prime}}$ ) for using either of $\widetilde{H}$ and $H_{\rm Melbourne}$ is unclear.
•

Further enrichment of experiments is important, such as the large component case, and comparison of the derivatives of entropy.
•

One important application of the entropy approximation is variational inference. In Appendix C, we include an overview of variational inference and an experiment on the toy task. However, these are not sufficient to determine the effectiveness of this approximation for variational inference. For instance, the variational inference maximizes the ELBO in (C.1), which includes the entropy term. Investigating how this approximate entropy dominates other terms will be interesting for future work.

Acknowledgments

TF was partially supported by JSPS KAKENHI Grant Number JP24K16949.

References

Bacharoglou [2010] Athanassia Bacharoglou. Approximation of probability distributions by convex mixtures of Gaussian measures. Proceedings of the American Mathematical Society, 138(7):2619–2628, 2010.
Barber and Bishop [1998] David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks. Nato ASI Series F Computer and Systems Sciences, 168:215–238, 1998.
Bishop [2006] Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.
Bonilla et al. [2019] Edwin V Bonilla, Karl Krauth, and Amir Dezfouli. Generic inference in latent Gaussian process models. J. Mach. Learn. Res., 20:117–1, 2019.
Dahlke and Pacheco [2023] Caleb Dahlke and Jason Pacheco. On convergence of polynomial approximations to the Gaussian mixture entropy. NeurIPS 2023, 2023.
Fort et al. [2019] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
Frazier [2018] Peter I Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. Proceedings of The 33rd International Conference on Machine Learning, 48:1050–1059, 20–22 Jun 2016.
He et al. [2020] Bobby He, Balaji Lakshminarayanan, and Yee W Teh. Bayesian deep ensembles via the neural tangent kernel. Advances in Neural Information Processing Systems, 33:1010–1022, 2020.
Hershey and Olsen [2007] John R Hershey and Peder A Olsen. Approximating the Kullback Leibler divergence between Gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–317. IEEE, 2007.
Hinton and Van Camp [1993] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5–13, 1993.
Huber et al. [2008] Marco F Huber, Tim Bailey, Hugh Durrant-Whyte, and Uwe D Hanebeck. On entropy approximation for Gaussian mixture random vectors. In 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pages 181–188. IEEE, 2008.
Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
Kingma et al. [2015] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28:2575–2583, 2015.
Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
MacKay [1992] David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
Melbourne et al. [2022] James Melbourne, Saurav Talukdar, Shreyas Bhaban, Mokshay Madiman, and Murti V Salapaka. The differential entropy of mixtures: New bounds and applications. IEEE Transactions on Information Theory, 68(4):2123–2146, 2022.
Neal [2012] Radford M Neal. Bayesian learning for neural networks. Springer Science & Business Media, 118, 2012.
Nielsen and Sun [2016] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.
Rezende et al. [2014] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. International conference on machine learning, pages 1278–1286, 2014.
Settles [2009] Burr Settles. Active learning literature survey. 2009.
Titsias and Lázaro-Gredilla [2014] Michalis Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate inference. In International conference on machine learning, pages 1971–1979. PMLR, 2014.
Wang and Madiman [2014] Liyao Wang and Mokshay Madiman. Beyond the entropy power inequality, via rearrangements. IEEE Transactions on Information Theory, 60(9):5116–5137, 2014.
Zobay [2014] Oliver Zobay. Variational Bayesian inference with Gaussian-mixture approximations. Electronic Journal of Statistics, 8(1):355–389, 2014.

Appendix

Appendix A Proofs in Section 4

We recall the definitions and notations used in this appendix. Let $q(w)$ be the Gaussian mixture distribution, that is,

q(w)=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k}),\quad w\in\mathbb{R}^{m},

where $m\in\mathbb{N}$ is the dimension, $K\in\mathbb{N}$ is the number of mixture components, $\pi_{k}\in(0,1]$ are mixing coefficients constrained by $\sum_{k=1}^{K}\pi_{k}=1$ , and $\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})$ is the Gaussian distribution with a mean $\mu_{k}\in\mathbb{R}^{m}$ and covariance matrix $\Sigma_{k}\in\mathbb{R}^{m\times m}$ , that is,

\mathcal{N}(w\,|\,\mu_{k},\Sigma_{k})=\frac{1}{\sqrt{(2\pi)^{m}|\Sigma_{k}|}}\exp\left(-\frac{1}{2}\left\|w-\mu_{k}\right\|_{\Sigma_{k}}^{2}\right).

Here, $|\Sigma_{k}|$ is the determinant of matrix $\Sigma_{k}$ , and $\|x\|_{\Sigma}^{2}\coloneqq x\cdot(\Sigma^{-1}x)$ for a vector $x\in\mathbb{R}^{m}$ and a positive definite matrix $\Sigma\in\mathbb{R}^{m\times m}$ . The entropy of $q(w)$ and its approximation are defined by

	$\displaystyle H[q]\coloneqq$	$\displaystyle-\int q(w)\log(q(w))\,dw,$
	$\displaystyle\widetilde{H}[q]\coloneqq$	$\displaystyle\,-\sum_{k=1}^{K}\pi_{k}\int\mathcal{N}(w\,\|\,\mu_{k},\Sigma_{k})\log\left(\pi_{k}\mathcal{N}(w\,\|\,\mu_{k},\Sigma_{k})\right)dw$
	$\displaystyle=$	$\displaystyle\,\frac{m}{2}+\frac{m}{2}\log 2\pi+\frac{1}{2}\sum_{k=1}^{K}\pi_{k}\log\|\Sigma_{k}\|-\sum_{k=1}^{K}\pi_{k}\log\pi_{k}.$

A.1 Proof of Theorem 4.2

Theorem 4.2 is a combination of Lemmas A.2, A.3, and A.5 stated below.

Lemma A.1.

\left|H[q]-\widetilde{H}[q]\right|\leq\frac{K}{2}.

Proof.

Making the change of variables as $y=\Sigma_{k}^{-1/2}(x-\mu_{k})$ , we write

	$\displaystyle\widetilde{H}[q]-H[q]$
	$\displaystyle=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\mathcal{N}(x\|\mu_{k},\Sigma_{k})\left\{\log\left(\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\mathcal{N}(x\|\mu_{k^{\prime}},\Sigma_{k})\right)-\log\left(\pi_{k}\mathcal{N}(x\|\mu_{k},\Sigma_{k})\right)\right\}dx\hskip 30.0pt$
	$\displaystyle=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}\|\Sigma_{k}\|}}\exp\left(-\frac{\left\\|x-\mu_{k}\right\\|^{2}_{\Sigma_{k}}}{2}\right)$
	$\displaystyle\hskip 60.0pt\times\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Sigma_{k}\|^{\frac{1}{2}}}{\pi_{k}\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{\left\\|x-\mu_{k}\right\\|^{2}_{\Sigma_{k}}-\left\\|x-\mu_{k^{\prime}}\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)\right)dx$

		$\displaystyle=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)$
		$\displaystyle\hskip 60.0pt\times\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Sigma_{k}\|^{\frac{1}{2}}}{\pi_{k}\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{\|y\|^{2}-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)\right)dy.$		(A.1)

Using the inequality $\log(1+x)\leq\sqrt{x}$ $(x\geq 0)$ and the Cauchy-Schwarz inequality, we have

$\displaystyle\left\|H[q]-\widetilde{H}[q]\right\|$
$\displaystyle\leq\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\sqrt{\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Sigma_{k}\|^{\frac{1}{2}}}{\pi_{k}\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{\|y\|^{2}-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)}\ dy\hskip 65.0pt$		(A.3)
$\displaystyle=\sum_{k=1}^{K}\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{4}}}\exp\left(-\frac{\|y\|^{2}}{4}\right)\sqrt{\frac{1}{(2\pi)^{\frac{m}{2}}}\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)}\ dy$	$\displaystyle\leq\sum_{k=1}^{K}\Biggl{(}\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)dy\Biggr{)}^{\frac{1}{2}}$
$\displaystyle\hskip 60.0pt\times\underbrace{\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)dy\right)^{\frac{1}{2}}}_{\displaystyle=\left(\left(\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\right)\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{\|z\|^{2}}{2}\right)dz\right)^{\frac{1}{2}}}$
$\displaystyle=\sum_{k=1}^{K}\left(\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\right)^{\frac{1}{2}}=\sum_{k=1}^{K}\sqrt{\pi_{k}(1-\pi_{k})}\leq\sum_{k=1}^{K}\frac{\pi_{k}+(1-\pi_{k})}{2}\leq\frac{K}{2}.$		(A.4)

Thus, the proof of Lemma A.1 is finished. ∎

Lemma A.2.

Let $s\in(0,1)$ . Then

\left|H[q]-\widetilde{H}[q]\right|\leq\frac{2}{(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right).

(A.5)

Proof.

Using the inequality (A.3) and the inequality $\sqrt{\sum_{i}a_{i}}\leq\sum_{i}\sqrt{a_{i}}$ , we decompose

\begin{split}&\left|H[q]-\widetilde{H}[q]\right|\\ &\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{|\Sigma_{k^{\prime}}|^{\frac{1}{2}}}}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{4}-\frac{\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy\\ &=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{|\Sigma_{k^{\prime}}|^{\frac{1}{2}}}}\int_{|y|<\alpha_{k,k^{\prime}}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{4}-\frac{\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy\\ &\quad+\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{|\Sigma_{k^{\prime}}|^{\frac{1}{2}}}}\int_{|y|>\alpha_{k,k^{\prime}}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{4}-\frac{\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy\\ &\eqqcolon D^{i}+D^{o}.\end{split}

Firstly, we evaluate the term $D^{i}$ . By the definition of $\alpha_{k,k^{\prime}}$ , we have

\left|\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right|=\alpha_{k,k^{\prime}}\left(1+\left\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\right\|_{\rm op}\right)>|y|+\alpha_{k,k^{\prime}}\left\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\right\|_{\rm op}.

Then it follows from properties of $\|\cdot\|_{\Sigma_{k^{\prime}}},\|\cdot\|_{\rm op}$ , and triangle inequality that

	$\displaystyle\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|_{\Sigma_{k^{\prime}}}$	$\displaystyle=\left\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|$
		$\displaystyle\geq\frac{\left\|y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right\|}{\left\\|\left(\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\right)^{-1}\right\\|_{\rm op}}\geq\frac{\left\|\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right\|-\|y\|}{\left\\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\right\\|_{\rm op}}>\alpha_{k,k^{\prime}}.$		(A.6)

From inequality (A.1) and the Cauchy-Schwarz inequality, it follows that for $s\in(0,1)$ ,

	$\displaystyle D^{i}$	$\displaystyle=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}}\int_{\|y\|<\alpha_{k,k^{\prime}}}\frac{1}{\sqrt{(2\pi)^{m}}}$
		$\displaystyle\qquad\times\exp\left(-\frac{\|y\|^{2}}{4}-\frac{s\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{4}-\frac{(1-s)\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\int_{\|y\|<\alpha_{k,k^{\prime}}}\frac{1}{(2\pi)^{\frac{m}{4}}}$
		$\displaystyle\hskip 60.0pt\times\exp\left(-\frac{\|y\|^{2}}{4}\right)\sqrt{\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}}\frac{1}{(2\pi)^{\frac{m}{4}}}\exp\left(-\frac{(1-s)\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy\hskip 25.0pt$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)dy\right)^{\frac{1}{2}}$
		$\displaystyle\hskip 60.0pt\times\left(\int_{\mathbb{R}^{m}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{(1-s)\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)dy\right)^{\frac{1}{2}}$
		$\displaystyle=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{(1-s)\|z\|^{2}}{2}\right)dz\right)^{\frac{1}{2}}$
		$\displaystyle=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)(1-s)^{-\frac{m}{4}}\underbrace{\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}\|(1-s)^{-1}I\|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}\\|z\\|_{(1-s)^{-1}I}^{2}\right)dz\right)^{\frac{1}{2}}}_{\displaystyle=1}$
		$\displaystyle=\frac{1}{(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right),$

where we have used the change of variable as $z=\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)$ .

Secondly, we evaluate the term $D^{o}$ . In the same way as above, we have

\begin{split}D^{o}&=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{|\Sigma_{k^{\prime}}|^{\frac{1}{2}}}}\\ &\hskip 20.0pt\times\int_{|y|>\alpha_{k,k^{\prime}}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{s|y|^{2}}{4}-\frac{(1-s)|y|^{2}}{4}-\frac{\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy\\ &\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\int_{|y|>\alpha_{k,k^{\prime}}}\frac{1}{(2\pi)^{\frac{m}{4}}}\exp\left(-\frac{(1-s)|y|^{2}}{4}\right)\\ &\quad\hskip 85.35826pt\times\sqrt{\frac{|\Sigma_{k}|^{\frac{1}{2}}}{|\Sigma_{k^{\prime}}|^{\frac{1}{2}}}}\frac{1}{(2\pi)^{\frac{m}{4}}}\exp\left(-\frac{\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy\\ &\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{(1-s)|y|^{2}}{2}\right)dy\right)^{\frac{1}{2}}\\ &\quad\hskip 56.9055pt\times\left(\int_{\mathbb{R}^{m}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{|\Sigma_{k^{\prime}}|^{\frac{1}{2}}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)dy\right)^{\frac{1}{2}}\\ &=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{(1-s)|y|^{2}}{2}\right)dy\right)^{\frac{1}{2}}\\ &=\frac{1}{(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right).\end{split}

Combining the estimates obtained now, we conclude (A.5). ∎

Lemma A.3.

Let $s\in(0,1)$ . Then

\left|H[q]-\widetilde{H}[q]\right|\leq\frac{2}{(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha_{\{k,k^{\prime}\}}^{2}}{4}\right).

Proof.

The proof is almost the same as Theorem A.2 except the evaluation (A.1). By the change of variables $y=\Sigma_{k}^{-\frac{1}{2}}(x-\mu_{k})$ , we have

	$\displaystyle\\|x-\mu_{k}\\|_{\Sigma_{k}}$	$\displaystyle=\left\|\Sigma_{k}^{-\frac{1}{2}}(x-\mu_{k})\right\|=\|y\|,$
	$\displaystyle\\|x-\mu_{k^{\prime}}\\|_{\Sigma_{k^{\prime}}}$	$\displaystyle=\left\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}(x-\mu_{k^{\prime}})\right\|=\left\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|=\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|_{\Sigma_{k^{\prime}}}.$

From the definition of $\alpha_{\{k,k^{\prime}\}}$ ,

\{y\in\mathbb{R}^{m}:|y|<\alpha\}\cap\left\{y\in\mathbb{R}^{m}:\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|_{\Sigma_{k^{\prime}}}<\alpha\right\}=\varnothing,

then if $|y|<\alpha_{\{k,k^{\prime}\}}$ , we obtain

\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|_{\Sigma_{k^{\prime}}}\geq\alpha_{\{k,k^{\prime}\}}.

The proof complete by replacing $\alpha_{k,k^{\prime}}$ with $\alpha_{\{k,k^{\prime}\}}$ in the proof of Lemma A.2. ∎

Lemma A.4.

$\alpha_{\{k,k^{\prime}\}}\geq\alpha_{k,k^{\prime}}$ for any $k,k^{\prime}\in[K]$ .

Proof.

When $x$ satisfies

\|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}\|_{\Sigma_{k}^{-1}\Sigma_{k^{\prime}}}<\alpha_{k,k^{\prime}},

because

\|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}\|_{\Sigma_{k}^{-1}\Sigma_{k^{\prime}}}=\left|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\left(x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}\right)\right|\geq\frac{|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}|}{\left\|\left(\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\right)^{-1}\right\|_{\bf op}}=\frac{|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}|}{\sigma},

then $|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}|<\alpha_{k,k^{\prime}}\sigma$ , where $\sigma=\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\|_{\rm op}$ . On the other hand, from the definition of $\alpha_{k,k^{\prime}}$ , we have

\alpha_{k,k^{\prime}}+\alpha_{k,k^{\prime}}\sigma=|\Sigma_{k}^{-\frac{1}{2}}\mu_{k}-\Sigma_{k}^{\frac{1}{2}}\mu_{k^{\prime}}|,

and thus $\{x\in\mathbb{R}^{m}:|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k}|<\alpha_{k,k^{\prime}}\}\cap\{x\in\mathbb{R}^{m}:|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}|<\alpha_{k,k^{\prime}}\sigma\}=\varnothing$ . Therefore, we obtain

\{x\in\mathbb{R}^{m}:|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k}|<\alpha_{k,k^{\prime}}\}\cap\{x\in\mathbb{R}^{m}:\|x-\Sigma_{k}^{-\frac{1}{2}}\mu_{k^{\prime}}\|_{\Sigma_{k}^{-1}\Sigma_{k^{\prime}}}<\alpha_{k,k^{\prime}}\}=\varnothing.

Making the change of variables as $y=\Sigma_{k}^{\frac{1}{2}}x$ ,

\{y\in\mathbb{R}^{m}:\|y-\mu_{k^{\prime}}\|_{\Sigma_{k}}<\alpha_{k,k^{\prime}}\}\cap\{y\in\mathbb{R}^{m}:\|y-\mu_{k^{\prime}}\|_{\Sigma_{k^{\prime}}}<\alpha_{k,k^{\prime}}\}=\varnothing.

From definition (4.1), $\alpha_{\{k,k^{\prime}\}}\geq\alpha_{k,k^{\prime}}$ is obtained. ∎

Lemma A.5.

\begin{split}&\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\frac{\pi_{k}\pi_{k^{\prime}}}{1-\pi_{k}}c_{k,k^{\prime}}\log\left(1+\frac{1-\pi_{k}}{\pi_{k}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{\displaystyle\max_{l}|\Sigma_{l}|^{\frac{1}{2}}}\exp\left(-\frac{\Bigl{(}1+\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\|_{\rm op}\Bigr{)}^{2}}{2}\alpha_{k^{\prime},k}^{2}\right)\right)\leq\left|H[q]-\widetilde{H}[q]\right|,\end{split}

where the coefficient $c_{k,k^{\prime}}$ is defined by

c_{k,k^{\prime}}\coloneqq\frac{1}{\sqrt{(2\pi)^{m}}}\int_{\mathbb{R}^{m}_{k,k^{\prime}}}\exp\left(-\frac{|y|^{2}}{2}\right)dy\geq 0,

and the set $\mathbb{R}^{m}_{k,k^{\prime}}$ is defined by

\mathbb{R}^{m}_{k,k^{\prime}}\coloneqq\left\{y\in\mathbb{R}^{m}:\begin{array}[]{cc}y\cdot y\geq(\Sigma_{k}^{\frac{1}{2}}\Sigma_{k^{\prime}}^{-1}\Sigma_{k}^{\frac{1}{2}}y)\cdot y,\\ y\cdot(\Sigma_{k}^{\frac{1}{2}}\Sigma_{k^{\prime}}^{-1}(\mu_{k^{\prime}}-\mu_{k}))\geq 0\end{array}\right\}.

(A.7)

Proof.

Using the equality in (LABEL:computation1), we write

\begin{split}&\left|H[q]-\widetilde{H}[q]\right|\\ &=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{2}\right)\\ &\hskip 40.0pt\times\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}|\Sigma_{k}|^{\frac{1}{2}}}{\pi_{k}|\Sigma_{k^{\prime}}|^{\frac{1}{2}}}\exp\left(\frac{|y|^{2}-\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)\right)dy\\ &\geq\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{2}\right)\\ &\hskip 28.45274pt\times\log\left(1+\frac{1-\pi_{k}}{\pi_{k}}\frac{|\Sigma_{k}|^{\frac{1}{2}}}{\displaystyle\max_{l}|\Sigma_{l}|^{\frac{1}{2}}}\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{1-\pi_{k}}\exp\left(\frac{|y|^{2}-\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)\right)dy.\end{split}

Since $\log(1+\lambda x)$ is a concave function of $x>0$ for $\lambda>0$ , we estimate that

	$\displaystyle\left\|H[q]-\widetilde{H}[q]\right\|$	$\displaystyle\geq\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{1-\pi_{k}}$
		$\displaystyle\hskip 28.45274pt\times\log\left(1+\frac{1-\pi_{k}}{\pi_{k}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\displaystyle\max_{l}\|\Sigma_{l}\|^{\frac{1}{2}}}\exp\left(\frac{\|y\|^{2}-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)\right)dy$
		$\displaystyle\geq\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}_{k,k^{\prime}}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{1-\pi_{k}}$
		$\displaystyle\hskip 28.45274pt\times\log\left(1+\frac{1-\pi_{k}}{\pi_{k}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\displaystyle\max_{l}\|\Sigma_{l}\|^{\frac{1}{2}}}\exp\left(\frac{\|y\|^{2}-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)\right)dy.$

Here, it follows from the two conditions in the definition (A.7) of $\mathbb{R}^{m}_{k,k^{\prime}}$ that

\begin{split}|y|^{2}-\left\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|^{2}_{\Sigma_{k^{\prime}}}&\geq\left|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}y\right|^{2}-\left|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}y-\Sigma_{k^{\prime}}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right|^{2}\\ &=-\left|\Sigma_{k^{\prime}}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right|^{2}+2y\cdot\left(\Sigma_{k}^{\frac{1}{2}}\Sigma_{k^{\prime}}^{-1}(\mu_{k^{\prime}}-\mu_{k})\right)\\ &\geq-\left|\Sigma_{k^{\prime}}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right|^{2}\\ &=-\left(1+\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\|_{\rm op}\right)^{2}\alpha_{k^{\prime},k}^{2},\end{split}

for $y\in\mathbb{R}^{m}_{k,k^{\prime}}$ , where we used the cosine formula in the second step. Combining the above estimates, we have

	$\displaystyle\left\|H[q]-\widetilde{H}[q]\right\|$	$\displaystyle\geq\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}_{k,k^{\prime}}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{1-\pi_{k}}$
		$\displaystyle\hskip 60.0pt\times\log\left(1+\frac{1-\pi_{k}}{\pi_{k}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\displaystyle\max_{l}\|\Sigma_{l}\|^{\frac{1}{2}}}\exp\left(-\frac{\left(1+\\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\\|_{\rm op}\right)^{2}}{2}\alpha_{k^{\prime},k}^{2}\right)\right)dy$
		$\displaystyle=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\frac{\pi_{k}\pi_{k^{\prime}}}{1-\pi_{k}}c_{k,k^{\prime}}\log\left(1+\frac{1-\pi_{k}}{\pi_{k}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\displaystyle\max_{l}\|\Sigma_{l}\|^{\frac{1}{2}}}\exp\left(-\frac{\Bigl{(}1+\\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\\|_{\rm op}\Bigr{)}^{2}}{2}\alpha_{k^{\prime},k}^{2}\right)\right).$

The proof of Lemma A.5 is finished. ∎

Remark A.6.

Either $c_{k,k^{\prime}}$ or $c_{k^{\prime},k}$ is positive. Indeed,

•

if $\Sigma_{k}^{-1}-\Sigma_{k^{\prime}}^{-1}$ has at least one positive eigenvalue, then $c_{k,k^{\prime}}$ is positive;
•

if all eigenvalues of $\Sigma_{k}^{-1}-\Sigma_{k^{\prime}}^{-1}\neq O$ are non-positive, then $\Sigma_{k^{\prime}}^{-1}-\Sigma_{k}^{-1}$ has at least one positive eigenvalue;
•

if $\Sigma_{k}^{-1}-\Sigma_{k^{\prime}}^{-1}=O$ , then $c_{k,k^{\prime}}=1/2$ because $\mathbb{R}^{m}_{k,k^{\prime}}$ is a half-space of $\mathbb{R}^{m}$ ,

where $O$ is the zero matrix.

A.2 Proof of Corollary 4.7

We restate Corollary 4.7 as follows:

Lemma A.7.

Let $c>0$ . Assume $\{\mu_{k}\}_{k}$ and $\{\Sigma_{k}\}_{k}$ such that

\frac{\Sigma_{k}^{-\frac{1}{2}}(\mu_{k}-\mu_{k^{\prime}})}{1+\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\|_{\rm op}}\sim\mathcal{N}(0,c^{2}I)

(A.8)

for all pairs $k,k^{\prime}\in[K]$ ( $k\neq k^{\prime}$ ). Then, for $\varepsilon>0$ and $s\in(0,1)$ ,

\begin{split}&P\left(\left|H[q]-\widetilde{H}[q]\right|\geq\varepsilon\right)\leq\frac{2(K-1)}{\varepsilon}\left(\sqrt{1-s}\left(1+\frac{sc^{2}}{2}\right)\right)^{-\frac{m}{2}}.\end{split}

(A.9)

Proof.

Using Lemma A.2 and Markov’s inequality, for $\varepsilon>0$ and $s\in(0,1)$ , we estimate

P\left(\left|H[q]-\widetilde{H}[q]\right|\geq\varepsilon\right)\leq\frac{E\left[\left|H[q]-\widetilde{H}[q]\right|\right]}{\varepsilon}\leq\frac{2}{\varepsilon(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}E\left[\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\right].

By the assumption (A.8), $\alpha^{2}_{k,k^{\prime}}/c^{2}$ follows the $\chi^{2}$ -distribution with $m$ degrees of freedom, that is,

\frac{1}{c^{2}}\left|\frac{\Sigma_{k}^{-\frac{1}{2}}(\mu_{k}-\mu_{k^{\prime}})}{1+\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\|_{\rm op}}\right|^{2}\sim\chi_{m}^{2}.

Therefore, we conclude from the moment-generating function for $\chi^{2}$ -distribution that

	$\displaystyle P\left(\left\|H[q]-\widetilde{H}[q]\right\|\geq\varepsilon\right)$	$\displaystyle\leq\frac{2}{\varepsilon(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}E\left[\exp\left(-\frac{sc^{2}}{4}\frac{\alpha^{2}_{k,k^{\prime}}}{c^{2}}\right)\right]$
		$\displaystyle=\frac{2}{\varepsilon(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(1-2\left(-\frac{sc^{2}}{4}\right)\right)^{-\frac{m}{2}}$
		$\displaystyle\leq\frac{2(K-1)}{\varepsilon}\left(\sqrt{1-s}\left(1+\frac{sc^{2}}{2}\right)\right)^{-\frac{m}{2}}.$

∎

A.3 Proof of Theorem 4.8

The next lemma is the same as Theorem 4.8.

Lemma A.8.

Let $k\in[K]$ , $p,q\in[m]$ , and $s\in(0,1)$ . Then

	$\displaystyle\mathrm{(i)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\mu_{k,p}}\left(H[q]-\widetilde{H}[q]\right)\right\|\leq\frac{2}{(1-s)^{\frac{m+2}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}+\left\\|\Gamma_{k}^{-1}\right\\|_{1}\right)\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right),$
	$\displaystyle\mathrm{(ii)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\gamma_{k,pq}}\left(H[q]-\widetilde{H}[q]\right)\right\|$
		$\displaystyle\quad\leq\frac{6}{(1-s)^{\frac{m+4}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(2\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|+\left\\|\Gamma_{k}^{-1}\right\\|_{1}+\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}\right)\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)$
		$\displaystyle\quad\text{for}\ \gamma_{k,pq}\in\mathbb{R}\ \text{satisfying}\ \\|\Gamma_{k}^{-1}\\|_{1}<\infty,$
	$\displaystyle\mathrm{(iii)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\pi_{k}}\left(H[q]-\widetilde{H}[q]\right)\right\|\leq\frac{8}{(1-s)^{\frac{m}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\frac{\pi_{k^{\prime}}}{\pi_{k}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right),$

Proof.

In this proof, we denote $\Gamma_{k^{\prime}}^{-1}\Gamma_{k}\,y-\Gamma_{k^{\prime}}^{-1}(\mu_{k^{\prime}}-\mu_{k})$ by $\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})$ . From the equality (LABEL:computation1), we have

	$\displaystyle H[q]-\widetilde{H}[q]$
	$\displaystyle=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\ \log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Gamma_{k}\|}{\pi_{k}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)\right)dy$
	$\displaystyle=\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\ \log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Gamma_{k}\|}{\pi_{k}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)\right)dy$
	$\displaystyle\hskip 20.0pt+\sum_{\ell\neq k}\pi_{\ell}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)$
	$\displaystyle\hskip 80.0pt\times\log\Biggl{(}1+\underbrace{\sum_{k^{\prime}\neq\ell}\frac{\pi_{k^{\prime}}\|\Gamma_{\ell}\|}{\pi_{\ell}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k^{\prime}},\Gamma_{\ell;k^{\prime}})\|^{2}}{2}\right)}_{\hskip 80.0pt\displaystyle=\frac{\pi_{k}\|\Gamma_{\ell}\|}{\pi_{\ell}\|\Gamma_{k}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{2}\right)+\text{terms independent of $k$}}\Biggr{)}dy.$		(A.10)

(i) The derivatives of (A.10) with respect to $\mu_{k,p}$ are calculated as

		$\displaystyle\frac{\partial}{\partial\mu_{k,p}}\left(H[q]-\widetilde{H}[q]\right)$
		$\displaystyle=\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)$		(A.11)
		$\displaystyle\hskip 40.0pt\times\frac{\displaystyle\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Gamma_{k}\|}{\pi_{k}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)\frac{\partial}{\partial\mu_{k,p}}\left(\frac{-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Gamma_{k}\|}{\pi_{k}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)}dy$
		$\displaystyle\hskip 20.0pt+\sum_{\ell\neq k}\pi_{\ell}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)$		(A.12)
		$\displaystyle\hskip 60.0pt\times\frac{\displaystyle\frac{\pi_{k}\|\Gamma_{\ell}\|}{\pi_{\ell}\|\Gamma_{k}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{2}\right)\frac{\partial}{\partial\mu_{k,p}}\left(\frac{-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq\ell}\frac{\pi_{k^{\prime}}\|\Gamma_{\ell}\|}{\pi_{\ell}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k^{\prime}},\Gamma_{\ell;k^{\prime}})\|^{2}}{2}\right)}dy.$		(A.13)

Since $\frac{x}{a+x}\leq x^{\frac{1}{2}}$ for $x>1$ when $a\geq 1$ , we estimate

\frac{\displaystyle\frac{\pi_{k^{\prime}}|\Gamma_{k}|}{\pi_{k}|\Gamma_{k^{\prime}}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}|\Gamma_{k}|}{\pi_{k}|\Gamma_{k^{\prime}}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{2}\right)}\leq\sqrt{\frac{\pi_{k^{\prime}}|\Gamma_{k}|}{\pi_{k}|\Gamma_{k^{\prime}}|}}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{4}\right),

(A.15)

and

\frac{\displaystyle\frac{\pi_{k}|\Gamma_{\ell}|}{\pi_{\ell}|\Gamma_{k}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq\ell}\frac{\pi_{k^{\prime}}|\Gamma_{\ell}|}{\pi_{\ell}|\Gamma_{k^{\prime}}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{\ell;k^{\prime}},\Gamma_{\ell;k^{\prime}})|^{2}}{2}\right)}\leq\sqrt{\frac{\pi_{k}|\Gamma_{\ell}|}{\pi_{\ell}|\Gamma_{k}|}}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|^{2}}{4}\right).

(A.16)

We also calculate

\displaystyle\frac{\partial}{\partial\mu_{k,p}}\left(\frac{-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{2}\right)

\displaystyle=\sum_{i=1}^{m}\left[\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\right]_{i}\gamma_{k^{\prime},ip}^{-1},

(A.17)

and

\displaystyle\frac{\partial}{\partial\mu_{k,p}}\left(\frac{-|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|^{2}}{2}\right)

\displaystyle=\sum_{i=1}^{m}\left[\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\right]_{i}\gamma_{k,ip}^{-1},

(A.18)

By using (LABEL:derivative-mu-1)–(A.18),

	$\displaystyle\left\|\frac{\partial}{\partial\mu_{k,p}}\left(H[q]-\widetilde{H}[q]\right)\right\|$
	$\displaystyle\leq\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\sum_{i=1}^{m}\left\|\gamma_{k^{\prime},ip}^{-1}\right\|\underbrace{\sqrt{\frac{\|\Gamma_{k}\|}{\|\Gamma_{k^{\prime}}\|}}\int_{\mathbb{R}^{m}}\frac{\left[\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\right]_{i}}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m+2}{4}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)}\hskip 30.0pt$
	$\displaystyle\hskip 20.0pt+\sum_{\ell\neq k}\sqrt{\pi_{k}\pi_{\ell}}\sum_{i=1}^{m}\left\|\gamma_{k,ip}^{-1}\right\|\underbrace{\sqrt{\frac{\|\Gamma_{\ell}\|}{\|\Gamma_{k}\|}}\int_{\mathbb{R}^{m}}\frac{\left[\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\right]_{i}}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m+2}{4}}}\exp\left(-\frac{s\alpha_{\ell,k}^{2}}{4}\right)}$
	$\displaystyle\leq\frac{2}{(1-s)^{\frac{m+2}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\Biggl{\{}\underbrace{\left(\sum_{i=1}^{m}\left\|\gamma_{k^{\prime},ip}^{-1}\right\|\right)}_{\displaystyle\leq\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)+\underbrace{\left(\sum_{i=1}^{m}\left\|\gamma_{k,ip}^{-1}\right\|\right)}_{\displaystyle\leq\left\\|\Gamma_{k}^{-1}\right\\|_{1}}\exp\left(-\frac{s\alpha_{k^{\prime},k}^{2}}{4}\right)\Biggr{\}},$

where the last inequality is given by the same arguments in the proof of Lemma A.2. Similar to Lemma A.3, this evaluation holds when $\alpha_{k,k^{\prime}}$ and $\alpha_{k^{\prime},k}$ are replaced with any $\alpha\leq\alpha_{\{k,k^{\prime}\}}$ , for example $\max(\alpha_{k,k^{\prime}},\alpha_{k^{\prime},k})$ .

(ii) The derivatives of (A.10) with respect to $\gamma_{k,pq}$ are calculated as follows:

	$\displaystyle\frac{\partial}{\partial\gamma_{k,pq}}\left(H[q]-\widetilde{H}[q]\right)$
	$\displaystyle=\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\frac{\displaystyle\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Gamma_{k}\|}{\pi_{k}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Gamma_{k}\|}{\pi_{k}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)}$
	$\displaystyle\hskip 60.0pt\times\Biggl{\{}\underbrace{\|\Gamma_{k}\|^{-1}\left(\frac{\partial}{\partial\gamma_{k,pq}}\|\Gamma_{k}\|\right)}_{\displaystyle=\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|}+\frac{\partial}{\partial\gamma_{k,pq}}\left(\frac{-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{2}\right)\Biggr{\}}dy$
	$\displaystyle\hskip 20.0pt+\sum_{\ell\neq k}\pi_{\ell}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\frac{\displaystyle\frac{\pi_{k}\|\Gamma_{\ell}\|}{\pi_{\ell}\|\Gamma_{k}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq\ell}\frac{\pi_{k^{\prime}}\|\Gamma_{\ell}\|}{\pi_{\ell}\|\Gamma_{k^{\prime}}\|}\exp\left(\frac{\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{2}\right)}$
	$\displaystyle\hskip 80.0pt\times\Biggl{\{}\underbrace{\|\Gamma_{k}\|\frac{\partial}{\partial\gamma_{k,pq}}\|\Gamma_{k}\|^{-1}}_{\displaystyle=\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|}+\frac{\partial}{\partial\gamma_{k,pq}}\left(\frac{-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{2}\right)\Biggr{\}}dy.$		(A.19)

We calculate

\frac{\partial}{\partial\gamma_{k,pq}}\left(\frac{-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{2}\right)=-\sum_{i=1}^{m}\left[\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\right]_{i}\gamma_{k^{\prime},ip}^{-1}y_{q},

(A.20)

and

\frac{\partial}{\partial\gamma_{k,pq}}\left(\frac{-|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|^{2}}{2}\right)=-\sum_{i=1}^{m}\left[\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\right]_{i}\left[\left(\frac{\partial}{\partial\gamma_{k,pq}}\Gamma_{k}^{-1}\right)\Gamma_{k}\left(\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\right)\right]_{i},

(A.21)

where we denote by $[v]_{i}$ the $i$ -th component of vector $v$ , and $\gamma_{k^{\prime},ip}^{-1}$ and $\gamma_{k,ip}^{-1}$ the $(i,p)$ -th component of matrix $\Gamma_{k^{\prime}}^{-1}$ and $\Gamma_{k}^{-1}$ , respectively, and $y_{q}$ is the $q$ -th component of vector $y$ , and $\frac{\partial}{\partial\gamma_{k,pq}}\Gamma_{k}^{-1}$ is component-wise derivative of matrix $\Gamma_{k}^{-1}$ with respect to $\gamma_{k,pq}$ . We also calculate $\left(\frac{\partial}{\partial\gamma_{k,pq}}\Gamma_{k}^{-1}\right)\Gamma_{k}=\delta_{k,pq}\Gamma_{k}^{-1}$ , where $\delta_{k,pq}$ is the matrix such that $(p,q)$ -th component is one, and other are zero. We further estimate (A.21) by

\begin{split}\left|\frac{\partial}{\partial\gamma_{k,pq}}\left(\frac{-|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|^{2}}{2}\right)\right|&=\left|\left<\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k}),\delta_{k,pq}\Gamma_{k}^{-1}\left(|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|\right)\right>\right|\\ &\leq\sum_{i=1}^{m}\sum_{j=1}^{m}\left|\left[\delta_{k,pq}\Gamma_{k}^{-1}\right]_{ij}\right|\left|\left[|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|\right]_{i}\left[|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|\right]_{j}\right|,\end{split}

(A.22)

where $\left[\delta_{k,pq}\Gamma_{k}^{-1}\right]_{ij}$ is $(i,j)$ -th component of matrix $\delta_{k,pq}\Gamma_{k}^{-1}$ . By using inequality of $\frac{x}{a+x}\leq x^{\frac{1}{2}}$ for $x>1$ ( $a\geq 1$ ), and the arguments (A.19)–(A.22), we have

	$\displaystyle\left\|\frac{\partial}{\partial\gamma_{k,pq}}\left(H[q]-\widetilde{H}[q]\right)\right\|$
	$\displaystyle\leq\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|\underbrace{\sqrt{\frac{\|\Gamma_{k}\|}{\|\Gamma_{k^{\prime}}\|}}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m}{4}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)}$
	$\displaystyle\hskip 20.0pt+\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\sum_{i=1}^{m}\left\|\gamma_{k^{\prime},ip}^{-1}\right\|\underbrace{\sqrt{\frac{\|\Gamma_{k}\|}{\|\Gamma_{k^{\prime}}\|}}\int_{\mathbb{R}^{m}}\frac{\left[\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\right]_{i}}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m+2}{4}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)}$
	$\displaystyle\hskip 20.0pt+\sum_{\ell\neq k}\sqrt{\pi_{k}\pi_{\ell}}\sum_{i=1}^{m}\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|\underbrace{\sqrt{\frac{\|\Gamma_{\ell}\|}{\|\Gamma_{k}\|}}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m}{4}}}\exp\left(-\frac{s\alpha_{\ell,k}^{2}}{4}\right)}$
	$\displaystyle\hskip 20.0pt+\sum_{\ell\neq k}\sqrt{\pi_{k}\pi_{\ell}}\sum_{i=1}^{m}\sum_{j=1}^{m}\left\|\left[\delta_{k,pq}\Gamma_{k}^{-1}\right]_{ij}\right\|$
	$\displaystyle\hskip 40.0pt\times\underbrace{\sqrt{\frac{\|\Gamma_{\ell}\|}{\|\Gamma_{k}\|}}\int_{\mathbb{R}^{m}}\frac{\left\|\left[\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\right]_{i}\left[\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|\right]_{j}\right\|}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{6}{(1-s)^{\frac{m+4}{4}}}\exp\left(-\frac{s\alpha_{\ell,k}^{2}}{4}\right)}$
	$\displaystyle\leq\frac{6}{(1-s)^{\frac{m+4}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\Biggl{[}\Biggl{\{}\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|+\underbrace{\left(\sum_{i=1}^{m}\left\|\gamma_{k^{\prime},ip}^{-1}\right\|\right)}_{\displaystyle\leq\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}}\Biggr{\}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)\hskip 110.0pt$
	$\displaystyle\hskip 80.0pt+\Biggl{\{}\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|+\underbrace{\left(\sum_{i=1}^{m}\sum_{j=1}^{m}\left\|\left[\delta_{k,pq}\Gamma_{k}^{-1}\right]_{ij}\right\|\right)}_{\displaystyle\leq\left\\|\Gamma_{k}^{-1}\right\\|_{1}}\Biggr{\}}\exp\left(-\frac{s\alpha_{k^{\prime},k}^{2}}{4}\right)\Biggr{]},$

where the last inequality is given by the same arguments in the proof of Theorem A.2.

(iii) The derivatives of (A.10) with respect to $\pi_{k}$ are calculated as follows:

\begin{split}&\frac{\partial}{\partial\pi_{k}}\left(H[q]-\widetilde{H}[q]\right)\\ &=\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{2}\right)\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}|\Gamma_{k}|}{\pi_{k}|\Gamma_{k^{\prime}}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{2}\right)\right)dy\\ &\hskip 20.0pt+\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{2}\right)\frac{\displaystyle\sum_{k^{\prime}\neq k}\frac{-\pi_{k^{\prime}}|\Gamma_{k}|}{\pi_{k}^{2}|\Gamma_{k^{\prime}}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}|\Gamma_{k}|}{\pi_{k}|\Gamma_{k^{\prime}}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})|^{2}}{2}\right)}dy\\ &\hskip 20.0pt+\sum_{\ell\neq k}\pi_{\ell}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{2}\right)\frac{\displaystyle\frac{|\Gamma_{\ell}|}{\pi_{\ell}|\Gamma_{k}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|^{2}}{2}\right)}{\displaystyle 1+\sum_{k^{\prime}\neq\ell}\frac{\pi_{k^{\prime}}|\Gamma_{\ell}|}{\pi_{\ell}|\Gamma_{k^{\prime}}|}\exp\left(\frac{|y|^{2}-|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})|^{2}}{2}\right)}dy.\\ \end{split}

Using inequalities of $\log(1+x)\leq\sqrt{x}$ for $x\geq 0$ and $\frac{x}{a+x}\leq x^{\frac{1}{2}}$ for $x>1$ when $a\geq 1$ , we estimate

	$\displaystyle\left\|\frac{\partial}{\partial\pi_{k}}\left(H[q]-\widetilde{H}[q]\right)\right\|$	$\displaystyle\leq\sum_{k^{\prime}\neq k}\sqrt{\frac{\pi_{k^{\prime}}}{\pi_{k}}}\underbrace{\sqrt{\frac{\|\Gamma_{k}\|}{\|\Gamma_{k^{\prime}}\|}}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m}{4}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)}$
		$\displaystyle\hskip 20.0pt+\sum_{k^{\prime}\neq k}\sqrt{\frac{\pi_{k^{\prime}}}{\pi_{k}}}\underbrace{\sqrt{\frac{\|\Gamma_{k}\|}{\|\Gamma_{k^{\prime}}\|}}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{k;k^{\prime}},\Gamma_{k;k^{\prime}})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m}{4}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)}$
		$\displaystyle\hskip 20.0pt+\sum_{\ell\neq k}\sqrt{\frac{\pi_{\ell}}{\pi_{k}}}\underbrace{\sqrt{\frac{\|\Gamma_{\ell}\|}{\|\Gamma_{k}\|}}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(\frac{-\|y\|^{2}-\|\Theta(y\mid\mu_{\ell;k},\Gamma_{\ell;k})\|^{2}}{4}\right)dy}_{\displaystyle\leq\frac{2}{(1-s)^{\frac{m}{4}}}\exp\left(-\frac{s\alpha_{\ell,k}^{2}}{4}\right)}$
		$\displaystyle\leq\frac{4}{(1-s)^{\frac{m}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\frac{\pi_{k^{\prime}}}{\pi_{k}}}\Biggl{\{}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)+\exp\left(-\frac{s\alpha_{k^{\prime},k}^{2}}{4}\right)\Biggr{\}},$

where the last inequality is given by the same arguments in the proof of Theorem A.2. The proof of Theorem 4.8 is finished. ∎

A.4 Proof of Proposition 4.9

Lemma A.9.

Let $m\geq K\geq 2$ . Then

H[q]=\widetilde{H}[q]-\sum_{k=1}^{K}\frac{\pi_{k}}{(2\pi)^{\frac{K-1}{2}}}\int_{\mathbb{R}^{K-1}}\exp\left(-\frac{|v|^{2}}{2}\right)\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{\pi_{k}}\exp\left(\frac{|v|^{2}-\left|v-u_{k^{\prime},k}\right|^{2}}{2}\right)\right)dv,

(A.23)

where $u_{k^{\prime},k}\coloneqq[R_{k}\Sigma^{-1/2}(\mu_{k^{\prime}}-\mu_{k})]_{1:K-1}\in\mathbb{R}^{K-1}$ and $R_{k}\in\mathbb{R}^{m\times m}$ is some rotation matrix such that

R_{k}\Sigma^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\in\mathrm{span}\{e_{1},\cdots,e_{K-1}\},\,\,\,k^{\prime}\in[K].

(A.24)

Here, $\{e_{i}\}_{i=1}^{K-1}$ is the standard basis in $\mathbb{R}^{K-1}$ , and $u_{1:K-1}\coloneqq(u_{1},\ldots,u_{K-1})^{T}\in\mathbb{R}^{K-1}$ for $u=(u_{1},\ldots,u_{m})^{T}\in\mathbb{R}^{m}$ .

Proof.

First, we observe that

H[q]=\widetilde{H}[q]-\underbrace{\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\mathcal{N}(x|\mu_{k},\Sigma_{k})\left\{\log\left(\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\mathcal{N}(x|\mu_{k^{\prime}},\Sigma_{k})\right)-\log\left(\pi_{k}\mathcal{N}(x|\mu_{k},\Sigma_{k})\right)\right\}dx}_{\displaystyle\eqqcolon(\clubsuit)}.

Using (LABEL:computation1) with the assumption (4.9), we write

\begin{split}(\clubsuit)&=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|y|^{2}}{2}\right)\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{\pi_{k}}\exp\left(\frac{|y|^{2}-\left|\left(y-\Sigma^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right|^{2}}{2}\right)\right)dy.\end{split}

We choose the rotation matrix $R_{k}\in\mathbb{R}^{m\times m}$ satisfying (A.24) for each $k\in[K]$ . Making the change of variables as $z=R_{k}y$ , we write

\begin{split}(\clubsuit)&=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{|z|^{2}}{2}\right)\log\bBigg@{3.5}(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{\pi_{k}}\underbrace{\exp\left(\frac{|z|^{2}-\left|\left(R^{T}z-\Sigma^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right|^{2}}{2}\right)}_{\displaystyle=\exp\left(\frac{|z|^{2}-\left|z-u_{k^{\prime},k}\right|^{2}}{2}\right)}\bBigg@{3.5})dy,\end{split}

where $u_{k^{\prime},k}=[R_{k}\Sigma^{-1/2}(\mu_{k^{\prime}}-\mu_{k})]_{1:K-1}\in\mathbb{R}^{K-1}$ , that is,

R_{k}\Sigma^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})=(u_{k^{\prime},k},0,\cdots,0)^{T}.

We change the variables as

z_{1}=v_{1},\,\,\ldots,\,\,z_{K-1}=v_{K-1},\,\,z_{K}=r\cos\theta_{K},\,\,z_{K+1}=r\sin\theta_{K}\cos\theta_{K+1},\,\,\ldots

z_{m-1}=r\sin\theta_{K}\cdots\sin\theta_{m-2}\cos\theta_{m-1},\,\,z_{m}=r\sin\theta_{K}\cdots\sin\theta_{m-2}\sin\theta_{m-1},

where $-\infty<v_{i}<\infty$ ( $i=1,\ldots,K-1$ ), $r>0$ , $0<\theta_{j}<\pi$ ( $j=K,\ldots,m-2$ ), and $0<\theta_{m-1}<2\pi$ . Because

|z|^{2}=|v|^{2}+r^{2},\,\,\,|z-u_{k^{\prime},k}|^{2}=|v-u_{k^{\prime},k}|^{2}+r^{2},

dz_{1}\cdots dz_{m}=r^{m-K}\prod_{i=K}^{m-1}(\sin\theta_{i})^{m-i-1}dv_{1}\cdots dv_{K-1}\,dr\,d\theta_{K}\cdots d\theta_{m-1},

we obtain

\begin{split}(\clubsuit)&=\sum_{k=1}^{K}\frac{\pi_{k}}{\sqrt{(2\pi)^{m}}}\int_{0}^{2\pi}d\theta_{m-1}\int_{0}^{\pi}d\theta_{m-2}\,\cdots\int_{0}^{\pi}d\theta_{K}\int_{0}^{\infty}dr\int_{-\infty}^{\infty}dv_{K-1}\,\cdots\int_{-\infty}^{\infty}dv_{1}\\ &\qquad\times\exp\left(-\frac{|v|^{2}+r^{2}}{2}\right)\log\Biggl{(}1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}}{\pi_{k}}\exp\left(\frac{|v|^{2}-\left|v-u_{k^{\prime},k}\right|^{2}}{2}\right)\Biggr{)}r^{m-K}\prod_{i=K}^{m-1}(\sin\theta_{i})^{m-i-1}.\end{split}

(A.25)

By the equality

\begin{split}&\frac{1}{\sqrt{(2\pi)^{m}}}\int_{0}^{2\pi}d\theta_{m-1}\int_{0}^{\pi}d\theta_{m-2}\,\cdots\int_{0}^{\pi}d\theta_{K}\int_{0}^{\infty}dr\ \exp\left(-\frac{r^{2}}{2}\right)r^{m-K}\prod_{i=K}^{m-1}(\sin\theta_{i})^{m-i-1}=\frac{1}{(2\pi)^{\frac{K-1}{2}}},\end{split}

we conclude (A.23) from (A.25). The proof of Proposition 4.9 is finished. ∎

Appendix B Details of Section 5

We give a detailed explanation for the relative error experiment in Section 5. We restricted the setting of the experiment to the case for the coincident covariance matrices (4.9) and the number of mixture components $K=2$ . Furthermore, we assumed that $\Sigma=I$ , $\mu_{1}=0$ , and $\mu_{2}\sim\mathcal{N}(0,(2c)^{2}I)$ . In this setting, we varied the dimension $m$ of Gaussian distributions from 1 to 500 for certain parameters $(c,\pi_{k})$ , where we sampled $\mu_{2}$ 10 times for each dimension $m$ . As formulas for the entropy approximation, we employed $\widetilde{H}_{\rm ours}[q]$ , $\widetilde{H}_{{\rm Huber}(R)}[q]$ , and $\widetilde{H}_{\rm Bonilla}[q]$ for the method of ours, Huber et al. [2008], and Bonilla et al. [2019], respectively, as follows:

$\displaystyle\widetilde{H}_{\rm ours}[q]$	$\displaystyle=\frac{m}{2}+\frac{m}{2}\log 2\pi+\frac{1}{2}\sum_{k=1}^{K}\pi_{k}\log\|\Sigma_{k}\|-\sum_{k=1}^{K}\pi_{k}\log\pi_{k},$	(B.1)
$\displaystyle\begin{split}\widetilde{H}_{{\rm Huber}(R)}[q]&=-\sum_{k=1}^{K}\pi_{k}\int\mathcal{N}(w\|\mu_{k},\Sigma_{k})\sum_{i=0}^{R}\frac{1}{i!}\left((w-\mu_{k})\odot\nabla_{\widetilde{w}}\right)^{i}\log\left(\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\mathcal{N}(\widetilde{w}\|\mu_{k^{\prime}},\Sigma_{k^{\prime}})\right)\Biggl{\|}_{\widetilde{w}=\mu_{k}}dw,\end{split}$		(B.2)
$\displaystyle\widetilde{H}_{\rm Bonilla}[q]$	$\displaystyle=-\sum_{k=1}^{K}\pi_{k}\log\left(\sum_{k^{\prime}=1}^{K}\mathcal{N}(\mu_{k}\|\mu_{k^{\prime}},\Sigma_{k}+\Sigma_{k^{\prime}})\right),$	(B.3)

where (B.1) is the same as described in Section 3, (B.2) is based on the Taylor expansion [Huber et al., 2008, (4)], and (B.3) is based on the lower bound analysis [Bonilla et al., 2019, (14)]. In the case for the coincident and diagonal covariance matrices $\Sigma_{k}=\mathrm{diag}(\sigma_{1}^{2},\ldots,\sigma_{m}^{2})$ , (B.2) for $R=0$ or $2$ can be analytically computed as

	$\displaystyle\widetilde{H}_{\rm Huber(0)}[q]$	$\displaystyle=-\sum_{k=1}^{K}\pi_{k}\log\left(\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\mathcal{N}(\mu_{k}\|\mu_{k^{\prime}},\Sigma_{k^{\prime}})\right),$
	$\displaystyle\widetilde{H}_{\rm Huber(2)}[q]$	$\displaystyle=-\sum_{k=1}^{K}\pi_{k}\left[\log\left(\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\mathcal{N}(\mu_{k}\|\mu_{k^{\prime}},\Sigma_{k^{\prime}})\right)+\frac{1}{2}\sum_{i=1}^{m}\sigma^{2}_{i}C_{k,i}\right],$

where

	$\displaystyle C_{k,i}$	$\displaystyle\coloneqq\frac{g_{0,k}g_{2,k,i}-g_{1,k,i}^{2}}{g_{0,k}^{2}},\quad g_{0,k}\coloneqq\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\mathcal{N}(\mu_{k}\|\mu_{k^{\prime}},\Sigma_{k^{\prime}}),$
	$\displaystyle g_{1,k,i}$	$\displaystyle\coloneqq\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\frac{\mu_{k,i}-\mu_{k^{\prime},i}}{\sigma_{i}^{2}}\mathcal{N}(\mu_{k}\|\mu_{k^{\prime}},\Sigma_{k^{\prime}}),$
	$\displaystyle g_{2,k,i}$	$\displaystyle\coloneqq\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\left[\left(\frac{\mu_{k,i}-\mu_{k^{\prime},i}}{\sigma_{i}^{2}}\right)^{2}-\frac{1}{\sigma_{i}^{2}}\right]\mathcal{N}(\mu_{k}\|\mu_{k^{\prime}},\Sigma_{k^{\prime}}).$

In the following, we show the tractable formula of the true entropy, which is used in the experiment in Section 5. In the case for the coincident covariance matrices and $K=2$ , we can reduce the integral in (4.10) to a one-dimensional Gaussian integral as follows:

\displaystyle H[q]=\widetilde{H}[q]-\sum_{k\neq k\prime}\frac{\pi_{k}}{\sqrt{2\pi}}\int_{\mathbb{R}}\exp\left(-\frac{|v|^{2}}{2}\right)\log\left(1+\frac{\pi_{k^{\prime}}}{\pi_{k}}\exp\left(\frac{|v|^{2}-\left|v-u_{k^{\prime},k}\right|^{2}}{2}\right)\right)dv.

Furthermore, we can choose the rotation matrix $R_{k}$ in Proposition 4.9 such that

\displaystyle u_{k^{\prime},k}=\left[R_{k}\Sigma^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right]_{1:1}\geq 0,

that is,

\displaystyle u_{k^{\prime},k}=\left|u_{k^{\prime},k}\right|=\left|\Sigma^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right|=2|a|,\quad a\coloneqq\frac{\Sigma^{-\frac{1}{2}}(\mu_{1}-\mu_{2})}{2}.

Hence, we have

\displaystyle|v|^{2}-|v-u_{k^{\prime},k}|^{2}=-u_{k^{\prime},k}^{2}+2vu_{k^{\prime},k}=-4|a|^{2}+4v|a|.

Therefore, by making the change of variables as $v=\sqrt{2}\,t$ , we conclude that

	$\displaystyle H[q]$	$\displaystyle=\widetilde{H}[q]-\sum_{k=1}^{2}\frac{\pi_{k}}{\sqrt{2\pi}}\int_{\mathbb{R}}\exp\left(-\frac{\|v\|^{2}}{2}\right)\log\left(1+\frac{\pi_{k^{\prime}}}{\pi_{k}}\exp\left(-2\|a\|^{2}+2v\|a\|\right)\right)dv$
		$\displaystyle=\widetilde{H}[q]-\sum_{k=1}^{2}\frac{\pi_{k}}{\sqrt{\pi}}\int_{\mathbb{R}}\exp(-t^{2})\log\left(1+\frac{\pi_{k^{\prime}}}{\pi_{k}}\exp\left(-2\|a\|^{2}+2\sqrt{2}\|a\|t\right)\right)dt.$		(B.4)

Note that the integration of (B.4) can be efficiently executed using the Gauss-Hermite quadrature.

Appendix C Application example: Variational inference to BNNs with Gaussian mixtures

C.1 Overview: Variational inference with Gaussian mixtures

Let $f(\,\cdot\ ;w)$ be the base model that is a function parameterized by weights $w\in\mathbb{R}^{m}$ , e.g., the neural network. Let $p(w)$ and $p(y|f(x;w))$ be the prior distribution of the weights and the likelihood of the model, respectively. For supervised learning, let $D=\{(x_{n},y_{n})\}_{n=1}^{N}$ be a dataset where $x_{n}\in\mathbb{R}^{d_{x}}$ and $y_{n}\in\mathbb{R}^{d_{y}}$ are the input and output, respectively, and the input-output pair $(x_{n},y_{n})$ is independently identically distributed. The Bayesian posterior distribution $p(w|D)$ is formulated as

p(w|D)\propto p(w)\prod_{n=1}^{N}p(y_{n}|f(x_{n};w)).

The goal of variational inference is to minimize the Kullback-Leibler (KL) divergence between a variational family $q_{\theta}(w)$ and posterior distribution $p(w|D)$ given by

D_{\text{KL}}(q_{\theta}(w)\,||\,p(w|D))\coloneqq-\int q_{\theta}(w)\log\left(\frac{p(w|D)}{q_{\theta}(w)}\right)dw,

which is equivalent to maximizing the evidence lower bound (ELBO) given by

\displaystyle\mathcal{L}(\theta)\coloneqq L(\theta)+\int q_{\theta}(w)\log(p(w))\,dw+H[q_{\theta}],

(C.1)

(see, e.g., Barber and Bishop [1998], Bishop [2006], Hinton and Van Camp [1993]). The first term of (C.1) is the expected log-likelihood given by

L(\theta)\coloneqq\sum_{n=1}^{N}E_{q_{\theta}(w)}[\log p(y_{n}|f(x_{n};w))],

the second term is the cross-entropy between a variational family $q(w)$ and prior distribution $p(w)$ , and the third term is the entropy of $q(w)$ given by

\displaystyle H[q_{\theta}]\coloneqq-\int q_{\theta}(w)\log(q_{\theta}(w))\,dw.

Here, we choose a unimodal Gaussian distribution as a prior, that is, $p(w)=\mathcal{N}(w|\mu_{0},\Sigma_{0})$ , and we choose a Gaussian mixture distribution as a variational family, that is,

q_{\theta}(w)=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(w|\mu_{k},\Sigma_{k}),\ \ \theta=(\pi_{k},\mu_{k},\Sigma_{k})_{k=1}^{K},

where $K\in\mathbb{N}$ is the number of mixture components, and $\pi_{k}\in(0,1]$ are mixing coefficients constrained by $\sum_{k=1}^{K}\pi_{k}=1$ . Here, $\mathcal{N}(w|\mu_{k},\Sigma_{k})$ is the Gaussian distribution with a mean $\mu_{k}\in\mathbb{R}^{m}$ and covariance matrix $\Sigma_{k}\in\mathbb{R}^{m\times m}$ , that is,

\mathcal{N}(w|\mu_{k},\Sigma_{k})=\frac{1}{\sqrt{(2\pi)^{m}|\Sigma_{k}|}}\exp\left(-\frac{1}{2}\left\|w-\mu_{k}\right\|_{\Sigma_{k}}^{2}\right),

In the following, we investigate the ingredients in (C.1). The expected log-likelihood $L(\theta)$ is analytically intractable due to the nonlinearity of the function $f(x;w)$ . To overcome this difficulty, we follow the stochastic gradient variational Bayes (SGVB) method [Kingma and Welling, 2013, Kingma et al., 2015, Rezende et al., 2014], which employs the reparametric trick and minibatch-based Monte Carlo sampling. Let $S\subset D$ be a minibatch set with minibatch size $M$ . By reparameterizing weights as $w=\Sigma_{k}^{1/2}\varepsilon+\mu_{k}$ , we can rewrite the expected log-likelihood $L(\theta)$ as

\begin{split}L(\theta)&=\sum_{n=1}^{N}\sum_{k=1}^{K}\pi_{k}\int\mathcal{N}(w|\mu_{k},\Sigma_{k})\log p(y_{n}|f(x_{n};w))\,dw\\ &=\sum_{n=1}^{N}\sum_{k=1}^{K}\pi_{k}\int\mathcal{N}(\varepsilon|0,I)\log p(y_{n}|f(x_{n};\Sigma_{k}^{\frac{1}{2}}\varepsilon+\mu_{k}))\,d\varepsilon.\end{split}

By minibatch-based Monte Carlo sampling, we obtain the following unbiased estimator $\widehat{L}^{\rm{SGVB}}(\theta)$ of the expected log-likelihood $L(\theta)$ as

\begin{split}L(\theta)&\approx\widehat{L}^{\rm{SGVB}}(\theta)\\ &\coloneqq\sum_{k=1}^{K}\pi_{k}\frac{N}{M}\sum_{i\in S}\log p(y_{i}|f(x_{i};\Sigma_{k}^{\frac{1}{2}}\varepsilon+\mu_{k})),\end{split}

(C.2)

where we employ noise sampling $\varepsilon\sim\mathcal{N}(0,I)$ once per mini-batch sampling (Kingma et al. [2015], Titsias and Lázaro-Gredilla [2014]). On the other hand, the cross-entropy between a Gaussian mixture and unimodal Gaussian distribution can be analytically computed as

\begin{split}\int q_{\theta}(w)\log(p(w))\,dw=-\sum_{k=1}^{K}\frac{\pi_{k}}{2}\Biggl{\{}m\log 2\pi+\log|\Sigma_{0}|+\mathrm{Tr}(\Sigma_{0}^{-1}\Sigma_{k})+\left\|\mu_{k}-\mu_{0}\right\|_{\Sigma_{0}}^{2}\Biggr{\}}.\end{split}

(C.3)

For the entropy term $H[q_{\theta}]$ we employ the approximate entropy (3.2), that is,

\begin{split}H[q]&\approx\widetilde{H}[q_{\theta}]\\ &=\,-\sum_{k=1}^{K}\pi_{k}\int\mathcal{N}(w|\mu_{k},\Sigma_{k})\log\left(\pi_{k}\mathcal{N}(w|\mu_{k},\Sigma_{k})\right)dw\\ &=\,\frac{m}{2}+\frac{m}{2}\log 2\pi+\frac{1}{2}\sum_{k=1}^{K}\pi_{k}\log|\Sigma_{k}|-\sum_{k=1}^{K}\pi_{k}\log\pi_{k}.\end{split}

(C.4)

In summary, we approximate the ELBO $\mathcal{L}(\theta)$ using (C.2), (C.3), and (C.4) as

\begin{split}\mathcal{L}(\theta)&\approx\widehat{\mathcal{L}}(\theta)\\ &\coloneqq\,\widehat{L}^{\rm{SGVB}}(\theta)+\int q_{\theta}(w)\log(p(w))\,dw+\widetilde{H}[q_{\theta}]\\ &=\,\sum_{k=1}^{K}\pi_{k}\left(\widehat{\mathcal{L}}(\mu_{k},\Sigma_{k})-\log\pi_{k}\right),\end{split}

(C.5)

where $\widehat{\mathcal{L}}(\mu_{k},\Sigma_{k})$ are ELBOs of the unimodal Gaussian distributions $\mathcal{N}(w|\mu_{k},\Sigma_{k})$ given by

\begin{split}\widehat{\mathcal{L}}(\mu_{k},\Sigma_{k})\coloneqq&\frac{N}{M}\sum_{i\in S}\log p(y_{i}|f(x_{i};\Sigma_{k}^{\frac{1}{2}}\varepsilon_{S}+\mu_{k}))-\frac{1}{2}\Biggl{\{}m\log 2\pi+\log|\Sigma_{0}|+\mathrm{Tr}(\Sigma_{0}^{-1}\Sigma_{k})+\left\|\mu_{k}-\mu_{0}\right\|_{\Sigma_{0}}^{2}\Biggr{\}}\\ &\quad+\frac{m}{2}(1+\log 2\pi)+\frac{1}{2}\log|\Sigma_{k}|.\end{split}

C.2 Experiment of BNN with Gaussian mixtures on toy task

We employed the approximate entropy (3.2) for variational inference to a BNN whose posterior was modeled by the Gaussian mixture, which we call BNN-GM in the following. We conducted the toy $1$ D regression task [He et al., 2020] to observe the uncertainty estimation capability of the BNN-GM. In particular, we observed that the BNN-GM could capture larger uncertainty than the deep ensemble [Lakshminarayanan et al., 2017]. The task was to learn a curve $y=x\sin(x)$ from a training dataset that consisted of $20$ points sampled from a noised curve $y=x\sin(x)+\varepsilon$ , $\varepsilon\sim\mathcal{N}(0,0.1^{2})$ . Refer to Appendix C.3 for the detail of implementations. We compared the BNN-GM with the deep ensemble of DNNs, the BNN with the single unimodal Gaussian, and the deep ensemble of BNNs with the single unimodal Gaussian, see Figure 4.

From the result in Figure 4, we can observe the following. First, every method can represent uncertainty on the area where train data do not exist. However, the BNN with a single unimodal Gaussian can represent smaller uncertainty than other methods (see around $x=0$ ). Second, as increasing the number of components, the BNN-GM can represent larger uncertainty than the deep ensemble of DNNs or BNNs. Therefore, there is a qualitative difference in uncertainty estimation between BNN-GMs and deep ensembles. Finally, the BNN-GM of 10 components has weak learners with small mixture coefficients. We suppose that this phenomenon is caused by the entropy regularization for the Gaussian mixture. Note that we do not claim the superiority of BNN-GMs to deep ensembles.

C.3 Details of experiment

We give a detailed explanation for the toy $1$ D regression experiment in Appendix C.2. The task is to learn a curve $y=x\sin(x)$ from a training dataset that consists of $20$ points sampled from the noised curve $y=x\sin(x)+\varepsilon$ , $\varepsilon\sim\mathcal{N}(0,0.1^{2})$ , see Figure 4. To obtain the regression model of the curve, we used the neural network model as the base model that had two hidden layers and $8$ hidden units in each layer with erf activation. Regarding the Bayes inference for the BNN-GM, we modeled the prior as $\mathcal{N}(0,\sigma_{w})$ and the variational family as the Gaussian mixture. Furthermore, we chose the likelihood function as the Gaussian distribution:

\displaystyle p(y|f(x;w))=\mathcal{N}(f(x;w),\sigma_{y}).

Then, we performed the SGVB method based on the proposed ELBO (C.5), where the batch size was equal to the dataset size. Hyperparameters were as follows: epochs $=100$ , learning rate $=0.05$ , $\sigma_{w}=10^{6}$ , $\sigma_{y}=10^{-2}$ .

	$\displaystyle\mathrm{(i)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\mu_{k,p}}\left(H[q]-\widetilde{H}[q]\right)\right\|\leq\frac{2}{(1-s)^{\frac{m+2}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}+\left\\|\Gamma_{k}^{-1}\right\\|_{1}\right)\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right),$
	$\displaystyle\mathrm{(ii)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\gamma_{k,pq}}\left(H[q]-\widetilde{H}[q]\right)\right\|$
		$\displaystyle\quad\leq\frac{6}{(1-s)^{\frac{m+4}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\left(2\|\Gamma_{k}\|^{-1}\|\Gamma_{k,pq}\|+\left\\|\Gamma_{k}^{-1}\right\\|_{1}+\left\\|\Gamma_{k^{\prime}}^{-1}\right\\|_{1}\right)\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right)$
		$\displaystyle\quad\text{for}\ \gamma_{k,pq}\in\mathbb{R}\ \text{satisfying}\ \\|\Gamma_{k}^{-1}\\|_{1}<\infty,$
	$\displaystyle\mathrm{(iii)}$	$\displaystyle\quad\left\|\frac{\partial}{\partial\pi_{k}}\left(H[q]-\widetilde{H}[q]\right)\right\|\leq\frac{8}{(1-s)^{\frac{m}{4}}}\sum_{k^{\prime}\neq k}\sqrt{\frac{\pi_{k^{\prime}}}{\pi_{k}}}\exp\left(-\frac{s\alpha_{k,k^{\prime}}^{2}}{4}\right),$

	$\displaystyle\widetilde{H}[q]-H[q]$
	$\displaystyle=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\mathcal{N}(x\|\mu_{k},\Sigma_{k})\left\{\log\left(\sum_{k^{\prime}=1}^{K}\pi_{k^{\prime}}\mathcal{N}(x\|\mu_{k^{\prime}},\Sigma_{k})\right)-\log\left(\pi_{k}\mathcal{N}(x\|\mu_{k},\Sigma_{k})\right)\right\}dx\hskip 30.0pt$
	$\displaystyle=\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}\|\Sigma_{k}\|}}\exp\left(-\frac{\left\\|x-\mu_{k}\right\\|^{2}_{\Sigma_{k}}}{2}\right)$
	$\displaystyle\hskip 60.0pt\times\log\left(1+\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Sigma_{k}\|^{\frac{1}{2}}}{\pi_{k}\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{\left\\|x-\mu_{k}\right\\|^{2}_{\Sigma_{k}}-\left\\|x-\mu_{k^{\prime}}\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)\right)dx$

$\displaystyle\left\|H[q]-\widetilde{H}[q]\right\|$
$\displaystyle\leq\sum_{k=1}^{K}\pi_{k}\int_{\mathbb{R}^{m}}\frac{1}{\sqrt{(2\pi)^{m}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)\sqrt{\sum_{k^{\prime}\neq k}\frac{\pi_{k^{\prime}}\|\Sigma_{k}\|^{\frac{1}{2}}}{\pi_{k}\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{\|y\|^{2}-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)}\ dy\hskip 65.0pt$		(A.3)
$\displaystyle=\sum_{k=1}^{K}\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{4}}}\exp\left(-\frac{\|y\|^{2}}{4}\right)\sqrt{\frac{1}{(2\pi)^{\frac{m}{2}}}\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)}\ dy$	$\displaystyle\leq\sum_{k=1}^{K}\Biggl{(}\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)dy\Biggr{)}^{\frac{1}{2}}$
$\displaystyle\hskip 60.0pt\times\underbrace{\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\exp\left(\frac{-\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)dy\right)^{\frac{1}{2}}}_{\displaystyle=\left(\left(\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\right)\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{\|z\|^{2}}{2}\right)dz\right)^{\frac{1}{2}}}$
$\displaystyle=\sum_{k=1}^{K}\left(\sum_{k^{\prime}\neq k}\pi_{k}\pi_{k^{\prime}}\right)^{\frac{1}{2}}=\sum_{k=1}^{K}\sqrt{\pi_{k}(1-\pi_{k})}\leq\sum_{k=1}^{K}\frac{\pi_{k}+(1-\pi_{k})}{2}\leq\frac{K}{2}.$		(A.4)

	$\displaystyle\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|_{\Sigma_{k^{\prime}}}$	$\displaystyle=\left\|\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\|$
		$\displaystyle\geq\frac{\left\|y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right\|}{\left\\|\left(\Sigma_{k^{\prime}}^{-\frac{1}{2}}\Sigma_{k}^{\frac{1}{2}}\right)^{-1}\right\\|_{\rm op}}\geq\frac{\left\|\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right\|-\|y\|}{\left\\|\Sigma_{k}^{-\frac{1}{2}}\Sigma_{k^{\prime}}^{\frac{1}{2}}\right\\|_{\rm op}}>\alpha_{k,k^{\prime}}.$		(A.6)

	$\displaystyle D^{i}$	$\displaystyle=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}}\int_{\|y\|<\alpha_{k,k^{\prime}}}\frac{1}{\sqrt{(2\pi)^{m}}}$
		$\displaystyle\qquad\times\exp\left(-\frac{\|y\|^{2}}{4}-\frac{s\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{4}-\frac{(1-s)\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\int_{\|y\|<\alpha_{k,k^{\prime}}}\frac{1}{(2\pi)^{\frac{m}{4}}}$
		$\displaystyle\hskip 60.0pt\times\exp\left(-\frac{\|y\|^{2}}{4}\right)\sqrt{\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}}\frac{1}{(2\pi)^{\frac{m}{4}}}\exp\left(-\frac{(1-s)\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{4}\right)dy\hskip 25.0pt$
		$\displaystyle\leq\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{\|y\|^{2}}{2}\right)dy\right)^{\frac{1}{2}}$
		$\displaystyle\hskip 60.0pt\times\left(\int_{\mathbb{R}^{m}}\frac{\|\Sigma_{k}\|^{\frac{1}{2}}}{\|\Sigma_{k^{\prime}}\|^{\frac{1}{2}}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{(1-s)\left\\|\Sigma_{k}^{\frac{1}{2}}\left(y-\Sigma_{k}^{-\frac{1}{2}}(\mu_{k^{\prime}}-\mu_{k})\right)\right\\|^{2}_{\Sigma_{k^{\prime}}}}{2}\right)dy\right)^{\frac{1}{2}}$
		$\displaystyle=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}}\exp\left(-\frac{(1-s)\|z\|^{2}}{2}\right)dz\right)^{\frac{1}{2}}$
		$\displaystyle=\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right)(1-s)^{-\frac{m}{4}}\underbrace{\left(\int_{\mathbb{R}^{m}}\frac{1}{(2\pi)^{\frac{m}{2}}\|(1-s)^{-1}I\|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}\\|z\\|_{(1-s)^{-1}I}^{2}\right)dz\right)^{\frac{1}{2}}}_{\displaystyle=1}$
		$\displaystyle=\frac{1}{(1-s)^{\frac{m}{4}}}\sum_{k=1}^{K}\sum_{k^{\prime}\neq k}\sqrt{\pi_{k}\pi_{k^{\prime}}}\exp\left(-\frac{s\alpha^{2}_{k,k^{\prime}}}{4}\right),$

Theoretical Error Analysis of Entropy Approximation for Gaussian Mixtures

Abstract

1 Introduction

2 Related work

3 Entropy approximation for Gaussian mixtures

4 Theoretical error analysis of the entropy approximation

Definition 4.1.

4.1 General covariance case

Theorem 4.2.

Remark 4.3.

Remark 4.4.

Remark 4.5.

Remark 4.6.

Corollary 4.7.

Theorem 4.8.

4.2 Coincident covariance case

Proposition 4.9.

Theorem 4.10.

Remark 4.11.

Corollary 4.12.

5 Experiment

6 Limitations and future work

Acknowledgments

References

Appendix

Appendix A Proofs in Section 4

A.1 Proof of Theorem 4.2

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Lemma A.3.

Proof.

Lemma A.4.

Proof.

Lemma A.5.

Proof.

Remark A.6.

A.2 Proof of Corollary 4.7

Lemma A.7.

Proof.

A.3 Proof of Theorem 4.8

Lemma A.8.

Proof.

A.4 Proof of Proposition 4.9

Lemma A.9.

Proof.

Appendix B Details of Section 5

Appendix C Application example: Variational inference to BNNs with Gaussian mixtures

C.1 Overview: Variational inference with Gaussian mixtures

C.2 Experiment of BNN with Gaussian mixtures on toy task

C.3 Details of experiment

Theoretical Error Analysis of Entropy Approximation
for Gaussian Mixtures