Cumulant-free closed-form formulas for some common (dis)similarities between densities of an exponential family
Abstract
It is well-known that the Bhattacharyya, Hellinger, Kullback-Leibler, -divergences, and Jeffreys’ divergences between densities belonging to a same exponential family have generic closed-form formulas relying on the strictly convex and real-analytic cumulant function characterizing the exponential family. In this work, we report (dis)similarity formulas which bypass the explicit use of the cumulant function and highlight the role of quasi-arithmetic means and their multivariate mean operator extensions. In practice, these cumulant-free formulas are handy when implementing these (dis)similarities using legacy Application Programming Interfaces (APIs) since our method requires only to partially factorize the densities canonically of the considered exponential family.
Keywords: Bhattacharyya coefficient; Bhattacharyya distance; Hellinger distance; Jensen-Shannon divergence; Kullback-Leibler divergence; -divergences; Jeffreys divergence; Cauchy-Schwarz divergence; quasi-arithmetic means; inverse function theorem; strictly monotone operator; mean operator; information geometry; exponential family; mixture family.
1 Introduction
Let be a measure space [6] with sample space , -algebra of events , and positive measure (i.e., Lebesgue or counting measures). The Kullback-Leibler divergence [25] (KLD), Jeffreys’ divergence [20] (JD), Bhattacharyya coefficient (BC), Bhattacharyya distance [5, 24] (BD) and Hellinger distance [19] (HD) between two probability measures and dominated by with respective Radon-Nikodym densities and are statistical (dis)similarities defined respectively by:
(Kullback-Leibler divergence) | (1) | ||||
(Jeffreys’ divergence) | (2) | ||||
(Bhattacharyya coefficient) | (3) | ||||
(Bhattacharyya distance) | (4) | ||||
(Hellinger distance) | (5) |
KLD is an oriented distance (i.e., ). JD is a common symmetrization of the KLD which is not a metric distance because JD fails the triangle inequality. BC is a similarity which ensures that . BD is a symmetric non-metric distance, and HD is a metric distance. (KLD and JD are homogeneous divergences.)
All these divergences require to calculate definite integrals which can be calculated in theory using Risch pseudo-algorithm [44] which depends on an oracle. In practice, computer algebra systems like Maple® or Maxima implement different subsets of the theoretical Risch pseudo-algorithm and thus face different limitations.
When the densities and belong to a same parametric family of densities, i.e., and , these (dis)similarities yield equivalent parameter (dis)similarities. For example, we get the parameter divergence . We use notationally the brackets to indicate that the (dis)similarities parameters are densities, and the parenthesis to indicate parameter (dis)similarities.
In particular, when is a natural exponential family [34, 3] (NEF)
(6) |
with denoting the sufficient statistics, an auxiliary measure carrier term, and
(7) |
the cumulant function111More precisely, is the cumulant generating function of the sufficient statistic from which all moments can be recovered (the cumulant generating function being the logarithm of the moment generating function). also called log-normalizer, log-partition function, free energy, or log-Laplace transform. Parameter is called the natural parameter. The cumulant function is a strictly smooth convex function (real analytic function [3]) defined on the open convex natural parameter space . Let denote the dimension of the parameter space (i.e., the order of the exponential family) and the dimension of the sample space . We further assume full regular exponential families [3] so that is a non-empty open convex domainand is a minimal sufficient statistic [14].
Many common families of distributions are exponential families in disguise after reparameterization: . Those families are called exponential families (i.e., EFs and not natural EFs to emphasize that ), and their densities are canonically factorized as follows:
(8) |
We call parameter the source parameter (or the ordinary parameter, with , the source/ordinary parameter space) and parameter is the corresponding natural parameter. Notice that the canonical parameterization of Eq. 8 is not unique: For example, adding a constant term to can be compensated by subtracting this constant to , or multiplying the sufficient statistic by a symmetric invertible matrix can be compensated by multiplying by the inverse of so that .
Another useful parameterization of exponential families is the moment parameter [34, 3]: . The moment parameter space shall be denoted by :
(9) |
Let be the closed convex hull of the range of the sufficient statistic. When , the family is said steep [14]. In the remainder, we consider full regular and steep exponential families.
To give one example, consider the family of univariate normal densities:
(10) |
Family is interpreted as an exponential family of order with univariate parametric densities (), indexed by the source parameter with . The corresponding natural parameter is , and the moment parameter is since . The cumulant function for the normal family is .
When densities and both belong to the same exponential family, we have the following well-known closed-form expressions [33, 34] for the (dis)similarities introduced formerly:
Kullback-Leibler divergence | (11) | ||||
Jeffreys’ divergence | (12) | ||||
Bhattacharyya coefficient | (13) | ||||
Bhattacharyya distance | (14) | ||||
Hellinger distance | (15) |
where indicates the reverse divergence (parameter swapping222Here, the star ‘*’ is not to be confused with the Legendre-Fenchel transform used in information geometry [1]. For a Bregman divergence , we have the reverse Bregman divergence , where denotes convex conjugate of obtained by the Legendre-Fenchel transform.), and and are the Bregman divergence [8] and the Jensen divergence [33] induced by the functional generator , respectively:
(16) | |||||
(17) |
In information geometry, the Bregman divergences are the canonical divergences of dually flat spaces [28].
More generally, the Bhattacharrya distance/coefficient can be skewed with a parameter to yield the -skewed Bhattacharyya distance/coefficient [33]:
(18) | |||||
(19) | |||||
(20) |
The ordinary Bhattacharyya distance and coefficient are recovered for . In statistics, the maximum skewed Bhattacharrya distance is called Chernoff information [10, 27] used in Bayesian hypothesis testing:
(21) |
Notice that the Bhattacharyya skewed -coefficient of Eq. 20 also appears in the definition of the -divergences [1] (with ):
(22) |
The -divergences belong to the class of -divergences [13] which are the invariant divergences333A statistical invariant divergence is such that , where is the Fisher information matrix [1]. in information geometry [1].
When densities and both belong to the same exponential family , we get the following closed-form formula [33]:
(23) |
where denotes the -skewed Jensen divergence:
(24) |
All these closed-form formula can be obtained from the calculation of the following generic integral [37]:
(25) |
when and . Indeed, provided that , we have
(26) |
The calculation of in Eq. 26 is easily achieved by bypassing the computation of the antiderivative of the integrand in Eq. 25 (using the fact that for any ), see [37].
In particular, we get the following special cases:
-
•
When , since (domain is convex).
-
•
When , and , . This is always the case when is a convex cone (e.g., Gaussian or Wishart family), see [26].
- •
When or , we get the following limits of the -skewed Bhattacharrya distances:
(28) | |||||
(29) |
It follows that when the densities and both belong to the same exponential family, we obtain
(30) | |||||
(31) |
In practice, we would like to get closed-form formula for the (dis)similarities when the densities belong to the same exponential families using the source reparameterization :
(32) | |||||
(33) | |||||
(34) | |||||
(35) | |||||
(36) |
where and are the -variate functions for converting the source parameter to the natural parameter and the moment parameter , respectively. The Chernoff information between two densities and of the same exponential family amounts to a Jensen-Chernoff divergence [27]:
(37) | |||||
(38) |
that is the maximal value of a skew Jensen divergence for .
Thus to have closed-form formula, we need to explicit both the conversion function and the cumulant function . This can be prone to human calculus mistakes (e.g., report manually these formula without calculus errors for the multivariate Gaussian family). Furthermore, the cumulant function may not be available in closed-form [39].
In this work, we show how to easily bypass the explicit use of the cumulant function . Our method is based on a simple trick, and makes the programming of these (dis)similarities easy using off-the-shelf functions of application programming interfaces (APIs) (e.g., the density function, the entropy function or the moment function of a distribution family).
This paper is organized as follows: Section 2 explains the method for the Bhattacharyya coefficient and its related dissimilarities. Section 3 further carry on the principle of bypassing the explicit use of the cumulant function and its gradient for the calculation of the Kullback-Leibler divergence and its related Jeffreys’ divergence. Section 4 summarizes the results. Throughout the paper, we present several examples to illustrate the methods. Appendix Closed-form formula using the Maxima computer algebra system displays some code written using the computer algebra system (CAS) Maxima to recover some formula for some exponential families. Appendix A provides further examples.
2 Cumulant-free formula for the Bhattacharyya coefficient and distances derived thereof
2.1 A method based on a simple trick
The densities of an exponential family have all the same support [3] . Consider any point in the support. Then observe that we can write the cumulant of a natural exponential family as:
(39) |
Since the generator of a Jensen or Bregman divergence is defined modulo an affine function (i.e., and for for and ), we consider the following equivalent generator (term is affine) expressed using the density parameterized by :
(40) |
Then the Bhattacharyya coefficient is expressed by this cumulant-free expression using the source parameterization :
(41) |
with
(42) |
Similarly, the Bhattacharyya distance is written as
(43) |
Let be the log-density. Then we have
(44) |
This is a Jensen divergence for the strictly convex function (wrt. ) since (modulo an affine term).
Thus we do not need to explicitly use the expression of the cumulant function in Eq. 41 and Eq. 43 but we need the following parameter conversion functions:
-
1.
the ordinary-to-natural parameter conversion function, and
-
2.
its reciprocal function (i.e., ), the natural-to-source parameter conversion function
so that we can calculate the ordinary -parameter corresponding to the natural mid-parameter :
(45) |
Notice that in general, a linear interpolation in the natural parameter corresponds to a non-linear interpolation in the source parameterization when .
Since , we can interpret the non-linear interpolation as a generalization of quasi-arithmetic mean [33] (QAM):
(46) |
where
(47) |
is the quasi-arithmetic mean induced by a strictly monotonic and smooth function .444Notice that when with and . Thus we can assume that is a strictly increasing and smooth function. Function is said to be in standard form if , and ).
We can extend the quasi-arithmetic mean to a weighted quasi-arithmetic mean as follows:
(48) |
Weighted quasi-arithmetic means are strictly monotone [47] when the .
Let us remark that extensions of weighted quasi-arithmetic means have been studied recently in information geometry to describe geodesic paths [16, 15] .
In 1D, a bijective function on an interval (e.g., a parameter conversion function in our setting) is a strictly monotonic function, and thus defines a quasi-arithmetic mean.
To define quasi-arithmetic mean with multivariate generators using Eq. 47, we need to properly define . When the function is separable, i.e., with the ’s strictly monotone and smooth functions, we can straightforwardly define the multivariate mean as .
In general, the inverse function theorem in multivariable calculus [11] states that there exists locally an inverse function for a continuously differentiable function at point if the determinant of the Jacobian of at is non-zero. Moreover, is continuously differentiable and we have (matrix inverse) at .
In some cases, we get the existence of a global inverse function. For example, when is the gradient of Legendre-type strictly convex and smooth function , the reciprocal function is well-defined and global, where denotes the convex conjugate of (Legendre-type). In that case, is a strictly monotone operator555A mapping is a strictly monotone operator iff for all with . Monotone operators are a generalization of the concept of univariate monotonic functions. since it is the gradient of a strictly convex function. We also refer to [2] for some work on operator means, and to [42] for multivariate quasi-arithmetic means of covariance matrices.
To summarize, we can compute the Bhattacharyya coefficient (and Bhattacharyya/Hellinger distances) using the parametric density function and a quasi-arithmetic mean on the source parameters and as:
(49) |
using the notation , and
(50) |
Similarly, we get the following cumulant-free expression for the Hellinger distance:
(51) |
The Hellinger distance proves useful when using some generic algorithms which require to handle metric distances. For example, the -approximation factor of Gonzalez [18] for -center metric clustering.
These cumulant-free formula are all the more convenient as in legacy software API, we usually have access to the density function of the probability family. Thus if a parametric family of an API is an exponential family , we just need to implement the corresponding quasi-arithmetic mean .
More generally, the -skewed Bhattacharyya distance [33] for is expressed using the following cumulant-free expression:
(52) |
with
(53) |
Notice that the geometric -barycenter of two densities and of an exponential family is a scale density of :
Let denote the unnormalized density of an exponential family (so that ). We have the following invariant:
(54) |
It follows that we have:
(55) | |||||
(56) | |||||
(57) |
The Cauchy-Schwarz divergence [21, 41] is defined by
(58) |
Thus the Cauchy-Schwarz divergence is a projective divergence: for any . It can be shown that the Cauchy-Schwarz divergence between two densities and of an exponential family with a natural parameter space a cone [31] (e.g., Gaussian) is:
(59) |
See [31] for the formula extended to Hölder divergences which generalize the Cauchy-Schwarz divergence.
2.2 Some illustrating examples
Let us start with an example of a continuous exponential family which relies on the arithmetic mean :
Example 1
Consider the family of exponential distributions with rate parameter . The densities of this continuous EF writes as with support . From the partial canonical factorization of densities following Eq. 6, we get that and so that is the arithmetic mean. Choose so that . It follows that
Let denote the geometric mean for . Notice that since , we have . The Bhattacharyya distance between two exponential densities of rate and is
Since the logarithm function is monotonous, we have and therefore we check that .
Next, we consider a discrete exponential family which exhibits the geometric mean:
Example 2
The Poisson family of probability mass functions (PMFs) where denotes the intensity parameter and is a discrete exponential family with (, and ), and . Thus the quasi-arithmetic mean associated with the Poisson family is the geometric mean. It follows that the Bhattacharrya coefficient is
Hence, we recover the Bhattacharrya distance between two Poisson pmfs:
The negative binomial distribution with known number of failures yields also the same natural parameter and geometric mean (but the density is different).
To illustrate the use of the power means of order (also called Hölder means) for (with ), let us consider the family of Weibull distributions.
Example 3
The Weibull distributions with a prescribed shape parameter (e.g., exponential distributions when , Rayleigh distributions when ) form an exponential family. The density of a Weibull distribution with scale and fixed shape parameter is
The ordinarynatural parameter conversion functions are and . It follows that is the power means of order . We choose and get . It follows the closed-form formula for the Bhattacharrya coefficient for integer :
(60) | |||||
(61) |
For , the Weibull family yields the Rayleigh family with . The Bhattacharyya coefficient is where and denotes the geometric mean and the quadratic mean, respectively (with ).
Table 1 summarizes the quasi-arithmetic means associated with common univariate exponential families. Notice that a same quasi-arithmetic mean can be associated to many exponential families: For example, the Gaussian family with fixed variance or the exponential distribution family have both yielding the arithmetic mean.
Let us now consider multi-order exponential families. We start with the bi-order exponential family of Gamma distributions.
Example 4
The density of a Gamma distribution is
for a shape parameter and rate parameter (i.e., a -order exponential family with ). The natural parameter is and the inverse function is . It follows that the generalized quasi-arithmetic mean is the bivariate arithmetic mean: . We choose so that .
We get the Bhattacharrya coefficient:
and the Bhattacharrya distance:
The Dirichlet family which exhibits a separable (quasi-)arithmetic mean:
Example 5
Consider the family of Dirichlet distributions with densities defined on the -dimensional open standard simplex support
(62) |
The family of Dirichlet distributions including the family of Beta distributions when . The density of a Dirichlet distribution is defined by:
(63) |
The Dirichlet distributions and are used in in Bayesian statistics as the conjugate priors of the multinomial family. The Dirichlet distributions form an exponential family with -dimensional natural parameter () and vector of sufficient statistics . The induced quasi-arithmetic mean is a multivariate separable arithmetic means, i.e., the multivariate arithmetic mean .
Let us choose so that .
We get the following Bhattacharrya coefficient between two Dirichlet densities and :
(64) |
where
(65) |
It follows that the Bhattacharrya distance between two Dirichlet densities and is
(66) |
The -skewed Bhattacharrya coefficient for a scalar is:
(67) |
and the -skewed Bhattacharrya distance:
(68) | |||||
(69) |
with
(70) |
(This is in accordance with Eq. 15–17 of [43].)
Finally, we consider the case of the multivariate Gaussian family:
Example 6
Consider the -dimensional Gaussian family [29] also called MultiVariate Normal (MVN) family. The parameter of a MVN consists of a vector part and a matrix part . The Gaussian density is given by
where denotes the matrix determinant.
Partially factorizing the density into the canonical form of exponential family, we find that and . It follows that the multivariate weighted mean is with
(71) |
the matrix harmonic barycenter and
(72) |
We choose with . Let . It follows the following closed-form formula for the Bhattacharyya coefficient between Gaussian densities:
Thus the Bhattacharrya distance is
and the Hellinger distance:
The Cauchy-Schwarz divergence between two Gaussians [31] is:
(73) | |||||
3 Cumulant-free formula for the Kullback-Leibler divergence and related divergences
The Kullback-Leibler divergence (KLD) between two densities and also called the relative entropy [12] amounts to a reverse Bregman divergence, , when the densities belong to the same exponential family , where the Bregman generator is the cumulant function of .
We present below two techniques to calculate the KLD by avoiding to compute the integral:
-
•
The first technique, described in §3.1, considers the KLD as a limit case of skewed Bhattacharrya distance.
-
•
The second technique relies on the availability of off-the-shelf formula for the entropy and moment of the sufficient statistic (§3.2), and is derived using the Legendre-Fenchel divergence.
3.1 Kullback-Leibler divergence as the limit case of a skewed Bhattacharrya distance
We can obtain closed-form formula for the Kullback-Leibler divergence by considering the limit case of skewed Bhattacharrya distance:
(74) | |||||
(75) | |||||
(76) | |||||
(77) |
When we deal with uni-order exponential families (), we can use a first-order Taylor expansion of the quasi-arithmetic means when (see [30]):
(78) |
where denote the derivative of the ordinary-to-natural parameter conversion function.
It follows that we have:
(79) |
Notice that we need to calculate case by case the limit as it depends on the density expression of the exponential family. This limit can be computed symbolically using a computer algebra system (e.g., using Maxima666http://maxima.sourceforge.net/). The example below illustrates the technique for calculating the KLD between two Weibull densities with prescribed shape parameter.
Example 7
Consider the Weibull family with prescribed shape parameter that form an exponential family (including the family of exponential distributions for and the the Rayleigh distributions for ). The density of a Weibull distribution with scale and fixed shape parameter is
We have and . We set so that . Let us program the formula of Eq. 77 using the computer algebra system Maxima:
/* KLD betwen Weibull distributions by calculating a limit */ declare( k , integer); assume(lambda1>0); assume(lambda2>0); k:5; omega:1; t(u):=u**(-k); tinv(u):=u**(-1/k); tp(u):=k*u**(-k-1); p(x,l):=(k/l)*((x/l)**(k-1))*exp(-(x/l)**k); mean(a,b):= tinv(alpha*t(a)+(1-alpha)*t(b)); log(p(omega,l1)/p(omega,l2)) + (1.0/alpha)*log(p(omega,mean(l2,l1))/p(omega,l1)); limit (ratsimp(%), alpha, 0); expand(%);

Figure 1 displays a snapshot of the result which can be easily simplified manually as
(80) |
In general, the KLD between two Weibull densities with arbitrary shapes [4] is
(81) |
where denotes the Euler-Mascheroni constant. Thus when , we recover Eq. 80 since . (However, the family of Weibull densities with varying parameter shape is not an exponential family since the sufficient statistics depend on the shape parameters.)
In practice, we may program the formula of Eq. 77 by defining:
(82) |
and approximate the KLD by for a small value of (say, ). Thus we need only and for defining , and the density . This approximation also works for multivariate extensions of the quasi-arithmetic means.
Let us give some two examples using the first-order approximation of the univariate quasi-arithmetic mean:
Example 8
Consider the family of exponential distributions with support . Set , , , and . We have
(83) | |||||
(84) | |||||
(85) |
Example 9
Consider the Poisson family with , , ( is the geometric mean) and . We get
(86) | |||||
(87) | |||||
(88) |
3.2 Kullback-Leibler divergence formula relying on the differential entropy and moments
Consider the Kullback-Leibler divergence [25] (relative entropy) between two probability densities and . When the densities belong to the same exponential families, the KL divergences amounts to a Legendre-Fenchel divergence (the canonical expression of divergences using the dual coordinate systems in dually flat spaces of information geometry [1]):
(89) |
where the Legendre-Fenchel divergence is defined for a pair of strictly convex and differentiable conjugate generators and by
(90) |
with .
Since is defined modulo some affine function, we can choose . Furthermore, for exponential families, we have
(91) |
and the Shannon entropy [46]
(92) |
admits the following expression [36] when belongs to an exponential family :
(93) |
Thus if we already have at our disposal (1) the expectation of the sufficient statistics, and (2) the entropy, we can easily recover the Kullback-Leibler divergence as follows:
(94) |
For densities and belonging to the same exponential family, the Jeffreys divergence is
(95) |
It follows that we can write Jeffreys’ divergence using the following cumulant-free expression:
(96) |
Note that a strictly monotone operator defines a symmetric dissimilarity:
(97) |
with equality if and only if . Since is a strictly monotone operator and , we may reinterpret the Jeffreys’ divergence as a symmetric dissimilarity induced by a strictly monotone operator.
Let us report now some illustrating examples. We start with an example illustrating the use of a separable multivariate quasi-arithmetic mean.
Example 10 (continue Example 5)
Consider the Dirichlet exponential family. The differential entropy of a Dirichlet density is
where denotes the digamma function. We have where . It follows that the Kullback-Leibler between and is:
Next, we report an example illustrating a non-separable multivariate quasi-arithmetic mean.
Example 11 (continue Example 6)
Consider the -dimensional multivariate Gaussian family. Since (so that ), (since , ), and the usual differential entropy is known as since . we recover the Kullback-Leibler divergence as
(98) |
and the Jeffreys divergence as
(99) |
3.3 The Kullback-Leibler divergence expressed as a log density ratio
Let us express the Bregman divergence with the equivalent generator for any prescribed instead of the cumulant function of Eq. 7. We get
(100) | |||||
(101) |
Let us remark that
(102) |
We have
(103) | |||||
(104) | |||||
(105) |
where is the cumulant function of Eq. 7. Alternatively, we can also directly find that from Eq. 102.
It follows that:
(106) |
Thus when is Euclidean orthogonal to (i.e., ), the Bregman divergence (and the corresponding KLD on swapped parameter order of the densities) is expressed as a log-density ratio quantity. Let
(107) |
Then for all .
Lemma 1
The Kullback-Leibler divergence between two densities and belonging to a same exponential family is expressed as a log density ratio, , whenever .
Notice that a sufficient condition is to choose such that .
Thus if we carefully choose according to the source parameters, we may express the Kullback-Leibler divergence as a simple log density ratio without requiring the formula for the differential entropy nor the moment.
Example 12
Consider the exponential family of exponential distributions with for . We have that amounts to , i.e., (and ). In that case, we have
(108) | |||||
(109) | |||||
(110) |
This formula matches the expression of Eq. 85.
Similarly, we may rewrite the Bregman divergence as
(111) |
for (and ) for .
Example 13
Consider the exponential family of -dimensional Gaussians with fixed covariance matrix :
(112) |
It is an exponential family of order with sufficient statistic . Let us choose such that . For example, we choose . It follows that we have
(113) | |||||
(114) | |||||
(115) |
This is half of the squared Mahalanobis distance obtained for the precision matrix . We recover the Kullback-Leibler divergence between multivariate Gaussians (Eq. 98) when .
Example 14
Consider the exponential family of Rayleigh distributions:
(116) |
for . Let us choose such that with , and . We choose (i.e., ). It follows that we have
(117) | |||||
(118) | |||||
(119) |
This example shows the limitation of the method which we shall now overcome.
Example 15
Consider the exponential family of univariate Gaussian distributions:
(120) |
for . Let us choose such that . Here, and . Thus we have for any .
Let denote the dual moment parameter space. When the exponential family is regular, is an open convex set, and so is ( is of Legendre-type). The problem we faced with the last example is that . However since Eq. 106 holds for any , let us choose values for (i.e., ), and average Eq. 106 for these values. We have
(121) |
We can now choose the ’s such that for . We need to choose so that the system of equations is solvable.
Example 16
Let us continue Example 15. Consider . We need to find and such that we can solve the following system of equations:
(122) |
The solution of this system of equations is and . Thus it follows that we have:
(123) |
We conclude with the following theorem extending Lemma 1:
Theorem 1
The Kullback-Leibler divergence between two densities and belonging to a full regular exponential family of order can be expressed as the averaged sum of logarithms of density ratios:
where are distinct samples of chosen such that .
The bound follows from Carathéodory’s theorem [9].
Notice that the Monte Carlo stochastic approximation of the Kullback-Leibler divergence:
(124) |
for independent and identically distributed (iid) variates from is also an average sum of log density ratios which yields a consistent estimator, i.e.,
(125) |
In practice, to avoid potential negative values of , we estimate the extended Kullback-Leibler divergence:
(126) |
with .
Example 17
We continue Example 6 of the -dimensional multivariate Gaussian family. First, let us consider the subfamily of zero-centered Gaussian densities. The sufficient statistic is a matrix: . We find the column vectors ’s from the singular value decomposition of :
(127) |
where the ’s are the eigenvalues and the ’s the corresponding eigenvectors. Let for . We have
(128) |
It follows that we can express the KLD between two zero-centered Gaussians and as the following weighted sum of log density ratios:
(129) |
where the ’s are the eigenvalues of with the ’s the corresponding eigenvectors. Notice that the order of the family is but we used vectors .
The closed-form formula for the Kullback-Leibler divergence between two zero-centered Gaussians and is
(130) |
Now consider the full family of multivariate normal densities. We shall use vectors as follows:
(133) |
where the ’s are the eigenvalues of with the ’s the corresponding eigenvectors. It can be checked that and . These points are called the “sigma points” in the unscented transform [23, 17]. Moreover, we have , the -th column of the square root of the covariance matrix of . Thus it follows that
where denotes the vector extracted from the -th column of the square root matrix of .
This formula matches the ordinary formula for the Kullback-Leibler divergence between the two Gaussian densities and :
(134) |
Example 18
We continue the Example 4 of the exponential family of gamma distributions which has order . The sufficient statistic vector is , and we have and , where is the digamma function. We want to express as an average sum of log ratio of densities. To find the values of and , we need to solve the following system of equations:
(135) |
with and . We find the following two solutions:
(136) |
We have since .
It follows that
(137) |
This expression of the KLD is the same as the ordinary expression of the KLD:
(138) |
Example 19
Consider the exponential family of Beta distributions:
(139) |
where
(140) |
The sufficient statistic vector is . We have and . We need to solve the following system of equations for :
(141) |
This system is equivalent to
(142) |
We find
(143) | |||||
(144) |
where
(145) |
It follows that we have
(146) |
This formula is equivalent to the ordinary formula for the KLD between two beta densities and :
Notice that the ’s are chosen according to . Thus we may express the Voronoi bisector:
(148) |
as follows:
(149) |
In particular, when , the Voronoi bisector is expressed as:
(150) |
where emphasizes on the fact that is a function of . The statistical Voronoi diagrams of a finite set of densities belonging to an exponential family has been studied as equivalent Bregman Voronoi diagrams in [7].
3.4 The Jensen-Shannon divergence
The Jensen-Shannon divergence [29] (JSD) is another symmetrization of the Kullback-Leibler divergence which can be given many information-theoretic interpretations [32] and which is further guaranteed to be always bounded by (KLD and JD are unbounded):
(151) | |||||
(152) |
Usually, the JSD does not provably admit a closed-form formula [40]. However, in the particular case when the mixture belongs to the same parametric family of densities, we can calculate the Jensen-Shannon divergence using the entropy function as shown in Eq. 152. For example, consider a mixture family in information geometry [1]. That is, a statistical mixture with prescribed components which are linearly independent (so that all mixtures of the family are identifiable by their corresponding parameters). Let . In that particular case (e.g., mixture family with prescribed Gaussians components), we get
(153) |
Thus the JSD for a mixture family can be expressed using the entropy as:
(154) |
Although we do not have closed-form formula for the entropy of a mixture (except in few cases, e.g., when the support of the distributions are pairwise disjoint [32]), but we can use any approximation method for calculating the entropy of a mixture to approximate or bound [40] the Jensen-Shannon divergence .
4 Conclusion
We have described several methods to easily recover closed-form formula for some common (dis)similarities between densities belonging to a same exponential family which express themselves using the cumulant function of the exponential family (e.g., the Kullback-Leibler divergence amounts to a reverse Bregman divergence and the Bhattacharyya distance amounts to a Jensen divergence). Our trick consists in observing that the generators of the Bregman or Jensen divergences are defined modulo an affine term, so that we may choose for any falling inside the support . It follows that the Bhattacharyya coefficient can be calculated with the following cumulant-free expression:
(155) |
where is a generalized quasi-arithmetic mean induced by the ordinary-to-natural parameter conversion function . Thus our method requires only partial canonical factorization of the densities of an exponential family to get . The formula for the Bhattacharyya distance, Hellinger distance, and -divergences follow straightforwardly:
(156) | |||||
(157) | |||||
(158) |
In practice, it is easy to program those formula using legacy software APIs which offer many parametric densities in their library: First, we check that the distribution is an exponential family. Then we set to be any point of the support , partially factorize the distribution in order to retrieve and its reciprocal function , and equipped with these functions, we implement the corresponding generalized weighted quasi-arithmetic mean function to calculate .
To calculate the Kullback-Leibler divergence (and Jeffreys’ divergence) without the explicit use of the cumulant function, we reported two methods: The first method consists in expressing the KLD as a limit of -skew Bhattacharyya distance which writes as:
(159) |
This limit can be calculated symbolically using a computer algebra system, or approximated for a small value of by
(160) |
When dealing with uni-order exponential family, we can use a first-order approximation of the weighted quasi-arithmetic mean to express the KLD as the following limit:
(161) |
Notice that we can also estimate , and related dissimilarities (e.g., when the cumulant function is intractable) using density ratio estimation techniques [48].
The second approach consists in using the entropy and moment formula which are often available when dealing with parametric distributions. When the parametric distributions form an exponential family, the KLD is equivalent to a Legendre-Fenchel divergence, and we write this Legendre-Fenchel divergence as:
(162) |
It follows that the Jeffreys’ divergence is expressed as
(163) |
Finally, we proved in §3.3 that the Kullback-Leibler divergence between two densities and of an exponential family of order can be expressed as
where are distinct samples of chosen such that . We illustrated how to find the ’s for the univariate Gaussian family and the multivariate zero-centered Gaussian family.
To conclude this work, let us emphasize that we have revealed a new kind of invariance when providing closed-form formula for common (dis)similarities between densities of an exponential family without explicitly using the cumulant function of that exponential family: For the Bhattacharrya/Hellinger/-divergences, the can be chosen as any arbitrary point of the support . For the Kullback-Leibler divergence, by carefully choosing a set of ’s, we may express the Kullback-Leibler divergence as a weighted sum of log density ratios.
References
- [1] Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences. Springer Japan, 2016.
- [2] Tsuyoshi Ando and Fumio Hiai. Operator log-convex functions and operator means. Mathematische Annalen, 350(3):611–630, 2011.
- [3] Ole Barndorff-Nielsen. Information and exponential families. John Wiley & Sons, 2014.
- [4] Christian Bauckhage. Computing the Kullback–Leibler divergence between two Weibull distributions. arXiv preprint arXiv:1310.3713, 2013.
- [5] Anil Bhattacharyya. On a measure of divergence between two multinomial populations. Sankhyā: the indian journal of statistics, pages 401–406, 1946.
- [6] Patrick Billingsley. Probability and Measure. Wiley, 3 edition, April 1995.
- [7] Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete & Computational Geometry, 44(2):281–307, 2010.
- [8] Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
- [9] Constantin Carathéodory. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. Mathematische Annalen, 64(1):95–115, 1907.
- [10] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
- [11] Francis Clarke. On the inverse function theorem. Pacific Journal of Mathematics, 64(1):97–102, 1976.
- [12] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
- [13] Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
- [14] Joan Del Castillo. The singly truncated normal distribution: a non-steep exponential family. Annals of the Institute of Statistical Mathematics, 46(1):57–66, 1994.
- [15] Shinto Eguchi and Osamu Komori. Path connectedness on a space of probability density functions. In International Conference on Geometric Science of Information, pages 615–624. Springer, 2015.
- [16] Jun Ichi Fujii. Path of quasi-means as a geodesic. Linear algebra and its applications, 434(2):542–558, 2011.
- [17] Jacob Goldberger, Hayit K Greenspan, and Jeremie Dreyfuss. Simplifying mixture models using the unscented transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8):1496–1502, 2008.
- [18] Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985.
- [19] Ernst Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 136:210–271, 1909.
- [20] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.
- [21] Robert Jenssen, Jose C Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal of the Franklin Institute, 343(6):614–629, 2006.
- [22] Shihao Ji, Balaji Krishnapuram, and Lawrence Carin. Variational Bayes for continuous hidden Markov models and its application to active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):522–532, 2006.
- [23] Simon Julier, Jeffrey Uhlmann, and Hugh F Durrant-Whyte. A new method for the nonlinear transformation of means and covariances in filters and estimators. IEEE Transactions on automatic control, 45(3):477–482, 2000.
- [24] Thomas Kailath. The divergence and Bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
- [25] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
- [26] Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 1723–1726. IEEE, 2012.
- [27] Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE Signal Processing Letters, 20(3):269–272, 2013.
- [28] Frank Nielsen. An elementary introduction to information geometry. arXiv preprint arXiv:1808.08271, 2018.
- [29] Frank Nielsen. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy, 21(5):485, 2019.
- [30] Frank Nielsen. A generalization of the -divergences based on comparable and distinct weighted means. CoRR, abs/2001.09660, 2020.
- [31] Frank Nielsen. A note on Onicescu’s informational energy and correlation coefficient in exponential families. arXiv preprint arXiv:2003.13199, 2020.
- [32] Frank Nielsen. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy, 22(2):221, 2020.
- [33] Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
- [34] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. Technical report, arXiv:0911.4863, 2009.
- [35] Frank Nielsen and Gaëtan Hadjeres. On power chi expansions of -divergences. arXiv preprint arXiv:1903.05818, 2019.
- [36] Frank Nielsen and Richard Nock. Entropies and cross-entropies of exponential families. In IEEE International Conference on Image Processing, pages 3621–3624. IEEE, 2010.
- [37] Frank Nielsen and Richard Nock. A closed-form expression for the Sharma–Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45(3):032003, 2011.
- [38] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating -divergences. IEEE Signal Processing Letters, 21(1):10–13, 2013.
- [39] Frank Nielsen and Richard Nock. Patch matching with polynomial exponential families and projective divergences. In International Conference on Similarity Search and Applications, pages 109–116. Springer, 2016.
- [40] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.
- [41] Frank Nielsen, Ke Sun, and Stéphane Marchand-Maillet. On Hölder projective divergences. Entropy, 19(3):122, 2017.
- [42] Emilio Porcu, Jorge Mateu, and George Christakos. Quasi-arithmetic means of covariance functions with potential applications to space–time data. Journal of Multivariate Analysis, 100(8):1830–1844, 2009.
- [43] Thomas W Rauber, Tim Braun, and Karsten Berns. Probabilistic distance measures of the Dirichlet and Beta distributions. Pattern Recognition, 41(2):637–645, 2008.
- [44] Robert H Risch. The solution of the problem of integration in finite terms. Bulletin of the American Mathematical Society, 76(3):605–608, 1970.
- [45] Christophe Saint-Jean and Frank Nielsen. Hartigan’s method for -mle: Mixture modeling with Wishart distributions and its application to motion retrieval. In Geometric theory of information, pages 301–330. Springer, 2014.
- [46] Claude E Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
- [47] Jana Špirková. Weighted operators based on dissimilarity function. Information Sciences, 281:172–181, 2014.
- [48] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
- [49] John Wishart. The generalised product moment distribution in samples from a normal multivariate population. Biometrika, pages 32–52, 1928.
Closed-form formula using the Maxima computer algebra system
Since the statistical (dis)similarities rely on integral calculations, we may use symbolic calculations to check the results. For example, below is some code snippets written in Maxima.777http://maxima.sourceforge.net/ The code snippet below calculates symbolically the Bhattacharyya coefficient for several exponential families.
/* Quasi-arithmetic mean associated with the univariate Gaussian family */ ptheta(lambda):=[lambda[1]/lambda[2],-1/(2*lambda[2])]; plambda(theta):=[-theta[1]/(2*theta[2]),-1/(2*theta[2])]; ptheta(plambda([t0,t1])); l1: [p1,p2]; l2: [q1,q2]; plambda(0.5*ptheta(l1)+0.5*ptheta(l2)); ratsimp(%); /* end */ /* Quasi-arithmetic mean associated with the inverse Gaussian family */ ptheta(lambda):=[-lambda[2]/(2*lambda[1]**2),-lambda[2]/2]; plambda(theta):=[sqrt(theta[2]/theta[1]),-2*theta[2]]; ptheta(plambda([t0,t1])); l1: [p1,p2]; l2: [q1,q2]; plambda(0.5*ptheta(l1)+0.5*ptheta(l2)); ratsimp(%); /* end */ /* Exponential family of exponential distributions */ assume(lambda1>0); assume(lambda2>0); p(x,lambda) := lambda*exp(-lambda*x); integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,0,inf); ratsimp(%); /* end */ /* Exponential family of zero-centered Gaussian densities */ assume(sigma>0); p(x,sigma) := (1.0/(2*sigma))*exp(-abs(x)/sigma); assume(sigma1>0); assume(sigma2>0); integrate(sqrt(p(x,sigma1)*p(x,sigma2)),x,-inf,inf); ratsimp(%); /* end */ /* Exponential family of centered-Laplacian distributions */ assume(lambda1>0); assume(lambda2>0); p(x,lambda) := (1/(2*lambda))*exp(-abs(x)/lambda); integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,-inf,inf); ratsimp(%); /* end*/ /* Exponential family of Weibull densities with prescribed shape parameter k */ declare( k , integer); assume(k>=1); assume(lambda1>0); assume(lambda2>0); p(x,lambda) := (k/lambda)*(x/lambda)**(k-1)*exp(-(x/lambda)**k); integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,0,inf); expand(ratsimp(%)); /* end */ /* KLD betwen Weibull distributions by symbolic computing of the limit */ declare( k , integer); assume(lambda1>0); assume(lambda2>0); k:3; omega:1; t(u):=u**(-k); tp(u):=k*u**(-k-1); p(x,l):=(k/l)*((x/l)**(k-1))*exp(-(x/l)**k); mean(l1,l2):=l1+alpha*(t(l1)-t(l2))/tp(l1); log(p(omega,l1)/p(omega,l2)) + (1.0/alpha)*log(p(omega,mean(l1,l2))/p(omega,l1)); limit (ratsimp(%), alpha, 0); expand(%); /* end */
Appendix A Further illustrating examples
The Laplacian exponential family illustrates the use of the harmonic mean for :
Example 20
Consider the family of zero-centered Laplacian densities [34] for . We have for , so that , where denotes the harmonic mean. Applying the cumulant-free formula Eq. 49, we get
Note that the arithmetic mean dominates the geometric mean (i.e., ) so that . It follows that the Bhattacharyya distance is
This yields the same formula as the exponential distribution.
Example 21
The univariate Gaussian distribution with source parameters has density . The have and . We have
(164) |
Our next example illustrates an unusual non-separable mean derived from the inverse Gaussian family:
Example 22
Consider the family of inverse Gaussian densities:
with , and source parameters . The inverse Gaussian densities form an exponential family with natural parameters . The inverse conversion function is . Thus the generalized quasi-arithmetic mean induced by the multivariate is
The calculation can be checked using a computer algebra system (detailed in Appendix Closed-form formula using the Maxima computer algebra system). We choose .
Next, we consider the case of the exponential family of central Wishart densities [45].
Example 23
The density of a Wishart distribution [49] with positive-definite scale matrix and number of degrees of freedom (matrix generalization of the chi-squared distribution) is:
(165) |
where denotes the multivariate Gamma function:
(166) |
The sample space is the set of positive-definite matrices. The natural parameter consists of a scalar part and a matrix part :
(167) |
The inverse conversion naturalordinary function is
(168) |
It follows that the quasi-arithmetic mean is the following separable scalar-vector quasi-arithmetic mean (scalar arithmetic mean with a harmonic matrix mean):
(169) |
We choose (the identity matrix) so that
(170) |
The sufficient statistics is . It follows that (see [22]), where is the multivariate digamma function (the derivative of the logarithm of the multivariate gamma function). The differential entropy is:
It follows that the Kullback-Leibler divergence between two central Wishart densities is:
(171) |
This last example exhibits a non-usually mean when dealing with the Bernoulli family:
Example 24
Consider the Bernoulli family with density where and denotes the probability of success. The Bernoulli family is a discrete exponential family. We have and . The associated quasi-arithmetic mean is
(172) |