This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Cumulant-free closed-form formulas for some common (dis)similarities between densities of an exponential family

Frank Nielsen Richard Nock
Abstract

It is well-known that the Bhattacharyya, Hellinger, Kullback-Leibler, α\alpha-divergences, and Jeffreys’ divergences between densities belonging to a same exponential family have generic closed-form formulas relying on the strictly convex and real-analytic cumulant function characterizing the exponential family. In this work, we report (dis)similarity formulas which bypass the explicit use of the cumulant function and highlight the role of quasi-arithmetic means and their multivariate mean operator extensions. In practice, these cumulant-free formulas are handy when implementing these (dis)similarities using legacy Application Programming Interfaces (APIs) since our method requires only to partially factorize the densities canonically of the considered exponential family.

Keywords: Bhattacharyya coefficient; Bhattacharyya distance; Hellinger distance; Jensen-Shannon divergence; Kullback-Leibler divergence; α\alpha-divergences; Jeffreys divergence; Cauchy-Schwarz divergence; quasi-arithmetic means; inverse function theorem; strictly monotone operator; mean operator; information geometry; exponential family; mixture family.

1 Introduction

Let (𝒳,,μ)(\mathcal{X},\mathcal{F},\mu) be a measure space [6] with sample space 𝒳\mathcal{X}, σ\sigma-algebra of events \mathcal{F}, and positive measure μ\mu (i.e., Lebesgue or counting measures). The Kullback-Leibler divergence [25] (KLD), Jeffreys’ divergence [20] (JD), Bhattacharyya coefficient (BC), Bhattacharyya distance [5, 24] (BD) and Hellinger distance [19] (HD) between two probability measures PP and QQ dominated by μ\mu with respective Radon-Nikodym densities p=dPdμp=\frac{\mathrm{d}P}{\mathrm{d}\mu} and q=dQdμq=\frac{\mathrm{d}Q}{\mathrm{d}\mu} are statistical (dis)similarities defined respectively by:

DKL[p:q]\displaystyle D_{\mathrm{KL}}[p:q] :=p(x)logp(x)q(x)dμ(x),\displaystyle:=\int p(x)\log\frac{p(x)}{q(x)}\mathrm{d}\mu(x), (Kullback-Leibler divergence) (1)
DJ[p,q]\displaystyle D_{J}[p,q] :=DKL[p:q]+DKL[q:p],\displaystyle:=D_{\mathrm{KL}}[p:q]+D_{\mathrm{KL}}[q:p], (Jeffreys’ divergence) (2)
ρ[p,q]\displaystyle\rho[p,q] :=𝒳p(x)q(x)dμ(x),\displaystyle:=\int_{\mathcal{X}}\sqrt{p(x)\ q(x)}\mathrm{d}\mu(x), (Bhattacharyya coefficient) (3)
DB[p,q]\displaystyle D_{B}[p,q] :=log(ρ[p,q]),\displaystyle:=-\log(\rho[p,q]), (Bhattacharyya distance) (4)
DH[p,q]\displaystyle D_{H}[p,q] :=1ρ[p,q].\displaystyle:=\sqrt{1-\rho[p,q]}. (Hellinger distance) (5)

KLD is an oriented distance (i.e., DKL[p:q]DKL[q:p]D_{\mathrm{KL}}[p:q]\not=D_{\mathrm{KL}}[q:p]). JD is a common symmetrization of the KLD which is not a metric distance because JD fails the triangle inequality. BC is a similarity which ensures that ρ[p,q](0,1]\rho[p,q]\in(0,1]. BD is a symmetric non-metric distance, and HD is a metric distance. (KLD and JD are homogeneous divergences.)

All these divergences require to calculate definite integrals which can be calculated in theory using Risch pseudo-algorithm [44] which depends on an oracle. In practice, computer algebra systems like Maple® or Maxima implement different subsets of the theoretical Risch pseudo-algorithm and thus face different limitations.

When the densities pp and qq belong to a same parametric family ={pθ:θΘ}\mathcal{E}=\{p_{\theta}\ :\ \theta\in\Theta\} of densities, i.e., p=pθ1p=p_{\theta_{1}} and q=pθ2q=p_{\theta_{2}}, these (dis)similarities yield equivalent parameter (dis)similarities. For example, we get the parameter divergence P(θ1:θ2)=DKL(θ1:θ2):=DKL(pθ1:pθ2)P(\theta_{1}:\theta_{2})=D_{\mathrm{KL}}^{\mathcal{E}}(\theta_{1}:\theta_{2}):=D_{\mathrm{KL}}(p_{\theta_{1}}:p_{\theta_{2}}). We use notationally the brackets to indicate that the (dis)similarities parameters are densities, and the parenthesis to indicate parameter (dis)similarities.

In particular, when \mathcal{E} is a natural exponential family [34, 3] (NEF)

={pθ(x)=exp(θt(x)F(θ)+k(x)):θΘ},\mathcal{E}=\left\{p_{\theta}(x)=\exp\left(\theta^{\top}t(x)-F(\theta)+k(x)\right)\ :\ \theta\in\Theta\right\}, (6)

with t(x)t(x) denoting the sufficient statistics, k(x)k(x) an auxiliary measure carrier term, and

F(θ):=log(𝒳exp(θt(x))dμ(x)),F(\theta):=\log\left(\int_{\mathcal{X}}\exp(\theta^{\top}t(x))\mathrm{d}\mu(x)\right), (7)

the cumulant function111More precisely, κθ(u)=F(θ+u)F(θ)\kappa_{\theta}(u)=F(\theta+u)-F(\theta) is the cumulant generating function of the sufficient statistic t(x)t(x) from which all moments can be recovered (the cumulant generating function being the logarithm of the moment generating function). also called log-normalizer, log-partition function, free energy, or log-Laplace transform. Parameter θ\theta is called the natural parameter. The cumulant function is a strictly smooth convex function (real analytic function [3]) defined on the open convex natural parameter space Θ\Theta. Let DD denote the dimension of the parameter space Θ\Theta (i.e., the order of the exponential family) and dd the dimension of the sample space 𝒳\mathcal{X}. We further assume full regular exponential families [3] so that Θ\Theta is a non-empty open convex domainand t(x)t(x) is a minimal sufficient statistic [14].

Many common families of distributions {pλ(x)λΛ}\{p_{\lambda}(x)\ \lambda\in\Lambda\} are exponential families in disguise after reparameterization: pλ(x)=pθ(λ)(x)p_{\lambda}(x)=p_{\theta(\lambda)}(x). Those families are called exponential families (i.e., EFs and not natural EFs to emphasize that θ(u)u\theta(u)\not=u), and their densities are canonically factorized as follows:

p(x;λ)=exp(θ(λ)t(x)F(θ(λ))+k(x)).p(x;\lambda)=\exp\left(\theta(\lambda)^{\top}t(x)-F(\theta(\lambda))+k(x)\right). (8)

We call parameter λ\lambda the source parameter (or the ordinary parameter, with λΛ\lambda\in\Lambda, the source/ordinary parameter space) and parameter θ(λ)Θ\theta(\lambda)\in\Theta is the corresponding natural parameter. Notice that the canonical parameterization of Eq. 8 is not unique: For example, adding a constant term cc\in\mathbb{R} to F(θ)F(\theta) can be compensated by subtracting this constant to k(x)k(x), or multiplying the sufficient statistic t(x)t(x) by a symmetric invertible matrix AA can be compensated by multiplying θ(λ)\theta(\lambda) by the inverse of AA so that (Aθ(λ)))(A1t(x))=θ(λ))AA1t(x)=θ(λ)t(x)(A\theta(\lambda)))^{\top}(A^{-1}t(x))=\theta(\lambda))^{\top}AA^{-1}t(x)=\theta(\lambda)^{\top}t(x).

Another useful parameterization of exponential families is the moment parameter [34, 3]: η=Epθ[t(x)]=F(θ)\eta=E_{p_{\theta}}[t(x)]=\nabla F(\theta). The moment parameter space shall be denoted by HH:

H={Epθ[t(x)]:θΘ}.H=\left\{E_{p_{\theta}}[t(x)]\ :\ \theta\in\Theta\right\}. (9)

Let C=CH¯{t(x):x𝒳}C=\overline{\mathrm{CH}}\{t(x)\ :\ x\in\mathcal{X}\} be the closed convex hull of the range of the sufficient statistic. When H=int(C)H=\mathrm{int}(C), the family is said steep [14]. In the remainder, we consider full regular and steep exponential families.

To give one example, consider the family of univariate normal densities:

𝒩:={p(x;λ)=1σ2πexp(12(xμσ)2),λ=(μ,σ2)×++}.\mathcal{N}:=\left\{p(x;\lambda)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right),\ \lambda=(\mu,\sigma^{2})\in\mathbb{R}\times\mathbb{R}_{++}\right\}. (10)

Family 𝒩\mathcal{N} is interpreted as an exponential family of order D=2D=2 with univariate parametric densities (d=1d=1), indexed by the source  parameter λ=(μ,σ2)Λ\lambda=(\mu,\sigma^{2})\in\Lambda with Λ=×++\Lambda=\mathbb{R}\times\mathbb{R}_{++}. The corresponding natural parameter is θ(λ)=(μσ2,12σ2)\theta(\lambda)=\left(\frac{\mu}{\sigma^{2}},-\frac{1}{2\sigma^{2}}\right), and the moment parameter is η(λ)=(Epλ[x],Epλ[x2])=(μ,μ2+σ2)\eta(\lambda)=(E_{p_{\lambda}}[x],E_{p_{\lambda}}[x^{2}])=(\mu,\mu^{2}+\sigma^{2}) since t(x)=(x,x2)t(x)=(x,x^{2}). The cumulant function for the normal family is F(θ)=θ124θ2+12log(πθ2)F(\theta)=-\frac{\theta_{1}^{2}}{4\theta_{2}}+\frac{1}{2}\log\left(-\frac{\pi}{\theta_{2}}\right).

When densities p=pθ1p=p_{\theta_{1}} and q=pθ2q=p_{\theta_{2}} both belong to the same exponential family, we have the following well-known closed-form expressions [33, 34] for the (dis)similarities introduced formerly:

DKL[pθ1:pθ2]\displaystyle D_{\mathrm{KL}}[p_{\theta_{1}}:p_{\theta_{2}}] =BF(θ1:θ2),\displaystyle={B_{F}}^{*}(\theta_{1}:\theta_{2}), Kullback-Leibler divergence (11)
DJ[pθ1:pθ2]\displaystyle D_{J}[p_{\theta_{1}}:p_{\theta_{2}}] =(θ2θ1)(η2η1),\displaystyle=(\theta_{2}-\theta_{1})^{\top}(\eta_{2}-\eta_{1}), Jeffreys’ divergence (12)
ρ[pθ1:pθ2]\displaystyle\rho[p_{\theta_{1}}:p_{\theta_{2}}] =exp(JF(θ1:θ2)),\displaystyle=\exp(-J_{F}(\theta_{1}:\theta_{2})), Bhattacharyya coefficient (13)
DB[pθ1:pθ2]\displaystyle D_{B}[p_{\theta_{1}}:p_{\theta_{2}}] =JF(θ1:θ2)\displaystyle=J_{F}(\theta_{1}:\theta_{2}) Bhattacharyya distance (14)
DH[pθ1:pθ2]\displaystyle D_{H}[p_{\theta_{1}}:p_{\theta_{2}}] =1exp(JF(θ1:θ2)),\displaystyle=\sqrt{1-\exp(-J_{F}(\theta_{1}:\theta_{2}))}, Hellinger distance (15)

where DD^{*} indicates the reverse divergence D(θ1:θ2):=D(θ2:θ1)D^{*}(\theta_{1}:\theta_{2}):=D(\theta_{2}:\theta_{1}) (parameter swapping222Here, the star ‘*’ is not to be confused with the Legendre-Fenchel transform used in information geometry [1]. For a Bregman divergence BF(θ1:θ2)B_{F}(\theta_{1}:\theta_{2}), we have the reverse Bregman divergence BF(θ1:θ2)=BF(θ2:θ1)=BF(F(θ1):F(θ2)){B_{F}}^{*}(\theta_{1}:\theta_{2})=B_{F}(\theta_{2}:\theta_{1})=B_{F^{*}}(\nabla F(\theta_{1}):\nabla F(\theta_{2})), where FF^{*} denotes convex conjugate of FF obtained by the Legendre-Fenchel transform.), and BFB_{F} and JFJ_{F} are the Bregman divergence [8] and the Jensen divergence [33] induced by the functional generator FF, respectively:

BF(θ1:θ2)\displaystyle B_{F}(\theta_{1}:\theta_{2}) :=\displaystyle:= F(θ1)F(θ2)(θ1θ2)F(θ2)\displaystyle F(\theta_{1})-F(\theta_{2})-(\theta_{1}-\theta_{2})^{\top}\nabla F(\theta_{2}) (16)
JF(θ1:θ2)\displaystyle J_{F}(\theta_{1}:\theta_{2}) :=\displaystyle:= F(θ1)+F(θ2)2F(θ1+θ22).\displaystyle\frac{F(\theta_{1})+F(\theta_{2})}{2}-F\left(\frac{\theta_{1}+\theta_{2}}{2}\right). (17)

In information geometry, the Bregman divergences are the canonical divergences of dually flat spaces [28].

More generally, the Bhattacharrya distance/coefficient can be skewed with a parameter α(0,1)\alpha\in(0,1) to yield the α\alpha-skewed Bhattacharyya distance/coefficient [33]:

DB,α[p:q]\displaystyle D_{B,\alpha}[p:q] :=\displaystyle:= log(ρα[p:q]),\displaystyle-\log(\rho_{\alpha}[p:q]), (18)
ρα[p:q]\displaystyle\rho_{\alpha}[p:q] :=\displaystyle:= 𝒳p(x)αq(x)1αdμ(x),\displaystyle\int_{\mathcal{X}}p(x)^{\alpha}q(x)^{1-\alpha}\mathrm{d}\mu(x), (19)
=\displaystyle= log𝒳q(x)(p(x)q(x))αdμ(x).\displaystyle-\log\int_{\mathcal{X}}q(x)\left(\frac{p(x)}{q(x)}\right)^{\alpha}\mathrm{d}\mu(x). (20)

The ordinary Bhattacharyya distance and coefficient are recovered for α=12\alpha=\frac{1}{2}. In statistics, the maximum skewed Bhattacharrya distance is called Chernoff information [10, 27] used in Bayesian hypothesis testing:

DC[p:q]:=maxα(0,1)DB,α[p:q].D_{C}[p:q]:=\max_{\alpha\in(0,1)}D_{B,\alpha}[p:q]. (21)

Notice that the Bhattacharyya skewed α\alpha-coefficient of Eq. 20 also appears in the definition of the α\alpha-divergences [1] (with α\alpha\in\mathbb{R}):

Dα[p:q]:={1α(1α)(1ρα[p:q]),α\{0,1}D1[p:q]=DKL[p:q],α=1D0[p:q]=DKL[q:p],α=0D_{\alpha}[p:q]:=\left\{\begin{array}[]{ll}\frac{1}{\alpha(1-\alpha)}\left(1-\rho_{\alpha}[p:q]\right),&\alpha\in\mathbb{R}\backslash\{0,1\}\\ D_{1}[p:q]=D_{\mathrm{KL}}[p:q],&\alpha=1\\ D_{0}[p:q]=D_{\mathrm{KL}}[q:p],&\alpha=0\end{array}\right. (22)

The α\alpha-divergences belong to the class of ff-divergences [13] which are the invariant divergences333A statistical invariant divergence DD is such that D[pθ:pθ+dθ]=i,jgij(θ)dθidθjD[p_{\theta}:p_{\theta+\mathrm{d}\theta}]=\sum_{i,j}g_{ij}(\theta)\mathrm{d}\theta_{i}\mathrm{d}\theta_{j}, where I(θ)=[gij(θ)]I(\theta)=[g_{ij}(\theta)] is the Fisher information matrix [1]. in information geometry [1].

When densities p=pθ1p=p_{\theta_{1}} and q=pθ2q=p_{\theta_{2}} both belong to the same exponential family \mathcal{E}, we get the following closed-form formula [33]:

DB,α[pθ1:pθ2]\displaystyle D_{B,\alpha}[p_{\theta_{1}}:p_{\theta_{2}}] =\displaystyle= JF,α(θ1:θ2),\displaystyle J_{F,\alpha}(\theta_{1}:\theta_{2}), (23)

where JF,αJ_{F,\alpha} denotes the α\alpha-skewed Jensen divergence:

JF,α(θ1:θ2):=αF(θ1)+(1α)F(θ2)F(αθ1+(1α)θ2).J_{F,\alpha}(\theta_{1}:\theta_{2}):=\alpha F(\theta_{1})+(1-\alpha)F(\theta_{2})-F(\alpha\theta_{1}+(1-\alpha)\theta_{2}). (24)

All these closed-form formula can be obtained from the calculation of the following generic integral [37]:

Iα,β[p:q]:=p(x)αq(x)βdμ(x),I_{\alpha,\beta}[p:q]:=\int p(x)^{\alpha}q(x)^{\beta}\mathrm{d}\mu(x), (25)

when p=pθ1p=p_{\theta_{1}} and q=pθ2q=p_{\theta_{2}}. Indeed, provided that αθ1+βθ2Θ\alpha\theta_{1}+\beta\theta_{2}\in\Theta, we have

Iα,β[pθ1:pθ2]=exp((αF(θ1)+βF(θ2)F(αθ1+βθ2)))Epαθ1+βθ2[exp((α+β1)k(x))].I_{\alpha,\beta}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp\left(-(\alpha F(\theta_{1})+\beta F(\theta_{2})-F(\alpha\theta_{1}+\beta\theta_{2}))\right)E_{p_{\alpha\theta_{1}+\beta\theta_{2}}}\left[\exp((\alpha+\beta-1)k(x))\right]. (26)

The calculation of Iα,βI_{\alpha,\beta} in Eq. 26 is easily achieved by bypassing the computation of the antiderivative of the integrand in Eq. 25 (using the fact that pθ(x)dμ(x)=1\int p_{\theta}(x)\mathrm{d}\mu(x)=1 for any θΘ\theta\in\Theta), see [37].

In particular, we get the following special cases:

  • When α+β=1\alpha+\beta=1, Iα,1α[pθ1:pθ2]=exp(JF,α(θ1:θ2))I_{\alpha,1-\alpha}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp(-J_{F,\alpha}(\theta_{1}:\theta_{2})) since αθ1+(1α)θ2Θ\alpha\theta_{1}+(1-\alpha)\theta_{2}\in\Theta (domain Θ\Theta is convex).

  • When k(x)=0k(x)=0, and α+β>0\alpha+\beta>0, Iα,β[pθ1:pθ2]=exp((αF(θ1)+βF(θ2)F(αθ1+βθ2)))I_{\alpha,\beta}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp\left(-(\alpha F(\theta_{1})+\beta F(\theta_{2})-F(\alpha\theta_{1}+\beta\theta_{2}))\right). This is always the case when Θ\Theta is a convex cone (e.g., Gaussian or Wishart family), see [26].

  • When α+β=1\alpha+\beta=1 with arbitrary α\alpha,

    Iα,1α[pθ1:pθ2]=exp((αF(θ1)+(1α)F(θ2)F(αθ1+(1α)θ2))).I_{\alpha,1-\alpha}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp\left(-(\alpha F(\theta_{1})+(1-\alpha)F(\theta_{2})-F(\alpha\theta_{1}+(1-\alpha)\theta_{2}))\right). (27)

    This setting is useful for getting truncated series of ff-divergences when the exponential family has an affine space Θ\Theta, see [38, 35].

When α1\alpha\rightarrow 1 or α0\alpha\rightarrow 0, we get the following limits of the α\alpha-skewed Bhattacharrya distances:

limα11α(1α)DB,α[p:q]\displaystyle\lim_{\alpha\rightarrow 1}\frac{1}{\alpha(1-\alpha)}D_{B,\alpha}[p:q] =\displaystyle= DKL[p:q]\displaystyle D_{\mathrm{KL}}[p:q] (28)
limα01α(1α)DB,α[p:q]\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha(1-\alpha)}D_{B,\alpha}[p:q] =\displaystyle= DKL[q:p].\displaystyle D_{\mathrm{KL}}[q:p]. (29)

It follows that when the densities p=pθ1p=p_{\theta_{1}} and q=pθ2q=p_{\theta_{2}} both belong to the same exponential family, we obtain

limα01α(1α)JF,α(θ1:θ2)\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha(1-\alpha)}J_{F,\alpha}(\theta_{1}:\theta_{2}) =\displaystyle= BF(θ2:θ1),\displaystyle B_{F}(\theta_{2}:\theta_{1}), (30)
limα11α(1α)JF,α(θ1:θ2)\displaystyle\lim_{\alpha\rightarrow 1}\frac{1}{\alpha(1-\alpha)}J_{F,\alpha}(\theta_{1}:\theta_{2}) =\displaystyle= BF(θ1:θ2).\displaystyle B_{F}(\theta_{1}:\theta_{2}). (31)

In practice, we would like to get closed-form formula for the (dis)similarities when the densities belong to the same exponential families using the source reparameterization λΛ\lambda\in\Lambda:

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= BF(θ(λ2):θ(λ1)),\displaystyle{B_{F}}(\theta(\lambda_{2}):\theta(\lambda_{1})), (32)
DJ[pλ1:pλ2]\displaystyle D_{J}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= (θ(λ1):θ(λ2))(η(λ2)η(λ1)),\displaystyle(\theta(\lambda_{1}):\theta(\lambda_{2}))^{\top}(\eta(\lambda_{2})-\eta(\lambda_{1})), (33)
ρ[pλ1:pλ2]\displaystyle\rho[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= exp(JF(θ(λ1):θ(λ2))),\displaystyle\exp(-J_{F}(\theta(\lambda_{1}):\theta(\lambda_{2}))), (34)
DB[pλ1:pλ2]\displaystyle D_{B}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= JF(θ(λ1):θ(λ2))\displaystyle J_{F}(\theta(\lambda_{1}):\theta(\lambda_{2})) (35)
DH[pλ1:pλ2]\displaystyle D_{H}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= 1ρ(θ(λ1):θ(λ2)),\displaystyle\sqrt{1-\rho(\theta(\lambda_{1}):\theta(\lambda_{2}))}, (36)

where θ()\theta(\cdot) and η()\eta(\cdot) are the DD-variate functions for converting the source parameter λ\lambda to the natural parameter θ\theta and the moment parameter η\eta, respectively. The Chernoff information between two densities pλ1p_{\lambda_{1}} and pλ2p_{\lambda_{2}} of the same exponential family amounts to a Jensen-Chernoff divergence [27]:

JC(λ1:λ2):=DC[pλ1:pλ2]\displaystyle\mathrm{JC}({\lambda_{1}}:{\lambda_{2}}):=D_{C}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= maxα(0,1)DB,α[pλ1:pλ2],\displaystyle\max_{\alpha\in(0,1)}D_{B,\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}], (37)
=\displaystyle= maxα(0,1)JF,α(θ(λ1):θ(λ2)),\displaystyle\max_{\alpha\in(0,1)}J_{F,\alpha}(\theta(\lambda_{1}):\theta(\lambda_{2})), (38)

that is the maximal value of a skew Jensen divergence for α(0,1)\alpha\in(0,1).

Thus to have closed-form formula, we need to explicit both the θ()\theta(\cdot) conversion function and the cumulant function FF. This can be prone to human calculus mistakes (e.g., report manually these formula without calculus errors for the multivariate Gaussian family). Furthermore, the cumulant function FF may not be available in closed-form [39].

In this work, we show how to easily bypass the explicit use of the cumulant function FF. Our method is based on a simple trick, and makes the programming of these (dis)similarities easy using off-the-shelf functions of application programming interfaces (APIs) (e.g., the density function, the entropy function or the moment function of a distribution family).

This paper is organized as follows: Section 2 explains the method for the Bhattacharyya coefficient and its related dissimilarities. Section 3 further carry on the principle of bypassing the explicit use of the cumulant function and its gradient for the calculation of the Kullback-Leibler divergence and its related Jeffreys’ divergence. Section 4 summarizes the results. Throughout the paper, we present several examples to illustrate the methods. Appendix Closed-form formula using the Maxima computer algebra system displays some code written using the computer algebra system (CAS) Maxima to recover some formula for some exponential families. Appendix A provides further examples.

2 Cumulant-free formula for the Bhattacharyya coefficient and distances derived thereof

2.1 A method based on a simple trick

The densities of an exponential family have all the same support [3] 𝒳\mathcal{X}. Consider any point ω𝒳\omega\in\mathcal{X} in the support. Then observe that we can write the cumulant of a natural exponential family as:

F(θ)=logpθ(ω)+t(ω)θ+k(ω).F(\theta)=-\log p_{\theta}(\omega)+t(\omega)^{\top}\theta+k(\omega). (39)

Since the generator FF of a Jensen or Bregman divergence is defined modulo an affine function aθ+ba^{\top}\theta+b (i.e., JF=JGJ_{F}=J_{G} and BF=BGB_{F}=B_{G} for G(θ)=F(θ)+aθ+bG(\theta)=F(\theta)+a^{\top}\theta+b for aDa\in\mathbb{R}^{D} and bb\in\mathbb{R}), we consider the following equivalent generator (term +t(ω)θ+k(ω)+t(\omega)^{\top}\theta+k(\omega) is affine) expressed using the density parameterized by λΛ\lambda\in\Lambda:

Fλ(λ)=F(θ(λ))=log(pλ(ω)).F_{\lambda}(\lambda)=F(\theta(\lambda))=-\log(p_{\lambda}(\omega)). (40)

Then the Bhattacharyya coefficient is expressed by this cumulant-free expression using the source parameterization λ\lambda:

ω𝒳,ρ[pλ1,pλ2]=pλ1(ω)pλ2(ω)pλ¯(ω)=pλ1(ω)pλ¯(ω)pλ2(ω)pλ¯(ω),\displaystyle\forall\omega\in\mathcal{X},\quad\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{\sqrt{p_{\lambda_{1}}(\omega)p_{\lambda_{2}}(\omega)}}{p_{\bar{\lambda}}(\omega)}=\sqrt{\frac{p_{\lambda_{1}}(\omega)}{p_{\bar{\lambda}}(\omega)}}\sqrt{\frac{p_{\lambda_{2}}(\omega)}{p_{\bar{\lambda}}(\omega)}}, (41)

with

λ¯=λ(θ(λ1)+θ(λ2)2).\bar{\lambda}=\lambda\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right). (42)

Similarly, the Bhattacharyya distance is written as

ω𝒳,DB[pλ1,pλ2]=log(p(ω;λ¯)p(ω,λ1)p(ω,λ2)).\displaystyle\forall\omega\in\mathcal{X},\quad D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\left(\frac{p\left(\omega;\bar{\lambda}\right)}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\right). (43)

Let l(x;λ):=logp(x;λ)l(x;\lambda):=\log p(x;\lambda) be the log-density. Then we have

ω𝒳,DB[pλ1,pλ2]=l(ω;λ¯)l(ω,λ1)+l(ω,λ2)2.\displaystyle\forall\omega\in\mathcal{X},\quad D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=l\left(\omega;\bar{\lambda}\right)-\frac{l(\omega,\lambda_{1})+l(\omega,\lambda_{2})}{2}. (44)

This is a Jensen divergence for the strictly convex function l(x;θ)-l(x;\theta) (wrt. θ\theta) since l(x;θ)F(θ)-l(x;\theta)\equiv F(\theta) (modulo an affine term).

Thus we do not need to explicitly use the expression of the cumulant function FF in Eq. 41 and Eq. 43 but we need the following parameter λθ\lambda\Leftrightarrow\theta conversion functions:

  1. 1.

    θ(λ)\theta(\lambda) the ordinary-to-natural parameter conversion function, and

  2. 2.

    its reciprocal function λ(θ)\lambda(\theta) (i.e., λ()=θ1()\lambda(\cdot)=\theta^{-1}(\cdot)), the natural-to-source parameter conversion function

so that we can calculate the ordinary λ\lambda-parameter λ¯\bar{\lambda} corresponding to the natural mid-parameter θ(λ1)+θ(λ2)2\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}:

λ¯=λ(θ(λ1)+θ(λ2)2).\bar{\lambda}=\lambda\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right). (45)

Notice that in general, a linear interpolation in the natural parameter θ\theta corresponds to a non-linear interpolation in the source parameterization λ\lambda when θ(u)u\theta(u)\not=u.

Since λ()=θ1()\lambda(\cdot)=\theta^{-1}(\cdot), we can interpret the non-linear interpolation λ¯\bar{\lambda} as a generalization of quasi-arithmetic mean [33] (QAM):

λ¯=θ1(θ(λ1)+θ(λ2)2)=:Mθ(λ1,λ2),\bar{\lambda}=\theta^{-1}\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right)=:M_{\theta}(\lambda_{1},\lambda_{2}), (46)

where

Mf(a,b):=f1(12f(a)+12f(b)),M_{f}(a,b):=f^{-1}\left(\frac{1}{2}f(a)+\frac{1}{2}f(b)\right), (47)

is the quasi-arithmetic mean induced by a strictly monotonic and smooth function ff.444Notice that Mf(a,b)=Mg(a,b)M_{f}(a,b)=M_{g}(a,b) when g(u)=c1f(u)+c2g(u)=c_{1}f(u)+c_{2} with c10c_{1}\not=0\in\mathbb{R} and c2c_{2}\in\mathbb{R}. Thus we can assume that ff is a strictly increasing and smooth function. Function ff is said to be in standard form if f(1)=0f(1)=0, f(u)>0f^{\prime}(u)>0 and f(1)=1f^{\prime}(1)=1).

We can extend the quasi-arithmetic mean to a weighted quasi-arithmetic mean as follows:

Mf,α(a,b):=f1(αf(a)+(1α)f(b)),α[0,1].M_{f,\alpha}(a,b):=f^{-1}\left(\alpha f(a)+(1-\alpha)f(b)\right),\quad\alpha\in[0,1]. (48)

Weighted quasi-arithmetic means are strictly monotone [47] when the Range(f)\mathrm{Range}(f)\subset\mathbb{R}.

Let us remark that extensions of weighted quasi-arithmetic means have been studied recently in information geometry to describe geodesic paths [16, 15] γab={Mf,α(a,b):α[0,1]}\gamma_{ab}=\{M_{f,\alpha}(a,b)\ :\ \alpha\in[0,1]\}.

In 1D, a bijective function on an interval (e.g., a parameter conversion function in our setting) is a strictly monotonic function, and thus ff defines a quasi-arithmetic mean.

To define quasi-arithmetic mean with multivariate generators using Eq. 47, we need to properly define f1f^{-1}. When the function ff is separable, i.e., f(x)=i=1Dfi(xi)f(x)=\sum_{i=1}^{D}f_{i}(x_{i}) with the fif_{i}’s strictly monotone and smooth functions, we can straightforwardly define the multivariate mean as Mf(x,x):=(Mf1(x1,x1),,MfD(x1,x1))M_{f}(x,x^{\prime}):=(M_{f_{1}}(x_{1},x_{1}^{\prime}),\ldots,M_{f_{D}}(x_{1},x_{1}^{\prime})).

In general, the inverse function theorem in multivariable calculus [11] states that there exists locally an inverse function f1f^{-1} for a continuously differentiable function ff at point xx if the determinant of the Jacobian Jf(x)J_{f}(x) of ff at xx is non-zero. Moreover, f1f^{-1} is continuously differentiable and we have Jf1(y)=[Jf(x)]1J_{f^{-1}}(y)=[J_{f}(x)]^{-1} (matrix inverse) at y=f(x)y=f(x).

In some cases, we get the existence of a global inverse function. For example, when f=Hf=\nabla H is the gradient of Legendre-type strictly convex and smooth function HH, the reciprocal function f1f^{-1} is well-defined f1=Hf^{-1}=\nabla H^{*} and global, where HH^{*} denotes the convex conjugate of HH (Legendre-type). In that case, f=Hf=\nabla H is a strictly monotone operator555A mapping M:ddM:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a strictly monotone operator iff (pq)(M(p)M(q))>0(p-q)^{\top}(M(p)-M(q))>0 for all p,qdp,q\in\mathbb{R}^{d} with pqp\not=q. Monotone operators are a generalization of the concept of univariate monotonic functions. since it is the gradient of a strictly convex function. We also refer to [2] for some work on operator means, and to [42] for multivariate quasi-arithmetic means of covariance matrices.

To summarize, we can compute the Bhattacharyya coefficient (and Bhattacharyya/Hellinger distances) using the parametric density function pλ(x)p_{\lambda}(x) and a quasi-arithmetic mean λ¯=Mθ(λ1,λ2)\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2}) on the source parameters λ1\lambda_{1} and λ2\lambda_{2} as:

ω𝒳,ρ[pλ1,pλ2]=pλ1(ω)pλ2(ω)p(ω;Mθ(λ1,λ2))\displaystyle\forall\omega\in\mathcal{X},\quad\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{\sqrt{p_{\lambda_{1}}(\omega)p_{\lambda_{2}}(\omega)}}{p(\omega;M_{\theta}(\lambda_{1},\lambda_{2}))} (49)

using the notation p(x;θ):=pθ(x)p(x;\theta):=p_{\theta}(x), and

ω𝒳,DB[pλ1,pλ2]=log(p(ω;Mθ(λ1,λ2))p(ω,λ1)p(ω,λ2)).\forall\omega\in\mathcal{X},\quad D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\left(\frac{p\left(\omega;M_{\theta}(\lambda_{1},\lambda_{2})\right)}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\right). (50)

Similarly, we get the following cumulant-free expression for the Hellinger distance:

ω𝒳,DH[pλ1,pλ2]=1pλ1(ω)pλ2(ω)p(ω;Mθ(λ1,λ2)).\forall\omega\in\mathcal{X},\quad D_{H}[p_{\lambda_{1}},p_{\lambda_{2}}]=\sqrt{1-\frac{\sqrt{p_{\lambda_{1}}(\omega)p_{\lambda_{2}}(\omega)}}{p_{(}\omega;M_{\theta}(\lambda_{1},\lambda_{2}))}}. (51)

The Hellinger distance proves useful when using some generic algorithms which require to handle metric distances. For example, the 22-approximation factor of Gonzalez [18] for kk-center metric clustering.

These cumulant-free formula are all the more convenient as in legacy software API, we usually have access to the density function of the probability family. Thus if a parametric family of an API is an exponential family \mathcal{E}, we just need to implement the corresponding quasi-arithmetic mean MθM_{\theta}^{\mathcal{E}}.

More generally, the α\alpha-skewed Bhattacharyya distance [33] for α(0,1)\alpha\in(0,1) is expressed using the following cumulant-free expression:

ω𝒳,DB,α[pλ1:pλ2]=log(p(ω,λ1)αp(ω,λ2)1αp(ω;Mθ,α(λ1,λ2)))\forall\omega\in\mathcal{X},\quad D_{B,\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega,\lambda_{1})^{\alpha}p(\omega,\lambda_{2})^{1-\alpha}}{p\left(\omega;M_{\theta,\alpha}(\lambda_{1},\lambda_{2})\right)}\right) (52)

with

Mθ,α(λ1,λ2)=λ(αθ(λ1)+(1α)θ(λ2))=:λ¯α.M_{\theta,\alpha}(\lambda_{1},\lambda_{2})=\lambda(\alpha\theta(\lambda_{1})+(1-\alpha)\theta(\lambda_{2}))=:\bar{\lambda}_{\alpha}. (53)

Notice that the geometric α\alpha-barycenter of two densities pθ1p_{\theta_{1}} and pθ2p_{\theta_{2}} of an exponential family \mathcal{E} is a scale density of \mathcal{E}:

p(x;αθ1+(1α)θ2)=(p(ω;αθ1+(1α)θ2)pα(ω;θ1)p1α(ω;θ2))pα(x;θ1)p1α(x;θ2),ω𝒳.p(x;\alpha\theta_{1}+(1-\alpha)\theta_{2})=\left(\frac{p(\omega;\alpha\theta_{1}+(1-\alpha)\theta_{2})}{p^{\alpha}(\omega;\theta_{1})p^{1-\alpha}(\omega;\theta_{2})}\right)p^{\alpha}(x;\theta_{1})p^{1-\alpha}(x;\theta_{2}),\quad\forall\omega\in\mathcal{X}.

Let p~θ(x)=exp(t(x)θ+k(x))\tilde{p}_{\theta}(x)=\exp(t(x)^{\top}\theta+k(x)) denote the unnormalized density of an exponential family (so that pθ(x)=p~θ(x)exp(F(θ))p_{\theta}(x)=\tilde{p}_{\theta}(x)\exp(-F(\theta))). We have the following invariant:

ω𝒳,p~(ω;Mθ,α(λ1,λ2))p~(ω,λ1)αp~(ω,λ2)1α=1.\forall\omega\in\mathcal{X},\quad\frac{\tilde{p}(\omega;{M_{\theta},\alpha}(\lambda_{1},\lambda_{2}))}{\tilde{p}(\omega,\lambda_{1})^{\alpha}\tilde{p}(\omega,\lambda_{2})^{1-\alpha}}=1. (54)

It follows that we have:

ρα[pθ1:pθ2]\displaystyle\rho_{\alpha}[p_{\theta_{1}}:p_{\theta_{2}}] =\displaystyle= p(ω;αθ1+(1α)θ2)p(ω,λ1)αp(ω,λ2)1α,\displaystyle\frac{{p}(\omega;\alpha\theta_{1}+(1-\alpha)\theta_{2})}{p(\omega,\lambda_{1})^{\alpha}p(\omega,\lambda_{2})^{1-\alpha}}, (55)
=\displaystyle= exp(F(αθ1+(1α)θ2))exp(αF(θ1))exp((1α)F(θ2)),\displaystyle\frac{\exp(F(\alpha\theta_{1}+(1-\alpha)\theta_{2}))}{\exp(\alpha F(\theta_{1}))\exp((1-\alpha)F(\theta_{2}))}, (56)
=\displaystyle= exp(JF,α(θ1:θ2)).\displaystyle\exp(-J_{F,\alpha}(\theta_{1}:\theta_{2})). (57)

The Cauchy-Schwarz divergence [21, 41] is defined by

DCS(p,q):=log(𝒳p(x)q(x)dμ(x)(𝒳p(x)2dμ(x))(𝒳q(x)2dμ(x)))0.D_{\mathrm{CS}}(p,q):=-\log\left(\frac{\int_{\mathcal{X}}p(x)q(x)\mathrm{d}\mu(x)}{\sqrt{\left(\int_{\mathcal{X}}p(x)^{2}\mathrm{d}\mu(x)\right)\left(\int_{\mathcal{X}}q(x)^{2}\mathrm{d}\mu(x)\right)}}\right)\geq 0. (58)

Thus the Cauchy-Schwarz divergence is a projective divergence: DCS(p,q)=DCS(λp,λq)D_{\mathrm{CS}}(p,q)=D_{\mathrm{CS}}(\lambda p,\lambda q) for any λ>0\lambda>0. It can be shown that the Cauchy-Schwarz divergence between two densities pθ1p_{\theta_{1}} and pθ2p_{\theta_{2}} of an exponential family with a natural parameter space a cone [31] (e.g., Gaussian) is:

DCS(pθ1,pθ2)=JF(2θ1:2θ2)+log(Ep2θ1[exp(k(x))]Ep2θ2[exp(k(x))]Epθ1+θ2[exp(k(x))]).D_{\mathrm{CS}}(p_{\theta_{1}},p_{\theta_{2}})=J_{F}(2\theta_{1}:2\theta_{2})+\log\left(\frac{\sqrt{E_{p_{2\theta_{1}}}\left[\exp(k(x))\right]E_{p_{2\theta_{2}}}\left[\exp(k(x))\right]}}{E_{p_{\theta_{1}+\theta_{2}}}\left[\exp(k(x))\right]}\right). (59)

See [31] for the formula extended to Hölder divergences which generalize the Cauchy-Schwarz divergence.

2.2 Some illustrating examples

Let us start with an example of a continuous exponential family which relies on the arithmetic mean A(a,b)=a+b2A(a,b)=\frac{a+b}{2}:

Example 1

Consider the family of exponential distributions with rate parameter λ>0\lambda>0. The densities of this continuous EF writes as pλ(x)=λexp(λx)p_{\lambda}(x)=\lambda\exp(-\lambda x) with support 𝒳=[0,)\mathcal{X}=[0,\infty). From the partial canonical factorization of densities following Eq. 6, we get that θ(u)=u\theta(u)=u and θ1(u)=u\theta^{-1}(u)=u so that Mθ(λ1,λ2)=A(λ1,λ2)=λ1+λ22M_{\theta}(\lambda_{1},\lambda_{2})=A(\lambda_{1},\lambda_{2})=\frac{\lambda_{1}+\lambda_{2}}{2} is the arithmetic mean. Choose ω=0\omega=0 so that pλ(ω)=λp_{\lambda}(\omega)=\lambda. It follows that

ρ[pλ1,pλ2]\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}] =\displaystyle= p(ω,λ1)p(ω,λ2)p(ω;λ¯),\displaystyle\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p\left(\omega;\bar{\lambda}\right)},
=\displaystyle= λ1λ2λ1+λ22,\displaystyle\frac{\sqrt{\lambda_{1}\lambda_{2}}}{\frac{\lambda_{1}+\lambda_{2}}{2}},
=\displaystyle= 2λ1λ2λ1+λ2.\displaystyle\frac{2\sqrt{\lambda_{1}\lambda_{2}}}{\lambda_{1}+\lambda_{2}}.

Let G(a,b)=abG(a,b)=\sqrt{ab} denote the geometric mean for a,b>0a,b>0. Notice that since G(λ1,λ2)A(λ1,λ2)G(\lambda_{1},\lambda_{2})\leq A(\lambda_{1},\lambda_{2}), we have ρ[pσ1,pσ2](0,1]\rho[p_{\sigma_{1}},p_{\sigma_{2}}]\in(0,1]. The Bhattacharyya distance between two exponential densities of rate λ1\lambda_{1} and λ2\lambda_{2} is

DBhat[pσ1,pσ2]=logρ[pλ1,pλ2]=logλ1+λ2212logλ1λ20.D_{\mathrm{Bhat}}[p_{\sigma_{1}},p_{\sigma_{2}}]=-\log\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\frac{\lambda_{1}+\lambda_{2}}{2}-\frac{1}{2}\log\lambda_{1}\lambda_{2}\geq 0.

Since the logarithm function is monotonous, we have logA(λ1,λ2)logG(λ1,λ2)\log A(\lambda_{1},\lambda_{2})\geq\log G(\lambda_{1},\lambda_{2}) and therefore we check that DBhat[pσ1,pσ2]0D_{\mathrm{Bhat}}[p_{\sigma_{1}},p_{\sigma_{2}}]\geq 0.

Next, we consider a discrete exponential family which exhibits the geometric mean:

Example 2

The Poisson family of probability mass functions (PMFs) pλ(x)=λxexp(λ)x!p_{\lambda}(x)=\frac{\lambda^{x}\exp(-\lambda)}{x!} where λ>0\lambda>0 denotes the intensity parameter and x𝒳={0,1,,}x\in\mathcal{X}=\{0,1,\ldots,\} is a discrete exponential family with t(x)=xt(x)=x (ω=0\omega=0, t(ω)=0t(\omega)=0 and pλ(ω)=exp(λ)p_{\lambda}(\omega)=\exp(-\lambda)), θ(u)=logu\theta(u)=\log u and λ(u)=θ1(u)=exp(u)\lambda(u)=\theta^{-1}(u)=\exp(u). Thus the quasi-arithmetic mean associated with the Poisson family is Mθ(λ1,λ2)=G(λ1,λ2)=λ1λ2M_{\theta}(\lambda_{1},\lambda_{2})=G(\lambda_{1},\lambda_{2})=\sqrt{\lambda_{1}\lambda_{2}} the geometric mean. It follows that the Bhattacharrya coefficient is

ρ[pλ1,pλ2]\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}] =\displaystyle= p(ω;λ¯)p(ω,λ1)p(ω,λ2),\displaystyle\frac{p\left(\omega;\bar{\lambda}\right)}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}},
=\displaystyle= exp(λ1λ2)exp(λ1)exp(λ2)=exp(λ1+λ22λ1λ2)\displaystyle\frac{\exp(-\sqrt{\lambda_{1}\lambda_{2}})}{\sqrt{\exp(-\lambda_{1})\exp(-\lambda_{2})}}=\exp\left(\frac{\lambda_{1}+\lambda_{2}}{2}-\sqrt{\lambda_{1}\lambda_{2}}\right)

Hence, we recover the Bhattacharrya distance between two Poisson pmfs:

DB[pλ1,pλ2]=λ1+λ22λ1λ20.D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{\lambda_{1}+\lambda_{2}}{2}-\sqrt{\lambda_{1}\lambda_{2}}\geq 0.

The negative binomial distribution with known number of failures yields also the same natural parameter and geometric mean (but the density p(ω,λ)p(\omega,\lambda) is different).

To illustrate the use of the power means Pr(a,b)=(ar+br)1rP_{r}(a,b)=(a^{r}+b^{r})^{\frac{1}{r}} of order rr\in\mathbb{R} (also called Hölder means) for r0r\not=0 (with limr0Pr(a,b)=G(a,b)\lim_{r\rightarrow 0}P_{r}(a,b)=G(a,b)), let us consider the family of Weibull distributions.

Example 3

The Weibull distributions with a prescribed shape parameter k++k\in\mathbb{R}_{++} (e.g., exponential distributions when k=1k=1, Rayleigh distributions when k=2k=2) form an exponential family. The density of a Weibull distribution with scale λ\lambda and fixed shape parameter kk is

pλ(x)=kλ(xλ)k1e(x/λ)k,x𝒳=++p_{\lambda}(x)=\frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1}e^{-(x/\lambda)^{k}},\quad x\in\mathcal{X}=\mathbb{R}_{++}

The ordinary\leftrightarrownatural parameter conversion functions are θ(u)=1uk\theta(u)=\frac{1}{u^{k}} and θ1(u)=1u1k\theta^{-1}(u)=\frac{1}{u^{\frac{1}{k}}}. It follows that λ¯=Mθ(λ1,λ2)=Pk(λ1,λ2)=(12λ1k+12λ2k)1k\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})=P_{-k}(\lambda_{1},\lambda_{2})=\left(\frac{1}{2}\lambda_{1}^{-k}+\frac{1}{2}\lambda_{2}^{-k}\right)^{-\frac{1}{k}} is the power means of order k-k. We choose ω=1\omega=1 and get pλ(ω)=kλke1λkp_{\lambda}(\omega)=\frac{k}{\lambda^{k}}e^{-\frac{1}{\lambda^{k}}}. It follows the closed-form formula for the Bhattacharrya coefficient for integer k{2,,}k\in\{2,\ldots,\}:

ρ[pλ1,pλ2]\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}] =\displaystyle= p(ω,λ1)p(ω,λ2)p(ω;λ¯),\displaystyle{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\frac{p\left(\omega;\bar{\lambda}\right)}{,} (60)
=\displaystyle= 2λ1kλ2kλ1k+λ2k.\displaystyle\frac{2\sqrt{\lambda_{1}^{k}\lambda_{2}^{k}}}{\lambda_{1}^{k}+\lambda_{2}^{k}}. (61)

For k=2k=2, the Weibull family yields the Rayleigh family with λ=2σ\lambda=\sqrt{2}\sigma. The Bhattacharyya coefficient is ρ[pλ1,pλ2]=G(λ1,λ2)Q(λ1,λ2)\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{G(\lambda_{1},\lambda_{2})}{Q(\lambda_{1},\lambda_{2})} where GG and QQ denotes the geometric mean and the quadratic mean, respectively (with QGQ\geq G).

Table 1 summarizes the quasi-arithmetic means associated with common univariate exponential families. Notice that a same quasi-arithmetic mean can be associated to many exponential families: For example, the Gaussian family with fixed variance or the exponential distribution family have both θ(u)=u\theta(u)=u yielding the arithmetic mean.

Exponential family θ(u)θ1(u)λ¯=Mθ(λ1,λ2)ExponentialuuA(λ1,λ2)=λ1+λ22Poissonloguexp(u)G(λ1,λ2)=λ1λ2Laplace (fixed μ)1u1uH(λ1,λ2)=2λ1λ2λ1+λ2Weibull (fixed shape k>0)1uk1u1kP1k(λ1,λ2)=(12λ11k+12λ11k)1kBernoullilogu1uexp(u)1+exp(u)Ber(λ1,λ2)=a121+a12(logit function)(logistic function)with a12=λ1λ2(1λ1)(1λ2)\begin{array}[]{llll}\mbox{Exponential family $\mathcal{E}$}&\theta(u)&\theta^{-1}(u)&\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})\\ \hline\cr\mbox{Exponential}&u&u&A(\lambda_{1},\lambda_{2})=\frac{\lambda_{1}+\lambda_{2}}{2}\\ \mbox{Poisson}&\log u&\exp(u)&G(\lambda_{1},\lambda_{2})=\sqrt{\lambda_{1}\lambda_{2}}\\ \mbox{Laplace (fixed $\mu$)}&\frac{1}{u}&\frac{1}{u}&H(\lambda_{1},\lambda_{2})=\frac{2\lambda_{1}\lambda_{2}}{\lambda_{1}+\lambda_{2}}\\ \mbox{Weibull (fixed shape $k>0$)}&\frac{1}{u^{k}}&\frac{1}{u^{\frac{1}{k}}}&P_{\frac{1}{k}}(\lambda_{1},\lambda_{2})=(\frac{1}{2}\lambda_{1}^{\frac{1}{k}}+\frac{1}{2}\lambda_{1}^{\frac{1}{k}})^{-\frac{1}{k}}\\ \mbox{Bernoulli}&\log\frac{u}{1-u}&\frac{\exp(u)}{1+\exp(u)}&\mathrm{Ber}(\lambda_{1},\lambda_{2})=\frac{a_{12}}{1+a_{12}}\\ &\mbox{(logit function)}&\mbox{(logistic function)}&\mbox{with\ }a_{12}=\sqrt{\frac{\lambda_{1}\lambda_{2}}{(1-\lambda_{1})(1-\lambda_{2})}}\\ \hline\cr\end{array}
Table 1: Quasi-arithmetic means Mθ(λ1,λ2)=θ1(θ(λ1)+θ(λ2)2)M_{\theta}(\lambda_{1},\lambda_{2})=\theta^{-1}\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right) associated with some common discrete and continuous exponential families of order D=1D=1.

Let us now consider multi-order exponential families. We start with the bi-order exponential family of Gamma distributions.

Example 4

The density of a Gamma distribution is

p(x;α,β)=βαΓ(α)xα1eβx,x𝒳=(0,),p(x;\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x},\quad x\in\mathcal{X}=(0,\infty),

for a shape parameter α>0\alpha>0 and rate parameter β>0\beta>0 (i.e., a 22-order exponential family with λ=(λ1,λ2)=(α,β)\lambda=(\lambda_{1},\lambda_{2})=(\alpha,\beta)). The natural parameter is θ(λ)=(λ11,λ2)\theta(\lambda)=(\lambda_{1}-1,-\lambda_{2}) and the inverse function is λ(θ)=(θ1+1,θ2)\lambda(\theta)=(\theta_{1}+1,-\theta_{2}). It follows that the generalized quasi-arithmetic mean is the bivariate arithmetic mean: Mθ(λ1,λ2)=A(λ1,λ2)=(α1+α22,β1+β22)=(α¯,β¯)M_{\theta}(\lambda_{1},\lambda_{2})=A(\lambda_{1},\lambda_{2})=(\frac{\alpha_{1}+\alpha_{2}}{2},\frac{\beta_{1}+\beta_{2}}{2})=(\bar{\alpha},\bar{\beta}). We choose ω=1\omega=1 so that p(ω;α,β)=βαΓ(α)eβp(\omega;\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}e^{-\beta}.

We get the Bhattacharrya coefficient:

ρ[pα1,β1:pα2,β2]\displaystyle\rho[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}] =\displaystyle= p(ω,λ1)p(ω,λ2)p(ω;λ¯),\displaystyle\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p\left(\omega;\bar{\lambda}\right)},
=\displaystyle= β1α1β2α2Γα1Γα2β¯α¯,\displaystyle\sqrt{\frac{\beta_{1}^{\alpha_{1}}\beta_{2}^{\alpha_{2}}}{\Gamma{\alpha_{1}}\Gamma{\alpha_{2}}}}\bar{\beta}^{-\bar{\alpha}},

and the Bhattacharrya distance:

Dα[pα1,β1:pα2,β2]=α¯logβ¯α1logβ1+α2logβ22+logΓ(α1)Γ(α2)Γ(α¯).D_{\alpha}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=\bar{\alpha}\log\bar{\beta}-\frac{\alpha_{1}\log\beta_{1}+\alpha_{2}\log\beta_{2}}{2}+\log\frac{\sqrt{\Gamma(\alpha_{1})\Gamma(\alpha_{2})}}{\Gamma(\bar{\alpha})}.

The Dirichlet family which exhibits a separable (quasi-)arithmetic mean:

Example 5

Consider the family of Dirichlet distributions with densities defined on the (d1)(d-1)-dimensional open standard simplex support

𝒳=Δd:={(x1,,xd):xi(0,1),i=1dxi=1}.\mathcal{X}=\Delta_{d}:=\left\{(x_{1},\ldots,x_{d})\ :\ x_{i}\in(0,1),\ \sum_{i=1}^{d}x_{i}=1\right\}. (62)

The family of Dirichlet distributions including the family of Beta distributions when d=2d=2. The density of a Dirichlet distribution is defined by:

pα(x)=Γ(i=1dαi)i=1dΓ(αi)i=1dxiαi1.p_{\alpha}(x)=\frac{\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right)}{\prod_{i=1}^{d}\Gamma\left(\alpha_{i}\right)}\prod_{i=1}^{d}x_{i}^{\alpha_{i}-1}. (63)

The Dirichlet distributions and are used in in Bayesian statistics as the conjugate priors of the multinomial family. The Dirichlet distributions form an exponential family with dd-dimensional natural parameter θ=(α11,,αd1)\theta=(\alpha_{1}-1,\ldots,\alpha_{d}-1) (D=dD=d) and vector of sufficient statistics t(x)=(logx1,,logxd)t(x)=(\log x_{1},\ldots,\log x_{d}). The induced quasi-arithmetic mean is a multivariate separable arithmetic means, i.e., the multivariate arithmetic mean λ¯=Mθ(α1,α2)=A(α1,α2)=α1+α22\bar{\lambda}=M_{\theta}(\alpha_{1},\alpha_{2})=A(\alpha_{1},\alpha_{2})=\frac{\alpha_{1}+\alpha_{2}}{2}.

Let us choose ω=(1d,,1d)\omega=\left(\frac{1}{d},\ldots,\frac{1}{d}\right) so that p(ω;α)=Γ(i=1dαi)i=1dΓ(αi)1d(i=1dαi)dp(\omega;\alpha)=\frac{\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right)}{\prod_{i=1}^{d}\Gamma\left(\alpha_{i}\right)}\frac{1}{d^{(\sum_{i=1}^{d}\alpha_{i})-d}}.

We get the following Bhattacharrya coefficient between two Dirichlet densities pα1p_{\alpha_{1}} and pα2p_{\alpha_{2}}:

ρ[pα1:pα2]=B(α1+α22)B(α1)B(α2),\rho[p_{\alpha_{1}}:p_{\alpha_{2}}]=\frac{B\left(\frac{\alpha_{1}+\alpha_{2}}{2}\right)}{\sqrt{B(\alpha_{1})B(\alpha_{2})}}, (64)

where

B(α):=i=1dΓ(αi)Γ(i=1dαi).B(\alpha):=\frac{\prod_{i=1}^{d}\Gamma\left(\alpha_{i}\right)}{\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right)}. (65)

It follows that the Bhattacharrya distance between two Dirichlet densities pα1p_{\alpha_{1}} and pα2p_{\alpha_{2}} is

DB[pα1:pα2]=log(B(α1)B(α2)B(α1+α22)).D_{B}[p_{\alpha_{1}}:p_{\alpha_{2}}]=\log\left(\frac{\sqrt{B(\alpha_{1})B(\alpha_{2})}}{B\left(\frac{\alpha_{1}+\alpha_{2}}{2}\right)}\right). (66)

The α\alpha-skewed Bhattacharrya coefficient for a scalar α(0,1)\alpha\in(0,1) is:

ρα[pα1:pα2]=B(αα1+(1α)α2)Bα(α1)B1α(α2),\rho_{\alpha}[p_{\alpha_{1}}:p_{\alpha_{2}}]=\frac{B\left(\alpha\alpha_{1}+(1-\alpha)\alpha_{2}\right)}{B^{\alpha}(\alpha_{1})B^{1-\alpha}(\alpha_{2})}, (67)

and the α\alpha-skewed Bhattacharrya distance:

DB,α[pα1:pα2]\displaystyle D_{B,\alpha}[p_{\alpha_{1}}:p_{\alpha_{2}}] =\displaystyle= log(Bα(α1)B1α(α2)B(αα1+(1α)α2)),\displaystyle\log\left(\frac{B^{\alpha}(\alpha_{1})B^{1-\alpha}(\alpha_{2})}{B\left(\alpha\alpha_{1}+(1-\alpha)\alpha_{2}\right)}\right), (68)
=\displaystyle= αlogB(α1)+(1α)logB(α2)logB(αα1+(1α)α2),\displaystyle\alpha\log B(\alpha_{1})+(1-\alpha)\log B(\alpha_{2})-\log B\left(\alpha\alpha_{1}+(1-\alpha)\alpha_{2}\right), (69)

with

logB(α)=i=1dlogΓ(αi)logΓ(i=1dαi).\log B(\alpha)=\sum_{i=1}^{d}\log\Gamma(\alpha_{i})-\log\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right). (70)

(This is in accordance with Eq. 15–17 of [43].)

Finally, we consider the case of the multivariate Gaussian family:

Example 6

Consider the dd-dimensional Gaussian family [29] also called MultiVariate Normal (MVN) family. The parameter λ=(λv,λM)\lambda=(\lambda_{v},\lambda_{M}) of a MVN consists of a vector part λv=μ\lambda_{v}=\mu and a d×dd\times d matrix part λM=Σ\lambda_{M}=\Sigma. The Gaussian density is given by

pλ(x;λ)=1(2π)d2|λM|exp(12(xλv)λM1(xλv)),p_{\lambda}(x;\lambda)=\frac{1}{(2\pi)^{\frac{d}{2}}\sqrt{|\lambda_{M}|}}\exp\left(-\frac{1}{2}(x-\lambda_{v})^{\top}\lambda_{M}^{-1}(x-\lambda_{v})\right),

where |||\cdot| denotes the matrix determinant.

Partially factorizing the density into the canonical form of exponential family, we find that θ(λ)=(λM1λv,12λM1)\theta(\lambda)=\left(\lambda_{M}^{-1}\lambda_{v},\frac{1}{2}\lambda_{M}^{-1}\right) and λ(θ)=(12θM1θv,12θM1)\lambda(\theta)=(\frac{1}{2}\theta_{M}^{-1}\theta_{v},\frac{1}{2}\theta_{M}^{-1}). It follows that the multivariate weighted mean is Mθα(λ1,λ2)=λ¯α=(μα,Σα)M_{\theta}^{\alpha}(\lambda_{1},\lambda_{2})=\bar{\lambda}_{\alpha}=(\mu_{\alpha},\Sigma_{\alpha}) with

Σα=(αΣ11+(1α)Σ21)1,\Sigma_{\alpha}=\left(\alpha\Sigma_{1}^{-1}+(1-\alpha)\Sigma_{2}^{-1}\right)^{-1}, (71)

the matrix harmonic barycenter and

μα=Σα(αΣ11μ1+(1α)Σ21μ2).\mu_{\alpha}=\Sigma_{\alpha}\left(\alpha\Sigma_{1}^{-1}\mu_{1}+(1-\alpha)\Sigma_{2}^{-1}\mu_{2}\right). (72)

We choose ω=0\omega=0 with pλ(0;λ)=1(2π)d2|λM|exp(12λvλM1λv)p_{\lambda}(0;\lambda)=\frac{1}{(2\pi)^{\frac{d}{2}}\sqrt{|\lambda_{M}|}}\exp\left(-\frac{1}{2}\lambda_{v}^{\top}\lambda_{M}^{-1}\lambda_{v}\right). Let Δμ=μ2μ1\Delta_{\mu}=\mu_{2}-\mu_{1}. It follows the following closed-form formula for the Bhattacharyya coefficient between Gaussian densities:

ρ[pμ1,Σ1,pμ2,Σ2]=|Σ1+Σ22|12|Σ1|14|Σ2|14exp(18Δμ(Σ1+Σ22)1Δμ).\rho[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\left|\frac{\Sigma_{1}+\Sigma_{2}}{2}\right|^{-\frac{1}{2}}|\Sigma_{1}|^{\frac{1}{4}}|\Sigma_{2}|^{\frac{1}{4}}\exp\left(-\frac{1}{8}\Delta_{\mu}^{\top}\left(\frac{\Sigma_{1}+\Sigma_{2}}{2}\right)^{-1}\Delta_{\mu}\right).

Thus the Bhattacharrya distance is

DB,α[pμ1,Σ1,pμ2,Σ2]=12(αμ1Σ11μ1+(1α)μ2Σ21μ2μαΣα1μα+log|Σ1|α|Σ2|1α|Σα|),D_{B,\alpha}[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\frac{1}{2}\left(\alpha\mu_{1}^{\top}\Sigma_{1}^{-1}\mu_{1}+(1-\alpha)\mu_{2}^{\top}\Sigma_{2}^{-1}\mu_{2}-\mu_{\alpha}^{\top}\Sigma_{\alpha}^{-1}\mu_{\alpha}+\log\frac{|\Sigma_{1}|^{\alpha}|\Sigma_{2}|^{1-\alpha}}{|\Sigma_{\alpha}|}\right),

and the Hellinger distance:

DH[pμ1,Σ1,pμ2,Σ2]=1|Σ1|14|Σ2|14|Σ1+Σ22|121exp(18Δμ|Σ1+Σ22|1Δμ).D_{H}[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\sqrt{1-\frac{|\Sigma_{1}|^{\frac{1}{4}}|\Sigma_{2}|^{\frac{1}{4}}}{\left|\frac{\Sigma_{1}+\Sigma_{2}}{2}\right|^{\frac{1}{2}1}}\exp\left(-\frac{1}{8}\Delta_{\mu}\top\left|\frac{\Sigma_{1}+\Sigma_{2}}{2}\right|^{-1}\Delta_{\mu}\right)}.

The Cauchy-Schwarz divergence between two Gaussians [31] is:

DCS(pμ1,Σ1,pμ2,Σ2)\displaystyle D_{\mathrm{CS}}(p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}) =\displaystyle= 12log(12d|Σ1|Σ2|||(Σ11+Σ21)1|)\displaystyle\frac{1}{2}\log\left(\frac{1}{2^{d}}\frac{\sqrt{|\Sigma_{1}|\ \Sigma_{2}||}}{|(\Sigma_{1}^{-1}+\Sigma_{2}^{-1})^{-1}|}\right) (73)
+12μ1Σ11μ1+12μ2Σ21μ2\displaystyle+\frac{1}{2}\mu_{1}^{\top}\Sigma_{1}^{-1}\mu_{1}+\frac{1}{2}\mu_{2}^{\top}\Sigma_{2}^{-1}\mu_{2}
12(Σ11μ1+Σ21μ2)(Σ11+Σ21)1(Σ11μ1+Σ21μ2).\displaystyle-\frac{1}{2}(\Sigma_{1}^{-1}\mu_{1}+\Sigma_{2}^{-1}\mu_{2})^{\top}(\Sigma_{1}^{-1}+\Sigma_{2}^{-1})^{-1}(\Sigma_{1}^{-1}\mu_{1}+\Sigma_{2}^{-1}\mu_{2}).

3 Cumulant-free formula for the Kullback-Leibler divergence and related divergences

The Kullback-Leibler divergence (KLD) DKL[p:q]:=p(x)logp(x)q(x)dμ(x)D_{\mathrm{KL}}[p:q]:=\int p(x)\log\frac{p(x)}{q(x)}\mathrm{d}\mu(x) between two densities pp and qq also called the relative entropy [12] amounts to a reverse Bregman divergence, DKL[pθ1:pθ2]=BF(θ1:θ2)=BF(θ2:θ1)D_{\mathrm{KL}}[p_{\theta_{1}}:p_{\theta_{2}}]={B_{F}}^{*}(\theta_{1}:\theta_{2})=B_{F}(\theta_{2}:\theta_{1}), when the densities belong to the same exponential family \mathcal{E}, where the Bregman generator FF is the cumulant function of \mathcal{E}.

We present below two techniques to calculate the KLD by avoiding to compute the integral:

  • The first technique, described in §3.1, considers the KLD as a limit case of α\alpha skewed Bhattacharrya distance.

  • The second technique relies on the availability of off-the-shelf formula for the entropy and moment of the sufficient statistic (§3.2), and is derived using the Legendre-Fenchel divergence.

3.1 Kullback-Leibler divergence as the limit case of a skewed Bhattacharrya distance

We can obtain closed-form formula for the Kullback-Leibler divergence by considering the limit case of α\alpha skewed Bhattacharrya distance:

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= limα01αBα[pλ2:pλ1],\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}B_{\alpha}[p_{\lambda_{2}}:p_{\lambda_{1}}], (74)
=\displaystyle= limα01αlogρα[pλ2:pλ1],\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\rho_{\alpha}[p_{\lambda_{2}}:p_{\lambda_{1}}], (75)
=\displaystyle= limα01αlogp(ω;Mθ,α(λ2,λ1))p(ω,λ2)αp(ω,λ1)1α,\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega,\lambda_{2})^{\alpha}p(\omega,\lambda_{1})^{1-\alpha}}, (76)
=\displaystyle= log(p(ω;λ1)p(ω;λ2))+limα01αlog(p(ω;Mθ,α(λ2,λ1))p(ω,λ1)).\displaystyle\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega,\lambda_{1})}\right). (77)

When we deal with uni-order exponential families (D=1D=1), we can use a first-order Taylor expansion of the quasi-arithmetic means when α0\alpha\simeq 0 (see [30]):

Mθ(a,b)=α0a+αθ(b)θ(a)θ(a)+o(α(θ(b)θ(a))),M_{\theta}(a,b)=_{\alpha\simeq 0}a+\alpha\frac{\theta(b)-\theta(a)}{\theta^{\prime}(a)}+o\left(\alpha(\theta(b)-\theta(a))\right), (78)

where θ()\theta^{\prime}(\cdot) denote the derivative of the ordinary-to-natural parameter conversion function.

It follows that we have:

DKL[pλ1:pλ2]=log(p(ω;λ1)p(ω;λ2))+limα01αlog(p(ω;λ1+αθ(λ2)θ(λ1)θ(λ1))p(ω;λ1)),ω𝒳D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;\lambda_{1}+\alpha\frac{\theta(\lambda_{2})-\theta(\lambda_{1})}{\theta^{\prime}(\lambda_{1})}\right)}{p(\omega;\lambda_{1})}\right),\forall\omega\in\mathcal{X} (79)

Notice that we need to calculate case by case the limit as it depends on the density expression p(x;λ)p(x;\lambda) of the exponential family. This limit can be computed symbolically using a computer algebra system (e.g., using Maxima666http://maxima.sourceforge.net/). The example below illustrates the technique for calculating the KLD between two Weibull densities with prescribed shape parameter.

Example 7

Consider the Weibull family with prescribed shape parameter kk that form an exponential family (including the family of exponential distributions for k=1k=1 and the the Rayleigh distributions for k=2k=2). The density of a Weibull distribution with scale λ>0\lambda>0 and fixed shape parameter kk is

pλ(x)=kλ(xλ)k1e(x/λ)k,x𝒳=++p_{\lambda}(x)=\frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1}e^{-(x/\lambda)^{k}},\quad x\in\mathcal{X}=\mathbb{R}_{++}

We have θ(u)=uk\theta(u)=u^{-k} and θ(u)=kuk1\theta^{\prime}(u)=-ku^{-k-1}. We set ω=1\omega=1 so that pλ(ω)=kλkexp(λk)p_{\lambda}(\omega)=\frac{k}{\lambda^{k}}\exp(\lambda^{-k}). Let us program the formula of Eq. 77 using the computer algebra system Maxima:

/* KLD betwen Weibull distributions by calculating a limit */
declare( k , integer);
assume(lambda1>0);
assume(lambda2>0);
k:5;
omega:1;
t(u):=u**(-k);
tinv(u):=u**(-1/k);
tp(u):=k*u**(-k-1);
p(x,l):=(k/l)*((x/l)**(k-1))*exp(-(x/l)**k);
mean(a,b):= tinv(alpha*t(a)+(1-alpha)*t(b));
log(p(omega,l1)/p(omega,l2)) + (1.0/alpha)*log(p(omega,mean(l2,l1))/p(omega,l1));
limit (ratsimp(%), alpha, 0);
expand(%);
Refer to caption
Figure 1: Snapshot of the Maxima GUI displaying the result of symbolic calculations of the KLD between two Weibull densities of prescribed parameter shape.

Figure 1 displays a snapshot of the result which can be easily simplified manually as

DKL[pλ1:pλ2]=klogλ2λ1+(λ1λ2)k1.D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=k\log\frac{\lambda_{2}}{\lambda_{1}}+\left(\frac{\lambda_{1}}{\lambda_{2}}\right)^{k}-1. (80)

In general, the KLD between two Weibull densities with arbitrary shapes [4] is

DKL[pk1,λ1:pk2,λ2]=logk1λ1k1logk2λ2k2+(k1k2)(logλ1γk1)+(λ1λ2)k2Γ(k2k1+1)1,D_{\mathrm{KL}}[p_{k_{1},\lambda_{1}}:p_{k_{2},\lambda_{2}}]=\log\frac{k_{1}}{\lambda_{1}^{k_{1}}}-\log\frac{k_{2}}{\lambda_{2}^{k_{2}}}+\left(k_{1}-k_{2}\right)\left(\log\lambda_{1}-\frac{\gamma}{k_{1}}\right)+\left(\frac{\lambda_{1}}{\lambda_{2}}\right)^{k_{2}}\Gamma\left(\frac{k_{2}}{k_{1}}+1\right)-1, (81)

where γ\gamma denotes the Euler-Mascheroni constant. Thus when k1=k2=kk_{1}=k_{2}=k, we recover Eq. 80 since Γ(2)=1\Gamma(2)=1. (However, the family of Weibull densities with varying parameter shape is not an exponential family since the sufficient statistics depend on the shape parameters.)

In practice, we may program the formula of Eq. 77 by defining:

DKL,α[pλ1:pλ2]=log(p(ω;λ1)p(ω;λ2))+1αlog(p(ω;Mθ,α(λ2,λ1))p(ω;λ1)),D_{\mathrm{KL},\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega;\lambda_{1})}\right), (82)

and approximate the KLD by DKL,αD_{\mathrm{KL},\alpha} for a small value of α\alpha (say, α=103\alpha=10^{-3}). Thus we need only θ()\theta(\cdot) and θ1()\theta^{-1}(\cdot) for defining Mθ(,)M_{\theta}(\cdot,\cdot), and the density p(x;θ)p(x;\theta). This approximation also works for multivariate extensions of the quasi-arithmetic means.

Let us give some two examples using the first-order approximation of the univariate quasi-arithmetic mean:

Example 8

Consider the family {p(x;λ)=λexp(λx)}\{p(x;\lambda)=\lambda\exp(-\lambda x)\} of exponential distributions with support 𝒳=[0,)\mathcal{X}=[0,\infty). Set ω=0\omega=0, p(ω;λ)=λp(\omega;\lambda)=\lambda, θ=λ\theta=\lambda, θ(u)=u\theta(u)=u and θ(u)=1\theta^{\prime}(u)=1. We have

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= log(λ1λ2)+limα01αlogλ1+α(λ2λ1)λ1,\displaystyle\log\left(\frac{\lambda_{1}}{\lambda_{2}}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\frac{\lambda_{1}+\alpha(\lambda_{2}-\lambda_{1})}{\lambda_{1}}, (83)
=\displaystyle= log(λ2λ1)+limα01αlog(1+α(λ2λ11)),\displaystyle\log\left(\frac{\lambda_{2}}{\lambda_{1}}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(1+\alpha\left(\frac{\lambda_{2}}{\lambda_{1}}-1\right)\right), (84)
=\displaystyle= log(λ1λ2)+λ2λ11.\displaystyle\log\left(\frac{\lambda_{1}}{\lambda_{2}}\right)+\frac{\lambda_{2}}{\lambda_{1}}-1. (85)
Example 9

Consider the Poisson family with ω=0\omega=0, pλ(ω)=exp(λ)p_{\lambda}(\omega)=\exp(-\lambda), θ(u)=logu\theta(u)=\log u (MθM_{\theta} is the geometric mean) and θ(u)=1u\theta^{\prime}(u)=\frac{1}{u}. We get

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= logp(ω;λ1)p(ω;λ2)+limα01αlogp(ω;λ1+αθ(λ2)θ(λ1)θ(λ1))p(ω;λ2),\displaystyle\log\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\frac{p\left(\omega;\lambda_{1}+\alpha\frac{\theta(\lambda_{2})-\theta(\lambda_{1})}{\theta^{\prime}(\lambda_{1})}\right)}{p(\omega;\lambda_{2})}, (86)
=\displaystyle= λ2λ1+limα01αlogexp(αλ1logλ1λ2),\displaystyle\lambda_{2}-\lambda_{1}+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\exp(\alpha\lambda_{1}\log\frac{\lambda_{1}}{\lambda_{2}}), (87)
=\displaystyle= λ2λ1+λ1logλ1λ2.\displaystyle\lambda_{2}-\lambda_{1}+\lambda_{1}\log\frac{\lambda_{1}}{\lambda_{2}}. (88)

3.2 Kullback-Leibler divergence formula relying on the differential entropy and moments

Consider the Kullback-Leibler divergence [25] (relative entropy) DKL[p:q]=p(x)logp(x)q(x)dμ(x)D_{\mathrm{KL}}[p:q]=\int p(x)\log\frac{p(x)}{q(x)}\mathrm{d}\mu(x) between two probability densities pp and qq. When the densities belong to the same exponential families, the KL divergences amounts to a Legendre-Fenchel divergence (the canonical expression of divergences using the dual coordinate systems in dually flat spaces of information geometry [1]):

DKL[pλ1:pλ2]=AF(θ2:θ1),D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=A_{F}(\theta_{2}:\theta_{1}), (89)

where the Legendre-Fenchel divergence is defined for a pair of strictly convex and differentiable conjugate generators F(θ)F(\theta) and F(η)=supθθηF(θ)F^{*}(\eta)=\sup_{\theta}\theta^{\top}\eta-F(\theta) by

AF(θ:η):=F(θ)+F(η)θη,A_{F}(\theta:\eta^{\prime}):=F(\theta)+F^{*}(\eta^{\prime})-\theta^{\top}\eta^{\prime}, (90)

with η=F(θ)\eta^{\prime}=\nabla F(\theta).

Since FF is defined modulo some affine function, we can choose F(θ(λ))=logp(ω;θ(λ))F(\theta(\lambda))=-\log p(\omega;\theta(\lambda)). Furthermore, for exponential families, we have

η(λ)=Epλ[t(x)],\eta(\lambda)=E_{p_{\lambda}}[t(x)], (91)

and the Shannon entropy [46]

h(p)=𝒳p(x)logp(x)dμ(x),h(p)=-\int_{\mathcal{X}}p(x)\log p(x)\mathrm{d}\mu(x), (92)

admits the following expression [36] when p=pθp=p_{\theta} belongs to an exponential family \mathcal{E}:

h(pλ)=F(η(λ))Epλ[k(x)].h(p_{\lambda})=-F^{*}(\eta(\lambda))-E_{p_{\lambda}}[k(x)]. (93)

Thus if we already have at our disposal (1) the expectation of the sufficient statistics, and (2) the entropy, we can easily recover the Kullback-Leibler divergence as follows:

DKL[pλ1:pλ2]=logp(ω;λ2)h(pλ1)Epθ1[k(x)]θ(λ2)Epθ1[t(x)].D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=-\log p(\omega;\lambda_{2})-h(p_{\lambda_{1}})-E_{p_{\theta_{1}}}[k(x)]-\theta(\lambda_{2})^{\top}E_{p_{\theta_{1}}}[t(x)]. (94)

For densities pλ1p_{\lambda_{1}} and pλ2p_{\lambda_{2}} belonging to the same exponential family, the Jeffreys divergence is

DJ[pλ1:pλ2]=DKL[p:q]+DKL[q:p]=(θ2θ1)(η2η1).D_{J}[p_{\lambda_{1}}:p_{\lambda_{2}}]=D_{\mathrm{KL}}[p:q]+D_{\mathrm{KL}}[q:p]=(\theta_{2}-\theta_{1})^{\top}(\eta_{2}-\eta_{1}). (95)

It follows that we can write Jeffreys’ divergence using the following cumulant-free expression:

DJ[pλ1:pλ2]=(θ(λ2)θ(λ1))(Epλ2[t(x)]Epλ1[t(x)]).D_{J}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\left(\theta(\lambda_{2})-\theta(\lambda_{1})\right)^{\top}(E_{p_{\lambda_{2}}}[t(x)]-E_{p_{\lambda_{1}}}[t(x)]). (96)

Note that a strictly monotone operator OO defines a symmetric dissimilarity:

DO(θ1,θ2):=(θ1θ2)(O(θ2)O(θ1))0,D_{O}(\theta_{1},\theta_{2}):=(\theta_{1}-\theta_{2})^{\top}(O(\theta_{2})-O(\theta_{1}))\geq 0, (97)

with equality if and only if θ1=θ2\theta_{1}=\theta_{2}. Since F\nabla F is a strictly monotone operator and Epλ[k(x)]=F(θ)E_{p_{\lambda}}[k(x)]=\nabla F(\theta), we may reinterpret the Jeffreys’ divergence as a symmetric dissimilarity induced by a strictly monotone operator.

Let us report now some illustrating examples. We start with an example illustrating the use of a separable multivariate quasi-arithmetic mean.

Example 10 (continue Example 5)

Consider the Dirichlet exponential family. The differential entropy of a Dirichlet density pαp_{\alpha} is

h(pα)=logB(α)+(iαid)ψ(iαi)j=1d(αj1)ψ(αj),h(p_{\alpha})=\log\mathrm{B}(\alpha)+\left(\sum_{i}\alpha_{i}-d\right)\psi\left(\sum_{i}\alpha_{i}\right)-\sum_{j=1}^{d}\left(\alpha_{j}-1\right)\psi\left(\alpha_{j}\right),

where ψ()\psi(\cdot) denotes the digamma function. We have E[t(x)]=E[(logx1,,logxd)]=(ψ(α1)ψ(α),,ψ(α1)ψ(α))E[t(x)]=E[(\log x_{1},\ldots,\log x_{d})]=(\psi(\alpha_{1})-\psi(\alpha_{\sum}),\ldots,\psi(\alpha_{1})-\psi(\alpha_{\sum})) where α=i=1dαi\alpha_{\sum}=\sum_{i=1}^{d}\alpha_{i}. It follows that the Kullback-Leibler between pαp_{\alpha} and pαp_{\alpha^{\prime}} is:

DKL[pα:pα]=logΓ(α)i=1dlogΓ(αi)logΓ(α)+i=1dlogΓ(αi)+i=1d(αiαi)(ψ(αi)ψ(α)).D_{\mathrm{KL}}[p_{\alpha}:p_{\alpha}^{\prime}]=\log\Gamma\left(\alpha_{\sum}\right)-\sum_{i=1}^{d}\log\Gamma\left(\alpha_{i}\right)-\log\Gamma\left(\alpha_{\sum}^{\prime}\right)+\sum_{i=1}^{d}\log\Gamma\left(\alpha_{i}^{\prime}\right)+\sum_{i=1}^{d}\left(\alpha_{i}-\alpha_{i}^{\prime}\right)\left(\psi\left(\alpha_{i}\right)-\psi\left(\alpha_{\sum}\right)\right).

Next, we report an example illustrating a non-separable multivariate quasi-arithmetic mean.

Example 11 (continue Example 6)

Consider the dd-dimensional multivariate Gaussian family. Since k(x)=0k(x)=0 (so that E[k(x)]=0E[k(x)]=0), η(λ)=θF(θ(λ))=(μ,μμ+Σ)\eta(\lambda)=\nabla_{\theta}F(\theta(\lambda))=(\mu,\mu\mu^{\top}+\Sigma) (since E[x]=μE[x]=\mu, E[xx]=μμ+ΣE[xx^{\top}]=\mu\mu^{\top}+\Sigma), and the usual differential entropy is known as h(pμ,Σ)=12log|2πeΣ|=d2(1+log2π)+12log|Σ|h(p_{\mu,\Sigma})=\frac{1}{2}\log|2\pi e\Sigma|=\frac{d}{2}(1+\log 2\pi)+\frac{1}{2}\log|\Sigma| since |aM|=|a|d|M||aM|=|a|^{d}|M|. we recover the Kullback-Leibler divergence as

DKL[pμ1,Σ1:pμ2,Σ2]=12(tr(Σ21Σ1)+ΔμΣ21Δμ+log|Σ2||Σ1|d),D_{\mathrm{KL}}[p_{\mu_{1},\Sigma_{1}}:p_{\mu_{2},\Sigma_{2}}]=\frac{1}{2}\left(\mathrm{tr}(\Sigma_{2}^{-1}\Sigma_{1})+\Delta_{\mu}^{\top}\Sigma_{2}^{-1}\Delta_{\mu}+\log\frac{|\Sigma_{2}|}{|\Sigma_{1}|}-d\right), (98)

and the Jeffreys divergence as

DJ[pμ1,Σ1,pμ2,Σ2]=Δμ(Σ11+Σ212)Δμ+12tr(Σ21Σ1+Σ11Σ2)d.D_{J}[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\Delta_{\mu}^{\top}\left(\frac{\Sigma_{1}^{-1}+\Sigma_{2}^{-1}}{2}\right)\Delta_{\mu}+\frac{1}{2}\mathrm{tr}\left(\Sigma_{2}^{-1}\Sigma_{1}+\Sigma_{1}^{-1}\Sigma_{2}\right)-d. (99)

3.3 The Kullback-Leibler divergence expressed as a log density ratio

Let us express the Bregman divergence with the equivalent generator Fl(θ):=log(pθ(ω))F_{l}(\theta):=-\log(p_{\theta}(\omega)) for any prescribed ω𝒳\omega\in\mathcal{X} instead of the cumulant function of Eq. 7. We get

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= BFl(θ(λ2):θ(λ1)),\displaystyle B_{F_{l}}(\theta(\lambda_{2}):\theta(\lambda_{1})), (100)
=\displaystyle= log(pλ1(ω)pλ2(ω))(θ(λ2)θ(λ1))θFl(θ(λ1)).\displaystyle\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right)-(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}\nabla_{\theta}F_{l}(\theta(\lambda_{1})). (101)

Let us remark that

Fl(θ)=θt(ω)+F(θ)k(ω).F_{l}(\theta)=-\theta^{\top}t(\omega)+F(\theta)-k(\omega). (102)

We have

Fl(θ(λ1))\displaystyle\nabla F_{l}(\theta(\lambda_{1})) =\displaystyle= θpλ1(ω)pλ1(ω),\displaystyle-\frac{\nabla_{\theta}p_{\lambda_{1}}(\omega)}{p_{\lambda_{1}}(\omega)}, (103)
=\displaystyle= (t(ω)F(θ(λ1)))pλ1(ω)pλ1(ω),\displaystyle-\frac{(t(\omega)-\nabla F(\theta(\lambda_{1})))p_{\lambda_{1}}(\omega)}{p_{\lambda_{1}}(\omega)}, (104)
=\displaystyle= (t(ω)F(θ(λ1))),\displaystyle-(t(\omega)-\nabla F(\theta(\lambda_{1}))), (105)

where F(θ)F(\theta) is the cumulant function of Eq. 7. Alternatively, we can also directly find that Fl(θ(λ1))=(t(ω)F(θ(λ1)))\nabla F_{l}(\theta(\lambda_{1}))=-(t(\omega)-\nabla F(\theta(\lambda_{1}))) from Eq. 102.

It follows that:

DKL[pλ1:pλ2]=log(pλ1(ω)pλ2(ω))+(θ(λ2)θ(λ1))(t(ω)F(θ(λ1))),ω𝒳.D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right)+(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1}))),\quad\forall\omega\in\mathcal{X}. (106)

Thus when t(ω)F(θ(λ1))t(\omega)-\nabla F(\theta(\lambda_{1})) is Euclidean orthogonal to θ(λ2)θ(λ1)\theta(\lambda_{2})-\theta(\lambda_{1}) (i.e., (θ(λ2)θ(λ1))(t(ω)F(θ(λ1)))=0(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0), the Bregman divergence (and the corresponding KLD on swapped parameter order of the densities) is expressed as a log-density ratio quantity. Let

𝒳(λ1:λ2):={ω𝒳:(θ(λ2)θ(λ1))(t(ω)F(θ(λ1)))=0}.\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2}):=\left\{\omega\in\mathcal{X}\ :\ (\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0\right\}. (107)

Then DKL[pλ1:pλ2]=log(pλ1(ω)pλ2(ω))D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right) for all ω𝒳(λ1:λ2)\omega\in\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2}).

Lemma 1

The Kullback-Leibler divergence between two densities pλ1p_{\lambda_{1}} and pλ2p_{\lambda_{2}} belonging to a same exponential family \mathcal{E} is expressed as a log density ratio, DKL[pλ1:pλ2]=log(pλ1(ω)pλ2(ω))D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right), whenever ω𝒳(λ1:λ2)\omega\in\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2}).

Notice that a sufficient condition is to choose ω\omega such that t(ω)=η1=F(θ1)t(\omega)=\eta_{1}=\nabla F(\theta_{1}).

Thus if we carefully choose ω𝒳(λ1:λ2)\omega\in\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2}) according to the source parameters, we may express the Kullback-Leibler divergence as a simple log density ratio without requiring the formula for the differential entropy nor the moment.

Example 12

Consider the exponential family ={pλ(x)=λexp(λx),λ>0}\mathcal{E}=\{p_{\lambda}(x)=\lambda\exp(-\lambda x),\ \lambda>0\} of exponential distributions with F(θ)=1θ\nabla F(\theta)=\frac{1}{\theta} for θ=λ\theta=\lambda. We have (θ(λ2)θ(λ1))(t(ω)F(θ(λ1)))=0(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0 that amounts to (λ2λ1)(ω1λ1)=0(\lambda_{2}-\lambda_{1})(\omega-\frac{1}{\lambda_{1}})=0, i.e., ω=1λ1\omega=\frac{1}{\lambda_{1}} (and 𝒳(λ1:λ2)={1λ1}\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2})=\left\{\frac{1}{\lambda_{1}}\right\}). In that case, we have

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= log(pλ1(1λ1)pλ2(1λ1)),\displaystyle\log\left(\frac{p_{\lambda_{1}}\left(\frac{1}{\lambda_{1}}\right)}{p_{\lambda_{2}}\left(\frac{1}{\lambda_{1}}\right)}\right), (108)
=\displaystyle= log(λ1exp(λ11λ1)λ2exp(λ21λ1)),\displaystyle\log\left(\frac{\lambda_{1}\exp\left(-\lambda_{1}\frac{1}{\lambda_{1}}\right)}{\lambda_{2}\exp\left(-\lambda_{2}\frac{1}{\lambda_{1}}\right)}\right), (109)
=\displaystyle= log(λ1λ2)+λ2λ11.\displaystyle\log\left(\frac{\lambda_{1}}{\lambda_{2}}\right)+\frac{\lambda_{2}}{\lambda_{1}}-1. (110)

This formula matches the expression of Eq. 85.

Similarly, we may rewrite the Bregman divergence BF(θ1:θ2)B_{F}(\theta_{1}:\theta_{2}) as

BF(θ1:θ2)=BFa(θ1:θ2)=Fa(θ1)Fa(θ2),B_{F}(\theta_{1}:\theta_{2})=B_{F_{a}}(\theta_{1}:\theta_{2})=F_{a}(\theta_{1})-F_{a}(\theta_{2}), (111)

for Fa(θ)=F(θ)aθF_{a}(\theta)=F(\theta)-a\theta (and Fa(θ)=F(θ)a\nabla F_{a}(\theta)=\nabla F(\theta)-a) for a=Fa(θ2)a=-\nabla F_{a}(\theta_{2}).

Example 13

Consider the exponential family of dd-dimensional Gaussians with fixed covariance matrix Σ\Sigma:

={pλ(x)=1(2π)d2|Σ|exp(12(xλ)Σ1(xλ)),:λd}.\mathcal{E}=\left\{p_{\lambda}(x)=\frac{1}{(2\pi)^{\frac{d}{2}}\sqrt{|\Sigma|}}\exp\left(-\frac{1}{2}(x-\lambda)^{\top}\Sigma^{-1}(x-\lambda)\right),\ :\ \lambda\in\mathbb{R}^{d}\right\}. (112)

It is an exponential family of order D=dD=d with sufficient statistic t(x)=xt(x)=x. Let us choose ω\omega such that (θ(λ2)θ(λ1))(t(ω)F(θ(λ1)))=0(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0. For example, we choose t(ω)=F(θ1)=μ1t(\omega)=\nabla F(\theta_{1})=\mu_{1}. It follows that we have

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= log(pλ1(1λ1)pλ2(1λ1)),\displaystyle\log\left(\frac{p_{\lambda_{1}}\left(\frac{1}{\lambda_{1}}\right)}{p_{\lambda_{2}}\left(\frac{1}{\lambda_{1}}\right)}\right), (113)
=\displaystyle= log((2π)d2|Σ|(2π)d2|Σ|exp(12(μ1μ2)Σ1(μ1μ2))),\displaystyle\log\left(\frac{(2\pi)^{\frac{d}{2}}\sqrt{|\Sigma|}}{(2\pi)^{\frac{d}{2}}\sqrt{|\Sigma|}}\exp\left(\frac{1}{2}(\mu_{1}-\mu_{2})^{\top}\Sigma^{-1}(\mu_{1}-\mu_{2})\right)\right), (114)
=\displaystyle= 12(μ1μ2)Σ1(μ1μ2).\displaystyle\frac{1}{2}(\mu_{1}-\mu_{2})^{\top}\Sigma^{-1}(\mu_{1}-\mu_{2}). (115)

This is half of the squared Mahalanobis distance obtained for the precision matrix Σ1\Sigma^{-1}. We recover the Kullback-Leibler divergence between multivariate Gaussians (Eq. 98) when Σ1=Σ2=Σ\Sigma_{1}=\Sigma_{2}=\Sigma.

Example 14

Consider the exponential family of Rayleigh distributions:

={pλ(x)=xλ2exp(x22λ2)},\mathcal{E}=\left\{p_{\lambda}(x)=\frac{x}{\lambda^{2}}\exp\left(-\frac{x^{2}}{2\lambda^{2}}\right)\right\}, (116)

for 𝒳=[0,)\mathcal{X}=[0,\infty). Let us choose ω\omega such that (θ(λ2)θ(λ1))(t(ω)F(θ(λ1)))=0(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0 with θ=12λ2\theta=-\frac{1}{2\lambda^{2}}, t(x)=x2t(x)=x^{2} and F(θ(λ))=2λ2\nabla F(\theta(\lambda))=2\lambda^{2}. We choose ω2=2λ12\omega^{2}=2\lambda^{2}_{1} (i.e., ω=λ12\omega=\lambda_{1}\sqrt{2}). It follows that we have

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= log(pλ1(1λ1)pλ2(1λ1)),\displaystyle\log\left(\frac{p_{\lambda_{1}}\left(\frac{1}{\lambda_{1}}\right)}{p_{\lambda_{2}}\left(\frac{1}{\lambda_{1}}\right)}\right), (117)
=\displaystyle= log(2λ1λ22λ1exp(2λ122λ12+2λ122λ22)),\displaystyle\log\left(\frac{\sqrt{2}}{\lambda_{1}}\frac{\lambda_{2}^{2}}{\lambda_{1}}\exp\left(-\frac{2\lambda_{1}^{2}}{2\lambda_{1}^{2}}+\frac{2\lambda_{1}^{2}}{2\lambda_{2}^{2}}\right)\right), (118)
=\displaystyle= 2log(λ2λ1)+λ12λ221.]\displaystyle 2\log\left(\frac{\lambda_{2}}{\lambda_{1}}\right)+\frac{\lambda_{1}^{2}}{\lambda_{2}^{2}}-1.] (119)

This example shows the limitation of the method which we shall now overcome.

Example 15

Consider the exponential family of univariate Gaussian distributions:

={pλ(x)=12πλ2exp((xλ1)22λ2)},\mathcal{E}=\left\{p_{\lambda}(x)=\frac{1}{\sqrt{2\pi\lambda_{2}}}\exp\left(-\frac{(x-\lambda_{1})^{2}}{2\lambda_{2}}\right)\right\}, (120)

for 𝒳=(,)\mathcal{X}=(-\infty,\infty). Let us choose ω\omega such that (θ(λ2)θ(λ1))(t(ω)F(θ(λ1)))=0(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0. Here, F(θ(λ1)))=(μ1,μ12+σ12)\nabla F(\theta(\lambda_{1})))=(\mu_{1},\mu_{1}^{2}+\sigma_{1}^{2}) and t(ω)=(x,x2)t(\omega)=(x,x^{2}). Thus we have t(ω)F(θ(λ1))t(\omega)\not=\nabla F(\theta(\lambda_{1})) for any ω𝒳\omega\in\mathcal{X}.

Let H={F(θ):θΘ}H=\{\nabla F(\theta)\ :\ \theta\in\Theta\} denote the dual moment parameter space. When the exponential family is regular, Θ\Theta is an open convex set, and so is HH (FF is of Legendre-type). The problem we faced with the last example is that ωH\omega\in\partial H. However since Eq. 106 holds for any ωΩ\omega\in\Omega, let us choose ss values for ω\omega (i.e., ω1,,ωs\omega_{1},\ldots,\omega_{s}), and average Eq. 106 for these ss values. We have

DKL[pλ1:pλ2]\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= 1si=1slog(pλ1(ωi)pλ2(ωi))+(θ(λ2)θ(λ1))(1si=1st(ωi)F(θ(λ1))).\displaystyle\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda_{1}}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right)+(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}\left(\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})-\nabla F(\theta(\lambda_{1}))\right). (121)

We can now choose the ωi\omega_{i}’s such that 1si=1st(ωi)H\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})\in H for s>1s>1. We need to choose ss so that the system of equations 1st(ωi)=F(θ)=E[t(x)]\frac{1}{s}t(\omega_{i})=\nabla F(\theta)=E[t(x)] is solvable.

Example 16

Let us continue Example 15. Consider s=2s=2. We need to find ω1\omega_{1} and ω2\omega_{2} such that we can solve the following system of equations:

{μ1=ω1+ω22,μ12+σ12=ω12+ω222.\left\{\begin{array}[]{lcl}\mu_{1}&=&\frac{\omega_{1}+\omega_{2}}{2},\\ \mu_{1}^{2}+\sigma_{1}^{2}&=&\frac{\omega_{1}^{2}+\omega_{2}^{2}}{2}.\end{array}\right. (122)

The solution of this system of equations is ω1=μ1σ1\omega_{1}=\mu_{1}-\sigma_{1} and ω2=μ1+σ1\omega_{2}=\mu_{1}+\sigma_{1}. Thus it follows that we have:

DKL[pμ1,σ1:pμ2,σ2]=12(log(pμ1,σ1(μ1σ1)pμ2,σ2(μ1σ1))+log(pμ1,σ1(μ1+σ1)pμ2,σ2(μ1+σ1))).D_{\mathrm{KL}}[p_{\mu_{1},\sigma_{1}}:p_{\mu_{2},\sigma_{2}}]=\frac{1}{2}\left(\log\left(\frac{p_{\mu_{1},\sigma_{1}}(\mu_{1}-\sigma_{1})}{p_{\mu_{2},\sigma_{2}}(\mu_{1}-\sigma_{1})}\right)+\log\left(\frac{p_{\mu_{1},\sigma_{1}}(\mu_{1}+\sigma_{1})}{p_{\mu_{2},\sigma_{2}}(\mu_{1}+\sigma_{1})}\right)\right). (123)

We conclude with the following theorem extending Lemma 1:

Theorem 1

The Kullback-Leibler divergence between two densities pλ1p_{\lambda_{1}} and pλ2p_{\lambda_{2}} belonging to a full regular exponential family \mathcal{E} of order DD can be expressed as the averaged sum of logarithms of density ratios:

DKL[pλ1:pλ2]=1si=1slog(pλ1(ωi)pλ2(ωi)),D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda_{1}}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right),

where ω1,,ωs\omega_{1},\ldots,\omega_{s} are sD+1s\leq D+1 distinct samples of 𝒳\mathcal{X} chosen such that 1si=1st(ωi)=Epλ1[t(x)]\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})=E_{p_{\lambda_{1}}}[t(x)].

The bound sD+1s\leq D+1 follows from Carathéodory’s theorem [9].

Notice that the Monte Carlo stochastic approximation of the Kullback-Leibler divergence:

D~KL(m)[pλ1:pλ2]=1mi=1mlogpλ1(xi)pλ2(xi),\widetilde{D}_{\mathrm{KL}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{m}\sum_{i=1}^{m}\log\frac{p_{\lambda_{1}}(x_{i})}{p_{\lambda_{2}}(x_{i})}, (124)

for mm independent and identically distributed (iid) variates x1,,xmx_{1},\ldots,x_{m} from pλ1p_{\lambda_{1}} is also an average sum of log density ratios which yields a consistent estimator, i.e.,

limmD~KL(m)[pλ1:pλ2]=DKL[pλ1:pλ2].\lim_{m\rightarrow\infty}\widetilde{D}_{\mathrm{KL}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]. (125)

In practice, to avoid potential negative values of D~KL(m)[pλ1:pλ2]\widetilde{D}_{\mathrm{KL}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}], we estimate the extended Kullback-Leibler divergence:

D~KL+(m)[pλ1:pλ2]=1mi=1m(logpλ1(xi)pλ2(xi)+pλ2(xi)pλ1(xi))0,\widetilde{D}_{\mathrm{KL}_{+}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{m}\sum_{i=1}^{m}\left(\log\frac{p_{\lambda_{1}}(x_{i})}{p_{\lambda_{2}}(x_{i})}+p_{\lambda_{2}}(x_{i})-p_{\lambda_{1}}(x_{i})\right)\geq 0, (126)

with limmD~KL+(m)[pλ1:pλ2]=DKL[pλ1:pλ2]\lim_{m\rightarrow\infty}\widetilde{D}_{\mathrm{KL}_{+}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}].

Example 17

We continue Example 6 of the dd-dimensional multivariate Gaussian family. First, let us consider the subfamily of zero-centered Gaussian densities. The sufficient statistic t(x)t(x) is a d×dd\times d matrix: t(x)=xxt(x)=xx^{\top}. We find the dd column vectors ωi\omega_{i}’s from the singular value decomposition of Σ1\Sigma_{1}:

Σ1=i=1dλieiei,\Sigma_{1}=\sum_{i=1}^{d}\lambda_{i}e_{i}e_{i}^{\top}, (127)

where the λi\lambda_{i}’s are the eigenvalues and the eie_{i}’s the corresponding eigenvectors. Let ωi=dλiei\omega_{i}=\sqrt{d\lambda_{i}}e_{i} for i{1,,d}i\in\{1,\ldots,d\}. We have

1di=1dt(ωi)=Σ1=EpΣ1[xx].\frac{1}{d}\sum_{i=1}^{d}t(\omega_{i})=\Sigma_{1}=E_{p_{\Sigma_{1}}}[xx^{\top}]. (128)

It follows that we can express the KLD between two zero-centered Gaussians pΣ1p_{\Sigma_{1}} and pΣ2p_{\Sigma_{2}} as the following weighted sum of log density ratios:

DKL[pΣ1:pΣ2]=1di=1dlog(pΣ1(dλiei)pΣ2(dλiei)),D_{\mathrm{KL}}[p_{\Sigma_{1}}:p_{\Sigma_{2}}]=\frac{1}{d}\sum_{i=1}^{d}\log\left(\frac{p_{\Sigma_{1}}(\sqrt{d\lambda_{i}}e_{i})}{p_{\Sigma_{2}}(\sqrt{d\lambda_{i}}e_{i})}\right), (129)

where the λi\lambda_{i}’s are the eigenvalues of Σ1\Sigma_{1} with the eie_{i}’s the corresponding eigenvectors. Notice that the order of the family is D=d(d+1)2D=\frac{d(d+1)}{2} but we used s=dDs=d\leq D vectors ω1,,ωd\omega_{1},\ldots,\omega_{d}.

The closed-form formula for the Kullback-Leibler divergence between two zero-centered Gaussians pΣ1p_{\Sigma_{1}} and pΣ2p_{\Sigma_{2}} is

DKL[pΣ1:pΣ2]=12(tr(Σ2Σ11)+log(|Σ2||Σ1|)d).D_{\mathrm{KL}}[p_{\Sigma_{1}}:p_{\Sigma_{2}}]=\frac{1}{2}\left(\mathrm{tr}\left(\Sigma_{2}\Sigma_{1}^{-1}\right)+\log\left(\frac{|\Sigma_{2}|}{|\Sigma_{1}|}\right)-d\right). (130)

Now consider the full family of multivariate normal densities. We shall use 2d2d vectors ωi\omega_{i} as follows:

{ωi=μ1dλiei,i{1,,d}ωi=μ1+dλiei,i{d+1,,2d}\displaystyle\left\{\begin{array}[]{ll}\omega_{i}=\mu_{1}-\sqrt{d\lambda_{i}}e_{i},&i\in\{1,\ldots,d\}\\ \omega_{i}=\mu_{1}+\sqrt{d\lambda_{i}}e_{i},&i\in\{d+1,\ldots,2d\}\end{array}\right. (133)

where the λi\lambda_{i}’s are the eigenvalues of Σ1\Sigma_{1} with the eie_{i}’s the corresponding eigenvectors. It can be checked that 12di=12dωi=Epλ1[x]=μ1\frac{1}{2d}\sum_{i=1}^{2d}\omega_{i}=E_{p_{\lambda_{1}}}[x]=\mu_{1} and 12di=12dωiωi=μ1μ1+Σ1\frac{1}{2d}\sum_{i=1}^{2d}\omega_{i}\omega_{i}^{\top}=\mu_{1}\mu_{1}^{\top}+\Sigma_{1}. These points are called the “sigma points” in the unscented transform [23, 17]. Moreover, we have λiei=[Σ1],i\sqrt{\lambda_{i}}e_{i}=[\sqrt{\Sigma_{1}}]_{\cdot,i}, the ii-th column of the square root of the covariance matrix of Σ1\Sigma_{1}. Thus it follows that

DKL[pμ1,Σ1:pμ2,Σ2]\displaystyle D_{\mathrm{KL}}[p_{\mu_{1},\Sigma_{1}}:p_{\mu_{2},\Sigma_{2}}] =\displaystyle= 12di=1d(log(pμ1,Σ1(μ1dλiei)pμ2,Σ2(μ1dλiei))+log(pμ1,Σ1(μ1+dλiei)pμ2,Σ2(μ1+dλiei))),\displaystyle\frac{1}{2d}\sum_{i=1}^{d}\left(\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}-\sqrt{d\lambda_{i}}e_{i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}-\sqrt{d\lambda_{i}}e_{i}\right)}\right)+\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}+\sqrt{d\lambda_{i}}e_{i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}+\sqrt{d\lambda_{i}}e_{i}\right)}\right)\right),
=\displaystyle= 12di=1d(log(pμ1,Σ1(μ1[dΣ1],i)pμ2,Σ2(μ1[dΣ1],i))+log(pμ1,Σ1(μ1+[dΣ1],i)pμ2,Σ2(μ1+[dΣ1],i))),\displaystyle\frac{1}{2d}\sum_{i=1}^{d}\left(\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}-[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}-[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}\right)+\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}+[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}+[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}\right)\right),

where [dΣ1],i=λiei[\sqrt{d\Sigma_{1}}]_{\cdot,i}=\sqrt{\lambda_{i}}e_{i} denotes the vector extracted from the ii-th column of the square root matrix of dΣ1d\Sigma_{1}.

This formula matches the ordinary formula for the Kullback-Leibler divergence between the two Gaussian densities pμ1,Σ1p_{\mu_{1},\Sigma_{1}} and pμ2,Σ2p_{\mu_{2},\Sigma_{2}}:

DKL[pμ1,Σ1:pμ2,Σ2]=12((μ2μ1)Σ21(μ2μ1)+tr(Σ2Σ11)+log(|Σ2||Σ1|)d).D_{\mathrm{KL}}[p_{\mu_{1},\Sigma_{1}}:p_{\mu_{2},\Sigma_{2}}]=\frac{1}{2}\left((\mu_{2}-\mu_{1})^{\top}\Sigma_{2}^{-1}(\mu_{2}-\mu_{1})+\mathrm{tr}\left(\Sigma_{2}\Sigma_{1}^{-1}\right)+\log\left(\frac{|\Sigma_{2}|}{|\Sigma_{1}|}\right)-d\right). (134)
Example 18

We continue the Example 4 of the exponential family of gamma distributions which has order D=2D=2. The sufficient statistic vector is t(x)=(x,logx)t(x)=(x,\log x), and we have E[x]=αβE[x]=\frac{\alpha}{\beta} and E[logx]=ψ(α)logβE[\log x]=\psi(\alpha)-\log\beta, where ψ()\psi(\cdot) is the digamma function. We want to express DKL[pα1,β1:pα2,β2]D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}] as an average sum of log ratio of densities. To find the values of ω1\omega_{1} and ω2\omega_{2}, we need to solve the following system of equations:

{ω1+ω22=a,logω1+logω22=b,\left\{\begin{array}[]{lll}\frac{\omega_{1}+\omega_{2}}{2}&=&a,\\ \frac{\log\omega_{1}+\log\omega_{2}}{2}&=&b\end{array}\right., (135)

with a=α1β1a=\frac{\alpha_{1}}{\beta_{1}} and b=ψ(α1)logβ1b=\psi(\alpha_{1})-\log\beta_{1}. We find the following two solutions:

ω1=aa2exp(2b),ω2=a+a2exp(2b).\omega_{1}=a-\sqrt{a^{2}-\exp(2b)},\quad\omega_{2}=a+\sqrt{a^{2}-\exp(2b)}. (136)

We have a2exp(2b)0a^{2}-\exp(2b)\geq 0 since E[x]2exp(2E[logx])E[x]^{2}\leq\exp(2E[\log x]).

It follows that

DKL[pα1,β1:pα2,β2]\displaystyle D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}] =\displaystyle= 12(logpα1,β1(ω1)pα2,β2(ω1)+logpα1,β1(ω2)pα2,β2(ω2)),\displaystyle\frac{1}{2}\left(\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{1})}{p_{\alpha_{2},\beta_{2}}(\omega_{1})}+\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{2})}{p_{\alpha_{2},\beta_{2}}(\omega_{2})}\right), (137)

This expression of the KLD is the same as the ordinary expression of the KLD:

DKL[pα1,β1:pα2,β2]=(α1α2)ψ(α1)logΓ(α1)+logΓ(α2)+α2logβ1β2+α1β2β1β1.D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=(\alpha_{1}-\alpha_{2})\psi(\alpha_{1})-\log\Gamma(\alpha_{1})+\log\Gamma(\alpha_{2})+\alpha_{2}\log\frac{\beta_{1}}{\beta_{2}}+\alpha_{1}\frac{\beta_{2}-\beta_{1}}{\beta_{1}}. (138)
Example 19

Consider the exponential family of Beta distributions:

={pα,β(x)=xα1(1x)β1B(α,β):α>0,β>0,x𝒳=(0,1)},\mathcal{E}=\left\{p_{\alpha,\beta}(x)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\ :\alpha>0,\beta>0,x\in\mathcal{X}=(0,1)\right\}, (139)

where

B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} (140)

The sufficient statistic vector is t(x)=(logx,log(1x))t(x)=(\log x,\log(1-x)). We have Epα1,β1[log(x)]=ψ(α1)ψ(α1+β1)=:AE_{p_{\alpha_{1},\beta_{1}}}[\log(x)]=\psi(\alpha_{1})-\psi(\alpha_{1}+\beta_{1})=:A and Epα1,β1[log(1x)]=ψ(β1)ψ(α1+β1)=:BE_{p_{\alpha_{1},\beta_{1}}}[\log(1-x)]=\psi(\beta_{1})-\psi(\alpha_{1}+\beta_{1})=:B. We need to solve the following system of equations for s=2s=2:

{logω1+logω22=A,log(1ω1)+log(1ω2)2=B\left\{\begin{array}[]{lcl}\frac{\log\omega_{1}+\log\omega_{2}}{2}&=&A,\\ \frac{\log(1-\omega_{1})+\log(1-\omega_{2})}{2}&=&B\end{array}\right. (141)

This system is equivalent to

{ω1ω=exp(2A):=a,(1ω1)(1ω2)=exp(2B):=b\left\{\begin{array}[]{lcl}\omega_{1}\omega&=&\exp(2A):=a,\\ (1-\omega_{1})(1-\omega_{2})&=&\exp(2B):=b\end{array}\right. (142)

We find

ω1\displaystyle\omega_{1} =\displaystyle= b+a+1Δ2,\displaystyle\frac{-b+a+1-\sqrt{\Delta}}{2}, (143)
ω2\displaystyle\omega_{2} =\displaystyle= b+a+1+Δ2,\displaystyle\frac{-b+a+1+\sqrt{\Delta}}{2}, (144)

where

Δ=b22(a+1)b+a22a+1.\Delta=b^{2}-2(a+1)b+a^{2}-2a+1. (145)

It follows that we have

DKL[pα1,β1:pα2,β2]=12(logpα1,β1(ω1)pα2,β2(ω1)+logpα1,β1(ω2)pα2,β2(ω2)).D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=\frac{1}{2}\left(\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{1})}{p_{\alpha_{2},\beta_{2}}(\omega_{1})}+\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{2})}{p_{\alpha_{2},\beta_{2}}(\omega_{2})}\right). (146)

This formula is equivalent to the ordinary formula for the KLD between two beta densities pα1,β1p_{\alpha_{1},\beta_{1}} and pα2,β2p_{\alpha_{2},\beta_{2}}:

DKL[pα1,β1:pα2,β2]=\displaystyle D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=
log(B(α2,β2)B(α1,β1))+(α1α2)ψ(α1)+(β1β2)ψ(β1)+(α2α1+β2β1)ψ(α1+β1).\displaystyle\log\left(\frac{\mathrm{B}\left(\alpha_{2},\beta_{2}\right)}{\mathrm{B}(\alpha_{1},\beta_{1})}\right)+\left(\alpha_{1}-\alpha_{2}\right)\psi(\alpha_{1})+\left(\beta_{1}-\beta_{2}\right)\psi(\beta_{1})+\left(\alpha_{2}-\alpha_{1}+\beta_{2}-\beta_{1}\right)\psi(\alpha_{1}+\beta_{1}).

Notice that the ωi\omega_{i}’s are chosen according to λ1\lambda_{1}. Thus we may express the Voronoi bisector:

Bi(pλ1,pλ2):={λ:DKL[pλ:pλ1]=DKL[pλ:pλ2]}\mathrm{Bi}(p_{\lambda_{1}},p_{\lambda_{2}}):=\left\{\lambda\ :\ D_{\mathrm{KL}}[p_{\lambda}:p_{\lambda_{1}}]=D_{\mathrm{KL}}[p_{\lambda}:p_{\lambda_{2}}]\right\} (148)

as follows:

Bi(pλ1,pλ2)={λ:1si=1slog(pλ(ωi)pλ1(ωi))=1si=1slog(pλ(ωi)pλ2(ωi))}.\mathrm{Bi}(p_{\lambda_{1}},p_{\lambda_{2}})=\left\{\lambda\ :\ \frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda}(\omega_{i})}{p_{\lambda_{1}}(\omega_{i})}\right)=\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right)\right\}. (149)

In particular, when s=1s=1, the Voronoi bisector is expressed as:

Bi(pλ1,pλ2)={λ:pλ1(ω(λ))=pλ2(ω(λ))},\mathrm{Bi}(p_{\lambda_{1}},p_{\lambda_{2}})=\left\{\lambda\ :\ p_{\lambda_{1}}(\omega(\lambda))=p_{\lambda_{2}}(\omega(\lambda))\right\}, (150)

where ω(λ)\omega(\lambda) emphasizes on the fact that ω\omega is a function of λ\lambda. The statistical Voronoi diagrams of a finite set of densities belonging to an exponential family has been studied as equivalent Bregman Voronoi diagrams in [7].

3.4 The Jensen-Shannon divergence

The Jensen-Shannon divergence [29] (JSD) is another symmetrization of the Kullback-Leibler divergence which can be given many information-theoretic interpretations [32] and which is further guaranteed to be always bounded by log2\log 2 (KLD and JD are unbounded):

DJS[p,q]\displaystyle D_{\mathrm{JS}}[p,q] :=\displaystyle:= 12(DKL[p:p+q2]+DKL[q:p+q2]),\displaystyle\frac{1}{2}\left(D_{\mathrm{KL}}\left[p:\frac{p+q}{2}\right]+D_{\mathrm{KL}}\left[q:\frac{p+q}{2}\right]\right), (151)
=\displaystyle= h(p+q2)h(p)+h(q)2.\displaystyle h\left(\frac{p+q}{2}\right)-\frac{h(p)+h(q)}{2}. (152)

Usually, the JSD does not provably admit a closed-form formula [40]. However, in the particular case when the mixture pθ1+pθ22\frac{p_{\theta_{1}}+p_{\theta_{2}}}{2} belongs to the same parametric family of densities, we can calculate the Jensen-Shannon divergence using the entropy function as shown in Eq. 152. For example, consider a mixture family in information geometry [1]. That is, a statistical mixture with kk prescribed components p1(x),,pk(x)p_{1}(x),\ldots,p_{k}(x) which are linearly independent (so that all mixtures of the family are identifiable by their corresponding parameters). Let mλ(x)=i=1kwipi(x)m_{\lambda}(x)=\sum_{i=1}^{k}w_{i}p_{i}(x). In that particular case (e.g., mixture family with kk prescribed Gaussians components), we get

mλ1+mλ22=mλ1+λ22.\frac{m_{\lambda_{1}}+m_{\lambda_{2}}}{2}=m_{\frac{\lambda_{1}+\lambda_{2}}{2}}. (153)

Thus the JSD for a mixture family can be expressed using the entropy as:

DJS[mλ1,mλ1]=h(mλ1+λ22)h(mλ1)+h(mλ2)2.D_{\mathrm{JS}}[m_{\lambda_{1}},m_{\lambda_{1}}]=h\left(m_{\frac{\lambda_{1}+\lambda_{2}}{2}}\right)-\frac{h(m_{\lambda_{1}})+h(m_{\lambda_{2}})}{2}. (154)

Although we do not have closed-form formula for the entropy of a mixture (except in few cases, e.g., when the support of the distributions are pairwise disjoint [32]), but we can use any approximation method for calculating the entropy of a mixture to approximate or bound [40] the Jensen-Shannon divergence DJSD_{\mathrm{JS}}.

4 Conclusion

We have described several methods to easily recover closed-form formula for some common (dis)similarities between densities belonging to a same exponential family {p(x;λ)}λΛ\{p(x;\lambda)\}_{\lambda\in\Lambda} which express themselves using the cumulant function FF of the exponential family (e.g., the Kullback-Leibler divergence amounts to a reverse Bregman divergence and the Bhattacharyya distance amounts to a Jensen divergence). Our trick consists in observing that the generators FF of the Bregman or Jensen divergences are defined modulo an affine term, so that we may choose F(θ(λ))=logp(ω,λ)F(\theta(\lambda))=-\log p(\omega,\lambda) for any ω\omega falling inside the support 𝒳\mathcal{X}. It follows that the Bhattacharyya coefficient can be calculated with the following cumulant-free expression:

ρ[pλ1,pλ2]=p(ω;λ¯)p(ω,λ1)p(ω,λ2),ω𝒳\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{p(\omega;\bar{\lambda})}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}},\quad\forall\omega\in\mathcal{X} (155)

where λ¯=Mθ(λ1,λ2)\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2}) is a generalized quasi-arithmetic mean induced by the ordinary-to-natural parameter conversion function θ(λ)\theta(\lambda). Thus our method requires only partial canonical factorization of the densities of an exponential family to get θ(λ)\theta(\lambda). The formula for the Bhattacharyya distance, Hellinger distance, and α\alpha-divergences follow straightforwardly:

DB[pλ1,pλ2]\displaystyle D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}] =\displaystyle= log(p(ω,λ1)p(ω,λ2)p(ω;λ¯)),\displaystyle\log\left(\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p(\omega;\bar{\lambda})}\right), (156)
DH[pλ1,pλ2]\displaystyle D_{H}[p_{\lambda_{1}},p_{\lambda_{2}}] =\displaystyle= 1p(ω;λ¯)p(ω,λ1)p(ω,λ2),\displaystyle\sqrt{1-\frac{p(\omega;\bar{\lambda})}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}}, (157)
Dα[pλ1:pλ2]\displaystyle D_{\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}] =\displaystyle= 1α(1α)(1p(ω;λ¯)p(ω,λ1)p(ω,λ2)),α\{0,1}.\displaystyle\frac{1}{\alpha(1-\alpha)}\left(1-\frac{p(\omega;\bar{\lambda})}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\right),\quad\alpha\in\mathbb{R}\backslash\{0,1\}. (158)

In practice, it is easy to program those formula using legacy software APIs which offer many parametric densities in their library: First, we check that the distribution is an exponential family. Then we set ω\omega to be any point of the support 𝒳\mathcal{X}, partially factorize the distribution in order to retrieve θ(λ)\theta(\lambda) and its reciprocal function λ(θ)\lambda(\theta), and equipped with these functions, we implement the corresponding generalized weighted quasi-arithmetic mean function Mθ,α(a,b)=θ1(αθ(a)+(1α)θ(b))M_{\theta,\alpha}(a,b)=\theta^{-1}(\alpha\theta(a)+(1-\alpha)\theta(b)) to calculate λ¯=Mθ(λ1,λ2)\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2}).

To calculate the Kullback-Leibler divergence (and Jeffreys’ divergence) without the explicit use of the cumulant function, we reported two methods: The first method consists in expressing the KLD as a limit of α\alpha-skew Bhattacharyya distance which writes as:

DKL[pλ1:pλ2]=log(p(ω;λ1)p(ω;λ2))+limα01αlog(p(ω;Mθ,α(λ2,λ1))p(ω,λ1)).D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega,\lambda_{1})}\right). (159)

This limit can be calculated symbolically using a computer algebra system, or approximated for a small value of α\alpha by

DKL,α[pλ1:pλ2]=log(p(ω;λ1)p(ω;λ2))+1αlog(p(ω;Mθ,α(λ2,λ1))p(ω;λ1)).D_{\mathrm{KL},\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega;\lambda_{1})}\right). (160)

When dealing with uni-order exponential family, we can use a first-order approximation of the weighted quasi-arithmetic mean to express the KLD as the following limit:

DKL[pλ1:pλ2]=log(p(ω;λ1)p(ω;λ2))+limα01αlog(p(ω;λ1+αθ(λ2)θ(λ1)θ(λ1))p(ω;λ1)).D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;\lambda_{1}+\alpha\frac{\theta(\lambda_{2})-\theta(\lambda_{1})}{\theta^{\prime}(\lambda_{1})}\right)}{p(\omega;\lambda_{1})}\right). (161)

Notice that we can also estimate DKL,αD_{\mathrm{KL},\alpha}, ρα\rho_{\alpha} and related dissimilarities (e.g., when the cumulant function is intractable) using density ratio estimation techniques [48].

The second approach consists in using the entropy and moment formula which are often available when dealing with parametric distributions. When the parametric distributions form an exponential family, the KLD is equivalent to a Legendre-Fenchel divergence, and we write this Legendre-Fenchel divergence as:

DKL[pλ1:pλ2]=logp(ω;λ2)h(pλ1)Epλ1[k(x)]θ(λ2)Epλ1[t(x)].D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=-\log p(\omega;\lambda_{2})-h(p_{\lambda_{1}})-E_{p_{\lambda_{1}}}[k(x)]-\theta(\lambda_{2})^{\top}E_{p_{\lambda_{1}}}[t(x)]. (162)

It follows that the Jeffreys’ divergence is expressed as

DJ[pλ1,pλ2]=(θ(λ2)θ(λ1))(Epλ2[t(x)]Epλ1[t(x)]).D_{J}[p_{\lambda_{1}},p_{\lambda_{2}}]=\left(\theta(\lambda_{2})-\theta(\lambda_{1})\right)^{\top}(E_{p_{\lambda_{2}}}[t(x)]-E_{p_{\lambda_{1}}}[t(x)]). (163)

Finally, we proved in §3.3 that the Kullback-Leibler divergence between two densities pλ1p_{\lambda_{1}} and pλ2p_{\lambda_{2}} of an exponential family \mathcal{E} of order DD can be expressed as

DKL[pλ1:pλ2]=1si=1slog(pλ1(ωi)pλ2(ωi)),D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda_{1}}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right),

where ω1,,ωs\omega_{1},\ldots,\omega_{s} are sD+1s\leq D+1 distinct samples of 𝒳\mathcal{X} chosen such that 1si=1st(ωi)=Epλ1[t(x)]\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})=E_{p_{\lambda_{1}}}[t(x)]. We illustrated how to find the ωi\omega_{i}’s for the univariate Gaussian family and the multivariate zero-centered Gaussian family.

To conclude this work, let us emphasize that we have revealed a new kind of invariance when providing closed-form formula for common (dis)similarities between densities of an exponential family without explicitly using the cumulant function of that exponential family: For the Bhattacharrya/Hellinger/α\alpha-divergences, the ω\omega can be chosen as any arbitrary point of the support 𝒳\mathcal{X}. For the Kullback-Leibler divergence, by carefully choosing a set of ω\omega’s, we may express the Kullback-Leibler divergence as a weighted sum of log density ratios.

References

  • [1] Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences. Springer Japan, 2016.
  • [2] Tsuyoshi Ando and Fumio Hiai. Operator log-convex functions and operator means. Mathematische Annalen, 350(3):611–630, 2011.
  • [3] Ole Barndorff-Nielsen. Information and exponential families. John Wiley & Sons, 2014.
  • [4] Christian Bauckhage. Computing the Kullback–Leibler divergence between two Weibull distributions. arXiv preprint arXiv:1310.3713, 2013.
  • [5] Anil Bhattacharyya. On a measure of divergence between two multinomial populations. Sankhyā: the indian journal of statistics, pages 401–406, 1946.
  • [6] Patrick Billingsley. Probability and Measure. Wiley, 3 edition, April 1995.
  • [7] Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete & Computational Geometry, 44(2):281–307, 2010.
  • [8] Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
  • [9] Constantin Carathéodory. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. Mathematische Annalen, 64(1):95–115, 1907.
  • [10] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
  • [11] Francis Clarke. On the inverse function theorem. Pacific Journal of Mathematics, 64(1):97–102, 1976.
  • [12] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [13] Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
  • [14] Joan Del Castillo. The singly truncated normal distribution: a non-steep exponential family. Annals of the Institute of Statistical Mathematics, 46(1):57–66, 1994.
  • [15] Shinto Eguchi and Osamu Komori. Path connectedness on a space of probability density functions. In International Conference on Geometric Science of Information, pages 615–624. Springer, 2015.
  • [16] Jun Ichi Fujii. Path of quasi-means as a geodesic. Linear algebra and its applications, 434(2):542–558, 2011.
  • [17] Jacob Goldberger, Hayit K Greenspan, and Jeremie Dreyfuss. Simplifying mixture models using the unscented transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8):1496–1502, 2008.
  • [18] Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985.
  • [19] Ernst Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 136:210–271, 1909.
  • [20] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.
  • [21] Robert Jenssen, Jose C Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal of the Franklin Institute, 343(6):614–629, 2006.
  • [22] Shihao Ji, Balaji Krishnapuram, and Lawrence Carin. Variational Bayes for continuous hidden Markov models and its application to active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):522–532, 2006.
  • [23] Simon Julier, Jeffrey Uhlmann, and Hugh F Durrant-Whyte. A new method for the nonlinear transformation of means and covariances in filters and estimators. IEEE Transactions on automatic control, 45(3):477–482, 2000.
  • [24] Thomas Kailath. The divergence and Bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
  • [25] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [26] Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 1723–1726. IEEE, 2012.
  • [27] Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE Signal Processing Letters, 20(3):269–272, 2013.
  • [28] Frank Nielsen. An elementary introduction to information geometry. arXiv preprint arXiv:1808.08271, 2018.
  • [29] Frank Nielsen. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy, 21(5):485, 2019.
  • [30] Frank Nielsen. A generalization of the α\alpha-divergences based on comparable and distinct weighted means. CoRR, abs/2001.09660, 2020.
  • [31] Frank Nielsen. A note on Onicescu’s informational energy and correlation coefficient in exponential families. arXiv preprint arXiv:2003.13199, 2020.
  • [32] Frank Nielsen. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy, 22(2):221, 2020.
  • [33] Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
  • [34] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. Technical report, arXiv:0911.4863, 2009.
  • [35] Frank Nielsen and Gaëtan Hadjeres. On power chi expansions of ff-divergences. arXiv preprint arXiv:1903.05818, 2019.
  • [36] Frank Nielsen and Richard Nock. Entropies and cross-entropies of exponential families. In IEEE International Conference on Image Processing, pages 3621–3624. IEEE, 2010.
  • [37] Frank Nielsen and Richard Nock. A closed-form expression for the Sharma–Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45(3):032003, 2011.
  • [38] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating ff-divergences. IEEE Signal Processing Letters, 21(1):10–13, 2013.
  • [39] Frank Nielsen and Richard Nock. Patch matching with polynomial exponential families and projective divergences. In International Conference on Similarity Search and Applications, pages 109–116. Springer, 2016.
  • [40] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.
  • [41] Frank Nielsen, Ke Sun, and Stéphane Marchand-Maillet. On Hölder projective divergences. Entropy, 19(3):122, 2017.
  • [42] Emilio Porcu, Jorge Mateu, and George Christakos. Quasi-arithmetic means of covariance functions with potential applications to space–time data. Journal of Multivariate Analysis, 100(8):1830–1844, 2009.
  • [43] Thomas W Rauber, Tim Braun, and Karsten Berns. Probabilistic distance measures of the Dirichlet and Beta distributions. Pattern Recognition, 41(2):637–645, 2008.
  • [44] Robert H Risch. The solution of the problem of integration in finite terms. Bulletin of the American Mathematical Society, 76(3):605–608, 1970.
  • [45] Christophe Saint-Jean and Frank Nielsen. Hartigan’s method for kk-mle: Mixture modeling with Wishart distributions and its application to motion retrieval. In Geometric theory of information, pages 301–330. Springer, 2014.
  • [46] Claude E Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
  • [47] Jana Špirková. Weighted operators based on dissimilarity function. Information Sciences, 281:172–181, 2014.
  • [48] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • [49] John Wishart. The generalised product moment distribution in samples from a normal multivariate population. Biometrika, pages 32–52, 1928.

Closed-form formula using the Maxima computer algebra system

Since the statistical (dis)similarities rely on integral calculations, we may use symbolic calculations to check the results. For example, below is some code snippets written in Maxima.777http://maxima.sourceforge.net/ The code snippet below calculates symbolically the Bhattacharyya coefficient for several exponential families.

/* Quasi-arithmetic mean associated with the univariate Gaussian family */
ptheta(lambda):=[lambda[1]/lambda[2],-1/(2*lambda[2])];
plambda(theta):=[-theta[1]/(2*theta[2]),-1/(2*theta[2])];
ptheta(plambda([t0,t1]));
l1: [p1,p2];
l2: [q1,q2];
plambda(0.5*ptheta(l1)+0.5*ptheta(l2));
ratsimp(%);
/* end */

/* Quasi-arithmetic mean associated with the inverse Gaussian family */
ptheta(lambda):=[-lambda[2]/(2*lambda[1]**2),-lambda[2]/2];
plambda(theta):=[sqrt(theta[2]/theta[1]),-2*theta[2]];
ptheta(plambda([t0,t1]));
l1: [p1,p2];
l2: [q1,q2];
plambda(0.5*ptheta(l1)+0.5*ptheta(l2));
ratsimp(%);
/* end */

/* Exponential family of exponential distributions */
assume(lambda1>0);
assume(lambda2>0);
p(x,lambda) := lambda*exp(-lambda*x);

integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,0,inf);
ratsimp(%);
/* end */

/* Exponential family of zero-centered Gaussian densities */
assume(sigma>0);
p(x,sigma) :=  (1.0/(2*sigma))*exp(-abs(x)/sigma);
assume(sigma1>0);
assume(sigma2>0);

integrate(sqrt(p(x,sigma1)*p(x,sigma2)),x,-inf,inf);
ratsimp(%);
/* end */

/* Exponential family of centered-Laplacian distributions */
assume(lambda1>0);
assume(lambda2>0);
p(x,lambda) := (1/(2*lambda))*exp(-abs(x)/lambda);

integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,-inf,inf);
ratsimp(%);
/* end*/

/* Exponential family of Weibull densities with prescribed shape parameter k */
declare( k , integer);
assume(k>=1);
assume(lambda1>0);
assume(lambda2>0);
p(x,lambda) := (k/lambda)*(x/lambda)**(k-1)*exp(-(x/lambda)**k);

integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,0,inf);
expand(ratsimp(%));
/* end */

/* KLD betwen Weibull distributions by symbolic computing of the limit */
declare( k , integer);
assume(lambda1>0);
assume(lambda2>0);
k:3;
omega:1;
t(u):=u**(-k);
tp(u):=k*u**(-k-1);
p(x,l):=(k/l)*((x/l)**(k-1))*exp(-(x/l)**k);
mean(l1,l2):=l1+alpha*(t(l1)-t(l2))/tp(l1);
log(p(omega,l1)/p(omega,l2)) + (1.0/alpha)*log(p(omega,mean(l1,l2))/p(omega,l1));
limit (ratsimp(%), alpha, 0);
expand(%);
/* end */

Appendix A Further illustrating examples

The Laplacian exponential family illustrates the use of the harmonic mean H(a,b)=2aba+bH(a,b)=\frac{2ab}{a+b} for a,b>0a,b>0:

Example 20

Consider the family of zero-centered Laplacian densities [34] pλ(x)=12λexp(|x|λ)p_{\lambda}(x)=\frac{1}{2\lambda}\exp(-\frac{|x|}{\lambda}) for xx\in\mathbb{R}. We have pλ(ω)=12λp_{\lambda}(\omega)=\frac{1}{2\lambda} for ω=0\omega=0, θ(u)=1u=θ1(u)\theta(u)=-\frac{1}{u}=\theta^{-1}(u) so that λ¯=Mθ(λ1,λ2)=H(λ1,λ2)=2λ1λ2λ1+λ2\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})=H(\lambda_{1},\lambda_{2})=\frac{2\lambda_{1}\lambda_{2}}{\lambda_{1}+\lambda_{2}}, where HH denotes the harmonic mean. Applying the cumulant-free formula Eq. 49, we get

ρ[pλ1,pλ2]\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}] =\displaystyle= p(ω,λ1)p(ω,λ2)p(ω;λ¯),\displaystyle\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p\left(\omega;\bar{\lambda}\right)},
=\displaystyle= 122λ1λ2λ1+λ2112λ112λ2,\displaystyle\frac{1}{2\frac{2\lambda_{1}\lambda_{2}}{\lambda_{1}+\lambda_{2}}}\frac{1}{\sqrt{\frac{1}{2\lambda_{1}}\frac{1}{2}\lambda_{2}}},
=\displaystyle= 2λ1λ2λ1+λ2.\displaystyle\frac{2\sqrt{\lambda_{1}\lambda_{2}}}{\lambda_{1}+\lambda_{2}}.

Note that the arithmetic mean A(λ1,λ2)=λ1+λ22A(\lambda_{1},\lambda_{2})=\frac{\lambda_{1}+\lambda_{2}}{2} dominates the geometric mean G(λ1,λ2)=λ1λ2G(\lambda_{1},\lambda_{2})=\sqrt{\lambda_{1}\lambda_{2}} (i.e., AGA\geq G) so that ρ[pλ1,pλ2]=G(λ1,λ2)A(λ1,λ2)(ω,1]\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{G(\lambda_{1},\lambda_{2})}{A(\lambda_{1},\lambda_{2})}\in(\omega,1]. It follows that the Bhattacharyya distance is

DBhat[pλ1,pλ2]=logλ1+λ22logλ1logλ2.D_{\mathrm{Bhat}}[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\frac{\lambda_{1}+\lambda_{2}}{2}-\log\sqrt{\lambda_{1}\log\lambda_{2}}.

This yields the same formula as the exponential distribution.

Example 21

The univariate Gaussian distribution with source parameters λ=(μ,σ2)\lambda=(\mu,\sigma^{2}) has density pλ(x)=12πλ2exp((xλ1)22λ2)p_{\lambda}(x)=\frac{1}{\sqrt{2\pi\lambda_{2}}}\exp\left(-\frac{(x-\lambda_{1})^{2}}{2\lambda_{2}}\right). The have θ(u)=(u1u2,12u2)\theta(u)=(\frac{u_{1}}{u_{2}},-\frac{1}{2u_{2}}) and θ1(u)=(u12u2,12u2)\theta^{-1}(u)=(-\frac{u_{1}}{2u_{2}},-\frac{1}{2u_{2}}). We have

Mθ(λ1,λ2)=(μ1σ22+μ2σ12σ12+σ22,2σ12σ22σ12+σ22).M_{\theta}(\lambda_{1},\lambda_{2})=\left(\frac{\mu_{1}\sigma_{2}^{2}+\mu_{2}\sigma_{1}^{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}},\frac{2\sigma_{1}^{2}\sigma_{2}^{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}}\right). (164)

Our next example illustrates an unusual non-separable mean derived from the inverse Gaussian family:

Example 22

Consider the family of inverse Gaussian densities:

p(x;μ,λ)=λ2πx3exp(λ(xμ)22μ2x),p(x;\mu,\lambda)=\sqrt{\frac{\lambda}{2\pi x^{3}}}\exp\left(-\frac{\lambda(x-\mu)^{2}}{2\mu^{2}x}\right),

with 𝒳=(0,)\mathcal{X}=(0,\infty), and source parameters (μ>0,λ>0)(\mu>0,\lambda>0). The inverse Gaussian densities form an exponential family with natural parameters θ(λ)=(λ22λ12,λ22)\theta(\lambda)=\left(-\frac{\lambda_{2}}{2\lambda_{1}^{2}},-\frac{\lambda_{2}}{2}\right). The inverse conversion function is θ1(θ1,θ2)=(θ2θ1,2θ2)\theta^{-1}(\theta_{1},\theta_{2})=\left(\frac{\theta_{2}}{\theta_{1}},-2\theta_{2}\right). Thus the generalized quasi-arithmetic mean induced by the multivariate θ()\theta(\cdot) is

Mθ(λ,λ)=((λ2+λ2)λ12(λ1)2λ2(λ1)2+λ2λ12,λ2+λ22).M_{\theta}(\lambda,\lambda^{\prime})=\left(\sqrt{\frac{(\lambda_{2}+\lambda_{2}^{\prime})\lambda_{1}^{2}(\lambda_{1}^{\prime})^{2}}{\lambda_{2}(\lambda_{1}^{\prime})^{2}+\lambda_{2}^{\prime}\lambda_{1}^{2}}},\frac{\lambda_{2}+\lambda_{2}^{\prime}}{2}\right).

The calculation can be checked using a computer algebra system (detailed in Appendix Closed-form formula using the Maxima computer algebra system). We choose ω=1\omega=1.

Next, we consider the case of the exponential family of central Wishart densities [45].

Example 23

The density of a Wishart distribution [49] with positive-definite scale matrix S0S\succ 0 and number of degrees of freedom ndn\geq d (matrix generalization of the chi-squared distribution) is:

p(x;λ=(λs=n,λM=S))=|X|nd12e12tr(S1X)2nd2|S|n2Γd(n2),p(x;\lambda=(\lambda_{s}=n,\lambda_{M}=S))=\frac{|X|^{\frac{n-d-1}{2}}e^{-\frac{1}{2}\operatorname{tr}\left(S^{-1}X\right)}}{2^{\frac{nd}{2}}|S|^{\frac{n}{2}}\Gamma_{d}\left(\frac{n}{2}\right)}, (165)

where Γd\Gamma_{d} denotes the multivariate Gamma function:

Γd(x):=πd(d1)/4j=1dΓ(x+(1j)/2).\Gamma_{d}(x):=\pi^{d(d-1)/4}\prod_{j=1}^{d}\Gamma(x+(1-j)/2). (166)

The sample space x𝒳x\in\mathcal{X} is the set of d×dd\times d positive-definite matrices. The natural parameter consists of a scalar part θs\theta_{s} and a matrix part θM\theta_{M}:

θ(λ)=(θs,θM)=(λsd12,12λM1).\theta(\lambda)=\left(\theta_{s},\theta_{M}\right)=\left(\frac{\lambda_{s}-d-1}{2},-\frac{1}{2}\lambda_{M}^{-1}\right). (167)

The inverse conversion natural\rightarrowordinary function is

λ(θ)=(2λs+d+1,12λM1).\lambda(\theta)=\left(2\lambda_{s}+d+1,-\frac{1}{2}\lambda_{M}^{-1}\right). (168)

It follows that the quasi-arithmetic mean is the following separable scalar-vector quasi-arithmetic mean (scalar arithmetic mean with a harmonic matrix mean):

Mθ((n1,S1),(n2,S2))=(n1+n22,(S11+S212)1).M_{\theta}((n_{1},S_{1}),(n_{2},S_{2}))=\left(\frac{n_{1}+n_{2}}{2},\left(\frac{S_{1}^{-1}+S_{2}^{-1}}{2}\right)^{-1}\right). (169)

We choose ω=I\omega=I (the identity matrix) so that

p(I;S,n)=exp(12tr(S1))2nd2|S|n2Γd(n2).p(I;S,n)=\frac{\exp\left(-\frac{1}{2}\mathrm{tr}(S^{-1})\right)}{2^{\frac{nd}{2}}|S|^{\frac{n}{2}}\Gamma_{d}\left(\frac{n}{2}\right)}. (170)

The sufficient statistics is t(x)=(log|X|,12X)t(x)=(\log|X|,-\frac{1}{2}X). It follows that η=F(θ)=E[t(x)]=(log|S12|+i=1dψ(n+1i2),12nS)\eta=\nabla F(\theta)=E[t(x)]=(-\log|\frac{S^{-1}}{2}|+\sum_{i=1}^{d}\psi(\frac{n+1-i}{2}),-\frac{1}{2}nS) (see [22]), where ψd\psi_{d} is the multivariate digamma function (the derivative of the logarithm of the multivariate gamma function). The differential entropy is:

h(pn,S)=d+12ln|S|+12d(d+1)ln2+lnΓd(n2)nd12ψd(n2)+nd2.h(p_{n,S})=\frac{d+1}{2}\ln|S|+\frac{1}{2}d(d+1)\ln 2+\ln\Gamma_{d}\left(\frac{n}{2}\right)-\frac{n-d-1}{2}\psi_{d}\left(\frac{n}{2}\right)+\frac{nd}{2}.

It follows that the Kullback-Leibler divergence between two central Wishart densities is:

DKL(pn1,S1:pn2,S2)=log(Γd(n12)Γd(n22))+(n1n22)ψd(n12)+n12(log|S1||S2|+tr(S21S1)d).D_{\mathrm{KL}}\left(p_{n_{1},S_{1}}:p_{n_{2},S_{2}}\right)=-\log\left(\frac{\Gamma_{d}\left(\frac{n_{1}}{2}\right)}{\Gamma_{d}\left(\frac{n_{2}}{2}\right)}\right)+\left(\frac{n_{1}-n_{2}}{2}\right)\psi_{d}\left(\frac{n_{1}}{2}\right)+\frac{n_{1}}{2}\left(-\log\frac{\left|S_{1}\right|}{\left|S_{2}\right|}+\operatorname{tr}\left(S_{2}^{-1}S_{1}\right)-d\right). (171)

This last example exhibits a non-usually mean when dealing with the Bernoulli family:

Example 24

Consider the Bernoulli family with density pλ(x)=λx(1λ)1xp_{\lambda}(x)=\lambda^{x}(1-\lambda)^{1-x} where 𝒳={0,1}\mathcal{X}=\{0,1\} and λ\lambda denotes the probability of success. The Bernoulli family is a discrete exponential family. We have θ(u)=logu1u\theta(u)=\log\frac{u}{1-u} and θ1(u)=eu1+eu\theta^{-1}(u)=\frac{e^{u}}{1+e^{u}}. The associated quasi-arithmetic mean is

Mθ(λ1,λ2)=λ1λ2(1λ1)(1λ2).M_{\theta}(\lambda_{1},\lambda_{2})=\sqrt{\frac{\lambda_{1}\lambda_{2}}{(1-\lambda_{1})(1-\lambda_{2})}}. (172)