Cumulant-free closed-form formulas for some common (dis)similarities between densities of an exponential family

Frank Nielsen Richard Nock

Abstract

It is well-known that the Bhattacharyya, Hellinger, Kullback-Leibler, $\alpha$ -divergences, and Jeffreys’ divergences between densities belonging to a same exponential family have generic closed-form formulas relying on the strictly convex and real-analytic cumulant function characterizing the exponential family. In this work, we report (dis)similarity formulas which bypass the explicit use of the cumulant function and highlight the role of quasi-arithmetic means and their multivariate mean operator extensions. In practice, these cumulant-free formulas are handy when implementing these (dis)similarities using legacy Application Programming Interfaces (APIs) since our method requires only to partially factorize the densities canonically of the considered exponential family.

Keywords: Bhattacharyya coefficient; Bhattacharyya distance; Hellinger distance; Jensen-Shannon divergence; Kullback-Leibler divergence; $\alpha$ -divergences; Jeffreys divergence; Cauchy-Schwarz divergence; quasi-arithmetic means; inverse function theorem; strictly monotone operator; mean operator; information geometry; exponential family; mixture family.

1 Introduction

Let $(\mathcal{X},\mathcal{F},\mu)$ be a measure space [6] with sample space $\mathcal{X}$ , $\sigma$ -algebra of events $\mathcal{F}$ , and positive measure $\mu$ (i.e., Lebesgue or counting measures). The Kullback-Leibler divergence [25] (KLD), Jeffreys’ divergence [20] (JD), Bhattacharyya coefficient (BC), Bhattacharyya distance [5, 24] (BD) and Hellinger distance [19] (HD) between two probability measures $P$ and $Q$ dominated by $\mu$ with respective Radon-Nikodym densities $p=\frac{\mathrm{d}P}{\mathrm{d}\mu}$ and $q=\frac{\mathrm{d}Q}{\mathrm{d}\mu}$ are statistical (dis)similarities defined respectively by:

$\displaystyle D_{\mathrm{KL}}[p:q]$	$\displaystyle:=\int p(x)\log\frac{p(x)}{q(x)}\mathrm{d}\mu(x),$	(Kullback-Leibler divergence)	(1)
$\displaystyle D_{J}[p,q]$	$\displaystyle:=D_{\mathrm{KL}}[p:q]+D_{\mathrm{KL}}[q:p],$	(Jeffreys’ divergence)	(2)
$\displaystyle\rho[p,q]$	$\displaystyle:=\int_{\mathcal{X}}\sqrt{p(x)\ q(x)}\mathrm{d}\mu(x),$	(Bhattacharyya coefficient)	(3)
$\displaystyle D_{B}[p,q]$	$\displaystyle:=-\log(\rho[p,q]),$	(Bhattacharyya distance)	(4)
$\displaystyle D_{H}[p,q]$	$\displaystyle:=\sqrt{1-\rho[p,q]}.$	(Hellinger distance)	(5)

KLD is an oriented distance (i.e., $D_{\mathrm{KL}}[p:q]\not=D_{\mathrm{KL}}[q:p]$ ). JD is a common symmetrization of the KLD which is not a metric distance because JD fails the triangle inequality. BC is a similarity which ensures that $\rho[p,q]\in(0,1]$ . BD is a symmetric non-metric distance, and HD is a metric distance. (KLD and JD are homogeneous divergences.)

All these divergences require to calculate definite integrals which can be calculated in theory using Risch pseudo-algorithm [44] which depends on an oracle. In practice, computer algebra systems like Maple® or Maxima implement different subsets of the theoretical Risch pseudo-algorithm and thus face different limitations.

When the densities $p$ and $q$ belong to a same parametric family $\mathcal{E}=\{p_{\theta}\ :\ \theta\in\Theta\}$ of densities, i.e., $p=p_{\theta_{1}}$ and $q=p_{\theta_{2}}$ , these (dis)similarities yield equivalent parameter (dis)similarities. For example, we get the parameter divergence $P(\theta_{1}:\theta_{2})=D_{\mathrm{KL}}^{\mathcal{E}}(\theta_{1}:\theta_{2}):=D_{\mathrm{KL}}(p_{\theta_{1}}:p_{\theta_{2}})$ . We use notationally the brackets to indicate that the (dis)similarities parameters are densities, and the parenthesis to indicate parameter (dis)similarities.

In particular, when $\mathcal{E}$ is a natural exponential family [34, 3] (NEF)

\mathcal{E}=\left\{p_{\theta}(x)=\exp\left(\theta^{\top}t(x)-F(\theta)+k(x)\right)\ :\ \theta\in\Theta\right\},

(6)

with $t(x)$ denoting the sufficient statistics, $k(x)$ an auxiliary measure carrier term, and

F(\theta):=\log\left(\int_{\mathcal{X}}\exp(\theta^{\top}t(x))\mathrm{d}\mu(x)\right),

(7)

the cumulant function¹¹1More precisely, $\kappa_{\theta}(u)=F(\theta+u)-F(\theta)$ is the cumulant generating function of the sufficient statistic $t(x)$ from which all moments can be recovered (the cumulant generating function being the logarithm of the moment generating function). also called log-normalizer, log-partition function, free energy, or log-Laplace transform. Parameter $\theta$ is called the natural parameter. The cumulant function is a strictly smooth convex function (real analytic function [3]) defined on the open convex natural parameter space $\Theta$ . Let $D$ denote the dimension of the parameter space $\Theta$ (i.e., the order of the exponential family) and $d$ the dimension of the sample space $\mathcal{X}$ . We further assume full regular exponential families [3] so that $\Theta$ is a non-empty open convex domainand $t(x)$ is a minimal sufficient statistic [14].

Many common families of distributions $\{p_{\lambda}(x)\ \lambda\in\Lambda\}$ are exponential families in disguise after reparameterization: $p_{\lambda}(x)=p_{\theta(\lambda)}(x)$ . Those families are called exponential families (i.e., EFs and not natural EFs to emphasize that $\theta(u)\not=u$ ), and their densities are canonically factorized as follows:

p(x;\lambda)=\exp\left(\theta(\lambda)^{\top}t(x)-F(\theta(\lambda))+k(x)\right).

(8)

We call parameter $\lambda$ the source parameter (or the ordinary parameter, with $\lambda\in\Lambda$ , the source/ordinary parameter space) and parameter $\theta(\lambda)\in\Theta$ is the corresponding natural parameter. Notice that the canonical parameterization of Eq. 8 is not unique: For example, adding a constant term $c\in\mathbb{R}$ to $F(\theta)$ can be compensated by subtracting this constant to $k(x)$ , or multiplying the sufficient statistic $t(x)$ by a symmetric invertible matrix $A$ can be compensated by multiplying $\theta(\lambda)$ by the inverse of $A$ so that $(A\theta(\lambda)))^{\top}(A^{-1}t(x))=\theta(\lambda))^{\top}AA^{-1}t(x)=\theta(\lambda)^{\top}t(x)$ .

Another useful parameterization of exponential families is the moment parameter [34, 3]: $\eta=E_{p_{\theta}}[t(x)]=\nabla F(\theta)$ . The moment parameter space shall be denoted by $H$ :

H=\left\{E_{p_{\theta}}[t(x)]\ :\ \theta\in\Theta\right\}.

(9)

Let $C=\overline{\mathrm{CH}}\{t(x)\ :\ x\in\mathcal{X}\}$ be the closed convex hull of the range of the sufficient statistic. When $H=\mathrm{int}(C)$ , the family is said steep [14]. In the remainder, we consider full regular and steep exponential families.

To give one example, consider the family of univariate normal densities:

\mathcal{N}:=\left\{p(x;\lambda)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{2}\right),\ \lambda=(\mu,\sigma^{2})\in\mathbb{R}\times\mathbb{R}_{++}\right\}.

(10)

Family $\mathcal{N}$ is interpreted as an exponential family of order $D=2$ with univariate parametric densities ( $d=1$ ), indexed by the source parameter $\lambda=(\mu,\sigma^{2})\in\Lambda$ with $\Lambda=\mathbb{R}\times\mathbb{R}_{++}$ . The corresponding natural parameter is $\theta(\lambda)=\left(\frac{\mu}{\sigma^{2}},-\frac{1}{2\sigma^{2}}\right)$ , and the moment parameter is $\eta(\lambda)=(E_{p_{\lambda}}[x],E_{p_{\lambda}}[x^{2}])=(\mu,\mu^{2}+\sigma^{2})$ since $t(x)=(x,x^{2})$ . The cumulant function for the normal family is $F(\theta)=-\frac{\theta_{1}^{2}}{4\theta_{2}}+\frac{1}{2}\log\left(-\frac{\pi}{\theta_{2}}\right)$ .

When densities $p=p_{\theta_{1}}$ and $q=p_{\theta_{2}}$ both belong to the same exponential family, we have the following well-known closed-form expressions [33, 34] for the (dis)similarities introduced formerly:

$\displaystyle D_{\mathrm{KL}}[p_{\theta_{1}}:p_{\theta_{2}}]$	$\displaystyle={B_{F}}^{*}(\theta_{1}:\theta_{2}),$	Kullback-Leibler divergence	(11)
$\displaystyle D_{J}[p_{\theta_{1}}:p_{\theta_{2}}]$	$\displaystyle=(\theta_{2}-\theta_{1})^{\top}(\eta_{2}-\eta_{1}),$	Jeffreys’ divergence	(12)
$\displaystyle\rho[p_{\theta_{1}}:p_{\theta_{2}}]$	$\displaystyle=\exp(-J_{F}(\theta_{1}:\theta_{2})),$	Bhattacharyya coefficient	(13)
$\displaystyle D_{B}[p_{\theta_{1}}:p_{\theta_{2}}]$	$\displaystyle=J_{F}(\theta_{1}:\theta_{2})$	Bhattacharyya distance	(14)
$\displaystyle D_{H}[p_{\theta_{1}}:p_{\theta_{2}}]$	$\displaystyle=\sqrt{1-\exp(-J_{F}(\theta_{1}:\theta_{2}))},$	Hellinger distance	(15)

where $D^{*}$ indicates the reverse divergence $D^{*}(\theta_{1}:\theta_{2}):=D(\theta_{2}:\theta_{1})$ (parameter swapping²²2Here, the star ‘*’ is not to be confused with the Legendre-Fenchel transform used in information geometry [1]. For a Bregman divergence $B_{F}(\theta_{1}:\theta_{2})$ , we have the reverse Bregman divergence ${B_{F}}^{*}(\theta_{1}:\theta_{2})=B_{F}(\theta_{2}:\theta_{1})=B_{F^{*}}(\nabla F(\theta_{1}):\nabla F(\theta_{2}))$ , where $F^{*}$ denotes convex conjugate of $F$ obtained by the Legendre-Fenchel transform.), and $B_{F}$ and $J_{F}$ are the Bregman divergence [8] and the Jensen divergence [33] induced by the functional generator $F$ , respectively:

	$\displaystyle B_{F}(\theta_{1}:\theta_{2})$	$\displaystyle:=$	$\displaystyle F(\theta_{1})-F(\theta_{2})-(\theta_{1}-\theta_{2})^{\top}\nabla F(\theta_{2})$		(16)
	$\displaystyle J_{F}(\theta_{1}:\theta_{2})$	$\displaystyle:=$	$\displaystyle\frac{F(\theta_{1})+F(\theta_{2})}{2}-F\left(\frac{\theta_{1}+\theta_{2}}{2}\right).$		(17)

In information geometry, the Bregman divergences are the canonical divergences of dually flat spaces [28].

More generally, the Bhattacharrya distance/coefficient can be skewed with a parameter $\alpha\in(0,1)$ to yield the $\alpha$ -skewed Bhattacharyya distance/coefficient [33]:

$\displaystyle D_{B,\alpha}[p:q]$	$\displaystyle:=$	$\displaystyle-\log(\rho_{\alpha}[p:q]),$	(18)
$\displaystyle\rho_{\alpha}[p:q]$	$\displaystyle:=$	$\displaystyle\int_{\mathcal{X}}p(x)^{\alpha}q(x)^{1-\alpha}\mathrm{d}\mu(x),$	(19)
	$\displaystyle=$	$\displaystyle-\log\int_{\mathcal{X}}q(x)\left(\frac{p(x)}{q(x)}\right)^{\alpha}\mathrm{d}\mu(x).$	(20)

The ordinary Bhattacharyya distance and coefficient are recovered for $\alpha=\frac{1}{2}$ . In statistics, the maximum skewed Bhattacharrya distance is called Chernoff information [10, 27] used in Bayesian hypothesis testing:

D_{C}[p:q]:=\max_{\alpha\in(0,1)}D_{B,\alpha}[p:q].

(21)

Notice that the Bhattacharyya skewed $\alpha$ -coefficient of Eq. 20 also appears in the definition of the $\alpha$ -divergences [1] (with $\alpha\in\mathbb{R}$ ):

D_{\alpha}[p:q]:=\left\{\begin{array}[]{ll}\frac{1}{\alpha(1-\alpha)}\left(1-\rho_{\alpha}[p:q]\right),&\alpha\in\mathbb{R}\backslash\{0,1\}\\ D_{1}[p:q]=D_{\mathrm{KL}}[p:q],&\alpha=1\\ D_{0}[p:q]=D_{\mathrm{KL}}[q:p],&\alpha=0\end{array}\right.

(22)

The $\alpha$ -divergences belong to the class of $f$ -divergences [13] which are the invariant divergences³³3A statistical invariant divergence $D$ is such that $D[p_{\theta}:p_{\theta+\mathrm{d}\theta}]=\sum_{i,j}g_{ij}(\theta)\mathrm{d}\theta_{i}\mathrm{d}\theta_{j}$ , where $I(\theta)=[g_{ij}(\theta)]$ is the Fisher information matrix [1]. in information geometry [1].

When densities $p=p_{\theta_{1}}$ and $q=p_{\theta_{2}}$ both belong to the same exponential family $\mathcal{E}$ , we get the following closed-form formula [33]:

\displaystyle D_{B,\alpha}[p_{\theta_{1}}:p_{\theta_{2}}]

\displaystyle=

\displaystyle J_{F,\alpha}(\theta_{1}:\theta_{2}),

(23)

where $J_{F,\alpha}$ denotes the $\alpha$ -skewed Jensen divergence:

J_{F,\alpha}(\theta_{1}:\theta_{2}):=\alpha F(\theta_{1})+(1-\alpha)F(\theta_{2})-F(\alpha\theta_{1}+(1-\alpha)\theta_{2}).

(24)

All these closed-form formula can be obtained from the calculation of the following generic integral [37]:

I_{\alpha,\beta}[p:q]:=\int p(x)^{\alpha}q(x)^{\beta}\mathrm{d}\mu(x),

(25)

when $p=p_{\theta_{1}}$ and $q=p_{\theta_{2}}$ . Indeed, provided that $\alpha\theta_{1}+\beta\theta_{2}\in\Theta$ , we have

I_{\alpha,\beta}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp\left(-(\alpha F(\theta_{1})+\beta F(\theta_{2})-F(\alpha\theta_{1}+\beta\theta_{2}))\right)E_{p_{\alpha\theta_{1}+\beta\theta_{2}}}\left[\exp((\alpha+\beta-1)k(x))\right].

(26)

The calculation of $I_{\alpha,\beta}$ in Eq. 26 is easily achieved by bypassing the computation of the antiderivative of the integrand in Eq. 25 (using the fact that $\int p_{\theta}(x)\mathrm{d}\mu(x)=1$ for any $\theta\in\Theta$ ), see [37].

In particular, we get the following special cases:

•

When $\alpha+\beta=1$ , $I_{\alpha,1-\alpha}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp(-J_{F,\alpha}(\theta_{1}:\theta_{2}))$ since $\alpha\theta_{1}+(1-\alpha)\theta_{2}\in\Theta$ (domain $\Theta$ is convex).
•

When $k(x)=0$ , and $\alpha+\beta>0$ , $I_{\alpha,\beta}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp\left(-(\alpha F(\theta_{1})+\beta F(\theta_{2})-F(\alpha\theta_{1}+\beta\theta_{2}))\right)$ . This is always the case when $\Theta$ is a convex cone (e.g., Gaussian or Wishart family), see [26].

•

When $\alpha+\beta=1$ with arbitrary $\alpha$ ,

I_{\alpha,1-\alpha}[p_{\theta_{1}}:p_{\theta_{2}}]=\exp\left(-(\alpha F(\theta_{1})+(1-\alpha)F(\theta_{2})-F(\alpha\theta_{1}+(1-\alpha)\theta_{2}))\right).

(27)

This setting is useful for getting truncated series of $f$ -divergences when the exponential family has an affine space $\Theta$ , see [38, 35].

When $\alpha\rightarrow 1$ or $\alpha\rightarrow 0$ , we get the following limits of the $\alpha$ -skewed Bhattacharrya distances:

	$\displaystyle\lim_{\alpha\rightarrow 1}\frac{1}{\alpha(1-\alpha)}D_{B,\alpha}[p:q]$	$\displaystyle=$	$\displaystyle D_{\mathrm{KL}}[p:q]$		(28)
	$\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha(1-\alpha)}D_{B,\alpha}[p:q]$	$\displaystyle=$	$\displaystyle D_{\mathrm{KL}}[q:p].$		(29)

It follows that when the densities $p=p_{\theta_{1}}$ and $q=p_{\theta_{2}}$ both belong to the same exponential family, we obtain

	$\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha(1-\alpha)}J_{F,\alpha}(\theta_{1}:\theta_{2})$	$\displaystyle=$	$\displaystyle B_{F}(\theta_{2}:\theta_{1}),$		(30)
	$\displaystyle\lim_{\alpha\rightarrow 1}\frac{1}{\alpha(1-\alpha)}J_{F,\alpha}(\theta_{1}:\theta_{2})$	$\displaystyle=$	$\displaystyle B_{F}(\theta_{1}:\theta_{2}).$		(31)

In practice, we would like to get closed-form formula for the (dis)similarities when the densities belong to the same exponential families using the source reparameterization $\lambda\in\Lambda$ :

$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle{B_{F}}(\theta(\lambda_{2}):\theta(\lambda_{1})),$	(32)
$\displaystyle D_{J}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle(\theta(\lambda_{1}):\theta(\lambda_{2}))^{\top}(\eta(\lambda_{2})-\eta(\lambda_{1})),$	(33)
$\displaystyle\rho[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\exp(-J_{F}(\theta(\lambda_{1}):\theta(\lambda_{2}))),$	(34)
$\displaystyle D_{B}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle J_{F}(\theta(\lambda_{1}):\theta(\lambda_{2}))$	(35)
$\displaystyle D_{H}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\sqrt{1-\rho(\theta(\lambda_{1}):\theta(\lambda_{2}))},$	(36)

where $\theta(\cdot)$ and $\eta(\cdot)$ are the $D$ -variate functions for converting the source parameter $\lambda$ to the natural parameter $\theta$ and the moment parameter $\eta$ , respectively. The Chernoff information between two densities $p_{\lambda_{1}}$ and $p_{\lambda_{2}}$ of the same exponential family amounts to a Jensen-Chernoff divergence [27]:

	$\displaystyle\mathrm{JC}({\lambda_{1}}:{\lambda_{2}}):=D_{C}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\max_{\alpha\in(0,1)}D_{B,\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}],$		(37)
		$\displaystyle=$	$\displaystyle\max_{\alpha\in(0,1)}J_{F,\alpha}(\theta(\lambda_{1}):\theta(\lambda_{2})),$		(38)

that is the maximal value of a skew Jensen divergence for $\alpha\in(0,1)$ .

Thus to have closed-form formula, we need to explicit both the $\theta(\cdot)$ conversion function and the cumulant function $F$ . This can be prone to human calculus mistakes (e.g., report manually these formula without calculus errors for the multivariate Gaussian family). Furthermore, the cumulant function $F$ may not be available in closed-form [39].

In this work, we show how to easily bypass the explicit use of the cumulant function $F$ . Our method is based on a simple trick, and makes the programming of these (dis)similarities easy using off-the-shelf functions of application programming interfaces (APIs) (e.g., the density function, the entropy function or the moment function of a distribution family).

This paper is organized as follows: Section 2 explains the method for the Bhattacharyya coefficient and its related dissimilarities. Section 3 further carry on the principle of bypassing the explicit use of the cumulant function and its gradient for the calculation of the Kullback-Leibler divergence and its related Jeffreys’ divergence. Section 4 summarizes the results. Throughout the paper, we present several examples to illustrate the methods. Appendix Closed-form formula using the Maxima computer algebra system displays some code written using the computer algebra system (CAS) Maxima to recover some formula for some exponential families. Appendix A provides further examples.

2 Cumulant-free formula for the Bhattacharyya coefficient and distances derived thereof

2.1 A method based on a simple trick

The densities of an exponential family have all the same support [3] $\mathcal{X}$ . Consider any point $\omega\in\mathcal{X}$ in the support. Then observe that we can write the cumulant of a natural exponential family as:

F(\theta)=-\log p_{\theta}(\omega)+t(\omega)^{\top}\theta+k(\omega).

(39)

Since the generator $F$ of a Jensen or Bregman divergence is defined modulo an affine function $a^{\top}\theta+b$ (i.e., $J_{F}=J_{G}$ and $B_{F}=B_{G}$ for $G(\theta)=F(\theta)+a^{\top}\theta+b$ for $a\in\mathbb{R}^{D}$ and $b\in\mathbb{R}$ ), we consider the following equivalent generator (term $+t(\omega)^{\top}\theta+k(\omega)$ is affine) expressed using the density parameterized by $\lambda\in\Lambda$ :

F_{\lambda}(\lambda)=F(\theta(\lambda))=-\log(p_{\lambda}(\omega)).

(40)

Then the Bhattacharyya coefficient is expressed by this cumulant-free expression using the source parameterization $\lambda$ :

\displaystyle\forall\omega\in\mathcal{X},\quad\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{\sqrt{p_{\lambda_{1}}(\omega)p_{\lambda_{2}}(\omega)}}{p_{\bar{\lambda}}(\omega)}=\sqrt{\frac{p_{\lambda_{1}}(\omega)}{p_{\bar{\lambda}}(\omega)}}\sqrt{\frac{p_{\lambda_{2}}(\omega)}{p_{\bar{\lambda}}(\omega)}},

(41)

with

\bar{\lambda}=\lambda\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right).

(42)

Similarly, the Bhattacharyya distance is written as

\displaystyle\forall\omega\in\mathcal{X},\quad D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\left(\frac{p\left(\omega;\bar{\lambda}\right)}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\right).

(43)

Let $l(x;\lambda):=\log p(x;\lambda)$ be the log-density. Then we have

\displaystyle\forall\omega\in\mathcal{X},\quad D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=l\left(\omega;\bar{\lambda}\right)-\frac{l(\omega,\lambda_{1})+l(\omega,\lambda_{2})}{2}.

(44)

This is a Jensen divergence for the strictly convex function $-l(x;\theta)$ (wrt. $\theta$ ) since $-l(x;\theta)\equiv F(\theta)$ (modulo an affine term).

Thus we do not need to explicitly use the expression of the cumulant function $F$ in Eq. 41 and Eq. 43 but we need the following parameter $\lambda\Leftrightarrow\theta$ conversion functions:

1.

$\theta(\lambda)$ the ordinary-to-natural parameter conversion function, and
2.

its reciprocal function $\lambda(\theta)$ (i.e., $\lambda(\cdot)=\theta^{-1}(\cdot)$ ), the natural-to-source parameter conversion function

so that we can calculate the ordinary $\lambda$ -parameter $\bar{\lambda}$ corresponding to the natural mid-parameter $\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}$ :

\bar{\lambda}=\lambda\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right).

(45)

Notice that in general, a linear interpolation in the natural parameter $\theta$ corresponds to a non-linear interpolation in the source parameterization $\lambda$ when $\theta(u)\not=u$ .

Since $\lambda(\cdot)=\theta^{-1}(\cdot)$ , we can interpret the non-linear interpolation $\bar{\lambda}$ as a generalization of quasi-arithmetic mean [33] (QAM):

\bar{\lambda}=\theta^{-1}\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right)=:M_{\theta}(\lambda_{1},\lambda_{2}),

(46)

where

M_{f}(a,b):=f^{-1}\left(\frac{1}{2}f(a)+\frac{1}{2}f(b)\right),

(47)

is the quasi-arithmetic mean induced by a strictly monotonic and smooth function $f$ .⁴⁴4Notice that $M_{f}(a,b)=M_{g}(a,b)$ when $g(u)=c_{1}f(u)+c_{2}$ with $c_{1}\not=0\in\mathbb{R}$ and $c_{2}\in\mathbb{R}$ . Thus we can assume that $f$ is a strictly increasing and smooth function. Function $f$ is said to be in standard form if $f(1)=0$ , $f^{\prime}(u)>0$ and $f^{\prime}(1)=1$ ).

We can extend the quasi-arithmetic mean to a weighted quasi-arithmetic mean as follows:

M_{f,\alpha}(a,b):=f^{-1}\left(\alpha f(a)+(1-\alpha)f(b)\right),\quad\alpha\in[0,1].

(48)

Weighted quasi-arithmetic means are strictly monotone [47] when the $\mathrm{Range}(f)\subset\mathbb{R}$ .

Let us remark that extensions of weighted quasi-arithmetic means have been studied recently in information geometry to describe geodesic paths [16, 15] $\gamma_{ab}=\{M_{f,\alpha}(a,b)\ :\ \alpha\in[0,1]\}$ .

In 1D, a bijective function on an interval (e.g., a parameter conversion function in our setting) is a strictly monotonic function, and thus $f$ defines a quasi-arithmetic mean.

To define quasi-arithmetic mean with multivariate generators using Eq. 47, we need to properly define $f^{-1}$ . When the function $f$ is separable, i.e., $f(x)=\sum_{i=1}^{D}f_{i}(x_{i})$ with the $f_{i}$ ’s strictly monotone and smooth functions, we can straightforwardly define the multivariate mean as $M_{f}(x,x^{\prime}):=(M_{f_{1}}(x_{1},x_{1}^{\prime}),\ldots,M_{f_{D}}(x_{1},x_{1}^{\prime}))$ .

In general, the inverse function theorem in multivariable calculus [11] states that there exists locally an inverse function $f^{-1}$ for a continuously differentiable function $f$ at point $x$ if the determinant of the Jacobian $J_{f}(x)$ of $f$ at $x$ is non-zero. Moreover, $f^{-1}$ is continuously differentiable and we have $J_{f^{-1}}(y)=[J_{f}(x)]^{-1}$ (matrix inverse) at $y=f(x)$ .

In some cases, we get the existence of a global inverse function. For example, when $f=\nabla H$ is the gradient of Legendre-type strictly convex and smooth function $H$ , the reciprocal function $f^{-1}$ is well-defined $f^{-1}=\nabla H^{*}$ and global, where $H^{*}$ denotes the convex conjugate of $H$ (Legendre-type). In that case, $f=\nabla H$ is a strictly monotone operator⁵⁵5A mapping $M:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is a strictly monotone operator iff $(p-q)^{\top}(M(p)-M(q))>0$ for all $p,q\in\mathbb{R}^{d}$ with $p\not=q$ . Monotone operators are a generalization of the concept of univariate monotonic functions. since it is the gradient of a strictly convex function. We also refer to [2] for some work on operator means, and to [42] for multivariate quasi-arithmetic means of covariance matrices.

To summarize, we can compute the Bhattacharyya coefficient (and Bhattacharyya/Hellinger distances) using the parametric density function $p_{\lambda}(x)$ and a quasi-arithmetic mean $\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})$ on the source parameters $\lambda_{1}$ and $\lambda_{2}$ as:

\displaystyle\forall\omega\in\mathcal{X},\quad\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{\sqrt{p_{\lambda_{1}}(\omega)p_{\lambda_{2}}(\omega)}}{p(\omega;M_{\theta}(\lambda_{1},\lambda_{2}))}

(49)

using the notation $p(x;\theta):=p_{\theta}(x)$ , and

\forall\omega\in\mathcal{X},\quad D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\left(\frac{p\left(\omega;M_{\theta}(\lambda_{1},\lambda_{2})\right)}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\right).

(50)

Similarly, we get the following cumulant-free expression for the Hellinger distance:

\forall\omega\in\mathcal{X},\quad D_{H}[p_{\lambda_{1}},p_{\lambda_{2}}]=\sqrt{1-\frac{\sqrt{p_{\lambda_{1}}(\omega)p_{\lambda_{2}}(\omega)}}{p_{(}\omega;M_{\theta}(\lambda_{1},\lambda_{2}))}}.

(51)

The Hellinger distance proves useful when using some generic algorithms which require to handle metric distances. For example, the $2$ -approximation factor of Gonzalez [18] for $k$ -center metric clustering.

These cumulant-free formula are all the more convenient as in legacy software API, we usually have access to the density function of the probability family. Thus if a parametric family of an API is an exponential family $\mathcal{E}$ , we just need to implement the corresponding quasi-arithmetic mean $M_{\theta}^{\mathcal{E}}$ .

More generally, the $\alpha$ -skewed Bhattacharyya distance [33] for $\alpha\in(0,1)$ is expressed using the following cumulant-free expression:

\forall\omega\in\mathcal{X},\quad D_{B,\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega,\lambda_{1})^{\alpha}p(\omega,\lambda_{2})^{1-\alpha}}{p\left(\omega;M_{\theta,\alpha}(\lambda_{1},\lambda_{2})\right)}\right)

(52)

with

M_{\theta,\alpha}(\lambda_{1},\lambda_{2})=\lambda(\alpha\theta(\lambda_{1})+(1-\alpha)\theta(\lambda_{2}))=:\bar{\lambda}_{\alpha}.

(53)

Notice that the geometric $\alpha$ -barycenter of two densities $p_{\theta_{1}}$ and $p_{\theta_{2}}$ of an exponential family $\mathcal{E}$ is a scale density of $\mathcal{E}$ :

p(x;\alpha\theta_{1}+(1-\alpha)\theta_{2})=\left(\frac{p(\omega;\alpha\theta_{1}+(1-\alpha)\theta_{2})}{p^{\alpha}(\omega;\theta_{1})p^{1-\alpha}(\omega;\theta_{2})}\right)p^{\alpha}(x;\theta_{1})p^{1-\alpha}(x;\theta_{2}),\quad\forall\omega\in\mathcal{X}.

Let $\tilde{p}_{\theta}(x)=\exp(t(x)^{\top}\theta+k(x))$ denote the unnormalized density of an exponential family (so that $p_{\theta}(x)=\tilde{p}_{\theta}(x)\exp(-F(\theta))$ ). We have the following invariant:

\forall\omega\in\mathcal{X},\quad\frac{\tilde{p}(\omega;{M_{\theta},\alpha}(\lambda_{1},\lambda_{2}))}{\tilde{p}(\omega,\lambda_{1})^{\alpha}\tilde{p}(\omega,\lambda_{2})^{1-\alpha}}=1.

(54)

It follows that we have:

$\displaystyle\rho_{\alpha}[p_{\theta_{1}}:p_{\theta_{2}}]$	$\displaystyle=$	$\displaystyle\frac{{p}(\omega;\alpha\theta_{1}+(1-\alpha)\theta_{2})}{p(\omega,\lambda_{1})^{\alpha}p(\omega,\lambda_{2})^{1-\alpha}},$	(55)
	$\displaystyle=$	$\displaystyle\frac{\exp(F(\alpha\theta_{1}+(1-\alpha)\theta_{2}))}{\exp(\alpha F(\theta_{1}))\exp((1-\alpha)F(\theta_{2}))},$	(56)
	$\displaystyle=$	$\displaystyle\exp(-J_{F,\alpha}(\theta_{1}:\theta_{2})).$	(57)

The Cauchy-Schwarz divergence [21, 41] is defined by

D_{\mathrm{CS}}(p,q):=-\log\left(\frac{\int_{\mathcal{X}}p(x)q(x)\mathrm{d}\mu(x)}{\sqrt{\left(\int_{\mathcal{X}}p(x)^{2}\mathrm{d}\mu(x)\right)\left(\int_{\mathcal{X}}q(x)^{2}\mathrm{d}\mu(x)\right)}}\right)\geq 0.

(58)

Thus the Cauchy-Schwarz divergence is a projective divergence: $D_{\mathrm{CS}}(p,q)=D_{\mathrm{CS}}(\lambda p,\lambda q)$ for any $\lambda>0$ . It can be shown that the Cauchy-Schwarz divergence between two densities $p_{\theta_{1}}$ and $p_{\theta_{2}}$ of an exponential family with a natural parameter space a cone [31] (e.g., Gaussian) is:

D_{\mathrm{CS}}(p_{\theta_{1}},p_{\theta_{2}})=J_{F}(2\theta_{1}:2\theta_{2})+\log\left(\frac{\sqrt{E_{p_{2\theta_{1}}}\left[\exp(k(x))\right]E_{p_{2\theta_{2}}}\left[\exp(k(x))\right]}}{E_{p_{\theta_{1}+\theta_{2}}}\left[\exp(k(x))\right]}\right).

(59)

See [31] for the formula extended to Hölder divergences which generalize the Cauchy-Schwarz divergence.

2.2 Some illustrating examples

Let us start with an example of a continuous exponential family which relies on the arithmetic mean $A(a,b)=\frac{a+b}{2}$ :

Example 1

Consider the family of exponential distributions with rate parameter $\lambda>0$ . The densities of this continuous EF writes as $p_{\lambda}(x)=\lambda\exp(-\lambda x)$ with support $\mathcal{X}=[0,\infty)$ . From the partial canonical factorization of densities following Eq. 6, we get that $\theta(u)=u$ and $\theta^{-1}(u)=u$ so that $M_{\theta}(\lambda_{1},\lambda_{2})=A(\lambda_{1},\lambda_{2})=\frac{\lambda_{1}+\lambda_{2}}{2}$ is the arithmetic mean. Choose $\omega=0$ so that $p_{\lambda}(\omega)=\lambda$ . It follows that

$\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p\left(\omega;\bar{\lambda}\right)},$
	$\displaystyle=$	$\displaystyle\frac{\sqrt{\lambda_{1}\lambda_{2}}}{\frac{\lambda_{1}+\lambda_{2}}{2}},$
	$\displaystyle=$	$\displaystyle\frac{2\sqrt{\lambda_{1}\lambda_{2}}}{\lambda_{1}+\lambda_{2}}.$

Let $G(a,b)=\sqrt{ab}$ denote the geometric mean for $a,b>0$ . Notice that since $G(\lambda_{1},\lambda_{2})\leq A(\lambda_{1},\lambda_{2})$ , we have $\rho[p_{\sigma_{1}},p_{\sigma_{2}}]\in(0,1]$ . The Bhattacharyya distance between two exponential densities of rate $\lambda_{1}$ and $\lambda_{2}$ is

D_{\mathrm{Bhat}}[p_{\sigma_{1}},p_{\sigma_{2}}]=-\log\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\frac{\lambda_{1}+\lambda_{2}}{2}-\frac{1}{2}\log\lambda_{1}\lambda_{2}\geq 0.

Since the logarithm function is monotonous, we have $\log A(\lambda_{1},\lambda_{2})\geq\log G(\lambda_{1},\lambda_{2})$ and therefore we check that $D_{\mathrm{Bhat}}[p_{\sigma_{1}},p_{\sigma_{2}}]\geq 0$ .

Next, we consider a discrete exponential family which exhibits the geometric mean:

Example 2

The Poisson family of probability mass functions (PMFs) $p_{\lambda}(x)=\frac{\lambda^{x}\exp(-\lambda)}{x!}$ where $\lambda>0$ denotes the intensity parameter and $x\in\mathcal{X}=\{0,1,\ldots,\}$ is a discrete exponential family with $t(x)=x$ ( $\omega=0$ , $t(\omega)=0$ and $p_{\lambda}(\omega)=\exp(-\lambda)$ ), $\theta(u)=\log u$ and $\lambda(u)=\theta^{-1}(u)=\exp(u)$ . Thus the quasi-arithmetic mean associated with the Poisson family is $M_{\theta}(\lambda_{1},\lambda_{2})=G(\lambda_{1},\lambda_{2})=\sqrt{\lambda_{1}\lambda_{2}}$ the geometric mean. It follows that the Bhattacharrya coefficient is

	$\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\frac{p\left(\omega;\bar{\lambda}\right)}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}},$
		$\displaystyle=$	$\displaystyle\frac{\exp(-\sqrt{\lambda_{1}\lambda_{2}})}{\sqrt{\exp(-\lambda_{1})\exp(-\lambda_{2})}}=\exp\left(\frac{\lambda_{1}+\lambda_{2}}{2}-\sqrt{\lambda_{1}\lambda_{2}}\right)$

Hence, we recover the Bhattacharrya distance between two Poisson pmfs:

D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{\lambda_{1}+\lambda_{2}}{2}-\sqrt{\lambda_{1}\lambda_{2}}\geq 0.

The negative binomial distribution with known number of failures yields also the same natural parameter and geometric mean (but the density $p(\omega,\lambda)$ is different).

To illustrate the use of the power means $P_{r}(a,b)=(a^{r}+b^{r})^{\frac{1}{r}}$ of order $r\in\mathbb{R}$ (also called Hölder means) for $r\not=0$ (with $\lim_{r\rightarrow 0}P_{r}(a,b)=G(a,b)$ ), let us consider the family of Weibull distributions.

Example 3

The Weibull distributions with a prescribed shape parameter $k\in\mathbb{R}_{++}$ (e.g., exponential distributions when $k=1$ , Rayleigh distributions when $k=2$ ) form an exponential family. The density of a Weibull distribution with scale $\lambda$ and fixed shape parameter $k$ is

p_{\lambda}(x)=\frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1}e^{-(x/\lambda)^{k}},\quad x\in\mathcal{X}=\mathbb{R}_{++}

The ordinary $\leftrightarrow$ natural parameter conversion functions are $\theta(u)=\frac{1}{u^{k}}$ and $\theta^{-1}(u)=\frac{1}{u^{\frac{1}{k}}}$ . It follows that $\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})=P_{-k}(\lambda_{1},\lambda_{2})=\left(\frac{1}{2}\lambda_{1}^{-k}+\frac{1}{2}\lambda_{2}^{-k}\right)^{-\frac{1}{k}}$ is the power means of order $-k$ . We choose $\omega=1$ and get $p_{\lambda}(\omega)=\frac{k}{\lambda^{k}}e^{-\frac{1}{\lambda^{k}}}$ . It follows the closed-form formula for the Bhattacharrya coefficient for integer $k\in\{2,\ldots,\}$ :

	$\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\frac{p\left(\omega;\bar{\lambda}\right)}{,}$		(60)
		$\displaystyle=$	$\displaystyle\frac{2\sqrt{\lambda_{1}^{k}\lambda_{2}^{k}}}{\lambda_{1}^{k}+\lambda_{2}^{k}}.$		(61)

For $k=2$ , the Weibull family yields the Rayleigh family with $\lambda=\sqrt{2}\sigma$ . The Bhattacharyya coefficient is $\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{G(\lambda_{1},\lambda_{2})}{Q(\lambda_{1},\lambda_{2})}$ where $G$ and $Q$ denotes the geometric mean and the quadratic mean, respectively (with $Q\geq G$ ).

Table 1 summarizes the quasi-arithmetic means associated with common univariate exponential families. Notice that a same quasi-arithmetic mean can be associated to many exponential families: For example, the Gaussian family with fixed variance or the exponential distribution family have both $\theta(u)=u$ yielding the arithmetic mean.

\begin{array}[]{llll}\mbox{Exponential family $\mathcal{E}$}&\theta(u)&\theta^{-1}(u)&\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})\\ \hline\cr\mbox{Exponential}&u&u&A(\lambda_{1},\lambda_{2})=\frac{\lambda_{1}+\lambda_{2}}{2}\\ \mbox{Poisson}&\log u&\exp(u)&G(\lambda_{1},\lambda_{2})=\sqrt{\lambda_{1}\lambda_{2}}\\ \mbox{Laplace (fixed $\mu$)}&\frac{1}{u}&\frac{1}{u}&H(\lambda_{1},\lambda_{2})=\frac{2\lambda_{1}\lambda_{2}}{\lambda_{1}+\lambda_{2}}\\ \mbox{Weibull (fixed shape $k>0$)}&\frac{1}{u^{k}}&\frac{1}{u^{\frac{1}{k}}}&P_{\frac{1}{k}}(\lambda_{1},\lambda_{2})=(\frac{1}{2}\lambda_{1}^{\frac{1}{k}}+\frac{1}{2}\lambda_{1}^{\frac{1}{k}})^{-\frac{1}{k}}\\ \mbox{Bernoulli}&\log\frac{u}{1-u}&\frac{\exp(u)}{1+\exp(u)}&\mathrm{Ber}(\lambda_{1},\lambda_{2})=\frac{a_{12}}{1+a_{12}}\\ &\mbox{(logit function)}&\mbox{(logistic function)}&\mbox{with\ }a_{12}=\sqrt{\frac{\lambda_{1}\lambda_{2}}{(1-\lambda_{1})(1-\lambda_{2})}}\\ \hline\cr\end{array}

Table 1: Quasi-arithmetic means

M_{\theta}(\lambda_{1},\lambda_{2})=\theta^{-1}\left(\frac{\theta(\lambda_{1})+\theta(\lambda_{2})}{2}\right)

associated with some common discrete and continuous exponential families of order

D=1

Let us now consider multi-order exponential families. We start with the bi-order exponential family of Gamma distributions.

Example 4

The density of a Gamma distribution is

p(x;\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x},\quad x\in\mathcal{X}=(0,\infty),

for a shape parameter $\alpha>0$ and rate parameter $\beta>0$ (i.e., a $2$ -order exponential family with $\lambda=(\lambda_{1},\lambda_{2})=(\alpha,\beta)$ ). The natural parameter is $\theta(\lambda)=(\lambda_{1}-1,-\lambda_{2})$ and the inverse function is $\lambda(\theta)=(\theta_{1}+1,-\theta_{2})$ . It follows that the generalized quasi-arithmetic mean is the bivariate arithmetic mean: $M_{\theta}(\lambda_{1},\lambda_{2})=A(\lambda_{1},\lambda_{2})=(\frac{\alpha_{1}+\alpha_{2}}{2},\frac{\beta_{1}+\beta_{2}}{2})=(\bar{\alpha},\bar{\beta})$ . We choose $\omega=1$ so that $p(\omega;\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}e^{-\beta}$ .

We get the Bhattacharrya coefficient:

	$\displaystyle\rho[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]$	$\displaystyle=$	$\displaystyle\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p\left(\omega;\bar{\lambda}\right)},$
		$\displaystyle=$	$\displaystyle\sqrt{\frac{\beta_{1}^{\alpha_{1}}\beta_{2}^{\alpha_{2}}}{\Gamma{\alpha_{1}}\Gamma{\alpha_{2}}}}\bar{\beta}^{-\bar{\alpha}},$

and the Bhattacharrya distance:

D_{\alpha}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=\bar{\alpha}\log\bar{\beta}-\frac{\alpha_{1}\log\beta_{1}+\alpha_{2}\log\beta_{2}}{2}+\log\frac{\sqrt{\Gamma(\alpha_{1})\Gamma(\alpha_{2})}}{\Gamma(\bar{\alpha})}.

The Dirichlet family which exhibits a separable (quasi-)arithmetic mean:

Example 5

Consider the family of Dirichlet distributions with densities defined on the $(d-1)$ -dimensional open standard simplex support

\mathcal{X}=\Delta_{d}:=\left\{(x_{1},\ldots,x_{d})\ :\ x_{i}\in(0,1),\ \sum_{i=1}^{d}x_{i}=1\right\}.

(62)

The family of Dirichlet distributions including the family of Beta distributions when $d=2$ . The density of a Dirichlet distribution is defined by:

p_{\alpha}(x)=\frac{\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right)}{\prod_{i=1}^{d}\Gamma\left(\alpha_{i}\right)}\prod_{i=1}^{d}x_{i}^{\alpha_{i}-1}.

(63)

The Dirichlet distributions and are used in in Bayesian statistics as the conjugate priors of the multinomial family. The Dirichlet distributions form an exponential family with $d$ -dimensional natural parameter $\theta=(\alpha_{1}-1,\ldots,\alpha_{d}-1)$ ( $D=d$ ) and vector of sufficient statistics $t(x)=(\log x_{1},\ldots,\log x_{d})$ . The induced quasi-arithmetic mean is a multivariate separable arithmetic means, i.e., the multivariate arithmetic mean $\bar{\lambda}=M_{\theta}(\alpha_{1},\alpha_{2})=A(\alpha_{1},\alpha_{2})=\frac{\alpha_{1}+\alpha_{2}}{2}$ .

Let us choose $\omega=\left(\frac{1}{d},\ldots,\frac{1}{d}\right)$ so that $p(\omega;\alpha)=\frac{\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right)}{\prod_{i=1}^{d}\Gamma\left(\alpha_{i}\right)}\frac{1}{d^{(\sum_{i=1}^{d}\alpha_{i})-d}}$ .

We get the following Bhattacharrya coefficient between two Dirichlet densities $p_{\alpha_{1}}$ and $p_{\alpha_{2}}$ :

\rho[p_{\alpha_{1}}:p_{\alpha_{2}}]=\frac{B\left(\frac{\alpha_{1}+\alpha_{2}}{2}\right)}{\sqrt{B(\alpha_{1})B(\alpha_{2})}},

(64)

where

B(\alpha):=\frac{\prod_{i=1}^{d}\Gamma\left(\alpha_{i}\right)}{\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right)}.

(65)

It follows that the Bhattacharrya distance between two Dirichlet densities $p_{\alpha_{1}}$ and $p_{\alpha_{2}}$ is

D_{B}[p_{\alpha_{1}}:p_{\alpha_{2}}]=\log\left(\frac{\sqrt{B(\alpha_{1})B(\alpha_{2})}}{B\left(\frac{\alpha_{1}+\alpha_{2}}{2}\right)}\right).

(66)

The $\alpha$ -skewed Bhattacharrya coefficient for a scalar $\alpha\in(0,1)$ is:

\rho_{\alpha}[p_{\alpha_{1}}:p_{\alpha_{2}}]=\frac{B\left(\alpha\alpha_{1}+(1-\alpha)\alpha_{2}\right)}{B^{\alpha}(\alpha_{1})B^{1-\alpha}(\alpha_{2})},

(67)

and the $\alpha$ -skewed Bhattacharrya distance:

	$\displaystyle D_{B,\alpha}[p_{\alpha_{1}}:p_{\alpha_{2}}]$	$\displaystyle=$	$\displaystyle\log\left(\frac{B^{\alpha}(\alpha_{1})B^{1-\alpha}(\alpha_{2})}{B\left(\alpha\alpha_{1}+(1-\alpha)\alpha_{2}\right)}\right),$		(68)
		$\displaystyle=$	$\displaystyle\alpha\log B(\alpha_{1})+(1-\alpha)\log B(\alpha_{2})-\log B\left(\alpha\alpha_{1}+(1-\alpha)\alpha_{2}\right),$		(69)

with

\log B(\alpha)=\sum_{i=1}^{d}\log\Gamma(\alpha_{i})-\log\Gamma\left(\sum_{i=1}^{d}\alpha_{i}\right).

(70)

(This is in accordance with Eq. 15–17 of [43].)

Finally, we consider the case of the multivariate Gaussian family:

Example 6

Consider the $d$ -dimensional Gaussian family [29] also called MultiVariate Normal (MVN) family. The parameter $\lambda=(\lambda_{v},\lambda_{M})$ of a MVN consists of a vector part $\lambda_{v}=\mu$ and a $d\times d$ matrix part $\lambda_{M}=\Sigma$ . The Gaussian density is given by

p_{\lambda}(x;\lambda)=\frac{1}{(2\pi)^{\frac{d}{2}}\sqrt{|\lambda_{M}|}}\exp\left(-\frac{1}{2}(x-\lambda_{v})^{\top}\lambda_{M}^{-1}(x-\lambda_{v})\right),

where $|\cdot|$ denotes the matrix determinant.

Partially factorizing the density into the canonical form of exponential family, we find that $\theta(\lambda)=\left(\lambda_{M}^{-1}\lambda_{v},\frac{1}{2}\lambda_{M}^{-1}\right)$ and $\lambda(\theta)=(\frac{1}{2}\theta_{M}^{-1}\theta_{v},\frac{1}{2}\theta_{M}^{-1})$ . It follows that the multivariate weighted mean is $M_{\theta}^{\alpha}(\lambda_{1},\lambda_{2})=\bar{\lambda}_{\alpha}=(\mu_{\alpha},\Sigma_{\alpha})$ with

\Sigma_{\alpha}=\left(\alpha\Sigma_{1}^{-1}+(1-\alpha)\Sigma_{2}^{-1}\right)^{-1},

(71)

the matrix harmonic barycenter and

\mu_{\alpha}=\Sigma_{\alpha}\left(\alpha\Sigma_{1}^{-1}\mu_{1}+(1-\alpha)\Sigma_{2}^{-1}\mu_{2}\right).

(72)

We choose $\omega=0$ with $p_{\lambda}(0;\lambda)=\frac{1}{(2\pi)^{\frac{d}{2}}\sqrt{|\lambda_{M}|}}\exp\left(-\frac{1}{2}\lambda_{v}^{\top}\lambda_{M}^{-1}\lambda_{v}\right)$ . Let $\Delta_{\mu}=\mu_{2}-\mu_{1}$ . It follows the following closed-form formula for the Bhattacharyya coefficient between Gaussian densities:

\rho[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\left|\frac{\Sigma_{1}+\Sigma_{2}}{2}\right|^{-\frac{1}{2}}|\Sigma_{1}|^{\frac{1}{4}}|\Sigma_{2}|^{\frac{1}{4}}\exp\left(-\frac{1}{8}\Delta_{\mu}^{\top}\left(\frac{\Sigma_{1}+\Sigma_{2}}{2}\right)^{-1}\Delta_{\mu}\right).

Thus the Bhattacharrya distance is

D_{B,\alpha}[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\frac{1}{2}\left(\alpha\mu_{1}^{\top}\Sigma_{1}^{-1}\mu_{1}+(1-\alpha)\mu_{2}^{\top}\Sigma_{2}^{-1}\mu_{2}-\mu_{\alpha}^{\top}\Sigma_{\alpha}^{-1}\mu_{\alpha}+\log\frac{|\Sigma_{1}|^{\alpha}|\Sigma_{2}|^{1-\alpha}}{|\Sigma_{\alpha}|}\right),

and the Hellinger distance:

D_{H}[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\sqrt{1-\frac{|\Sigma_{1}|^{\frac{1}{4}}|\Sigma_{2}|^{\frac{1}{4}}}{\left|\frac{\Sigma_{1}+\Sigma_{2}}{2}\right|^{\frac{1}{2}1}}\exp\left(-\frac{1}{8}\Delta_{\mu}\top\left|\frac{\Sigma_{1}+\Sigma_{2}}{2}\right|^{-1}\Delta_{\mu}\right)}.

The Cauchy-Schwarz divergence between two Gaussians [31] is:

$\displaystyle D_{\mathrm{CS}}(p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}})$	$\displaystyle=$	$\displaystyle\frac{1}{2}\log\left(\frac{1}{2^{d}}\frac{\sqrt{\|\Sigma_{1}\|\ \Sigma_{2}\|\|}}{\|(\Sigma_{1}^{-1}+\Sigma_{2}^{-1})^{-1}\|}\right)$	(73)
		$\displaystyle+\frac{1}{2}\mu_{1}^{\top}\Sigma_{1}^{-1}\mu_{1}+\frac{1}{2}\mu_{2}^{\top}\Sigma_{2}^{-1}\mu_{2}$
		$\displaystyle-\frac{1}{2}(\Sigma_{1}^{-1}\mu_{1}+\Sigma_{2}^{-1}\mu_{2})^{\top}(\Sigma_{1}^{-1}+\Sigma_{2}^{-1})^{-1}(\Sigma_{1}^{-1}\mu_{1}+\Sigma_{2}^{-1}\mu_{2}).$

3 Cumulant-free formula for the Kullback-Leibler divergence and related divergences

The Kullback-Leibler divergence (KLD) $D_{\mathrm{KL}}[p:q]:=\int p(x)\log\frac{p(x)}{q(x)}\mathrm{d}\mu(x)$ between two densities $p$ and $q$ also called the relative entropy [12] amounts to a reverse Bregman divergence, $D_{\mathrm{KL}}[p_{\theta_{1}}:p_{\theta_{2}}]={B_{F}}^{*}(\theta_{1}:\theta_{2})=B_{F}(\theta_{2}:\theta_{1})$ , when the densities belong to the same exponential family $\mathcal{E}$ , where the Bregman generator $F$ is the cumulant function of $\mathcal{E}$ .

We present below two techniques to calculate the KLD by avoiding to compute the integral:

•

The first technique, described in §3.1, considers the KLD as a limit case of $\alpha$ skewed Bhattacharrya distance.
•

The second technique relies on the availability of off-the-shelf formula for the entropy and moment of the sufficient statistic (§3.2), and is derived using the Legendre-Fenchel divergence.

3.1 Kullback-Leibler divergence as the limit case of a skewed Bhattacharrya distance

We can obtain closed-form formula for the Kullback-Leibler divergence by considering the limit case of $\alpha$ skewed Bhattacharrya distance:

$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}B_{\alpha}[p_{\lambda_{2}}:p_{\lambda_{1}}],$	(74)
	$\displaystyle=$	$\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\rho_{\alpha}[p_{\lambda_{2}}:p_{\lambda_{1}}],$	(75)
	$\displaystyle=$	$\displaystyle\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega,\lambda_{2})^{\alpha}p(\omega,\lambda_{1})^{1-\alpha}},$	(76)
	$\displaystyle=$	$\displaystyle\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega,\lambda_{1})}\right).$	(77)

When we deal with uni-order exponential families ( $D=1$ ), we can use a first-order Taylor expansion of the quasi-arithmetic means when $\alpha\simeq 0$ (see [30]):

M_{\theta}(a,b)=_{\alpha\simeq 0}a+\alpha\frac{\theta(b)-\theta(a)}{\theta^{\prime}(a)}+o\left(\alpha(\theta(b)-\theta(a))\right),

(78)

where $\theta^{\prime}(\cdot)$ denote the derivative of the ordinary-to-natural parameter conversion function.

It follows that we have:

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;\lambda_{1}+\alpha\frac{\theta(\lambda_{2})-\theta(\lambda_{1})}{\theta^{\prime}(\lambda_{1})}\right)}{p(\omega;\lambda_{1})}\right),\forall\omega\in\mathcal{X}

(79)

Notice that we need to calculate case by case the limit as it depends on the density expression $p(x;\lambda)$ of the exponential family. This limit can be computed symbolically using a computer algebra system (e.g., using Maxima⁶⁶6http://maxima.sourceforge.net/). The example below illustrates the technique for calculating the KLD between two Weibull densities with prescribed shape parameter.

Example 7

Consider the Weibull family with prescribed shape parameter $k$ that form an exponential family (including the family of exponential distributions for $k=1$ and the the Rayleigh distributions for $k=2$ ). The density of a Weibull distribution with scale $\lambda>0$ and fixed shape parameter $k$ is

p_{\lambda}(x)=\frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1}e^{-(x/\lambda)^{k}},\quad x\in\mathcal{X}=\mathbb{R}_{++}

We have $\theta(u)=u^{-k}$ and $\theta^{\prime}(u)=-ku^{-k-1}$ . We set $\omega=1$ so that $p_{\lambda}(\omega)=\frac{k}{\lambda^{k}}\exp(\lambda^{-k})$ . Let us program the formula of Eq. 77 using the computer algebra system Maxima:

/* KLD betwen Weibull distributions by calculating a limit */
declare( k , integer);
assume(lambda1>0);
assume(lambda2>0);
k:5;
omega:1;
t(u):=u**(-k);
tinv(u):=u**(-1/k);
tp(u):=k*u**(-k-1);
p(x,l):=(k/l)*((x/l)**(k-1))*exp(-(x/l)**k);
mean(a,b):= tinv(alpha*t(a)+(1-alpha)*t(b));
log(p(omega,l1)/p(omega,l2)) + (1.0/alpha)*log(p(omega,mean(l2,l1))/p(omega,l1));
limit (ratsimp(%), alpha, 0);
expand(%);

Refer to caption — Figure 1: Snapshot of the Maxima GUI displaying the result of symbolic calculations of the KLD between two Weibull densities of prescribed parameter shape.

Figure 1 displays a snapshot of the result which can be easily simplified manually as

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=k\log\frac{\lambda_{2}}{\lambda_{1}}+\left(\frac{\lambda_{1}}{\lambda_{2}}\right)^{k}-1.

(80)

In general, the KLD between two Weibull densities with arbitrary shapes [4] is

D_{\mathrm{KL}}[p_{k_{1},\lambda_{1}}:p_{k_{2},\lambda_{2}}]=\log\frac{k_{1}}{\lambda_{1}^{k_{1}}}-\log\frac{k_{2}}{\lambda_{2}^{k_{2}}}+\left(k_{1}-k_{2}\right)\left(\log\lambda_{1}-\frac{\gamma}{k_{1}}\right)+\left(\frac{\lambda_{1}}{\lambda_{2}}\right)^{k_{2}}\Gamma\left(\frac{k_{2}}{k_{1}}+1\right)-1,

(81)

where $\gamma$ denotes the Euler-Mascheroni constant. Thus when $k_{1}=k_{2}=k$ , we recover Eq. 80 since $\Gamma(2)=1$ . (However, the family of Weibull densities with varying parameter shape is not an exponential family since the sufficient statistics depend on the shape parameters.)

In practice, we may program the formula of Eq. 77 by defining:

D_{\mathrm{KL},\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega;\lambda_{1})}\right),

(82)

and approximate the KLD by $D_{\mathrm{KL},\alpha}$ for a small value of $\alpha$ (say, $\alpha=10^{-3}$ ). Thus we need only $\theta(\cdot)$ and $\theta^{-1}(\cdot)$ for defining $M_{\theta}(\cdot,\cdot)$ , and the density $p(x;\theta)$ . This approximation also works for multivariate extensions of the quasi-arithmetic means.

Let us give some two examples using the first-order approximation of the univariate quasi-arithmetic mean:

Example 8

Consider the family $\{p(x;\lambda)=\lambda\exp(-\lambda x)\}$ of exponential distributions with support $\mathcal{X}=[0,\infty)$ . Set $\omega=0$ , $p(\omega;\lambda)=\lambda$ , $\theta=\lambda$ , $\theta(u)=u$ and $\theta^{\prime}(u)=1$ . We have

$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\log\left(\frac{\lambda_{1}}{\lambda_{2}}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\frac{\lambda_{1}+\alpha(\lambda_{2}-\lambda_{1})}{\lambda_{1}},$	(83)
	$\displaystyle=$	$\displaystyle\log\left(\frac{\lambda_{2}}{\lambda_{1}}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(1+\alpha\left(\frac{\lambda_{2}}{\lambda_{1}}-1\right)\right),$	(84)
	$\displaystyle=$	$\displaystyle\log\left(\frac{\lambda_{1}}{\lambda_{2}}\right)+\frac{\lambda_{2}}{\lambda_{1}}-1.$	(85)

Example 9

Consider the Poisson family with $\omega=0$ , $p_{\lambda}(\omega)=\exp(-\lambda)$ , $\theta(u)=\log u$ ( $M_{\theta}$ is the geometric mean) and $\theta^{\prime}(u)=\frac{1}{u}$ . We get

$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\log\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\frac{p\left(\omega;\lambda_{1}+\alpha\frac{\theta(\lambda_{2})-\theta(\lambda_{1})}{\theta^{\prime}(\lambda_{1})}\right)}{p(\omega;\lambda_{2})},$	(86)
	$\displaystyle=$	$\displaystyle\lambda_{2}-\lambda_{1}+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\exp(\alpha\lambda_{1}\log\frac{\lambda_{1}}{\lambda_{2}}),$	(87)
	$\displaystyle=$	$\displaystyle\lambda_{2}-\lambda_{1}+\lambda_{1}\log\frac{\lambda_{1}}{\lambda_{2}}.$	(88)

3.2 Kullback-Leibler divergence formula relying on the differential entropy and moments

Consider the Kullback-Leibler divergence [25] (relative entropy) $D_{\mathrm{KL}}[p:q]=\int p(x)\log\frac{p(x)}{q(x)}\mathrm{d}\mu(x)$ between two probability densities $p$ and $q$ . When the densities belong to the same exponential families, the KL divergences amounts to a Legendre-Fenchel divergence (the canonical expression of divergences using the dual coordinate systems in dually flat spaces of information geometry [1]):

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=A_{F}(\theta_{2}:\theta_{1}),

(89)

where the Legendre-Fenchel divergence is defined for a pair of strictly convex and differentiable conjugate generators $F(\theta)$ and $F^{*}(\eta)=\sup_{\theta}\theta^{\top}\eta-F(\theta)$ by

A_{F}(\theta:\eta^{\prime}):=F(\theta)+F^{*}(\eta^{\prime})-\theta^{\top}\eta^{\prime},

(90)

with $\eta^{\prime}=\nabla F(\theta)$ .

Since $F$ is defined modulo some affine function, we can choose $F(\theta(\lambda))=-\log p(\omega;\theta(\lambda))$ . Furthermore, for exponential families, we have

\eta(\lambda)=E_{p_{\lambda}}[t(x)],

(91)

and the Shannon entropy [46]

h(p)=-\int_{\mathcal{X}}p(x)\log p(x)\mathrm{d}\mu(x),

(92)

admits the following expression [36] when $p=p_{\theta}$ belongs to an exponential family $\mathcal{E}$ :

h(p_{\lambda})=-F^{*}(\eta(\lambda))-E_{p_{\lambda}}[k(x)].

(93)

Thus if we already have at our disposal (1) the expectation of the sufficient statistics, and (2) the entropy, we can easily recover the Kullback-Leibler divergence as follows:

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=-\log p(\omega;\lambda_{2})-h(p_{\lambda_{1}})-E_{p_{\theta_{1}}}[k(x)]-\theta(\lambda_{2})^{\top}E_{p_{\theta_{1}}}[t(x)].

(94)

For densities $p_{\lambda_{1}}$ and $p_{\lambda_{2}}$ belonging to the same exponential family, the Jeffreys divergence is

D_{J}[p_{\lambda_{1}}:p_{\lambda_{2}}]=D_{\mathrm{KL}}[p:q]+D_{\mathrm{KL}}[q:p]=(\theta_{2}-\theta_{1})^{\top}(\eta_{2}-\eta_{1}).

(95)

It follows that we can write Jeffreys’ divergence using the following cumulant-free expression:

D_{J}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\left(\theta(\lambda_{2})-\theta(\lambda_{1})\right)^{\top}(E_{p_{\lambda_{2}}}[t(x)]-E_{p_{\lambda_{1}}}[t(x)]).

(96)

Note that a strictly monotone operator $O$ defines a symmetric dissimilarity:

D_{O}(\theta_{1},\theta_{2}):=(\theta_{1}-\theta_{2})^{\top}(O(\theta_{2})-O(\theta_{1}))\geq 0,

(97)

with equality if and only if $\theta_{1}=\theta_{2}$ . Since $\nabla F$ is a strictly monotone operator and $E_{p_{\lambda}}[k(x)]=\nabla F(\theta)$ , we may reinterpret the Jeffreys’ divergence as a symmetric dissimilarity induced by a strictly monotone operator.

Let us report now some illustrating examples. We start with an example illustrating the use of a separable multivariate quasi-arithmetic mean.

Example 10 (continue Example 5)

Consider the Dirichlet exponential family. The differential entropy of a Dirichlet density $p_{\alpha}$ is

h(p_{\alpha})=\log\mathrm{B}(\alpha)+\left(\sum_{i}\alpha_{i}-d\right)\psi\left(\sum_{i}\alpha_{i}\right)-\sum_{j=1}^{d}\left(\alpha_{j}-1\right)\psi\left(\alpha_{j}\right),

where $\psi(\cdot)$ denotes the digamma function. We have $E[t(x)]=E[(\log x_{1},\ldots,\log x_{d})]=(\psi(\alpha_{1})-\psi(\alpha_{\sum}),\ldots,\psi(\alpha_{1})-\psi(\alpha_{\sum}))$ where $\alpha_{\sum}=\sum_{i=1}^{d}\alpha_{i}$ . It follows that the Kullback-Leibler between $p_{\alpha}$ and $p_{\alpha^{\prime}}$ is:

D_{\mathrm{KL}}[p_{\alpha}:p_{\alpha}^{\prime}]=\log\Gamma\left(\alpha_{\sum}\right)-\sum_{i=1}^{d}\log\Gamma\left(\alpha_{i}\right)-\log\Gamma\left(\alpha_{\sum}^{\prime}\right)+\sum_{i=1}^{d}\log\Gamma\left(\alpha_{i}^{\prime}\right)+\sum_{i=1}^{d}\left(\alpha_{i}-\alpha_{i}^{\prime}\right)\left(\psi\left(\alpha_{i}\right)-\psi\left(\alpha_{\sum}\right)\right).

Next, we report an example illustrating a non-separable multivariate quasi-arithmetic mean.

Example 11 (continue Example 6)

Consider the $d$ -dimensional multivariate Gaussian family. Since $k(x)=0$ (so that $E[k(x)]=0$ ), $\eta(\lambda)=\nabla_{\theta}F(\theta(\lambda))=(\mu,\mu\mu^{\top}+\Sigma)$ (since $E[x]=\mu$ , $E[xx^{\top}]=\mu\mu^{\top}+\Sigma$ ), and the usual differential entropy is known as $h(p_{\mu,\Sigma})=\frac{1}{2}\log|2\pi e\Sigma|=\frac{d}{2}(1+\log 2\pi)+\frac{1}{2}\log|\Sigma|$ since $|aM|=|a|^{d}|M|$ . we recover the Kullback-Leibler divergence as

D_{\mathrm{KL}}[p_{\mu_{1},\Sigma_{1}}:p_{\mu_{2},\Sigma_{2}}]=\frac{1}{2}\left(\mathrm{tr}(\Sigma_{2}^{-1}\Sigma_{1})+\Delta_{\mu}^{\top}\Sigma_{2}^{-1}\Delta_{\mu}+\log\frac{|\Sigma_{2}|}{|\Sigma_{1}|}-d\right),

(98)

and the Jeffreys divergence as

D_{J}[p_{\mu_{1},\Sigma_{1}},p_{\mu_{2},\Sigma_{2}}]=\Delta_{\mu}^{\top}\left(\frac{\Sigma_{1}^{-1}+\Sigma_{2}^{-1}}{2}\right)\Delta_{\mu}+\frac{1}{2}\mathrm{tr}\left(\Sigma_{2}^{-1}\Sigma_{1}+\Sigma_{1}^{-1}\Sigma_{2}\right)-d.

(99)

3.3 The Kullback-Leibler divergence expressed as a log density ratio

Let us express the Bregman divergence with the equivalent generator $F_{l}(\theta):=-\log(p_{\theta}(\omega))$ for any prescribed $\omega\in\mathcal{X}$ instead of the cumulant function of Eq. 7. We get

	$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle B_{F_{l}}(\theta(\lambda_{2}):\theta(\lambda_{1})),$		(100)
		$\displaystyle=$	$\displaystyle\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right)-(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}\nabla_{\theta}F_{l}(\theta(\lambda_{1})).$		(101)

Let us remark that

F_{l}(\theta)=-\theta^{\top}t(\omega)+F(\theta)-k(\omega).

(102)

We have

$\displaystyle\nabla F_{l}(\theta(\lambda_{1}))$	$\displaystyle=$	$\displaystyle-\frac{\nabla_{\theta}p_{\lambda_{1}}(\omega)}{p_{\lambda_{1}}(\omega)},$	(103)
	$\displaystyle=$	$\displaystyle-\frac{(t(\omega)-\nabla F(\theta(\lambda_{1})))p_{\lambda_{1}}(\omega)}{p_{\lambda_{1}}(\omega)},$	(104)
	$\displaystyle=$	$\displaystyle-(t(\omega)-\nabla F(\theta(\lambda_{1}))),$	(105)

where $F(\theta)$ is the cumulant function of Eq. 7. Alternatively, we can also directly find that $\nabla F_{l}(\theta(\lambda_{1}))=-(t(\omega)-\nabla F(\theta(\lambda_{1})))$ from Eq. 102.

It follows that:

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right)+(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1}))),\quad\forall\omega\in\mathcal{X}.

(106)

Thus when $t(\omega)-\nabla F(\theta(\lambda_{1}))$ is Euclidean orthogonal to $\theta(\lambda_{2})-\theta(\lambda_{1})$ (i.e., $(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0$ ), the Bregman divergence (and the corresponding KLD on swapped parameter order of the densities) is expressed as a log-density ratio quantity. Let

\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2}):=\left\{\omega\in\mathcal{X}\ :\ (\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0\right\}.

(107)

Then $D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right)$ for all $\omega\in\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2})$ .

Lemma 1

The Kullback-Leibler divergence between two densities $p_{\lambda_{1}}$ and $p_{\lambda_{2}}$ belonging to a same exponential family $\mathcal{E}$ is expressed as a log density ratio, $D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p_{\lambda_{1}}(\omega)}{p_{\lambda_{2}}(\omega)}\right)$ , whenever $\omega\in\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2})$ .

Notice that a sufficient condition is to choose $\omega$ such that $t(\omega)=\eta_{1}=\nabla F(\theta_{1})$ .

Thus if we carefully choose $\omega\in\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2})$ according to the source parameters, we may express the Kullback-Leibler divergence as a simple log density ratio without requiring the formula for the differential entropy nor the moment.

Example 12

Consider the exponential family $\mathcal{E}=\{p_{\lambda}(x)=\lambda\exp(-\lambda x),\ \lambda>0\}$ of exponential distributions with $\nabla F(\theta)=\frac{1}{\theta}$ for $\theta=\lambda$ . We have $(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0$ that amounts to $(\lambda_{2}-\lambda_{1})(\omega-\frac{1}{\lambda_{1}})=0$ , i.e., $\omega=\frac{1}{\lambda_{1}}$ (and $\mathcal{X}_{\perp}^{\mathcal{E}}(\lambda_{1}:\lambda_{2})=\left\{\frac{1}{\lambda_{1}}\right\}$ ). In that case, we have

$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\log\left(\frac{p_{\lambda_{1}}\left(\frac{1}{\lambda_{1}}\right)}{p_{\lambda_{2}}\left(\frac{1}{\lambda_{1}}\right)}\right),$	(108)
	$\displaystyle=$	$\displaystyle\log\left(\frac{\lambda_{1}\exp\left(-\lambda_{1}\frac{1}{\lambda_{1}}\right)}{\lambda_{2}\exp\left(-\lambda_{2}\frac{1}{\lambda_{1}}\right)}\right),$	(109)
	$\displaystyle=$	$\displaystyle\log\left(\frac{\lambda_{1}}{\lambda_{2}}\right)+\frac{\lambda_{2}}{\lambda_{1}}-1.$	(110)

This formula matches the expression of Eq. 85.

Similarly, we may rewrite the Bregman divergence $B_{F}(\theta_{1}:\theta_{2})$ as

B_{F}(\theta_{1}:\theta_{2})=B_{F_{a}}(\theta_{1}:\theta_{2})=F_{a}(\theta_{1})-F_{a}(\theta_{2}),

(111)

for $F_{a}(\theta)=F(\theta)-a\theta$ (and $\nabla F_{a}(\theta)=\nabla F(\theta)-a$ ) for $a=-\nabla F_{a}(\theta_{2})$ .

Example 13

Consider the exponential family of $d$ -dimensional Gaussians with fixed covariance matrix $\Sigma$ :

\mathcal{E}=\left\{p_{\lambda}(x)=\frac{1}{(2\pi)^{\frac{d}{2}}\sqrt{|\Sigma|}}\exp\left(-\frac{1}{2}(x-\lambda)^{\top}\Sigma^{-1}(x-\lambda)\right),\ :\ \lambda\in\mathbb{R}^{d}\right\}.

(112)

It is an exponential family of order $D=d$ with sufficient statistic $t(x)=x$ . Let us choose $\omega$ such that $(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0$ . For example, we choose $t(\omega)=\nabla F(\theta_{1})=\mu_{1}$ . It follows that we have

$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\log\left(\frac{p_{\lambda_{1}}\left(\frac{1}{\lambda_{1}}\right)}{p_{\lambda_{2}}\left(\frac{1}{\lambda_{1}}\right)}\right),$	(113)
	$\displaystyle=$	$\displaystyle\log\left(\frac{(2\pi)^{\frac{d}{2}}\sqrt{\|\Sigma\|}}{(2\pi)^{\frac{d}{2}}\sqrt{\|\Sigma\|}}\exp\left(\frac{1}{2}(\mu_{1}-\mu_{2})^{\top}\Sigma^{-1}(\mu_{1}-\mu_{2})\right)\right),$	(114)
	$\displaystyle=$	$\displaystyle\frac{1}{2}(\mu_{1}-\mu_{2})^{\top}\Sigma^{-1}(\mu_{1}-\mu_{2}).$	(115)

This is half of the squared Mahalanobis distance obtained for the precision matrix $\Sigma^{-1}$ . We recover the Kullback-Leibler divergence between multivariate Gaussians (Eq. 98) when $\Sigma_{1}=\Sigma_{2}=\Sigma$ .

Example 14

Consider the exponential family of Rayleigh distributions:

\mathcal{E}=\left\{p_{\lambda}(x)=\frac{x}{\lambda^{2}}\exp\left(-\frac{x^{2}}{2\lambda^{2}}\right)\right\},

(116)

for $\mathcal{X}=[0,\infty)$ . Let us choose $\omega$ such that $(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0$ with $\theta=-\frac{1}{2\lambda^{2}}$ , $t(x)=x^{2}$ and $\nabla F(\theta(\lambda))=2\lambda^{2}$ . We choose $\omega^{2}=2\lambda^{2}_{1}$ (i.e., $\omega=\lambda_{1}\sqrt{2}$ ). It follows that we have

$\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\log\left(\frac{p_{\lambda_{1}}\left(\frac{1}{\lambda_{1}}\right)}{p_{\lambda_{2}}\left(\frac{1}{\lambda_{1}}\right)}\right),$	(117)
	$\displaystyle=$	$\displaystyle\log\left(\frac{\sqrt{2}}{\lambda_{1}}\frac{\lambda_{2}^{2}}{\lambda_{1}}\exp\left(-\frac{2\lambda_{1}^{2}}{2\lambda_{1}^{2}}+\frac{2\lambda_{1}^{2}}{2\lambda_{2}^{2}}\right)\right),$	(118)
	$\displaystyle=$	$\displaystyle 2\log\left(\frac{\lambda_{2}}{\lambda_{1}}\right)+\frac{\lambda_{1}^{2}}{\lambda_{2}^{2}}-1.]$	(119)

This example shows the limitation of the method which we shall now overcome.

Example 15

Consider the exponential family of univariate Gaussian distributions:

\mathcal{E}=\left\{p_{\lambda}(x)=\frac{1}{\sqrt{2\pi\lambda_{2}}}\exp\left(-\frac{(x-\lambda_{1})^{2}}{2\lambda_{2}}\right)\right\},

(120)

for $\mathcal{X}=(-\infty,\infty)$ . Let us choose $\omega$ such that $(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}(t(\omega)-\nabla F(\theta(\lambda_{1})))=0$ . Here, $\nabla F(\theta(\lambda_{1})))=(\mu_{1},\mu_{1}^{2}+\sigma_{1}^{2})$ and $t(\omega)=(x,x^{2})$ . Thus we have $t(\omega)\not=\nabla F(\theta(\lambda_{1}))$ for any $\omega\in\mathcal{X}$ .

Let $H=\{\nabla F(\theta)\ :\ \theta\in\Theta\}$ denote the dual moment parameter space. When the exponential family is regular, $\Theta$ is an open convex set, and so is $H$ ( $F$ is of Legendre-type). The problem we faced with the last example is that $\omega\in\partial H$ . However since Eq. 106 holds for any $\omega\in\Omega$ , let us choose $s$ values for $\omega$ (i.e., $\omega_{1},\ldots,\omega_{s}$ ), and average Eq. 106 for these $s$ values. We have

\displaystyle D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]

\displaystyle=

\displaystyle\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda_{1}}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right)+(\theta(\lambda_{2})-\theta(\lambda_{1}))^{\top}\left(\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})-\nabla F(\theta(\lambda_{1}))\right).

(121)

We can now choose the $\omega_{i}$ ’s such that $\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})\in H$ for $s>1$ . We need to choose $s$ so that the system of equations $\frac{1}{s}t(\omega_{i})=\nabla F(\theta)=E[t(x)]$ is solvable.

Example 16

Let us continue Example 15. Consider $s=2$ . We need to find $\omega_{1}$ and $\omega_{2}$ such that we can solve the following system of equations:

\left\{\begin{array}[]{lcl}\mu_{1}&=&\frac{\omega_{1}+\omega_{2}}{2},\\ \mu_{1}^{2}+\sigma_{1}^{2}&=&\frac{\omega_{1}^{2}+\omega_{2}^{2}}{2}.\end{array}\right.

(122)

The solution of this system of equations is $\omega_{1}=\mu_{1}-\sigma_{1}$ and $\omega_{2}=\mu_{1}+\sigma_{1}$ . Thus it follows that we have:

D_{\mathrm{KL}}[p_{\mu_{1},\sigma_{1}}:p_{\mu_{2},\sigma_{2}}]=\frac{1}{2}\left(\log\left(\frac{p_{\mu_{1},\sigma_{1}}(\mu_{1}-\sigma_{1})}{p_{\mu_{2},\sigma_{2}}(\mu_{1}-\sigma_{1})}\right)+\log\left(\frac{p_{\mu_{1},\sigma_{1}}(\mu_{1}+\sigma_{1})}{p_{\mu_{2},\sigma_{2}}(\mu_{1}+\sigma_{1})}\right)\right).

(123)

We conclude with the following theorem extending Lemma 1:

Theorem 1

The Kullback-Leibler divergence between two densities $p_{\lambda_{1}}$ and $p_{\lambda_{2}}$ belonging to a full regular exponential family $\mathcal{E}$ of order $D$ can be expressed as the averaged sum of logarithms of density ratios:

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda_{1}}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right),

where $\omega_{1},\ldots,\omega_{s}$ are $s\leq D+1$ distinct samples of $\mathcal{X}$ chosen such that $\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})=E_{p_{\lambda_{1}}}[t(x)]$ .

The bound $s\leq D+1$ follows from Carathéodory’s theorem [9].

Notice that the Monte Carlo stochastic approximation of the Kullback-Leibler divergence:

\widetilde{D}_{\mathrm{KL}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{m}\sum_{i=1}^{m}\log\frac{p_{\lambda_{1}}(x_{i})}{p_{\lambda_{2}}(x_{i})},

(124)

for $m$ independent and identically distributed (iid) variates $x_{1},\ldots,x_{m}$ from $p_{\lambda_{1}}$ is also an average sum of log density ratios which yields a consistent estimator, i.e.,

\lim_{m\rightarrow\infty}\widetilde{D}_{\mathrm{KL}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}].

(125)

In practice, to avoid potential negative values of $\widetilde{D}_{\mathrm{KL}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]$ , we estimate the extended Kullback-Leibler divergence:

\widetilde{D}_{\mathrm{KL}_{+}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{m}\sum_{i=1}^{m}\left(\log\frac{p_{\lambda_{1}}(x_{i})}{p_{\lambda_{2}}(x_{i})}+p_{\lambda_{2}}(x_{i})-p_{\lambda_{1}}(x_{i})\right)\geq 0,

(126)

with $\lim_{m\rightarrow\infty}\widetilde{D}_{\mathrm{KL}_{+}}^{(m)}[p_{\lambda_{1}}:p_{\lambda_{2}}]=D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]$ .

Example 17

We continue Example 6 of the $d$ -dimensional multivariate Gaussian family. First, let us consider the subfamily of zero-centered Gaussian densities. The sufficient statistic $t(x)$ is a $d\times d$ matrix: $t(x)=xx^{\top}$ . We find the $d$ column vectors $\omega_{i}$ ’s from the singular value decomposition of $\Sigma_{1}$ :

\Sigma_{1}=\sum_{i=1}^{d}\lambda_{i}e_{i}e_{i}^{\top},

(127)

where the $\lambda_{i}$ ’s are the eigenvalues and the $e_{i}$ ’s the corresponding eigenvectors. Let $\omega_{i}=\sqrt{d\lambda_{i}}e_{i}$ for $i\in\{1,\ldots,d\}$ . We have

\frac{1}{d}\sum_{i=1}^{d}t(\omega_{i})=\Sigma_{1}=E_{p_{\Sigma_{1}}}[xx^{\top}].

(128)

It follows that we can express the KLD between two zero-centered Gaussians $p_{\Sigma_{1}}$ and $p_{\Sigma_{2}}$ as the following weighted sum of log density ratios:

D_{\mathrm{KL}}[p_{\Sigma_{1}}:p_{\Sigma_{2}}]=\frac{1}{d}\sum_{i=1}^{d}\log\left(\frac{p_{\Sigma_{1}}(\sqrt{d\lambda_{i}}e_{i})}{p_{\Sigma_{2}}(\sqrt{d\lambda_{i}}e_{i})}\right),

(129)

where the $\lambda_{i}$ ’s are the eigenvalues of $\Sigma_{1}$ with the $e_{i}$ ’s the corresponding eigenvectors. Notice that the order of the family is $D=\frac{d(d+1)}{2}$ but we used $s=d\leq D$ vectors $\omega_{1},\ldots,\omega_{d}$ .

The closed-form formula for the Kullback-Leibler divergence between two zero-centered Gaussians $p_{\Sigma_{1}}$ and $p_{\Sigma_{2}}$ is

D_{\mathrm{KL}}[p_{\Sigma_{1}}:p_{\Sigma_{2}}]=\frac{1}{2}\left(\mathrm{tr}\left(\Sigma_{2}\Sigma_{1}^{-1}\right)+\log\left(\frac{|\Sigma_{2}|}{|\Sigma_{1}|}\right)-d\right).

(130)

Now consider the full family of multivariate normal densities. We shall use $2d$ vectors $\omega_{i}$ as follows:

\displaystyle\left\{\begin{array}[]{ll}\omega_{i}=\mu_{1}-\sqrt{d\lambda_{i}}e_{i},&i\in\{1,\ldots,d\}\\ \omega_{i}=\mu_{1}+\sqrt{d\lambda_{i}}e_{i},&i\in\{d+1,\ldots,2d\}\end{array}\right.

(133)

where the $\lambda_{i}$ ’s are the eigenvalues of $\Sigma_{1}$ with the $e_{i}$ ’s the corresponding eigenvectors. It can be checked that $\frac{1}{2d}\sum_{i=1}^{2d}\omega_{i}=E_{p_{\lambda_{1}}}[x]=\mu_{1}$ and $\frac{1}{2d}\sum_{i=1}^{2d}\omega_{i}\omega_{i}^{\top}=\mu_{1}\mu_{1}^{\top}+\Sigma_{1}$ . These points are called the “sigma points” in the unscented transform [23, 17]. Moreover, we have $\sqrt{\lambda_{i}}e_{i}=[\sqrt{\Sigma_{1}}]_{\cdot,i}$ , the $i$ -th column of the square root of the covariance matrix of $\Sigma_{1}$ . Thus it follows that

	$\displaystyle D_{\mathrm{KL}}[p_{\mu_{1},\Sigma_{1}}:p_{\mu_{2},\Sigma_{2}}]$	$\displaystyle=$	$\displaystyle\frac{1}{2d}\sum_{i=1}^{d}\left(\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}-\sqrt{d\lambda_{i}}e_{i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}-\sqrt{d\lambda_{i}}e_{i}\right)}\right)+\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}+\sqrt{d\lambda_{i}}e_{i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}+\sqrt{d\lambda_{i}}e_{i}\right)}\right)\right),$
		$\displaystyle=$	$\displaystyle\frac{1}{2d}\sum_{i=1}^{d}\left(\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}-[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}-[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}\right)+\log\left(\frac{p_{\mu_{1},\Sigma_{1}}\left(\mu_{1}+[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}{p_{\mu_{2},\Sigma_{2}}\left(\mu_{1}+[\sqrt{d\Sigma_{1}}]_{\cdot,i}\right)}\right)\right),$

where $[\sqrt{d\Sigma_{1}}]_{\cdot,i}=\sqrt{\lambda_{i}}e_{i}$ denotes the vector extracted from the $i$ -th column of the square root matrix of $d\Sigma_{1}$ .

This formula matches the ordinary formula for the Kullback-Leibler divergence between the two Gaussian densities $p_{\mu_{1},\Sigma_{1}}$ and $p_{\mu_{2},\Sigma_{2}}$ :

D_{\mathrm{KL}}[p_{\mu_{1},\Sigma_{1}}:p_{\mu_{2},\Sigma_{2}}]=\frac{1}{2}\left((\mu_{2}-\mu_{1})^{\top}\Sigma_{2}^{-1}(\mu_{2}-\mu_{1})+\mathrm{tr}\left(\Sigma_{2}\Sigma_{1}^{-1}\right)+\log\left(\frac{|\Sigma_{2}|}{|\Sigma_{1}|}\right)-d\right).

(134)

Example 18

We continue the Example 4 of the exponential family of gamma distributions which has order $D=2$ . The sufficient statistic vector is $t(x)=(x,\log x)$ , and we have $E[x]=\frac{\alpha}{\beta}$ and $E[\log x]=\psi(\alpha)-\log\beta$ , where $\psi(\cdot)$ is the digamma function. We want to express $D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]$ as an average sum of log ratio of densities. To find the values of $\omega_{1}$ and $\omega_{2}$ , we need to solve the following system of equations:

\left\{\begin{array}[]{lll}\frac{\omega_{1}+\omega_{2}}{2}&=&a,\\ \frac{\log\omega_{1}+\log\omega_{2}}{2}&=&b\end{array}\right.,

(135)

with $a=\frac{\alpha_{1}}{\beta_{1}}$ and $b=\psi(\alpha_{1})-\log\beta_{1}$ . We find the following two solutions:

\omega_{1}=a-\sqrt{a^{2}-\exp(2b)},\quad\omega_{2}=a+\sqrt{a^{2}-\exp(2b)}.

(136)

We have $a^{2}-\exp(2b)\geq 0$ since $E[x]^{2}\leq\exp(2E[\log x])$ .

It follows that

\displaystyle D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]

\displaystyle=

\displaystyle\frac{1}{2}\left(\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{1})}{p_{\alpha_{2},\beta_{2}}(\omega_{1})}+\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{2})}{p_{\alpha_{2},\beta_{2}}(\omega_{2})}\right),

(137)

This expression of the KLD is the same as the ordinary expression of the KLD:

D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=(\alpha_{1}-\alpha_{2})\psi(\alpha_{1})-\log\Gamma(\alpha_{1})+\log\Gamma(\alpha_{2})+\alpha_{2}\log\frac{\beta_{1}}{\beta_{2}}+\alpha_{1}\frac{\beta_{2}-\beta_{1}}{\beta_{1}}.

(138)

Example 19

Consider the exponential family of Beta distributions:

\mathcal{E}=\left\{p_{\alpha,\beta}(x)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\ :\alpha>0,\beta>0,x\in\mathcal{X}=(0,1)\right\},

(139)

where

B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}

(140)

The sufficient statistic vector is $t(x)=(\log x,\log(1-x))$ . We have $E_{p_{\alpha_{1},\beta_{1}}}[\log(x)]=\psi(\alpha_{1})-\psi(\alpha_{1}+\beta_{1})=:A$ and $E_{p_{\alpha_{1},\beta_{1}}}[\log(1-x)]=\psi(\beta_{1})-\psi(\alpha_{1}+\beta_{1})=:B$ . We need to solve the following system of equations for $s=2$ :

\left\{\begin{array}[]{lcl}\frac{\log\omega_{1}+\log\omega_{2}}{2}&=&A,\\ \frac{\log(1-\omega_{1})+\log(1-\omega_{2})}{2}&=&B\end{array}\right.

(141)

This system is equivalent to

\left\{\begin{array}[]{lcl}\omega_{1}\omega&=&\exp(2A):=a,\\ (1-\omega_{1})(1-\omega_{2})&=&\exp(2B):=b\end{array}\right.

(142)

We find

	$\displaystyle\omega_{1}$	$\displaystyle=$	$\displaystyle\frac{-b+a+1-\sqrt{\Delta}}{2},$		(143)
	$\displaystyle\omega_{2}$	$\displaystyle=$	$\displaystyle\frac{-b+a+1+\sqrt{\Delta}}{2},$		(144)

where

\Delta=b^{2}-2(a+1)b+a^{2}-2a+1.

(145)

It follows that we have

D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=\frac{1}{2}\left(\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{1})}{p_{\alpha_{2},\beta_{2}}(\omega_{1})}+\log\frac{p_{\alpha_{1},\beta_{1}}(\omega_{2})}{p_{\alpha_{2},\beta_{2}}(\omega_{2})}\right).

(146)

This formula is equivalent to the ordinary formula for the KLD between two beta densities $p_{\alpha_{1},\beta_{1}}$ and $p_{\alpha_{2},\beta_{2}}$ :

	$\displaystyle D_{\mathrm{KL}}[p_{\alpha_{1},\beta_{1}}:p_{\alpha_{2},\beta_{2}}]=$
			$\displaystyle\log\left(\frac{\mathrm{B}\left(\alpha_{2},\beta_{2}\right)}{\mathrm{B}(\alpha_{1},\beta_{1})}\right)+\left(\alpha_{1}-\alpha_{2}\right)\psi(\alpha_{1})+\left(\beta_{1}-\beta_{2}\right)\psi(\beta_{1})+\left(\alpha_{2}-\alpha_{1}+\beta_{2}-\beta_{1}\right)\psi(\alpha_{1}+\beta_{1}).$

Notice that the $\omega_{i}$ ’s are chosen according to $\lambda_{1}$ . Thus we may express the Voronoi bisector:

\mathrm{Bi}(p_{\lambda_{1}},p_{\lambda_{2}}):=\left\{\lambda\ :\ D_{\mathrm{KL}}[p_{\lambda}:p_{\lambda_{1}}]=D_{\mathrm{KL}}[p_{\lambda}:p_{\lambda_{2}}]\right\}

(148)

as follows:

\mathrm{Bi}(p_{\lambda_{1}},p_{\lambda_{2}})=\left\{\lambda\ :\ \frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda}(\omega_{i})}{p_{\lambda_{1}}(\omega_{i})}\right)=\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right)\right\}.

(149)

In particular, when $s=1$ , the Voronoi bisector is expressed as:

\mathrm{Bi}(p_{\lambda_{1}},p_{\lambda_{2}})=\left\{\lambda\ :\ p_{\lambda_{1}}(\omega(\lambda))=p_{\lambda_{2}}(\omega(\lambda))\right\},

(150)

where $\omega(\lambda)$ emphasizes on the fact that $\omega$ is a function of $\lambda$ . The statistical Voronoi diagrams of a finite set of densities belonging to an exponential family has been studied as equivalent Bregman Voronoi diagrams in [7].

3.4 The Jensen-Shannon divergence

The Jensen-Shannon divergence [29] (JSD) is another symmetrization of the Kullback-Leibler divergence which can be given many information-theoretic interpretations [32] and which is further guaranteed to be always bounded by $\log 2$ (KLD and JD are unbounded):

	$\displaystyle D_{\mathrm{JS}}[p,q]$	$\displaystyle:=$	$\displaystyle\frac{1}{2}\left(D_{\mathrm{KL}}\left[p:\frac{p+q}{2}\right]+D_{\mathrm{KL}}\left[q:\frac{p+q}{2}\right]\right),$		(151)
		$\displaystyle=$	$\displaystyle h\left(\frac{p+q}{2}\right)-\frac{h(p)+h(q)}{2}.$		(152)

Usually, the JSD does not provably admit a closed-form formula [40]. However, in the particular case when the mixture $\frac{p_{\theta_{1}}+p_{\theta_{2}}}{2}$ belongs to the same parametric family of densities, we can calculate the Jensen-Shannon divergence using the entropy function as shown in Eq. 152. For example, consider a mixture family in information geometry [1]. That is, a statistical mixture with $k$ prescribed components $p_{1}(x),\ldots,p_{k}(x)$ which are linearly independent (so that all mixtures of the family are identifiable by their corresponding parameters). Let $m_{\lambda}(x)=\sum_{i=1}^{k}w_{i}p_{i}(x)$ . In that particular case (e.g., mixture family with $k$ prescribed Gaussians components), we get

\frac{m_{\lambda_{1}}+m_{\lambda_{2}}}{2}=m_{\frac{\lambda_{1}+\lambda_{2}}{2}}.

(153)

Thus the JSD for a mixture family can be expressed using the entropy as:

D_{\mathrm{JS}}[m_{\lambda_{1}},m_{\lambda_{1}}]=h\left(m_{\frac{\lambda_{1}+\lambda_{2}}{2}}\right)-\frac{h(m_{\lambda_{1}})+h(m_{\lambda_{2}})}{2}.

(154)

Although we do not have closed-form formula for the entropy of a mixture (except in few cases, e.g., when the support of the distributions are pairwise disjoint [32]), but we can use any approximation method for calculating the entropy of a mixture to approximate or bound [40] the Jensen-Shannon divergence $D_{\mathrm{JS}}$ .

4 Conclusion

We have described several methods to easily recover closed-form formula for some common (dis)similarities between densities belonging to a same exponential family $\{p(x;\lambda)\}_{\lambda\in\Lambda}$ which express themselves using the cumulant function $F$ of the exponential family (e.g., the Kullback-Leibler divergence amounts to a reverse Bregman divergence and the Bhattacharyya distance amounts to a Jensen divergence). Our trick consists in observing that the generators $F$ of the Bregman or Jensen divergences are defined modulo an affine term, so that we may choose $F(\theta(\lambda))=-\log p(\omega,\lambda)$ for any $\omega$ falling inside the support $\mathcal{X}$ . It follows that the Bhattacharyya coefficient can be calculated with the following cumulant-free expression:

\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{p(\omega;\bar{\lambda})}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}},\quad\forall\omega\in\mathcal{X}

(155)

where $\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})$ is a generalized quasi-arithmetic mean induced by the ordinary-to-natural parameter conversion function $\theta(\lambda)$ . Thus our method requires only partial canonical factorization of the densities of an exponential family to get $\theta(\lambda)$ . The formula for the Bhattacharyya distance, Hellinger distance, and $\alpha$ -divergences follow straightforwardly:

$\displaystyle D_{B}[p_{\lambda_{1}},p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\log\left(\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p(\omega;\bar{\lambda})}\right),$	(156)
$\displaystyle D_{H}[p_{\lambda_{1}},p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\sqrt{1-\frac{p(\omega;\bar{\lambda})}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}},$	(157)
$\displaystyle D_{\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\frac{1}{\alpha(1-\alpha)}\left(1-\frac{p(\omega;\bar{\lambda})}{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}\right),\quad\alpha\in\mathbb{R}\backslash\{0,1\}.$	(158)

In practice, it is easy to program those formula using legacy software APIs which offer many parametric densities in their library: First, we check that the distribution is an exponential family. Then we set $\omega$ to be any point of the support $\mathcal{X}$ , partially factorize the distribution in order to retrieve $\theta(\lambda)$ and its reciprocal function $\lambda(\theta)$ , and equipped with these functions, we implement the corresponding generalized weighted quasi-arithmetic mean function $M_{\theta,\alpha}(a,b)=\theta^{-1}(\alpha\theta(a)+(1-\alpha)\theta(b))$ to calculate $\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})$ .

To calculate the Kullback-Leibler divergence (and Jeffreys’ divergence) without the explicit use of the cumulant function, we reported two methods: The first method consists in expressing the KLD as a limit of $\alpha$ -skew Bhattacharyya distance which writes as:

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega,\lambda_{1})}\right).

(159)

This limit can be calculated symbolically using a computer algebra system, or approximated for a small value of $\alpha$ by

D_{\mathrm{KL},\alpha}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\frac{1}{\alpha}\log\left(\frac{p\left(\omega;M_{\theta,\alpha}(\lambda_{2},\lambda_{1})\right)}{p(\omega;\lambda_{1})}\right).

(160)

When dealing with uni-order exponential family, we can use a first-order approximation of the weighted quasi-arithmetic mean to express the KLD as the following limit:

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\log\left(\frac{p(\omega;\lambda_{1})}{p(\omega;\lambda_{2})}\right)+\lim_{\alpha\rightarrow 0}\frac{1}{\alpha}\log\left(\frac{p\left(\omega;\lambda_{1}+\alpha\frac{\theta(\lambda_{2})-\theta(\lambda_{1})}{\theta^{\prime}(\lambda_{1})}\right)}{p(\omega;\lambda_{1})}\right).

(161)

Notice that we can also estimate $D_{\mathrm{KL},\alpha}$ , $\rho_{\alpha}$ and related dissimilarities (e.g., when the cumulant function is intractable) using density ratio estimation techniques [48].

The second approach consists in using the entropy and moment formula which are often available when dealing with parametric distributions. When the parametric distributions form an exponential family, the KLD is equivalent to a Legendre-Fenchel divergence, and we write this Legendre-Fenchel divergence as:

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=-\log p(\omega;\lambda_{2})-h(p_{\lambda_{1}})-E_{p_{\lambda_{1}}}[k(x)]-\theta(\lambda_{2})^{\top}E_{p_{\lambda_{1}}}[t(x)].

(162)

It follows that the Jeffreys’ divergence is expressed as

D_{J}[p_{\lambda_{1}},p_{\lambda_{2}}]=\left(\theta(\lambda_{2})-\theta(\lambda_{1})\right)^{\top}(E_{p_{\lambda_{2}}}[t(x)]-E_{p_{\lambda_{1}}}[t(x)]).

(163)

Finally, we proved in §3.3 that the Kullback-Leibler divergence between two densities $p_{\lambda_{1}}$ and $p_{\lambda_{2}}$ of an exponential family $\mathcal{E}$ of order $D$ can be expressed as

D_{\mathrm{KL}}[p_{\lambda_{1}}:p_{\lambda_{2}}]=\frac{1}{s}\sum_{i=1}^{s}\log\left(\frac{p_{\lambda_{1}}(\omega_{i})}{p_{\lambda_{2}}(\omega_{i})}\right),

where $\omega_{1},\ldots,\omega_{s}$ are $s\leq D+1$ distinct samples of $\mathcal{X}$ chosen such that $\frac{1}{s}\sum_{i=1}^{s}t(\omega_{i})=E_{p_{\lambda_{1}}}[t(x)]$ . We illustrated how to find the $\omega_{i}$ ’s for the univariate Gaussian family and the multivariate zero-centered Gaussian family.

To conclude this work, let us emphasize that we have revealed a new kind of invariance when providing closed-form formula for common (dis)similarities between densities of an exponential family without explicitly using the cumulant function of that exponential family: For the Bhattacharrya/Hellinger/ $\alpha$ -divergences, the $\omega$ can be chosen as any arbitrary point of the support $\mathcal{X}$ . For the Kullback-Leibler divergence, by carefully choosing a set of $\omega$ ’s, we may express the Kullback-Leibler divergence as a weighted sum of log density ratios.

References

[1] Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences. Springer Japan, 2016.
[2] Tsuyoshi Ando and Fumio Hiai. Operator log-convex functions and operator means. Mathematische Annalen, 350(3):611–630, 2011.
[3] Ole Barndorff-Nielsen. Information and exponential families. John Wiley & Sons, 2014.
[4] Christian Bauckhage. Computing the Kullback–Leibler divergence between two Weibull distributions. arXiv preprint arXiv:1310.3713, 2013.
[5] Anil Bhattacharyya. On a measure of divergence between two multinomial populations. Sankhyā: the indian journal of statistics, pages 401–406, 1946.
[6] Patrick Billingsley. Probability and Measure. Wiley, 3 edition, April 1995.
[7] Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Discrete & Computational Geometry, 44(2):281–307, 2010.
[8] Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
[9] Constantin Carathéodory. Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen. Mathematische Annalen, 64(1):95–115, 1907.
[10] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
[11] Francis Clarke. On the inverse function theorem. Pacific Journal of Mathematics, 64(1):97–102, 1976.
[12] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
[13] Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
[14] Joan Del Castillo. The singly truncated normal distribution: a non-steep exponential family. Annals of the Institute of Statistical Mathematics, 46(1):57–66, 1994.
[15] Shinto Eguchi and Osamu Komori. Path connectedness on a space of probability density functions. In International Conference on Geometric Science of Information, pages 615–624. Springer, 2015.
[16] Jun Ichi Fujii. Path of quasi-means as a geodesic. Linear algebra and its applications, 434(2):542–558, 2011.
[17] Jacob Goldberger, Hayit K Greenspan, and Jeremie Dreyfuss. Simplifying mixture models using the unscented transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8):1496–1502, 2008.
[18] Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985.
[19] Ernst Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 136:210–271, 1909.
[20] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.
[21] Robert Jenssen, Jose C Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal of the Franklin Institute, 343(6):614–629, 2006.
[22] Shihao Ji, Balaji Krishnapuram, and Lawrence Carin. Variational Bayes for continuous hidden Markov models and its application to active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):522–532, 2006.
[23] Simon Julier, Jeffrey Uhlmann, and Hugh F Durrant-Whyte. A new method for the nonlinear transformation of means and covariances in filters and estimators. IEEE Transactions on automatic control, 45(3):477–482, 2000.
[24] Thomas Kailath. The divergence and Bhattacharyya distance measures in signal selection. IEEE transactions on communication technology, 15(1):52–60, 1967.
[25] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
[26] Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pages 1723–1726. IEEE, 2012.
[27] Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE Signal Processing Letters, 20(3):269–272, 2013.
[28] Frank Nielsen. An elementary introduction to information geometry. arXiv preprint arXiv:1808.08271, 2018.
[29] Frank Nielsen. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy, 21(5):485, 2019.
[30] Frank Nielsen. A generalization of the $\alpha$ -divergences based on comparable and distinct weighted means. CoRR, abs/2001.09660, 2020.
[31] Frank Nielsen. A note on Onicescu’s informational energy and correlation coefficient in exponential families. arXiv preprint arXiv:2003.13199, 2020.
[32] Frank Nielsen. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy, 22(2):221, 2020.
[33] Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
[34] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. Technical report, arXiv:0911.4863, 2009.
[35] Frank Nielsen and Gaëtan Hadjeres. On power chi expansions of $f$ -divergences. arXiv preprint arXiv:1903.05818, 2019.
[36] Frank Nielsen and Richard Nock. Entropies and cross-entropies of exponential families. In IEEE International Conference on Image Processing, pages 3621–3624. IEEE, 2010.
[37] Frank Nielsen and Richard Nock. A closed-form expression for the Sharma–Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45(3):032003, 2011.
[38] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating $f$ -divergences. IEEE Signal Processing Letters, 21(1):10–13, 2013.
[39] Frank Nielsen and Richard Nock. Patch matching with polynomial exponential families and projective divergences. In International Conference on Similarity Search and Applications, pages 109–116. Springer, 2016.
[40] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.
[41] Frank Nielsen, Ke Sun, and Stéphane Marchand-Maillet. On Hölder projective divergences. Entropy, 19(3):122, 2017.
[42] Emilio Porcu, Jorge Mateu, and George Christakos. Quasi-arithmetic means of covariance functions with potential applications to space–time data. Journal of Multivariate Analysis, 100(8):1830–1844, 2009.
[43] Thomas W Rauber, Tim Braun, and Karsten Berns. Probabilistic distance measures of the Dirichlet and Beta distributions. Pattern Recognition, 41(2):637–645, 2008.
[44] Robert H Risch. The solution of the problem of integration in finite terms. Bulletin of the American Mathematical Society, 76(3):605–608, 1970.
[45] Christophe Saint-Jean and Frank Nielsen. Hartigan’s method for $k$ -mle: Mixture modeling with Wishart distributions and its application to motion retrieval. In Geometric theory of information, pages 301–330. Springer, 2014.
[46] Claude E Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
[47] Jana Špirková. Weighted operators based on dissimilarity function. Information Sciences, 281:172–181, 2014.
[48] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
[49] John Wishart. The generalised product moment distribution in samples from a normal multivariate population. Biometrika, pages 32–52, 1928.

Closed-form formula using the Maxima computer algebra system

Since the statistical (dis)similarities rely on integral calculations, we may use symbolic calculations to check the results. For example, below is some code snippets written in Maxima.⁷⁷7http://maxima.sourceforge.net/ The code snippet below calculates symbolically the Bhattacharyya coefficient for several exponential families.

/* Quasi-arithmetic mean associated with the univariate Gaussian family */
ptheta(lambda):=[lambda[1]/lambda[2],-1/(2*lambda[2])];
plambda(theta):=[-theta[1]/(2*theta[2]),-1/(2*theta[2])];
ptheta(plambda([t0,t1]));
l1: [p1,p2];
l2: [q1,q2];
plambda(0.5*ptheta(l1)+0.5*ptheta(l2));
ratsimp(%);
/* end */

/* Quasi-arithmetic mean associated with the inverse Gaussian family */
ptheta(lambda):=[-lambda[2]/(2*lambda[1]**2),-lambda[2]/2];
plambda(theta):=[sqrt(theta[2]/theta[1]),-2*theta[2]];
ptheta(plambda([t0,t1]));
l1: [p1,p2];
l2: [q1,q2];
plambda(0.5*ptheta(l1)+0.5*ptheta(l2));
ratsimp(%);
/* end */

/* Exponential family of exponential distributions */
assume(lambda1>0);
assume(lambda2>0);
p(x,lambda) := lambda*exp(-lambda*x);

integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,0,inf);
ratsimp(%);
/* end */

/* Exponential family of zero-centered Gaussian densities */
assume(sigma>0);
p(x,sigma) :=  (1.0/(2*sigma))*exp(-abs(x)/sigma);
assume(sigma1>0);
assume(sigma2>0);

integrate(sqrt(p(x,sigma1)*p(x,sigma2)),x,-inf,inf);
ratsimp(%);
/* end */

/* Exponential family of centered-Laplacian distributions */
assume(lambda1>0);
assume(lambda2>0);
p(x,lambda) := (1/(2*lambda))*exp(-abs(x)/lambda);

integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,-inf,inf);
ratsimp(%);
/* end*/

/* Exponential family of Weibull densities with prescribed shape parameter k */
declare( k , integer);
assume(k>=1);
assume(lambda1>0);
assume(lambda2>0);
p(x,lambda) := (k/lambda)*(x/lambda)**(k-1)*exp(-(x/lambda)**k);

integrate(sqrt(p(x,lambda1)*p(x,lambda2)),x,0,inf);
expand(ratsimp(%));
/* end */

/* KLD betwen Weibull distributions by symbolic computing of the limit */
declare( k , integer);
assume(lambda1>0);
assume(lambda2>0);
k:3;
omega:1;
t(u):=u**(-k);
tp(u):=k*u**(-k-1);
p(x,l):=(k/l)*((x/l)**(k-1))*exp(-(x/l)**k);
mean(l1,l2):=l1+alpha*(t(l1)-t(l2))/tp(l1);
log(p(omega,l1)/p(omega,l2)) + (1.0/alpha)*log(p(omega,mean(l1,l2))/p(omega,l1));
limit (ratsimp(%), alpha, 0);
expand(%);
/* end */

Appendix A Further illustrating examples

The Laplacian exponential family illustrates the use of the harmonic mean $H(a,b)=\frac{2ab}{a+b}$ for $a,b>0$ :

Example 20

Consider the family of zero-centered Laplacian densities [34] $p_{\lambda}(x)=\frac{1}{2\lambda}\exp(-\frac{|x|}{\lambda})$ for $x\in\mathbb{R}$ . We have $p_{\lambda}(\omega)=\frac{1}{2\lambda}$ for $\omega=0$ , $\theta(u)=-\frac{1}{u}=\theta^{-1}(u)$ so that $\bar{\lambda}=M_{\theta}(\lambda_{1},\lambda_{2})=H(\lambda_{1},\lambda_{2})=\frac{2\lambda_{1}\lambda_{2}}{\lambda_{1}+\lambda_{2}}$ , where $H$ denotes the harmonic mean. Applying the cumulant-free formula Eq. 49, we get

$\displaystyle\rho[p_{\lambda_{1}},p_{\lambda_{2}}]$	$\displaystyle=$	$\displaystyle\frac{\sqrt{p(\omega,\lambda_{1})p(\omega,\lambda_{2})}}{p\left(\omega;\bar{\lambda}\right)},$
	$\displaystyle=$	$\displaystyle\frac{1}{2\frac{2\lambda_{1}\lambda_{2}}{\lambda_{1}+\lambda_{2}}}\frac{1}{\sqrt{\frac{1}{2\lambda_{1}}\frac{1}{2}\lambda_{2}}},$
	$\displaystyle=$	$\displaystyle\frac{2\sqrt{\lambda_{1}\lambda_{2}}}{\lambda_{1}+\lambda_{2}}.$

Note that the arithmetic mean $A(\lambda_{1},\lambda_{2})=\frac{\lambda_{1}+\lambda_{2}}{2}$ dominates the geometric mean $G(\lambda_{1},\lambda_{2})=\sqrt{\lambda_{1}\lambda_{2}}$ (i.e., $A\geq G$ ) so that $\rho[p_{\lambda_{1}},p_{\lambda_{2}}]=\frac{G(\lambda_{1},\lambda_{2})}{A(\lambda_{1},\lambda_{2})}\in(\omega,1]$ . It follows that the Bhattacharyya distance is

D_{\mathrm{Bhat}}[p_{\lambda_{1}},p_{\lambda_{2}}]=\log\frac{\lambda_{1}+\lambda_{2}}{2}-\log\sqrt{\lambda_{1}\log\lambda_{2}}.

This yields the same formula as the exponential distribution.

Example 21

The univariate Gaussian distribution with source parameters $\lambda=(\mu,\sigma^{2})$ has density $p_{\lambda}(x)=\frac{1}{\sqrt{2\pi\lambda_{2}}}\exp\left(-\frac{(x-\lambda_{1})^{2}}{2\lambda_{2}}\right)$ . The have $\theta(u)=(\frac{u_{1}}{u_{2}},-\frac{1}{2u_{2}})$ and $\theta^{-1}(u)=(-\frac{u_{1}}{2u_{2}},-\frac{1}{2u_{2}})$ . We have

M_{\theta}(\lambda_{1},\lambda_{2})=\left(\frac{\mu_{1}\sigma_{2}^{2}+\mu_{2}\sigma_{1}^{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}},\frac{2\sigma_{1}^{2}\sigma_{2}^{2}}{\sigma_{1}^{2}+\sigma_{2}^{2}}\right).

(164)

Our next example illustrates an unusual non-separable mean derived from the inverse Gaussian family:

Example 22

Consider the family of inverse Gaussian densities:

p(x;\mu,\lambda)=\sqrt{\frac{\lambda}{2\pi x^{3}}}\exp\left(-\frac{\lambda(x-\mu)^{2}}{2\mu^{2}x}\right),

with $\mathcal{X}=(0,\infty)$ , and source parameters $(\mu>0,\lambda>0)$ . The inverse Gaussian densities form an exponential family with natural parameters $\theta(\lambda)=\left(-\frac{\lambda_{2}}{2\lambda_{1}^{2}},-\frac{\lambda_{2}}{2}\right)$ . The inverse conversion function is $\theta^{-1}(\theta_{1},\theta_{2})=\left(\frac{\theta_{2}}{\theta_{1}},-2\theta_{2}\right)$ . Thus the generalized quasi-arithmetic mean induced by the multivariate $\theta(\cdot)$ is

M_{\theta}(\lambda,\lambda^{\prime})=\left(\sqrt{\frac{(\lambda_{2}+\lambda_{2}^{\prime})\lambda_{1}^{2}(\lambda_{1}^{\prime})^{2}}{\lambda_{2}(\lambda_{1}^{\prime})^{2}+\lambda_{2}^{\prime}\lambda_{1}^{2}}},\frac{\lambda_{2}+\lambda_{2}^{\prime}}{2}\right).

The calculation can be checked using a computer algebra system (detailed in Appendix Closed-form formula using the Maxima computer algebra system). We choose $\omega=1$ .

Next, we consider the case of the exponential family of central Wishart densities [45].

Example 23

The density of a Wishart distribution [49] with positive-definite scale matrix $S\succ 0$ and number of degrees of freedom $n\geq d$ (matrix generalization of the chi-squared distribution) is:

p(x;\lambda=(\lambda_{s}=n,\lambda_{M}=S))=\frac{|X|^{\frac{n-d-1}{2}}e^{-\frac{1}{2}\operatorname{tr}\left(S^{-1}X\right)}}{2^{\frac{nd}{2}}|S|^{\frac{n}{2}}\Gamma_{d}\left(\frac{n}{2}\right)},

(165)

where $\Gamma_{d}$ denotes the multivariate Gamma function:

\Gamma_{d}(x):=\pi^{d(d-1)/4}\prod_{j=1}^{d}\Gamma(x+(1-j)/2).

(166)

The sample space $x\in\mathcal{X}$ is the set of $d\times d$ positive-definite matrices. The natural parameter consists of a scalar part $\theta_{s}$ and a matrix part $\theta_{M}$ :

\theta(\lambda)=\left(\theta_{s},\theta_{M}\right)=\left(\frac{\lambda_{s}-d-1}{2},-\frac{1}{2}\lambda_{M}^{-1}\right).

(167)

The inverse conversion natural $\rightarrow$ ordinary function is

\lambda(\theta)=\left(2\lambda_{s}+d+1,-\frac{1}{2}\lambda_{M}^{-1}\right).

(168)

It follows that the quasi-arithmetic mean is the following separable scalar-vector quasi-arithmetic mean (scalar arithmetic mean with a harmonic matrix mean):

M_{\theta}((n_{1},S_{1}),(n_{2},S_{2}))=\left(\frac{n_{1}+n_{2}}{2},\left(\frac{S_{1}^{-1}+S_{2}^{-1}}{2}\right)^{-1}\right).

(169)

We choose $\omega=I$ (the identity matrix) so that

p(I;S,n)=\frac{\exp\left(-\frac{1}{2}\mathrm{tr}(S^{-1})\right)}{2^{\frac{nd}{2}}|S|^{\frac{n}{2}}\Gamma_{d}\left(\frac{n}{2}\right)}.

(170)

The sufficient statistics is $t(x)=(\log|X|,-\frac{1}{2}X)$ . It follows that $\eta=\nabla F(\theta)=E[t(x)]=(-\log|\frac{S^{-1}}{2}|+\sum_{i=1}^{d}\psi(\frac{n+1-i}{2}),-\frac{1}{2}nS)$ (see [22]), where $\psi_{d}$ is the multivariate digamma function (the derivative of the logarithm of the multivariate gamma function). The differential entropy is:

h(p_{n,S})=\frac{d+1}{2}\ln|S|+\frac{1}{2}d(d+1)\ln 2+\ln\Gamma_{d}\left(\frac{n}{2}\right)-\frac{n-d-1}{2}\psi_{d}\left(\frac{n}{2}\right)+\frac{nd}{2}.

It follows that the Kullback-Leibler divergence between two central Wishart densities is:

D_{\mathrm{KL}}\left(p_{n_{1},S_{1}}:p_{n_{2},S_{2}}\right)=-\log\left(\frac{\Gamma_{d}\left(\frac{n_{1}}{2}\right)}{\Gamma_{d}\left(\frac{n_{2}}{2}\right)}\right)+\left(\frac{n_{1}-n_{2}}{2}\right)\psi_{d}\left(\frac{n_{1}}{2}\right)+\frac{n_{1}}{2}\left(-\log\frac{\left|S_{1}\right|}{\left|S_{2}\right|}+\operatorname{tr}\left(S_{2}^{-1}S_{1}\right)-d\right).

(171)

This last example exhibits a non-usually mean when dealing with the Bernoulli family:

Example 24

Consider the Bernoulli family with density $p_{\lambda}(x)=\lambda^{x}(1-\lambda)^{1-x}$ where $\mathcal{X}=\{0,1\}$ and $\lambda$ denotes the probability of success. The Bernoulli family is a discrete exponential family. We have $\theta(u)=\log\frac{u}{1-u}$ and $\theta^{-1}(u)=\frac{e^{u}}{1+e^{u}}$ . The associated quasi-arithmetic mean is

M_{\theta}(\lambda_{1},\lambda_{2})=\sqrt{\frac{\lambda_{1}\lambda_{2}}{(1-\lambda_{1})(1-\lambda_{2})}}.

(172)