This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Ionas Erb 22institutetext: Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; 22email: [email protected] 33institutetext: Nihat Ay 44institutetext: Max-Planck Institute for Mathematics in the Sciences, Leipzig, Germany;
Department of Mathematics and Computer Science, Leipzig University, Leipzig, Germany;
Santa Fe Institute, Santa Fe, NM, USA; 44email: [email protected]

The Information-Geometric Perspective of Compositional Data Analysis

Ionas Erb and Nihat Ay
Abstract

Information geometry uses the formal tools of differential geometry to describe the space of probability distributions as a Riemannian manifold with an additional dual structure. The formal equivalence of compositional data with discrete probability distributions makes it possible to apply the same description to the sample space of Compositional Data Analysis (CoDA). The latter has been formally described as a Euclidean space with an orthonormal basis featuring components that are suitable combinations of the original parts. In contrast to the Euclidean metric, the information-geometric description singles out the Fisher information metric as the only one keeping the manifoldโ€™s geometric structure invariant under equivalent representations of the underlying random variables. Well-known concepts that are valid in Euclidean coordinates, e.g., the Pythagorean theorem, are generalized by information geometry to corresponding notions that hold for more general coordinates. In briefly reviewing Euclidean CoDA and, in more detail, the information-geometric approach, we show how the latter justifies the use of distance measures and divergences that so far have received little attention in CoDA as they do not fit the Euclidean geometry favoured by current thinking. We also show how Shannon entropy and relative entropy can describe amalgamations in a simple way, while Aitchison distance requires the use of geometric means to obtain more succinct relationships. We proceed to prove the information monotonicity property for Aitchison distance. We close with some thoughts about new directions in CoDA where the rich structure that is provided by information geometry could be exploited.

1 Introduction

Information geometry and Compositional Data Analysis (CoDA) are fields that have ignored each other so far. Independently, both have found powerful descriptions that led to a deeper understanding of the geometric relationships between their respective objects of interest: probability distributions and compositional data. Although both of these live on the same mathematical space (the simplex), and some of the mathematical structures are identically described, surprisingly, both fields have come to focus on quite different geometric aspects. On the one hand, the tools of differential geometry have revealed the underlying duality of the manifold of probability distributions, with the Fisher information metric playing a central role. On the other hand, the classical log-ratio approach led to the modern description of the compositional sample space as Euclidean and affine. We think it is time that CoDA starts to profit from the rich structures that information geometry has to offer. This paper intends to build some bridges from information geometry to CoDA. In the first section, we will give a brief description of the Euclidean CoDA perspective. The second, and main, part of the paper describes in some detail the approach centred around the Fisher metric, with a description of the dual coordinates, exponential families and of how dual divergence functions generalize the notion of Euclidean distance. To ease understanding, throughout this section we link those concepts to the ones familiar in CoDA. In the third part of the paper, we show how information-based measures can lead to simpler expressions when amalgamations of parts are involved, and an important monotonicity result that holds for relative entropy is derived for Aitchison distance. We conclude with a short discussion about where we could go from here.

2 The Euclidean CoDA perspective

Compositional data analysis is now unthinkable without the log-ratio approach pioneered by Aitchison AitchisonBook . It has both led to a variety of data-analytic developments (for the most recent review, see green ), and more formal mathematical descriptions (see sampleSpace ). Following these, compositions can be described as equivalence classes whose representatives are points in a Euclidean space. We will give a brief recount here for completeness of exposition. Compositional data are defined as vectors of strictly positive numbers describing the parts of a whole for which the relevant information is only relative. As such, the absolute size of a DD-part composition ๐’™โˆˆโ„+D\boldsymbol{x}\in\mathbb{R}^{D}_{+} is irrelevant, and all the information is conveyed by the ratios of its components. This can further be formalized by considering equivalent compositions ๐’™\boldsymbol{x}, ๐’š\boldsymbol{y} such that ๐’š=cโ€‹๐’™\boldsymbol{y}=c\boldsymbol{x} for a positive constant cc. A composition is then an equivalence class of such proportional vectors mathCoDA . A closed composition is the simplicial representative ๐’žโ€‹๐’™:=๐’™/โˆ‘ixi\mathcal{C}\boldsymbol{x}:=\boldsymbol{x}/\sum_{i}x_{i}, where the symbol ๐’ž\mathcal{C} denotes the closure operation (i.e., the division by the sum over the components). Closed compositions are elements of the simplex

๐’ฎD:={(x1,โ€ฆ,xD)Tโˆˆโ„D:xi>0,i=1,โ€ฆ,D,โˆ‘iDxi=1},\mathcal{S}^{D}:=\left\{(x_{1},\dots,x_{D})^{T}\in\mathbb{R}^{D}:x_{i}>0,i=1,\dots,D,\sum_{i}^{D}x_{i}=1\right\}, (1)

where TT denotes transposition. The simplex ๐’ฎD\mathcal{S}^{D} can be equipped with a Euclidean structure by the vector space operations of perturbation and powering (playing the role of vector addition and scalar multiplication in real vector spaces), defined by

๐’™โŠ•๐’š\displaystyle\boldsymbol{x}\oplus\boldsymbol{y} :=\displaystyle:= ๐’žโ€‹(x1โ€‹y1,โ€ฆ,xDโ€‹yD)T,\displaystyle\mathcal{C}(x_{1}y_{1},\dots,x_{D}y_{D})^{T}, (2)
ฮฑโŠ™๐’™\displaystyle\alpha\odot\boldsymbol{x} :=\displaystyle:= ๐’žโ€‹(x1ฮฑ,โ€ฆ,xDฮฑ)T.\displaystyle\mathcal{C}(x_{1}^{\alpha},\dots,x_{D}^{\alpha})^{T}. (3)

An inverse perturbation is given by โŠ–๐’™:=(โˆ’1)โŠ™๐’™\ominus\boldsymbol{x}:=(-1)\odot\boldsymbol{x}.
As a vector space, ๐’ฎD{\mathcal{S}}^{D} also carries the structure of an affine space, and we can study affine subspaces, which are referred to as linear manifolds in simplicialGeometry . In order to do so, we require a set of vectors ๐’™1,โ€ฆ,๐’™m\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{m}, which we assume to be linearly independent, and an origin point ๐’™0\boldsymbol{x}_{0}. Here, independence means the following: Let ๐’=๐’žโ€‹(1,โ€ฆ,1)\boldsymbol{n}=\mathcal{C}(1,\dots,1) be the neutral element. A set of mm compositions ๐’™1,โ€ฆ,๐’™mโˆˆ๐’ฎD\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{m}\in\mathcal{S}^{D} is called perturbation-independent if the fact that ๐’=โจi=1m(ฮฑiโŠ™๐’™i)\boldsymbol{n}=\bigoplus_{i=1}^{m}(\alpha_{i}\odot\boldsymbol{x}_{i}) implies ฮฑ1=โ‹ฏ=ฮฑm=0\alpha_{1}=\dots=\alpha_{m}=0. With this, an affine subspace is given as the set of compositions ๐’š\boldsymbol{y} such that

๐’š=๐’™0โŠ•โจi=1m(ฮฑiโŠ™๐’™i)\boldsymbol{y}=\boldsymbol{x}_{0}\oplus\bigoplus_{i=1}^{m}(\alpha_{i}\odot\boldsymbol{x}_{i}) (4)

for any real constants ฮฑi\alpha_{i}, i=1,โ€ฆ,mi=1,\dots,m. Due to the linear independence of the vectors ๐’™1,โ€ฆ,๐’™m\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{m}, this is an mm-dimensional space.
It is convenient to define the inner product for our Euclidean space via the so-called centred log-ratio transformation. Its definition logratioTrans and inverse operation are

๐’—\displaystyle\boldsymbol{v} =\displaystyle= clrโ€‹(๐’™):=(logโกx1gโ€‹(๐’™),โ€ฆ,logโกxDgโ€‹(๐’™))T,\displaystyle\mathrm{clr}(\boldsymbol{x}):=\left(\log\frac{x_{1}}{g(\boldsymbol{x})},\dots,\log\frac{x_{D}}{g(\boldsymbol{x})}\right)^{T}, (5)
๐’™\displaystyle\boldsymbol{x} =\displaystyle= clrโˆ’1โ€‹(๐’—)=๐’žโ€‹expโ€‹(๐’—),\displaystyle\mathrm{clr}^{-1}(\boldsymbol{v})=\mathcal{C}\mathrm{exp}(\boldsymbol{v}), (6)

where gg denotes the geometric mean gโ€‹(๐’™)=(โˆi=1Dxi)1Dg(\boldsymbol{x})=\left(\prod_{i=1}^{D}x_{i}\right)^{\frac{1}{D}}. Note that the sum over the components clri\mathrm{clr}_{i} of clr-transformed vectors is 0. An inner product can then be defined by

โŸจ๐’™,๐’šโŸฉA:=โˆ‘i=1Dclriโ€‹(๐’™)โ€‹clriโ€‹(๐’š).\left<\boldsymbol{x},\boldsymbol{y}\right>_{A}:=\sum_{i=1}^{D}\mathrm{clr}_{i}(\boldsymbol{x})\mathrm{clr}_{i}(\boldsymbol{y}). (7)

The corresponding (squared) norm and distance are

โˆฅ๐’™โˆฅA2=โŸจ๐’™,๐’™โŸฉA,dA2โ€‹(๐’™,๐’š)=โˆฅ๐’™โŠ–๐’šโˆฅA2.\left\lVert\boldsymbol{x}\right\rVert^{2}_{A}=\left<\boldsymbol{x},\boldsymbol{x}\right>_{A},~{}~{}~{}~{}~{}d^{2}_{A}(\boldsymbol{x},\boldsymbol{y})=\left\lVert\boldsymbol{x}\ominus\boldsymbol{y}\right\rVert^{2}_{A}. (8)

This distance is known as Aitchison distance (denoted by the AA subscript). Note that the clr-transformation is an isometry ๐’ฎDโ†’๐’ฏDโŠ‚โ„D\mathcal{S}^{D}\rightarrow\mathcal{T}^{D}\subset\mathbb{R}^{D} between (Dโˆ’1)(D-1)-dimensional Euclidean spaces, where

๐’ฏD:={๐’—โˆˆโ„D:โˆ‘i=1Dvi=0}.\mathcal{T}^{D}:=\left\{\boldsymbol{v}\in\mathbb{R}^{D}:\sum_{i=1}^{D}v_{i}=0\right\}. (9)

We can thus obtain orthonormal bases in ๐’ฎD\mathcal{S}^{D} from orthonormal bases in ๐’ฏD\mathcal{T}^{D}. Such orthonormal basis vectors are defined by the columns ๐’—i\boldsymbol{v}_{i}, i=1,โ€ฆ,Dโˆ’1i=1,\dots,D-1, of ๐•\mathbf{V}, a matrix of order Dร—(Dโˆ’1)D\times(D-1) obeying

๐•Tโ€‹๐•\displaystyle\mathbf{V}^{T}\mathbf{V} =\displaystyle= ๐ˆDโˆ’1,\displaystyle\mathbf{I}_{D-1}, (10)
๐•๐•T\displaystyle\mathbf{V}\mathbf{V}^{T} =\displaystyle= ๐ˆDโˆ’1Dโ€‹๐Ÿ๐ŸT,\displaystyle\mathbf{I}_{D}-\frac{1}{D}\mathbf{1}\mathbf{1}^{T}, (11)

with ๐ˆD\mathbf{I}_{D} the Dร—DD\times D identity matrix and ๐Ÿ\mathbf{1} a DD-dimensional vector where each component is 1. The first equation ensures orthonormality, the second equation makes sure components sum to zero. Now the vectors

๐’†i=clrโˆ’1โ€‹(๐’—i)=๐’žโ€‹expโ€‹(๐’—i)\boldsymbol{e}_{i}=\mathrm{clr}^{-1}(\boldsymbol{v}_{i})=\mathcal{C}\mathrm{exp}(\boldsymbol{v}_{i}) (12)

constitute an orthonormal basis in ๐’ฎD\mathcal{S}^{D}. Euclidean coordinates ๐’›\boldsymbol{z} of ๐’™โˆˆ๐’ฎD\boldsymbol{x}\in\mathcal{S}^{D} with respect to the basis ๐’†i\boldsymbol{e}_{i}, i=1,โ€ฆ,Dโˆ’1i=1,\dots,D-1, then follow from the so-called isometric log-ratio transformation ilr . Its definition and inverse operation are

๐’›=ilrโ€‹(๐’™)\displaystyle\boldsymbol{z}=\mathrm{ilr}(\boldsymbol{x}) :=\displaystyle:= ๐•Tโ€‹logโก(๐’™),\displaystyle\mathbf{V}^{T}\log(\boldsymbol{x}), (13)
๐’™=ilrโˆ’1โ€‹(๐’›)\displaystyle\boldsymbol{x}=\mathrm{ilr}^{-1}(\boldsymbol{z}) =\displaystyle= ๐’žโ€‹expโ€‹(๐•โ€‹๐’›).\displaystyle\mathcal{C}\mathrm{exp}(\mathbf{V}\boldsymbol{z}). (14)

The second equation shows the composition as generated from the coordinates ๐’›\boldsymbol{z}. This can also be written in the usual component form using the basis vectors of Eq.ย (12):

๐’™=โจi=1Dโˆ’1(ziโŠ™๐’†i).\boldsymbol{x}=\bigoplus_{i=1}^{D-1}(z_{i}\odot\boldsymbol{e}_{i}). (15)

Equations (10) and (11) characterize any orthonormal basis of ๐’ฎD\mathcal{S}^{D}. ๐•\mathbf{V} is known under the name of contrast matrix. Many choices for this matrix are possible. Balance coordinates balances are popular for their relative simplicity and interpretability, where pivot coordinates pivot are sometimes preferred.

3 The information-geometric perspective

Information geometry started out as a study of the geometry of statistical estimation. The set of probability distributions that constitute a statistical model is seen as a manifold whose invariant geometric structures Chentsov are studied in relation to the statistical estimation using this model. A Riemannian metric and a family of affine connections are naturally introduced on such a manifold Rao . It turns out that a fundamental duality underlies these structures AmariNagaoka . While the first ideas about the geometry of statistical estimation are from the first half of the 20th century and go back to a variety of authors, a first attempt at a unified exposition of the topic by Amari lectureNotes was published around a similar time as Aitchisonโ€™s book.

Here we try to highlight some of the main concepts, borrowing from chapter two in Nihat , but mainly following the treatment in Amari . The latter also showcases the many applications information geometry has found during the last decades. While the best known book on the topic may be AmariNagaoka , the most formal and complete exposition can currently be found in Nihat .
The ideas that require more advanced notions from differential geometry will not be touched upon in our short outline.

3.1 Dual coordinates and Fisher metric in finite information geometry

To exploit the equivalence of probability distributions with compositional data, we only consider the manifold of discrete distributions ๐’ฎD\mathcal{S}^{D}. To emphasize the equivalence, we will denote the probabilities by the same symbol as our compositions, i.e., ๐’™โˆˆ๐’ฎD\boldsymbol{x}\in\mathcal{S}^{D} is a vector of DD probabilities. To complete the probabilistic picture, we need a random variable RR which can take values rโˆˆ{1,โ€ฆ,D}r\in\{1,\dots,D\} with the respective probabilities xr=โ„™โ€‹{R=r}x_{r}=\mathbb{P}\{R=r\}, the coordinates of ๐’™\boldsymbol{x}. The distribution of this random variable can now be written as

pR=๐’™=(x1,โ€ฆ,xD)Tโˆˆ๐’ฎD.p_{R}=\boldsymbol{x}=(x_{1},\dots,x_{D})^{T}\in\mathcal{S}^{D}. (16)

From the information-geometric perspective, there are two natural ways to parametrize the set ๐’ฎD{\mathcal{S}}^{D} of all (strictly positive) distributions of RR. The first possibility is quite obvious. The first Dโˆ’1D-1 probabilities in Eq.ย (16) that are free to be specified, that is x1,โ€ฆ,xDโˆ’1x_{1},\dots,x_{D-1}, can be considered parameters, which we denote by ๐œผ=(x1,โ€ฆ,xDโˆ’1)T\boldsymbol{\eta}=(x_{1},\dots,x_{D-1})^{T}. In these coordinates, probability distributions are written as

pโ€‹(r;๐œผ)={ฮทr,ifย rโ‰คDโˆ’11โˆ’โˆ‘i=1Dโˆ’1ฮทi,ifย r=D,r=1,โ€ฆ,D.p(r;{\boldsymbol{\eta}})=\left\{\begin{array}[]{c@{,\quad}l}\eta_{r}&\mbox{if $r\leq D-1$}\\ 1-\sum_{i=1}^{D-1}\eta_{i}&\mbox{if $r=D$}\end{array}\right.,\qquad r=1,\dots,D. (17)

Alternatively, our distribution can be parametrized using what is known as the alr-transformation logratioTrans in CoDA:

ฮธi=logโกxixD,i=1,โ€ฆ,Dโˆ’1.\theta^{i}=\log\frac{x_{i}}{x_{D}},~{}~{}~{}i=1,\dots,D-1. (18)

With this, we can write our distribution in the form

pโ€‹(r;๐œฝ)=expโ€‹(โˆ‘k=1Dโˆ’1ฮธkโ€‹๐Ÿ™kโ€‹(r)โˆ’ฯˆโ€‹(๐œฝ)),r=1,โ€ฆ,D,p(r;{\boldsymbol{\theta}})=\mathrm{exp}\left(\sum_{k=1}^{D-1}\theta^{k}{\mathbbm{1}}_{k}(r)-\psi(\boldsymbol{\theta})\right),\qquad r=1,\dots,D, (19)

where ๐Ÿ™kโ€‹(r)=1{\mathbbm{1}}_{k}(r)=1 if r=kr=k, and ๐Ÿ™kโ€‹(r)=0{\mathbbm{1}}_{k}(r)=0 otherwise. The function ฯˆ\psi ensures normalization, that is

ฯˆโ€‹(๐œฝ)=logโก(1+โˆ‘i=1Dโˆ’1eฮธi)=โˆ’logโกxD.\psi(\boldsymbol{\theta})=\log\left(1+\sum_{i=1}^{D-1}e^{\theta^{i}}\right)=-\log x_{D}. (20)

The parametrization of Eq.ย (19) in terms of ๐œฝโˆˆโ„Dโˆ’1\boldsymbol{\theta}\in\mathbb{R}^{D-1} can be used in order to define a linear structure on ๐’ฎD{\mathcal{S}}^{D}. The addition of two distributions pโ€‹(โ‹…;๐œฝ1)p(\cdot;\boldsymbol{\theta}_{1}) and pโ€‹(โ‹…;๐œฝ2)p(\cdot;\boldsymbol{\theta}_{2}) can simply be defined by taking their product and then normalizing. With this vector addition, denoted by โŠ•\oplus, we obviously have

pโ€‹(r;๐œฝ1)โŠ•pโ€‹(r;๐œฝ2)\displaystyle p({r};\boldsymbol{\theta}_{1})\oplus p({r};\boldsymbol{\theta}_{2}) =\displaystyle= pโ€‹(r;๐œฝ1+๐œฝ2).\displaystyle p({r};\boldsymbol{\theta}_{1}+\boldsymbol{\theta}_{2}). (21)

Thus, the operation โŠ•\oplus is consistent with the usual addition in the parameter space โ„Dโˆ’1\mathbb{R}^{D-1}. The multiplication of a distribution pโ€‹(โ‹…;๐œฝ)p(\cdot;\boldsymbol{\theta}) with a scalar ฮฑโˆˆโ„\alpha\in\mathbb{R} can be correspondingly defined by potentiating and then normalizing. This defines a scalar multiplication โŠ™\odot, and we have

ฮฑโŠ™pโ€‹(r;๐œฝ)\displaystyle{\alpha}\odot p({r};\boldsymbol{\theta}) =\displaystyle= pโ€‹(r;ฮฑโ‹…๐œฝ).\displaystyle p({r};\alpha\cdot\boldsymbol{\theta}). (22)

Obviously, the scalar multiplication โŠ™\odot is consistent with the usual multiplication in the parameter space. Note that the vector space structure defined here, which is well known in information geometry, coincides with the structure defined by equations (2) and (3). Given the linear structure, we can consider affine subspaces of ๐’ฎD\mathcal{S}^{D}. These are well-known and fundamental families in statistics, statistical physics, and information geometry, the so-called exponential families. We basically obtain them from the representation of Eq.ย (19) if we replace the functions ๐Ÿ™k{\mathbbm{1}}_{k} by dd (dโ‰คDโˆ’1d\leq D-1) functions Xk:{1,โ€ฆ,D}โ†’โ„X_{k}:\{1,\dots,D\}\to\mathbb{R}, and shift the whole family by some reference measure p0p_{0}:

pโ€‹(r;๐œฝ)=p0โ€‹(r)โ€‹expโ€‹(โˆ‘k=1dฮธkโ€‹Xkโ€‹(r)โˆ’ฯˆโ€‹(๐œฝ)),r=1,โ€ฆ,D.p(r;{\boldsymbol{\theta}})=p_{0}(r)\,\mathrm{exp}\left(\sum_{k=1}^{d}\theta^{k}X_{k}(r)-\psi(\boldsymbol{\theta})\right),\qquad r=1,\dots,D. (23)

Here, ฯˆโ€‹(๐œฝ)\psi(\boldsymbol{\theta}) ensures normalization, but it does not reduce to the simple structure of Eq.ย (20) in general. In statistical physics, the function ฯˆ\psi is known as the free energy (in other contexts it is also known as the cumulant-generating function). Note that, given the same linear structure on ๐’ฎD\mathcal{S}^{D}, the exponential families coincide with the linear manifolds of Eq.ย (4), which were introduced into the field of Compositional Data Analysis more recently.

In what follows we restrict attention to the parametrizations of equations (17) and (19) of the full simplex ๐’ฎD\mathcal{S}^{D} as one instance of the general structure that underlies information geometry. The function ฯˆ\psi, given by Eq.ย (20), is a convex function, and we can get back the ๐œผ\boldsymbol{\eta} coordinates from it via a Legendre transformation AmariNagaoka

๐œผ=โˆ‡ฯˆโ€‹(๐œฝ),\boldsymbol{\eta}=\nabla\psi(\boldsymbol{\theta}), (24)

where โˆ‡\nabla denotes the partial derivatives (โˆ‚ฯˆ/โˆ‚ฮธi)i=1D(\partial\psi/\partial\theta^{i})_{i=1}^{D}. The Legendre dual of ฯˆโ€‹(๐œฝ)\psi(\boldsymbol{\theta}) is another convex function defined by

ฯ•โ€‹(๐œผ)=max๐œฝโก{๐œฝโ‹…๐œผโˆ’ฯˆโ€‹(๐œฝ)},\phi(\boldsymbol{\eta})=\max_{\boldsymbol{\theta}}\left\{\boldsymbol{\theta}\cdot\boldsymbol{\eta}-\psi(\boldsymbol{\theta})\right\}, (25)

which is given by the negative Shannon entropy

ฯ•โ€‹(๐œผ)=โˆ‘i=1Dโˆ’1ฮทiโ€‹logโกฮทi+(1โˆ’โˆ‘i=1Dโˆ’1ฮทi)โ€‹logโก(1โˆ’โˆ‘i=1Dโˆ’1ฮทi).\phi(\boldsymbol{\eta})=\sum_{i=1}^{D-1}\eta_{i}\log\eta_{i}+\left(1-\sum_{i=1}^{D-1}\eta_{i}\right)\log\left(1-\sum_{i=1}^{D-1}\eta_{i}\right). (26)

In equivalence to Eq.ย (24), from ฯ•โ€‹(๐œผ)\phi(\boldsymbol{\eta}) we can get back the ๐œฝ\boldsymbol{\theta} coordinates:

๐œฝ=โˆ‡ฯ•โ€‹(๐œผ).\boldsymbol{\theta}=\nabla\phi(\boldsymbol{\eta}). (27)

There is thus a fundamental duality mediated by the Legendre transformation which links the two types of parameters ๐œผ\boldsymbol{\eta} and ๐œฝ\boldsymbol{\theta} as well as the convex functions ฯˆ\psi and ฯ•\phi. Legendre transformations are well known to play a fundamental role in phenomenological thermodynamics. Their importance for information geometry was established by Amari and Nagaoka AmariNagaoka .
In what follows, we use the parameters ๐œฝ\boldsymbol{\theta}. Like ๐œผ\boldsymbol{\eta}, they define a (local) coordinate system of the manifold ๐’ฎD\mathcal{S}^{D}. From each point ๐œฝ\boldsymbol{\theta}, Dโˆ’1D-1 coordinate curves ฮณi:tโ†ฆฮณiโ€‹(t)\gamma_{i}:t\mapsto\gamma_{i}(t) in ๐’ฎD{\mathcal{S}}^{D}, ฮณiโ€‹(0)=p๐œฝ\gamma_{i}(0)=p_{\boldsymbol{\theta}}, emerge when holding Dโˆ’2D-2 of the ฮธi\theta_{i} constant. Consider their velocities

ei:=ddโ€‹tโ€‹ฮณiโ€‹(t)|t=0,i=1,โ€ฆ,Dโˆ’1.\mathrm{e}_{i}:=\left.\frac{d}{dt}\gamma_{i}(t)\right|_{t=0},\qquad i=1,\dots,D-1. (28)

In the point ๐œฝ\boldsymbol{\theta} itself, the Dโˆ’1D-1 vectors ei\mathrm{e}_{i} pointing in the direction of each coordinate curve form the basis of the so-called tangent space T๐œฝโ€‹๐’ฎD=๐’ฏD{T}_{\boldsymbol{\theta}}\mathcal{S}^{D}={\mathcal{T}}^{D} of this point, see Figure 1a). Similarly, we can define vectors ei\mathrm{e}^{i} with respect to the coordinates ๐œผ\boldsymbol{\eta}, which also span the tangent space.

Refer to caption
Figure 1: a) A manifold (in black) and its tangent space in point pฮธp_{\theta} (in red) where two coordinate curves (dashed lines) are crossing. The velocities of the coordinate curves are the basis vectors e1\mathrm{e_{1}} and e2\mathrm{e_{2}}. b) Exponential map and its inverse via the tangent space anchored in a reference point ๐’\boldsymbol{n}. The points ๐’™\boldsymbol{x} and ๐’š\boldsymbol{y} are mapped via exp๐’โˆ’1\mathrm{exp}_{\boldsymbol{n}}^{-1} to the tangent space. The difference vector between the two mapped points is ๐’—=vecโ€‹(๐’™,๐’š)\boldsymbol{v}=\mathrm{vec}(\boldsymbol{x},\boldsymbol{y}).

For a Riemannian manifold a metric tensor ๐†\mathbf{G} is defined. This metric is obtained via an inner product on the tangent space:

giโ€‹j=โŸจei,ejโŸฉ,g_{ij}=\left<\mathrm{e}_{i},\mathrm{e}_{j}\right>, (29)

which depends on the coordinates chosen. The coordinate system is Euclidean if giโ€‹j=ฮดiโ€‹jg_{ij}=\delta_{ij} (this is the case for the parametrization achieved by Eq.ย 13). For the Riemannian metric of exponential families, the basis vectors can be identified with the so-called score111The score function plays an important role in maximum-likelihood estimation. function โˆ‚logโกpโ€‹(r,๐œฝ)/โˆ‚ฮธi\partial\log p(r,\boldsymbol{\theta})/\partial\theta^{i}. The resulting Riemannian metric is known as the Fisher information matrix:

giโ€‹jโ€‹(๐œฝ)\displaystyle g_{ij}(\boldsymbol{\theta}) =\displaystyle= E๐œฝโ€‹[โˆ‚โˆ‚ฮธiโ€‹logโกpโ€‹(r;๐œฝ)โ€‹โˆ‚โˆ‚ฮธjโ€‹logโกpโ€‹(r;๐œฝ)],\displaystyle E_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\theta^{i}}\log p({r};\boldsymbol{\theta})\,\frac{\partial}{\partial\theta^{j}}\log p({r};\boldsymbol{\theta})\right], (30)
=\displaystyle= E๐œฝโ€‹[(๐Ÿ™iโˆ’Eโ€‹(๐Ÿ™i))โ€‹(๐Ÿ™jโˆ’Eโ€‹(๐Ÿ™j))]\displaystyle E_{\boldsymbol{\theta}}\left[(\mathbbm{1}_{i}-E(\mathbbm{1}_{i}))(\mathbbm{1}_{j}-E(\mathbbm{1}_{j}))\right] (31)
=\displaystyle= {โˆ’ฮทiโ€‹ฮทj,ifย iโ‰ jฮทiโ€‹(1โˆ’ฮทi),ifย i=j.\displaystyle\left\{\begin{array}[]{c@{,\quad}l}-\eta_{i}\eta_{j}&\mbox{if $i\not=j$}\\ \eta_{i}(1-\eta_{i})&\mbox{if $i=j$}\end{array}\right.. (34)

with EE denoting the expectation value, and

ฮทi=eฮธi1+โˆ‘k=1Dโˆ’1eฮธk,i=1,โ€ฆ,Dโˆ’1.\eta_{i}=\frac{e^{\theta^{i}}}{1+\sum_{k=1}^{D-1}e^{\theta^{k}}},\qquad i=1,\dots,D-1.

Note that this is the covariance matrix of the random vector (๐Ÿ™1,โ€ฆ,๐Ÿ™Dโˆ’1)(\mathbbm{1}_{1},\dots,\mathbbm{1}_{D-1}), as expressed by Eq.ย (31). Convex functions have positive definite Hessian matrices. Here their elements are given by the Fisher metric itself:

giโ€‹jโ€‹(๐œฝ)\displaystyle g_{ij}(\boldsymbol{\theta}) =\displaystyle= โˆ‚โˆ‚ฮธiโ€‹โˆ‚โˆ‚ฮธjโ€‹ฯˆโ€‹(๐œฝ),\displaystyle\frac{\partial}{\partial\theta^{i}}\frac{\partial}{\partial\theta^{j}}\psi(\boldsymbol{\theta}), (35)
giโ€‹jโ€‹(๐œผ)\displaystyle g^{ij}(\boldsymbol{\eta}) =\displaystyle= โˆ‚โˆ‚ฮทiโ€‹โˆ‚โˆ‚ฮทjโ€‹ฯ•โ€‹(๐œผ).\displaystyle\frac{\partial}{\partial\eta_{i}}\frac{\partial}{\partial\eta_{j}}\phi(\boldsymbol{\eta}). (36)

The second matrix is the inverse of the first, which follows from their Legendre duality. This means the second matrix is an inverse covariance, an important object in the theory of graphical models graphMod . Although the Fisher metric is not Euclidean, i.e., โŸจei,ejโŸฉโ‰ ฮดiโ€‹j\left<\mathrm{e}_{i},\mathrm{e}_{j}\right>\neq\delta_{ij}, we do have a generalization of this when mixing the dual coordinates: โŸจei,ejโŸฉ=ฮดiโ€‹j\left<\mathrm{e}_{i},\mathrm{e}^{j}\right>=\delta_{ij}.

Note that for a general exponential family of the form of Eq.ย (23), the Fisher information matrix can be expressed as a covariance matrix, which generalizes Eq.ย (31):

giโ€‹jโ€‹(๐œฝ)\displaystyle g_{ij}(\boldsymbol{\theta}) =\displaystyle= E๐œฝโ€‹[โˆ‚โˆ‚ฮธiโ€‹logโกpโ€‹(r;๐œฝ)โ€‹โˆ‚โˆ‚ฮธjโ€‹logโกpโ€‹(r;๐œฝ)],\displaystyle E_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\theta^{i}}\log p({r};\boldsymbol{\theta})\,\frac{\partial}{\partial\theta^{j}}\log p({r};\boldsymbol{\theta})\right], (37)
=\displaystyle= E๐œฝโ€‹[(Xiโˆ’Eโ€‹(Xi))โ€‹(Xjโˆ’Eโ€‹(Xj))].\displaystyle E_{\boldsymbol{\theta}}\left[(X_{i}-E(X_{i}))(X_{j}-E(X_{j}))\right]. (38)

3.2 Distance measures and divergences

Our affine structure on ๐’ฎD\mathcal{S}^{D} can be reformulated additively via an exponential map ๐’ฎDร—๐’ฏDโ†’๐’ฎD\mathcal{S}^{D}\times\mathcal{T}^{D}\to\mathcal{S}^{D}, (๐’™,๐’—)โ†ฆ๐’š(\boldsymbol{x},\boldsymbol{v})\mapsto\boldsymbol{y}:

๐’š=exp๐’™โ€‹(๐’—)\displaystyle\boldsymbol{y}=\mathrm{exp}_{\boldsymbol{x}}(\boldsymbol{v}) :=\displaystyle:= ๐’™โŠ•e๐’—,\displaystyle\boldsymbol{x}\oplus e^{\boldsymbol{v}}, (39)
๐’—=vecโ€‹(๐’™,๐’š)\displaystyle\boldsymbol{v}=\mathrm{vec}(\boldsymbol{x},\boldsymbol{y}) :=\displaystyle:= exp๐’™โˆ’1โ€‹(๐’š)=clrโ€‹(๐’š)โˆ’clrโ€‹(๐’™),\displaystyle\mathrm{exp}^{-1}_{\boldsymbol{x}}(\boldsymbol{y})=\mathrm{clr}(\boldsymbol{y})-\mathrm{clr}(\boldsymbol{x}), (40)

using the notation introduced in section 2, and shown here together with its inverse. This map is used in AyErb , where ordinary linear differential equations with the time derivative defined for vecโ€‹(๐’™โ€‹(t0),๐’™โ€‹(t))\mathrm{vec}(\boldsymbol{x}(t_{0}),\boldsymbol{x}(t)) letting tโ†’t0t\to t_{0} are considered. These turn out to be replicator equations with special properties that are known from population dynamics. Note that, e.g., with the center of the simplex ๐’\boldsymbol{n} as defined before, we have vecโ€‹(๐’,๐’™)=clrโ€‹(๐’™)\mathrm{vec}(\boldsymbol{n},\boldsymbol{x})=\mathrm{clr}(\boldsymbol{x}). With the exponential map, we can interpret vecโ€‹(๐’™,๐’š)\mathrm{vec}(\boldsymbol{x},\boldsymbol{y}) as the difference vector between two compositions, see Figure 1b). The set of all such difference vectors for a given point can be interpreted as the gradient field of a convex function (a.k.a.ย a potential). In order to highlight the generality of this concept in information geometry, we consider a general convex function UU with respect to some parameters ๐œผ\boldsymbol{\eta} from a convex domain. In the following, we will denote the compositions for which the parameters are to be evaluated by subscripts. The linearization of UU in ๐œผ๐’š\boldsymbol{\eta}_{\boldsymbol{y}} is given by

Uยฏโ€‹(๐œผ๐’™):=Uโ€‹(๐œผ๐’š)+โˆ‡Uโ€‹(๐œผ๐’š)โ‹…(๐œผ๐’™โˆ’๐œผ๐’š).\overline{U}(\boldsymbol{\eta}_{\boldsymbol{x}})\,:=\,U(\boldsymbol{\eta}_{\boldsymbol{y}})+\nabla U(\boldsymbol{\eta}_{\boldsymbol{y}})\cdot(\boldsymbol{\eta}_{\boldsymbol{x}}-\boldsymbol{\eta}_{\boldsymbol{y}}).

The graph of this linearization is a hyperplane of dimension Dโˆ’1D-1 touching the graph of UU in the point (๐œผ๐’š,Uโ€‹(๐œผ๐’š))(\boldsymbol{\eta}_{\boldsymbol{y}},U(\boldsymbol{\eta}_{\boldsymbol{y}})), see Figure 2. The difference between UU and its linearisation Uยฏ\overline{U} in ๐œผ๐’š\boldsymbol{\eta}_{\boldsymbol{y}}, evaluated at ๐œผ๐’™\boldsymbol{\eta}_{\boldsymbol{x}}, defines a so-called Bregman divergence, a class of divergences that plays an important role in information geometry. More precisely,

DU(๐’™||๐’š):=U(๐œผ๐’™)โˆ’Uยฏ(๐œผ๐’™)=U(๐œผ๐’™)โˆ’U(๐œผ๐’š)โˆ’โˆ‡U(๐œผ๐’š)โ‹…(๐œผ๐’™โˆ’๐œผ๐’š).D_{U}(\boldsymbol{x}||\boldsymbol{y})\,:=\,U(\boldsymbol{\eta}_{\boldsymbol{x}})-\overline{U}(\boldsymbol{\eta}_{\boldsymbol{x}})\,=\,U(\boldsymbol{\eta}_{\boldsymbol{x}})-U(\boldsymbol{\eta}_{\boldsymbol{y}})-\nabla U(\boldsymbol{\eta}_{\boldsymbol{y}})\cdot(\boldsymbol{\eta}_{\boldsymbol{x}}-\boldsymbol{\eta}_{\boldsymbol{y}}). (41)

t] Refer to caption

Figure 2: The graph of the potential UU, of its linearization Uยฏ\overline{U}, and a visualization of the corresponding Bregman divergence DUD_{U}. The gradient vector โˆ‡U\nabla U points in the direction of greatest change of the function UU in the space of the coordinates ๐œผ\boldsymbol{\eta}.

Divergences are similar to distance functions but they are not necessarily symmetric and need not fulfill the triangle inequality. As an example, let us consider the potential naturally associated with the structure of equations (39) and (40), the squared Aitchison norm

Aโ€‹(๐’›๐’™)=โˆ‘i=1Dclriโ€‹(๐’™)2=โˆ‘i=1Dโˆ’1ilriโ€‹(๐’™)2,A(\boldsymbol{z}_{\boldsymbol{x}})=\sum_{i=1}^{D}\mathrm{clr}_{i}(\boldsymbol{x})^{2}=\sum_{i=1}^{D-1}\mathrm{ilr}_{i}(\boldsymbol{x})^{2}, (42)

with ilri the ii-th ilr coordinate ziz_{i}, see Eq.ย (13), and ๐’›๐’™\boldsymbol{z}_{\boldsymbol{x}} the vector of coordinates ziz_{i}. We then have

DA(๐’™||๐’š)=โˆ‘i=1Dโˆ’1(ilri(๐’™)โˆ’ilri(๐’š))2=dA2(๐’™,๐’š),D_{A}(\boldsymbol{x}||\boldsymbol{y})=\sum_{i=1}^{D-1}(\mathrm{ilr}_{i}(\boldsymbol{x})-\mathrm{ilr}_{i}(\boldsymbol{y}))^{2}=d^{2}_{A}(\boldsymbol{x},\boldsymbol{y}), (43)

coinciding with the squared Aitchison distance. This is the special case of a Euclidean divergence, which is also a (squared) distance function.

Let us now come to the divergences that arise when replacing Uโ€‹(๐œผ)U(\boldsymbol{\eta}) in Eq.ย (41) by our dually convex functions ฯˆโ€‹(๐œฝ)\psi(\boldsymbol{\theta}) and ฯ•โ€‹(๐œผ)\phi(\boldsymbol{\eta}). They turn out to be the relative entropies (a.k.a.ย Kullback-Leibler divergences)

Dฯ•(๐’™||๐’š)\displaystyle D_{\phi}(\boldsymbol{x}||\boldsymbol{y}) =\displaystyle= โˆ‘i=1Dxiโ€‹logโกxiyi,\displaystyle\sum_{i=1}^{D}x_{i}\log\frac{x_{i}}{y_{i}}, (44)
Dฯˆ(๐’™||๐’š)\displaystyle D_{\psi}(\boldsymbol{x}||\boldsymbol{y}) =\displaystyle= โˆ‘i=1Dyiโ€‹logโกyixi.\displaystyle\sum_{i=1}^{D}y_{i}\log\frac{y_{i}}{x_{i}}. (45)

Thus the symmetry we had in the Euclidean case finds its generalization for our dual case in

Dฯˆ(๐’™||๐’š)=Dฯ•(๐’š||๐’™).D_{\psi}(\boldsymbol{x}||\boldsymbol{y})=D_{\phi}(\boldsymbol{y}||\boldsymbol{x}). (46)

Moreover, one can show that we can โ€œcomplete the squareโ€ via

Dฯˆ(๐’™||๐’š)=ฯˆ(๐œฝ๐’™)+ฯ•(๐œผ๐’š)โˆ’๐œฝ๐’™โ‹…๐œผ๐’š.D_{\psi}(\boldsymbol{x}||\boldsymbol{y})=\psi(\boldsymbol{\theta}_{\boldsymbol{x}})+\phi(\boldsymbol{\eta}_{\boldsymbol{y}})-\boldsymbol{\theta}_{\boldsymbol{x}}\cdot\boldsymbol{\eta}_{\boldsymbol{y}}. (47)

Of course, symmetrizations of relative entropy exist, with the Jensen-Shannon divergence perhaps the most prominent example. Also, a symmetric โ€œcompositionalโ€ version of relative entropy has been proposed because of its additional properties that are often regarded as indispensable222As an example, the translation invariance of distance measures, known under the name of perturbation invariance in CoDA, has its information-geometric analog in the invariance of the inner product of tangent vectors ๐ฎ\mathbf{u}, ๐ฏ\mathbf{v} under parallel transport PP and its dual Pโˆ—P^{*}: โŸจ๐ฎ,๐ฏโŸฉ=โŸจPโ€‹(๐ฎ),Pโˆ—โ€‹(๐ฏ)โŸฉ\left<\mathbf{u},\mathbf{v}\right>=\left<P(\mathbf{u}),P^{*}(\mathbf{v})\right>. in CoDA symmDiv . While such measures have some interesting properties, they do not make use of the duality of our parametrizations and are therefore less suitable for our approach. Indeed, although a dual divergence is not a distance measure in the strict sense, it can quantify the distance between points along a curve in a similar way, and generalizations of well-known relationships from Euclidean geometry are available. Geodesic lines connecting two compositions can be constructed via convex combinations of the parameters ๐œฝ\boldsymbol{\theta}, and the corresponding dual geodesics from convex combinations of the dual parameters ๐œผ\boldsymbol{\eta}. Such geodesics are orthogonal to each other when the inner product of their tangent vectors, with respect to the Fisher metric, vanishes in the point of intersection. In this case, a generalized Pythagorean theorem holds for the corresponding dual divergence, e.g.,

Dฯˆ(๐’™||๐’š)=Dฯˆ(๐’™||๐’›)+Dฯˆ(๐’›||๐’š),D_{\psi}(\boldsymbol{x}||\boldsymbol{y})=D_{\psi}(\boldsymbol{x}||\boldsymbol{z})+D_{\psi}(\boldsymbol{z}||\boldsymbol{y}), (48)

where the geodesics tโ€‹๐œผ๐’™+(1โˆ’t)โ€‹๐œผ๐’›t\boldsymbol{\eta}_{\boldsymbol{x}}+(1-t)\boldsymbol{\eta}_{\boldsymbol{z}} and tโ€‹๐œฝ๐’›+(1โˆ’t)โ€‹๐œฝ๐’št\boldsymbol{\theta}_{\boldsymbol{z}}+(1-t)\boldsymbol{\theta}_{\boldsymbol{y}} intersect orthogonally.

3.3 Distances obtained from the Fisher metric compared with those used in CoDA

We have seen that the potential of Eq.ย (42) led to a divergence that is also a Euclidean distance. Let us now consider a generalization of our dual divergences that includes as a special case a Euclidean distance that is related to the Fisher metric. The so-called ฮฑ\alpha-divergence (closely related to the Renyi Renyi and Tsallis Tsallis entropies) is defined as

Dฮฑ(๐’™||๐’š)=41โˆ’ฮฑ2(1โˆ’โˆ‘i=1Dyi1+ฮฑ2xi1โˆ’ฮฑ2).D_{\alpha}(\boldsymbol{x}||\boldsymbol{y})=\frac{4}{1-\alpha^{2}}\left(1-\sum_{i=1}^{D}y_{i}^{\frac{1+\alpha}{2}}x_{i}^{\frac{1-\alpha}{2}}\right). (49)

In the limit, the cases ฮฑ=ยฑ1\alpha=\pm 1 correspond to DฯˆD_{\psi} and Dฯ•D_{\phi}. The case ฮฑ=0\alpha=0 (where the divergence is self-dual, i.e., symmetric) corresponds to the Euclidean distance of the points (x1,โ€ฆ,xD)(\sqrt{x_{1}},\dots,\sqrt{x_{D}}) and (y1,โ€ฆ,yD)(\sqrt{y_{1}},\dots,\sqrt{y_{D}}):

dH2โ€‹(๐’™,๐’š)\displaystyle d^{2}_{H}(\boldsymbol{x},\boldsymbol{y}) =\displaystyle= โˆ‘i=1D(xiโˆ’yi)2\displaystyle\sum_{i=1}^{D}(\sqrt{x_{i}}-\sqrt{y_{i}})^{2}
=\displaystyle= โˆ‘i=1D(xiโˆ’2โ€‹xiโ€‹yi+yi)\displaystyle\sum_{i=1}^{D}(x_{i}-2\sqrt{x_{i}y_{i}}+y_{i})
=\displaystyle= 2โ€‹(1โˆ’โˆ‘i=1Dxiโ€‹yi)\displaystyle 2\left(1-\sum_{i=1}^{D}\sqrt{x_{i}y_{i}}\right)
=\displaystyle= 12โ€‹D0โ€‹(๐’™โˆฅ๐’š).\displaystyle\frac{1}{2}\,D_{0}(\boldsymbol{x}\|\boldsymbol{y}). (51)

This is the so-called Hellinger distance. It is closely related to the Riemannian distance333The Riemannian distance between two points on a manifold is the minimum of the lengths of all the piecewise smooth paths joining the two points. between two compositions with respect to the Fisher metric. This so-called Fisher distance can be expressed explicitly Nihat as

dF2โ€‹(๐’™,๐’š)=4โ€‹arccos2โ€‹(โˆ‘i=1Dxiโ€‹yi),d^{2}_{F}(\boldsymbol{x},\boldsymbol{y})=4\,\mathrm{arccos}^{2}\left(\sum_{i=1}^{D}\sqrt{x_{i}y_{i}}\right), (52)

and its relation to the Hellinger distance is given by

dH2โ€‹(๐’™,๐’š)=2โ€‹(1โˆ’cosโ€‹dFโ€‹(๐’™,๐’š)2).d^{2}_{H}(\boldsymbol{x},\boldsymbol{y})=2\left(1-\mathrm{cos}\frac{d_{F}(\boldsymbol{x},\boldsymbol{y})}{2}\right). (53)

These distances can be better understood when noting the role that plays the angle between the two points on the sphere, i.e., when comparing Eq.ย (52) with the cosine of the angle ฯ†\varphi between the rays going from the origin through the transformed compositions (see Fig.ย 3), also referred to as Bhattacharyya coefficient Bcoeff :

cosโ€‹ฯ†=โˆ‘i=1Dxiโ€‹yi.\mathrm{cos}~{}\varphi=\sum_{i=1}^{D}\sqrt{x_{i}y_{i}}. (54)

t] Refer to caption

Figure 3: Hellinger and Fisher distance between compositions ๐’™\boldsymbol{x} and ๐’š\boldsymbol{y}. Shown are the 2-simplex and the positive orthant of a sphere with radius 2. The gray points indicate the square-root-transformed compositions. They lie on a sphere with radius one, and their Euclidean distance is the Hellinger distance. The lines from the origin going through these points cross the sphere with radius two. The length of the arc between the resulting points is the Fisher distance, while their Euclidean distance is twice the Hellinger distance.

Let us now come back to Aitchison distance and discuss in which structural aspects it differs from divergences obtained from the Fisher metric. In fact, in data analysis, parametrized classes of distance measures are common green3 . In the same way as in Eq.ย (49), they are mediated by the Box-Cox transformation, which has the limit

limฮฒโ†’0xฮฒโˆ’1ฮฒ=logโก(x).\lim_{\beta\to 0}\frac{x^{\beta}-1}{\beta}=\log(x). (55)

This has been applied in CoDA to obtain log-ratio analysis as a Correspondence Analysis of power-transformed data green1 . There, we have the following family of distance measures:

dฮฒ2โ€‹(๐’™,๐’š)=1ฮฒ2โ€‹โˆ‘i=1Dฯ‰iโ€‹(๐’žโ€‹(๐’™ฮฒ)iโˆ’๐’žโ€‹(๐’šฮฒ)i)2,d^{2}_{\beta}(\boldsymbol{x},\boldsymbol{y})=\frac{1}{\beta^{2}}\sum_{i=1}^{D}\omega_{i}\left(\mathcal{C}{(\boldsymbol{x}^{\beta})_{i}}-\mathcal{C}{(\boldsymbol{y}^{\beta})_{i}}\right)^{2}, (56)

where ฯ‰i\omega_{i} are suitable weights. For the case ฮฒ=1\beta=1, this is the (square of the) symmetric ฯ‡2\chi^{2}-distance used in Correspondence Analysis, while ฮฒ=0\beta=0 gives444This can be seen as the high-temperature limit in statistical physics. Aitchison distance (when ฯ‰i=D2\omega_{i}=D^{2} for all ii), see the Appendix for a proof. Although the case ฮฒ=1/2\beta=1/2 has a direct relationship with it, Hellinger distance does not form part of this family because of the closure operation that makes us stay inside the simplex. Similarly, Aitchison distance cannot be obtained from the alpha divergences of Eq.ย (49). Alpha divergences are included in a more general class of divergences known under the name of ff-divergence. They have the form

Dfโ€‹(๐’™,๐’š)=โˆ‘i=1Dxiโ€‹fโ€‹(yixi),D_{f}(\boldsymbol{x},\boldsymbol{y})=\sum_{i=1}^{D}x_{i}f\left(\frac{y_{i}}{x_{i}}\right), (57)

where ff is a convex function. It is a well-established result in information geometry that ff-divergences are the only decomposable555A divergence is decomposable if it can be written as a sum of terms that only depend on individual components. divergences that behave monotonically under coarse-graining of information Amari , i.e., when compositional parts are amalgamated into higher-level parts. This important invariance property is called information monotonicity. Aitchison distance is not decomposable, as in each summand we use information from all compositional parts when evaluating their geometric mean. Nevertheless, in the next section we will show that it fulfills information monotonicity.
It is interesting to note that the two Euclidean distances (Hellinger and Aitchison) are each related to different isometries of the simplex. In the case of Hellinger distance, compositions are isometrically mapped into the positive orthant of the sphere666But note that this mapping does not obtain the spherical representative of the composition in the sense of the definition of an equivalence class.. This isometry also holds for the Fisher metric itself Nihat (which makes it possible to view the Fisher metric as a Euclidean metric on the โ„D\mathbb{R}^{D} where the sphere is embedded). In the case of Aitchison distance, the isometry in question is of central interest in CoDA. It is the clr transformation, i.e., the map between ๐’ฎD\mathcal{S}^{D} and ๐’ฏD\mathcal{T}^{D}. Here, however, there is no corresponding isometry of the Fisher metric. Although a Euclidean metric may appear convenient, it does not have the same desirable properties as the Fisher metric. In fact, a central result in information geometry states that the Fisher metric is the only metric Chentsov2 that stays invariant under coarse graining of information.

4 Information monotonicity of Aitchison distance

4.1 Amalgamations lead to coarse grained information

Let us denote by ๐’œ\mathcal{A} a subset of ๐’Ÿ:={1,โ€ฆ,D}\mathcal{D}:=\{1,\dots,D\}, and let ๐’™๐’œ\boldsymbol{x}_{\mathcal{A}} denote the corresponding subcomposition, i.e., the vector where parts with indices not belonging to ๐’œ\mathcal{A} have been removed. Let Hโ€‹(๐’™)H(\boldsymbol{x}) denote the Shannon entropy of composition ๐’™\boldsymbol{x}, i.e., the potential โˆ’ฯ•โ€‹(๐œผ๐’™)-\phi(\boldsymbol{\eta}_{\boldsymbol{x}}). Consider its decomposition

Hโ€‹(๐’™)=(1โˆ’sโ€‹(๐’™๐’œ))โ€‹Hโ€‹(๐’žโ€‹๐’™๐’Ÿ\๐’œ)+sโ€‹(๐’™๐’œ)โ€‹Hโ€‹(๐’žโ€‹๐’™๐’œ)+Hโ€‹(sโ€‹(๐’™๐’œ),1โˆ’sโ€‹(๐’™๐’œ)),H(\boldsymbol{x})=(1-s(\boldsymbol{x}_{\mathcal{A}}))H(\mathcal{C}\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})+s(\boldsymbol{x}_{\mathcal{A}})H(\mathcal{C}\boldsymbol{x}_{\mathcal{A}})+H(s(\boldsymbol{x}_{\mathcal{A}}),1-s(\boldsymbol{x}_{\mathcal{A}})), (58)

where sโ€‹(๐’™๐’œ)=โˆ‘iโˆˆ๐’œxis(\boldsymbol{x}_{\mathcal{A}})=\sum_{i\in\mathcal{A}}x_{i}. We see that this is a convex combination of the entropies of the two subcompositions plus a binary entropy, where all terms involve the amalgamation sโ€‹(๐’™๐’œ)s(\boldsymbol{x}_{\mathcal{A}}). Probabilistically speaking, this particular amalgamation corresponds to the probability that an event occurs for any of the events ๐’œ\mathcal{A} that are left out to obtain the subcomposition ๐’™๐’Ÿ\๐’œ\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}}. Generally, amalgamation is nothing else but a coarse graining of the events and their probabilities. The corresponding coarse graining of information is described by this alternative decomposition of Shannon entropy:

Hโ€‹(๐’™)=Hโ€‹(๐’™๐’Ÿ\๐’œ,sโ€‹(๐’™๐’œ))+sโ€‹(๐’™๐’œ)โ€‹Hโ€‹(๐’žโ€‹๐’™๐’œ).H(\boldsymbol{x})=H(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))+s(\boldsymbol{x}_{\mathcal{A}})H(\mathcal{C}\boldsymbol{x}_{\mathcal{A}}). (59)

As the second summand is greater equal zero, this also shows that information cannot grow after coarse graining.

4.2 The notion of monotonicity for divergences and distance measures

These considerations lead us to an important result that concerns the divergence associated with the potential ฯ•โ€‹(๐œผ๐’™)\phi(\boldsymbol{\eta}_{\boldsymbol{x}}), i.e., the relative entropy. This result is the information monotonicity under coarse graining, where the notion of monotonicity is somewhat related to the notion of subcompositional dominance. The latter refers to the property that a measure of distance does not increase when evaluating it on a subset of parts only. This is often seen as a desirable property of distances in CoDA (and not fulfilled by distances like Hellinger and Batthacharyya, see clustering for a discussion of distance measures with respect to compositions). A similarโ€”but perhaps more naturalโ€”requirement that has not received attention yet in the CoDA community is the one that a distance between compositions should not increase when comparing it with the one after amalgamating over a subset of parts.777Subcompositional coherence, i.e., the fundamental requirement that quantities remain identical on a renormalized subcomposition, is not an issue for amalgamation: after amalgamation there is no need for renormalization. As we have seen in the previous subsection, we cannot gain information when amalgamating parts, so we should lose resolution when comparing the amalgamated compositions. This is also related to the notion of sufficient statistic, see Amari . The information monotonicity property of relative entropy can be expressed as

Dฯ•(๐’™||๐’š)โ‰ฅDฯ•((๐’™๐’Ÿ\๐’œ,s(๐’™๐’œ))||(๐’š๐’Ÿ\๐’œ,s(๐’š๐’œ))).D_{\phi}(\boldsymbol{x}||\boldsymbol{y})\geq D_{\phi}\left((\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))~{}||~{}(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{y}_{\mathcal{A}}))\right). (60)

This result can be shown for the more general case of ff-divergences and continuous distributions using Jensenโ€™s inequality AmariNagaoka .
Note that in balances , when discussing amalgamations of parts, the notion of โ€œmonotonicityโ€ is used differently. There, the authors argue against amalgamations referring to the observation that Aitchison distances between amalgamated compositions and the amalgamated center of the simplex show a non-monotonic behaviour along an ilr-coordinate axis defined before amalgamation. We will show below that information monotonicity does hold for Aitchison distance. We see the lack of distance monotonicity as discussed in balances rather as an argument against the use of a Euclidean coordinate system here.

4.3 Monotonicity of Aitchison distance

A symmetrized version of relative entropy has recently been used in the context of data-driven amalgamation amalgams , where it was shown to be better preserved between samples than Aitchison distance. While the information-theoretic meaning and mathematical properties reflected in the decompositions shown in section 4.1 make Shannon entropy an ideal measure of information, alternative indices that can sometimes be evaluated more easily on real-world data (e.g., making use of sums of squares) have also been considered. In our context, it is interesting that Aโ€‹(๐’›๐’™)A(\boldsymbol{z}_{\boldsymbol{x}}), the potential of Eq.ย (42), has been proposed as an alternative measure of information within an attempt to reformulate information theory from a CoDA point-of-view E&PGinfo . More recently, it has also been proposed as an inequality index (when divided by the number of parts DD) sampleSpace . Here, the following decomposition was shown:

โ€–๐’™โ€–A2=โ€–(๐’™๐’Ÿ\๐’œ,๐’™๐’œ)โ€–A2=โ€–๐’™๐’Ÿ\๐’œโ€–A2+โ€–๐’™๐’œโ€–A2+aโ€‹(Dโˆ’a)Dโ€‹(logโกgโ€‹(๐’™๐’Ÿ\๐’œ)gโ€‹(๐’™๐’œ))2,||\boldsymbol{x}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},\boldsymbol{x}_{\mathcal{A}})||_{A}^{2}=||\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}||_{A}^{2}+\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}\right)^{2}, (61)

where we denoted the set size of ๐’œ\mathcal{A} by aa. If we now want to decompose with respect to a composition that was partly amalgamated, we find a corresponding relationship that is more complicated888It is interesting to note that the two interaction terms have the form of squares of the balance and pivot coordinates mentioned in section 2.:

||๐’™||A2=||(๐’™๐’Ÿ\๐’œ,s(๐’™๐’œ)||A2+||๐’™๐’œ||A2+aโ€‹(Dโˆ’a)Dโ€‹(logโกgโ€‹(๐’™๐’Ÿ\๐’œ)gโ€‹(๐’™๐’œ))2โˆ’Dโˆ’aDโˆ’a+1โ€‹(logโกgโ€‹(๐’™๐’Ÿ\๐’œ)sโ€‹(๐’™๐’œ))2.||\boldsymbol{x}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}})||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}||_{A}^{2}\\ +\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}\right)^{2}-\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{s(\boldsymbol{x}_{\mathcal{A}})}\right)^{2}. (62)

Clearly, if we replace the amalgamation sโ€‹(๐’™๐’œ)s(\boldsymbol{x}_{\mathcal{A}}) by the geometric mean gโ€‹(๐’™๐’œ)g(\boldsymbol{x}_{\mathcal{A}}), we get a simpler equality:

โ€–๐’™โ€–A2=โ€–(๐’™๐’Ÿ\๐’œ,gโ€‹(๐’™๐’œ))โ€–A2+โ€–๐’™๐’œโ€–A2+(aโ€‹(Dโˆ’a)Dโˆ’Dโˆ’aDโˆ’a+1)โ€‹(logโกgโ€‹(๐’™๐’Ÿ\๐’œ)gโ€‹(๐’™๐’œ))2.||\boldsymbol{x}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{x}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}||_{A}^{2}\\ +\left(\frac{a(D-a)}{D}-\frac{D-a}{D-a+1}\right)\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}\right)^{2}. (63)

Aggregating by geometric means or by amalgamations has been a subject of debate in the CoDA community sampleSpace ; amalgGreen . As we can see, measures like the Aitchison norm lend themselves much better to taking geometric means rather than amalgamations. There is, however, no straight-forward probabilistic interpretation of geometric means999The product over parts specifies the probability that all events in the subset occur, but this is then re-scaled by the exponent to the probability of a single event., and the more elegant formal expressions that result often suffer from reduced interpretability.
To the best of our knowledge, the information monotonicity property in its general form has not been considered yet for Aitchison distance. We here exploit the various decompositions stated above for proving it. Results are summarized in the following propositions. All proofs can be found in the Appendix.

Proposition 1

Let ๐’Ÿ\mathcal{D} and ๐’œโŠ‚๐’Ÿ\mathcal{A}\subset\mathcal{D} be two index sets with sizes DD and aa, respectively. Further, let ๐ฑ\boldsymbol{x} and ๐ฒ\boldsymbol{y} be the simplicial representatives of two compositions in ๐’ฎD\mathcal{S}^{D}. The amalgamation of ๐ฑ\boldsymbol{x} over the subset ๐’œ\mathcal{A} of parts be denoted by sโ€‹(๐ฑ๐’œ)=โˆ‘iโˆˆ๐’œxis(\boldsymbol{x}_{\mathcal{A}})=\sum_{i\in\mathcal{A}}x_{i}. Then the following decomposition of Aitchison distance holds:

โ€–๐’™โŠ–๐’šโ€–A2=โ€–(๐’™๐’Ÿ\๐’œ,sโ€‹(๐’™๐’œ))โŠ–(๐’š๐’Ÿ\๐’œ,sโ€‹(๐’š๐’œ))โ€–A2+โ€–๐’™๐’œโŠ–๐’š๐’œโ€–A2+aโ€‹(Dโˆ’a)Dโ€‹(logโกgโ€‹(๐’™๐’Ÿ\๐’œ)gโ€‹(๐’™๐’œ)โˆ’logโกgโ€‹(๐’š๐’Ÿ\๐’œ)gโ€‹(๐’š๐’œ))2โˆ’Dโˆ’aDโˆ’a+1โ€‹(logโกgโ€‹(๐’™๐’Ÿ\๐’œ)sโ€‹(๐’™๐’œ)โˆ’logโกgโ€‹(๐’š๐’Ÿ\๐’œ)sโ€‹(๐’š๐’œ))2.||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}\ominus\boldsymbol{y}_{\mathcal{A}}||_{A}^{2}\\ +\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}-\log\frac{g(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{y}_{\mathcal{A}})}\right)^{2}\\ -\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{s(\boldsymbol{x}_{\mathcal{A}})}-\log\frac{g(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}})}{s(\boldsymbol{y}_{\mathcal{A}})}\right)^{2}. (64)
Corollary 1

When aggregating a subset of parts in form of their geometric mean, we have the following decomposition of Aitchison distance

โ€–๐’™โŠ–๐’šโ€–A2=โ€–(๐’™๐’Ÿ\๐’œ,gโ€‹(๐’™๐’œ))โŠ–(๐’™๐’Ÿ\๐’œ,gโ€‹(๐’š๐’œ))โ€–A2+โ€–๐’™๐’œโŠ–๐’š๐’œโ€–A2+(aโ€‹(Dโˆ’a)Dโˆ’Dโˆ’aDโˆ’a+1)โ€‹(logโกgโ€‹(๐’™๐’Ÿ\๐’œ)gโ€‹(๐’™๐’œ)โˆ’logโกgโ€‹(๐’š๐’Ÿ\๐’œ)gโ€‹(๐’š๐’œ))2.||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}\ominus\boldsymbol{y}_{\mathcal{A}}||_{A}^{2}\\ +\left(\frac{a(D-a)}{D}-\frac{D-a}{D-a+1}\right)\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}-\log\frac{g(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{y}_{\mathcal{A}})}\right)^{2}. (65)

From this decomposition, we get the following monotonicity result:

Corollary 2

With parts aggregated by geometric means, the following inequality holds

โ€–๐’™โŠ–๐’šโ€–A2โ‰ฅโ€–(๐’™๐’Ÿ\๐’œ,gโ€‹(๐’™๐’œ))โŠ–(๐’™๐’Ÿ\๐’œ,gโ€‹(๐’š๐’œ))โ€–A2+โ€–๐’™๐’œโŠ–๐’š๐’œโ€–A2.||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}\geq||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}\ominus\boldsymbol{y}_{\mathcal{A}}||_{A}^{2}.

As we can see, for geometric-mean summaries, the sum of the interaction terms (i.e., of the terms not involving norms) will remain greater equal zero. This is no longer true for amalgamation of parts, and it is less straightforward to show the corresponding inequality:

Proposition 2

Aitchison distance fulfills the information monotonicity

โ€–๐’™โŠ–๐’šโ€–A2โ‰ฅโ€–(๐’™๐’Ÿ\๐’œ,sโ€‹(๐’™๐’œ))โŠ–(๐’™๐’Ÿ\๐’œ,sโ€‹(๐’š๐’œ))โ€–A2.||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}\geq||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}.

5 Discussion and outlook

In our little outline of finite information geometry, we could but scratch the surface of the formal apparatus that is at our disposal. We are certain it can serve to advance the field of Compositional Data Analysis in various ways. Differential geometry provides a universally valid framework for the problems occurring with constrained data. Considering the simplex a differentiable manifold enables a general approach from which specific problems like the compositional differential calculus compDiffCalc follow naturally. Clearly, there is (and has to be) overlap in methodology between the information-geometric perspective and the CoDA approach. An example is the use of the exponential map anchored at the center of the simplex discussed in section 3.2, which allows to identify the simplex with a linear space that is central to the Euclidean CoDA approach. Another example is the fundamental role that exponential families play in information geometry and which have been studied in the CoDA context in the so-called Bayes spaces BayesSpaces . But we also think that some of the current limitations of CoDA can be overcome using the additional structures that information geometry can provide. The ease with which amalgamations of parts can be handled by Kullback-Leibler divergence might partly resolve the debate surrounding this issue in the CoDA community. Further, maximum-entropy projections, where Kullback-Leibler divergences are the central tool, seem an especially promising avenue to pursue in the context of data that are only partially available or subject to constraints. Also, our description has focused on the equivalence of compositions with discrete probability distributions, but information geometry can of course be used to describe the distributions themselves. These are no longer finite but continuous and contain a constraint that introduces dependencies among their random variables, calling for the use of more general versions of the concepts presented here.

Appendix

Proof that Eq.ย 56 tends to Aitchison distance for ฮฒโ†’0\beta\rightarrow 0

dฮฒ2โ€‹(๐’™,๐’š)=1ฮฒ2โ€‹โˆ‘i=1Dฯ‰iโ€‹(๐’žโ€‹(๐’™ฮฒ)iโˆ’๐’žโ€‹(๐’šฮฒ)i)2=โˆ‘i=1Dฯ‰iโ€‹(xiฮฒฮฒโ€‹โˆ‘k=1Dxkฮฒโˆ’yiฮฒฮฒโ€‹โˆ‘k=1Dykฮฒ)2.d^{2}_{\beta}(\boldsymbol{x},\boldsymbol{y})=\frac{1}{\beta^{2}}\sum_{i=1}^{D}\omega_{i}\left(\mathcal{C}{(\boldsymbol{x}^{\beta})_{i}}-\mathcal{C}{(\boldsymbol{y}^{\beta})_{i}}\right)^{2}=\sum_{i=1}^{D}\omega_{i}\left(\frac{x^{\beta}_{i}}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}}{\beta\sum_{k=1}^{D}y^{\beta}_{k}}\right)^{2}.

The terms inside the bracket can be written as

xiฮฒฮฒโ€‹โˆ‘k=1Dxkฮฒโˆ’yiฮฒฮฒโ€‹โˆ‘k=1Dykฮฒ=xiฮฒโˆ’1Dโ€‹โˆ‘k=1Dxkฮฒฮฒโ€‹โˆ‘k=1Dxkฮฒโˆ’yiฮฒโˆ’1Dโ€‹โˆ‘k=1Dykฮฒฮฒโ€‹โˆ‘k=1Dykฮฒ\frac{x^{\beta}_{i}}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}}{\beta\sum_{k=1}^{D}y^{\beta}_{k}}=\frac{x^{\beta}_{i}-\frac{1}{D}\sum_{k=1}^{D}x^{\beta}_{k}}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}-\frac{1}{D}\sum_{k=1}^{D}y^{\beta}_{k}}{\beta\sum_{k=1}^{D}y^{\beta}_{k}} (66)

when subtracting 1/(ฮฒโ€‹D)1/(\beta D) and adding it back. Similarly,

=xiฮฒโˆ’1โˆ’1Dโ€‹โˆ‘k=1D(xkฮฒโˆ’1)ฮฒโ€‹โˆ‘k=1Dxkฮฒโˆ’yiฮฒโˆ’1โˆ’1Dโ€‹โˆ‘k=1D(ykฮฒโˆ’1)ฮฒโ€‹โˆ‘k=1Dykฮฒ.=\frac{x^{\beta}_{i}-1-\frac{1}{D}\sum_{k=1}^{D}(x^{\beta}_{k}-1)}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}-1-\frac{1}{D}\sum_{k=1}^{D}(y^{\beta}_{k}-1)}{\beta\sum_{k=1}^{D}y^{\beta}_{k}}. (67)

In this, we recognize the Box-Cox transform in the numerators, and can make use of the limit in Eq.ย (55). The sums in the denominators clearly tend to DD for ฮฒโ†’0\beta\rightarrow 0. We can thus evaluate the limit as a quotient of finite limits and conclude

limฮฒโ†’0dฮฒ2โ€‹(๐’™,๐’š)=โˆ‘i=1Dฯ‰iD2โ€‹(logโกxigโ€‹(๐’™)โˆ’logโกyigโ€‹(๐’š))2.\lim_{\beta\to 0}d^{2}_{\beta}(\boldsymbol{x},\boldsymbol{y})=\sum_{i=1}^{D}\frac{\omega_{i}}{D^{2}}\left(\log\frac{x_{i}}{g(\boldsymbol{x})}-\log\frac{y_{i}}{g(\boldsymbol{y})}\right)^{2}. (68)

Proof of Proposition 1

We start by defining ๐’›=(xi/yi)iโˆˆ๐’Ÿ\boldsymbol{z}=(x_{i}/y_{i})_{i\in\mathcal{D}}. We can now use the decomposition

โ€–๐’›โ€–A2=โ€–๐’›๐’Ÿ\๐’œโ€–A2+โ€–๐’›๐’œโ€–A2+aโ€‹(Dโˆ’a)Dโ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)gโ€‹(๐’›๐’œ))2,||\boldsymbol{z}||_{A}^{2}=||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}+||\boldsymbol{z}_{\mathcal{A}}||_{A}^{2}+\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}, (69)

which can be derived using equalities like

gโ€‹(๐’™)=(gโ€‹(๐’™๐’œ)aโ€‹gโ€‹(๐’™๐’Ÿ\๐’œ)Dโˆ’a)1D=gโ€‹(๐’™๐’œ)โ€‹gโ€‹(๐’™๐’œ)aDโˆ’1โ€‹gโ€‹(๐’™๐’Ÿ\๐’œ)Dโˆ’aD,g(\boldsymbol{x})=\left(g(\boldsymbol{x}_{\mathcal{A}})^{a}g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})^{D-a}\right)^{\frac{1}{D}}=g(\boldsymbol{x}_{\mathcal{A}})g(\boldsymbol{x}_{\mathcal{A}})^{\frac{a}{D}-1}g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})^{\frac{D-a}{D}},

which can be used to expand

โˆ‘iโˆˆ๐’œ(logโกzigโ€‹(๐’›))2=โˆ‘iโˆˆ๐’œ(logโกzigโ€‹(๐’›๐’œ)+Dโˆ’aDโ€‹logโกgโ€‹(๐’›๐’œ)gโ€‹(๐’›๐’Ÿ\๐’œ))2\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z})}\right)^{2}=\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}+\frac{D-a}{D}\log\frac{g(\boldsymbol{z}_{\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}\right)^{2}

and doing the square, cross terms vanish after summation.
Next, we observe that, for an arbitrary s๐’›s_{\boldsymbol{z}} which we join as an additional component with the vector ๐’›๐’Ÿ\๐’œ\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}, we have

โ€–(๐’›๐’Ÿ\๐’œ,s๐’›)โ€–A2=โ€–๐’›๐’Ÿ\๐’œโ€–A2+Dโˆ’aDโˆ’a+1โ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)s๐’›)2||(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s_{\boldsymbol{z}})||_{A}^{2}=||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}+\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{s_{\boldsymbol{z}}}\right)^{2} (70)

because, similarly as before, we have

โˆ‘iโˆˆ๐’Ÿ\๐’œ(logโกzigโ€‹((๐’›๐’Ÿ\๐’œ,s๐’›)))2+(logโกs๐’›gโ€‹((๐’›๐’Ÿ\๐’œ,s๐’›)))2=โˆ‘iโˆˆ๐’Ÿ\๐’œ(logโกzigโ€‹(๐’›๐’Ÿ\๐’œ)+1Dโˆ’a+1โ€‹logโกgโ€‹(๐’›๐’Ÿ\๐’œ)s๐’›)2+(Dโˆ’aDโˆ’a+1โ€‹logโกs๐’›gโ€‹(๐’›๐’Ÿ\๐’œ))2.\sum_{i\in\mathcal{D}\backslash\mathcal{A}}\left(\log\frac{z_{i}}{g((\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s_{\boldsymbol{z}}))}\right)^{2}+\left(\log\frac{s_{\boldsymbol{z}}}{g((\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s_{\boldsymbol{z}}))}\right)^{2}=\\ \sum_{i\in\mathcal{D}\backslash\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}+\frac{1}{D-a+1}\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{s_{\boldsymbol{z}}}\right)^{2}+\left(\frac{D-a}{D-a+1}\log\frac{s_{\boldsymbol{z}}}{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}\right)^{2}. (71)

We can now choose for szs_{z} the expression sz=(โˆ‘iโˆˆ๐’œxi)/(โˆ‘iโˆˆ๐’œyi)s_{z}=(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i}). With this, we express โ€–๐’›๐’Ÿ\๐’œโ€–A2||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2} using the expression following from Eq.ย (70). We then substitute the corresponding term in Eq.ย (69), proving the proposition:

โ€–๐’›โ€–A2=โ€–(๐’›๐’Ÿ\๐’œ,s)โ€–A2+โ€–๐’›๐’œโ€–A2+aโ€‹(Dโˆ’a)Dโ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)gโ€‹(๐’›๐’œ))2โˆ’Dโˆ’aDโˆ’a+1โ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)(โˆ‘iโˆˆ๐’œxi)/(โˆ‘iโˆˆ๐’œyi))2.||\boldsymbol{z}||_{A}^{2}=||(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s)||_{A}^{2}+||\boldsymbol{z}_{\mathcal{A}}||_{A}^{2}\\ +\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}\right)^{2}. (72)

Proof of Corollary 1

This follows from the fact that we can insert gโ€‹(๐’›๐’œ)g(\boldsymbol{z}_{\mathcal{A}}) for s๐’›s_{\boldsymbol{z}} in Eq.ย (70). This gets us an expression for โ€–๐’›๐’Ÿ\๐’œโ€–A2||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}, which is inserted in Eq.ย (69).

Proof of Corollary 2

To prove the corollary, we have to show that the last term in the decomposition of Corollary 1 is greater or equal to zero. We thus need to show

(Dโˆ’a)โ€‹aโ€‹(Dโˆ’a+1)โˆ’DDโ€‹(Dโˆ’a+1)โ‰ฅ0.(D-a)\frac{a(D-a+1)-D}{D(D-a+1)}\geq 0. (73)

Since D>a>1D>a>1, and the quadratic equation a2โˆ’aโ€‹(D+1)+D=0a^{2}-a(D+1)+D=0 has solutions a=1a=1 and a=Da=D, between these values we are either above or below zero. We are above because the first derivative of Eq.ย (73) in a=1a=1 is (Dโˆ’1)2/D2(D-1)^{2}/D^{2}, which is bigger zero.

Proof of Proposition 2

Let ๐’›\boldsymbol{z} denote the vector with components xi/yix_{i}/y_{i}, i=1,โ€ฆ,Di=1,\dots,D. To prove the proposition, we have to bound the terms after the first plus sign in Proposition 1 from below, i.e.,

Rโ€‹(๐’™,๐’š):=โˆ‘iโˆˆ๐’œ(logโกzigโ€‹(๐’›๐’œ))2+aโ€‹(Dโˆ’a)Dโ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)gโ€‹(๐’›๐’œ))2โˆ’Dโˆ’aDโˆ’a+1โ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)(โˆ‘iโˆˆ๐’œxi)/(โˆ‘iโˆˆ๐’œyi))2โ‰ฅ0.R(\boldsymbol{x},\boldsymbol{y}):=\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}+\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}\\ -\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}\right)^{2}\geq 0. (74)

Let us start with the second summand. We rewrite it as

(aโ€‹(Dโˆ’a)Dโˆ’Dโˆ’aDโˆ’a+1)โ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)gโ€‹(๐’›๐’œ))2+Dโˆ’aDโˆ’a+1โ€‹(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)gโ€‹(๐’›๐’œ))2\left(\frac{a(D-a)}{D}-\frac{D-a}{D-a+1}\right)\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}+\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}

Since we showed Eq.ย (73), the summand on the left is greater equal zero. Thus we have

Rโ€‹(๐’™,๐’š)โ‰ฅโˆ‘iโˆˆ๐’œ(logโกzigโ€‹(๐’›๐’œ))2+Dโˆ’aDโˆ’a+1โ€‹((logโกgโ€‹(๐’›๐’Ÿ\๐’œ)gโ€‹(๐’›๐’œ))2โˆ’(logโกgโ€‹(๐’›๐’Ÿ\๐’œ)(โˆ‘iโˆˆ๐’œxi)/(โˆ‘iโˆˆ๐’œyi))2)โ‰ฅโˆ‘iโˆˆ๐’œ(logโกzigโ€‹(๐’›๐’œ))2โˆ’(logโก(โˆ‘iโˆˆ๐’œxi)/(โˆ‘iโˆˆ๐’œyi)gโ€‹(๐’›๐’œ))2,R(\boldsymbol{x},\boldsymbol{y})\geq\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}\\ +\frac{D-a}{D-a+1}\left(\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}\right)^{2}\right)\\ \geq\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\left(\log\frac{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}, (75)

where the second inequality follows because the big bracket (with a prefactor smaller one) has a structure that can be bounded like

|(Aโˆ’B)2โˆ’(Aโˆ’C)2|โ‰ค(Bโˆ’C)2,|(A-B)^{2}-(A-C)^{2}|\leq(B-C)^{2},

with logโกgโ€‹(๐’›๐’Ÿ\๐’œ)\log~{}g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}) playing the role of AA. Finally, the last term in Eq.ย (75) can be bounded from above as follows. Since xiโ‰คyiโ‹…maxโกxi/yix_{i}\leq y_{i}\cdot\max x_{i}/y_{i}, we also have

โˆ‘iโˆˆ๐’œxiโ‰คmaxkโˆˆ๐’œโกxkykโ€‹โˆ‘iโˆˆ๐’œyi,\sum_{i\in\mathcal{A}}x_{i}\leq\max_{k\in\mathcal{A}}\frac{x_{k}}{y_{k}}\sum_{i\in\mathcal{A}}y_{i}, (76)

so the ratio of sums is smaller than the maximum over the ratios. Without restricting generality, let us assume that gโ€‹(๐’™๐’œ)โ‰ฅgโ€‹(๐’š๐’œ)g(\boldsymbol{x}_{\mathcal{A}})\geq g(\boldsymbol{y}_{\mathcal{A}}), i.e., gโ€‹(๐’›๐’œ)โ‰ฅ1g(\boldsymbol{z}_{\mathcal{A}})\geq 1. The bound on the sum ratio implied by Eq.ย (76) is then sufficient for proving the proposition:

R(๐’™,๐’š)โ‰ฅmaxiโˆˆ๐’œ(logxi/yigโ€‹(๐’›๐’œ))2โˆ’(log(โˆ‘iโˆˆ๐’œxi)/(โˆ‘iโˆˆ๐’œyi)gโ€‹(๐’›๐’œ))2โ‰ฅ0.R(\boldsymbol{x},\boldsymbol{y})\geq\max_{i\in\mathcal{A}}\left(\log\frac{x_{i}/y_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\left(\log\frac{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}\geq 0. (77)

References

  • (1) Aitchison, J: The statistical analysis of compositional data. Chapman and Hall (1986)
  • (2) Greenacre, M: Compositional Data Analysis, Annual Reviews of Statistics and its Application 8(1), 271โ€“299 (2021)
  • (3) Egozcue, JJ and Pawlowsky-Glahn, V: Compositional data: the sample space and its structure. TEST 28(3), 599โ€“638 (2019)
  • (4) Barcelรณ-Vidal, C, Martรญn-Fernรกndez, JA: The Mathematics of Compositional Analysis. Austrian Journal of Statistics 45(4), 57โ€“71 (2016)
  • (5) Aitchison, J: The Statistical Analysis of Compositional Data. J Royal Stat Soc B 44 (2), 139โ€“160 (1982)
  • (6) Egozcue, JJ, Barcelรณ-Vidal, C, Martรญn-Fernรกndez, JA, Jarauta-Bragulat, E, Dรญaz-Barrero, JL, Mateu-Figueras, G, Pawlowsky-Glahn, V, Buccianti, A: Elements of simplicial linear algebra and geometry. In: Pawlowsky-Glahn, V and Buccianti, A (eds.) Compositional data analysis: Theory and applications, pp. 141โ€“157. Wiley (2011)
  • (7) Egozcue, JJ, Pawlowsky-Glahn, V, Mateu-Figueras, G, and Barcelรณ-Vidal, C: Isometric Logratio Transformations for Compositional Data Analysis. Mathematical Geology, 35 (3), 279โ€“300 (2003)
  • (8) J. J. Egozcue and V. Pawlowsky-Glahn. Groups of Parts and Their Balances in Compositional Data Analysis. Mathematical Geology, 37 (7), 795โ€“828 (2005)
  • (9) Hron, K, Filzmoser, P, de Caritat, P, Fiลerovรก, E, Gardlo, A: Weighted Pivot Coordinates for Compositional Data and Their Application to Geochemical Mapping. Mathematical Geosciences 49, 797โ€“814 (2017)
  • (10) Chentsov, N: Statistical Decision Rules and Optimal Inference (vol.ย 53), Nauka (1972) (in Russian); English translation in: Math. Monograph. (vol.ย 53), Am. Math. Soc. (1982)
  • (11) Rao, CR: Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81โ€“89 (1945)
  • (12) Amari, S and Nagaoka, H: Methods of Information Geometry. Translations of Mathematical Monographs (vol.ย 191), American Mathematical Society (2000)
  • (13) Amari, S: Differential-Geometric Methods in Statistics. Lecture Notes in Statistics (vol.ย 28), Springer (1985)
  • (14) Ay, N, Jost, J, Le, HV, Schwachhรถfer, L: Information Geometry. A Series of Modern Surveys in Mathematics (vol.ย 64), Springer (2017)
  • (15) Amari, S: Information Geometry and Its Applications. Applied Mathematical Sciences, (vol.ย 194), Springer (2016)
  • (16) Whittaker, J: Graphical models in applied multivariate statistics. Wiley (1990)
  • (17) Ay, N and Erb, I: On a notion of linear replicator equations. J Dyn. Diff. Eqs., 17 (2), 427-451 (2005)
  • (18) Martรญn-Fernรกndez, JA, Bren, M, Barcelรณ-Vidal, C and Pawlowsky-Glahn, V: A Measure of Difference for Compositional Data based on measures of divergence. In: Proceedings of the Fifth Annual Conference of the International Assotiation for Mathematical Geology, Ed. Lippard, S.J., Naess, A., and Sinding-Larsen, R.. Trondheim (Norway), Vol. 1, pp. 211-215 (1999)
  • (19) Rรฉnyi, A.: On measures of entropy and information. In: Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp.ย 547โ€“561. University of California Press, Berkeley (1961)
  • (20) Tsallis, C.: Possible generalization of Boltzmannโ€“Gibbs statistics. J. Stat. Phys. 52, 479โ€“487 (1988)
  • (21) Greenacre, M: Power transformations in correspondence analysis. Computational Statistics & Data Analysis, 53(8), 3107-3116 (2009)
  • (22) Greenacre, M: Log-Ratio Analysis Is a Limiting Case of Correspondence Analysis. Math Geosci 42, 129 (2010)
  • (23) Bhattacharyya, A: On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 35, 99โ€“109. (1943)
  • (24) Chentsov, N: Algebraic foundation of mathematical statistics. Math.ย Oper.forsch.ย Stat., Ser.ย Stat.ย 9, 267โ€“276 (1978)
  • (25) Palarea-Albaladejo, J, Martรญn-Fernรกndez, JA and Soto, J.A.: Dealing with Distances and Transformations for Fuzzy C-Means Clustering of Compositional Data, Journal of Classification 29(2), 144-169 (2012)
  • (26) Quinn, TP, Erb, I: Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data, NAR Genomics and Bioinformatics 2(4) Iqaa076 (2021)
  • (27) Egozcue, JJ and Pawlowsky-Glahn, V: Evidence functions: a compositional approach to information. SORT 42 (2), 101-124 (2018)
  • (28) Greenacre, M: Amalgamations are valid in compositional data analysis, can be used in agglomerative clustering, and their logratios have an inverse transformation. Appl.ย Comp.ย Geosc., 5, 100017 (2020)
  • (29) Barcelรณ-Vidal, C., Martรญn-Fernรกndez, J.A. and Mateu-Figueras, G., Compositional Differential Calculus on the Simplex. In: Pawlowsky-Glahn, V and Buccianti, A (eds.) Compositional data analysis: Theory and applications, pp. 176โ€“190. Wiley (2011)
  • (30) Egozcue, JJ, Pawlowsky-Glahn, V, Tolosana-Delgado, R, Ortego, MI and van den Boogaart, KG, Bayes spaces: use of improper distributions and exponential families. Revista de la Real Academia de Ciencias Exactas, Fรญsicas y Naturales. Serie A. Matematicas, 107(2), pp. 475โ€“486 (2013)