¹¹institutetext: Ionas Erb ²²institutetext: Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; ²²email: [email protected] ³³institutetext: Nihat Ay ⁴⁴institutetext: Max-Planck Institute for Mathematics in the Sciences, Leipzig, Germany;
Department of Mathematics and Computer Science, Leipzig University, Leipzig, Germany;
Santa Fe Institute, Santa Fe, NM, USA; ⁴⁴email: [email protected]

The Information-Geometric Perspective of Compositional Data Analysis

Ionas Erb and Nihat Ay

Abstract

Information geometry uses the formal tools of differential geometry to describe the space of probability distributions as a Riemannian manifold with an additional dual structure. The formal equivalence of compositional data with discrete probability distributions makes it possible to apply the same description to the sample space of Compositional Data Analysis (CoDA). The latter has been formally described as a Euclidean space with an orthonormal basis featuring components that are suitable combinations of the original parts. In contrast to the Euclidean metric, the information-geometric description singles out the Fisher information metric as the only one keeping the manifold’s geometric structure invariant under equivalent representations of the underlying random variables. Well-known concepts that are valid in Euclidean coordinates, e.g., the Pythagorean theorem, are generalized by information geometry to corresponding notions that hold for more general coordinates. In briefly reviewing Euclidean CoDA and, in more detail, the information-geometric approach, we show how the latter justifies the use of distance measures and divergences that so far have received little attention in CoDA as they do not fit the Euclidean geometry favoured by current thinking. We also show how Shannon entropy and relative entropy can describe amalgamations in a simple way, while Aitchison distance requires the use of geometric means to obtain more succinct relationships. We proceed to prove the information monotonicity property for Aitchison distance. We close with some thoughts about new directions in CoDA where the rich structure that is provided by information geometry could be exploited.

1 Introduction

Information geometry and Compositional Data Analysis (CoDA) are fields that have ignored each other so far. Independently, both have found powerful descriptions that led to a deeper understanding of the geometric relationships between their respective objects of interest: probability distributions and compositional data. Although both of these live on the same mathematical space (the simplex), and some of the mathematical structures are identically described, surprisingly, both fields have come to focus on quite different geometric aspects. On the one hand, the tools of differential geometry have revealed the underlying duality of the manifold of probability distributions, with the Fisher information metric playing a central role. On the other hand, the classical log-ratio approach led to the modern description of the compositional sample space as Euclidean and affine. We think it is time that CoDA starts to profit from the rich structures that information geometry has to offer. This paper intends to build some bridges from information geometry to CoDA. In the first section, we will give a brief description of the Euclidean CoDA perspective. The second, and main, part of the paper describes in some detail the approach centred around the Fisher metric, with a description of the dual coordinates, exponential families and of how dual divergence functions generalize the notion of Euclidean distance. To ease understanding, throughout this section we link those concepts to the ones familiar in CoDA. In the third part of the paper, we show how information-based measures can lead to simpler expressions when amalgamations of parts are involved, and an important monotonicity result that holds for relative entropy is derived for Aitchison distance. We conclude with a short discussion about where we could go from here.

2 The Euclidean CoDA perspective

Compositional data analysis is now unthinkable without the log-ratio approach pioneered by Aitchison AitchisonBook . It has both led to a variety of data-analytic developments (for the most recent review, see green ), and more formal mathematical descriptions (see sampleSpace ). Following these, compositions can be described as equivalence classes whose representatives are points in a Euclidean space. We will give a brief recount here for completeness of exposition. Compositional data are defined as vectors of strictly positive numbers describing the parts of a whole for which the relevant information is only relative. As such, the absolute size of a $D$ -part composition $\boldsymbol{x}\in\mathbb{R}^{D}_{+}$ is irrelevant, and all the information is conveyed by the ratios of its components. This can further be formalized by considering equivalent compositions $\boldsymbol{x}$ , $\boldsymbol{y}$ such that $\boldsymbol{y}=c\boldsymbol{x}$ for a positive constant $c$ . A composition is then an equivalence class of such proportional vectors mathCoDA . A closed composition is the simplicial representative $\mathcal{C}\boldsymbol{x}:=\boldsymbol{x}/\sum_{i}x_{i}$ , where the symbol $\mathcal{C}$ denotes the closure operation (i.e., the division by the sum over the components). Closed compositions are elements of the simplex

\mathcal{S}^{D}:=\left\{(x_{1},\dots,x_{D})^{T}\in\mathbb{R}^{D}:x_{i}>0,i=1,\dots,D,\sum_{i}^{D}x_{i}=1\right\},

(1)

where $T$ denotes transposition. The simplex $\mathcal{S}^{D}$ can be equipped with a Euclidean structure by the vector space operations of perturbation and powering (playing the role of vector addition and scalar multiplication in real vector spaces), defined by

	$\displaystyle\boldsymbol{x}\oplus\boldsymbol{y}$	$\displaystyle:=$	$\displaystyle\mathcal{C}(x_{1}y_{1},\dots,x_{D}y_{D})^{T},$		(2)
	$\displaystyle\alpha\odot\boldsymbol{x}$	$\displaystyle:=$	$\displaystyle\mathcal{C}(x_{1}^{\alpha},\dots,x_{D}^{\alpha})^{T}.$		(3)

An inverse perturbation is given by $\ominus\boldsymbol{x}:=(-1)\odot\boldsymbol{x}$ .
As a vector space, ${\mathcal{S}}^{D}$ also carries the structure of an affine space, and we can study affine subspaces, which are referred to as linear manifolds in simplicialGeometry . In order to do so, we require a set of vectors $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{m}$ , which we assume to be linearly independent, and an origin point $\boldsymbol{x}_{0}$ . Here, independence means the following: Let $\boldsymbol{n}=\mathcal{C}(1,\dots,1)$ be the neutral element. A set of $m$ compositions $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{m}\in\mathcal{S}^{D}$ is called perturbation-independent if the fact that $\boldsymbol{n}=\bigoplus_{i=1}^{m}(\alpha_{i}\odot\boldsymbol{x}_{i})$ implies $\alpha_{1}=\dots=\alpha_{m}=0$ . With this, an affine subspace is given as the set of compositions $\boldsymbol{y}$ such that

\boldsymbol{y}=\boldsymbol{x}_{0}\oplus\bigoplus_{i=1}^{m}(\alpha_{i}\odot\boldsymbol{x}_{i})

(4)

for any real constants $\alpha_{i}$ , $i=1,\dots,m$ . Due to the linear independence of the vectors $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{m}$ , this is an $m$ -dimensional space.
It is convenient to define the inner product for our Euclidean space via the so-called centred log-ratio transformation. Its definition logratioTrans and inverse operation are

	$\displaystyle\boldsymbol{v}$	$\displaystyle=$	$\displaystyle\mathrm{clr}(\boldsymbol{x}):=\left(\log\frac{x_{1}}{g(\boldsymbol{x})},\dots,\log\frac{x_{D}}{g(\boldsymbol{x})}\right)^{T},$		(5)
	$\displaystyle\boldsymbol{x}$	$\displaystyle=$	$\displaystyle\mathrm{clr}^{-1}(\boldsymbol{v})=\mathcal{C}\mathrm{exp}(\boldsymbol{v}),$		(6)

where $g$ denotes the geometric mean $g(\boldsymbol{x})=\left(\prod_{i=1}^{D}x_{i}\right)^{\frac{1}{D}}$ . Note that the sum over the components $\mathrm{clr}_{i}$ of clr-transformed vectors is 0. An inner product can then be defined by

\left<\boldsymbol{x},\boldsymbol{y}\right>_{A}:=\sum_{i=1}^{D}\mathrm{clr}_{i}(\boldsymbol{x})\mathrm{clr}_{i}(\boldsymbol{y}).

(7)

The corresponding (squared) norm and distance are

\left\lVert\boldsymbol{x}\right\rVert^{2}_{A}=\left<\boldsymbol{x},\boldsymbol{x}\right>_{A},~{}~{}~{}~{}~{}d^{2}_{A}(\boldsymbol{x},\boldsymbol{y})=\left\lVert\boldsymbol{x}\ominus\boldsymbol{y}\right\rVert^{2}_{A}.

(8)

This distance is known as Aitchison distance (denoted by the $A$ subscript). Note that the clr-transformation is an isometry $\mathcal{S}^{D}\rightarrow\mathcal{T}^{D}\subset\mathbb{R}^{D}$ between $(D-1)$ -dimensional Euclidean spaces, where

\mathcal{T}^{D}:=\left\{\boldsymbol{v}\in\mathbb{R}^{D}:\sum_{i=1}^{D}v_{i}=0\right\}.

(9)

We can thus obtain orthonormal bases in $\mathcal{S}^{D}$ from orthonormal bases in $\mathcal{T}^{D}$ . Such orthonormal basis vectors are defined by the columns $\boldsymbol{v}_{i}$ , $i=1,\dots,D-1$ , of $\mathbf{V}$ , a matrix of order $D\times(D-1)$ obeying

	$\displaystyle\mathbf{V}^{T}\mathbf{V}$	$\displaystyle=$	$\displaystyle\mathbf{I}_{D-1},$		(10)
	$\displaystyle\mathbf{V}\mathbf{V}^{T}$	$\displaystyle=$	$\displaystyle\mathbf{I}_{D}-\frac{1}{D}\mathbf{1}\mathbf{1}^{T},$		(11)

with $\mathbf{I}_{D}$ the $D\times D$ identity matrix and $\mathbf{1}$ a $D$ -dimensional vector where each component is 1. The first equation ensures orthonormality, the second equation makes sure components sum to zero. Now the vectors

\boldsymbol{e}_{i}=\mathrm{clr}^{-1}(\boldsymbol{v}_{i})=\mathcal{C}\mathrm{exp}(\boldsymbol{v}_{i})

(12)

constitute an orthonormal basis in $\mathcal{S}^{D}$ . Euclidean coordinates $\boldsymbol{z}$ of $\boldsymbol{x}\in\mathcal{S}^{D}$ with respect to the basis $\boldsymbol{e}_{i}$ , $i=1,\dots,D-1$ , then follow from the so-called isometric log-ratio transformation ilr . Its definition and inverse operation are

	$\displaystyle\boldsymbol{z}=\mathrm{ilr}(\boldsymbol{x})$	$\displaystyle:=$	$\displaystyle\mathbf{V}^{T}\log(\boldsymbol{x}),$		(13)
	$\displaystyle\boldsymbol{x}=\mathrm{ilr}^{-1}(\boldsymbol{z})$	$\displaystyle=$	$\displaystyle\mathcal{C}\mathrm{exp}(\mathbf{V}\boldsymbol{z}).$		(14)

The second equation shows the composition as generated from the coordinates $\boldsymbol{z}$ . This can also be written in the usual component form using the basis vectors of Eq. (12):

\boldsymbol{x}=\bigoplus_{i=1}^{D-1}(z_{i}\odot\boldsymbol{e}_{i}).

(15)

Equations (10) and (11) characterize any orthonormal basis of $\mathcal{S}^{D}$ . $\mathbf{V}$ is known under the name of contrast matrix. Many choices for this matrix are possible. Balance coordinates balances are popular for their relative simplicity and interpretability, where pivot coordinates pivot are sometimes preferred.

3 The information-geometric perspective

Information geometry started out as a study of the geometry of statistical estimation. The set of probability distributions that constitute a statistical model is seen as a manifold whose invariant geometric structures Chentsov are studied in relation to the statistical estimation using this model. A Riemannian metric and a family of affine connections are naturally introduced on such a manifold Rao . It turns out that a fundamental duality underlies these structures AmariNagaoka . While the first ideas about the geometry of statistical estimation are from the first half of the 20th century and go back to a variety of authors, a first attempt at a unified exposition of the topic by Amari lectureNotes was published around a similar time as Aitchison’s book.

Here we try to highlight some of the main concepts, borrowing from chapter two in Nihat , but mainly following the treatment in Amari . The latter also showcases the many applications information geometry has found during the last decades. While the best known book on the topic may be AmariNagaoka , the most formal and complete exposition can currently be found in Nihat .
The ideas that require more advanced notions from differential geometry will not be touched upon in our short outline.

3.1 Dual coordinates and Fisher metric in finite information geometry

To exploit the equivalence of probability distributions with compositional data, we only consider the manifold of discrete distributions $\mathcal{S}^{D}$ . To emphasize the equivalence, we will denote the probabilities by the same symbol as our compositions, i.e., $\boldsymbol{x}\in\mathcal{S}^{D}$ is a vector of $D$ probabilities. To complete the probabilistic picture, we need a random variable $R$ which can take values $r\in\{1,\dots,D\}$ with the respective probabilities $x_{r}=\mathbb{P}\{R=r\}$ , the coordinates of $\boldsymbol{x}$ . The distribution of this random variable can now be written as

p_{R}=\boldsymbol{x}=(x_{1},\dots,x_{D})^{T}\in\mathcal{S}^{D}.

(16)

From the information-geometric perspective, there are two natural ways to parametrize the set ${\mathcal{S}}^{D}$ of all (strictly positive) distributions of $R$ . The first possibility is quite obvious. The first $D-1$ probabilities in Eq. (16) that are free to be specified, that is $x_{1},\dots,x_{D-1}$ , can be considered parameters, which we denote by $\boldsymbol{\eta}=(x_{1},\dots,x_{D-1})^{T}$ . In these coordinates, probability distributions are written as

p(r;{\boldsymbol{\eta}})=\left\{\begin{array}[]{c@{,\quad}l}\eta_{r}&\mbox{if $r\leq D-1$}\\ 1-\sum_{i=1}^{D-1}\eta_{i}&\mbox{if $r=D$}\end{array}\right.,\qquad r=1,\dots,D.

(17)

Alternatively, our distribution can be parametrized using what is known as the alr-transformation logratioTrans in CoDA:

\theta^{i}=\log\frac{x_{i}}{x_{D}},~{}~{}~{}i=1,\dots,D-1.

(18)

With this, we can write our distribution in the form

p(r;{\boldsymbol{\theta}})=\mathrm{exp}\left(\sum_{k=1}^{D-1}\theta^{k}{\mathbbm{1}}_{k}(r)-\psi(\boldsymbol{\theta})\right),\qquad r=1,\dots,D,

(19)

where ${\mathbbm{1}}_{k}(r)=1$ if $r=k$ , and ${\mathbbm{1}}_{k}(r)=0$ otherwise. The function $\psi$ ensures normalization, that is

\psi(\boldsymbol{\theta})=\log\left(1+\sum_{i=1}^{D-1}e^{\theta^{i}}\right)=-\log x_{D}.

(20)

The parametrization of Eq. (19) in terms of $\boldsymbol{\theta}\in\mathbb{R}^{D-1}$ can be used in order to define a linear structure on ${\mathcal{S}}^{D}$ . The addition of two distributions $p(\cdot;\boldsymbol{\theta}_{1})$ and $p(\cdot;\boldsymbol{\theta}_{2})$ can simply be defined by taking their product and then normalizing. With this vector addition, denoted by $\oplus$ , we obviously have

\displaystyle p({r};\boldsymbol{\theta}_{1})\oplus p({r};\boldsymbol{\theta}_{2})

\displaystyle=

\displaystyle p({r};\boldsymbol{\theta}_{1}+\boldsymbol{\theta}_{2}).

(21)

Thus, the operation $\oplus$ is consistent with the usual addition in the parameter space $\mathbb{R}^{D-1}$ . The multiplication of a distribution $p(\cdot;\boldsymbol{\theta})$ with a scalar $\alpha\in\mathbb{R}$ can be correspondingly defined by potentiating and then normalizing. This defines a scalar multiplication $\odot$ , and we have

\displaystyle{\alpha}\odot p({r};\boldsymbol{\theta})

\displaystyle=

\displaystyle p({r};\alpha\cdot\boldsymbol{\theta}).

(22)

Obviously, the scalar multiplication $\odot$ is consistent with the usual multiplication in the parameter space. Note that the vector space structure defined here, which is well known in information geometry, coincides with the structure defined by equations (2) and (3). Given the linear structure, we can consider affine subspaces of $\mathcal{S}^{D}$ . These are well-known and fundamental families in statistics, statistical physics, and information geometry, the so-called exponential families. We basically obtain them from the representation of Eq. (19) if we replace the functions ${\mathbbm{1}}_{k}$ by $d$ ( $d\leq D-1$ ) functions $X_{k}:\{1,\dots,D\}\to\mathbb{R}$ , and shift the whole family by some reference measure $p_{0}$ :

p(r;{\boldsymbol{\theta}})=p_{0}(r)\,\mathrm{exp}\left(\sum_{k=1}^{d}\theta^{k}X_{k}(r)-\psi(\boldsymbol{\theta})\right),\qquad r=1,\dots,D.

(23)

Here, $\psi(\boldsymbol{\theta})$ ensures normalization, but it does not reduce to the simple structure of Eq. (20) in general. In statistical physics, the function $\psi$ is known as the free energy (in other contexts it is also known as the cumulant-generating function). Note that, given the same linear structure on $\mathcal{S}^{D}$ , the exponential families coincide with the linear manifolds of Eq. (4), which were introduced into the field of Compositional Data Analysis more recently.

In what follows we restrict attention to the parametrizations of equations (17) and (19) of the full simplex $\mathcal{S}^{D}$ as one instance of the general structure that underlies information geometry. The function $\psi$ , given by Eq. (20), is a convex function, and we can get back the $\boldsymbol{\eta}$ coordinates from it via a Legendre transformation AmariNagaoka

\boldsymbol{\eta}=\nabla\psi(\boldsymbol{\theta}),

(24)

where $\nabla$ denotes the partial derivatives $(\partial\psi/\partial\theta^{i})_{i=1}^{D}$ . The Legendre dual of $\psi(\boldsymbol{\theta})$ is another convex function defined by

\phi(\boldsymbol{\eta})=\max_{\boldsymbol{\theta}}\left\{\boldsymbol{\theta}\cdot\boldsymbol{\eta}-\psi(\boldsymbol{\theta})\right\},

(25)

which is given by the negative Shannon entropy

\phi(\boldsymbol{\eta})=\sum_{i=1}^{D-1}\eta_{i}\log\eta_{i}+\left(1-\sum_{i=1}^{D-1}\eta_{i}\right)\log\left(1-\sum_{i=1}^{D-1}\eta_{i}\right).

(26)

In equivalence to Eq. (24), from $\phi(\boldsymbol{\eta})$ we can get back the $\boldsymbol{\theta}$ coordinates:

\boldsymbol{\theta}=\nabla\phi(\boldsymbol{\eta}).

(27)

There is thus a fundamental duality mediated by the Legendre transformation which links the two types of parameters $\boldsymbol{\eta}$ and $\boldsymbol{\theta}$ as well as the convex functions $\psi$ and $\phi$ . Legendre transformations are well known to play a fundamental role in phenomenological thermodynamics. Their importance for information geometry was established by Amari and Nagaoka AmariNagaoka .
In what follows, we use the parameters $\boldsymbol{\theta}$ . Like $\boldsymbol{\eta}$ , they define a (local) coordinate system of the manifold $\mathcal{S}^{D}$ . From each point $\boldsymbol{\theta}$ , $D-1$ coordinate curves $\gamma_{i}:t\mapsto\gamma_{i}(t)$ in ${\mathcal{S}}^{D}$ , $\gamma_{i}(0)=p_{\boldsymbol{\theta}}$ , emerge when holding $D-2$ of the $\theta_{i}$ constant. Consider their velocities

\mathrm{e}_{i}:=\left.\frac{d}{dt}\gamma_{i}(t)\right|_{t=0},\qquad i=1,\dots,D-1.

(28)

In the point $\boldsymbol{\theta}$ itself, the $D-1$ vectors $\mathrm{e}_{i}$ pointing in the direction of each coordinate curve form the basis of the so-called tangent space ${T}_{\boldsymbol{\theta}}\mathcal{S}^{D}={\mathcal{T}}^{D}$ of this point, see Figure 1a). Similarly, we can define vectors $\mathrm{e}^{i}$ with respect to the coordinates $\boldsymbol{\eta}$ , which also span the tangent space.

Refer to caption — Figure 1: a) A manifold (in black) and its tangent space in point $p_{\theta}$ (in red) where two coordinate curves (dashed lines) are crossing. The velocities of the coordinate curves are the basis vectors $\mathrm{e_{1}}$ and $\mathrm{e_{2}}$ . b) Exponential map and its inverse via the tangent space anchored in a reference point $\boldsymbol{n}$ . The points $\boldsymbol{x}$ and $\boldsymbol{y}$ are mapped via $\mathrm{exp}_{\boldsymbol{n}}^{-1}$ to the tangent space. The difference vector between the two mapped points is $\boldsymbol{v}=\mathrm{vec}(\boldsymbol{x},\boldsymbol{y})$ .

For a Riemannian manifold a metric tensor $\mathbf{G}$ is defined. This metric is obtained via an inner product on the tangent space:

g_{ij}=\left<\mathrm{e}_{i},\mathrm{e}_{j}\right>,

(29)

which depends on the coordinates chosen. The coordinate system is Euclidean if $g_{ij}=\delta_{ij}$ (this is the case for the parametrization achieved by Eq. 13). For the Riemannian metric of exponential families, the basis vectors can be identified with the so-called score¹¹1The score function plays an important role in maximum-likelihood estimation. function $\partial\log p(r,\boldsymbol{\theta})/\partial\theta^{i}$ . The resulting Riemannian metric is known as the Fisher information matrix:

$\displaystyle g_{ij}(\boldsymbol{\theta})$	$\displaystyle=$	$\displaystyle E_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\theta^{i}}\log p({r};\boldsymbol{\theta})\,\frac{\partial}{\partial\theta^{j}}\log p({r};\boldsymbol{\theta})\right],$	(30)
	$\displaystyle=$	$\displaystyle E_{\boldsymbol{\theta}}\left[(\mathbbm{1}_{i}-E(\mathbbm{1}_{i}))(\mathbbm{1}_{j}-E(\mathbbm{1}_{j}))\right]$	(31)
	$\displaystyle=$	$\displaystyle\left\{\begin{array}[]{c@{,\quad}l}-\eta_{i}\eta_{j}&\mbox{if $i\not=j$}\\ \eta_{i}(1-\eta_{i})&\mbox{if $i=j$}\end{array}\right..$	(34)

with $E$ denoting the expectation value, and

\eta_{i}=\frac{e^{\theta^{i}}}{1+\sum_{k=1}^{D-1}e^{\theta^{k}}},\qquad i=1,\dots,D-1.

Note that this is the covariance matrix of the random vector $(\mathbbm{1}_{1},\dots,\mathbbm{1}_{D-1})$ , as expressed by Eq. (31). Convex functions have positive definite Hessian matrices. Here their elements are given by the Fisher metric itself:

	$\displaystyle g_{ij}(\boldsymbol{\theta})$	$\displaystyle=$	$\displaystyle\frac{\partial}{\partial\theta^{i}}\frac{\partial}{\partial\theta^{j}}\psi(\boldsymbol{\theta}),$		(35)
	$\displaystyle g^{ij}(\boldsymbol{\eta})$	$\displaystyle=$	$\displaystyle\frac{\partial}{\partial\eta_{i}}\frac{\partial}{\partial\eta_{j}}\phi(\boldsymbol{\eta}).$		(36)

The second matrix is the inverse of the first, which follows from their Legendre duality. This means the second matrix is an inverse covariance, an important object in the theory of graphical models graphMod . Although the Fisher metric is not Euclidean, i.e., $\left<\mathrm{e}_{i},\mathrm{e}_{j}\right>\neq\delta_{ij}$ , we do have a generalization of this when mixing the dual coordinates: $\left<\mathrm{e}_{i},\mathrm{e}^{j}\right>=\delta_{ij}$ .

Note that for a general exponential family of the form of Eq. (23), the Fisher information matrix can be expressed as a covariance matrix, which generalizes Eq. (31):

	$\displaystyle g_{ij}(\boldsymbol{\theta})$	$\displaystyle=$	$\displaystyle E_{\boldsymbol{\theta}}\left[\frac{\partial}{\partial\theta^{i}}\log p({r};\boldsymbol{\theta})\,\frac{\partial}{\partial\theta^{j}}\log p({r};\boldsymbol{\theta})\right],$		(37)
		$\displaystyle=$	$\displaystyle E_{\boldsymbol{\theta}}\left[(X_{i}-E(X_{i}))(X_{j}-E(X_{j}))\right].$		(38)

3.2 Distance measures and divergences

Our affine structure on $\mathcal{S}^{D}$ can be reformulated additively via an exponential map $\mathcal{S}^{D}\times\mathcal{T}^{D}\to\mathcal{S}^{D}$ , $(\boldsymbol{x},\boldsymbol{v})\mapsto\boldsymbol{y}$ :

	$\displaystyle\boldsymbol{y}=\mathrm{exp}_{\boldsymbol{x}}(\boldsymbol{v})$	$\displaystyle:=$	$\displaystyle\boldsymbol{x}\oplus e^{\boldsymbol{v}},$		(39)
	$\displaystyle\boldsymbol{v}=\mathrm{vec}(\boldsymbol{x},\boldsymbol{y})$	$\displaystyle:=$	$\displaystyle\mathrm{exp}^{-1}_{\boldsymbol{x}}(\boldsymbol{y})=\mathrm{clr}(\boldsymbol{y})-\mathrm{clr}(\boldsymbol{x}),$		(40)

using the notation introduced in section 2, and shown here together with its inverse. This map is used in AyErb , where ordinary linear differential equations with the time derivative defined for $\mathrm{vec}(\boldsymbol{x}(t_{0}),\boldsymbol{x}(t))$ letting $t\to t_{0}$ are considered. These turn out to be replicator equations with special properties that are known from population dynamics. Note that, e.g., with the center of the simplex $\boldsymbol{n}$ as defined before, we have $\mathrm{vec}(\boldsymbol{n},\boldsymbol{x})=\mathrm{clr}(\boldsymbol{x})$ . With the exponential map, we can interpret $\mathrm{vec}(\boldsymbol{x},\boldsymbol{y})$ as the difference vector between two compositions, see Figure 1b). The set of all such difference vectors for a given point can be interpreted as the gradient field of a convex function (a.k.a. a potential). In order to highlight the generality of this concept in information geometry, we consider a general convex function $U$ with respect to some parameters $\boldsymbol{\eta}$ from a convex domain. In the following, we will denote the compositions for which the parameters are to be evaluated by subscripts. The linearization of $U$ in $\boldsymbol{\eta}_{\boldsymbol{y}}$ is given by

\overline{U}(\boldsymbol{\eta}_{\boldsymbol{x}})\,:=\,U(\boldsymbol{\eta}_{\boldsymbol{y}})+\nabla U(\boldsymbol{\eta}_{\boldsymbol{y}})\cdot(\boldsymbol{\eta}_{\boldsymbol{x}}-\boldsymbol{\eta}_{\boldsymbol{y}}).

The graph of this linearization is a hyperplane of dimension $D-1$ touching the graph of $U$ in the point $(\boldsymbol{\eta}_{\boldsymbol{y}},U(\boldsymbol{\eta}_{\boldsymbol{y}}))$ , see Figure 2. The difference between $U$ and its linearisation $\overline{U}$ in $\boldsymbol{\eta}_{\boldsymbol{y}}$ , evaluated at $\boldsymbol{\eta}_{\boldsymbol{x}}$ , defines a so-called Bregman divergence, a class of divergences that plays an important role in information geometry. More precisely,

D_{U}(\boldsymbol{x}||\boldsymbol{y})\,:=\,U(\boldsymbol{\eta}_{\boldsymbol{x}})-\overline{U}(\boldsymbol{\eta}_{\boldsymbol{x}})\,=\,U(\boldsymbol{\eta}_{\boldsymbol{x}})-U(\boldsymbol{\eta}_{\boldsymbol{y}})-\nabla U(\boldsymbol{\eta}_{\boldsymbol{y}})\cdot(\boldsymbol{\eta}_{\boldsymbol{x}}-\boldsymbol{\eta}_{\boldsymbol{y}}).

(41)

Divergences are similar to distance functions but they are not necessarily symmetric and need not fulfill the triangle inequality. As an example, let us consider the potential naturally associated with the structure of equations (39) and (40), the squared Aitchison norm

A(\boldsymbol{z}_{\boldsymbol{x}})=\sum_{i=1}^{D}\mathrm{clr}_{i}(\boldsymbol{x})^{2}=\sum_{i=1}^{D-1}\mathrm{ilr}_{i}(\boldsymbol{x})^{2},

(42)

with ilr_i the $i$ -th ilr coordinate $z_{i}$ , see Eq. (13), and $\boldsymbol{z}_{\boldsymbol{x}}$ the vector of coordinates $z_{i}$ . We then have

D_{A}(\boldsymbol{x}||\boldsymbol{y})=\sum_{i=1}^{D-1}(\mathrm{ilr}_{i}(\boldsymbol{x})-\mathrm{ilr}_{i}(\boldsymbol{y}))^{2}=d^{2}_{A}(\boldsymbol{x},\boldsymbol{y}),

(43)

coinciding with the squared Aitchison distance. This is the special case of a Euclidean divergence, which is also a (squared) distance function.

Let us now come to the divergences that arise when replacing $U(\boldsymbol{\eta})$ in Eq. (41) by our dually convex functions $\psi(\boldsymbol{\theta})$ and $\phi(\boldsymbol{\eta})$ . They turn out to be the relative entropies (a.k.a. Kullback-Leibler divergences)

	$\displaystyle D_{\phi}(\boldsymbol{x}\|\|\boldsymbol{y})$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{D}x_{i}\log\frac{x_{i}}{y_{i}},$		(44)
	$\displaystyle D_{\psi}(\boldsymbol{x}\|\|\boldsymbol{y})$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{D}y_{i}\log\frac{y_{i}}{x_{i}}.$		(45)

Thus the symmetry we had in the Euclidean case finds its generalization for our dual case in

D_{\psi}(\boldsymbol{x}||\boldsymbol{y})=D_{\phi}(\boldsymbol{y}||\boldsymbol{x}).

(46)

Moreover, one can show that we can “complete the square” via

D_{\psi}(\boldsymbol{x}||\boldsymbol{y})=\psi(\boldsymbol{\theta}_{\boldsymbol{x}})+\phi(\boldsymbol{\eta}_{\boldsymbol{y}})-\boldsymbol{\theta}_{\boldsymbol{x}}\cdot\boldsymbol{\eta}_{\boldsymbol{y}}.

(47)

Of course, symmetrizations of relative entropy exist, with the Jensen-Shannon divergence perhaps the most prominent example. Also, a symmetric “compositional” version of relative entropy has been proposed because of its additional properties that are often regarded as indispensable²²2As an example, the translation invariance of distance measures, known under the name of perturbation invariance in CoDA, has its information-geometric analog in the invariance of the inner product of tangent vectors $\mathbf{u}$ , $\mathbf{v}$ under parallel transport $P$ and its dual $P^{*}$ : $\left<\mathbf{u},\mathbf{v}\right>=\left<P(\mathbf{u}),P^{*}(\mathbf{v})\right>$ . in CoDA symmDiv . While such measures have some interesting properties, they do not make use of the duality of our parametrizations and are therefore less suitable for our approach. Indeed, although a dual divergence is not a distance measure in the strict sense, it can quantify the distance between points along a curve in a similar way, and generalizations of well-known relationships from Euclidean geometry are available. Geodesic lines connecting two compositions can be constructed via convex combinations of the parameters $\boldsymbol{\theta}$ , and the corresponding dual geodesics from convex combinations of the dual parameters $\boldsymbol{\eta}$ . Such geodesics are orthogonal to each other when the inner product of their tangent vectors, with respect to the Fisher metric, vanishes in the point of intersection. In this case, a generalized Pythagorean theorem holds for the corresponding dual divergence, e.g.,

D_{\psi}(\boldsymbol{x}||\boldsymbol{y})=D_{\psi}(\boldsymbol{x}||\boldsymbol{z})+D_{\psi}(\boldsymbol{z}||\boldsymbol{y}),

(48)

where the geodesics $t\boldsymbol{\eta}_{\boldsymbol{x}}+(1-t)\boldsymbol{\eta}_{\boldsymbol{z}}$ and $t\boldsymbol{\theta}_{\boldsymbol{z}}+(1-t)\boldsymbol{\theta}_{\boldsymbol{y}}$ intersect orthogonally.

3.3 Distances obtained from the Fisher metric compared with those used in CoDA

We have seen that the potential of Eq. (42) led to a divergence that is also a Euclidean distance. Let us now consider a generalization of our dual divergences that includes as a special case a Euclidean distance that is related to the Fisher metric. The so-called $\alpha$ -divergence (closely related to the Renyi Renyi and Tsallis Tsallis entropies) is defined as

D_{\alpha}(\boldsymbol{x}||\boldsymbol{y})=\frac{4}{1-\alpha^{2}}\left(1-\sum_{i=1}^{D}y_{i}^{\frac{1+\alpha}{2}}x_{i}^{\frac{1-\alpha}{2}}\right).

(49)

In the limit, the cases $\alpha=\pm 1$ correspond to $D_{\psi}$ and $D_{\phi}$ . The case $\alpha=0$ (where the divergence is self-dual, i.e., symmetric) corresponds to the Euclidean distance of the points $(\sqrt{x_{1}},\dots,\sqrt{x_{D}})$ and $(\sqrt{y_{1}},\dots,\sqrt{y_{D}})$ :

$\displaystyle d^{2}_{H}(\boldsymbol{x},\boldsymbol{y})$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{D}(\sqrt{x_{i}}-\sqrt{y_{i}})^{2}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{D}(x_{i}-2\sqrt{x_{i}y_{i}}+y_{i})$
	$\displaystyle=$	$\displaystyle 2\left(1-\sum_{i=1}^{D}\sqrt{x_{i}y_{i}}\right)$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\,D_{0}(\boldsymbol{x}\\|\boldsymbol{y}).$	(51)

This is the so-called Hellinger distance. It is closely related to the Riemannian distance³³3The Riemannian distance between two points on a manifold is the minimum of the lengths of all the piecewise smooth paths joining the two points. between two compositions with respect to the Fisher metric. This so-called Fisher distance can be expressed explicitly Nihat as

d^{2}_{F}(\boldsymbol{x},\boldsymbol{y})=4\,\mathrm{arccos}^{2}\left(\sum_{i=1}^{D}\sqrt{x_{i}y_{i}}\right),

(52)

and its relation to the Hellinger distance is given by

d^{2}_{H}(\boldsymbol{x},\boldsymbol{y})=2\left(1-\mathrm{cos}\frac{d_{F}(\boldsymbol{x},\boldsymbol{y})}{2}\right).

(53)

These distances can be better understood when noting the role that plays the angle between the two points on the sphere, i.e., when comparing Eq. (52) with the cosine of the angle $\varphi$ between the rays going from the origin through the transformed compositions (see Fig. 3), also referred to as Bhattacharyya coefficient Bcoeff :

\mathrm{cos}~{}\varphi=\sum_{i=1}^{D}\sqrt{x_{i}y_{i}}.

(54)

Let us now come back to Aitchison distance and discuss in which structural aspects it differs from divergences obtained from the Fisher metric. In fact, in data analysis, parametrized classes of distance measures are common green3 . In the same way as in Eq. (49), they are mediated by the Box-Cox transformation, which has the limit

\lim_{\beta\to 0}\frac{x^{\beta}-1}{\beta}=\log(x).

(55)

This has been applied in CoDA to obtain log-ratio analysis as a Correspondence Analysis of power-transformed data green1 . There, we have the following family of distance measures:

d^{2}_{\beta}(\boldsymbol{x},\boldsymbol{y})=\frac{1}{\beta^{2}}\sum_{i=1}^{D}\omega_{i}\left(\mathcal{C}{(\boldsymbol{x}^{\beta})_{i}}-\mathcal{C}{(\boldsymbol{y}^{\beta})_{i}}\right)^{2},

(56)

where $\omega_{i}$ are suitable weights. For the case $\beta=1$ , this is the (square of the) symmetric $\chi^{2}$ -distance used in Correspondence Analysis, while $\beta=0$ gives⁴⁴4This can be seen as the high-temperature limit in statistical physics. Aitchison distance (when $\omega_{i}=D^{2}$ for all $i$ ), see the Appendix for a proof. Although the case $\beta=1/2$ has a direct relationship with it, Hellinger distance does not form part of this family because of the closure operation that makes us stay inside the simplex. Similarly, Aitchison distance cannot be obtained from the alpha divergences of Eq. (49). Alpha divergences are included in a more general class of divergences known under the name of $f$ -divergence. They have the form

D_{f}(\boldsymbol{x},\boldsymbol{y})=\sum_{i=1}^{D}x_{i}f\left(\frac{y_{i}}{x_{i}}\right),

(57)

where $f$ is a convex function. It is a well-established result in information geometry that $f$ -divergences are the only decomposable⁵⁵5A divergence is decomposable if it can be written as a sum of terms that only depend on individual components. divergences that behave monotonically under coarse-graining of information Amari , i.e., when compositional parts are amalgamated into higher-level parts. This important invariance property is called information monotonicity. Aitchison distance is not decomposable, as in each summand we use information from all compositional parts when evaluating their geometric mean. Nevertheless, in the next section we will show that it fulfills information monotonicity.
It is interesting to note that the two Euclidean distances (Hellinger and Aitchison) are each related to different isometries of the simplex. In the case of Hellinger distance, compositions are isometrically mapped into the positive orthant of the sphere⁶⁶6But note that this mapping does not obtain the spherical representative of the composition in the sense of the definition of an equivalence class.. This isometry also holds for the Fisher metric itself Nihat (which makes it possible to view the Fisher metric as a Euclidean metric on the $\mathbb{R}^{D}$ where the sphere is embedded). In the case of Aitchison distance, the isometry in question is of central interest in CoDA. It is the clr transformation, i.e., the map between $\mathcal{S}^{D}$ and $\mathcal{T}^{D}$ . Here, however, there is no corresponding isometry of the Fisher metric. Although a Euclidean metric may appear convenient, it does not have the same desirable properties as the Fisher metric. In fact, a central result in information geometry states that the Fisher metric is the only metric Chentsov2 that stays invariant under coarse graining of information.

4 Information monotonicity of Aitchison distance

4.1 Amalgamations lead to coarse grained information

Let us denote by $\mathcal{A}$ a subset of $\mathcal{D}:=\{1,\dots,D\}$ , and let $\boldsymbol{x}_{\mathcal{A}}$ denote the corresponding subcomposition, i.e., the vector where parts with indices not belonging to $\mathcal{A}$ have been removed. Let $H(\boldsymbol{x})$ denote the Shannon entropy of composition $\boldsymbol{x}$ , i.e., the potential $-\phi(\boldsymbol{\eta}_{\boldsymbol{x}})$ . Consider its decomposition

H(\boldsymbol{x})=(1-s(\boldsymbol{x}_{\mathcal{A}}))H(\mathcal{C}\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})+s(\boldsymbol{x}_{\mathcal{A}})H(\mathcal{C}\boldsymbol{x}_{\mathcal{A}})+H(s(\boldsymbol{x}_{\mathcal{A}}),1-s(\boldsymbol{x}_{\mathcal{A}})),

(58)

where $s(\boldsymbol{x}_{\mathcal{A}})=\sum_{i\in\mathcal{A}}x_{i}$ . We see that this is a convex combination of the entropies of the two subcompositions plus a binary entropy, where all terms involve the amalgamation $s(\boldsymbol{x}_{\mathcal{A}})$ . Probabilistically speaking, this particular amalgamation corresponds to the probability that an event occurs for any of the events $\mathcal{A}$ that are left out to obtain the subcomposition $\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}}$ . Generally, amalgamation is nothing else but a coarse graining of the events and their probabilities. The corresponding coarse graining of information is described by this alternative decomposition of Shannon entropy:

H(\boldsymbol{x})=H(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))+s(\boldsymbol{x}_{\mathcal{A}})H(\mathcal{C}\boldsymbol{x}_{\mathcal{A}}).

(59)

As the second summand is greater equal zero, this also shows that information cannot grow after coarse graining.

4.2 The notion of monotonicity for divergences and distance measures

These considerations lead us to an important result that concerns the divergence associated with the potential $\phi(\boldsymbol{\eta}_{\boldsymbol{x}})$ , i.e., the relative entropy. This result is the information monotonicity under coarse graining, where the notion of monotonicity is somewhat related to the notion of subcompositional dominance. The latter refers to the property that a measure of distance does not increase when evaluating it on a subset of parts only. This is often seen as a desirable property of distances in CoDA (and not fulfilled by distances like Hellinger and Batthacharyya, see clustering for a discussion of distance measures with respect to compositions). A similar—but perhaps more natural—requirement that has not received attention yet in the CoDA community is the one that a distance between compositions should not increase when comparing it with the one after amalgamating over a subset of parts.⁷⁷7Subcompositional coherence, i.e., the fundamental requirement that quantities remain identical on a renormalized subcomposition, is not an issue for amalgamation: after amalgamation there is no need for renormalization. As we have seen in the previous subsection, we cannot gain information when amalgamating parts, so we should lose resolution when comparing the amalgamated compositions. This is also related to the notion of sufficient statistic, see Amari . The information monotonicity property of relative entropy can be expressed as

D_{\phi}(\boldsymbol{x}||\boldsymbol{y})\geq D_{\phi}\left((\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))~{}||~{}(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{y}_{\mathcal{A}}))\right).

(60)

This result can be shown for the more general case of $f$ -divergences and continuous distributions using Jensen’s inequality AmariNagaoka .
Note that in balances , when discussing amalgamations of parts, the notion of “monotonicity” is used differently. There, the authors argue against amalgamations referring to the observation that Aitchison distances between amalgamated compositions and the amalgamated center of the simplex show a non-monotonic behaviour along an ilr-coordinate axis defined before amalgamation. We will show below that information monotonicity does hold for Aitchison distance. We see the lack of distance monotonicity as discussed in balances rather as an argument against the use of a Euclidean coordinate system here.

4.3 Monotonicity of Aitchison distance

A symmetrized version of relative entropy has recently been used in the context of data-driven amalgamation amalgams , where it was shown to be better preserved between samples than Aitchison distance. While the information-theoretic meaning and mathematical properties reflected in the decompositions shown in section 4.1 make Shannon entropy an ideal measure of information, alternative indices that can sometimes be evaluated more easily on real-world data (e.g., making use of sums of squares) have also been considered. In our context, it is interesting that $A(\boldsymbol{z}_{\boldsymbol{x}})$ , the potential of Eq. (42), has been proposed as an alternative measure of information within an attempt to reformulate information theory from a CoDA point-of-view E&PGinfo . More recently, it has also been proposed as an inequality index (when divided by the number of parts $D$ ) sampleSpace . Here, the following decomposition was shown:

||\boldsymbol{x}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},\boldsymbol{x}_{\mathcal{A}})||_{A}^{2}=||\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}||_{A}^{2}+\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}\right)^{2},

(61)

where we denoted the set size of $\mathcal{A}$ by $a$ . If we now want to decompose with respect to a composition that was partly amalgamated, we find a corresponding relationship that is more complicated⁸⁸8It is interesting to note that the two interaction terms have the form of squares of the balance and pivot coordinates mentioned in section 2.:

||\boldsymbol{x}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}})||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}||_{A}^{2}\\ +\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}\right)^{2}-\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{s(\boldsymbol{x}_{\mathcal{A}})}\right)^{2}.

(62)

Clearly, if we replace the amalgamation $s(\boldsymbol{x}_{\mathcal{A}})$ by the geometric mean $g(\boldsymbol{x}_{\mathcal{A}})$ , we get a simpler equality:

||\boldsymbol{x}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{x}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}||_{A}^{2}\\ +\left(\frac{a(D-a)}{D}-\frac{D-a}{D-a+1}\right)\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}\right)^{2}.

(63)

Aggregating by geometric means or by amalgamations has been a subject of debate in the CoDA community sampleSpace ; amalgGreen . As we can see, measures like the Aitchison norm lend themselves much better to taking geometric means rather than amalgamations. There is, however, no straight-forward probabilistic interpretation of geometric means⁹⁹9The product over parts specifies the probability that all events in the subset occur, but this is then re-scaled by the exponent to the probability of a single event., and the more elegant formal expressions that result often suffer from reduced interpretability.
To the best of our knowledge, the information monotonicity property in its general form has not been considered yet for Aitchison distance. We here exploit the various decompositions stated above for proving it. Results are summarized in the following propositions. All proofs can be found in the Appendix.

Proposition 1

Let $\mathcal{D}$ and $\mathcal{A}\subset\mathcal{D}$ be two index sets with sizes $D$ and $a$ , respectively. Further, let $\boldsymbol{x}$ and $\boldsymbol{y}$ be the simplicial representatives of two compositions in $\mathcal{S}^{D}$ . The amalgamation of $\boldsymbol{x}$ over the subset $\mathcal{A}$ of parts be denoted by $s(\boldsymbol{x}_{\mathcal{A}})=\sum_{i\in\mathcal{A}}x_{i}$ . Then the following decomposition of Aitchison distance holds:

||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}\ominus\boldsymbol{y}_{\mathcal{A}}||_{A}^{2}\\ +\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}-\log\frac{g(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{y}_{\mathcal{A}})}\right)^{2}\\ -\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{s(\boldsymbol{x}_{\mathcal{A}})}-\log\frac{g(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}})}{s(\boldsymbol{y}_{\mathcal{A}})}\right)^{2}.

(64)

Corollary 1

When aggregating a subset of parts in form of their geometric mean, we have the following decomposition of Aitchison distance

||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}=||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}\ominus\boldsymbol{y}_{\mathcal{A}}||_{A}^{2}\\ +\left(\frac{a(D-a)}{D}-\frac{D-a}{D-a+1}\right)\left(\log\frac{g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{x}_{\mathcal{A}})}-\log\frac{g(\boldsymbol{y}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{y}_{\mathcal{A}})}\right)^{2}.

(65)

From this decomposition, we get the following monotonicity result:

Corollary 2

With parts aggregated by geometric means, the following inequality holds

||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}\geq||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},g(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}+||\boldsymbol{x}_{\mathcal{A}}\ominus\boldsymbol{y}_{\mathcal{A}}||_{A}^{2}.

As we can see, for geometric-mean summaries, the sum of the interaction terms (i.e., of the terms not involving norms) will remain greater equal zero. This is no longer true for amalgamation of parts, and it is less straightforward to show the corresponding inequality:

Proposition 2

Aitchison distance fulfills the information monotonicity

||\boldsymbol{x}\ominus\boldsymbol{y}||_{A}^{2}\geq||(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{x}_{\mathcal{A}}))\ominus(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}},s(\boldsymbol{y}_{\mathcal{A}}))||_{A}^{2}.

5 Discussion and outlook

In our little outline of finite information geometry, we could but scratch the surface of the formal apparatus that is at our disposal. We are certain it can serve to advance the field of Compositional Data Analysis in various ways. Differential geometry provides a universally valid framework for the problems occurring with constrained data. Considering the simplex a differentiable manifold enables a general approach from which specific problems like the compositional differential calculus compDiffCalc follow naturally. Clearly, there is (and has to be) overlap in methodology between the information-geometric perspective and the CoDA approach. An example is the use of the exponential map anchored at the center of the simplex discussed in section 3.2, which allows to identify the simplex with a linear space that is central to the Euclidean CoDA approach. Another example is the fundamental role that exponential families play in information geometry and which have been studied in the CoDA context in the so-called Bayes spaces BayesSpaces . But we also think that some of the current limitations of CoDA can be overcome using the additional structures that information geometry can provide. The ease with which amalgamations of parts can be handled by Kullback-Leibler divergence might partly resolve the debate surrounding this issue in the CoDA community. Further, maximum-entropy projections, where Kullback-Leibler divergences are the central tool, seem an especially promising avenue to pursue in the context of data that are only partially available or subject to constraints. Also, our description has focused on the equivalence of compositions with discrete probability distributions, but information geometry can of course be used to describe the distributions themselves. These are no longer finite but continuous and contain a constraint that introduces dependencies among their random variables, calling for the use of more general versions of the concepts presented here.

Appendix

Proof that Eq. 56 tends to Aitchison distance for $\beta\rightarrow 0$

d^{2}_{\beta}(\boldsymbol{x},\boldsymbol{y})=\frac{1}{\beta^{2}}\sum_{i=1}^{D}\omega_{i}\left(\mathcal{C}{(\boldsymbol{x}^{\beta})_{i}}-\mathcal{C}{(\boldsymbol{y}^{\beta})_{i}}\right)^{2}=\sum_{i=1}^{D}\omega_{i}\left(\frac{x^{\beta}_{i}}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}}{\beta\sum_{k=1}^{D}y^{\beta}_{k}}\right)^{2}.

The terms inside the bracket can be written as

\frac{x^{\beta}_{i}}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}}{\beta\sum_{k=1}^{D}y^{\beta}_{k}}=\frac{x^{\beta}_{i}-\frac{1}{D}\sum_{k=1}^{D}x^{\beta}_{k}}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}-\frac{1}{D}\sum_{k=1}^{D}y^{\beta}_{k}}{\beta\sum_{k=1}^{D}y^{\beta}_{k}}

(66)

when subtracting $1/(\beta D)$ and adding it back. Similarly,

=\frac{x^{\beta}_{i}-1-\frac{1}{D}\sum_{k=1}^{D}(x^{\beta}_{k}-1)}{\beta\sum_{k=1}^{D}x^{\beta}_{k}}-\frac{y^{\beta}_{i}-1-\frac{1}{D}\sum_{k=1}^{D}(y^{\beta}_{k}-1)}{\beta\sum_{k=1}^{D}y^{\beta}_{k}}.

(67)

In this, we recognize the Box-Cox transform in the numerators, and can make use of the limit in Eq. (55). The sums in the denominators clearly tend to $D$ for $\beta\rightarrow 0$ . We can thus evaluate the limit as a quotient of finite limits and conclude

\lim_{\beta\to 0}d^{2}_{\beta}(\boldsymbol{x},\boldsymbol{y})=\sum_{i=1}^{D}\frac{\omega_{i}}{D^{2}}\left(\log\frac{x_{i}}{g(\boldsymbol{x})}-\log\frac{y_{i}}{g(\boldsymbol{y})}\right)^{2}.

(68)

Proof of Proposition 1

We start by defining $\boldsymbol{z}=(x_{i}/y_{i})_{i\in\mathcal{D}}$ . We can now use the decomposition

||\boldsymbol{z}||_{A}^{2}=||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}+||\boldsymbol{z}_{\mathcal{A}}||_{A}^{2}+\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2},

(69)

which can be derived using equalities like

g(\boldsymbol{x})=\left(g(\boldsymbol{x}_{\mathcal{A}})^{a}g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})^{D-a}\right)^{\frac{1}{D}}=g(\boldsymbol{x}_{\mathcal{A}})g(\boldsymbol{x}_{\mathcal{A}})^{\frac{a}{D}-1}g(\boldsymbol{x}_{\mathcal{D}\backslash\mathcal{A}})^{\frac{D-a}{D}},

which can be used to expand

\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z})}\right)^{2}=\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}+\frac{D-a}{D}\log\frac{g(\boldsymbol{z}_{\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}\right)^{2}

and doing the square, cross terms vanish after summation.
Next, we observe that, for an arbitrary $s_{\boldsymbol{z}}$ which we join as an additional component with the vector $\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}$ , we have

||(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s_{\boldsymbol{z}})||_{A}^{2}=||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}+\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{s_{\boldsymbol{z}}}\right)^{2}

(70)

because, similarly as before, we have

\sum_{i\in\mathcal{D}\backslash\mathcal{A}}\left(\log\frac{z_{i}}{g((\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s_{\boldsymbol{z}}))}\right)^{2}+\left(\log\frac{s_{\boldsymbol{z}}}{g((\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s_{\boldsymbol{z}}))}\right)^{2}=\\ \sum_{i\in\mathcal{D}\backslash\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}+\frac{1}{D-a+1}\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{s_{\boldsymbol{z}}}\right)^{2}+\left(\frac{D-a}{D-a+1}\log\frac{s_{\boldsymbol{z}}}{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}\right)^{2}.

(71)

We can now choose for $s_{z}$ the expression $s_{z}=(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})$ . With this, we express $||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}$ using the expression following from Eq. (70). We then substitute the corresponding term in Eq. (69), proving the proposition:

||\boldsymbol{z}||_{A}^{2}=||(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}},s)||_{A}^{2}+||\boldsymbol{z}_{\mathcal{A}}||_{A}^{2}\\ +\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}\right)^{2}.

(72)

Proof of Corollary 1

This follows from the fact that we can insert $g(\boldsymbol{z}_{\mathcal{A}})$ for $s_{\boldsymbol{z}}$ in Eq. (70). This gets us an expression for $||\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}}||_{A}^{2}$ , which is inserted in Eq. (69).

Proof of Corollary 2

To prove the corollary, we have to show that the last term in the decomposition of Corollary 1 is greater or equal to zero. We thus need to show

(D-a)\frac{a(D-a+1)-D}{D(D-a+1)}\geq 0.

(73)

Since $D>a>1$ , and the quadratic equation $a^{2}-a(D+1)+D=0$ has solutions $a=1$ and $a=D$ , between these values we are either above or below zero. We are above because the first derivative of Eq. (73) in $a=1$ is $(D-1)^{2}/D^{2}$ , which is bigger zero.

Proof of Proposition 2

Let $\boldsymbol{z}$ denote the vector with components $x_{i}/y_{i}$ , $i=1,\dots,D$ . To prove the proposition, we have to bound the terms after the first plus sign in Proposition 1 from below, i.e.,

R(\boldsymbol{x},\boldsymbol{y}):=\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}+\frac{a(D-a)}{D}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}\\ -\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}\right)^{2}\geq 0.

(74)

Let us start with the second summand. We rewrite it as

\left(\frac{a(D-a)}{D}-\frac{D-a}{D-a+1}\right)\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}+\frac{D-a}{D-a+1}\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}

Since we showed Eq. (73), the summand on the left is greater equal zero. Thus we have

R(\boldsymbol{x},\boldsymbol{y})\geq\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}\\ +\frac{D-a}{D-a+1}\left(\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\left(\log\frac{g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})}{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}\right)^{2}\right)\\ \geq\sum_{i\in\mathcal{A}}\left(\log\frac{z_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\left(\log\frac{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2},

(75)

where the second inequality follows because the big bracket (with a prefactor smaller one) has a structure that can be bounded like

|(A-B)^{2}-(A-C)^{2}|\leq(B-C)^{2},

with $\log~{}g(\boldsymbol{z}_{\mathcal{D}\backslash\mathcal{A}})$ playing the role of $A$ . Finally, the last term in Eq. (75) can be bounded from above as follows. Since $x_{i}\leq y_{i}\cdot\max x_{i}/y_{i}$ , we also have

\sum_{i\in\mathcal{A}}x_{i}\leq\max_{k\in\mathcal{A}}\frac{x_{k}}{y_{k}}\sum_{i\in\mathcal{A}}y_{i},

(76)

so the ratio of sums is smaller than the maximum over the ratios. Without restricting generality, let us assume that $g(\boldsymbol{x}_{\mathcal{A}})\geq g(\boldsymbol{y}_{\mathcal{A}})$ , i.e., $g(\boldsymbol{z}_{\mathcal{A}})\geq 1$ . The bound on the sum ratio implied by Eq. (76) is then sufficient for proving the proposition:

R(\boldsymbol{x},\boldsymbol{y})\geq\max_{i\in\mathcal{A}}\left(\log\frac{x_{i}/y_{i}}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}-\left(\log\frac{(\sum_{i\in\mathcal{A}}x_{i})/(\sum_{i\in\mathcal{A}}y_{i})}{g(\boldsymbol{z}_{\mathcal{A}})}\right)^{2}\geq 0.

(77)

References

(1) Aitchison, J: The statistical analysis of compositional data. Chapman and Hall (1986)
(2) Greenacre, M: Compositional Data Analysis, Annual Reviews of Statistics and its Application 8(1), 271–299 (2021)
(3) Egozcue, JJ and Pawlowsky-Glahn, V: Compositional data: the sample space and its structure. TEST 28(3), 599–638 (2019)
(4) Barceló-Vidal, C, Martín-Fernández, JA: The Mathematics of Compositional Analysis. Austrian Journal of Statistics 45(4), 57–71 (2016)
(5) Aitchison, J: The Statistical Analysis of Compositional Data. J Royal Stat Soc B 44 (2), 139–160 (1982)
(6) Egozcue, JJ, Barceló-Vidal, C, Martín-Fernández, JA, Jarauta-Bragulat, E, Díaz-Barrero, JL, Mateu-Figueras, G, Pawlowsky-Glahn, V, Buccianti, A: Elements of simplicial linear algebra and geometry. In: Pawlowsky-Glahn, V and Buccianti, A (eds.) Compositional data analysis: Theory and applications, pp. 141–157. Wiley (2011)
(7) Egozcue, JJ, Pawlowsky-Glahn, V, Mateu-Figueras, G, and Barceló-Vidal, C: Isometric Logratio Transformations for Compositional Data Analysis. Mathematical Geology, 35 (3), 279–300 (2003)
(8) J. J. Egozcue and V. Pawlowsky-Glahn. Groups of Parts and Their Balances in Compositional Data Analysis. Mathematical Geology, 37 (7), 795–828 (2005)
(9) Hron, K, Filzmoser, P, de Caritat, P, Fiŝerová, E, Gardlo, A: Weighted Pivot Coordinates for Compositional Data and Their Application to Geochemical Mapping. Mathematical Geosciences 49, 797–814 (2017)
(10) Chentsov, N: Statistical Decision Rules and Optimal Inference (vol. 53), Nauka (1972) (in Russian); English translation in: Math. Monograph. (vol. 53), Am. Math. Soc. (1982)
(11) Rao, CR: Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–89 (1945)
(12) Amari, S and Nagaoka, H: Methods of Information Geometry. Translations of Mathematical Monographs (vol. 191), American Mathematical Society (2000)
(13) Amari, S: Differential-Geometric Methods in Statistics. Lecture Notes in Statistics (vol. 28), Springer (1985)
(14) Ay, N, Jost, J, Le, HV, Schwachhöfer, L: Information Geometry. A Series of Modern Surveys in Mathematics (vol. 64), Springer (2017)
(15) Amari, S: Information Geometry and Its Applications. Applied Mathematical Sciences, (vol. 194), Springer (2016)
(16) Whittaker, J: Graphical models in applied multivariate statistics. Wiley (1990)
(17) Ay, N and Erb, I: On a notion of linear replicator equations. J Dyn. Diff. Eqs., 17 (2), 427-451 (2005)
(18) Martín-Fernández, JA, Bren, M, Barceló-Vidal, C and Pawlowsky-Glahn, V: A Measure of Difference for Compositional Data based on measures of divergence. In: Proceedings of the Fifth Annual Conference of the International Assotiation for Mathematical Geology, Ed. Lippard, S.J., Naess, A., and Sinding-Larsen, R.. Trondheim (Norway), Vol. 1, pp. 211-215 (1999)
(19) Rényi, A.: On measures of entropy and information. In: Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561. University of California Press, Berkeley (1961)
(20) Tsallis, C.: Possible generalization of Boltzmann–Gibbs statistics. J. Stat. Phys. 52, 479–487 (1988)
(21) Greenacre, M: Power transformations in correspondence analysis. Computational Statistics & Data Analysis, 53(8), 3107-3116 (2009)
(22) Greenacre, M: Log-Ratio Analysis Is a Limiting Case of Correspondence Analysis. Math Geosci 42, 129 (2010)
(23) Bhattacharyya, A: On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 35, 99–109. (1943)
(24) Chentsov, N: Algebraic foundation of mathematical statistics. Math. Oper.forsch. Stat., Ser. Stat. 9, 267–276 (1978)
(25) Palarea-Albaladejo, J, Martín-Fernández, JA and Soto, J.A.: Dealing with Distances and Transformations for Fuzzy C-Means Clustering of Compositional Data, Journal of Classification 29(2), 144-169 (2012)
(26) Quinn, TP, Erb, I: Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data, NAR Genomics and Bioinformatics 2(4) Iqaa076 (2021)
(27) Egozcue, JJ and Pawlowsky-Glahn, V: Evidence functions: a compositional approach to information. SORT 42 (2), 101-124 (2018)
(28) Greenacre, M: Amalgamations are valid in compositional data analysis, can be used in agglomerative clustering, and their logratios have an inverse transformation. Appl. Comp. Geosc., 5, 100017 (2020)
(29) Barceló-Vidal, C., Martín-Fernández, J.A. and Mateu-Figueras, G., Compositional Differential Calculus on the Simplex. In: Pawlowsky-Glahn, V and Buccianti, A (eds.) Compositional data analysis: Theory and applications, pp. 176–190. Wiley (2011)
(30) Egozcue, JJ, Pawlowsky-Glahn, V, Tolosana-Delgado, R, Ortego, MI and van den Boogaart, KG, Bayes spaces: use of improper distributions and exponential families. Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales. Serie A. Matematicas, 107(2), pp. 475–486 (2013)