On the Normalizing Constant of
the Continuous Categorical Distribution

Elliott Gordon-Rodriguez
Columbia University
[email protected]
&Gabriel Loaiza-Ganem¹¹footnotemark: 1
Layer6 AI
[email protected]
Andres Potapczynski
New York University
[email protected]
&John P. Cunningham
Columbia University
[email protected]
Equal contribution.

Abstract

Probability distributions supported on the simplex enjoy a wide range of applications across statistics and machine learning. Recently, a novel family of such distributions has been discovered: the continuous categorical. This family enjoys remarkable mathematical simplicity; its density function resembles that of the Dirichlet distribution, but with a normalizing constant that can be written in closed form using elementary functions only. In spite of this mathematical simplicity, our understanding of the normalizing constant remains far from complete. In this work, we characterize the numerical behavior of the normalizing constant and we present theoretical and methodological advances that can, in turn, help to enable broader applications of the continuous categorical distribution. Our code is available at https://github.com/cunningham-lab/cb_and_cc/.

1 Introduction

The continuous categorical (CC) distribution is defined by the following density function (Gordon-Rodriguez et al., 2020a):

\displaystyle\mathbb{x}\sim\mathcal{CC}(\boldsymbol{\lambda})\iff p(\mathbb{x};\boldsymbol{\lambda})\propto\prod_{i=1}^{K}\lambda_{i}^{x_{i}}.

(1)

Here, $\mathbb{x}$ denotes a simplex-valued random variable, and $\boldsymbol{\lambda}$ denotes a simplex-valued parameter, in other words:¹¹1Note that the $K$ -simplex is also commonly defined as $\Delta^{K}=\{\mathbb{x}\in\mathbb{R}^{K}_{+}:\sum_{i=1}^{K}=1\}$ . The two definitions are equivalent, however $\mathbb{S}^{K}$ is a subset of $\mathbb{R}^{K-1}$ with positive Lebesgue measure, whereas $\Delta^{K}$ is a subset of $\mathbb{R}^{K}$ with zero Lebesgue measure. For this reason, using $\mathbb{S}^{K}$ will facilitate our later arguments involving integrals on the simplex.

\displaystyle\mathbb{x},\boldsymbol{\lambda}\in\mathbb{S}^{K}:=\left\{\boldsymbol{x}\in\mathbb{R}_{+}^{K-1}:\sum_{i=1}^{K-1}x_{i}\leq 1\right\},

(2)

where we additionally define the $K$ th coordinates as the remainder:

	$\displaystyle x_{K}$	$\displaystyle=1-\sum_{i=1}^{K-1}x_{i}$		(3)
	$\displaystyle\lambda_{K}$	$\displaystyle=1-\sum_{i=1}^{K-1}\lambda_{i}.$		(4)

It is natural to contrast the CC with the similar-looking Dirichlet distribution:

\displaystyle\mathbb{x}\sim\mathrm{Dirichlet}(\boldsymbol{\alpha})\iff p(\mathbb{x};\boldsymbol{\alpha})\propto\prod_{i=1}^{K}x_{i}^{\alpha_{i}-1},

(5)

where again $\mathbb{x}\in\mathbb{S}^{K}$ , but $\boldsymbol{\alpha}\in\mathbb{R}_{+}^{K}$ now denotes a positive unconstrained parameter vector with one more dimension than $\mathbb{x}$ .

While the densities in Eq. 1 and Eq. 5 look similar, they hide very different normalizing constants. In the Dirichlet case, it is well known that (Dirichlet, 1839):

\displaystyle\int_{\mathbb{S}^{K}}\prod_{i=1}^{K}x_{i}^{\alpha_{i}-1}d\mu(\mathbb{x})=\frac{\prod_{i=1}^{K}\Gamma(\alpha_{i})}{\Gamma(\sum_{i=1}^{K}\alpha_{i})},

(6)

where $\Gamma(\alpha)=\int_{0}^{\infty}t^{\alpha-1}e^{-t}dt$ denotes the gamma function and $\mu(\cdot)$ is the Lebesgue measure. On the other hand, the normalizing constant of the CC admits the following closed form (Gordon-Rodriguez et al., 2020a):

\displaystyle\int_{\mathbb{S}^{K}}\prod_{i=1}^{K}\lambda_{i}^{x_{i}}d\mu(\mathbb{x})=\sum_{k=1}^{K}\frac{\lambda_{k}}{\prod_{i\neq k}\log{\frac{\lambda_{k}}{\lambda_{i}}}},

(7)

which contains elementary operations only. In spite of its mathematical simplicity, this normalizing constant can be numerically hard to compute, particularly in high dimensions. Moreover, Eq. 7 breaks down under equality of parameters, i.e., whenever $\lambda_{i}=\lambda_{k}$ for some $i\neq k$ , because the denominator evaluates to zero. These issues will be the primary focus of our exposition, in particular:

•

In Section 2.1, we characterize the numerical behavior of our normalizing constant. We demonstrate that vectorized computation can suffer from catastrophic cancellation, the severity of which depends on the proximity between parameter values.
•

In Section 2.2, we rederive the normalizing constant as an inverse Laplace transform, which in turn can be be evaluated using numerical inversion algorithms. We show that this alternative computation strategy exhibits good numerical behavior in the regime where catastrophic cancellation is most severe.
•

In Section 2.3, we propose an orthogonal computational approach based on a recursive property of the normalizing constant.
•

In Section 3.2, we generalize Eq. 7 to arbitrary parameter values, i.e., including equality of parameters $\lambda_{i}=\lambda_{k}$ for any $i\neq k$ . The resulting formula depends on an expectation that can be computed using automatic differentiation.

We conclude this section with some remarks. First, note that in the 1-dimensional case, the CC distribution reduces to the continuous Bernoulli distribution (Loaiza-Ganem and Cunningham, 2019), which arose in the context of generative models of images (Kingma and Welling, 2014; Bond-Taylor et al., 2021) and provided the original inspiration for the CC family. More generally, the CC is closely related to the categorical cross-entropy loss commonly used in machine learning (Gordon-Rodriguez et al., 2020b).

We also note that the CC can be rewritten using the exponential family canonical form:

\displaystyle\mathbb{x}\sim\mathcal{CC}(\boldsymbol{\eta})\iff p(\mathbb{x};\boldsymbol{\eta})=\frac{1}{C(\boldsymbol{\eta})}e^{\boldsymbol{\eta}^{\top}\mathbb{x}},

(8)

where $\eta_{i}=\log(\lambda_{i}/\lambda_{K})$ is the natural parameter, which conveniently becomes unconstrained real-valued. Note that, like with $\mathbb{x}$ and $\boldsymbol{\lambda}$ , we will drop the $K$ th coordinate to denote $\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{K-1})$ , since $\eta_{K}=\log(1)=0$ is fixed.²²2In principle, we could let $\eta_{K}$ vary together with $\eta_{1},\dots,\eta_{K-1}$ ; Eqs. 9 and 8 would still hold, since the additional term $e^{\eta_{K}x_{K}}$ in the density would compensate the change in $C(\boldsymbol{\eta})$ . However, such a model would be overparameterized as it would be invariant to a parallel shift across all the $\eta_{i}$ . For mathematical conciseness, we keep $\eta_{K}$ fixed at 0 and work with $\boldsymbol{\eta}\in\mathbb{R}^{K-1}$ . In this notation, the normalizing constant becomes:

\displaystyle C(\boldsymbol{\eta})=\sum_{j=1}^{K}\frac{e^{\eta_{k}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)},

(9)

which, again, is undefined whenever $\eta_{i}=\eta_{k}$ for some $i\neq k$ .

2 Numerical computation of the normalizing constant

Eq. 9 can be vectorized efficiently as follows:

Algorithm 1 Vectorized computation of the normalizing constant

Input: A parameter vector $\boldsymbol{\eta}$
Output: The normalizing constant $C(\boldsymbol{\eta})$

1:Compute a

K\times K

matrix of differences

M=[\eta_{k}-\eta_{i}]_{i,k=1}^{K}

and add to this the identity matrix

I

. A tensor with an additional batch dimension can be used if necessary.

2:Take the product of the rows of

M+I

, using log space as necessary.

3:Multiply the resulting vector componentwise with the vector

[e^{\eta_{k}}]_{k=1}^{K}

, and sum up the terms.

2.1 Catastrophic cancellation

Algorithm 1 is easy to code up and adds little computational overhead to most models. However, the summation in Step 3 involves positive and negative numbers, which can result in catastrophic cancellation (Goldberg, 1991). We stress that the log-sum-exp trick, while useful for preventing overflow, cannot address catastrophic cancellation (see Section 2.4). For example, consider the case $K=5$ with $\boldsymbol{\eta}=(1,2,3,4)$ . In single-precision floating-point, the summation in Eq. 9 evaluates to:

	$\displaystyle\sum_{k=1}^{K}\frac{e^{\eta_{k}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)}$	$\displaystyle=-0.45304695+1.847264-3.3475895+2.2749228+0.04166667$
		$\displaystyle=0.363217.$		(10)

Note that the output is an order of magnitude smaller than (at least one of) the summands, and as a result we have lost one digit in precision. As we increase the dimension of the CC, the cancellation becomes more severe. For example, in the case $K=10$ with $\boldsymbol{\eta}=(1,2,3,4,5,6,7,8,9)$ , the same summation becomes:

$\displaystyle\sum_{k=1}^{K}\frac{e^{\eta_{k}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)}$	$\displaystyle=-6.7417699\times 10^{-5}-7.3304126\times 10^{-4}+4.6494300\times 10^{-3}$
	$\displaystyle\ \ \ \ \ \ -1.8957691\times 10^{-2}+5.1532346\times 10^{-2}+9.3386299\times 10^{-2}$
	$\displaystyle\ \ \ \ \ \ +1.0879297\times 10^{-1}-7.3932491\times 10^{-2}+2.2329926\times 10^{-2}$
	$\displaystyle\ \ \ \ \ \ -2.7557318\times 10^{-6}$
	$\displaystyle=3.5982\times 10^{-4}.$	(11)

We have now lost 3 digits, since the leading summand is 3 orders of magnitude greater than the output. If we continue increasing $K$ in the same way, by $K=20$ we will have lost 6 digits, and by $K=25$ we are no longer able to compute $C(\boldsymbol{\eta})$ to even a single significant digit. If we were to use double-precision floating-point instead, by $K=40$ we will have lost 13 digits, and by $K=50$ we can no longer compute $C(\boldsymbol{\eta})$ to a single significant digit.

We can summarize the problem as follows: when $C(\boldsymbol{\eta})$ is of a much lower order of magnitude than the leading term in the summation of Eq. 9, numerical computation fails due to catastrophic cancellation. To complicate things further, the relationship between the two orders of magnitude (that of $C(\boldsymbol{\eta})$ and that of the leading summand) is nontrivial. This relationship depends on the relative size of the exponential terms $e^{\eta_{k}}$ and the products of the differences $\eta_{k}-\eta_{i}$ , and is not straightforward to analyze. However, if two elements of $\boldsymbol{\eta}$ are particularly close to one another, meaning that $|\eta_{j_{1}}-\eta_{j_{2}}|$ is close to 0 for some $j_{1}$ and $j_{2}$ , it follows that the corresponding summands:

\displaystyle\frac{e^{\eta_{j_{1}}}}{\prod_{i\neq{j_{1}}}\left(\eta_{j_{1}}-\eta_{i}\right)},\ \ \mathrm{and}\ \ \frac{e^{\eta_{j_{2}}}}{\prod_{i\neq{j_{2}}}\left(\eta_{j_{2}}-\eta_{i}\right)},

(12)

are (a) large in magnitude (due to the term $\eta_{j_{1}}-\eta_{j_{2}}$ in the denominator), (b) of opposing sign (due to the difference in $\eta_{j_{1}}-\eta_{j_{2}}$ versus $\eta_{j_{2}}-\eta_{j_{1}}$ ), and (c) similar in absolute value (since the terms in the product are approximately equal). Thus, it is likely that the terms in Eq. 12 are of leading order and catastrophically cancel each other out. Note that, as the dimensionality $K$ increases, it becomes more likely that some pair of the components are close to one another, and therefore computation becomes harder.

Refer to caption — Figure 1: Scaling behavior of $C(\boldsymbol{\eta})$ relative to its summands (from Eq. 9), as the dimension $K$ varies. Each point represents a random draw of $\eta_{i}\overset{iid}{\sim}N(0,1)$ , for which we compute $C(\boldsymbol{\eta})$ . Note that catastrophic cancellation depends on the difference in order of $C(\boldsymbol{\eta})$ and its summands; in the green region this difference is at most 8 orders of magnitude, so that single-precision floating-point is sufficient. In the yellow region, it is between 8 and 16 orders of magnitude, so that single-precision fails due to catastrophic cancellation, but double-precision succeeds. In the red region, both fail.

Another helpful intuition can be obtained by reasoning from the integral:

\displaystyle C(\boldsymbol{\eta})=\int_{\mathbb{S}^{K}}e^{\boldsymbol{\eta}^{\top}\mathbb{x}}d\mu(\mathbb{x}).

(13)

As $K$ increases, the Lebesgue measure of the simplex decays like $1/K!$ .³³3As can be seen, for example, by taking Eq. 6 with $\alpha_{i}=0$ for all $i$ . Therefore, assuming the components of $\boldsymbol{\eta}$ are $O(1)$ , we have that $e^{\boldsymbol{\eta}^{\top}\mathbb{x}}=O(1)$ also, and therefore $C(\boldsymbol{\eta})=O(1/K!)$ . Under this assumption, we also have that $\eta_{k}-\eta_{i}=O(1)$ , and therefore the summands in Eq. 9 cannot decay factorially fast (they may, but need not, decay at most exponentially due to the product of $K-1$ terms of constant order in the denominator). Thus, such a regime implies catastrophic cancellation is inevitable for large enough $K$ . On the other hand, when all the components of $\boldsymbol{\eta}$ are far from one another, we are spared of such cancellation and Algorithm 1 succeeds, including in high dimensions. We demonstrate these behaviors empirically in the following experiments (see Figures 1 and 2).

2.1.1 Experiments

To evaluate the effectiveness of Algorithm 1 for computing $C(\boldsymbol{\eta})$ , we first implemented an oracle capable of correctly computing $C(\boldsymbol{\eta})$ to within 4 significant figures (at a potentially large computational cost). Our oracle is an implementation of Eq. 9 with arbitrary-precision floating-point, using the mpmath library (Johansson et al., 2013). In particular, for a given $\boldsymbol{\eta}$ , we ensure the level of precision is set appropriately by computing Eq. 9 repeatedly at increasingly high precision, until the outputs converge (to 4 significant figures). Equipped with this oracle, we then drew $\boldsymbol{\eta}$ vectors from a normal prior for a variety of dimensions $K$ , to analyze the behavior of $C(\boldsymbol{\eta})$ .

First, we took $\eta_{i}\overset{iid}{\sim}N(0,1)$ and compared the magnitude of $C(\boldsymbol{\eta})$ to the magnitude of the largest summand in Eq. 9. The results are plotted in Figure 1, where we observe that $C(\boldsymbol{\eta})$ decays rapidly in $K$ , whereas the same is not true of its (largest) summands. Thus, under this prior, Algorithm 1 is unsuccessful except in low-dimensional settings.

Next, we let the spread of $\boldsymbol{\eta}$ vary by drawing $\eta_{i}\overset{iid}{\sim}N(0,\sigma^{2})$ , where $\sigma$ ranges between $10^{-2}$ and $10^{2}$ . We then plotted, for each $\sigma$ , the highest value of $K$ such that the output of Algorithm 1 agreed with the Oracle to 3 significant figures. We repeated the procedure using single- and double-precision floating-point (orange and pink lines in Figure 2, respectively), as well as two Laplace inversion methods that will be discussed in Section 2.2. As $\sigma$ increases, the parameter values tend to move away from one another, making computation easier and allowing for much higher dimensions. On the other hand, when $\sigma$ decreases, the parameter values come closer, bringing us into the unstable regions of Eq. 9, and Algorithm 1 fails for all but just a few dimensions.

2.2 Inverse laplace transform

In this Section, we show that $C(\boldsymbol{\eta})$ can be written as the inverse Laplace transform of a function that does not suffer from catastrophic cancellation. In particular, said function can be passed to Laplace inversion methods (Davies and Martin, 1979) in order to evaluate $C(\boldsymbol{\eta})$ in the regime where Algorithm 1 fails.

Proposition: For a function $f:\mathbb{R}^{+}\to\mathbb{R}^{+}$ , let $\mathcal{L}[f](s)=\int_{0}^{\infty}f(t)e^{-st}dt$ denote the Laplace transform. Define the function $c:\mathbb{R}^{+}\to\mathbb{R}^{+}$ by:

\displaystyle c(t)

\displaystyle=\int_{\{\mathbb{x}:\sum_{i=1}^{K-1}x_{i}\leq t\}}\prod_{i=1}^{K}e^{\eta_{i}x_{i}}d\mu(\mathbb{x}).

(14)

Then the Laplace transform of $c$ is equal to:

\displaystyle\mathcal{L}[c](s)=\prod_{i=1}^{K}\frac{1}{s-\eta_{i}}.

(15)

Remark: The function $c$ includes the normalizing constant of the CC as a special case $C(\boldsymbol{\eta})=c(1)$ . More generally, we have that:

\displaystyle c(t)=t^{K-1}C(\boldsymbol{\eta}/t)^{-1},

(16)

as can be seen from letting $\tilde{x}_{i}=x_{i}/t$ in the integral 14.

Proof: Let $f_{\eta}(x)=e^{\eta x}$ , so that:

	$\displaystyle c(t)$	$\displaystyle=\int_{\{\mathbb{x}:\sum_{i=1}^{K-1}x_{i}\leq t\}}\prod_{i=1}^{K}f_{\eta_{i}}(x_{i})d\mu(\mathbb{x})$
		$\displaystyle=\int_{0}^{t}\int_{0}^{t-x_{1}}\cdots\int_{0}^{t-x_{1}-\cdots-x_{K-2}}\prod_{i=1}^{K}f_{\eta_{i}}(x_{i})dx_{K-1}\cdots dx_{2}dx_{1}.$		(17)

Next we apply the transformation from Wolpert and Wolf (1995), defined as $w_{1}=t$ , $w_{k}=w_{k-1}-x_{k-1}$ , or equivalently, $w_{k}=t-\sum_{i=1}^{k-1}x_{i}$ . We can then write our integral as a convolution:

	$\displaystyle c(t)=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-1}}\prod_{i=1}^{K}f_{\eta_{i}}(x_{i})dx_{K-1}\cdots dx_{2}dx_{1}$
	$\displaystyle=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-2}}\prod_{i=1}^{K-2}f_{\eta_{i}}(x_{i})\int_{0}^{w_{K-1}}f_{\eta_{K-1}}(x_{K-1})f_{\eta_{K}}(w_{K-1}-x_{K-1})dx_{K-1}\cdots dx_{2}dx_{1}$
	$\displaystyle=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-2}}\prod_{i=1}^{K-2}f_{\eta_{i}}(x_{i})(f_{\eta_{K-1}}\ast f_{\eta_{K}})(w_{K-1})dx_{K-2}\cdots dx_{2}dx_{1}$
	$\displaystyle=\int_{0}^{w_{1}}\cdots\int_{0}^{w_{K-3}}\prod_{i=1}^{K-3}f_{\eta_{i}}(x_{i})\int_{0}^{w_{K-2}}f_{\eta_{K-2}}(x_{K-2})(f_{\eta_{K-1}}\ast f_{\eta_{K}})(w_{K-2}-x_{K-2})dx_{K-2}\cdots dx_{1}$
	$\displaystyle=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-3}}\prod_{i=1}^{K-3}f_{\eta_{i}}(x_{i})(f_{\eta_{K-2}}\ast f_{\eta_{K-1}}\ast f_{\eta_{K}})(w_{K-2})dx_{K-3}\cdots dx_{2}dx_{1}$
	$\displaystyle=\cdots$
	$\displaystyle=(\circledast_{i=1}^{K}f_{\eta_{i}})(t).$		(18)

Next, since the Laplace transform of a convolution equals the product of the Laplace transforms, we have that:

\displaystyle\mathcal{L}[c](s)=\prod_{i=1}^{K}\mathcal{L}[f_{i}](s),

(19)

but the univariate case is simply:

\displaystyle\mathcal{L}[f_{i}](s)=\int_{0}^{\infty}e^{(\eta_{i}-s)t}dt=\frac{1}{s-\eta_{i}},

(20)

and the result follows. ∎

Corollary: The normalizing constant of the continuous categorical distribution can be written as the following inverse Laplace transform:

\displaystyle C(\boldsymbol{\eta})=\mathcal{L}^{-1}\left[\prod_{i=1}^{K}\frac{1}{s-\eta_{i}}\right](1).

(21)

Proof: Take the inverse Laplace transform in Eq. 15 to find:

\displaystyle c(t)=\mathcal{L}^{-1}\left[\prod_{i=1}^{K}\frac{1}{s-\eta_{i}}\right](t).

(22)

Taking $t=1$ gives the desired result. ∎

Unlike Eq. 9, the product in Eq. 15 does not suffer from catastrophic cancellation, nor does it diverge whenever $\eta_{j_{1}}\approx\eta_{j_{2}}$ for some $j_{1}\neq j_{2}$ . The corresponding Laplace inversion, i.e., Eq. 21, provides an alternative method to compute our normalizing constant $C(\boldsymbol{\eta})$ . Numerous numerical algorithms exist for inverting the Laplace transform; see (Cohen, 2007) for a survey. We note, however, that inverting the Laplace transform is generally a hard problem (Epstein and Schotland, 2008).

Empirically, we found some modest empiricasuccess in computing Eq. 21 numerically. We tested three inversion algorithms, due to Talbot (Talbot, 1979), Stehfest (Stehfest, 1970), and De Hoog (De Hoog et al., 1982). The experimental setup was identical to that of Section 2.1.1, and the results are incorporated into Figure 2. We found De Hoog’s method to be the most effective on our problem, whereas Talbot’s always failed and is omitted from the Figure. Importantly, De Hoog’s method showed some success in the regime where Algorithm 1 failed, meaning it could be used as a complementary technique. However, no inversion method was able to scale beyond $K=30$ .

2.3 Inductive approach

In this section, we provide an alternative algorithm to compute our normalizing constant. This algorithm will be based on the following recursive property of $C(\boldsymbol{\eta})$ , which was implicitly used as part of the proof of Eq. 9 (Gordon-Rodriguez et al., 2020a).

Proposition: Define the subvector notation $\boldsymbol{\eta}_{:k}=(\eta_{1},\dots,\eta_{k-1})$ , and make the dependence on $K$ explicit by writing $C_{K}(\boldsymbol{\eta})=C(\eta_{1},\dots,\eta_{K-1})$ . We also use the notation $\boldsymbol{\eta}-\eta_{k}=(\eta_{1}-\eta_{k},\dots,\eta_{K-1}-\eta_{k})$ . Then:

\displaystyle C_{K}(\boldsymbol{\eta})=\frac{e^{\eta_{K-1}}C_{K-1}\left(\boldsymbol{\eta}_{:(K-1)}-\eta_{K-1}\right)-C_{K-1}\left(\boldsymbol{\eta}_{:(K-1)}\right)}{\eta_{K-1}}.

(23)

Proof: We start from the integral definition of the left hand side:

	$\displaystyle C_{K}(\text{\boldmath$\eta$})$	$\displaystyle=\int_{{\mathbb{S}^{K-1}}}e^{\boldsymbol{\eta}^{\top}\mathbb{x}}d\mu$
		$\displaystyle=\int_{0}^{1}\int_{0}^{1-x_{1}}\cdots\int_{0}^{1-x_{1}-\cdots-x_{K-2}}e^{\sum_{i=1}^{K-1}\eta_{i}x_{i}}dx_{K-1}\cdots dx_{2}dx_{1}.$		(24)

For the innermost integral, we have:

$\displaystyle\int_{0}^{1-x_{1}-\cdots-x_{K-2}}e^{\sum_{i=1}^{K-1}\eta_{i}x_{i}}dx_{K-1}$	$\displaystyle=e^{\sum_{i=1}^{K-2}\eta_{i}x_{i}}\int_{0}^{1-x_{1}-\cdots-x_{K-2}}e^{\eta_{K-1}x_{K-1}}dx_{K-1}$
	$\displaystyle=e^{\sum_{i=1}^{K-2}\eta_{i}x_{i}}\left(\frac{e^{\eta_{K-1}(1-x_{1}-\cdots-x_{K-2})}-1}{\eta_{K-1}}\right)$
	$\displaystyle=\frac{e^{\eta_{K-1}}e^{\sum_{i=1}^{K-2}(\eta_{i}-\eta_{K-1})x_{i}}-e^{\sum_{i=1}^{K-2}\eta_{i}x_{i}}}{\eta_{K-1}}$
	$\displaystyle=\frac{e^{\eta_{K-1}}e^{(\boldsymbol{\eta}_{:(K-1)}-\eta_{K-1})^{\top}\mathbb{x}_{:(K-1)}}-e^{\boldsymbol{\eta}_{:(K-1)}^{\top}\mathbb{x}_{:(K-1)}}}{\eta_{K-1}},$	(25)

and the result follows by linearity of the integral. ∎

Remark: The base case $K=2$ corresponds to the univariate continuous Bernoulli distribution, which admits a straightforward Taylor expansion that is useful around the unstable region $\eta\approx 0$ (Loaiza-Ganem and Cunningham, 2019):

\displaystyle C_{2}(\boldsymbol{\eta}_{:2})=\frac{e^{\eta_{1}}-1}{\eta_{1}}=\frac{(1+\eta_{1}+\frac{1}{2!}\eta_{1}^{2}+\cdots)-1}{\eta_{1}}=1+\frac{1}{2!}\eta_{1}+\cdots.

(26)

In words, Eq. 23 is stating that we can compute $C(\boldsymbol{\eta})$ for the full parameter vector $\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{K-1})$ by first computing it for the lower-dimensional parameter vectors:

	$\displaystyle\boldsymbol{\eta}_{:K-1}$	$\displaystyle=(\eta_{1},\dots,\eta_{K-2}),$
	$\displaystyle\boldsymbol{\eta}_{:K-1}-\eta_{K-1}$	$\displaystyle=(\eta_{1}-\eta_{K-1},\dots,\eta_{K-2}-\eta_{K-1}).$

Substituting these back into Eq. 23, we have that:

	$\displaystyle C_{K-1}(\boldsymbol{\eta}_{:K-1})$	$\displaystyle=\frac{e^{\eta_{K-2}}C_{K-2}(\boldsymbol{\eta}_{1:K-2}-\eta_{K-2})-C_{K-2}(\boldsymbol{\eta}_{1:K-2})}{\eta_{K-2}},$
	$\displaystyle C_{K-1}(\boldsymbol{\eta}_{:K-1}-\eta_{K-1})$	$\displaystyle=\frac{e^{\eta_{K-2}-\eta_{K-1}}C_{K-2}(\boldsymbol{\eta}_{1:K-2}-\eta_{K-2})-C_{K-2}(\boldsymbol{\eta}_{1:K-2}-\eta_{K-1})}{\eta_{K-2}-\eta_{K-1}}.$

Note that we are now left with not 4, but 3 new parameter vectors to recurse on: $\boldsymbol{\eta}_{1:K-3}$ , $\boldsymbol{\eta}_{1:K-3}-\eta_{K-1}$ , and $\boldsymbol{\eta}_{1:K-3}-\eta_{K-2}$ . Repeating the argument $K-2$ times and working backwards we obtain Algorithm 2 for computing the normalizing constant.

Algorithm 2 Inductive computation of the normalizing constant

Input: A parameter vector $\boldsymbol{\eta}$
Output: The normalizing constant $C(\boldsymbol{\eta})$

1:Initialize

\mathbb{c}=(1,\dots,1)\in\mathbb{R}^{K-1}

and set

\mathbb{\tilde{c}}=\mathbb{c}

2:for

k=1,2,\dots,K-1

3: for

i=1,\dots,K-k

4: Set

\xi_{i}=\eta_{k}-\eta_{k+i}

5: Set

\tilde{c}_{i}=\frac{e^{\xi_{i}}c_{1}-c_{i+1}}{\xi_{i}}

6: end for

7: Set

\mathbb{c}=\mathbb{\tilde{c}}

8:end for

9:return

c_{1}

We find the numerical properties of Algorithm 2 to perform similarly to Algorithm 1, suffering from the same cancellation issues in high dimensions. Nevertheless, we hope this alternative scheme may help to inspire further numerical improvements. For example, since the floating-point behavior of Algorithm 2 depends on the ordering of the elements of $\boldsymbol{\eta}$ , it may be possible do design a reordering scheme that improves the overall precision of the algorithm; such reorderings have been explored in the context of sampling algorithms for the CC (Gordon-Rodriguez et al., 2020a). Other possibilities include Kahan summation (Kahan, 1965) or the compensated Horner algorithm (Langlois and Louvet, 2007); we leave their study to future work.

2.4 Overflow

We conclude this section with a small remark on numerical overflow in the context of Algorithm 1. In high dimensions and with high $\eta$ values, overflow can occur for the terms in the summand, due to a large $e^{\eta_{j}}$ term or a small denominator. This can be addressed by re-writing the normalizing constant in a form that allows us to take advantage of the log-sum-exp trick:

	$\displaystyle\log C(\text{\boldmath$\eta$})$	$\displaystyle=\log\left(\sum_{k=1}^{K}\frac{{e^{\eta_{k}}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)}\right)$
		$\displaystyle=\log\left(\sum_{k=1}^{K}\frac{{e^{\eta_{k}}}}{\mathrm{sign}\left(\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)\right)\prod_{i\neq k}\|\eta_{k}-\eta_{i}\|}\right)$
		$\displaystyle=\log\left(\sum_{k=1}^{K}\mathrm{sign}\left(\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)\right)\exp\left(\eta_{k}-\sum_{i\neq k}\log\left\|\eta_{k}-\eta_{i}\right\|\right)\right).$

3 Normalizing constant with repeated parameters

Whenever we have an equality between any pair of parameters, Eq. 9 is undefined and, indeed, its proof by Gordon-Rodriguez et al. (2020a) breaks down. In this Section, we derive a counterpart to Eq. 9 for the case when 2 or more elements of $\boldsymbol{\eta}$ are equal to one another. Note that, for mathematical convenience, we shall now denote $\boldsymbol{\eta}\in\mathbb{R}^{K}$ , where the $K$ th component $\eta_{K}$ is now included in the vector $\boldsymbol{\eta}$ . As discussed in Section 1, this component can be taken as fixed at 0, or it can be treated as an additional free parameter, resulting in an overparameterized model (our results will remain correct nevertheless).

3.1 A simple example

First, we illustrate the main idea of our argument using an example with $K=3$ and $\eta_{1}=\eta_{2}\neq\eta_{3}=0$ .⁴⁴4Note that the case $\eta_{3}\neq 0$ and the case $\eta_{1}\neq\eta_{2}=\eta_{3}$ are mathematically equivalent (albeit more algebraically cumbersome), since we can permute the elements of $\boldsymbol{\eta}$ and shift by a constant without loss of generality. By definition, the normalizing constant is then:

\displaystyle C_{3}(\boldsymbol{\eta})=\int_{0}^{1}\int_{0}^{1-x_{1}}e^{\eta_{1}x_{1}+\eta_{2}x_{2}}dx_{2}dx_{1}=\int_{0}^{1}\int_{0}^{1-x_{1}}e^{\eta_{1}(x_{1}+x_{2})}dx_{2}dx_{1}.

(27)

We apply the change of variables $u=x_{1}+x_{2}$ , $v=x_{1}$ to obtain:

\displaystyle C_{3}(\boldsymbol{\eta})=\int_{0}^{1}\int_{0}^{u}e^{\eta_{1}u}dvdu=\int_{0}^{1}ue^{\eta_{1}u}du.

(28)

Note that the last integral corresponds to the expectation of a univariate CC random variable, an idea we now generalize.

3.2 General formula

We start by proving the following lemma, which relates $C(\boldsymbol{\eta})$ , where some elements of $\boldsymbol{\eta}$ are repeated (potentially many times), to $C(\boldsymbol{\eta}^{\prime})$ , where $\boldsymbol{\eta}^{\prime}$ collapses the repeated elements of $\boldsymbol{\eta}$ onto a single coordinate. We again use subvector notation $\mathbb{x}_{k:}=(x_{k},x_{k+1},\dots,x_{K-1})$ .

Lemma: Let $\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})$ with $\eta_{1}=\eta_{2}=\dots=\eta_{k}$ , for $1\leq k\leq K-1$ , and let $f:\mathbb{R}^{K-k-1}\to\mathbb{R}$ . Then:

{C_{K}(\boldsymbol{\eta})}\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})]={C_{K-k+1}(\boldsymbol{\eta}_{k:})}\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{K-k+1}(\boldsymbol{\eta}_{k:})}\left[\dfrac{u_{1}^{k-1}}{(k-1)!}f(\mathbb{u}_{2:})\right].

(29)

Remark: We are not assuming that the last $K-k$ coordinates of $\boldsymbol{\eta}$ are all different, we are simply assuming that the first $k$ are identical. Note also that the positions of the coordinates could be arbitrary and need not be the first $k$ ones; we can always use this lemma to collapse repeated parameter values provided that the function $f$ does not depend on the corresponding coordinates (we can simply relabel the coordinates by a suitable permutation without loss of generality).

Proof: By definition of the left hand side:

	$\displaystyle C_{K}$	$\displaystyle(\boldsymbol{\eta})\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})]=\int_{0}^{1}\int_{0}^{1-x_{1}}\cdots\int_{0}^{1-x_{1}-\cdots-x_{K-2}}f(\mathbb{x}_{(k+1):})e^{\boldsymbol{\eta}^{\top}\mathbb{x}}dx_{K-1}\cdots dx_{2}dx_{1}$
		$\displaystyle=\int_{0}^{1}\int_{0}^{1-x_{1}}\cdots\int_{0}^{1-x_{1}-\cdots-x_{K-2}}f(\mathbb{x}_{(k+1):})e^{\eta_{k}(x_{1}+\dots+x_{k})+\boldsymbol{\eta}_{(k+1):}^{\top}\mathbb{x}_{(k+1):}}dx_{K-1}\cdots dx_{2}dx_{1}.$		(30)

Consider the following change of variable (note this is just like Section 3.1, but with more bookkeeping):

\begin{cases}u_{1}=x_{1}+x_{2}+\cdots+x_{k}\\ u_{2}=x_{k+1}\\ u_{3}=x_{k+2}\\ \hskip 16.0pt\vdots\\ u_{K-k}=x_{K-1}\\ v_{1}=x_{1}\\ v_{2}=x_{2}\\ \hskip 16.0pt\vdots\\ v_{k-1}=x_{k-1}\end{cases}

(31)

This change of variable amounts to an invertible linear transformation with the property that the absolute value of the determinant of its Jacobian is $1$ , so that we have:

	$\displaystyle C_{K}(\boldsymbol{\eta})$	$\displaystyle\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})]=\int_{0}^{1}\int_{0}^{1-u_{1}}\cdots\int_{0}^{1-u_{1}-u_{2}-\cdots-u_{K-k}}\int_{0}^{u_{1}}\int_{0}^{u_{1}-v_{1}}\cdots$		(32)
		$\displaystyle\dots\int_{0}^{u_{1}-v_{1}-v_{2}-\dots-v_{k-2}}f(\mathbb{u}_{2:})e^{\eta_{k}u_{1}+\boldsymbol{\eta}_{(k+1):}^{\top}\mathbb{u}_{2:}}dv_{k-1}\cdots dv_{2}dv_{1}du_{K-k}\cdots du_{2}du_{1}.$

Note that the change of variables is such that the integrand does not depend on $v_{1},v_{2},\dots,v_{k-1}$ . Therefore:

	$\displaystyle C_{K}(\boldsymbol{\eta})$	$\displaystyle\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})]$		(33)
		$\displaystyle=\int_{0}^{1}\int_{0}^{1-u_{1}}\cdots\int_{0}^{1-u_{1}-u_{2}-\cdots-u_{K-k}}g(u_{1})f(\mathbb{u}_{2:})e^{\boldsymbol{\eta}_{k:}^{\top}\mathbb{u}}du_{K-k}\cdots du_{2}du_{1},$

where:

\displaystyle g(u_{1})=\int_{0}^{u_{1}}\int_{0}^{u_{1}-v_{1}}\cdots\int_{0}^{u_{1}-v_{1}-v_{2}-\cdots-v_{k-2}}dv_{k-1}\cdots dv_{2}dv_{1}.

(34)

But this is simply the Lebesgue measure of a simplex inscribed in the hypercube $[0,u_{1}]^{K-1}$ , so that $g(u_{1})=u_{1}^{k-1}\mu(\mathbb{S}^{K})=u_{1}^{k-1}/(k-1)!$ (this can also be seen by changing variables to $\tilde{v}_{i}=v_{i}/u_{i}$ , or by applying Eq. 16). Multiplying and dividing by $C_{K-k+1}(\boldsymbol{\eta}_{k:})$ gives the desired result. ∎

We can now derive a formula for the normalizing constant for an arbitrary parameter vector $\boldsymbol{\eta}$ .

Corollary: Let $\boldsymbol{\eta}\in\mathbb{R}^{K}$ contain $D\leq K$ unique elements. Assume, without loss of generality, that $\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{1},\eta_{2},\dots,\eta_{2},\dots,\eta_{D},\dots,\eta_{D})$ , where each coordinate is repeated $1\leq r_{i}\leq K$ times, with $\sum_{i=1}^{D}r_{i}=K$ . Then:

C_{K}(\boldsymbol{\eta})={C_{D}(\eta_{1},\eta_{2},\dots,\eta_{D})}{\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{D}(\eta_{1},\eta_{2},\dots,\eta_{D})}\left[\displaystyle\prod_{i=1}^{D}\dfrac{u_{i}^{r_{i}-1}}{(r_{i}-1)!}\right]}

(35)

Proof: The result follows by applying the lemma $D$ times. First, we apply the lemma with $f(\cdot)=1$ :

\displaystyle C_{K}(\boldsymbol{\eta})={C_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}{\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}\left[\dfrac{u_{1}^{r_{1}-1}}{(r_{1}-1)!}\right]},

(36)

where $\boldsymbol{\eta}^{\prime}=\boldsymbol{\eta}_{r_{1}:}=(\eta_{1},\eta_{2},\dots,\eta_{2},\dots,\eta_{D},\dots,\eta_{D})$ , i.e., we have collapsed the first parameter value onto a single coordinate. Next, we apply the lemma a second time on the new expectation term to collapse the $\eta_{2}$ values, this time using $f(u_{1})=u_{1}^{r_{1}-1}/(r_{1}-1)!$ , which does not depend on the $\eta_{2}$ coordinates:

	$\displaystyle{C_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}\mathbb{E}$	${}_{\mathbb{u}\sim\mathcal{CC}_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}\left[\dfrac{u_{1}^{r_{1}-1}}{(r_{1}-1)!}\right]$		(37)
		$\displaystyle={C_{K-r_{1}-r_{2}+2}(\boldsymbol{\eta}^{\prime\prime})}\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{K-r_{1}-r_{2}+2}(\boldsymbol{\eta}^{\prime\prime})}\left[\dfrac{u_{1}^{r_{1}-1}}{(r_{1}-1)!}\dfrac{u_{2}^{r_{2}-1}}{(r_{2}-1)!}\right],$

where $\boldsymbol{\eta}^{\prime\prime}=(\eta_{1},\eta_{2},\eta_{3},\dots,\eta_{3},\dots,\eta_{D},\dots,\eta_{D})$ . Repeating $D$ times yields the desired result. ∎

Note that Eq. 35 can be computed using known results. The term $C_{D}(\eta_{1},\dots,\eta_{D})$ can be evaluated with Eq. 9, since all the parameter values are distinct. The expectation term can be computed by differentiating the moment generating function of $\mathbb{u}\sim\mathcal{CC}_{D}(\eta_{1},\eta_{2},\dots,\eta_{D})$ , as discussed by Gordon-Rodriguez et al. (2020a). Note that in real data applications exact equality between parameter values may never occur, and it is unclear how close the elements of $\boldsymbol{\eta}$ should be in order to warrant applying Eq. 35. Nevertheless, Eq. 35 is important for theoretical completeness.

4 Discussion

The normalizing constant of the CC distribution is essential to applications, being a necessary prerequisite for evaluating likelihoods, optimizing models, and simulating samples alike. Computing this normalizing constant is nontrivial, and doing so in high dimensions remains an open problem. Our work represents a significant step toward this goal, improving our understanding of the numerical properties of different computation techniques, as well as advancing the underlying theory and algorithms. In addition, we hope our results will help to inspire further advances and to develop increasingly robust numerical techniques that will ultimately enable the use of the CC distribution with arbitrary parameter values on high-dimensional problems.

References

Bond-Taylor et al. (2021) Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G Willcocks. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. arXiv preprint arXiv:2103.04922, 2021.
Cohen (2007) Alan M Cohen. Numerical methods for Laplace transform inversion, volume 5. Springer Science & Business Media, 2007.
Davies and Martin (1979) Brian Davies and Brian Martin. Numerical inversion of the laplace transform: a survey and comparison of methods. Journal of computational physics, 33(1):1–32, 1979.
De Hoog et al. (1982) Frank R De Hoog, JH Knight, and AN Stokes. An improved method for numerical inversion of laplace transforms. SIAM Journal on Scientific and Statistical Computing, 3(3):357–366, 1982.
Dirichlet (1839) Peter Gustav Lejeune Dirichlet. Sur une nouvelle méthode pour la détermination des intégrales multiples. Journal de Mathématiques, Ser I, 4, pages 164–168, 1839.
Epstein and Schotland (2008) Charles L Epstein and John Schotland. The bad truth about laplace’s transform. SIAM review, 50(3):504–520, 2008.
Goldberg (1991) David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR), 23(1):5–48, 1991.
Gordon-Rodriguez et al. (2020a) Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, and John Cunningham. The continuous categorical: a novel simplex-valued exponential family. In International Conference on Machine Learning, pages 3637–3647. PMLR, 2020a.
Gordon-Rodriguez et al. (2020b) Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, Geoff Pleiss, and John Patrick Cunningham. Uses and abuses of the cross-entropy loss: Case studies in modern deep learning. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pages 1–10. PMLR, 2020b.
Johansson et al. (2013) Fredrik Johansson et al. mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.18), December 2013. http://mpmath.org/.
Kahan (1965) William Kahan. Pracniques: further remarks on reducing truncation errors. Communications of the ACM, 8(1):40, 1965.
Kingma and Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
Langlois and Louvet (2007) Philippe Langlois and Nicolas Louvet. How to ensure a faithful polynomial evaluation with the compensated horner algorithm. In 18th IEEE Symposium on Computer Arithmetic (ARITH’07), pages 141–149. IEEE, 2007.
Loaiza-Ganem and Cunningham (2019) Gabriel Loaiza-Ganem and John P Cunningham. The continuous bernoulli: fixing a pervasive error in variational autoencoders. In Advances in Neural Information Processing Systems, pages 13266–13276, 2019.
Stehfest (1970) Harald Stehfest. Algorithm 368: Numerical inversion of laplace transforms [d5]. Communications of the ACM, 13(1):47–49, 1970.
Talbot (1979) Alan Talbot. The accurate numerical inversion of laplace transforms. IMA Journal of Applied Mathematics, 23(1):97–120, 1979.
Wolpert and Wolf (1995) David H Wolpert and David R Wolf. Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52(6):6841, 1995.

On the Normalizing Constant of the Continuous Categorical Distribution