This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Normalizing Constant of
the Continuous Categorical Distribution

Elliott Gordon-Rodriguez
Columbia University
[email protected]
&Gabriel Loaiza-Ganem11footnotemark: 1
Layer6 AI
[email protected]
Andres Potapczynski
New York University
[email protected]
&John P. Cunningham
Columbia University
[email protected]
Equal contribution.
Abstract

Probability distributions supported on the simplex enjoy a wide range of applications across statistics and machine learning. Recently, a novel family of such distributions has been discovered: the continuous categorical. This family enjoys remarkable mathematical simplicity; its density function resembles that of the Dirichlet distribution, but with a normalizing constant that can be written in closed form using elementary functions only. In spite of this mathematical simplicity, our understanding of the normalizing constant remains far from complete. In this work, we characterize the numerical behavior of the normalizing constant and we present theoretical and methodological advances that can, in turn, help to enable broader applications of the continuous categorical distribution. Our code is available at https://github.com/cunningham-lab/cb_and_cc/.

1 Introduction

The continuous categorical (CC) distribution is defined by the following density function (Gordon-Rodriguez etย al., 2020a):

๐•ฉโˆผ๐’žโ€‹๐’žโ€‹(๐€)โ‡”pโ€‹(๐•ฉ;๐€)โˆโˆi=1Kฮปixi.\displaystyle\mathbb{x}\sim\mathcal{CC}(\boldsymbol{\lambda})\iff p(\mathbb{x};\boldsymbol{\lambda})\propto\prod_{i=1}^{K}\lambda_{i}^{x_{i}}. (1)

Here, ๐•ฉ\mathbb{x} denotes a simplex-valued random variable, and ๐€\boldsymbol{\lambda} denotes a simplex-valued parameter, in other words:111Note that the KK-simplex is also commonly defined as ฮ”K={๐•ฉโˆˆโ„+K:โˆ‘i=1K=1}\Delta^{K}=\{\mathbb{x}\in\mathbb{R}^{K}_{+}:\sum_{i=1}^{K}=1\}. The two definitions are equivalent, however ๐•ŠK\mathbb{S}^{K} is a subset of โ„Kโˆ’1\mathbb{R}^{K-1} with positive Lebesgue measure, whereas ฮ”K\Delta^{K} is a subset of โ„K\mathbb{R}^{K} with zero Lebesgue measure. For this reason, using ๐•ŠK\mathbb{S}^{K} will facilitate our later arguments involving integrals on the simplex.

๐•ฉ,๐€โˆˆ๐•ŠK:={๐’™โˆˆโ„+Kโˆ’1:โˆ‘i=1Kโˆ’1xiโ‰ค1},\displaystyle\mathbb{x},\boldsymbol{\lambda}\in\mathbb{S}^{K}:=\left\{\boldsymbol{x}\in\mathbb{R}_{+}^{K-1}:\sum_{i=1}^{K-1}x_{i}\leq 1\right\}, (2)

where we additionally define the KKth coordinates as the remainder:

xK\displaystyle x_{K} =1โˆ’โˆ‘i=1Kโˆ’1xi\displaystyle=1-\sum_{i=1}^{K-1}x_{i} (3)
ฮปK\displaystyle\lambda_{K} =1โˆ’โˆ‘i=1Kโˆ’1ฮปi.\displaystyle=1-\sum_{i=1}^{K-1}\lambda_{i}. (4)

It is natural to contrast the CC with the similar-looking Dirichlet distribution:

๐•ฉโˆผDirichletโ€‹(๐œถ)โ‡”pโ€‹(๐•ฉ;๐œถ)โˆโˆi=1Kxiฮฑiโˆ’1,\displaystyle\mathbb{x}\sim\mathrm{Dirichlet}(\boldsymbol{\alpha})\iff p(\mathbb{x};\boldsymbol{\alpha})\propto\prod_{i=1}^{K}x_{i}^{\alpha_{i}-1}, (5)

where again ๐•ฉโˆˆ๐•ŠK\mathbb{x}\in\mathbb{S}^{K}, but ๐œถโˆˆโ„+K\boldsymbol{\alpha}\in\mathbb{R}_{+}^{K} now denotes a positive unconstrained parameter vector with one more dimension than ๐•ฉ\mathbb{x}.

While the densities in Eq. 1 and Eq. 5 look similar, they hide very different normalizing constants. In the Dirichlet case, it is well known that (Dirichlet, 1839):

โˆซ๐•ŠKโˆi=1Kxiฮฑiโˆ’1โ€‹dโ€‹ฮผโ€‹(๐•ฉ)=โˆi=1Kฮ“โ€‹(ฮฑi)ฮ“โ€‹(โˆ‘i=1Kฮฑi),\displaystyle\int_{\mathbb{S}^{K}}\prod_{i=1}^{K}x_{i}^{\alpha_{i}-1}d\mu(\mathbb{x})=\frac{\prod_{i=1}^{K}\Gamma(\alpha_{i})}{\Gamma(\sum_{i=1}^{K}\alpha_{i})}, (6)

where ฮ“โ€‹(ฮฑ)=โˆซ0โˆžtฮฑโˆ’1โ€‹eโˆ’tโ€‹๐‘‘t\Gamma(\alpha)=\int_{0}^{\infty}t^{\alpha-1}e^{-t}dt denotes the gamma function and ฮผโ€‹(โ‹…)\mu(\cdot) is the Lebesgue measure. On the other hand, the normalizing constant of the CC admits the following closed form (Gordon-Rodriguez etย al., 2020a):

โˆซ๐•ŠKโˆi=1Kฮปixiโ€‹dโ€‹ฮผโ€‹(๐•ฉ)=โˆ‘k=1Kฮปkโˆiโ‰ klogโกฮปkฮปi,\displaystyle\int_{\mathbb{S}^{K}}\prod_{i=1}^{K}\lambda_{i}^{x_{i}}d\mu(\mathbb{x})=\sum_{k=1}^{K}\frac{\lambda_{k}}{\prod_{i\neq k}\log{\frac{\lambda_{k}}{\lambda_{i}}}}, (7)

which contains elementary operations only. In spite of its mathematical simplicity, this normalizing constant can be numerically hard to compute, particularly in high dimensions. Moreover, Eq. 7 breaks down under equality of parameters, i.e., whenever ฮปi=ฮปk\lambda_{i}=\lambda_{k} for some iโ‰ ki\neq k, because the denominator evaluates to zero. These issues will be the primary focus of our exposition, in particular:

  • โ€ข

    In Section 2.1, we characterize the numerical behavior of our normalizing constant. We demonstrate that vectorized computation can suffer from catastrophic cancellation, the severity of which depends on the proximity between parameter values.

  • โ€ข

    In Section 2.2, we rederive the normalizing constant as an inverse Laplace transform, which in turn can be be evaluated using numerical inversion algorithms. We show that this alternative computation strategy exhibits good numerical behavior in the regime where catastrophic cancellation is most severe.

  • โ€ข

    In Section 2.3, we propose an orthogonal computational approach based on a recursive property of the normalizing constant.

  • โ€ข

    In Section 3.2, we generalize Eq. 7 to arbitrary parameter values, i.e., including equality of parameters ฮปi=ฮปk\lambda_{i}=\lambda_{k} for any iโ‰ ki\neq k. The resulting formula depends on an expectation that can be computed using automatic differentiation.

We conclude this section with some remarks. First, note that in the 1-dimensional case, the CC distribution reduces to the continuous Bernoulli distribution (Loaiza-Ganem and Cunningham, 2019), which arose in the context of generative models of images (Kingma and Welling, 2014; Bond-Taylor etย al., 2021) and provided the original inspiration for the CC family. More generally, the CC is closely related to the categorical cross-entropy loss commonly used in machine learning (Gordon-Rodriguez etย al., 2020b).

We also note that the CC can be rewritten using the exponential family canonical form:

๐•ฉโˆผ๐’žโ€‹๐’žโ€‹(๐œผ)โ‡”pโ€‹(๐•ฉ;๐œผ)=1Cโ€‹(๐œผ)โ€‹e๐œผโŠคโ€‹๐•ฉ,\displaystyle\mathbb{x}\sim\mathcal{CC}(\boldsymbol{\eta})\iff p(\mathbb{x};\boldsymbol{\eta})=\frac{1}{C(\boldsymbol{\eta})}e^{\boldsymbol{\eta}^{\top}\mathbb{x}}, (8)

where ฮทi=logโก(ฮปi/ฮปK)\eta_{i}=\log(\lambda_{i}/\lambda_{K}) is the natural parameter, which conveniently becomes unconstrained real-valued. Note that, like with ๐•ฉ\mathbb{x} and ๐€\boldsymbol{\lambda}, we will drop the KKth coordinate to denote ๐œผ=(ฮท1,โ€ฆ,ฮทKโˆ’1)\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{K-1}), since ฮทK=logโก(1)=0\eta_{K}=\log(1)=0 is fixed.222In principle, we could let ฮทK\eta_{K} vary together with ฮท1,โ€ฆ,ฮทKโˆ’1\eta_{1},\dots,\eta_{K-1}; Eqs. 9 and 8 would still hold, since the additional term eฮทKโ€‹xKe^{\eta_{K}x_{K}} in the density would compensate the change in Cโ€‹(๐œผ)C(\boldsymbol{\eta}). However, such a model would be overparameterized as it would be invariant to a parallel shift across all the ฮทi\eta_{i}. For mathematical conciseness, we keep ฮทK\eta_{K} fixed at 0 and work with ๐œผโˆˆโ„Kโˆ’1\boldsymbol{\eta}\in\mathbb{R}^{K-1}. In this notation, the normalizing constant becomes:

Cโ€‹(๐œผ)=โˆ‘j=1Keฮทkโˆiโ‰ k(ฮทkโˆ’ฮทi),\displaystyle C(\boldsymbol{\eta})=\sum_{j=1}^{K}\frac{e^{\eta_{k}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)}, (9)

which, again, is undefined whenever ฮทi=ฮทk\eta_{i}=\eta_{k} for some iโ‰ ki\neq k.

2 Numerical computation of the normalizing constant

Eq. 9 can be vectorized efficiently as follows:

Algorithm 1 Vectorized computation of the normalizing constant

Input: A parameter vector ๐œผ\boldsymbol{\eta}
ย ย ย ย ย  Output: The normalizing constant Cโ€‹(๐œผ)C(\boldsymbol{\eta})

1:Compute a Kร—KK\times K matrix of differences M=[ฮทkโˆ’ฮทi]i,k=1KM=[\eta_{k}-\eta_{i}]_{i,k=1}^{K} and add to this the identity matrix II. A tensor with an additional batch dimension can be used if necessary.
2:Take the product of the rows of M+IM+I, using log space as necessary.
3:Multiply the resulting vector componentwise with the vector [eฮทk]k=1K[e^{\eta_{k}}]_{k=1}^{K}, and sum up the terms.

2.1 Catastrophic cancellation

Algorithm 1 is easy to code up and adds little computational overhead to most models. However, the summation in Step 3 involves positive and negative numbers, which can result in catastrophic cancellation (Goldberg, 1991). We stress that the log-sum-exp trick, while useful for preventing overflow, cannot address catastrophic cancellation (see Section 2.4). For example, consider the case K=5K=5 with ๐œผ=(1,2,3,4)\boldsymbol{\eta}=(1,2,3,4). In single-precision floating-point, the summation in Eq. 9 evaluates to:

โˆ‘k=1Keฮทkโˆiโ‰ k(ฮทkโˆ’ฮทi)\displaystyle\sum_{k=1}^{K}\frac{e^{\eta_{k}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)} =โˆ’0.45304695+1.847264โˆ’3.3475895+2.2749228+0.04166667\displaystyle=-0.45304695+1.847264-3.3475895+2.2749228+0.04166667
=0.363217.\displaystyle=0.363217. (10)

Note that the output is an order of magnitude smaller than (at least one of) the summands, and as a result we have lost one digit in precision. As we increase the dimension of the CC, the cancellation becomes more severe. For example, in the case K=10K=10 with ๐œผ=(1,2,3,4,5,6,7,8,9)\boldsymbol{\eta}=(1,2,3,4,5,6,7,8,9), the same summation becomes:

โˆ‘k=1Keฮทkโˆiโ‰ k(ฮทkโˆ’ฮทi)\displaystyle\sum_{k=1}^{K}\frac{e^{\eta_{k}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)} =โˆ’6.7417699ร—10โˆ’5โˆ’7.3304126ร—10โˆ’4+4.6494300ร—10โˆ’3\displaystyle=-6.7417699\times 10^{-5}-7.3304126\times 10^{-4}+4.6494300\times 10^{-3}
โˆ’1.8957691ร—10โˆ’2+5.1532346ร—10โˆ’2+9.3386299ร—10โˆ’2\displaystyle\ \ \ \ \ \ -1.8957691\times 10^{-2}+5.1532346\times 10^{-2}+9.3386299\times 10^{-2}
+1.0879297ร—10โˆ’1โˆ’7.3932491ร—10โˆ’2+2.2329926ร—10โˆ’2\displaystyle\ \ \ \ \ \ +1.0879297\times 10^{-1}-7.3932491\times 10^{-2}+2.2329926\times 10^{-2}
โˆ’2.7557318ร—10โˆ’6\displaystyle\ \ \ \ \ \ -2.7557318\times 10^{-6}
=3.5982ร—10โˆ’4.\displaystyle=3.5982\times 10^{-4}. (11)

We have now lost 3 digits, since the leading summand is 3 orders of magnitude greater than the output. If we continue increasing KK in the same way, by K=20K=20 we will have lost 6 digits, and by K=25K=25 we are no longer able to compute Cโ€‹(๐œผ)C(\boldsymbol{\eta}) to even a single significant digit. If we were to use double-precision floating-point instead, by K=40K=40 we will have lost 13 digits, and by K=50K=50 we can no longer compute Cโ€‹(๐œผ)C(\boldsymbol{\eta}) to a single significant digit.

We can summarize the problem as follows: when Cโ€‹(๐›ˆ)C(\boldsymbol{\eta}) is of a much lower order of magnitude than the leading term in the summation of Eq. 9, numerical computation fails due to catastrophic cancellation. To complicate things further, the relationship between the two orders of magnitude (that of Cโ€‹(๐œผ)C(\boldsymbol{\eta}) and that of the leading summand) is nontrivial. This relationship depends on the relative size of the exponential terms eฮทke^{\eta_{k}} and the products of the differences ฮทkโˆ’ฮทi\eta_{k}-\eta_{i}, and is not straightforward to analyze. However, if two elements of ๐œผ\boldsymbol{\eta} are particularly close to one another, meaning that |ฮทj1โˆ’ฮทj2||\eta_{j_{1}}-\eta_{j_{2}}| is close to 0 for some j1j_{1} and j2j_{2}, it follows that the corresponding summands:

eฮทj1โˆiโ‰ j1(ฮทj1โˆ’ฮทi),andeฮทj2โˆiโ‰ j2(ฮทj2โˆ’ฮทi),\displaystyle\frac{e^{\eta_{j_{1}}}}{\prod_{i\neq{j_{1}}}\left(\eta_{j_{1}}-\eta_{i}\right)},\ \ \mathrm{and}\ \ \frac{e^{\eta_{j_{2}}}}{\prod_{i\neq{j_{2}}}\left(\eta_{j_{2}}-\eta_{i}\right)}, (12)

are (a) large in magnitude (due to the term ฮทj1โˆ’ฮทj2\eta_{j_{1}}-\eta_{j_{2}} in the denominator), (b) of opposing sign (due to the difference in ฮทj1โˆ’ฮทj2\eta_{j_{1}}-\eta_{j_{2}} versus ฮทj2โˆ’ฮทj1\eta_{j_{2}}-\eta_{j_{1}}), and (c) similar in absolute value (since the terms in the product are approximately equal). Thus, it is likely that the terms in Eq. 12 are of leading order and catastrophically cancel each other out. Note that, as the dimensionality KK increases, it becomes more likely that some pair of the components are close to one another, and therefore computation becomes harder.

Refer to caption
Figure 1: Scaling behavior of Cโ€‹(๐œผ)C(\boldsymbol{\eta}) relative to its summands (from Eq. 9), as the dimension KK varies. Each point represents a random draw of ฮทiโ€‹โˆผiโ€‹iโ€‹dโ€‹Nโ€‹(0,1)\eta_{i}\overset{iid}{\sim}N(0,1), for which we compute Cโ€‹(๐œผ)C(\boldsymbol{\eta}). Note that catastrophic cancellation depends on the difference in order of Cโ€‹(๐œผ)C(\boldsymbol{\eta}) and its summands; in the green region this difference is at most 8 orders of magnitude, so that single-precision floating-point is sufficient. In the yellow region, it is between 8 and 16 orders of magnitude, so that single-precision fails due to catastrophic cancellation, but double-precision succeeds. In the red region, both fail.

Another helpful intuition can be obtained by reasoning from the integral:

Cโ€‹(๐œผ)=โˆซ๐•ŠKe๐œผโŠคโ€‹๐•ฉโ€‹๐‘‘ฮผโ€‹(๐•ฉ).\displaystyle C(\boldsymbol{\eta})=\int_{\mathbb{S}^{K}}e^{\boldsymbol{\eta}^{\top}\mathbb{x}}d\mu(\mathbb{x}). (13)

As KK increases, the Lebesgue measure of the simplex decays like 1/K!1/K!.333As can be seen, for example, by taking Eq. 6 with ฮฑi=0\alpha_{i}=0 for all ii. Therefore, assuming the components of ๐œผ\boldsymbol{\eta} are Oโ€‹(1)O(1), we have that e๐œผโŠคโ€‹๐•ฉ=Oโ€‹(1)e^{\boldsymbol{\eta}^{\top}\mathbb{x}}=O(1) also, and therefore Cโ€‹(๐œผ)=Oโ€‹(1/K!)C(\boldsymbol{\eta})=O(1/K!). Under this assumption, we also have that ฮทkโˆ’ฮทi=Oโ€‹(1)\eta_{k}-\eta_{i}=O(1), and therefore the summands in Eq. 9 cannot decay factorially fast (they may, but need not, decay at most exponentially due to the product of Kโˆ’1K-1 terms of constant order in the denominator). Thus, such a regime implies catastrophic cancellation is inevitable for large enough KK. On the other hand, when all the components of ๐œผ\boldsymbol{\eta} are far from one another, we are spared of such cancellation and Algorithm 1 succeeds, including in high dimensions. We demonstrate these behaviors empirically in the following experiments (see Figures 1 and 2).

2.1.1 Experiments

To evaluate the effectiveness of Algorithm 1 for computing Cโ€‹(๐œผ)C(\boldsymbol{\eta}), we first implemented an oracle capable of correctly computing Cโ€‹(๐œผ)C(\boldsymbol{\eta}) to within 4 significant figures (at a potentially large computational cost). Our oracle is an implementation of Eq. 9 with arbitrary-precision floating-point, using the mpmath library (Johansson etย al., 2013). In particular, for a given ๐œผ\boldsymbol{\eta}, we ensure the level of precision is set appropriately by computing Eq. 9 repeatedly at increasingly high precision, until the outputs converge (to 4 significant figures). Equipped with this oracle, we then drew ๐œผ\boldsymbol{\eta} vectors from a normal prior for a variety of dimensions KK, to analyze the behavior of Cโ€‹(๐œผ)C(\boldsymbol{\eta}).

First, we took ฮทiโ€‹โˆผiโ€‹iโ€‹dโ€‹Nโ€‹(0,1)\eta_{i}\overset{iid}{\sim}N(0,1) and compared the magnitude of Cโ€‹(๐œผ)C(\boldsymbol{\eta}) to the magnitude of the largest summand in Eq. 9. The results are plotted in Figure 1, where we observe that Cโ€‹(๐œผ)C(\boldsymbol{\eta}) decays rapidly in KK, whereas the same is not true of its (largest) summands. Thus, under this prior, Algorithm 1 is unsuccessful except in low-dimensional settings.

Next, we let the spread of ๐œผ\boldsymbol{\eta} vary by drawing ฮทiโ€‹โˆผiโ€‹iโ€‹dโ€‹Nโ€‹(0,ฯƒ2)\eta_{i}\overset{iid}{\sim}N(0,\sigma^{2}), where ฯƒ\sigma ranges between 10โˆ’210^{-2} and 10210^{2}. We then plotted, for each ฯƒ\sigma, the highest value of KK such that the output of Algorithm 1 agreed with the Oracle to 3 significant figures. We repeated the procedure using single- and double-precision floating-point (orange and pink lines in Figure 2, respectively), as well as two Laplace inversion methods that will be discussed in Section 2.2. As ฯƒ\sigma increases, the parameter values tend to move away from one another, making computation easier and allowing for much higher dimensions. On the other hand, when ฯƒ\sigma decreases, the parameter values come closer, bringing us into the unstable regions of Eq. 9, and Algorithm 1 fails for all but just a few dimensions.

Refer to caption
Figure 2: Scaling behavior of the numerical accuracy for computing Cโ€‹(๐œผ)C(\boldsymbol{\eta}), as a function of the spread of the parameter values. For each ฯƒ\sigma (x-axis), we draw ฮทiโ€‹โˆผiโ€‹iโ€‹dโ€‹Nโ€‹(0,ฯƒ2)\eta_{i}\overset{iid}{\sim}N(0,\sigma^{2}) for i=1,โ€ฆ,40i=1,\dots,40, and we compute Cโ€‹(ฮท1,โ€ฆ,ฮทKโˆ’1)C(\eta_{1},\dots,\eta_{K-1}), for each K=3,โ€ฆ,40K=3,\dots,40. We then show on the y-axis the highest KK such that the numerical value of Cโ€‹(ฮท1,โ€ฆ,ฮทKโˆ’1)C(\eta_{1},\dots,\eta_{K-1}) was equal to that obtained by the oracle (to 3 significant figures). The orange and pink lines show the result for Algorithm 1 using single- and double-precision, respectively. The purple and brown lines show the result for computing Cโ€‹(๐œผ)C(\boldsymbol{\eta}) using the inverse Laplace transform, i.e., Eq. 21, as discussed in Section 2.2. At each level of ฯƒ\sigma, we show error bars over 10 random draws of ๐œผ\boldsymbol{\eta}.

2.2 Inverse laplace transform

In this Section, we show that Cโ€‹(๐œผ)C(\boldsymbol{\eta}) can be written as the inverse Laplace transform of a function that does not suffer from catastrophic cancellation. In particular, said function can be passed to Laplace inversion methods (Davies and Martin, 1979) in order to evaluate Cโ€‹(๐œผ)C(\boldsymbol{\eta}) in the regime where Algorithm 1 fails.

Proposition: For a function f:โ„+โ†’โ„+f:\mathbb{R}^{+}\to\mathbb{R}^{+}, let โ„’โ€‹[f]โ€‹(s)=โˆซ0โˆžfโ€‹(t)โ€‹eโˆ’sโ€‹tโ€‹๐‘‘t\mathcal{L}[f](s)=\int_{0}^{\infty}f(t)e^{-st}dt denote the Laplace transform. Define the function c:โ„+โ†’โ„+c:\mathbb{R}^{+}\to\mathbb{R}^{+} by:

cโ€‹(t)\displaystyle c(t) =โˆซ{๐•ฉ:โˆ‘i=1Kโˆ’1xiโ‰คt}โˆi=1Keฮทiโ€‹xiโ€‹dโ€‹ฮผโ€‹(๐•ฉ).\displaystyle=\int_{\{\mathbb{x}:\sum_{i=1}^{K-1}x_{i}\leq t\}}\prod_{i=1}^{K}e^{\eta_{i}x_{i}}d\mu(\mathbb{x}). (14)

Then the Laplace transform of cc is equal to:

โ„’โ€‹[c]โ€‹(s)=โˆi=1K1sโˆ’ฮทi.\displaystyle\mathcal{L}[c](s)=\prod_{i=1}^{K}\frac{1}{s-\eta_{i}}. (15)

Remark: The function cc includes the normalizing constant of the CC as a special case Cโ€‹(๐œผ)=cโ€‹(1)C(\boldsymbol{\eta})=c(1). More generally, we have that:

cโ€‹(t)=tKโˆ’1โ€‹Cโ€‹(๐œผ/t)โˆ’1,\displaystyle c(t)=t^{K-1}C(\boldsymbol{\eta}/t)^{-1}, (16)

as can be seen from letting x~i=xi/t\tilde{x}_{i}=x_{i}/t in the integral 14.

Proof: Let fฮทโ€‹(x)=eฮทโ€‹xf_{\eta}(x)=e^{\eta x}, so that:

cโ€‹(t)\displaystyle c(t) =โˆซ{๐•ฉ:โˆ‘i=1Kโˆ’1xiโ‰คt}โˆi=1Kfฮทiโ€‹(xi)โ€‹dโ€‹ฮผโ€‹(๐•ฉ)\displaystyle=\int_{\{\mathbb{x}:\sum_{i=1}^{K-1}x_{i}\leq t\}}\prod_{i=1}^{K}f_{\eta_{i}}(x_{i})d\mu(\mathbb{x})
=โˆซ0tโˆซ0tโˆ’x1โ‹ฏโ€‹โˆซ0tโˆ’x1โˆ’โ‹ฏโˆ’xKโˆ’2โˆi=1Kfฮทiโ€‹(xi)โ€‹dโ€‹xKโˆ’1โ€‹โ‹ฏโ€‹dโ€‹x2โ€‹dโ€‹x1.\displaystyle=\int_{0}^{t}\int_{0}^{t-x_{1}}\cdots\int_{0}^{t-x_{1}-\cdots-x_{K-2}}\prod_{i=1}^{K}f_{\eta_{i}}(x_{i})dx_{K-1}\cdots dx_{2}dx_{1}. (17)

Next we apply the transformation from Wolpert and Wolf (1995), defined as w1=tw_{1}=t, wk=wkโˆ’1โˆ’xkโˆ’1w_{k}=w_{k-1}-x_{k-1}, or equivalently, wk=tโˆ’โˆ‘i=1kโˆ’1xiw_{k}=t-\sum_{i=1}^{k-1}x_{i}. We can then write our integral as a convolution:

cโ€‹(t)=โˆซ0w1โˆซ0w2โ‹ฏโ€‹โˆซ0wKโˆ’1โˆi=1Kfฮทiโ€‹(xi)โ€‹dโ€‹xKโˆ’1โ€‹โ‹ฏโ€‹dโ€‹x2โ€‹dโ€‹x1\displaystyle c(t)=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-1}}\prod_{i=1}^{K}f_{\eta_{i}}(x_{i})dx_{K-1}\cdots dx_{2}dx_{1}
=โˆซ0w1โˆซ0w2โ‹ฏโ€‹โˆซ0wKโˆ’2โˆi=1Kโˆ’2fฮทiโ€‹(xi)โ€‹โˆซ0wKโˆ’1fฮทKโˆ’1โ€‹(xKโˆ’1)โ€‹fฮทKโ€‹(wKโˆ’1โˆ’xKโˆ’1)โ€‹๐‘‘xKโˆ’1โ€‹โ‹ฏโ€‹๐‘‘x2โ€‹๐‘‘x1\displaystyle=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-2}}\prod_{i=1}^{K-2}f_{\eta_{i}}(x_{i})\int_{0}^{w_{K-1}}f_{\eta_{K-1}}(x_{K-1})f_{\eta_{K}}(w_{K-1}-x_{K-1})dx_{K-1}\cdots dx_{2}dx_{1}
=โˆซ0w1โˆซ0w2โ‹ฏโ€‹โˆซ0wKโˆ’2โˆi=1Kโˆ’2fฮทiโ€‹(xi)โ€‹(fฮทKโˆ’1โˆ—fฮทK)โ€‹(wKโˆ’1)โ€‹dโ€‹xKโˆ’2โ€‹โ‹ฏโ€‹dโ€‹x2โ€‹dโ€‹x1\displaystyle=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-2}}\prod_{i=1}^{K-2}f_{\eta_{i}}(x_{i})(f_{\eta_{K-1}}\ast f_{\eta_{K}})(w_{K-1})dx_{K-2}\cdots dx_{2}dx_{1}
=โˆซ0w1โ‹ฏโ€‹โˆซ0wKโˆ’3โˆi=1Kโˆ’3fฮทiโ€‹(xi)โ€‹โˆซ0wKโˆ’2fฮทKโˆ’2โ€‹(xKโˆ’2)โ€‹(fฮทKโˆ’1โˆ—fฮทK)โ€‹(wKโˆ’2โˆ’xKโˆ’2)โ€‹๐‘‘xKโˆ’2โ€‹โ‹ฏโ€‹๐‘‘x1\displaystyle=\int_{0}^{w_{1}}\cdots\int_{0}^{w_{K-3}}\prod_{i=1}^{K-3}f_{\eta_{i}}(x_{i})\int_{0}^{w_{K-2}}f_{\eta_{K-2}}(x_{K-2})(f_{\eta_{K-1}}\ast f_{\eta_{K}})(w_{K-2}-x_{K-2})dx_{K-2}\cdots dx_{1}
=โˆซ0w1โˆซ0w2โ‹ฏโ€‹โˆซ0wKโˆ’3โˆi=1Kโˆ’3fฮทiโ€‹(xi)โ€‹(fฮทKโˆ’2โˆ—fฮทKโˆ’1โˆ—fฮทK)โ€‹(wKโˆ’2)โ€‹dโ€‹xKโˆ’3โ€‹โ‹ฏโ€‹dโ€‹x2โ€‹dโ€‹x1\displaystyle=\int_{0}^{w_{1}}\int_{0}^{w_{2}}\cdots\int_{0}^{w_{K-3}}\prod_{i=1}^{K-3}f_{\eta_{i}}(x_{i})(f_{\eta_{K-2}}\ast f_{\eta_{K-1}}\ast f_{\eta_{K}})(w_{K-2})dx_{K-3}\cdots dx_{2}dx_{1}
=โ‹ฏ\displaystyle=\cdots
=(โŠ›i=1Kfฮทi)(t).\displaystyle=(\circledast_{i=1}^{K}f_{\eta_{i}})(t). (18)

Next, since the Laplace transform of a convolution equals the product of the Laplace transforms, we have that:

โ„’โ€‹[c]โ€‹(s)=โˆi=1Kโ„’โ€‹[fi]โ€‹(s),\displaystyle\mathcal{L}[c](s)=\prod_{i=1}^{K}\mathcal{L}[f_{i}](s), (19)

but the univariate case is simply:

โ„’โ€‹[fi]โ€‹(s)=โˆซ0โˆže(ฮทiโˆ’s)โ€‹tโ€‹๐‘‘t=1sโˆ’ฮทi,\displaystyle\mathcal{L}[f_{i}](s)=\int_{0}^{\infty}e^{(\eta_{i}-s)t}dt=\frac{1}{s-\eta_{i}}, (20)

and the result follows. โˆŽ

Corollary: The normalizing constant of the continuous categorical distribution can be written as the following inverse Laplace transform:

Cโ€‹(๐œผ)=โ„’โˆ’1โ€‹[โˆi=1K1sโˆ’ฮทi]โ€‹(1).\displaystyle C(\boldsymbol{\eta})=\mathcal{L}^{-1}\left[\prod_{i=1}^{K}\frac{1}{s-\eta_{i}}\right](1). (21)

Proof: Take the inverse Laplace transform in Eq. 15 to find:

cโ€‹(t)=โ„’โˆ’1โ€‹[โˆi=1K1sโˆ’ฮทi]โ€‹(t).\displaystyle c(t)=\mathcal{L}^{-1}\left[\prod_{i=1}^{K}\frac{1}{s-\eta_{i}}\right](t). (22)

Taking t=1t=1 gives the desired result. โˆŽ

Unlike Eq. 9, the product in Eq. 15 does not suffer from catastrophic cancellation, nor does it diverge whenever ฮทj1โ‰ˆฮทj2\eta_{j_{1}}\approx\eta_{j_{2}} for some j1โ‰ j2j_{1}\neq j_{2}. The corresponding Laplace inversion, i.e., Eq. 21, provides an alternative method to compute our normalizing constant Cโ€‹(๐œผ)C(\boldsymbol{\eta}). Numerous numerical algorithms exist for inverting the Laplace transform; see (Cohen, 2007) for a survey. We note, however, that inverting the Laplace transform is generally a hard problem (Epstein and Schotland, 2008).

Empirically, we found some modest empiricasuccess in computing Eq. 21 numerically. We tested three inversion algorithms, due to Talbot (Talbot, 1979), Stehfest (Stehfest, 1970), and De Hoog (Deย Hoog etย al., 1982). The experimental setup was identical to that of Section 2.1.1, and the results are incorporated into Figure 2. We found De Hoogโ€™s method to be the most effective on our problem, whereas Talbotโ€™s always failed and is omitted from the Figure. Importantly, De Hoogโ€™s method showed some success in the regime where Algorithm 1 failed, meaning it could be used as a complementary technique. However, no inversion method was able to scale beyond K=30K=30.

2.3 Inductive approach

In this section, we provide an alternative algorithm to compute our normalizing constant. This algorithm will be based on the following recursive property of Cโ€‹(๐œผ)C(\boldsymbol{\eta}), which was implicitly used as part of the proof of Eq. 9 (Gordon-Rodriguez etย al., 2020a).

Proposition: Define the subvector notation ๐œผ:k=(ฮท1,โ€ฆ,ฮทkโˆ’1)\boldsymbol{\eta}_{:k}=(\eta_{1},\dots,\eta_{k-1}), and make the dependence on KK explicit by writing CKโ€‹(๐œผ)=Cโ€‹(ฮท1,โ€ฆ,ฮทKโˆ’1)C_{K}(\boldsymbol{\eta})=C(\eta_{1},\dots,\eta_{K-1}). We also use the notation ๐œผโˆ’ฮทk=(ฮท1โˆ’ฮทk,โ€ฆ,ฮทKโˆ’1โˆ’ฮทk)\boldsymbol{\eta}-\eta_{k}=(\eta_{1}-\eta_{k},\dots,\eta_{K-1}-\eta_{k}). Then:

CKโ€‹(๐œผ)=eฮทKโˆ’1โ€‹CKโˆ’1โ€‹(๐œผ:(Kโˆ’1)โˆ’ฮทKโˆ’1)โˆ’CKโˆ’1โ€‹(๐œผ:(Kโˆ’1))ฮทKโˆ’1.\displaystyle C_{K}(\boldsymbol{\eta})=\frac{e^{\eta_{K-1}}C_{K-1}\left(\boldsymbol{\eta}_{:(K-1)}-\eta_{K-1}\right)-C_{K-1}\left(\boldsymbol{\eta}_{:(K-1)}\right)}{\eta_{K-1}}. (23)

Proof: We start from the integral definition of the left hand side:

CKโ€‹(๐œผ)\displaystyle C_{K}(\text{\boldmath$\eta$}) =โˆซ๐•ŠKโˆ’1e๐œผโŠคโ€‹๐•ฉโ€‹๐‘‘ฮผ\displaystyle=\int_{{\mathbb{S}^{K-1}}}e^{\boldsymbol{\eta}^{\top}\mathbb{x}}d\mu
=โˆซ01โˆซ01โˆ’x1โ‹ฏโ€‹โˆซ01โˆ’x1โˆ’โ‹ฏโˆ’xKโˆ’2eโˆ‘i=1Kโˆ’1ฮทiโ€‹xiโ€‹๐‘‘xKโˆ’1โ€‹โ‹ฏโ€‹๐‘‘x2โ€‹๐‘‘x1.\displaystyle=\int_{0}^{1}\int_{0}^{1-x_{1}}\cdots\int_{0}^{1-x_{1}-\cdots-x_{K-2}}e^{\sum_{i=1}^{K-1}\eta_{i}x_{i}}dx_{K-1}\cdots dx_{2}dx_{1}. (24)

For the innermost integral, we have:

โˆซ01โˆ’x1โˆ’โ‹ฏโˆ’xKโˆ’2eโˆ‘i=1Kโˆ’1ฮทiโ€‹xiโ€‹๐‘‘xKโˆ’1\displaystyle\int_{0}^{1-x_{1}-\cdots-x_{K-2}}e^{\sum_{i=1}^{K-1}\eta_{i}x_{i}}dx_{K-1} =eโˆ‘i=1Kโˆ’2ฮทiโ€‹xiโ€‹โˆซ01โˆ’x1โˆ’โ‹ฏโˆ’xKโˆ’2eฮทKโˆ’1โ€‹xKโˆ’1โ€‹๐‘‘xKโˆ’1\displaystyle=e^{\sum_{i=1}^{K-2}\eta_{i}x_{i}}\int_{0}^{1-x_{1}-\cdots-x_{K-2}}e^{\eta_{K-1}x_{K-1}}dx_{K-1}
=eโˆ‘i=1Kโˆ’2ฮทiโ€‹xiโ€‹(eฮทKโˆ’1โ€‹(1โˆ’x1โˆ’โ‹ฏโˆ’xKโˆ’2)โˆ’1ฮทKโˆ’1)\displaystyle=e^{\sum_{i=1}^{K-2}\eta_{i}x_{i}}\left(\frac{e^{\eta_{K-1}(1-x_{1}-\cdots-x_{K-2})}-1}{\eta_{K-1}}\right)
=eฮทKโˆ’1โ€‹eโˆ‘i=1Kโˆ’2(ฮทiโˆ’ฮทKโˆ’1)โ€‹xiโˆ’eโˆ‘i=1Kโˆ’2ฮทiโ€‹xiฮทKโˆ’1\displaystyle=\frac{e^{\eta_{K-1}}e^{\sum_{i=1}^{K-2}(\eta_{i}-\eta_{K-1})x_{i}}-e^{\sum_{i=1}^{K-2}\eta_{i}x_{i}}}{\eta_{K-1}}
=eฮทKโˆ’1โ€‹e(๐œผ:(Kโˆ’1)โˆ’ฮทKโˆ’1)โŠคโ€‹๐•ฉ:(Kโˆ’1)โˆ’e๐œผ:(Kโˆ’1)โŠคโ€‹๐•ฉ:(Kโˆ’1)ฮทKโˆ’1,\displaystyle=\frac{e^{\eta_{K-1}}e^{(\boldsymbol{\eta}_{:(K-1)}-\eta_{K-1})^{\top}\mathbb{x}_{:(K-1)}}-e^{\boldsymbol{\eta}_{:(K-1)}^{\top}\mathbb{x}_{:(K-1)}}}{\eta_{K-1}}, (25)

and the result follows by linearity of the integral. โˆŽ

Remark: The base case K=2K=2 corresponds to the univariate continuous Bernoulli distribution, which admits a straightforward Taylor expansion that is useful around the unstable region ฮทโ‰ˆ0\eta\approx 0 (Loaiza-Ganem and Cunningham, 2019):

C2โ€‹(๐œผ:2)=eฮท1โˆ’1ฮท1=(1+ฮท1+12!โ€‹ฮท12+โ‹ฏ)โˆ’1ฮท1=1+12!โ€‹ฮท1+โ‹ฏ.\displaystyle C_{2}(\boldsymbol{\eta}_{:2})=\frac{e^{\eta_{1}}-1}{\eta_{1}}=\frac{(1+\eta_{1}+\frac{1}{2!}\eta_{1}^{2}+\cdots)-1}{\eta_{1}}=1+\frac{1}{2!}\eta_{1}+\cdots. (26)

In words, Eq. 23 is stating that we can compute Cโ€‹(๐œผ)C(\boldsymbol{\eta}) for the full parameter vector ๐œผ=(ฮท1,โ€ฆ,ฮทKโˆ’1)\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{K-1}) by first computing it for the lower-dimensional parameter vectors:

๐œผ:Kโˆ’1\displaystyle\boldsymbol{\eta}_{:K-1} =(ฮท1,โ€ฆ,ฮทKโˆ’2),\displaystyle=(\eta_{1},\dots,\eta_{K-2}),
๐œผ:Kโˆ’1โˆ’ฮทKโˆ’1\displaystyle\boldsymbol{\eta}_{:K-1}-\eta_{K-1} =(ฮท1โˆ’ฮทKโˆ’1,โ€ฆ,ฮทKโˆ’2โˆ’ฮทKโˆ’1).\displaystyle=(\eta_{1}-\eta_{K-1},\dots,\eta_{K-2}-\eta_{K-1}).

Substituting these back into Eq. 23, we have that:

CKโˆ’1โ€‹(๐œผ:Kโˆ’1)\displaystyle C_{K-1}(\boldsymbol{\eta}_{:K-1}) =eฮทKโˆ’2โ€‹CKโˆ’2โ€‹(๐œผ1:Kโˆ’2โˆ’ฮทKโˆ’2)โˆ’CKโˆ’2โ€‹(๐œผ1:Kโˆ’2)ฮทKโˆ’2,\displaystyle=\frac{e^{\eta_{K-2}}C_{K-2}(\boldsymbol{\eta}_{1:K-2}-\eta_{K-2})-C_{K-2}(\boldsymbol{\eta}_{1:K-2})}{\eta_{K-2}},
CKโˆ’1โ€‹(๐œผ:Kโˆ’1โˆ’ฮทKโˆ’1)\displaystyle C_{K-1}(\boldsymbol{\eta}_{:K-1}-\eta_{K-1}) =eฮทKโˆ’2โˆ’ฮทKโˆ’1โ€‹CKโˆ’2โ€‹(๐œผ1:Kโˆ’2โˆ’ฮทKโˆ’2)โˆ’CKโˆ’2โ€‹(๐œผ1:Kโˆ’2โˆ’ฮทKโˆ’1)ฮทKโˆ’2โˆ’ฮทKโˆ’1.\displaystyle=\frac{e^{\eta_{K-2}-\eta_{K-1}}C_{K-2}(\boldsymbol{\eta}_{1:K-2}-\eta_{K-2})-C_{K-2}(\boldsymbol{\eta}_{1:K-2}-\eta_{K-1})}{\eta_{K-2}-\eta_{K-1}}.

Note that we are now left with not 4, but 3 new parameter vectors to recurse on: ๐œผ1:Kโˆ’3\boldsymbol{\eta}_{1:K-3}, ๐œผ1:Kโˆ’3โˆ’ฮทKโˆ’1\boldsymbol{\eta}_{1:K-3}-\eta_{K-1}, and ๐œผ1:Kโˆ’3โˆ’ฮทKโˆ’2\boldsymbol{\eta}_{1:K-3}-\eta_{K-2}. Repeating the argument Kโˆ’2K-2 times and working backwards we obtain Algorithm 2 for computing the normalizing constant.

Algorithm 2 Inductive computation of the normalizing constant

Input: A parameter vector ๐œผ\boldsymbol{\eta}
ย ย ย ย ย  Output: The normalizing constant Cโ€‹(๐œผ)C(\boldsymbol{\eta})

1:Initialize ๐•”=(1,โ€ฆ,1)โˆˆโ„Kโˆ’1\mathbb{c}=(1,\dots,1)\in\mathbb{R}^{K-1} and set ๐•”~=๐•”\mathbb{\tilde{c}}=\mathbb{c}.
2:forย k=1,2,โ€ฆ,Kโˆ’1k=1,2,\dots,K-1ย do
3:ย ย ย ย ย forย i=1,โ€ฆ,Kโˆ’ki=1,\dots,K-kย do
4:ย ย ย ย ย ย ย ย ย Set ฮพi=ฮทkโˆ’ฮทk+i\xi_{i}=\eta_{k}-\eta_{k+i}
5:ย ย ย ย ย ย ย ย ย Set c~i=eฮพiโ€‹c1โˆ’ci+1ฮพi\tilde{c}_{i}=\frac{e^{\xi_{i}}c_{1}-c_{i+1}}{\xi_{i}}
6:ย ย ย ย ย endย for
7:ย ย ย ย ย Set ๐•”=๐•”~\mathbb{c}=\mathbb{\tilde{c}}
8:endย for
9:return c1c_{1}

We find the numerical properties of Algorithm 2 to perform similarly to Algorithm 1, suffering from the same cancellation issues in high dimensions. Nevertheless, we hope this alternative scheme may help to inspire further numerical improvements. For example, since the floating-point behavior of Algorithm 2 depends on the ordering of the elements of ๐œผ\boldsymbol{\eta}, it may be possible do design a reordering scheme that improves the overall precision of the algorithm; such reorderings have been explored in the context of sampling algorithms for the CC (Gordon-Rodriguez etย al., 2020a). Other possibilities include Kahan summation (Kahan, 1965) or the compensated Horner algorithm (Langlois and Louvet, 2007); we leave their study to future work.

2.4 Overflow

We conclude this section with a small remark on numerical overflow in the context of Algorithm 1. In high dimensions and with high ฮท\eta values, overflow can occur for the terms in the summand, due to a large eฮทje^{\eta_{j}} term or a small denominator. This can be addressed by re-writing the normalizing constant in a form that allows us to take advantage of the log-sum-exp trick:

logโกCโ€‹(๐œผ)\displaystyle\log C(\text{\boldmath$\eta$}) =logโก(โˆ‘k=1Keฮทkโˆiโ‰ k(ฮทkโˆ’ฮทi))\displaystyle=\log\left(\sum_{k=1}^{K}\frac{{e^{\eta_{k}}}}{\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)}\right)
=logโก(โˆ‘k=1Keฮทksignโ€‹(โˆiโ‰ k(ฮทkโˆ’ฮทi))โ€‹โˆiโ‰ k|ฮทkโˆ’ฮทi|)\displaystyle=\log\left(\sum_{k=1}^{K}\frac{{e^{\eta_{k}}}}{\mathrm{sign}\left(\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)\right)\prod_{i\neq k}|\eta_{k}-\eta_{i}|}\right)
=logโก(โˆ‘k=1Ksignโ€‹(โˆiโ‰ k(ฮทkโˆ’ฮทi))โ€‹expโก(ฮทkโˆ’โˆ‘iโ‰ klogโก|ฮทkโˆ’ฮทi|)).\displaystyle=\log\left(\sum_{k=1}^{K}\mathrm{sign}\left(\prod_{i\neq k}\left(\eta_{k}-\eta_{i}\right)\right)\exp\left(\eta_{k}-\sum_{i\neq k}\log\left|\eta_{k}-\eta_{i}\right|\right)\right).

3 Normalizing constant with repeated parameters

Whenever we have an equality between any pair of parameters, Eq. 9 is undefined and, indeed, its proof by Gordon-Rodriguez etย al. (2020a) breaks down. In this Section, we derive a counterpart to Eq. 9 for the case when 2 or more elements of ๐œผ\boldsymbol{\eta} are equal to one another. Note that, for mathematical convenience, we shall now denote ๐œผโˆˆโ„K\boldsymbol{\eta}\in\mathbb{R}^{K}, where the KKth component ฮทK\eta_{K} is now included in the vector ๐œผ\boldsymbol{\eta}. As discussed in Section 1, this component can be taken as fixed at 0, or it can be treated as an additional free parameter, resulting in an overparameterized model (our results will remain correct nevertheless).

3.1 A simple example

First, we illustrate the main idea of our argument using an example with K=3K=3 and ฮท1=ฮท2โ‰ ฮท3=0\eta_{1}=\eta_{2}\neq\eta_{3}=0.444Note that the case ฮท3โ‰ 0\eta_{3}\neq 0 and the case ฮท1โ‰ ฮท2=ฮท3\eta_{1}\neq\eta_{2}=\eta_{3} are mathematically equivalent (albeit more algebraically cumbersome), since we can permute the elements of ๐œผ\boldsymbol{\eta} and shift by a constant without loss of generality. By definition, the normalizing constant is then:

C3โ€‹(๐œผ)=โˆซ01โˆซ01โˆ’x1eฮท1โ€‹x1+ฮท2โ€‹x2โ€‹๐‘‘x2โ€‹๐‘‘x1=โˆซ01โˆซ01โˆ’x1eฮท1โ€‹(x1+x2)โ€‹๐‘‘x2โ€‹๐‘‘x1.\displaystyle C_{3}(\boldsymbol{\eta})=\int_{0}^{1}\int_{0}^{1-x_{1}}e^{\eta_{1}x_{1}+\eta_{2}x_{2}}dx_{2}dx_{1}=\int_{0}^{1}\int_{0}^{1-x_{1}}e^{\eta_{1}(x_{1}+x_{2})}dx_{2}dx_{1}. (27)

We apply the change of variables u=x1+x2u=x_{1}+x_{2}, v=x1v=x_{1} to obtain:

C3โ€‹(๐œผ)=โˆซ01โˆซ0ueฮท1โ€‹uโ€‹๐‘‘vโ€‹๐‘‘u=โˆซ01uโ€‹eฮท1โ€‹uโ€‹๐‘‘u.\displaystyle C_{3}(\boldsymbol{\eta})=\int_{0}^{1}\int_{0}^{u}e^{\eta_{1}u}dvdu=\int_{0}^{1}ue^{\eta_{1}u}du. (28)

Note that the last integral corresponds to the expectation of a univariate CC random variable, an idea we now generalize.

3.2 General formula

We start by proving the following lemma, which relates Cโ€‹(๐œผ)C(\boldsymbol{\eta}), where some elements of ๐œผ\boldsymbol{\eta} are repeated (potentially many times), to Cโ€‹(๐œผโ€ฒ)C(\boldsymbol{\eta}^{\prime}), where ๐œผโ€ฒ\boldsymbol{\eta}^{\prime} collapses the repeated elements of ๐œผ\boldsymbol{\eta} onto a single coordinate. We again use subvector notation ๐•ฉk:=(xk,xk+1,โ€ฆ,xKโˆ’1)\mathbb{x}_{k:}=(x_{k},x_{k+1},\dots,x_{K-1}).

Lemma: Let ๐•ฉโˆผ๐’žโ€‹๐’žKโ€‹(๐œผ)\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta}) with ฮท1=ฮท2=โ‹ฏ=ฮทk\eta_{1}=\eta_{2}=\dots=\eta_{k}, for 1โ‰คkโ‰คKโˆ’11\leq k\leq K-1, and let f:โ„Kโˆ’kโˆ’1โ†’โ„f:\mathbb{R}^{K-k-1}\to\mathbb{R}. Then:

CKโ€‹(๐œผ)โ€‹๐”ผ๐•ฉโˆผ๐’žโ€‹๐’žKโ€‹(๐œผ)โ€‹[fโ€‹(๐•ฉ(k+1):)]=CKโˆ’k+1โ€‹(๐œผk:)โ€‹๐”ผ๐•ฆโˆผ๐’žโ€‹๐’žKโˆ’k+1โ€‹(๐œผk:)โ€‹[u1kโˆ’1(kโˆ’1)!โ€‹fโ€‹(๐•ฆ2:)].{C_{K}(\boldsymbol{\eta})}\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})]={C_{K-k+1}(\boldsymbol{\eta}_{k:})}\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{K-k+1}(\boldsymbol{\eta}_{k:})}\left[\dfrac{u_{1}^{k-1}}{(k-1)!}f(\mathbb{u}_{2:})\right]. (29)

Remark: We are not assuming that the last Kโˆ’kK-k coordinates of ๐œผ\boldsymbol{\eta} are all different, we are simply assuming that the first kk are identical. Note also that the positions of the coordinates could be arbitrary and need not be the first kk ones; we can always use this lemma to collapse repeated parameter values provided that the function ff does not depend on the corresponding coordinates (we can simply relabel the coordinates by a suitable permutation without loss of generality).

Proof: By definition of the left hand side:

CK\displaystyle C_{K} (๐œผ)โ€‹๐”ผ๐•ฉโˆผ๐’žโ€‹๐’žKโ€‹(๐œผ)โ€‹[fโ€‹(๐•ฉ(k+1):)]=โˆซ01โˆซ01โˆ’x1โ‹ฏโ€‹โˆซ01โˆ’x1โˆ’โ‹ฏโˆ’xKโˆ’2fโ€‹(๐•ฉ(k+1):)โ€‹e๐œผโŠคโ€‹๐•ฉโ€‹๐‘‘xKโˆ’1โ€‹โ‹ฏโ€‹๐‘‘x2โ€‹๐‘‘x1\displaystyle(\boldsymbol{\eta})\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})]=\int_{0}^{1}\int_{0}^{1-x_{1}}\cdots\int_{0}^{1-x_{1}-\cdots-x_{K-2}}f(\mathbb{x}_{(k+1):})e^{\boldsymbol{\eta}^{\top}\mathbb{x}}dx_{K-1}\cdots dx_{2}dx_{1}
=โˆซ01โˆซ01โˆ’x1โ‹ฏโ€‹โˆซ01โˆ’x1โˆ’โ‹ฏโˆ’xKโˆ’2fโ€‹(๐•ฉ(k+1):)โ€‹eฮทkโ€‹(x1+โ‹ฏ+xk)+๐œผ(k+1):โŠคโ€‹๐•ฉ(k+1):โ€‹๐‘‘xKโˆ’1โ€‹โ‹ฏโ€‹๐‘‘x2โ€‹๐‘‘x1.\displaystyle=\int_{0}^{1}\int_{0}^{1-x_{1}}\cdots\int_{0}^{1-x_{1}-\cdots-x_{K-2}}f(\mathbb{x}_{(k+1):})e^{\eta_{k}(x_{1}+\dots+x_{k})+\boldsymbol{\eta}_{(k+1):}^{\top}\mathbb{x}_{(k+1):}}dx_{K-1}\cdots dx_{2}dx_{1}. (30)

Consider the following change of variable (note this is just like Section 3.1, but with more bookkeeping):

{u1=x1+x2+โ‹ฏ+xku2=xk+1u3=xk+2โ‹ฎuKโˆ’k=xKโˆ’1v1=x1v2=x2โ‹ฎvkโˆ’1=xkโˆ’1\begin{cases}u_{1}=x_{1}+x_{2}+\cdots+x_{k}\\ u_{2}=x_{k+1}\\ u_{3}=x_{k+2}\\ \hskip 16.0pt\vdots\\ u_{K-k}=x_{K-1}\\ v_{1}=x_{1}\\ v_{2}=x_{2}\\ \hskip 16.0pt\vdots\\ v_{k-1}=x_{k-1}\end{cases} (31)

This change of variable amounts to an invertible linear transformation with the property that the absolute value of the determinant of its Jacobian is 11, so that we have:

CKโ€‹(๐œผ)\displaystyle C_{K}(\boldsymbol{\eta}) ๐”ผ๐•ฉโˆผ๐’žโ€‹๐’žKโ€‹(๐œผ)โ€‹[fโ€‹(๐•ฉ(k+1):)]=โˆซ01โˆซ01โˆ’u1โ‹ฏโ€‹โˆซ01โˆ’u1โˆ’u2โˆ’โ‹ฏโˆ’uKโˆ’kโˆซ0u1โˆซ0u1โˆ’v1โ‹ฏ\displaystyle\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})]=\int_{0}^{1}\int_{0}^{1-u_{1}}\cdots\int_{0}^{1-u_{1}-u_{2}-\cdots-u_{K-k}}\int_{0}^{u_{1}}\int_{0}^{u_{1}-v_{1}}\cdots (32)
โ€ฆโ€‹โˆซ0u1โˆ’v1โˆ’v2โˆ’โ‹ฏโˆ’vkโˆ’2fโ€‹(๐•ฆ2:)โ€‹eฮทkโ€‹u1+๐œผ(k+1):โŠคโ€‹๐•ฆ2:โ€‹๐‘‘vkโˆ’1โ€‹โ‹ฏโ€‹๐‘‘v2โ€‹๐‘‘v1โ€‹๐‘‘uKโˆ’kโ€‹โ‹ฏโ€‹๐‘‘u2โ€‹๐‘‘u1.\displaystyle\dots\int_{0}^{u_{1}-v_{1}-v_{2}-\dots-v_{k-2}}f(\mathbb{u}_{2:})e^{\eta_{k}u_{1}+\boldsymbol{\eta}_{(k+1):}^{\top}\mathbb{u}_{2:}}dv_{k-1}\cdots dv_{2}dv_{1}du_{K-k}\cdots du_{2}du_{1}.

Note that the change of variables is such that the integrand does not depend on v1,v2,โ€ฆ,vkโˆ’1v_{1},v_{2},\dots,v_{k-1}. Therefore:

CKโ€‹(๐œผ)\displaystyle C_{K}(\boldsymbol{\eta}) ๐”ผ๐•ฉโˆผ๐’žโ€‹๐’žKโ€‹(๐œผ)โ€‹[fโ€‹(๐•ฉ(k+1):)]\displaystyle\mathbb{E}_{\mathbb{x}\sim\mathcal{CC}_{K}(\boldsymbol{\eta})}[f(\mathbb{x}_{(k+1):})] (33)
=โˆซ01โˆซ01โˆ’u1โ‹ฏโ€‹โˆซ01โˆ’u1โˆ’u2โˆ’โ‹ฏโˆ’uKโˆ’kgโ€‹(u1)โ€‹fโ€‹(๐•ฆ2:)โ€‹e๐œผk:โŠคโ€‹๐•ฆโ€‹๐‘‘uKโˆ’kโ€‹โ‹ฏโ€‹๐‘‘u2โ€‹๐‘‘u1,\displaystyle=\int_{0}^{1}\int_{0}^{1-u_{1}}\cdots\int_{0}^{1-u_{1}-u_{2}-\cdots-u_{K-k}}g(u_{1})f(\mathbb{u}_{2:})e^{\boldsymbol{\eta}_{k:}^{\top}\mathbb{u}}du_{K-k}\cdots du_{2}du_{1},

where:

gโ€‹(u1)=โˆซ0u1โˆซ0u1โˆ’v1โ‹ฏโ€‹โˆซ0u1โˆ’v1โˆ’v2โˆ’โ‹ฏโˆ’vkโˆ’2๐‘‘vkโˆ’1โ€‹โ‹ฏโ€‹๐‘‘v2โ€‹๐‘‘v1.\displaystyle g(u_{1})=\int_{0}^{u_{1}}\int_{0}^{u_{1}-v_{1}}\cdots\int_{0}^{u_{1}-v_{1}-v_{2}-\cdots-v_{k-2}}dv_{k-1}\cdots dv_{2}dv_{1}. (34)

But this is simply the Lebesgue measure of a simplex inscribed in the hypercube [0,u1]Kโˆ’1[0,u_{1}]^{K-1}, so that gโ€‹(u1)=u1kโˆ’1โ€‹ฮผโ€‹(๐•ŠK)=u1kโˆ’1/(kโˆ’1)!g(u_{1})=u_{1}^{k-1}\mu(\mathbb{S}^{K})=u_{1}^{k-1}/(k-1)! (this can also be seen by changing variables to v~i=vi/ui\tilde{v}_{i}=v_{i}/u_{i}, or by applying Eq. 16). Multiplying and dividing by CKโˆ’k+1โ€‹(๐œผk:)C_{K-k+1}(\boldsymbol{\eta}_{k:}) gives the desired result. โˆŽ

We can now derive a formula for the normalizing constant for an arbitrary parameter vector ๐œผ\boldsymbol{\eta}.

Corollary: Let ๐œผโˆˆโ„K\boldsymbol{\eta}\in\mathbb{R}^{K} contain Dโ‰คKD\leq K unique elements. Assume, without loss of generality, that ๐œผ=(ฮท1,โ€ฆ,ฮท1,ฮท2,โ€ฆ,ฮท2,โ€ฆ,ฮทD,โ€ฆ,ฮทD)\boldsymbol{\eta}=(\eta_{1},\dots,\eta_{1},\eta_{2},\dots,\eta_{2},\dots,\eta_{D},\dots,\eta_{D}), where each coordinate is repeated 1โ‰คriโ‰คK1\leq r_{i}\leq K times, with โˆ‘i=1Dri=K\sum_{i=1}^{D}r_{i}=K. Then:

CKโ€‹(๐œผ)=CDโ€‹(ฮท1,ฮท2,โ€ฆ,ฮทD)โ€‹๐”ผ๐•ฆโˆผ๐’žโ€‹๐’žDโ€‹(ฮท1,ฮท2,โ€ฆ,ฮทD)โ€‹[โˆi=1Duiriโˆ’1(riโˆ’1)!]C_{K}(\boldsymbol{\eta})={C_{D}(\eta_{1},\eta_{2},\dots,\eta_{D})}{\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{D}(\eta_{1},\eta_{2},\dots,\eta_{D})}\left[\displaystyle\prod_{i=1}^{D}\dfrac{u_{i}^{r_{i}-1}}{(r_{i}-1)!}\right]} (35)

Proof: The result follows by applying the lemma DD times. First, we apply the lemma with fโ€‹(โ‹…)=1f(\cdot)=1:

CKโ€‹(๐œผ)=CKโˆ’r1+1โ€‹(๐œผโ€ฒ)โ€‹๐”ผ๐•ฆโˆผ๐’žโ€‹๐’žKโˆ’r1+1โ€‹(๐œผโ€ฒ)โ€‹[u1r1โˆ’1(r1โˆ’1)!],\displaystyle C_{K}(\boldsymbol{\eta})={C_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}{\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}\left[\dfrac{u_{1}^{r_{1}-1}}{(r_{1}-1)!}\right]}, (36)

where ๐œผโ€ฒ=๐œผr1:=(ฮท1,ฮท2,โ€ฆ,ฮท2,โ€ฆ,ฮทD,โ€ฆ,ฮทD)\boldsymbol{\eta}^{\prime}=\boldsymbol{\eta}_{r_{1}:}=(\eta_{1},\eta_{2},\dots,\eta_{2},\dots,\eta_{D},\dots,\eta_{D}), i.e., we have collapsed the first parameter value onto a single coordinate. Next, we apply the lemma a second time on the new expectation term to collapse the ฮท2\eta_{2} values, this time using fโ€‹(u1)=u1r1โˆ’1/(r1โˆ’1)!f(u_{1})=u_{1}^{r_{1}-1}/(r_{1}-1)!, which does not depend on the ฮท2\eta_{2} coordinates:

CKโˆ’r1+1โ€‹(๐œผโ€ฒ)โ€‹๐”ผ\displaystyle{C_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}\mathbb{E} [u1r1โˆ’1(r1โˆ’1)!]๐•ฆโˆผ๐’žโ€‹๐’žKโˆ’r1+1โ€‹(๐œผโ€ฒ){}_{\mathbb{u}\sim\mathcal{CC}_{K-r_{1}+1}(\boldsymbol{\eta}^{\prime})}\left[\dfrac{u_{1}^{r_{1}-1}}{(r_{1}-1)!}\right] (37)
=CKโˆ’r1โˆ’r2+2โ€‹(๐œผโ€ฒโ€ฒ)โ€‹๐”ผ๐•ฆโˆผ๐’žโ€‹๐’žKโˆ’r1โˆ’r2+2โ€‹(๐œผโ€ฒโ€ฒ)โ€‹[u1r1โˆ’1(r1โˆ’1)!โ€‹u2r2โˆ’1(r2โˆ’1)!],\displaystyle={C_{K-r_{1}-r_{2}+2}(\boldsymbol{\eta}^{\prime\prime})}\mathbb{E}_{\mathbb{u}\sim\mathcal{CC}_{K-r_{1}-r_{2}+2}(\boldsymbol{\eta}^{\prime\prime})}\left[\dfrac{u_{1}^{r_{1}-1}}{(r_{1}-1)!}\dfrac{u_{2}^{r_{2}-1}}{(r_{2}-1)!}\right],

where ๐œผโ€ฒโ€ฒ=(ฮท1,ฮท2,ฮท3,โ€ฆ,ฮท3,โ€ฆ,ฮทD,โ€ฆ,ฮทD)\boldsymbol{\eta}^{\prime\prime}=(\eta_{1},\eta_{2},\eta_{3},\dots,\eta_{3},\dots,\eta_{D},\dots,\eta_{D}). Repeating DD times yields the desired result. โˆŽ

Note that Eq. 35 can be computed using known results. The term CDโ€‹(ฮท1,โ€ฆ,ฮทD)C_{D}(\eta_{1},\dots,\eta_{D}) can be evaluated with Eq. 9, since all the parameter values are distinct. The expectation term can be computed by differentiating the moment generating function of ๐•ฆโˆผ๐’žโ€‹๐’žDโ€‹(ฮท1,ฮท2,โ€ฆ,ฮทD)\mathbb{u}\sim\mathcal{CC}_{D}(\eta_{1},\eta_{2},\dots,\eta_{D}), as discussed by Gordon-Rodriguez etย al. (2020a). Note that in real data applications exact equality between parameter values may never occur, and it is unclear how close the elements of ๐œผ\boldsymbol{\eta} should be in order to warrant applying Eq. 35. Nevertheless, Eq. 35 is important for theoretical completeness.

4 Discussion

The normalizing constant of the CC distribution is essential to applications, being a necessary prerequisite for evaluating likelihoods, optimizing models, and simulating samples alike. Computing this normalizing constant is nontrivial, and doing so in high dimensions remains an open problem. Our work represents a significant step toward this goal, improving our understanding of the numerical properties of different computation techniques, as well as advancing the underlying theory and algorithms. In addition, we hope our results will help to inspire further advances and to develop increasingly robust numerical techniques that will ultimately enable the use of the CC distribution with arbitrary parameter values on high-dimensional problems.

References

  • Bond-Taylor etย al. (2021) Sam Bond-Taylor, Adam Leach, Yang Long, and Chrisย G Willcocks. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. arXiv preprint arXiv:2103.04922, 2021.
  • Cohen (2007) Alanย M Cohen. Numerical methods for Laplace transform inversion, volumeย 5. Springer Science & Business Media, 2007.
  • Davies and Martin (1979) Brian Davies and Brian Martin. Numerical inversion of the laplace transform: a survey and comparison of methods. Journal of computational physics, 33(1):1โ€“32, 1979.
  • Deย Hoog etย al. (1982) Frankย R Deย Hoog, JHย Knight, and ANย Stokes. An improved method for numerical inversion of laplace transforms. SIAM Journal on Scientific and Statistical Computing, 3(3):357โ€“366, 1982.
  • Dirichlet (1839) Peter Gustavย Lejeune Dirichlet. Sur une nouvelle mรฉthode pour la dรฉtermination des intรฉgrales multiples. Journal de Mathรฉmatiques, Ser I, 4, pages 164โ€“168, 1839.
  • Epstein and Schotland (2008) Charlesย L Epstein and John Schotland. The bad truth about laplaceโ€™s transform. SIAM review, 50(3):504โ€“520, 2008.
  • Goldberg (1991) David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR), 23(1):5โ€“48, 1991.
  • Gordon-Rodriguez etย al. (2020a) Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, and John Cunningham. The continuous categorical: a novel simplex-valued exponential family. In International Conference on Machine Learning, pages 3637โ€“3647. PMLR, 2020a.
  • Gordon-Rodriguez etย al. (2020b) Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, Geoff Pleiss, and Johnย Patrick Cunningham. Uses and abuses of the cross-entropy loss: Case studies in modern deep learning. In Proceedings on "I Canโ€™t Believe Itโ€™s Not Better!" at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pages 1โ€“10. PMLR, 2020b.
  • Johansson etย al. (2013) Fredrik Johansson etย al. mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.18), December 2013. http://mpmath.org/.
  • Kahan (1965) William Kahan. Pracniques: further remarks on reducing truncation errors. Communications of the ACM, 8(1):40, 1965.
  • Kingma and Welling (2014) Diederikย P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  • Langlois and Louvet (2007) Philippe Langlois and Nicolas Louvet. How to ensure a faithful polynomial evaluation with the compensated horner algorithm. In 18th IEEE Symposium on Computer Arithmetic (ARITHโ€™07), pages 141โ€“149. IEEE, 2007.
  • Loaiza-Ganem and Cunningham (2019) Gabriel Loaiza-Ganem and Johnย P Cunningham. The continuous bernoulli: fixing a pervasive error in variational autoencoders. In Advances in Neural Information Processing Systems, pages 13266โ€“13276, 2019.
  • Stehfest (1970) Harald Stehfest. Algorithm 368: Numerical inversion of laplace transforms [d5]. Communications of the ACM, 13(1):47โ€“49, 1970.
  • Talbot (1979) Alan Talbot. The accurate numerical inversion of laplace transforms. IMA Journal of Applied Mathematics, 23(1):97โ€“120, 1979.
  • Wolpert and Wolf (1995) Davidย H Wolpert and Davidย R Wolf. Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52(6):6841, 1995.