On the Normalizing Constant of
the Continuous Categorical Distribution
Abstract
Probability distributions supported on the simplex enjoy a wide range of applications across statistics and machine learning. Recently, a novel family of such distributions has been discovered: the continuous categorical. This family enjoys remarkable mathematical simplicity; its density function resembles that of the Dirichlet distribution, but with a normalizing constant that can be written in closed form using elementary functions only. In spite of this mathematical simplicity, our understanding of the normalizing constant remains far from complete. In this work, we characterize the numerical behavior of the normalizing constant and we present theoretical and methodological advances that can, in turn, help to enable broader applications of the continuous categorical distribution. Our code is available at https://github.com/cunningham-lab/cb_and_cc/.
1 Introduction
The continuous categorical (CC) distribution is defined by the following density function (Gordon-Rodriguez etย al., 2020a):
(1) |
Here, denotes a simplex-valued random variable, and denotes a simplex-valued parameter, in other words:111Note that the -simplex is also commonly defined as . The two definitions are equivalent, however is a subset of with positive Lebesgue measure, whereas is a subset of with zero Lebesgue measure. For this reason, using will facilitate our later arguments involving integrals on the simplex.
(2) |
where we additionally define the th coordinates as the remainder:
(3) | ||||
(4) |
It is natural to contrast the CC with the similar-looking Dirichlet distribution:
(5) |
where again , but now denotes a positive unconstrained parameter vector with one more dimension than .
While the densities in Eq. 1 and Eq. 5 look similar, they hide very different normalizing constants. In the Dirichlet case, it is well known that (Dirichlet, 1839):
(6) |
where denotes the gamma function and is the Lebesgue measure. On the other hand, the normalizing constant of the CC admits the following closed form (Gordon-Rodriguez etย al., 2020a):
(7) |
which contains elementary operations only. In spite of its mathematical simplicity, this normalizing constant can be numerically hard to compute, particularly in high dimensions. Moreover, Eq. 7 breaks down under equality of parameters, i.e., whenever for some , because the denominator evaluates to zero. These issues will be the primary focus of our exposition, in particular:
-
โข
In Section 2.1, we characterize the numerical behavior of our normalizing constant. We demonstrate that vectorized computation can suffer from catastrophic cancellation, the severity of which depends on the proximity between parameter values.
-
โข
In Section 2.2, we rederive the normalizing constant as an inverse Laplace transform, which in turn can be be evaluated using numerical inversion algorithms. We show that this alternative computation strategy exhibits good numerical behavior in the regime where catastrophic cancellation is most severe.
-
โข
In Section 2.3, we propose an orthogonal computational approach based on a recursive property of the normalizing constant.
- โข
We conclude this section with some remarks. First, note that in the 1-dimensional case, the CC distribution reduces to the continuous Bernoulli distribution (Loaiza-Ganem and Cunningham, 2019), which arose in the context of generative models of images (Kingma and Welling, 2014; Bond-Taylor etย al., 2021) and provided the original inspiration for the CC family. More generally, the CC is closely related to the categorical cross-entropy loss commonly used in machine learning (Gordon-Rodriguez etย al., 2020b).
We also note that the CC can be rewritten using the exponential family canonical form:
(8) |
where is the natural parameter, which conveniently becomes unconstrained real-valued. Note that, like with and , we will drop the th coordinate to denote , since is fixed.222In principle, we could let vary together with ; Eqs. 9 and 8 would still hold, since the additional term in the density would compensate the change in . However, such a model would be overparameterized as it would be invariant to a parallel shift across all the . For mathematical conciseness, we keep fixed at 0 and work with . In this notation, the normalizing constant becomes:
(9) |
which, again, is undefined whenever for some .
2 Numerical computation of the normalizing constant
Eq. 9 can be vectorized efficiently as follows:
Input: A parameter vector
ย ย ย ย ย Output: The normalizing constant
2.1 Catastrophic cancellation
Algorithm 1 is easy to code up and adds little computational overhead to most models. However, the summation in Step 3 involves positive and negative numbers, which can result in catastrophic cancellation (Goldberg, 1991). We stress that the log-sum-exp trick, while useful for preventing overflow, cannot address catastrophic cancellation (see Section 2.4). For example, consider the case with . In single-precision floating-point, the summation in Eq. 9 evaluates to:
(10) |
Note that the output is an order of magnitude smaller than (at least one of) the summands, and as a result we have lost one digit in precision. As we increase the dimension of the CC, the cancellation becomes more severe. For example, in the case with , the same summation becomes:
(11) |
We have now lost 3 digits, since the leading summand is 3 orders of magnitude greater than the output. If we continue increasing in the same way, by we will have lost 6 digits, and by we are no longer able to compute to even a single significant digit. If we were to use double-precision floating-point instead, by we will have lost 13 digits, and by we can no longer compute to a single significant digit.
We can summarize the problem as follows: when is of a much lower order of magnitude than the leading term in the summation of Eq. 9, numerical computation fails due to catastrophic cancellation. To complicate things further, the relationship between the two orders of magnitude (that of and that of the leading summand) is nontrivial. This relationship depends on the relative size of the exponential terms and the products of the differences , and is not straightforward to analyze. However, if two elements of are particularly close to one another, meaning that is close to 0 for some and , it follows that the corresponding summands:
(12) |
are (a) large in magnitude (due to the term in the denominator), (b) of opposing sign (due to the difference in versus ), and (c) similar in absolute value (since the terms in the product are approximately equal). Thus, it is likely that the terms in Eq. 12 are of leading order and catastrophically cancel each other out. Note that, as the dimensionality increases, it becomes more likely that some pair of the components are close to one another, and therefore computation becomes harder.

Another helpful intuition can be obtained by reasoning from the integral:
(13) |
As increases, the Lebesgue measure of the simplex decays like .333As can be seen, for example, by taking Eq. 6 with for all . Therefore, assuming the components of are , we have that also, and therefore . Under this assumption, we also have that , and therefore the summands in Eq. 9 cannot decay factorially fast (they may, but need not, decay at most exponentially due to the product of terms of constant order in the denominator). Thus, such a regime implies catastrophic cancellation is inevitable for large enough . On the other hand, when all the components of are far from one another, we are spared of such cancellation and Algorithm 1 succeeds, including in high dimensions. We demonstrate these behaviors empirically in the following experiments (see Figures 1 and 2).
2.1.1 Experiments
To evaluate the effectiveness of Algorithm 1 for computing , we first implemented an oracle capable of correctly computing to within 4 significant figures (at a potentially large computational cost). Our oracle is an implementation of Eq. 9 with arbitrary-precision floating-point, using the mpmath library (Johansson etย al., 2013). In particular, for a given , we ensure the level of precision is set appropriately by computing Eq. 9 repeatedly at increasingly high precision, until the outputs converge (to 4 significant figures). Equipped with this oracle, we then drew vectors from a normal prior for a variety of dimensions , to analyze the behavior of .
First, we took and compared the magnitude of to the magnitude of the largest summand in Eq. 9. The results are plotted in Figure 1, where we observe that decays rapidly in , whereas the same is not true of its (largest) summands. Thus, under this prior, Algorithm 1 is unsuccessful except in low-dimensional settings.
Next, we let the spread of vary by drawing , where ranges between and . We then plotted, for each , the highest value of such that the output of Algorithm 1 agreed with the Oracle to 3 significant figures. We repeated the procedure using single- and double-precision floating-point (orange and pink lines in Figure 2, respectively), as well as two Laplace inversion methods that will be discussed in Section 2.2. As increases, the parameter values tend to move away from one another, making computation easier and allowing for much higher dimensions. On the other hand, when decreases, the parameter values come closer, bringing us into the unstable regions of Eq. 9, and Algorithm 1 fails for all but just a few dimensions.

2.2 Inverse laplace transform
In this Section, we show that can be written as the inverse Laplace transform of a function that does not suffer from catastrophic cancellation. In particular, said function can be passed to Laplace inversion methods (Davies and Martin, 1979) in order to evaluate in the regime where Algorithm 1 fails.
Proposition: For a function , let denote the Laplace transform. Define the function by:
(14) |
Then the Laplace transform of is equal to:
(15) |
Remark: The function includes the normalizing constant of the CC as a special case . More generally, we have that:
(16) |
as can be seen from letting in the integral 14.
Proof: Let , so that:
(17) |
Next we apply the transformation from Wolpert and Wolf (1995), defined as , , or equivalently, . We can then write our integral as a convolution:
(18) |
Next, since the Laplace transform of a convolution equals the product of the Laplace transforms, we have that:
(19) |
but the univariate case is simply:
(20) |
and the result follows. โ
Corollary: The normalizing constant of the continuous categorical distribution can be written as the following inverse Laplace transform:
(21) |
Proof: Take the inverse Laplace transform in Eq. 15 to find:
(22) |
Taking gives the desired result. โ
Unlike Eq. 9, the product in Eq. 15 does not suffer from catastrophic cancellation, nor does it diverge whenever for some . The corresponding Laplace inversion, i.e., Eq. 21, provides an alternative method to compute our normalizing constant . Numerous numerical algorithms exist for inverting the Laplace transform; see (Cohen, 2007) for a survey. We note, however, that inverting the Laplace transform is generally a hard problem (Epstein and Schotland, 2008).
Empirically, we found some modest empiricasuccess in computing Eq. 21 numerically. We tested three inversion algorithms, due to Talbot (Talbot, 1979), Stehfest (Stehfest, 1970), and De Hoog (Deย Hoog etย al., 1982). The experimental setup was identical to that of Section 2.1.1, and the results are incorporated into Figure 2. We found De Hoogโs method to be the most effective on our problem, whereas Talbotโs always failed and is omitted from the Figure. Importantly, De Hoogโs method showed some success in the regime where Algorithm 1 failed, meaning it could be used as a complementary technique. However, no inversion method was able to scale beyond .
2.3 Inductive approach
In this section, we provide an alternative algorithm to compute our normalizing constant. This algorithm will be based on the following recursive property of , which was implicitly used as part of the proof of Eq. 9 (Gordon-Rodriguez etย al., 2020a).
Proposition: Define the subvector notation , and make the dependence on explicit by writing . We also use the notation . Then:
(23) |
Proof: We start from the integral definition of the left hand side:
(24) |
For the innermost integral, we have:
(25) |
and the result follows by linearity of the integral. โ
Remark: The base case corresponds to the univariate continuous Bernoulli distribution, which admits a straightforward Taylor expansion that is useful around the unstable region (Loaiza-Ganem and Cunningham, 2019):
(26) |
In words, Eq. 23 is stating that we can compute for the full parameter vector by first computing it for the lower-dimensional parameter vectors:
Substituting these back into Eq. 23, we have that:
Note that we are now left with not 4, but 3 new parameter vectors to recurse on: , , and . Repeating the argument times and working backwards we obtain Algorithm 2 for computing the normalizing constant.
Input: A parameter vector
ย ย ย ย ย Output: The normalizing constant
We find the numerical properties of Algorithm 2 to perform similarly to Algorithm 1, suffering from the same cancellation issues in high dimensions. Nevertheless, we hope this alternative scheme may help to inspire further numerical improvements. For example, since the floating-point behavior of Algorithm 2 depends on the ordering of the elements of , it may be possible do design a reordering scheme that improves the overall precision of the algorithm; such reorderings have been explored in the context of sampling algorithms for the CC (Gordon-Rodriguez etย al., 2020a). Other possibilities include Kahan summation (Kahan, 1965) or the compensated Horner algorithm (Langlois and Louvet, 2007); we leave their study to future work.
2.4 Overflow
We conclude this section with a small remark on numerical overflow in the context of Algorithm 1. In high dimensions and with high values, overflow can occur for the terms in the summand, due to a large term or a small denominator. This can be addressed by re-writing the normalizing constant in a form that allows us to take advantage of the log-sum-exp trick:
3 Normalizing constant with repeated parameters
Whenever we have an equality between any pair of parameters, Eq. 9 is undefined and, indeed, its proof by Gordon-Rodriguez etย al. (2020a) breaks down. In this Section, we derive a counterpart to Eq. 9 for the case when 2 or more elements of are equal to one another. Note that, for mathematical convenience, we shall now denote , where the th component is now included in the vector . As discussed in Section 1, this component can be taken as fixed at 0, or it can be treated as an additional free parameter, resulting in an overparameterized model (our results will remain correct nevertheless).
3.1 A simple example
First, we illustrate the main idea of our argument using an example with and .444Note that the case and the case are mathematically equivalent (albeit more algebraically cumbersome), since we can permute the elements of and shift by a constant without loss of generality. By definition, the normalizing constant is then:
(27) |
We apply the change of variables , to obtain:
(28) |
Note that the last integral corresponds to the expectation of a univariate CC random variable, an idea we now generalize.
3.2 General formula
We start by proving the following lemma, which relates , where some elements of are repeated (potentially many times), to , where collapses the repeated elements of onto a single coordinate. We again use subvector notation .
Lemma: Let with , for , and let . Then:
(29) |
Remark: We are not assuming that the last coordinates of are all different, we are simply assuming that the first are identical. Note also that the positions of the coordinates could be arbitrary and need not be the first ones; we can always use this lemma to collapse repeated parameter values provided that the function does not depend on the corresponding coordinates (we can simply relabel the coordinates by a suitable permutation without loss of generality).
Proof: By definition of the left hand side:
(30) |
Consider the following change of variable (note this is just like Section 3.1, but with more bookkeeping):
(31) |
This change of variable amounts to an invertible linear transformation with the property that the absolute value of the determinant of its Jacobian is , so that we have:
(32) | ||||
Note that the change of variables is such that the integrand does not depend on . Therefore:
(33) | ||||
where:
(34) |
But this is simply the Lebesgue measure of a simplex inscribed in the hypercube , so that (this can also be seen by changing variables to , or by applying Eq. 16). Multiplying and dividing by gives the desired result. โ
We can now derive a formula for the normalizing constant for an arbitrary parameter vector .
Corollary: Let contain unique elements. Assume, without loss of generality, that , where each coordinate is repeated times, with . Then:
(35) |
Proof: The result follows by applying the lemma times. First, we apply the lemma with :
(36) |
where , i.e., we have collapsed the first parameter value onto a single coordinate. Next, we apply the lemma a second time on the new expectation term to collapse the values, this time using , which does not depend on the coordinates:
(37) | ||||
where . Repeating times yields the desired result. โ
Note that Eq. 35 can be computed using known results. The term can be evaluated with Eq. 9, since all the parameter values are distinct. The expectation term can be computed by differentiating the moment generating function of , as discussed by Gordon-Rodriguez etย al. (2020a). Note that in real data applications exact equality between parameter values may never occur, and it is unclear how close the elements of should be in order to warrant applying Eq. 35. Nevertheless, Eq. 35 is important for theoretical completeness.
4 Discussion
The normalizing constant of the CC distribution is essential to applications, being a necessary prerequisite for evaluating likelihoods, optimizing models, and simulating samples alike. Computing this normalizing constant is nontrivial, and doing so in high dimensions remains an open problem. Our work represents a significant step toward this goal, improving our understanding of the numerical properties of different computation techniques, as well as advancing the underlying theory and algorithms. In addition, we hope our results will help to inspire further advances and to develop increasingly robust numerical techniques that will ultimately enable the use of the CC distribution with arbitrary parameter values on high-dimensional problems.
References
- Bond-Taylor etย al. (2021) Sam Bond-Taylor, Adam Leach, Yang Long, and Chrisย G Willcocks. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. arXiv preprint arXiv:2103.04922, 2021.
- Cohen (2007) Alanย M Cohen. Numerical methods for Laplace transform inversion, volumeย 5. Springer Science & Business Media, 2007.
- Davies and Martin (1979) Brian Davies and Brian Martin. Numerical inversion of the laplace transform: a survey and comparison of methods. Journal of computational physics, 33(1):1โ32, 1979.
- Deย Hoog etย al. (1982) Frankย R Deย Hoog, JHย Knight, and ANย Stokes. An improved method for numerical inversion of laplace transforms. SIAM Journal on Scientific and Statistical Computing, 3(3):357โ366, 1982.
- Dirichlet (1839) Peter Gustavย Lejeune Dirichlet. Sur une nouvelle mรฉthode pour la dรฉtermination des intรฉgrales multiples. Journal de Mathรฉmatiques, Ser I, 4, pages 164โ168, 1839.
- Epstein and Schotland (2008) Charlesย L Epstein and John Schotland. The bad truth about laplaceโs transform. SIAM review, 50(3):504โ520, 2008.
- Goldberg (1991) David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR), 23(1):5โ48, 1991.
- Gordon-Rodriguez etย al. (2020a) Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, and John Cunningham. The continuous categorical: a novel simplex-valued exponential family. In International Conference on Machine Learning, pages 3637โ3647. PMLR, 2020a.
- Gordon-Rodriguez etย al. (2020b) Elliott Gordon-Rodriguez, Gabriel Loaiza-Ganem, Geoff Pleiss, and Johnย Patrick Cunningham. Uses and abuses of the cross-entropy loss: Case studies in modern deep learning. In Proceedings on "I Canโt Believe Itโs Not Better!" at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, pages 1โ10. PMLR, 2020b.
- Johansson etย al. (2013) Fredrik Johansson etย al. mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.18), December 2013. http://mpmath.org/.
- Kahan (1965) William Kahan. Pracniques: further remarks on reducing truncation errors. Communications of the ACM, 8(1):40, 1965.
- Kingma and Welling (2014) Diederikย P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- Langlois and Louvet (2007) Philippe Langlois and Nicolas Louvet. How to ensure a faithful polynomial evaluation with the compensated horner algorithm. In 18th IEEE Symposium on Computer Arithmetic (ARITHโ07), pages 141โ149. IEEE, 2007.
- Loaiza-Ganem and Cunningham (2019) Gabriel Loaiza-Ganem and Johnย P Cunningham. The continuous bernoulli: fixing a pervasive error in variational autoencoders. In Advances in Neural Information Processing Systems, pages 13266โ13276, 2019.
- Stehfest (1970) Harald Stehfest. Algorithm 368: Numerical inversion of laplace transforms [d5]. Communications of the ACM, 13(1):47โ49, 1970.
- Talbot (1979) Alan Talbot. The accurate numerical inversion of laplace transforms. IMA Journal of Applied Mathematics, 23(1):97โ120, 1979.
- Wolpert and Wolf (1995) Davidย H Wolpert and Davidย R Wolf. Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52(6):6841, 1995.