Non-asymptotic approximations for Pearson’s chi-square statistic and its application to confidence intervals for strictly convex functions of the probability weights of discrete distributions
Abstract
In this paper, we develop a non-asymptotic local normal approximation for multinomial probabilities. First, we use it to find non-asymptotic total variation bounds between the measures induced by uniformly jittered multinomials and the multivariate normals with the same means and covariances. From the total variation bounds, we also derive a comparison of the cumulative distribution functions and quantile coupling inequalities between Pearson’s chi-square statistic (written as the normalized quadratic form of a multinomial vector) and its multivariate normal analogue. We apply our results to find confidence intervals for the negative entropy of discrete distributions. Our method can be applied more generally to find confidence intervals for strictly convex functions of the weights of discrete distributions.
keywords:
categorical data , convex optimization , entropy , Gaussian approximation , information theory , local limit theorem , multinomial distribution , multivariate normal distribution , quantile couplingMSC:
[2020]Primary: 62E17, 62F25, 62F30; Secondary: 60E15, 60F99, 62E20, 62H10, 62H121 Introduction
A fundamental question in machine learning and statistics is this: Given a set of i.i.d. observations, what can we infer about the distribution that generated the observations? In general, we cannot infer the exact distribution – it is an intractable inverse problem. However, in some cases, if we know that the generating distribution belongs to a parametric family of distributions, then it is often possible to find a confidence set on the parameters for a given significance level. If the confidence set has a shape that makes it amenable to optimization (being convex, for example), then we can use it to infer confidence bounds on functions of the parameters by minimizing/maximizing the functions over the confidence set. In this paper, we are interested in the particular case where the parametric family is a given (fixed) discrete distribution with known values . The parameters are the probability weights associated with each value . In other words, given a sequence of i.i.d. observations , we assume that
(1.1) |
where the categorical values are fixed and known, but the categorical probabilities are unknown parameters. We want confidence bounds on the quantity
(1.2) |
where the objective function can depend on the categorical values and is assumed to be strictly convex. If we sample observations from the distribution (1.1), then note that the probability weights are also parameters of the multinomial distribution associated with the sample count vector where
(1.3) |
Therefore, in order to build confidence bounds for , an idea is to find a confidence set for the probability parameters of the sample count vector’s multinomial distribution and then minimize/maximize the objective function over the probability parameters in that set. This problem is described in detail and solved in Section 3.
Motivated by the above machine learning problem, we derive in this paper closely related theoretical results that may be of independent interest. They include a non-asymptotic local normal approximation for multinomial probabilities, a non-asymptotic total variation bound between the measures induced by uniformly jittered multinomials and the corresponding multivariate normals with the same means and covariances, and a non-asymptotic comparison of the cumulative distribution functions (c.d.f.s) between Pearson’s chi-square statistic (written as the normalized quadratic form of a multinomial vector) and its multivariate normal analogue, which then leads to quantile coupling bounds as a corollary. (For a good introduction to quantile coupling bounds, see, e.g., Mason & Zhou, (2012).)
To the best of our knowledge, the first two results are new. The local approximation result complements the asymptotic version obtained in Lemma 2.1 of Siotani & Fujikoshi, (1984), via a Stieltjes integral representation of probabilities due to Esseen, (1945), and later rederived independently in Theorem 2.1 of Ouimet, (2021b) in a log-ratio form via elementary Taylor expansions and Stirling’s formula. The total variation bound complements the asymptotic version obtained in the proof of Theorem 1 in Carter, (2002) and its improvement in Lemma 3.1 of Ouimet, (2021b) where the inductive part of Carter’s proof was removed by using the multinomial/normal local approximation directly instead of having a multi-scale comparison between binomials/normals. Unfortunately, the comparison of the c.d.f.s (and thus the quantile coupling result), being of order turns out to be (slightly) sub-optimal, although the derivation is easy to follow and provides a completely different route than usual (so there should still be some interest in the argument). Earlier asymptotic results in that direction can be found in Siotani & Fujikoshi, (1984) and Read, (1984). Some lapses were later found in the asymptotic expansions of those two papers and corrected in Ulyanov & Zubov, (2009) (see also Prokhorov & Ulyanov, (2013)). Slightly better bounds, of order , can be obtained via a multivariate Berry-Esseen theorem, see the Kolmogorov distance bounds of Mann, (1997a, b) mentioned on p.734 of Gaunt et al., (2017) and references therein. Gaunt et al., (2017) themselves derive, using Stein’s method, distributional bounds of order over smooth test functions between Pearson’s chi-square statistic and its asymptotic chi-square distribution, see also Gaunt, (2021) and Gaunt & Reinert, (2022) for similar results with respect to the Friedman and power divergence statistics. Since indicator functions are not smooth, the rate would be optimal in our context.
Here is the outline of the paper. The definitions and theoretical results are presented in Section 2 and proved in Appendix A. The solution to the problem described in the introduction is given in Section 3 and related proofs are also differed to Appendix A. Technical moment estimates are gathered in Appendix B.
2 Definitions and theoretical results
For , define the (unit) -dimensional simplex and its interior by
(2.1) |
where denotes the norm on . Given a positive integer and a set of probability weights , the probability mass function is defined by
(2.2) |
where and . With this notation, the multinomial distribution has categories, with respective probabilities . The covariance matrix for the first components of a multinomial vector is well-known to be , where , see, e.g., (Severini,, 2005, p.377). From Theorem 1 in Tanabe & Sagae, (1992), we also know that . Hence, the density function of the multivariate centered Gaussian random vector that has the same covariance matrix is defined as follows:
(2.3) |
Throughout this section, let be given, and let
(2.4) |
Ultimately, our main goal (Theorem 2.7) will be to compare the c.d.f. of Pearson’s chi-square statistic, , with its Gaussian analogue , namely compare
(2.5) |
where ,
(2.6) |
To see that the statistic is just a rewriting of Pearson’s chi-square statistic, use the fact that
(2.7) |
(see, e.g., (Tanabe & Sagae,, 1992, Eqn. 21)) together with to write
(2.8) |
Remark 2.1.
Notice that
(2.9) |
because and .
Remark 2.2.
Let be given. Since we know that is distributed, then
(2.10) |
where the two functions
(2.11) |
denote the regularized lower incomplete gamma function and Euler’s gamma function, respectively.
Remark 2.3.
Throughout the paper, the notation means that , where is a universal constant. Whenever might depend on a parameter, we add a subscript: for example, . Similarly, means that , and subscripts indicate which parameters the convergence rate can depend on.
The first step (Proposition 2.4) to achieve our goal (2.5) is to prove a non-asymptotic version of the local limit theorem found in (Siotani & Fujikoshi,, 1984, Lemma 2.1) and (Ouimet,, 2021b, Theorem 2.1). Specifically, we develop a non-asymptotic expansion for the log-ratio of the multinomial probability mass function (2.2) over the corresponding multivariate normal density (2.3). Our approximation is uniform for in the bulk of the distribution and uniform for in a compact set away from the boundary of the simplex , namely for
(2.12) | |||
(2.13) |
where is a fixed parameter.
Proposition 2.4 (Non-asymptotic local expansion).
Let be given. Then, uniformly for , and , we have
(2.14) |
where the error satisfies
(2.15) |
The following corollary may be of independent interest. It extends the local limit theorem from Proposition 2.4 to one-to-one transformations of multinomial vectors.
Corollary 2.5.
Let be given. Let be a one-to-one mapping from an open subset of onto a subset of . Assume further that has continuous partial derivatives on and its Jacobian determinant is non-zero for all . Then, uniformly for such that , and , we have
(2.16) | ||||
where the error satisfies
(2.17) |
Using the local expansion from Proposition 2.4, the second step (Proposition 2.6) to achieve our goal (2.5) consists in finding an upper bound on the total variation between the probability measure associated with the multivariate normal in (2.3) and the probability measure of a multinomial vector jittered by a uniform random variable on . The proposition is a non-asymptotic version of (Ouimet,, 2021b, Lemma 3.1).
Proposition 2.6 (Total variation).
Let be given, and let and , where and are assumed independent. Define and let be the law of . Let be the law of the multivariate normal distribution , where recall . Then, for all , we have
(2.18) |
where denotes the total variation norm.
The final step (Theorem 2.7) to achieve our goal (2.5) consists in controlling the difference between and using the total variation bound from Proposition 2.6.
Theorem 2.7 (C.d.f. comparison).
Let and be given. Then, for all , we have
(2.19) |
As a consequence of Remark 2.2, Theorem 2.7 and the mean value theorem, we deduce the following stochastic upper bound on .
Corollary 2.8 (Quantile coupling).
Let be the cumulative distribution function (c.d.f.) of , and let denote the generalized inverse distribution function (or quantile function) associated with . Also, let be the c.d.f. of , and define the random variable . For , let
(2.20) |
Then, with the event
(2.21) |
we have, for large enough,
(2.22) |
where solves .
3 Application to confidence intervals
3.1 Description of the problem
In this section, we apply the method mentioned in Section 1 and use the normal approximation results we obtained in Section 2 to clarify the optimization step, given the multi-dimensionality of the problem and the roughness of the confidence set for the multinomial.
Formally, consider a discrete random variable having the following probability mass function
(3.1) |
where is a positive integer, the values are known, but the probability weights are unknown. We consider a (fixed and given) strictly convex objective function , which may depend on the values of and , and our goal is to estimate and find a confidence interval for
(3.2) |
Given a significance level , a sequence of i.i.d. observations following the discrete distribution in (3.1), and the associated sample count random variables
(3.3) |
we will show how to construct a lower bound and an upper bound that satisfy
(3.4) |
where is an explicit threshold, , and is a small explicit quantity that goes to as .
When there are only two category values (), this problem is easy. Indeed, using a binary search, a computer can find an exact -confidence interval for , say , using the fact that . Then we find (resp., ) simply by minimizing (resp., maximizing) the objective function over , i.e.,
(3.5) |
(For details on binomial inversion, see, e.g., Langford, (2005). Binomial tails can be computed accurately using Loader’s method Loader, (2002), as used in pbinomial() in R, or Stirling’s approximation with several terms for very large , or computed exactly using arithmetic on extended numbers such as BigInteger and BigDecimal in Java.)
In higher dimensions however, the problem becomes much more difficult. Given a significance level and an (observed) vector of sample counts
(3.6) |
we have to optimize the objective function over vectors of probabilities in the following confidence set for :
(3.7) |
Remark 3.1.
Assuming that the (observed) sample counts are all positive (i.e., each of the categories has at least one observation), note that the sum in (3.7) is equal to for any on the boundary of the simplex since in that case. This is why the confidence set is restricted to the interior of the simplex in (3.7). This remark will play an important role in showing that the optima which define the confidence bounds in Theorem 3.3 are unique, see (A.67) and below.
The shape of the set is unclear and the boundaries need not be smooth, which makes the optimization of the objective function much more complicated than in the binomial case (). To overcome this obstacle, our strategy will be to find a smooth superset that contains the exact confidence set for and then compute and by optimizing over the smooth superset, as in (3.5). The small quantity in (3.4) is the “price” we have to pay in confidence to transfer the optimization from the rough (but exact) confidence set to the smooth superset. The smooth superset for will be constructed using the fact that the law of the sample count vector , i.e., the distribution, is close in total variation, when jittered by a , to the multivariate normal distribution with the same mean and covariance . This point was proved in Proposition 2.6. Other elements of the proof of Theorem 2.7 will be used for the comparison of the confidence sets.
3.2 Solution to the problem
As mentioned previously, it is not obvious (a priori) how we can optimize the objective function over the confidence set and find an explicit solution given the multi-dimensional, discrete and rough nature of the confidence set. However, a natural idea to simplify the problem is to replace the confidence set by its Gaussian analogue, namely
(3.8) | ||||
where recall , was defined in (2.3), was defined in (2.6), and was defined in (2.2).
First, we need to find a Gaussian superset as in (3.8) that contains the exact confidence set , where is a positive quantity that goes to zero as . The proposition below does just that. The proof relies on the total variation bound from Proposition 2.6 and on other elements of the proof of Theorem 2.7.
Proposition 3.2.
Finally, by Proposition 3.2, we can find the lower bound and the upper bound by optimizing the objective function over the Gaussian superset . This solves the problem mentioned in the introduction and described in Section 3.1. The proof shows that the Gaussian superset is strictly convex and it also establishes the existence and uniqueness of the minimizers/maximizers.
Theorem 3.3 (Existence and uniqueness of the solution to the problem).
Assume the objective function to be strictly convex. Let , , and let be given such that . We have
(3.10) |
with the choices
(3.11) |
Furthermore, the set is strictly convex, so that the minimizers/maximizers in (3.11) are well defined (i.e., unique).
The bound on the confidence level in (3.10) cannot be close to exact unless the number of observations is extremely high. Therefore, for practical purposes, we simply let . This is illustrated below for two examples:
(3.12) | |||
(3.13) |
where is a positive definite matrix that may depend on the known categorical values and .
In Figures 3.1 and 3.2, the confidence bounds are shown to converge to the value of the objective functions at (i.e., ) and the empirical significance levels are shown to stay below the nominal level . The R code that generated the figures is available at https://www.dropbox.com/s/tfwnnba642740xo/multinomial_method.R?dl=0.




Appendix A Proofs
Proof of Proposition 2.4.
By taking the logarithm on the left-hand side of (2.2), we have
(A.1) |
From Lindelöf’s estimate of the remainder terms in the expansion of the factorials, found for example on page 67 of Remmert, (1998), we know that
(A.2) |
By applying (A.2) in (A.1) and reorganizing the terms, we get
(A.3) | ||||
Since , the above is
(A.4) | ||||
Now, for (recall our assumption ), Lagrange error bounds for the following Taylor expansions imply
(A.5) | ||||
By applying these estimates in (A.4) with , together with the error bound from (A.2), we obtain
(A.6) | ||||
where the rest satisfies
(A.7) |
The cancellations in (A.6) come from (recall Remark 2.1) and from for by (2.7). This ends the proof. ∎
Proof of Proposition 2.6.
By the comparison of the total variation norm with the Hellinger distance on page 726 of Carter, (2002), we already know that
(A.8) |
By applying a union bound, followed by the fact that for all and then Hoeffding’s inequality, we get
(A.9) |
and similarly, for ,
(A.10) |
To control the expectation in (A.8), note that if denotes the density function associated with (i.e., it is equal to whenever is closest to ), then
(A.11) |
By the non-asymptotic expansion in Proposition 2.4, we have
(A.12) | ||||
By Lemma B.1, and our assumptions and , we can bound the expectations in (A.12) as follows:
(A.13) |
For the term in (A), note that
(A.14) |
Using the expression in (2.7), the independence of the random variables , and our assumptions and , we obtain
(A.15) |
To bound the term in (A), we first derive a rough bound for the log-ratio in (2.14) on the set . Note that the three coefficients we found in (A.5), and hence those in (A.7), can be replaced by on the event by rerunning the proof of Proposition 2.4, since the bound on just above (A.5) in that case becomes
(A.16) |
Taking the above into account, and using our assumptions and , we have
(A.17) |
We also have the following rough bound from (A),
(A.18) |
again using our assumptions and . Putting the rough bounds (A) and (A) together yields
(A.19) |
Proof of Theorem 2.7.
The case can easily be treated separately using the Berry-Esseen theorem (in fact the bound in (2.19) would be of the form in that case). Alternatively, we can prove our statement as we do below, for all and any , only assuming further that when .
By the triangle inequality, we have
(A.22) | ||||
First, as in (A), we have
(A.23) |
Second, note that by the definition of ,
(A.24) |
Third, to bound , we can use the total variation bound in Proposition 2.6. We have
(A.25) |
To bound , note that is a convex and connected set. Hence, if we ignore the restriction to , the symmetric difference
(A.26) | ||||
is contained in a -dimensional shell of thickness at most in any given direction that is parallel to one of the axes. Figure A.3 illustrates the situation when .

When , we have
(A.27) | |||
(A.28) | |||
(A.29) |
so that
(A.30) |
(using and ) where
(A.31) |
Using Equation (A.30), Remark 2.2 and the mean value theorem (only when ), we deduce
(A.32) | ||||
(A.33) |
Now, notice that the maximum above is attained at for . More generally, the function maximizes on at for all , and
(A.34) |
by a standard Stirling lower bound (see, e.g., Lemma 1 of Minc & Sathre, (1964/65), or Theorem 2.2 of Batır, (2017) for a more precise bound). Using this in (A) shows that, for any given (we assumed that when ),
(A.35) |
Proof of Corollary 2.8.
By Theorem 2.7, we know that, for and ,
(A.38) |
By the mean value theorem, we have
(A.39) |
for an appropriate point , where recall
(A.40) |
It is easily verified that is decreasing for (use Remark 2.2 together with the fact that is the mode of the Gamma distribution with shape parameter and scale parameter ), so that, by choosing large enough and subsequently , we have
(A.41) |
From (A.41) and (A.38), we deduce
(A.42) |
Therefore, by applying on both sides, we get, for large enough,
(A.43) |
where solves and is the event that we defined in (2.21). This ends the proof. ∎
Proof of Proposition 3.2.
Throughout this proof, let . As in the proof of Theorem 2.7, the case can easily be treated separately using the Berry-Esseen theorem (in fact would be of the form in that case). Alternatively, we can prove our statement as we do below, for all and , only assuming further that when .
By the triangle inequality, we have
(A.44) | ||||
The terms , , , and are already bounded in the proof of Theorem 2.7. It remains to bound .
If , then an argument like (A) shows that and imply
(A.45) |
and and imply
(A.46) |
Therefore,
(A.47) |
By the triangle inequality, we can then split the above probability into three distinct terms:
(A.50) | ||||
(A.51) |
For the first term in (A), we can use the total variation bound in Proposition 2.6:
(A.52) |
Bounding the second term in (A) is the same as except that there are two ellipses instead of one, namely and . Therefore,
(A.53) |
For the third term in (A), we can use Equation (A.30), Remark 2.2 and the mean value theorem (only when ) to obtain
(A.54) | ||||
(A.55) |
Now, notice that the maximum above is attained at for . More generally, the function maximizes on at for all , and
(A.56) |
by a standard Stirling lower bound (see, e.g., Lemma 1 of Minc & Sathre, (1964/65), or Theorem 2.2 of Batır, (2017) for a more precise bound). Using this in (A) shows that, for any given (we assumed that when ),
(A.60) |
Proof of Theorem 3.3.
Equation (3.10) is direct consequence of Proposition 3.2. It remains to show that the set is strictly convex.
Using the calculation in (2), we can write
(A.64) |
so that the Hessian matrix of is positive definite on . Indeed,
(A.65) |
Therefore, is strictly convex, and sublevel sets of the form
(A.66) |
are strictly convex for any . Since is invertible ( is clearly increasing), write the inverse as . The domain over which we optimize the objective function in (3.11) is the intersection of two strictly convex sets, namely
(A.67) |
which proves that is strictly convex. As pointed in Remark 3.1, the fact that the first term in the intersection is plays a crucial role for to be strictly convex and thus for the uniqueness minimizer/maximizer of the objective function. ∎
Appendix B Moments estimates
In this section, we derive some moment estimates for the multinomial distribution. The lemma below is used in the proof of Proposition 2.6.
Lemma B.1.
Let and be given. If according to (2.2), then, for all ,
(B.1) | |||
(B.2) |
where . Moreover, for any Borel set and all , we have
(B.3) | |||
(B.4) |
Proof.
The result (B.1) is well-known and the result (B.2) is shown for example in Equation (A.29) of Ouimet, (2021b) or Equation (82) in Ouimet, (2021a). The result (B.3) is a direct consequence of (B.1) together with the Cauchy-Schwarz inequality and the fact that . Finally, to obtain (B.4), we use the fact that (see Lemma A.1 in Ouimet, (2021b) or Equation (79) in Ouimet, (2021a)) followed by an application of Hölder’s inequality and the bound (which follows from (B.2) and the assumption ):
(B.5) |
This ends the proof. ∎
R code
The R code that generated Figures 3.1 and 3.2 is available at
https://www.dropbox.com/s/tfwnnba642740xo/multinomial_method.R?dl=0.
Funding
F. Ouimet was supported by a CRM-Simons postdoctoral fellowship from the Centre de recherches mathématiques (Montréal, Canada) and the Simons Foundation.
References
- Batır, (2017) Batır, N. 2017. Bounds for the gamma function. Results Math., 72(1-2), 865–874. MR3684463.
- Carter, (2002) Carter, A. V. 2002. Deficiency distance between multinomial and multivariate normal experiments. Dedicated to the memory of Lucien Le Cam. Ann. Statist., 30(3), 708–730. MR1922539.
- Esseen, (1945) Esseen, C.-G. 1945. Fourier analysis of distribution functions. A mathematical study of the Laplace-Gaussian law. Acta Math., 77, 1–125. MR14626.
- Gaunt, (2021) Gaunt, R. E. 2021. Bounds for the chi-square approximation of the power divergence family of statistics. To appear in Journal of Applied Probability, 1–26. arXiv:2107.00535.
- Gaunt & Reinert, (2022) Gaunt, R. E., & Reinert, G. 2022. Bounds for the chi-square approximation of Friedman’s statistic by Stein’s method. Preprint, 1–38. arXiv:2111.00949.
- Gaunt et al., (2017) Gaunt, R. E., Pickett, A. M., & Reinert, G. 2017. Chi-square approximation by Stein’s method with application to Pearson’s statistic. Ann. Appl. Probab., 27(2), 720–756. MR3655852.
- Langford, (2005) Langford, J. 2005. Tutorial on practical prediction theory for classification. J. Mach. Learn. Res., 6, 273–306. MR2249822.
- Loader, (2002) Loader, C. 2002. Fast and accurate computation of binomial probabilities. Technical report, 11 pp. https://www.r-project.org/doc/reports/CLoader-dbinom-2002.pdf.
- Mann, (1997a) Mann, B. 1997a. Convergence rate for of a multinomial. Unpublished manuscript.
- Mann, (1997b) Mann, B. 1997b. Stein’s method for of a multinomial. Unpublished manuscript.
- Mason & Zhou, (2012) Mason, D. M., & Zhou, H. H. 2012. Quantile coupling inequalities and their applications. Probab. Surv., 9, 439–479. MR3007210.
- Minc & Sathre, (1964/65) Minc, H., & Sathre, L. 1964/65. Some inequalities involving . Proc. Edinburgh Math. Soc. (2), 14, 41–46. MR162751.
- Ouimet, (2021a) Ouimet, F. 2021a. General formulas for the central and non-central moments of the multinomial distribution. Stats, 4(1), 18–27. doi:10.3390/stats4010002.
- Ouimet, (2021b) Ouimet, F. 2021b. A precise local limit theorem for the multinomial distribution and some applications. J. Statist. Plann. Inference, 215, 218–233. MR4249129.
- Prokhorov & Ulyanov, (2013) Prokhorov, Y. V., & Ulyanov, V. V. 2013. Some approximation problems in statistics and probability. Pages 235–249 of: Limit theorems in probability, statistics and number theory. Springer Proc. Math. Stat., vol. 42. Springer, Heidelberg. MR3079145.
- Read, (1984) Read, T. R. C. 1984. Closer asymptotic approximations for the distributions of the power divergence goodness-of-fit statistics. Ann. Inst. Statist. Math., 36(1), 59–69. MR752006.
- Remmert, (1998) Remmert, R. 1998. Classical Topics in Complex Function Theory. Graduate Texts in Mathematics, vol. 172. Springer-Verlag, New York. MR1483074.
- Severini, (2005) Severini, T. A. 2005. Elements of Distribution Theory. Cambridge Series in Statistical and Probabilistic Mathematics, vol. 17. Cambridge University Press, Cambridge. MR2168237.
- Siotani & Fujikoshi, (1984) Siotani, M., & Fujikoshi, Y. 1984. Asymptotic approximations for the distributions of multinomial goodness-of-fit statistics. Hiroshima Math. J., 14(1), 115–124. MR750392.
- Tanabe & Sagae, (1992) Tanabe, K., & Sagae, M. 1992. An exact Cholesky decomposition and the generalized inverse of the variance-covariance matrix of the multinomial distribution, with applications. J. Roy. Statist. Soc. Ser. B, 54(1), 211–219. MR1157720.
- Ulyanov & Zubov, (2009) Ulyanov, V. V., & Zubov, V. N. 2009. Refinement on the convergence of one family of goodness-of-fit statistics to chi-squared distribution. Hiroshima Math. J., 39(1), 133–161. MR2499200.