Tight Lower Bounds for -Divergences Under Moment Constraints and Relations Between Different
Abstract
The -divergences include the well-known Kullback-Leibler divergence, Hellinger distance and -divergence. In this paper, we derive differential and integral relations between the -divergences that are generalizations of the relation between the Kullback-Leibler divergence and the -divergence. We also show tight lower bounds for the -divergences under given means and variances. In particular, we show a necessary and sufficient condition such that the binary divergences, which are divergences between probability measures on the same -point set, always attain lower bounds. Kullback-Leibler divergence, Hellinger distance, and -divergence satisfy this condition.
Keywords: Alpha-divergence, Kullback-Leibler divergence, Hellinger distance, Chi-squared divergence, Relative entropy, Renyi divergence.
I Introduction
The Kullback–Leibler divergence [12] (also known as the relative entropy) and the Hellinger distance [10] are divergence measures which play a key role in information theory, statistics, machine learning, physics, signal processing, and other theoretical and applied branches of mathematics. They both belong to an important class of divergence measures, defined by means of convex functions , and named -divergences [5, 6, 7]. The most notable class of -divergence is the -divergence [1, 4]. By choosing different , we get a large number of well-known divergences as special cases, including Kullback-Leibler divergence, Hellinger distance, and -divergence [17].
In this paper, we study relations between the -divergences for different , and derive the tight lower bounds for the -divergences under given means and variances. The relation between the Kullback-Leibler divergence and the -divergence was shown in [14, 2, 16], and we generalize this relation for general and . Regarding the lower bounds under given means and variances, there are some works for the -divergence and the Hellinger distance [3, 8, 11]. Recently, for the Kullback-Leibler divergence and the Hellinger distance, we showed that the tight lower bounds are all attained by their binary divergences that are divergences between probability measures on the same -point set [16, 15]. Our motivation is to study necessary and sufficient conditions for such that the binary -divergences always attain lower bounds. Furthermore, we show tight lower bounds under given means and variances for the Rényi divergences [18], which are closely related to the -divergences.
In this work, Section 2 presents notation and definitions, Section 3 refers to the main results, Section 4 shows the proofs of the main results. Finally, Section 5 concludes this paper, and lemmas that are necessary for the proofs of the main results are proved in Appendices.
II Preliminaries
This section provides definitions of divergence measures which are used in this paper.
Definition 1.
Definition 2.
[4] The basic asymmetric alpha-divergence is the -divergence with
for ,
(2) |
where denotes the Kullback-Leibler divergence (relative entropy).
In the special case for , we obtain from (2) the well known Pearson Chi-square, Hellinger and Neyman Chi-square distances, given respectively by
The -divergences have duality as follows.
Duality:
(3) |
The Rényi divergences [18] are closely related to -divergences.
Definition 3.
The Rényi divergence for the simple orders is defined as
For the extended orders, the Rényi divergence is defined as
where .
Definition 4.
Let us define a set of pairs of probability measures defined on -point set by , where are arbitrary real numbers.
If , is a subset of .
Definition 5.
Let and be probability measures defined on a measurable space , where is the real line and is the Borel -algebra of subsets of . Let be a set of pairs of probability measures under given means and variances, i.e.,
(4) |
where and .
Definition 6.
The binary -divergence is defined for .
(5) |
where and .
Definition 7.
The binary Rényi divergence for orders is defined for .
where denotes the indicator function.
III Main results
Theorem 1.
Let and for . Then, for all ,
(6) | |||
(7) |
Proof.
See Section IV. ∎
Corollary 1.
For all ,
(8) | |||
(9) |
The relation (8) for yields the relation between the Kullback-Leibler divergence and the -divergence.
(10) |
Proof.
By the Taylor expansion for a differentiable function such that , for efficiently small , we obtain
(11) |
Since the -divergences belong to -divergences, it follows that
(12) |
Replacing by , multiplying and integrating both sides of (6), we obtain (8). The condition is due to (12). The equality (9) follows in a similar way. ∎
Theorem 2.
Let .
-
(a)
If , the binary -divergence always attains a lower bound under given means and variances if and only if .
(13) where
(14) (15) (16) (17) - (b)
-
(c)
If , then,
(20)
Proof.
See Section IV. ∎
For the Hellinger distance and the -divergence, their binary divergences are simplified as
Corollary 2.
If and , the binary Rényi divergence always attains a lower bound under given means and variances.
Proof.
For , the result follows by Definition 3 and Theorem 2. Since the binary Rényi divergence is equal to for and , we obtain the result. For , we have at . Letting , we obtain
(21) | |||
(22) |
By the Cauchy-Schwarz inequality, it follows that
(23) |
By combining this inequality with (21), we obtain
(24) |
From (14) and (15), we have and for . By combining (24) with Definition (3), the result follows. The case can be justified in a similar way. ∎
IV Proofs of main results
IV-A Proof of Theorem 1
IV-B Proof of Theorem 2
Lemma 1.
For , let be a set of pairs of probability measures such that for all . Let . If and , the global minimum points satisfy any of the following conditions.
-
(1)
, and .
-
(2)
.
Proof.
See Appendix A. ∎
Lemma 2.
If and , the binary -divergence is monotonically decreasing with respect to both and .
Proof.
See Appendix B. ∎
Proof.
See [15][Lemma 1]. ∎
Lemma 4.
Let , and let . If , there exixts such that .
Proof.
See Appendix C. ∎
We show Theorem 2 by 4-step.
Proof for .
We first prove (13) for pairs of finite discrete probability measures.
Let , and suppose .
By Lemma 1 as and Lemma 3, there exist sequences of vectors , and probability measures and , which are defined on , such that
(29) |
where denotes for variables . Without any loss of generality, one can assume that for and for . Let and , where and . By the variance constraints, we have and . Hence, and hold for , and due to , it follows that
(30) | |||
(31) | |||
(32) |
Let and be probability measures defined on , and let , for . From (30)-(32), it follows that
(33) | ||||
(34) |
where we set . Since the variances of and are non-negative, we have , and . If is not a global minimum in , there exists a sequence in such that , , which gives a smaller value than . It contradicts that is global minimum in . Hence, by Lemma 1, it follows that , and
(35) |
where with means and variances . By Lemma 2, since is monotonically decreasing with respect to and , we have . This contradicts the assumption of , thus we obtain .
Next, we prove (13) for pairs of probability measures in . For sufficiently small , there exists such that
(36) |
Since , we have
(37) |
In the interval , one can approximate probability measures by finite discrete probability measures as follows.
(38) | ||||
(39) | ||||
(40) |
From (37) and (40), we have . From (36), (38), and (39), it follows that differences of means and variances between and are . By applying , it follows that
(41) |
where , and we use differentiability of with respect to means and variances. Since one can choose arbitrary small , we obtain . ∎
Proof for and .
Let , and . For all , probability measures and have the same means and variances. Hence, by the tight lower bounds for , the relation (8) for , and setting in (8), we obtain
The inequality for follows due to the duality (3). Next, we prove for . By the Hammersley–Chapman–Robbins bound [3], we obtain
(see [16][(180)]). Since and hold due to (14) and (15), it follows that
By the duality (3), we obtain inequalities for . The relation (8) for and the duality (3) yield inequalities for (see [16][Theorem 2]). By combining these results, we obtain lower bounds for and . ∎
Proof of the necessary condition.
Due to the duality , it is enough to show for . We show an example of such that . Let , which is defined on . We obtain due to . Let be
where is a small positive real number such that . As , and have means and variances . Since , it follows that . By Lemma 4, there exists such that . ∎
Proof of Item (c).
Since the proof is similar to [16][Theorem 2], we outline of the proof. We construct sequence of probability measures with zero mean and respective variances for which as (without any loss of generality, one can assume that the equal means are equal to zero). We start by assuming . Let
and define a sequence of quaternary real-valued random variables with probability mass functions
It can be verified that, for all , has zero mean and variance .
Furthermore, let
with
If , for , we choose arbitrary with mean and variance . Then,
Next, suppose , then construct and as before with variances and , respectively. If and denote the random variables and scaled by a factor of , then their variances are , respectively, and as we let . ∎
V Conclusion
In this paper, we derived relations between and for the asymmetric -divergences. These relations are generalizations of the integral relation between the Kullback-Leibler divergence and the -divergence. We showed that is the necessary and sufficient condition that the binary -divergences always attain lower bounds under given means and variances. Kullback-Leibler divergence, Hellinger distance, and -divergence satisfy this condition. It is intuitively natural that discrepancy between probability measures under moment constraints is smaller for localized measures than for broad probability measures. In this point of view, -divergences for have preferable properties. The tight lower bounds for the Kullback-Leibler divergence and the Hellinger distance recently began to be applied to physics [9, 19]. In the future, we hope that the range of applications of relations between the -divergences and tight lower bounds will expand, and we hope that they will help to progress many fields and to deepen the understanding of properties of divergences.
References
- [1] S.-i. Amari. Information geometry and its applications, volume 194. Springer, 2016.
- [2] K. M. Audenaert. Quantum skew divergence. Journal of Mathematical Physics, 55(11):112202, 2014.
- [3] D. G. Chapman, H. Robbins, et al. Minimum variance estimation without regularity assumptions. The Annals of Mathematical Statistics, 22(4):581–586, 1951.
- [4] A. Cichocki and S.-i. Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
- [5] I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
- [6] I. Csiszár. On topological properties of f-divergences. Studia Math. Hungar., 2:329–339, 1967.
- [7] I. Csiszár. A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(1-4):191–213, 1972.
- [8] M. Dashti and A. M. Stuart. The bayesian approach to inverse problems. arXiv preprint arXiv:1302.6989, 2013.
- [9] Y. Hasegawa. Irreversibility, loschmidt echo, and thermodynamic uncertainty relation. Physical Review Letters, 127(24):240602, 2021.
- [10] E. Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 1909(136):210–271, 1909.
- [11] M. A. Katsoulakis, L. Rey-Bellet, and J. Wang. Scalable information inequalities for uncertainty quantification. Journal of Computational Physics, 336:513–545, 2017.
- [12] S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
- [13] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
- [14] J. Melbourne, M. Madiman, and M. V. Salapaka. Relationships between certain f -divergences. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1068–1073, 2019.
- [15] T. Nishiyama. A tight lower bound for the hellinger distance with given means and variances. arXiv preprint arXiv:2010.13548, 2020.
- [16] T. Nishiyama and I. Sason. On relations between the relative entropy and 2-divergence, generalizations and applications. Entropy, 22(5):563, 2020.
- [17] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175, 1900.
- [18] A. Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
- [19] T. Van Vu, Y. Hasegawa, et al. Unified approach to classical speed limit and thermodynamic uncertainty relation. Physical Review E, 102(6):062132, 2020.
Appendix A Proof of Lemma 1
Proof.
Let ( are trivial), and let , . Consider the following minimization problem.
(42) | ||||
subject to | (43) | |||
(44) | ||||
(45) |
where , , and (43) and (44) correspond to (4). Since the feasible set is compact, there exists a global minimum. We first consider and , then we have for . Hence, the global minimum point must be a stationary point, or be on the boundary at , , or . By rearranging the order of appropriately, let
Since holds for the global minimum point, we have . If , let be
where and denote sufficiently small real numbers. The moment constraints for probability measures and include and (), respectively. These variables are independent, and it can be easily verified that the determinants for each probability measure constraint are non-zero by for . Hence, one can choose such that and . By and ,
(46) |
This contradicts that is a global minimum, then we obtain . In a similar way, we also obtain and . Hence, the global minimum point is an interior point or on the the boundary at . Supposing , it must be a stationary point of the following Lagrangian.
(47) |
where and . Since for , and , it follows that are linearly independent. The stationary conditions are
(48) | ||||
(49) | ||||
(50) |
where ′ denotes the derivative with respect to .
From (48) and (49), it follows that
(51) |
Substituting (48) and (50) into (49), we have
(52) |
Since and are positive from (48) and (49), one can define the function . From (51) and (52), we obtain
(53) |
Since and are at most quadratic functions with respect to , the equation (52) has at most a degree of . If (52) is not identically , we have . If , by the relation (53) and the mean value theorem, there exists such that . By the definition of , it follows that is a solution of (52). This contradicts that the equation (52) has at most a degree of . If equation (52) is identically , the equality (51) is identity. Since the degree of and are any of , and due to , it follows that they are constant. The equality (48) yields for all and a constant . This implies for all , and it contradicts . Hence, we obtain .
Next, we show the case for and briefly. The notation is the same as above, and and hold due to . The Lagrangian is
(54) |
Supposing , the stationary conditions are
(55) | ||||
(56) | ||||
(57) |
From (55) and (56), it follows that is not a constant. Since is at most a dgree of 1 from (57) and for , it follows that . The result for and also follows by swapping and . ∎
Appendix B Proof of Lemma 2
We first prove the following lemma.
Lemma 5.
Let . If ,
If ,
Proof.
Letting , for and , we have
By combining this relation with , the results follow. ∎
Proof of Lemma 2.
Let and , and . From (14)-(17), by replacing by , we obtain . Hence, the binary -divergences are written by . Since , we show that is monotonically increasing with respect to and .
Appendix C Proof of Lemma 4
Proof.
Let . In a similar way to the proof of Lemma 2, we obtain (B). For , let
where we write explicitly in in the proof of Lemma 2. By , there exists such that and for all . By combining these inequalities with , it follows that (B) is positive for all . By and , it follows that
(68) |
where . The case for can be justified in a similar way.
∎