Characterizing the Functional Density Power Divergence Class
Abstract
Divergence measures have a long association with statistical inference, machine learning and information theory. The density power divergence and related measures have produced many useful (and popular) statistical procedures, which provide a good balance between model efficiency on one hand and outlier stability or robustness on the other. The logarithmic density power divergence, a particular logarithmic transform of the density power divergence, has also been very successful in producing efficient and stable inference procedures; in addition it has also led to significant demonstrated applications in information theory. The success of the minimum divergence procedures based on the density power divergence and the logarithmic density power divergence (which also go by the names -divergence and -divergence, respectively) make it imperative and meaningful to look for other, similar divergences which may be obtained as transforms of the density power divergence in the same spirit. With this motivation we search for such transforms of the density power divergence, referred to herein as the functional density power divergence class. The present article characterizes this functional density power divergence class, and thus identifies the available divergence measures within this construct that may be explored further for possible applications in statistical inference, machine learning and information theory.
1 Introduction
Divergence measures have natural and appealing applications in many scientific disciplines including statistics, machine learning and information theory. The method based on likelihood, the canonical approach to inference in statistical data analysis, is itself a minimum divergence method; the maximum likelihood estimator minimizes the likelihood disparity [15], a version of the Kullback-Leibler divergence. Among the different formats of minimum divergence inference, the approach based on the minimization of density-based divergences is of particular importance, as in this case the resulting procedures combine a high degree of model efficiency with strong robustness properties.
The central elements in the present research are the collections of the density-based minimum divergence procedures based on the density power divergence (DPD) of [1]. The popularity and utility of this procedure make it important to study other similar divergences in search of competitive or better statistical (and other) properties. Indeed, one such divergence that is known to us and one that has also left its unmistakable mark on the area of robust statistical inference is the logarithmic density power divergence (LDPD); see, e.g., [11], [7], [2], [6]. The applicability of this class of divergences in mathematical information theory has been explored in [13], [14], [8], [9].
Both the ordinary DPD and the LDPD belong to the functional density power divergence class that we will define in the next section. These two families of divergences have also been referred to as the BHHJ and the JHHB classes, or the type 1 and type 0 classes, or the -divergence and the -divergence classes; more details about their applications may be found in [11], [7], [4], [2] among others. However, while the DPD belongs to the class of Bregman divergences, the LDPD does not. The DPD is also a single-integral, non-kernel divergence [10]; the LDPD is not a single-integral divergence, although it is a non-kernel one. The non-kernel divergences have also been called decomposable in the literature [3]. The divergences within the DPD family have been shown to possess strong robustness properties in statistical applications. The LDPD family is also useful in this respect.
Our basic aim in this work is to characterize the class of functional density power divergences. Essentially, each functional density power divergence corresponds to a function with the non-negative real line as its domain. The DPD corresponds to the identity function, while the LDPD corresponds to the log function. Within the class of functional density power divergences, we will characterize the class of functions which generate legitimate divergences. In turn, this will provide a characterization of the functional density power divergence class.
2 The DPD and the LDPD
Suppose is a measure space on the real line. Introduce the notation
(1) |
A divergence defined on is a non-negative function with the property
(2) |
For the sake of brevity we will drop the dominating measure from the notation; it will be understood that all the integrations and almost sure statements are with respect to this dominating measure. We also suppress the dummy variable of integration from the expression of the integrals in the rest of the paper.
One of the most popular examples of such families of divergences is the density power divergence (DPD) of [1] defined as
(3) |
for all , where is a non-negative real number and
For , the definition is to be understood in a limiting sense as , and the form of the divergence then turns out to be,
(4) |
which is actually the likelihood disparity, see [15]; it is also a version of the Kullback-Leiber divergence. For , the divergence in (3) reduces to the squared distance. It is easy and straightforward to check that the definition in (3) satisfies the condition in (2).
Another common example of a related divergence class is the logarithmic density power divergence (LDPD) family of [11] defined as
(5) |
for all Its structural similarity with the DPD family is immediately apparent. It is obtained by replacing the identity function on each component of the integral by the log function. This family is also known to produce highly robust estimators with good efficiency. [7] and [6], in fact, argue that the minimum divergence estimators based on the LDPD are more successful in minimizing the bias of the estimator under heavy contamination in comparison to the minimum DPD estimators. However, also see [12] for some counter views. The latter work has, in fact, proposed a new class of divergences which provides a smooth bridge between the DPD and the LDPD families.
3 The Functional Density Power Divergence
Further exploration of the divergences within the DPD family leads to the observation that this class of divergences may be extended to a more general family of divergences, called the functional density power divergence family having the form
(6) |
for all , where is a pre-assigned function, is a non-negative real number and is a divergence in the sense of (2). Note that the expression given in (6) need not necessarily define a divergence for all as it does not always satisfy the condition stated in Section 2. Indeed, it may even not be well-defined for all pairs of densities since may take the value . In the following we will identify the class of functions for which the quantity defined in (6) is actually a divergence, thus providing a characterization of the FDPD class.
Within the FDPD class, the case has to be again understood in a limiting sense and this limiting divergence exists under some constraints on the function . For example, if we assume is continuously differentiable in an interval around , then the divergence for can be defined as
(7) |
where is the derivative of . Obviously we require to be positive for the above to be a divergence. Note that, the divergence in (7) is actually the likelihood disparity with a different scaling constant and therefore the divergences are effectively equivalent to the likelihood disparity for inferential purposes. For the DPD and the LDPD, in fact, the scaling constant equals unity. The characterization of the FDPD, therefore, is not an interesting problem for ; hence, we will not concern ourselves with the case in the following.
Remark 1.
Suppose, is a strictly increasing and convex function on the non-negative real line. Then it is straightforward to check that the expression defined in (6) does indeed satisfy the divergence conditions in Section 2 and therefore defines a legitimate divergence which belongs to the FDPD class. Note that is necessarily positive in this case. The identity function which relates to the DPD family belongs to this class of functions.
Remark 2.
That the class of functions described in the previous remark does not completely characterize the FDPDs can be seen by choosing for all , with the convention that . In this case is a concave function but the corresponding FDPD also satisfies the divergence conditions and gives rise to the logarithmic density power divergence (LDPD) family already introduced in Section 2. In this case .
We expect that the members of the FDPD family will possess useful robustness properties and have other information theoretic utilities which could make it interesting to examine these divergences and therefore it is natural to ask whether we can characterize all the functions which give rise to a divergence in (6) and thus can obtain a complete description of the FDPD family. As already indicated, the main objective of this article is to discuss this characterization.
4 Characterization of the FDPD family
In this section we will assume that the dominating measure is actually the Lebesgue measure on the real line and therefore the FDPD is a family of divergences on the space of probability density functions.
Our first result states a general sufficient condition on the function which will guarantee that is a valid divergence for all
Proposition 1.
Suppose is a function such that the function defined as is convex and strictly increasing on its domain. Moreover assume that . Then is a valid divergence for each fixed , according to the definition in (6).
Proof.
We start by observing that for any , the quantities and are finite and non-zero. Therefore the expression in (6) is well-defined since for all . Now, in order to show that is a valid divergence, we need to establish that is non-negative for all choices of and it is exactly zero if and only if For , using the convexity of the function we can conclude that
(8) |
On the other hand, for , using Hölder’s inequality on the functions and with dual indices we obtain,
(9) |
which is equivalent to
(10) |
Expression (8) and (10) along with the strict monotonicity of imply that,
(11) | ||||
(12) |
which is equivalent to the statement that . For the equality to hold, we must have equality in (8) and (11). By strict monotonicity of , the equality in (11) implies equality in (9) which will happen only if , which is equivalent to . On the other hand, if , then clearly by (6). This completes our proof. ∎
Now we shall show that the condition on stated in Proposition 1 is indeed a necessary condition for generating a divergence family, for any fixed .
Proposition 2.
Fix . Suppose is a function such that that is a valid divergence. Then the function defined as is convex and strictly increasing on its domain with .
Proof.
We shall use the idea of computing the divergence between two appropriate probability density functions and extracting the property of the function from it. Fix any real and consider the family of probability densities given by
(13) |
where denotes the indicator function of the set . These are valid probability densities since . Easy computations show that
(14) |
and for any
(15) |
Therefore, the property that along with equality if and only if yields that
(16) |
and
(17) |
where . The assertion that the expressions in the left hand sides of (16) and (17) are well-defined is also part of the implication. Now fix any . If , plug in and in Equation (17). Notice that will guarantee that . Therefore we get
(18) |
for all , which on simplification yields
(19) |
Similar manipulation with (16) leads us to the following observation.
(20) |
We shall now proceed with some appropriate choices for . If we take in (20), we obtain that is strictly increasing on . To prove that is indeed strictly increasing on , take and for some . In this case,
and hence
Since this holds for all , we have our required strict monotonicity of on , which proves the fact that is strictly increasing on . Observe that, strict monotonicity of on implies All that remains to show now is the convexity of the function .
Fix any and take in (19). Since,
we can conclude that
(21) |
where , for all ; which exists since is monotone. Similar manipulation with (20) yields the inequality in (21) for . Monotonicity of also guarantees that is finite on . Fix and get sequences and . Define , for all . Clearly, and
(22) |
Taking in (22), we can conclude that
implying that is indeed -convex, see Definition 1 in the Appendix. The function , being finite and non-decreasing on , is bounded on any finite interval. Applying Lemma 1 and Proposition 3, we can conclude that is convex and continuous on . Lemma 3 yields that ; hence is convex on . Monotonicity of on guarantees that it is indeed convex on . This completes the proof.
∎
Remark 3.
The proof of Proposition 2 does not assume continuity of (or equivalently of ) a priori. Instead we have proved that is convex on , implying that it also has to be continuous on .
Remark 4.
Remark 5.
There exists other directions for proving the necessity part in Proposition 2 assuming some smoothness conditions for the function . One such direction may be provided by the method of [10]. Any kind of smoothness assumption being redundant in our proof makes our characterization complete. In fact, it appears that the approach in [10] might be refined by the approach in the present paper, rather than the other way around. (This was also suggested by one of the reviewers). We hope to explore this in our future work.
However from a practical point of view and for large sample consistency or influence function calculations, we would probably need some differentiability conditions on .
Remark 6.
One purpose of characterizing this class of divergences will be to identify new estimators which will be obtained as the minimizer of a divergence between an empirical estimate (see Remark 7) of the true density and the model density in terms of the parameter over a suitable parameter space . A natural follow up of the present work will be to look at properties of the minimum FDPD estimators from an overall standpoint and explore whether a general proof of asymptotic normality is possible under the presently existing conditions on the function , or under minimal additional conditions (apart from standard model conditions).
Remark 7.
It may be noted that all minimum FDPD estimators are non-kernel divergence estimators in the sense of [10], although not all minimum FDPD estimators are M-estimators. While the present paper is focused entirely on the characterization issue, eventually one would also like to know how useful are the inference procedures resulting from the minimization of divergences within the FDPD class (as already observed in the previous remark). In that respect the non-kernel divergence property will lend a practical edge to the estimators and other inference procedures based on this family in comparison with divergences which require an active use of a non-parametric smoothing technique in their construction.
Proposition 1 and Proposition 2 provide the complete characterization of the FDPD family and class of functions generating them. We trust that this characterization describes the class within which one can search for suitable minimum divergence procedures exhibiting good balance between model efficiency and robustness.
5 Acknowledgements
We are grateful to two anonymous referees and the Associate Editor, whose suggestions have led to an improved version of the paper. In particular, it has allowed us to prove Proposition 2 without smoothness conditions (or even the assumption of continuity) on . Also, a comment about possible -specific functions allowed us to make our result more general.
6 Appendix
The proof in Proposition 2 depends on some additional results involving -convex functions and general convex functions. These results, which are used as tools in our main pursuit, are presented here in the Appendix, separately, so as not to lose focus from our main characterization problem.
Definition 1.
Let and A function is said to be -convex if
Obviously any convex function is also -convex, though the converse is not generally true. Traditionally, -convex functions are called midpoint convex. Under some further assumptions on , like Lebesgue measurability or boundedness on a set with positive Lebesgue measure, one can prove that midpoint convex functions are indeed convex; see [5, Section I.3] for an extensive account of these kind of results. Here we shall prove a similar result in Proposition 3 for -convex functions for any . Lemma 1 and Lemma 2 are instrumental in proving Proposition 3.
Lemma 1.
Suppose that is -convex. Then is continuous at if and only if is bounded on an interval around .
Proof.
The proof of the only if part is trivial from the definition of continuity. The proof of the if part is inspired by the proof of the theorem in [5, pp.12]. Suppose that is bounded on an interval around . This condition can equivalently be written as
Applying -convexity for the function we can write the following.
where the last inequality follows from the observation that converges to if converges to . Since is finite, we can conclude that it is at most and hence equal to . Applying -convexity again,
implying that is at least and hence equal to . In other words, both and are equal to and therefore is continuous at .
∎
Lemma 2.
Suppose that is -convex and continuous. Then is convex.
Proof.
We shall prove the statement by contradiction. Suppose that is not convex. Then we can find with and such that
Define as follows.
Since is continuous, so is and . Let be the infimum of the set , which is non-empty due to continuity of . Continuity of also guarantees that ; hence since . Get such that and define
Note that with and We can, therefore, write the following series of inequalities.
where the left-most inequality follows from the fact that and hence . This gives us a contradiction. ∎
Proposition 3.
Suppose that is -convex. Moreover for any , is bounded on an interval around . Then is convex.
Lemma 3.
Let be non-decreasing. Suppose the left-hand limit function , defined as for all , is continuous. Then is also continuous; in particular .
Proof.
It is enough to show that . Take any . Monotonicity of implies that . On the other hand, for all ; and hence Since is continuous, we can take to conclude that This shows ∎
References
- [1] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.
- [2] Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park. Statistical inference: The Minimum Distance Approach. Chapman and Hall/CRC, 2019.
- [3] Michel Broniatowski, Aida Toma, and Igor Vajda. Decomposable pseudodistances and applications in statistical estimation. Journal of Statistical Planning and Inference, 142(9):2574–2585, 2012.
- [4] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
- [5] William F. Donoghue Jr. Distributions and Fourier Transforms. Academic Press, 1969.
- [6] Hironori Fujisawa. Normalized estimating equation for robust parameter estimation. Electronic Journal of Statistics, 7:1587–1606, 2013.
- [7] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.
- [8] Abhik Ghosh and Ayanendranath Basu. A generalized relative -entropy: Geometric properties and applications to robust statistical inference. Entropy, 20(5), 2018.
- [9] Abhik Ghosh and Ayanendranath Basu. A scale invariant generalization of Renyi entropy and related optimizations under Tsallis’ nonextensive framework. IEEE Transactions on Information Theory, 67(4):2141–2161, 2021.
- [10] Soham Jana and Ayanendranath Basu. A characterization of all single-integral, non-kernel divergence estimators. IEEE Transactions on Information Theory, 65(12):7976–7984, 2019.
- [11] MC Jones, Nils Lid Hjort, Ian R Harris, and Ayanendranath Basu. A comparison of related density-based minimum divergence estimators. Biometrika, 88(3):865–873, 2001.
- [12] Arun Kumar Kuchibhotla, Somabha Mukherjee, and Ayanendranath Basu. Statistical inference based on bridge divergences. Annals of the Institute of Statistical Mathematics, 71(3):627–656, 2019.
- [13] M Ashok Kumar and Rajesh Sundaresan. Minimization problems based on relative -entropy i : Forward projection. IEEE Transactions on Information Theory, 61(9):5063–5080, 2015.
- [14] M Ashok Kumar and Rajesh Sundaresan. Minimization problems based on relative -entropy ii: Reverse projection. IEEE Transactions on Information Theory, 61(9):5081–5095, 2015.
- [15] Bruce G Lindsay. Efficiency versus robustness: the case for minimum hellinger distance and related methods. The Annals of Statistics, 22(2):1081–1114, 1994.