This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Characterizing the Functional Density Power Divergence Class

Ray, S.1, Pal, S.2, Kar, S. K.3 and Basu, A.4
1Stanford University,2Iowa State University
3University of North Carolina, Chapel Hill
, 4Indian Statistical Institute
Abstract

Divergence measures have a long association with statistical inference, machine learning and information theory. The density power divergence and related measures have produced many useful (and popular) statistical procedures, which provide a good balance between model efficiency on one hand and outlier stability or robustness on the other. The logarithmic density power divergence, a particular logarithmic transform of the density power divergence, has also been very successful in producing efficient and stable inference procedures; in addition it has also led to significant demonstrated applications in information theory. The success of the minimum divergence procedures based on the density power divergence and the logarithmic density power divergence (which also go by the names β\beta-divergence and γ\gamma-divergence, respectively) make it imperative and meaningful to look for other, similar divergences which may be obtained as transforms of the density power divergence in the same spirit. With this motivation we search for such transforms of the density power divergence, referred to herein as the functional density power divergence class. The present article characterizes this functional density power divergence class, and thus identifies the available divergence measures within this construct that may be explored further for possible applications in statistical inference, machine learning and information theory.

1 Introduction

Divergence measures have natural and appealing applications in many scientific disciplines including statistics, machine learning and information theory. The method based on likelihood, the canonical approach to inference in statistical data analysis, is itself a minimum divergence method; the maximum likelihood estimator minimizes the likelihood disparity [15], a version of the Kullback-Leibler divergence. Among the different formats of minimum divergence inference, the approach based on the minimization of density-based divergences is of particular importance, as in this case the resulting procedures combine a high degree of model efficiency with strong robustness properties.

The central elements in the present research are the collections of the density-based minimum divergence procedures based on the density power divergence (DPD) of [1]. The popularity and utility of this procedure make it important to study other similar divergences in search of competitive or better statistical (and other) properties. Indeed, one such divergence that is known to us and one that has also left its unmistakable mark on the area of robust statistical inference is the logarithmic density power divergence (LDPD); see, e.g., [11][7][2][6]. The applicability of this class of divergences in mathematical information theory has been explored in [13][14][8][9].

Both the ordinary DPD and the LDPD belong to the functional density power divergence class that we will define in the next section. These two families of divergences have also been referred to as the BHHJ and the JHHB classes, or the type 1 and type 0 classes, or the β\beta-divergence and the γ\gamma-divergence classes; more details about their applications may be found in  [11][7][4][2] among others. However, while the DPD belongs to the class of Bregman divergences, the LDPD does not. The DPD is also a single-integral, non-kernel divergence [10]; the LDPD is not a single-integral divergence, although it is a non-kernel one. The non-kernel divergences have also been called decomposable in the literature [3]. The divergences within the DPD family have been shown to possess strong robustness properties in statistical applications. The LDPD family is also useful in this respect.

Our basic aim in this work is to characterize the class of functional density power divergences. Essentially, each functional density power divergence corresponds to a function with the non-negative real line as its domain. The DPD corresponds to the identity function, while the LDPD corresponds to the log function. Within the class of functional density power divergences, we will characterize the class of functions which generate legitimate divergences. In turn, this will provide a characterization of the functional density power divergence class.

2 The DPD and the LDPD

Suppose (,,μ)(\mathbb{R},\mathcal{B}_{\mathbb{R}},\mu) is a measure space on the real line. Introduce the notation

μ:={f::f0,fmeasurable,f𝑑μ=1}.\mathcal{L}_{\mu}:=\left\{f:\mathbb{R}\longrightarrow\mathbb{R}:\;f\geq 0,\;f\;\text{measurable},\;\int_{\mathbb{R}}f\;d\mu=1\right\}. (1)

A divergence defined on μ\mathcal{L}\subseteq\mathcal{L}_{\mu} is a non-negative function D:×[0,]D:\mathcal{L}\times\mathcal{L}\longrightarrow[0,\infty] with the property

D(g,f)=0g=fa.e.[μ].D(g,f)=0\iff g=f\;a.e.[\mu]. (2)

For the sake of brevity we will drop the dominating measure μ\mu from the notation; it will be understood that all the integrations and almost sure statements are with respect to this dominating measure. We also suppress the dummy variable of integration from the expression of the integrals in the rest of the paper.

One of the most popular examples of such families of divergences is the density power divergence (DPD) of [1] defined as

DPDα(g,f):=fα+1(1+1α)fαg+1αgα+1,\text{DPD}_{\alpha}(g,f):=\int_{\mathbb{R}}f^{\alpha+1}-\Big{(}1+\dfrac{1}{\alpha}\Big{)}\int_{\mathbb{R}}f^{\alpha}g+\dfrac{1}{\alpha}\int_{\mathbb{R}}g^{\alpha+1}, (3)

for all g,fμ,αg,f\in\mathcal{L}_{\mu,\alpha}, where α\alpha is a non-negative real number and

μ,α:={fμ:f1+α𝑑μ<}.\mathcal{L}_{\mu,\alpha}:=\left\{f\in\mathcal{L}_{\mu}\;:\;\int_{\mathbb{R}}f^{1+\alpha}\;d\mu<\infty\right\}.

For α=0\alpha=0, the definition is to be understood in a limiting sense as α0\alpha\downarrow 0, and the form of the divergence then turns out to be,

DPDα=0(g,f):=gloggf,forallg,fμ=μ,0,\text{DPD}_{\alpha=0}(g,f):=\int_{\mathbb{R}}g\log\dfrac{g}{f},\;\;{\rm for~{}all}\;\;g,f\in\mathcal{L}_{\mu}=\mathcal{L}_{\mu,0}, (4)

which is actually the likelihood disparity, see [15]; it is also a version of the Kullback-Leiber divergence. For α=1\alpha=1, the divergence in (3) reduces to the squared L2L_{2} distance. It is easy and straightforward to check that the definition in (3) satisfies the condition in (2).

Another common example of a related divergence class is the logarithmic density power divergence (LDPD) family of [11] defined as

LDPDα(g,f)\displaystyle\text{LDPD}_{\alpha}(g,f) :=log(fα+1)(1+1α)log(fαg)+1αlog(gα+1),\displaystyle:=\log\Big{(}\int_{\mathbb{R}}f^{\alpha+1}\Big{)}-\Big{(}1+\dfrac{1}{\alpha}\Big{)}\log\Big{(}\int_{\mathbb{R}}f^{\alpha}g\Big{)}+\dfrac{1}{\alpha}\log\Big{(}\int_{\mathbb{R}}g^{\alpha+1}\Big{)}, (5)

for all g,fμ,α,α0.g,f\in\mathcal{L}_{\mu,\alpha},\alpha\geq 0. Its structural similarity with the DPD family is immediately apparent. It is obtained by replacing the identity function on each component of the integral by the log function. This family is also known to produce highly robust estimators with good efficiency.  [7] and  [6], in fact, argue that the minimum divergence estimators based on the LDPD are more successful in minimizing the bias of the estimator under heavy contamination in comparison to the minimum DPD estimators. However, also see [12] for some counter views. The latter work has, in fact, proposed a new class of divergences which provides a smooth bridge between the DPD and the LDPD families.

3 The Functional Density Power Divergence

Further exploration of the divergences within the DPD family leads to the observation that this class of divergences may be extended to a more general family of divergences, called the functional density power divergence family having the form

FDPDφ,α(g,f)\displaystyle\text{FDPD}_{\varphi,\alpha}(g,f) :=φ(fα+1)(1+1α)φ(fαg)+1αφ(gα+1),\displaystyle:=\varphi\Big{(}\int_{\mathbb{R}}f^{\alpha+1}\Big{)}-\Big{(}1+\dfrac{1}{\alpha}\Big{)}\varphi\Big{(}\int_{\mathbb{R}}f^{\alpha}g\Big{)}+\dfrac{1}{\alpha}\varphi\Big{(}\int_{\mathbb{R}}g^{\alpha+1}\Big{)}, (6)

for all g,fμ,αg,f\in\mathcal{L}_{\mu,\alpha}, where φ:[0,)[,]\varphi:[0,\infty)\longrightarrow[-\infty,\infty] is a pre-assigned function, α\alpha is a non-negative real number and FDPDφ,α\text{FDPD}_{\varphi,\alpha} is a divergence in the sense of (2). Note that the expression given in (6) need not necessarily define a divergence for all φ\varphi as it does not always satisfy the condition stated in Section 2. Indeed, it may even not be well-defined for all pairs of densities f,gμ,αf,g\in\mathcal{L}_{\mu,\alpha} since φ\varphi may take the value \infty. In the following we will identify the class of functions φ\varphi for which the quantity defined in (6) is actually a divergence, thus providing a characterization of the FDPD class.

Within the FDPD class, the case α=0\alpha=0 has to be again understood in a limiting sense and this limiting divergence exists under some constraints on the function φ\varphi. For example, if we assume φ\varphi is continuously differentiable in an interval around 11, then the divergence for α=0\alpha=0 can be defined as

FDPDφ,α=0(g,f):=φ(1)gloggf,forallg,fμ,\text{FDPD}_{\varphi,\alpha=0}(g,f):=\varphi^{\prime}(1)\int_{\mathbb{R}}g\log\dfrac{g}{f},\;\;{\rm for~{}all}\;\;g,f\in\mathcal{L}_{\mu}, (7)

where φ\varphi^{\prime} is the derivative of φ\varphi. Obviously we require φ(1)\varphi^{\prime}(1) to be positive for the above to be a divergence. Note that, the divergence in (7) is actually the likelihood disparity with a different scaling constant and therefore the divergences FDPDφ,α=0\text{FDPD}_{\varphi,\alpha=0} are effectively equivalent to the likelihood disparity for inferential purposes. For the DPD and the LDPD, in fact, the scaling constant φ(1)\varphi^{\prime}(1) equals unity. The characterization of the FDPD, therefore, is not an interesting problem for α=0\alpha=0; hence, we will not concern ourselves with the α=0\alpha=0 case in the following.

Remark 1.

Suppose, φ\varphi is a strictly increasing and convex function on the non-negative real line. Then it is straightforward to check that the expression defined in (6) does indeed satisfy the divergence conditions in Section 2 and therefore defines a legitimate divergence which belongs to the FDPD class. Note that φ(1)\varphi^{\prime}(1) is necessarily positive in this case. The identity function which relates to the DPD family belongs to this class of ϕ\phi functions.

Remark 2.

That the class of functions described in the previous remark does not completely characterize the FDPDs can be seen by choosing φ(x)=logx,\varphi(x)=\log x, for all x0x\geq 0, with the convention that log0:=\log 0:=-\infty. In this case φ\varphi is a concave function but the corresponding FDPD also satisfies the divergence conditions and gives rise to the logarithmic density power divergence (LDPD) family already introduced in Section 2. In this case φ(1)=1\varphi^{\prime}(1)=1.

We expect that the members of the FDPD family will possess useful robustness properties and have other information theoretic utilities which could make it interesting to examine these divergences and therefore it is natural to ask whether we can characterize all the functions φ\varphi which give rise to a divergence in (6) and thus can obtain a complete description of the FDPD family. As already indicated, the main objective of this article is to discuss this characterization.

4 Characterization of the FDPD family

In this section we will assume that the dominating measure μ\mu is actually the Lebesgue measure on the real line and therefore the FDPD is a family of divergences on the space of probability density functions.

Our first result states a general sufficient condition on the function φ\varphi which will guarantee that FDPDφ,α\text{FDPD}_{\varphi,\alpha} is a valid divergence for all α>0.\alpha>0.

Proposition 1.

Suppose φ:[0,)[,]\varphi:[0,\infty)\longrightarrow[-\infty,\infty] is a function such that the function ψ:[,)[,]\psi:[-\infty,\infty)\longrightarrow[-\infty,\infty] defined as ψ(x):=φ(ex),forallx[,)\psi(x):=\varphi(e^{x}),\;{\rm for~{}all}~{}x\in[-\infty,\infty) is convex and strictly increasing on its domain. Moreover assume that ψ()\psi(\mathbb{R})\subseteq\mathbb{R}. Then FDPDφ,α\text{FDPD}_{\varphi,\alpha} is a valid divergence for each fixed α>0\alpha>0, according to the definition in (6).

Proof.

We start by observing that for any f,gμ,αf,g\in\mathcal{L}_{\mu,\alpha}, the quantities f1+α\int_{\mathbb{R}}f^{1+\alpha} and g1+α\int_{\mathbb{R}}g^{1+\alpha} are finite and non-zero. Therefore the expression in (6) is well-defined since φ(x)=ψ(logx)\varphi(x)=\psi(\log x)\in\mathbb{R} for all x(0,)x\in(0,\infty). Now, in order to show that FDPDφ,α\text{FDPD}_{\varphi,\alpha} is a valid divergence, we need to establish that FDPDφ,α(g,f)\text{FDPD}_{\varphi,\alpha}(g,f) is non-negative for all choices of g,fμ,α,g,f\in\mathcal{L}_{\mu,\alpha}, and it is exactly zero if and only if g=f,a.e.[μ].g=f,\;a.e.[\mu]. For α>0\alpha>0, using the convexity of the function ψ\psi we can conclude that

αφ(fα+1)+φ(gα+1)\displaystyle\alpha\varphi\Big{(}\int_{\mathbb{R}}f^{\alpha+1}\Big{)}+\varphi\Big{(}\int_{\mathbb{R}}g^{\alpha+1}\Big{)} =αψ(logfα+1)+ψ(loggα+1)\displaystyle=\alpha\psi\Big{(}\log\int_{\mathbb{R}}f^{\alpha+1}\Big{)}+\psi\Big{(}\log\int_{\mathbb{R}}g^{\alpha+1}\Big{)}
(1+α)ψ(α1+αlogfα+1+11+αloggα+1).\displaystyle\geq(1+\alpha)\psi\Big{(}\dfrac{\alpha}{1+\alpha}\log\int_{\mathbb{R}}f^{\alpha+1}+\dfrac{1}{1+\alpha}\log\int_{\mathbb{R}}g^{\alpha+1}\Big{)}. (8)

On the other hand, for α>0\alpha>0, using Hölder’s inequality on the functions fαf^{\alpha} and gg with dual indices (1+αα,1+α)\Big{(}\dfrac{1+\alpha}{\alpha},1+\alpha\Big{)} we obtain,

(fα+1)α1+α(gα+1)11+αfαg,{\left(\int_{\mathbb{R}}f^{\alpha+1}\right)}^{\frac{\alpha}{1+\alpha}}{\left(\int_{\mathbb{R}}g^{\alpha+1}\right)}^{\frac{1}{1+\alpha}}\geq\int_{\mathbb{R}}f^{\alpha}g, (9)

which is equivalent to

α1+αlogfα+1+11+αloggα+1logfαg.\dfrac{\alpha}{1+\alpha}\log\int_{\mathbb{R}}f^{\alpha+1}+\dfrac{1}{1+\alpha}\log\int_{\mathbb{R}}g^{\alpha+1}\geq\log\int_{\mathbb{R}}f^{\alpha}g. (10)

Expression (8) and (10) along with the strict monotonicity of ψ\psi imply that,

αφ(fα+1)+ϕ(gα+1)\displaystyle\alpha\varphi\Big{(}\int_{\mathbb{R}}f^{\alpha+1}\Big{)}+\phi\Big{(}\int_{\mathbb{R}}g^{\alpha+1}\Big{)} (1+α)ψ(α1+αlogfα+1+11+αloggα+1)\displaystyle\geq(1+\alpha)\psi\Big{(}\dfrac{\alpha}{1+\alpha}\log\int_{\mathbb{R}}f^{\alpha+1}+\dfrac{1}{1+\alpha}\log\int_{\mathbb{R}}g^{\alpha+1}\Big{)}
(1+α)ψ(logfαg)\displaystyle\geq(1+\alpha)\psi\Big{(}\log\int_{\mathbb{R}}f^{\alpha}g\Big{)} (11)
=(1+α)φ(fαg),\displaystyle=(1+\alpha)\varphi\Big{(}\int_{\mathbb{R}}f^{\alpha}g\Big{)}, (12)

which is equivalent to the statement that FDPDφ,α(g,f)0\text{FDPD}_{\varphi,\alpha}(g,f)\geq 0. For the equality FDPDφ,α(g,f)=0\text{FDPD}_{\varphi,\alpha}(g,f)=0 to hold, we must have equality in (8) and (11). By strict monotonicity of ψ\psi, the equality in (11) implies equality in (9) which will happen only if f1+α=g1+α,a.e.[μ]f^{1+\alpha}=g^{1+\alpha},\;a.e.[\mu], which is equivalent to f=g,a.e.[μ]f=g,a.e.[\mu]. On the other hand, if f=g,a.e.[μ]f=g,\;a.e.[\mu], then clearly FDPDφ,α(g,f)=0\text{FDPD}_{\varphi,\alpha}(g,f)=0 by (6). This completes our proof. ∎

Now we shall show that the condition on φ\varphi stated in Proposition 1 is indeed a necessary condition for generating a divergence family, for any fixed α>0\alpha>0.

Proposition 2.

Fix α(0,)\alpha\in(0,\infty). Suppose φ:[0,)[,]\varphi:[0,\infty)\longrightarrow[-\infty,\infty] is a function such that that FDPDφ,α\text{FDPD}_{\varphi,\alpha} is a valid divergence. Then the function ψ:[,)[,]\psi:[-\infty,\infty)\longrightarrow[-\infty,\infty] defined as ψ(x):=φ(ex),forallx[,)\psi(x):=\varphi(e^{x}),\;{\rm for~{}all}~{}x\in[-\infty,\infty) is convex and strictly increasing on its domain with ψ()\psi(\mathbb{R})\subseteq\mathbb{R}.

Proof.

We shall use the idea of computing the divergence between two appropriate probability density functions and extracting the property of the function φ\varphi from it. Fix any real γ>1/(1+α)\gamma>-1/(1+\alpha) and consider the family of probability densities given by

fθ(x):=(γ+1)θγ1xγ𝟙(0,θ)(x),x,θ>0,{}f_{\theta}(x):=(\gamma+1)\theta^{-\gamma-1}x^{\gamma}\mathbbm{1}_{(0,\theta)}(x),\;\forall\;x\in\mathbb{R},\;\theta>0, (13)

where 𝟙A\mathbbm{1}_{A} denotes the indicator function of the set AA. These are valid probability densities since γ>1\gamma>-1. Easy computations show that

fθ1+α=(γ+1)1+α1+γ(1+α)θα,θ>0,\int_{\mathbb{R}}f_{\theta}^{1+\alpha}=\dfrac{(\gamma+1)^{1+\alpha}}{1+\gamma(1+\alpha)}\theta^{-\alpha},\;\;\forall\;\theta>0, (14)

and for any θ,τ>0\theta,\tau>0

fθαfτ={(γ+1)1+α1+γ(1+α)θγαατγα,if θ>τ,(γ+1)1+α1+γ(1+α)θ1+γατγ1,if θτ.\int_{\mathbb{R}}f_{\theta}^{\alpha}f_{\tau}=\begin{cases}\dfrac{(\gamma+1)^{1+\alpha}}{1+\gamma(1+\alpha)}\theta^{-\gamma\alpha-\alpha}\tau^{\gamma\alpha},\;\;\;\;\;\text{if }\theta>\tau,\\ \dfrac{(\gamma+1)^{1+\alpha}}{1+\gamma(1+\alpha)}\theta^{1+\gamma-\alpha}\tau^{-\gamma-1},\;\;\text{if }\theta\leq\tau.\end{cases} (15)

Therefore, the property that FDPDφ,α(g,f)0\text{FDPD}_{\varphi,\alpha}(g,f)\geq 0 along with equality if and only if g=f,a.e.[μ]g=f,a.e.[\mu] yields that

φ(Cθα)(1+1α)φ(Cθγαατγα)+1αφ(Cτα)>0,ifθ>τ>0,\varphi(C\theta^{-\alpha})-\Big{(}1+\dfrac{1}{\alpha}\Big{)}\varphi(C\theta^{-\gamma\alpha-\alpha}\tau^{\gamma\alpha})+\dfrac{1}{\alpha}\varphi(C\tau^{-\alpha})>0,\;\;{\rm if}\;\theta>\tau>0, (16)

and

φ(Cθα)(1+1α)φ(Cθ1+γατγ1)+1αφ(Cτα)>0,ifτ>θ>0,\varphi(C\theta^{-\alpha})-\Big{(}1+\dfrac{1}{\alpha}\Big{)}\varphi(C\theta^{1+\gamma-\alpha}\tau^{-\gamma-1})+\dfrac{1}{\alpha}\varphi(C\tau^{-\alpha})>0,\;{\rm if}\;\tau>\theta>0, (17)

where C:=(γ+1)1+α/(1+γ(1+α))C:=(\gamma+1)^{1+\alpha}/(1+\gamma(1+\alpha)). The assertion that the expressions in the left hand sides of (16) and (17) are well-defined is also part of the implication. Now fix any x,yx,y\in\mathbb{R}. If x>yx>y, plug in θ=C1/αexp(x/α)\theta=C^{1/\alpha}\exp(-x/\alpha) and τ=C1/αexp(y/α)\tau=C^{1/\alpha}\exp(-y/\alpha) in Equation (17). Notice that x>yx>y will guarantee that θ<τ\theta<\tau. Therefore we get

φ(ex)+1αφ(ey)>(1+1α)φ(exp((1γ+1α)x+γ+1αy)),\varphi(e^{x})+\dfrac{1}{\alpha}\varphi(e^{y})>\Big{(}1+\dfrac{1}{\alpha}\Big{)}\varphi\Bigg{(}\exp\Bigg{(}\left(1-\dfrac{\gamma+1}{\alpha}\right)x+\dfrac{\gamma+1}{\alpha}y\Bigg{)}\Bigg{)}, (18)

for all x>yx>y\in\mathbb{R}, which on simplification yields

α1+αψ(x)+11+αψ(y)>ψ((1γ+1α)x+γ+1αy),x>y.\dfrac{\alpha}{1+\alpha}\psi(x)+\dfrac{1}{1+\alpha}\psi(y)>\psi\Bigg{(}\left(1-\dfrac{\gamma+1}{\alpha}\right)x+\dfrac{\gamma+1}{\alpha}y\Bigg{)},\;\forall\;x>y\in\mathbb{R}. (19)

Similar manipulation with (16) leads us to the following observation.

α1+αψ(x)+11+αψ(y)>ψ((1+γ)xγy),x<y.\dfrac{\alpha}{1+\alpha}\psi(x)+\dfrac{1}{1+\alpha}\psi(y)>\psi\Big{(}\left(1+\gamma\right)x-\gamma y\Big{)},\;\forall\;x<y\in\mathbb{R}. (20)

We shall now proceed with some appropriate choices for γ\gamma. If we take γ=0\gamma=0 in (20), we obtain that ψ\psi is strictly increasing on \mathbb{R}. To prove that ψ\psi is indeed strictly increasing on [,)[-\infty,\infty), take f=θ1𝟙(0,θ)f=\theta^{-1}\mathbbm{1}_{(0,\theta)} and g=θ1𝟙(θ,2θ)g=\theta^{-1}\mathbbm{1}_{(\theta,2\theta)} for some θ>0\theta>0. In this case,

fαg=0,fα+1=gα+1=θα,\int_{\mathbb{R}}f^{\alpha}g=0,\;\int_{\mathbb{R}}f^{\alpha+1}=\int_{\mathbb{R}}g^{\alpha+1}=\theta^{-\alpha},

and hence

0<FDPDφ,α(f,g)=(1+1α)(φ(θα)φ(0)).0<\text{FDPD}_{\varphi,\alpha}(f,g)=\left(1+\dfrac{1}{\alpha}\right)\left(\varphi(\theta^{-\alpha})-\varphi(0)\right).

Since this holds for all θ>0\theta>0, we have our required strict monotonicity of φ\varphi on [0,)[0,\infty), which proves the fact that ψ\psi is strictly increasing on [,)[-\infty,\infty). Observe that, strict monotonicity of ψ\psi on \mathbb{R} implies ψ().\psi(\mathbb{R})\subseteq\mathbb{R}. All that remains to show now is the convexity of the function ψ\psi.

Fix any x>yx>y\in\mathbb{R} and take γ=γn:=(1+α)1+1/n\gamma=\gamma_{n}:=-(1+\alpha)^{-1}+1/n in (19). Since,

(1γn+1α)x+γn+1αy=α1+αx+11+αyxyαnα1+αx+11+αy, as n,\left(1-\dfrac{\gamma_{n}+1}{\alpha}\right)x+\dfrac{\gamma_{n}+1}{\alpha}y=\dfrac{\alpha}{1+\alpha}x+\dfrac{1}{1+\alpha}y-\dfrac{x-y}{\alpha n}\uparrow\dfrac{\alpha}{1+\alpha}x+\dfrac{1}{1+\alpha}y,\;\text{ as }n\to\infty,

we can conclude that

α1+αψ(x)+11+αψ(y)ψ[(α1+αx+11+αy)],x>y,\dfrac{\alpha}{1+\alpha}\psi(x)+\dfrac{1}{1+\alpha}\psi(y)\geq\psi\Bigg{[}\Bigg{(}\dfrac{\alpha}{1+\alpha}x+\dfrac{1}{1+\alpha}y\Bigg{)}-\Bigg{]},\;\forall\;x>y\in\mathbb{R}, (21)

where ψ(u):=limvuψ(v)\psi(u-):=\lim_{v\uparrow u}\psi(v), for all uu\in\mathbb{R}; which exists since ψ\psi is monotone. Similar manipulation with (20) yields the inequality in (21) for x<yx<y\in\mathbb{R}. Monotonicity of ψ\psi also guarantees that ψ()\psi(\cdot-) is finite on \mathbb{R}. Fix x,yx,y\in\mathbb{R} and get sequences xnxx_{n}\uparrow x and ynyy_{n}\uparrow y. Define zn:=α(1+α)1xn+(1+α)1yn1/nz_{n}:=\alpha(1+\alpha)^{-1}x_{n}+(1+\alpha)^{-1}y_{n}-1/n, for all n1n\geq 1. Clearly, znz:=α(1+α)1x+(1+α)1yz_{n}\uparrow z:=\alpha(1+\alpha)^{-1}x+(1+\alpha)^{-1}y and

α1+αψ(xn)+11+αψ(yn)ψ[(α1+αxn+11+αyn)]ψ(zn),n1.\dfrac{\alpha}{1+\alpha}\psi(x_{n})+\dfrac{1}{1+\alpha}\psi(y_{n})\geq\psi\Bigg{[}\Bigg{(}\dfrac{\alpha}{1+\alpha}x_{n}+\dfrac{1}{1+\alpha}y_{n}\Bigg{)}-\Bigg{]}\geq\psi(z_{n}),\;\forall\;n\geq 1. (22)

Taking nn\to\infty in (22), we can conclude that

α1+αψ(x)+11+αψ(y)ψ(z),\dfrac{\alpha}{1+\alpha}\psi(x-)+\dfrac{1}{1+\alpha}\psi(y-)\geq\psi(z-),

implying that ψ()\psi(\cdot-) is indeed α/(1+α)\alpha/(1+\alpha)-convex, see Definition 1 in the Appendix. The function ψ()\psi(\cdot-), being finite and non-decreasing on \mathbb{R}, is bounded on any finite interval. Applying Lemma 1 and Proposition 3, we can conclude that ψ()\psi(\cdot-) is convex and continuous on \mathbb{R}. Lemma 3 yields that ψ()=ψ\psi(\cdot-)=\psi; hence ψ\psi is convex on \mathbb{R}. Monotonicity of ψ\psi on [,)[-\infty,\infty) guarantees that it is indeed convex on [,)[-\infty,\infty). This completes the proof.

Remark 3.

The proof of Proposition 2 does not assume continuity of φ\varphi (or equivalently of ψ\psi) a priori. Instead we have proved that ψ\psi is convex on \mathbb{R}, implying that it also has to be continuous on \mathbb{R}.

Remark 4.

Note that if φ\varphi is convex and strictly increasing then ψ\psi defined as in Proposition 1 and Proposition 2 is strictly convex and strictly increasing. On the other hand, if φ=log\varphi=\log, then ψ\psi is the identity function and therefore convex and strictly increasing.

Remark 5.

There exists other directions for proving the necessity part in Proposition 2 assuming some smoothness conditions for the function φ\varphi. One such direction may be provided by the method of [10]. Any kind of smoothness assumption being redundant in our proof makes our characterization complete. In fact, it appears that the approach in [10] might be refined by the approach in the present paper, rather than the other way around. (This was also suggested by one of the reviewers). We hope to explore this in our future work.

However from a practical point of view and for large sample consistency or influence function calculations, we would probably need some differentiability conditions on φ\varphi.

Remark 6.

One purpose of characterizing this class of divergences will be to identify new estimators which will be obtained as the minimizer of a divergence between an empirical estimate (see Remark 7) of the true density gg and the model density fθf_{\theta} in terms of the parameter θ\theta over a suitable parameter space Θ\Theta. A natural follow up of the present work will be to look at properties of the minimum FDPD estimators from an overall standpoint and explore whether a general proof of asymptotic normality is possible under the presently existing conditions on the function φ\varphi, or under minimal additional conditions (apart from standard model conditions).

Remark 7.

It may be noted that all minimum FDPD estimators are non-kernel divergence estimators in the sense of [10], although not all minimum FDPD estimators are M-estimators. While the present paper is focused entirely on the characterization issue, eventually one would also like to know how useful are the inference procedures resulting from the minimization of divergences within the FDPD class (as already observed in the previous remark). In that respect the non-kernel divergence property will lend a practical edge to the estimators and other inference procedures based on this family in comparison with divergences which require an active use of a non-parametric smoothing technique in their construction.

Proposition 1 and Proposition 2 provide the complete characterization of the FDPD family and class of φ\varphi functions generating them. We trust that this characterization describes the class within which one can search for suitable minimum divergence procedures exhibiting good balance between model efficiency and robustness.

5 Acknowledgements

We are grateful to two anonymous referees and the Associate Editor, whose suggestions have led to an improved version of the paper. In particular, it has allowed us to prove Proposition 2 without smoothness conditions (or even the assumption of continuity) on φ\varphi. Also, a comment about possible α\alpha-specific φ\varphi functions allowed us to make our result more general.

6 Appendix

The proof in Proposition 2 depends on some additional results involving λ\lambda-convex functions and general convex functions. These results, which are used as tools in our main pursuit, are presented here in the Appendix, separately, so as not to lose focus from our main characterization problem.

Definition 1.

Let λ(0,1)\lambda\in(0,1) and a<b.-\infty\leq a<b\leq\infty. A function f:(a,b)f:(a,b)\to\mathbb{R} is said to be λ\lambda-convex if

f(λx+(1λ)y)λf(x)+(1λ)f(y),x,y(a,b).f(\lambda x+(1-\lambda)y)\leq\lambda f(x)+(1-\lambda)f(y),\;\;\forall\;x,y\in(a,b).

Obviously any convex function is also λ\lambda-convex, though the converse is not generally true. Traditionally, 1/21/2-convex functions are called midpoint convex. Under some further assumptions on ff, like Lebesgue measurability or boundedness on a set with positive Lebesgue measure, one can prove that midpoint convex functions are indeed convex; see [5, Section I.3] for an extensive account of these kind of results. Here we shall prove a similar result in Proposition 3 for λ\lambda-convex functions for any λ(0,1)\lambda\in(0,1). Lemma 1 and Lemma 2 are instrumental in proving Proposition 3.

Lemma 1.

Suppose that f:(a,b)f:(a,b)\to\mathbb{R} is λ\lambda-convex. Then ff is continuous at x(a,b)x\in(a,b) if and only if ff is bounded on an interval around xx.

Proof.

The proof of the only if part is trivial from the definition of continuity. The proof of the if part is inspired by the proof of the theorem in [5, pp.12]. Suppose that ff is bounded on an interval around xx. This condition can equivalently be written as

<lim infyxf(y)f(x)lim supyxf(y)<.-\infty<\liminf_{y\to x}f(y)\leq f(x)\leq\limsup_{y\to x}f(y)<\infty.

Applying λ\lambda-convexity for the function ff we can write the following.

lim supyxf(y)\displaystyle\limsup_{y\to x}f(y) =lim supyxf(λy(1λ)xλ+(1λ)x)\displaystyle=\limsup_{y\to x}f\left(\lambda\dfrac{y-(1-\lambda)x}{\lambda}+(1-\lambda)x\right)
lim supyx[λf(y(1λ)xλ)+(1λ)f(x)]\displaystyle\leq\limsup_{y\to x}\left[\lambda f\left(\dfrac{y-(1-\lambda)x}{\lambda}\right)+(1-\lambda)f(x)\right]
=λlim supyxf(y(1λ)xλ)+(1λ)f(x)λlim supyxf(y)+(1λ)f(x),\displaystyle=\lambda\limsup_{y\to x}f\left(\dfrac{y-(1-\lambda)x}{\lambda}\right)+(1-\lambda)f(x)\leq\lambda\limsup_{y\to x}f(y)+(1-\lambda)f(x),

where the last inequality follows from the observation that (y(1λ)x)/λ(y-(1-\lambda)x)/\lambda converges to xx if yy converges to xx. Since lim supyxf(y)\limsup_{y\to x}f(y) is finite, we can conclude that it is at most f(x)f(x) and hence equal to f(x)f(x). Applying λ\lambda-convexity again,

f(x)=lim infyxf(λy+(1λ)xλy1λ)\displaystyle f(x)=\liminf_{y\to x}f\left(\lambda y+(1-\lambda)\dfrac{x-\lambda y}{1-\lambda}\right) lim infyx[λf(y)+(1λ)f(xλy1λ)]\displaystyle\leq\liminf_{y\to x}\left[\lambda f(y)+(1-\lambda)f\left(\dfrac{x-\lambda y}{1-\lambda}\right)\right]
λlim infyxf(y)+(1λ)lim supyxf(xλy1λ)\displaystyle\leq\lambda\liminf_{y\to x}f(y)+(1-\lambda)\limsup_{y\to x}f\left(\dfrac{x-\lambda y}{1-\lambda}\right)
λlim infyxf(y)+(1λ)lim supyxf(y)\displaystyle\leq\lambda\liminf_{y\to x}f(y)+(1-\lambda)\limsup_{y\to x}f(y)
=λlim infyxf(y)+(1λ)f(x),\displaystyle=\lambda\liminf_{y\to x}f(y)+(1-\lambda)f(x),

implying that lim infyxf(y)\liminf_{y\to x}f(y) is at least f(x)f(x) and hence equal to f(x)f(x). In other words, both lim supyxf(y)\limsup_{y\to x}f(y) and lim infyxf(y)\liminf_{y\to x}f(y) are equal to f(x)f(x) and therefore ff is continuous at xx.

Lemma 2.

Suppose that f:(a,b)f:(a,b)\to\mathbb{R} is λ\lambda-convex and continuous. Then ff is convex.

Proof.

We shall prove the statement by contradiction. Suppose that ff is not convex. Then we can find x,y(a,b)x,y\in(a,b) with x<yx<y and β(0,1)\beta\in(0,1) such that

f(βx+(1β)y)>βf(x)+(1β)f(y).f(\beta x+(1-\beta)y)>\beta f(x)+(1-\beta)f(y).

Define h:[0,1]h:[0,1]\to\mathbb{R} as follows.

h(t):=f(tx+(1t)y)tf(x)(1t)f(y),t[0,1].h(t):=f(tx+(1-t)y)-tf(x)-(1-t)f(y),\;\;\forall\;t\in[0,1].

Since ff is continuous, so is hh and M:=supt[0,1]h(t)h(β)>0M:=\sup_{t\in[0,1]}h(t)\geq h(\beta)>0. Let β0\beta_{0} be the infimum of the set {t[0,1]:h(t)=M}\left\{t\in[0,1]:h(t)=M\right\}, which is non-empty due to continuity of hh. Continuity of hh also guarantees that h(β0)=Mh(\beta_{0})=M; hence β0(0,1)\beta_{0}\in(0,1) since h(0)=h(1)=0h(0)=h(1)=0. Get δ>0\delta>0 such that (β0δ,β0+δ)(0,1)(\beta_{0}-\delta,\beta_{0}+\delta)\subseteq(0,1) and define

β1:=β0δ(1λ),β2:=β0+δλ,u:=β1x+(1β1)y,v:=β2x+(1β2)y.\beta_{1}:=\beta_{0}-\delta(1-\lambda),\;\beta_{2}:=\beta_{0}+\delta\lambda,\;u:=\beta_{1}x+(1-\beta_{1})y,\;\;v:=\beta_{2}x+(1-\beta_{2})y.

Note that 0<β1<β0<β2<10<\beta_{1}<\beta_{0}<\beta_{2}<1 with λβ1+(1λ)β2=β0\lambda\beta_{1}+(1-\lambda)\beta_{2}=\beta_{0} and λu+(1λ)v=β0x+(1β0)y.\lambda u+(1-\lambda)v=\beta_{0}x+(1-\beta_{0})y. We can, therefore, write the following series of inequalities.

M>λh(β1)+(1λ)h(β2)\displaystyle M>\lambda h(\beta_{1})+(1-\lambda)h(\beta_{2}) =λ[f(u)β1f(x)(1β1)f(y)]\displaystyle=\lambda\left[f(u)-\beta_{1}f(x)-(1-\beta_{1})f(y)\right]
+(1λ)[f(v)β2f(x)(1β2)f(y)]\displaystyle\hskip 72.26999pt+(1-\lambda)\left[f(v)-\beta_{2}f(x)-(1-\beta_{2})f(y)\right]
=λf(u)+(1λ)f(v)β0f(x)(1β0)f(y)\displaystyle=\lambda f(u)+(1-\lambda)f(v)-\beta_{0}f(x)-(1-\beta_{0})f(y)
f(λu+(1λ)v)β0f(x)(1β0)f(y)\displaystyle\geq f(\lambda u+(1-\lambda)v)-\beta_{0}f(x)-(1-\beta_{0})f(y)
=f(β0x+(1β0)y)β0f(x)(1β0)f(y)=h(β0)=M,\displaystyle=f(\beta_{0}x+(1-\beta_{0})y)-\beta_{0}f(x)-(1-\beta_{0})f(y)=h(\beta_{0})=M,

where the left-most inequality follows from the fact that β1<β0\beta_{1}<\beta_{0} and hence h(β1)<Mh(\beta_{1})<M. This gives us a contradiction. ∎

The following proposition now follows readily from Lemma 1 and Lemma 2.

Proposition 3.

Suppose that f:(a,b)f:(a,b)\to\mathbb{R} is λ\lambda-convex. Moreover for any x(a,b)x\in(a,b), ff is bounded on an interval around xx. Then ff is convex.

Lemma 3.

Let f:(a,b)f:(a,b)\to\mathbb{R} be non-decreasing. Suppose the left-hand limit function f()f(\cdot-), defined as f(x)=limyxf(y)f(x-)=\lim_{y\uparrow x}f(y) for all x(a,b)x\in(a,b), is continuous. Then ff is also continuous; in particular f()=ff(\cdot-)=f.

Proof.

It is enough to show that f()=ff(\cdot-)=f. Take any x(a,b)x\in(a,b). Monotonicity of ff implies that f(x)f(x)f(x-)\leq f(x). On the other hand, f(y)f(x)f(y)\geq f(x) for all a<x<y<ba<x<y<b; and hence f(y)f(x).f(y-)\geq f(x). Since f()f(\cdot-) is continuous, we can take yxy\downarrow x to conclude that f(x)f(x).f(x-)\geq f(x). This shows f(x)=f(x).f(x-)=f(x).

References

  • [1] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3):549–559, 1998.
  • [2] Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park. Statistical inference: The Minimum Distance Approach. Chapman and Hall/CRC, 2019.
  • [3] Michel Broniatowski, Aida Toma, and Igor Vajda. Decomposable pseudodistances and applications in statistical estimation. Journal of Statistical Planning and Inference, 142(9):2574–2585, 2012.
  • [4] Andrzej Cichocki and Shun-ichi Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
  • [5] William F. Donoghue Jr. Distributions and Fourier Transforms. Academic Press, 1969.
  • [6] Hironori Fujisawa. Normalized estimating equation for robust parameter estimation. Electronic Journal of Statistics, 7:1587–1606, 2013.
  • [7] Hironori Fujisawa and Shinto Eguchi. Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9):2053–2081, 2008.
  • [8] Abhik Ghosh and Ayanendranath Basu. A generalized relative (α,β)(\alpha,\beta)-entropy: Geometric properties and applications to robust statistical inference. Entropy, 20(5), 2018.
  • [9] Abhik Ghosh and Ayanendranath Basu. A scale invariant generalization of Renyi entropy and related optimizations under Tsallis’ nonextensive framework. IEEE Transactions on Information Theory, 67(4):2141–2161, 2021.
  • [10] Soham Jana and Ayanendranath Basu. A characterization of all single-integral, non-kernel divergence estimators. IEEE Transactions on Information Theory, 65(12):7976–7984, 2019.
  • [11] MC Jones, Nils Lid Hjort, Ian R Harris, and Ayanendranath Basu. A comparison of related density-based minimum divergence estimators. Biometrika, 88(3):865–873, 2001.
  • [12] Arun Kumar Kuchibhotla, Somabha Mukherjee, and Ayanendranath Basu. Statistical inference based on bridge divergences. Annals of the Institute of Statistical Mathematics, 71(3):627–656, 2019.
  • [13] M Ashok Kumar and Rajesh Sundaresan. Minimization problems based on relative α\alpha-entropy i : Forward projection. IEEE Transactions on Information Theory, 61(9):5063–5080, 2015.
  • [14] M Ashok Kumar and Rajesh Sundaresan. Minimization problems based on relative α\alpha-entropy ii: Reverse projection. IEEE Transactions on Information Theory, 61(9):5081–5095, 2015.
  • [15] Bruce G Lindsay. Efficiency versus robustness: the case for minimum hellinger distance and related methods. The Annals of Statistics, 22(2):1081–1114, 1994.