This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tight Lower Bounds for α\alpha-Divergences Under Moment Constraints and Relations Between Different α\alpha

Tomohiro Nishiyama
Email: [email protected]
Abstract

The α\alpha-divergences include the well-known Kullback-Leibler divergence, Hellinger distance and χ2\chi^{2}-divergence. In this paper, we derive differential and integral relations between the α\alpha-divergences that are generalizations of the relation between the Kullback-Leibler divergence and the χ2\chi^{2}-divergence. We also show tight lower bounds for the α\alpha-divergences under given means and variances. In particular, we show a necessary and sufficient condition such that the binary divergences, which are divergences between probability measures on the same 22-point set, always attain lower bounds. Kullback-Leibler divergence, Hellinger distance, and χ2\chi^{2}-divergence satisfy this condition.

Keywords: Alpha-divergence, Kullback-Leibler divergence, Hellinger distance, Chi-squared divergence, Relative entropy, Renyi divergence.

I Introduction

The Kullback–Leibler divergence [12] (also known as the relative entropy) and the Hellinger distance [10] are divergence measures which play a key role in information theory, statistics, machine learning, physics, signal processing, and other theoretical and applied branches of mathematics. They both belong to an important class of divergence measures, defined by means of convex functions ff, and named ff -divergences [5, 6, 7]. The most notable class of ff-divergence is the α\alpha-divergence [1, 4]. By choosing different α\alpha, we get a large number of well-known divergences as special cases, including Kullback-Leibler divergence, Hellinger distance, and χ2\chi^{2}-divergence [17].

In this paper, we study relations between the α\alpha-divergences for different α\alpha, and derive the tight lower bounds for the α\alpha-divergences under given means and variances. The relation between the Kullback-Leibler divergence and the χ2\chi^{2}-divergence was shown in  [14, 2, 16], and we generalize this relation for general α\alpha and α+1\alpha+1. Regarding the lower bounds under given means and variances, there are some works for the χ2\chi^{2}-divergence and the Hellinger distance [3, 8, 11]. Recently, for the Kullback-Leibler divergence and the Hellinger distance, we showed that the tight lower bounds are all attained by their binary divergences that are divergences between probability measures on the same 22-point set [16, 15]. Our motivation is to study necessary and sufficient conditions for α\alpha such that the binary α\alpha-divergences always attain lower bounds. Furthermore, we show tight lower bounds under given means and variances for the Rényi divergences [18], which are closely related to the α\alpha-divergences.

In this work, Section 2 presents notation and definitions, Section 3 refers to the main results, Section 4 shows the proofs of the main results. Finally, Section 5 concludes this paper, and lemmas that are necessary for the proofs of the main results are proved in Appendices.

II Preliminaries

This section provides definitions of divergence measures which are used in this paper.

Definition 1.

[13, p. 4398] Let PP and QQ be probability measures, let μ\mu be a dominating measure of PP and QQ (i.e., P,QμP,Q\ll\mu), and let p:=dPdμp:=\frac{\mathrm{d}P}{\mathrm{d}\mu} and q:=dQdμq:=\frac{\mathrm{d}Q}{\mathrm{d}\mu} be the densities of PP and QQ with respect to μ\mu. The ff-divergence from PP to QQ is given by

Df(PQ):=qf(pq)dμ,\displaystyle D_{f}(P\|Q):=\int q\,f\Bigl{(}\frac{p}{q}\Bigr{)}\,\mathrm{d}\mu, (1)

where

f(0):=limt0+f(t),0f(00):=0,\displaystyle f(0):=\underset{t\to 0^{+}}{\lim}\,f(t),\quad 0f\biggl{(}\frac{0}{0}\biggr{)}:=0,
0f(a0):=limt0+tf(at)=alimuf(u)u,a>0.\displaystyle 0f\biggl{(}\frac{a}{0}\biggr{)}:=\lim_{t\to 0^{+}}\,tf\biggl{(}\frac{a}{t}\biggr{)}=a\lim_{u\to\infty}\frac{f(u)}{u},\quad a>0.

It should be noted that the right side of (1) is invariant in the dominating measure μ\mu.

Definition 2.

[4] The basic asymmetric alpha-divergence is the ff-divergence with

f(t):={tαtα(α1),α0,1,logt,α=0,tlogt,α=1.\displaystyle f(t):=\begin{dcases}\frac{t^{\alpha}-t}{\alpha(\alpha-1)},&\alpha\neq 0,1,\\ -\log t,&\alpha=0,\\ t\log t,&\alpha=1.\end{dcases}

for t>0t>0,

DA(α)(PQ):={1α(α1)(pαq1αdμ1),α0,1,qlogqpdμ=:D(QP),α=0,plogpqdμ=:D(PQ),α=1,\displaystyle D_{A}^{(\alpha)}(P\|Q):=\begin{dcases}\frac{1}{\alpha(\alpha-1)}\Bigl{(}\int p^{\alpha}q^{1-\alpha}\mathrm{d}\mu-1\Bigr{)},&\alpha\neq 0,1,\\ \int q\log\frac{q}{p}\mathrm{d}\mu=:D(Q\|P),&\alpha=0,\\ \int p\log\frac{p}{q}\mathrm{d}\mu=:D(P\|Q),&\alpha=1,\end{dcases} (2)

where D(PQ)D(P\|Q) denotes the Kullback-Leibler divergence (relative entropy).

In the special case for α=2, 0.5,1\alpha=2,\;0.5,\;-1, we obtain from (2) the well known Pearson Chi-square, Hellinger and Neyman Chi-square distances, given respectively by

DA(2)(PQ)=12χP2(PQ)=12(pq)2qdμ,\displaystyle D_{A}^{(2)}(P\|Q)=\frac{1}{2}\chi^{2}_{P}(P\|Q)=\frac{1}{2}\int\frac{(p-q)^{2}}{q}\mathrm{d}\mu,
DA(1/2)(PQ)=4H2(P,Q)=2(pq)2dμ,\displaystyle D_{A}^{(1/2)}(P\|Q)=4H^{2}(P,Q)=2\int(\sqrt{p}-\sqrt{q})^{2}\mathrm{d}\mu,
DA(1)(PQ)=12χN2(PQ)=12(pq)2pdμ.\displaystyle D_{A}^{(-1)}(P\|Q)=\frac{1}{2}\chi^{2}_{N}(P\|Q)=\frac{1}{2}\int\frac{(p-q)^{2}}{p}\mathrm{d}\mu.

The α\alpha-divergences have duality as follows.
Duality:

DA(α)(PQ)=DA(1α)(QP).\displaystyle D_{A}^{(\alpha)}(P\|Q)=D_{A}^{(1-\alpha)}(Q\|P). (3)

The Rényi divergences [18] are closely related to α\alpha-divergences.

Definition 3.

The Rényi divergence for the simple orders α(0,1)(1,)\alpha\in(0,1)\cup(1,\infty) is defined as

DR(α)(PQ):=1α1logpαq1αdμ=1α1log(1+α(α1)DA(α)(PQ)).\displaystyle D_{R}^{(\alpha)}(P\|Q):=\frac{1}{\alpha-1}\log\int p^{\alpha}q^{1-\alpha}\mathrm{d}\mu=\frac{1}{\alpha-1}\log\Bigl{(}1+\alpha(\alpha-1)D_{A}^{(\alpha)}(P\|Q)\Bigr{)}.

For the extended orders, the Rényi divergence is defined as

DR(α)(PQ):={logQ(p>0),α=0,D(PQ)=DA(1)(PQ),α=1,logesssupp(Z)q(Z),α=,\displaystyle D_{R}^{(\alpha)}(P\|Q):=\begin{dcases}-\log Q(p>0),&\alpha=0,\\ D(P\|Q)=D_{A}^{(1)}(P\|Q),&\alpha=1,\\ \log\operatorname*{ess\,sup}\frac{p(Z)}{q(Z)},&\alpha=\infty,\end{dcases}

where ZμZ\sim\mu.

Definition 4.

Let us define a set of pairs of probability measures (P,Q)(P,Q) defined on nn-point set {u1,u2,,un}\{u_{1},u_{2},\cdots,u_{n}\} by 𝒫n\mathcal{P}_{n}, where {ui}1in\{u_{i}\}_{1\leq i\leq n} are arbitrary real numbers.

If m<nm<n, 𝒫m\mathcal{P}_{m} is a subset of 𝒫n\mathcal{P}_{n}.

Definition 5.

Let PP and QQ be probability measures defined on a measurable space (,)(\mathbb{R},\mathscr{B}), where \mathbb{R} is the real line and \mathscr{B} is the Borel σ\sigma-algebra of subsets of \mathbb{R}. Let 𝒫[mP,σP,mQ,σQ]\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}] be a set of pairs of probability measures (P,Q)(P,Q) under given means and variances, i.e.,

𝔼[X]=:mP,𝔼[Y]=:mQ,Var(X)=:σP2,Var(Y)=:σQ2,\displaystyle\mathbb{E}[X]=:m_{P},\;\mathbb{E}[Y]=:m_{Q},\quad\mathrm{Var}(X)=:\sigma_{P}^{2},\;\mathrm{Var}(Y)=:\sigma_{Q}^{2}, (4)

where XPX\sim P and YQY\sim Q.

Definition 6.

The binary α\alpha-divergence is defined for (R,S)𝒫2(R,S)\in\mathcal{P}_{2}.

dA(α)(rs):={1α(α1)(rαs1α+(1r)α(1s)1α1),α0,1,slogsr+(1s)log1s1r,α=0,rlogrs+(1r)log1r1s,α=1.\displaystyle d_{A}^{(\alpha)}(r\|s):=\begin{dcases}\frac{1}{\alpha(\alpha-1)}\Bigl{(}r^{\alpha}s^{1-\alpha}+(1-r)^{\alpha}(1-s)^{1-\alpha}-1\Bigr{)},&\alpha\neq 0,1,\\ s\log\frac{s}{r}+(1-s)\log\frac{1-s}{1-r},&\alpha=0,\\ r\log\frac{r}{s}+(1-r)\log\frac{1-r}{1-s},&\alpha=1.\end{dcases} (5)

where R(u1)=rR(u_{1})=r and S(u1)=sS(u_{1})=s.

Definition 7.

The binary Rényi divergence for orders α[0,]\alpha\in[0,\infty] is defined for (R,S)𝒫2(R,S)\in\mathcal{P}_{2}.

dR(α)(rs):={1α1log(rαs1α+(1r)α(1s)1α),α0,1,log(s1{r>0}+(1s)1{1r>0}),α=0,rlogrs+(1r)log1r1s,α=1,logmax(rs,1r1s),α=.\displaystyle d_{R}^{(\alpha)}(r\|s):=\begin{dcases}\frac{1}{\alpha-1}\log\Bigl{(}r^{\alpha}s^{1-\alpha}+(1-r)^{\alpha}(1-s)^{1-\alpha}\Bigr{)},&\alpha\neq 0,1,\\ -\log\Bigl{(}s1\{r>0\}+(1-s)1\{1-r>0\}\Bigr{)},&\alpha=0,\\ r\log\frac{r}{s}+(1-r)\log\frac{1-r}{1-s},&\alpha=1,\\ \log\max\Bigl{(}\frac{r}{s},\frac{1-r}{1-s}\Bigr{)},&\alpha=\infty.\end{dcases}

where 1{relation}1\{\mbox{relation}\} denotes the indicator function.

III Main results

Theorem 1.

Let Qt:=(QP)t+PQ_{t}:=(Q-P)t+P and Pt:=(PQ)t+Q=Q1tP_{t}:=(P-Q)t+Q=Q_{1-t} for t[0,1]t\in[0,1]. Then, for all t(0,1]t\in(0,1],

DA(α+1)(PQt)=t2α(α+1)ddt(tα1DA(α)(PQt)),α1,\displaystyle D_{A}^{(\alpha+1)}(P\|Q_{t})=\frac{t^{2-\alpha}}{(\alpha+1)}\frac{\mathrm{d}}{\mathrm{d}t}\Bigl{(}t^{\alpha-1}D_{A}^{(\alpha)}(P\|Q_{t})\Bigr{)},\quad\alpha\neq-1, (6)
DA(α)(PtQ)=t2+α(α+1)ddt(tα1DA(α+1)(PtQ)),α1.\displaystyle D_{A}^{(\alpha)}(P_{t}\|Q)=\frac{t^{2+\alpha}}{(-\alpha+1)}\frac{\mathrm{d}}{\mathrm{d}t}\Bigl{(}t^{-\alpha-1}D_{A}^{(\alpha+1)}(P_{t}\|Q)\Bigr{)},\quad\alpha\neq 1. (7)
Proof.

See Section IV. ∎

Corollary 1.

For all t(0,1]t\in(0,1],

DA(α)(PQt)=(α+1)t1α0tsα2DA(α+1)(PQs)ds,α>1,\displaystyle D_{A}^{(\alpha)}(P\|Q_{t})=(\alpha+1)t^{1-\alpha}\int_{0}^{t}s^{\alpha-2}D_{A}^{(\alpha+1)}(P\|Q_{s})\mathrm{d}s,\quad\alpha>-1, (8)
DA(α+1)(PtQ)=(α+1)t1+α0tsα2DA(α)(PsQ)ds,α<1.\displaystyle D_{A}^{(\alpha+1)}(P_{t}\|Q)=(-\alpha+1)t^{1+\alpha}\int_{0}^{t}s^{-\alpha-2}D_{A}^{(\alpha)}(P_{s}\|Q)\mathrm{d}s,\quad\alpha<1. (9)

The relation (8) for α=1\alpha=1 yields the relation between the Kullback-Leibler divergence and the χ2\chi^{2}-divergence.

D(PQt)=0ts1χP2(PQs)ds.\displaystyle D(P\|Q_{t})=\int_{0}^{t}s^{-1}\chi_{P}^{2}(P\|Q_{s})\mathrm{d}s. (10)
Proof.

By the Taylor expansion for a differentiable function f()f(\cdot) such that f(1)=0f(1)=0, for efficiently small ss, we obtain

qsf(pqs)dμ=qs(f(1)(pqs1)+f′′(1)2(pqs1)2+O(s3))dμ=f′′(1)s22χP2(PQ)+O(s3).\displaystyle\int q_{s}f\Bigl{(}\frac{p}{q_{s}}\Bigr{)}\mathrm{d}\mu=\int q_{s}\Bigl{(}f^{\prime}(1)\Bigl{(}\frac{p}{q_{s}}-1\Bigr{)}+\frac{f^{\prime\prime}(1)}{2}{\Bigl{(}\frac{p}{q_{s}}-1\Bigr{)}}^{2}+O(s^{3})\Bigr{)}\mathrm{d}\mu=\frac{f^{\prime\prime}(1)s^{2}}{2}\chi^{2}_{P}(P\|Q)+O(s^{3}). (11)

Since the α\alpha-divergences belong to ff-divergences, it follows that

tα1DA(α)(PQt)=O(tα+1).\displaystyle t^{\alpha-1}D_{A}^{(\alpha)}(P\|Q_{t})=O(t^{\alpha+1}). (12)

Replacing tt by ss, multiplying (α+1)sα2(\alpha+1)s^{\alpha-2} and integrating both sides of (6), we obtain (8). The condition α>1\alpha>-1 is due to (12). The equality (9) follows in a similar way. ∎

Theorem 2.

Let (P,Q)𝒫[mP,σP,mQ,σQ](P,Q)\in\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}].

  1. (a)

    If mPmQm_{P}\neq m_{Q}, the binary α\alpha-divergence always attains a lower bound under given means and variances if and only if α[1,2]\alpha\in[-1,2].

    DA(α)(PQ)dA(α)(rs),\displaystyle D_{A}^{(\alpha)}(P\|Q)\geq d_{A}^{(\alpha)}(r\|s), (13)

    where

    r:=12+σQ2σP2+a24av[0,1],\displaystyle r:=\frac{1}{2}+\frac{\sigma_{Q}^{2}-\sigma_{P}^{2}+a^{2}}{4av}\in[0,1], (14)
    s:=12+σQ2σP2a24av[0,1],\displaystyle s:=\frac{1}{2}+\frac{\sigma_{Q}^{2}-\sigma_{P}^{2}-a^{2}}{4av}\in[0,1], (15)
    a:=mPmQ,\displaystyle a:=m_{P}-m_{Q}, (16)
    v:=12|a|(σQ2σP2)2+2a2(σP2+σQ2)+a4.\displaystyle v:=\frac{1}{2|a|}\sqrt{(\sigma_{Q}^{2}-\sigma_{P}^{2})^{2}+2a^{2}(\sigma_{P}^{2}+\sigma_{Q}^{2})+a^{4}}. (17)
  2. (b)

    The lower bound in the right side of (13) is attained for (R,S)𝒫2𝒫[mP,σP,mQ,σQ](R,S)\in\mathcal{P}_{2}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}] defined on {u1,u2}\{u_{1},u_{2}\}, and

    R(u1)=r,S(u1)=s,\displaystyle R(u_{1})=r,\quad S(u_{1})=s, (18)

    with rr and ss in (14) and (15), respectively, and

    u1:=mP+(1r)σP2r,u2:=mPrσP21r.\displaystyle u_{1}:=m_{P}+\sqrt{\frac{(1-r)\sigma_{P}^{2}}{r}},\quad u_{2}:=m_{P}-\sqrt{\frac{r\sigma_{P}^{2}}{1-r}}. (19)
  3. (c)

    If mP=mQm_{P}=m_{Q}, then,

    inf(P,Q)𝒫[mP,σP,mQ,σQ]DA(α)(PQ)=0.\displaystyle\inf_{(P,Q)\in\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}]}D_{A}^{(\alpha)}(P\|Q)=0. (20)
Proof.

See Section IV. ∎

For the Hellinger distance and the χ2\chi^{2}-divergence, their binary divergences are simplified as

dA(2)(rs)\displaystyle d_{A}^{(2)}(r\|s) =a22σQ2,\displaystyle=\frac{a^{2}}{2\sigma_{Q}^{2}},
dA(1/2)(rs)\displaystyle d_{A}^{(1/2)}(r\|s) =4(1(σP+σQ)2a2+(σP+σQ)2).\displaystyle=4\Bigl{(}1-\sqrt{\frac{(\sigma_{P}+\sigma_{Q})^{2}}{a^{2}+(\sigma_{P}+\sigma_{Q})^{2}}}\Bigr{)}.

See Subsection IV-B and [15][Lemma 2], respectively.

Corollary 2.

If mPmQm_{P}\neq m_{Q} and α[0,2]\alpha\in[0,2], the binary Rényi divergence always attains a lower bound under given means and variances.

Proof.

For α(0,2]\alpha\in(0,2], the result follows by Definition 3 and Theorem 2. Since the binary Rényi divergence is equal to 0 for α=0\alpha=0 and σP>0\sigma_{P}>0, we obtain the result. For α=σP=0\alpha=\sigma_{P}=0, we have P(u)=1P(u)=1 at u=mPu=m_{P}. Letting q=Q(u)q=Q(u), we obtain

uimPQ(ui)ui=mQqmP,\displaystyle\sum_{u_{i}\neq m_{P}}Q(u_{i})u_{i}=m_{Q}-qm_{P}, (21)
uimPQ(ui)ui2=σQ2+mQ2qmP2.\displaystyle\sum_{u_{i}\neq m_{P}}Q(u_{i})u_{i}^{2}=\sigma_{Q}^{2}+m_{Q}^{2}-qm_{P}^{2}. (22)

By the Cauchy-Schwarz inequality, it follows that

(uimPQ(ui)ui)2=(uimPQ(ui)(Q(ui)ui))2(uimPQ(ui))(uimPQ(ui)ui2)=(1q)(σQ2+mQ2qmP2).\displaystyle\Bigl{(}\sum_{u_{i}\neq m_{P}}Q(u_{i})u_{i}\Bigr{)}^{2}=\Bigl{(}\sum_{u_{i}\neq m_{P}}\sqrt{Q(u_{i})}(\sqrt{Q(u_{i})}u_{i})\Bigr{)}^{2}\leq\Bigl{(}\sum_{u_{i}\neq m_{P}}Q(u_{i})\Bigr{)}\Bigl{(}\sum_{u_{i}\neq m_{P}}Q(u_{i})u_{i}^{2}\Bigr{)}=(1-q)(\sigma_{Q}^{2}+m_{Q}^{2}-qm_{P}^{2}). (23)

By combining this inequality with (21), we obtain

qσQ2σQ2+a2.\displaystyle q\leq\frac{\sigma_{Q}^{2}}{\sigma_{Q}^{2}+a^{2}}. (24)

From (14) and (15), we have r=1r=1 and s=σQ2σQ2+a2s=\frac{\sigma_{Q}^{2}}{\sigma_{Q}^{2}+a^{2}} for a>0a>0. By combining (24) with Definition (3), the result follows. The case a<0a<0 can be justified in a similar way. ∎

IV Proofs of main results

IV-A Proof of Theorem 1

Proof.

Let qt:=(qp)t+pq_{t}:=(q-p)t+p for all t(0,1]t\in(0,1]. For α0,±1\alpha\neq 0,\;\pm 1, we obtain

ddt((pαqt1αdμ1)tα1)\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\Bigl{(}\Bigl{(}\int p^{\alpha}q_{t}^{1-\alpha}\mathrm{d}\mu-1\Bigr{)}t^{\alpha-1}\Bigr{)} =(α1)tα2((pαqtα(qp)t+pαqt1α)dμ1)\displaystyle=(\alpha-1)t^{\alpha-2}\Bigl{(}\int\Bigl{(}-p^{\alpha}q_{t}^{-\alpha}(q-p)t+p^{\alpha}q_{t}^{1-\alpha}\Bigr{)}\mathrm{d}\mu-1\Bigr{)} (25)
=(α1)tα2((pαqtα((qp)t+qt))dμ1)\displaystyle=(\alpha-1)t^{\alpha-2}\Bigl{(}\int\Bigl{(}p^{\alpha}q_{t}^{-\alpha}(-(q-p)t+q_{t})\Bigr{)}\mathrm{d}\mu-1\Bigr{)}
=(α1)tα2(p1+αqtαdμ1)=(α+1)α(α1)tα2DA(α+1)(PQt).\displaystyle=(\alpha-1)t^{\alpha-2}\Bigl{(}\int p^{1+\alpha}q_{t}^{-\alpha}\mathrm{d}\mu-1\Bigr{)}=(\alpha+1)\alpha(\alpha-1)t^{\alpha-2}D_{A}^{(\alpha+1)}(P\|Q_{t}).

Dividing (α+1)α(α1)tα2(\alpha+1)\alpha(\alpha-1)t^{\alpha-2} both sides of (25), we obtain (6). For α=1\alpha=1, we obtain

t2ddtplogpqtdμ=12p(pqt)qtdμ=12(qtp)2qtdμ=DA(2)(PQt).\displaystyle\frac{t}{2}\frac{\mathrm{d}}{\mathrm{d}t}\int p\log\frac{p}{q_{t}}\mathrm{d}\mu=\frac{1}{2}\int\frac{p(p-q_{t})}{q_{t}}\mathrm{d}\mu=\frac{1}{2}\int\frac{(q_{t}-p)^{2}}{q_{t}}\mathrm{d}\mu=D_{A}^{(2)}(P\|Q_{t}). (26)

For α=0\alpha=0, we obtain

t2ddtt1qtlogqtpdμ=t2(t2plogqtp+t1(qp)dμ)=plogpqtdμ=DA(1)(PQt).\displaystyle t^{2}\frac{\mathrm{d}}{\mathrm{d}t}\int t^{-1}q_{t}\log\frac{q_{t}}{p}\mathrm{d}\mu=t^{2}\int\Bigl{(}-t^{-2}p\log\frac{q_{t}}{p}+t^{-1}(q-p)\mathrm{d}\mu\Bigr{)}=\int p\log\frac{p}{q_{t}}\mathrm{d}\mu=D_{A}^{(1)}(P\|Q_{t}). (27)

By the duality (3), and swapping PP and QQ for (6), it follows that

DA(α)(PtQ)=t2α(α+1)ddt(tα1DA(1α)(PtQ)).\displaystyle D_{A}^{(-\alpha)}(P_{t}\|Q)=\frac{t^{2-\alpha}}{(\alpha+1)}\frac{\mathrm{d}}{\mathrm{d}t}\Bigl{(}t^{\alpha-1}D_{A}^{(1-\alpha)}(P_{t}\|Q)\Bigr{)}. (28)

Replacing α\alpha by α-\alpha yields (7). ∎

IV-B Proof of Theorem 2

Lemma 1.

For R>0R>0, let 𝒫n,R𝒫n\mathcal{P}_{n,R}\subseteq\mathcal{P}_{n} be a set of pairs of probability measures such that |ui|R|u_{i}|\leq R for all i=1,2,,ni=1,2,\cdots,n. Let (P,Q)𝒫n,R𝒫[mP,σP,mQ,σQ](P,Q)\in\mathcal{P}_{n,R}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}]. If α(0,1)\alpha\in(0,1) and mPmQm_{P}\neq m_{Q}, the global minimum points (P,Q,𝐮)=argmin(P,Q)𝒫n,R𝒫[mP,σP,mQ,σQ]DA(α)(PQ)(P^{*},Q^{*},\bm{u}^{*})=\mathrm{argmin}_{(P,Q)\in\mathcal{P}_{n,R}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}]}D_{A}^{(\alpha)}(P\|Q) satisfy any of the following conditions.

  1. (1)

    (P,Q)𝒫2(P^{*},Q^{*})\in\mathcal{P}_{2}, and maxi|ui|<R\max_{i}|u^{*}_{i}|<R.

  2. (2)

    maxi|ui|=R\max_{i}|u^{*}_{i}|=R.

Proof.

See Appendix A. ∎

Lemma 2.

If α(0,1)\alpha\in(0,1) and mPmQm_{P}\neq m_{Q}, the binary α\alpha-divergence is monotonically decreasing with respect to both σP\sigma_{P} and σQ\sigma_{Q}.

Proof.

See Appendix B. ∎

Lemma 3.

If mPmQm_{P}\neq m_{Q}, a set 𝒫2𝒫[mP,σP,mQ,σQ]\mathcal{P}_{2}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}] has one component (R,S)(R,S) that is given by (18), and (19).

Proof.

See [15][Lemma 1]. ∎

Lemma 4.

Let (R,S)𝒫2𝒫[mP,σP,mQ,σQ](R,S)\in\mathcal{P}_{2}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}], and let (R,S)𝒫2𝒫[mP,σP,mQ,0](R^{\prime},S^{\prime})\in\mathcal{P}_{2}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},0]. If α<1\alpha<-1, there exixts σQ\sigma_{Q} such that dA(α)(rs)>dA(α)(rs)d_{A}^{(\alpha)}(r\|s)>d_{A}^{(\alpha)}(r^{\prime}\|s^{\prime}).

Proof.

See Appendix C. ∎

We show Theorem 2 by 4-step.

Proof for 0<α<10<\alpha<1.

We first prove (13) for pairs of finite discrete probability measures.
Let DA(α):=inf(P,Q)𝒫n𝒫[mP,σP,mQ,σQ]DA(α)(PQ){D_{A}^{(\alpha)}}^{*}:=\inf_{(P,Q)\in\mathcal{P}_{n}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}]}D_{A}^{(\alpha)}(P\|Q), and suppose DA(α)<dA(α)(rs){D_{A}^{(\alpha)}}^{*}<d_{A}^{(\alpha)}(r\|s). By Lemma 1 as RR\rightarrow\infty and Lemma 3, there exist sequences of vectors {𝒖j}\{\bm{u}_{j}\}, and probability measures {Pj}\{P_{j}\} and {Qj}\{Q_{j}\}, which are defined on {𝒖j}\{\bm{u}_{j}\}, such that

DA(α)(PQ)=DA(α),\displaystyle D_{A}^{(\alpha)}(P_{\infty}\|Q_{\infty})={D_{A}^{(\alpha)}}^{*}, (29)

where ZZ_{\infty} denotes limjZj\lim_{j\rightarrow\infty}Z_{j} for variables Z={P,Q,ui}Z=\{P,Q,u_{i}\}. Without any loss of generality, one can assume that |ui,|<|u_{i,\infty}|<\infty for 1iI1\leq i\leq I and |ui,|=|u_{i,\infty}|=\infty for I<inI<i\leq n. Let i>Ipi,ui,2=C2\sum_{i>I}p_{i,\infty}u^{2}_{i,\infty}=C^{2} and i>Iqi,ui,2=D2\sum_{i>I}q_{i,\infty}u^{2}_{i,\infty}=D^{2}, where pi,j=Pj(ui,j)p_{i,j}=P_{j}(u_{i,j}) and qi,j=Qj(ui,j)q_{i,j}=Q_{j}(u_{i,j}). By the variance constraints, we have C2mP2+σP2C^{2}\leq m_{P}^{2}+\sigma_{P}^{2} and D2mQ2+σQ2D^{2}\leq m_{Q}^{2}+\sigma_{Q}^{2}. Hence, pi,=O(ui,2)p_{i,\infty}=O(u_{i,\infty}^{-2}) and qi,=O(ui,2)q_{i,\infty}=O(u_{i,\infty}^{-2}) hold for i>Ii>I, and due to 0<α<10<\alpha<1, it follows that

i>Ipi,=i>Ipi,ui,=0,\displaystyle\sum_{i>I}p_{i,\infty}=\sum_{i>I}p_{i,\infty}u_{i,\infty}=0, (30)
i>Iqi,=i>Iqi,ui,=0,\displaystyle\sum_{i>I}q_{i,\infty}=\sum_{i>I}q_{i,\infty}u_{i,\infty}=0, (31)
i>Ipi,αqi,1α=0.\displaystyle\sum_{i>I}p_{i,\infty}^{\alpha}q_{i,\infty}^{1-\alpha}=0. (32)

Let PP^{\prime} and QQ^{\prime} be probability measures defined on {u1,,u2,,,uI,}\{u_{1,\infty},u_{2,\infty},\cdots,u_{I,\infty}\}, and let P(ui,)=pi,P^{\prime}(u_{i,\infty})=p_{i,\infty}, Q(ui,)=qi,Q^{\prime}(u_{i,\infty})=q_{i,\infty} for 1iI1\leq i\leq I. From (30)-(32), it follows that

(P,Q)\displaystyle(P^{\prime},Q^{\prime}) 𝒫I,R𝒫[mP,σP2C2,mQ,σQ2D2],\displaystyle\in\mathcal{P}_{I,R^{\prime}}\cap\mathcal{P}\Bigl{[}m_{P},\sqrt{\sigma_{P}^{2}-C^{2}},m_{Q},\sqrt{\sigma_{Q}^{2}-D^{2}}\Bigr{]}, (33)
DA(α)\displaystyle{D_{A}^{(\alpha)}}^{*} =DA(α)(P,Q),\displaystyle=D_{A}^{(\alpha)}(P^{\prime},Q^{\prime}), (34)

where we set R>maxiI|ui,|R^{\prime}>\max_{i\leq I}|u_{i,\infty}|. Since the variances of PP^{\prime} and QQ^{\prime} are non-negative, we have 0C2σP20\leq C^{2}\leq\sigma_{P}^{2}, and 0D2σQ20\leq D^{2}\leq\sigma_{Q}^{2}. If DA(α){D_{A}^{(\alpha)}}^{*} is not a global minimum in 𝒫I,R𝒫[mP,σP2C2,mQ,σQ2D2]\mathcal{P}_{I,R^{\prime}}\cap\mathcal{P}\Bigl{[}m_{P},\sqrt{\sigma_{P}^{2}-C^{2}},m_{Q},\sqrt{\sigma_{Q}^{2}-D^{2}}\Bigr{]}, there exists a sequence in 𝒫n𝒫[mP,σP,mQ,σQ]\mathcal{P}_{n}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}] such that i>Ipi,ui,2=C2\sum_{i>I}p_{i,\infty}u^{2}_{i,\infty}=C^{2}, i>Iqi,ui,2=D2\sum_{i>I}q_{i,\infty}u^{2}_{i,\infty}=D^{2}, which gives a smaller value than DA(α){D_{A}^{(\alpha)}}^{*}. It contradicts that DA(α){D_{A}^{(\alpha)}}^{*} is global minimum in 𝒫n𝒫[mP,σP,mQ,σQ]\mathcal{P}_{n}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}]. Hence, by Lemma 1, it follows that (P,Q)𝒫2(P^{\prime},Q^{\prime})\in\mathcal{P}_{2}, and

DA(α)=DA(α)(PQ)=dA(α)(rs),\displaystyle{D_{A}^{(\alpha)}}^{*}=D_{A}^{(\alpha)}(P^{\prime}\|Q^{\prime})=d_{A}^{(\alpha)}(r^{\prime}\|s^{\prime}), (35)

where (r,s)(r^{\prime},s^{\prime}) with means (mP,mQ)(m_{P},m_{Q}) and variances (σP2C2,σQ2D2)(\sigma_{P}^{2}-C^{2},\sigma_{Q}^{2}-D^{2}). By Lemma 2, since dA(α)(rs)d_{A}^{(\alpha)}(r^{\prime}\|s^{\prime}) is monotonically decreasing with respect to σP\sigma_{P} and σQ\sigma_{Q}, we have DA(α)=dA(α)(rs)dA(α)(rs){D_{A}^{(\alpha)}}^{*}=d_{A}^{(\alpha)}(r^{\prime}\|s^{\prime})\geq d_{A}^{(\alpha)}(r\|s). This contradicts the assumption of DA(α)<dA(α)(rs){D_{A}^{(\alpha)}}^{*}<d_{A}^{(\alpha)}(r\|s), thus we obtain DA(α)=dA(α)(rs){D_{A}^{(\alpha)}}^{*}=d_{A}^{(\alpha)}(r\|s).

Next, we prove (13) for pairs of probability measures in 𝒫[mP,σP,mQ,σQ]\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}]. For sufficiently small ϵ\epsilon, there exists RR such that

||x|>Rpxkdμ(x)|<ϵ,||x|>Rqxkdμ(x)|<ϵ,fork=0,1,2.\displaystyle|\int_{|x|>R}px^{k}\mathrm{d}\mu(x)|<\epsilon,\quad|\int_{|x|>R}qx^{k}\mathrm{d}\mu(x)|<\epsilon,\quad\mbox{for}\hskip 4.26773ptk=0,1,2. (36)

Since pαq1αp+qp^{\alpha}q^{1-\alpha}\leq p+q, we have

|x|>Rpαq1αdμ(x)<2ϵ.\displaystyle\int_{|x|>R}p^{\alpha}q^{1-\alpha}\mathrm{d}\mu(x)<2\epsilon. (37)

In the interval [R,R][-R,R], one can approximate probability measures by finite discrete probability measures (Pd,Qd)𝒫n,R(P_{d},Q_{d})\in\mathcal{P}_{n,R} as follows.

||x|Rpxkdμ(x)ipiuik|<ϵ,fork=0,1,2,\displaystyle|\int_{|x|\leq R}px^{k}\mathrm{d}\mu(x)-\sum_{i}p_{i}u_{i}^{k}|<\epsilon,\quad\mbox{for}\hskip 4.26773ptk=0,1,2, (38)
||x|Rqxkdμ(x)iqiuik|<ϵ,fork=0,1,2,\displaystyle|\int_{|x|\leq R}qx^{k}\mathrm{d}\mu(x)-\sum_{i}q_{i}u_{i}^{k}|<\epsilon,\quad\mbox{for}\hskip 4.26773ptk=0,1,2, (39)
|1α(α1)|x|Rpαq1αdμ(x)DA(α)(PdQd)|<ϵ.\displaystyle|\frac{1}{\alpha(\alpha-1)}\int_{|x|\leq R}p^{\alpha}q^{1-\alpha}\mathrm{d}\mu(x)-D_{A}^{(\alpha)}(P_{d}\|Q_{d})|<\epsilon. (40)

From (37) and (40), we have DA(α)(PQ)=DA(α)(PdQd)+O(ϵ)D_{A}^{(\alpha)}(P\|Q)=D_{A}^{(\alpha)}(P_{d}\|Q_{d})+O(\epsilon). From (36), (38), and (39), it follows that differences of means and variances between (P,Q)(P,Q) and (Pd,Qd)(P_{d},Q_{d}) are O(ϵ)O(\epsilon). By applying DA(α)(Pd,Qd)dA(α)(rd,sd)D_{A}^{(\alpha)}(P_{d},Q_{d})\geq d_{A}^{(\alpha)}(r_{d},s_{d}), it follows that

DA(α)(PQ)=DA(α)(Pd,Qd)+O(ϵ)dA(α)(rd,sd)+O(ϵ)=dA(α)(rs)+O(ϵ),\displaystyle D_{A}^{(\alpha)}(P\|Q)=D_{A}^{(\alpha)}(P_{d},Q_{d})+O(\epsilon)\geq d_{A}^{(\alpha)}(r_{d},\|s_{d})+O(\epsilon)=d_{A}^{(\alpha)}(r\|s)+O(\epsilon), (41)

where (rd,sd)𝒫2𝒫[mPd,σPd,mQd,σQd](r_{d},s_{d})\in\mathcal{P}_{2}\cap\mathcal{P}[m_{P_{d}},\sigma_{P_{d}},m_{Q_{d}},\sigma_{Q_{d}}], and we use differentiability of dA(α)(rs)d_{A}^{(\alpha)}(r\|s) with respect to means and variances. Since one can choose arbitrary small ϵ\epsilon, we obtain DA(α)(PQ)dA(α)(rs)D_{A}^{(\alpha)}(P\|Q)\geq d_{A}^{(\alpha)}(r\|s). ∎

Proof for 1α0-1\leq\alpha\leq 0 and 1α21\leq\alpha\leq 2.

Let (P,Q)𝒫[mP,σP,mQ,σQ](P,Q)\in\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}], and (R,S)𝒫2𝒫[mP,σP,mQ,σQ](R,S)\in\mathcal{P}_{2}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}]. For all t[0,1]t\in[0,1], probability measures Qt=(QP)t+PQ_{t}=(Q-P)t+P and St=(SR)t+RS_{t}=(S-R)t+R have the same means and variances. Hence, by the tight lower bounds for 0<α+1<10<\alpha+1<1, the relation (8) for 1<α<0-1<\alpha<0, and setting t=1t=1 in (8), we obtain

DA(α)(PQ)=(α+1)01tα2DA(α+1)(PQt)dt(α+1)01tα2dA(α+1)(rst)dt=dA(α)(rs).\displaystyle D_{A}^{(\alpha)}(P\|Q)=(\alpha+1)\int_{0}^{1}t^{\alpha-2}D_{A}^{(\alpha+1)}(P\|Q_{t})\mathrm{d}t\geq(\alpha+1)\int_{0}^{1}t^{\alpha-2}d_{A}^{(\alpha+1)}(r\|s_{t})\mathrm{d}t=d_{A}^{(\alpha)}(r\|s).

The inequality for 1<α<21<\alpha<2 follows due to the duality (3). Next, we prove for α=1,2\alpha=-1,2. By the Hammersley–Chapman–Robbins bound [3], we obtain

DA(2)(PQ)a22σQ2.\displaystyle D_{A}^{(2)}(P\|Q)\geq\frac{a^{2}}{2\sigma_{Q}^{2}}.

(see [16][(180)]). Since s(1s)=σQ24v2s(1-s)=\frac{\sigma_{Q}^{2}}{4v^{2}} and rs=a2vr-s=\frac{a}{2v} hold due to (14) and (15), it follows that

dA(2)(rs)=12((rs)2s+(1r(1s))21s)=(rs)22s(1s)=a22σQ2.\displaystyle d_{A}^{(2)}(r\|s)=\frac{1}{2}\Bigl{(}\frac{(r-s)^{2}}{s}+\frac{(1-r-(1-s))^{2}}{1-s}\Bigr{)}=\frac{(r-s)^{2}}{2s(1-s)}=\frac{a^{2}}{2\sigma_{Q}^{2}}.

By the duality (3), we obtain inequalities for α=1\alpha=-1. The relation (8) for α=1\alpha=1 and the duality (3) yield inequalities for α=0,1\alpha=0,1 (see [16][Theorem 2]). By combining these results, we obtain lower bounds for 1α0-1\leq\alpha\leq 0 and 1α21\leq\alpha\leq 2. ∎

Proof of the necessary condition.

Due to the duality (3)\eqref{eq-duality}, it is enough to show for α<1\alpha<-1. We show an example of (P,Q)𝒫3𝒫[mP,σP,mQ,σQ](P,Q)\in\mathcal{P}_{3}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}] such that dA(α)(rs)>DA(α)(PQ)d_{A}^{(\alpha)}(r\|s)>D_{A}^{(\alpha)}(P\|Q). Let (R,S)𝒫2𝒫[mP,σP,mQ,0](R^{\prime},S^{\prime})\in\mathcal{P}_{2}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},0], which is defined on (u1,u2)(u^{\prime}_{1},u^{\prime}_{2}). We obtain s=1s^{\prime}=1 due to σQ=0\sigma_{Q}=0. Let (P,Q)𝒫3(P,Q)\in\mathcal{P}_{3} be

(P(u1),P(u2),P(u3))\displaystyle(P(u^{\prime}_{1}),P(u^{\prime}_{2}),P(u^{\prime}_{3})) =(r1u32+δ, 1r,1u32+δ),\displaystyle=\Bigl{(}r^{\prime}-\frac{1}{{u^{\prime}_{3}}^{2+\delta}},\;1-r^{\prime},\;\frac{1}{{u^{\prime}_{3}}^{2+\delta}}\Bigr{)},
(Q(u1),Q(u2),Q(u3))\displaystyle(Q(u^{\prime}_{1}),Q(u^{\prime}_{2}),Q(u^{\prime}_{3})) =(1σQ2u32, 0,σQ2u32),\displaystyle=\Bigl{(}1-\frac{\sigma_{Q}^{2}}{{u^{\prime}_{3}}^{2}},\;0,\;\frac{\sigma_{Q}^{2}}{{u^{\prime}_{3}}^{2}}\Bigr{)},

where δ\delta is a small positive real number such that 2+αδ>02+\alpha\delta>0. As u3u_{3}^{\prime}\rightarrow\infty, PP and QQ have means (mP,mQ)(m_{P},m_{Q}) and variances (σP2,σQ2)(\sigma_{P}^{2},\sigma_{Q}^{2}). Since P(u3)αQ(u3)1α=O(u3(2+αδ))P(u^{\prime}_{3})^{\alpha}Q(u^{\prime}_{3})^{1-\alpha}=O({u^{\prime}_{3}}^{-(2+\alpha\delta)}), it follows that DA(α)(PQ)dA(α)(rs)D_{A}^{(\alpha)}(P\|Q)\rightarrow d_{A}^{(\alpha)}(r^{\prime}\|s^{\prime}). By Lemma 4, there exists σQ\sigma_{Q} such that dA(α)(rs)>DA(α)(PQ)d_{A}^{(\alpha)}(r\|s)>D_{A}^{(\alpha)}(P\|Q). ∎

Proof of Item (c).

Since the proof is similar to [16][Theorem 2], we outline of the proof. We construct sequence of probability measures {(Pj,Qj)}\{(P_{j},Q_{j})\} with zero mean and respective variances (σP2,σQ2)(\sigma_{P}^{2},\sigma_{Q}^{2}) for which DA(α)(PjQj)0D_{A}^{(\alpha)}(P_{j}\|Q_{j})\rightarrow 0 as jj\rightarrow\infty (without any loss of generality, one can assume that the equal means are equal to zero). We start by assuming min{σP2,σQ2}1\min\{\sigma_{P}^{2},\sigma_{Q}^{2}\}\geq 1. Let

μj:=1+j(σQ21),\displaystyle\mu_{j}:=\sqrt{1+j(\sigma_{Q}^{2}-1)},

and define a sequence of quaternary real-valued random variables with probability mass functions

Qj(u):={1212j,u=±1,12j,u=±μj.\displaystyle Q_{j}(u):=\begin{dcases}\frac{1}{2}-\frac{1}{2j},&\quad u=\pm 1,\\ \frac{1}{2j},&\quad u=\pm\mu_{j}.\end{dcases}

It can be verified that, for all jj\in\mathbb{N}, QjQ_{j} has zero mean and variance σQ2\sigma_{Q}^{2}.

Furthermore, let

Pj(u):={12ξ2j,u=±1,ξ2j,u=±μj,\displaystyle P_{j}(u):=\begin{dcases}\frac{1}{2}-\frac{\xi}{2j},&\quad u=\pm 1,\\ \frac{\xi}{2j},&\quad u=\pm\mu_{j},\end{dcases}

with

ξ:=σP21σQ21.\displaystyle\xi:=\frac{\sigma_{P}^{2}-1}{\sigma_{Q}^{2}-1}.

If ξ>1\xi>1, for j=1,,ξj=1,\cdots,\lceil\xi\rceil, we choose PjP_{j} arbitrary with mean 0 and variance σP2\sigma_{P}^{2}. Then,

DA(α)(PjQj)=dA(α)(ξj1j)0.\displaystyle D_{A}^{(\alpha)}(P_{j}\|Q_{j})=d_{A}^{(\alpha)}\Bigl{(}\frac{\xi}{j}\|\frac{1}{j}\Bigr{)}\rightarrow 0.

Next, suppose min{σP2,σQ2}:=σ2<1\min\{\sigma_{P}^{2},\sigma_{Q}^{2}\}:=\sigma^{2}<1, then construct PjP^{\prime}_{j} and QjQ^{\prime}_{j} as before with variances 2σP2σ2>1\frac{2\sigma_{P}^{2}}{\sigma^{2}}>1 and 2σQ2σ2>1\frac{2\sigma_{Q}^{2}}{\sigma^{2}}>1, respectively. If PjP_{j} and QjQ_{j} denote the random variables PjP^{\prime}_{j} and QjQ^{\prime}_{j} scaled by a factor of σ2\frac{\sigma}{\sqrt{2}}, then their variances are σP2,σQ2\sigma_{P}^{2},\sigma_{Q}^{2}, respectively, and DA(α)(Pj,Qj)=DA(α)(Pj,Qj)0D_{A}^{(\alpha)}(P_{j},Q_{j})=D_{A}^{(\alpha)}(P^{\prime}_{j},Q^{\prime}_{j})\rightarrow 0 as we let jj\rightarrow\infty. ∎

V Conclusion

In this paper, we derived relations between α\alpha and α+1\alpha+1 for the asymmetric α\alpha-divergences. These relations are generalizations of the integral relation between the Kullback-Leibler divergence and the χ2\chi^{2}-divergence. We showed that α[1,2]\alpha\in[-1,2] is the necessary and sufficient condition that the binary α\alpha-divergences always attain lower bounds under given means and variances. Kullback-Leibler divergence, Hellinger distance, and χ2\chi^{2}-divergence satisfy this condition. It is intuitively natural that discrepancy between probability measures under moment constraints is smaller for localized measures than for broad probability measures. In this point of view, α\alpha-divergences for α[1,2]\alpha\in[-1,2] have preferable properties. The tight lower bounds for the Kullback-Leibler divergence and the Hellinger distance recently began to be applied to physics [9, 19]. In the future, we hope that the range of applications of relations between the α\alpha-divergences and tight lower bounds will expand, and we hope that they will help to progress many fields and to deepen the understanding of properties of divergences.

References

  • [1] S.-i. Amari. Information geometry and its applications, volume 194. Springer, 2016.
  • [2] K. M. Audenaert. Quantum skew divergence. Journal of Mathematical Physics, 55(11):112202, 2014.
  • [3] D. G. Chapman, H. Robbins, et al. Minimum variance estimation without regularity assumptions. The Annals of Mathematical Statistics, 22(4):581–586, 1951.
  • [4] A. Cichocki and S.-i. Amari. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy, 12(6):1532–1568, 2010.
  • [5] I. Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
  • [6] I. Csiszár. On topological properties of f-divergences. Studia Math. Hungar., 2:329–339, 1967.
  • [7] I. Csiszár. A class of measures of informativity of observation channels. Periodica Mathematica Hungarica, 2(1-4):191–213, 1972.
  • [8] M. Dashti and A. M. Stuart. The bayesian approach to inverse problems. arXiv preprint arXiv:1302.6989, 2013.
  • [9] Y. Hasegawa. Irreversibility, loschmidt echo, and thermodynamic uncertainty relation. Physical Review Letters, 127(24):240602, 2021.
  • [10] E. Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal für die reine und angewandte Mathematik (Crelles Journal), 1909(136):210–271, 1909.
  • [11] M. A. Katsoulakis, L. Rey-Bellet, and J. Wang. Scalable information inequalities for uncertainty quantification. Journal of Computational Physics, 336:513–545, 2017.
  • [12] S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [13] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006.
  • [14] J. Melbourne, M. Madiman, and M. V. Salapaka. Relationships between certain f -divergences. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1068–1073, 2019.
  • [15] T. Nishiyama. A tight lower bound for the hellinger distance with given means and variances. arXiv preprint arXiv:2010.13548, 2020.
  • [16] T. Nishiyama and I. Sason. On relations between the relative entropy and χ\chi2-divergence, generalizations and applications. Entropy, 22(5):563, 2020.
  • [17] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175, 1900.
  • [18] A. Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
  • [19] T. Van Vu, Y. Hasegawa, et al. Unified approach to classical speed limit and thermodynamic uncertainty relation. Physical Review E, 102(6):062132, 2020.

Appendix A Proof of Lemma 1

Proof.

Let n3n\geq 3 (n=1,2n=1,2 are trivial), and let pi:=P(ui)p_{i}:=P(u_{i}), qi:=Q(ui)q_{i}:=Q(u_{i}). Consider the following minimization problem.

minimizeipiαqi1α,\displaystyle\mbox{minimize}\quad-\sum_{i}p_{i}^{\alpha}q_{i}^{1-\alpha}, (42)
subject to gk(𝒑,𝒖):=ipiuik1Ak=0,\displaystyle g_{k}(\bm{p},\bm{u}):=\sum_{i}p_{i}u_{i}^{k-1}-A_{k}=0, (43)
gk+3(𝒒,𝒖):=iqiuik1Bk=0,fork=1,2,3,\displaystyle g_{k+3}(\bm{q},\bm{u}):=\sum_{i}q_{i}u_{i}^{k-1}-B_{k}=0,\quad\mbox{for}\hskip 4.26773ptk=1,2,3, (44)
0pi1,0qi1,|ui|R,for1in,\displaystyle 0\leq p_{i}\leq 1,\quad 0\leq q_{i}\leq 1,\quad|u_{i}|\leq R,\quad\mbox{for}\hskip 4.26773pt1\leq i\leq n, (45)

where 𝑨:=(1,mP,σP2+mP2)T\bm{A}:=(1,m_{P},\sigma_{P}^{2}+m_{P}^{2})^{\mathrm{T}}, 𝑩:=(1,mQ,σQ2+mQ2)T\bm{B}:=(1,m_{Q},\sigma_{Q}^{2}+m_{Q}^{2})^{\mathrm{T}}, and (43) and (44) correspond to (4). Since the feasible set is compact, there exists a global minimum. We first consider σP>0\sigma_{P}>0 and σQ>0\sigma_{Q}>0, then we have pi,qi<1p_{i},q_{i}<1 for 1in1\leq i\leq n. Hence, the global minimum point must be a stationary point, or be on the boundary at maxi|ui|=R\max_{i}|u^{*}_{i}|=R, pi=0p^{*}_{i}=0, or qi=0q^{*}_{i}=0. By rearranging the order of {pi},{qi}\{p^{*}_{i}\},\{q^{*}_{i}\} appropriately, let

pi>0,qi\displaystyle p^{*}_{i}>0,\;q^{*}_{i} >0,foriI,\displaystyle>0,\quad\mbox{for}\hskip 4.26773pti\leq I,
pj>0,qj\displaystyle p^{*}_{j}>0,\;q^{*}_{j} =0,forI<jI+J,\displaystyle=0,\quad\mbox{for}\hskip 4.26773ptI<j\leq I+J,
pk=0,qk\displaystyle p^{*}_{k}=0,\;q^{*}_{k} >0,forI+J<kI+J+K=n.\displaystyle>0,\quad\mbox{for}\hskip 4.26773ptI+J<k\leq I+J+K=n.

Since iIpiαqi1α<0-\sum_{i\leq I}p_{i}^{\alpha}q_{i}^{1-\alpha}<0 holds for the global minimum point, we have I1I\geq 1. If J>0J>0, let (P,Q,𝒖)(P^{\prime},Q^{\prime},\bm{u^{\prime}}) be

pi=pi+dpi,qi=qi+dqi,for1iI,\displaystyle p^{\prime}_{i}=p^{*}_{i}+\mathrm{d}p_{i},\quad q^{\prime}_{i}=q^{*}_{i}+\mathrm{d}q_{i},\quad\mbox{for}\hskip 4.26773pt1\leq i\leq I,
pI+1=pI+1+dpI+1,qI+1=ϵ>0,\displaystyle p^{\prime}_{I+1}=p^{*}_{I+1}+\mathrm{d}p_{I+1},\quad q^{\prime}_{I+1}=\epsilon>0,
pj=pj+dpj,qj=0,forI+1<jI+J,\displaystyle p^{\prime}_{j}=p^{*}_{j}+\mathrm{d}p_{j},\quad q^{\prime}_{j}=0,\quad\mbox{for}\hskip 4.26773ptI+1<j\leq I+J,
pk=0,qk=qk+dqk,forI+J<kI+J+K=n,\displaystyle p^{\prime}_{k}=0,\quad q^{\prime}_{k}=q^{*}_{k}+\mathrm{d}q_{k},\quad\mbox{for}\hskip 4.26773ptI+J<k\leq I+J+K=n,
ui=ui+dui,forin,\displaystyle u^{\prime}_{i}=u^{*}_{i}+\mathrm{d}u_{i},\quad\mbox{for}\hskip 4.26773pti\leq n,

where ϵ\epsilon and d\mathrm{d}\cdot denote sufficiently small real numbers. The moment constraints for probability measures PP^{\prime} and QQ^{\prime} include {dp1,dpI+1,duI+1}\{\mathrm{d}p_{1},\mathrm{d}p_{I+1},\mathrm{d}u_{I+1}\} and {dq1,dql,dul}\{\mathrm{d}q_{1},\mathrm{d}q_{l},\mathrm{d}u_{l}\} (lI+1l\neq I+1), respectively. These variables are independent, and it can be easily verified that the determinants for each probability measure constraint are non-zero by uiuju^{*}_{i}\neq u^{*}_{j} for iji\neq j. Hence, one can choose d\mathrm{d}\cdot such that (P,Q)𝒫[mP,σP,mQ,σQ](P^{\prime},Q^{\prime})\in\mathcal{P}[m_{P},\sigma_{P},m_{Q},\sigma_{Q}] and d=O(ϵ)\mathrm{d}\cdot=O(\epsilon). By 0<α<10<\alpha<1 and pI+1>0p^{*}_{I+1}>0,

iIpiαqi1α(iIpiαqi1αpI+1αqI+11α)=ϵ1αpI+1α+O(ϵ)>0.\displaystyle-\sum_{i\leq I}{p^{*}}_{i}^{\alpha}{q^{*}}_{i}^{1-\alpha}-\Bigl{(}-\sum_{i\leq I}{p^{\prime}}_{i}^{\alpha}{q^{\prime}}_{i}^{1-\alpha}-{p^{\prime}}_{I+1}^{\alpha}{q^{\prime}}_{I+1}^{1-\alpha}\Bigr{)}=\epsilon^{1-\alpha}{p^{*}_{I+1}}^{\alpha}+O(\epsilon)>0. (46)

This contradicts that (𝒑,𝒒,𝒖)(\bm{p}^{*},\bm{q}^{*},\bm{u}^{*}) is a global minimum, then we obtain J=0J=0. In a similar way, we also obtain K=0K=0 and I=nI=n. Hence, the global minimum point is an interior point or on the the boundary at maxi|ui|=R\max_{i}|u^{*}_{i}|=R. Supposing maxi|ui|<R\max_{i}|u^{*}_{i}|<R, it must be a stationary point of the following Lagrangian.

L(𝒑,𝒒,𝒖,𝝀):=inpiαqi1α+inpiϕ𝝀(ui)+inqiψ𝝀(ui)k=13λkAkk=13λk+3Bk,\displaystyle L(\bm{p},\bm{q},\bm{u},\bm{\lambda}):=-\sum_{i\leq n}p_{i}^{\alpha}q_{i}^{1-\alpha}+\sum_{i\leq n}p_{i}\phi_{\bm{\lambda}}(u_{i})+\sum_{i\leq n}q_{i}\psi_{\bm{\lambda}}(u_{i})-\sum_{k=1}^{3}\lambda_{k}A_{k}-\sum_{k=1}^{3}\lambda_{k+3}B_{k}, (47)

where ϕ𝝀(u):=k=13λkuk1\phi_{\bm{\lambda}}(u):=\sum_{k=1}^{3}\lambda_{k}u^{k-1} and ψ𝝀(u):=k=13λk+3uk1\psi_{\bm{\lambda}}(u):=\sum_{k=1}^{3}\lambda_{k+3}u^{k-1}. Since uiuju_{i}\neq u_{j} for iji\neq j, and n3n\geq 3, it follows that {gk}k6\{\nabla g_{k}\}_{k\leq 6} are linearly independent. The stationary conditions are

Lpi(𝒑,𝒒,𝒖,𝝀)\displaystyle\frac{\partial{L}}{\partial{p_{i}}}(\bm{p}^{*},\bm{q}^{*},\bm{u}^{*},\bm{\lambda}^{*}) =α(piqi)α1+ϕ𝝀(ui)=0,\displaystyle=-\alpha{\Bigl{(}\frac{p^{*}_{i}}{q^{*}_{i}}\Bigr{)}}^{\alpha-1}+\phi_{\bm{\lambda}^{*}}(u^{*}_{i})=0, (48)
Lqi(𝒑,𝒒,𝒖,𝝀)\displaystyle\frac{\partial{L}}{\partial{q_{i}}}(\bm{p}^{*},\bm{q}^{*},\bm{u}^{*},\bm{\lambda}^{*}) =(1α)(piqi)α+ψ𝝀(ui)=0,\displaystyle=-(1-\alpha){\Bigl{(}\frac{p^{*}_{i}}{q^{*}_{i}}\Bigr{)}}^{\alpha}+\psi_{\bm{\lambda}^{*}}(u^{*}_{i})=0, (49)
Lui(𝒑,𝒒,𝒖,𝝀)\displaystyle\frac{\partial{L}}{\partial{u_{i}}}(\bm{p}^{*},\bm{q}^{*},\bm{u}^{*},\bm{\lambda}^{*}) =piϕ𝝀(ui)+qiψ𝝀(ui)=0,\displaystyle=p^{*}_{i}\phi^{\prime}_{\bm{\lambda}^{*}}(u^{*}_{i})+q^{*}_{i}\psi^{\prime}_{\bm{\lambda}^{*}}(u^{*}_{i})=0, (50)

where denotes the derivative with respect to uu.

From (48) and (49), it follows that

ψ𝝀(ui)=(1α)(ϕ𝝀(ui)α)αα1.\displaystyle\psi_{\bm{\lambda}^{*}}(u^{*}_{i})=(1-\alpha){\Bigl{(}\frac{\phi_{\bm{\lambda}^{*}}(u^{*}_{i})}{\alpha}\Bigr{)}}^{\frac{\alpha}{\alpha-1}}. (51)

Substituting (48) and (50) into (49), we have

αϕ𝝀(ui)ψ𝝀(ui)+(1α)ϕ𝝀(ui)ψ𝝀(ui)=0.\displaystyle\alpha\phi^{\prime}_{\bm{\lambda}^{*}}(u^{*}_{i})\psi_{\bm{\lambda}^{*}}(u^{*}_{i})+(1-\alpha)\phi_{\bm{\lambda}^{*}}(u^{*}_{i})\psi^{\prime}_{\bm{\lambda}^{*}}(u^{*}_{i})=0. (52)

Since ϕ𝝀(u)\phi_{\bm{\lambda}^{*}}(u^{*}) and ψ𝝀(u)\psi_{\bm{\lambda}^{*}}(u^{*}) are positive from (48) and (49), one can define the function β(u):=αlogϕ𝝀(u)α+(1α)logψ𝝀(u)1α\beta(u):=\alpha\log\frac{\phi_{\bm{\lambda}^{*}}(u)}{\alpha}+(1-\alpha)\log\frac{\psi_{\bm{\lambda}^{*}}(u)}{1-\alpha}. From (51) and (52), we obtain

β(ui)=β(ui)=0.\displaystyle\beta(u^{*}_{i})=\beta^{\prime}(u^{*}_{i})=0. (53)

Since ϕ𝝀(u)\phi_{\bm{\lambda}^{*}}(u) and ψ𝝀(u)\psi_{\bm{\lambda}^{*}}(u) are at most quadratic functions with respect to uu, the equation (52) has at most a degree of 33. If (52) is not identically 0, we have n3n\leq 3. If n=3n=3, by the relation (53) and the mean value theorem, there exists ulu1,u2,u3u_{l}\neq u^{*}_{1},u^{*}_{2},u^{*}_{3} such that β(ul)=0\beta^{\prime}(u_{l})=0. By the definition of β()\beta(\cdot), it follows that ulu_{l} is a solution of (52). This contradicts that the equation (52) has at most a degree of 33. If equation (52) is identically 0, the equality (51) is identity. Since the degree of ϕ𝝀(u)\phi_{\bm{\lambda}^{*}}(u) and ψ𝝀(u)\psi_{\bm{\lambda}^{*}}(u) are any of {0, 1, 2}\{0,\;1,\;2\}, and due to 0<α<10<\alpha<1, it follows that they are constant. The equality (48) yields qi=Cpiq_{i}=Cp_{i} for all ii and a constant CC. This implies pi=qip_{i}=q_{i} for all ii, and it contradicts mPmQm_{P}\neq m_{Q}. Hence, we obtain n2n\leq 2.

Next, we show the case for σP>0\sigma_{P}>0 and σQ=0\sigma_{Q}=0 briefly. The notation is the same as above, and I=1I=1 and J=n1J=n-1 hold due to σQ=0\sigma_{Q}=0. The Lagrangian is

L(𝒑,𝒖,𝝀):=p1α+inpiϕ𝝀(ui)k=13λkAk.\displaystyle L(\bm{p},\bm{u},\bm{\lambda}):=-p_{1}^{\alpha}+\sum_{i\leq n}p_{i}\phi_{\bm{\lambda}}(u_{i})-\sum_{k=1}^{3}\lambda_{k}A_{k}. (54)

Supposing maxi|ui|<R\max_{i}|u^{*}_{i}|<R, the stationary conditions are

Lp1(𝒑,𝒖,𝝀)\displaystyle\frac{\partial{L}}{\partial{p_{1}}}(\bm{p}^{*},\bm{u}^{*},\bm{\lambda}^{*}) =αp1α1+ϕ𝝀(u1)=0,\displaystyle=-\alpha{p^{*}_{1}}^{\alpha-1}+\phi_{\bm{\lambda}^{*}}(u^{*}_{1})=0, (55)
Lpi(𝒑,𝒖,𝝀)\displaystyle\frac{\partial{L}}{\partial{p_{i}}}(\bm{p}^{*},\bm{u}^{*},\bm{\lambda}^{*}) =ϕ𝝀(ui)=0,for2in,\displaystyle=\phi_{\bm{\lambda}^{*}}(u^{*}_{i})=0,\quad\mbox{for}\hskip 4.26773pt2\leq i\leq n, (56)
Lui(𝒑,𝒖,𝝀)\displaystyle\frac{\partial{L}}{\partial{u_{i}}}(\bm{p}^{*},\bm{u}^{*},\bm{\lambda}^{*}) =piϕ𝝀(ui)=0,for1in.\displaystyle=p^{*}_{i}\phi^{\prime}_{\bm{\lambda}^{*}}(u^{*}_{i})=0,\quad\mbox{for}\hskip 4.26773pt1\leq i\leq n. (57)

From (55) and (56), it follows that ϕ𝝀(ui)\phi_{\bm{\lambda}^{*}}(u^{*}_{i}) is not a constant. Since ϕ𝝀(u)\phi^{\prime}_{\bm{\lambda}^{*}}(u) is at most a dgree of 1 from (57) and pi>0p^{*}_{i}>0 for 2in2\leq i\leq n, it follows that n2n\leq 2. The result for σP=0\sigma_{P}=0 and σQ>0\sigma_{Q}>0 also follows by swapping PP and QQ. ∎

Appendix B Proof of Lemma 2

We first prove the following lemma.

Lemma 5.

Let 0<α<10<\alpha<1. If 0<x10<x\leq 1,

(1αx)(1+x)α(1+αx)(1x)α>1.\displaystyle\frac{(1-\alpha x)(1+x)^{\alpha}}{(1+\alpha x)(1-x)^{\alpha}}>1.

If 1x<0-1\leq x<0,

(1αx)(1+x)α(1+αx)(1x)α<1.\displaystyle\frac{(1-\alpha x)(1+x)^{\alpha}}{(1+\alpha x)(1-x)^{\alpha}}<1.
Proof.

Letting F(x):=log(1αx)(1+x)α(1+αx)(1x)αF(x):=\log\frac{(1-\alpha x)(1+x)^{\alpha}}{(1+\alpha x)(1-x)^{\alpha}}, for 0<|x|10<|x|\leq 1 and 0<α<10<\alpha<1, we have

F(x)=α(11+x+11x11αx11+αx)=2α(1α2)x2(1x2)(1α2x2)>0.\displaystyle F^{\prime}(x)=\alpha\Bigl{(}\frac{1}{1+x}+\frac{1}{1-x}-\frac{1}{1-\alpha x}-\frac{1}{1+\alpha x}\Bigr{)}=\frac{2\alpha(1-\alpha^{2})x^{2}}{(1-x^{2})(1-\alpha^{2}x^{2})}>0.

By combining this relation with F(0)=0F(0)=0, the results follow. ∎

Proof of Lemma 2.

Let VP:=σP2V_{P}:=\sigma_{P}^{2} and VQ:=σQ2V_{Q}:=\sigma_{Q}^{2}, and f(α)(VP,VQ,a):=rαs1αf^{(\alpha)}(V_{P},V_{Q},a):=r^{\alpha}s^{1-\alpha}. From (14)-(17), by replacing aa by a-a, we obtain (r,s)(1r,1s)(r,s)\rightarrow(1-r,1-s). Hence, the binary α\alpha-divergences are written by dA(α)(rs)=1α(α1)(f(α)(VP,VQ,a)+f(α)(VP,VQ,a)1)d_{A}^{(\alpha)}(r\|s)=\frac{1}{\alpha(\alpha-1)}(f^{(\alpha)}(V_{P},V_{Q},a)+f^{(\alpha)}(V_{P},V_{Q},-a)-1). Since 0<α<10<\alpha<1, we show that f(α)(VP,VQ,a)+f(α)(VP,VQ,a)f^{(\alpha)}(V_{P},V_{Q},a)+f^{(\alpha)}(V_{P},V_{Q},-a) is monotonically increasing with respect to VPV_{P} and VQV_{Q}.

From (14)-(17), we have

vVQ\displaystyle\frac{\partial v}{\partial V_{Q}} =VQVP+a24a2v,\displaystyle=\frac{V_{Q}-V_{P}+a^{2}}{4a^{2}v}, (58)
rVQ\displaystyle\frac{\partial r}{\partial V_{Q}} =(VQVP+a2)216a3v3+14av=VP4av3,\displaystyle=-\frac{(V_{Q}-V_{P}+a^{2})^{2}}{16a^{3}v^{3}}+\frac{1}{4av}=\frac{V_{P}}{4av^{3}}, (59)
sVQ\displaystyle\frac{\partial s}{\partial V_{Q}} =(VQVP+a2)(VQVPa2)16a3v3+14av=VP+VQ+a28av3.\displaystyle=-\frac{(V_{Q}-V_{P}+a^{2})(V_{Q}-V_{P}-a^{2})}{16a^{3}v^{3}}+\frac{1}{4av}=\frac{V_{P}+V_{Q}+a^{2}}{8av^{3}}. (60)

By combining (59) and (60), it follows that

f(α)VQ=18av3(rs)α((1α)(VP+VQ+a2)+2αVPsr).\displaystyle\frac{\partial f^{(\alpha)}}{\partial V_{Q}}=\frac{1}{8av^{3}}{\Bigl{(}\frac{r}{s}\Bigr{)}}^{\alpha}\Bigr{(}(1-\alpha)(V_{P}+V_{Q}+a^{2})+2\alpha V_{P}\frac{s}{r}\Bigr{)}. (61)

By rs=a2vr-s=\frac{a}{2v} and s(1s)=VQ4v2s(1-s)=\frac{V_{Q}}{4v^{2}}, we obtain

rs=1+a2vs=1+2va(1s)VQ=VP+VQ+a2+2va2VQ.\displaystyle\frac{r}{s}=1+\frac{a}{2vs}=1+\frac{2va(1-s)}{V_{Q}}=\frac{V_{P}+V_{Q}+a^{2}+2va}{2V_{Q}}. (62)

By replacing (VP,VQ,a)(V_{P},V_{Q},a) by (VQ,VP,a)(V_{Q},V_{P},-a), we obtain (r,s)(s,r)(r,s)\rightarrow(s,r). From (62), we have

sr=VP+VQ+a22va2VP.\displaystyle\frac{s}{r}=\frac{V_{P}+V_{Q}+a^{2}-2va}{2V_{P}}. (63)

Substituting (62) and (63) into (61), we obtain

f(α)VQ(VP,VQ,a)\displaystyle\frac{\partial f^{(\alpha)}}{\partial V_{Q}}(V_{P},V_{Q},a) =18av3(VP+VQ+a2+2va2VQ)α((VP+VQ+a2)2αva)\displaystyle=\frac{1}{8av^{3}}{\Bigl{(}\frac{V_{P}+V_{Q}+a^{2}+2va}{2V_{Q}}\Bigr{)}}^{\alpha}\Bigr{(}(V_{P}+V_{Q}+a^{2})-2\alpha va\Bigr{)} (64)
=18av3(VP+VQ+a22VQ)α(VP+VQ+a2)(1+x)α(1αx),\displaystyle=\frac{1}{8av^{3}}{\Bigl{(}\frac{V_{P}+V_{Q}+a^{2}}{2V_{Q}}\Bigr{)}}^{\alpha}(V_{P}+V_{Q}+a^{2})(1+x)^{\alpha}(1-\alpha x), (65)

where x:=2vaVP+VQ+a2x:=\frac{2va}{V_{P}+V_{Q}+a^{2}} and |x|1|x|\leq 1. Hence, we obtain

(f(α)(VP,VQ,a)+f(α)(VP,VQ,a))VQ\displaystyle\frac{\partial\Bigl{(}f^{(\alpha)}(V_{P},V_{Q},a)+f^{(\alpha)}(V_{P},V_{Q},-a)\Bigr{)}}{\partial V_{Q}}
=18av3(VP+VQ+a22VQ)α(VP+VQ+a2)((1+x)α(1αx)(1x)α(1+αx)).\displaystyle=\frac{1}{8av^{3}}{\Bigl{(}\frac{V_{P}+V_{Q}+a^{2}}{2V_{Q}}\Bigr{)}}^{\alpha}(V_{P}+V_{Q}+a^{2})\Bigl{(}(1+x)^{\alpha}(1-\alpha x)-(1-x)^{\alpha}(1+\alpha x)\Bigr{)}. (66)

If a>0a>0, by Lemma 5, (B) and 0<x10<x\leq 1, it follows that

(f(α)(VP,VQ,a)+f(α)(VP,VQ,a))VQ>0.\displaystyle\frac{\partial\Bigl{(}f^{(\alpha)}(V_{P},V_{Q},a)+f^{(\alpha)}(V_{P},V_{Q},-a)\Bigr{)}}{\partial V_{Q}}>0. (67)

If a<0a<0, we obtain (67) due to 1x<0-1\leq x<0 and Lemma 5. By (14), (15), and the definition of f(α)(VP,VQ,a)f^{(\alpha)}(V_{P},V_{Q},a), we have f(α)(VP,VQ,a)=f(1α)(VQ,VP,a)f^{(\alpha)}(V_{P},V_{Q},a)=f^{(1-\alpha)}(V_{Q},V_{P},-a). Hence, by combinig this relation with (67) for 1α1-\alpha, we have (f(α)(VP,VQ,a)+f(α)(VP,VQ,a))VP>0\frac{\partial\Bigl{(}f^{(\alpha)}(V_{P},V_{Q},a)+f^{(\alpha)}(V_{P},V_{Q},-a)\Bigr{)}}{\partial V_{P}}>0. ∎

Appendix C Proof of Lemma 4

Proof.

Let α<1\alpha<-1. In a similar way to the proof of Lemma 2, we obtain (B). For a>0a>0, let

x(VQ)=2vaVP+VQ+a2,\displaystyle x(V_{Q})=\frac{2va}{V_{P}+V_{Q}+a^{2}},

where we write VQV_{Q} explicitly in xx in the proof of Lemma 2. By x(0)=1x(0)=1, there exists VQ=σQ2V_{Q}=\sigma_{Q}^{2} such that x(z)>0x(z)>0 and 1+αx(z)<01+\alpha x(z)<0 for all z[0,VQ]z\in[0,V_{Q}]. By combining these inequalities with |x(VQ)|1|x(V_{Q})|\leq 1, it follows that (B) is positive for all z[0,VQ]z\in[0,V_{Q}]. By dA(α)(rs)=1α(α1)(f(α)(VP,VQ,a)+f(α)(VP,VQ,a)1)d_{A}^{(\alpha)}(r\|s)=\frac{1}{\alpha(\alpha-1)}(f^{(\alpha)}(V_{P},V_{Q},a)+f^{(\alpha)}(V_{P},V_{Q},-a)-1) and α(α1)>0\alpha(\alpha-1)>0, it follows that

dA(α)(rs)>dA(α)(rs),\displaystyle d_{A}^{(\alpha)}(r\|s)>d_{A}^{(\alpha)}(r^{\prime}\|s^{\prime}), (68)

where (R,S)𝒫2𝒫[mP,σP,mQ,0](R^{\prime},S^{\prime})\in\mathcal{P}_{2}\cap\mathcal{P}[m_{P},\sigma_{P},m_{Q},0]. The case for a<0a<0 can be justified in a similar way.