This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimization Problems on Strictly Convex Divergences

Tomohiro Nishiyama
Email: htam0ybboh@gmail.com
Abstract

The divergence minimization problem plays an important role in various fields. In this note, we focus on differentiable and strictly convex divergences. For some minimization problems, we show the minimizer conditions and the uniqueness of the minimizer without assuming a specific form of divergences. Furthermore, we show geometric properties related to the minimization problems.

Keywords: convex, minimization problem, projection, centroid, Bregman divergence, f-divergence, Rényi-divergence, Lagrange multiplier.

I Introduction

The divergences are quantities that measure discrepancy between probability measures. For two probability measures PP and QQ, divergences satisfy the following properties.

D(PQ)0D(P\|Q)\geq 0 with equality if and only if P=QP=Q.

In particular, the ff-divergence [15, 7], the Bregman divergence [3] and the Rényi divergence [16, 14] are often used in various fields such as machine learning, image processing, statistical physics, finance and so on.

In order to find the probability measures closest (in the meaning of divergence) to the target probability measure subject to some constraints, it is necessary to solve divergence minimization problems and there are many works about them [8, 4, 2]. Divergence minimization problems are also deeply related to the geometric properties of divergences such as the projection from a probability measure to a set.

The main purpose of this note is to study minimization problems and geometric properties of differentiable and strictly convex divergences in the first or the second argument [13]. For example, the squared Euclidean distance is differentiable and strictly convex. The most important result is that we can derive the minimizer conditions and the uniqueness of the minimizer from only these assumptions without specifying the form of divergences if there exist the solutions that satisfy the minimizer conditions. The minimizer conditions are consistent with the results of the method of Lagrange multipliers.

First, We introduce divergence lines, balls, inner products and orthogonal subsets that are the generalization of line segments, spheres, inner products and orthogonal planes perpendicular to lines in the Euclidean space, respectively. Furthermore, we show the three-point inequality as a basic geometric property and show some properties of the divergence inner product.

Next, we discuss the minimization problem of the weighted average of divergences from some probability measures, which is important in clustering algorithms such as k-means clustering [10]. We show that the minimizer of the weighted average of divergences is the generalized centroid as in the case of the Euclidean space.

Finally, we discuss the minimization problems between a probability measure PP and a set 𝒮\mathcal{S}. These are interpreted as a projection from the probability measure PP to the set 𝒮\mathcal{S}. In the Euclidean space, the minimum distance from a point P3P\in\mathbb{R}^{3} to a plane is given by the perpendicular foot, and the minimum distance to the sphere is given by the intersection of the sphere and a line connecting PP and the center of the sphere. We show that the minimizer of divergence between a probability measure and the divergence ball or the orthogonal subset have the similar properties as in the case of the Euclidean space.

II Preliminaries

This section provides definitions and notations which are used in this note. Let 𝒫\mathcal{P} denotes the set of probability measures. For P,Q𝒫P,Q\in\mathcal{P}, P=QP=Q denotes P=Q a.s.P=Q\mbox{ a.s.}. Let μ\mu be a dominating measure of 𝒫\mathcal{P} (PμP\ll\mu) and p:=dPdμp:=\frac{dP}{d\mu} be the density of PP.

Divergences are defined as functions that satisfy the following properties.

Let D:𝒫×𝒫[0,)D:\mathcal{P}\times\mathcal{P}\rightarrow[0,\infty). For any P,Q𝒫P,Q\in\mathcal{P},

D(PQ)0,\displaystyle D(P\|Q)\geq 0,
D(PQ)=0P=Q.\displaystyle D(P\|Q)=0\iff P=Q.
Definition 1 (Strictly convex divergence).

Let P,Q,R𝒫P,Q,R\in\mathcal{P} and QRQ\neq R. Let t(0,1)t\in(0,1) and let DD be a divergence. The divergence D(PQ)D(P\|Q) is strictly convex in the second argument if

(1t)D(PQ)+tD(PR)>D(P(1t)Q+tR).\displaystyle(1-t)D(P\|Q)+tD(P\|R)>D(P\|(1-t)Q+tR). (1)
Definition 2 (Differentiable divergence).

Let P,Q𝒫P,Q\in\mathcal{P} and let DD be a divergence. The divergence D(PQ)D(P\|Q) is differentiable with respect to the second argument if D(PQ)D(P\|Q) is the functional of q=dQdμq=\frac{dQ}{d\mu} and the functional derivative exists with respect to qq. Let D[q]:=D(PQ)D[q]:=D(P\|Q) be a functional with respect to qq. The functional derivative of D(PQ)D(P\|Q) with respect to qq is defined by

δD(PQ)δq(z)ϕ(z)dμ(z):=ddϵD[q+ϵϕ]|ϵ=0,\displaystyle\int\frac{\delta D(P\|Q)}{\delta q(z)}\phi(z)\mathrm{d}\mu(z):=\left.\frac{d}{d\epsilon}D[q+\epsilon\phi]\right|_{\epsilon=0}, (2)

where ϕ\phi is an arbitrary function.

The strictly convex or differentiable divergence in the first argument, we can define in the same way as the second argument. We show some examples of the functional derivative.

Example 1 (Squared Euclidean distance).

Let DE(PQ):=12(qp)2dμD_{\mathrm{E}}(P\|Q):=\frac{1}{2}\int(q-p)^{2}\mathrm{d}\mu. The functional derivative is

δDE(PQ)δq(z)=q(z)p(z).\displaystyle\frac{\delta D_{\mathrm{E}}(P\|Q)}{\delta q(z)}=q(z)-p(z). (3)
Example 2 (Bregman divergence).

Let f:f:\mathbb{R}\rightarrow\mathbb{R} be a differentiable and strictly convex function. The Bregman divergence is defined by

DB(PQ):=f(p)dμf(q)dμf(q)(pq)dμ,\displaystyle D_{\mathrm{B}}(P\|Q):=\int f(p)\mathrm{d}\mu-\int f(q)\mathrm{d}\mu-\int f^{\prime}(q)(p-q)\mathrm{d}\mu, (4)

where f(x)f^{\prime}(x) denotes the derivative of ff. The Bregman divergence is strictly convex in the first argument. The functional derivative is

δDB(PQ)δp(z)=f(p(z))f(q(z)).\displaystyle\frac{\delta D_{\mathrm{B}}(P\|Q)}{\delta p(z)}=f^{\prime}(p(z))-f^{\prime}(q(z)). (5)
Example 3 (ff-divergence).

Let f:f:\mathbb{R}\rightarrow\mathbb{R} be a strictly convex function and f(1)=0f(1)=0. The ff-divergence is defined by

Df(PQ):=qf(pq)dμ.\displaystyle D_{f}(P\|Q):=\int qf\biggl{(}\frac{p}{q}\biggr{)}\mathrm{d}\mu. (6)

The ff-divergence is strictly convex in the first and the second argument. If ff is differentiable, the functional derivatives are

δDf(PQ)δq(z)=f~(q(z)p(z)),\displaystyle\frac{\delta D_{f}(P\|Q)}{\delta q(z)}=\tilde{f}^{\prime}\biggl{(}\frac{q(z)}{p(z)}\biggr{)}, (7)

where f~(x):=xf(1x)\tilde{f}(x):=xf\bigl{(}\frac{1}{x}\bigr{)} and

δDf(PQ)δp(z)=f(p(z)q(z)).\displaystyle\frac{\delta D_{f}(P\|Q)}{\delta p(z)}=f^{\prime}\biggl{(}\frac{p(z)}{q(z)}\biggr{)}. (8)
Example 4 (Rényi-divergence).

For 0<α<0<\alpha<\infty, the Rényi-divergence is defined by

Dα(PQ):=1α1logpαq1αdμ for α1,\displaystyle D_{\alpha}(P\|Q):=\frac{1}{\alpha-1}\log\int p^{\alpha}q^{1-\alpha}\mathrm{d}\mu\mbox{ for }\alpha\neq 1, (9)
D1(PQ):=plogpqdμ.\displaystyle D_{1}(P\|Q):=\int p\log\frac{p}{q}\mathrm{d}\mu.

The Rényi divergence is strictly convex in the second argument for 0<α<0<\alpha<\infty (see [16]). The functional derivative is

δDα(PQ)δq(z)=1pαq1αdμ(p(z)q(z))α.\displaystyle\frac{\delta D_{\alpha}(P\|Q)}{\delta q(z)}=-\frac{1}{\int p^{\alpha}q^{1-\alpha}\mathrm{d}\mu}{\biggl{(}\frac{p(z)}{q(z)}\biggr{)}}^{\alpha}. (10)

If a divergence D(PQ)D(P\|Q) is differentiable or strictly convex in the first argument, by putting D^(PQ)=D(QP)\hat{D}(P\|Q)=D(Q\|P), D^\hat{D} is differentiable or strictly convex in the second argument. Hence, in the following, we only consider the differentiable or strictly convex divergences in the second argument.

Definition 3 (Divergence line).

Let DD be a differentiable divergence and let P,Q𝒫P,Q\in\mathcal{P}. The “divergence line” (P:Q)\mathcal{L}(P:Q) is defined by

(P:Q):={R𝒫|for α[0,1],C(α),(1α)δD(PR)δr(z)+αδD(QR)δr(z)=C(α)}.\displaystyle\mathcal{L}(P:Q):=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],\exists C(\alpha)\in\mathbb{R},(1-\alpha)\frac{\delta D(P\|R)}{\delta r(z)}+\alpha\frac{\delta D(Q\|R)}{\delta r(z)}=C(\alpha)\}. (11)

We also define probability measures on the divergence line at α\alpha by

α(P,Q):={R𝒫|C,(1α)δD(PR)δr(z)+αδD(QR)δr(z)=C}.\displaystyle\mathcal{L}_{\alpha}(P,Q):=\{R\in\mathcal{P}|\exists C\in\mathbb{R},(1-\alpha)\frac{\delta D(P\|R)}{\delta r(z)}+\alpha\frac{\delta D(Q\|R)}{\delta r(z)}=C\}. (12)

We show some examples of the divergence lines.

Example 5 (Squared Euclidean distance).
(P:Q)={R𝒫|for α[0,1],R=(1α)P+αQ}.\displaystyle\mathcal{L}(P:Q)=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],R=(1-\alpha)P+\alpha Q\}. (13)

These are mixture distributions and a line segment in Euclidean space.

Example 6 (Kullback-Leibler divergence).

The Kullback-Leibler divergence (KL-divergence or relative entropy) [9] DKL(PQ)D_{\mathrm{KL}}(P\|Q) belongs to both the Bregman divergence and the ff-divergence.

D(PQ)=DKL(PQ):=plogpqdμ.\displaystyle D(P\|Q)=D_{\mathrm{KL}}(P\|Q):=\int p\log\frac{p}{q}\mathrm{d}\mu. (14)

The divergence line is

(P:Q)={R𝒫|for α[0,1],R=(1α)P+αQ}.\displaystyle\mathcal{L}(P:Q)=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],R=(1-\alpha)P+\alpha Q\}. (15)

For the reverse KL-divergence D(PQ)=DKL(QP)=qlogqpdμD(P\|Q)=D_{\mathrm{KL}}(Q\|P)=\int q\log\frac{q}{p}\mathrm{d}\mu,

(P:Q)={R𝒫|for α[0,1],r:=dRdμ=1p(1α)qαdμp(1α)qα}.\displaystyle\mathcal{L}(P:Q)=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],r:=\frac{dR}{d\mu}=\frac{1}{\int p^{(1-\alpha)}q^{\alpha}\mathrm{d}\mu}p^{(1-\alpha)}q^{\alpha}\}. (16)

These correspond to the m-geodesic and the e-geodesic in information geometry [1].

Definition 4 (Divergence inner product).

Let DD be a differentiable divergence. Let P,Q,R𝒫P,Q,R\in\mathcal{P}. We define “divergence inner product” by

PQRQ:=(q(z)r(z))δD(PQ)δq(z)dμ(z).\displaystyle\langle PQ\|RQ\rangle:=\int(q(z)-r(z))\frac{\delta D(P\|Q)}{\delta q(z)}\mathrm{d}\mu(z). (17)

For the squared Euclidean distance DE(PQ)=12(qp)2dμD_{\mathrm{E}}(P\|Q)=\frac{1}{2}\int(q-p)^{2}\mathrm{d}\mu, PQRQ=(pq)(rq)dμ\langle PQ\|RQ\rangle=\int(p-q)(r-q)\mathrm{d}\mu and this is the inner product of functions pqp-q and rqr-q.

Definition 5 (Orthogonal subspace).

Let P,Q𝒫P,Q\in\mathcal{P}. We define the orthogonal subspace at QQ by

𝒪(P:Q):={R𝒫|PQRQ=0}.\displaystyle\mathcal{O}(P:Q):=\{R\in\mathcal{P}|\langle PQ\|RQ\rangle=0\}. (18)

Since the divergence inner product is linear with respect to RR, the orthogonal subspace is a convex set.

Definition 6 (Divergence ball).

Let P𝒫P\in\mathcal{P} and DD be a divergence. We define the divergence ball by

κ(P):={Q𝒫|D(PQ)κ}\displaystyle\mathcal{B}_{\kappa}(P):=\{Q\in\mathcal{P}|D(P\|Q)\leq\kappa\} (19)

and the surface of the divergence ball by

κ(P):={Q𝒫|D(PQ)=κ}.\displaystyle\partial\mathcal{B}_{\kappa}(P):=\{Q\in\mathcal{P}|D(P\|Q)=\kappa\}. (20)

If the divergence is convex, the divergence ball is a convex set from the definition.

III Main results

In this section, we focus on the differentiable and strictly convex divergences and show some properties of them. We first show the three-point inequality and some properties of the divergence inner product. Next, we discuss some minimization problems and we show that the minimizer conditions and the uniqueness of the minimizer if there exist the solutions that satisfy the minimizer conditions.

We prove the following lemma that we use in various proofs.

Lemma 1.

Let DD be a strictly convex divergence and PQ𝒫P\neq Q\in\mathcal{P}. Let λ[0,1]\lambda\in[0,1] and Qλ:=P+(QP)λQ_{\lambda}:=P+(Q-P)\lambda. Then, D(PQλ)D(P\|Q_{\lambda}) is strictly convex with respect to λ\lambda.

Proof.

When λ1λ2[0,1]\lambda_{1}\neq\lambda_{2}\in[0,1], Qλ1Qλ2Q_{\lambda_{1}}\neq Q_{\lambda_{2}} holds since PQP\neq Q. From the assumption of strictly convexity of the divergence, for t(0,1)t\in(0,1),

tD(PQλ1)+(1t)D(PQλ2)>D(PtQλ1+(1t)Qλ2).\displaystyle tD(P\|Q_{\lambda_{1}})+(1-t)D(P\|Q_{\lambda_{2}})>D(P\|tQ_{\lambda_{1}}+(1-t)Q_{\lambda_{2}}). (21)

From the definition of QλQ_{\lambda}, we have tQλ1+(1t)Qλ2=P+(QP)(tλ1+(1t)λ2)=Qtλ1+(1t)λ2tQ_{\lambda_{1}}+(1-t)Q_{\lambda_{2}}=P+(Q-P)(t\lambda_{1}+(1-t)\lambda_{2})=Q_{t\lambda_{1}+(1-t)\lambda_{2}}. By combining this equality and (21), the result follows. ∎

III-A Three-point inequality

In this subsection, we show some geometric properties of differentiable and strictly convex divergences.

Theorem 1 (Three-point inequality).

Let DD be a differentiable and strictly convex divergence. Let P,Q,R𝒫P,Q,R\in\mathcal{P}.

Then,

D(PR)D(PQ)PQRQ,\displaystyle D(P\|R)\geq D(P\|Q)-\langle PQ\|RQ\rangle, (22)

where the equality holds if and only if Q=RQ=R.

Proof.

Let Rλ:=Q+(RQ)λR_{\lambda}:=Q+(R-Q)\lambda and F(λ):=D(PRλ)F(\lambda):=D(P\|R_{\lambda}) with QRQ\neq R. From the assumption and Lemma 1, F(λ)F(\lambda) is strictly convex. From the definition of the functional derivative for ϕ(z)=r(z)q(z)\phi(z)=r(z)-q(z),

F(λ)=ddϵD(PRλ+ϵ)|ϵ=0=δD(PRλ)δrλ(z)(r(z)q(z))dμ(z),\displaystyle F^{\prime}(\lambda)=\left.\frac{d}{d\epsilon}D(P\|R_{\lambda+\epsilon})\right|_{\epsilon=0}=\int\frac{\delta D(P\|R_{\lambda})}{\delta r_{\lambda}(z)}(r(z)-q(z))\mathrm{d}\mu(z), (23)

where we use rλ(z)+ϵ(r(z)q(z))=rλ+ϵ(z)r_{\lambda}(z)+\epsilon(r(z)-q(z))=r_{\lambda+\epsilon}(z). From the strictly convexity of F(λ)F(\lambda), for λ>0\lambda>0, we have

F(λ)>F(0)+F(0)(λ0).\displaystyle F(\lambda)>F(0)+F^{\prime}(0)(\lambda-0). (24)

Substituting λ=1\lambda=1 into (24) and using F(1)=D(PR),F(0)=D(PQ)F(1)=D(P\|R),F(0)=D(P\|Q) and F(0)=δD(PQ)δq(z)(r(z)q(z))dμ=PQRQF^{\prime}(0)=\int\frac{\delta D(P\|Q)}{\delta q(z)}(r(z)-q(z))\mathrm{d}\mu=-\langle PQ\|RQ\rangle, we have

D(PR)>D(PQ)PQRQ.\displaystyle D(P\|R)>D(P\|Q)-\langle PQ\|RQ\rangle. (25)

Hence, we have the result. ∎

For the Bregman divergence, three-point identity holds [11].

DB(QP)+DB(RQ)=DB(RP)+(rq)(f(p)f(q))dμ.\displaystyle D_{\mathrm{B}}(Q\|P)+D_{\mathrm{B}}(R\|Q)=D_{\mathrm{B}}(R\|P)+\int(r-q)(f^{\prime}(p)-f^{\prime}(q))\mathrm{d}\mu. (26)

By using DB(RQ)0D_{\mathrm{B}}(R\|Q)\geq 0 and the result of Example 2, we have the same inequality as (22) by putting D(PQ)=DB(QP)D(P\|Q)=D_{\mathrm{B}}(Q\|P)

We show the figure of three-point inequality.

Refer to caption
Figure 1: Three points inequality.
Proposition 1.

Let DD be a differentiable and strictly convex divergence. Let P,Q,S𝒫P,Q,S\in\mathcal{P} and Rα(P,Q)R\in\mathcal{L}_{\alpha}(P,Q) for α[0,1]\alpha\in[0,1].

Then,

(1α)PRSR+αQRSR=0.\displaystyle(1-\alpha)\langle PR\|SR\rangle+\alpha\langle QR\|SR\rangle=0. (27)
Proof.

From the assumption, RR satisfies

(1α)δD(PR)δr(z)+αδD(QR)δr(z)=C.\displaystyle(1-\alpha)\frac{\delta D(P\|R)}{\delta r(z)}+\alpha\frac{\delta D(Q\|R)}{\delta r(z)}=C. (28)

By multiplying both sides by r(z)s(z)r(z)-s(z) and integrating with respect to zz and using sdμ=rdμ=1\int s\mathrm{d}\mu=\int r\mathrm{d}\mu=1, the result follows. ∎

Corollary 1.

Let DD be a differentiable and strictly convex divergence and PQ𝒫P\neq Q\in\mathcal{P}.

Then,

PQPQ>D(PQ).\displaystyle\langle PQ\|PQ\rangle>D(P\|Q). (29)
Proof.

By substituting P=RP=R into (22), the result follows. ∎

Proposition 2.

Let DD be a differentiable and strictly convex divergence. Let P𝒫P\in\mathcal{P}, Q,P𝒮𝒫Q,P_{\ast}\in\mathcal{S}\subset\mathcal{P} and P=argminQ𝒮D(PQ)P_{\ast}=\mathop{\rm arg~{}min}\limits_{Q\in\mathcal{S}}D(P\|Q).

For all Q𝒮Q\in\mathcal{S},

PPQP0.\displaystyle\langle PP_{\ast}\|QP_{\ast}\rangle\leq 0. (30)
Proof.

Let Qλ:=P+(QP)λQ_{\lambda}:=P_{\ast}+(Q-P_{\ast})\lambda and F(λ):=D(PQλ)F(\lambda):=D(P\|Q_{\lambda}). From the assumption, F(0)F(0) is the minimum value for λ[0,1]\lambda\in[0,1]. Hence, F(0)0F^{\prime}(0)\geq 0 and we have

F(0)=δD(PP)δp(z)(q(z)p(z))dμ(z)=PPQP0.\displaystyle F^{\prime}(0)=\int\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}(q(z)-p_{\ast}(z))\mathrm{d}\mu(z)=-\langle PP_{\ast}\|QP_{\ast}\rangle\geq 0. (31)

From this inequality, the result follows. ∎

This is the same approach to show the Pythagorean inequality for the KL-divergence [6].

Corollary 2.

Let P,Q𝒫P,Q\in\mathcal{P} and let RαR\in\mathcal{L}_{\alpha} for α(0,1)\alpha\in(0,1). Then,

𝒪(P:R)=𝒪(Q:R).\displaystyle\mathcal{O}(P:R)=\mathcal{O}(Q:R). (32)
Proof.

The result follows from Proposition 1. ∎

III-B Centroids

We consider the minimization problem of the weighted average of differentiable and strictly convex divergences. This problem is important for the clustering algorithm and we show that the minimizer is a generalized centroid and the uniqueness of the minimizer if there exists the solution that satisfies the generalized centroid condition.

Theorem 2 (Minimization of the weighted average of divergences).

Let DD be a differentiable and strictly convex divergence. Let Pi(i=1,2,N)𝒫P_{i}(i=1,2,\cdots N)\in\mathcal{P} and αi(i=1,2,N)\alpha_{i}(i=1,2,\cdots N)\in\mathbb{R} be parameters that satisfy iαi=1\sum_{i}\alpha_{i}=1 and αi0\alpha_{i}\geq 0. Suppose that there exists P𝒫P_{\ast}\in\mathcal{P} that satisfies

i=1NαiδD(PiP)δp(z)=C,\displaystyle\sum_{i=1}^{N}\alpha_{i}\frac{\delta D(P_{i}\|P_{\ast})}{\delta p_{\ast}(z)}=C, (33)

where CC\in\mathbb{R} is the Lagrange multiplier.

Then, the minimizer of the weighted average of divergences

argminR𝒫i=1NαiD(PiR)=P\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{P}}\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|R)=P_{\ast} (34)

is unique and PP_{\ast} is the unique solution of (33).

Proof.

We first prove (34). Let RPR\neq P_{\ast} be an arbitrary probability measure. Let Rλ:=P+(RP)λR_{\lambda}:=P_{\ast}+(R-P_{\ast})\lambda and F(λ)=i=1NαiD(PiRλ)F(\lambda)=\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|R_{\lambda}). By differentiating F(λ)F(\lambda) with respect to λ\lambda and substituting λ=0\lambda=0, it follows that

F(0)=i=1NαiδD(PiP)δp(z)(r(z)p(z))dμ(z)=0,\displaystyle F^{\prime}(0)=\sum_{i=1}^{N}\alpha_{i}\int\frac{\delta D(P_{i}\|P_{\ast})}{\delta p_{\ast}(z)}(r(z)-p_{\ast}(z))\mathrm{d}\mu(z)=0, (35)

where we use (33), rdμ=pdμ=1\int r\mathrm{d}\mu=\int p_{\ast}\mathrm{d}\mu=1 and the definition of the functional derivative. From Lemma 1 and αi[0,1]\alpha_{i}\in[0,1], F(λ)F(\lambda) is strictly convex with respect to λ\lambda. Hence, we have F(1)>F(0)+F(0)(10)=F(0)F(1)>F(0)+F^{\prime}(0)(1-0)=F(0). Since RR is an arbitrary probability measure, from F(1)=i=1NαiD(PiR)F(1)=\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|R) and F(0)=i=1NαiD(PiP)F(0)=\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|P_{\ast}), it follows that PP_{\ast} is the unique minimizer. If there exists an another probability measure P~\tilde{P_{\ast}} that is the solution of (33), P~\tilde{P_{\ast}} is also a minimizer of (34). This contradicts that PP_{\ast} is the unique minimizer. Hence, the result follows. ∎

The equality (33) is the generalized centroid condition.

Proposition 3.

Let PQ𝒫P\neq Q\in\mathcal{P} and suppose that Pα(P:Q)P_{\ast}\in\mathcal{L}_{\alpha}(P:Q) for α[0,1]\alpha\in[0,1]. Then, α(P:Q)={P}\mathcal{L}_{\alpha}(P:Q)=\{P_{\ast}\} and 0(P:Q)={P},1(P:Q)={Q}\mathcal{L}_{0}(P:Q)=\{P\},\mathcal{L}_{1}(P:Q)=\{Q\}.

Proof.

By applying Theorem 2 for N=2N=2 and putting P1=P,P2=QP_{1}=P,P_{2}=Q, it follows that PP_{\ast} is the unique solution of

(1α)δD(PP)δp(z)+αδD(QP)δp(z)=C,\displaystyle(1-\alpha)\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\alpha\frac{\delta D(Q\|P_{\ast})}{\delta p_{\ast}(z)}=C, (36)

where CC\in\mathbb{R} is the Lagrange multiplier. From the definition of α(P:Q)\mathcal{L}_{\alpha}(P:Q), the result α(P:Q)={P}\mathcal{L}_{\alpha}(P:Q)=\{P_{\ast}\} follows.

Next, we show that 0(P:Q)={P}\mathcal{L}_{0}(P:Q)=\{P\}. From Theorem 2, it follows that argminR𝒫D(PR)=P\mathop{\rm arg~{}min}\limits_{R\in\mathcal{P}}D(P\|R)=P_{\ast}. Since D(PR)0D(P\|R)\geq 0 and PP_{\ast} is unique, we have P=PP_{\ast}=P. We also have the result 1(P:Q)={Q}\mathcal{L}_{1}(P:Q)=\{Q\} in the same way. ∎

We can show the same theorem for the real-valued vector of d\mathbb{R}^{d}.

Proposition 4.

Let pi(i=1,2,N)dp_{i}(i=1,2,\cdots N)\in\mathbb{R}^{d} and αi(i=1,2,N)\alpha_{i}(i=1,2,\cdots N)\in\mathbb{R} be parameters that satisfy iαi=1\sum_{i}\alpha_{i}=1 and αi0\alpha_{i}\geq 0. Let D:d×d[0,)D:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,\infty) be a differentiable and strictly convex divergence. Suppose that there exists pdp_{\ast}\in\mathbb{R}^{d} that satisfies

i=1NαiD(pip)p,ν=0,\displaystyle\sum_{i=1}^{N}\alpha_{i}\frac{\partial D(p_{i}\|p_{\ast})}{\partial p_{\ast,\nu}}=0, (37)

where ν={1,2,,d}\nu=\{1,2,\cdots,d\} and {p,ν}\{p_{\ast,\nu}\} are components of the vector pp_{\ast}.

Then, the minimizer of the weighted average of divergences

argminrdi=1NαiD(pir)=p\displaystyle\mathop{\rm arg~{}min}\limits_{r\in\mathbb{R}^{d}}\sum_{i=1}^{N}\alpha_{i}D(p_{i}\|r)=p_{\ast} (38)

is unique and and pp_{\ast} is the unique solution of (37).

The proof is the same as Theorem 2.

For the Bregman divergence, DB(pq)=f(q)f(p)νf(p)pν(qνpν)D_{\mathrm{B}}(p\|q)=f^{*}(q^{*})-f^{*}(p^{*})-\sum_{\nu}\frac{\partial f^{*}(p^{*})}{\partial p^{*}_{\nu}}(q^{*}_{\nu}-p^{*}_{\nu}) holds [12], where f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} is a differentiable and strictly convex function, ff^{*} denotes the Legendre convex conjugate [5] and pν=f(p)pνp^{*}_{\nu}=\frac{\partial f(p)}{\partial p_{\nu}}. Since f(q)f(p)νf(p)pν(qνpν)=f(p)f(q)νf(q)qν(pνqν)>0f^{*}(q^{*})-f^{*}(p^{*})-\sum_{\nu}\frac{\partial f^{*}(p^{*})}{\partial p^{*}_{\nu}}(q^{*}_{\nu}-p^{*}_{\nu})=f(p)-f(q)-\sum_{\nu}\frac{\partial f(q)}{\partial q_{\nu}}(p_{\nu}-q_{\nu})>0 for pqp\neq q, ff^{*} is also strictly convex. Hence, DB(pq)D_{\mathrm{B}}(p\|q) is a differentiable and strictly convex divergence with respect to qq^{*}. By combining f(q)qν=qν\frac{\partial f^{*}(q^{*})}{\partial q^{*}_{\nu}}=q_{\nu} and (37), it follows that

i=1NαiDB(pip)p,ν=i=1Nαi(p,νpi,ν)=0.\displaystyle\sum_{i=1}^{N}\alpha_{i}\frac{\partial D_{\mathrm{B}}(p_{i}\|p_{\ast})}{\partial p^{\ast}_{\ast,\nu}}=\sum_{i=1}^{N}\alpha_{i}(p_{\ast,\nu}-p_{i,\nu})=0. (39)

Hence, p=i=1Nαipip_{\ast}=\sum_{i=1}^{N}\alpha_{i}p_{i} is the unique minimizer.

III-C Projection from a probability measure to a set

In this subsection, we discuss the minimization problems of divergence between a probability measure and a set such as a orthogonal subset, a divergence ball or a set subject to linear moment-like constraint, we show the minimizer conditions and the uniqueness of the minimizer if there exist the solutions that satisfy the minimizer conditions.

Corollary 3 (Minimization of the divergence between a probability measure and a orthogonal subspace).

Let DD be a differentiable and strictly convex divergence and let P,Q𝒫P,Q\in\mathcal{P}.

Then, the minimizer of divergence between a probability measure PP and the orthogonal subspace 𝒪(P:Q)\mathcal{O}(P:Q)

argminR𝒪(P:Q)D(PR)=Q\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{O}(P:Q)}D(P\|R)=Q (40)

is unique.

Proof.

Since PQRQ=0\langle PQ\|RQ\rangle=0 for R𝒪(P:Q)R\in\mathcal{O}(P:Q), the result follows from Theorem 1. ∎

Theorem 3 (Minimization of the divergence between a probability measure and a divergence ball).

Let DD be a differentiable and strictly convex divergence. Let P,Q𝒫P,Q\in\mathcal{P} and suppose that Pκ(P)(P:Q)P_{\ast}\in\partial\mathcal{B}_{\kappa}(P)\cap\mathcal{L}(P:Q).

Then, the minimizer of divergence between a probability measure QQ and a divergence ball κ(P)\mathcal{B}_{\kappa}(P)

argminRκ(P)D(QR)=P\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{B}_{\kappa}(P)}D(Q\|R)=P_{\ast} (41)

is unique.

Proof.

Since the case κ=0\kappa=0 is trivial, we consider the case κ>0\kappa>0. By combining Proposition 3 and PPP_{\ast}\neq P, it follows that α>0\alpha>0. Consider an arbitrary probability measure Rκ(P)R\in\mathcal{B}_{\kappa}(P). From the assumption and Proposition 3, there exists α(0,1]\alpha\in(0,1] and α(P:Q)={P}\mathcal{L}_{\alpha}(P:Q)=\{P_{\ast}\}. Let Rλ:=P+(RP)λR_{\lambda}:=P_{\ast}+(R-P_{\ast})\lambda and F(λ)=(1α)D(PRλ)+αD(QRλ)F(\lambda)=(1-\alpha)D(P\|R_{\lambda})+\alpha D(Q\|R_{\lambda}).

By differentiating F(λ)F(\lambda) with respect to λ\lambda and substituting λ=0\lambda=0, it follows that

F(0)=((1α)δD(PP)δp(z)+αδD(QP)δp(z))(r(z)p(z))dμ(z)=0,\displaystyle F^{\prime}(0)=\int\biggl{(}(1-\alpha)\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\alpha\frac{\delta D(Q\|P_{\ast})}{\delta p_{\ast}(z)}\biggr{)}(r(z)-p_{\ast}(z))\mathrm{d}\mu(z)=0, (42)

where we use α(P,Q)={P}\mathcal{L}_{\alpha}(P,Q)=\{P_{\ast}\}, rdμ=pdμ=1\int r\mathrm{d}\mu=\int p_{\ast}\mathrm{d}\mu=1 and the definition of the functional derivative. From Lemma 1 and α(0,1]\alpha\in(0,1], F(λ)F(\lambda) is strictly convex with respect to λ\lambda. Hence, we have F(1)>F(0)+F(0)(10)=F(0)F(1)>F(0)+F^{\prime}(0)(1-0)=F(0). From F(1)=(1α)D(PR)+αD(QR)F(1)=(1-\alpha)D(P\|R)+\alpha D(Q\|R) and F(0)=(1α)D(PP)+αD(QP)F(0)=(1-\alpha)D(P\|P_{\ast})+\alpha D(Q\|P_{\ast}), it follows that

(1α)D(PR)+αD(QR)>(1α)D(PP)+αD(QP).\displaystyle(1-\alpha)D(P\|R)+\alpha D(Q\|R)>(1-\alpha)D(P\|P_{\ast})+\alpha D(Q\|P_{\ast}). (43)

From D(PR)κD(P\|R)\leq\kappa, D(PP)=κD(P\|P_{\ast})=\kappa and α>0\alpha>0, it follows that

D(QR)>D(QP).\displaystyle D(Q\|R)>D(Q\|P_{\ast}). (44)

Since RR is an arbitrary probability measure in κ(P)\mathcal{B}_{\kappa}(P), the result follows. ∎

The next corollary follows from Theorem 3.

Corollary 4.

Let P,Q𝒫P,Q\in\mathcal{P} and Pκ1(P)κ2(Q)(P:Q)P_{\ast}\in\partial\mathcal{B}_{\kappa_{1}}(P)\cap\partial\mathcal{B}_{\kappa_{2}}(Q)\cap\mathcal{L}(P:Q). Then, κ1(P)κ2(Q)={P}\mathcal{B}_{\kappa_{1}}(P)\cap\mathcal{B}_{\kappa_{2}}(Q)=\{P_{\ast}\}.

We show the figure that summarize Corollary 2, 3 and 4.

Refer to caption
Figure 2: Relation among two divergence balls, a divergence line and an orthogonal subspace.
Theorem 4 (Minimization of the divergence subject to linear moment-like constraint).

Let DD be a differentiable and strictly convex divergence and let P𝒫P\in\mathcal{P}. Let Ti(Z)(i=1,2,K)T_{i}(Z)(i=1,2,\cdots K) be functions of a random variable (vector) ZZ and :={Q𝒫|E[Ti(Z)]=mi(i=1,2,K),ZQ}\mathcal{M}:=\{Q\in\mathcal{P}|\mathrm{E}[T_{i}(Z)]=m_{i}(i=1,2,\cdots K),Z\sim Q\}, where E[]\mathrm{E}[\cdot] denotes the expected value and mi(i=1,2,K)m_{i}(i=1,2,\cdots K)\in\mathbb{R} are constants.

Suppose that there exists PP_{\ast}\in\mathcal{M} that satisfies

δD(PP)δp(z)+i=1KβiTi(z)=C,\displaystyle\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\sum_{i=1}^{K}\beta_{i}T_{i}(z)=C, (45)

where βi\beta_{i}\in\mathbb{R} and CC\in\mathbb{R} are the Lagrange multipliers.

Then, the minimizer of divergence between a probability measure PP and the set \mathcal{M}

argminRD(PR)=P\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{M}}D(P\|R)=P_{\ast} (46)

is unique and PP_{\ast} is the unique solution of (45).

Proof.

We use the same technique in Theorem 2. Consider an arbitrary RR\in\mathcal{M}. Let Rλ:=P+(RP)λR_{\lambda}:=P_{\ast}+(R-P_{\ast})\lambda and F(λ):=D(PRλ)+iβiTi(z)rλ(z)dμ(z)F(\lambda):=D(P\|R_{\lambda})+\sum_{i}\beta_{i}\int T_{i}(z)r_{\lambda}(z)\mathrm{d}\mu(z).

By differentiating F(λ)F(\lambda) with respect to λ\lambda and substituting λ=0\lambda=0, it follows that

F(0)=(δD(PP)δp(z)+iβiTi(z))(r(z)p(z))dμ(z)=0,\displaystyle F^{\prime}(0)=\int\biggl{(}\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\sum_{i}\beta_{i}T_{i}(z)\biggr{)}(r(z)-p_{\ast}(z))\mathrm{d}\mu(z)=0, (47)

where we use (45), rdμ=pdμ=1\int r\mathrm{d}\mu=\int p_{\ast}\mathrm{d}\mu=1 and the definition of the functional derivative. From Lemma 1 and the linearity of iβiTi(z)rλ(z)dμ(z)\sum_{i}\beta_{i}\int T_{i}(z)r_{\lambda}(z)\mathrm{d}\mu(z) with respect to λ\lambda, F(λ)F(\lambda) is strictly convex with respect to λ\lambda. Hence, we have F(1)>F(0)+F(0)(10)=F(0)F(1)>F(0)+F^{\prime}(0)(1-0)=F(0). From F(1)=D(PR)+iβiTi(z)r(z)dμ(z)F(1)=D(P\|R)+\sum_{i}\beta_{i}\int T_{i}(z)r(z)\mathrm{d}\mu(z), F(0)=D(PP)+iβiTi(z)p(z)dμ(z)F(0)=D(P\|P_{\ast})+\sum_{i}\beta_{i}\int T_{i}(z)p_{\ast}(z)\mathrm{d}\mu(z), it follows that

D(PR)+iβiTi(z)r(z)dμ(z)>D(PP)+iβiTi(z)p(z)dμ(z).\displaystyle D(P\|R)+\sum_{i}\beta_{i}\int T_{i}(z)r(z)\mathrm{d}\mu(z)>D(P\|P_{\ast})+\sum_{i}\beta_{i}\int T_{i}(z)p_{\ast}(z)\mathrm{d}\mu(z). (48)

From Ti(z)r(z)dμ(z)=Ti(z)p(z)dμ(z)=mi\int T_{i}(z)r(z)\mathrm{d}\mu(z)=\int T_{i}(z)p_{\ast}(z)\mathrm{d}\mu(z)=m_{i}, it follows that

D(PR)>D(PP).\displaystyle D(P\|R)>D(P\|P_{\ast}). (49)

Since RR is an arbitrary probability measure in \mathcal{M}, it follow that PP_{\ast} is the unique minimizer. If there exists an another probability measure P~\tilde{P_{\ast}} that is the solution of (45), P~\tilde{P_{\ast}} is also the minimizer of (46). This contradicts that PP_{\ast} is the unique minimizer. Hence, the result follows. ∎

IV Summary

We have discussed the minimization problems and geometric properties for the differentiable and strictly convex divergences. We have derived the three-point inequality and introduced the divergence lines, inner products, balls and orthogonal subsets that are the generalization of lines, inner product, spheres and planes perpendicular to a line in the Euclidean space.

Furthermore, we have shown the minimizer conditions and the uniqueness of the minimizer if there exist the solutions that satisfy the minimizer conditions in the following cases,

1) Minimization of weighted average of divergences from multiple probability measures.

2) Minimization of divergence between a probability measure and the orthogonal subsets, divergence balls, or the set subject to linear moment-like constraints.

References

  • [1] Shun-ichi Amari and Andrzej Cichocki. Information geometry of divergence functions. Bulletin of the Polish Academy of Sciences: Technical Sciences, 58(1):183–195, 2010.
  • [2] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
  • [3] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
  • [4] Thomas Breuer and Imre Csiszár. Measuring distribution model risk. Mathematical Finance, 26(2):395–411, 2016.
  • [5] Yair Censor, Stavros Andrea Zenios, et al. Parallel optimization: Theory, algorithms, and applications. Oxford University Press on Demand, 1997.
  • [6] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [7] Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
  • [8] Imre Csiszár and Frantisek Matus. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, 2003.
  • [9] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [10] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • [11] Frank Nielsen, Jean-Daniel Boissonnat, and Richard Nock. On bregman voronoi diagrams. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 746–755. Society for Industrial and Applied Mathematics, 2007.
  • [12] Frank Nielsen and Sylvain Boltz. The burbea-rao and bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
  • [13] Tomohiro Nishiyama. Monotonically decreasing sequence of divergences. arXiv preprint arXiv:1910.00402, 2019.
  • [14] Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
  • [15] Igal Sason and Sergio Verdu. ff-divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
  • [16] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.