Minimization Problems on Strictly Convex Divergences

Tomohiro Nishiyama
Email: htam0ybboh@gmail.com

Abstract

The divergence minimization problem plays an important role in various fields. In this note, we focus on differentiable and strictly convex divergences. For some minimization problems, we show the minimizer conditions and the uniqueness of the minimizer without assuming a specific form of divergences. Furthermore, we show geometric properties related to the minimization problems.

Keywords: convex, minimization problem, projection, centroid, Bregman divergence, f-divergence, Rényi-divergence, Lagrange multiplier.

I Introduction

The divergences are quantities that measure discrepancy between probability measures. For two probability measures $P$ and $Q$ , divergences satisfy the following properties.

$D(P\|Q)\geq 0$ with equality if and only if $P=Q$ .

In particular, the $f$ -divergence [15, 7], the Bregman divergence [3] and the Rényi divergence [16, 14] are often used in various fields such as machine learning, image processing, statistical physics, finance and so on.

In order to find the probability measures closest (in the meaning of divergence) to the target probability measure subject to some constraints, it is necessary to solve divergence minimization problems and there are many works about them [8, 4, 2]. Divergence minimization problems are also deeply related to the geometric properties of divergences such as the projection from a probability measure to a set.

The main purpose of this note is to study minimization problems and geometric properties of differentiable and strictly convex divergences in the first or the second argument [13]. For example, the squared Euclidean distance is differentiable and strictly convex. The most important result is that we can derive the minimizer conditions and the uniqueness of the minimizer from only these assumptions without specifying the form of divergences if there exist the solutions that satisfy the minimizer conditions. The minimizer conditions are consistent with the results of the method of Lagrange multipliers.

First, We introduce divergence lines, balls, inner products and orthogonal subsets that are the generalization of line segments, spheres, inner products and orthogonal planes perpendicular to lines in the Euclidean space, respectively. Furthermore, we show the three-point inequality as a basic geometric property and show some properties of the divergence inner product.

Next, we discuss the minimization problem of the weighted average of divergences from some probability measures, which is important in clustering algorithms such as k-means clustering [10]. We show that the minimizer of the weighted average of divergences is the generalized centroid as in the case of the Euclidean space.

Finally, we discuss the minimization problems between a probability measure $P$ and a set $\mathcal{S}$ . These are interpreted as a projection from the probability measure $P$ to the set $\mathcal{S}$ . In the Euclidean space, the minimum distance from a point $P\in\mathbb{R}^{3}$ to a plane is given by the perpendicular foot, and the minimum distance to the sphere is given by the intersection of the sphere and a line connecting $P$ and the center of the sphere. We show that the minimizer of divergence between a probability measure and the divergence ball or the orthogonal subset have the similar properties as in the case of the Euclidean space.

II Preliminaries

This section provides definitions and notations which are used in this note. Let $\mathcal{P}$ denotes the set of probability measures. For $P,Q\in\mathcal{P}$ , $P=Q$ denotes $P=Q\mbox{ a.s.}$ . Let $\mu$ be a dominating measure of $\mathcal{P}$ ( $P\ll\mu$ ) and $p:=\frac{dP}{d\mu}$ be the density of $P$ .

Divergences are defined as functions that satisfy the following properties.

Let $D:\mathcal{P}\times\mathcal{P}\rightarrow[0,\infty)$ . For any $P,Q\in\mathcal{P}$ ,

	$\displaystyle D(P\\|Q)\geq 0,$
	$\displaystyle D(P\\|Q)=0\iff P=Q.$

Definition 1 (Strictly convex divergence).

Let $P,Q,R\in\mathcal{P}$ and $Q\neq R$ . Let $t\in(0,1)$ and let $D$ be a divergence. The divergence $D(P\|Q)$ is strictly convex in the second argument if

\displaystyle(1-t)D(P\|Q)+tD(P\|R)>D(P\|(1-t)Q+tR).

(1)

Definition 2 (Differentiable divergence).

Let $P,Q\in\mathcal{P}$ and let $D$ be a divergence. The divergence $D(P\|Q)$ is differentiable with respect to the second argument if $D(P\|Q)$ is the functional of $q=\frac{dQ}{d\mu}$ and the functional derivative exists with respect to $q$ . Let $D[q]:=D(P\|Q)$ be a functional with respect to $q$ . The functional derivative of $D(P\|Q)$ with respect to $q$ is defined by

\displaystyle\int\frac{\delta D(P\|Q)}{\delta q(z)}\phi(z)\mathrm{d}\mu(z):=\left.\frac{d}{d\epsilon}D[q+\epsilon\phi]\right|_{\epsilon=0},

(2)

where $\phi$ is an arbitrary function.

The strictly convex or differentiable divergence in the first argument, we can define in the same way as the second argument. We show some examples of the functional derivative.

Example 1 (Squared Euclidean distance).

Let $D_{\mathrm{E}}(P\|Q):=\frac{1}{2}\int(q-p)^{2}\mathrm{d}\mu$ . The functional derivative is

\displaystyle\frac{\delta D_{\mathrm{E}}(P\|Q)}{\delta q(z)}=q(z)-p(z).

(3)

Example 2 (Bregman divergence).

Let $f:\mathbb{R}\rightarrow\mathbb{R}$ be a differentiable and strictly convex function. The Bregman divergence is defined by

\displaystyle D_{\mathrm{B}}(P\|Q):=\int f(p)\mathrm{d}\mu-\int f(q)\mathrm{d}\mu-\int f^{\prime}(q)(p-q)\mathrm{d}\mu,

(4)

where $f^{\prime}(x)$ denotes the derivative of $f$ . The Bregman divergence is strictly convex in the first argument. The functional derivative is

\displaystyle\frac{\delta D_{\mathrm{B}}(P\|Q)}{\delta p(z)}=f^{\prime}(p(z))-f^{\prime}(q(z)).

(5)

Example 3 ( $f$ -divergence).

Let $f:\mathbb{R}\rightarrow\mathbb{R}$ be a strictly convex function and $f(1)=0$ . The $f$ -divergence is defined by

\displaystyle D_{f}(P\|Q):=\int qf\biggl{(}\frac{p}{q}\biggr{)}\mathrm{d}\mu.

(6)

The $f$ -divergence is strictly convex in the first and the second argument. If $f$ is differentiable, the functional derivatives are

\displaystyle\frac{\delta D_{f}(P\|Q)}{\delta q(z)}=\tilde{f}^{\prime}\biggl{(}\frac{q(z)}{p(z)}\biggr{)},

(7)

where $\tilde{f}(x):=xf\bigl{(}\frac{1}{x}\bigr{)}$ and

\displaystyle\frac{\delta D_{f}(P\|Q)}{\delta p(z)}=f^{\prime}\biggl{(}\frac{p(z)}{q(z)}\biggr{)}.

(8)

Example 4 (Rényi-divergence).

For $0<\alpha<\infty$ , the Rényi-divergence is defined by

	$\displaystyle D_{\alpha}(P\\|Q):=\frac{1}{\alpha-1}\log\int p^{\alpha}q^{1-\alpha}\mathrm{d}\mu\mbox{ for }\alpha\neq 1,$		(9)
	$\displaystyle D_{1}(P\\|Q):=\int p\log\frac{p}{q}\mathrm{d}\mu.$

The Rényi divergence is strictly convex in the second argument for $0<\alpha<\infty$ (see [16]). The functional derivative is

\displaystyle\frac{\delta D_{\alpha}(P\|Q)}{\delta q(z)}=-\frac{1}{\int p^{\alpha}q^{1-\alpha}\mathrm{d}\mu}{\biggl{(}\frac{p(z)}{q(z)}\biggr{)}}^{\alpha}.

(10)

If a divergence $D(P\|Q)$ is differentiable or strictly convex in the first argument, by putting $\hat{D}(P\|Q)=D(Q\|P)$ , $\hat{D}$ is differentiable or strictly convex in the second argument. Hence, in the following, we only consider the differentiable or strictly convex divergences in the second argument.

Definition 3 (Divergence line).

Let $D$ be a differentiable divergence and let $P,Q\in\mathcal{P}$ . The “divergence line” $\mathcal{L}(P:Q)$ is defined by

\displaystyle\mathcal{L}(P:Q):=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],\exists C(\alpha)\in\mathbb{R},(1-\alpha)\frac{\delta D(P\|R)}{\delta r(z)}+\alpha\frac{\delta D(Q\|R)}{\delta r(z)}=C(\alpha)\}.

(11)

We also define probability measures on the divergence line at $\alpha$ by

\displaystyle\mathcal{L}_{\alpha}(P,Q):=\{R\in\mathcal{P}|\exists C\in\mathbb{R},(1-\alpha)\frac{\delta D(P\|R)}{\delta r(z)}+\alpha\frac{\delta D(Q\|R)}{\delta r(z)}=C\}.

(12)

We show some examples of the divergence lines.

Example 5 (Squared Euclidean distance).

\displaystyle\mathcal{L}(P:Q)=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],R=(1-\alpha)P+\alpha Q\}.

(13)

These are mixture distributions and a line segment in Euclidean space.

Example 6 (Kullback-Leibler divergence).

The Kullback-Leibler divergence (KL-divergence or relative entropy) [9] $D_{\mathrm{KL}}(P\|Q)$ belongs to both the Bregman divergence and the $f$ -divergence.

\displaystyle D(P\|Q)=D_{\mathrm{KL}}(P\|Q):=\int p\log\frac{p}{q}\mathrm{d}\mu.

(14)

The divergence line is

\displaystyle\mathcal{L}(P:Q)=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],R=(1-\alpha)P+\alpha Q\}.

(15)

For the reverse KL-divergence $D(P\|Q)=D_{\mathrm{KL}}(Q\|P)=\int q\log\frac{q}{p}\mathrm{d}\mu$ ,

\displaystyle\mathcal{L}(P:Q)=\{R\in\mathcal{P}|\mbox{for }\alpha\in[0,1],r:=\frac{dR}{d\mu}=\frac{1}{\int p^{(1-\alpha)}q^{\alpha}\mathrm{d}\mu}p^{(1-\alpha)}q^{\alpha}\}.

(16)

These correspond to the m-geodesic and the e-geodesic in information geometry [1].

Definition 4 (Divergence inner product).

Let $D$ be a differentiable divergence. Let $P,Q,R\in\mathcal{P}$ . We define “divergence inner product” by

\displaystyle\langle PQ\|RQ\rangle:=\int(q(z)-r(z))\frac{\delta D(P\|Q)}{\delta q(z)}\mathrm{d}\mu(z).

(17)

For the squared Euclidean distance $D_{\mathrm{E}}(P\|Q)=\frac{1}{2}\int(q-p)^{2}\mathrm{d}\mu$ , $\langle PQ\|RQ\rangle=\int(p-q)(r-q)\mathrm{d}\mu$ and this is the inner product of functions $p-q$ and $r-q$ .

Definition 5 (Orthogonal subspace).

Let $P,Q\in\mathcal{P}$ . We define the orthogonal subspace at $Q$ by

\displaystyle\mathcal{O}(P:Q):=\{R\in\mathcal{P}|\langle PQ\|RQ\rangle=0\}.

(18)

Since the divergence inner product is linear with respect to $R$ , the orthogonal subspace is a convex set.

Definition 6 (Divergence ball).

Let $P\in\mathcal{P}$ and $D$ be a divergence. We define the divergence ball by

\displaystyle\mathcal{B}_{\kappa}(P):=\{Q\in\mathcal{P}|D(P\|Q)\leq\kappa\}

(19)

and the surface of the divergence ball by

\displaystyle\partial\mathcal{B}_{\kappa}(P):=\{Q\in\mathcal{P}|D(P\|Q)=\kappa\}.

(20)

If the divergence is convex, the divergence ball is a convex set from the definition.

III Main results

In this section, we focus on the differentiable and strictly convex divergences and show some properties of them. We first show the three-point inequality and some properties of the divergence inner product. Next, we discuss some minimization problems and we show that the minimizer conditions and the uniqueness of the minimizer if there exist the solutions that satisfy the minimizer conditions.

We prove the following lemma that we use in various proofs.

Lemma 1.

Let $D$ be a strictly convex divergence and $P\neq Q\in\mathcal{P}$ . Let $\lambda\in[0,1]$ and $Q_{\lambda}:=P+(Q-P)\lambda$ . Then, $D(P\|Q_{\lambda})$ is strictly convex with respect to $\lambda$ .

Proof.

When $\lambda_{1}\neq\lambda_{2}\in[0,1]$ , $Q_{\lambda_{1}}\neq Q_{\lambda_{2}}$ holds since $P\neq Q$ . From the assumption of strictly convexity of the divergence, for $t\in(0,1)$ ,

\displaystyle tD(P\|Q_{\lambda_{1}})+(1-t)D(P\|Q_{\lambda_{2}})>D(P\|tQ_{\lambda_{1}}+(1-t)Q_{\lambda_{2}}).

(21)

From the definition of $Q_{\lambda}$ , we have $tQ_{\lambda_{1}}+(1-t)Q_{\lambda_{2}}=P+(Q-P)(t\lambda_{1}+(1-t)\lambda_{2})=Q_{t\lambda_{1}+(1-t)\lambda_{2}}$ . By combining this equality and (21), the result follows. ∎

III-A Three-point inequality

In this subsection, we show some geometric properties of differentiable and strictly convex divergences.

Theorem 1 (Three-point inequality).

Let $D$ be a differentiable and strictly convex divergence. Let $P,Q,R\in\mathcal{P}$ .

Then,

\displaystyle D(P\|R)\geq D(P\|Q)-\langle PQ\|RQ\rangle,

(22)

where the equality holds if and only if $Q=R$ .

Proof.

Let $R_{\lambda}:=Q+(R-Q)\lambda$ and $F(\lambda):=D(P\|R_{\lambda})$ with $Q\neq R$ . From the assumption and Lemma 1, $F(\lambda)$ is strictly convex. From the definition of the functional derivative for $\phi(z)=r(z)-q(z)$ ,

\displaystyle F^{\prime}(\lambda)=\left.\frac{d}{d\epsilon}D(P\|R_{\lambda+\epsilon})\right|_{\epsilon=0}=\int\frac{\delta D(P\|R_{\lambda})}{\delta r_{\lambda}(z)}(r(z)-q(z))\mathrm{d}\mu(z),

(23)

where we use $r_{\lambda}(z)+\epsilon(r(z)-q(z))=r_{\lambda+\epsilon}(z)$ . From the strictly convexity of $F(\lambda)$ , for $\lambda>0$ , we have

\displaystyle F(\lambda)>F(0)+F^{\prime}(0)(\lambda-0).

(24)

Substituting $\lambda=1$ into (24) and using $F(1)=D(P\|R),F(0)=D(P\|Q)$ and $F^{\prime}(0)=\int\frac{\delta D(P\|Q)}{\delta q(z)}(r(z)-q(z))\mathrm{d}\mu=-\langle PQ\|RQ\rangle$ , we have

\displaystyle D(P\|R)>D(P\|Q)-\langle PQ\|RQ\rangle.

(25)

Hence, we have the result. ∎

For the Bregman divergence, three-point identity holds [11].

\displaystyle D_{\mathrm{B}}(Q\|P)+D_{\mathrm{B}}(R\|Q)=D_{\mathrm{B}}(R\|P)+\int(r-q)(f^{\prime}(p)-f^{\prime}(q))\mathrm{d}\mu.

(26)

By using $D_{\mathrm{B}}(R\|Q)\geq 0$ and the result of Example 2, we have the same inequality as (22) by putting $D(P\|Q)=D_{\mathrm{B}}(Q\|P)$

We show the figure of three-point inequality.

Refer to caption — Figure 1: Three points inequality.

Proposition 1.

Let $D$ be a differentiable and strictly convex divergence. Let $P,Q,S\in\mathcal{P}$ and $R\in\mathcal{L}_{\alpha}(P,Q)$ for $\alpha\in[0,1]$ .

Then,

\displaystyle(1-\alpha)\langle PR\|SR\rangle+\alpha\langle QR\|SR\rangle=0.

(27)

Proof.

From the assumption, $R$ satisfies

\displaystyle(1-\alpha)\frac{\delta D(P\|R)}{\delta r(z)}+\alpha\frac{\delta D(Q\|R)}{\delta r(z)}=C.

(28)

By multiplying both sides by $r(z)-s(z)$ and integrating with respect to $z$ and using $\int s\mathrm{d}\mu=\int r\mathrm{d}\mu=1$ , the result follows. ∎

Corollary 1.

Let $D$ be a differentiable and strictly convex divergence and $P\neq Q\in\mathcal{P}$ .

Then,

\displaystyle\langle PQ\|PQ\rangle>D(P\|Q).

(29)

Proof.

By substituting $P=R$ into (22), the result follows. ∎

Proposition 2.

Let $D$ be a differentiable and strictly convex divergence. Let $P\in\mathcal{P}$ , $Q,P_{\ast}\in\mathcal{S}\subset\mathcal{P}$ and $P_{\ast}=\mathop{\rm arg~{}min}\limits_{Q\in\mathcal{S}}D(P\|Q)$ .

For all $Q\in\mathcal{S}$ ,

\displaystyle\langle PP_{\ast}\|QP_{\ast}\rangle\leq 0.

(30)

Proof.

Let $Q_{\lambda}:=P_{\ast}+(Q-P_{\ast})\lambda$ and $F(\lambda):=D(P\|Q_{\lambda})$ . From the assumption, $F(0)$ is the minimum value for $\lambda\in[0,1]$ . Hence, $F^{\prime}(0)\geq 0$ and we have

\displaystyle F^{\prime}(0)=\int\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}(q(z)-p_{\ast}(z))\mathrm{d}\mu(z)=-\langle PP_{\ast}\|QP_{\ast}\rangle\geq 0.

(31)

From this inequality, the result follows. ∎

This is the same approach to show the Pythagorean inequality for the KL-divergence [6].

Corollary 2.

Let $P,Q\in\mathcal{P}$ and let $R\in\mathcal{L}_{\alpha}$ for $\alpha\in(0,1)$ . Then,

\displaystyle\mathcal{O}(P:R)=\mathcal{O}(Q:R).

(32)

Proof.

The result follows from Proposition 1. ∎

III-B Centroids

We consider the minimization problem of the weighted average of differentiable and strictly convex divergences. This problem is important for the clustering algorithm and we show that the minimizer is a generalized centroid and the uniqueness of the minimizer if there exists the solution that satisfies the generalized centroid condition.

Theorem 2 (Minimization of the weighted average of divergences).

Let $D$ be a differentiable and strictly convex divergence. Let $P_{i}(i=1,2,\cdots N)\in\mathcal{P}$ and $\alpha_{i}(i=1,2,\cdots N)\in\mathbb{R}$ be parameters that satisfy $\sum_{i}\alpha_{i}=1$ and $\alpha_{i}\geq 0$ . Suppose that there exists $P_{\ast}\in\mathcal{P}$ that satisfies

\displaystyle\sum_{i=1}^{N}\alpha_{i}\frac{\delta D(P_{i}\|P_{\ast})}{\delta p_{\ast}(z)}=C,

(33)

where $C\in\mathbb{R}$ is the Lagrange multiplier.

Then, the minimizer of the weighted average of divergences

\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{P}}\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|R)=P_{\ast}

(34)

is unique and $P_{\ast}$ is the unique solution of (33).

Proof.

We first prove (34). Let $R\neq P_{\ast}$ be an arbitrary probability measure. Let $R_{\lambda}:=P_{\ast}+(R-P_{\ast})\lambda$ and $F(\lambda)=\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|R_{\lambda})$ . By differentiating $F(\lambda)$ with respect to $\lambda$ and substituting $\lambda=0$ , it follows that

\displaystyle F^{\prime}(0)=\sum_{i=1}^{N}\alpha_{i}\int\frac{\delta D(P_{i}\|P_{\ast})}{\delta p_{\ast}(z)}(r(z)-p_{\ast}(z))\mathrm{d}\mu(z)=0,

(35)

where we use (33), $\int r\mathrm{d}\mu=\int p_{\ast}\mathrm{d}\mu=1$ and the definition of the functional derivative. From Lemma 1 and $\alpha_{i}\in[0,1]$ , $F(\lambda)$ is strictly convex with respect to $\lambda$ . Hence, we have $F(1)>F(0)+F^{\prime}(0)(1-0)=F(0)$ . Since $R$ is an arbitrary probability measure, from $F(1)=\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|R)$ and $F(0)=\sum_{i=1}^{N}\alpha_{i}D(P_{i}\|P_{\ast})$ , it follows that $P_{\ast}$ is the unique minimizer. If there exists an another probability measure $\tilde{P_{\ast}}$ that is the solution of (33), $\tilde{P_{\ast}}$ is also a minimizer of (34). This contradicts that $P_{\ast}$ is the unique minimizer. Hence, the result follows. ∎

The equality (33) is the generalized centroid condition.

Proposition 3.

Let $P\neq Q\in\mathcal{P}$ and suppose that $P_{\ast}\in\mathcal{L}_{\alpha}(P:Q)$ for $\alpha\in[0,1]$ . Then, $\mathcal{L}_{\alpha}(P:Q)=\{P_{\ast}\}$ and $\mathcal{L}_{0}(P:Q)=\{P\},\mathcal{L}_{1}(P:Q)=\{Q\}$ .

Proof.

By applying Theorem 2 for $N=2$ and putting $P_{1}=P,P_{2}=Q$ , it follows that $P_{\ast}$ is the unique solution of

\displaystyle(1-\alpha)\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\alpha\frac{\delta D(Q\|P_{\ast})}{\delta p_{\ast}(z)}=C,

(36)

where $C\in\mathbb{R}$ is the Lagrange multiplier. From the definition of $\mathcal{L}_{\alpha}(P:Q)$ , the result $\mathcal{L}_{\alpha}(P:Q)=\{P_{\ast}\}$ follows.

Next, we show that $\mathcal{L}_{0}(P:Q)=\{P\}$ . From Theorem 2, it follows that $\mathop{\rm arg~{}min}\limits_{R\in\mathcal{P}}D(P\|R)=P_{\ast}$ . Since $D(P\|R)\geq 0$ and $P_{\ast}$ is unique, we have $P_{\ast}=P$ . We also have the result $\mathcal{L}_{1}(P:Q)=\{Q\}$ in the same way. ∎

We can show the same theorem for the real-valued vector of $\mathbb{R}^{d}$ .

Proposition 4.

Let $p_{i}(i=1,2,\cdots N)\in\mathbb{R}^{d}$ and $\alpha_{i}(i=1,2,\cdots N)\in\mathbb{R}$ be parameters that satisfy $\sum_{i}\alpha_{i}=1$ and $\alpha_{i}\geq 0$ . Let $D:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,\infty)$ be a differentiable and strictly convex divergence. Suppose that there exists $p_{\ast}\in\mathbb{R}^{d}$ that satisfies

\displaystyle\sum_{i=1}^{N}\alpha_{i}\frac{\partial D(p_{i}\|p_{\ast})}{\partial p_{\ast,\nu}}=0,

(37)

where $\nu=\{1,2,\cdots,d\}$ and $\{p_{\ast,\nu}\}$ are components of the vector $p_{\ast}$ .

Then, the minimizer of the weighted average of divergences

\displaystyle\mathop{\rm arg~{}min}\limits_{r\in\mathbb{R}^{d}}\sum_{i=1}^{N}\alpha_{i}D(p_{i}\|r)=p_{\ast}

(38)

is unique and and $p_{\ast}$ is the unique solution of (37).

The proof is the same as Theorem 2.

For the Bregman divergence, $D_{\mathrm{B}}(p\|q)=f^{*}(q^{*})-f^{*}(p^{*})-\sum_{\nu}\frac{\partial f^{*}(p^{*})}{\partial p^{*}_{\nu}}(q^{*}_{\nu}-p^{*}_{\nu})$ holds [12], where $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is a differentiable and strictly convex function, $f^{*}$ denotes the Legendre convex conjugate [5] and $p^{*}_{\nu}=\frac{\partial f(p)}{\partial p_{\nu}}$ . Since $f^{*}(q^{*})-f^{*}(p^{*})-\sum_{\nu}\frac{\partial f^{*}(p^{*})}{\partial p^{*}_{\nu}}(q^{*}_{\nu}-p^{*}_{\nu})=f(p)-f(q)-\sum_{\nu}\frac{\partial f(q)}{\partial q_{\nu}}(p_{\nu}-q_{\nu})>0$ for $p\neq q$ , $f^{*}$ is also strictly convex. Hence, $D_{\mathrm{B}}(p\|q)$ is a differentiable and strictly convex divergence with respect to $q^{*}$ . By combining $\frac{\partial f^{*}(q^{*})}{\partial q^{*}_{\nu}}=q_{\nu}$ and (37), it follows that

\displaystyle\sum_{i=1}^{N}\alpha_{i}\frac{\partial D_{\mathrm{B}}(p_{i}\|p_{\ast})}{\partial p^{\ast}_{\ast,\nu}}=\sum_{i=1}^{N}\alpha_{i}(p_{\ast,\nu}-p_{i,\nu})=0.

(39)

Hence, $p_{\ast}=\sum_{i=1}^{N}\alpha_{i}p_{i}$ is the unique minimizer.

III-C Projection from a probability measure to a set

In this subsection, we discuss the minimization problems of divergence between a probability measure and a set such as a orthogonal subset, a divergence ball or a set subject to linear moment-like constraint, we show the minimizer conditions and the uniqueness of the minimizer if there exist the solutions that satisfy the minimizer conditions.

Corollary 3 (Minimization of the divergence between a probability measure and a orthogonal subspace).

Let $D$ be a differentiable and strictly convex divergence and let $P,Q\in\mathcal{P}$ .

Then, the minimizer of divergence between a probability measure $P$ and the orthogonal subspace $\mathcal{O}(P:Q)$

\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{O}(P:Q)}D(P\|R)=Q

(40)

is unique.

Proof.

Since $\langle PQ\|RQ\rangle=0$ for $R\in\mathcal{O}(P:Q)$ , the result follows from Theorem 1. ∎

Theorem 3 (Minimization of the divergence between a probability measure and a divergence ball).

Let $D$ be a differentiable and strictly convex divergence. Let $P,Q\in\mathcal{P}$ and suppose that $P_{\ast}\in\partial\mathcal{B}_{\kappa}(P)\cap\mathcal{L}(P:Q)$ .

Then, the minimizer of divergence between a probability measure $Q$ and a divergence ball $\mathcal{B}_{\kappa}(P)$

\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{B}_{\kappa}(P)}D(Q\|R)=P_{\ast}

(41)

is unique.

Proof.

Since the case $\kappa=0$ is trivial, we consider the case $\kappa>0$ . By combining Proposition 3 and $P_{\ast}\neq P$ , it follows that $\alpha>0$ . Consider an arbitrary probability measure $R\in\mathcal{B}_{\kappa}(P)$ . From the assumption and Proposition 3, there exists $\alpha\in(0,1]$ and $\mathcal{L}_{\alpha}(P:Q)=\{P_{\ast}\}$ . Let $R_{\lambda}:=P_{\ast}+(R-P_{\ast})\lambda$ and $F(\lambda)=(1-\alpha)D(P\|R_{\lambda})+\alpha D(Q\|R_{\lambda})$ .

By differentiating $F(\lambda)$ with respect to $\lambda$ and substituting $\lambda=0$ , it follows that

\displaystyle F^{\prime}(0)=\int\biggl{(}(1-\alpha)\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\alpha\frac{\delta D(Q\|P_{\ast})}{\delta p_{\ast}(z)}\biggr{)}(r(z)-p_{\ast}(z))\mathrm{d}\mu(z)=0,

(42)

where we use $\mathcal{L}_{\alpha}(P,Q)=\{P_{\ast}\}$ , $\int r\mathrm{d}\mu=\int p_{\ast}\mathrm{d}\mu=1$ and the definition of the functional derivative. From Lemma 1 and $\alpha\in(0,1]$ , $F(\lambda)$ is strictly convex with respect to $\lambda$ . Hence, we have $F(1)>F(0)+F^{\prime}(0)(1-0)=F(0)$ . From $F(1)=(1-\alpha)D(P\|R)+\alpha D(Q\|R)$ and $F(0)=(1-\alpha)D(P\|P_{\ast})+\alpha D(Q\|P_{\ast})$ , it follows that

\displaystyle(1-\alpha)D(P\|R)+\alpha D(Q\|R)>(1-\alpha)D(P\|P_{\ast})+\alpha D(Q\|P_{\ast}).

(43)

From $D(P\|R)\leq\kappa$ , $D(P\|P_{\ast})=\kappa$ and $\alpha>0$ , it follows that

\displaystyle D(Q\|R)>D(Q\|P_{\ast}).

(44)

Since $R$ is an arbitrary probability measure in $\mathcal{B}_{\kappa}(P)$ , the result follows. ∎

The next corollary follows from Theorem 3.

Corollary 4.

Let $P,Q\in\mathcal{P}$ and $P_{\ast}\in\partial\mathcal{B}_{\kappa_{1}}(P)\cap\partial\mathcal{B}_{\kappa_{2}}(Q)\cap\mathcal{L}(P:Q)$ . Then, $\mathcal{B}_{\kappa_{1}}(P)\cap\mathcal{B}_{\kappa_{2}}(Q)=\{P_{\ast}\}$ .

We show the figure that summarize Corollary 2, 3 and 4.

Theorem 4 (Minimization of the divergence subject to linear moment-like constraint).

Let $D$ be a differentiable and strictly convex divergence and let $P\in\mathcal{P}$ . Let $T_{i}(Z)(i=1,2,\cdots K)$ be functions of a random variable (vector) $Z$ and $\mathcal{M}:=\{Q\in\mathcal{P}|\mathrm{E}[T_{i}(Z)]=m_{i}(i=1,2,\cdots K),Z\sim Q\}$ , where $\mathrm{E}[\cdot]$ denotes the expected value and $m_{i}(i=1,2,\cdots K)\in\mathbb{R}$ are constants.

Suppose that there exists $P_{\ast}\in\mathcal{M}$ that satisfies

\displaystyle\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\sum_{i=1}^{K}\beta_{i}T_{i}(z)=C,

(45)

where $\beta_{i}\in\mathbb{R}$ and $C\in\mathbb{R}$ are the Lagrange multipliers.

Then, the minimizer of divergence between a probability measure $P$ and the set $\mathcal{M}$

\displaystyle\mathop{\rm arg~{}min}\limits_{R\in\mathcal{M}}D(P\|R)=P_{\ast}

(46)

is unique and $P_{\ast}$ is the unique solution of (45).

Proof.

We use the same technique in Theorem 2. Consider an arbitrary $R\in\mathcal{M}$ . Let $R_{\lambda}:=P_{\ast}+(R-P_{\ast})\lambda$ and $F(\lambda):=D(P\|R_{\lambda})+\sum_{i}\beta_{i}\int T_{i}(z)r_{\lambda}(z)\mathrm{d}\mu(z)$ .

By differentiating $F(\lambda)$ with respect to $\lambda$ and substituting $\lambda=0$ , it follows that

\displaystyle F^{\prime}(0)=\int\biggl{(}\frac{\delta D(P\|P_{\ast})}{\delta p_{\ast}(z)}+\sum_{i}\beta_{i}T_{i}(z)\biggr{)}(r(z)-p_{\ast}(z))\mathrm{d}\mu(z)=0,

(47)

where we use (45), $\int r\mathrm{d}\mu=\int p_{\ast}\mathrm{d}\mu=1$ and the definition of the functional derivative. From Lemma 1 and the linearity of $\sum_{i}\beta_{i}\int T_{i}(z)r_{\lambda}(z)\mathrm{d}\mu(z)$ with respect to $\lambda$ , $F(\lambda)$ is strictly convex with respect to $\lambda$ . Hence, we have $F(1)>F(0)+F^{\prime}(0)(1-0)=F(0)$ . From $F(1)=D(P\|R)+\sum_{i}\beta_{i}\int T_{i}(z)r(z)\mathrm{d}\mu(z)$ , $F(0)=D(P\|P_{\ast})+\sum_{i}\beta_{i}\int T_{i}(z)p_{\ast}(z)\mathrm{d}\mu(z)$ , it follows that

\displaystyle D(P\|R)+\sum_{i}\beta_{i}\int T_{i}(z)r(z)\mathrm{d}\mu(z)>D(P\|P_{\ast})+\sum_{i}\beta_{i}\int T_{i}(z)p_{\ast}(z)\mathrm{d}\mu(z).

(48)

From $\int T_{i}(z)r(z)\mathrm{d}\mu(z)=\int T_{i}(z)p_{\ast}(z)\mathrm{d}\mu(z)=m_{i}$ , it follows that

\displaystyle D(P\|R)>D(P\|P_{\ast}).

(49)

Since $R$ is an arbitrary probability measure in $\mathcal{M}$ , it follow that $P_{\ast}$ is the unique minimizer. If there exists an another probability measure $\tilde{P_{\ast}}$ that is the solution of (45), $\tilde{P_{\ast}}$ is also the minimizer of (46). This contradicts that $P_{\ast}$ is the unique minimizer. Hence, the result follows. ∎

IV Summary

We have discussed the minimization problems and geometric properties for the differentiable and strictly convex divergences. We have derived the three-point inequality and introduced the divergence lines, inner products, balls and orthogonal subsets that are the generalization of lines, inner product, spheres and planes perpendicular to a line in the Euclidean space.

Furthermore, we have shown the minimizer conditions and the uniqueness of the minimizer if there exist the solutions that satisfy the minimizer conditions in the following cases,

1) Minimization of weighted average of divergences from multiple probability measures.

2) Minimization of divergence between a probability measure and the orthogonal subsets, divergence balls, or the set subject to linear moment-like constraints.

References

[1] Shun-ichi Amari and Andrzej Cichocki. Information geometry of divergence functions. Bulletin of the Polish Academy of Sciences: Technical Sciences, 58(1):183–195, 2010.
[2] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
[3] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
[4] Thomas Breuer and Imre Csiszár. Measuring distribution model risk. Mathematical Finance, 26(2):395–411, 2016.
[5] Yair Censor, Stavros Andrea Zenios, et al. Parallel optimization: Theory, algorithms, and applications. Oxford University Press on Demand, 1997.
[6] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
[7] Imre Csiszár. Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318, 1967.
[8] Imre Csiszár and Frantisek Matus. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474–1490, 2003.
[9] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
[10] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
[11] Frank Nielsen, Jean-Daniel Boissonnat, and Richard Nock. On bregman voronoi diagrams. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 746–755. Society for Industrial and Applied Mathematics, 2007.
[12] Frank Nielsen and Sylvain Boltz. The burbea-rao and bhattacharyya centroids. IEEE Transactions on Information Theory, 57(8):5455–5466, 2011.
[13] Tomohiro Nishiyama. Monotonically decreasing sequence of divergences. arXiv preprint arXiv:1910.00402, 2019.
[14] Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1961.
[15] Igal Sason and Sergio Verdu. $f$ -divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
[16] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.

Minimization Problems on Strictly Convex Divergences

Abstract

I Introduction

II Preliminaries

Definition 1 (Strictly convex divergence).

Definition 2 (Differentiable divergence).

Example 1 (Squared Euclidean distance).

Example 2 (Bregman divergence).

Example 3 (ff-divergence).

Example 4 (Rényi-divergence).

Definition 3 (Divergence line).

Example 5 (Squared Euclidean distance).

Example 6 (Kullback-Leibler divergence).

Definition 4 (Divergence inner product).

Definition 5 (Orthogonal subspace).

Definition 6 (Divergence ball).

III Main results

Lemma 1.

Proof.

III-A Three-point inequality

Theorem 1 (Three-point inequality).

Proof.

Proposition 1.

Proof.

Corollary 1.

Proof.

Proposition 2.

Proof.

Corollary 2.

Proof.

III-B Centroids

Theorem 2 (Minimization of the weighted average of divergences).

Proof.

Proposition 3.

Proof.

Proposition 4.

III-C Projection from a probability measure to a set

Corollary 3 (Minimization of the divergence between a probability measure and a orthogonal subspace).

Proof.

Theorem 3 (Minimization of the divergence between a probability measure and a divergence ball).

Proof.

Corollary 4.

Theorem 4 (Minimization of the divergence subject to linear moment-like constraint).

Proof.

IV Summary

References

Example 3 ( $f$ -divergence).