This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computing Augustin Information via
Hybrid Geodesically Convex Optimization

Guan-Ren Wang Graduate Institute of Networking and Multimedia, National Taiwan University Chung-En Tsai Department of Computer Science and Information Engineering, National Taiwan University Hao-Chung Cheng Department of Electrical Engineering and Graduate Institute of Communication Engineering, National Taiwan University Department of Mathematics, National Taiwan University Hon Hai (Foxconn) Quantum Computing Centre Center for Quantum Science and Engineering, National Taiwan University Yen-Huan Li Graduate Institute of Networking and Multimedia, National Taiwan University Department of Computer Science and Information Engineering, National Taiwan University Department of Mathematics, National Taiwan University Center for Quantum Science and Engineering, National Taiwan University
Abstract

We propose a Riemannian gradient descent with the Poincaré metric to compute the order-α\alpha Augustin information, a widely used quantity for characterizing exponential error behaviors in information theory. We prove that the algorithm converges to the optimum at a rate of 𝒪(1/T)\mathcal{O}(1/T). As far as we know, this is the first algorithm with a non-asymptotic optimization error guarantee for all positive orders. Numerical experimental results demonstrate the empirical efficiency of the algorithm. Our result is based on a novel hybrid analysis of Riemannian gradient descent for functions that are geodesically convex in a Riemannian metric and geodesically smooth in another.

1 Introduction

Characterizing operational quantities of interest in terms of a single-letter information-theoretic quantity is fundamental to information theory. However, certain single-letter quantities involve an optimization that is not easy to solve. This work aims to propose an optimization framework for computing an essential family of information-theoretic quantities, termed the Augustin information [4, 5, 17, 43, 32, 13, 44, 11], with the first non-asymptotic optimization error guarantee.

Consider two probability distributions PP and QQ sharing the same finite alphabet 𝒴\mathcal{Y}. The order-α\alpha Rényi divergence (α(0,)\{1}\alpha\in(0,\infty)\backslash\{1\}) is defined as [39, 42]

Dα(PQ):=1α1logy𝒴Pα(y)Q1α(y).\displaystyle D_{\alpha}(P\|Q):=\frac{1}{\alpha-1}\log\sum_{y\in\mathcal{Y}}P^{\alpha}(y)Q^{1-\alpha}(y).

As α\alpha tends to 11, we have

limα1Dα(PQ)=D(PQ),\displaystyle\lim_{\alpha\to 1}D_{\alpha}(P\|Q)=D(P\|Q),

where D(PQ):=𝔼YP[logP(Y)Q(Y)]D(P\|Q):=\mathbb{E}_{Y\sim P}\left[\log\frac{P(Y)}{Q(Y)}\right] is the Kullback–Leibler (KL) divergence. The KL divergence defines Shannon’s mutual information for a prior distribution PXP_{X} on the input alphabet 𝒳\mathcal{X} and a conditional distribution PY|XP_{Y|X} as

I(PX,PY|X)\displaystyle I(P_{X},P_{Y|X}) :=minQY𝔼XD(PYX(X)QY)\displaystyle:=\min_{Q_{Y}}\mathbb{E}_{X}D(P_{Y\mid X}(\cdot\mid X)\|Q_{Y})
=D(PYXPY|PX),\displaystyle=D(P_{Y\mid X}\|P_{Y}|P_{X}),

where the minimization is over all probability distributions on 𝒴\mathcal{Y} and PY=𝔼X[PY|X]P_{Y}=\mathbb{E}_{X}[P_{Y|X}]. Augustin generalized Shannon’s mutual information to the order-α\alpha Augustin information [4, 5], defined as

Iα(PX,PY|X):=minQY𝔼XDα(PYX(X)QY).I_{\alpha}(P_{X},P_{Y|X}):=\min_{Q_{Y}}\mathbb{E}_{X}D_{\alpha}(P_{Y\mid X}(\cdot\mid X)\|Q_{Y}). (1)

The Augustin information yields a wealth of applications in information theory, including:

  1. (i)

    characterizing the cut-off rate of channel coding [17],

  2. (ii)

    determining the error exponent of optimal constant compositions codes for rates below Shannon’s capacity [22, 4, 7, 5, 18, 33, 19, 12],

  3. (iii)

    determining the strong converse exponent of channel coding for rates above Shannon’s capacity [3, 35, 20, 30, 31],

  4. (iv)

    determining the error exponent of variable-length data compression with side information [10, 14],

  5. (v)

    demonstrating an achievable secrecy exponent for privacy amplification and wiretap channel coding via regular random binning [29, 40].

In spite of the operational significance of the Augustin information, computing it is not an easy task. The order-α\alpha Augustin information does not have a closed-form expression in general unless α=1\alpha=1. The optimization problem (1) is indeed convex [42, Theorem 12], so we may consider solving it via gradient descent. However, existing non-asymptotic analyses of gradient descent assume either Lipschitzness or smoothness111The term “smooth” used in this paper does not mean infinite differentiability. We will define smoothness in Section 3.1. of the objective function [9, 34, 26]. These two conditions, which we refer to as smoothness-type conditions, are violated by the optimization problem (1) [48].

In this paper, we adopt a geodesically convex optimization approach [49, 8, 45, 1]. This approach utilizes generalizations of convexity and smoothness-type conditions for Riemannian manifolds, namely geodesic convexity, geodesic Lipschitzness, and geodesic smoothness [49, 38, 41, 47, 50, 23, 24, 15, 2]. Since the lack of standard smoothness-type conditions is the main difficulty in applying gradient descent-type methods, we may find an appropriate Riemannian metric with respect to which the objective function in the optimization problem (1) is geodesically smooth.

Indeed, we have found that the objective function is geodesically smooth with respect to a Riemannian metric called the Poincaré metric. However, numerical experiments suggest that the objective function is geodesically non-convex under this metric. This poses another challenge. Existing analyses for Riemannian gradient descent, a direct extension of vanilla gradient descent, require both geodesic convexity and geodesic smoothness under the same Riemannian metric [49, 8].

Our main theoretical contribution lies in analyzing Riemannian gradient descent under a hybrid scenario, where the objective function is geodesically convex with respect to one Riemannian metric and geodesically smooth with respect to another. We prove that under this hybrid scenario, Riemannian gradient descent converges at a rate of 𝒪(1/T)\mathcal{O}(1/T), identical to the rate of vanilla gradient descent under the Euclidean setup.

Given that the objective function for computing the Augustin information is convex under the standard Euclidean structure and geodesically smooth under the Poincaré metric, our proposed hybrid framework offers a viable approach to solve this problem. In particular, we prove that Riemannian gradient descent with the Poincaré metric converges at a rate of 𝒪(1/T)\mathcal{O}(1/T) for computing the Augustin information. This marks the first algorithm for computing the Augustin information that has a non-asymptotic optimization error bound for all α>0\alpha>0.

Notations. We denote the all-ones vector by 𝟙\mathbbm{1}. We denote the set of vectors with non-negative entries and strictly positive entries by +N\mathbb{R}^{N}_{+} and ++N\mathbb{R}^{N}_{++}, respectively. The ii-th entry of xx is denoted by x(i)x^{(i)}. For any function f:f:\mathbb{R}\to\mathbb{R} and vector xx, we define f(x)f(x) as the vector whose ii-th entry equals f(x(i))f(x^{(i)}). Hence, for example, the ii-th entry of exp(x)\exp(x) is expx(i)\exp x^{(i)}. The probability simplex in +N\mathbb{R}^{N}_{+} is denoted by Δ+N\Delta^{N}_{+}, and the intersection of ++N\mathbb{R}^{N}_{++} and Δ+N\Delta^{N}_{+} is denoted by Δ++N\Delta^{N}_{++}. The notations \odot and \oslash denote entry-wise product and entry-wise division, respectively. For x,yNx,y\in\mathbb{R}^{N}, the partial order xyx\leq y means yx+Ny-x\in\mathbb{R}^{N}_{+}. For a set SNS\in\mathbb{R}^{N}, the boundary of SS is denoted by S\partial S. The set {1,2,,N}\{1,2,\ldots,N\} is denoted by [N][N].

2 Related Work

2.1 Computing the Augustin Information

Nakiboğlu [32] showed that the minimizer of the optimization problem (1) satisfies a fixed-point equation and proved that the associated fixed-point iteration converges to the order-α\alpha Augustin information for α(0,1)\alpha\in(0,1). Cheng and Nakiboğlu [11, Lemma 6] provided a descent lemma for a specific update rule, which implies asymptotic convergence for the optimization problem (1) for all α>0\alpha>0. Li and Cevher [28] provided a line search gradient-based algorithm to solve optimization problems on the probability simplex, which can also be used to solve the optimization problem (1) for all α>0\alpha>0. You et al. [48] proposed a gradient-based method for minimizing quantum Rényi divergences, which includes the optimization problem (1) as a special case for all α>0\alpha>0. These works only guarantee asymptotic convergence of the algorithms. Analysis of the convergence rate before the present work is still missing.

2.2 Hybrid Analysis

Antonakopoulos et al. [2] considered (Euclidean) convex and geodesically Lipschitz objective functions, analyzing the regret rates of the follow-the-regularized-leader algorithm and online mirror descent. Weber and Sra [46] considered geodesically convex and (Euclidean) smooth objective functions, analyzing the optimization error of the convex-concave procedure. The former did not consider the hybrid case of geodesic convexity and geodesic smoothness under different Riemannian metrics. The latter only analyzed a special case of the hybrid scenario we proposed. Neither of them analyzed Riemannian gradient descent.

3 Preliminaries

This chapter introduces relevant concepts in geodesic convex optimization.

3.1 Smooth Convex Optimization

If an optimization problem is convex, we may consider using (projected) gradient descent to approximate a global minimizer. Existing non-asymptotic error bounds of gradient descent require the objective function to be either Lipschitz or smooth.

Definition 3.1.

We say a function f:Nf:\mathbb{R}^{N}\to\mathbb{R} is LL-Lipschitz for some L>0L>0 if for any x,yNx,y\in\mathbb{R}^{N}, the function satisfies

|f(y)f(x)|Lyx2.|f(y)-f(x)|\leq L\|y-x\|_{2}.

We say a function f:Nf:\mathbb{R}^{N}\to\mathbb{R} is LL-smooth for some L>0L>0 if for any x,yNx,y\in\mathbb{R}^{N}, the function satisfies

f(y)f(x)+f(x),yx+L2yx22.f(y)\leq f(x)+\langle\nabla f(x),y-x\rangle+\frac{L}{2}\|y-x\|_{2}^{2}.

Given a convex optimization problem, gradient descent converges at a rate of 𝒪(1/T)\mathcal{O}(1/\sqrt{T}) for the Lipschitz case and 𝒪(1/T)\mathcal{O}(1/T) for the smoothness case, where TT denotes the number of iterations. However, You et al. [48, Proposition III.1] showed that the objective function in the optimization problem (1) is neither Lipschitz nor smooth.

3.2 Basics of Riemannian Geometry

An NN-dimensional topological manifold \mathcal{M} is a topological space that is Hausdorff, second countable, and locally homeomorphic to N\mathbb{R}^{N}. The tangent space at a point xx\in\mathcal{M}, denoted as TxT_{x}\mathcal{M}, is the set of all vectors tangent to the manifold at xx. A Riemannian metric 𝔤\mathfrak{g} on \mathcal{M} defines an inner product on each tangent space. The inner product on TxT_{x}\mathcal{M} is denoted as ,x\langle\cdot,\cdot\rangle_{x}. A Riemannian manifold (,𝔤)(\mathcal{M},\mathfrak{g}) is a topological manifold \mathcal{M} equipped with a Riemannian metric 𝔤\mathfrak{g}. Given a Riemannian metric, the induced norm of a tangent vector vTxv\in T_{x}\mathcal{M} is given by vx:=v,vx\|v\|_{x}:=\sqrt{\langle v,v\rangle_{x}}. The length of a curve γ:[0,1]\gamma:[0,1]\to\mathcal{M} is defined as L(γ):=01γ(t)γ(t)dtL(\gamma):=\int_{0}^{1}\|\gamma^{\prime}(t)\|_{\gamma(t)}\mathrm{d}t, and the induced distance between two points x,yx,y\in\mathcal{M} is given by d(x,y)=infγL(γ)d(x,y)=\inf_{\gamma}L(\gamma), where the infimum is taken over all curves γ\gamma connecting xx and yy. A geodesic is a curve connecting xx and yy with constant speed and satisfying L(γ)=d(x,y)L(\gamma)=d(x,y). For each xx\in\mathcal{M} and vTxv\in T_{x}\mathcal{M}, there is a unique geodesic γv:[0,1]\gamma_{v}:[0,1]\to\mathcal{M} such that γv(0)=x\gamma_{v}(0)=x and γv(0)=v\gamma_{v}^{\prime}(0)=v [27, Corollary 4.28]. The exponential map at a point xx\in\mathcal{M} is the map expx:Tx\exp_{x}:T_{x}\mathcal{M}\to\mathcal{M} such that expx(v)=γv(1)\exp_{x}(v)=\gamma_{v}(1).222In general, the exponential map may only be locally defined, while the exponential map considered in this paper is defined on the whole tangent space. The logarithmic map logx:Tx\log_{x}:\mathcal{M}\to T_{x}\mathcal{M} at a point xx\in\mathcal{M} is the inverse of expx\exp_{x}. Given a differentiable function f:f:\mathcal{M}\to\mathbb{R}, the Riemannian gradient of ff at xx\in\mathcal{M} is the unique tangent vector gradf(x)\mathrm{grad}f(x) that satisfies gradf(x),vx=df(x)[v]\langle\mathrm{grad}f(x),v\rangle_{x}=\mathrm{d}f(x)[v] for all vTxv\in T_{x}\mathcal{M}.

3.3 Geodesically Convex Optimization

The notions of convexity and smoothness can be extended for functions defined on Riemannian manifolds.

Definition 3.2 (Geodesic convexity, geodesic smoothness [49]).

Let f:f:\mathcal{M}\to\mathbb{R} be a function defined on a Riemannian manifold (,𝔤)(\mathcal{M},\mathfrak{g}). We say the function ff is geodesically convex (g-convex) on (,𝔤)(\mathcal{M},\mathfrak{g}) if for every x,yx,y\in\mathcal{M}, the function satisfies

f(y)f(x)+gradf(x),logxyx,f(y)\geq f(x)+\langle\mathrm{grad}f(x),\log_{x}y\rangle_{x},

We say the function ff is geodesically LL-smooth (g-LL-smooth) on (,𝔤)(\mathcal{M},\mathfrak{g}) if for every x,yx,y\in\mathcal{M}, the function satisfies

f(y)f(x)+gradf(x),logxyx+L2d2(x,y),f(y)\leq f(x)+\langle\mathrm{grad}f(x),\log_{x}y\rangle_{x}+\frac{L}{2}d^{2}(x,y),

for some L0L\geq 0, where dd is the induced distance.

With the notions of the exponential map and Riemannian gradient, vanilla gradient descent can be extended to the so-called Riemannian gradient descent (RGD) [49, 8, 15], which iterates as follows

xt+1expxt(ηgradf(xt)),x_{t+1}\leftarrow\exp_{x_{t}}(-\eta\mathrm{grad}f(x_{t})), (2)

where η\eta denotes the step size.

For Riemannian gradient descent, there exists a non-asymptotic optimization error bound analogous to that in gradient descent [49].

Theorem 3.3 ([49, Theorem 13]).

If the function to be minimized is g-convex and g-LL-smooth on a Riemannian manifold (,𝔤)(\mathcal{M},\mathfrak{g}), then RGD with step size η=1/L\eta=1/L converges at a rate of 𝒪(1/T)\mathcal{O}(1/T).

4 Hybrid Analysis of Geodesically Smooth Convex Optimization

Consider the optimization problem

minxf(x),\min_{x\in\mathcal{M}}f(x),

where \mathcal{M} is a Riemannian manifold equipped with Riemannian metrics 𝔤\mathfrak{g} and 𝔥\mathfrak{h}, and f𝒞1()f\in\mathcal{C}^{1}(\mathcal{M}) is a function lower bounded on \mathcal{M}. In this paper, we consider a hybrid geodesic optimization framework in which ff is g-convex and g-smooth under different Riemannian metrics.

Assumption 4.1.

The function ff is g-convex with respect to the Riemannian metric 𝔤\mathfrak{g} and g-LL-smooth with respect to the Riemannian metric 𝔥\mathfrak{h}, i.e.,

f(y)\displaystyle f(y) f(x)+grad𝔤f(x),log𝔤(x)y𝔤(x)\displaystyle\geq f(x)+\langle\mathrm{grad}_{\mathfrak{g}}f(x),\log_{\mathfrak{g}(x)}y\rangle_{{\mathfrak{g}}(x)}
f(y)\displaystyle f(y) f(x)+grad𝔥f(x),log𝔥(x)y𝔥(x)+L2d𝔥2(x,y),\displaystyle\leq f(x)+\langle\mathrm{grad}_{\mathfrak{h}}f(x),\log_{\mathfrak{h}(x)}y\rangle_{{\mathfrak{h}}(x)}+\frac{L}{2}d^{2}_{\mathfrak{h}}(x,y),

where the subscripts 𝔤\mathfrak{g} and 𝔥\mathfrak{h} in the grad\mathrm{grad}, log\log, ,\langle\cdot,\cdot\rangle, and d(,)d(\cdot,\cdot) respectively indicate the specific metric associated with each geometric quantity.

In this section, we prove that RGD with 𝔥\mathfrak{h} still converges at a rate of 𝒪(1/T)\mathcal{O}(1/T) under Assumption 4.1.

Lemma 4.2 ([8, Corollary 4.8]).

Let ff be g-LL-smooth with respect to 𝔥\mathfrak{h} and let x+=exp𝔥(x)(1Lgrad𝔥f(x))x_{+}=\exp_{\mathfrak{h}(x)}\left(-\frac{1}{L}\mathrm{grad}_{\mathfrak{h}}f(x)\right) for some xx\in\mathcal{M}. Then,

f(x+)f(x)12Lgrad𝔥f(x)𝔥(x)2,f(x_{+})-f(x)\leq-\frac{1}{2L}\|\mathrm{grad}_{\mathfrak{h}}f(x)\|_{\mathfrak{h}(x)}^{2},

where exp𝔥(x)()\exp_{\mathfrak{h}(x)}(\cdot), grad𝔥f()\mathrm{grad}_{\mathfrak{h}}f(\cdot), and 𝔥(x)\|\cdot\|_{\mathfrak{h}(x)} are induced by 𝔥\mathfrak{h}.

With Lemma 4.2, we can expect that RGD generates a convergent sequence for which the Riemannian gradient at the limit point vanishes.

Lemma 4.3 ([8, Corollary 4.9]).

Let ff be g-LL-smooth with respect to 𝔥\mathfrak{h}. Let {xt}\{x_{t}\} be the iterates generated by RGD with the Riemannian metric 𝔥\mathfrak{h}, initial iterate x1x_{1}\in\mathcal{M} and step size η=1/L\eta=1/L. Then, we have

limtgrad𝔥f(xt)𝔥(xt)=0.\lim_{t\to\infty}\|\mathrm{grad}_{\mathfrak{h}}f(x_{t})\|_{\mathfrak{h}(x_{t})}=0.

The distance between two consecutive RGD iterates, xx and x+=exp𝔥(x)(1Lgrad𝔥f(x))x_{+}=\exp_{\mathfrak{h}(x)}\left(-\frac{1}{L}\mathrm{grad}_{\mathfrak{h}}f(x)\right), goes to zero. This is due to the fact that d𝔥(x,y)=log𝔥(x)y𝔥(x)d_{\mathfrak{h}}(x,y)=\|\log_{\mathfrak{h}(x)}y\|_{\mathfrak{h}(x)} for any x,yx,y\in\mathcal{M}, and the following equation:

log𝔥(x)x+=1Lgrad𝔥f(x).\log_{\mathfrak{h}(x)}x_{+}=-\frac{1}{L}\mathrm{grad}_{\mathfrak{h}}f(x).

Since the Riemannian gradients at the iterates vanish, as stated in Lemma 4.3, the following corollary applies.

Corollary 4.4.

The limit point x:=limtxtx_{\infty}:=\lim_{t\to\infty}x_{t} of the RGD iterates in Lemma 4.3 exists.

Then, if xx_{\infty}\in\mathcal{M}, the Riemannian gradient (with respect to the Riemannian metric 𝔥\mathfrak{h}) at xx_{\infty} is zero. This means the iterates converge to a minimizer of ff.

Lemma 4.5.

Suppose that Assumption 4.1 holds. Let {xt}\{x_{t}\} be the iterates generated by RGD with the Riemannian metric 𝔥\mathfrak{h}, initial iterate x1x_{1}\in\mathcal{M} and step size η=1/L\eta=1/L. If the limit point xx_{\infty}\in\mathcal{M}, then xx_{\infty} is a minimizer of ff on \mathcal{M}.

The following theorem presents the main result of this section, a non-asymptotic error bound for RGD under the hybrid geodesic optimization framework.

Theorem 4.6.

Suppose that Assumption 4.1 holds. Let {xt}\{x_{t}\} be the iterates generated by RGD with the Riemannian metric 𝔥\mathfrak{h}, initial iterate x1x_{1}\in\mathcal{M} and step size η=1/L\eta=1/L. For any TT\in\mathbb{N}, we have

f(xT+1)f(x)2LTsuptlog𝔤(xt)x𝔥(xt)2,f(x_{T+1})-f(x_{\infty})\leq\frac{2L}{T}\sup_{t\in\mathbb{N}}\|\log_{\mathfrak{g}(x_{t})}x_{\infty}\|^{2}_{\mathfrak{h}(x_{t})},

where x:=limtxtx_{\infty}:=\lim_{t\to\infty}x_{t}.

We sketch the proof below.

Proof.

Let δt:=f(xt)f(x)\delta_{t}:=f(x_{t})-f(x_{\infty}). By Lemma 4.2, we have

δt+1δt12Lgrad𝔥f(xt)𝔥(xt)2.\delta_{t+1}-\delta_{t}\leq-\frac{1}{2L}\|\mathrm{grad}_{\mathfrak{h}}f(x_{t})\|_{\mathfrak{h}(x_{t})}^{2}.

By the g-convexity of ff and the Cauchy-Schwarz inequalities, we write

δt\displaystyle\delta_{t} grad𝔤f(xt),log𝔤(xt)x𝔤(xt)\displaystyle\leq\langle-\mathrm{grad}_{\mathfrak{g}}f(x_{t}),\log_{\mathfrak{g}(x_{t})}x_{\infty}\rangle_{\mathfrak{g}(x_{t})}
=df(xt)[log𝔤(xt)x]\displaystyle=-\mathrm{d}f(x_{t})[\log_{\mathfrak{g}(x_{t})}x_{\infty}]
=grad𝔥f(xt),log𝔤(xt)x𝔥(xt)\displaystyle=\langle-\mathrm{grad}_{\mathfrak{h}}f(x_{t}),\log_{\mathfrak{g}(x_{t})}x_{\infty}\rangle_{\mathfrak{h}(x_{t})}
grad𝔥f(xt)𝔥(xt)log𝔤(xt)x𝔥(xt).\displaystyle\leq\|\mathrm{grad}_{\mathfrak{h}}f(x_{t})\|_{\mathfrak{h}(x_{t})}\|\log_{\mathfrak{g}(x_{t})}x_{\infty}\|_{\mathfrak{h}(x_{t})}.

Combining the two inequalities above, we obtain

δt+1δt12Lδt2log𝔤(xt)x𝔥(xt)2.\delta_{t+1}-\delta_{t}\leq-\frac{1}{2L}\frac{\delta_{t}^{2}}{\|\log_{\mathfrak{g}(x_{t})}x_{\infty}\|^{2}_{\mathfrak{h}(x_{t})}}.

Then, we can follow the standard analysis of gradient descent [9, Section 3.2]. The rest of the proof can be found in Appendix B. ∎

Remark 4.7.

Note that suptlog𝔤(xt)x𝔥(xt)\sup_{t\in\mathbb{N}}\|\log_{\mathfrak{g}(x_{t})}x_{\infty}\|_{\mathfrak{h}(x_{t})} is bounded since limtxt=x\lim_{t\to\infty}x_{t}=x_{\infty}.

5 Application: Computing Augustin Information

In this section, we apply Theorem 4.6 to compute the order-α\alpha Augustin information (1). We begin by introducing a Riemannian metric called the Poincaré metric. We then show that the objective function of the optimization problem (1) is g-|1α||1-\alpha|-smooth with respect to this Riemannian metric. Finally, we apply the main result, Theorem 4.6, with a minor adjustment regarding a boundary issue.

5.1 Poincaré Metric

Consider the manifold =++N\mathcal{M}=\mathbb{R}^{N}_{++}. Let xx\in\mathcal{M} and u,vTxu,v\in T_{x}\mathcal{M}. The Poincaré metric333This metric is not the exact Poincaré metric in standard differential geometry literature (see e.g., [27]) but rather a variation of it. We use this term for convenience, as Antonakopoulos et al. [2] did. is given by

u,vx=ux,vx.\langle u,v\rangle_{x}=\langle u\oslash x,v\oslash x\rangle.
Proposition 5.1 ([6, 37, 36]).

Given x,y++Nx,y\in\mathbb{R}^{N}_{++} and vNv\in\mathbb{R}^{N}, we have the following:

  • Riemannian distance: d2(x,y)=log(yx)22d^{2}(x,y)=\|\log(y\oslash x)\|^{2}_{2}.

  • Geodesic: γ(t)=x1tyt\gamma(t)=x^{1-t}\odot y^{t}, which connects xx and yy.

  • Exponential map: expx(v)=xexp(vx)\exp_{x}(v)=x\odot\exp(v\oslash x).

  • Riemannian gradient: gradf(x)=x2f(x)\mathrm{grad}f(x)=x^{2}\odot\nabla f(x).

5.2 Geodesic Smoothness of the Objective Function

Since we consider the finite alphabet 𝒴\mathcal{Y} case, we can identify PY|XP_{Y|X} and QYQ_{Y} as vectors in Δ+N\Delta^{N}_{+}. Denote QYQ_{Y} by xx and view PY|XP_{Y|X} as a random variable pp in Δ+N\Delta^{N}_{+}. Then, the order-α\alpha Augustin information (1) for pp can be written as

minxΔ+Nfα(x),fα(x):=𝔼pDα(px),\min_{x\in\Delta^{N}_{+}}f_{\alpha}(x),\quad f_{\alpha}(x):=\mathbb{E}_{p}D_{\alpha}(p\|x), (3)

where α+{1}\alpha\in\mathbb{R}_{+}\setminus\{1\} and Dα(yx):=1α1logyα,x1αD_{\alpha}(y\|x):=\frac{1}{\alpha-1}\log\langle y^{\alpha},x^{1-\alpha}\rangle is the order-α\alpha Rényi divergence between two probability vectors x,yΔ+Nx,y\in\Delta^{N}_{+}.

The constraint set in the optimization problem (3) is the probability simplex Δ+N\Delta^{N}_{+}, while the Poincaré metric is defined over the whole positive orthant ++N\mathbb{R}^{N}_{++}. We address this inconsistency by the following lemma. Define

gα(x):=𝟙,x+fα(x).g_{\alpha}(x):=\langle\mathbbm{1},x\rangle+f_{\alpha}(x).
Lemma 5.2.

Let fαf_{\alpha} and gαg_{\alpha} be defined as above, we have

argminx+Ngα(x)=argminyΔ+Nfα(y).\operatorname*{\arg\min}_{x\in\mathbb{R}^{N}_{+}}g_{\alpha}(x)=\operatorname*{\arg\min}_{y\in\Delta^{N}_{+}}f_{\alpha}(y).
Lemma 5.3.

The function fα(x)f_{\alpha}(x) is geodesically |1α||1-\alpha|-smooth in ++N\mathbb{R}^{N}_{++} with respect to the Poincaré metric; the function gα(x)g_{\alpha}(x) is geodesically (|1α|+1)(|1-\alpha|+1)-smooth on the set {x++N:x𝟙}\{x\in\mathbb{R}^{N}_{++}:x\leq\mathbbm{1}\} with respect to the Poincaré metric.

With Lemma 5.3, we may consider minimizing gαg_{\alpha} on +N\mathbb{R}^{N}_{+} via RGD with the Poincaré metric and step size η=1/(|1α|+1)\eta=1/(|1-\alpha|+1). Starting with x1+Nx_{1}\in\mathbb{R}^{N}_{+}, the iteration rule proceeds as

xt+1=xtexp(1|1α|+1xtgα(xt)).x_{t+1}=x_{t}\odot\exp\left(-\frac{1}{|1-\alpha|+1}x_{t}\odot\nabla g_{\alpha}(x_{t})\right). (4)

Note that we additionally require that x𝟙x\leq\mathbbm{1} for the g-smoothness of gαg_{\alpha}. We justify this requirement by the following lemma.

Lemma 5.4.

The iterates {xt}\{x_{t}\} generated by the iteration rule (4) satisfies xt𝟙x_{t}\leq\mathbbm{1}.

Since the function 𝟙,x\langle\mathbbm{1},x\rangle is convex in the Euclidean sense, the objective function gαg_{\alpha} is also convex in the Euclidean sense. It is desirable to apply Theorem 4.6 to gαg_{\alpha} with 𝔤\mathfrak{g} as the standard Euclidean metric and 𝔥\mathfrak{h} as the Poincaré metric. However, the Poincaré metric is only defined on the interior of the constraint set +N\mathbb{R}^{N}_{+}, whereas RGD for the function gα(x)g_{\alpha}(x) may yield iterates that violate the assumption that limtxt\lim_{t\to\infty}x_{t}\in\mathcal{M} of Lemma 4.5. We show that even if the limit point falls on the boundary of ++N\mathbb{R}^{N}_{++}, the limit point is still a minimizer when considering the Poincaré metric on ++N\mathbb{R}^{N}_{++}.

Lemma 5.5.

Let f𝒞1(+N)f\in\mathcal{C}^{1}(\mathbb{R}^{N}_{+}) be Euclidean convex and g-LL-smooth with respect to the Poincaré metric. Let {xt}\{x_{t}\} be the iterates generated by RGD with the Poincaré metric, initial iterate x1++Nx_{1}\in\mathbb{R}^{N}_{++} and step size η=1/L\eta=1/L. Then, the limit point x:=limtxt+Nx_{\infty}:=\lim_{t\to\infty}x_{t}\in\mathbb{R}^{N}_{+} is a minimizer of ff on +N\mathbb{R}^{N}_{+}.

Consequently, we still have a non-asymptotic error bound for the optimization problem minx+Ngα(x)\min_{x\in\mathbb{R}^{N}_{+}}g_{\alpha}(x).

Since the iteration rule (4) is for the optimization problem minx+Ngα(x)\min_{x\in\mathbb{R}^{N}_{+}}g_{\alpha}(x), these iterates may fall outside the constraint set Δ+N\Delta^{N}_{+} of the original problem. We show that the sequence of normalized iterates still converges at a rate of 𝒪(1/T)\mathcal{O}(1/T).

Proposition 5.6.

Let {xt}\{x_{t}\} be the iterates generated by the iteration rule (4) with x1Δ+Nx_{1}\in\Delta^{N}_{+}. For any TT\in\mathbb{N}, we have

fα(x¯T+1)fα(x)2(|1α|+1)Tsuptxxt𝟙22,f_{\alpha}(\overline{x}_{T+1})-f_{\alpha}(x^{\star})\leq\frac{2(|1-\alpha|+1)}{T}\sup_{t\in\mathbb{N}}\|x^{\star}\oslash x_{t}-\mathbbm{1}\|_{2}^{2},

where xargminxΔ+Nfα(x)x^{\star}\in\operatorname*{\arg\min}_{x\in\Delta^{N}_{+}}f_{\alpha}(x) and x¯:=xx1\bar{x}:=\frac{x}{\|x\|_{1}} for any vector x+Nx\in\mathbb{R}^{N}_{+}.

Note that the term suptxxt𝟙22\sup_{t\in\mathbb{N}}\|x^{\star}\oslash x_{t}-\mathbbm{1}\|^{2}_{2} is bounded since limtxt=x\lim_{t\to\infty}x_{t}=x^{\star} (Remark 4.7).

5.3 Numerical Results

We conducted numerical experiments444The code is available on GitHub at the following repository: https://github.com/CMGRWang/Computing-Augustin-Information-via-Hybrid-Geodesically-Convex-Optimization. on the optimization problem (1) by RGD, using the explicit iteration rule (4). The experimental setting is as follows: the support cardinality of pp is 2142^{14}, and the parameter dimension NN is 242^{4}. The random variable pp is uniformly generated from the probability simplex Δ+N\Delta^{N}_{+}. We implemented all methods in Python on a machine equipped with an Intel(R) Core(TM) i7-9700 CPU running at 3.00GHz and 16.0GB memory. The elapsed time represents the actual running time of each algorithm.

Refer to caption
Figure 1: Convergence speeds for computing the order-33 Augustin information.
Refer to caption
Figure 2: Slower convergence speeds for the methods proposed by Li and Cevher [28] and by You et al. [48].

Figure 1 demonstrates the results of computing the order-33 Augustin information using RGD, the fixed-point iteration in Nakıboğlu’s paper [32], the method proposed by Li and Cevher [28], and the method proposed by You et al. [48]. We tuned the parameters for the latter two methods to achieve faster convergence speeds. We set α¯=0.4\bar{\alpha}=0.4, r=0.7r=0.7, τ=0.5\tau=0.5 for the method proposed by Li and Cevher [28], and set δ1=1\delta_{1}=1, δ=105\delta=10^{-5}, β=0.99\beta=0.99, γ=1.25\gamma=1.25, c=10c=10 for the method proposed by You et al. [48]. We approximate the optimal value based on the result obtained from the method proposed by Li and Cevher [28] after 2020 iterations.

The numerical results validate our theoretical analysis, showing that RGD converges to the optimum. The fixed-point iteration proposed by Nakiboğlu [32] diverges, as it is only guaranteed to converge for α<1\alpha<1. We observe that RGD is slower than the methods proposed by Li and Cevher [28] and by You et al. [48]. However, the latter two methods guarantee only asymptotic convergence, and tuning parameters is required to achieve faster convergence. We found that slightly adjusting parameters can result in slow convergence speed for these two algorithms. Figure 2 shows that if we change α¯\bar{\alpha} from 0.40.4 to 0.040.04 and cc from 1010 to 100100, both methods become slower than RGD in terms of elapsed time. Directly comparing numerical results would be unfair, given that these two methods lack convergence rate guarantees and require tuning to get better results.

6 Conclusion

We have presented an algorithm for computing the order-α\alpha Augustin information with a non-asymptotic optimization error guarantee. We have shown that the objective function of the corresponding optimization problem is geodesically |1α||1-\alpha|-smooth with respect to the Poincaré metric. With this observation, we propose a hybrid geodesic optimization framework for this optimization problem, and demonstrate that Riemannian gradient descent converges at a rate of 𝒪(1/T)\mathcal{O}(1/T) under the hybrid framework.

A potential future direction could be generalizing this result for computing quantum Rényi divergences. Our framework can also be applied to other problems. For instance, the objective functions in the Kelly criterion in mathematical finance [25] and in computing the John Ellipsoid [16] are both g-smooth with respect to the Poincaré metric. Another interesting direction is to develop accelerated or stochastic gradient descent-like algorithms in this hybrid scenario.

Acknowledgments

G.-R. Wang, C.-E. Tsai, and Y.-H. Li are supported by the Young Scholar Fellowship (Einstein Program) of the National Science and Technology Council of Taiwan under grant number NSTC 112-2636-E-002-003, by the 2030 Cross-Generation Young Scholars Program (Excellent Young Scholars) of the National Science and Technology Council of Taiwan under grant number NSTC 112-2628-E-002-019-MY3, by the research project “Pioneering Research in Forefront Quantum Computing, Learning and Engineering” of National Taiwan University under grant numbers NTU-CC-112L893406 and NTU-CC-113L891606, and by the Academic Research-Career Development Project (Laurel Research Project) of National Taiwan University under grant numbers NTU-CDP-112L7786 and NTU-CDP-113L7763.

H.-C. Cheng is supported by the Young Scholar Fellowship (Einstein Program) of the National Science and Technology Council, Taiwan (R.O.C.) under Grants No. NSTC 112-2636-E-002-009, No. NSTC 112-2119-M-007-006, No. NSTC 112-2119-M-001-006, No. NSTC 112-2124-M-002-003, by the Yushan Young Scholar Program of the Ministry of Education (MOE), Taiwan (R.O.C.) under Grants No. NTU-112V1904-4, and by the research project “Pioneering Research in Forefront Quantum Computing, Learning and Engineering” of National Taiwan University under Grant No. NTC-CC-113L891605. H.-C. Cheng acknowledges support from the “Center for Advanced Computing and Imaging in Biomedicine (NTU-113L900702)” by the MOE in Taiwan.

References

  • Absil et al. [2008] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton Univ. Press, 2008.
  • Antonakopoulos et al. [2020] Kimon Antonakopoulos, Elena Veronica Belmega, and Panayotis Mertikopoulos. Online and stochastic optimization beyond Lipschitz continuity: A Riemannian approach. In Int. Conf. Learning Representations, 2020.
  • Arimoto [1973] S. Arimoto. On the converse to the coding theorem for discrete memoryless channels (corresp.). IEEE Trans. Inf. Theory, 19(3):357–359, May 1973. ISSN 0018-9448. doi: 10.1109/TIT.1973.1055007.
  • Augustin [1969] U. Augustin. Error estimates for low rate codes. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 14(1):61–88, March 1969. ISSN 0178-8051. doi: 10.1007/BF00534118.
  • Augustin [1978] Udo Augustin. Noisy channels. Habilitation Thesis, Universität Erlangen-Nürnberg, 1978.
  • Bhatia et al. [2019] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices. Expo. Math., 37(2):165–191, 2019.
  • Blahut [1974] Richard Blahut. Hypothesis testing and information theory. IEEE Trans. Inf. Theory, 20(4):405–417, July 1974.
  • Boumal [2023] Nicolas Boumal. An Introduction to Optimization on Smooth Manifolds. Cambridge Univ. Press, 2023.
  • Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3-4):231–357, November 2015.
  • Chen et al. [2017] Jun Chen, Da-ke He, Ashish Jagmohan, and Luis A Lastras-Montaño. On the reliability function of variable-rate Slepian-Wolf coding. Entropy, 19(8):389, July 2017.
  • Cheng and Nakiboğlu [2021] Hao-Chung Cheng and Barış Nakiboğlu. On the existence of the Augustin mean. In 2021 IEEE Information Theory Workshop (ITW), pages 1–6, 2021.
  • Cheng et al. [2019] Hao-Chung Cheng, Min-Hsiu Hsieh, and Marco Tomamichel. Quantum sphere-packing bounds with polynomial prefactors. IEEE Trans. Inf. Theory, 65(5):2872–2898, January 2019. doi: 10.1109/tit.2019.2891347.
  • Cheng et al. [2022a] Hao-Chung Cheng, Li Gao, and Min-Hsiu Hsieh. Properties of noncommutative Rényi and Augustin information. Commun. Math. Phys., 390(2):501–544, March 2022a.
  • Cheng et al. [2022b] Hao-Chung Cheng, Eric P Hanson, Nilanjana Datta, and Min-Hsiu Hsieh. Duality between source coding with quantum side information and classical-quantum channel coding. IEEE Trans. Inf. Theory, 68(11):7315–7345, June 2022b.
  • Chewi et al. [2020] Sinho Chewi, Tyler Maunu, Philippe Rigollet, and Austin J Stromme. Gradient descent algorithms for Bures-Wasserstein barycenters. In Conf. Learning Theory, pages 1276–1304, 2020.
  • Cohen et al. [2019] Michael B Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the John Ellipsoid. In Conf. Learning Theory, pages 849–873, 2019.
  • Csiszár [1995] Imre Csiszár. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory, 41(1):26–34, January 1995. doi: 10.1109/18.370121.
  • Csiszár and Körner [2011] Imre Csiszár and János Körner. Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge Univ. Press, 2011.
  • Dalai and Winter [2017] Marco Dalai and Andreas Winter. Constant compositions in the sphere packing bound for classical-quantum channels. IEEE Trans. Inf. Theory, 63(9):5603–5617, July 2017.
  • Dueck and Korner [1979] G. Dueck and J. Korner. Reliability function of a discrete memoryless channel at rates above capacity (corresp.). IEEE Trans. Inf. Theory, 25(1):82–85, January 1979. ISSN 0018-9448. doi: 10.1109/TIT.1979.1056003.
  • Han et al. [2021] Andi Han, Bamdev Mishra, Pratik Kumar Jawanpuria, and Junbin Gao. On Riemannian optimization over positive definite matrices with the Bures-Wasserstein geometry. Adv. Neural Information Processing Systems, 34:8940–8953, December 2021.
  • Haroutunian [1968] EA Haroutunian. Bounds for the exponent of the probability of error for a semicontinuous memoryless channel. Problemy Peredachi Informatsii, 4(4):37–48, 1968.
  • Hosseini and Sra [2015] Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for Gaussian mixtures. Adv. Neural Information Processing Systems, 28, 2015.
  • Hosseini and Sra [2020] Reshad Hosseini and Suvrit Sra. An alternative to EM for Gaussian mixture models: batch and stochastic Riemannian optimization. Math. Program., 181(1):187–223, May 2020.
  • Kelly [1956] John L Kelly. A new interpretation of information rate. Bell Syst. Tech. J., 35(4):917–926, July 1956.
  • Lan [2020] Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020.
  • Lee [2018] John M Lee. Introduction to Riemannian Manifolds. Springer, 2018.
  • Li and Cevher [2019] Yen-Huan Li and Volkan Cevher. Convergence of the exponentiated gradient method with Armijo line search. J. Optim. Theory Appl., 181(2):588–607, May 2019.
  • Mojahedian et al. [2019] Mohammad Mahdi Mojahedian, Salman Beigi, Amin Gohari, Mohammad Hossein Yassaee, and Mohammad Reza Aref. A correlation measure based on vector-valued LpL_{p}-norms. IEEE Trans. Inf. Theory, 65(12):7985–8004, August 2019.
  • Mosonyi and Ogawa [2017] M. Mosonyi and T. Ogawa. Strong converse exponent for classical-quantum channel coding. Commun. Math. Phys., 355(1):373–426, October 2017. ISSN 1432-0916. doi: 10.1007/s00220-017-2928-4.
  • Mosonyi and Ogawa [2021] M. Mosonyi and T. Ogawa. Divergence radii and the strong converse exponent of classical-quantum channel coding with constant compositions. IEEE Trans. Inf. Theory, 67(3):1668–1698, December 2021.
  • Nakiboğlu [2019] Barış Nakiboğlu. The Augustin capacity and center. Probl. Inf. Transm., 55:299–342, October 2019.
  • Nakiboğlu [2020] BARIŞ Nakiboğlu. A simple derivation of the refined sphere packing bound under certain symmetry hypotheses. Turk. J. Math., 44(3):919–948, 2020.
  • Nesterov [2018] Yurii Nesterov. Lectures on Convex Optimization. Springer, 2018.
  • Omura [1975] J. K. Omura. A lower bounding method for channel and source coding probabilities. Inf. Control., 27(2):148–177, February 1975. ISSN 0019-9958. doi: 10.1016/S0019-9958(75)90120-5.
  • Pennec [2020] Xavier Pennec. Manifold-valued image processing with SPD matrices. In Riemannian Geometric Statistics in Medical Image Analysis, pages 75–134. Elsevier, 2020.
  • Pennec et al. [2006] Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian framework for tensor computing. Int. J. Comput. Vis., 66:41–66, January 2006.
  • Rapcsak [1991] Tamas Rapcsak. Geodesic convexity in nonlinear optimization. J. Optim. Theory Appl., 69(1):169–183, April 1991.
  • Rényi [1961] Alfréd Rényi. On measures of entropy and information. In Proc. 4th Berkeley Symp. Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547–562, 1961.
  • Shen et al. [2023] Yu-Chen Shen, Li Gao, and Hao-Chung Cheng. Privacy amplification against quantum side information via regular random binning. In 2023 59th Annu. Allerton Conference on Communication, Control, and Computing (Allerton), pages 1–8, 2023.
  • Udriste [1994] Constantin Udriste. Convex functions and optimization methods on Riemannian manifolds. Springer Science & Business Media, 1994.
  • van Erven and Harremoes [2014] Tim van Erven and Peter Harremoes. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory, 60(7):3797–3820, June 2014. doi: 10.1109/tit.2014.2320500.
  • Verdu [2015] Sergio Verdu. α\alpha-mutual information. In 2015 Information Theory and Applications Workshop (ITA), pages 1–6, 2015.
  • Verdú [2021] Sergio Verdú. Error exponents and α\alpha-mutual information. Entropy, 23(2):199, August 2021.
  • Vishnoi [2018] Nisheeth K Vishnoi. Geodesic convex optimization: Differentiation on manifolds, geodesics, and convexity. 2018. arXiv:1806.06373.
  • Weber and Sra [2023] Melanie Weber and Suvrit Sra. Global optimality for Euclidean CCCP under Riemannian convexity. In Int. Conf. Machine Learning, pages 36790–36803, 2023.
  • Wiesel [2012] Ami Wiesel. Geodesic convexity and covariance estimation. IEEE Trans. Signal Process., 60(12):6182–6189, September 2012.
  • You et al. [2022] Jun-Kai You, Hao-Chung Cheng, and Yen-Huan Li. Minimizing quantum Rényi divergences via mirror descent with Polyak step size. In 2022 IEEE Int. Symp. Information Theory (ISIT), pages 252–257, 2022.
  • Zhang and Sra [2016] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conf. Learning Theory, pages 1617–1638, 2016.
  • Zhang et al. [2013] Teng Zhang, Ami Wiesel, and Maria Sabrina Greco. Multivariate generalized Gaussian distribution: Convexity and graphical models. IEEE Trans. Signal Process., 61(16):4141–4148, June 2013.

Appendices

Appendix A Proof of Lemma 4.5

Since xx_{\infty}\in\mathcal{M}, by Lemma 4.3, we have

grad𝔥f(x)𝔥(x)=0.\|\mathrm{grad}_{\mathfrak{h}}f(x_{\infty})\|_{\mathfrak{h}(x_{\infty})}=0.

Note that grad𝔥f(x)\mathrm{grad}_{\mathfrak{h}}f(x) is the (Riesz) representation of the differential form df(x)\mathrm{d}f(x) on the tangent space TxT_{x}\mathcal{M} with respect to the metric 𝔥\mathfrak{h}, and grad𝔤f(x)\mathrm{grad}_{\mathfrak{g}}f(x) is the representation of the same differential form df(x)\mathrm{d}f(x) under another metric 𝔤\mathfrak{g}. Therefore, for every vTxv\in T_{x}\mathcal{M}, we have

grad𝔥f(x),v𝔥(x)=df(x)[v]=grad𝔤f(x),v𝔤(x),\langle\mathrm{grad}_{\mathfrak{h}}f(x),v\rangle_{\mathfrak{h}(x)}=\mathrm{d}f(x)[v]=\langle\mathrm{grad}_{\mathfrak{g}}f(x),v\rangle_{\mathfrak{g}(x)},

which implies

grad𝔤f(x)𝔤(x)=0.\|\mathrm{grad}_{\mathfrak{g}}f(x_{\infty})\|_{\mathfrak{g}(x_{\infty})}=0.

Since ff is g-convex with respect to 𝔤\mathfrak{g}, having zero Riemannian gradient with respect to 𝔤\mathfrak{g} at xx is equivalent to xx being a global minimizer of ff on \mathcal{M}.

Appendix B Proof of Theorem 4.6

We have shown that

δt+1δt12Lδt2log𝔤(xt)x𝔥(xt)2.\delta_{t+1}-\delta_{t}\leq-\frac{1}{2L}\frac{\delta_{t}^{2}}{\|\log_{\mathfrak{g}(x_{t})}x_{\infty}\|^{2}_{\mathfrak{h}(x_{t})}}.

Let wt:=1log𝔤(xt)x𝔥(xt)2w_{t}:=\frac{1}{\|\log_{\mathfrak{g}(x_{t})}x_{\infty}\|^{2}_{\mathfrak{h}(x_{t})}} and divide both sides by δtδt+1\delta_{t}\delta_{t+1}. Since the sequence {δt}\{\delta_{t}\} is non-increasing, we write

1δt1δt+1wt2Lδtδt+1wt2L.\frac{1}{\delta_{t}}-\frac{1}{\delta_{t+1}}\leq-\frac{w_{t}}{2L}\frac{\delta_{t}}{\delta_{t+1}}\leq-\frac{w_{t}}{2L}.

By a telescopic sum, we get

1δt+11δ11δt+1twt2L.-\frac{1}{\delta_{t+1}}\leq\frac{1}{\delta_{1}}-\frac{1}{\delta_{t+1}}\leq-\frac{\sum_{t}w_{t}}{2L}.

Therefore, we have

f(xT+1)f(x)2Lwt2LTsuptlog𝔤(xt)x𝔥(xt)2.f(x_{T+1})-f(x_{\infty})\leq\frac{2L}{\sum w_{t}}\leq\frac{2L}{T}\sup_{t\in\mathbb{N}}\|\log_{\mathfrak{g}(x_{t})}x_{\infty}\|^{2}_{\mathfrak{h}(x_{t})}.

Appendix C Proof of Lemma 5.2

Given any vector x+Nx\in\mathbb{R}^{N}_{+}, we can write it as x=λyx=\lambda y where λ=i[N]x(i)\lambda=\sum_{i\in[N]}x^{(i)} and yΔ+Ny\in\Delta^{N}_{+}. We have

gα(x)=gα(λy)=λlogλ+fα(y).g_{\alpha}(x)=g_{\alpha}(\lambda y)=\lambda-\log\lambda+f_{\alpha}(y).

Observe that

λgα=11λ\frac{\partial}{\partial\lambda}g_{\alpha}=1-\frac{1}{\lambda}

and ga(λy)g_{a}(\lambda y) is convex in λ\lambda. This means for every yΔ+Ny\in\Delta^{N}_{+}, the function gα(λy)g_{\alpha}(\lambda y) achieves its minimum at λ=1\lambda=1. Therefore, let yy^{\star} be a minimizer of fαf_{\alpha} on Δ+N\Delta^{N}_{+}, for every λ>0\lambda>0 and yΔ+Ny\in\Delta^{N}_{+}, we have

fα(y)\displaystyle f_{\alpha}(y^{\star}) fα(y)\displaystyle\leq f_{\alpha}(y)
1+fα(y)\displaystyle\Leftrightarrow 1+f_{\alpha}(y^{\star}) 1+fα(y)\displaystyle\leq 1+f_{\alpha}(y)
gα(y)\displaystyle\Leftrightarrow g_{\alpha}(y^{\star}) gα(y)gα(λy).\displaystyle\leq g_{\alpha}(y)\leq g_{\alpha}(\lambda y).

This shows that yy^{\star} is a minimizer of fα(y)f_{\alpha}(y) on Δ+N\Delta^{N}_{+} if and only if x=yx^{\star}=y^{\star} is a minimizer of gα(x)g_{\alpha}(x).

Appendix D Proof of Lemma 5.3

We will utilize the following equivalent definition of g-convexity and g-smoothness.

Definition D.1 ([49, 21]).

For a function f:f:\mathcal{M}\to\mathbb{R} defined on (,𝔤)(\mathcal{M},\mathfrak{g}), we say that ff is g-convex on (,𝔤\mathcal{M},\mathfrak{g}) if for every geodesic γ(t):[0,1]\gamma(t):[0,1]\to\mathcal{M} connecting any two points x,yx,y\in\mathcal{M}, f(γ(t))f(\gamma(t)) is convex in tt in the Euclidean sense. We say that ff is g-LL-smooth (,𝔤\mathcal{M},\mathfrak{g}) if for every geodesic γ(t):[0,1]\gamma(t):[0,1]\to\mathcal{M} connecting any two points x,yx,y\in\mathcal{M}, f(γ(t))f(\gamma(t)) is Ld2(x,y)Ld^{2}(x,y)-smooth in tt in the Euclidean sense.

Let γ(t)\gamma(t) be the geodesic connecting xx and yy for x,y++Nx,y\in\mathbb{R}^{N}_{++}. If fp(γ(t))f_{p}(\gamma(t)) is Ld(x,y)Ld(x,y)-smooth in the Euclidean sense, then the function fα(γ(t))=𝔼pfp(γ(t))f_{\alpha}(\gamma(t))=\mathbb{E}_{p}f_{p}(\gamma(t)) is also Ld(x,y)Ld(x,y)-smooth in the Euclidean sense. Therefore, it suffices to prove the g-smoothness for the function fp(x):=11αlogpα,x1αf_{p}(x):=-\frac{1}{1-\alpha}\log\langle p^{\alpha},x^{1-\alpha}\rangle for any p+Np\in\mathbb{R}^{N}_{+}.

We now prove that

fp(γ(t))=11αlogpα,γ(t)1αf_{p}(\gamma(t))=-\frac{1}{1-\alpha}\log\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle

is |1α|d2(x,y)|1-\alpha|d^{2}(x,y)-smooth in the Euclidean sense. Note that

γ(t)=γ(t)log(yx).\gamma^{\prime}(t)=\gamma(t)\odot\log(y\oslash x).

We have

ddtfp(γ(t))=pα,γ(t)1αlog(yx)pα,γ(t)1α.\frac{\mathrm{d}}{\mathrm{d}t}f_{p}(\gamma(t))=-\frac{\langle p^{\alpha},\gamma(t)^{1-\alpha}\odot\log(y\oslash x)\rangle}{\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle}.

The second-order derivative is

d2dt2fp(γ(t))\displaystyle\frac{\mathrm{d}^{2}}{{\mathrm{d}t}^{2}}f_{p}(\gamma(t)) =(1α)[(pαγ(t)1α,log(yx)pα,γ(t)1α)2(1)\displaystyle=(1-\alpha)\bigg{[}\underbrace{\left(\frac{\langle p^{\alpha}\odot\gamma(t)^{1-\alpha},\log(y\oslash x)\rangle}{\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle}\right)^{2}}_{\text{(1)}}
pαγ(t)1α,(log(yx))2pα,γ(t)1α(2)].\displaystyle-\underbrace{\frac{\langle p^{\alpha}\odot\gamma(t)^{1-\alpha},(\log(y\oslash x))^{2}\rangle}{\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle}}_{\text{(2)}}\bigg{]}.

Note that p,(log(yx))2+Np,(\log(y\oslash x))^{2}\in\mathbb{R}^{N}_{+} and γ(t)++N\gamma(t)\in\mathbb{R}^{N}_{++}, so the quantities (1) and (2) are both non-negative. Our strategy is to show that both quantities (1) and (2) are upper bounded by d2(x,y)d^{2}(x,y), which then implies

d2dt2fp(γ(t))[|1α|d2(x,y),|1α|d2(x,y)].\frac{\mathrm{d}^{2}}{{\mathrm{d}t}^{2}}f_{p}(\gamma(t))\in[-|1-\alpha|d^{2}(x,y),|1-\alpha|d^{2}(x,y)].

This gives the desired result by the fact that a function is LL-smooth if and only if its Hessian is upper bounded by LILI, where II denotes the identity matrix.

D.1 Upper Bound of the quantity (1)

The numerator of the quantity (1) is

pαγ(t)1α\displaystyle\langle p^{\alpha}\odot\gamma(t)^{1-\alpha} ,log(yx)2\displaystyle,\log(y\oslash x)\rangle^{2}
pαγ(t)1α22log(yx))22\displaystyle\leq\left\|p^{\alpha}\odot\gamma(t)^{1-\alpha}\right\|^{2}_{2}\cdot\left\|\log(y\oslash x))\right\|^{2}_{2}
pα,γ(t)1α2log(yx))22\displaystyle\leq\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle^{2}\left\|\log(y\oslash x))\right\|^{2}_{2}
=pα,γ(t)1α2d2(x,y).\displaystyle=\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle^{2}d^{2}(x,y).

The first inequality is due to the Cauchy-Schwarz inequality, and the second inequality holds because (x(i))2(x(i))2\sum(x^{(i)})^{2}\leq(\sum x^{(i)})^{2} when x(i)0x^{(i)}\geq 0 for all i[N]i\in[N]. Thus, the quantity (1) is upper bounded by d2(x,y)d^{2}(x,y).

D.2 Upper Bound of the quantity (2)

The numerator of the quantity (2) is

pαγ(t)1α\displaystyle\langle p^{\alpha}\odot\gamma(t)^{1-\alpha} ,(log(yx))2\displaystyle,(\log(y\oslash x))^{2}\rangle
pαγ(t)1α2(log(yx))22\displaystyle\leq\left\|p^{\alpha}\odot\gamma(t)^{1-\alpha}\right\|_{2}\cdot\left\|(\log(y\oslash x))^{2}\right\|_{2}
pα,γ(t)1αlog(yx))22\displaystyle\leq\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle\left\|\log(y\oslash x))\right\|_{2}^{2}
=pα,γ(t)1αd2(x,y).\displaystyle=\langle p^{\alpha},\gamma(t)^{1-\alpha}\rangle d^{2}(x,y).

The first inequality is due to the Cauchy-Schwarz inequality, and the second inequality is because (x(i))2(x(i))2\sum(x^{(i)})^{2}\leq(\sum x^{(i)})^{2} when x(i)0x^{(i)}\geq 0 for all i[N]i\in[N]. Thus, the quantity (2) is upper bounded by d2(x,y)d^{2}(x,y).

D.3 Geodesic Smoothness of gαg_{\alpha}

We only need to show that h(x):x𝟙,xh(x):x\mapsto\langle\mathbbm{1},x\rangle is g-1-smooth. By direct calculation, we have

d2dt2h(γ(t))\displaystyle\frac{\mathrm{d}^{2}}{{\mathrm{d}t}^{2}}h(\gamma(t)) =(γ(t))(i)(logx(i)y(i))2\displaystyle=\sum(\gamma(t))^{(i)}\left(\log\frac{x^{(i)}}{y^{(i)}}\right)^{2}
(logx(i)y(i))2=d2(x,y),\displaystyle\leq\sum\left(\log\frac{x^{(i)}}{y^{(i)}}\right)^{2}=d^{2}(x,y),

where the inequality is due to the concavity and monotonicity of log\log, which implies

(γ(t))(i)=(x(i))1t(y(i))t(1t)x(i)+ty(i)1.(\gamma(t))^{(i)}=(x^{(i)})^{1-t}(y^{(i)})^{t}\leq(1-t)x^{(i)}+ty^{(i)}\leq 1.

Appendix E Proof of Lemma 5.4

Let x𝟙x\leq\mathbbm{1}, then the next iterate is

x+=xexp(1|1α|+1xgα(x)).x_{+}=x\odot\exp\left(-\frac{1}{|1-\alpha|+1}x\odot\nabla g_{\alpha}(x)\right).

A direct calculation shows that xfα(x)Δ+Nx\odot-\nabla f_{\alpha}(x)\in\Delta^{N}_{+}, so

x(i)(i)fα(x)1,-x^{(i)}\odot\nabla^{(i)}f_{\alpha}(x)\leq 1,

where (i)f(x)\nabla^{(i)}f(x) is the ii-th component of f(x)\nabla f(x). By the inequality log(a)b(a1/b1)\log(a)\leq b(a^{1/b}-1) for any a,b>0a,b>0, we have

logx(i)\displaystyle\log x^{(i)} 1|1α|+1((x(i))|1α|+11)\displaystyle\leq\frac{1}{|1-\alpha|+1}\left((x^{(i)})^{|1-\alpha|+1}-1\right)
1|1α|+1(x(i)1)\displaystyle\leq\frac{1}{|1-\alpha|+1}(x^{(i)}-1)
1|1α|+1(x(i)+x(i)(i)fα(x))\displaystyle\leq\frac{1}{|1-\alpha|+1}\left(x^{(i)}+x^{(i)}\nabla^{(i)}f_{\alpha}(x)\right)
=1|1α|+1(x(i)(i)gα(x)).\displaystyle=\frac{1}{|1-\alpha|+1}\left(x^{(i)}\nabla^{(i)}g_{\alpha}(x)\right).

Therefore,

x(i)exp(1|1α|+1(x(i)(i)gα(x))).\displaystyle x^{(i)}\leq\exp\left(\frac{1}{|1-\alpha|+1}\left(x^{(i)}\nabla^{(i)}g_{\alpha}(x)\right)\right).

Hence,

x+(i)=x(i)exp(1|1α|+1(x(i)(i)gα(x)))1.x^{(i)}_{+}=x^{(i)}\exp\left(-\frac{1}{|1-\alpha|+1}\left(x^{(i)}\nabla^{(i)}g_{\alpha}(x)\right)\right)\leq 1.

This gives the desired result.

Appendix F Proof of Lemma 5.5

If the limit xx_{\infty} lies in ++N\mathbb{R}^{N}_{++}, then it is a minimizer of f(x)f(x) by Theorem 4.6. Assume that the limit point x++Nx_{\infty}\in\partial\mathbb{R}^{N}_{++} and xargminx+Nf(x)x_{\infty}\notin\operatorname*{\arg\min}_{x\in\mathbb{R}^{N}_{+}}f(x). Then, by the first-order optimality condition, there exists some x+Nx\in\mathbb{R}^{N}_{+} such that

f(x),xx=i[N](i)f(x)(x(i)x(i))<0,\langle\nabla f(x_{\infty}),x-x_{\infty}\rangle=\sum_{i\in[N]}\nabla^{(i)}f(x_{\infty})(x^{(i)}-x^{(i)}_{\infty})<0, (5)

where (i)f(x)\nabla^{(i)}f(x) is the ii-th component of the vector f(x)\nabla f(x).

By Lemma 4.3, we have

gradf(x)x=xf(x)2=0,\|\mathrm{grad}f(x_{\infty})\|_{x_{\infty}}=\|x_{\infty}\odot\nabla f(x_{\infty})\|_{2}=0,

where the Riemannian gradient is with respect to the Poincaré metric. Therefore,

x(i)0(i)f(x)=0.x^{(i)}_{\infty}\neq 0\Rightarrow\nabla^{(i)}f(x_{\infty})=0.

Combining this and the inequality (5), we get

i[N],x(i)=0(i)f(x)x(i)<0,\sum_{i\in[N],x^{(i)}_{\infty}=0}\nabla^{(i)}f(x_{\infty})x^{(i)}<0,

which means there exists some j[N]j\in[N] such that

(j)f(x)<0 and x(j)=0.\nabla^{(j)}f(x_{\infty})<0\text{ and }x^{(j)}_{\infty}=0.

By the continuity of f(x)\nabla f(x), there exists an open neighborhood UNU\subset\mathbb{R}^{N} of xx_{\infty} such that for all xUx\in U

(j)f(x)<0.\nabla^{(j)}f(x)<0.

Since xx_{\infty} is the limit point of the sequence {xt}\{x_{t}\}, there exists T>0T>0 such that xtUx_{t}\in U for all t>Tt>T, that is,

(j)f(xt)<0.\nabla^{(j)}f(x_{t})<0.

However, this implies

1Lxt(j)(j)f(xt)>0t>T,-\frac{1}{L}x_{t}^{(j)}\nabla^{(j)}f(x_{t})>0\quad\forall t>T,

since xt++Nx_{t}\in\mathbb{R}^{N}_{++} and (j)f(xt)<0\nabla^{(j)}f(x_{t})<0. Therefore, by the iteration rule of RGD, we have

xt+1(j)=xt(j)exp(1Lxt(j)(j)f(xt))>xt(j).x_{t+1}^{(j)}=x_{t}^{(j)}\exp\left(-\frac{1}{L}x_{t}^{(j)}\nabla^{(j)}f(x_{t})\right)>x_{t}^{(j)}.

This shows that {xt(j)}\{x_{t}^{(j)}\} is increasing after t>Tt>T, which means this sequence cannot converge to 0. This contradicts our assumption.

Appendix G Proof of Lemma 5.6

Since gαg_{\alpha} is convex and g-smooth with respect to the Poincaré metric, by Theorem 4.6 and Lemma 5.5, we have

gα(xT+1)gα(x)2(|1α|+1)Tsuptxxt𝟙22.g_{\alpha}(x_{T+1})-g_{\alpha}(x^{\star})\leq\frac{2(|1-\alpha|+1)}{T}\sup_{t\in\mathbb{N}}\|x^{\star}\oslash x_{t}-\mathbbm{1}\|_{2}^{2}.

Note that for any λ>0\lambda>0 and a fixed xΔ+Nx\in\Delta^{N}_{+}, the function gα(λx)g_{\alpha}(\lambda x) is minimized when λ=1\lambda=1, as shown in Appendix C. Therefore, xΔ+Nx^{\star}\in\Delta^{N}_{+}, and we have

fα(x¯T+1)fα(x)\displaystyle f_{\alpha}(\overline{x}_{T+1})-f_{\alpha}(x^{\star}) =gα(x¯T+1)gα(x)\displaystyle=g_{\alpha}(\overline{x}_{T+1})-g_{\alpha}(x^{\star})
gα(xT+1)gα(x).\displaystyle\leq g_{\alpha}(x_{T+1})-g_{\alpha}(x^{\star}).