Computing Augustin Information via
Hybrid Geodesically Convex Optimization
Abstract
We propose a Riemannian gradient descent with the Poincaré metric to compute the order- Augustin information, a widely used quantity for characterizing exponential error behaviors in information theory. We prove that the algorithm converges to the optimum at a rate of . As far as we know, this is the first algorithm with a non-asymptotic optimization error guarantee for all positive orders. Numerical experimental results demonstrate the empirical efficiency of the algorithm. Our result is based on a novel hybrid analysis of Riemannian gradient descent for functions that are geodesically convex in a Riemannian metric and geodesically smooth in another.
1 Introduction
Characterizing operational quantities of interest in terms of a single-letter information-theoretic quantity is fundamental to information theory. However, certain single-letter quantities involve an optimization that is not easy to solve. This work aims to propose an optimization framework for computing an essential family of information-theoretic quantities, termed the Augustin information [4, 5, 17, 43, 32, 13, 44, 11], with the first non-asymptotic optimization error guarantee.
Consider two probability distributions and sharing the same finite alphabet . The order- Rényi divergence () is defined as [39, 42]
As tends to , we have
where is the Kullback–Leibler (KL) divergence. The KL divergence defines Shannon’s mutual information for a prior distribution on the input alphabet and a conditional distribution as
where the minimization is over all probability distributions on and . Augustin generalized Shannon’s mutual information to the order- Augustin information [4, 5], defined as
(1) |
The Augustin information yields a wealth of applications in information theory, including:
-
(i)
characterizing the cut-off rate of channel coding [17],
- (ii)
- (iii)
- (iv)
- (v)
In spite of the operational significance of the Augustin information, computing it is not an easy task. The order- Augustin information does not have a closed-form expression in general unless . The optimization problem (1) is indeed convex [42, Theorem 12], so we may consider solving it via gradient descent. However, existing non-asymptotic analyses of gradient descent assume either Lipschitzness or smoothness111The term “smooth” used in this paper does not mean infinite differentiability. We will define smoothness in Section 3.1. of the objective function [9, 34, 26]. These two conditions, which we refer to as smoothness-type conditions, are violated by the optimization problem (1) [48].
In this paper, we adopt a geodesically convex optimization approach [49, 8, 45, 1]. This approach utilizes generalizations of convexity and smoothness-type conditions for Riemannian manifolds, namely geodesic convexity, geodesic Lipschitzness, and geodesic smoothness [49, 38, 41, 47, 50, 23, 24, 15, 2]. Since the lack of standard smoothness-type conditions is the main difficulty in applying gradient descent-type methods, we may find an appropriate Riemannian metric with respect to which the objective function in the optimization problem (1) is geodesically smooth.
Indeed, we have found that the objective function is geodesically smooth with respect to a Riemannian metric called the Poincaré metric. However, numerical experiments suggest that the objective function is geodesically non-convex under this metric. This poses another challenge. Existing analyses for Riemannian gradient descent, a direct extension of vanilla gradient descent, require both geodesic convexity and geodesic smoothness under the same Riemannian metric [49, 8].
Our main theoretical contribution lies in analyzing Riemannian gradient descent under a hybrid scenario, where the objective function is geodesically convex with respect to one Riemannian metric and geodesically smooth with respect to another. We prove that under this hybrid scenario, Riemannian gradient descent converges at a rate of , identical to the rate of vanilla gradient descent under the Euclidean setup.
Given that the objective function for computing the Augustin information is convex under the standard Euclidean structure and geodesically smooth under the Poincaré metric, our proposed hybrid framework offers a viable approach to solve this problem. In particular, we prove that Riemannian gradient descent with the Poincaré metric converges at a rate of for computing the Augustin information. This marks the first algorithm for computing the Augustin information that has a non-asymptotic optimization error bound for all .
Notations. We denote the all-ones vector by . We denote the set of vectors with non-negative entries and strictly positive entries by and , respectively. The -th entry of is denoted by . For any function and vector , we define as the vector whose -th entry equals . Hence, for example, the -th entry of is . The probability simplex in is denoted by , and the intersection of and is denoted by . The notations and denote entry-wise product and entry-wise division, respectively. For , the partial order means . For a set , the boundary of is denoted by . The set is denoted by .
2 Related Work
2.1 Computing the Augustin Information
Nakiboğlu [32] showed that the minimizer of the optimization problem (1) satisfies a fixed-point equation and proved that the associated fixed-point iteration converges to the order- Augustin information for . Cheng and Nakiboğlu [11, Lemma 6] provided a descent lemma for a specific update rule, which implies asymptotic convergence for the optimization problem (1) for all . Li and Cevher [28] provided a line search gradient-based algorithm to solve optimization problems on the probability simplex, which can also be used to solve the optimization problem (1) for all . You et al. [48] proposed a gradient-based method for minimizing quantum Rényi divergences, which includes the optimization problem (1) as a special case for all . These works only guarantee asymptotic convergence of the algorithms. Analysis of the convergence rate before the present work is still missing.
2.2 Hybrid Analysis
Antonakopoulos et al. [2] considered (Euclidean) convex and geodesically Lipschitz objective functions, analyzing the regret rates of the follow-the-regularized-leader algorithm and online mirror descent. Weber and Sra [46] considered geodesically convex and (Euclidean) smooth objective functions, analyzing the optimization error of the convex-concave procedure. The former did not consider the hybrid case of geodesic convexity and geodesic smoothness under different Riemannian metrics. The latter only analyzed a special case of the hybrid scenario we proposed. Neither of them analyzed Riemannian gradient descent.
3 Preliminaries
This chapter introduces relevant concepts in geodesic convex optimization.
3.1 Smooth Convex Optimization
If an optimization problem is convex, we may consider using (projected) gradient descent to approximate a global minimizer. Existing non-asymptotic error bounds of gradient descent require the objective function to be either Lipschitz or smooth.
Definition 3.1.
We say a function is -Lipschitz for some if for any , the function satisfies
We say a function is -smooth for some if for any , the function satisfies
Given a convex optimization problem, gradient descent converges at a rate of for the Lipschitz case and for the smoothness case, where denotes the number of iterations. However, You et al. [48, Proposition III.1] showed that the objective function in the optimization problem (1) is neither Lipschitz nor smooth.
3.2 Basics of Riemannian Geometry
An -dimensional topological manifold is a topological space that is Hausdorff, second countable, and locally homeomorphic to . The tangent space at a point , denoted as , is the set of all vectors tangent to the manifold at . A Riemannian metric on defines an inner product on each tangent space. The inner product on is denoted as . A Riemannian manifold is a topological manifold equipped with a Riemannian metric . Given a Riemannian metric, the induced norm of a tangent vector is given by . The length of a curve is defined as , and the induced distance between two points is given by , where the infimum is taken over all curves connecting and . A geodesic is a curve connecting and with constant speed and satisfying . For each and , there is a unique geodesic such that and [27, Corollary 4.28]. The exponential map at a point is the map such that .222In general, the exponential map may only be locally defined, while the exponential map considered in this paper is defined on the whole tangent space. The logarithmic map at a point is the inverse of . Given a differentiable function , the Riemannian gradient of at is the unique tangent vector that satisfies for all .
3.3 Geodesically Convex Optimization
The notions of convexity and smoothness can be extended for functions defined on Riemannian manifolds.
Definition 3.2 (Geodesic convexity, geodesic smoothness [49]).
Let be a function defined on a Riemannian manifold . We say the function is geodesically convex (g-convex) on if for every , the function satisfies
We say the function is geodesically -smooth (g--smooth) on if for every , the function satisfies
for some , where is the induced distance.
With the notions of the exponential map and Riemannian gradient, vanilla gradient descent can be extended to the so-called Riemannian gradient descent (RGD) [49, 8, 15], which iterates as follows
(2) |
where denotes the step size.
For Riemannian gradient descent, there exists a non-asymptotic optimization error bound analogous to that in gradient descent [49].
Theorem 3.3 ([49, Theorem 13]).
If the function to be minimized is g-convex and g--smooth on a Riemannian manifold , then RGD with step size converges at a rate of .
4 Hybrid Analysis of Geodesically Smooth Convex Optimization
Consider the optimization problem
where is a Riemannian manifold equipped with Riemannian metrics and , and is a function lower bounded on . In this paper, we consider a hybrid geodesic optimization framework in which is g-convex and g-smooth under different Riemannian metrics.
Assumption 4.1.
The function is g-convex with respect to the Riemannian metric and g--smooth with respect to the Riemannian metric , i.e.,
where the subscripts and in the , , , and respectively indicate the specific metric associated with each geometric quantity.
In this section, we prove that RGD with still converges at a rate of under Assumption 4.1.
Lemma 4.2 ([8, Corollary 4.8]).
Let be g--smooth with respect to and let for some . Then,
where , , and are induced by .
With Lemma 4.2, we can expect that RGD generates a convergent sequence for which the Riemannian gradient at the limit point vanishes.
Lemma 4.3 ([8, Corollary 4.9]).
Let be g--smooth with respect to . Let be the iterates generated by RGD with the Riemannian metric , initial iterate and step size . Then, we have
The distance between two consecutive RGD iterates, and , goes to zero. This is due to the fact that for any , and the following equation:
Since the Riemannian gradients at the iterates vanish, as stated in Lemma 4.3, the following corollary applies.
Corollary 4.4.
The limit point of the RGD iterates in Lemma 4.3 exists.
Then, if , the Riemannian gradient (with respect to the Riemannian metric ) at is zero. This means the iterates converge to a minimizer of .
Lemma 4.5.
Suppose that Assumption 4.1 holds. Let be the iterates generated by RGD with the Riemannian metric , initial iterate and step size . If the limit point , then is a minimizer of on .
The following theorem presents the main result of this section, a non-asymptotic error bound for RGD under the hybrid geodesic optimization framework.
Theorem 4.6.
Suppose that Assumption 4.1 holds. Let be the iterates generated by RGD with the Riemannian metric , initial iterate and step size . For any , we have
where .
We sketch the proof below.
Proof.
Remark 4.7.
Note that is bounded since .
5 Application: Computing Augustin Information
In this section, we apply Theorem 4.6 to compute the order- Augustin information (1). We begin by introducing a Riemannian metric called the Poincaré metric. We then show that the objective function of the optimization problem (1) is g--smooth with respect to this Riemannian metric. Finally, we apply the main result, Theorem 4.6, with a minor adjustment regarding a boundary issue.
5.1 Poincaré Metric
5.2 Geodesic Smoothness of the Objective Function
Since we consider the finite alphabet case, we can identify and as vectors in . Denote by and view as a random variable in . Then, the order- Augustin information (1) for can be written as
(3) |
where and is the order- Rényi divergence between two probability vectors .
The constraint set in the optimization problem (3) is the probability simplex , while the Poincaré metric is defined over the whole positive orthant . We address this inconsistency by the following lemma. Define
Lemma 5.2.
Let and be defined as above, we have
Lemma 5.3.
The function is geodesically -smooth in with respect to the Poincaré metric; the function is geodesically -smooth on the set with respect to the Poincaré metric.
With Lemma 5.3, we may consider minimizing on via RGD with the Poincaré metric and step size . Starting with , the iteration rule proceeds as
(4) |
Note that we additionally require that for the g-smoothness of . We justify this requirement by the following lemma.
Lemma 5.4.
The iterates generated by the iteration rule (4) satisfies .
Since the function is convex in the Euclidean sense, the objective function is also convex in the Euclidean sense. It is desirable to apply Theorem 4.6 to with as the standard Euclidean metric and as the Poincaré metric. However, the Poincaré metric is only defined on the interior of the constraint set , whereas RGD for the function may yield iterates that violate the assumption that of Lemma 4.5. We show that even if the limit point falls on the boundary of , the limit point is still a minimizer when considering the Poincaré metric on .
Lemma 5.5.
Let be Euclidean convex and g--smooth with respect to the Poincaré metric. Let be the iterates generated by RGD with the Poincaré metric, initial iterate and step size . Then, the limit point is a minimizer of on .
Consequently, we still have a non-asymptotic error bound for the optimization problem .
Since the iteration rule (4) is for the optimization problem , these iterates may fall outside the constraint set of the original problem. We show that the sequence of normalized iterates still converges at a rate of .
Proposition 5.6.
Let be the iterates generated by the iteration rule (4) with . For any , we have
where and for any vector .
Note that the term is bounded since (Remark 4.7).
5.3 Numerical Results
We conducted numerical experiments444The code is available on GitHub at the following repository: https://github.com/CMGRWang/Computing-Augustin-Information-via-Hybrid-Geodesically-Convex-Optimization. on the optimization problem (1) by RGD, using the explicit iteration rule (4). The experimental setting is as follows: the support cardinality of is , and the parameter dimension is . The random variable is uniformly generated from the probability simplex . We implemented all methods in Python on a machine equipped with an Intel(R) Core(TM) i7-9700 CPU running at 3.00GHz and 16.0GB memory. The elapsed time represents the actual running time of each algorithm.


Figure 1 demonstrates the results of computing the order- Augustin information using RGD, the fixed-point iteration in Nakıboğlu’s paper [32], the method proposed by Li and Cevher [28], and the method proposed by You et al. [48]. We tuned the parameters for the latter two methods to achieve faster convergence speeds. We set , , for the method proposed by Li and Cevher [28], and set , , , , for the method proposed by You et al. [48]. We approximate the optimal value based on the result obtained from the method proposed by Li and Cevher [28] after iterations.
The numerical results validate our theoretical analysis, showing that RGD converges to the optimum. The fixed-point iteration proposed by Nakiboğlu [32] diverges, as it is only guaranteed to converge for . We observe that RGD is slower than the methods proposed by Li and Cevher [28] and by You et al. [48]. However, the latter two methods guarantee only asymptotic convergence, and tuning parameters is required to achieve faster convergence. We found that slightly adjusting parameters can result in slow convergence speed for these two algorithms. Figure 2 shows that if we change from to and from to , both methods become slower than RGD in terms of elapsed time. Directly comparing numerical results would be unfair, given that these two methods lack convergence rate guarantees and require tuning to get better results.
6 Conclusion
We have presented an algorithm for computing the order- Augustin information with a non-asymptotic optimization error guarantee. We have shown that the objective function of the corresponding optimization problem is geodesically -smooth with respect to the Poincaré metric. With this observation, we propose a hybrid geodesic optimization framework for this optimization problem, and demonstrate that Riemannian gradient descent converges at a rate of under the hybrid framework.
A potential future direction could be generalizing this result for computing quantum Rényi divergences. Our framework can also be applied to other problems. For instance, the objective functions in the Kelly criterion in mathematical finance [25] and in computing the John Ellipsoid [16] are both g-smooth with respect to the Poincaré metric. Another interesting direction is to develop accelerated or stochastic gradient descent-like algorithms in this hybrid scenario.
Acknowledgments
G.-R. Wang, C.-E. Tsai, and Y.-H. Li are supported by the Young Scholar Fellowship (Einstein Program) of the National Science and Technology Council of Taiwan under grant number NSTC 112-2636-E-002-003, by the 2030 Cross-Generation Young Scholars Program (Excellent Young Scholars) of the National Science and Technology Council of Taiwan under grant number NSTC 112-2628-E-002-019-MY3, by the research project “Pioneering Research in Forefront Quantum Computing, Learning and Engineering” of National Taiwan University under grant numbers NTU-CC-112L893406 and NTU-CC-113L891606, and by the Academic Research-Career Development Project (Laurel Research Project) of National Taiwan University under grant numbers NTU-CDP-112L7786 and NTU-CDP-113L7763.
H.-C. Cheng is supported by the Young Scholar Fellowship (Einstein Program) of the National Science and Technology Council, Taiwan (R.O.C.) under Grants No. NSTC 112-2636-E-002-009, No. NSTC 112-2119-M-007-006, No. NSTC 112-2119-M-001-006, No. NSTC 112-2124-M-002-003, by the Yushan Young Scholar Program of the Ministry of Education (MOE), Taiwan (R.O.C.) under Grants No. NTU-112V1904-4, and by the research project “Pioneering Research in Forefront Quantum Computing, Learning and Engineering” of National Taiwan University under Grant No. NTC-CC-113L891605. H.-C. Cheng acknowledges support from the “Center for Advanced Computing and Imaging in Biomedicine (NTU-113L900702)” by the MOE in Taiwan.
References
- Absil et al. [2008] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton Univ. Press, 2008.
- Antonakopoulos et al. [2020] Kimon Antonakopoulos, Elena Veronica Belmega, and Panayotis Mertikopoulos. Online and stochastic optimization beyond Lipschitz continuity: A Riemannian approach. In Int. Conf. Learning Representations, 2020.
- Arimoto [1973] S. Arimoto. On the converse to the coding theorem for discrete memoryless channels (corresp.). IEEE Trans. Inf. Theory, 19(3):357–359, May 1973. ISSN 0018-9448. doi: 10.1109/TIT.1973.1055007.
- Augustin [1969] U. Augustin. Error estimates for low rate codes. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 14(1):61–88, March 1969. ISSN 0178-8051. doi: 10.1007/BF00534118.
- Augustin [1978] Udo Augustin. Noisy channels. Habilitation Thesis, Universität Erlangen-Nürnberg, 1978.
- Bhatia et al. [2019] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices. Expo. Math., 37(2):165–191, 2019.
- Blahut [1974] Richard Blahut. Hypothesis testing and information theory. IEEE Trans. Inf. Theory, 20(4):405–417, July 1974.
- Boumal [2023] Nicolas Boumal. An Introduction to Optimization on Smooth Manifolds. Cambridge Univ. Press, 2023.
- Bubeck [2015] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3-4):231–357, November 2015.
- Chen et al. [2017] Jun Chen, Da-ke He, Ashish Jagmohan, and Luis A Lastras-Montaño. On the reliability function of variable-rate Slepian-Wolf coding. Entropy, 19(8):389, July 2017.
- Cheng and Nakiboğlu [2021] Hao-Chung Cheng and Barış Nakiboğlu. On the existence of the Augustin mean. In 2021 IEEE Information Theory Workshop (ITW), pages 1–6, 2021.
- Cheng et al. [2019] Hao-Chung Cheng, Min-Hsiu Hsieh, and Marco Tomamichel. Quantum sphere-packing bounds with polynomial prefactors. IEEE Trans. Inf. Theory, 65(5):2872–2898, January 2019. doi: 10.1109/tit.2019.2891347.
- Cheng et al. [2022a] Hao-Chung Cheng, Li Gao, and Min-Hsiu Hsieh. Properties of noncommutative Rényi and Augustin information. Commun. Math. Phys., 390(2):501–544, March 2022a.
- Cheng et al. [2022b] Hao-Chung Cheng, Eric P Hanson, Nilanjana Datta, and Min-Hsiu Hsieh. Duality between source coding with quantum side information and classical-quantum channel coding. IEEE Trans. Inf. Theory, 68(11):7315–7345, June 2022b.
- Chewi et al. [2020] Sinho Chewi, Tyler Maunu, Philippe Rigollet, and Austin J Stromme. Gradient descent algorithms for Bures-Wasserstein barycenters. In Conf. Learning Theory, pages 1276–1304, 2020.
- Cohen et al. [2019] Michael B Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the John Ellipsoid. In Conf. Learning Theory, pages 849–873, 2019.
- Csiszár [1995] Imre Csiszár. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory, 41(1):26–34, January 1995. doi: 10.1109/18.370121.
- Csiszár and Körner [2011] Imre Csiszár and János Körner. Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge Univ. Press, 2011.
- Dalai and Winter [2017] Marco Dalai and Andreas Winter. Constant compositions in the sphere packing bound for classical-quantum channels. IEEE Trans. Inf. Theory, 63(9):5603–5617, July 2017.
- Dueck and Korner [1979] G. Dueck and J. Korner. Reliability function of a discrete memoryless channel at rates above capacity (corresp.). IEEE Trans. Inf. Theory, 25(1):82–85, January 1979. ISSN 0018-9448. doi: 10.1109/TIT.1979.1056003.
- Han et al. [2021] Andi Han, Bamdev Mishra, Pratik Kumar Jawanpuria, and Junbin Gao. On Riemannian optimization over positive definite matrices with the Bures-Wasserstein geometry. Adv. Neural Information Processing Systems, 34:8940–8953, December 2021.
- Haroutunian [1968] EA Haroutunian. Bounds for the exponent of the probability of error for a semicontinuous memoryless channel. Problemy Peredachi Informatsii, 4(4):37–48, 1968.
- Hosseini and Sra [2015] Reshad Hosseini and Suvrit Sra. Matrix manifold optimization for Gaussian mixtures. Adv. Neural Information Processing Systems, 28, 2015.
- Hosseini and Sra [2020] Reshad Hosseini and Suvrit Sra. An alternative to EM for Gaussian mixture models: batch and stochastic Riemannian optimization. Math. Program., 181(1):187–223, May 2020.
- Kelly [1956] John L Kelly. A new interpretation of information rate. Bell Syst. Tech. J., 35(4):917–926, July 1956.
- Lan [2020] Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020.
- Lee [2018] John M Lee. Introduction to Riemannian Manifolds. Springer, 2018.
- Li and Cevher [2019] Yen-Huan Li and Volkan Cevher. Convergence of the exponentiated gradient method with Armijo line search. J. Optim. Theory Appl., 181(2):588–607, May 2019.
- Mojahedian et al. [2019] Mohammad Mahdi Mojahedian, Salman Beigi, Amin Gohari, Mohammad Hossein Yassaee, and Mohammad Reza Aref. A correlation measure based on vector-valued -norms. IEEE Trans. Inf. Theory, 65(12):7985–8004, August 2019.
- Mosonyi and Ogawa [2017] M. Mosonyi and T. Ogawa. Strong converse exponent for classical-quantum channel coding. Commun. Math. Phys., 355(1):373–426, October 2017. ISSN 1432-0916. doi: 10.1007/s00220-017-2928-4.
- Mosonyi and Ogawa [2021] M. Mosonyi and T. Ogawa. Divergence radii and the strong converse exponent of classical-quantum channel coding with constant compositions. IEEE Trans. Inf. Theory, 67(3):1668–1698, December 2021.
- Nakiboğlu [2019] Barış Nakiboğlu. The Augustin capacity and center. Probl. Inf. Transm., 55:299–342, October 2019.
- Nakiboğlu [2020] BARIŞ Nakiboğlu. A simple derivation of the refined sphere packing bound under certain symmetry hypotheses. Turk. J. Math., 44(3):919–948, 2020.
- Nesterov [2018] Yurii Nesterov. Lectures on Convex Optimization. Springer, 2018.
- Omura [1975] J. K. Omura. A lower bounding method for channel and source coding probabilities. Inf. Control., 27(2):148–177, February 1975. ISSN 0019-9958. doi: 10.1016/S0019-9958(75)90120-5.
- Pennec [2020] Xavier Pennec. Manifold-valued image processing with SPD matrices. In Riemannian Geometric Statistics in Medical Image Analysis, pages 75–134. Elsevier, 2020.
- Pennec et al. [2006] Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian framework for tensor computing. Int. J. Comput. Vis., 66:41–66, January 2006.
- Rapcsak [1991] Tamas Rapcsak. Geodesic convexity in nonlinear optimization. J. Optim. Theory Appl., 69(1):169–183, April 1991.
- Rényi [1961] Alfréd Rényi. On measures of entropy and information. In Proc. 4th Berkeley Symp. Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547–562, 1961.
- Shen et al. [2023] Yu-Chen Shen, Li Gao, and Hao-Chung Cheng. Privacy amplification against quantum side information via regular random binning. In 2023 59th Annu. Allerton Conference on Communication, Control, and Computing (Allerton), pages 1–8, 2023.
- Udriste [1994] Constantin Udriste. Convex functions and optimization methods on Riemannian manifolds. Springer Science & Business Media, 1994.
- van Erven and Harremoes [2014] Tim van Erven and Peter Harremoes. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory, 60(7):3797–3820, June 2014. doi: 10.1109/tit.2014.2320500.
- Verdu [2015] Sergio Verdu. -mutual information. In 2015 Information Theory and Applications Workshop (ITA), pages 1–6, 2015.
- Verdú [2021] Sergio Verdú. Error exponents and -mutual information. Entropy, 23(2):199, August 2021.
- Vishnoi [2018] Nisheeth K Vishnoi. Geodesic convex optimization: Differentiation on manifolds, geodesics, and convexity. 2018. arXiv:1806.06373.
- Weber and Sra [2023] Melanie Weber and Suvrit Sra. Global optimality for Euclidean CCCP under Riemannian convexity. In Int. Conf. Machine Learning, pages 36790–36803, 2023.
- Wiesel [2012] Ami Wiesel. Geodesic convexity and covariance estimation. IEEE Trans. Signal Process., 60(12):6182–6189, September 2012.
- You et al. [2022] Jun-Kai You, Hao-Chung Cheng, and Yen-Huan Li. Minimizing quantum Rényi divergences via mirror descent with Polyak step size. In 2022 IEEE Int. Symp. Information Theory (ISIT), pages 252–257, 2022.
- Zhang and Sra [2016] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conf. Learning Theory, pages 1617–1638, 2016.
- Zhang et al. [2013] Teng Zhang, Ami Wiesel, and Maria Sabrina Greco. Multivariate generalized Gaussian distribution: Convexity and graphical models. IEEE Trans. Signal Process., 61(16):4141–4148, June 2013.
Appendices
Appendix A Proof of Lemma 4.5
Since , by Lemma 4.3, we have
Note that is the (Riesz) representation of the differential form on the tangent space with respect to the metric , and is the representation of the same differential form under another metric . Therefore, for every , we have
which implies
Since is g-convex with respect to , having zero Riemannian gradient with respect to at is equivalent to being a global minimizer of on .
Appendix B Proof of Theorem 4.6
We have shown that
Let and divide both sides by . Since the sequence is non-increasing, we write
By a telescopic sum, we get
Therefore, we have
Appendix C Proof of Lemma 5.2
Given any vector , we can write it as where and . We have
Observe that
and is convex in . This means for every , the function achieves its minimum at . Therefore, let be a minimizer of on , for every and , we have
This shows that is a minimizer of on if and only if is a minimizer of .
Appendix D Proof of Lemma 5.3
We will utilize the following equivalent definition of g-convexity and g-smoothness.
Definition D.1 ([49, 21]).
For a function defined on , we say that is g-convex on () if for every geodesic connecting any two points , is convex in in the Euclidean sense. We say that is g--smooth () if for every geodesic connecting any two points , is -smooth in in the Euclidean sense.
Let be the geodesic connecting and for . If is -smooth in the Euclidean sense, then the function is also -smooth in the Euclidean sense. Therefore, it suffices to prove the g-smoothness for the function for any .
We now prove that
is -smooth in the Euclidean sense. Note that
We have
The second-order derivative is
Note that and , so the quantities (1) and (2) are both non-negative. Our strategy is to show that both quantities (1) and (2) are upper bounded by , which then implies
This gives the desired result by the fact that a function is -smooth if and only if its Hessian is upper bounded by , where denotes the identity matrix.
D.1 Upper Bound of the quantity (1)
The numerator of the quantity (1) is
The first inequality is due to the Cauchy-Schwarz inequality, and the second inequality holds because when for all . Thus, the quantity (1) is upper bounded by .
D.2 Upper Bound of the quantity (2)
The numerator of the quantity (2) is
The first inequality is due to the Cauchy-Schwarz inequality, and the second inequality is because when for all . Thus, the quantity (2) is upper bounded by .
D.3 Geodesic Smoothness of
We only need to show that is g-1-smooth. By direct calculation, we have
where the inequality is due to the concavity and monotonicity of , which implies
Appendix E Proof of Lemma 5.4
Let , then the next iterate is
A direct calculation shows that , so
where is the -th component of . By the inequality for any , we have
Therefore,
Hence,
This gives the desired result.
Appendix F Proof of Lemma 5.5
If the limit lies in , then it is a minimizer of by Theorem 4.6. Assume that the limit point and . Then, by the first-order optimality condition, there exists some such that
(5) |
where is the -th component of the vector .
By Lemma 4.3, we have
where the Riemannian gradient is with respect to the Poincaré metric. Therefore,
Combining this and the inequality (5), we get
which means there exists some such that
By the continuity of , there exists an open neighborhood of such that for all
Since is the limit point of the sequence , there exists such that for all , that is,
However, this implies
since and . Therefore, by the iteration rule of RGD, we have
This shows that is increasing after , which means this sequence cannot converge to . This contradicts our assumption.