From Soft-Minoration to Information-Constrained Optimal Transport and Spiked Tensor Models
Abstract
Let be a given distribution on . For any , we may interpret as a soft-max of . We explore lower bounds on in terms of the minimum mutual information over which is a coupling of and itself such that is bounded in a certain sense. This may be viewed as a soft version of Sudakov’s minoration, which lower bounds the expected supremum of a stochastic process in terms of the packing number. Our method is based on convex geometry (thrifty approximation of convex bodies), and works for general non-Gaussian . When is Gaussian and converges to , this recovers a recent inequality of Bai-Wu-Ozgur on information-constrained optimal transport, previously established using Gaussian-specific techniques. We also use soft-minoration to obtain asymptotically (in tensor order) tight bounds on the free energy in the Sherrington-Kirkpatrick model with spins uniformly distributed on a type class, implying asymptotically tight bounds for the type II error exponent in spiked tensor detection.
I Introduction
Given on , define , where and denotes the inner product. We may interpret as a soft-max of . Indeed, if is the uniform distribution on a compact set , then . Moreover, the inequality typically becomes tight when is large. If is the standard Gaussian distribution, Sudakov’s minoration [23][11] gives
(1) |
where is a universal constant and denotes the -packing number of under the Euclidean distance. Generalization of Sudakov’s minoration to other log-concave measures has also been considered [11][10][16]. In this paper we explore inequalities of the following form which may be called “soft minoration”:
(2) |
where the inf is over coupling under which is “small” and both and have the same law .
One motivation for (2) is network information theory. Cover’s problem asks the minimum relay rate needed for achieving the maximum capacity of a relay channel [7] (see also [26]). Measure concentration and reverse hypercontractivity techniques yield nontrivial bounds but are not sufficient for solving Cover’s problem [27][14][15]. The solution in the Gaussian setting is infinity, as shown by [26] using a rearrangement inequality for the spheres (see also a solution for binary symmetric channels using a similar idea [2]). Bai, Wu, and Ozgur [1] provided a simplified proof for the Gaussian setting by proving a bound on information constrained optimal transport: if is the standard normal distribution in , has well-defined differential entropy, and , then
(3) |
where (all information in nats throughout the paper), the function , and denotes the set of couplings between two distribution. (3) generalizes Talagrand’s inequality by replacing optimal transport with entropy-regularized optimal transport [1]. Previous proofs of (3) relied on Gaussian specific arguments. In contrast, [9] used the traditional auxiliary random variable approach, yielding the same capacity region outer bound for the Gaussian setting as [26][1], and also showed that compress-and-forward solves Cover’s problem for discrete memoryless channels under a full-rankness condition. Concurrently, [12][13] used convex geometry to lower bound with packing numbers of (for uniform on ), showing the optimality of compress-and-forward for all discrete memoryless channels under conditions originally stated in [7] (without full-rankness assumption).
In [13] the argument was restricted to of the form of a uniform distribution on a set, which is sufficient for the solution of Cover’s problem because when applied to the relay channel, is the restriction of the channel output distribution on the intersection of a type class and a relay decoding set. In this paper, we consider general , and the packing number of a set is replaced by the mutual information in (2), which is accomplished by combining the approach of [13] with a tensorization argument. Another difference from [13] is that [13] used the reduction of general to the case of Rademacher distribution; in contrast, this paper uses Barvinok’s thrifty approximation of convex bodies [3] which entails explicit information theoretic bounds for general .
We show as a consequence of (2) that for any ,
(4) |
which implies (3) as . As noted in [8][1], information constrained optimal transport is useful in machine learning due to the availability of fast algorithms. In many such applications is the empirical distribution of samples, in which case and (3) is useless. In contrast, a bound with may still be nontrivial. Moreover, our approach easily extends to general symmetric distribution , where is replaced by another universal function and or the infimum in (4) is given a general definition.
In statistical physics, a central quantity is the (expected) free energy , where is called the spin or configuration, and the expectation is with respect to the randomness (also known as the disorder) of the Hamiltonian [25][17]. Clearly (2) provides a lower bound on the free energy, once we embed and in a suitable Euclidean space so that is an inner product. The free energy in the Sherrington-Kirkpatrick model (SK) characterizes the information-theoretic threshold for spiked tensor detection [6]. Statistical physics literature has been focusing on the cases of Rademacher and spherical spins, and existing exact formulae for the free energy generally rely on these structures and are usually hard to evaluate [24][22][6][18]. We prove a simple dimension-free bound in the format of (2) and show its tightness when the tensor order is large and the prior is uniform on a type class, which in turn provides asymptotically tight type II error exponent bounds in spiked tensor detection.
II Preliminaries
The simplest proof of Sudakov’s minoration (1) in the case of Gaussian is through Gaussian comparison (see e.g. [5][11]). However, this approach is Gaussian specific, and a longstanding goal in this research area is to extend the results to general log-concave measures (see [16]). For our proofs in Section III-IV, it suffices to use the following result of Pajor [19], which can be viewed as a generalization of (1) to the case of general and packing number under the Minkowski functional distance. The proof is a simple application of the Alexandrov-Fenchel inequality, and a review of related convex geometry concepts can be found in [13].
Lemma 1.
[19] Suppose that is a symmetric convex body in , and let be the associated cone volume measure. Let be compact, and define . Let be the polar of . For any , define as the -packing number of under the Minkowski function (which is a norm in the case of symmetric convex body ). Then
(5) |
Remark 1.
Another key ingredient for the proofs in Section III is to relate in (2) to , which is achieved by approximating the support of in (1) by a sparser set and then apply Markov’s inequality and the union bound. More specifically, we use the following “thrifty approximation of convex body” by Barvinok [3]. We state here a simplified asymptotic version.
Lemma 2.
[3] For any , let and be the solutions to
(6) | ||||
(7) |
where denotes the binary entropy function. Then for any symmetric convex body , there exists a symmetric polytope satisfying and with at most vertices.
III General
In this section we derive a bound in the form of (2) which, among other things, generalizes (3) to arbitrary satisfying (see Corollary 4). To be precise, in (3) will be replaced by a worse constant for general ; we will explain in the next section how the constant is improved to for Gaussian .
Theorem 1.
Suppose that and are distributions on , , and and have the same distribution, where . Then for any and ,
(8) |
where the infimum is over all satisfying for all . (8) also holds if is replaced by .
Proof.
It suffices to prove the case where and are supported on finite sets and all the probability masses are rational numbers. The general case can then be established by an approximation argument, using the fact that the mutual information can be arbitrarily well approximated with finite partitions of the space [21]. For any which divides the denominators of these rational numbers, let (resp. ) be the equiprobable distribution on (resp. ), defined as the -type class (resp. -type class). For any , define
(9) |
where . Then by the method of types and large deviation analysis we have
(10) | ||||
(11) |
where in (10), and (11) follows since by the Donsker-Varadhan variational formula, for any and therefore
(12) |
By Lemma 2, we can choose as a subset of the convex hull of such that
(13) | ||||
(14) |
Let be equiprobable on . Define a subset of as
(15) |
Then by Markov’s inequality we have
(16) |
and moreover,
(17) | ||||
(18) |
where we used , and the fact that is a constant on by permutation invariance of the type class. Now let be the -packing number under , which is upper bounded by the -packing number by (14). Therefore by Pajor Lemma 1,
(19) |
For any , the set is
(20) |
whose cardinality is, by large deviation analysis,
(21) |
where the supremum is over the same set as the infimum in (8). Note that the packing number can be lower bounded by divided by the cardinality of the set in (20); using (16) and (21) we have
(22) |
Next, we consider a limiting case of Theorem 1 as .
Definition 1.
Fix a distribution on . For any define
(23) |
where the supremum is over all satisfying .
Remark 3.
Since , we see that . Moreover, if is standard Gaussian then the supremum is achieved when by Talagrand’s inequality (special case of (3) when ), and therefore .
Corollary 3.
Remark 4.
From (10)-(11) we see that the bounds in Theorem 1 and Corollary 3 can be sharpened by replacing by the right side of (12). It might appear that this is a strict improvement, but actually it is just an equivalent version since the converse implication is also true. Indeed, the fact that these bounds with holding for all and implies the sharpened versions with the right side of (12), using a similar tensorization argument as in (10)-(11).
An equivalent form of (3) is the following:
Corollary 4.
Let and be as in Corollary 3. Denote as the inverse function of . For any , we have
(29) |
where the supremum is over satisfying .
IV The Gaussian Case
For Gaussian , we can improve the estimates in Section III by replacing Lemma 2 with the following sharp estimate, which follows from sphere covering (e.g. [4]).
Lemma 5.
Fix . For any positive integer there exists a set on the unit ball satisfying and .
Now define the functions and by the equations and ; explicitly,
(35) | ||||
(36) |
Theorem 2.
Proof.
The proof is similar to the general non-Gaussian case, and we shall only mention a few differences in the argument. It suffices to consider supported on a finite set with all probability masses equal to rational numbers. Let divide the denominators of these rational numbers. Define , , and as in the proof of Theorem 1. Then we have
(37) | ||||
(38) |
Define and let . Note that , and therefore by Jensen’s inequality,
(39) |
Choose similar to before but use Lemma 5 instead and replace in (13) with . Define as the random variable distributed on and following the cone volume measure, and let be a random rotation in , independent of and following the uniform distribution on the orthogonal group. Then . There exists some (deterministic) rotation such that , which we can assume without loss of generality to be the identity, so that
(40) |
There rest of the proof is similar to Theorem 1, where is now the centered sphere of radius . The improved estimate on the left side of (8) is seen by refining (20) for in a ball. ∎
Remark 5.
The bounds claimed in Theorem 2 are asymptotically tight (as ) when is uniform on a ball.
V Spiked Tensor Model
In this section we explore bounds in the form (2) when has a special rank-1 tensor structure, and the implications for the spiked tensor detection problem in statistics [20][6]. The order tensors associated with is again a vector space, and can be given an inner product compatible with the Frobenius norm. The dimension of order tensors is and order symmetric tensors is , both too large for directly applying a dimension depending minoration such as Lemma 1 for tight bounds. As mentioned in Remark 1, a dimension reduction argument may be applied. In this section, we shall just focus on the case of Gaussian where we can apply a Gaussian comparison argument, which reduces to the random energy model (REM). The result we will use is (see [17, p150])
(43) |
where , are independent.
We will lower bound the soft-max (free energy) when follows the equiprobable distribution on a type class; once this setting is understood, the free energy for other permutation invariant (such as i.i.d. coordinates) can be estimated using standard method of types and large deviation analysis.
Theorem 3.
Let be a distribution on with finite support, and be equiprobably distributed on the -type class (with rounding if necessary). Define and
(44) |
for any order tensor , and let
(45) |
Then
(48) |
where is an order tensor with i.i.d. standard Gaussian entries.
Proof.
Set , where does not depend on . Define
(49) |
where denotes the centered ball of radius in the space of tensors under the Frobenius norm. Let and be two sequences in the -type class, and define and accordingly. Let be the joint type of . Then
(50) |
where denotes the expectation under the type , viewed as a distribution on . Then by the large deviations analysis,
(51) |
We now generate a random measure supported on , the support of . Select uniformly at random a point in , and then select uniformly at random the next point among all points in at least away from the previously selected points, and so on, until no more points can be selected. Let be the equiprobable distribution on these selected points. is random because of the randomness of the point selection process. By symmetry of the type class, we see that , so by Jensen’s inequality,
(52) |
Since the support size of is at least , using (43) and Slepian’s comparison [5], we can lower bound the right side of (52) in terms of the free energy of the REM with parameters given by
(53) | ||||
(54) |
and the theorem follows by taking . ∎
While the statistical physics literature mostly focuses on equiprobable on a Boolean cube, a general is relevant for statistical applications such as spiked tensor detection [20][6]. Let the noise be a tensor with i.i.d. entries. Consider a hypothesis testing problem with observation
-
•
: ;
-
•
: ,
where is the signal to noise ratio. Denote by and the distributions of under and , respectively. From the Gaussian density formula it is easy to see that
(55) |
Using concentration, it can be shown that the critical for detecting a rank-1 spike with nontrivial probability coincides with the largest for . Previously [20] computed such critical by bounding with the Rényi divergence . However when is above the critical value, grows super-linearly in (see [20, Section 2.4]) and hence does not give a useful bound for and hence for the free energy. In contrast, we show that Theorem 3 is asymptotically tight for large :
Corollary 6.
In Theorem 3, suppose that has unit variance. Then
(58) |
where is the entropy of . In particular, if and the type I error in spiked tensor detection is bounded away from 0 and 1, then the optimal type II exponent converges to as .
Proof.
For any , in Theorem 3 converges to as . Taking proves the part. To see the part, we follow [20] and consider the maximum likelihood statistic
(59) |
where the maximum is over such that is -typical. Let be the event that , where is defined as the number such that . By the union bound calculation in [20, Proposition 4.1], we have . Then
(60) | ||||
(61) |
Note that follows . If , we have
(62) |
By the data processing inequality,
(63) | ||||
(64) |
where denotes the binary divergence function. Then from (55) we have shown the part of (58) in the case of . The part in the case of is trivial from . ∎
Remark 6.
Results related to Corollary 6 have appeared in the literature: as mentioned, [20] performed 2-Rényi divergence calculations to show that the critical converges to as . The 2-Rényi divergence is equivalent to the expected partition function of 2-replica systems. For equiprobable on the hypercube, a classical replica symmetry calculation (see e.g. [17]) shows that the free energy of the -spin model converges to the free energy of the REM as .
VI acknowledgement
The author would like to thank Qiang Wu for discussions on the SK model and tensor detection.
References
- [1] Yikun Bai, Xiugang Wu, and Ayfer Ozgur. Information constrained optimal transport: From Talagrand, to Marton, to Cover. Arxiv: 2008.10249.
- [2] L. P. Barnes, X. Wu, and A. Ozgur. A solution to cover’s problem for the binary symmetric relay channel: Geometry of sets on the hamming sphere. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 844–851, Oct 2017.
- [3] Alexander Barvinok. Thrifty approximations of convex bodies by polytopes. International Mathematics Research Notices, 2014(16):4341–4356, 2014.
- [4] Károly Böröczky and Gergely Wintsche. Covering the sphere by equal spherical balls. In Discrete and computational geometry, pages 235–251. Springer, 2003.
- [5] Sourav Chatterjee. An error bound in the Sudakov-Fernique inequality. arXiv preprint math/0510424, 2005.
- [6] Wei-Kuo Chen. Phase transition in the spiked random tensor with rademacher prior. The Annals of Statistics, 47(5):2734–2756, 2019.
- [7] T. M. Cover. The capacity of the relay channel. Open Problems in Communication and Computation, edited by T. M. Cover and B. Gopinath, New York: Springer-Verlag, pages 72–73, 1987.
- [8] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
- [9] Abbas El Gamal, Amin Gohari, and Chandra Nair. A strengthened cutset upper bound on the capacity of the relay channel and applications. IEEE Transactions on Information Theory, 2022.
- [10] Rafał Latała. Sudakov-type minoration for log-concave vectors. Studia Mathematica, 223(3):251–274, 2014.
- [11] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, Berlin Heidelberg, 1991.
- [12] Jingbo Liu. Soft minoration: Solution to cover’s problem in the original discrete memoryless setting. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1648–1652. IEEE, 2021.
- [13] Jingbo Liu. Minoration via mixed volumes and cover’s problem for general channels. Probability Theory and Related Fields, 183(1):315–357, 2022.
- [14] Jingbo Liu and Ayfer Ozgur. New converses for the relay channel via reverse hypercontractivity. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2878–2882, 2019.
- [15] Jingbo Liu and Ayfer Özgür. Capacity upper bounds for the relay channel via reverse hypercontractivity. IEEE Transactions on Information Theory, 66(9):5448–5455, 2020.
- [16] Shahar Mendelson, Emanuel Milman, and Grigoris Paouris. Generalized dual Sudakov minoration via dimension-reduction program. Studia Mathematica, 244:159–202, 2019.
- [17] Marc Mezard and Andrea Montanari. Information, physics, and computation. Oxford University Press, 2009.
- [18] Jean-Christophe Mourrat. The parisi formula is a hamilton–jacobi equation in wasserstein space. Canadian Journal of Mathematics, 74(3):607–629, 2022.
- [19] Alain Pajor. Sous-espaces des espaces de banach. PhD Thesis, L’Universite Pierre Et Marie Curie, https://perso.math.u-pem.fr/pajor.alain/recherche/docs/these.pdf, 8 Nov 1984.
- [20] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and sub-optimality of pca i: Spiked random matrix models. The Annals of Statistics, 46(5):2416–2451, 2018.
- [21] M.S. Pinsker. Information and information stability of random variables and processes. San Francisco : Holden-Day, translated and edited by Amiel Feinstein edition, 1964.
- [22] Eliran Subag. The geometry of the gibbs measure of pure spherical spin glasses. Inventiones mathematicae, 210(1):135–209, 2017.
- [23] Vladimir N Sudakov. Gaussian measures, cauchy measures and -entropy. In Soviet Math. Dokl, volume 10, pages 310–313, 1969.
- [24] Michel Talagrand. Free energy of the spherical mean field model. Probability theory and related fields, 134(3):339–382, 2006.
- [25] Michel Talagrand. Mean field models for spin glasses: Volume I: Basic examples, volume 54. Springer Science & Business Media, 2010.
- [26] X. Wu, L. P. Barnes, and A. Ozgur. The capacity of the relay channel: Solution to Cover’s problem in the Gaussian case. IEEE Transactions on Information Theory, 65(1):255–275, Jan. 2019.
- [27] Xiugang Wu, Ayfer Özgür, and Liang-Liang Xie. Improving on the cut-set bound via geometric analysis of typical sets. IEEE Transactions on Information Theory, 63(4):2254–2277, 2017.