Equivalence of Approximate Message Passing and
Low-Degree Polynomials in Rank-One Matrix
Estimation
Abstract
We consider the problem of estimating an unknown parameter vector , given noisy observations of the rank-one matrix , where has independent Gaussian entries. When information is available about the distribution of the entries of , spectral methods are known to be strictly sub-optimal. Past work characterized the asymptotics of the accuracy achieved by the optimal estimator. However, no polynomial-time estimator is known that achieves this accuracy.
It has been conjectured that this statistical-computation gap is fundamental, and moreover that the optimal accuracy achievable by polynomial-time estimators coincides with the accuracy achieved by certain approximate message passing (AMP) algorithms. We provide evidence towards this conjecture by proving that no estimator in the (broader) class of constant-degree polynomials can surpass AMP.
1 Introduction
Statistical-computational gaps are a ubiquitous phenomenon in high-dimensional statistics. Consider estimation of a high-dimensional parameter vector from noisy observations . In many models, we can characterize the optimal accuracy achieved by arbitrary estimators. However, when we analyze classes of estimators that can be implemented via polynomial-time algorithms, a significantly smaller accuracy is obtained. A short list of examples includes sparse regression [CM22, GZ22], sparse principal component analysis [JL09, AW09, KNV15, BR13], graph clustering and community detection [DKMZ11, BHK+19], tensor principal component analysis [RM14, HSS15], and tensor decomposition [MSS16, Wei22].
This state of affairs has led to the conjecture that these gaps are fundamental in nature: there simply exists no polynomial-time algorithm achieving the statistically optimal accuracy. Proving such a conjecture is extremely challenging since standard complexity theoretic assumptions (e.g. PNP) are ill-suited to establish complexity lower bounds in average-case settings where the input to the algorithm is random. A possible approach to overcome this challenge is to establish ‘average-case’ reductions between statistical estimation problems. We refer to [BBH18, BB20] for pointers to this rich line of work.
A second approach is to prove reductions between classes of polynomial-time algorithms, thus trimming the space of possible algorithmic choices. This paper contributes to this line of work by establishing—in a specific context—that approximate message passing (AMP) algorithms and low-degree (Low-Deg) polynomial estimators achieve asymptotically the same accuracy.
Examples of reductions between algorithm classes in statistical estimation include [HKP+17, BBH+21, CMW20, BMR21, CM22]. The motivation for studying the relation between AMP and Low-Deg comes from the distinctive position occupied by these two classes in the theoretical landscape. AMP are iterative algorithms motivated by ideas in information theory (iterative decoding [Gal62, RU08]) and statistical physics (cavity method and TAP equations [TAP77, MPV87]). Their structure is relatively constrained: they operate by alternating a matrix-vector multiplication (with a matrix constructed from the data ), and a non-linear operation on the same vectors (typically independent of the data matrix). This structure has opened the way to sharp characterization of their behavior in the high-dimensional limit, known as ‘state evolution’ [DMM09, BM11, Bol14].
Low-Deg was originally motivated by a connection to Sum-of-Squares (SoS) semidefinite programming relaxations and captures a different (algebraic) notion of complexity [HS17, HKP+17, SW22]. In Low-Deg procedures, each unknown parameter is estimated via a fixed polynomial in the entries of the data matrix , of constant or moderately growing degree. As such, these estimators are relatively unconstrained, and indeed capture (via polynomial approximation) a broad variety of natural approaches. Developing a sharp analysis of such a broad class of estimators can be very challenging.
In the symmetric rank-one estimation problem, we observe a noisy version of the rank-one matrix , with an unknown vector:
(1) |
Here is a random matrix independent of , drawn from the Gaussian Orthogonal Ensemble which we define here by and independent with and for . Given a single observation of the matrix , we would like to estimate .
Because of its simplicity, the rank-one estimation problem has been a useful sandbox to develop new mathematical ideas in high-dimensional statistics. A significant amount of work has been devoted to the analysis of spectral estimators, which typically take the form where is the top eigenvector of and is a scaling factor [HR04, BBP05, BS06, BGN11]. However, spectral methods are known to be suboptimal if additional information is available about the entries of . In this paper, we model this information by assuming , for a probability distribution on (which does not depend on ). The resulting Bayes error coincides (up to terms negligible as ) with the minimax optimal error in the class of vectors with empirical distribution converging to . Hence this model captures the minimax behavior in a well-defined class of signal parameters.
In the high-dimensional limit (with fixed), the Bayes optimal accuracy for estimating under model (1) (and the above assumptions on ) is known to converge to a well-defined limit that was characterized in a sequence of papers, see e.g. [LM19, BM19, CX22]. The asymptotic accuracy of the optimal AMP algorithm (called Bayes AMP) is also known [MV21].
It is useful to pause and recall these results in some detail. Define the function (here and below denotes the set of probability distributions over ) by letting
(2) | ||||
(3) |
Note that can be interpreted as the mutual information between a scalar random variable and the scalar noisy observation . It can be expressed as a one-dimensional integral with respect to :
(4) |
The next statement adapts results from [LM19] concerning the behavior of the Bayes optimal error (see Appendix A, which details the derivation from [LM19]). A formal definition of Bayes AMP will be given in the next section, alongside a formal definition of Low-Deg algorithms.
Theorem 1.1.
Assume is independent of , has non-vanishing first moment , and has finite moments of all orders. Define
(5) | ||||
(6) |
Bayes AMP has time complexity , for any diverging sequence , and achieves mean squared error (MSE)
(7) |
Further, assume that is differentiable at . Then the minimum MSE of any estimator is
(8) |
Remark 1.1 (Differentiability at ).
The function is concave because it is a minimum of linear functions, and therefore differentiable at all except for countably many values of . Hence the differentiability assumption amounts to requiring that is non-exceptional.
Also note that replacing the prior by its shift where amounts to changing to . Hence the assumption that is non-exceptional is equivalent to the assumption that the mean of is non-exceptional.
We also note that our results will not require such genericity assumptions, which we only introduce to offer a simple comparison to the Bayes optimal estimator.
Remark 1.2 (Nonzero mean assumption).
The assumption can be removed from Theorem 1.1 provided the mean squared error metric is replaced by a different metric that is invariant under change of relative sign in and . In that case, Bayes AMP needs to be modified, e.g. via a spectral initialization [MV21]. Here we focus on the case because this is the most relevant for the results that follow.
In this paper we compare Bayes AMP run for a constant number of iterations and Low-Deg estimators with degree . Assuming and is sub-Gaussian, we establish the following (see Theorem 2.3):
-
•
For any fixed and , there exists a constant and a degree- estimator that approximates the MSE achieved by Bayes AMP after iterations within an additive error of at most .
-
•
For any constant , no degree- estimator can surpass the asymptotic MSE of Bayes AMP. Namely, for any fixed and any degree- estimator , we have .
Here and throughout, the notation signifies a quantity that vanishes in the limit where with all other parameters (such as ) held fixed. The first claim above (‘upper bound’) is proved by a straightforward polynomial approximation of Bayes AMP. To obtain the second claim (‘lower bound’), we develop a new proof technique. Given a Low-Deg estimator for coordinate , we express as a sum of terms that are associated to rooted multi-graphs on vertex set with at most edges. We group these terms into a constant number of homomorphism classes. We next prove that among these classes, only those corresponding to trees yield a non-negligible contribution as . Finally we show that the latter contribution can be encoded into an AMP algorithm and use existing optimality theory for AMP algorithms [CMW20, MW22] to deduce the result.
This new proof strategy elucidates the relation between AMP and Low-Deg algorithms: roughly speaking, AMP algorithms correspond to ‘tree-structured’ low-degree polynomials, a subspace of all Low-Deg estimators. AMP can effectively evaluate tree-structured polynomials via a dynamic-programming style recursion with complexity (with the tree depth) instead of the naive .
The rest of the paper is organized as follows. Section 2 provides the necessary background, formally states our results, and discusses limitations, implications, and future directions. Sections 3 and 4 prove the main theorem (Theorem 2.3), respectively establishing the upper and lower bounds on the optimal estimation error achieved by Low-Deg. The proofs of several technical lemmas are deferred to the appendices.
2 Main results
2.1 Background: AMP
The class of AMP algorithms111More general settings have been studied and analyzed in the literature [JM13, BMN20, GB21, Fan22]. that we will consider in this paper proceed iteratively by updating a state according to the iteration:
(9) |
Throughout, we will assume the uninformative initialization , . In the above is a function which operates on the matrix row-wise. Namely for with rows , we have . After iterations, we estimate the signal via , for some function (again, applied row-wise). We will consider and the sequence of functions fixed, while .
The sequence of matrices can be taken to be non-random (independent of ) and will be specified shortly. The high-dimensional asymptotics of the above iteration is characterized by the following finite-dimensional recursion over variables , , known as ‘state evolution,’ which is initialized at :
(10) | ||||
(11) |
In terms of this sequence, we define via
(12) |
where denotes the weak differential of .
The next theorem characterizes the high-dimensional asymptotics of the above AMP algorithm. It summarizes results from [BM11, BLM15] (see e.g. [MW22] for the application to rank-one estimation).
Theorem 2.1.
Assume is independent of and has finite moments of all orders. Further assume that the functions are independent of and either: for each , and are -Lipschitz, for some ; or for each , and are degree- polynomials, for some . Then
(13) |
where expectation is with respect to .
Using the above result, it follows that the optimal function is given by the posterior expectation denoiser. Namely, define the one-dimensional recursion (with ):
(14) |
Consider in the recursion of Eqs. (10), (11), so and , and take with defined in Eq. (14). We have
(15) | |||
(16) |
The next result establishes that this is indeed the optimal MSE achieved by AMP algorithms.
Theorem 2.2 ([MW22]).
Assume is independent of and has finite second moment. Then, for any AMP algorithm satisfying the assumptions of Theorem 2.1, we have for any fixed ,
(17) |
Further there exists a sequence of AMP algorithms approaching the lower bound arbitrarily closely.

As an illustration, Figure 1 presents the asymptotic accuracy achieved by Bayes AMP and Bayes optimal estimation, as characterized by Theorem 1.1. In this figure we consider the following parametrized family of three-point priors:
(18) |
We set and sweep (which measures the signal-to-noise ratio). We compare the predicted accuracy for Bayes AMP with numerical simulations for and iterations, averaged over realizations.
2.2 Background: Low-Deg
We say that , is a Low-Deg estimator of degree if its coordinates are polynomials of maximum degree in the matrix entries . The coefficients of these polynomials may depend on but not on . Let denote the space of all polynomials in the variables of degree at most . We use to denote the set of estimators whose coordinates are degree polynomials:
(19) |
With an abuse of notation we will often refer to an estimator (e.g. a Low-Deg estimator ), but really mean a sequence of estimators indexed by the dimension .
2.3 Statement of main results
Recall that a random variable is called sub-Gaussian if where the sub-Gaussian norm is defined as
(20) |
For example, any distribution with bounded support is sub-Gaussian.
Theorem 2.3.
Assume is independent of and sub-Gaussian, with . For any , there exists and a (family of) estimators such that
(21) |
Further, for any constant ,
(22) |
The lower bound (22) only requires to have all moments finite, rather than sub-Gaussian. The upper bound (21) likely also holds under weaker conditions on than sub-Gaussian, but we have not attempted to explore this here. We conclude this summary of our results with a few remarks and directions for future work.
Sharp characterization for Low-Deg.
As a by-product of our results, we obtain a sharp characterization of the optimal accuracy of Low-Deg estimators with constant degree. To the best of our knowledge, this is the first example of such a characterization. Prior work on low-degree estimation [SW22] gave bounds on the signal-to-noise ratio (tight up to log factors) under which the low-degree estimation accuracy is either asymptotically perfect or trivial. In our work we are in a setting where the Low-Deg MSE converges to a non-trivial constant, and we pin down the exact constant.
Similarly, our results are sharper than existing computational lower bounds based on the ‘overlap gap property’ [GJS21], which rule out a different (incomparable) class of algorithms including certain Markov chains.
Optimality of AMP.
Our results support the conjecture that AMP is asymptotically optimal among computationally-efficient procedures for certain high-dimensional estimation problems. We are also aware of some problems for which AMP is strictly sub-optimal and has to be modified to capture higher-order correlations between variables. One such examples is provided by the spiked tensor model [RM14, HSS15, WEM19], namely the generalization of the above model to higher order tensors, where observations take the form for some .
We believe that a generalization of the proof techniques developed in this paper can help distinguish in a principled way which problems can be optimally solved using AMP. These ideas may also provide guidance to modify AMP for problems in which it is sub-optimal. The key properties of the rank-one estimation problem that cause AMP to be optimal among Low-Deg estimators are established in Lemmas 4.7 and 4.8; notably these include the block diagonal structure of a certain correlation matrix . As a result of these properties, the best Low-Deg estimator is ‘tree-structured’ and can thus be computed using an AMP algorithm.
Zero-mean prior.
Our result on the equivalence of AMP and Low-Deg actually applies to the case as well. However, it must be emphasized that the statement concerns AMP with uninformative initialization and iterations, and Low-Deg estimators with degree. When , the MSE of both these algorithms converges to as : it is no better than random guessing.
On the other hand, initializing AMP with a spectral initialization (for a suitable scalar ) yields the accuracy stated in Theorem 1.1, even for . Since the leading eigenvector can be approximated by power iteration, we believe it is possible to show that the same accuracy is achievable by AMP when run for iterations (and a random initialization), or by Low-Deg for degree. However generalizing the lower bound of Eq. (22) to logarithmic degree goes beyond the analysis developed here.
Higher degree.
We expect the optimality of Bayes AMP to hold within a significantly broader class of low-degree estimators, namely -degree estimators for any . Again, this extension is beyond our current proof technique. Heuristically, polynomials of degree can be thought of as a proxy for algorithms of runtime (which is the runtime required to naively evaluate such a polynomial); see e.g. [DKWB19].
3 Proof of Theorem 2.3: Upper bound
Recall from Theorem 2.2 that the fixed points of the state evolution recursion (14) coincide with the stationary points of . Further, it is easy to see from the definition Eq. (14) that is a non-decreasing function with . (This follows from the fact that the minimum mean square error is a non-increasing function of the signal-to-noise ratio or, equivalently, from Jensen inequality.) As a consequence is a non-decreasing sequence with . Let be such that .
Consider the special case of the AMP algorithm of Eq. (9) with , with , and specialize the state evolution recursion of Eqs. (10), (11) to this case. Namely we define recursively for via
(23) | ||||
(24) |
We claim that for the following holds. For any , we can construct polynomials of degree (independent of ), we have,
(25) |
Once this is established, the desired upper bound (21) follows by taking and . Consider indeed the AMP estimator where is defined by the AMP iteration (9) with , (that is, we choose ). This estimator is a polynomial of degree and, by Eq. (13) (specialized to ), we have
(26) | ||||
(27) | ||||
(28) |
We are left with the task to prove the claim (25), which we will do by induction on . The claim holds for , and we then assume as an induction hypothesis that it holds for a certain . We will denote by the ideal nonlinearity. Now let be a sequence converging to and select , according to the induction hypothesis with . We will denote by the corresponding state evolution quantities. In particular, we can assume without loss of generality . By the state evolution recursion, we have (expectation here is with respect to )
It is a general fact about Bayes posterior expectation that is continuous with for some constant (see Lemma B.1 in Appendix B). Further as . By dominated convergence, we have and therefore we can choose so that for all , .
Next consider term . Denoting by the sub-Gaussian norm of (see Eq. (20)), we have that is sub-Gaussian with . We let denote this upper bound. Since is continuous with , by weighted approximation theory [Lor05], we can choose a polynomial such that
(29) |
Using this polynomial approximation, we get
(30) | ||||
(31) |
This completes the proof of the induction claim in the first inequality in Eq. (25). The second inequality is treated analogously.
4 Proof of Theorem 2.3: Lower bound
Recall that denotes the family of polynomials of degree at most in the variables . The key of our proof is to show that the asymptotic estimation accuracy of Low-Deg estimators is not reduced if we replace polynomials by a restricted family of tree-structured polynomials, which we will denote by . This reduction is presented in Sections 4.1 and 4.2. We will then establish a connection between tree-structured polynomials and AMP algorithms, and rely on known optimality theory for AMP estimators, cf. Section 4.3.
4.1 Reduction to tree-structured polynomials
Let denote the set of rooted trees up to root-preserving isomorphism, with at most edges. (Trees must be connected, with no self-loops or multi-edges allowed. The tree with one vertex and no edges is included.) We denote the root vertex of by . For , define a labeling rooted at (or, simply, a labeling) of to be a function such that . (Vertex 1 is distinguished because we will be considering estimation of .)
Given a graph , we will denote by the graph distance between two vertices .
Definition 4.1.
A labeling of is said to be non-reversing if, for every pair of distinct vertices with the same label (i.e., ), one of the following holds:
-
•
, or
-
•
and have the same distance from the root (i.e., ).
(The latter holds if and only if are both children of a common vertex .) We denote by the set of all non-reversing labelings of .
For each , define the associated polynomial
(32) |
Definition 4.2.
We define to be the set of all polynomials of the form
(33) |
for coefficients .
Note that any is a polynomial of degree at most , that is
(34) |
We are now in position to state our result about reduction to tree-structured polynomials.
Proposition 4.3.
Assume to have finite moments of all orders. Let be a measurable function, and assume all moments of to be finite. For any fixed there exists a fixed (-independent) choice of coefficients such that the associated polynomial defined by (33) satisfies
(35) |
(In particular, the above limits exist.)
4.2 Proof of Proposition 4.3
For the proof, we will consider a slightly different model whereby we observe , with and . In other words, we use Gaussians of variance instead of on the diagonal. The original model is related to this one by where is a diagonal matrix with independent of . It is easy to see that it is sufficient to prove Proposition 4.3 under this modified model. Indeed, the left-hand side of Eq. (35) does not depend on the distribution of the diagonal entries of (since tree-structured polynomials do not depend on the diagonal entries). For the right-hand side, if is an arbitrary degree- estimator then also has degree and satisfies and by Jensen’s inequality, , whence we conclude
(36) |
This proves the claim. In what follows, we will drop the tilde from , and assume the new normalization.
It is useful to generalize the setting introduced previously. Let denote the set of all rooted (multi-)graphs, up to root-preserving isomorphism, with at most total edges, with the additional constraint that no vertices are isolated except possibly the root. Self-loops and multi-edges are allowed, and edges are counted with their multiplicity. For instance, a triple-edge contributes 3 towards the edge count . The graph need not be connected. For we write for the set of vertices and for the multiset of edges.
We use for the root of a graph , and define labelings (rooted at 1) exactly as for trees. Instead of non-reversing labelings, it will be convenient to work with embeddings, that is, labelings that are injective (every pair of distinct vertices has ). We denote the set of embeddings of by .
A labeling of induces a multi-graph whose vertices are elements of . This is the graph with vertex set and edge set . If and for , distinct edges in the multigraph , then and are considered distinct edges in the induced multi-graph.
We call this induced graph the image of and write whenever is the image of . We will treat as an element of where , namely where counts the multiplicity of edge in . Formally, .
For , let denote the -th orthonormal Hermite polynomial. Recall that these are defined (uniquely, up to sign) by the following two conditions: is a degree- polynomial; when . We refer to [Sze39] for background.
For , define the multivariate Hermite polynomial
(37) |
These polynomials are orthonormal: when . Viewing as a graph, let denote the set of non-empty (i.e., containing at least one edge) connected components of . As above, each is an element of where denotes the multiplicity of edge . It will be important to “center” each component in the following sense. Define
(38) |
where (in the case ) the empty product is equal to 1 by convention. Here and throughout, expectation is over distributed according to the rank-one estimation model as defined in Eq. (1). For , define
(39) |
Define the symmetric subspace as
(40) |
In words, is the -span of . It is also easy to see that is the linear subspace of which is invariant under permutations of rows/columns of .
Note that the task of estimating under the model (1) is invariant under permutations of . As a consequence of the Hunt–Stein theorem [EG21], the optimal estimator of must be equivariant under permutations of . The following lemma shows that the same is true if we restrict our attention to Low-Deg estimators. Namely, instead of the infimum over all degree- polynomials in (35), it suffices to study only the symmetric subspace.
Lemma 4.4.
Under the assumptions of Proposition 4.3, for any ,
(41) |
Our next step will be to write down an explicit formula (given in Lemma 4.6 below) for the right-hand side of (41). Define the vector and matrix by
(42) |
Note that both and have constant dimension (depending on but not ), but their entries depend on . Also note that is a Gram matrix and therefore positive semidefinite. We will show that is strictly positive definite (and thus invertible), and in fact strongly so in the sense that its minimum eigenvalue is lower bounded by a positive constant independent of . (Here and throughout, asymptotic notation such as and may hide factors depending on .)
Lemma 4.5.
Under the assumptions of Proposition 4.3,
We can now obtain an explicit formula for the optimal estimation error in terms of .
Lemma 4.6.
Determining the asymptotic error achieved by Low-Deg estimators requires understanding the asymptotic behavior of and . This is achieved in the following lemmas. Notably, is nearly block diagonal. (For this block diagonal structure to appear, it is crucial that we center each connected component in (38).)
Lemma 4.7.
For each we have the following.
-
•
There is a constant (depending on ) such that .
-
•
If then .
Lemma 4.8.
For each we have the following.
-
•
There is a constant (depending on ) such that .
-
•
If and then .
Write in block form, where the first block is indexed by rooted trees and the second by other graphs :
(45) |
Let denote the corresponding limiting vectors/matrices from Lemmas 4.7 and 4.8, and note that and .
We now work towards constructing the (sequence of) tree-structured polynomials that verify Eq. (35). We begin by defining a related polynomial , which is tree-structured in the basis instead of :
(46) |
To see that this definition is well-posed, note that since is strictly positive definite by Lemma 4.5, is also strictly positive definite and thus invertible. For intuition, note the similarity between and the optimizer in Lemma 4.6.
The next lemma characterizes the asymptotic estimation error achieved by the polynomial .
Lemma 4.9.
We have
(47) |
and in particular,
(48) |
The crucial equality above is an immediate consequence of the structure of established in Lemmas 4.7 and 4.8, namely and .
As a direct consequence of Lemma 4.9, we can now prove the following analogue of Proposition 4.3 where the basis is used instead of . which was left out by Lemma 4.9.
Proposition 4.10.
Assume to have finite moments of all orders. Let be a measurable function, and assume all moments of to be finite. For any fixed there exists a fixed (-independent) choice of coefficients such that the associated polynomial defined by satisfies
(49) |
(In particular, the above limits exist.)
Proof of Proposition 4.10.
Proposition 4.10 differs from our goal, namely Proposition 4.3, in that it offers an estimator that is a linear combination of the polynomials instead of . Recalling that we are restricting to trees and that the first two Hermite polynomials are simply and , the difference lies in the fact that involves a sum over embeddings, while involves a sum over the larger class of non-reversing labelings; also is centered by its expectation, provided is not edgeless (see Eq. (38) and note that a tree has only one connected component).
The next lemma allows us to, for each , rewrite as a linear combination of the with negligible error.
Lemma 4.11.
For any fixed there exist -independent coefficients such that
(50) |
We can finally conclude the proof of Proposition 4.3.
Proof of Proposition 4.3.
As in the proof of Proposition 4.10, it suffices to define so that the limit on the left-hand side of (35) is equal to . Defining as in (46), we can use Lemma 4.11 to expand
In other words, we define . Since , , and , we have
Now compute
(51) |
where the last step follows from Lemma 4.9 together with the remark that , so .
4.3 From tree-structured polynomials to AMP
In this section we will show that the Bayes-AMP algorithm achieves the same asymptotic accuracy as the best tree-structured constant-degree polynomial , matching Eq. (35). This will complete the proof of the lower bound in Theorem 2.3.
The key step is to construct a message passing (MP) algorithm that evaluates any given tree-structured polynomial. We recall the definition of a class of MP algorithms introduced in [BLM15] (we make some simplifications with respect to the original setting). After iterations, the state of an algorithm in this class is an array of messages indexed by ordered pairs in . Each message is a vector , where is a fixed integer (independent of and ). Here, we use the arrow to emphasize ordering, and entries are set to by convention. Updates make use of functions according to
(53) |
We finally define vectors indexed by :
(54) | ||||
(55) |
We claim that this class of recursions can be used to evaluate the tree-structured polynomials , for any fixed . To see this, we let so that we can index the entries of or by elements of . For instance, we will have:
(56) |
where is an enumeration of .
Given , we define to be the graph with vertex set (where is not an element of ), edge set where was the root of . We set the root of to be . In words, is obtained by attaching an edge to the root of , and moving the root to the other endpoint of the new edge.
If the root of has degree , we let denote the connected subgraphs of that are obtained by removing the root and the edges incident to the root. Notice that each is a tree which we root at the unique vertex such that . We write . We then define the special mapping , by letting, for any ,
(57) |
In order to clarify the connection between this MP algorithm and tree-structured polynomials, we define the following modification of the tree-structured polynomials of Eq. (32). For each , and each pair of distinct indices , we define
(58) |
Here is the set of labelings i.e., maps that are non-reversing (in the same sense as Definition 4.1) and such that the following hold:
-
•
. (The labeling is rooted at , not at .)
-
•
For any such that , we have .
Notice also the different normalization with respect to Eq. (32). However, by counting the choices at each vertex we have , and therefore
(59) |
We also define the radius of a rooted graph , .
Proposition 4.12.
Proof.
The proof is straightforward, and amounts to check that the first claim holds by induction on . ∎
We next show that, for a broad class of choices of the functions , the high-dimensional asymptotics of the algorithm defined by Eqs. (53), (54), (55) is determined by a generalization of the state evolution recursion of Theorem 2.1.
Lemma 4.13.
Assume that has finite moments of all orders and, for each , is a polynomial independent of . Define the sequence of vectors and positive semidefinite matrices via the state evolution equations (10), (11).
Then for any polynomial , the following limits hold for :
(60) | |||
(61) |
We are now in position to conclude the proof of the lower bound (22) on the optimal error of Low-Deg estimators in Theorem 2.3. The quantity we want to lower bound takes the form
which by Proposition 4.3 takes the form | ||||
for some -independent choice of coefficients . On the other hand, by Proposition 4.12, letting be the iterates defined by Eqs. (53), (54), (55), with given by Eq. (57), we have, for any ,
(62) | ||||
(63) |
where in step we used Lemma 4.13. (Here , are defined recursively by Eqs. (10), (11). Recall that can be indexed by elements of .)
Analogously, we obtain
(64) |
Hence we obtain that the optimal error achieved by Low-Deg estimators is given by
(65) | ||||
(66) |
However, by Theorem 2.1 there exists an AMP algorithm of the form (9) that achieves exactly the same error. Simply take , as defined in Eq. (57), and . The desired lower bound follows by applying the optimality result of Theorem 2.2.
Acknowledgments
This project was initiated when the authors were visiting the Simons Institute for the Theory of Computing during the program on Computational Complexity of Statistical Inference in Fall 2021: we are grateful to the Simons Institute for its support.
AM was supported by the NSF through award DMS-2031883, the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning, the NSF grant CCF-2006489 and the ONR grant N00014-18-1-2729, and a grant from Eric and Wendy Schmidt at the Institute for Advanced Studies. Part of this work was carried out while Andrea Montanari was on partial leave from Stanford and a Chief Scientist at Ndata Inc dba Project N. The present research is unrelated to AM’s activity while on leave.
Part of this work was done while ASW was with the Simons Institute for the Theory of Computing, supported by a Simons-Berkeley Research Fellowship. Part of this work was done while ASW was with the Algorithms and Randomness Center at Georgia Tech, supported by NSF grants CCF-2007443 and CCF-2106444.
Appendix A Proof of Theorem 1.1
Let . We assume, without loss of generality, . Let be the version of centered at , namely the law of when (in particular ). We finally let be a vector with i.i.d. coordinates with distribution , and (for )
(67) |
We will take . The normalized mutual information between and is given, after simple manipulations, by
By writing , , where , we get
(68) |
where
Let be the Bayes posterior over :
(69) |
The above implies
(70) | ||||
(71) | ||||
(72) |
Integrating by parts, we get
(73) |
We finally note that are jointly distributed according to . Note that the pair is exchangeable, with marginals , . Writing the above expectations in terms of this measure, we get
(74) | ||||
(75) |
By concentration inequalities for sums of independent random variables, there exist constants such that
(76) | |||
(77) | |||
(78) |
The same inequalities obviously hold for .
Using these bounds in Eqs. (74), (75), we get
(79) | ||||
(80) |
Summing these, we finally obtain
(81) | ||||
where we note that the term is uniform in .
Now recalling the definition of Eq. (2), we have, by [LM19, Theorem 1],
(82) |
Note that
(83) |
Differentiating with respect to , we deduce that
-
1.
is differentiable at if and only if is differentiable at . Both are in turn equivalent to being uniquely minimized at .
-
2.
If the last point holds, then (with )
(84)
Comparing with Eq. (81), we are left with the task of proving that when is differentiable at .
Taking the second derivative of Eq. (68) yields
where by construction. A direct calculation yields
(85) |
Hence proceeding as for the first derivative, we get
(86) | ||||
(87) |
where, once more, the term is uniform in .
Consider now the function , defined on an interval , with (with ). We conclude that
- 1.
-
2.
We have . Hence, by the previous point, at a point if and only if .
- 3.
Points 1 and 3 imply (by the lemma below) that for any differentiability point , and hence by point 3, we conclude . The proof is then completed by the following analysis fact.
Lemma A.1.
Let , be a sequence of differentiable functions converging pointwise to . Assume where are differentiable, is convex and the are equicontinuous on (i.e. there exists a non-decreasing function such that implies ). If is differentiable at , then .
Proof.
By convexity we have, for
(89) |
Further, by the intermediate value theorem and equicontinuity of ,
(90) |
Summing the last two displays, we get
(91) |
Taking the limit and using the convergence of , we get
(92) |
Finally taking the limit , and using the differentiability of at :
(93) |
∎
Appendix B Lemma for the upper bound in Theorem 2.3
Lemma B.1.
Let for . Then, for any there exists a constant such that, for all
(94) |
Proof.
If is the law of , then we have . Hence, if the claim holds for , it holds for as well. Hence, without loss of generality, we can assume .
Note that where is the law of . Therefore, without loss of generality we can assume . A simple calculation (‘Tweedie’s formula’) yields
(95) | ||||
(96) |
Notice that is convex, and therefore
(97) | ||||
(98) |
We therefore proceed to lower and upper bound . First notice that
where we used the fact that . Further, by Jensen’s inequality
Therefore we obtain, for ,
Since is continuous and therefore bounded on , this proves .
In order to prove , we use the lower bound in Eq. (97) and proceed analogously. ∎
Appendix C Proof of Lemmas for Proposition 4.3
C.1 Notation
We use the convention . We often work with where and . For , we use the notation
We use to mean for all , and to mean for all with strict inequality for some . Note that can be identified with a multigraph whose vertices are elements of , where is the number of copies of edge . We use to denote the set of vertices spanned by the edges of , with vertex 1 (the root) always included by convention. We use to denote the set of non-empty (i.e., containing at least one edge) connected components of .
Asymptotic notation such as and pertains to the limit and may hide factors depending on . We use the symbol to denote a constant (which may be positive, zero, or negative) depending on (but not ) and e.g., to denote a constant that may additionally depend on . These constants may change from line to line.
C.2 Hermite polynomials
We will need the following well known property of Hermite polynomials (see e.g. page 254 of [MOS13]).
Proposition C.1.
For any and ,
As a result, for any , and writing for the signal matrix, we have:
In particular, since (due to orthonormality of Hermite polynomials and the fact ),
C.3 Proof of Lemma 4.4: Symmetry
As mentioned in the main text, this is a special case of the classical Hunt–Stein theorem [EG21], the only difference being that we are restricting the class of estimators to degree- polynomials. We present a proof for completeness.
Let denote the group of permutations on that fix . This group acts on the space of symmetric matrices by permuting both the rows and columns (by the same permutation). We also have an induced action of on given by . A polynomial can be symmetrized over to produce as follows:
We claim that the symmetric subspace is precisely the image of the above symmetrization operation, that is, . This can be seen as follows. For the inclusion , note that any satisfies . For the reverse inclusion , start with an arbitrary . Any such admits an expansion as a linear combination of Hermite polynomials , and therefore admits an expansion in (cf. the definition (38) which can be inverted recursively). Finally, note that is a scalar multiple of for a certain . This allows to be written as a linear combination of .
To complete the proof of Lemma 4.4, it is sufficient to show that for any low degree estimator , mean squared error does not increase under symmetrization:
(99) |
First note that because and have the same distribution for any fixed . Next, using Jensen’s inequality and the symmetry for all , we have
Now Eq. (99) follows by combining these claims.
C.4 Proof of Lemma 4.5:
Consider an arbitrary polynomial with coefficients normalized so that . This induces an expansion . Since , it suffices to show . Using Jensen’s inequality and orthonormality of Hermite polynomials (here, we use subscripts on expectation to denote which variable is being integrated over),
where with Hermite expansion . It now suffices to show . We will compute explicitly in terms of . We have, using the Hermite facts from Section C.2,
where is the set of such that , and includes at least one edge from every non-empty component of | ||||
This means
(100) |
Due to the symmetry of , we have whenever are images of embeddings of the same . Therefore, admits an expansion where is defined by Eq. (39).
The coefficients and are related as follows. Let denote the number of root-preserving automorphisms of , i.e., the number of embeddings that induce a single element . For each there are distinct elements , and each has the same coefficient . Without loss of generality we can set for some coefficients . Therefore
Therefore we can conclude that . This means
It now suffices to show .
Our next step will be to relate the coefficients to the coefficients . Using (100) along with the above relation between and , we have for any and the image of some ,
(101) |
Similarly to above, we have whenever is the image of some . In (101), the number of terms in the sum that correspond to a single is , recalling that implies . We also have the bounds , , , and . This means (101) can be written as for some coefficients (not depending on ) that satisfy
where the last bound holds since and therefore cannot have more components than .
We have now shown . We can also see directly that for all , and whenever and . This means
which we can rewrite in vector form as
where . We order so that the number of edges is non-decreasing along this ordering, and therefore is upper triangular with ones on the diagonal. This implies , and in particular is invertible. By Cramer’s rule, letting denote the -th minor,
where the last bound follows from the fact that the matrix dimensions are independent of , and . Therefore and
This concludes the proof.
C.5 Proof of Lemma 4.6: Explicit solution
Throughout this proof, we will omit the subscript from and .
For an arbitrary , write the expansion . Recalling the definitions and , we can rewrite the optimization problem as
Since is positive definite by Lemma 4.5, the map is strictly concave and is thus uniquely maximized at its unique stationary point .
C.6 Proof of Lemma 4.7: Limit of
Throughout this proof, we will omit the subscript from and its entries.
Our goal is to compute, for each ,
where in the last step, is the image of under an arbitrary embedding and we have used the fact that depends only on (not ). We have
(102) |
For the image of an embedding of , we have by definition,
If (the edgeless graph), we can see is a constant and we are done. If has a non-empty connected component that does not contain the root then we have due to independence between components, and again we are done.
Finally, consider the case in which has a single non-empty component and this component contains the root. In this case, simplifying the above using the Hermite facts from Section C.2, and recalling that ,
Using orthogonality of the Hermite polynomials and recalling , all the terms with are zero in expectation, i.e.,
Putting it all together, we get
Recall from above that unless has a single connected component and this component contains the root; we therefore restrict to of this type in what follows. Due to connectivity, any such has , implying as desired. Now suppose furthermore that , i.e., contains a cycle (where we consider a self-loop or double-edge to be a cycle). In this case, , implying as desired.
C.7 Proof of Lemma 4.8: Limit of
Throughout this proof, we will omit the subscript from .
We will need to reason about the different possible intersection patterns between two rooted graphs . To this end, define a pattern to be a rooted colored multigraph where no vertices are isolated except possibly the root, every edge is colored either red or blue, the edge-induced subgraph of red edges (with the root always included) is isomorphic (as a rooted graph) to , and the edge-induced subgraph of blue edges is isomorphic to . Let denote the set of such patterns, up to isomorphism of rooted colored graphs. The number of such patterns is a constant depending on . For , let denote the set of embeddings of , where an embedding is defined to be a pair with such that the induced colored graph on vertex set is isomorphic to . We let denote the number of vertices in the pattern (including the root).
We can write as
where in the last line, is the shadow of an arbitrary embedding , and we have used the fact that depends only on (not ). Recall from (102) that
and we similarly have
Now compute using the Hermite facts from Section C.2, and recalling that ,
If has a non-empty (i.e., containing at least one edge) connected component that is vertex-disjoint from all non-empty connected components of or vice-versa, then we have from the first line above (due to independence between components). Otherwise we say that has “no isolated components” and denote by the set of patterns with this property. For , using orthogonality of the Hermite polynomials, the expansion of the expression above has a unique leading (in ) term where are as large as possible: namely, and where denotes entrywise minimum between elements of . Therefore there exists independent of such that
where the symmetric difference is defined as .
Putting it all together, and recalling whenever (and eventually redefining )
where
where , and always includes the root by convention (see Section C.1). To complete the proof, it remains to show for all , and furthermore in the case .
For any , , and for an embedding of , each non-empty connected component of spans at most vertices. Of these vertices, none belong to and due to the “no isolated components” constraint imposed by , at most of them (i.e., all but one) can belong to . Every vertex in belongs to some non-empty component of (since the root belongs to ), so we have . Similarly, , implying as desired.
In order to have equality in the above argument, every non-empty connected component of must be a simple (i.e., without self-loops or multi-edges) tree that spans only one vertex in , and similarly must be a simple tree that spans only one vertex in . It remains to show that this is impossible when . This means is a simple rooted tree. We can assume has at least one edge, or else the “no isolated components” property must fail (since has no non-empty components). We will consider a few different cases for .
First consider the case where the root is isolated in . This means has a non-empty connected component containing the root . Since is a simple rooted tree and due to “no isolated components,” this component also contains a vertex belonging to some non-empty component of . However, now we have a non-empty component of that spans at least two vertices in , namely and , and so by the discussion above, it is impossible to have equality .
Next consider the case where the root is not isolated in , but has multiple connected components. Let be a non-empty component of that does not contain the root. Due to “no isolated components,” must span some (non-root) vertex . Since is a simple rooted tree, there is a unique path in from to . Recall that by convention, so both endpoints of this path belong to . Also, this path must contain an edge in , or else would be connected to the root. This means that along the path, we can find a non-empty component of that spans two different vertices in , and so it is impossible to have equality .
The last case to consider is where contains a cycle. This cycle must contain at least one edge in , since has no cycles. Recall from above that in order to have , every non-empty component of must be a simple tree, which means the cycle also contains at least one edge in . Now we know the cycle in has at least one edge in and at least one edge in , but this means has a non-empty connected component that spans at least two vertices in , and so it is impossible to have .
C.8 Limit of
Lemma C.2.
Proof.
From Lemma 4.5 we have . As a direct consequence of Lemma 4.8,
This means, using for matrix operator norm,
implying that is symmetric positive definite and thus invertible. The result is now immediate from Lemmas 4.7 and 4.8 because has fixed dimension and the entries of are differentiable functions of the entries of (in a neighborhood of ). ∎
C.9 Proof of Lemma 4.9: Accuracy of
First we claim that . Recall the block form for and in Eq. (45), and that per Lemma 4.8 and therefore
Now as we have
and
completing the proof.
C.10 Proof of Lemma 4.11: Change-of-basis
Fix . If then and the proof is complete, so suppose . Recall that our goal is to take
and approximate it in the basis , where
where we use the shorthand
Since is a tree with at least one edge,
(104) |
for an arbitrary .
We first focus on the non-random term , which will be easy to expand in the basis because . Let
noting that this limit exists by combining (102) with the fact
We can now write
where satisfies and so (being non-random) .
The error term will be the sum of terms , with independent of . It suffices to show for each individually, as this implies by the triangle inequality
(105) |
We next handle the more substantial term in (104), namely
(106) | ||||
For the first term,
Since (in fact, embeddings make up a fraction of all labelings, not just non-reversing ones) and (similarly to the proof of Lemma C.3 below), we have .
Returning to the remaining term in (106), we partition into two sets and , defined as follows. If the multigraph has no triple-edges or higher (i.e., for all ) and has no cycles (namely, the simple graph defined by has no cycles) then we let ; otherwise . The remaining term to handle is
(107) |
Lemma C.3.
.
Proof.
We define the set of patterns between two rooted trees as Section C.7, but now restricted to labelings in . Namely edge-induced subgraph of red edges in should not be isomorphic to but rather isomorphic to the image of some (and similarly for the blue edges and ). We write to denote an embedding of this pattern into vertex set . Similarly to Section C.7, compute
where is the image of an arbitrary embedding of | ||||
where denotes the number of edges for which is odd, which depends only on (not ). Now since is a tree, and so the above becomes | ||||
(108) |
and so it remains to show that the exponent is strictly negative for any .
Fix . Let denote the number of single-edges in (ignoring the edge colors), let denote the number of double-edges, let denote the number of -edges for odd, and let denote the number of -edges for even. By definition we have
Also, since (and therefore ) is connected,
where is the indicator that (viewed as a simple graph by replacing each multi-edge by a single-edge) contains a cycle. Now the exponent in (108) is
By the definition of , we must either have or , which means the exponent is , completing the proof. ∎
It remains to handle the first term in Eq. (107). For , define the “skeleton” to be the subgraph of obtained from as follows: delete all multi-edges (i.e., whenever , set ) and then take the connected component of the root (vertex 1). Using the definition of and recalling that is not an embedding, note that is always a simple tree with strictly fewer edges than . The remaining term to handle is
(109) |
where
Lemma C.4.
.
Proof.
Define as in the proof of Lemma C.3 except with in place of . Our goal is to bound
For , letting (throughout this proof we adopt the shorthand since the base graph is fixed),
This means
(110) | ||||
We will next claim that certain terms in the sum in (110) are zero. First note that we must have is even for every edge , or else the corresponding term in (110) is zero due to the factors. Therefore, for , the value of (110) is where, recall, denotes the number of odd-edges in (i.e., the number of edges for which is odd).
We will improve the bound in certain cases. Let denote the property that has no triple-edges or higher (i.e., for all ) and no cycles (when is viewed as a simple graph by replacing multi-edges by single-edges). We claim that (110) is whenever holds. To prove this, first identify the “bridges,” i.e., edges for which and one endpoint of belongs to . At least one bridge must exist by the definition of . Using the definition of , shares no vertices with except possibly one endpoint of each bridge. As a result, some bridge must have or else the corresponding term in (110) is zero; to see this, note that if every bridge has then the factor has mean zero and is independent from the other factors in (110). This gives the desired improvement to .
It remains to handle the first term in (109). For each , is isomorphic (as a rooted multigraph) to some connected multigraph that is a rooted tree but with both single- and double-edges allowed, and at least one double-edge required; let denote the set of such multigraphs and for , write when is isomorphic to . The remaining term to handle is
Note that depends only on (not ). Also write for the rooted tree isomorphic to , which again only depends on . The above becomes | ||||
(111) |
The number of double-edges in is . We have
Now (111) becomes
where because (similarly to the proof of Lemma C.3).
Appendix D Proof of Lemma 4.13
Let , be a collection of i.i.d. copies of , and define
(112) |
We will use the sequence of random matrices uniquely as a device for algorithm analysis and not in the actual estimation procedure.
We extend the definition of tree structured polynomials (cf. Eq. (32)) to such sequences of random matrices via
(113) |
Here . When the argument is a single matrix , then is defined by applying Eq. (113) to the sequence of matrices given by for all (thus recovering Eq (32)).
The random variable does depend on . However its distribution is independent of . We will therefore often omit the superscript . (In the case of , the random variable itself does not depend on .)
We define a subset of the family of non-reversing labelings of .
Definition D.1.
A labeling of is said to be strongly non-reversing if it is non-reversing and for any two edges , , with , we have (as unordered pairs). We denote by the set of all strongly non-reversing labelings of .
A pair of labelings , is said to be jointly strongly non-reversing if each of them is strongly non-reversing and implies . We denote by the set of such pairs.
We also define
As before is obtained from the above definition by for all .
As we next see, the moving from non-reversing to strongly non-reversing embeddings has a negligible effect.
Lemma D.2.
Assume to have finite moments of all order, and for a constant . Then, for any fixed , there exist constants , such that
(114) |
(115) |
Proof.
We will prove Eq. (114) since (115) follows by the same argument. Considering the first bound, note that
(116) |
where in the last line denotes the equivalence classes of under rooted graph homomorphisms. Now the expectation in the last line only depends on . Taking expectation conditional on , we get
Here is the image of embedding . Next taking expectation with respect to and using that all moments are finite
(117) |
where is the number of edges with multiplicity in .
Using this bound in Eq. (116) and noting that (where is the number of vertices in ), we get
where is the number of edges of counted with their multiplicity. In the last step we used the fact that and .
Next notice that, denoting by the number of edges in , this time without counting multiplicity, we have
where is the number of self-loops in the projection of onto simple graphs (i.e. the graph obtained by replacing every multi-edge in by a single edge). For , we have , whence the claim follows.
The proof for the second equation in Eq. (114) is very similar. Using the shorthand , we begin by writing
The only difference with respect to the previous case lies in the fact that is a collection of graphs with edges labeled by . (It is the set of equivalence classes of under graph homomorphisms.)
Proceeding as before,
Therefore
Recall that , and, as before, . Further , whence
The last sum is upper bounded by by the same argument as above. ∎
Consider now a term corresponding to of the expectation . By construction, this does not involve moments for . Therefore the expectations coincide when considering or . In other words, we have the identities
(118) |
We therefore proved the following consequence of Lemma D.2.
Corollary D.3.
Under the assumptions of Lemma D.2, there exists such that
(119) |
We now consider a message passing of the same form as Eq. (53), but with replaced by at iteration . Namely we define
(120) |
and:
(121) | ||||
(122) |
Analogously to the case of the original iteration, we can write these as sums over polynomials , with coefficients that have a limit as . Therefore, we have a further consequence of Lemma D.2 and the previous corollary.
Corollary D.4.
Assume to have finite moments of all order, and be a polynomial with coefficients bounded by and maximum degree .
Then there exists such that, for all ,
(123) | |||
(124) |
Proof.
We expand the polynomial as
Further, can be expressed as a sum over and therefore the same holds for
where, in the last row, we grouped terms such that and (up to combinatorial factors which are independent of ), the coefficient is given by the product of coefficients .
Hence
and the claim follows by noting that the sums over and involve a constant (in ) number of terms, the coefficients have a limit as , and applying Corollary D.3. ∎
Lemma D.5.
Assume to have finite moments of all order, and let , be fixed. Then there exists a constant independent of such that
(125) |
As a final step towards the proof of Lemma 4.13, we prove the analogous statement for the modified iteration .
Lemma D.6.
Assume to have finite moments of all order, and to be a fixed polynomial. Define the sequence of vectors and positive semidefinite matrices via the state evolution equations (10), (11).
Then the following hold (here denotes limit in probability):
(126) | |||
(127) |
Proof.
The proof is essentially the same as the proof of Proposition 4 in [BLM15]. We will focus on the claim (127), since Eq. (127) is completely analogous.
We proceed by induction over , and will denote by the -algebra generated by and . It is convenient to define and do rewrite Eq. (120) as
(128) |
Fixing , by the induction hypothesis
(129) | |||
(130) | |||
(131) |
Construct for different in the same probability space. By Lyapunov’s central limit theorem, we have that, in probability
(132) |
(Here denotes weak convergence of probability measures.) Hence, for any bounded Lipschitz function , we have
(133) |
Since the right hand side is non-random, and using dominated convergence, we have
(134) |
Applying Lemma D.5, the claim also holds for that is only polynomially bounded, thus proving the the first equation in Eq. (127), for iteration .
Next consider the second limit in Eq. (127), and begin by considering the case in which is a fixed polynomial. We claim that, for any fixed ,
(135) |
Indeed by linearity it is sufficient to prove that this is the case for . This case in turn can be analyze by expanding in a sum over trees as in the proof of Lemma D.2. (See Proposition 4 in [BLM15].)
References
- [AW09] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics, pages 2877–2921, 2009.
- [BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, pages 648–847. PMLR, 2020.
- [BBH18] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. In Conference On Learning Theory, pages 48–166. PMLR, 2018.
- [BBH+21] Matthew S Brennan, Guy Bresler, Sam Hopkins, Jerry Li, and Tselil Schramm. Statistical query algorithms and low degree tests are almost equivalent. In Conference on Learning Theory, pages 774–774. PMLR, 2021.
- [BBP05] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, pages 1643–1697, 2005.
- [BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521, 2011.
- [BHK+19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. SIAM Journal on Computing, 48(2):687–735, 2019.
- [BLM15] Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and message passing algorithms. The Annals of Applied Probability, 25(2):753–822, 2015.
- [BM11] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. on Inform. Theory, 57:764–785, 2011.
- [BM19] Jean Barbier and Nicolas Macris. The adaptive interpolation method: a simple scheme to prove replica formulas in bayesian inference. Probability theory and related fields, 174(3):1133–1185, 2019.
- [BMN20] Raphael Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate message passing with non-separable functions. Information and Inference: A Journal of the IMA, 9(1):33–79, 2020.
- [BMR21] Jess Banks, Sidhanth Mohanty, and Prasad Raghavendra. Local statistics, semidefinite programming, and community detection. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1298–1316. SIAM, 2021.
- [Bol14] Erwin Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington–Kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366, 2014.
- [BR13] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on learning theory, pages 1046–1066. PMLR, 2013.
- [BS06] Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382–1408, 2006.
- [CM22] Michael Celentano and Andrea Montanari. Fundamental barriers to high-dimensional regression with convex penalties. The Annals of Statistics, 50(1):170–196, 2022.
- [CMW20] Michael Celentano, Andrea Montanari, and Yuchen Wu. The estimation error of general first order methods. In Conference on Learning Theory, pages 1078–1141. PMLR, 2020.
- [CX22] Hong-Bin Chen and Jiaming Xia. Hamilton–jacobi equations for inference of matrix tensor products. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 58, pages 755–793. Institut Henri Poincaré, 2022.
- [DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
- [DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Subexponential-time algorithms for sparse PCA. arXiv preprint arXiv:1907.11635, 2019.
- [DMM09] David L. Donoho, Arian Maleki, and Andrea Montanari. Message Passing Algorithms for Compressed Sensing. Proceedings of the National Academy of Sciences, 106:18914–18919, 2009.
- [EG21] Morris L Eaton and Edward I George. Charles Stein and invariance: Beginning with the Hunt–Stein theorem. The Annals of Statistics, 49(4):1815–1822, 2021.
- [Fan22] Zhou Fan. Approximate message passing algorithms for rotationally invariant matrices. The Annals of Statistics, 50(1):197–224, 2022.
- [Gal62] Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21–28, 1962.
- [GB21] Cédric Gerbelot and Raphaël Berthier. Graph-based approximate message passing iterations. arXiv:2109.11905, 2021.
- [GJS21] David Gamarnik, Aukosh Jagannath, and Subhabrata Sen. The overlap gap property in principal submatrix recovery. Probability Theory and Related Fields, 181(4):757–814, 2021.
- [GZ22] David Gamarnik and Ilias Zadik. Sparse high-dimensional linear regression. Estimating squared error and a phase transition. The Annals of Statistics, 50(2):880–903, 2022.
- [HKP+17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720–731. IEEE, 2017.
- [HR04] David C Hoyle and Magnus Rattray. Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Physical Review E, 69(2):026124, 2004.
- [HS17] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples: community detection and related problems. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 379–390. IEEE, 2017.
- [HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006. PMLR, 2015.
- [JL09] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486), 2009.
- [JM13] Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115–144, 2013.
- [KNV15] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations solve sparse PCA up to the information limit? The Annals of Statistics, 43(3):1300–1322, 2015.
- [LM19] Marc Lelarge and Léo Miolane. Fundamental limits of symmetric low-rank matrix estimation. Probability Theory and Related Fields, 173(3):859–929, 2019.
- [Lor05] G G Lorentz. Approximation of Functions, volume 322. American Mathematical Soc., 2005.
- [MOS13] Wilhelm Magnus, Fritz Oberhettinger, and Raj Pal Soni. Formulas and theorems for the special functions of mathematical physics, volume 52. Springer Science & Business Media, 2013.
- [MPV87] Marc Mézard, Giorgio Parisi, and Miguel A. Virasoro. Spin Glass Theory and Beyond. World Scientific, 1987.
- [MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 438–446. IEEE, 2016.
- [MV21] Andrea Montanari and Ramji Venkataramanan. Estimation of low-rank matrices via approximate message passing. The Annals of Statistics, 49(1):321–345, 2021.
- [MW22] Andrea Montanari and Yuchen Wu. Statistically optimal first order algorithms: A proof via orthogonalization. arXiv:2201.05101, 2022.
- [RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. Advances in neural information processing systems, 27, 2014.
- [RU08] T. J. Richardson and R. Urbanke. Modern Coding Theory. Cambridge University Press, Cambridge, 2008.
- [SW22] Tselil Schramm and Alexander S Wein. Computational barriers to estimation from low-degree polynomials. The Annals of Statistics, 50(3):1833–1858, 2022.
- [Sze39] Gabor Szegö. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.
- [TAP77] David J. Thouless, Philip W. Anderson, and Richard G. Palmer. Solution of ‘solvable model of a spin glass’. Philosophical Magazine, 35(3):593–601, 1977.
- [Wei22] Alexander S Wein. Average-case complexity of tensor decomposition for low-degree polynomials. arXiv preprint arXiv:2211.05274, 2022.
- [WEM19] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The Kikuchi hierarchy and tensor PCA. In 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468. IEEE, 2019.