Equivalence of Approximate Message Passing and
Low-Degree Polynomials in Rank-One Matrix Estimation

Andrea Montanari Department of Electrical Engineering and Department of Statistics, Stanford University; School of Mathematics, Institute for Advanced Studies, Princeton Alexander S. Wein Department of Mathematics, University of California, Davis

Abstract

We consider the problem of estimating an unknown parameter vector ${\boldsymbol{\theta}}\in{\mathbb{R}}^{n}$ , given noisy observations ${\boldsymbol{Y}}={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n}+{\boldsymbol{Z}}$ of the rank-one matrix ${\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}$ , where ${\boldsymbol{Z}}$ has independent Gaussian entries. When information is available about the distribution of the entries of ${\boldsymbol{\theta}}$ , spectral methods are known to be strictly sub-optimal. Past work characterized the asymptotics of the accuracy achieved by the optimal estimator. However, no polynomial-time estimator is known that achieves this accuracy.

It has been conjectured that this statistical-computation gap is fundamental, and moreover that the optimal accuracy achievable by polynomial-time estimators coincides with the accuracy achieved by certain approximate message passing (AMP) algorithms. We provide evidence towards this conjecture by proving that no estimator in the (broader) class of constant-degree polynomials can surpass AMP.

1 Introduction

Statistical-computational gaps are a ubiquitous phenomenon in high-dimensional statistics. Consider estimation of a high-dimensional parameter vector ${\boldsymbol{\theta}}=(\theta_{1},\theta_{2},\dots,\theta_{n})$ from noisy observations ${\boldsymbol{Y}}\sim{\rm P}_{{\boldsymbol{\theta}}}$ . In many models, we can characterize the optimal accuracy achieved by arbitrary estimators. However, when we analyze classes of estimators that can be implemented via polynomial-time algorithms, a significantly smaller accuracy is obtained. A short list of examples includes sparse regression [CM22, GZ22], sparse principal component analysis [JL09, AW09, KNV15, BR13], graph clustering and community detection [DKMZ11, BHK⁺19], tensor principal component analysis [RM14, HSS15], and tensor decomposition [MSS16, Wei22].

This state of affairs has led to the conjecture that these gaps are fundamental in nature: there simply exists no polynomial-time algorithm achieving the statistically optimal accuracy. Proving such a conjecture is extremely challenging since standard complexity theoretic assumptions (e.g. P $\neq$ NP) are ill-suited to establish complexity lower bounds in average-case settings where the input to the algorithm is random. A possible approach to overcome this challenge is to establish ‘average-case’ reductions between statistical estimation problems. We refer to [BBH18, BB20] for pointers to this rich line of work.

A second approach is to prove reductions between classes of polynomial-time algorithms, thus trimming the space of possible algorithmic choices. This paper contributes to this line of work by establishing—in a specific context—that approximate message passing (AMP) algorithms and low-degree (Low-Deg) polynomial estimators achieve asymptotically the same accuracy.

Examples of reductions between algorithm classes in statistical estimation include [HKP⁺17, BBH⁺21, CMW20, BMR21, CM22]. The motivation for studying the relation between AMP and Low-Deg comes from the distinctive position occupied by these two classes in the theoretical landscape. AMP are iterative algorithms motivated by ideas in information theory (iterative decoding [Gal62, RU08]) and statistical physics (cavity method and TAP equations [TAP77, MPV87]). Their structure is relatively constrained: they operate by alternating a matrix-vector multiplication (with a matrix constructed from the data ${\boldsymbol{Y}}$ ), and a non-linear operation on the same vectors (typically independent of the data matrix). This structure has opened the way to sharp characterization of their behavior in the high-dimensional limit, known as ‘state evolution’ [DMM09, BM11, Bol14].

Low-Deg was originally motivated by a connection to Sum-of-Squares (SoS) semidefinite programming relaxations and captures a different (algebraic) notion of complexity [HS17, HKP⁺17, SW22]. In Low-Deg procedures, each unknown parameter $\theta_{i}$ is estimated via a fixed polynomial in the entries of the data matrix ${\boldsymbol{Y}}$ , of constant or moderately growing degree. As such, these estimators are relatively unconstrained, and indeed capture (via polynomial approximation) a broad variety of natural approaches. Developing a sharp analysis of such a broad class of estimators can be very challenging.

In the symmetric rank-one estimation problem, we observe a noisy version ${\boldsymbol{Y}}$ of the rank-one matrix ${\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}$ , with ${\boldsymbol{\theta}}\in{\mathbb{R}}^{n}$ an unknown vector:

\displaystyle{\boldsymbol{Y}}=\frac{1}{\sqrt{n}}{\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}+{\boldsymbol{Z}}\,.

(1)

Here ${\boldsymbol{Z}}$ is a random matrix independent of ${\boldsymbol{\theta}}$ , drawn from the Gaussian Orthogonal Ensemble ${\boldsymbol{Z}}\sim{\sf GOE}(n)$ which we define here by ${\boldsymbol{Z}}={\boldsymbol{Z}}^{{\sf T}}$ and $(Z_{ij})_{i\leq j}$ independent with $Z_{ii}\sim{\mathcal{N}}(0,2)$ and $Z_{ij}\sim{\mathcal{N}}(0,1)$ for $i<j$ . Given a single observation of the matrix ${\boldsymbol{Y}}$ , we would like to estimate ${\boldsymbol{\theta}}$ .

Because of its simplicity, the rank-one estimation problem has been a useful sandbox to develop new mathematical ideas in high-dimensional statistics. A significant amount of work has been devoted to the analysis of spectral estimators, which typically take the form $\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})=c_{n}\,{\boldsymbol{v}}_{1}({\boldsymbol{Y}})$ where ${\boldsymbol{v}}_{1}({\boldsymbol{Y}})$ is the top eigenvector of ${\boldsymbol{Y}}$ and $c_{n}$ is a scaling factor [HR04, BBP05, BS06, BGN11]. However, spectral methods are known to be suboptimal if additional information is available about the entries $\theta_{i}$ of $\theta$ . In this paper, we model this information by assuming $(\theta_{i})_{1\leq i\leq n}\sim_{iid}\pi_{\Theta}$ , for $\pi_{\Theta}$ a probability distribution on ${\mathbb{R}}$ (which does not depend on $n$ ). The resulting Bayes error coincides (up to terms negligible as $n\to\infty$ ) with the minimax optimal error in the class of vectors with empirical distribution converging to $\pi_{\Theta}$ . Hence this model captures the minimax behavior in a well-defined class of signal parameters.

In the high-dimensional limit $n\to\infty$ (with $\pi_{\Theta}$ fixed), the Bayes optimal accuracy for estimating ${\boldsymbol{\theta}}$ under model (1) (and the above assumptions on ${\boldsymbol{\theta}}$ ) is known to converge to a well-defined limit that was characterized in a sequence of papers, see e.g. [LM19, BM19, CX22]. The asymptotic accuracy of the optimal AMP algorithm (called Bayes AMP) is also known [MV21].

It is useful to pause and recall these results in some detail. Define the function $\Psi:{\mathbb{R}}_{\geq 0}\times{\mathbb{R}}\times\mathscrsfs{P}({\mathbb{R}})\to{\mathbb{R}}$ (here and below $\mathscrsfs{P}({\mathbb{R}})$ denotes the set of probability distributions over ${\mathbb{R}}$ ) by letting

	$\displaystyle\Psi(q;b,\pi_{\Theta})$	$\displaystyle:=\frac{1}{4}q^{2}-\frac{1}{2}\big{(}\operatorname{\mathbb{E}}[\Theta^{2}]+b\big{)}q+{\rm I}(q;\pi_{\Theta})\,,$		(2)
	$\displaystyle{\rm I}(q;\pi_{\Theta})$	$\displaystyle:=\operatorname{\mathbb{E}}\log\frac{{\rm d}p_{Y_{\mbox{\tiny\rm eff}}\|\Theta}}{{\rm d}p_{Y_{\mbox{\tiny\rm eff}}}}\,,\;\;\;\;\;Y_{\mbox{\tiny\rm eff}}=\sqrt{q}\Theta+G\,,\;\;\;(\Theta,G)\sim\pi_{\Theta}\otimes\mathcal{N}(0,1)\,.$		(3)

Note that ${\rm I}(q;\pi_{\Theta})$ can be interpreted as the mutual information between a scalar random variable $\Theta\sim\pi_{\Theta}$ and the scalar noisy observation $Y_{\mbox{\tiny\rm eff}}$ . It can be expressed as a one-dimensional integral with respect to $\pi_{\Theta}$ :

\displaystyle{\rm I}(q;\pi_{\Theta})

\displaystyle:=-\operatorname{\mathbb{E}}\log\Big{\{}\int e^{-(Y_{\mbox{\tiny\rm eff}}-\sqrt{q}\theta)^{2}/2}\,\pi_{\Theta}({\rm d}\theta)\Big{\}}-\frac{1}{2}\,.

(4)

The next statement adapts results from [LM19] concerning the behavior of the Bayes optimal error (see Appendix A, which details the derivation from [LM19]). A formal definition of Bayes AMP will be given in the next section, alongside a formal definition of Low-Deg algorithms.

Theorem 1.1.

Assume $\pi_{\Theta}$ is independent of $n$ , has non-vanishing first moment $\operatorname{\mathbb{E}}[\Theta]\neq 0$ , and has finite moments of all orders. Define

	$\displaystyle q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta})$	$\displaystyle:=\operatorname*{argmin}_{q\geq 0}\,\Psi(q;0,\pi_{\Theta})\,,$		(5)
	$\displaystyle q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})$	$\displaystyle:=\inf\Big{\{}q\geq 0\;:\;\Psi^{\prime}(q;0,\pi_{\Theta})=0,\;\Psi^{\prime\prime}(q;0,\pi_{\Theta})\geq 0\Big{\}}\,.$		(6)

Bayes AMP has time complexity $O(c(n)\,n^{2})$ , for any diverging sequence $c(n)$ , and achieves mean squared error (MSE)

\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}^{\mbox{\tiny\rm AMP}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|^{2}_{2}\big{\}}=\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})\,.

(7)

Further, assume that $b\mapsto\Psi_{*}(b;\pi_{\Theta}):=\min_{q\geq 0}\Psi(q;b,\pi_{\Theta})$ is differentiable at $b=0$ . Then the minimum MSE of any estimator is

\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}(\,\cdot\,)}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|^{2}_{2}\big{\}}=\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta})\,.

(8)

Remark 1.1 (Differentiability at $b=0$ ).

The function $b\mapsto\Psi_{*}(b;\pi_{\Theta})$ is concave because it is a minimum of linear functions, and therefore differentiable at all except for countably many values of $b$ . Hence the differentiability assumption amounts to requiring that $b=0$ is non-exceptional.

Also note that replacing the prior $\pi_{\Theta}$ by its shift $\pi_{\Theta^{\prime}}$ where $\Theta^{\prime}=\Theta+a$ amounts to changing $b$ to $b^{\prime}=b+2\operatorname{\mathbb{E}}[\Theta]a+a^{2}$ . Hence the assumption that $b=0$ is non-exceptional is equivalent to the assumption that the mean of $\pi_{\Theta}$ is non-exceptional.

We also note that our results will not require such genericity assumptions, which we only introduce to offer a simple comparison to the Bayes optimal estimator.

Remark 1.2 (Nonzero mean assumption).

The assumption $\operatorname{\mathbb{E}}[\Theta]\neq 0$ can be removed from Theorem 1.1 provided the mean squared error metric is replaced by a different metric that is invariant under change of relative sign in ${\boldsymbol{\theta}}$ and $\hat{\boldsymbol{\theta}}$ . In that case, Bayes AMP needs to be modified, e.g. via a spectral initialization [MV21]. Here we focus on the case $\operatorname{\mathbb{E}}[\Theta]\neq 0$ because this is the most relevant for the results that follow.

In this paper we compare Bayes AMP run for a constant number $t=O(1)$ of iterations and Low-Deg estimators with degree $D=O(1)$ . Assuming $\operatorname{\mathbb{E}}[\Theta]\neq 0$ and $\pi_{\Theta}$ is sub-Gaussian, we establish the following (see Theorem 2.3):

•

For any fixed $t$ and $\varepsilon>0$ , there exists a constant $D=D(t,\varepsilon)$ and a degree- $D$ estimator that approximates the MSE achieved by Bayes AMP after $t$ iterations within an additive error of at most $\varepsilon$ .
•

For any constant $D=O(1)$ , no degree- $D$ estimator can surpass the asymptotic MSE of Bayes AMP. Namely, for any fixed $D$ and any degree- $D$ estimator $\hat{\boldsymbol{\theta}}$ , we have $\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|^{2}_{2}\big{\}}/n\geq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})-o_{n}(1)$ .

Here and throughout, the notation $o_{n}(1)$ signifies a quantity that vanishes in the limit where $n\to\infty$ with all other parameters (such as $D$ ) held fixed. The first claim above (‘upper bound’) is proved by a straightforward polynomial approximation of Bayes AMP. To obtain the second claim (‘lower bound’), we develop a new proof technique. Given a Low-Deg estimator for coordinate $i$ , we express $\hat{\theta}_{i}({\boldsymbol{Y}})$ as a sum of terms that are associated to rooted multi-graphs on vertex set $[n]:=\{1,2,\ldots,n\}$ with at most $D$ edges. We group these terms into a constant number of homomorphism classes. We next prove that among these classes, only those corresponding to trees yield a non-negligible contribution as $n\to\infty$ . Finally we show that the latter contribution can be encoded into an AMP algorithm and use existing optimality theory for AMP algorithms [CMW20, MW22] to deduce the result.

This new proof strategy elucidates the relation between AMP and Low-Deg algorithms: roughly speaking, AMP algorithms correspond to ‘tree-structured’ low-degree polynomials, a subspace of all Low-Deg estimators. AMP can effectively evaluate tree-structured polynomials via a dynamic-programming style recursion with complexity $O(n^{2}\cdot\mathsf{depth})$ (with $\mathsf{depth}$ the tree depth) instead of the naive $O(n^{D+1})$ .

The rest of the paper is organized as follows. Section 2 provides the necessary background, formally states our results, and discusses limitations, implications, and future directions. Sections 3 and 4 prove the main theorem (Theorem 2.3), respectively establishing the upper and lower bounds on the optimal estimation error achieved by Low-Deg. The proofs of several technical lemmas are deferred to the appendices.

2 Main results

2.1 Background: AMP

The class of AMP algorithms¹¹1More general settings have been studied and analyzed in the literature [JM13, BMN20, GB21, Fan22]. that we will consider in this paper proceed iteratively by updating a state ${\boldsymbol{x}}^{t}\in{\mathbb{R}}^{n\times{\sf d}}$ according to the iteration:

\displaystyle{\boldsymbol{x}}^{t+1}

\displaystyle=\frac{1}{\sqrt{n}}{\boldsymbol{Y}}\,F_{t}({\boldsymbol{x}}^{t})-F_{t-1}({\boldsymbol{x}}^{t-1}){\sf B}^{{\sf T}}_{t}\,,\;\;\;\;t\geq 0\,.

(9)

Throughout, we will assume the uninformative initialization ${\boldsymbol{x}}^{0}={\boldsymbol{0}}$ , ${\sf B}_{0}=0$ . In the above $F_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}}^{{\sf d}}$ is a function which operates on the matrix ${\boldsymbol{x}}^{t}\in{\mathbb{R}}^{n\times{\sf d}}$ row-wise. Namely for ${\boldsymbol{x}}\in{\mathbb{R}}^{n\times{\sf d}}$ with rows ${\boldsymbol{x}}_{1}^{\sf T},\dots,{\boldsymbol{x}}_{n}^{{\sf T}}$ , we have $F_{t}({\boldsymbol{x}}):=(F_{t}({\boldsymbol{x}}_{1}),\dots,F_{t}({\boldsymbol{x}}_{n}))^{{\sf T}}$ . After $t$ iterations, we estimate the signal ${\boldsymbol{\theta}}$ via $\hat{\boldsymbol{\theta}}^{t}:=g_{t}({\boldsymbol{x}}^{t})$ , for some function $g_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}}$ (again, applied row-wise). We will consider ${\sf d}$ and the sequence of functions $F_{t}$ fixed, while $n\to\infty$ .

The sequence of matrices ${\sf B}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}}$ can be taken to be non-random (independent of ${\boldsymbol{Y}}$ ) and will be specified shortly. The high-dimensional asymptotics of the above iteration is characterized by the following finite-dimensional recursion over variables ${\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{{\sf d}}$ , ${\boldsymbol{\Sigma}}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}}$ , known as ‘state evolution,’ which is initialized at $({\boldsymbol{\mu}}_{0},{\boldsymbol{\Sigma}}_{0})=({\boldsymbol{0}},{\boldsymbol{0}})$ :

	$\displaystyle{\boldsymbol{\mu}}_{t+1}$	$\displaystyle=\operatorname{\mathbb{E}}\big{\{}\Theta F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})\big{\}}\,,\;\;\;\;\;(\Theta,{\boldsymbol{G}}_{t})\sim\pi_{\Theta}\otimes\mathcal{N}(0,{\boldsymbol{\Sigma}}_{t})$		(10)
	$\displaystyle{\boldsymbol{\Sigma}}_{t+1}$	$\displaystyle=\operatorname{\mathbb{E}}\big{\{}F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})^{{\sf T}}\big{\}}\,.$		(11)

In terms of this sequence, we define ${\sf B}_{t}$ via

\displaystyle{\sf B}_{t}=\operatorname{\mathbb{E}}\{{\rm D}F_{t}(\mu_{t}\Theta+{\boldsymbol{G}}_{t})\}\,,

(12)

where ${\rm D}F_{t}=(\partial_{i}F_{t,j})_{i,j\in[{\sf d}]}$ denotes the weak differential of $F_{t}$ .

The next theorem characterizes the high-dimensional asymptotics of the above AMP algorithm. It summarizes results from [BM11, BLM15] (see e.g. [MW22] for the application to rank-one estimation).

Theorem 2.1.

Assume $\pi_{\Theta}$ is independent of $n$ and has finite moments of all orders. Further assume that the functions $\{F_{t},g_{t}\}_{t\geq 0}$ are independent of $n$ and either: $(i)$ for each $t$ , $F_{t}$ and $g_{t}$ are $L_{t}$ -Lipschitz, for some $L_{t}<\infty$ ; or $(ii)$ for each $t$ , $F_{t}$ and $g_{t}$ are degree- $B_{t}$ polynomials, for some $B_{t}<\infty$ . Then

\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}^{t}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}=\operatorname{\mathbb{E}}\big{\{}\big{[}\Theta-g_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})\big{]}^{2}\big{\}}\,,

(13)

where expectation is with respect to $(\Theta,{\boldsymbol{G}}_{t})\sim\pi_{\Theta}\otimes\mathcal{N}(0,{\boldsymbol{\Sigma}}_{t})$ .

Using the above result, it follows that the optimal function $F_{t}$ is given by the posterior expectation denoiser. Namely, define the one-dimensional recursion (with $G\sim{\mathcal{N}}(0,1)$ ):

\displaystyle q_{t+1}=\operatorname{\mathbb{E}}\big{\{}\operatorname{\mathbb{E}}[\Theta\,|\,q_{t}\Theta+\sqrt{q_{t}}G]^{2}\big{\}}=:{\rm SE}(q_{t};\pi_{\Theta})\,,\;\;\;q_{0}=0\,.

(14)

Consider ${\sf d}=1$ in the recursion of Eqs. (10), (11), so $\mu_{t}\in{\mathbb{R}}$ and $\Sigma_{t}=:\sigma^{2}_{t}\in{\mathbb{R}}$ , and take $g_{t}(x)=F_{t}(x):=\operatorname{\mathbb{E}}[\Theta\,|\,q_{t}\Theta+\sqrt{q_{t}}G=x]$ with $q_{t}$ defined in Eq. (14). We have

	$\displaystyle\mu_{t}=\sigma^{2}_{t}=q_{t}\,,$		(15)
	$\displaystyle\operatorname{\mathbb{E}}\big{\{}\big{[}\Theta-g_{t}(\mu_{t}\Theta+G)\big{]}^{2}\big{\}}=\operatorname{\mathbb{E}}[\Theta^{2}]-q_{t+1}\,.$		(16)

The next result establishes that this is indeed the optimal MSE achieved by AMP algorithms.

Theorem 2.2 ([MW22]).

Assume $\pi_{\Theta}$ is independent of $n$ and has finite second moment. Then, for any AMP algorithm satisfying the assumptions of Theorem 2.1, we have for any fixed $t\geq 0$ ,

\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}^{t}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}\geq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{t+1}\,.

(17)

Further there exists a sequence of AMP algorithms approaching the lower bound arbitrarily closely.

Finally, the fixed points of iteration (14) (i.e., the points $q$ such that $q={\rm SE}(q;\pi_{\Theta})$ ) coincide with the stationary points of $\Psi(q;b=0,\pi_{\Theta})$ defined in Eq. (2) (i.e., the points $q$ such that $\partial_{q}\Psi(q;0,\pi_{\Theta})=0$ ).

Refer to caption — Figure 1: Estimation accuracy in the symmetric rank-one estimation problem (1), for a one-parameter family of prior distributions $\pi_{\Theta}^{s}$ defined in Eq. (18). Solid black curve: asymptotic MSE achieved by Bayes AMP. Blue curve: Bayes optimal MSE. Dashed curves: asymptotic MSE for Bayes AMP with (from bottom to top) $t\in\{5,10,20\}$ iterations. Circles: average MSE achieved by AMP in simulation for $t\in\{5,10,20\}$ . The black and blue curves coincide outside the interval $[s^{*}_{\mbox{\tiny\rm Bayes}},s^{*}_{\mbox{\tiny\rm AMP}}]$ . Here MSE is normalized by $1/s$ so that the horizontal black dashed line represents the trivial MSE achieved by the constant estimator $\hat{\theta}_{i}({\boldsymbol{Y}})=\operatorname*{\mathbb{E}}[\Theta]$ .

As an illustration, Figure 1 presents the asymptotic accuracy achieved by Bayes AMP and Bayes optimal estimation, as characterized by Theorem 1.1. In this figure we consider the following parametrized family of three-point priors:

\displaystyle\pi_{\Theta}^{s}=\frac{1-\varepsilon}{2}\,(\delta_{-\sqrt{s}}+\delta_{+\sqrt{s}})+\varepsilon\,\delta_{\sqrt{s/\varepsilon}}\,.

(18)

We set $\varepsilon=0.01$ and sweep $s$ (which measures the signal-to-noise ratio). We compare the predicted accuracy for Bayes AMP with numerical simulations for $n=2\cdot 10^{4}$ and $t\in\{5,10,20\}$ iterations, averaged over $50$ realizations.

2.2 Background: Low-Deg

We say that $\hat{\boldsymbol{\theta}}:{\mathbb{R}}^{n\times n}\to{\mathbb{R}}^{n}$ , ${\boldsymbol{Y}}\mapsto\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})$ is a Low-Deg estimator of degree $D$ if its coordinates are polynomials of maximum degree $D$ in the matrix entries $(Y_{ij})_{1\leq i\leq j\leq n}$ . The coefficients of these polynomials may depend on $n$ but not on ${\boldsymbol{Y}}$ . Let $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}$ denote the space of all polynomials in the variables $(Y_{ij})_{1\leq i\leq j\leq n}$ of degree at most $D$ . We use ${\sf LD}(D;n)$ to denote the set of estimators whose coordinates are degree $D$ polynomials:

\displaystyle{\sf LD}(D;n):=\big{\{}\hat{\boldsymbol{\theta}}:{\mathbb{R}}^{n\times n}\to{\mathbb{R}}^{n}\;:\;\;\forall i,\,\hat{\theta}_{i}\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}\big{\}}.

(19)

With an abuse of notation we will often refer to an estimator $\hat{\boldsymbol{\theta}}$ (e.g. a Low-Deg estimator $\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)$ ), but really mean a sequence of estimators $\hat{\boldsymbol{\theta}}_{n}$ indexed by the dimension $n$ .

2.3 Statement of main results

Recall that a random variable $X$ is called sub-Gaussian if $\|X\|_{\psi_{2}}<0$ where the sub-Gaussian norm is defined as

\|X\|_{\psi_{2}}:=\inf\{t>0\,:\,\operatorname*{\mathbb{E}}\exp(X^{2}/t^{2})\leq 2\}\,.

(20)

For example, any distribution with bounded support is sub-Gaussian.

Theorem 2.3.

Assume $\pi_{\Theta}$ is independent of $n$ and sub-Gaussian, with $\operatorname{\mathbb{E}}[\Theta]\neq 0$ . For any $\varepsilon>0$ , there exists $D(\varepsilon)<\infty$ and a (family of) estimators $\hat{\boldsymbol{\theta}}_{\varepsilon}=\hat{\boldsymbol{\theta}}_{\varepsilon,n}\in{\sf LD}(D(\varepsilon);n)$ such that

\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}_{\varepsilon}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}\leq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})+\varepsilon\,.

(21)

Further, for any constant $D$ ,

\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}\geq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})\,.

(22)

The lower bound (22) only requires $\pi_{\Theta}$ to have all moments finite, rather than sub-Gaussian. The upper bound (21) likely also holds under weaker conditions on $\pi_{\Theta}$ than sub-Gaussian, but we have not attempted to explore this here. We conclude this summary of our results with a few remarks and directions for future work.

Sharp characterization for Low-Deg.

As a by-product of our results, we obtain a sharp characterization of the optimal accuracy of Low-Deg estimators with constant degree. To the best of our knowledge, this is the first example of such a characterization. Prior work on low-degree estimation [SW22] gave bounds on the signal-to-noise ratio (tight up to log factors) under which the low-degree estimation accuracy is either asymptotically perfect or trivial. In our work we are in a setting where the Low-Deg MSE converges to a non-trivial constant, and we pin down the exact constant.

Similarly, our results are sharper than existing computational lower bounds based on the ‘overlap gap property’ [GJS21], which rule out a different (incomparable) class of algorithms including certain Markov chains.

Optimality of AMP.

Our results support the conjecture that AMP is asymptotically optimal among computationally-efficient procedures for certain high-dimensional estimation problems. We are also aware of some problems for which AMP is strictly sub-optimal and has to be modified to capture higher-order correlations between variables. One such examples is provided by the spiked tensor model [RM14, HSS15, WEM19], namely the generalization of the above model to higher order tensors, where observations take the form ${\boldsymbol{Y}}=n^{-(k-1)/2}\,{\boldsymbol{\theta}}^{\otimes k}+{\boldsymbol{Z}}$ for some $k\geq 3$ .

We believe that a generalization of the proof techniques developed in this paper can help distinguish in a principled way which problems can be optimally solved using AMP. These ideas may also provide guidance to modify AMP for problems in which it is sub-optimal. The key properties of the rank-one estimation problem that cause AMP to be optimal among Low-Deg estimators are established in Lemmas 4.7 and 4.8; notably these include the block diagonal structure of a certain correlation matrix ${\boldsymbol{M}}_{\infty}$ . As a result of these properties, the best Low-Deg estimator is ‘tree-structured’ and can thus be computed using an AMP algorithm.

Zero-mean prior.

Our result on the equivalence of AMP and Low-Deg actually applies to the case $\operatorname{\mathbb{E}}[\Theta]=0$ as well. However, it must be emphasized that the statement concerns AMP with uninformative initialization and $O(1)$ iterations, and Low-Deg estimators with $O(1)$ degree. When $\operatorname{\mathbb{E}}[\Theta]=0$ , the MSE of both these algorithms converges to $\operatorname{\mathbb{E}}[\Theta^{2}]$ as $n\to\infty$ : it is no better than random guessing.

On the other hand, initializing AMP with a spectral initialization $c_{n}\,{\boldsymbol{v}}_{1}({\boldsymbol{Y}})$ (for a suitable scalar $c_{n}$ ) yields the accuracy stated in Theorem 1.1, even for $\operatorname{\mathbb{E}}[\Theta]=0$ . Since the leading eigenvector can be approximated by power iteration, we believe it is possible to show that the same accuracy is achievable by AMP when run for $O(\log n)$ iterations (and a random initialization), or by Low-Deg for $O(\log n)$ degree. However generalizing the lower bound of Eq. (22) to logarithmic degree goes beyond the analysis developed here.

Higher degree.

We expect the optimality of Bayes AMP to hold within a significantly broader class of low-degree estimators, namely $O(n^{c})$ -degree estimators for any $c\in(0,1)$ . Again, this extension is beyond our current proof technique. Heuristically, polynomials of degree $O(n^{c})$ can be thought of as a proxy for algorithms of runtime $\exp(n^{c+o_{n}(1)})$ (which is the runtime required to naively evaluate such a polynomial); see e.g. [DKWB19].

3 Proof of Theorem 2.3: Upper bound

Recall from Theorem 2.2 that the fixed points of the state evolution recursion (14) coincide with the stationary points of $\Psi(q,0;\pi_{\Theta})$ . Further, it is easy to see from the definition Eq. (14) that $q\mapsto{\rm SE}(q;\pi_{\Theta})$ is a non-decreasing function with ${\rm SE}(0;\pi_{\Theta})=\operatorname{\mathbb{E}}[\Theta]^{2}>0$ . (This follows from the fact that the minimum mean square error is a non-increasing function of the signal-to-noise ratio or, equivalently, from Jensen inequality.) As a consequence $(q_{t})_{t\geq 0}$ is a non-decreasing sequence with $\lim_{t\to\infty}q_{t}=q_{\mbox{\tiny\rm AMP}}$ . Let $t_{*}$ be such that $q_{t_{*}+1}\geq q_{\mbox{\tiny\rm AMP}}-\varepsilon/2$ .

Consider the special case of the AMP algorithm of Eq. (9) with ${\sf d}=1$ , with $F_{t}=f_{t}:{\mathbb{R}}\to{\mathbb{R}}$ , and specialize the state evolution recursion of Eqs. (10), (11) to this case. Namely we define recursively $\mu_{s},\sigma_{s}^{2}$ for $s\geq 0$ via

	$\displaystyle\mu_{t+1}$	$\displaystyle=\operatorname{\mathbb{E}}\big{\{}\Theta f_{t}(\mu_{t}\Theta+\sigma_{t}\,G)\big{\}}\,,\;\;\;\;\;(\Theta,G)\sim\pi_{\Theta}\otimes\mathcal{N}(0,1)$		(23)
	$\displaystyle\sigma^{2}_{t+1}$	$\displaystyle=\operatorname{\mathbb{E}}\big{\{}f_{t}(\mu_{t}\Theta+G_{t})^{2}\big{\}}\,.$		(24)

We claim that for $t\geq 0$ the following holds. For any $\varepsilon_{0}>0$ , we can construct polynomials $(f_{s})_{0\leq s\leq t}$ of degree $(D_{s}(\varepsilon_{0}))_{0\leq s\leq t}$ (independent of $n$ ), we have,

\displaystyle\big{|}\mu_{t}-q_{t}\big{|}\leq\varepsilon_{0}\,,\;\;\big{|}\sigma^{2}_{t}-q_{t}\big{|}\leq\varepsilon_{0}\,.

(25)

Once this is established, the desired upper bound (21) follows by taking $t=t_{*}+1$ and $\varepsilon_{0}=\varepsilon/8$ . Consider indeed the AMP estimator $\hat{\boldsymbol{\theta}}^{t}({\boldsymbol{Y}}):=f_{t}({\boldsymbol{x}}^{t})$ where ${\boldsymbol{x}}^{t}$ is defined by the AMP iteration (9) with ${\sf d}=1$ , $F_{t}=f_{t}$ (that is, we choose $g_{t}=f_{t}$ ). This estimator is a polynomial of degree $D\leq(D_{1}+1)(D_{2}+1)\cdots(D_{t_{*}}+1)$ and, by Eq. (13) (specialized to ${\sf d}=1$ ), we have

$\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\\|\hat{\boldsymbol{\theta}}^{t_{*}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\\|_{2}^{2}\big{\}}$	$\displaystyle=\operatorname{\mathbb{E}}\big{\{}\big{[}\Theta-f_{t_{}}(\mu_{t_{}}\Theta+\sigma_{t_{*}}G)\big{]}^{2}\big{\}}$	(26)
	$\displaystyle=\operatorname{\mathbb{E}}[\Theta^{2}]-2\mu_{t_{}+1}+\sigma_{t_{}+1}^{2}$	(27)
	$\displaystyle\leq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{t_{*}+1}+\frac{\varepsilon}{2}\leq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}+\varepsilon\,.$	(28)

We are left with the task to prove the claim (25), which we will do by induction on $t$ . The claim holds for $t=0$ , and we then assume as an induction hypothesis that it holds for a certain $t$ . We will denote by $f^{\mbox{\tiny\rm Bayes}}_{t}(x):=\operatorname{\mathbb{E}}[\Theta\,|\,q_{t}\Theta+\sqrt{q_{t}}G=x]$ the ideal nonlinearity. Now let $\varepsilon_{1,k}\downarrow 0$ be a sequence converging to $0$ and select $f_{s}=f_{k,s}$ , $s\leq t-1$ according to the induction hypothesis with $\varepsilon_{0}=\varepsilon_{1,k}$ . We will denote by $\mu_{k,t},\sigma^{2}_{k,t}$ the corresponding state evolution quantities. In particular, we can assume without loss of generality $\mu_{k,t},\sigma^{2}_{k,t}\leq 2q_{t}$ . By the state evolution recursion, we have (expectation here is with respect to $(\Theta,G)\sim\pi_{\Theta}\otimes{\mathcal{N}}(0,1)$ )

	$\displaystyle\big{\|}\mu_{k,t+1}-q_{t+1}\big{\|}$	$\displaystyle=\Big{\|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f_{k,t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(q_{t}\Theta+\sqrt{q_{t}}G)]\Big{\}}\Big{\|}$
		$\displaystyle\leq\Big{\|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f^{\mbox{\tiny\rm Bayes}}_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(q_{t}\Theta+\sqrt{q_{t}}G)]\Big{\}}\Big{\|}$
		$\displaystyle\phantom{AAAAAAA}+\Big{\|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)]\Big{\}}\Big{\|}$
		$\displaystyle=:B_{1}(k)+B_{2}(k)\,.$

It is a general fact about Bayes posterior expectation that $f_{t}^{\mbox{\tiny\rm Bayes}}$ is continuous with $|f_{t}^{\mbox{\tiny\rm Bayes}}(x)|\leq C_{t}(1+|x|)$ for some constant $C_{t}$ (see Lemma B.1 in Appendix B). Further $\mu_{k,t}\Theta+\sigma_{k,t}G\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}q_{t}\Theta+\sqrt{q_{t}}G$ as $k\to\infty$ . By dominated convergence, we have $\lim_{k\to\infty}B_{1}(k)=0$ and therefore we can choose $k_{0}$ so that for all $k\geq k_{0}$ , $B_{1}(k)\leq\varepsilon_{0}/2$ .

Next consider term $B_{2}(k)$ . Denoting by $\tau:=\|\Theta\|_{\psi_{2}}$ the sub-Gaussian norm of $\pi_{\Theta}$ (see Eq. (20)), we have that $Z_{k,t}:=\mu_{k,t}\Theta+\sigma_{k,t}G$ is sub-Gaussian with $\|Z_{k,t}\|_{\psi_{2}}^{2}\leq\mu_{k,t}^{2}\tau^{2}+\sigma_{k,t}^{2}\leq 4(q_{t}^{2}\tau^{2}+q_{t})$ . We let $\tau^{2}_{t}:=4(q_{t}^{2}\tau^{2}+q_{t})$ denote this upper bound. Since $f^{\mbox{\tiny\rm Bayes}}_{t}$ is continuous with $|f^{\mbox{\tiny\rm Bayes}}(x)|\leq C_{t}(1+|x|)$ , by weighted approximation theory [Lor05], we can choose $f_{t}$ a polynomial such that

\displaystyle\sup_{u\in{\mathbb{R}}}\big{|}f_{t}(u)-f^{\mbox{\tiny\rm Bayes}}_{t}(u)\big{|}\exp\Big{(}-\frac{u^{2}}{4\tau^{2}_{t}}\Big{)}\leq\frac{\varepsilon_{0}}{4\operatorname{\mathbb{E}}[\Theta^{2}]^{1/2}}\,.

(29)

Using this polynomial approximation, we get

	$\displaystyle B_{2}(k)$	$\displaystyle\leq\operatorname{\mathbb{E}}\big{[}\Theta^{2}\big{]}^{1/2}\operatorname{\mathbb{E}}\big{\{}[f_{t}(Z_{k,t})-f^{\mbox{\tiny\rm Bayes}}_{t}(Z_{k,t})]^{2}\big{\}}^{1/2}$		(30)
		$\displaystyle\leq\frac{1}{4}\varepsilon_{0}\operatorname{\mathbb{E}}[\exp({Z_{k,t}^{2}/2\tau_{k,t}^{2}})]^{1/2}\leq\frac{1}{2}\varepsilon_{0}\,.$		(31)

This completes the proof of the induction claim in the first inequality in Eq. (25). The second inequality is treated analogously.

4 Proof of Theorem 2.3: Lower bound

Recall that $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}$ denotes the family of polynomials of degree at most $D$ in the variables $(Y_{ij})_{1\leq i\leq j\leq n}$ . The key of our proof is to show that the asymptotic estimation accuracy of Low-Deg estimators is not reduced if we replace polynomials by a restricted family of tree-structured polynomials, which we will denote by $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}$ . This reduction is presented in Sections 4.1 and 4.2. We will then establish a connection between tree-structured polynomials and AMP algorithms, and rely on known optimality theory for AMP estimators, cf. Section 4.3.

4.1 Reduction to tree-structured polynomials

Let $\mathcal{T}_{\leq D}$ denote the set of rooted trees up to root-preserving isomorphism, with at most $D$ edges. (Trees must be connected, with no self-loops or multi-edges allowed. The tree with one vertex and no edges is included.) We denote the root vertex of $T\in\mathcal{T}_{\leq D}$ by $\circ$ . For $T\in\mathcal{T}_{\leq D}$ , define a labeling rooted at $1$ (or, simply, a labeling) of $T$ to be a function $\phi:V(T)\to[n]$ such that $\phi(\circ)=1$ . (Vertex 1 is distinguished because we will be considering estimation of $\theta_{1}$ .)

Given a graph $G=(V,E)$ , we will denote by ${\sf d}_{G}(u,v)$ the graph distance between two vertices $u,v\in V$ .

Definition 4.1.

A labeling $\phi$ of $T\in\mathcal{T}_{\leq D}$ is said to be non-reversing if, for every pair of distinct vertices $u,v\in V(T)$ with the same label (i.e., $\phi(u)=\phi(v)$ ), one of the following holds:

•

${\sf d}_{T}(u,v)>2$ , or
•

${\sf d}_{T}(u,v)=2$ and $u,v$ have the same distance from the root (i.e., ${\sf d}_{T}(u,\circ)={\sf d}_{T}(v,\circ)$ ).

(The latter holds if and only if $u,v$ are both children of a common vertex $w\in V(T)$ .) We denote by $\mathsf{nr}(T)$ the set of all non-reversing labelings of $T$ .

For each $T\in\mathcal{T}_{\leq D}$ , define the associated polynomial

\mathscrsfs{F}_{T}({\boldsymbol{Y}})=\frac{1}{\sqrt{|\mathsf{nr}(T)|}}\sum_{\phi\in\mathsf{nr}(T)}\,\prod_{(u,v)\in E(T)}Y_{\phi(u),\phi(v)}.

(32)

Definition 4.2.

We define $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}$ to be the set of all polynomials of the form

p({\boldsymbol{Y}})=\sum_{T\in\mathcal{T}_{\leq D}}\hat{p}_{T}\mathscrsfs{F}_{T}({\boldsymbol{Y}})

(33)

for coefficients $\hat{p}_{T}\in\mathbb{R}$ .

Note that any $p\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}$ is a polynomial of degree at most $D$ , that is

\displaystyle\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}\subseteq\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}\,.

(34)

We are now in position to state our result about reduction to tree-structured polynomials.

Proposition 4.3.

Assume $\pi_{\Theta}$ to have finite moments of all orders. Let $\psi:{\mathbb{R}}\to{\mathbb{R}}$ be a measurable function, and assume all moments of $\psi(\theta_{1})$ to be finite. For any fixed $\pi_{\Theta},\psi,D$ there exists a fixed ( $n$ -independent) choice of coefficients $(\hat{p}_{T})_{T\in\mathcal{T}_{\leq D}}$ such that the associated polynomial $p=p_{n}\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathcal{T}}_{\leq D}$ defined by (33) satisfies

\displaystyle\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(p({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\lim_{n\to\infty}\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,.

(35)

(In particular, the above limits exist.)

4.2 Proof of Proposition 4.3

The direction “ $\geq$ ” in Eq. (35) is immediate from Eq. (34), so it remains to prove “ $\leq$ .”

For the proof, we will consider a slightly different model whereby we observe $\tilde{\boldsymbol{Y}}={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n}+\tilde{\boldsymbol{Z}}$ , with $\tilde{\boldsymbol{Z}}=\tilde{\boldsymbol{Z}}^{{\sf T}}$ and $(\tilde{Z}_{ij})_{1\leq i\leq j\leq n}\sim_{iid}\mathcal{N}(0,1)$ . In other words, we use Gaussians of variance $1$ instead of $2$ on the diagonal. The original model is related to this one by ${\boldsymbol{Y}}=\tilde{\boldsymbol{Y}}+{\boldsymbol{W}}$ where ${\boldsymbol{W}}$ is a diagonal matrix with $(W_{ii})_{1\leq i\leq n}\sim_{iid}\mathcal{N}(0,1)$ independent of ${\boldsymbol{\theta}},\tilde{\boldsymbol{Z}}$ . It is easy to see that it is sufficient to prove Proposition 4.3 under this modified model. Indeed, the left-hand side of Eq. (35) does not depend on the distribution of the diagonal entries of ${\boldsymbol{Y}}$ (since tree-structured polynomials do not depend on the diagonal entries). For the right-hand side, if $q({\boldsymbol{Y}})$ is an arbitrary degree- $D$ estimator then $\tilde{q}(\tilde{\boldsymbol{Y}})=\operatorname{\mathbb{E}}_{{\boldsymbol{W}}}q(\tilde{\boldsymbol{Y}}+{\boldsymbol{W}})$ also has degree $D$ and satisfies $\operatorname*{\mathbb{E}}[q({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\operatorname*{\mathbb{E}}[\tilde{q}(\tilde{\boldsymbol{Y}})\cdot\psi(\theta_{1})]$ and by Jensen’s inequality, $\operatorname*{\mathbb{E}}[q({\boldsymbol{Y}})^{2}]\geq\operatorname*{\mathbb{E}}[\tilde{q}(\tilde{\boldsymbol{Y}})^{2}]$ , whence we conclude

\inf_{\tilde{q}\in\mathbb{R}[\tilde{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(\tilde{q}(\tilde{\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\leq\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,.

(36)

This proves the claim. In what follows, we will drop the tilde from $\tilde{\boldsymbol{Y}}$ , $\tilde{\boldsymbol{Z}}$ and assume the new normalization.

It is useful to generalize the setting introduced previously. Let $\mathcal{G}_{\leq D}$ denote the set of all rooted (multi-)graphs, up to root-preserving isomorphism, with at most $D$ total edges, with the additional constraint that no vertices are isolated except possibly the root. Self-loops and multi-edges are allowed, and edges are counted with their multiplicity. For instance, a triple-edge contributes 3 towards the edge count $D$ . The graph need not be connected. For $G\in\mathcal{G}_{\leq D}$ we write $V(G)$ for the set of vertices and $E(G)$ for the multiset of edges.

We use $\circ$ for the root of a graph $G\in\mathcal{G}_{\leq D}$ , and define labelings (rooted at 1) exactly as for trees. Instead of non-reversing labelings, it will be convenient to work with embeddings, that is, labelings that are injective (every pair of distinct vertices $u,v\in V(G)$ has $\phi(u)\neq\phi(v)$ ). We denote the set of embeddings of $G$ by $\mathsf{emb}(G)$ .

A labeling $\phi$ of $G\in\mathcal{G}_{\leq D}$ induces a multi-graph whose vertices are elements of $[n]$ . This is the graph with vertex set $\{\phi(v)\,:\;v\in V(G)\}$ and edge set $\{(\phi(u),\phi(v))\,:\;(u,v)\in E(G)\}$ . If $\phi(u)=\phi(u^{\prime})$ and $\phi(v)=\phi(v^{\prime})$ for $(u,v)$ , $(u^{\prime},v^{\prime})\in E(G)$ distinct edges in the multigraph $G$ , then $(\phi(u),\phi(v))$ and $(\phi(u^{\prime}),\phi(v^{\prime}))$ are considered distinct edges in the induced multi-graph.

We call this induced graph the image of $\phi$ and write ${\boldsymbol{\alpha}}=\mathsf{im}(\phi;G)$ whenever ${\boldsymbol{\alpha}}$ is the image of $\phi$ . We will treat ${\boldsymbol{\alpha}}$ as an element of $\mathbb{N}^{\overline{E}_{n}}$ where ${\overline{E}_{n}}=\{(i,j)\,:\,\,1\leq i\leq j\leq n\}$ , namely ${\boldsymbol{\alpha}}=(\alpha_{ij})_{1\leq i\leq j\leq n}$ where $\alpha_{ij}\in\mathbb{N}=\{0,1,2,\ldots\}$ counts the multiplicity of edge $(i,j)$ in ${\boldsymbol{\alpha}}$ . Formally, $\alpha_{ij}:=|\{(u,v)\in E(G)\,:\,\phi(\{u,v\})=\{i,j\}\}|$ .

For $k\in\mathbb{N}$ , let $h_{k}:\mathbb{R}\to\mathbb{R}$ denote the $k$ -th orthonormal Hermite polynomial. Recall that these are defined (uniquely, up to sign) by the following two conditions: $(i)$ $h_{k}$ is a degree- $k$ polynomial; $(ii)$ $\operatorname{\mathbb{E}}[h_{j}(Z)h_{k}(Z)]=\mathbbm{1}_{j=k}$ when $Z\sim{\mathcal{N}}(0,1)$ . We refer to [Sze39] for background.

For ${\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}$ , define the multivariate Hermite polynomial

\displaystyle h_{\boldsymbol{\alpha}}({\boldsymbol{Y}})=\prod_{1\leq i\leq j\leq n}h_{\alpha_{ij}}(Y_{ij})\,.

(37)

These polynomials are orthonormal: $\operatorname{\mathbb{E}}[h_{\boldsymbol{\alpha}}({\boldsymbol{Z}})h_{\boldsymbol{\beta}}({\boldsymbol{Z}})]=\mathbbm{1}_{{\boldsymbol{\alpha}}={\boldsymbol{\beta}}}$ when $(Z_{ij})_{i\leq j}\sim_{iid}{\mathcal{N}}(0,1)$ . Viewing ${\boldsymbol{\alpha}}$ as a graph, let $C(\alpha)$ denote the set of non-empty (i.e., containing at least one edge) connected components of ${\boldsymbol{\alpha}}$ . As above, each ${\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})$ is an element of $\mathbb{N}^{\overline{E}_{n}}$ where $\gamma_{ij}$ denotes the multiplicity of edge $(i,j)$ . It will be important to “center” each component in the following sense. Define

\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})=\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}(h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})-\operatorname{\mathbb{E}}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}}))\,,

(38)

where (in the case ${\boldsymbol{\alpha}}={\boldsymbol{0}}$ ) the empty product is equal to 1 by convention. Here and throughout, expectation is over ${\boldsymbol{Y}}$ distributed according to the rank-one estimation model as defined in Eq. (1). For $G\in\mathcal{G}_{\leq D}$ , define

\mathscrsfs{H}_{G}({\boldsymbol{Y}})=\frac{1}{\sqrt{|\mathsf{emb}(G)|}}\sum_{\phi\in\mathsf{emb}(G)}\mathscrsfs{H}_{\mathsf{im}(\phi;G)}({\boldsymbol{Y}})\,.

(39)

Define the symmetric subspace $\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}\subseteq\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}$ as

\displaystyle\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}:=\Big{\{}f({\boldsymbol{Y}})=\!\!\sum_{G\in\mathcal{G}_{\leq D}}\!a_{G}\mathscrsfs{H}_{G}({\boldsymbol{Y}})\;:\;a_{G}\in\mathbb{R}\;\;\forall G\in\mathcal{G}_{\leq D}\Big{\}}\,.

(40)

In words, $\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}$ is the $\mathbb{R}$ -span of $(\mathscrsfs{H}_{G})_{G\in\mathcal{G}_{\leq D}}$ . It is also easy to see that $\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}$ is the linear subspace of $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}$ which is invariant under permutations of rows/columns $\{2,\dots,n\}$ of ${\boldsymbol{Y}}$ .

Note that the task of estimating $\psi(\theta_{1})$ under the model (1) is invariant under permutations of $\{2,\dots,n\}$ . As a consequence of the Hunt–Stein theorem [EG21], the optimal estimator of $\psi(\theta_{1})$ must be equivariant under permutations of $\{2,\dots,n\}$ . The following lemma shows that the same is true if we restrict our attention to Low-Deg estimators. Namely, instead of the infimum over all degree- $D$ polynomials in (35), it suffices to study only the symmetric subspace.

Lemma 4.4.

Under the assumptions of Proposition 4.3, for any $n$ ,

\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,.

(41)

Our next step will be to write down an explicit formula (given in Lemma 4.6 below) for the right-hand side of (41). Define the vector ${\boldsymbol{c}}_{n}=(c_{n,A})_{A\in\mathcal{G}_{\leq D}}$ and matrix ${\boldsymbol{M}}_{n}=(M_{n,AB})_{A,B\in\mathcal{G}_{\leq D}}$ by

\displaystyle c_{n,A}=\operatorname{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]\,,\;\;\;\;\;M_{n,AB}=\operatorname{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\mathscrsfs{H}_{B}({\boldsymbol{Y}})]\,.

(42)

Note that both ${\boldsymbol{c}}_{n}$ and ${\boldsymbol{M}}_{n}$ have constant dimension (depending on $D$ but not $n$ ), but their entries depend on $n$ . Also note that ${\boldsymbol{M}}_{n}$ is a Gram matrix and therefore positive semidefinite. We will show that ${\boldsymbol{M}}_{n}$ is strictly positive definite (and thus invertible), and in fact strongly so in the sense that its minimum eigenvalue is lower bounded by a positive constant independent of $n$ . (Here and throughout, asymptotic notation such as $O(\,\cdot\,)$ and $\Omega(\,\cdot\,)$ may hide factors depending on $\pi_{\Theta},\psi,D$ .)

Lemma 4.5.

Under the assumptions of Proposition 4.3,

\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})=\Omega(1)\,.

We can now obtain an explicit formula for the optimal estimation error in terms of ${\boldsymbol{M}}_{n},{\boldsymbol{c}}_{n}$ .

Lemma 4.6.

Define the vector ${\boldsymbol{c}}_{n}$ and matrix ${\boldsymbol{M}}_{n}$ as per Eq. (42). Then, under the assumptions of Proposition 4.3, for any $n$ ,

\displaystyle\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle\,.

(43)

Furthermore, the infimum is attained and the maximizer $q^{*}$ , which is unique, takes the form

\displaystyle q^{*}({\boldsymbol{Y}})=\sum_{A\in\mathcal{G}_{\leq D}}\hat{q}_{A}\mathscrsfs{H}_{A}({\boldsymbol{Y}})\,,\;\;\;\;\;\hat{\boldsymbol{q}}=(\hat{q}_{A})_{A\in\mathcal{G}_{\leq D}}={\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\,.

(44)

Determining the asymptotic error achieved by Low-Deg estimators requires understanding the asymptotic behavior of ${\boldsymbol{c}}_{n}$ and ${\boldsymbol{M}}_{n}$ . This is achieved in the following lemmas. Notably, ${\boldsymbol{M}}_{n}$ is nearly block diagonal. (For this block diagonal structure to appear, it is crucial that we center each connected component in (38).)

Lemma 4.7.

For each $A\in\mathcal{G}_{\leq D}$ we have the following.

•

There is a constant $c_{\infty,A}\in\mathbb{R}$ (depending on $\pi_{\Theta},\psi,A$ ) such that $c_{n,A}=c_{\infty,A}+O(n^{-1/2})$ .
•

If $A\notin\mathcal{T}_{\leq D}$ then $c_{\infty,A}=0$ .

Lemma 4.8.

For each $A,B\in\mathcal{G}_{\leq D}$ we have the following.

•

There is a constant $M_{\infty,AB}\in\mathbb{R}$ (depending on $\pi_{\Theta},A,B$ ) such that $M_{n,AB}=M_{\infty,AB}+O(n^{-1/2})$ .
•

If $A\in\mathcal{T}_{\leq D}$ and $B\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}$ then $M_{\infty,AB}=0$ .

Write ${\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}$ in block form, where the first block is indexed by rooted trees $\mathcal{T}_{\leq D}$ and the second by other graphs $\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}$ :

{\boldsymbol{c}}_{n}=\left[\begin{array}[]{c}{\boldsymbol{d}}_{n}\\ {\boldsymbol{e}}_{n}\end{array}\right]\qquad\qquad{\boldsymbol{M}}_{n}=\left[\begin{array}[]{cc}{\boldsymbol{P}}_{n}&{\boldsymbol{R}}_{n}\\ {\boldsymbol{R}}_{n}^{\sf T}&{\boldsymbol{Q}}_{n}\end{array}\right].

(45)

Let ${\boldsymbol{d}}_{\infty},{\boldsymbol{e}}_{\infty},{\boldsymbol{P}}_{\infty},{\boldsymbol{Q}}_{\infty},{\boldsymbol{R}}_{\infty}$ denote the corresponding limiting vectors/matrices from Lemmas 4.7 and 4.8, and note that ${\boldsymbol{e}}_{\infty}={\boldsymbol{0}}$ and ${\boldsymbol{R}}_{\infty}={\boldsymbol{0}}$ .

We now work towards constructing the (sequence of) tree-structured polynomials $p=p_{n}$ that verify Eq. (35). We begin by defining a related polynomial $r=r_{n}$ , which is tree-structured in the basis $\{\mathscrsfs{H}_{T}\}$ instead of $\{\mathscrsfs{F}_{T}\}$ :

r({\boldsymbol{Y}})=\sum_{T\in\mathcal{T}_{\leq D}}\hat{r}_{T}\mathscrsfs{H}_{T}({\boldsymbol{Y}})\qquad\text{where}\qquad\hat{\boldsymbol{r}}={\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\,.

(46)

To see that this definition is well-posed, note that since ${\boldsymbol{M}}_{\infty}$ is strictly positive definite by Lemma 4.5, ${\boldsymbol{P}}_{\infty}$ is also strictly positive definite and thus invertible. For intuition, note the similarity between $\hat{\boldsymbol{r}}$ and the optimizer $\hat{\boldsymbol{q}}$ in Lemma 4.6.

The next lemma characterizes the asymptotic estimation error achieved by the polynomial $r({\boldsymbol{Y}})$ .

Lemma 4.9.

We have

\displaystyle\lim_{n\to\infty}\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\lim_{n\to\infty}\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]=\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,,

(47)

and in particular,

\displaystyle\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(r({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,.

(48)

The crucial equality $\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle$ above is an immediate consequence of the structure of ${\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}$ established in Lemmas 4.7 and 4.8, namely ${\boldsymbol{e}}_{\infty}={\boldsymbol{0}}$ and ${\boldsymbol{R}}_{\infty}={\boldsymbol{0}}$ .

As a direct consequence of Lemma 4.9, we can now prove the following analogue of Proposition 4.3 where the basis $\{\mathscrsfs{H}_{T}\}$ is used instead of $\{\mathscrsfs{F}_{T}\}$ . which was left out by Lemma 4.9.

Proposition 4.10.

Assume $\pi_{\Theta}$ to have finite moments of all orders. Let $\psi:{\mathbb{R}}\to{\mathbb{R}}$ be a measurable function, and assume all moments of $\psi(\theta_{1})$ to be finite. For any fixed $\pi_{\Theta},\psi,D$ there exists a fixed ( $n$ -independent) choice of coefficients $(\hat{r}_{T})_{T\in\mathcal{T}_{\leq D}}$ such that the associated polynomial $r=r_{n}$ defined by $r({\boldsymbol{Y}})=\sum_{T\in\mathcal{T}_{\leq D}}\hat{r}_{T}\mathscrsfs{H}_{T}({\boldsymbol{Y}})$ satisfies

\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(r({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\lim_{n\to\infty}\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,.

(49)

(In particular, the above limits exist.)

Proof of Proposition 4.10.

It follows immediately from Lemmas 4.5, 4.7, and 4.8 that

\lim_{n\to\infty}\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,.

(See Lemma C.2 in the appendix for a formal proof.) Combining this fact with Lemmas 4.4 and 4.6, the limit on the right-hand side of (49) exists and is equal to $\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle$ .

Defining $\hat{\boldsymbol{r}}$ as in (46), we have by Lemma 4.9 that the limit on the left-hand side of (49) is also equal to $\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle$ . This completes the proof. ∎

Proposition 4.10 differs from our goal, namely Proposition 4.3, in that it offers an estimator that is a linear combination of the polynomials $\{\mathscrsfs{H}_{T}\}_{T\in\mathcal{T}_{\leq D}}$ instead of $\{\mathscrsfs{F}_{T}\}_{T\in\mathcal{T}_{\leq D}}$ . Recalling that we are restricting to trees and that the first two Hermite polynomials are simply $h_{0}(z)=1$ and $h_{1}(z)=z$ , the difference lies in the fact that $\mathscrsfs{H}_{T}({\boldsymbol{Y}})$ involves a sum over embeddings, while $\mathscrsfs{F}_{T}({\boldsymbol{Y}})$ involves a sum over the larger class of non-reversing labelings; also $\mathscrsfs{H}_{T}({\boldsymbol{Y}})$ is centered by its expectation, provided $T$ is not edgeless (see Eq. (38) and note that a tree has only one connected component).

The next lemma allows us to, for each $A\in\mathcal{T}_{\leq D}$ , rewrite $\mathscrsfs{H}_{A}({\boldsymbol{Y}})$ as a linear combination of the $\{\mathscrsfs{F}_{T}({\boldsymbol{Y}})\}_{T\in\mathcal{T}_{\leq D}}$ with negligible error.

Lemma 4.11.

For any fixed $A\in\mathcal{T}_{\leq D}$ there exist $n$ -independent coefficients $(m_{AB})_{B\in\mathcal{T}_{\leq D}}$ such that

\displaystyle\mathscrsfs{H}_{A}({\boldsymbol{Y}})=\sum_{B\in\mathcal{T}_{\leq D}}m_{AB}\mathscrsfs{F}_{B}({\boldsymbol{Y}})+\mathscrsfs{E}_{A}({\boldsymbol{Y}})\,,\;\;\;\;\;\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A}({\boldsymbol{Y}})^{2}]=o_{n}(1)\,.

(50)

We can finally conclude the proof of Proposition 4.3.

Proof of Proposition 4.3.

As in the proof of Proposition 4.10, it suffices to define $\hat{\boldsymbol{p}}$ so that the limit on the left-hand side of (35) is equal to $\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle$ . Defining $\hat{\boldsymbol{r}}$ as in (46), we can use Lemma 4.11 to expand

	$\displaystyle r({\boldsymbol{Y}})$	$\displaystyle=\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}\mathscrsfs{H}_{A}({\boldsymbol{Y}})$
		$\displaystyle=\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}\left[\sum_{B\in\mathcal{T}_{\leq D}}m_{AB}\mathscrsfs{F}_{B}({\boldsymbol{Y}})+\mathscrsfs{E}_{A}({\boldsymbol{Y}})\right]$
		$\displaystyle=\sum_{A,B\in\mathcal{T}_{\leq D}}\hat{r}_{A}m_{AB}\mathscrsfs{F}_{B}({\boldsymbol{Y}})+\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}\mathscrsfs{E}_{A}({\boldsymbol{Y}})$
		$\displaystyle=:p({\boldsymbol{Y}})+\Delta({\boldsymbol{Y}})\,.$

In other words, we define $\hat{p}_{B}=\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}m_{AB}$ . Since $|\mathcal{T}_{\leq D}|=O(1)$ , $\hat{r}_{A}=O(1)$ , and $\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A}({\boldsymbol{Y}})^{2}]=o_{n}(1)$ , we have

\operatorname*{\mathbb{E}}[\Delta({\boldsymbol{Y}})^{2}]^{1/2}\leq\sum_{A\in\mathcal{T}_{\leq D}}|\hat{r}_{A}|\cdot\operatorname{\mathbb{E}}\left[\mathscrsfs{E}_{A}({\boldsymbol{Y}})^{2}\right]^{1/2}=o_{n}(1)\,.

Now compute

	$\displaystyle\operatorname*{\mathbb{E}}[p({\boldsymbol{Y}})\cdot\psi(\theta_{1})]$	$\displaystyle=\operatorname{\mathbb{E}}[r({\boldsymbol{Y}})\cdot\psi(\theta_{1})]-\operatorname{\mathbb{E}}[\Delta({\boldsymbol{Y}})\cdot\psi(\theta_{1})]$
		$\displaystyle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle+o_{n}(1)\,,$		(51)

where the last step follows from Lemma 4.9 together with the remark that $\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]=O(1)$ , so $|\mathbb{E}[\Delta({\boldsymbol{Y}})\cdot\psi(\theta_{1})]|\leq\operatorname*{\mathbb{E}}[\Delta({\boldsymbol{Y}})^{2}]^{1/2}\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]^{1/2}=o_{n}(1)$ .

Similarly, since $\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})^{2}]=O(1)$ by Lemma 4.8, and $\hat{r}_{A}=O(1)$ , we have

\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]^{1/2}\leq\sum_{A\in\mathcal{T}_{\leq D}}|\hat{r}_{A}|\cdot\operatorname{\mathbb{E}}\left[\mathscrsfs{H}_{A}({\boldsymbol{Y}})^{2}\right]^{1/2}=O(1)\,,

and therefore

$\displaystyle\operatorname*{\mathbb{E}}[p({\boldsymbol{Y}})^{2}]$	$\displaystyle=\operatorname*{\mathbb{E}}[(r({\boldsymbol{Y}})-\Delta({\boldsymbol{Y}}))^{2}]$
	$\displaystyle=\operatorname{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]-2\operatorname{\mathbb{E}}[r({\boldsymbol{Y}})\Delta({\boldsymbol{Y}})]+\operatorname*{\mathbb{E}}[\Delta({\boldsymbol{Y}})^{2}]$
	$\displaystyle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle+o_{n}(1)\,.$	(52)

Combining Eqs. (51) and (52) we now conclude

\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(p({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,,

completing the proof. ∎

4.3 From tree-structured polynomials to AMP

In this section we will show that the Bayes-AMP algorithm achieves the same asymptotic accuracy as the best tree-structured constant-degree polynomial $p\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}$ , matching Eq. (35). This will complete the proof of the lower bound in Theorem 2.3.

The key step is to construct a message passing (MP) algorithm that evaluates any given tree-structured polynomial. We recall the definition of a class of MP algorithms introduced in [BLM15] (we make some simplifications with respect to the original setting). After $t$ iterations, the state of an algorithm in this class is an array of messages $({\boldsymbol{s}}_{i\to j}^{t})_{i,j\in[n]}$ indexed by ordered pairs in $[n]$ . Each message is a vector ${\boldsymbol{s}}_{i\to j}^{t}\in{\mathbb{R}}^{{\sf d}}$ , where ${\sf d}$ is a fixed integer (independent of $n$ and $t$ ). Here, we use the arrow to emphasize ordering, and entries $i\to i$ are set to $0$ by convention. Updates make use of functions $F_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}}^{{\sf d}}$ according to

\displaystyle{\boldsymbol{s}}_{i\to j}^{t+1}

\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i,j\}}Y_{ik}F_{t}({\boldsymbol{s}}^{t}_{k\to i})\,,\;\;\;\;\;\,\;\;{\boldsymbol{s}}_{i\to j}^{0}={\boldsymbol{0}}\;\;\forall i\neq j\,.

(53)

We finally define vectors ${\boldsymbol{s}}_{i}^{t},\hat{\boldsymbol{s}}_{i}^{t}\in{\mathbb{R}}^{{\sf d}}$ indexed by $i\in[n]$ :

	$\displaystyle{\boldsymbol{s}}_{i}^{t+1}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i\}}Y_{ik}F_{t}({\boldsymbol{s}}^{t}_{k\to i})\,,$		(54)
	$\displaystyle\hat{\boldsymbol{s}}_{i}^{t+1}$	$\displaystyle=F_{t}({\boldsymbol{s}}^{t+1}_{i})\,.$		(55)

We claim that this class of recursions can be used to evaluate the tree-structured polynomials $\{\mathscrsfs{F}_{T}\}_{T\in\mathcal{T}_{\leq D}}$ , for any fixed $D$ . To see this, we let ${\sf d}=|\mathcal{T}_{\leq D}|$ so that we can index the entries of ${\boldsymbol{s}}_{i\to j}^{t}$ or ${\boldsymbol{s}}_{i}^{t},\hat{\boldsymbol{s}}_{i}^{t}$ by elements of $\mathcal{T}_{\leq D}$ . For instance, we will have:

\displaystyle{\boldsymbol{s}}^{t}_{i\to j}=\big{(}s_{i\to j}^{t}(T_{(1)}),s_{i\to j}^{t}(T_{(2)}),\dots,s_{i\to j}^{t}(T_{({\sf d})})\big{)}\,,

(56)

where $\mathcal{T}_{\leq D}=(T_{(1)},T_{(2)},\dots,T_{({\sf d})})$ is an enumeration of $\mathcal{T}_{\leq D}$ .

Given $T\in\mathcal{T}_{\leq D}$ , we define $T_{+}$ to be the graph with vertex set $V(T)\cup\{v_{+}\}$ (where $v_{+}$ is not an element of $V(T)$ ), edge set $E(T)\cup\{(v_{0},v_{+})\}$ where $v_{0}$ was the root of $T$ . We set the root of $T_{+}$ to be $\circ=v_{+}$ . In words, $T_{+}$ is obtained by attaching an edge to the root of $T$ , and moving the root to the other endpoint of the new edge.

If the root of $T\in\mathcal{T}_{\leq D}$ has degree $k$ , we let $T_{1},\dots,T_{k}$ denote the connected subgraphs of $T$ that are obtained by removing the root and the edges incident to the root. Notice that each $T_{j}$ is a tree which we root at the unique vertex $v_{j}$ such that $(\circ,v_{j})\in E(T)$ . We write ${\mathcal{D}}(T)=\{T_{1},T_{2},\dots,T_{k}\}$ . We then define the special mapping $F^{*}:{\mathbb{R}}^{|\mathcal{T}_{\leq D}|}\to{\mathbb{R}}^{|\mathcal{T}_{\leq D}|}$ , by letting, for any $T\in\mathcal{T}_{\leq D}$ ,

\displaystyle F^{*}({\boldsymbol{s}})(T):=\prod_{T^{\prime}\in{\mathcal{D}}(T)}s(T^{\prime})\,.

(57)

In order to clarify the connection between this MP algorithm and tree-structured polynomials, we define the following modification of the tree-structured polynomials of Eq. (32). For each $T\in\mathcal{T}_{\leq D}$ , and each pair of distinct indices $i,j\in[n]$ , we define

\mathscrsfs{F}_{T,i\to j}({\boldsymbol{Y}})=\frac{1}{n^{|E(T)|/2}}\sum_{\phi\in\mathsf{nr}(T;i\to j)}\,\prod_{(u,v)\in E(T)}Y_{\phi(u),\phi(v)}\,.

(58)

Here $\mathsf{nr}(T;i\to j)$ is the set of labelings i.e., maps $\phi:V(T)\to[n]$ that are non-reversing (in the same sense as Definition 4.1) and such that the following hold:

•

$\phi(\circ)=i$ . (The labeling is rooted at $i$ , not at $1$ .)
•

For any $v\in V(T)$ such that $(\circ,v)\in E(T)$ , we have $\phi(v)\neq j$ .

Notice also the different normalization with respect to Eq. (32). However, by counting the choices at each vertex we have $|\mathsf{nr}(T;i\to j)|=(n-2)^{|E(T)|}$ , $|\mathsf{nr}(T)|=(n-1)(n-2)^{|E(T)|-1}$ and therefore

\frac{|\mathsf{nr}(T;i\to j)|}{n^{|E(T)|}}=1+o_{n}(1)\,,\;\;\;\frac{|\mathsf{nr}(T)|}{n^{|E(T)|}}=1+o_{n}(1)\,.

(59)

We also define the radius of a rooted graph $G$ , ${\mathsf{rad}}(G):=\max\{{\sf d}_{G}(\circ,v):\;v\in V(G)\}$ .

Proposition 4.12.

Let ${\boldsymbol{s}}^{t}_{i\to j}$ , ${\boldsymbol{s}}_{i}^{t}$ , $\hat{\boldsymbol{s}}_{i}^{t}$ be the iterates defined by Eqs. (53), (54), (55) with $F_{t}=F^{*}$ given by Eq. (57). Then, for any $T\in\mathcal{T}_{\leq D}$ , and any $t>{\mathsf{rad}}(T)$ , recalling the definition of $T_{+}$ given above, we have the following (here the $o_{n}(1)$ terms are uniform in ${\boldsymbol{Y}}$ ):

	$\displaystyle{\boldsymbol{s}}^{t}_{i\to j}(T)$	$\displaystyle=\mathscrsfs{F}_{T_{+},i\to j}({\boldsymbol{Y}})\,,$
	$\displaystyle{\boldsymbol{s}}_{1}^{t}(T)$	$\displaystyle=\frac{\|\mathsf{nr}(T_{+})\|}{n^{\|E(T_{+})\|/2}}\mathscrsfs{F}_{T_{+}}({\boldsymbol{Y}})=(1+o_{n}(1))\cdot\mathscrsfs{F}_{T_{+}}({\boldsymbol{Y}})\,,$
	$\displaystyle\hat{\boldsymbol{s}}_{1}^{t}(T)$	$\displaystyle=\frac{\|\mathsf{nr}(T)\|}{n^{\|E(T)\|/2}}\mathscrsfs{F}_{T}({\boldsymbol{Y}})=(1+o_{n}(1))\cdot\mathscrsfs{F}_{T}({\boldsymbol{Y}})\,.$

Proof.

The proof is straightforward, and amounts to check that the first claim holds by induction on ${\mathsf{rad}}(G)$ . ∎

We next show that, for a broad class of choices of the functions $F_{t}$ , the high-dimensional asymptotics of the algorithm defined by Eqs. (53), (54), (55) is determined by a generalization of the state evolution recursion of Theorem 2.1.

Lemma 4.13.

Assume that $\pi_{\Theta}$ has finite moments of all orders and, for each $t\geq 0$ , $F_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}}^{{\sf d}}$ is a polynomial independent of $n$ . Define the sequence of vectors ${\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{{\sf d}}$ and positive semidefinite matrices ${\boldsymbol{\Sigma}}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}}$ via the state evolution equations (10), (11).

Then for any polynomial $\psi:{\mathbb{R}}^{{\sf d}}\times{\mathbb{R}}\to{\mathbb{R}}$ , the following limits hold for $(\Theta,{\boldsymbol{G}}_{t})\sim\pi_{\Theta}\otimes{\mathcal{N}}(0,{\boldsymbol{\Sigma}}_{t})$ :

	$\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{s}}^{t}_{1\to 2},\theta_{1})]=\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{s}}^{t}_{1},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,,$		(60)
	$\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi(\hat{\boldsymbol{s}}^{t}_{1},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t}),\Theta\big{)}\,.$		(61)

The proof of this lemma is based on results from [BLM15] and will be presented in Appendix D.

We are now in position to conclude the proof of the lower bound (22) on the optimal error of Low-Deg estimators in Theorem 2.3. The quantity we want to lower bound takes the form

	$\displaystyle{\sf MSE}_{\leq D}:=$	$\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\\|_{2}^{2}\big{\}}$
	$\displaystyle=$	$\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)}\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}\left\{(\hat{\theta}_{i}({\boldsymbol{Y}})-\theta_{i})^{2}\right\}$
	$\displaystyle=$	$\displaystyle\lim_{n\to\infty}\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname{\mathbb{E}}\left\{(q({\boldsymbol{Y}})-\theta_{1})^{2}\right\}$
which by Proposition 4.3 takes the form
	$\displaystyle=$	$\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}\Big{\{}\Big{(}\theta_{1}-\sum_{T\in\mathcal{T}_{\leq D}}\hat{p}_{T}\mathscrsfs{F}_{T}({\boldsymbol{Y}})\Big{)}^{2}\Big{\}}\,,$

for some $n$ -independent choice of coefficients $\hat{p}_{T}$ . On the other hand, by Proposition 4.12, letting $\hat{\boldsymbol{s}}_{1}^{t}$ be the iterates defined by Eqs. (53), (54), (55), with $F_{t}=F^{*}$ given by Eq. (57), we have, for any $t>D$ ,

	$\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}\big{[}\theta_{1}\cdot\mathscrsfs{F}_{T}({\boldsymbol{Y}})\big{]}$	$\displaystyle=\lim_{n\to\infty}\operatorname{\mathbb{E}}\big{[}\theta_{1}\cdot\hat{s}^{t}_{1}(T)\big{]}$		(62)
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\operatorname{\mathbb{E}}[\Theta F^{*}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})(T)]={\boldsymbol{\mu}}_{t+1}(T)\,,$		(63)

where in step $(a)$ we used Lemma 4.13. (Here ${\boldsymbol{\mu}}_{t}$ , ${\boldsymbol{\Sigma}}_{t}$ are defined recursively by Eqs. (10), (11). Recall that ${\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{|\mathcal{T}_{\leq D}|}$ can be indexed by elements $T$ of $\mathcal{T}_{\leq D}$ .)

Analogously, we obtain

\displaystyle\lim_{n\to\infty}\Big{(}\operatorname{\mathbb{E}}\big{[}\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})\big{]}\Big{)}_{T_{1},T_{2}\in\mathcal{T}_{\leq D}}

\displaystyle=\operatorname{\mathbb{E}}\big{[}F^{*}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})F^{*}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})^{{\sf T}}\big{]}={\boldsymbol{\Sigma}}_{t+1}\,.

(64)

Hence we obtain that the optimal error achieved by Low-Deg estimators is given by

	$\displaystyle{\sf MSE}_{\leq D}$	$\displaystyle=\lim_{n\to\infty}\operatorname{\mathbb{E}}\Big{\{}\Big{(}\theta_{1}-\sum_{T\in\mathcal{T}_{\leq D}}\hat{p}_{T}\mathscrsfs{F}_{T}({\boldsymbol{Y}})\Big{)}^{2}\Big{\}}$		(65)
		$\displaystyle=\operatorname{\mathbb{E}}[\Theta^{2}]-2\langle\hat{\boldsymbol{p}},{\boldsymbol{\mu}}_{t+1}\rangle+\langle\hat{\boldsymbol{p}},{\boldsymbol{\Sigma}}_{t+1}\hat{\boldsymbol{p}}\rangle\,.$		(66)

However, by Theorem 2.1 there exists an AMP algorithm of the form (9) that achieves exactly the same error. Simply take ${\sf d}=|\mathcal{T}_{\leq D}|$ , $F_{t}=F^{*}$ as defined in Eq. (57), and $g_{t}({\boldsymbol{x}}^{t})=\langle\hat{p}_{T},F^{*}({\boldsymbol{x}}^{t})\rangle$ . The desired lower bound follows by applying the optimality result of Theorem 2.2.

Acknowledgments

This project was initiated when the authors were visiting the Simons Institute for the Theory of Computing during the program on Computational Complexity of Statistical Inference in Fall 2021: we are grateful to the Simons Institute for its support.

AM was supported by the NSF through award DMS-2031883, the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning, the NSF grant CCF-2006489 and the ONR grant N00014-18-1-2729, and a grant from Eric and Wendy Schmidt at the Institute for Advanced Studies. Part of this work was carried out while Andrea Montanari was on partial leave from Stanford and a Chief Scientist at Ndata Inc dba Project N. The present research is unrelated to AM’s activity while on leave.

Part of this work was done while ASW was with the Simons Institute for the Theory of Computing, supported by a Simons-Berkeley Research Fellowship. Part of this work was done while ASW was with the Algorithms and Randomness Center at Georgia Tech, supported by NSF grants CCF-2007443 and CCF-2106444.

Appendix A Proof of Theorem 1.1

Let $t_{0}:=\operatorname{\mathbb{E}}[\Theta]$ . We assume, without loss of generality, $t_{0}>0$ . Let $\pi_{t}$ be the version of $\pi_{\Theta}$ centered at $t\geq 0$ , namely the law of $\Theta-t_{0}+t$ when $\Theta\sim\pi_{\Theta}$ (in particular $\pi_{\Theta}=\pi_{t_{0}}$ ). We finally let ${\boldsymbol{\theta}}_{t}$ be a vector with i.i.d. coordinates with distribution $\pi_{t}$ , and (for ${\boldsymbol{Z}}\sim{\sf GOE}(n)$ )

\displaystyle{\boldsymbol{Y}}_{t}=\frac{1}{\sqrt{n}}{\boldsymbol{\theta}}_{t}{\boldsymbol{\theta}}_{t}^{{\sf T}}+{\boldsymbol{Z}}\,.

(67)

We will take ${\boldsymbol{\theta}}_{t}={\boldsymbol{\theta}}_{0}+t{\boldsymbol{1}}$ . The normalized mutual information between ${\boldsymbol{Y}}_{t}$ and ${\boldsymbol{\theta}}_{t}$ is given, after simple manipulations, by

	$\displaystyle\phi_{n}(t):=$	$\displaystyle\frac{1}{n}I({\boldsymbol{Y}}_{t};{\boldsymbol{\theta}}_{t})$
	$\displaystyle=$	$\displaystyle-\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\log\left\{\int\exp\Big{(}-\frac{1}{4n}\big{\\|}\bar{\boldsymbol{\theta}}_{t}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}}-{\boldsymbol{\theta}}_{t}{\boldsymbol{\theta}}_{t}^{{\sf T}}\big{\\|}^{2}_{F}+\frac{1}{2n^{1/2}}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}\bar{\boldsymbol{\theta}}_{t}\rangle\Big{)}\,\pi_{t}^{\otimes n}({\rm d}\bar{\boldsymbol{\theta}}_{t})\right\}\,.$

By writing ${\boldsymbol{\theta}}_{t}={\boldsymbol{\theta}}_{0}+t{\boldsymbol{1}}$ , $\bar{\boldsymbol{\theta}}_{t}=\bar{\boldsymbol{\theta}}_{0}+t{\boldsymbol{1}}$ , where ${\boldsymbol{\theta}}_{0}\sim\pi_{0}^{\otimes n}$ , we get

\displaystyle\phi_{n}(t)

\displaystyle:=-\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\log\left\{e^{-{\cal H}_{n}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)}\pi_{0}^{\otimes n}({\rm d}\bar{\boldsymbol{\theta}}_{0})\right\}\,,

(68)

where

\displaystyle\frac{\partial{\cal H}_{n}}{\partial t}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)=\frac{1}{n}\Big{\{}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\,{\boldsymbol{1}}\rangle\Big{\}}-\frac{1}{n^{1/2}}\langle{\boldsymbol{Z}},{\boldsymbol{1}}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}}\rangle\,.

Let $\mu_{{\boldsymbol{Y}}_{t}}({\rm d}\bar{\boldsymbol{\theta}}_{t})$ be the Bayes posterior over $\bar{\boldsymbol{\theta}}_{t}$ :

\displaystyle\mu_{{\boldsymbol{Y}}_{t}}({\rm d}\bar{\boldsymbol{\theta}}_{t})\propto\exp\Big{(}-\frac{1}{4}\Big{\|}{\boldsymbol{Y}}_{t}-\frac{1}{n^{1/2}}\bar{\boldsymbol{\theta}}_{t}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}}\Big{\|}_{F}^{2}\Big{)}\,\pi^{\otimes n}_{t}({\rm d}\bar{\boldsymbol{\theta}}_{t})\,.

(69)

The above implies

$\displaystyle\phi^{\prime}_{n}(t)$	$\displaystyle=A_{n}(t)+B_{n}(t)\,,$	(70)
$\displaystyle A_{n}(t)$	$\displaystyle:=\frac{1}{n^{2}}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\Big{(}\\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\\|^{2}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\,{\boldsymbol{1}}\rangle\Big{)}\,,$	(71)
$\displaystyle B_{n}(t)$	$\displaystyle:=-\frac{1}{n^{3/2}}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\langle{\boldsymbol{Z}},\mu_{{\boldsymbol{Y}}_{t}}({\boldsymbol{1}}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}})\rangle\,.$	(72)

Integrating ${\boldsymbol{Z}}$ by parts, we get

\displaystyle B_{n}(t)

\displaystyle:=-\frac{1}{n^{2}}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\Big{\{}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\|\bar{\boldsymbol{\theta}}_{t}\|^{2}\big{)}-\langle\mu_{{\boldsymbol{Y}}_{t}}(\bar{\boldsymbol{\theta}}_{t}),\mu_{{\boldsymbol{Y}}_{t}}(\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\bar{\boldsymbol{\theta}}_{t})\rangle\,\Big{\}}\,.

(73)

We finally note that ${\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}$ are jointly distributed according to ${\mathbb{E}}F({\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t})=\operatorname{\mathbb{E}}\mu_{{\boldsymbol{Y}}_{t}}^{\otimes 2}\big{(}F({\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t})\big{)}$ . Note that the pair ${\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}$ is exchangeable, with marginals ${\boldsymbol{\theta}}_{t}\sim\pi_{t}^{\otimes n}$ , $\bar{\boldsymbol{\theta}}_{t}\sim\pi_{t}^{\otimes n}$ . Writing the above expectations in terms of this measure, we get

	$\displaystyle A_{n}(t)$	$\displaystyle=\frac{1}{n^{2}}{\mathbb{E}}\Big{(}\\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\\|^{2}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle\Big{)}\,,$		(74)
	$\displaystyle B_{n}(t)$	$\displaystyle=-\frac{1}{n^{2}}{\mathbb{E}}\Big{(}\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\\|\bar{\boldsymbol{\theta}}_{t}\\|^{2}-\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\langle{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\,\Big{)}\,.$		(75)

By concentration inequalities for sums of independent random variables, there exist constants $C_{k}$ such that

	$\displaystyle{\mathbb{E}}\big{(}\\|{\boldsymbol{\theta}}_{t}\\|_{2}^{2k}\big{)}\leq C_{k}n^{k}\,,$		(76)
	$\displaystyle{\mathbb{E}}\big{(}\|\langle{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-nt\|^{k}\big{)}\leq C_{k}n^{k/2}\,,$		(77)
	$\displaystyle{\mathbb{E}}\big{(}\|\\|{\boldsymbol{\theta}}_{t}\\|_{2}^{2}-{\mathbb{E}}[\\|{\boldsymbol{\theta}}_{t}\\|_{2}^{2}]\|^{k}\big{)}\leq C_{k}n^{k}\,.$		(78)

The same inequalities obviously hold for $\bar{\boldsymbol{\theta}}_{t}$ .

Using these bounds in Eqs. (74), (75), we get

	$\displaystyle A_{n}(t)$	$\displaystyle=\frac{t}{n}{\mathbb{E}}\big{\{}\\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\\|^{2}\big{\}}+O(n^{-1/2})\,,$		(79)
	$\displaystyle B_{n}(t)$	$\displaystyle=-\frac{t}{n}{\mathbb{E}}\{\\|{\boldsymbol{\theta}}_{t}\\|^{2}_{2}\}+\frac{t}{n^{2}}\operatorname{\mathbb{E}}\big{\{}\big{\\|}\hat{\boldsymbol{\theta}}_{t}({\boldsymbol{Y}}_{t})\big{\\|}^{2}_{2}\big{\}}+O(n^{-1/2})\,.$		(80)

Summing these, we finally obtain

	$\displaystyle\phi^{\prime}_{n}(t)$	$\displaystyle=\frac{t}{n}\operatorname{\mathbb{E}}\big{\{}\big{\\|}{\boldsymbol{\theta}}_{t}-\hat{\boldsymbol{\theta}}_{t}({\boldsymbol{Y}}_{t})\big{\\|}_{2}^{2}\big{\}}+O(n^{-1/2})$		(81)
		$\displaystyle=:t\,{\sf MSE}(t;n)+O(n^{-1/2})\,,$

where we note that the $O(n^{-1/2})$ term is uniform in $t$ .

Now recalling the definition of Eq. (2), we have, by [LM19, Theorem 1],

\phi_{\infty}(t):=\lim_{n\to\infty}\phi_{n}(t)=\inf_{q\geq 0}\Psi(q;0,\pi_{t})\,.

(82)

Note that

\displaystyle\Psi(q;0,\pi_{t})=\Psi(q;t^{2},\pi_{0})+\frac{1}{2}t^{2}\mathop{\mathrm{Var}}(\Theta)+\frac{1}{4}t^{4}\,.

(83)

Differentiating with respect to $t$ , we deduce that

1.

$\phi_{\infty}(t)$ is differentiable at $t$ if and only if $b\mapsto\Psi(q;b,\pi_{t})$ is differentiable at $b=0$ . Both are in turn equivalent to $q\mapsto\Psi(q;0,\pi_{t})$ being uniquely minimized at $q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta})$ .

If the last point holds, then (with $\Theta_{t}=\Theta-t_{0}+t\sim\pi_{t}$ )

\displaystyle\phi_{\infty}^{\prime}(t)=t\cdot\big{(}\operatorname{\mathbb{E}}[\Theta^{2}_{t}]-q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta})\big{)}\,.

(84)

Comparing with Eq. (81), we are left with the task of proving that $\lim_{n\to\infty}\phi_{n}^{\prime}(t_{0})=\phi_{\infty}^{\prime}(t_{0})$ when $\phi_{\infty}$ is differentiable at $t_{0}$ .

Taking the second derivative of Eq. (68) yields

	$\displaystyle\phi^{\prime\prime}_{n}(t)$	$\displaystyle=-\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mathop{\mathrm{Var}}_{\mu_{{\boldsymbol{Y}}_{t}}}\big{(}\partial_{t}{\cal H}_{n}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)\big{)}+\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\partial^{2}_{t}{\cal H}_{n}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)\big{)}$
		$\displaystyle=:-D_{n}(t)+E_{n}(t)\,,$

where $D_{n}(t)\geq 0$ by construction. A direct calculation yields

\displaystyle\frac{\partial^{2}{\cal H}_{n}}{\partial t^{2}}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)=\frac{1}{2}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}+\frac{1}{2n}\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle^{2}-\frac{1}{2n^{1/2}}\langle{\boldsymbol{1}},{\boldsymbol{Z}}{\boldsymbol{1}}\rangle\,.

(85)

Hence proceeding as for the first derivative, we get

	$\displaystyle E_{n}(t)$	$\displaystyle=\frac{1}{2n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\\|^{2}\big{)}+\frac{1}{2n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle^{2}\big{)}$		(86)
		$\displaystyle={\sf MSE}(t;n)+O(n^{-1/2})\,,$		(87)

where, once more, the $O(n^{-1/2})$ term is uniform in $t$ .

Consider now the function $f_{n}(t)=-t^{-1/2}\,\phi_{n}(t)$ , defined on an interval $[t_{1},t_{2}]$ , with $0<t_{1}<t_{0}<t_{2}$ (with $n\in{\mathbb{N}}\cup\{\infty\}$ ). We conclude that

1.

By Eq. (82), $\lim_{n\to\infty}f_{n}(t)=f_{\infty}(t)$ for any $t\in[0,1]$ . (In fact the convergence is uniform since $\phi^{\prime}_{n}(t)$ is uniformly bounded by Eq. (81), but we will not use this fact.)
2.

We have $f^{\prime}_{n}(t)=\phi_{n}(t)/(2t^{3/2})-\phi^{\prime}_{n}(t)/t^{1/2}$ . Hence, by the previous point, $\lim_{n\to\infty}f^{\prime}_{n}(t)=f^{\prime}_{\infty}(t)$ at a point $t$ if and only if $\lim_{n\to\infty}\phi^{\prime}_{n}(t)=\phi^{\prime}_{\infty}(t)$ .
3.

We have $f^{\prime\prime}_{n}(t)=-3\phi_{n}(t)/(4t^{5/2})+\phi^{\prime}_{n}(t)/t^{3/2}-\phi^{\prime\prime}_{n}(t)/t^{1/2}$ . Substituting Eqs. (81), (A), (87), we obtain

$\displaystyle f^{\prime\prime}_{n}(t)=-\frac{3}{4t^{5/2}}\phi_{n}(t)+\frac{1}{t^{1/2}}D_{n}(t)\,.$ (88)

In particular, the first term is uniformly bounded (by boundedness of the normalized mutual information on $[t_{1},t_{2}]$ ), and the second is non-negative.

Points 1 and 3 imply (by the lemma below) that $f^{\prime}_{n}(t)\to f_{\infty}^{\prime}(t)$ for any differentiability point $t$ , and hence by point 3, we conclude $\phi^{\prime}_{n}(t)\to\phi_{\infty}^{\prime}(t)$ . The proof is then completed by the following analysis fact.

Lemma A.1.

Let $\{f_{n}:n\geq 1\}$ , $f_{n}:[t_{1},t_{2}]\to{\mathbb{R}}$ be a sequence of differentiable functions converging pointwise to $f_{\infty}$ . Assume $f_{n}=g_{n}+h_{n}$ where $g_{n},h_{n}$ are differentiable, $g_{n}$ is convex and the $\{h_{n}^{\prime}\}$ are equicontinuous on $[t_{1},t_{2}]$ (i.e. there exists a non-decreasing function $\delta(\varepsilon)\downarrow 0$ such that $|t-t^{\prime}|\leq\varepsilon$ implies $|h_{n}^{\prime}(t)-h^{\prime}_{n}(t^{\prime})|\leq\delta(\varepsilon)$ ). If $f_{\infty}$ is differentiable at $t_{0}$ , then $\lim_{n\to\infty}f^{\prime}_{n}(t_{0})=f^{\prime}_{\infty}(t_{0})$ .

Proof.

By convexity we have, for $\varepsilon>0$

\displaystyle\frac{1}{\varepsilon}\big{[}g_{n}(t_{0}+\varepsilon)-g_{n}(t_{0})\big{]}\geq g^{\prime}_{n}(t_{0})\,.

(89)

Further, by the intermediate value theorem and equicontinuity of $h^{\prime}_{n}$ ,

\displaystyle\frac{1}{\varepsilon}\big{[}h_{n}(t_{0}+\varepsilon)-h_{n}(t_{0})\big{]}\geq h^{\prime}_{n}(t_{n}(\varepsilon))\geq h^{\prime}_{n}(t_{0})-\delta(\varepsilon)\,.

(90)

Summing the last two displays, we get

\displaystyle f_{n}^{\prime}(t_{0})\leq\frac{1}{\varepsilon}\big{[}f_{n}(t_{0}+\varepsilon)-f_{n}(t_{0})\big{]}+\delta(\varepsilon)\,.

(91)

Taking the limit $n\to\infty$ and using the convergence of $f_{n}$ , we get

\displaystyle\limsup_{n\to\infty}f_{n}^{\prime}(t_{0})\leq\frac{1}{\varepsilon}\big{[}f_{\infty}(t_{0}+\varepsilon)-f_{\infty}(t_{0})\big{]}+\delta(\varepsilon)\,.

(92)

Finally taking the limit $\varepsilon\downarrow 0$ , and using the differentiability of $f_{\infty}$ at $t_{0}$ :

\displaystyle\limsup_{n\to\infty}f_{n}^{\prime}(t_{0})\leq f^{\prime}_{\infty}(t_{0})\,.

(93)

∎

Appendix B Lemma for the upper bound in Theorem 2.3

Lemma B.1.

Let $f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q):=\operatorname{\mathbb{E}}[\Theta|q\Theta+\sqrt{q}G=x]$ for $(\Theta,G)\sim\pi_{\Theta}\otimes{\mathcal{N}}(0,1)$ . Then, for any $q>0$ there exists a constant $C=C(q,\pi_{\Theta})$ such that, for all $x$

\displaystyle|f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)|\leq C(1+|x|)\,.

(94)

Proof.

If $\pi^{a}_{\Theta}$ is the law of $\Theta+a$ , then we have $f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)=f^{\mbox{\tiny\rm Bayes}}(x+qa;\pi^{a}_{\Theta},q)$ . Hence, if the claim holds for $\pi^{a}_{\Theta}$ , it holds for $\pi_{\Theta}$ as well. Hence, without loss of generality, we can assume $\operatorname{\mathbb{E}}[\Theta]=0$ .

Note that $f^{\mbox{\tiny\rm Bayes}}(-x;\pi_{\Theta},q)=-f^{\mbox{\tiny\rm Bayes}}(x;\pi_{-\Theta},q)$ where $\pi_{-\Theta}$ is the law of $-\Theta$ . Therefore, without loss of generality we can assume $x>0$ . A simple calculation (‘Tweedie’s formula’) yields

	$\displaystyle f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)$	$\displaystyle=\frac{{\rm d}\phantom{x}}{{\rm d}x}\phi(x;\pi_{\Theta},q)\,,$		(95)
	$\displaystyle\phi(x;\pi_{\Theta},q)$	$\displaystyle:=\log\Big{\{}\int e^{\theta x-q\theta^{2}/2}\pi_{\Theta}({\rm d}\theta)\Big{\}}\,.$		(96)

Notice that $x\mapsto\phi(x;\pi_{\Theta},q)$ is convex, and therefore

		$\displaystyle\Delta_{q}(-2x,-x)\leq f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\leq\Delta_{q}(x,2x)\,,$		(97)
		$\displaystyle\Delta_{q}(x_{1},x_{2}):=\frac{1}{x_{2}-x_{1}}\big{\{}\phi(x_{2};\pi_{\Theta},q)-\phi(x_{1};\pi_{\Theta},q)\big{\}}\,.$		(98)

We therefore proceed to lower and upper bound $\phi$ . First notice that

	$\displaystyle\phi(x;\pi_{\Theta},q)$	$\displaystyle=\frac{x^{2}}{2q}+\log\Big{\{}\int e^{-(x-q\theta)^{2}/2q}\pi_{\Theta}({\rm d}\theta)\Big{\}}$
		$\displaystyle\leq\frac{x^{2}}{2q}\,,$

where we used the fact that $\exp(-(x-q\theta)^{2}/2q)\leq 1$ . Further, by Jensen’s inequality

\displaystyle\phi(x;\pi_{\Theta},q)\geq x\operatorname{\mathbb{E}}[\Theta]-\frac{1}{2}\operatorname{\mathbb{E}}(\Theta^{2})q=-\frac{1}{2}\operatorname{\mathbb{E}}(\Theta^{2})q\,.

Therefore we obtain, for $x>1$ ,

	$\displaystyle f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\leq\Delta_{q}(x,2x)$	$\displaystyle\leq\frac{x}{2q}+\frac{1}{2x}\operatorname{\mathbb{E}}(\Theta^{2})q$
		$\displaystyle\leq\frac{x}{2q}+\frac{1}{2}\operatorname{\mathbb{E}}(\Theta^{2})q\,.$

Since $x\mapsto f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)$ is continuous and therefore bounded on $[0,1]$ , this proves $f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\leq C(1+|x|)$ .

In order to prove $f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\geq-C(1+|x|)$ , we use the lower bound in Eq. (97) and proceed analogously. ∎

Appendix C Proof of Lemmas for Proposition 4.3

C.1 Notation

We use the convention $\mathbb{N}=\{0,1,2,\ldots\}$ . We often work with ${\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}$ where ${\overline{E}_{n}}=\{(i,j)\,:\,\,1\leq i\leq j\leq n\}$ and $|{\overline{E}_{n}}|=N:=n(n+1)/2$ . For ${\boldsymbol{\alpha}},{\boldsymbol{\beta}}\in\mathbb{N}^{\overline{E}_{n}}$ , we use the notation

|{\boldsymbol{\alpha}}|:=\sum_{(i,j)\in{\overline{E}_{n}}}\alpha_{ij},\qquad{\boldsymbol{\alpha}}!:=\prod_{(i,j)\in{\overline{E}_{n}}}\alpha_{ij}!,

\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}:=\prod_{(i,j)\in{\overline{E}_{n}}}\binom{\alpha_{ij}}{\beta_{ij}},\qquad{\boldsymbol{Y}}^{\boldsymbol{\alpha}}:=\prod_{(i,j)\in{\overline{E}_{n}}}Y_{ij}^{\alpha_{ij}},

({\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}})_{ij}=|\alpha_{ij}-\beta_{ij}|,\qquad({\boldsymbol{\alpha}}\wedge{\boldsymbol{\beta}})_{ij}=\min\{\alpha_{ij},\beta_{ij}\},\qquad({\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}})_{ij}:=\max\{0,\alpha_{ij}-\beta_{ij}\}\,.

We use ${\boldsymbol{\alpha}}\leq{\boldsymbol{\beta}}$ to mean $\alpha_{ij}\leq\beta_{ij}$ for all $i,j$ , and ${\boldsymbol{\alpha}}\lneq{\boldsymbol{\beta}}$ to mean $\alpha_{ij}\leq\beta_{ij}$ for all $i,j$ with strict inequality for some $i,j$ . Note that ${\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}$ can be identified with a multigraph whose vertices are elements of $[n]$ , where $\alpha_{ij}$ is the number of copies of edge $(i,j)$ . We use $V({\boldsymbol{\alpha}})\subseteq[n]$ to denote the set of vertices spanned by the edges of ${\boldsymbol{\alpha}}$ , with vertex 1 (the root) always included by convention. We use $C({\boldsymbol{\alpha}})$ to denote the set of non-empty (i.e., containing at least one edge) connected components of ${\boldsymbol{\alpha}}$ .

Asymptotic notation such as $O(\;\cdot\;)$ and $\Omega(\;\cdot\;)$ pertains to the limit $n\to\infty$ and may hide factors depending on $\pi_{\Theta},\psi,D$ . We use the symbol $\mathsf{const}$ to denote a constant (which may be positive, zero, or negative) depending on $\pi_{\Theta},\psi,D$ (but not $n$ ) and e.g., $\mathsf{const}(A,B)$ to denote a constant that may additionally depend on $A,B$ . These constants may change from line to line.

C.2 Hermite polynomials

We will need the following well known property of Hermite polynomials (see e.g. page 254 of [MOS13]).

Proposition C.1.

For any $\mu,z\in\mathbb{R}$ and $k\in\mathbb{N}$ ,

h_{k}(\mu+z)=\sum_{\ell=0}^{k}\sqrt{\frac{\ell!}{k!}}\binom{k}{\ell}\mu^{k-\ell}h_{\ell}(z)\,.

As a result, for any ${\boldsymbol{\gamma}}\in\mathbb{N}^{\overline{E}_{n}}$ , and writing ${\boldsymbol{X}}={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n}$ for the signal matrix, we have:

	$\displaystyle h_{{\boldsymbol{\gamma}}}({\boldsymbol{Y}})$	$\displaystyle=\prod_{(i,j)\in{\overline{E}_{n}}}h_{\gamma_{ij}}(X_{ij}+Z_{ij})$
		$\displaystyle=\prod_{(i,j)\in{\overline{E}_{n}}}\sum_{\beta_{ij}=0}^{\gamma_{ij}}\sqrt{\frac{\beta_{ij}!}{\gamma_{ij}!}}\binom{\gamma_{ij}}{\beta_{ij}}X_{ij}^{\gamma_{ij}-\beta_{ij}}h_{\beta_{ij}}(Z_{ij})$
		$\displaystyle=\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\prod_{(i,j)\in{\overline{E}_{n}}}\sqrt{\frac{\beta_{ij}!}{\gamma_{ij}!}}\binom{\gamma_{ij}}{\beta_{ij}}X_{ij}^{\gamma_{ij}-\beta_{ij}}h_{\beta_{ij}}(Z_{ij})$
		$\displaystyle=\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\beta}}}{\boldsymbol{X}}^{{\boldsymbol{\gamma}}-{\boldsymbol{\beta}}}h_{\boldsymbol{\beta}}({\boldsymbol{Z}})\,.$

In particular, since $\operatorname*{\mathbb{E}}_{{\boldsymbol{Z}}}[h_{\boldsymbol{\beta}}({\boldsymbol{Z}})]=\mathbbm{1}_{{\boldsymbol{\beta}}={\boldsymbol{0}}}$ (due to orthonormality of Hermite polynomials and the fact $h_{0}(z)=1$ ),

\operatorname*{\mathbb{E}}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})=\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}\operatorname*{\mathbb{E}}{\boldsymbol{X}}^{\boldsymbol{\gamma}}\,.

C.3 Proof of Lemma 4.4: Symmetry

As mentioned in the main text, this is a special case of the classical Hunt–Stein theorem [EG21], the only difference being that we are restricting the class of estimators to degree- $D$ polynomials. We present a proof for completeness.

Let $S_{-1}$ denote the group of permutations on $[n]$ that fix $\{1\}$ . This group acts on the space of $n\times n$ symmetric matrices by permuting both the rows and columns (by the same permutation). We also have an induced action of $S_{-1}$ on $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}$ given by $(\sigma\cdot q)({\boldsymbol{Y}})=q(\sigma^{-1}\cdot{\boldsymbol{Y}})$ . A polynomial $q$ can be symmetrized over $S_{-1}$ to produce $q^{\mathsf{sym}}$ as follows:

q^{\mathsf{sym}}:=\frac{1}{|S_{-1}|}\sum_{\sigma\in S_{-1}}\sigma\cdot q\,.

We claim that the symmetric subspace $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}}=\mathrm{span}_{\mathbb{R}}\{\mathscrsfs{H}_{A}\,:\,\mathcal{G}_{\leq D}\}$ is precisely the image of the above symmetrization operation, that is, $\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}}=\{q^{\mathsf{sym}}\,:\,q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}\}$ . This can be seen as follows. For the inclusion $\subseteq$ , note that any $q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}}$ satisfies $q=q^{\mathsf{sym}}$ . For the reverse inclusion $\supseteq$ , start with an arbitrary $q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}$ . Any such $q$ admits an expansion as a linear combination of Hermite polynomials $(h_{\boldsymbol{\alpha}})_{|{\boldsymbol{\alpha}}|\leq D}$ , and therefore admits an expansion in $(\mathscrsfs{H}_{\boldsymbol{\alpha}})_{|{\boldsymbol{\alpha}}|\leq D}$ (cf. the definition (38) which can be inverted recursively). Finally, note that $\mathscrsfs{H}_{\boldsymbol{\alpha}}^{\mathsf{sym}}$ is a scalar multiple of $\mathscrsfs{H}_{G}$ for a certain $G\in\mathcal{G}_{\leq D}$ . This allows $q^{\mathsf{sym}}$ to be written as a linear combination of $(\mathscrsfs{H}_{G})_{G\in\mathcal{G}_{\leq D}}$ .

To complete the proof of Lemma 4.4, it is sufficient to show that for any low degree estimator $q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}$ , mean squared error does not increase under symmetrization:

\operatorname*{\mathbb{E}}[(q^{\mathsf{sym}}({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\leq\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,.

(99)

First note that $\operatorname*{\mathbb{E}}[q^{\mathsf{sym}}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\operatorname*{\mathbb{E}}[q({\boldsymbol{Y}})\cdot\psi(\theta_{1})]$ because $({\boldsymbol{Y}},\theta_{1})$ and $(\sigma\cdot{\boldsymbol{Y}},\theta_{1})$ have the same distribution for any fixed $\sigma\in S_{-1}$ . Next, using Jensen’s inequality and the symmetry $\operatorname*{\mathbb{E}}[q^{2}]=\operatorname*{\mathbb{E}}[(\sigma\cdot q)^{2}]$ for all $\sigma\in S_{-1}$ , we have

	$\displaystyle\operatorname*{\mathbb{E}}[q^{\mathsf{sym}}({\boldsymbol{Y}})^{2}]$	$\displaystyle=\operatorname*{\mathbb{E}}\Big{\{}\Big{(}\frac{1}{\|S_{-1}\|}\sum_{\sigma\in S_{-1}}(\sigma\cdot q)({\boldsymbol{Y}})\Big{)}^{2}\Big{\}}$
		$\displaystyle\leq\frac{1}{\|S_{-1}\|}\sum_{\sigma\in S_{-1}}\operatorname{\mathbb{E}}[(\sigma\cdot q)({\boldsymbol{Y}})^{2}]=\operatorname{\mathbb{E}}[q({\boldsymbol{Y}})^{2}]\,.$

Now Eq. (99) follows by combining these claims.

C.4 Proof of Lemma 4.5: $\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})$

Consider an arbitrary polynomial $f=\sum_{G\in\mathcal{G}_{\leq D}}b_{G}(f)\mathscrsfs{H}_{G}$ with coefficients ${\boldsymbol{b}}(f)=(b_{G}(f))$ normalized so that $\|{\boldsymbol{b}}(f)\|^{2}:=\sum_{G\in\mathcal{G}_{\leq D}}b_{G}(f)^{2}=1$ . This induces an expansion $f=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}},|{\boldsymbol{\alpha}}|\leq D}\hat{f}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{\boldsymbol{\alpha}}$ . Since $\operatorname*{\mathbb{E}}[f({\boldsymbol{Y}})^{2}]=\langle{\boldsymbol{b}},{\boldsymbol{M}}_{n}{\boldsymbol{b}}\rangle$ , it suffices to show $\operatorname*{\mathbb{E}}[f({\boldsymbol{Y}})^{2}]=\Omega(1)$ . Using Jensen’s inequality and orthonormality of Hermite polynomials (here, we use subscripts on expectation to denote which variable is being integrated over),

\displaystyle\operatorname*{\mathbb{E}}[f({\boldsymbol{Y}})^{2}]=\operatorname*{\mathbb{E}}_{Z}\operatorname*{\mathbb{E}}_{X}[f({\boldsymbol{X}}+{\boldsymbol{Z}})^{2}]\geq\operatorname*{\mathbb{E}}_{Z}[(\operatorname*{\mathbb{E}}_{X}f({\boldsymbol{X}}+{\boldsymbol{Z}}))^{2}]=:\|\hat{g}\|^{2}

where $g({\boldsymbol{Z}}):=\operatorname*{\mathbb{E}}_{\boldsymbol{X}}f({\boldsymbol{X}}+{\boldsymbol{Z}})$ with Hermite expansion $g=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}},|{\boldsymbol{\alpha}}|\leq D}\hat{g}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{\boldsymbol{\alpha}}$ . It now suffices to show $\|\hat{g}\|^{2}=\Omega(1)$ . We will compute $\hat{g}$ explicitly in terms of $\hat{f}$ . We have, using the Hermite facts from Section C.2,

	$\displaystyle g({\boldsymbol{Z}})$	$\displaystyle=\operatorname*{\mathbb{E}}_{X}f({\boldsymbol{X}}+{\boldsymbol{Z}})$
		$\displaystyle=\operatorname*{\mathbb{E}}_{X}\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{X}}+{\boldsymbol{Z}})$
		$\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\operatorname{\mathbb{E}}_{X}\prod_{\gamma\in C({\boldsymbol{\alpha}})}\left(h_{\boldsymbol{\gamma}}({\boldsymbol{X}}+{\boldsymbol{Z}})-\operatorname{\mathbb{E}}_{Y}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})\right)$
		$\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\operatorname{\mathbb{E}}_{X}[h_{\boldsymbol{\gamma}}({\boldsymbol{X}}+{\boldsymbol{Z}})-\operatorname{\mathbb{E}}_{Y}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})]\qquad\text{using independence between components}$
		$\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\operatorname*{\mathbb{E}}_{X}\left[\sum_{0\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\beta}}}X^{{\boldsymbol{\gamma}}-{\boldsymbol{\beta}}}h_{\boldsymbol{\beta}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}{\boldsymbol{X}}^{\boldsymbol{\gamma}}\right]$
		$\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\sum_{0\lneq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\gamma}}-{\boldsymbol{\beta}}}]h_{\boldsymbol{\beta}}({\boldsymbol{Z}})$
		$\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\sum_{{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]h_{\boldsymbol{\beta}}({\boldsymbol{Z}})$
where $\Gamma({\boldsymbol{\alpha}})$ is the set of ${\boldsymbol{\beta}}\in\mathbb{N}^{\overline{E}_{n}}$ such that $0\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\alpha}}$ , and ${\boldsymbol{\beta}}$ includes at least one edge from every non-empty component of ${\boldsymbol{\alpha}}$
		$\displaystyle=\sum_{{\boldsymbol{\beta}}\in\mathbb{N}^{\overline{E}_{n}}}h_{\boldsymbol{\beta}}({\boldsymbol{Z}})\sum_{{\boldsymbol{\alpha}}\,:\,{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\hat{f}_{\boldsymbol{\alpha}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]\,.$

This means

\hat{g}_{\boldsymbol{\beta}}=\sum_{{\boldsymbol{\alpha}}\,:\,{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\hat{f}_{\boldsymbol{\alpha}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]\,.

(100)

Due to the symmetry of $f$ , we have $\hat{g}_{\boldsymbol{\beta}}=\hat{g}_{{\boldsymbol{\beta}}^{\prime}}$ whenever ${\boldsymbol{\beta}},{\boldsymbol{\beta}}^{\prime}$ are images of embeddings of the same $G\in\mathcal{G}_{\leq D}$ . Therefore, $g$ admits an expansion $g=\sum_{G\in\mathcal{G}_{\leq D}}b_{G}(g)\mathscrsfs{H}_{G}$ where $\mathscrsfs{H}_{G}$ is defined by Eq. (39).

The coefficients $b_{G}(g)$ and $\hat{g}_{{\boldsymbol{\alpha}}}$ are related as follows. Let $|\mathsf{Aut}(G)|$ denote the number of root-preserving automorphisms of $G$ , i.e., the number of embeddings $\phi\in\mathsf{emb}(G)$ that induce a single element ${\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}$ . For each $G\in\mathcal{G}_{\leq D}$ there are $\frac{|\mathsf{emb}(G)|}{|\mathsf{Aut}(G)|}$ distinct elements ${\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}$ , and each has the same coefficient $\hat{g}_{\boldsymbol{\alpha}}$ . Without loss of generality we can set $\hat{g}_{\boldsymbol{\alpha}}=:|\mathsf{Aut}(G)|/\sqrt{|\mathsf{emb}(G)|}\,\tilde{g}_{G}$ for some coefficients $\tilde{g}_{G}$ . Therefore

	$\displaystyle g$	$\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}}\hat{g}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{{\boldsymbol{\alpha}}}$
		$\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}}\hat{g}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{{\boldsymbol{\alpha}}}\cdot\frac{1}{\|\mathsf{Aut}(G)\|}\sum_{\phi\in\mathsf{emb}(G)}\mathbbm{1}_{{\boldsymbol{\alpha}}=\mathsf{im}(\phi;G)}$
		$\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\sum_{\phi\in\mathsf{emb}(G)}\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}}\mathbbm{1}_{{\boldsymbol{\alpha}}=\mathsf{im}(\phi;G)}\frac{1}{\sqrt{\|\mathsf{emb}(G)\|}}\tilde{g}_{G}\mathscrsfs{H}_{{\boldsymbol{\alpha}}}$
		$\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\tilde{g}_{G}\mathscrsfs{H}_{G}\,.$

Therefore we can conclude that $b_{G}(g)=\tilde{g}_{G}=\frac{\sqrt{|\mathsf{emb}(G)|}}{|\mathsf{Aut}(G)|}\hat{g}_{\boldsymbol{\alpha}}$ . This means

	$\displaystyle\\|\hat{g}\\|^{2}=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}},\|{\boldsymbol{\alpha}}\|\leq D}\hat{g}_{\boldsymbol{\alpha}}^{2}$	$\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\frac{\|\mathsf{emb}(G)\|}{\|\mathsf{Aut}(G)\|}\left(\frac{\|\mathsf{Aut}(G)\|}{\sqrt{\|\mathsf{emb}(G)\|}}b_{G}(g)\right)^{2}$
		$\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\|\mathsf{Aut}(G)\|\cdot b_{G}(g)^{2}\geq\\|{\boldsymbol{b}}(g)\\|^{2}\,.$

It now suffices to show $\|{\boldsymbol{b}}(g)\|^{2}=\Omega(1)$ .

Our next step will be to relate the coefficients ${\boldsymbol{b}}(g)$ to the coefficients ${\boldsymbol{b}}(f)$ . Using (100) along with the above relation between $\tilde{g}$ and $\hat{g}$ , we have for any $B\in\mathcal{G}_{\leq D}$ and ${\boldsymbol{\beta}}$ the image of some $\phi\in\mathsf{emb}(B)$ ,

b_{B}(g)=\frac{\sqrt{|\mathsf{emb}(B)|}}{|\mathsf{Aut}(B)|}\hat{g}_{\boldsymbol{\beta}}=\frac{\sqrt{|\mathsf{emb}(B)|}}{|\mathsf{Aut}(B)|}\sum_{{\boldsymbol{\alpha}}\,:\,{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\hat{f}_{\boldsymbol{\alpha}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]\,.

(101)

Similarly to above, we have $\hat{f}_{\boldsymbol{\alpha}}=\frac{|\mathsf{Aut}(A)|}{\sqrt{|\mathsf{emb}(A)|}}b_{A}(f)$ whenever ${\boldsymbol{\alpha}}$ is the image of some $\phi\in\mathsf{emb}(A)$ . In (101), the number of terms ${\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}$ in the sum that correspond to a single $A\in\mathcal{G}_{\leq D}$ is $O(n^{|V(A)|-|V(B)|})$ , recalling that ${\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})$ implies ${\boldsymbol{\beta}}\leq{\boldsymbol{\alpha}}$ . We also have the bounds $\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}=O(1)$ , $\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]=O(n^{-\frac{1}{2}(|E(A)|-|E(B)|)})$ , $|\mathsf{Aut}(A)|=\Theta(1)$ , and $|\mathsf{emb}(A)|=\Theta(n^{|V(A)|-1})$ . This means (101) can be written as $b_{B}(g)=\sum_{A}\zeta_{BA}b_{A}(f)$ for some coefficients $\zeta_{BA}$ (not depending on $f,g$ ) that satisfy

	$\displaystyle\|\zeta_{BA}\|$	$\displaystyle\leq\sqrt{\frac{\|\mathsf{emb}(B)\|}{\|\mathsf{emb}(A)\|}}\cdot O(n^{\|V(A)\|-\|V(B)\|}\cdot n^{-\frac{1}{2}(\|E(A)\|-\|E(B)\|)})$
		$\displaystyle=O(n^{\frac{1}{2}(\|V(A)\|-\|V(B)\|-\|E(A)\|+\|E(B)\|)})$
		$\displaystyle=O(1)\,,$

where the last bound holds since ${\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})$ and therefore ${\boldsymbol{\alpha}}$ cannot have more components than ${\boldsymbol{\beta}}$ .

We have now shown $|\zeta_{BA}|=O(1)$ . We can also see directly that $\zeta_{AA}=1$ for all $A$ , and $\zeta_{BA}=0$ whenever $B\neq A$ and $|E(B)|\geq|E(A)|$ . This means

b_{B}(g)=b_{B}(f)+\sum_{A\,:\,|E(A)|>|E(B)|}\zeta_{BA}b_{A}(f)

which we can rewrite in vector form as

{\boldsymbol{b}}(g)={\boldsymbol{\zeta}}\,{\boldsymbol{b}}(f)\,,

where ${\boldsymbol{\zeta}}=(\zeta_{A,B})_{A,B\in\mathcal{G}_{\leq D}}$ . We order $\mathcal{G}_{\leq D}$ so that the number of edges is non-decreasing along this ordering, and therefore ${\boldsymbol{\zeta}}$ is upper triangular with ones on the diagonal. This implies $\det({\boldsymbol{\zeta}})=1$ , and in particular ${\boldsymbol{\zeta}}$ is invertible. By Cramer’s rule, letting ${\boldsymbol{\zeta}}_{(B,A)}$ denote the $(B,A)$ -th minor,

\displaystyle({\boldsymbol{\zeta}}^{-1})_{A,B}=(-1)^{i+j}\frac{\det({\boldsymbol{\zeta}}_{(B,A)})}{\det({\boldsymbol{\zeta}})}=(-1)^{i+j}\det({\boldsymbol{\zeta}}_{(B,A)})=O(1)\,,

where the last bound follows from the fact that the matrix dimensions are independent of $n$ , and $\max_{A,B\in\mathcal{G}_{\leq D}}|\zeta_{A,B}|=O(1)$ . Therefore $\|{\boldsymbol{\zeta}}^{-1}\|_{\mbox{\rm\tiny op}}=O(1)$ and

\displaystyle\|{\boldsymbol{b}}(g)\|_{2}\geq\|{\boldsymbol{\zeta}}^{-1}\|_{\mbox{\rm\tiny op}}^{-1}\,\|{\boldsymbol{b}}(f)\|_{2}\geq\|{\boldsymbol{\zeta}}^{-1}\|_{\mbox{\rm\tiny op}}^{-1}=\Omega(1)\,.

This concludes the proof.

C.5 Proof of Lemma 4.6: Explicit solution

Throughout this proof, we will omit the subscript $n$ from ${\boldsymbol{c}}$ and ${\boldsymbol{M}}$ .

For an arbitrary $q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}}$ , write the expansion $q=\sum_{A\in\mathcal{G}_{\leq D}}\hat{q}_{A}\mathscrsfs{H}_{A}$ . Recalling the definitions $c_{A}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]$ and $M_{AB}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\mathscrsfs{H}_{B}({\boldsymbol{Y}})]$ , we can rewrite the optimization problem as

	$\displaystyle\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]$	$\displaystyle=\operatorname{\mathbb{E}}[\psi(\theta_{1})^{2}]-\sup_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname{\mathbb{E}}[2\,q({\boldsymbol{Y}})\cdot\psi(\theta_{1})-q({\boldsymbol{Y}})^{2}]$
		$\displaystyle=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\sup_{\hat{\boldsymbol{q}}}\left(2\langle\hat{\boldsymbol{q}},{\boldsymbol{c}}\rangle-\langle\hat{\boldsymbol{q}},{\boldsymbol{M}}\hat{\boldsymbol{q}}\rangle\right).$

Since ${\boldsymbol{M}}$ is positive definite by Lemma 4.5, the map $\hat{\boldsymbol{q}}\mapsto 2\langle\hat{\boldsymbol{q}},{\boldsymbol{c}}\rangle-\langle\hat{\boldsymbol{q}},{\boldsymbol{M}}\hat{\boldsymbol{q}}\rangle$ is strictly concave and is thus uniquely maximized at its unique stationary point $\hat{\boldsymbol{q}}={\boldsymbol{M}}^{-1}{\boldsymbol{c}}$ .

C.6 Proof of Lemma 4.7: Limit of ${\boldsymbol{c}}_{n}$

Throughout this proof, we will omit the subscript $n$ from ${\boldsymbol{c}}$ and its entries.

Our goal is to compute, for each $A\in\mathcal{G}_{\leq D}$ ,

c_{A}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{emb}(A)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi;A)}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})]\,,

where in the last step, ${\boldsymbol{\alpha}}$ is the image of $A$ under an arbitrary embedding $\phi$ and we have used the fact that $\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})]$ depends only on $A$ (not ${\boldsymbol{\alpha}}$ ). We have

|\mathsf{emb}(A)|=\binom{n-1}{|V(A)|-1}(|V(A)|-1)!=n^{|V(A)|-1}(1+O(n^{-1}))\,.

(102)

For ${\boldsymbol{\alpha}}$ the image of an embedding of $A$ , we have by definition,

\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\operatorname*{\mathbb{E}}\Big{\{}\psi(\theta_{1})\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}(h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\gamma}({\boldsymbol{Y}}))\Big{\}}\,.

If $A=\emptyset$ (the edgeless graph), we can see $c_{A}=\operatorname*{\mathbb{E}}[\psi(\theta_{1})]$ is a constant and we are done. If $A$ has a non-empty connected component that does not contain the root then we have $\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=0$ due to independence between components, and again we are done.

Finally, consider the case in which $A$ has a single non-empty component and this component contains the root. In this case, simplifying the above using the Hermite facts from Section C.2, and recalling that ${\boldsymbol{X}}:={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n}$ ,

	$\displaystyle\operatorname{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})]=\operatorname{\mathbb{E}}\big{\{}\mathscrsfs{H}(\theta_{1})(h_{\boldsymbol{\alpha}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\boldsymbol{\alpha}}({\boldsymbol{Y}}))\big{\}}$
	$\displaystyle\qquad=\operatorname{\mathbb{E}}\left\{\psi(\theta_{1})\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\alpha}}^{\prime}\leq{\boldsymbol{\alpha}}}\sqrt{\frac{{\boldsymbol{\alpha}}^{\prime}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\alpha}}^{\prime}}X^{{\boldsymbol{\alpha}}-{\boldsymbol{\alpha}}^{\prime}}h_{{\boldsymbol{\alpha}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\alpha}}!}}\operatorname{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\alpha}}]\right)\right\}$
	$\displaystyle\qquad=\operatorname{\mathbb{E}}\left\{\psi(\theta_{1})\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\alpha}}^{\prime}\leq{\boldsymbol{\alpha}}}\sqrt{\frac{{\boldsymbol{\alpha}}^{\prime}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\alpha}}^{\prime}}n^{-\frac{1}{2}\|{\boldsymbol{\alpha}}-{\boldsymbol{\alpha}}^{\prime}\|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\alpha}}-{\boldsymbol{\alpha}}^{\prime}}h_{{\boldsymbol{\alpha}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\alpha}}!}}n^{-\frac{1}{2}\|{\boldsymbol{\alpha}}\|}\operatorname{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\alpha}}]\right)\right\}.$

Using orthogonality of the Hermite polynomials and recalling $h_{0}(z)=1$ , all the terms with ${\boldsymbol{\alpha}}^{\prime}\neq{\boldsymbol{0}}$ are zero in expectation, i.e.,

	$\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})]$	$\displaystyle=\operatorname{\mathbb{E}}\left\{\psi(\theta_{1})\frac{n^{-\frac{1}{2}\|{\boldsymbol{\alpha}}\|}}{\sqrt{{\boldsymbol{\alpha}}!}}\left(({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}})^{\boldsymbol{\alpha}}-\operatorname{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}})^{\boldsymbol{\alpha}}]\right)\right\}$
		$\displaystyle=\mathsf{const}(A)\cdot n^{-\frac{1}{2}\|{\boldsymbol{\alpha}}\|}=\mathsf{const}(A)\cdot n^{-\frac{1}{2}\|E(A)\|}\,.$

Putting it all together, we get

c_{A}=\mathsf{const}(A)\cdot n^{\frac{1}{2}(|V(A)|-1-|E(A)|)}\cdot\big{(}1+O(n^{-1})\big{)}\,.

Recall from above that $c_{A}=0$ unless $A$ has a single connected component and this component contains the root; we therefore restrict to $A$ of this type in what follows. Due to connectivity, any such $A$ has $|V(A)|\leq|E(A)|+1$ , implying $c_{A}=\mathsf{const}(A)+O(n^{-1})$ as desired. Now suppose furthermore that $A\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}$ , i.e., $A$ contains a cycle (where we consider a self-loop or double-edge to be a cycle). In this case, $|V(A)|<|E(A)|+1$ , implying $c_{A}=O(n^{-1/2})$ as desired.

C.7 Proof of Lemma 4.8: Limit of ${\boldsymbol{M}}_{n}$

Throughout this proof, we will omit the subscript $n$ from ${\boldsymbol{M}}$ .

We will need to reason about the different possible intersection patterns between two rooted graphs $A,B\in\mathcal{G}_{\leq D}$ . To this end, define a pattern to be a rooted colored multigraph where no vertices are isolated except possibly the root, every edge is colored either red or blue, the edge-induced subgraph of red edges (with the root always included) is isomorphic (as a rooted graph) to $A$ , and the edge-induced subgraph of blue edges is isomorphic to $B$ . Let $\mathsf{pat}(A,B)$ denote the set of such patterns, up to isomorphism of rooted colored graphs. The number of such patterns is a constant depending on $A,B$ . For $\Pi\in\mathsf{pat}(A,B)$ , let $\mathsf{emb}(\Pi)$ denote the set of embeddings of $\Pi$ , where an embedding is defined to be a pair $(\phi_{A},\phi_{B})$ with $\phi_{A}\in\mathsf{emb}(A),\phi_{B}\in\mathsf{emb}(B)$ such that the induced colored graph on vertex set $[n]$ is isomorphic to $\Pi$ . We let $|V(\Pi)|$ denote the number of vertices in the pattern (including the root).

We can write $M_{AB}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\mathscrsfs{H}_{B}({\boldsymbol{Y}})]$ as

	$\displaystyle M_{AB}$	$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|\cdot\|\mathsf{emb}(B)\|}}\sum_{\phi_{A}\in\mathsf{emb}(A),\phi_{B}\in\mathsf{emb}(B)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi_{A};A)}({\boldsymbol{Y}})\mathscrsfs{H}_{\mathsf{im}(\phi_{B};B)}({\boldsymbol{Y}})]$
		$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|\cdot\|\mathsf{emb}(B)\|}}\sum_{\Pi\in\mathsf{pat}(A,B)}\sum_{(\phi_{A},\phi_{B})\in\mathsf{emb}(\Pi)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi_{A};A)}({\boldsymbol{Y}})\mathscrsfs{H}_{\mathsf{im}(\phi_{B};B)}({\boldsymbol{Y}})]$
		$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|\cdot\|\mathsf{emb}(B)\|}}\sum_{\Pi\in\mathsf{pat}(A,B)}\|\mathsf{emb}(\Pi)\|\cdot\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]\,,$

where in the last line, $({\boldsymbol{\alpha}},{\boldsymbol{\beta}})=(\mathsf{im}(\phi_{A};A),\mathsf{im}(\phi_{B};B))$ is the shadow of an arbitrary embedding $(\phi_{A},\phi_{B})\in\mathsf{emb}(\Pi)$ , and we have used the fact that $\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]$ depends only on $\Pi$ (not ${\boldsymbol{\alpha}},{\boldsymbol{\beta}}$ ). Recall from (102) that

|\mathsf{emb}(A)|=n^{|V(A)|-1}(1+O(n^{-1}))\,,

and we similarly have

|\mathsf{emb}(\Pi)|=\mathsf{const}(\Pi)\cdot\binom{n-1}{|V(\Pi)|-1}=\mathsf{const}(\Pi)\cdot n^{|V(\Pi)|-1}(1+O(n^{-1}))\,.

Now compute using the Hermite facts from Section C.2, and recalling that ${\boldsymbol{X}}:={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n}$ ,

	$\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})$	$\displaystyle\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]=\operatorname{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})-\operatorname{\mathbb{E}}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})\right)\right)\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(h_{\boldsymbol{\delta}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\boldsymbol{\delta}}({\boldsymbol{Y}})\right)\right)$
		$\displaystyle=\operatorname{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\gamma}}^{\prime}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\gamma}}^{\prime}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\gamma}}^{\prime}}{\boldsymbol{X}}^{{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}}h_{{\boldsymbol{\gamma}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}\operatorname{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\gamma}}]\right)\right)\cdot$
		$\displaystyle\qquad\qquad\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\delta}}^{\prime}\leq{\boldsymbol{\delta}}}\sqrt{\frac{{\boldsymbol{\delta}}^{\prime}!}{{\boldsymbol{\delta}}!}}\binom{{\boldsymbol{\delta}}}{{\boldsymbol{\delta}}^{\prime}}{\boldsymbol{X}}^{{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}}h_{{\boldsymbol{\delta}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\delta}}!}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\delta}}]\right)\right)$
		$\displaystyle=\operatorname{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\gamma}}^{\prime}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\gamma}}^{\prime}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\gamma}}^{\prime}}n^{-\frac{1}{2}\|{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}\|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}}h_{{\boldsymbol{\gamma}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}n^{-\frac{1}{2}\|{\boldsymbol{\gamma}}\|}\operatorname{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\gamma}}]\right)\right)\cdot$
		$\displaystyle\qquad\qquad\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\delta}}^{\prime}\leq{\boldsymbol{\delta}}}\sqrt{\frac{{\boldsymbol{\delta}}^{\prime}!}{{\boldsymbol{\delta}}!}}\binom{{\boldsymbol{\delta}}}{{\boldsymbol{\delta}}^{\prime}}n^{-\frac{1}{2}\|{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}\|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}}h_{{\boldsymbol{\delta}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\delta}}!}}n^{-\frac{1}{2}\|{\boldsymbol{\delta}}\|}\operatorname*{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\delta}}]\right)\right).$

If ${\boldsymbol{\alpha}}$ has a non-empty (i.e., containing at least one edge) connected component that is vertex-disjoint from all non-empty connected components of ${\boldsymbol{\beta}}$ or vice-versa, then we have $\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]=0$ from the first line above (due to independence between components). Otherwise we say that $\Pi$ has “no isolated components” and denote by $\mathsf{pat}^{*}(A,B)$ the set of patterns with this property. For $\Pi\in\mathsf{pat}^{*}(A,B)$ , using orthogonality of the Hermite polynomials, the expansion of the expression above has a unique leading (in $n$ ) term where ${\boldsymbol{\gamma}}^{\prime},{\boldsymbol{\delta}}^{\prime}$ are as large as possible: namely, ${\boldsymbol{\gamma}}^{\prime}={\boldsymbol{\gamma}}\wedge{\boldsymbol{\beta}}$ and ${\boldsymbol{\delta}}^{\prime}={\boldsymbol{\delta}}\wedge{\boldsymbol{\alpha}}$ where $\wedge$ denotes entrywise minimum between elements of $\mathbb{N}^{\overline{E}_{n}}$ . Therefore there exists $\mathsf{const}(\Pi)\in\mathbb{R}$ independent of $n$ such that

	$\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]$	$\displaystyle=\big{(}\mathsf{const}(\Pi)+O(n^{-1/2})\big{)}\cdot n^{-\frac{1}{2}\|{\boldsymbol{\alpha}}-({\boldsymbol{\alpha}}\wedge{\boldsymbol{\beta}})\|-\frac{1}{2}\|{\boldsymbol{\beta}}-({\boldsymbol{\alpha}}\wedge{\boldsymbol{\beta}})\|}$
		$\displaystyle=\big{(}\mathsf{const}(\Pi)+O(n^{-1/2})\big{)}\cdot n^{-\frac{1}{2}\|{\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}}\|}\,,$

where the symmetric difference is defined as $({\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}})_{ij}:=|\alpha_{ij}-\beta_{ij}|$ .

Putting it all together, and recalling $\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]=0$ whenever $\Pi\in\mathsf{pat}(A,B)\setminus\mathsf{pat}^{*}(A,B)$ (and eventually redefining $\mathsf{const}(\Pi)$ )

	$\displaystyle M_{AB}$	$\displaystyle=n^{-\frac{1}{2}(\|V(A)\|-1)-\frac{1}{2}(\|V(B)\|-1)}\sum_{\Pi\in\mathsf{pat}^{*}(A,B)}\big{(}\mathsf{const}(\Pi)+O(n^{-1/2})\big{)}\cdot n^{(\|V(\Pi)\|-1)-\frac{1}{2}\|\alpha\bigtriangleup\beta\|}$
		$\displaystyle=\big{(}\mathsf{const}(A,B)+O(n^{-1/2})\big{)}\cdot n^{\varphi(A,B)}$

where

	$\displaystyle\varphi(A,B):=$	$\displaystyle-\frac{1}{2}(\|V(A)\|+\|V(B)\|)+\max_{\Pi\in\mathsf{pat}^{*}(A,B)}\Big{(}\|V(\Pi)\|-\frac{1}{2}\|{\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}}\|\Big{)}$
	$\displaystyle=$	$\displaystyle\;\frac{1}{2}\max_{\Pi\in\mathsf{pat}^{*}(A,B)}\Big{(}\|V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}})\|+\|V({\boldsymbol{\beta}})\setminus V({\boldsymbol{\alpha}})\|-\|{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}\|-\|{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}\|\Big{)}\,,$

where $({\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}})_{ij}:=\max\{0,\alpha_{ij}-\beta_{ij}\}$ , and $V({\boldsymbol{\alpha}})$ always includes the root by convention (see Section C.1). To complete the proof, it remains to show $\varphi(A,B)\leq 0$ for all $A,B$ , and furthermore $\varphi(A,B)\leq-1/2$ in the case $A\in\mathcal{T}_{\leq D},B\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}$ .

For any $A,B\in\mathcal{G}_{\leq D}$ , $\Pi\in\mathsf{pat}^{*}(A,B)$ , and $({\boldsymbol{\alpha}},{\boldsymbol{\beta}})=(\mathsf{im}(\phi;A),\mathsf{im}(\phi;B))$ for $\phi$ an embedding of $\Pi$ , each non-empty connected component ${\boldsymbol{\gamma}}$ of ${\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}$ spans at most $|{\boldsymbol{\gamma}}|+1$ vertices. Of these vertices, none belong to $V({\boldsymbol{\beta}})\setminus V({\boldsymbol{\alpha}})$ and due to the “no isolated components” constraint imposed by $\mathsf{pat}^{*}(A,B)$ , at most $|{\boldsymbol{\gamma}}|$ of them (i.e., all but one) can belong to $V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}})$ . Every vertex in $V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}})$ belongs to some non-empty component of ${\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}$ (since the root belongs to $V({\boldsymbol{\alpha}})\cap V({\boldsymbol{\beta}})$ ), so we have $|V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}})|\leq|{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}|$ . Similarly, $|V({\boldsymbol{\beta}})\setminus V({\boldsymbol{\alpha}})|\leq|{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}|$ , implying $\varphi(A,B)\leq 0$ as desired.

In order to have equality $\varphi(A,B)=0$ in the above argument, every non-empty connected component of ${\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}$ must be a simple (i.e., without self-loops or multi-edges) tree that spans only one vertex in $V({\boldsymbol{\beta}})$ , and similarly ${\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}$ must be a simple tree that spans only one vertex in $V({\boldsymbol{\alpha}})$ . It remains to show that this is impossible when $A\in\mathcal{T}_{\leq D},B\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}$ . This means $A$ is a simple rooted tree. We can assume $A$ has at least one edge, or else the “no isolated components” property must fail (since $A$ has no non-empty components). We will consider a few different cases for $B$ .

First consider the case where the root is isolated in $B$ . This means ${\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}$ has a non-empty connected component containing the root $\circ$ . Since $A$ is a simple rooted tree and due to “no isolated components,” this component also contains a vertex $u$ belonging to some non-empty component of ${\boldsymbol{\beta}}$ . However, now we have a non-empty component of ${\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}$ that spans at least two vertices in $V({\boldsymbol{\beta}})$ , namely $\circ$ and $u$ , and so by the discussion above, it is impossible to have equality $\varphi(A,B)=0$ .

Next consider the case where the root is not isolated in $B$ , but $B$ has multiple connected components. Let ${\boldsymbol{\gamma}}$ be a non-empty component of ${\boldsymbol{\beta}}$ that does not contain the root. Due to “no isolated components,” ${\boldsymbol{\gamma}}$ must span some (non-root) vertex $u\in V({\boldsymbol{\alpha}})$ . Since ${\boldsymbol{\alpha}}$ is a simple rooted tree, there is a unique path in ${\boldsymbol{\alpha}}$ from $u$ to $\circ$ . Recall that $\circ\in V({\boldsymbol{\beta}})$ by convention, so both endpoints of this path belong to $V({\boldsymbol{\beta}})$ . Also, this path must contain an edge in ${\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}$ , or else ${\boldsymbol{\gamma}}$ would be connected to the root. This means that along the path, we can find a non-empty component of ${\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}$ that spans two different vertices in $V({\boldsymbol{\beta}})$ , and so it is impossible to have equality $\varphi(A,B)=0$ .

The last case to consider is where $B$ contains a cycle. This cycle must contain at least one edge in ${\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}$ , since ${\boldsymbol{\alpha}}$ has no cycles. Recall from above that in order to have $\varphi(A,B)=0$ , every non-empty component of ${\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}$ must be a simple tree, which means the cycle also contains at least one edge in ${\boldsymbol{\alpha}}$ . Now we know the cycle in ${\boldsymbol{\beta}}$ has at least one edge in ${\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}$ and at least one edge in ${\boldsymbol{\alpha}}$ , but this means ${\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}$ has a non-empty connected component that spans at least two vertices in $V({\boldsymbol{\alpha}})$ , and so it is impossible to have $\varphi(A,B)=0$ .

C.8 Limit of $\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle$

Lemma C.2.

Under the assumptions of Proposition 4.3, and with ${\boldsymbol{M}}_{\infty}$ , ${\boldsymbol{c}}_{\infty}$ as in the statement of Lemmas 4.7 and 4.8, we have

\displaystyle\lim_{n\to\infty}\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,.

(103)

Proof.

From Lemma 4.5 we have $\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})=\Omega(1)$ . As a direct consequence of Lemma 4.8,

\|{\boldsymbol{M}}_{n}-{\boldsymbol{M}}_{\infty}\|_{F}=O(n^{-1/2})\,.

This means, using $\|\;\cdot\;\|_{\mbox{\rm\tiny op}}$ for matrix operator norm,

\lambda_{\mathrm{min}}({\boldsymbol{M}}_{\infty})\geq\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})-\|{\boldsymbol{M}}_{n}-{\boldsymbol{M}}_{\infty}\|_{\mbox{\rm\tiny op}}\geq\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})-\|{\boldsymbol{M}}_{n}-{\boldsymbol{M}}_{\infty}\|_{F}\geq\mathsf{const}>0\,,

implying that ${\boldsymbol{M}}_{\infty}$ is symmetric positive definite and thus invertible. The result is now immediate from Lemmas 4.7 and 4.8 because ${\boldsymbol{M}}_{n}$ has fixed dimension and the entries of ${\boldsymbol{M}}^{-1}_{n}=\mathrm{adj}({\boldsymbol{M}}_{n})/\mathrm{det}({\boldsymbol{M}}_{n})$ are differentiable functions of the entries of ${\boldsymbol{M}}_{n}$ (in a neighborhood of ${\boldsymbol{M}}_{\infty}$ ). ∎

C.9 Proof of Lemma 4.9: Accuracy of $r$

First we claim that $\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle$ . Recall the block form for ${\boldsymbol{M}}_{n}$ and ${\boldsymbol{c}}_{n}$ in Eq. (45), and that ${\boldsymbol{R}}_{\infty}=0$ per Lemma 4.8 and therefore

{\boldsymbol{M}}_{\infty}=\left[\begin{array}[]{cc}{\boldsymbol{P}}_{\infty}&{\boldsymbol{0}}\\ {\boldsymbol{0}}&{\boldsymbol{Q}}_{\infty}\end{array}\right]\qquad\text{and}\qquad{\boldsymbol{M}}_{\infty}^{-1}=\left[\begin{array}[]{cc}{\boldsymbol{P}}_{\infty}^{-1}&0\\ 0&{\boldsymbol{Q}}_{\infty}^{-1}\end{array}\right].

The claim follows because ${\boldsymbol{e}}_{\infty}=0$ (see Eq. (45) and Lemma 4.7).

Now as $n\to\infty$ we have

\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\langle\hat{\boldsymbol{r}},{\boldsymbol{d}}_{n}\rangle=\langle{\boldsymbol{d}}_{n},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle\,\to\,\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle\,,

and

\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]=\langle\hat{\boldsymbol{r}},{\boldsymbol{P}}_{n}\hat{\boldsymbol{r}}\rangle=\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{P}}_{n}{\boldsymbol{P}}^{-1}_{\infty}{\boldsymbol{d}}_{\infty}\rangle\,\to\,\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle\,,

completing the proof.

C.10 Proof of Lemma 4.11: Change-of-basis

Fix $A\in\mathcal{T}_{\leq D}$ . If $A=\emptyset$ then $\mathscrsfs{H}_{A}=\mathscrsfs{F}_{A}=1$ and the proof is complete, so suppose $A\neq\emptyset$ . Recall that our goal is to take

\mathscrsfs{H}_{A}({\boldsymbol{Y}}):=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{emb}(A)}\mathscrsfs{H}_{\mathsf{im}(\phi;A)}({\boldsymbol{Y}})

and approximate it in the basis $\{\mathscrsfs{F}_{B}\,:\,B\in\mathcal{T}_{\leq D}\}$ , where

\mathscrsfs{F}_{B}({\boldsymbol{Y}}):=\frac{1}{\sqrt{|\mathsf{nr}(B)|}}\sum_{\phi\in\mathsf{nr}(B)}{\boldsymbol{Y}}^{\phi}\,,

where we use the shorthand

{\boldsymbol{Y}}^{\phi}:={\boldsymbol{Y}}^{\mathsf{im}(\phi;B)}=\prod_{(i,j)\in E(B)}Y_{\phi(i),\phi(j)}\,.

Since $A$ is a tree with at least one edge,

	$\displaystyle\mathscrsfs{H}_{A}({\boldsymbol{Y}})$	$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{emb}(A)}(h_{\mathsf{im}(\phi;A)}({\boldsymbol{Y}})-\operatorname{\mathbb{E}}h_{\mathsf{im}(\phi)}({\boldsymbol{Y}}))$
		$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi}-\sqrt{\|\mathsf{emb}(A)\|}\operatorname{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{}}\,,$		(104)

for an arbitrary $\phi^{*}\in\mathsf{emb}(A)$ .

We first focus on the non-random term $\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}$ , which will be easy to expand in the basis $\{\mathscrsfs{F}_{B}\}$ because $\mathscrsfs{F}_{\emptyset}=1$ . Let

m_{A}=\lim_{n\to\infty}\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}\,,

noting that this limit exists by combining (102) with the fact

\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}=\mathsf{const}(A)\cdot n^{-\frac{1}{2}|E(A)|}=\mathsf{const}(A)\cdot n^{-\frac{1}{2}(|V(A)|-1)}\,.

We can now write

\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}=m_{A}\mathscrsfs{F}_{\emptyset}+\mathscrsfs{E}_{A,0}

where $\mathscrsfs{E}_{A,0}:=\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}-m_{A}$ satisfies $\mathscrsfs{E}_{A,0}=o(1)$ and so (being non-random) $\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,0}^{2}]=o(1)$ .

The error term $\mathscrsfs{E}_{A}$ will be the sum of $K$ terms $\mathscrsfs{E}_{A}=\sum_{i=1}^{K}\mathscrsfs{E}_{A,i}$ , with $K$ independent of $n$ . It suffices to show $\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,i}^{2}]=o(1)$ for each individually, as this implies by the triangle inequality

\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A}^{2}]^{1/2}\leq\sum_{i=1}^{K}\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,i}^{2}]^{1/2}=o(1)\,.

(105)

We next handle the more substantial term in (104), namely

	$\displaystyle G_{A}({\boldsymbol{Y}})$	$\displaystyle:=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi}$		(106)
		$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{nr}(A)}{\boldsymbol{Y}}^{\phi}-\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{nr}(A)\setminus\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi}\,.$

For the first term,

	$\displaystyle\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{nr}(A)}{\boldsymbol{Y}}^{\phi}$	$\displaystyle=\sqrt{\frac{\|\mathsf{nr}(A)\|}{\|\mathsf{emb}(A)\|}}\mathscrsfs{F}_{A}({\boldsymbol{Y}})$
		$\displaystyle=\mathscrsfs{F}_{A}({\boldsymbol{Y}})+\left(\sqrt{\frac{\|\mathsf{nr}(A)\|}{\|\mathsf{emb}(A)\|}}-1\right)\mathscrsfs{F}_{A}({\boldsymbol{Y}})=:\mathscrsfs{F}_{A}({\boldsymbol{Y}})+\mathscrsfs{E}_{A,1}({\boldsymbol{Y}})\,.$

Since $\lim_{n\to\infty}(|\mathsf{nr}(A)|/|\mathsf{emb}(A)|)=1$ (in fact, embeddings make up a $1-o(1)$ fraction of all labelings, not just non-reversing ones) and $\operatorname*{\mathbb{E}}[\mathscrsfs{F}_{A}^{2}]=O(1)$ (similarly to the proof of Lemma C.3 below), we have $\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,1}^{2}]=o(1)$ .

Returning to the remaining term in (106), we partition $\mathsf{nr}(A)\setminus\mathsf{emb}(A)$ into two sets $L_{1}=L_{1}(A)$ and $L_{2}=L_{2}(A)$ , defined as follows. If the multigraph ${\boldsymbol{\alpha}}:=\mathsf{im}(\phi)$ has no triple-edges or higher (i.e., $\alpha_{ij}\leq 2$ for all $i\leq j$ ) and has no cycles (namely, the simple graph ${\boldsymbol{\beta}}$ defined by $\beta_{ij}=\mathbbm{1}_{\alpha_{ij}\geq 1}$ has no cycles) then we let $\phi\in L_{1}$ ; otherwise $\phi\in L_{2}$ . The remaining term to handle is

	$\displaystyle\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{nr}(A)\setminus\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi}$	$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi}+\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{2}}{\boldsymbol{Y}}^{\phi}$
		$\displaystyle=:\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi}+\mathscrsfs{E}_{A,2}({\boldsymbol{Y}})\,.$		(107)

Lemma C.3.

$\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,2}^{2}]=o(1)$ .

Proof.

We define the set of patterns $\Pi\in\mathsf{pat}_{2}(A,B)$ between two rooted trees $A,B\in\mathcal{T}_{\leq D}$ as Section C.7, but now restricted to labelings in $L_{2}$ . Namely edge-induced subgraph of red edges in $\Pi\in\mathsf{pat}_{2}(A,B)$ should not be isomorphic to $A$ but rather isomorphic to the image of some $\phi\in L_{2}(A)$ (and similarly for the blue edges and $B$ ). We write $(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)$ to denote an embedding of this pattern into vertex set $[n]$ . Similarly to Section C.7, compute

	$\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,2}^{2}]$	$\displaystyle=\frac{1}{\|\mathsf{emb}(A)\|}\sum_{\phi_{1},\phi_{2}\in L_{2}(A)}\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{\phi_{1}}{\boldsymbol{Y}}^{\phi_{2}}]$
		$\displaystyle=\frac{1}{\|\mathsf{emb}(A)\|}\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}\,\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{\phi_{1}}{\boldsymbol{Y}}^{\phi_{2}}]$
		$\displaystyle=\frac{1}{\|\mathsf{emb}(A)\|}\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}\|\mathsf{emb}(\Pi)\|\cdot\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{{\boldsymbol{\alpha}}+{\boldsymbol{\beta}}}]$
where $({\boldsymbol{\alpha}},{\boldsymbol{\beta}})=(\mathsf{im}(\phi_{1};A),\mathsf{im}(\phi_{2};A))$ is the image of an arbitrary embedding of $\Pi$
		$\displaystyle=\Theta(n^{1-\|V(A)\|})\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}O(n^{\|V(\Pi)\|-1}\cdot n^{-\frac{1}{2}\mathsf{odd}(\Pi)})$
where $\mathsf{odd}(\Pi)$ denotes the number of edges $i\leq j$ for which $\alpha_{ij}+\beta_{ij}$ is odd, which depends only on $\Pi$ (not ${\boldsymbol{\alpha}},{\boldsymbol{\beta}}$ ). Now since $A$ is a tree, $\|E(\Pi)\|=2\|E(A)\|=2(\|V(A)\|-1)$ and so the above becomes
		$\displaystyle=\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}O(n^{\|V(\Pi)\|-\frac{1}{2}\|E(\Pi)\|-1-\frac{1}{2}\mathsf{odd}(\Pi)})$	(108)

and so it remains to show that the exponent is strictly negative for any $\Pi\in\mathsf{pat}_{2}(A,A)$ .

Fix $\Pi\in\mathsf{pat}_{2}(A,A)$ . Let $s$ denote the number of single-edges in $\Pi$ (ignoring the edge colors), let $d$ denote the number of double-edges, let $o$ denote the number of $k$ -edges for $k\geq 3$ odd, and let $e$ denote the number of $k$ -edges for $k\geq 4$ even. By definition we have

\mathsf{odd}(\Pi)=s+o\,,

|E(\Pi)|\geq s+2d+3o+4e\,.

Also, since $A$ (and therefore $\Pi$ ) is connected,

|V(\Pi)|\leq s+d+o+e+1-\mathbbm{1}_{\mathrm{cycle}}

where $\mathbbm{1}_{\mathrm{cycle}}$ is the indicator that $\Pi$ (viewed as a simple graph by replacing each multi-edge by a single-edge) contains a cycle. Now the exponent in (108) is

	$\displaystyle\|V(\Pi)\|-\frac{1}{2}\|E(\Pi)\|-1-\frac{1}{2}\mathsf{odd}(\Pi)$	$\displaystyle\leq(s+d+o+e+1-\mathbbm{1}_{\mathrm{cycle}})-\frac{1}{2}(s+2d+3o+4e)-1-\frac{1}{2}(s+o)$
		$\displaystyle=-(o+e+\mathbbm{1}_{\mathrm{cycle}})\,.$

By the definition of $L_{2}$ , we must either have $o+e\geq 1$ or $\mathbbm{1}_{\mathrm{cycle}}=1$ , which means the exponent is $\leq-1$ , completing the proof. ∎

It remains to handle the first term in Eq. (107). For $\phi\in L_{1}(A)$ , define the “skeleton” $\mathsf{sk}(\phi)\in\mathbb{N}^{\overline{E}_{n}}$ to be the subgraph of $\mathsf{im}(\phi)$ obtained from ${\boldsymbol{\alpha}}:=\mathsf{im}(\phi;A)$ as follows: delete all multi-edges (i.e., whenever $\alpha_{ij}\geq 2$ , set $\alpha_{ij}=0$ ) and then take the connected component of the root (vertex 1). Using the definition of $L_{1}(A)$ and recalling that $\phi\in L_{1}(A)$ is not an embedding, note that $\mathsf{sk}(\phi)$ is always a simple tree with strictly fewer edges than $A$ . The remaining term to handle is

	$\displaystyle\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi}$	$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}+\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}\left({\boldsymbol{Y}}^{\phi}-C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}\right)$
		$\displaystyle=:\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}+\mathscrsfs{E}_{A,3}(Y)$		(109)

where

C_{\phi}:=\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{\mathsf{im}(\phi)-\mathsf{sk}(\phi)}]\,.

Lemma C.4.

$\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,3}^{2}]=o(1)$ .

Proof.

Define $\mathsf{pat}_{1}(A,B)$ as in the proof of Lemma C.3 except with $L_{1}$ in place of $L_{2}$ . Our goal is to bound

	$\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,3}^{2}]$	$\displaystyle=\frac{1}{\|\mathsf{emb}(A)\|}\sum_{\phi_{1},\phi_{2}\in L_{1}(A)}\operatorname*{\mathbb{E}}\left({\boldsymbol{Y}}^{\phi_{1}}-C_{\phi_{1}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{1})}\right)\left({\boldsymbol{Y}}^{\phi_{2}}-C_{\phi_{2}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{2})}\right)$
		$\displaystyle=\frac{1}{\|\mathsf{emb}(A)\|}\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}\;\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}\operatorname*{\mathbb{E}}\left({\boldsymbol{Y}}^{\phi_{1}}-C_{\phi_{1}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{1})}\right)\left({\boldsymbol{Y}}^{\phi_{2}}-C_{\phi_{2}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{2})}\right).$

For $\phi\in L_{1}(A)$ , letting $\overline{\mathsf{sk}}(\phi)=\mathsf{im}(\phi)-\mathsf{sk}(\phi)$ (throughout this proof we adopt the shorthand $\mathsf{im}(\phi)=\mathsf{im}(\phi;A)$ since the base graph $A$ is fixed),

	$\displaystyle{\boldsymbol{Y}}^{\phi}-C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}$	$\displaystyle={\boldsymbol{Y}}^{\mathsf{sk}(\phi)}\left({\boldsymbol{Y}}^{\overline{\mathsf{sk}}(\phi)}-\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\mathsf{im}(\phi)-\mathsf{sk}(\phi)}\right)$
		$\displaystyle=\left(\sum_{0\leq{\boldsymbol{\alpha}}\leq\mathsf{sk}(\phi)}{\boldsymbol{X}}^{\mathsf{sk}(\phi)-{\boldsymbol{\alpha}}}{\boldsymbol{Z}}^{\boldsymbol{\alpha}}\right)\sum_{0\leq{\boldsymbol{\beta}}\leq\overline{\mathsf{sk}}(\phi)}\left({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi)-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}-\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi)-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}]\right).$

This means

		$\displaystyle\operatorname*{\mathbb{E}}\left({\boldsymbol{Y}}^{\phi_{1}}-C_{\phi_{1}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{1})}\right)\left({\boldsymbol{Y}}^{\phi_{2}}-C_{\phi_{2}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{2})}\right)$		(110)
		$\displaystyle=\sum_{\begin{subarray}{c}{\boldsymbol{0}}\leq{\boldsymbol{\alpha}}\leq\mathsf{sk}(\phi_{1})\\ {\boldsymbol{0}}\leq{\boldsymbol{\beta}}\leq\overline{\mathsf{sk}}(\phi_{1})\\ {\boldsymbol{0}}\leq{\boldsymbol{\gamma}}\leq\mathsf{sk}(\phi_{2})\\ {\boldsymbol{0}}\leq{\boldsymbol{\delta}}\leq\overline{\mathsf{sk}}(\phi_{2})\end{subarray}}{\boldsymbol{X}}^{\mathsf{sk}(\phi_{1})-{\boldsymbol{\alpha}}}{\boldsymbol{Z}}^{\boldsymbol{\alpha}}{\boldsymbol{X}}^{\mathsf{sk}(\phi_{2})-{\boldsymbol{\gamma}}}{\boldsymbol{Z}}^{\boldsymbol{\gamma}}\left({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}-\operatorname{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}]\right)\left({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{2})-{\boldsymbol{\delta}}}{\boldsymbol{Z}}^{\boldsymbol{\delta}}-\operatorname{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{2})-{\boldsymbol{\delta}}}{\boldsymbol{Z}}^{\boldsymbol{\delta}}]\right).$

We will next claim that certain terms in the sum in (110) are zero. First note that we must have $\alpha_{ij}+\beta_{ij}+\gamma_{ij}+\delta_{ij}$ is even for every edge $i\leq j$ , or else the corresponding term in (110) is zero due to the $Z$ factors. Therefore, for $(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)$ , the value of (110) is $O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)})$ where, recall, $\mathsf{odd}(\Pi)$ denotes the number of odd-edges in $\Pi$ (i.e., the number of edges $i\leq j$ for which $\mathsf{im}(\phi_{1})_{ij}+\mathsf{im}(\phi_{2})_{ij}$ is odd).

We will improve the bound $O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)})$ in certain cases. Let $P(\Pi)$ denote the property that $\Pi$ has no triple-edges or higher (i.e., $\mathsf{im}(\phi_{1})_{ij}+\mathsf{im}(\phi_{2})_{ij}\leq 2$ for all $i\leq j$ ) and no cycles (when $\Pi$ is viewed as a simple graph by replacing multi-edges by single-edges). We claim that (110) is $O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)-1})$ whenever $P(\Pi)$ holds. To prove this, first identify the “bridges,” i.e., edges $(i,j)$ for which $\mathsf{im}(\phi_{1})_{ij}=2$ and one endpoint of $(i,j)$ belongs to $V(\mathsf{sk}(\phi_{1}))$ . At least one bridge must exist by the definition of $L_{1}$ . Using the definition of $P(\Pi)$ , $\overline{\mathsf{sk}}(\phi_{1})$ shares no vertices with $\mathsf{im}(\phi_{2})$ except possibly one endpoint of each bridge. As a result, some bridge $(i,j)$ must have $\beta_{ij}=0$ or else the corresponding term in (110) is zero; to see this, note that if every bridge has $\beta_{ij}=2$ then the factor $({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}-\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}])$ has mean zero and is independent from the other factors in (110). This gives the desired improvement to $O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)-1})$ .

We now have

	$\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,3}^{2}]$	$\displaystyle=\frac{1}{\|\mathsf{emb}(A)\|}\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}\;\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})$
		$\displaystyle=\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{-(\|V(A)\|-1)+(\|V(\Pi)\|-1)-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})$
		$\displaystyle=\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{-\|V(A)\|+\|V(\Pi)\|-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})$
and now repeating the argument from the proof of Lemma C.3
		$\displaystyle=\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{\|V(\Pi)\|-\frac{1}{2}\|E(\Pi)\|-1-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})$
		$\displaystyle\leq\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{-(o+e+\mathbbm{1}_{\mathrm{cycle}}+\mathbbm{1}_{P(\Pi)})})$
		$\displaystyle\leq O(n^{-1})\,,$

completing the proof. ∎

It remains to handle the first term in (109). For each $\phi\in L_{1}$ , $\mathsf{im}(\phi)$ is isomorphic (as a rooted multigraph) to some connected multigraph that is a rooted tree but with both single- and double-edges allowed, and at least one double-edge required; let $\mathsf{pat}_{1}(A)$ denote the set of such multigraphs and for $\Pi\in\mathsf{pat}_{1}(A)$ , write $\phi\in\mathsf{emb}(\Pi)$ when $\mathsf{im}(\phi)$ is isomorphic to $\Pi$ . The remaining term to handle is

	$\displaystyle\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}$	$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\Pi\in\mathsf{pat}_{1}(A)}\,\sum_{\phi\in\mathsf{emb}(\Pi)}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}.$
Note that $C_{\Pi}:=C_{\phi}$ depends only on $\Pi$ (not $\phi$ ). Also write $\mathsf{Sk}(\Pi)$ for the rooted tree isomorphic to $\mathsf{sk}(\phi)$ , which again only depends on $\Pi$ . The above becomes
		$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\Pi\in\mathsf{pat}_{1}(A)}C_{\Pi}\sum_{\phi\in\mathsf{emb}(\mathsf{Sk}(\Pi))}{\boldsymbol{Y}}^{\phi}\,(1+o(1))\,n^{\|V(\Pi)\|-\|V(\mathsf{Sk}(\Pi))\|}$
		$\displaystyle=\frac{1+o(1)}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\Pi\in\mathsf{pat}_{1}(A)}C_{\Pi}\,n^{\|V(\Pi)\|-\|V(\mathsf{Sk}(\Pi))\|}\sqrt{\|\mathsf{emb}(\mathsf{Sk}(\Pi))\|}\,G_{\mathsf{Sk}(\Pi)}({\boldsymbol{Y}})\,.$	(111)

The number of double-edges in $\Pi$ is $d(\Pi)=|V(A)|-|V(\Pi)|$ . We have

|\mathsf{emb}(A)|=(1+o(1))n^{|V(A)|-1},

C_{\Pi}=(\mathsf{const}(\Pi)+o(1))n^{-\frac{1}{2}(|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|-d(\Pi))}=(\mathsf{const}(\Pi)+o(1))n^{-\frac{1}{2}(2|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|-|V(A)|)}.

Now (111) becomes

	$\displaystyle\sum_{\Pi\in\mathsf{pat}_{1}(A)}(\mathsf{const}(\Pi)+o(1))n^{-\frac{1}{2}(\|V(A)\|-1)-\frac{1}{2}(2\|V(\Pi)\|-\|V(\mathsf{Sk}(\Pi))\|-\|V(A)\|)+\|V(\Pi)\|-\|V(\mathsf{Sk}(\Pi))\|+\frac{1}{2}(\|V(\mathsf{Sk}(\Pi))\|-1)}G_{\mathsf{Sk}(\Pi)}({\boldsymbol{Y}})$
	$\displaystyle\quad=\sum_{\Pi\in\mathsf{pat}_{1}(A)}(\mathsf{const}(\Pi)+o(1))\,G_{\mathsf{Sk}(\Pi)}({\boldsymbol{Y}})$
	$\displaystyle\quad=\sum_{B\in\mathcal{T}_{\leq D}\,:\,\|E(B)\|<\|E(A)\|}(\mathsf{const}(A,B)+o(1))\,G_{B}({\boldsymbol{Y}})$
	$\displaystyle\quad=:\sum_{B\in\mathcal{T}_{\leq D}\,:\,\|E(B)\|<\|E(A)\|}\mathsf{const}(A,B)\,G_{B}({\boldsymbol{Y}})+\mathscrsfs{E}_{A,4}$

where $\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,4}^{2}]=o(1)$ because $\operatorname*{\mathbb{E}}[G_{B}^{2}]=O(1)$ (similarly to the proof of Lemma C.3).

Proof of Lemma 4.11.

Summarizing the above, we have shown how to write

\mathscrsfs{H}_{A}=\mathsf{const}(A)\mathscrsfs{F}_{\emptyset}-\mathscrsfs{E}_{A,0}+G_{A}

where

G_{A}=\mathscrsfs{F}_{A}+\mathscrsfs{E}_{A,1}-\mathscrsfs{E}_{A,2}-\mathscrsfs{E}_{A,3}-\mathscrsfs{E}_{A,4}-\sum_{B\in\mathcal{T}_{\leq D}\,:\,|E(B)|<|E(A)|}\mathsf{const}(A,B)\,G_{B}\,.

Using induction on $|E(A)|$ we can apply the same procedure to expand $G_{B}$ in the basis $\{\mathscrsfs{F}_{C}:C\in\mathcal{T}_{\leq D},\,|E(C)|<|E(B)|\}$ plus an error term. Recalling (105), this now gives the desired expansion for $\mathscrsfs{H}_{A}$ . ∎

Appendix D Proof of Lemma 4.13

Let $({\boldsymbol{Z}}(t))_{t\in{\mathbb{Z}}}$ , ${\boldsymbol{Z}}(t)\in{\mathbb{R}}^{n\times n}$ be a collection of i.i.d. copies of ${\boldsymbol{Z}}$ , and define

\displaystyle{\boldsymbol{Y}}(t)=\frac{1}{\sqrt{n}}{\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}+{\boldsymbol{Z}}(t)\,.

(112)

We will use the sequence of random matrices ${\boldsymbol{Y}}_{*}:=\{{\boldsymbol{Y}}(t)\}$ uniquely as a device for algorithm analysis and not in the actual estimation procedure.

We extend the definition of tree structured polynomials (cf. Eq. (32)) to such sequences of random matrices via

\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*})=\frac{1}{\sqrt{|\mathsf{nr}(T)|}}\sum_{\phi\in\mathsf{nr}(T)}\,\prod_{(i,j)\in E(T)}Y_{\phi(i),\phi(j)}(t-{\sf d}_{T}((i,j),\circ))\,.

(113)

Here ${\sf d}_{T}((i,j),\circ):=\max({\sf d}_{T}(i,\circ),{\sf d}_{T}(j,\circ))$ . When the argument is a single matrix ${\boldsymbol{Y}}$ , then $\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}})$ is defined by applying Eq. (113) to the sequence of matrices given by ${\boldsymbol{Y}}(s)={\boldsymbol{Y}}$ for all $s\in{\mathbb{Z}}$ (thus recovering Eq (32)).

The random variable $\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*})$ does depend on $t$ . However its distribution is independent of $t$ . We will therefore often omit the superscript $t$ . (In the case of $\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}})$ , the random variable itself does not depend on $t$ .)

We define a subset of the family of non-reversing labelings of $T\in\mathcal{T}_{\leq D}$ .

Definition D.1.

A labeling $\phi$ of $T\in\mathcal{T}_{\leq D}$ is said to be strongly non-reversing if it is non-reversing and for any two edges $(i,j)$ , $(k,l)\in E(T)$ , with ${\sf d}_{T}((i,j),\circ)\neq{\sf d}_{T}((k,l),\circ)$ , we have $(\phi(i),\phi(j))\neq(\phi(k),\phi(l))$ (as unordered pairs). We denote by $\tilde{\sf nr}(T)$ the set of all strongly non-reversing labelings of $T$ .

A pair of labelings $\phi_{1}\in\mathsf{nr}(T_{1})$ , $\phi_{2}\in(T_{2})$ is said to be jointly strongly non-reversing if each of them is strongly non-reversing and ${\sf d}_{T_{1}}((i,j),\circ)\neq{\sf d}_{T_{2}}((k,l),\circ)$ implies $(\phi_{1}(i),\phi_{1}(j))\neq(\phi_{2}(k),\phi_{2}(l))$ . We denote by $\tilde{\sf nr}(T_{1},T_{2})$ the set of such pairs.

We define the modified polynomials $\tilde{\mathscrsfs{F}}^{t}_{T}({\boldsymbol{Y}})$ , $\tilde{\mathscrsfs{F}}^{t}_{T}({\boldsymbol{Y}}_{*})$ by restricting the sum to $\tilde{\sf nr}(T)$ in Eqs. (32), (113).

We also define

	$\displaystyle\mathscrsfs{F}^{t}_{T_{1},T_{2}}({\boldsymbol{Y}}_{*}):=\frac{1}{\sqrt{\|\mathsf{nr}(T_{1})\|\cdot\|\mathsf{nr}(T_{2})\|}}\;\cdot$
	$\displaystyle\sum_{(\phi_{1},\phi_{2})\in\tilde{\sf nr}(T_{1},T_{2})}\,\prod_{(i,j)\in E(T_{1})}{\boldsymbol{Y}}_{\phi_{1}(i),\phi_{1}(j)}(t-{\sf d}_{T_{1}}((i,j),\circ))\prod_{(i,j)\in E(T_{2})}{\boldsymbol{Y}}_{\phi_{2}(i),\phi_{2}(j)}(t-{\sf d}_{T_{2}}((i,j),\circ))\,.$

As before $\mathscrsfs{F}^{t}_{T_{1},T_{2}}({\boldsymbol{Y}})$ is obtained from the above definition by ${\boldsymbol{Y}}(s)={\boldsymbol{Y}}$ for all $s\in{\mathbb{Z}}$ .

As we next see, the moving from non-reversing to strongly non-reversing embeddings has a negligible effect.

Lemma D.2.

Assume $\pi_{\Theta}$ to have finite moments of all order, and $|\psi(\theta)|\leq B(1+|\theta|)^{B}$ for a constant $B$ . Then, for any fixed $T,T_{1},T_{2}\in\mathcal{T}_{\leq D}$ , there exist constants $C_{*}=C_{*}(B,T)$ , $C_{\#}=C_{\#}(B,T_{1},T_{2})$ such that

\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\right|\leq\frac{C_{*}}{\sqrt{n}}\,,\;\;\;\;\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}(Y)]\right|\leq\frac{C_{\#}}{\sqrt{n}}\,,

(114)

\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}^{t}_{T}({\boldsymbol{Y}}_{*})]\right|\leq\frac{C_{*}}{\sqrt{n}},\;\;\;\;\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}^{t}_{T_{1}}({\boldsymbol{Y}}_{*})\mathscrsfs{F}^{t}_{T_{2}}({\boldsymbol{Y}}_{*})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}^{t}_{T_{1},T_{2}}({\boldsymbol{Y}}_{*})]\right|\leq\frac{C_{\#}}{\sqrt{n}}\,.

(115)

Proof.

We will prove Eq. (114) since (115) follows by the same argument. Considering the first bound, note that

	$\displaystyle\left\|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\right\|$
	$\displaystyle\leq\frac{1}{\sqrt{\|\mathsf{nr}(T)\|}}\sum_{\phi\in\mathsf{nr}(T)\setminus\tilde{\sf nr}(T)}\left\|\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\Big{]}\right\|$
	$\displaystyle\leq\frac{1}{\sqrt{\|\mathsf{nr}(T)\|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T)}\sum_{\phi\in\mathsf{emb}(\Pi)}\left\|\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\Big{]}\right\|\,,$		(116)

where in the last line $\mathsf{pat}^{\prime\prime}(T)$ denotes the equivalence classes of $\mathsf{nr}(T)\setminus\tilde{\sf nr}(T)$ under rooted graph homomorphisms. Now the expectation in the last line only depends on $\Pi$ . Taking expectation conditional on $\theta$ , we get

\displaystyle\Big{|}\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\;\Big{|}\;\theta\Big{]}\Big{|}\leq\psi(\theta_{1})\prod_{\begin{subarray}{c}1\leq i\leq j\leq n\\ \alpha_{ij}(\phi)=1\end{subarray}}\frac{\theta_{i}\theta_{j}}{\sqrt{n}}\prod_{\begin{subarray}{c}1\leq i\leq j\leq n\\ \alpha_{ij}(\phi)\geq 2\end{subarray}}2^{\alpha_{ij}(\phi)-1}\left(\frac{|\theta_{i}\theta_{j}|}{\sqrt{n}}+C\sqrt{\alpha_{ij}}\right)\,.

Here $\alpha(\phi)$ is the image of embedding $\phi$ . Next taking expectation with respect to $\theta$ and using that all moments are finite

\displaystyle\Big{|}\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\Big{]}\Big{|}\leq C(\psi,\Pi)\,n^{-e_{1}(\Pi)/2}\,,

(117)

where $e_{1}(\Pi)$ is the number of edges with multiplicity $1$ in $\Pi$ .

Using this bound in Eq. (116) and noting that $|\mathsf{emb}(\pi)|\leq n^{v(\Pi)-1}$ (where $v(\Pi)$ is the number of vertices in $\Pi$ ), we get

	$\displaystyle\left\|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\right\|$
	$\displaystyle\leq C\frac{1}{\sqrt{\|\mathsf{nr}(T)\|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T)}n^{v(\Pi)-1-e_{1}(\Pi)/2}$
	$\displaystyle\leq C^{\prime}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T)}n^{v(\Pi)-1-e_{1}(\Pi)/2-m(\Pi)/2}\,,$

where $m(\pi)$ is the number of edges of $\Pi$ counted with their multiplicity. In the last step we used the fact that $m(\pi)=V(T)-1$ and $|\mathsf{nr}(T)|\geq C_{0}n^{V(T)-1}$ .

Next notice that, denoting by $e_{\geq 2}(\Pi)$ the number of edges in $\Pi$ , this time without counting multiplicity, we have

	$\displaystyle\frac{1}{2}e_{1}(\Pi)+\frac{1}{2}m(\Pi)+1-v(\Pi)$	$\displaystyle\geq e_{1}(\Pi)+e_{\geq 2}(\Pi)+1-v(\Pi)$
		$\displaystyle\geq{\sf loop}(\Pi)\,.$

where ${\sf loop}(\Pi)$ is the number of self-loops in the projection of $\Pi$ onto simple graphs (i.e. the graph obtained by replacing every multi-edge in $\Pi$ by a single edge). For $\Pi\in\mathsf{pat}^{\prime\prime}(T)$ , we have ${\sf loop}(\Pi)\geq 1$ , whence the claim follows.

The proof for the second equation in Eq. (114) is very similar. Using the shorthand $\mathsf{nr}\setminus\tilde{\sf nr}(T_{1},T_{2}):=\mathsf{nr}(T_{1})\times\mathsf{nr}(T_{2})\setminus\tilde{\sf nr}(T_{1},T_{2})$ , we begin by writing

	$\displaystyle\left\|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\right\|$
	$\displaystyle\leq\frac{1}{\sqrt{\|\mathsf{nr}(T_{1})\|\cdot\|\mathsf{nr}(T_{2})\|}}\sum_{(\phi_{1},\phi_{2})\in\mathsf{nr}\setminus\tilde{\sf nr}(T_{1},T_{2})}\Big{\|}\operatorname{\mathbb{E}}\Big{\{}\prod_{(i,j)\in E(T_{1})}Y_{\phi_{1}(i),\phi_{1}(j)}\prod_{(i,j)\in E(T_{2})}Y_{\phi_{2}(i),\phi_{2}(j)}\Big{\}}\Big{\|}$
	$\displaystyle\leq\frac{1}{\sqrt{\|\mathsf{nr}(T_{1})\|\cdot\|\mathsf{nr}(T_{2})\|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T_{1},T_{2})}\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}\left\|\operatorname{\mathbb{E}}\Big{[}\prod_{(i,j)\in E(T_{1})}Y_{\phi_{1}(i),\phi_{1}(j)}\prod_{(i,j)\in E(T_{2})}Y_{\phi_{2}(i),\phi_{2}(j)}\Big{]}\right\|\,.$

The only difference with respect to the previous case lies in the fact that $\mathsf{pat}^{\prime\prime}(T_{1},T_{2})$ is a collection of graphs with edges labeled by $\{1,2\}$ . (It is the set of equivalence classes of $\mathsf{nr}\setminus\tilde{\sf nr}(T_{1},T_{2})$ under graph homomorphisms.)

Proceeding as before,

\displaystyle\left|\operatorname{\mathbb{E}}\Big{[}\prod_{(i,j)\in E(T_{1})}Y_{\phi_{1}(i),\phi_{1}(j)}\prod_{(i,j)\in E(T_{2})}Y_{\phi_{2}(i),\phi_{2}(j)}\Big{]}\right|\leq C(\Pi)\,n^{-e_{1}(\Pi)}\,.

Therefore

\displaystyle\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\right|\leq\frac{1}{\sqrt{|\mathsf{nr}(T_{1})|\cdot|\mathsf{nr}(T_{2})|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T_{1},T_{2})}\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}C(\Pi)\,n^{-e_{1}(\Pi)}\,.

Recall that $|\mathsf{nr}(T_{1})|\geq C_{0}n^{V(T_{1})-1}$ , $|\mathsf{nr}(T_{2})|\geq C_{0}n^{V(T_{2})-1}$ and, as before, $V(T_{1})+V(T_{2})-2=m(\Pi)$ . Further $|\mathsf{emb}(\Pi)|\leq C_{1}\,n^{v(\Pi)-1}$ , whence

\displaystyle\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\right|\leq C_{2}\,\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T_{1},T_{2})}n^{v(\Pi)-1-e_{1}(\Pi)/2-m(\Pi)/2}\,.

The last sum is upper bounded by $C\,n^{-1/2}$ by the same argument as above. ∎

Consider now a term corresponding to $\phi\in\tilde{\sf nr}(G)$ of the expectation $\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}(Y_{*})\,|\,\theta]$ . By construction, this does not involve moments $\operatorname{\mathbb{E}}[Y^{a}_{ij}(t_{1})Y^{b}_{ij}(t_{2})\cdots\,|\,\theta]$ for $t_{1}\neq t_{2}$ . Therefore the expectations coincide when considering $\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}}_{*})$ or $\tilde{\mathscrsfs{F}}_{T}(Y)$ . In other words, we have the identities

\displaystyle\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}}_{*})]=\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\,,\;\;\;\;\;\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}}_{*})]=\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\,.

(118)

We therefore proved the following consequence of Lemma D.2.

Corollary D.3.

Under the assumptions of Lemma D.2, there exists $C_{0}=C_{0}(T,B)$ such that

\displaystyle\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}}_{*})]\right|\leq\frac{C_{0}}{\sqrt{n}}\,\,.

(119)

We now consider a message passing of the same form as Eq. (53), but with ${\boldsymbol{Y}}$ replaced by ${\boldsymbol{Y}}(t)$ at iteration $t$ . Namely we define

\displaystyle{\boldsymbol{r}}_{i\to j}^{t+1}

\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i,j\}}Y_{ik}(t)F_{t}({\boldsymbol{r}}^{t}_{k\to i})\,,\;\;\;\;\;\,\;\;{\boldsymbol{r}}_{i\to j}^{0}=\operatorname{\mathbb{E}}[\Theta]\;\;\forall i\neq j\,,

(120)

and:

	$\displaystyle{\boldsymbol{r}}_{i}^{t+1}$	$\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i\}}Y_{ik}(t)F_{t}({\boldsymbol{r}}^{t}_{k\to i})\,,$		(121)
	$\displaystyle\hat{{\boldsymbol{r}}}_{i}^{t+1}$	$\displaystyle=F_{t}({\boldsymbol{r}}^{t+1}_{i})\,.$		(122)

Analogously to the case of the original iteration, we can write these as sums over polynomials $\mathscrsfs{F}^{t+1}_{T}({\boldsymbol{Y}}_{*})$ , with coefficients that have a limit as $n\to\infty$ . Therefore, we have a further consequence of Lemma D.2 and the previous corollary.

Corollary D.4.

Assume $\pi_{\Theta}$ to have finite moments of all order, and $\psi:{\mathbb{R}}^{{\sf d}+1}\to{\mathbb{R}}$ be a polynomial with coefficients bounded by $B$ and maximum degree $B$ .

Then there exists $C_{0}=C_{0}(T,B)$ such that, for all $t\leq T$ ,

	$\displaystyle\left\|\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{s}}_{1}^{t})]-\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{r}}_{1}^{t})]\right\|\leq\frac{C_{0}}{\sqrt{n}}\,,$		(123)
	$\displaystyle\left\|\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{s}}_{1\to 2}^{t})]-\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{r}}_{1\to 2}^{t})]\right\|\leq\frac{C_{0}}{\sqrt{n}}\,.$		(124)

Proof.

We expand the polynomial $\psi$ as

\displaystyle\psi(\theta,{\boldsymbol{x}})=\sum_{{\boldsymbol{m}}\in{\mathbb{N}}^{{\sf d}}}\psi_{{\boldsymbol{m}}}(\theta){\boldsymbol{x}}^{{\boldsymbol{m}}}\,,\;\;\;\;\;{\boldsymbol{x}}^{{\boldsymbol{m}}}:=\prod_{i=1}^{{\sf d}}x_{i}^{m_{i}}\,.

Further, ${\boldsymbol{s}}^{t}_{1}$ can be expressed as a sum over $T\in\mathcal{T}_{\leq D}$ and therefore the same holds for $({\boldsymbol{s}}^{t}_{1})^{{\boldsymbol{m}}}$

	$\displaystyle({\boldsymbol{s}}^{t}_{1})^{{\boldsymbol{m}}}$	$\displaystyle=\sum_{(T_{\ell,k})_{\ell\leq{\sf d},k\leq m_{\ell}}}\prod_{\ell\leq{\sf d},k\leq m_{\ell}}c_{T_{\ell,k}}\prod_{\ell\leq{\sf d},k\leq m_{\ell}}\mathscrsfs{F}_{T_{\ell,k}}({\boldsymbol{Y}})\,,$
		$\displaystyle=\sum_{T}\overline{c}_{T}({\boldsymbol{m}})\mathscrsfs{F}_{T}({\boldsymbol{Y}})\,,$

where, in the last row, we grouped terms such that $\cup_{\ell,k}T_{\ell,k}=T$ and (up to combinatorial factors which are independent of $n$ ), the coefficient $\overline{c}_{T}({\boldsymbol{m}})$ is given by the product of coefficients $c_{T_{\ell,k}}$ .

Hence

\displaystyle\Big{|}\operatorname{\mathbb{E}}[\psi(\theta_{1},\hat{s}_{1}^{t})]-\operatorname{\mathbb{E}}[\psi(\theta_{1},\hat{r}_{1}^{t})]\Big{|}\leq\sum_{{\boldsymbol{m}}}\sum_{T}\overline{c}_{T}({\boldsymbol{m}})\Big{|}\operatorname{\mathbb{E}}[\psi_{{\boldsymbol{m}}}(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi_{{\boldsymbol{m}}}(\theta_{1})\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*})]\Big{|}\,,

and the claim follows by noting that the sums over ${\boldsymbol{m}}$ and $T$ involve a constant (in $n$ ) number of terms, the coefficients $\overline{c}_{T}({\boldsymbol{m}})$ have a limit as $n\to\infty$ , and applying Corollary D.3. ∎

Lemma D.5.

Assume $\pi_{\Theta}$ to have finite moments of all order, and let $m_{0}$ , ${\boldsymbol{m}}\in{\mathbb{N}}^{\sf d}$ be fixed. Then there exists a constant $C=C(t;{\boldsymbol{m}})$ independent of $n$ such that

\displaystyle\big{|}\operatorname{\mathbb{E}}[({\boldsymbol{r}}_{i}^{t})^{{\boldsymbol{m}}}]\big{|}\leq C\,,\;\;\;\;\big{|}\operatorname{\mathbb{E}}[({\boldsymbol{r}}_{i\to j}^{t})^{{\boldsymbol{m}}}]\big{|}\leq C\,.

(125)

As a final step towards the proof of Lemma 4.13, we prove the analogous statement for the modified iteration $(\hat{r}^{t}_{i})$ .

Lemma D.6.

Assume $\pi_{\Theta}$ to have finite moments of all order, and $\psi$ to be a fixed polynomial. Define the sequence of vectors ${\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{{\sf d}}$ and positive semidefinite matrices ${\boldsymbol{\Sigma}}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}}$ via the state evolution equations (10), (11).

Then the following hold (here $\operatorname*{p-lim}$ denotes limit in probability):

	$\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{r}}^{t}_{1},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,,\;\;\;\;\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\psi({\boldsymbol{r}}^{t}_{i},\theta_{i})=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,,$		(126)
	$\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{r}}^{t}_{1\to 2},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,,\;\;\;\;\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t}_{i\to 1},\theta_{i})=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,.$		(127)

Proof.

The proof is essentially the same as the proof of Proposition 4 in [BLM15]. We will focus on the claim (127), since Eq. (127) is completely analogous.

We proceed by induction over $t$ , and will denote by ${\mathcal{F}}_{t}$ the $\sigma$ -algebra generated by $\theta$ and $Z(1),\dots Z(t)$ . It is convenient to define $W(s)=Z(s)/\sqrt{n}$ and do rewrite Eq. (120) as

\displaystyle{\boldsymbol{r}}_{i\to j}^{t+1}=\Big{(}\frac{1}{n}\sum{k\in[n]\setminus\{i,j\}}\theta_{k}F_{t}({\boldsymbol{r}}^{t}_{k\to i})\Big{)}\theta_{i}+\sum_{k\in[n]\setminus\{i,j\}}W_{ik}(t)F_{t}({\boldsymbol{r}}^{t}_{k\to i})\,.

(128)

Fixing $i,j$ , by the induction hypothesis

	$\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{k\in[n]\setminus\{i,j\}}\theta_{k}F_{t}({\boldsymbol{r}}^{t}_{k\to i})=\operatorname{\mathbb{E}}\big{\{}\Theta F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})\big{\}}={\boldsymbol{\mu}}_{t+1}\,,$		(129)
	$\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{k\in[n]\setminus\{i,j\}}F_{t}({\boldsymbol{r}}^{t}_{k\to i})F_{t}({\boldsymbol{r}}^{t}_{k\to i})^{{\sf T}}=\operatorname{\mathbb{E}}\big{\{}F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})^{{\sf T}}\big{\}}={\boldsymbol{\Sigma}}_{t+1}\,,$		(130)
	$\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{k\in[n]\setminus\{i,j\}}\\|F_{t}({\boldsymbol{r}}^{t}_{k\to i})\\|^{4}=C_{t}<\infty\,.$		(131)

Construct $(\theta_{i})_{i\geq 1}$ for different $n$ in the same probability space. By Lyapunov’s central limit theorem, we have that, in probability

\displaystyle{\sf Law}({\boldsymbol{r}}_{i\to j}^{t+1}|{\mathcal{F}}_{t})\Rightarrow{\mathcal{N}}({\boldsymbol{\mu}}_{t+1}\theta_{i},{\boldsymbol{\Sigma}}_{t+1})\,.

(132)

(Here $\Rightarrow$ denotes weak convergence of probability measures.) Hence, for any bounded Lipschitz function $\overline{\psi}$ , we have

\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\overline{\psi}({\boldsymbol{r}}^{t+1}_{i\to j},\theta_{i})|{\mathcal{F}}_{t}]=\operatorname{\mathbb{E}}\overline{\psi}\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,.

(133)

Since the right hand side is non-random, and using dominated convergence, we have

\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\overline{\psi}({\boldsymbol{r}}^{t+1}_{i\to j},\theta_{i})]=\operatorname{\mathbb{E}}\overline{\psi}\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,.

(134)

Applying Lemma D.5, the claim also holds for $\psi$ that is only polynomially bounded, thus proving the the first equation in Eq. (127), for iteration $t+1$ .

Next consider the second limit in Eq. (127), and begin by considering the case in which $\psi$ is a fixed polynomial. We claim that, for any fixed $i_{1}\neq i_{2}\in\{2,\dots,n\}$ ,

\displaystyle\lim_{n\to\infty}\big{|}\operatorname{\mathbb{E}}[\psi(\theta_{i_{1}},{\boldsymbol{r}}_{i_{1}\to 1}^{t+1})\psi(\theta_{i_{2}},{\boldsymbol{r}}_{i_{2}\to 1}^{t+1})]-\operatorname{\mathbb{E}}[\psi(\theta_{i_{1}},{\boldsymbol{r}}_{i_{1}\to 1}^{t+1})]\operatorname{\mathbb{E}}[\psi(\theta_{i_{2}},{\boldsymbol{r}}_{i_{2}\to 1}^{t+1})]\big{|}=0\,.

(135)

Indeed by linearity it is sufficient to prove that this is the case for $\psi(\theta,{\boldsymbol{x}})=\psi_{{\boldsymbol{m}}}(\theta){\boldsymbol{x}}^{{\boldsymbol{m}}}$ . This case in turn can be analyze by expanding $({\boldsymbol{r}}_{i\to 1}^{t})^{{\boldsymbol{m}}}$ in a sum over trees as in the proof of Lemma D.2. (See Proposition 4 in [BLM15].)

By expanding the sum in the variance and using Eq. (135) we thus get

\displaystyle\lim_{n\to\infty}{\rm Var}\left(\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t+1}_{i\to 1},\theta_{i})\right)=0\,.

Further we already proved earlier that

\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}\left(\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t+1}_{i\to 1},\theta_{i})\right)=\operatorname{\mathbb{E}}\overline{\psi}\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,.

Hence, by Chebyshev’s inequality, the following holds for any polynomial $\psi$ :

\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t+1}_{i\to 1},\theta_{i})=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,.

This completes the proof of Lemma D.6. ∎

Now Lemma 4.13 is an immediate consequence of Corollary D.4 and Lemma D.6.

References

[AW09] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics, pages 2877–2921, 2009.
[BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, pages 648–847. PMLR, 2020.
[BBH18] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. In Conference On Learning Theory, pages 48–166. PMLR, 2018.
[BBH⁺21] Matthew S Brennan, Guy Bresler, Sam Hopkins, Jerry Li, and Tselil Schramm. Statistical query algorithms and low degree tests are almost equivalent. In Conference on Learning Theory, pages 774–774. PMLR, 2021.
[BBP05] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, pages 1643–1697, 2005.
[BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521, 2011.
[BHK⁺19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. SIAM Journal on Computing, 48(2):687–735, 2019.
[BLM15] Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and message passing algorithms. The Annals of Applied Probability, 25(2):753–822, 2015.
[BM11] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. on Inform. Theory, 57:764–785, 2011.
[BM19] Jean Barbier and Nicolas Macris. The adaptive interpolation method: a simple scheme to prove replica formulas in bayesian inference. Probability theory and related fields, 174(3):1133–1185, 2019.
[BMN20] Raphael Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate message passing with non-separable functions. Information and Inference: A Journal of the IMA, 9(1):33–79, 2020.
[BMR21] Jess Banks, Sidhanth Mohanty, and Prasad Raghavendra. Local statistics, semidefinite programming, and community detection. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1298–1316. SIAM, 2021.
[Bol14] Erwin Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington–Kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366, 2014.
[BR13] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on learning theory, pages 1046–1066. PMLR, 2013.
[BS06] Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382–1408, 2006.
[CM22] Michael Celentano and Andrea Montanari. Fundamental barriers to high-dimensional regression with convex penalties. The Annals of Statistics, 50(1):170–196, 2022.
[CMW20] Michael Celentano, Andrea Montanari, and Yuchen Wu. The estimation error of general first order methods. In Conference on Learning Theory, pages 1078–1141. PMLR, 2020.
[CX22] Hong-Bin Chen and Jiaming Xia. Hamilton–jacobi equations for inference of matrix tensor products. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 58, pages 755–793. Institut Henri Poincaré, 2022.
[DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
[DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Subexponential-time algorithms for sparse PCA. arXiv preprint arXiv:1907.11635, 2019.
[DMM09] David L. Donoho, Arian Maleki, and Andrea Montanari. Message Passing Algorithms for Compressed Sensing. Proceedings of the National Academy of Sciences, 106:18914–18919, 2009.
[EG21] Morris L Eaton and Edward I George. Charles Stein and invariance: Beginning with the Hunt–Stein theorem. The Annals of Statistics, 49(4):1815–1822, 2021.
[Fan22] Zhou Fan. Approximate message passing algorithms for rotationally invariant matrices. The Annals of Statistics, 50(1):197–224, 2022.
[Gal62] Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21–28, 1962.
[GB21] Cédric Gerbelot and Raphaël Berthier. Graph-based approximate message passing iterations. arXiv:2109.11905, 2021.
[GJS21] David Gamarnik, Aukosh Jagannath, and Subhabrata Sen. The overlap gap property in principal submatrix recovery. Probability Theory and Related Fields, 181(4):757–814, 2021.
[GZ22] David Gamarnik and Ilias Zadik. Sparse high-dimensional linear regression. Estimating squared error and a phase transition. The Annals of Statistics, 50(2):880–903, 2022.
[HKP⁺17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720–731. IEEE, 2017.
[HR04] David C Hoyle and Magnus Rattray. Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Physical Review E, 69(2):026124, 2004.
[HS17] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples: community detection and related problems. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 379–390. IEEE, 2017.
[HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006. PMLR, 2015.
[JL09] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486), 2009.
[JM13] Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115–144, 2013.
[KNV15] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations solve sparse PCA up to the information limit? The Annals of Statistics, 43(3):1300–1322, 2015.
[LM19] Marc Lelarge and Léo Miolane. Fundamental limits of symmetric low-rank matrix estimation. Probability Theory and Related Fields, 173(3):859–929, 2019.
[Lor05] G G Lorentz. Approximation of Functions, volume 322. American Mathematical Soc., 2005.
[MOS13] Wilhelm Magnus, Fritz Oberhettinger, and Raj Pal Soni. Formulas and theorems for the special functions of mathematical physics, volume 52. Springer Science & Business Media, 2013.
[MPV87] Marc Mézard, Giorgio Parisi, and Miguel A. Virasoro. Spin Glass Theory and Beyond. World Scientific, 1987.
[MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 438–446. IEEE, 2016.
[MV21] Andrea Montanari and Ramji Venkataramanan. Estimation of low-rank matrices via approximate message passing. The Annals of Statistics, 49(1):321–345, 2021.
[MW22] Andrea Montanari and Yuchen Wu. Statistically optimal first order algorithms: A proof via orthogonalization. arXiv:2201.05101, 2022.
[RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. Advances in neural information processing systems, 27, 2014.
[RU08] T. J. Richardson and R. Urbanke. Modern Coding Theory. Cambridge University Press, Cambridge, 2008.
[SW22] Tselil Schramm and Alexander S Wein. Computational barriers to estimation from low-degree polynomials. The Annals of Statistics, 50(3):1833–1858, 2022.
[Sze39] Gabor Szegö. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.
[TAP77] David J. Thouless, Philip W. Anderson, and Richard G. Palmer. Solution of ‘solvable model of a spin glass’. Philosophical Magazine, 35(3):593–601, 1977.
[Wei22] Alexander S Wein. Average-case complexity of tensor decomposition for low-degree polynomials. arXiv preprint arXiv:2211.05274, 2022.
[WEM19] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The Kikuchi hierarchy and tensor PCA. In 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468. IEEE, 2019.

	$\displaystyle\big{\|}\mu_{k,t+1}-q_{t+1}\big{\|}$	$\displaystyle=\Big{\|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f_{k,t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(q_{t}\Theta+\sqrt{q_{t}}G)]\Big{\}}\Big{\|}$
		$\displaystyle\leq\Big{\|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f^{\mbox{\tiny\rm Bayes}}_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(q_{t}\Theta+\sqrt{q_{t}}G)]\Big{\}}\Big{\|}$
		$\displaystyle\phantom{AAAAAAA}+\Big{\|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)]\Big{\}}\Big{\|}$
		$\displaystyle=:B_{1}(k)+B_{2}(k)\,.$

	$\displaystyle\|\zeta_{BA}\|$	$\displaystyle\leq\sqrt{\frac{\|\mathsf{emb}(B)\|}{\|\mathsf{emb}(A)\|}}\cdot O(n^{\|V(A)\|-\|V(B)\|}\cdot n^{-\frac{1}{2}(\|E(A)\|-\|E(B)\|)})$
		$\displaystyle=O(n^{\frac{1}{2}(\|V(A)\|-\|V(B)\|-\|E(A)\|+\|E(B)\|)})$
		$\displaystyle=O(1)\,,$

	$\displaystyle M_{AB}$	$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|\cdot\|\mathsf{emb}(B)\|}}\sum_{\phi_{A}\in\mathsf{emb}(A),\phi_{B}\in\mathsf{emb}(B)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi_{A};A)}({\boldsymbol{Y}})\mathscrsfs{H}_{\mathsf{im}(\phi_{B};B)}({\boldsymbol{Y}})]$
		$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|\cdot\|\mathsf{emb}(B)\|}}\sum_{\Pi\in\mathsf{pat}(A,B)}\sum_{(\phi_{A},\phi_{B})\in\mathsf{emb}(\Pi)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi_{A};A)}({\boldsymbol{Y}})\mathscrsfs{H}_{\mathsf{im}(\phi_{B};B)}({\boldsymbol{Y}})]$
		$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|\cdot\|\mathsf{emb}(B)\|}}\sum_{\Pi\in\mathsf{pat}(A,B)}\|\mathsf{emb}(\Pi)\|\cdot\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]\,,$

	$\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})$	$\displaystyle\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]=\operatorname{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})-\operatorname{\mathbb{E}}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})\right)\right)\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(h_{\boldsymbol{\delta}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\boldsymbol{\delta}}({\boldsymbol{Y}})\right)\right)$
		$\displaystyle=\operatorname{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\gamma}}^{\prime}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\gamma}}^{\prime}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\gamma}}^{\prime}}{\boldsymbol{X}}^{{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}}h_{{\boldsymbol{\gamma}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}\operatorname{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\gamma}}]\right)\right)\cdot$
		$\displaystyle\qquad\qquad\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\delta}}^{\prime}\leq{\boldsymbol{\delta}}}\sqrt{\frac{{\boldsymbol{\delta}}^{\prime}!}{{\boldsymbol{\delta}}!}}\binom{{\boldsymbol{\delta}}}{{\boldsymbol{\delta}}^{\prime}}{\boldsymbol{X}}^{{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}}h_{{\boldsymbol{\delta}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\delta}}!}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\delta}}]\right)\right)$
		$\displaystyle=\operatorname{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\gamma}}^{\prime}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\gamma}}^{\prime}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\gamma}}^{\prime}}n^{-\frac{1}{2}\|{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}\|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}}h_{{\boldsymbol{\gamma}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}n^{-\frac{1}{2}\|{\boldsymbol{\gamma}}\|}\operatorname{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\gamma}}]\right)\right)\cdot$
		$\displaystyle\qquad\qquad\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\delta}}^{\prime}\leq{\boldsymbol{\delta}}}\sqrt{\frac{{\boldsymbol{\delta}}^{\prime}!}{{\boldsymbol{\delta}}!}}\binom{{\boldsymbol{\delta}}}{{\boldsymbol{\delta}}^{\prime}}n^{-\frac{1}{2}\|{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}\|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}}h_{{\boldsymbol{\delta}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\delta}}!}}n^{-\frac{1}{2}\|{\boldsymbol{\delta}}\|}\operatorname*{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\delta}}]\right)\right).$

	$\displaystyle\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in\mathsf{nr}(A)\setminus\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi}$	$\displaystyle=\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi}+\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{2}}{\boldsymbol{Y}}^{\phi}$
		$\displaystyle=:\frac{1}{\sqrt{\|\mathsf{emb}(A)\|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi}+\mathscrsfs{E}_{A,2}({\boldsymbol{Y}})\,.$		(107)

Equivalence of Approximate Message Passing and Low-Degree Polynomials in Rank-One Matrix Estimation

Abstract

1 Introduction

Theorem 1.1.

Remark 1.1 (Differentiability at b=0b=0).

Remark 1.2 (Nonzero mean assumption).

2 Main results

2.1 Background: AMP

Theorem 2.1.

Theorem 2.2 ([MW22]).

2.2 Background: Low-Deg

2.3 Statement of main results

Theorem 2.3.

Sharp characterization for Low-Deg.

Optimality of AMP.

Zero-mean prior.

Higher degree.

3 Proof of Theorem 2.3: Upper bound

4 Proof of Theorem 2.3: Lower bound

4.1 Reduction to tree-structured polynomials

Definition 4.1.

Definition 4.2.

Proposition 4.3.

4.2 Proof of Proposition 4.3

Lemma 4.4.

Lemma 4.5.

Lemma 4.6.

Lemma 4.7.

Lemma 4.8.

Lemma 4.9.

Proposition 4.10.

Proof of Proposition 4.10.

Lemma 4.11.

Proof of Proposition 4.3.

4.3 From tree-structured polynomials to AMP

Proposition 4.12.

Proof.

Lemma 4.13.

Acknowledgments

Appendix A Proof of Theorem 1.1

Lemma A.1.

Proof.

Appendix B Lemma for the upper bound in Theorem 2.3

Lemma B.1.

Proof.

Appendix C Proof of Lemmas for Proposition 4.3

C.1 Notation

C.2 Hermite polynomials

Proposition C.1.

C.3 Proof of Lemma 4.4: Symmetry

C.4 Proof of Lemma 4.5: λmin​(𝑴n)\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})

C.5 Proof of Lemma 4.6: Explicit solution

C.6 Proof of Lemma 4.7: Limit of 𝒄n{\boldsymbol{c}}_{n}

C.7 Proof of Lemma 4.8: Limit of 𝑴n{\boldsymbol{M}}_{n}

C.8 Limit of ⟨𝒄n,𝑴n−1​𝒄n⟩\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle

Lemma C.2.

Proof.

C.9 Proof of Lemma 4.9: Accuracy of rr

C.10 Proof of Lemma 4.11: Change-of-basis

Lemma C.3.

Proof.

Lemma C.4.

Proof.

Proof of Lemma 4.11.

Appendix D Proof of Lemma 4.13

Definition D.1.

Lemma D.2.

Proof.

Corollary D.3.

Corollary D.4.

Proof.

Lemma D.5.

Lemma D.6.

Proof.

References

Equivalence of Approximate Message Passing and
Low-Degree Polynomials in Rank-One Matrix Estimation

Remark 1.1 (Differentiability at $b=0$ ).

C.4 Proof of Lemma 4.5: $\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})$

C.6 Proof of Lemma 4.7: Limit of ${\boldsymbol{c}}_{n}$

C.7 Proof of Lemma 4.8: Limit of ${\boldsymbol{M}}_{n}$

C.8 Limit of $\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle$

C.9 Proof of Lemma 4.9: Accuracy of $r$