This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Equivalence of Approximate Message Passing and
Low-Degree Polynomials in Rank-One Matrix Estimation

Andrea Montanari Department of Electrical Engineering and Department of Statistics, Stanford University; School of Mathematics, Institute for Advanced Studies, Princeton    Alexander S. Wein Department of Mathematics, University of California, Davis
Abstract

We consider the problem of estimating an unknown parameter vector 𝜽n{\boldsymbol{\theta}}\in{\mathbb{R}}^{n}, given noisy observations 𝒀=𝜽𝜽𝖳/n+𝒁{\boldsymbol{Y}}={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n}+{\boldsymbol{Z}} of the rank-one matrix 𝜽𝜽𝖳{\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}, where 𝒁{\boldsymbol{Z}} has independent Gaussian entries. When information is available about the distribution of the entries of 𝜽{\boldsymbol{\theta}}, spectral methods are known to be strictly sub-optimal. Past work characterized the asymptotics of the accuracy achieved by the optimal estimator. However, no polynomial-time estimator is known that achieves this accuracy.

It has been conjectured that this statistical-computation gap is fundamental, and moreover that the optimal accuracy achievable by polynomial-time estimators coincides with the accuracy achieved by certain approximate message passing (AMP) algorithms. We provide evidence towards this conjecture by proving that no estimator in the (broader) class of constant-degree polynomials can surpass AMP.

1 Introduction

Statistical-computational gaps are a ubiquitous phenomenon in high-dimensional statistics. Consider estimation of a high-dimensional parameter vector 𝜽=(θ1,θ2,,θn){\boldsymbol{\theta}}=(\theta_{1},\theta_{2},\dots,\theta_{n}) from noisy observations 𝒀P𝜽{\boldsymbol{Y}}\sim{\rm P}_{{\boldsymbol{\theta}}}. In many models, we can characterize the optimal accuracy achieved by arbitrary estimators. However, when we analyze classes of estimators that can be implemented via polynomial-time algorithms, a significantly smaller accuracy is obtained. A short list of examples includes sparse regression [CM22, GZ22], sparse principal component analysis [JL09, AW09, KNV15, BR13], graph clustering and community detection [DKMZ11, BHK+19], tensor principal component analysis [RM14, HSS15], and tensor decomposition [MSS16, Wei22].

This state of affairs has led to the conjecture that these gaps are fundamental in nature: there simply exists no polynomial-time algorithm achieving the statistically optimal accuracy. Proving such a conjecture is extremely challenging since standard complexity theoretic assumptions (e.g. P\neqNP) are ill-suited to establish complexity lower bounds in average-case settings where the input to the algorithm is random. A possible approach to overcome this challenge is to establish ‘average-case’ reductions between statistical estimation problems. We refer to [BBH18, BB20] for pointers to this rich line of work.

A second approach is to prove reductions between classes of polynomial-time algorithms, thus trimming the space of possible algorithmic choices. This paper contributes to this line of work by establishing—in a specific context—that approximate message passing (AMP) algorithms and low-degree (Low-Deg) polynomial estimators achieve asymptotically the same accuracy.

Examples of reductions between algorithm classes in statistical estimation include [HKP+17, BBH+21, CMW20, BMR21, CM22]. The motivation for studying the relation between AMP and Low-Deg comes from the distinctive position occupied by these two classes in the theoretical landscape. AMP are iterative algorithms motivated by ideas in information theory (iterative decoding [Gal62, RU08]) and statistical physics (cavity method and TAP equations [TAP77, MPV87]). Their structure is relatively constrained: they operate by alternating a matrix-vector multiplication (with a matrix constructed from the data 𝒀{\boldsymbol{Y}}), and a non-linear operation on the same vectors (typically independent of the data matrix). This structure has opened the way to sharp characterization of their behavior in the high-dimensional limit, known as ‘state evolution’ [DMM09, BM11, Bol14].

Low-Deg was originally motivated by a connection to Sum-of-Squares (SoS) semidefinite programming relaxations and captures a different (algebraic) notion of complexity [HS17, HKP+17, SW22]. In Low-Deg procedures, each unknown parameter θi\theta_{i} is estimated via a fixed polynomial in the entries of the data matrix 𝒀{\boldsymbol{Y}}, of constant or moderately growing degree. As such, these estimators are relatively unconstrained, and indeed capture (via polynomial approximation) a broad variety of natural approaches. Developing a sharp analysis of such a broad class of estimators can be very challenging.

In the symmetric rank-one estimation problem, we observe a noisy version 𝒀{\boldsymbol{Y}} of the rank-one matrix 𝜽𝜽𝖳{\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}, with 𝜽n{\boldsymbol{\theta}}\in{\mathbb{R}}^{n} an unknown vector:

𝒀=1n𝜽𝜽𝖳+𝒁.\displaystyle{\boldsymbol{Y}}=\frac{1}{\sqrt{n}}{\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}+{\boldsymbol{Z}}\,. (1)

Here 𝒁{\boldsymbol{Z}} is a random matrix independent of 𝜽{\boldsymbol{\theta}}, drawn from the Gaussian Orthogonal Ensemble 𝒁𝖦𝖮𝖤(n){\boldsymbol{Z}}\sim{\sf GOE}(n) which we define here by 𝒁=𝒁𝖳{\boldsymbol{Z}}={\boldsymbol{Z}}^{{\sf T}} and (Zij)ij(Z_{ij})_{i\leq j} independent with Zii𝒩(0,2)Z_{ii}\sim{\mathcal{N}}(0,2) and Zij𝒩(0,1)Z_{ij}\sim{\mathcal{N}}(0,1) for i<ji<j. Given a single observation of the matrix 𝒀{\boldsymbol{Y}}, we would like to estimate 𝜽{\boldsymbol{\theta}}.

Because of its simplicity, the rank-one estimation problem has been a useful sandbox to develop new mathematical ideas in high-dimensional statistics. A significant amount of work has been devoted to the analysis of spectral estimators, which typically take the form 𝜽^(𝒀)=cn𝒗1(𝒀)\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})=c_{n}\,{\boldsymbol{v}}_{1}({\boldsymbol{Y}}) where 𝒗1(𝒀){\boldsymbol{v}}_{1}({\boldsymbol{Y}}) is the top eigenvector of 𝒀{\boldsymbol{Y}} and cnc_{n} is a scaling factor [HR04, BBP05, BS06, BGN11]. However, spectral methods are known to be suboptimal if additional information is available about the entries θi\theta_{i} of θ\theta. In this paper, we model this information by assuming (θi)1iniidπΘ(\theta_{i})_{1\leq i\leq n}\sim_{iid}\pi_{\Theta}, for πΘ\pi_{\Theta} a probability distribution on {\mathbb{R}} (which does not depend on nn). The resulting Bayes error coincides (up to terms negligible as nn\to\infty) with the minimax optimal error in the class of vectors with empirical distribution converging to πΘ\pi_{\Theta}. Hence this model captures the minimax behavior in a well-defined class of signal parameters.

In the high-dimensional limit nn\to\infty (with πΘ\pi_{\Theta} fixed), the Bayes optimal accuracy for estimating 𝜽{\boldsymbol{\theta}} under model (1) (and the above assumptions on 𝜽{\boldsymbol{\theta}}) is known to converge to a well-defined limit that was characterized in a sequence of papers, see e.g. [LM19, BM19, CX22]. The asymptotic accuracy of the optimal AMP algorithm (called Bayes AMP) is also known [MV21].

It is useful to pause and recall these results in some detail. Define the function Ψ:0××P()\Psi:{\mathbb{R}}_{\geq 0}\times{\mathbb{R}}\times\mathscrsfs{P}({\mathbb{R}})\to{\mathbb{R}} (here and below P()\mathscrsfs{P}({\mathbb{R}}) denotes the set of probability distributions over {\mathbb{R}}) by letting

Ψ(q;b,πΘ)\displaystyle\Psi(q;b,\pi_{\Theta}) :=14q212(𝔼[Θ2]+b)q+I(q;πΘ),\displaystyle:=\frac{1}{4}q^{2}-\frac{1}{2}\big{(}\operatorname{\mathbb{E}}[\Theta^{2}]+b\big{)}q+{\rm I}(q;\pi_{\Theta})\,, (2)
I(q;πΘ)\displaystyle{\rm I}(q;\pi_{\Theta}) :=𝔼logdpYeff|ΘdpYeff,Yeff=qΘ+G,(Θ,G)πΘ𝒩(0,1).\displaystyle:=\operatorname{\mathbb{E}}\log\frac{{\rm d}p_{Y_{\mbox{\tiny\rm eff}}|\Theta}}{{\rm d}p_{Y_{\mbox{\tiny\rm eff}}}}\,,\;\;\;\;\;Y_{\mbox{\tiny\rm eff}}=\sqrt{q}\Theta+G\,,\;\;\;(\Theta,G)\sim\pi_{\Theta}\otimes\mathcal{N}(0,1)\,. (3)

Note that I(q;πΘ){\rm I}(q;\pi_{\Theta}) can be interpreted as the mutual information between a scalar random variable ΘπΘ\Theta\sim\pi_{\Theta} and the scalar noisy observation YeffY_{\mbox{\tiny\rm eff}}. It can be expressed as a one-dimensional integral with respect to πΘ\pi_{\Theta}:

I(q;πΘ)\displaystyle{\rm I}(q;\pi_{\Theta}) :=𝔼log{e(Yeffqθ)2/2πΘ(dθ)}12.\displaystyle:=-\operatorname{\mathbb{E}}\log\Big{\{}\int e^{-(Y_{\mbox{\tiny\rm eff}}-\sqrt{q}\theta)^{2}/2}\,\pi_{\Theta}({\rm d}\theta)\Big{\}}-\frac{1}{2}\,. (4)

The next statement adapts results from [LM19] concerning the behavior of the Bayes optimal error (see Appendix A, which details the derivation from [LM19]). A formal definition of Bayes AMP will be given in the next section, alongside a formal definition of Low-Deg algorithms.

Theorem 1.1.

Assume πΘ\pi_{\Theta} is independent of nn, has non-vanishing first moment 𝔼[Θ]0\operatorname{\mathbb{E}}[\Theta]\neq 0, and has finite moments of all orders. Define

qBayes(πΘ)\displaystyle q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta}) :=argminq0Ψ(q;0,πΘ),\displaystyle:=\operatorname*{argmin}_{q\geq 0}\,\Psi(q;0,\pi_{\Theta})\,, (5)
qAMP(πΘ)\displaystyle q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta}) :=inf{q0:Ψ(q;0,πΘ)=0,Ψ′′(q;0,πΘ)0}.\displaystyle:=\inf\Big{\{}q\geq 0\;:\;\Psi^{\prime}(q;0,\pi_{\Theta})=0,\;\Psi^{\prime\prime}(q;0,\pi_{\Theta})\geq 0\Big{\}}\,. (6)

Bayes AMP has time complexity O(c(n)n2)O(c(n)\,n^{2}), for any diverging sequence c(n)c(n), and achieves mean squared error (MSE)

limn1n𝔼{𝜽^AMP(𝒀)𝜽22}=𝔼[Θ2]qAMP(πΘ).\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}^{\mbox{\tiny\rm AMP}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|^{2}_{2}\big{\}}=\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})\,. (7)

Further, assume that bΨ(b;πΘ):=minq0Ψ(q;b,πΘ)b\mapsto\Psi_{*}(b;\pi_{\Theta}):=\min_{q\geq 0}\Psi(q;b,\pi_{\Theta}) is differentiable at b=0b=0. Then the minimum MSE of any estimator is

limninf𝜽^()1n𝔼{𝜽^(𝒀)𝜽22}=𝔼[Θ2]qBayes(πΘ).\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}(\,\cdot\,)}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|^{2}_{2}\big{\}}=\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta})\,. (8)
Remark 1.1 (Differentiability at b=0b=0).

The function bΨ(b;πΘ)b\mapsto\Psi_{*}(b;\pi_{\Theta}) is concave because it is a minimum of linear functions, and therefore differentiable at all except for countably many values of bb. Hence the differentiability assumption amounts to requiring that b=0b=0 is non-exceptional.

Also note that replacing the prior πΘ\pi_{\Theta} by its shift πΘ\pi_{\Theta^{\prime}} where Θ=Θ+a\Theta^{\prime}=\Theta+a amounts to changing bb to b=b+2𝔼[Θ]a+a2b^{\prime}=b+2\operatorname{\mathbb{E}}[\Theta]a+a^{2}. Hence the assumption that b=0b=0 is non-exceptional is equivalent to the assumption that the mean of πΘ\pi_{\Theta} is non-exceptional.

We also note that our results will not require such genericity assumptions, which we only introduce to offer a simple comparison to the Bayes optimal estimator.

Remark 1.2 (Nonzero mean assumption).

The assumption 𝔼[Θ]0\operatorname{\mathbb{E}}[\Theta]\neq 0 can be removed from Theorem 1.1 provided the mean squared error metric is replaced by a different metric that is invariant under change of relative sign in 𝜽{\boldsymbol{\theta}} and 𝜽^\hat{\boldsymbol{\theta}}. In that case, Bayes AMP needs to be modified, e.g. via a spectral initialization [MV21]. Here we focus on the case 𝔼[Θ]0\operatorname{\mathbb{E}}[\Theta]\neq 0 because this is the most relevant for the results that follow.

In this paper we compare Bayes AMP run for a constant number t=O(1)t=O(1) of iterations and Low-Deg estimators with degree D=O(1)D=O(1). Assuming 𝔼[Θ]0\operatorname{\mathbb{E}}[\Theta]\neq 0 and πΘ\pi_{\Theta} is sub-Gaussian, we establish the following (see Theorem 2.3):

  • For any fixed tt and ε>0\varepsilon>0, there exists a constant D=D(t,ε)D=D(t,\varepsilon) and a degree-DD estimator that approximates the MSE achieved by Bayes AMP after tt iterations within an additive error of at most ε\varepsilon.

  • For any constant D=O(1)D=O(1), no degree-DD estimator can surpass the asymptotic MSE of Bayes AMP. Namely, for any fixed DD and any degree-DD estimator 𝜽^\hat{\boldsymbol{\theta}}, we have 𝔼{𝜽^(𝒀)𝜽22}/n𝔼[Θ2]qAMP(πΘ)on(1)\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|^{2}_{2}\big{\}}/n\geq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})-o_{n}(1).

Here and throughout, the notation on(1)o_{n}(1) signifies a quantity that vanishes in the limit where nn\to\infty with all other parameters (such as DD) held fixed. The first claim above (‘upper bound’) is proved by a straightforward polynomial approximation of Bayes AMP. To obtain the second claim (‘lower bound’), we develop a new proof technique. Given a Low-Deg estimator for coordinate ii, we express θ^i(𝒀)\hat{\theta}_{i}({\boldsymbol{Y}}) as a sum of terms that are associated to rooted multi-graphs on vertex set [n]:={1,2,,n}[n]:=\{1,2,\ldots,n\} with at most DD edges. We group these terms into a constant number of homomorphism classes. We next prove that among these classes, only those corresponding to trees yield a non-negligible contribution as nn\to\infty. Finally we show that the latter contribution can be encoded into an AMP algorithm and use existing optimality theory for AMP algorithms [CMW20, MW22] to deduce the result.

This new proof strategy elucidates the relation between AMP and Low-Deg algorithms: roughly speaking, AMP algorithms correspond to ‘tree-structured’ low-degree polynomials, a subspace of all Low-Deg estimators. AMP can effectively evaluate tree-structured polynomials via a dynamic-programming style recursion with complexity O(n2𝖽𝖾𝗉𝗍𝗁)O(n^{2}\cdot\mathsf{depth}) (with 𝖽𝖾𝗉𝗍𝗁\mathsf{depth} the tree depth) instead of the naive O(nD+1)O(n^{D+1}).

The rest of the paper is organized as follows. Section 2 provides the necessary background, formally states our results, and discusses limitations, implications, and future directions. Sections 3 and 4 prove the main theorem (Theorem 2.3), respectively establishing the upper and lower bounds on the optimal estimation error achieved by Low-Deg. The proofs of several technical lemmas are deferred to the appendices.

2 Main results

2.1 Background: AMP

The class of AMP algorithms111More general settings have been studied and analyzed in the literature [JM13, BMN20, GB21, Fan22]. that we will consider in this paper proceed iteratively by updating a state 𝒙tn×𝖽{\boldsymbol{x}}^{t}\in{\mathbb{R}}^{n\times{\sf d}} according to the iteration:

𝒙t+1\displaystyle{\boldsymbol{x}}^{t+1} =1n𝒀Ft(𝒙t)Ft1(𝒙t1)𝖡t𝖳,t0.\displaystyle=\frac{1}{\sqrt{n}}{\boldsymbol{Y}}\,F_{t}({\boldsymbol{x}}^{t})-F_{t-1}({\boldsymbol{x}}^{t-1}){\sf B}^{{\sf T}}_{t}\,,\;\;\;\;t\geq 0\,. (9)

Throughout, we will assume the uninformative initialization 𝒙0=𝟎{\boldsymbol{x}}^{0}={\boldsymbol{0}}, 𝖡0=0{\sf B}_{0}=0. In the above Ft:𝖽𝖽F_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}}^{{\sf d}} is a function which operates on the matrix 𝒙tn×𝖽{\boldsymbol{x}}^{t}\in{\mathbb{R}}^{n\times{\sf d}} row-wise. Namely for 𝒙n×𝖽{\boldsymbol{x}}\in{\mathbb{R}}^{n\times{\sf d}} with rows 𝒙1𝖳,,𝒙n𝖳{\boldsymbol{x}}_{1}^{\sf T},\dots,{\boldsymbol{x}}_{n}^{{\sf T}}, we have Ft(𝒙):=(Ft(𝒙1),,Ft(𝒙n))𝖳F_{t}({\boldsymbol{x}}):=(F_{t}({\boldsymbol{x}}_{1}),\dots,F_{t}({\boldsymbol{x}}_{n}))^{{\sf T}}. After tt iterations, we estimate the signal 𝜽{\boldsymbol{\theta}} via 𝜽^t:=gt(𝒙t)\hat{\boldsymbol{\theta}}^{t}:=g_{t}({\boldsymbol{x}}^{t}), for some function gt:𝖽g_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}} (again, applied row-wise). We will consider 𝖽{\sf d} and the sequence of functions FtF_{t} fixed, while nn\to\infty.

The sequence of matrices 𝖡t𝖽×𝖽{\sf B}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}} can be taken to be non-random (independent of 𝒀{\boldsymbol{Y}}) and will be specified shortly. The high-dimensional asymptotics of the above iteration is characterized by the following finite-dimensional recursion over variables 𝝁t𝖽{\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{{\sf d}}, 𝚺t𝖽×𝖽{\boldsymbol{\Sigma}}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}}, known as ‘state evolution,’ which is initialized at (𝝁0,𝚺0)=(𝟎,𝟎)({\boldsymbol{\mu}}_{0},{\boldsymbol{\Sigma}}_{0})=({\boldsymbol{0}},{\boldsymbol{0}}):

𝝁t+1\displaystyle{\boldsymbol{\mu}}_{t+1} =𝔼{ΘFt(𝝁tΘ+𝑮t)},(Θ,𝑮t)πΘ𝒩(0,𝚺t)\displaystyle=\operatorname{\mathbb{E}}\big{\{}\Theta F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})\big{\}}\,,\;\;\;\;\;(\Theta,{\boldsymbol{G}}_{t})\sim\pi_{\Theta}\otimes\mathcal{N}(0,{\boldsymbol{\Sigma}}_{t}) (10)
𝚺t+1\displaystyle{\boldsymbol{\Sigma}}_{t+1} =𝔼{Ft(𝝁tΘ+𝑮t)Ft(𝝁tΘ+𝑮t)𝖳}.\displaystyle=\operatorname{\mathbb{E}}\big{\{}F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})^{{\sf T}}\big{\}}\,. (11)

In terms of this sequence, we define 𝖡t{\sf B}_{t} via

𝖡t=𝔼{DFt(μtΘ+𝑮t)},\displaystyle{\sf B}_{t}=\operatorname{\mathbb{E}}\{{\rm D}F_{t}(\mu_{t}\Theta+{\boldsymbol{G}}_{t})\}\,, (12)

where DFt=(iFt,j)i,j[𝖽]{\rm D}F_{t}=(\partial_{i}F_{t,j})_{i,j\in[{\sf d}]} denotes the weak differential of FtF_{t}.

The next theorem characterizes the high-dimensional asymptotics of the above AMP algorithm. It summarizes results from [BM11, BLM15] (see e.g. [MW22] for the application to rank-one estimation).

Theorem 2.1.

Assume πΘ\pi_{\Theta} is independent of nn and has finite moments of all orders. Further assume that the functions {Ft,gt}t0\{F_{t},g_{t}\}_{t\geq 0} are independent of nn and either: (i)(i) for each tt, FtF_{t} and gtg_{t} are LtL_{t}-Lipschitz, for some Lt<L_{t}<\infty; or (ii)(ii) for each tt, FtF_{t} and gtg_{t} are degree-BtB_{t} polynomials, for some Bt<B_{t}<\infty. Then

limn1n𝔼{𝜽^t(𝒀)𝜽22}=𝔼{[Θgt(𝝁tΘ+𝑮t)]2},\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}^{t}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}=\operatorname{\mathbb{E}}\big{\{}\big{[}\Theta-g_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})\big{]}^{2}\big{\}}\,, (13)

where expectation is with respect to (Θ,𝐆t)πΘ𝒩(0,𝚺t)(\Theta,{\boldsymbol{G}}_{t})\sim\pi_{\Theta}\otimes\mathcal{N}(0,{\boldsymbol{\Sigma}}_{t}).

Using the above result, it follows that the optimal function FtF_{t} is given by the posterior expectation denoiser. Namely, define the one-dimensional recursion (with G𝒩(0,1)G\sim{\mathcal{N}}(0,1)):

qt+1=𝔼{𝔼[Θ|qtΘ+qtG]2}=:SE(qt;πΘ),q0=0.\displaystyle q_{t+1}=\operatorname{\mathbb{E}}\big{\{}\operatorname{\mathbb{E}}[\Theta\,|\,q_{t}\Theta+\sqrt{q_{t}}G]^{2}\big{\}}=:{\rm SE}(q_{t};\pi_{\Theta})\,,\;\;\;q_{0}=0\,. (14)

Consider 𝖽=1{\sf d}=1 in the recursion of Eqs. (10), (11), so μt\mu_{t}\in{\mathbb{R}} and Σt=:σt2\Sigma_{t}=:\sigma^{2}_{t}\in{\mathbb{R}}, and take gt(x)=Ft(x):=𝔼[Θ|qtΘ+qtG=x]g_{t}(x)=F_{t}(x):=\operatorname{\mathbb{E}}[\Theta\,|\,q_{t}\Theta+\sqrt{q_{t}}G=x] with qtq_{t} defined in Eq. (14). We have

μt=σt2=qt,\displaystyle\mu_{t}=\sigma^{2}_{t}=q_{t}\,, (15)
𝔼{[Θgt(μtΘ+G)]2}=𝔼[Θ2]qt+1.\displaystyle\operatorname{\mathbb{E}}\big{\{}\big{[}\Theta-g_{t}(\mu_{t}\Theta+G)\big{]}^{2}\big{\}}=\operatorname{\mathbb{E}}[\Theta^{2}]-q_{t+1}\,. (16)

The next result establishes that this is indeed the optimal MSE achieved by AMP algorithms.

Theorem 2.2 ([MW22]).

Assume πΘ\pi_{\Theta} is independent of nn and has finite second moment. Then, for any AMP algorithm satisfying the assumptions of Theorem 2.1, we have for any fixed t0t\geq 0,

limn1n𝔼{𝜽^t(𝒀)𝜽22}𝔼[Θ2]qt+1.\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}^{t}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}\geq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{t+1}\,. (17)

Further there exists a sequence of AMP algorithms approaching the lower bound arbitrarily closely.

Finally, the fixed points of iteration (14) (i.e., the points qq such that q=SE(q;πΘ)q={\rm SE}(q;\pi_{\Theta})) coincide with the stationary points of Ψ(q;b=0,πΘ)\Psi(q;b=0,\pi_{\Theta}) defined in Eq. (2) (i.e., the points qq such that qΨ(q;0,πΘ)=0\partial_{q}\Psi(q;0,\pi_{\Theta})=0).

Refer to caption
Figure 1: Estimation accuracy in the symmetric rank-one estimation problem (1), for a one-parameter family of prior distributions πΘs\pi_{\Theta}^{s} defined in Eq. (18). Solid black curve: asymptotic MSE achieved by Bayes AMP. Blue curve: Bayes optimal MSE. Dashed curves: asymptotic MSE for Bayes AMP with (from bottom to top) t{5,10,20}t\in\{5,10,20\} iterations. Circles: average MSE achieved by AMP in simulation for t{5,10,20}t\in\{5,10,20\}. The black and blue curves coincide outside the interval [sBayes,sAMP][s^{*}_{\mbox{\tiny\rm Bayes}},s^{*}_{\mbox{\tiny\rm AMP}}]. Here MSE is normalized by 1/s1/s so that the horizontal black dashed line represents the trivial MSE achieved by the constant estimator θ^i(𝒀)=𝔼[Θ]\hat{\theta}_{i}({\boldsymbol{Y}})=\operatorname*{\mathbb{E}}[\Theta].

As an illustration, Figure 1 presents the asymptotic accuracy achieved by Bayes AMP and Bayes optimal estimation, as characterized by Theorem 1.1. In this figure we consider the following parametrized family of three-point priors:

πΘs=1ε2(δs+δ+s)+εδs/ε.\displaystyle\pi_{\Theta}^{s}=\frac{1-\varepsilon}{2}\,(\delta_{-\sqrt{s}}+\delta_{+\sqrt{s}})+\varepsilon\,\delta_{\sqrt{s/\varepsilon}}\,. (18)

We set ε=0.01\varepsilon=0.01 and sweep ss (which measures the signal-to-noise ratio). We compare the predicted accuracy for Bayes AMP with numerical simulations for n=2104n=2\cdot 10^{4} and t{5,10,20}t\in\{5,10,20\} iterations, averaged over 5050 realizations.

2.2 Background: Low-Deg

We say that 𝜽^:n×nn\hat{\boldsymbol{\theta}}:{\mathbb{R}}^{n\times n}\to{\mathbb{R}}^{n}, 𝒀𝜽^(𝒀){\boldsymbol{Y}}\mapsto\hat{\boldsymbol{\theta}}({\boldsymbol{Y}}) is a Low-Deg estimator of degree DD if its coordinates are polynomials of maximum degree DD in the matrix entries (Yij)1ijn(Y_{ij})_{1\leq i\leq j\leq n}. The coefficients of these polynomials may depend on nn but not on 𝒀{\boldsymbol{Y}}. Let [𝒀]D\mathbb{R}[{\boldsymbol{Y}}]_{\leq D} denote the space of all polynomials in the variables (Yij)1ijn(Y_{ij})_{1\leq i\leq j\leq n} of degree at most DD. We use 𝖫𝖣(D;n){\sf LD}(D;n) to denote the set of estimators whose coordinates are degree DD polynomials:

𝖫𝖣(D;n):={𝜽^:n×nn:i,θ^i[𝒀]D}.\displaystyle{\sf LD}(D;n):=\big{\{}\hat{\boldsymbol{\theta}}:{\mathbb{R}}^{n\times n}\to{\mathbb{R}}^{n}\;:\;\;\forall i,\,\hat{\theta}_{i}\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}\big{\}}. (19)

With an abuse of notation we will often refer to an estimator 𝜽^\hat{\boldsymbol{\theta}} (e.g. a Low-Deg estimator 𝜽^𝖫𝖣(D;n)\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)), but really mean a sequence of estimators 𝜽^n\hat{\boldsymbol{\theta}}_{n} indexed by the dimension nn.

2.3 Statement of main results

Recall that a random variable XX is called sub-Gaussian if Xψ2<0\|X\|_{\psi_{2}}<0 where the sub-Gaussian norm is defined as

Xψ2:=inf{t>0:𝔼exp(X2/t2)2}.\|X\|_{\psi_{2}}:=\inf\{t>0\,:\,\operatorname*{\mathbb{E}}\exp(X^{2}/t^{2})\leq 2\}\,. (20)

For example, any distribution with bounded support is sub-Gaussian.

Theorem 2.3.

Assume πΘ\pi_{\Theta} is independent of nn and sub-Gaussian, with 𝔼[Θ]0\operatorname{\mathbb{E}}[\Theta]\neq 0. For any ε>0\varepsilon>0, there exists D(ε)<D(\varepsilon)<\infty and a (family of) estimators 𝛉^ε=𝛉^ε,n𝖫𝖣(D(ε);n)\hat{\boldsymbol{\theta}}_{\varepsilon}=\hat{\boldsymbol{\theta}}_{\varepsilon,n}\in{\sf LD}(D(\varepsilon);n) such that

limn1n𝔼{𝜽^ε(𝒀)𝜽22}𝔼[Θ2]qAMP(πΘ)+ε.\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}_{\varepsilon}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}\leq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})+\varepsilon\,. (21)

Further, for any constant DD,

limninf𝜽^𝖫𝖣(D;n)1n𝔼{𝜽^(𝒀)𝜽22}𝔼[Θ2]qAMP(πΘ).\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}\geq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}(\pi_{\Theta})\,. (22)

The lower bound (22) only requires πΘ\pi_{\Theta} to have all moments finite, rather than sub-Gaussian. The upper bound (21) likely also holds under weaker conditions on πΘ\pi_{\Theta} than sub-Gaussian, but we have not attempted to explore this here. We conclude this summary of our results with a few remarks and directions for future work.

Sharp characterization for Low-Deg.

As a by-product of our results, we obtain a sharp characterization of the optimal accuracy of Low-Deg estimators with constant degree. To the best of our knowledge, this is the first example of such a characterization. Prior work on low-degree estimation [SW22] gave bounds on the signal-to-noise ratio (tight up to log factors) under which the low-degree estimation accuracy is either asymptotically perfect or trivial. In our work we are in a setting where the Low-Deg MSE converges to a non-trivial constant, and we pin down the exact constant.

Similarly, our results are sharper than existing computational lower bounds based on the ‘overlap gap property’ [GJS21], which rule out a different (incomparable) class of algorithms including certain Markov chains.

Optimality of AMP.

Our results support the conjecture that AMP is asymptotically optimal among computationally-efficient procedures for certain high-dimensional estimation problems. We are also aware of some problems for which AMP is strictly sub-optimal and has to be modified to capture higher-order correlations between variables. One such examples is provided by the spiked tensor model [RM14, HSS15, WEM19], namely the generalization of the above model to higher order tensors, where observations take the form 𝒀=n(k1)/2𝜽k+𝒁{\boldsymbol{Y}}=n^{-(k-1)/2}\,{\boldsymbol{\theta}}^{\otimes k}+{\boldsymbol{Z}} for some k3k\geq 3.

We believe that a generalization of the proof techniques developed in this paper can help distinguish in a principled way which problems can be optimally solved using AMP. These ideas may also provide guidance to modify AMP for problems in which it is sub-optimal. The key properties of the rank-one estimation problem that cause AMP to be optimal among Low-Deg estimators are established in Lemmas 4.7 and 4.8; notably these include the block diagonal structure of a certain correlation matrix 𝑴{\boldsymbol{M}}_{\infty}. As a result of these properties, the best Low-Deg estimator is ‘tree-structured’ and can thus be computed using an AMP algorithm.

Zero-mean prior.

Our result on the equivalence of AMP and Low-Deg actually applies to the case 𝔼[Θ]=0\operatorname{\mathbb{E}}[\Theta]=0 as well. However, it must be emphasized that the statement concerns AMP with uninformative initialization and O(1)O(1) iterations, and Low-Deg estimators with O(1)O(1) degree. When 𝔼[Θ]=0\operatorname{\mathbb{E}}[\Theta]=0, the MSE of both these algorithms converges to 𝔼[Θ2]\operatorname{\mathbb{E}}[\Theta^{2}] as nn\to\infty: it is no better than random guessing.

On the other hand, initializing AMP with a spectral initialization cn𝒗1(𝒀)c_{n}\,{\boldsymbol{v}}_{1}({\boldsymbol{Y}}) (for a suitable scalar cnc_{n}) yields the accuracy stated in Theorem 1.1, even for 𝔼[Θ]=0\operatorname{\mathbb{E}}[\Theta]=0. Since the leading eigenvector can be approximated by power iteration, we believe it is possible to show that the same accuracy is achievable by AMP when run for O(logn)O(\log n) iterations (and a random initialization), or by Low-Deg for O(logn)O(\log n) degree. However generalizing the lower bound of Eq. (22) to logarithmic degree goes beyond the analysis developed here.

Higher degree.

We expect the optimality of Bayes AMP to hold within a significantly broader class of low-degree estimators, namely O(nc)O(n^{c})-degree estimators for any c(0,1)c\in(0,1). Again, this extension is beyond our current proof technique. Heuristically, polynomials of degree O(nc)O(n^{c}) can be thought of as a proxy for algorithms of runtime exp(nc+on(1))\exp(n^{c+o_{n}(1)}) (which is the runtime required to naively evaluate such a polynomial); see e.g. [DKWB19].

3 Proof of Theorem 2.3: Upper bound

Recall from Theorem 2.2 that the fixed points of the state evolution recursion (14) coincide with the stationary points of Ψ(q,0;πΘ)\Psi(q,0;\pi_{\Theta}). Further, it is easy to see from the definition Eq. (14) that qSE(q;πΘ)q\mapsto{\rm SE}(q;\pi_{\Theta}) is a non-decreasing function with SE(0;πΘ)=𝔼[Θ]2>0{\rm SE}(0;\pi_{\Theta})=\operatorname{\mathbb{E}}[\Theta]^{2}>0. (This follows from the fact that the minimum mean square error is a non-increasing function of the signal-to-noise ratio or, equivalently, from Jensen inequality.) As a consequence (qt)t0(q_{t})_{t\geq 0} is a non-decreasing sequence with limtqt=qAMP\lim_{t\to\infty}q_{t}=q_{\mbox{\tiny\rm AMP}}. Let tt_{*} be such that qt+1qAMPε/2q_{t_{*}+1}\geq q_{\mbox{\tiny\rm AMP}}-\varepsilon/2.

Consider the special case of the AMP algorithm of Eq. (9) with 𝖽=1{\sf d}=1, with Ft=ft:F_{t}=f_{t}:{\mathbb{R}}\to{\mathbb{R}}, and specialize the state evolution recursion of Eqs. (10), (11) to this case. Namely we define recursively μs,σs2\mu_{s},\sigma_{s}^{2} for s0s\geq 0 via

μt+1\displaystyle\mu_{t+1} =𝔼{Θft(μtΘ+σtG)},(Θ,G)πΘ𝒩(0,1)\displaystyle=\operatorname{\mathbb{E}}\big{\{}\Theta f_{t}(\mu_{t}\Theta+\sigma_{t}\,G)\big{\}}\,,\;\;\;\;\;(\Theta,G)\sim\pi_{\Theta}\otimes\mathcal{N}(0,1) (23)
σt+12\displaystyle\sigma^{2}_{t+1} =𝔼{ft(μtΘ+Gt)2}.\displaystyle=\operatorname{\mathbb{E}}\big{\{}f_{t}(\mu_{t}\Theta+G_{t})^{2}\big{\}}\,. (24)

We claim that for t0t\geq 0 the following holds. For any ε0>0\varepsilon_{0}>0, we can construct polynomials (fs)0st(f_{s})_{0\leq s\leq t} of degree (Ds(ε0))0st(D_{s}(\varepsilon_{0}))_{0\leq s\leq t} (independent of nn), we have,

|μtqt|ε0,|σt2qt|ε0.\displaystyle\big{|}\mu_{t}-q_{t}\big{|}\leq\varepsilon_{0}\,,\;\;\big{|}\sigma^{2}_{t}-q_{t}\big{|}\leq\varepsilon_{0}\,. (25)

Once this is established, the desired upper bound (21) follows by taking t=t+1t=t_{*}+1 and ε0=ε/8\varepsilon_{0}=\varepsilon/8. Consider indeed the AMP estimator 𝜽^t(𝒀):=ft(𝒙t)\hat{\boldsymbol{\theta}}^{t}({\boldsymbol{Y}}):=f_{t}({\boldsymbol{x}}^{t}) where 𝒙t{\boldsymbol{x}}^{t} is defined by the AMP iteration (9) with 𝖽=1{\sf d}=1, Ft=ftF_{t}=f_{t} (that is, we choose gt=ftg_{t}=f_{t}). This estimator is a polynomial of degree D(D1+1)(D2+1)(Dt+1)D\leq(D_{1}+1)(D_{2}+1)\cdots(D_{t_{*}}+1) and, by Eq. (13) (specialized to 𝖽=1{\sf d}=1), we have

limn1n𝔼{𝜽^t(𝒀)𝜽22}\displaystyle\lim_{n\to\infty}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}^{t_{*}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}} =𝔼{[Θft(μtΘ+σtG)]2}\displaystyle=\operatorname{\mathbb{E}}\big{\{}\big{[}\Theta-f_{t_{*}}(\mu_{t_{*}}\Theta+\sigma_{t_{*}}G)\big{]}^{2}\big{\}} (26)
=𝔼[Θ2]2μt+1+σt+12\displaystyle=\operatorname{\mathbb{E}}[\Theta^{2}]-2\mu_{t_{*}+1}+\sigma_{t_{*}+1}^{2} (27)
𝔼[Θ2]qt+1+ε2𝔼[Θ2]qAMP+ε.\displaystyle\leq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{t_{*}+1}+\frac{\varepsilon}{2}\leq\operatorname{\mathbb{E}}[\Theta^{2}]-q_{\mbox{\tiny\rm AMP}}+\varepsilon\,. (28)

We are left with the task to prove the claim (25), which we will do by induction on tt. The claim holds for t=0t=0, and we then assume as an induction hypothesis that it holds for a certain tt. We will denote by ftBayes(x):=𝔼[Θ|qtΘ+qtG=x]f^{\mbox{\tiny\rm Bayes}}_{t}(x):=\operatorname{\mathbb{E}}[\Theta\,|\,q_{t}\Theta+\sqrt{q_{t}}G=x] the ideal nonlinearity. Now let ε1,k0\varepsilon_{1,k}\downarrow 0 be a sequence converging to 0 and select fs=fk,sf_{s}=f_{k,s}, st1s\leq t-1 according to the induction hypothesis with ε0=ε1,k\varepsilon_{0}=\varepsilon_{1,k}. We will denote by μk,t,σk,t2\mu_{k,t},\sigma^{2}_{k,t} the corresponding state evolution quantities. In particular, we can assume without loss of generality μk,t,σk,t22qt\mu_{k,t},\sigma^{2}_{k,t}\leq 2q_{t}. By the state evolution recursion, we have (expectation here is with respect to (Θ,G)πΘ𝒩(0,1)(\Theta,G)\sim\pi_{\Theta}\otimes{\mathcal{N}}(0,1))

|μk,t+1qt+1|\displaystyle\big{|}\mu_{k,t+1}-q_{t+1}\big{|} =|𝔼{Θ[fk,t(μk,tΘ+σk,tG)ftBayes(qtΘ+qtG)]}|\displaystyle=\Big{|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f_{k,t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(q_{t}\Theta+\sqrt{q_{t}}G)]\Big{\}}\Big{|}
|𝔼{Θ[ftBayes(μk,tΘ+σk,tG)ftBayes(qtΘ+qtG)]}|\displaystyle\leq\Big{|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f^{\mbox{\tiny\rm Bayes}}_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(q_{t}\Theta+\sqrt{q_{t}}G)]\Big{\}}\Big{|}
+|𝔼{Θ[ft(μk,tΘ+σk,tG)ftBayes(μk,tΘ+σk,tG)]}|\displaystyle\phantom{AAAAAAA}+\Big{|}\operatorname{\mathbb{E}}\Big{\{}\Theta[f_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)-f^{\mbox{\tiny\rm Bayes}}_{t}(\mu_{k,t}\Theta+\sigma_{k,t}G)]\Big{\}}\Big{|}
=:B1(k)+B2(k).\displaystyle=:B_{1}(k)+B_{2}(k)\,.

It is a general fact about Bayes posterior expectation that ftBayesf_{t}^{\mbox{\tiny\rm Bayes}} is continuous with |ftBayes(x)|Ct(1+|x|)|f_{t}^{\mbox{\tiny\rm Bayes}}(x)|\leq C_{t}(1+|x|) for some constant CtC_{t} (see Lemma B.1 in Appendix B). Further μk,tΘ+σk,tGa.s.qtΘ+qtG\mu_{k,t}\Theta+\sigma_{k,t}G\stackrel{{\scriptstyle a.s.}}{{\longrightarrow}}q_{t}\Theta+\sqrt{q_{t}}G as kk\to\infty. By dominated convergence, we have limkB1(k)=0\lim_{k\to\infty}B_{1}(k)=0 and therefore we can choose k0k_{0} so that for all kk0k\geq k_{0}, B1(k)ε0/2B_{1}(k)\leq\varepsilon_{0}/2.

Next consider term B2(k)B_{2}(k). Denoting by τ:=Θψ2\tau:=\|\Theta\|_{\psi_{2}} the sub-Gaussian norm of πΘ\pi_{\Theta} (see Eq. (20)), we have that Zk,t:=μk,tΘ+σk,tGZ_{k,t}:=\mu_{k,t}\Theta+\sigma_{k,t}G is sub-Gaussian with Zk,tψ22μk,t2τ2+σk,t24(qt2τ2+qt)\|Z_{k,t}\|_{\psi_{2}}^{2}\leq\mu_{k,t}^{2}\tau^{2}+\sigma_{k,t}^{2}\leq 4(q_{t}^{2}\tau^{2}+q_{t}). We let τt2:=4(qt2τ2+qt)\tau^{2}_{t}:=4(q_{t}^{2}\tau^{2}+q_{t}) denote this upper bound. Since ftBayesf^{\mbox{\tiny\rm Bayes}}_{t} is continuous with |fBayes(x)|Ct(1+|x|)|f^{\mbox{\tiny\rm Bayes}}(x)|\leq C_{t}(1+|x|), by weighted approximation theory [Lor05], we can choose ftf_{t} a polynomial such that

supu|ft(u)ftBayes(u)|exp(u24τt2)ε04𝔼[Θ2]1/2.\displaystyle\sup_{u\in{\mathbb{R}}}\big{|}f_{t}(u)-f^{\mbox{\tiny\rm Bayes}}_{t}(u)\big{|}\exp\Big{(}-\frac{u^{2}}{4\tau^{2}_{t}}\Big{)}\leq\frac{\varepsilon_{0}}{4\operatorname{\mathbb{E}}[\Theta^{2}]^{1/2}}\,. (29)

Using this polynomial approximation, we get

B2(k)\displaystyle B_{2}(k) 𝔼[Θ2]1/2𝔼{[ft(Zk,t)ftBayes(Zk,t)]2}1/2\displaystyle\leq\operatorname{\mathbb{E}}\big{[}\Theta^{2}\big{]}^{1/2}\operatorname{\mathbb{E}}\big{\{}[f_{t}(Z_{k,t})-f^{\mbox{\tiny\rm Bayes}}_{t}(Z_{k,t})]^{2}\big{\}}^{1/2} (30)
14ε0𝔼[exp(Zk,t2/2τk,t2)]1/212ε0.\displaystyle\leq\frac{1}{4}\varepsilon_{0}\operatorname{\mathbb{E}}[\exp({Z_{k,t}^{2}/2\tau_{k,t}^{2}})]^{1/2}\leq\frac{1}{2}\varepsilon_{0}\,. (31)

This completes the proof of the induction claim in the first inequality in Eq. (25). The second inequality is treated analogously.

4 Proof of Theorem 2.3: Lower bound

Recall that [𝒀]D\mathbb{R}[{\boldsymbol{Y}}]_{\leq D} denotes the family of polynomials of degree at most DD in the variables (Yij)1ijn(Y_{ij})_{1\leq i\leq j\leq n}. The key of our proof is to show that the asymptotic estimation accuracy of Low-Deg estimators is not reduced if we replace polynomials by a restricted family of tree-structured polynomials, which we will denote by [𝒀]D𝒯\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}. This reduction is presented in Sections 4.1 and 4.2. We will then establish a connection between tree-structured polynomials and AMP algorithms, and rely on known optimality theory for AMP estimators, cf. Section 4.3.

4.1 Reduction to tree-structured polynomials

Let 𝒯D\mathcal{T}_{\leq D} denote the set of rooted trees up to root-preserving isomorphism, with at most DD edges. (Trees must be connected, with no self-loops or multi-edges allowed. The tree with one vertex and no edges is included.) We denote the root vertex of T𝒯DT\in\mathcal{T}_{\leq D} by \circ. For T𝒯DT\in\mathcal{T}_{\leq D}, define a labeling rooted at 11 (or, simply, a labeling) of TT to be a function ϕ:V(T)[n]\phi:V(T)\to[n] such that ϕ()=1\phi(\circ)=1. (Vertex 1 is distinguished because we will be considering estimation of θ1\theta_{1}.)

Given a graph G=(V,E)G=(V,E), we will denote by 𝖽G(u,v){\sf d}_{G}(u,v) the graph distance between two vertices u,vVu,v\in V.

Definition 4.1.

A labeling ϕ\phi of T𝒯DT\in\mathcal{T}_{\leq D} is said to be non-reversing if, for every pair of distinct vertices u,vV(T)u,v\in V(T) with the same label (i.e., ϕ(u)=ϕ(v)\phi(u)=\phi(v)), one of the following holds:

  • 𝖽T(u,v)>2{\sf d}_{T}(u,v)>2, or

  • 𝖽T(u,v)=2{\sf d}_{T}(u,v)=2 and u,vu,v have the same distance from the root (i.e., 𝖽T(u,)=𝖽T(v,){\sf d}_{T}(u,\circ)={\sf d}_{T}(v,\circ)).

(The latter holds if and only if u,vu,v are both children of a common vertex wV(T)w\in V(T).) We denote by 𝗇𝗋(T)\mathsf{nr}(T) the set of all non-reversing labelings of TT.

For each T𝒯DT\in\mathcal{T}_{\leq D}, define the associated polynomial

FT(𝒀)=1|𝗇𝗋(T)|ϕ𝗇𝗋(T)(u,v)E(T)Yϕ(u),ϕ(v).\mathscrsfs{F}_{T}({\boldsymbol{Y}})=\frac{1}{\sqrt{|\mathsf{nr}(T)|}}\sum_{\phi\in\mathsf{nr}(T)}\,\prod_{(u,v)\in E(T)}Y_{\phi(u),\phi(v)}. (32)
Definition 4.2.

We define [𝐘]D𝒯\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}} to be the set of all polynomials of the form

p(𝒀)=T𝒯Dp^TFT(𝒀)p({\boldsymbol{Y}})=\sum_{T\in\mathcal{T}_{\leq D}}\hat{p}_{T}\mathscrsfs{F}_{T}({\boldsymbol{Y}}) (33)

for coefficients p^T\hat{p}_{T}\in\mathbb{R}.

Note that any p[𝒀]D𝒯p\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}} is a polynomial of degree at most DD, that is

[𝒀]D𝒯[𝒀]D.\displaystyle\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}\subseteq\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}\,. (34)

We are now in position to state our result about reduction to tree-structured polynomials.

Proposition 4.3.

Assume πΘ\pi_{\Theta} to have finite moments of all orders. Let ψ:\psi:{\mathbb{R}}\to{\mathbb{R}} be a measurable function, and assume all moments of ψ(θ1)\psi(\theta_{1}) to be finite. For any fixed πΘ,ψ,D\pi_{\Theta},\psi,D there exists a fixed (nn-independent) choice of coefficients (p^T)T𝒯D(\hat{p}_{T})_{T\in\mathcal{T}_{\leq D}} such that the associated polynomial p=pn[𝐘]D𝒯p=p_{n}\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathcal{T}}_{\leq D} defined by (33) satisfies

limn𝔼[(p(𝒀)ψ(θ1))2]=limninfq[𝒀]D𝔼[(q(𝒀)ψ(θ1))2].\displaystyle\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(p({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\lim_{n\to\infty}\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,. (35)

(In particular, the above limits exist.)

4.2 Proof of Proposition 4.3

The direction “\geq” in Eq. (35) is immediate from Eq. (34), so it remains to prove “\leq.”

For the proof, we will consider a slightly different model whereby we observe 𝒀~=𝜽𝜽𝖳/n+𝒁~\tilde{\boldsymbol{Y}}={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n}+\tilde{\boldsymbol{Z}}, with 𝒁~=𝒁~𝖳\tilde{\boldsymbol{Z}}=\tilde{\boldsymbol{Z}}^{{\sf T}} and (Z~ij)1ijniid𝒩(0,1)(\tilde{Z}_{ij})_{1\leq i\leq j\leq n}\sim_{iid}\mathcal{N}(0,1). In other words, we use Gaussians of variance 11 instead of 22 on the diagonal. The original model is related to this one by 𝒀=𝒀~+𝑾{\boldsymbol{Y}}=\tilde{\boldsymbol{Y}}+{\boldsymbol{W}} where 𝑾{\boldsymbol{W}} is a diagonal matrix with (Wii)1iniid𝒩(0,1)(W_{ii})_{1\leq i\leq n}\sim_{iid}\mathcal{N}(0,1) independent of 𝜽,𝒁~{\boldsymbol{\theta}},\tilde{\boldsymbol{Z}}. It is easy to see that it is sufficient to prove Proposition 4.3 under this modified model. Indeed, the left-hand side of Eq. (35) does not depend on the distribution of the diagonal entries of 𝒀{\boldsymbol{Y}} (since tree-structured polynomials do not depend on the diagonal entries). For the right-hand side, if q(𝒀)q({\boldsymbol{Y}}) is an arbitrary degree-DD estimator then q~(𝒀~)=𝔼𝑾q(𝒀~+𝑾)\tilde{q}(\tilde{\boldsymbol{Y}})=\operatorname{\mathbb{E}}_{{\boldsymbol{W}}}q(\tilde{\boldsymbol{Y}}+{\boldsymbol{W}}) also has degree DD and satisfies 𝔼[q(𝒀)ψ(θ1)]=𝔼[q~(𝒀~)ψ(θ1)]\operatorname*{\mathbb{E}}[q({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\operatorname*{\mathbb{E}}[\tilde{q}(\tilde{\boldsymbol{Y}})\cdot\psi(\theta_{1})] and by Jensen’s inequality, 𝔼[q(𝒀)2]𝔼[q~(𝒀~)2]\operatorname*{\mathbb{E}}[q({\boldsymbol{Y}})^{2}]\geq\operatorname*{\mathbb{E}}[\tilde{q}(\tilde{\boldsymbol{Y}})^{2}], whence we conclude

infq~[𝒀~]D𝔼[(q~(𝒀~)ψ(θ1))2]infq[𝒀]D𝔼[(q(𝒀)ψ(θ1))2].\inf_{\tilde{q}\in\mathbb{R}[\tilde{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(\tilde{q}(\tilde{\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\leq\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,. (36)

This proves the claim. In what follows, we will drop the tilde from 𝒀~\tilde{\boldsymbol{Y}}, 𝒁~\tilde{\boldsymbol{Z}} and assume the new normalization.

It is useful to generalize the setting introduced previously. Let 𝒢D\mathcal{G}_{\leq D} denote the set of all rooted (multi-)graphs, up to root-preserving isomorphism, with at most DD total edges, with the additional constraint that no vertices are isolated except possibly the root. Self-loops and multi-edges are allowed, and edges are counted with their multiplicity. For instance, a triple-edge contributes 3 towards the edge count DD. The graph need not be connected. For G𝒢DG\in\mathcal{G}_{\leq D} we write V(G)V(G) for the set of vertices and E(G)E(G) for the multiset of edges.

We use \circ for the root of a graph G𝒢DG\in\mathcal{G}_{\leq D}, and define labelings (rooted at 1) exactly as for trees. Instead of non-reversing labelings, it will be convenient to work with embeddings, that is, labelings that are injective (every pair of distinct vertices u,vV(G)u,v\in V(G) has ϕ(u)ϕ(v)\phi(u)\neq\phi(v)). We denote the set of embeddings of GG by 𝖾𝗆𝖻(G)\mathsf{emb}(G).

A labeling ϕ\phi of G𝒢DG\in\mathcal{G}_{\leq D} induces a multi-graph whose vertices are elements of [n][n]. This is the graph with vertex set {ϕ(v):vV(G)}\{\phi(v)\,:\;v\in V(G)\} and edge set {(ϕ(u),ϕ(v)):(u,v)E(G)}\{(\phi(u),\phi(v))\,:\;(u,v)\in E(G)\}. If ϕ(u)=ϕ(u)\phi(u)=\phi(u^{\prime}) and ϕ(v)=ϕ(v)\phi(v)=\phi(v^{\prime}) for (u,v)(u,v), (u,v)E(G)(u^{\prime},v^{\prime})\in E(G) distinct edges in the multigraph GG, then (ϕ(u),ϕ(v))(\phi(u),\phi(v)) and (ϕ(u),ϕ(v))(\phi(u^{\prime}),\phi(v^{\prime})) are considered distinct edges in the induced multi-graph.

We call this induced graph the image of ϕ\phi and write 𝜶=𝗂𝗆(ϕ;G){\boldsymbol{\alpha}}=\mathsf{im}(\phi;G) whenever 𝜶{\boldsymbol{\alpha}} is the image of ϕ\phi. We will treat 𝜶{\boldsymbol{\alpha}} as an element of E¯n\mathbb{N}^{\overline{E}_{n}} where E¯n={(i,j):  1ijn}{\overline{E}_{n}}=\{(i,j)\,:\,\,1\leq i\leq j\leq n\}, namely 𝜶=(αij)1ijn{\boldsymbol{\alpha}}=(\alpha_{ij})_{1\leq i\leq j\leq n} where αij={0,1,2,}\alpha_{ij}\in\mathbb{N}=\{0,1,2,\ldots\} counts the multiplicity of edge (i,j)(i,j) in 𝜶{\boldsymbol{\alpha}}. Formally, αij:=|{(u,v)E(G):ϕ({u,v})={i,j}}|\alpha_{ij}:=|\{(u,v)\in E(G)\,:\,\phi(\{u,v\})=\{i,j\}\}|.

For kk\in\mathbb{N}, let hk:h_{k}:\mathbb{R}\to\mathbb{R} denote the kk-th orthonormal Hermite polynomial. Recall that these are defined (uniquely, up to sign) by the following two conditions: (i)(i) hkh_{k} is a degree-kk polynomial; (ii)(ii) 𝔼[hj(Z)hk(Z)]=𝟙j=k\operatorname{\mathbb{E}}[h_{j}(Z)h_{k}(Z)]=\mathbbm{1}_{j=k} when Z𝒩(0,1)Z\sim{\mathcal{N}}(0,1). We refer to [Sze39] for background.

For 𝜶E¯n{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}, define the multivariate Hermite polynomial

h𝜶(𝒀)=1ijnhαij(Yij).\displaystyle h_{\boldsymbol{\alpha}}({\boldsymbol{Y}})=\prod_{1\leq i\leq j\leq n}h_{\alpha_{ij}}(Y_{ij})\,. (37)

These polynomials are orthonormal: 𝔼[h𝜶(𝒁)h𝜷(𝒁)]=𝟙𝜶=𝜷\operatorname{\mathbb{E}}[h_{\boldsymbol{\alpha}}({\boldsymbol{Z}})h_{\boldsymbol{\beta}}({\boldsymbol{Z}})]=\mathbbm{1}_{{\boldsymbol{\alpha}}={\boldsymbol{\beta}}} when (Zij)ijiid𝒩(0,1)(Z_{ij})_{i\leq j}\sim_{iid}{\mathcal{N}}(0,1). Viewing 𝜶{\boldsymbol{\alpha}} as a graph, let C(α)C(\alpha) denote the set of non-empty (i.e., containing at least one edge) connected components of 𝜶{\boldsymbol{\alpha}}. As above, each 𝜸C(𝜶){\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}}) is an element of E¯n\mathbb{N}^{\overline{E}_{n}} where γij\gamma_{ij} denotes the multiplicity of edge (i,j)(i,j). It will be important to “center” each component in the following sense. Define

H𝜶(𝒀)=𝜸C(𝜶)(h𝜸(𝒀)𝔼h𝜸(𝒀)),\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})=\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}(h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})-\operatorname{\mathbb{E}}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}}))\,, (38)

where (in the case 𝜶=𝟎{\boldsymbol{\alpha}}={\boldsymbol{0}}) the empty product is equal to 1 by convention. Here and throughout, expectation is over 𝒀{\boldsymbol{Y}} distributed according to the rank-one estimation model as defined in Eq. (1). For G𝒢DG\in\mathcal{G}_{\leq D}, define

HG(𝒀)=1|𝖾𝗆𝖻(G)|ϕ𝖾𝗆𝖻(G)H𝗂𝗆(ϕ;G)(𝒀).\mathscrsfs{H}_{G}({\boldsymbol{Y}})=\frac{1}{\sqrt{|\mathsf{emb}(G)|}}\sum_{\phi\in\mathsf{emb}(G)}\mathscrsfs{H}_{\mathsf{im}(\phi;G)}({\boldsymbol{Y}})\,. (39)

Define the symmetric subspace [𝒀]D𝗌𝗒𝗆[𝒀]D\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}\subseteq\mathbb{R}[{\boldsymbol{Y}}]_{\leq D} as

[𝒀]D𝗌𝗒𝗆:={f(𝒀)=G𝒢DaGHG(𝒀):aGG𝒢D}.\displaystyle\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}:=\Big{\{}f({\boldsymbol{Y}})=\!\!\sum_{G\in\mathcal{G}_{\leq D}}\!a_{G}\mathscrsfs{H}_{G}({\boldsymbol{Y}})\;:\;a_{G}\in\mathbb{R}\;\;\forall G\in\mathcal{G}_{\leq D}\Big{\}}\,. (40)

In words, [𝒀]D𝗌𝗒𝗆\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D} is the \mathbb{R}-span of (HG)G𝒢D(\mathscrsfs{H}_{G})_{G\in\mathcal{G}_{\leq D}}. It is also easy to see that [𝒀]D𝗌𝗒𝗆\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D} is the linear subspace of [𝒀]D\mathbb{R}[{\boldsymbol{Y}}]_{\leq D} which is invariant under permutations of rows/columns {2,,n}\{2,\dots,n\} of 𝒀{\boldsymbol{Y}}.

Note that the task of estimating ψ(θ1)\psi(\theta_{1}) under the model (1) is invariant under permutations of {2,,n}\{2,\dots,n\}. As a consequence of the Hunt–Stein theorem [EG21], the optimal estimator of ψ(θ1)\psi(\theta_{1}) must be equivariant under permutations of {2,,n}\{2,\dots,n\}. The following lemma shows that the same is true if we restrict our attention to Low-Deg estimators. Namely, instead of the infimum over all degree-DD polynomials in (35), it suffices to study only the symmetric subspace.

Lemma 4.4.

Under the assumptions of Proposition 4.3, for any nn,

infq[𝒀]D𝔼[(q(𝒀)ψ(θ1))2]=infq[𝒀]D𝗌𝗒𝗆𝔼[(q(𝒀)ψ(θ1))2].\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,. (41)

Our next step will be to write down an explicit formula (given in Lemma 4.6 below) for the right-hand side of (41). Define the vector 𝒄n=(cn,A)A𝒢D{\boldsymbol{c}}_{n}=(c_{n,A})_{A\in\mathcal{G}_{\leq D}} and matrix 𝑴n=(Mn,AB)A,B𝒢D{\boldsymbol{M}}_{n}=(M_{n,AB})_{A,B\in\mathcal{G}_{\leq D}} by

cn,A=𝔼[HA(𝒀)ψ(θ1)],Mn,AB=𝔼[HA(𝒀)HB(𝒀)].\displaystyle c_{n,A}=\operatorname{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]\,,\;\;\;\;\;M_{n,AB}=\operatorname{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\mathscrsfs{H}_{B}({\boldsymbol{Y}})]\,. (42)

Note that both 𝒄n{\boldsymbol{c}}_{n} and 𝑴n{\boldsymbol{M}}_{n} have constant dimension (depending on DD but not nn), but their entries depend on nn. Also note that 𝑴n{\boldsymbol{M}}_{n} is a Gram matrix and therefore positive semidefinite. We will show that 𝑴n{\boldsymbol{M}}_{n} is strictly positive definite (and thus invertible), and in fact strongly so in the sense that its minimum eigenvalue is lower bounded by a positive constant independent of nn. (Here and throughout, asymptotic notation such as O()O(\,\cdot\,) and Ω()\Omega(\,\cdot\,) may hide factors depending on πΘ,ψ,D\pi_{\Theta},\psi,D.)

Lemma 4.5.

Under the assumptions of Proposition 4.3,

λmin(𝑴n)=Ω(1).\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})=\Omega(1)\,.

We can now obtain an explicit formula for the optimal estimation error in terms of 𝑴n,𝒄n{\boldsymbol{M}}_{n},{\boldsymbol{c}}_{n}.

Lemma 4.6.

Define the vector 𝐜n{\boldsymbol{c}}_{n} and matrix 𝐌n{\boldsymbol{M}}_{n} as per Eq. (42). Then, under the assumptions of Proposition 4.3, for any nn,

infq[𝒀]D𝗌𝗒𝗆𝔼[(q(𝒀)ψ(θ1))2]=𝔼[ψ(θ1)2]𝒄n,𝑴n1𝒄n.\displaystyle\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle\,. (43)

Furthermore, the infimum is attained and the maximizer qq^{*}, which is unique, takes the form

q(𝒀)=A𝒢Dq^AHA(𝒀),𝒒^=(q^A)A𝒢D=𝑴n1𝒄n.\displaystyle q^{*}({\boldsymbol{Y}})=\sum_{A\in\mathcal{G}_{\leq D}}\hat{q}_{A}\mathscrsfs{H}_{A}({\boldsymbol{Y}})\,,\;\;\;\;\;\hat{\boldsymbol{q}}=(\hat{q}_{A})_{A\in\mathcal{G}_{\leq D}}={\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\,. (44)

Determining the asymptotic error achieved by Low-Deg estimators requires understanding the asymptotic behavior of 𝒄n{\boldsymbol{c}}_{n} and 𝑴n{\boldsymbol{M}}_{n}. This is achieved in the following lemmas. Notably, 𝑴n{\boldsymbol{M}}_{n} is nearly block diagonal. (For this block diagonal structure to appear, it is crucial that we center each connected component in (38).)

Lemma 4.7.

For each A𝒢DA\in\mathcal{G}_{\leq D} we have the following.

  • There is a constant c,Ac_{\infty,A}\in\mathbb{R} (depending on πΘ,ψ,A\pi_{\Theta},\psi,A) such that cn,A=c,A+O(n1/2)c_{n,A}=c_{\infty,A}+O(n^{-1/2}).

  • If A𝒯DA\notin\mathcal{T}_{\leq D} then c,A=0c_{\infty,A}=0.

Lemma 4.8.

For each A,B𝒢DA,B\in\mathcal{G}_{\leq D} we have the following.

  • There is a constant M,ABM_{\infty,AB}\in\mathbb{R} (depending on πΘ,A,B\pi_{\Theta},A,B) such that Mn,AB=M,AB+O(n1/2)M_{n,AB}=M_{\infty,AB}+O(n^{-1/2}).

  • If A𝒯DA\in\mathcal{T}_{\leq D} and B𝒢D𝒯DB\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D} then M,AB=0M_{\infty,AB}=0.

Write 𝒄n,𝑴n{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n} in block form, where the first block is indexed by rooted trees 𝒯D\mathcal{T}_{\leq D} and the second by other graphs 𝒢D𝒯D\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}:

𝒄n=[𝒅n𝒆n]𝑴n=[𝑷n𝑹n𝑹n𝖳𝑸n].{\boldsymbol{c}}_{n}=\left[\begin{array}[]{c}{\boldsymbol{d}}_{n}\\ {\boldsymbol{e}}_{n}\end{array}\right]\qquad\qquad{\boldsymbol{M}}_{n}=\left[\begin{array}[]{cc}{\boldsymbol{P}}_{n}&{\boldsymbol{R}}_{n}\\ {\boldsymbol{R}}_{n}^{\sf T}&{\boldsymbol{Q}}_{n}\end{array}\right]. (45)

Let 𝒅,𝒆,𝑷,𝑸,𝑹{\boldsymbol{d}}_{\infty},{\boldsymbol{e}}_{\infty},{\boldsymbol{P}}_{\infty},{\boldsymbol{Q}}_{\infty},{\boldsymbol{R}}_{\infty} denote the corresponding limiting vectors/matrices from Lemmas 4.7 and 4.8, and note that 𝒆=𝟎{\boldsymbol{e}}_{\infty}={\boldsymbol{0}} and 𝑹=𝟎{\boldsymbol{R}}_{\infty}={\boldsymbol{0}}.

We now work towards constructing the (sequence of) tree-structured polynomials p=pnp=p_{n} that verify Eq. (35). We begin by defining a related polynomial r=rnr=r_{n}, which is tree-structured in the basis {HT}\{\mathscrsfs{H}_{T}\} instead of {FT}\{\mathscrsfs{F}_{T}\}:

r(𝒀)=T𝒯Dr^THT(𝒀)where𝒓^=𝑷1𝒅.r({\boldsymbol{Y}})=\sum_{T\in\mathcal{T}_{\leq D}}\hat{r}_{T}\mathscrsfs{H}_{T}({\boldsymbol{Y}})\qquad\text{where}\qquad\hat{\boldsymbol{r}}={\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\,. (46)

To see that this definition is well-posed, note that since 𝑴{\boldsymbol{M}}_{\infty} is strictly positive definite by Lemma 4.5, 𝑷{\boldsymbol{P}}_{\infty} is also strictly positive definite and thus invertible. For intuition, note the similarity between 𝒓^\hat{\boldsymbol{r}} and the optimizer 𝒒^\hat{\boldsymbol{q}} in Lemma 4.6.

The next lemma characterizes the asymptotic estimation error achieved by the polynomial r(𝒀)r({\boldsymbol{Y}}).

Lemma 4.9.

We have

limn𝔼[r(𝒀)ψ(θ1)]=limn𝔼[r(𝒀)2]=𝒅,𝑷1𝒅=𝒄,𝑴1𝒄,\displaystyle\lim_{n\to\infty}\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\lim_{n\to\infty}\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]=\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,, (47)

and in particular,

limn𝔼[(r(𝒀)ψ(θ1))2]=𝔼[ψ(θ1)2]𝒄,𝑴1𝒄.\displaystyle\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(r({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,. (48)

The crucial equality 𝒅,𝑷1𝒅=𝒄,𝑴1𝒄\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle above is an immediate consequence of the structure of 𝒄,𝑴{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty} established in Lemmas 4.7 and 4.8, namely 𝒆=𝟎{\boldsymbol{e}}_{\infty}={\boldsymbol{0}} and 𝑹=𝟎{\boldsymbol{R}}_{\infty}={\boldsymbol{0}}.

As a direct consequence of Lemma 4.9, we can now prove the following analogue of Proposition 4.3 where the basis {HT}\{\mathscrsfs{H}_{T}\} is used instead of {FT}\{\mathscrsfs{F}_{T}\}. which was left out by Lemma 4.9.

Proposition 4.10.

Assume πΘ\pi_{\Theta} to have finite moments of all orders. Let ψ:\psi:{\mathbb{R}}\to{\mathbb{R}} be a measurable function, and assume all moments of ψ(θ1)\psi(\theta_{1}) to be finite. For any fixed πΘ,ψ,D\pi_{\Theta},\psi,D there exists a fixed (nn-independent) choice of coefficients (r^T)T𝒯D(\hat{r}_{T})_{T\in\mathcal{T}_{\leq D}} such that the associated polynomial r=rnr=r_{n} defined by r(𝐘)=T𝒯Dr^THT(𝐘)r({\boldsymbol{Y}})=\sum_{T\in\mathcal{T}_{\leq D}}\hat{r}_{T}\mathscrsfs{H}_{T}({\boldsymbol{Y}}) satisfies

limn𝔼[(r(𝒀)ψ(θ1))2]=limninfq[𝒀]D𝔼[(q(𝒀)ψ(θ1))2].\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(r({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\lim_{n\to\infty}\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,. (49)

(In particular, the above limits exist.)

Proof of Proposition 4.10.

It follows immediately from Lemmas 4.54.7, and 4.8 that

limn𝒄n,𝑴n1𝒄n=𝒄,𝑴1𝒄.\lim_{n\to\infty}\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,.

(See Lemma C.2 in the appendix for a formal proof.) Combining this fact with Lemmas 4.4 and 4.6, the limit on the right-hand side of (49) exists and is equal to 𝔼[ψ(θ1)2]𝒄,𝑴1𝒄\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle.

Defining 𝒓^\hat{\boldsymbol{r}} as in (46), we have by Lemma 4.9 that the limit on the left-hand side of (49) is also equal to 𝔼[ψ(θ1)2]𝒄,𝑴1𝒄\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle. This completes the proof. ∎

Proposition 4.10 differs from our goal, namely Proposition 4.3, in that it offers an estimator that is a linear combination of the polynomials {HT}T𝒯D\{\mathscrsfs{H}_{T}\}_{T\in\mathcal{T}_{\leq D}} instead of {FT}T𝒯D\{\mathscrsfs{F}_{T}\}_{T\in\mathcal{T}_{\leq D}}. Recalling that we are restricting to trees and that the first two Hermite polynomials are simply h0(z)=1h_{0}(z)=1 and h1(z)=zh_{1}(z)=z, the difference lies in the fact that HT(𝒀)\mathscrsfs{H}_{T}({\boldsymbol{Y}}) involves a sum over embeddings, while FT(𝒀)\mathscrsfs{F}_{T}({\boldsymbol{Y}}) involves a sum over the larger class of non-reversing labelings; also HT(𝒀)\mathscrsfs{H}_{T}({\boldsymbol{Y}}) is centered by its expectation, provided TT is not edgeless (see Eq. (38) and note that a tree has only one connected component).

The next lemma allows us to, for each A𝒯DA\in\mathcal{T}_{\leq D}, rewrite HA(𝒀)\mathscrsfs{H}_{A}({\boldsymbol{Y}}) as a linear combination of the {FT(𝒀)}T𝒯D\{\mathscrsfs{F}_{T}({\boldsymbol{Y}})\}_{T\in\mathcal{T}_{\leq D}} with negligible error.

Lemma 4.11.

For any fixed A𝒯DA\in\mathcal{T}_{\leq D} there exist nn-independent coefficients (mAB)B𝒯D(m_{AB})_{B\in\mathcal{T}_{\leq D}} such that

HA(𝒀)=B𝒯DmABFB(𝒀)+EA(𝒀),𝔼[EA(𝒀)2]=on(1).\displaystyle\mathscrsfs{H}_{A}({\boldsymbol{Y}})=\sum_{B\in\mathcal{T}_{\leq D}}m_{AB}\mathscrsfs{F}_{B}({\boldsymbol{Y}})+\mathscrsfs{E}_{A}({\boldsymbol{Y}})\,,\;\;\;\;\;\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A}({\boldsymbol{Y}})^{2}]=o_{n}(1)\,. (50)

We can finally conclude the proof of Proposition 4.3.

Proof of Proposition 4.3.

As in the proof of Proposition 4.10, it suffices to define 𝒑^\hat{\boldsymbol{p}} so that the limit on the left-hand side of (35) is equal to 𝔼[ψ(θ1)2]𝒄,𝑴1𝒄\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle. Defining 𝒓^\hat{\boldsymbol{r}} as in (46), we can use Lemma 4.11 to expand

r(𝒀)\displaystyle r({\boldsymbol{Y}}) =A𝒯Dr^AHA(𝒀)\displaystyle=\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}\mathscrsfs{H}_{A}({\boldsymbol{Y}})
=A𝒯Dr^A[B𝒯DmABFB(𝒀)+EA(𝒀)]\displaystyle=\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}\left[\sum_{B\in\mathcal{T}_{\leq D}}m_{AB}\mathscrsfs{F}_{B}({\boldsymbol{Y}})+\mathscrsfs{E}_{A}({\boldsymbol{Y}})\right]
=A,B𝒯Dr^AmABFB(𝒀)+A𝒯Dr^AEA(𝒀)\displaystyle=\sum_{A,B\in\mathcal{T}_{\leq D}}\hat{r}_{A}m_{AB}\mathscrsfs{F}_{B}({\boldsymbol{Y}})+\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}\mathscrsfs{E}_{A}({\boldsymbol{Y}})
=:p(𝒀)+Δ(𝒀).\displaystyle=:p({\boldsymbol{Y}})+\Delta({\boldsymbol{Y}})\,.

In other words, we define p^B=A𝒯Dr^AmAB\hat{p}_{B}=\sum_{A\in\mathcal{T}_{\leq D}}\hat{r}_{A}m_{AB}. Since |𝒯D|=O(1)|\mathcal{T}_{\leq D}|=O(1), r^A=O(1)\hat{r}_{A}=O(1), and 𝔼[EA(𝒀)2]=on(1)\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A}({\boldsymbol{Y}})^{2}]=o_{n}(1), we have

𝔼[Δ(𝒀)2]1/2A𝒯D|r^A|𝔼[EA(𝒀)2]1/2=on(1).\operatorname*{\mathbb{E}}[\Delta({\boldsymbol{Y}})^{2}]^{1/2}\leq\sum_{A\in\mathcal{T}_{\leq D}}|\hat{r}_{A}|\cdot\operatorname{\mathbb{E}}\left[\mathscrsfs{E}_{A}({\boldsymbol{Y}})^{2}\right]^{1/2}=o_{n}(1)\,.

Now compute

𝔼[p(𝒀)ψ(θ1)]\displaystyle\operatorname*{\mathbb{E}}[p({\boldsymbol{Y}})\cdot\psi(\theta_{1})] =𝔼[r(𝒀)ψ(θ1)]𝔼[Δ(𝒀)ψ(θ1)]\displaystyle=\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})\cdot\psi(\theta_{1})]-\operatorname*{\mathbb{E}}[\Delta({\boldsymbol{Y}})\cdot\psi(\theta_{1})]
=𝒄,𝑴1𝒄+on(1),\displaystyle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle+o_{n}(1)\,, (51)

where the last step follows from Lemma 4.9 together with the remark that 𝔼[ψ(θ1)2]=O(1)\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]=O(1), so |𝔼[Δ(𝒀)ψ(θ1)]|𝔼[Δ(𝒀)2]1/2𝔼[ψ(θ1)2]1/2=on(1)|\mathbb{E}[\Delta({\boldsymbol{Y}})\cdot\psi(\theta_{1})]|\leq\operatorname*{\mathbb{E}}[\Delta({\boldsymbol{Y}})^{2}]^{1/2}\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]^{1/2}=o_{n}(1).

Similarly, since 𝔼[HA(𝒀)2]=O(1)\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})^{2}]=O(1) by Lemma 4.8, and r^A=O(1)\hat{r}_{A}=O(1), we have

𝔼[r(𝒀)2]1/2A𝒯D|r^A|𝔼[HA(𝒀)2]1/2=O(1),\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]^{1/2}\leq\sum_{A\in\mathcal{T}_{\leq D}}|\hat{r}_{A}|\cdot\operatorname{\mathbb{E}}\left[\mathscrsfs{H}_{A}({\boldsymbol{Y}})^{2}\right]^{1/2}=O(1)\,,

and therefore

𝔼[p(𝒀)2]\displaystyle\operatorname*{\mathbb{E}}[p({\boldsymbol{Y}})^{2}] =𝔼[(r(𝒀)Δ(𝒀))2]\displaystyle=\operatorname*{\mathbb{E}}[(r({\boldsymbol{Y}})-\Delta({\boldsymbol{Y}}))^{2}]
=𝔼[r(𝒀)2]2𝔼[r(𝒀)Δ(𝒀)]+𝔼[Δ(𝒀)2]\displaystyle=\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]-2\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})\Delta({\boldsymbol{Y}})]+\operatorname*{\mathbb{E}}[\Delta({\boldsymbol{Y}})^{2}]
=𝒄,𝑴1𝒄+on(1).\displaystyle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle+o_{n}(1)\,. (52)

Combining Eqs. (51) and (52) we now conclude

limn𝔼[(p(𝒀)ψ(θ1))2]=𝔼[ψ(θ1)2]𝒄,𝑴1𝒄,\lim_{n\to\infty}\operatorname*{\mathbb{E}}[(p({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,,

completing the proof. ∎

4.3 From tree-structured polynomials to AMP

In this section we will show that the Bayes-AMP algorithm achieves the same asymptotic accuracy as the best tree-structured constant-degree polynomial p[𝒀]D𝒯p\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathcal{T}}, matching Eq. (35). This will complete the proof of the lower bound in Theorem 2.3.

The key step is to construct a message passing (MP) algorithm that evaluates any given tree-structured polynomial. We recall the definition of a class of MP algorithms introduced in [BLM15] (we make some simplifications with respect to the original setting). After tt iterations, the state of an algorithm in this class is an array of messages (𝒔ijt)i,j[n]({\boldsymbol{s}}_{i\to j}^{t})_{i,j\in[n]} indexed by ordered pairs in [n][n]. Each message is a vector 𝒔ijt𝖽{\boldsymbol{s}}_{i\to j}^{t}\in{\mathbb{R}}^{{\sf d}}, where 𝖽{\sf d} is a fixed integer (independent of nn and tt). Here, we use the arrow to emphasize ordering, and entries iii\to i are set to 0 by convention. Updates make use of functions Ft:𝖽𝖽F_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}}^{{\sf d}} according to

𝒔ijt+1\displaystyle{\boldsymbol{s}}_{i\to j}^{t+1} =1nk[n]{i,j}YikFt(𝒔kit),𝒔ij0=𝟎ij.\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i,j\}}Y_{ik}F_{t}({\boldsymbol{s}}^{t}_{k\to i})\,,\;\;\;\;\;\,\;\;{\boldsymbol{s}}_{i\to j}^{0}={\boldsymbol{0}}\;\;\forall i\neq j\,. (53)

We finally define vectors 𝒔it,𝒔^it𝖽{\boldsymbol{s}}_{i}^{t},\hat{\boldsymbol{s}}_{i}^{t}\in{\mathbb{R}}^{{\sf d}} indexed by i[n]i\in[n]:

𝒔it+1\displaystyle{\boldsymbol{s}}_{i}^{t+1} =1nk[n]{i}YikFt(𝒔kit),\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i\}}Y_{ik}F_{t}({\boldsymbol{s}}^{t}_{k\to i})\,, (54)
𝒔^it+1\displaystyle\hat{\boldsymbol{s}}_{i}^{t+1} =Ft(𝒔it+1).\displaystyle=F_{t}({\boldsymbol{s}}^{t+1}_{i})\,. (55)

We claim that this class of recursions can be used to evaluate the tree-structured polynomials {FT}T𝒯D\{\mathscrsfs{F}_{T}\}_{T\in\mathcal{T}_{\leq D}}, for any fixed DD. To see this, we let 𝖽=|𝒯D|{\sf d}=|\mathcal{T}_{\leq D}| so that we can index the entries of 𝒔ijt{\boldsymbol{s}}_{i\to j}^{t} or 𝒔it,𝒔^it{\boldsymbol{s}}_{i}^{t},\hat{\boldsymbol{s}}_{i}^{t} by elements of 𝒯D\mathcal{T}_{\leq D}. For instance, we will have:

𝒔ijt=(sijt(T(1)),sijt(T(2)),,sijt(T(𝖽))),\displaystyle{\boldsymbol{s}}^{t}_{i\to j}=\big{(}s_{i\to j}^{t}(T_{(1)}),s_{i\to j}^{t}(T_{(2)}),\dots,s_{i\to j}^{t}(T_{({\sf d})})\big{)}\,, (56)

where 𝒯D=(T(1),T(2),,T(𝖽))\mathcal{T}_{\leq D}=(T_{(1)},T_{(2)},\dots,T_{({\sf d})}) is an enumeration of 𝒯D\mathcal{T}_{\leq D}.

Given T𝒯DT\in\mathcal{T}_{\leq D}, we define T+T_{+} to be the graph with vertex set V(T){v+}V(T)\cup\{v_{+}\} (where v+v_{+} is not an element of V(T)V(T)), edge set E(T){(v0,v+)}E(T)\cup\{(v_{0},v_{+})\} where v0v_{0} was the root of TT. We set the root of T+T_{+} to be =v+\circ=v_{+}. In words, T+T_{+} is obtained by attaching an edge to the root of TT, and moving the root to the other endpoint of the new edge.

If the root of T𝒯DT\in\mathcal{T}_{\leq D} has degree kk, we let T1,,TkT_{1},\dots,T_{k} denote the connected subgraphs of TT that are obtained by removing the root and the edges incident to the root. Notice that each TjT_{j} is a tree which we root at the unique vertex vjv_{j} such that (,vj)E(T)(\circ,v_{j})\in E(T). We write 𝒟(T)={T1,T2,,Tk}{\mathcal{D}}(T)=\{T_{1},T_{2},\dots,T_{k}\}. We then define the special mapping F:|𝒯D||𝒯D|F^{*}:{\mathbb{R}}^{|\mathcal{T}_{\leq D}|}\to{\mathbb{R}}^{|\mathcal{T}_{\leq D}|}, by letting, for any T𝒯DT\in\mathcal{T}_{\leq D},

F(𝒔)(T):=T𝒟(T)s(T).\displaystyle F^{*}({\boldsymbol{s}})(T):=\prod_{T^{\prime}\in{\mathcal{D}}(T)}s(T^{\prime})\,. (57)

In order to clarify the connection between this MP algorithm and tree-structured polynomials, we define the following modification of the tree-structured polynomials of Eq. (32). For each T𝒯DT\in\mathcal{T}_{\leq D}, and each pair of distinct indices i,j[n]i,j\in[n], we define

FT,ij(𝒀)=1n|E(T)|/2ϕ𝗇𝗋(T;ij)(u,v)E(T)Yϕ(u),ϕ(v).\mathscrsfs{F}_{T,i\to j}({\boldsymbol{Y}})=\frac{1}{n^{|E(T)|/2}}\sum_{\phi\in\mathsf{nr}(T;i\to j)}\,\prod_{(u,v)\in E(T)}Y_{\phi(u),\phi(v)}\,. (58)

Here 𝗇𝗋(T;ij)\mathsf{nr}(T;i\to j) is the set of labelings i.e., maps ϕ:V(T)[n]\phi:V(T)\to[n] that are non-reversing (in the same sense as Definition 4.1) and such that the following hold:

  • ϕ()=i\phi(\circ)=i. (The labeling is rooted at ii, not at 11.)

  • For any vV(T)v\in V(T) such that (,v)E(T)(\circ,v)\in E(T), we have ϕ(v)j\phi(v)\neq j.

Notice also the different normalization with respect to Eq. (32). However, by counting the choices at each vertex we have |𝗇𝗋(T;ij)|=(n2)|E(T)||\mathsf{nr}(T;i\to j)|=(n-2)^{|E(T)|}, |𝗇𝗋(T)|=(n1)(n2)|E(T)|1|\mathsf{nr}(T)|=(n-1)(n-2)^{|E(T)|-1} and therefore

|𝗇𝗋(T;ij)|n|E(T)|=1+on(1),|𝗇𝗋(T)|n|E(T)|=1+on(1).\frac{|\mathsf{nr}(T;i\to j)|}{n^{|E(T)|}}=1+o_{n}(1)\,,\;\;\;\frac{|\mathsf{nr}(T)|}{n^{|E(T)|}}=1+o_{n}(1)\,. (59)

We also define the radius of a rooted graph GG, 𝗋𝖺𝖽(G):=max{𝖽G(,v):vV(G)}{\mathsf{rad}}(G):=\max\{{\sf d}_{G}(\circ,v):\;v\in V(G)\}.

Proposition 4.12.

Let 𝐬ijt{\boldsymbol{s}}^{t}_{i\to j}, 𝐬it{\boldsymbol{s}}_{i}^{t}, 𝐬^it\hat{\boldsymbol{s}}_{i}^{t} be the iterates defined by Eqs. (53), (54), (55) with Ft=FF_{t}=F^{*} given by Eq. (57). Then, for any T𝒯DT\in\mathcal{T}_{\leq D}, and any t>𝗋𝖺𝖽(T)t>{\mathsf{rad}}(T), recalling the definition of T+T_{+} given above, we have the following (here the on(1)o_{n}(1) terms are uniform in 𝐘{\boldsymbol{Y}}):

𝒔ijt(T)\displaystyle{\boldsymbol{s}}^{t}_{i\to j}(T) =FT+,ij(𝒀),\displaystyle=\mathscrsfs{F}_{T_{+},i\to j}({\boldsymbol{Y}})\,,
𝒔1t(T)\displaystyle{\boldsymbol{s}}_{1}^{t}(T) =|𝗇𝗋(T+)|n|E(T+)|/2FT+(𝒀)=(1+on(1))FT+(𝒀),\displaystyle=\frac{|\mathsf{nr}(T_{+})|}{n^{|E(T_{+})|/2}}\mathscrsfs{F}_{T_{+}}({\boldsymbol{Y}})=(1+o_{n}(1))\cdot\mathscrsfs{F}_{T_{+}}({\boldsymbol{Y}})\,,
𝒔^1t(T)\displaystyle\hat{\boldsymbol{s}}_{1}^{t}(T) =|𝗇𝗋(T)|n|E(T)|/2FT(𝒀)=(1+on(1))FT(𝒀).\displaystyle=\frac{|\mathsf{nr}(T)|}{n^{|E(T)|/2}}\mathscrsfs{F}_{T}({\boldsymbol{Y}})=(1+o_{n}(1))\cdot\mathscrsfs{F}_{T}({\boldsymbol{Y}})\,.
Proof.

The proof is straightforward, and amounts to check that the first claim holds by induction on 𝗋𝖺𝖽(G){\mathsf{rad}}(G). ∎

We next show that, for a broad class of choices of the functions FtF_{t}, the high-dimensional asymptotics of the algorithm defined by Eqs. (53), (54), (55) is determined by a generalization of the state evolution recursion of Theorem 2.1.

Lemma 4.13.

Assume that πΘ\pi_{\Theta} has finite moments of all orders and, for each t0t\geq 0, Ft:𝖽𝖽F_{t}:{\mathbb{R}}^{{\sf d}}\to{\mathbb{R}}^{{\sf d}} is a polynomial independent of nn. Define the sequence of vectors 𝛍t𝖽{\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{{\sf d}} and positive semidefinite matrices 𝚺t𝖽×𝖽{\boldsymbol{\Sigma}}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}} via the state evolution equations (10), (11).

Then for any polynomial ψ:𝖽×\psi:{\mathbb{R}}^{{\sf d}}\times{\mathbb{R}}\to{\mathbb{R}}, the following limits hold for (Θ,𝐆t)πΘ𝒩(0,𝚺t)(\Theta,{\boldsymbol{G}}_{t})\sim\pi_{\Theta}\otimes{\mathcal{N}}(0,{\boldsymbol{\Sigma}}_{t}):

limn𝔼[ψ(𝒔12t,θ1)]=limn𝔼[ψ(𝒔1t,θ1)]=𝔼ψ(𝝁tΘ+𝑮t,Θ),\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{s}}^{t}_{1\to 2},\theta_{1})]=\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{s}}^{t}_{1},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,, (60)
limn𝔼[ψ(𝒔^1t,θ1)]=𝔼ψ(Ft(𝝁tΘ+𝑮t),Θ).\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi(\hat{\boldsymbol{s}}^{t}_{1},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t}),\Theta\big{)}\,. (61)

The proof of this lemma is based on results from [BLM15] and will be presented in Appendix D.

We are now in position to conclude the proof of the lower bound (22) on the optimal error of Low-Deg estimators in Theorem 2.3. The quantity we want to lower bound takes the form

𝖬𝖲𝖤D:=\displaystyle{\sf MSE}_{\leq D}:= limninf𝜽^𝖫𝖣(D;n)1n𝔼{𝜽^(𝒀)𝜽22}\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)}\frac{1}{n}\operatorname{\mathbb{E}}\big{\{}\|\hat{\boldsymbol{\theta}}({\boldsymbol{Y}})-{\boldsymbol{\theta}}\|_{2}^{2}\big{\}}
=\displaystyle= limninf𝜽^𝖫𝖣(D;n)1ni=1n𝔼{(θ^i(𝒀)θi)2}\displaystyle\lim_{n\to\infty}\inf_{\hat{\boldsymbol{\theta}}\in{\sf LD}(D;n)}\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathbb{E}}\left\{(\hat{\theta}_{i}({\boldsymbol{Y}})-\theta_{i})^{2}\right\}
=\displaystyle= limninfq[𝒀]D𝔼{(q(𝒀)θ1)2}\displaystyle\lim_{n\to\infty}\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}}\operatorname{\mathbb{E}}\left\{(q({\boldsymbol{Y}})-\theta_{1})^{2}\right\}
which by Proposition 4.3 takes the form
=\displaystyle= limn𝔼{(θ1T𝒯Dp^TFT(𝒀))2},\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}\Big{\{}\Big{(}\theta_{1}-\sum_{T\in\mathcal{T}_{\leq D}}\hat{p}_{T}\mathscrsfs{F}_{T}({\boldsymbol{Y}})\Big{)}^{2}\Big{\}}\,,

for some nn-independent choice of coefficients p^T\hat{p}_{T}. On the other hand, by Proposition 4.12, letting 𝒔^1t\hat{\boldsymbol{s}}_{1}^{t} be the iterates defined by Eqs. (53), (54), (55), with Ft=FF_{t}=F^{*} given by Eq. (57), we have, for any t>Dt>D,

limn𝔼[θ1FT(𝒀)]\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}\big{[}\theta_{1}\cdot\mathscrsfs{F}_{T}({\boldsymbol{Y}})\big{]} =limn𝔼[θ1s^1t(T)]\displaystyle=\lim_{n\to\infty}\operatorname{\mathbb{E}}\big{[}\theta_{1}\cdot\hat{s}^{t}_{1}(T)\big{]} (62)
=(a)𝔼[ΘF(𝝁tΘ+𝑮t)(T)]=𝝁t+1(T),\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\operatorname{\mathbb{E}}[\Theta F^{*}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})(T)]={\boldsymbol{\mu}}_{t+1}(T)\,, (63)

where in step (a)(a) we used Lemma 4.13. (Here 𝝁t{\boldsymbol{\mu}}_{t}, 𝚺t{\boldsymbol{\Sigma}}_{t} are defined recursively by Eqs. (10), (11). Recall that 𝝁t|𝒯D|{\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{|\mathcal{T}_{\leq D}|} can be indexed by elements TT of 𝒯D\mathcal{T}_{\leq D}.)

Analogously, we obtain

limn(𝔼[FT1(𝒀)FT2(𝒀)])T1,T2𝒯D\displaystyle\lim_{n\to\infty}\Big{(}\operatorname{\mathbb{E}}\big{[}\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})\big{]}\Big{)}_{T_{1},T_{2}\in\mathcal{T}_{\leq D}} =𝔼[F(𝝁tΘ+𝑮t)F(𝝁tΘ+𝑮t)𝖳]=𝚺t+1.\displaystyle=\operatorname{\mathbb{E}}\big{[}F^{*}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})F^{*}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})^{{\sf T}}\big{]}={\boldsymbol{\Sigma}}_{t+1}\,. (64)

Hence we obtain that the optimal error achieved by Low-Deg estimators is given by

𝖬𝖲𝖤D\displaystyle{\sf MSE}_{\leq D} =limn𝔼{(θ1T𝒯Dp^TFT(𝒀))2}\displaystyle=\lim_{n\to\infty}\operatorname{\mathbb{E}}\Big{\{}\Big{(}\theta_{1}-\sum_{T\in\mathcal{T}_{\leq D}}\hat{p}_{T}\mathscrsfs{F}_{T}({\boldsymbol{Y}})\Big{)}^{2}\Big{\}} (65)
=𝔼[Θ2]2𝒑^,𝝁t+1+𝒑^,𝚺t+1𝒑^.\displaystyle=\operatorname{\mathbb{E}}[\Theta^{2}]-2\langle\hat{\boldsymbol{p}},{\boldsymbol{\mu}}_{t+1}\rangle+\langle\hat{\boldsymbol{p}},{\boldsymbol{\Sigma}}_{t+1}\hat{\boldsymbol{p}}\rangle\,. (66)

However, by Theorem 2.1 there exists an AMP algorithm of the form (9) that achieves exactly the same error. Simply take 𝖽=|𝒯D|{\sf d}=|\mathcal{T}_{\leq D}|, Ft=FF_{t}=F^{*} as defined in Eq. (57), and gt(𝒙t)=p^T,F(𝒙t)g_{t}({\boldsymbol{x}}^{t})=\langle\hat{p}_{T},F^{*}({\boldsymbol{x}}^{t})\rangle. The desired lower bound follows by applying the optimality result of Theorem 2.2.

Acknowledgments

This project was initiated when the authors were visiting the Simons Institute for the Theory of Computing during the program on Computational Complexity of Statistical Inference in Fall 2021: we are grateful to the Simons Institute for its support.

AM was supported by the NSF through award DMS-2031883, the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning, the NSF grant CCF-2006489 and the ONR grant N00014-18-1-2729, and a grant from Eric and Wendy Schmidt at the Institute for Advanced Studies. Part of this work was carried out while Andrea Montanari was on partial leave from Stanford and a Chief Scientist at Ndata Inc dba Project N. The present research is unrelated to AM’s activity while on leave.

Part of this work was done while ASW was with the Simons Institute for the Theory of Computing, supported by a Simons-Berkeley Research Fellowship. Part of this work was done while ASW was with the Algorithms and Randomness Center at Georgia Tech, supported by NSF grants CCF-2007443 and CCF-2106444.

Appendix A Proof of Theorem 1.1

Let t0:=𝔼[Θ]t_{0}:=\operatorname{\mathbb{E}}[\Theta]. We assume, without loss of generality, t0>0t_{0}>0. Let πt\pi_{t} be the version of πΘ\pi_{\Theta} centered at t0t\geq 0, namely the law of Θt0+t\Theta-t_{0}+t when ΘπΘ\Theta\sim\pi_{\Theta} (in particular πΘ=πt0\pi_{\Theta}=\pi_{t_{0}}). We finally let 𝜽t{\boldsymbol{\theta}}_{t} be a vector with i.i.d. coordinates with distribution πt\pi_{t}, and (for 𝒁𝖦𝖮𝖤(n){\boldsymbol{Z}}\sim{\sf GOE}(n))

𝒀t=1n𝜽t𝜽t𝖳+𝒁.\displaystyle{\boldsymbol{Y}}_{t}=\frac{1}{\sqrt{n}}{\boldsymbol{\theta}}_{t}{\boldsymbol{\theta}}_{t}^{{\sf T}}+{\boldsymbol{Z}}\,. (67)

We will take 𝜽t=𝜽0+t𝟏{\boldsymbol{\theta}}_{t}={\boldsymbol{\theta}}_{0}+t{\boldsymbol{1}}. The normalized mutual information between 𝒀t{\boldsymbol{Y}}_{t} and 𝜽t{\boldsymbol{\theta}}_{t} is given, after simple manipulations, by

ϕn(t):=\displaystyle\phi_{n}(t):= 1nI(𝒀t;𝜽t)\displaystyle\frac{1}{n}I({\boldsymbol{Y}}_{t};{\boldsymbol{\theta}}_{t})
=\displaystyle= 1n𝔼𝜽t,𝒁log{exp(14n𝜽¯t𝜽¯t𝖳𝜽t𝜽t𝖳F2+12n1/2𝜽¯t,𝒁𝜽¯t)πtn(d𝜽¯t)}.\displaystyle-\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\log\left\{\int\exp\Big{(}-\frac{1}{4n}\big{\|}\bar{\boldsymbol{\theta}}_{t}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}}-{\boldsymbol{\theta}}_{t}{\boldsymbol{\theta}}_{t}^{{\sf T}}\big{\|}^{2}_{F}+\frac{1}{2n^{1/2}}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}\bar{\boldsymbol{\theta}}_{t}\rangle\Big{)}\,\pi_{t}^{\otimes n}({\rm d}\bar{\boldsymbol{\theta}}_{t})\right\}\,.

By writing 𝜽t=𝜽0+t𝟏{\boldsymbol{\theta}}_{t}={\boldsymbol{\theta}}_{0}+t{\boldsymbol{1}}, 𝜽¯t=𝜽¯0+t𝟏\bar{\boldsymbol{\theta}}_{t}=\bar{\boldsymbol{\theta}}_{0}+t{\boldsymbol{1}}, where 𝜽0π0n{\boldsymbol{\theta}}_{0}\sim\pi_{0}^{\otimes n}, we get

ϕn(t)\displaystyle\phi_{n}(t) :=1n𝔼𝜽0,𝒁log{en(𝜽¯0,𝜽0;t)π0n(d𝜽¯0)},\displaystyle:=-\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\log\left\{e^{-{\cal H}_{n}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)}\pi_{0}^{\otimes n}({\rm d}\bar{\boldsymbol{\theta}}_{0})\right\}\,, (68)

where

nt(𝜽¯0,𝜽0;t)=1n{𝜽¯t𝜽t2𝜽¯t,𝟏𝜽¯t𝜽t,𝜽¯t𝜽¯t𝜽t 1}1n1/2𝒁,𝟏𝜽¯t𝖳.\displaystyle\frac{\partial{\cal H}_{n}}{\partial t}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)=\frac{1}{n}\Big{\{}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\,{\boldsymbol{1}}\rangle\Big{\}}-\frac{1}{n^{1/2}}\langle{\boldsymbol{Z}},{\boldsymbol{1}}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}}\rangle\,.

Let μ𝒀t(d𝜽¯t)\mu_{{\boldsymbol{Y}}_{t}}({\rm d}\bar{\boldsymbol{\theta}}_{t}) be the Bayes posterior over 𝜽¯t\bar{\boldsymbol{\theta}}_{t}:

μ𝒀t(d𝜽¯t)exp(14𝒀t1n1/2𝜽¯t𝜽¯t𝖳F2)πtn(d𝜽¯t).\displaystyle\mu_{{\boldsymbol{Y}}_{t}}({\rm d}\bar{\boldsymbol{\theta}}_{t})\propto\exp\Big{(}-\frac{1}{4}\Big{\|}{\boldsymbol{Y}}_{t}-\frac{1}{n^{1/2}}\bar{\boldsymbol{\theta}}_{t}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}}\Big{\|}_{F}^{2}\Big{)}\,\pi^{\otimes n}_{t}({\rm d}\bar{\boldsymbol{\theta}}_{t})\,. (69)

The above implies

ϕn(t)\displaystyle\phi^{\prime}_{n}(t) =An(t)+Bn(t),\displaystyle=A_{n}(t)+B_{n}(t)\,, (70)
An(t)\displaystyle A_{n}(t) :=1n2𝔼𝜽t,𝒁μ𝒀t(𝜽¯t𝜽t2𝜽¯t,𝟏𝜽¯t𝜽t,𝜽¯t𝜽¯t𝜽t 1),\displaystyle:=\frac{1}{n^{2}}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\Big{(}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\,{\boldsymbol{1}}\rangle\Big{)}\,, (71)
Bn(t)\displaystyle B_{n}(t) :=1n3/2𝔼𝜽t,𝒁𝒁,μ𝒀t(𝟏𝜽¯t𝖳).\displaystyle:=-\frac{1}{n^{3/2}}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\langle{\boldsymbol{Z}},\mu_{{\boldsymbol{Y}}_{t}}({\boldsymbol{1}}\bar{\boldsymbol{\theta}}_{t}^{{\sf T}})\rangle\,. (72)

Integrating 𝒁{\boldsymbol{Z}} by parts, we get

Bn(t)\displaystyle B_{n}(t) :=1n2𝔼𝜽t,𝒁{μ𝒀t(𝟏,𝜽¯t𝜽¯t2)μ𝒀t(𝜽¯t),μ𝒀t(𝟏,𝜽¯t𝜽¯t)}.\displaystyle:=-\frac{1}{n^{2}}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{t},{\boldsymbol{Z}}}\Big{\{}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\|\bar{\boldsymbol{\theta}}_{t}\|^{2}\big{)}-\langle\mu_{{\boldsymbol{Y}}_{t}}(\bar{\boldsymbol{\theta}}_{t}),\mu_{{\boldsymbol{Y}}_{t}}(\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\bar{\boldsymbol{\theta}}_{t})\rangle\,\Big{\}}\,. (73)

We finally note that 𝜽t,𝜽¯t{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t} are jointly distributed according to 𝔼F(𝜽t,𝜽¯t)=𝔼μ𝒀t2(F(𝜽t,𝜽¯t)){\mathbb{E}}F({\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t})=\operatorname{\mathbb{E}}\mu_{{\boldsymbol{Y}}_{t}}^{\otimes 2}\big{(}F({\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t})\big{)}. Note that the pair 𝜽t,𝜽¯t{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t} is exchangeable, with marginals 𝜽tπtn{\boldsymbol{\theta}}_{t}\sim\pi_{t}^{\otimes n}, 𝜽¯tπtn\bar{\boldsymbol{\theta}}_{t}\sim\pi_{t}^{\otimes n}. Writing the above expectations in terms of this measure, we get

An(t)\displaystyle A_{n}(t) =1n2𝔼(𝜽¯t𝜽t2𝜽¯t,𝟏𝜽¯t𝜽t,𝜽¯t𝜽¯t𝜽t,𝟏),\displaystyle=\frac{1}{n^{2}}{\mathbb{E}}\Big{(}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}\langle\bar{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle\Big{)}\,, (74)
Bn(t)\displaystyle B_{n}(t) =1n2𝔼(𝟏,𝜽¯t𝜽¯t2𝟏,𝜽¯t𝜽t,𝜽¯t).\displaystyle=-\frac{1}{n^{2}}{\mathbb{E}}\Big{(}\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\|\bar{\boldsymbol{\theta}}_{t}\|^{2}-\langle{\boldsymbol{1}},\bar{\boldsymbol{\theta}}_{t}\rangle\langle{\boldsymbol{\theta}}_{t},\bar{\boldsymbol{\theta}}_{t}\rangle\,\Big{)}\,. (75)

By concentration inequalities for sums of independent random variables, there exist constants CkC_{k} such that

𝔼(𝜽t22k)Cknk,\displaystyle{\mathbb{E}}\big{(}\|{\boldsymbol{\theta}}_{t}\|_{2}^{2k}\big{)}\leq C_{k}n^{k}\,, (76)
𝔼(|𝜽t,𝟏nt|k)Cknk/2,\displaystyle{\mathbb{E}}\big{(}|\langle{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle-nt|^{k}\big{)}\leq C_{k}n^{k/2}\,, (77)
𝔼(|𝜽t22𝔼[𝜽t22]|k)Cknk.\displaystyle{\mathbb{E}}\big{(}|\|{\boldsymbol{\theta}}_{t}\|_{2}^{2}-{\mathbb{E}}[\|{\boldsymbol{\theta}}_{t}\|_{2}^{2}]|^{k}\big{)}\leq C_{k}n^{k}\,. (78)

The same inequalities obviously hold for 𝜽¯t\bar{\boldsymbol{\theta}}_{t}.

Using these bounds in Eqs. (74), (75), we get

An(t)\displaystyle A_{n}(t) =tn𝔼{𝜽¯t𝜽t2}+O(n1/2),\displaystyle=\frac{t}{n}{\mathbb{E}}\big{\{}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}\big{\}}+O(n^{-1/2})\,, (79)
Bn(t)\displaystyle B_{n}(t) =tn𝔼{𝜽t22}+tn2𝔼{𝜽^t(𝒀t)22}+O(n1/2).\displaystyle=-\frac{t}{n}{\mathbb{E}}\{\|{\boldsymbol{\theta}}_{t}\|^{2}_{2}\}+\frac{t}{n^{2}}\operatorname{\mathbb{E}}\big{\{}\big{\|}\hat{\boldsymbol{\theta}}_{t}({\boldsymbol{Y}}_{t})\big{\|}^{2}_{2}\big{\}}+O(n^{-1/2})\,. (80)

Summing these, we finally obtain

ϕn(t)\displaystyle\phi^{\prime}_{n}(t) =tn𝔼{𝜽t𝜽^t(𝒀t)22}+O(n1/2)\displaystyle=\frac{t}{n}\operatorname{\mathbb{E}}\big{\{}\big{\|}{\boldsymbol{\theta}}_{t}-\hat{\boldsymbol{\theta}}_{t}({\boldsymbol{Y}}_{t})\big{\|}_{2}^{2}\big{\}}+O(n^{-1/2}) (81)
=:t𝖬𝖲𝖤(t;n)+O(n1/2),\displaystyle=:t\,{\sf MSE}(t;n)+O(n^{-1/2})\,,

where we note that the O(n1/2)O(n^{-1/2}) term is uniform in tt.

Now recalling the definition of Eq. (2), we have, by [LM19, Theorem 1],

ϕ(t):=limnϕn(t)=infq0Ψ(q;0,πt).\phi_{\infty}(t):=\lim_{n\to\infty}\phi_{n}(t)=\inf_{q\geq 0}\Psi(q;0,\pi_{t})\,. (82)

Note that

Ψ(q;0,πt)=Ψ(q;t2,π0)+12t2Var(Θ)+14t4.\displaystyle\Psi(q;0,\pi_{t})=\Psi(q;t^{2},\pi_{0})+\frac{1}{2}t^{2}\mathop{\mathrm{Var}}(\Theta)+\frac{1}{4}t^{4}\,. (83)

Differentiating with respect to tt, we deduce that

  1. 1.

    ϕ(t)\phi_{\infty}(t) is differentiable at tt if and only if bΨ(q;b,πt)b\mapsto\Psi(q;b,\pi_{t}) is differentiable at b=0b=0. Both are in turn equivalent to qΨ(q;0,πt)q\mapsto\Psi(q;0,\pi_{t}) being uniquely minimized at qBayes(πΘ)q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta}).

  2. 2.

    If the last point holds, then (with Θt=Θt0+tπt\Theta_{t}=\Theta-t_{0}+t\sim\pi_{t})

    ϕ(t)=t(𝔼[Θt2]qBayes(πΘ)).\displaystyle\phi_{\infty}^{\prime}(t)=t\cdot\big{(}\operatorname{\mathbb{E}}[\Theta^{2}_{t}]-q_{\mbox{\tiny\rm Bayes}}(\pi_{\Theta})\big{)}\,. (84)

Comparing with Eq. (81), we are left with the task of proving that limnϕn(t0)=ϕ(t0)\lim_{n\to\infty}\phi_{n}^{\prime}(t_{0})=\phi_{\infty}^{\prime}(t_{0}) when ϕ\phi_{\infty} is differentiable at t0t_{0}.

Taking the second derivative of Eq. (68) yields

ϕn′′(t)\displaystyle\phi^{\prime\prime}_{n}(t) =1n𝔼𝜽0,𝒁Varμ𝒀t(tn(𝜽¯0,𝜽0;t))+1n𝔼𝜽0,𝒁μ𝒀t(t2n(𝜽¯0,𝜽0;t))\displaystyle=-\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mathop{\mathrm{Var}}_{\mu_{{\boldsymbol{Y}}_{t}}}\big{(}\partial_{t}{\cal H}_{n}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)\big{)}+\frac{1}{n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\partial^{2}_{t}{\cal H}_{n}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)\big{)}
=:Dn(t)+En(t),\displaystyle=:-D_{n}(t)+E_{n}(t)\,,

where Dn(t)0D_{n}(t)\geq 0 by construction. A direct calculation yields

2nt2(𝜽¯0,𝜽0;t)=12𝜽¯t𝜽t2+12n𝜽¯t𝜽t,𝟏212n1/2𝟏,𝒁𝟏.\displaystyle\frac{\partial^{2}{\cal H}_{n}}{\partial t^{2}}(\bar{\boldsymbol{\theta}}_{0},{\boldsymbol{\theta}}_{0};t)=\frac{1}{2}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}+\frac{1}{2n}\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle^{2}-\frac{1}{2n^{1/2}}\langle{\boldsymbol{1}},{\boldsymbol{Z}}{\boldsymbol{1}}\rangle\,. (85)

Hence proceeding as for the first derivative, we get

En(t)\displaystyle E_{n}(t) =12n𝔼𝜽0,𝒁μ𝒀t(𝜽¯t𝜽t2)+12n𝔼𝜽0,𝒁μ𝒀t(𝜽¯t𝜽t,𝟏2)\displaystyle=\frac{1}{2n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\|\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t}\|^{2}\big{)}+\frac{1}{2n}\operatorname{\mathbb{E}}_{{\boldsymbol{\theta}}_{0},{\boldsymbol{Z}}}\mu_{{\boldsymbol{Y}}_{t}}\big{(}\langle\bar{\boldsymbol{\theta}}_{t}-{\boldsymbol{\theta}}_{t},{\boldsymbol{1}}\rangle^{2}\big{)} (86)
=𝖬𝖲𝖤(t;n)+O(n1/2),\displaystyle={\sf MSE}(t;n)+O(n^{-1/2})\,, (87)

where, once more, the O(n1/2)O(n^{-1/2}) term is uniform in tt.

Consider now the function fn(t)=t1/2ϕn(t)f_{n}(t)=-t^{-1/2}\,\phi_{n}(t), defined on an interval [t1,t2][t_{1},t_{2}], with 0<t1<t0<t20<t_{1}<t_{0}<t_{2} (with n{}n\in{\mathbb{N}}\cup\{\infty\}). We conclude that

  1. 1.

    By Eq. (82), limnfn(t)=f(t)\lim_{n\to\infty}f_{n}(t)=f_{\infty}(t) for any t[0,1]t\in[0,1]. (In fact the convergence is uniform since ϕn(t)\phi^{\prime}_{n}(t) is uniformly bounded by Eq. (81), but we will not use this fact.)

  2. 2.

    We have fn(t)=ϕn(t)/(2t3/2)ϕn(t)/t1/2f^{\prime}_{n}(t)=\phi_{n}(t)/(2t^{3/2})-\phi^{\prime}_{n}(t)/t^{1/2}. Hence, by the previous point, limnfn(t)=f(t)\lim_{n\to\infty}f^{\prime}_{n}(t)=f^{\prime}_{\infty}(t) at a point tt if and only if limnϕn(t)=ϕ(t)\lim_{n\to\infty}\phi^{\prime}_{n}(t)=\phi^{\prime}_{\infty}(t).

  3. 3.

    We have fn′′(t)=3ϕn(t)/(4t5/2)+ϕn(t)/t3/2ϕn′′(t)/t1/2f^{\prime\prime}_{n}(t)=-3\phi_{n}(t)/(4t^{5/2})+\phi^{\prime}_{n}(t)/t^{3/2}-\phi^{\prime\prime}_{n}(t)/t^{1/2}. Substituting Eqs. (81), (A), (87), we obtain

    fn′′(t)=34t5/2ϕn(t)+1t1/2Dn(t).\displaystyle f^{\prime\prime}_{n}(t)=-\frac{3}{4t^{5/2}}\phi_{n}(t)+\frac{1}{t^{1/2}}D_{n}(t)\,. (88)

    In particular, the first term is uniformly bounded (by boundedness of the normalized mutual information on [t1,t2][t_{1},t_{2}]), and the second is non-negative.

Points 1 and 3 imply (by the lemma below) that fn(t)f(t)f^{\prime}_{n}(t)\to f_{\infty}^{\prime}(t) for any differentiability point tt, and hence by point 3, we conclude ϕn(t)ϕ(t)\phi^{\prime}_{n}(t)\to\phi_{\infty}^{\prime}(t). The proof is then completed by the following analysis fact.

Lemma A.1.

Let {fn:n1}\{f_{n}:n\geq 1\}, fn:[t1,t2]f_{n}:[t_{1},t_{2}]\to{\mathbb{R}} be a sequence of differentiable functions converging pointwise to ff_{\infty}. Assume fn=gn+hnf_{n}=g_{n}+h_{n} where gn,hng_{n},h_{n} are differentiable, gng_{n} is convex and the {hn}\{h_{n}^{\prime}\} are equicontinuous on [t1,t2][t_{1},t_{2}] (i.e. there exists a non-decreasing function δ(ε)0\delta(\varepsilon)\downarrow 0 such that |tt|ε|t-t^{\prime}|\leq\varepsilon implies |hn(t)hn(t)|δ(ε)|h_{n}^{\prime}(t)-h^{\prime}_{n}(t^{\prime})|\leq\delta(\varepsilon)). If ff_{\infty} is differentiable at t0t_{0}, then limnfn(t0)=f(t0)\lim_{n\to\infty}f^{\prime}_{n}(t_{0})=f^{\prime}_{\infty}(t_{0}).

Proof.

By convexity we have, for ε>0\varepsilon>0

1ε[gn(t0+ε)gn(t0)]gn(t0).\displaystyle\frac{1}{\varepsilon}\big{[}g_{n}(t_{0}+\varepsilon)-g_{n}(t_{0})\big{]}\geq g^{\prime}_{n}(t_{0})\,. (89)

Further, by the intermediate value theorem and equicontinuity of hnh^{\prime}_{n},

1ε[hn(t0+ε)hn(t0)]hn(tn(ε))hn(t0)δ(ε).\displaystyle\frac{1}{\varepsilon}\big{[}h_{n}(t_{0}+\varepsilon)-h_{n}(t_{0})\big{]}\geq h^{\prime}_{n}(t_{n}(\varepsilon))\geq h^{\prime}_{n}(t_{0})-\delta(\varepsilon)\,. (90)

Summing the last two displays, we get

fn(t0)1ε[fn(t0+ε)fn(t0)]+δ(ε).\displaystyle f_{n}^{\prime}(t_{0})\leq\frac{1}{\varepsilon}\big{[}f_{n}(t_{0}+\varepsilon)-f_{n}(t_{0})\big{]}+\delta(\varepsilon)\,. (91)

Taking the limit nn\to\infty and using the convergence of fnf_{n}, we get

lim supnfn(t0)1ε[f(t0+ε)f(t0)]+δ(ε).\displaystyle\limsup_{n\to\infty}f_{n}^{\prime}(t_{0})\leq\frac{1}{\varepsilon}\big{[}f_{\infty}(t_{0}+\varepsilon)-f_{\infty}(t_{0})\big{]}+\delta(\varepsilon)\,. (92)

Finally taking the limit ε0\varepsilon\downarrow 0, and using the differentiability of ff_{\infty} at t0t_{0}:

lim supnfn(t0)f(t0).\displaystyle\limsup_{n\to\infty}f_{n}^{\prime}(t_{0})\leq f^{\prime}_{\infty}(t_{0})\,. (93)

Appendix B Lemma for the upper bound in Theorem 2.3

Lemma B.1.

Let fBayes(x;πΘ,q):=𝔼[Θ|qΘ+qG=x]f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q):=\operatorname{\mathbb{E}}[\Theta|q\Theta+\sqrt{q}G=x] for (Θ,G)πΘ𝒩(0,1)(\Theta,G)\sim\pi_{\Theta}\otimes{\mathcal{N}}(0,1). Then, for any q>0q>0 there exists a constant C=C(q,πΘ)C=C(q,\pi_{\Theta}) such that, for all xx

|fBayes(x;πΘ,q)|C(1+|x|).\displaystyle|f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)|\leq C(1+|x|)\,. (94)
Proof.

If πΘa\pi^{a}_{\Theta} is the law of Θ+a\Theta+a, then we have fBayes(x;πΘ,q)=fBayes(x+qa;πΘa,q)f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)=f^{\mbox{\tiny\rm Bayes}}(x+qa;\pi^{a}_{\Theta},q). Hence, if the claim holds for πΘa\pi^{a}_{\Theta}, it holds for πΘ\pi_{\Theta} as well. Hence, without loss of generality, we can assume 𝔼[Θ]=0\operatorname{\mathbb{E}}[\Theta]=0.

Note that fBayes(x;πΘ,q)=fBayes(x;πΘ,q)f^{\mbox{\tiny\rm Bayes}}(-x;\pi_{\Theta},q)=-f^{\mbox{\tiny\rm Bayes}}(x;\pi_{-\Theta},q) where πΘ\pi_{-\Theta} is the law of Θ-\Theta. Therefore, without loss of generality we can assume x>0x>0. A simple calculation (‘Tweedie’s formula’) yields

fBayes(x;πΘ,q)\displaystyle f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q) =ddxϕ(x;πΘ,q),\displaystyle=\frac{{\rm d}\phantom{x}}{{\rm d}x}\phi(x;\pi_{\Theta},q)\,, (95)
ϕ(x;πΘ,q)\displaystyle\phi(x;\pi_{\Theta},q) :=log{eθxqθ2/2πΘ(dθ)}.\displaystyle:=\log\Big{\{}\int e^{\theta x-q\theta^{2}/2}\pi_{\Theta}({\rm d}\theta)\Big{\}}\,. (96)

Notice that xϕ(x;πΘ,q)x\mapsto\phi(x;\pi_{\Theta},q) is convex, and therefore

Δq(2x,x)fBayes(x;πΘ,q)Δq(x,2x),\displaystyle\Delta_{q}(-2x,-x)\leq f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\leq\Delta_{q}(x,2x)\,, (97)
Δq(x1,x2):=1x2x1{ϕ(x2;πΘ,q)ϕ(x1;πΘ,q)}.\displaystyle\Delta_{q}(x_{1},x_{2}):=\frac{1}{x_{2}-x_{1}}\big{\{}\phi(x_{2};\pi_{\Theta},q)-\phi(x_{1};\pi_{\Theta},q)\big{\}}\,. (98)

We therefore proceed to lower and upper bound ϕ\phi. First notice that

ϕ(x;πΘ,q)\displaystyle\phi(x;\pi_{\Theta},q) =x22q+log{e(xqθ)2/2qπΘ(dθ)}\displaystyle=\frac{x^{2}}{2q}+\log\Big{\{}\int e^{-(x-q\theta)^{2}/2q}\pi_{\Theta}({\rm d}\theta)\Big{\}}
x22q,\displaystyle\leq\frac{x^{2}}{2q}\,,

where we used the fact that exp((xqθ)2/2q)1\exp(-(x-q\theta)^{2}/2q)\leq 1. Further, by Jensen’s inequality

ϕ(x;πΘ,q)x𝔼[Θ]12𝔼(Θ2)q=12𝔼(Θ2)q.\displaystyle\phi(x;\pi_{\Theta},q)\geq x\operatorname{\mathbb{E}}[\Theta]-\frac{1}{2}\operatorname{\mathbb{E}}(\Theta^{2})q=-\frac{1}{2}\operatorname{\mathbb{E}}(\Theta^{2})q\,.

Therefore we obtain, for x>1x>1,

fBayes(x;πΘ,q)Δq(x,2x)\displaystyle f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\leq\Delta_{q}(x,2x) x2q+12x𝔼(Θ2)q\displaystyle\leq\frac{x}{2q}+\frac{1}{2x}\operatorname{\mathbb{E}}(\Theta^{2})q
x2q+12𝔼(Θ2)q.\displaystyle\leq\frac{x}{2q}+\frac{1}{2}\operatorname{\mathbb{E}}(\Theta^{2})q\,.

Since xfBayes(x;πΘ,q)x\mapsto f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q) is continuous and therefore bounded on [0,1][0,1], this proves fBayes(x;πΘ,q)C(1+|x|)f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\leq C(1+|x|).

In order to prove fBayes(x;πΘ,q)C(1+|x|)f^{\mbox{\tiny\rm Bayes}}(x;\pi_{\Theta},q)\geq-C(1+|x|), we use the lower bound in Eq. (97) and proceed analogously. ∎

Appendix C Proof of Lemmas for Proposition 4.3

C.1 Notation

We use the convention ={0,1,2,}\mathbb{N}=\{0,1,2,\ldots\}. We often work with 𝜶E¯n{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}} where E¯n={(i,j):  1ijn}{\overline{E}_{n}}=\{(i,j)\,:\,\,1\leq i\leq j\leq n\} and |E¯n|=N:=n(n+1)/2|{\overline{E}_{n}}|=N:=n(n+1)/2. For 𝜶,𝜷E¯n{\boldsymbol{\alpha}},{\boldsymbol{\beta}}\in\mathbb{N}^{\overline{E}_{n}}, we use the notation

|𝜶|:=(i,j)E¯nαij,𝜶!:=(i,j)E¯nαij!,|{\boldsymbol{\alpha}}|:=\sum_{(i,j)\in{\overline{E}_{n}}}\alpha_{ij},\qquad{\boldsymbol{\alpha}}!:=\prod_{(i,j)\in{\overline{E}_{n}}}\alpha_{ij}!,
(𝜶𝜷):=(i,j)E¯n(αijβij),𝒀𝜶:=(i,j)E¯nYijαij,\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}:=\prod_{(i,j)\in{\overline{E}_{n}}}\binom{\alpha_{ij}}{\beta_{ij}},\qquad{\boldsymbol{Y}}^{\boldsymbol{\alpha}}:=\prod_{(i,j)\in{\overline{E}_{n}}}Y_{ij}^{\alpha_{ij}},
(𝜶𝜷)ij=|αijβij|,(𝜶𝜷)ij=min{αij,βij},(𝜶𝜷)ij:=max{0,αijβij}.({\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}})_{ij}=|\alpha_{ij}-\beta_{ij}|,\qquad({\boldsymbol{\alpha}}\wedge{\boldsymbol{\beta}})_{ij}=\min\{\alpha_{ij},\beta_{ij}\},\qquad({\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}})_{ij}:=\max\{0,\alpha_{ij}-\beta_{ij}\}\,.

We use 𝜶𝜷{\boldsymbol{\alpha}}\leq{\boldsymbol{\beta}} to mean αijβij\alpha_{ij}\leq\beta_{ij} for all i,ji,j, and 𝜶𝜷{\boldsymbol{\alpha}}\lneq{\boldsymbol{\beta}} to mean αijβij\alpha_{ij}\leq\beta_{ij} for all i,ji,j with strict inequality for some i,ji,j. Note that 𝜶E¯n{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}} can be identified with a multigraph whose vertices are elements of [n][n], where αij\alpha_{ij} is the number of copies of edge (i,j)(i,j). We use V(𝜶)[n]V({\boldsymbol{\alpha}})\subseteq[n] to denote the set of vertices spanned by the edges of 𝜶{\boldsymbol{\alpha}}, with vertex 1 (the root) always included by convention. We use C(𝜶)C({\boldsymbol{\alpha}}) to denote the set of non-empty (i.e., containing at least one edge) connected components of 𝜶{\boldsymbol{\alpha}}.

Asymptotic notation such as O()O(\;\cdot\;) and Ω()\Omega(\;\cdot\;) pertains to the limit nn\to\infty and may hide factors depending on πΘ,ψ,D\pi_{\Theta},\psi,D. We use the symbol 𝖼𝗈𝗇𝗌𝗍\mathsf{const} to denote a constant (which may be positive, zero, or negative) depending on πΘ,ψ,D\pi_{\Theta},\psi,D (but not nn) and e.g., 𝖼𝗈𝗇𝗌𝗍(A,B)\mathsf{const}(A,B) to denote a constant that may additionally depend on A,BA,B. These constants may change from line to line.

C.2 Hermite polynomials

We will need the following well known property of Hermite polynomials (see e.g. page 254 of [MOS13]).

Proposition C.1.

For any μ,z\mu,z\in\mathbb{R} and kk\in\mathbb{N},

hk(μ+z)==0k!k!(k)μkh(z).h_{k}(\mu+z)=\sum_{\ell=0}^{k}\sqrt{\frac{\ell!}{k!}}\binom{k}{\ell}\mu^{k-\ell}h_{\ell}(z)\,.

As a result, for any 𝜸E¯n{\boldsymbol{\gamma}}\in\mathbb{N}^{\overline{E}_{n}}, and writing 𝑿=𝜽𝜽𝖳/n{\boldsymbol{X}}={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n} for the signal matrix, we have:

h𝜸(𝒀)\displaystyle h_{{\boldsymbol{\gamma}}}({\boldsymbol{Y}}) =(i,j)E¯nhγij(Xij+Zij)\displaystyle=\prod_{(i,j)\in{\overline{E}_{n}}}h_{\gamma_{ij}}(X_{ij}+Z_{ij})
=(i,j)E¯nβij=0γijβij!γij!(γijβij)Xijγijβijhβij(Zij)\displaystyle=\prod_{(i,j)\in{\overline{E}_{n}}}\sum_{\beta_{ij}=0}^{\gamma_{ij}}\sqrt{\frac{\beta_{ij}!}{\gamma_{ij}!}}\binom{\gamma_{ij}}{\beta_{ij}}X_{ij}^{\gamma_{ij}-\beta_{ij}}h_{\beta_{ij}}(Z_{ij})
=𝟎𝜷𝜸(i,j)E¯nβij!γij!(γijβij)Xijγijβijhβij(Zij)\displaystyle=\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\prod_{(i,j)\in{\overline{E}_{n}}}\sqrt{\frac{\beta_{ij}!}{\gamma_{ij}!}}\binom{\gamma_{ij}}{\beta_{ij}}X_{ij}^{\gamma_{ij}-\beta_{ij}}h_{\beta_{ij}}(Z_{ij})
=𝟎𝜷𝜸𝜷!𝜸!(𝜸𝜷)𝑿𝜸𝜷h𝜷(𝒁).\displaystyle=\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\beta}}}{\boldsymbol{X}}^{{\boldsymbol{\gamma}}-{\boldsymbol{\beta}}}h_{\boldsymbol{\beta}}({\boldsymbol{Z}})\,.

In particular, since 𝔼𝒁[h𝜷(𝒁)]=𝟙𝜷=𝟎\operatorname*{\mathbb{E}}_{{\boldsymbol{Z}}}[h_{\boldsymbol{\beta}}({\boldsymbol{Z}})]=\mathbbm{1}_{{\boldsymbol{\beta}}={\boldsymbol{0}}} (due to orthonormality of Hermite polynomials and the fact h0(z)=1h_{0}(z)=1),

𝔼h𝜸(𝒀)=1𝜸!𝔼𝑿𝜸.\operatorname*{\mathbb{E}}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})=\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}\operatorname*{\mathbb{E}}{\boldsymbol{X}}^{\boldsymbol{\gamma}}\,.

C.3 Proof of Lemma 4.4: Symmetry

As mentioned in the main text, this is a special case of the classical Hunt–Stein theorem [EG21], the only difference being that we are restricting the class of estimators to degree-DD polynomials. We present a proof for completeness.

Let S1S_{-1} denote the group of permutations on [n][n] that fix {1}\{1\}. This group acts on the space of n×nn\times n symmetric matrices by permuting both the rows and columns (by the same permutation). We also have an induced action of S1S_{-1} on [𝒀]D\mathbb{R}[{\boldsymbol{Y}}]_{\leq D} given by (σq)(𝒀)=q(σ1𝒀)(\sigma\cdot q)({\boldsymbol{Y}})=q(\sigma^{-1}\cdot{\boldsymbol{Y}}). A polynomial qq can be symmetrized over S1S_{-1} to produce q𝗌𝗒𝗆q^{\mathsf{sym}} as follows:

q𝗌𝗒𝗆:=1|S1|σS1σq.q^{\mathsf{sym}}:=\frac{1}{|S_{-1}|}\sum_{\sigma\in S_{-1}}\sigma\cdot q\,.

We claim that the symmetric subspace [𝒀]D𝗌𝗒𝗆=span{HA:𝒢D}\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}}=\mathrm{span}_{\mathbb{R}}\{\mathscrsfs{H}_{A}\,:\,\mathcal{G}_{\leq D}\} is precisely the image of the above symmetrization operation, that is, [𝒀]D𝗌𝗒𝗆={q𝗌𝗒𝗆:q[𝒀]D}\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}}=\{q^{\mathsf{sym}}\,:\,q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}\}. This can be seen as follows. For the inclusion \subseteq, note that any q[𝒀]D𝗌𝗒𝗆q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}} satisfies q=q𝗌𝗒𝗆q=q^{\mathsf{sym}}. For the reverse inclusion \supseteq, start with an arbitrary q[𝒀]Dq\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}. Any such qq admits an expansion as a linear combination of Hermite polynomials (h𝜶)|𝜶|D(h_{\boldsymbol{\alpha}})_{|{\boldsymbol{\alpha}}|\leq D}, and therefore admits an expansion in (H𝜶)|𝜶|D(\mathscrsfs{H}_{\boldsymbol{\alpha}})_{|{\boldsymbol{\alpha}}|\leq D} (cf. the definition (38) which can be inverted recursively). Finally, note that H𝜶𝗌𝗒𝗆\mathscrsfs{H}_{\boldsymbol{\alpha}}^{\mathsf{sym}} is a scalar multiple of HG\mathscrsfs{H}_{G} for a certain G𝒢DG\in\mathcal{G}_{\leq D}. This allows q𝗌𝗒𝗆q^{\mathsf{sym}} to be written as a linear combination of (HG)G𝒢D(\mathscrsfs{H}_{G})_{G\in\mathcal{G}_{\leq D}}.

To complete the proof of Lemma 4.4, it is sufficient to show that for any low degree estimator q[𝒀]Dq\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}, mean squared error does not increase under symmetrization:

𝔼[(q𝗌𝗒𝗆(𝒀)ψ(θ1))2]𝔼[(q(𝒀)ψ(θ1))2].\operatorname*{\mathbb{E}}[(q^{\mathsf{sym}}({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\leq\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}]\,. (99)

First note that 𝔼[q𝗌𝗒𝗆(𝒀)ψ(θ1)]=𝔼[q(𝒀)ψ(θ1)]\operatorname*{\mathbb{E}}[q^{\mathsf{sym}}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\operatorname*{\mathbb{E}}[q({\boldsymbol{Y}})\cdot\psi(\theta_{1})] because (𝒀,θ1)({\boldsymbol{Y}},\theta_{1}) and (σ𝒀,θ1)(\sigma\cdot{\boldsymbol{Y}},\theta_{1}) have the same distribution for any fixed σS1\sigma\in S_{-1}. Next, using Jensen’s inequality and the symmetry 𝔼[q2]=𝔼[(σq)2]\operatorname*{\mathbb{E}}[q^{2}]=\operatorname*{\mathbb{E}}[(\sigma\cdot q)^{2}] for all σS1\sigma\in S_{-1}, we have

𝔼[q𝗌𝗒𝗆(𝒀)2]\displaystyle\operatorname*{\mathbb{E}}[q^{\mathsf{sym}}({\boldsymbol{Y}})^{2}] =𝔼{(1|S1|σS1(σq)(𝒀))2}\displaystyle=\operatorname*{\mathbb{E}}\Big{\{}\Big{(}\frac{1}{|S_{-1}|}\sum_{\sigma\in S_{-1}}(\sigma\cdot q)({\boldsymbol{Y}})\Big{)}^{2}\Big{\}}
1|S1|σS1𝔼[(σq)(𝒀)2]=𝔼[q(𝒀)2].\displaystyle\leq\frac{1}{|S_{-1}|}\sum_{\sigma\in S_{-1}}\operatorname*{\mathbb{E}}[(\sigma\cdot q)({\boldsymbol{Y}})^{2}]=\operatorname*{\mathbb{E}}[q({\boldsymbol{Y}})^{2}]\,.

Now Eq. (99) follows by combining these claims.

C.4 Proof of Lemma 4.5: λmin(𝑴n)\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})

Consider an arbitrary polynomial f=G𝒢DbG(f)HGf=\sum_{G\in\mathcal{G}_{\leq D}}b_{G}(f)\mathscrsfs{H}_{G} with coefficients 𝒃(f)=(bG(f)){\boldsymbol{b}}(f)=(b_{G}(f)) normalized so that 𝒃(f)2:=G𝒢DbG(f)2=1\|{\boldsymbol{b}}(f)\|^{2}:=\sum_{G\in\mathcal{G}_{\leq D}}b_{G}(f)^{2}=1. This induces an expansion f=𝜶E¯n,|𝜶|Df^𝜶H𝜶f=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}},|{\boldsymbol{\alpha}}|\leq D}\hat{f}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{\boldsymbol{\alpha}}. Since 𝔼[f(𝒀)2]=𝒃,𝑴n𝒃\operatorname*{\mathbb{E}}[f({\boldsymbol{Y}})^{2}]=\langle{\boldsymbol{b}},{\boldsymbol{M}}_{n}{\boldsymbol{b}}\rangle, it suffices to show 𝔼[f(𝒀)2]=Ω(1)\operatorname*{\mathbb{E}}[f({\boldsymbol{Y}})^{2}]=\Omega(1). Using Jensen’s inequality and orthonormality of Hermite polynomials (here, we use subscripts on expectation to denote which variable is being integrated over),

𝔼[f(𝒀)2]=𝔼Z𝔼X[f(𝑿+𝒁)2]𝔼Z[(𝔼Xf(𝑿+𝒁))2]=:g^2\displaystyle\operatorname*{\mathbb{E}}[f({\boldsymbol{Y}})^{2}]=\operatorname*{\mathbb{E}}_{Z}\operatorname*{\mathbb{E}}_{X}[f({\boldsymbol{X}}+{\boldsymbol{Z}})^{2}]\geq\operatorname*{\mathbb{E}}_{Z}[(\operatorname*{\mathbb{E}}_{X}f({\boldsymbol{X}}+{\boldsymbol{Z}}))^{2}]=:\|\hat{g}\|^{2}

where g(𝒁):=𝔼𝑿f(𝑿+𝒁)g({\boldsymbol{Z}}):=\operatorname*{\mathbb{E}}_{\boldsymbol{X}}f({\boldsymbol{X}}+{\boldsymbol{Z}}) with Hermite expansion g=𝜶E¯n,|𝜶|Dg^𝜶H𝜶g=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}},|{\boldsymbol{\alpha}}|\leq D}\hat{g}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{\boldsymbol{\alpha}}. It now suffices to show g^2=Ω(1)\|\hat{g}\|^{2}=\Omega(1). We will compute g^\hat{g} explicitly in terms of f^\hat{f}. We have, using the Hermite facts from Section C.2,

g(𝒁)\displaystyle g({\boldsymbol{Z}}) =𝔼Xf(𝑿+𝒁)\displaystyle=\operatorname*{\mathbb{E}}_{X}f({\boldsymbol{X}}+{\boldsymbol{Z}})
=𝔼X𝜶E¯nf^𝜶H𝜶(𝑿+𝒁)\displaystyle=\operatorname*{\mathbb{E}}_{X}\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{X}}+{\boldsymbol{Z}})
=𝜶E¯nf^𝜶𝔼XγC(𝜶)(h𝜸(𝑿+𝒁)𝔼Yh𝜸(𝒀))\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\operatorname*{\mathbb{E}}_{X}\prod_{\gamma\in C({\boldsymbol{\alpha}})}\left(h_{\boldsymbol{\gamma}}({\boldsymbol{X}}+{\boldsymbol{Z}})-\operatorname*{\mathbb{E}}_{Y}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})\right)
=𝜶E¯nf^𝜶𝜸C(𝜶)𝔼X[h𝜸(𝑿+𝒁)𝔼Yh𝜸(𝒀)]using independence between components\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\operatorname*{\mathbb{E}}_{X}[h_{\boldsymbol{\gamma}}({\boldsymbol{X}}+{\boldsymbol{Z}})-\operatorname*{\mathbb{E}}_{Y}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})]\qquad\text{using independence between components}
=𝜶E¯nf^𝜶𝜸C(𝜶)𝔼X[0𝜷𝜸𝜷!𝜸!(𝜸𝜷)X𝜸𝜷h𝜷(𝒁)1𝜸!𝑿𝜸]\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\operatorname*{\mathbb{E}}_{X}\left[\sum_{0\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\beta}}}X^{{\boldsymbol{\gamma}}-{\boldsymbol{\beta}}}h_{\boldsymbol{\beta}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}{\boldsymbol{X}}^{\boldsymbol{\gamma}}\right]
=𝜶E¯nf^𝜶𝜸C(𝜶)0𝜷𝜸𝜷!𝜸!(𝜸𝜷)𝔼[𝑿𝜸𝜷]h𝜷(𝒁)\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\sum_{0\lneq{\boldsymbol{\beta}}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\gamma}}-{\boldsymbol{\beta}}}]h_{\boldsymbol{\beta}}({\boldsymbol{Z}})
=𝜶E¯nf^𝜶𝜷Γ(𝜶)𝜷!𝜶!(𝜶𝜷)𝔼[𝑿𝜶𝜷]h𝜷(𝒁)\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}}\hat{f}_{\boldsymbol{\alpha}}\sum_{{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]h_{\boldsymbol{\beta}}({\boldsymbol{Z}})
where Γ(𝜶)\Gamma({\boldsymbol{\alpha}}) is the set of 𝜷E¯n{\boldsymbol{\beta}}\in\mathbb{N}^{\overline{E}_{n}} such that 0𝜷𝜶0\leq{\boldsymbol{\beta}}\leq{\boldsymbol{\alpha}}, and 𝜷{\boldsymbol{\beta}} includes at least one edge from every non-empty component of 𝜶{\boldsymbol{\alpha}}
=𝜷E¯nh𝜷(𝒁)𝜶:𝜷Γ(𝜶)f^𝜶𝜷!𝜶!(𝜶𝜷)𝔼[𝑿𝜶𝜷].\displaystyle=\sum_{{\boldsymbol{\beta}}\in\mathbb{N}^{\overline{E}_{n}}}h_{\boldsymbol{\beta}}({\boldsymbol{Z}})\sum_{{\boldsymbol{\alpha}}\,:\,{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\hat{f}_{\boldsymbol{\alpha}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]\,.

This means

g^𝜷=𝜶:𝜷Γ(𝜶)f^𝜶𝜷!𝜶!(𝜶𝜷)𝔼[𝑿𝜶𝜷].\hat{g}_{\boldsymbol{\beta}}=\sum_{{\boldsymbol{\alpha}}\,:\,{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\hat{f}_{\boldsymbol{\alpha}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]\,. (100)

Due to the symmetry of ff, we have g^𝜷=g^𝜷\hat{g}_{\boldsymbol{\beta}}=\hat{g}_{{\boldsymbol{\beta}}^{\prime}} whenever 𝜷,𝜷{\boldsymbol{\beta}},{\boldsymbol{\beta}}^{\prime} are images of embeddings of the same G𝒢DG\in\mathcal{G}_{\leq D}. Therefore, gg admits an expansion g=G𝒢DbG(g)HGg=\sum_{G\in\mathcal{G}_{\leq D}}b_{G}(g)\mathscrsfs{H}_{G} where HG\mathscrsfs{H}_{G} is defined by Eq. (39).

The coefficients bG(g)b_{G}(g) and g^𝜶\hat{g}_{{\boldsymbol{\alpha}}} are related as follows. Let |𝖠𝗎𝗍(G)||\mathsf{Aut}(G)| denote the number of root-preserving automorphisms of GG, i.e., the number of embeddings ϕ𝖾𝗆𝖻(G)\phi\in\mathsf{emb}(G) that induce a single element 𝜶E¯n{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}. For each G𝒢DG\in\mathcal{G}_{\leq D} there are |𝖾𝗆𝖻(G)||𝖠𝗎𝗍(G)|\frac{|\mathsf{emb}(G)|}{|\mathsf{Aut}(G)|} distinct elements 𝜶E¯n{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}}, and each has the same coefficient g^𝜶\hat{g}_{\boldsymbol{\alpha}}. Without loss of generality we can set g^𝜶=:|𝖠𝗎𝗍(G)|/|𝖾𝗆𝖻(G)|g~G\hat{g}_{\boldsymbol{\alpha}}=:|\mathsf{Aut}(G)|/\sqrt{|\mathsf{emb}(G)|}\,\tilde{g}_{G} for some coefficients g~G\tilde{g}_{G}. Therefore

g\displaystyle g =𝜶E¯ng^𝜶H𝜶\displaystyle=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}}\hat{g}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{{\boldsymbol{\alpha}}}
=G𝒢D𝜶E¯ng^𝜶H𝜶1|𝖠𝗎𝗍(G)|ϕ𝖾𝗆𝖻(G)𝟙𝜶=𝗂𝗆(ϕ;G)\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}}\hat{g}_{\boldsymbol{\alpha}}\mathscrsfs{H}_{{\boldsymbol{\alpha}}}\cdot\frac{1}{|\mathsf{Aut}(G)|}\sum_{\phi\in\mathsf{emb}(G)}\mathbbm{1}_{{\boldsymbol{\alpha}}=\mathsf{im}(\phi;G)}
=G𝒢Dϕ𝖾𝗆𝖻(G)𝜶E¯n𝟙𝜶=𝗂𝗆(ϕ;G)1|𝖾𝗆𝖻(G)|g~GH𝜶\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\sum_{\phi\in\mathsf{emb}(G)}\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}}}\mathbbm{1}_{{\boldsymbol{\alpha}}=\mathsf{im}(\phi;G)}\frac{1}{\sqrt{|\mathsf{emb}(G)|}}\tilde{g}_{G}\mathscrsfs{H}_{{\boldsymbol{\alpha}}}
=G𝒢Dg~GHG.\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\tilde{g}_{G}\mathscrsfs{H}_{G}\,.

Therefore we can conclude that bG(g)=g~G=|𝖾𝗆𝖻(G)||𝖠𝗎𝗍(G)|g^𝜶b_{G}(g)=\tilde{g}_{G}=\frac{\sqrt{|\mathsf{emb}(G)|}}{|\mathsf{Aut}(G)|}\hat{g}_{\boldsymbol{\alpha}}. This means

g^2=𝜶E¯n,|𝜶|Dg^𝜶2\displaystyle\|\hat{g}\|^{2}=\sum_{{\boldsymbol{\alpha}}\in\mathbb{N}^{\overline{E}_{n}},|{\boldsymbol{\alpha}}|\leq D}\hat{g}_{\boldsymbol{\alpha}}^{2} =G𝒢D|𝖾𝗆𝖻(G)||𝖠𝗎𝗍(G)|(|𝖠𝗎𝗍(G)||𝖾𝗆𝖻(G)|bG(g))2\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}\frac{|\mathsf{emb}(G)|}{|\mathsf{Aut}(G)|}\left(\frac{|\mathsf{Aut}(G)|}{\sqrt{|\mathsf{emb}(G)|}}b_{G}(g)\right)^{2}
=G𝒢D|𝖠𝗎𝗍(G)|bG(g)2𝒃(g)2.\displaystyle=\sum_{G\in\mathcal{G}_{\leq D}}|\mathsf{Aut}(G)|\cdot b_{G}(g)^{2}\geq\|{\boldsymbol{b}}(g)\|^{2}\,.

It now suffices to show 𝒃(g)2=Ω(1)\|{\boldsymbol{b}}(g)\|^{2}=\Omega(1).

Our next step will be to relate the coefficients 𝒃(g){\boldsymbol{b}}(g) to the coefficients 𝒃(f){\boldsymbol{b}}(f). Using (100) along with the above relation between g~\tilde{g} and g^\hat{g}, we have for any B𝒢DB\in\mathcal{G}_{\leq D} and 𝜷{\boldsymbol{\beta}} the image of some ϕ𝖾𝗆𝖻(B)\phi\in\mathsf{emb}(B),

bB(g)=|𝖾𝗆𝖻(B)||𝖠𝗎𝗍(B)|g^𝜷=|𝖾𝗆𝖻(B)||𝖠𝗎𝗍(B)|𝜶:𝜷Γ(𝜶)f^𝜶𝜷!𝜶!(𝜶𝜷)𝔼[𝑿𝜶𝜷].b_{B}(g)=\frac{\sqrt{|\mathsf{emb}(B)|}}{|\mathsf{Aut}(B)|}\hat{g}_{\boldsymbol{\beta}}=\frac{\sqrt{|\mathsf{emb}(B)|}}{|\mathsf{Aut}(B)|}\sum_{{\boldsymbol{\alpha}}\,:\,{\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}})}\hat{f}_{\boldsymbol{\alpha}}\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]\,. (101)

Similarly to above, we have f^𝜶=|𝖠𝗎𝗍(A)||𝖾𝗆𝖻(A)|bA(f)\hat{f}_{\boldsymbol{\alpha}}=\frac{|\mathsf{Aut}(A)|}{\sqrt{|\mathsf{emb}(A)|}}b_{A}(f) whenever 𝜶{\boldsymbol{\alpha}} is the image of some ϕ𝖾𝗆𝖻(A)\phi\in\mathsf{emb}(A). In (101), the number of terms 𝜶E¯n{\boldsymbol{\alpha}}\in\mathbb{N}^{{\overline{E}_{n}}} in the sum that correspond to a single A𝒢DA\in\mathcal{G}_{\leq D} is O(n|V(A)||V(B)|)O(n^{|V(A)|-|V(B)|}), recalling that 𝜷Γ(𝜶){\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}}) implies 𝜷𝜶{\boldsymbol{\beta}}\leq{\boldsymbol{\alpha}}. We also have the bounds 𝜷!𝜶!(𝜶𝜷)=O(1)\sqrt{\frac{{\boldsymbol{\beta}}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\beta}}}=O(1), 𝔼[𝑿𝜶𝜷]=O(n12(|E(A)||E(B)|))\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{{\boldsymbol{\alpha}}-{\boldsymbol{\beta}}}]=O(n^{-\frac{1}{2}(|E(A)|-|E(B)|)}), |𝖠𝗎𝗍(A)|=Θ(1)|\mathsf{Aut}(A)|=\Theta(1), and |𝖾𝗆𝖻(A)|=Θ(n|V(A)|1)|\mathsf{emb}(A)|=\Theta(n^{|V(A)|-1}). This means (101) can be written as bB(g)=AζBAbA(f)b_{B}(g)=\sum_{A}\zeta_{BA}b_{A}(f) for some coefficients ζBA\zeta_{BA} (not depending on f,gf,g) that satisfy

|ζBA|\displaystyle|\zeta_{BA}| |𝖾𝗆𝖻(B)||𝖾𝗆𝖻(A)|O(n|V(A)||V(B)|n12(|E(A)||E(B)|))\displaystyle\leq\sqrt{\frac{|\mathsf{emb}(B)|}{|\mathsf{emb}(A)|}}\cdot O(n^{|V(A)|-|V(B)|}\cdot n^{-\frac{1}{2}(|E(A)|-|E(B)|)})
=O(n12(|V(A)||V(B)||E(A)|+|E(B)|))\displaystyle=O(n^{\frac{1}{2}(|V(A)|-|V(B)|-|E(A)|+|E(B)|)})
=O(1),\displaystyle=O(1)\,,

where the last bound holds since 𝜷Γ(𝜶){\boldsymbol{\beta}}\in\Gamma({\boldsymbol{\alpha}}) and therefore 𝜶{\boldsymbol{\alpha}} cannot have more components than 𝜷{\boldsymbol{\beta}}.

We have now shown |ζBA|=O(1)|\zeta_{BA}|=O(1). We can also see directly that ζAA=1\zeta_{AA}=1 for all AA, and ζBA=0\zeta_{BA}=0 whenever BAB\neq A and |E(B)||E(A)||E(B)|\geq|E(A)|. This means

bB(g)=bB(f)+A:|E(A)|>|E(B)|ζBAbA(f)b_{B}(g)=b_{B}(f)+\sum_{A\,:\,|E(A)|>|E(B)|}\zeta_{BA}b_{A}(f)

which we can rewrite in vector form as

𝒃(g)=𝜻𝒃(f),{\boldsymbol{b}}(g)={\boldsymbol{\zeta}}\,{\boldsymbol{b}}(f)\,,

where 𝜻=(ζA,B)A,B𝒢D{\boldsymbol{\zeta}}=(\zeta_{A,B})_{A,B\in\mathcal{G}_{\leq D}}. We order 𝒢D\mathcal{G}_{\leq D} so that the number of edges is non-decreasing along this ordering, and therefore 𝜻{\boldsymbol{\zeta}} is upper triangular with ones on the diagonal. This implies det(𝜻)=1\det({\boldsymbol{\zeta}})=1, and in particular 𝜻{\boldsymbol{\zeta}} is invertible. By Cramer’s rule, letting 𝜻(B,A){\boldsymbol{\zeta}}_{(B,A)} denote the (B,A)(B,A)-th minor,

(𝜻1)A,B=(1)i+jdet(𝜻(B,A))det(𝜻)=(1)i+jdet(𝜻(B,A))=O(1),\displaystyle({\boldsymbol{\zeta}}^{-1})_{A,B}=(-1)^{i+j}\frac{\det({\boldsymbol{\zeta}}_{(B,A)})}{\det({\boldsymbol{\zeta}})}=(-1)^{i+j}\det({\boldsymbol{\zeta}}_{(B,A)})=O(1)\,,

where the last bound follows from the fact that the matrix dimensions are independent of nn, and maxA,B𝒢D|ζA,B|=O(1)\max_{A,B\in\mathcal{G}_{\leq D}}|\zeta_{A,B}|=O(1). Therefore 𝜻1op=O(1)\|{\boldsymbol{\zeta}}^{-1}\|_{\mbox{\rm\tiny op}}=O(1) and

𝒃(g)2𝜻1op1𝒃(f)2𝜻1op1=Ω(1).\displaystyle\|{\boldsymbol{b}}(g)\|_{2}\geq\|{\boldsymbol{\zeta}}^{-1}\|_{\mbox{\rm\tiny op}}^{-1}\,\|{\boldsymbol{b}}(f)\|_{2}\geq\|{\boldsymbol{\zeta}}^{-1}\|_{\mbox{\rm\tiny op}}^{-1}=\Omega(1)\,.

This concludes the proof.

C.5 Proof of Lemma 4.6: Explicit solution

Throughout this proof, we will omit the subscript nn from 𝒄{\boldsymbol{c}} and 𝑴{\boldsymbol{M}}.

For an arbitrary q[𝒀]D𝗌𝗒𝗆q\in\mathbb{R}[{\boldsymbol{Y}}]_{\leq D}^{\mathsf{sym}}, write the expansion q=A𝒢Dq^AHAq=\sum_{A\in\mathcal{G}_{\leq D}}\hat{q}_{A}\mathscrsfs{H}_{A}. Recalling the definitions cA=𝔼[HA(𝒀)ψ(θ1)]c_{A}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\cdot\psi(\theta_{1})] and MAB=𝔼[HA(𝒀)HB(𝒀)]M_{AB}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\mathscrsfs{H}_{B}({\boldsymbol{Y}})], we can rewrite the optimization problem as

infq[𝒀]D𝗌𝗒𝗆𝔼[(q(𝒀)ψ(θ1))2]\displaystyle\inf_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname*{\mathbb{E}}[(q({\boldsymbol{Y}})-\psi(\theta_{1}))^{2}] =𝔼[ψ(θ1)2]supq[𝒀]D𝗌𝗒𝗆𝔼[2q(𝒀)ψ(θ1)q(𝒀)2]\displaystyle=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\sup_{q\in\mathbb{R}[{\boldsymbol{Y}}]^{\mathsf{sym}}_{\leq D}}\operatorname*{\mathbb{E}}[2\,q({\boldsymbol{Y}})\cdot\psi(\theta_{1})-q({\boldsymbol{Y}})^{2}]
=𝔼[ψ(θ1)2]sup𝒒^(2𝒒^,𝒄𝒒^,𝑴𝒒^).\displaystyle=\operatorname*{\mathbb{E}}[\psi(\theta_{1})^{2}]-\sup_{\hat{\boldsymbol{q}}}\left(2\langle\hat{\boldsymbol{q}},{\boldsymbol{c}}\rangle-\langle\hat{\boldsymbol{q}},{\boldsymbol{M}}\hat{\boldsymbol{q}}\rangle\right).

Since 𝑴{\boldsymbol{M}} is positive definite by Lemma 4.5, the map 𝒒^2𝒒^,𝒄𝒒^,𝑴𝒒^\hat{\boldsymbol{q}}\mapsto 2\langle\hat{\boldsymbol{q}},{\boldsymbol{c}}\rangle-\langle\hat{\boldsymbol{q}},{\boldsymbol{M}}\hat{\boldsymbol{q}}\rangle is strictly concave and is thus uniquely maximized at its unique stationary point 𝒒^=𝑴1𝒄\hat{\boldsymbol{q}}={\boldsymbol{M}}^{-1}{\boldsymbol{c}}.

C.6 Proof of Lemma 4.7: Limit of 𝒄n{\boldsymbol{c}}_{n}

Throughout this proof, we will omit the subscript nn from 𝒄{\boldsymbol{c}} and its entries.

Our goal is to compute, for each A𝒢DA\in\mathcal{G}_{\leq D},

cA=𝔼[HA(𝒀)ψ(θ1)]=1|𝖾𝗆𝖻(A)|ϕ𝖾𝗆𝖻(A)𝔼[H𝗂𝗆(ϕ;A)(𝒀)ψ(θ1)]=|𝖾𝗆𝖻(A)|𝔼[H𝜶(Y)ψ(θ1)],c_{A}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{emb}(A)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi;A)}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})]\,,

where in the last step, 𝜶{\boldsymbol{\alpha}} is the image of AA under an arbitrary embedding ϕ\phi and we have used the fact that 𝔼[H𝜶(Y)ψ(θ1)]\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})] depends only on AA (not 𝜶{\boldsymbol{\alpha}}). We have

|𝖾𝗆𝖻(A)|=(n1|V(A)|1)(|V(A)|1)!=n|V(A)|1(1+O(n1)).|\mathsf{emb}(A)|=\binom{n-1}{|V(A)|-1}(|V(A)|-1)!=n^{|V(A)|-1}(1+O(n^{-1}))\,. (102)

For 𝜶{\boldsymbol{\alpha}} the image of an embedding of AA, we have by definition,

𝔼[H𝜶(𝒀)ψ(θ1)]=𝔼{ψ(θ1)𝜸C(𝜶)(h𝜸(𝒀)𝔼hγ(𝒀))}.\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\operatorname*{\mathbb{E}}\Big{\{}\psi(\theta_{1})\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}(h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\gamma}({\boldsymbol{Y}}))\Big{\}}\,.

If A=A=\emptyset (the edgeless graph), we can see cA=𝔼[ψ(θ1)]c_{A}=\operatorname*{\mathbb{E}}[\psi(\theta_{1})] is a constant and we are done. If AA has a non-empty connected component that does not contain the root then we have 𝔼[H𝜶(𝒀)ψ(θ1)]=0\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=0 due to independence between components, and again we are done.

Finally, consider the case in which AA has a single non-empty component and this component contains the root. In this case, simplifying the above using the Hermite facts from Section C.2, and recalling that 𝑿:=𝜽𝜽𝖳/n{\boldsymbol{X}}:={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n},

𝔼[H𝜶(Y)ψ(θ1)]=𝔼{H(θ1)(h𝜶(𝒀)𝔼h𝜶(𝒀))}\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})]=\operatorname*{\mathbb{E}}\big{\{}\mathscrsfs{H}(\theta_{1})(h_{\boldsymbol{\alpha}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\boldsymbol{\alpha}}({\boldsymbol{Y}}))\big{\}}
=𝔼{ψ(θ1)(𝟎𝜶𝜶𝜶!𝜶!(𝜶𝜶)X𝜶𝜶h𝜶(𝒁)1𝜶!𝔼[𝑿𝜶])}\displaystyle\qquad=\operatorname*{\mathbb{E}}\left\{\psi(\theta_{1})\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\alpha}}^{\prime}\leq{\boldsymbol{\alpha}}}\sqrt{\frac{{\boldsymbol{\alpha}}^{\prime}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\alpha}}^{\prime}}X^{{\boldsymbol{\alpha}}-{\boldsymbol{\alpha}}^{\prime}}h_{{\boldsymbol{\alpha}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\alpha}}!}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\alpha}}]\right)\right\}
=𝔼{ψ(θ1)(𝟎𝜶𝜶𝜶!𝜶!(𝜶𝜶)n12|𝜶𝜶|(𝜽𝜽𝖳)𝜶𝜶h𝜶(𝒁)1𝜶!n12|𝜶|𝔼[(𝜽𝜽𝖳)𝜶])}.\displaystyle\qquad=\operatorname*{\mathbb{E}}\left\{\psi(\theta_{1})\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\alpha}}^{\prime}\leq{\boldsymbol{\alpha}}}\sqrt{\frac{{\boldsymbol{\alpha}}^{\prime}!}{{\boldsymbol{\alpha}}!}}\binom{{\boldsymbol{\alpha}}}{{\boldsymbol{\alpha}}^{\prime}}n^{-\frac{1}{2}|{\boldsymbol{\alpha}}-{\boldsymbol{\alpha}}^{\prime}|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\alpha}}-{\boldsymbol{\alpha}}^{\prime}}h_{{\boldsymbol{\alpha}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\alpha}}!}}n^{-\frac{1}{2}|{\boldsymbol{\alpha}}|}\operatorname*{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\alpha}}]\right)\right\}.

Using orthogonality of the Hermite polynomials and recalling h0(z)=1h_{0}(z)=1, all the terms with 𝜶𝟎{\boldsymbol{\alpha}}^{\prime}\neq{\boldsymbol{0}} are zero in expectation, i.e.,

𝔼[H𝜶(Y)ψ(θ1)]\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\cdot\psi(\theta_{1})] =𝔼{ψ(θ1)n12|𝜶|𝜶!((𝜽𝜽𝖳)𝜶𝔼[(𝜽𝜽𝖳)𝜶])}\displaystyle=\operatorname*{\mathbb{E}}\left\{\psi(\theta_{1})\frac{n^{-\frac{1}{2}|{\boldsymbol{\alpha}}|}}{\sqrt{{\boldsymbol{\alpha}}!}}\left(({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}})^{\boldsymbol{\alpha}}-\operatorname*{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}})^{\boldsymbol{\alpha}}]\right)\right\}
=𝖼𝗈𝗇𝗌𝗍(A)n12|𝜶|=𝖼𝗈𝗇𝗌𝗍(A)n12|E(A)|.\displaystyle=\mathsf{const}(A)\cdot n^{-\frac{1}{2}|{\boldsymbol{\alpha}}|}=\mathsf{const}(A)\cdot n^{-\frac{1}{2}|E(A)|}\,.

Putting it all together, we get

cA=𝖼𝗈𝗇𝗌𝗍(A)n12(|V(A)|1|E(A)|)(1+O(n1)).c_{A}=\mathsf{const}(A)\cdot n^{\frac{1}{2}(|V(A)|-1-|E(A)|)}\cdot\big{(}1+O(n^{-1})\big{)}\,.

Recall from above that cA=0c_{A}=0 unless AA has a single connected component and this component contains the root; we therefore restrict to AA of this type in what follows. Due to connectivity, any such AA has |V(A)||E(A)|+1|V(A)|\leq|E(A)|+1, implying cA=𝖼𝗈𝗇𝗌𝗍(A)+O(n1)c_{A}=\mathsf{const}(A)+O(n^{-1}) as desired. Now suppose furthermore that A𝒢D𝒯DA\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}, i.e., AA contains a cycle (where we consider a self-loop or double-edge to be a cycle). In this case, |V(A)|<|E(A)|+1|V(A)|<|E(A)|+1, implying cA=O(n1/2)c_{A}=O(n^{-1/2}) as desired.

C.7 Proof of Lemma 4.8: Limit of 𝑴n{\boldsymbol{M}}_{n}

Throughout this proof, we will omit the subscript nn from 𝑴{\boldsymbol{M}}.

We will need to reason about the different possible intersection patterns between two rooted graphs A,B𝒢DA,B\in\mathcal{G}_{\leq D}. To this end, define a pattern to be a rooted colored multigraph where no vertices are isolated except possibly the root, every edge is colored either red or blue, the edge-induced subgraph of red edges (with the root always included) is isomorphic (as a rooted graph) to AA, and the edge-induced subgraph of blue edges is isomorphic to BB. Let 𝗉𝖺𝗍(A,B)\mathsf{pat}(A,B) denote the set of such patterns, up to isomorphism of rooted colored graphs. The number of such patterns is a constant depending on A,BA,B. For Π𝗉𝖺𝗍(A,B)\Pi\in\mathsf{pat}(A,B), let 𝖾𝗆𝖻(Π)\mathsf{emb}(\Pi) denote the set of embeddings of Π\Pi, where an embedding is defined to be a pair (ϕA,ϕB)(\phi_{A},\phi_{B}) with ϕA𝖾𝗆𝖻(A),ϕB𝖾𝗆𝖻(B)\phi_{A}\in\mathsf{emb}(A),\phi_{B}\in\mathsf{emb}(B) such that the induced colored graph on vertex set [n][n] is isomorphic to Π\Pi. We let |V(Π)||V(\Pi)| denote the number of vertices in the pattern (including the root).

We can write MAB=𝔼[HA(𝒀)HB(𝒀)]M_{AB}=\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{A}({\boldsymbol{Y}})\mathscrsfs{H}_{B}({\boldsymbol{Y}})] as

MAB\displaystyle M_{AB} =1|𝖾𝗆𝖻(A)||𝖾𝗆𝖻(B)|ϕA𝖾𝗆𝖻(A),ϕB𝖾𝗆𝖻(B)𝔼[H𝗂𝗆(ϕA;A)(𝒀)H𝗂𝗆(ϕB;B)(𝒀)]\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|\cdot|\mathsf{emb}(B)|}}\sum_{\phi_{A}\in\mathsf{emb}(A),\phi_{B}\in\mathsf{emb}(B)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi_{A};A)}({\boldsymbol{Y}})\mathscrsfs{H}_{\mathsf{im}(\phi_{B};B)}({\boldsymbol{Y}})]
=1|𝖾𝗆𝖻(A)||𝖾𝗆𝖻(B)|Π𝗉𝖺𝗍(A,B)(ϕA,ϕB)𝖾𝗆𝖻(Π)𝔼[H𝗂𝗆(ϕA;A)(𝒀)H𝗂𝗆(ϕB;B)(𝒀)]\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|\cdot|\mathsf{emb}(B)|}}\sum_{\Pi\in\mathsf{pat}(A,B)}\sum_{(\phi_{A},\phi_{B})\in\mathsf{emb}(\Pi)}\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\mathsf{im}(\phi_{A};A)}({\boldsymbol{Y}})\mathscrsfs{H}_{\mathsf{im}(\phi_{B};B)}({\boldsymbol{Y}})]
=1|𝖾𝗆𝖻(A)||𝖾𝗆𝖻(B)|Π𝗉𝖺𝗍(A,B)|𝖾𝗆𝖻(Π)|𝔼[H𝜶(Y)H𝜷(𝒀)],\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|\cdot|\mathsf{emb}(B)|}}\sum_{\Pi\in\mathsf{pat}(A,B)}|\mathsf{emb}(\Pi)|\cdot\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}(Y)\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]\,,

where in the last line, (𝜶,𝜷)=(𝗂𝗆(ϕA;A),𝗂𝗆(ϕB;B))({\boldsymbol{\alpha}},{\boldsymbol{\beta}})=(\mathsf{im}(\phi_{A};A),\mathsf{im}(\phi_{B};B)) is the shadow of an arbitrary embedding (ϕA,ϕB)𝖾𝗆𝖻(Π)(\phi_{A},\phi_{B})\in\mathsf{emb}(\Pi), and we have used the fact that 𝔼[H𝜶(𝒀)H𝜷(𝒀)]\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})] depends only on Π\Pi (not 𝜶,𝜷{\boldsymbol{\alpha}},{\boldsymbol{\beta}}). Recall from (102) that

|𝖾𝗆𝖻(A)|=n|V(A)|1(1+O(n1)),|\mathsf{emb}(A)|=n^{|V(A)|-1}(1+O(n^{-1}))\,,

and we similarly have

|𝖾𝗆𝖻(Π)|=𝖼𝗈𝗇𝗌𝗍(Π)(n1|V(Π)|1)=𝖼𝗈𝗇𝗌𝗍(Π)n|V(Π)|1(1+O(n1)).|\mathsf{emb}(\Pi)|=\mathsf{const}(\Pi)\cdot\binom{n-1}{|V(\Pi)|-1}=\mathsf{const}(\Pi)\cdot n^{|V(\Pi)|-1}(1+O(n^{-1}))\,.

Now compute using the Hermite facts from Section C.2, and recalling that 𝑿:=𝜽𝜽𝖳/n{\boldsymbol{X}}:={\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}/\sqrt{n},

𝔼[H𝜶(𝒀)\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}}) H𝜷(𝒀)]=𝔼(𝜸C(𝜶)(h𝜸(𝒀)𝔼h𝜸(𝒀)))(𝜹C(𝜷)(h𝜹(𝒀)𝔼h𝜹(𝒀)))\displaystyle\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]=\operatorname*{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\boldsymbol{\gamma}}({\boldsymbol{Y}})\right)\right)\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(h_{\boldsymbol{\delta}}({\boldsymbol{Y}})-\operatorname*{\mathbb{E}}h_{\boldsymbol{\delta}}({\boldsymbol{Y}})\right)\right)
=𝔼(𝜸C(𝜶)(𝟎𝜸𝜸𝜸!𝜸!(𝜸𝜸)𝑿𝜸𝜸h𝜸(𝒁)1𝜸!𝔼[𝑿𝜸]))\displaystyle=\operatorname*{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\gamma}}^{\prime}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\gamma}}^{\prime}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\gamma}}^{\prime}}{\boldsymbol{X}}^{{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}}h_{{\boldsymbol{\gamma}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\gamma}}]\right)\right)\cdot
(𝜹C(𝜷)(𝟎𝜹𝜹𝜹!𝜹!(𝜹𝜹)𝑿𝜹𝜹h𝜹(𝒁)1𝜹!𝔼[𝑿𝜹]))\displaystyle\qquad\qquad\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\delta}}^{\prime}\leq{\boldsymbol{\delta}}}\sqrt{\frac{{\boldsymbol{\delta}}^{\prime}!}{{\boldsymbol{\delta}}!}}\binom{{\boldsymbol{\delta}}}{{\boldsymbol{\delta}}^{\prime}}{\boldsymbol{X}}^{{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}}h_{{\boldsymbol{\delta}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\delta}}!}}\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\boldsymbol{\delta}}]\right)\right)
=𝔼(𝜸C(𝜶)(𝟎𝜸𝜸𝜸!𝜸!(𝜸𝜸)n12|𝜸𝜸|(𝜽𝜽𝖳)𝜸𝜸h𝜸(𝒁)1𝜸!n12|𝜸|𝔼[(𝜽𝜽𝖳)𝜸]))\displaystyle=\operatorname*{\mathbb{E}}\left(\prod_{{\boldsymbol{\gamma}}\in C({\boldsymbol{\alpha}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\gamma}}^{\prime}\leq{\boldsymbol{\gamma}}}\sqrt{\frac{{\boldsymbol{\gamma}}^{\prime}!}{{\boldsymbol{\gamma}}!}}\binom{{\boldsymbol{\gamma}}}{{\boldsymbol{\gamma}}^{\prime}}n^{-\frac{1}{2}|{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\gamma}}-{\boldsymbol{\gamma}}^{\prime}}h_{{\boldsymbol{\gamma}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\gamma}}!}}n^{-\frac{1}{2}|{\boldsymbol{\gamma}}|}\operatorname*{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\gamma}}]\right)\right)\cdot
(𝜹C(𝜷)(𝟎𝜹𝜹𝜹!𝜹!(𝜹𝜹)n12|𝜹𝜹|(𝜽𝜽𝖳)𝜹𝜹h𝜹(𝒁)1𝜹!n12|𝜹|𝔼[(𝜽𝜽𝖳)𝜹])).\displaystyle\qquad\qquad\left(\prod_{{\boldsymbol{\delta}}\in C({\boldsymbol{\beta}})}\left(\sum_{{\boldsymbol{0}}\leq{\boldsymbol{\delta}}^{\prime}\leq{\boldsymbol{\delta}}}\sqrt{\frac{{\boldsymbol{\delta}}^{\prime}!}{{\boldsymbol{\delta}}!}}\binom{{\boldsymbol{\delta}}}{{\boldsymbol{\delta}}^{\prime}}n^{-\frac{1}{2}|{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}|}({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{{\boldsymbol{\delta}}-{\boldsymbol{\delta}}^{\prime}}h_{{\boldsymbol{\delta}}^{\prime}}({\boldsymbol{Z}})-\frac{1}{\sqrt{{\boldsymbol{\delta}}!}}n^{-\frac{1}{2}|{\boldsymbol{\delta}}|}\operatorname*{\mathbb{E}}[({\boldsymbol{\theta}}{\boldsymbol{\theta}}^{\sf T})^{\boldsymbol{\delta}}]\right)\right).

If 𝜶{\boldsymbol{\alpha}} has a non-empty (i.e., containing at least one edge) connected component that is vertex-disjoint from all non-empty connected components of 𝜷{\boldsymbol{\beta}} or vice-versa, then we have 𝔼[H𝜶(𝒀)H𝜷(𝒀)]=0\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]=0 from the first line above (due to independence between components). Otherwise we say that Π\Pi has “no isolated components” and denote by 𝗉𝖺𝗍(A,B)\mathsf{pat}^{*}(A,B) the set of patterns with this property. For Π𝗉𝖺𝗍(A,B)\Pi\in\mathsf{pat}^{*}(A,B), using orthogonality of the Hermite polynomials, the expansion of the expression above has a unique leading (in nn) term where 𝜸,𝜹{\boldsymbol{\gamma}}^{\prime},{\boldsymbol{\delta}}^{\prime} are as large as possible: namely, 𝜸=𝜸𝜷{\boldsymbol{\gamma}}^{\prime}={\boldsymbol{\gamma}}\wedge{\boldsymbol{\beta}} and 𝜹=𝜹𝜶{\boldsymbol{\delta}}^{\prime}={\boldsymbol{\delta}}\wedge{\boldsymbol{\alpha}} where \wedge denotes entrywise minimum between elements of E¯n\mathbb{N}^{\overline{E}_{n}}. Therefore there exists 𝖼𝗈𝗇𝗌𝗍(Π)\mathsf{const}(\Pi)\in\mathbb{R} independent of nn such that

𝔼[H𝜶(𝒀)H𝜷(𝒀)]\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})] =(𝖼𝗈𝗇𝗌𝗍(Π)+O(n1/2))n12|𝜶(𝜶𝜷)|12|𝜷(𝜶𝜷)|\displaystyle=\big{(}\mathsf{const}(\Pi)+O(n^{-1/2})\big{)}\cdot n^{-\frac{1}{2}|{\boldsymbol{\alpha}}-({\boldsymbol{\alpha}}\wedge{\boldsymbol{\beta}})|-\frac{1}{2}|{\boldsymbol{\beta}}-({\boldsymbol{\alpha}}\wedge{\boldsymbol{\beta}})|}
=(𝖼𝗈𝗇𝗌𝗍(Π)+O(n1/2))n12|𝜶𝜷|,\displaystyle=\big{(}\mathsf{const}(\Pi)+O(n^{-1/2})\big{)}\cdot n^{-\frac{1}{2}|{\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}}|}\,,

where the symmetric difference is defined as (𝜶𝜷)ij:=|αijβij|({\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}})_{ij}:=|\alpha_{ij}-\beta_{ij}|.

Putting it all together, and recalling 𝔼[H𝜶(𝒀)H𝜷(𝒀)]=0\operatorname*{\mathbb{E}}[\mathscrsfs{H}_{\boldsymbol{\alpha}}({\boldsymbol{Y}})\mathscrsfs{H}_{\boldsymbol{\beta}}({\boldsymbol{Y}})]=0 whenever Π𝗉𝖺𝗍(A,B)𝗉𝖺𝗍(A,B)\Pi\in\mathsf{pat}(A,B)\setminus\mathsf{pat}^{*}(A,B) (and eventually redefining 𝖼𝗈𝗇𝗌𝗍(Π)\mathsf{const}(\Pi))

MAB\displaystyle M_{AB} =n12(|V(A)|1)12(|V(B)|1)Π𝗉𝖺𝗍(A,B)(𝖼𝗈𝗇𝗌𝗍(Π)+O(n1/2))n(|V(Π)|1)12|αβ|\displaystyle=n^{-\frac{1}{2}(|V(A)|-1)-\frac{1}{2}(|V(B)|-1)}\sum_{\Pi\in\mathsf{pat}^{*}(A,B)}\big{(}\mathsf{const}(\Pi)+O(n^{-1/2})\big{)}\cdot n^{(|V(\Pi)|-1)-\frac{1}{2}|\alpha\bigtriangleup\beta|}
=(𝖼𝗈𝗇𝗌𝗍(A,B)+O(n1/2))nφ(A,B)\displaystyle=\big{(}\mathsf{const}(A,B)+O(n^{-1/2})\big{)}\cdot n^{\varphi(A,B)}

where

φ(A,B):=\displaystyle\varphi(A,B):= 12(|V(A)|+|V(B)|)+maxΠ𝗉𝖺𝗍(A,B)(|V(Π)|12|𝜶𝜷|)\displaystyle-\frac{1}{2}(|V(A)|+|V(B)|)+\max_{\Pi\in\mathsf{pat}^{*}(A,B)}\Big{(}|V(\Pi)|-\frac{1}{2}|{\boldsymbol{\alpha}}\bigtriangleup{\boldsymbol{\beta}}|\Big{)}
=\displaystyle= 12maxΠ𝗉𝖺𝗍(A,B)(|V(𝜶)V(𝜷)|+|V(𝜷)V(𝜶)||𝜶𝜷||𝜷𝜶|),\displaystyle\;\frac{1}{2}\max_{\Pi\in\mathsf{pat}^{*}(A,B)}\Big{(}|V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}})|+|V({\boldsymbol{\beta}})\setminus V({\boldsymbol{\alpha}})|-|{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}|-|{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}|\Big{)}\,,

where (𝜶𝜷)ij:=max{0,αijβij}({\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}})_{ij}:=\max\{0,\alpha_{ij}-\beta_{ij}\}, and V(𝜶)V({\boldsymbol{\alpha}}) always includes the root by convention (see Section C.1). To complete the proof, it remains to show φ(A,B)0\varphi(A,B)\leq 0 for all A,BA,B, and furthermore φ(A,B)1/2\varphi(A,B)\leq-1/2 in the case A𝒯D,B𝒢D𝒯DA\in\mathcal{T}_{\leq D},B\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}.

For any A,B𝒢DA,B\in\mathcal{G}_{\leq D}, Π𝗉𝖺𝗍(A,B)\Pi\in\mathsf{pat}^{*}(A,B), and (𝜶,𝜷)=(𝗂𝗆(ϕ;A),𝗂𝗆(ϕ;B))({\boldsymbol{\alpha}},{\boldsymbol{\beta}})=(\mathsf{im}(\phi;A),\mathsf{im}(\phi;B)) for ϕ\phi an embedding of Π\Pi, each non-empty connected component 𝜸{\boldsymbol{\gamma}} of 𝜶𝜷{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}} spans at most |𝜸|+1|{\boldsymbol{\gamma}}|+1 vertices. Of these vertices, none belong to V(𝜷)V(𝜶)V({\boldsymbol{\beta}})\setminus V({\boldsymbol{\alpha}}) and due to the “no isolated components” constraint imposed by 𝗉𝖺𝗍(A,B)\mathsf{pat}^{*}(A,B), at most |𝜸||{\boldsymbol{\gamma}}| of them (i.e., all but one) can belong to V(𝜶)V(𝜷)V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}}). Every vertex in V(𝜶)V(𝜷)V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}}) belongs to some non-empty component of 𝜶𝜷{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}} (since the root belongs to V(𝜶)V(𝜷)V({\boldsymbol{\alpha}})\cap V({\boldsymbol{\beta}})), so we have |V(𝜶)V(𝜷)||𝜶𝜷||V({\boldsymbol{\alpha}})\setminus V({\boldsymbol{\beta}})|\leq|{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}|. Similarly, |V(𝜷)V(𝜶)||𝜷𝜶||V({\boldsymbol{\beta}})\setminus V({\boldsymbol{\alpha}})|\leq|{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}|, implying φ(A,B)0\varphi(A,B)\leq 0 as desired.

In order to have equality φ(A,B)=0\varphi(A,B)=0 in the above argument, every non-empty connected component of 𝜶𝜷{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}} must be a simple (i.e., without self-loops or multi-edges) tree that spans only one vertex in V(𝜷)V({\boldsymbol{\beta}}), and similarly 𝜷𝜶{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}} must be a simple tree that spans only one vertex in V(𝜶)V({\boldsymbol{\alpha}}). It remains to show that this is impossible when A𝒯D,B𝒢D𝒯DA\in\mathcal{T}_{\leq D},B\in\mathcal{G}_{\leq D}\setminus\mathcal{T}_{\leq D}. This means AA is a simple rooted tree. We can assume AA has at least one edge, or else the “no isolated components” property must fail (since AA has no non-empty components). We will consider a few different cases for BB.

First consider the case where the root is isolated in BB. This means 𝜶𝜷{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}} has a non-empty connected component containing the root \circ. Since AA is a simple rooted tree and due to “no isolated components,” this component also contains a vertex uu belonging to some non-empty component of 𝜷{\boldsymbol{\beta}}. However, now we have a non-empty component of 𝜶𝜷{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}} that spans at least two vertices in V(𝜷)V({\boldsymbol{\beta}}), namely \circ and uu, and so by the discussion above, it is impossible to have equality φ(A,B)=0\varphi(A,B)=0.

Next consider the case where the root is not isolated in BB, but BB has multiple connected components. Let 𝜸{\boldsymbol{\gamma}} be a non-empty component of 𝜷{\boldsymbol{\beta}} that does not contain the root. Due to “no isolated components,” 𝜸{\boldsymbol{\gamma}} must span some (non-root) vertex uV(𝜶)u\in V({\boldsymbol{\alpha}}). Since 𝜶{\boldsymbol{\alpha}} is a simple rooted tree, there is a unique path in 𝜶{\boldsymbol{\alpha}} from uu to \circ. Recall that V(𝜷)\circ\in V({\boldsymbol{\beta}}) by convention, so both endpoints of this path belong to V(𝜷)V({\boldsymbol{\beta}}). Also, this path must contain an edge in 𝜶𝜷{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}}, or else 𝜸{\boldsymbol{\gamma}} would be connected to the root. This means that along the path, we can find a non-empty component of 𝜶𝜷{\boldsymbol{\alpha}}\setminus{\boldsymbol{\beta}} that spans two different vertices in V(𝜷)V({\boldsymbol{\beta}}), and so it is impossible to have equality φ(A,B)=0\varphi(A,B)=0.

The last case to consider is where BB contains a cycle. This cycle must contain at least one edge in 𝜷𝜶{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}}, since 𝜶{\boldsymbol{\alpha}} has no cycles. Recall from above that in order to have φ(A,B)=0\varphi(A,B)=0, every non-empty component of 𝜷𝜶{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}} must be a simple tree, which means the cycle also contains at least one edge in 𝜶{\boldsymbol{\alpha}}. Now we know the cycle in 𝜷{\boldsymbol{\beta}} has at least one edge in 𝜷𝜶{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}} and at least one edge in 𝜶{\boldsymbol{\alpha}}, but this means 𝜷𝜶{\boldsymbol{\beta}}\setminus{\boldsymbol{\alpha}} has a non-empty connected component that spans at least two vertices in V(𝜶)V({\boldsymbol{\alpha}}), and so it is impossible to have φ(A,B)=0\varphi(A,B)=0.

C.8 Limit of 𝒄n,𝑴n1𝒄n\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle

Lemma C.2.

Under the assumptions of Proposition 4.3, and with 𝐌{\boldsymbol{M}}_{\infty}, 𝐜{\boldsymbol{c}}_{\infty} as in the statement of Lemmas 4.7 and 4.8, we have

limn𝒄n,𝑴n1𝒄n=𝒄,𝑴1𝒄.\displaystyle\lim_{n\to\infty}\langle{\boldsymbol{c}}_{n},{\boldsymbol{M}}_{n}^{-1}{\boldsymbol{c}}_{n}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle\,. (103)
Proof.

From Lemma 4.5 we have λmin(𝑴n)=Ω(1)\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})=\Omega(1). As a direct consequence of Lemma 4.8,

𝑴n𝑴F=O(n1/2).\|{\boldsymbol{M}}_{n}-{\boldsymbol{M}}_{\infty}\|_{F}=O(n^{-1/2})\,.

This means, using op\|\;\cdot\;\|_{\mbox{\rm\tiny op}} for matrix operator norm,

λmin(𝑴)λmin(𝑴n)𝑴n𝑴opλmin(𝑴n)𝑴n𝑴F𝖼𝗈𝗇𝗌𝗍>0,\lambda_{\mathrm{min}}({\boldsymbol{M}}_{\infty})\geq\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})-\|{\boldsymbol{M}}_{n}-{\boldsymbol{M}}_{\infty}\|_{\mbox{\rm\tiny op}}\geq\lambda_{\mathrm{min}}({\boldsymbol{M}}_{n})-\|{\boldsymbol{M}}_{n}-{\boldsymbol{M}}_{\infty}\|_{F}\geq\mathsf{const}>0\,,

implying that 𝑴{\boldsymbol{M}}_{\infty} is symmetric positive definite and thus invertible. The result is now immediate from Lemmas 4.7 and 4.8 because 𝑴n{\boldsymbol{M}}_{n} has fixed dimension and the entries of 𝑴n1=adj(𝑴n)/det(𝑴n){\boldsymbol{M}}^{-1}_{n}=\mathrm{adj}({\boldsymbol{M}}_{n})/\mathrm{det}({\boldsymbol{M}}_{n}) are differentiable functions of the entries of 𝑴n{\boldsymbol{M}}_{n} (in a neighborhood of 𝑴{\boldsymbol{M}}_{\infty}). ∎

C.9 Proof of Lemma 4.9: Accuracy of rr

First we claim that 𝒅,𝑷1𝒅=𝒄,𝑴1𝒄\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle=\langle{\boldsymbol{c}}_{\infty},{\boldsymbol{M}}_{\infty}^{-1}{\boldsymbol{c}}_{\infty}\rangle. Recall the block form for 𝑴n{\boldsymbol{M}}_{n} and 𝒄n{\boldsymbol{c}}_{n} in Eq. (45), and that 𝑹=0{\boldsymbol{R}}_{\infty}=0 per Lemma 4.8 and therefore

𝑴=[𝑷𝟎𝟎𝑸]and𝑴1=[𝑷100𝑸1].{\boldsymbol{M}}_{\infty}=\left[\begin{array}[]{cc}{\boldsymbol{P}}_{\infty}&{\boldsymbol{0}}\\ {\boldsymbol{0}}&{\boldsymbol{Q}}_{\infty}\end{array}\right]\qquad\text{and}\qquad{\boldsymbol{M}}_{\infty}^{-1}=\left[\begin{array}[]{cc}{\boldsymbol{P}}_{\infty}^{-1}&0\\ 0&{\boldsymbol{Q}}_{\infty}^{-1}\end{array}\right].

The claim follows because 𝒆=0{\boldsymbol{e}}_{\infty}=0 (see Eq. (45) and Lemma 4.7).

Now as nn\to\infty we have

𝔼[r(𝒀)ψ(θ1)]=𝒓^,𝒅n=𝒅n,𝑷1𝒅𝒅,𝑷1𝒅,\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})\cdot\psi(\theta_{1})]=\langle\hat{\boldsymbol{r}},{\boldsymbol{d}}_{n}\rangle=\langle{\boldsymbol{d}}_{n},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle\,\to\,\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle\,,

and

𝔼[r(𝒀)2]=𝒓^,𝑷n𝒓^=𝒅,𝑷1𝑷n𝑷1𝒅𝒅,𝑷1𝒅,\operatorname*{\mathbb{E}}[r({\boldsymbol{Y}})^{2}]=\langle\hat{\boldsymbol{r}},{\boldsymbol{P}}_{n}\hat{\boldsymbol{r}}\rangle=\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{P}}_{n}{\boldsymbol{P}}^{-1}_{\infty}{\boldsymbol{d}}_{\infty}\rangle\,\to\,\langle{\boldsymbol{d}}_{\infty},{\boldsymbol{P}}_{\infty}^{-1}{\boldsymbol{d}}_{\infty}\rangle\,,

completing the proof.

C.10 Proof of Lemma 4.11: Change-of-basis

Fix A𝒯DA\in\mathcal{T}_{\leq D}. If A=A=\emptyset then HA=FA=1\mathscrsfs{H}_{A}=\mathscrsfs{F}_{A}=1 and the proof is complete, so suppose AA\neq\emptyset. Recall that our goal is to take

HA(𝒀):=1|𝖾𝗆𝖻(A)|ϕ𝖾𝗆𝖻(A)H𝗂𝗆(ϕ;A)(𝒀)\mathscrsfs{H}_{A}({\boldsymbol{Y}}):=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{emb}(A)}\mathscrsfs{H}_{\mathsf{im}(\phi;A)}({\boldsymbol{Y}})

and approximate it in the basis {FB:B𝒯D}\{\mathscrsfs{F}_{B}\,:\,B\in\mathcal{T}_{\leq D}\}, where

FB(𝒀):=1|𝗇𝗋(B)|ϕ𝗇𝗋(B)𝒀ϕ,\mathscrsfs{F}_{B}({\boldsymbol{Y}}):=\frac{1}{\sqrt{|\mathsf{nr}(B)|}}\sum_{\phi\in\mathsf{nr}(B)}{\boldsymbol{Y}}^{\phi}\,,

where we use the shorthand

𝒀ϕ:=𝒀𝗂𝗆(ϕ;B)=(i,j)E(B)Yϕ(i),ϕ(j).{\boldsymbol{Y}}^{\phi}:={\boldsymbol{Y}}^{\mathsf{im}(\phi;B)}=\prod_{(i,j)\in E(B)}Y_{\phi(i),\phi(j)}\,.

Since AA is a tree with at least one edge,

HA(𝒀)\displaystyle\mathscrsfs{H}_{A}({\boldsymbol{Y}}) =1|𝖾𝗆𝖻(A)|ϕ𝖾𝗆𝖻(A)(h𝗂𝗆(ϕ;A)(𝒀)𝔼h𝗂𝗆(ϕ)(𝒀))\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{emb}(A)}(h_{\mathsf{im}(\phi;A)}({\boldsymbol{Y}})-\operatorname{\mathbb{E}}h_{\mathsf{im}(\phi)}({\boldsymbol{Y}}))
=1|𝖾𝗆𝖻(A)|ϕ𝖾𝗆𝖻(A)𝒀ϕ|𝖾𝗆𝖻(A)|𝔼𝒀ϕ,\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi}-\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}\,, (104)

for an arbitrary ϕ𝖾𝗆𝖻(A)\phi^{*}\in\mathsf{emb}(A).

We first focus on the non-random term |𝖾𝗆𝖻(A)|𝔼𝒀ϕ\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}, which will be easy to expand in the basis {FB}\{\mathscrsfs{F}_{B}\} because F=1\mathscrsfs{F}_{\emptyset}=1. Let

mA=limn|𝖾𝗆𝖻(A)|𝔼𝒀ϕ,m_{A}=\lim_{n\to\infty}\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}\,,

noting that this limit exists by combining (102) with the fact

𝔼𝒀ϕ=𝖼𝗈𝗇𝗌𝗍(A)n12|E(A)|=𝖼𝗈𝗇𝗌𝗍(A)n12(|V(A)|1).\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}=\mathsf{const}(A)\cdot n^{-\frac{1}{2}|E(A)|}=\mathsf{const}(A)\cdot n^{-\frac{1}{2}(|V(A)|-1)}\,.

We can now write

|𝖾𝗆𝖻(A)|𝔼𝒀ϕ=mAF+EA,0\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}=m_{A}\mathscrsfs{F}_{\emptyset}+\mathscrsfs{E}_{A,0}

where EA,0:=|𝖾𝗆𝖻(A)|𝔼𝒀ϕmA\mathscrsfs{E}_{A,0}:=\sqrt{|\mathsf{emb}(A)|}\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\phi^{*}}-m_{A} satisfies EA,0=o(1)\mathscrsfs{E}_{A,0}=o(1) and so (being non-random) 𝔼[EA,02]=o(1)\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,0}^{2}]=o(1).

The error term EA\mathscrsfs{E}_{A} will be the sum of KK terms EA=i=1KEA,i\mathscrsfs{E}_{A}=\sum_{i=1}^{K}\mathscrsfs{E}_{A,i}, with KK independent of nn. It suffices to show 𝔼[EA,i2]=o(1)\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,i}^{2}]=o(1) for each individually, as this implies by the triangle inequality

𝔼[EA2]1/2i=1K𝔼[EA,i2]1/2=o(1).\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A}^{2}]^{1/2}\leq\sum_{i=1}^{K}\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,i}^{2}]^{1/2}=o(1)\,. (105)

We next handle the more substantial term in (104), namely

GA(𝒀)\displaystyle G_{A}({\boldsymbol{Y}}) :=1|𝖾𝗆𝖻(A)|ϕ𝖾𝗆𝖻(A)𝒀ϕ\displaystyle:=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi} (106)
=1|𝖾𝗆𝖻(A)|ϕ𝗇𝗋(A)𝒀ϕ1|𝖾𝗆𝖻(A)|ϕ𝗇𝗋(A)𝖾𝗆𝖻(A)𝒀ϕ.\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{nr}(A)}{\boldsymbol{Y}}^{\phi}-\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{nr}(A)\setminus\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi}\,.

For the first term,

1|𝖾𝗆𝖻(A)|ϕ𝗇𝗋(A)𝒀ϕ\displaystyle\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{nr}(A)}{\boldsymbol{Y}}^{\phi} =|𝗇𝗋(A)||𝖾𝗆𝖻(A)|FA(𝒀)\displaystyle=\sqrt{\frac{|\mathsf{nr}(A)|}{|\mathsf{emb}(A)|}}\mathscrsfs{F}_{A}({\boldsymbol{Y}})
=FA(𝒀)+(|𝗇𝗋(A)||𝖾𝗆𝖻(A)|1)FA(𝒀)=:FA(𝒀)+EA,1(𝒀).\displaystyle=\mathscrsfs{F}_{A}({\boldsymbol{Y}})+\left(\sqrt{\frac{|\mathsf{nr}(A)|}{|\mathsf{emb}(A)|}}-1\right)\mathscrsfs{F}_{A}({\boldsymbol{Y}})=:\mathscrsfs{F}_{A}({\boldsymbol{Y}})+\mathscrsfs{E}_{A,1}({\boldsymbol{Y}})\,.

Since limn(|𝗇𝗋(A)|/|𝖾𝗆𝖻(A)|)=1\lim_{n\to\infty}(|\mathsf{nr}(A)|/|\mathsf{emb}(A)|)=1 (in fact, embeddings make up a 1o(1)1-o(1) fraction of all labelings, not just non-reversing ones) and 𝔼[FA2]=O(1)\operatorname*{\mathbb{E}}[\mathscrsfs{F}_{A}^{2}]=O(1) (similarly to the proof of Lemma C.3 below), we have 𝔼[EA,12]=o(1)\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,1}^{2}]=o(1).

Returning to the remaining term in (106), we partition 𝗇𝗋(A)𝖾𝗆𝖻(A)\mathsf{nr}(A)\setminus\mathsf{emb}(A) into two sets L1=L1(A)L_{1}=L_{1}(A) and L2=L2(A)L_{2}=L_{2}(A), defined as follows. If the multigraph 𝜶:=𝗂𝗆(ϕ){\boldsymbol{\alpha}}:=\mathsf{im}(\phi) has no triple-edges or higher (i.e., αij2\alpha_{ij}\leq 2 for all iji\leq j) and has no cycles (namely, the simple graph 𝜷{\boldsymbol{\beta}} defined by βij=𝟙αij1\beta_{ij}=\mathbbm{1}_{\alpha_{ij}\geq 1} has no cycles) then we let ϕL1\phi\in L_{1}; otherwise ϕL2\phi\in L_{2}. The remaining term to handle is

1|𝖾𝗆𝖻(A)|ϕ𝗇𝗋(A)𝖾𝗆𝖻(A)𝒀ϕ\displaystyle\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in\mathsf{nr}(A)\setminus\mathsf{emb}(A)}{\boldsymbol{Y}}^{\phi} =1|𝖾𝗆𝖻(A)|ϕL1𝒀ϕ+1|𝖾𝗆𝖻(A)|ϕL2𝒀ϕ\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi}+\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{2}}{\boldsymbol{Y}}^{\phi}
=:1|𝖾𝗆𝖻(A)|ϕL1𝒀ϕ+EA,2(𝒀).\displaystyle=:\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi}+\mathscrsfs{E}_{A,2}({\boldsymbol{Y}})\,. (107)
Lemma C.3.

𝔼[EA,22]=o(1)\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,2}^{2}]=o(1).

Proof.

We define the set of patterns Π𝗉𝖺𝗍2(A,B)\Pi\in\mathsf{pat}_{2}(A,B) between two rooted trees A,B𝒯DA,B\in\mathcal{T}_{\leq D} as Section C.7, but now restricted to labelings in L2L_{2}. Namely edge-induced subgraph of red edges in Π𝗉𝖺𝗍2(A,B)\Pi\in\mathsf{pat}_{2}(A,B) should not be isomorphic to AA but rather isomorphic to the image of some ϕL2(A)\phi\in L_{2}(A) (and similarly for the blue edges and BB). We write (ϕ1,ϕ2)𝖾𝗆𝖻(Π)(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi) to denote an embedding of this pattern into vertex set [n][n]. Similarly to Section C.7, compute

𝔼[EA,22]\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,2}^{2}] =1|𝖾𝗆𝖻(A)|ϕ1,ϕ2L2(A)𝔼[𝒀ϕ1𝒀ϕ2]\displaystyle=\frac{1}{|\mathsf{emb}(A)|}\sum_{\phi_{1},\phi_{2}\in L_{2}(A)}\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{\phi_{1}}{\boldsymbol{Y}}^{\phi_{2}}]
=1|𝖾𝗆𝖻(A)|Π𝗉𝖺𝗍2(A,A)(ϕ1,ϕ2)𝖾𝗆𝖻(Π)𝔼[𝒀ϕ1𝒀ϕ2]\displaystyle=\frac{1}{|\mathsf{emb}(A)|}\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}\,\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{\phi_{1}}{\boldsymbol{Y}}^{\phi_{2}}]
=1|𝖾𝗆𝖻(A)|Π𝗉𝖺𝗍2(A,A)|𝖾𝗆𝖻(Π)|𝔼[𝒀𝜶+𝜷]\displaystyle=\frac{1}{|\mathsf{emb}(A)|}\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}|\mathsf{emb}(\Pi)|\cdot\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{{\boldsymbol{\alpha}}+{\boldsymbol{\beta}}}]
where (𝜶,𝜷)=(𝗂𝗆(ϕ1;A),𝗂𝗆(ϕ2;A))({\boldsymbol{\alpha}},{\boldsymbol{\beta}})=(\mathsf{im}(\phi_{1};A),\mathsf{im}(\phi_{2};A)) is the image of an arbitrary embedding of Π\Pi
=Θ(n1|V(A)|)Π𝗉𝖺𝗍2(A,A)O(n|V(Π)|1n12𝗈𝖽𝖽(Π))\displaystyle=\Theta(n^{1-|V(A)|})\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}O(n^{|V(\Pi)|-1}\cdot n^{-\frac{1}{2}\mathsf{odd}(\Pi)})
where 𝗈𝖽𝖽(Π)\mathsf{odd}(\Pi) denotes the number of edges iji\leq j for which αij+βij\alpha_{ij}+\beta_{ij} is odd, which depends only on Π\Pi (not 𝜶,𝜷{\boldsymbol{\alpha}},{\boldsymbol{\beta}}). Now since AA is a tree, |E(Π)|=2|E(A)|=2(|V(A)|1)|E(\Pi)|=2|E(A)|=2(|V(A)|-1) and so the above becomes
=Π𝗉𝖺𝗍2(A,A)O(n|V(Π)|12|E(Π)|112𝗈𝖽𝖽(Π))\displaystyle=\sum_{\Pi\in\mathsf{pat}_{2}(A,A)}O(n^{|V(\Pi)|-\frac{1}{2}|E(\Pi)|-1-\frac{1}{2}\mathsf{odd}(\Pi)}) (108)

and so it remains to show that the exponent is strictly negative for any Π𝗉𝖺𝗍2(A,A)\Pi\in\mathsf{pat}_{2}(A,A).

Fix Π𝗉𝖺𝗍2(A,A)\Pi\in\mathsf{pat}_{2}(A,A). Let ss denote the number of single-edges in Π\Pi (ignoring the edge colors), let dd denote the number of double-edges, let oo denote the number of kk-edges for k3k\geq 3 odd, and let ee denote the number of kk-edges for k4k\geq 4 even. By definition we have

𝗈𝖽𝖽(Π)=s+o,\mathsf{odd}(\Pi)=s+o\,,
|E(Π)|s+2d+3o+4e.|E(\Pi)|\geq s+2d+3o+4e\,.

Also, since AA (and therefore Π\Pi) is connected,

|V(Π)|s+d+o+e+1𝟙cycle|V(\Pi)|\leq s+d+o+e+1-\mathbbm{1}_{\mathrm{cycle}}

where 𝟙cycle\mathbbm{1}_{\mathrm{cycle}} is the indicator that Π\Pi (viewed as a simple graph by replacing each multi-edge by a single-edge) contains a cycle. Now the exponent in (108) is

|V(Π)|12|E(Π)|112𝗈𝖽𝖽(Π)\displaystyle|V(\Pi)|-\frac{1}{2}|E(\Pi)|-1-\frac{1}{2}\mathsf{odd}(\Pi) (s+d+o+e+1𝟙cycle)12(s+2d+3o+4e)112(s+o)\displaystyle\leq(s+d+o+e+1-\mathbbm{1}_{\mathrm{cycle}})-\frac{1}{2}(s+2d+3o+4e)-1-\frac{1}{2}(s+o)
=(o+e+𝟙cycle).\displaystyle=-(o+e+\mathbbm{1}_{\mathrm{cycle}})\,.

By the definition of L2L_{2}, we must either have o+e1o+e\geq 1 or 𝟙cycle=1\mathbbm{1}_{\mathrm{cycle}}=1, which means the exponent is 1\leq-1, completing the proof. ∎

It remains to handle the first term in Eq. (107). For ϕL1(A)\phi\in L_{1}(A), define the “skeleton” 𝗌𝗄(ϕ)E¯n\mathsf{sk}(\phi)\in\mathbb{N}^{\overline{E}_{n}} to be the subgraph of 𝗂𝗆(ϕ)\mathsf{im}(\phi) obtained from 𝜶:=𝗂𝗆(ϕ;A){\boldsymbol{\alpha}}:=\mathsf{im}(\phi;A) as follows: delete all multi-edges (i.e., whenever αij2\alpha_{ij}\geq 2, set αij=0\alpha_{ij}=0) and then take the connected component of the root (vertex 1). Using the definition of L1(A)L_{1}(A) and recalling that ϕL1(A)\phi\in L_{1}(A) is not an embedding, note that 𝗌𝗄(ϕ)\mathsf{sk}(\phi) is always a simple tree with strictly fewer edges than AA. The remaining term to handle is

1|𝖾𝗆𝖻(A)|ϕL1𝒀ϕ\displaystyle\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{1}}{\boldsymbol{Y}}^{\phi} =1|𝖾𝗆𝖻(A)|ϕL1Cϕ𝒀𝗌𝗄(ϕ)+1|𝖾𝗆𝖻(A)|ϕL1(𝒀ϕCϕ𝒀𝗌𝗄(ϕ))\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{1}}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}+\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{1}}\left({\boldsymbol{Y}}^{\phi}-C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}\right)
=:1|𝖾𝗆𝖻(A)|ϕL1Cϕ𝒀𝗌𝗄(ϕ)+EA,3(Y)\displaystyle=:\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{1}}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}+\mathscrsfs{E}_{A,3}(Y) (109)

where

Cϕ:=𝔼[𝒀𝗂𝗆(ϕ)𝗌𝗄(ϕ)].C_{\phi}:=\operatorname*{\mathbb{E}}[{\boldsymbol{Y}}^{\mathsf{im}(\phi)-\mathsf{sk}(\phi)}]\,.
Lemma C.4.

𝔼[EA,32]=o(1)\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,3}^{2}]=o(1).

Proof.

Define 𝗉𝖺𝗍1(A,B)\mathsf{pat}_{1}(A,B) as in the proof of Lemma C.3 except with L1L_{1} in place of L2L_{2}. Our goal is to bound

𝔼[EA,32]\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,3}^{2}] =1|𝖾𝗆𝖻(A)|ϕ1,ϕ2L1(A)𝔼(𝒀ϕ1Cϕ1𝒀𝗌𝗄(ϕ1))(𝒀ϕ2Cϕ2𝒀𝗌𝗄(ϕ2))\displaystyle=\frac{1}{|\mathsf{emb}(A)|}\sum_{\phi_{1},\phi_{2}\in L_{1}(A)}\operatorname*{\mathbb{E}}\left({\boldsymbol{Y}}^{\phi_{1}}-C_{\phi_{1}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{1})}\right)\left({\boldsymbol{Y}}^{\phi_{2}}-C_{\phi_{2}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{2})}\right)
=1|𝖾𝗆𝖻(A)|Π𝗉𝖺𝗍1(A,A)(ϕ1,ϕ2)𝖾𝗆𝖻(Π)𝔼(𝒀ϕ1Cϕ1𝒀𝗌𝗄(ϕ1))(𝒀ϕ2Cϕ2𝒀𝗌𝗄(ϕ2)).\displaystyle=\frac{1}{|\mathsf{emb}(A)|}\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}\;\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}\operatorname*{\mathbb{E}}\left({\boldsymbol{Y}}^{\phi_{1}}-C_{\phi_{1}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{1})}\right)\left({\boldsymbol{Y}}^{\phi_{2}}-C_{\phi_{2}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{2})}\right).

For ϕL1(A)\phi\in L_{1}(A), letting 𝗌𝗄¯(ϕ)=𝗂𝗆(ϕ)𝗌𝗄(ϕ)\overline{\mathsf{sk}}(\phi)=\mathsf{im}(\phi)-\mathsf{sk}(\phi) (throughout this proof we adopt the shorthand 𝗂𝗆(ϕ)=𝗂𝗆(ϕ;A)\mathsf{im}(\phi)=\mathsf{im}(\phi;A) since the base graph AA is fixed),

𝒀ϕCϕ𝒀𝗌𝗄(ϕ)\displaystyle{\boldsymbol{Y}}^{\phi}-C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)} =𝒀𝗌𝗄(ϕ)(𝒀𝗌𝗄¯(ϕ)𝔼𝒀𝗂𝗆(ϕ)𝗌𝗄(ϕ))\displaystyle={\boldsymbol{Y}}^{\mathsf{sk}(\phi)}\left({\boldsymbol{Y}}^{\overline{\mathsf{sk}}(\phi)}-\operatorname*{\mathbb{E}}{\boldsymbol{Y}}^{\mathsf{im}(\phi)-\mathsf{sk}(\phi)}\right)
=(0𝜶𝗌𝗄(ϕ)𝑿𝗌𝗄(ϕ)𝜶𝒁𝜶)0𝜷𝗌𝗄¯(ϕ)(𝑿𝗌𝗄¯(ϕ)𝜷𝒁𝜷𝔼[𝑿𝗌𝗄¯(ϕ)𝜷𝒁𝜷]).\displaystyle=\left(\sum_{0\leq{\boldsymbol{\alpha}}\leq\mathsf{sk}(\phi)}{\boldsymbol{X}}^{\mathsf{sk}(\phi)-{\boldsymbol{\alpha}}}{\boldsymbol{Z}}^{\boldsymbol{\alpha}}\right)\sum_{0\leq{\boldsymbol{\beta}}\leq\overline{\mathsf{sk}}(\phi)}\left({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi)-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}-\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi)-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}]\right).

This means

𝔼(𝒀ϕ1Cϕ1𝒀𝗌𝗄(ϕ1))(𝒀ϕ2Cϕ2𝒀𝗌𝗄(ϕ2))\displaystyle\operatorname*{\mathbb{E}}\left({\boldsymbol{Y}}^{\phi_{1}}-C_{\phi_{1}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{1})}\right)\left({\boldsymbol{Y}}^{\phi_{2}}-C_{\phi_{2}}{\boldsymbol{Y}}^{\mathsf{sk}(\phi_{2})}\right) (110)
=𝟎𝜶𝗌𝗄(ϕ1)𝟎𝜷𝗌𝗄¯(ϕ1)𝟎𝜸𝗌𝗄(ϕ2)𝟎𝜹𝗌𝗄¯(ϕ2)𝑿𝗌𝗄(ϕ1)𝜶𝒁𝜶𝑿𝗌𝗄(ϕ2)𝜸𝒁𝜸(𝑿𝗌𝗄¯(ϕ1)𝜷𝒁𝜷𝔼[𝑿𝗌𝗄¯(ϕ1)𝜷𝒁𝜷])(𝑿𝗌𝗄¯(ϕ2)𝜹𝒁𝜹𝔼[𝑿𝗌𝗄¯(ϕ2)𝜹𝒁𝜹]).\displaystyle=\sum_{\begin{subarray}{c}{\boldsymbol{0}}\leq{\boldsymbol{\alpha}}\leq\mathsf{sk}(\phi_{1})\\ {\boldsymbol{0}}\leq{\boldsymbol{\beta}}\leq\overline{\mathsf{sk}}(\phi_{1})\\ {\boldsymbol{0}}\leq{\boldsymbol{\gamma}}\leq\mathsf{sk}(\phi_{2})\\ {\boldsymbol{0}}\leq{\boldsymbol{\delta}}\leq\overline{\mathsf{sk}}(\phi_{2})\end{subarray}}{\boldsymbol{X}}^{\mathsf{sk}(\phi_{1})-{\boldsymbol{\alpha}}}{\boldsymbol{Z}}^{\boldsymbol{\alpha}}{\boldsymbol{X}}^{\mathsf{sk}(\phi_{2})-{\boldsymbol{\gamma}}}{\boldsymbol{Z}}^{\boldsymbol{\gamma}}\left({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}-\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}]\right)\left({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{2})-{\boldsymbol{\delta}}}{\boldsymbol{Z}}^{\boldsymbol{\delta}}-\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{2})-{\boldsymbol{\delta}}}{\boldsymbol{Z}}^{\boldsymbol{\delta}}]\right).

We will next claim that certain terms in the sum in (110) are zero. First note that we must have αij+βij+γij+δij\alpha_{ij}+\beta_{ij}+\gamma_{ij}+\delta_{ij} is even for every edge iji\leq j, or else the corresponding term in (110) is zero due to the ZZ factors. Therefore, for (ϕ1,ϕ2)𝖾𝗆𝖻(Π)(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi), the value of (110) is O(n12𝗈𝖽𝖽(Π))O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)}) where, recall, 𝗈𝖽𝖽(Π)\mathsf{odd}(\Pi) denotes the number of odd-edges in Π\Pi (i.e., the number of edges iji\leq j for which 𝗂𝗆(ϕ1)ij+𝗂𝗆(ϕ2)ij\mathsf{im}(\phi_{1})_{ij}+\mathsf{im}(\phi_{2})_{ij} is odd).

We will improve the bound O(n12𝗈𝖽𝖽(Π))O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)}) in certain cases. Let P(Π)P(\Pi) denote the property that Π\Pi has no triple-edges or higher (i.e., 𝗂𝗆(ϕ1)ij+𝗂𝗆(ϕ2)ij2\mathsf{im}(\phi_{1})_{ij}+\mathsf{im}(\phi_{2})_{ij}\leq 2 for all iji\leq j) and no cycles (when Π\Pi is viewed as a simple graph by replacing multi-edges by single-edges). We claim that (110) is O(n12𝗈𝖽𝖽(Π)1)O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)-1}) whenever P(Π)P(\Pi) holds. To prove this, first identify the “bridges,” i.e., edges (i,j)(i,j) for which 𝗂𝗆(ϕ1)ij=2\mathsf{im}(\phi_{1})_{ij}=2 and one endpoint of (i,j)(i,j) belongs to V(𝗌𝗄(ϕ1))V(\mathsf{sk}(\phi_{1})). At least one bridge must exist by the definition of L1L_{1}. Using the definition of P(Π)P(\Pi), 𝗌𝗄¯(ϕ1)\overline{\mathsf{sk}}(\phi_{1}) shares no vertices with 𝗂𝗆(ϕ2)\mathsf{im}(\phi_{2}) except possibly one endpoint of each bridge. As a result, some bridge (i,j)(i,j) must have βij=0\beta_{ij}=0 or else the corresponding term in (110) is zero; to see this, note that if every bridge has βij=2\beta_{ij}=2 then the factor (𝑿𝗌𝗄¯(ϕ1)𝜷𝒁𝜷𝔼[𝑿𝗌𝗄¯(ϕ1)𝜷𝒁𝜷])({\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}-\operatorname*{\mathbb{E}}[{\boldsymbol{X}}^{\overline{\mathsf{sk}}(\phi_{1})-{\boldsymbol{\beta}}}{\boldsymbol{Z}}^{\boldsymbol{\beta}}]) has mean zero and is independent from the other factors in (110). This gives the desired improvement to O(n12𝗈𝖽𝖽(Π)1)O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)-1}).

We now have

𝔼[EA,32]\displaystyle\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,3}^{2}] =1|𝖾𝗆𝖻(A)|Π𝗉𝖺𝗍1(A,A)(ϕ1,ϕ2)𝖾𝗆𝖻(Π)O(n12𝗈𝖽𝖽(Π)𝟙P(Π))\displaystyle=\frac{1}{|\mathsf{emb}(A)|}\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}\;\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}O(n^{-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})
=Π𝗉𝖺𝗍1(A,A)O(n(|V(A)|1)+(|V(Π)|1)12𝗈𝖽𝖽(Π)𝟙P(Π))\displaystyle=\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{-(|V(A)|-1)+(|V(\Pi)|-1)-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})
=Π𝗉𝖺𝗍1(A,A)O(n|V(A)|+|V(Π)|12𝗈𝖽𝖽(Π)𝟙P(Π))\displaystyle=\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{-|V(A)|+|V(\Pi)|-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})
and now repeating the argument from the proof of Lemma C.3
=Π𝗉𝖺𝗍1(A,A)O(n|V(Π)|12|E(Π)|112𝗈𝖽𝖽(Π)𝟙P(Π))\displaystyle=\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{|V(\Pi)|-\frac{1}{2}|E(\Pi)|-1-\frac{1}{2}\mathsf{odd}(\Pi)-\mathbbm{1}_{P(\Pi)}})
Π𝗉𝖺𝗍1(A,A)O(n(o+e+𝟙cycle+𝟙P(Π)))\displaystyle\leq\sum_{\Pi\in\mathsf{pat}_{1}(A,A)}O(n^{-(o+e+\mathbbm{1}_{\mathrm{cycle}}+\mathbbm{1}_{P(\Pi)})})
O(n1),\displaystyle\leq O(n^{-1})\,,

completing the proof. ∎

It remains to handle the first term in (109). For each ϕL1\phi\in L_{1}, 𝗂𝗆(ϕ)\mathsf{im}(\phi) is isomorphic (as a rooted multigraph) to some connected multigraph that is a rooted tree but with both single- and double-edges allowed, and at least one double-edge required; let 𝗉𝖺𝗍1(A)\mathsf{pat}_{1}(A) denote the set of such multigraphs and for Π𝗉𝖺𝗍1(A)\Pi\in\mathsf{pat}_{1}(A), write ϕ𝖾𝗆𝖻(Π)\phi\in\mathsf{emb}(\Pi) when 𝗂𝗆(ϕ)\mathsf{im}(\phi) is isomorphic to Π\Pi. The remaining term to handle is

1|𝖾𝗆𝖻(A)|ϕL1Cϕ𝒀𝗌𝗄(ϕ)\displaystyle\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\phi\in L_{1}}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)} =1|𝖾𝗆𝖻(A)|Π𝗉𝖺𝗍1(A)ϕ𝖾𝗆𝖻(Π)Cϕ𝒀𝗌𝗄(ϕ).\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\Pi\in\mathsf{pat}_{1}(A)}\,\sum_{\phi\in\mathsf{emb}(\Pi)}C_{\phi}{\boldsymbol{Y}}^{\mathsf{sk}(\phi)}.
Note that CΠ:=CϕC_{\Pi}:=C_{\phi} depends only on Π\Pi (not ϕ\phi). Also write 𝖲𝗄(Π)\mathsf{Sk}(\Pi) for the rooted tree isomorphic to 𝗌𝗄(ϕ)\mathsf{sk}(\phi), which again only depends on Π\Pi. The above becomes
=1|𝖾𝗆𝖻(A)|Π𝗉𝖺𝗍1(A)CΠϕ𝖾𝗆𝖻(𝖲𝗄(Π))𝒀ϕ(1+o(1))n|V(Π)||V(𝖲𝗄(Π))|\displaystyle=\frac{1}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\Pi\in\mathsf{pat}_{1}(A)}C_{\Pi}\sum_{\phi\in\mathsf{emb}(\mathsf{Sk}(\Pi))}{\boldsymbol{Y}}^{\phi}\,(1+o(1))\,n^{|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|}
=1+o(1)|𝖾𝗆𝖻(A)|Π𝗉𝖺𝗍1(A)CΠn|V(Π)||V(𝖲𝗄(Π))||𝖾𝗆𝖻(𝖲𝗄(Π))|G𝖲𝗄(Π)(𝒀).\displaystyle=\frac{1+o(1)}{\sqrt{|\mathsf{emb}(A)|}}\sum_{\Pi\in\mathsf{pat}_{1}(A)}C_{\Pi}\,n^{|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|}\sqrt{|\mathsf{emb}(\mathsf{Sk}(\Pi))|}\,G_{\mathsf{Sk}(\Pi)}({\boldsymbol{Y}})\,. (111)

The number of double-edges in Π\Pi is d(Π)=|V(A)||V(Π)|d(\Pi)=|V(A)|-|V(\Pi)|. We have

|𝖾𝗆𝖻(A)|=(1+o(1))n|V(A)|1,|\mathsf{emb}(A)|=(1+o(1))n^{|V(A)|-1},
CΠ=(𝖼𝗈𝗇𝗌𝗍(Π)+o(1))n12(|V(Π)||V(𝖲𝗄(Π))|d(Π))=(𝖼𝗈𝗇𝗌𝗍(Π)+o(1))n12(2|V(Π)||V(𝖲𝗄(Π))||V(A)|).C_{\Pi}=(\mathsf{const}(\Pi)+o(1))n^{-\frac{1}{2}(|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|-d(\Pi))}=(\mathsf{const}(\Pi)+o(1))n^{-\frac{1}{2}(2|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|-|V(A)|)}.

Now (111) becomes

Π𝗉𝖺𝗍1(A)(𝖼𝗈𝗇𝗌𝗍(Π)+o(1))n12(|V(A)|1)12(2|V(Π)||V(𝖲𝗄(Π))||V(A)|)+|V(Π)||V(𝖲𝗄(Π))|+12(|V(𝖲𝗄(Π))|1)G𝖲𝗄(Π)(𝒀)\displaystyle\sum_{\Pi\in\mathsf{pat}_{1}(A)}(\mathsf{const}(\Pi)+o(1))n^{-\frac{1}{2}(|V(A)|-1)-\frac{1}{2}(2|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|-|V(A)|)+|V(\Pi)|-|V(\mathsf{Sk}(\Pi))|+\frac{1}{2}(|V(\mathsf{Sk}(\Pi))|-1)}G_{\mathsf{Sk}(\Pi)}({\boldsymbol{Y}})
=Π𝗉𝖺𝗍1(A)(𝖼𝗈𝗇𝗌𝗍(Π)+o(1))G𝖲𝗄(Π)(𝒀)\displaystyle\quad=\sum_{\Pi\in\mathsf{pat}_{1}(A)}(\mathsf{const}(\Pi)+o(1))\,G_{\mathsf{Sk}(\Pi)}({\boldsymbol{Y}})
=B𝒯D:|E(B)|<|E(A)|(𝖼𝗈𝗇𝗌𝗍(A,B)+o(1))GB(𝒀)\displaystyle\quad=\sum_{B\in\mathcal{T}_{\leq D}\,:\,|E(B)|<|E(A)|}(\mathsf{const}(A,B)+o(1))\,G_{B}({\boldsymbol{Y}})
=:B𝒯D:|E(B)|<|E(A)|𝖼𝗈𝗇𝗌𝗍(A,B)GB(𝒀)+EA,4\displaystyle\quad=:\sum_{B\in\mathcal{T}_{\leq D}\,:\,|E(B)|<|E(A)|}\mathsf{const}(A,B)\,G_{B}({\boldsymbol{Y}})+\mathscrsfs{E}_{A,4}

where 𝔼[EA,42]=o(1)\operatorname*{\mathbb{E}}[\mathscrsfs{E}_{A,4}^{2}]=o(1) because 𝔼[GB2]=O(1)\operatorname*{\mathbb{E}}[G_{B}^{2}]=O(1) (similarly to the proof of Lemma C.3).

Proof of Lemma 4.11.

Summarizing the above, we have shown how to write

HA=𝖼𝗈𝗇𝗌𝗍(A)FEA,0+GA\mathscrsfs{H}_{A}=\mathsf{const}(A)\mathscrsfs{F}_{\emptyset}-\mathscrsfs{E}_{A,0}+G_{A}

where

GA=FA+EA,1EA,2EA,3EA,4B𝒯D:|E(B)|<|E(A)|𝖼𝗈𝗇𝗌𝗍(A,B)GB.G_{A}=\mathscrsfs{F}_{A}+\mathscrsfs{E}_{A,1}-\mathscrsfs{E}_{A,2}-\mathscrsfs{E}_{A,3}-\mathscrsfs{E}_{A,4}-\sum_{B\in\mathcal{T}_{\leq D}\,:\,|E(B)|<|E(A)|}\mathsf{const}(A,B)\,G_{B}\,.

Using induction on |E(A)||E(A)| we can apply the same procedure to expand GBG_{B} in the basis {FC:C𝒯D,|E(C)|<|E(B)|}\{\mathscrsfs{F}_{C}:C\in\mathcal{T}_{\leq D},\,|E(C)|<|E(B)|\} plus an error term. Recalling (105), this now gives the desired expansion for HA\mathscrsfs{H}_{A}. ∎

Appendix D Proof of Lemma 4.13

Let (𝒁(t))t({\boldsymbol{Z}}(t))_{t\in{\mathbb{Z}}}, 𝒁(t)n×n{\boldsymbol{Z}}(t)\in{\mathbb{R}}^{n\times n} be a collection of i.i.d. copies of 𝒁{\boldsymbol{Z}}, and define

𝒀(t)=1n𝜽𝜽𝖳+𝒁(t).\displaystyle{\boldsymbol{Y}}(t)=\frac{1}{\sqrt{n}}{\boldsymbol{\theta}}{\boldsymbol{\theta}}^{{\sf T}}+{\boldsymbol{Z}}(t)\,. (112)

We will use the sequence of random matrices 𝒀:={𝒀(t)}{\boldsymbol{Y}}_{*}:=\{{\boldsymbol{Y}}(t)\} uniquely as a device for algorithm analysis and not in the actual estimation procedure.

We extend the definition of tree structured polynomials (cf. Eq. (32)) to such sequences of random matrices via

FTt(𝒀)=1|𝗇𝗋(T)|ϕ𝗇𝗋(T)(i,j)E(T)Yϕ(i),ϕ(j)(t𝖽T((i,j),)).\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*})=\frac{1}{\sqrt{|\mathsf{nr}(T)|}}\sum_{\phi\in\mathsf{nr}(T)}\,\prod_{(i,j)\in E(T)}Y_{\phi(i),\phi(j)}(t-{\sf d}_{T}((i,j),\circ))\,. (113)

Here 𝖽T((i,j),):=max(𝖽T(i,),𝖽T(j,)){\sf d}_{T}((i,j),\circ):=\max({\sf d}_{T}(i,\circ),{\sf d}_{T}(j,\circ)). When the argument is a single matrix 𝒀{\boldsymbol{Y}}, then FTt(𝒀)\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}) is defined by applying Eq. (113) to the sequence of matrices given by 𝒀(s)=𝒀{\boldsymbol{Y}}(s)={\boldsymbol{Y}} for all ss\in{\mathbb{Z}} (thus recovering Eq (32)).

The random variable FTt(𝒀)\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*}) does depend on tt. However its distribution is independent of tt. We will therefore often omit the superscript tt. (In the case of FTt(𝒀)\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}), the random variable itself does not depend on tt.)

We define a subset of the family of non-reversing labelings of T𝒯DT\in\mathcal{T}_{\leq D}.

Definition D.1.

A labeling ϕ\phi of T𝒯DT\in\mathcal{T}_{\leq D} is said to be strongly non-reversing if it is non-reversing and for any two edges (i,j)(i,j), (k,l)E(T)(k,l)\in E(T), with 𝖽T((i,j),)𝖽T((k,l),){\sf d}_{T}((i,j),\circ)\neq{\sf d}_{T}((k,l),\circ), we have (ϕ(i),ϕ(j))(ϕ(k),ϕ(l))(\phi(i),\phi(j))\neq(\phi(k),\phi(l)) (as unordered pairs). We denote by 𝗇𝗋~(T)\tilde{\sf nr}(T) the set of all strongly non-reversing labelings of TT.

A pair of labelings ϕ1𝗇𝗋(T1)\phi_{1}\in\mathsf{nr}(T_{1}), ϕ2(T2)\phi_{2}\in(T_{2}) is said to be jointly strongly non-reversing if each of them is strongly non-reversing and 𝖽T1((i,j),)𝖽T2((k,l),){\sf d}_{T_{1}}((i,j),\circ)\neq{\sf d}_{T_{2}}((k,l),\circ) implies (ϕ1(i),ϕ1(j))(ϕ2(k),ϕ2(l))(\phi_{1}(i),\phi_{1}(j))\neq(\phi_{2}(k),\phi_{2}(l)). We denote by 𝗇𝗋~(T1,T2)\tilde{\sf nr}(T_{1},T_{2}) the set of such pairs.

We define the modified polynomials F~Tt(𝐘)\tilde{\mathscrsfs{F}}^{t}_{T}({\boldsymbol{Y}}), F~Tt(𝐘)\tilde{\mathscrsfs{F}}^{t}_{T}({\boldsymbol{Y}}_{*}) by restricting the sum to 𝗇𝗋~(T)\tilde{\sf nr}(T) in Eqs. (32), (113).

We also define

FT1,T2t(𝒀):=1|𝗇𝗋(T1)||𝗇𝗋(T2)|\displaystyle\mathscrsfs{F}^{t}_{T_{1},T_{2}}({\boldsymbol{Y}}_{*}):=\frac{1}{\sqrt{|\mathsf{nr}(T_{1})|\cdot|\mathsf{nr}(T_{2})|}}\;\cdot
(ϕ1,ϕ2)𝗇𝗋~(T1,T2)(i,j)E(T1)𝒀ϕ1(i),ϕ1(j)(t𝖽T1((i,j),))(i,j)E(T2)𝒀ϕ2(i),ϕ2(j)(t𝖽T2((i,j),)).\displaystyle\sum_{(\phi_{1},\phi_{2})\in\tilde{\sf nr}(T_{1},T_{2})}\,\prod_{(i,j)\in E(T_{1})}{\boldsymbol{Y}}_{\phi_{1}(i),\phi_{1}(j)}(t-{\sf d}_{T_{1}}((i,j),\circ))\prod_{(i,j)\in E(T_{2})}{\boldsymbol{Y}}_{\phi_{2}(i),\phi_{2}(j)}(t-{\sf d}_{T_{2}}((i,j),\circ))\,.

As before FT1,T2t(𝐘)\mathscrsfs{F}^{t}_{T_{1},T_{2}}({\boldsymbol{Y}}) is obtained from the above definition by 𝐘(s)=𝐘{\boldsymbol{Y}}(s)={\boldsymbol{Y}} for all ss\in{\mathbb{Z}}.

As we next see, the moving from non-reversing to strongly non-reversing embeddings has a negligible effect.

Lemma D.2.

Assume πΘ\pi_{\Theta} to have finite moments of all order, and |ψ(θ)|B(1+|θ|)B|\psi(\theta)|\leq B(1+|\theta|)^{B} for a constant BB. Then, for any fixed T,T1,T2𝒯DT,T_{1},T_{2}\in\mathcal{T}_{\leq D}, there exist constants C=C(B,T)C_{*}=C_{*}(B,T), C#=C#(B,T1,T2)C_{\#}=C_{\#}(B,T_{1},T_{2}) such that

|𝔼[ψ(θ1)FT(𝒀)]𝔼[ψ(θ1)F~T(𝒀)]|Cn,|𝔼[FT1(𝒀)FT2(𝒀)]𝔼[F~T1,T2(Y)]|C#n,\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\right|\leq\frac{C_{*}}{\sqrt{n}}\,,\;\;\;\;\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}(Y)]\right|\leq\frac{C_{\#}}{\sqrt{n}}\,, (114)
|𝔼[ψ(θ1)FTt(𝒀)]𝔼[ψ(θ1)F~Tt(𝒀)]|Cn,|𝔼[FT1t(𝒀)FT2t(𝒀)]𝔼[F~T1,T2t(𝒀)]|C#n.\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}^{t}_{T}({\boldsymbol{Y}}_{*})]\right|\leq\frac{C_{*}}{\sqrt{n}},\;\;\;\;\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}^{t}_{T_{1}}({\boldsymbol{Y}}_{*})\mathscrsfs{F}^{t}_{T_{2}}({\boldsymbol{Y}}_{*})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}^{t}_{T_{1},T_{2}}({\boldsymbol{Y}}_{*})]\right|\leq\frac{C_{\#}}{\sqrt{n}}\,. (115)
Proof.

We will prove Eq. (114) since (115) follows by the same argument. Considering the first bound, note that

|𝔼[ψ(θ1)FT(𝒀)]𝔼[ψ(θ1)F~T(𝒀)]|\displaystyle\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\right|
1|𝗇𝗋(T)|ϕ𝗇𝗋(T)𝗇𝗋~(T)|𝔼[ψ(θ1)(i,j)E(T)Yϕ1(i),ϕ1(j)]|\displaystyle\leq\frac{1}{\sqrt{|\mathsf{nr}(T)|}}\sum_{\phi\in\mathsf{nr}(T)\setminus\tilde{\sf nr}(T)}\left|\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\Big{]}\right|
1|𝗇𝗋(T)|Π𝗉𝖺𝗍′′(T)ϕ𝖾𝗆𝖻(Π)|𝔼[ψ(θ1)(i,j)E(T)Yϕ1(i),ϕ1(j)]|,\displaystyle\leq\frac{1}{\sqrt{|\mathsf{nr}(T)|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T)}\sum_{\phi\in\mathsf{emb}(\Pi)}\left|\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\Big{]}\right|\,, (116)

where in the last line 𝗉𝖺𝗍′′(T)\mathsf{pat}^{\prime\prime}(T) denotes the equivalence classes of 𝗇𝗋(T)𝗇𝗋~(T)\mathsf{nr}(T)\setminus\tilde{\sf nr}(T) under rooted graph homomorphisms. Now the expectation in the last line only depends on Π\Pi. Taking expectation conditional on θ\theta, we get

|𝔼[ψ(θ1)(i,j)E(T)Yϕ1(i),ϕ1(j)θ]|ψ(θ1)1ijnαij(ϕ)=1θiθjn1ijnαij(ϕ)22αij(ϕ)1(|θiθj|n+Cαij).\displaystyle\Big{|}\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\;\Big{|}\;\theta\Big{]}\Big{|}\leq\psi(\theta_{1})\prod_{\begin{subarray}{c}1\leq i\leq j\leq n\\ \alpha_{ij}(\phi)=1\end{subarray}}\frac{\theta_{i}\theta_{j}}{\sqrt{n}}\prod_{\begin{subarray}{c}1\leq i\leq j\leq n\\ \alpha_{ij}(\phi)\geq 2\end{subarray}}2^{\alpha_{ij}(\phi)-1}\left(\frac{|\theta_{i}\theta_{j}|}{\sqrt{n}}+C\sqrt{\alpha_{ij}}\right)\,.

Here α(ϕ)\alpha(\phi) is the image of embedding ϕ\phi. Next taking expectation with respect to θ\theta and using that all moments are finite

|𝔼[ψ(θ1)(i,j)E(T)Yϕ1(i),ϕ1(j)]|C(ψ,Π)ne1(Π)/2,\displaystyle\Big{|}\operatorname{\mathbb{E}}\Big{[}\psi(\theta_{1})\prod_{(i,j)\in E(T)}Y_{\phi_{1}(i),\phi_{1}(j)}\Big{]}\Big{|}\leq C(\psi,\Pi)\,n^{-e_{1}(\Pi)/2}\,, (117)

where e1(Π)e_{1}(\Pi) is the number of edges with multiplicity 11 in Π\Pi.

Using this bound in Eq. (116) and noting that |𝖾𝗆𝖻(π)|nv(Π)1|\mathsf{emb}(\pi)|\leq n^{v(\Pi)-1} (where v(Π)v(\Pi) is the number of vertices in Π\Pi), we get

|𝔼[ψ(θ1)FT(𝒀)]𝔼[ψ(θ1)F~T(𝒀)]|\displaystyle\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\right|
C1|𝗇𝗋(T)|Π𝗉𝖺𝗍′′(T)nv(Π)1e1(Π)/2\displaystyle\leq C\frac{1}{\sqrt{|\mathsf{nr}(T)|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T)}n^{v(\Pi)-1-e_{1}(\Pi)/2}
CΠ𝗉𝖺𝗍′′(T)nv(Π)1e1(Π)/2m(Π)/2,\displaystyle\leq C^{\prime}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T)}n^{v(\Pi)-1-e_{1}(\Pi)/2-m(\Pi)/2}\,,

where m(π)m(\pi) is the number of edges of Π\Pi counted with their multiplicity. In the last step we used the fact that m(π)=V(T)1m(\pi)=V(T)-1 and |𝗇𝗋(T)|C0nV(T)1|\mathsf{nr}(T)|\geq C_{0}n^{V(T)-1}.

Next notice that, denoting by e2(Π)e_{\geq 2}(\Pi) the number of edges in Π\Pi, this time without counting multiplicity, we have

12e1(Π)+12m(Π)+1v(Π)\displaystyle\frac{1}{2}e_{1}(\Pi)+\frac{1}{2}m(\Pi)+1-v(\Pi) e1(Π)+e2(Π)+1v(Π)\displaystyle\geq e_{1}(\Pi)+e_{\geq 2}(\Pi)+1-v(\Pi)
𝗅𝗈𝗈𝗉(Π).\displaystyle\geq{\sf loop}(\Pi)\,.

where 𝗅𝗈𝗈𝗉(Π){\sf loop}(\Pi) is the number of self-loops in the projection of Π\Pi onto simple graphs (i.e. the graph obtained by replacing every multi-edge in Π\Pi by a single edge). For Π𝗉𝖺𝗍′′(T)\Pi\in\mathsf{pat}^{\prime\prime}(T), we have 𝗅𝗈𝗈𝗉(Π)1{\sf loop}(\Pi)\geq 1, whence the claim follows.

The proof for the second equation in Eq. (114) is very similar. Using the shorthand 𝗇𝗋𝗇𝗋~(T1,T2):=𝗇𝗋(T1)×𝗇𝗋(T2)𝗇𝗋~(T1,T2)\mathsf{nr}\setminus\tilde{\sf nr}(T_{1},T_{2}):=\mathsf{nr}(T_{1})\times\mathsf{nr}(T_{2})\setminus\tilde{\sf nr}(T_{1},T_{2}), we begin by writing

|𝔼[FT1(𝒀)FT2(𝒀)]𝔼[F~T1,T2(𝒀)]|\displaystyle\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\right|
1|𝗇𝗋(T1)||𝗇𝗋(T2)|(ϕ1,ϕ2)𝗇𝗋𝗇𝗋~(T1,T2)|𝔼{(i,j)E(T1)Yϕ1(i),ϕ1(j)(i,j)E(T2)Yϕ2(i),ϕ2(j)}|\displaystyle\leq\frac{1}{\sqrt{|\mathsf{nr}(T_{1})|\cdot|\mathsf{nr}(T_{2})|}}\sum_{(\phi_{1},\phi_{2})\in\mathsf{nr}\setminus\tilde{\sf nr}(T_{1},T_{2})}\Big{|}\operatorname{\mathbb{E}}\Big{\{}\prod_{(i,j)\in E(T_{1})}Y_{\phi_{1}(i),\phi_{1}(j)}\prod_{(i,j)\in E(T_{2})}Y_{\phi_{2}(i),\phi_{2}(j)}\Big{\}}\Big{|}
1|𝗇𝗋(T1)||𝗇𝗋(T2)|Π𝗉𝖺𝗍′′(T1,T2)(ϕ1,ϕ2)𝖾𝗆𝖻(Π)|𝔼[(i,j)E(T1)Yϕ1(i),ϕ1(j)(i,j)E(T2)Yϕ2(i),ϕ2(j)]|.\displaystyle\leq\frac{1}{\sqrt{|\mathsf{nr}(T_{1})|\cdot|\mathsf{nr}(T_{2})|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T_{1},T_{2})}\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}\left|\operatorname{\mathbb{E}}\Big{[}\prod_{(i,j)\in E(T_{1})}Y_{\phi_{1}(i),\phi_{1}(j)}\prod_{(i,j)\in E(T_{2})}Y_{\phi_{2}(i),\phi_{2}(j)}\Big{]}\right|\,.

The only difference with respect to the previous case lies in the fact that 𝗉𝖺𝗍′′(T1,T2)\mathsf{pat}^{\prime\prime}(T_{1},T_{2}) is a collection of graphs with edges labeled by {1,2}\{1,2\}. (It is the set of equivalence classes of 𝗇𝗋𝗇𝗋~(T1,T2)\mathsf{nr}\setminus\tilde{\sf nr}(T_{1},T_{2}) under graph homomorphisms.)

Proceeding as before,

|𝔼[(i,j)E(T1)Yϕ1(i),ϕ1(j)(i,j)E(T2)Yϕ2(i),ϕ2(j)]|C(Π)ne1(Π).\displaystyle\left|\operatorname{\mathbb{E}}\Big{[}\prod_{(i,j)\in E(T_{1})}Y_{\phi_{1}(i),\phi_{1}(j)}\prod_{(i,j)\in E(T_{2})}Y_{\phi_{2}(i),\phi_{2}(j)}\Big{]}\right|\leq C(\Pi)\,n^{-e_{1}(\Pi)}\,.

Therefore

|𝔼[FT1(𝒀)FT2(𝒀)]𝔼[F~T1,T2(𝒀)]|1|𝗇𝗋(T1)||𝗇𝗋(T2)|Π𝗉𝖺𝗍′′(T1,T2)(ϕ1,ϕ2)𝖾𝗆𝖻(Π)C(Π)ne1(Π).\displaystyle\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\right|\leq\frac{1}{\sqrt{|\mathsf{nr}(T_{1})|\cdot|\mathsf{nr}(T_{2})|}}\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T_{1},T_{2})}\sum_{(\phi_{1},\phi_{2})\in\mathsf{emb}(\Pi)}C(\Pi)\,n^{-e_{1}(\Pi)}\,.

Recall that |𝗇𝗋(T1)|C0nV(T1)1|\mathsf{nr}(T_{1})|\geq C_{0}n^{V(T_{1})-1}, |𝗇𝗋(T2)|C0nV(T2)1|\mathsf{nr}(T_{2})|\geq C_{0}n^{V(T_{2})-1} and, as before, V(T1)+V(T2)2=m(Π)V(T_{1})+V(T_{2})-2=m(\Pi). Further |𝖾𝗆𝖻(Π)|C1nv(Π)1|\mathsf{emb}(\Pi)|\leq C_{1}\,n^{v(\Pi)-1}, whence

|𝔼[FT1(𝒀)FT2(𝒀)]𝔼[F~T1,T2(𝒀)]|C2Π𝗉𝖺𝗍′′(T1,T2)nv(Π)1e1(Π)/2m(Π)/2.\displaystyle\left|\operatorname{\mathbb{E}}[\mathscrsfs{F}_{T_{1}}({\boldsymbol{Y}})\mathscrsfs{F}_{T_{2}}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\right|\leq C_{2}\,\sum_{\Pi\in\mathsf{pat}^{\prime\prime}(T_{1},T_{2})}n^{v(\Pi)-1-e_{1}(\Pi)/2-m(\Pi)/2}\,.

The last sum is upper bounded by Cn1/2C\,n^{-1/2} by the same argument as above. ∎

Consider now a term corresponding to ϕ𝗇𝗋~(G)\phi\in\tilde{\sf nr}(G) of the expectation 𝔼[ψ(θ1)F~T(Y)|θ]\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}(Y_{*})\,|\,\theta]. By construction, this does not involve moments 𝔼[Yija(t1)Yijb(t2)|θ]\operatorname{\mathbb{E}}[Y^{a}_{ij}(t_{1})Y^{b}_{ij}(t_{2})\cdots\,|\,\theta] for t1t2t_{1}\neq t_{2}. Therefore the expectations coincide when considering F~T(𝒀)\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}}_{*}) or F~T(Y)\tilde{\mathscrsfs{F}}_{T}(Y). In other words, we have the identities

𝔼[ψ(θ1)F~T(𝒀)]=𝔼[ψ(θ1)F~T(𝒀)],𝔼[F~T1,T2(𝒀)]=𝔼[F~T1,T2(𝒀)].\displaystyle\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}}_{*})]=\operatorname{\mathbb{E}}[\psi(\theta_{1})\tilde{\mathscrsfs{F}}_{T}({\boldsymbol{Y}})]\,,\;\;\;\;\;\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}}_{*})]=\operatorname{\mathbb{E}}[\tilde{\mathscrsfs{F}}_{T_{1},T_{2}}({\boldsymbol{Y}})]\,. (118)

We therefore proved the following consequence of Lemma D.2.

Corollary D.3.

Under the assumptions of Lemma D.2, there exists C0=C0(T,B)C_{0}=C_{0}(T,B) such that

|𝔼[ψ(θ1)FT(𝒀)]𝔼[ψ(θ1)FT(𝒀)]|C0n.\displaystyle\left|\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}}_{*})]\right|\leq\frac{C_{0}}{\sqrt{n}}\,\,. (119)

We now consider a message passing of the same form as Eq. (53), but with 𝒀{\boldsymbol{Y}} replaced by 𝒀(t){\boldsymbol{Y}}(t) at iteration tt. Namely we define

𝒓ijt+1\displaystyle{\boldsymbol{r}}_{i\to j}^{t+1} =1nk[n]{i,j}Yik(t)Ft(𝒓kit),𝒓ij0=𝔼[Θ]ij,\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i,j\}}Y_{ik}(t)F_{t}({\boldsymbol{r}}^{t}_{k\to i})\,,\;\;\;\;\;\,\;\;{\boldsymbol{r}}_{i\to j}^{0}=\operatorname{\mathbb{E}}[\Theta]\;\;\forall i\neq j\,, (120)

and:

𝒓it+1\displaystyle{\boldsymbol{r}}_{i}^{t+1} =1nk[n]{i}Yik(t)Ft(𝒓kit),\displaystyle=\frac{1}{\sqrt{n}}\sum_{k\in[n]\setminus\{i\}}Y_{ik}(t)F_{t}({\boldsymbol{r}}^{t}_{k\to i})\,, (121)
𝒓^it+1\displaystyle\hat{{\boldsymbol{r}}}_{i}^{t+1} =Ft(𝒓it+1).\displaystyle=F_{t}({\boldsymbol{r}}^{t+1}_{i})\,. (122)

Analogously to the case of the original iteration, we can write these as sums over polynomials FTt+1(𝒀)\mathscrsfs{F}^{t+1}_{T}({\boldsymbol{Y}}_{*}), with coefficients that have a limit as nn\to\infty. Therefore, we have a further consequence of Lemma D.2 and the previous corollary.

Corollary D.4.

Assume πΘ\pi_{\Theta} to have finite moments of all order, and ψ:𝖽+1\psi:{\mathbb{R}}^{{\sf d}+1}\to{\mathbb{R}} be a polynomial with coefficients bounded by BB and maximum degree BB.

Then there exists C0=C0(T,B)C_{0}=C_{0}(T,B) such that, for all tTt\leq T,

|𝔼[ψ(θ1,𝒔1t)]𝔼[ψ(θ1,𝒓1t)]|C0n,\displaystyle\left|\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{s}}_{1}^{t})]-\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{r}}_{1}^{t})]\right|\leq\frac{C_{0}}{\sqrt{n}}\,, (123)
|𝔼[ψ(θ1,𝒔12t)]𝔼[ψ(θ1,𝒓12t)]|C0n.\displaystyle\left|\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{s}}_{1\to 2}^{t})]-\operatorname{\mathbb{E}}[\psi(\theta_{1},{\boldsymbol{r}}_{1\to 2}^{t})]\right|\leq\frac{C_{0}}{\sqrt{n}}\,. (124)
Proof.

We expand the polynomial ψ\psi as

ψ(θ,𝒙)=𝒎𝖽ψ𝒎(θ)𝒙𝒎,𝒙𝒎:=i=1𝖽ximi.\displaystyle\psi(\theta,{\boldsymbol{x}})=\sum_{{\boldsymbol{m}}\in{\mathbb{N}}^{{\sf d}}}\psi_{{\boldsymbol{m}}}(\theta){\boldsymbol{x}}^{{\boldsymbol{m}}}\,,\;\;\;\;\;{\boldsymbol{x}}^{{\boldsymbol{m}}}:=\prod_{i=1}^{{\sf d}}x_{i}^{m_{i}}\,.

Further, 𝒔1t{\boldsymbol{s}}^{t}_{1} can be expressed as a sum over T𝒯DT\in\mathcal{T}_{\leq D} and therefore the same holds for (𝒔1t)𝒎({\boldsymbol{s}}^{t}_{1})^{{\boldsymbol{m}}}

(𝒔1t)𝒎\displaystyle({\boldsymbol{s}}^{t}_{1})^{{\boldsymbol{m}}} =(T,k)𝖽,km𝖽,kmcT,k𝖽,kmFT,k(𝒀),\displaystyle=\sum_{(T_{\ell,k})_{\ell\leq{\sf d},k\leq m_{\ell}}}\prod_{\ell\leq{\sf d},k\leq m_{\ell}}c_{T_{\ell,k}}\prod_{\ell\leq{\sf d},k\leq m_{\ell}}\mathscrsfs{F}_{T_{\ell,k}}({\boldsymbol{Y}})\,,
=Tc¯T(𝒎)FT(𝒀),\displaystyle=\sum_{T}\overline{c}_{T}({\boldsymbol{m}})\mathscrsfs{F}_{T}({\boldsymbol{Y}})\,,

where, in the last row, we grouped terms such that ,kT,k=T\cup_{\ell,k}T_{\ell,k}=T and (up to combinatorial factors which are independent of nn), the coefficient c¯T(𝒎)\overline{c}_{T}({\boldsymbol{m}}) is given by the product of coefficients cT,kc_{T_{\ell,k}}.

Hence

|𝔼[ψ(θ1,s^1t)]𝔼[ψ(θ1,r^1t)]|𝒎Tc¯T(𝒎)|𝔼[ψ𝒎(θ1)FT(𝒀)]𝔼[ψ𝒎(θ1)FTt(𝒀)]|,\displaystyle\Big{|}\operatorname{\mathbb{E}}[\psi(\theta_{1},\hat{s}_{1}^{t})]-\operatorname{\mathbb{E}}[\psi(\theta_{1},\hat{r}_{1}^{t})]\Big{|}\leq\sum_{{\boldsymbol{m}}}\sum_{T}\overline{c}_{T}({\boldsymbol{m}})\Big{|}\operatorname{\mathbb{E}}[\psi_{{\boldsymbol{m}}}(\theta_{1})\mathscrsfs{F}_{T}({\boldsymbol{Y}})]-\operatorname{\mathbb{E}}[\psi_{{\boldsymbol{m}}}(\theta_{1})\mathscrsfs{F}^{t}_{T}({\boldsymbol{Y}}_{*})]\Big{|}\,,

and the claim follows by noting that the sums over 𝒎{\boldsymbol{m}} and TT involve a constant (in nn) number of terms, the coefficients c¯T(𝒎)\overline{c}_{T}({\boldsymbol{m}}) have a limit as nn\to\infty, and applying Corollary D.3. ∎

Lemma D.5.

Assume πΘ\pi_{\Theta} to have finite moments of all order, and let m0m_{0}, 𝐦𝖽{\boldsymbol{m}}\in{\mathbb{N}}^{\sf d} be fixed. Then there exists a constant C=C(t;𝐦)C=C(t;{\boldsymbol{m}}) independent of nn such that

|𝔼[(𝒓it)𝒎]|C,|𝔼[(𝒓ijt)𝒎]|C.\displaystyle\big{|}\operatorname{\mathbb{E}}[({\boldsymbol{r}}_{i}^{t})^{{\boldsymbol{m}}}]\big{|}\leq C\,,\;\;\;\;\big{|}\operatorname{\mathbb{E}}[({\boldsymbol{r}}_{i\to j}^{t})^{{\boldsymbol{m}}}]\big{|}\leq C\,. (125)

As a final step towards the proof of Lemma 4.13, we prove the analogous statement for the modified iteration (r^it)(\hat{r}^{t}_{i}).

Lemma D.6.

Assume πΘ\pi_{\Theta} to have finite moments of all order, and ψ\psi to be a fixed polynomial. Define the sequence of vectors 𝛍t𝖽{\boldsymbol{\mu}}_{t}\in{\mathbb{R}}^{{\sf d}} and positive semidefinite matrices 𝚺t𝖽×𝖽{\boldsymbol{\Sigma}}_{t}\in{\mathbb{R}}^{{\sf d}\times{\sf d}} via the state evolution equations (10), (11).

Then the following hold (here plim\operatorname*{p-lim} denotes limit in probability):

limn𝔼[ψ(𝒓1t,θ1)]=𝔼ψ(𝝁tΘ+𝑮t,Θ),plimn1ni=1nψ(𝒓it,θi)=𝔼ψ(𝝁tΘ+𝑮t,Θ),\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{r}}^{t}_{1},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,,\;\;\;\;\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\psi({\boldsymbol{r}}^{t}_{i},\theta_{i})=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,, (126)
limn𝔼[ψ(𝒓12t,θ1)]=𝔼ψ(𝝁tΘ+𝑮t,Θ),plimn1ni=2nψ(𝒓i1t,θi)=𝔼ψ(𝝁tΘ+𝑮t,Θ).\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\psi({\boldsymbol{r}}^{t}_{1\to 2},\theta_{1})]=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,,\;\;\;\;\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t}_{i\to 1},\theta_{i})=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t},\Theta\big{)}\,. (127)
Proof.

The proof is essentially the same as the proof of Proposition 4 in [BLM15]. We will focus on the claim (127), since Eq. (127) is completely analogous.

We proceed by induction over tt, and will denote by t{\mathcal{F}}_{t} the σ\sigma-algebra generated by θ\theta and Z(1),Z(t)Z(1),\dots Z(t). It is convenient to define W(s)=Z(s)/nW(s)=Z(s)/\sqrt{n} and do rewrite Eq. (120) as

𝒓ijt+1=(1nk[n]{i,j}θkFt(𝒓kit))θi+k[n]{i,j}Wik(t)Ft(𝒓kit).\displaystyle{\boldsymbol{r}}_{i\to j}^{t+1}=\Big{(}\frac{1}{n}\sum{k\in[n]\setminus\{i,j\}}\theta_{k}F_{t}({\boldsymbol{r}}^{t}_{k\to i})\Big{)}\theta_{i}+\sum_{k\in[n]\setminus\{i,j\}}W_{ik}(t)F_{t}({\boldsymbol{r}}^{t}_{k\to i})\,. (128)

Fixing i,ji,j, by the induction hypothesis

plimn1nk[n]{i,j}θkFt(𝒓kit)=𝔼{ΘFt(𝝁tΘ+𝑮t)}=𝝁t+1,\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{k\in[n]\setminus\{i,j\}}\theta_{k}F_{t}({\boldsymbol{r}}^{t}_{k\to i})=\operatorname{\mathbb{E}}\big{\{}\Theta F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})\big{\}}={\boldsymbol{\mu}}_{t+1}\,, (129)
plimn1nk[n]{i,j}Ft(𝒓kit)Ft(𝒓kit)𝖳=𝔼{Ft(𝝁tΘ+𝑮t)Ft(𝝁tΘ+𝑮t)𝖳}=𝚺t+1,\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{k\in[n]\setminus\{i,j\}}F_{t}({\boldsymbol{r}}^{t}_{k\to i})F_{t}({\boldsymbol{r}}^{t}_{k\to i})^{{\sf T}}=\operatorname{\mathbb{E}}\big{\{}F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})F_{t}({\boldsymbol{\mu}}_{t}\Theta+{\boldsymbol{G}}_{t})^{{\sf T}}\big{\}}={\boldsymbol{\Sigma}}_{t+1}\,, (130)
plimn1nk[n]{i,j}Ft(𝒓kit)4=Ct<.\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{k\in[n]\setminus\{i,j\}}\|F_{t}({\boldsymbol{r}}^{t}_{k\to i})\|^{4}=C_{t}<\infty\,. (131)

Construct (θi)i1(\theta_{i})_{i\geq 1} for different nn in the same probability space. By Lyapunov’s central limit theorem, we have that, in probability

𝖫𝖺𝗐(𝒓ijt+1|t)𝒩(𝝁t+1θi,𝚺t+1).\displaystyle{\sf Law}({\boldsymbol{r}}_{i\to j}^{t+1}|{\mathcal{F}}_{t})\Rightarrow{\mathcal{N}}({\boldsymbol{\mu}}_{t+1}\theta_{i},{\boldsymbol{\Sigma}}_{t+1})\,. (132)

(Here \Rightarrow denotes weak convergence of probability measures.) Hence, for any bounded Lipschitz function ψ¯\overline{\psi}, we have

limn𝔼[ψ¯(𝒓ijt+1,θi)|t]=𝔼ψ¯(𝝁t+1Θ+𝑮t+1,Θ).\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\overline{\psi}({\boldsymbol{r}}^{t+1}_{i\to j},\theta_{i})|{\mathcal{F}}_{t}]=\operatorname{\mathbb{E}}\overline{\psi}\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,. (133)

Since the right hand side is non-random, and using dominated convergence, we have

limn𝔼[ψ¯(𝒓ijt+1,θi)]=𝔼ψ¯(𝝁t+1Θ+𝑮t+1,Θ).\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}[\overline{\psi}({\boldsymbol{r}}^{t+1}_{i\to j},\theta_{i})]=\operatorname{\mathbb{E}}\overline{\psi}\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,. (134)

Applying Lemma D.5, the claim also holds for ψ\psi that is only polynomially bounded, thus proving the the first equation in Eq. (127), for iteration t+1t+1.

Next consider the second limit in Eq. (127), and begin by considering the case in which ψ\psi is a fixed polynomial. We claim that, for any fixed i1i2{2,,n}i_{1}\neq i_{2}\in\{2,\dots,n\},

limn|𝔼[ψ(θi1,𝒓i11t+1)ψ(θi2,𝒓i21t+1)]𝔼[ψ(θi1,𝒓i11t+1)]𝔼[ψ(θi2,𝒓i21t+1)]|=0.\displaystyle\lim_{n\to\infty}\big{|}\operatorname{\mathbb{E}}[\psi(\theta_{i_{1}},{\boldsymbol{r}}_{i_{1}\to 1}^{t+1})\psi(\theta_{i_{2}},{\boldsymbol{r}}_{i_{2}\to 1}^{t+1})]-\operatorname{\mathbb{E}}[\psi(\theta_{i_{1}},{\boldsymbol{r}}_{i_{1}\to 1}^{t+1})]\operatorname{\mathbb{E}}[\psi(\theta_{i_{2}},{\boldsymbol{r}}_{i_{2}\to 1}^{t+1})]\big{|}=0\,. (135)

Indeed by linearity it is sufficient to prove that this is the case for ψ(θ,𝒙)=ψ𝒎(θ)𝒙𝒎\psi(\theta,{\boldsymbol{x}})=\psi_{{\boldsymbol{m}}}(\theta){\boldsymbol{x}}^{{\boldsymbol{m}}}. This case in turn can be analyze by expanding (𝒓i1t)𝒎({\boldsymbol{r}}_{i\to 1}^{t})^{{\boldsymbol{m}}} in a sum over trees as in the proof of Lemma D.2. (See Proposition 4 in [BLM15].)

By expanding the sum in the variance and using Eq. (135) we thus get

limnVar(1ni=2nψ(𝒓i1t+1,θi))=0.\displaystyle\lim_{n\to\infty}{\rm Var}\left(\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t+1}_{i\to 1},\theta_{i})\right)=0\,.

Further we already proved earlier that

limn𝔼(1ni=2nψ(𝒓i1t+1,θi))=𝔼ψ¯(𝝁t+1Θ+𝑮t+1,Θ).\displaystyle\lim_{n\to\infty}\operatorname{\mathbb{E}}\left(\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t+1}_{i\to 1},\theta_{i})\right)=\operatorname{\mathbb{E}}\overline{\psi}\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,.

Hence, by Chebyshev’s inequality, the following holds for any polynomial ψ\psi:

plimn1ni=2nψ(𝒓i1t+1,θi)=𝔼ψ(𝝁t+1Θ+𝑮t+1,Θ).\displaystyle\operatorname*{p-lim}_{n\to\infty}\frac{1}{n}\sum_{i=2}^{n}\psi({\boldsymbol{r}}^{t+1}_{i\to 1},\theta_{i})=\operatorname{\mathbb{E}}\psi\big{(}{\boldsymbol{\mu}}_{t+1}\Theta+{\boldsymbol{G}}_{t+1},\Theta\big{)}\,.

This completes the proof of Lemma D.6. ∎

Now Lemma 4.13 is an immediate consequence of Corollary D.4 and Lemma D.6.

References

  • [AW09] Arash A Amini and Martin J Wainwright. High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics, pages 2877–2921, 2009.
  • [BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, pages 648–847. PMLR, 2020.
  • [BBH18] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. In Conference On Learning Theory, pages 48–166. PMLR, 2018.
  • [BBH+21] Matthew S Brennan, Guy Bresler, Sam Hopkins, Jerry Li, and Tselil Schramm. Statistical query algorithms and low degree tests are almost equivalent. In Conference on Learning Theory, pages 774–774. PMLR, 2021.
  • [BBP05] Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, pages 1643–1697, 2005.
  • [BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Advances in Mathematics, 227(1):494–521, 2011.
  • [BHK+19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. SIAM Journal on Computing, 48(2):687–735, 2019.
  • [BLM15] Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and message passing algorithms. The Annals of Applied Probability, 25(2):753–822, 2015.
  • [BM11] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. on Inform. Theory, 57:764–785, 2011.
  • [BM19] Jean Barbier and Nicolas Macris. The adaptive interpolation method: a simple scheme to prove replica formulas in bayesian inference. Probability theory and related fields, 174(3):1133–1185, 2019.
  • [BMN20] Raphael Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate message passing with non-separable functions. Information and Inference: A Journal of the IMA, 9(1):33–79, 2020.
  • [BMR21] Jess Banks, Sidhanth Mohanty, and Prasad Raghavendra. Local statistics, semidefinite programming, and community detection. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1298–1316. SIAM, 2021.
  • [Bol14] Erwin Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington–Kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366, 2014.
  • [BR13] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on learning theory, pages 1046–1066. PMLR, 2013.
  • [BS06] Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382–1408, 2006.
  • [CM22] Michael Celentano and Andrea Montanari. Fundamental barriers to high-dimensional regression with convex penalties. The Annals of Statistics, 50(1):170–196, 2022.
  • [CMW20] Michael Celentano, Andrea Montanari, and Yuchen Wu. The estimation error of general first order methods. In Conference on Learning Theory, pages 1078–1141. PMLR, 2020.
  • [CX22] Hong-Bin Chen and Jiaming Xia. Hamilton–jacobi equations for inference of matrix tensor products. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 58, pages 755–793. Institut Henri Poincaré, 2022.
  • [DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
  • [DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Subexponential-time algorithms for sparse PCA. arXiv preprint arXiv:1907.11635, 2019.
  • [DMM09] David L. Donoho, Arian Maleki, and Andrea Montanari. Message Passing Algorithms for Compressed Sensing. Proceedings of the National Academy of Sciences, 106:18914–18919, 2009.
  • [EG21] Morris L Eaton and Edward I George. Charles Stein and invariance: Beginning with the Hunt–Stein theorem. The Annals of Statistics, 49(4):1815–1822, 2021.
  • [Fan22] Zhou Fan. Approximate message passing algorithms for rotationally invariant matrices. The Annals of Statistics, 50(1):197–224, 2022.
  • [Gal62] Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21–28, 1962.
  • [GB21] Cédric Gerbelot and Raphaël Berthier. Graph-based approximate message passing iterations. arXiv:2109.11905, 2021.
  • [GJS21] David Gamarnik, Aukosh Jagannath, and Subhabrata Sen. The overlap gap property in principal submatrix recovery. Probability Theory and Related Fields, 181(4):757–814, 2021.
  • [GZ22] David Gamarnik and Ilias Zadik. Sparse high-dimensional linear regression. Estimating squared error and a phase transition. The Annals of Statistics, 50(2):880–903, 2022.
  • [HKP+17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720–731. IEEE, 2017.
  • [HR04] David C Hoyle and Magnus Rattray. Principal-component-analysis eigenvalue spectra from data with symmetry-breaking structure. Physical Review E, 69(2):026124, 2004.
  • [HS17] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples: community detection and related problems. In 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 379–390. IEEE, 2017.
  • [HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006. PMLR, 2015.
  • [JL09] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486), 2009.
  • [JM13] Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115–144, 2013.
  • [KNV15] Robert Krauthgamer, Boaz Nadler, and Dan Vilenchik. Do semidefinite relaxations solve sparse PCA up to the information limit? The Annals of Statistics, 43(3):1300–1322, 2015.
  • [LM19] Marc Lelarge and Léo Miolane. Fundamental limits of symmetric low-rank matrix estimation. Probability Theory and Related Fields, 173(3):859–929, 2019.
  • [Lor05] G G Lorentz. Approximation of Functions, volume 322. American Mathematical Soc., 2005.
  • [MOS13] Wilhelm Magnus, Fritz Oberhettinger, and Raj Pal Soni. Formulas and theorems for the special functions of mathematical physics, volume 52. Springer Science & Business Media, 2013.
  • [MPV87] Marc Mézard, Giorgio Parisi, and Miguel A. Virasoro. Spin Glass Theory and Beyond. World Scientific, 1987.
  • [MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 438–446. IEEE, 2016.
  • [MV21] Andrea Montanari and Ramji Venkataramanan. Estimation of low-rank matrices via approximate message passing. The Annals of Statistics, 49(1):321–345, 2021.
  • [MW22] Andrea Montanari and Yuchen Wu. Statistically optimal first order algorithms: A proof via orthogonalization. arXiv:2201.05101, 2022.
  • [RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. Advances in neural information processing systems, 27, 2014.
  • [RU08] T. J. Richardson and R. Urbanke. Modern Coding Theory. Cambridge University Press, Cambridge, 2008.
  • [SW22] Tselil Schramm and Alexander S Wein. Computational barriers to estimation from low-degree polynomials. The Annals of Statistics, 50(3):1833–1858, 2022.
  • [Sze39] Gabor Szegö. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.
  • [TAP77] David J. Thouless, Philip W. Anderson, and Richard G. Palmer. Solution of ‘solvable model of a spin glass’. Philosophical Magazine, 35(3):593–601, 1977.
  • [Wei22] Alexander S Wein. Average-case complexity of tensor decomposition for low-degree polynomials. arXiv preprint arXiv:2211.05274, 2022.
  • [WEM19] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The Kikuchi hierarchy and tensor PCA. In 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468. IEEE, 2019.