This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Average-Case Complexity of Tensor Decomposition
for Low-Degree Polynomials

Alexander S. Wein Email: [email protected]. Part of this work was done while with the Algorithms and Randomness Center at Georgia Tech, supported by NSF grants CCF-2007443 and CCF-2106444. Department of Mathematics, University of California, Davis
Abstract

Suppose we are given an nn-dimensional order-3 symmetric tensor T(n)3T\in(\mathbb{R}^{n})^{\otimes 3} that is the sum of rr random rank-1 terms. The problem of recovering the rank-1 components is possible in principle when rn2r\lesssim n^{2} but polynomial-time algorithms are only known in the regime rn3/2r\ll n^{3/2}. Similar “statistical-computational gaps” occur in many high-dimensional inference tasks, and in recent years there has been a flurry of work on explaining the apparent computational hardness in these problems by proving lower bounds against restricted (yet powerful) models of computation such as statistical queries (SQ), sum-of-squares (SoS), and low-degree polynomials (LDP). However, no such prior work exists for tensor decomposition, largely because its hardness does not appear to be explained by a “planted versus null” testing problem.

We consider a model for random order-3 tensor decomposition where one component is slightly larger in norm than the rest (to break symmetry), and the components are drawn uniformly from the hypercube. We resolve the computational complexity in the LDP model: O(logn)O(\log n)-degree polynomial functions of the tensor entries can accurately estimate the largest component when rn3/2r\ll n^{3/2} but fail to do so when rn3/2r\gg n^{3/2}. This provides rigorous evidence suggesting that the best known algorithms for tensor decomposition cannot be improved, at least by known approaches. A natural extension of the result holds for tensors of any fixed order k3k\geq 3, in which case the LDP threshold is rnk/2r\sim n^{k/2}.

1 Introduction

Tensor decomposition is a basic computational primitive underlying many algorithms for a variety of data analysis tasks, including phylogenetic reconstruction [MR05], topic modeling [AFH+12], community detection [AGHK13, HS17, AAA17, JLLX21], learning mixtures of Gaussians [HK13, GHK15, BCMV14, ABG+14], independent component analysis [GVX14], dictionary learning [BKS15, MSS16], and multi-reference alignment [PWB+19]. For further discussion, we refer the reader to [Moi14, RSG17, SDF+17, BM20].

In this work we will consider order-kk tensors of the form T(n)kT\in(\mathbb{R}^{n})^{\otimes k} for an integer k3k\geq 3, that is, TT is an n××nn\times\cdots\times n (kk times) array of real numbers with entries denoted by Ti1,,ikT_{i_{1},\ldots,i_{k}} with i1,,ik[n]:={1,2,,n}i_{1},\ldots,i_{k}\in[n]:=\{1,2,\ldots,n\}. A vector u=(ui)i[n]nu=(u_{i})_{i\in[n]}\in\mathbb{R}^{n} can be used to form a rank-1 tensor uku^{\otimes k} defined by (uk)i1,,ik=ui1ui2uik(u^{\otimes k})_{i_{1},\ldots,i_{k}}=u_{i_{1}}u_{i_{2}}\cdots u_{i_{k}}, akin to the rank-1 matrix uuuu^{\top}.

In the tensor decomposition problem, we observe a rank-rr tensor of the form

T=j=1rajkT=\sum_{j=1}^{r}a_{j}^{\otimes k} (1)

with unknown components ajna_{j}\in\mathbb{R}^{n}. The goal is to recover the components a1,,ara_{1},\ldots,a_{r} up to the inherent symmetries in the problem, i.e., we cannot recover the ordering of the list a1,,ara_{1},\ldots,a_{r} and if kk is even we cannot distinguish aja_{j} from aj-a_{j}. Our regime of interest will be kk fixed and nn large, with rr possibly growing with nn.

A large number of algorithmic results for tensor decomposition now exist under various assumptions on k,n,rk,n,r and {aj}\{a_{j}\} [LRA93, DCC07, GVX14, BCMV14, AGJ15, AGJ14, BKS15, GM15, HSSS16, MSS16, GM17, SS17, HSS19, KP20, DdL+22]. We will focus on the order-3 case (k=3k=3) to simplify the discussion. If a1,,ara_{1},\ldots,a_{r} are linearly independent, the decomposition problem can be solved by a classical method based on simultaneous diagonalization111This method is sometimes called Jennrich’s algorithm, although this may be a misnomer; see [Kol21]. [LRA93]. Most recent work focuses on the more difficult overcomplete regime where r>nr>n. For random order-3 tensors, where the components aja_{j} are drawn uniformly from the unit sphere, the state-of-the-art polynomial-time algorithms succeed when rn3/2r\ll n^{3/2} (where \ll hides a polylogarithmic factor (logn)O(1)(\log n)^{O(1)}); this was achieved first by a method based on the sum-of-squares hierarchy [MSS16] and then later by a faster spectral method [DdL+22]. Under the weaker condition rn2r\lesssim n^{2} (where \lesssim hides a constant factor), the decomposition is unique [BCO14] and so the problem is solvable in principle. However, no polynomial-time algorithm is known in the regime rn3/2r\gtrsim n^{3/2}.

For random tensors of any fixed order k4k\geq 4, the situation is similar. The decomposition is unique when rnk1r\lesssim n^{k-1} [BCO14], but poly-time algorithms are only known for substantially smaller rank. For k=4k=4, poly-time decomposition is known when rn2r\ll n^{2} [DCC07, MSS16], and it is expected that rnk/2r\ll n^{k/2} is achievable for general kk; however, to our knowledge, the best poly-time algorithms in the literature for k5k\geq 5 require rn(k1)/2r\lesssim n^{\lfloor(k-1)/2\rfloor} [LRA93, BCMV14]. These algorithms for k4k\geq 4 work not just for random tensors but also under much weaker assumptions on the components.

Statistical-computational gaps.

The motivation for this work is to understand whether the “statistical-computational gap” mentioned above is inherent, that is, we hope to investigate whether or not a poly-time algorithm exists in the apparent “possible but hard” regime nk/2rnk1n^{k/2}\lesssim r\lesssim n^{k-1}. For context, gaps between statistical and computational thresholds are a nearly ubiquitous phenomenon throughout high-dimensional statistics, appearing in settings such as planted clique, sparse PCA, community detection, tensor PCA, and more; see e.g. [WX21, GMZ22] for exposition. In these average-case problems where the input is random, classical computational complexity theory has little to say: results on NP-hardness (including those for tensor decomposition [Hås89, HL13]) typically show hardness for worst-case instances, which does not imply hardness in the average case. Still, a number of frameworks have emerged to justify why the “hard” regime is actually hard, and to predict the location of the hard regime in new problems. One approach is based on average-case reductions which formally transfer suspected hardness of one average-case problem (usually planted clique) to another (e.g. [BR13, HWX15, BBH18, BB20]). Another approach is to prove unconditional lower bounds against particular algorithms or classes of algorithms (e.g. [Jer92, DKMZ11, LKZ15]), and sometimes this analysis is based on intricate geometric properties of the solution space (e.g. [AC08, GS17, GZ17]; see [Gam21] for a survey). Perhaps some of the most powerful and well-established classes of algorithms to prove lower bounds against are statistical query (SQ) algorithms (e.g. [FGR+17, DKS17]), the sum-of-squares (SoS) hierarchy (e.g. [BHK+19, KMOW17]; see [RSS18] for a survey), and low-degree polynomials (LDP) (e.g. [HS17, HKP+17, Hop18]; see [KWB22] for a survey). In recent years, all of the above frameworks have been widely successful at providing many different perspectives on what makes statistical problems easy versus hard, and there has also been progress on understanding formal connections between different frameworks [HKP+17, GJW20, BBH+21, BEH+22]. For many conjectured statistical-computational gaps, we now have sharp lower bounds in one (or more) of these frameworks, suggesting that the best known algorithms cannot be improved (at least by certain known approaches).

Despite all these successes, the tensor decomposition problem is a rare example where (prior to this work) essentially no progress has been made on any kind of average-case hardness, even though the problem itself and the suspected threshold at rnk/2r\sim n^{k/2} have maintained a high profile in the community since the work of [GM15] in 2015. The lecture notes [Kun22] highlight “hardness of tensor decomposition” as one of 15 prominent open problems related to sum-of-squares (Open Problem 7.1). In Section 1.3.2 we explain why tensor decomposition is more difficult to analyze than the other statistical problems that were understood previously, and in Section 1.4.1 we explain how to overcome these challenges.

The LDP framework.

Our main result (Theorem 1.4) determines the computational complexity of tensor decomposition in the low-degree polynomial (LDP) framework, establishing an easy-hard phase transition at the threshold rnk/2r\sim n^{k/2}. This means we consider a restricted class of algorithms, namely those that can be expressed as O(logn)O(\log n)-degree multivariate polynomials in the tensor entries. The study of this class of algorithms in the context of high-dimensional statistics first emerged from a line of work on the sum-of-squares (SoS) hierarchy [BHK+19, HS17, HKP+17, Hop18] and is now a popular tool for predicting and explaining statistical-computational gaps; see [KWB22] for a survey. Low-degree polynomials capture various algorithmic paradigms such as spectral methods and approximate message passing (see e.g. Section 4 of [KWB22], Appendix A of [GJW20], and Theorem 1.4 of [MW22]), and so LDP lower bounds allow us to rule out a large class of known approaches all at once. For the types of problems that typically arise in high-dimensional statistics (informally speaking), the LDP framework has a great track record for consistently matching the performance of the best known poly-time algorithms. As a result, LDP lower bounds can be taken as evidence for inherent hardness of certain types of average-case problems. While there are some settings where LDPs are outperformed by another algorithm, these other algorithms tend to be “brittle” algebraic methods with almost no noise tolerance, and so the LDP framework is arguably still saying something meaningful in these settings; see [HW21, Kun21, KM21a, ZSWB22, DK22] for discussion.

Existing LDP lower bounds apply to a wide variety of statistical tasks, which can be classified as hypothesis testing (e.g. [HS17, HKP+17, Hop18]), estimation (e.g. [SW22, KM21a]), and optimization (e.g. [GJW20, Wei22, BH22]). The techniques used to prove these lower bounds are quite different in the three cases, and the current work introduces a new technique for the case of estimation.

1.1 Largest Component Recovery

In order to formulate the random tensor decomposition problem in the LDP framework, we will define a variant of the problem where one component has slightly larger norm than the rest, and the goal is to recover this particular component.

Problem 1.1 (Largest component recovery).

Let k3k\geq 3 be odd and δ>0\delta>0. In the largest component recovery problem, we observe

T=(1+δ)a1k+j=2rajkT=(1+\delta)a_{1}^{\otimes k}+\sum_{j=2}^{r}a_{j}^{\otimes k}

where each aja_{j} (for 1jr1\leq j\leq r) is drawn uniformly and independently from the hypercube {±1}n\{\pm 1\}^{n}. The goal is to recover a1a_{1} with high probability.

One can think of δ\delta as a small constant, although our results will apply more generally. The purpose of increasing the norm of the first component is to give the algorithm a concrete goal, namely to recover a1a_{1}; without this, the algorithm has no way to break symmetry among the components. We have restricted to odd kk here because otherwise there is no hope to disambiguate between a1a_{1} and a1-a_{1}; however, we give similar results for even kk in Section 2. The assumption that aja_{j} is drawn from the hypercube helps to make some of the proofs cleaner, but our approach can likely handle other natural distributions for aja_{j} such as 𝒩(0,In)\mathcal{N}(0,I_{n}) (and we do not expect this to change the threshold).

Connection to decomposition.

While Problem 1.1 does not have formal implications for the canonical random tensor decomposition model (i.e., (1) with aja_{j} uniform on the sphere), it does have implications for a mild generalization of this problem.

Problem 1.2 (Semirandom tensor decomposition).

Let k3k\geq 3 be odd and δ>0\delta>0. In the semirandom tensor decomposition problem, we observe

T=j=1rλjajkT=\sum_{j=1}^{r}\lambda_{j}a_{j}^{\otimes k}

where λj[1,1+δ]\lambda_{j}\in[1,1+\delta] are arbitrary and aja_{j} are drawn uniformly and independently from the hypercube {±1}n\{\pm 1\}^{n}. The goal is to recover a1,,ara_{1},\ldots,a_{r} (up to re-ordering) as well as the corresponding λj\lambda_{j}’s, with high probability.

Note that any algorithm for Problem 1.2 can be used to solve Problem 1.1 (with the same parameters k,n,r,δk,n,r,\delta): simply run the algorithm to recover all the (λj,aj)(\lambda_{j},a_{j}) pairs and then output the aja_{j} with the largest corresponding λj\lambda_{j}. When rnk/2r\gg n^{k/2}, our main result (Theorem 1.4) will give rigorous evidence for hardness of Problem 1.2 via a two-step argument: we will show LDP-hardness of Problem 1.1, justifying the conjecture that there is no poly-time algorithm for Problem 1.1; this conjecture, if true, formally implies that there is no poly-time algorithm for Problem 1.2.

We see Problem 1.2 as a natural variant of random tensor decomposition. In particular, we expect that existing algorithms [MSS16, DdL+22] can be readily adapted to solve this variant when rnk/2r\ll n^{k/2}, at least for k{3,4}k\in\{3,4\}. As such, understanding the complexity of Problem 1.2 (via the connection to Problem 1.1 described above) can be viewed as the end-goal of this work. From this point onward, we will study Problem 1.1. In Section 1.5 we discuss possible extensions of our result that would more directly address the canonical model for random tensor decomposition.

Remark 1.3.

In line with the discussion in the Introduction, our results should really only be considered evidence for hardness of Problems 1.1 and 1.2 in the presence of noise. Indeed, if δ\delta is not an integer, there is a trivial algorithm for Problem 1.1 by examining the non-integer part of TT (the author thanks Bruce Hajek for pointing this out). This algorithm is easily defeated by adding Gaussian noise of variance σ21\sigma^{2}\gg 1 to each entry of TT. This is a low level of noise, as we believe our low-degree upper bound tolerates σnk/4\sigma\ll n^{k/4}, akin to tensor PCA.

1.2 Main Results

We now show how to cast Problem 1.1 in the LDP framework and state our main results. The class of algorithms we consider are (multivariate) polynomials ff of degree (at most) DD with real coefficients, whose input variables are the nkn^{k} entries of T(n)kT\in(\mathbb{R}^{n})^{\otimes k} and whose output is a real scalar value; we write [T]D\mathbb{R}[T]_{\leq D} for the space of such polynomials. We will be interested in whether such a polynomial can accurately estimate a11:=(a1)1a_{11}:=(a_{1})_{1}, the first entry of the vector a1a_{1}. Following [SW22], define the degree-DD minimum mean squared error

𝖬𝖬𝖲𝖤D:=inff[T]D𝔼[(f(T)a11)2]\mathsf{MMSE}_{\leq D}:=\inf_{f\in\mathbb{R}[T]_{\leq D}}\mathbb{E}[(f(T)-a_{11})^{2}]

where the expectation is over {ai}\{a_{i}\} and TT sampled as in Problem 1.1. Note that the above “scalar MMSE” is directly related to the “vector MMSE”: by linearity of expectation and symmetry among the coordinates,

inff1,,fn[T]D𝔼i=1n(fi(T)(a1)i)2=n𝖬𝖬𝖲𝖤D.\inf_{f_{1},\ldots,f_{n}\in\mathbb{R}[T]_{\leq D}}\mathbb{E}\sum_{i=1}^{n}(f_{i}(T)-(a_{1})_{i})^{2}=n\cdot\mathsf{MMSE}_{\leq D}. (2)

It therefore suffices to study the scalar MMSE, with 𝖬𝖬𝖲𝖤D0\mathsf{MMSE}_{\leq D}\to 0 indicating near-perfect estimation and 𝖬𝖬𝖲𝖤D1\mathsf{MMSE}_{\leq D}\to 1 indicating near-trivial estimation (no better than the constant function f(T)0f(T)\equiv 0).

Theorem 1.4.

Fix any k3k\geq 3 odd and δ>0\delta>0. As in Problem 1.1, let

T=(1+δ)a1k+j=2rajkT=(1+\delta)a_{1}^{\otimes k}+\sum_{j=2}^{r}a_{j}^{\otimes k}

where each aja_{j} (for 1jr1\leq j\leq r) is drawn uniformly and independently from the hypercube {±1}n\{\pm 1\}^{n}. Let r=rnr=r_{n} grow with nn as r=Θ(nα)r=\Theta(n^{\alpha}) for an arbitrary fixed α>0\alpha>0.

  • (“Easy”) If α<k/2\alpha<k/2 then limn𝖬𝖬𝖲𝖤Clogn=0\lim_{n\to\infty}\mathsf{MMSE}_{\leq C\log n}=0 for a constant C=C(k,δ)>0C=C(k,\delta)>0.

  • (“Hard”) If α>k/2\alpha>k/2 then limn𝖬𝖬𝖲𝖤nc=1\lim_{n\to\infty}\mathsf{MMSE}_{\leq n^{c}}=1 for a constant c=c(k,α)>0c=c(k,\alpha)>0.

More precise and more general statements of the lower bound (“hard” regime) and upper bound (“easy” regime) can be found in Section 2 (Theorems 2.1 and 2.2), where the case of kk even is also handled. The lower bound shows failure of LDP algorithms when rnk/2r\gg n^{k/2}. Since no one has managed to find a poly-time algorithm in the LDP-hard regime for many other problems of a similar nature (including planted clique [Hop18], community detection [HS17], tensor PCA [HKP+17, KWB22], and more), this suggests that there is no poly-time algorithm for Problem 1.1 (and therefore also for Problem 1.2) when rnk/2r\gg n^{k/2}, at least without a new algorithmic breakthrough. This is the first average-case hardness result of any type for tensor decomposition, and it gives a concrete reason to believe that the existing algorithms for k=3k=3 (which succeed when rn3/2r\ll n^{3/2} [MSS16, DdL+22]) are essentially optimal. Following the heuristic correspondence between polynomial degree and runtime in Hypothesis 2.1.5 of [Hop18] (see also [KWB22, DKWB19])—namely, degree DD corresponds to runtime exp(Θ~(D))\exp(\tilde{\Theta}(D)) where Θ~\tilde{\Theta} hides factors of logn\log n—the lower bound suggests that runtime exp(nΩ(1))\exp(n^{\Omega(1)}) is required in the hard regime.

The upper bound shows that a degree-O(logn)O(\log n) polynomial successfully estimates a1a_{1} in the regime rnk/2r\ll n^{k/2}. Such a polynomial has nO(logn)n^{O(\log n)} terms and can thus be evaluated in time nO(logn)n^{O(\log n)} (quasi-polynomial). This is in stark contrast to the exp(nΩ(1))\exp(n^{\Omega(1)}) runtime that we expect to need in the hard regime. We are confident that for k{3,4}k\in\{3,4\} and rnk/2r\ll n^{k/2}, Problems 1.1 and 1.2 can in fact be solved in polynomial time by adapting existing algorithms [MSS16, DdL+22], but we have not attempted this here as our main focus is on establishing the phase transition for polynomials. We consider the lower bound to be our main contribution, and the upper bound serves to make the lower bound more meaningful.

1.3 Discussion

1.3.1 Why the LDP framework?

Since there are many different frameworks for exploring statistical-computational gaps, we briefly discuss the possibility of applying some of the others to tensor decomposition, and explain why the LDP framework seems especially suited to this task.

Statistical query (SQ) model.

The SQ model (e.g. [FGR+17, DKS17]) is applicable only to settings where the observation consists of i.i.d. samples from some unknown distribution. There does not seem to be a clear way to cast tensor decomposition in this form.

Sum-of-squares (SoS) hierarchy.

SoS is a powerful algorithmic tool, and SoS lower bounds (e.g. [BHK+19, KMOW17, RSS18]) are sometimes viewed as a gold standard in average-case hardness. However, it is important to note that SoS lower bounds show hardness of certification problems, whereas Problems 1.1 and 1.2 are recovery problems. Hardness of certification does not necessarily imply hardness of recovery, as discussed in [BKW20, BMR21, BBK+21]. Therefore, while there are some natural certification problems related to tensor decomposition that would be nice to have SoS lower bounds for (e.g. Open Problem 7.2 in [Kun22]), it is not clear how to use an SoS lower bound to directly address the decomposition problem.

Optimization landscape.

Prior work [AGJ15, GM17] has studied the landscape of a certain non-convex optimization problem related to tensor decomposition, leading to some algorithmic results. In other settings, recent work has studied structural properties of optimization problems as a means to prove failure of certain algorithms (e.g., local search, Markov chains, and gradient descent [GZ17, BGJ20, GJS21, BWZ20]) to recover a planted structure. If results of this type were obtained for tensor decomposition, they would complement ours by ruling out different types of algorithms. One caveat is that recent work on tensor PCA [RM14, BGJ20, MKUZ19, WEM19, BCRT20] and planted clique [GZ19, CMZ22] has revealed that the choice of which function to optimize can be very important. Thus, “hardness” of one particular canonical optimization landscape need not indicate inherent hardness of the recovery problem.

Reductions.

Reductions from planted clique (e.g. [BR13, HWX15, BBH18, BB20]) are among the strongest forms of average-case hardness that exist for statistical problems, and it would be great to have reduction-based evidence for hardness of tensor decomposition. This would likely require new ideas beyond those in the current literature, as Problem 1.1 is somewhat different in structure from planted clique and the problems it has been reduced to thus far. These differences are explained in Section 1.3.2 below.


In summary, there are many complementary approaches that could be explored in future work in order to give new perspectives on hardness of tensor decomposition. In this work we have chosen to pursue the LDP framework, some strengths of which include its stellar track record of predicting statistical-computational gaps in the past, and its flexibility to directly address the decomposition problem (Problem 1.2) rather than a related certification or optimization problem.

1.3.2 Difficulties of tensor decomposition

Although the suspected threshold rnk/2r\sim n^{k/2} for random tensor decomposition has been around for some time now [GM15, MSS16], and despite the many recent successes in computational lower bounds for other statistical problems, there are (prior to this work) effectively no results that point to whether rnk/2r\sim n^{k/2} is a fundamental barrier or not. Here we will discuss what makes tensor decomposition more difficult to analyze than the other statistical-computational gaps that have received recent attention in the literature.

As a point of comparison, we briefly introduce the tensor principal component analysis (tensor PCA) problem [RM14]: here we observe an order-kk tensor Y=λxk+ZY=\lambda x^{\otimes k}+Z where λ>0\lambda>0 is a signal-to-noise ratio, xnx\in\mathbb{R}^{n} is drawn uniformly from the unit sphere, and Z(n)kZ\in(\mathbb{R}^{n})^{\otimes k} is a tensor with i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries; the goal is to (approximately) recover the vector xx (up to sign, if kk is even). Tensor PCA also has a statistical-computational gap, and it is now well-understood: the best known poly-time algorithms require λnk/4\lambda\gtrsim n^{k/4}, and there are matching lower bounds in both the SoS and LDP frameworks [HSS15, HKP+17, PR20, KWB22]. While tensor decomposition and tensor PCA may look superficially similar, tensor decomposition is much harder to analyze for the reason described below.

With a few exceptions ([SW22, KM21b, KM21a]), essentially all existing lower bounds for recovering a planted signal in the SQ, SoS, or LDP frameworks leverage hardness of testing between a “planted” distribution and a “null” distribution, where the null distribution has independent entries. In the case of tensor PCA, in order to show hardness of recovering xx when λnk/4\lambda\ll n^{k/4}, the SoS and LDP lower bounds crucially leverage the fact that it is hard to even detect the planted structure when λnk/4\lambda\ll n^{k/4}, that is, it is hard to distinguish between a spiked tensor YY and an i.i.d. Gaussian tensor ZZ. Now for the decomposition problem, we are hoping to show hardness of decomposition whenever rnk/2r\gg n^{k/2}, but the problem of distinguishing between a random rank-rr tensor and the appropriate Gaussian tensor222The Gaussian tensor matches moments of order 11 and 22 with the rank-rr tensor. (One could choose a different null distribution, but the analysis may be difficult; for instance, see Section 1.5.) is actually easy when rnkr\ll n^{k}; this can be achieved by thresholding a particular degree-4 polynomial in the tensor entries, where each monomial forms a double cover of 2k2k elements of [n][n]. This “detection-recovery gap” creates difficulty because the standard tools for proving lower bounds cannot show hardness of recovery unless detection is also hard. The tools with this limitation include the SDA (statistical dimension on average) approach for SQ [FGR+17], the pseudo-calibration approach for SoS [BHK+19], and the low-degree likelihood ratio [HS17, HKP+17, Hop18, KWB22] (as well as its conditional variants [BEH+22, CGH+22]) for LDP. Reductions from planted clique also typically require hardness of detection [BBH18].

The method of [SW22] overcomes this challenge in some settings and gives LDP lower bounds for recovery problems even in regimes where detection is easy. This methods applies to problems of the form (roughly speaking) “signal plus noise” where the noise has independent entries. For instance, this gives LDP-hardness of recovery for variants of tensor PCA with detection-recovery gaps [LZ22]. However, tensor decomposition does not take this “signal plus noise” form: Problem 1.1 has a “signal” term (1+δ)a1k(1+\delta)a_{1}^{\otimes k} but the remaining “noise” term is a rank-(r1)(r-1) tensor, which does not have independent entries. As a result, we will need to introduce a new method in order to prove a lower bound for tensor decomposition. The proof strategy is outlined in Section 1.4.1 below.

1.4 Proof Techniques

1.4.1 Lower bound

Instead of mean squared error it will be convenient to work with an equivalent quantity, the degree-DD maximum correlation

𝖢𝗈𝗋𝗋D:=supf[T]D𝔼[f(T)a11]𝔼[f(T)2].\mathsf{Corr}_{\leq D}:=\sup_{f\in\mathbb{R}[T]_{\leq D}}\frac{\mathbb{E}[f(T)\cdot a_{11}]}{\sqrt{\mathbb{E}[f(T)^{2}]}}.

This is directly related to 𝖬𝖬𝖲𝖤D\mathsf{MMSE}_{\leq D} via the identity 𝖬𝖬𝖲𝖤D=𝔼[a112]𝖢𝗈𝗋𝗋D2=1𝖢𝗈𝗋𝗋D2\mathsf{MMSE}_{\leq D}=\mathbb{E}[a_{11}^{2}]-\mathsf{Corr}_{\leq D}^{2}=1-\mathsf{Corr}_{\leq D}^{2} (see [SW22, Fact 1.1]), so our objective is to show 𝖢𝗈𝗋𝗋D=o(1)\mathsf{Corr}_{\leq D}=o(1) when rnk/2r\gg n^{k/2} and D=nΩ(1)D=n^{\Omega(1)}.

A first attempt is to compute 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} explicitly. Expand an arbitrary degree-DD polynomial in the monomial basis: f(T)=0|S|Df^STSf(T)=\sum_{0\leq|S|\leq D}\hat{f}_{S}T^{S} where SS ranges over multi-sets of tensor entries and TS:=ISTIT^{S}:=\prod_{I\in S}T_{I}. The numerator of 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} is a linear functional of the coefficient vector f^=(f^S)|S|D\hat{f}=(\hat{f}_{S})_{|S|\leq D}:

𝔼[f(T)a11]=Sf^S𝔼[TSa11]=cf^where cS:=𝔼[TSa11].\mathbb{E}[f(T)\cdot a_{11}]=\sum_{S}\hat{f}_{S}\,\mathbb{E}[T^{S}\cdot a_{11}]=c^{\top}\hat{f}\qquad\text{where }c_{S}:=\mathbb{E}[T^{S}\cdot a_{11}].

For the denominator, 𝔼[f(T)2]\mathbb{E}[f(T)^{2}] is a quadratic form:

𝔼[f(T)2]=S,Sf^Sf^S𝔼[TSTS]=f^Pf^where PS,S:=𝔼[TSTS].\mathbb{E}[f(T)^{2}]=\sum_{S,S^{\prime}}\hat{f}_{S}\hat{f}_{S^{\prime}}\,\mathbb{E}[T^{S}T^{S^{\prime}}]=\hat{f}^{\top}P\hat{f}\qquad\text{where }P_{S,S^{\prime}}:=\mathbb{E}[T^{S}T^{S^{\prime}}].

This means we have the explicit formula

𝖢𝗈𝗋𝗋D=supf^cf^f^Pf^=cP1c.\mathsf{Corr}_{\leq D}=\sup_{\hat{f}}\frac{c^{\top}\hat{f}}{\sqrt{\hat{f}^{\top}P\hat{f}}}=\sqrt{c^{\top}P^{-1}c}.

The difficulty here is that, while we can write down an explicit formula for the vector cc and matrix PP, it does not seem tractable to write down an explicit formula for the inverse matrix P1P^{-1}. We will instead find an upper bound on 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} that is manageable to work with.

Our difficulties stem from the fact that we do not have an orthogonal basis of polynomials to work with. After all, if {TS}|S|D\{T^{S}\}_{|S|\leq D} were orthogonal polynomials—i.e., if PP were a diagonal matrix—then we would have no problem inverting PP. Since the distribution of TT is complicated (in particular, it is not a product measure), it seems difficult to construct an explicit basis of orthogonal polynomials. Instead, we will think of f(T)f(T) as a function of the underlying i.i.d. Rademacher random variables A=(aij):=(aj)iA=(a_{ij}):=(a_{j})_{i} and work with an orthogonal basis of polynomials in those variables. For any f(T)f(T) there is an associated polynomial g(A)g(A) such that f(T)=g(A)f(T)=g(A), and we can expand these as

|S|Df^STS=f(T)=g(A)=|U|kDg^UAU,\sum_{|S|\leq D}\hat{f}_{S}T^{S}=f(T)=g(A)=\sum_{|U|\leq kD}\hat{g}_{U}A^{U},

where U[n]×[r]U\subseteq[n]\times[r], AU:=(i,j)UaijA^{U}:=\prod_{(i,j)\in U}a_{ij}, and g^=(g^U)|U|kD\hat{g}=(\hat{g}_{U})_{|U|\leq kD} is some vector of coefficients. Here {AU}\{A^{U}\} is the standard Fourier basis for functions on the hypercube, which crucially does have the desired orthogonality property: 𝔼[AUAU]=𝟙U=U\mathbb{E}[A^{U}A^{U^{\prime}}]=\mathbbm{1}_{U=U^{\prime}}. As a result, we can express the denominator of 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} as simply

𝔼[f(T)2]=𝔼[g(A)2]=g^2.\mathbb{E}[f(T)^{2}]=\mathbb{E}[g(A)^{2}]=\|\hat{g}\|^{2}.

To exploit this, we will need to understand the relation between f^\hat{f} and the corresponding g^\hat{g}. We will write down an explicit linear transformation, i.e., a matrix MM such that g^=Mf^\hat{g}=M\hat{f}. The crux of our new strategy is that to obtain an upper bound on 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} it suffices to construct a left-inverse for MM, i.e., a matrix M+M^{+} such that M+M=IM^{+}M=I. To see why this suffices,

𝖢𝗈𝗋𝗋D=supf^cf^Mf^=supf^cM+Mf^Mf^suphcM+hh=cM+,\mathsf{Corr}_{\leq D}=\sup_{\hat{f}}\frac{c^{\top}\hat{f}}{\|M\hat{f}\|}=\sup_{\hat{f}}\frac{c^{\top}M^{+}M\hat{f}}{\|M\hat{f}\|}\leq\sup_{h}\frac{c^{\top}M^{+}h}{\|h\|}=\|c^{\top}M^{+}\|,

where in the inequality step, hh plays the role of g^=Mf^\hat{g}=M\hat{f} except we have relaxed the problem to allow hh to be any vector, not necessarily in the image of MM.

In light of the above, our remaining task is to construct an explicit left-inverse M+M^{+}. There are many possible choices, some of which will yield better bounds on 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} than others. We will need one that yields a good bound but is also analytically tractable to write down explicitly. We now explain the intuition behind our construction for M+M^{+}. Note that the left-inverse property is equivalent to M+g^=f^M^{+}\hat{g}=\hat{f} whenever g^=Mf^\hat{g}=M\hat{f}. That is, M+M^{+} is a procedure to recover the (coefficients of the) polynomial ff when given the (coefficients of the) polynomial gg that satisfies f(T)=g(A)f(T)=g(A). Luckily there is a methodical process for this task which starts from the highest-degree terms and works downward. At each stage, there is always a particular monomial in AA whose coefficient in gg allows us to immediately deduce some particular coefficient in ff. To illustrate, if k=3k=3 and D=2D=2 then the coefficient of the monomial

a27a37a47a28a38a58a_{27}a_{37}a_{47}a_{28}a_{38}a_{58} (3)

in g(A)g(A) allows us to immediately deduce the coefficient of the monomial T234T235T_{234}T_{235} in f(T)f(T), since this is the only monomial in TT (of degree at most 22) whose expansion in terms of AA contains the term (3). Once we have deduced the highest-degree coefficients of ff, we can subtract off the corresponding terms in gg and repeat the process recursively. Our choice of M+M^{+} is inspired by this process; the full details can be found in Section 3.4.

Recall now that the final step is to bound cM+\|c^{\top}M^{+}\| for our choice of M+M^{+}. Due to the recursive structure of M+M^{+}, this boils down to analyzing certain recursively-defined values vSv_{S} (see Section 3.6). As above, SS is a multi-set of tensor entries, which can be thought of as a hypergraph on vertex set [n][n] (with a hyperedge for each tensor entry). The value vSv_{S} is computed by summing over all possible ways to reduce SS to a smaller hypergraph by replacing some collection of hyperedges by a new hyperedge equal to their symmetric difference. Analyzing this recurrence is the most technically involved part of the proof, since there are subtle cancellations that need to be understood in order to get the right bound.

1.4.2 Upper bound

To construct a polynomial that accurately estimates the largest component, we take inspiration from a line of algorithmic work that builds spectral methods from tensor networks [HSSS16, MW19, DdL+22, OTR22]. A tensor network is a diagram that specifies how to “multiply” together a collection of tensors to form another tensor; see [MW19] for details. Some existing algorithms for order-3 tensor decomposition [HSSS16, DdL+22] (including the fastest known method that reaches the threshold rn3/2r\sim n^{3/2} [DdL+22]) are based on the idea of building a tensor network from multiple copies of the input tensor, flattening the result to a matrix, and running a spectral method on this matrix. Morally speaking, these are all steps that one should be able to capture using low-degree polynomials, but the algorithm of [DdL+22] is actually a multi-stage process that seems challenging to write as a single polynomial. Also, since our goal is to estimate the largest component, we do not need to inject randomness to break symmetry among the components as in [DdL+22].

To construct our polynomial, we multiply O(logn)O(\log n) copies of TT together in a tensor network whose shape is an expander graph. The output is a single vector, which we take as our estimate for a1a_{1}. There is one key trick we use to greatly simplify the analysis: we deviate slightly from the standard notion of a tensor network by disallowing repeated “edge labels.” The precise definition of the polynomial appears in Section 4.2.

1.5 Future Directions

We collect here a list of possible extensions of our main result and related open problems.

  • Distribution of aja_{j}: We have taken aja_{j} drawn uniformly from the hypercube in the interest of keeping the proofs as clean as possible. We expect our approach could be adapted to other distributions for aja_{j} such as 𝒩(0,In)\mathcal{N}(0,I_{n}).

  • Canonical model: While we feel that the semirandom tensor decomposition model (Problem 1.2) is quite natural, there may be interest in showing hardness for the more canonical model where every λj\lambda_{j} is equal to 1. One way to approach this could be to study the task of hypothesis testing between a random rank-rr tensor and a random rank-(r+1)(r+1) tensor. Hardness (for poly-time algorithms) of this testing problem implies hardness of decomposition in the canonical model. It may be possible to use our proof technique to show LDP-hardness of this testing problem. Another alternative approach could be to consider a variant of Problem 1.1 where symmetry between the tensor components is broken by giving the algorithm a small amount of side-information rather than increasing the norm of one component.

  • Tensors with algebraic structure: Hopefully the proof techniques introduced here will open the door to other hardness results for related problems. For instance, orbit recovery problems—a class of statistical problems involving group actions [BBK+17, FSWW20]—give rise to variants of tensor decomposition with algebraic structure. One example is multi-reference alignment (MRA), where the tensor components are cyclic shifts of one another [PWB+19]. A statistical-computational gap in heterogeneous MRA was conjectured in [BBLS18]; the positive side was resolved in [Wei18] using [MSS16], and the negative side should now be approachable using our techniques. Tensors with continuous group symmetry are even less well understood [MW19, LM21].

  • Smoothed order-3 tensor decomposition: For k=3k=3, the algorithms achieving the optimal condition rn3/2r\ll n^{3/2} seem to crucially exploit randomness in the components [MSS16, DdL+22]. In contrast, other algorithms succeed in the smoothed analysis model (which imposes minimal assumptions on the components) but require rnr\lesssim n [GVX14, BCMV14]. It remains unclear whether better algorithms for the smoothed model exist, or whether there is an inherent gap between the random and smoothed settings. For k=4k=4 there is no such gap, with both random and smoothed algorithms achieving rn2r\sim n^{2} [DCC07, MSS16].

  • Other LDP estimation lower bounds: Proving LDP lower bounds for estimation problems remains a difficult task in many settings, with relatively few tools available compared to detection (hypothesis testing) problems. A few open problems are to resolve the LDP recovery threshold for sparse linear regression and group testing; these problems have detection-recovery gaps and the LDP detection thresholds were resolved in [BEH+22, CGH+22].

2 General Setting and Results

It will be convenient to generalize Problem 1.1 in a few different ways. First, we allow all the components to have potentially different norms. This way, there is nothing distinguished about the first component, eliminating some tedious casework. To this end, we will consider tensors of the form

j=1rλjajk\sum_{j=1}^{r}\lambda_{j}a_{j}^{\otimes k} (4)

where each component aja_{j} is drawn uniformly and independently from the hypercube {±1}n\{\pm 1\}^{n}, and λj\lambda_{j}\in\mathbb{R} are deterministic nonzero scalars that are known to the algorithm. Without loss of generality we assume 1=λ1|λ2||λ3||λr|=:λmin>01=\lambda_{1}\geq|\lambda_{2}|\geq|\lambda_{3}|\geq\cdots\geq|\lambda_{r}|=:\lambda_{\min}>0. The goal is to recover the largest component a1a_{1}. The case λ2=λ3==λr=(1+δ)1\lambda_{2}=\lambda_{3}=\cdots=\lambda_{r}=(1+\delta)^{-1} is equivalent to Problem 1.1.

Second, we will potentially allow the algorithm access not just to an order-kk tensor but also to tensors of lower order. For I[n]I\subseteq[n] write

TI:=j=1rλjajIT_{I}:=\sum_{j=1}^{r}\lambda_{j}a_{j}^{I} (5)

with λj\lambda_{j} and aja_{j} as above, and where for a vector vnv\in\mathbb{R}^{n}, vI:=iIviv^{I}:=\prod_{i\in I}v_{i}. Note that since the aja_{j} have {±1}\{\pm 1\}-valued entries, each entry of the tensor (4) takes the form TIT_{I} for some 0|I|k0\leq|I|\leq k.

Let Ω\Omega be a collection of subsets of [n][n]. We consider the task of estimating a11:=(a1)1a_{11}:=(a_{1})_{1} using a degree-DD polynomial f:Ωf:\mathbb{R}^{\Omega}\to\mathbb{R} whose input variables are {TI}IΩ\{T_{I}\}_{I\in\Omega}. Accordingly, we define

𝖬𝖬𝖲𝖤DΩ:=inff:Ωdeg(f)D𝔼[(f(T)a11)2].\mathsf{MMSE}_{\leq D}^{\Omega}:=\inf_{\begin{subarray}{c}f:\mathbb{R}^{\Omega}\to\mathbb{R}\\ \deg(f)\leq D\end{subarray}}\mathbb{E}[(f(T)-a_{11})^{2}].

To make our lower bound for order-kk tensors as strong as possible, we will choose Ω={I[n]: 0<|I|k}\Omega=\{I\subseteq[n]\,:\,0<|I|\leq k\}. This is equivalent to giving the algorithm access to all the tensors

j=1rλjaj,j=1rλjaj2,,j=1rλjajk,\sum_{j=1}^{r}\lambda_{j}a_{j},\qquad\sum_{j=1}^{r}\lambda_{j}a_{j}^{\otimes 2},\qquad\ldots,\qquad\sum_{j=1}^{r}\lambda_{j}a_{j}^{\otimes k},

(which is a situation that often arises when using the method of moments, e.g. [HK13, PWB+19]). Note that a lower bound in this setting implies a lower bound in the original setting where only the order-kk tensor is revealed.

To make our upper bound as strong as possible, we will give the algorithm access to minimal information. For kk odd, we will take Ω={I[n]:|I|=k}\Omega=\{I\subseteq[n]\,:\,|I|=k\}, that is, our polynomial only needs access to the “off-diagonal” entries of the order-kk tensor (Ti1,,ikT_{i_{1},\ldots,i_{k}} where i1,,iki_{1},\ldots,i_{k} are all distinct). For kk even, the order-kk tensor alone does not suffice to disambiguate between a1a_{1} and a1-a_{1}, so we additionally reveal the (off-diagonal part of the) order-(k1)(k-1) tensor, that is, we take Ω={I[n]:k1|I|k}\Omega=\{I\subseteq[n]\,:\,k-1\leq|I|\leq k\}. (An alternative formulation for kk even could be to reveal only the order-kk tensor and ask the algorithm to estimate (a1)1(a1)2(a_{1})_{1}\cdot(a_{1})_{2}; we expect this could be analyzed using our methods.)

The following are non-asymptotic statements of our main results. Together, these immediately imply Theorem 1.4.

Theorem 2.1 (Lower bound).

Consider the setting described above with any parameters k3k\geq 3, n1n\geq 1, D0D\geq 0, and set Ω={I[n]: 0<|I|k}\Omega=\{I\subseteq[n]\,:\,0<|I|\leq k\}. If

r19kkDk+4λmin2nk/2r\geq 19k^{k}D^{k+4}\lambda_{\min}^{-2}n^{k/2} (6)

then

𝖬𝖬𝖲𝖤DΩ1n1/2.\mathsf{MMSE}_{\leq D}^{\Omega}\geq 1-n^{-1/2}.

In particular, if k3k\geq 3 is fixed, λminδ\lambda_{\min}\geq\delta for fixed δ>0\delta>0, and r=rnr=r_{n} grows as r=Θ(nα)r=\Theta(n^{\alpha}) for fixed α>k/2\alpha>k/2, then limn𝖬𝖬𝖲𝖤ncΩ=1\lim_{n\to\infty}\mathsf{MMSE}_{\leq n^{c}}^{\Omega}=1 for a constant c=c(k,α)>0c=c(k,\alpha)>0.

Theorem 2.2 (Upper bound).

Consider the setting described above with any k3k\geq 3. Let k{k1,k}k^{\prime}\in\{k-1,k\} be odd, and set Ω={I[n]:k|I|k}\Omega=\{I\subseteq[n]\,:\,k^{\prime}\leq|I|\leq k\}. Suppose nn0n\geq n_{0} for some constant n0=n0(k)n_{0}=n_{0}(k). Let DD be the smallest odd integer such that

Dklogn1|λ2|.D\geq\frac{k\log n}{1-|\lambda_{2}|}.

If |λ2|1n1/52|\lambda_{2}|\leq 1-n^{-1/52} and

r12kk/2D27knk/2r\leq\frac{1}{2}k^{-k/2}D^{-27k}n^{k/2} (7)

then

𝖬𝖬𝖲𝖤DΩ10kk1D52krnk1𝟙k even10kk1D52krnk/2.\mathsf{MMSE}_{\leq D}^{\Omega}\leq 10k^{k-1}D^{52k}\frac{r}{n^{k-1-\mathbbm{1}_{k\text{ even}}}}\leq 10k^{k-1}D^{52k}\frac{r}{n^{k/2}}. (8)

In particular, if k3k\geq 3 is fixed, |λ2|1δ|\lambda_{2}|\leq 1-\delta for fixed δ>0\delta>0, and r=rnr=r_{n} grows as r=Θ(nα)r=\Theta(n^{\alpha}) for fixed 0<α<k/20<\alpha<k/2, then limn𝖬𝖬𝖲𝖤ClognΩ=0\lim_{n\to\infty}\mathsf{MMSE}_{\leq C\log n}^{\Omega}=0 for a constant C=C(k,δ)>0C=C(k,\delta)>0.

Note that (7) is the bottleneck, requiring rnk/2r\ll n^{k/2}. The intermediate step in (8) is a sharper bound on the MMSE that we use for the remark below, but it does not dictate the threshold for rr.

Remark 2.3 (Exact recovery).

We remark that if 𝖬𝖬𝖲𝖤D=o(1/n)\mathsf{MMSE}_{\leq D}=o(1/n) then exact recovery of a1a_{1} is possible with probability 1o(1)1-o(1) by thresholding the values of certain degree-DD polynomials f1,,fnf_{1},\ldots,f_{n}. To see this: combining (2) with Markov’s inequality, we have f1,,fnf_{1},\ldots,f_{n} such that i=1n(fi(T)(a1)i)2<1\sum_{i=1}^{n}(f_{i}(T)-(a_{1})_{i})^{2}<1 with probability 1o(1)1-o(1). Furthermore, thresholding f1,,fnf_{1},\ldots,f_{n} is guaranteed to exactly recover a1a_{1} whenever i=1n(fi(T)(a1)i)2<1\sum_{i=1}^{n}(f_{i}(T)-(a_{1})_{i})^{2}<1.

As a result, when |λ2|1Ω(1)|\lambda_{2}|\leq 1-\Omega(1) and rnk/2/(logn)O(1)r\leq n^{k/2}/(\log n)^{O(1)}, our upper bound (Theorem 2.2) gives exact recovery for any fixed k5k\geq 5. We expect a similar result to hold for k{3,4}k\in\{3,4\} but this may require modifying the construction of the polynomial in the proof of Theorem 2.2.

3 Proof of Theorem 2.1: Lower Bound

3.1 Setup

Throughout, define Ω:={I[n]: 0<|I|k}\Omega:=\{I\subseteq[n]\,:\,0<|I|\leq k\}. Our goal will be to give an upper bound on

𝖢𝗈𝗋𝗋D:=supf:Ωdeg(f)D𝔼[f(T)a11]𝔼[f(T)2].\mathsf{Corr}_{\leq D}:=\sup_{\begin{subarray}{c}f:\mathbb{R}^{\Omega}\to\mathbb{R}\\ \deg(f)\leq D\end{subarray}}\frac{\mathbb{E}[f(T)\cdot a_{11}]}{\sqrt{\mathbb{E}[f(T)^{2}]}}.

This will imply the desired result due to the following direct relation between 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} and the associated MMSE.

Fact 3.1 ([SW22, Fact 1.1]).

𝖬𝖬𝖲𝖤DΩ=𝔼[a112]𝖢𝗈𝗋𝗋D2=1𝖢𝗈𝗋𝗋D2\mathsf{MMSE}_{\leq D}^{\Omega}=\mathbb{E}[a_{11}^{2}]-\mathsf{Corr}_{\leq D}^{2}=1-\mathsf{Corr}_{\leq D}^{2}.

Any degree-DD polynomial f:Ωf:\mathbb{R}^{\Omega}\to\mathbb{R} admits an expansion of the form

f(T)=0|S|Df^STSf(T)=\sum_{0\leq|S|\leq D}\hat{f}_{S}T^{S}

for some coefficients f^S\hat{f}_{S}\in\mathbb{R}, where SS takes values SΩ:={0,1,2,}ΩS\in\mathbb{N}^{\Omega}:=\{0,1,2,\ldots\}^{\Omega} with |S|:=IΩSI|S|:=\sum_{I\in\Omega}S_{I} and TS:=IΩTISIT^{S}:=\prod_{I\in\Omega}T_{I}^{S_{I}}.

At the same time, TT can be thought of as a function of AA, the n×rn\times r matrix with columns aja_{j}. This means we can also expand ff as

f(T)=g(A)=Ug^UAUf(T)=g(A)=\sum_{U}\hat{g}_{U}A^{U}

where UU ranges over subsets U[n]×[r]U\subseteq[n]\times[r] of cardinality |U|kD|U|\leq kD. This expansion will be useful because, since AA has i.i.d. Rademacher entries, {AU}\{A^{U}\} forms an orthonormal basis in the sense 𝔼[AUAU]=𝟙U=U\mathbb{E}[A^{U}A^{U^{\prime}}]=\mathbbm{1}_{U=U^{\prime}}. As a consequence,

𝔼[f(T)2]=𝔼[g(A)2]=g^2:=Ug^U2.\mathbb{E}[f(T)^{2}]=\mathbb{E}[g(A)^{2}]=\|\hat{g}\|^{2}:=\sum_{U}\hat{g}_{U}^{2}. (9)

Any vector of coefficients f^=(f^S)|S|D\hat{f}=(\hat{f}_{S})_{|S|\leq D} induces a unique choice of g^=(g^U)|U|kD\hat{g}=(\hat{g}_{U})_{|U|\leq kD} such that

|S|Df^STS=f(T)=g(A)=|U|kDg^UAU.\sum_{|S|\leq D}\hat{f}_{S}T^{S}=f(T)=g(A)=\sum_{|U|\leq kD}\hat{g}_{U}A^{U}. (10)

It will be important to understand this mapping from f^\hat{f} to g^\hat{g}.

3.2 Proof Overview

We will show that the mapping from f^\hat{f} to g^\hat{g} in (10) is a linear transformation that takes the form g^=Mf^\hat{g}=M\hat{f} for an explicit matrix M=(MUS)M=(M_{US}). A key step in the proof will be to construct an explicit left inverse for MM, that is, a matrix M+M^{+} satisfying M+M=IM^{+}M=I. In other words, M+M^{+} is a matrix that recovers the coefficients f^\hat{f} from the coefficients g^\hat{g}: M+g^=M+Mf^=f^M^{+}\hat{g}=M^{+}M\hat{f}=\hat{f}.

The numerator of 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} can be expressed as

𝔼[fa11]=|S|Df^S𝔼[TSa11]=cf^\mathbb{E}[f\cdot a_{11}]=\sum_{|S|\leq D}\hat{f}_{S}\,\mathbb{E}[T^{S}\cdot a_{11}]=c^{\top}\hat{f}

where the vector c=(cS)|S|Dc=(c_{S})_{|S|\leq D} is defined by

cS=𝔼[TSa11].c_{S}=\mathbb{E}[T^{S}\cdot a_{11}]. (11)

Using (9), the denominator can be expressed as 𝔼[f(T)2]=g^=Mf^\sqrt{\mathbb{E}[f(T)^{2}]}=\|\hat{g}\|=\|M\hat{f}\|. This means we can write

𝖢𝗈𝗋𝗋D=supf^cf^Mf^=supf^cM+Mf^Mf^suphcM+hh=cM+.\mathsf{Corr}_{\leq D}=\sup_{\hat{f}}\frac{c^{\top}\hat{f}}{\|M\hat{f}\|}=\sup_{\hat{f}}\frac{c^{\top}M^{+}M\hat{f}}{\|M\hat{f}\|}\leq\sup_{h}\frac{c^{\top}M^{+}h}{\|h\|}=\|c^{\top}M^{+}\|. (12)

In the crucial inequality above, hh plays the role of g^=Mf^\hat{g}=M\hat{f} except we have relaxed the problem to allow hh to be any vector, not necessarily in the image of MM. After this simplification, the optimizer for hh is h=(cM+)h^{*}=(c^{\top}M^{+})^{\top} (or any scalar multiple thereof), yielding the explicit expression cM+\|c^{\top}M^{+}\| for the optimum. So long as we can construct a left inverse M+M^{+}, this gives an upper bound on 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D}. While many choices for M+M^{+} are possible, we will need to find one that (i) is simple enough to work with explicitly, and (ii) results in a good bound on 𝖢𝗈𝗋𝗋D\mathsf{Corr}_{\leq D} at the end of the day.

3.3 Computing MM

The first step in writing down an explicit expression for MM will be to express TST^{S} in the basis {AU}\{A^{U}\}.

Definition 3.2 (List notation for SS).

We will identify SΩS\in\mathbb{N}^{\Omega} with a multi-set, namely the multi-set containing SIS_{I} copies of each IΩI\in\Omega. With some abuse of notation, write SS as an ordered list S=(I1,,I|S|)S=(I_{1},\ldots,I_{|S|}) containing the elements of the associated multi-set sorted according to some arbitrary but fixed ordering on Ω\Omega. For a labeling =(1,,|S|)[r]|S|\ell=(\ell_{1},\ldots,\ell_{|S|})\in[r]^{|S|}, define S()[n]×[r]S(\ell)\subseteq[n]\times[r] to be the subset containing all (i,j)(i,j) pairs with the following property: the element ii occurs in an odd number of the sets {Id:d=j}\{I_{d}\,:\,\ell_{d}=j\}. (Informally, S()S(\ell) is produced by placing each IdI_{d} into column d\ell_{d} of an n×rn\times r grid and then XOR’ing the contents of each column.)

With this notation we can write

TS=IΩ(j=1rλjajI)SI=d=1|S|(d=1rλdadId)=[r]|S|λAS()T^{S}=\prod_{I\in\Omega}\left(\sum_{j=1}^{r}\lambda_{j}a_{j}^{I}\right)^{S_{I}}=\prod_{d=1}^{|S|}\left(\sum_{\ell_{d}=1}^{r}\lambda_{\ell_{d}}a_{\ell_{d}}^{I_{d}}\right)=\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}A^{S(\ell)} (13)

where λ:=dλd\lambda^{\ell}:=\prod_{d}\lambda_{\ell_{d}}.

Next we consider an arbitrary vector of coefficients f^\hat{f} and write down an expression for the corresponding g^\hat{g} in (10). We have

f(T)\displaystyle f(T) =|S|Df^STS\displaystyle=\sum_{|S|\leq D}\hat{f}_{S}T^{S}
=|S|Df^S[r]|S|λAS()\displaystyle=\sum_{|S|\leq D}\hat{f}_{S}\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}A^{S(\ell)}
=|U|kDAU|S|Df^S[r]|S|λ 1S()=Ug^U.\displaystyle=\sum_{|U|\leq kD}A^{U}\underbrace{\sum_{|S|\leq D}\hat{f}_{S}\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=U}}_{\hat{g}_{U}}.

In other words, g^=Mf^\hat{g}=M\hat{f} where

MUS:=[r]|S|λ 1S()=U.M_{US}:=\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=U}. (14)

3.4 Constructing the Left Inverse

We now construct a left inverse M+M^{+} for MM, which, recall, is needed to apply the bound (12). The intuition behind this construction is described in Section 1.4.1.

Definition 3.3 (Generic UU).

We call U[n]×[r]U\subseteq[n]\times[r] generic provided that every column Uj[n]U_{j}\subseteq[n] satisfies |Uj|k|U_{j}|\leq k, and at most DD columns satisfy |Uj|>0|U_{j}|>0.

These are the “generic” terms appearing in the expansion (13): UU is generic if and only if there exist SΩS\in\mathbb{N}^{\Omega} with |S|D|S|\leq D and [r]|S|\ell\in[r]^{|S|} with distinct entries 1,,|S|\ell_{1},\ldots,\ell_{|S|} such that S()=US(\ell)=U. (Note that (6) implies DrD\leq r, so it is possible for \ell to have distinct entries.) Furthermore, if UU is generic then there is a unique corresponding SS, namely 𝖼𝗈𝗅𝗌(U)\mathsf{cols}(U) defined as follows.

Definition 3.4.

For a generic UU, define 𝖼𝗈𝗅𝗌(U)Ω\mathsf{cols}(U)\in\mathbb{N}^{\Omega} by letting 𝖼𝗈𝗅𝗌(U)I\mathsf{cols}(U)_{I} be the number of columns jj for which Uj=IU_{j}=I.

When viewing SS as a multi-set as per Definition 3.2, 𝖼𝗈𝗅𝗌(U)\mathsf{cols}(U) is simply the multi-set of columns UjU_{j}. Recalling the definition |S|:=IΩSI|S|:=\sum_{I\in\Omega}S_{I}, note that |𝖼𝗈𝗅𝗌(U)||\mathsf{cols}(U)| denotes the number of non-empty columns of UU.

Our left inverse M+M^{+} will satisfy MSU+=0M^{+}_{SU}=0 whenever UU is not generic; in other words, our procedure for recovering f^\hat{f} from g^\hat{g} only uses the values g^U\hat{g}_{U} for which UU is generic. Write MM in block form

M=[GN]M=\left[\begin{array}[]{c}G\\ N\end{array}\right]

where GG (“generic”) is indexed by generic UU’s and NN (“not”) is indexed by the rest. It suffices to construct a left inverse G+G^{+} for GG and then set

M+=[G+0].M^{+}=\left[\begin{array}[]{cc}G^{+}&0\end{array}\right]. (15)

Note that G=G(D)G=G(D) has the recursive structure

G(D)=[G(D1)R(D)0Q(D)]G(D)=\left[\begin{array}[]{cc}G(D-1)&R(D)\\ 0&Q(D)\end{array}\right]

where the first block of columns is indexed by |S|D1|S|\leq D-1 and the second is indexed by |S|=D|S|=D, and the first block of rows is indexed by |𝖼𝗈𝗅𝗌(U)|D1|\mathsf{cols}(U)|\leq D-1 and the second by |𝖼𝗈𝗅𝗌(U)|=D|\mathsf{cols}(U)|=D. Crucially, the lower-left block is 0: recall (14) and note that S()=US(\ell)=U is only possible when |S||𝖼𝗈𝗅𝗌(U)||S|\geq|\mathsf{cols}(U)|. Given any left inverses G(D1)+,Q(D)+G(D-1)^{+},Q(D)^{+} for G(D1),Q(D)G(D-1),Q(D) respectively, one can verify that the following matrix is a valid left inverse for G(D)G(D):

G(D)+:=[G(D1)+G(D1)+R(D)Q(D)+0Q(D)+].G(D)^{+}:=\left[\begin{array}[]{cc}G(D-1)^{+}&-G(D-1)^{+}R(D)Q(D)^{+}\\ 0&Q(D)^{+}\end{array}\right]. (16)

The left inverse G(D1)+G(D-1)^{+} can be constructed by applying (16) recursively. The matrix Q(D)Q(D) has only one nonzero entry per row and so we will be able to construct Q(D)+Q(D)^{+} by hand.

Lemma 3.5.

The following is a valid left inverse for Q=Q(D)Q=Q(D): for UU generic and |S|=|𝖼𝗈𝗅𝗌(U)|=D|S|=|\mathsf{cols}(U)|=D, define Q+=(QSU+)Q^{+}=(Q^{+}_{SU}) by

QSU+:=𝟙𝖼𝗈𝗅𝗌(U)=SλUr|S|¯Q^{+}_{SU}:=\frac{\mathbbm{1}_{\mathsf{cols}(U)=S}}{\lambda^{U}r^{\underline{|S|}}} (17)

where

λU:=j:|Uj|>0λj\lambda^{U}:=\prod_{j\,:\,|U_{j}|>0}\lambda_{j}

and

rd¯:=r(r1)(r2)(rd+1)d factors.r^{\underline{d}}:=\underbrace{r(r-1)(r-2)\cdots(r-d+1)}_{d\text{ factors}}. (18)
Proof.

We need to verify Q+Q=IQ^{+}Q=I. For |S|=|S|=D|S|=|S^{\prime}|=D,

(Q+Q)SS=U:|𝖼𝗈𝗅𝗌(U)|=DQSU+QUS=U:|𝖼𝗈𝗅𝗌(U)|=D𝟙𝖼𝗈𝗅𝗌(U)=SλUr|S|¯[r]|S|λ 1S()=U=𝟙S=S,(Q^{+}Q)_{SS^{\prime}}=\sum_{U\,:\,|\mathsf{cols}(U)|=D}Q^{+}_{SU}Q_{US^{\prime}}=\sum_{U\,:\,|\mathsf{cols}(U)|=D}\frac{\mathbbm{1}_{\mathsf{cols}(U)=S}}{\lambda^{U}r^{\underline{|S|}}}\sum_{\ell\in[r]^{|S^{\prime}|}}\lambda^{\ell}\,\mathbbm{1}_{S^{\prime}(\ell)=U}=\mathbbm{1}_{S=S^{\prime}},

where the last step is justified as follows: since |S|=|𝖼𝗈𝗅𝗌(U)|=D|S^{\prime}|=|\mathsf{cols}(U)|=D, the indicator 𝟙S()=U\mathbbm{1}_{S^{\prime}(\ell)=U} can only be satisfied when \ell has distinct entries. There are r|S|¯r^{\underline{|S^{\prime}|}} such \ell’s, and for each there is exactly one term in the first sum satisfying 𝟙S()=U\mathbbm{1}_{S^{\prime}(\ell)=U}, namely U=S()U=S^{\prime}(\ell), and this implies λU=λ\lambda^{U}=\lambda^{\ell} and S=𝖼𝗈𝗅𝗌(U)S^{\prime}=\mathsf{cols}(U). This means the indicator 𝟙𝖼𝗈𝗅𝗌(U)=S\mathbbm{1}_{\mathsf{cols}(U)=S} becomes 𝟙S=S\mathbbm{1}_{S=S^{\prime}}, and the other factors cancel. This completes the proof. ∎

This completes the description of the left inverse M+M^{+}.

3.5 Recurrence for ww

Ultimately we are interested in an expression not for M+M^{+} but for the vector w:=cM+w^{\top}:=c^{\top}M^{+} appearing in (12). We will use the calculations from the previous section to write down a self-contained recursive formula for the entries of ww. Note that from (15), wU=0w_{U}=0 whenever UU is not generic, so we will focus on computing wgen=(wU:U generic)w_{\mathrm{gen}}=(w_{U}\,:\,U\text{ generic}). Using (16),

wgen=cG+=[c[G(D1)+0]c[G(D1)+R(D)Q(D)+Q(D)+]]=:[xy],w_{\mathrm{gen}}^{\top}=c^{\top}G^{+}=\left[c^{\top}\left[\begin{array}[]{c}G(D-1)^{+}\\ 0\end{array}\right]\quad c^{\top}\left[\begin{array}[]{c}-G(D-1)^{+}R(D)Q(D)^{+}\\ Q(D)^{+}\end{array}\right]\right]=:\left[x^{\top}\quad y^{\top}\right],

where xx is indexed by |𝖼𝗈𝗅𝗌(U)|D1|\mathsf{cols}(U)|\leq D-1 and yy is indexed by |𝖼𝗈𝗅𝗌(U)|=D|\mathsf{cols}(U)|=D. The expression for xx reveals that wUw_{U} does not depend on DD (so long as D|𝖼𝗈𝗅𝗌(U)|D\geq|\mathsf{cols}(U)|). By comparing the expressions for xx and yy we can write

y=c[0Q(D)+]xR(D)Q(D)+.y^{\top}=c^{\top}\left[\begin{array}[]{c}0\\ Q(D)^{+}\end{array}\right]-x^{\top}R(D)Q(D)^{+}.

For any generic UU, the case D=|𝖼𝗈𝗅𝗌(U)|D=|\mathsf{cols}(U)| of the above gives

wU=S:|S|=|𝖼𝗈𝗅𝗌(U)|cSQ(D)SU+(I)generic U|𝖼𝗈𝗅𝗌(U)|<|𝖼𝗈𝗅𝗌(U)|S:|S|=|𝖼𝗈𝗅𝗌(U)|wUR(D)USQ(D)SU+(II).w_{U}=\underbrace{\sum_{S\,:\,|S|=|\mathsf{cols}(U)|}c_{S}Q(D)^{+}_{SU}}_{\text{(I)}}-\underbrace{\sum_{\begin{subarray}{c}\text{generic }U^{\prime}\\ |\mathsf{cols}(U^{\prime})|<|\mathsf{cols}(U)|\end{subarray}}\;\sum_{S\,:\,|S|=|\mathsf{cols}(U)|}w_{U^{\prime}}R(D)_{U^{\prime}S}Q(D)^{+}_{SU}}_{\text{(II)}}.

We treat the two terms (I),(II)\text{(I)},\text{(II)} separately. Using the definition (17) for Q+Q^{+},

(I) =S:|S|=|𝖼𝗈𝗅𝗌(U)|cS𝟙𝖼𝗈𝗅𝗌(U)=SλUr|S|¯\displaystyle=\sum_{S\,:\,|S|=|\mathsf{cols}(U)|}c_{S}\,\frac{\mathbbm{1}_{\mathsf{cols}(U)=S}}{\lambda^{U}r^{\underline{|S|}}}
=cSλUr|S|¯where S=𝖼𝗈𝗅𝗌(U).\displaystyle=\frac{c_{S}}{\lambda^{U}r^{\underline{|S|}}}\qquad\text{where }S=\mathsf{cols}(U).

Now for the second term (II), suppressing the constraints on UU^{\prime} for ease of notation,

(II) =US:|S|=|𝖼𝗈𝗅𝗌(U)|wUR(D)USQ(D)SU+\displaystyle=\sum_{U^{\prime}}\;\sum_{S\,:\,|S|=|\mathsf{cols}(U)|}w_{U^{\prime}}R(D)_{U^{\prime}S}Q(D)^{+}_{SU}
=US:|S|=|𝖼𝗈𝗅𝗌(U)|wU([r]|S|λ 1S()=U)𝟙𝖼𝗈𝗅𝗌(U)=SλUr|S|¯\displaystyle=\sum_{U^{\prime}}\;\sum_{S\,:\,|S|=|\mathsf{cols}(U)|}w_{U^{\prime}}\left(\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=U^{\prime}}\right)\frac{\mathbbm{1}_{\mathsf{cols}(U)=S}}{\lambda^{U}r^{\underline{|S|}}}
=UwU([r]|S|λ 1S()=U)1λUr|S|¯where S=𝖼𝗈𝗅𝗌(U).\displaystyle=\sum_{U^{\prime}}w_{U^{\prime}}\left(\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=U^{\prime}}\right)\frac{1}{\lambda^{U}r^{\underline{|S|}}}\qquad\text{where }S=\mathsf{cols}(U).

Putting it together, we have now shown

𝖢𝗈𝗋𝗋D2generic U|𝖼𝗈𝗅𝗌(U)|DwU2\mathsf{Corr}_{\leq D}^{2}\leq\sum_{\begin{subarray}{c}\text{generic }U\\ |\mathsf{cols}(U)|\leq D\end{subarray}}w_{U}^{2} (19)

where, for generic UU, the wUw_{U} are defined by the recurrence

wU=1λUr|S|¯(cSgeneric U|𝖼𝗈𝗅𝗌(U)|<|𝖼𝗈𝗅𝗌(U)|wU[r]|S|λ 1S()=U)where S=𝖼𝗈𝗅𝗌(U).w_{U}=\frac{1}{\lambda^{U}r^{\underline{|S|}}}\left(c_{S}-\sum_{\begin{subarray}{c}\text{generic }U^{\prime}\\ |\mathsf{cols}(U^{\prime})|<|\mathsf{cols}(U)|\end{subarray}}w_{U^{\prime}}\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=U^{\prime}}\right)\qquad\text{where }S=\mathsf{cols}(U). (20)

3.6 Recurrence for vv

It will be convenient to rewrite the recurrence (20) in terms of a different quantity indexed by SΩS\in\mathbb{N}^{\Omega} instead of U[n]×[r]U\subseteq[n]\times[r], namely

vS:=λUr|S|¯wUfor U satisfying 𝖼𝗈𝗅𝗌(U)=S.v_{S}:=\lambda^{U}r^{\underline{|S|}}\,w_{U}\qquad\text{for $U$ satisfying }\mathsf{cols}(U)=S. (21)

It can be seen from (20) that vSv_{S} is well-defined in the sense that it does not depend on the choice of UU in (21). In particular,

vS=cSgeneric U|𝖼𝗈𝗅𝗌(U)|<|S|wU[r]|S|λ 1S()=U.v_{S}=c_{S}-\sum_{\begin{subarray}{c}\text{generic }U^{\prime}\\ |\mathsf{cols}(U^{\prime})|<|S|\end{subarray}}w_{U^{\prime}}\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=U^{\prime}}.

Using (21) we can turn this into a self-contained recurrence for vv (not involving ww):

vS\displaystyle v_{S} =cSgeneric U|𝖼𝗈𝗅𝗌(U)|<|S|vSλUr|S|¯[r]|S|λ 1S()=Uwhere S=𝖼𝗈𝗅𝗌(U)\displaystyle=c_{S}-\sum_{\begin{subarray}{c}\text{generic }U^{\prime}\\ |\mathsf{cols}(U^{\prime})|<|S|\end{subarray}}\frac{v_{S^{\prime}}}{\lambda^{U^{\prime}}r^{\underline{|S^{\prime}|}}}\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=U^{\prime}}\qquad\text{where }S^{\prime}=\mathsf{cols}(U^{\prime})
=cSSΩ|S|<|S|vS[r]|S|λλS()r|S|¯ 1𝖼𝗈𝗅𝗌(S())=S\displaystyle=c_{S}-\sum_{\begin{subarray}{c}S^{\prime}\in\mathbb{N}^{\Omega}\\ |S^{\prime}|<|S|\end{subarray}}v_{S^{\prime}}\sum_{\ell\in[r]^{|S|}}\frac{\lambda^{\ell}}{\lambda^{S(\ell)}r^{\underline{|S^{\prime}|}}}\,\mathbbm{1}_{\mathsf{cols}(S(\ell))=S^{\prime}} (22)

where the predicate 𝖼𝗈𝗅𝗌(S())=S\mathsf{cols}(S(\ell))=S^{\prime} in particular requires S()S(\ell) to be generic. For any SΩS\in\mathbb{N}^{\Omega} there are at most r|S|¯r^{\underline{|S|}} corresponding generic UU’s for which S=𝖼𝗈𝗅𝗌(U)S=\mathsf{cols}(U). For any such UU, we have from (21) that |wU||vS|/(λmin|S|r|S|¯)|w_{U}|\leq|v_{S}|/(\lambda_{\min}^{|S|}\,r^{\underline{|S|}}). This means (19) gives

𝖢𝗈𝗋𝗋D2SΩ|S|Dr|S|¯(vSλmin|S|r|S|¯)2=SΩ|S|DvS2λmin2|S|r|S|¯\mathsf{Corr}_{\leq D}^{2}\leq\sum_{\begin{subarray}{c}S\in\mathbb{N}^{\Omega}\\ |S|\leq D\end{subarray}}r^{\underline{|S|}}\left(\frac{v_{S}}{\lambda_{\min}^{|S|}\,r^{\underline{|S|}}}\right)^{2}=\sum_{\begin{subarray}{c}S\in\mathbb{N}^{\Omega}\\ |S|\leq D\end{subarray}}\frac{v_{S}^{2}}{\lambda_{\min}^{2|S|}\,r^{\underline{|S|}}} (23)

where vSv_{S} is defined by the recurrence (22).

3.7 Grouping by Patterns

The core challenge remaining is to analyze the recurrence (22) and establish an upper bound on |vS||v_{S}|. Naively bounding the terms in (22) will not suffice, as there are subtle cancellations that occur. To understand the nature of these cancellations, we will rewrite (22) in a different form. First we will group the terms in (22) by their type, defined as followed. We use \oplus to denote the XOR (symmetric difference) operation on sets.

Definition 3.6.

Fix SΩS\in\mathbb{N}^{\Omega}, viewed as a list S=(I1,,I|S|)S=(I_{1},\ldots,I_{|S|}) as per Definition 3.2. Define a pattern π=(π1,,π|S|)\pi=(\pi_{1},\ldots,\pi_{|S|}) to be an element of ([r]{})|S|([r]\cup\{\star\})^{|S|} satisfying the following rules:

  • (i)

    “No singletons”: if πd=j[r]\pi_{d}=j\in[r] then there must exist ddd^{\prime}\neq d such that πd=j\pi_{d^{\prime}}=j.

  • (ii)

    “Not all stars”: there must exist dd such that πd\pi_{d}\neq\star.

  • (iii)

    “Every column valid”: for every j[r]j\in[r], we have |d:πd=jId|k|\oplus_{d\,:\,\pi_{d}=j}I_{d}|\leq k.

Let Π(S)\Pi(S) denote the set of all such patterns. We let S(π)j:=d:πd=jIdS(\pi)_{j}:=\oplus_{d\,:\,\pi_{d}=j}I_{d} and define S(π)[n]×[r]S(\pi)\subseteq[n]\times[r] to have jjth column S(π)jS(\pi)_{j} for all j[r]j\in[r].

Note that if π\pi has no stars, it is simply a labeling [r]|S|\ell\in[r]^{|S|}, in which case the definitions S(π)S(\pi) and S()S(\ell) coincide. Intuitively, a pattern π\pi describes a class of possible labelings \ell, where the stars are “wildcards” that may represent any element of [r][r] (subject to some restrictions). More formally, we now define which \ell’s belong to a pattern π\pi.

Definition 3.7.

Fix SΩS\in\mathbb{N}^{\Omega}, viewed as a list S=(I1,,I|S|)S=(I_{1},\ldots,I_{|S|}). For [r]|S|\ell\in[r]^{|S|} and πΠ(S)\pi\in\Pi(S), write π\pi\vdash\ell if the following conditions hold:

  • (i)

    For each dd, if πd=j[r]\pi_{d}=j\in[r] then d=j\ell_{d}=j.

  • (ii)

    The “starred” columns :={d:πd=}\ell_{\star}:=\{\ell_{d}\,:\,\pi_{d}=\star\} are distinct.

  • (iii)

    For every jj\in\ell_{\star} we have S(π)j=S(\pi)_{j}=\emptyset.

Let (S)\mathcal{L}(S) denote the set of labelings [r]|S|\ell\in[r]^{|S|} for which there exists SΩS^{\prime}\in\mathbb{N}^{\Omega} such that |S|<|S||S^{\prime}|<|S| and 𝖼𝗈𝗅𝗌(S())=S\mathsf{cols}(S(\ell))=S^{\prime}; these are the \ell’s that contribute to the sum in (22). For any (S)\ell\in\mathcal{L}(S), there is at least one (and possibly more than one) pattern πΠ(S)\pi\in\Pi(S) for which π\pi\vdash\ell. The following inclusion-exclusion formula will allow us to sum over πΠ(S)\pi\in\Pi(S) in a way that counts every (S)\ell\in\mathcal{L}(S) exactly once.

Lemma 3.8.

Fix SΩS\in\mathbb{N}^{\Omega}, viewed as a list S=(I1,,I|S|)S=(I_{1},\ldots,I_{|S|}). Fix a function ϕ:[r]|S|\phi:[r]^{|S|}\to\mathbb{R}. For πΠ(S)\pi\in\Pi(S) and j[r]j\in[r], define

mπ(j)=|{d:πd=j and Id=S(π)j}|m_{\pi}(j)=|\{d\,:\,\pi_{d}=j\text{ and }I_{d}=S(\pi)_{j}\}|

and

mπ=j[r](1mπ(j)).m_{\pi}=\prod_{j\in[r]}(1-m_{\pi}(j)).

Then we have

(S)ϕ()=πΠ(S)mπ[r]|S|:πϕ().\sum_{\ell\in\mathcal{L}(S)}\phi(\ell)=\sum_{\pi\in\Pi(S)}m_{\pi}\sum_{\ell\in[r]^{|S|}\,:\,\pi\vdash\ell}\phi(\ell). (24)
Proof.

First note that π\pi\vdash\ell implies (S)\ell\in\mathcal{L}(S) and so there are no “extra” terms ϕ()\phi(\ell) on the right-hand side that are not present on the left-hand side. For any fixed (S)\ell\in\mathcal{L}(S), the term ϕ()\phi(\ell) appears exactly once on the left-hand side of (24). Our goal is to show that it also appears exactly once on the right-hand side, that is, it suffices to prove

πΠ(S):πmπ=1.\sum_{\pi\in\Pi(S)\,:\,\pi\vdash\ell}m_{\pi}=1.

For a fixed \ell, we need to enumerate the possible patterns π\pi for which π\pi\vdash\ell. The rules for these patterns are as follows:

  • (Case 1) For any jj, if there is exactly one dd for which d=j\ell_{d}=j then πd=\pi_{d}=\star. In this case, mπ(j)=0m_{\pi}(j)=0.

  • (Case 2) For any jj, if S()j=S(\ell)_{j}=\emptyset then there are no stars among {πd:d=j}\{\pi_{d}\,:\,\ell_{d}=j\}. In this case, mπ(j)=0m_{\pi}(j)=0.

  • (Case 3) For any jj not in Case 1, if S()jS(\ell)_{j}\neq\emptyset then among {πd:d=j}\{\pi_{d}\,:\,\ell_{d}=j\} there are either no stars or exactly one star of the form πd=\pi_{d}=\star where d=j\ell_{d}=j and Id=S()jI_{d}=S(\ell)_{j}. If there are no stars then mπ(j)=m(j):=|{d:d=j and Id=S()j}|m_{\pi}(j)=m_{\ell}(j):=|\{d\,:\,\ell_{d}=j\text{ and }I_{d}=S(\ell)_{j}\}|. There are m(j)m_{\ell}(j) ways to have one star, and each yields mπ(j)=0m_{\pi}(j)=0.

Now we have

π:πmπ=π:πj[r](1mπ(j))=j in Case 3[(1m(j))+m(j)(10)]=1\sum_{\pi\,:\,\pi\vdash\ell}m_{\pi}=\sum_{\pi\,:\,\pi\vdash\ell}\prod_{j\in[r]}(1-m_{\pi}(j))=\prod_{j\text{ in Case 3}}[(1-m_{\ell}(j))+m_{\ell}(j)(1-0)]=1

as desired. ∎

Using Lemma 3.8, we can rewrite (22) as

vS\displaystyle v_{S} =cSS:|S|<|S|vS[r]|S|λλS()r|S|¯ 1𝖼𝗈𝗅𝗌(S())=S\displaystyle=c_{S}-\sum_{S^{\prime}\,:\,|S^{\prime}|<|S|}v_{S^{\prime}}\sum_{\ell\in[r]^{|S|}}\frac{\lambda^{\ell}}{\lambda^{S(\ell)}r^{\underline{|S^{\prime}|}}}\,\mathbbm{1}_{\mathsf{cols}(S(\ell))=S^{\prime}}
=cSS:|S|<|S|vSπΠ(S)mπ:πλλS()r|S|¯ 1𝖼𝗈𝗅𝗌(S())=S.\displaystyle=c_{S}-\sum_{S^{\prime}\,:\,|S^{\prime}|<|S|}v_{S^{\prime}}\sum_{\pi\in\Pi(S)}m_{\pi}\sum_{\ell\,:\,\pi\vdash\ell}\frac{\lambda^{\ell}}{\lambda^{S(\ell)}r^{\underline{|S^{\prime}|}}}\,\mathbbm{1}_{\mathsf{cols}(S(\ell))=S^{\prime}}.

The number of labelings \ell such that π\pi\vdash\ell is (rπ)sπ¯(r_{\pi})^{\underline{s_{\pi}}} where rπr_{\pi} is the number of columns j[r]j\in[r] for which S(π)j=S(\pi)_{j}=\emptyset, and sπs_{\pi} is the number of stars in π\pi (i.e., the number of indices dd such that πd=\pi_{d}=\star). Note that the ratio λλS()\frac{\lambda^{\ell}}{\lambda^{S(\ell)}} depends only on π\pi (not on \ell), and so we define λ(π):=λλS()\lambda^{(\pi)}:=\frac{\lambda^{\ell}}{\lambda^{S(\ell)}}. Also, the predicate 𝖼𝗈𝗅𝗌(S())=S\mathsf{cols}(S(\ell))=S^{\prime} depends only on π\pi (not on \ell), and we write this predicate as S𝜋SS\overset{\pi}{\longrightarrow}S^{\prime}. With this notation, the recurrence for vSv_{S} becomes

vS=cSS:|S|<|S|vSπΠ(S)mπ(rπ)sπ¯λ(π)r|S|¯ 1S𝜋S.v_{S}=c_{S}-\sum_{S^{\prime}\,:\,|S^{\prime}|<|S|}v_{S^{\prime}}\sum_{\pi\in\Pi(S)}m_{\pi}\,(r_{\pi})^{\,\underline{s_{\pi}}}\,\frac{\lambda^{(\pi)}}{r^{\underline{|S^{\prime}|}}}\,\mathbbm{1}_{S\overset{\pi}{\longrightarrow}S^{\prime}}. (25)

3.8 Unravelling the Recurrence

Next we will rewrite (25) in closed form (without recursion) by expanding the recursion tree as a sum over “paths.”

We first unpack the definition (11) for cSc_{S}. Using (13) and recalling that AA has i.i.d. Rademacher entries,

cS=𝔼[TSa11]=[r]|S|λ𝔼[AS()a11]=[r]|S|λ 1S()={(1,1)}.c_{S}=\mathbb{E}[T^{S}\cdot a_{11}]=\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbb{E}[A^{S(\ell)}\cdot a_{11}]=\sum_{\ell\in[r]^{|S|}}\lambda^{\ell}\,\mathbbm{1}_{S(\ell)=\{(1,1)\}}. (26)

By expanding the recursion tree of (25) we can write vSv_{S} as a sum over paths which we denote by

S=S0π0S1π1πp1Spπp.S=S^{0}\overset{\pi^{0}}{\longrightarrow}S^{1}\overset{\pi^{1}}{\longrightarrow}\cdots\overset{\pi^{p-1}}{\longrightarrow}S^{p}\overset{\pi^{p}}{\longrightarrow}\,\perp. (27)

Formally, a path consists of S0S^{0} and π1,,πp\pi^{1},\ldots,\pi^{p} (which then determine S1,,SpS^{1},\ldots,S^{p}) with StΩS^{t}\in\mathbb{N}^{\Omega} for all tt, πtΠ(St)\pi^{t}\in\Pi(S^{t}) for tp1t\leq p-1, and πp[r]|Sp|\pi^{p}\in[r]^{|S^{p}|} (“no stars at the final step”) such that the following properties hold:

  • For all 0tp10\leq t\leq p-1, we require |St+1|<|St||S^{t+1}|<|S^{t}| and the predicate StπtSt+1S^{t}\overset{\pi^{t}}{\longrightarrow}S^{t+1} holds.

  • For the final step SpπpS^{p}\overset{\pi^{p}}{\longrightarrow}\,\perp we require Sp(πp)={(1,1)}S^{p}(\pi^{p})=\{(1,1)\}.

With this notation, (25) can be written as

vS=p0S=S0π0S1π1πp1Spπp(1)p(t=0p1mπt(rπt)sπt¯λ(πt)r|St+1|¯)λπp,v_{S}=\sum_{p\geq 0}\;\sum_{S=S^{0}\overset{\pi^{0}}{\longrightarrow}S^{1}\overset{\pi^{1}}{\longrightarrow}\,\cdots\,\overset{\pi^{p-1}}{\longrightarrow}S^{p}\overset{\pi^{p}}{\longrightarrow}\,\perp}(-1)^{p}\left(\prod_{t=0}^{p-1}m_{\pi^{t}}\,(r_{\pi^{t}})^{\,\underline{s_{\pi^{t}}}}\,\frac{\lambda^{(\pi^{t})}}{r^{\underline{|S^{t+1}|}}}\right)\lambda^{\pi^{p}}, (28)

where the special rule for the final step comes from (26).

3.9 Excluding Bad Paths

The next step is the crux of the argument: we will show that only certain types of paths contribute to (28), due to cancellations among the remaining paths.

Definition 3.9 (Event).

For a path of the form (27), we say an event occurs at timestep t{0,1,,p}t\in\{0,1,\ldots,p\} on column j[r]j\in[r] if there exists dd for which πdt=j\pi^{t}_{d}=j.

Note that Definition 3.6 requires every timestep tt to have an event on at least one column jj.

Definition 3.10 (Deletion event).

An event at timestep tt on column jj is called a deletion event if St(πt)j=S^{t}(\pi^{t})_{j}=\emptyset.

Definition 3.11 (Good/bad paths).

A path of the form (27) is called bad if there exists a column j[r]j\in[r] such that the last event (i.e., with the largest tt) on that column is a deletion event. If a path is not bad, it is called good.

Lemma 3.12.

The total contribution from bad paths to (28) is 0. That is,

vS=p0S=S0π0S1π1πp1Spπpgood(1)p(t=0p1mπt(rπt)sπt¯λ(πt)r|St+1|¯)λπp.v_{S}=\sum_{p\geq 0}\;\sum_{\begin{subarray}{c}S=S^{0}\overset{\pi^{0}}{\longrightarrow}S^{1}\overset{\pi^{1}}{\longrightarrow}\,\cdots\,\overset{\pi^{p-1}}{\longrightarrow}S^{p}\overset{\pi^{p}}{\longrightarrow}\,\perp\\ \text{good}\end{subarray}}(-1)^{p}\left(\prod_{t=0}^{p-1}m_{\pi^{t}}\,(r_{\pi^{t}})^{\,\underline{s_{\pi^{t}}}}\,\frac{\lambda^{(\pi^{t})}}{r^{\underline{|S^{t+1}|}}}\right)\lambda^{\pi^{p}}. (29)
Proof.

We will show that the bad paths can be paired up so that within each pair, the two paths contribute the same term to (28) but with opposite signs. The pairing is described by the following procedure, an involution that maps each bad path to its partner (and vice versa).

  1. (1)

    Given a bad path as input, let jj^{*} denote the largest column index for which the last event is a deletion event. Let tt^{*} denote the timestep on which this deletion event occurs.

  2. (2a)

    If there exists another event at timestep tt^{*} (on some column jjj\neq j^{*}), “promote” the (t,j)(t^{*},j^{*}) deletion event to its own timestep. That is, replace

    StπtbySt𝜏S𝜎\cdots\;S^{t^{*}}\overset{\pi^{t^{*}}}{\longrightarrow}\cdots\qquad\text{by}\qquad\cdots\;S^{t^{*}}\overset{\tau}{\longrightarrow}S^{\prime}\overset{\sigma}{\longrightarrow}\cdots

    where τ,σ\tau,\sigma are defined as follows. For all dd such that πdt=j\pi^{t^{*}}_{d}=j^{*}, set τd=j\tau_{d}=j^{*}; for all other dd, set τd=\tau_{d}=\star. Now SS^{\prime} (viewed as a list) is produced from StS^{t^{*}} by removing the elements IdI_{d} for which πdt=j\pi^{t^{*}}_{d}=j^{*}; similarly, σ\sigma is produced from πt\pi^{t^{*}} by removing the elements πdt\pi^{t^{*}}_{d} that are equal to jj^{*}.

  3. (2b)

    If instead there is no other event at timestep tt^{*} (which cannot happen if t=pt^{*}=p due to the non-deletion event on column 1), “merge” timestep tt^{*} with the subsequent timestep. That is, replace

    StπtSt+1πt+1bySt𝜏\cdots\;S^{t^{*}}\overset{\pi^{t^{*}}}{\longrightarrow}S^{t^{*}+1}\overset{\pi^{t^{*}+1}}{\longrightarrow}\cdots\qquad\text{by}\qquad\cdots\;S^{t^{*}}\overset{\tau}{\longrightarrow}\cdots

    where τ\tau is defined as follows. For all dd such that πdt=j\pi^{t^{*}}_{d}=j^{*}, set τd=j\tau_{d}=j^{*}. The number of remaining entries of τ\tau is exactly |St+1||S^{t^{*}+1}|, and we designate these entries as “unassigned”; for each 1i|St+1|1\leq i\leq|S^{t^{*}+1}|, set the iith unassigned entry of τ\tau to be πit+1\pi^{t^{*}+1}_{i}. Note that since the (t,j)(t^{*},j^{*}) event was the last event in its column, the merge operation will not cause it to “collide” with another event.

A few claims remain to be checked before the proof is complete. First note that the above procedure maps any bad path to a different bad path (its “partner”), and applying the procedure again on the partner recovers the original path. For instance, if applying the procedure to the original path resulted in a “promote” operation on column jj^{*}, applying the procedure on the partner will undo this using a “merge” operation on the same column jj^{*}.

We furthermore claim that for any bad path and its partner, both paths have the same value for the factor (t=0p1)λπp\left(\prod_{t=0}^{p-1}\cdots\right)\lambda^{\pi^{p}} in (28). However, the lengths of the two paths differ by one, causing the two corresponding terms in (28) to cancel due to the (1)p(-1)^{p} factor. To prove the claim, compare the cases StπtSt+1S^{t}\overset{\pi^{t}}{\longrightarrow}S^{t+1} and St𝜏S𝜎St+1S^{t}\overset{\tau}{\longrightarrow}S^{\prime}\overset{\sigma}{\longrightarrow}S^{t+1} from (2a) above, where for now we assume St+1S^{t+1}\neq\perp. For the mm factors, note that mπ(j)=0m_{\pi}(j)=0 whenever π\pi has either a deletion event on column jj or no event on column jj, and so mπt=mτmσm_{\pi^{t}}=m_{\tau}m_{\sigma}. For the rs¯r^{\underline{s}} factors, note that rπt=rσr_{\pi^{t}}=r_{\sigma} and sπt=sσs_{\pi^{t}}=s_{\sigma}; also, rτ=rr_{\tau}=r and sτ=|S|s_{\tau}=|S^{\prime}|, so (rτ)sτ¯(r_{\tau})^{\underline{s_{\tau}}} cancels with the existing factor of 1/r|S|¯1/r^{\underline{|S^{\prime}|}} in (28). Finally, for the λ\lambda factors we have λ(πt)=λ(τ)λ(σ)\lambda^{(\pi^{t})}=\lambda^{(\tau)}\lambda^{(\sigma)}. The other case StπtS^{t}\overset{\pi^{t}}{\longrightarrow}\,\perp versus St𝜏S𝜎S^{t}\overset{\tau}{\longrightarrow}S^{\prime}\overset{\sigma}{\longrightarrow}\,\perp is treated similarly, where now we have mτ=1m_{\tau}=1, (rτ)sτ¯=r|S|¯(r_{\tau})^{\underline{s_{\tau}}}=r^{\underline{|S^{\prime}|}}, and λπt=λ(τ)λσ\lambda^{\pi^{t}}=\lambda^{(\tau)}\lambda^{\sigma}. This completes the proof. ∎

3.10 Bounding |vS||v_{S}|

Now that we have identified the crucial cancellations between pairs of bad paths, the rest of the proof will follow by bounding the terms in (29) in a straightforward way. We start by collecting some bounds on the individual pieces of (29).

Lemma 3.13.

For any step S𝜋SS\overset{\pi}{\longrightarrow}S^{\prime}, |mπ|2|S||S||m_{\pi}|\leq 2^{|S|-|S^{\prime}|}.

Proof.

Note that

|mπ(j)||{d:πd=j}|=:|π1(j)|.|m_{\pi}(j)|\leq|\{d\,:\,\pi_{d}=j\}|=:|\pi^{-1}(j)|.

The number of stars plus the number of distinct columns in π\pi must be at least |S||S^{\prime}|. This leaves at most |S||S||S|-|S^{\prime}| entries of π\pi that repeat a previous column, i.e.,

j[r]:|π1(j)|2(|π1(j)|1)|S||S|.\sum_{j\in[r]\,:\,|\pi^{-1}(j)|\geq 2}(|\pi^{-1}(j)|-1)\leq|S|-|S^{\prime}|. (30)

This means

|mπ|=j[r]|mπ(j)1|\displaystyle|m_{\pi}|=\prod_{j\in[r]}|m_{\pi}(j)-1| j[r]:|mπ(j)|2(mπ(j)1)j[r]:|π1(j)|2(|π1(j)|1)\displaystyle\leq\prod_{j\in[r]\,:\,|m_{\pi}(j)|\geq 2}(m_{\pi}(j)-1)\leq\prod_{j\in[r]\,:\,|\pi^{-1}(j)|\geq 2}(|\pi^{-1}(j)|-1)
j[r]:|π1(j)|22(|π1(j)|1)=2j[r]:|π1(j)|2(|π1(j)|1)\displaystyle\leq\prod_{j\in[r]\,:\,|\pi^{-1}(j)|\geq 2}2^{(|\pi^{-1}(j)|-1)}=2^{\sum_{j\in[r]\,:\,|\pi^{-1}(j)|\geq 2}(|\pi^{-1}(j)|-1)}
2|S||S|\displaystyle\leq 2^{|S|-|S^{\prime}|}

where the final step uses (30). ∎

Lemma 3.14.

For any step S𝜋SS\overset{\pi}{\longrightarrow}S^{\prime}, |λ(π)|1|\lambda^{(\pi)}|\leq 1.

Proof.

Recall λ(π):=λλS()\lambda^{(\pi)}:=\frac{\lambda^{\ell}}{\lambda^{S(\ell)}} for any \ell such that π\pi\vdash\ell. Recall that λS()\lambda^{S(\ell)} is the product of λj\lambda_{j} over the non-empty columns jj of S()S(\ell). For any such non-empty column jj, there must exist dd for which d=j\ell_{d}=j. Thus, every factor of λj\lambda_{j} in the denominator of λλS()\frac{\lambda^{\ell}}{\lambda^{S(\ell)}} is cancelled by the numerator, and the result now follows because |λj|1|\lambda_{j}|\leq 1. ∎

We now state the main conclusion of this section.

Lemma 3.15.

For any SΩS\in\mathbb{N}^{\Omega} with |S|1|S|\geq 1, we have |vS|(3|S|2)|S||v_{S}|\leq(3|S|^{2})^{|S|}.

Note that for |S|=0|S|=0, it can be verified directly that v=0v_{\emptyset}=0.

Proof.

Proceed by induction on |S||S|. We will bound the sum of absolute values of terms in (29); it will no longer be important to exploit cancellations between positive and negative terms. First consider paths of the form Sπ0S\overset{\pi^{0}}{\longrightarrow}\,\perp. There is at most one such path that is good, namely π0=(1,1,,1)\pi^{0}=(1,1,\ldots,1); the value of this term is |λπ0|1|\lambda^{\pi^{0}}|\leq 1.

All remaining paths take the form Sπ0S1S\overset{\pi^{0}}{\longrightarrow}S^{1}\cdots where |S1|=i|S^{1}|=i for some 1i|S|11\leq i\leq|S|-1. In order to produce S1S^{1}, π0\pi^{0} must have exactly ii entries dd that are either stars (i.e., πd0=\pi^{0}_{d}=\star) or first in a “combination” event (i.e., for some j[r]j\in[r] with S(π0)jS(\pi^{0})_{j}\neq\emptyset, dd is the lowest index such that πd0=j\pi^{0}_{d}=j). There are (|S|i)\binom{|S|}{i} choices for which ii elements of π0\pi^{0} play these roles; by default they will all be stars, and will be converted to “combinations” if another entry of π0\pi^{0} decides to join them on the same column.

Now there are |S|i|S|-i entries of π0\pi^{0} remaining, which have a few options. One option is to participate in a combination event by joining one of the ii previously designated entries on the same column. The other option is to participate in a deletion event. Since we are only counting good paths, this can only happen on a column on which a later event will occur. Regardless of the remainder of the path S1π1S^{1}\overset{\pi^{1}}{\longrightarrow}\cdots\,\perp, there are at most |S1|=i|S^{1}|=i such columns available. Since each of |S|i|S|-i entries of π0\pi^{0} has at most 2i2i choices, this gives a total of at most (2i)|S|i(2i)^{|S|-i} choices.

We also need to decide which columns the combination events occur on. If π0\pi^{0} has sπ0s_{\pi^{0}} stars and isπ0i-s_{\pi^{0}} combination events, there are risπ0¯r^{\underline{i-s_{\pi^{0}}}} choices for the columns. Note that this exactly cancels the factor (rπ0)sπ0¯/r|S1|¯(r_{\pi^{0}})^{\underline{s_{\pi^{0}}}}/r^{\underline{|S^{1}|}} in (29) because rπ0=r(isπ0)r_{\pi^{0}}=r-(i-s_{\pi^{0}}) and |S1|=i|S^{1}|=i.

Recall that we aim to show |vS|b(|S|)|v_{S}|\leq b(|S|) where b(d):=(3d2)db(d):=(3d^{2})^{d}. Plugging the above calculations (along with Lemmas 3.13 and 3.14) into (29) and using the induction hypothesis to handle the remainder of the path S1π1S^{1}\overset{\pi^{1}}{\longrightarrow}\cdots\,\perp, we have

|vS|\displaystyle|v_{S}| 1+i=1|S|1b(i)(|S|i)(2i)|S|i 2|S|i\displaystyle\leq 1+\sum_{i=1}^{|S|-1}b(i)\binom{|S|}{i}(2i)^{|S|-i}\,2^{|S|-i}
=1+i=1|S|1(3i2)i(|S|i)(4i)|S|i.\displaystyle=1+\sum_{i=1}^{|S|-1}(3i^{2})^{i}\binom{|S|}{i}(4i)^{|S|-i}.
At this point we can verify the case |S|=1|S|=1 directly. Assuming |S|2|S|\geq 2, we continue:
1+i=1|S|1(|S|i)(3(|S|1)2)i(4(|S|1))|S|i\displaystyle\leq 1+\sum_{i=1}^{|S|-1}\binom{|S|}{i}(3(|S|-1)^{2})^{i}(4(|S|-1))^{|S|-i}
=1+i=0|S|(|S|i)(3(|S|1)2)i(4(|S|1))|S|i(4(|S|1))|S|(3(|S|1)2)|S|\displaystyle=1+\sum_{i=0}^{|S|}\binom{|S|}{i}(3(|S|-1)^{2})^{i}(4(|S|-1))^{|S|-i}-(4(|S|-1))^{|S|}-(3(|S|-1)^{2})^{|S|}
i=0|S|(|S|i)(3(|S|1)2)i(4(|S|1))|S|i\displaystyle\leq\sum_{i=0}^{|S|}\binom{|S|}{i}(3(|S|-1)^{2})^{i}(4(|S|-1))^{|S|-i}
=[3(|S|1)2+4(|S|1)]|S|\displaystyle=[3(|S|-1)^{2}+4(|S|-1)]^{|S|}
=(3|S|26|S|+3+4|S|4)|S|\displaystyle=(3|S|^{2}-6|S|+3+4|S|-4)^{|S|}
(3|S|2)|S|,\displaystyle\leq(3|S|^{2})^{|S|},

completing the proof. ∎

3.11 Putting it Together

We now combine (23) with Lemma 3.15 to complete the proof of Theorem 2.1. We first note that vS0v_{S}\neq 0 only when the elements of SS (viewed as a multi-set) together with {1}\{1\} form an even cover of [n][n].

Lemma 3.16.

Let P(S)P(S) denote the property that S()={(1,1)}S(\ell)=\{(1,1)\} for =(1,1,,1)\ell=(1,1,\ldots,1). If P(S)P(S) fails to hold then vS=0v_{S}=0.

Proof.

From (26), note that if cS0c_{S}\neq 0 then S()={(1,1)}S(\ell)=\{(1,1)\} for some \ell, which implies S()={(1,1)}S(\ell)=\{(1,1)\} for =(1,1,,1)\ell=(1,1,\ldots,1). Thus cS=0c_{S}=0 whenever P(S)P(S) fails.

Also note that if P(S)P(S) fails and 𝖼𝗈𝗅𝗌(S())=S\mathsf{cols}(S(\ell))=S^{\prime} for some \ell, then P(S)P(S^{\prime}) fails. The result now follows from (22) using induction on |S||S|. ∎

Lemma 3.17.

For any d1d\geq 1, the number of multi-sets SΩS\in\mathbb{N}^{\Omega} with |S|=d|S|=d such that P(S)P(S) holds is at most n(kd1)/2((kd+3)/2)kdn^{(kd-1)/2}((kd+3)/2)^{kd}.

Proof.

Since P(S)P(S) holds, SS together with {1}\{1\} forms an even cover of [n][n]. Therefore the number of elements of [n]{1}[n]\setminus\{1\} covered by SS is at most (kd1)/2(kd-1)/2. The number of ways to choose this many elements is at most n(kd1)/2n^{(kd-1)/2}. Once these are chosen, SS has at most (kd1)/2+1=(kd+1)/2(kd-1)/2+1=(kd+1)/2 elements to draw from, so the number of possibilities for SS is at most ((kd+1)/2+1)kd=((kd+3)/2)kd((kd+1)/2+1)^{kd}=((kd+3)/2)^{kd}. ∎

Proof of Theorem 2.1.

Starting from (23) and using Lemmas 3.15, 3.16, and 3.17,

𝖢𝗈𝗋𝗋D2\displaystyle\mathsf{Corr}_{\leq D}^{2} |S|DvS2λmin2|S|r|S|¯\displaystyle\leq\sum_{|S|\leq D}\frac{v_{S}^{2}}{\lambda_{\min}^{2|S|}\,r^{\underline{|S|}}}
d=1Dn(kd1)/2((kd+3)/2)kd(3d2)2dλmin2d(rd+1)d\displaystyle\leq\sum_{d=1}^{D}n^{(kd-1)/2}((kd+3)/2)^{kd}\frac{(3d^{2})^{2d}}{\lambda_{\min}^{2d}(r-d+1)^{d}}
=n1/2d=1D(nk/2((kd+3)/2)k(3d2)2λmin2(rd+1))d\displaystyle=n^{-1/2}\sum_{d=1}^{D}\left(\frac{n^{k/2}((kd+3)/2)^{k}(3d^{2})^{2}}{\lambda_{\min}^{2}(r-d+1)}\right)^{d}
n1/2d=1D(9D4(k(D+1)/2)knk/2λmin2(rD))d\displaystyle\leq n^{-1/2}\sum_{d=1}^{D}\left(9D^{4}(k(D+1)/2)^{k}\cdot\frac{n^{k/2}}{\lambda_{\min}^{2}(r-D)}\right)^{d}
n1/2d=1D(9D4(kD)k19nk/218λmin2r)d\displaystyle\leq n^{-1/2}\sum_{d=1}^{D}\left(9D^{4}(kD)^{k}\cdot\frac{19n^{k/2}}{18\lambda_{\min}^{2}r}\right)^{d}
since (6) implies Dr/19D\leq r/19
=n1/2d=1D(192kkDk+4nk/2λmin2r)d\displaystyle=n^{-1/2}\sum_{d=1}^{D}\left(\frac{19}{2}\,k^{k}D^{k+4}\cdot\frac{n^{k/2}}{\lambda_{\min}^{2}r}\right)^{d}
n1/2d=1D(12)d\displaystyle\leq n^{-1/2}\sum_{d=1}^{D}\left(\frac{1}{2}\right)^{d}
where we have used the assumption (6)
n1/2.\displaystyle\leq n^{-1/2}.

Using Fact 3.1, this completes the proof. ∎

4 Proof of Theorem 2.2: Upper Bound

4.1 Expander Graphs

We begin by collecting some standard properties of expander graphs.

Proposition 4.1.

Fix an integer k3k\geq 3. For all even NN exceeding a constant N0=N0(k)N_{0}=N_{0}(k), there exists a kk-regular NN-vertex (simple) graph with the following properties:

  • the minimum cut value is kk (achieved by a single vertex), and

  • for any SV(G)S\subseteq V(G) with 0<|S|N/20<|S|\leq N/2, |S|c|S||\partial S|\geq c|S|,

where S\partial S is the set of edges with exactly one endpoint in SS, and c0.08c\geq 0.08 is an absolute constant.

Proof.

It follows from classical results that a uniformly random kk-regular NN-vertex graph has these properties with high probability, i.e., probability 1o(1)1-o(1) as NN\to\infty with kk fixed. Letting GG be such a graph, it is well-known that GG is kk-connected with high probability [Bol98, Section 7.6], which proves the first statement about the minimum cut.

For the second statement, let k=μ1μ2μNk=\mu_{1}\geq\mu_{2}\geq\cdots\geq\mu_{N} denote the eigenvalues of the adjacency matrix of GG. Friedman’s second eigenvalue theorem [Fri08] states that for any fixed ϵ>0\epsilon>0, μ22k1+ϵ\mu_{2}\leq 2\sqrt{k-1}+\epsilon with high probability. Cheeger’s inequality [Dod84, AM85] (see Theorem 2.4 of [HLW06]) tells us that for any SV(G)S\subseteq V(G) with 0<|S|N/20<|S|\leq N/2, |S||S|12(kμ2)\frac{|\partial S|}{|S|}\geq\frac{1}{2}(k-\mu_{2}). Combining these gives

|S||S|2(kμ2)|S|2(k2k1ϵ)|\partial S|\geq\frac{|S|}{2}(k-\mu_{2})\geq\frac{|S|}{2}(k-2\sqrt{k-1}-\epsilon)

which concludes the proof for any choice of cc satisfying 0<c<12(k2k1)0<c<\frac{1}{2}(k-2\sqrt{k-1}). The expression 12(k2k1)\frac{1}{2}(k-2\sqrt{k-1}) is minimized (over k3k\geq 3) when k=3k=3. ∎

4.2 Constructing the Polynomial

Let N=D1N=D-1 where DD is defined as in Theorem 2.2, choosing n0n_{0} large enough so that NN0N\geq N_{0}. From this point onward, let GG denote the kk-regular NN-vertex graph guaranteed by Proposition 4.1. We construct a new graph HH as follows. Starting with GG, add two additional vertices called \circ and uu, and add the edge (,u)(\circ,u). Recall that kk^{\prime} is the odd element of {k1,k}\{k-1,k\}. Choose p:=(k1)/2p:=(k^{\prime}-1)/2 arbitrary edges (i1,j1),,(ip,jp)(i_{1},j_{1}),\ldots,(i_{p},j_{p}) of GG with no endpoints in common. Delete these pp edges and add the edges (u,i1),(u,j1),,(u,ip),(u,jp)(u,i_{1}),(u,j_{1}),\ldots,(u,i_{p}),(u,j_{p}). This completes the description of HH. Note that HH is kk-regular aside from the degree-1 vertex \circ and the degree-kk^{\prime} vertex uu.

Definition 4.2.

Define an edge-labeling of HH to be a function ϕ:E(H)[n]\phi:E(H)\to[n] that is injective (no two edges get the same label) with ϕ(,u)=1\phi(\circ,u)=1. Let Φ\Phi denote the set of all edge-labelings of HH.

For an edge-labeling ϕ\phi and a vertex vV(H){}v\in V(H)\setminus\{\circ\}, define Tv(ϕ)T_{v}(\phi) to be the following entry of the input tensor: let e1,,eme_{1},\ldots,e_{m} be the edges incident to vv (where m{k,k}m\in\{k,k^{\prime}\}) and then let Tv(ϕ):=Tϕ(e1),,ϕ(em)T_{v}(\phi):=T_{\phi(e_{1}),\ldots,\phi(e_{m})}. Our polynomial estimator is defined as follows:

f(T)=1|Φ|ϕΦvV(H){}Tv(ϕ).f(T)=\frac{1}{|\Phi|}\sum_{\phi\in\Phi}\,\prod_{v\in V(H)\setminus\{\circ\}}T_{v}(\phi). (31)

4.3 Vertex Labelings

Definition 4.3.

Define a vertex-labeling of HH to be a function ψ:V(H){}[r]\psi:V(H)\setminus\{\circ\}\to[r]. Let Ψ\Psi denote the set of all vertex-labelings of HH.

For ϕΦ\phi\in\Phi, ψΨ\psi\in\Psi, and vV(H){}v\in V(H)\setminus\{\circ\}, define Tv(ϕ,ψ)T_{v}(\phi,\psi) as follows: letting e1,,eme_{1},\ldots,e_{m} be the edges incident to vv, and j:=ψ(v)j:=\psi(v), let Tv(ϕ,ψ):=λji=1m(aj)ϕ(ei)T_{v}(\phi,\psi):=\lambda_{j}\prod_{i=1}^{m}(a_{j})_{\phi(e_{i})}. Recalling (5), we can expand (31) as

f(T)\displaystyle f(T) =1|Φ|ϕΦψΨvV(H){}Tv(ϕ,ψ)\displaystyle=\frac{1}{|\Phi|}\sum_{\phi\in\Phi}\,\sum_{\psi\in\Psi}\,\prod_{v\in V(H)\setminus\{\circ\}}T_{v}(\phi,\psi)
=ψΨ1|Φ|ϕΦvV(H){}Tv(ϕ,ψ).\displaystyle=\sum_{\psi\in\Psi}\frac{1}{|\Phi|}\sum_{\phi\in\Phi}\,\prod_{v\in V(H)\setminus\{\circ\}}T_{v}(\phi,\psi).

We will break the above sum into different terms depending on ψ\psi. Define the partition Ψ=Ψ1Ψ2Ψ3\Psi=\Psi_{1}\sqcup\Psi_{2}\sqcup\Psi_{3} as follows:

  • Ψ1={ψ1}\Psi_{1}=\{\psi_{1}\} where ψ1\psi_{1} denotes the all-ones labeling: ψ1(v)=1\psi_{1}(v)=1 for all vv,

  • Ψ2={ψ2,,ψr}\Psi_{2}=\{\psi_{2},\ldots,\psi_{r}\} where ψj\psi_{j} denotes the all-jj’s labeling: ψj(v)=j\psi_{j}(v)=j for all vv,

  • Ψ3=Ψ(Ψ1Ψ2)\Psi_{3}=\Psi\setminus(\Psi_{1}\cup\Psi_{2}).

We can now write f=f1+f2+f3f=f_{1}+f_{2}+f_{3} where for i{1,2,3}i\in\{1,2,3\},

fi:=ψΨi1|Φ|ϕΦvV(H){}Tv(ϕ,ψ).f_{i}:=\sum_{\psi\in\Psi_{i}}\frac{1}{|\Phi|}\sum_{\phi\in\Phi}\,\prod_{v\in V(H)\setminus\{\circ\}}T_{v}(\phi,\psi).

4.4 Signal Term

We first handle the terms f1f_{1} and f2f_{2}.

Lemma 4.4.

f1=a11f_{1}=a_{11}.

Proof.

We have

f1=1|Φ|ϕΦvV(H){}Tv(ϕ,ψ1).f_{1}=\frac{1}{|\Phi|}\sum_{\phi\in\Phi}\,\prod_{v\in V(H)\setminus\{\circ\}}T_{v}(\phi,\psi_{1}).

Recalling that aja_{j} has {±1}\{\pm 1\}-valued entries and λ1=1\lambda_{1}=1, note that for any ϕΦ\phi\in\Phi,

vV(H){}Tv(ϕ,ψ1)=a11,\prod_{v\in V(H)\setminus\{\circ\}}T_{v}(\phi,\psi_{1})=a_{11},

because each edge eE(H)e\in E(H) contributes a factor of (a1)ϕ(e)2=1(a_{1})_{\phi(e)}^{2}=1 except the edge (,u)(\circ,u), which is required by Definition 4.2 to have ϕ(,u)=1\phi(\circ,u)=1 and thus contributes a factor of a11a_{11}. The result follows. ∎

Lemma 4.5.

𝔼[f22](r1)|λ2|2(N+1)\mathbb{E}[f_{2}^{2}]\leq(r-1)\cdot|\lambda_{2}|^{2(N+1)}.

Proof.

Similarly to the proof of Lemma 4.4,

f2=j=2rλj|V(H)|1(aj)1.f_{2}=\sum_{j=2}^{r}\lambda_{j}^{|V(H)|-1}(a_{j})_{1}.

Note that |V(H)|1=N+1|V(H)|-1=N+1 where, recall, N=|V(G)|N=|V(G)|. Now compute

𝔼[f22]\displaystyle\mathbb{E}[f_{2}^{2}] =j=2rj=2rλjN+1λjN+1𝔼[(aj)1(aj)1]\displaystyle=\sum_{j=2}^{r}\,\sum_{j^{\prime}=2}^{r}\lambda_{j}^{N+1}\lambda_{j^{\prime}}^{N+1}\,\mathbb{E}[(a_{j})_{1}(a_{j^{\prime}})_{1}]
=j=2rλj2(N+1)\displaystyle=\sum_{j=2}^{r}\lambda_{j}^{2(N+1)}
(r1)|λ2|2(N+1),\displaystyle\leq(r-1)\cdot|\lambda_{2}|^{2(N+1)},

completing the proof. ∎

4.5 Noise Term

We now handle the term f3f_{3}. This section is devoted to proving the following.

Lemma 4.6.

Under the assumptions of Theorem 2.2, 𝔼[f32]4kk1D52krnk1𝟙k even\mathbb{E}[f_{3}^{2}]\leq 4k^{k-1}D^{52k}\frac{r}{n^{k-1-\mathbbm{1}_{k\text{ even}}}}.

To compute 𝔼[f32]\mathbb{E}[f_{3}^{2}], it will help to introduce an auxiliary graph H~\tilde{H} defined as follows. Start with two disjoint copies of HH, called HH and HH^{\prime}. Delete the vertices \circ and \circ^{\prime} and connect the two leftover half-edges to form the edge (u,u)(u,u^{\prime}). This completes the description of H~\tilde{H}.

The vertices of H~\tilde{H} can be partitioned as V(H~)={u,u}VVV(\tilde{H})=\{u,u^{\prime}\}\sqcup V\sqcup V^{\prime} where VV comes from the copy of GG in HH, and VV^{\prime} from HH^{\prime}. Similarly, the edges of H~\tilde{H} can be partitioned as E(H~)={(u,u)}EEE(\tilde{H})=\{(u,u^{\prime})\}\sqcup E\sqcup E^{\prime} where EE comes from HH, and EE^{\prime} from HH^{\prime}.

Definition 4.7.

Define an edge-labeling of H~\tilde{H} to be a function ϕ:E(H~)[n]\phi:E(\tilde{H})\to[n] such that ϕ(u,u)=1\phi(u,u^{\prime})=1, no other edge has the label 1, no two edges in EE (defined above) have the same label, and no two edges in EE^{\prime} have the same label. Let Φ~\tilde{\Phi} denote the set of all edge-labelings of H~\tilde{H}.

Definition 4.8.

Define a vertex-labeling of H~\tilde{H} to be a function ψ:V(H~)[r]\psi:V(\tilde{H})\to[r] such that ψ\psi takes at least two different values within V{u}V\cup\{u\} (defined above), and ψ\psi takes at least two different values within V{u}V^{\prime}\cup\{u^{\prime}\}. Let Ψ~\tilde{\Psi} denote the set of all vertex-labelings of H~\tilde{H}.

For ϕΦ~\phi\in\tilde{\Phi}, ψΨ~\psi\in\tilde{\Psi}, and vV(H~)v\in V(\tilde{H}), define Tv(ϕ,ψ)T_{v}(\phi,\psi) similarly to above: letting e1,,eme_{1},\ldots,e_{m} be the edges incident to vv, and j:=ψ(v)j:=\psi(v), let Tv(ϕ,ψ):=λji=1m(aj)ϕ(ei)T_{v}(\phi,\psi):=\lambda_{j}\prod_{i=1}^{m}(a_{j})_{\phi(e_{i})}.

With the above definitions in hand, and recalling that Ψ3\Psi_{3} is the set of vertex-labelings of HH that take at least two different values, we can write

f32=1|Φ|2ϕΦ~ψΨ~vV(H~)Tv(ϕ,ψ).f_{3}^{2}=\frac{1}{|\Phi|^{2}}\sum_{\phi\in\tilde{\Phi}}\,\sum_{\psi\in\tilde{\Psi}}\,\prod_{v\in V(\tilde{H})}T_{v}(\phi,\psi). (32)

Only certain “valid” pairs (ϕ,ψ)(\phi,\psi) yield a term with nonzero expectation.

Definition 4.9.

For ϕΦ~\phi\in\tilde{\Phi} and ψΨ~\psi\in\tilde{\Psi}, we say (ϕ,ψ)(\phi,\psi) is valid if the following holds: for each i[n]i\in[n] and j[r]j\in[r], there is an even number of edges eE(H~)e\in E(\tilde{H}) with the property that ϕ(e)=i\phi(e)=i and exactly one endpoint of ee has vertex-label jj. In fact, this even number must be 2 because, by Definition 4.7, only two edges can share the same label ii.

Note that valid pairs (ϕ,ψ)(\phi,\psi) are precisely those for which the corresponding term in (32) has an even number of factors of each (aj)i(a_{j})_{i}. We can now write

𝔼[f32]\displaystyle\mathbb{E}[f_{3}^{2}] =1|Φ|2(ϕ,ψ) valid𝔼vV(H~)Tv(ϕ,ψ)\displaystyle=\frac{1}{|\Phi|^{2}}\sum_{(\phi,\psi)\text{ valid}}\mathbb{E}\prod_{v\in V(\tilde{H})}T_{v}(\phi,\psi)
=1|Φ|2(ϕ,ψ) validvV(H~)λψ(v).\displaystyle=\frac{1}{|\Phi|^{2}}\sum_{(\phi,\psi)\text{ valid}}\,\prod_{v\in V(\tilde{H})}\lambda_{\psi(v)}. (33)
Definition 4.10.

For ψΨ~\psi\in\tilde{\Psi}, a region is the preimage under ψ\psi of some j[r]j\in[r]. In other words, a region consists of all vertices of H~\tilde{H} that have a particular label.

For a valid (ϕ,ψ)(\phi,\psi) pair, let RR denote the number of non-empty regions. By Definition 4.8 we must have R2R\geq 2. Since (u,u)(u,u^{\prime}) is the only edge with label 11 (Definition 4.7) and (ϕ,ψ)(\phi,\psi) is valid, uu and uu^{\prime} must belong to the same region; call this region 1, and number the other non-empty regions 2,,R2,\ldots,R. For 1iR1\leq i\leq R, let sis_{i} denote the number of vertices in VV that belong to region ii and let sis^{\prime}_{i} denote the number of vertices in VV^{\prime} that belong to region ii. Let s¯i=min{si,Nsi}\bar{s}_{i}=\min\{s_{i},N-s_{i}\} and s¯i=min{si,Nsi}\bar{s}^{\prime}_{i}=\min\{s^{\prime}_{i},N-s^{\prime}_{i}\} where, recall, N=|V|=|V|N=|V|=|V^{\prime}|. For 1iR1\leq i\leq R, let i\ell_{i} denote the number of edges in EE that “cross” region ii (i.e., have exactly one endpoint in region ii). Since the edges of H~\tilde{H} crossing region ii must be paired up with each pair having the same edge-label (Definition 4.9), and edge-labels cannot repeat within EE or EE^{\prime} (Definition 4.7), i\ell_{i} must also be equal to the number of edges in EE^{\prime} that cross region ii. The total number of cross-edges (i.e., edges of H~\tilde{H} whose endpoints have different vertex-labels) is =i=1Ri\ell=\sum_{i=1}^{R}\ell_{i}. Note that (u,u)(u,u^{\prime}) is never a cross-edge since both its endpoints belong to region 1. As a consequence of the above discussion, every non-empty region must include at least one vertex from both V{u}V\cup\{u\} and V{u}V^{\prime}\cup\{u^{\prime}\}.

Lemma 4.11.

For any valid (ϕ,ψ)(\phi,\psi) pair and any i{1,2}i\in\{1,2\},

imax{k1,cs¯i,cs¯i}\ell_{i}\geq\max\{k^{\prime}-1,c\bar{s}_{i},c\bar{s}^{\prime}_{i}\}

where c>0c>0 is the constant from Proposition 4.1.

Proof.

Recall that uu belongs to region 1 by convention. Let SVS\subseteq V denote the vertices in VV that belong to region ii. The case S=S=\emptyset is possible only if i=1i=1, in which case we have i=k1\ell_{i}=k^{\prime}-1. The case S=VS=V is possible only if i=R=2i=R=2 (since there must be at least 2 regions, each containing a vertex from both V{u}V\cup\{u\} and V{u}V^{\prime}\cup\{u^{\prime}\}), in which case again i=k1\ell_{i}=k^{\prime}-1. This leaves the case 0<|S|<N0<|S|<N. Proposition 4.1 (applied to either SS or VSV\setminus S, whichever is smaller) tells us that the number of original edges of GG (recall some edges were deleted to form HH) crossing SS is at least the maximum of kk and cmin{|S|,N|S|}=cs¯ic\cdot\min\{|S|,N-|S|\}=c\bar{s}_{i}. For each edge (v1,v2)(v_{1},v_{2}) that was deleted from GG to form HH, if (v1,v2)(v_{1},v_{2}) crosses SS then one of the two new edges (u,v1)(u,v_{1}) or (u,v2)(u,v_{2}) must cross region ii. We conclude that at least max{k,cs¯i}\max\{k,c\bar{s}_{i}\} edges in EE cross region ii. The same argument applied to VV^{\prime} gives the bound max{k,cs¯i}\max\{k,c\bar{s}^{\prime}_{i}\}. ∎

Lemma 4.12.

For any valid (ϕ,ψ)(\phi,\psi) pair and any 3iR3\leq i\leq R,

imax{k,cs¯i,cs¯i}\ell_{i}\geq\max\{k,c\bar{s}_{i},c\bar{s}^{\prime}_{i}\}

where c>0c>0 is the constant from Proposition 4.1.

Proof.

The proof is the same as that of Lemma 4.11 except now the cases S=S=\emptyset and S=VS=V are impossible. ∎

Proof of Lemma 4.6.

Combining Lemmas 4.11 and 4.12, we have for any valid (ϕ,ψ)(\phi,\psi), the total number of cross-edges is

=i=1Ri\displaystyle\ell=\sum_{i=1}^{R}\ell_{i} i=1Rmax{k,cs¯i,cs¯i}2(1+𝟙k even)\displaystyle\geq\sum_{i=1}^{R}\max\{k,c\bar{s}_{i},c\bar{s}^{\prime}_{i}\}-2(1+\mathbbm{1}_{k\text{ even}})
=i=1R(k+Δi)2(1+𝟙k even)\displaystyle=\sum_{i=1}^{R}(k+\Delta_{i})-2(1+\mathbbm{1}_{k\text{ even}})
=Rk2(1+𝟙k even)+Δ\displaystyle=Rk-2(1+\mathbbm{1}_{k\text{ even}})+\Delta (34)

where

Δi:=max{0,cs¯ik,cs¯ik}\Delta_{i}:=\max\{0,c\bar{s}_{i}-k,c\bar{s}^{\prime}_{i}-k\} (35)

and

Δ:=i=1RΔi.\Delta:=\sum_{i=1}^{R}\Delta_{i}.

We now work towards bounding (33). Since every non-empty region must include at least one vertex from both V{u}V\cup\{u\} and V{u}V^{\prime}\cup\{u^{\prime}\}, the possible values for RR are 2RN+12\leq R\leq N+1. The number of ways to choose the values s1,,sRs_{1},\ldots,s_{R} and s1,,sRs_{1}^{\prime},\ldots,s_{R}^{\prime} is at most N2RN^{2R} because 0s1,s1N10\leq s_{1},s_{1}^{\prime}\leq N-1 and for i2i\geq 2, 1si,siN1\leq s_{i},s_{i}^{\prime}\leq N. Once these values are chosen, the number of ways to partition V(H~)V(\tilde{H}) into RR non-empty regions of the prescribed sizes is at most

i=1R(Nsi)(Nsi)i=1RNs¯i+s¯iN2Rk/c+2Δ/c,\prod_{i=1}^{R}\binom{N}{s_{i}}\binom{N}{s_{i}^{\prime}}\leq\prod_{i=1}^{R}N^{\bar{s}_{i}+\bar{s}^{\prime}_{i}}\leq N^{2Rk/c+2\Delta/c},

where the last step uses (35) to conclude s¯i,s¯i(k+Δi)/c\bar{s}_{i},\bar{s}_{i}^{\prime}\leq(k+\Delta_{i})/c.

Now that the regions are chosen, we next count the number of ways to assign vertex-labels ψ\psi that respect these regions. At the same time, we will also bound the term vV(H~)λψ(v)\prod_{v\in V(\tilde{H})}\lambda_{\psi(v)} appearing in (33). Recall that all vertices in a given region must have the same vertex-label. We consider two cases. First suppose every region contains at most N+1N+1 vertices (half the total number in H~\tilde{H}). There are at most rRr^{R} ways to assign the vertex-labels and, since at most half the vertices have label 1, we have vV(H~)λψ(v)|λ2|N+1\prod_{v\in V(\tilde{H})}\lambda_{\psi(v)}\leq|\lambda_{2}|^{N+1}. Now consider the other case where some “large” region has more than N+1N+1 vertices. If we choose to assign vertex-label 1 to the large region, then there are at most rR1r^{R-1} ways to assign the remaining labels and vV(H~)λψ(v)1\prod_{v\in V(\tilde{H})}\lambda_{\psi(v)}\leq 1; otherwise, there are at most rRr^{R} ways to assign the labels and vV(H~)λψ(v)|λ2|N+1\prod_{v\in V(\tilde{H})}\lambda_{\psi(v)}\leq|\lambda_{2}|^{N+1}.

Now we count the number of ways to assign edge-labels ϕ\phi. Recall that the edge (u,u)(u,u^{\prime}) is required to have edge-label 11, and no other edge can have edge-label 11. Recall that there are /2\ell/2 cross-edges in EE and /2\ell/2 cross-edges in EE^{\prime}. These need to be paired up, with each cross-edge in EE having the same edge-label as some cross-edge in EE^{\prime}. There are (/2)!(\ell/2)! ways to choose the pairing and then (n1)/2¯(n-1)^{\underline{\ell/2}} ways to assign edge-labels to the cross-edges, recalling the falling factorial notation (18). There are |E|/2|E|-\ell/2 edges in EE remaining, which can have any edge-labels subject to not repeating within EE, so there are (n/21)|E|/2¯(n-\ell/2-1)^{\underline{|E|-\ell/2}} ways to label these edges and the same number of ways to label the rest of EE^{\prime}. (Here we have assumed /2n1\ell/2\leq n-1, which will indeed be the case: /2|E|\ell/2\leq|E| by definition, and we will see |E|n/2|E|\leq n/2 below.)

Note that |E|=kN2+k1212k(N+1)=12kD|E|=\frac{kN}{2}+\frac{k^{\prime}-1}{2}\leq\frac{1}{2}k(N+1)=\frac{1}{2}kD, and by the assumptions of Theorem 2.2,

Dkn1/52logn+2.D\leq kn^{1/52}\log n+2. (36)

Thus, for sufficiently large n0n_{0} we have |E|n/2|E|\leq n/2.

Putting it all together, (33) becomes

𝔼[f32]\displaystyle\mathbb{E}[f_{3}^{2}] =1|Φ|2(ϕ,ψ) validvV(H~)λψ(v)\displaystyle=\frac{1}{|\Phi|^{2}}\sum_{(\phi,\psi)\text{ valid}}\,\prod_{v\in V(\tilde{H})}\lambda_{\psi(v)}
1|Φ|2R=2N+1N2R+2Rk/c+2Δ/c(rR1+rR|λ2|N+1)sup(/2)!(n1)/2¯[(n/21)|E|/2¯]2.\displaystyle\leq\frac{1}{|\Phi|^{2}}\sum_{R=2}^{N+1}N^{2R+2Rk/c+2\Delta/c}(r^{R-1}+r^{R}|\lambda_{2}|^{N+1})\sup_{\ell}\,(\ell/2)!(n-1)^{\underline{\ell/2}}\left[(n-\ell/2-1)^{\underline{|E|-\ell/2}}\right]^{2}.

We will bound pieces of this expression separately. First, since |Φ|=(n1)|E|¯|\Phi|=(n-1)^{\underline{|E|}}, we have

1|Φ|2(/2)!(n1)/2¯[(n/21)|E|/2¯]2\displaystyle\frac{1}{|\Phi|^{2}}(\ell/2)!(n-1)^{\underline{\ell/2}}\left[(n-\ell/2-1)^{\underline{|E|-\ell/2}}\right]^{2} =(/2)!(n1)/2¯[(n/21)|E|/2¯(n1)|E|¯]2\displaystyle=(\ell/2)!(n-1)^{\underline{\ell/2}}\left[\frac{(n-\ell/2-1)^{\underline{|E|-\ell/2}}}{(n-1)^{\underline{|E|}}}\right]^{2}
=(/2)!(n1)/2¯[1(n1)/2¯]2\displaystyle=(\ell/2)!(n-1)^{\underline{\ell/2}}\left[\frac{1}{(n-1)^{\underline{\ell/2}}}\right]^{2}
(/2)/2(n/2)/2\displaystyle\leq(\ell/2)^{\ell/2}(n-\ell/2)^{-\ell/2}
(/2n/2)/2\displaystyle\leq\left(\frac{\ell/2}{n-\ell/2}\right)^{\ell/2}
and recalling /2|E|n/2\ell/2\leq|E|\leq n/2 from above,
(/2n/2)/2.\displaystyle\leq\left(\frac{\ell/2}{n/2}\right)^{\ell/2}.
Recalling from (34) that Rk2(1+𝟙k even)+Δ\ell\geq Rk-2(1+\mathbbm{1}_{k\text{ even}})+\Delta, this becomes
(n)12(Rk2(1+𝟙k even)+Δ)\displaystyle\leq\left(\frac{\ell}{n}\right)^{\frac{1}{2}(Rk-2(1+\mathbbm{1}_{k\text{ even}})+\Delta)}
and recalling 2|E|kD\ell\leq 2|E|\leq kD,
(kDn)12(Rk2(1+𝟙k even)+Δ).\displaystyle\leq\left(\frac{kD}{n}\right)^{\frac{1}{2}(Rk-2(1+\mathbbm{1}_{k\text{ even}})+\Delta)}.

We now show r|λ2|N+11r|\lambda_{2}|^{N+1}\leq 1. Using 1|λ2|log(1/|λ2|)1-|\lambda_{2}|\leq\log(1/|\lambda_{2}|) and the definition of DD (see Theorem 2.2),

|λ2|N+1=|λ2|D|λ2|klogn1|λ2|=exp(klogn1|λ2|log1|λ2|)exp(klogn)=nk.|\lambda_{2}|^{N+1}=|\lambda_{2}|^{D}\leq|\lambda_{2}|^{\frac{k\log n}{1-|\lambda_{2}|}}=\exp\left(-\frac{k\log n}{1-|\lambda_{2}|}\log\frac{1}{|\lambda_{2}|}\right)\leq\exp\left(-k\log n\right)=n^{-k}. (37)

Since (7) implies rnk/2r\leq n^{k/2}, this gives r|λ2|N+11r|\lambda_{2}|^{N+1}\leq 1 as desired, implying

rR1+rR|λ2|N+1=rR1(1+r|λ2|N+1)2rR1.r^{R-1}+r^{R}|\lambda_{2}|^{N+1}=r^{R-1}(1+r|\lambda_{2}|^{N+1})\leq 2r^{R-1}.

Combining the above,

𝔼[f32]\displaystyle\mathbb{E}[f_{3}^{2}] 2R=2N+1supΔN2R+2Rk/c+2Δ/crR1(k(N+1)n)12(Rk2(1+𝟙k even)+Δ)\displaystyle\leq 2\sum_{R=2}^{N+1}\sup_{\Delta}N^{2R+2Rk/c+2\Delta/c}r^{R-1}\left(\frac{k(N+1)}{n}\right)^{\frac{1}{2}(Rk-2(1+\mathbbm{1}_{k\text{ even}})+\Delta)}
=2r(nk(N+1))1+𝟙k evenR=2N+1(kk/2N2+2k/c(N+1)k/2rnk/2)RsupΔ(N2/ck(N+1)n)Δ.\displaystyle=\frac{2}{r}\left(\frac{n}{k(N+1)}\right)^{1+\mathbbm{1}_{k\text{ even}}}\sum_{R=2}^{N+1}\left(k^{k/2}N^{2+2k/c}(N+1)^{k/2}\frac{r}{n^{k/2}}\right)^{R}\sup_{\Delta}\left(N^{2/c}\sqrt{\frac{k(N+1)}{n}}\right)^{\Delta}.

Recall c0.08c\geq 0.08 (Proposition 4.1), which gives 1/c12.51/c\leq 12.5. Using (36),

N2/ck(N+1)nD2/c+1/2knD25.5kn(kn1/52logn+2)25.5kn1N^{2/c}\sqrt{\frac{k(N+1)}{n}}\leq D^{2/c+1/2}\sqrt{\frac{k}{n}}\leq D^{25.5}\sqrt{\frac{k}{n}}\leq\left(kn^{1/52}\log n+2\right)^{25.5}\sqrt{\frac{k}{n}}\leq 1
for sufficiently large n0n_{0}. Since Δ0\Delta\geq 0, the supremum above is achieved at Δ=0\Delta=0 and the bound on 𝔼[f32]\mathbb{E}[f_{3}^{2}] becomes
=2r(nk(N+1))1+𝟙k evenR=2N+1(kk/2N2+2k/c(N+1)k/2rnk/2)R\displaystyle=\frac{2}{r}\left(\frac{n}{k(N+1)}\right)^{1+\mathbbm{1}_{k\text{ even}}}\sum_{R=2}^{N+1}\left(k^{k/2}N^{2+2k/c}(N+1)^{k/2}\frac{r}{n^{k/2}}\right)^{R}
2r(nkD)1+𝟙k evenR=2N+1(kk/2D2+2k/c+k/2rnk/2)R\displaystyle\leq\frac{2}{r}\left(\frac{n}{kD}\right)^{1+\mathbbm{1}_{k\text{ even}}}\sum_{R=2}^{N+1}\left(k^{k/2}D^{2+2k/c+k/2}\frac{r}{n^{k/2}}\right)^{R}
2r(nkD)1+𝟙k evenR=2(kk/2D25.5k+2rnk/2)R\displaystyle\leq\frac{2}{r}\left(\frac{n}{kD}\right)^{1+\mathbbm{1}_{k\text{ even}}}\sum_{R=2}^{\infty}\left(k^{k/2}D^{25.5k+2}\frac{r}{n^{k/2}}\right)^{R}
and now (7) implies kk/2D25.5k+2rnk/2kk/2D27krnk/21/2k^{k/2}D^{25.5k+2}\frac{r}{n^{k/2}}\leq k^{k/2}D^{27k}\frac{r}{n^{k/2}}\leq 1/2, which gives
4r(nkD)1+𝟙k even(kk/2D25.5k+2rnk/2)2\displaystyle\leq\frac{4}{r}\left(\frac{n}{kD}\right)^{1+\mathbbm{1}_{k\text{ even}}}\left(k^{k/2}D^{25.5k+2}\frac{r}{n^{k/2}}\right)^{2}
=4kk1𝟙k evenD51k+3𝟙k evenrnk1𝟙k even\displaystyle=4k^{k-1-\mathbbm{1}_{k\text{ even}}}D^{51k+3-\mathbbm{1}_{k\text{ even}}}\frac{r}{n^{k-1-\mathbbm{1}_{k\text{ even}}}}
4kk1D52krnk1𝟙k even,\displaystyle\leq 4k^{k-1}D^{52k}\frac{r}{n^{k-1-\mathbbm{1}_{k\text{ even}}}},

completing the proof. ∎

4.6 Putting it Together

Proof of Theorem 2.2.

Using Lemma 4.4,

𝔼[(fa11)2]\displaystyle\mathbb{E}[(f-a_{11})^{2}] =𝔼[(f1+f2+f3a11)2]\displaystyle=\mathbb{E}[(f_{1}+f_{2}+f_{3}-a_{11})^{2}]
=𝔼[(f2+f3)2]\displaystyle=\mathbb{E}[(f_{2}+f_{3})^{2}]
2(𝔼[f22]+𝔼[f32]).\displaystyle\leq 2(\mathbb{E}[f_{2}^{2}]+\mathbb{E}[f_{3}^{2}]).

Recall that Lemma 4.5 gives 𝔼[f22]r|λ2|2(N+1)=r|λ2|2D\mathbb{E}[f_{2}^{2}]\leq r|\lambda_{2}|^{2(N+1)}=r|\lambda_{2}|^{2D}, and from (37) we have |λ|2Dn2k|\lambda|^{2D}\leq n^{-2k}. Also, (7) implies rnk/2r\leq n^{k/2}, which gives 𝔼[f22]nk\mathbb{E}[f_{2}^{2}]\leq n^{-k}. Combining this with Lemma 4.6 yields

𝔼[(fa11)2]2(nk+4kk1D52krnk1𝟙k even)10kk1D52krnk1𝟙k even,\mathbb{E}[(f-a_{11})^{2}]\leq 2(n^{-k}+4k^{k-1}D^{52k}\frac{r}{n^{k-1-\mathbbm{1}_{k\text{ even}}}})\leq 10k^{k-1}D^{52k}\frac{r}{n^{k-1-\mathbbm{1}_{k\text{ even}}}},

completing the proof. ∎

Acknowledgments

The author is indebted to Jonathan Niles-Weed, Tselil Schramm, and Jerry Li for numerous detailed discussions about this problem. The author also thanks Jingqiu Ding, Bruce Hajek, Tim Kunisky, Cris Moore, Aaron Potechin, Bobby Shi, and anonymous reviewers, for helpful discussions and comments.

References

  • [AAA17] Esraa Al-Sharoa, Mahmood Al-Khassaweneh, and Selin Aviyente. A tensor based framework for community detection in dynamic networks. In International conference on acoustics, speech and signal processing (ICASSP), pages 2312–2316. IEEE, 2017.
  • [ABG+14] Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, and James Voss. The more, the merrier: the blessing of dimensionality for learning large gaussian mixtures. In Conference on Learning Theory, pages 1135–1164. PMLR, 2014.
  • [AC08] Dimitris Achlioptas and Amin Coja-Oghlan. Algorithmic barriers from phase transitions. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 793–802. IEEE, 2008.
  • [AFH+12] Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-Kai Liu. A spectral algorithm for latent dirichlet allocation. Advances in neural information processing systems, 25, 2012.
  • [AGHK13] Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham Kakade. A tensor spectral approach to learning mixed membership community models. In Conference on Learning Theory, pages 867–881. PMLR, 2013.
  • [AGJ14] Anima Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models. arXiv preprint arXiv:1411.1488, 2014.
  • [AGJ15] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models through tensor methods. In Conference on Learning Theory, pages 36–112. PMLR, 2015.
  • [AM85] Noga Alon and Vitali D Milman. λ1\lambda_{1}, isoperimetric inequalities for graphs, and superconcentrators. Journal of Combinatorial Theory, Series B, 38(1):73–88, 1985.
  • [BB20] Matthew Brennan and Guy Bresler. Reducibility and statistical-computational gaps from secret leakage. In Conference on Learning Theory, pages 648–847. PMLR, 2020.
  • [BBH18] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computational lower bounds for problems with planted sparse structure. In Conference On Learning Theory, pages 48–166. PMLR, 2018.
  • [BBH+21] Matthew S Brennan, Guy Bresler, Sam Hopkins, Jerry Li, and Tselil Schramm. Statistical query algorithms and low degree tests are almost equivalent. In Conference on Learning Theory, pages 774–774. PMLR, 2021.
  • [BBK+17] Afonso S Bandeira, Ben Blum-Smith, Joe Kileel, Amelia Perry, Jonathan Weed, and Alexander S Wein. Estimation under group actions: recovering orbits from invariants. arXiv preprint arXiv:1712.10163, 2017.
  • [BBK+21] Afonso S Bandeira, Jess Banks, Dmitriy Kunisky, Christopher Moore, and Alexander S Wein. Spectral planting and the hardness of refuting cuts, colorability, and communities in random graphs. In Conference on Learning Theory, pages 410–473. PMLR, 2021.
  • [BBLS18] Nicolas Boumal, Tamir Bendory, Roy R Lederman, and Amit Singer. Heterogeneous multireference alignment: A single pass approach. In 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2018.
  • [BCMV14] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed analysis of tensor decompositions. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 594–603, 2014.
  • [BCO14] Cristiano Bocci, Luca Chiantini, and Giorgio Ottaviani. Refined methods for the identifiability of tensors. Annali di Matematica Pura ed Applicata (1923-), 193(6):1691–1702, 2014.
  • [BCRT20] Giulio Biroli, Chiara Cammarota, and Federico Ricci-Tersenghi. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA. Journal of Physics A: Mathematical and Theoretical, 53(17):174003, 2020.
  • [BEH+22] Afonso S Bandeira, Ahmed El Alaoui, Samuel B Hopkins, Tselil Schramm, Alexander S Wein, and Ilias Zadik. The Franz-Parisi criterion and computational trade-offs in high dimensional statistics. arXiv preprint arXiv:2205.09727, 2022.
  • [BGJ20] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor PCA. The Annals of Probability, 48(4):2052–2087, 2020.
  • [BH22] Guy Bresler and Brice Huang. The algorithmic phase transition of random k-SAT for low degree polynomials. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 298–309. IEEE, 2022.
  • [BHK+19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra, and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted clique problem. SIAM Journal on Computing, 48(2):687–735, 2019.
  • [BKS15] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 143–151, 2015.
  • [BKW20] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein. Computational hardness of certifying bounds on constrained PCA problems. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020), volume 151, page 78. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
  • [BM20] Davide Bacciu and Danilo P Mandic. Tensor decompositions in deep learning. arXiv preprint arXiv:2002.11835, 2020.
  • [BMR21] Jess Banks, Sidhanth Mohanty, and Prasad Raghavendra. Local statistics, semidefinite programming, and community detection. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1298–1316. SIAM, 2021.
  • [Bol98] Béla Bollobás. Random graphs. In Modern graph theory, pages 215–252. Springer, 1998.
  • [BR13] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparse principal component detection. In Conference on learning theory, pages 1046–1066. PMLR, 2013.
  • [BWZ20] Gérard Ben Arous, Alexander S Wein, and Ilias Zadik. Free energy wells and overlap gap property in sparse PCA. In Conference on Learning Theory, pages 479–482. PMLR, 2020.
  • [CGH+22] Amin Coja-Oghlan, Oliver Gebhard, Max Hahn-Klimroth, Alexander S Wein, and Ilias Zadik. Statistical and computational phase transitions in group testing. In Conference on Learning Theory, pages 4764–4781. PMLR, 2022.
  • [CMZ22] Zongchen Chen, Elchanan Mossel, and Ilias Zadik. Almost-linear planted cliques elude the metropolis process. arXiv preprint arXiv:2204.01911, 2022.
  • [DCC07] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55(6):2965–2973, 2007.
  • [DdL+22] Jingqiu Ding, Tommaso d’Orsi, Chih-Hung Liu, David Steurer, and Stefan Tiegel. Fast algorithm for overcomplete order-3 tensor decomposition. In Conference on Learning Theory, pages 3741–3799. PMLR, 2022.
  • [DK22] Ilias Diakonikolas and Daniel Kane. Non-gaussian component analysis via lattice basis reduction. In Conference on Learning Theory, pages 4535–4547. PMLR, 2022.
  • [DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
  • [DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower bounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 73–84. IEEE, 2017.
  • [DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Subexponential-time algorithms for sparse PCA. arXiv preprint arXiv:1907.11635, 2019.
  • [Dod84] Jozef Dodziuk. Difference equations, isoperimetric inequality and transience of certain random walks. Transactions of the American Mathematical Society, 284(2):787–794, 1984.
  • [FGR+17] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64(2):1–37, 2017.
  • [Fri08] Joel Friedman. A proof of Alon’s second eigenvalue conjecture and related problems. American Mathematical Soc., 2008.
  • [FSWW20] Zhou Fan, Yi Sun, Tianhao Wang, and Yihong Wu. Likelihood landscape and maximum likelihood estimation for the discrete orbit recovery model. Communications on Pure and Applied Mathematics, 2020.
  • [Gam21] David Gamarnik. The overlap gap property: A topological barrier to optimizing over random structures. Proceedings of the National Academy of Sciences, 118(41):e2108492118, 2021.
  • [GHK15] Rong Ge, Qingqing Huang, and Sham M Kakade. Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 761–770, 2015.
  • [GJS21] David Gamarnik, Aukosh Jagannath, and Subhabrata Sen. The overlap gap property in principal submatrix recovery. Probability Theory and Related Fields, 181(4):757–814, 2021.
  • [GJW20] David Gamarnik, Aukosh Jagannath, and Alexander S Wein. Low-degree hardness of random optimization problems. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 131–140. IEEE, 2020.
  • [GM15] Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares algorithms. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, page 829–849, 2015.
  • [GM17] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. Advances in Neural Information Processing Systems, 30, 2017.
  • [GMZ22] David Gamarnik, Cristopher Moore, and Lenka Zdeborová. Disordered systems insights on computational hardness. arXiv preprint arXiv:2210.08312, 2022.
  • [GS17] David Gamarnik and Madhu Sudan. Limits of local algorithms over sparse random graphs. The Annals of Probability, pages 2353–2376, 2017.
  • [GVX14] Navin Goyal, Santosh Vempala, and Ying Xiao. Fourier PCA and robust tensor decomposition. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 584–593, 2014.
  • [GZ17] David Gamarnik and Ilias Zadik. Sparse high-dimensional linear regression. algorithmic barriers and a local search algorithm. arXiv preprint arXiv:1711.04952, 2017.
  • [GZ19] David Gamarnik and Ilias Zadik. The landscape of the planted clique problem: Dense subgraphs and the overlap gap property. arXiv preprint arXiv:1904.07174, 2019.
  • [Hås89] Johan Håstad. Tensor rank is NP-complete. In International Colloquium on Automata, Languages, and Programming, pages 451–460. Springer, 1989.
  • [HK13] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20, 2013.
  • [HKP+17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, Tselil Schramm, and David Steurer. The power of sum-of-squares for detecting hidden structures. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 720–731. IEEE, 2017.
  • [HL13] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are NP-hard. Journal of the ACM (JACM), 60(6):1–39, 2013.
  • [HLW06] Shlomo Hoory, Nathan Linial, and Avi Wigderson. Expander graphs and their applications. Bulletin of the American Mathematical Society, 43(4):439–561, 2006.
  • [Hop18] Samuel Hopkins. Statistical Inference and the Sum of Squares Method. PhD thesis, Cornell University, 2018.
  • [HS17] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples: community detection and related problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 379–390. IEEE, 2017.
  • [HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Conference on Learning Theory, pages 956–1006. PMLR, 2015.
  • [HSS19] Samuel B Hopkins, Tselil Schramm, and Jonathan Shi. A robust spectral algorithm for overcomplete tensor decomposition. In Conference on Learning Theory, pages 1683–1722. PMLR, 2019.
  • [HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178–191, 2016.
  • [HW21] Justin Holmgren and Alexander S Wein. Counterexamples to the low-degree conjecture. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), volume 185, 2021.
  • [HWX15] Bruce Hajek, Yihong Wu, and Jiaming Xu. Computational lower bounds for community detection on random graphs. In Conference on Learning Theory, pages 899–928. PMLR, 2015.
  • [Jer92] Mark Jerrum. Large cliques elude the metropolis process. Random Structures & Algorithms, 3(4):347–359, 1992.
  • [JLLX21] Bing-Yi Jing, Ting Li, Zhongyuan Lyu, and Dong Xia. Community detection on mixture multilayer networks via regularized tensor decomposition. The Annals of Statistics, 49(6):3181–3205, 2021.
  • [KM21a] Frederic Koehler and Elchanan Mossel. Reconstruction on trees and low-degree polynomials. arXiv preprint arXiv:2109.06915, 2021.
  • [KM21b] Pravesh K Kothari and Peter Manohar. A stress-free sum-of-squares lower bound for coloring. arXiv preprint arXiv:2105.07517, 2021.
  • [KMOW17] Pravesh K Kothari, Ryuhei Mori, Ryan O’Donnell, and David Witmer. Sum of squares lower bounds for refuting any CSP. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 132–145, 2017.
  • [Kol21] Tamara G. Kolda. Will the real Jennrich’s algorithm please stand up? Available online at www.mathsci.ai/post/jennrich, December 2021. Accessed: 10-22-2022.
  • [KP20] Bohdan Kivva and Aaron Potechin. Exact nuclear norm, completion and decomposition for random overcomplete tensors via degree-4 SOS. arXiv preprint arXiv:2011.09416, 2020.
  • [Kun21] Dmitriy Kunisky. Hypothesis testing with low-degree polynomials in the Morris class of exponential families. In Conference on Learning Theory, pages 2822–2848. PMLR, 2021.
  • [Kun22] Dmitriy Kunisky. Lecture notes on sum-of-squares optimization. Available online at www.kunisky.com/static/teaching/2022spring-sos/sos-notes.pdf, 2022. Accessed: 09-29-2022.
  • [KWB22] Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio. In ISAAC Congress (International Society for Analysis, its Applications and Computation), pages 1–50. Springer, 2022.
  • [LKZ15] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 680–687. IEEE, 2015.
  • [LM21] Allen Liu and Ankur Moitra. Algorithms from invariants: Smoothed analysis of orbit recovery over SO(3)SO(3). arXiv preprint arXiv:2106.02680, 2021.
  • [LRA93] Sue E Leurgans, Robert T Ross, and Rebecca B Abel. A decomposition for three-way arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064–1083, 1993.
  • [LZ22] Yuetian Luo and Anru R Zhang. Tensor clustering with planted structures: Statistical optimality and computational limits. The Annals of Statistics, 50(1):584–613, 2022.
  • [MKUZ19] Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborova. Passed & spurious: Descent algorithms and local minima in spiked matrix-tensor models. In International conference on machine learning, pages 4333–4342. PMLR, 2019.
  • [Moi14] Ankur Moitra. Algorithmic aspects of machine learning. Lecture notes, 2014.
  • [MR05] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
  • [MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 438–446. IEEE, 2016.
  • [MW19] Ankur Moitra and Alexander S Wein. Spectral methods from tensor networks. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 926–937, 2019.
  • [MW22] Andrea Montanari and Alexander S Wein. Equivalence of approximate message passing and low-degree polynomials in rank-one matrix estimation. arXiv preprint arXiv:2212.06996, 2022.
  • [OTR22] Mohamed Ouerfelli, Mohamed Tamaazousti, and Vincent Rivasseau. Random tensor theory for tensor decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  • [PR20] Aaron Potechin and Goutham Rajendran. Machinery for proving sum-of-squares lower bounds on certification problems. arXiv preprint arXiv:2011.04253, 2020.
  • [PWB+19] Amelia Perry, Jonathan Weed, Afonso S Bandeira, Philippe Rigollet, and Amit Singer. The sample complexity of multireference alignment. SIAM Journal on Mathematics of Data Science, 1(3):497–517, 2019.
  • [RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. Advances in neural information processing systems, 27, 2014.
  • [RSG17] Stephan Rabanser, Oleksandr Shchur, and Stephan Günnemann. Introduction to tensor decompositions and their applications in machine learning. arXiv preprint arXiv:1711.10781, 2017.
  • [RSS18] Prasad Raghavendra, Tselil Schramm, and David Steurer. High dimensional estimation via sum-of-squares proofs. In Proceedings of the International Congress of Mathematicians: Rio de Janeiro 2018, pages 3389–3423. World Scientific, 2018.
  • [SDF+17] Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017.
  • [SS17] Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning. In Conference on Learning Theory, pages 1760–1793. PMLR, 2017.
  • [SW22] Tselil Schramm and Alexander S Wein. Computational barriers to estimation from low-degree polynomials. The Annals of Statistics, 50(3):1833–1858, 2022.
  • [Wei18] Alexander Spence Wein. Statistical estimation in the presence of group actions. PhD thesis, Massachusetts Institute of Technology, 2018.
  • [Wei22] Alexander S Wein. Optimal low-degree hardness of maximum independent set. Mathematical Statistics and Learning, 4(3):221–251, 2022.
  • [WEM19] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore. The Kikuchi hierarchy and tensor PCA. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1446–1468. IEEE, 2019.
  • [WX21] Yihong Wu and Jiaming Xu. Statistical problems with planted structures: Information-theoretical and computational limits. Information-Theoretic Methods in Data Science, 383, 2021.
  • [ZSWB22] Ilias Zadik, Min Jae Song, Alexander S Wein, and Joan Bruna. Lattice-based methods surpass sum-of-squares in clustering. In Conference on Learning Theory, pages 1247–1248. PMLR, 2022.