Smoothing the Landscape Boosts the Signal for SGD
Optimal Sample Complexity for Learning Single Index Models

Alex Damian
Princeton University
[email protected] Eshaan Nichani
Princeton University
[email protected]
Rong Ge
Duke University
[email protected]
Jason D. Lee
Princeton University
[email protected]

Abstract

We focus on the task of learning a single index model $\sigma(w^{\star}\cdot x)$ with respect to the isotropic Gaussian distribution in $d$ dimensions. Prior work has shown that the sample complexity of learning $w^{\star}$ is governed by the information exponent $k^{\star}$ of the link function $\sigma$ , which is defined as the index of the first nonzero Hermite coefficient of $\sigma$ . Ben Arous et al. [1] showed that $n\gtrsim d^{k^{\star}-1}$ samples suffice for learning $w^{\star}$ and that this is tight for online SGD. However, the CSQ lower bound for gradient based methods only shows that $n\gtrsim d^{{k^{\star}}/2}$ samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns $w^{\star}$ with $n\gtrsim d^{{k^{\star}}/2}$ samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.

1 Introduction

Gradient descent-based algorithms are popular for deriving computational and statistical guarantees for a number of high-dimensional statistical learning problems [2, 3, 1, 4, 5, 6]. Despite the fact that the empirical loss is nonconvex and in the worst case computationally intractible to optimize, for a number of statistical learning tasks gradient-based methods still converge to good solutions with polynomial runtime and sample complexity. Analyses in these settings typically study properties of the empirical loss landscape [7], and in particular the number of samples needed for the signal of the gradient arising from the population loss to overpower the noise in some uniform sense. The sample complexity for learning with gradient descent is determined by the landscape of the empirical loss.

One setting in which the empirical loss landscape showcases rich behavior is that of learning a single-index model. Single index models are target functions of the form $f^{*}(x)=\sigma(w^{\star}\cdot x)$ , where $w^{\star}\in S^{d-1}$ is the unknown relevant direction and $\sigma$ is the known link function. When the covariates are drawn from the standard $d$ -dimensional Gaussian distribution, the shape of the loss landscape is governed by the information exponent ${k^{\star}}$ of the link function $\sigma$ , which characterizes the curvature of the loss landscape around the origin. Ben Arous et al. [1] show that online stochastic gradient descent on the empirical loss can recover $w^{*}$ with $n\gtrsim d^{{k^{\star}}-1}$ samples; furthermore, they present a lower bound showing that for a class of online SGD algorithms, $d^{{k^{\star}}-1}$ samples are indeed necessary.

However, gradient descent can be suboptimal for various statistical learning problems, as it only relies on local information in the loss landscape and is thus prone to getting stuck in local minima. For learning a single index model, the Correlational Statistical Query (CSQ) lower bound only requires $d^{{k^{\star}}/2}$ samples to recover $w^{\star}$ [6, 4], which is far fewer than the number of samples required by online SGD. This gap between gradient-based methods and the CSQ lower bound is also present in the Tensor PCA problem [8]; for recovering a rank 1 $k$ -tensor in $d$ dimensions, both gradient descent and the power method require $d^{k-1}$ samples, whereas more sophisticated spectral algorithms can match the computational lower bound of $d^{k/2}$ samples.

In light of the lower bound from [1], it seems hopeless for a gradient-based algorithm to match the CSQ lower bound for learning single-index models. [1] considers the regime in which SGD is simply a discretization of gradient flow, in which case the poor properties of the loss landscape with insufficient samples imply a lower bound. However, recent work has shown that SGD is not just a discretization to gradient flow, but rather that it has an additional implicit regularization effect. Specifically, [9, 10, 11] show that over short periods of time, SGD converges to a quasi-stationary distribution $N(\theta,\lambda S)$ where $\theta$ is an initial reference point, $S$ is a matrix depending on the Hessian and the noise covariance and $\lambda=\frac{\eta}{B}$ measures the strength of the noise where $\eta$ is the learning rate and $B$ is the batch size. The resulting long term dynamics therefore follow the smoothed gradient $\widetilde{\nabla}L(\theta)=\operatorname{\mathbb{E}}_{z\sim N(0,S)}[\nabla L(\theta+\lambda z)]$ which has the effect of regularizing the trace of the Hessian.

This implicit regularization effect of minibatch SGD has been shown to drastically improve generalization and reduce the number of samples necessary for supervised learning tasks [12, 13, 14]. However, the connection between the smoothed landscape and the resulting sample complexity is poorly understood. Towards closing this gap, we consider directly smoothing the loss landscape in order to efficiently learn single index models. Our main result, Theorem 1, shows that online SGD on the smoothed loss learns $w^{\star}$ in $n\gtrsim d^{{k^{\star}}/2}$ samples, which matches the correlation statistical query (CSQ) lower bound. This improves over the $n\gtrsim d^{{k^{\star}}-1}$ lower bound for online SGD on the unsmoothed loss from Ben Arous et al. [1]. Key to our analysis is the observation that smoothing the loss landscape boosts the signal-to-noise ratio in a region around the initialization, which allows the iterates to avoid the poor local minima for the unsmoothed empirical loss. Our analysis is inspired by the implicit regularization effect of minibatch SGD, along with the partial trace algorithm for Tensor PCA which achieves the optimal $d^{k/2}$ sample complexity for computationally efficient algorithms.

The outline of our paper is as follows. In Section 3 we formalize the specific statistical learning setup, define the information exponent ${k^{\star}}$ , and describe our algorithm. Section 4 contains our main theorem, and Section 5 presents a heuristic derivation for how smoothing the loss landscape increases the signal-to-noise ratio. We present empirical verification in Section 6, and in Section 7 we detail connections to tensor PCA nad minibatch SGD.

2 Related Work

There is a rich literature on learning single index models. Kakade et al. [15] showed that gradient descent can learn single index models when the link function is Lipschitz and monotonic and designed an alternative algorithm to handle the case when the link function is unknown. Soltanolkotabi [16] focused on learning single index models where the link function is $\operatorname{ReLU}(x):=\max(0,x)$ which has information exponent ${k^{\star}}=1$ . The phase-retrieval problem is a special case of the single index model in which the link function is $\sigma(x)=x^{2}$ or $\sigma(x)=\absolutevalue{x}$ ; this corresponds to ${k^{\star}}=2$ , and solving phase retrieval via gradient descent has been well studied [17, 18, 19]. Dudeja and Hsu [20] constructed an algorithm which explicitly uses the harmonic structure of Hermite polynomials to identify the information exponent. Ben Arous et al. [1] provided matching upper and lower bounds that show that $n\gtrsim d^{{k^{\star}}-1}$ samples are necessary and sufficient for online SGD to recover $w^{\star}$ .

Going beyond gradient-based algorithms, Chen and Meka [21] provide an algorithm that can learn polynomials of few relevant dimensions with $n\gtrsim d$ samples, including single index models with polynomial link functions. Their estimator is based on the structure of the filtered PCA matrix $\operatorname{\mathbb{E}}_{x,y}[\mathbf{1}_{|y|\geq\tau}xx^{T}]$ , which relies on the heavy tails of polynomials. In particular, this upper bound does not apply to bounded link functions. Furthermore, while their result achieves the information-theoretically optimal $d$ dependence it is not a CSQ algorithm, whereas our Algorithm 1 achieves the optimal sample complexity over the class of CSQ algorithms (which contains gradient descent).

Recent work has also studied the ability of neural networks to learn single or multi-index models [5, 6, 22, 23, 4]. Bietti et al. [5] showed that two layer neural networks are able to adapt to unknown link functions with $n\gtrsim d^{{k^{\star}}}$ samples. Damian et al. [6] consider multi-index models with polynomial link function, and under a nondegeneracy assumption which corresponds to the ${k^{\star}}=2$ case, show that SGD on a two-layer neural network requires $n\gtrsim d^{2}+r^{p}$ samples. Abbe et al. [23, 4] provide a generalization of the information exponent called the leap. They prove that in some settings, SGD can learn low dimensional target functions with $n\gtrsim d^{\text{Leap}-1}$ samples. However, they conjecture that the optimal rate is $n\gtrsim d^{\text{Leap}/2}$ and that this can be achieved by ERM rather than online SGD.

The problem of learning single index models with information exponent $k$ is strongly related to the order $k$ Tensor PCA problem (see Section 7.1), which was introduced by Richard and Montanari [8]. They conjectured the existence of a computational-statistical gap for Tensor PCA as the information-theoretic threshold for the problem is $n\gtrsim d$ , but all known computationally efficient algorithms require $n\gtrsim d^{k/2}$ . Furthermore, simple iterative estimators including tensor power method, gradient descent, and AMP are suboptimal and require $n\gtrsim d^{k-1}$ samples. Hopkins et al. [24] introduced the partial trace estimator which succeeds with $n\gtrsim d^{\lceil k/2\rceil}$ samples. Anandkumar et al. [25] extended this result to show that gradient descent on a smoothed landscape could achieve $d^{k/2}$ sample complexity when $k=3$ and Biroli et al. [26] heuristically extended this result to larger $k$ . The success of smoothing the landscape for Tensor PCA is one of the inspirations for Algorithm 1.

3 Setting

3.1 Data distribution and target function

Our goal is to efficiently learn single index models of the form $f^{\star}(x)=\sigma(w^{\star}\cdot x)$ where $w^{\star}\in S^{d-1}$ , the $d$ -dimensional unit sphere. We assume that $\sigma$ is normalized so that $\operatorname{\mathbb{E}}_{x\sim N(0,1)}[\sigma(x)^{2}]=1$ . We will also assume that $\sigma$ is differentiable and that $\sigma^{\prime}$ has polynomial tails:

Assumption 1.

There exist constants $C_{1},C_{2}$ such that $\absolutevalue{\sigma^{\prime}(x)}\leq C_{1}(1+x^{2})^{C_{2}}$ .

Our goal is to recover $w^{\star}$ given $n$ samples $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ sampled i.i.d from

\displaystyle x_{i}\sim N(0,I_{d}),\quad y_{i}=f^{\star}(x_{i})+z_{i}\mbox{\quad where\quad}z_{i}\sim N(0,\varsigma^{2}).

For simplicity of exposition, we assume that $\sigma$ is known and we take our model class to be

\displaystyle f(w,x):=\sigma\quantity(w\cdot x)\mbox{\quad where\quad}w\in S^{d-1}.

3.2 Algorithm: online SGD on a smoothed landscape

As $w\in S^{d-1}$ we will let $\nabla_{w}$ denote the spherical gradient with respect to $w$ . That is, for a function $g:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , let $\nabla_{w}g(w)=(I-ww^{T})\nabla g(z)\evaluated{}_{z=w}$ where $\nabla$ is the standard Euclidean gradient.

To compute the loss on a sample $(x,y)$ , we use the correlation loss:

\displaystyle L(w;x;y):=1-f(w,x)y.

Furthermore, when the sample is omitted we refer to the population loss:

\displaystyle L(w):=\operatorname{\mathbb{E}}_{x,y}[L(w;x;y)]

Our primary contribution is that SGD on a smoothed loss achieves the optimal sample complexity for this problem. First, we define the smoothing operator $\mathcal{L}_{\lambda}$ :

Definition 1.

Let $g:S^{d-1}\to\mathbb{R}$ . We define the smoothing operator $\mathcal{L}_{\lambda}$ by

\displaystyle(\mathcal{L}_{\lambda}g)(w):=\operatorname{\mathbb{E}}_{z\sim\mu_{w}}\quantity[g\quantity(\frac{w+\lambda z}{\norm{w+\lambda z}})]

where $\mu_{w}$ is the uniform distribution over $S^{d-1}$ conditioned on being perpendicular to $w$ .

This choice of smoothing is natural for spherical gradient descent and can be directly related¹¹1 This is equivalent to the intrinsic definition $(\mathcal{L}_{\lambda}g)(w):=\operatorname{\mathbb{E}}_{z\sim UT_{w}(S^{d-1})}[\exp_{w}(\theta z)]$ where $\theta=\arctan(\lambda)$ , $UT_{w}(S^{d-1})$ is the unit sphere in $T_{w}(S^{d-1})$ , and $\exp$ is the Riemannian exponential map. to the Riemannian exponential map on $S^{d-1}$ . We will often abuse notation and write $\mathcal{L}_{\lambda}\quantity(g(w))$ rather than $(\mathcal{L}_{\lambda}g)(w)$ . The smoothed empirical loss $L_{\lambda}(w;x;y)$ and the population loss $L_{\lambda}(w)$ are defined by:

\displaystyle L_{\lambda}(w;x;y):=\mathcal{L}_{\lambda}\quantity(L(w;x;y))\mbox{\quad and\quad}L_{\lambda}(w):=\mathcal{L}_{\lambda}\quantity(L(w)).

Our algorithm is online SGD on the smoothed loss $L_{\lambda}$ :

Input: learning rate schedule

\quantity{\eta_{t}}

, smoothing schedule

\quantity{\lambda_{t}}

, steps

T

Sample

w_{0}\sim\operatorname{Unif}(S^{d-1})

for $t=0$ to $T-1$ do

Sample a fresh sample

(x_{t},y_{t})

\hat{w}_{t+1}\leftarrow w_{t}-\eta_{t}\nabla_{w}L_{\lambda_{t}}(w_{t};x_{t};y_{t})

w_{t+1}\leftarrow\hat{w}_{t+1}/\norm{\hat{w}_{t+1}}

end for

Algorithm 1 Smoothed Online SGD

3.3 Hermite polynomials and information exponent

The sample complexity of Algorithm 1 depends on the Hermite coefficients of $\sigma$ :

Definition 2 (Hermite Polynomials).

The $k$ th Hermite polynomial $He_{k}:\mathbb{R}\rightarrow\mathbb{R}$ is the degree $k$ , monic polynomial defined by

He_{k}(x)=(-1)^{k}\frac{\nabla^{k}\mu(x)}{\mu(x)},

where $\mu(x):=\frac{e^{-\frac{x^{2}}{2}}}{\sqrt{2\pi}}$ is the PDF of a standard Gaussian.

The first few Hermite polynomials are $He_{0}(x)=0,He_{1}(x)=x,He_{2}(x)=x^{2}-1,He_{3}(x)=x^{3}-3x$ . For further discussion on the Hermite polynomials and their properties, refer to Section A.2. The Hermite polynomials form an orthogonal basis of $L^{2}(\mu)$ so any function in $L^{2}(\mu)$ admits a Hermite expansion. We let $\quantity{c_{k}}_{k\geq 0}$ denote the Hermite coefficients of the link function $\sigma$ :

Definition 3 (Hermite Expansion of $\sigma$ ).

Let $\quantity{c_{k}}_{k\geq 0}$ be the Hermite coefficients of $\sigma$ , i.e.

\displaystyle\sigma(x)=\sum_{k\geq 0}\frac{c_{k}}{k!}He_{k}(x)\mbox{\quad where\quad}c_{k}=\operatorname{\mathbb{E}}_{x\sim N(0,1)}[\sigma(x)He_{k}(x)].

The critical quantity of interest is the information exponent of $\sigma$ :

Definition 4 (Information Exponent).

${k^{\star}}={k^{\star}}(\sigma)$ is the first index $k\geq 1$ such that $c_{k}\neq 0$ .

Example 1.

Below are some example link functions and their information exponents:

•

$\sigma(x)=x$ and $\sigma(x)=\operatorname{ReLU}(x):=\max(0,x)$ have information exponents ${k^{\star}}=1$ .
•

$\sigma(x)=x^{2}$ and $\sigma(x)=\absolutevalue{x}$ have information exponents ${k^{\star}}=2$ .
•

$\sigma(x)=x^{3}-3x$ has information exponent ${k^{\star}}=3$ . More generally, $\sigma(x)=He_{k}(x)$ has information exponent ${k^{\star}}=k$ .

Throughout our main results we focus on the case ${k^{\star}}\geq 3$ as when ${k^{\star}}=1,2$ , online SGD without smoothing already achieves the optimal sample complexity of $n\asymp d$ samples (up to log factors) [1].

4 Main Results

Our main result is a sample complexity guarantee for Algorithm 1:

Theorem 1.

Assume $w_{0}\cdot w^{\star}\gtrsim d^{-1/2}$ and $\lambda\in[1,d^{1/4}]$ . Let $T_{1}=\tilde{O}\quantity(d^{{k^{\star}}-1}\lambda^{-2{k^{\star}}+4})$ . For $t\leq T_{1}$ set $\lambda_{t}=\lambda$ and $\eta_{t}=\tilde{O}(d^{-{k^{\star}}/2}\lambda^{2{k^{\star}}-2})$ . For $t>T_{1}$ set $\lambda_{t}=0$ and $\eta_{t}=O\quantity((d+t-T_{1})^{-1})$ . Then if $T=T_{1}+T_{2}$ , with high probability the final iterate $w_{T}$ of Algorithm 1 satisfies $L(w_{T})\leq O(\frac{d}{d+T_{2}}).$

Theorem 1 uses large smoothing (up to $\lambda=d^{1/4}$ ) to rapidly escape the regime in which $w\cdot w^{\star}\asymp d^{-1/2}$ . This first stage continues until $w\cdot w^{\star}=1-o_{d}(1)$ which takes $T_{1}=\tilde{O}(d^{{k^{\star}}/2})$ steps when $\lambda=d^{1/4}$ . The second stage, in which $\lambda=0$ and the learning rate decays linearly, lasts for an additional $T_{2}=d/\epsilon$ steps where $\epsilon$ is the target accuracy. Because Algorithm 1 uses each sample exactly once, this gives the sample complexity

\displaystyle n\gtrsim d^{{k^{\star}}-1}\lambda^{-2{k^{\star}}+4}+d/\epsilon

to reach population loss $L(w_{T})\leq\epsilon$ . Setting $\lambda=O(1)$ is equivalent to zero smoothing and gives a sample complexity of $n\gtrsim d^{{k^{\star}}-1}+d/\epsilon$ , which matches the results of Ben Arous et al. [1]. On the other hand, setting $\lambda$ to the maximal allowable value of $d^{1/4}$ gives:

\displaystyle n\gtrsim\underbrace{d^{\frac{{k^{\star}}}{2}}\vphantom{d/\epsilon}}_{\begin{subarray}{c}\text{CSQ}\\ \text{lower bound}\end{subarray}}+\underbrace{d/\epsilon}_{\begin{subarray}{c}\text{information}\\ \text{lower bound}\end{subarray}}

which matches the sum of the CSQ lower bound, which is $d^{\frac{{k^{\star}}}{2}}$ , and the information-theoretic lower bound, which is $d/\epsilon$ , up to poly-logarithmic factors.

To complement Theorem 1, we replicate the CSQ lower bound in [6] for the specific function class $\sigma(w\cdot x)$ where $w\in S^{d-1}$ . Statistical query learners are a family of learners that can query values $q(x,y)$ and receive outputs $\hat{q}$ with $\absolutevalue{\hat{q}-\operatorname{\mathbb{E}}_{x,y}[q(x,y)]}\leq\tau$ where $\tau$ denotes the query tolerance [27, 28]. An important class of statistical query learners is that of correlational/inner product statistical queries (CSQ) of the form $q(x,y)=yh(x)$ . This includes a wide class of algorithms including gradient descent with square loss and correlation loss.

Theorem 2 (CSQ Lower Bound).

Consider the function class $\mathcal{F}_{\sigma}:=\{\sigma(w\cdot x):w\in\mathcal{S}^{d-1}\}$ . Any CSQ algorithm using $q$ queries requires a tolerance $\tau$ of at most

\displaystyle\tau\lesssim\quantity(\frac{\log(qd)}{d})^{{k^{\star}}/4}

to output an $f\in\mathcal{F}_{\sigma}$ with population loss less than $1/2$ .

Using the standard $\tau\approx n^{-1/2}$ heuristic which comes from concentration, this implies that $n\gtrsim d^{\frac{{k^{\star}}}{2}}$ samples are necessary to learn $\sigma(w\cdot x)$ unless the algorithm makes exponentially many queries. In the context of gradient descent, this is equivalent to either requiring exponentially many parameters or exponentially many steps of gradient descent.

5 Proof Sketch

In this section we highlight the key ideas of the proof of Theorem 1. The full proof is deferred to Appendix B. The proof sketch is broken into three parts. First, we conduct a general analysis on online SGD to show how the signal-to-noise ratio (SNR) affects the sample complexity. Next, we compute the SNR for the unsmoothed objective ( $\lambda=0$ ) to heuristically rederive the $d^{{k^{\star}}-1}$ sample complexity in Ben Arous et al. [1]. Finally, we show how smoothing boosts the SNR and leads to an improved sample complexity of $d^{{k^{\star}}/2}$ when $\lambda=d^{1/4}$ .

5.1 Online SGD Analysis

To begin, we will analyze a single step of online SGD. We define $\alpha_{t}:=w_{t}\cdot w^{\star}$ so that $\alpha_{t}\in[-1,1]$ measures our current progress. Furthermore, let $v_{t}:=-\nabla L_{\lambda_{t}}(w_{t};x_{t};y_{t})$ . Recall that the online SGD update is:

\displaystyle w_{t+1}=\frac{w_{t}+\eta_{t}v_{t}}{\norm{w_{t}+\eta_{t}v_{t}}}\implies\alpha_{t+1}=\frac{\alpha_{t}+\eta_{t}(v_{t}\cdot w^{\star})}{\norm{w_{t}+\eta_{t}v_{t}}}.

Using the fact that $v\perp w$ and $\frac{1}{\sqrt{1+x^{2}}}\approx 1-\frac{x^{2}}{2}$ we can Taylor expand the update for $\alpha_{t+1}$ :

\displaystyle\alpha_{t+1}=\frac{\alpha_{t}+\eta_{t}(v_{t}\cdot w^{\star})}{\sqrt{1+\eta_{t}^{2}\norm{v_{t}}^{2}}}\approx\alpha_{t}+\eta_{t}(v_{t}\cdot w^{\star})-\frac{\eta_{t}^{2}\norm{v_{t}}^{2}\alpha_{t}}{2}+O(\eta_{t}^{3}).

As in Ben Arous et al. [1], we decompose this update into a drift term and a martingale term. Let $\mathcal{F}_{t}=\sigma\quantity{(x_{0},y_{0}),\ldots,(x_{t-1},y_{t-1})}$ be the natural filtration. We focus on the drift term as the martingale term can be handled with standard concentration arguments. Taking expectations with respect to the fresh batch $(x_{t},y_{t})$ gives:

\displaystyle\operatorname{\mathbb{E}}[\alpha_{t+1}|\mathcal{F}_{t}]\approx\alpha_{t}+\eta_{t}\operatorname{\mathbb{E}}[v_{t}\cdot w^{\star}|\mathcal{F}_{t}]-\eta_{t}^{2}\operatorname{\mathbb{E}}[\norm{v_{t}}^{2}|\mathcal{F}_{t}]\alpha_{t}/2

so to guarantee a positive drift, we need to set $\eta_{t}\leq\frac{2\operatorname{\mathbb{E}}[v_{t}\cdot w^{\star}|\mathcal{F}_{t}]}{\operatorname{\mathbb{E}}[\norm{v_{t}}^{2}|\mathcal{F}_{t}]\alpha_{t}}$ which gives us the value of $\eta_{t}$ used in Theorem 1 for $t\leq T_{1}$ . However, to simplify the proof sketch we can assume knowledge of $\operatorname{\mathbb{E}}[v_{t}\cdot w^{\star}|\mathcal{F}_{t}]$ and $\operatorname{\mathbb{E}}[\norm{v_{t}}^{2}|\mathcal{F}_{t}]$ and optimize over $\eta_{t}$ to get a maximum drift of

\displaystyle\operatorname{\mathbb{E}}[\alpha_{t+1}|w_{t}]\approx\alpha_{t}+\frac{1}{2\alpha_{t}}\cdot\underbrace{\frac{\operatorname{\mathbb{E}}[v_{t}\cdot w^{\star}|\mathcal{F}_{t}]^{2}}{\operatorname{\mathbb{E}}[\norm{v_{t}}^{2}|\mathcal{F}_{t}]}}_{\text{SNR}}.

The numerator measures the correlation of the population gradient with $w^{\star}$ while the denominator measures the norm of the noisy gradient. Their ratio thus has a natural interpretation as the signal-to-noise ratio (SNR). Note that the SNR is a local property, i.e. the SNR can vary for different $w_{t}$ . When the SNR can be written as a function of $\alpha_{t}=w_{t}\cdot w^{\star}$ , the SNR directly dictates the rate of optimization through the ODE approximation: $\alpha^{\prime}\approx\text{SNR}/\alpha$ . As online SGD uses each sample exactly once, the sample complexity for online SGD can be approximated by the time it takes this ODE to reach $\alpha\approx 1$ from $\alpha_{0}\approx d^{-1/2}$ . The remainder of the proof sketch will therefore focus on analyzing the SNR of the minibatch gradient $\nabla L_{\lambda}(w;x;y)$ .

5.2 Computing the Rate with Zero Smoothing

When $\lambda=0$ , the signal and noise terms can easily be calculated. The key property we need is:

Property 1 (Orthogonality Property).

Let $w,w^{\star}\in S^{d-1}$ and let $\alpha=w\cdot w^{\star}$ . Then:

\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}[He_{j}(w\cdot x)He_{k}(w^{\star}\cdot x)]=\delta_{jk}k!\alpha^{k}.

Using 1 and the Hermite expansion of $\sigma$ (Definition 3) we can directly compute the population loss and gradient. Letting $P_{w}^{\perp}:=I-ww^{T}$ denote the projection onto the subspace orthogonal to $w$ we have:

\displaystyle L(w)=\sum_{k\geq 0}\frac{c_{k}^{2}}{k!}[1-\alpha^{k}]\mbox{\quad and\quad}\nabla L(w)=-(P_{w}^{\perp}w^{\star})\sum_{k\geq 0}\frac{c_{k}^{2}}{(k-1)!}\alpha^{k-1}.

As $\alpha\ll 1$ throughout most of the trajectory, the gradient is dominated by the first nonzero Hermite coefficient so up to constants: $\operatorname{\mathbb{E}}[v\cdot w^{\star}]=-\nabla L(w)\cdot w^{\star}\approx\alpha^{{k^{\star}}-1}$ . Similarly, a standard concentration argument shows that because $v$ is a random vector in $d$ dimensions where each coordinate is $O(1)$ , $\operatorname{\mathbb{E}}[\norm{v}]^{2}\approx d$ . Therefore the SNR is equal to $\alpha^{2({k^{\star}}-1)}/d$ so with an optimal learning rate schedule,

\displaystyle\operatorname{\mathbb{E}}[\alpha_{t+1}|\mathcal{F}_{t}]\approx\alpha_{t}+\alpha_{t}^{2{k^{\star}}-3}/d.

This can be approximated by the ODE $\alpha^{\prime}=\alpha^{2{k^{\star}}-3}/d$ . Solving this ODE with the initial $\alpha_{0}\asymp d^{-1/2}$ gives that the escape time is proportional to $d\alpha_{0}^{-2({k^{\star}}-1)}=d^{{k^{\star}}-1}$ which heuristically re-derives the result of Ben Arous et al. [1].

Figure 1: When

\lambda=d^{1/4}

, smoothing increases the SNR until

\alpha=\lambda d^{-1/2}=d^{-1/4}

5.3 How Smoothing boosts the SNR

Smoothing improves the sample complexity of online SGD by boosting the SNR of the stochastic gradient $\nabla L(w;x;y)$ . Recall that the population loss was approximately equal to $1-\frac{c_{k^{\star}}^{2}}{{k^{\star}}!}\alpha^{k^{\star}}$ where ${k^{\star}}$ is the first nonzero Hermite coefficient of $\sigma$ . Isolating the dominant $\alpha^{k^{\star}}$ term and applying the smoothing operator $\mathcal{L}_{\lambda}$ , we get:

\displaystyle\mathcal{L}_{\lambda}(\alpha^{k^{\star}})=\operatorname{\mathbb{E}}_{z\sim\mu_{w}}\quantity[\quantity(\frac{w+\lambda z}{\norm{w+\lambda z}}\cdot w^{\star})^{k^{\star}}].

Because $v\perp w$ and $\norm{z}=1$ we have that $\norm{w+\lambda z}=\sqrt{1+\lambda^{2}}\approx\lambda$ . Therefore,

\displaystyle\mathcal{L}_{\lambda}(\alpha^{{k^{\star}}})\approx\lambda^{-{k^{\star}}}\operatorname{\mathbb{E}}_{z\sim\mu_{w}}\quantity[(\alpha+\lambda(z\cdot w^{\star}))^{k^{\star}}]=\lambda^{-{k^{\star}}}\sum_{j=0}^{k}\binom{k}{j}\alpha^{k-j}\lambda^{j}\operatorname{\mathbb{E}}_{z\sim\mu_{w}}[(z\cdot w^{\star})^{j}].

Now because $z\stackrel{{\scriptstyle d}}{{=}}-z$ , the terms where $j$ is odd disappear. Furthermore, for a random $z$ , $\absolutevalue{z\cdot w^{\star}}=\Theta(d^{-1/2})$ . Therefore reindexing and ignoring all constants we have that

\displaystyle L_{\lambda}(w)\approx 1-\mathcal{L}_{\lambda}(\alpha^{{k^{\star}}})

\displaystyle\approx 1-\lambda^{-{k^{\star}}}\sum_{j=0}^{\lfloor\frac{{k^{\star}}}{2}\rfloor}\alpha^{k-2j}\quantity(\lambda^{2}/d)^{j}.

Differentiating gives that

\displaystyle\operatorname{\mathbb{E}}[v\cdot w^{\star}]\approx-w^{\star}\cdot\nabla_{w}\mathcal{L}_{\lambda}(\alpha^{{k^{\star}}})

\displaystyle\approx\lambda^{-1}\sum_{j=0}^{\lfloor\frac{{k^{\star}}-1}{2}\rfloor}\quantity(\frac{\alpha}{\lambda})^{{k^{\star}}-1}\quantity(\frac{\lambda^{2}}{\alpha^{2}d})^{j}.

As this is a geometric series, it is either dominated by the first or the last term depending on whether $\alpha\geq\lambda d^{-1/2}$ or $\alpha\leq\lambda d^{-1/2}$ . Furthermore, the last term is either $d^{-\frac{{k^{\star}}-1}{2}}$ if ${k^{\star}}$ is odd or $\frac{\alpha}{\lambda}d^{-\frac{{k^{\star}}-2}{2}}$ if ${k^{\star}}$ is even. Therefore the signal term is:

\displaystyle\operatorname{\mathbb{E}}[v\cdot w^{\star}]\approx\lambda^{-1}\begin{cases}(\frac{\alpha}{\lambda})^{{k^{\star}}-1}&\alpha\geq\lambda d^{-1/2}\\ d^{-\frac{{k^{\star}}-1}{2}}&\alpha\leq\lambda d^{-1/2}\text{ and ${k^{\star}}$ is odd}\\ \frac{\alpha}{\lambda}d^{-\frac{{k^{\star}}-2}{2}}&\alpha\leq\lambda d^{-1/2}\text{ and ${k^{\star}}$ is even}\end{cases}.

In addition, we can show that when $\lambda\leq d^{1/4}$ , the noise term satisfies $\operatorname{\mathbb{E}}[\norm{v}^{2}]\leq d\lambda^{-2{k^{\star}}}$ . Note that in the high signal regime ( $\alpha\geq\lambda d^{-1/2}$ ), both the signal and the noise are smaller by factors of $\lambda^{{k^{\star}}}$ which cancel when computing the SNR. However, when $\alpha\leq\lambda d^{-1/2}$ the smoothing shrinks the noise faster than it shrinks the signal, resulting in an overall larger SNR. Explicitly,

\displaystyle\text{SNR}:=\frac{\operatorname{\mathbb{E}}[v\cdot w^{\star}]^{2}}{\operatorname{\mathbb{E}}[\norm{v}^{2}]}\approx\frac{1}{d}\begin{cases}\alpha^{2({k^{\star}}-1)}&\alpha\geq\lambda d^{-1/2}\\ (\lambda^{2}/d)^{{k^{\star}}-1}&\alpha\leq\lambda d^{-1/2}\text{ and ${k^{\star}}$ is odd}\\ \alpha^{2}(\lambda^{2}/d)^{{k^{\star}}-2}&\alpha\leq\lambda d^{-1/2}\text{ and ${k^{\star}}$ is even}\end{cases}.

For $\alpha\geq\lambda d^{-1/2}$ , smoothing does not affect the SNR. However, when $\alpha\leq\lambda d^{-1/2}$ , smoothing greatly increases the SNR (see Figure 1).

Solving the ODE: $\alpha^{\prime}=\text{SNR}/\alpha$ gives that it takes $T\approx d^{{k^{\star}}-1}\lambda^{-2{k^{\star}}+4}$ steps to converge to $\alpha\approx 1$ from $\alpha\approx d^{-1/2}$ . Once $\alpha\approx 1$ , the problem is locally strongly convex, so we can decay the learning rate and use classical analysis of strongly-convex functions to show that $\alpha\geq 1-\epsilon$ with an additional $d/\epsilon$ steps, from which Theorem 1 follows.

6 Experiments

For ${k^{\star}}=3,4,5$ and $d=2^{8},\ldots,2^{13}$ we ran a minibatch variant of Algorithm 1 with batch size $B$ when $\sigma(x)=\frac{He_{{k^{\star}}}(x)}{\sqrt{k!}}$ , the normalized ${k^{\star}}$ th Hermite polynomial. We set:

\displaystyle\lambda=d^{1/4},\quad\eta=\frac{Bd^{-{k^{\star}}/2}(1+\lambda^{2})^{{k^{\star}}-1}}{{1000{k^{\star}}!}},\quad B=\min\quantity(0.1d^{{k^{\star}}/2}(1+\lambda^{2})^{-2{k^{\star}}+4},8192).

We computed the number of samples required for Algorithm 1 to reach $\alpha^{2}=0.5$ from $\alpha=d^{-1/2}$ and we report the min, mean, and max over $10$ random seeds. For each $k$ we fit a power law of the form $n=c_{1}d^{c_{2}}$ in order to measure how the sample complexity scales with $d$ . For all values of ${k^{\star}}$ , we find that $c_{2}\approx{k^{\star}}/2$ which matches Theorem 1. The results can be found in Figure 2 and additional experimental details can be found in Appendix E.

Refer to caption — Figure 2: For ${k^{\star}}=3,4,5$ , Algorithm 1 finds $w^{\star}$ with $n\propto d^{{k^{\star}}/2}$ samples. The solid lines and the shaded areas represent the mean and min/max values over $10$ random seeds. For each curve, we also fit a power law $n=c_{1}d^{c_{2}}$ represented by the dashed lines. The value of $c_{2}$ is reported in the legend.

7 Discussion

7.1 Tensor PCA

We next outline connections to the Tensor PCA problem. Introduced in [8], the goal of Tensor PCA is to recover the hidden direction $w^{*}\in\mathcal{S}^{d-1}$ from the noisy $k$ -tensor $T_{n}\in(\mathbb{R}^{d})^{\otimes k}$ given by²²2This normalization is equivalent to the original $\frac{1}{\beta\sqrt{d}}Z$ normalization by setting $n=\beta^{2}d$ .

\displaystyle T_{n}=(w^{*})^{\otimes k}+\frac{1}{\sqrt{n}}Z,

where $Z\in(\mathbb{R}^{d})^{\otimes k}$ is a Gaussian noise tensor with each entry drawn i.i.d from $\mathcal{N}(0,1)$ .

The Tensor PCA problem has garnered significant interest as it exhibits a statistical-computational gap. $w^{*}$ is information theoretically recoverable when $n\gtrsim d$ . However, the best polynomial-time algorithms require $n\gtrsim d^{k/2}$ ; this lower bound has been shown to be tight for various notions of hardness such as CSQ or SoS lower bounds [29, 24, 30, 31, 32, 33, 34]. Tensor PCA also exhibits a gap between spectral methods and iterative algorithms. Algorithms that work in the $n\asymp d^{k/2}$ regime rely on unfolding or contracting the tensor $X$ , or on semidefinite programming relaxations [29, 24]. On the other hand, iterative algorithms including gradient descent, power method, and AMP require a much larger sample complexity of $n\gtrsim d^{k-1}$ [35]. The suboptimality of iterative algorithms is believed to be due to bad properties of the landscape of the Tensor PCA objective in the region around the initialization. Specifically [36, 37] argue that there are exponentially many local minima near the equator in the $n\ll d^{k-1}$ regime. To overcome this, prior works have considered “smoothed" versions of gradient descent, and show that smoothing recovers the computationally optimal SNR in the $k=3$ case [25] and heuristically for larger $k$ [26].

7.1.1 The Partial Trace Algorithm

The smoothing algorithms above are inspired by the following partial trace algorithm for Tensor PCA [24], which can be viewed as Algorithm 1 in the limit as $\lambda\to\infty$ [25]. Let $T_{n}=(w^{\star})^{\otimes k}+\frac{1}{\sqrt{n}}Z$ . Then we will consider iteratively contracting indices of $T$ until all that remains is a vector (if $k$ is odd) or a matrix (if $k$ is even). Explicitly, we define the partial trace tensor by

\displaystyle M_{n}:=T_{n}\quantity(I_{d}^{\otimes\lceil\frac{k-2}{2}\rceil})\in\begin{cases}\mathbb{R}^{d\times d}&\text{$k$ is even}\\ \mathbb{R}^{d}&\text{$k$ is odd}.\end{cases}

When ${k^{\star}}$ is odd, we can directly return $M_{n}$ as our estimate for $w^{\star}$ and when ${k^{\star}}$ is even we return the top eigenvector of $M_{n}$ . A standard concentration argument shows that this succeeds when $n\gtrsim d^{\lceil k/2\rceil}$ . Furthermore, this can be strengthened to $d^{k/2}$ by using the partial trace vector as a warm start for gradient descent or tensor power method when $k$ is odd [25, 26].

7.1.2 The Connection Between Single Index Models and Tensor PCA

For both tensor PCA and learning single index models, gradient descent succeeds when the sample complexity is $n=d^{k-1}$ [35, 1]. On the other hand, the smoothing algorithms for Tensor PCA [26, 25] succeed with the computationally optimal sample complexity of $n=d^{k/2}$ . Our Theorem 1 shows that this smoothing analysis can indeed be transferred to the single-index model setting.

In fact, one can make a direct connection between learning single-index models with Gaussian covariates and Tensor PCA. Consider learning a single-index model when $\sigma(x)=\frac{He_{k}(x)}{\sqrt{k!}}$ , the normalized $k$ th Hermite polynomial. Then minimizing the correlation loss is equivalent to maximizing the loss function:

\displaystyle L_{n}(w)=\frac{1}{n}\sum_{i=1}^{n}y_{i}\frac{He_{k}(w\cdot x_{i})}{\sqrt{k!}}=\expectationvalue{w^{\otimes k},T_{n}}\mbox{\quad where\quad}T_{n}:=\frac{1}{n}\sum_{i=1}^{n}y_{i}\frac{\mathbf{He}_{k}(x_{i})}{\sqrt{k!}}.

Here $\mathbf{He}_{k}(x_{i})\in(\mathbb{R}^{d})^{\otimes k}$ denotes the $k$ th Hermite tensor (see Section A.2 for background on Hermite polynomials and Hermite tensors). In addition, by the orthogonality of the Hermite tensors, $\operatorname{\mathbb{E}}_{x,y}[T_{n}]=(w^{\star})^{\otimes k}$ so we can decompose $T_{n}=(w^{\star})^{\otimes k}+Z_{n}$ where by standard concentration, each entry of $Z_{n}$ is order $n^{-1/2}$ . We can therefore directly apply algorithms for Tensor PCA to this problem. We remark that this connection is a heuristic, as the structure of the noise in Tensor PCA and our single index model setting are different.

7.2 Empirical Risk Minimization on the Smoothed Landscape

Our main sample complexity guarantee, Theorem 1, is based on a tight analysis of online SGD (Algorithm 1) in which each sample is used exactly once. One might expect that if the algorithm were allowed to reuse samples, as is standard practice in deep learning, that the algorithm could succeed with fewer samples. In particular, Abbe et al. [4] conjectured that gradient descent on the empirical loss $L_{n}(w):=\frac{1}{n}\sum_{i=1}^{n}L(w;x_{i};y_{i})$ would succeed with $n\gtrsim d^{{k^{\star}}/2}$ samples.

Our smoothing algorithm Algorithm 1 can be directly translated to the ERM setting to learn $w^{\star}$ with $n\gtrsim d^{{k^{\star}}/2}$ samples. We can then Taylor expand the smoothed loss in the large $\lambda$ limit:

\displaystyle\mathcal{L}_{\lambda}\quantity(L_{n}(w))\approx\operatorname{\mathbb{E}}_{z\sim S^{d-1}}\quantity[L_{n}(z)+\lambda^{-1}w\cdot\nabla L_{n}(z)+\frac{\lambda^{-2}}{2}\cdot w^{T}\nabla^{2}L_{n}(z)w]+O(\lambda^{-3}).

As $\lambda\to\infty$ , gradient descent on this smoothed loss will converge to $\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[\nabla L_{n}(z)]$ which is equivalent to the partial trace estimator for ${k^{\star}}$ odd (see Section 7.1). If ${k^{\star}}$ is even, this first term is zero in expectation and gradient descent will converge to the top eigenvector of $\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[\nabla^{2}L_{n}(z)]$ , which corresponds to the partial trace estimator for ${k^{\star}}$ even. Mirroring the calculation for the partial trace estimator, this succeeds with $n\gtrsim d^{\lceil{k^{\star}}/2\rceil}$ samples. When ${k^{\star}}$ is odd, this can be further improved to $d^{{k^{\star}}/2}$ by using this estimator as a warm start from which to run gradient descent with $\lambda=0$ as in Anandkumar et al. [25], Biroli et al. [26].

7.3 Connection to Minibatch SGD

A recent line of works has studied the implicit regularization effect of stochastic gradient descent [9, 11, 10]. The key idea is that over short timescales, the iterates converge to a quasi-stationary distribution $N(\theta,\lambda S)$ where $S\approx I$ depends on the Hessian and the noise covariance at $\theta$ and $\lambda$ is proportional to the ratio of the learning rate and batch size. As a result, over longer periods of time SGD follows the smoothed gradient of the empirical loss:

\displaystyle\widetilde{L}_{n}(w)=\operatorname{\mathbb{E}}_{z\sim N(0,S)}[L_{n}(w+\lambda z)].

We therefore conjecture that minibatch SGD is also able to achieve the optimal $n\gtrsim d^{{k^{\star}}/2}$ sample complexity without explicit smoothing if the learning rate and batch size are properly tuned.

8 Acknowledgements

AD acknowledges support from a NSF Graduate Research Fellowship. EN acknowledges support from a National Defense Science & Engineering Graduate Fellowship. RG is supported by NSF Award DMS-2031849, CCF-1845171 (CAREER), CCF-1934964 (Tripods) and a Sloan Research Fellowship. AD, EN, and JDL acknowledge support of the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, NSF CIF 2212262, ONR Young Investigator Award, and NSF CAREER Award 2144994.

References

Ben Arous et al. [2021] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference. The Journal of Machine Learning Research, 22(1):4788–4838, 2021.
Ge et al. [2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. Advances in neural information processing systems, 29, 2016.
Ma [2020] Tengyu Ma. Why do local methods solve nonconvex problems?, 2020.
Abbe et al. [2023] Emmanuel Abbe, Enric Boix-Adserà, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. arXiv, 2023. URL https://arxiv.org/abs/2302.11055.
Bietti et al. [2022] Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. Learning single-index models with shallow neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
Damian et al. [2022] Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pages 5413–5452. PMLR, 2022.
Mei et al. [2018] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46:2747–2774, 2018.
Richard and Montanari [2014] Emile Richard and Andrea Montanari. A statistical model for tensor pca. In Advances in Neural Information Processing Systems, pages 2897 – 2905, 2014.
Blanc et al. [2020] Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on Learning Theory, pages 483–513, 2020.
Damian et al. [2021] Alex Damian, Tengyu Ma, and Jason D. Lee. Label noise SGD provably prefers flat global minimizers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
Li et al. [2022] Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations, 2022.
Shallue et al. [2018] Christopher J Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.
Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
Wen et al. [2019] Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. Interplay between optimization and generalization of stochastic gradient descent with covariance noise. arXiv preprint arXiv:1902.08234, 2019.
Kakade et al. [2011] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.
Soltanolkotabi [2017] Mahdi Soltanolkotabi. Learning relus via gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
Candès et al. [2015] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, 2015. doi: 10.1109/TIT.2015.2399924.
Chen et al. [2019] Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Mathematical Programming, 176(1):5–37, 2019.
Sun et al. [2018] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131001198, 2018.
Dudeja and Hsu [2018] Rishabh Dudeja and Daniel Hsu. Learning single-index models in gaussian space. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1887–1930. PMLR, 06–09 Jul 2018. URL https://proceedings.mlr.press/v75/dudeja18a.html.
Chen and Meka [2020] Sitan Chen and Raghu Meka. Learning polynomials in few relevant dimensions. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 1161–1227. PMLR, 09–12 Jul 2020. URL https://proceedings.mlr.press/v125/chen20a.html.
Ba et al. [2022] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
Abbe et al. [2022] Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
Hopkins et al. [2016] Samuel B. Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: Tensor decomposition and planted sparse vectors. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, page 178–191, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341325. doi: 10.1145/2897518.2897529. URL https://doi.org/10.1145/2897518.2897529.
Anandkumar et al. [2017] Anima Anandkumar, Yuan Deng, Rong Ge, and Hossein Mobahi. Homotopy analysis for tensor pca. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 79–104. PMLR, 07–10 Jul 2017. URL https://proceedings.mlr.press/v65/anandkumar17a.html.
Biroli et al. [2020] Giulio Biroli, Chiara Cammarota, and Federico Ricci-Tersenghi. How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor pca. Journal of Physics A: Mathematical and Theoretical, 53(17):174003, apr 2020. doi: 10.1088/1751-8121/ab7b1f. URL https://dx.doi.org/10.1088/1751-8121/ab7b1f.
Goel et al. [2020] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. arXiv preprint arXiv:2006.12011, 2020.
Diakonikolas et al. [2020] Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, and Nikos Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514–1539, 2020.
Hopkins et al. [2015] Samuel B. Hopkins, Jonathan Shi, and David Steurer. Tensor principal component analysis via sum-of-square proofs. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 956–1006, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Hopkins15.html.
Kunisky et al. [2019] Dmitriy Kunisky, Alexander S. Wein, and Afonso S. Bandeira. Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio, 2019.
Bandeira et al. [2022] Afonso S Bandeira, Ahmed El Alaoui, Samuel Hopkins, Tselil Schramm, Alexander S Wein, and Ilias Zadik. The franz-parisi criterion and computational trade-offs in high dimensional statistics. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 33831–33844. Curran Associates, Inc., 2022.
Brennan et al. [2021] Matthew S Brennan, Guy Bresler, Sam Hopkins, Jerry Li, and Tselil Schramm. Statistical query algorithms and low degree tests are almost equivalent. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 774–774. PMLR, 15–19 Aug 2021. URL https://proceedings.mlr.press/v134/brennan21a.html.
Dudeja and Hsu [2021] Rishabh Dudeja and Daniel Hsu. Statistical query lower bounds for tensor pca. Journal of Machine Learning Research, 22(83):1–51, 2021. URL http://jmlr.org/papers/v22/20-837.html.
Dudeja and Hsu [2022] Rishabh Dudeja and Daniel Hsu. Statistical-computational trade-offs in tensor pca and related problems via communication complexity, 2022.
Ben Arous et al. [2020] Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor PCA. The Annals of Probability, 48(4):2052 – 2087, 2020. doi: 10.1214/19-AOP1415. URL https://doi.org/10.1214/19-AOP1415.
Ros et al. [2019] Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local minima, and phase transitions. Phys. Rev. X, 9:011003, Jan 2019. doi: 10.1103/PhysRevX.9.011003. URL https://link.aps.org/doi/10.1103/PhysRevX.9.011003.
Ben Arous et al. [2019] Gérard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape of the spiked tensor model. Communications on Pure and Applied Mathematics, 72(11):2282–2330, 2019. doi: https://doi.org/10.1002/cpa.21861. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21861.
Szörényi [2009] Balázs Szörényi. Characterizing statistical query learning: Simplified notions and proofs. In ALT, 2009.
Pinelis [1994] Iosif Pinelis. Optimum Bounds for the Distributions of Martingales in Banach Spaces. The Annals of Probability, 22(4):1679 – 1706, 1994. doi: 10.1214/aop/1176988477. URL https://doi.org/10.1214/aop/1176988477.
Bradbury et al. [2018] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Biewald [2020] Lukas Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.

Appendix A Background and Notation

A.1 Tensor Notation

Throughout this section let $T\in(\mathbb{R}^{d})^{\otimes}k$ be a $k$ -tensor.

Definition 5 (Tensor Action).

For a $j$ tensor $A\in(\mathbb{R}^{d})^{\otimes j}$ with $j\leq k$ , we define the action $T[A]$ of $T$ on $A$ by

\displaystyle(T[A])_{i_{1},\ldots,i_{k-j}}:=T_{i_{1},\ldots,i_{k}}A^{i_{k-j+1},\ldots,i_{k}}\in(\mathbb{R}^{d})^{\otimes(k-j)}.

We will also use $\expectationvalue{T,A}$ to denote $T[A]=A[T]$ when $A,T$ are both $k$ tensors. Note that this corresponds to the standard dot product after flattening $A,T$ .

Definition 6 (Permutation/Transposition).

Given a $k$ -tensor $T$ and a permutation $\pi\in S_{k}$ , we use $\pi(T)$ to denote the result of permuting the axes of $T$ by the permutation $\pi$ , i.e.

\displaystyle\pi(T)_{i_{1},\ldots,i_{k}}:=T_{i_{\pi(1)},\ldots,i_{\pi(k)}}.

Definition 7 (Symmetrization).

We define $\operatorname{Sym}_{k}\in(\mathbb{R}^{d})^{\otimes 2k}$ by

\displaystyle(\operatorname{Sym}_{k})_{i_{1},\ldots,i_{k},j_{1},\ldots,j_{k}}=\frac{1}{k!}\sum_{\pi\in S_{k}}\delta_{i_{\pi(1)},j_{1}}\cdots\delta_{i_{\pi(k)},j_{k}}

where $S_{k}$ is the symmetric group on $1,\ldots,k$ . Note that $\operatorname{Sym}_{k}$ acts on $k$ tensors $T$ by

\displaystyle(\operatorname{Sym}_{k}[T])_{i_{1},\ldots,i_{k}}=\frac{1}{k!}\sum_{\pi\in S_{k}}\pi(T).

i.e. $\operatorname{Sym}_{k}[T]$ is the symmetrized version of $T$ .

We will also overload notation and use $\operatorname{Sym}$ to denote the symmetrization operator, i.e. if $T$ is a $k$ -tensor, $\operatorname{Sym}(T):=\operatorname{Sym}_{k}[T]$ .

Definition 8 (Symmetric Tensor Product).

For a $k$ tensor $T$ and a $j$ tensor $A$ we define the symmetric tensor product of $T$ and $A$ by

\displaystyle T\operatorname{\widetilde{\otimes}}A:=\operatorname{Sym}(T\otimes A).

Lemma 1.

For any tensor $T$ ,

\displaystyle\norm{\operatorname{Sym}(T)}_{F}\leq\norm{T}_{F}.

Proof.

\displaystyle\norm{\operatorname{Sym}(T)}_{F}=\norm{\frac{1}{k!}\sum_{\pi\in S_{k}}\pi(T)}_{F}\leq\frac{1}{k!}\sum_{\pi\in S_{k}}\norm{\pi(T)}_{F}=\norm{T}_{F}

because permuting the indices of $T$ does not change the Frobenius norm. ∎

We will use the following two lemmas for tensor moments of the Gaussian distribution and the uniform distribution over the sphere:

Definition 9.

For integers $k,d>0$ , define the quantity $\nu_{k}^{(d)}$ as

\displaystyle\nu_{k}^{(d)}:=(2k-1)!!\prod_{j=0}^{k-1}\frac{1}{d+2j}=\Theta(d^{-k}).

Note that $\nu_{k}^{(d)}=\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[z_{1}^{2k}]$ .

Lemma 2 (Tensorized Moments).

\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}[x^{\otimes 2k}]=(2k-1)!!I^{\operatorname{\widetilde{\otimes}}k}\mbox{\quad and\quad}\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[z^{\otimes 2k}]=\nu_{k}^{(d)}I^{\operatorname{\widetilde{\otimes}}k}.

Proof.

For the Gaussian moment, see [6]. The spherical moment follows from the decomposition $x\stackrel{{\scriptstyle d}}{{=}}zr$ where $r\sim\chi(d)$ . ∎

Lemma 3.

\displaystyle\|I^{\operatorname{\widetilde{\otimes}}j}\|_{F}^{2}=(\nu_{j}^{(d)})^{-1}

Proof.

By Lemma 2 we have

\displaystyle 1=\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[\|z\|^{2j}]=\langle\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[z^{2j}],I^{\operatorname{\widetilde{\otimes}}j}\rangle=\nu_{j}^{(d)}\|I^{\operatorname{\widetilde{\otimes}}j}\|_{F}^{2}.

∎

A.2 Hermite Polynomials and Hermite Tensors

We provide a brief review of the properties of Hermite polynomials and Hermite tensors.

Definition 10.

We define the $k$ th Hermite polynomial $He_{k}(x)$ by

\displaystyle He_{k}(x):=(-1)^{k}\frac{\nabla^{k}\mu(x)}{\mu(x)}

where $\mu(x):=\frac{e^{-\frac{\norm{x}^{2}}{2}}}{(2\pi)^{d/2}}$ is the PDF of a standard Gaussian in $d$ dimensions. Note that when $d=1$ , this definition reduces to the standard univariate Hermite polynomials.

We begin with the classical properties of the scalar Hermite polynomials:

Lemma 4 (Properties of Hermite Polynomials).

When $d=1$ ,

•

Orthogonality:

$\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,1)}[He_{j}(x)He_{k}(x)]=k!\delta_{jk}$
•

Derivatives:

$\displaystyle\frac{d}{dx}He_{k}(x)=kHe_{k-1}(x)$
•

Correlations: If $x,y\sim N(0,1)$ are correlated Gaussians with correlation $\alpha:=\operatorname{\mathbb{E}}[xy]$ ,

$\displaystyle\operatorname{\mathbb{E}}_{x,y}[He_{j}(x)He_{k}(y)]=k!\delta_{jk}\alpha^{k}.$

•

Hermite Expansion: If $f\in L^{2}(\mu)$ where $\mu$ is the PDF of a standard Gaussian,

\displaystyle f(x)\stackrel{{\scriptstyle L^{2}(\mu)}}{{=}}\sum_{k\geq 0}\frac{c_{k}}{k!}He_{k}(x)\mbox{\quad where\quad}c_{k}=\operatorname{\mathbb{E}}_{x\sim N(0,1)}[f(x)He_{k}(x)].

These properties also have tensor analogues:

Lemma 5 (Hermite Polynomials in Higher Dimensions).

•

Relationship to Univariante Hermite Polynomials: If $\norm{w}=1$ ,

$\displaystyle He_{k}(w\cdot x)=\expectationvalue{He_{k}(x),w^{\otimes k}}$

•

Orthogonality:

\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[He_{j}(x)\otimes He_{k}(x)]=\delta_{jk}k!\operatorname{Sym}_{k}

or equivalently, for any $j$ tensor $A$ and $k$ tensor $B$ :

\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\expectationvalue{He_{j}(x),A}\expectationvalue{He_{k}(x),B}]=\delta_{jk}k!\expectationvalue{\operatorname{Sym}(A),\operatorname{Sym}(B)}.

•

Hermite Expansions: If $f:\mathbb{R}^{d}\to\mathbb{R}$ satisfies $\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}[f(x)^{2}]<\infty$ ,

\displaystyle f=\sum_{k\geq 0}\frac{1}{k!}\expectationvalue{He_{k}(x),C_{k}}\mbox{\quad where\quad}C_{k}=\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}[f(x)He_{k}(x)].

Appendix B Proof of Theorem 1

The proof of Theorem 1 is divided into four parts. First, Section B.1 introduces some notation that will be used throughout the proof. Next, Section B.2 computes matching upper and lower bounds for the gradient of the smoothed population loss. Similarly, Section B.3 concentrates the empirical gradient of the smoothed loss. Finally, Section B.4 combines the bounds in Section B.2 and Section B.3 with a standard online SGD analysis to arrive at the final rate.

B.1 Additional Notation

Throughout the proof we will assume that $w\in S^{d-1}$ so that $\nabla_{w}$ denotes the spherical gradient of $w$ . In particular, $\nabla_{w}g(w)\perp w$ for any $g:S^{d-1}\to\mathbb{R}$ . We will also use $\alpha$ to denote $w\cdot w^{\star}$ so that we can write expressions such as:

\displaystyle\nabla_{w}\alpha=P_{w}^{\perp}w^{\star}=w^{\star}-\alpha w.

We will use the following assumption on $\lambda$ without reference throughout the proof:

Assumption 2.

$\lambda^{2}\leq d/C$ for a sufficiently large constant $C$ .

We note that this is satisfied for the optimal choice of $\lambda=d^{1/4}$ .

We will use $\tilde{O}(\cdot)$ to hide $\operatorname{polylog}(d)$ dependencies. Explicitly, $X=\tilde{O}(1)$ if there exists $C_{1},C_{2}>0$ such that $\absolutevalue{X}\leq C_{1}\log(d)^{C_{2}}$ . We will also use the following shorthand for denoting high probability events:

Definition 11.

We say an event $E$ happens with high probability if for every $k\geq 0$ there exists $d(k)$ such that for all $d\geq d(k)$ , $\operatorname{\mathbb{P}}[E]\geq 1-d^{-k}.$

Note that high probability events are closed under polynomially sized union bounds. As an example, if $X\sim N(0,1)$ then $X\leq\log(d)$ with high probability because

\displaystyle\operatorname{\mathbb{P}}[x>\log(d)]\leq\exp(-\log(d)^{2}/2)\ll d^{-k}

for sufficiently large $d$ . In general, Lemma 24 shows that if $X$ is mean zero and has polynomial tails, i.e. there exists $C$ such that $E[X^{p}]^{1/p}\leq p^{C}$ , then $X=\tilde{O}(1)$ with high probability.

B.2 Computing the Smoothed Population Gradient

Recall that

\displaystyle\sigma(x)\stackrel{{\scriptstyle L^{2}(\mu)}}{{=}}\sum_{k\geq 0}\frac{c_{k}}{k!}He_{k}(x)\mbox{\quad where\quad}c_{k}:=\operatorname{\mathbb{E}}_{x\sim N(0,1)}[\sigma(x)He_{k}(x)].

In addition, because we assumed that $\operatorname{\mathbb{E}}_{x\sim N(0,1)}[\sigma(x)^{2}]=1$ we have Parseval’s identity:

\displaystyle 1=\operatorname{\mathbb{E}}_{x\sim N(0,1)}[\sigma(x)^{2}]=\sum_{k\geq 0}\frac{c_{k}^{2}}{k!}.

This Hermite decomposition immmediately implies a closed form for the population loss:

Lemma 6 (Population Loss).

Let $\alpha=w\cdot w^{\star}$ . Then,

\displaystyle L(w)=\sum_{k\geq 0}\frac{c_{k}^{2}}{k!}[1-\alpha^{k}].

Lemma 6 implies that to understand the smoothed population $L_{\lambda}(w)=(\mathcal{L}_{\lambda}L)(w)$ , it suffices to understand $\mathcal{L}_{\lambda}(\alpha^{k})$ for $k\geq 0$ . First, we will show that the set of single index models is closed under smoothing operator $\mathcal{L}_{\lambda}$ :

Lemma 7.

Let $g:[-1,1]\to\mathbb{R}$ and let $u\in S^{d-1}$ . Then

\displaystyle\mathcal{L}_{\lambda}\quantity(g(w\cdot u))=g_{\lambda}(w\cdot u)

where

\displaystyle g_{\lambda}(\alpha):=\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[g\quantity(\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})].

Proof.

Expanding the definition of $\mathcal{L}_{\lambda}$ gives:

	$\displaystyle\mathcal{L}_{\lambda}\quantity(g(w\cdot u))$	$\displaystyle=\operatorname{\mathbb{E}}_{z\sim\mu_{w}}\quantity[g\quantity(\frac{w+\lambda z}{\norm{w+\lambda z}}\cdot u)]$
		$\displaystyle=\operatorname{\mathbb{E}}_{z\sim\mu_{w}}\quantity[g\quantity(\frac{w\cdot u+\lambda z\cdot u}{\sqrt{1+\lambda^{2}}})].$

Now I claim that when $z\sim\mu_{w}$ , $z\cdot u\stackrel{{\scriptstyle d}}{{=}}z_{1}\sqrt{1-(w\cdot u)^{2}}$ where $z\sim S^{d-2}$ which would complete the proof. To see this, note that we can decompose $R^{d}$ into $\operatorname{span}\{w\}\oplus\operatorname{span}\{w\}^{\perp}$ . Under this decomposition we have the polyspherical decomposition $z\stackrel{{\scriptstyle d}}{{=}}(0,z^{\prime})$ where $z^{\prime}\sim S^{d-2}$ . Then

\displaystyle z\cdot u=z^{\prime}\cdot P_{w}^{\perp}u\stackrel{{\scriptstyle d}}{{=}}z_{1}\|P_{w}^{\perp}u\|=z_{1}\sqrt{1-(w\cdot u)^{2}}.

∎

Of central interest are the quantities $\mathcal{L}_{\lambda}(\alpha^{k})$ as these terms show up when smoothing the population loss (see Lemma 6). We begin by defining the quantity $s_{k}(\alpha;\lambda)$ which will provide matching upper and lower bounds on $\mathcal{L}_{\lambda}(\alpha^{k})$ when $\alpha\geq 0$ :

Definition 12.

We define $s_{k}(\alpha;\lambda)$ by

\displaystyle s_{k}(\alpha;\lambda):=\frac{1}{(1+\lambda^{2})^{k/2}}\begin{cases}\alpha^{k}&\alpha^{2}\geq\frac{\lambda^{2}}{d}\\ \quantity(\frac{\lambda^{2}}{d})^{\frac{k}{2}}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and $k$ is even}\\ \alpha\quantity(\frac{\lambda^{2}}{d})^{\frac{k-1}{2}}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and $k$ is odd}\end{cases}.

Lemma 8.

For all $k\geq 0$ and $\alpha\geq 0$ , there exist constants $c(k),C(k)$ such that

\displaystyle c(k)s_{k}(\alpha;\lambda)\leq\mathcal{L}_{\lambda}(\alpha^{k})\leq C(k)s_{k}(\alpha;\lambda).

Proof.

Using Lemma 7 we have that

	$\displaystyle\mathcal{L}_{\lambda}(\alpha^{k})$	$\displaystyle=\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\quantity(\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})^{k}]$
		$\displaystyle=(1+\lambda^{2})^{-k/2}\sum_{j=0}^{k}\binom{k}{j}\alpha^{k-j}\lambda^{j}(1-\alpha^{2})^{j/2}\operatorname{\mathbb{E}}_{z\sim S^{d-2}}[z_{1}^{j}].$

Now note that when $j$ is odd, $\operatorname{\mathbb{E}}_{z\sim S^{d-2}}[z_{1}^{j}]=0$ so we can re-index this sum to get

	$\displaystyle\mathcal{L}_{\lambda}(\alpha^{k})$	$\displaystyle=(1+\lambda^{2})^{-k/2}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}\alpha^{k-2j}\lambda^{2j}(1-\alpha^{2})^{j}\operatorname{\mathbb{E}}_{z\sim S^{d-2}}[z_{1}^{2j}]$
		$\displaystyle=(1+\lambda^{2})^{-k/2}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}\alpha^{k-2j}\lambda^{2j}(1-\alpha^{2})^{j}\nu_{j}^{(d-1)}.$

Note that every term in this sum is non-negative. Now we can ignore constants depending on $k$ and use that $\nu_{j}^{(d-1)}\asymp d^{-j}$ to get

\displaystyle\mathcal{L}_{\lambda}(\alpha^{k})\asymp\quantity(\frac{\alpha}{\sqrt{1+\lambda^{2}}})^{k}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\quantity(\frac{\lambda^{2}(1-\alpha^{2})}{\alpha^{2}d})^{j}.

Now when $\alpha^{2}\geq\frac{\lambda^{2}}{d}$ , this is a decreasing geometric series which is dominated by the first term so $\mathcal{L}_{\lambda}(\alpha^{k})\asymp\quantity(\frac{\alpha}{\lambda})^{k}$ . Next, when $\alpha^{2}\leq\frac{\lambda^{2}}{d}$ we have by 2 that $\alpha\leq\frac{1}{C}$ so $1-\alpha^{2}$ is bounded away from $0$ . Therefore the geometric series is dominated by the last term which is

\displaystyle\frac{1}{(1+\lambda^{2})^{-k/2}}\begin{cases}\quantity(\frac{\lambda^{2}}{d})^{\frac{k}{2}}&\text{$k$ is even}\\ \alpha\quantity(\frac{\lambda^{2}}{d})^{\frac{k-1}{2}}&\text{$k$ is odd}\end{cases}

which completes the proof. ∎

Next, in order to understand the population gradient, we need to understand how the smoothing operator $\mathcal{L}_{\lambda}$ commutes with differentiation. We note that these do not directly commute because the smoothing distribution $\mu_{w}$ depends on $w$ so this term must be differentiated as well. However, smoothing and differentiation almost commute, which is described in the following lemma:

Lemma 9.

Define the dimension-dependent univariate smoothing operator by:

\displaystyle\mathcal{L}_{\lambda}^{(d)}g(\alpha):=\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[g\quantity(\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})].

Then,

\displaystyle\frac{d}{d\alpha}\mathcal{L}_{\lambda}^{(d)}(g(\alpha))=\frac{\mathcal{L}_{\lambda}^{(d)}(g^{\prime}(\alpha))}{\sqrt{1+\lambda^{2}}}-\frac{\lambda^{2}\alpha}{(1+\lambda^{2})(d-1)}\mathcal{L}_{\lambda}^{(d+2)}(g^{\prime\prime}(\alpha)).

Proof.

Directly differentiating the definition for $\mathcal{L}_{\lambda}^{(d)}$ gives

\displaystyle\frac{d}{d\alpha}\mathcal{L}_{\lambda}^{(d)}

\displaystyle=\frac{\mathcal{L}_{\lambda}^{(d)}(g^{\prime}(\alpha))}{\sqrt{1+\lambda^{2}}}-\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\frac{\alpha\lambda z_{1}}{\sqrt{(1+\lambda^{2})(1-\alpha^{2})}}g^{\prime}\quantity(\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})].

By Lemma 25, this is equal to

	$\displaystyle\frac{d}{d\alpha}\mathcal{L}_{\lambda}^{(d)}$	$\displaystyle=\frac{\mathcal{L}_{\lambda}^{(d)}(g^{\prime}(\alpha))}{\sqrt{1+\lambda^{2}}}-\frac{\lambda^{2}\alpha}{(1+\lambda^{2})(d-1)}\operatorname{\mathbb{E}}_{z\sim S^{d}}\quantity[g^{\prime\prime}\quantity(\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})]$
		$\displaystyle=\frac{\mathcal{L}_{\lambda}^{(d)}(g^{\prime}(\alpha))}{\sqrt{1+\lambda^{2}}}-\frac{\lambda^{2}\alpha}{(1+\lambda^{2})(d-1)}\mathcal{L}_{\lambda}^{(d+2)}(g^{\prime\prime}(\alpha)).$

∎

Now we are ready to analyze the population gradient:

Lemma 10.

\displaystyle\nabla_{w}L_{\lambda}(w)=-(w^{\star}-\alpha w)c_{\lambda}(\alpha)

where for $\alpha\geq C^{-1/4}d^{-1/2}$ ,

\displaystyle c_{\lambda}(\alpha)\asymp\frac{s_{{k^{\star}}-1}(\alpha;\lambda)}{\sqrt{1+\lambda^{2}}}.

Proof.

Recall that

\displaystyle L(w)=1-\sum_{k\geq 0}\frac{c_{k}^{2}}{k!}\alpha^{k}.

Because ${k^{\star}}$ is the index of the first nonzero Hermite coefficient, we can start this sum at $k={k^{\star}}$ . Smoothing and differentiating gives:

\displaystyle\nabla_{w}L(w)=-(w^{\star}-\alpha w)c_{\lambda}(\alpha)\mbox{\quad where\quad}c_{\lambda}(\alpha):=\sum_{k\geq{k^{\star}}}\frac{c_{k}^{2}}{k!}\frac{d}{d\alpha}\mathcal{L}_{\lambda}(\alpha^{k}).

We will break this into the $k={k^{\star}}$ term and the $k>{k^{\star}}$ tail. First when $k={k^{\star}}$ we can use Lemma 9 and Lemma 8 to get:

\displaystyle\frac{c_{{k^{\star}}}^{2}}{({k^{\star}})!}\frac{d}{d\alpha}\mathcal{L}_{\lambda}(\alpha^{k^{\star}})=\frac{c_{{k^{\star}}}^{2}}{({k^{\star}})!}\quantity[\frac{{k^{\star}}\mathcal{L}_{\lambda}(\alpha^{{k^{\star}}-1})}{\sqrt{1+\lambda^{2}}}-\frac{{k^{\star}}({k^{\star}}-1)\lambda^{2}\alpha\mathcal{L}_{\lambda}^{(d+2)}(\alpha^{{k^{\star}}-2})}{(1+\lambda^{2})(d-1)}].

The first term is equal up to constants to $\frac{s_{{k^{\star}}-1}(\alpha;\lambda)}{\sqrt{1+\lambda^{2}}}$ while the second term is equal up to constants to $\frac{\lambda^{2}\alpha}{(1+\lambda^{2})d}s_{{k^{\star}}-2}(\alpha;\lambda)$ . However, we have that

\displaystyle\frac{\frac{\lambda^{2}\alpha}{(1+\lambda^{2})d}s_{{k^{\star}}-2}(\alpha;\lambda)}{\frac{s_{{k^{\star}}-1}(\alpha;\lambda)}{\sqrt{1+\lambda^{2}}}}=\begin{cases}\frac{\lambda^{2}}{d}&\alpha^{2}\geq\frac{\lambda^{2}}{d}\\ \frac{\lambda^{2}}{d}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and ${k^{\star}}$ is even}\\ \alpha^{2}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and ${k^{\star}}$ is odd}\end{cases}\leq\frac{\lambda^{2}}{d}\leq\frac{1}{C}.

Therefore the $k={k^{\star}}$ term in $c_{\lambda}(\alpha)$ is equal up to constants to $\frac{s_{{k^{\star}}-1}(\alpha;\lambda)}{\sqrt{1+\lambda^{2}}}$ .

Next, we handle the $k>{k^{\star}}$ tail. By Lemma 9 this is equal to

\displaystyle\sum_{k>{k^{\star}}}\frac{c_{k}^{2}}{k!}\quantity[\frac{k\mathcal{L}_{\lambda}(\alpha^{k-1})}{\sqrt{1+\lambda^{2}}}-\frac{k(k-1)\lambda^{2}\alpha\mathcal{L}_{\lambda}^{(d+2)}(\alpha^{k-2})}{(1+\lambda^{2})(d-1)}].

Now recall that from Lemma 8, $\mathcal{L}_{\lambda}(\alpha^{k})$ is always non-negative so we can use $c_{k}^{2}\leq k!$ to bound this tail in absolute value by

	$\displaystyle\sum_{k>{k^{\star}}}\frac{k\mathcal{L}_{\lambda}(\alpha^{k-1})}{\sqrt{1+\lambda^{2}}}+\frac{k(k-1)\lambda^{2}\alpha\mathcal{L}_{\lambda}^{(d+2)}(\alpha^{k-2})}{(1+\lambda^{2})(d-1)}$
	$\displaystyle\lesssim\frac{1}{\sqrt{1+\lambda^{2}}}\mathcal{L}_{\lambda}\quantity(\sum_{k>{k^{\star}}}k\alpha^{k-1})+\frac{\lambda^{2}\alpha}{(1+\lambda^{2})d}\mathcal{L}_{\lambda}^{(d+2)}\quantity(\sum_{k>{k^{\star}}}k(k-1)\alpha^{k-2})$
	$\displaystyle\lesssim\frac{1}{\sqrt{1+\lambda^{2}}}\mathcal{L}_{\lambda}\quantity(\frac{\alpha^{{k^{\star}}}}{(1-\alpha)^{2}})+\frac{\lambda^{2}\alpha}{(1+\lambda^{2})d}\mathcal{L}_{\lambda}^{(d+2)}\quantity(\frac{\alpha^{{k^{\star}}-1}}{(1-\alpha)^{3}}).$

Now by Corollary 3, this is bounded for $d\geq 5$ by

\displaystyle\frac{s_{{k^{\star}}}(\alpha;\lambda)}{\sqrt{1+\lambda^{2}}}+\frac{\lambda^{2}\alpha}{(1+\lambda^{2})d}s_{{k^{\star}}-1}(\alpha;\lambda).

For the first term, we have

\displaystyle\frac{s_{{k^{\star}}}(\alpha;\lambda)}{s_{{k^{\star}}-1}(\alpha;\lambda)}=\begin{cases}\frac{\alpha}{\lambda}&\alpha^{2}\geq\frac{\lambda^{2}}{d}\\ \frac{\lambda}{d\alpha}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and ${k^{\star}}$ is even}\\ \frac{\alpha}{\lambda}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and ${k^{\star}}$ is odd}\end{cases}\leq C^{-1/4}.

The second term is trivially bounded by

\displaystyle\frac{\lambda^{2}\alpha}{(1+\lambda^{2})d}\leq\frac{\lambda^{2}}{(1+\lambda^{2})d}\leq\frac{1}{C(1+\lambda^{2})}\leq\frac{1}{C\sqrt{1+\lambda^{2}}}

which completes the proof. ∎

B.3 Concentrating the Empirical Gradient

We cannot directly apply Lemma 7 to $\sigma(w\cdot x)$ as $\norm{x}\neq 1$ . Instead, we will use the properties of the Hermite tensors to directly smooth $\sigma(w\cdot x)$ .

Lemma 11.

\displaystyle\mathcal{L}_{\lambda}(He_{k}(w\cdot x))=\expectationvalue{He_{k}(x),T_{k}(w)}

where

\displaystyle T_{k}(w)=(1+\lambda^{2})^{-\frac{k}{2}}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}w^{\otimes(k-2j)}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d-1)}.

Proof.

Using Lemma 5, we can write

\displaystyle\mathcal{L}_{\lambda}\quantity(He_{k}(w\cdot x))=\expectationvalue{He_{k}(x),\mathcal{L}_{\lambda}\quantity(w^{\otimes k})}.

Now

	$\displaystyle T_{k}(w)$	$\displaystyle=\mathcal{L}_{\lambda}(w^{\otimes k})$
		$\displaystyle=\operatorname{\mathbb{E}}_{z\sim\mu_{w}}\quantity[\quantity(\frac{w+\lambda z}{\sqrt{1+\lambda^{2}}})^{\otimes k}]$
		$\displaystyle=(1+\lambda^{2})^{-\frac{k}{2}}\sum_{j=0}^{k}\binom{k}{j}w^{\otimes(k-j)}\operatorname{\widetilde{\otimes}}\operatorname{\mathbb{E}}_{z\sim\mu_{w}}[z^{\otimes j}]\lambda^{j}.$

Now by Lemma 2, this is equal to

\displaystyle T_{k}(w)=(1+\lambda^{2})^{-\frac{k}{2}}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}w^{\otimes(k-2j)}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d-1)}

which completes the proof. ∎

Lemma 12.

For any $u\in S^{d-1}$ with $u\perp w$ ,

	$\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\quantity(u\cdot\nabla_{w}\mathcal{L}_{\lambda}(He_{k}(w\cdot x)))^{2}]$
	$\displaystyle\qquad\lesssim k!\quantity(\frac{k^{2}}{1+\lambda^{2}}\mathcal{L}_{\lambda}\quantity(\alpha^{k-1})+\frac{\lambda^{4}k^{4}}{(1+\lambda^{2})^{2}d^{2}}\mathcal{L}_{\lambda}^{(d+2)}\quantity(\alpha^{k-2}))\evaluated{}_{\alpha=\frac{1}{\sqrt{1+\lambda^{2}}}}.$

Proof.

Recall that by Lemma 11 we have

\displaystyle\mathcal{L}_{\lambda}(He_{k}(w\cdot x))=\expectationvalue{He_{k}(x),T_{k}(w)}

where

\displaystyle T_{k}(w):=(1+\lambda^{2})^{-\frac{k}{2}}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}w^{\otimes(k-2j)}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d-1)}.

Differentiating this with respect to $w$ gives

\displaystyle u\cdot\nabla_{w}\mathcal{L}_{\lambda}(He_{k}(w\cdot x))=\expectationvalue{He_{k}(x),\nabla_{w}T_{k}(w)[u]}.

Now note that by Lemma 5:

	$\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\quantity(u\cdot\nabla_{w}\mathcal{L}_{\lambda}(He_{k}(w\cdot x)))^{2}]$	$\displaystyle=\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\expectationvalue{He_{k}(x),\nabla_{w}T_{k}(w)[u]}^{2}]$
		$\displaystyle=k!\norm{\nabla_{w}T_{k}(w)[u]}_{F}^{2}.$

Therefore it suffices to compute the Frobenius norm of $\nabla_{w}T_{k}(w)[u]$ . We first explicitly differentiate $T_{k}(w)$ :

	$\displaystyle\nabla_{w}T_{k}(w)[u]$	$\displaystyle=(1+\lambda^{2})^{-\frac{k}{2}}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}(k-2j)u\operatorname{\widetilde{\otimes}}w^{\otimes k-2j-1}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d-1)}$
		$\displaystyle\qquad-(1+\lambda^{2})^{-\frac{k}{2}}\sum_{j=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}(2j)u\operatorname{\widetilde{\otimes}}w^{\otimes k-2j+1}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}(j-1)}\lambda^{2j}\nu_{j}^{(d-1)}$
		$\displaystyle=\frac{k}{(1+\lambda^{2})^{\frac{k}{2}}}\sum_{j=0}^{\lfloor\frac{k-1}{2}\rfloor}\binom{k-1}{2j}u\operatorname{\widetilde{\otimes}}w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d-1)}$
		$\displaystyle\qquad-\frac{\lambda^{2}k(k-1)}{(d-1)(1+\lambda^{2})^{\frac{k}{2}}}\sum_{j=0}^{\lfloor\frac{k-2}{2}\rfloor}\binom{k-2}{2j}u\operatorname{\widetilde{\otimes}}w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d+1)}.$

Taking Frobenius norms gives

	$\displaystyle\norm{\nabla_{w}T_{k}(w)[u]}_{F}^{2}$
	$\displaystyle\lesssim\frac{k^{2}}{(1+\lambda^{2})^{k}}\norm{\sum_{j=0}^{\lfloor\frac{k-1}{2}\rfloor}\binom{k-1}{2j}u\operatorname{\widetilde{\otimes}}w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d-1)}}_{F}^{2}$
	$\displaystyle\qquad+\frac{\lambda^{4}k^{4}}{(d-1)^{2}(1+\lambda^{2})^{k}}\norm{\sum_{j=0}^{\lfloor\frac{k-2}{2}\rfloor}\binom{k-2}{2j}u\operatorname{\widetilde{\otimes}}w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d+1)}}_{F}^{2}.$

Now we can use Lemma 1 to pull out $u$ and get:

	$\displaystyle\norm{\nabla_{w}T_{k}(w)[u]}_{F}^{2}$
	$\displaystyle\lesssim\frac{k^{2}}{(1+\lambda^{2})^{k}}\norm{\sum_{j=0}^{\lfloor\frac{k-1}{2}\rfloor}\binom{k-1}{2j}w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d-1)}}_{F}^{2}$
	$\displaystyle\qquad+\frac{\lambda^{4}k^{4}}{d^{2}(1+\lambda^{2})^{k}}\norm{\sum_{j=0}^{\lfloor\frac{k-2}{2}\rfloor}\binom{k-2}{2j}w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}\lambda^{2j}\nu_{j}^{(d+1)}}_{F}^{2}.$

Now note that the terms in each sum are orthogonal as at least one $w$ will need to be contracted with a $P_{w}^{\perp}$ . Therefore this is equivalent to:

	$\displaystyle\norm{\nabla_{w}T_{k}(w)[u]}_{F}^{2}$
	$\displaystyle\lesssim\frac{k^{2}}{(1+\lambda^{2})^{k}}\sum_{j=0}^{\lfloor\frac{k-1}{2}\rfloor}\binom{k-1}{2j}^{2}\lambda^{4j}(\nu_{j}^{(d-1)})^{2}\norm{w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}}_{F}^{2}$
	$\displaystyle\qquad+\frac{\lambda^{4}k^{4}}{d^{2}(1+\lambda^{2})^{k}}\sum_{j=0}^{\lfloor\frac{k-2}{2}\rfloor}\binom{k-2}{2j}^{2}\lambda^{4j}(\nu_{j}^{(d+1)})^{2}\norm{w^{\otimes k-1-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}}_{F}^{2}.$

Next, note that for any $k$ tensor $A$ , $\norm{\operatorname{Sym}(A)}_{F}^{2}=\frac{1}{k!}\sum_{\pi\in S_{k}}\expectationvalue{A,\pi(A)}$ . When $A=w^{\otimes k-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}$ , the only permutations that don’t give $0$ are the ones which pair up all of the $w$ s of which there are $(2j)!(k-2j)!$ . Therefore, by Lemma 3,

\displaystyle\norm{w^{\otimes k-2j}\operatorname{\widetilde{\otimes}}(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}}_{F}^{2}=\frac{1}{\binom{k}{2j}}\norm{(P_{w}^{\perp})^{\operatorname{\widetilde{\otimes}}j}}_{F}^{2}=\frac{1}{\nu_{j}^{(d-1)}\binom{k}{2j}}.

Plugging this in gives:

	$\displaystyle\norm{\nabla_{w}T_{k}(w)[u]}_{F}^{2}$
	$\displaystyle\lesssim\frac{k^{2}}{(1+\lambda^{2})^{k}}\sum_{j=0}^{\lfloor\frac{k-1}{2}\rfloor}\binom{k-1}{2j}\lambda^{4j}\nu_{j}^{(d-1)}$
	$\displaystyle\qquad+\frac{\lambda^{4}k^{4}}{d^{2}(1+\lambda^{2})^{k}}\sum_{j=0}^{\lfloor\frac{k-2}{2}\rfloor}\binom{k-2}{2j}\lambda^{4j}\nu_{j}^{(d+1)}.$

Now note that

\displaystyle\mathcal{L}_{\lambda}(\alpha^{k})\evaluated{}_{\alpha=\frac{1}{\sqrt{1+\lambda^{2}}}}=\frac{1}{(1+\lambda^{2})^{k}}\sum_{k=0}^{\lfloor\frac{k}{2}\rfloor}\binom{k}{2j}\lambda^{4j}\nu_{j}^{(d-1)}

which completes the proof. ∎

Corollary 1.

For any $u\in S^{d-1}$ with $u\perp w$ ,

\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\quantity(u\cdot\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x)))^{2}]\lesssim\frac{\min\quantity(1+\lambda^{2},\sqrt{d})^{-({k^{\star}}-1)}}{1+\lambda^{2}}.

Proof.

Note that

\displaystyle\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x))=\sum_{k}\frac{c_{k}}{k!}\expectationvalue{He_{k}(x),T_{k}(w)}

\displaystyle u\cdot\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x))=\sum_{k}\frac{c_{k}}{k!}\expectationvalue{He_{k}(x),\nabla_{w}T_{k}(w)[u]}.

By Lemma 4, these terms are orthogonal so by Lemma 12,

	$\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\quantity(u\cdot\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x)))^{2}]$
	$\displaystyle\lesssim\sum_{k}\frac{c_{k}^{2}}{k!}\quantity(\frac{k^{2}}{1+\lambda^{2}}\mathcal{L}_{\lambda}\quantity(\alpha^{k-1})+\frac{\lambda^{4}k^{4}}{(1+\lambda^{2})^{2}d^{2}}\mathcal{L}_{\lambda}^{(d+2)}\quantity(\alpha^{k-2}))\evaluated{}_{\alpha=\frac{1}{\sqrt{1+\lambda^{2}}}}$
	$\displaystyle\lesssim\frac{1}{1+\lambda^{2}}\mathcal{L}_{\lambda}\quantity(\frac{\alpha^{{k^{\star}}-1}}{(1-\alpha)^{3}})+\frac{\lambda^{4}}{(1+\lambda^{2})^{2}d^{2}}\mathcal{L}_{\lambda}^{(d+2)}\quantity(\frac{\alpha^{{k^{\star}}-2}}{(1-\alpha)^{5}})$
	$\displaystyle\lesssim\frac{1}{1+\lambda^{2}}s_{{k^{\star}}-1}\quantity(\frac{1}{\sqrt{1+\lambda^{2}}};\lambda)+\frac{\lambda^{4}}{(1+\lambda^{2})^{2}d}s_{{k^{\star}}-2}\quantity(\frac{1}{\sqrt{1+\lambda^{2}}};\lambda)$
	$\displaystyle\lesssim\frac{1}{1+\lambda^{2}}s_{{k^{\star}}-1}\quantity(\lambda^{-1};\lambda)+\frac{\lambda^{4}}{(1+\lambda^{2})^{2}d}s_{{k^{\star}}-2}\quantity(\lambda^{-1};\lambda).$

Now plugging in the formula for $s_{k}(\alpha;\lambda)$ gives:

	$\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\quantity(u\cdot\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x)))^{2}]$	$\displaystyle\lesssim(1+\lambda^{2})^{-\frac{{k^{\star}}+1}{2}}\begin{cases}(1+\lambda^{2})^{-\frac{{k^{\star}}-1}{2}}&1+\lambda^{2}\leq\sqrt{d}\\ \quantity(\frac{1+\lambda^{2}}{d})^{\frac{{k^{\star}}-1}{2}}&1+\lambda^{2}\geq\sqrt{d}\end{cases}$
		$\displaystyle\lesssim\frac{\min\quantity(1+\lambda^{2},\sqrt{d})^{-({k^{\star}}-1)}}{1+\lambda^{2}}.$

∎

The following lemma shows that $\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x))$ inherits polynomial tails from $\sigma^{\prime}$ :

Lemma 13.

There exists an absolute constant $C$ such that for any $u\in S^{d-1}$ with $u\perp w$ and any $p\in[0,d/C]$ ,

\displaystyle\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}\quantity[\quantity(u\cdot\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x)))^{p}]^{1/p}\lesssim\frac{p^{C}}{\sqrt{1+\lambda^{2}}}.

Proof.

Following the proof of Lemma 9 we have

	$\displaystyle u\cdot\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x))$
	$\displaystyle=(u\cdot x)\operatorname{\mathbb{E}}_{z_{1}\sim S^{d-2}}\quantity[\sigma^{\prime}\quantity(\frac{w\cdot x+\lambda z_{1}\norm{P_{w}^{\perp}x}}{\sqrt{1+\lambda^{2}}})\quantity(\frac{1}{\sqrt{1+\lambda^{2}}}-\frac{\lambda z_{1}(w\cdot x)}{\norm{P_{w}^{\perp}x}\sqrt{1+\lambda^{2}}})].$

First, we consider the first term. Its $p$ norm is bounded by

\displaystyle\frac{1}{\sqrt{1+\lambda^{2}}}\operatorname{\mathbb{E}}_{x}\quantity[(u\cdot x)^{2p}]\operatorname{\mathbb{E}}_{x}\quantity(\operatorname{\mathbb{E}}_{z_{1}\sim S^{d-2}}\quantity[\sigma^{\prime}\quantity(\frac{w\cdot x+\lambda z_{1}\norm{P_{w}^{\perp}x}}{\sqrt{1+\lambda^{2}}})]^{2p}).

By Jensen we can pull out the expectation over $z_{1}$ and use 1 to get

	$\displaystyle\frac{1}{\sqrt{1+\lambda^{2}}}\operatorname{\mathbb{E}}_{x}[(u\cdot x)^{2p}]\operatorname{\mathbb{E}}_{x\sim N(0,I_{d}),z_{1}\sim S^{d-2}}\quantity[\sigma^{\prime}\quantity(\frac{w\cdot x+\lambda z_{1}\norm{P_{w}^{\perp}x}}{\sqrt{1+\lambda^{2}}})^{2p}]$
	$\displaystyle=\frac{1}{\sqrt{1+\lambda^{2}}}\operatorname{\mathbb{E}}_{x\sim N(0,1)}[x^{2p}]\operatorname{\mathbb{E}}_{x\sim N(0,1)}\quantity[\sigma^{\prime}(x)^{p}]$
	$\displaystyle\lesssim\frac{\operatorname{poly}(p)}{\sqrt{1+\lambda^{2}}}.$

Similarly, the $p$ norm of the second term is bounded by

\displaystyle\frac{\lambda}{\sqrt{1+\lambda^{2}}}\cdot\operatorname{poly}(p)\cdot\operatorname{\mathbb{E}}_{x,z_{1}}\quantity[\quantity(\frac{z_{1}(x\cdot w)}{\norm{P_{w}^{\perp}}})^{2p}]^{\frac{1}{2p}}\lesssim\frac{\lambda}{d\sqrt{1+\lambda^{2}}}\operatorname{poly}(p)\ll\frac{\operatorname{poly}(p)}{\sqrt{1+\lambda^{2}}}.

∎

Finally, we can use Corollary 1 and Lemma 13 to bound the $p$ norms of the gradient:

Lemma 14.

Let $(x,y)$ be a fresh sample and let $v=-\nabla L_{\lambda}(w;x;y)$ . Then there exists a constant $C$ such that for any $u\in S^{d-1}$ with $u\perp w$ , any $\lambda\leq d^{1/4}$ and all $2\leq p\leq d/C$ ,

\displaystyle\operatorname{\mathbb{E}}_{\mathcal{B}}\quantity[(u\cdot v)^{p}]^{1/p}\lesssim\operatorname{poly}(p)\cdot\tilde{O}\quantity((1+\lambda^{2})^{-\frac{1}{2}-\frac{{k^{\star}}-1}{p}}).

Proof.

First,

	$\displaystyle\operatorname{\mathbb{E}}_{\mathcal{B}}[(u\cdot v)^{p}]^{1/p}$	$\displaystyle=\operatorname{\mathbb{E}}_{\mathcal{B}}[(u\cdot\nabla L_{\lambda}(w;\mathcal{B}))^{p}]^{1/p}$
		$\displaystyle=\operatorname{\mathbb{E}}_{x,y}[y^{p}(u\cdot\nabla_{w}\mathcal{L}_{\lambda}\quantity(\sigma(w\cdot x))^{p}]^{1/p}.$

Applying Lemma 23 with $X=(u\cdot\nabla_{w}\mathcal{L}_{\lambda}(\sigma(w\cdot x)))^{2}$ and $Y=y^{p}(u\cdot\nabla_{w}\mathcal{L}_{\lambda}(\sigma(w\cdot x)))^{p-2}$ gives:

	$\displaystyle\operatorname{\mathbb{E}}_{\mathcal{B}}[(u\cdot v)^{p}]^{1/p}$	$\displaystyle\lesssim\operatorname{poly}(p)\tilde{O}\quantity(\frac{\min\quantity(1+\lambda^{2},\sqrt{d})^{-\frac{{k^{\star}}-1}{p}}}{\sqrt{1+\lambda^{2}}})$
		$\displaystyle\lesssim\operatorname{poly}(p)\cdot\tilde{O}\quantity((1+\lambda^{2})^{-\frac{1}{2}-\frac{{k^{\star}}-1}{p}})$

which completes the proof. ∎

Corollary 2.

Let $v,\epsilon$ be as in Lemma 14. Then for all $2\leq p\leq d/C$ ,

\displaystyle\operatorname{\mathbb{E}}_{\mathcal{B}}[\norm{v}^{2p}]^{1/p}\lesssim\operatorname{poly}(p)\cdot d\cdot\tilde{O}\quantity((1+\lambda^{2})^{-1-\frac{{k^{\star}}-1}{p}}).

Proof.

By Jensen’s inequality,

\displaystyle\norm{v}^{2p}=\operatorname{\mathbb{E}}\quantity[\quantity(\sum_{i=1}^{d}(v\cdot e_{i})^{2})^{p}]\lesssim d^{p-1}\operatorname{\mathbb{E}}\quantity[\sum_{i=1}^{d}(v\cdot e_{i})^{2p}]\lesssim d^{p}\max_{i}\operatorname{\mathbb{E}}[(z\cdot e_{i})^{2p}].

Taking $p$ th roots and using Lemma 14 finishes the proof. ∎

B.4 Analyzing the Dynamics

Throughout this section we will assume $1\leq\lambda\leq d^{1/4}$ . The proof of the dynamics is split into three stages.

In the first stage, we analyze the regime $\alpha\in[\alpha_{0},\lambda d^{-1/2}]$ . In this regime, the signal is dominated by the smoothing.

In the second stage, we analyze the regime $\alpha\in[\lambda d^{-1/2},1-o_{d}(1)]$ . This analysis is similar to the analysis in Ben Arous et al. [1] and could be equivalently carried out with $\lambda=0$ .

Finally in the third stage, we decay the learning rate linearly to achieve the optimal rate

\displaystyle n\gtrsim d^{\frac{{k^{\star}}}{2}}+\frac{d}{\epsilon}.

All three stages will use the following progress lemma:

Lemma 15.

Let $w\in S^{d-1}$ and let $\alpha:=w\cdot w^{\star}$ . Let $(x,y)$ be a fresh batch and define

\displaystyle v:=-\nabla_{w}L_{\lambda}(w;x;y),\quad z:=v-\operatorname{\mathbb{E}}_{x,y}[v],\quad w^{\prime}=\frac{w+\lambda v}{\norm{w+\lambda v}}\mbox{\quad and\quad}\alpha^{\prime}:=w^{\prime}\cdot w^{\star}.

Then if $\eta\lesssim\alpha\sqrt{1+\lambda^{2}}$ ,

\displaystyle\alpha^{\prime}=\alpha+\eta(1-\alpha^{2})c_{\lambda}(\alpha)+Z+\tilde{O}\quantity(\frac{\eta^{2}d\alpha}{(1+\lambda^{2})^{k^{\star}}}).

where $\operatorname{\mathbb{E}}_{x,y}[Z]=0$ and for all $2\leq p\leq d/C$ ,

\displaystyle\operatorname{\mathbb{E}}_{x,y}[Z^{p}]^{1/p}\leq\tilde{O}(\operatorname{poly}(p))\quantity[\eta(1+\lambda^{2})^{-\frac{1}{2}-\frac{({k^{\star}}-1)}{p}}]\quantity[\sqrt{1-\alpha^{2}}+\frac{\eta d\alpha}{\sqrt{1+\lambda^{2}}}].

Furthermore, if $\lambda=O(1)$ the $\tilde{O}(\cdot)$ can be replaced with $O(\cdot)$ .

Proof.

Because $v\perp w$ and $1\geq\frac{1}{\sqrt{1+x^{2}}}\geq 1-\frac{x^{2}}{2}$ ,

\displaystyle\alpha^{\prime}=\frac{\alpha+\eta(v\cdot w^{\star})}{\sqrt{1+\eta^{2}\norm{v}^{2}}}=\alpha+\eta(v\cdot w^{\star})+r

where $\absolutevalue{r}\leq\frac{\eta^{2}}{2}\norm{v}^{2}\quantity[\alpha+\eta\absolutevalue{v\cdot w^{\star}}]$ . Note that by Lemma 14, $\eta(v\cdot w^{\star})$ has moments bounded by $\frac{\eta}{\lambda}\operatorname{poly}(p)\lesssim\alpha\operatorname{poly}(p)$ . Therefore by Lemma 23 with $X=\norm{v}^{2}$ and $Y=\alpha+\eta\absolutevalue{v\cdot w^{\star}}$ ,

\displaystyle\operatorname{\mathbb{E}}_{x,y}[r]\leq\tilde{O}\quantity(\eta^{2}\operatorname{\mathbb{E}}[\norm{v}^{2}]\alpha).

Plugging in the bound on $\operatorname{\mathbb{E}}[\norm{v}^{2}]$ from Corollary 2 gives

\displaystyle\operatorname{\mathbb{E}}_{x,y}[\alpha^{\prime}]=\alpha+\eta(1-\alpha^{2})c_{\lambda}(\alpha)+\tilde{O}\quantity(\eta^{2}d\alpha(1+\lambda^{2})^{-{k^{\star}}}).

In addition, by Lemma 14,

	$\displaystyle\operatorname{\mathbb{E}}_{x,y}\quantity[\absolutevalue{\eta(v\cdot w^{\star})-\operatorname{\mathbb{E}}_{x,y}[\eta(v\cdot w^{\star})]}^{p}]^{1/p}$	$\displaystyle\lesssim\operatorname{poly}(p)\cdot\eta\cdot\norm{P_{w}^{\perp}w^{\star}}\cdot\tilde{O}\quantity((1+\lambda^{2})^{-\frac{1}{2}-\frac{{k^{\star}}-1}{p}})$
		$\displaystyle=\operatorname{poly}(p)\cdot\eta\cdot\sqrt{1-\alpha^{2}}\cdot\tilde{O}\quantity((1+\lambda^{2})^{-\frac{1}{2}-\frac{{k^{\star}}-1}{p}}).$

Similarly, by Lemma 23 with $X=\norm{v}^{2}$ and $Y=\norm{v}^{2(p-1)}[\alpha+\eta\absolutevalue{v\cdot w}]^{p}$ , Lemma 14, and Corollary 2,

	$\displaystyle\operatorname{\mathbb{E}}_{x,y}\quantity[\absolutevalue{r_{t}-\operatorname{\mathbb{E}}_{x,y}[r_{t}]}^{p}]^{1/p}$	$\displaystyle\lesssim\operatorname{\mathbb{E}}_{x,y}\quantity[\absolutevalue{r_{t}}^{p}]^{1/p}$
		$\displaystyle\lesssim\eta^{2}\alpha\operatorname{poly}(p)\tilde{O}\quantity(\quantity(\frac{d}{(1+\lambda^{2})^{k^{\star}}})^{1/p}\quantity(\frac{d}{1+\lambda^{2}})^{\frac{p-1}{p}})$
		$\displaystyle=\operatorname{poly}(p)\cdot\eta^{2}d\alpha\cdot\tilde{O}\quantity((1+\lambda^{2})^{-1-\frac{{k^{\star}}-1}{p}}).$

∎

We can now analyze the first stage in which $\alpha\in[d^{-1/2},\lambda\cdot d^{-1/2}]$ . This stage is dominated by the signal from the smoothing.

Lemma 16 (Stage 1).

Assume that $\lambda\geq 1$ and $\alpha_{0}\geq\frac{1}{C}d^{-1/2}$ . Set

\displaystyle\eta=\frac{d^{-\frac{{k^{\star}}}{2}}(1+\lambda^{2})^{{k^{\star}}-1}}{\log(d)^{C}}\mbox{\quad and\quad}T_{1}=\frac{C(1+\lambda^{2})d^{\frac{{k^{\star}}-2}{2}}\log(d)}{\eta}=\tilde{O}\quantity(d^{{k^{\star}}-1}\lambda^{-2{k^{\star}}+4})

for a sufficiently large constant $C$ . Then with high probability, there exists $t\leq T_{1}$ such that $\alpha_{t}\geq\lambda d^{-1/2}$ .

Proof.

Let $\tau$ be the hitting time for $\alpha_{\tau}\geq\lambda d^{-1/2}$ . For $t\leq T_{1}$ , let $E_{t}$ be the event that

\displaystyle\alpha_{t}\geq\frac{1}{2}\quantity[\alpha_{0}+\eta\sum_{j=0}^{t-1}c_{\lambda}(\alpha_{j})].

We will prove by induction that for any $t\leq T_{1}$ , the event: $\quantity{E_{t}\text{ or }t\geq\tau}$ happens with high probability. The base case of $t=0$ is trivial so let $t\geq 0$ and assume the result for all $s<t$ . Note that $\eta/\lambda\ll\frac{d^{-1/2}}{C}\leq\alpha_{j}$ so by Lemma 15 and the fact that $\lambda\geq 1$ ,

\displaystyle\alpha_{t}=\alpha_{0}+\sum_{j=0}^{t-1}\quantity[\eta(1-\alpha_{j}^{2})c_{\lambda}(\alpha_{j})+Z_{j}+\tilde{O}\quantity(\eta^{2}d\alpha_{j}\lambda^{-2{k^{\star}}})].

Now note that $\operatorname{\mathbb{P}}[E_{t}\text{ or }t\geq\tau]=1-\operatorname{\mathbb{P}}\quantity[!E_{t}\text{ and }t<\tau]$ so let us condition on the event $t<\tau$ . Then by the induction hypothesis, with high probability we have $\alpha_{s}\in[\frac{\alpha_{0}}{2},\lambda d^{-1/2}]$ for all $s<t$ . Plugging in the value of $\eta$ gives:

	$\displaystyle\eta(1-\alpha_{j}^{2})c_{\lambda}(\alpha_{j})+\tilde{O}\quantity(\eta^{2}d\alpha_{j}\lambda^{-2{k^{\star}}})$
	$\displaystyle\geq\eta(1-\alpha_{j}^{2})c_{\lambda}(\alpha_{j})-\frac{\eta d^{-\frac{{k^{\star}}-2}{2}}\lambda^{-2}\alpha_{j}}{C}$
	$\displaystyle\geq\frac{\eta c_{\lambda}(\alpha_{j})}{2}.$

Similarly, because $\sum_{j=0}^{t-1}Z_{j}$ is a martingale we have by Lemma 22 and Lemma 24 that with high probability,

	$\displaystyle\sum_{j=0}^{t-1}Z_{j}$	$\displaystyle\lesssim\tilde{O}\quantity(\quantity[\sqrt{T_{1}}\cdot\eta\lambda^{-{k^{\star}}}+\eta\lambda^{-1}]\quantity[1+\max_{j<t}\frac{\eta d\alpha_{j}}{\lambda}])$
		$\displaystyle\lesssim\tilde{O}\quantity(\sqrt{T_{1}}\cdot\eta\lambda^{-{k^{\star}}}+\eta\lambda^{-1})$
		$\displaystyle\leq\frac{d^{-1/2}}{C}.$

where we used that $\eta d\alpha_{j}/\lambda\leq\eta\sqrt{d}\ll 1$ . Therefore conditioned on $t\leq\tau$ we have with high probability that for all $s\leq t$ :

\displaystyle\alpha_{t}\geq\frac{1}{2}\quantity[\alpha_{0}+\eta\sum_{j=0}^{t-1}c_{\lambda}(\alpha_{j})].

Now we split into two cases depending on the parity of ${k^{\star}}$ . First, if ${k^{\star}}$ is odd we have that with high probability, for all $t\leq T_{1}$ :

\displaystyle\alpha_{t}\gtrsim\alpha_{0}+\eta t\lambda^{-1}d^{-\frac{{k^{\star}}-1}{2}}\mbox{\quad or\quad}t\geq\tau.

Now let $t=T_{1}$ . Then we have that with high probability,

\displaystyle\alpha_{t}\geq\lambda d^{-1/2}\mbox{\quad or\quad}\tau\leq T_{1}

which implies that $\tau\leq T_{1}$ with high probability. Next, if ${k^{\star}}$ is even we have that with high probability

\displaystyle\alpha_{t}\gtrsim\alpha_{0}+\frac{\eta\cdot d^{-\frac{{k^{\star}}-2}{2}}}{\lambda^{2}}\sum_{s=0}^{t-1}\alpha_{s}\mbox{\quad or\quad}t\geq\tau.

As above, by Lemma 27 the first event implies that $\alpha_{T_{1}}\geq\lambda d^{-1/2}$ so we must have $\tau\leq T_{1}$ with high probability. ∎

Next, we consider what happens when $\alpha\geq\lambda d^{-1/2}$ . The analysis in this stage is similar to the online SGD analysis in [1].

Lemma 17 (Stage 2).

Assume that $\alpha_{0}\geq\lambda d^{-1/2}$ . Set $\eta,T_{1}$ as in Lemma 16. Then with high probability, $\alpha_{T_{1}}\geq 1-d^{-1/4}$ .

Proof.

The proof is almost identical to Lemma 16. We again have from Lemma 15

\displaystyle\alpha_{t}\geq\alpha_{0}+\sum_{j=0}^{t-1}\quantity[\eta(1-\alpha_{j}^{2})c_{\lambda}(\alpha_{j})+Z_{j}-\tilde{O}\quantity(\eta^{2}d\alpha_{j}\lambda^{-2{k^{\star}}})].

First, from martingale concentration we have that

	$\displaystyle\sum_{j=0}^{t-1}Z_{j}$	$\displaystyle\lesssim\tilde{O}\quantity(\quantity[\sqrt{T_{1}}\cdot\eta\lambda^{-{k^{\star}}}+\eta\lambda^{-1}]\quantity[1+\frac{\eta d}{\lambda}])$
		$\displaystyle\lesssim\tilde{O}\quantity(\quantity[\sqrt{T_{1}}\cdot\eta\lambda^{-{k^{\star}}}+\eta\lambda^{-1}]\cdot\lambda)$
		$\displaystyle\lesssim\frac{\lambda d^{-1/2}}{C}$

where we used that $\eta\ll\frac{\lambda^{2}}{d}$ . Therefore with high probability,

	$\displaystyle\alpha_{t}$	$\displaystyle\geq\frac{\alpha_{0}}{2}+\sum_{j=0}^{t-1}\quantity[\eta(1-\alpha_{j}^{2})c_{\lambda}(\alpha_{j})-\tilde{O}\quantity(\eta^{2}d\alpha_{j}\lambda^{-2{k^{\star}}})]$
		$\displaystyle\geq\frac{\alpha_{0}}{2}+\eta\sum_{j=0}^{t-1}\quantity[(1-\alpha_{j}^{2})c_{\lambda}(\alpha_{j})-\frac{\alpha_{j}d^{-\frac{{k^{\star}}-2}{2}}}{C\lambda^{2}}].$

Therefore while $\alpha_{t}\leq 1-\frac{1}{{k^{\star}}}$ , for sufficiently large $C$ we have

\displaystyle\alpha_{t}\geq\frac{\alpha_{0}}{2}+\frac{\eta}{C^{1/2}\lambda^{{k^{\star}}}}\sum_{j=0}^{t-1}\alpha_{j}^{{k^{\star}}-1}.

Therefore by Lemma 27, we have that there exists $t\leq T_{1}/2$ such that $\alpha_{t}\geq 1-\frac{1}{{k^{\star}}}$ . Next, let $p_{t}=1-\alpha_{t}$ . Then applying Lemma 15 to $p_{t}$ and using $(1-\frac{1}{{k^{\star}}})^{k^{\star}}\gtrsim 1/e$ gives that if

\displaystyle r:=\frac{d^{-\frac{{k^{\star}}-2}{2}}}{C\lambda^{2}}\mbox{\quad and\quad}c:=\frac{1}{C^{1/2}\lambda^{{k^{\star}}}}

then

	$\displaystyle p_{t+1}$	$\displaystyle\leq p_{t}-\eta cp_{t}+\eta r+Z_{t}$
		$\displaystyle=(1-\eta c)p_{t}+\eta r+Z_{t}.$

Therefore,

\displaystyle p_{t+s}\leq(1-\eta c)^{s}p_{t}+r/c+\sum_{i=0}^{s-1}(1-\eta c)^{i}Z_{t+s-1-i}.

With high probability, the martingale term is bounded by $\lambda d^{-1/2}/C$ as before as long as $s\leq T_{1}$ , so for $s\in[\frac{C\log(d)}{\eta c},T_{1}]$ we have that $p_{t+s}\lesssim C^{-1/2}\quantity(\quantity(\frac{\lambda^{2}}{d})^{\frac{{k^{\star}}-2}{2}}+\lambda d^{-1/2})\lesssim C^{-1/2}d^{-1/4}$ . Setting $s=T_{1}-t$ and choosing $C$ appropriately yields $p_{T_{1}}\leq d^{-1/4}$ , which completes the proof. ∎

Finally, the third stage guarantees not only a hitting time but a last iterate guarantee. It also achieves the optimal sample complexity in terms of the target accuracy $\epsilon$ :

Lemma 18 (Stage 3).

Assume that $\alpha_{0}\geq 1-d^{-1/4}$ . Set $\lambda=0$ and

\displaystyle\eta_{t}=\frac{C}{C^{4}d+t}.

for a sufficiently large constant $C$ . Then for any $t\leq\exp(d^{1/C})$ , we have that with high probability,

\displaystyle\alpha_{t}\geq 1-O\quantity(\frac{d}{d+t}).

Proof.

Let $p_{t}=1-\alpha_{t}$ . By Lemma 15, while $p_{t}\leq 1/{k^{\star}}$ :

	$\displaystyle p_{t+1}$	$\displaystyle\leq p_{t}-\frac{\eta_{t}p_{t}}{2C}+C\eta_{t}^{2}d+\eta_{t}\sqrt{p_{t}}\cdot W_{t}+\eta_{t}^{2}d\cdot Z_{t}$
		$\displaystyle=\quantity(\frac{C^{4}d+t-2}{C^{4}d+t})p_{t}+C\eta_{t}^{2}d+\eta_{t}\sqrt{p_{t}}\cdot W_{t}+\eta_{t}^{2}d\cdot Z_{t}.$

where the moments of $W_{t},Z_{t}$ are each bounded by $\operatorname{poly}(p)$ . We will prove by induction that with probability at least $1-t\exp(-Cd^{1/C}/e)$ , we have for all $s\leq t$ :

\displaystyle p_{s}\leq\frac{2C^{3}d}{C^{4}d+s}\leq\frac{2}{C}\leq\frac{1}{{k^{\star}}}.

The base case is clear so assume the result for all $s\leq t$ . Then from the recurrence above,

\displaystyle p_{t+1}\leq p_{0}\frac{C^{8}d^{2}}{(C^{4}d+t)^{2}}+\frac{1}{(C^{4}d+t)^{2}}\sum_{j=0}^{t}(C^{4}d+j)^{2}\quantity[C\eta_{j}^{2}d+\eta_{t}\sqrt{p_{t}}\cdot W_{t}+\eta_{t}^{2}d\cdot Z_{t}].

First, because $p_{0}\leq d^{-1/4}$ ,

\displaystyle\frac{C^{8}p_{0}d^{2}}{(C^{4}d+t)^{2}}\leq\frac{C^{4}p_{0}d}{C^{4}d+t}\ll\frac{d}{C^{4}d+t}.

Next,

	$\displaystyle\frac{1}{(C^{4}d+t)^{2}}\sum_{j=0}^{t}(C^{4}d+j)^{2}C\eta_{j}^{2}d$	$\displaystyle=\frac{1}{(C^{4}d+t)^{2}}\sum_{j=0}^{t}C^{3}d$
		$\displaystyle=\frac{C^{3}dt}{(C^{4}d+t)^{2}}$
		$\displaystyle\leq\frac{C^{3}d}{C^{4}d+t}.$

The next error term is:

\displaystyle\frac{1}{(C^{4}d+t)^{2}}\sum_{j=0}^{t}(C^{4}d+j)^{2}\eta_{t}\sqrt{p_{t}}\cdot W_{t}.

Fix $p=\frac{d^{1/C}}{e}$ . Then we will bound the $p$ th moment of $p_{t}$ :

	$\displaystyle\operatorname{\mathbb{E}}[p_{t}^{p}]$	$\displaystyle\leq\frac{2C^{3}d}{C^{4}d+t}+\operatorname{\mathbb{E}}\quantity[p_{t}^{p}\mathbf{1}_{p_{t}\geq\frac{2C^{3}d}{C^{4}d+s}}]$
		$\displaystyle\leq\quantity(\frac{2C^{3}d}{C^{4}d+t})^{p}+2^{p}\operatorname{\mathbb{P}}\quantity[p_{t}\geq\frac{2C^{3}d}{C^{4}d+t}]$
		$\displaystyle\leq\quantity(\frac{2C^{3}d}{C^{4}d+t})^{p}+2^{p}t\exp(-Cd^{1/C}).$

Now note that because $t\leq\exp(d^{1/C})$ ,

\displaystyle\log(t\exp(-Cd^{1/C}))=\log(t)-Cd^{1/C}\leq-\log(t)(C-1)d^{1/C}\leq-p\log(t).

Therefore $\operatorname{\mathbb{E}}[p_{t}^{p}]^{1/p}\leq\frac{4C^{3}d}{C^{4}d+t}.$ Therefore the $p$ norm of the predictable quadratic variation of the next error term is bounded by:

	$\displaystyle\operatorname{poly}(p)\sum_{j=0}^{t}(C^{4}d+j)^{4}\eta_{t}^{2}\operatorname{\mathbb{E}}[p_{t}^{p}]^{1/p}$	$\displaystyle\leq\operatorname{poly}(p)\sum_{j=0}^{t}C^{5}d(C^{4}d+j)$
		$\displaystyle\lesssim\operatorname{poly}(p)C^{5}dt(C^{4}d+t).$

In addition, the $p$ norm of the largest term in this sum is bounded by

\displaystyle\operatorname{poly}(p)\sqrt{C^{5}d(C^{4}d+t)}.

Therefore by Lemma 22 and Lemma 24, we have with probability at least $1-\exp(Cd^{-1/C}/e)$ , this term is bounded by

\displaystyle\frac{\sqrt{C^{5}dt}}{(C^{4}d+t)^{3/2}}\cdot d^{1/2}\leq\frac{C^{3}d}{C^{4}d+t}.

Finally, the last term is similarly bounded with probability at least $1-\exp(-Cd^{-1/C}/e)$ by

\displaystyle\frac{C^{2}d\sqrt{t}}{(C^{4}d+t)^{2}}\cdot d^{1/2}\ll\frac{C^{3}d}{C^{4}d+t}

which completes the induction. ∎

We can now combine the above lemmas to prove Theorem 1:

Proof of Theorem 1.

By Lemmas 16, 17 and 18, if $T=T_{1}+T_{2}$ we have with high probability for all $T_{2}\leq\exp(d^{1/C})$ :

\displaystyle\alpha_{T}\geq 1-O\quantity(\frac{d}{d+T_{2}}).

Next, note that by Bernoulli’s inequality ( $(1+x)^{n}\geq 1+nx$ ), we have that $1-\alpha^{k}\leq k(1-\alpha)$ . Therefore,

	$\displaystyle L(w_{T})$	$\displaystyle=\sum_{k\geq 0}\frac{c_{k}^{2}}{k!}[1-\alpha_{T}^{k}]$
		$\displaystyle\leq(1-\alpha_{T})\sum_{k\geq 0}\frac{c_{k}^{2}}{(k-1)!}$
		$\displaystyle=(1-\alpha_{T})\operatorname{\mathbb{E}}_{x\sim N(0,1)}[\sigma^{\prime}(x)^{2}]$
		$\displaystyle\lesssim\frac{d}{d+T_{2}}$

which completes the proof of Theorem 1. ∎

B.5 Proof of Theorem 2

We directly follow the proof of Theorem 2 in Damian et al. [6] which is reproduced here for completeness. We begin with the following general CSQ lemma which can be found in Szörényi [38], Damian et al. [6]:

Lemma 19.

Let $\mathcal{F}$ be a class of functions and $\mathcal{D}$ be a data distribution such that

\displaystyle\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}[f(x)^{2}]=1\mbox{\quad and\quad}\absolutevalue{\operatorname{\mathbb{E}}_{x\sim\mathcal{D}}[f(x)g(x)]}\leq\epsilon\qquad\forall f\neq g\in\mathcal{F}.

Then any correlational statistical query learner requires at least $\frac{\absolutevalue{\mathcal{F}}(\tau^{2}-\epsilon)}{2}$ queries of tolerance $\tau$ to output a function in $\mathcal{F}$ with $L^{2}(\mathcal{D})$ loss at most $2-2\epsilon$ .

First, we will construct a function class from a subset of $\mathcal{F}:=\quantity{\sigma(w\cdot x)~{}:~{}w\in S^{d-1}}$ . By [6, Lemma 3], for any $\epsilon$ there exist $\frac{1}{2}e^{c\epsilon^{2}d}$ unit vectors $w_{1},\ldots,w_{s}$ such that their pairwise inner products are all bounded by $\epsilon$ . Let $\widehat{\mathcal{F}}:=\quantity{\sigma(w_{i}\cdot x)~{}:~{}i\in[s]}$ . Then for $i\neq j$ ,

\displaystyle\absolutevalue{\operatorname{\mathbb{E}}_{x\sim N(0,I_{d})}[\sigma(w_{i}\cdot x)\sigma(w_{j}\cdot x)]}=\absolutevalue{\sum_{k\geq 0}\frac{c_{k}^{2}}{k!}(w_{i}\cdot w_{j})^{k}}\leq\absolutevalue{w_{i}\cdot w_{j}}^{{k^{\star}}}\leq\epsilon^{{k^{\star}}}.

Therefore by Lemma 19,

\displaystyle 4m\geq e^{c\epsilon^{2}d}(\tau^{2}-\epsilon^{{k^{\star}}}).

Now set

\displaystyle\epsilon=\sqrt{\frac{\log\quantity(4m(cd)^{{k^{\star}}/2})}{cd}}

which gives

\displaystyle\tau^{2}\leq\frac{1+\log^{k/2}(4m(cd)^{{k^{\star}}/2})}{(cd)^{{k^{\star}}/2}}\lesssim\frac{\log^{{k^{\star}}/2}(md)}{d^{{k^{\star}}/2}}.

Appendix C Concentration Inequalities

Lemma 20 (Rosenthal-Burkholder-Pinelis Inequality [39]).

Let $\{Y_{i}\}_{i=0}^{n}$ be a martingale with martingale difference sequence $\{X_{i}\}_{i=1}^{n}$ where $X_{i}=Y_{i}-Y_{i-1}$ . Let

\displaystyle\expectationvalue{Y}=\sum_{i=1}^{n}\operatorname{\mathbb{E}}[\norm{X_{i}}^{2}|\mathcal{F}_{i-1}]

denote the predictable quadratic variation. Then there exists an absolute constant $C$ such that for all $p$ ,

\displaystyle\norm{Y_{n}}_{p}\leq C\quantity[\sqrt{p\norm{\expectationvalue{Y}}_{p/2}}+p~{}\norm{\max_{i}\norm{X_{i}}}_{p}].

The above inequality is found in Pinelis [39, Theorem 4.1]. It is often combined with the following simple lemma:

Lemma 21.

For any random variables $X_{1},\ldots,X_{n}$ ,

\displaystyle\norm{\max_{i}\norm{X_{i}}}_{p}\leq\quantity(\sum_{i=1}^{n}\norm{X_{i}}_{p}^{p})^{1/p}.

This has the immediate corollary:

Lemma 22.

Let $\{Y_{i}\}_{i=0}^{n}$ be a martingale with martingale difference sequence $\{X_{i}\}_{i=1}^{n}$ where $X_{i}=Y_{i}-Y_{i-1}$ . Let $\expectationvalue{Y}=\sum_{i=1}^{n}\operatorname{\mathbb{E}}[\norm{X_{i}}^{2}|\mathcal{F}_{i-1}]$ denote the predictable quadratic variation. Then there exists an absolute constant $C$ such that for all $p$ ,

\displaystyle\norm{Y_{n}}_{p}\leq C\quantity[\sqrt{p\norm{\expectationvalue{Y}}_{p/2}}+pn^{1/p}~{}\max_{i}\norm{X_{i}}_{p}].

We will often use the following corollary of Holder’s inequality to bound the operator norm of a product of two random variables when one has polynomial tails:

Lemma 23.

Let $X,Y$ be random variables with $\norm{Y}_{p}\leq\sigma_{Y}p^{C}$ . Then,

\displaystyle\operatorname{\mathbb{E}}[XY]\leq\norm{X}_{1}\cdot\sigma_{Y}\cdot(2e)^{C}\cdot\max\quantity(1,\frac{1}{C}\log\quantity(\frac{\norm{X}_{2}}{\norm{X}_{1}}))^{C}.

Proof.

Fix $\epsilon\in[0,1]$ . Then using Holder’s inequality with $1=1-\epsilon+\frac{\epsilon}{2}+\frac{\epsilon}{2}$ gives:

\displaystyle\operatorname{\mathbb{E}}[XY]=\operatorname{\mathbb{E}}[X^{1-\epsilon}X^{\epsilon}Y]\leq\norm{X}_{1}^{1-\epsilon}\norm{X}_{2}^{\epsilon}\norm{Y}_{2/\epsilon}.

Using the fact that $X,Y$ have polynomial tails we can bound this by

\displaystyle\operatorname{\mathbb{E}}[XY]=\operatorname{\mathbb{E}}[X^{1-\epsilon}X^{\epsilon}Y]\leq\norm{X}_{1}^{1-\epsilon}\norm{X}_{2}^{\epsilon}\sigma_{Y}(2/\epsilon)^{C}.

First, if $\norm{X}_{2}\geq e^{C}\norm{X}_{1}$ , we can set $\epsilon=\frac{C}{\log\quantity(\frac{\norm{X}_{2}}{\norm{X}_{1}})}$ which gives

\displaystyle\operatorname{\mathbb{E}}[XY]\leq\norm{X}_{1}\cdot\sigma_{Y}\cdot\quantity(\frac{2e}{C}\log\quantity(\frac{\norm{X}_{2}}{\norm{X}_{1}}))^{C}.

Next, if $\norm{X}_{2}\leq e^{C}\norm{X}_{1}$ we can set $\epsilon=1$ which gives

\displaystyle\operatorname{\mathbb{E}}[XY]\leq\norm{X}_{2}\norm{Y}_{2}\leq\norm{X}_{1}\sigma_{Y}(2e)^{C}

which completes the proof. ∎

Finally, the following basic lemma will allow is to easily convert between $p$ -norm bounds and concentration inequalities:

Lemma 24.

Let $\delta\geq 0$ and let $X$ be a mean zero random variable satisfying

\displaystyle\operatorname{\mathbb{E}}[\absolutevalue{X}^{p}]^{1/p}\leq\sigma_{X}p^{C}\mbox{\quad for\quad}p=\frac{\log(1/\delta)}{C}

for some $C$ . Then with probability at least $1-\delta$ , $\absolutevalue{X}\leq\sigma_{X}(ep)^{C}$ .

Proof.

Let $\epsilon=\sigma_{X}(ep)^{C}$ . Then,

	$\displaystyle\operatorname{\mathbb{P}}[\absolutevalue{X}\geq\epsilon]$	$\displaystyle=\operatorname{\mathbb{P}}[\absolutevalue{X}^{p}\geq\epsilon^{p}]$
		$\displaystyle\leq\frac{\operatorname{\mathbb{E}}[\absolutevalue{X}^{p}]}{\epsilon^{p}}$
		$\displaystyle\leq\frac{(\sigma_{X})^{p}p^{pC}}{\epsilon^{p}}$
		$\displaystyle=e^{-Cp}$
		$\displaystyle=\delta.$

∎

Appendix D Additional Technical Lemmas

The following lemma extends Steins’s lemma ( $\operatorname{\mathbb{E}}_{x\sim N(0,1)}[xg(x)]=\operatorname{\mathbb{E}}_{x\sim N(0,1)}[g^{\prime}(x)]$ ) to the ultraspherical distribution $\mu^{(d)}$ where $\mu^{(d)}$ is the distribution of $z_{1}$ when $z\sim S^{d-1}$ :

Lemma 25 (Spherical Stein’s Lemma).

For any $g\in L^{2}(\mu^{(d)})$ ,

\displaystyle\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[z_{1}g(z_{1})]=\frac{\operatorname{\mathbb{E}}_{z\sim S^{d+1}}[g^{\prime}(z_{1})]}{d}.

Proof.

Recall that the density of $z_{1}$ is equal to

\displaystyle\frac{(1-x^{2})^{\frac{d-3}{2}}}{C(d)}\mbox{\quad where\quad}C(d):=\frac{\sqrt{\pi}\cdot\Gamma(\frac{d-1}{2})}{\Gamma(\frac{d}{2})}.

Therefore,

\displaystyle\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[z_{1}g(z_{1})]=\frac{1}{C(d)}\int_{-1}^{1}z_{1}g(z_{1})(1-z^{2})^{\frac{d-3}{2}}dz_{1}.

Now we can integrate by parts to get

	$\displaystyle\operatorname{\mathbb{E}}_{z\sim S^{d-1}}[z_{1}g(z_{1})]$	$\displaystyle=\frac{1}{C(d)}\int_{-1}^{1}\frac{g^{\prime}(z_{1})(1-z^{2})^{\frac{d-1}{2}}}{d-1}dz_{1}$
		$\displaystyle=\frac{C(d+2)}{C(d)(d-1)}\operatorname{\mathbb{E}}_{z\sim S^{d+1}}[g^{\prime}(z_{1})]$
		$\displaystyle=\frac{1}{d}\operatorname{\mathbb{E}}_{z\sim S^{d+1}}[g^{\prime}(z_{1})].$

∎

Lemma 26.

For $j\leq d/4$ ,

\displaystyle\operatorname{\mathbb{E}}_{z\sim S^{d-1}}\quantity(\frac{z_{1}^{k}}{(1-z_{1}^{2})^{j}})\lesssim\operatorname{\mathbb{E}}_{z\sim S^{d-2j-1}}\quantity(z_{1}^{k}).

Proof.

Recall that the PDF of $\mu^{(d)}$ is

\displaystyle\frac{(1-x^{2})^{\frac{d-3}{2}}}{C(d)}\mbox{\quad where\quad}C(d):=\frac{\sqrt{\pi}\cdot\Gamma(\frac{d-1}{2})}{\Gamma(\frac{d}{2})}.

Using this we have that:

	$\displaystyle\operatorname{\mathbb{E}}_{z\sim S^{d-1}}\quantity[\frac{z_{1}^{k}}{(1-z_{1}^{2})^{j}}]$	$\displaystyle=\frac{1}{C(d)}\int_{-1}^{1}\frac{x^{k}}{(1-x^{2})^{j}}(1-x^{2})^{\frac{d-3}{2}}dx$
		$\displaystyle=\frac{1}{C(d)}\int_{-1}^{1}x^{k}(1-x^{2})^{\frac{d-2j-3}{2}}dx$
		$\displaystyle=\frac{C(d-2j)}{C(d)}\operatorname{\mathbb{E}}_{z_{1}\sim\mu^{(d-2j)}}[z_{1}^{k}]$
		$\displaystyle=\frac{\Gamma(\frac{d}{2})\Gamma(\frac{d-2j-1}{2})}{\Gamma(\frac{d-1}{2})\Gamma(\frac{d-2j}{2})}\operatorname{\mathbb{E}}_{z\sim S^{d-2j-1}}[z_{1}^{k}]$
		$\displaystyle\lesssim\operatorname{\mathbb{E}}_{z\sim S^{d-2j-1}}[z_{1}^{k}].$

∎

We have the following generalization of Lemma 8:

Corollary 3.

For any $k,j\geq 0$ with $d\geq 2j+1$ and $\alpha\geq C^{-1/4}d^{1/2}$ , there exist $c(j,k),C(j,k)$ such that

\displaystyle\mathcal{L}_{\lambda}\quantity(\frac{\alpha^{k}}{(1-\alpha)^{j}})\leq C(j,k)s_{k}(\alpha;\lambda).

Proof.

Expanding the definition of $\mathcal{L}_{\lambda}$ gives:

\displaystyle\mathcal{L}_{\lambda}\quantity(\frac{\alpha^{k}}{(1-\alpha)^{j}})

\displaystyle=\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\frac{\quantity(\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})^{k}}{\quantity(1-\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})^{j}}].

Now let $X=\frac{\lambda\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}}\cdot\quantity(1-\frac{\alpha}{\sqrt{1+\lambda^{2}}})^{-1}$ and note that by Cauchy-Schwarz, $X\leq 1$ . Then,

	$\displaystyle\mathcal{L}_{\lambda}\quantity(\frac{\alpha^{k}}{(1-\alpha)^{j}})$	$\displaystyle=\frac{1}{\quantity(1-\frac{\alpha}{\sqrt{1+\lambda^{2}}})^{j}}\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\frac{\quantity(\frac{\alpha+\lambda z\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})^{k}}{(1-Xz_{1})^{j}}]$
		$\displaystyle\asymp\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\frac{(1+Xz_{1})^{j}\quantity(\frac{\alpha+\lambda z\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})^{k}}{(1-X^{2}z_{1}^{2})^{j}}].$

Now we can use the binomial theorem to expand this. Ignoring constants only depending on $j,k$ :

	$\displaystyle\mathcal{L}_{\lambda}\quantity(\frac{\alpha^{k}}{(1-\alpha)^{j}})$	$\displaystyle=\frac{1}{\quantity(1-\frac{\alpha}{\sqrt{1+\lambda^{2}}})^{j}}\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\frac{\quantity(\frac{\alpha+\lambda z_{1}\sqrt{1-\alpha^{2}}}{\sqrt{1+\lambda^{2}}})^{k}}{(1-Xz_{1})^{j}}]$
		$\displaystyle\asymp\lambda^{-k}\sum_{i=0}^{k}\alpha^{k-i}\lambda^{i}(1-\alpha^{2})^{i/2}\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\frac{(1+Xz_{1})^{j}z_{1}^{i}}{(1-X^{2}z_{1}^{2})^{j}}]$
		$\displaystyle\leq\lambda^{-k}\sum_{i=0}^{k}\alpha^{k-i}\lambda^{i}(1-\alpha^{2})^{i/2}\operatorname{\mathbb{E}}_{z\sim S^{d-2}}\quantity[\frac{(1+Xz_{1})^{j}z_{1}^{i}}{(1-z_{1}^{2})^{j}}].$

By Lemma 26, the $z_{1}$ term is bounded by $d^{-\frac{i}{2}}$ when $i$ is even and $Xd^{-\frac{i+1}{2}}$ when $i$ is odd. Therefore this expression is bounded by

	$\displaystyle\quantity(\frac{\alpha}{\lambda})^{k}\sum_{i=0}^{\lfloor\frac{k}{2}\rfloor}\quantity(\frac{\lambda^{2}(1-\alpha^{2})}{\alpha^{2}d})^{i}+\quantity(\frac{\alpha}{\lambda})^{k-1}\sum_{i=0}^{\lfloor\frac{k-1}{2}\rfloor}\alpha^{-2i}\lambda^{2i}(1-\alpha^{2})^{i}d^{-(i+1)}$
	$\displaystyle\asymp s_{k}(\alpha;\lambda)+\frac{1}{d}s_{k-1}(\alpha;\lambda).$

Now note that

\displaystyle\frac{\frac{1}{d}s_{k-1}(\alpha;\lambda)}{s_{k}(\alpha;\lambda)}=\begin{cases}\frac{\lambda}{d\alpha}&\alpha^{2}\geq\frac{\lambda^{2}}{d}\\ \frac{\alpha}{\lambda}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and ${k^{\star}}$ is even}\\ \frac{\lambda}{d\alpha}&\alpha^{2}\leq\frac{\lambda^{2}}{d}\text{ and ${k^{\star}}$ is odd}\end{cases}\leq C^{-1/4}.

Therefore, $s_{k}(\alpha;\lambda)$ is the dominant term which completes the proof. ∎

Lemma 27 (Adapted from Abbe et al. [4]).

Let $\eta,a_{0}\geq 0$ be positive constants, and let $u_{t}$ be a sequence satisfying

\displaystyle u_{t}\geq a_{0}+\eta\sum_{s=0}^{t-1}u_{s}^{k}.

Then, if $\max_{0\leq s\leq t-1}\eta u_{s}^{k-1}\leq\frac{\log 2}{k}$ , we have the lower bound

\displaystyle u_{t}\geq\quantity(a_{0}^{-(k-1)}-\frac{1}{2}\eta(k-1)t)^{-\frac{1}{k-1}}.

Proof.

Consider the auxiliary sequence $w_{t}=a_{0}+\eta\sum_{s=0}^{t-1}w_{s}^{k}$ . By induction, $u_{t}\geq w_{t}$ . To lower bound $w_{t}$ , we have that

	$\displaystyle\eta=\frac{w_{t}-w_{t-1}}{w_{t-1}^{k}}$	$\displaystyle=\frac{w_{t}-w_{t-1}}{w_{t}^{k}}\cdot\frac{w_{t}^{k}}{w_{t-1}^{k}}$
		$\displaystyle\leq\frac{w_{t}^{k}}{w_{t-1}^{k}}\int_{w_{t-1}}^{w_{t}}\frac{1}{x^{k}}dx$
		$\displaystyle=\frac{w_{t}^{k}}{w_{t-1}^{k}(k-1)}\quantity(w_{t-1}^{-(k-1)}-w_{t}^{-(k-1)})$
		$\displaystyle\leq\frac{(1+\eta w_{t-1}^{k-1})^{k}}{(k-1)}\quantity(w_{t-1}^{-(k-1)}-w_{t}^{-(k-1)})$
		$\displaystyle\leq\frac{(1+\frac{\log 2}{k})^{k}}{(k-1)}\quantity(w_{t-1}^{-(k-1)}-w_{t}^{-(k-1)})$
		$\displaystyle\leq\frac{2}{k-1}\quantity(w_{t-1}^{-(k-1)}-w_{t}^{-(k-1)}).$

Therefore

\displaystyle w_{t}^{-(k-1)}\leq w_{t-1}^{-(k-1)}-\frac{1}{2}\eta(k-1).

Altogether, we get

\displaystyle w_{t}^{-(k-1)}\leq a_{0}^{-(k-1)}-\frac{1}{2}\eta(k-1)t,

\displaystyle u_{t}\geq w_{t}\geq\quantity(a_{0}^{-(k-1)}-\frac{1}{2}\eta(k-1)t)^{-\frac{1}{k-1}},

as desired. ∎

Appendix E Additional Experimental Details

To compute the smoothed loss $L_{\lambda}(w;x;y)$ we used the closed form for $\mathcal{L}_{\lambda}\quantity(He_{k}(w\cdot x))$ (see Section B.3). Experiments were run on 8 NVIDIA A6000 GPUs. Our code is written in JAX [40] and we used Weights and Biases [41] for experiment tracking.

Smoothing the Landscape Boosts the Signal for SGD Optimal Sample Complexity for Learning Single Index Models

Abstract

1 Introduction

2 Related Work

3 Setting

3.1 Data distribution and target function

Assumption 1.

3.2 Algorithm: online SGD on a smoothed landscape

Definition 1.

3.3 Hermite polynomials and information exponent

Definition 2 (Hermite Polynomials).

Definition 3 (Hermite Expansion of σ\sigma).

Definition 4 (Information Exponent).

Example 1.

4 Main Results

Theorem 1.

Theorem 2 (CSQ Lower Bound).

5 Proof Sketch

5.1 Online SGD Analysis

5.2 Computing the Rate with Zero Smoothing

Property 1 (Orthogonality Property).

5.3 How Smoothing boosts the SNR

6 Experiments

7 Discussion

7.1 Tensor PCA

7.1.1 The Partial Trace Algorithm

7.1.2 The Connection Between Single Index Models and Tensor PCA

7.2 Empirical Risk Minimization on the Smoothed Landscape

7.3 Connection to Minibatch SGD

8 Acknowledgements

References

Appendix A Background and Notation

A.1 Tensor Notation

Definition 5 (Tensor Action).

Definition 6 (Permutation/Transposition).

Definition 7 (Symmetrization).

Definition 8 (Symmetric Tensor Product).

Lemma 1.

Proof.

Definition 9.

Lemma 2 (Tensorized Moments).

Proof.

Lemma 3.

Proof.

A.2 Hermite Polynomials and Hermite Tensors

Definition 10.

Lemma 4 (Properties of Hermite Polynomials).

Lemma 5 (Hermite Polynomials in Higher Dimensions).

Appendix B Proof of Theorem 1

B.1 Additional Notation

Assumption 2.

Definition 11.

B.2 Computing the Smoothed Population Gradient

Lemma 6 (Population Loss).

Lemma 7.

Proof.

Definition 12.

Lemma 8.

Proof.

Lemma 9.

Proof.

Lemma 10.

Proof.

B.3 Concentrating the Empirical Gradient

Lemma 11.

Proof.

Lemma 12.

Proof.

Corollary 1.

Proof.

Lemma 13.

Proof.

Lemma 14.

Proof.

Corollary 2.

Proof.

B.4 Analyzing the Dynamics

Lemma 15.

Proof.

Lemma 16 (Stage 1).

Smoothing the Landscape Boosts the Signal for SGD
Optimal Sample Complexity for Learning Single Index Models

Definition 3 (Hermite Expansion of $\sigma$ ).