This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Interpolation Phase Transition in Neural Networks:
Memorization and Generalization under Lazy Training

Andrea Montanari   and    Yiqiao Zhong11footnotemark: 1 Department of Electrical Engineering and Department of Statistics, Stanford University
Abstract

Modern neural networks are often operated in a strongly overparametrized regime: they comprise so many parameters that they can interpolate the training set, even if actual labels are replaced by purely random ones. Despite this, they achieve good prediction error on unseen data: interpolating the training set does not lead to a large generalization error. Further, overparametrization appears to be beneficial in that it simplifies the optimization landscape. Here we study these phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We consider a simple data model, with isotropic covariates vectors in dd dimensions, and NN hidden neurons. We assume that both the sample size nn and the dimension dd are large, and they are polynomially related. Our first main result is a characterization of the eigenstructure of the empirical NT kernel in the overparametrized regime NdnNd\gg n. This characterization implies as a corollary that the minimum eigenvalue of the empirical NT kernel is bounded away from zero as soon as NdnNd\gg n, and therefore the network can exactly interpolate arbitrary labels in the same regime.

Our second main result is a characterization of the generalization error of NT ridge regression including, as a special case, min-2\ell_{2} norm interpolation. We prove that, as soon as NdnNd\gg n, the test error is well approximated by the one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error of polynomial ridge regression, whereby the regularization parameter is increased by a ‘self-induced’ term related to the high-degree components of the activation function. The polynomial degree depends on the sample size and the dimension (in particular on logn/logd\log n/\log d).

1 Introduction

Tractability and generalization are two key problems in statistical learning. Classically, tractability is achieved by crafting suitable convex objectives, and generalization by regularizing (or restricting) the function class of interest to guarantee uniform convergence. In modern neural networks, a different mechanism appears to be often at work [NTS15, ZBH+16, BHMM19]. Empirical risk minimization becomes tractable despite non-convexity because the model is overparametrized. In fact, so overparametrized that a model interpolating perfectly the training set is found in the neighborhood of most initializations. Despite this, the resulting model generalizes well to unseen data: the inductive bias produced by gradient-based algorithms is sufficient to select models that generalize well.

Elements of this picture have been rigorously established in special regimes. In particular, it is known that for neural networks with a sufficiently large number of neurons, gradient descent converges quickly to a model with vanishing training error [DZPS18, AZLS19, OS20]. In a parallel research direction, the generalization properties of several examples of interpolating models have been studied in detail [BLLT20, LR20, HMRT19, BHX19, MM21, MRSY19]. The present paper continues along the last direction, by studying the generalization properties of linear interpolating models that arise from the analysis of neural networks.

In this context, many fundamental questions remain challenging. (We refer to Section 4 for pointers to recent progress on these questions.)

  1. Q1.

    When is a neural network sufficiently complex to interpolate nn data points? Counting degrees of freedom would suggest that this happens as soon as the number of parameters in the network is larger than nn. Does this lower bound predict the correct threshold? What are the architectures that achieve this lower bound?

  2. Q2.

    Assume that the answer to the previous question is positive, namely a network with the order of nn parameters can interpolate nn data points. Can such a network be found efficiently, using standard gradient descent (GD) or stochastic gradient descent (SGD)?

  3. Q3.

    Can we characterize the generalization error above this interpolation threshold? Does it decrease with the number of parameters? What is the nature of the implicit regularization and of the resulting model f^(𝒙)\widehat{f}(\text{\boldmath$x$})?

Here we address these questions by studying a class of linear models known as ‘neural tangent’ models. Our focus will be on characterizing test error and generalization, i.e., on the last question Q3, but our results are also relevant to Q1 and Q2.

We assume to be given data {(𝒙i,yi)}in\{(\text{\boldmath$x$}_{i},y_{i})\}_{i\leq n} with i.i.d. dd-dimensional covariates vectors 𝒙iUnif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) (we denote by 𝕊d1(r)\mathbb{S}^{d-1}(r) the sphere of radius rr in dd dimensions). In addressing questions Q1 and Q2, we assume the labels yiy_{i} to be arbitrary. For Q3 we will assume yi=f(𝒙i)+εiy_{i}=f_{*}(\text{\boldmath$x$}_{i})+\varepsilon_{i}, where ε1,,εn\varepsilon_{1},\ldots,\varepsilon_{n} are independent noise variables with 𝔼(εi)=0\mathbb{E}(\varepsilon_{i})=0 and 𝔼(εi2)=σε2\mathbb{E}(\varepsilon^{2}_{i})=\sigma_{\varepsilon}^{2}. Crucially, we allow for a general target function ff_{*} under the minimal condition 𝔼{f(𝒙i)2}<\mathbb{E}\{f_{*}(\text{\boldmath$x$}_{i})^{2}\}<\infty. We assume both the dimension and sample size to be large and polynomially related, namely c0dnd1/c0c_{0}d\leq n\leq d^{1/c_{0}} for some small constant c0c_{0}.

A substantial literature [DZPS18, AZLS19, ZCZG20, OS20, LZB20] shows that, under certain training schemes, multi-layer neural networks can be well approximated by linear models with a nonlinear (randomized) featurization map that depends on the network architecture and its initialization. In this paper, we will focus on the simplest such models. Given a set of weights 𝑾=(𝒘1,,𝒘N)\text{\boldmath$W$}=(\text{\boldmath$w$}_{1},\dots,\text{\boldmath$w$}_{N}), we define the following function class with NdNd parameters 𝒂:=(𝒂1,,𝒂N)\text{\boldmath$a$}:=(\text{\boldmath$a$}_{1},\dots,\text{\boldmath$a$}_{N}):

𝖭𝖳N(𝑾):={f(𝒙;𝒂):=1Ndi=1N𝒂i,𝒙σ(𝒘i,𝒙)𝒂id}.\displaystyle\mathcal{F}_{{\sf NT}}^{N}(\text{\boldmath$W$}):=\Big{\{}f(\text{\boldmath$x$};\text{\boldmath$a$}):=\frac{1}{\sqrt{Nd}}\sum_{i=1}^{N}\langle\text{\boldmath$a$}_{i},\text{\boldmath$x$}\rangle\sigma^{\prime}(\langle\text{\boldmath$w$}_{i},\text{\boldmath$x$}\rangle)\,\;\;\;\text{\boldmath$a$}_{i}\in\mathbb{R}^{d}\Big{\}}\,. (1)

In other words, to a vector of covariates 𝒙d\text{\boldmath$x$}\in\mathbb{R}^{d}, the NT model associates a (random) features vector

𝚽(𝒙)=1Nd[σ(𝒙,𝒘1)𝒙𝖳,,σ(𝒙,𝒘N)𝒙𝖳]Nd.\text{\boldmath$\Phi$}(\text{\boldmath$x$})=\frac{1}{\sqrt{Nd}}[\sigma^{\prime}(\langle\text{\boldmath$x$},\text{\boldmath$w$}_{1}\rangle)\text{\boldmath$x$}^{\sf T},\ldots,\sigma^{\prime}(\langle\text{\boldmath$x$},\text{\boldmath$w$}_{N}\rangle)\text{\boldmath$x$}^{\sf T}]\in\mathbb{R}^{Nd}. (2)

The model then computes a linear function f(𝒙;𝒂)=𝒂,𝚽(𝒙)f(\text{\boldmath$x$};\text{\boldmath$a$})=\langle\text{\boldmath$a$},\text{\boldmath$\Phi$}(\text{\boldmath$x$})\rangle. We fit the parameters 𝒂a via minimum 2\ell_{2}-norm interpolation:

minimize 𝒂2,\displaystyle\;\;\;\;\|\text{\boldmath$a$}\|_{2}\,, (3)
subj. to f(𝒙i;𝒂)=yiin.\displaystyle\;\;\;\;f(\text{\boldmath$x$}_{i};\text{\boldmath$a$})=y_{i}\;\;\;\forall i\leq n\,.

We will also study a generalization of min 2\ell_{2}-norm interpolation which is given by least-squares regression with ridge penalty. Interpolation is recovered in the limit in which the ridge parameter tends to 0.

As mentioned, the construction of model (1) and the choice of minimum 2\ell_{2}-norm interpolation (3) are motivated by the analysis of two-layers neural networks. In a nutshell, for a suitable scaling of the initialization, two-layer networks trained via gradient descent are well approximated by the min-norm interpolator described above.

While our focus is not on establishing or expanding the connection between neural networks and neural tangent models, Section 5 discusses the relation between model (1) and two-layer networks, mainly based on [COB19, BMR21]. We also emphasize that the model (1) is the neural tangent model corresponding to a two-layer network in which only first-layer weights are trained. If second-layer weights were trained as well, this model would have to be slightly modified (see Section 5). However, our proofs and results would remain essentially unchanged at the cost of substantial notational burden. The reason is intuitively clear: in large dimensions, the number of second layer weights NN is negligible compared to the number of first-layer weights NdNd.

We next summarize our results. We assume the weights (𝒘k)kN(\text{\boldmath$w$}_{k})_{k\leq N} to be i.i.d. with 𝒘kUnif(𝕊d1(1))\text{\boldmath$w$}_{k}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(1)). In large dimensions, this choice is closely related to one of the most standard initialization of gradient descent, that is 𝒘k𝒩(𝟎,𝑰d/d)\text{\boldmath$w$}_{k}\sim{\mathcal{N}}(\mathrm{\bf 0},{\boldsymbol{I}}_{d}/d). In the summary below CC denotes constants that can change from line to line, and various statements hold ‘with high probability’ (i.e., with probability converging to one as N,d,nN,d,n\to\infty).

Interpolation threshold.

Considering —as mentioned above— feature vectors (𝒙i)inUnif(𝕊d1(1))(\text{\boldmath$x$}_{i})_{i\leq n}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(1)), and arbitrary labels yiy_{i}\in\mathbb{R} we prove that, if Nd/(logNd)CnNd/(\log Nd)^{C}\geq n then with high probability there exists f𝖭𝖳N(𝑾)f\in\mathcal{F}^{N}_{{\sf NT}}(\text{\boldmath$W$}) that interpolates the data. Namely f(𝒙i;𝒂)=yif(\text{\boldmath$x$}_{i};\text{\boldmath$a$})=y_{i} for all ini\leq n.

Finding such an interpolator amounts to solving the nn linear equations f(𝒙i;𝒂)=yif(\text{\boldmath$x$}_{i};\text{\boldmath$a$})=y_{i}, ini\leq n in the NdNd unknowns 𝒂1,,𝒂N\text{\boldmath$a$}_{1},\dots,\text{\boldmath$a$}_{N}, which parametrize 𝖭𝖳N(𝑾)\mathcal{F}^{N}_{{\sf NT}}(\text{\boldmath$W$}), cf. Eq. (1). Hence the function ff can be found efficiently, e.g. via gradient descent with respect to the square loss.

Minimum eigenvalue of the empirical kernel.

In order to prove the previous upper bound on the interpolation threshold, we show that the linear system f(𝒙i;𝒂)=yif(\text{\boldmath$x$}_{i};\text{\boldmath$a$})=y_{i}, ini\leq n has full row rank provided Nd/(logNd)CnNd/(\log Nd)^{C}\geq n. In fact, our proof provides quantitative control on the eigenstructure of the associated empirical kernel matrix.

Denoting by 𝚽n×(Nd)\text{\boldmath$\Phi$}\in\mathbb{R}^{n\times(Nd)} the matrix whose ii-th row is the ii-th feature vector 𝚽(𝒙i)\text{\boldmath$\Phi$}(\text{\boldmath$x$}_{i}), the empirical kernel is given by 𝐊N:=𝚽𝚽\mathrm{\bf K}_{N}:=\text{\boldmath$\Phi$}\text{\boldmath$\Phi$}^{\top}. We then prove that, for Nd/(logNd)CnNd/(\log Nd)^{C}\geq n, 𝐊N\mathrm{\bf K}_{N} tightly concentrates around its expectation with respect to the weights 𝐊:=𝔼𝑾[𝚽𝚽]\mathrm{\bf K}:=\mathbb{E}_{\text{\boldmath$W$}}[\text{\boldmath$\Phi$}\text{\boldmath$\Phi$}^{\top}], which can be interpreted as the kernel associated to an infinite-width network. We then prove that the latter is well approximated by 𝐊p+γ>𝑰n\mathrm{\bf K}^{p}+\gamma_{>\ell}{\boldsymbol{I}}_{n}, where 𝐊p\mathrm{\bf K}^{p} is a polynomial kernel of constant degree \ell, and γ>\gamma_{>\ell} is bounded away from zero, and depends on the high-degree components of the activation function. This result implies a tight lower bound on the minimum eigenvalue of the kernel 𝐊N\mathrm{\bf K}_{N} as well as estimates of all the other eigenvalues. The term γ>𝑰n\gamma_{>\ell}{\boldsymbol{I}}_{n} acts as a ‘self-induced’ ridge regularization.

We note that λmin(𝐊N)\lambda_{\min}(\mathrm{\bf K}_{N}) has a direct algorithmic relevance, as discussed in [DZPS18, COB19, OS20].

Generalization error.

Most of our work is devoted to the analysis of the generalization error of the min-norm NT interpolator (3). As mentioned above, we consider general labels of the form yi=f(𝒙i)+εiy_{i}=f_{*}(\text{\boldmath$x$}_{i})+\varepsilon_{i}, for ff_{*} a target function with finite second moment. We prove that, as soon as Nd/(logNd)CnNd/(\log Nd)^{C}\geq n, the risk of NT interpolation is close to the one of polynomial ridge regression with kernel 𝐊p\mathrm{\bf K}^{p} and ridge parameter γ>\gamma_{>\ell}.

The degree \ell of the effective polynomial kernel is defined by the condition d(logd)Cnd+1/(logd)Cd^{\ell}(\log d)^{C}\leq n\ll d^{\ell+1}/(\log d)^{C} for 2\ell\geq 2 or d/Cnd2/(logd)Cd/C\leq n\ll d^{2}/(\log d)^{C} for =1\ell=1 (in particular our result does not cover the case ndn\asymp d^{\ell} for an integer 2\ell\geq 2, but they cover ndαn\asymp d^{\alpha} for any non-integer α\alpha).

Remarkably, even if the original NT model interpolates the data, the equivalent polynomial regression model does not interpolate and has a positive ‘self-induced’ regularization parameter γ>\gamma_{>\ell}.

From a technical viewpoint, the characterization of the generalization behavior is significantly more challenging than the analysis of the eigenstructure of the kernel 𝐊N\mathrm{\bf K}_{N} at the previous point. Indeed understanding generalization requires to study the eigenvectors of 𝐊N\mathrm{\bf K}_{N}, and how they change when adding new sample points.

These results provide a clear picture of how neural networks achieve good generalizations in the neural tangent regime, despite interpolating the data. First, the model is nonlinear in the input covariates, and sufficiently overparametrized. Thanks to this flexibility it can interpolate the data points. Second, gradient descent selects the min-norm interpolator. This results in a model that is very close to f^(𝒙)\widehat{f}_{\ell}(\text{\boldmath$x$}) a polynomial of degree \ell at ‘most’ points 𝒙x (more formally, the model is close to a polynomial in the L2L^{2} sense). Third, because of the large dimension dd, the empirical kernel matrix contains a portion proportional to the identity, which acts as a self-induced regularization term.

The rest of the paper is organized as follows. In the next section, we illustrate our results through some simple numerical simulations. We formally present our main results in Section 3. We state our characterization of the NT kernel, and of the generalization error of NT regression. In Section 4, we briefly overview related work on interpolation and generalization of neural networks. We discuss the connection between GD-trained neural networks and neural tangent models in Section 5. Section 6 provides some technical background on orthogonal polynomials, which is useful for the proofs. The proofs of the main theorems are outlined in Sections 7 and 8, with most technical work deferred to the appendices.

2 Two numerical illustrations

2.1 Comparing NT models and Kernel Ridge Regression

Refer to caption
Refer to caption
Figure 1: In both heatmaps, we fix the dimension d=20d=20 and use min-2\ell_{2} norm NT regression to fit data generated according to (4). Results are averaged over nrep=10n_{\text{rep}}=10 repetitions. Top: For varying network parameters NdNd and sample size nn, we check if 𝐊N\mathrm{\bf K}_{N} is singular (we report the empirical probability). Bottom: We calculate the train error of min-2\ell_{2} norm NT regression.

In the first experiment, we generated data according to the model yi=f(𝒙i)+εiy_{i}=f_{*}(\text{\boldmath$x$}_{i})+\varepsilon_{i}, with 𝒙iUnif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), εi𝒩(0,σε2)\varepsilon_{i}\sim{\mathcal{N}}(0,\sigma^{2}_{\varepsilon}), σε=0.5\sigma_{\varepsilon}=0.5, and

f(𝒙)=410h1(𝜷,𝒙)+410h2(𝜷,𝒙)+210h4(𝜷,𝒙).\displaystyle f_{*}(\text{\boldmath$x$})=\sqrt{\frac{4}{10}}\,h_{1}(\langle\text{\boldmath$\beta$}_{*},\text{\boldmath$x$}\rangle)+\sqrt{\frac{4}{10}}\,h_{2}(\langle\text{\boldmath$\beta$}_{*},\text{\boldmath$x$}\rangle)+\sqrt{\frac{2}{10}}\,h_{4}(\langle\text{\boldmath$\beta$}_{*},\text{\boldmath$x$}\rangle)\,. (4)

Here 𝜷\text{\boldmath$\beta$}_{*} is a fixed unit norm vector (randomly generated and then fixed throughout), and (hk)k0(h_{k})_{k\geq 0} are orthonormal Hermite polynomials, e.g. h1(x)=xh_{1}(x)=x, h1(x)=(x21)/2h_{1}(x)=(x^{2}-1)/\sqrt{2} (see Section 6 for general definitions). We fix d=20d=20, and compute the min 2\ell_{2}-norm NT interpolator for ReLU activations σ(t)=max(t,0)\sigma(t)=\max(t,0), and random weights 𝒘kUnif(𝕊d1(d))\text{\boldmath$w$}_{k}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})). We average our results over nrep=10n_{\text{rep}}=10 repetitions.

Refer to caption
Refer to caption
Figure 2: We fix the dimension d=20d=20 and use min-2\ell_{2} norm NT regression as before. We calculated the test error using an independent test set of size 40004000. Results are averaged over nrep=10n_{\text{rep}}=10 repetitions. Top: Test errors are plotted for varying NdNd and nn; we cap the errors at 22 so that the blowup near the dashed line is easy to visualize. Bottom: We examine the test errors at three particular sample sizes n=46,444n=46,444 and 43044304. The dashed horizontal lines represent test errors of KRR with the infinite-width kernel 𝐊\mathrm{\bf K} at three corresponding sample sizes.

From Figure 1, we observe that (1)(1) the minimum eigenvalue of the kernel becomes strictly positive very sharply as soon as Nd/n1Nd/n\gtrsim 1, (2)(2) as a consequence, the train error vanishes sharply as Nd/nNd/n crosses 11. Both phenomena are captured by our theorems in the next section, although we require the condition Nd/(logNd)CnNd/(\log Nd)^{C}\geq n which is suboptimal by a polylogarithmic factor.

From Figure 2, we make the following observations and remarks.

  1. 1.

    The number of samples nn and the number of parameters NdNd play a strikingly symmetric role. The test error is large when NdnNd\approx n (the interpolation threshold) and decreases rapidly when either NdNd or nn increases (i.e. moving either along horizontal or vertical lines). In the context of random features model, a form of this symmetry property was established rigorously in [MMM21]. For the present work, we only focus on the overparametrized regime NdnNd\gg n.

  2. 2.

    The test error rapidly decays to a limit value as NdNd grows at fixed nn. We interpret the limit value as the infinite-width limit, and indeed matches the risk of kernel ridge(–less) regression with respect to the infinite-width kernel 𝐊\mathrm{\bf K} (dashed lines); see Theorem 3.3.

  3. 3.

    Considering the most favorable case, namely NdnNd\gg n, the test error appears to remain bounded away from zero even when nd2n\approx d^{2}. This appears to be surprising given the simplicity of the target function (4). This phenomenon can be explained by Theorem 3.3, in which we show that for dnd+1d^{\ell}\ll n\ll d^{\ell+1} with \ell an integer, NT regression is roughly equivalent to regression with respect to degree-\ell polynomials. In particular, for nd3n\ll d^{3} it will not capture components of degree larger than 2 in the target ff_{*} of Eq. (4).

Refer to caption
Figure 3: Test/generalization errors of NT kernel ridge regression (NT) and polynomial ridge regression (Poly). We fix n=4000n=4000 and d=500d=500. For each regularization parameter λ\lambda (which corresponds to one color), we plot two curves (solid: NT, dashed: Poly) that represent R𝖭𝖳(λ)R_{{\sf NT}}(\lambda) and Rlin(γeff(λ,σ))R_{\mbox{\tiny\rm lin}}(\gamma_{\mbox{\tiny\rm eff}}(\lambda,\sigma)) respectively.
Refer to caption
Figure 4: Test/generalization errors of NT and Poly. As before, for each regularization parameter λ\lambda (which corresponds to one color), we plot two curves.

2.2 Comparing NT regression and polynomial regression

In the second experiment, we generated data from a linear model yi=𝒙i,𝜷+εiy_{i}=\langle\text{\boldmath$x$}_{i},\text{\boldmath$\beta$}_{*}\rangle+\varepsilon_{i}. As before, 𝒙iUnif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), εi𝒩(0,σε2)\varepsilon_{i}\sim{\mathcal{N}}(0,\sigma^{2}_{\varepsilon}), σε=0.5\sigma_{\varepsilon}=0.5, and 𝜷\text{\boldmath$\beta$}_{*} is a fixed unit norm vector (randomly generated and then fixed throughout). In Figure 3, We fix n=4000n=4000 and d=500d=500, and vary the number of neurons \seqsplitN{10,20,,90,100,,1000}N\in\{10,20,\ldots,90,100,\ldots,1000\}. In Figure 4, we fix N=800N=800 and d=500d=500, and vary the sample size \seqsplitn{100,200,,900,1000,1500,,4000}n\in\{100,200,\ldots,900,1000,1500,\ldots,4000\}. The results are averaged over nrep=10n_{\text{rep}}=10 repetitions.

Notice that in all of these experiments nd2n\ll d^{2}. The theory developed below (see in particular Theorem 3.3 and Corollary 3.2) implies that the risk of NT ridge regression should be well approximated by the risk polynomial ridge regression, although with an inflated ridge parameter. For nd2n\ll d^{2}, the polynomial degree is =1\ell=1. If λ0\lambda\geq 0 denotes the regularization parameter in NT ridge regression, the equivalent regularization in polynomial regression is predicted to be

γ=λ+Var(σ){𝔼[σ(G)]}2,\gamma=\frac{\lambda+\mathrm{Var}(\sigma^{\prime})}{\big{\{}\mathbb{E}\big{[}\sigma^{\prime}(G)]\big{\}}^{2}}, (5)

where G𝒩(0,1)G\sim\mathcal{N}(0,1).

In Figures 3 and 4 we fit NT ridge regression (NT) and polynomial ridge regression of degree =1\ell=1 (Poly). We use different pairs of regularization parameters λ\lambda (for NT) and γ\gamma (for Poly), satisfying Eq. (5). We observe a close match between the risk of NT regression and polynomial, in agreement with the theory established below.

We also trained a two-layer neural network to fit this linear model. Under specific initialization of weights, the test error is well aligned with the one from NT KRR and polynomial ridge regression. Details can be found in the appendix.

3 Main results

3.1 Notations

For a positive integer, we denote by [n][n] the set {1,2,,n}\{1,2,\ldots,n\}. The 2\ell_{2} norm of a vector 𝒖m\text{\boldmath$u$}\in\mathbb{R}^{m} is denoted by 𝒖\|\text{\boldmath$u$}\|. We denote by 𝕊d1(r)={𝒖d:𝒖=r}\mathbb{S}^{d-1}(r)=\{\text{\boldmath$u$}\in\mathbb{R}^{d}:\|\text{\boldmath$u$}\|=r\} the sphere of radius rr in dd dimensions (sometimes we simply write 𝕊d1:=𝕊d1(1)\mathbb{S}^{d-1}:=\mathbb{S}^{d-1}(1)).

Let 𝐀n×m\mathrm{\bf A}\in\mathbb{R}^{n\times m} be a matrix. We use σj(𝐀)\sigma_{j}(\mathrm{\bf A}) to denote the jj-th largest singular value of 𝐀\mathrm{\bf A}, and we also denote σmax(𝐀)=σ1(𝐀)\sigma_{\max}(\mathrm{\bf A})=\sigma_{1}(\mathrm{\bf A}) and σmin(𝐀)=σmin{m,n}(𝐀)\sigma_{\min}(\mathrm{\bf A})=\sigma_{\min\{m,n\}}(\mathrm{\bf A}). If 𝐀\mathrm{\bf A} is a symmetric matrix, we use λj(𝐀)\lambda_{j}(\mathrm{\bf A}) to denote its jj-th largest eigenvalue. We denote by 𝐀op=max𝒖𝕊m1𝐀𝒖\|\mathrm{\bf A}\|_{\mathrm{op}}=\max_{\text{\boldmath$u$}\in\mathbb{S}^{m-1}}\|\mathrm{\bf A}\text{\boldmath$u$}\| the operator norm, by 𝐀max=maxi[n],j[m]|Aij|\|\mathrm{\bf A}\|_{\max}=\max_{i\in[n],j\in[m]}|A_{ij}| the maximum norm, by 𝐀F=(i,jAij2)1/2\|\mathrm{\bf A}\|_{F}=\big{(}\sum_{i,j}A_{ij}^{2}\big{)}^{1/2} the Frobenius norm, and by 𝐀=j=1min{n,m}σj(𝐀)\|\mathrm{\bf A}\|_{*}=\sum_{j=1}^{\min\{n,m\}}\sigma_{j}(\mathrm{\bf A}) the nuclear norm (where σj\sigma_{j} is the jj-th singular value). If 𝐀n×n\mathrm{\bf A}\in\mathbb{R}^{n\times n} is a square matrix, the trace of 𝐀\mathrm{\bf A} is denoted by Tr(𝐀)=i[n]Aii\mathrm{Tr}(\mathrm{\bf A})=\sum_{i\in[n]}A_{ii}. Positive semi-definite is abbreviated as p.s.d.

We will use Od()O_{d}(\cdot) and od()o_{d}(\cdot) for the standard big-OO and small-oo notation, where dd is the asymptotic variable. We write ad=Ωd(bd)a_{d}=\Omega_{d}(b_{d}) for scalars ad,bda_{d},b_{d} if there exists d0,C>0d_{0},C>0 such that adCbda_{d}\geq Cb_{d} for d>d0d>d_{0}. For random variables ξ1(d)\xi_{1}(d) and ξ2(d)\xi_{2}(d), ξ1(d)=Od,(ξ2(d))\xi_{1}(d)=O_{d,\mathbb{P}}(\xi_{2}(d)) if for any ε\varepsilon, there exists Cε>0C_{\varepsilon}>0 and dε>0d_{\varepsilon}>0 such that

(|ξ1(d)/ξ2(d)|>Cε)ε,for alld>dε.\mathbb{P}\big{(}|\xi_{1}(d)/\xi_{2}(d)|>C_{\varepsilon}\big{)}\leq\varepsilon,\qquad\text{for all}~{}d>d_{\varepsilon}.

Similarly, ξ1(d)=od,(ξ2(d))\xi_{1}(d)=o_{d,\mathbb{P}}(\xi_{2}(d)) if ξ1(d)/ξ2(d)\xi_{1}(d)/\xi_{2}(d) converges to 0 in probability. Occasionally, we use the notation O~d,()\widetilde{O}_{d,\mathbb{P}}(\cdot) and o~d,()\widetilde{o}_{d,\mathbb{P}}(\cdot): we write ξ1(d)=O~d,(ξ2(d))\xi_{1}(d)=\widetilde{O}_{d,\mathbb{P}}(\xi_{2}(d)) if there exists a constant C>0C>0 such that ξ1(d)=Od,((logd)Cξ2(d))\xi_{1}(d)=O_{d,\mathbb{P}}((\log d)^{C}\xi_{2}(d)), and similarly we write ξ1(d)=o~d,(ξ2(d))\xi_{1}(d)=\widetilde{o}_{d,\mathbb{P}}(\xi_{2}(d)) if there exists a constant C>0C>0 such that ξ1(d)=od,((logd)Cξ2(d))\xi_{1}(d)=o_{d,\mathbb{P}}((\log d)^{C}\xi_{2}(d)). We may drop the subscript dd if there is no confusion in a context.

For a nonnegative sequence (ad)d1(a_{d})_{d\geq 1}, we write ad=poly(d)a_{d}=\mathrm{poly}(d) if there exists a constant C>0C>0 such that adCdCa_{d}\leq Cd^{C}.

Throughout, we will use C,C0,C1,C2,C3C,C_{0},C_{1},C_{2},C_{3} to refer to constants that do not depend on dd. In particular, for notational convenience, the value of CC may change from line to line.

Most of our statement apply to settings in which N,n,dN,n,d all grow to \infty, while satisfying certain conditions. Without loss of generality, one can thing that such sequences are indexed by dd, with N,nN,n functions of dd.

Recall that we say that AN,n,dA_{N,n,d} happens with high probability (abbreviated as w.h.p.) if its probability tends to one as N,n,dN,n,d\to\infty (in whatever way is specified in the text). In certain proofs, we will say that an event AN,n,dA_{N,n,d} happens with very high probability if for every β>0\beta>0, we have limddβ(AN,n,dc)=0\lim_{d\to\infty}d^{\beta}\mathbb{P}(A_{N,n,d}^{c})=0 (again, we think of N,nN,n\to\infty when dd\to\infty in whatever way prescribed in the text).

3.2 Definitions and assumptions

Throughout, we assume 𝒙ii.i.d.Unif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and 𝒘ki.i.d.Unif(𝕊d1)\text{\boldmath$w$}_{k}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}). To a vector of covariates 𝒙d\text{\boldmath$x$}\in\mathbb{R}^{d}, the NT model associates a (random) features vector 𝚽(𝒙)\text{\boldmath$\Phi$}(\text{\boldmath$x$}) as per Eq. (2). We denote by 𝚽n×(Nd)\text{\boldmath$\Phi$}\in\mathbb{R}^{n\times(Nd)} the matrix whose ii-th row contains the feature vector of the ii-th sample, and by 𝐊N:=𝚽𝚽\mathrm{\bf K}_{N}:=\text{\boldmath$\Phi$}\text{\boldmath$\Phi$}^{\top} the corresponding empirical kernel:

𝚽=[𝚽(𝒙1)𝚽(𝒙2)𝚽(𝒙n)]n×(Nd),𝐊N:=𝚽𝚽n×n.\text{\boldmath$\Phi$}=\left[\begin{matrix}\text{\boldmath$\Phi$}(\text{\boldmath$x$}_{1})^{\top}\\ \text{\boldmath$\Phi$}(\text{\boldmath$x$}_{2})^{\top}\\ \ldots\\ \text{\boldmath$\Phi$}(\text{\boldmath$x$}_{n})^{\top}\end{matrix}\right]\in\mathbb{R}^{n\times(Nd)},\;\;\;\;\;\;\mathrm{\bf K}_{N}:=\text{\boldmath$\Phi$}\text{\boldmath$\Phi$}^{\top}\in\mathbb{R}^{n\times n}\,.

The entries of the kernel matrix take the form

[𝐊N]ij=1Ndk=1Nσ(𝒙i,𝒘k)σ(𝒙j,𝒘k)𝒙i,𝒙j.[\mathrm{\bf K}_{N}]_{ij}=\frac{1}{Nd}\sum_{k=1}^{N}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}_{k}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}_{k}\rangle)\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle\,. (6)

The infinite-width kernel matrix is given by

[𝐊]ij=𝔼𝒘[σ(𝒙i,𝒘)σ(𝒙j,𝒘)]𝒙i,𝒙jd.[\mathrm{\bf K}]_{ij}=\mathbb{E}_{\text{\boldmath$w$}}\big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\big{]}\frac{\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle}{d}. (7)

In terms of the featurization map 𝚽\Phi, an NT function f𝖭𝖳N(𝑾)f\in\mathcal{F}_{{\sf NT}}^{N}(\text{\boldmath$W$}) reads

f(𝒙;𝒂)=𝒂,𝚽(𝒙),\displaystyle f(\text{\boldmath$x$};\text{\boldmath$a$})=\langle\text{\boldmath$a$},\text{\boldmath$\Phi$}(\text{\boldmath$x$})\rangle\,, (8)

where 𝒂=(𝒂1,𝒂N)Nd\text{\boldmath$a$}=(\text{\boldmath$a$}_{1},\dots\text{\boldmath$a$}_{N})^{\top}\in\mathbb{R}^{Nd}.

Assumption 3.1.

Given an arbitrary integer 1\ell\geq 1, and a small constant c0>0c_{0}>0, we assume that the following hold for a sufficiently large constant C0C_{0} (depending on c0c_{0}, \ell, and on the activation σ\sigma)

c0dnd2(logd)C0,\displaystyle c_{0}d\leq n\leq\frac{d^{2}}{(\log d)^{C_{0}}}, if=1,\displaystyle\text{if}~{}\ell=1, (9)
d(logd)C0nd+1(logd)C0,\displaystyle d^{\ell}(\log d)^{C_{0}}\leq n\leq\frac{d^{\ell+1}}{(\log d)^{C_{0}}}, if>1.\displaystyle\text{if}~{}\ell>1. (10)

Throughout, we assume that the activation function σ:\sigma:\mathbb{R}\to\mathbb{R} satisfies the following condition. Note that commonly used activation functions, such as ReLU, sigmoid, tanh, leaky ReLU, satisfy this condition. Low-degree polynomials are excluded from this assumption. We denote by μk(σ)\mu_{k}(\sigma^{\prime}) the kk-th coefficient in the Hermite expansion of σ\sigma^{\prime}; see its definition in Section 6, especially Eqn. 39.

Assumption 3.2 (polynomial growth).

We assume that σ\sigma is weakly differentiable with weak derivative σ\sigma^{\prime} satisfying |σ(x)|B(1+|x|)B|\sigma^{\prime}(x)|\leq B(1+|x|)^{B} for some finite constant B>0B>0, and that k[μk(σ)]2>0\sum_{k\geq\ell}[\mu_{k}(\sigma^{\prime})]^{2}>0.

Note that this condition is extremely mild: it is satisfied by any activation function of practical use. The existence of the weak derivative σ\sigma^{\prime} is needed for the NT model to make sense at all. The assumption of polynomial growth for σ\sigma^{\prime} ensures that we can use harmonic analysis on the sphere to analyze its behavior. Finally the condition k[μk(σ)]2>0\sum_{k\geq\ell}[\mu_{k}(\sigma^{\prime})]^{2}>0 is satisfied if σ\sigma is not a polynomial of degree \ell.

3.3 Structure of the kernel matrix

Given points 𝒙1,𝒙n\text{\boldmath$x$}_{1},\dots\text{\boldmath$x$}_{n}, 𝚽\Phi has full row rank, i.e. rank(𝚽)=n{\rm rank}(\text{\boldmath$\Phi$})=n if and only if for any choice of the labels or responses y1,,yny_{1},\dots,y_{n}\in\mathbb{R}, there exists a function f𝖭𝖳N(𝑾)f\in\mathcal{F}_{{\sf NT}}^{N}(\text{\boldmath$W$}) interpolating those data, i.e. yi=f(𝒙i)y_{i}=f(\text{\boldmath$x$}_{i}) for all ini\leq n. This of course requires NdnNd\geq n. Our first result—which is a direct corollary of Theorem 3.1 that we present below—shows that this lower bound is roughly correct.

Corollary 3.1.

Assume (𝐱i)ini.i.d.Unif(𝕊d1(d))(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), and (𝐰k)kNi.i.d.Unif(𝕊d1(1))(\text{\boldmath$w$}_{k})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(1)), and let B,c0>0B,c_{0}>0, \ell\in\mathbb{N} be fixed. Then, there exist constants C0,C>0C_{0},C>0 such that the following holds.

If Assumptions 3.1 and 3.2 hold with constant ,c0,C0,B\ell,c_{0},C_{0},B, and if Nd/(log(Nd))CnNd/(\log(Nd))^{C}\geq n, then rank(𝚽)=n{\rm rank}(\text{\boldmath$\Phi$})=n with high probability. In particular an NT interpolator exists with high probability for any choice of the responses (yi)in(y_{i})_{i\leq n}.

Remark 1.

Note that any NT model (1) can be approximated arbitrarily well by a two-layer neural network with 2N2N neurons. This can be seen by taking ε0\varepsilon\to 0 in fε(𝒙)=k=1N1ε{σ(𝒘k,𝒙+ε𝒂k,𝒙)σ(𝒘k,𝒙ε𝒂k,𝒙)}f_{\varepsilon}(\text{\boldmath$x$})=\sum_{k=1}^{N}\frac{1}{\varepsilon}\big{\{}\sigma(\langle\text{\boldmath$w$}_{k},\text{\boldmath$x$}\rangle+\varepsilon\langle\text{\boldmath$a$}_{k},\text{\boldmath$x$}\rangle)-\sigma(\langle\text{\boldmath$w$}_{k},\text{\boldmath$x$}\rangle-\varepsilon\langle\text{\boldmath$a$}_{k},\text{\boldmath$x$}\rangle)\big{\}}. As a consequence, in the above setting, a 2N2N-neurons neural network can interpolate nn data points with arbitrarily small approximation error with high probability provided Nd/(log(Nd))CnNd/(\log(Nd))^{C}\geq n.

To the best of our knowledge, this is the first result of this type for regression. The concurrent paper [Dan20] proves a similar memorization result for classification but exploits in a crucial way the fact that in classification it is sufficient to ensure yif(𝒙i)>0y_{i}f(\text{\boldmath$x$}_{i})>0. Regression is considered in [BELM20] but the number of required neurons depends on the interpolation accuracy.

While we stated the above as an independent result because of its interest, it is in fact an immediate corollary of a quantitative lower bound on the minimum eigenvalue of the kernel 𝐊Nn×n\mathrm{\bf K}_{N}\in\mathbb{R}^{n\times n}, stated below. Given g:g:\mathbb{R}\to\mathbb{R}, we let μk(g)=𝔼[g(G)hk(G)]\mu_{k}(g)=\mathbb{E}[g(G)h_{k}(G)] denote its kk-th Hermite coefficient (here G𝒩(0,1)G\sim{\mathcal{N}}(0,1), 𝔼[hk(G)2]=1\mathbb{E}[h_{k}(G)^{2}]=1); see also Section 6, Eq. 39.

Theorem 3.1.

Assume (𝐱i)ini.i.d.Unif(𝕊d1(d))(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), and (𝐰k)kNi.i.d.Unif(𝕊d1(1))(\text{\boldmath$w$}_{k})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(1)), and let B,c0>0B,c_{0}>0, \ell\in\mathbb{N} be fixed. Then, there exist constants C0,C>0C_{0},C>0 such that the following holds.

If Assumptions 3.1 and 3.2 hold with constant ,c0,C0,B\ell,c_{0},C_{0},B, and further nd+1n\geq d+1 and Nd/(log(Nd))CnNd/(\log(Nd))^{C}\geq n, then, defining v(σ):=k[μk(σ)]2v(\sigma):=\sum_{k\geq\ell}[\mu_{k}(\sigma^{\prime})]^{2}, we have

λmin(𝐊N)=v(σ)+od,(1).\displaystyle\lambda_{\min}(\mathrm{\bf K}_{N})=v(\sigma)+o_{d,\mathbb{P}}(1)\,. (11)

If the assumption nd+1n\geq d+1 is replaced by nc0dn\geq c_{0}d for a strictly positive constant c0c_{0}, then λmin(𝐊N)v(σ)od,(1)\lambda_{\min}(\mathrm{\bf K}_{N})\geq v(\sigma)-o_{d,\mathbb{P}}(1).

Remark 2.

In the case =1\ell=1, the eigenvalue lower bound v(σ)v(\sigma) is simply v(σ)=Var(σ(G))=𝔼[σ(G)2]{𝔼[σ(G)]}2v(\sigma)={\rm Var}(\sigma^{\prime}(G))=\mathbb{E}[\sigma^{\prime}(G)^{2}]-\{\mathbb{E}[\sigma^{\prime}(G)]\}^{2} where G𝒩(0,1)G\sim{\mathcal{N}}(0,1).

The only earlier result comparable to Theorem 3.1 was obtained in [SJL18] which, in a similar setting, proved that, with high probability, λmin(𝐊N)c\lambda_{\min}(\mathrm{\bf K}_{N})\geq c_{*} for a strictly positive constant cc_{*}, provided N2dN\geq 2d.

The proof of Theorem 3.1 is presented in Section 7. The key is to show that the NT kernel concentrates to the infinite-width kernel for large enough NN.

For any constant γ>0\gamma>0, we define an event 𝒜γ\mathcal{A}_{\gamma} as follows.

𝒜γ={𝐊γ𝐈n,𝑿op2(n+d)}.\mathcal{A}_{\gamma}=\big{\{}\mathrm{\bf K}\succeq\gamma\mathrm{\bf I}_{n},~{}\|\text{\boldmath$X$}\|_{\mathrm{op}}\leq 2(\sqrt{n}+\sqrt{d})\big{\}}. (12)

This event only involves (𝒙i)in(\text{\boldmath$x$}_{i})_{i\leq n} (formally speaking, it lies in the sigma-algebra generated by (𝒙i)in(\text{\boldmath$x$}_{i})_{i\leq n}). We will show that, if γ>0\gamma>0 is a constant no larger than v(σ)/2v(\sigma)/2, then the event 𝒜γ\mathcal{A}_{\gamma} happens with very high probability under Assumption 3.2, provided nd+1/(logd)C0n\leq d^{\ell+1}/(\log d)^{C_{0}}. Below we will use 𝒘\mathbb{P}_{\text{\boldmath$w$}} to mean the probability over the randomness of 𝒘1,,𝒘N\text{\boldmath$w$}_{1},\ldots,\text{\boldmath$w$}_{N} (equivalently, conditional on (𝒙i)in(\text{\boldmath$x$}_{i})_{i\leq n}).

Theorem 3.2 (Kernel concentration).

Assume (𝐱i)ini.i.d.Unif(𝕊d1(d))(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), and (𝐰k)kNi.i.d.Unif(𝕊d1(1))(\text{\boldmath$w$}_{k})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(1)). Let γ=v(σ)/2\gamma=v(\sigma)/2. Then, there exist constants C,C0>0C^{\prime},C_{0}>0 such that the following holds. Under Assumption 3.2, the event AγA_{\gamma} holds with very high probability, and on the event AγA_{\gamma}, for any constant β>0\beta>0,

dβ𝒘(𝐊1/2𝐊N𝐊1/2𝐈nop>(n+d)(log(nNd))CNd+(n+d)(log(nNd))CNd)=od(1).d^{\beta}\cdot\mathbb{P}_{\text{\boldmath$w$}}\left(\big{\|}\mathrm{\bf K}^{-1/2}\mathrm{\bf K}_{N}\mathrm{\bf K}^{-1/2}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}>\sqrt{\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}}+\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}\right)=o_{d}(1).

As a consequence, if nd+1/(logd)C0n\leq d^{\ell+1}/(\log d)^{C_{0}} (i.e. the upper bound in Assumption 3.1 holds), with the same \ell as in Assumption 3.2, then with very high probability,

𝐊1/2𝐊N𝐊1/2𝐈nop(n+d)(log(nNd))CNd+(n+d)(log(nNd))CNd.\big{\|}\mathrm{\bf K}^{-1/2}\mathrm{\bf K}_{N}\mathrm{\bf K}^{-1/2}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq\sqrt{\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}}+\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}. (13)

When the right-hand side of (13) is od,(1)o_{d,\mathbb{P}}(1), it is clear that the eigenvalues of 𝐊N\mathrm{\bf K}_{N} are bounded from below, since

𝐊1/2𝐊N𝐊1/2𝐈nopη(1η)𝐊𝐊N(1+η)𝐊.\big{\|}\mathrm{\bf K}^{-1/2}\mathrm{\bf K}_{N}\mathrm{\bf K}^{-1/2}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq\eta^{\prime}\quad\Longrightarrow\quad(1-\eta^{\prime})\mathrm{\bf K}\preceq\mathrm{\bf K}_{N}\preceq(1+\eta^{\prime})\mathrm{\bf K}. (14)

If we further have nNd/(log(Nd))Cn\leq Nd/(\log(Nd))^{C}, nc0dn\geq c_{0}d for c0>0c_{0}>0 and a sufficiently large constant CC, then we can take η=(n(log(Nd))C/Nd)1/2\eta^{\prime}=(n(\log(Nd))^{C^{\prime}}/Nd)^{1/2} here.

Remark 3.

If σ\sigma^{\prime} is not a polynomial, then 𝒜γ\mathcal{A}_{\gamma} holds with high probability as soon as ndC′′n\leq d^{C^{\prime\prime}} for some constant C′′C^{\prime\prime}. Hence, in this case, the stronger assumption nd+1/(logd)C0n\leq d^{\ell+1}/(\log d)^{C_{0}} is not needed for the second part of this theorem.

3.4 Test error

In order to study the generalization properties of the NT model, we consider a general regression model for the data distribution. Data (𝒙1,y1),,(𝒙n,yn)(\text{\boldmath$x$}_{1},y_{1}),\ldots,(\text{\boldmath$x$}_{n},y_{n}) are i.i.d. with 𝒙iUnif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and yi=f(𝒙i)+εiy_{i}=f_{*}(\text{\boldmath$x$}_{i})+\varepsilon_{i} where fL2:=L2(𝕊d1(d),τd1)f_{*}\in L^{2}:=L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1}). Here, L2(𝕊d1(d),τd1)L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1}) is the space of functions on 𝕊d1(d)\mathbb{S}^{d-1}(\sqrt{d}) which are square integrable with respect to the uniform measure. In matrix notation, we let 𝑿n×d\text{\boldmath$X$}\in\mathbb{R}^{n\times d} denote the matrix whose ii-th row is 𝒙i\text{\boldmath$x$}_{i}, 𝒚:=(yi)in\text{\boldmath$y$}:=(y_{i})_{i\leq n}, and 𝒇=(f(𝒙i))in\text{\boldmath$f$}_{*}=(f_{*}(\text{\boldmath$x$}_{i}))_{i\leq n}. We then have

𝒚=𝒇+𝜺,whereVar(εi)=σε2.\text{\boldmath$y$}=\text{\boldmath$f$}_{*}+\text{\boldmath$\varepsilon$},\qquad\text{where}~{}\mathrm{Var}(\varepsilon_{i})=\sigma_{\varepsilon}^{2}. (15)

The noise variables ε1,,εni.i.d.ε\varepsilon_{1},\ldots,\varepsilon_{n}\sim_{\mathrm{i.i.d.}}\mathbb{P}_{\varepsilon} are assumed to have zero mean and finite second moment, i.e., σε2=𝔼(ε12)\sigma_{\varepsilon}^{2}=\mathbb{E}(\varepsilon_{1}^{2}).

We fit the coefficients 𝒂=(𝒂1,,𝒂N)Nd\text{\boldmath$a$}=(\text{\boldmath$a$}_{1},\dots,\text{\boldmath$a$}_{N})\in\mathbb{R}^{Nd} of the NT model using ridge regression. Namely

𝒂^(λ):=argmin𝒂Nd{i=1n(yif(𝒙i;𝒂))2+λ𝒂2}.\displaystyle\widehat{\boldsymbol{a}}(\lambda):=\arg\min_{\text{\boldmath$a$}\in\mathbb{R}^{Nd}}\Big{\{}\sum_{i=1}^{n}(y_{i}-f(\text{\boldmath$x$}_{i};\text{\boldmath$a$})\big{)}^{2}+\lambda\|\text{\boldmath$a$}\|^{2}\Big{\}}\,. (16)

where f(;𝒂)f(\,\cdot\,;\text{\boldmath$a$}) is defined as per Eq. (1). Explicitly, we have

𝒂^(λ)=𝚽(𝚽𝚽+λ𝐈n)1𝒚.\widehat{\boldsymbol{a}}(\lambda)=\text{\boldmath$\Phi$}^{\top}\big{(}\text{\boldmath$\Phi$}\text{\boldmath$\Phi$}^{\top}+\lambda\mathrm{\bf I}_{n})^{-1}\text{\boldmath$y$}. (17)

We evaluate this approach on a new input 𝒙0\text{\boldmath$x$}_{0} that has the same distribution as the training input. Our analysis covers the case λ=0\lambda=0, in which case we obtain the minimum 2\ell_{2}-norm interpolator.

The test error is defined as

R𝖭𝖳(f;λ)=𝔼𝒙0[(f(𝒙0)𝒂^(λ),𝚽(𝒙0))2].\displaystyle R_{{\sf NT}}(f_{*};\lambda)=\mathbb{E}_{\text{\boldmath$x$}_{0}}\big{[}(f_{*}(\text{\boldmath$x$}_{0})-\langle\widehat{\boldsymbol{a}}(\lambda),\text{\boldmath$\Phi$}(\text{\boldmath$x$}_{0})\rangle)^{2}\big{]}\,. (18)

We occasionally call this ‘generalization error’, with a slight abuse of terminology (sometimes this term is referred to the difference between test error and the train error n1in(yi𝒂^,𝚽(𝒙i))2n^{-1}\sum_{i\leq n}(y_{i}-\langle\widehat{\boldsymbol{a}},\text{\boldmath$\Phi$}(\text{\boldmath$x$}_{i})\rangle)^{2}.)

Our main results on the generalization behavior of NT KRR establish its equivalence with simpler methods. Namely, we perform kernel ridge regression with the infinite-width kernel 𝐊\mathrm{\bf K}. Let γ0\gamma\geq 0 be any ridge regularization parameter. The prediction function fitted by the data alongside its associated risk is given by

f^𝖪𝖱𝖱γ(𝒙)=𝐊(,𝒙)(γ𝐈n+𝐊)1𝒚,\displaystyle\widehat{f}_{{\sf KRR}}^{\gamma}(\text{\boldmath$x$})=\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\gamma\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$y$},
R𝖪𝖱𝖱(f;γ):=𝔼𝒙0[(f(𝒙0)f^𝖪𝖱𝖱γ(𝒙0))2].\displaystyle R_{{\sf KRR}}(f_{*};\gamma):=\mathbb{E}_{\text{\boldmath$x$}_{0}}\big{[}(f_{*}(\text{\boldmath$x$}_{0})-\widehat{f}_{{\sf KRR}}^{\gamma}(\text{\boldmath$x$}_{0}))^{2}\big{]}.

We also define the polynomial ridge regression as follows. The infinite-width kernel KK can be decomposed uniquely in orthogonal polynomials as K(𝒙,𝒙)=k=0γkQk(d)(𝒙,𝒙)K(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime})=\sum_{k=0}^{\infty}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle), where Qk(d)(z)Q_{k}^{(d)}(z) is the Gegenbauer polynomial of degree kk (see Lemma 1). We consider truncating the kernel function up to the degree-\ell polynomials:

Kp(𝒙,𝒙)=k=0γkQk(d)(𝒙,𝒙).K^{p}(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime})=\sum_{k=0}^{\ell}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle). (19)

The superscript refers to the name “polynomial”. We also define

𝐊p=(Kp(𝒙i,𝒙j))i,jn,𝐊p(,𝒙)=(Kp(𝒙i,𝒙))in.\mathrm{\bf K}^{p}=\big{(}K^{p}(\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j})\big{)}_{i,j\leq n},\qquad\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})=\big{(}K^{p}(\text{\boldmath$x$}_{i},\text{\boldmath$x$})\big{)}_{i\leq n}. (20)

For example, in the case =1\ell=1, we have equivalence 𝐊p=γ0𝟏n𝟏n+γ1d𝑿𝑿\mathrm{\bf K}^{p}=\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top} (the kernel of linear regression with an intercept). The prediction function fitted by the data and its associated risk are

f^𝖯𝖱𝖱γ(𝒙)=𝐊p(,𝒙)(γ𝐈n+𝐊p)1𝒚,\displaystyle\widehat{f}_{{\sf PRR}}^{\gamma}(\text{\boldmath$x$})=\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})^{\top}(\gamma\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$y$},
R𝖯𝖱𝖱(f;γ):=𝔼𝒙0[(f(𝒙0)f^𝖯𝖱𝖱γ(𝒙0))2].\displaystyle R_{{\sf PRR}}(f_{*};\gamma):=\mathbb{E}_{\text{\boldmath$x$}_{0}}\big{[}(f_{*}(\text{\boldmath$x$}_{0})-\widehat{f}_{{\sf PRR}}^{\gamma}(\text{\boldmath$x$}_{0}))^{2}\big{]}.

Kernel ridge regression and polynomial ridge regression are well understood. Our next result establishes a relation between these risks: the neural tangent model behaves as the polynomial model, albeit with a different value of the regularization parameter.

Theorem 3.3.

Assume (𝐱i)ini.i.d.Unif(𝕊d1(d))(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), and (𝐰k)kNi.i.d.Unif(𝕊d1(1))(\text{\boldmath$w$}_{k})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(1)). Recall v(σ):=k[μk(σ)]2v(\sigma):=\sum_{k\geq\ell}[\mu_{k}(\sigma^{\prime})]^{2} and let B,c0>0B,c_{0}>0, \ell\in\mathbb{N} be fixed. Then, there exist constants C0,C,C>0C_{0},C,C^{\prime}>0 such that the following holds.

If Assumptions 3.1 and 3.2 hold with constants B,c0,B,c_{0},\ell, and Nd/(log(Nd))CnNd/(\log(Nd))^{C}\geq n, then for any λ0\lambda\geq 0,

R𝖭𝖳(f;λ)\displaystyle R_{{\sf NT}}(f_{*};\lambda) =R𝖪𝖱𝖱(f;λ)+Od,(τ2n(log(Nd))CNd)\displaystyle=R_{{\sf KRR}}(f_{*};\lambda)+O_{d,\mathbb{P}}\Big{(}\tau^{2}\sqrt{\frac{n(\log(Nd))^{C^{\prime}}}{Nd}}\Big{)}
=R𝖯𝖱𝖱(f;λ+v(σ))+Od,(τ2n(log(Nd))CNd+τ2n(logn)Cd+1),\displaystyle=R_{{\sf PRR}}(f_{*};\lambda+v(\sigma))+O_{d,\mathbb{P}}\Big{(}\tau^{2}\sqrt{\frac{n(\log(Nd))^{C^{\prime}}}{Nd}}\,+\tau^{2}\sqrt{\frac{n(\log n)^{C}}{d^{\ell+1}}}\Big{)},

where τ2:=fL22+σε2\tau^{2}:=\|f_{*}\|^{2}_{L^{2}}+\sigma^{2}_{\varepsilon}.

Example: nd2n\ll d^{2}.

Suppose Assumptions 3.1 and 3.2 hold with =1\ell=1, i.e.  c0dnd2/(logd)C0c_{0}d\leq n\leq d^{2}/(\log d)^{C_{0}}. In this case, Theorem 3.3 implies that NT regression can only fit the linear component in the target function f(𝒙)f_{*}(\text{\boldmath$x$}). In order to simplify our treatment, we assume that the target is linear: f(𝒙)=𝜷,𝒙f_{*}(\text{\boldmath$x$})=\langle\text{\boldmath$\beta$}_{*},\text{\boldmath$x$}\rangle.

Consider ridge regression with respect to the linear features, with regularization γ0\gamma\geq 0:

𝜷^(γ)\displaystyle\widehat{\boldsymbol{\beta}}(\gamma) :=argmin𝜷d{1di=1n(yi𝜷,𝒙i)2+γ𝜷22},\displaystyle:=\arg\min_{\text{\boldmath$\beta$}\in\mathbb{R}^{d}}\Big{\{}\frac{1}{d}\sum_{i=1}^{n}(y_{i}-\langle\text{\boldmath$\beta$},\text{\boldmath$x$}_{i}\rangle\big{)}^{2}+\gamma\|\text{\boldmath$\beta$}\|_{2}^{2}\Big{\}}\,, (21)
Rlin(f;γ)\displaystyle R_{\mbox{\tiny\rm lin}}(f_{*};\gamma) :=𝔼𝒙0[(𝜷,𝒙0𝜷^(γ),𝒙0)2].\displaystyle:=\mathbb{E}_{\text{\boldmath$x$}_{0}}\big{[}(\langle\text{\boldmath$\beta$}_{*},\text{\boldmath$x$}_{0}\rangle-\langle\widehat{\boldsymbol{\beta}}(\gamma),\text{\boldmath$x$}_{0}\rangle)^{2}\big{]}\,. (22)

Note that Rlin(f;γ)R_{\mbox{\tiny\rm lin}}(f_{*};\gamma) is essentially the same as R𝖯𝖱𝖱(f;λ+v(σ))R_{{\sf PRR}}(f_{*};\lambda+v(\sigma)), with the minor difference that we are not fitting an intercept. The scaling factor d1d^{-1} is specially chosen for comparison with NT KRR. Theorem 3.3 implies the following correspondence.

Corollary 3.2 (NT KRR for linear model).

Assume (𝐱i)ini.i.d.Unif(𝕊d1(d))(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), (𝐰k)kNi.i.d.Unif(𝕊d1(1))(\text{\boldmath$w$}_{k})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(1)), and 𝔼(εi4)Cσε4\mathbb{E}(\varepsilon_{i}^{4})\leq C\sigma_{\varepsilon}^{4}. Denote τ2=𝛃2+σε2\tau^{2}=\|\text{\boldmath$\beta$}_{*}\|^{2}+\sigma_{\varepsilon}^{2}. Then, there exist constants C0,C>0C_{0},C>0 such that the following holds. Under Assumption 3.1 and 3.2 with =1\ell=1 and Nd/(log(Nd))CnNd/(\log(Nd))^{C}\geq n,

R𝖭𝖳(f;λ)\displaystyle R_{{\sf NT}}(f_{*};\lambda) =Rlin(f;γeff(λ,σ))+Od,(τ2n(logd)CNd+τ2d+τ2n(logn)Cd2),where\displaystyle=R_{\mbox{\tiny\rm lin}}(f_{*};\gamma_{\mbox{\tiny\rm eff}}(\lambda,\sigma))+O_{d,\mathbb{P}}\Big{(}\tau^{2}\sqrt{\frac{n(\log d)^{C}}{Nd}}+\frac{\tau^{2}}{\sqrt{d}}+\tau^{2}\sqrt{\frac{n(\log n)^{C}}{d^{2}}}\Big{)}\,,\qquad\text{where} (23)
γeff(λ,σ)\displaystyle\gamma_{\mbox{\tiny\rm eff}}(\lambda,\sigma) :=λ+v(σ){𝔼[σ(G)]}2.\displaystyle:=\frac{\lambda+v(\sigma)}{\{\mathbb{E}[\sigma^{\prime}(G)]\}^{2}}.\, (24)

Further,

Rlin(f;γ)\displaystyle R_{\mbox{\tiny\rm lin}}(f_{*};\gamma) =𝜷22Blin(γ)+σε2Vlin(γ)+Od,(τ2/d),\displaystyle=\|\text{\boldmath$\beta$}_{*}\|^{2}_{2}\mathscrsfs{B}_{\mbox{\tiny\rm lin}}(\gamma)+\sigma_{\varepsilon}^{2}\mathscrsfs{V}_{\mbox{\tiny\rm lin}}(\gamma)+O_{d,\mathbb{P}}(\tau^{2}/\sqrt{d}), (25)
Blin(γ)\displaystyle\mathscrsfs{B}_{\mbox{\tiny\rm lin}}(\gamma) :=γ2dTr((γ𝑰d+𝑿𝖳𝑿/d)2),\displaystyle:=\frac{\gamma^{2}}{d}\mathrm{Tr}\Big{(}\big{(}\gamma{\boldsymbol{I}}_{d}+\text{\boldmath$X$}^{{\sf T}}\text{\boldmath$X$}/d\big{)}^{-2}\Big{)}, (26)
Vlin(γ)\displaystyle\mathscrsfs{V}_{\mbox{\tiny\rm lin}}(\gamma) :=1d2Tr(𝑿𝖳𝑿(γ𝑰d+𝑿𝖳𝑿/d)2).\displaystyle:=\frac{1}{d^{2}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{{\sf T}}\text{\boldmath$X$}\big{(}\gamma{\boldsymbol{I}}_{d}+\text{\boldmath$X$}^{{\sf T}}\text{\boldmath$X$}/d\big{)}^{-2}\Big{)}\,. (27)
Remark 4.

The error term Od,(τ2d1/2)O_{d,\mathbb{P}}(\tau^{2}d^{-1/2}) in (23) is due to the effect of the intercept. It is easy to derive the asymptotic formulas if n/dκ(0,)n/d\to\kappa\in(0,\infty) where κ\kappa is a constant [HMRT19]:

Blin(κ,γ)\displaystyle\mathscrsfs{B}_{\mbox{\tiny\rm lin}}(\kappa,\gamma) =12{1κ+(κ1+γ)2+4γγ(1+κ+γ)(κ1+γ)2+4γ}+od,(1),\displaystyle=\frac{1}{2}\left\{1-\kappa+\sqrt{(\kappa-1+\gamma)^{2}+4\gamma}-\frac{\gamma(1+\kappa+\gamma)}{\sqrt{(\kappa-1+\gamma)^{2}+4\gamma}}\right\}+o_{d,\mathbb{P}}(1)\,, (28)
Vlin(κ,γ)\displaystyle\mathscrsfs{V}_{\mbox{\tiny\rm lin}}(\kappa,\gamma) =12{1+κ+γ+1(κ1+γ)2+4γ}+od,(1).\displaystyle=\frac{1}{2}\left\{-1+\frac{\kappa+\gamma+1}{\sqrt{(\kappa-1+\gamma)^{2}+4\gamma}}\right\}+o_{d,\mathbb{P}}(1)\,. (29)

Here we emphasized the dependence on κ\kappa. Also, as κ\kappa\to\infty, Blin(κ,γ)=γ2κ2+O(κ3)\mathscrsfs{B}_{\mbox{\tiny\rm lin}}(\kappa,\gamma)=\gamma^{2}\kappa^{-2}+O(\kappa^{-3}) and Vlin(κ,γ)=κ1+O(κ2)\mathscrsfs{V}_{\mbox{\tiny\rm lin}}(\kappa,\gamma)=\kappa^{-1}+O(\kappa^{-2}).

In particular, the ridgeless NT model at λ=0\lambda=0 corresponds to linear regression with regularization γ=v(σ)/{𝔼[σ(G)]}2\gamma=v(\sigma)/\{\mathbb{E}[\sigma^{\prime}(G)]\}^{2}. Note that the denominator in (24) is due to the different scaling for the linear term (cf. Eq. 19). Since Blin(κ,γ)=O(d2/n2)\mathscrsfs{B}_{\mbox{\tiny\rm lin}}(\kappa,\gamma)=O(d^{2}/n^{2}) and Vlin(κ,γ)=O(d/n)\mathscrsfs{V}_{\mbox{\tiny\rm lin}}(\kappa,\gamma)=O(d/n) from the formulas (28) and (29), we find that the error term of order n/Nd\sqrt{n/Nd} in Eq. (23) is negligible for 1n/dN1/31\leq n/d\ll N^{1/3}. We leave to future work the problem of obtaining optimal bounds on the error term.

3.5 An upper bound on the memorization capacity

In the regression setting, naively counting degrees of freedom would imply that the memorization capacity of a two-layer network with NN neurons is at most given by the number of parameters, namely N(d+1)N(d+1). More careful consideration reveals that, as long as the data {(𝒙i,yi)}in\{(\text{\boldmath$x$}_{i},y_{i})\}_{i\leq n} are in generic positions N=O(1)N=O(1) neurons are sufficient to achieve memorization within any fixed accuracy, using a ‘sawlike’ activation function.

These constructions are of course fragile and a non-trivial upper bound on the memorization capacity can be obtained if we constrain the weights. In this section, we prove such an upper bound. We state it for binary classification, but it is immediate to see that it implies an upper bound on regression which is of the same order.

To simplify the argument, we assume (𝒙i)ini.i.d.𝒩(𝟎,𝐈d)(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}\mathcal{N}(\mathrm{\bf 0},\mathrm{\bf I}_{d}) (which is very close to the setting (𝒙i)ini.i.d.Unif(𝕊d1(d)(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and y1,,yni.i.d.Unif({+1,1})y_{1},\ldots,y_{n}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\{+1,-1\}). Consider a subset of two-layers neural networks, by constraining the magnitude of the parameters:

𝖭𝖭N,L:={f(𝒙;𝒃,𝑾)=k=1Nbkσ(𝒘,𝒙),N1/2𝒃L,𝒘kL,k[N]}.\mathcal{F}_{{\sf NN}}^{N,L}:=\Big{\{}f(\text{\boldmath$x$};\text{\boldmath$b$},\text{\boldmath$W$})=\sum_{k=1}^{N}b_{k}\sigma\big{(}\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}\rangle\big{)},\quad N^{-1/2}\|\text{\boldmath$b$}\|\leq L,\|\text{\boldmath$w$}_{k}\|\leq L,~{}\forall\,k\in[N]\Big{\}}.

To get binary outputs from f(𝒙)f(\text{\boldmath$x$})\in\mathbb{R}, we take the sign. In particular, the label predict on sample 𝒙i\text{\boldmath$x$}_{i} is sign(f(𝒙i))\mathrm{sign}(f(\text{\boldmath$x$}_{i})).

The value yif(𝒙i)y_{i}f(\text{\boldmath$x$}_{i}) is referred to the margin for input 𝒙i\text{\boldmath$x$}_{i}. In regression we require yif(𝒙i)=yi2=1y_{i}f(\text{\boldmath$x$}_{i})=y_{i}^{2}=1 (the latter equality holds for {+1,1}\{+1,-1\} labels), while in classification we ask for yf(𝒙i)>0yf(\text{\boldmath$x$}_{i})>0. The next result states that NdNd must be roughly of order nn in order for a neural network to fit a nontrivial fraction of data with δ\delta margin.

Proposition 3.1.

Let Assumptions 3.1 and 3.2 hold, and further assume (𝐱i)ini.i.d.𝒩(𝟎,𝐈d)(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{\mathrm{i.i.d.}}{\mathcal{N}}(\mathrm{\bf 0},{\boldsymbol{I}}_{d}). Fix constants η1,η2>0\eta_{1},\eta_{2}>0, and let L=Ld1L=L_{d}\geq 1, δ=δd>0\delta=\delta_{d}>0 be general functions of dd. Then there exists a constant CC such that the following holds.

Assume that, with probability larger than η1\eta_{1}, there exists a function f𝖭𝖭N,Lf\in\mathcal{F}_{{\sf NN}}^{N,L} such that

1ni=1n𝟏{yif(𝒙i)>δ}1+η22.\frac{1}{n}\sum_{i=1}^{n}\mathrm{\bf 1}\{y_{i}f(\text{\boldmath$x$}_{i})>\delta\}\geq\frac{1+\eta_{2}}{2}. (30)

(In words, ff achieves margin δ\delta in at least a fraction (1+η2)/2(1+\eta_{2})/2 of the samples.)

Then we must have nC1Ndlog(Ld/δ)n\leq C^{-1}Nd\log(Ld/\delta).

If LL and 1/δ1/\delta are upper bounded by a polynomial of dd, then this result implies an upper bound on the network capacity of order NdlogdNd\log d. This matches the interpolation and invertibility thresholds in Corollary 3.1, and Theorem 3.1 up to a logarithmic factor. The proof follows a discretization–approximation argument, which can be found in the appendix.

Remark 5.

Notice that the memorization capacity upper bound C1Ndlog(Ld/δ)C^{-1}Nd\log(Ld/\delta) tends to infinity when the margin vanishes δ0\delta\to 0. This is not an artifact of the proof. As mentioned above, if we allow for an arbitrarily small margin, it is possible to construct a ‘sawlike’ activation function σ\sigma such that the corresponding network correctly classifies nn points with binary labels yi{+1,1}y_{i}\in\{+1,-1\} despite nNdn\gg Nd. Note that our result is different from [YSJ19, BHLM19] in which the activation function is piecewise linear/polynomial.

Remark 6.

While we stated Proposition 3.1 for classification, it has obvious consequences for regression interpolation. Indeed, considering yi{+1,1}y_{i}\in\{+1,-1\}, the interpolation constraint f(𝒙i)=yif(\text{\boldmath$x$}_{i})=y_{i} implies yif(𝒙i)δ0=1y_{i}f(\text{\boldmath$x$}_{i})\geq\delta_{0}=1.

4 Related work

As discussed in the introduction, we addressed three questions: Q1: What is the maximum number of training samples nn that a network of given width NN can interpolate (or memorize)? Q2: Can such an interpolator be found efficiently? Q3: What are the generalization properties of the interpolator? While questions Q1 and Q2 have some history (which we briefly review next), much less is known about Q3, which is our main goal in this paper.

In the context of binary classification, question Q1 was first studied by Tom Cover [Cov65], who considered the case of a simple perceptron network (N=1N=1) when the feature vectors (𝒙i)in(\text{\boldmath$x$}_{i})_{i\leq n} are in a generic position, and the labels yi{+1,1}y_{i}\in\{+1,-1\} are independent and uniformly random. He proved that this model can memorize nn training samples with high probability if n2d(1ε)n\leq 2d(1-\varepsilon), and cannot memorize them with high probability if n2d(1+ε)n\geq 2d(1+\varepsilon). Following Cover, this maximum number of samples is sometimes referred to as the network capacity but, for greater clarity, we also use the expression network memorization capacity.

The case of two-layers network was studied by Baum [Bau88] who proved that, again for any set of points in general positions, the memorization capacity is at least NdNd. Upper bound of the same order were proved, among others, in [Sak92, Kow94]. Generalizations to multilayer networks were proven recently in [YSJ19, Ver20]. Can these networks be found efficiently? In the context of classification, the recent work of [Dan19] provides an efficient algorithm that can memorize all but a fraction ε\varepsilon of the training samples in polynomial time, provided the NdCn/ε2Nd\geq Cn/\varepsilon^{2}. For the case of Gaussian feature vectors, [Dan20] proves that exact memorization can be achieved efficiently provided NdCn(logd)4Nd\geq Cn(\log d)^{4}.

Here we are interested in achieving memorization in regression, which is more challenging than for classification. Indeed, in binary classification a function f:df:\mathbb{R}^{d}\to\mathbb{R} memorizes the data if yif(𝒙i)>0y_{i}f(\text{\boldmath$x$}_{i})>0 for all ini\leq n. On the other hand, in our setting, memorization amounts to f(𝒙i)=yif(\text{\boldmath$x$}_{i})=y_{i} for all ini\leq n. The techniques developed for binary classification exploit in a crucial way the flexibility provided by the inequality constraint, which we cannot do here.

In concurrent work, [BELM20] studied the interpolation properties of two-layers networks. Generalizing the construction of Baum [Bau88], they show that, for N4n/dN\geq 4\lceil n/d\rceil there exists a two-layers ReLU network interpolating nn points in generic positions. This is however unlikely to be the network produced by gradient-based training. They also construct a model that interpolates the data with error ε\varepsilon, provided Ndnlog(1/ε)Nd\geq n\log(1/\varepsilon). In contrast, we obtain exact interpolation provided Ndn(logNd)CNd\geq n(\log Nd)^{C} From a more fundamental point of view, our work does not only construct a network that memorizes the data, but also characterizes the eigenstructure of the kernel matrix. While our paper is under review, more papers on memorization are posted, including [VYS21, PLYS21] that study the effect of depth.

As discussed in the introduction, we focus here on the lazy or neural tangent regime in which weights change only slightly with respect to a random initialization [JGH18]. This regime attracted considerable attention over the last two years, although the focus has been so far on its implications for optimization, rather than on its statistical properties.

It was first shown in [DZPS18] that, for sufficiency overparametrized networks, and under suitable initializations, gradient-based training indeed converges to an interpolator that is well approximated by an NT model. The proof of [DZPS18] required (in the present context) NCn6/λmin(𝐊)4N\geq Cn^{6}/\lambda_{\min}(\mathrm{\bf K})^{4}, where 𝐊\mathrm{\bf K} is the infinite width kernel of Eq. (7). This bound was improved over the last two years [DZPS18, AZLS19, LXS+19, ZCZG20, OS20, LZB20, WCL20]. In particular, [OS20] prove that, for NdCn2Nd\geq Cn^{2}, gradient descent converges to an interpolator. The authors also point out the gap between this result and the natural lower bound NdnNd\gtrsim n.

A key step in the analysis of gradient descent in the NT regime is to prove that the tangent feature map at the initialization (the matrix 𝚽n×Nd\text{\boldmath$\Phi$}\in\mathbb{R}^{n\times Nd}) or, equivalently, the associated kernel (i.e. the matrix 𝐊N=𝚽𝚽\mathrm{\bf K}_{N}=\text{\boldmath$\Phi$}\text{\boldmath$\Phi$}^{\top}) is nonsingular. Our Theorem 3.1 establishes that this is the case for Ndn(logd)CNd\geq n(\log d)^{C}. As discussed in Section 5, this implies convergence of gradient descent under the near-optimal condition Ndn(logd)CNd\geq n(\log d)^{C}, under a suitable scaling of the weights [COB19]. More generally, our characterization of the eigenstructure of 𝐊N\mathrm{\bf K}_{N} is a foundational step towards a sharper analysis of the gradient descent dynamics.

As mentioned several times, our main contribution is a characterization of the test error of minimum norm regression in the NT model (1). We are not aware of any comparable results.

Upper bounds on the generalization error of neural networks based on NT theory were proved, among others, in [ADH+19, AZLL19, CG19, JT19, CCZG19, NCS19]. These works assume a more general data distribution than ours. However their objectives are of very different nature from ours. We characterize the test error of interpolators, while most of these works do not consider interpolators. Our main result is a sharp characterization of the difference between test error of NT regression (with a finite number of neurons NN) and kernel ridge regression (corresponding to N=N=\infty), and between this and polynomial regression. None of the earlier work is sharp enough to provide to control these quantities. More in detail:

  • The upper bound of [ADH+19] applies to interpolators but controls the generalization error by (𝒚,𝐊1𝒚/n)1/2(\langle\text{\boldmath$y$},\mathrm{\bf K}^{-1}\text{\boldmath$y$}\rangle/n)^{1/2} which, in the setting studied in the present paper, is a quantity of order one.

  • The upper bounds of [AZLL19, CG19] do not apply to interpolators since SGD is run in a one-pass fashion. Further they require large overparametrization, namely Nn7N\gtrsim n^{7} in [CG19] and NN depending on the generalization error in [AZLL19]. Finally, they bound generalization error rather than test error, and do not bound the difference between NT regression and kernel ridge regression.

  • The upper bounds of [JT19, CCZG19, NCS19] apply to classification and require a large margin condition. As before, these papers do not bound the difference between NT regression and kernel ridge regression.

Results similar to ours were recently obtained in the context of simpler random features models in [HMRT19, GMMM21, MM21, MRSY19, GMMM20, MMM21]. The models studied in these works corresponds to a two-layer network in which the first layer is random and the second is trained. Their analysis is simpler because the featurization map has independent coordinates.

Finally, the closest line of work in the literature is one that characterizes the risk of KRR and KRR interpolators in high dimensions [Bac17, LR20, GMMM21, BM19, LRZ20, MMM21]. It is easy to see that the infinite width limit (NN\to\infty at n,dn,d fixed), NT regression converges to kernel ridge(–less) regression (KRR) with the rotationally invariant kernel 𝐊\mathrm{\bf K}. Our work address the natural open question in these studies: how large NN should be for NT regression to approximate KRR with respect to the limit kernel? By considering scalings in which N,n,dN,n,d are all large and polynomially related, we showed that NT performs KRR already when Nd/(logNd)CnNd/(\log Nd)^{C}\geq n.

5 Connections with gradient descent training of neural networks

In this section we discuss in greater detail the relation between the NT model (1) studied in the rest of the paper, and fully nonlinear two-layer neural networks. For clarity, we will adopt the following notation for the NT model:

f𝖭𝖳(𝒙;𝒂,𝑾):=1Nk=1N𝒂k,𝒙σ(𝒘k,𝒙).\displaystyle f_{{\sf NT}}(\text{\boldmath$x$};\text{\boldmath$a$},\text{\boldmath$W$}):=\frac{1}{\sqrt{N}}\sum_{k=1}^{N}\langle\text{\boldmath$a$}_{k},\text{\boldmath$x$}\rangle\sigma^{\prime}(\langle\text{\boldmath$w$}_{k},\text{\boldmath$x$}\rangle)\,. (31)

We changed the normalization purely for aesthetic reasons: since the model is linear, this has no impact on the min-norm interpolant. We will compare it to the following neural network

f𝖭𝖭(𝒙;𝑾~):=αNk=12Nbkσ(𝒘~k,𝒙),b1==bN=+1,bN+1==b2N=1.\displaystyle f_{{\sf NN}}(\text{\boldmath$x$};\widetilde{\text{\boldmath$W$}}):=\frac{\alpha}{\sqrt{N}}\sum_{k=1}^{2N}b_{k}\sigma(\langle\widetilde{\text{\boldmath$w$}}_{k},\text{\boldmath$x$}\rangle),~{}~{}b_{1}=\dots=b_{N}=+1\,,\;b_{N+1}=\dots=b_{2N}=-1\,. (32)

Note that the network has 2N2N hidden units and the second layer weights bkb_{k} are fixed. We train the first-layer weights using gradient flow with respect to the empirical risk

ddt𝑾~t=𝑾R^n(𝑾~t),whereR^n(𝑾~):=1ni=1n(yif𝖭𝖭(𝒙;𝑾~))2.\displaystyle\frac{{\rm d}\phantom{t}}{{\rm d}t}\widetilde{\text{\boldmath$W$}}^{t}=-\nabla_{\text{\boldmath$W$}}\widehat{R}_{n}(\widetilde{\text{\boldmath$W$}}^{t})\,,\;\;\;\;\text{where}~{}~{}\widehat{R}_{n}(\widetilde{\text{\boldmath$W$}}):=\frac{1}{n}\sum_{i=1}^{n}\big{(}y_{i}-f_{{\sf NN}}(\text{\boldmath$x$};\widetilde{\text{\boldmath$W$}})\big{)}^{2}\,. (33)

We use the following initialization:

𝒘~k0=𝒘N+k0=𝒘~kkN.\displaystyle\widetilde{\text{\boldmath$w$}}^{0}_{k}=\text{\boldmath$w$}^{0}_{N+k}=\widetilde{\text{\boldmath$w$}}_{k}\,\;\;\;\forall k\leq N\,. (34)

Under this initialization, the network evaluates to 0 at t=0t=0. Namely, f𝖭𝖭(𝒙;𝑾~0)=0f_{{\sf NN}}(\text{\boldmath$x$};\widetilde{\text{\boldmath$W$}}^{0})=0 for all 𝒙d\text{\boldmath$x$}\in\mathbb{R}^{d}. Let us also emphasize that the weights 𝒘k\text{\boldmath$w$}_{k} in the NT model (31) are chosen to match the ones in the initialization (34): in particular, we can take (𝒘k)kNi.i.d.Unif(𝕊d1(1))(\text{\boldmath$w$}_{k})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(1)) to recover the neural tangent model treated in the rest of the paper. This symmetrization is a standard technique [COB19]: it simplifies the analysis because it implies f𝖭𝖭(𝒙;𝑾~0)=0f_{{\sf NN}}(\text{\boldmath$x$};\widetilde{\text{\boldmath$W$}}^{0})=0. We refer Remark 7 (ii)(ii) for explanations.

The next theorem is a slight modification of [BMR21, Theorem 5.4] which in turn is a refined version of the analysis of [COB19, OS20].

Theorem 5.1.

Consider the two-layer neural network of (32) trained with gradient flow from initialization (34), and the associated NT model of Eq. (31). Assume (𝐰k)kNi.i.d.Unif(𝕊d1(1))(\text{\boldmath$w$}_{k})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(1)), the activation function to have bounded second derivative supt|σ′′(t)|C\sup_{t\in\mathbb{R}}|\sigma^{\prime\prime}(t)|\leq C, and its Hermite coefficients to satisfy μk(σ)0\mu_{k}(\sigma)\neq 0 for all k0k\leq\ell_{0} for some constant 0\ell_{0}. Further assume {(𝐱i,yi)}in\{(\text{\boldmath$x$}_{i},y_{i})\}_{i\leq n} are i.i.d. with 𝐱iUnif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and yiy_{i} to be CC-subgaussian.

Then there exist constants CiC_{i}, depending uniquely on σ,0\sigma,\ell_{0}, such that if dnd0/(logd)C0d\leq n\leq d^{\ell_{0}}/(\log d)^{C_{0}}, as well as

nNd(logNd)C0 and αC0n2Nd,\displaystyle n\leq\frac{Nd}{(\log Nd)^{C_{0}}}\;\;\;\mbox{ and }\;\;\;\alpha\geq C_{0}\sqrt{\frac{n^{2}}{Nd}}\,, (35)

then, with probability at least 12exp{n/C0}1-2\exp\{-n/C_{0}\}, the following holds.

  1. 1.

    Gradient flow converges exponentially fast to a global minimizer. Specifically, letting λ=λmin(𝐊N)α2d/(4n)\lambda_{*}=\lambda_{\min}(\mathrm{\bf K}_{N})\alpha^{2}d/(4n), we have, for all t0t\geq 0,

    R^n(𝑾~t)R^n(𝑾~0)eλt.\displaystyle\widehat{R}_{n}(\widetilde{\text{\boldmath$W$}}^{t})\leq\widehat{R}_{n}(\widetilde{\text{\boldmath$W$}}^{0})\,e^{-\lambda_{*}t}\,. (36)

    In particular the rate is lower bounded as λC1α2d/n\lambda_{*}\geq C_{1}\alpha^{2}d/n by Theorem 3.1.

  2. 2.

    The model learned by gradient flow and min-2\ell_{2} norm interpolant are similar on test data. Namely, writing f𝖭𝖭(𝑾~):=f𝖭𝖭(;𝑾~)f_{{\sf NN}}(\widetilde{\text{\boldmath$W$}}):=f_{{\sf NN}}(\,\cdot\,;\widetilde{\text{\boldmath$W$}}) and f𝖭𝖳(𝑾):=f𝖭𝖳(;𝒂,𝑾)f_{{\sf NT}}(\text{\boldmath$W$}):=f_{{\sf NT}}(\,\cdot\,;\text{\boldmath$a$},\text{\boldmath$W$}), we have

    lim suptf𝖭𝖭(𝑾~t)f𝖭𝖳(𝒂^,𝑾)L2()C1{1αn2Nd+1α2n5Nd4},\displaystyle\limsup_{t\to\infty}\|f_{{\sf NN}}(\widetilde{\text{\boldmath$W$}}^{t})-f_{{\sf NT}}(\widehat{\text{\boldmath$a$}},\text{\boldmath$W$})\|_{L^{2}(\mathbb{P})}\leq C_{1}\left\{\frac{1}{\alpha}\sqrt{\frac{n^{2}}{Nd}}+\frac{1}{\alpha^{2}}\sqrt{\frac{n^{5}}{Nd^{4}}}\right\}\,, (37)

    for 𝒂^\widehat{\text{\boldmath$a$}} the coefficients of the min-2\ell_{2} norm interpolant of Eq. (3).

Notice that any generic activation function satisfies the condition μk(σ)0\mu_{k}(\sigma)\neq 0 for all k0k\leq\ell_{0}. For instance a sigmoid or smoothed ReLU with a generic offset satisfy the assumptions of this theorem.

The only difference with respect to [BMR21, Theorem 5.4] is that we assume 𝒙iUnif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) while [BMR21] assumes 𝒙i𝒩(𝟎,𝑰d)\text{\boldmath$x$}_{i}\sim{\mathcal{N}}(\mathrm{\bf 0},{\boldsymbol{I}}_{d}). However, it is immediate to adapt the proof of [BMR21], in particular using Theorem 3.1 to control the minimum eigenvalue of the kernel at initialization.

Note that Theorem 5.1 requires the activation function σ\sigma to be smoother than other results in this paper (it requires a bounded second derivative). This assumption is used to bound the Lipschitz constant of the Jacobian 𝐃𝑾f𝖭𝖭(𝒙;𝑾)\mathrm{\bf D}_{\text{\boldmath$W$}}f_{{\sf NN}}(\text{\boldmath$x$};\text{\boldmath$W$}) uniformly over 𝑾W. We expect that a more careful analysis could avoid assuming such a strong uniform bound.

The difference in test errors between the neural network (32) trained via gradient flow and the min-norm NT interpolant studied in this paper can be bounded using Eq. (37) by triangular inequality. Also, while we focus for simplicity on gradient flow, similar results hold for gradient descent with small enough step size.

A few remarks are in order.

Remark 7.

Even among two-layer fully connected neural networks, model (32) presents some simplifications:

  • (i)(i)

    Second-layer weights are not trained and fixed to bj{+1,1}b_{j}\in\{+1,-1\}. If second-layer weights are trained, the NT model (31) needs to be modified as follows

    f𝖭𝖳(𝒙;𝒂,𝑾):=1Ni=1N𝒂i,𝒙σ(𝒘i,𝒙)+1Ni=1Nai~σ(𝒘i,𝒙).\displaystyle f_{{\sf NT}}(\text{\boldmath$x$};\text{\boldmath$a$},\text{\boldmath$W$}):=\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\langle\text{\boldmath$a$}_{i},\text{\boldmath$x$}\rangle\sigma^{\prime}(\langle\text{\boldmath$w$}_{i},\text{\boldmath$x$}\rangle)+\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\widetilde{a_{i}}\sigma(\langle\text{\boldmath$w$}_{i},\text{\boldmath$x$}\rangle)\,. (38)

    The new model has N(d+1)N(d+1) parameters 𝒂=(𝒂1,,𝒂N;a~1,,a~N)\text{\boldmath$a$}=(\text{\boldmath$a$}_{1},\dots,\text{\boldmath$a$}_{N};\widetilde{a}_{1},\dots,\widetilde{a}_{N}). Notice that, for large dimension, the additional number of parameters is negligible. Indeed, going through the proofs reveals that our analysis can be extended to this case at the price of additional notational burden, but without changing the results.

    We also point out that the model in which only second layer weights are trained, i.e. we set 𝒂i=0\text{\boldmath$a$}_{i}=0 in Eq. (38), has been studied in detail in [GMMM21, MMM21]. These papers support the above parameter-counting heuristics.

  • (ii)(ii)

    The specific initialization (34) is chosen so that f𝖭𝖭(𝒙;𝑾~0)=0f_{{\sf NN}}(\text{\boldmath$x$};\widetilde{\text{\boldmath$W$}}^{0})=0 identically. If this initialization was modified (for instance by taking independent (𝒘~k)k2N(\widetilde{\text{\boldmath$w$}}_{k})_{k\leq 2N} ) two main elements would change in the analysis. First, the target model f(𝒙)f_{*}(\text{\boldmath$x$}) should be replaced by the difference between target and initialization f(𝒙)f𝖭𝖭(𝒙;𝑾~0)f_{*}(\text{\boldmath$x$})-f_{{\sf NN}}(\text{\boldmath$x$};\widetilde{\text{\boldmath$W$}}^{0}). Second, and more importantly, the approximation results in Theorem 5.1 become weaker: we refer to [BMR21] for a comparison.

Remark 8.

Within the setting of Theorem 5.1 (in particular yiy_{i} being CC-subgaussian), the null risk fL22\|f_{*}\|^{2}_{L^{2}} is of order one, and therefore we should compare the right-hand side of Eq. (37) with fL2=Θ(1)\|f_{*}\|_{L^{2}}=\Theta(1). We can point at two specific regimes in which this upper bound guarantees that f𝖭𝖭(𝑾~)f𝖭𝖳(𝒂^,𝑾)L2fL2\|f_{{\sf NN}}(\widetilde{\text{\boldmath$W$}}^{\infty})-f_{{\sf NT}}(\widehat{\text{\boldmath$a$}},\text{\boldmath$W$})\|_{L^{2}}\ll\|f_{*}\|_{L^{2}}:

  • (i)(i)

    First letting α\alpha grow, the this upper bound vanishes. More precisely, it becomes much smaller than one provided α(n2/Nd)1/2(n5/Nd4)1/4\alpha\gg(n^{2}/Nd)^{1/2}\vee(n^{5}/Nd^{4})^{1/4}. By training the two-layer neural network with a large scaling parameter α\alpha, we obtain a model that is well approximated by the NT model as soon as the overparametrization condition nNd/(logNd)Cn\leq Nd/(\log Nd)^{C} is satisfied.

    This role of scaling in the network parameters, and its generality were first pointed out in [COB19]. It implies that the theory developed in this always applied to nonlinear neural networks under a certain specific initialization and training scheme.

  • (ii)(ii)

    A standard initialization rule suggests to take α=Θ(1)\alpha=\Theta(1). In this case, the right-hand side of Eq. (37) becomes small for wide enough networks, namely Ndn2(n5/d4)Nd\gg n^{2}\vee(n^{5}/d^{4}). This condition is stronger than the overparametrization condition under which we carried our analysis of the NT model.

    This means that, for α=Θ(1)\alpha=\Theta(1) and n(logn)CNdn2(n5/d4)n(\log n)^{C}\ll Nd\lesssim n^{2}\vee(n^{5}/d^{4}), we can apply our analysis to neural tangent models but not to actual neural networks. On the other hand, the condition Ndn2(n5/d4)Nd\gg n^{2}\vee(n^{5}/d^{4}) in Theorem 5.1 is likely to be a proof artifact, and we expect that it will be improved in the future. In fact we believe that the refined characterization of the NT model in the present paper is a foundational step towards such improvements.

Remark 9.

Theorem 5.1 relates the large-time limit of GD trained neural networks to the minimum 2\ell_{2}-norm NT interpolators, corresponding to the case λ=0\lambda=0 of Eq. (17). The review paper [BMR21] proves a similar bound relating the NN and NT models, trained via gradient descent, at all times tt. Our main result, Theorem 3.3 characterizes the test error of NT regression with general λ0\lambda\geq 0. Going through the proof [BMR21, Theorem 5.4], it can be seen that a non-vanishing λ>0\lambda>0 corresponds to regularizing GD training of the two-layer neural network by the penalty 𝑾𝑾0F2=N𝒘𝒘022\|\text{\boldmath$W$}-\text{\boldmath$W$}^{0}\|_{F}^{2}=\sum_{\ell\leq N}\|\text{\boldmath$w$}_{\ell}-\text{\boldmath$w$}_{\ell}^{0}\|_{2}^{2}.

6 Technical background

This subsection provides a very short introduction to Hermite polynomials and Gegenbauer polynomials. More background information can be found in [AH12, CC14, GMMM21] for example. Let ρ\rho be standard Gaussian measure on \mathbb{R}, namely ρ(dx)=(2π)1/2ex2/2dx\rho(dx)=(2\pi)^{-1/2}e^{-x^{2}/2}\;dx. The space L2(,ρ)L^{2}(\mathbb{R},\rho) is a Hilbert space with the inner product f,gL2(,ρ)=𝔼Gρ[f(G)g(G)]\langle f,g\rangle_{L^{2}(\mathbb{R},\rho)}=\mathbb{E}_{G\sim\rho}[f(G)g(G)].

The Hermite polynomials form a complete orthogonal basis of L2(,ρ)L^{2}(\mathbb{R},\rho). Throughout this paper, we will use normalized Hermite polynomials {hk}k0\{h_{k}\}_{k\geq 0}:

𝔼[hk(G)hj(G)]=δkj.\mathbb{E}\big{[}h_{k}(G)h_{j}(G)\big{]}=\delta_{kj}.

where δkj\delta_{kj} is Kronecker delta, namely δkj=0\delta_{kj}=0 if kjk\neq j and δkj=1\delta_{kj}=1 if k=jk=j. For example, the first four Hermite polynomials are given by h0(x)=1,h1(x)=x,h2(x)=12(x21)h_{0}(x)=1,h_{1}(x)=x,h_{2}(x)=\frac{1}{\sqrt{2}}(x^{2}-1) and h3(x)=16(x33x)h_{3}(x)=\frac{1}{\sqrt{6}}(x^{3}-3x). For any function fL2(,ρ)f\in L^{2}(\mathbb{R},\rho), we have decomposition

f=k=0f,hkL2(,ρ)hk.f=\sum_{k=0}^{\infty}\langle f,h_{k}\rangle_{L^{2}(\mathbb{R},\rho)}\;h_{k}.

In particular, if σL2(,ρ)\sigma^{\prime}\in L^{2}(\mathbb{R},\rho), we denote μk=σ,hkL2(,ρ)\mu_{k}=\langle\sigma^{\prime},h_{k}\rangle_{L^{2}(\mathbb{R},\rho)} and have

σ=k=0μkhk.\sigma^{\prime}=\sum_{k=0}^{\infty}\mu_{k}h_{k}. (39)

Let τd1\tau_{d-1} be the uniform probability measure on 𝕊d1(d)\mathbb{S}^{d-1}(\sqrt{d}), τ~d11\widetilde{\tau}_{d-1}^{1} be the probability measure of d𝒙,𝒆1\sqrt{d}\,\langle\text{\boldmath$x$},\text{\boldmath$e$}_{1}\rangle where 𝒙τd1\text{\boldmath$x$}\sim\tau_{d-1}, and τd11\tau_{d-1}^{1} be the probability measure of 𝒙,𝒆1\langle\text{\boldmath$x$},\text{\boldmath$e$}_{1}\rangle. The Gegenbauer polynomials (Qk(d))k=0(Q_{k}^{(d)})_{k=0}^{\infty} form a basis of L2([d,d],τ~d11)L^{2}([-d,d],\widetilde{\tau}_{d-1}^{1}) (for simplicity, we may write it as L2L^{2} unless confusion arises) where Qk(d)Q_{k}^{(d)} is a polynomial of degree of kk and the they satisfy the normalization

Qk(d),Qj(d)L2=1B(d,k)δjk\langle Q_{k}^{(d)},Q_{j}^{(d)}\rangle_{L^{2}}=\frac{1}{B(d,k)}\delta_{jk}

where B(d,k)B(d,k) is a dimension parameter that is monotonically increasing in kk and satisfies B(d,k)=(1+od(1))dk/k!B(d,k)=(1+o_{d}(1))d^{k}/k!. This normalization guarantees Qk(d)(d)=1Q_{k}^{(d)}(d)=1. The following are some properties of Gegenbauer polynomials we will use.

  • (a)(a)

    For 𝒙,𝒚𝕊d1(d)\text{\boldmath$x$},\text{\boldmath$y$}\in\mathbb{S}^{d-1}(\sqrt{d}),

    Qk(d)(𝒙,),Qj(d)(𝒚,)L2(𝕊d1(d),τd1)=1B(d,k)δjkQk(d)(𝒙,𝒚).\big{\langle}Q_{k}^{(d)}(\langle\text{\boldmath$x$},\cdot\rangle),Q_{j}^{(d)}(\langle\text{\boldmath$y$},\cdot\rangle)\big{\rangle}_{L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1})}=\frac{1}{B(d,k)}\delta_{jk}Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$y$}\rangle). (40)
  • (b)(b)

    For 𝒙,𝒚𝕊d1(d)\text{\boldmath$x$},\text{\boldmath$y$}\in\mathbb{S}^{d-1}(\sqrt{d}),

    Qk(d)(𝒙,𝒚)=1B(d,k)i=1B(d,k)Yki(d)(𝒙)Yki(d)(𝒚),Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$y$}\rangle)=\frac{1}{B(d,k)}\sum_{i=1}^{B(d,k)}Y_{ki}^{(d)}(\text{\boldmath$x$})Y_{ki}^{(d)}(\text{\boldmath$y$}), (41)

    where Yk,1(d),,Yk,B(d,k)(d)Y_{k,1}^{(d)},\ldots,Y_{k,B(d,k)}^{(d)} are normalized spherical harmonics of degree kk; more precisely, each Yk,i(d)Y_{k,i}^{(d)} is a polynomial of degree kk and they satisfy

    Yk,i(d),Ym,j(d)L2(𝕊d1(d),τd1)=δkmδij.\langle Y_{k,i}^{(d)},Y_{m,j}^{(d)}\rangle_{L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1})}=\delta_{km}\delta_{ij}. (42)
  • (c)(c)

    Recurrence formula. For t[d,d]t\in[-d,d],

    tdQk(d)(t)=k2k+d2Qk1(d)(t)+k+d22k+d2Qk+1(d)(t).\frac{t}{d}Q_{k}^{(d)}(t)=\frac{k}{2k+d-2}Q_{k-1}^{(d)}(t)+\frac{k+d-2}{2k+d-2}Q_{k+1}^{(d)}(t). (43)
  • (d)(d)

    Connection to Hermite polynomials. If fL2(,ρ)L2([d,d],τd11)f\in L^{2}(\mathbb{R},\rho)\cap L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d-1}^{1}), then

    μk(f)=limdB(d,k)f,Qk(d)(d)L2([d,d],τd11).\mu_{k}(f)=\lim_{d\to\infty}\sqrt{B(d,k)}\,\langle f,Q_{k}^{(d)}(\sqrt{d}\,\cdot)\rangle_{L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d-1}^{1})}. (44)

Note that point (b)(b) and the Cauchy-Schwarz inequality implies max|t|d|Qk(d)(t)|1\max_{|t|\leq d}|Q_{k}^{(d)}(t)|\leq 1. If the function σ\sigma^{\prime} satisfies σL2([d,d],τd11)\sigma^{\prime}\in L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d-1}^{1}), then we can decompose σ\sigma^{\prime} using Gegenbauer polynomials:

σ(x)=k=0λd,k(σ)B(d,k)Qk(d)(dx),where\displaystyle\sigma^{\prime}(x)=\sum_{k=0}^{\infty}\lambda_{d,k}(\sigma^{\prime})B(d,k)Q_{k}^{(d)}(\sqrt{d}\,x),\qquad\text{where} (45)
λd,k(σ):=σ,Qk(d)(d)L2([d,d],τd11).\displaystyle\lambda_{d,k}(\sigma^{\prime}):=\langle\sigma^{\prime},Q_{k}^{(d)}(\sqrt{d}\,\cdot)\big{\rangle}_{L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d-1}^{1})}. (46)

We also have σL2([d,d],τd11)2=k=0B(d,k)[λd,k(σ)]2\|\sigma^{\prime}\|_{L^{2}([-\sqrt{d},\sqrt{d}],\tau_{d-1}^{1})}^{2}=\sum_{k=0}^{\infty}B(d,k)[\lambda_{d,k}(\sigma^{\prime})]^{2}. We sometimes drop the subscript kk if no confusion arises.

7 Kernel invertibility and concentration: Proof of Theorems 3.1 and 3.2

Our analysis starts with a study of the infinite-width kernel 𝐊\mathrm{\bf K}. In our setting, we will show that 𝐊\mathrm{\bf K} is essentially a regularized low-degree polynomial kernel plus a small ‘noise’. Such a decomposition will play an essential role in our analysis of the generalization error.

To ease notations, we will henceforth write the target function ff_{*} simply as ff.

7.1 Infinite-width kernel decomposition

Recall that we denote by (Qk(d))k0(Q_{k}^{(d)})_{k\geq 0} the Gegenbauer polynomials in dd dimensions, and denote by (Ykt)tB(d,k)(Y_{kt})_{t\leq B(d,k)} the normalized spherical harmonics of degree kk in dd dimensions. We denote by 𝚿=kn×B(d,k)\text{\boldmath$\Psi$}_{=k}\in\mathbb{R}^{n\times B(d,k)} the degree-kk spherical harmonics evaluated at nn data points, and denote by 𝚿k\text{\boldmath$\Psi$}_{\leq k} the concatenation of these matrices of degree no larger than kk; namely,

𝚿=[𝚿=0,,𝚿=],where𝚿=k=(Ykt(𝒙i))in,tB(d,k).\text{\boldmath$\Psi$}_{\leq\ell}=\big{[}\text{\boldmath$\Psi$}_{=0},\ldots,\text{\boldmath$\Psi$}_{=\ell}\big{]},\qquad\text{where}~{}\text{\boldmath$\Psi$}_{=k}=\big{(}Y_{kt}(\text{\boldmath$x$}_{i})\big{)}_{i\leq n,t\leq B(d,k)}.
Lemma 1 (Harmonic decomposition of the infinite-width kernel).

The kernel KK can be decomposed as

K(𝒙,𝒙)=k=0γkQk(d)(𝒙,𝒙),\displaystyle K(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime})=\sum_{k=0}^{\infty}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)\,, (47)

with coefficients:

γ0=[λ1(σ)]2,γk=k+12k+dB(d,k+1)[λk+1(σ)]2+k+d32k+d4B(d,k1)[λk1(σ)]2,\displaystyle\gamma_{0}=\big{[}\lambda_{1}(\sigma^{\prime})\big{]}^{2},\qquad\gamma_{k}=\frac{k+1}{2k+d}B(d,k+1)\big{[}\lambda_{k+1}(\sigma^{\prime})\big{]}^{2}+\frac{k+d-3}{2k+d-4}B(d,k-1)\big{[}\lambda_{k-1}(\sigma^{\prime})\big{]}^{2}, (48)

where the coefficients λk=λd,k\lambda_{k}=\lambda_{d,k} are defined in (46). The convergence in (47) takes place in any of the following interpretations: (i)(i) As functions in L2(𝕊d1(d)×𝕊d1(d),τd1τd1)L^{2}(\mathbb{S}^{d-1}(\sqrt{d})\times\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1}\otimes\tau_{d-1}); (ii)(ii) Pointwise, for every 𝐱,𝐱𝕊d1(d)×𝕊d1(d)\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\in\mathbb{S}^{d-1}(\sqrt{d})\times\mathbb{S}^{d-1}(\sqrt{d}); (iii)(iii) In operator norm as integral operators L2(𝕊d1(d),τd1)L2(𝕊d1(d),τd1)L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1})\to L^{2}(\mathbb{S}^{d-1}(\sqrt{d}),\tau_{d-1}).

Finally, γ0=od(1)\gamma_{0}=o_{d}(1) and for k1k\geq 1 we have γk=μk12+od(1)\gamma_{k}=\mu_{k-1}^{2}+o_{d}(1), where we recall that μk\mu_{k} is the kk-th coefficient in the Hermite expansion of σ\sigma^{\prime}.

The use of spherical harmonics and Gegenbauer polynomials to study inner product kernels on the sphere (the N=N=\infty case) is standard since the classical work of Schoenberg [Sch42]. Several recent papers in machine learning use these tools [Bac17, BM19, GMMM21]. On the other hand, quantitative properties when both the sample size nn and the dimension dd are large are much less studied [GMMM21]. The finite width NN case is even more challenging since the kernel is no longer of inner-product type.

Proof.

Recall the Gegenbauer decomposition of σ\sigma^{\prime} in (45), implying that the following holds in L2(𝕊d1)L^{2}(\mathbb{S}^{d-1}) (we omit indicating the measure when this is uniform). For any fixed 𝒙𝕊d1(d)\text{\boldmath$x$}\in\mathbb{S}^{d-1}(\sqrt{d}) (and writing λk(σ):=λd,k(σ)\lambda_{k}(\sigma^{\prime}):=\lambda_{d,k}(\sigma^{\prime})):

σ(𝒙,)=k=0λk(σ)B(d,k)Qk(d)(d𝒙,).\displaystyle\sigma^{\prime}(\langle\text{\boldmath$x$},\,\cdot\,\rangle)=\sum_{k=0}^{\infty}\lambda_{k}(\sigma^{\prime})B(d,k)Q_{k}^{(d)}(\sqrt{d}\,\langle\text{\boldmath$x$},\,\cdot\,\rangle)\,.

We thus obtain, for fixed 𝒙x, 𝒙\text{\boldmath$x$}^{\prime}:

𝔼𝒘[σ(𝒙,𝒘)σ(𝒙,𝒘)]\displaystyle\mathbb{E}_{\text{\boldmath$w$}}\big{[}\sigma^{\prime}(\langle\text{\boldmath$x$},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}^{\prime},\text{\boldmath$w$}\rangle)\big{]} =(i)k=0[λk(σ)]2[B(d,k)]2𝔼𝒘[Qk(d)(d𝒙,𝒘)Qk(d)(d𝒙,𝒘)]\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\sum_{k=0}^{\infty}\big{[}\lambda_{k}(\sigma^{\prime})\big{]}^{2}[B(d,k)]^{2}\mathbb{E}_{\text{\boldmath$w$}}\big{[}Q_{k}^{(d)}(\sqrt{d}\,\langle\text{\boldmath$x$},\text{\boldmath$w$}\rangle)Q_{k}^{(d)}(\sqrt{d}\,\langle\text{\boldmath$x$}^{\prime},\text{\boldmath$w$}\rangle)\big{]}
=(ii)k=0[λk(σ)]2B(d,k)Qk(d)(𝒙,𝒙).\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\sum_{k=0}^{\infty}\big{[}\lambda_{k}(\sigma^{\prime})\big{]}^{2}B(d,k)Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle).

Here, (i)(i) is due to the the orthogonality of Qk(d)(d𝒙,)Q_{k}^{(d)}(\sqrt{d}\,\langle\text{\boldmath$x$},\,\cdot\,\rangle) and Qk(d)(d𝒙,)Q_{k^{\prime}}^{(d)}(\sqrt{d}\,\langle\text{\boldmath$x$}^{\prime},\,\cdot\,\rangle) for kkk\neq k^{\prime} as well as Lemma 24 (exchanging summation with expectation), and (ii)(ii) is due to the property (40).

Thus, we obtain

K(𝒙,𝒙)=k=0[λk(σ)]2B(d,k)Qk(d)(𝒙,𝒙)𝒙,𝒙d.K(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime})=\sum_{k=0}^{\infty}\big{[}\lambda_{k}(\sigma^{\prime})\big{]}^{2}B(d,k)Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)\frac{\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle}{d}.

We use the recurrence formula to simplify the above expression.

K\displaystyle K (𝒙,𝒙)=k1k2k+d2[λk(σ)]2B(d,k)Qk1(d)(𝒙,𝒙)+k0k+d22k+d2[λk(σ)]2B(d,k)Qk+1(d)(𝒙,𝒙)\displaystyle(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime})=\sum_{k\geq 1}\frac{k}{2k+d-2}\big{[}\lambda_{k}(\sigma^{\prime})\big{]}^{2}B(d,k)Q_{k-1}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)+\sum_{k\geq 0}\frac{k+d-2}{2k+d-2}\big{[}\lambda_{k}(\sigma^{\prime})\big{]}^{2}B(d,k)Q_{k+1}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)
=k0k+12k+d[λk+1(σ)]2B(d,k+1)Qk(d)(𝒙i,𝒙j)+k1k+d32k+d4[λk1(σ)]2B(d,k1)Qk(k)(𝒙,𝒙)\displaystyle=\sum_{k\geq 0}\frac{k+1}{2k+d}\big{[}\lambda_{k+1}(\sigma^{\prime})\big{]}^{2}B(d,k+1)Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)+\sum_{k\geq 1}\frac{k+d-3}{2k+d-4}\big{[}\lambda_{k-1}(\sigma^{\prime})\big{]}^{2}B(d,k-1)Q_{k}^{(k)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)
=k=0γkQk(d)(𝒙,𝒙).\displaystyle=\sum_{k=0}^{\infty}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)\,. (49)

We thus proved that Eq. (47) holds pointwise for every 𝒙,𝒙\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}. Calling KK^{\ell} the sum of the first \ell terms on the right hand side, we also obtain KKL2(𝕊d1(d)×𝕊d1(d))0\|K-K^{\ell}\|_{L^{2}(\mathbb{S}^{d-1}(\sqrt{d})\times\mathbb{S}^{d-1}(\sqrt{d}))}\to 0 by an application of Fatou’s lemma. Finally the last norm coincides with Hilbert-Schmidt norm, when we view these kernels as operators, and hence implies convergence in operator norm.

The asymptotic formulas for γk\gamma_{k} follow by the convergence of Gegenbauer expansion to Hermite expansion discussed in Section 6. ∎

Lemma 2 (Structure of infinite-width kernel matrix).

Assume (𝐱i)iniidUnif(𝕊d1(d))(\text{\boldmath$x$}_{i})_{i\leq n}\sim_{iid}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})), and the activation function to satisfy Assumption 3.2. Then we can decompose 𝐊\mathrm{\bf K} as follows

𝐊=γ>𝐈n+𝚿𝚲2𝚿+𝚫\displaystyle\mathrm{\bf K}=\gamma_{>\ell}\mathrm{\bf I}_{n}+\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\text{\boldmath$\Delta$} (50)

where γ>:=k>γk=kμk2+od(1)\gamma_{>\ell}:=\sum_{k>\ell}\gamma_{k}=\sum_{k\geq\ell}\mu_{k}^{2}+o_{d}(1) and

𝚲2:=diag{γ0,B(d,1)1γ1,,B(d,1)1γ1B(d,1),,B(d,)1γ,,B(d,)1γB(d,)}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}:=\mathrm{diag}\big{\{}\gamma_{0},\underbrace{B(d,1)^{-1}\gamma_{1},\ldots,B(d,1)^{-1}\gamma_{1}}_{B(d,1)},\ldots,\underbrace{B(d,\ell)^{-1}\gamma_{\ell},\ldots,B(d,\ell)^{-1}\gamma_{\ell}}_{B(d,\ell)}\big{\}}

Here, γk0\gamma_{k}\geq 0 and γk=μk12+od(1)\gamma_{k}=\mu_{k-1}^{2}+o_{d}(1) (we denote μ1=0\mu_{-1}=0).

Further there exists a constant CC such that, for n(logn)Cd+1n(\log n)^{C}\leq d^{\ell+1}, the remainder term 𝚫\Delta satisfies, with very high probability,

𝚫opn(logn)Cd+1.\big{\|}\text{\boldmath$\Delta$}\big{\|}_{\mathrm{op}}\leq\sqrt{\frac{n(\log n)^{C}}{d^{\ell+1}}}.

Finally, if nCd(logd)2n\geq Cd^{\ell}(\log d)^{2}, with very high probability,

n1𝚿𝚿𝐈nopCd(logd)2n.\big{\|}n^{-1}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq C\sqrt{\frac{d^{\ell}(\log d)^{2}}{n}}. (51)

In the case =1\ell=1, the above upper bound can be replaced by Cd/n\sqrt{Cd/n}.

Proof of Lemma 2.

Using Lemma 1, and using property (41) to express

Qk(k)(𝒙i,𝒙j)=[B(d,k)]1𝚿=(𝒙i)𝚿=(𝒙j)Q_{k}^{(k)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)=[B(d,k)]^{-1}\text{\boldmath$\Psi$}_{=\ell}(\text{\boldmath$x$}_{i})\text{\boldmath$\Psi$}_{=\ell}(\text{\boldmath$x$}_{j})^{\top}

for kk\leq\ell, we obtain the decomposition (50), where

𝚫:=k=+1γk(𝑸k𝐈n),\displaystyle\text{\boldmath$\Delta$}:=\sum_{k=\ell+1}^{\infty}\gamma_{k}(\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n})\,,

and 𝑸k:=(Qk(d)(𝒙i,𝒙j))i,jn\text{\boldmath$Q$}_{k}:=(Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle))_{i,j\leq n}. Note that

γ>\displaystyle\gamma_{>\ell} =k+2k2k+d2B(d,k)[λk(σ)]2+kk+d22k+d2B(d,k)[λk(σ)]2\displaystyle=\sum_{k\geq\ell+2}\frac{k}{2k+d-2}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}+\sum_{k\geq\ell}\frac{k+d-2}{2k+d-2}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}
=k+1k+d22k+d2B(d,k)[λk(σ)]2+k+2B(d,k)[λk(σ)]2\displaystyle=\sum_{\ell\leq k\leq\ell+1}\frac{k+d-2}{2k+d-2}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}+\sum_{k\geq\ell+2}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}
=kμk2+od(1).\displaystyle=\sum_{k\geq\ell}\mu_{k}^{2}+o_{d}(1).

The last equality involves interchanging the limit limd\lim_{d\to\infty} with the summation. We explain the validity as follows. For any integer M>+2M>\ell+2,

|k+2B(d,k)[λk(σ)]2kμk2|\displaystyle\Big{|}\sum_{k\geq\ell+2}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}-\sum_{k\geq\ell}\mu_{k}^{2}\Big{|} k+2M|B(d,k)[λk(σ)]2μk2|+k>MB(d,k)[λk(σ)]2+k>Mμk2.\displaystyle\leq\sum_{k\geq\ell+2}^{M}\big{|}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}-\mu_{k}^{2}\big{|}+\sum_{k>M}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}+\sum_{k>M}\mu_{k}^{2}.

By first taking dd\to\infty and then MM\to\infty, we obtain

lim supd|k+2B(d,k)[λk(σ)]2kμk2|limMlim supdk>MB(d,k)[λk(σ)]2,\displaystyle\limsup_{d\to\infty}\Big{|}\sum_{k\geq\ell+2}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}-\sum_{k\geq\ell}\mu_{k}^{2}\Big{|}\leq\lim_{M\to\infty}\limsup_{d\to\infty}\sum_{k>M}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2},
=lim supdσL2(τd11)2limMlim infdkMB(d,k)[λk(σ)]2\displaystyle=\limsup_{d\to\infty}\|\sigma^{\prime}\|_{L^{2}(\tau^{1}_{d-1})}^{2}-\lim_{M\to\infty}\liminf_{d\to\infty}\sum_{k\leq M}B(d,k)[\lambda_{k}(\sigma^{\prime})]^{2}
=σL2(ρ)2limMkMμk(σ)2=0.\displaystyle=\|\sigma^{\prime}\|_{L^{2}(\rho)}^{2}-\lim_{M\to\infty}\sum_{k\leq M}\mu_{k}(\sigma^{\prime})^{2}=0\,.

Here in the last line we recall that ρ\rho is the standard Gaussian measure, and we used dominated convergence to show that σL2(τ~d11)2σL2(ρ)2\|\sigma^{\prime}\|_{L^{2}(\widetilde{\tau}^{1}_{d-1})}^{2}\to\|\sigma^{\prime}\|_{L^{2}(\rho)}^{2}. By Proposition E.1, we have with very high probability,

supk>𝑸k𝐈nopn(logn)Cd+1.\sup_{k>\ell}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq\sqrt{\frac{n(\log n)^{C}}{d^{\ell+1}}}.

Thus, 𝚫=k>γ(𝑸k𝐈n)\text{\boldmath$\Delta$}=\sum_{k>\ell}\gamma_{\ell}(\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n}) satisfies

𝚫opγ>supk>𝑸k𝐈dopn(logn)Cd+1.\|\text{\boldmath$\Delta$}\|_{\mathrm{op}}\leq\gamma_{>\ell}\sup_{k>\ell}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}\leq\sqrt{\frac{n(\log n)^{C}}{d^{\ell+1}}}.

Finally, denote D=kB(d,k)D=\sum_{k\leq\ell}B(d,k). The matrix 𝚿n×D\text{\boldmath$\Psi$}_{\leq\ell}\in\mathbb{R}^{n\times D} has nn i.i.d. rows, denote them by 𝝍(𝒙i){\boldsymbol{\psi}}(\text{\boldmath$x$}_{i}), ini\leq n, where

𝝍(𝒙):=(Yk,t(𝒙))t,B(d,k),k.\displaystyle{\boldsymbol{\psi}}(\text{\boldmath$x$}):=(Y_{k,t}(\text{\boldmath$x$})\big{)}_{t,\leq B(d,k),k\leq\ell}\,. (52)

By orthonormality of the spherical harmonics (42), the covariance of these vectors is 𝔼{𝝍(𝒙)𝝍(𝒙)}=𝐈D\mathbb{E}\{{\boldsymbol{\psi}}(\text{\boldmath$x$}){\boldsymbol{\psi}}(\text{\boldmath$x$})^{\top}\}=\mathrm{\bf I}_{D}. Further, for any 𝒙𝕊d1(d)\text{\boldmath$x$}\in\mathbb{S}^{d-1}(\sqrt{d}), we have

𝝍(𝒙)22\displaystyle\|{\boldsymbol{\psi}}(\text{\boldmath$x$})\|_{2}^{2} =k=0t=1B(d,k)Yk,t(𝒙)2=k=0B(d,k)=D.\displaystyle=\sum_{k=0}^{\ell}\sum_{t=1}^{B(d,k)}Y_{k,t}(\text{\boldmath$x$})^{2}=\sum_{k=0}^{\ell}B(d,k)=D\,. (53)

By [Ver18][Theorem 5.6.1 and Exercise 5.6.4], we obtain

n1𝚿𝚿𝐈nopC(D(logD+u)n+D(logD+u)n),\displaystyle\big{\|}n^{-1}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq C\Big{(}\sqrt{\frac{D(\log D+u)}{n}}+\frac{D(\log D+u)}{n}\Big{)}\,, (54)

with probability at least 12eu1-2e^{-u}. The claim follows by setting u=(logd)2u=(\log d)^{2}, and noting that DCdD\leq C^{\prime}d^{\ell}. ∎

7.2 Concentration of neural tangent kernel: Proof of Theorem 3.2

Let C0>0C_{0}>0 be a sufficiently large constant (which is determined later). Define the truncated function

φ(x)=σ(x)𝟏{|x|C0log(nNd)}.\varphi(x)=\sigma^{\prime}(x)\mathrm{\bf 1}\{|x|\leq C_{0}\log(nNd)\}.

For k[n]k\in[n], we also define matrices 𝐃k,𝐇kn×n\mathrm{\bf D}_{k},\mathrm{\bf H}_{k}\in\mathbb{R}^{n\times n}, and truncated kernels 𝐊0\mathrm{\bf K}^{0} and 𝐊N0\mathrm{\bf K}_{N}^{0} as

𝐃k\displaystyle\mathrm{\bf D}_{k} =diag{φ(𝒙1,𝒘k),,φ(𝒙n,𝒘k)},𝐊0=1d𝔼𝒘[𝐃k𝑿𝑿𝐃k]\displaystyle=\mathrm{diag}\big{\{}\varphi(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}_{k}\rangle),\ldots,\varphi(\langle\text{\boldmath$x$}_{n},\text{\boldmath$w$}_{k}\rangle)\big{\}},\qquad\mathrm{\bf K}^{0}=\frac{1}{d}\mathbb{E}_{\text{\boldmath$w$}}\big{[}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}\big{]} (55)
𝐇k\displaystyle\mathrm{\bf H}_{k} =1d(𝐊0)1/2𝐃k𝑿𝑿𝐃k(𝐊0)1/2,𝐊N0=1Ndk=1N𝐃k𝑿𝑿𝐃k.\displaystyle=\frac{1}{d}(\mathrm{\bf K}^{0})^{-1/2}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}(\mathrm{\bf K}^{0})^{-1/2},\qquad\mathrm{\bf K}_{N}^{0}=\frac{1}{Nd}\sum_{k=1}^{N}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}. (56)

Note that by definition and the assumption on the activation function, namely Assumption 3.2, we have |φ(x)|B(1+(C0log(nNd))B)|\varphi(x)|\leq B(1+(C_{0}\log(nNd))^{B}).

Note that 𝒜γ\mathcal{A}_{\gamma} is a very high probability event as a consequence of Lemma 2. We shall treat 𝒙1,,𝒙n\text{\boldmath$x$}_{1},\ldots,\text{\boldmath$x$}_{n} as deterministic vectors such that the conditions in the event 𝒜γ\mathcal{A}_{\gamma} holds.

Step 1: The effect of truncation is small. First, we realize that

𝒘(𝐊N0𝐊N)\displaystyle\mathbb{P}_{\text{\boldmath$w$}}\big{(}\mathrm{\bf K}_{N}^{0}\neq\mathrm{\bf K}_{N}\big{)} 𝒘(maxi,k|𝒙i,𝒘k|>C0log(nNd))Nn(|𝒙1,𝒘1|>C0log(nNd)).\displaystyle\leq\mathbb{P}_{\text{\boldmath$w$}}\big{(}\max_{i,k}|\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}_{k}\rangle|>C_{0}\log(nNd)\big{)}\leq Nn\mathbb{P}\big{(}|\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}_{1}\rangle|>C_{0}\log(nNd)\big{)}.

By the fact that uniform spherical random vector is subgaussian [Ver10][Sect. 5.2.5], we pick C0C_{0} sufficiently large such that

𝒘(|𝒙1,𝒘1|>C0log(nNd))Cexp(c(log(nNd))2).\mathbb{P}_{\text{\boldmath$w$}}\big{(}|\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}_{1}\rangle|>C_{0}\log(nNd)\big{)}\leq C\exp\Big{(}-c(\log(nNd))^{2}\Big{)}. (57)

Thus, with very high probability, 𝐊N0=𝐊N\mathrm{\bf K}_{N}^{0}=\mathrm{\bf K}_{N}.

Furthermore, for i,j[n]i,j\in[n], we have

(𝐊𝐊0)ij\displaystyle(\mathrm{\bf K}-\mathrm{\bf K}^{0})_{ij} =𝔼𝒘[σ(𝒙i,𝒘)σ(𝒙j,𝒘)(1𝟏Ai)𝟏Aj]𝒙i,𝒙jd\displaystyle=\mathbb{E}_{\text{\boldmath$w$}}\big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)(1-\mathrm{\bf 1}_{A_{i}})\mathrm{\bf 1}_{A_{j}}\big{]}\frac{\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle}{d}
+𝔼𝒘[σ(𝒙i,𝒘)σ(𝒙j,𝒘)(1𝟏Aj)]𝒙i,𝒙jd=:I1+I2\displaystyle~{}+\mathbb{E}_{\text{\boldmath$w$}}\big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)(1-\mathrm{\bf 1}_{A_{j}})\big{]}\frac{\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle}{d}=:I_{1}+I_{2}

where Ai:={|𝒙i,𝒘|C0log(nNd)}A_{i}:=\{|\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle|\leq C_{0}\log(nNd)\}. For the first term, we derive

|I1|\displaystyle|I_{1}| (i)|𝔼𝒘[σ(𝒙i,𝒘)σ(𝒙j,𝒘)(1𝟏Ai)𝟏Aj]|\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\Big{|}\mathbb{E}_{\text{\boldmath$w$}}\big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)(1-\mathrm{\bf 1}_{A_{i}})\mathrm{\bf 1}_{A_{j}}\big{]}\Big{|}
(ii){𝔼𝒘[(σ(𝒙i,𝒘))4]}1/4{𝔼𝒘[(σ(𝒙j,𝒘))4]}1/4{𝔼𝒘[(1𝟏Ai)2]}1/2\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\Big{\{}\mathbb{E}_{\text{\boldmath$w$}}\big{[}\big{(}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\big{)}^{4}\big{]}\Big{\}}^{1/4}\cdot\Big{\{}\mathbb{E}_{\text{\boldmath$w$}}\big{[}\big{(}\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\big{)}^{4}\big{]}\Big{\}}^{1/4}\cdot\Big{\{}\mathbb{E}_{\text{\boldmath$w$}}\big{[}(1-\mathrm{\bf 1}_{A_{i}})^{2}\big{]}\Big{\}}^{1/2}
(iii)C(dB+1){𝒘(|𝒙i,𝒘|>C0log(nNd))}1/2\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}C(d^{B}+1)\Big{\{}\mathbb{P}_{\text{\boldmath$w$}}\big{(}|\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle|>C_{0}\log(nNd)\big{)}\Big{\}}^{1/2}

where (i) is due to |𝒙i,𝒙j|𝒙i𝒙j=d|\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle|\leq\|\text{\boldmath$x$}_{i}\|\cdot\|\text{\boldmath$x$}_{j}\|=d, (ii) is due to Hölder’s inequality, and (iii) follows from the polynomial growth assumption on σ\sigma^{\prime}. By (57), we can choose C0C_{0} large enough such that |I1|C(nNd)2|I_{1}|\leq C(nNd)^{-2}. Similarly, we can prove that |I2|C(nNd)2|I_{2}|\leq C(nNd)^{-2}. Thus,

𝐊0𝐊opnmaxij|(𝐊0𝐊)ij|CnNd.\displaystyle\big{\|}\mathrm{\bf K}^{0}-\mathrm{\bf K}\big{\|}_{\mathrm{op}}\leq n\max_{ij}\big{|}(\mathrm{\bf K}^{0}-\mathrm{\bf K})_{ij}\big{|}\leq\frac{C}{nNd}. (58)

Since we work on the event 𝐊γ𝐈n\mathrm{\bf K}\succeq\gamma\mathrm{\bf I}_{n}, this implies

𝐊1/2(𝐊0𝐊)𝐊1/2opCγnNd.\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}^{0}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}\leq\frac{C}{\gamma nNd}. (59)

Step 2: Concentration of truncated kernel. Next, we will use matrix concentration inequality [Ver18][Thm. 5.4.1] for k=1N𝐇k\sum_{k=1}^{N}\mathrm{\bf H}_{k}. First, we observe that (58) implies that 𝐊0(γod(1))𝐈n\mathrm{\bf K}^{0}\succeq(\gamma-o_{d}(1))\cdot\mathrm{\bf I}_{n}. Together with the bound on 𝑿op\|\text{\boldmath$X$}\|_{\mathrm{op}} and the deterministic bound on 𝐃kop\|\mathrm{\bf D}_{k}\|_{\mathrm{op}}, we have a deterministic bound on 𝐇kop\|\mathrm{\bf H}_{k}\|_{\mathrm{op}}.

𝐇kop\displaystyle\|\mathrm{\bf H}_{k}\|_{\mathrm{op}} 1dλmin(𝐊0)𝐃kop2𝑿op2\displaystyle\leq\frac{1}{d\lambda_{\min}(\mathrm{\bf K}^{0})}\|\mathrm{\bf D}_{k}\|_{\mathrm{op}}^{2}\cdot\|\text{\boldmath$X$}\|_{\mathrm{op}}^{2}
1d(γod(1))[B(1+(C0lognNd)B)]24(n+d)2\displaystyle\leq\frac{1}{d(\gamma-o_{d}(1))}\big{[}B(1+(C_{0}\log nNd)^{B})\big{]}^{2}\cdot 4(\sqrt{n}+\sqrt{d})^{2}
C(n+d)d(log(nNd))C.\displaystyle\leq\frac{C(n+d)}{d}\big{(}\log(nNd))^{C}.

where CC is a sufficiently large constant. This also implies 𝐇k𝔼𝒘[𝐇k]op(n+d)log(nNd)C/d\|\mathrm{\bf H}_{k}-\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf H}_{k}]\|_{\mathrm{op}}\leq(n+d)\log(nNd)^{C}/d. We will make use of the following simple fact: if 𝐀1,𝐀2\mathrm{\bf A}_{1},\mathrm{\bf A}_{2} are p.s.d. that satisfy 𝐀1𝐀2\mathrm{\bf A}_{1}\preceq\mathrm{\bf A}_{2}, then 𝑸𝐀1𝑸𝑸𝐀2𝑸\text{\boldmath$Q$}\mathrm{\bf A}_{1}\text{\boldmath$Q$}^{\top}\preceq\text{\boldmath$Q$}\mathrm{\bf A}_{2}\text{\boldmath$Q$}^{\top}. We use this on 𝐇k2\mathrm{\bf H}_{k}^{2} and find

𝐇k2\displaystyle\mathrm{\bf H}_{k}^{2} =1d2(𝐊0)1/2𝐃k𝑿𝑿𝐃k(𝐊0)1𝐃k𝑿𝑿𝐃k(𝐊0)1/2\displaystyle=\frac{1}{d^{2}}(\mathrm{\bf K}^{0})^{-1/2}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}(\mathrm{\bf K}^{0})^{-1}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}(\mathrm{\bf K}^{0})^{-1/2}
1d2(𝐊0)1/2𝐃k𝑿𝑿𝑿𝑿𝐃k(𝐊0)1/2(γod(1))1[B(1+(C0log(nNd))B)]2\displaystyle\preceq\frac{1}{d^{2}}(\mathrm{\bf K}^{0})^{-1/2}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}(\mathrm{\bf K}^{0})^{-1/2}\cdot(\gamma-o_{d}(1))^{-1}\big{[}B(1+(C_{0}\log(nNd))^{B})\big{]}^{2}
1d2(𝐊0)1/2𝐃k𝑿𝑿𝐃k(𝐊0)1/2(γod(1))1[B(1+(C0log(nNd))B)]24(n+d)2.\displaystyle\preceq\frac{1}{d^{2}}(\mathrm{\bf K}^{0})^{-1/2}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}(\mathrm{\bf K}^{0})^{-1/2}\cdot(\gamma-o_{d}(1))^{-1}\big{[}B(1+(C_{0}\log(nNd))^{B})\big{]}^{2}\cdot 4(\sqrt{n}+\sqrt{d})^{2}.

Taking the expectation 𝔼𝒘\mathbb{E}_{\text{\boldmath$w$}}, we get

𝔼𝒘[𝐇k2]1d(γod(1))1[B(1+(C0log(nNd))B)]24(n+d)2𝐈nC(n+d)d(log(nNd))C𝐈n.\displaystyle\mathbb{E}_{\text{\boldmath$w$}}\big{[}\mathrm{\bf H}_{k}^{2}\big{]}\preceq\frac{1}{d}(\gamma-o_{d}(1))^{-1}\big{[}B(1+(C_{0}\log(nNd))^{B})\big{]}^{2}\cdot 4(\sqrt{n}+\sqrt{d})^{2}\cdot\mathrm{\bf I}_{n}\preceq\frac{C(n+d)}{d}\big{(}\log(nNd)\big{)}^{C}\cdot\mathrm{\bf I}_{n}.

This implies

𝔼𝒘(𝐇k𝔼𝒘[𝐇k])2=𝔼𝒘[𝐇k2](𝔼𝒘[𝐇k])2𝔼𝒘[𝐇k2]O(n+ddpolylog(nNd))𝐈n.\mathbb{E}_{\text{\boldmath$w$}}\big{(}\mathrm{\bf H}_{k}-\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf H}_{k}])^{2}=\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf H}_{k}^{2}]-\big{(}\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf H}_{k}]\big{)}^{2}\preceq\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf H}_{k}^{2}]\preceq O\Big{(}\frac{n+d}{d}\mathrm{polylog}(nNd)\Big{)}\cdot\mathrm{\bf I}_{n}.

Now we apply the matrix Bernstein’s inequality [Ver18][Thm. 5.4.1] to the matrix sum k=1N𝐇k𝔼𝒘[𝐇k]\sum_{k=1}^{N}\mathrm{\bf H}_{k}-\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf H}_{k}]. We obtain

(|k=1N𝐇k𝔼𝒘[𝐇k]|t)2nexp(t2/2v+Lt/3),\displaystyle\mathbb{P}\Big{(}\Big{|}\sum_{k=1}^{N}\mathrm{\bf H}_{k}-\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf H}_{k}]\Big{|}\geq t\Big{)}\leq 2n\cdot\exp\left(-\frac{t^{2}/2}{v+Lt/3}\right), (60)

where

L=n+dd(log(nNd))C,v=N(n+d)d(log(nNd))C.\displaystyle L=\frac{n+d}{d}\big{(}\log(nNd)\big{)}^{C},\qquad v=\frac{N(n+d)}{d}\big{(}\log(nNd)\big{)}^{C}\,.

We choose a sufficient large constant CC^{\prime} and set

t=max{N(n+d)(log(nNd))Cd,(n+d)(log(nNd))Cd}t=\max\Big{\{}\sqrt{\frac{N(n+d)(\log(nNd))^{C^{\prime}}}{d}},\frac{(n+d)(\log(nNd))^{C^{\prime}}}{d}\Big{\}}

so that the right-hand side of (60) is no larger than 2nexp((log(nNd)2))2n\cdot\exp(-(\log(nNd)^{2})). This proves that with very high probability,

(𝐊0)1/2𝐊N0(𝐊0)1/2𝐈nop(n+d)(log(nNd))CNd+(n+d)(log(nNd))CNd.\big{\|}(\mathrm{\bf K}^{0})^{-1/2}\mathrm{\bf K}_{N}^{0}(\mathrm{\bf K}^{0})^{-1/2}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq\sqrt{\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}}+\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}. (61)

Step 3: back to original kernel. The inequality (59) implies that for large enough nn, (𝐊0)1/2𝐊1/2opC\|(\mathrm{\bf K}^{0})^{1/2}\mathrm{\bf K}^{-1/2}\|_{\mathrm{op}}\leq C. Therefore,

𝐊1/2(𝐊N0𝐊)𝐊1/2op\displaystyle~{}~{}~{}~{}\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}^{0}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}
𝐊1/2(𝐊N0𝐊0)𝐊1/2op+𝐊1/2(𝐊0𝐊)𝐊1/2op\displaystyle\leq\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}^{0}-\mathrm{\bf K}^{0})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}+\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}^{0}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}
𝐊1/2(𝐊0)1/2op(𝐊0)1/2(𝐊N0𝐊0)(𝐊0)1/2op(𝐊0)1/2𝐊1/2op+CγnNd\displaystyle\leq\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}^{0})^{1/2}\big{\|}_{\mathrm{op}}\cdot\big{\|}(\mathrm{\bf K}^{0})^{-1/2}(\mathrm{\bf K}_{N}^{0}-\mathrm{\bf K}^{0})(\mathrm{\bf K}^{0})^{-1/2}\big{\|}_{\mathrm{op}}\cdot\big{\|}(\mathrm{\bf K}^{0})^{1/2}\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}+\frac{C}{\gamma nNd}
(n+d)(log(nNd))CNd+(n+d)(log(nNd))CNd+CγnNd.\displaystyle\leq\sqrt{\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}}+\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}+\frac{C}{\gamma nNd}.

Since 1/(γnNd)O(1)C(n+d)(log(nNd))C/(Nd)1/(\gamma nNd)\leq O(1)\cdot C^{\prime}\sqrt{(n+d)(\log(nNd))^{C^{\prime}}/(Nd)}, we can enlarge the constant CC^{\prime} appropriately to obtain that with very high probability,

𝐊1/2(𝐊N0𝐊)𝐊1/2op(n+d)(log(nNd))CNd+(n+d)(log(nNd))CNd.\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}^{0}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}\leq\sqrt{\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}}+\frac{(n+d)(\log(nNd))^{C^{\prime}}}{Nd}.

This completes the proof of Theorem 3.2.

7.3 Smallest eigenvalue of neural tangent kernel: Proof of Theorem 3.1

By Lemma 2, we have, with high probability,

𝚫opn(logn)Cd+1(logn)C(logd)C0(+1)C(logd)CC0,\displaystyle\|\text{\boldmath$\Delta$}\|_{\mathrm{op}}\leq\sqrt{\frac{n(\log n)^{C}}{d^{\ell+1}}}\leq\sqrt{\frac{(\log n)^{C}}{(\log d)^{C_{0}}}}\leq\sqrt{(\ell+1)^{C}(\log d)^{C-C_{0}}},

where the second inequality is because by Assumption 3.1, d+1n(logd)C0nd^{\ell+1}\geq n(\log d)^{C_{0}}\geq n, so logn(+1)logd\log n\leq(\ell+1)\log d. We choose the constant C0C_{0} to be larger than CC. So by Weyl’s inequality,

|λmin(𝐊)λmin(γ>𝐈n+𝚿𝚲2𝚿)|𝚫op=od,(1).\big{|}\lambda_{\min}(\mathrm{\bf K})-\lambda_{\min}(\gamma_{>\ell}\mathrm{\bf I}_{n}+\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top})\big{|}\leq\|\text{\boldmath$\Delta$}\|_{\mathrm{op}}=o_{d,\mathbb{P}}(1).

Note that 𝚿𝚲2𝚿\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top} is always p.s.d., and it has rank at most d+1d+1 if =1\ell=1 and O(d)O(d^{\ell}) if >1\ell>1. So

λmin(γ>𝐈n+𝚿𝚲2𝚿)(i)γ>=v(σ)+od(1)and thusλmin(𝐊)(ii)v(σ)od(1)\lambda_{\min}(\gamma_{>\ell}\mathrm{\bf I}_{n}+\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top})\stackrel{{\scriptstyle(i)}}{{\geq}}\gamma_{>\ell}=v(\sigma)+o_{d}(1)\qquad\text{and thus}\qquad\lambda_{\min}(\mathrm{\bf K})\stackrel{{\scriptstyle(ii)}}{{\geq}}v(\sigma)-o_{d}(1)

and we can strengthen the inequalities \geq to equalities == in (i), (ii) if n>d+1n>d+1. By Theorem 3.2 and Eq. (14) in its following remark,

(1od,(1))λmin(𝐊)λmin(𝐊N)(1+od,(1))λmin(𝐊).(1-o_{d,\mathbb{P}}(1))\cdot\lambda_{\min}(\mathrm{\bf K})\leq\lambda_{\min}(\mathrm{\bf K}_{N})\leq(1+o_{d,\mathbb{P}}(1))\cdot\lambda_{\min}(\mathrm{\bf K}).

We conclude that λmin(𝐊N)v(σ)od,(1)\lambda_{\min}(\mathrm{\bf K}_{N})\geq v(\sigma)-o_{d,\mathbb{P}}(1) and that the inequality can be replaced by an equality if n>d+1n>d+1.

8 Generalization error: Proof outline of Theorem 3.3

In this section outline the proof our main result Theorem 3.3, which characterize the test error of NT regression. We describe the main scheme of proof, and how we treat the bias term. We refer to the appendix where the remaining steps (and most of the technical work) are carried out.

Throughout, we will assume that the setting of that theorem (in particular, Assumption 3.1 and 3.2) hold. We will further assume that Eq. (10) in Assumption 3.1 holds for the case =1\ell=1 as well. In Section C we will refine our analysis to eliminate logarithmic factors for =1\ell=1.

In NT regression the coefficients vector is given by Eq. (16) and, more explicitly, Eq. (17). The prediction function f^𝖭𝖳(𝒙)=𝚽(𝒙),𝒂^\widehat{f}_{{\sf NT}}(\text{\boldmath$x$})=\langle\text{\boldmath$\Phi$}(\text{\boldmath$x$}),\widehat{\text{\boldmath$a$}}\rangle can be written as

f^𝖭𝖳(𝒙)=𝐊N(,𝒙)(λ𝐈n+𝐊N)1𝒚,\widehat{f}_{{\sf NT}}(\text{\boldmath$x$})=\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$y$}\,,

where we denote 𝐊N(,𝒙)=(KN(𝒙1,𝒙),,KN(𝒙n,𝒙))n\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})=(K_{N}(\text{\boldmath$x$}_{1},\text{\boldmath$x$}),\ldots,K_{N}(\text{\boldmath$x$}_{n},\text{\boldmath$x$}))^{\top}\in\mathbb{R}^{n}. Define 𝒇=(f(𝒙1),,f(𝒙n))\text{\boldmath$f$}=(f(\text{\boldmath$x$}_{1}),\ldots,f(\text{\boldmath$x$}_{n}))^{\top} and 𝜺=(ε1,,εn)\text{\boldmath$\varepsilon$}=(\varepsilon_{1},\ldots,\varepsilon_{n})^{\top}. We now decompose the generalization error R𝖭𝖳(λ)R_{{\sf NT}}(\lambda) into three errors.

R𝖭𝖳(λ)\displaystyle R_{{\sf NT}}(\lambda) =𝔼𝒙[(f(𝒙)𝐊N(,𝒙)(λ𝐈n+𝐊N)1𝒚)2]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}f(\text{\boldmath$x$})-\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$y$}\big{)}^{2}\Big{]}
=𝔼𝒙[(f(𝒙)𝐊N(,𝒙)(λ𝐈n+𝐊N)1𝒇)2]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}f(\text{\boldmath$x$})-\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$f$}\big{)}^{2}\Big{]}
+𝜺(λ𝐈n+𝐊N)1𝔼𝒙[𝐊N(,𝒙)𝐊N(,𝒙)](λ𝐈n+𝐊N)1𝜺\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$\varepsilon$}
2𝜺(λ𝐈n+𝐊N)1𝔼𝒙[𝐊N(,𝒙)(f(𝒙)𝐊N(,𝒙)(λ𝐈n+𝐊N)1𝒇)]\displaystyle-2\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})(f(\text{\boldmath$x$})-\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$f$})\big{]}
=:EBiasN+EVarN2ECrossN.\displaystyle=:E_{\mathrm{Bias}}^{N}+E_{\mathrm{Var}}^{N}-2E_{\text{\rm Cross}}^{N}.

In the kernel ridge regression, the prediction function is

f^𝖪𝖱𝖱(𝒙)=𝐊(,𝒙)(λ𝐈n+𝐊)1𝒚.\widehat{f}_{{\sf KRR}}(\text{\boldmath$x$})=\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$y$}.

Similarly, we can decompose the associated generalization error R𝖪𝖱𝖱(λ)R_{{\sf KRR}}(\lambda) into three errors.

R𝖪𝖱𝖱(λ)\displaystyle R_{{\sf KRR}}(\lambda) =𝔼𝒙[(f(𝒙)𝐊(,𝒙)(λ𝐈n+𝐊)1𝒚)2]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}f(\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$y$}\big{)}^{2}\Big{]}
=𝔼𝒙[(f(𝒙)𝐊(,𝒙)(λ𝐈n+𝐊)1𝒇)2]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}f(\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}\big{)}^{2}\Big{]}
+𝜺(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)𝐊(,𝒙)](λ𝐈n+𝐊)1𝜺\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}
2𝜺(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)(f(𝒙)𝐊(,𝒙)(λ𝐈n+𝐊)1𝒇)]\displaystyle-2\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})(f(\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$})\big{]}
=:EBias+EVar2ECross.\displaystyle=:E_{\mathrm{Bias}}+E_{\mathrm{Var}}-2E_{\text{\rm Cross}}.

The first part of Theorem 3.3, establishing that NT regression is well approximated by kernel ridge regression for overparametrized models, is an immediate consequence of the next statement.

Theorem 8.1 (Reduction to kernel ridge regression).

There exists a constant C>0C^{\prime}>0 such that the following holds. If n(log(Nd))CNdn\leq(\log(Nd))^{C^{\prime}}Nd, then for any λ>0\lambda>0, with high probability,

|EBiasNEBias|ηfL22,|EVarNEVar|ησε2,\displaystyle\big{|}E_{\mathrm{Bias}}^{N}-E_{\mathrm{Bias}}\big{|}\leq\eta\,\|f\|_{L^{2}}^{2},\qquad\big{|}E_{\mathrm{Var}}^{N}-E_{\mathrm{Var}}\big{|}\leq\eta\,\sigma_{\varepsilon}^{2}, (62)
|ECrossNECross|ηfL2σε,whereη=n(Clog(nNd))CNd.\displaystyle\big{|}E_{\text{\rm Cross}}^{N}-E_{\text{\rm Cross}}\big{|}\leq\eta\,\|f\|_{L^{2}}\sigma_{\varepsilon},\qquad\text{where}~{}\eta=\sqrt{\frac{n(C^{\prime}\log(nNd))^{C^{\prime}}}{Nd}}\,. (63)

As a consequence, we have R𝖭𝖳(λ)=R𝖪𝖱𝖱(λ)+Od,((fL22+σε2)n(log(nNd))CNd)R_{{\sf NT}}(\lambda)=R_{{\sf KRR}}(\lambda)+O_{d,\mathbb{P}}\big{(}(\|f\|_{L^{2}}^{2}+\sigma_{\varepsilon}^{2})\sqrt{\frac{n(\log(nNd))^{C^{\prime}}}{Nd}}\big{)}.

Recall the decomposition of the infinite-width kernel into Gegenbauer polynomials introduced in Lemma 1. In Section 3.4 we defined the polynomial kernel Kp(𝒙,𝒙)K^{p}(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}) by truncating K(𝒙,𝒙)K(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}) to the degree-\ell polynomials. Namely:

Kp(𝒙,𝒙)=k=0γkQk(d)(𝒙,𝒙).K^{p}(\text{\boldmath$x$},\text{\boldmath$x$}^{\prime})=\sum_{k=0}^{\ell}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle). (64)

We also define the matrix 𝐊pn×n\mathrm{\bf K}^{p}\in\mathbb{R}^{n\times n} and vector 𝐊p(,𝒙)n\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})\in\mathbb{R}^{n} as in Eq. (20). In polynomial ridge regression, the prediction function is

f^𝖯𝖱𝖱(𝒙)=𝐊p(,𝒙)((λ+γ>)𝐈n+𝐊p)1𝒚.\widehat{f}_{{\sf PRR}}(\text{\boldmath$x$})=\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})^{\top}((\lambda+\gamma_{>\ell})\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$y$}.

Its associated generalization error R𝖯𝖱𝖱(λ)R_{{\sf PRR}}(\lambda) into also decomposed into three errors.

R𝖯𝖱𝖱(λ)\displaystyle R_{{\sf PRR}}(\lambda) =𝔼𝒙[(f(𝒙)𝐊p(,𝒙)(λ𝐈n+𝐊p)1𝒇)2]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}f(\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$f$}\big{)}^{2}\Big{]}
+𝜺(λ𝐈n+𝐊p)1𝔼𝒙[𝐊p(,𝒙)𝐊p(,𝒙)](λ𝐈n+𝐊p)1𝜺\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})^{\top}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$\varepsilon$}
2𝜺(λ𝐈n+𝐊p)1𝔼𝒙[𝐊p(,𝒙)(f(𝒙)𝐊p(,𝒙)(λ𝐈n+𝐊p)1𝒇)]\displaystyle-2\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})(f(\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$f$})\big{]}
=:EBiasp(λ)+EVarp(λ)2ECrossp(λ).\displaystyle=:E_{\mathrm{Bias}}^{p}(\lambda)+E_{\mathrm{Var}}^{p}(\lambda)-2E_{\text{\rm Cross}}^{p}(\lambda).

The second part of Theorem 3.3 follows immediately from the following result.

Theorem 8.2 (Reduction to polynomial ridge regression).

There exists a constant C>0C^{\prime}>0 such that the following holds. For any λ>0\lambda>0, with high probability,

|EBias(λ)EBiasp(λ+γ>)|ηfL22,|EVar(λ)EVarp(λ+γ>)|ησε2,\displaystyle\big{|}E_{\mathrm{Bias}}(\lambda)-E_{\mathrm{Bias}}^{p}(\lambda+\gamma_{>\ell})\big{|}\leq\eta^{\prime}\|f\|_{L^{2}}^{2},\qquad\big{|}E_{\mathrm{Var}}(\lambda)-E_{\mathrm{Var}}^{p}(\lambda+\gamma_{>\ell})\big{|}\leq\eta^{\prime}\sigma_{\varepsilon}^{2}, (65)
|ECross(λ)ECrossp(λ+γ>)|ηfL2σε,whereη:=C(logn)Cnd+1\displaystyle\big{|}E_{\text{\rm Cross}}(\lambda)-E_{\text{\rm Cross}}^{p}(\lambda+\gamma_{>\ell})\big{|}\leq\eta^{\prime}\|f\|_{L^{2}}\sigma_{\varepsilon},\qquad\text{where}~{}\eta^{\prime}:=\sqrt{\frac{C^{\prime}(\log n)^{C^{\prime}}n}{d^{\ell+1}}} (66)

As a consequence, we have R𝖪𝖱𝖱(λ)=R𝖯𝖱𝖱(λ+γ>)+Od,((fL22+σε2)C(logn)Cnd+1)R_{{\sf KRR}}(\lambda)=R_{{\sf PRR}}(\lambda+\gamma_{>\ell})+O_{d,\mathbb{P}}\big{(}(\|f\|_{L^{2}}^{2}+\sigma_{\varepsilon}^{2})\sqrt{\frac{C^{\prime}(\log n)^{C^{\prime}}n}{d^{\ell+1}}}\big{)}.

8.1 Proof of Theorem 8.1: Bias term

In this section we prove the first inequality in Eq. (62). Since both sides are homogeneous in ff, we will assume without loss of generality that fL2=1\|f\|_{L^{2}}=1.

First let us decompose EBiasNE_{\mathrm{Bias}}^{N}. Define

𝐊(,𝒙)=(K(𝒙1,𝒙),,K(𝒙n,𝒙))n,𝐊(2)=𝔼[𝐊(,𝒙)𝐊(,𝒙)]n×n,\displaystyle\mathrm{\bf K}(\cdot,\text{\boldmath$x$})=\big{(}K(\text{\boldmath$x$}_{1},\text{\boldmath$x$}),\ldots,K(\text{\boldmath$x$}_{n},\text{\boldmath$x$})\big{)}^{\top}\in\mathbb{R}^{n},\qquad\mathrm{\bf K}^{(2)}=\mathbb{E}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}\big{]}\in\mathbb{R}^{n\times n},
𝐊N(,𝒙)=(KN(𝒙1,𝒙),,KN(𝒙n,𝒙))n,𝐊N(2)=𝔼[𝐊N(,𝒙)𝐊N(,𝒙)]n×n.\displaystyle\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})=\big{(}K_{N}(\text{\boldmath$x$}_{1},\text{\boldmath$x$}),\ldots,K_{N}(\text{\boldmath$x$}_{n},\text{\boldmath$x$})\big{)}^{\top}\in\mathbb{R}^{n},\qquad\mathrm{\bf K}^{(2)}_{N}=\mathbb{E}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}\big{]}\in\mathbb{R}^{n\times n}.

Then,

EBiasN\displaystyle E_{\mathrm{Bias}}^{N} =𝔼𝒙[(f(𝒙)𝐊N(,𝒙)(λ𝐈n+𝐊N)1𝒇)2]=fL222I1N+I2N\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\big{[}\big{(}f(\text{\boldmath$x$})-\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$f$}\big{)}^{2}\big{]}=\|f\|_{L^{2}}^{2}-2I_{1}^{N}+I_{2}^{N}

where we define

I1N=𝒇(λ𝐈n+𝐊N)1𝔼𝒙[𝐊N(,𝒙)f(𝒙)],\displaystyle I_{1}^{N}=\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]},
I2N=𝒇(λ𝐈n+𝐊N)1𝔼𝒙[𝐊N(,𝒙)𝐊N(,𝒙)](λ𝐈n+𝐊N)1𝒇.\displaystyle I_{2}^{N}=\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})^{\top}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$f$}.

We also decompose EBias=fL222I1+I2E_{\mathrm{Bias}}=\|f\|_{L^{2}}^{2}-2I_{1}+I_{2}, where

I1=𝒇(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)],\displaystyle I_{1}=\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]},
I2=𝒇(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)𝐊(,𝒙)](λ𝐈n+𝐊)1𝒇.\displaystyle I_{2}=\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}.

Our goal is to show that I1NI_{1}^{N} and I1I_{1} are close, and that I2NI_{2}^{N} and I2I_{2} are close. Let us pause here to explain the challenges and our proof strategy:

  • First, our concentration result controls 𝐊1/2(𝐊N𝐊)𝐊1/2op\|\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\|_{\mathrm{op}} but not 𝐊1(𝐊N𝐊)op\|\mathrm{\bf K}^{-1}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\|_{\mathrm{op}}, so it is crucial to balance the matrix differences as we do in the decomposition introduced below.

  • Second, the relation between eigenvalues of 𝐊N\mathrm{\bf K}_{N} and 𝐊\mathrm{\bf K} is not sufficient to control the generalization error (which is evident in the term 𝐊N1𝐊N(2)𝐊N1\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}). We will therefore characterize the eigenvector structure as well.

  • Third our bound of 𝐊1/2(𝐊N𝐊)𝐊1/2op\|\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\|_{\mathrm{op}} does not allow us to control 𝐊N(2)𝐊(2)\mathrm{\bf K}_{N}^{(2)}-\mathrm{\bf K}^{(2)} from our previous analysis. We develop a new approach that exploits the independence of (𝒘k)kN(\text{\boldmath$w$}_{k})_{k\leq N}.

For the purpose of later use, we need a slightly general setup: let gL2g\in L^{2} be a function and 𝒉n\text{\boldmath$h$}\in\mathbb{R}^{n} a random vector. We begin the analysis by defining the following differences.

δI11g,𝒉=[𝒉(λ𝐈n+𝐊N)1𝐊N𝒉(λ𝐈n+𝐊)1𝐊N]𝐊N1𝔼𝒙[𝐊N(,𝒙)g(𝒙)],\displaystyle\delta I_{11}^{g,\text{\boldmath$h$}}=\big{[}\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}-\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\big{]}\cdot\mathrm{\bf K}_{N}^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})g(\text{\boldmath$x$})\big{]},
δI12g,𝒉=𝒉(λ𝐈n+𝐊)1𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))g(𝒙)],\displaystyle\delta I_{12}^{g,\text{\boldmath$h$}}=\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))g(\text{\boldmath$x$})\big{]},
δI21𝒉=[𝒉(λ𝐈n+𝐊N)1𝐊N𝒉(λ𝐈n+𝐊)1𝐊N]𝐊N1𝐊N(2)𝐊N1𝐊N(λ𝐈n+𝐊N)1𝒉,\displaystyle\delta I_{21}^{\text{\boldmath$h$}}=\big{[}\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}-\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\big{]}\cdot\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\cdot\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$h$},
δI22𝒉=𝒉(λ𝐈n+𝐊)1𝐊N𝐊N1𝐊N(2)𝐊N1[𝐊N(λ𝐈n+𝐊N)1𝒉𝐊N(λ𝐈n+𝐊)1𝒉],\displaystyle\delta I_{22}^{\text{\boldmath$h$}}=\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\cdot\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\cdot\big{[}\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$h$}-\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\big{]},
δI23𝒉=𝒉(λ𝐈n+𝐊)1[𝐊N(2)𝐊(2)](λ𝐈n+𝐊)1𝒉.\displaystyle\delta I_{23}^{\text{\boldmath$h$}}=\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{[}\mathrm{\bf K}_{N}^{(2)}-\mathrm{\bf K}^{(2)}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}.

We notice that

I1NI1=δI11f,𝒇+δI12f,𝒇,I2NI2=δI21𝒇+δI22𝒇+δI23𝒇,I_{1}^{N}-I_{1}=\delta I_{11}^{f,\text{\boldmath$f$}}+\delta I_{12}^{f,\text{\boldmath$f$}},\qquad I_{2}^{N}-I_{2}=\delta I_{21}^{\text{\boldmath$f$}}+\delta I_{22}^{\text{\boldmath$f$}}+\delta I_{23}^{\text{\boldmath$f$}},

so we only need to bound these delta terms. Below we state a lemma for this general setup. Note that 𝒇f satisfies the assumption on the random vector therein, because by the law of large numbers n1𝒇2=(1+on,(1))fL22n^{-1}\|\text{\boldmath$f$}\|^{2}=(1+o_{n,\mathbb{P}}(1))\|f\|_{L^{2}}^{2}, so 𝒇C1n\|\text{\boldmath$f$}\|\leq C_{1}\sqrt{n} with high probability.

Lemma 3.

Suppose that, for C1>0C_{1}>0 a constant, we have gL2C1\|g\|_{L^{2}}\leq C_{1}, and that 𝐡n\text{\boldmath$h$}\in\mathbb{R}^{n} is a random vector that satisfies 𝐡C1n\|\text{\boldmath$h$}\|\leq C_{1}\sqrt{n} with high probability. Then, there exists a constant C>0C^{\prime}>0 such that the following bounds hold with high probability.

𝐊N1𝔼𝒙[𝐊N(,𝒙)g(𝒙)]Cn,\displaystyle\left\|\mathrm{\bf K}_{N}^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})g(\text{\boldmath$x$})\big{]}\right\|\leq\frac{C^{\prime}}{\sqrt{n}}, (67)
𝐊N1𝐊N(2)𝐊N1opCn,\displaystyle\left\|\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\right\|_{\mathrm{op}}\leq\frac{C^{\prime}}{n}, (68)
𝐊(λ𝐈n+𝐊)1𝒉Cn,\displaystyle\left\|\mathrm{\bf K}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\right\|\leq C^{\prime}\sqrt{n}, (69)
𝐊N(λ𝐈n+𝐊N)1𝒉Cn,\displaystyle\left\|\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$h$}\right\|\leq C^{\prime}\sqrt{n}, (70)
𝐊N(λ𝐈n+𝐊N)1𝒉𝐊N(λ𝐈n+𝐊)1𝒉Cηn.\displaystyle\left\|\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$h$}-\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\right\|\leq C^{\prime}\eta\sqrt{n}. (71)

Here, we denote η=n(log(nNd))C/(Nd)\eta=\sqrt{n(\log(nNd))^{C^{\prime}}/(Nd)}. As a consequence, we have, with high probability,

max{|δI11g,𝒇|,|δI21𝒉|,|δI22𝒉|}Cη.\max\{|\delta I_{11}^{g,\text{\boldmath$f$}}|,|\delta I_{21}^{\text{\boldmath$h$}}|,|\delta I_{22}^{\text{\boldmath$h$}}|\}\leq C^{\prime}\eta. (72)

We handle the other two terms δI12g,𝒉\delta I_{12}^{g,\text{\boldmath$h$}} and δI23𝒉\delta I_{23}^{\text{\boldmath$h$}} in a different way. Denote

𝒗=(λ𝐈n+𝐊)1𝒉n,h~(𝒙)=𝐊(,𝒙)(λ𝐈n+𝐊)1𝒉.\displaystyle\text{\boldmath$v$}=(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\in\mathbb{R}^{n},\qquad\widetilde{h}(\text{\boldmath$x$})=\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}.

The function h~\widetilde{h} satisfies h~L2C𝒉2/n\|\widetilde{h}\|_{L^{2}}\leq C\|\text{\boldmath$h$}\|^{2}/n with high probability by the following lemma, which we will prove in Section B.3.

Lemma 4.

Define 𝐊(2):=𝔼𝐱[𝐊(,𝐱)𝐊(,𝐱)]n×n\mathrm{\bf K}^{(2)}:=\mathbb{E}_{\text{\boldmath$x$}}[\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}]\in\mathbb{R}^{n\times n} or, equivalently,

Kij(2)=(𝔼𝒙[K(𝒙,𝒙i)K(𝒙,𝒙j)])i,jn.\displaystyle K_{ij}^{(2)}=\Big{(}\mathbb{E}_{\text{\boldmath$x$}}[K(\text{\boldmath$x$},\text{\boldmath$x$}_{i})K(\text{\boldmath$x$},\text{\boldmath$x$}_{j})]\Big{)}_{i,j\leq n}\,. (73)

Then, with high probability,

(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1opCn,\displaystyle\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}\leq\frac{C}{n}, (74)
(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)]CnfL2.\displaystyle\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}[\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})]\big{\|}\leq\frac{C}{\sqrt{n}}\|f\|_{L^{2}}. (75)

We can rewrite δI12g,𝒉\delta I_{12}^{g,\text{\boldmath$h$}} and δI23𝒉\delta I_{23}^{\text{\boldmath$h$}} as

δI12g,𝒉\displaystyle\delta I_{12}^{g,\text{\boldmath$h$}} =𝒗𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))g(𝒙)],\displaystyle=\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))g(\text{\boldmath$x$})\big{]},
δI23𝒉\displaystyle\delta I_{23}^{\text{\boldmath$h$}} =2𝒗𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))h~(𝒙)]+𝒗𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))(𝐊N(,𝒙)𝐊(,𝒙))]𝒗\displaystyle=2\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))\widetilde{h}(\text{\boldmath$x$})\big{]}+\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))^{\top}\Big{]}\text{\boldmath$v$}
=:2δI231𝒉+δI232𝒉.\displaystyle=:2\delta I_{231}^{\text{\boldmath$h$}}+\delta I_{232}^{\text{\boldmath$h$}}.

Note that δI232𝒉\delta I_{232}^{\text{\boldmath$h$}} is always nonnegative. We calculate and bound 𝔼𝒘[(δI12g,𝒉)2]\mathbb{E}_{\text{\boldmath$w$}}[(\delta I_{12}^{g,\text{\boldmath$h$}})^{2}], 𝔼𝒘[(δI231𝒉)2]\mathbb{E}_{\text{\boldmath$w$}}[(\delta I_{231}^{\text{\boldmath$h$}})^{2}], and 𝔼𝒘[δI232𝒉]\mathbb{E}_{\text{\boldmath$w$}}[\delta I_{232}^{\text{\boldmath$h$}}], so that we obtain bounds on δI12g,𝒉\delta I_{12}^{g,\text{\boldmath$h$}} and δI23𝒉\delta I_{23}^{\text{\boldmath$h$}} with high probability.

Lemma 5.

Suppose that, for C1>0C_{1}>0 a constant, we have gL2C1\|g\|_{L^{2}}\leq C_{1}, and that 𝐡n\text{\boldmath$h$}\in\mathbb{R}^{n} is a random vector that satisfies 𝐡C1n\|\text{\boldmath$h$}\|\leq C_{1}\sqrt{n} with high probability.

Let 𝐳1,𝐳2,𝐳\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2},\text{\boldmath$z$} be independent copies of 𝐱x. Then we have

𝔼𝒘[(δI12g,𝒉)2]1N𝒉(λ𝐈n+𝐊)1𝐇1(λ𝐈n+𝐊)1𝒉,\displaystyle\mathbb{E}_{\text{\boldmath$w$}}[(\delta I_{12}^{g,\text{\boldmath$h$}})^{2}]\leq\frac{1}{N}\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf H}_{1}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}, (76)
𝔼𝒘[(δI231𝒉)2]1N𝒉(λ𝐈n+𝐊)1𝐇2(λ𝐈n+𝐊)1𝒉,\displaystyle\mathbb{E}_{\text{\boldmath$w$}}[(\delta I_{231}^{\text{\boldmath$h$}})^{2}]\leq\frac{1}{N}\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf H}_{2}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}, (77)
𝔼𝒘[δI232𝒉]1N𝒉(λ𝐈n+𝐊)1𝐇3(λ𝐈n+𝐊)1𝒉,\displaystyle\mathbb{E}_{\text{\boldmath$w$}}[\delta I_{232}^{\text{\boldmath$h$}}]\leq\frac{1}{N}\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf H}_{3}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}, (78)

where 𝐇j=(Hj(𝐱i,𝐱j))i,jnn×n\mathrm{\bf H}_{j}=(H_{j}(\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}))_{i,j\leq n}\in\mathbb{R}^{n\times n} (j=1,2,3)(j=1,2,3) are given by

H1(𝒙1,𝒙2)\displaystyle H_{1}(\text{\boldmath$x$}_{1},\text{\boldmath$x$}_{2}) =𝔼𝒛1,𝒛2,𝒘[σ(𝒙1,𝒘)σ(𝒛1,𝒘)σ(𝒙2,𝒘)σ(𝒛2,𝒘)𝒙1,𝒛1d𝒙2,𝒛2dg(𝒛1)g(𝒛2)]\displaystyle=\mathbb{E}_{\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2},\text{\boldmath$w$}}\Big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$z$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{2},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$z$}_{2},\text{\boldmath$w$}\rangle)\frac{\langle\text{\boldmath$x$}_{1},\text{\boldmath$z$}_{1}\rangle}{d}\frac{\langle\text{\boldmath$x$}_{2},\text{\boldmath$z$}_{2}\rangle}{d}g(\text{\boldmath$z$}_{1})g(\text{\boldmath$z$}_{2})\Big{]}
H2(𝒙1,𝒙2)\displaystyle H_{2}(\text{\boldmath$x$}_{1},\text{\boldmath$x$}_{2}) =𝔼𝒛1,𝒛2,𝒘[σ(𝒙1,𝒘)σ(𝒛1,𝒘)σ(𝒙2,𝒘)σ(𝒛2,𝒘)𝒙1,𝒛1d𝒙2,𝒛2dh~(𝒛1)h~(𝒛2)]\displaystyle=\mathbb{E}_{\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2},\text{\boldmath$w$}}\Big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$z$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{2},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$z$}_{2},\text{\boldmath$w$}\rangle)\frac{\langle\text{\boldmath$x$}_{1},\text{\boldmath$z$}_{1}\rangle}{d}\frac{\langle\text{\boldmath$x$}_{2},\text{\boldmath$z$}_{2}\rangle}{d}\widetilde{h}(\text{\boldmath$z$}_{1})\widetilde{h}(\text{\boldmath$z$}_{2})\Big{]}
H3(𝒙1,𝒙2)\displaystyle H_{3}(\text{\boldmath$x$}_{1},\text{\boldmath$x$}_{2}) =𝔼𝒛,𝒘[σ(𝒙1,𝒘)σ(𝒙2,𝒘)[σ(𝒛,𝒘)]2𝒙1,𝒛d𝒙2,𝒛d].\displaystyle=\mathbb{E}_{\text{\boldmath$z$},\text{\boldmath$w$}}\Big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{2},\text{\boldmath$w$}\rangle)[\sigma^{\prime}(\langle\text{\boldmath$z$},\text{\boldmath$w$}\rangle)]^{2}\frac{\langle\text{\boldmath$x$}_{1},\text{\boldmath$z$}\rangle}{d}\frac{\langle\text{\boldmath$x$}_{2},\text{\boldmath$z$}\rangle}{d}\Big{]}.

Moreover, it holds that with high probability,

𝐇1Cd𝐊,𝐇2Cd𝐊,𝐇3Cd𝐊.\mathrm{\bf H}_{1}\preceq\frac{C}{d}\mathrm{\bf K},\qquad\mathrm{\bf H}_{2}\preceq\frac{C}{d}\mathrm{\bf K},\qquad\mathrm{\bf H}_{3}\preceq\frac{C}{d}\mathrm{\bf K}.

As a consequence, with high probability, we have

|δI12g,𝒉|CnlognNd,|δI23𝒉|CnlognNd.\big{|}\delta I_{12}^{g,\text{\boldmath$h$}}\big{|}\leq C\sqrt{\frac{n\log n}{Nd}},\qquad\big{|}\delta I_{23}^{\text{\boldmath$h$}}\big{|}\leq C\sqrt{\frac{n\log n}{Nd}}.

In Lemma 3 and 5, we set 𝒉=𝒇\text{\boldmath$h$}=\text{\boldmath$f$} and g=fg=f. By the conclusions of these two lemmas, we will obtain the bound on the bias term in Theorem 8.1. We defer the proofs of the two lemmas to the appendix.

9 Conclusions

We studied the neural tangent model associated to two-layer neural networks, focusing on its generalization properties in a regime in which the sample size nn, the dimension dd, and the number of neurons NN are all large and polynomially related (c0dnd1/c0c_{0}d\leq n\leq d^{1/c_{0}} for some small constant c0>0c_{0}>0), while in the overparametrized regime NdnNd\gg n. We assumed an isotropic model for the data (𝒙i)in(\text{\boldmath$x$}_{i})_{i\leq n}, and noisy label (yi)in(y_{i})_{i\leq n} with a general target yi=f(𝒙i)+εiy_{i}=f_{*}(\text{\boldmath$x$}_{i})+\varepsilon_{i} (with εi\varepsilon_{i} independent noise).

As a fundamental technical step, we obtained a characterization of the eigenstructure of the empirical NT kernel 𝐊N\mathrm{\bf K}_{N} in the overparametrized regime. In particular for non-polynomial activations, we showed that, the minimum eigenvalue of 𝐊N\mathrm{\bf K}_{N} is bounded away from zero, as soon as Nd/(logNd)CnNd/(\log Nd)^{C}\geq n. Further, for d(logd)C0nd+1/(logd)C0d^{\ell}(\log d)^{C_{0}}\leq n\leq d^{\ell+1}/(\log d)^{C_{0}}, \ell\in\mathbb{N}, λmin(𝐊N)\lambda_{\min}(\mathrm{\bf K}_{N}) concentrates around a value that is explicitly given in terms of the Hermite decomposition of the activation.

An immediate corollary is that, as soon as Ndn(logd)CNd\geq n(\log d)^{C}, the neural network can exactly interpolate arbitrary labels.

Our most important result is a characterization of the test error of minimum 2\ell_{2}-norm NT regression. We prove that, for Nd/(logNd)CnNd/(\log Nd)^{C}\geq n the test error is close to the one of the N=N=\infty limit (i.e. of kernel ridgeless regression with respect to the expected kernel). The latter is in turn well approximated by polynomial regression with respect degree-\ell polynomials as long as d(logd)C0nd+1/(logd)C0d^{\ell}(\log d)^{C_{0}}\leq n\leq d^{\ell+1}/(\log d)^{C_{0}}.

Our analysis offers several insight to statistical practice:

  1. 1.

    NT models provide an effective randomized approximation to kernel methods. Indeed their statistical properties are analogous to the one of more standard random features methods [RR08], when we compare them at fixed number of parameters [MMM21]. However, the computational complexity at prediction time of NT models is smaller than the one of random feature models if we keep the same number of parameters.

  2. 2.

    The additional error due to the finite number of hidden units is roughly bounded by n/(Nd)\sqrt{n/(Nd)}.

  3. 3.

    In high dimensions, the nonlinearity in the activation function produces an effective ‘self-induced’ regularization (diagonal term in the kernel) and as a consequence interpolation can be optimal.

Finally, our characterization of generalization error applies to two-layer neural networks (under specific initializations) as long as the neural tangent theory provides a good approximation for their behavior. In Section 5 we discussed sufficient conditions for this to be the case, but we do not expect these conditions to be sharp.

Acknowledgements

We thank Behrooz Ghorbani and Song Mei for helpful discussion. This work was supported by NSF through award DMS-2031883 and from the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. We also acknowledge NSF grants CCF-1714305, IIS-1741162, and the ONR grant N00014-18-1-2729.

Appendix A Additional numerical experiment

The setup of the experiment is similar to the second experiment in Section 2: we fit a linear model yi=𝒙i,𝜷+εiy_{i}=\langle\text{\boldmath$x$}_{i},\text{\boldmath$\beta$}_{*}\rangle+\varepsilon_{i} where covariates satisfy 𝒙iUnif(𝕊d1(d))\text{\boldmath$x$}_{i}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})) and the noises satisfy εi𝒩(0,σε2)\varepsilon_{i}\sim{\mathcal{N}}(0,\sigma^{2}_{\varepsilon}), σε=0.5\sigma_{\varepsilon}=0.5. We fix N=800N=800 and d=500d=500, and vary the sample size n{100,200,,900,1000,1500,,4000}n\in\{100,200,\ldots,900,1000,1500,\ldots,4000\}.

Figure 5 illustrates the relation between the NT, 2-layer NN, and polynomial ridge regression. We train a two-layer neural network with ReLU activations using the initialization strategy of [COB19]. Namely, we draw (ak0)kNi.i.d.𝒩(0,1)(a_{k}^{0})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathcal{N}(0,1), (𝒘k0)kNi.i.d.𝒩(𝟎,𝐈d)(\text{\boldmath$w$}_{k}^{0})_{k\leq N}\sim_{\mathrm{i.i.d.}}\mathcal{N}(\mathrm{\bf 0},\mathrm{\bf I}_{d}) and set a~k0=ak0\widetilde{a}_{k}^{0}=-a_{k}^{0}, 𝒘~k0=𝒘k0\widetilde{\text{\boldmath$w$}}_{k}^{0}=\text{\boldmath$w$}_{k}^{0}. We use Adam optimizer to train the two-layer neural network f(𝒙)=k=1Nakσ(𝒘k,𝒙)+=1Na~kσ(𝒘~k,𝒙)f(\text{\boldmath$x$})=\sum_{k=1}^{N}a_{k}\sigma(\langle\text{\boldmath$w$}_{k},\text{\boldmath$x$}\rangle)+\sum_{\ell=1}^{N}\widetilde{a}_{k}\sigma(\langle\widetilde{\text{\boldmath$w$}}_{k},\text{\boldmath$x$}\rangle) with parameters (ak,a~k,𝒘,𝒘~)kN(a_{k},\widetilde{a}_{k},\text{\boldmath$w$}_{\ell},\widetilde{\text{\boldmath$w$}}_{\ell})_{k\leq N} initialized by (ak0,a~k0,𝒘k0,𝒘~k0)kN(a_{k}^{0},\widetilde{a}_{k}^{0},\text{\boldmath$w$}_{k}^{0},\widetilde{\text{\boldmath$w$}}_{k}^{0})_{k\leq N}. This guarantees that the output is zero at initialization and the parameter scale is much larger than the default, so that we are in the lazy training regime. We compare its generalization error with the one of PRR, and with the theoretical prediction of Remark 4. The agreement is excellent.

Refer to caption
Figure 5: Test/generalization errors of NT, Poly, and two-layer neural network (2-NN). We fix N=800,d=500N=800,d=500 and vary nn. For each noise level σε{0.01,0.1,0.3,0.5}\sigma_{\varepsilon}\in\{0.01,0.1,0.3,0.5\} (which corresponds to one color), we plot three curves that represent the test errors for NT, Poly, and 2-NN. All curves of the same color behave similarly.

Appendix B Generalization error: Proof of Theorem 3.3

This section finishes the proof of our main result, Theorem 3.3, which was initiated in Section 8. We use definitions and notations introduced in that section.

B.1 Proof of Theorem 8.1: Variance term and cross term

By homogeneity we can and will assume, without loss of generality, fL2=σε=1\|f\|_{L^{2}}=\sigma_{\varepsilon}=1.

We handle EVarNE_{\mathrm{Var}}^{N} and ECrossNE_{\text{\rm Cross}}^{N} in a way similar to that of EBiasNE_{\mathrm{Bias}}^{N} (mostly following the same proof with 𝒉h set to 𝜺\varepsilon instead of 𝒇f). First we observe that

𝜺(λ𝐈n+𝐊N)1𝐊N(2)(λ𝐈n+𝐊N)1𝜺𝜺(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1𝜺\displaystyle~{}~{}~{}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$\varepsilon$}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}
=[𝜺(λ𝐈n+𝐊N)1𝐊N𝜺(λ𝐈n+𝐊)1𝐊N]𝐊N1𝐊N(2)𝐊N1𝐊N(λ𝐈n+𝐊N)1𝜺\displaystyle=\big{[}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\big{]}\cdot\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\cdot\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$\varepsilon$}
+𝜺(λ𝐈n+𝐊)1𝐊N𝐊N1𝐊N(2)𝐊N1[𝐊N(λ𝐈n+𝐊N)1𝜺𝐊N(λ𝐈n+𝐊)1𝜺]\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\cdot\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\cdot\big{[}\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$\varepsilon$}-\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}\big{]}
+𝜺(λ𝐈n+𝐊)1[𝐊N(2)𝐊(2)](λ𝐈n+𝐊)1𝜺\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{[}\mathrm{\bf K}_{N}^{(2)}-\mathrm{\bf K}^{(2)}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}
=:δJ1+δJ2+δJ3.\displaystyle=:\delta J_{1}+\delta J_{2}+\delta J_{3}.

Note that n1𝜺2=(1+od,(1))σε2n^{-1}\|\text{\boldmath$\varepsilon$}\|^{2}=(1+o_{d,\mathbb{P}}(1))\sigma_{\varepsilon}^{2} by the law of large numbers, and therefore 𝜺C1n\|\text{\boldmath$\varepsilon$}\|\leq C_{1}\sqrt{n} with high probability. Applying Lemma 3 with 𝒉h set to 𝜺\varepsilon, we obtain the following holds with high probability.

𝐊(λ𝐈n+𝐊)1𝜺Cn,\displaystyle\left\|\mathrm{\bf K}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}\right\|\leq C\sqrt{n}, (79)
𝐊N(λ𝐈n+𝐊N)1𝜺Cn,\displaystyle\left\|\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$\varepsilon$}\right\|\leq C\sqrt{n}, (80)
𝐊N(λ𝐈n+𝐊N)1𝜺𝐊N(λ𝐈n+𝐊)1𝜺Cηn.\displaystyle\left\|\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$\varepsilon$}-\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}\right\|\leq C\eta\sqrt{n}. (81)

This implies that |δJ1|Cη|\delta J_{1}|\leq C\eta and |δJ2|Cη|\delta J_{2}|\leq C\eta w.h.p. Moreover, denoting

𝒗=(λ𝐈n+𝐊)1𝜺n,h~(𝒙)=𝐊(,𝒙)(λ𝐈n+𝐊)1𝜺,\displaystyle\text{\boldmath$v$}=(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}\in\mathbb{R}^{n},\qquad\widetilde{h}(\text{\boldmath$x$})=\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$},

we express δJ3\delta J_{3} as

δJ3\displaystyle\delta J_{3} =2𝒗𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))h~(𝒙)]+𝒗𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))(𝐊N(,𝒙)𝐊(,𝒙))]𝒗\displaystyle=2\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))\widetilde{h}(\text{\boldmath$x$})\big{]}+\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))^{\top}\Big{]}\text{\boldmath$v$}
=:2δJ31+δJ32.\displaystyle=:2\delta J_{31}+\delta J_{32}.

Applying Lemma 5 in which we set 𝒉=𝜺\text{\boldmath$h$}=\text{\boldmath$\varepsilon$}, we obtain w.h.p.,

𝔼𝒘[(δJ31)2]1N𝜺(λ𝐈n+𝐊)1𝐇2(λ𝐈n+𝐊)1𝜺CnNd\displaystyle\mathbb{E}_{\text{\boldmath$w$}}[(\delta J_{31})^{2}]\leq\frac{1}{N}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf H}_{2}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}\leq\frac{Cn}{Nd}
𝔼𝒘[δJ32]1N𝜺(λ𝐈n+𝐊)1𝐇3(λ𝐈n+𝐊)1𝜺CnNd.\displaystyle\mathbb{E}_{\text{\boldmath$w$}}[\delta J_{32}]\leq\frac{1}{N}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf H}_{3}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}\leq\frac{Cn}{Nd}.

This implies that |δJ3|Cη|\delta J_{3}|\leq C\eta w.h.p. as well. This proves the bound on the variance term in Theorem 8.1.

Now for the cross term, we observe that

ECrossNECross\displaystyle E_{\text{\rm Cross}}^{N}-E_{\text{\rm Cross}} =[𝜺(λ𝐈n+𝐊N)1𝔼𝒙[𝐊N(,𝒙)f(𝒙)]𝜺(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)]]\displaystyle=\Big{[}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\Big{]}
[𝜺(λ𝐈n+𝐊N)1𝐊N(2)(λ𝐈n+𝐊N)1𝒇𝜺(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1𝒇]\displaystyle-\Big{[}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$f$}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}\Big{]}
=:δJ~1δJ~2.\displaystyle=:\widetilde{\delta J}_{1}-\widetilde{\delta J}_{2}.

For the first term δJ~1\widetilde{\delta J}_{1}, we further decompose:

δJ~1\displaystyle\widetilde{\delta J}_{1} =[𝜺(λ𝐈n+𝐊N)1𝐊N𝜺(λ𝐈n+𝐊)1𝐊N]𝐊N1𝔼𝒙[𝐊N(,𝒙)f(𝒙)],\displaystyle=\big{[}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\big{]}\cdot\mathrm{\bf K}_{N}^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]},
+𝜺(λ𝐈n+𝐊)1𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))f(𝒙)]=:δJ~11+δJ~12.\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))f(\text{\boldmath$x$})\big{]}=:\widetilde{\delta J}_{11}+\widetilde{\delta J}_{12}.

From Eqs. (67) and (71) in Lemma 3 (setting g=fg=f and 𝒉=𝜺\text{\boldmath$h$}=\text{\boldmath$\varepsilon$}), we have |δJ~11|Cη|\widetilde{\delta J}_{11}|\leq C\eta w.h.p. By Lemma 5 Eq. (76) (setting 𝒉h to 𝜺\varepsilon), w.h.p.,

𝔼𝒘[(δJ~12)2]1N𝜺(λ𝐈n+𝐊)1𝐇1(λ𝐈n+𝐊)1𝜺CnNd,\mathbb{E}_{\text{\boldmath$w$}}[(\widetilde{\delta J}_{12})^{2}]\leq\frac{1}{N}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf H}_{1}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$}\leq\frac{Cn}{Nd},

so with high probability |δJ~12|Cη|\widetilde{\delta J}_{12}|\leq C\eta. Thus we proved |δJ~1|Cη|\widetilde{\delta J}_{1}|\leq C\eta.

Finally, we further decompose δJ~2\widetilde{\delta J}_{2} as follows.

δJ~2=[𝜺(λ𝐈n+𝐊N)1𝐊N𝜺(λ𝐈n+𝐊)1𝐊N]𝐊N1𝐊N(2)𝐊N1𝐊N(λ𝐈n+𝐊N)1𝒇,\displaystyle\widetilde{\delta J}_{2}=\big{[}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\big{]}\cdot\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\cdot\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$f$},
+𝜺(λ𝐈n+𝐊)1𝐊N𝐊N1𝐊N(2)𝐊N1[𝐊N(λ𝐈n+𝐊N)1𝒇𝐊N(λ𝐈n+𝐊)1𝒇],\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}_{N}\cdot\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\cdot\big{[}\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$f$}-\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}\big{]},
+𝜺(λ𝐈n+𝐊)1[𝐊N(2)𝐊(2)](λ𝐈n+𝐊)1𝒇\displaystyle+\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{[}\mathrm{\bf K}_{N}^{(2)}-\mathrm{\bf K}^{(2)}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}
=:δJ~21+δJ~22+δJ~23.\displaystyle=:\widetilde{\delta J}_{21}+\widetilde{\delta J}_{22}+\widetilde{\delta J}_{23}.

By Lemma 3 (in which we set 𝒉h to 𝒇f) and Eqs. (79)–(81), we have |δJ~21|Cη|\widetilde{\delta J}_{21}|\leq C\eta and |δJ~22|Cη|\widetilde{\delta J}_{22}|\leq C\eta w.h.p. Moreover, denoting

𝒗1=(λ𝐈n+𝐊)1𝒇,h~1(𝒙)=𝐊(,𝒙)(λ𝐈n+𝐊)1𝒇\displaystyle\text{\boldmath$v$}_{1}=(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$},\qquad\widetilde{h}_{1}(\text{\boldmath$x$})=\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}
𝒗2=(λ𝐈n+𝐊)1𝜺,h~2(𝒙)=𝐊(,𝒙)(λ𝐈n+𝐊)1𝜺,\displaystyle\text{\boldmath$v$}_{2}=(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$},\qquad\widetilde{h}_{2}(\text{\boldmath$x$})=\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\varepsilon$},

we express δJ~23\widetilde{\delta J}_{23} as

δJ~23\displaystyle\widetilde{\delta J}_{23} =𝒗2𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))h~1(𝒙)]+𝒗1𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))h~2(𝒙)]\displaystyle=\text{\boldmath$v$}_{2}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))\widetilde{h}_{1}(\text{\boldmath$x$})\big{]}+\text{\boldmath$v$}_{1}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))\widetilde{h}_{2}(\text{\boldmath$x$})\big{]}
+𝒗2𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))(𝐊N(,𝒙)𝐊(,𝒙))]𝒗1\displaystyle+\text{\boldmath$v$}_{2}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))^{\top}\Big{]}\text{\boldmath$v$}_{1}
=:δJ~231+δJ~231+δJ~232.\displaystyle=:\widetilde{\delta J}_{231}+\widetilde{\delta J}_{231}^{\prime}+\widetilde{\delta J}_{232}.

In Lemma 5 Eq. (76), we set g=h~1,𝒉=𝜺g=\widetilde{h}_{1},\text{\boldmath$h$}=\text{\boldmath$\varepsilon$} to obtain 𝔼𝒘[(δJ~231)2]Cn/(Nd)\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\widetilde{\delta J}_{231})^{2}\big{]}\leq Cn/(Nd); and we set g=h~2,𝒉=𝒇g=\widetilde{h}_{2},\text{\boldmath$h$}=\text{\boldmath$f$} to obtain 𝔼𝒘[(δJ~231)2]Cn/(Nd)\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\widetilde{\delta J}_{231}^{\prime})^{2}\big{]}\leq Cn/(Nd). For δJ~232\widetilde{\delta J}_{232}, we use

δJ~232\displaystyle\widetilde{\delta J}_{232} =(𝒗1+𝒗2)𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))(𝐊N(,𝒙)𝐊(,𝒙))](𝒗1+𝒗2)\displaystyle=(\text{\boldmath$v$}_{1}+\text{\boldmath$v$}_{2})^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))^{\top}\Big{]}(\text{\boldmath$v$}_{1}+\text{\boldmath$v$}_{2})
𝒗1𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))(𝐊N(,𝒙)𝐊(,𝒙))]𝒗1\displaystyle-\text{\boldmath$v$}_{1}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))^{\top}\Big{]}\text{\boldmath$v$}_{1}
𝒗2𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))(𝐊N(,𝒙)𝐊(,𝒙))]𝒗2\displaystyle-\text{\boldmath$v$}_{2}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))^{\top}\Big{]}\text{\boldmath$v$}_{2}
(𝒗1+𝒗2)𝔼𝒙[(𝐊N(,𝒙)𝐊(,𝒙))(𝐊N(,𝒙)𝐊(,𝒙))](𝒗1+𝒗2).\displaystyle\leq(\text{\boldmath$v$}_{1}+\text{\boldmath$v$}_{2})^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$}))^{\top}\Big{]}(\text{\boldmath$v$}_{1}+\text{\boldmath$v$}_{2}).

Note that 𝒇+𝜺Cn\|\text{\boldmath$f$}+\text{\boldmath$\varepsilon$}\|\leq C\sqrt{n} w.h.p., so applying Lemma 5 Eq. (78) with 𝒉=𝒇+𝜺\text{\boldmath$h$}=\text{\boldmath$f$}+\text{\boldmath$\varepsilon$} leads to 𝔼𝒘[δJ~232]Cn/(Nd)\mathbb{E}_{\text{\boldmath$w$}}\big{[}\widetilde{\delta J}_{232}\big{]}\leq Cn/(Nd) w.h.p. This proves |δJ~23|Cn/(Nd)|\widetilde{\delta J}_{23}|\leq Cn/(Nd) w.h.p. and therefore |δJ~2|Cη|\widetilde{\delta J}_{2}|\leq C\eta w.h.p. Hence, we have proved the bound on the cross term in Theorem 8.1.

B.2 Proof of Lemma 3

Throughout this subsection, given a sequence of random variables ξd\xi_{d}, we write ξd=o˘d,(1)\xi_{d}=\breve{o}_{d,\mathbb{P}}(1) if and only if for any ε>0\varepsilon>0 and β>0\beta>0, we have limd0dβ(|ξd|>ε)=0\lim_{d\to 0}d^{\beta}\mathbb{P}(|\xi_{d}|>\varepsilon)=0. We also assume that

d+1(logn)Cnd(logn)C,nNd(log(Nd))C\displaystyle\frac{d^{\ell+1}}{(\log n)^{C}}\geq n\geq d^{\ell}(\log n)^{C},\qquad n\leq\frac{Nd}{(\log(Nd))^{C}}

for a sufficiently large C>0C>0 such that we can use the following inequalities by Lemma 2 and Theorem 3.2:

𝚫op=o˘d,(1),n1𝚿𝚿𝐈nop=o˘d,(1),𝐊1/2(𝐊N𝐊)𝐊1/2op=o˘d,(1).\displaystyle\big{\|}\text{\boldmath$\Delta$}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1),\qquad\big{\|}n^{-1}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1),\qquad\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1). (82)

In Section C we will refine our analysis to eliminate logarithmic factors for =1\ell=1.

We start with the easiest bounds in Lemma 3.

Proof of Lemma 3, Eqs. (69)–(71).

The first two bounds follow from the observation

𝐊(λ𝐈n+𝐊)1𝒉=𝒉λ(λ𝐈n+𝐊)1𝒉,\displaystyle\mathrm{\bf K}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}=\text{\boldmath$h$}-\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$},
𝐊N(λ𝐈n+𝐊N)1𝒉=𝒉λ(λ𝐈n+𝐊N)1𝒉\displaystyle\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$h$}=\text{\boldmath$h$}-\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$h$}

From the invertibility of 𝐊\mathrm{\bf K} and 𝐊N\mathrm{\bf K}_{N}, with high probability,

λ(λ𝐈n+𝐊)1opλc+λ,λ(λ𝐈n+𝐊N)1opλc+λ.\displaystyle\left\|\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\right\|_{\mathrm{op}}\leq\frac{\lambda}{c+\lambda},\qquad\left\|\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\right\|_{\mathrm{op}}\leq\frac{\lambda}{c+\lambda}.

We conclude that Eq. (69) and (70) hold with high probability.

Next, we will show that w.h.p.,

𝐊N(λ𝐈n+𝐊N)1𝐊(λ𝐈n+𝐊)1opCη,\displaystyle\left\|\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}-\mathrm{\bf K}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\right\|_{\mathrm{op}}\leq C\sqrt{\eta}, (83)
(𝐊N𝐊)(λ𝐈n+𝐊)1𝒉Cηn.\displaystyle\left\|(\mathrm{\bf K}_{N}-\mathrm{\bf K})(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\right\|\leq C\sqrt{\eta n}. (84)

Once these are proved, then, w.h.p.,

𝐊N(λ𝐈n+𝐊N)1𝒉𝐊N(λ𝐈n+𝐊)1𝒉\displaystyle\left\|\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\text{\boldmath$h$}-\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\right\| 𝐊N(λ𝐈n+𝐊N)1𝐊(λ𝐈n+𝐊)1op𝒉\displaystyle\leq\left\|\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}-\mathrm{\bf K}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\right\|_{\mathrm{op}}\cdot\|\text{\boldmath$h$}\|
+(𝐊N𝐊)(λ𝐈n+𝐊)1𝒉\displaystyle+\left\|(\mathrm{\bf K}_{N}-\mathrm{\bf K})(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\right\| Cηn.\displaystyle\leq C\sqrt{\eta n}.

In order to show (83), we observe that

𝐊N(λ𝐈n+𝐊N)1𝐊(λ𝐈n+𝐊)1\displaystyle~{}~{}~{}\mathrm{\bf K}_{N}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}-\mathrm{\bf K}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}
=λ(λ𝐈n+𝐊N)1+λ(λ𝐈n+𝐊)1\displaystyle=-\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}+\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}
=λ(λ𝐈n+𝐊N)1(𝐊N𝐊)(λ𝐈n+𝐊)1\displaystyle=\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}(\mathrm{\bf K}_{N}-\mathrm{\bf K})(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}
=λ(λ𝐈n+𝐊N)1𝐊N1/2𝐊N1/2𝐊1/2𝐊1/2(𝐊N𝐊)𝐊1/2𝐊1/2(λ𝐈n+𝐊)1.\displaystyle=\lambda(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}^{1/2}\mathrm{\bf K}_{N}^{-1/2}\mathrm{\bf K}^{1/2}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\mathrm{\bf K}^{1/2}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}.

From Theorem 3.2, we know that 𝐊1/2(𝐊N𝐊)𝐊1/2opη\|\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\|_{\mathrm{op}}\leq\sqrt{\eta} w.h.p. It also implies

𝐊N1/2𝐊1/2op2=𝐊1/2𝐊N1𝐊1/2op=(𝐊1/2𝐊N𝐊1/2)1op(1η)1.\|\mathrm{\bf K}_{N}^{-1/2}\mathrm{\bf K}^{1/2}\|_{\mathrm{op}}^{2}=\|\mathrm{\bf K}^{1/2}\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}^{1/2}\big{\|}_{\mathrm{op}}=\|(\mathrm{\bf K}^{-1/2}\mathrm{\bf K}_{N}\mathrm{\bf K}^{-1/2})^{-1}\big{\|}_{\mathrm{op}}\leq(1-\sqrt{\eta})^{-1}.

Also, we notice that

λ1/2𝐊1/2(λ𝐈n+𝐊)1opmaxiλ1/2[λi(𝐊)]1/2λ+λi(𝐊)12,\displaystyle\left\|\lambda^{1/2}\mathrm{\bf K}^{1/2}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\right\|_{\mathrm{op}}\leq\max_{i}\frac{\lambda^{1/2}[\lambda_{i}(\mathrm{\bf K})]^{1/2}}{\lambda+\lambda_{i}(\mathrm{\bf K})}\leq\frac{1}{2},
λ1/2(λ𝐈n+𝐊N)1𝐊N1/2opmaxiλ1/2[λi(𝐊N)]1/2λ+λi(𝐊N)12,\displaystyle\left\|\lambda^{1/2}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}_{N})^{-1}\mathrm{\bf K}_{N}^{1/2}\right\|_{\mathrm{op}}\leq\max_{i}\frac{\lambda^{1/2}[\lambda_{i}(\mathrm{\bf K}_{N})]^{1/2}}{\lambda+\lambda_{i}(\mathrm{\bf K}_{N})}\leq\frac{1}{2},

These bounds then lead to the desired bound in (83). In order to show (84), we recall the notation 𝒗=(λ𝐈n+𝐊)1𝒉\text{\boldmath$v$}=(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}. We also write 𝐊N=N1kN𝐊(k)\mathrm{\bf K}_{N}=N^{-1}\sum_{k\leq N}\mathrm{\bf K}^{(k)} where (𝐊(k))kN(\mathrm{\bf K}^{(k)})_{k\leq N} are independent conditional on (𝒙i)in(\text{\boldmath$x$}_{i})_{i\leq n} and 𝔼𝒘[𝐊(k)]=𝐊\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf K}^{(k)}]=\mathrm{\bf K}. We calculate

𝔼𝒘[(𝐊N𝐊)𝒗2]\displaystyle\mathbb{E}_{\text{\boldmath$w$}}\big{[}\big{\|}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\text{\boldmath$v$}\big{\|}^{2}\big{]} =𝔼𝒘𝒗,(𝐊N𝐊)2𝒗=1N2k,kN𝔼𝒘𝒗,(𝐊(k)𝐊)(𝐊(k)𝐊)𝒗\displaystyle=\mathbb{E}_{\text{\boldmath$w$}}\langle\text{\boldmath$v$},(\mathrm{\bf K}_{N}-\mathrm{\bf K})^{2}\text{\boldmath$v$}\rangle=\frac{1}{N^{2}}\sum_{k,k^{\prime}\leq N}\mathbb{E}_{\text{\boldmath$w$}}\langle\text{\boldmath$v$},(\mathrm{\bf K}^{(k)}-\mathrm{\bf K})(\mathrm{\bf K}^{(k^{\prime})}-\mathrm{\bf K})\text{\boldmath$v$}\rangle (85)
=1N2kN𝔼𝒘𝒗,(𝐊(k)𝐊)2𝒗\displaystyle=\frac{1}{N^{2}}\sum_{k\leq N}\mathbb{E}_{\text{\boldmath$w$}}\langle\text{\boldmath$v$},(\mathrm{\bf K}^{(k)}-\mathrm{\bf K})^{2}\text{\boldmath$v$}\rangle (86)
(i)1N2kN𝔼𝒘𝒗,(𝐊(k))2𝒗\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{1}{N^{2}}\sum_{k\leq N}\mathbb{E}_{\text{\boldmath$w$}}\langle\text{\boldmath$v$},(\mathrm{\bf K}^{(k)})^{2}\text{\boldmath$v$}\rangle (87)

where (i) is due to

𝔼𝒘[(𝐊(k)𝐊)2]=𝔼𝒘[(𝐊(k))2𝐊(k)𝐊𝐊𝐊(k)+𝐊2]=𝔼𝒘[(𝐊(k))2]𝐊2𝔼𝒘[(𝐊(k))2].\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\mathrm{\bf K}^{(k)}-\mathrm{\bf K})^{2}\big{]}=\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\mathrm{\bf K}^{(k)})^{2}-\mathrm{\bf K}^{(k)}\mathrm{\bf K}-\mathrm{\bf K}\mathrm{\bf K}^{(k)}+\mathrm{\bf K}^{2}\big{]}=\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\mathrm{\bf K}^{(k)})^{2}\big{]}-\mathrm{\bf K}^{2}\preceq\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\mathrm{\bf K}^{(k)})^{2}\big{]}.

Recall the notations in the proof of Theorem 3.2, cf. Eqs. (55), (56).

We denote the indicator variable ξ=𝟏{|𝒙i,𝒘k|C0log(nNd),i,k}\xi=\mathrm{\bf 1}\{|\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}_{k}\rangle|\leq C_{0}\log(nNd),\forall~{}i,k\}. If C0C_{0} is sufficiently large, we have 𝔼𝒘{𝐊(k)(1ξ)op2}C/(nNd)2\mathbb{E}_{\text{\boldmath$w$}}\{\|\mathrm{\bf K}^{(k)}(1-\xi)\|_{\mathrm{op}}^{2}\}\leq C/(nNd)^{2}. Therefore

𝐊(k)=𝐊(k)ξ+𝐊(k)(1ξ)=d1𝐃k𝑿𝑿𝐃kξ+𝐊(k)(1ξ),and thus\displaystyle\mathrm{\bf K}^{(k)}=\mathrm{\bf K}^{(k)}\xi+\mathrm{\bf K}^{(k)}(1-\xi)=d^{-1}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}\xi+\mathrm{\bf K}^{(k)}(1-\xi),\qquad\text{and thus}
𝒗(𝐊(k))2𝒗=𝒗(𝐊(k))2𝒗ξ+𝒗(𝐊(k))2𝒗(1ξ)𝒗(d1𝐃k𝑿𝑿𝐃k)2𝒗+C𝒗2n2N2d2.\displaystyle\text{\boldmath$v$}^{\top}(\mathrm{\bf K}^{(k)})^{2}\text{\boldmath$v$}=\text{\boldmath$v$}^{\top}(\mathrm{\bf K}^{(k)})^{2}\text{\boldmath$v$}\xi+\text{\boldmath$v$}^{\top}(\mathrm{\bf K}^{(k)})^{2}\text{\boldmath$v$}(1-\xi)\leq\text{\boldmath$v$}^{\top}(d^{-1}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k})^{2}\text{\boldmath$v$}+\frac{C\|\text{\boldmath$v$}\|^{2}}{n^{2}N^{2}d^{2}}.

Now we work on the event 𝑿op2(n+d)\|\text{\boldmath$X$}\|_{\mathrm{op}}\leq 2(\sqrt{n}+\sqrt{d}), which happens with high probability. Continuing from Eq. (87), we get:

𝔼𝒘[(𝐊N𝐊)𝒗2]\displaystyle\mathbb{E}_{\text{\boldmath$w$}}\big{[}\big{\|}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\text{\boldmath$v$}\big{\|}^{2}\big{]} 1N2d2kN𝔼𝒘[𝒗𝐃k𝑿𝑿𝐃k2𝑿𝑿𝐃k𝒗]+C𝒗2n2N2d2\displaystyle\leq\frac{1}{N^{2}d^{2}}\sum_{k\leq N}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\text{\boldmath$v$}^{\top}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}^{2}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}\text{\boldmath$v$}\Big{]}+\frac{C\|\text{\boldmath$v$}\|^{2}}{n^{2}N^{2}d^{2}}
n(log(nNd))CN2d2kN𝔼𝒘[𝒗𝐃k𝑿𝑿𝐃k𝒗]+C𝒗2n2N2d2\displaystyle\leq\frac{n(\log(nNd))^{C}}{N^{2}d^{2}}\sum_{k\leq N}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\text{\boldmath$v$}^{\top}\mathrm{\bf D}_{k}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf D}_{k}\text{\boldmath$v$}\Big{]}+\frac{C\|\text{\boldmath$v$}\|^{2}}{n^{2}N^{2}d^{2}}
=n(log(nNd))CNd𝒗𝐊𝒗+C𝒗2n2N2d2.\displaystyle=\frac{n(\log(nNd))^{C}}{Nd}\text{\boldmath$v$}^{\top}\mathrm{\bf K}\text{\boldmath$v$}+\frac{C\|\text{\boldmath$v$}\|^{2}}{n^{2}N^{2}d^{2}}.

Finally, since 𝐊λ𝐈n+𝐊\mathrm{\bf K}\preceq\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}, w.h.p.,

𝒗𝐊𝒗𝒉(λ𝐈n+𝐊)1(λ𝐈n+𝐊)(λ𝐈n+𝐊)1𝒉1c+λ𝒉2Cn,\displaystyle\text{\boldmath$v$}^{\top}\mathrm{\bf K}\text{\boldmath$v$}\leq\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\leq\frac{1}{c+\lambda}\|\text{\boldmath$h$}\|^{2}\leq Cn,
𝒗21c+λ𝒉2Cn.\displaystyle\|\text{\boldmath$v$}\|^{2}\leq\frac{1}{c+\lambda}\|\text{\boldmath$h$}\|^{2}\leq Cn.

We have shown that

𝔼𝒘[(𝐊N𝐊)𝒗2]n2(log(nNd))CNd.\mathbb{E}_{\text{\boldmath$w$}}\big{[}\big{\|}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\text{\boldmath$v$}\big{\|}^{2}\big{]}\leq\frac{n^{2}(\log(nNd))^{C}}{Nd}.

By Markov’s inequality, we then obtain (𝐊N𝐊)𝒗ηn\|(\mathrm{\bf K}_{N}-\mathrm{\bf K})\text{\boldmath$v$}\|\leq\sqrt{\eta n} w.h.p., as claimed in Eq. (84). ∎

Before proving the more difficult inequalities (67), (68), we establish some useful properties of the eigenstructure of 𝐊N\mathrm{\bf K}_{N}. For convenience, we write 𝐃1=𝐃2(1+o˘d,(1))\mathrm{\bf D}_{1}=\mathrm{\bf D}_{2}\cdot(1+\breve{o}_{d,\mathbb{P}}(1)) if 𝐃1,𝐃2\mathrm{\bf D}_{1},\mathrm{\bf D}_{2} are diagonal matrices and maxi|(𝐃1)ii/(𝐃2)ii1|=o˘d,(1)\max_{i}\big{|}(\mathrm{\bf D}_{1})_{ii}/(\mathrm{\bf D}_{2})_{ii}-1\big{|}=\breve{o}_{d,\mathbb{P}}(1); we also write 𝐃1=𝐃2+o˘d,(1)\mathrm{\bf D}_{1}=\mathrm{\bf D}_{2}+\breve{o}_{d,\mathbb{P}}(1) if 𝐃1,𝐃2\mathrm{\bf D}_{1},\mathrm{\bf D}_{2} are diagonal matrices and maxi|(𝐃1)ii(𝐃2)ii|=o˘d,(1)\max_{i}\big{|}(\mathrm{\bf D}_{1})_{ii}-(\mathrm{\bf D}_{2})_{ii}\big{|}=\breve{o}_{d,\mathbb{P}}(1). Denote D=kB(d,k)D=\sum_{k\leq\ell}B(d,k).

Lemma 6 (Kernel eigenvalue structure).

The eigen-decomposition of 𝐊γ>𝐈n\mathrm{\bf K}-\gamma_{>\ell}\mathrm{\bf I}_{n} and 𝐊Nγ>𝐈n\mathrm{\bf K}_{N}-\gamma_{>\ell}\mathrm{\bf I}_{n} can be written in the following form:

𝐊γ>𝐈n=𝐔𝐃𝐔+𝚫(res),\displaystyle\mathrm{\bf K}-\gamma_{>\ell}\mathrm{\bf I}_{n}=\mathrm{\bf U}\mathrm{\bf D}\mathrm{\bf U}^{\top}+\text{\boldmath$\Delta$}^{(\mathrm{res})}, (88)
𝐊Nγ>𝐈n=𝐔N𝐃N𝐔N+𝚫N(res),\displaystyle\mathrm{\bf K}_{N}-\gamma_{>\ell}\mathrm{\bf I}_{n}=\mathrm{\bf U}_{N}\mathrm{\bf D}_{N}\mathrm{\bf U}_{N}^{\top}+\text{\boldmath$\Delta$}^{(\mathrm{res})}_{N}, (89)

where 𝐃,𝐃ND×D\mathrm{\bf D},\mathrm{\bf D}_{N}\in\mathbb{R}^{D\times D} are diagonal matrices that contain DD eigenvalues of 𝐊,𝐊N\mathrm{\bf K},\mathrm{\bf K}_{N} respectively, columns of 𝐔,𝐔Nn×D\mathrm{\bf U},\mathrm{\bf U}_{N}\in\mathbb{R}^{n\times D} are the corresponding eigenvectors and 𝚫(res),𝚫N(res)\text{\boldmath$\Delta$}^{(\mathrm{res})},\text{\boldmath$\Delta$}_{N}^{(\mathrm{res})} correspond to the other eigenvectors (in particular 𝚫(res)𝐔=𝚫N(res)𝐔N=𝟎\text{\boldmath$\Delta$}^{(\mathrm{res})}\mathrm{\bf U}=\text{\boldmath$\Delta$}_{N}^{(\mathrm{res})}\mathrm{\bf U}_{N}=\mathrm{\bf 0}).

The eigenvalues in 𝐃,𝐃N\mathrm{\bf D},\mathrm{\bf D}_{N} have the following structure.

𝐃=diag(γ0n,γ1nd,,γ1ndd,,γ(!)nd,,γ(!)ndB(d,))(1+o˘d,(1))+o˘d,(1),\displaystyle\mathrm{\bf D}=\mathrm{diag}\Big{(}\gamma_{0}n,\underbrace{\frac{\gamma_{1}n}{d},\ldots,\frac{\gamma_{1}n}{d}}_{d},\ldots,\underbrace{\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}},\ldots,\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}}}_{B(d,\ell)}\Big{)}\cdot(1+\breve{o}_{d,\mathbb{P}}(1))+\breve{o}_{d,\mathbb{P}}(1), (90)
𝐃N=diag(γ0n,γ1nd,,γ1ndd,,γ(!)nd,,γ(!)ndB(d,))(1+o˘d,(1))+o˘d,(1).\displaystyle\mathrm{\bf D}_{N}=\mathrm{diag}\Big{(}\gamma_{0}n,\underbrace{\frac{\gamma_{1}n}{d},\ldots,\frac{\gamma_{1}n}{d}}_{d},\ldots,\underbrace{\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}},\ldots,\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}}}_{B(d,\ell)}\Big{)}\cdot(1+\breve{o}_{d,\mathbb{P}}(1))+\breve{o}_{d,\mathbb{P}}(1). (91)

Moreover, the remaining components satisfy 𝚫(res)op=o˘d,(1)\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1) and 𝚫N(res)op=o˘d,(1)\|\text{\boldmath$\Delta$}^{(\mathrm{res})}_{N}\|_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1).

Proof of Lemma 6.

For convenience, for a symmetric matrix 𝐀\mathrm{\bf A}, we denote by 𝝀(𝐀)\text{\boldmath$\lambda$}(\mathrm{\bf A}) the vector of eigenvalues of 𝐀\mathrm{\bf A}. We write 𝝀(𝐀1)=𝝀(𝐀2)(1+o˘d,(1))\text{\boldmath$\lambda$}(\mathrm{\bf A}_{1})=\text{\boldmath$\lambda$}(\mathrm{\bf A}_{2})\cdot(1+\breve{o}_{d,\mathbb{P}}(1)) if maxi|λi(𝐀1)/λi(𝐀2)1|=o˘d,(1)\max_{i}|\lambda_{i}(\mathrm{\bf A}_{1})/\lambda_{i}(\mathrm{\bf A}_{2})-1|=\breve{o}_{d,\mathbb{P}}(1). From Lemma 2, we can decompose 𝐊\mathrm{\bf K} as

𝐊=𝚿𝚲2𝚿+γ>𝐈n+𝚫.\mathrm{\bf K}=\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}+\text{\boldmath$\Delta$}.

We claim that

𝝀(𝚿𝚲2𝚿)=(γ0n,γ1nd,,γ1ndd,,γ(!)nd,,γ(!)ndB(d,),0,,0nD)(1+o˘d,(1)).\text{\boldmath$\lambda$}(\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top})=\Big{(}\gamma_{0}n,\underbrace{\frac{\gamma_{1}n}{d},\ldots,\frac{\gamma_{1}n}{d}}_{d},\ldots,\underbrace{\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}},\ldots,\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}}}_{B(d,\ell)},\underbrace{0,\ldots,0}_{n-D}\Big{)}\cdot(1+\breve{o}_{d,\mathbb{P}}(1)). (92)

To prove this claim, first note that (82) implies n1/2𝚿op=1+o˘d,(1)n^{-1/2}\|\text{\boldmath$\Psi$}_{\leq\ell}\|_{\mathrm{op}}=1+\breve{o}_{d,\mathbb{P}}(1). Observe that 𝚿𝚲2𝚿\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top} has rank at most DD. Define 𝑸n×D\text{\boldmath$Q$}\in\mathbb{R}^{n\times D} such that its columns are the left singular vectors of 𝚿\text{\boldmath$\Psi$}_{\leq\ell}. We only need to show

𝝀(𝑸𝚿𝚲2𝚿𝑸)=(γ0n,γ1nd,,γ1ndd,,γ(!)nd,,γ(!)ndB(d,))(1+o˘d,(1)).\text{\boldmath$\lambda$}(\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$})=\Big{(}\gamma_{0}n,\underbrace{\frac{\gamma_{1}n}{d},\ldots,\frac{\gamma_{1}n}{d}}_{d},\ldots,\underbrace{\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}},\ldots,\frac{\gamma_{\ell}(\ell!)n}{d^{\ell}}}_{B(d,\ell)}\Big{)}\cdot(1+\breve{o}_{d,\mathbb{P}}(1)). (93)

In order to prove the above claim, we use the eigenvalue min-max principle to express the kk-th eigenvalue of 𝑸𝚿𝚲2𝚿𝑸\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}. We sort the diagonal entries of 𝚲2\text{\boldmath$\Lambda$}_{\leq\ell}^{2} in the descending order and denote the ranks by s1,,sDs_{1},\ldots,s_{D} (break ties arbitrarily). Define the subspaces 𝒱1=span{𝒆s1,,𝒆sk}\mathcal{V}_{1}=\mathrm{span}\{\text{\boldmath$e$}_{s_{1}},\ldots,\text{\boldmath$e$}_{s_{k}}\}, 𝒱2=span{𝒆sk,,𝒆sD}D\mathcal{V}_{2}=\mathrm{span}\{\text{\boldmath$e$}_{s_{k}},\ldots,\text{\boldmath$e$}_{s_{D}}\}\subset\mathbb{R}^{D} and 𝒱1=span{𝒖D:𝚿𝑸𝒖𝒱1},𝒱2=span{𝒖D:𝚿𝑸𝒖𝒱2}\mathcal{V}^{\prime}_{1}=\mathrm{span}\{\text{\boldmath$u$}\in\mathbb{R}^{D}:\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}\in\mathcal{V}_{1}\},\mathcal{V}^{\prime}_{2}=\mathrm{span}\{\text{\boldmath$u$}\in\mathbb{R}^{D}:\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}\in\mathcal{V}_{2}\}. Note that 𝚿𝑸\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$} has full rank, so dim(𝒱1)=k\dim(\mathcal{V}_{1}^{\prime})=k and dim(𝒱2)=Dk+1\dim(\mathcal{V}_{2}^{\prime})=D-k+1.

λk(𝑸𝚿𝚲2𝚿𝑸)\displaystyle\lambda_{k}(\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}) =max𝒱:dim(𝒱)=kmin𝟎𝒖𝒱𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝒖2\displaystyle=\max_{\mathcal{V}:\dim(\mathcal{V})=k}\min_{\mathrm{\bf 0}\neq\text{\boldmath$u$}\in\mathcal{V}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$u$}\|^{2}}
(1o˘d,(1))min𝟎𝒖𝒱1𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝚿𝑸𝒖2/n\displaystyle\geq(1-\breve{o}_{d,\mathbb{P}}(1))\min_{\mathrm{\bf 0}\neq\text{\boldmath$u$}\in\mathcal{V}_{1}^{\prime}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}\|^{2}/n}
(1o˘d,(1))min𝟎𝒙𝒱1𝒙𝚲2𝒙𝒙2/nn(1o˘d,(1))λk(𝚲2).\displaystyle\geq(1-\breve{o}_{d,\mathbb{P}}(1))\min_{\mathrm{\bf 0}\neq\text{\boldmath$x$}\in\mathcal{V}_{1}}\frac{\text{\boldmath$x$}^{\top}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$x$}}{\|\text{\boldmath$x$}\|^{2}/n}\geq n(1-\breve{o}_{d,\mathbb{P}}(1))\cdot\lambda_{k}(\text{\boldmath$\Lambda$}_{\leq\ell}^{2}).

Similarly,

λk(𝑸𝚿𝚲2𝚿𝑸)\displaystyle\lambda_{k}(\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}) =min𝒱:dim(𝒱)=Dk+1max𝒖𝒱𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝒖2\displaystyle=\min_{\mathcal{V}:\dim(\mathcal{V})=D-k+1}\max_{\text{\boldmath$u$}\in\mathcal{V}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$u$}\|^{2}}
(1+o˘d,(1))max𝒖𝒱2𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝚿𝑸𝒖2/n\displaystyle\leq(1+\breve{o}_{d,\mathbb{P}}(1))\max_{\text{\boldmath$u$}\in\mathcal{V}_{2}^{\prime}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}\|^{2}/n}
(1+o˘d,(1))max𝒙𝒱2𝒙𝚲2𝒙𝒙2/nn(1+o˘d,(1))λk(𝚲2).\displaystyle\leq(1+\breve{o}_{d,\mathbb{P}}(1))\max_{\text{\boldmath$x$}\in\mathcal{V}_{2}}\frac{\text{\boldmath$x$}^{\top}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$x$}}{\|\text{\boldmath$x$}\|^{2}/n}\leq n(1+\breve{o}_{d,\mathbb{P}}(1))\cdot\lambda_{k}(\text{\boldmath$\Lambda$}_{\leq\ell}^{2}).

Finally, we recall that B(d,k)=(1+od(1))dk/k!B(d,k)=(1+o_{d}(1))\cdot d^{k}/k!. This then leads to the claim (92).

In the equality 𝐊γ>𝐈n=𝚿𝚲2𝚿+𝚫\mathrm{\bf K}-\gamma_{>\ell}\mathrm{\bf I}_{n}=\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\text{\boldmath$\Delta$}, we view 𝚫\Delta as the perturbation added to 𝚿𝚲2𝚿\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}. By Weyl’s inequality,

𝝀(𝐊γ>𝐈n)𝝀(𝚿𝚲2𝚿)𝚫op=o˘d,(1),\big{\|}\text{\boldmath$\lambda$}(\mathrm{\bf K}-\gamma_{>\ell}\mathrm{\bf I}_{n})-\text{\boldmath$\lambda$}(\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top})\big{\|}_{\infty}\leq\big{\|}\text{\boldmath$\Delta$}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1),

This proves (90) about the eigenvalues of 𝐊γ>𝐈n\mathrm{\bf K}-\gamma_{>\ell}\mathrm{\bf I}_{n}.

Finally, from the kernel invertibility result (i.e. Theorem 3.2), we have (1+η)𝐊𝐊N(1η)𝐊(1+\eta)\mathrm{\bf K}\succeq\mathrm{\bf K}_{N}\succeq(1-\eta)\mathrm{\bf K} for some η=o˘d,(1)\eta=\breve{o}_{d,\mathbb{P}}(1). This implies

(1+η)λk(𝐊)λk(𝐊N)(1η)λk(𝐊).(1+\eta)\lambda_{k}(\mathrm{\bf K})\geq\lambda_{k}(\mathrm{\bf K}_{N})\geq(1-\eta)\lambda_{k}(\mathrm{\bf K}).

So all eigenvalues of 𝐊N\mathrm{\bf K}_{N} are up to a factor of 1+o˘d,(1)1+\breve{o}_{d,\mathbb{P}}(1) compared with those of 𝐊\mathrm{\bf K}. This implies that the eigenvalue structure of 𝐊N\mathrm{\bf K}_{N} is similar to that of 𝐊\mathrm{\bf K}. ∎

Lemma 6 indicates that the eigenvalues of 𝐊\mathrm{\bf K} and 𝐊N\mathrm{\bf K}_{N} exhibit a group structure: roughly speaking, for every k=0,1,,k=0,1,\ldots,\ell, there are B(d,)B(d,\ell) eigenvalues centered around γ>+γk(k!)n/(dk)\gamma_{>\ell}+\gamma_{k}(k!)n/(d^{k}). It is convenient to partition the eigenvectors according to such group structure.

We define

𝐔=[𝐕0(0)n×1,𝐕0(1)n×d,,𝐕0()n×B(d,)],𝐔N=[𝐕(0)n×1,𝐕(1)n×d,,𝐕()n×B(d,)]𝐃=diag(D0(0)1×1,𝐃0(1)d×d,,𝐃0()B(d,)×B(d,)),𝐃N=diag(D(0)1×1,𝐃(1)d×d,,𝐃()B(d,)×B(d,))\begin{array}[]{llll}\mathrm{\bf U}&=\big{[}\underbrace{\mathrm{\bf V}_{0}^{(0)}}_{n\times 1},\underbrace{\mathrm{\bf V}_{0}^{(1)}}_{n\times d},\ldots,\underbrace{\mathrm{\bf V}_{0}^{(\ell)}}_{n\times B(d,\ell)}\big{]},&\mathrm{\bf U}_{N}&=\big{[}\underbrace{\mathrm{\bf V}^{(0)}}_{n\times 1},\underbrace{\mathrm{\bf V}^{(1)}}_{n\times d},\ldots,\underbrace{\mathrm{\bf V}^{(\ell)}}_{n\times B(d,\ell)}\big{]}\\ \mathrm{\bf D}&=\mathrm{diag}\big{(}\underbrace{D_{0}^{(0)}}_{1\times 1},\underbrace{\mathrm{\bf D}_{0}^{(1)}}_{d\times d},\ldots,\underbrace{\mathrm{\bf D}_{0}^{(\ell)}}_{B(d,\ell)\times B(d,\ell)}\big{)},&\mathrm{\bf D}_{N}&=\mathrm{diag}\big{(}\underbrace{D^{(0)}}_{1\times 1},\underbrace{\mathrm{\bf D}^{(1)}}_{d\times d},\ldots,\underbrace{\mathrm{\bf D}^{(\ell)}}_{B(d,\ell)\times B(d,\ell)}\big{)}\end{array}

We further define 𝐕0(+1),𝐕(+1)n×(nD)\mathrm{\bf V}_{0}^{(\ell+1)},\mathrm{\bf V}^{(\ell+1)}\in\mathbb{R}^{n\times(n-D)} and 𝐃0(+1),𝐃(+1)(nD)×(nD)\mathrm{\bf D}_{0}^{(\ell+1)},\mathrm{\bf D}^{(\ell+1)}\in\mathbb{R}^{(n-D)\times(n-D)} such that they contain the remaining eigenvectors and eigenvalues of 𝐊\mathrm{\bf K}, 𝐊N\mathrm{\bf K}_{N}, respectively. We also denote groups of eigenvectors/eigenvalues of 𝚿𝚲2𝚿+γ>𝐈n\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n} by

𝐔s=[𝐕s(0)n×1,𝐕s(1)n×d,,𝐕s()n×B(d,)],𝐃s=diag(Ds(0)1×1,𝐃s(1)d×d,,𝐃s()B(d,)×B(d,)),\mathrm{\bf U}_{s}=\big{[}\underbrace{\mathrm{\bf V}_{s}^{(0)}}_{n\times 1},\underbrace{\mathrm{\bf V}_{s}^{(1)}}_{n\times d},\ldots,\underbrace{\mathrm{\bf V}_{s}^{(\ell)}}_{n\times B(d,\ell)}\big{]},\qquad\mathrm{\bf D}_{s}=\mathrm{diag}\big{(}\underbrace{D_{s}^{(0)}}_{1\times 1},\underbrace{\mathrm{\bf D}_{s}^{(1)}}_{d\times d},\ldots,\underbrace{\mathrm{\bf D}_{s}^{(\ell)}}_{B(d,\ell)\times B(d,\ell)}\big{)}, (94)

and the remaining eigenvectors/eigenvalues are 𝐕s(+1)\mathrm{\bf V}_{s}^{(\ell+1)} and 𝐃s(+1)\mathrm{\bf D}_{s}^{(\ell+1)}. The following is a useful result about the kernel eigenvector structure.

Lemma 7 (Kernel eigenvector structure).

Let kk{0,1,,+1}k\neq k^{\prime}\in\{0,1,\ldots,\ell+1\}. Denote λk=γ>+γk(k!)n/(dk)\lambda_{k}=\gamma_{>\ell}+\gamma_{k}(k!)n/(d^{k}) for k=0,,k=0,\ldots,\ell and λ+1=γ>\lambda_{\ell+1}=\gamma_{>\ell}.

  1. (a)

    Suppose that min{λk/λk,λk/λk}1/4\min\{\lambda_{k}/\lambda_{k^{\prime}},\lambda_{k^{\prime}}/\lambda_{k}\}\leq 1/4. Then,

    (𝐕0(k))𝐕(k)op=o˘d,(1).\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1).
  2. (b)

    Recall 𝚫(res)\text{\boldmath$\Delta$}^{(\mathrm{res})} defined in (88). Then,

    (𝐕s(k))𝐕0(k)op2𝚫(res)op|λkλk|o˘d,(1).\big{\|}(\mathrm{\bf V}_{s}^{(k^{\prime})})^{\top}\mathrm{\bf V}_{0}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{2\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}}{\big{|}\lambda_{k}-\lambda_{k^{\prime}}\big{|}-\breve{o}_{d,\mathbb{P}}(1)}.
Proof of Lemma 7.

First, we prove a useful claim, which is a consequence of classical eigenvector perturbation results [DK70]. We will give a short proof for completeness. For a diagonal matrix 𝐃\mathrm{\bf D}, we denote by (𝐃)\mathcal{L}(\mathrm{\bf D}) the smallest interval that covers all diagonal entries in 𝐃\mathrm{\bf D}. If (𝐃0(k))(𝐃(k))=\mathcal{L}(\mathrm{\bf D}_{0}^{(k^{\prime})})\cap\mathcal{L}(\mathrm{\bf D}^{(k)})=\emptyset, then

(𝐕0(k))𝐕(k)op(𝐕0(k))(𝐊N𝐊)𝐕(k)opmini,j|(𝐃0(k))ii(𝐃(k))jj|.\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{0}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}^{(k)})_{jj}\big{|}}. (95)

To prove this, we observe that

(𝐊N𝐊)𝐕(k)+𝐊𝐕(k)=𝐊N𝐕(k)=𝐕(k)𝐃(k).(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf V}^{(k)}+\mathrm{\bf K}\mathrm{\bf V}^{(k)}=\mathrm{\bf K}_{N}\mathrm{\bf V}^{(k)}=\mathrm{\bf V}^{(k)}\mathrm{\bf D}^{(k)}.

Left-multiplying both sides by (𝐕0(k))(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top} and re-arranging the equality, we obtain

(𝐕0(k))(𝐊N𝐊)𝐕(k)\displaystyle-(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf V}^{(k)} =(𝐕0(k))𝐊𝐕(k)(𝐕0(k))𝐕(k)𝐃(k)\displaystyle=(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf K}\mathrm{\bf V}^{(k)}-(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}\mathrm{\bf D}^{(k)}
=𝐃0(k)(𝐕0(k))𝐕(k)(𝐕0(k))𝐕(k)𝐃(k).\displaystyle=\mathrm{\bf D}_{0}^{(k^{\prime})}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}-(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}\mathrm{\bf D}^{(k)}.

Denote 𝐑=(𝐕0(k))𝐕(k)\mathrm{\bf R}=(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)} for simplicity. Without loss of generality, we assume mini(𝐃0(k))ii>maxj(𝐃(k))jj\min_{i}(\mathrm{\bf D}_{0}^{(k^{\prime})})_{ii}>\max_{j}(\mathrm{\bf D}^{(k)})_{jj}. Let 𝒖,𝒗\text{\boldmath$u$},\text{\boldmath$v$} be the top left/right singular vector of 𝐑\mathrm{\bf R}. Then,

𝐃0(k)𝐑𝐑𝐃(k)op\displaystyle\big{\|}\mathrm{\bf D}_{0}^{(k^{\prime})}\mathrm{\bf R}-\mathrm{\bf R}\mathrm{\bf D}^{(k)}\big{\|}_{\mathrm{op}} 𝒖𝐃0(k)𝐑𝒗𝒖𝐑𝐃(k)𝒗=𝐑op𝒖𝐃0(k)𝒖𝐑op𝒗𝐃(k)𝒗\displaystyle\geq\text{\boldmath$u$}^{\top}\mathrm{\bf D}_{0}^{(k^{\prime})}\mathrm{\bf R}\text{\boldmath$v$}-\text{\boldmath$u$}^{\top}\mathrm{\bf R}\mathrm{\bf D}^{(k)}\text{\boldmath$v$}=\|\mathrm{\bf R}\|_{\mathrm{op}}\cdot\text{\boldmath$u$}^{\top}\mathrm{\bf D}_{0}^{(k^{\prime})}\text{\boldmath$u$}-\|\mathrm{\bf R}\|_{\mathrm{op}}\text{\boldmath$v$}^{\top}\mathrm{\bf D}^{(k)}\text{\boldmath$v$}
𝐑opmini(𝐃0(k))ii𝐑opmaxj(𝐃(k))jj.\displaystyle\geq\|\mathrm{\bf R}\|_{\mathrm{op}}\cdot\min_{i}(\mathrm{\bf D}_{0}^{(k^{\prime})})_{ii}-\|\mathrm{\bf R}\|_{\mathrm{op}}\cdot\max_{j}(\mathrm{\bf D}^{(k)})_{jj}.

This proves the claim (95). By the kernel eigenvalue structure (Lemma 6),

mini,j|(𝐃0(k))ii(𝐃(k))jj|\displaystyle\min_{i,j}\big{|}(\mathrm{\bf D}_{0}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}^{(k)})_{jj}\big{|} =|γ>+(1+o˘d,(1))(λkγ>)γ>(1+o˘d,(1))(λkγ>)|+o˘d,(1)\displaystyle=\big{|}\gamma_{>\ell}+(1+\breve{o}_{d,\mathbb{P}}(1))(\lambda_{k}-\gamma_{>\ell})-\gamma_{>\ell}-(1+\breve{o}_{d,\mathbb{P}}(1))(\lambda_{k^{\prime}}-\gamma_{>\ell})\big{|}+\breve{o}_{d,\mathbb{P}}(1)
=|(1+o˘d,(1))λk(1+o˘d,(1))λk|+o˘d,(1).\displaystyle=\big{|}(1+\breve{o}_{d,\mathbb{P}}(1))\lambda_{k}-(1+\breve{o}_{d,\mathbb{P}}(1))\lambda_{k^{\prime}}\big{|}+\breve{o}_{d,\mathbb{P}}(1). (96)

By the assumptions on λk\lambda_{k} and λk\lambda_{k^{\prime}}, with very high probability,

mini,j|(𝐃0(k))ii(𝐃(k))jj|max{λk,λk}/2.\min_{i,j}\big{|}(\mathrm{\bf D}_{0}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}^{(k)})_{jj}\big{|}\geq\max\{\lambda_{k},\lambda_{k^{\prime}}\big{\}}/2. (97)

By the kernel invertibility, Theorem 3.2, we have

(𝐕0(k))(𝐊N𝐊)𝐕(k)op\displaystyle\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}} (𝐕0(k))𝐊1/2op𝐊1/2(𝐊N𝐊)𝐊1/2op𝐊1/2𝐕(k)op\displaystyle\leq\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf K}^{1/2}\big{\|}_{\mathrm{op}}\cdot\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}\cdot\big{\|}\mathrm{\bf K}^{1/2}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}
o˘d,(1)(𝐕0(k))𝐊1/2op𝐊1/2𝐕(k)op\displaystyle\leq\breve{o}_{d,\mathbb{P}}(1)\cdot\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf K}^{1/2}\big{\|}_{\mathrm{op}}\cdot\big{\|}\mathrm{\bf K}^{1/2}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}
o˘d,(1)(𝐕0(k))𝐊1/2op𝐊1/2𝐊N1/2op𝐊N1/2𝐕(k)op.\displaystyle\leq\breve{o}_{d,\mathbb{P}}(1)\cdot\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf K}^{1/2}\big{\|}_{\mathrm{op}}\cdot\big{\|}\mathrm{\bf K}^{1/2}\mathrm{\bf K}_{N}^{-1/2}\big{\|}_{\mathrm{op}}\cdot\big{\|}\mathrm{\bf K}_{N}^{1/2}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}.

Since 𝐊1/2\mathrm{\bf K}^{1/2} has the same eigenvectors as 𝐊\mathrm{\bf K}, we have

(𝐕0(k))𝐊1/2op=[𝐃0(k)]1/2(𝐕0(k))op(1+o˘d,(1))λk+o˘d,(1).\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf K}^{1/2}\big{\|}_{\mathrm{op}}=\big{\|}\big{[}\mathrm{\bf D}_{0}^{(k^{\prime})}\big{]}^{1/2}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\big{\|}_{\mathrm{op}}\leq(1+\breve{o}_{d,\mathbb{P}}(1))\cdot\sqrt{\lambda_{k^{\prime}}}+\breve{o}_{d,\mathbb{P}}(1)\,. (98)

Similarly,

𝐊N1/2𝐕(k)op=𝐕(k)(𝐃(k))1/2op(1+o˘d,(1))λk+o˘d,(1).\big{\|}\mathrm{\bf K}_{N}^{1/2}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}=\big{\|}\mathrm{\bf V}^{(k)}(\mathrm{\bf D}^{(k)})^{1/2}\big{\|}_{\mathrm{op}}\leq(1+\breve{o}_{d,\mathbb{P}}(1))\cdot\sqrt{\lambda_{k}}+\breve{o}_{d,\mathbb{P}}(1)\,. (99)

We also note that

𝐊1/2𝐊N1/2op2=1[σmin(𝐊N1/2𝐊1/2)]2=1λmin(𝐊1/2𝐊N𝐊1/2)1+o˘d,(1).\displaystyle\big{\|}\mathrm{\bf K}^{1/2}\mathrm{\bf K}_{N}^{-1/2}\big{\|}_{\mathrm{op}}^{2}=\frac{1}{\big{[}\sigma_{\min}\big{(}\mathrm{\bf K}_{N}^{1/2}\mathrm{\bf K}^{-1/2})\big{]}^{2}}=\frac{1}{\lambda_{\min}(\mathrm{\bf K}^{-1/2}\mathrm{\bf K}_{N}\mathrm{\bf K}^{-1/2})}\leq 1+\breve{o}_{d,\mathbb{P}}(1).

Therefore, we deduce

(𝐕0(k))(𝐊N𝐊)𝐕(k)op=o˘d,(1)λkλk.\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1)\cdot\sqrt{\lambda_{k}\lambda_{k^{\prime}}}. (100)

Combining (97) and (100), we have shown that with very high probability,

(𝐕0(k))𝐕(k)opo˘d,(1)λkλkmax{λk,λk}/2o˘d,(1),\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{\breve{o}_{d,\mathbb{P}}(1)\cdot\sqrt{\lambda_{k}\lambda_{k^{\prime}}}}{\max\{\lambda_{k},\lambda_{k^{\prime}}\}/2}\leq\breve{o}_{d,\mathbb{P}}(1),

which proves (a)(a). Now in order to show (b)(b), we view 𝐊\mathrm{\bf K} as a perturbation of 𝚿𝚲2𝚿+γ>𝐈n\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}, cf. Eq. (88). Hence,

(𝐕s(k))𝐕0(k)op(𝐕s(k))𝚫(res)𝐕0(k)opmini,j|(𝐃s(k))ii(𝐃0(k))jj|𝚫(res)opmini,j|(𝐃s(k))ii(𝐃0(k))jj|.\big{\|}(\mathrm{\bf V}_{s}^{(k^{\prime})})^{\top}\mathrm{\bf V}_{0}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{\big{\|}(\mathrm{\bf V}_{s}^{(k^{\prime})})^{\top}\text{\boldmath$\Delta$}^{(\mathrm{res})}\mathrm{\bf V}_{0}^{(k)}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{0}^{(k)})_{jj}\big{|}}\leq\frac{\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{0}^{(k)})_{jj}\big{|}}. (101)

(The first inequality is proved in the same way as (95).) By Weyl’s inequality, maxj|(𝐃0(k))jj(𝐃s(k))jj|𝚫(res)op\max_{j}|(\mathrm{\bf D}_{0}^{(k)})_{jj}-(\mathrm{\bf D}_{s}^{(k)})_{jj}|\leq\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}. Therefore, if mini,j|(𝐃s(k))ii(𝐃s(k))jj|2𝚫(res)op\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}\geq 2\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}, then

(𝐕s(k))𝐕0(k)op\displaystyle\big{\|}(\mathrm{\bf V}_{s}^{(k^{\prime})})^{\top}\mathrm{\bf V}_{0}^{(k)}\big{\|}_{\mathrm{op}} 𝚫(res)opmini,j|(𝐃s(k))ii(𝐃s(k))jj|𝚫(res)op\displaystyle\leq\frac{\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}-\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}
𝚫(res)opmini,j|(𝐃s(k))ii(𝐃s(k))jj|mini,j|(𝐃s(k))ii(𝐃s(k))jj|/2\displaystyle\leq\frac{\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}-\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}/2}
=2𝚫(res)opmini,j|(𝐃s(k))ii(𝐃s(k))jj|.\displaystyle=\frac{2\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}}.

If mini,j|(𝐃s(k))ii(𝐃s(k))jj|<2𝚫(res)op\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}<2\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}, then from a trivial bound we have

(𝐕s(k))𝐕0(k)op\displaystyle\big{\|}(\mathrm{\bf V}_{s}^{(k^{\prime})})^{\top}\mathrm{\bf V}_{0}^{(k)}\big{\|}_{\mathrm{op}} 12𝚫(res)opmini,j|(𝐃s(k))ii(𝐃s(k))jj|.\displaystyle\leq 1\leq\frac{2\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}}.

as well. In either way,

(𝐕s(k))𝐕0(k)op2𝚫(res)opmini,j|(𝐃s(k))ii(𝐃s(k))jj|2𝚫(res)op|λkλk|o˘d,(1)\big{\|}(\mathrm{\bf V}_{s}^{(k^{\prime})})^{\top}\mathrm{\bf V}_{0}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{2\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}{\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}}\leq\frac{2\big{\|}\text{\boldmath$\Delta$}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}}{\big{|}\lambda_{k}-\lambda_{k^{\prime}}\big{|}-\breve{o}_{d,\mathbb{P}}(1)}

We state and prove two useful lemmas. Let 𝒆1,,𝒆nn\text{\boldmath$e$}_{1},\ldots,\text{\boldmath$e$}_{n}\in\mathbb{R}^{n} be the canonical basis. For an integer mnm\leq n, define the projection matrix 𝐏=[𝒆1,,𝒆m][𝒆1,,𝒆m]\mathrm{\bf P}=[\text{\boldmath$e$}_{1},\ldots,\text{\boldmath$e$}_{m}][\text{\boldmath$e$}_{1},\ldots,\text{\boldmath$e$}_{m}]^{\top}.

Lemma 8.

Suppose that cnmCnc^{\prime}n\leq m\leq C^{\prime}n, where c,C(0,1)c^{\prime},C^{\prime}\in(0,1) are constants. Then there exists c>0c>0 such that with very high probability,

σmin(𝐏𝐔s)c.\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{s})\geq c.
Proof of Lemma 8.

Let the singular value decomposition of 𝚿\text{\boldmath$\Psi$}_{\leq\ell} be 𝚿=𝐔𝚿𝐃𝚿𝐕𝚿\text{\boldmath$\Psi$}_{\leq\ell}=\mathrm{\bf U}_{\text{\boldmath$\Psi$}}\mathrm{\bf D}_{\text{\boldmath$\Psi$}}\mathrm{\bf V}_{\text{\boldmath$\Psi$}}^{\top} where 𝐔𝚿n×D,𝐃𝚿D×D,𝐕𝚿D×D\mathrm{\bf U}_{\text{\boldmath$\Psi$}}\in\mathbb{R}^{n\times D},\mathrm{\bf D}_{\text{\boldmath$\Psi$}}\in\mathbb{R}^{D\times D},\mathrm{\bf V}_{\text{\boldmath$\Psi$}}\in\mathbb{R}^{D\times D}. By the concentration result (51) we have σmax(𝐃𝚿)(1+o˘d,(1))n\sigma_{\max}(\mathrm{\bf D}_{\text{\boldmath$\Psi$}})\leq(1+\breve{o}_{d,\mathbb{P}}(1))\sqrt{n}. By definition, 𝚿𝐕𝚿=𝐔𝚿𝐃𝚿\text{\boldmath$\Psi$}_{\leq\ell}\mathrm{\bf V}_{\text{\boldmath$\Psi$}}=\mathrm{\bf U}_{\text{\boldmath$\Psi$}}\mathrm{\bf D}_{\text{\boldmath$\Psi$}}. Left-multiplying both sides by 𝐏\mathrm{\bf P} and rearranging, we have

𝐏𝐔𝚿=𝐏𝚿𝐕𝚿𝐃𝚿1.\mathrm{\bf P}\mathrm{\bf U}_{\text{\boldmath$\Psi$}}=\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell}\mathrm{\bf V}_{\text{\boldmath$\Psi$}}\mathrm{\bf D}_{\text{\boldmath$\Psi$}}^{-1}.

We use the concentration result (51) to 𝐏𝚿\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell} (where we view mm as the dimension) and get σmin(𝐏𝚿)(1o˘d,(1))m\sigma_{\min}(\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell})\geq(1-\breve{o}_{d,\mathbb{P}}(1))\sqrt{m}. This then leads to

σmin(𝐏𝐔𝚿)σmin(𝐏𝚿𝐕𝚿)σmax(𝐃𝚿)=σmin(𝐏𝚿)σmax(𝐃𝚿)(1o˘d,(1))mn.\displaystyle\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{\text{\boldmath$\Psi$}})\geq\frac{\sigma_{\min}(\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell}\mathrm{\bf V}_{\text{\boldmath$\Psi$}})}{\sigma_{\max}(\mathrm{\bf D}_{\text{\boldmath$\Psi$}})}=\frac{\sigma_{\min}(\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell})}{\sigma_{\max}(\mathrm{\bf D}_{\text{\boldmath$\Psi$}})}\geq(1-\breve{o}_{d,\mathbb{P}}(1))\sqrt{\frac{m}{n}}. (102)

So with very high probability, σmin(𝐏𝐔𝚿)c/2\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{\text{\boldmath$\Psi$}})\geq\sqrt{c^{\prime}}/2. Since 𝚿\text{\boldmath$\Psi$}_{\leq\ell} and 𝚿𝚲2𝚿\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top} span the same column space, we can find an orthogonal matrix 𝐑D×D\mathrm{\bf R}\in\mathbb{R}^{D\times D} such that 𝐔s=𝐔𝚿𝐑\mathrm{\bf U}_{s}=\mathrm{\bf U}_{\text{\boldmath$\Psi$}}\mathrm{\bf R}. Thus, σmin(𝐏𝐔s)=σmin(𝐏𝐔𝚿)c/2\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{s})=\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{\text{\boldmath$\Psi$}})\geq\sqrt{c^{\prime}}/2 with very high probability. ∎

Lemma 9.

Suppose that 𝐔a,𝐔bm×k\mathrm{\bf U}_{a},\mathrm{\bf U}_{b}\in\mathbb{R}^{m\times k} are two matrices with orthonormal column vectors. Let 𝐏m×m\mathrm{\bf P}\in\mathbb{R}^{m\times m} be any orthogonal projection matrix, and 𝐏a=𝐈m𝐔a𝐔a\mathrm{\bf P}_{a}^{\perp}=\mathrm{\bf I}_{m}-\mathrm{\bf U}_{a}\mathrm{\bf U}_{a}^{\top}. Then, we have

σmin(𝐏𝐔b)σmin(𝐏𝐔a)σmin(𝐔a𝐔b)σmax(𝐏a𝐔b)\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{b})\geq\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{a})\sigma_{\min}(\mathrm{\bf U}_{a}^{\top}\mathrm{\bf U}_{b})-\sigma_{\max}(\mathrm{\bf P}_{a}^{\perp}\mathrm{\bf U}_{b})
Proof of Lemma 9.

By the variational characterization of singular values and the triangle inequality,

σmin(𝐏𝐔b)=min𝒗=1𝐏𝐔b𝒗\displaystyle\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{b})=\min_{\|\text{\boldmath$v$}\|=1}\big{\|}\mathrm{\bf P}\mathrm{\bf U}_{b}\text{\boldmath$v$}\big{\|} min𝒗=1{𝐏𝐔a𝐔a𝐔b𝒗𝐏𝐏a𝐔b𝒗}\displaystyle\geq\min_{\|\text{\boldmath$v$}\|=1}\Big{\{}\big{\|}\mathrm{\bf P}\mathrm{\bf U}_{a}\mathrm{\bf U}_{a}^{\top}\mathrm{\bf U}_{b}\text{\boldmath$v$}\big{\|}-\big{\|}\mathrm{\bf P}\mathrm{\bf P}_{a}^{\bot}\mathrm{\bf U}_{b}\text{\boldmath$v$}\big{\|}\Big{\}}
σmin(𝐏𝐔a)min𝒗=1𝐔a𝐔b𝒗max𝒗=1𝐏a𝐔b𝒗\displaystyle\geq\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{a})\min_{\|\text{\boldmath$v$}\|=1}\big{\|}\mathrm{\bf U}_{a}^{\top}\mathrm{\bf U}_{b}\text{\boldmath$v$}\big{\|}-\max_{\|\text{\boldmath$v$}\|=1}\big{\|}\mathrm{\bf P}_{a}^{\bot}\mathrm{\bf U}_{b}\text{\boldmath$v$}\big{\|}
σmin(𝐏𝐔a)σmin(𝐔a𝐔b)σmax(𝐏a𝐔b).\displaystyle\geq\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{a})\sigma_{\min}(\mathrm{\bf U}_{a}^{\top}\mathrm{\bf U}_{b})-\sigma_{\max}(\mathrm{\bf P}_{a}^{\bot}\mathrm{\bf U}_{b})\,.

This proves the lemma. ∎

Suppose nn^{\prime} is a positive integer that satisfies nn(1+C2)nn\leq n^{\prime}\leq(1+C_{2})n where C2>0C_{2}>0 is a constant. Denote n0=n+nn_{0}=n+n^{\prime}. Let us sample nn^{\prime} new data 𝒙n+1,,𝒙n0\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n_{0}} which are i.i.d. and have the same distribution as 𝒙1\text{\boldmath$x$}_{1}. We introduce the augmented kernel matrix 𝐊~n0×n0\widetilde{\mathrm{\bf K}}\in\mathbb{R}^{n_{0}\times n_{0}} as follows. We define

[𝐊~]ij:=𝐊N(𝒙i,𝒙j)=(𝐊~11𝐊~12𝐊~21𝐊~22).[\widetilde{\mathrm{\bf K}}]_{ij}:=\mathrm{\bf K}_{N}(\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j})=\left(\begin{array}[]{cc}\widetilde{\mathrm{\bf K}}_{11}&\widetilde{\mathrm{\bf K}}_{12}\\ \widetilde{\mathrm{\bf K}}_{21}&\widetilde{\mathrm{\bf K}}_{22}\end{array}\right). (103)

where 𝐊~11,𝐊~12,𝐊~21,𝐊~22\widetilde{\mathrm{\bf K}}_{11},\widetilde{\mathrm{\bf K}}_{12},\widetilde{\mathrm{\bf K}}_{21},\widetilde{\mathrm{\bf K}}_{22} have size n×nn\times n, n×nn\times n^{\prime}, n×nn^{\prime}\times n, n×nn^{\prime}\times n^{\prime} respectively. Under this definition, clearly 𝐊~11=𝐊N\widetilde{\mathrm{\bf K}}_{11}=\mathrm{\bf K}_{N}. Note that we can express 𝐊N(2)\mathrm{\bf K}_{N}^{(2)} as (recalling that 𝐊N(,𝒙)=[𝐊N(𝒙i,𝒙)]inn\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})=[\mathrm{\bf K}_{N}(\text{\boldmath$x$}_{i},\text{\boldmath$x$})]_{i\leq n}\in\mathbb{R}^{n}):

𝐊N(2)=1ni=1n𝔼𝒙n+1,,𝒙n+n[𝐊N(,𝒙n+i)𝐊N(,𝒙n+i)]=1n𝔼𝒙n+1,,𝒙n0[𝐊~12𝐊~21].\mathrm{\bf K}_{N}^{(2)}=\frac{1}{n^{\prime}}\sum_{i=1}^{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n+n^{\prime}}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$}_{n+i})\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$}_{n+i})^{\top}\big{]}=\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n_{0}}}\big{[}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\mathrm{\bf K}}_{21}^{\top}\big{]}.

So we can reduce our problem using 𝐊~\widetilde{\mathrm{\bf K}}.

𝐊N1𝐊N(2)𝐊N1op\displaystyle\big{\|}\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\big{\|}_{\mathrm{op}} (i)1n𝔼𝒙n+1,,𝒙n0(𝐊~11)1𝐊~12𝐊~21(𝐊~11)1op\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n_{0}}}\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\mathrm{\bf K}}_{21}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\big{\|}_{\mathrm{op}}
=(ii)1n𝔼𝒙n+1,,𝒙n0(𝐊~11)1(𝐊~2)11(𝐊~11)1𝐈nop\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n_{0}}}\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}

where (i) follows from Jensen’s inequality, and (ii) is due to

(𝐊~2)11=(𝐊~11)2+𝐊~12𝐊~21(\widetilde{\mathrm{\bf K}}^{2})_{11}=(\widetilde{\mathrm{\bf K}}_{11})^{2}+\widetilde{\mathrm{\bf K}}_{12}\widetilde{\mathrm{\bf K}}_{21}

where (𝐊~2)11(\widetilde{\mathrm{\bf K}}^{2})_{11} denotes the top left n×nn\times n matrix block of 𝐊~2\widetilde{\mathrm{\bf K}}^{2}. Similarly, we define the augmented vector

𝒇~=(f(𝒙i))in0=(𝒇~1𝒇~2),\widetilde{\text{\boldmath$f$}}=(f(\text{\boldmath$x$}_{i}))_{i\leq n_{0}}=\left(\begin{array}[]{c}\widetilde{\text{\boldmath$f$}}_{1}\\ \widetilde{\text{\boldmath$f$}}_{2}\end{array}\right), (104)

and by Jensen’s inequality, we have

𝐊N1𝔼𝒙[𝐊N(,𝒙)f(𝒙)]\displaystyle\big{\|}\mathrm{\bf K}_{N}^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\big{\|} 1n𝔼𝒙n+1,,𝒙n0𝐊~111𝐊~12𝒇~2.\displaystyle\leq\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n_{0}}}\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}.

Hence, proving Eqs. (67) and (68) is reduced to studying (𝐊~11)1(𝐊~2)11(𝐊~11)1(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1} and 𝐊~111𝐊~12𝒇~2\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}, which can be analyzed by making use of the kernel eigenstructure.

Lemma 10.

There exist constant C1,C2>0C_{1},C_{2}>0 such that the following holds. With very high probability,

(𝐊~11)1(𝐊~2)11(𝐊~11)1𝐈nopC1,\displaystyle\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq C_{1}, (105)
𝐊~111𝐊~12𝒇~2C1n.\displaystyle\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}\leq C_{1}\sqrt{n}. (106)

Further, with high probability,

1n𝔼𝒙n+1,,𝒙2n(𝐊~11)1(𝐊~2)11(𝐊~11)1𝐈nopC2n,\displaystyle\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{2n}}\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq\frac{C_{2}}{n}, (107)
1n𝔼𝒙n+1,,𝒙2n𝐊~111𝐊~12𝒇~2C2n.\displaystyle\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{2n}}\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}\leq\frac{C_{2}}{\sqrt{n}}. (108)

Consequently, we obtain (67) and (68) in Lemma 3.

Proof of Lemma 10.

Step 1. First we prove (105) and (106). Fix the constant δ0:=1/4\delta_{0}:=1/4. We apply Lemma 6 to the augmented matrix 𝐊~\widetilde{\mathrm{\bf K}} (where nn is replaced by n0n_{0}) and find that its eigenvalues can be partitioned into +2\ell+2 groups, with the first +1\ell+1 groups corresponding to DD eigenvalues. These groups, together with the group of the remaining eigenvalues, are centered around

γ0n0+γ>,γ1n0d+γ>,,γ1(!)n0d+γ>,γ>.\gamma_{0}n_{0}+\gamma_{>\ell},\frac{\gamma_{1}n_{0}}{d}+\gamma_{>\ell},\ldots,\frac{\gamma_{1}(\ell!)n_{0}}{d^{\ell}}+\gamma_{>\ell},\gamma_{>\ell}. (109)

These +2\ell+2 values may be unordered and have small gaps (so that Lemma 7 does not directly apply). So we order them in the descending order and denote the ordered values by λ(1),,λ(+2)\lambda_{(1)},\ldots,\lambda_{(\ell+2)}. Let their corresponding group sizes be s1,s2,,s+2{1,B(d,1),,B(d,),n0D}s_{1},s_{2},\ldots,s_{\ell+2}\in\{1,B(d,1),\ldots,B(d,\ell),n_{0}-D\}. Let 0{1,,+1}\ell_{0}\in\{1,\ldots,\ell+1\} be the largest index kk such that λ(k+1)/λ(k)<δ0\lambda_{(k+1)}/\lambda_{(k)}<\delta_{0}. (If such kk does not exist, then 𝐊~op2γ>(1/δ0)+1\|\widetilde{\mathrm{\bf K}}\|_{\mathrm{op}}\leq 2\gamma_{>\ell}(1/\delta_{0})^{\ell+1} with very high probability, and thus (105) and (106) follow.) Let the eigen-decomposition of 𝐊~\widetilde{\mathrm{\bf K}} be

𝐊~\displaystyle\widetilde{\mathrm{\bf K}} =𝐔~𝐃~𝐔~+𝐊~1(res)\displaystyle=\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}\widetilde{\mathrm{\bf U}}^{\top}+\widetilde{\mathrm{\bf K}}_{1}^{(\mathrm{res})}

so that we only keep eigenvalues/eigenvectors in the largest 0\ell_{0} groups in 𝐔~𝐃~𝐔~\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}\widetilde{\mathrm{\bf U}}^{\top}, and the remaining eigenvalues are in 𝐊~1(res)\widetilde{\mathrm{\bf K}}_{1}^{(\mathrm{res})} (in particular 𝐊~1(res)𝐔~=𝟎\widetilde{\mathrm{\bf K}}_{1}^{(\mathrm{res})}\widetilde{\mathrm{\bf U}}=\mathrm{\bf 0}). Therefore

𝐃~=diag(λ(1)𝐈s1,,λ(0)𝐈s0)(1+o˘d,(1)).\widetilde{\mathrm{\bf D}}=\mathrm{diag}\big{(}\lambda_{(1)}\mathrm{\bf I}_{s_{1}},\ldots,\lambda_{(\ell_{0})}\mathrm{\bf I}_{s_{\ell_{0}}}\big{)}\cdot(1+\breve{o}_{d,\mathbb{P}}(1)).

and 𝐊~1(res)\widetilde{\mathrm{\bf K}}_{1}^{(\mathrm{res})} satisfies

(γ>o˘d,(1))𝐈n0𝐊~1(res)(γ>+C+o˘d,(1))𝐈n0\displaystyle(\gamma_{>\ell}-\breve{o}_{d,\mathbb{P}}(1))\mathrm{\bf I}_{n_{0}}\preceq\widetilde{\mathrm{\bf K}}_{1}^{(\mathrm{res})}\preceq(\gamma_{>\ell}+C+\breve{o}_{d,\mathbb{P}}(1))\mathrm{\bf I}_{n_{0}} (110)

(where C>0C>0 is a constant). Taking the square, we have

𝐊~2\displaystyle\widetilde{\mathrm{\bf K}}^{2} =𝐔~𝐃~2𝐔~+𝐊~2(res),\displaystyle=\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}^{2}\widetilde{\mathrm{\bf U}}^{\top}+\widetilde{\mathrm{\bf K}}_{2}^{(\mathrm{res})},

where 𝐊~2(res)\widetilde{\mathrm{\bf K}}_{2}^{(\mathrm{res})} satisfies (γ>2o˘d,(1))𝐈2n𝐊~2(res)((γ>+C)2+o˘d,(1))𝐈2n(\gamma_{>\ell}^{2}-\breve{o}_{d,\mathbb{P}}(1))\mathrm{\bf I}_{2n}\preceq\widetilde{\mathrm{\bf K}}_{2}^{(\mathrm{res})}\preceq((\gamma_{>\ell}+C)^{2}+\breve{o}_{d,\mathbb{P}}(1))\mathrm{\bf I}_{2n}.

For convenience, we denote D0D_{0} to be the size of 𝐃~\widetilde{\mathrm{\bf D}} (i.e., number of eigenvalues in the main part of 𝐊~\widetilde{\mathrm{\bf K}}). Also denote by 𝐔~1n×D,𝐔~2n×D\widetilde{\mathrm{\bf U}}_{1}\in\mathbb{R}^{n\times D},\widetilde{\mathrm{\bf U}}_{2}\in\mathbb{R}^{n^{\prime}\times D} the submatrix of 𝐔~\widetilde{\mathrm{\bf U}} formed by its top-nn/bottom-nn^{\prime} rows respectively. Using this eigen-decomposition, we have

(𝐊~11)1(𝐊~2)11(𝐊~11)1𝐈nop\displaystyle\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}} (𝐊~11)1[𝐔~𝐃~2𝐔~]11(𝐊~11)1op\displaystyle\leq\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}[\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}^{2}\widetilde{\mathrm{\bf U}}^{\top}]_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\big{\|}_{\mathrm{op}} (111)
+(𝐊~11)1op𝐊~2(res)op(𝐊~11)1op+1.\displaystyle~{}~{}+\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\big{\|}_{\mathrm{op}}\cdot\big{\|}\widetilde{\mathrm{\bf K}}_{2}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}\cdot\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\big{\|}_{\mathrm{op}}+1. (112)

Since 𝐊~11\widetilde{\mathrm{\bf K}}_{11} (or equivalently 𝐊N\mathrm{\bf K}_{N}) has smallest eigenvalue bounded away from zero with very high probability by Theorem 3.2, we can see that the two terms in the second line (112) are bounded with very high probability. To handle the term on the right-hand side of (111), we use the identity

𝐀1𝐇𝐀1\displaystyle\mathrm{\bf A}^{-1}\mathrm{\bf H}\mathrm{\bf A}^{-1} =𝐀01𝐇𝐀01𝐀1(𝐀𝐀0)𝐀01𝐇𝐀01𝐀01𝐇𝐀01(𝐀𝐀0)𝐀1\displaystyle=\mathrm{\bf A}_{0}^{-1}\mathrm{\bf H}\mathrm{\bf A}_{0}^{-1}-\mathrm{\bf A}^{-1}(\mathrm{\bf A}-\mathrm{\bf A}_{0})\mathrm{\bf A}_{0}^{-1}\mathrm{\bf H}\mathrm{\bf A}_{0}^{-1}-\mathrm{\bf A}_{0}^{-1}\mathrm{\bf H}\mathrm{\bf A}_{0}^{-1}(\mathrm{\bf A}-\mathrm{\bf A}_{0})\mathrm{\bf A}^{-1}
+𝐀1(𝐀𝐀0)𝐀01𝐇𝐀01(𝐀𝐀0)𝐀1\displaystyle~{}~{}~{}+\mathrm{\bf A}^{-1}(\mathrm{\bf A}-\mathrm{\bf A}_{0})\mathrm{\bf A}_{0}^{-1}\mathrm{\bf H}\mathrm{\bf A}_{0}^{-1}(\mathrm{\bf A}-\mathrm{\bf A}_{0})\mathrm{\bf A}^{-1} (113)

where 𝐀0,𝐀\mathrm{\bf A}_{0},\mathrm{\bf A} are nonsingular symmetric matrices and 𝐇\mathrm{\bf H} is a symmetric matrix. Define 𝐃~γ=𝐃~γ>𝐈D\widetilde{\mathrm{\bf D}}_{\gamma}=\widetilde{\mathrm{\bf D}}-\gamma_{>\ell}\mathrm{\bf I}_{D}, which is strictly positive definite with very high probability (since by definition, λ(0)δ01γ>\lambda_{(\ell_{0})}\geq\delta_{0}^{-1}\gamma_{>\ell}). We set 𝐀=𝐊~11,𝐀0=[𝐔~𝐃~λ𝐔~]11+γ>𝐈n\mathrm{\bf A}=\widetilde{\mathrm{\bf K}}_{11},\mathrm{\bf A}_{0}=[\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}_{\lambda}\widetilde{\mathrm{\bf U}}^{\top}]_{11}+\gamma_{>\ell}\mathrm{\bf I}_{n}, and 𝐇=[𝐔~𝐃~2𝐔~]11\mathrm{\bf H}=[\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}^{2}\widetilde{\mathrm{\bf U}}^{\top}]_{11}, and it holds that 𝐀1opC\|\mathrm{\bf A}^{-1}\|_{\mathrm{op}}\leq C, 𝐀01opC\|\mathrm{\bf A}_{0}^{-1}\|_{\mathrm{op}}\leq C and 𝐀𝐀0opC\|\mathrm{\bf A}-\mathrm{\bf A}_{0}\|_{\mathrm{op}}\leq C with very high probability by the above. Therefore, with very high probability,

(𝐊~11)1[𝐔~𝐃~2𝐔~]11(𝐊~11)1op\displaystyle~{}~{}~{}~{}\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}[\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}^{2}\widetilde{\mathrm{\bf U}}^{\top}]_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\big{\|}_{\mathrm{op}}
C([𝐔~𝐃~γ𝐔~]11+γ>𝐈n)1[𝐔~𝐃~2𝐔~]11([𝐔~𝐃~γ𝐔~]11+γ>𝐈n)1op\displaystyle\leq C\cdot\Big{\|}\big{(}[\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}^{\top}]_{11}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\big{[}\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}^{2}\widetilde{\mathrm{\bf U}}^{\top}\big{]}_{11}\big{(}[\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}^{\top}]_{11}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\Big{\|}_{\mathrm{op}}
=C(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1𝐔~1𝐃~2𝐔~1(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1op\displaystyle=C\cdot\Big{\|}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}^{2}\widetilde{\mathrm{\bf U}}_{1}^{\top}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\Big{\|}_{\mathrm{op}}
(i)C(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1𝐔~1(γ>2𝐈D)𝐔~1(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1op\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}C\cdot\Big{\|}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\widetilde{\mathrm{\bf U}}_{1}(\gamma_{>\ell}^{2}\mathrm{\bf I}_{D})\widetilde{\mathrm{\bf U}}_{1}^{\top}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\Big{\|}_{\mathrm{op}}
+C(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1𝐔~1𝐃~γ2𝐔~1(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1op\displaystyle+C\cdot\Big{\|}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}^{2}\widetilde{\mathrm{\bf U}}_{1}^{\top}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\Big{\|}_{\mathrm{op}}
=(ii)C+C𝐔~1(𝐔~1𝐔~1+γ>(𝐃~γ)1)2𝐔~1op\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}C+C\cdot\Big{\|}\widetilde{\mathrm{\bf U}}_{1}\big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\big{)}^{-2}\widetilde{\mathrm{\bf U}}_{1}^{\top}\Big{\|}_{\mathrm{op}}
(iii)C+C[λmin(𝐔~1𝐔~1)]1\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}C+C\cdot\big{[}\lambda_{\min}(\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1})\big{]}^{-1} (114)

where (i) is because 𝐃~22γ>2𝐈D+2(𝐃~γ)2\widetilde{\mathrm{\bf D}}^{2}\preceq 2\gamma_{>\ell}^{2}\mathrm{\bf I}_{D}+2(\widetilde{\mathrm{\bf D}}_{\gamma})^{2}, (ii) is because by the identity (156) in Lemma 23 we have

(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1𝐔~1𝐃~γ\displaystyle\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma} =𝐔~1(𝐃~γ)1/2(𝐃~γ1/2𝐔~1𝐔~1𝐃~γ1/2+γ>𝐈D0)1𝐃~γ1/2\displaystyle=\widetilde{\mathrm{\bf U}}_{1}(\widetilde{\mathrm{\bf D}}_{\gamma})^{1/2}\big{(}\widetilde{\mathrm{\bf D}}_{\gamma}^{1/2}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}^{1/2}+\gamma_{>\ell}\mathrm{\bf I}_{D_{0}}\big{)}^{-1}\widetilde{\mathrm{\bf D}}_{\gamma}^{1/2}
=𝐔~1(𝐔~1𝐔~1+γ>𝐃~γ1)1;\displaystyle=\widetilde{\mathrm{\bf U}}_{1}\big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}\widetilde{\mathrm{\bf D}}_{\gamma}^{-1}\big{)}^{-1};

and (iii) is due to 𝐔~1op𝐔~op=1\|\widetilde{\mathrm{\bf U}}_{1}\|_{\mathrm{op}}\leq\|\widetilde{\mathrm{\bf U}}\|_{\mathrm{op}}=1.

Similarly, we have

𝐊~111𝐊~12𝒇~2\displaystyle\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|} 𝐊~111[𝐔~𝐃~𝐔~]12𝒇~2+𝐊~111(𝐊~1(res))12𝒇~2\displaystyle\leq\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\big{[}\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}\widetilde{\mathrm{\bf U}}^{\top}\big{]}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}+\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}(\widetilde{\mathrm{\bf K}}_{1}^{(\mathrm{res})})_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}
𝐊~111𝐔~1𝐃~𝐔~2𝒇~2+𝐊~111op𝐊~1(res)op𝒇~2.\displaystyle\leq\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}\widetilde{\mathrm{\bf U}}_{2}^{\top}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}+\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\big{\|}_{\mathrm{op}}\cdot\big{\|}\widetilde{\mathrm{\bf K}}_{1}^{(\mathrm{res})}\big{\|}_{\mathrm{op}}\cdot\big{\|}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}.

It is clear that the second term on the last line is bounded by C𝒇~2C\|\widetilde{\text{\boldmath$f$}}_{2}\| with very high probability, by Eq. (110). For the first term, with very high probability,

𝐊~111[𝐔~𝐃~𝐔~]12𝒇~2\displaystyle\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\big{[}\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}\widetilde{\mathrm{\bf U}}^{\top}\big{]}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|} C(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1𝐔~1(γ>𝐈D)𝐔~2𝒇~2\displaystyle\leq C\cdot\Big{\|}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\widetilde{\mathrm{\bf U}}_{1}(\gamma_{>\ell}\mathrm{\bf I}_{D})\widetilde{\mathrm{\bf U}}_{2}^{\top}\widetilde{\text{\boldmath$f$}}_{2}\Big{\|} (115)
+C(𝐔~1𝐃~γ𝐔~1+γ>𝐈n)1𝐔~1𝐃~γ𝐔~2𝒇~2\displaystyle+C\cdot\Big{\|}\big{(}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{1}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\widetilde{\mathrm{\bf U}}_{1}\widetilde{\mathrm{\bf D}}_{\gamma}\widetilde{\mathrm{\bf U}}_{2}^{\top}\widetilde{\text{\boldmath$f$}}_{2}\Big{\|}
C𝒇~2+C𝐔~1(𝐔~1𝐔~1+γ>(𝐃~γ)1)1𝐔~2𝒇~2\displaystyle\leq C\|\widetilde{\text{\boldmath$f$}}_{2}\|+C\cdot\Big{\|}\widetilde{\mathrm{\bf U}}_{1}\big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\big{)}^{-1}\widetilde{\mathrm{\bf U}}_{2}^{\top}\widetilde{\text{\boldmath$f$}}_{2}\Big{\|}
C𝒇~2+C(𝐔~1𝐔~1+γ>(𝐃~γ)1)1op𝒇~2\displaystyle\leq C\|\widetilde{\text{\boldmath$f$}}_{2}\|+C\cdot\Big{\|}\big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\big{)}^{-1}\Big{\|}_{\mathrm{op}}\cdot\|\widetilde{\text{\boldmath$f$}}_{2}\|
C𝒇~2+C[λmin(𝐔~1𝐔~1)]1𝒇~2.\displaystyle\leq C\|\widetilde{\text{\boldmath$f$}}_{2}\|+C\cdot\big{[}\lambda_{\min}(\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1})\big{]}^{-1}\cdot\|\widetilde{\text{\boldmath$f$}}_{2}\|. (116)

By fL2<C\|f\|_{L^{2}}<C and the law of large numbers, 𝒇~2Cn\|\widetilde{\text{\boldmath$f$}}_{2}\|\leq C\sqrt{n} holds with very high probability, so (114) and (116) indicate that we only need to prove that λmin(𝐔~1𝐔~1)c0\lambda_{\min}(\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1})\geq c_{0}.

Step 2. We next prove a the claimed lower bound on λmin(𝐔~1𝐔~1)\lambda_{\min}(\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}). We will make use of the augmented infinite-width kernel

[𝐊~0]ij:=K(𝒙i,𝒙j),wherei,jn0.[\widetilde{\mathrm{\bf K}}_{0}]_{ij}:=K(\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}),\qquad\text{where}~{}i,j\leq n_{0}.

Notice that the eigen-decomposition of 𝐊~0\widetilde{\mathrm{\bf K}}_{0} takes the same form as the one of 𝐊\mathrm{\bf K} as given in Lemma 6 (with the change that nn should be replaced by n0n_{0}). We denote by 𝐔~0n0×D0\widetilde{\mathrm{\bf U}}_{0}\in\mathbb{R}^{n_{0}\times D_{0}} the groups of eigenvectors that correspond to 𝐔\mathrm{\bf U} in that lemma and have the same eigenvalue structure as 𝐔~\widetilde{\mathrm{\bf U}}. Namely, in the eigen-decomposition of 𝐊~0\widetilde{\mathrm{\bf K}}_{0}, we keep eigenvalues from the largest 0\ell_{0} groups in (109) and let columns of 𝐔~0\widetilde{\mathrm{\bf U}}_{0} be the corresponding eigenvectors. We can express 𝐔~0\widetilde{\mathrm{\bf U}}_{0} and 𝐔~\widetilde{\mathrm{\bf U}} as 0\ell_{0} groups of eigenvectors.

𝐔~0=[𝐕~0(1),𝐕~0(2),,𝐕~0(0)],𝐔~=[𝐕~(1),𝐕~(2),,𝐕~(0)].\widetilde{\mathrm{\bf U}}_{0}=\big{[}\widetilde{\mathrm{\bf V}}_{0}^{(1)},\widetilde{\mathrm{\bf V}}_{0}^{(2)},\ldots,\widetilde{\mathrm{\bf V}}_{0}^{(\ell_{0})}\big{]},\qquad\widetilde{\mathrm{\bf U}}=\big{[}\widetilde{\mathrm{\bf V}}^{(1)},\widetilde{\mathrm{\bf V}}^{(2)},\ldots,\widetilde{\mathrm{\bf V}}^{(\ell_{0})}\big{]}. (117)

The remaining +20\ell+2-\ell_{0} groups of eigenvectors are denoted by 𝐕~0(0+1),,𝐕~0(+2)\widetilde{\mathrm{\bf V}}_{0}^{(\ell_{0}+1)},\ldots,\widetilde{\mathrm{\bf V}}_{0}^{(\ell+2)}, and 𝐕~(0+1),,𝐕~(+2)\widetilde{\mathrm{\bf V}}^{(\ell_{0}+1)},\ldots,\widetilde{\mathrm{\bf V}}^{(\ell+2)}. Note that

𝐔~1𝐔~1=𝐔~𝐏1𝐔~,where𝐏1:=(𝐈n𝟎𝟎𝟎).\ \widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}=\widetilde{\mathrm{\bf U}}^{\top}\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}},\qquad\text{where}~{}\mathrm{\bf P}_{1}:=\left(\begin{array}[]{cc}\mathrm{\bf I}_{n}&\mathrm{\bf 0}\\ \mathrm{\bf 0}&\mathrm{\bf 0}\end{array}\right).

Define 𝐏~0=𝐈n0𝐔~0𝐔~0\widetilde{\mathrm{\bf P}}_{0}^{\bot}=\mathrm{\bf I}_{n_{0}}-\widetilde{\mathrm{\bf U}}_{0}\widetilde{\mathrm{\bf U}}_{0}^{\top}. We apply Lemma 9 where we set 𝐔a=𝐔~0\mathrm{\bf U}_{a}=\widetilde{\mathrm{\bf U}}_{0}, 𝐔b=𝐔~\mathrm{\bf U}_{b}=\widetilde{\mathrm{\bf U}}, 𝐏=𝐏~1\mathrm{\bf P}=\widetilde{\mathrm{\bf P}}_{1}, and 𝐏a=𝐏~0\mathrm{\bf P}_{a}^{\perp}=\widetilde{\mathrm{\bf P}}_{0}^{\perp}. This yields

[λmin(𝐔~𝐏1𝐔~)]1/2=σmin(𝐏1𝐔~)σmin(𝐏1𝐔~0)σmin(𝐔~0𝐔~)σmax(𝐏~0𝐔~).\Big{[}\lambda_{\min}(\widetilde{\mathrm{\bf U}}^{\top}\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}})\Big{]}^{1/2}=\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}})\geq\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{0})\sigma_{\min}(\widetilde{\mathrm{\bf U}}_{0}^{\top}\widetilde{\mathrm{\bf U}})-\sigma_{\max}(\widetilde{\mathrm{\bf P}}_{0}^{\bot}\widetilde{\mathrm{\bf U}}). (118)

Now we invoke Lemma 7 (a) for the pair (k,k)(k,k^{\prime}) where k0k\leq\ell_{0} and k>0k^{\prime}>\ell_{0}. By the definition of 0\ell_{0}, the assumption of Lemma 7 (a) is satisfied, so (𝐕~0(k))𝐕~(k)opo˘d,(1)\|(\widetilde{\mathrm{\bf V}}_{0}^{(k^{\prime})})^{\top}\widetilde{\mathrm{\bf V}}^{(k)}\|_{\mathrm{op}}\leq\breve{o}_{d,\mathbb{P}}(1). Thus,

σmax(𝐏~0𝐔~)\displaystyle\sigma_{\max}(\widetilde{\mathrm{\bf P}}_{0}^{\bot}\widetilde{\mathrm{\bf U}}) =k>0𝐕~0(k)(𝐕~0(k))𝐔~opk>0(𝐕~0(k))𝐔~op\displaystyle=\big{\|}\sum_{k^{\prime}>\ell_{0}}\widetilde{\mathrm{\bf V}}_{0}^{(k^{\prime})}(\widetilde{\mathrm{\bf V}}_{0}^{(k^{\prime})})^{\top}\widetilde{\mathrm{\bf U}}\big{\|}_{\mathrm{op}}\leq\sum_{k^{\prime}>\ell_{0}}\big{\|}(\widetilde{\mathrm{\bf V}}_{0}^{(k^{\prime})})^{\top}\widetilde{\mathrm{\bf U}}\big{\|}_{\mathrm{op}}
k>0k0(𝐕~0(k))𝐕~(k)op(+2)2o˘d,(1)=o˘d,(1).\displaystyle\leq\sum_{k^{\prime}>\ell_{0}}\sum_{k\leq\ell_{0}}\big{\|}(\widetilde{\mathrm{\bf V}}_{0}^{(k^{\prime})})^{\top}\widetilde{\mathrm{\bf V}}^{(k)}\big{\|}_{\mathrm{op}}\leq(\ell+2)^{2}\cdot\breve{o}_{d,\mathbb{P}}(1)=\breve{o}_{d,\mathbb{P}}(1).

Moreover, we have

σmin(𝐔~0𝐔~)\displaystyle\sigma_{\min}(\widetilde{\mathrm{\bf U}}_{0}^{\top}\widetilde{\mathrm{\bf U}}) =min𝒗=1𝐔~0𝐔~𝒗=(i)min𝒗=1𝐔~0𝐔~0𝐔~𝒗=min𝒗=1𝐔~𝒗𝐏~0𝐔~𝒗\displaystyle=\min_{\|\text{\boldmath$v$}\|=1}\big{\|}\widetilde{\mathrm{\bf U}}_{0}^{\top}\widetilde{\mathrm{\bf U}}\text{\boldmath$v$}\big{\|}\stackrel{{\scriptstyle(i)}}{{=}}\min_{\|\text{\boldmath$v$}\|=1}\big{\|}\widetilde{\mathrm{\bf U}}_{0}\widetilde{\mathrm{\bf U}}_{0}^{\top}\widetilde{\mathrm{\bf U}}\text{\boldmath$v$}\big{\|}=\min_{\|\text{\boldmath$v$}\|=1}\big{\|}\widetilde{\mathrm{\bf U}}\text{\boldmath$v$}-\widetilde{\mathrm{\bf P}}_{0}^{\bot}\widetilde{\mathrm{\bf U}}\text{\boldmath$v$}\big{\|}
min𝒗=1𝐔~𝒗max𝒗=1𝐏~0𝐔~𝒗=(ii)1max𝒗=1𝐏~0𝐔~𝒗\displaystyle\geq\min_{\|\text{\boldmath$v$}\|=1}\big{\|}\widetilde{\mathrm{\bf U}}\text{\boldmath$v$}\big{\|}-\max_{\|\text{\boldmath$v$}\|=1}\big{\|}\widetilde{\mathrm{\bf P}}_{0}^{\bot}\widetilde{\mathrm{\bf U}}\text{\boldmath$v$}\big{\|}\stackrel{{\scriptstyle(ii)}}{{=}}1-\max_{\|\text{\boldmath$v$}\|=1}\big{\|}\widetilde{\mathrm{\bf P}}_{0}^{\bot}\widetilde{\mathrm{\bf U}}\text{\boldmath$v$}\big{\|}
1𝐏~0𝐔~op1o˘d,(1),\displaystyle\geq 1-\big{\|}\widetilde{\mathrm{\bf P}}_{0}^{\bot}\widetilde{\mathrm{\bf U}}\big{\|}_{\mathrm{op}}\geq 1-\breve{o}_{d,\mathbb{P}}(1),

where (i)(i) and (ii)(ii) uses the fact that 𝐔~0𝒂=𝒂\|\widetilde{\mathrm{\bf U}}_{0}\text{\boldmath$a$}\|=\|\text{\boldmath$a$}\| (where 𝒂a is a vector) since 𝐔0\mathrm{\bf U}_{0} has orthonormal columns.

Finally, we claim that

σmin(𝐏1𝐔~0)c,\displaystyle\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{0})\geq c\,, (119)

with very high probability where c>0c>0 is a small constant. We defer the proof of this claim to Step 4. Once it is proved, from Eqs. (118), we deduce that λmin(𝐔~1𝐔~1)c\lambda_{\min}(\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1})\geq c^{\prime}. From (114) and (116), we conclude that, with very high probability,

(𝐊~11)1[𝐔~𝐃~2𝐔~]11(𝐊~11)1opC,\displaystyle\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}[\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}^{2}\widetilde{\mathrm{\bf U}}^{\top}]_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\big{\|}_{\mathrm{op}}\leq C,
𝐊~111[𝐔~𝐃~𝐔~]12𝒇~2opC,\displaystyle\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\big{[}\widetilde{\mathrm{\bf U}}\widetilde{\mathrm{\bf D}}\widetilde{\mathrm{\bf U}}^{\top}\big{]}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}_{\mathrm{op}}\leq C,

and thus we have proved (105) and (106).

Step 3. We only prove (107) since (108) is similar. For simplicity, we denote Z=(𝐊~11)1Z=\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1} (𝐊~2)11(𝐊~11)1𝐈nop(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}. We know by Theorem 3.2 that there exists a constant c>0c^{\prime}>0 such that λmin(𝐊~11)c\lambda_{\min}(\widetilde{\mathrm{\bf K}}_{11})\geq c^{\prime} with very high probability. Let C=C1C=C_{1} be the same constant as in Eq. (105). For any β>0\beta>0, we also notice that the following event happens with very high probability

𝒙n+1,,𝒙n+n(Z>C)>dβ.\mathbb{P}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n+n^{\prime\prime}}}(Z>C)>d^{-\beta}. (120)

(Note that the left-hand side is a random variable depending on 𝒙1,,𝒙n\text{\boldmath$x$}_{1},\ldots,\text{\boldmath$x$}_{n} and 𝒘1,,𝒘N\text{\boldmath$w$}_{1},\ldots,\text{\boldmath$w$}_{N}.) Indeed, for any β>0\beta^{\prime}>0, By Markov’s inequality,

dβ(𝒙n+1,,𝒙2n(Z>C)>dβ)\displaystyle d^{\beta^{\prime}}\,\mathbb{P}\Big{(}\mathbb{P}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{2n}}(Z>C)>d^{-\beta}\Big{)} dβ+β𝔼[𝒙n+1,,𝒙2n(Z>C)]\displaystyle\leq d^{\beta+\beta^{\prime}}\mathbb{E}\Big{[}\mathbb{P}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{2n}}(Z>C)\Big{]}
=dβ+β(Z>C)d0.\displaystyle=d^{\beta+\beta^{\prime}}\mathbb{P}(Z>C)\xrightarrow{d\to\infty}0.

Thus, we proved that (120) happens with high probability. Let 𝒜\mathcal{A} be the event such that both λmin(𝐊~11)c\lambda_{\min}(\widetilde{\mathrm{\bf K}}_{11})\geq c^{\prime} and (120) happen. By the assumption on σ\sigma^{\prime}, namely Assumption 3.2, on the event λmin(𝐊~11)c\lambda_{\min}(\widetilde{\mathrm{\bf K}}_{11})\geq c^{\prime}, a deterministic (and naive) bound on 𝐊~max\|\widetilde{\mathrm{\bf K}}\|_{\max} is O(d2B)O(d^{2B}). So on 𝒜\mathcal{A}, we get

Z1+(𝐊~11)1op2𝐊~op2Cd2B.Z\leq 1+\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}\big{\|}_{\mathrm{op}}^{2}\cdot\big{\|}\widetilde{\mathrm{\bf K}}\big{\|}_{\mathrm{op}}^{2}\leq Cd^{2B}.

Thus, on the event 𝒜\mathcal{A},

𝔼𝒙n+1,,𝒙n+n[Z]\displaystyle\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n+n^{\prime}}}\big{[}Z\big{]} 𝔼𝒙n+1,,𝒙n+n[Z;ZC]+𝔼𝒙n+1,,𝒙n+n[Z;Z>C]\displaystyle\leq\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n+n^{\prime}}}\big{[}Z;Z\leq C\big{]}+\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n+n^{\prime}}}\big{[}Z;Z>C\big{]}
C+Cd2B𝒙n+1,,𝒙n+n(Z>C).\displaystyle\leq C+Cd^{2B}\cdot\mathbb{P}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n+n^{\prime}}}\big{(}Z>C\big{)}.

Therefore, 𝔼𝒙n+1,,𝒙n+n[Z]C\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n+n^{\prime}}}\big{[}Z\big{]}\leq C^{\prime} for some C>CC^{\prime}>C.

Step 4. We now prove the claim (119), namely σmin(𝐏1𝐔~0)c\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{0})\geq c which was used in Step 2. We use a similar strategy as in Step 2. Let 𝚿~n0×D\widetilde{\text{\boldmath$\Psi$}}_{\leq\ell}\in\mathbb{R}^{n_{0}\times D} contain the normalized spherical harmonics evaluated on 𝒙1,,𝒙n0\text{\boldmath$x$}_{1},\ldots,\text{\boldmath$x$}_{n_{0}}. Similar to (117), in the eigen-decomposition of 𝚿~𝚲2𝚿~+γ>𝐈n0\widetilde{\text{\boldmath$\Psi$}}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\widetilde{\text{\boldmath$\Psi$}}_{\leq\ell}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n_{0}}, we keep eigenvalues from the largest 0\ell_{0} groups in (109) and let 𝐃~s\widetilde{\mathrm{\bf D}}_{s} be those eigenvalues and columns of 𝐔~s\widetilde{\mathrm{\bf U}}_{s} be the corresponding eigenvectors. Then we partition 𝐔~s,𝐃~s\widetilde{\mathrm{\bf U}}_{s},\widetilde{\mathrm{\bf D}}_{s} into 0\ell_{0} groups:

𝐔~s=[𝐕~s(1),𝐕~s(2),,𝐕~s(0)],𝐃~s=diag(D~s(1),𝐃~s(2),,𝐃~s(0)).\widetilde{\mathrm{\bf U}}_{s}=\big{[}\widetilde{\mathrm{\bf V}}_{s}^{(1)},\widetilde{\mathrm{\bf V}}_{s}^{(2)},\ldots,\widetilde{\mathrm{\bf V}}_{s}^{(\ell_{0})}\big{]},\qquad\widetilde{\mathrm{\bf D}}_{s}=\mathrm{diag}\big{(}\widetilde{D}_{s}^{(1)},\widetilde{\mathrm{\bf D}}_{s}^{(2)},\ldots,\widetilde{\mathrm{\bf D}}_{s}^{(\ell_{0})}\big{)}.

The remaining +20\ell+2-\ell_{0} groups of eigenvectors are denoted by 𝐕~s(0+1),,𝐕~s(+2)\widetilde{\mathrm{\bf V}}_{s}^{(\ell_{0}+1)},\ldots,\widetilde{\mathrm{\bf V}}_{s}^{(\ell+2)}. Define 𝐏~s=𝐈n0𝐔~s𝐔~s\widetilde{\mathrm{\bf P}}_{s}^{\bot}=\mathrm{\bf I}_{n_{0}}-\widetilde{\mathrm{\bf U}}_{s}\widetilde{\mathrm{\bf U}}_{s}^{\top}. We apply Lemma 9 again, in which we set 𝐔a=𝐔~s\mathrm{\bf U}_{a}=\widetilde{\mathrm{\bf U}}_{s}, 𝐔b=𝐔~0\mathrm{\bf U}_{b}=\widetilde{\mathrm{\bf U}}_{0}, 𝐏=𝐏1\mathrm{\bf P}=\mathrm{\bf P}_{1}, and 𝐏a=𝐏~s\mathrm{\bf P}_{a}^{\perp}=\widetilde{\mathrm{\bf P}}_{s}^{\perp}. This yields

σmin(𝐏1𝐔~0)σmin(𝐏1𝐔~s)σmin(𝐔~s𝐔~0)σmax(𝐏~s𝐔~0).\displaystyle\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{0})\geq\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{s})\sigma_{\min}(\widetilde{\mathrm{\bf U}}_{s}^{\top}\widetilde{\mathrm{\bf U}}_{0})-\sigma_{\max}(\widetilde{\mathrm{\bf P}}_{s}^{\bot}\widetilde{\mathrm{\bf U}}_{0}).

By Lemma 8, we have σmin(𝐏1𝐔~s)c>0\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{s})\geq c>0 with very high probability for certain constant cc. Because of the gap λ(0+1)δ0λ(0)\lambda_{(\ell_{0}+1)}\leq\delta_{0}\lambda_{(\ell_{0})}, we can apply Lemma 7: for k>0k^{\prime}>\ell_{0} and k0k\leq\ell_{0}, we have

(𝐕~s(k))𝐕~0(k)op2𝚫(res)op|λkλk|o˘d,(1)2𝚫(res)op(1δ0)λ(0)o˘d,(1)3(1+o˘d,(1))𝚫(res)opγ>.\big{\|}(\widetilde{\mathrm{\bf V}}_{s}^{(k^{\prime})})^{\top}\widetilde{\mathrm{\bf V}}_{0}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{2\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}}{\big{|}\lambda_{k}-\lambda_{k^{\prime}}\big{|}-\breve{o}_{d,\mathbb{P}}(1)}\leq\frac{2\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}}{(1-\delta_{0})\lambda_{(\ell_{0})}-\breve{o}_{d,\mathbb{P}}(1)}\leq\frac{3(1+\breve{o}_{d,\mathbb{P}}(1))\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}}{\gamma_{>\ell}}.

Further 𝚫(res)op=o˘d,(1)\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1) by Lemma 6. Thus,

σmax(𝐏~s𝐔~0)k>0k0(𝐕~s(k))𝐕~0(k)opo˘d,(1),\sigma_{\max}(\widetilde{\mathrm{\bf P}}_{s}^{\bot}\widetilde{\mathrm{\bf U}}_{0})\leq\sum_{k^{\prime}>\ell_{0}}\sum_{k\leq\ell_{0}}\big{\|}(\widetilde{\mathrm{\bf V}}_{s}^{(k^{\prime})})^{\top}\widetilde{\mathrm{\bf V}}_{0}^{(k)}\big{\|}_{\mathrm{op}}\leq\breve{o}_{d,\mathbb{P}}(1), (121)

Moreover, σmin(𝐔~s𝐔~0)1𝐏~s𝐔~0op1o˘d,(1)\sigma_{\min}(\widetilde{\mathrm{\bf U}}_{s}^{\top}\widetilde{\mathrm{\bf U}}_{0})\geq 1-\|\widetilde{\mathrm{\bf P}}_{s}^{\bot}\widetilde{\mathrm{\bf U}}_{0}\|_{\mathrm{op}}\geq 1-\breve{o}_{d,\mathbb{P}}(1). Combining the lower bounds on σmin(𝐏1𝐔~s)\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{s}), σmin(𝐔~s𝐔~0)\sigma_{\min}(\widetilde{\mathrm{\bf U}}_{s}^{\top}\widetilde{\mathrm{\bf U}}_{0}) and the upper bound (121), we obtain

σmin(𝐏1𝐔~0)(1o˘d,(1))co˘d,(1)\sigma_{\min}(\mathrm{\bf P}_{1}\widetilde{\mathrm{\bf U}}_{0})\geq(1-\breve{o}_{d,\mathbb{P}}(1))\cdot c^{\prime}-\breve{o}_{d,\mathbb{P}}(1)

with very high probability. This proves the claim. ∎

B.3 Proof of Lemma 4

By homogeneity, we will assume, without loss of generality fL2=1\|f\|_{L^{2}}=1.

Recall that, by Lemma 1, we have K(𝒙i,𝒙)=k=0γkQk(d)(𝒙i,𝒙)K(\text{\boldmath$x$}_{i},\text{\boldmath$x$})=\sum_{k=0}^{\infty}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}\rangle) (where we replaced 𝒙j\text{\boldmath$x$}_{j} by an independent copy 𝒙x for notational convenience). We derive

K(2)ij\displaystyle K^{(2)}_{ij} =𝔼𝒙[K(𝒙i,𝒙)K(𝒙,𝒙j)]=(i)limMk,m=0Mγkγm𝔼𝒙[Qk(d)(𝒙i,𝒙)Qk(d)(𝒙,𝒙j)]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\big{[}K(\text{\boldmath$x$}_{i},\text{\boldmath$x$})K(\text{\boldmath$x$},\text{\boldmath$x$}_{j})\big{]}\stackrel{{\scriptstyle(i)}}{{=}}\lim_{M\to\infty}\sum_{k,m=0}^{M}\gamma_{k}\gamma_{m}\mathbb{E}_{\text{\boldmath$x$}}\big{[}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}\rangle)Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}_{j}\rangle)\big{]}
=(ii)k=0γk2B(d,k)Qk(d)(𝒙i,𝒙j)\displaystyle\stackrel{{\scriptstyle(ii)}}{{=}}\sum_{k=0}^{\infty}\frac{\gamma_{k}^{2}}{B(d,k)}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)

where in (i) we used limMk>MγkQk(d)(𝒙i,)L2=0\lim_{M\to\infty}\|\sum_{k>M}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\cdot\rangle)\|_{L^{2}}=0 and convergence in L2L^{2}-spaces (Lemma 24), and in (ii) we used the identity (40). Note that

γ>(2)\displaystyle\gamma_{>\ell}^{(2)} :=k>γk2B(d,)1B(d,+1)k>γk2(1+od(1))(!)γ>2d+1.\displaystyle:=\sum_{k>\ell}\frac{\gamma_{k}^{2}}{B(d,\ell)}\leq\frac{1}{B(d,\ell+1)}\sum_{k>\ell}\gamma_{k}^{2}\leq\frac{(1+o_{d}(1))(\ell!)\gamma_{>\ell}^{2}}{d^{\ell+1}}.

Using the identity (41), we have 𝑸k:=(𝑸k(d)(𝒙i,𝒙j))i,jn=[B(d,k)]1𝚽=𝚽=\text{\boldmath$Q$}_{k}:=(\text{\boldmath$Q$}_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle))_{i,j\leq n}=[B(d,k)]^{-1}\text{\boldmath$\Phi$}_{=\ell}\text{\boldmath$\Phi$}_{=\ell}^{\top}. Thus, we can express 𝐊(2)\mathrm{\bf K}^{(2)} as

𝐊(2)=𝚿𝚲4𝚿+γ>(2)𝐈n+𝚫(2)\mathrm{\bf K}^{(2)}=\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{4}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\gamma_{>\ell}^{(2)}\mathrm{\bf I}_{n}+\text{\boldmath$\Delta$}^{(2)} (122)

where 𝚫(2):=k>[B(d,k)]1γk2(𝑸k𝐈n)\text{\boldmath$\Delta$}^{(2)}:=\sum_{k>\ell}[B(d,k)]^{-1}\gamma_{k}^{2}(\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n}) satisfies, by Proposition E.1, with high probability,

𝚫(2)opγ>(2)supk>𝑸k𝐈dopCd+1n(logn)Cd+1.\|\text{\boldmath$\Delta$}^{(2)}\|_{\mathrm{op}}\leq\gamma_{>\ell}^{(2)}\sup_{k>\ell}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}\leq\frac{C}{d^{\ell+1}}\sqrt{\frac{n(\log n)^{C}}{d^{\ell+1}}}.

Using the decomposition of 𝐊(2)\mathrm{\bf K}^{(2)} in (122) and 𝐊1opC\|\mathrm{\bf K}^{-1}\|_{\mathrm{op}}\leq C from Lemma 2, we have

(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1op\displaystyle\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}
(λ𝐈n+𝐊)1𝚿𝚲4𝚿(λ𝐈n+𝐊)1op+(λ𝐈n+𝐊)1op(𝚫(2)op+γ>(2))(λ𝐈n+𝐊)1op\displaystyle\leq\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{4}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}+\|(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\|_{\mathrm{op}}\cdot\big{(}\|\text{\boldmath$\Delta$}^{(2)}\|_{\mathrm{op}}+\gamma_{>\ell}^{(2)}\big{)}\cdot\|(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\|_{\mathrm{op}}
(λ𝐈n+𝐊)1𝚿𝚲4𝚿(λ𝐈n+𝐊)1op+(λ𝐈n+𝐊)1op2Cd+1\displaystyle\leq\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{4}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}+\|(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\|_{\mathrm{op}}^{2}\cdot\frac{C}{d^{\ell+1}}
(λ𝐈n+𝐊)1𝚿𝚲4𝚿(λ𝐈n+𝐊)1op+Cn.\displaystyle\leq\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{4}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}+\frac{C}{n}.

Now we use the identity (113), in which we set 𝐀=λ𝐈n+𝐊,𝐀0=𝚿𝚲2𝚿+γ>𝐈n\mathrm{\bf A}=\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K},\mathrm{\bf A}_{0}=\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}, and 𝐇=𝚿𝚲4𝚿\mathrm{\bf H}=\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{4}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}. It holds that 𝐀1opC,𝐀01opC\|\mathrm{\bf A}^{-1}\|_{\mathrm{op}}\leq C,\|\mathrm{\bf A}_{0}^{-1}\|_{\mathrm{op}}\leq C, and 𝐀𝐀0opC\|\mathrm{\bf A}-\mathrm{\bf A}_{0}\|_{\mathrm{op}}\leq C with high probability. So with high probability,

(λ𝐈n+𝐊)1𝚿𝚲4𝚿(λ𝐈n+𝐊)1op\displaystyle\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{4}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}
C(𝚿𝚲2𝚿+γ>𝐈n)1𝚿𝚲4𝚿(𝚿𝚲2𝚿+γ>𝐈n)1op\displaystyle\leq C\cdot\Big{\|}\big{(}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{4}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\big{(}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\gamma_{>\ell}\mathrm{\bf I}_{n}\big{)}^{-1}\Big{\|}_{\mathrm{op}}
(i)C𝚿(𝚿𝚿+γ>𝚲2)2𝚿op\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}C\cdot\Big{\|}\text{\boldmath$\Psi$}_{\leq\ell}\big{(}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}+\gamma_{>\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{-2}\big{)}^{-2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\Big{\|}_{\mathrm{op}}
C𝚿op2λmin(𝚿𝚿)2\displaystyle\leq C\cdot\big{\|}\text{\boldmath$\Psi$}_{\leq\ell}\big{\|}_{\mathrm{op}}^{2}\cdot\lambda_{\min}\big{(}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\big{)}^{-2}
(ii)Cn,\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\frac{C}{n},

where (i) follows from the identity (156) and (ii) follows form (82). This completes the proof of (74).

In order to prove (75), we will prove that the following holds w.h.p.

sup𝒖𝕊n1|𝔼𝒙[𝒖(λ𝐈n+𝐊)1𝐊(,𝒙)f(𝒙)]|Cn.\sup_{\text{\boldmath$u$}\in\mathbb{S}^{n-1}}\Big{|}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\text{\boldmath$u$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\Big{|}\leq\frac{C}{\sqrt{n}}.

We use the Cauchy-Schwarz inequality on the left-hand side. Since we assumed, without loss of generality, fL2=1\|f\|_{L^{2}}=1, we only need to show

𝔼𝒙[(𝒖(λ𝐈n+𝐊)1𝐊(,𝒙))2]Cn\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}\text{\boldmath$u$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\big{)}^{2}\Big{]}\leq\frac{C}{n}

for any unit vector 𝒖u. This is immediate from (74) since

sup𝒖=1𝔼𝒙[(𝒖(λ𝐈n+𝐊)1𝐊(,𝒙))2]\displaystyle\sup_{\|\text{\boldmath$u$}\|=1}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}\text{\boldmath$u$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\big{)}^{2}\Big{]}
=sup𝒖=1𝒖𝔼𝒙[(λ𝐈n+𝐊)1𝐊(,𝒙)𝐊(,𝒙)(λ𝐈n+𝐊)1]𝒖\displaystyle=\sup_{\|\text{\boldmath$u$}\|=1}\text{\boldmath$u$}^{\top}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}(\cdot,\text{\boldmath$x$})^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\Big{]}\text{\boldmath$u$}
=(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1op.\displaystyle=\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}.

The proof is now complete.

B.4 Proof of Lemma 5

Proof of Lemma 5.

Step 1: Proving (76)–(78). Define, for kNk\leq N,

K(k)(𝒙1,𝒙2)=σ(𝒙1,𝒘k)σ(𝒙2,𝒘k)𝒙1,𝒙2dK^{(k)}(\text{\boldmath$x$}_{1},\text{\boldmath$x$}_{2})=\sigma^{\prime}(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}_{k}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{2},\text{\boldmath$w$}_{k}\rangle)\frac{\langle\text{\boldmath$x$}_{1},\text{\boldmath$x$}_{2}\rangle}{d}

so we have 𝐊N=N1kN𝐊(k)\mathrm{\bf K}_{N}=N^{-1}\sum_{k\leq N}\mathrm{\bf K}^{(k)}. We derive

𝔼𝒘,𝒛1,𝒛2[(δI12g,𝒉)2]\displaystyle\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\big{[}(\delta I_{12}^{g,\text{\boldmath$h$}})^{2}\big{]} =1N2k,kN𝔼𝒘,𝒛1,𝒛2[𝒗((𝐊(k)(,𝒛1)𝐊(,𝒛1))(𝐊(k)(𝒛2,)𝐊(𝒛2,))g(𝒛1)g(𝒛2))𝒗]\displaystyle=\frac{1}{N^{2}}\sum_{k,k^{\prime}\leq N}\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\Big{[}\text{\boldmath$v$}^{\top}\big{(}(\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$z$}_{1})-\mathrm{\bf K}(\cdot,\text{\boldmath$z$}_{1}))(\mathrm{\bf K}^{(k^{\prime})}(\text{\boldmath$z$}_{2},\cdot)-\mathrm{\bf K}(\text{\boldmath$z$}_{2},\cdot))g(\text{\boldmath$z$}_{1})g(\text{\boldmath$z$}_{2})\big{)}\text{\boldmath$v$}\Big{]}
=1N2kN𝒗𝔼𝒘,𝒛1,𝒛2[(𝐊(k)(,𝒛1)𝐊(,𝒛1))(𝐊(k)(𝒛2,)𝐊(𝒛2,))g(𝒛1)g(𝒛2)]𝒗\displaystyle=\frac{1}{N^{2}}\sum_{k\leq N}\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\Big{[}\big{(}\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$z$}_{1})-\mathrm{\bf K}(\cdot,\text{\boldmath$z$}_{1}))\big{(}\mathrm{\bf K}^{(k)}(\text{\boldmath$z$}_{2},\cdot)-\mathrm{\bf K}(\text{\boldmath$z$}_{2},\cdot))g(\text{\boldmath$z$}_{1})g(\text{\boldmath$z$}_{2})\Big{]}\text{\boldmath$v$}
=1N2kN𝒗𝔼𝒘,𝒛1,𝒛2[(𝐊(k)(,𝒛1)𝐊(k)(𝒛2,)𝐊(,𝒛1)𝐊(𝒛2,))g(𝒛1)g(𝒛2)]𝒗\displaystyle=\frac{1}{N^{2}}\sum_{k\leq N}\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\Big{[}\big{(}\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$z$}_{1})\mathrm{\bf K}^{(k)}(\text{\boldmath$z$}_{2},\cdot)-\mathrm{\bf K}(\cdot,\text{\boldmath$z$}_{1})\mathrm{\bf K}(\text{\boldmath$z$}_{2},\cdot)\big{)}g(\text{\boldmath$z$}_{1})g(\text{\boldmath$z$}_{2})\Big{]}\text{\boldmath$v$}

where we used independence of 𝐊(k)\mathrm{\bf K}^{(k)} (conditional on (𝒙i)in(\text{\boldmath$x$}_{i})_{i\leq n}) and 𝔼𝒘[𝐊(k)]=𝐊\mathbb{E}_{\text{\boldmath$w$}}[\mathrm{\bf K}^{(k)}]=\mathrm{\bf K}. Note that

𝒗𝔼𝒘,𝒛1,𝒛2[𝐊(,𝒛1)𝐊(𝒛2,)g(𝒛1)g(𝒛2)]𝒗=(𝒗𝔼𝒛[𝐊(,𝒛)g(𝒛)])20,\displaystyle\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$z$}_{1})\mathrm{\bf K}(\text{\boldmath$z$}_{2},\cdot)g(\text{\boldmath$z$}_{1})g(\text{\boldmath$z$}_{2})\big{]}\text{\boldmath$v$}=\big{(}\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$z$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$z$})g(\text{\boldmath$z$})\big{]}\big{)}^{2}\geq 0,

and that

𝔼𝒘,𝒛1,𝒛2[𝐊(k)(,𝒛1)𝐊(k)(𝒛2,)g(𝒛1)g(𝒛2)]=𝐇1.\displaystyle\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\Big{[}\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$z$}_{1})\mathrm{\bf K}^{(k)}(\text{\boldmath$z$}_{2},\cdot)g(\text{\boldmath$z$}_{1})g(\text{\boldmath$z$}_{2})\Big{]}=\mathrm{\bf H}_{1}.

This proves the the bound on 𝔼𝒘,𝒛1,𝒛2[(δI12g,𝒉)2]\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\big{[}(\delta I_{12}^{g,\text{\boldmath$h$}})^{2}\big{]} in (76). A similar bound, namely (77), holds for 𝔼𝒘,𝒛1,𝒛2[(δI231𝒉)2]\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$z$}_{1},\text{\boldmath$z$}_{2}}\big{[}(\delta I_{231}^{\text{\boldmath$h$}})^{2}\big{]}, in which we replace gg with h~\widetilde{h}.

Next, we derive

𝔼𝒘[δI232𝒉]\displaystyle\mathbb{E}_{\text{\boldmath$w$}}\big{[}\delta I_{232}^{\text{\boldmath$h$}}\big{]} =1N2k,kN𝒗𝔼𝒘,𝒙[(𝐊(k)(,𝒙)𝐊(,𝒙))(𝐊(k)(𝒙,)𝐊(𝒙,))]𝒗\displaystyle=\frac{1}{N^{2}}\sum_{k,k^{\prime}\leq N}\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$x$}}\Big{[}\big{(}\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\big{)}\big{(}\mathrm{\bf K}^{(k^{\prime})}(\text{\boldmath$x$},\cdot)-\mathrm{\bf K}(\text{\boldmath$x$},\cdot)\big{)}\Big{]}\text{\boldmath$v$}
=1N2kN𝒗𝔼𝒘,𝒙[(𝐊(k)(,𝒙)𝐊(,𝒙))(𝐊(k)(𝒙,)𝐊(𝒙,))]𝒗\displaystyle=\frac{1}{N^{2}}\sum_{k\leq N}\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$x$}}\Big{[}\big{(}\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\big{)}\big{(}\mathrm{\bf K}^{(k)}(\text{\boldmath$x$},\cdot)-\mathrm{\bf K}(\text{\boldmath$x$},\cdot)\big{)}\Big{]}\text{\boldmath$v$}
=1N2kN𝒗𝔼𝒘,𝒙[𝐊(k)(,𝒙)𝐊(k)(𝒙,)𝐊(,𝒙)𝐊(𝒙,)]𝒗.\displaystyle=\frac{1}{N^{2}}\sum_{k\leq N}\text{\boldmath$v$}^{\top}\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$x$}}\Big{[}\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}^{(k)}(\text{\boldmath$x$},\cdot)-\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}(\text{\boldmath$x$},\cdot)\Big{]}\text{\boldmath$v$}.

Note that 𝐊(,𝒙)𝐊(𝒙,)\mathrm{\bf K}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}(\text{\boldmath$x$},\cdot) is p.s.d., and that

𝔼𝒘,𝒙[𝐊(k)(,𝒙)𝐊(k)(𝒙,)]=𝐇3.\mathbb{E}_{\text{\boldmath$w$},\text{\boldmath$x$}}\Big{[}\mathrm{\bf K}^{(k)}(\cdot,\text{\boldmath$x$})\mathrm{\bf K}^{(k)}(\text{\boldmath$x$},\cdot)\Big{]}=\mathrm{\bf H}_{3}.

This proves the bound on 𝔼𝒘[δI232𝒉]\mathbb{E}_{\text{\boldmath$w$}}\big{[}\delta I_{232}^{\text{\boldmath$h$}}\big{]} in (78).

Step 2: proving the bounds on 𝐇1,𝐇2,𝐇3\mathrm{\bf H}_{1},\mathrm{\bf H}_{2},\mathrm{\bf H}_{3}. We define the function 𝒒(𝒘)d\text{\boldmath$q$}(\text{\boldmath$w$})\in\mathbb{R}^{d} as follows.

𝒒(𝒘)=1d𝔼𝒛[σ(𝒛,𝒘)g(𝒛)𝒛].\text{\boldmath$q$}(\text{\boldmath$w$})=\frac{1}{d}\mathbb{E}_{\text{\boldmath$z$}}\big{[}\sigma^{\prime}(\langle\text{\boldmath$z$},\text{\boldmath$w$}\rangle)g(\text{\boldmath$z$})\text{\boldmath$z$}\big{]}.

We observe that, for any unit vector 𝒖d\text{\boldmath$u$}\in\mathbb{R}^{d},

|𝒒(𝒘),𝒖|1d{𝔼𝒛[(σ(𝒛,𝒘))4]}1/4{𝔼𝒛[g(𝒛)2]}1/2{𝔼𝒛[𝒖,𝒛4]}1/4CdgL22,\big{|}\langle\text{\boldmath$q$}(\text{\boldmath$w$}),\text{\boldmath$u$}\rangle\big{|}\leq\frac{1}{d}\Big{\{}\mathbb{E}_{\text{\boldmath$z$}}\big{[}(\sigma^{\prime}(\langle\text{\boldmath$z$},\text{\boldmath$w$}\rangle))^{4}\big{]}\Big{\}}^{1/4}\cdot\Big{\{}\mathbb{E}_{\text{\boldmath$z$}}\big{[}g(\text{\boldmath$z$})^{2}\big{]}\Big{\}}^{1/2}\cdot\Big{\{}\mathbb{E}_{\text{\boldmath$z$}}\big{[}\langle\text{\boldmath$u$},\text{\boldmath$z$}\rangle^{4}\big{]}\Big{\}}^{1/4}\leq\frac{C}{d}\|g\|_{L^{2}}^{2}\,,

where we used Hölder’s inequality and

limd𝔼𝒛[(σ(𝒛,𝒘))4]=limd𝔼𝒛[(σ(z1))4]=𝔼G𝒩(0,1)[(σ(G))4],\displaystyle\lim_{d\to\infty}\mathbb{E}_{\text{\boldmath$z$}}\big{[}(\sigma^{\prime}(\langle\text{\boldmath$z$},\text{\boldmath$w$}\rangle))^{4}]=\lim_{d\to\infty}\mathbb{E}_{\text{\boldmath$z$}}\big{[}(\sigma^{\prime}(z_{1}))^{4}]=\mathbb{E}_{G\sim\mathcal{N}(0,1)}\big{[}(\sigma^{\prime}(G))^{4}], (123)
limd𝔼𝒛[𝒖,𝒛4]=limd𝔼𝒛[z14]=𝔼G𝒩(0,1)[G4].\displaystyle\lim_{d\to\infty}\mathbb{E}_{\text{\boldmath$z$}}\big{[}\langle\text{\boldmath$u$},\text{\boldmath$z$}\rangle^{4}\big{]}=\lim_{d\to\infty}\mathbb{E}_{\text{\boldmath$z$}}\big{[}z_{1}^{4}\big{]}=\mathbb{E}_{G\sim\mathcal{N}(0,1)}\big{[}G^{4}]. (124)

Since gL2\|g\|_{L^{2}} is independent of 𝒘w, it follows that

sup𝒘=1𝒒(𝒘)=sup𝒘=𝒖=1|𝒒(𝒘),𝒖|CdgL22.\sup_{\|\text{\boldmath$w$}\|=1}\big{\|}\text{\boldmath$q$}(\text{\boldmath$w$})\big{\|}=\sup_{\|\text{\boldmath$w$}\|=\|\text{\boldmath$u$}\|=1}\big{|}\langle\text{\boldmath$q$}(\text{\boldmath$w$}),\text{\boldmath$u$}\rangle\big{|}\leq\frac{C}{d}\|g\|_{L^{2}}^{2}.

Using this function, we express 𝐇1\mathrm{\bf H}_{1} as

H1(𝒙1,𝒙2)=𝔼𝒘[σ(𝒙1,𝒘)σ(𝒙2,𝒘)𝒒(𝒘),𝒙1𝒒(𝒘),𝒙2].\displaystyle H_{1}(\text{\boldmath$x$}_{1},\text{\boldmath$x$}_{2})=\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{2},\text{\boldmath$w$}\rangle)\langle\text{\boldmath$q$}(\text{\boldmath$w$}),\text{\boldmath$x$}_{1}\rangle\langle\text{\boldmath$q$}(\text{\boldmath$w$}),\text{\boldmath$x$}_{2}\rangle\Big{]}.

Let 𝜶=(α1,,αn)\text{\boldmath$\alpha$}=(\alpha_{1},\ldots,\alpha_{n})^{\top} be any vector, then

𝜶𝐇1𝜶\displaystyle\text{\boldmath$\alpha$}^{\top}\mathrm{\bf H}_{1}\text{\boldmath$\alpha$} =𝔼𝒘[𝒒(𝒘)(i,jnαiαjσ(𝒙i,𝒘)σ(𝒙j,𝒘)𝒙i𝒙j)𝒒(𝒘)]\displaystyle=\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\text{\boldmath$q$}(\text{\boldmath$w$})^{\top}\Big{(}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\text{\boldmath$x$}_{i}\text{\boldmath$x$}_{j}^{\top}\Big{)}\text{\boldmath$q$}(\text{\boldmath$w$})\Big{]}
𝔼𝒘[𝒒(𝒘)2i,jnαiαjσ(𝒙1,𝒘)σ(𝒙2,𝒘)𝒙i𝒙jop]\displaystyle\leq\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\big{\|}\text{\boldmath$q$}(\text{\boldmath$w$})\big{\|}^{2}\cdot\Big{\|}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\sigma^{\prime}(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{2},\text{\boldmath$w$}\rangle)\text{\boldmath$x$}_{i}\text{\boldmath$x$}_{j}^{\top}\Big{\|}_{\mathrm{op}}\Big{]}
CgL22d𝔼𝒘[Tr(i,jnαiαjσ(𝒙1,𝒘)σ(𝒙2,𝒘)𝒙i𝒙j)]\displaystyle\leq\frac{C\|g\|_{L^{2}}^{2}}{d}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\mathrm{Tr}\Big{(}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\sigma^{\prime}(\langle\text{\boldmath$x$}_{1},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{2},\text{\boldmath$w$}\rangle)\text{\boldmath$x$}_{i}\text{\boldmath$x$}_{j}^{\top}\Big{)}\Big{]}
=CgL22i,jnαiαj𝔼𝒘[σ(𝒙i,𝒘)σ(𝒙j,𝒘)𝒙i,𝒙jd]\displaystyle=C\|g\|_{L^{2}}^{2}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\frac{\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle}{d}\Big{]}
=CgL22𝜶𝐊𝜶\displaystyle=C\|g\|_{L^{2}}^{2}\cdot\text{\boldmath$\alpha$}^{\top}\mathrm{\bf K}\text{\boldmath$\alpha$}

where we used 𝐀opTr(𝐀)\|\mathrm{\bf A}\|_{\mathrm{op}}\leq\mathrm{Tr}(\mathrm{\bf A}) for a p.s.d. matrix 𝐀\mathrm{\bf A}. This shows 𝐇1Cd1gL22𝐊\mathrm{\bf H}_{1}\preceq Cd^{-1}\|g\|_{L^{2}}^{2}\mathrm{\bf K}. The proof for 𝐇2\mathrm{\bf H}_{2} is similar.

We also define 𝑸(𝒘)n×n\text{\boldmath$Q$}(\text{\boldmath$w$})\in\mathbb{R}^{n\times n}:

𝑸(𝒘)=𝔼𝒘[(σ(𝒘,𝒛))2𝒛𝒛d].\displaystyle\text{\boldmath$Q$}(\text{\boldmath$w$})=\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\big{(}\sigma^{\prime}(\langle\text{\boldmath$w$},\text{\boldmath$z$}\rangle)\big{)}^{2}\frac{\text{\boldmath$z$}\text{\boldmath$z$}^{\top}}{d}\Big{]}.

By definition, 𝑸(𝒘)\text{\boldmath$Q$}(\text{\boldmath$w$}) is p.s.d. for all 𝒘w. For any 𝒘w and unit vector 𝒖u,

𝒖𝑸(𝒘)𝒖\displaystyle\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}(\text{\boldmath$w$})\text{\boldmath$u$} =1d𝔼𝒛[(σ(𝒘,𝒛))2𝒖,𝒛2]\displaystyle=\frac{1}{d}\mathbb{E}_{\text{\boldmath$z$}}\Big{[}\big{(}\sigma^{\prime}(\langle\text{\boldmath$w$},\text{\boldmath$z$}\rangle)\big{)}^{2}\langle\text{\boldmath$u$},\text{\boldmath$z$}\rangle^{2}\Big{]}
1d{𝔼𝒛[(σ(𝒘,𝒛))4]}1/2{𝔼𝒘[𝒖,𝒛4]}1/2Cd\displaystyle\leq\frac{1}{d}\Big{\{}\mathbb{E}_{\text{\boldmath$z$}}\Big{[}\big{(}\sigma^{\prime}(\langle\text{\boldmath$w$},\text{\boldmath$z$}\rangle)\big{)}^{4}\Big{]}\Big{\}}^{1/2}\cdot\Big{\{}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\langle\text{\boldmath$u$},\text{\boldmath$z$}\rangle^{4}\Big{]}\Big{\}}^{1/2}\leq\frac{C}{d}

where we used the Cauchy-Schwarz inequality and (123)–(124). This implies

sup𝒘=1𝑸(𝒘)opCd.\sup_{\|\text{\boldmath$w$}\|=1}\big{\|}\text{\boldmath$Q$}(\text{\boldmath$w$})\big{\|}_{\mathrm{op}}\leq\frac{C}{d}.

For any vector 𝜶=(α1,,αn)\text{\boldmath$\alpha$}=(\alpha_{1},\ldots,\alpha_{n})^{\top}, we derive

𝜶𝐇3𝜶\displaystyle\text{\boldmath$\alpha$}^{\top}\mathrm{\bf H}_{3}\text{\boldmath$\alpha$} =1d𝔼𝒘[i,jnαiαjσ(𝒙i,𝒘)σ(𝒙j,𝒘)𝒙i𝑸(𝒘)𝒙j]\displaystyle=\frac{1}{d}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\text{\boldmath$x$}_{i}^{\top}\text{\boldmath$Q$}(\text{\boldmath$w$})\text{\boldmath$x$}_{j}\Big{]}
=1d𝔼𝒘[Tr(i,jnαiαjσ(𝒙i,𝒘)σ(𝒙j,𝒘)𝒙j𝒙i𝑸(𝒘))]\displaystyle=\frac{1}{d}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\mathrm{Tr}\Big{(}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\text{\boldmath$x$}_{j}\text{\boldmath$x$}_{i}^{\top}\text{\boldmath$Q$}(\text{\boldmath$w$})\Big{)}\Big{]}
1d𝔼𝒘[𝑸(𝒘)opTr(i,jnαiαjσ(𝒙i,𝒘)σ(𝒙j,𝒘)𝒙j𝒙i)]\displaystyle\leq\frac{1}{d}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\big{\|}\text{\boldmath$Q$}(\text{\boldmath$w$})\big{\|}_{\mathrm{op}}\cdot\mathrm{Tr}\Big{(}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\text{\boldmath$x$}_{j}\text{\boldmath$x$}_{i}^{\top}\Big{)}\Big{]}
Cd𝔼𝒘[i,jnαiαjσ(𝒙i,𝒘)σ(𝒙j,𝒘)𝒙i,𝒙jd]\displaystyle\leq\frac{C}{d}\mathbb{E}_{\text{\boldmath$w$}}\Big{[}\sum_{i,j\leq n}\alpha_{i}\alpha_{j}\sigma^{\prime}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$w$}\rangle)\sigma^{\prime}(\langle\text{\boldmath$x$}_{j},\text{\boldmath$w$}\rangle)\frac{\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle}{d}\Big{]}
=Cd𝜶𝐊𝜶,\displaystyle=\frac{C}{d}\text{\boldmath$\alpha$}^{\top}\mathrm{\bf K}\text{\boldmath$\alpha$},

where we used Tr(𝐀1𝐀2)𝐀1opTr(𝐀2)\mathrm{Tr}(\mathrm{\bf A}_{1}\mathrm{\bf A}_{2})\leq\|\mathrm{\bf A}_{1}\|_{\mathrm{op}}\mathrm{Tr}(\mathrm{\bf A}_{2}) for p.s.d. matrices 𝐀1,𝐀2\mathrm{\bf A}_{1},\mathrm{\bf A}_{2}. This proves 𝐇3Cd1𝐊\mathrm{\bf H}_{3}\preceq Cd^{-1}\mathrm{\bf K}.

Step 3: Proving the “as a consequence” part. Now we derive bounds on δI12g,𝒉\delta I_{12}^{g,\text{\boldmath$h$}} and δI23𝒉\delta I_{23}^{\text{\boldmath$h$}}. By assumption, gL2\|g\|_{L^{2}} is bounded by a constant.

Also, by assumption 𝒉2/nC\|\text{\boldmath$h$}\|^{2}/n\leq C with high probability. Therefore

𝔼𝒘[(δI12g,𝒉)2]\displaystyle\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\delta I_{12}^{g,\text{\boldmath$h$}})^{2}\big{]} 1Nd𝒉(λ𝐈n+𝐊)1𝐊(λ𝐈n+𝐊)1𝒉1Nd𝒉(λ𝐈n+𝐊)1𝒉\displaystyle\leq\frac{1}{Nd}\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}\leq\frac{1}{Nd}\text{\boldmath$h$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}
1cNd𝒉2CnNd.\displaystyle\leq\frac{1}{cNd}\|\text{\boldmath$h$}\|^{2}\leq\frac{Cn}{Nd}.

Similar bounds hold for 𝔼𝒘[(δI231𝒉)2]\mathbb{E}_{\text{\boldmath$w$}}\big{[}(\delta I_{231}^{\text{\boldmath$h$}})^{2}\big{]} and 𝔼𝒘[δI232𝒉]\mathbb{E}_{\text{\boldmath$w$}}\big{[}\delta I_{232}^{\text{\boldmath$h$}}\big{]}. Note that δI232𝒉\delta I_{232}^{\text{\boldmath$h$}} is always nonnegative. By Markov’s inequality, we have w.h.p.,

|δI12g,𝒉|CnlognNd,|δI231𝒉|CnlognNd,|δI232𝒉|CnlognNd.|\delta I_{12}^{g,\text{\boldmath$h$}}|\leq\sqrt{\frac{Cn\log n}{Nd}},\qquad|\delta I_{231}^{\text{\boldmath$h$}}|\leq\sqrt{\frac{Cn\log n}{Nd}},\qquad|\delta I_{232}^{\text{\boldmath$h$}}|\leq\frac{Cn\log n}{Nd}.

Combining the last two bounds then leads to the bound on |δI23𝒉||\delta I_{23}^{\text{\boldmath$h$}}|. ∎

B.5 Proof of Theorem 8.2

By homogeneity we can and will assume, without loss of generality, fL2=σε=1\|f\|_{L^{2}}=\sigma_{\varepsilon}=1. It is convenient to introduce the following matrix 𝐊(p,2)n×n\mathrm{\bf K}^{(p,2)}\in\mathbb{R}^{n\times n}

𝐊(p,2)=(𝔼𝒙[Kp(𝒙i,𝒙)Kp(𝒙,𝒙j)])i,jn.\mathrm{\bf K}^{(p,2)}=\big{(}\mathbb{E}_{\text{\boldmath$x$}}[K^{p}(\text{\boldmath$x$}_{i},\text{\boldmath$x$})K^{p}(\text{\boldmath$x$},\text{\boldmath$x$}_{j})]\big{)}_{i,j\leq n}\,. (125)

We further define 𝐊¯p:=𝐊p+γ>𝑰n\overline{\mathrm{\bf K}}^{p}:=\mathrm{\bf K}^{p}+\gamma_{>\ell}{\boldsymbol{I}}_{n} or, equivalently,

K¯pij\displaystyle\overline{K}^{p}_{ij} =k=0γkQk(d)(𝒙i,𝒙j)+γ>δij.\displaystyle=\sum_{k=0}^{\ell}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)+\gamma_{>\ell}\delta_{ij}\,. (126)

We will use the following lemma.

Lemma 11.

Suppose that C1>0C_{1}>0 is a constant and that 𝐡1,𝐡2n\text{\boldmath$h$}_{1},\text{\boldmath$h$}_{2}\in\mathbb{R}^{n} are random vectors that satisfy max{𝐡1,𝐡2}C1n\max\{\|\text{\boldmath$h$}_{1}\|,\|\text{\boldmath$h$}_{2}\|\}\leq C_{1}\sqrt{n} with high probability. Then, there exists a constant C>0C^{\prime}>0 such that the following bounds hold with high probability.

|𝒉1(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1𝒉2𝒉1(λ𝐈n+𝐊¯p)1𝐊(p,2)(λ𝐈n+𝐊¯p)1𝒉2|C(logn)Cnd+1,\displaystyle\big{|}\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}_{2}-\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathrm{\bf K}^{(p,2)}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$h$}_{2}\big{|}\leq\sqrt{\frac{C^{\prime}(\log n)^{C^{\prime}}n}{d^{\ell+1}}}, (127)
|𝒉1(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)]𝒉1(λ𝐈n+𝐊¯p)1𝔼𝒙[𝐊p(,𝒙)f(𝒙)]|C(logn)Cnd+1fL2\displaystyle\big{|}\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}-\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\big{|}\leq\sqrt{\frac{C^{\prime}(\log n)^{C^{\prime}}n}{d^{\ell+1}}}\|f\|_{L^{2}} (128)

We first show that Theorem 8.2 follows from this lemma. After that, we will prove Lemma 11.

Proof of Theorem 8.2.

First we observe that

EBiasEBiasp\displaystyle E_{\mathrm{Bias}}-E_{\mathrm{Bias}}^{p} =2(𝒇(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)]𝒇(λ𝐈n+𝐊¯p)1𝔼𝒙[𝐊p(,𝒙)f(𝒙)])\displaystyle=-2\Big{(}\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}-\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\Big{)}
+(𝒇(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1𝒇𝒇(λ𝐈n+𝐊¯p)1𝐊(p,2)(λ𝐈n+𝐊¯p)1𝒇).\displaystyle+\Big{(}\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}-\text{\boldmath$f$}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathrm{\bf K}^{(p,2)}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$f$}\Big{)}.

In Lemma 11 Eqs. (127) and (128), we set 𝒉1=𝒉2=𝒇\text{\boldmath$h$}_{1}=\text{\boldmath$h$}_{2}=\text{\boldmath$f$}, which yields |EBiasEBiasp|η|E_{\mathrm{Bias}}-E_{\mathrm{Bias}}^{p}|\leq\eta^{\prime}. For the variance term, we apply Lemma 11 Eq. (127) with 𝒉1=𝒉2=𝜺\text{\boldmath$h$}_{1}=\text{\boldmath$h$}_{2}=\text{\boldmath$\varepsilon$}, which yields |EVarEVarp|η|E_{\mathrm{Var}}-E_{\mathrm{Var}}^{p}|\leq\eta^{\prime}.

For the cross term, we observe that

ECrossECrossp\displaystyle E_{\text{\rm Cross}}-E_{\text{\rm Cross}}^{p} =(𝜺(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)]𝜺(λ𝐈n+𝐊¯p)1𝔼𝒙[𝐊p(,𝒙)f(𝒙)])\displaystyle=\Big{(}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\Big{)}
(𝜺(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1𝒇𝜺(λ𝐈n+𝐊¯p)1𝐊(p,2)(λ𝐈n+𝐊¯p)1𝒇).\displaystyle-\Big{(}\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$f$}-\text{\boldmath$\varepsilon$}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathrm{\bf K}^{(p,2)}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$f$}\Big{)}.

We apply Lemma 11 with 𝒉1=𝜺\text{\boldmath$h$}_{1}=\text{\boldmath$\varepsilon$} and 𝒉2=𝒇\text{\boldmath$h$}_{2}=\text{\boldmath$f$}. This leads to |ECrossECrossp|η|E_{\text{\rm Cross}}-E_{\text{\rm Cross}}^{p}|\leq\eta^{\prime}, which completes the proof. ∎

Proof of Lemma 11.

Define differences

δI1\displaystyle\delta I_{1}^{\prime} =𝒉1(λ𝐈n+𝐊¯p)1(𝐊(2)𝐊(p,2))(λ𝐈n+𝐊¯p)1𝒉2\displaystyle=\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\big{(}\mathrm{\bf K}^{(2)}-\mathrm{\bf K}^{(p,2)})\big{(}\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$h$}_{2}
δI2\displaystyle\delta I_{2}^{\prime} =𝒉1(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1𝒉2𝒉1(λ𝐈n+𝐊¯p)1𝐊(2)(λ𝐈n+𝐊¯p)1𝒉2,\displaystyle=\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}\big{(}\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}_{2}-\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathrm{\bf K}^{(2)}\big{(}\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$h$}_{2},
δI3\displaystyle\delta I_{3}^{\prime} =𝒉1(λ𝐈n+𝐊¯p)1𝔼𝒙[(𝐊(,𝒙)𝐊p(,𝒙))f(𝒙)],\displaystyle=\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$}))f(\text{\boldmath$x$})\big{]},
δI4\displaystyle\delta I_{4}^{\prime} =𝒉1[(λ𝐈n+𝐊)1(λ𝐈n+𝐊¯p)1]𝔼𝒙[𝐊(,𝒙)f(𝒙)].\displaystyle=\text{\boldmath$h$}_{1}^{\top}\big{[}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}-(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\big{]}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}.

We have

𝒉1(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1𝒉2𝒉1(λ𝐈n+𝐊¯p)1𝐊(p,2)(λ𝐈n+𝐊¯p)1𝒉2=δI1+δI2,\displaystyle\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\text{\boldmath$h$}_{2}-\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathrm{\bf K}^{(p,2)}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$h$}_{2}=\delta I_{1}^{\prime}+\delta I_{2}^{\prime},
𝒉1(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)]𝒉1(λ𝐈n+𝐊¯p)1𝔼𝒙[𝐊p(,𝒙)f(𝒙)]=δI3+δI4.\displaystyle\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}-\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}=\delta I_{3}^{\prime}+\delta I_{4}^{\prime}.

We observe that

K¯pij\displaystyle\overline{K}^{p}_{ij} =k=0γkQk(d)(𝒙i,𝒙j)+γ>δij,\displaystyle=\sum_{k=0}^{\ell}\gamma_{k}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)+\gamma_{>\ell}\delta_{ij},
K(p,2)ij\displaystyle K^{(p,2)}_{ij} =k,m=0γkγm𝔼𝒙[Qk(d)(𝒙i,𝒙)Qk(d)(𝒙,𝒙j)]=k=0γk2B(d,k)Qk(d)(𝒙i,𝒙j).\displaystyle=\sum_{k,m=0}^{\ell}\gamma_{k}\gamma_{m}\mathbb{E}_{\text{\boldmath$x$}}\big{[}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}\rangle)Q_{k}^{(d)}(\langle\text{\boldmath$x$},\text{\boldmath$x$}_{j}\rangle)\big{]}=\sum_{k=0}^{\ell}\frac{\gamma_{k}^{2}}{B(d,k)}Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle).

It follows that

𝐊𝐊¯p=k>γk(𝑸k𝐈n),𝐊(2)𝐊(p,2)=k>γk2B(d,k)𝑸k,\displaystyle\mathrm{\bf K}-\overline{\mathrm{\bf K}}^{p}=\sum_{k>\ell}\gamma_{k}(\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n}),\qquad\mathrm{\bf K}^{(2)}-\mathrm{\bf K}^{(p,2)}=\sum_{k>\ell}\frac{\gamma_{k}^{2}}{B(d,k)}\text{\boldmath$Q$}_{k}, (129)
𝔼𝒙[(𝐊(,𝒙)𝐊p(,𝒙))(𝐊(,𝒙)𝐊p(,𝒙))]=k>γk2B(d,k)𝑸k.\displaystyle\mathbb{E}_{\text{\boldmath$x$}}\big{[}(\mathrm{\bf K}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$}))(\mathrm{\bf K}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$}))^{\top}\big{]}=\sum_{k>\ell}\frac{\gamma_{k}^{2}}{B(d,k)}\text{\boldmath$Q$}_{k}. (130)

Therefore, with high probability,

𝐊𝐊¯pop\displaystyle\big{\|}\mathrm{\bf K}-\overline{\mathrm{\bf K}}^{p}\big{\|}_{\mathrm{op}} k>γk(𝑸k𝐈n)opγ>supk>𝑸k𝐈nop(i)C(logn)Cnd+1,\displaystyle\leq\Big{\|}\sum_{k>\ell}\gamma_{k}(\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n})\Big{\|}_{\mathrm{op}}\leq\gamma_{>\ell}\sup_{k>\ell}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\stackrel{{\scriptstyle(i)}}{{\leq}}\sqrt{\frac{C(\log n)^{C}n}{d^{\ell+1}}}, (131)
𝐊(2)𝐊(p,2)op\displaystyle\big{\|}\mathrm{\bf K}^{(2)}-\mathrm{\bf K}^{(p,2)}\big{\|}_{\mathrm{op}} =k>γk2B(d,k)𝑸kopk>γk2Cd+1supk>𝑸kop\displaystyle=\Big{\|}\sum_{k>\ell}\frac{\gamma_{k}^{2}}{B(d,k)}\text{\boldmath$Q$}_{k}\Big{\|}_{\mathrm{op}}\leq\sum_{k>\ell}\gamma_{k}^{2}\cdot\frac{C}{d^{\ell+1}}\sup_{k>\ell}\big{\|}\text{\boldmath$Q$}_{k}\big{\|}_{\mathrm{op}} (132)
(ii)γ>2Cd+1(1+od,(1))Cd+1,\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\gamma_{>\ell}^{2}\cdot\frac{C}{d^{\ell+1}}\cdot(1+o_{d,\mathbb{P}}(1))\leq\frac{C}{d^{\ell+1}}, (133)

where in (i), (ii) we used Proposition E.1. Also, note that 𝐊¯pγ>𝐈n\overline{\mathrm{\bf K}}^{p}\succeq\gamma_{>\ell}\mathrm{\bf I}_{n}, so we must have (λ𝐈n+𝐊¯p)1opγ>1\|(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\|_{\mathrm{op}}\leq\gamma_{>\ell}^{-1}. Thus, w.h.p.,

|δI1|𝒉1𝒉2(λ𝐈n+𝐊¯p)1op2𝐊(2)𝐊(p,2)opCnd+1.\big{|}\delta I_{1}^{\prime}\big{|}\leq\|\text{\boldmath$h$}_{1}\|\cdot\|\text{\boldmath$h$}_{2}\|\cdot\big{\|}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\big{\|}_{\mathrm{op}}^{2}\cdot\big{\|}\mathrm{\bf K}^{(2)}-\mathrm{\bf K}^{(p,2)}\big{\|}_{\mathrm{op}}\leq\frac{Cn}{d^{\ell+1}}.

In the identity (113), we set 𝐀0=λ𝐈n+𝐊¯p\mathrm{\bf A}_{0}=\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p}, 𝐀=λ𝐈n+𝐊\mathrm{\bf A}=\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}, and 𝐇=𝐊(2)\mathrm{\bf H}=\mathrm{\bf K}^{(2)}. It holds that w.h.p., 𝐀01op,𝐀1opC\|\mathrm{\bf A}_{0}^{-1}\|_{\mathrm{op}},\|\mathrm{\bf A}^{-1}\|_{\mathrm{op}}\leq C, 𝐀𝐀0opC(logn)Cn/d+1\|\mathrm{\bf A}-\mathrm{\bf A}_{0}\|_{\mathrm{op}}\leq\sqrt{C(\log n)^{C}n/d^{\ell+1}}. Thus, w.h.p.,

|δI2|\displaystyle\big{|}\delta I_{2}^{\prime}\big{|} C𝒉1𝒉2C(logn)Cnd+1(λ𝐈n+𝐊)1𝐊(2)(λ𝐈n+𝐊)1opCn(logn)Cnd+1Cn\displaystyle\leq C\|\text{\boldmath$h$}_{1}\|\cdot\|\text{\boldmath$h$}_{2}\|\cdot\sqrt{\frac{C(\log n)^{C}n}{d^{\ell+1}}}\cdot\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathrm{\bf K}^{(2)}\big{(}\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\big{\|}_{\mathrm{op}}\leq Cn\cdot\sqrt{\frac{(\log n)^{C}n}{d^{\ell+1}}}\cdot\frac{C}{n}
C(logn)Cnd+1\displaystyle\leq\sqrt{\frac{C(\log n)^{C}n}{d^{\ell+1}}}

where we used Lemma 4 Eq. (74). Combining the upper bounds on |δI1||\delta I_{1}^{\prime}| and |δI2||\delta I_{2}^{\prime}|, we arrive at the first inequality in the lemma.

Next, we bound |δI3||\delta I_{3}^{\prime}|. With high probability,

|δI3|2\displaystyle|\delta I_{3}^{\prime}|^{2} (i)𝔼𝒙[(𝒉1(λ𝐈n+𝐊¯p)1(𝐊(,𝒙)𝐊p(,𝒙))2]𝔼𝒙[(f(𝒙))2]\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\big{(}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})\big{)}^{2}\Big{]}\cdot\mathbb{E}_{\text{\boldmath$x$}}\big{[}(f(\text{\boldmath$x$}))^{2}\big{]}
𝒉1(λ𝐈n+𝐊p)1𝔼𝒙[(𝐊(,𝒙)𝐊p(,𝒙))(𝐊(,𝒙)𝐊p(,𝒙))](λ𝐈n+𝐊p)1𝒉1fL22\displaystyle\leq\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\big{(}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})\big{)}\big{(}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})\big{)}^{\top}\big{]}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$h$}_{1}\cdot\|f\|_{L^{2}}^{2}
(ii)k>Cγk2B(d,k)𝒉1(λ𝐈n+𝐊¯p)1𝑸k(λ𝐈n+𝐊¯p)1𝒉1\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\sum_{k>\ell}\frac{C\gamma_{k}^{2}}{B(d,k)}\cdot\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$Q$}_{k}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\text{\boldmath$h$}_{1}
Cd+1k>γk2𝒉12(λ𝐈n+𝐊¯p)12op𝑸kop\displaystyle\leq\frac{C}{d^{\ell+1}}\cdot\sum_{k>\ell}\gamma_{k}^{2}\cdot\|\text{\boldmath$h$}_{1}\|^{2}\cdot\|(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p}\big{)}^{-1}\|^{2}_{\mathrm{op}}\cdot\|\text{\boldmath$Q$}_{k}\|_{\mathrm{op}}
(iii)Cnd+1γ>2supk>𝑸kop(iv)Cnd+1\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}\frac{Cn}{d^{\ell+1}}\cdot\gamma_{>\ell}^{2}\cdot\sup_{k>\ell}\|\text{\boldmath$Q$}_{k}\|_{\mathrm{op}}\stackrel{{\scriptstyle(iv)}}{{\leq}}\frac{Cn}{d^{\ell+1}}

where (i) follows from the Cauchy-Schwarz inequality, (ii) follows from (130), (iii) is because 𝒉Cn\|\text{\boldmath$h$}\|\leq C\sqrt{n} w.h.p. by assumption and (λ𝐈n+𝐊¯p)1opγ>1\|(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\|_{\mathrm{op}}\leq\gamma_{>\ell}^{-1}, and (iv) follows from Proposition E.1. Finally,

|δI4|\displaystyle|\delta I_{4}^{\prime}| 𝒉1(λ𝐈n+𝐊¯p)1𝐊¯p𝐊op(λ𝐈n+𝐊)1𝔼𝒙[𝐊(,𝒙)f(𝒙)]op\displaystyle\leq\big{\|}\text{\boldmath$h$}_{1}^{\top}(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\big{\|}\cdot\|\overline{\mathrm{\bf K}}^{p}-\mathrm{\bf K}\|_{\mathrm{op}}\cdot\big{\|}(\lambda\mathrm{\bf I}_{n}+\mathrm{\bf K})^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\big{\|}_{\mathrm{op}}
(i)CnC(logn)Cnd+1Cn(ii)C(logn)Cnd+1\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}C\sqrt{n}\cdot\sqrt{\frac{C(\log n)^{C}n}{d^{\ell+1}}}\cdot\frac{C}{\sqrt{n}}\stackrel{{\scriptstyle(ii)}}{{\leq}}\sqrt{\frac{C(\log n)^{C}n}{d^{\ell+1}}}

where in (i) we used (λ𝐈n+𝐊¯p)1opγ>1\|(\lambda\mathrm{\bf I}_{n}+\overline{\mathrm{\bf K}}^{p})^{-1}\|_{\mathrm{op}}\leq\gamma_{>\ell}^{-1}, Eq. (131), and Lemma 4 Eq. (75). Combining the upper bounds on |δI3||\delta I_{3}^{\prime}| and |δI4||\delta I_{4}^{\prime}|, we arrive at the second inequality of this lemma. ∎

Appendix C Generalization error: improved analysis for =1\ell=1

We will show that for the case =1\ell=1, we can relax the condition

C0(logd)C0dnd2C0(logd)C0toc0dnd2C0(logd)C0.C_{0}(\log d)^{C_{0}}d\leq n\leq\frac{d^{2}}{C_{0}(\log d)^{C_{0}}}\qquad\text{to}\qquad c_{0}d\leq n\leq\frac{d^{2}}{C_{0}(\log d)^{C_{0}}}.

The proof of the generalization error under this relaxed condition follows mostly the proof in Section B.2, with modifications we show in this subsection.

Throughout, we suppose that =1\ell=1 and correspondingly D=d+1D=d+1. The matrix 𝚿n×D\text{\boldmath$\Psi$}_{\leq\ell}\in\mathbb{R}^{n\times D} is given by

𝚿=(𝟏n,𝑿).\text{\boldmath$\Psi$}_{\leq\ell}=\big{(}\mathrm{\bf 1}_{n},\text{\boldmath$X$}\big{)}.

An important difference under the relaxed condition is that n1𝚿𝚿n^{-1}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell} does not necessarily converge to 𝐈D\mathrm{\bf I}_{D} (cf. Eq. 51). (In fact, if n,dn,d satisfy n/dκn/d\to\kappa, the spectra of n1𝑿𝑿n^{-1}\text{\boldmath$X$}^{\top}\text{\boldmath$X$} is characterized by the Marchenko-Pastur distribution.) Nevertheless, if nCdn\geq Cd where CC is sufficiently large, we have a good control of n1𝚿𝚿n^{-1}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}.

Lemma 12.

For any constant δ(0,0.1)\delta\in(0,0.1), there exists certain Cδ>1C_{\delta}>1 such that the following holds. If nCδdn\geq C_{\delta}d, then with very high probability,

1n𝚿𝚿𝐈Dopδ.\Big{\|}\frac{1}{n}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}-\mathrm{\bf I}_{D}\Big{\|}_{\mathrm{op}}\leq\delta. (134)

Fix the constant δ=0.01\delta=0.01. Let Cδ>1C_{\delta}>1 be the constant in Lemma 12. First, we establish some useful results under the condition

Cδdnd2C(logd)C,nNd(log(Nd))C,\displaystyle C_{\delta}d\leq n\leq\frac{d^{2}}{C(\log d)^{C}},\qquad n\leq\frac{Nd}{(\log(Nd))^{C}}\,, (135)

for a sufficiently large C>0C>0 such that the following inequalities hold by Lemma 212 and Theorem 3.2: in the decomposition 𝐊=γ>𝐈n+𝚿𝚲2𝚿+𝚫\mathrm{\bf K}=\gamma_{>\ell}\mathrm{\bf I}_{n}+\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}+\text{\boldmath$\Delta$}\,, we have

𝚫op=o˘d,(1),n1𝚿𝚿𝐈nopδ+o˘d,(1),𝐊1/2(𝐊N𝐊)𝐊1/2op=o˘d,(1).\displaystyle\big{\|}\text{\boldmath$\Delta$}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1),\quad\big{\|}n^{-1}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq\delta+\breve{o}_{d,\mathbb{P}}(1),\quad\big{\|}\mathrm{\bf K}^{-1/2}(\mathrm{\bf K}_{N}-\mathrm{\bf K})\mathrm{\bf K}^{-1/2}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1)\,. (136)

We strengthen Lemma 6 for the case =1\ell=1.

Lemma 13 (Kernel eigenvalue structure: case =1\ell=1).

Assume that Eq. (135) holds and set =1\ell=1 and D=d+1D=d+1. Then the eigen-decomposition of 𝐊γ>𝐈n\mathrm{\bf K}-\gamma_{>\ell}\mathrm{\bf I}_{n} and 𝐊Nγ>𝐈n\mathrm{\bf K}_{N}-\gamma_{>\ell}\mathrm{\bf I}_{n} can be written in the following form:

𝐊γ>𝐈n=𝐔𝐃𝐔+𝚫(res),\displaystyle\mathrm{\bf K}-\gamma_{>\ell}\mathrm{\bf I}_{n}=\mathrm{\bf U}\mathrm{\bf D}\mathrm{\bf U}^{\top}+\text{\boldmath$\Delta$}^{(\mathrm{res})}, (137)
𝐊Nγ>𝐈n=𝐔N𝐃N𝐔N+𝚫(res)N,\displaystyle\mathrm{\bf K}_{N}-\gamma_{>\ell}\mathrm{\bf I}_{n}=\mathrm{\bf U}_{N}\mathrm{\bf D}_{N}\mathrm{\bf U}_{N}^{\top}+\text{\boldmath$\Delta$}^{(\mathrm{res})}_{N}, (138)

where 𝐃,𝐃ND×D\mathrm{\bf D},\mathrm{\bf D}_{N}\in\mathbb{R}^{D\times D} are diagonal matrices that contain DD eigenvalues of 𝐊,𝐊N\mathrm{\bf K},\mathrm{\bf K}_{N} respectively, columns of 𝐔,𝐔Nn×D\mathrm{\bf U},\mathrm{\bf U}_{N}\in\mathbb{R}^{n\times D} are the corresponding eigenvectors and 𝚫(res),𝚫N(res)\text{\boldmath$\Delta$}^{(\mathrm{res})},\text{\boldmath$\Delta$}_{N}^{(\mathrm{res})} correspond to the other eigenvectors (in particular 𝚫(res)𝐔=𝚫N(res)𝐔N=𝟎\text{\boldmath$\Delta$}^{(\mathrm{res})}\mathrm{\bf U}=\text{\boldmath$\Delta$}_{N}^{(\mathrm{res})}\mathrm{\bf U}_{N}=\mathrm{\bf 0}).

Further defining

𝐃=diag(γ0n,γ1nd,,γ1ndd).\mathrm{\bf D}^{*}=\mathrm{diag}\Big{(}\gamma_{0}n,\underbrace{\frac{\gamma_{1}n}{d},\ldots,\frac{\gamma_{1}n}{d}}_{d}\Big{)}.

the eigenvalues have the following structure:

(12δo˘d,(1))𝐃o˘d,(1)𝐃(1+2δ+o˘d,(1))𝐃+o˘d,(1),\displaystyle(1-2\delta-\breve{o}_{d,\mathbb{P}}(1))\cdot\mathrm{\bf D}^{*}-\breve{o}_{d,\mathbb{P}}(1)\leq\mathrm{\bf D}\leq(1+2\delta+\breve{o}_{d,\mathbb{P}}(1))\cdot\mathrm{\bf D}^{*}+\breve{o}_{d,\mathbb{P}}(1), (139)
(12δo˘d,(1))𝐃o˘d,(1)𝐃N(1+2δ+o˘d,(1))𝐃+o˘d,(1).\displaystyle(1-2\delta-\breve{o}_{d,\mathbb{P}}(1))\cdot\mathrm{\bf D}^{*}-\breve{o}_{d,\mathbb{P}}(1)\leq\mathrm{\bf D}_{N}\leq(1+2\delta+\breve{o}_{d,\mathbb{P}}(1))\cdot\mathrm{\bf D}^{*}+\breve{o}_{d,\mathbb{P}}(1). (140)

Here, \leq denotes entrywise comparisons (𝐀𝐀\mathrm{\bf A}\leq\mathrm{\bf A}^{\prime} if and only if AijAijA_{ij}\leq A^{\prime}_{ij} for all i,ji,j). Moreover, the remaining components satisfy 𝚫(res)op=o˘d,(1)\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1) and 𝚫(res)Nop=o˘d,(1)\|\text{\boldmath$\Delta$}^{(\mathrm{res})}_{N}\|_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1).

Proof of Lemma 13.

In the proof of Lemma 6, instead of claiming (93), we will show that the following modified claim holds.

(12δo˘d,(1))𝐃𝝀(𝑸𝚿𝚲2𝚿𝑸)(1+2δ+o˘d,(1))𝐃.(1-2\delta-\breve{o}_{d,\mathbb{P}}(1))\cdot\mathrm{\bf D}^{*}\leq\text{\boldmath$\lambda$}(\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$})\leq(1+2\delta+\breve{o}_{d,\mathbb{P}}(1))\cdot\mathrm{\bf D}^{*}.

To prove this claim, we note that Lemma 12 implies that n1/2𝚿op1+δ1+δn^{-1/2}\|\text{\boldmath$\Psi$}_{\leq\ell}\|_{\mathrm{op}}\leq\sqrt{1+\delta}\leq 1+\delta and n1/2𝚿op1δ1δn^{-1/2}\|\text{\boldmath$\Psi$}_{\leq\ell}\|_{\mathrm{op}}\geq\sqrt{1-\delta}\geq 1-\delta with very high probability. Following the notations in the proof of Lemma 6, we have

λk(𝑸𝚿𝚲2𝚿𝑸)\displaystyle\lambda_{k}(\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}) =max𝒱:dim(𝒱)=kmin𝟎𝒖𝒱𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝒖2\displaystyle=\max_{\mathcal{V}:\dim(\mathcal{V})=k}\min_{\mathrm{\bf 0}\neq\text{\boldmath$u$}\in\mathcal{V}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$u$}\|^{2}}
(12δo˘d,(1))min𝟎𝒖𝒱1𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝚿𝑸𝒖2/n\displaystyle\geq(1-2\delta-\breve{o}_{d,\mathbb{P}}(1))\min_{\mathrm{\bf 0}\neq\text{\boldmath$u$}\in\mathcal{V}_{1}^{\prime}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}\|^{2}/n}
(12δo˘d,(1))min𝟎𝒙𝒱1𝒙𝚲2𝒙𝒙2/n\displaystyle\geq(1-2\delta-\breve{o}_{d,\mathbb{P}}(1))\min_{\mathrm{\bf 0}\neq\text{\boldmath$x$}\in\mathcal{V}_{1}}\frac{\text{\boldmath$x$}^{\top}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$x$}}{\|\text{\boldmath$x$}\|^{2}/n}
n(12δo˘d,(1))λk(𝚲2).\displaystyle\geq n(1-2\delta-\breve{o}_{d,\mathbb{P}}(1))\cdot\lambda_{k}(\text{\boldmath$\Lambda$}_{\leq\ell}^{2}).

Similarly,

λk(𝑸𝚿𝚲2𝚿𝑸)\displaystyle\lambda_{k}(\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}) =min𝒱:dim(𝒱)=Dk+1max𝒖𝒱𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝒖2\displaystyle=\min_{\mathcal{V}:\dim(\mathcal{V})=D-k+1}\max_{\text{\boldmath$u$}\in\mathcal{V}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$u$}\|^{2}}
(1+2δ+o˘d,(1))max𝒖𝒱2𝒖𝑸𝚿𝚲2𝚿𝑸𝒖𝚿𝑸𝒖2/n\displaystyle\leq(1+2\delta+\breve{o}_{d,\mathbb{P}}(1))\max_{\text{\boldmath$u$}\in\mathcal{V}_{2}^{\prime}}\frac{\text{\boldmath$u$}^{\top}\text{\boldmath$Q$}^{\top}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}}{\|\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$Q$}\text{\boldmath$u$}\|^{2}/n}
(1+2δ+o˘d,(1))max𝒙𝒱2𝒙𝚲2𝒙𝒙2/n\displaystyle\leq(1+2\delta+\breve{o}_{d,\mathbb{P}}(1))\max_{\text{\boldmath$x$}\in\mathcal{V}_{2}}\frac{\text{\boldmath$x$}^{\top}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$x$}}{\|\text{\boldmath$x$}\|^{2}/n}
n(1+2δ+o˘d,(1))λk(𝚲2).\displaystyle\leq n(1+2\delta+\breve{o}_{d,\mathbb{P}}(1))\cdot\lambda_{k}(\text{\boldmath$\Lambda$}_{\leq\ell}^{2}).

We omit the rest of proof, as it follows closely the proof of Lemma 6. ∎

We prove a slight modification of Lemma 7 for the case =1\ell=1.

Lemma 14 (Kernel eigenvector structure: case =1\ell=1).

Assume that Eq. (135) holds and set =1\ell=1 and D=d+1D=d+1. Let kk{0,1,,+1}k\neq k^{\prime}\in\{0,1,\ldots,\ell+1\}. Denote λk=γ>+γk(k!)n/(dk)\lambda_{k}=\gamma_{>\ell}+\gamma_{k}(k!)n/(d^{k}) for k=0,,k=0,\ldots,\ell and λ+1=γ>\lambda_{\ell+1}=\gamma_{>\ell}.

  1. (a)

    Suppose that min{λk/λk,λk/λk}1/4\min\{\lambda_{k}/\lambda_{k^{\prime}},\lambda_{k^{\prime}}/\lambda_{k}\}\leq 1/4. Then,

    (𝐕0(k))𝐕(k)op=o˘d,(1).\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}=\breve{o}_{d,\mathbb{P}}(1).
  2. (b)

    Recall 𝚫(res)\text{\boldmath$\Delta$}^{(\mathrm{res})} defined in (138). If |λkλk|4.1δmax{λk,λk}\big{|}\lambda_{k}-\lambda_{k^{\prime}}\big{|}\geq 4.1\delta\max\{\lambda_{k},\lambda_{k^{\prime}}\}, then

    (𝐕s(k))𝐕0(k)op2𝚫(res)op|λkλk|4δmax{λk,λk}o˘d,(1).\big{\|}(\mathrm{\bf V}_{s}^{(k^{\prime})})^{\top}\mathrm{\bf V}_{0}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{2\|\text{\boldmath$\Delta$}^{(\mathrm{res})}\|_{\mathrm{op}}}{\big{|}\lambda_{k}-\lambda_{k^{\prime}}\big{|}-4\delta\max\{\lambda_{k},\lambda_{k^{\prime}}\}-\breve{o}_{d,\mathbb{P}}(1)}.
Proof of Lemma 14.

Instead of Eqs. (96)–(97), we have

mini,j|(𝐃0(k))ii(𝐃(k))jj|\displaystyle\min_{i,j}\big{|}(\mathrm{\bf D}_{0}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}^{(k)})_{jj}\big{|} |γ>+(1+o˘d,(1))(λkγ>)γ>(1+o˘d,(1))(λkγ>)|\displaystyle\geq\big{|}\gamma_{>\ell}+(1+\breve{o}_{d,\mathbb{P}}(1))(\lambda_{k}-\gamma_{>\ell})-\gamma_{>\ell}-(1+\breve{o}_{d,\mathbb{P}}(1))(\lambda_{k^{\prime}}-\gamma_{>\ell})\big{|}
2δ|λkγ>|2δ|λkγ>|o˘d,(1)\displaystyle-2\delta|\lambda_{k}-\gamma_{>\ell}|-2\delta|\lambda_{k^{\prime}}-\gamma_{>\ell}|-\breve{o}_{d,\mathbb{P}}(1)
|(1+o˘d,(1))λk(1+o˘d,(1))λk|4δmax{λk,λk}o˘d,(1).\displaystyle\geq\big{|}(1+\breve{o}_{d,\mathbb{P}}(1))\lambda_{k}-(1+\breve{o}_{d,\mathbb{P}}(1))\lambda_{k^{\prime}}\big{|}-4\delta\max\{\lambda_{k},\lambda_{k^{\prime}}\}-\breve{o}_{d,\mathbb{P}}(1).

By the assumptions on λk\lambda_{k} and λk\lambda_{k^{\prime}}, with very high probability,

mini,j|(𝐃0(k))ii(𝐃(k))jj|(1/24δ)max{λk,λk}.\min_{i,j}\big{|}(\mathrm{\bf D}_{0}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}^{(k)})_{jj}\big{|}\geq(1/2-4\delta)\max\{\lambda_{k},\lambda_{k^{\prime}}\big{\}}.

Also, instead of Eqs. (98)–(99), we have

(𝐕0(k))𝐊1/2op=[𝐃0(k)]1/2(𝐕0(k))op(1+2δ+o˘d,(1))λk.\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf K}^{1/2}\big{\|}_{\mathrm{op}}=\big{\|}\big{[}\mathrm{\bf D}_{0}^{(k^{\prime})}\big{]}^{1/2}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\big{\|}_{\mathrm{op}}\leq(\sqrt{1+2\delta}\,+\breve{o}_{d,\mathbb{P}}(1))\cdot\sqrt{\lambda_{k^{\prime}}}.

Similarly,

𝐊N1/2𝐕(k)op=𝐕(k)(𝐃(k))1/2op(1+1+2δ+o˘d,(1))λk.\big{\|}\mathrm{\bf K}_{N}^{1/2}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}=\big{\|}\mathrm{\bf V}^{(k)}(\mathrm{\bf D}^{(k)})^{1/2}\big{\|}_{\mathrm{op}}\leq(1+\sqrt{1+2\delta}\,+\breve{o}_{d,\mathbb{P}}(1))\cdot\sqrt{\lambda_{k}}.

These inequalities lead to

(𝐕0(k))𝐕(k)opo˘d,(1)λkλk(11/24δ)max{λk,λk}o˘d,(1),\big{\|}(\mathrm{\bf V}_{0}^{(k^{\prime})})^{\top}\mathrm{\bf V}^{(k)}\big{\|}_{\mathrm{op}}\leq\frac{\breve{o}_{d,\mathbb{P}}(1)\cdot\sqrt{\lambda_{k}\lambda_{k^{\prime}}}}{(1-1/2-4\delta)\max\{\lambda_{k},\lambda_{k^{\prime}}\}}\leq\breve{o}_{d,\mathbb{P}}(1),

which results in the same conclusion as in Lemma 7 (a). To show the modified inequality in (b), we only need to notice that

mini,j|(𝐃s(k))ii(𝐃s(k))jj||λkλk|4δmax{λk,λk}o˘d,(1).\min_{i,j}\big{|}(\mathrm{\bf D}_{s}^{(k^{\prime})})_{ii}-(\mathrm{\bf D}_{s}^{(k)})_{jj}\big{|}\leq|\lambda_{k}-\lambda_{k^{\prime}}|-4\delta\max\{\lambda_{k},\lambda_{k^{\prime}}\}-\breve{o}_{d,\mathbb{P}}(1).

The proof is complete. ∎

Recall that we denote the projection matrix 𝐏=[𝒆1,,𝒆m][𝒆1,,𝒆m]\mathrm{\bf P}=[\text{\boldmath$e$}_{1},\ldots,\text{\boldmath$e$}_{m}][\text{\boldmath$e$}_{1},\ldots,\text{\boldmath$e$}_{m}]^{\top}, and that the top-DD eigenvectors of 𝚿𝚲2𝚿\text{\boldmath$\Psi$}_{\leq\ell}^{\top}\text{\boldmath$\Lambda$}_{\leq\ell}^{2}\text{\boldmath$\Psi$}_{\leq\ell} form 𝐔s\mathrm{\bf U}_{s}. The next lemma adapts Lemma 8 to the present case.

Lemma 15.

Suppose that cnmCnc^{\prime}n\leq m\leq C^{\prime}n, where c,C(0,1)c^{\prime},C^{\prime}\in(0,1) are constants. Also suppose that the condition (135) is satisfied. Then there exist C3,c>0C_{3},c>0 such that if m>(C3+1)dm>(C_{3}+1)d, then with very high probability,

σmin(𝐏𝐔s)c.\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{s})\geq c.
Proof of Lemma 15.

Following the proof of Lemma 8, we have

σmin(𝐏𝐔𝚿)σmin(𝐏𝚿𝐕𝚿)σmax(𝐃𝚿)=σmin(𝐏𝚿)σmax(𝐃𝚿).\sigma_{\min}(\mathrm{\bf P}\mathrm{\bf U}_{\text{\boldmath$\Psi$}})\geq\frac{\sigma_{\min}(\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell}\mathrm{\bf V}_{\text{\boldmath$\Psi$}})}{\sigma_{\max}(\mathrm{\bf D}_{\text{\boldmath$\Psi$}})}=\frac{\sigma_{\min}(\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell})}{\sigma_{\max}(\mathrm{\bf D}_{\text{\boldmath$\Psi$}})}.

By (136), we have n1/2σmax(𝐃𝚿)1+δ+o˘d,(1)n^{-1/2}\sigma_{\max}(\mathrm{\bf D}_{\text{\boldmath$\Psi$}})\leq 1+\delta+\breve{o}_{d,\mathbb{P}}(1). Let 𝑿m×d\text{\boldmath$X$}^{\prime}\in\mathbb{R}^{m\times d} be the matrix formed by the top-mm rows of 𝑿X. Note that

σmin(𝐏𝚿)=min𝒖=1𝐏𝚿𝒖=min𝒖=1[𝟏m,𝑿]𝒖=σmin([𝟏m,𝑿])\sigma_{\min}(\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell})=\min_{\|\text{\boldmath$u$}\|=1}\big{\|}\mathrm{\bf P}\text{\boldmath$\Psi$}_{\leq\ell}\text{\boldmath$u$}\big{\|}=\min_{\|\text{\boldmath$u$}\|=1}\big{\|}[\mathrm{\bf 1}_{m},\text{\boldmath$X$}^{\prime}]\text{\boldmath$u$}\big{\|}=\sigma_{\min}\big{(}[\mathrm{\bf 1}_{m},\text{\boldmath$X$}^{\prime}]\big{)}

where the last equality holds since md+1=Dm\geq d+1=D. It suffices to show σmin([𝟏m,𝑿])cm>0\sigma_{\min}\big{(}[\mathrm{\bf 1}_{m},\text{\boldmath$X$}^{\prime}]\big{)}\geq c\sqrt{m}>0 for certain constant cc with very high probability.

Let 𝐆:=[𝟏m,𝑿][𝟏m,𝑿]/m\mathrm{\bf G}:=[\mathrm{\bf 1}_{m},\text{\boldmath$X$}^{\prime}]^{\top}[\mathrm{\bf 1}_{m},\text{\boldmath$X$}^{\prime}]/m, and note that

𝐆=[1𝒗𝒗𝐆0],𝒗:=1m(𝑿)𝟏m,𝐆0:=1m(𝑿)𝑿.\displaystyle\mathrm{\bf G}=\left[\begin{matrix}1&\text{\boldmath$v$}^{\top}\\ \text{\boldmath$v$}&\mathrm{\bf G}_{0}\end{matrix}\right]\,,\;\;\;\;\text{\boldmath$v$}:=\frac{1}{m}(\text{\boldmath$X$}^{\prime})^{\top}\mathrm{\bf 1}_{m}\,,\;\;\;\;\mathrm{\bf G}_{0}:=\frac{1}{m}(\text{\boldmath$X$}^{\prime})^{\top}\text{\boldmath$X$}^{\prime}\,.

By concentration of the norm of subgaussian vectors, we have 𝒗22d/m\|\text{\boldmath$v$}\|_{2}\leq 2\sqrt{d/m} with very high probability. By concentration of the eigenvalues of empirical covariance matrices [Ver18], we have λ0:=λmin(𝐆0)13d/m\lambda_{0}:=\lambda_{\min}(\mathrm{\bf G}_{0})\geq 1-3\sqrt{d/m} with very high probability. Further, by the Cauchy-Schwarz inequality,

λmin(𝐆)λmin([1𝒗2𝒗2λ0])=12[1+λ0(1λ0)2+4𝒗22].\displaystyle\lambda_{\min}(\mathrm{\bf G})\geq\lambda_{\min}\left(\left[\begin{matrix}1&\|\text{\boldmath$v$}\|_{2}\\ \|\text{\boldmath$v$}\|_{2}&\lambda_{0}\end{matrix}\right]\right)=\frac{1}{2}\big{[}1+\lambda_{0}-\sqrt{(1-\lambda_{0})^{2}+4\|\text{\boldmath$v$}\|_{2}^{2}}\big{]}\,.

It follows from the above bounds that λmin(𝐆)1/10\lambda_{\min}(\mathrm{\bf G})\geq 1/10 with very high probability, provided we choose C3C_{3} a large enough constant. Since σmin([𝟏m,𝑿])=mλmin(𝐆)\sigma_{\min}\big{(}[\mathrm{\bf 1}_{m},\text{\boldmath$X$}^{\prime}]\big{)}=\sqrt{m\lambda_{\min}(\mathrm{\bf G})}, this implies the desired lower bound. ∎

Consider the condition c0dnd2/(C(logd)C)c_{0}d\leq n\leq d^{2}/(C(\log d)^{C}). We choose nn^{\prime} to be the smallest integer such that nn+(c0)1Cδnn^{\prime}\geq n+(c_{0})^{-1}C_{\delta}n. As defined in (103), the augmented kernel matrix 𝐊~\widetilde{\mathrm{\bf K}} has size n0×n0n_{0}\times n_{0}, where n0n_{0} is guaranteed to satisfy n0Cδdn_{0}\geq C_{\delta}d.

By Jensen’s inequality,

𝐊N1𝐊N(2)𝐊N1op1n𝔼𝒙n+1,,𝒙n0(𝐊~11)1(𝐊~2)11(𝐊~11)1𝐈nop,\displaystyle\big{\|}\mathrm{\bf K}_{N}^{-1}\mathrm{\bf K}_{N}^{(2)}\mathrm{\bf K}_{N}^{-1}\big{\|}_{\mathrm{op}}\leq\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n_{0}}}\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}},
𝐊N1𝔼𝒙[𝐊N(,𝒙)f(𝒙)]1n𝔼𝒙n+1,,𝒙n0𝐊~111𝐊~12𝒇~2\displaystyle\big{\|}\mathrm{\bf K}_{N}^{-1}\mathbb{E}_{\text{\boldmath$x$}}\big{[}\mathrm{\bf K}_{N}(\cdot,\text{\boldmath$x$})f(\text{\boldmath$x$})\big{]}\big{\|}\leq\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{n_{0}}}\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}

where 𝒇~\widetilde{\text{\boldmath$f$}} is defined in (104).

Lemma 16.

There exist constant C,C2>0C,C_{2}>0 such that the following holds. If c0dnd2/(C(logd)C)c_{0}d\leq n\leq d^{2}/(C(\log d)^{C}), then with high probability,

1n𝔼𝒙n+1,,𝒙2n(𝐊~11)1(𝐊~2)11(𝐊~11)1𝐈nopC2n,\displaystyle\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{2n}}\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq\frac{C_{2}}{n}, (141)
1n𝔼𝒙n+1,,𝒙2n𝐊~111𝐊~12𝒇~2C2n.\displaystyle\frac{1}{n^{\prime}}\mathbb{E}_{\text{\boldmath$x$}_{n+1},\ldots,\text{\boldmath$x$}_{2n}}\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}\leq\frac{C_{2}}{\sqrt{n}}. (142)

Consequently, we obtain (67) and (68) in Lemma 3 under the relaxed condition c0dnd2/(C(logd)C)c_{0}d\leq n\leq d^{2}/(C(\log d)^{C}).

We note that the remaining proof of Lemma 3 is the same as before. Once Lemma 3 is proved, we will obtain Theorem 3.3 under the relaxed condition for =1\ell=1.

Proof of Lemma 16.

Following the proof of Lemma 16, we have with very high probability,

(𝐊~11)1(𝐊~2)11(𝐊~11)1𝐈nopC+C𝐔~1(𝐔~1𝐔~1+γ>(𝐃~γ)1)2𝐔~1op\displaystyle\big{\|}(\widetilde{\mathrm{\bf K}}_{11})^{-1}(\widetilde{\mathrm{\bf K}}^{2})_{11}(\widetilde{\mathrm{\bf K}}_{11})^{-1}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq C+C\cdot\Big{\|}\widetilde{\mathrm{\bf U}}_{1}\big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\big{)}^{-2}\widetilde{\mathrm{\bf U}}_{1}^{\top}\Big{\|}_{\mathrm{op}}
𝐊~111𝐊~12𝒇~2C𝒇~2+C(𝐔~1𝐔~1+γ>(𝐃~γ)1)1op𝒇~2.\displaystyle\big{\|}\widetilde{\mathrm{\bf K}}_{11}^{-1}\widetilde{\mathrm{\bf K}}_{12}\widetilde{\text{\boldmath$f$}}_{2}\big{\|}\leq C\|\widetilde{\text{\boldmath$f$}}_{2}\|+C\cdot\Big{\|}\big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\big{)}^{-1}\Big{\|}_{\mathrm{op}}\cdot\|\widetilde{\text{\boldmath$f$}}_{2}\|.

As in the proof of Lemma 16, it suffices to show

λmin(𝐔~1𝐔~1+γ>(𝐃~γ)1)c\lambda_{\min}\Big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{)}\geq c (143)

for certain constant c>0c>0. In the case n>(C3+1)dn>(C_{3}+1)d, the assumptions in Lemma 13, 14, 15 are satisfied. Following the proof of Lemma 16, we have

λmin(𝐔~1𝐔~1)c.\lambda_{\min}\Big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}\Big{)}\geq c. (144)

Below we consider the case n(C3+1)dn\leq(C_{3}+1)d. One difficulty is that we cannot apply Lemma 15; moreover, if n<dn<d, then the matrix 𝐔~1𝐔~1\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1} is not full rank, so we do not expect (144) to hold.

In order to resolve this issue, we first notice that if γ0n02γ1n0/d\gamma_{0}n_{0}\leq 2\gamma_{1}n_{0}/d, then (143) holds with very high probability. In fact,

λmin(𝐔~1𝐔~1+γ>(𝐃~γ)1)λmin(γ>(𝐃~γ)1)γ>max{γ0n0,γ1n0/d}γ>2γ1n0/d>c.\lambda_{\min}\Big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{)}\geq\lambda_{\min}\Big{(}\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{)}\geq\frac{\gamma_{>\ell}}{\max\{\gamma_{0}n_{0},\gamma_{1}n_{0}/d\}}\geq\frac{\gamma_{>\ell}}{2\gamma_{1}n_{0}/d}>c.

From now on, we assume γ0n0>2γ1n0/d\gamma_{0}n_{0}>2\gamma_{1}n_{0}/d. Conforming to our notations in the previous proof, we denote the top eigenvector (which corresponds to the eigenvalue closest to γ0n0\gamma_{0}n_{0}) of 𝐊N,𝐊,𝚿~𝚲~2𝚿~\mathrm{\bf K}_{N},\mathrm{\bf K},\widetilde{\text{\boldmath$\Psi$}}_{\leq\ell}\widetilde{\text{\boldmath$\Lambda$}}_{\leq\ell}^{2}\widetilde{\text{\boldmath$\Psi$}}_{\leq\ell}^{\top}, respectively, by 𝒗~(1),𝒗~0(1),𝒗~s(1)\widetilde{\text{\boldmath$v$}}^{(1)},\widetilde{\text{\boldmath$v$}}_{0}^{(1)},\widetilde{\text{\boldmath$v$}}_{s}^{(1)}. Also, denote 𝐔~=[𝒗~(1),𝐕~(2)]\widetilde{\mathrm{\bf U}}=[\widetilde{\text{\boldmath$v$}}^{(1)},\widetilde{\mathrm{\bf V}}^{(2)}] where 𝐕~(2)n0×d\widetilde{\mathrm{\bf V}}^{(2)}\in\mathbb{R}^{n_{0}\times d}.

First, we make a claim about 𝒗s(1)\text{\boldmath$v$}_{s}^{(1)}. There exists a constant C3>0C_{3}>0 such that if γ0n0C3\gamma_{0}n_{0}\geq C_{3}, then for certain constant c0>0c_{0}^{\prime}>0, with very high probability,

𝐏𝒗~s(1)c0.\big{\|}\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\big{\|}\geq c_{0}^{\prime}. (145)

To prove this claim, we express 𝒗~s(1)\widetilde{\text{\boldmath$v$}}_{s}^{(1)} as

𝒗~s(1)=argmax𝒗=1[𝒗𝚿~𝚲~2𝚿~𝒗]=argmax𝒗=1[γ0𝒗,𝟏n02+γ1d𝑿~𝒗2]\widetilde{\text{\boldmath$v$}}_{s}^{(1)}=\mbox{argmax}_{\|\text{\boldmath$v$}\|=1}\Big{[}\text{\boldmath$v$}^{\top}\widetilde{\text{\boldmath$\Psi$}}_{\leq\ell}\widetilde{\text{\boldmath$\Lambda$}}_{\leq\ell}^{2}\widetilde{\text{\boldmath$\Psi$}}_{\leq\ell}^{\top}\text{\boldmath$v$}\Big{]}=\mbox{argmax}_{\|\text{\boldmath$v$}\|=1}\Big{[}\gamma_{0}\langle\text{\boldmath$v$},\mathrm{\bf 1}_{n_{0}}\rangle^{2}+\frac{\gamma_{1}}{d}\|\widetilde{\text{\boldmath$X$}}\text{\boldmath$v$}\|^{2}\Big{]}

where each row of 𝑿~n0×d\widetilde{\text{\boldmath$X$}}\in\mathbb{R}^{n_{0}\times d} is a uniform vector in 𝕊d1(d)\mathbb{S}^{d-1}(\sqrt{d}). Evaluating the maximization problem at 𝒗=𝒗~s(1)\text{\boldmath$v$}=\widetilde{\text{\boldmath$v$}}_{s}^{(1)} and 𝒗=𝟏n0/n0\text{\boldmath$v$}=\mathrm{\bf 1}_{n_{0}}/\sqrt{n_{0}}, we obtain

γ0𝒗~s(1),𝟏n02+γ1d𝑿~𝒗~s(1)2γ0n0+γ1dn0𝑿~𝟏n02.\gamma_{0}\langle\widetilde{\text{\boldmath$v$}}_{s}^{(1)},\mathrm{\bf 1}_{n_{0}}\rangle^{2}+\frac{\gamma_{1}}{d}\|\widetilde{\text{\boldmath$X$}}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\|^{2}\geq\gamma_{0}n_{0}+\frac{\gamma_{1}}{dn_{0}}\|\widetilde{\text{\boldmath$X$}}\mathrm{\bf 1}_{n_{0}}\|^{2}.

Since 𝑿~op2(n0+d)Cn0\|\widetilde{\text{\boldmath$X$}}\|_{\mathrm{op}}\leq 2(\sqrt{n_{0}}+\sqrt{d})\leq C^{\prime}\sqrt{n_{0}} with very high probability, we have

γ0𝒗~s(1),𝟏n02γ0n0C4n0dγ0n0C4\gamma_{0}\langle\widetilde{\text{\boldmath$v$}}_{s}^{(1)},\mathrm{\bf 1}_{n_{0}}\rangle^{2}\geq\gamma_{0}n_{0}-C_{4}\frac{n_{0}}{d}\geq\gamma_{0}n_{0}-C_{4}^{\prime}

where C4,C4>0C_{4},C_{4}^{\prime}>0 are some constants. Thus, we obtain

𝒗~s(1),𝟏n0/n021C4γ0n0.\langle\widetilde{\text{\boldmath$v$}}_{s}^{(1)},\mathrm{\bf 1}_{n_{0}}/\sqrt{n_{0}}\rangle^{2}\geq 1-\frac{C_{4}^{\prime}}{\gamma_{0}n_{0}}.

We observe that

𝒗~s(1),𝟏n0/n0\displaystyle\langle\widetilde{\text{\boldmath$v$}}_{s}^{(1)},\mathrm{\bf 1}_{n_{0}}/\sqrt{n_{0}}\rangle 𝐏𝒗~s(1),𝟏n0/n0+𝐏𝒗~s(1),𝟏n0/n0\displaystyle\leq\langle\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)},\mathrm{\bf 1}_{n_{0}}/\sqrt{n_{0}}\rangle+\langle\mathrm{\bf P}^{\perp}\widetilde{\text{\boldmath$v$}}_{s}^{(1)},\mathrm{\bf 1}_{n_{0}}/\sqrt{n_{0}}\rangle
(i)𝐏𝒗~s(1)+𝒗~s(1),𝐏𝟏n0/n0\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\big{\|}\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\big{\|}+\langle\widetilde{\text{\boldmath$v$}}_{s}^{(1)},\mathrm{\bf P}^{\perp}\mathrm{\bf 1}_{n_{0}}/\sqrt{n_{0}}\rangle
𝐏𝒗~s(1)+n0nn0\displaystyle\leq\big{\|}\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\big{\|}+\sqrt{\frac{n_{0}-n}{n_{0}}}

where in (i) we used the fact 𝐏𝒂,𝒃=𝒂,𝒃𝐏\langle\mathrm{\bf P}\text{\boldmath$a$},\text{\boldmath$b$}\rangle=\langle\text{\boldmath$a$},\text{\boldmath$b$}\mathrm{\bf P}\rangle for any projection matrix 𝐏\mathrm{\bf P}. Thus, we deduce that

𝐏𝒗~s(1)\displaystyle\big{\|}\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\big{\|} 1C4/(γ0n0)1(n/n0)\displaystyle\geq\sqrt{1-C_{4}^{\prime}/(\gamma_{0}n_{0})}-\sqrt{1-(n/n_{0})}
(i)1C4/(γ0n0)(1(n/(2n0))\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}1-C_{4}^{\prime}/(\gamma_{0}n_{0})-(1-(n/(2n_{0}))
(ii)c0C4/(γ0n0)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}c_{0}-C_{4}^{\prime}/(\gamma_{0}n_{0})

where in (i) we used 1a1a1(a/2)1-a\leq\sqrt{1-a}\leq 1-(a/2) for a(0,1)a\in(0,1), and in (ii) c0>0c_{0}>0 is certain small constant. Therefore, if γ0n02C4\gamma_{0}n_{0}\geq 2C_{4}^{\prime}, then 𝐏𝒗~s(1)c0/2\big{\|}\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\big{\|}\geq c_{0}/2. This proves the claim (145).

In order to bound the smallest eigenvalue in (143), we consider two cases.

Case 1: γ0n0<C3\gamma_{0}n_{0}<C_{3}. We have

λmin(𝐔~1𝐔~1+γ>(𝐃~γ)1)\displaystyle\lambda_{\min}\Big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{)} λmin(γ>(𝐃~γ)1)γ>max{γ0n0,γ1n0/d}\displaystyle\geq\lambda_{\min}\Big{(}\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{)}\geq\frac{\gamma_{>\ell}}{\max\{\gamma_{0}n_{0},\gamma_{1}n_{0}/d\}}
γ>max{C3,C(n/d)}c.\displaystyle\geq\frac{\gamma_{>\ell}}{\max\{C_{3},C(n/d)\}}\geq c.

Case 2: γ0n0C3\gamma_{0}n_{0}\geq C_{3}. We derive two lower bounds on its variational form. Let 𝒛𝕊n01\text{\boldmath$z$}\in\mathbb{S}^{n_{0}-1} by any vector, and we denote 𝒛=[z1,𝒛1]\text{\boldmath$z$}=[z_{1},\text{\boldmath$z$}_{-1}]^{\top}. The first lower bound is

𝒛[𝐔~1𝐔~1+γ>(𝐃~γ)1]𝒛\displaystyle~{}~{}\text{\boldmath$z$}^{\top}\Big{[}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{]}\text{\boldmath$z$}
z12𝐏𝒗~s(1)2+2z1[𝒗~(1)]𝐏𝐕~(2)𝒛1+𝒛1[𝐕~(2)]𝐏𝐕~(2)𝒛1+γ>𝒛(𝐃~γ)1𝒛\displaystyle\geq z_{1}^{2}\big{\|}\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\big{\|}^{2}+2z_{1}[\widetilde{\text{\boldmath$v$}}^{(1)}]^{\top}\mathrm{\bf P}\widetilde{\mathrm{\bf V}}^{(2)}\text{\boldmath$z$}_{-1}+\text{\boldmath$z$}_{-1}^{\top}[\widetilde{\mathrm{\bf V}}^{(2)}]^{\top}\mathrm{\bf P}\widetilde{\mathrm{\bf V}}^{(2)}\text{\boldmath$z$}_{-1}+\gamma_{>\ell}\text{\boldmath$z$}^{\top}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\text{\boldmath$z$}
z12𝐏𝒗~s(1)22|z1|𝒛1𝒛12\displaystyle\geq z_{1}^{2}\big{\|}\mathrm{\bf P}\widetilde{\text{\boldmath$v$}}_{s}^{(1)}\big{\|}^{2}-2|z_{1}|\cdot\|\text{\boldmath$z$}_{-1}\|-\|\text{\boldmath$z$}_{-1}\|^{2}
(i)(1𝒛12)c03𝒛1.\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}(1-\|\text{\boldmath$z$}_{-1}\|^{2})\cdot c_{0}^{\prime}-3\|\text{\boldmath$z$}_{-1}\|. (146)

where we used the claim (145) in (i). The second lower bound is

𝒛[𝐔~1𝐔~1+γ>(𝐃~γ)1]𝒛\displaystyle\text{\boldmath$z$}^{\top}\Big{[}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{]}\text{\boldmath$z$} 𝒛[γ>(𝐃~γ)1]𝒛γ>𝒛12cγ1n0/d\displaystyle\geq\text{\boldmath$z$}^{\top}\Big{[}\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{]}\text{\boldmath$z$}\geq\gamma_{>\ell}\|\text{\boldmath$z$}_{-1}\|^{2}\cdot\frac{c}{\gamma_{1}n_{0}/d}
cγ>𝒛12dγ1nc1𝒛12\displaystyle\geq\frac{c\gamma_{>\ell}\|\text{\boldmath$z$}_{-1}\|^{2}d}{\gamma_{1}n}\geq c_{1}\|\text{\boldmath$z$}_{-1}\|^{2} (147)

where c1>0c_{1}>0 is certain small constant, and in the last inequality we used our assumption n(C3+1)dn\leq(C_{3}+1)d. If 𝒛1min{1/2,c0/12}\|\text{\boldmath$z$}_{-1}\|\leq\min\{1/2,c_{0}^{\prime}/12\}, then by the first lower bound (146), we have

𝒛[𝐔~1𝐔~1+γ>(𝐃~γ)1]𝒛c0/4.\text{\boldmath$z$}^{\top}\Big{[}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{]}\text{\boldmath$z$}\geq c_{0}/4.

If 𝒛1min{1/2,c0/12}\|\text{\boldmath$z$}_{-1}\|\geq\min\{1/2,c_{0}^{\prime}/12\} instead, then the second lower bound (147) implies that the left-hand side above is lower bounded by a constant. Combining the two cases, we obtain

λmin(𝐔~1𝐔~1+γ>(𝐃~γ)1)c.\lambda_{\min}\Big{(}\widetilde{\mathrm{\bf U}}_{1}^{\top}\widetilde{\mathrm{\bf U}}_{1}+\gamma_{>\ell}(\widetilde{\mathrm{\bf D}}_{\gamma})^{-1}\Big{)}\geq c.

This proves the desired inequality (143). The remaining proof is similar proof of Lemma 16. ∎

Appendix D Generalization error for linear model: proof of Corollary 3.2

Throughout this subsection, let the assumptions of Corollary 3.2 hold. Denote λ¯:=λ+v(σ)\overline{\lambda}:=\lambda+v(\sigma). First, we state and prove a lemma. We will use the simple identity

𝐀1=𝐀01𝐀1(𝐀𝐀0)𝐀01,for all𝐀,𝐀0n×n\mathrm{\bf A}^{-1}=\mathrm{\bf A}_{0}^{-1}-\mathrm{\bf A}^{-1}(\mathrm{\bf A}-\mathrm{\bf A}_{0})\mathrm{\bf A}_{0}^{-1},\qquad\text{for all}~{}\mathrm{\bf A},\mathrm{\bf A}_{0}\in\mathbb{R}^{n\times n}

multiple times.

Lemma 17.

There exist constants c,C>0c,C>0 such that the following holds. With high probability,

𝟏n(λ¯𝐈n+γ1d𝐗𝐗)1𝟏ncnλ¯,\displaystyle\mathrm{\bf 1}_{n}^{\top}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}\geq\frac{cn}{\overline{\lambda}},
γ0(λ¯𝐈n+γ0𝟏n𝟏n+γ1d𝐗𝐗)1𝟏nCn.\displaystyle\Big{\|}\gamma_{0}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}\Big{\|}\leq\frac{C}{\sqrt{n}}.
Proof of Lemma 17.

Since 𝑿𝑿opC(n+d)\|\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\|_{\mathrm{op}}\leq C(n+d) with very high probability (by Lemma 21), if n16dn\leq 16d, we have

𝟏n(λ¯𝐈n+γ1d𝑿𝑿)1𝟏nnλmax(λ¯𝐈n+γ1𝑿𝑿/d)nλ¯+Cγ1nCλ¯.\mathrm{\bf 1}_{n}^{\top}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}\geq\frac{n}{\lambda_{\max}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\gamma_{1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}/d\big{)}}\geq\frac{n}{\overline{\lambda}+C\gamma_{1}}\geq\frac{n}{C^{\prime}\overline{\lambda}}.

where in the last inequality we used λ¯c\overline{\lambda}\geq c for a constant c>0c>0 so λ¯+Cλ1Cλ¯\overline{\lambda}+C\lambda_{1}\leq C^{\prime}\overline{\lambda}. If n>16dn>16d, we observe that

𝟏n(λ¯𝐈n+λ1d𝑿𝑿)1𝟏n𝟏n(λ¯𝐈n)1𝟏n\displaystyle\mathrm{\bf 1}_{n}^{\top}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\lambda_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}-\mathrm{\bf 1}_{n}^{\top}(\overline{\lambda}\mathrm{\bf I}_{n})^{-1}\mathrm{\bf 1}_{n} =𝟏n1(λ¯𝐈n)1λ1d𝑿𝑿(λ¯𝐈n+λ1d𝑿𝑿)1𝟏n\displaystyle=-\mathrm{\bf 1}_{n}^{-1}(\overline{\lambda}\mathrm{\bf I}_{n})^{-1}\frac{\lambda_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\lambda_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}
=λ1λ¯d𝟏n𝑿(λ¯𝐈d+λ1d𝑿𝑿)1𝑿𝟏n.\displaystyle=-\frac{\lambda_{1}}{\overline{\lambda}d}\mathrm{\bf 1}_{n}^{\top}\text{\boldmath$X$}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\lambda_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-1}\text{\boldmath$X$}^{\top}\mathrm{\bf 1}_{n}.

Since λmin(𝑿𝑿)n/4\lambda_{\min}(\text{\boldmath$X$}^{\top}\text{\boldmath$X$})\geq n/4 and 𝑿𝟏n22nd\|\text{\boldmath$X$}^{\top}\mathrm{\bf 1}_{n}\|^{2}\leq 2nd with high probability due to Lemma 21, we get

𝟏n(λ¯𝐈n+λ1d𝑿𝑿)1𝟏n\displaystyle\mathrm{\bf 1}_{n}^{\top}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\lambda_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n} nλ¯λ1λ¯d𝑿𝟏n24dλ1n\displaystyle\geq\frac{n}{\overline{\lambda}}-\frac{\lambda_{1}}{\overline{\lambda}d}\cdot\|\text{\boldmath$X$}^{\top}\mathrm{\bf 1}_{n}\|^{2}\cdot\frac{4d}{\lambda_{1}n}
nλ¯λ1λ¯d2nd4dλ1nnλ¯8dλ¯\displaystyle\geq\frac{n}{\overline{\lambda}}-\frac{\lambda_{1}}{\overline{\lambda}d}\cdot 2nd\cdot\frac{4d}{\lambda_{1}n}\geq\frac{n}{\overline{\lambda}}-\frac{8d}{\overline{\lambda}}
n2λ¯.\displaystyle\geq\frac{n}{2\overline{\lambda}}.

In either case, the first inequality of the lemma holds with high probability. Next, we notice that

(λ¯𝐈n+γ0𝟏n𝟏n+γ1d𝑿𝑿)1𝟏n(λ¯𝐈n+γ1d𝑿𝑿)1𝟏n\displaystyle~{}~{}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}-\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}
=(λ¯𝐈n+γ0𝟏n𝟏n+γ1d𝑿𝑿)1γ0𝟏n𝟏n(λ¯𝐈n+γ1d𝑿𝑿)1𝟏n.\displaystyle=-\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}.

By re-arranging the equality, we get

(λ¯𝐈n+γ0𝟏n𝟏n+γ1d𝑿𝑿)1𝟏n\displaystyle\Big{\|}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}\Big{\|} =[1+γ0𝟏n(λ¯𝐈n+γ1d𝑿𝑿)1𝟏n]1(λ¯𝐈n+γ1d𝑿𝑿)1𝟏n\displaystyle=\Big{[}1+\gamma_{0}\mathrm{\bf 1}_{n}^{\top}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}\Big{]}^{-1}\Big{\|}\big{(}\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}^{-1}\mathrm{\bf 1}_{n}\Big{\|}
n/λ¯1+cγ0n/λ¯=nλ¯+cγ0nCγ0n,\displaystyle\leq\frac{\sqrt{n}/\overline{\lambda}}{1+c\gamma_{0}n/\overline{\lambda}}=\frac{\sqrt{n}}{\overline{\lambda}+c\gamma_{0}n}\leq\frac{C}{\gamma_{0}\sqrt{n}},

which proves the lemma. ∎

Building on this lemma, we prove some useful bounds. For convenience, we define

𝐀=λ¯𝐈n+γ0𝟏n𝟏n+γ1d𝑿𝑿,𝐀0=λ¯𝐈n+γ1d𝑿𝑿.\mathrm{\bf A}=\overline{\lambda}\mathrm{\bf I}_{n}+\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top},\qquad\mathrm{\bf A}_{0}=\overline{\lambda}\mathrm{\bf I}_{n}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}. (148)
Lemma 18.

We have

γ1𝐀1𝑿op=Od,(d),γ1𝐀1𝑿F=Od,(d),\displaystyle\gamma_{1}\big{\|}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}=O_{d,\mathbb{P}}\big{(}\sqrt{d}\big{)},\qquad\gamma_{1}\big{\|}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}=O_{d,\mathbb{P}}\big{(}d\big{)}, (149a)
γ1𝑿𝐀1𝑿op=Od,(d),γ1𝑿𝐀1𝑿F=Od,(d3/2),\displaystyle\gamma_{1}\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}=O_{d,\mathbb{P}}\big{(}d\big{)},\qquad\gamma_{1}\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}=O_{d,\mathbb{P}}\big{(}d^{3/2}\big{)}, (149b)
γ1𝐗𝐗𝐀01op=Od,(d),γ1𝐗𝐗𝐀1op=Od,(d).\displaystyle\gamma_{1}\big{\|}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\big{\|}_{\mathrm{op}}=O_{d,\mathbb{P}}(d),\qquad\gamma_{1}\big{\|}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\big{\|}_{\mathrm{op}}=O_{d,\mathbb{P}}(d). (149c)
Proof of Lemma 18.

In this proof, we will use the bounds on the singular values of 𝑿X from Lemma 21. To prove the two bounds in (149a), we observe that

𝐀1𝑿𝐀01𝑿F\displaystyle\big{\|}\mathrm{\bf A}^{-1}\text{\boldmath$X$}-\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{F} =𝐀1γ0𝟏n𝟏n𝐀01𝑿Fγ0𝐀1𝟏n𝑿𝐀01𝟏n\displaystyle=\big{\|}\mathrm{\bf A}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{F}\leq\gamma_{0}\big{\|}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}\cdot\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\mathrm{\bf 1}_{n}\big{\|}
(i)Cnn𝐀01𝑿opC𝐀01𝑿op\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{C}{\sqrt{n}}\cdot\sqrt{n}\,\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\leq C\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}

where in (i) we used Lemma 17. So by the triangle inequality, we have

𝐀1𝑿opC𝐀01𝑿op,𝐀1𝑿F𝐀01𝑿F+C𝐀01𝑿opC𝐀01𝑿F.\big{\|}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\leq C\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}},\qquad\big{\|}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}\leq\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{F}+C\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\leq C\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{F}.

If n/d>1+cn/d>1+c for any constant c>0c>0, then by Lemma 21 with high probability λmin(𝑿𝑿)>cn\lambda_{\min}(\text{\boldmath$X$}^{\top}\text{\boldmath$X$})>c^{\prime}n for certain constant c>0c^{\prime}>0. Since n>c0dn>c_{0}d, we also have 𝑿opCn\|\text{\boldmath$X$}\|_{\mathrm{op}}\leq C\sqrt{n} and 𝑿FCn\|\text{\boldmath$X$}\|_{F}\leq Cn with high probability. Thus,

𝐀01𝑿op=𝑿(λ¯𝐈d+γ1d𝑿𝑿)1op𝑿op(λ¯𝐈d+γ1d𝑿𝑿)1op(i)Cndγ1n,\displaystyle\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}=\big{\|}\text{\boldmath$X$}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-1}\big{\|}_{\mathrm{op}}\leq\big{\|}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\cdot\big{\|}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-1}\big{\|}_{\mathrm{op}}\stackrel{{\scriptstyle(i)}}{{\leq}}C\sqrt{n}\cdot\frac{d}{\gamma_{1}n}, (150)
𝐀01𝑿F𝑿F(λ¯𝐈d+γ1d𝑿𝑿)1opCndγ1nCdγ1.\displaystyle\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{F}\leq\big{\|}\text{\boldmath$X$}\big{\|}_{F}\cdot\big{\|}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-1}\big{\|}_{\mathrm{op}}\leq Cn\cdot\frac{d}{\gamma_{1}n}\leq\frac{Cd}{\gamma_{1}}.

where in (i) we used 𝐌1𝐌2F𝐌1F𝐌2op\|\mathrm{\bf M}_{1}\mathrm{\bf M}_{2}\|_{F}\leq\|\mathrm{\bf M}_{1}\|_{F}\cdot\|\mathrm{\bf M}_{2}\|_{\mathrm{op}} for matrices 𝐌1,𝐌2\mathrm{\bf M}_{1},\mathrm{\bf M}_{2}. If n/d1+cn/d\leq 1+c instead, then 𝐀01𝑿opC𝑿opCnOd,(d)\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\leq C\big{\|}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\leq C\sqrt{n}\leq O_{d,\mathbb{P}}(\sqrt{d}), and 𝐀01𝑿FC𝑿FCnOd,(d)\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{F}\leq C\big{\|}\text{\boldmath$X$}\big{\|}_{F}\leq Cn\leq O_{d,\mathbb{P}}(d). In either cases, the two bounds in (149a) are true. To prove the two bounds in (149b), we note that

𝑿𝐀1𝑿𝑿𝐀01𝑿op\displaystyle\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}-\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}} =𝑿𝐀1γ0𝟏n𝟏n𝐀01𝑿op(i)𝑿opγ0𝐀1𝟏nn𝐀01𝑿op\displaystyle=\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\stackrel{{\scriptstyle(i)}}{{\leq}}\big{\|}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}\cdot\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}\cdot\sqrt{n}\,\big{\|}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}
Od,(n)Od,(n1/2)Od,(γ11d)=Od,(γ11d)\displaystyle\leq O_{d,\mathbb{P}}(\sqrt{n})\cdot O_{d,\mathbb{P}}(n^{-1/2})\cdot O_{d,\mathbb{P}}(\gamma_{1}^{-1}d)=O_{d,\mathbb{P}}(\gamma_{1}^{-1}d)
𝑿𝐀01𝑿op\displaystyle\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}} =𝑿𝑿(λ¯𝐈d+λ1d𝑿𝑿)1opdγ1.\displaystyle=\big{\|}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\lambda_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-1}\big{\|}_{\mathrm{op}}\leq\frac{d}{\gamma_{1}}.

where we used the bound (150) in (i). This leads to γ1𝑿𝐀1𝑿op=Od,(d)\gamma_{1}\|\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\|_{\mathrm{op}}=O_{d,\mathbb{P}}(d). Also, the rank of 𝑿𝐀1𝑿\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$} is at most dd, so

γ1𝑿𝐀1𝑿Fγ1d𝑿𝐀1𝑿op=Od,(d3/2).\gamma_{1}\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}\leq\gamma_{1}\sqrt{d}\,\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}=O_{d,\mathbb{P}}(d^{3/2}).

To prove the two bounds in (149c), we let the eigenvalue decomposition of 𝑿𝑿\text{\boldmath$X$}\text{\boldmath$X$}^{\top} be 𝑿𝑿=𝐔𝑿𝐃𝑿𝐔𝑿\text{\boldmath$X$}\text{\boldmath$X$}^{\top}=\mathrm{\bf U}_{\text{\boldmath$X$}}\mathrm{\bf D}_{\text{\boldmath$X$}}\mathrm{\bf U}_{\text{\boldmath$X$}}^{\top} and we find

γ1𝑿𝑿𝐀01=𝐔𝑿diag{λ1(𝑿𝑿)γ1λ1(𝑿𝑿)/d+λ¯,,λd(𝑿𝑿)γ1λd(𝑿𝑿)/d+λ¯}𝐔𝑿.\gamma_{1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}=\mathrm{\bf U}_{\text{\boldmath$X$}}\mathrm{diag}\Big{\{}\frac{\lambda_{1}(\text{\boldmath$X$}\text{\boldmath$X$}^{\top})}{\gamma_{1}\lambda_{1}(\text{\boldmath$X$}\text{\boldmath$X$}^{\top})/d+\overline{\lambda}},\ldots,\frac{\lambda_{d}(\text{\boldmath$X$}\text{\boldmath$X$}^{\top})}{\gamma_{1}\lambda_{d}(\text{\boldmath$X$}\text{\boldmath$X$}^{\top})/d+\overline{\lambda}}\Big{\}}\mathrm{\bf U}_{\text{\boldmath$X$}}^{\top}.

The diagonal entries are all no larger than dd, so this proves the first bound in (149c). Moreover,

𝑿𝑿𝐀1𝑿𝑿𝐀01op\displaystyle\big{\|}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}-\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\big{\|}_{\mathrm{op}} =𝑿𝑿𝐀01γ0𝟏n𝟏n𝐀1op\displaystyle=\big{\|}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}^{-1}\big{\|}_{\mathrm{op}}
𝑿𝑿𝐀01opnγ0𝐀1𝟏n\displaystyle\leq\big{\|}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\big{\|}_{\mathrm{op}}\cdot\sqrt{n}\cdot\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}
Od,(d)nOd,(n1/2)=Od,(d),\displaystyle\leq O_{d,\mathbb{P}}(d)\cdot\sqrt{n}\cdot O_{d,\mathbb{P}}(n^{-1/2})=O_{d,\mathbb{P}}(d),

which proves the second bound. ∎

To study the risk R𝖯𝖱𝖱(λ¯)R_{{\sf PRR}}(\overline{\lambda}) in the linear case, we first notice that Kp(𝒙i,𝒙j)=γ0+γ1d𝒙i𝒙jK^{p}(\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j})=\gamma_{0}+\frac{\gamma_{1}}{d}\text{\boldmath$x$}_{i}^{\top}\text{\boldmath$x$}_{j}. Thus, we have

𝐊p=γ0𝟏n𝟏n+γ1d𝑿𝑿,𝐊(p,2)=γ02𝟏n𝟏n+γ12d2𝑿𝑿\mathrm{\bf K}^{p}=\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}\text{\boldmath$X$}^{\top},\qquad\mathrm{\bf K}^{(p,2)}=\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}

where the second equality is due to 𝔼[𝒙𝒙]=𝐈d\mathbb{E}[\text{\boldmath$x$}\text{\boldmath$x$}^{\top}]=\mathrm{\bf I}_{d}. Now we break down the risk.

R𝖯𝖱𝖱(λ¯)=EBiasp(λ¯)+EVarp(λ¯)+ECrossp(λ¯).\displaystyle R_{{\sf PRR}}(\overline{\lambda})=E_{\mathrm{Bias}}^{p}(\overline{\lambda})+E_{\mathrm{Var}}^{p}(\overline{\lambda})+E_{\text{\rm Cross}}^{p}(\overline{\lambda}).

The expressions for the bias/variance/cross terms are given below.

EBiasp\displaystyle E_{\mathrm{Bias}}^{p} =𝔼𝒙[(f(𝒙)𝐊p(,𝒙)(λ¯𝐈n+𝐊p)1𝒇)2]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}}\Big{[}\big{(}f(\text{\boldmath$x$})-\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$f$}\big{)}^{2}\Big{]}
=fL222𝔼𝒙[f(𝒙)𝐊p(,𝒙)(λ¯𝐈n+𝐊p)1𝒇]+𝒇(λ¯𝐈n+𝐊p)1𝐊(p,2)(λ¯𝐈n+𝐊p)1𝒇\displaystyle=\|f\|_{L^{2}}^{2}-2\mathbb{E}_{\text{\boldmath$x$}}\Big{[}f(\text{\boldmath$x$})\mathrm{\bf K}^{p}(\cdot,\text{\boldmath$x$})^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$f$}\Big{]}+\text{\boldmath$f$}^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\mathrm{\bf K}^{(p,2)}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$f$}
=fL222γ1d𝜷𝑿(λ¯𝐈n+𝐊p)1𝑿𝜷\displaystyle=\|f\|_{L^{2}}^{2}-2\frac{\gamma_{1}}{d}\text{\boldmath$\beta$}_{*}^{\top}\text{\boldmath$X$}^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$X$}\text{\boldmath$\beta$}_{*}
+𝜷𝑿(λ¯𝐈n+𝐊p)1(γ02𝟏n𝟏n+γ12d2𝑿𝑿)(λ¯𝐈n+𝐊p)1𝑿𝜷=:fL222I1(1)+I1,1(2),\displaystyle+\text{\boldmath$\beta$}_{*}^{\top}\text{\boldmath$X$}^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\Big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\Big{)}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$X$}\text{\boldmath$\beta$}_{*}=:\|f\|_{L^{2}}^{2}-2I_{1}^{(1)}+I_{1,1}^{(2)},
EVarp\displaystyle E_{\mathrm{Var}}^{p} =𝜺(λ¯𝐈n+𝐊p)1(γ02𝟏n𝟏n+γ12d2𝑿𝑿)(λ¯𝐈n+𝐊p)1𝜺=:Iε,ε(2),\displaystyle=\text{\boldmath$\varepsilon$}^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\Big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\Big{)}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$\varepsilon$}=:I_{\varepsilon,\varepsilon}^{(2)},
ECrossp\displaystyle E_{\text{\rm Cross}}^{p} =γ1d𝜺(λ¯𝐈n+𝐊p)1𝑿𝜷𝜺(λ¯𝐈n+𝐊p)1(γ02𝟏n𝟏n+γ12d2𝑿𝑿)(λ¯𝐈n+𝐊p)1𝑿𝜷\displaystyle=\frac{\gamma_{1}}{d}\text{\boldmath$\varepsilon$}^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$X$}\text{\boldmath$\beta$}_{*}-\text{\boldmath$\varepsilon$}^{\top}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\Big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\Big{)}(\overline{\lambda}\mathrm{\bf I}_{n}+\mathrm{\bf K}^{p})^{-1}\text{\boldmath$X$}\text{\boldmath$\beta$}_{*}
=:Iε(1)Iε,1(2).\displaystyle=:I_{\varepsilon}^{(1)}-I_{\varepsilon,1}^{(2)}.

To simplify these expressions, we notice that we can assume 𝜷\text{\boldmath$\beta$}_{*} to be uniformly random on a sphere with given radius r:=𝜷2r:=\|\text{\boldmath$\beta$}_{*}\|_{2}: 𝜷Unif(𝕊d1(r))\text{\boldmath$\beta$}_{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(r)). To understand why this is the case, let us denote the risk R𝖯𝖱𝖱=R𝖯𝖱𝖱(𝒙1,,𝒙n,𝜷)R_{{\sf PRR}}=R_{{\sf PRR}}(\text{\boldmath$x$}_{1},\ldots,\text{\boldmath$x$}_{n},\text{\boldmath$\beta$}_{*}) to emphasize its dependence on 𝑿X, 𝜷\text{\boldmath$\beta$}_{*}. Let 𝐑𝒪(n)\mathrm{\bf R}\in\mathcal{O}(n) be a random orthogonal matrix with uniform (Haar) distribution. Then,

R𝖯𝖱𝖱(𝒙1,,𝒙n,𝜷)\displaystyle R_{{\sf PRR}}(\text{\boldmath$x$}_{1},\ldots,\text{\boldmath$x$}_{n},\text{\boldmath$\beta$}_{*}) =R𝖯𝖱𝖱(𝐑𝒙1,,𝐑𝒙n,𝐑𝜷)=dR𝖯𝖱𝖱(𝒙1,,𝒙n,𝐑𝜷).\displaystyle=R_{{\sf PRR}}(\mathrm{\bf R}\text{\boldmath$x$}_{1},\ldots,\mathrm{\bf R}\text{\boldmath$x$}_{n},\mathrm{\bf R}\text{\boldmath$\beta$}_{*})\stackrel{{\scriptstyle d}}{{=}}R_{{\sf PRR}}(\text{\boldmath$x$}_{1},\ldots,\text{\boldmath$x$}_{n},\mathrm{\bf R}\text{\boldmath$\beta$}_{*}).

The first equality follows from straightforward calculation, and the distributional equivalence =d\stackrel{{\scriptstyle d}}{{=}} holds because the distribution of 𝒙i\text{\boldmath$x$}_{i} is orthogonally invariant. Considering therefore, without loss of generality, 𝜷Unif(𝕊d1(r))\text{\boldmath$\beta$}_{*}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(r)), it is easy to further simplify the terms EBiasp,EVarp,ECrosspE_{\mathrm{Bias}}^{p},E_{\mathrm{Var}}^{p},E_{\text{\rm Cross}}^{p}. Indeed, we expect these terms to concentrate around their expectation with respect to 𝜷\text{\boldmath$\beta$}_{*}, 𝜺\varepsilon. This is formally stated in the next lemma.

Lemma 19 (Concentration to the trace).

We have

max{|I(1)ε|,|I(2)ε,1|}=Od,(τ2d),\displaystyle\max\big{\{}\big{|}I^{(1)}_{\varepsilon}\big{|},\big{|}I^{(2)}_{\varepsilon,1}\big{|}\big{\}}=O_{d,\mathbb{P}}\Big{(}\frac{\tau^{2}}{\sqrt{d}}\Big{)}, (151a)
|I1(1)γ1𝜷2d2Tr(𝑿𝐀1𝑿)|=Od,(τ2d),\displaystyle\Big{|}I_{1}^{(1)}-\frac{\gamma_{1}\|\text{\boldmath$\beta$}_{*}\|^{2}}{d^{2}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{\tau^{2}}{\sqrt{d}}\Big{)}, (151b)
|I1,1(2)𝜷2dTr(𝑿𝐀1(γ02𝟏n𝟏n+γ12d2𝐗𝐗)𝐀1𝑿)|=Od,(τ2d),\displaystyle\Big{|}I_{1,1}^{(2)}-\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\Big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\Big{)}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{\tau^{2}}{\sqrt{d}}\Big{)}, (151c)
|Iε,ε(2)σε2Tr(𝐀1(γ02𝟏n𝟏n+γ12d2𝐗𝐗)𝐀1)|=Od,(τ2d).\displaystyle\Big{|}I_{\varepsilon,\varepsilon}^{(2)}-\sigma_{\varepsilon}^{2}\mathrm{Tr}\Big{(}\mathrm{\bf A}^{-1}\Big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\Big{)}\mathrm{\bf A}^{-1}\Big{)}\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{\tau^{2}}{\sqrt{d}}\Big{)}. (151d)

To further simplify the terms, we will show the following lemma.

Lemma 20.

We have

|Tr(𝐀1(γ02𝟏n𝟏n+γ12d2𝐗𝐗)𝐀1)γ12d2Tr(𝐀01𝐗𝐗𝐀01)|=Od,(1d),\displaystyle\Big{|}\mathrm{Tr}\Big{(}\mathrm{\bf A}^{-1}\big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}\mathrm{\bf A}^{-1}\Big{)}-\frac{\gamma_{1}^{2}}{d^{2}}\mathrm{Tr}\Big{(}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\Big{)}\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}, (152a)
|1dTr(𝑿𝐀1(γ02𝟏n𝟏n+γ12d2𝐗𝐗)𝐀1𝑿)γ12d3Tr(𝑿𝐀01𝐗𝐗𝐀01𝑿)|=Od,(1d),\displaystyle\Big{|}\frac{1}{d}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}-\frac{\gamma_{1}^{2}}{d^{3}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}, (152b)
|γ1d2Tr(𝑿𝐀1𝑿)γ1d2Tr(𝑿𝐀01𝑿)|=Od,(1d).\displaystyle\Big{|}\frac{\gamma_{1}}{d^{2}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}-\frac{\gamma_{1}}{d^{2}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}. (152c)

Once these bounded are proved, then by Lemma 23 we have

γ12d2Tr(𝐀01𝑿𝑿𝐀01)=γ12d2Tr(𝑿(λ¯𝐈d+γ1d𝑿𝑿)2𝑿)=1dTr(𝐒(γ𝐈d+𝐒)2),\displaystyle\frac{\gamma_{1}^{2}}{d^{2}}\mathrm{Tr}\Big{(}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\Big{)}=\frac{\gamma_{1}^{2}}{d^{2}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-2}\text{\boldmath$X$}^{\top}\Big{)}=\frac{1}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}\big{(}\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S}\big{)}^{-2}\Big{)}, (153a)
γ12d3Tr(𝑿𝐀01𝑿𝑿𝐀01𝑿)=γ12d3Tr(𝑿𝑿(λ¯𝐈d+γ1d𝑿𝑿)2𝑿𝑿)=1dTr(𝐒2(γ𝐈d+𝐒)2),\displaystyle\frac{\gamma_{1}^{2}}{d^{3}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}=\frac{\gamma_{1}^{2}}{d^{3}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-2}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\Big{)}=\frac{1}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}^{2}\big{(}\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S}\big{)}^{-2}\Big{)}, (153b)
γ1d2Tr(𝑿𝐀01𝑿)=γ1d2Tr(𝑿𝑿(λ¯𝐈d+γ1d𝑿𝑿)1)=1dTr(𝐒(γ𝐈d+𝐒)1).\displaystyle\frac{\gamma_{1}}{d^{2}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}=\frac{\gamma_{1}}{d^{2}}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{(}\overline{\lambda}\mathrm{\bf I}_{d}+\frac{\gamma_{1}}{d}\text{\boldmath$X$}^{\top}\text{\boldmath$X$}\big{)}^{-1}\Big{)}=\frac{1}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}\big{(}\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S}\big{)}^{-1}\Big{)}. (153c)

where we recall γ:=γeff(λ,σ)=λ¯/γ1\gamma:=\gamma_{\mbox{\tiny\rm eff}}(\lambda,\sigma)=\overline{\lambda}/\gamma_{1} and denote 𝐒=𝑿𝑿/d\mathrm{\bf S}=\text{\boldmath$X$}^{\top}\text{\boldmath$X$}/d. Hence, we derive

EBiasp\displaystyle E_{\mathrm{Bias}}^{p} =fL222𝜷2dTr(𝐒(γ𝐈d+𝐒)1)+𝜷2dTr(𝐒2(γ𝐈d+𝐒)2)+Od,(τ2/d)\displaystyle=\|f\|_{L^{2}}^{2}-\frac{2\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-1}\Big{)}+\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}^{2}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-2}\Big{)}+O_{d,\mathbb{P}}(\tau^{2}/\sqrt{d})
=𝜷2dTr(𝐈d)2𝜷2dTr(𝐒(γ𝐈d+𝐒)1)+𝜷2dTr(𝐒2(γ𝐈d+𝐒)2)+Od,(τ2/d)\displaystyle=\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}\mathrm{\bf I}_{d}\Big{)}-\frac{2\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-1}\Big{)}+\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}^{2}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-2}\Big{)}+O_{d,\mathbb{P}}(\tau^{2}/\sqrt{d})
=γ2𝜷2dTr((γ𝐈d+𝐒)2)+Od,(τ2/d)\displaystyle=\frac{\gamma^{2}\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-2}\Big{)}+O_{d,\mathbb{P}}(\tau^{2}/\sqrt{d})

where the last equality is due to (γ𝐈d+𝐒)22𝐒(γ𝐈d+𝐒)𝐒2=γ2𝐈d(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{2}-2\mathrm{\bf S}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})-\mathrm{\bf S}^{2}=\gamma^{2}\mathrm{\bf I}_{d}. Also,

EVarp=σε2dTr(𝐒(γ𝐈d+𝐒)2)+Od,(τ2/d),ECrossp=Od,(τ2/d).\displaystyle E_{\mathrm{Var}}^{p}=\frac{\sigma_{\varepsilon}^{2}}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-2}\Big{)}+O_{d,\mathbb{P}}(\tau^{2}/\sqrt{d}),\qquad E_{\text{\rm Cross}}^{p}=O_{d,\mathbb{P}}(\tau^{2}/\sqrt{d}).

This proves that

R𝖯𝖱𝖱(λ¯)=γ2𝜷2dTr((γ𝐈d+𝐒)2)+σε2dTr(𝐒(γ𝐈d+𝐒)2)+Od,(τ2/d)\displaystyle R_{{\sf PRR}}(\overline{\lambda})=\frac{\gamma^{2}\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{Tr}\Big{(}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-2}\Big{)}+\frac{\sigma_{\varepsilon}^{2}}{d}\mathrm{Tr}\Big{(}\mathrm{\bf S}(\gamma\mathrm{\bf I}_{d}+\mathrm{\bf S})^{-2}\Big{)}+O_{d,\mathbb{P}}(\tau^{2}/\sqrt{d}) (154)

Similar to the decomposition of the risk R𝖯𝖱𝖱(λ¯)R_{{\sf PRR}}(\overline{\lambda}) and Lemma 20, we can show that Rlin(γ)R_{\mbox{\tiny\rm lin}}(\gamma) has a similar variance-bias decomposition and that the components concentrate to their respective trace, which yield Eqs. (25)–(27). Combining these with Theorem 3.3 and (154), we arrive at the claims in Corollary 3.2.

To complete the proof, below we first prove Lemma 20 and then Lemma 19.

Proof of Lemma 20.

To show the claim (152a), first we notice that

0Tr(𝐀1γ02𝟏n𝟏n𝐀1)=γ0𝐀1𝟏n2Od,(1n)0\leq\mathrm{Tr}\Big{(}\mathrm{\bf A}^{-1}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}^{-1}\Big{)}=\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{2}\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{n}\Big{)}

where we used Lemma 17. Further,

|Tr(𝐀1γ12d2𝑿𝑿𝐀1)Tr(𝐀01γ12d2𝑿𝑿𝐀1)|\displaystyle\Big{|}\mathrm{Tr}\Big{(}\mathrm{\bf A}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\Big{)}-\mathrm{Tr}\Big{(}\mathrm{\bf A}_{0}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\Big{)}\Big{|} =|Tr(𝐀1γ0𝟏n𝟏n𝐀01γ12d2𝑿𝑿𝐀1)|\displaystyle=\Big{|}\mathrm{Tr}\Big{(}\mathrm{\bf A}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}_{0}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\Big{)}\Big{|}
γ0𝐀1𝟏nγ12d2𝐀1𝑿𝑿𝐀01𝟏n\displaystyle\leq\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}\cdot\frac{\gamma_{1}^{2}}{d^{2}}\big{\|}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\mathrm{\bf 1}_{n}\big{\|}
(i)Cnγ12λ¯d2𝑿𝑿𝐀01opn\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{C}{\sqrt{n}}\cdot\frac{\gamma_{1}^{2}}{\overline{\lambda}d^{2}}\cdot\big{\|}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\big{\|}_{\mathrm{op}}\cdot\sqrt{n}
Cγ1λ¯d=Od,(1d)\displaystyle\leq\frac{C\gamma_{1}}{\overline{\lambda}d}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}

where in (i) we used Lemma 17 and 𝐀1opλ¯1\|\mathrm{\bf A}^{-1}\|_{\mathrm{op}}\leq\overline{\lambda}^{-1}, in (ii) we used γ1d𝑿𝑿𝐀01op1\frac{\gamma_{1}}{d}\|\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\|_{\mathrm{op}}\leq 1 from Lemma 18. Similarly, we have

|Tr(𝐀01γ12d2𝑿𝑿𝐀1)Tr(𝐀01γ12d2𝑿𝑿𝐀01)|=Od,(1d)\Big{|}\mathrm{Tr}\Big{(}\mathrm{\bf A}_{0}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\Big{)}-\mathrm{Tr}\Big{(}\mathrm{\bf A}_{0}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\Big{)}\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}

which proves (152a). Next,, we have

1d|Tr(𝑿𝐀1γ02𝟏n𝟏n𝐀1𝑿)|\displaystyle\frac{1}{d}\Big{|}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}\Big{|} 1dγ0𝑿𝐀1𝟏n21d𝑿op2γ0𝐀1𝟏n2\displaystyle\leq\frac{1}{d}\big{\|}\gamma_{0}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{2}\leq\frac{1}{d}\big{\|}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}^{2}\cdot\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{2}
Od,(nd)γ0𝐀1𝟏n2Od,(1d)\displaystyle\leq O_{d,\mathbb{P}}\Big{(}\frac{n}{d}\Big{)}\cdot\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{2}\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}

where we used Lemma 17. Also,

1d|Tr(𝑿𝐀1γ12d2𝑿𝑿𝐀1𝑿)Tr(𝑿𝐀01γ12d2𝑿𝑿𝐀1𝑿)|\displaystyle\frac{1}{d}\Big{|}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}-\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}
=γ12d3|Tr(𝑿𝐀1γ0𝟏n𝟏n𝐀01𝑿𝑿𝐀1𝑿)|\displaystyle=\frac{\gamma_{1}^{2}}{d^{3}}\Big{|}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}
=γ12d3|Tr(𝐀1γ0𝟏n𝟏n𝐀01𝑿𝑿𝐀1𝑿𝑿)|\displaystyle=\frac{\gamma_{1}^{2}}{d^{3}}\Big{|}\mathrm{Tr}\Big{(}\mathrm{\bf A}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\Big{)}\Big{|}
1d3γ0𝐀1𝟏nγ1𝑿𝑿𝐀1opγ1𝑿𝑿𝐀01opn\displaystyle\leq\frac{1}{d^{3}}\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}\cdot\|\gamma_{1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\big{\|}_{\mathrm{op}}\cdot\big{\|}\gamma_{1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\big{\|}_{\mathrm{op}}\cdot\sqrt{n}
Od,(1d)\displaystyle\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}

where we used Lemma 17 and 18 in the last inequality. Similarly, we can also prove

1d|Tr(𝑿𝐀1γ12d2𝑿𝑿𝐀01𝑿)Tr(𝑿𝐀01γ12d2𝑿𝑿𝐀01𝑿)|Od,(1d),\frac{1}{d}\Big{|}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}-\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)},

which proves (152b). Finally, we observe

Tr(𝑿𝐀1𝑿)Tr(𝑿𝐀01𝑿)\displaystyle\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}-\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)} =Tr(𝑿𝐀1γ0𝟏n𝟏n𝐀01𝑿)\displaystyle=\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}
=Tr(𝐀1γ0𝟏n𝟏n𝐀01𝑿𝑿).\displaystyle=\mathrm{Tr}\Big{(}\mathrm{\bf A}^{-1}\gamma_{0}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\Big{)}.

Thus we get

γ1d2|Tr(𝑿𝐀1𝑿)Tr(𝑿𝐀01𝑿)|ndγ0𝐀1𝟏nγ1d𝐀01𝑿𝑿opOd,(1d).\displaystyle\frac{\gamma_{1}}{d^{2}}\Big{|}\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\Big{)}-\mathrm{Tr}\Big{(}\text{\boldmath$X$}^{\top}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\Big{)}\Big{|}\leq\frac{\sqrt{n}}{d}\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}\cdot\big{\|}\frac{\gamma_{1}}{d}\mathrm{\bf A}_{0}^{-1}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{\|}_{\mathrm{op}}\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}.

which proves (152c). ∎

Proof of Lemma 19.

Step 1. We will use the variance calculation of quadratic forms from [MM21]: suppose 𝒈g is a vector of i.i.d. random variables gigg_{i}\sim\mathbb{P}_{g} which have zero mean and variance σg2\sigma_{g}^{2}, and 𝒉h is another vector of i.i.d. random variables hihh_{i}\sim\mathbb{P}_{h} with zero mean and variance σh2\sigma_{h}^{2}, then for matrices 𝐀,𝐁\mathrm{\bf A},\mathrm{\bf B},

Var[𝒈𝐀𝒉]=σg2σh2𝐀F2,Var[𝒈𝐁𝒈]=i=1nBii2(𝔼[g4]3σg4)+σg4𝐁F2+σg4Tr(𝐁2).\mathrm{Var}[\text{\boldmath$g$}^{\top}\mathrm{\bf A}\text{\boldmath$h$}]=\sigma_{g}^{2}\sigma_{h}^{2}\|\mathrm{\bf A}\|_{F}^{2},\qquad\mathrm{Var}[\text{\boldmath$g$}^{\top}\mathrm{\bf B}\text{\boldmath$g$}]=\sum_{i=1}^{n}B_{ii}^{2}(\mathbb{E}[g^{4}]-3\sigma_{g}^{4})+\sigma_{g}^{4}\|\mathrm{\bf B}\|_{F}^{2}+\sigma_{g}^{4}\mathrm{Tr}(\mathrm{\bf B}^{2}). (155)

In particular, if 𝐁\mathrm{\bf B} is symmetric and 𝔼(gi4)Cσg4\mathbb{E}(g_{i}^{4})\leq C\sigma_{g}^{4}, then the second variance is bounded by O(σg4)𝐁F2O(\sigma_{g}^{4})\cdot\|\mathrm{\bf B}\|_{F}^{2}. In order to apply these identities, we use the Gaussian vector 𝒈𝒩(𝟎,𝜷2d𝐈d)\text{\boldmath$g$}\in\mathcal{N}(\mathrm{\bf 0},\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{d}\mathrm{\bf I}_{d}) as a surrogate for 𝜷Unif(𝜷𝕊d1)\text{\boldmath$\beta$}_{*}\sim\mathrm{Unif}(\|\text{\boldmath$\beta$}_{*}\|\mathbb{S}^{d-1}), which only incurs small errors since Gaussian vector norm concentrates in high dimensions. We use I~ε(1),I~1(1),I~ε,ε(2),I~1,1(2),I~ε,1(2)\widetilde{I}_{\varepsilon}^{(1)},\widetilde{I}_{1}^{(1)},\widetilde{I}_{\varepsilon,\varepsilon}^{(2)},\widetilde{I}_{1,1}^{(2)},\widetilde{I}_{\varepsilon,1}^{(2)} to denote the surrogate terms after we replace 𝜷\text{\boldmath$\beta$}_{*} by 𝒈g in the definitions of Iε(1),I1(1),Iε,ε(2),I1,1(2),Iε,1(2)I_{\varepsilon}^{(1)},I_{1}^{(1)},I_{\varepsilon,\varepsilon}^{(2)},I_{1,1}^{(2)},I_{\varepsilon,1}^{(2)}. By homogeneity, we assume that 𝜷=σε=1\|\text{\boldmath$\beta$}_{*}\|=\sigma_{\varepsilon}=1 without loss of generality.

Using Lemma 18, we have

Var[I~ε(1)]Cγ1d3𝐀1𝑿F2=Od,(1d),Var[I~1(1)]Cγ12d4𝑿𝐀1𝑿F2=Od,(1d).\displaystyle\mathrm{Var}\big{[}\widetilde{I}_{\varepsilon}^{(1)}\big{]}\leq\frac{C\gamma_{1}}{d^{3}}\big{\|}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)},\qquad\mathrm{Var}\big{[}\widetilde{I}_{1}^{(1)}\big{]}\leq\frac{C\gamma_{1}^{2}}{d^{4}}\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}.

Also, using Lemma 18 and Lemma 20,

1d𝐀1γ02𝟏n𝟏n𝐀1𝑿F2\displaystyle\frac{1}{d}\big{\|}\mathrm{\bf A}^{-1}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2} =1dγ0𝐀1𝟏n2γ0𝑿𝐀1𝟏n21dγ0𝐀1𝟏n4𝑿op2\displaystyle=\frac{1}{d}\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{2}\cdot\big{\|}\gamma_{0}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{2}\leq\frac{1}{d}\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\|^{4}\cdot\big{\|}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}^{2}
Od,(1dn2)Od,(n)=Od,(1dn).\displaystyle\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{dn^{2}}\Big{)}\cdot O_{d,\mathbb{P}}(n)=O_{d,\mathbb{P}}\Big{(}\frac{1}{dn}\Big{)}.
1d𝐀1γ12d2𝑿𝑿𝐀1𝑿F2\displaystyle\frac{1}{d}\big{\|}\mathrm{\bf A}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2} 1d5γ1𝐀1𝑿op2γ1𝑿𝐀1𝑿F2=Od,(1d).\displaystyle\leq\frac{1}{d^{5}}\big{\|}\gamma_{1}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}^{2}\cdot\big{\|}\gamma_{1}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}.

Combining the two bounds proves

Var[I~ε,1(2)]Cd𝐀1(γ02𝟏n𝟏n+γ12d2𝑿𝑿)𝐀1𝑿F2=Od,(1d).\mathrm{Var}\big{[}\widetilde{I}_{\varepsilon,1}^{(2)}\big{]}\leq\frac{C}{d}\Big{\|}\mathrm{\bf A}^{-1}\big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}.

Similar to this, we derive

𝐀1γ02𝟏n𝟏n𝐀1F2=γ0𝐀1𝟏n4=Od,(1n2)Od,(1d2),\displaystyle\big{\|}\mathrm{\bf A}^{-1}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}^{-1}\big{\|}_{F}^{2}=\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{4}=O_{d,\mathbb{P}}\Big{(}\frac{1}{n^{2}}\Big{)}\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{d^{2}}\Big{)},
𝐀1γ12d2𝑿𝑿𝐀1F2=1d4γ1𝐀1𝑿F2γ1𝐀1𝑿op2=Od,(1d),\displaystyle\big{\|}\mathrm{\bf A}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\big{\|}_{F}^{2}=\frac{1}{d^{4}}\big{\|}\gamma_{1}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}\cdot\big{\|}\gamma_{1}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)},
1d2𝑿𝐀1γ02𝟏n𝟏n𝐀1𝑿F2=1d2γ0𝑿𝐀1𝟏n41d2𝑿op2γ0𝐀1𝟏n4Od,(1n2),\displaystyle\frac{1}{d^{2}}\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}=\frac{1}{d^{2}}\big{\|}\gamma_{0}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{4}\leq\frac{1}{d^{2}}\|\text{\boldmath$X$}\|_{\mathrm{op}}^{2}\cdot\big{\|}\gamma_{0}\mathrm{\bf A}^{-1}\mathrm{\bf 1}_{n}\big{\|}^{4}\leq O_{d,\mathbb{P}}\Big{(}\frac{1}{n^{2}}\Big{)},
1d2𝑿𝐀1γ12d2𝑿𝑿𝐀1𝑿F2=1d6γ1𝑿𝐀1𝑿F2γ1𝑿𝐀1𝑿op2=Od,(1d).\displaystyle\frac{1}{d^{2}}\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}=\frac{1}{d^{6}}\big{\|}\gamma_{1}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}\cdot\big{\|}\gamma_{1}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{\mathrm{op}}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}.

Combining these two bounds give

Var[I~ε,ε(2)]C𝐀1(γ02𝟏n𝟏n+γ12d2𝑿𝑿)𝐀1F2=Od,(1d),\displaystyle\mathrm{Var}\big{[}\widetilde{I}_{\varepsilon,\varepsilon}^{(2)}\big{]}\leq C\big{\|}\mathrm{\bf A}^{-1}\big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}\mathrm{\bf A}^{-1}\big{\|}_{F}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)},
Var[I~1,1(2)]Cd2𝑿𝐀1(γ02𝟏n𝟏n+γ12d2𝑿𝑿)𝐀1𝑿F2=Od,(1d).\displaystyle\mathrm{Var}\big{[}\widetilde{I}_{1,1}^{(2)}\big{]}\leq\frac{C}{d^{2}}\big{\|}\text{\boldmath$X$}^{\top}\mathrm{\bf A}^{-1}\big{(}\gamma_{0}^{2}\mathrm{\bf 1}_{n}\mathrm{\bf 1}_{n}^{\top}+\frac{\gamma_{1}^{2}}{d^{2}}\text{\boldmath$X$}\text{\boldmath$X$}^{\top}\big{)}\mathrm{\bf A}^{-1}\text{\boldmath$X$}\big{\|}_{F}^{2}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}.

Step 2. Finally, to prove the desired results, we connect I~ε(1),I~1(1),I~ε,ε(2),I~1,1(2),I~ε,1(2)\widetilde{I}_{\varepsilon}^{(1)},\widetilde{I}_{1}^{(1)},\widetilde{I}_{\varepsilon,\varepsilon}^{(2)},\widetilde{I}_{1,1}^{(2)},\widetilde{I}_{\varepsilon,1}^{(2)} back to Iε(1),I1(1),Iε,ε(2),I1,1(2),Iε,1(2)I_{\varepsilon}^{(1)},I_{1}^{(1)},I_{\varepsilon,\varepsilon}^{(2)},I_{1,1}^{(2)},I_{\varepsilon,1}^{(2)}. By definition, 𝜷\text{\boldmath$\beta$}_{*} and 𝜷𝒈/𝒈\|\text{\boldmath$\beta$}_{*}\|\text{\boldmath$g$}/\|\text{\boldmath$g$}\| have the same distribution, so there is no loss of generality by writing

Iε(1)=𝜷𝒈I~ε(1),I1(1)=𝜷2𝒈2I~1(1),I1,1(2)=𝜷2𝒈2I~1,1(2),Iε,1(2)=𝜷𝒈I~ε,1(2).I_{\varepsilon}^{(1)}=\frac{\|\text{\boldmath$\beta$}_{*}\|}{\|\text{\boldmath$g$}\|}\widetilde{I}_{\varepsilon}^{(1)},\quad I_{1}^{(1)}=\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{\|\text{\boldmath$g$}\|^{2}}\widetilde{I}_{1}^{(1)},\quad I_{1,1}^{(2)}=\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{\|\text{\boldmath$g$}\|^{2}}\widetilde{I}_{1,1}^{(2)},\quad I_{\varepsilon,1}^{(2)}=\frac{\|\text{\boldmath$\beta$}_{*}\|}{\|\text{\boldmath$g$}\|}\widetilde{I}_{\varepsilon,1}^{(2)}.

And also I~ε,ε(2)=Iε,ε(2)\widetilde{I}_{\varepsilon,\varepsilon}^{(2)}=I_{\varepsilon,\varepsilon}^{(2)} because there is no 𝜷\text{\boldmath$\beta$}_{*} involved. By concentration of Gaussian vector norms [Ver10, vH14], we have 𝒈/𝜷=1+Od,(d1/2)\|\text{\boldmath$g$}\|/\|\text{\boldmath$\beta$}_{*}\|=1+O_{d,\mathbb{P}}(d^{-1/2}). We observe that

𝔼𝜺,𝒈[I~ε(1)]=0,Var𝜺,𝒈[I~ε(1)]=Od,(1d)|I~ε(1)|=Od,(1d)\displaystyle\mathbb{E}_{\text{\boldmath$\varepsilon$},\text{\boldmath$g$}}[\widetilde{I}_{\varepsilon}^{(1)}]=0,\quad\mathrm{Var}_{\text{\boldmath$\varepsilon$},\text{\boldmath$g$}}[\widetilde{I}_{\varepsilon}^{(1)}]=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)}\qquad\Rightarrow\qquad\big{|}\widetilde{I}_{\varepsilon}^{(1)}\big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{\sqrt{d}}\Big{)}

by Chebyshev’s inequality, and thus |Iε(1)|=Od,(d1/2)\big{|}I_{\varepsilon}^{(1)}\big{|}=O_{d,\mathbb{P}}(d^{-1/2}). The same argument applies to the term Iε,1(2)I_{\varepsilon,1}^{(2)} which leads to the bound |Iε,1(2)|=Od,(d1/2)\big{|}I_{\varepsilon,1}^{(2)}\big{|}=O_{d,\mathbb{P}}(d^{-1/2}). Also, by a similar argument,

𝜷2𝒈2|I~1(1)𝔼𝒈[I~1(1)]|=Od,(1d),𝜷2𝒈2|I~1,1(2)𝔼𝒈[I~1,1(2)]|=Od,(1d),\displaystyle\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{\|\text{\boldmath$g$}\|^{2}}\Big{|}\widetilde{I}_{1}^{(1)}-\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1}^{(1)}]\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{\sqrt{d}}\Big{)},\qquad\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{\|\text{\boldmath$g$}\|^{2}}\Big{|}\widetilde{I}_{1,1}^{(2)}-\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1,1}^{(2)}]\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{d}\Big{)},
|Iε,ε(2)𝔼𝜺[Iε,ε(2)]|=Od,(1d).\displaystyle\Big{|}I_{\varepsilon,\varepsilon}^{(2)}-\mathbb{E}_{\text{\boldmath$\varepsilon$}}[I_{\varepsilon,\varepsilon}^{(2)}]\Big{|}=O_{d,\mathbb{P}}\Big{(}\frac{1}{\sqrt{d}}\Big{)}.

Therefore, we deduce

I1(1)=𝜷2𝒈2𝔼𝒈[I~1(1)]+Od,(1d),I1,1(2)=𝜷2𝒈2𝔼𝒈[I~1,1(2)]+Od,(1d),\displaystyle I_{1}^{(1)}=\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{\|\text{\boldmath$g$}\|^{2}}\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1}^{(1)}]+O_{d,\mathbb{P}}\Big{(}\frac{1}{\sqrt{d}}\Big{)},\qquad I_{1,1}^{(2)}=\frac{\|\text{\boldmath$\beta$}_{*}\|^{2}}{\|\text{\boldmath$g$}\|^{2}}\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1,1}^{(2)}]+O_{d,\mathbb{P}}\Big{(}\frac{1}{\sqrt{d}}\Big{)},
Iε,ε(2)=𝔼𝜺[Iε,ε(2)]+Od,(1d).\displaystyle I_{\varepsilon,\varepsilon}^{(2)}=\mathbb{E}_{\text{\boldmath$\varepsilon$}}[I_{\varepsilon,\varepsilon}^{(2)}]+O_{d,\mathbb{P}}\Big{(}\frac{1}{\sqrt{d}}\Big{)}.

Note that 𝒈2/𝜷2=1+Od,(d1/2)\|\text{\boldmath$g$}\|^{2}/\|\text{\boldmath$\beta$}_{*}\|^{2}=1+O_{d,\mathbb{P}}(d^{-1/2}) and 𝔼𝒈[I~1(1)],𝔼𝒈[I~1,1(2)],𝔼𝜺[Iε,ε(2)]\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1}^{(1)}],\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1,1}^{(2)}],\mathbb{E}_{\text{\boldmath$\varepsilon$}}[I_{\varepsilon,\varepsilon}^{(2)}] are exactly the trace quantities in the statement of Lemma 19. Also note that |𝔼𝒈[I~1(1)]|,|𝔼𝒈[I~1,1(2)]|,|𝔼𝜺[Iε,ε(2)]||\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1}^{(1)}]|,|\mathbb{E}_{\text{\boldmath$g$}}[\widetilde{I}_{1,1}^{(2)}]|,|\mathbb{E}_{\text{\boldmath$\varepsilon$}}[I_{\varepsilon,\varepsilon}^{(2)}]| are bounded by a constant due to Lemma 20 and Eqs. (153a)–(153c). This completes the proof.

Appendix E Supporting lemmas and proofs

E.1 Basic lemmas

Lemma 21.

Fix any constant ε0(0,0.1)\varepsilon_{0}\in(0,0.1). With very high probability, 𝐗op(1+ε0)(n+d)\|\text{\boldmath$X$}\|_{\mathrm{op}}\leq(1+\varepsilon_{0})(\sqrt{n}+\sqrt{d}), σmin(𝐗)(n(1+ε0)d)/(1+ε0)\sigma_{\min}(\text{\boldmath$X$})\geq(\sqrt{n}-(1+\varepsilon_{0})\sqrt{d})/(1+\varepsilon_{0}), and 𝐗𝟏n(1+ε0)nd\|\text{\boldmath$X$}^{\top}\mathrm{\bf 1}_{n}\|\leq(1+\varepsilon_{0})\sqrt{nd}.

Proof of Lemma 21.

Let 𝑿~=[𝒙~1,,𝒙~n]n×d\widetilde{\text{\boldmath$X$}}=[\widetilde{\text{\boldmath$x$}}_{1},\ldots,\widetilde{\text{\boldmath$x$}}_{n}]^{\top}\in\mathbb{R}^{n\times d} be a matrix of i.i.d. standard normal variable, and 𝐃~𝑿=d1/2diag{𝒙~1,,𝒙~n}\widetilde{\mathrm{\bf D}}_{\text{\boldmath$X$}}=d^{-1/2}\mathrm{diag}\{\|\widetilde{\text{\boldmath$x$}}_{1}\|,\ldots,\|\widetilde{\text{\boldmath$x$}}_{n}\|\}. By [Ver10][Corollary 5.35], 𝑿~op(1+ε0/2)(n+d)\|\widetilde{\text{\boldmath$X$}}\|_{\mathrm{op}}\leq(1+\varepsilon_{0}/2)(\sqrt{n}+\sqrt{d}) with probability at least 12exp(c(n+d))1-2\exp(-c(n+d)). By the subgaussian concentration, d1/2𝒙~i(1ε0/10,1+ε0/10)d^{-1/2}\|\widetilde{\text{\boldmath$x$}}_{i}\|\in(1-\varepsilon_{0}/10,1+\varepsilon_{0}/10) with probability at least 12exp(cd)1-2\exp(-cd). Taking the union bound, we deduce that with probability at least 1Cnexp(cd)1-Cn\exp(-cd),

𝑿op\displaystyle\big{\|}\text{\boldmath$X$}\big{\|}_{\mathrm{op}} =𝐃~𝑿1𝑿~op𝑿~op/λmin(𝐃~𝑿)<(1+ε0)(n+d),\displaystyle=\big{\|}\widetilde{\mathrm{\bf D}}_{\text{\boldmath$X$}}^{-1}\widetilde{\text{\boldmath$X$}}\big{\|}_{\mathrm{op}}\leq\big{\|}\widetilde{\text{\boldmath$X$}}\big{\|}_{\mathrm{op}}/\lambda_{\min}(\widetilde{\mathrm{\bf D}}_{\text{\boldmath$X$}})<(1+\varepsilon_{0})(\sqrt{n}+\sqrt{d}),
σmin(𝑿)\displaystyle\sigma_{\min}(\text{\boldmath$X$}) =σmin(𝐃~𝑿1𝑿~)σmin(𝑿~)/λmax(𝐃~𝑿)>(n(1+ε0)d)/(1+ε0),\displaystyle=\sigma_{\min}\big{(}\widetilde{\mathrm{\bf D}}_{\text{\boldmath$X$}}^{-1}\widetilde{\text{\boldmath$X$}}\big{)}\geq\sigma_{\min}(\widetilde{\text{\boldmath$X$}})/\lambda_{\max}(\widetilde{\mathrm{\bf D}}_{\text{\boldmath$X$}})>(\sqrt{n}-(1+\varepsilon_{0})\sqrt{d})/(1+\varepsilon_{0}),
𝑿𝟏n\displaystyle\|\text{\boldmath$X$}^{\top}\mathrm{\bf 1}_{n}\| =𝐃~𝑿1𝑿~𝟏n𝑿~𝟏n/λmin(𝐃~𝑿)<(1+ε0)nd,\displaystyle=\|\widetilde{\mathrm{\bf D}}_{\text{\boldmath$X$}}^{-1}\widetilde{\text{\boldmath$X$}}^{\top}\mathrm{\bf 1}_{n}\|\leq\|\widetilde{\text{\boldmath$X$}}^{\top}\mathrm{\bf 1}_{n}\|/\lambda_{\min}(\widetilde{\mathrm{\bf D}}_{\text{\boldmath$X$}})<(1+\varepsilon_{0})\sqrt{nd},

where we used concentration of Gaussian vector norms [Ver10, vH14] in the last inequality. This finishes the proof. ∎

Lemma 22.

Let a>0a>0 be any constant, 𝐱𝕊d1(d)\text{\boldmath$x$}^{\prime}\in\mathbb{S}^{d-1}(\sqrt{d}) be fixed, and 𝐱Unif(𝕊d1(d))\text{\boldmath$x$}\sim\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})). Then |𝐱,𝐱|adlogd|\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle|\leq a\sqrt{d}\log d holds with very high probability.

Proof of Lemma 22.

By orthogonal invariance, we assume without loss of generality that 𝒙=d𝒆1\text{\boldmath$x$}^{\prime}=\sqrt{d}\,\text{\boldmath$e$}_{1}. It suffices to prove that |𝒙,𝒆1|alogd|\langle\text{\boldmath$x$},\text{\boldmath$e$}_{1}\rangle|\leq a\log d holds with very high probability. Note that 𝒙=dd𝒈/𝒈\text{\boldmath$x$}\stackrel{{\scriptstyle d}}{{=}}\sqrt{d}\,\text{\boldmath$g$}/\|\text{\boldmath$g$}\| where 𝒈𝒩(0,𝐈d)\text{\boldmath$g$}\sim\mathcal{N}(0,\mathrm{\bf I}_{d}). From basic concentration results [Ver10], we know 𝒈/d1/2\|\text{\boldmath$g$}\|/\sqrt{d}\geq 1/2 and |𝒙,𝒆1|alogd/2|\langle\text{\boldmath$x$},\text{\boldmath$e$}_{1}\rangle|\leq a\log d/2 with very high probability. This then proves the desired result. ∎

Lemma 23.

Let 𝐁m×q\mathrm{\bf B}\in\mathbb{R}^{m\times q} be a matrix and a>0a>0. Then,

(𝐁𝐁+a𝐈m)1𝐁=𝐁(𝐁𝐁+a𝐈q)1\big{(}\mathrm{\bf B}\mathrm{\bf B}^{\top}+a\mathrm{\bf I}_{m}\big{)}^{-1}\mathrm{\bf B}=\mathrm{\bf B}\big{(}\mathrm{\bf B}^{\top}\mathrm{\bf B}+a\mathrm{\bf I}_{q}\big{)}^{-1} (156)

This simple lemma is straightforward to check (left-multiplying (𝐁𝐁+a𝐈m)1\big{(}\mathrm{\bf B}\mathrm{\bf B}^{\top}+a\mathrm{\bf I}_{m}\big{)}^{-1} and right-multiplying (𝐁𝐁+a𝐈q)1\big{(}\mathrm{\bf B}^{\top}\mathrm{\bf B}+a\mathrm{\bf I}_{q}\big{)}^{-1}).

Lemma 24.

Let \mathcal{H} be an L2L^{2} space and ,\langle\cdot,\cdot\rangle_{\mathcal{H}} be the associated inner product. If f,fm,g,gmf,f_{m},g,g_{m}\in\mathcal{H} for m1m\geq 1, and fmL2ff_{m}\xrightarrow{L^{2}}f, gmL2gg_{m}\xrightarrow{L^{2}}g, then fm,gmmf,g\langle f_{m},g_{m}\rangle_{\mathcal{H}}\xrightarrow{m\to\infty}\langle f,g\rangle_{\mathcal{H}}.

This is a simple convergence result of L2L^{2} spaces.

E.2 An operator norm bound

Recall that Qk(d)Q_{k}^{(d)} is the degree-kk Gegenbauer polynomial in dd dimensions. The next lemma states that when the degree is high, the Gegenbauer inner-product kernels are very close to the identity matrices. Throughout this subsection, we assume that d+1nd^{\ell+1}\geq n.

Proposition E.1.

Denote 𝐐k=(Qk(d)(𝐱i,𝐱j))i,jn\text{\boldmath$Q$}_{k}=(Q_{k}^{(d)}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle))_{i,j\leq n}. Then, there exists a constant C>0C>0 such that with very high probability,

supk>𝑸k𝐈nopCn(logd)Cd+1.\displaystyle\sup_{k>\ell}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq C\sqrt{\frac{n(\log d)^{C}}{d^{\ell+1}}}.

This lemma is a quantitative refinement of [GMMM21][Prop. 3]. Difference from the moment method employed in [GMMM21][Prop. 3], the proof of this lemma uses a specific matrix concentration inequality, namely the matrix version of Freedman’s inequality for martingales, first established by [T+11] (see also [Oli09]).

Theorem E.1.

Consider a matrix-valued martingale {𝐘i:i=0,1,}\{\text{\boldmath$Y$}_{i}:i=0,1,\ldots\} with respect to the filtration (i)i=0(\mathcal{F}_{i})_{i=0}^{\infty}. The values of 𝐘i\text{\boldmath$Y$}_{i} are symmetric matrices with dimension nn. Let 𝐙i=𝐘i𝐘i1\text{\boldmath$Z$}_{i}=\text{\boldmath$Y$}_{i}-\text{\boldmath$Y$}_{i-1} for all i1i\geq 1. Assume that λmax(𝐙i)L¯\lambda_{\max}(\text{\boldmath$Z$}_{i})\leq\overline{L} almost surely for all i1i\geq 1. Then, for all t0t\geq 0 and v0v\geq 0,

(k0:λmax(𝒀k)tand𝐕kopv)nexp(t2/2v+L¯t/3)\mathbb{P}\Big{(}\exists\,k\geq 0:\lambda_{\max}(\text{\boldmath$Y$}_{k})\geq t~{}\text{and}~{}\|\mathrm{\bf V}_{k}\|_{\mathrm{op}}\leq v\Big{)}\leq n\cdot\exp\left(-\frac{t^{2}/2}{v+\overline{L}t/3}\right)

where 𝐕i\mathrm{\bf V}_{i} is defined as 𝐕i=j=1i𝔼[𝐙j2|i1]\mathrm{\bf V}_{i}=\sum_{j=1}^{i}\mathbb{E}[\text{\boldmath$Z$}_{j}^{2}|\mathcal{F}_{i-1}].

First, we fix k>k>\ell and treat it as a constant. We also suppress the dependence on kk if there is no confusion. Define a filtration (i)i=0n(\mathcal{F}_{i})_{i=0}^{n} and random matrices (𝒁i)i=1n(\text{\boldmath$Z$}_{i})_{i=1}^{n} as follows. We define j\mathcal{F}_{j} to be the σ\sigma-algebra generated by 𝒙1,,𝒙j\text{\boldmath$x$}_{1},\ldots,\text{\boldmath$x$}_{j}; particularly, 0\mathcal{F}_{0} is the trivial σ\sigma-algebra. We also define the truncated version of 𝑸Q.

𝑸¯=(Qijξij𝟏{ij})i,jn,whereξij=𝟏{|𝒙i,𝒙j|dlogd}\overline{\text{\boldmath$Q$}}=\big{(}Q_{ij}\xi_{ij}\mathrm{\bf 1}\{i\neq j\}\big{)}_{i,j\leq n},\qquad\text{where}~{}\xi_{ij}=\mathrm{\bf 1}\{|\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle|\leq\sqrt{d}\log d\} (157)

By a simple concentration result (Lemma 22), we have, with very high probability,

maxij|𝒙i,𝒙j|dlogd.\max_{i\neq j}|\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle|\leq\sqrt{d}\,\log d.

Note Qii=1Q_{ii}=1 always holds. So 𝑸¯=𝑸𝐈n\overline{\text{\boldmath$Q$}}=\text{\boldmath$Q$}-\mathrm{\bf I}_{n} holds with very high probability. Define

𝒁i:=𝔼[𝑸¯|i]𝔼[𝑸¯|i1].\text{\boldmath$Z$}_{i}:=\mathbb{E}[\overline{\text{\boldmath$Q$}}|\mathcal{F}_{i}]-\mathbb{E}[\overline{\text{\boldmath$Q$}}|\mathcal{F}_{i-1}].

In other words, (𝔼[𝑸¯|i])i=1n(\mathbb{E}[\overline{\text{\boldmath$Q$}}|\mathcal{F}_{i}])_{i=1}^{n} is a Doob martingale with respect to the filtration (i)i=0n(\mathcal{F}_{i})_{i=0}^{n}, and (𝒁i)i=1n(\text{\boldmath$Z$}_{i})_{i=1}^{n} is the resulting martingale difference sequence. Note that trivially 𝑸¯=𝔼[𝑸¯|n]\overline{\text{\boldmath$Q$}}=\mathbb{E}[\overline{\text{\boldmath$Q$}}|\mathcal{F}_{n}], 𝔼[𝑸¯]=𝔼[𝑸¯|0]\mathbb{E}[\overline{\text{\boldmath$Q$}}]=\mathbb{E}[\overline{\text{\boldmath$Q$}}|\mathcal{F}_{0}]. Using this definition, we can express 𝑸¯\overline{\text{\boldmath$Q$}} as

𝑸¯=𝔼[𝑸¯]+i=1n𝒁i.\overline{\text{\boldmath$Q$}}=\mathbb{E}[\overline{\text{\boldmath$Q$}}]+\sum_{i=1}^{n}\text{\boldmath$Z$}_{i}. (158)

By construction, 𝔼[𝑸]=𝐈d\mathbb{E}[\text{\boldmath$Q$}]=\mathrm{\bf I}_{d}. It is not difficult to show that 𝔼[𝑸¯]opCdk\big{\|}\mathbb{E}[\overline{\text{\boldmath$Q$}}]\big{\|}_{\mathrm{op}}\leq Cd^{-k}; see Lemma 25 below. Crucially, we claim that the matrix 𝒁i\text{\boldmath$Z$}_{i} is a rank-22 matrix plus a small perturbation (namely, a matrix with very small operator norm). The perturbation term is due to the effect of truncation.

Lemma 25.

Let k+1k\geq\ell+1 be a constant. Then we have decomposition

𝒁i=𝒆i𝒗i+𝒗i𝒆i+𝚫i,where(𝒗i)j=Q¯ij𝟏{j<i}.\text{\boldmath$Z$}_{i}=\text{\boldmath$e$}_{i}\text{\boldmath$v$}_{i}^{\top}+\text{\boldmath$v$}_{i}\text{\boldmath$e$}_{i}^{\top}+\text{\boldmath$\Delta$}_{i},\qquad\text{where}~{}(\text{\boldmath$v$}_{i})_{j}=\overline{Q}_{ij}\mathrm{\bf 1}\{j<i\}. (159)

Here 𝚫in×n\text{\boldmath$\Delta$}_{i}\in\mathbb{R}^{n\times n} is certain matrix that satisfies 𝚫iopCdk\|\text{\boldmath$\Delta$}_{i}\|_{\mathrm{op}}\leq Cd^{-k} almost surely, and 𝐯i\text{\boldmath$v$}_{i} satisfies 𝐯iCn(logd)2kdk\|\text{\boldmath$v$}_{i}\|\leq C\sqrt{\frac{n(\log d)^{2k}}{d^{k}}} almost surely. In addition, 𝔼[𝐐¯]opCdk\|\mathbb{E}[\overline{\text{\boldmath$Q$}}]\|_{\mathrm{op}}\leq Cd^{-k}.

Proof of Lemma 25.

In this proof, we will use the identities

𝔼𝒙[Q(𝒙,𝒙)]=0,𝔼𝒙[Q(𝒙,𝒙)2]=B(d,k)1,\mathbb{E}_{\text{\boldmath$x$}}\big{[}Q(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)\big{]}=0,\qquad\mathbb{E}_{\text{\boldmath$x$}}\big{[}Q(\langle\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\rangle)^{2}\big{]}=B(d,k)^{-1}, (160)

where 𝒙,𝒙i.i.d.Unif(𝕊d1(d))\text{\boldmath$x$},\text{\boldmath$x$}^{\prime}\sim_{\mathrm{i.i.d.}}\mathrm{Unif}(\mathbb{S}^{d-1}(\sqrt{d})). These identities can be derived from (41).

Let j,m[n]j,m\in[n] be indices. Since 𝒁i\text{\boldmath$Z$}_{i} is symmetric, there is no loss of generality in only considering that j<mj<m (note (𝒁i)jj=0(\text{\boldmath$Z$}_{i})_{jj}=0 if j=mj=m). If m<im<i, then Q¯jmi1\overline{Q}_{jm}\in\mathcal{F}_{i-1} so [𝒁i]jm=0[\text{\boldmath$Z$}_{i}]_{jm}=0. Next we consider m>im>i. Observe that

|𝔼𝒙m[Q(𝒙j,𝒙m)𝟏{|𝒙j,𝒙m|C1dlogd}]|\displaystyle~{}~{}~{}~{}\Big{|}\mathbb{E}_{\text{\boldmath$x$}_{m}}\big{[}Q(\langle\text{\boldmath$x$}_{j},\text{\boldmath$x$}_{m}\rangle)\mathrm{\bf 1}\big{\{}|\langle\text{\boldmath$x$}_{j},\text{\boldmath$x$}_{m}\rangle|\leq C_{1}\sqrt{d}\,\log d\big{\}}\big{]}\Big{|}
=(i)|𝔼𝒙m[Q(𝒙j,𝒙m)𝟏{|𝒙j,𝒙m|>C1dlogd}]|\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}\Big{|}\mathbb{E}_{\text{\boldmath$x$}_{m}}\big{[}Q(\langle\text{\boldmath$x$}_{j},\text{\boldmath$x$}_{m}\rangle)\mathrm{\bf 1}\big{\{}|\langle\text{\boldmath$x$}_{j},\text{\boldmath$x$}_{m}\rangle|>C_{1}\sqrt{d}\,\log d\big{\}}\big{]}\Big{|}
(ii){𝔼𝒙m[Q(𝒙j,𝒙m)2]}1/2{𝒙m(|𝒙j,𝒙m|>C1dlogd)}1/2\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}\Big{\{}\mathbb{E}_{\text{\boldmath$x$}_{m}}\big{[}Q(\langle\text{\boldmath$x$}_{j},\text{\boldmath$x$}_{m}\rangle)^{2}\big{]}\Big{\}}^{1/2}\cdot\Big{\{}\mathbb{P}_{\text{\boldmath$x$}_{m}}\big{(}|\langle\text{\boldmath$x$}_{j},\text{\boldmath$x$}_{m}\rangle|>C_{1}\sqrt{d}\,\log d\big{)}\Big{\}}^{1/2}
(iii)B(d,k)1/2Od(d2k)=Od(d2k),\displaystyle\stackrel{{\scriptstyle(iii)}}{{\leq}}B(d,k)^{-1/2}\cdot O_{d}(d^{-2k})=O_{d}(d^{-2k}),

where in (i) we used (160), in (ii) we used the Cauchy-Schwarz inequality, and in (iii) we used (160) and Lemma 22. This implies

𝔼[Qjmξjm|i]=Od(d2k),𝔼[Qjmξjm|i1]=Od(d2k).\mathbb{E}\big{[}Q_{jm}\xi_{jm}\big{|}\mathcal{F}_{i}\big{]}=O_{d}(d^{-2k}),\qquad\mathbb{E}\big{[}Q_{jm}\xi_{jm}\big{|}\mathcal{F}_{i-1}\big{]}=O_{d}(d^{-2k}). (161)

Together, these bounds imply |[𝒁i]jm|=Od(d2k)|[\text{\boldmath$Z$}_{i}]_{jm}|=O_{d}(d^{-2k}). Finally, consider m=im=i. A similar argument gives

𝔼[Qjmξjm|i]=Qjmξjm=Q¯jm,𝔼[Qjmξjm|i1]=Od(d2k).\mathbb{E}\big{[}Q_{jm}\xi_{jm}\big{|}\mathcal{F}_{i}\big{]}=Q_{jm}\xi_{jm}=\overline{Q}_{jm},\qquad\mathbb{E}\big{[}Q_{jm}\xi_{jm}\big{|}\mathcal{F}_{i-1}\big{]}=O_{d}(d^{-2k}).

Combining the three cases, we derive the expression (159). The residual satisfies

𝚫opn𝚫max=nOd(d2k)Od(dk)\big{\|}\text{\boldmath$\Delta$}\big{\|}_{\mathrm{op}}\leq n\big{\|}\text{\boldmath$\Delta$}\big{\|}_{\max}=n\cdot O_{d}(d^{-2k})\leq O_{d}(d^{-k})

where the last inequality is due to dknd^{k}\geq n. Note that (161) also implies

𝔼[𝑸¯]opnmaxjm|𝔼[Qjmξj,m]|Od(nd2k)Od(dk).\|\mathbb{E}[\overline{\text{\boldmath$Q$}}]\|_{\mathrm{op}}\leq n\cdot\max_{jm}\big{|}\mathbb{E}\big{[}Q_{jm}\xi_{j,m}\big{]}\big{|}\leq O_{d}(nd^{-2k})\leq O_{d}(d^{-k}).

By the property (44) of the Gegenbauer polynomials, we know that the coefficients of B(d,k)1/2Q()B(d,k)^{1/2}Q(\cdot) are of constant order, so we have the deterministic bound

|B(d,k)1/2Q(𝒙i,𝒙j)|ξij(1+od(1))hk(𝒙i,𝒙jd)ξijOd((logd)k).\big{|}B(d,k)^{1/2}Q(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)\big{|}\xi_{ij}\leq(1+o_{d}(1))\cdot h_{k}\Big{(}\frac{\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle}{\sqrt{d}}\Big{)}\xi_{ij}\leq O_{d}\big{(}(\log d)^{k}\big{)}.

Since B(d,k)=(1+od(1))dk/k!B(d,k)=(1+o_{d}(1))\cdot d^{k}/k!, this gives maxij|Q¯ij|C(logd)k/dk\max_{ij}|\overline{Q}_{ij}|\leq C\sqrt{(\log d)^{k}/d^{k}} and thus

𝒗iimaxj<i|Q¯ij|Cn(logd)2kdk.\|\text{\boldmath$v$}_{i}\|\leq\sqrt{i}\cdot\max_{j<i}|\overline{Q}_{ij}|\leq C\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}.

This completes the proof. ∎

Before applying the matrix concentration inequality to the sum i=1n𝒁i\sum_{i=1}^{n}\text{\boldmath$Z$}_{i}, let us define

L:=maxin𝒁iop,V:=i=1n𝔼[𝒁i2|i1]op.L:=\max_{i\leq n}\|\text{\boldmath$Z$}_{i}\|_{\mathrm{op}},\qquad V:=\Big{\|}\sum_{i=1}^{n}\mathbb{E}[\text{\boldmath$Z$}_{i}^{2}|\mathcal{F}_{i-1}]\Big{\|}_{\mathrm{op}}.

Using Lemma 25, we obtain deterministic bounds on LL and VV, as stated below.

Lemma 26.

The following holds almost surely.

maxin𝒁iopCn(logd)2kdk,\displaystyle\max_{i\leq n}\|\text{\boldmath$Z$}_{i}\|_{\mathrm{op}}\leq C\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}, (162)
i=1n𝔼[𝒁i2|i1]opCn(logd)2kdk+nB(d,k)𝑸𝐈nop.\displaystyle\Big{\|}\sum_{i=1}^{n}\mathbb{E}[\text{\boldmath$Z$}_{i}^{2}|\mathcal{F}_{i-1}]\Big{\|}_{\mathrm{op}}\leq C\frac{n(\log d)^{2k}}{d^{k}}+\frac{n}{B(d,k)}\|\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\|_{\mathrm{op}}. (163)
Proof of Lemma 26.

The first inequality follows directly from Lemma 25:

maxin𝒁iop2maxin𝒗i+maxin𝚫iopCn(logd)2kdk+CdkCn(logd)2kdk.\max_{i\leq n}\|\text{\boldmath$Z$}_{i}\|_{\mathrm{op}}\leq 2\max_{i\leq n}\|\text{\boldmath$v$}_{i}\|+\max_{i\leq n}\|\text{\boldmath$\Delta$}_{i}\|_{\mathrm{op}}\leq C\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}+Cd^{-k}\leq C\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}.

To prove the second inequality, we use the decomposition (159).

𝔼[𝒁i2|i1]\displaystyle\mathbb{E}\big{[}\text{\boldmath$Z$}_{i}^{2}|\mathcal{F}_{i-1}\big{]} =𝔼[𝒗i𝒗i|i1]+𝔼[𝒆i𝒆i𝒗i2|i1]+𝚫~i\displaystyle=\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$v$}_{i}^{\top}\big{|}\mathcal{F}_{i-1}\big{]}+\mathbb{E}\big{[}\text{\boldmath$e$}_{i}\text{\boldmath$e$}_{i}^{\top}\|\text{\boldmath$v$}_{i}\|^{2}\big{|}\mathcal{F}_{i-1}\big{]}+\widetilde{\text{\boldmath$\Delta$}}_{i}

where 𝚫~in×n\widetilde{\text{\boldmath$\Delta$}}_{i}\in\mathbb{R}^{n\times n} is given by

𝚫~i=𝔼[𝒗i𝒆i𝚫i+𝒆i𝒗i𝚫i+𝚫i𝒗i𝒆i+𝚫i𝒆i𝒗i+𝚫i2|i1].\displaystyle\widetilde{\text{\boldmath$\Delta$}}_{i}=\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$e$}_{i}^{\top}\text{\boldmath$\Delta$}_{i}+\text{\boldmath$e$}_{i}\text{\boldmath$v$}_{i}^{\top}\text{\boldmath$\Delta$}_{i}+\text{\boldmath$\Delta$}_{i}\text{\boldmath$v$}_{i}\text{\boldmath$e$}_{i}^{\top}+\text{\boldmath$\Delta$}_{i}\text{\boldmath$e$}_{i}\text{\boldmath$v$}_{i}^{\top}+\text{\boldmath$\Delta$}_{i}^{2}\big{|}\mathcal{F}_{i-1}\big{]}.

Note that we used 𝒆i𝒗i=0\text{\boldmath$e$}_{i}^{\top}\text{\boldmath$v$}_{i}=0 in the above calculation. Using the deterministic bounds on 𝒗i\|\text{\boldmath$v$}_{i}\| and 𝚫iop\|\text{\boldmath$\Delta$}_{i}\|_{\mathrm{op}}, we get

𝚫~iop4𝚫~iop𝒗i+𝚫~iop2Cdkn(logd)2kdk+Cd2k.\|\widetilde{\text{\boldmath$\Delta$}}_{i}\|_{\mathrm{op}}\leq 4\|\widetilde{\text{\boldmath$\Delta$}}_{i}\|_{\mathrm{op}}\cdot\|\text{\boldmath$v$}_{i}\|+\|\widetilde{\text{\boldmath$\Delta$}}_{i}\|_{\mathrm{op}}^{2}\leq\frac{C}{d^{k}}\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}+\frac{C}{d^{2k}}.

We also note that in𝔼[𝒆i𝒆i𝒗i2|i1]\sum_{i\leq n}\mathbb{E}\big{[}\text{\boldmath$e$}_{i}\text{\boldmath$e$}_{i}^{\top}\|\text{\boldmath$v$}_{i}\|^{2}\big{|}\mathcal{F}_{i-1}\big{]} is a diagonal matrix with its diagonal entries bounded by maxin𝒗i2\max_{i\leq n}\|\text{\boldmath$v$}_{i}\|^{2}. Thus, we get

in𝔼[𝒁i2|i1]op\displaystyle\Big{\|}\sum_{i\leq n}\mathbb{E}\big{[}\text{\boldmath$Z$}_{i}^{2}|\mathcal{F}_{i-1}\big{]}\Big{\|}_{\mathrm{op}} nmaxin𝔼[𝒗i𝒗i|i1]op+maxin𝒗i2+nmaxin𝚫~iop\displaystyle\leq n\max_{i\leq n}\Big{\|}\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$v$}_{i}^{\top}\big{|}\mathcal{F}_{i-1}\big{]}\Big{\|}_{\mathrm{op}}+\max_{i\leq n}\|\text{\boldmath$v$}_{i}\|^{2}+n\max_{i\leq n}\|\widetilde{\text{\boldmath$\Delta$}}_{i}\|_{\mathrm{op}}
nmaxin𝔼[𝒗i𝒗i|i1]op+Cn(logd)2kdk+Cndkn(logd)2kdk+Cnd2k\displaystyle\leq n\max_{i\leq n}\Big{\|}\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$v$}_{i}^{\top}\big{|}\mathcal{F}_{i-1}\big{]}\Big{\|}_{\mathrm{op}}+\frac{Cn(\log d)^{2k}}{d^{k}}+\frac{Cn}{d^{k}}\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}+\frac{Cn}{d^{2k}}
nmaxin𝔼[𝒗i𝒗i|i1]op+Cn(logd)2kdk\displaystyle\leq n\max_{i\leq n}\Big{\|}\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$v$}_{i}^{\top}\big{|}\mathcal{F}_{i-1}\big{]}\Big{\|}_{\mathrm{op}}+\frac{Cn(\log d)^{2k}}{d^{k}} (164)

where the last inequality is because

ndkn(logd)2kdkndkn(logd)2kdkn(logd)2kdk,nd2kn(logd)2kdk,\frac{n}{d^{k}}\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}\leq\sqrt{\frac{n}{d^{k}}}\cdot\sqrt{\frac{n(\log d)^{2k}}{d^{k}}}\leq\frac{n(\log d)^{2k}}{d^{k}},\qquad\frac{n}{d^{2k}}\leq\frac{n(\log d)^{2k}}{d^{k}},

due to the assumption dknd^{k}\geq n. To further simplify (164), we note that

𝔼[𝒗i𝒗i|i1]jm=0,if mior ji.\displaystyle\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$v$}_{i}^{\top}\big{|}\mathcal{F}_{i-1}\big{]}_{jm}=0,\qquad\text{if }m\geq i~{}\text{or }j\geq i.

Now we consider the case where m<im<i and j<ij<i.

𝔼[𝒗i𝒗i|i1]jm\displaystyle\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$v$}_{i}^{\top}\big{|}\mathcal{F}_{i-1}\big{]}_{jm} =𝔼𝒙i[Q(𝒙i,𝒙j)Q(𝒙i,𝒙m)ξijξim]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}Q(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)Q(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{m}\rangle)\xi_{ij}\xi_{im}\big{]}
=𝔼𝒙i[Q(𝒙i,𝒙j)Q(𝒙i,𝒙m)]𝔼𝒙i[Q(𝒙i,𝒙j)Q(𝒙i,𝒙m)(1ξijξim)]\displaystyle=\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}Q(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)Q(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{m}\rangle)\big{]}-\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}Q(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)Q(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{m}\rangle)(1-\xi_{ij}\xi_{im})\big{]}
=1B(d,k)Q(𝒙j,𝒙m)𝔼𝒙i[QijQim(1ξijξim)]\displaystyle=\frac{1}{B(d,k)}Q(\langle\text{\boldmath$x$}_{j},\text{\boldmath$x$}_{m}\rangle)-\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}Q_{ij}Q_{im}(1-\xi_{ij}\xi_{im})\big{]}

where we used the identity (40) in the last equality. We write 1ξijξim=(1ξij)ξim+1ξim1-\xi_{ij}\xi_{im}=(1-\xi_{ij})\xi_{im}+1-\xi_{im} and use the Cauchy-Schwarz inequality to derive

|𝔼𝒙i[QijQim(1ξijξim)]|\displaystyle\Big{|}\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}Q_{ij}Q_{im}(1-\xi_{ij}\xi_{im})\big{]}\Big{|} {𝔼𝒙i[Qij2Qim2]}1/2{𝔼𝒙i[(1ξij)2ξim2]}1/2\displaystyle\leq\Big{\{}\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}Q_{ij}^{2}Q_{im}^{2}\big{]}\Big{\}}^{1/2}\cdot\Big{\{}\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}(1-\xi_{ij})^{2}\xi_{im}^{2}\big{]}\Big{\}}^{1/2}
+{𝔼𝒙i[Qij2Qim2]}1/2{𝔼𝒙i[(1ξim)2]}1/2\displaystyle+\Big{\{}\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}Q_{ij}^{2}Q_{im}^{2}\big{]}\Big{\}}^{1/2}\cdot\Big{\{}\mathbb{E}_{\text{\boldmath$x$}_{i}}\big{[}(1-\xi_{im})^{2}\big{]}\Big{\}}^{1/2}
(i)[𝒙i(ξij=0)]1/2+[𝒙i(ξim=0)]1/2\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\big{[}\mathbb{P}_{\text{\boldmath$x$}_{i}}\big{(}\xi_{ij}=0)\big{]}^{1/2}+\big{[}\mathbb{P}_{\text{\boldmath$x$}_{i}}\big{(}\xi_{im}=0)\big{]}^{1/2}
(ii)Od(d2k).\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}O_{d}(d^{-2k}).

where in (i) we used the inequality |Qij|1|Q_{ij}|\leq 1, and in (ii) we used Lemma 22. Therefore,

𝔼[𝒗i𝒗i|i1]1B(d,k)𝑸+Od(nd2k)𝐈n.\mathbb{E}\big{[}\text{\boldmath$v$}_{i}\text{\boldmath$v$}_{i}^{\top}\big{|}\mathcal{F}_{i-1}\big{]}\preceq\frac{1}{B(d,k)}\text{\boldmath$Q$}+O_{d}(nd^{-2k})\cdot\mathrm{\bf I}_{n}.

Thus, we obtain

in𝔼[𝒁i2|i1]opnB(d,k)𝑸op+Cn(logd)2kdknB(d,k)𝑸𝐈nop+Cn(logd)2kdk.\Big{\|}\sum_{i\leq n}\mathbb{E}\big{[}\text{\boldmath$Z$}_{i}^{2}|\mathcal{F}_{i-1}\big{]}\Big{\|}_{\mathrm{op}}\leq\frac{n}{B(d,k)}\|\text{\boldmath$Q$}\|_{\mathrm{op}}+\frac{Cn(\log d)^{2k}}{d^{k}}\leq\frac{n}{B(d,k)}\|\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\|_{\mathrm{op}}+\frac{Cn(\log d)^{2k}}{d^{k}}.

This completes the proof. ∎

Once this lemma is established, we apply Theorem E.1 to the martingales 𝒀m:=im𝒁i\text{\boldmath$Y$}_{m}:=\sum_{i\leq m}\text{\boldmath$Z$}_{i} and 𝒀m:=im(𝒁i)\text{\boldmath$Y$}^{\prime}_{m}:=\sum_{i\leq m}(-\text{\boldmath$Z$}_{i}) (where we simply set 𝒁n+i=𝟎\text{\boldmath$Z$}_{n+i}=\mathrm{\bf 0} for i1i\geq 1), and we obtain the following result. For any t0t\geq 0 and v>0v>0,

(i=1n𝒁iopt,andVv)2nexp(t2/2v+L¯t/3),\mathbb{P}\left(\Big{\|}\sum_{i=1}^{n}\text{\boldmath$Z$}_{i}\Big{\|}_{\mathrm{op}}\geq t,\text{and}~{}V\leq v\right)\leq 2n\cdot\exp\left(-\frac{t^{2}/2}{v+\overline{L}t/3}\right),

where L¯=Cn(logd)2kdk\overline{L}=C\sqrt{\frac{n(\log d)^{2k}}{d^{k}}} is the upper bound on LL in (162). This inequality implies a probability tail bound on i=1n𝒁iop\|\sum_{i=1}^{n}\text{\boldmath$Z$}_{i}\|_{\mathrm{op}}.

(i=1n𝒁iopt)2nexp(t2/2v+L¯t/3)+(V>v).\mathbb{P}\left(\Big{\|}\sum_{i=1}^{n}\text{\boldmath$Z$}_{i}\Big{\|}_{\mathrm{op}}\geq t\right)\leq 2n\cdot\exp\left(-\frac{t^{2}/2}{v+\overline{L}t/3}\right)+\mathbb{P}\big{(}V>v\big{)}. (165)

By taking vv slightly larger the “typical” values of VV, we can make the probability (V>v)\mathbb{P}\big{(}V>v\big{)} very small, which can be bounded by a tail probability of 𝑸op\|\text{\boldmath$Q$}\|_{\mathrm{op}} according to (163). This leads to a recursive inequality between tail probabilities. By abuse of notations, in the next lemma we assume that constant C1C\geq 1 is chosen to be no smaller than those that appeared in Lemma 25 and 26 (so that we can invoke both results).

Lemma 27.

Let the constant C1C\geq 1 be no smaller than those in Lemma 25 and 26. Suppose that

tmax{8Cn(logd)2k+4dk,128n(logd)2B(d,k)}.t\geq\max\Big{\{}8C\sqrt{\frac{n(\log d)^{2k+4}}{d^{k}}},128\frac{n(\log d)^{2}}{B(d,k)}\Big{\}}. (166)

Then, by setting L¯=Cn(logd)2kdk\overline{L}=C\sqrt{\frac{n(\log d)^{2k}}{d^{k}}} and v=t232(logd)2v=\frac{t^{2}}{32(\log d)^{2}}, we have

(𝑸𝐈dopt)(𝑸𝐈dop2t)+Cexp((logd)2)+(𝑸¯𝑸𝐈n).\mathbb{P}\left(\big{\|}\text{\boldmath$Q$}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}\geq t\right)\leq\mathbb{P}\left(\big{\|}\text{\boldmath$Q$}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}\geq 2t\right)+C^{\prime}\exp\big{(}-(\log d)^{2}\big{)}+\mathbb{P}\big{(}\overline{\text{\boldmath$Q$}}\neq\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{)}. (167)

As a consequence, by iterating the choice of tt at most Od(logd)O_{d}(\log d) times, we obtain that 𝐐𝐈dopt\big{\|}\text{\boldmath$Q$}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}\geq t holds with very high probability.

Proof of Lemma 27.

If tt satisfies (166), then we derive

(𝑸𝐈nop>t)\displaystyle\mathbb{P}\big{(}\|\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\|_{\mathrm{op}}>t\big{)} (𝑸¯op>t)+(𝑸¯𝑸𝐈n)\displaystyle\leq\mathbb{P}\big{(}\|\overline{\text{\boldmath$Q$}}\|_{\mathrm{op}}>t\big{)}+\mathbb{P}\big{(}\overline{\text{\boldmath$Q$}}\neq\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{)}
(in𝒁iop+𝔼[𝑸¯]op>t)+(𝑸¯𝑸𝐈n)\displaystyle\leq\mathbb{P}\big{(}\|\sum_{i\leq n}\text{\boldmath$Z$}_{i}\|_{\mathrm{op}}+\big{\|}\mathbb{E}[\overline{\text{\boldmath$Q$}}]\big{\|}_{\mathrm{op}}>t\big{)}+\mathbb{P}\big{(}\overline{\text{\boldmath$Q$}}\neq\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{)}
(i)(in𝒁iop>t2)+(𝑸¯𝑸𝐈n)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbb{P}\Big{(}\big{\|}\sum_{i\leq n}\text{\boldmath$Z$}_{i}\big{\|}_{\mathrm{op}}>\frac{t}{2}\Big{)}+\mathbb{P}\big{(}\overline{\text{\boldmath$Q$}}\neq\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{)}
(V>v)+nexp(t2/8v+L¯t/6)+(𝑸¯𝑸𝐈n)\displaystyle\leq\mathbb{P}(V>v)+n\exp\Big{(}-\frac{t^{2}/8}{v+\overline{L}t/6}\Big{)}+\mathbb{P}\big{(}\overline{\text{\boldmath$Q$}}\neq\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{)}

where (i) is because 𝔼[𝑸¯]op<t/2\big{\|}\mathbb{E}[\overline{\text{\boldmath$Q$}}]\big{\|}_{\mathrm{op}}<t/2. By Lemma 26 Eq. (163), we get

(V>v)(Cn(logd)2kdk+nB(d,k)𝑸𝐈nop>t232(logd)2).\mathbb{P}(V>v)\leq\mathbb{P}\Big{(}C\frac{n(\log d)^{2k}}{d^{k}}+\frac{n}{B(d,k)}\|\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\|_{\mathrm{op}}>\frac{t^{2}}{32(\log d)^{2}}\Big{)}.

By the assumption on tt, we have

Cn(logd)2kdkt264(logd)2,2tnB(d,k)t264(logd)2.C\frac{n(\log d)^{2k}}{d^{k}}\leq\frac{t^{2}}{64(\log d)^{2}},\qquad\frac{2tn}{B(d,k)}\leq\frac{t^{2}}{64(\log d)^{2}}.

This leads to (V>v)(𝑸𝐈nop>2t)\mathbb{P}(V>v)\leq\mathbb{P}(\|\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\|_{\mathrm{op}}>2t). Also, L¯t/6t2/(32(logd)2)\overline{L}t/6\leq t^{2}/(32(\log d)^{2}), so

nexp(t2/8v+L¯t/6)\displaystyle n\exp\Big{(}-\frac{t^{2}/8}{v+\overline{L}t/6}\Big{)} nexp(t2/8v+t2/(32(logd)2))\displaystyle\leq n\exp\Big{(}-\frac{t^{2}/8}{v+t^{2}/(32(\log d)^{2})}\Big{)}
=nexp(2(logd)2)dkexp(2(logd)2)\displaystyle=n\exp\Big{(}-2(\log d)^{2}\Big{)}\leq d^{k}\cdot\exp\Big{(}-2(\log d)^{2}\Big{)}
Cexp((logd)2).\displaystyle\leq C^{\prime}\exp\Big{(}-(\log d)^{2}\Big{)}.

This proves the inequality (167). We note that there is a naive deterministic bound

𝑸𝐈nopn𝑸𝐈nmax2n.\big{\|}\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}\leq n\big{\|}\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{\|}_{\max}\leq 2n.

We apply (167) in which tt is set to one of {2t,4t,,2st\{2t,4t,\ldots,2^{s}t} where sklogd/log2s\geq k\log d/\log 2. Summing these inequalities and using the naive deterministic bound, we obtain

(𝑸𝐈nop>t)Od(logd)(exp((logd)2)+(𝑸¯𝑸𝐈n)).\mathbb{P}\left(\big{\|}\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{\|}_{\mathrm{op}}>t\right)\leq O_{d}(\log d)\cdot\Big{(}\exp\Big{(}-(\log d)^{2}\Big{)}+\mathbb{P}\big{(}\overline{\text{\boldmath$Q$}}\neq\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\big{)}\Big{)}.

Since 𝑸¯=𝑸𝐈n\overline{\text{\boldmath$Q$}}=\text{\boldmath$Q$}-\mathrm{\bf I}_{n} happens with high probability, the above inequality implies that 𝑸𝐈nop\|\text{\boldmath$Q$}-\mathrm{\bf I}_{n}\|_{\mathrm{op}} also happens with very high probability. ∎

Finally, we are in a position to prove Proposition E.1. Note that Lemma 27 already gives the right bound on 𝑸k𝐈nop\|\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{n}\|_{\mathrm{op}} for a constant kk. For very large kk, we resort to a different approach.

Proof of Proposition E.1.

Let s1s\geq 1 be a constant integer. Our goal is to prove

ds(supk>𝑸k𝐈dopCn(logd)Cd+1)=od(1)\displaystyle d^{s}\cdot\mathbb{P}\Big{(}\sup_{k>\ell}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}\leq C^{\prime}\sqrt{\frac{n(\log d)^{C^{\prime}}}{d^{\ell+1}}}\Big{)}=o_{d}(1) (168)

for certain sufficiently large constant CC^{\prime}.

Consider kk0:=4+s+7k\geq k_{0}:=4\ell+s+7. Since 𝔼[Qk(𝒙i,𝒙j)2]=B(d,k)1\mathbb{E}\big{[}Q_{k}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)^{2}\big{]}=B(d,k)^{-1}, by Markov’s inequality, for any t>0t>0, we have

(|Qk(𝒙i,𝒙j)|>dt)d2tB(d,k).\mathbb{P}\Big{(}\big{|}Q_{k}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)\big{|}>d^{-t}\Big{)}\leq\frac{d^{2t}}{B(d,k)}.

Thus, by the union bound,

(𝑸k𝐈dop>ndt)(nmaxij|Qk(𝒙i,𝒙j)|>ndt)n2d2tB(d,k).\displaystyle\mathbb{P}\big{(}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}>nd^{-t}\big{)}\leq\mathbb{P}\big{(}n\max_{i\neq j}\big{|}Q_{k}(\langle\text{\boldmath$x$}_{i},\text{\boldmath$x$}_{j}\rangle)\big{|}>nd^{-t}\big{)}\leq\frac{n^{2}d^{2t}}{B(d,k)}.

We choose t=+2t=\ell+2, so that

ndt<nd+1<Cn(logd)Cd+1,k>k0n2d2tB(d,k)d2+2+tk>k01B(d,k)Cds1nd^{-t}<\frac{n}{d^{\ell+1}}<C\sqrt{\frac{n(\log d)^{C}}{d^{\ell+1}}},\qquad\sum_{k>k_{0}}\frac{n^{2}d^{2t}}{B(d,k)}\leq d^{2\ell+2+t}\sum_{k>k_{0}}\frac{1}{B(d,k)}\leq Cd^{-s-1}

where we used the inequality k>k01B(d,k)Cdk0\sum_{k>k_{0}}\frac{1}{B(d,k)}\leq Cd^{-k_{0}}. So taking the union bound, we get

(supkk0𝑸k𝐈dop>Cn(logd)Cd+1)k>k0n2dtB(d,k)Cds1.\displaystyle\mathbb{P}\left(\sup_{k\geq k_{0}}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}>C\sqrt{\frac{n(\log d)^{C}}{d^{\ell+1}}}\right)\leq\sum_{k>k_{0}}\frac{n^{2}d^{t}}{B(d,k)}\leq Cd^{-s-1}. (169)

Further, we apply Lemma 27 with kk set to {+1,+2,,k01}\{\ell+1,\ell+2,\ldots,k_{0}-1\}, and we find

ds(sup<k<k0𝑸k𝐈dop>Cn(logd)Cd+1)=od(1).\displaystyle d^{s}\cdot\mathbb{P}\left(\sup_{\ell<k<k_{0}}\big{\|}\text{\boldmath$Q$}_{k}-\mathrm{\bf I}_{d}\big{\|}_{\mathrm{op}}>C\sqrt{\frac{n(\log d)^{C}}{d^{\ell+1}}}\right)=o_{d}(1). (170)

Combining (169) with (170) yields the desired goal (168). ∎

Appendix F Proof for network capacity upper bound

Proof of Theorem 3.1.

We denote by 𝜽=(𝒃,𝐖)\text{\boldmath$\theta$}=(\text{\boldmath$b$},\mathrm{\bf W}) the collection of neural network parameters and by f𝜽f_{\text{\boldmath$\theta$}} the associated function.

Let M=Md>1M=M_{d}>1 be a positive integer to be specified later. We define the discretization set 𝒟M={0,±1M,±2M,±3M,}\mathcal{D}_{M}=\{0,\pm\frac{1}{M},\pm\frac{2}{M},\pm\frac{3}{M},\ldots\} and

ΘL:={𝜽:b𝒟M,dWk𝒟M,[N],k[d]},\displaystyle\Theta_{L}:=\Big{\{}\text{\boldmath$\theta$}:b_{\ell}\in\mathcal{D}_{M},\sqrt{d}\,W_{\ell k}\in\mathcal{D}_{M},~{}\forall\,\ell\in[N],\forall\,k\in[d]\Big{\}}, (171)
ΘL,M:=ΘL{𝜽:b𝒟M,dWk𝒟M,[N],k[d]}.\displaystyle\Theta_{L,M}:=\Theta_{L}\cap\Big{\{}\text{\boldmath$\theta$}:b_{\ell}\in\mathcal{D}_{M},\sqrt{d}\,W_{\ell k}\in\mathcal{D}_{M},~{}\forall\,\ell\in[N],\forall\,k\in[d]\Big{\}}. (172)

Each f𝖭𝖭N,Lf\in\mathcal{F}_{{\sf NN}}^{N,L} is associated with an 𝜽ΘL\text{\boldmath$\theta$}\in\Theta_{L}.

(1) Lower bound on discretized parameter space. We make the following claim. If with probability at least η1/2\eta_{1}/2, there exists certain 𝜽¯ΘL,M\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M} such that at least (1/2+η2/2)n(1/2+\eta_{2}/2)n examples are correctly classified, i.e.,

n1i=1n𝟏{yif𝜽¯(𝒙i)>0}12+η22,n^{-1}\sum_{i=1}^{n}\mathrm{\bf 1}\{y_{i}f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})>0\}\geq\frac{1}{2}+\frac{\eta_{2}}{2}, (173)

then we have Nd=Ωd(nlog(LM))Nd=\Omega_{d}\big{(}\frac{n}{\log(LM)}\big{)}.

To prove this claim, we treat the input 𝒙i\text{\boldmath$x$}_{i} as deterministic. We derive

𝒚(𝜽¯ΘL,Msuch that|{i:sign(f𝜽¯(𝒙i))=yi}|(1/2+η2/2)n)\displaystyle~{}~{}~{}\mathbb{P}_{\text{\boldmath$y$}}\Big{(}\exists\,\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M}~{}\text{such that}~{}|\{i:\mathrm{sign}(f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i}))=y_{i}\}|\geq(1/2+\eta_{2}/2)n\Big{)} (174)
(i)|ΘL,M|𝒚(|{i:sign(f𝜽¯(𝒙i))=yi}|(1/2+η2/2)n)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\big{|}\Theta_{L,M}\big{|}\cdot\mathbb{P}_{\text{\boldmath$y$}}\Big{(}|\{i:\mathrm{sign}(f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i}))=y_{i}\}|\geq(1/2+\eta_{2}/2)n\Big{)}
(ii)Od(1)[2πe(4ML+1)]Nd+N𝒚(i=1nξi(1/2+η2/2)n)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}O_{d}(1)\cdot\big{[}\sqrt{2\pi e}(4ML+1)\big{]}^{Nd+N}\cdot\mathbb{P}_{\text{\boldmath$y$}}\Big{(}\sum_{i=1}^{n}\xi_{i}\geq(1/2+\eta_{2}/2)n\Big{)}

where ξi=𝟏{sign(f𝜽¯(𝒙i))=yi}\xi_{i}=\mathrm{\bf 1}\{\mathrm{sign}(f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i}))=y_{i}\} denotes a Bernoulli random variable with mean 1/21/2. Here, (i) used the union bound and (ii) used the following bound on the cardinality of ΘL,M\Theta_{L,M}

|ΘL,M|=Od(1)[2πe(4ML+1)]Nd+N,|\Theta_{L,M}|=O_{d}(1)\cdot\big{[}\sqrt{2\pi e}(4ML+1)\big{]}^{Nd+N},

which is proved via the covering number argument (see Lemma 28). By Hoeffding’s inequality, we have

𝒚(i=1nξi(1/2+η2/2)n)exp(2nη22).\mathbb{P}_{\text{\boldmath$y$}}\Big{(}\sum_{i=1}^{n}\xi_{i}\geq(1/2+\eta_{2}/2)n\Big{)}\leq\exp\big{(}-2n\eta_{2}^{2}\big{)}.

Since we assume that the probability of the event in (174) is at least η1\eta_{1}, we take the logarithm and deduce

log(η1/2)Od(1)N(d+1)log(4LM+1)nη22/2,or simply\displaystyle\log(\eta_{1}/2)\leq O_{d}(1)\cdot N(d+1)\log(4LM+1)-n\eta_{2}^{2}/2,\quad\text{or simply}
n=Od(Ndlog(LM+1)).\displaystyle n=O_{d}\big{(}Nd\log(LM+1)\big{)}.

It then follows that Nd=Od(nlog(LM+1))Nd=O_{d}\big{(}\frac{n}{\log(LM+1)}\big{)}.

(2) Projecting θ\theta onto the discretized set. For any zz\in\mathbb{R}, let τ1(z)\tau_{1}(z) and τ2(z)\tau_{2}(z) be the elements in 𝒟M\mathcal{D}_{M} such that the distances to zz are the smallest one and the second smallest one (break ties in an arbitrary way). For a given zz, define a random variable ξ(z){τ1(z),τ2(z)}\xi(z)\in\{\tau_{1}(z),\tau_{2}(z)\} by

ξ(z)={τ1(z),with probability|zτ2(z)||zτ1(z)|+|zτ2(z)|;τ2(z),with probability|zτ1(z)||zτ1(z)|+|zτ2(z)|.\displaystyle\xi(z)=\begin{cases}\tau_{1}(z),&\text{with probability}~{}\displaystyle\frac{|z-\tau_{2}(z)|}{|z-\tau_{1}(z)|+|z-\tau_{2}(z)|};\\ \tau_{2}(z),&\text{with probability}~{}\displaystyle\frac{|z-\tau_{1}(z)|}{|z-\tau_{1}(z)|+|z-\tau_{2}(z)|}.\end{cases}

This definition ensures that 𝔼ξ(z)=z\mathbb{E}\xi(z)=z. Indeed, zτ1(z)z-\tau_{1}(z) and zτ2(z)z-\tau_{2}(z) have the opposite signs, so

z𝔼ξ(z)=(zτ1(z))|zτ2(z)||zτ1(z)|+|zτ2(z)|+(zτ2(z))|zτ1(z)||zτ1(z)|+|zτ2(z)|=0.z-\mathbb{E}\xi(z)=\frac{(z-\tau_{1}(z))|z-\tau_{2}(z)|}{|z-\tau_{1}(z)|+|z-\tau_{2}(z)|}+\frac{(z-\tau_{2}(z))|z-\tau_{1}(z)|}{|z-\tau_{1}(z)|+|z-\tau_{2}(z)|}=0.

Denote b~=ξ(b)\widetilde{b}_{\ell}=\xi(b_{\ell}) and W~k=d1/2ξ(d1/2Wk)\widetilde{W}_{\ell k}=d^{-1/2}\xi(d^{1/2}W_{\ell k}) for all \ell and kk (they are independent random variables), and 𝜽~=(𝒃~,𝐖~)\widetilde{\text{\boldmath$\theta$}}=(\widetilde{\text{\boldmath$b$}},\widetilde{\mathrm{\bf W}}). Then 𝜽~\widetilde{\text{\boldmath$\theta$}} is a random projection of 𝜽ΘL\text{\boldmath$\theta$}\in\Theta_{L} onto the discretized set ΘL,M\Theta_{L,M}.

(3) Bounding the approximation error. By assumption, with probability greater than η1\eta_{1}, there exists 𝜽ΘL\text{\boldmath$\theta$}\in\Theta_{L} (which depends on the data (𝒙i)i1(\text{\boldmath$x$}_{i})_{i\leq 1} and (yi)in(y_{i})_{i\leq n}) such that it gives more than δ\delta margin on at least (1/2+η2)n(1/2+\eta_{2})n examples. Denote by :=(𝐗,𝒚)[n]\mathcal{I}:=\mathcal{I}(\mathrm{\bf X},\text{\boldmath$y$})\subset[n] the set of indices ii such that yif𝜽(𝒙i)>δy_{i}f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})>\delta. Then, the assumption is equivalent to ||(1/2+η2)n|\mathcal{I}|\geq(1/2+\eta_{2})n with probability great than η1\eta_{1}.

We claim

𝐗,𝒚,𝜽~(1ni=1n𝟏{|f𝜽(𝒙i)f𝜽~(𝒙i)|<δ}1η22)1η12.\mathbb{P}_{\mathrm{\bf X},\text{\boldmath$y$},\widetilde{\text{\boldmath$\theta$}}}\Big{(}\frac{1}{n}\sum_{i=1}^{n}\mathrm{\bf 1}\big{\{}|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\widetilde{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})|<\delta\big{\}}\geq 1-\frac{\eta_{2}}{2}\Big{)}\geq 1-\frac{\eta_{1}}{2}. (175)

Here we emphasize that 𝐗,𝒚,𝜽~\mathbb{P}_{\mathrm{\bf X},\text{\boldmath$y$},\widetilde{\text{\boldmath$\theta$}}} is the probability measure over all random variables (namely the data 𝐗\mathrm{\bf X}, 𝒚y and also 𝜽~\widetilde{\text{\boldmath$\theta$}}). Once this claim is proved, we can immediately prove the desired result. In fact, for a given 𝜽¯ΘL,M\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M}, let us define the indicator variable

I𝜽¯=𝟏{1ni=1n𝟏{|f𝜽(𝒙i)f𝜽¯(𝒙i)|<δ}1η22}I_{\overline{\text{\boldmath$\theta$}}}=\mathrm{\bf 1}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\mathrm{\bf 1}\big{\{}|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})|<\delta\big{\}}\geq 1-\frac{\eta_{2}}{2}\Big{\}}

We observe that

𝔼𝐗,𝒚,𝜽~[I𝜽~]=𝔼𝐗,𝒚𝔼𝜽~|𝐗,𝒚[I𝜽~](i)𝔼𝐗,𝒚[max𝜽¯ΘL,M[I𝜽¯]]=(ii)𝐗,𝒚[max𝜽¯ΘL,M[I𝜽¯]=1]\displaystyle~{}~{}~{}~{}\mathbb{E}_{\mathrm{\bf X},\text{\boldmath$y$},\widetilde{\text{\boldmath$\theta$}}}[I_{\widetilde{\text{\boldmath$\theta$}}}]=\mathbb{E}_{\mathrm{\bf X},\text{\boldmath$y$}}\mathbb{E}_{\widetilde{\text{\boldmath$\theta$}}|\mathrm{\bf X},\text{\boldmath$y$}}[I_{\widetilde{\text{\boldmath$\theta$}}}]\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbb{E}_{\mathrm{\bf X},\text{\boldmath$y$}}\Big{[}\max_{\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M}}[I_{\overline{\text{\boldmath$\theta$}}}]\Big{]}\stackrel{{\scriptstyle(ii)}}{{=}}\mathbb{P}_{\mathrm{\bf X},\text{\boldmath$y$}}\Big{[}\max_{\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M}}[I_{\overline{\text{\boldmath$\theta$}}}]=1\Big{]}
=𝐗,𝒚[𝜽¯ΘL,Msuch thatI𝜽¯=1]\displaystyle=\mathbb{P}_{\mathrm{\bf X},\text{\boldmath$y$}}\Big{[}\exists\,\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M}~{}\text{such that}~{}I_{\overline{\text{\boldmath$\theta$}}}=1\Big{]}

where (i) is because the mean is no larger than the maximum and (ii) is because max𝜽¯ΘL,M[I𝜽¯]\max_{\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M}}[I_{\overline{\text{\boldmath$\theta$}}}] takes value 0 or 11. Combining this with (175), we deduce that with probability at least 1η1/21-\eta_{1}/2, there exists an 𝜽¯ΘL,M\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M} such that |f𝜽(𝒙i)f𝜽¯(𝒙i)|<δ|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})|<\delta for all ii\in\mathcal{I}^{\prime} where [n]\mathcal{I}^{\prime}\subset[n] satisfies ||(1η2/2)n|\mathcal{I}^{\prime}|\geq(1-\eta_{2}/2)n.

We use the fact that (𝒜1𝒜2)(𝒜1)+𝒫(𝒜2)1\mathbb{P}(\mathcal{A}_{1}\cap\mathcal{A}_{2})\geq\mathbb{P}(\mathcal{A}_{1})+\mathcal{P}(\mathcal{A}_{2})-1 for arbitrary events 𝒜1,𝒜2\mathcal{A}_{1},\mathcal{A}_{2} to deduce that with probability at least η1/2\eta_{1}/2, there exists an 𝜽¯ΘL,M\overline{\text{\boldmath$\theta$}}\in\Theta_{L,M} such that |f𝜽(𝒙i)f𝜽¯(𝒙i)|<δ|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})|<\delta for all ii\in\mathcal{I}^{\prime}, and in the meantime ||(1/2+η2)n|\mathcal{I}|\geq(1/2+\eta_{2})n. By the triangle inequality, for every ii\in\mathcal{I}\cap\mathcal{I}^{\prime},

yif𝜽¯yif𝜽|yif𝜽(𝒙i)yif𝜽¯(𝒙i)|=yif𝜽|f𝜽(𝒙i)f𝜽¯(𝒙i)|>0.y_{i}f_{\overline{\text{\boldmath$\theta$}}}\geq y_{i}f_{\text{\boldmath$\theta$}}-|y_{i}f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-y_{i}f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})|=y_{i}f_{\text{\boldmath$\theta$}}-|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\overline{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})|>0.

Note that ||||+||n(1/2+η2/2)n|\mathcal{I}\cap\mathcal{I}^{\prime}|\geq|\mathcal{I}|+|\mathcal{I}^{\prime}|-n\geq(1/2+\eta_{2}/2)n. Therefore, the assumption of the discretized case, namely (1), is satisfied; and hence we obtain the desired lower bound.

Below we prove the claim (175). Denote hi=σ(𝒘,𝒙i)h_{i\ell}=\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle) and Δhi=σ(𝒘,𝒙i)σ(𝒘~,𝒙i)\Delta h_{i\ell}=\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle)-\sigma(\langle\widetilde{\text{\boldmath$w$}}_{\ell},\text{\boldmath$x$}_{i}\rangle). By the triangle inequality,

|f𝜽(𝒙i)f𝜽~(𝒙i)|\displaystyle|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\widetilde{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})| 1N|=1N(bb~)σ(𝒘,𝒙i)|+1N=1N|b~[σ(𝒘,𝒙i)σ(𝒘~,𝒙i)]|\displaystyle\leq\frac{1}{N}\Big{|}\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle)\Big{|}+\frac{1}{N}\sum_{\ell=1}^{N}\Big{|}\widetilde{b}_{\ell}\big{[}\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle)-\sigma(\langle\widetilde{\text{\boldmath$w$}}_{\ell},\text{\boldmath$x$}_{i}\rangle)\big{]}\Big{|}
1N|=1N(bb~)hi|+LN=1N|Δhi|,\displaystyle\leq\frac{1}{N}\Big{|}\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})h_{i\ell}\Big{|}+\frac{L}{N}\sum_{\ell=1}^{N}|\Delta h_{i\ell}|,
(i)|σ(0)|N|=1N(bb~)|+1N|=1N(bb~)(hiσ(0))|+LN(=1N|Δhi|2)1/2,\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}\frac{|\sigma(0)|}{N}\Big{|}\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})\Big{|}+\frac{1}{N}\Big{|}\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})(h_{i\ell}-\sigma(0))\Big{|}+\frac{L}{\sqrt{N}}\Big{(}\sum_{\ell=1}^{N}|\Delta h_{i\ell}|^{2}\Big{)}^{1/2},
=:Ti,1+Ti,2+Ti,3,\displaystyle=:T_{i,1}+T_{i,2}+T_{i,3}, (176)

where we used Cauchy-Schwarz inequality in (i).

i) Bounding Ti,1T_{i,1}. Conditioning on 𝐗\mathrm{\bf X} and 𝒚y, the random variable bb~b_{\ell}-\widetilde{b}_{\ell} is independent across \ell, has zero mean, and satisfies |bb~|1/M|b_{\ell}-\widetilde{b}_{\ell}|\leq 1/M. So we can use Hoeffding’s inequality to control the first term T1T_{1} in (176).

𝜽~(|σ(0)|N|=1N(bb~)|>|σ(0)|tM)2eNt2/22et2/2.\mathbb{P}_{\widetilde{\text{\boldmath$\theta$}}}\Big{(}\frac{|\sigma(0)|}{N}\big{|}\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})\big{|}>\frac{|\sigma(0)|t}{M}\Big{)}\leq 2e^{-Nt^{2}/2}\leq 2e^{-t^{2}/2}.

Taking t=2log(16/η1)t=\sqrt{2\log(16/\eta_{1})} yields the bound

Ti,1C02log(16/η1)MT_{i,1}\leq\frac{C_{0}\sqrt{2\log(16/\eta_{1})}}{M}

with probability at least 1η1/81-\eta_{1}/8, where we used the assumption that |σ(0)|C0|\sigma(0)|\leq C_{0}.

ii) Bounding Ti,2T_{i,2}. By assumption, σ\sigma has weak derivatives, so we have

σ(𝒘,𝒙i)σ(0)=𝒘,𝒙i01σ(t𝒘,𝒙i)dt\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle)-\sigma(0)=\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle\int_{0}^{1}\sigma^{\prime}\big{(}t\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle\big{)}\;dt

Since maxi[n]𝒙i22d\max_{i\in[n]}\|\text{\boldmath$x$}_{i}\|_{2}\leq 2\sqrt{d} with probability at least 12necd1-2ne^{-cd}, we have |𝒘,𝒙i|𝒘2𝒙i22Ld|\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle|\leq\|\text{\boldmath$w$}_{\ell}\|_{2}\cdot\|\text{\boldmath$x$}_{i}\|_{2}\leq 2L\sqrt{d}. Thus

|σ(𝒘,𝒙i)σ(0)|max|z|2Ld|σ(z)||𝒘,𝒙i|.|\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle)-\sigma(0)|\leq\max_{|z|\leq 2L\sqrt{d}}|\sigma^{\prime}(z)|\cdot|\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle|.

Recall the Assumption 3.2 on the activation function |σ(z)|B(1+|z|)B|\sigma^{\prime}(z)|\leq B(1+|z|)^{B}. We have max|z|2Ld|σ(z)|B[1+(2Ld)B]2B(2Ld)B\max_{|z|\leq 2L\sqrt{d}}|\sigma^{\prime}(z)|\leq B[1+(2L\sqrt{d})^{B}]\leq 2B(2L\sqrt{d})^{B}. Therefore,

i=1n=1N|σ(𝒘,𝒙i)σ(0)|2\displaystyle\sum_{i=1}^{n}\sum_{\ell=1}^{N}|\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle)-\sigma(0)|^{2} 4B2(2Ld)2Bi=1n=1N|𝒘,𝒙i|2\displaystyle\leq 4B^{2}(2L\sqrt{d})^{2B}\cdot\sum_{i=1}^{n}\sum_{\ell=1}^{N}|\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle|^{2}
=4B2(2Ld)2B=1N𝐗𝒘2\displaystyle=4B^{2}(2L\sqrt{d})^{2B}\cdot\sum_{\ell=1}^{N}\|\mathrm{\bf X}\text{\boldmath$w$}_{\ell}\|^{2}
4B2(2Ld)2B=1N𝐗op2𝒘2\displaystyle\leq 4B^{2}(2L\sqrt{d})^{2B}\cdot\sum_{\ell=1}^{N}\|\mathrm{\bf X}\|_{\mathrm{op}}^{2}\|\text{\boldmath$w$}_{\ell}\|^{2}

By standard random matrix theory [Ver10, Cor. 5.35], 𝐗op2n+d3n\|\mathrm{\bf X}\|_{\mathrm{op}}\leq 2\sqrt{n}+\sqrt{d}\leq 3\sqrt{n} with probability at least 12en2/21-2e^{-n^{2}/2}. Also, 𝒘2L2\|\text{\boldmath$w$}_{\ell}\|^{2}\leq L^{2}. So we get

i=1n=1N|σ(𝒘,𝒙i)σ(0)|236B2L2(2Ld)2BNn.\sum_{i=1}^{n}\sum_{\ell=1}^{N}|\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle)-\sigma(0)|^{2}\leq 36B^{2}L^{2}(2L\sqrt{d})^{2B}Nn. (177)

Let 1:=1(𝐗,𝒚)[n]\mathcal{I}_{1}:=\mathcal{I}_{1}(\mathrm{\bf X},\text{\boldmath$y$})\subset[n] be the set of i[n]i\in[n] such that

=1N|hiσ(0)|232(η1η2)1×36B2L2(2Ld)2BN.\sum_{\ell=1}^{N}|h_{i\ell}-\sigma(0)|^{2}\leq 32(\eta_{1}\eta_{2})^{-1}\times 36B^{2}L^{2}(2L\sqrt{d})^{2B}N.

Recall the definition hi=σ(𝒘,𝒙i)h_{i\ell}=\sigma(\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle). Then, with probability at least 12necd2en2/21-2ne^{-cd}-2e^{-n^{2}/2}, we have |1|(1η1η2/32)n|\mathcal{I}_{1}|\geq(1-\eta_{1}\eta_{2}/32)n. Indeed, if this is the not true, then the set of ii that does not satisfy the above inequality will exceed η1η2n/32\eta_{1}\eta_{2}n/32 and thus (177) will be violated.

Now, conditioning on 𝐗,𝒚\mathrm{\bf X},\text{\boldmath$y$}, we can view =1N(bb~)(hiσ(0))\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})(h_{i\ell}-\sigma(0)) as a weighted sum of independent sub-gaussian variables bb~b_{\ell}-\widetilde{b}_{\ell}. By Hoeffding’s inequality for sub-gaussian variables, we have

𝜽~(|=1N(bb~)(hiσ(0))|>M1t)2exp(t2/=1N(hiσ(0))2).\mathbb{P}_{\widetilde{\text{\boldmath$\theta$}}}\Big{(}\big{|}\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})(h_{i\ell}-\sigma(0))\big{|}>M^{-1}t\Big{)}\leq 2\exp\Big{(}-t^{2}\big{/}\sum_{\ell=1}^{N}(h_{i\ell}-\sigma(0))^{2}\Big{)}.

We take t=CL(2Ld)BNt=CL(2L\sqrt{d})^{B}\sqrt{N} where C>0C>0 is a sufficiently large constant so that the above probability upper bound (right-hand side) is smaller than η1η2/16\eta_{1}\eta_{2}/16. Denote the event i\mathcal{E}_{i} as

i={|=1N(bb~)(hiσ(0))|>CM1L(2Ld)BN}.\mathcal{E}_{i}=\Big{\{}\big{|}\sum_{\ell=1}^{N}(b_{\ell}-\widetilde{b}_{\ell})(h_{i\ell}-\sigma(0))\big{|}>CM^{-1}L(2L\sqrt{d})^{B}\sqrt{N}\Big{\}}.

We have proved 𝜽~(i)η1η2/32\mathbb{P}_{\widetilde{\text{\boldmath$\theta$}}}(\mathcal{E}_{i})\leq\eta_{1}\eta_{2}/32 for i1i\in\mathcal{I}_{1}. Conditioning on |1|(1η1η2/32)n|\mathcal{I}_{1}|\geq(1-\eta_{1}\eta_{2}/32)n, we obtain, by Markov’s inequality,

𝜽~(i=1n𝟏{i}>η2n2)\displaystyle\mathbb{P}_{\widetilde{\text{\boldmath$\theta$}}}\Big{(}\sum_{i=1}^{n}\mathrm{\bf 1}\{\mathcal{E}_{i}\}>\frac{\eta_{2}n}{2}\Big{)} 2η2n𝔼𝜽~[i=1n𝟏{i}]=2η2ni=1n𝜽~(i)\displaystyle\leq\frac{2}{\eta_{2}n}\mathbb{E}_{\widetilde{\text{\boldmath$\theta$}}}\Big{[}\sum_{i=1}^{n}\mathrm{\bf 1}\{\mathcal{E}_{i}\}\Big{]}=\frac{2}{\eta_{2}n}\sum_{i=1}^{n}\mathbb{P}_{\widetilde{\text{\boldmath$\theta$}}}(\mathcal{E}_{i})
2η2ni1𝜽~(i)+2η2ni11η18.\displaystyle\leq\frac{2}{\eta_{2}n}\sum_{i\in\mathcal{I}_{1}}\mathbb{P}_{\widetilde{\text{\boldmath$\theta$}}}(\mathcal{E}_{i})+\frac{2}{\eta_{2}n}\sum_{i\not\in\mathcal{I}_{1}}1\leq\frac{\eta_{1}}{8}.

Let =(𝐗,𝒚,𝜽~)[n]\mathcal{I}^{\prime}=\mathcal{I}^{\prime}(\mathrm{\bf X},\text{\boldmath$y$},\widetilde{\text{\boldmath$\theta$}})\subset[n] be the set of ii such that the complement ic\mathcal{E}_{i}^{c} holds. Then with probability at least 1η1/82necd2en2/21-\eta_{1}/8-2ne^{-cd}-2e^{-n^{2}/2}, we have ||(1η2/2)n|\mathcal{I}^{\prime}|\geq(1-\eta_{2}/2)n, and for every ii\in\mathcal{I}^{\prime},

Ti,21NCM1L(2Ld)BCM1L(2Ld)B.T_{i,2}\leq\frac{1}{\sqrt{N}}CM^{-1}L(2L\sqrt{d})^{B}\leq CM^{-1}L(2L\sqrt{d})^{B}.

iii) Bounding Ti,3T_{i,3}. By the assumption on σ\sigma^{\prime}, we can write

Δhi=01σ(t𝒘,𝒙i+(1t)𝒘~,𝒙i)dt𝒘𝒘~,𝒙i\Delta h_{i\ell}=\int_{0}^{1}\sigma^{\prime}\big{(}t\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle+(1-t)\langle\widetilde{\text{\boldmath$w$}}_{\ell},\text{\boldmath$x$}_{i}\rangle\big{)}\;dt\cdot\langle\text{\boldmath$w$}_{\ell}-\widetilde{\text{\boldmath$w$}}_{\ell},\text{\boldmath$x$}_{i}\rangle

Similarly as before, with probability at least 12necd1-2ne^{-cd}, we have |𝒘,𝒙i|𝒘2𝒙i22Ld|\langle\text{\boldmath$w$}_{\ell},\text{\boldmath$x$}_{i}\rangle|\leq\|\text{\boldmath$w$}_{\ell}\|_{2}\cdot\|\text{\boldmath$x$}_{i}\|_{2}\leq 2L\sqrt{d} and similarly |𝒘~,𝒙i|2Ld|\langle\widetilde{\text{\boldmath$w$}}_{\ell},\text{\boldmath$x$}_{i}\rangle|\leq 2L\sqrt{d} for all \ell and ii. This leads to

|Δhi|max|z|2Ld|σ(z)||𝒘𝒘~,𝒙i|.|\Delta h_{i\ell}|\leq\max_{|z|\leq 2L\sqrt{d}}|\sigma^{\prime}(z)|\cdot|\langle\text{\boldmath$w$}_{\ell}-\widetilde{\text{\boldmath$w$}}_{\ell},\text{\boldmath$x$}_{i}\rangle|.

By the assumption on σ\sigma^{\prime}, we have max|z|2Ld|σ(z)|B[1+(2Ld)B]2B(2Ld)B\max_{|z|\leq 2L\sqrt{d}}|\sigma^{\prime}(z)|\leq B[1+(2L\sqrt{d})^{B}]\leq 2B(2L\sqrt{d})^{B}. Therefore, we can bound Ti,3T_{i,3} as follows. For each i[n]i\in[n],

LN(=1N|Δhi|2)1/2\displaystyle\frac{L}{\sqrt{N}}\Big{(}\sum_{\ell=1}^{N}|\Delta h_{i\ell}|^{2}\Big{)}^{1/2} LN2B(2Ld)B(=1N𝒘𝒘~,𝒙i2)1/2\displaystyle\leq\frac{L}{\sqrt{N}}\cdot 2B(2L\sqrt{d})^{B}\cdot\Big{(}\sum_{\ell=1}^{N}\langle\text{\boldmath$w$}_{\ell}-\widetilde{\text{\boldmath$w$}}_{\ell},\text{\boldmath$x$}_{i}\rangle^{2}\Big{)}^{1/2}
2BL(2Ld)BN(𝐖𝐖~)𝒙i\displaystyle\leq\frac{2BL(2L\sqrt{d})^{B}}{\sqrt{N}}\cdot\|(\mathrm{\bf W}-\widetilde{\mathrm{\bf W}})\text{\boldmath$x$}_{i}\|
4BCM1L(2Ld)B\displaystyle\leq 4BCM^{-1}L(2L\sqrt{d})^{B}

with probability at least 12ecd1-2e^{-cd}. This is because 𝒙i2d\|\text{\boldmath$x$}_{i}\|\leq 2\sqrt{d} holds with probability 12ecd1-2e^{-cd}; and also, since each entry of 𝐖𝐖~\mathrm{\bf W}-\widetilde{\mathrm{\bf W}} is independent and bounded by (Md)1(M\sqrt{d})^{-1},

𝜽~((𝐖𝐖~)𝒙i2CM1N)12ecN,\mathbb{P}_{\widetilde{\text{\boldmath$\theta$}}}\Big{(}\|(\mathrm{\bf W}-\widetilde{\mathrm{\bf W}})\text{\boldmath$x$}_{i}\|\leq 2CM^{-1}\sqrt{N}\Big{)}\geq 1-2e^{-cN}, (178)

which is a consequence of Bernstein’s inequality (see [Ver10] Thm. 5.39 for example). Taking the union bound over ii, (178) holds for all i[n]i\in[n] with probability at least 12necd2necN1-2ne^{-cd}-2ne^{-cN}.

(iv) Combining three bounds. Finally, combining the bounds on Ti,1T_{i,1}, Ti,2T_{i,2}, and Ti,3T_{i,3}, we obtain that with probability at least 1η1/44n(ecN+ecd)2en2/21-\eta_{1}/4-4n(e^{-cN}+e^{-cd})-2e^{-n^{2}/2},

maxiI|f𝜽(𝒙i)f𝜽~(𝒙i)|\displaystyle\max_{i\in I^{\prime}}|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\widetilde{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})| CM1+CM1L(2Ld)B+CM1L(2Ld)B\displaystyle\leq CM^{-1}+CM^{-1}L(2L\sqrt{d})^{B}+CM^{-1}L(2L\sqrt{d})^{B}
CM1L(Ld)B.\displaystyle\leq CM^{-1}L(L\sqrt{d})^{B}. (179)

Under the asymptotic assumption logn=od(min{N,d})\log n=o_{d}\big{(}\min\{N,d\}\big{)}, we have 4n(ecN+ecd)=od(1)4n(e^{-cN}+e^{-cd})=o_{d}(1), so there exists a sufficient large d0d_{0} such that 4n(ecN+ecd)+2en2/2η1/44n(e^{-cN}+e^{-cd})+2e^{-n^{2}/2}\leq\eta_{1}/4 if d>d0d>d_{0}. We choose M=Cδ12L(Ld)BM=C\delta^{-1}\cdot 2L(L\sqrt{d})^{B} (where the CC has the same value as in Eq. 179) and get

maxi|f𝜽(𝒙i)f𝜽~(𝒙i)|δ/2<δ.\max_{i\in\mathcal{I}^{\prime}}|f_{\text{\boldmath$\theta$}}(\text{\boldmath$x$}_{i})-f_{\widetilde{\text{\boldmath$\theta$}}}(\text{\boldmath$x$}_{i})|\leq\delta/2<\delta.

with probability 1η1/21-\eta_{1}/2. This proves the claim. Note that with this choice of MM, log(LM+1)=Od(log(dL/δ))\log(LM+1)=O_{d}\big{(}\log(dL/\delta)\big{)} so the lower bound is obtained.

Lemma 28.

The cardinality of the discrete set ΘL,M\Theta_{L,M} defined in (172) have the bound

|ΘL,M|O([2πe(4ML+1)]Nd+N).|\Theta_{L,M}|\leq O\Big{(}\big{[}\sqrt{2\pi e}(4ML+1)\big{]}^{Nd+N}\Big{)}.
Proof.

Denote Bd(r)={𝒙d:𝒙r}B^{d}(r)=\{\text{\boldmath$x$}\in\mathbb{R}^{d}:\|\text{\boldmath$x$}\|\leq r\}. Recall the definition of the packing number. We say the set 𝒜Bd(r)\mathcal{A}\in B^{d}(r) is an ε\varepsilon-packing of (Bd(r),)(B^{d}(r),\|\cdot\|_{\infty}) if 𝒙1𝒙2>ε\|\text{\boldmath$x$}_{1}-\text{\boldmath$x$}_{2}\|_{\infty}>\varepsilon for every 𝒙1,𝒙2𝒜\text{\boldmath$x$}_{1},\text{\boldmath$x$}_{2}\in\mathcal{A}, 𝒙1𝒙2\text{\boldmath$x$}_{1}\neq\text{\boldmath$x$}_{2}. We define the packing number

D(Bd(r),,ε):=sup{|𝒜|:𝒜is an ε-packing of(Bd(r),}D(B^{d}(r),\|\cdot\|_{\infty},\varepsilon):=\sup\big{\{}|\mathcal{A}|:\mathcal{A}~{}\text{is an $\varepsilon$-packing of}~{}(B^{d}(r),\|\cdot\|_{\infty}\big{\}}

By the volume argument, we have the bound

D(Bd(r),,2ε)Vol(Bd(r+dε))εd=(i)od(1)(2πed)d/2(r+dεε)dD(B^{d}(r),\|\cdot\|_{\infty},2\varepsilon)\leq\frac{\mathrm{Vol}\big{(}B^{d}(r+\sqrt{d}\,\varepsilon)\big{)}}{\varepsilon^{d}}\stackrel{{\scriptstyle(i)}}{{=}}o_{d}(1)\cdot\Big{(}\frac{2\pi e}{d}\Big{)}^{d/2}\Big{(}\frac{r+\sqrt{d}\varepsilon}{\varepsilon}\Big{)}^{d} (180)

where in (i) we used the formula of the volume of a Euclidean ball and Stirling’s approximation.

Recall that 𝒟M={0,±1M,±2M,}\mathcal{D}_{M}=\{0,\pm\frac{1}{M},\pm\frac{2}{M},\ldots\}. To apply this result to bounding |ΘL,M||\Theta_{L,M}|, first we observe that

|ΘL,M|\displaystyle|\Theta_{L,M}|\leq |{𝒂N:𝒂NL,a𝒟M[N]}|\displaystyle\big{|}\big{\{}\text{\boldmath$a$}\in\mathbb{R}^{N}:\|\text{\boldmath$a$}\|\leq\sqrt{N}\,L,a_{\ell}\in\mathcal{D}_{M}~{}\forall\,\ell\in[N]\big{\}}\big{|} (181)
\displaystyle\cdot =1N|{𝒘d:𝒘L,d[𝒘]j𝒟Mj[d]}|.\displaystyle\prod_{\ell=1}^{N}\big{|}\big{\{}\text{\boldmath$w$}_{\ell}\in\mathbb{R}^{d}:\|\text{\boldmath$w$}_{\ell}\|\leq L,\sqrt{d}\,[\text{\boldmath$w$}_{\ell}]_{j}\in\mathcal{D}_{M}~{}\forall\,j\in[d]\big{\}}\big{|}. (182)

Suppose 𝒂1,𝒂2\text{\boldmath$a$}_{1},\text{\boldmath$a$}_{2} lie in the set on the RHS of (181) and 𝒂1𝒂2\text{\boldmath$a$}_{1}\neq\text{\boldmath$a$}_{2}. Then, by the definition of 𝒟M\mathcal{D}_{M}, we must have 𝒂1𝒂2>12M\|\text{\boldmath$a$}_{1}-\text{\boldmath$a$}_{2}\|_{\infty}>\frac{1}{2M}. This means that the cardinality of the set on the RHS of (181) is bounded by the packing number D(BN(NL),,12M)D(B^{N}(\sqrt{N}\,L),\|\cdot\|_{\infty},\frac{1}{2M}). By (180), we get an upper bound on the cardinality

oN(2πeN)N/2(NL+N14M14M)N=oN(1)(2πe)N/2(4ML+1)N.o_{N}\Big{(}\frac{2\pi e}{N}\Big{)}^{N/2}\cdot\big{(}\frac{\sqrt{N}\,L+\sqrt{N}\frac{1}{4M}}{\frac{1}{4M}}\Big{)}^{N}=o_{N}(1)\cdot(2\pi e)^{N/2}(4ML+1)^{N}. (183)

Similarly, we can bound the cardinality of each of the NN sets in (182) by

od(2πed)d/2(L+d14Md14Md)d=od(1)(2πe)d/2(4LM+1)d.o_{d}\Big{(}\frac{2\pi e}{d}\Big{)}^{d/2}\cdot\big{(}\frac{L+\sqrt{d}\frac{1}{4M\sqrt{d}}}{\frac{1}{4M\sqrt{d}}}\Big{)}^{d}=o_{d}(1)\cdot(2\pi e)^{d/2}(4LM+1)^{d}. (184)

Recall that N=N(d)N=N(d)\to\infty as dd\to\infty, so oN(1)o_{N}(1) in (183) can be replaced by od(1)o_{d}(1). Combining (183) and (184), we obtain

|ΘL,M|\displaystyle|\Theta_{L,M}| od(1)[2πe(4ML+1)]N=1N[2πe(4ML+1)]d\displaystyle\leq o_{d}(1)\cdot\big{[}\sqrt{2\pi e}\,(4ML+1)\big{]}^{N}\prod_{\ell=1}^{N}\big{[}\sqrt{2\pi e}\,(4ML+1)\big{]}^{d}
od(1)[2πe(4ML+1)]Nd+N.\displaystyle\leq o_{d}(1)\cdot\big{[}\sqrt{2\pi e}\,(4ML+1)\big{]}^{Nd+N}.

References

  • [ADH+19] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang, Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks, International Conference on Machine Learning, PMLR, 2019, pp. 322–332.
  • [AH12] Kendall Atkinson and Weimin Han, Spherical harmonics and approximations on the unit sphere: an introduction, vol. 2044, Springer Science, 2012.
  • [AZLL19] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, Advances in Neural Information Processing Systems 32 (2019), 6158–6169.
  • [AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, International Conference on Machine Learning, 2019, pp. 242–252.
  • [Bac17] Francis Bach, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
  • [Bau88] Eric B Baum, On the capabilities of multilayer perceptrons, Journal of Complexity 4 (1988), no. 3, 193–215.
  • [BELM20] Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, and Dan Mikulincer, Network size and weights size for memorization with two-layers neural networks, arXiv:2006.02855 (2020).
  • [BHLM19] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian, Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks., J. Mach. Learn. Res. 20 (2019), 63–1.
  • [BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off, Proceedings of the National Academy of Sciences 116 (2019), no. 32, 15849–15854.
  • [BHX19] Mikhail Belkin, Daniel Hsu, and Ji Xu, Two models of double descent for weak features, arXiv:1903.07571, 2019.
  • [BLLT20] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler, Benign overfitting in linear regression, Proceedings of the National Academy of Sciences (2020).
  • [BM19] Alberto Bietti and Julien Mairal, On the inductive bias of neural tangent kernels, arXiv:1905.12173 (2019).
  • [BMR21] Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin, Deep learning: a statistical viewpoint, Acta Numerica 30 (2021), 87–201.
  • [CC14] Efthimiou Costas and Frye Christopher, Spherical harmonics in pp dimensions, World Scientific, 2014.
  • [CCZG19] Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu, How much over-parameterization is sufficient to learn deep relu networks?, arXiv:1911.12360 (2019).
  • [CG19] Yuan Cao and Quanquan Gu, Generalization bounds of stochastic gradient descent for wide and deep neural networks, Advances in Neural Information Processing Systems 32 (2019), 10836–10846.
  • [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach, On lazy training in differentiable programming, Advances in Neural Information Processing Systems, 2019, pp. 2937–2947.
  • [Cov65] Thomas M Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE transactions on electronic computers (1965), no. 3, 326–334.
  • [Dan19] Amit Daniely, Neural networks learning and memorization with (almost) no over-parameterization, arXiv preprint arXiv:1911.09873 (2019).
  • [Dan20]  , Memorizing gaussians with no over-parameterizaion via gradient decent on neural networks, arXiv:2003.12895 (2020).
  • [DK70] Chandler Davis and W. M. Kahan, The rotation of eigenvectors by a perturbation. iii, SIAM Journal on Numerical Analysis 7 (1970), no. 1, pp. 1–46.
  • [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, International Conference on Learning Representations, 2018.
  • [GMMM20] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, When do neural networks outperform kernel methods?, arXiv:2006.13409 (2020).
  • [GMMM21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, The Annals of Statistics 49 (2021), no. 2, 1029 – 1054.
  • [HMRT19] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, arXiv:1903.08560 (2019).
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571–8580.
  • [JT19] Ziwei Ji and Matus Telgarsky, Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks, International Conference on Learning Representations, 2019.
  • [Kow94] Adam Kowalczyk, Counting function theorem for multi-layer networks, Advances in neural information processing systems, 1994, pp. 375–382.
  • [LR20] Tengyuan Liang and Alexander Rakhlin, Just interpolate: Kernel “ridgeless” regression can generalize, Annals of Statistics 48 (2020), no. 3, 1329–1347.
  • [LRZ20] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai, On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels, Proceedings of Thirty Third Conference on Learning Theory (Jacob Abernethy and Shivani Agarwal, eds.), Proceedings of Machine Learning Research, vol. 125, PMLR, 2020, arXiv:1908.10292, pp. 2683–2711.
  • [LXS+19] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington, Wide neural networks of any depth evolve as linear models under gradient descent, Advances in neural information processing systems 32 (2019), 8572–8583.
  • [LZB20] Chaoyue Liu, Libin Zhu, and Mikhail Belkin, On the linearity of large non-linear models: when and why the tangent kernel is constant, Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, eds.), vol. 33, Curran Associates, Inc., 2020, pp. 15954–15964.
  • [MM21] Song Mei and Andrea Montanari, The generalization error of random features regression: Precise asymptotics and double descent curve, Communications in Pure and Applied Mathematics (2021).
  • [MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration, arXiv:2101.10588 (2021).
  • [MRSY19] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan, The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime, arXiv:1911.01544 (2019).
  • [NCS19] Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki, Gradient descent can learn less over-parameterized two-layer neural networks on classification problems, arXiv:1905.09870 (2019).
  • [NTS15] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning., ICLR (Workshop), 2015.
  • [Oli09] Roberto Imbuzeiro Oliveira, Concentration of the adjacency matrix and of the laplacian in random graphs with independent edges, arXiv:0911.0600 (2009).
  • [OS20] Samet Oymak and Mahdi Soltanolkotabi, Towards moderate overparameterization: global convergence guarantees for training shallow neural networks, IEEE Journal on Selected Areas in Information Theory (2020).
  • [PLYS21] Sejun Park, Jaeho Lee, Chulhee Yun, and Jinwoo Shin, Provable memorization via deep neural networks using sub-linear parameters, Conference on Learning Theory, PMLR, 2021, pp. 3627–3661.
  • [RR08] Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177–1184.
  • [Sak92] Akito Sakurai, n-h-1 networks store no less n*h+1 examples, but sometimes no more, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, vol. 3, 1992, pp. 936–941.
  • [Sch42] Isaac J. Schoenberg, Positive definite functions on spheres, Duke Math. J. 9 (1942), 96––108.
  • [SJL18] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2018), no. 2, 742–769.
  • [T+11] Joel Tropp et al., Freedman’s inequality for matrix martingales, Electronic Communications in Probability 16 (2011), 262–270.
  • [Ver10] Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010).
  • [Ver18]  , High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018.
  • [Ver20]  , Memory capacity of neural networks with threshold and relu activations, arXiv:2001.06938 (2020).
  • [vH14] Ramon van Handel, Probability in high dimension, Tech. report, PRINCETON UNIV NJ, 2014.
  • [VYS21] Gal Vardi, Gilad Yehudai, and Ohad Shamir, On the optimal memorization power of relu neural networks, arXiv preprint arXiv:2110.03187 (2021).
  • [WCL20] E Weinan, Ma Chao, and Wu Lei, A comparative analysis of optimization and generalization properties oftwo-layer neural network and random feature models under gradient descent dynamics, Science China Mathematics 63 (2020), no. 7, 1235.
  • [YSJ19] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie, Small relu networks are powerful memorizers: a tight analysis of memorization capacity, Advances in Neural Information Processing Systems, 2019, pp. 15558–15569.
  • [ZBH+16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, Understanding deep learning requires rethinking generalization, arXiv:1611.03530 (2016).
  • [ZCZG20] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu, Gradient descent optimizes over-parameterized deep ReLU networks, Machine Learning 109 (2020), no. 3, 467–492.