This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks

Zhichao Wang Department of Mathematics, University of California, San Diego, La Jolla, CA 92093 [email protected]  and  Yizhe Zhu Department of Mathematics, University of California, Irvine, Irvine, CA 92697 [email protected]
Abstract.

In this paper, we investigate a two-layer fully connected neural network of the form f(X)=1d1𝒂σ(WX)f(X)=\frac{1}{\sqrt{d_{1}}}\boldsymbol{a}^{\top}\sigma\left(WX\right), where Xd0×nX\in\mathbb{R}^{d_{0}\times n} is a deterministic data matrix, Wd1×d0W\in\mathbb{R}^{d_{1}\times d_{0}} and 𝒂d1\boldsymbol{a}\in\mathbb{R}^{d_{1}} are random Gaussian weights, and σ\sigma is a nonlinear activation function. We study the limiting spectral distributions of two empirical kernel matrices associated with f(X)f(X): the empirical conjugate kernel (CK) and neural tangent kernel (NTK), beyond the linear-width regime (d1nd_{1}\asymp n). We focus on the ultra-wide regime, where the width d1d_{1} of the first layer is much larger than the sample size nn. Under appropriate assumptions on XX and σ\sigma, a deformed semicircle law emerges as d1/nd_{1}/n\to\infty and nn\to\infty. We first prove this limiting law for generalized sample covariance matrices with some dependency. To specify it for our neural network model, we provide a nonlinear Hanson-Wright inequality that is suitable for neural networks with random weights and Lipschitz activation functions. We also demonstrate non-asymptotic concentrations of the empirical CK and NTK around their limiting kernels in the spectral norm, along with lower bounds on their smallest eigenvalues. As an application, we show that random feature regression induced by the empirical kernel achieves the same asymptotic performance as its limiting kernel regression under the ultra-wide regime. This allows us to calculate the asymptotic training and test errors for random feature regression using the corresponding kernel regression.

1. Introduction

Nowadays, deep neural networks have become one of the leading models in machine learning, and many theoretical results have been established to understand the training and generalization of neural networks. Among them, two kernel matrices are prominent in deep learning theory: Conjugate Kernel (CK) [Nea95, Wil97, RR07, CS09, DFS16, PLR+16, SGGSD17, LBN+18, MHR+18] and Neural Tangent Kernel (NTK) [JGH18, DZPS19, AZLS19]. The CK matrix defined in (1.5), which has been exploited to study the generalization of random feature regression, is the Gram matrix of the output of the last hidden layer on the training dataset. While the NTK matrix, defined in (1.8), is the Gram matrix of the Jacobian of the neural network with respect to training parameters, characterizing the performance of a wide neural network through gradient flows. Both are related to the kernel machine and help us explore the generalization and training process of the neural network.

We are interested in the behaviors of CK and NTK matrices at random initialization. A recent line of work has proved that these two random kernel matrices will converge to their expectations when the width of the network becomes infinitely wide [JGH18, ADH+19b]. Although CK and NTK are usually referred to as these expected kernels in literature, we will always call CK and NTK the empirical kernel matrices in this paper, with a slight abuse of terminology.

In this paper, we study the random CK and NTK matrices of a two-layer fully connected neural network with input data Xd0×nX\in\mathbb{R}^{d_{0}\times n}, given by f:d0×nnf:\mathbb{R}^{d_{0}\times n}\to\mathbb{R}^{n} such that

(1.1) f(X):=1d1𝒂σ(WX),f(X):=\frac{1}{\sqrt{d_{1}}}\boldsymbol{a}^{\top}\sigma\left(WX\right),

where Wd1×d0W\in\mathbb{R}^{d_{1}\times d_{0}} is the weight matrix for the first layer, 𝒂d1\boldsymbol{a}\in\mathbb{R}^{d_{1}} are the second layer weights, and σ\sigma is a nonlinear activation function applied to the matrix WXWX element-wisely. We assume that all entries of 𝒂\boldsymbol{a} and WW are independently identically distributed by the standard Gaussian 𝒩(0,1)\mathcal{N}(0,1). We will always view the input data XX as a deterministic matrix (independent of the random weights in 𝒂\boldsymbol{a} and WW) with certain assumptions.

In terms of random matrix theory, we study the difference between these two kernel matrices (CK and NTK) and their expectations with respect to random weights, showing both asymptotic and non-asymptotic behaviors of these differences as the width of the first hidden layer d1d_{1} is growing faster than the number of samples nn. As an extension of [FW20], we prove that when n/d10n/d_{1}\to 0, the centered CK and NTK with appropriate normalization have the limiting eigenvalue distribution given by a deformed semicircle law, determined by the training data spectrum and the nonlinear activation function. To prove this global law, we further set up a limiting law theorem for centered sample covariance matrices with dependent structures and a nonlinear version of the Hanson-Wright inequality. These two results are very general, which makes them potentially applicable to different scenarios beyond our neural network model. For the non-asymptotic analysis, we establish concentration inequalities between the random kernel matrices and their expectations. As a byproduct, we provide lower bounds of the smallest eigenvalues of CK and NTK, which are essential for the global convergence of gradient-based optimization methods when training a wide neural network [OS20, NM20, Ngu21]. Because of the non-asymptotic results for kernel matrices, we can also describe how close the performances of the random feature regression and the limiting kernel regression are with a general dataset, which allows us to compute the limiting training error and generalization error for the random feature regression via its corresponding kernel regression in the ultra-wide regime.

1.1. Nonlinear random matrix theory in neural networks

Recently, the limiting spectra of CK and NTK at random initialization have received increasing attention from a random matrix theory perspective. Most of the papers focus on the linear-width regime d1nd_{1}\propto n, using both the moment method and Stieltjes transforms. Based on moment methods, [PW17] first computed the limiting law of the CK for two-layer neural networks with centered nonlinear activation functions, which is further described as a deformed Marchenko–Pastur law in [Péc19]. This result has been extended to sub-Gaussian weights and input data with real analytic activation functions by [BP21], even for multiple layers with some special activation functions. Later, [ALP22] generalized their results by adding a random bias vector in pre-activation and a more general input data matrix. Similar results for the two-layer model with a random bias vector and random input data were analyzed in [PS21] by cumulant expansion. In parallel, by Stieltjes transform, [LLC18] investigated the CK of a one-hidden-layer network with general deterministic input data and Lipschitz activation functions via some deterministic equivalent. [LCM20] further developed a deterministic equivalent for the Fourier feature map. With the help of the Gaussian equivalent technique and operator-valued free probability theory, the limit spectrum of NTK with one-hidden layer has been analyzed in [AP20]. Then the limit spectra of CK and NTK of a multi-layer neural network with general deterministic input data have been fully characterized in [FW20], where the limiting spectrum of CK is given by the propagation of the Marchenko–Pastur map through the network, while the NTK is approximated by the linear combination of CK’s of each hidden layer. [FW20] illustrated that the pairwise approximate orthogonality assumption on the input data is preserved in all hidden layers. Such a property is useful to approximate the expected CK and NTK. We refer to [GLBP21] as a summary of the recent development in nonlinear random matrix theory.

Most of the results in nonlinear random matrix theory focus on the case when d1d_{1} is proportional to nn as nn\to\infty. We build a random matrix result for both CK and NTK under the ultra-wide regime, where d1/nd_{1}/n\to\infty and nn\to\infty. As an intrinsic interest of this regime, this exhibits the connection between wide (or overparameterized) neural networks and kernel learning induced by limiting kernels of CK and NTK. In this article, we will follow general assumptions on the input data and activation function in [FW20] and study the limiting spectra of the centered and normalized CK matrix

(1.2) 1nd1(YY𝔼[YY]),\frac{1}{\sqrt{nd_{1}}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right),

where Y:=σ(WX)Y:=\sigma(WX). Similar results for the NTK can be obtained as well. To complete the proofs, we establish a nonlinear version of Hanson-Wright inequality, which has previously appeared in [LLC18, LCM20]. This nonlinear version is a generalization of the original Hanson-Wright inequality [HW71, RV13, Ada15], and may have various applications in statistics, machine learning, and other areas. In addition, we also derive a deformed semicircle law for normalized sample covariance matrices without independence in columns. This result is of independent interest in random matrix theory as well.

1.2. General sample covariance matrices

We observe that the random matrix Yd1×nY\in\mathbb{R}^{d_{1}\times n} defined above has independent and identically-distributed rows. Hence, YYY^{\top}Y is a generalized sample covariance matrix. We first inspect a more general sample covariance matrix YY whose rows are independent copies of some random vector 𝐲n\mathbf{y}\in\mathbb{R}^{n}. Assuming nn and d1d_{1} both go to infinity but n/d10n/d_{1}\to 0, we aim to study the limiting empirical eigenvalue distribution of centered Wishart matrices in the form of (1.2) with certain conditions on 𝐲\mathbf{y}. This regime is also related to the ultra-high dimensional setting in statistics [QLY21].

This regime has been studied for decades starting in [BY88], where YY has i.i.d. entries and 𝔼[YY]=d1Id\mathbb{E}[Y^{\top}Y]=d_{1}\operatorname{Id}. In this setting, by the moment method, one can obtain the semicircle law. This normalized model also arises in quantum theory with respect to random induced states (see [Aub12, AS17, CYZ18]). The largest eigenvalue of such a normalized sample covariance matrix has been considered in [CP12]. Subsequently, [CP15, LY16, YXZ22, QLY21] analyzed the fluctuations for the linear spectral statistics of this model and applied this result to hypothesis testing for the covariance matrix. A spiked model for sample covariance matrices in this regime was recently studied in [Fel21]. This kind of semicircle law also appears in many other random matrix models. For instance, [Jia04] showed this limiting law for normalized sample correlation matrices. Also, the semicircle law for centered sample covariance matrices has already been applied in machine learning: [GKZ19] controlled the generalization error of shallow neural networks with quadratic activation functions by the moments of this limiting semicircle law; [GZR22] derived a semicircle law of the fluctuation matrix between stochastic batch Hessian and the deterministic empirical Hessian of deep neural networks.

For general sample covariance, [WP14] considered the form Y=BXA1/2Y=BXA^{1/2} with deterministic AA and BB, where XX consists of i.i.d. entries with mean zero and variance one. The same result has been proved in [Bao12] by generalized Stein’s method. Unlike previous results, [Xie13] tackled the general case, only assuming YY has independent rows with some deterministic covariance Φn\Phi_{n}. Though this is similar to our model in Section 4, we will consider more general assumptions on each row of YY, which can be directly verified in our neural network models.

1.3. Infinite-width kernels and the smallest eigenvalues of empirical kernels

Besides the above asymptotic spectral fluctuation of (1.2), we provide non-asymptotic concentrations of (1.2) in spectral norm and a corresponding result for the NTK. In the infinite-width networks, where d1d_{1}\to\infty and nn are fixed, both CK and NTK will converge to their expected kernels. This has been investigated in [DFS16, SGGSD17, LBN+18, MHR+18] for the CK and [JGH18, DZPS19, AZLS19, ADH+19b, LRZ20] for the NTK. Such kernels are also called infinite-width kernels in literature. In this current work, we present the precise probability bounds for concentrations of CK and NTK around their infinite-width kernels, where the difference is of order n/d1\sqrt{n/d_{1}}. Our results permit more general activation functions and input data XX only with pairwise approximate orthogonality, albeit similar concentrations have been applied in [AKM+17, SY19, AP20, MZ22, HXAP20].

A corollary of our concentration is the explicit lower bounds of the smallest eigenvalues of the CK and the NTK. Such extreme eigenvalues of the NTK have been utilized to prove the global convergence of gradient descent algorithms of wide neural networks since the NTK governs the gradient flow in the training process, see, e.g., [COB19, DZPS19, ADH+19a, SY19, WDW19, OS20, NM20, Ngu21]. The smallest eigenvalue of NTK is also crucial for proving generalization bounds and memorization capacity in [ADH+19a, MZ22]. Analogous to Theorem 3.1 in [MZ22], our lower bounds are given by the Hermite coefficients of the activation function σ\sigma. Besides, the lower bound of NTK for multi-layer ReLU networks is analyzed in [NMM21].

1.4. Random feature regression and limiting kernel regression

Another byproduct of our concentration results is to measure the difference of performance between random feature regression with respect to 1d1Y\frac{1}{\sqrt{d_{1}}}Y and corresponding kernel regression when d1/nd_{1}/n\to\infty. Random feature regression can be viewed as the linear regression of the last hidden layer, and its performance has been studied in, for instance, [PW17, LLC18, MM19, LCM20, GLK+20, HL22, LD21, MMM21, LGC+21] under the linear-width regime111This linear-width regime is also known as the high-dimensional regime, while our ultra-wide regime is also called a highly overparameterized regime in literature, see [MM19].. In this regime, the CK matrix 1d1YY\frac{1}{d_{1}}Y^{\top}Y is not concentrated around its expectation

(1.3) Φ:=𝔼𝒘[σ(𝒘X)σ(𝒘X)]\Phi:=\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}X)^{\top}\sigma(\boldsymbol{w}^{\top}X)]

under the spectral norm, where 𝒘\boldsymbol{w} is the standard normal random vector in d0\mathbb{R}^{d_{0}}. But the limiting spectrum of CK is exploited to characterize the asymptotic performance and double descent phenomenon of random feature regression when n,d0,d1n,d_{0},d_{1}\to\infty proportionally. Several works have also utilized this regime to depict the performance of the ultra-wide random network by letting d1/nψ(0,)d_{1}/n\to\psi\in(0,\infty) first, getting the asymptotic performance and then taking ψ\psi\to\infty (see [MM19, YBM21]). However, there is still a difference between this sequential limit and the ultra-wide regime. Before these results, random feature regression has already attracted significant attention in that it is a random approximation of the RKHS defined by population kernel function K:d0×d0K:\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R} such that

(1.4) K(𝐱,𝐳):=𝔼𝒘[σ(𝒘,𝐱)σ(𝒘,𝐳)],K(\mathbf{x},\mathbf{z}):=\mathbb{E}_{\boldsymbol{w}}[\sigma(\langle\boldsymbol{w},\mathbf{x}\rangle)\sigma(\langle\boldsymbol{w},\mathbf{z}\rangle)],

when width d1d_{1} is sufficiently large [RR07, Bac13, RR17, Bac17]. We point out that Theorem 9 of [AKM+17] has the same order n/d1\sqrt{n/d_{1}} of the approximation as ours, despite only for random Fourier features.

In our work, the concentration between empirical kernel induced by 1d1YY\frac{1}{d_{1}}Y^{\top}Y and the population kernel matrix KK defined in (1.4) for XX leads to the control of the differences of training/test errors between random feature regression and kernel regression, which were previously concerned by [AKM+17, JSS+20, MZ22, MMM21] in different cases. Specifically, [JSS+20] obtained the same kind of estimation but considered random features sampled from Gaussian Processes. Our results explicitly show how large width d1d_{1} should be so that the random feature regression gets the same asymptotic performance as kernel regression [MMM21]. With these estimations, we can take the limiting test error of the kernel regression to predict the limiting test error of random feature regression as n/d10n/d_{1}\to 0 and d0,nd_{0},n\to\infty. We refer [LR20, LRZ20, LLS21, MMM21], [BMR21, Section 4.3] and references therein for more details in high-dimensional kernel ridge/ridgeless regressions. We emphasize that the optimal prediction error of random feature regression in linear-width regime is actually achieved in the ultra-wide regime, which boils down to the limiting kernel regression, see [MM19, MMM21, YBM21, LGC+21]. This is one of the motivations for studying the ultra-wide regime and the limiting kernel regression.

In the end, we would like to mention the idea of spectral-norm approximation for the expected kernel Φ\Phi, which helps us describe the asymptotic behavior of limiting kernel regression. For specific activation σ\sigma, kernel Φ\Phi has an explicit formula, see [LLC18, LC18, LCM20], whereas generally, it can be expanded in terms of the Hermite expansion of σ\sigma [PW17, MM19, FW20]. Thanks to pairwise approximate orthogonality introduced in [FW20, Definition 3.1], we can approximate Φ\Phi in the spectral norm for general deterministic data XX. This pairwise approximate orthogonality defines how orthogonal is within different input vectors of XX. With certain i.i.d. assumption on XX, [LRZ20] and [BMR21, Section 4.3], where the scaling d0nαd_{0}\propto n^{\alpha}, for α(0,1]\alpha\in(0,1], determined which degree of the polynomial kernel is sufficient to approximate Φ\Phi. Instead, our theory leverages the approximate orthogonality among general datasets XX to obtain a similar approximation. Our analysis presumably indicates that the weaker orthogonality XX has, the higher degree of the polynomial kernel we need to approximate the kernel Φ\Phi.

1.5. Preliminaries

Notations

We use tr(A)=1niAii\operatorname{tr}(A)=\frac{1}{n}\sum_{i}A_{ii} as the normalized trace of a matrix An×nA\in\mathbb{R}^{n\times n} and Tr(A)=iAii\operatorname{Tr}(A)=\sum_{i}A_{ii}. Denote vectors by lowercase boldface. A\|A\| is the spectral norm for matrix AA, AF\|A\|_{F} denotes the Frobenius norm, and 𝐱\|\mathbf{x}\| is the 2\ell_{2}-norm of any vector 𝐱\mathbf{x}. ABA\odot B is the Hadamard product of two matrices, i.e., (AB)ij=AijBij(A\odot B)_{ij}=A_{ij}B_{ij}. Let 𝔼𝒘[]\mathbb{E}_{\boldsymbol{w}}[\cdot] and Var𝒘[]\operatorname{Var}_{\boldsymbol{w}}[\cdot] be the expectation and variance only with respect to random vector 𝒘\boldsymbol{w}. Given any vector 𝒗\boldsymbol{v}, diag(𝒗)\operatorname{diag}(\boldsymbol{v}) is a diagonal matrix where the main diagonal elements are given by 𝒗\boldsymbol{v}. λmin(A)\lambda_{\min}(A) is the smallest eigenvalue of any Hermitian matrix AA.


Before stating our main results, we describe our model with assumptions. We first consider the output of the first hidden layer and empirical Conjugate Kernel (CK):

(1.5) Y:=σ(WX)and1d1YY.Y:=\sigma(WX)\quad\text{and}\quad\frac{1}{d_{1}}Y^{\top}Y.

Observe that the rows of matrix YY are independent and identically distributed since only WW is random and XX is deterministic. Let the ii-th row of YY be 𝐲i\mathbf{y}_{i}^{\top}, for 1id11\leq i\leq d_{1}. Then, we obtain a sample covariance matrix,

(1.6) YY=i=1d1𝐲i𝐲i,Y^{\top}Y=\sum_{i=1}^{d_{1}}\mathbf{y}_{i}\mathbf{y}_{i}^{\top},

which is the sum of d1d_{1} independent rank-one random matrices in n×n\mathbb{R}^{n\times n}. Let the second moment of any row 𝐲i\mathbf{y}_{i} be (1.3). Later on, we will approximate Φ\Phi based on the assumptions of input data XX.

Next, we define the empirical Neural Tangent Kernel (NTK) for (1.1), denoted by Hn×nH\in\mathbb{R}^{n\times n}. From Section 3.3 in [FW20], the (i,j)(i,j)-th entry of HH can be explicitly written as

(1.7) Hij:=1d1r=1d1(σ(𝒘r𝐱i)σ(𝒘r𝐱j)+ar2σ(𝒘r𝐱i)σ(𝒘r𝐱j)𝐱i𝐱j),1i,jn,\displaystyle H_{ij}:=\frac{1}{d_{1}}\sum_{r=1}^{d_{1}}\left(\sigma(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{i})\sigma(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{j})+a_{r}^{2}\sigma^{\prime}(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{i})\sigma^{\prime}(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{j})\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right),\quad 1\leq i,j\leq n,

where 𝒘r\boldsymbol{w}_{r} is the rr-th row of weight matrix WW, 𝐱i\mathbf{x}_{i} is the ii-th column of matrix XX, and ara_{r} is rr-th entry of the output layer 𝒂\boldsymbol{a}. In the matrix form, HH can be written by

(1.8) H:=1d1(YY+(SS)(XX)),\displaystyle H:=\frac{1}{d_{1}}\left(Y^{\top}Y+(S^{\top}S)\odot(X^{\top}X)\right),

where the α\alpha-th column of SS is given by

(1.9) diag(σ(W𝐱α))𝒂,1αn.\displaystyle\operatorname{diag}(\sigma^{\prime}(W\mathbf{x}_{\alpha}))\boldsymbol{a},\quad\forall 1\leq\alpha\leq n.

We introduce the following assumptions for the random weights, nonlinear activation function σ\sigma, and input data. These assumptions are basically carried on from [FW20].

Assumption 1.1.

The entries of WW and 𝒂\boldsymbol{a} are i.i.d. and distributed by 𝒩(0,1)\mathcal{N}(0,1).

Assumption 1.2.

Activation function σ(x)\sigma(x) is a Lipschitz function with the Lipschitz constant λσ(0,)\lambda_{\sigma}\in(0,\infty). Assume that σ\sigma is centered and normalized with respect to ξ𝒩(0,1)\xi\sim\mathcal{N}(0,1) such that

(1.10) 𝔼[σ(ξ)]=0,\displaystyle\mathbb{E}[\sigma(\xi)]=0,\qquad 𝔼[σ2(ξ)]=1.\displaystyle\mathbb{E}[\sigma^{2}(\xi)]=1.

Define constants aσa_{\sigma} and bσb_{\sigma}\in\mathbb{R} by

(1.11) bσ:=𝔼[σ(ξ)],\displaystyle b_{\sigma}:=\mathbb{E}[\sigma^{\prime}(\xi)],\qquad aσ:=𝔼[σ(ξ)2].\displaystyle a_{\sigma}:=\mathbb{E}[\sigma^{\prime}(\xi)^{2}].

Furthermore, σ\sigma satisfies either of the following:

  1. (1)

    σ(x)\sigma(x) is twice differentiable with supx|σ′′(x)|λσ\sup_{x\in\mathbb{R}}|\sigma^{\prime\prime}(x)|\leq\lambda_{\sigma}, or

  2. (2)

    σ(x)\sigma(x) is a piece-wise linear function defined by

    σ(x)={ax+b,x>0,cx+b,x0,\sigma(x)=\begin{cases}ax+b,&x>0,\\ cx+b,&x\leq 0,\end{cases}

    for some constants a,b,ca,b,c\in\mathbb{R} such that (1.10) holds.

Analogously to [HXAP20], our Assumption 1.2 permits σ\sigma to be the commonly used activation functions, including ReLU, Sigmoid, and Tanh, although we have to center and normalize the activation functions to guarantee (1.10). Such normalized activation functions exclude some trivial spike in the limit spectra of CK and NTK [BP21, FW20]. The foregoing assumptions ensure our nonlinear Hanson-Wright inequality in the proof. As a future direction, going beyond Gaussian weights and Lipschitz activation functions may involve different types of concentration inequalities.

Next, we present the conditions of the deterministic input data XX and the asymptotic regime for our main results. Define the following (ε,B)(\varepsilon,B)-orthonormal property for our data matrix XX.

Definition 1.3.

For given any ε,B>0\varepsilon,B>0, matrix XX is (ε,B)(\varepsilon,B)-orthonormal if for any distinct columns 𝐱α,𝐱β\mathbf{x}_{\alpha},\mathbf{x}_{\beta} in XX, we have

|𝐱α21|ε,|𝐱β21|ε,|𝐱α𝐱β|ε,\big{|}\|\mathbf{x}_{\alpha}\|_{2}-1\big{|}\leq\varepsilon,\qquad\big{|}\|\mathbf{x}_{\beta}\|_{2}-1\big{|}\leq\varepsilon,\qquad\big{|}\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}\big{|}\leq\varepsilon,

and also

α=1n(𝐱α21)2B2,XB.\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|_{2}-1)^{2}\leq B^{2},\qquad\|X\|\leq B.
Assumption 1.4.

Let n,d0,d1n,d_{0},d_{1}\to\infty such that

  1. (a)

    γ:=n/d10\gamma:=n/d_{1}\to 0;

  2. (b)

    XX is (εn,B)\left(\varepsilon_{n},B\right)-orthonormal such that nεn4 0n\varepsilon_{n}^{4}\to\ 0 as nn\to\infty;

  3. (c)

    The empirical spectral distribution  μ^0\hat{\mu}_{0} of XXX^{\top}X converges weakly to a fixed and non-degenerate probability distribution μ0δ0\mu_{0}\not=\delta_{0} on [0,)[0,\infty).

In above (b), the (εn,B)(\varepsilon_{n},B)-orthonormal property with nεn4=o(1)n\varepsilon^{4}_{n}=o(1) is a quantitative version of pairwise approximate orthogonality for the column vectors of the data matrix Xd0×nX\in\mathbb{R}^{d_{0}\times n}. When d0nd_{0}\asymp n, it holds, with high probability, for many random XX with independent columns 𝐱α\mathbf{x}_{\alpha}, including the anisotropic Gaussian vectors 𝐱α𝒩(0,Σ)\mathbf{x}_{\alpha}\sim\mathcal{N}(0,\Sigma) with tr(Σ)=1\operatorname{tr}(\Sigma)=1 and Σ1/n\|\Sigma\|\lesssim 1/n, vectors generated by Gaussian mixture models, and vectors satisfying the log-Sobolev inequality or convex Lipschitz concentration property. See [FW20, Section 3.1] for more details. Specifically, when 𝐱α\mathbf{x}_{\alpha}’s are independently sampled from the unit sphere 𝕊d01\mathbb{S}^{d_{0}-1}, XX is (εn,B)\left(\varepsilon_{n},B\right)-orthonormal with high probability where εn=O(log(n)n)\varepsilon_{n}=O\Big{(}\sqrt{\frac{\log(n)}{n}}\Big{)} and B=O(1)B=O(1) as nd0n\asymp d_{0}. In this case, for any >2\ell>2, we have nεn0n\varepsilon_{n}^{\ell}\to 0. In our theory, we always treat XX as a deterministic matrix. However, our results also work for random input XX independent of weights WW and 𝒂\boldsymbol{a} by conditioning on the high probability event that XX satisfies (εn,B)(\varepsilon_{n},B)-orthonormal property. Unlike data vectors with independent entries, our assumption is promising to analyze real-world datasets [LGC+21] and establish some nn-dependent deterministic equivalents like [LCM20].

The following Hermite polynomials are crucial to the approximation of Φ\Phi in our analysis.

Definition 1.5 (Normalized Hermite polynomials).

The rr-th normalized Hermite polynomial is given by

hr(x)=1r!(1)rex2/2drdxrex2/2.h_{r}(x)=\frac{1}{\sqrt{r!}}(-1)^{r}e^{x^{2}/2}\frac{d^{r}}{dx^{r}}e^{-x^{2}/2}.

Here {hr}r=0\{h_{r}\}_{r=0}^{\infty} form an orthonormal basis of L2(,Γ)L^{2}(\mathbb{R},\Gamma), where Γ\Gamma denotes the standard Gaussian distribution. For σ1,σ2L2(,Γ)\sigma_{1},\sigma_{2}\in L^{2}(\mathbb{R},\Gamma), the inner product is defined by

σ1,σ2=σ1(x)σ2(x)ex2/22π𝑑x.\displaystyle\langle\sigma_{1},\sigma_{2}\rangle=\int_{-\infty}^{\infty}\sigma_{1}(x)\sigma_{2}(x)\frac{e^{-x^{2}/2}}{\sqrt{2\pi}}dx.

Every function σL2(,Γ)\sigma\in L^{2}(\mathbb{R},\Gamma) can be expanded as a Hermite polynomial expansion

σ(x)=r=0ζr(σ)hr(x),\displaystyle\sigma(x)=\sum_{r=0}^{\infty}\zeta_{r}(\sigma)h_{r}(x),

where ζr(σ)\zeta_{r}(\sigma) is the rr-th Hermite coefficient defined by

ζr(σ):=σ(x)hr(x)ex2/22π𝑑x.\displaystyle\zeta_{r}(\sigma):=\int_{-\infty}^{\infty}\sigma(x)h_{r}(x)\frac{e^{-x^{2}/2}}{\sqrt{2\pi}}dx.

In the following statements and proofs, we denote ξ𝒩(0,1)\xi\sim\mathcal{N}(0,1). Then for any kk\in\mathbb{N}, we have

(1.12) ζk(σ)=𝔼[σ(ξ)hk(ξ)].\displaystyle\zeta_{k}(\sigma)=\mathbb{E}[\sigma(\xi)h_{k}(\xi)].

Specifically, bσ=𝔼[σ(ξ)]=𝔼[ξσ(ξ)]=ζ1(σ)b_{\sigma}=\mathbb{E}[\sigma^{\prime}(\xi)]=\mathbb{E}[\xi\cdot\sigma(\xi)]=\zeta_{1}(\sigma). Let fk(x)=xkf_{k}(x)=x^{k}. We define the inner-product kernel random matrix fk(XX)n×nf_{k}(X^{\top}X)\in\mathbb{R}^{n\times n} by applying fkf_{k} entrywise to XXX^{\top}X. Define a deterministic matrix

(1.13) Φ0:=𝝁𝝁+k=13ζk(σ)2fk(XX)+(1ζ1(σ)2ζ2(σ)2ζ3(σ)2)Id,\Phi_{0}:=\boldsymbol{\mu}\boldsymbol{\mu}^{\top}+\sum_{k=1}^{3}\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)+(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})\operatorname{Id},

where the α\alpha-th entry of 𝝁n\boldsymbol{\mu}\in\mathbb{R}^{n} is 2ζ2(σ)(𝐱α21)\sqrt{2}\zeta_{2}(\sigma)\cdot(\|\mathbf{x}_{\alpha}\|_{2}-1) for 1αn1\leq\alpha\leq n. We will employ Φ0\Phi_{0} as an approximation of the population covariance Φ\Phi in (1.3) in the spectral norm when nεn40n\varepsilon_{n}^{4}\to 0.

For any n×nn\times n Hermitian matrix AnA_{n} with eigenvalues λ1,,λn\lambda_{1},\dots,\lambda_{n}, the empirical spectral distribution of AA is defined by

μAn(x)=1ni=1nδλi(x).\displaystyle\mu_{A_{n}}(x)=\frac{1}{n}\sum_{i=1}^{n}\delta_{\lambda_{i}}(x).

We write limspec(An)=μ\operatorname{lim\,spec}(A_{n})=\mu if μAnμ\mu_{A_{n}}\to\mu weakly as nn\to\infty. The main tool we use to study the limiting spectral distribution of a matrix sequence is the Stieltjes transform defined as follows.

Definition 1.6 (Stieltjes transform).

Let μ\mu be a probability measure on \mathbb{R}. The Stieltjes transform of μ\mu is a function s(z)s(z) defined on the upper half plane +\mathbb{C}^{+} by

s(z)=1xz𝑑μ(x).\displaystyle s(z)=\int_{\mathbb{R}}\frac{1}{x-z}d\mu(x).

For any n×nn\times n Hermitian matrix AnA_{n}, the Stieltjes transform of the empirical spectral distribution of AnA_{n} can be written as tr(AnzId)1\operatorname{tr}(A_{n}-z\operatorname{Id})^{-1}. We call (AnzId)1(A_{n}-z\operatorname{Id})^{-1} the resolvent of AnA_{n}.

2. Main results

2.1. Spectra of the centered CK and NTK

Our first result is a deformed semicircle law for the CK matrix. Denote by μ~0=(1bσ)2+bσ2μ0\tilde{\mu}_{0}=(1-b_{\sigma})^{2}+b_{\sigma}^{2}\mu_{0} the distribution of (1bσ2)+bσ2λ(1-b_{\sigma}^{2})+b_{\sigma}^{2}\lambda with λ\lambda sampled from the distribution μ0\mu_{0}. The limiting law of our centered and normalized CK matrix is depicted by μsμ~0\mu_{s}\boxtimes\tilde{\mu}_{0}, where μs\mu_{s} is the standard semicircle law and the notation \boxtimes is the free multiplicative convolution in free harmonic analysis. For full descriptions of free independence and free multiplicative convolution, see [NS06, Lecture 18] and [AGZ10, Section 5.3.3]. The free multiplicative convolution \boxtimes was first introduced in [Voi87], which later has many applications for products of asymptotic free random matrices. The main tool for computing free multiplicative convolution is the SS-transform, invented by [Voi87]. SS-transform was recently utilized to study the dynamical isometry of deep neural networks [PSG17, PSG18, XBSD+18, HK21, CH23]. Some basic properties and intriguing examples for free multiplicative convolution with μs\mu_{s} can also be found in [BZ10, Theorems 1.2, 1.3].

Theorem 2.1 (Limiting spectral distribution for the conjugate kernel).

Suppose Assumptions 1.1, 1.2 and 1.4 of the input matrix XX hold, the empirical eigenvalue distribution of

(2.1) 1d1n(YY𝔼[YY])\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right)

converges weakly to

(2.2) μ:=μs((1bσ2)+bσ2μ0)=μsμ~0\displaystyle\mu:=\mu_{s}\boxtimes\Big{(}(1-b_{\sigma}^{2})+b_{\sigma}^{2}\cdot\mu_{0}\Big{)}=\mu_{s}\boxtimes\tilde{\mu}_{0}

almost surely as n,d0,d1n,d_{0},d_{1}\to\infty. Furthermore, if d1εn40d_{1}\varepsilon_{n}^{4}\rightarrow 0 as n,d0,d1n,d_{0},d_{1}\to\infty, then the empirical eigenvalue distribution of

(2.3) d1n(1d1YYΦ0)\sqrt{\frac{d_{1}}{n}}\left(\frac{1}{d_{1}}Y^{\top}Y-\Phi_{0}\right)

also converges weakly to the probability measure μ\mu almost surely, whose Stieltjes transform m(z)m(z) is defined by

(2.4) m(z)+dμ~0(x)z+β(z)x=0m(z)+\int_{\mathbb{R}}\frac{d\tilde{\mu}_{0}(x)}{z+\beta(z)x}=0

for each z+z\in\mathbb{C}^{+}, where β(z)+\beta(z)\in\mathbb{C}^{+} is the unique solution to

(2.5) β(z)+xdμ~0(x)z+β(z)x=0.\beta(z)+\int_{\mathbb{R}}\frac{xd\tilde{\mu}_{0}(x)}{z+\beta(z)x}=0.

Suppose that we additionally have bσ=0b_{\sigma}=0, i.e. 𝔼[σ(ξ)]=0\mathbb{E}[\sigma^{\prime}(\xi)]=0. In this case, our Theorem 2.1 shows that the limiting spectral distribution of (1.2) is the semicircle law, and from (2.2), the deterministic data matrix XX does not have an effect on the limiting spectrum. See Figure 1 for a cosine-type σ\sigma with bσ=0b_{\sigma}=0. The only effect of the nonlinearity in μ\mu is the coefficient bσb_{\sigma} in the deformation μ~0\tilde{\mu}_{0}.

Refer to caption
Refer to caption
Refer to caption
Figure 1. Simulations for empirical eigenvalue distributions of (2.3) and theoretical predication (red curves) of the limiting law μ\mu where activation function σ(x)cos(x)\sigma(x)\propto\cos(x) satisfies Assumption 1.2 with bσ=0b_{\sigma}=0, and XX is a standard Gaussian random matrix. Dimension parameters are given by n=1.9×103n=1.9\times 10^{3}, d0=2×103d_{0}=2\times 10^{3} and d1=2×105d_{1}=2\times 10^{5} (left); n=2×103n=2\times 10^{3}, d0=1.9×103d_{0}=1.9\times 10^{3} and d1=2×105d_{1}=2\times 10^{5} (middle); n=2×103n=2\times 10^{3}, d0=2×103d_{0}=2\times 10^{3} and d1=2×105d_{1}=2\times 10^{5} (right).

Figure 2 is a simulation of the limiting spectral distribution of CK with activation function σ(x)\sigma(x) given by arctan(x)\arctan(x) after proper shifting and scaling. More simulations are provided in Appendix B with different activation functions. The red curves are implemented by the self-consistent equations (2.4) and (2.5) in Theorem 2.1. In Section 4, we present general random matrix models with similar limiting eigenvalue distribution as μ\mu whose Stieltjes transform is also determined by (2.4) and (2.5).

Refer to caption
Refer to caption
Refer to caption
Figure 2. Simulations for empirical eigenvalue distributions of (2.3) and theoretical predication (red curves) of the limiting law μ\mu where activation function σ(x)arctan(x)\sigma(x)\propto\arctan(x) satisfies Assumption 1.2 and XX is a standard Gaussian random matrix: n=103n=10^{3}, d0=103d_{0}=10^{3} and d1=105d_{1}=10^{5} (left); n=103n=10^{3}, d0=1.5×103d_{0}=1.5\times 10^{3} and d1=105d_{1}=10^{5} (middle); n=1.5×103n=1.5\times 10^{3}, d0=103d_{0}=10^{3} and d1=105d_{1}=10^{5} (right).

Theorem 2.1 can be extended to the NTK model as well. Denote by

(2.6) Ψ:\displaystyle\Psi: =1d1𝔼[SS](XX)n×n.\displaystyle=\frac{1}{d_{1}}\mathbb{E}[S^{\top}S]\odot(X^{\top}X)\in\mathbb{R}^{n\times n}.

As an approximation of Ψ\Psi in the spectral norm, we define

(2.7) Ψ0:\displaystyle\Psi_{0}: =(aσk=02ηk2(σ))Id+k=02ηk2(σ)fk+1(XX),\displaystyle=\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)\operatorname{Id}+\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)f_{k+1}(X^{\top}X),

where fkf_{k}’s are defined in (1.13), aσa_{\sigma} is defined in (1.11), and the kk-th Hermite coefficient of σ\sigma^{\prime} is

(2.8) ηk(σ):=𝔼[σ(ξ)hk(ξ)].\displaystyle\eta_{k}(\sigma):=\mathbb{E}[\sigma^{\prime}(\xi)h_{k}(\xi)].

Then, similar deformed semicircle law can be obtained for the empirical NTK matrix HH.

Theorem 2.2 (Limiting spectral distribution of the NTK).

Under Assumptions 1.1, 1.2 and 1.4 of the input matrix XX, the empirical eigenvalue distribution of

(2.9) d1n(H𝔼[H])\sqrt{\frac{d_{1}}{n}}\left(H-\mathbb{E}[H]\right)

weakly converges to μ=μs((1bσ2)+bσ2μ0)\mu=\mu_{s}\boxtimes\Big{(}(1-b_{\sigma}^{2})+b_{\sigma}^{2}\cdot\mu_{0}\Big{)} almost surely as n,d0,d1n,d_{0},d_{1}\to\infty and n/d10n/d_{1}\to 0. Furthermore, suppose that εn4d10\varepsilon_{n}^{4}d_{1}\to 0, then the empirical eigenvalue distribution of

(2.10) d1n(HΦ0Ψ0)\sqrt{\frac{d_{1}}{n}}\left(H-\Phi_{0}-\Psi_{0}\right)

weakly converges to μ\mu almost surely, where Φ0\Phi_{0} and Ψ0\Psi_{0} are defined in (1.13) and (2.7), respectively.

2.2. Non-asymptotic estimations

With our nonlinear Hanson-Wright inequality (Corollary 3.5), we attain the following concentration bound on the CK matrix in the spectral norm.

Theorem 2.3.

With Assumption 1.1, assume XX satisfies i=1n(𝐱i21)2B2\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}-1)^{2}\leq B^{2} for a constant B0B\geq 0, and σ\sigma is λσ\lambda_{\sigma}-Lipschitz with 𝔼[σ(ξ)]=0\mathbb{E}[\sigma(\xi)]=0. Then with probability at least 14e2n1-4e^{-2n},

(2.11) 1d1YYΦC(nd1+nd1)λσ2X2+32Bλσ2Xnd1,\displaystyle\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|\leq C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+32B\lambda_{\sigma}^{2}\|X\|\sqrt{\frac{n}{d_{1}}},

where C>0C>0 is a universal constant.

Remark 2.4.

Theorem 2.3 ensures that the empirical spectral measure μn\mu_{n} of the centered random matrix d1n(1d1YYΦ)\sqrt{\frac{d_{1}}{n}}\left(\frac{1}{d_{1}}Y^{\top}Y-\Phi\right) has a bounded support for all sufficiently large nn. Together with the global law in Theorem 2.1, our concentration inequality (2.11) is tight up to a constant factor. Additionally, by the weak convergence of μn\mu_{n} to μ\mu proved in Theorem 2.1, we can take a test function x2x^{2} to obtain that

x2𝑑μn(x)x2𝑑μ(x),i.e.,d1n1d1YYΦF(x2𝑑μ(x))12\int_{\mathbb{R}}x^{2}d\mu_{n}(x)\to\int_{\mathbb{R}}x^{2}d\mu(x),\quad\text{i.e.,}\quad\frac{\sqrt{d_{1}}}{n}\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|_{F}\rightarrow\left(\int_{\mathbb{R}}x^{2}d\mu(x)\right)^{\frac{1}{2}}

almost surely, as n,d1n,d_{1}\to\infty and d1/nd_{1}/n\to\infty. Therefore, the fluctuation of 1d1YY\frac{1}{d_{1}}Y^{\top}Y around Φ\Phi under the Frobenius norm is exactly of order n/d1n/\sqrt{d_{1}}.

Based on the foregoing estimation, we have the following lower bound on the smallest eigenvalue of the empirical conjugate kernel, denoted by λmin(1d1YY)\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right).

Theorem 2.5.

Suppose Assumptions 1.1 and 1.2 hold and σ\sigma is not a linear function, XX is (εn,B)(\varepsilon_{n},B)-orthonormal. Then with probability at least 14e2n1-4e^{-2n},

λmin(1d1YY)1i=13ζi(σ)2CBεn2nC(nd1+nd1)λσ2B2,\displaystyle\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)\geq 1-\sum_{i=1}^{3}\zeta_{i}(\sigma)^{2}-C_{B}\varepsilon_{n}^{2}\sqrt{n}-C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}B^{2},

where CBC_{B} is a constant depending on BB. In particular, if εn4n=o(1),B=O(1),d1=ω(n)\varepsilon_{n}^{4}n=o(1),B=O(1),d_{1}=\omega(n), then with high probability,

λmin(1d1YY)1i=13ζi(σ)2o(1).\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)\geq 1-\sum_{i=1}^{3}\zeta_{i}(\sigma)^{2}-o(1).
Remark 2.6.

A related result in [OS20, Theorem 5] assumed 𝐱j=1\|\mathbf{x}_{j}\|=1 for all j[n]j\in[n], λσB,|σ(0)|B\lambda_{\sigma}\leq B,|\sigma(0)|\leq B, d1Clog2(n)nλmin(Φ)d_{1}\geq C\log^{2}(n)\frac{n}{\lambda_{\min}(\Phi)} and obtained 1d1λmin(YY)λmin(Φ)o(1).\frac{1}{d_{1}}\lambda_{\min}(Y^{\top}Y)\geq\lambda_{\min}(\Phi)-o(1). We relax the assumption on the column vectors of XX, and extend the range of d1d_{1} down to d1=Ω(n)d_{1}=\Omega(n), to guarantee that 1d1λmin(YY)\frac{1}{d_{1}}\lambda_{\min}(Y^{\top}Y) is lower bounded by an absolute constant, with an extra assumption that 𝔼[σ(ξ)]=0\mathbb{E}[\sigma(\xi)]=0. This assumption can always be satisfied by shifting the activation function with a proper constant. Our bound for λmin(Φ)\lambda_{\min}(\Phi) is derived via Hermite polynomial expansion, similar to [OS20, Lemma 15]. However, we apply an ε\varepsilon-net argument for concentration bound for 1d1YY\frac{1}{d_{1}}Y^{\top}Y around Φ\Phi, while a matrix Chernoff concentration bound with truncation was used in [OS20, Theorem 5].


Additionally, the concentration for the NTK matrix HH can be obtained in the next theorem. Recall that HH is defined by (1.8) and the columns of SS are defined by (1.9) with Assumption 1.1.

Theorem 2.7.

Suppose d1lognd_{1}\geq\log n, and σ\sigma is λσ\lambda_{\sigma}-Lipschitz. Then with probability at least 1n7/31-n^{-7/3},

(2.12) 1d1(SS𝔼[SS])(XX)10λσ4X4lognd1.\displaystyle\left\|\frac{1}{d_{1}}(S^{\top}S-\mathbb{E}[S^{\top}S])\odot(X^{\top}X)\right\|\leq 10\lambda_{\sigma}^{4}\|X\|^{4}\sqrt{\frac{\log n}{d_{1}}}.

Moreover, if the assumptions in Theorem 2.3 hold, then with probability at least 1n7/34e2n1-n^{-7/3}-4e^{-2n},

(2.13) H𝔼HC(nd1+nd1)λσ2X2+32Bλσ2Xnd1+10λσ4X4lognd1.\displaystyle\|H-\mathbb{E}H\|\leq C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+32B\lambda_{\sigma}^{2}\|X\|\sqrt{\frac{n}{d_{1}}}+10\lambda_{\sigma}^{4}\|X\|^{4}\sqrt{\frac{\log n}{d_{1}}}.
Remark 2.8.

Compared to Proposition D.3 in [HXAP20], we assume 𝒂\boldsymbol{a} is a Gaussian vector instead of a Rademacher random vector and attain a better bound. If ai{+1,1}a_{i}\in\{+1,-1\}, then one can apply matrix Bernstein inequality for the sum of bounded random matrices. In our case, the boundedness condition is not satisfied. Section S1.1 in [AP20] applied matrix Bernstein inequality for the sum of bounded random matrices when 𝒂\boldsymbol{a} is a Gaussian vector, but the boundedness condition does not hold in Equation (S7) of [AP20].

Based on Theorem 2.7, we get a lower bound for the smallest eigenvalue of the NTK.

Theorem 2.9.

Under Assumptions 1.1 and 1.2, suppose that XX is (εn,B)(\varepsilon_{n},B)-orthonormal, σ\sigma is not a linear function, and d1lognd_{1}\geq\log n. Then with probability at least 1n7/31-n^{-7/3},

λmin(H)aσk=02ηk2(σ)CBεn4n10λσ4B4lognd1,\displaystyle\lambda_{\min}(H)\geq a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)-C_{B}\varepsilon_{n}^{4}n-10\lambda_{\sigma}^{4}B^{4}\sqrt{\frac{\log n}{d_{1}}},

where CBC_{B} is a constant depending only on BB, and ηk(σ)\eta_{k}(\sigma) is defined in (2.8). In particular, if εn4n=o(1)\varepsilon_{n}^{4}n=o(1), B=O(1)B=O(1), and d1=ω(logn)d_{1}=\omega(\log n), then with high probability,

λmin(H)(aσk=02ηk2(σ))(1o(1)).\displaystyle\lambda_{\min}(H)\geq\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)(1-o(1)).
Remark 2.10.

We relax the assumption in [NMM21] to d1=ω(logn)d_{1}=\omega(\log n) for the 2-layer case and our result is applicable beyond the ReLU activation function and to more general assumptions on XX. Our proof strategy is different from [NMM21]. In [NMM21], the authors used the inequality λmin((SS)(XX))miniSi22λmin(XX)\lambda_{\min}((S^{\top}S)\odot(X^{\top}X))\geq\min_{i}\|S_{i}\|_{2}^{2}\cdot\lambda_{\min}(X^{\top}X) where SiS_{i} is the ii-th column of SS. Then getting the lower bound is reduced to show the concentration of the 22-norm of the column vectors of SS. Here we apply a matrix concentration inequality to (SS)(XX)(S^{\top}S)\odot(X^{\top}X) and gain a weaker assumption on d1d_{1} to ensure the lower bound on λmin(H)\lambda_{\min}(H).

Remark 2.11.

In Theorems 2.5 and 2.9, we exclude the linear activation function. When σ(x)=x\sigma(x)=x, it is easy to check both 1d1λmin(YY)\frac{1}{d_{1}}\lambda_{\min}(Y^{\top}Y) and λmin(H)\lambda_{\min}(H) will trivially determined by λmin(XX)\lambda_{\min}(X^{\top}X), which can be vanishing. In this case, the lower bounds of the smallest eigenvalues of CK and NTK rely on the assumption of μ0\mu_{0} or the distribution of XX. For instance, when the entries of XX are i.i.d. Gaussian random variables, λmin(XX)\lambda_{\min}(X^{\top}X) has been analyzed in [Sil85].

2.3. Training and test errors for random feature regression

We apply the results of the preceding sections to a two-layer neural network at random initialization defined in (1.1), to estimate the training errors and test errors with mean-square losses for random feature regression under the ultra-wide regime where d1/nd_{1}/n\to\infty and nn\to\infty. In this model, we take the random feature 1d1σ(WX)\frac{1}{\sqrt{d_{1}}}\sigma(WX) and consider the regression with respect to 𝜽d1\boldsymbol{\theta}\in\mathbb{R}^{d_{1}} based on

f𝜽(X):=1d1𝜽σ(WX),f_{\boldsymbol{\theta}}(X):=\frac{1}{\sqrt{d_{1}}}\boldsymbol{\theta}^{\top}\sigma\left(WX\right),

with training data Xd0×nX\in\mathbb{R}^{d_{0}\times n} and training labels 𝒚n\boldsymbol{y}\in\mathbb{R}^{n}. Considering the ridge regression with ridge parameter λ0\lambda\geq 0 and squared loss defined by

(2.14) L(𝜽):=f𝜽(X)𝒚2+λ𝜽2,\displaystyle L(\boldsymbol{\theta}):=\|f_{\boldsymbol{\theta}}(X)^{\top}-\boldsymbol{y}\|^{2}+\lambda\|\boldsymbol{\theta}\|^{2},

we can conclude that the minimization 𝜽^:=argmin𝜽L(𝜽)\hat{\boldsymbol{\theta}}:=\arg\min_{\boldsymbol{\theta}}L(\boldsymbol{\theta}) has an explicit solution

(2.15) 𝜽^=1d1Y(1d1YY+λId)1𝒚,\hat{\boldsymbol{\theta}}=\frac{1}{\sqrt{d_{1}}}Y\left(\frac{1}{d_{1}}Y^{\top}Y+\lambda\operatorname{Id}\right)^{-1}\boldsymbol{y},

where Y=σ(WX)Y=\sigma(WX) is defined in (1.5). When σ\sigma is nonlinear, by Theorem 2.5, it is feasible to take inverse in (2.15) for any λ0\lambda\geq 0. Hence, in the following results, we will focus on nonlinear activation functions222As Remark 2.11 stated, when σ(x)=x\sigma(x)=x, λmin\lambda_{\min} of CK may be possibly vanishing. To include the linear activation function, we can alternatively assume that the ridge parameter λ\lambda is strictly positive and focus on random feature ridge regressions.. In general, the optimal predictor for this random feature with respect to (2.14) is

(2.16) f^λ(RF)(𝐱):=1d1𝜽^σ(W𝐱)=Kn(𝐱,X)(Kn(X,X)+λId)1𝒚,\displaystyle\hat{f}_{\lambda}^{(RF)}(\mathbf{x}):=\frac{1}{\sqrt{d_{1}}}\hat{\boldsymbol{\theta}}^{\top}\sigma\left(W\mathbf{x}\right)=K_{n}(\mathbf{x},X)(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y},

where we define an empirical kernel Kn(,):d0×d0K_{n}(\cdot,\cdot):\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R} as

(2.17) Kn(𝐱,𝐳):=1d1σ(W𝐱)σ(W𝐳)=1d1i=1d1σ(𝒘i,𝐱)σ(𝒘i,𝐳).K_{n}(\mathbf{x},\mathbf{z}):=\frac{1}{d_{1}}\sigma(W\mathbf{x})^{\top}\sigma(W\mathbf{z})=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\sigma(\langle\boldsymbol{w}_{i},\mathbf{x}\rangle)\sigma(\langle\boldsymbol{w}_{i},\mathbf{z}\rangle).

The nn-dimension row vector is given by

(2.18) Kn(𝐱,X)=[Kn(𝐱,𝐱1),,Kn(𝐱,𝐱n)],K_{n}(\mathbf{x},X)=[K_{n}(\mathbf{x},\mathbf{x}_{1}),\ldots,K_{n}(\mathbf{x},\mathbf{x}_{n})],

and the (i,j)(i,j) entry of Kn(X,X)K_{n}(X,X) is defined by Kn(𝐱i,𝐱j)K_{n}(\mathbf{x}_{i},\mathbf{x}_{j}), for 1i,jn1\leq i,j\leq n.

Analogously, consider any kernel function K(,):d0×d0K(\cdot,\cdot):\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}. The optimal kernel predictor with a ridge parameter λ0\lambda\geq 0 for the kernel ridge regression is given by (see [RR07, AKM+17, LR20, JSS+20, LLS21, BMR21] for more details)

(2.19) f^λ(K)(𝐱):=K(𝐱,X)(K(X,X)+λId)1𝒚,\displaystyle\hat{f}_{\lambda}^{(K)}(\mathbf{x}):=K(\mathbf{x},X)(K(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y},

where K(X,X)K(X,X) is an n×nn\times n matrix such that its (i,j)(i,j) entry is K(𝐱i,𝐱j)K(\mathbf{x}_{i},\mathbf{x}_{j}), and K(𝐱,X)K(\mathbf{x},X) is a row vector in n\mathbb{R}^{n} similarly with (2.18). We compare the characteristics of the two different predictors f^λ(RF)(𝐱)\hat{f}_{\lambda}^{(RF)}(\mathbf{x}) and f^λ(K)(𝐱)\hat{f}_{\lambda}^{(K)}(\mathbf{x}) when the kernel function KK is defined in (1.4). Denote the optimal predictors for random features and kernel KK on training data XX by

f^λ(RF)(X)\displaystyle\hat{f}_{\lambda}^{(RF)}(X) =(f^λ(RF)(𝐱1),,f^λ(RF)(𝐱n)),\displaystyle=\left(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}_{1}),\dots,\hat{f}_{\lambda}^{(RF)}(\mathbf{x}_{n})\right)^{\top},
f^λ(K)(X)\displaystyle\hat{f}_{\lambda}^{(K)}(X) =(f^λ(K)(𝐱1),,f^λ(K)(𝐱n)),\displaystyle=\left(\hat{f}_{\lambda}^{(K)}(\mathbf{x}_{1}),\dots,\hat{f}_{\lambda}^{(K)}(\mathbf{x}_{n})\right)^{\top},

respectively. Notice that, in this case, K(X,X)ΦK(X,X)\equiv\Phi defined in (1.3) and Kn(X,X)K_{n}(X,X) is the random empirical CK matrix 1d1YY\frac{1}{d_{1}}Y^{\top}Y defined in (1.5).

We aim to compare the training and test errors for these two predictors in ultra-wide random neural networks, respectively. Let training errors of these two predictors be

(2.20) Etrain(K,λ):\displaystyle E_{\textnormal{train}}^{(K,\lambda)}: =1nf^λ(K)(X)𝒚22=λ2n(K(X,X)+λId)1𝒚2,\displaystyle=\frac{1}{n}\|\hat{f}_{\lambda}^{(K)}(X)-\boldsymbol{y}\|_{2}^{2}=\frac{\lambda^{2}}{n}\|(K(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\|^{2},
(2.21) Etrain(RF,λ):\displaystyle E_{\textnormal{train}}^{(RF,\lambda)}: =1nf^λ(RF)(X)𝒚22=λ2n(Kn(X,X)+λId)1𝒚2.\displaystyle=\frac{1}{n}\|\hat{f}_{\lambda}^{(RF)}(X)-\boldsymbol{y}\|_{2}^{2}=\frac{\lambda^{2}}{n}\|(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\|^{2}.

In the following theorem, we show that, with high probability, the training error of the random feature regression model can be approximated by the corresponding kernel regression model with the same ridge parameter λ0\lambda\geq 0 for ultra-wide neural networks.

Theorem 2.12 (Training error approximation).

Suppose Assumptions 1.1, 1.2 and 1.4 hold, and σ\sigma is not a linear function. Then, for all large nn, with probability at least 14e2n1-4e^{-2n},

(2.22) |Etrain(RF,λ)Etrain(K,λ)|C1nd1(nd1+C2)𝒚2,\displaystyle\left|E_{\textnormal{train}}^{(RF,\lambda)}-E_{\textnormal{train}}^{(K,\lambda)}\right|\leq\frac{C_{1}}{\sqrt{nd_{1}}}\left(\sqrt{\frac{n}{d_{1}}}+C_{2}\right)\|\boldsymbol{y}\|^{2},

where constants C1C_{1} and C2C_{2} only depend on λ\lambda, BB and σ\sigma.

Next, to investigate the test errors (or generalization errors), we introduce further assumptions on the data and the target function that we want to learn from training data. Denote the true regression function by f:d0f^{*}:\mathbb{R}^{d_{0}}\to\mathbb{R}. Then, the training labels are defined by

(2.23) 𝒚=f(X)+ϵandf(X)=(f(𝐱1),,f(𝐱n)),\boldsymbol{y}=f^{*}(X)+\boldsymbol{\epsilon}\quad\text{and}\quad f^{*}(X)=(f^{*}(\mathbf{x}_{1}),\ldots,f^{*}(\mathbf{x}_{n}))^{\top},

where ϵn\boldsymbol{\epsilon}\in\mathbb{R}^{n} is the training label noise. For simplicity, we further impose the following assumptions, analogously to [LD21].

Assumption 2.13.

Assume that the target function is a linear function f(𝐱)=𝜷,𝐱f^{*}(\mathbf{x})=\langle\boldsymbol{\beta}^{*},\mathbf{x}\rangle, where random vector satisfies 𝜷𝒩(0,σ𝜷2Id)\boldsymbol{\beta}^{*}\sim\mathcal{N}(0,\sigma^{2}_{\boldsymbol{\beta}}\operatorname{Id}). Then, in this case, the training label vector is given by 𝒚=X𝜷+ϵ\boldsymbol{y}=X^{\top}\boldsymbol{\beta}^{*}+\boldsymbol{\epsilon} where ϵ𝒩(𝟎,σϵ2Id)\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma_{\boldsymbol{\epsilon}}^{2}\operatorname{Id}) independent with 𝜷d0\boldsymbol{\beta}^{*}\in\mathbb{R}^{d_{0}}.

Assumption 2.14.

Suppose that training dataset X=[𝐱1,,𝐱n]d0×nX=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n}]\in\mathbb{R}^{d_{0}\times n} satisfies (εn,B)(\varepsilon_{n},B)-orthonormal condition with nεn4=o(1)n\varepsilon_{n}^{4}=o(1), and a test data 𝐱d0\mathbf{x}\in\mathbb{R}^{d_{0}} is independent with XX and 𝒚\boldsymbol{y} such that X~:=[𝐱1,,𝐱n,𝐱]d0×(n+1)\tilde{X}:=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n},\mathbf{x}]\in\mathbb{R}^{d_{0}\times(n+1)} is also (εn,B)(\varepsilon_{n},B)-orthonormal. For convenience, we further assume the population covariance of the test data is 𝔼𝐱[𝐱𝐱]=1d0Id\mathbb{E}_{\mathbf{x}}[\mathbf{x}\mathbf{x}^{\top}]=\frac{1}{d_{0}}\operatorname{Id}.

Remark 2.15.

Our Assumption 2.14 of the test data 𝐱\mathbf{x} ensures the same statistical behavior as training data in XX, but we do not have any explicit assumption of the distribution of 𝐱\mathbf{x}. It is promising to adopt such assumptions to handle statistical models with real-world data [LC18, LCM20]. Besides, it is possible to extend our analysis to general population covariance for 𝔼𝐱[𝐱𝐱]\mathbb{E}_{\mathbf{x}}[\mathbf{x}\mathbf{x}^{\top}].

For any predictor f^\hat{f}, define the test error (generalization error) by

(2.24) (f^):=𝔼𝐱[|f^(𝐱)f(𝐱)|2].\displaystyle\mathcal{L}(\hat{f}):=\mathbb{E}_{\mathbf{x}}[|\hat{f}(\mathbf{x})-f^{*}(\mathbf{x})|^{2}].

We first present the following approximation of the test error of a random feature predictor via its corresponding kernel predictor.

Theorem 2.16 (Test error approximation).

Suppose that Assumptions 1.1, 1.2, 2.13 and 2.14 hold, and σ\sigma is not a linear function. Then, for any ε(0,1/2)\varepsilon\in(0,1/2), the difference of test errors satisfies

(2.25) |(f^λ(RF)(𝐱))(f^λ(K)(𝐱))|=o((n/d1)12ε),\left|\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))-\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))\right|=o\left(\left(n/d_{1}\right)^{\frac{1}{2}-\varepsilon}\right),

with probability 1o(1)1-o(1), when n/d10n/d_{1}\to 0 and nn\to\infty.

Theorems 2.12 and 2.16 verify that the random feature regression achieves the same asymptotic errors as the kernel regression, as long as n/d1n/d_{1}\to\infty. This is closely related to [MMM21, Theorem 1] with different settings. Based on that, we can compute the asymptotic training and test errors for the random feature model by calculating the corresponding quantities for the kernel regression in the ultra-wide regime where n/d10n/d_{1}\to 0.

Theorem 2.17 (Asymptotic training and test errors).

Suppose Assumptions 1.1 and 1.2 hold, and σ\sigma is not a linear function. Suppose the target function ff^{*}, training data XX and test data 𝐱d0\mathbf{x}\in\mathbb{R}^{d_{0}} satisfy Assumptions 2.13 and 2.14. For any λ0\lambda\geq 0, let the effective ridge parameter be

(2.26) λeff(λ,σ):=1+λbσ2bσ2.\lambda_{\textnormal{eff}}(\lambda,\sigma):=\frac{1+\lambda-b_{\sigma}^{2}}{b_{\sigma}^{2}}.

If the training data has some limiting eigenvalue distribution μ0=limspecXX\mu_{0}=\operatorname{lim\,spec}X^{\top}X as nn\to\infty and n/d0γ(0,)n/d_{0}\to\gamma\in(0,\infty), then when n/d10n/d_{1}\to 0 and nn\to\infty, the training error satisfies

(2.27) Etrain(RF,λ)σ𝜷2λ2γbσ4𝒱K(λeff(λ,σ))+σϵ2λ2γ(1+λbσ2)2(K(λeff(λ,σ))1+γ),E_{\textnormal{train}}^{(RF,\lambda)}\xrightarrow[]{\mathbb{P}}\frac{\sigma^{2}_{\boldsymbol{\beta}}\lambda^{2}}{\gamma b_{\sigma}^{4}}\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))+\frac{\sigma^{2}_{\boldsymbol{\epsilon}}\lambda^{2}}{\gamma(1+\lambda-b_{\sigma}^{2})^{2}}\left(\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))-1+\gamma\right),

and the test error satisfies

(2.28) (f^λ(RF)(𝐱))σ𝜷2K(λeff(λ,σ))+σϵ2𝒱K(λeff(λ,σ)),\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))\xrightarrow[]{\mathbb{P}}\sigma^{2}_{\boldsymbol{\beta}}\,\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))+\sigma_{\boldsymbol{\epsilon}}^{2}\,\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma)),

where the bias and variance functions are defined by

(2.29) K(ν):=\displaystyle\mathcal{B}_{K}(\nu):=~{} (1γ)+γν21(x+ν)2𝑑μ0(x),\displaystyle(1-\gamma)+\gamma\nu^{2}\int_{\mathbb{R}}\frac{1}{(x+\nu)^{2}}d\mu_{0}(x),
(2.30) 𝒱K(ν):=\displaystyle\mathcal{V}_{K}(\nu):=~{} γx(x+ν)2𝑑μ0(x).\displaystyle\gamma\int_{\mathbb{R}}\frac{x}{(x+\nu)^{2}}d\mu_{0}(x).

We emphasize that in the proof of Theorem 2.17, we also get nn-dependent deterministic equivalents for training/test errors of the kernel regression to approximate the performance of random feature regression. This is akin to [LCM20, Theorem 3] and [BMR21, Theorem 4.13], but in different regimes. In the following Figure 3, we present implementations of test errors for random feature regressions on standard Gaussian random data and their limits (2.28). For simplicity, we fix n,d0n,d_{0}, only let d1d_{1}\to\infty, and use empirical spectral distribution of XXX^{\top}X to approximate μ0\mu_{0} in K(λeff(λ,σ))\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma)) and 𝒱K(λeff(λ,σ))\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma)), which is actually the nn-dependent deterministic equivalent. However, for Gaussian random matrix XX, μ0\mu_{0} is actually a Marchenko-Pastur law with ratio γ\gamma, so K(λeff(λ,σ))\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma)) and 𝒱K(λeff(λ,σ))\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma)) can be computed explicitly according to [LD21, Definition 1].

Refer to caption
Refer to caption
Figure 3. Simulations for the test errors of random feature regressions with centered Gaussian random matrix as input XX and regularization parameter λ=103\lambda=10^{-3} (left) and λ=106\lambda=10^{-6} (right). Here, the activation function σ\sigma is a re-scaled Sigmoid function, σϵ=1\sigma_{\boldsymbol{\epsilon}}=1 and σ𝜷=2\sigma_{\boldsymbol{\beta}}=2. We fix d0=500d_{0}=500, varying values of sample sizes nn and widths d1d_{1}. Test errors in solid lines with error bars are computed using an independent test set of size 5000. We average our results over 50 repetitions. Limiting test errors in black dash lines are computed by (2.28), and we take μ0\mu_{0} to be the corresponding Marchenko–Pastur distributions.
Remark 2.18 (Implicit regularization).

For nonlinear σ\sigma, the effective ridge parameter (2.26) can be viewed as an inflated ridge parameter since bσ2[0,1)b_{\sigma}^{2}\in[0,1) and λeff>λ0\lambda_{\textnormal{eff}}>\lambda\geq 0. This λeff\lambda_{\textnormal{eff}} leads to implicit regularization for our random feature and kernel ridge regressions even for the ridgeless regression with λ=0\lambda=0 [LR20, MZ22, JSS+20, BMR21]. This effective ridge parameter λeff\lambda_{\textnormal{eff}} also shows the effect of the nonlinearity in the random feature and kernel regressions induced by ultra-wide neural networks.

Remark 2.19.

For convenience, we only consider the linear target function ff^{*}, but in general, the above theorems can also be obtained for nonlinear target functions, for instance, ff^{*} is a nonlinear single-index model. Under (εn,B)(\varepsilon_{n},B)-orthonormal assumption with nεn40n\varepsilon_{n}^{4}\to 0, our expected kernel K(X,X)ΦK(X,X)\equiv\Phi is approximated in terms of

(2.31) limspecK(X,X)=limspec(bσ2XX+(1bσ2)Id),\operatorname{lim\,spec}K(X,X)=\operatorname{lim\,spec}\left(b_{\sigma}^{2}X^{\top}X+(1-b^{2}_{\sigma})\operatorname{Id}\right),

whence, this kernel regression can only learn linear functions. So if ff^{*} is nonlinear, the limiting test error should be decomposed into the linear part as (2.28) and the nonlinear component as a noise [BMR21, Theorem 4.13]. For more conclusions of this kernel machine, we refer to [LR20, LRZ20, LLS21, MMM21].

Remark 2.20 (Neural tangent regression).

In parallel to the above results, we can obtain a similar analysis of the limiting training and test errors for random feature regression in (2.16) with empirical NTK given by either Kn(X,X)=1d1(SS)(XX)K_{n}(X,X)=\frac{1}{d_{1}}(S^{\top}S)\odot(X^{\top}X) or Kn(X,X)=HK_{n}(X,X)=H. This random feature regression also refers to neural tangent regression [MZ22]. With the help of our concentration results in Theorem 2.7 and the lower bound of the smallest eigenvalues in Theorem 2.9, we can directly extend the above Theorems 2.12, 2.16 and 2.17 to this neural tangent regression. We omit the proofs in these cases and only state the results as follows.

If Kn(X,X)=1d1(SS)(XX)K_{n}(X,X)=\frac{1}{d_{1}}(S^{\top}S)\odot(X^{\top}X) with expected kernel K(X,X)=ΨK(X,X)=\Psi defined by (2.6), the limiting training and test errors of this neural tangent regression can be approximated by the kernel regression with respect to Ψ\Psi, as long as d1=ω(logn)d_{1}=\omega(\log n). Analogously to (2.31), we have an additional approximation

(2.32) limspecΨ=limspec(bσ2XX+(aσbσ2)Id).\operatorname{lim\,spec}\Psi=\operatorname{lim\,spec}\left(b_{\sigma}^{2}X^{\top}X+(a_{\sigma}-b^{2}_{\sigma})\operatorname{Id}\right).

Under the same assumptions of Theorem 2.17 and replacing n/d10n/d_{1}\to 0 with d1=ω(logn)d_{1}=\omega(\log n), we can conclude that the test error of this neural tangent regression has the same limit as (2.28) but changing the effective ridge parameter (2.26) into λeff(λ,σ)=aσ+λbσ2bσ2\lambda_{\textnormal{eff}}(\lambda,\sigma)=\frac{a_{\sigma}+\lambda-b_{\sigma}^{2}}{b_{\sigma}^{2}}. This result is akin to [MZ22, Corollary 3.2] but permits more general assumptions on XX. The limiting training error of this neural tangent regression can be obtained by slightly modifying the coefficient in (2.27).

Similarly, if Kn(X,X)=HK_{n}(X,X)=H defined by (1.8) possesses an expected kernel K(X,X)=Φ+ΨK(X,X)=\Phi+\Psi, this neural tangent regression in (2.16) is close to kernel regression (2.19) with kernel

K(𝐱,𝐳)=𝔼𝒘[σ(𝒘𝐱)σ(𝒘𝐱)]+𝔼𝒘[σ(𝒘𝐱)σ(𝒘𝐱)]𝐱𝐳,K(\mathbf{x},\mathbf{z})=\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}\mathbf{x})\sigma(\boldsymbol{w}^{\top}\mathbf{x})]+\mathbb{E}_{\boldsymbol{w}}[\sigma^{\prime}(\boldsymbol{w}^{\top}\mathbf{x})\sigma^{\prime}(\boldsymbol{w}^{\top}\mathbf{x})]\mathbf{x}^{\top}\mathbf{z},

under the ultra-wide regime, n/d10n/d_{1}\to 0. Combining (2.31) and (2.32), Theorem 2.17 can directly be extended to this neural tangent regression but replacing (2.26) with λeff(λ,σ)=aσ+1+λ2bσ22bσ2\lambda_{\textnormal{eff}}(\lambda,\sigma)=\frac{a_{\sigma}+1+\lambda-2b_{\sigma}^{2}}{2b_{\sigma}^{2}}. Section 6.1 of [AP20] also calculated this limiting test error when data XX is isotropic Gaussian.

Organization of the paper

The remaining parts of the paper are structured as follows. In Section 3, we first provide a nonlinear Hanson-Wright inequality as a concentration tool for our spectral analysis. Section 4 gives a general theorem for the limiting spectral distributions of generalized centered sample covariance matrices. We prove the limiting spectral distributions for the empirical CK and NTK matrices (Theorem 2.1 and Theorem 2.2) in Section 5. Non-asymptotic estimates in subsection 2.2 are proved in Section 6. In Section 7, we justify the asymptotic results of the training and test errors for the random feature model (Theorem 2.12 and Theorem 2.16). Auxiliary lemmas and additional simulations are included in Appendices.

3. A non-linear Hanson-Wright inequality

We give an improved version of Lemma 1 in [LLC18] with a simple proof based on a Hanson-Wright inequality for random vectors with dependence [Ada15]. This serves as the concentration tool for us to prove the deformed semicircle law in Section 5 and provide bounds on extreme eigenvalues in Section 6. We first define some concentration properties for random vectors.

Definition 3.1 (Concentration property).

Let XX be a random vector in n\mathbb{R}^{n}. We say XX has the KK-concentration property with constant KK if for any 11-Lipschitz function f:nf:\mathbb{R}^{n}\to\mathbb{R}, we have 𝔼|f(X)|<\mathbb{E}|f(X)|<\infty and for any t>0t>0,

(3.1) (|f(X)𝔼f(X)|t)2exp(t2/K2).\displaystyle\mathbb{P}(|f(X)-\mathbb{E}f(X)|\geq t)\leq 2\exp(-t^{2}/K^{2}).

There are many distributions of random vectors satisfying KK-concentration property, including uniform random vectors on the sphere, unit ball, hamming or continuous cube, uniform random permutation, etc. See [Ver18, Chapter 5] for more details.

Definition 3.2 (Convex concentration property).

Let XX be a random vector in n\mathbb{R}^{n}. We say XX has the KK-convex concentration property with the constant KK if for any 11-Lipschitz convex function f:nf:\mathbb{R}^{n}\to\mathbb{R}, we have 𝔼|f(X)|<\mathbb{E}|f(X)|<\infty and for any t>0t>0,

(|f(X)𝔼f(X)|t)2exp(t2/K2).\displaystyle\mathbb{P}(|f(X)-\mathbb{E}f(X)|\geq t)\leq 2\exp(-t^{2}/K^{2}).

We will apply the following result from [Ada15] to the nonlinear setting.

Lemma 3.3 (Theorem 2.5 in [Ada15]).

Let XX be a mean zero random vector in n\mathbb{R}^{n}. If XX has the KK-convex concentration property, then for any n×nn\times n matrix AA and any t>0t>0,

(|XAX𝔼(XAX)|t)2exp(1Cmin{t22K4AF2,tK2A})\displaystyle\mathbb{P}(|X^{\top}AX-\mathbb{E}(X^{\top}AX)|\geq t)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{2K^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\|A\|}\right\}\right)

for some universal constant C>1C>1.

Theorem 3.4.

Let 𝐰d0\boldsymbol{w}\in\mathbb{R}^{d_{0}} be a random vector with KK-concentration property, X=(𝐱1,,𝐱n)d0×nX=(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\in\mathbb{R}^{d_{0}\times n} be a deterministic matrix. Define 𝐲=σ(𝐰X)\mathbf{y}=\sigma(\boldsymbol{w}^{\top}X)^{\top}, where σ\sigma is λσ\lambda_{\sigma}-Lipschitz, and Φ=𝔼𝐲𝐲\Phi=\mathbb{E}\mathbf{y}\mathbf{y}^{\top}. Let AA be an n×nn\times n deterministic matrix.

  1. (1)

    If 𝔼[𝐲]=0\mathbb{E}[\mathbf{y}]=0, for any t>0t>0,

    (3.2) (|𝐲A𝐲TrAΦ|t)2exp(1Cmin{t22K4λσ4X4AF2,tK2λσ2X2A}),\displaystyle\mathbb{P}\left(|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi|\geq t\right)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{2K^{4}\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right),

    where C>0C>0 is an absolute constant.

  2. (2)

    If 𝔼[𝐲]0\mathbb{E}[\mathbf{y}]\not=0, for any t>0t>0,

    (|𝐲A𝐲TrAΦ|>t)\displaystyle\mathbb{P}\left(|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi|>t\right)\leq~{} 2exp(1Cmin{t24K4λσ4X4AF2,tK2λσ2X2A})\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{4K^{4}\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right)
    +2exp(t216K2λσ2X2A2𝔼𝐲2).\displaystyle+2\exp\left(-\frac{t^{2}}{16K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|^{2}\|\mathbb{E}\mathbf{y}\|^{2}}\right).

    for some constant C>0C>0.

Proof.

Let ff be any 11-Lipschitz convex function. Since 𝐲=σ(𝒘X)\mathbf{y}=\sigma(\boldsymbol{w}^{\top}X)^{\top}, f(𝐲)=f(σ(𝒘X))f(\mathbf{y})=f(\sigma(\boldsymbol{w}^{\top}X)^{\top}) is a λσX\lambda_{\sigma}\|X\|-Lipschitz function of 𝒘\boldsymbol{w}. Then by the Lipschitz concentration property of 𝒘\boldsymbol{w} in (3.1), we obtain

(|f(𝐲)𝔼f(𝐲)|t)2exp(t2K2λσ2X2).\displaystyle\mathbb{P}(|f(\mathbf{y})-\mathbb{E}f(\mathbf{y})|\geq t)\leq 2\exp\left(-\frac{t^{2}}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}}\right).

Therefore, 𝐲\mathbf{y} satisfies the KλσXK\lambda_{\sigma}\|X\|-convex concentration property. Define f~(𝐱)=f(𝐱𝔼𝐲)\tilde{f}(\mathbf{x})=f(\mathbf{x}-\mathbb{E}\mathbf{y}), then f~\tilde{f} is also a convex 11-Lipschitz function and f~(𝐲)=f(𝐲𝔼𝐲)\tilde{f}(\mathbf{y})=f(\mathbf{y}-\mathbb{E}\mathbf{y}). Hence 𝐲~:=𝐲𝔼𝐲\tilde{\mathbf{y}}:=\mathbf{y}-\mathbb{E}\mathbf{y} also satisfies the KλσXK\lambda_{\sigma}\|X\|-convex concentration property. Applying Theorem 3.3 to 𝐲~\tilde{\mathbf{y}}, we have for any t>0t>0,

(3.3) (|𝐲~A𝐲~𝔼(𝐲~A𝐲~)|t)2exp(1Cmin{t22K4λσ4X4AF2,tK2λσ2X2A}).\displaystyle\mathbb{P}(|{\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}})|\geq t)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{2K^{4}\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right).

Since 𝔼𝐲~=0\mathbb{E}\tilde{\mathbf{y}}=0, the inequality above implies (3.2). Note that

𝐲~A𝐲~𝔼(𝐲~A𝐲~)=(𝐲A𝐲TrAΦ)𝐲~A𝔼𝐲𝔼𝐲A𝐲~,\displaystyle{\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}})=(\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi)-\tilde{\mathbf{y}}^{\top}A\mathbb{E}\mathbf{y}-\mathbb{E}\mathbf{y}^{\top}A\tilde{\mathbf{y}},

Hence,

𝐲A𝐲TrAΦ\displaystyle\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi =(𝐲~A𝐲~𝔼(𝐲~A𝐲~))+(𝐲𝔼𝐲)(A+A)𝔼𝐲\displaystyle=({\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}}))+(\mathbf{y}-\mathbb{E}\mathbf{y})^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}
(3.4) =(𝐲~A𝐲~𝔼(𝐲~A𝐲~))+(𝐲(A+A)𝔼𝐲𝔼𝐲(A+A)𝔼𝐲).\displaystyle=({\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}}))+(\mathbf{y}^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}-\mathbb{E}\mathbf{y}^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}).

Since 𝐲(A+A)𝔼𝐲\mathbf{y}^{\top}(A+A^{\top})\mathbb{E}\mathbf{y} is a (2A𝔼𝐲Xλσ)(2\|A\|\|\mathbb{E}\mathbf{y}\|\|X\|\lambda_{\sigma})-Lipschitz function of 𝒘\boldsymbol{w}, by the Lipschitz concentration property of 𝒘\boldsymbol{w}, we have

(3.5) (|(𝐲𝔼𝐲)(A+A)𝔼𝐲|t)2exp(t24K2(A𝔼𝐲Xλσ)2).\displaystyle\mathbb{P}(|(\mathbf{y}-\mathbb{E}\mathbf{y})^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}|\geq t)\leq 2\exp\left(-\frac{t^{2}}{4K^{2}(\|A\|\|\mathbb{E}\mathbf{y}\|\|X\|\lambda_{\sigma})^{2}}\right).

Then combining (3.3), (3), and (3.5), we have

(|𝐲A𝐲TrAΦ|t)\displaystyle\mathbb{P}(|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi|\geq t) (|𝐲~A𝐲~𝔼(𝐲~A𝐲~)|t/2)+(|(𝐲𝔼𝐲)(A+A)𝔼𝐲|t/2)\displaystyle\leq\mathbb{P}(|{\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}})|\geq t/2)+\mathbb{P}(|(\mathbf{y}-\mathbb{E}\mathbf{y})^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}|\geq t/2)
2exp(12Cmin{t24K4λσ4X4AF2,tK2λσ2X2A})\displaystyle\leq 2\exp\left(-\frac{1}{2C}\min\left\{\frac{t^{2}}{4K^{4}\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right)
+2exp(t216K2λσ2X2A2𝔼𝐲2).\displaystyle\quad+2\exp\left(-\frac{t^{2}}{16K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|^{2}\|\mathbb{E}\mathbf{y}\|^{2}}\right).

This finishes the proof. ∎

Since the Gaussian random vector 𝒘𝒩(0,Id0)\boldsymbol{w}\sim\mathcal{N}(0,I_{d_{0}}) satisfies the KK-concentration inequality with K=2K=\sqrt{2} (see for example [BLM13]), we have the following corollary.

Corollary 3.5.

Let 𝒘𝒩(0,Id0)\boldsymbol{w}\sim\mathcal{N}(0,I_{d_{0}}), X=(𝐱1,,𝐱n)d0×nX=(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\in\mathbb{R}^{d_{0}\times n} be a deterministic matrix. Define 𝐲=σ(𝒘X)\mathbf{y}=\sigma(\boldsymbol{w}^{\top}X)^{\top}, where σ\sigma is λσ\lambda_{\sigma}-Lipschitz, and Φ=𝔼𝐲𝐲\Phi=\mathbb{E}\mathbf{y}\mathbf{y}^{\top}. Let AA be an n×nn\times n deterministic matrix.

  1. (1)

    If 𝔼[𝐲]=0\mathbb{E}[\mathbf{y}]=0, for any t>0t>0,

    (3.6) (|𝐲A𝐲TrAΦ|t)2exp(1Cmin{t24λσ4X4AF2,tλσ2X2A})\displaystyle~{}~{}~{}\mathbb{P}\left(|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi|\geq t\right)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{4\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right)

    for some absolute constant C>0C>0.

  2. (2)

    If 𝔼[𝐲]0\mathbb{E}[\mathbf{y}]\not=0, for any t>0t>0,

    (3.7) (|𝐲A𝐲TrAΦ|>t)\displaystyle\mathbb{P}\left(|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi|>t\right)
    \displaystyle\leq~{} 2exp(1Cmin{t28λσ4X4AF2,tλσ2X2A})+2exp(t232λσ2X2A2𝔼𝐲2)\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{8\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right)+2\exp\left(-\frac{t^{2}}{32\lambda_{\sigma}^{2}\|X\|^{2}\|A\|^{2}\|\mathbb{E}\mathbf{y}\|^{2}}\right)
    (3.8) \displaystyle\leq~{} 2exp(1Cmin{t28λσ4X4AF2,tλσ2X2A})+2exp(t232λσ2X2A2t0),\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{8\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right)+2\exp\left(-\frac{t^{2}}{32\lambda_{\sigma}^{2}\|X\|^{2}\|A\|^{2}t_{0}}\right),

    where

    (3.9) t0:=2λσ2i=1n(𝐱i1)2+2n(𝔼σ(ξ))2,ξ𝒩(0,1).\displaystyle t_{0}:=2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|-1)^{2}+2n(\mathbb{E}\sigma(\xi))^{2},\quad\xi\sim\mathcal{N}(0,1).
Remark 3.6.

Compared to [LLC18, Lemma 1], we identify the dependence on AF\|A\|_{F} and 𝔼𝐲\mathbb{E}\mathbf{y} in the probability estimate. By using the inequality AFnA\|A\|_{F}\leq\sqrt{n}\|A\|, we obtain a similar inequality to the one in [LLC18] with a better dependence on nn. Moreover, our bound in t0t_{0} is independent of d0d_{0}, while the corresponding term t0t_{0} in [LLC18, Lemma 1] depends on X\|X\| and d0d_{0}. In particular, when 𝔼σ(ξ)=0\mathbb{E}\sigma(\xi)=0 and XX is (εn,B)(\varepsilon_{n},B)-orthonormal, t0t_{0} is of order 1. Hence, (3.8) with the special choice of t0t_{0} is the key ingredient in the proof of Theorem 2.3 to get a concentration of the spectral norm for CK.

Proof of Corollary 3.5.

We only need to prove (3.8), since other statements follow immediately by taking K=2K=\sqrt{2}. Let 𝐱i\mathbf{x}_{i} be the ii-th column of XX. Then

𝔼𝐲2\displaystyle\|\mathbb{E}\mathbf{y}\|^{2} =𝔼σ(𝒘X)2=i=1n[𝔼σ(𝒘𝐱i)]2.\displaystyle=\|\mathbb{E}\sigma(\boldsymbol{w}^{\top}X)\|^{2}=\sum_{i=1}^{n}[\mathbb{E}\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{i})]^{2}.

Let ξ𝒩(0,1)\xi\sim\mathcal{N}(0,1). We have

|𝔼σ(𝒘𝐱i)|\displaystyle|\mathbb{E}\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{i})| =|𝔼σ(ξ𝐱i)|𝔼|(σ(ξ𝐱i)σ(ξ))|+|𝔼σ(ξ)|\displaystyle=|\mathbb{E}\sigma(\xi\|\mathbf{x}_{i}\|)|\leq\mathbb{E}|(\sigma(\xi\|\mathbf{x}_{i}\|)-\sigma(\xi))|+|\mathbb{E}\sigma(\xi)|
(3.10) λσ𝔼|ξ(𝐱i1)|+|𝔼σ(ξ)|λσ|𝐱i1|+|𝔼σ(ξ)|.\displaystyle\leq\lambda_{\sigma}\mathbb{E}|\xi(\|\mathbf{x}_{i}\|-1)|+|\mathbb{E}\sigma(\xi)|\leq\lambda_{\sigma}|\|\mathbf{x}_{i}\|-1|+|\mathbb{E}\sigma(\xi)|.

Therefore

(3.11) 𝔼𝐲2\displaystyle\|\mathbb{E}\mathbf{y}\|^{2} i=1n(λσ(𝐱i1)+|𝔼σ(ξ)|)2i=1n2λσ2(𝐱i1)2+2(𝔼σ(ξ))2\displaystyle\leq\sum_{i=1}^{n}(\lambda_{\sigma}(\|\mathbf{x}_{i}\|-1)+|\mathbb{E}\sigma(\xi)|)^{2}\leq\sum_{i=1}^{n}2\lambda_{\sigma}^{2}(\|\mathbf{x}_{i}\|-1)^{2}+2(\mathbb{E}\sigma(\xi))^{2}
=2λσ2i=1n(𝐱i1)2+2n(𝔼σ(ξ))2=t0,\displaystyle=2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|-1)^{2}+2n(\mathbb{E}\sigma(\xi))^{2}=t_{0},

and (3.8) holds.

We include the following corollary about the variance of 𝐲A𝐲\mathbf{y}^{\top}A\mathbf{y}, which will be used in Section 5 to study the spectrum of the CK and NTK.

Corollary 3.7.

Under the same assumptions of Corollary 3.5, we further assume that t0C1nt_{0}\leq C_{1}n, and A,XC2\|A\|,\|X\|\leq C_{2}. Then as nn\to\infty,

1n2𝔼[|𝐲A𝐲TrAΦ|2]0.\frac{1}{n^{2}}\mathbb{E}\left[\left|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right|^{2}\right]\to 0.
Proof.

Notice that AFnA\|A\|_{F}\leq\sqrt{n}\|A\|. Thanks to Theorem 3.5 (2), we have that for any t>0t>0,

(3.12) (1n|𝐲A𝐲TrAΦ|>t)4exp(Cnmin{t2,t}),\mathbb{P}\left(\frac{1}{n}\left|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right|>t\right)\leq 4\exp\left(-Cn\min\{t^{2},t\}\right),

where constant C>0C>0 only relies on C1,C2C_{1},C_{2}, λσ\lambda_{\sigma}, and KK. Therefore, we can compute the variance in the following way:

𝔼[1n2|𝐲A𝐲TrAΦ|2]=\displaystyle\mathbb{E}\left[\frac{1}{n^{2}}\left|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right|^{2}\right]=~{} 0(1n2|𝐲A𝐲TrAΦ|2>s)𝑑s\displaystyle\int_{0}^{\infty}\mathbb{P}\left(\frac{1}{n^{2}}\left|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right|^{2}>s\right)ds
\displaystyle\leq~{} 40exp(Cnmin{s,s})𝑑s\displaystyle 4\int_{0}^{\infty}\exp\left(-Cn\min\{s,\sqrt{s}\}\right)ds
=\displaystyle=~{} 401exp(Cns)𝑑s+41+exp(Cns)𝑑s0,\displaystyle 4\int_{0}^{1}\exp\left(-Cn\sqrt{s}\right)ds+4\int_{1}^{+\infty}\exp\left(-Cns\right)ds\rightarrow 0,

as nn\rightarrow\infty. Here, we use the dominant convergence theorem for the first integral in the last line. ∎

4. Limiting law for general centered sample covariance matrices

Independent of the subsequent sections, this section focuses on the generalized sample covariance matrix where the dimension of the feature is much smaller than the sample size. We will later interpret such sample covariance matrix specifically for our neural network applications. Under certain weak assumptions, we prove the limiting eigenvalue distribution of the normalized sample covariance matrix satisfies two self-consistent equations, which are subsumed into a deformed semicircle law. Our findings in this section demonstrate some degree of universality, indicating that they hold across various random matrix models and may have implications for other related fields.

Theorem 4.1.

Suppose 𝐲1,,𝐲dn\mathbf{y}_{1},\ldots,\mathbf{y}_{d}\in\mathbb{R}^{n} are independent random vectors with the same distribution of a random vector 𝐲n\mathbf{y}\in\mathbb{R}^{n}. Assume that 𝔼[𝐲]=0\mathbb{E}[\mathbf{y}]=\textbf{0}, 𝔼[𝐲𝐲]=Φnn×n\mathbb{E}[\mathbf{y}\mathbf{y}^{\top}]=\Phi_{n}\in\mathbb{R}^{n\times n}, where Φn\Phi_{n} is a deterministic matrix whose limiting eigenvalue distribution is μΦδ0\mu_{\Phi}\not=\delta_{0}. Assume ΦnC\|\Phi_{n}\|\leq C for some constant CC. Define An:=dn(1di=1d𝐲i𝐲iΦn)A_{n}:=\sqrt{\frac{d}{n}}\left(\frac{1}{d}\sum_{i=1}^{d}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\Phi_{n}\right) and R(z):=(AnzId)1R(z):=(A_{n}-z\operatorname{Id})^{-1}. For any z+z\in\mathbb{C}^{+} and any deterministic matrices DnD_{n} with DnC\|D_{n}\|\leq C, suppose that as n,dn,d\rightarrow\infty and n/d0n/d\rightarrow 0,

(4.1) trR(z)Dn𝔼[trR(z)Dn]a.s.0,\operatorname{tr}R(z)D_{n}-\mathbb{E}\left[\operatorname{tr}R(z)D_{n}\right]\overset{\textnormal{a.s.}}{\longrightarrow}0,

and

(4.2) 1n2𝔼[|𝐲Dn𝐲TrDnΦn|2]0.\frac{1}{n^{2}}\mathbb{E}\left[\left|\mathbf{y}^{\top}D_{n}\mathbf{y}-\operatorname{Tr}D_{n}\Phi_{n}\right|^{2}\right]\to 0.

Then the empirical eigenvalue distribution of matrix AnA_{n} weakly converges to μ\mu almost surely, whose Stieltjes transform m(z)m(z) is defined by

(4.3) m(z)+dμΦ(x)z+β(z)x=0m(z)+\int\frac{d\mu_{\Phi}(x)}{z+\beta(z)x}=0

for each z+z\in\mathbb{C}^{+}, where β(z)+\beta(z)\in\mathbb{C}^{+} is the unique solution to

(4.4) β(z)+xdμΦ(x)z+β(z)x=0.\beta(z)+\int\frac{xd\mu_{\Phi}(x)}{z+\beta(z)x}=0.

In particular, μ=μsμΦ\mu=\mu_{s}\boxtimes\mu_{\Phi}.

Remark 4.2.

In [Xie13], it was assumed that dn3𝔼|𝐲Dn𝐲TrDnΦn|20,\frac{d}{n^{3}}\mathbb{E}\left|\mathbf{y}^{\top}D_{n}\mathbf{y}-\operatorname{Tr}D_{n}\Phi_{n}\right|^{2}\to 0, where n3/dn^{3}/d\rightarrow\infty and n/d0n/d\to 0 as nn\to\infty. By martingale difference, this condition implies (4.1). However, we are not able to verify a certain step in the proof of [Xie13]. Hence, we will not directly adopt the result of [Xie13] but consider a more general situation without assuming n3/dn^{3}/d\rightarrow\infty. The weakest conditions we found are conditions (4.1) and (4.2), which can be verified in our nonlinear random model.

The self-consistent equations we derived are consistent with the results in [Bao12, Xie13], where they studied the empirical spectral distribution of separable sample covariance matrices in the regime n/d0n/d\to 0 under different assumptions. When nn\rightarrow\infty and n/d0n/d\rightarrow 0, our goal is to prove that the Stieltjes transform mn(z)m_{n}(z) of the empirical eigenvalue distribution of AnA_{n} and βn(z):=tr[R(z)Φn]\beta_{n}(z):=\operatorname{tr}[R(z)\Phi_{n}] point-wisely converges to m(z)m(z) and β(z)\beta(z), respectively.

For the rest of this section, we first prove a series of lemmas to get nn-dependent deterministic equivalents related to (4.3) and (4.4) and then deduce the proof of Theorem 4.1 at the end of this section. Recall An=dn(1di=1d𝐲i𝐲iΦn)A_{n}=\sqrt{\frac{d}{n}}\left(\frac{1}{d}\sum_{i=1}^{d}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\Phi_{n}\right), R(z)=(AnzId)1R(z)=(A_{n}-z\operatorname{Id})^{-1}, and 𝐲\mathbf{y} is a random vector independent of AnA_{n} with the same distribution of 𝐲i\mathbf{y}_{i}.

Lemma 4.3.

Under the assumptions of Theorem 4.1, for any z+z\in\mathbb{C}^{+}, as d,nd,n\rightarrow\infty,

(4.5) trD+z𝔼[trR(z)D]+𝔼[1n𝐲DR(z)𝐲1n𝐲R(z)𝐲1+nd1n𝐲R(z)𝐲]=o(1),\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\frac{\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\cdot\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}{1+\sqrt{\frac{n}{d}}\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=o(1),

where Dn×nD\in\mathbb{R}^{n\times n} is any deterministic matrix such that DC\|D\|\leq C, for some constant CC.

Proof.

Let z=u+iv+z=u+iv\in\mathbb{C}^{+} where uu\in\mathbb{R} and v>0v>0. Let

R^:=(1dnj=1d+1𝐲j𝐲jdnΦnzId)1,\hat{R}:=\left(\frac{1}{\sqrt{dn}}\sum_{j=1}^{d+1}\mathbf{y}_{j}\mathbf{y}_{j}^{\top}-\sqrt{\frac{d}{n}}\Phi_{n}-z\operatorname{Id}\right)^{-1},

where 𝐲j\mathbf{y}_{j}’s are independent copies of 𝐲\mathbf{y} defined in Theorem 4.1. Notice that, for any deterministic matrix Dn×nD\in\mathbb{R}^{n\times n},

(4.6) D\displaystyle D =R^(1dnj=1d+1𝐲j𝐲jdnΦnzId)D\displaystyle=\hat{R}\left(\frac{1}{\sqrt{dn}}\sum_{j=1}^{d+1}\mathbf{y}_{j}\mathbf{y}_{j}^{\top}-\sqrt{\frac{d}{n}}\Phi_{n}-z\operatorname{Id}\right)D
(4.7) =1dnR^(i=1d+1𝐲i𝐲i)DdnR^ΦnDzR^D.\displaystyle=\frac{1}{\sqrt{dn}}\hat{R}\left(\sum_{i=1}^{d+1}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}\right)D-\sqrt{\frac{d}{n}}\hat{R}\Phi_{n}D-z\hat{R}D.

Without loss of generality, we assume D1\|D\|\leq 1. Taking normalized trace, we have

(4.8) trD+ztr[R^D]=1dn1ni=1d+1𝐲iDR^𝐲idntr[R^ΦnD].\operatorname{tr}D+z\operatorname{tr}[\hat{R}D]=\frac{1}{\sqrt{dn}}\frac{1}{n}\sum_{i=1}^{d+1}\mathbf{y}_{i}^{\top}D\hat{R}\mathbf{y}_{i}-\sqrt{\frac{d}{n}}\operatorname{tr}[\hat{R}\Phi_{n}D].

For each 1id+11\leq i\leq d+1, Sherman–Morrison formula (Lemma A.2) implies

(4.9) R^=R(i)R(i)𝐲i𝐲iR(i)dn+𝐲iR(i)𝐲i,\hat{R}=R^{(i)}-\frac{R^{(i)}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}R^{(i)}}{\sqrt{dn}+\mathbf{y}_{i}^{\top}R^{(i)}\mathbf{y}_{i}},

where the leave-one-out resolvent R(i)R^{(i)} is defined as

R(i):=(1dn1jd+1,ji𝐲j𝐲jdnΦnzId)1.R^{(i)}:=\left(\frac{1}{\sqrt{dn}}\sum_{1\leq j\leq d+1,j\neq i}\mathbf{y}_{j}\mathbf{y}_{j}^{\top}-\sqrt{\frac{d}{n}}\Phi_{n}-z\operatorname{Id}\right)^{-1}.

Hence, by (4.9), we obtain

(4.10) 1dn1ni=1d+1𝐲iDR^𝐲i=1ni=1d+1𝐲iDR(i)𝐲idn+𝐲iR(i)𝐲i.\frac{1}{\sqrt{dn}}\frac{1}{n}\sum_{i=1}^{d+1}\mathbf{y}_{i}^{\top}D\hat{R}\mathbf{y}_{i}=\frac{1}{n}\sum_{i=1}^{d+1}\frac{\mathbf{y}_{i}^{\top}DR^{(i)}\mathbf{y}_{i}}{\sqrt{dn}+\mathbf{y}_{i}^{\top}R^{(i)}\mathbf{y}_{i}}.

Combining equations (4.8) and (4.10), and applying expectation at both sides implies

trD+z𝔼[trR^D]=\displaystyle\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}\hat{R}D]=~{} 1ni=1d+1𝔼[𝐲iDR(i)𝐲idn+𝐲iR(i)𝐲i]dn𝔼trR^ΦnD\displaystyle\frac{1}{n}\sum_{i=1}^{d+1}\mathbb{E}\left[\frac{\mathbf{y}_{i}^{\top}DR^{(i)}\mathbf{y}_{i}}{\sqrt{dn}+\mathbf{y}_{i}^{\top}R^{(i)}\mathbf{y}_{i}}\right]-\sqrt{\frac{d}{n}}\mathbb{E}\operatorname{tr}\hat{R}\Phi_{n}D
(4.11) =\displaystyle=~{} d+1n𝔼[𝐲DR(z)𝐲dn+𝐲R(z)𝐲]dn𝔼trR^ΦnD,\displaystyle\frac{d+1}{n}\mathbb{E}\left[\frac{\mathbf{y}^{\top}DR(z)\mathbf{y}}{\sqrt{dn}+\mathbf{y}^{\top}R(z)\mathbf{y}}\right]-\sqrt{\frac{d}{n}}\mathbb{E}\operatorname{tr}\hat{R}\Phi_{n}D,

because of the assumption that all 𝐲i\mathbf{y}_{i}’s have the same distribution as vector 𝐲\mathbf{y} for all i[d+1]i\in[d+1]. With (4.11), to prove (4.5), we will first show that when n,dn,d\rightarrow\infty,

(4.12) dn(𝔼[trR^ΦnD]𝔼[trR(z)ΦnD])=o(1),\displaystyle\sqrt{\frac{d}{n}}\left(\mathbb{E}[\operatorname{tr}\hat{R}\Phi_{n}D]-\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]\right)=o(1),
(4.13) 𝔼[trR^D]𝔼[trR(z)D]=o(1),\displaystyle\mathbb{E}[\operatorname{tr}\hat{R}D]-\mathbb{E}[\operatorname{tr}R(z)D]=o(1),
(4.14) 1n𝔼[𝐲DR(z)𝐲dn+𝐲R(z)𝐲]=o(1).\displaystyle\frac{1}{n}\mathbb{E}\left[\frac{\mathbf{y}^{\top}DR(z)\mathbf{y}}{\sqrt{dn}+\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=o(1).

Recall that

R^R(z)=1dnR(z)(𝐲d+1𝐲d+1)R^,\hat{R}-R(z)=\frac{1}{\sqrt{dn}}R(z)\left(\mathbf{y}_{d+1}\mathbf{y}_{d+1}^{\top}\right)\hat{R},

and spectral norms R^,R(z)1/v\|\hat{R}\|,\|R(z)\|\leq 1/v due to Proposition C.2 in [FW20]. Notice that ΦnC\|\Phi_{n}\|\leq C. Hence, we can deduce that

dn|𝔼[trR^ΦnD]𝔼[trR(z)ΦnD]|\displaystyle\sqrt{\frac{d}{n}}\left|\mathbb{E}[\operatorname{tr}\hat{R}\Phi_{n}D]-\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]\right|\leq~{} 1n𝔼[|trR(z)𝐲d+1𝐲d+1R^ΦnD|]\displaystyle\frac{1}{n}\mathbb{E}[|\operatorname{tr}R(z)\mathbf{y}_{d+1}\mathbf{y}_{d+1}^{\top}\hat{R}\Phi_{n}D|]
\displaystyle\leq~{} 1n2𝔼[R^ΦnDR(z)𝐲d+12]\displaystyle\frac{1}{n^{2}}\mathbb{E}[\|\hat{R}\Phi_{n}DR(z)\|\cdot\|\mathbf{y}_{d+1}\|^{2}]
=\displaystyle=~{} Cv2n2𝔼[Tr𝐲d+1𝐲d+1]=CTrΦnv2n2C2v2n0,\displaystyle\frac{C}{v^{2}n^{2}}\mathbb{E}[\operatorname{Tr}\mathbf{y}_{d+1}\mathbf{y}_{d+1}^{\top}]=\frac{C\operatorname{Tr}\Phi_{n}}{v^{2}n^{2}}\leq\frac{C^{2}}{v^{2}n}\rightarrow 0,

as n.n\rightarrow\infty. The same argument can be applied to the error of 𝔼[trR^D]𝔼[trR(z)D]\mathbb{E}[\operatorname{tr}\hat{R}D]-\mathbb{E}[\operatorname{tr}R(z)D]. Therefore (4.12) and (4.13) hold. For (4.14), we denote 𝐲~:=𝐲/(nd)1/4\tilde{\mathbf{y}}:=\mathbf{y}/(nd)^{1/4} and observe that

(4.15) 1n𝔼[𝐲DR(z)𝐲dn+𝐲R(z)𝐲]=1n𝔼[𝐲~DR(z)𝐲~1+𝐲~R(z)𝐲~].\frac{1}{n}\mathbb{E}\left[\frac{\mathbf{y}^{\top}DR(z)\mathbf{y}}{\sqrt{dn}+\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=\frac{1}{n}\mathbb{E}\left[\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}}\right].

Let R(z)=i=1n1λiz𝐮i𝐮iR(z)=\sum_{i=1}^{n}\frac{1}{\lambda_{i}-z}\mathbf{u}_{i}\mathbf{u}_{i}^{\top} be the eigen-decomposition of R(z)R(z). Then

(4.16) 𝐲~R(z)𝐲~/𝐲~2=i=1n1λiz(𝐮i,𝐲~)2𝐲~2:=1xz𝑑μ𝐲~\displaystyle\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}/\|\tilde{\mathbf{y}}\|^{2}=\sum_{i=1}^{n}\frac{1}{\lambda_{i}-z}\frac{(\langle\mathbf{u}_{i},\tilde{\mathbf{y}}\rangle)^{2}}{\|\tilde{\mathbf{y}}\|^{2}}:=\int\frac{1}{x-z}d\mu_{\tilde{\mathbf{y}}}

is the Stieltjes transform of a discrete measure μ𝐲~=i=1n(𝐮i,𝐲~)2𝐲~2δλi\mu_{\tilde{\mathbf{y}}}=\sum_{i=1}^{n}\frac{(\langle\mathbf{u}_{i},\tilde{\mathbf{y}}\rangle)^{2}}{\|\tilde{\mathbf{y}}\|^{2}}\delta_{\lambda_{i}}. Then, we can control the real part of 𝐲~R(z)𝐲~\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}} by Lemma A.4:

(4.17) |Re(𝐲~R(z)𝐲~)|v1/2𝐲~(Im(𝐲~R(z)𝐲~))1/2.\left|\operatorname{Re}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right|\leq v^{-1/2}\|\tilde{\mathbf{y}}\|\left(\operatorname{Im}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right)^{1/2}.

We now separately consider two cases in the following:

  • If the right-hand side of the above inequality (4.17) is at most 1/21/2, then

    |1+𝐲~R(z)𝐲~||1+Re(𝐲~R(z)𝐲~)|12,\left|1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}\right|\geq\left|1+\operatorname{Re}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right|\geq\frac{1}{2},

    which results in

    (4.18) |𝐲~DR(z)𝐲~1+𝐲~R(z)𝐲~|Cdn𝐲2.\left|\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}}\right|\leq\frac{C}{\sqrt{dn}}\|\mathbf{y}\|^{2}.
  • When v1/2𝐲~(Im(𝐲~R(z)𝐲~))1/2>1/2v^{-1/2}\|\tilde{\mathbf{y}}\|\left(\operatorname{Im}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right)^{1/2}>1/2, we know that

    |𝐲~DR(z)𝐲~1+𝐲~R(z)𝐲~|\displaystyle\left|\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}}\right|\leq~{} 𝐲~DR(z)𝐲~|Im(1+𝐲~R(z)𝐲~)|=𝐲~DR(z)𝐲~𝐲~Im(R(z))𝐲~\displaystyle\frac{\|\tilde{\mathbf{y}}^{\top}D\|\|R(z)\tilde{\mathbf{y}}\|}{|\operatorname{Im}(1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})|}=\frac{\|\tilde{\mathbf{y}}^{\top}D\|\|R(z)\tilde{\mathbf{y}}\|}{\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}}
    (4.19) \displaystyle\leq~{} 𝐲~D(v𝐲~Im(R(z))𝐲~)1/22𝐲~D𝐲~vC𝐲2vnd,\displaystyle\frac{\|\tilde{\mathbf{y}}^{\top}D\|}{\left(v\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}\right)^{1/2}}\leq\frac{2\|\tilde{\mathbf{y}}^{\top}D\|\|\tilde{\mathbf{y}}\|}{v}\leq\frac{C\|\mathbf{y}\|^{2}}{v\sqrt{nd}},

    where we exploit the fact that (see also Equation (A.1.11) in [BS10])

    R(z)𝐲~=(𝐲~R(z¯)R(z)𝐲~)1/2=(1v𝐲~Im(R(z))𝐲~)1/2.\|R(z)\tilde{\mathbf{y}}\|=(\tilde{\mathbf{y}}^{\top}R(\bar{z})R(z)\tilde{\mathbf{y}})^{1/2}=\left(\frac{1}{v}\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}\right)^{1/2}.

Finally, combining (4.18) and (4.19) in the above two cases, we can conclude the asymptotic result (4.14) because 𝔼𝐲2=TrΦnCn\mathbb{E}\|\mathbf{y}\|^{2}=\operatorname{Tr}\Phi_{n}\leq Cn in terms of the assumptions of Theorem 4.1.

Then with (4.12), (4.13), and (4.14), we get

(4.20) trD+z𝔼[trR(z)D]=𝔼[dn1n𝐲DR(z)𝐲1+1dn𝐲R(z)𝐲dntrR(z)ΦnD]+o(1),\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]=\mathbb{E}\left[\frac{\sqrt{\frac{d}{n}}\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}}{1+\frac{1}{\sqrt{dn}}\mathbf{y}^{\top}R(z)\mathbf{y}}-\sqrt{\frac{d}{n}}\operatorname{tr}R(z)\Phi_{n}D\right]+o(1),

as nn\rightarrow\infty. We utilize the notion 𝔼𝐲\mathbb{E}_{\mathbf{y}} to clarify the expectation only with respect to random vector 𝐲\mathbf{y}, conditioning on other independent random variables. So the conditional expectation is 𝔼𝐲[1n𝐲DR(z)𝐲]=trDR(z)Φn\mathbb{E}_{\mathbf{y}}\left[\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\right]=\operatorname{tr}DR(z)\Phi_{n} and

𝔼[1n𝐲DR(z)𝐲]=𝔼[𝔼𝐲[1n𝐲DR(z)𝐲]]=𝔼[trR(z)ΦnD].\mathbb{E}\left[\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\right]=\mathbb{E}\left[\mathbb{E}_{\mathbf{y}}\left[\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\right]\right]=\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D].

Therefore, based on (4.20), the conclusion (4.5) holds. ∎

In the next lemma, we apply the quadratic concentration condition (4.2) to simplify (4.5).

Lemma 4.4.

Under the assumptions of Theorem 4.1, condition (4.2) of Theorem 4.1 implies that

(4.21) 𝔼[1n𝐲DR(z)𝐲1n𝐲R(z)𝐲1+nd1n𝐲R(z)𝐲]=𝔼[trDR(z)ΦntrR(z)Φn1+ndtrR(z)Φn]+o(1),\mathbb{E}\left[\frac{\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\cdot\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}{1+\sqrt{\frac{n}{d}}\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=\mathbb{E}\left[\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right]+o(1),

for each z+z\in\mathbb{C}^{+} and any deterministic matrix DD with DC\|D\|\leq C.

Proof.

Let us denote

δn:=1n𝐲DR(z)𝐲1n𝐲R(z)𝐲1+nd1n𝐲R(z)𝐲trDR(z)ΦntrR(z)Φn1+ndtrR(z)Φn,\delta_{n}:=\frac{\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\cdot\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}{1+\sqrt{\frac{n}{d}}\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}-\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}},
Q1:=1n𝐲DR(z)𝐲,Q2:=1n𝐲R(z)𝐲,Q_{1}:=\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y},\quad Q_{2}:=\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y},

Q¯1:=𝔼𝐲[Q1]=trDR(z)Φn\bar{Q}_{1}:=\mathbb{E}_{\mathbf{y}}[Q_{1}]=\operatorname{tr}DR(z)\Phi_{n}, and Q¯2:=𝔼𝐲[Q1]=trR(z)Φn\bar{Q}_{2}:=\mathbb{E}_{\mathbf{y}}[Q_{1}]=\operatorname{tr}R(z)\Phi_{n}. In other words, δn\delta_{n} can be expressed by

δn=\displaystyle\delta_{n}=~{} Q1Q21+ndQ2Q¯1Q¯21+ndQ¯2\displaystyle\frac{Q_{1}Q_{2}}{1+\sqrt{\frac{n}{d}}Q_{2}}-\frac{\bar{Q}_{1}\bar{Q}_{2}}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}
=\displaystyle=~{} Q1(Q2+dn)1+ndQ2dnQ11+ndQ2Q¯1(Q¯2+dn)1+ndQ¯2+dnQ¯11+ndQ¯2\displaystyle\frac{Q_{1}\left(Q_{2}+\sqrt{\frac{d}{n}}\right)}{1+\sqrt{\frac{n}{d}}Q_{2}}-\frac{\sqrt{\frac{d}{n}}Q_{1}}{1+\sqrt{\frac{n}{d}}Q_{2}}-\frac{\bar{Q}_{1}\left(\bar{Q}_{2}+\sqrt{\frac{d}{n}}\right)}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}+\frac{\sqrt{\frac{d}{n}}\bar{Q}_{1}}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}
=\displaystyle=~{} dn(Q1Q¯1)+dn(Q¯1Q1)1+ndQ¯2+ndQ1dn(Q¯2Q2)(1+ndQ¯2)(1+ndQ2).\displaystyle\sqrt{\frac{d}{n}}(Q_{1}-\bar{Q}_{1})+\frac{\sqrt{\frac{d}{n}}(\bar{Q}_{1}-Q_{1})}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}+\frac{\sqrt{\frac{n}{d}}Q_{1}\sqrt{\frac{d}{n}}(\bar{Q}_{2}-Q_{2})}{\left(1+\sqrt{\frac{n}{d}}\bar{Q}_{2}\right)\left(1+\sqrt{\frac{n}{d}}Q_{2}\right)}.

Observe that 𝔼[Q¯i]=𝔼[Qi]\mathbb{E}[\bar{Q}_{i}]=\mathbb{E}[Q_{i}] for i=1,2i=1,2. Thus, δn\delta_{n} has the same expectation as the last term

Δn:=Q1(Q¯2Q2)(1+ndQ¯2)(1+ndQ2),\Delta_{n}:=\frac{Q_{1}(\bar{Q}_{2}-Q_{2})}{\left(1+\sqrt{\frac{n}{d}}\bar{Q}_{2}\right)\left(1+\sqrt{\frac{n}{d}}Q_{2}\right)},

since we can first take the expectation for 𝐲\mathbf{y} conditioning on the resolvent R(z)R(z) and then take the expectation for R(z)R(z). Besides, notice that |Q¯1|,|Q¯2|Cv|\bar{Q}_{1}|,|\bar{Q}_{2}|\leq\frac{C}{v} uniformly. Hence, ndQ¯2\sqrt{\frac{n}{d}}\bar{Q}_{2} converges to zero uniformly and there exists some constant C>0C>0 such that

(4.22) |11+ndQ¯2|C,\left|\frac{1}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}\right|\leq C,

for all large dd and nn. In addition, observe that

ndQ11+ndQ2=𝐲~DR(z)𝐲~1+𝐲~R(z)𝐲~,\frac{\sqrt{\frac{n}{d}}Q_{1}}{1+\sqrt{\frac{n}{d}}Q_{2}}=\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}},

where 𝐲~\tilde{\mathbf{y}} is defined in the proof of Lemma 4.3. In terms of (4.18) and (4.19), we verify that

(4.23) |Q11+ndQ2|C𝐲2n,\left|\frac{Q_{1}}{1+\sqrt{\frac{n}{d}}Q_{2}}\right|\leq\frac{C\|\mathbf{y}\|^{2}}{n},

where C>0C>0 is some constant depending on vv. Next, recall that condition (4.2) exposes that

(4.24) 𝔼(Q2Q¯2)20and𝔼(𝐲2/ntrΦn)20\mathbb{E}(Q_{2}-\bar{Q}_{2})^{2}\to 0\quad\text{and}\quad\mathbb{E}(\|\mathbf{y}\|^{2}/n-\operatorname{tr}\Phi_{n})^{2}\to 0

as nn\rightarrow\infty. The first convergence is derived by viewing Dn=R(z)D_{n}=R(z) and taking expectation conditional on R(z)R(z). To sum up, we can bound |Δn||\Delta_{n}| based on (4.22) and (4.23) in the subsequent way:

|Δn|\displaystyle|\Delta_{n}|\leq~{} C𝐲2n|Q¯2Q2|C|𝐲2/ntrΦn||Q¯2Q2|+C|trΦn||Q¯2Q2|.\displaystyle\frac{C\|\mathbf{y}\|^{2}}{n}|\bar{Q}_{2}-Q_{2}|\leq C\left|\|\mathbf{y}\|^{2}/n-\operatorname{tr}\Phi_{n}\right|\cdot|\bar{Q}_{2}-Q_{2}|+C\left|\operatorname{tr}\Phi_{n}\right|\cdot|\bar{Q}_{2}-Q_{2}|.

Here, |trΦn|Φn|\operatorname{tr}\Phi_{n}|\leq\|\Phi_{n}\| and Φn\|\Phi_{n}\| is uniformly bounded by some constant. Then, by Hölder’s inequality, (4.24) implies that 𝔼[|Δn|]0\mathbb{E}[|\Delta_{n}|]\rightarrow 0, as nn approaching to infinity. This concludes 𝔼[δn]=𝔼[Δn]\mathbb{E}[\delta_{n}]=\mathbb{E}[\Delta_{n}] converges to zero.

Lemma 4.5.

Under assumptions of Theorem 4.1, we can conclude that

limn,d(trD+z𝔼[trR(z)D]+𝔼[trDR(z)Φn]𝔼[trR(z)Φn])=0\lim_{n,d\rightarrow\infty}\left(\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\right]\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right]\right)=0

holds for each z+z\in\mathbb{C}^{+} and deterministic matrix DD with uniformly bounded spectral norm.

Proof.

Based on Lemma 4.3 and Lemma 4.4, (4.21) and (4.5) yield

trD+z𝔼[trR(z)D]+𝔼[trDR(z)ΦntrR(z)Φn1+ndtrR(z)Φn]=o(1).\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right]=o(1).

As |trR(z)D||\operatorname{tr}R(z)D| and |trR(z)DΦn||\operatorname{tr}R(z)D\Phi_{n}| are bounded by some constants uniformly and almost surely, for sufficiently large dd and nn, |ndtrR(z)Φn|<1/2|\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}|<1/2 and

|𝔼[trDR(z)ΦntrR(z)Φn1+ndtrR(z)Φn]𝔼[trDR(z)ΦntrR(z)Φn]|\displaystyle\left|\mathbb{E}\left[\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right]-\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}\right]\right|
\displaystyle\leq~{} 𝔼[|trR(z)D||trR(z)DΦn||ndtrR(z)Φn1+ndtrR(z)Φn|]2Cnd0,\displaystyle\mathbb{E}\left[|\operatorname{tr}R(z)D|\cdot|\operatorname{tr}R(z)D\Phi_{n}|\cdot\left|\frac{\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right|\right]\leq 2C\sqrt{\frac{n}{d}}\rightarrow 0,

as n/d0n/d\rightarrow 0. Hence,

(4.25) trD+z𝔼[trR(z)D]+𝔼[trDR(z)ΦntrR(z)Φn]=o(1).\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}\right]=o(1).

Considering Dn=ΦnD_{n}=\Phi_{n} in (4.1), we can get almost sure convergence for trDR(z)Φn(trR(z)Φn𝔼[trR(z)Φn])\operatorname{tr}DR(z)\Phi_{n}\cdot(\operatorname{tr}R(z)\Phi_{n}-\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right]) to zero. Thus by dominated convergence theorem,

limn𝔼[trDR(z)Φn(trR(z)Φn𝔼[trR(z)Φn])]0.\lim_{n\rightarrow\infty}\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\cdot(\operatorname{tr}R(z)\Phi_{n}-\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right])\right]\rightarrow 0.

So we can replace the third term at the right-hand side of (4.25) with

𝔼[trDR(z)Φn]𝔼[trR(z)Φn]\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\right]\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right]

to obtain the conclusion. ∎

Proof of Theorem 4.1.

Fix any z+z\in\mathbb{C}^{+}. Denote the Stieltjes transform of empirical spectrum of AnA_{n} and its expectation by mn(z):=trR(z)m_{n}(z):=\operatorname{tr}R(z) and m¯n(z):=𝔼[mn(z)]\bar{m}_{n}(z):=\mathbb{E}[m_{n}(z)] respectively. Let βn(z):=trR(z)Φn\beta_{n}(z):=\operatorname{tr}R(z)\Phi_{n} and β¯n(z):=𝔼[βn(z)]\bar{\beta}_{n}(z):=\mathbb{E}[\beta_{n}(z)]. Notice that mn(z),m¯n(z),βnm_{n}(z),\bar{m}_{n}(z),\beta_{n} and β¯n(z)\bar{\beta}_{n}(z) are all in +\mathbb{C}^{+} and uniformly and almost surely bounded by some constant. By choosing D=IdD=\operatorname{Id} in Lemma 4.5, we conclude

(4.26) limn,d(1+zm¯n(z)+β¯n(z)2)=0.\lim_{n,\,d\rightarrow\infty}\left(1+z\bar{m}_{n}(z)+\bar{\beta}_{n}(z)^{2}\right)=0.

Likewise, in Lemma 4.5, we consider D=(β¯n(z)Φn+zId)1ΦnD=\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}. Let

U=(β¯n(z)Φn+zId)1.U=\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}.

Because Φn\|\Phi_{n}\| is uniformly bounded, DCU\|D\|\leq C\|U\|. In terms of Lemma A.6, we only need to provide a lower bound for the imaginary part of UU. Observe that ImU=Imβ¯n(z)Φn+vIdvId\operatorname{Im}U=\operatorname{Im}\bar{\beta}_{n}(z)\Phi_{n}+v\operatorname{Id}\succeq v\operatorname{Id} since λmin(Φn)0\lambda_{\min}(\Phi_{n})\geq 0 and Imβ¯n(z)>0\operatorname{Im}\bar{\beta}_{n}(z)>0. Thus, DCv1\|D\|\leq Cv^{-1} for all nn. Meanwhile, we have the equation β¯n(z)ΦnD=ΦnzD\bar{\beta}_{n}(z)\Phi_{n}D=\Phi_{n}-zD and hence,

β¯n(z)𝔼[trR(z)ΦnD]=𝔼[trR(z)ΦnD]𝔼[trR(z)Φn]=β¯n(z)z𝔼[trR(z)D].\bar{\beta}_{n}(z)\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]=\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}]=\bar{\beta}_{n}(z)-z\mathbb{E}[\operatorname{tr}R(z)D].

So applying Lemma 4.5 again, we have another limiting equation trD+β¯n(z)0\operatorname{tr}D+\bar{\beta}_{n}(z)\rightarrow 0. In other words,

(4.27) limn,d(tr(β¯n(z)Φn+zId)1Φn+β¯n(z))=0.\lim_{n,\,d\rightarrow\infty}\left(\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}+\bar{\beta}_{n}(z)\right)=0.

Thanks to the identity

β¯n(z)tr(β¯n(z)Φn+zId)1Φn1=ztr(β¯n(z)Φn+zId)1,\displaystyle\bar{\beta}_{n}(z)\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}-1=-z\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1},

we can modify (4.26) and (4.27) to get

(4.28) limn,d(m¯n(z)+tr(β¯n(z)Φn+zId)1)=0.\lim_{n,\,d\rightarrow\infty}\left(\bar{m}_{n}(z)+\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\right)=0.

Since β¯n(z)\bar{\beta}_{n}(z) and m¯n(z)\bar{m}_{n}(z) are uniformly bounded, for any subsequence in nn, there is a further convergent sub-subsequence. We denote the limit of such sub-subsequence by β(z)\beta(z) and m(z)+m(z)\in\mathbb{C}^{+} respectively. Hence, by (4.27) and (4.28), one can conclude

limn,d(β(z)+tr(β(z)Φn+zId)1Φn)=0.\lim_{n,\,d\rightarrow\infty}\left(\beta(z)+\operatorname{tr}\left(\beta(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}\right)=0.

Because of the convergence of the empirical eigenvalue distribution of Φn\Phi_{n}, we obtain the fixed point equation (4.4) for β(z)\beta(z). Analogously, we can also obtain (4.3) for m(z)m(z) and β(z)\beta(z). The existence and the uniqueness of the solutions to (4.3) and (4.4) are proved in [BZ10, Theorem 2.1] and [WP14, Section 3.4], which implies the convergence of m¯n(z)\bar{m}_{n}(z) and β¯n(z)\bar{\beta}_{n}(z) to m(z)m(z) and β(z)\beta(z) governed by the self-consistent equations (4.3) and (4.4) as nn\rightarrow\infty, respectively.

Then, by virtue of condition (4.1) in Theorem 4.1, we know mn(z)m¯n(z)a.s.0m_{n}(z)-\bar{m}_{n}(z)\overset{\text{a.s.}}{\longrightarrow}0 and βn(z)β¯n(z)a.s.0\beta_{n}(z)-\bar{\beta}_{n}(z)\overset{\text{a.s.}}{\longrightarrow}0. Therefore, the empirical Stieltjes transform mn(z)m_{n}(z) converges to m(z)m(z) almost surely for each z+z\in\mathbb{C}^{+}. Recall that the Stieltjes transform of μ\mu is m(z)m(z). By the standard Stieltjes continuity theorem (see for example, [BS10, Theorem B.9]), this finally concludes the weak convergence of empirical eigenvalue distribution of AnA_{n} to μ\mu.


Now we show μ=μsμΦ\mu=\mu_{s}\boxtimes\mu_{\Phi}. The fixed point equations (4.3) and (4.4) induce

(4.29) β2(z)+1+zm(z)=0,\beta^{2}(z)+1+zm(z)=0,

since β(z)+\beta(z)\in\mathbb{C}^{+} for any z+z\in\mathbb{C}^{+}. Together with (4.3), we attain the same self-consistent equations for the convergence of the empirical spectral distribution of the Wigner-type matrix studied in [BZ10, Theorem 1.1].

Define WnW_{n}, the nn-by-nn Wigner matrix, as a Hermitian matrix with independent entries

{Wn[i,j]:𝔼[Wn[i,j]]=0, 𝔼[Wn[i,j]2]=1, 1ijn}.\{W_{n}[i,j]:\mathbb{E}[W_{n}[i,j]]=0,\text{ }\mathbb{E}[W_{n}[i,j]^{2}]=1,\text{ }1\leq i\leq j\leq n\}.

The Wigner-type matrix studied in [BZ10, Definition 1.2] is indeed 1nΦn1/2WnΦn1/2\frac{1}{\sqrt{n}}\Phi_{n}^{1/2}W_{n}\Phi_{n}^{1/2}. Hence, such Wigner-type matrix 1nΦn1/2WnΦn1/2\frac{1}{\sqrt{n}}\Phi_{n}^{1/2}W_{n}\Phi_{n}^{1/2} has the same limiting spectral distribution as AnA_{n} defined in Theorem 4.1. Both limits are determined by self-consistent equations (4.3) and (4.29).

On the other hand, based on [AGZ10, Theorem 5.4.5], 1nWn\frac{1}{\sqrt{n}}W_{n} and Φn\Phi_{n} are almost surely asymptotically free, i.e. the empirical distribution of {1nWn,Φn}\{\frac{1}{\sqrt{n}}W_{n},\Phi_{n}\} converges almost surely to the law of {𝐬,𝐝}\{\bf s,d\}, where s and d are two free non-commutative random variables (s is a semicircle element and d has the law μΦ\mu_{\Phi}). Thus, the limiting spectral distribution μ\mu of 1nΦn1/2WnΦn1/2\frac{1}{\sqrt{n}}\Phi_{n}^{1/2}W_{n}\Phi_{n}^{1/2} is the free multiplicative convolution between μs\mu_{s} and μΦ\mu_{\Phi}. This implies μ=μsμΦ\mu=\mu_{s}\boxtimes\mu_{\Phi} in our setting. ∎

5. Proof of Theorem 2.1 and Theorem 2.2

To prove Theorem 2.1, we first establish the following proposition to analyze the difference between Stieltjes transform of (2.1) and its expectation. This will assist us to verify condition (4.1) in Theorem 4.1. The proof is based on [FW20, Lemma E.6].

Proposition 5.1.

Let Dn×nD\in\mathbb{R}^{n\times n} be any deterministic symmetric matrix with a uniformly bounded spectral norm. Following the notions in Theorem 2.1, assume XC\|X\|\leq C for some constant CC and Assumption 1.2 holds. Let R(z)R(z) be the resolvent

(1d1n(YY𝔼[YY])zId)1,\left(\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right)-z\operatorname{Id}\right)^{-1},

for any fixed z+z\in\mathbb{C}^{+}. Then, there exist some constants s,n0>0s,n_{0}>0 such that for all n>n0n>n_{0} and any t>0t>0,

(|trR(z)D𝔼[trR(z)D]|>t)2ecnt2.\mathbb{P}\left(\left|\operatorname{tr}R(z)D-\mathbb{E}[\operatorname{tr}R(z)D]\right|>t\right)\leq 2e^{-cnt^{2}}.
Proof.

Define function F:d1×d0F:\mathbb{R}^{d_{1}\times d_{0}}\rightarrow\mathbb{R} by F(W):=trR(z)DF(W):=\operatorname{tr}R(z)D. Fix any W,Δd1×d0W,\Delta\in\mathbb{R}^{d_{1}\times d_{0}} where ΔF=1\|\Delta\|_{F}=1, and let Wt=W+tΔW_{t}=W+t\Delta. We want to verify F(W)F(W) is a Lipschitz function in WW with respect to the Frobenius norm. First, recall

R(z)1=1d1nσ(WX)σ(WX)d1nΦzId,R(z)^{-1}=\frac{1}{\sqrt{d_{1}n}}\sigma(WX)^{\top}\sigma(WX)-\sqrt{\frac{d_{1}}{n}}\Phi-z\operatorname{Id},

where the last two terms are deterministic with respect to WW. Hence,

vec(Δ)(F(W))\displaystyle\operatorname{vec}(\Delta)^{\top}(\nabla F(W)) =ddt|t=0F(Wt)\displaystyle=\frac{d}{dt}\Big{|}_{t=0}F(W_{t})
=trR(z)(ddt|t=0R(z)1)R(z)D\displaystyle=-\operatorname{tr}R(z)\left(\frac{d}{dt}\Big{|}_{t=0}R(z)^{-1}\right)R(z)D
=1d1ntrR(z)(ddt|t=0σ(WtX)σ(WtX))R(z)D\displaystyle=-\frac{1}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\frac{d}{dt}\Big{|}_{t=0}\sigma(W_{t}X)^{\top}\sigma(W_{t}X)\right)R(z)D
=2d1ntrR(z)(σ(WX)ddt|t=0σ(WtX))R(z)D\displaystyle=-\frac{2}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\sigma(WX)^{\top}\cdot\frac{d}{dt}\Big{|}_{t=0}\sigma(W_{t}X)\right)R(z)D
=2d1ntrR(z)(σ(WX)(σ(WX)(ΔX)))R(z)D,\displaystyle=-\frac{2}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\sigma(WX)^{\top}\cdot\left(\sigma^{\prime}(WX)\odot(\Delta X)\right)\right)R(z)D,

where \odot is the Hadamard product, and σ\sigma^{\prime} is applied entrywise. Here we utilize the formula

R(z)=R(z)((R(z)1))R(z)\partial R(z)=-R(z)(\partial(R(z)^{-1}))R(z)

and R(z)=R(z)R(z)=R(z)^{\top}. Lemma A.6 in Appendix A implies that R(z)1|Imz|.\|R(z)\|\leq\frac{1}{|\operatorname{Im}z|}. Therefore, based on the assumption of DD, we have

|vec(Δ)(F(W))|Cd1nR(z)σ(WX)σ(WX)(ΔX),\Big{|}\operatorname{vec}(\Delta)^{\top}(\nabla F(W))\Big{|}\leq\frac{C}{\sqrt{d_{1}n}}\|R(z)\sigma(WX)^{\top}\|\cdot\|\sigma^{\prime}(WX)\odot(\Delta X)\|,

for some constant C>0C>0. For the first term in the product on the right-hand side,

(1d1nR(z)σ(WX))2\displaystyle\left(\frac{1}{\sqrt{d_{1}n}}\|R(z)\sigma(WX)^{\top}\|\right)^{2}
=\displaystyle= 1d1nR(z)(1d1nσ(WX)σ(WX))R(z)\displaystyle\frac{1}{\sqrt{d_{1}n}}\left\|R(z)\left(\frac{1}{\sqrt{d_{1}n}}\sigma(WX)^{\top}\sigma(WX)\right)R(z)^{*}\right\|
\displaystyle\leq 1d1n(R(z)R(z)1R(z)+R(z)(d1nΦ+zId)R(z))\displaystyle\frac{1}{\sqrt{d_{1}n}}\left(\|R(z)R(z)^{-1}R(z)^{*}\|+\left\|R(z)\left(\sqrt{\frac{d_{1}}{n}}\Phi+z\operatorname{Id}\right)R(z)^{*}\right\|\right)
\displaystyle\leq 1d1n(R(z)+R(z)2(d1nΦ+|z|))Cn.\displaystyle\frac{1}{\sqrt{d_{1}n}}\left(\|R(z)\|+\|R(z)\|^{2}\left(\sqrt{\frac{d_{1}}{n}}\|\Phi\|+|z|\right)\right)\leq\frac{C}{n}.

For the second term,

σ(WX)(ΔX)σ(WX)(ΔX)FλσΔXFλσΔFXC.\|\sigma^{\prime}(WX)\odot(\Delta X)\|\leq\|\sigma^{\prime}(WX)\odot(\Delta X)\|_{F}\leq\lambda_{\sigma}\|\Delta X\|_{F}\leq\lambda_{\sigma}\|\Delta\|_{F}\cdot\|X\|\leq C.

Thus, |vec(Δ)(F(W))|C/n|\operatorname{vec}(\Delta)^{\top}(\nabla F(W))|\leq C/\sqrt{n}. This holds for every Δ\Delta such that ΔF=1\|\Delta\|_{F}=1, so F(W)F(W) is C/nC/\sqrt{n}-Lipschitz in WW with respect to the Frobenius norm. Then the result follows from the Gaussian concentration inequality for Lipschitz functions. ∎

Next, we investigate the approximation of Φ=𝔼𝒘[σ(𝒘X)σ(𝒘X)]\Phi=\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}X)^{\top}\sigma(\boldsymbol{w}^{\top}X)] via the Hermite polynomials {hk}k0\{h_{k}\}_{k\geq 0}. The orthogonality of Hermite polynomials allows us to write Φ\Phi as a series of kernel matrices. Then we only need to estimate each kernel matrix in this series. The proof is directly based on [GMMM19, Lemma 2]. The only difference is that we consider the deterministic input data XX with the (εn,B)(\varepsilon_{n},B)-orthonormal property, while in Lemma 2 of [GMMM19], the matrix XX is formed by independent Gaussian vectors.

Lemma 5.2.

Recall the definition of Φ0\Phi_{0} in (1.13). If XX is (εn,B)(\varepsilon_{n},B)-orthonormal and Assumption 1.2 holds, then we have the spectral norm bound

ΦΦ0CBεn2n,\|\Phi-\Phi_{0}\|\leq C_{B}\varepsilon_{n}^{2}\sqrt{n},

where CBC_{B} is a constant depending on BB. Suppose that εn2n0\varepsilon_{n}^{2}\sqrt{n}\to 0 as nn\to\infty, then ΦC\|\Phi\|\leq C uniformly for some constant CC independent of nn.

Proof.

By Assumption 1.2, we know that

ξ0(σ)=0,k=1ζk2(σ)=𝔼[σ(ξ)2]=1.\xi_{0}(\sigma)=0,\quad\sum_{k=1}^{\infty}\zeta_{k}^{2}(\sigma)=\mathbb{E}[\sigma(\xi)^{2}]=1.

For any fixed tt, σ(tx)L2(,Γ)\sigma(tx)\in L^{2}(\mathbb{R},\Gamma). This is because σ(x)L2(,Γ)\sigma(x)\in L^{2}(\mathbb{R},\Gamma) is a Lipschitz function and by triangle inequality |σ(tx)σ(x)|λσ|txx||\sigma(tx)-\sigma(x)|\leq\lambda_{\sigma}|tx-x|, we have, for ξ𝒩(0,1)\xi\sim\mathcal{N}(0,1),

(5.1) 𝔼(σ(tξ)2)\displaystyle\mathbb{E}(\sigma(t\xi)^{2}) 𝔼(|σ(ξ)|+λσ|tξξ|)2<.\displaystyle\leq\mathbb{E}(|\sigma(\xi)|+\lambda_{\sigma}|t\xi-\xi|)^{2}<\infty.

For 1αn1\leq\alpha\leq n, let σα(x):=σ(𝐱αx)\sigma_{\alpha}(x):=\sigma(\|\mathbf{x}_{\alpha}\|x) and the Hermite expansion of σa\sigma_{a} can be written as

σα(x)=k=0ζk(σα)hk(x),\sigma_{\alpha}(x)=\sum_{k=0}^{\infty}\zeta_{k}(\sigma_{\alpha})h_{k}(x),

where the coefficient ζk(σα)=𝔼[σα(ξ)hk(ξ)]\zeta_{k}(\sigma_{\alpha})=\mathbb{E}[\sigma_{\alpha}(\xi)h_{k}(\xi)]. Let unit vectors be 𝐮α=𝐱α/𝐱α\mathbf{u}_{\alpha}=\mathbf{x}_{\alpha}/\|\mathbf{x}_{\alpha}\|, for 1αn1\leq\alpha\leq n. So for 1α,βn1\leq\alpha,\beta\leq n, the (α,β)(\alpha,\beta) entry of Φ\Phi is

Φαβ=𝔼[σ(𝒘𝐱α)σ(𝒘𝐱β)]=𝔼[σα(ξα)σβ(ξβ)],\Phi_{\alpha\beta}=\mathbb{E}[\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{\alpha})\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{\beta})]=\mathbb{E}[\sigma_{\alpha}(\xi_{\alpha})\sigma_{\beta}(\xi_{\beta})],

where (ξα,ξβ)=(𝒘𝐮α,𝒘𝐮β)(\xi_{\alpha},\xi_{\beta})=(\boldsymbol{w}^{\top}\mathbf{u}_{\alpha},\boldsymbol{w}^{\top}\mathbf{u}_{\beta}) is a Gaussian random vector with mean zero and covariance

(5.2) (1𝐮α𝐮β𝐮α𝐮β1).\begin{pmatrix}1&\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}\\ \mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}&1\end{pmatrix}.

By the orthogonality of Hermite polynomials with respect to Γ\Gamma and Lemma A.5, we can obtain

𝔼[hj(ξα)hk(ξβ)]=𝔼[hj(𝒘𝐮α)hk(𝒘𝐮β)]=δj,k(𝐮α𝐮β)k,\mathbb{E}[h_{j}(\xi_{\alpha})h_{k}(\xi_{\beta})]=\mathbb{E}[h_{j}(\boldsymbol{w}^{\top}\mathbf{u}_{\alpha})h_{k}(\boldsymbol{w}^{\top}\mathbf{u}_{\beta})]=\delta_{j,k}(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k},

which leads to

(5.3) Φαβ=k=0ζk(σα)ζk(σβ)(𝐮α𝐮β)k.\Phi_{\alpha\beta}=\sum_{k=0}^{\infty}\zeta_{k}(\sigma_{\alpha})\zeta_{k}(\sigma_{\beta})(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k}.

For any kk\in\mathbb{N}, let TkT_{k} be an nn-by-nn matrix with (α,β)(\alpha,\beta)-th entry

(5.4) (Tk)αβ:=ζk(σα)ζk(σβ)(𝐮α𝐮β)k.\displaystyle(T_{k})_{\alpha\beta}:=\zeta_{k}(\sigma_{\alpha})\zeta_{k}(\sigma_{\beta})(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k}.

Specifically, for any kk\in\mathbb{N}, we have

Tk=Dkfk(XX)Dk,T_{k}=D_{k}f_{k}(X^{\top}X)D_{k},

where DkD_{k} is the diagonal matrix diag(ζk(σα)/𝐱αk)α[n]\operatorname{diag}(\zeta_{k}(\sigma_{\alpha})/\|\mathbf{x}_{\alpha}\|^{k})_{\alpha\in[n]}.

At first, we consider twice differentiable σ\sigma in Assumption 1.2. Similar with [GMMM19, Equation (26)], for any ε>0\varepsilon>0 and |t1|ε|t-1|\leq\varepsilon, we take the Taylor approximation of σ(tx)\sigma(tx) at point xx, then there exists η\eta between txtx and xx such that

σ(tx)σ(x)=σ(x)x(t1)+12σ′′(η)x2(t1)2.\sigma(tx)-\sigma(x)=\sigma^{\prime}(x)x(t-1)+\frac{1}{2}\sigma^{\prime\prime}(\eta)x^{2}(t-1)^{2}.

Replacing xx by ξ\xi and taking expectation, since σ′′\sigma^{\prime\prime} is uniformly bounded, we can get

(5.5) |𝔼[σ(tξ)σ(ξ)]𝔼[σ(ξ)ξ](t1)|C|t1|2Cεn2,\left|\mathbb{E}\left[\sigma(t\xi)-\sigma(\xi)\right]-\mathbb{E}[\sigma^{\prime}(\xi)\xi](t-1)\right|\leq C|t-1|^{2}\leq C\varepsilon_{n}^{2},

For k1k\geq 1, the Lipschitz condition for σ\sigma yields

(5.6) |ζk(σα)ζk(σ)|C|𝐱α1|𝔼[|ξ||hk(ξ)|]Cεn,\displaystyle|\zeta_{k}(\sigma_{\alpha})-\zeta_{k}(\sigma)|\leq C\left|\|\mathbf{x}_{\alpha}\|-1\right|\cdot\mathbb{E}[|\xi|\cdot|h_{k}(\xi)|]\leq C\varepsilon_{n},

where constant CC does not depend on kk. As for piece-wise linear σ\sigma, it is not hard to see

(5.7) 𝔼[σ(tξ)σ(ξ)]=𝔼[σ(ξ)ξ](t1).\mathbb{E}\left[\sigma(t\xi)-\sigma(\xi)\right]=\mathbb{E}[\sigma^{\prime}(\xi)\xi](t-1).

Now, we begin to approximate TkT_{k} separately based on (5.5), (5.6) and (5.7). Denote diag(A)\operatorname{diag}(A) the diagonal submatrix of a matrix AA.

(1) Approximation for k4(Tkdiag(Tk))\sum_{k\geq 4}(T_{k}-\operatorname{diag}(T_{k})). At first, we estimate the L2L^{2} norm with respect to Γ\Gamma of the function σα\sigma_{\alpha}. Recall that σαL2=𝔼[σα(ξ)2]1/2.\|\sigma_{\alpha}\|_{L^{2}}=\mathbb{E}[\sigma_{\alpha}(\xi)^{2}]^{1/2}. Because σL2=1\|\sigma\|_{L^{2}}=1 and σ\sigma is a Lipschitz function, we have

(5.8) sup1αnσσαL2=𝔼[(σ(ξ)σα(ξ))2]1/2\displaystyle\sup_{1\leq\alpha\leq n}\|\sigma-\sigma_{\alpha}\|_{L^{2}}=\mathbb{E}[(\sigma(\xi)-\sigma_{\alpha}(\xi))^{2}]^{1/2}\leq~{} C|𝐱α1|,\displaystyle C|\|\mathbf{x}_{\alpha}\|-1|,
(5.9) sup1αnσαL2\displaystyle\sup_{1\leq\alpha\leq n}\|\sigma_{\alpha}\|_{L^{2}}\leq~{} 1+Cεn.\displaystyle 1+C\varepsilon_{n}.

Hence, σαL2\|\sigma_{\alpha}\|_{L^{2}} is uniformly bounded with some constant for all large nn. Next, we estimate the off-diagonal entries of TkT_{k} when k4k\geq 4. From (5.4), we obtain that

k4(Tkdiag(Tk))\displaystyle\left\|\sum_{k\geq 4}(T_{k}-\operatorname{diag}(T_{k}))\right\|\leq~{} k4(Tkdiag(Tk))Fk4Tkdiag(Tk)F\displaystyle\left\|\sum_{k\geq 4}(T_{k}-\operatorname{diag}(T_{k}))\right\|_{F}\leq\sum_{k\geq 4}\left\|T_{k}-\operatorname{diag}(T_{k})\right\|_{F}
\displaystyle\leq~{} k4(supαβ|𝐮α𝐮β|k)[α,β=1nζk(σα)2ζk(σβ)2]12\displaystyle\sum_{k\geq 4}\left(\sup_{\alpha\neq\beta}|\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}|^{k}\right)\left[\sum_{\alpha,\beta=1}^{n}\zeta_{k}(\sigma_{\alpha})^{2}\zeta_{k}(\sigma_{\beta})^{2}\right]^{\frac{1}{2}}
\displaystyle\leq~{} (supαβ|𝐮α𝐮β|4)α=1nk=0ζk(σα)2\displaystyle\left(\sup_{\alpha\neq\beta}|\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}|^{4}\right)\sum_{\alpha=1}^{n}\sum_{k=0}^{\infty}\zeta_{k}(\sigma_{\alpha})^{2}
(5.10) \displaystyle\leq~{} n(supαβ|𝐱α𝐱β|4𝐱α4𝐱β4)sup1αnσαL22Cnεn4,\displaystyle n\cdot\left(\sup_{\alpha\neq\beta}\frac{|\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}|^{4}}{\|\mathbf{x}_{\alpha}\|^{4}\|\mathbf{x}_{\beta}\|^{4}}\right)\sup_{1\leq\alpha\leq n}\|\sigma_{\alpha}\|_{L^{2}}^{2}\leq Cn\cdot\varepsilon_{n}^{4},

when nn is sufficiently large.

(2) Approximation for T0T_{0}. Recall 𝔼[σ(ξ)]=0\mathbb{E}[\sigma(\xi)]=0 and by Gaussian integration by part,

𝔼[σ(ξ)ξ]=𝔼[ξ0ξσ(x)x𝑑x]=𝔼[ξ2σ(ξ)]𝔼[ξ0ξσ(x)𝑑x]=𝔼[ξ2σ(ξ)]𝔼[σ(ξ)].\mathbb{E}[\sigma^{\prime}(\xi)\xi]=\mathbb{E}[\xi\int_{0}^{\xi}\sigma^{\prime}(x)xdx]=\mathbb{E}[\xi^{2}\sigma(\xi)]-\mathbb{E}[\xi\int_{0}^{\xi}\sigma(x)dx]=\mathbb{E}[\xi^{2}\sigma(\xi)]-\mathbb{E}[\sigma(\xi)].

Then, we have

𝔼[σ(ξ)ξ]=𝔼[(ξ21)σ(ξ)]=𝔼[2h2(ξ)σ(ξ)]=2ζ2(σ).\displaystyle\mathbb{E}[\sigma^{\prime}(\xi)\xi]=\mathbb{E}[(\xi^{2}-1)\sigma(\xi)]=\mathbb{E}[\sqrt{2}h_{2}(\xi)\sigma(\xi)]=\sqrt{2}\zeta_{2}(\sigma).

If σ\sigma is twice differentiable, then 𝔼[σ′′(ξ)]=2ζ2(σ)\mathbb{E}[\sigma^{\prime\prime}(\xi)]=\sqrt{2}\zeta_{2}(\sigma) as well.

Thus, taking t=𝐱αt=\|\mathbf{x}_{\alpha}\| in (5.5) and (5.7) implies that for any 1αn,1\leq\alpha\leq n,

(5.11) |ζ0(σα)2ζ2(σ)(𝐱α1)|Cεn2.\left|\zeta_{0}(\sigma_{\alpha})-\sqrt{2}\zeta_{2}(\sigma)(\|\mathbf{x}_{\alpha}\|-1)\right|\leq C\varepsilon_{n}^{2}.

Define 𝝂:=(ζ0(σ1),,ζ0(σn))\boldsymbol{\nu}^{\top}:=(\zeta_{0}(\sigma_{1}),\ldots,\zeta_{0}(\sigma_{n})), then T0=𝝂𝝂T_{0}=\boldsymbol{\nu}\boldsymbol{\nu}^{\top}. Recall the definition of 𝝁\boldsymbol{\mu} in (1.13). Then, (5.11) ensures that

𝝁𝝂Cnεn2.\|\boldsymbol{\mu}-\boldsymbol{\nu}\|\leq C\sqrt{n}\varepsilon_{n}^{2}.

Applying the (εn,B)(\varepsilon_{n},B)-orthonormal property of 𝐱α\mathbf{x}_{\alpha} yields

(5.12) 𝝁2=2ζ2(σ)2α=1n(𝐱α1)22ζ2(σ)2α=1n(𝐱α21)22B2ζ2(σ)2.\|\boldsymbol{\mu}\|^{2}=2\zeta_{2}(\sigma)^{2}\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|-1)^{2}\leq 2\zeta_{2}(\sigma)^{2}\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|^{2}-1)^{2}\leq 2B^{2}\zeta_{2}(\sigma)^{2}.

Hence the difference between T0T_{0} and 𝝁𝝁\boldsymbol{\mu}\boldsymbol{\mu}^{\top} is controlled by

(5.13) T0𝝁𝝁𝝁𝝂(2𝝁+𝝂𝝁)Cnεn2.\|T_{0}-\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\|\leq\|\boldsymbol{\mu}-\boldsymbol{\nu}\|\left(2\|\boldsymbol{\mu}\|+\|\boldsymbol{\nu}-\boldsymbol{\mu}\|\right)\leq C\sqrt{n}\varepsilon_{n}^{2}.

(3) Approximation for TkT_{k} for k=1,2,3k=1,2,3. For 0k30\leq k\leq 3, Assumption 1.4 and (5.6) show that

|ζk(σα)/𝐱αkζk(σ)|\displaystyle\left|\zeta_{k}(\sigma_{\alpha})/\|\mathbf{x}_{\alpha}\|^{k}-\zeta_{k}(\sigma)\right|\leq~{} 1𝐱αk[|ζk(σα)ζk(σ)|+|ζk(σ)||𝐱αk1|]\displaystyle\frac{1}{\|\mathbf{x}_{\alpha}\|^{k}}\left[\left|\zeta_{k}(\sigma_{\alpha})-\zeta_{k}(\sigma)\right|+|\zeta_{k}(\sigma)|\cdot|\|\mathbf{x}_{\alpha}\|^{k}-1|\right]
(5.14) \displaystyle\leq~{} Cεn+C1|𝐱α1|(1εn)kC2εn,\displaystyle\frac{C\varepsilon_{n}+C_{1}|\|\mathbf{x}_{\alpha}\|-1|}{(1-\varepsilon_{n})^{k}}\leq C_{2}\varepsilon_{n},

when nn is sufficiently large. Notice that Tk=Dkfk(XX)DkT_{k}=D_{k}f_{k}(X^{\top}X)D_{k}, where DkD_{k} is the diagonal matrix. Hence, by (5.14),

Dkζk(σ)IdC2εn.\|D_{k}-\zeta_{k}(\sigma)\operatorname{Id}\|\leq C_{2}\varepsilon_{n}.

And for k=1,2,3,k=1,2,3, by the triangle inequality,

Tkζk(σ)2fk(XX)=Dkfk(XX)Dkζk(σ)2fk(XX)\displaystyle\|T_{k}-\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)\|=\|D_{k}f_{k}(X^{\top}X)D_{k}-\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)\|
\displaystyle\leq~{} Dkζk(σ)Idfk(XX)(|ζk(σ)|+Dkζk(σ)Id)Cεnfk(XX).\displaystyle\|D_{k}-\zeta_{k}(\sigma)\operatorname{Id}\|\cdot\|f_{k}(X^{\top}X)\|(|\zeta_{k}(\sigma)|+\|D_{k}-\zeta_{k}(\sigma)\operatorname{Id}\|)\leq C\varepsilon_{n}\|f_{k}(X^{\top}X)\|.

When k=1k=1, f1(XX)=XXf_{1}(X^{\top}X)=X^{\top}X and XXX2B2\|X^{\top}X\|\leq\|X\|^{2}\leq B^{2}. When k=2k=2,

f2(XX)=(XX)(XX).f_{2}(X^{\top}X)=(X^{\top}X)\odot(X^{\top}X).

From Lemma A.1 in Appendix A, we have that

(5.15) f2(XX)max1α,βn|𝐱α𝐱β|X2B2(1+εn).\|f_{2}(X^{\top}X)\|\leq\max_{1\leq\alpha,\beta\leq n}|\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}|\cdot\|X\|^{2}\leq B^{2}(1+\varepsilon_{n}).

So the left-hand side of (5.15) is bounded. Analogously, we can verify f3(XX)\|f_{3}(X^{\top}X)\| is also bounded. Therefore, we have

(5.16) Tkζk(σ)2fk(XX)Cεn,\|T_{k}-\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)\|\leq C\varepsilon_{n},

for some constant CC and k=1,2,3k=1,2,3 when nn is sufficiently large.

(4) Approximation for k4diag(Tk)\sum_{k\geq 4}\operatorname{diag}(T_{k}). Since 𝐮α𝐮α=1\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\alpha}=1, we know

k4diag(Tk)=diag(k4ζk(σα)2)α[n]=diag(σαL22k=04ζk(σα)2)α[n].\sum_{k\geq 4}\operatorname{diag}(T_{k})=\operatorname{diag}\left(\sum_{k\geq 4}\zeta_{k}(\sigma_{\alpha})^{2}\right)_{\alpha\in[n]}=\operatorname{diag}\left(\|\sigma_{\alpha}\|_{L^{2}}^{2}-\sum_{k=0}^{4}\zeta_{k}(\sigma_{\alpha})^{2}\right)_{\alpha\in[n]}.

First, by (5.8) and (5.9), we can claim that

|σαL221|=|σαL22σL22|CσασL2Cεn.\displaystyle|\|\sigma_{\alpha}\|_{L^{2}}^{2}-1|=|\|\sigma_{\alpha}\|_{L^{2}}^{2}-\|\sigma\|_{L^{2}}^{2}|\leq C\|\sigma_{\alpha}-\sigma\|_{L^{2}}\leq C\varepsilon_{n}.

Second, in terms of (5.14), we obtain

|ζk(σα)2ζk(σ)2|C|ζk(σα)ζk(σ)|Cεn,|\zeta_{k}(\sigma_{\alpha})^{2}-\zeta_{k}(\sigma)^{2}|\leq C|\zeta_{k}(\sigma_{\alpha})-\zeta_{k}(\sigma)|\leq C\varepsilon_{n},

for k=1,2k=1,2 and 33. Combining these together, we conclude that

k4diag(Tk)(1ζ1(σ)2ζ2(σ)2ζ3(σ)2)Id\displaystyle\left\|\sum_{k\geq 4}\operatorname{diag}(T_{k})-(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})\operatorname{Id}\right\|
(5.17) \displaystyle\leq\quad max1αn|(σαL221)k=04(ζk(σα)2ζk(σ)2)|Cεn.\displaystyle\max_{1\leq\alpha\leq n}\left|(\|\sigma_{\alpha}\|_{L^{2}}^{2}-1)-\sum_{k=0}^{4}(\zeta_{k}(\sigma_{\alpha})^{2}-\zeta_{k}(\sigma)^{2})\right|\leq C\varepsilon_{n}.

Recall

Φ0=𝝁𝝁+k=13ζk(σ)2fk(XX)+(1ζ1(σ)2ζ2(σ)2ζ3(σ)2)Id.\Phi_{0}=\boldsymbol{\mu}\boldsymbol{\mu}^{\top}+\sum_{k=1}^{3}\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)+(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})\operatorname{Id}.

In terms of approximations (5.10), (5.13), (5.16) and (5.17), we can finally manifest

(5.18) ΦΦ0C(εn+nεn2+nεn4)Cnεn2,\displaystyle\|\Phi-\Phi_{0}\|\leq C\left(\varepsilon_{n}+\sqrt{n}\varepsilon_{n}^{2}+n\varepsilon_{n}^{4}\right)\leq C\sqrt{n}\varepsilon_{n}^{2},

for some constant C>0C>0 as nεn20\sqrt{n}\varepsilon_{n}^{2}\to 0. The spectral norm bound of Φ\Phi is directly deduced by the spectral norm bound of Φ0\Phi_{0} based on (5.12) and (5.15), together with (5.18). ∎

Remark 5.3 (Optimality of εn\varepsilon_{n}).

For general deterministic data XX, our pairwise orthogonality assumption with rate nεn4=o(1)n\varepsilon_{n}^{4}=o(1) is optimal for the approximation of Φ\Phi by Φ0\Phi_{0} in the spectral norm. If we relax the decay rate of εn\varepsilon_{n} in Assumption 1.4, the above approximation may require including terms of higher-degree fk(XX)f_{k}(X^{\top}X) for k4k\geq 4 in Φ0\Phi_{0}, which will lead to the invalidation of some of our following results and simplifications. Subsequent to the initial completion of our paper, this weaker regime has been considered in our follow-up work [WZ22].

Next, we continue to provide an additional estimate for Φ\Phi, but in the Frobenius norm to further simplify the limiting spectral distribution of Φ\Phi.

Lemma 5.4.

If Assumptions 1.2 and 1.4 hold, then Φ\Phi has the same limiting spectrum as bσ2XX+(1bσ2)Idb_{\sigma}^{2}X^{\top}X+(1-b_{\sigma}^{2})\operatorname{Id} when nn\to\infty, i.e.

limspecΦ=limspec(bσ2XX+(1bσ2)Id)=bσ2μ0+(1bσ2).\operatorname{lim\,spec}\Phi=\operatorname{lim\,spec}\left(b_{\sigma}^{2}X^{\top}X+(1-b_{\sigma}^{2})\operatorname{Id}\right)=b_{\sigma}^{2}\mu_{0}+(1-b_{\sigma}^{2}).
Proof.

By the definition of bσb_{\sigma}, we know that bσ=ζ1(σ)b_{\sigma}=\zeta_{1}(\sigma). As a direct deduction of Lemma 5.2, the limiting spectrum of Φ\Phi is identical to the limiting spectrum of Φ0\Phi_{0}. To prove this lemma, it suffices to check the Frobenius norm of the difference between Φ0\Phi_{0} and ζ1(σ)2XX+(1ζ1(σ)2)Id\zeta_{1}(\sigma)^{2}X^{\top}X+(1-\zeta_{1}(\sigma)^{2})\operatorname{Id}. Notice that

Φ0ζ1(σ)2XX(1ζ1(σ)2)Id\displaystyle\Phi_{0}-\zeta_{1}(\sigma)^{2}X^{\top}X-(1-\zeta_{1}(\sigma)^{2})\operatorname{Id}
=\displaystyle=~{} 𝝁𝝁+ζ2(σ)2f2(XX)+ζ3(σ)2f3(XX)(ζ2(σ)2+ζ3(σ)2)Id.\displaystyle\boldsymbol{\mu}\boldsymbol{\mu}^{\top}+\zeta_{2}(\sigma)^{2}f_{2}(X^{\top}X)+\zeta_{3}(\sigma)^{2}f_{3}(X^{\top}X)-(\zeta_{2}(\sigma)^{2}+\zeta_{3}(\sigma)^{2})\operatorname{Id}.

By the definition of vector 𝝁\boldsymbol{\mu} and the assumption of XX, we have

(5.19) 𝝁𝝁F=𝝁2=2ζ22(σ)α=1n(𝐱α1)22ζ22(σ)B2.\|\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\|_{F}=\|\boldsymbol{\mu}\|^{2}=2\zeta_{2}^{2}(\sigma)\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|-1)^{2}\leq 2\zeta_{2}^{2}(\sigma)B^{2}.

For k=2,3k=2,3, the Frobenius norm can be controlled by

fk(XX)IdF2=\displaystyle\|f_{k}(X^{\top}X)-\operatorname{Id}\|_{F}^{2}=~{} α,β=1n((𝐱α𝐱β)kδαβ)2\displaystyle\sum_{\alpha,\beta=1}^{n}\left((\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta})^{k}-\delta_{\alpha\beta}\right)^{2}
\displaystyle\leq~{} n(n1)εn2k+α=1n(𝐱α2k1)2n2εn2k+Cnεn2.\displaystyle n(n-1)\varepsilon_{n}^{2k}+\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|^{2k}-1)^{2}\leq n^{2}\varepsilon_{n}^{2k}+Cn\varepsilon_{n}^{2}.

Hence, as nn\to\infty, we have

1n𝝁𝝁F2,1nfk(XX)IdF20, for k=2,3,\frac{1}{n}\|\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\|_{F}^{2},\quad\frac{1}{n}\|f_{k}(X^{\top}X)-\operatorname{Id}\|_{F}^{2}\to 0,~{}\text{ for }~{}k=2,3,

as nεn40n\varepsilon_{n}^{4}\to 0. Then we conclude that

1nΦ0ζ1(σ)2XX(1ζ1(σ)2)IdF2C(nεn4+εn2)0.\frac{1}{n}\|\Phi_{0}-\zeta_{1}(\sigma)^{2}X^{\top}X-(1-\zeta_{1}(\sigma)^{2})\operatorname{Id}\|_{F}^{2}\leq C(n\varepsilon_{n}^{4}+\varepsilon_{n}^{2})\to 0.

Hence, limspecΦ\operatorname{lim\,spec}\Phi is the same as limspec(ζ1(σ)2XX+(1ζ1(σ)2)Id)\operatorname{lim\,spec}(\zeta_{1}(\sigma)^{2}X^{\top}X+(1-\zeta_{1}(\sigma)^{2})\operatorname{Id}) when n,n\to\infty, due to Lemma A.7 in Appendix A. ∎

Moreover, the proof of Lemma 5.4 can be modified to prove (2.32), so we omit its proof. Now, based on Corollary 3.7, Proposition 5.1, Lemma 5.2, and Lemma 5.4, applying Theorem 4.1 for general sample covariance matrices, we can finish the proof of Theorem 2.1.

Proof of Theorem 2.1.

Based on Corollary 3.7 and Proposition 5.1, we can verify the conditions (4.1) and (4.2) in Theorem 4.1. By Lemma 5.2 and Lemma 5.4, we know that the limiting eigenvalue distributions of Φ\Phi and (1bσ2)Id+bσ2XX(1-b_{\sigma}^{2})\operatorname{Id}+b_{\sigma}^{2}X^{\top}X are identical and Φ\|\Phi\| is uniformly bounded. So the limiting eigenvalue distribution of Φ\Phi denoted by μΦ\mu_{\Phi} is just (1bσ2)+bσ2μ0(1-b_{\sigma}^{2})+b_{\sigma}^{2}\mu_{0}. Hence, the first conclusion of Theorem 2.1 follows from Theorem 4.1.

For the second part of this theorem, we consider the difference

1n1d1n(YY𝔼[YY])1d1n(YYd1Φ0)F2\displaystyle\frac{1}{n}\left\|\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right)-\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-d_{1}\Phi_{0}\right)\right\|^{2}_{F}
\displaystyle\leq~{} d1n2ΦΦ0F2d1nΦΦ02d1εn40,\displaystyle\frac{d_{1}}{n^{2}}\|\Phi-\Phi_{0}\|_{F}^{2}\leq\frac{d_{1}}{n}\|\Phi-\Phi_{0}\|^{2}\leq d_{1}\varepsilon_{n}^{4}\to 0,

where we employ Lemma 5.2 and the assumption d1εn4=o(1)d_{1}\varepsilon_{n}^{4}=o(1). Thus, because of Lemma A.7, 1d1n(YYd1Φ0)\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-d_{1}\Phi_{0}\right) has the same limiting eigenvalue distribution as (2.1), μs((1bσ2)+bσ2μ0)\mu_{s}\boxtimes((1-b_{\sigma}^{2})+b_{\sigma}^{2}\mu_{0}). This finishes the proof of Theorem 2.1. ∎


Next, we move to study the empirical NTK and its corresponding limiting eigenvalue distribution. Similarly, we first verify that such NTK concentrates around its expectation and then simplify this expectation by some deterministic matrix only depending on the input data matrix XX and nonlinear activation σ\sigma. The following lemma can be obtained from (2.12) in Theorem 2.7.

Lemma 5.5.

Suppose that Assumption 1.1 holds, supx|σ(x)|λσ\sup_{x\in\mathbb{R}}|\sigma^{\prime}(x)|\leq\lambda_{\sigma} and XB\|X\|\leq B. Then if d1=ω(logn)d_{1}=\omega(\log n), we have

(5.20) 1d1(SS)(XX)𝔼[(SS)(XX)]0,\displaystyle\frac{1}{d_{1}}\left\|(S^{\top}S)\odot(X^{\top}X)-\mathbb{E}[(S^{\top}S)\odot(X^{\top}X)]\right\|\to 0,

almost surely as n,d0,d1n,d_{0},d_{1}\to\infty. Moreover, if d1/nd_{1}/n\to\infty as nn\to\infty, then almost surely

(5.21) 1nd1(SS)(XX)𝔼[(SS)(XX)]0.\frac{1}{\sqrt{nd_{1}}}\left\|(S^{\top}S)\odot(X^{\top}X)-\mathbb{E}[(S^{\top}S)\odot(X^{\top}X)]\right\|\to 0.
Lemma 5.6.

Suppose XX is (εn,B)(\varepsilon_{n},B)-orthonormal. Under Assumption 1.2, we have

(5.22) ΨΨ0CBεn4n,\|\Psi-\Psi_{0}\|\leq C_{B}\varepsilon_{n}^{4}n,

where Ψ\Psi and Ψ0\Psi_{0} are defined in (2.6) and (2.7), respectively, and CBC_{B} is a constant depending on BB.

Proof.

We can directly apply methods in the proof of Lemma 5.2. Notice that (1.7) and (1.9) imply

𝔼[SS]=d1𝔼[σ(𝒘X)σ(𝒘X)],\mathbb{E}[S^{\top}S]=d_{1}\mathbb{E}[\sigma^{\prime}(\boldsymbol{w}^{\top}X)^{\top}\sigma^{\prime}(\boldsymbol{w}^{\top}X)],

for any standard Gaussian random vector 𝒘𝒩(0,Id)\boldsymbol{w}\sim\mathcal{N}(0,\operatorname{Id}). Recall that (2.8) defines the kk-th coefficient of Hermite expansion of σ(x)\sigma^{\prime}(x) by ηk(σ)\eta_{k}(\sigma) for any k.k\in\mathbb{N}. Then, Assumption 1.2 indicates bσ=η0(σ)b_{\sigma}=\eta_{0}(\sigma) and aσ=k=0ηk2(σ)a_{\sigma}=\sum_{k=0}^{\infty}\eta_{k}^{2}(\sigma). For 1αn1\leq\alpha\leq n, we introduce ϕα(x):=σ(𝐱αx)\phi_{\alpha}(x):=\sigma^{\prime}(\|\mathbf{x}_{\alpha}\|x) and the Hermite expansion of this function as

ϕα(x)=k=0ζk(ϕα)hk(x),\phi_{\alpha}(x)=\sum_{k=0}^{\infty}\zeta_{k}(\phi_{\alpha})h_{k}(x),

where the coefficient ζk(σα)=𝔼[ϕα(ξ)hk(ξ)]\zeta_{k}(\sigma_{\alpha})=\mathbb{E}[\phi_{\alpha}(\xi)h_{k}(\xi)]. Let 𝐮α=𝐱α/𝐱α\mathbf{u}_{\alpha}=\mathbf{x}_{\alpha}/\|\mathbf{x}_{\alpha}\|, for 1αn1\leq\alpha\leq n. So for 1α,βn1\leq\alpha,\beta\leq n, the (α,β)(\alpha,\beta)-entry of Ψ\Psi is

Ψαβ=𝔼[ϕα(ξα)ϕβ(ξβ)](𝐱a𝐱β),\Psi_{\alpha\beta}=\mathbb{E}[\phi_{\alpha}(\xi_{\alpha})\phi_{\beta}(\xi_{\beta})]\cdot(\mathbf{x}_{a}^{\top}\mathbf{x}_{\beta}),

where (ξα,ξβ)=(𝒘𝐮α,𝒘𝐮β)(\xi_{\alpha},\xi_{\beta})=(\boldsymbol{w}^{\top}\mathbf{u}_{\alpha},\boldsymbol{w}^{\top}\mathbf{u}_{\beta}) is a Gaussian random vector with mean zero and covariance (5.2). Following the derivation of formula (5.3), we obtain

(5.23) Ψαβ=k=0ζk(ϕα)ζk(ϕβ)𝐱αk𝐱βk(𝐱α𝐱β)k+1.\Psi_{\alpha\beta}=\sum_{k=0}^{\infty}\frac{\zeta_{k}(\phi_{\alpha})\zeta_{k}(\phi_{\beta})}{\|\mathbf{x}_{\alpha}\|^{k}\|\mathbf{x}_{\beta}\|^{k}}(\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta})^{k+1}.

For any kk\in\mathbb{N}, let Tkn×nT_{k}\in\mathbb{R}^{n\times n} be an nn-by-nn matrix with (α,β)(\alpha,\beta) entry

(Tk)αβ:=ζk(ϕα)ζk(ϕβ)𝐱αk𝐱βk(𝐱α𝐱β)k+1.(T_{k})_{\alpha\beta}:=\frac{\zeta_{k}(\phi_{\alpha})\zeta_{k}(\phi_{\beta})}{\|\mathbf{x}_{\alpha}\|^{k}\|\mathbf{x}_{\beta}\|^{k}}(\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta})^{k+1}.

We can write Tk=Dkfk+1(XX)DkT_{k}=D_{k}f_{k+1}(X^{\top}X)D_{k} for any kk\in\mathbb{N}, where DkD_{k} is diag(ζk(ϕα)/𝐱αk)\operatorname{diag}(\zeta_{k}(\phi_{\alpha})/\|\mathbf{x}_{\alpha}\|^{k}). Then, adopting the proof of (5.16), we can similarly conclude that

Tkηk2(σ)fk+1(XX)Cεn,\|T_{k}-\eta_{k}^{2}(\sigma)f_{k+1}(X^{\top}X)\|\leq C\varepsilon_{n},

for some constant CC and k=0,1,2k=0,1,2, when nn is sufficiently large. Likewise, (5.10) indicates

k3(Tkdiag(Tk))Cεn4n,\left\|\sum_{k\geq 3}(T_{k}-\operatorname{diag}(T_{k}))\right\|\leq C\varepsilon_{n}^{4}n,

and a similar proof of (5.17) implies that

k3diag(Tk)(aσk=02ηk2(σ))IdCεn.\left\|\sum_{k\geq 3}\operatorname{diag}(T_{k})-\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)\operatorname{Id}\right\|\leq C\varepsilon_{n}.

Based on these approximations, we can conclude the result of this lemma. ∎

Proof of Theorem 2.2.

The first part of the statement is a straight consequence of (5.21) and Theorem 2.1. Denote by A:=d1n(H𝔼[H])A:=\sqrt{\frac{d_{1}}{n}}\left(H-\mathbb{E}[H]\right) and B:=d1n(1d1YYΦ)B:=\sqrt{\frac{d_{1}}{n}}\left(\frac{1}{d_{1}}Y^{\top}Y-\Phi\right). Observe that

BA=1nd1[(SS)(XX)𝔼[(SS)(XX)]].B-A=\frac{1}{\sqrt{nd_{1}}}\left[(S^{\top}S)\odot(X^{\top}X)-\mathbb{E}[(S^{\top}S)\odot(X^{\top}X)]\right].

Hence, (5.21) indicates BA0\|B-A\|\to 0 as nn\to\infty. This convergence implies that limiting laws of AA and BB are identical because of Lemma A.3.

The second part is because of Lemma 5.2 and Lemma 5.6. From (1.8) and (2.6), 𝔼[H]=Φ+Ψ.\mathbb{E}[H]=\Phi+\Psi. Then almost surely,

d1n(H𝔼[H])d1n(HΦ0Ψ0)=d1nΦ0+Ψ0𝔼[H]\displaystyle\left\|\sqrt{\frac{d_{1}}{n}}\left(H-\mathbb{E}[H]\right)-\sqrt{\frac{d_{1}}{n}}\left(H-\Phi_{0}-\Psi_{0}\right)\right\|=\sqrt{\frac{d_{1}}{n}}\left\|\Phi_{0}+\Psi_{0}-\mathbb{E}[H]\right\|
\displaystyle\leq~{} d1n(ΦΦ0+ΨΨ0)d1n(nεn2+nεn4)0,\displaystyle\sqrt{\frac{d_{1}}{n}}\left(\|\Phi-\Phi_{0}\|+\|\Psi-\Psi_{0}\|\right)\leq\sqrt{\frac{d_{1}}{n}}\left(\sqrt{n}\varepsilon_{n}^{2}+n\varepsilon_{n}^{4}\right)\to 0,

as εn4d10\varepsilon_{n}^{4}d_{1}\to 0 by the assumption of Theorem 2.2. Therefore, the limiting eigenvalue distribution of (2.10) is the same as (2.9). ∎

6. Proof of the concentration for extreme eigenvalues

In this section, we obtain the estimates of the extreme eigenvalues for the CK and NTK we studied in Section 5. The limiting spectral distribution of 1d1n(YY𝔼[YY])\frac{1}{\sqrt{d_{1}n}}(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]) tells us the bulk behavior of the spectrum. An estimation of the extreme eigenvalues will show that the eigenvalues are confined in a finite interval with high probability. We first provide a non-asymptotic bound on the concentration of 1d1YY\frac{1}{d_{1}}Y^{\top}Y under the spectral norm. The proof is based on the Hanson-Wright inequality we proved in Section 3 and an ε\varepsilon-net argument.

Proof of Theorem 2.3.

Recall notations in Section 1. Define

M:=\displaystyle M:=~{} 1d1nYY=1d1ni=1d1𝐲i𝐲i,\displaystyle\frac{1}{\sqrt{d_{1}n}}Y^{\top}Y=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}\mathbf{y}_{i}\mathbf{y}_{i}^{\top},
M𝔼M=\displaystyle M-\mathbb{E}M=~{} 1d1ni=1d1(𝐲i𝐲i𝔼[𝐲i𝐲i])=1d1ni=1d1(𝐲i𝐲iΦ),\displaystyle\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}(\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\mathbb{E}[\mathbf{y}_{i}\mathbf{y}_{i}^{\top}])=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}(\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\Phi),

where 𝐲i=σ(𝒘iX)\mathbf{y}_{i}^{\top}=\sigma(\boldsymbol{w}_{i}^{\top}X).

For any fixed 𝐳𝕊n1\mathbf{z}\in\mathbb{S}^{n-1}, we have

𝐳(M𝔼M)𝐳\displaystyle\mathbf{z}^{\top}(M-\mathbb{E}M)\mathbf{z} =1d1ni=1d1[𝐳,𝐲i2𝐳Φ𝐳]\displaystyle=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}[\langle\mathbf{z},\mathbf{y}_{i}\rangle^{2}-\mathbf{z}^{\top}\Phi\mathbf{z}]
=1d1ni=1d1[𝐲i(𝐳𝐳)𝐲iTr(Φ𝐳𝐳)]\displaystyle=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}[\mathbf{y}_{i}^{\top}(\mathbf{z}\mathbf{z}^{\top})\mathbf{y}_{i}-\operatorname{Tr}(\Phi\mathbf{z}\mathbf{z}^{\top})]
(6.1) =(𝐲1,,𝐲d1)A𝐳(𝐲1,,𝐲d1)Tr(A𝐳Φ~),\displaystyle=(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})^{\top}A_{\mathbf{z}}(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})-\operatorname{Tr}(A_{\mathbf{z}}\tilde{\Phi}),

where

A𝐳=1d1n[𝐳𝐳𝐳𝐳]nd1×nd1,Φ~=[ΦΦ]nd1×nd1,\displaystyle A_{\mathbf{z}}=\frac{1}{\sqrt{d_{1}n}}\begin{bmatrix}\mathbf{z}\mathbf{z}^{\top}&&\\ &\ddots&\\ &&\mathbf{z}\mathbf{z}^{\top}\end{bmatrix}\in\mathbb{R}^{nd_{1}\times nd_{1}},\quad\tilde{\Phi}=\begin{bmatrix}\Phi&&\\ &\ddots&\\ &&\Phi\end{bmatrix}\in\mathbb{R}^{nd_{1}\times nd_{1}},

and column vector (𝐲1,,𝐲d1)nd1(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})\in\mathbb{R}^{nd_{1}} is the concatenation of column vectors 𝐲1,,𝐲d1\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}}. Then

(𝐲1,,𝐲d1)=σ((𝒘1,,𝒘d1)X~)\displaystyle(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})^{\top}=\sigma((\boldsymbol{w}_{1},\dots,\boldsymbol{w}_{d_{1}})^{\top}\tilde{X})

with block matrix

X~=[XX].\displaystyle\tilde{X}=\begin{bmatrix}X&&\\ &\ddots&\\ &&X\end{bmatrix}.

Notice that

A𝐳=1d1n,A𝐳F=1n,X~=X.\displaystyle\|A_{\mathbf{z}}\|=\frac{1}{\sqrt{d_{1}n}},\quad\|A_{\mathbf{z}}\|_{F}=\frac{1}{\sqrt{n}},\quad\|\tilde{X}\|=\|X\|.

Denote 𝐲~=(𝐲1,,𝐲d1)\tilde{\mathbf{y}}=(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}}). With (3.11), we obtain

𝔼𝐲~2\displaystyle\|\mathbb{E}\tilde{\mathbf{y}}\|^{2} =d1𝔼𝐲2d1(2λσ2i=1n(𝐱i21)2+2n(𝔼σ(ξ))2)\displaystyle=d_{1}\|\mathbb{E}\mathbf{y}\|^{2}\leq d_{1}\left(2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}-1)^{2}+2n(\mathbb{E}\sigma(\xi))^{2}\right)
=d1(2λσ2i=1n(𝐱i21)2)2d1λσ2B2,\displaystyle=d_{1}\left(2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}-1)^{2}\right)\leq 2d_{1}\lambda_{\sigma}^{2}B^{2},

where the last line is from the assumptions on XX and σ\sigma. When B0B\not=0, applying (3.8) to (6.1) implies

(|(𝐲1,,𝐲d1)A𝐳(𝐲1,,𝐲d1)Tr(A𝐳Φ~)|t)\displaystyle\mathbb{P}\left(|(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})^{\top}A_{\mathbf{z}}(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})-\operatorname{Tr}(A_{\mathbf{z}}\tilde{\Phi})|\geq t\right)
\displaystyle\leq~{} 2exp(1Cmin{t2n8λσ4X4,td1nλσ2X2})+2exp(t2d1n32λσ2X2𝔼𝐲~2)\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}n}{8\lambda_{\sigma}^{4}\|X\|^{4}},\frac{t\sqrt{d_{1}n}}{\lambda_{\sigma}^{2}\|X\|^{2}}\right\}\right)+2\exp\left(-\frac{t^{2}d_{1}n}{32\lambda_{\sigma}^{2}\|X\|^{2}\|\mathbb{E}\tilde{\mathbf{y}}\|^{2}}\right)
\displaystyle\leq~{} 2exp(1Cmin{t2n8λσ4X4,td1nλσ2X2})+2exp(t2n64λσ4B2X2).\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}n}{8\lambda_{\sigma}^{4}\|X\|^{4}},\frac{t\sqrt{d_{1}n}}{\lambda_{\sigma}^{2}\|X\|^{2}}\right\}\right)+2\exp\left(-\frac{t^{2}n}{64\lambda_{\sigma}^{4}B^{2}\|X\|^{2}}\right).

Let 𝒩\mathcal{N} be a 1/21/2-net on 𝕊n1\mathbb{S}^{n-1} with |𝒩|5n|\mathcal{N}|\leq 5^{n} (see e.g. [Ver18, Corollary 4.2.13]), then

M𝔼M2sup𝐳𝒩|𝐳(M𝔼M)𝐳|.\displaystyle\|M-\mathbb{E}M\|\leq 2\sup_{\mathbf{z}\in\mathcal{N}}|\mathbf{z}^{\top}(M-\mathbb{E}M)\mathbf{z}|.

Taking a union bound over 𝒩\mathcal{N} yields

(M𝔼M2t)\displaystyle\mathbb{P}(\|M-\mathbb{E}M\|\geq 2t)\leq~{} 2exp(nlog51Cmin{t2n16λσ4X4,td1n2λσ2X2})\displaystyle 2\exp\left(n\log 5-\frac{1}{C}\min\left\{\frac{t^{2}n}{16\lambda_{\sigma}^{4}\|X\|^{4}},\frac{t\sqrt{d_{1}n}}{2\lambda_{\sigma}^{2}\|X\|^{2}}\right\}\right)
+2exp(nlog5t2n64λσ4B2X2).\displaystyle+2\exp\left(n\log 5-\frac{t^{2}n}{64\lambda_{\sigma}^{4}B^{2}\|X\|^{2}}\right).

We then can set

t=(8C+8Cnd1)λσ2X2+16Bλσ2X,t=\left(8\sqrt{C}+8C\sqrt{\frac{n}{d_{1}}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+16B\lambda_{\sigma}^{2}\|X\|,

to conclude

(M𝔼M(16C+16Cnd1)λσ2X2+32Bλσ2X)4e2n.\displaystyle\mathbb{P}\left(\|M-\mathbb{E}M\|\geq\left(16\sqrt{C}+16C\sqrt{\frac{n}{d_{1}}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+32B\lambda_{\sigma}^{2}\|X\|\right)\leq 4e^{-2n}.

Since

1d1YYΦ=nd1M𝔼M,\displaystyle\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|=\sqrt{\frac{n}{d_{1}}}\|M-\mathbb{E}M\|,

the upper bound in (2.11) is then verified. When B=0B=0, we can apply (3.6) and follow the same steps to get the desired bound. ∎

By the concentration inequality in Theorem 2.3, we can get a lower bound on the smallest eigenvalue of the conjugate kernel 1d1YY\frac{1}{d_{1}}Y^{\top}Y as follows.

Lemma 6.1.

Assume XX satisfies i=1n(𝐱i21)2B2\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}-1)^{2}\leq B^{2} for a constant B>0B>0, and σ\sigma is λσ\lambda_{\sigma}-Lipschitz with 𝔼σ(ξ)=0\mathbb{E}\sigma(\xi)=0. Then with probability at least 14e2n1-4e^{-2n},

(6.2) λmin(1d1YY)λmin(Φ)C(nd1+nd1)λσ2X232Bλσ2Xnd1.\displaystyle\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)\geq\lambda_{\min}(\Phi)-C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}\|X\|^{2}-32B\lambda_{\sigma}^{2}\|X\|\sqrt{\frac{n}{d_{1}}}.
Proof.

By Weyl’s inequality [AGZ10, Corollary A.6], we have

|λmin(1d1YY)λmin(Φ)|1d1YYd1Φ.\displaystyle\left|\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)-\lambda_{\min}(\Phi)\right|\leq\left\|\frac{1}{d_{1}}Y^{\top}Y-d_{1}\Phi\right\|.

Then (6.2) follows from (2.11). ∎

The lower bound in (6.2) relies on λmin(Φ)\lambda_{\min}(\Phi). Under certain assumptions on XX and σ\sigma, we can guarantee that λmin(Φ)\lambda_{\min}(\Phi) is bounded below by an absolute constant.

Lemma 6.2.

Assume σ\sigma is not a linear function and σ(x)\sigma(x) is Lipschitz. Then

(6.3) sup{k:ζk(σ)2>0}=.\displaystyle\sup\{k\in\mathbb{N}:\zeta_{k}(\sigma)^{2}>0\}=\infty.
Proof.

Suppose that sup{k:ζk(σ)2>0}\sup\{k\in\mathbb{N}:\zeta_{k}(\sigma)^{2}>0\} is finite. Then σ\sigma is a polynomial of degree at least 22 from our assumption, which is a contradiction to the fact that σ\sigma is Lipschitz. Hence, (6.3) holds. ∎

Lemma 6.3.

Assume Assumption 1.2 holds, σ\sigma is not a linear function, and XX satisfies (εn,B)(\varepsilon_{n},B)-orthonormal property. Then,

(6.4) λmin(Φ)1ζ1(σ)2ζ2(σ)2ζ3(σ)2CBεn2n.\displaystyle\lambda_{\min}(\Phi)\geq 1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2}-C_{B}\varepsilon_{n}^{2}\sqrt{n}.
Remark 6.4.

This bound will not hold when σ\sigma is a linear function. Suppose σ\sigma is a linear function, under Assumption 1.2, we must have σ(x)=x\sigma(x)=x and Φ=XX\Phi=X^{\top}X. Then we will not have a lower bound on λmin(Φ)\lambda_{\min}(\Phi) based on the Hermite coefficients of σ\sigma.

Proof of Lemma 6.3.

From Lemma 5.2, under our assumptions, we know that

ΦΦ0CBεn2n.\|\Phi-\Phi_{0}\|\leq C_{B}\varepsilon_{n}^{2}\sqrt{n}.

where Φ0\Phi_{0} is given by (1.13). Thus, λmin(Φ)λmin(Φ0)CBεn2n,\lambda_{\min}(\Phi)\geq\lambda_{\min}(\Phi_{0})-C_{B}\varepsilon_{n}^{2}\sqrt{n},

and, from Weyl’s inequality [AGZ10, Theorem A.5], we have

λmin(Φ0)k=13ζk(σ)2λmin(fk(XX))+(1ζ1(σ)2ζ2(σ)2ζ3(σ)2).\displaystyle\lambda_{\min}(\Phi_{0})\geq\sum_{k=1}^{3}\zeta_{k}(\sigma)^{2}\lambda_{\min}(f_{k}(X^{\top}X))+(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2}).

Note that fk(XX)=KkKk,f_{k}(X^{\top}X)=K_{k}^{\top}K_{k}, where Kkd0k×nK_{k}\in\mathbb{R}^{d_{0}^{k}\times n}, and each column of KkK_{k} is given by the kk-th Kronecker product 𝐱i𝐱i\mathbf{x}_{i}\otimes\cdots\otimes\mathbf{x}_{i}. Hence, fk(XX)f_{k}(X^{\top}X) is positive semi-definite.

Therefore,

λmin(Φ0)(1ζ1(σ)2ζ2(σ)2ζ3(σ)2).\displaystyle\lambda_{\min}(\Phi_{0})\geq(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2}).

Since σ\sigma is not linear but Lipschitz, (6.3) holds for σ\sigma. Therefore, (1ζ1(σ)2ζ2(σ)2ζ3(σ)2)=k=4ζk(σ)2>0(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})=\sum_{k=4}^{\infty}\zeta_{k}(\sigma)^{2}>0, and then (6.4) holds.

Theorem 2.5 then follows directly from Lemma 6.1 and Lemma 6.3.


Next, we move on to non-asymptotic estimations for NTK. Recall that the empirical NTK matrix HH is given by (1.8) and the α\alpha-th column of SS is defined by diag(σ(W𝐱α))𝒂\operatorname{diag}(\sigma^{\prime}(W\mathbf{x}_{\alpha}))\boldsymbol{a}, for 1αn1\leq\alpha\leq n, in (1.9).

The ii-th row of SS is given by 𝐳i:=σ(𝒘iX)ai,\mathbf{z}_{i}^{\top}:=\sigma^{\prime}(\boldsymbol{w}_{i}^{\top}X)a_{i}, and 𝔼[𝐳i]=0\mathbb{E}[\mathbf{z}_{i}]=0, where aia_{i} is the ii-th entry of 𝒂\boldsymbol{a}. Define Dα=diag(σ(𝒘αX)aα),D_{\alpha}=\operatorname{diag}(\sigma^{\prime}(\boldsymbol{w}_{\alpha}^{\top}X)a_{\alpha}), for 1αd11\leq\alpha\leq d_{1}. We can rewrite (SS)(XX)(S^{\top}S)\odot(X^{\top}X) as

(SS)(XX)=α=1d1aα2DαXXDα.(S^{\top}S)\odot(X^{\top}X)=\sum_{\alpha=1}^{d_{1}}a_{\alpha}^{2}D_{\alpha}X^{\top}XD_{\alpha}.

Let us define LL and further expand it as follows:

(6.5) L\displaystyle L :=1d1(SS𝔼[SS])(XX)\displaystyle:=\frac{1}{d_{1}}(S^{\top}S-\mathbb{E}[S^{\top}S])\odot(X^{\top}X)
=1d1i=1d1(𝐳i𝐳i𝔼[𝐳i𝐳i])(XX)\displaystyle=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}(\mathbf{z}_{i}\mathbf{z}_{i}^{\top}-\mathbb{E}[\mathbf{z}_{i}\mathbf{z}_{i}^{\top}])\odot(X^{\top}X)
(6.6) =1d1i=1d1(Di(XX)Di𝔼[Di(XX)Di])=1d1i=1d1Zi.\displaystyle=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\left(D_{i}(X^{\top}X)D_{i}-\mathbb{E}[D_{i}(X^{\top}X)D_{i}]\right)=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}Z_{i}.

Here ZiZ_{i} is a centered random matrix, and we can apply matrix Bernstein’s inequality to show the concentration of LL. Since ZiZ_{i} does not have an almost sure bound on the spectral norm, we will use the following sub-exponential version of the matrix Bernstein inequality from [Tro12].

Lemma 6.5 ([Tro12], Theorem 6.2).

Let ZkZ_{k} be independent Hermitian matrices of size n×nn\times n. Assume

𝔼Zi=0,𝔼[Zip]12p!Rp2a2,\displaystyle\mathbb{E}Z_{i}=0,\quad\left\|\mathbb{E}[Z_{i}^{p}]\right\|\leq\frac{1}{2}p!R^{p-2}a^{2},

for any integer p2p\geq 2. Then for all t0t\geq 0,

(6.7) (i=1d1Zit)nexp(t22d1a2+2Rt).\displaystyle\mathbb{P}\left(~{}\left\|\sum_{i=1}^{d_{1}}Z_{i}\right\|\geq t\right)\leq n\exp\left(-\frac{t^{2}}{2d_{1}a^{2}+2Rt}\right).
Proof of Theorem 2.7.

From (6.6), 𝔼Zi=0\mathbb{E}Z_{i}=0, and

Zi\displaystyle\|Z_{i}\| Di2XX+𝔼Di2XXC1(ai2+1),\displaystyle\leq\|D_{i}\|^{2}\|XX^{\top}\|+\mathbb{E}\|D_{i}\|^{2}\|XX^{\top}\|\leq C_{1}(a_{i}^{2}+1),

where C1=λσ2X2C_{1}=\lambda_{\sigma}^{2}\|X\|^{2} and where ai𝒩(0,1)a_{i}\sim\mathcal{N}(0,1) is the ii-th entry of the second layer weight 𝒂\boldsymbol{a}. Then

𝔼[Zip]𝔼Zip\displaystyle\|\mathbb{E}[Z_{i}^{p}]\|\leq\mathbb{E}\|Z_{i}\|^{p} C12p𝔼(ai2+1)pC12pk=1p(pk)(2k1)!!\displaystyle\leq C_{1}^{2p}\mathbb{E}(a_{i}^{2}+1)^{p}\leq C_{1}^{2p}\sum_{k=1}^{p}\binom{p}{k}(2k-1)!!
=C12pp!k=1p(2k1)!!k!(pk)!C12pp!k=1p2k2(2C12)pp!.\displaystyle=C_{1}^{2p}p!\sum_{k=1}^{p}\frac{(2k-1)!!}{k!(p-k)!}\leq C_{1}^{2p}p!\sum_{k=1}^{p}2^{k}\leq 2(2C_{1}^{2})^{p}p!.

So we can take R=2C12,a2=8C14R=2C_{1}^{2},a^{2}=8C_{1}^{4} in (6.7) and obtain

(i=1d1Zit)nexp(t216d1C14+4C12t).\displaystyle\mathbb{P}\left(\left\|\sum_{i=1}^{d_{1}}Z_{i}\right\|\geq t\right)\leq n\exp\left(-\frac{t^{2}}{16d_{1}C_{1}^{4}+4C_{1}^{2}t}\right).

Hence, LL defined in (6.5) has a probability bound:

(Lt)=(1d1i=1d1Zit)nexp(t2d116C14+4C12t).\displaystyle\mathbb{P}\left(\|L\|\geq t\right)=\mathbb{P}\left(\frac{1}{d_{1}}\left\|\sum_{i=1}^{d_{1}}Z_{i}\right\|\geq t\right)\leq n\exp\left(-\frac{t^{2}d_{1}}{16C_{1}^{4}+4C_{1}^{2}t}\right).

Take t=10C12logn/d1t=10C_{1}^{2}\sqrt{\log n/d_{1}}. Under the assumption that d1lognd_{1}\geq\log n, we conclude that, with high probability at least 1n7/31-n^{-7/3},

(6.8) L10C12lognd1.\displaystyle\|L\|\leq 10C_{1}^{2}\sqrt{\frac{\log n}{d_{1}}}.

Thus, as a corollary, the two statements in Lemma 5.5 follow from (6.8). Meanwhile, since

H𝔼H1d1YYΦ+L,\|H-\mathbb{E}H\|\leq\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|+\|L\|,

the bound in (2.13) follows from Theorem 2.3 and (6.8). ∎

We now proceed to provide a lower bound of λmin(H)\lambda_{\min}(H) from Theorem 2.7.

Proof of Theorem 2.9.

Note that from (1.8), (2.6) and (6.5), we have

λmin(H)\displaystyle\lambda_{\min}(H) 1d1λmin((SS)(XX))\displaystyle\geq\frac{1}{d_{1}}\lambda_{\min}((S^{\top}S)\odot(X^{\top}X))
1d1λmin((𝔼SS)(XX))L=λmin(Ψ)L.\displaystyle\geq\frac{1}{d_{1}}\lambda_{\min}((\mathbb{E}S^{\top}S)\odot(X^{\top}X))-\|L\|=\lambda_{\min}(\Psi)-\|L\|.

Then with Lemma 5.6, we can get

λmin(H)\displaystyle\lambda_{\min}(H) λmin(Ψ0)Cεn4nL(aσk=02ηk2(σ))Cεn4nL.\displaystyle\geq\lambda_{\min}(\Psi_{0})-C\varepsilon_{n}^{4}n-\|L\|\geq\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)-C\varepsilon_{n}^{4}n-\|L\|.

Therefore, from Theorem 2.7, with probability at least 1n7/31-n^{-7/3},

λmin(H)\displaystyle\lambda_{\min}(H) aσk=02ηk2(σ)Cεn4n10λσ4X4lognd1\displaystyle\geq a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)-C\varepsilon_{n}^{4}n-10\lambda_{\sigma}^{4}\|X\|^{4}\sqrt{\frac{\log n}{d_{1}}}
aσk=02ηk2(σ)Cεn4n10λσ4B4lognd1.\displaystyle\geq a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)-C\varepsilon_{n}^{4}n-10\lambda_{\sigma}^{4}B^{4}\sqrt{\frac{\log n}{d_{1}}}.

Since σ\sigma is Lipschitz and non-linear, we know σ(x)\sigma^{\prime}(x) is not a linear function (including the constant function) and |σ(x)||\sigma^{\prime}(x)| is bounded. Suppose that σ(x)\sigma^{\prime}(x) has finite many non-zero Hermite coefficients, σ(x)\sigma(x) is a polynomial, then we get a contradiction. Hence, the Hermite coefficients of σ\sigma^{\prime} satisfy

(6.9) sup{k:ηk2(σ)>0}= and aσk=02ηk2(σ)=k=3ηk2(σ)>0.\displaystyle\sup\{k\in\mathbb{N}:\eta_{k}^{2}(\sigma)>0\}=\infty~{}\text{ and }~{}a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)=\sum_{k=3}^{\infty}\eta_{k}^{2}(\sigma)>0.

This finishes the proof. ∎

7. Proof of Theorem 2.12 and Theorem 2.17

By definitions, the random matrix Kn(X,X)K_{n}(X,X) is 1d1YY\frac{1}{d_{1}}Y^{\top}Y and the kernel matrix K(X,X)=ΦK(X,X)=\Phi is defined in (1.3). These two matrices have been already analyzed in Theorem 2.3 and Theorem 2.5, so we will apply these results to estimate how great the difference between training errors of random feature regression and its corresponding kernel regression.

Proof of Theorem 2.12.

Denote Kλ:=(K+λId)K_{\lambda}:=(K+\lambda\operatorname{Id}). From the definitions of training errors in (2.20) and (2.21), we have

|Etrain(RF,λ)Etrain(K,λ)|=1n|f^λ(RF)(X)𝒚2f^λ(K)(X)𝒚2|\displaystyle\left|E_{\textnormal{train}}^{(RF,\lambda)}-E_{\textnormal{train}}^{(K,\lambda)}\right|=\frac{1}{n}\left|\|\hat{f}_{\lambda}^{(RF)}(X)-\boldsymbol{y}\|^{2}-\|\hat{f}_{\lambda}^{(K)}(X)-\boldsymbol{y}\|^{2}\right|
=\displaystyle=~{} λ2n|Tr[(K(X,X)+λId)2𝒚𝒚]Tr[(Kn(X,X)+λId)2𝒚𝒚]|\displaystyle\frac{\lambda^{2}}{n}\left|\operatorname{Tr}[(K(X,X)+\lambda\operatorname{Id})^{-2}\boldsymbol{y}\boldsymbol{y}^{\top}]-\operatorname{Tr}[(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}\boldsymbol{y}\boldsymbol{y}^{\top}]\right|
=\displaystyle=~{} λ2n|𝒚[(K(X,X)+λId)2(Kn(X,X)+λId)2]𝒚|\displaystyle\frac{\lambda^{2}}{n}\left|\boldsymbol{y}^{\top}\left[(K(X,X)+\lambda\operatorname{Id})^{-2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}\right]\boldsymbol{y}\right|
\displaystyle\leq~{} λ2n(K(X,X)+λId)2(Kn(X,X)+λId)2𝒚2\displaystyle\frac{\lambda^{2}}{n}\|(K(X,X)+\lambda\operatorname{Id})^{-2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}\|\cdot\|\boldsymbol{y}\|^{2}
(7.1) \displaystyle\leq~{} λ2𝒚2nλmin2(K(X,X))λmin2(Kn(X,X))(Kλ2(Kn(X,X)+λId)2.\displaystyle\frac{\lambda^{2}\|\boldsymbol{y}\|^{2}}{n\lambda_{\min}^{2}(K(X,X))\lambda_{\min}^{2}(K_{n}(X,X))}\|(K_{\lambda}^{2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{2}\|.

Here, in (7.1), we employ the identity

(7.2) A1B1=B1(BA)A1,A^{-1}-B^{-1}=B^{-1}(B-A)A^{-1},

for A=(K(X,X)+λId)2A=(K(X,X)+\lambda\operatorname{Id})^{-2} and B=(Kn(X,X)+λId)2B=(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}, and the fact that (K(X,X)+λId)1λmin1(K(X,X))\|(K(X,X)+\lambda\operatorname{Id})^{-1}\|\leq\lambda_{\min}^{-1}(K(X,X)) and (Kn(X,X)+λId)1λmin1(Kn(X,X))(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\|\leq\lambda_{\min}^{-1}(K_{n}(X,X)). Next, before providing uniform upper bounds for λmin2(K(X,X))\lambda_{\min}^{-2}(K(X,X)) and λmin2(Kn(X,X))\lambda_{\min}^{-2}(K_{n}(X,X)) in (7.1), we can first get a bound for the last term of (7.1) as follows:

(K(X,X)+λId)2(Kn(X,X)+λId)2\displaystyle\|(K(X,X)+\lambda\operatorname{Id})^{2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{2}\|
=\displaystyle=~{} K2(X,X)Kn2(X,X)+2λ(K(X,X)Kn(X,X))\displaystyle\|K^{2}(X,X)-K_{n}^{2}(X,X)+2\lambda(K(X,X)-K_{n}(X,X))\|
\displaystyle\leq~{} K2(X,X)Kn2(X,X)+2λ(K(X,X)Kn(X,X))\displaystyle\|K^{2}(X,X)-K_{n}^{2}(X,X)\|+2\lambda\|(K(X,X)-K_{n}(X,X))\|
\displaystyle\leq~{} (Kn(X,X)K(X,X)+2K(X,X)+2λ)K(X,X)Kn(X,X)\displaystyle\Big{(}\|K_{n}(X,X)-K(X,X)\|+2\|K(X,X)\|+2\lambda\Big{)}\cdot\|K(X,X)-K_{n}(X,X)\|
(7.3) \displaystyle\leq~{} C(nd1+C)nd1.\displaystyle C\left(\sqrt{\frac{n}{d_{1}}}+C\right)\sqrt{\frac{n}{d_{1}}}.

for some constant C>0C>0, with probability at least 14e2n1-4e^{-2n}, where the last bound in (7.3) is due to Theorem 2.3 and Lemma A.9 in Appendix A. Additionally, combining Theorem 2.3 and Theorem 2.5, we can easily get

(7.4) (Kn(X,X)+λId)1λmin1(Kn(X,X))C\|(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\|\leq\lambda_{\min}^{-1}(K_{n}(X,X))\leq C

for all large nn and some universal constant CC, under the same event that (7.3) holds. Theorem 6.3 also shows λmin1(K(X,X))C\lambda_{\min}^{-1}(K(X,X))\leq C for all large nn. Hence, with the upper bounds for λmin2(K(X,X))\lambda_{\min}^{-2}(K(X,X)) and λmin2(Kn(X,X))\lambda_{\min}^{-2}(K_{n}(X,X)), (2.22) follows from the bounds of (7.1) and (7.3). ∎

For ease of notation, we denote K:=K(X,X)K:=K(X,X) and Kn:=Kn(X,X)K_{n}:=K_{n}(X,X). Hence, from (2.24), we can further decompose the test errors for KK and KnK_{n} into

(7.5) (f^λ(K))=\displaystyle\mathcal{L}(\hat{f}^{(K)}_{\lambda})=~{} 𝔼𝐱[|f(𝐱)|2]+Tr[(K+λId)1𝒚𝒚(K+λId)1𝔼𝐱[K(𝐱,X)K(𝐱,X)]]\displaystyle\mathbb{E}_{\mathbf{x}}[|f^{*}(\mathbf{x})|^{2}]+\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]\right]
2Tr[(K+λId)1𝒚𝔼𝐱[f(𝐱)K(𝐱,X)]],\displaystyle~{}-2\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\mathbb{E}_{\mathbf{x}}[f^{*}(\mathbf{x})K(\mathbf{x},X)]\right],
(7.6) (f^λ(RF))=\displaystyle\mathcal{L}(\hat{f}^{(RF)}_{\lambda})=~{} 𝔼𝐱[|f(𝐱)|2]+Tr[(Kn+λId)1𝒚𝒚(Kn+λId)1𝔼𝐱[Kn(𝐱,X)Kn(𝐱,X)]]\displaystyle\mathbb{E}_{\mathbf{x}}[|f^{*}(\mathbf{x})|^{2}]+\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K_{n}(\mathbf{x},X)^{\top}K_{n}(\mathbf{x},X)]\right]
2Tr[(Kn+λId)1𝒚𝔼𝐱[f(𝐱)Kn(𝐱,X)]].\displaystyle~{}-2\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\mathbb{E}_{\mathbf{x}}[f^{*}(\mathbf{x})K_{n}(\mathbf{x},X)]\right].

Let us denote

E1:=\displaystyle E_{1}:= Tr[(Kn+λId)1𝒚𝒚(Kn+λId)1𝔼𝐱[Kn(𝐱,X)Kn(𝐱,X)]],\displaystyle\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K_{n}(\mathbf{x},X)^{\top}K_{n}(\mathbf{x},X)]\right],
E¯1:=\displaystyle\bar{E}_{1}:= Tr[(K+λId)1𝒚𝒚(K+λId)1𝔼𝐱[K(𝐱,X)K(𝐱,X)]],\displaystyle\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]\right],
E2:=\displaystyle E_{2}:= Tr[(Kn+λId)1𝒚𝜷𝔼𝐱[𝐱Kn(𝐱,X)]],\displaystyle\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K_{n}(\mathbf{x},X)]\right],
E¯2:=\displaystyle\bar{E}_{2}:= Tr[(K+λId)1𝒚𝜷𝔼𝐱[𝐱K(𝐱,X)]].\displaystyle\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\right].

As we can see, to compare the test errors between random feature and kernel regression models, we need to control |E1E¯1||E_{1}-\bar{E}_{1}| and |E2E¯2||E_{2}-\bar{E}_{2}|. Firstly, it is necessary to study the concentrations of

𝔼𝐱[K(𝐱,X)K(𝐱,X)Kn(𝐱,X)Kn(𝐱,X)]\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)-K_{n}(\mathbf{x},X)^{\top}K_{n}(\mathbf{x},X)]

and

𝔼𝐱[f(𝐱)(K(𝐱,X)Kn(𝐱,X))].\mathbb{E}_{\mathbf{x}}\left[f^{*}(\mathbf{x})\left(K(\mathbf{x},X)-K_{n}(\mathbf{x},X)\right)\right].
Lemma 7.1.

Under Assumption 1.2 for σ\sigma and Assumption 2.14 for 𝐱\mathbf{x} and XX, with probability at least 14e2n1-4e^{-2n}, we have

(7.7) Kn(𝐱,X)K(𝐱,X)Cnd1,\displaystyle\left\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right\|\leq C\sqrt{\frac{n}{d_{1}}},

where C>0C>0 is a universal constant. Here, we only consider the randomness of the weight matrix in Kn(𝐱,X)K_{n}(\mathbf{x},X) defined by (2.17) and (2.18).

Proof.

We consider X~=[𝐱1,,𝐱n,𝐱]\tilde{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n},\mathbf{x}], its corresponding kernels Kn(X~,X~)K_{n}(\tilde{X},\tilde{X}) and K(X~,X~)(n+1)×(n+1)K(\tilde{X},\tilde{X})\in\mathbb{R}^{(n+1)\times(n+1)}. Under Assumption 2.14, we can directly apply Theorem 2.3 to get the concentration of Kn(X~,X~)K_{n}(\tilde{X},\tilde{X}) around K(X~,X~)K(\tilde{X},\tilde{X}), namely,

(7.8) Kn(X~,X~)K(X~,X~)Cnd1,\left\|K_{n}(\tilde{X},\tilde{X})-K(\tilde{X},\tilde{X})\right\|\leq C\sqrt{\frac{n}{d_{1}}},

with probability at least 14e2n1-4e^{-2n}. Meanwhile, we can write Kn(X~,X~)K_{n}(\tilde{X},\tilde{X}) and K(X~,X~)K(\tilde{X},\tilde{X}) as block matrices:

Kn(X~,X~)=(Kn(X,X)Kn(X,𝐱)Kn(𝐱,X)Kn(𝐱,𝐱)) and K(X~,X~)=(K(X,X)K(X,𝐱)K(𝐱,X)K(𝐱,𝐱)).K_{n}(\tilde{X},\tilde{X})=\begin{pmatrix}K_{n}(X,X)&K_{n}(X,\mathbf{x})\\ K_{n}(\mathbf{x},X)&K_{n}(\mathbf{x},\mathbf{x})\end{pmatrix}~{}\text{ and }~{}K(\tilde{X},\tilde{X})=\begin{pmatrix}K(X,X)&K(X,\mathbf{x})\\ K(\mathbf{x},X)&K(\mathbf{x},\mathbf{x})\end{pmatrix}.

Since the 2\ell_{2}-norm of any row is bounded above by the spectral norm of its entire matrix, we complete the proof of (7.7). ∎

Lemma 7.2.

Assume that training labels satisfy Assumption 2.13 and XB\|X\|\leq B, then for any deterministic An×nA\in\mathbb{R}^{n\times n}, we have

Var(𝒚A𝒚),Var(𝜷A𝒚)cAF2,\displaystyle\operatorname{Var}\left(\boldsymbol{y}^{\top}A\boldsymbol{y}\right),\operatorname{Var}\left(\boldsymbol{\beta}^{*\top}A\boldsymbol{y}\right)\leq c\|A\|_{F}^{2},

where constant cc only depends on σ𝛃,\sigma_{\boldsymbol{\beta}}, σϵ\sigma_{\boldsymbol{\epsilon}} and BB. Moreover,

𝔼[𝒚A𝒚]=σ𝜷2TrAXX+σϵ2TrA,𝔼[𝜷A𝒚]=σ𝜷2TrAX.\mathbb{E}[\boldsymbol{y}^{\top}A\boldsymbol{y}]=\sigma^{2}_{\boldsymbol{\beta}}\operatorname{Tr}AX^{\top}X+\sigma^{2}_{\boldsymbol{\epsilon}}\operatorname{Tr}A,\quad\mathbb{E}[\boldsymbol{\beta}^{*\top}A\boldsymbol{y}]=\sigma^{2}_{\boldsymbol{\beta}}\operatorname{Tr}AX^{\top}.
Proof.

We follow the idea in Lemma C.8 of [MM19] to investigate the variance of the quadratic form for the Gaussian random vector by

(7.9) Var(𝒈A𝒈)=AF2+Tr(A2)2AF2,\operatorname{Var}(\boldsymbol{g}^{\top}A\boldsymbol{g})=\|A\|_{F}^{2}+\operatorname{Tr}(A^{2})\leq 2\|A\|_{F}^{2},

for any deterministic square matrix AA and standard normal random vector 𝒈\boldsymbol{g}. Notice that the quadratic form

(7.10) 𝒚A𝒚=𝒈(σ𝜷2XAXσϵσ𝜷XAσϵσ𝜷AXσϵ2A)𝒈,\boldsymbol{y}^{\top}A\boldsymbol{y}=\boldsymbol{g}^{\top}\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}XAX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}XA\\ \sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}^{2}A\end{pmatrix}\boldsymbol{g},

where 𝒈\boldsymbol{g} is a standard Gaussian random vector in d0+n\mathbb{R}^{d_{0}+n}. Similarly, the second quadratic form can be written as

𝜷A𝒚=𝒈(σ𝜷2AXσϵσ𝜷A𝟎𝟎)𝒈.\boldsymbol{\beta}^{*\top}A\boldsymbol{y}=\boldsymbol{g}^{\top}\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}A\\ \mathbf{0}&\mathbf{0}\end{pmatrix}\boldsymbol{g}.

Let

A~1:=(σ𝜷2XAXσϵσ𝜷XAσϵσ𝜷AXσϵ2A),A~2:=(σ𝜷2AXσϵσ𝜷A𝟎𝟎).\tilde{A}_{1}:=\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}XAX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}XA\\ \sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}^{2}A\end{pmatrix},\quad\tilde{A}_{2}:=\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}A\\ \mathbf{0}&\mathbf{0}\end{pmatrix}.

By (7.9), we know Var(𝒚A𝒚)2A~1F2\operatorname{Var}\left(\boldsymbol{y}^{\top}A\boldsymbol{y}\right)\leq 2\|\tilde{A}_{1}\|_{F}^{2} and Var(𝜷A𝒚)2A~2F2\operatorname{Var}\left(\boldsymbol{\beta}^{*\top}A\boldsymbol{y}\right)\leq 2\|\tilde{A}_{2}\|_{F}^{2}. Since

A~1F2=σ𝜷4XAXF2+σϵ2σ𝜷2XAF2+σϵ2σ𝜷2AXF2+σϵ4AF2cAF2\|\tilde{A}_{1}\|^{2}_{F}=\sigma_{\boldsymbol{\beta}}^{4}\|XAX^{\top}\|^{2}_{F}+\sigma^{2}_{\boldsymbol{\epsilon}}\sigma^{2}_{\boldsymbol{\beta}}\|XA\|^{2}_{F}+\sigma^{2}_{\boldsymbol{\epsilon}}\sigma^{2}_{\boldsymbol{\beta}}\|AX^{\top}\|_{F}^{2}+\sigma_{\boldsymbol{\epsilon}}^{4}\|A\|_{F}^{2}\leq c\|A\|_{F}^{2}

and similarly A~2FcAF2\|\tilde{A}_{2}\|_{F}\leq c\|A\|_{F}^{2} for a constant cc, we can complete the proof. ∎

As a remark, in Lemma 7.2, for simplicity, we only provide a variance control for the quadratic forms to obtain convergence in probability in the following proofs of Theorems 2.16 and 2.17. However, we can actually apply Hanson-Wright inequalities in Section 3 to get more precise probability bounds and consider non-Gaussian distributions for 𝜷\boldsymbol{\beta}^{*} and ϵ\boldsymbol{\epsilon}.

Proof of Theorem 2.16.

Based on the preceding expansions of (f^λ(RF)(𝐱))\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x})) and (f^λ(K)(𝐱))\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x})) in (7.5) and (7.6), we need to control the right-hand side of

|(f^λ(RF)(𝐱))(f^λ(K)(𝐱))||E1E¯1|+2|E¯2E2|.\displaystyle\left|\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))-\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))\right|\leq\left|E_{1}-\bar{E}_{1}\right|+2\left|\bar{E}_{2}-E_{2}\right|.

In the subsequent procedure, we first take the concentrations of E1E_{1} and E2E_{2} with respect to normal random vectors 𝜷\boldsymbol{\beta}^{*} and ϵ\boldsymbol{\epsilon}, respectively. Then, we apply Theorem 2.3 and Lemma 7.1 to complete the proof of (2.25). For simplicity, we start with the second term

|E¯2E2|\displaystyle\left|\bar{E}_{2}-E_{2}\right|\leq~{} |𝜷𝔼𝐱[𝐱(Kn(𝐱,X)K(𝐱,X))](Kn+λId)1𝒚|\displaystyle\left|\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\right|
+|𝜷𝔼𝐱[𝐱K(𝐱,X)]((Kn+λId)1(K+λId)1)𝒚|\displaystyle+\left|\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)\boldsymbol{y}\right|
(7.11) \displaystyle\leq~{} |I1I¯1|+|I2I¯2|+|I¯1|+|I¯2|,\displaystyle|I_{1}-\bar{I}_{1}|+|I_{2}-\bar{I}_{2}|+|\bar{I}_{1}|+|\bar{I}_{2}|,

where I1I_{1} and I2I_{2} are quadratic forms defined below

I1:=\displaystyle I_{1}:=~{} 𝜷𝔼𝐱[𝐱(Kn(𝐱,X)K(𝐱,X))](Kn+λId)1𝒚,\displaystyle\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y},
I2:=\displaystyle I_{2}:=~{} 𝜷𝔼𝐱[𝐱K(𝐱,X)]((Kn+λId)1(K+λId)1)𝒚,\displaystyle\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)\boldsymbol{y},

and their expectations with respect to random vectors 𝜷\boldsymbol{\beta}^{*} and ϵ\boldsymbol{\epsilon} are denoted by

I¯1:=𝔼ϵ,𝜷[I1]=\displaystyle\bar{I}_{1}:=\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[I_{1}]=~{} σ𝜷2Tr(𝔼𝐱[𝐱(Kn(𝐱,X)K(𝐱,X))](Kn+λId)1X),\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\operatorname{Tr}\left(\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1}X^{\top}\right),
I¯2:=𝔼ϵ,𝜷[I2]=\displaystyle\bar{I}_{2}:=\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[I_{2}]=~{} σ𝜷2Tr(((Kn+λId)1(K+λId)1)X𝔼𝐱[𝐱K(𝐱,X)]).\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\operatorname{Tr}\left(\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)X^{\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\right).

We first consider the randomness of the weight matrix in KnK_{n} and define the event \mathcal{E} where both (7.4) and (7.8) hold. Then, Theorem 2.5 and the proof of Lemma 7.1 indicate that event \mathcal{E} occurs with probability at least 14e2n1-4e^{-2n} for all large nn. Notice that \mathcal{E} does not rely on the randomness of test data 𝐱\mathbf{x}.

We now consider A=𝔼𝐱[𝐱(Kn(𝐱,X)K(𝐱,X))](Kn+λId)1A=\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1} in Lemma 7.2. Conditioning on event \mathcal{E}, we have

(7.12) AF2\displaystyle\|A\|_{F}^{2}\leq~{} 𝔼𝐱[𝐱(Kn(𝐱,X)K(𝐱,X))F2](Kn+λId)1X2\displaystyle\mathbb{E}_{\mathbf{x}}\left[\left\|\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))^{\top}\right\|_{F}^{2}\right]\cdot\left\|(K_{n}+\lambda\operatorname{Id})^{-1}X^{\top}\right\|^{2}
(7.13) \displaystyle\leq~{} X2(Kn+λId)12𝔼𝐱[𝐱2Kn(𝐱,X)K(𝐱,X)2]Cnd1,\displaystyle\|X\|^{2}\left\|(K_{n}+\lambda\operatorname{Id})^{-1}\right\|^{2}\cdot\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{x}\|^{2}\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\|^{2}\right]\leq C\frac{n}{d_{1}},

for some constant CC, where we utilize the assumption 𝔼[𝐱2]=1\mathbb{E}[\|\mathbf{x}\|^{2}]=1. Hence, based on Lemma 7.2, we know Varϵ,𝜷(I1)cn/d1\operatorname{Var}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}(I_{1})\leq cn/d_{1}, for some constant cc. By Chebyshev’s inequality and event \mathcal{E},

(7.14) (|I1I¯1|(n/d1)1ε2)c(nd1)ε+4e2n,\mathbb{P}\left(|I_{1}-\bar{I}_{1}|\geq(n/d_{1})^{\frac{1-\varepsilon}{2}}\right)\leq c\left(\frac{n}{d_{1}}\right)^{\varepsilon}+4e^{-2n},

for any ε(0,1/2)\varepsilon\in(0,1/2). Hence, (d1/n)12ε|I1I¯1|=o(1)\left(d_{1}/n\right)^{\frac{1}{2}-\varepsilon}\cdot|I_{1}-\bar{I}_{1}|=o(1) with probability 1o(1)1-o(1), when n/d10n/d_{1}\to 0 and nn\to\infty.

Likewise, when A=𝔼𝐱[𝐱K(𝐱,X)]((Kn+λId)1(K+λId)1)A=\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right), we can apply (7.2) and

(7.15) K(𝐱,X)K(X~,X~)Cλσ2B2,\|K(\mathbf{x},X)\|\leq\|K(\tilde{X},\tilde{X})\|\leq C\lambda_{\sigma}^{2}B^{2},

due to Lemma A.9 in Appendix A, to obtain AF2Cn/d1\|A\|_{F}^{2}\leq Cn/d_{1} conditionally on event \mathcal{E}. Then, similarly, Lemma 7.2 shows Varϵ,𝜷(I2)cn/d1\operatorname{Var}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}(I_{2})\leq cn/d_{1}. Therefore, (7.14) also holds for |I2I¯2||I_{2}-\bar{I}_{2}|.

Moreover, conditioning on the event \mathcal{E},

(7.16) |I¯1|=\displaystyle|\bar{I}_{1}|=~{} σ𝜷2|𝔼𝐱[(Kn(𝐱,X)K(𝐱,X))(Kn+λId)1X𝐱]|\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\left|\mathbb{E}_{\mathbf{x}}\left[(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))(K_{n}+\lambda\operatorname{Id})^{-1}X^{\top}\mathbf{x}\right]\right|
(7.17) \displaystyle\leq~{} σ𝜷2𝔼𝐱[𝐱Kn(𝐱,X)K(𝐱,X)X(Kn+λId)1],\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{x}\|\cdot\left\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right\|\cdot\|X\|\cdot\left\|(K_{n}+\lambda\operatorname{Id})^{-1}\right\|\right],
(7.18) \displaystyle\leq~{} σ𝜷2𝔼𝐱[𝐱2]12𝔼𝐱[Kn(𝐱,X)K(𝐱,X)2]12X(Kn+λId)1Cnd1,\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{x}\|^{2}\right]^{\frac{1}{2}}\mathbb{E}_{\mathbf{x}}\left[\left\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right\|^{2}\right]^{\frac{1}{2}}\|X\|\left\|(K_{n}+\lambda\operatorname{Id})^{-1}\right\|\leq C\sqrt{\frac{n}{d_{1}}},

for some constant CC. In the same way, with (7.15), |I¯2|Cnd1|\bar{I}_{2}|\leq C\sqrt{\frac{n}{d_{1}}} on the event \mathcal{E}. Therefore, from (7.11), we can conclude |E¯2E2|=o((n/d1)1/2ε)\left|\bar{E}_{2}-E_{2}\right|=o\left(\left(n/d_{1}\right)^{1/2-\varepsilon}\right) for any ε(0,1/2)\varepsilon\in(0,1/2), with probability 1o(1)1-o(1), when n/d10n/d_{1}\to 0 and nn\to\infty.

Analogously, the first term |E¯1E1|\left|\bar{E}_{1}-E_{1}\right| is controlled by the following four quadratic forms

|E¯1E1|i=14|𝒚Ai𝒚|,\left|\bar{E}_{1}-E_{1}\right|\leq\sum_{i=1}^{4}\left|\boldsymbol{y}^{\top}A_{i}\boldsymbol{y}\right|,

where we define by Ji:=𝒚Ai𝒚J_{i}:=\boldsymbol{y}^{\top}A_{i}\boldsymbol{y} for 1i41\leq i\leq 4 and

A1:=\displaystyle A_{1}:=~{} (Kn+λId)1𝔼𝐱[Kn(𝐱,X)(Kn(𝐱,X)K(𝐱,X))](Kn+λId)1,\displaystyle(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K_{n}(\mathbf{x},X)^{\top}\left(K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right)](K_{n}+\lambda\operatorname{Id})^{-1},
A2:=\displaystyle A_{2}:=~{} (Kn+λId)1𝔼𝐱[(Kn(𝐱,X)K(𝐱,X))K(𝐱,X)](Kn+λId)1,\displaystyle(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[\left(K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right)^{\top}K(\mathbf{x},X)](K_{n}+\lambda\operatorname{Id})^{-1},
A3:=\displaystyle A_{3}:=~{} ((Kn+λId)1(K+λId)1)𝔼𝐱[K(𝐱,X)K(𝐱,X)](Kn+λId)1,\displaystyle\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)](K_{n}+\lambda\operatorname{Id})^{-1},
A4:=\displaystyle A_{4}:=~{} (K+λId)1𝔼𝐱[K(𝐱,X)K(𝐱,X)]((Kn+λId)1(K+λId)1).\displaystyle(K+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right).

Similarly with (7.13) and (7.18), it is not hard to verify AiFCn/d1\|A_{i}\|_{F}\leq C\sqrt{n/d_{1}} and |𝔼ϵ,𝜷[Ji]|Cn/d1|\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[J_{i}]|\leq C\sqrt{n/d_{1}} conditioning on the event \mathcal{E}. Then, like (7.14), we can invoke Lemma 7.2 for each AiA_{i} to apply Chebyshev’s inequality and conclude |E¯1E1|=o((n/d1)1/2ε)\left|\bar{E}_{1}-E_{1}\right|=o\left(\left(n/d_{1}\right)^{1/2-\varepsilon}\right) with probability 1o(1)1-o(1) when d1/nd_{1}/n\to\infty, for any ε(0,1/2)\varepsilon\in(0,1/2). ∎

Lemma 7.3.

With Assumptions 1.2 and 2.14, for (εn,B)(\varepsilon_{n},B)-orthonormal XX, we have that

(7.19) 𝔼𝐱[K(𝐱,X)K(𝐱,X)]bσ4d0XX\displaystyle\left\|\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right\|\leq 𝔼𝐱[K(𝐱,X)K(𝐱,X)]bσ4d0XXFCnεn2,\displaystyle\left\|\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right\|_{F}\leq C\sqrt{n}\varepsilon_{n}^{2},
(7.20) 𝔼𝐱[𝐱K(𝐱,X)]bσ2d0X\displaystyle\left\|\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]-\frac{b_{\sigma}^{2}}{d_{0}}X\right\|\leq 𝔼𝐱[𝐱K(𝐱,X)]bσ2d0XFCnεn2,\displaystyle\left\|\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]-\frac{b_{\sigma}^{2}}{d_{0}}X\right\|_{F}\leq C\sqrt{n}\varepsilon_{n}^{2},

for some constant C>0C>0.

Proof.

By Lemma A.8, we have an entrywise approximation

|K(𝐱,𝐱i)bσ2𝐱𝐱i|Cλσεn2,|K(\mathbf{x},\mathbf{x}_{i})-b_{\sigma}^{2}\mathbf{x}^{\top}\mathbf{x}_{i}|\leq C\lambda_{\sigma}\varepsilon_{n}^{2},

for any 1in1\leq i\leq n. Hence, K(𝐱,X)bσ2𝐱XCλσnεn2.\|K(\mathbf{x},X)-b_{\sigma}^{2}\mathbf{x}^{\top}X\|\leq C\lambda_{\sigma}\sqrt{n}\varepsilon_{n}^{2}. Assumption 2.14 of 𝐱\mathbf{x} implies that bσ4d0XX=bσ4𝔼𝐱[X𝐱𝐱X]\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X=b_{\sigma}^{4}\mathbb{E}_{\mathbf{x}}[X^{\top}\mathbf{x}\mathbf{x}^{\top}X]. Then, we can verify (7.19) based on the following approximation

𝔼𝐱[K(𝐱,X)K(𝐱,X)]bσ4d0XXF𝔼𝐱[K(𝐱,X)K(𝐱,X)bσ4X𝐱𝐱XF]\displaystyle\left\|\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right\|_{F}\leq\mathbb{E}_{\mathbf{x}}\left[\left\|K(\mathbf{x},X)^{\top}K(\mathbf{x},X)-b_{\sigma}^{4}X^{\top}\mathbf{x}\mathbf{x}^{\top}X\right\|_{F}\right]
\displaystyle\leq~{} 𝔼𝐱[K(𝐱,X)(K(𝐱,X)bσ2𝐱X)F+bσ2(K(𝐱,X)bσ2X𝐱)𝐱XF]\displaystyle\mathbb{E}_{\mathbf{x}}\left[\left\|K(\mathbf{x},X)^{\top}\left(K(\mathbf{x},X)-b_{\sigma}^{2}\mathbf{x}^{\top}X\right)\right\|_{F}+b_{\sigma}^{2}\left\|\left(K(\mathbf{x},X)^{\top}-b_{\sigma}^{2}X^{\top}\mathbf{x}\right)\mathbf{x}^{\top}X\right\|_{F}\right]
\displaystyle\leq~{} 𝔼𝐱[K(𝐱,X)bσ2𝐱X(K(𝐱,X)+bσ2𝐱X)]Cnεn2,\displaystyle\mathbb{E}_{\mathbf{x}}\left[\|K(\mathbf{x},X)-b_{\sigma}^{2}\mathbf{x}^{\top}X\|\left(\|K(\mathbf{x},X)\|+\|b_{\sigma}^{2}\mathbf{x}^{\top}X\|\right)\right]\leq C\sqrt{n}\varepsilon_{n}^{2},

for some universal constant CC. The same argument can also be employed to prove (7.20), so details will be omitted here. ∎

Proof of Theorem 2.17.

From (2.22) and (2.25), we can easily conclude that

(7.21) Etrain(RF,λ)Etrain(K,λ)\displaystyle E_{\textnormal{train}}^{(RF,\lambda)}-E_{\textnormal{train}}^{(K,\lambda)} 0,\displaystyle\overset{\mathbb{P}}{\to}0,
(7.22) (f^λ(RF)(𝐱))(f^λ(K)(𝐱))\displaystyle\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))-\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x})) 0,\displaystyle\overset{\mathbb{P}}{\to}0,

as nn\to\infty and n/d10n/d_{1}\to 0. Therefore, to study the training error Etrain(RF,λ)E_{\textnormal{train}}^{(RF,\lambda)} and the test error (f^λ(RF)(𝐱))\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x})) of random feature regression, it suffices to analyze the asymptotic behaviors of Etrain(K,λ)E_{\textnormal{train}}^{(K,\lambda)} and (f^λ(K)(𝐱))\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x})) for the kernel regression, respectively. In the rest of the proof, we will first analyze the test error (f^λ(K)(𝐱))\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x})) and then compute the training error Etrain(K,λ)E_{\textnormal{train}}^{(K,\lambda)} under the ultra-wide regime.

Recall that Kλ=(K+λId)K_{\lambda}=(K+\lambda\operatorname{Id}) and the test error is given by

(7.23) (f^λ(K))=\displaystyle\mathcal{L}(\hat{f}^{(K)}_{\lambda})=~{} 1d0𝜷2+L12L2,\displaystyle\frac{1}{d_{0}}\|\boldsymbol{\beta}^{*}\|^{2}+L_{1}-2L_{2},

where L1:=𝒚Kλ1𝔼𝐱[K(𝐱,X)K(𝐱,X)]Kλ1𝒚L_{1}:=\boldsymbol{y}^{\top}K_{\lambda}^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]K_{\lambda}^{-1}\boldsymbol{y}, L2:=𝜷𝔼𝐱[𝐱K(𝐱,X)]Kλ1𝒚L_{2}:=\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]K_{\lambda}^{-1}\boldsymbol{y}. The spectral norm of KλK_{\lambda} is bounded from above and the smallest eigenvalue is bounded from below by some positive constants.

We first focus on the last two terms L1L_{1} and L2L_{2} in the test error. Let us define

L~1:=bσ4d0𝒚Kλ1XXKλ1𝒚andL~2:=bσ2d0𝜷XKλ1𝒚.\widetilde{L}_{1}:=\frac{b_{\sigma}^{4}}{d_{0}}\boldsymbol{y}^{\top}K_{\lambda}^{-1}X^{\top}XK_{\lambda}^{-1}\boldsymbol{y}\quad\text{and}\quad\widetilde{L}_{2}:=\frac{b_{\sigma}^{2}}{d_{0}}\boldsymbol{\beta}^{*\top}XK_{\lambda}^{-1}\boldsymbol{y}.

Then, we obtain two quadratic forms

L1L~1=\displaystyle L_{1}-\widetilde{L}_{1}= 𝒚Kλ1(𝔼𝐱[K(𝐱,X)K(𝐱,X)]bσ4d0XX)Kλ1𝒚=:𝒚A1𝒚,\displaystyle\boldsymbol{y}^{\top}K_{\lambda}^{-1}\left(\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right)K_{\lambda}^{-1}\boldsymbol{y}=:\boldsymbol{y}^{\top}A_{1}\boldsymbol{y},
L2L~2=\displaystyle L_{2}-\widetilde{L}_{2}= 𝜷(𝔼𝐱[𝐱K(𝐱,X)]bσ2d0X)Kλ1𝒚=:𝜷A2𝒚,\displaystyle\boldsymbol{\beta}^{*\top}\left(\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]-\frac{b_{\sigma}^{2}}{d_{0}}X\right)K_{\lambda}^{-1}\boldsymbol{y}=:\boldsymbol{\beta}^{*\top}A_{2}\boldsymbol{y},

where A1F\|A_{1}\|_{F} and A2F\|A_{2}\|_{F} are at most Cnεn2C\sqrt{n}\varepsilon_{n}^{2} for some constant C>0C>0, due to Lemma 7.3. Hence, applying Lemma 7.2 for these two quadratic forms, we have Var(LiL~i)cnεn40\operatorname{Var}(L_{i}-\widetilde{L}_{i})\leq cn\varepsilon_{n}^{4}\to 0 as nn\to\infty. Additionally, Lemma 7.2 and the proof of Lemma 7.3 verify that 𝔼[𝒚A1𝒚]\mathbb{E}[\boldsymbol{y}^{\top}A_{1}\boldsymbol{y}] and 𝔼[𝜷A2𝒚]\mathbb{E}[\boldsymbol{\beta}^{*\top}A_{2}\boldsymbol{y}] are vanishing as nn\to\infty. Therefore, LiL~iL_{i}-\widetilde{L}_{i} converges to zero in probability for i=1,2i=1,2. So we can move to analyze L~1\widetilde{L}_{1} and L~2\widetilde{L}_{2} instead. Copying the above procedure, we can separately compute the variances of L~1\widetilde{L}_{1} and L~2\widetilde{L}_{2} with respect to 𝜷\boldsymbol{\beta}^{*} and ϵ\boldsymbol{\epsilon}, and then apply Lemma 7.2. Then, |L~1L¯1||\widetilde{L}_{1}-\bar{L}_{1}| and |L~2L¯2||\widetilde{L}_{2}-\bar{L}_{2}| will converge to zero in probability as n,d0n,d_{0}\to\infty, where

L¯1:=\displaystyle\bar{L}_{1}:=~{} 𝔼ϵ,𝜷[L~1]=bσ4σ𝜷2nd0trKλ1XXKλ1XX+bσ4σϵ2nd0trKλ1XXKλ1,\displaystyle\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[\widetilde{L}_{1}]=\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}K_{\lambda}^{-1}X^{\top}XK_{\lambda}^{-1}X^{\top}X+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}n}{d_{0}}\operatorname{tr}K_{\lambda}^{-1}X^{\top}XK_{\lambda}^{-1},
L¯2:=\displaystyle\bar{L}_{2}:=~{} 𝔼ϵ,𝜷[L~2]=bσ2σ𝜷2nd0trKλ1XX.\displaystyle\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[\widetilde{L}_{2}]=\frac{b_{\sigma}^{2}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}K_{\lambda}^{-1}X^{\top}X.

To obtain the last approximation, we define K¯(X,X):=bσ2XX+(1bσ2)Id\bar{K}(X,X):=b_{\sigma}^{2}X^{\top}X+(1-b_{\sigma}^{2})\operatorname{Id} and

(7.24) K¯λ:=bσ2XX+(1+λbσ2)Id.\bar{K}_{\lambda}:=b_{\sigma}^{2}X^{\top}X+(1+\lambda-b_{\sigma}^{2})\operatorname{Id}.

We aim to replace KλK_{\lambda} by K¯λ\bar{K}_{\lambda} in L¯1\bar{L}_{1} and L¯2\bar{L}_{2}. Recalling the identity (7.2), we have

Kλ1K¯λ1=K¯λ1(K(X,X)K¯(X,X))Kλ1.\displaystyle K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1}=\bar{K}_{\lambda}^{-1}\left(K(X,X)-\bar{K}(X,X)\right)K_{\lambda}^{-1}.

Since σ\sigma is not a linear function, 1bσ2>01-b_{\sigma}^{2}>0. Then, with (7.4), the proof of Lemma 5.4 indicates

(7.25) Kλ1K¯λ1FCn2εn4+nεn2,\displaystyle\left\|K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1}\right\|_{F}\leq C\sqrt{n^{2}\varepsilon_{n}^{4}+n\varepsilon_{n}^{2}},

where we apply the fact that λmin(K¯(X,X))1bσ2>0\lambda_{\min}(\bar{K}(X,X))\geq 1-b_{\sigma}^{2}>0. Let us denote

(7.26) L10:=\displaystyle L_{1}^{0}:=~{} bσ4σ𝜷2nd0trK¯λ1XXK¯λ1XX+bσ4σϵ2nd0trK¯λ1XXK¯λ1,\displaystyle\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}\bar{K}_{\lambda}^{-1}X^{\top}X\bar{K}_{\lambda}^{-1}X^{\top}X+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}n}{d_{0}}\operatorname{tr}\bar{K}_{\lambda}^{-1}X^{\top}X\bar{K}_{\lambda}^{-1},
(7.27) L20:=\displaystyle L_{2}^{0}:=~{} bσ2σ𝜷2nd0trK¯λ1XX.\displaystyle\frac{b_{\sigma}^{2}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}\bar{K}_{\lambda}^{-1}X^{\top}X.

Notice that for any matrices A,Bn×nA,B\in\mathbb{R}^{n\times n}, ABFABF,|Tr(AB)|AFBF.\|AB\|_{F}\leq\|A\|\|B\|_{F},|\operatorname{Tr}(AB)|\leq\|A\|_{F}\|B\|_{F}. Then, with the help of (7.25) and uniform bounds of the spectral norms of XXX^{\top}X, Kλ1K_{\lambda}^{-1} and K¯λ1\bar{K}_{\lambda}^{-1}, we obtain that

|L¯1L10|\displaystyle|\bar{L}_{1}-L^{0}_{1}|\leq~{} bσ4σ𝜷2d0|TrKλ1XX(Kλ1K¯λ1)XX|+bσ4σ𝜷2d0|Tr(Kλ1K¯λ1)XXK¯λ1XX|\displaystyle\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}}{d_{0}}\left|\operatorname{Tr}K_{\lambda}^{-1}X^{\top}X(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})X^{\top}X\right|+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}}{d_{0}}\left|\operatorname{Tr}(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})X^{\top}X\bar{K}_{\lambda}^{-1}X^{\top}X\right|
+bσ4σϵ2d0|Tr(Kλ1K¯λ1)XXK¯λ1|+bσ4σϵ2d0|TrKλ1XX(Kλ1K¯λ1)|\displaystyle+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}}{d_{0}}\left|\operatorname{Tr}(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})X^{\top}X\bar{K}_{\lambda}^{-1}\right|+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}}{d_{0}}\left|\operatorname{Tr}K_{\lambda}^{-1}X^{\top}X(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})\right|
\displaystyle\leq~{} Cnd0Kλ1K¯λ1FCnd0nεn4+εn20,\displaystyle\frac{C\sqrt{n}}{d_{0}}\left\|K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1}\right\|_{F}\leq C\frac{n}{d_{0}}\sqrt{n\varepsilon_{n}^{4}+\varepsilon_{n}^{2}}\to 0,

as nn\to\infty, n/d0γn/d_{0}\to\gamma and nεn40n\varepsilon_{n}^{4}\to 0. Combining all the approximations, we conclude that LiL_{i} and Li0L_{i}^{0} have identical limits in probability for i=1,2i=1,2. On the other hand, based on the assumption of XX and definitions in (7.24), (7.26) and (7.27), it is not hard to check that

limnL10=\displaystyle\lim_{n\to\infty}L_{1}^{0}=~{} bσ4σ𝜷2γx2(bσ2x+1+λbσ2)2𝑑μ0(x)+bσ4σϵ2γx(bσ2x+1+λbσ2)2𝑑μ0(x),\displaystyle b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}\gamma\int_{\mathbb{R}}\frac{x^{2}}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x)+b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}\gamma\int_{\mathbb{R}}\frac{x}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x),
limnL20=\displaystyle\lim_{n\to\infty}L_{2}^{0}=~{} bσ2σ𝜷2γxbσ2x+1+λbσ2𝑑μ0(x).\displaystyle b_{\sigma}^{2}\sigma_{\boldsymbol{\beta}}^{2}\gamma\int_{\mathbb{R}}\frac{x}{b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2}}d\mu_{0}(x).

Therefore, L1L_{1} and L2L_{2} converge in probability to the above limits, respectively, as nn\to\infty. In the end, we apply the concentration of the quadratic form 𝜷𝜷\boldsymbol{\beta}^{*\top}\boldsymbol{\beta}^{*} in (7.23) to get 1d0𝜷2σ𝜷2\frac{1}{d_{0}}\|\boldsymbol{\beta}^{*}\|^{2}\xrightarrow[]{\mathbb{P}}\sigma^{2}_{\boldsymbol{\beta}}. Then, by (7.22), we can get the limit in (2.28) for the test error (f^λ(RF))\mathcal{L}(\hat{f}^{(RF)}_{\lambda}). As a byproduct, we can even use L10L_{1}^{0} and L20L_{2}^{0} to form an nn-dependent deterministic equivalent of (f^λ(RF))\mathcal{L}(\hat{f}^{(RF)}_{\lambda}) as well.

Thanks to Lemma 7.2, the training error, Etrain(K,λ)=λ2n𝒚Kλ2𝒚E_{\textnormal{train}}^{(K,\lambda)}=\frac{\lambda^{2}}{n}\boldsymbol{y}^{\top}K_{\lambda}^{-2}\boldsymbol{y}, analogously, concentrates around its expectation with respect to 𝜷\boldsymbol{\beta}^{*} and ϵ\boldsymbol{\epsilon}, which is σ𝜷2λ2trKλ2XX+σϵ2λ2trKλ2\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\operatorname{tr}K_{\lambda}^{-2}X^{\top}X+\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\operatorname{tr}K_{\lambda}^{-2}. Moreover, because of (7.25), we can further substitute Kλ2K_{\lambda}^{-2} by K¯λ2\bar{K}_{\lambda}^{-2} defined in (7.24). Hence, we know that, asymptotically,

|Etrain(K,λ)σ𝜷2λ2trK¯λ2XXσϵ2λ2trK¯λ2|0,\left|E_{\textnormal{train}}^{(K,\lambda)}-\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}X^{\top}X-\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}\right|\xrightarrow[]{\mathbb{P}}0,

where as n,d0n,d_{0}\to\infty,

(7.28) limnσ𝜷2λ2trK¯λ2XX=\displaystyle\lim_{n\to\infty}\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}X^{\top}X=~{} σ𝜷2λ2x(bσ2x+1+λbσ2)2𝑑μ0(x),\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\int_{\mathbb{R}}\frac{x}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x),
(7.29) limnσϵ2λ2trK¯λ2=\displaystyle\lim_{n\to\infty}\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}=~{} σϵ2λ21(bσ2x+1+λbσ2)2𝑑μ0(x).\displaystyle\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\int_{\mathbb{R}}\frac{1}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x).

The last two limits are due to μ0=limspecXX\mu_{0}=\operatorname{lim\,spec}X^{\top}X as n,d0n,d_{0}\to\infty. Therefore, by (7.21), we obtain our final result (2.27) in Theorem 2.17. ∎

Acknowledgements

Z.W. is partially supported by NSF DMS-2055340 and NSF DMS-2154099. Y.Z. is partially supported by NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning. This material is based upon work supported by the National Science Foundation under Grant No. DMS-1928930 while Y.Z. was in residence at the Mathematical Sciences Research Institute in Berkeley, California, during the Fall 2021 semester for the program “Universality and Integrability in Random Matrix Theory and Interacting Particle Systems”. Z.W. would like to thank Denny Wu for his valuable suggestions and comments. Both authors would like to thank Lucas Benigni, Ioana Dumitriu, and Kameron Decker Harris for their helpful discussion.

Appendix A Auxiliary lemmas

Lemma A.1 (Equation (3.7.9) in [Joh90]).

Let A,BA,B be two n×nn\times n matrices, AA be positive semidefinite, and ABA\odot B be the Hadamard product between AA and BB. Then,

(A.1) ABmaxi,j|Aij|B.\|A\odot B\|\leq\max_{i,j}|A_{ij}|\cdot\|B\|.
Lemma A.2 (Sherman–Morrison formula, [Bar51]).

Suppose An×nA\in\mathbb{R}^{n\times n} is an invertible square matrix and 𝐮,𝐯n\mathbf{u},\mathbf{v}\in\mathbb{R}^{n} are column vectors. Then

(A.2) (A+𝐮𝐯)1=A1A1𝐮𝐯A11+𝐯A1𝐮.\displaystyle(A+\mathbf{u}\mathbf{v}^{\top})^{-1}=A^{-1}-\frac{A^{-1}\mathbf{u}\mathbf{v}^{\top}A^{-1}}{1+\mathbf{v}^{\top}A^{-1}\mathbf{u}}.
Lemma A.3 (Theorem A.45 in [BS10]).

Let A,BA,B be two n×nn\times n Hermitian matrices. Then AA and BB have the same limiting spectral distribution if AB0\|A-B\|\to 0 as nn\to\infty.

Lemma A.4 (Theorem B.11 in [BS10]).

Let z=x+iv,v>0z=x+iv\in\mathbb{C},v>0 and s(z)s(z) be the Stieltjes transform of a probability measure. Then |Res(z)|v1/2Ims(z)|\operatorname{Re}s(z)|\leq v^{-1/2}\sqrt{\operatorname{Im}s(z)}.

Lemma A.5 (Lemma D.2 in [NM20]).

Let 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d} such that 𝐱=𝐲=1\|\mathbf{x}\|=\|\mathbf{y}\|=1 and 𝐰𝒩(0,Id)\boldsymbol{w}\sim\mathcal{N}(0,I_{d}). Let hjh_{j} be the jj-th normalized Hermite polynomial given in (1.5). Then

(A.3) 𝔼𝒘[hj(𝒘,𝐱)hk(𝒘,𝐲)]=δjk𝐱,𝐲k.\displaystyle\mathbb{E}_{\boldsymbol{w}}[h_{j}(\langle\boldsymbol{w},\mathbf{x}\rangle)h_{k}(\langle\boldsymbol{w},\mathbf{y}\rangle)]=\delta_{jk}\langle\mathbf{x},\mathbf{y}\rangle^{k}.
Lemma A.6 (Proposition C.2 in [FW20]).

Suppose M=U+iVn×nM=U+iV\in\mathbb{C}^{n\times n}, U,VU,V are real symmetric, and VV is invertible with σmin(V)c0>0\sigma_{\min}(V)\geq c_{0}>0. Then MM is invertible with σmin(M)c0\sigma_{\min}(M)\geq c_{0}.

Lemma A.7 (Proposition C.3 in [FW20]).

Let M,M~M,\widetilde{M} be two sequences of n×nn\times n Hermitian matrices satisfying

1nMM~F20\frac{1}{n}\|M-\widetilde{M}\|_{F}^{2}\to 0

as nn\to\infty. Suppose that, as nn\to\infty, limspecM=ν\operatorname{lim\,spec}M=\nu for a probability distribution ν\nu on \mathbb{R}, then limspecM~=ν\operatorname{lim\,spec}\widetilde{M}=\nu.

Lemma A.8.

Recall the definition of Φ\Phi in (1.3). Under Assumption 1.2, if XX is (ε,B)(\varepsilon,B)-orthonormal with sufficiently small ε\varepsilon, then for a universal constant C>0C>0 and any αβ[n]\alpha\neq\beta\in[n], we have

(A.4) |Φαβbσ2𝐱α𝐱β|\displaystyle|\Phi_{\alpha\beta}-b_{\sigma}^{2}\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}|\leq~{} Cε2,\displaystyle C\varepsilon^{2},
(A.5) |𝔼𝒘[σ(𝒘𝐱α)]|\displaystyle|\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{\alpha})]|\leq~{} Cε.\displaystyle C\varepsilon.
Proof.

When σ\sigma is twice differentiable in Assumption 1.2, this result follows from Lemma D.3 in [FW20]. When σ\sigma is a piece-wise linear function defined in case 2 of Assumption 1.2, the second inequality follows from (5.7) with t=𝐱αt=\|\mathbf{x}_{\alpha}\|. For the first inequality, the Hermite expansion of Φαβ\Phi_{\alpha\beta} is given by (5.3) with coefficients ζk(σα)=𝔼[σ(𝐱αξ)hk(ξ)]\zeta_{k}(\sigma_{\alpha})=\mathbb{E}[\sigma(\|\mathbf{x}_{\alpha}\|\xi)h_{k}(\xi)] for kk\in\mathbb{N}. Observe that the piece-wise linear function in case 2 of Assumption 1.2 satisfies

(A.6) ζk(σα)=\displaystyle\zeta_{k}(\sigma_{\alpha})=~{} 𝐱αζk(σ), for k1,\displaystyle\|\mathbf{x}_{\alpha}\|\zeta_{k}(\sigma),~{}\text{ for }~{}k\geq 1,
(A.7) ζ0(σα)=\displaystyle\zeta_{0}(\sigma_{\alpha})=~{} b(1𝐱α),\displaystyle b(1-\|\mathbf{x}_{\alpha}\|),

because of condition (1.10) for σ\sigma. Recall 𝐮α=𝐱α/𝐱α\mathbf{u}_{\alpha}=\mathbf{x}_{\alpha}/\|\mathbf{x}_{\alpha}\| and ζ1(σ)=bσ\zeta_{1}(\sigma)=b_{\sigma}. Then, analogously to the derivation of (5.10), there exists some constant C>0C>0 such that

(A.8) |Φαβbσ2𝐱α𝐱β|=\displaystyle|\Phi_{\alpha\beta}-b_{\sigma}^{2}\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}|=~{} |k1ζk(σα)ζk(σβ)(𝐮α𝐮β)k|\displaystyle\left|\sum_{k\neq 1}\zeta_{k}(\sigma_{\alpha})\zeta_{k}(\sigma_{\beta})(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k}\right|
(A.9) \displaystyle\leq~{} b2(1𝐱α)(1𝐱β)+|𝐱α𝐱β|2𝐱α𝐱βσL22Cε2,\displaystyle b^{2}(1-\|\mathbf{x}_{\alpha}\|)(1-\|\mathbf{x}_{\beta}\|)+\frac{|\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}|^{2}}{\|\mathbf{x}_{\alpha}\|\|\mathbf{x}_{\beta}\|}\|\sigma\|_{L^{2}}^{2}\leq C\varepsilon^{2},

for ε(0,1)\varepsilon\in(0,1) and (ε,B)(\varepsilon,B)-orthonormal XX. This completes the proof of this lemma. ∎

With the above lemma, the proof of Lemma D.4 in [FW20] directly yields the following lemma.

Lemma A.9.

Under the same assumptions as Lemma A.8, there exists a constant C>0C>0 such that K(X,X)CB2\|K(X,X)\|\leq CB^{2}. Additionally, with Assumption 2.14, we have K(X~,X~)CB2\|K(\tilde{X},\tilde{X})\|\leq CB^{2}.

Appendix B Additional simulations

Figures 4 and 5 provide additional simulations for the eigenvalue distribution described in Theorem 2.1 with different activation functions and scaling. Here, we compute the empirical eigenvalue distributions of centered CK matrices in histograms and the limiting spectra in terms of self-consistent equations. All the input data XX’s are standard random Gaussian matrices. Interestingly, in Figure 5, we observe an outlier that emerges outside the bulk distribution for the piece-wise linear activation function defined in case 2 of Assumption 1.2. The analysis of the emergence of the outlier, in this case, would be interesting for future work.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4. Simulations for empirical eigenvalue distributions of (2.3) and theoretical predication (red curves) of the limiting law μ\mu with activation functions σ(x)\sigma(x)\propto Sigmoid function (first row) and σ(x)=x\sigma(x)=x linear function (second row) satisfying Assumption 1.2: n=103n=10^{3}, d0=103d_{0}=10^{3} and d1=105d_{1}=10^{5} (left); n=103n=10^{3}, d0=1.5×103d_{0}=1.5\times 10^{3} and d1=105d_{1}=10^{5} (middle); n=1.5×103n=1.5\times 10^{3}, d0=103d_{0}=10^{3} and d1=105d_{1}=10^{5} (right).
Refer to caption
Refer to caption
Refer to caption
Figure 5. Simulations for empirical eigenvalue distributions of (2.3) and theoretical predication (red curves) of the limiting law μ\mu where activation function σ(x)\sigma(x)\propto ReLU function satisfies case 2 of Assumption 1.2: n=103n=10^{3}, d0=103d_{0}=10^{3} and d1=105d_{1}=10^{5} (left); n=103n=10^{3}, d0=800d_{0}=800 and d1=105d_{1}=10^{5} (middle); n=800n=800, d0=103d_{0}=10^{3} and d1=105d_{1}=10^{5} (right).

References

  • [Ada15] Radoslaw Adamczak. A note on the Hanson-Wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 2015.
  • [ADH+19a] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
  • [ADH+19b] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8141–8150, 2019.
  • [AGZ10] Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An introduction to random matrices. Cambridge university press, 2010.
  • [AKM+17] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In International Conference on Machine Learning, pages 253–262. PMLR, 2017.
  • [ALP22] Ben Adlam, Jake A Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities in high dimensions. In International Conference on Artificial Intelligence and Statistics, pages 3434–3457. PMLR, 2022.
  • [AP20] Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR, 2020.
  • [AS17] Guillaume Aubrun and Stanisław J Szarek. Alice and Bob meet Banach, volume 223. American Mathematical Soc., 2017.
  • [Aub12] Guillaume Aubrun. Partial transposition of random states and non-centered semicircular distributions. Random Matrices: Theory and Applications, 1(02):1250001, 2012.
  • [AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252, 2019.
  • [Bac13] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Theory, pages 185–209. PMLR, 2013.
  • [Bac17] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
  • [Bao12] Zhigang Bao. Strong convergence of esd for the generalized sample covariance matrices when p/n0p/n\to 0. Statistics &\& Probability Letters, 82(5):894–901, 2012.
  • [Bar51] Maurice S Bartlett. An inverse matrix adjustment arising in discriminant analysis. The Annals of Mathematical Statistics, 22(1):107–111, 1951.
  • [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
  • [BMR21] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
  • [BP21] Lucas Benigni and Sandrine Péché. Eigenvalue distribution of some nonlinear models of random matrices. Electronic Journal of Probability, 26:1–37, 2021.
  • [BS10] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
  • [BY88] Zhidong Bai and Y. Q. Yin. Convergence to the semicircle law. The Annals of Probability, pages 863–875, 1988.
  • [BZ10] ZD Bai and LX Zhang. The limiting spectral distribution of the product of the Wigner matrix and a nonnegative definite matrix. Journal of Multivariate Analysis, 101(9):1927–1949, 2010.
  • [CH23] Benoit Collins and Tomohiro Hayase. Asymptotic freeness of layerwise jacobians caused by invariance of multilayer perceptron: The haar orthogonal case. Communications in Mathematical Physics, 397(1):85–109, 2023.
  • [COB19] Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2937–2947, 2019.
  • [CP12] Binbin Chen and Guangming Pan. Convergence of the largest eigenvalue of normalized sample covariance matrices when pp and nn both tend to infinity with their ratio converging to zero. Bernoulli, 18(4):1405–1420, 2012.
  • [CP15] Binbin Chen and Guangming Pan. CLT for linear spectral statistics of normalized sample covariance matrices with the dimension much larger than the sample size. Bernoulli, 21(2):1089–1133, 2015.
  • [CS09] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pages 342–350, 2009.
  • [CYZ18] Benoît Collins, Zhi Yin, and Ping Zhong. The PPT square conjecture holds generically for some classes of independent states. Journal of Physics A: Mathematical and Theoretical, 51(42):425301, 2018.
  • [DFS16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.
  • [DZPS19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
  • [Fel21] Michael J Feldman. Spiked singular values and vectors under extreme aspect ratios. arXiv preprint arXiv:2104.15127, 2021.
  • [FW20] Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 7710–7721. Curran Associates, Inc., 2020.
  • [GKZ19] David Gamarnik, Eren C Kızıldağ, and Ilias Zadik. Stationary points of shallow neural networks with quadratic activation function. arXiv preprint arXiv:1912.01599, 2019.
  • [GLBP21] Jungang Ge, Ying-Chang Liang, Zhidong Bai, and Guangming Pan. Large-dimensional random matrix theory and its applications in deep learning and wireless communications. Random Matrices: Theory and Applications, page 2230001, 2021.
  • [GLK+20] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
  • [GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limitations of lazy training of two-layers neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 9111–9121, 2019.
  • [GZR22] Diego Granziol, Stefan Zohren, and Stephen Roberts. Learning rates as a function of batch size: A random matrix theory approach to neural network training. J. Mach. Learn. Res, 23:1–65, 2022.
  • [HK21] Tomohiro Hayase and Ryo Karakida. The spectrum of Fisher information of deep networks achieving dynamical isometry. In International Conference on Artificial Intelligence and Statistics, pages 334–342. PMLR, 2021.
  • [HL22] Hong Hu and Yue M Lu. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 2022.
  • [HW71] David Lee Hanson and Farroll Tim Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
  • [HXAP20] Wei Hu, Lechao Xiao, Ben Adlam, and Jeffrey Pennington. The surprising simplicity of the early-time learning dynamics of neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 17116–17128. Curran Associates, Inc., 2020.
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8580–8589, 2018.
  • [Jia04] Tiefeng Jiang. The limiting distributions of eigenvalues of sample correlation matrices. Sankhyā: The Indian Journal of Statistics, pages 35–48, 2004.
  • [Joh90] C.R. Johnson. Matrix Theory and Applications. AMS Short Course Lecture Notes. American Mathematical Society, 1990.
  • [JSS+20] Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clément Hongler, and Franck Gabriel. Implicit regularization of random feature models. In International Conference on Machine Learning, pages 4631–4640. PMLR, 2020.
  • [LBN+18] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018.
  • [LC18] Zhenyu Liao and Romain Couillet. On the spectrum of random features maps of high dimensional data. In International Conference on Machine Learning, pages 3063–3071. PMLR, 2018.
  • [LCM20] Zhenyu Liao, Romain Couillet, and Michael W. Mahoney. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. In 34th Conference on Neural Information Processing Systems, 2020.
  • [LD21] Licong Lin and Edgar Dobriban. What causes the test error? going beyond bias-variance via anova. Journal of Machine Learning Research, 22(155):1–82, 2021.
  • [LGC+21] Bruno Loureiro, Cédric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34, 2021.
  • [LLC18] Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190–1248, 2018.
  • [LLS21] Fanghui Liu, Zhenyu Liao, and Johan Suykens. Kernel regression in high dimensions: Refined analysis beyond double descent. In International Conference on Artificial Intelligence and Statistics, pages 649–657. PMLR, 2021.
  • [LR20] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347, 2020.
  • [LRZ20] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, pages 2683–2711. PMLR, 2020.
  • [LY16] Zeng Li and Jianfeng Yao. Testing the sphericity of a covariance matrix when the dimension is much larger than the sample size. Electronic Journal of Statistics, 10(2):2973–3010, 2016.
  • [MHR+18] Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
  • [MM19] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 2019.
  • [MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 2021.
  • [MZ22] Andrea Montanari and Yiqiao Zhong. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5):2816–2847, 2022.
  • [Nea95] Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
  • [Ngu21] Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In International Conference on Machine Learning, pages 8056–8062. PMLR, 2021.
  • [NM20] Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In 34th Conference on Neural Information Processing Systems, volume 33, 2020.
  • [NMM21] Quynh Nguyen, Marco Mondelli, and Guido F Montufar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. In International Conference on Machine Learning, pages 8119–8129. PMLR, 2021.
  • [NS06] Alexandru Nica and Roland Speicher. Lectures on the combinatorics of free probability, volume 13. Cambridge University Press, 2006.
  • [OS20] Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
  • [Péc19] S Péché. A note on the Pennington-Worah distribution. Electronic Communications in Probability, 24, 2019.
  • [PLR+16] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, pages 3360–3368, 2016.
  • [PS21] Vanessa Piccolo and Dominik Schröder. Analysis of one-hidden-layer neural networks via the resolvent method. Advances in Neural Information Processing Systems, 34, 2021.
  • [PSG17] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in neural information processing systems, 30, 2017.
  • [PSG18] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932. PMLR, 2018.
  • [PW17] Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [QLY21] Jiaxin Qiu, Zeng Li, and Jianfeng Yao. Asymptotic normality for eigenvalue statistics of a general sample covariance matrix when p/np/n\to\infty and applications. arXiv preprint arXiv:2109.06701, 2021.
  • [RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 1177–1184, 2007.
  • [RR17] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [RV13] Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18:1–9, 2013.
  • [SGGSD17] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations, 2017.
  • [Sil85] Jack W Silverstein. The smallest eigenvalue of a large dimensional wishart matrix. The Annals of Probability, pages 1364–1368, 1985.
  • [SY19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix Chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
  • [Tro12] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
  • [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [Voi87] Dan Voiculescu. Multiplication of certain non-commuting random variables. Journal of Operator Theory, pages 223–235, 1987.
  • [WDW19] Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111, 2019.
  • [Wil97] Christopher KI Williams. Computing with infinite networks. In Advances in Neural Information Processing Systems, pages 295–301, 1997.
  • [WP14] Lili Wang and Debashis Paul. Limiting spectral distribution of renormalized separable sample covariance matrices when p/n0p/n\to 0. Journal of Multivariate Analysis, 126:25–52, 2014.
  • [WZ22] Zhichao Wang and Yizhe Zhu. Overparameterized random feature regression with nearly orthogonal data. arXiv preprint arXiv:2211.06077, 2022.
  • [XBSD+18] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393–5402. PMLR, 2018.
  • [Xie13] Junshan Xie. Limiting spectral distribution of normalized sample covariance matrices with p/n0p/n\to 0. Statistics &\& Probability Letters, 83:543–550, 2013.
  • [YBM21] Zitong Yang, Yu Bai, and Song Mei. Exact gap between generalization error and uniform convergence in random feature models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11704–11715. PMLR, 18–24 Jul 2021.
  • [YXZ22] Long Yu, Jiahui Xie, and Wang Zhou. Testing Kronecker product covariance matrices for high-dimensional matrix-variate data. Biometrika, 11 2022. asac063.