Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks

Zhichao Wang Department of Mathematics, University of California, San Diego, La Jolla, CA 92093 [email protected] and Yizhe Zhu Department of Mathematics, University of California, Irvine, Irvine, CA 92697 [email protected]

Abstract.

In this paper, we investigate a two-layer fully connected neural network of the form $f(X)=\frac{1}{\sqrt{d_{1}}}\boldsymbol{a}^{\top}\sigma\left(WX\right)$ , where $X\in\mathbb{R}^{d_{0}\times n}$ is a deterministic data matrix, $W\in\mathbb{R}^{d_{1}\times d_{0}}$ and $\boldsymbol{a}\in\mathbb{R}^{d_{1}}$ are random Gaussian weights, and $\sigma$ is a nonlinear activation function. We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$ : the empirical conjugate kernel (CK) and neural tangent kernel (NTK), beyond the linear-width regime ( $d_{1}\asymp n$ ). We focus on the ultra-wide regime, where the width $d_{1}$ of the first layer is much larger than the sample size $n$ . Under appropriate assumptions on $X$ and $\sigma$ , a deformed semicircle law emerges as $d_{1}/n\to\infty$ and $n\to\infty$ . We first prove this limiting law for generalized sample covariance matrices with some dependency. To specify it for our neural network model, we provide a nonlinear Hanson-Wright inequality that is suitable for neural networks with random weights and Lipschitz activation functions. We also demonstrate non-asymptotic concentrations of the empirical CK and NTK around their limiting kernels in the spectral norm, along with lower bounds on their smallest eigenvalues. As an application, we show that random feature regression induced by the empirical kernel achieves the same asymptotic performance as its limiting kernel regression under the ultra-wide regime. This allows us to calculate the asymptotic training and test errors for random feature regression using the corresponding kernel regression.

1. Introduction

Nowadays, deep neural networks have become one of the leading models in machine learning, and many theoretical results have been established to understand the training and generalization of neural networks. Among them, two kernel matrices are prominent in deep learning theory: Conjugate Kernel (CK) [Nea95, Wil97, RR07, CS09, DFS16, PLR⁺16, SGGSD17, LBN⁺18, MHR⁺18] and Neural Tangent Kernel (NTK) [JGH18, DZPS19, AZLS19]. The CK matrix defined in (1.5), which has been exploited to study the generalization of random feature regression, is the Gram matrix of the output of the last hidden layer on the training dataset. While the NTK matrix, defined in (1.8), is the Gram matrix of the Jacobian of the neural network with respect to training parameters, characterizing the performance of a wide neural network through gradient flows. Both are related to the kernel machine and help us explore the generalization and training process of the neural network.

We are interested in the behaviors of CK and NTK matrices at random initialization. A recent line of work has proved that these two random kernel matrices will converge to their expectations when the width of the network becomes infinitely wide [JGH18, ADH⁺19b]. Although CK and NTK are usually referred to as these expected kernels in literature, we will always call CK and NTK the empirical kernel matrices in this paper, with a slight abuse of terminology.

In this paper, we study the random CK and NTK matrices of a two-layer fully connected neural network with input data $X\in\mathbb{R}^{d_{0}\times n}$ , given by $f:\mathbb{R}^{d_{0}\times n}\to\mathbb{R}^{n}$ such that

(1.1)

f(X):=\frac{1}{\sqrt{d_{1}}}\boldsymbol{a}^{\top}\sigma\left(WX\right),

where $W\in\mathbb{R}^{d_{1}\times d_{0}}$ is the weight matrix for the first layer, $\boldsymbol{a}\in\mathbb{R}^{d_{1}}$ are the second layer weights, and $\sigma$ is a nonlinear activation function applied to the matrix $WX$ element-wisely. We assume that all entries of $\boldsymbol{a}$ and $W$ are independently identically distributed by the standard Gaussian $\mathcal{N}(0,1)$ . We will always view the input data $X$ as a deterministic matrix (independent of the random weights in $\boldsymbol{a}$ and $W$ ) with certain assumptions.

In terms of random matrix theory, we study the difference between these two kernel matrices (CK and NTK) and their expectations with respect to random weights, showing both asymptotic and non-asymptotic behaviors of these differences as the width of the first hidden layer $d_{1}$ is growing faster than the number of samples $n$ . As an extension of [FW20], we prove that when $n/d_{1}\to 0$ , the centered CK and NTK with appropriate normalization have the limiting eigenvalue distribution given by a deformed semicircle law, determined by the training data spectrum and the nonlinear activation function. To prove this global law, we further set up a limiting law theorem for centered sample covariance matrices with dependent structures and a nonlinear version of the Hanson-Wright inequality. These two results are very general, which makes them potentially applicable to different scenarios beyond our neural network model. For the non-asymptotic analysis, we establish concentration inequalities between the random kernel matrices and their expectations. As a byproduct, we provide lower bounds of the smallest eigenvalues of CK and NTK, which are essential for the global convergence of gradient-based optimization methods when training a wide neural network [OS20, NM20, Ngu21]. Because of the non-asymptotic results for kernel matrices, we can also describe how close the performances of the random feature regression and the limiting kernel regression are with a general dataset, which allows us to compute the limiting training error and generalization error for the random feature regression via its corresponding kernel regression in the ultra-wide regime.

1.1. Nonlinear random matrix theory in neural networks

Recently, the limiting spectra of CK and NTK at random initialization have received increasing attention from a random matrix theory perspective. Most of the papers focus on the linear-width regime $d_{1}\propto n$ , using both the moment method and Stieltjes transforms. Based on moment methods, [PW17] first computed the limiting law of the CK for two-layer neural networks with centered nonlinear activation functions, which is further described as a deformed Marchenko–Pastur law in [Péc19]. This result has been extended to sub-Gaussian weights and input data with real analytic activation functions by [BP21], even for multiple layers with some special activation functions. Later, [ALP22] generalized their results by adding a random bias vector in pre-activation and a more general input data matrix. Similar results for the two-layer model with a random bias vector and random input data were analyzed in [PS21] by cumulant expansion. In parallel, by Stieltjes transform, [LLC18] investigated the CK of a one-hidden-layer network with general deterministic input data and Lipschitz activation functions via some deterministic equivalent. [LCM20] further developed a deterministic equivalent for the Fourier feature map. With the help of the Gaussian equivalent technique and operator-valued free probability theory, the limit spectrum of NTK with one-hidden layer has been analyzed in [AP20]. Then the limit spectra of CK and NTK of a multi-layer neural network with general deterministic input data have been fully characterized in [FW20], where the limiting spectrum of CK is given by the propagation of the Marchenko–Pastur map through the network, while the NTK is approximated by the linear combination of CK’s of each hidden layer. [FW20] illustrated that the pairwise approximate orthogonality assumption on the input data is preserved in all hidden layers. Such a property is useful to approximate the expected CK and NTK. We refer to [GLBP21] as a summary of the recent development in nonlinear random matrix theory.

Most of the results in nonlinear random matrix theory focus on the case when $d_{1}$ is proportional to $n$ as $n\to\infty$ . We build a random matrix result for both CK and NTK under the ultra-wide regime, where $d_{1}/n\to\infty$ and $n\to\infty$ . As an intrinsic interest of this regime, this exhibits the connection between wide (or overparameterized) neural networks and kernel learning induced by limiting kernels of CK and NTK. In this article, we will follow general assumptions on the input data and activation function in [FW20] and study the limiting spectra of the centered and normalized CK matrix

(1.2)

\frac{1}{\sqrt{nd_{1}}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right),

where $Y:=\sigma(WX)$ . Similar results for the NTK can be obtained as well. To complete the proofs, we establish a nonlinear version of Hanson-Wright inequality, which has previously appeared in [LLC18, LCM20]. This nonlinear version is a generalization of the original Hanson-Wright inequality [HW71, RV13, Ada15], and may have various applications in statistics, machine learning, and other areas. In addition, we also derive a deformed semicircle law for normalized sample covariance matrices without independence in columns. This result is of independent interest in random matrix theory as well.

1.2. General sample covariance matrices

We observe that the random matrix $Y\in\mathbb{R}^{d_{1}\times n}$ defined above has independent and identically-distributed rows. Hence, $Y^{\top}Y$ is a generalized sample covariance matrix. We first inspect a more general sample covariance matrix $Y$ whose rows are independent copies of some random vector $\mathbf{y}\in\mathbb{R}^{n}$ . Assuming $n$ and $d_{1}$ both go to infinity but $n/d_{1}\to 0$ , we aim to study the limiting empirical eigenvalue distribution of centered Wishart matrices in the form of (1.2) with certain conditions on $\mathbf{y}$ . This regime is also related to the ultra-high dimensional setting in statistics [QLY21].

This regime has been studied for decades starting in [BY88], where $Y$ has i.i.d. entries and $\mathbb{E}[Y^{\top}Y]=d_{1}\operatorname{Id}$ . In this setting, by the moment method, one can obtain the semicircle law. This normalized model also arises in quantum theory with respect to random induced states (see [Aub12, AS17, CYZ18]). The largest eigenvalue of such a normalized sample covariance matrix has been considered in [CP12]. Subsequently, [CP15, LY16, YXZ22, QLY21] analyzed the fluctuations for the linear spectral statistics of this model and applied this result to hypothesis testing for the covariance matrix. A spiked model for sample covariance matrices in this regime was recently studied in [Fel21]. This kind of semicircle law also appears in many other random matrix models. For instance, [Jia04] showed this limiting law for normalized sample correlation matrices. Also, the semicircle law for centered sample covariance matrices has already been applied in machine learning: [GKZ19] controlled the generalization error of shallow neural networks with quadratic activation functions by the moments of this limiting semicircle law; [GZR22] derived a semicircle law of the fluctuation matrix between stochastic batch Hessian and the deterministic empirical Hessian of deep neural networks.

For general sample covariance, [WP14] considered the form $Y=BXA^{1/2}$ with deterministic $A$ and $B$ , where $X$ consists of i.i.d. entries with mean zero and variance one. The same result has been proved in [Bao12] by generalized Stein’s method. Unlike previous results, [Xie13] tackled the general case, only assuming $Y$ has independent rows with some deterministic covariance $\Phi_{n}$ . Though this is similar to our model in Section 4, we will consider more general assumptions on each row of $Y$ , which can be directly verified in our neural network models.

1.3. Infinite-width kernels and the smallest eigenvalues of empirical kernels

Besides the above asymptotic spectral fluctuation of (1.2), we provide non-asymptotic concentrations of (1.2) in spectral norm and a corresponding result for the NTK. In the infinite-width networks, where $d_{1}\to\infty$ and $n$ are fixed, both CK and NTK will converge to their expected kernels. This has been investigated in [DFS16, SGGSD17, LBN⁺18, MHR⁺18] for the CK and [JGH18, DZPS19, AZLS19, ADH⁺19b, LRZ20] for the NTK. Such kernels are also called infinite-width kernels in literature. In this current work, we present the precise probability bounds for concentrations of CK and NTK around their infinite-width kernels, where the difference is of order $\sqrt{n/d_{1}}$ . Our results permit more general activation functions and input data $X$ only with pairwise approximate orthogonality, albeit similar concentrations have been applied in [AKM⁺17, SY19, AP20, MZ22, HXAP20].

A corollary of our concentration is the explicit lower bounds of the smallest eigenvalues of the CK and the NTK. Such extreme eigenvalues of the NTK have been utilized to prove the global convergence of gradient descent algorithms of wide neural networks since the NTK governs the gradient flow in the training process, see, e.g., [COB19, DZPS19, ADH⁺19a, SY19, WDW19, OS20, NM20, Ngu21]. The smallest eigenvalue of NTK is also crucial for proving generalization bounds and memorization capacity in [ADH⁺19a, MZ22]. Analogous to Theorem 3.1 in [MZ22], our lower bounds are given by the Hermite coefficients of the activation function $\sigma$ . Besides, the lower bound of NTK for multi-layer ReLU networks is analyzed in [NMM21].

1.4. Random feature regression and limiting kernel regression

Another byproduct of our concentration results is to measure the difference of performance between random feature regression with respect to $\frac{1}{\sqrt{d_{1}}}Y$ and corresponding kernel regression when $d_{1}/n\to\infty$ . Random feature regression can be viewed as the linear regression of the last hidden layer, and its performance has been studied in, for instance, [PW17, LLC18, MM19, LCM20, GLK⁺20, HL22, LD21, MMM21, LGC⁺21] under the linear-width regime¹¹1This linear-width regime is also known as the high-dimensional regime, while our ultra-wide regime is also called a highly overparameterized regime in literature, see [MM19].. In this regime, the CK matrix $\frac{1}{d_{1}}Y^{\top}Y$ is not concentrated around its expectation

(1.3)

\Phi:=\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}X)^{\top}\sigma(\boldsymbol{w}^{\top}X)]

under the spectral norm, where $\boldsymbol{w}$ is the standard normal random vector in $\mathbb{R}^{d_{0}}$ . But the limiting spectrum of CK is exploited to characterize the asymptotic performance and double descent phenomenon of random feature regression when $n,d_{0},d_{1}\to\infty$ proportionally. Several works have also utilized this regime to depict the performance of the ultra-wide random network by letting $d_{1}/n\to\psi\in(0,\infty)$ first, getting the asymptotic performance and then taking $\psi\to\infty$ (see [MM19, YBM21]). However, there is still a difference between this sequential limit and the ultra-wide regime. Before these results, random feature regression has already attracted significant attention in that it is a random approximation of the RKHS defined by population kernel function $K:\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}$ such that

(1.4)

K(\mathbf{x},\mathbf{z}):=\mathbb{E}_{\boldsymbol{w}}[\sigma(\langle\boldsymbol{w},\mathbf{x}\rangle)\sigma(\langle\boldsymbol{w},\mathbf{z}\rangle)],

when width $d_{1}$ is sufficiently large [RR07, Bac13, RR17, Bac17]. We point out that Theorem 9 of [AKM⁺17] has the same order $\sqrt{n/d_{1}}$ of the approximation as ours, despite only for random Fourier features.

In our work, the concentration between empirical kernel induced by $\frac{1}{d_{1}}Y^{\top}Y$ and the population kernel matrix $K$ defined in (1.4) for $X$ leads to the control of the differences of training/test errors between random feature regression and kernel regression, which were previously concerned by [AKM⁺17, JSS⁺20, MZ22, MMM21] in different cases. Specifically, [JSS⁺20] obtained the same kind of estimation but considered random features sampled from Gaussian Processes. Our results explicitly show how large width $d_{1}$ should be so that the random feature regression gets the same asymptotic performance as kernel regression [MMM21]. With these estimations, we can take the limiting test error of the kernel regression to predict the limiting test error of random feature regression as $n/d_{1}\to 0$ and $d_{0},n\to\infty$ . We refer [LR20, LRZ20, LLS21, MMM21], [BMR21, Section 4.3] and references therein for more details in high-dimensional kernel ridge/ridgeless regressions. We emphasize that the optimal prediction error of random feature regression in linear-width regime is actually achieved in the ultra-wide regime, which boils down to the limiting kernel regression, see [MM19, MMM21, YBM21, LGC⁺21]. This is one of the motivations for studying the ultra-wide regime and the limiting kernel regression.

In the end, we would like to mention the idea of spectral-norm approximation for the expected kernel $\Phi$ , which helps us describe the asymptotic behavior of limiting kernel regression. For specific activation $\sigma$ , kernel $\Phi$ has an explicit formula, see [LLC18, LC18, LCM20], whereas generally, it can be expanded in terms of the Hermite expansion of $\sigma$ [PW17, MM19, FW20]. Thanks to pairwise approximate orthogonality introduced in [FW20, Definition 3.1], we can approximate $\Phi$ in the spectral norm for general deterministic data $X$ . This pairwise approximate orthogonality defines how orthogonal is within different input vectors of $X$ . With certain i.i.d. assumption on $X$ , [LRZ20] and [BMR21, Section 4.3], where the scaling $d_{0}\propto n^{\alpha}$ , for $\alpha\in(0,1]$ , determined which degree of the polynomial kernel is sufficient to approximate $\Phi$ . Instead, our theory leverages the approximate orthogonality among general datasets $X$ to obtain a similar approximation. Our analysis presumably indicates that the weaker orthogonality $X$ has, the higher degree of the polynomial kernel we need to approximate the kernel $\Phi$ .

1.5. Preliminaries

Notations

We use $\operatorname{tr}(A)=\frac{1}{n}\sum_{i}A_{ii}$ as the normalized trace of a matrix $A\in\mathbb{R}^{n\times n}$ and $\operatorname{Tr}(A)=\sum_{i}A_{ii}$ . Denote vectors by lowercase boldface. $\|A\|$ is the spectral norm for matrix $A$ , $\|A\|_{F}$ denotes the Frobenius norm, and $\|\mathbf{x}\|$ is the $\ell_{2}$ -norm of any vector $\mathbf{x}$ . $A\odot B$ is the Hadamard product of two matrices, i.e., $(A\odot B)_{ij}=A_{ij}B_{ij}$ . Let $\mathbb{E}_{\boldsymbol{w}}[\cdot]$ and $\operatorname{Var}_{\boldsymbol{w}}[\cdot]$ be the expectation and variance only with respect to random vector $\boldsymbol{w}$ . Given any vector $\boldsymbol{v}$ , $\operatorname{diag}(\boldsymbol{v})$ is a diagonal matrix where the main diagonal elements are given by $\boldsymbol{v}$ . $\lambda_{\min}(A)$ is the smallest eigenvalue of any Hermitian matrix $A$ .

Before stating our main results, we describe our model with assumptions. We first consider the output of the first hidden layer and empirical Conjugate Kernel (CK):

(1.5)

Y:=\sigma(WX)\quad\text{and}\quad\frac{1}{d_{1}}Y^{\top}Y.

Observe that the rows of matrix $Y$ are independent and identically distributed since only $W$ is random and $X$ is deterministic. Let the $i$ -th row of $Y$ be $\mathbf{y}_{i}^{\top}$ , for $1\leq i\leq d_{1}$ . Then, we obtain a sample covariance matrix,

(1.6)

Y^{\top}Y=\sum_{i=1}^{d_{1}}\mathbf{y}_{i}\mathbf{y}_{i}^{\top},

which is the sum of $d_{1}$ independent rank-one random matrices in $\mathbb{R}^{n\times n}$ . Let the second moment of any row $\mathbf{y}_{i}$ be (1.3). Later on, we will approximate $\Phi$ based on the assumptions of input data $X$ .

Next, we define the empirical Neural Tangent Kernel (NTK) for (1.1), denoted by $H\in\mathbb{R}^{n\times n}$ . From Section 3.3 in [FW20], the $(i,j)$ -th entry of $H$ can be explicitly written as

(1.7)

\displaystyle H_{ij}:=\frac{1}{d_{1}}\sum_{r=1}^{d_{1}}\left(\sigma(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{i})\sigma(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{j})+a_{r}^{2}\sigma^{\prime}(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{i})\sigma^{\prime}(\boldsymbol{w}_{r}^{\top}\mathbf{x}_{j})\mathbf{x}_{i}^{\top}\mathbf{x}_{j}\right),\quad 1\leq i,j\leq n,

where $\boldsymbol{w}_{r}$ is the $r$ -th row of weight matrix $W$ , $\mathbf{x}_{i}$ is the $i$ -th column of matrix $X$ , and $a_{r}$ is $r$ -th entry of the output layer $\boldsymbol{a}$ . In the matrix form, $H$ can be written by

(1.8)

\displaystyle H:=\frac{1}{d_{1}}\left(Y^{\top}Y+(S^{\top}S)\odot(X^{\top}X)\right),

where the $\alpha$ -th column of $S$ is given by

(1.9)

\displaystyle\operatorname{diag}(\sigma^{\prime}(W\mathbf{x}_{\alpha}))\boldsymbol{a},\quad\forall 1\leq\alpha\leq n.

We introduce the following assumptions for the random weights, nonlinear activation function $\sigma$ , and input data. These assumptions are basically carried on from [FW20].

Assumption 1.1.

The entries of $W$ and $\boldsymbol{a}$ are i.i.d. and distributed by $\mathcal{N}(0,1)$ .

Assumption 1.2.

Activation function $\sigma(x)$ is a Lipschitz function with the Lipschitz constant $\lambda_{\sigma}\in(0,\infty)$ . Assume that $\sigma$ is centered and normalized with respect to $\xi\sim\mathcal{N}(0,1)$ such that

(1.10)

\displaystyle\mathbb{E}[\sigma(\xi)]=0,\qquad

\displaystyle\mathbb{E}[\sigma^{2}(\xi)]=1.

Define constants $a_{\sigma}$ and $b_{\sigma}\in\mathbb{R}$ by

(1.11)

\displaystyle b_{\sigma}:=\mathbb{E}[\sigma^{\prime}(\xi)],\qquad

\displaystyle a_{\sigma}:=\mathbb{E}[\sigma^{\prime}(\xi)^{2}].

Furthermore, $\sigma$ satisfies either of the following:

(1)

$\sigma(x)$ is twice differentiable with $\sup_{x\in\mathbb{R}}|\sigma^{\prime\prime}(x)|\leq\lambda_{\sigma}$ , or
(2)

$\sigma(x)$ is a piece-wise linear function defined by

$\sigma(x)=\begin{cases}ax+b,&x>0,\\ cx+b,&x\leq 0,\end{cases}$

for some constants $a,b,c\in\mathbb{R}$ such that (1.10) holds.

Analogously to [HXAP20], our Assumption 1.2 permits $\sigma$ to be the commonly used activation functions, including ReLU, Sigmoid, and Tanh, although we have to center and normalize the activation functions to guarantee (1.10). Such normalized activation functions exclude some trivial spike in the limit spectra of CK and NTK [BP21, FW20]. The foregoing assumptions ensure our nonlinear Hanson-Wright inequality in the proof. As a future direction, going beyond Gaussian weights and Lipschitz activation functions may involve different types of concentration inequalities.

Next, we present the conditions of the deterministic input data $X$ and the asymptotic regime for our main results. Define the following $(\varepsilon,B)$ -orthonormal property for our data matrix $X$ .

Definition 1.3.

For given any $\varepsilon,B>0$ , matrix $X$ is $(\varepsilon,B)$ -orthonormal if for any distinct columns $\mathbf{x}_{\alpha},\mathbf{x}_{\beta}$ in $X$ , we have

\big{|}\|\mathbf{x}_{\alpha}\|_{2}-1\big{|}\leq\varepsilon,\qquad\big{|}\|\mathbf{x}_{\beta}\|_{2}-1\big{|}\leq\varepsilon,\qquad\big{|}\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}\big{|}\leq\varepsilon,

and also

\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|_{2}-1)^{2}\leq B^{2},\qquad\|X\|\leq B.

Assumption 1.4.

Let $n,d_{0},d_{1}\to\infty$ such that

(a)

$\gamma:=n/d_{1}\to 0$ ;
(b)

$X$ is $\left(\varepsilon_{n},B\right)$ -orthonormal such that $n\varepsilon_{n}^{4}\to\ 0$ as $n\to\infty$ ;
(c)

The empirical spectral distribution $\hat{\mu}_{0}$ of $X^{\top}X$ converges weakly to a fixed and non-degenerate probability distribution $\mu_{0}\not=\delta_{0}$ on $[0,\infty)$ .

In above (b), the $(\varepsilon_{n},B)$ -orthonormal property with $n\varepsilon^{4}_{n}=o(1)$ is a quantitative version of pairwise approximate orthogonality for the column vectors of the data matrix $X\in\mathbb{R}^{d_{0}\times n}$ . When $d_{0}\asymp n$ , it holds, with high probability, for many random $X$ with independent columns $\mathbf{x}_{\alpha}$ , including the anisotropic Gaussian vectors $\mathbf{x}_{\alpha}\sim\mathcal{N}(0,\Sigma)$ with $\operatorname{tr}(\Sigma)=1$ and $\|\Sigma\|\lesssim 1/n$ , vectors generated by Gaussian mixture models, and vectors satisfying the log-Sobolev inequality or convex Lipschitz concentration property. See [FW20, Section 3.1] for more details. Specifically, when $\mathbf{x}_{\alpha}$ ’s are independently sampled from the unit sphere $\mathbb{S}^{d_{0}-1}$ , $X$ is $\left(\varepsilon_{n},B\right)$ -orthonormal with high probability where $\varepsilon_{n}=O\Big{(}\sqrt{\frac{\log(n)}{n}}\Big{)}$ and $B=O(1)$ as $n\asymp d_{0}$ . In this case, for any $\ell>2$ , we have $n\varepsilon_{n}^{\ell}\to 0$ . In our theory, we always treat $X$ as a deterministic matrix. However, our results also work for random input $X$ independent of weights $W$ and $\boldsymbol{a}$ by conditioning on the high probability event that $X$ satisfies $(\varepsilon_{n},B)$ -orthonormal property. Unlike data vectors with independent entries, our assumption is promising to analyze real-world datasets [LGC⁺21] and establish some $n$ -dependent deterministic equivalents like [LCM20].

The following Hermite polynomials are crucial to the approximation of $\Phi$ in our analysis.

Definition 1.5 (Normalized Hermite polynomials).

The $r$ -th normalized Hermite polynomial is given by

h_{r}(x)=\frac{1}{\sqrt{r!}}(-1)^{r}e^{x^{2}/2}\frac{d^{r}}{dx^{r}}e^{-x^{2}/2}.

Here $\{h_{r}\}_{r=0}^{\infty}$ form an orthonormal basis of $L^{2}(\mathbb{R},\Gamma)$ , where $\Gamma$ denotes the standard Gaussian distribution. For $\sigma_{1},\sigma_{2}\in L^{2}(\mathbb{R},\Gamma)$ , the inner product is defined by

\displaystyle\langle\sigma_{1},\sigma_{2}\rangle=\int_{-\infty}^{\infty}\sigma_{1}(x)\sigma_{2}(x)\frac{e^{-x^{2}/2}}{\sqrt{2\pi}}dx.

Every function $\sigma\in L^{2}(\mathbb{R},\Gamma)$ can be expanded as a Hermite polynomial expansion

\displaystyle\sigma(x)=\sum_{r=0}^{\infty}\zeta_{r}(\sigma)h_{r}(x),

where $\zeta_{r}(\sigma)$ is the $r$ -th Hermite coefficient defined by

\displaystyle\zeta_{r}(\sigma):=\int_{-\infty}^{\infty}\sigma(x)h_{r}(x)\frac{e^{-x^{2}/2}}{\sqrt{2\pi}}dx.

In the following statements and proofs, we denote $\xi\sim\mathcal{N}(0,1)$ . Then for any $k\in\mathbb{N}$ , we have

(1.12)

\displaystyle\zeta_{k}(\sigma)=\mathbb{E}[\sigma(\xi)h_{k}(\xi)].

Specifically, $b_{\sigma}=\mathbb{E}[\sigma^{\prime}(\xi)]=\mathbb{E}[\xi\cdot\sigma(\xi)]=\zeta_{1}(\sigma)$ . Let $f_{k}(x)=x^{k}$ . We define the inner-product kernel random matrix $f_{k}(X^{\top}X)\in\mathbb{R}^{n\times n}$ by applying $f_{k}$ entrywise to $X^{\top}X$ . Define a deterministic matrix

(1.13)

\Phi_{0}:=\boldsymbol{\mu}\boldsymbol{\mu}^{\top}+\sum_{k=1}^{3}\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)+(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})\operatorname{Id},

where the $\alpha$ -th entry of $\boldsymbol{\mu}\in\mathbb{R}^{n}$ is $\sqrt{2}\zeta_{2}(\sigma)\cdot(\|\mathbf{x}_{\alpha}\|_{2}-1)$ for $1\leq\alpha\leq n$ . We will employ $\Phi_{0}$ as an approximation of the population covariance $\Phi$ in (1.3) in the spectral norm when $n\varepsilon_{n}^{4}\to 0$ .

For any $n\times n$ Hermitian matrix $A_{n}$ with eigenvalues $\lambda_{1},\dots,\lambda_{n}$ , the empirical spectral distribution of $A$ is defined by

\displaystyle\mu_{A_{n}}(x)=\frac{1}{n}\sum_{i=1}^{n}\delta_{\lambda_{i}}(x).

We write $\operatorname{lim\,spec}(A_{n})=\mu$ if $\mu_{A_{n}}\to\mu$ weakly as $n\to\infty$ . The main tool we use to study the limiting spectral distribution of a matrix sequence is the Stieltjes transform defined as follows.

Definition 1.6 (Stieltjes transform).

Let $\mu$ be a probability measure on $\mathbb{R}$ . The Stieltjes transform of $\mu$ is a function $s(z)$ defined on the upper half plane $\mathbb{C}^{+}$ by

\displaystyle s(z)=\int_{\mathbb{R}}\frac{1}{x-z}d\mu(x).

For any $n\times n$ Hermitian matrix $A_{n}$ , the Stieltjes transform of the empirical spectral distribution of $A_{n}$ can be written as $\operatorname{tr}(A_{n}-z\operatorname{Id})^{-1}$ . We call $(A_{n}-z\operatorname{Id})^{-1}$ the resolvent of $A_{n}$ .

2. Main results

2.1. Spectra of the centered CK and NTK

Our first result is a deformed semicircle law for the CK matrix. Denote by $\tilde{\mu}_{0}=(1-b_{\sigma})^{2}+b_{\sigma}^{2}\mu_{0}$ the distribution of $(1-b_{\sigma}^{2})+b_{\sigma}^{2}\lambda$ with $\lambda$ sampled from the distribution $\mu_{0}$ . The limiting law of our centered and normalized CK matrix is depicted by $\mu_{s}\boxtimes\tilde{\mu}_{0}$ , where $\mu_{s}$ is the standard semicircle law and the notation $\boxtimes$ is the free multiplicative convolution in free harmonic analysis. For full descriptions of free independence and free multiplicative convolution, see [NS06, Lecture 18] and [AGZ10, Section 5.3.3]. The free multiplicative convolution $\boxtimes$ was first introduced in [Voi87], which later has many applications for products of asymptotic free random matrices. The main tool for computing free multiplicative convolution is the $S$ -transform, invented by [Voi87]. $S$ -transform was recently utilized to study the dynamical isometry of deep neural networks [PSG17, PSG18, XBSD⁺18, HK21, CH23]. Some basic properties and intriguing examples for free multiplicative convolution with $\mu_{s}$ can also be found in [BZ10, Theorems 1.2, 1.3].

Theorem 2.1 (Limiting spectral distribution for the conjugate kernel).

Suppose Assumptions 1.1, 1.2 and 1.4 of the input matrix $X$ hold, the empirical eigenvalue distribution of

(2.1)

\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right)

converges weakly to

(2.2)

\displaystyle\mu:=\mu_{s}\boxtimes\Big{(}(1-b_{\sigma}^{2})+b_{\sigma}^{2}\cdot\mu_{0}\Big{)}=\mu_{s}\boxtimes\tilde{\mu}_{0}

almost surely as $n,d_{0},d_{1}\to\infty$ . Furthermore, if $d_{1}\varepsilon_{n}^{4}\rightarrow 0$ as $n,d_{0},d_{1}\to\infty$ , then the empirical eigenvalue distribution of

(2.3)

\sqrt{\frac{d_{1}}{n}}\left(\frac{1}{d_{1}}Y^{\top}Y-\Phi_{0}\right)

also converges weakly to the probability measure $\mu$ almost surely, whose Stieltjes transform $m(z)$ is defined by

(2.4)

m(z)+\int_{\mathbb{R}}\frac{d\tilde{\mu}_{0}(x)}{z+\beta(z)x}=0

for each $z\in\mathbb{C}^{+}$ , where $\beta(z)\in\mathbb{C}^{+}$ is the unique solution to

(2.5)

\beta(z)+\int_{\mathbb{R}}\frac{xd\tilde{\mu}_{0}(x)}{z+\beta(z)x}=0.

Suppose that we additionally have $b_{\sigma}=0$ , i.e. $\mathbb{E}[\sigma^{\prime}(\xi)]=0$ . In this case, our Theorem 2.1 shows that the limiting spectral distribution of (1.2) is the semicircle law, and from (2.2), the deterministic data matrix $X$ does not have an effect on the limiting spectrum. See Figure 1 for a cosine-type $\sigma$ with $b_{\sigma}=0$ . The only effect of the nonlinearity in $\mu$ is the coefficient $b_{\sigma}$ in the deformation $\tilde{\mu}_{0}$ .

Refer to caption — Figure 1. Simulations for empirical eigenvalue distributions of (2.3) and theoretical predication (red curves) of the limiting law $\mu$ where activation function $\sigma(x)\propto\cos(x)$ satisfies Assumption 1.2 with $b_{\sigma}=0$ , and $X$ is a standard Gaussian random matrix. Dimension parameters are given by $n=1.9\times 10^{3}$ , $d_{0}=2\times 10^{3}$ and $d_{1}=2\times 10^{5}$ (left); $n=2\times 10^{3}$ , $d_{0}=1.9\times 10^{3}$ and $d_{1}=2\times 10^{5}$ (middle); $n=2\times 10^{3}$ , $d_{0}=2\times 10^{3}$ and $d_{1}=2\times 10^{5}$ (right).

Figure 2 is a simulation of the limiting spectral distribution of CK with activation function $\sigma(x)$ given by $\arctan(x)$ after proper shifting and scaling. More simulations are provided in Appendix B with different activation functions. The red curves are implemented by the self-consistent equations (2.4) and (2.5) in Theorem 2.1. In Section 4, we present general random matrix models with similar limiting eigenvalue distribution as $\mu$ whose Stieltjes transform is also determined by (2.4) and (2.5).

Theorem 2.1 can be extended to the NTK model as well. Denote by

(2.6)

\displaystyle\Psi:

\displaystyle=\frac{1}{d_{1}}\mathbb{E}[S^{\top}S]\odot(X^{\top}X)\in\mathbb{R}^{n\times n}.

As an approximation of $\Psi$ in the spectral norm, we define

(2.7)

\displaystyle\Psi_{0}:

\displaystyle=\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)\operatorname{Id}+\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)f_{k+1}(X^{\top}X),

where $f_{k}$ ’s are defined in (1.13), $a_{\sigma}$ is defined in (1.11), and the $k$ -th Hermite coefficient of $\sigma^{\prime}$ is

(2.8)

\displaystyle\eta_{k}(\sigma):=\mathbb{E}[\sigma^{\prime}(\xi)h_{k}(\xi)].

Then, similar deformed semicircle law can be obtained for the empirical NTK matrix $H$ .

Theorem 2.2 (Limiting spectral distribution of the NTK).

Under Assumptions 1.1, 1.2 and 1.4 of the input matrix $X$ , the empirical eigenvalue distribution of

(2.9)

\sqrt{\frac{d_{1}}{n}}\left(H-\mathbb{E}[H]\right)

weakly converges to $\mu=\mu_{s}\boxtimes\Big{(}(1-b_{\sigma}^{2})+b_{\sigma}^{2}\cdot\mu_{0}\Big{)}$ almost surely as $n,d_{0},d_{1}\to\infty$ and $n/d_{1}\to 0$ . Furthermore, suppose that $\varepsilon_{n}^{4}d_{1}\to 0$ , then the empirical eigenvalue distribution of

(2.10)

\sqrt{\frac{d_{1}}{n}}\left(H-\Phi_{0}-\Psi_{0}\right)

weakly converges to $\mu$ almost surely, where $\Phi_{0}$ and $\Psi_{0}$ are defined in (1.13) and (2.7), respectively.

2.2. Non-asymptotic estimations

With our nonlinear Hanson-Wright inequality (Corollary 3.5), we attain the following concentration bound on the CK matrix in the spectral norm.

Theorem 2.3.

With Assumption 1.1, assume $X$ satisfies $\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}-1)^{2}\leq B^{2}$ for a constant $B\geq 0$ , and $\sigma$ is $\lambda_{\sigma}$ -Lipschitz with $\mathbb{E}[\sigma(\xi)]=0$ . Then with probability at least $1-4e^{-2n}$ ,

(2.11)

\displaystyle\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|\leq C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+32B\lambda_{\sigma}^{2}\|X\|\sqrt{\frac{n}{d_{1}}},

where $C>0$ is a universal constant.

Remark 2.4.

Theorem 2.3 ensures that the empirical spectral measure $\mu_{n}$ of the centered random matrix $\sqrt{\frac{d_{1}}{n}}\left(\frac{1}{d_{1}}Y^{\top}Y-\Phi\right)$ has a bounded support for all sufficiently large $n$ . Together with the global law in Theorem 2.1, our concentration inequality (2.11) is tight up to a constant factor. Additionally, by the weak convergence of $\mu_{n}$ to $\mu$ proved in Theorem 2.1, we can take a test function $x^{2}$ to obtain that

\int_{\mathbb{R}}x^{2}d\mu_{n}(x)\to\int_{\mathbb{R}}x^{2}d\mu(x),\quad\text{i.e.,}\quad\frac{\sqrt{d_{1}}}{n}\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|_{F}\rightarrow\left(\int_{\mathbb{R}}x^{2}d\mu(x)\right)^{\frac{1}{2}}

almost surely, as $n,d_{1}\to\infty$ and $d_{1}/n\to\infty$ . Therefore, the fluctuation of $\frac{1}{d_{1}}Y^{\top}Y$ around $\Phi$ under the Frobenius norm is exactly of order $n/\sqrt{d_{1}}$ .

Based on the foregoing estimation, we have the following lower bound on the smallest eigenvalue of the empirical conjugate kernel, denoted by $\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)$ .

Theorem 2.5.

Suppose Assumptions 1.1 and 1.2 hold and $\sigma$ is not a linear function, $X$ is $(\varepsilon_{n},B)$ -orthonormal. Then with probability at least $1-4e^{-2n}$ ,

\displaystyle\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)\geq 1-\sum_{i=1}^{3}\zeta_{i}(\sigma)^{2}-C_{B}\varepsilon_{n}^{2}\sqrt{n}-C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}B^{2},

where $C_{B}$ is a constant depending on $B$ . In particular, if $\varepsilon_{n}^{4}n=o(1),B=O(1),d_{1}=\omega(n)$ , then with high probability,

\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)\geq 1-\sum_{i=1}^{3}\zeta_{i}(\sigma)^{2}-o(1).

Remark 2.6.

A related result in [OS20, Theorem 5] assumed $\|\mathbf{x}_{j}\|=1$ for all $j\in[n]$ , $\lambda_{\sigma}\leq B,|\sigma(0)|\leq B$ , $d_{1}\geq C\log^{2}(n)\frac{n}{\lambda_{\min}(\Phi)}$ and obtained $\frac{1}{d_{1}}\lambda_{\min}(Y^{\top}Y)\geq\lambda_{\min}(\Phi)-o(1).$ We relax the assumption on the column vectors of $X$ , and extend the range of $d_{1}$ down to $d_{1}=\Omega(n)$ , to guarantee that $\frac{1}{d_{1}}\lambda_{\min}(Y^{\top}Y)$ is lower bounded by an absolute constant, with an extra assumption that $\mathbb{E}[\sigma(\xi)]=0$ . This assumption can always be satisfied by shifting the activation function with a proper constant. Our bound for $\lambda_{\min}(\Phi)$ is derived via Hermite polynomial expansion, similar to [OS20, Lemma 15]. However, we apply an $\varepsilon$ -net argument for concentration bound for $\frac{1}{d_{1}}Y^{\top}Y$ around $\Phi$ , while a matrix Chernoff concentration bound with truncation was used in [OS20, Theorem 5].

Additionally, the concentration for the NTK matrix $H$ can be obtained in the next theorem. Recall that $H$ is defined by (1.8) and the columns of $S$ are defined by (1.9) with Assumption 1.1.

Theorem 2.7.

Suppose $d_{1}\geq\log n$ , and $\sigma$ is $\lambda_{\sigma}$ -Lipschitz. Then with probability at least $1-n^{-7/3}$ ,

(2.12)

\displaystyle\left\|\frac{1}{d_{1}}(S^{\top}S-\mathbb{E}[S^{\top}S])\odot(X^{\top}X)\right\|\leq 10\lambda_{\sigma}^{4}\|X\|^{4}\sqrt{\frac{\log n}{d_{1}}}.

Moreover, if the assumptions in Theorem 2.3 hold, then with probability at least $1-n^{-7/3}-4e^{-2n}$ ,

(2.13)

\displaystyle\|H-\mathbb{E}H\|\leq C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+32B\lambda_{\sigma}^{2}\|X\|\sqrt{\frac{n}{d_{1}}}+10\lambda_{\sigma}^{4}\|X\|^{4}\sqrt{\frac{\log n}{d_{1}}}.

Remark 2.8.

Compared to Proposition D.3 in [HXAP20], we assume $\boldsymbol{a}$ is a Gaussian vector instead of a Rademacher random vector and attain a better bound. If $a_{i}\in\{+1,-1\}$ , then one can apply matrix Bernstein inequality for the sum of bounded random matrices. In our case, the boundedness condition is not satisfied. Section S1.1 in [AP20] applied matrix Bernstein inequality for the sum of bounded random matrices when $\boldsymbol{a}$ is a Gaussian vector, but the boundedness condition does not hold in Equation (S7) of [AP20].

Based on Theorem 2.7, we get a lower bound for the smallest eigenvalue of the NTK.

Theorem 2.9.

Under Assumptions 1.1 and 1.2, suppose that $X$ is $(\varepsilon_{n},B)$ -orthonormal, $\sigma$ is not a linear function, and $d_{1}\geq\log n$ . Then with probability at least $1-n^{-7/3}$ ,

\displaystyle\lambda_{\min}(H)\geq a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)-C_{B}\varepsilon_{n}^{4}n-10\lambda_{\sigma}^{4}B^{4}\sqrt{\frac{\log n}{d_{1}}},

where $C_{B}$ is a constant depending only on $B$ , and $\eta_{k}(\sigma)$ is defined in (2.8). In particular, if $\varepsilon_{n}^{4}n=o(1)$ , $B=O(1)$ , and $d_{1}=\omega(\log n)$ , then with high probability,

\displaystyle\lambda_{\min}(H)\geq\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)(1-o(1)).

Remark 2.10.

We relax the assumption in [NMM21] to $d_{1}=\omega(\log n)$ for the 2-layer case and our result is applicable beyond the ReLU activation function and to more general assumptions on $X$ . Our proof strategy is different from [NMM21]. In [NMM21], the authors used the inequality $\lambda_{\min}((S^{\top}S)\odot(X^{\top}X))\geq\min_{i}\|S_{i}\|_{2}^{2}\cdot\lambda_{\min}(X^{\top}X)$ where $S_{i}$ is the $i$ -th column of $S$ . Then getting the lower bound is reduced to show the concentration of the $2$ -norm of the column vectors of $S$ . Here we apply a matrix concentration inequality to $(S^{\top}S)\odot(X^{\top}X)$ and gain a weaker assumption on $d_{1}$ to ensure the lower bound on $\lambda_{\min}(H)$ .

Remark 2.11.

In Theorems 2.5 and 2.9, we exclude the linear activation function. When $\sigma(x)=x$ , it is easy to check both $\frac{1}{d_{1}}\lambda_{\min}(Y^{\top}Y)$ and $\lambda_{\min}(H)$ will trivially determined by $\lambda_{\min}(X^{\top}X)$ , which can be vanishing. In this case, the lower bounds of the smallest eigenvalues of CK and NTK rely on the assumption of $\mu_{0}$ or the distribution of $X$ . For instance, when the entries of $X$ are i.i.d. Gaussian random variables, $\lambda_{\min}(X^{\top}X)$ has been analyzed in [Sil85].

2.3. Training and test errors for random feature regression

We apply the results of the preceding sections to a two-layer neural network at random initialization defined in (1.1), to estimate the training errors and test errors with mean-square losses for random feature regression under the ultra-wide regime where $d_{1}/n\to\infty$ and $n\to\infty$ . In this model, we take the random feature $\frac{1}{\sqrt{d_{1}}}\sigma(WX)$ and consider the regression with respect to $\boldsymbol{\theta}\in\mathbb{R}^{d_{1}}$ based on

f_{\boldsymbol{\theta}}(X):=\frac{1}{\sqrt{d_{1}}}\boldsymbol{\theta}^{\top}\sigma\left(WX\right),

with training data $X\in\mathbb{R}^{d_{0}\times n}$ and training labels $\boldsymbol{y}\in\mathbb{R}^{n}$ . Considering the ridge regression with ridge parameter $\lambda\geq 0$ and squared loss defined by

(2.14)

\displaystyle L(\boldsymbol{\theta}):=\|f_{\boldsymbol{\theta}}(X)^{\top}-\boldsymbol{y}\|^{2}+\lambda\|\boldsymbol{\theta}\|^{2},

we can conclude that the minimization $\hat{\boldsymbol{\theta}}:=\arg\min_{\boldsymbol{\theta}}L(\boldsymbol{\theta})$ has an explicit solution

(2.15)

\hat{\boldsymbol{\theta}}=\frac{1}{\sqrt{d_{1}}}Y\left(\frac{1}{d_{1}}Y^{\top}Y+\lambda\operatorname{Id}\right)^{-1}\boldsymbol{y},

where $Y=\sigma(WX)$ is defined in (1.5). When $\sigma$ is nonlinear, by Theorem 2.5, it is feasible to take inverse in (2.15) for any $\lambda\geq 0$ . Hence, in the following results, we will focus on nonlinear activation functions²²2As Remark 2.11 stated, when $\sigma(x)=x$ , $\lambda_{\min}$ of CK may be possibly vanishing. To include the linear activation function, we can alternatively assume that the ridge parameter $\lambda$ is strictly positive and focus on random feature ridge regressions.. In general, the optimal predictor for this random feature with respect to (2.14) is

(2.16)

\displaystyle\hat{f}_{\lambda}^{(RF)}(\mathbf{x}):=\frac{1}{\sqrt{d_{1}}}\hat{\boldsymbol{\theta}}^{\top}\sigma\left(W\mathbf{x}\right)=K_{n}(\mathbf{x},X)(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y},

where we define an empirical kernel $K_{n}(\cdot,\cdot):\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}$ as

(2.17)

K_{n}(\mathbf{x},\mathbf{z}):=\frac{1}{d_{1}}\sigma(W\mathbf{x})^{\top}\sigma(W\mathbf{z})=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\sigma(\langle\boldsymbol{w}_{i},\mathbf{x}\rangle)\sigma(\langle\boldsymbol{w}_{i},\mathbf{z}\rangle).

The $n$ -dimension row vector is given by

(2.18)

K_{n}(\mathbf{x},X)=[K_{n}(\mathbf{x},\mathbf{x}_{1}),\ldots,K_{n}(\mathbf{x},\mathbf{x}_{n})],

and the $(i,j)$ entry of $K_{n}(X,X)$ is defined by $K_{n}(\mathbf{x}_{i},\mathbf{x}_{j})$ , for $1\leq i,j\leq n$ .

Analogously, consider any kernel function $K(\cdot,\cdot):\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}$ . The optimal kernel predictor with a ridge parameter $\lambda\geq 0$ for the kernel ridge regression is given by (see [RR07, AKM⁺17, LR20, JSS⁺20, LLS21, BMR21] for more details)

(2.19)

\displaystyle\hat{f}_{\lambda}^{(K)}(\mathbf{x}):=K(\mathbf{x},X)(K(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y},

where $K(X,X)$ is an $n\times n$ matrix such that its $(i,j)$ entry is $K(\mathbf{x}_{i},\mathbf{x}_{j})$ , and $K(\mathbf{x},X)$ is a row vector in $\mathbb{R}^{n}$ similarly with (2.18). We compare the characteristics of the two different predictors $\hat{f}_{\lambda}^{(RF)}(\mathbf{x})$ and $\hat{f}_{\lambda}^{(K)}(\mathbf{x})$ when the kernel function $K$ is defined in (1.4). Denote the optimal predictors for random features and kernel $K$ on training data $X$ by

	$\displaystyle\hat{f}_{\lambda}^{(RF)}(X)$	$\displaystyle=\left(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}_{1}),\dots,\hat{f}_{\lambda}^{(RF)}(\mathbf{x}_{n})\right)^{\top},$
	$\displaystyle\hat{f}_{\lambda}^{(K)}(X)$	$\displaystyle=\left(\hat{f}_{\lambda}^{(K)}(\mathbf{x}_{1}),\dots,\hat{f}_{\lambda}^{(K)}(\mathbf{x}_{n})\right)^{\top},$

respectively. Notice that, in this case, $K(X,X)\equiv\Phi$ defined in (1.3) and $K_{n}(X,X)$ is the random empirical CK matrix $\frac{1}{d_{1}}Y^{\top}Y$ defined in (1.5).

We aim to compare the training and test errors for these two predictors in ultra-wide random neural networks, respectively. Let training errors of these two predictors be

(2.20)		$\displaystyle E_{\textnormal{train}}^{(K,\lambda)}:$	$\displaystyle=\frac{1}{n}\\|\hat{f}_{\lambda}^{(K)}(X)-\boldsymbol{y}\\|_{2}^{2}=\frac{\lambda^{2}}{n}\\|(K(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\\|^{2},$
(2.21)		$\displaystyle E_{\textnormal{train}}^{(RF,\lambda)}:$	$\displaystyle=\frac{1}{n}\\|\hat{f}_{\lambda}^{(RF)}(X)-\boldsymbol{y}\\|_{2}^{2}=\frac{\lambda^{2}}{n}\\|(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\\|^{2}.$

In the following theorem, we show that, with high probability, the training error of the random feature regression model can be approximated by the corresponding kernel regression model with the same ridge parameter $\lambda\geq 0$ for ultra-wide neural networks.

Theorem 2.12 (Training error approximation).

Suppose Assumptions 1.1, 1.2 and 1.4 hold, and $\sigma$ is not a linear function. Then, for all large $n$ , with probability at least $1-4e^{-2n}$ ,

(2.22)

\displaystyle\left|E_{\textnormal{train}}^{(RF,\lambda)}-E_{\textnormal{train}}^{(K,\lambda)}\right|\leq\frac{C_{1}}{\sqrt{nd_{1}}}\left(\sqrt{\frac{n}{d_{1}}}+C_{2}\right)\|\boldsymbol{y}\|^{2},

where constants $C_{1}$ and $C_{2}$ only depend on $\lambda$ , $B$ and $\sigma$ .

Next, to investigate the test errors (or generalization errors), we introduce further assumptions on the data and the target function that we want to learn from training data. Denote the true regression function by $f^{*}:\mathbb{R}^{d_{0}}\to\mathbb{R}$ . Then, the training labels are defined by

(2.23)

\boldsymbol{y}=f^{*}(X)+\boldsymbol{\epsilon}\quad\text{and}\quad f^{*}(X)=(f^{*}(\mathbf{x}_{1}),\ldots,f^{*}(\mathbf{x}_{n}))^{\top},

where $\boldsymbol{\epsilon}\in\mathbb{R}^{n}$ is the training label noise. For simplicity, we further impose the following assumptions, analogously to [LD21].

Assumption 2.13.

Assume that the target function is a linear function $f^{*}(\mathbf{x})=\langle\boldsymbol{\beta}^{*},\mathbf{x}\rangle$ , where random vector satisfies $\boldsymbol{\beta}^{*}\sim\mathcal{N}(0,\sigma^{2}_{\boldsymbol{\beta}}\operatorname{Id})$ . Then, in this case, the training label vector is given by $\boldsymbol{y}=X^{\top}\boldsymbol{\beta}^{*}+\boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma_{\boldsymbol{\epsilon}}^{2}\operatorname{Id})$ independent with $\boldsymbol{\beta}^{*}\in\mathbb{R}^{d_{0}}$ .

Assumption 2.14.

Suppose that training dataset $X=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n}]\in\mathbb{R}^{d_{0}\times n}$ satisfies $(\varepsilon_{n},B)$ -orthonormal condition with $n\varepsilon_{n}^{4}=o(1)$ , and a test data $\mathbf{x}\in\mathbb{R}^{d_{0}}$ is independent with $X$ and $\boldsymbol{y}$ such that $\tilde{X}:=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n},\mathbf{x}]\in\mathbb{R}^{d_{0}\times(n+1)}$ is also $(\varepsilon_{n},B)$ -orthonormal. For convenience, we further assume the population covariance of the test data is $\mathbb{E}_{\mathbf{x}}[\mathbf{x}\mathbf{x}^{\top}]=\frac{1}{d_{0}}\operatorname{Id}$ .

Remark 2.15.

Our Assumption 2.14 of the test data $\mathbf{x}$ ensures the same statistical behavior as training data in $X$ , but we do not have any explicit assumption of the distribution of $\mathbf{x}$ . It is promising to adopt such assumptions to handle statistical models with real-world data [LC18, LCM20]. Besides, it is possible to extend our analysis to general population covariance for $\mathbb{E}_{\mathbf{x}}[\mathbf{x}\mathbf{x}^{\top}]$ .

For any predictor $\hat{f}$ , define the test error (generalization error) by

(2.24)

\displaystyle\mathcal{L}(\hat{f}):=\mathbb{E}_{\mathbf{x}}[|\hat{f}(\mathbf{x})-f^{*}(\mathbf{x})|^{2}].

We first present the following approximation of the test error of a random feature predictor via its corresponding kernel predictor.

Theorem 2.16 (Test error approximation).

Suppose that Assumptions 1.1, 1.2, 2.13 and 2.14 hold, and $\sigma$ is not a linear function. Then, for any $\varepsilon\in(0,1/2)$ , the difference of test errors satisfies

(2.25)

\left|\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))-\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))\right|=o\left(\left(n/d_{1}\right)^{\frac{1}{2}-\varepsilon}\right),

with probability $1-o(1)$ , when $n/d_{1}\to 0$ and $n\to\infty$ .

Theorems 2.12 and 2.16 verify that the random feature regression achieves the same asymptotic errors as the kernel regression, as long as $n/d_{1}\to\infty$ . This is closely related to [MMM21, Theorem 1] with different settings. Based on that, we can compute the asymptotic training and test errors for the random feature model by calculating the corresponding quantities for the kernel regression in the ultra-wide regime where $n/d_{1}\to 0$ .

Theorem 2.17 (Asymptotic training and test errors).

Suppose Assumptions 1.1 and 1.2 hold, and $\sigma$ is not a linear function. Suppose the target function $f^{*}$ , training data $X$ and test data $\mathbf{x}\in\mathbb{R}^{d_{0}}$ satisfy Assumptions 2.13 and 2.14. For any $\lambda\geq 0$ , let the effective ridge parameter be

(2.26)

\lambda_{\textnormal{eff}}(\lambda,\sigma):=\frac{1+\lambda-b_{\sigma}^{2}}{b_{\sigma}^{2}}.

If the training data has some limiting eigenvalue distribution $\mu_{0}=\operatorname{lim\,spec}X^{\top}X$ as $n\to\infty$ and $n/d_{0}\to\gamma\in(0,\infty)$ , then when $n/d_{1}\to 0$ and $n\to\infty$ , the training error satisfies

(2.27)

E_{\textnormal{train}}^{(RF,\lambda)}\xrightarrow[]{\mathbb{P}}\frac{\sigma^{2}_{\boldsymbol{\beta}}\lambda^{2}}{\gamma b_{\sigma}^{4}}\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))+\frac{\sigma^{2}_{\boldsymbol{\epsilon}}\lambda^{2}}{\gamma(1+\lambda-b_{\sigma}^{2})^{2}}\left(\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))-1+\gamma\right),

and the test error satisfies

(2.28)

\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))\xrightarrow[]{\mathbb{P}}\sigma^{2}_{\boldsymbol{\beta}}\,\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))+\sigma_{\boldsymbol{\epsilon}}^{2}\,\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma)),

where the bias and variance functions are defined by

(2.29)		$\displaystyle\mathcal{B}_{K}(\nu):=~{}$	$\displaystyle(1-\gamma)+\gamma\nu^{2}\int_{\mathbb{R}}\frac{1}{(x+\nu)^{2}}d\mu_{0}(x),$
(2.30)		$\displaystyle\mathcal{V}_{K}(\nu):=~{}$	$\displaystyle\gamma\int_{\mathbb{R}}\frac{x}{(x+\nu)^{2}}d\mu_{0}(x).$

We emphasize that in the proof of Theorem 2.17, we also get $n$ -dependent deterministic equivalents for training/test errors of the kernel regression to approximate the performance of random feature regression. This is akin to [LCM20, Theorem 3] and [BMR21, Theorem 4.13], but in different regimes. In the following Figure 3, we present implementations of test errors for random feature regressions on standard Gaussian random data and their limits (2.28). For simplicity, we fix $n,d_{0}$ , only let $d_{1}\to\infty$ , and use empirical spectral distribution of $X^{\top}X$ to approximate $\mu_{0}$ in $\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))$ and $\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))$ , which is actually the $n$ -dependent deterministic equivalent. However, for Gaussian random matrix $X$ , $\mu_{0}$ is actually a Marchenko-Pastur law with ratio $\gamma$ , so $\mathcal{B}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))$ and $\mathcal{V}_{K}(\lambda_{\textnormal{eff}}(\lambda,\sigma))$ can be computed explicitly according to [LD21, Definition 1].

Remark 2.18 (Implicit regularization).

For nonlinear $\sigma$ , the effective ridge parameter (2.26) can be viewed as an inflated ridge parameter since $b_{\sigma}^{2}\in[0,1)$ and $\lambda_{\textnormal{eff}}>\lambda\geq 0$ . This $\lambda_{\textnormal{eff}}$ leads to implicit regularization for our random feature and kernel ridge regressions even for the ridgeless regression with $\lambda=0$ [LR20, MZ22, JSS⁺20, BMR21]. This effective ridge parameter $\lambda_{\textnormal{eff}}$ also shows the effect of the nonlinearity in the random feature and kernel regressions induced by ultra-wide neural networks.

Remark 2.19.

For convenience, we only consider the linear target function $f^{*}$ , but in general, the above theorems can also be obtained for nonlinear target functions, for instance, $f^{*}$ is a nonlinear single-index model. Under $(\varepsilon_{n},B)$ -orthonormal assumption with $n\varepsilon_{n}^{4}\to 0$ , our expected kernel $K(X,X)\equiv\Phi$ is approximated in terms of

(2.31)

\operatorname{lim\,spec}K(X,X)=\operatorname{lim\,spec}\left(b_{\sigma}^{2}X^{\top}X+(1-b^{2}_{\sigma})\operatorname{Id}\right),

whence, this kernel regression can only learn linear functions. So if $f^{*}$ is nonlinear, the limiting test error should be decomposed into the linear part as (2.28) and the nonlinear component as a noise [BMR21, Theorem 4.13]. For more conclusions of this kernel machine, we refer to [LR20, LRZ20, LLS21, MMM21].

Remark 2.20 (Neural tangent regression).

In parallel to the above results, we can obtain a similar analysis of the limiting training and test errors for random feature regression in (2.16) with empirical NTK given by either $K_{n}(X,X)=\frac{1}{d_{1}}(S^{\top}S)\odot(X^{\top}X)$ or $K_{n}(X,X)=H$ . This random feature regression also refers to neural tangent regression [MZ22]. With the help of our concentration results in Theorem 2.7 and the lower bound of the smallest eigenvalues in Theorem 2.9, we can directly extend the above Theorems 2.12, 2.16 and 2.17 to this neural tangent regression. We omit the proofs in these cases and only state the results as follows.

If $K_{n}(X,X)=\frac{1}{d_{1}}(S^{\top}S)\odot(X^{\top}X)$ with expected kernel $K(X,X)=\Psi$ defined by (2.6), the limiting training and test errors of this neural tangent regression can be approximated by the kernel regression with respect to $\Psi$ , as long as $d_{1}=\omega(\log n)$ . Analogously to (2.31), we have an additional approximation

(2.32)

\operatorname{lim\,spec}\Psi=\operatorname{lim\,spec}\left(b_{\sigma}^{2}X^{\top}X+(a_{\sigma}-b^{2}_{\sigma})\operatorname{Id}\right).

Under the same assumptions of Theorem 2.17 and replacing $n/d_{1}\to 0$ with $d_{1}=\omega(\log n)$ , we can conclude that the test error of this neural tangent regression has the same limit as (2.28) but changing the effective ridge parameter (2.26) into $\lambda_{\textnormal{eff}}(\lambda,\sigma)=\frac{a_{\sigma}+\lambda-b_{\sigma}^{2}}{b_{\sigma}^{2}}$ . This result is akin to [MZ22, Corollary 3.2] but permits more general assumptions on $X$ . The limiting training error of this neural tangent regression can be obtained by slightly modifying the coefficient in (2.27).

Similarly, if $K_{n}(X,X)=H$ defined by (1.8) possesses an expected kernel $K(X,X)=\Phi+\Psi$ , this neural tangent regression in (2.16) is close to kernel regression (2.19) with kernel

K(\mathbf{x},\mathbf{z})=\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}\mathbf{x})\sigma(\boldsymbol{w}^{\top}\mathbf{x})]+\mathbb{E}_{\boldsymbol{w}}[\sigma^{\prime}(\boldsymbol{w}^{\top}\mathbf{x})\sigma^{\prime}(\boldsymbol{w}^{\top}\mathbf{x})]\mathbf{x}^{\top}\mathbf{z},

under the ultra-wide regime, $n/d_{1}\to 0$ . Combining (2.31) and (2.32), Theorem 2.17 can directly be extended to this neural tangent regression but replacing (2.26) with $\lambda_{\textnormal{eff}}(\lambda,\sigma)=\frac{a_{\sigma}+1+\lambda-2b_{\sigma}^{2}}{2b_{\sigma}^{2}}$ . Section 6.1 of [AP20] also calculated this limiting test error when data $X$ is isotropic Gaussian.

Organization of the paper

The remaining parts of the paper are structured as follows. In Section 3, we first provide a nonlinear Hanson-Wright inequality as a concentration tool for our spectral analysis. Section 4 gives a general theorem for the limiting spectral distributions of generalized centered sample covariance matrices. We prove the limiting spectral distributions for the empirical CK and NTK matrices (Theorem 2.1 and Theorem 2.2) in Section 5. Non-asymptotic estimates in subsection 2.2 are proved in Section 6. In Section 7, we justify the asymptotic results of the training and test errors for the random feature model (Theorem 2.12 and Theorem 2.16). Auxiliary lemmas and additional simulations are included in Appendices.

3. A non-linear Hanson-Wright inequality

We give an improved version of Lemma 1 in [LLC18] with a simple proof based on a Hanson-Wright inequality for random vectors with dependence [Ada15]. This serves as the concentration tool for us to prove the deformed semicircle law in Section 5 and provide bounds on extreme eigenvalues in Section 6. We first define some concentration properties for random vectors.

Definition 3.1 (Concentration property).

Let $X$ be a random vector in $\mathbb{R}^{n}$ . We say $X$ has the $K$ -concentration property with constant $K$ if for any $1$ -Lipschitz function $f:\mathbb{R}^{n}\to\mathbb{R}$ , we have $\mathbb{E}|f(X)|<\infty$ and for any $t>0$ ,

(3.1)

\displaystyle\mathbb{P}(|f(X)-\mathbb{E}f(X)|\geq t)\leq 2\exp(-t^{2}/K^{2}).

There are many distributions of random vectors satisfying $K$ -concentration property, including uniform random vectors on the sphere, unit ball, hamming or continuous cube, uniform random permutation, etc. See [Ver18, Chapter 5] for more details.

Definition 3.2 (Convex concentration property).

Let $X$ be a random vector in $\mathbb{R}^{n}$ . We say $X$ has the $K$ -convex concentration property with the constant $K$ if for any $1$ -Lipschitz convex function $f:\mathbb{R}^{n}\to\mathbb{R}$ , we have $\mathbb{E}|f(X)|<\infty$ and for any $t>0$ ,

\displaystyle\mathbb{P}(|f(X)-\mathbb{E}f(X)|\geq t)\leq 2\exp(-t^{2}/K^{2}).

We will apply the following result from [Ada15] to the nonlinear setting.

Lemma 3.3 (Theorem 2.5 in [Ada15]).

Let $X$ be a mean zero random vector in $\mathbb{R}^{n}$ . If $X$ has the $K$ -convex concentration property, then for any $n\times n$ matrix $A$ and any $t>0$ ,

\displaystyle\mathbb{P}(|X^{\top}AX-\mathbb{E}(X^{\top}AX)|\geq t)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{2K^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\|A\|}\right\}\right)

for some universal constant $C>1$ .

Theorem 3.4.

Let $\boldsymbol{w}\in\mathbb{R}^{d_{0}}$ be a random vector with $K$ -concentration property, $X=(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\in\mathbb{R}^{d_{0}\times n}$ be a deterministic matrix. Define $\mathbf{y}=\sigma(\boldsymbol{w}^{\top}X)^{\top}$ , where $\sigma$ is $\lambda_{\sigma}$ -Lipschitz, and $\Phi=\mathbb{E}\mathbf{y}\mathbf{y}^{\top}$ . Let $A$ be an $n\times n$ deterministic matrix.

(1)

If $\mathbb{E}[\mathbf{y}]=0$ , for any $t>0$ ,

(3.2)

\displaystyle\mathbb{P}\left(|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi|\geq t\right)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{2K^{4}\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right),

where $C>0$ is an absolute constant.

(2)

If $\mathbb{E}[\mathbf{y}]\not=0$ , for any $t>0$ ,

	$\displaystyle\mathbb{P}\left(\|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\|>t\right)\leq~{}$	$\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{4K^{4}\lambda_{\sigma}^{4}\\|X\\|^{4}\\|A\\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|}\right\}\right)$
		$\displaystyle+2\exp\left(-\frac{t^{2}}{16K^{2}\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|^{2}\\|\mathbb{E}\mathbf{y}\\|^{2}}\right).$

for some constant $C>0$ .

Proof.

Let $f$ be any $1$ -Lipschitz convex function. Since $\mathbf{y}=\sigma(\boldsymbol{w}^{\top}X)^{\top}$ , $f(\mathbf{y})=f(\sigma(\boldsymbol{w}^{\top}X)^{\top})$ is a $\lambda_{\sigma}\|X\|$ -Lipschitz function of $\boldsymbol{w}$ . Then by the Lipschitz concentration property of $\boldsymbol{w}$ in (3.1), we obtain

\displaystyle\mathbb{P}(|f(\mathbf{y})-\mathbb{E}f(\mathbf{y})|\geq t)\leq 2\exp\left(-\frac{t^{2}}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}}\right).

Therefore, $\mathbf{y}$ satisfies the $K\lambda_{\sigma}\|X\|$ -convex concentration property. Define $\tilde{f}(\mathbf{x})=f(\mathbf{x}-\mathbb{E}\mathbf{y})$ , then $\tilde{f}$ is also a convex $1$ -Lipschitz function and $\tilde{f}(\mathbf{y})=f(\mathbf{y}-\mathbb{E}\mathbf{y})$ . Hence $\tilde{\mathbf{y}}:=\mathbf{y}-\mathbb{E}\mathbf{y}$ also satisfies the $K\lambda_{\sigma}\|X\|$ -convex concentration property. Applying Theorem 3.3 to $\tilde{\mathbf{y}}$ , we have for any $t>0$ ,

(3.3)

\displaystyle\mathbb{P}(|{\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}})|\geq t)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{2K^{4}\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right).

Since $\mathbb{E}\tilde{\mathbf{y}}=0$ , the inequality above implies (3.2). Note that

\displaystyle{\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}})=(\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi)-\tilde{\mathbf{y}}^{\top}A\mathbb{E}\mathbf{y}-\mathbb{E}\mathbf{y}^{\top}A\tilde{\mathbf{y}},

Hence,

	$\displaystyle\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi$	$\displaystyle=({\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}}))+(\mathbf{y}-\mathbb{E}\mathbf{y})^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}$
(3.4)			$\displaystyle=({\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}}))+(\mathbf{y}^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}-\mathbb{E}\mathbf{y}^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}).$

Since $\mathbf{y}^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}$ is a $(2\|A\|\|\mathbb{E}\mathbf{y}\|\|X\|\lambda_{\sigma})$ -Lipschitz function of $\boldsymbol{w}$ , by the Lipschitz concentration property of $\boldsymbol{w}$ , we have

(3.5)

\displaystyle\mathbb{P}(|(\mathbf{y}-\mathbb{E}\mathbf{y})^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}|\geq t)\leq 2\exp\left(-\frac{t^{2}}{4K^{2}(\|A\|\|\mathbb{E}\mathbf{y}\|\|X\|\lambda_{\sigma})^{2}}\right).

Then combining (3.3), (3), and (3.5), we have

	$\displaystyle\mathbb{P}(\|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\|\geq t)$	$\displaystyle\leq\mathbb{P}(\|{\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}})\|\geq t/2)+\mathbb{P}(\|(\mathbf{y}-\mathbb{E}\mathbf{y})^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}\|\geq t/2)$
		$\displaystyle\leq 2\exp\left(-\frac{1}{2C}\min\left\{\frac{t^{2}}{4K^{4}\lambda_{\sigma}^{4}\\|X\\|^{4}\\|A\\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|}\right\}\right)$
		$\displaystyle\quad+2\exp\left(-\frac{t^{2}}{16K^{2}\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|^{2}\\|\mathbb{E}\mathbf{y}\\|^{2}}\right).$

This finishes the proof. ∎

Since the Gaussian random vector $\boldsymbol{w}\sim\mathcal{N}(0,I_{d_{0}})$ satisfies the $K$ -concentration inequality with $K=\sqrt{2}$ (see for example [BLM13]), we have the following corollary.

Corollary 3.5.

Let $\boldsymbol{w}\sim\mathcal{N}(0,I_{d_{0}})$ , $X=(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\in\mathbb{R}^{d_{0}\times n}$ be a deterministic matrix. Define $\mathbf{y}=\sigma(\boldsymbol{w}^{\top}X)^{\top}$ , where $\sigma$ is $\lambda_{\sigma}$ -Lipschitz, and $\Phi=\mathbb{E}\mathbf{y}\mathbf{y}^{\top}$ . Let $A$ be an $n\times n$ deterministic matrix.

(1)

If $\mathbb{E}[\mathbf{y}]=0$ , for any $t>0$ ,

(3.6)

\displaystyle~{}~{}~{}\mathbb{P}\left(|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi|\geq t\right)\leq 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{4\lambda_{\sigma}^{4}\|X\|^{4}\|A\|_{F}^{2}},\frac{t}{\lambda_{\sigma}^{2}\|X\|^{2}\|A\|}\right\}\right)

for some absolute constant $C>0$ .

(2)

If $\mathbb{E}[\mathbf{y}]\not=0$ , for any $t>0$ ,

(3.7)			$\displaystyle\mathbb{P}\left(\|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\|>t\right)$
	$\displaystyle\leq~{}$	$\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{8\lambda_{\sigma}^{4}\\|X\\|^{4}\\|A\\|_{F}^{2}},\frac{t}{\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|}\right\}\right)+2\exp\left(-\frac{t^{2}}{32\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|^{2}\\|\mathbb{E}\mathbf{y}\\|^{2}}\right)$
(3.8)		$\displaystyle\leq~{}$	$\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}}{8\lambda_{\sigma}^{4}\\|X\\|^{4}\\|A\\|_{F}^{2}},\frac{t}{\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|}\right\}\right)+2\exp\left(-\frac{t^{2}}{32\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|^{2}t_{0}}\right),$

where

(3.9)

\displaystyle t_{0}:=2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|-1)^{2}+2n(\mathbb{E}\sigma(\xi))^{2},\quad\xi\sim\mathcal{N}(0,1).

Remark 3.6.

Compared to [LLC18, Lemma 1], we identify the dependence on $\|A\|_{F}$ and $\mathbb{E}\mathbf{y}$ in the probability estimate. By using the inequality $\|A\|_{F}\leq\sqrt{n}\|A\|$ , we obtain a similar inequality to the one in [LLC18] with a better dependence on $n$ . Moreover, our bound in $t_{0}$ is independent of $d_{0}$ , while the corresponding term $t_{0}$ in [LLC18, Lemma 1] depends on $\|X\|$ and $d_{0}$ . In particular, when $\mathbb{E}\sigma(\xi)=0$ and $X$ is $(\varepsilon_{n},B)$ -orthonormal, $t_{0}$ is of order 1. Hence, (3.8) with the special choice of $t_{0}$ is the key ingredient in the proof of Theorem 2.3 to get a concentration of the spectral norm for CK.

Proof of Corollary 3.5.

We only need to prove (3.8), since other statements follow immediately by taking $K=\sqrt{2}$ . Let $\mathbf{x}_{i}$ be the $i$ -th column of $X$ . Then

\displaystyle\|\mathbb{E}\mathbf{y}\|^{2}

\displaystyle=\|\mathbb{E}\sigma(\boldsymbol{w}^{\top}X)\|^{2}=\sum_{i=1}^{n}[\mathbb{E}\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{i})]^{2}.

Let $\xi\sim\mathcal{N}(0,1)$ . We have

	$\displaystyle\|\mathbb{E}\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{i})\|$	$\displaystyle=\|\mathbb{E}\sigma(\xi\\|\mathbf{x}_{i}\\|)\|\leq\mathbb{E}\|(\sigma(\xi\\|\mathbf{x}_{i}\\|)-\sigma(\xi))\|+\|\mathbb{E}\sigma(\xi)\|$
(3.10)			$\displaystyle\leq\lambda_{\sigma}\mathbb{E}\|\xi(\\|\mathbf{x}_{i}\\|-1)\|+\|\mathbb{E}\sigma(\xi)\|\leq\lambda_{\sigma}\|\\|\mathbf{x}_{i}\\|-1\|+\|\mathbb{E}\sigma(\xi)\|.$

Therefore

(3.11)		$\displaystyle\\|\mathbb{E}\mathbf{y}\\|^{2}$	$\displaystyle\leq\sum_{i=1}^{n}(\lambda_{\sigma}(\\|\mathbf{x}_{i}\\|-1)+\|\mathbb{E}\sigma(\xi)\|)^{2}\leq\sum_{i=1}^{n}2\lambda_{\sigma}^{2}(\\|\mathbf{x}_{i}\\|-1)^{2}+2(\mathbb{E}\sigma(\xi))^{2}$
		$\displaystyle=2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\\|\mathbf{x}_{i}\\|-1)^{2}+2n(\mathbb{E}\sigma(\xi))^{2}=t_{0},$

and (3.8) holds.

∎

We include the following corollary about the variance of $\mathbf{y}^{\top}A\mathbf{y}$ , which will be used in Section 5 to study the spectrum of the CK and NTK.

Corollary 3.7.

Under the same assumptions of Corollary 3.5, we further assume that $t_{0}\leq C_{1}n$ , and $\|A\|,\|X\|\leq C_{2}$ . Then as $n\to\infty$ ,

\frac{1}{n^{2}}\mathbb{E}\left[\left|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right|^{2}\right]\to 0.

Proof.

Notice that $\|A\|_{F}\leq\sqrt{n}\|A\|$ . Thanks to Theorem 3.5 (2), we have that for any $t>0$ ,

(3.12)

\mathbb{P}\left(\frac{1}{n}\left|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right|>t\right)\leq 4\exp\left(-Cn\min\{t^{2},t\}\right),

where constant $C>0$ only relies on $C_{1},C_{2}$ , $\lambda_{\sigma}$ , and $K$ . Therefore, we can compute the variance in the following way:

	$\displaystyle\mathbb{E}\left[\frac{1}{n^{2}}\left\|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right\|^{2}\right]=~{}$	$\displaystyle\int_{0}^{\infty}\mathbb{P}\left(\frac{1}{n^{2}}\left\|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\right\|^{2}>s\right)ds$
	$\displaystyle\leq~{}$	$\displaystyle 4\int_{0}^{\infty}\exp\left(-Cn\min\{s,\sqrt{s}\}\right)ds$
	$\displaystyle=~{}$	$\displaystyle 4\int_{0}^{1}\exp\left(-Cn\sqrt{s}\right)ds+4\int_{1}^{+\infty}\exp\left(-Cns\right)ds\rightarrow 0,$

as $n\rightarrow\infty$ . Here, we use the dominant convergence theorem for the first integral in the last line. ∎

4. Limiting law for general centered sample covariance matrices

Independent of the subsequent sections, this section focuses on the generalized sample covariance matrix where the dimension of the feature is much smaller than the sample size. We will later interpret such sample covariance matrix specifically for our neural network applications. Under certain weak assumptions, we prove the limiting eigenvalue distribution of the normalized sample covariance matrix satisfies two self-consistent equations, which are subsumed into a deformed semicircle law. Our findings in this section demonstrate some degree of universality, indicating that they hold across various random matrix models and may have implications for other related fields.

Theorem 4.1.

Suppose $\mathbf{y}_{1},\ldots,\mathbf{y}_{d}\in\mathbb{R}^{n}$ are independent random vectors with the same distribution of a random vector $\mathbf{y}\in\mathbb{R}^{n}$ . Assume that $\mathbb{E}[\mathbf{y}]=\textbf{0}$ , $\mathbb{E}[\mathbf{y}\mathbf{y}^{\top}]=\Phi_{n}\in\mathbb{R}^{n\times n}$ , where $\Phi_{n}$ is a deterministic matrix whose limiting eigenvalue distribution is $\mu_{\Phi}\not=\delta_{0}$ . Assume $\|\Phi_{n}\|\leq C$ for some constant $C$ . Define $A_{n}:=\sqrt{\frac{d}{n}}\left(\frac{1}{d}\sum_{i=1}^{d}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\Phi_{n}\right)$ and $R(z):=(A_{n}-z\operatorname{Id})^{-1}$ . For any $z\in\mathbb{C}^{+}$ and any deterministic matrices $D_{n}$ with $\|D_{n}\|\leq C$ , suppose that as $n,d\rightarrow\infty$ and $n/d\rightarrow 0$ ,

(4.1)

\operatorname{tr}R(z)D_{n}-\mathbb{E}\left[\operatorname{tr}R(z)D_{n}\right]\overset{\textnormal{a.s.}}{\longrightarrow}0,

and

(4.2)

\frac{1}{n^{2}}\mathbb{E}\left[\left|\mathbf{y}^{\top}D_{n}\mathbf{y}-\operatorname{Tr}D_{n}\Phi_{n}\right|^{2}\right]\to 0.

Then the empirical eigenvalue distribution of matrix $A_{n}$ weakly converges to $\mu$ almost surely, whose Stieltjes transform $m(z)$ is defined by

(4.3)

m(z)+\int\frac{d\mu_{\Phi}(x)}{z+\beta(z)x}=0

for each $z\in\mathbb{C}^{+}$ , where $\beta(z)\in\mathbb{C}^{+}$ is the unique solution to

(4.4)

\beta(z)+\int\frac{xd\mu_{\Phi}(x)}{z+\beta(z)x}=0.

In particular, $\mu=\mu_{s}\boxtimes\mu_{\Phi}$ .

Remark 4.2.

In [Xie13], it was assumed that $\frac{d}{n^{3}}\mathbb{E}\left|\mathbf{y}^{\top}D_{n}\mathbf{y}-\operatorname{Tr}D_{n}\Phi_{n}\right|^{2}\to 0,$ where $n^{3}/d\rightarrow\infty$ and $n/d\to 0$ as $n\to\infty$ . By martingale difference, this condition implies (4.1). However, we are not able to verify a certain step in the proof of [Xie13]. Hence, we will not directly adopt the result of [Xie13] but consider a more general situation without assuming $n^{3}/d\rightarrow\infty$ . The weakest conditions we found are conditions (4.1) and (4.2), which can be verified in our nonlinear random model.

The self-consistent equations we derived are consistent with the results in [Bao12, Xie13], where they studied the empirical spectral distribution of separable sample covariance matrices in the regime $n/d\to 0$ under different assumptions. When $n\rightarrow\infty$ and $n/d\rightarrow 0$ , our goal is to prove that the Stieltjes transform $m_{n}(z)$ of the empirical eigenvalue distribution of $A_{n}$ and $\beta_{n}(z):=\operatorname{tr}[R(z)\Phi_{n}]$ point-wisely converges to $m(z)$ and $\beta(z)$ , respectively.

For the rest of this section, we first prove a series of lemmas to get $n$ -dependent deterministic equivalents related to (4.3) and (4.4) and then deduce the proof of Theorem 4.1 at the end of this section. Recall $A_{n}=\sqrt{\frac{d}{n}}\left(\frac{1}{d}\sum_{i=1}^{d}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\Phi_{n}\right)$ , $R(z)=(A_{n}-z\operatorname{Id})^{-1}$ , and $\mathbf{y}$ is a random vector independent of $A_{n}$ with the same distribution of $\mathbf{y}_{i}$ .

Lemma 4.3.

Under the assumptions of Theorem 4.1, for any $z\in\mathbb{C}^{+}$ , as $d,n\rightarrow\infty$ ,

(4.5)

\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\frac{\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\cdot\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}{1+\sqrt{\frac{n}{d}}\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=o(1),

where $D\in\mathbb{R}^{n\times n}$ is any deterministic matrix such that $\|D\|\leq C$ , for some constant $C$ .

Proof.

Let $z=u+iv\in\mathbb{C}^{+}$ where $u\in\mathbb{R}$ and $v>0$ . Let

\hat{R}:=\left(\frac{1}{\sqrt{dn}}\sum_{j=1}^{d+1}\mathbf{y}_{j}\mathbf{y}_{j}^{\top}-\sqrt{\frac{d}{n}}\Phi_{n}-z\operatorname{Id}\right)^{-1},

where $\mathbf{y}_{j}$ ’s are independent copies of $\mathbf{y}$ defined in Theorem 4.1. Notice that, for any deterministic matrix $D\in\mathbb{R}^{n\times n}$ ,

(4.6)		$\displaystyle D$	$\displaystyle=\hat{R}\left(\frac{1}{\sqrt{dn}}\sum_{j=1}^{d+1}\mathbf{y}_{j}\mathbf{y}_{j}^{\top}-\sqrt{\frac{d}{n}}\Phi_{n}-z\operatorname{Id}\right)D$
(4.7)			$\displaystyle=\frac{1}{\sqrt{dn}}\hat{R}\left(\sum_{i=1}^{d+1}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}\right)D-\sqrt{\frac{d}{n}}\hat{R}\Phi_{n}D-z\hat{R}D.$

Without loss of generality, we assume $\|D\|\leq 1$ . Taking normalized trace, we have

(4.8)

\operatorname{tr}D+z\operatorname{tr}[\hat{R}D]=\frac{1}{\sqrt{dn}}\frac{1}{n}\sum_{i=1}^{d+1}\mathbf{y}_{i}^{\top}D\hat{R}\mathbf{y}_{i}-\sqrt{\frac{d}{n}}\operatorname{tr}[\hat{R}\Phi_{n}D].

For each $1\leq i\leq d+1$ , Sherman–Morrison formula (Lemma A.2) implies

(4.9)

\hat{R}=R^{(i)}-\frac{R^{(i)}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}R^{(i)}}{\sqrt{dn}+\mathbf{y}_{i}^{\top}R^{(i)}\mathbf{y}_{i}},

where the leave-one-out resolvent $R^{(i)}$ is defined as

R^{(i)}:=\left(\frac{1}{\sqrt{dn}}\sum_{1\leq j\leq d+1,j\neq i}\mathbf{y}_{j}\mathbf{y}_{j}^{\top}-\sqrt{\frac{d}{n}}\Phi_{n}-z\operatorname{Id}\right)^{-1}.

Hence, by (4.9), we obtain

(4.10)

\frac{1}{\sqrt{dn}}\frac{1}{n}\sum_{i=1}^{d+1}\mathbf{y}_{i}^{\top}D\hat{R}\mathbf{y}_{i}=\frac{1}{n}\sum_{i=1}^{d+1}\frac{\mathbf{y}_{i}^{\top}DR^{(i)}\mathbf{y}_{i}}{\sqrt{dn}+\mathbf{y}_{i}^{\top}R^{(i)}\mathbf{y}_{i}}.

Combining equations (4.8) and (4.10), and applying expectation at both sides implies

	$\displaystyle\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}\hat{R}D]=~{}$	$\displaystyle\frac{1}{n}\sum_{i=1}^{d+1}\mathbb{E}\left[\frac{\mathbf{y}_{i}^{\top}DR^{(i)}\mathbf{y}_{i}}{\sqrt{dn}+\mathbf{y}_{i}^{\top}R^{(i)}\mathbf{y}_{i}}\right]-\sqrt{\frac{d}{n}}\mathbb{E}\operatorname{tr}\hat{R}\Phi_{n}D$
(4.11)		$\displaystyle=~{}$	$\displaystyle\frac{d+1}{n}\mathbb{E}\left[\frac{\mathbf{y}^{\top}DR(z)\mathbf{y}}{\sqrt{dn}+\mathbf{y}^{\top}R(z)\mathbf{y}}\right]-\sqrt{\frac{d}{n}}\mathbb{E}\operatorname{tr}\hat{R}\Phi_{n}D,$

because of the assumption that all $\mathbf{y}_{i}$ ’s have the same distribution as vector $\mathbf{y}$ for all $i\in[d+1]$ . With (4.11), to prove (4.5), we will first show that when $n,d\rightarrow\infty$ ,

(4.12)		$\displaystyle\sqrt{\frac{d}{n}}\left(\mathbb{E}[\operatorname{tr}\hat{R}\Phi_{n}D]-\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]\right)=o(1),$
(4.13)		$\displaystyle\mathbb{E}[\operatorname{tr}\hat{R}D]-\mathbb{E}[\operatorname{tr}R(z)D]=o(1),$
(4.14)		$\displaystyle\frac{1}{n}\mathbb{E}\left[\frac{\mathbf{y}^{\top}DR(z)\mathbf{y}}{\sqrt{dn}+\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=o(1).$

Recall that

\hat{R}-R(z)=\frac{1}{\sqrt{dn}}R(z)\left(\mathbf{y}_{d+1}\mathbf{y}_{d+1}^{\top}\right)\hat{R},

and spectral norms $\|\hat{R}\|,\|R(z)\|\leq 1/v$ due to Proposition C.2 in [FW20]. Notice that $\|\Phi_{n}\|\leq C$ . Hence, we can deduce that

	$\displaystyle\sqrt{\frac{d}{n}}\left\|\mathbb{E}[\operatorname{tr}\hat{R}\Phi_{n}D]-\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]\right\|\leq~{}$	$\displaystyle\frac{1}{n}\mathbb{E}[\|\operatorname{tr}R(z)\mathbf{y}_{d+1}\mathbf{y}_{d+1}^{\top}\hat{R}\Phi_{n}D\|]$
	$\displaystyle\leq~{}$	$\displaystyle\frac{1}{n^{2}}\mathbb{E}[\\|\hat{R}\Phi_{n}DR(z)\\|\cdot\\|\mathbf{y}_{d+1}\\|^{2}]$
	$\displaystyle=~{}$	$\displaystyle\frac{C}{v^{2}n^{2}}\mathbb{E}[\operatorname{Tr}\mathbf{y}_{d+1}\mathbf{y}_{d+1}^{\top}]=\frac{C\operatorname{Tr}\Phi_{n}}{v^{2}n^{2}}\leq\frac{C^{2}}{v^{2}n}\rightarrow 0,$

as $n\rightarrow\infty.$ The same argument can be applied to the error of $\mathbb{E}[\operatorname{tr}\hat{R}D]-\mathbb{E}[\operatorname{tr}R(z)D]$ . Therefore (4.12) and (4.13) hold. For (4.14), we denote $\tilde{\mathbf{y}}:=\mathbf{y}/(nd)^{1/4}$ and observe that

(4.15)

\frac{1}{n}\mathbb{E}\left[\frac{\mathbf{y}^{\top}DR(z)\mathbf{y}}{\sqrt{dn}+\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=\frac{1}{n}\mathbb{E}\left[\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}}\right].

Let $R(z)=\sum_{i=1}^{n}\frac{1}{\lambda_{i}-z}\mathbf{u}_{i}\mathbf{u}_{i}^{\top}$ be the eigen-decomposition of $R(z)$ . Then

(4.16)

\displaystyle\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}/\|\tilde{\mathbf{y}}\|^{2}=\sum_{i=1}^{n}\frac{1}{\lambda_{i}-z}\frac{(\langle\mathbf{u}_{i},\tilde{\mathbf{y}}\rangle)^{2}}{\|\tilde{\mathbf{y}}\|^{2}}:=\int\frac{1}{x-z}d\mu_{\tilde{\mathbf{y}}}

is the Stieltjes transform of a discrete measure $\mu_{\tilde{\mathbf{y}}}=\sum_{i=1}^{n}\frac{(\langle\mathbf{u}_{i},\tilde{\mathbf{y}}\rangle)^{2}}{\|\tilde{\mathbf{y}}\|^{2}}\delta_{\lambda_{i}}$ . Then, we can control the real part of $\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}$ by Lemma A.4:

(4.17)

\left|\operatorname{Re}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right|\leq v^{-1/2}\|\tilde{\mathbf{y}}\|\left(\operatorname{Im}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right)^{1/2}.

We now separately consider two cases in the following:

•

If the right-hand side of the above inequality (4.17) is at most $1/2$ , then

\left|1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}\right|\geq\left|1+\operatorname{Re}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right|\geq\frac{1}{2},

which results in

(4.18)

\left|\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}}\right|\leq\frac{C}{\sqrt{dn}}\|\mathbf{y}\|^{2}.

•

When $v^{-1/2}\|\tilde{\mathbf{y}}\|\left(\operatorname{Im}(\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\right)^{1/2}>1/2$ , we know that

	$\displaystyle\left\|\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}}\right\|\leq~{}$	$\displaystyle\frac{\\|\tilde{\mathbf{y}}^{\top}D\\|\\|R(z)\tilde{\mathbf{y}}\\|}{\|\operatorname{Im}(1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\|}=\frac{\\|\tilde{\mathbf{y}}^{\top}D\\|\\|R(z)\tilde{\mathbf{y}}\\|}{\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}}$
(4.19)		$\displaystyle\leq~{}$	$\displaystyle\frac{\\|\tilde{\mathbf{y}}^{\top}D\\|}{\left(v\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}\right)^{1/2}}\leq\frac{2\\|\tilde{\mathbf{y}}^{\top}D\\|\\|\tilde{\mathbf{y}}\\|}{v}\leq\frac{C\\|\mathbf{y}\\|^{2}}{v\sqrt{nd}},$

where we exploit the fact that (see also Equation (A.1.11) in [BS10])

\|R(z)\tilde{\mathbf{y}}\|=(\tilde{\mathbf{y}}^{\top}R(\bar{z})R(z)\tilde{\mathbf{y}})^{1/2}=\left(\frac{1}{v}\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}\right)^{1/2}.

Finally, combining (4.18) and (4.19) in the above two cases, we can conclude the asymptotic result (4.14) because $\mathbb{E}\|\mathbf{y}\|^{2}=\operatorname{Tr}\Phi_{n}\leq Cn$ in terms of the assumptions of Theorem 4.1.

Then with (4.12), (4.13), and (4.14), we get

(4.20)

\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]=\mathbb{E}\left[\frac{\sqrt{\frac{d}{n}}\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}}{1+\frac{1}{\sqrt{dn}}\mathbf{y}^{\top}R(z)\mathbf{y}}-\sqrt{\frac{d}{n}}\operatorname{tr}R(z)\Phi_{n}D\right]+o(1),

as $n\rightarrow\infty$ . We utilize the notion $\mathbb{E}_{\mathbf{y}}$ to clarify the expectation only with respect to random vector $\mathbf{y}$ , conditioning on other independent random variables. So the conditional expectation is $\mathbb{E}_{\mathbf{y}}\left[\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\right]=\operatorname{tr}DR(z)\Phi_{n}$ and

\mathbb{E}\left[\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\right]=\mathbb{E}\left[\mathbb{E}_{\mathbf{y}}\left[\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\right]\right]=\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D].

Therefore, based on (4.20), the conclusion (4.5) holds. ∎

In the next lemma, we apply the quadratic concentration condition (4.2) to simplify (4.5).

Lemma 4.4.

Under the assumptions of Theorem 4.1, condition (4.2) of Theorem 4.1 implies that

(4.21)

\mathbb{E}\left[\frac{\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\cdot\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}{1+\sqrt{\frac{n}{d}}\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}\right]=\mathbb{E}\left[\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right]+o(1),

for each $z\in\mathbb{C}^{+}$ and any deterministic matrix $D$ with $\|D\|\leq C$ .

Proof.

Let us denote

\delta_{n}:=\frac{\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y}\cdot\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}{1+\sqrt{\frac{n}{d}}\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y}}-\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}},

Q_{1}:=\frac{1}{n}\mathbf{y}^{\top}DR(z)\mathbf{y},\quad Q_{2}:=\frac{1}{n}\mathbf{y}^{\top}R(z)\mathbf{y},

$\bar{Q}_{1}:=\mathbb{E}_{\mathbf{y}}[Q_{1}]=\operatorname{tr}DR(z)\Phi_{n}$ , and $\bar{Q}_{2}:=\mathbb{E}_{\mathbf{y}}[Q_{1}]=\operatorname{tr}R(z)\Phi_{n}$ . In other words, $\delta_{n}$ can be expressed by

	$\displaystyle\delta_{n}=~{}$	$\displaystyle\frac{Q_{1}Q_{2}}{1+\sqrt{\frac{n}{d}}Q_{2}}-\frac{\bar{Q}_{1}\bar{Q}_{2}}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}$
	$\displaystyle=~{}$	$\displaystyle\frac{Q_{1}\left(Q_{2}+\sqrt{\frac{d}{n}}\right)}{1+\sqrt{\frac{n}{d}}Q_{2}}-\frac{\sqrt{\frac{d}{n}}Q_{1}}{1+\sqrt{\frac{n}{d}}Q_{2}}-\frac{\bar{Q}_{1}\left(\bar{Q}_{2}+\sqrt{\frac{d}{n}}\right)}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}+\frac{\sqrt{\frac{d}{n}}\bar{Q}_{1}}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}$
	$\displaystyle=~{}$	$\displaystyle\sqrt{\frac{d}{n}}(Q_{1}-\bar{Q}_{1})+\frac{\sqrt{\frac{d}{n}}(\bar{Q}_{1}-Q_{1})}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}+\frac{\sqrt{\frac{n}{d}}Q_{1}\sqrt{\frac{d}{n}}(\bar{Q}_{2}-Q_{2})}{\left(1+\sqrt{\frac{n}{d}}\bar{Q}_{2}\right)\left(1+\sqrt{\frac{n}{d}}Q_{2}\right)}.$

Observe that $\mathbb{E}[\bar{Q}_{i}]=\mathbb{E}[Q_{i}]$ for $i=1,2$ . Thus, $\delta_{n}$ has the same expectation as the last term

\Delta_{n}:=\frac{Q_{1}(\bar{Q}_{2}-Q_{2})}{\left(1+\sqrt{\frac{n}{d}}\bar{Q}_{2}\right)\left(1+\sqrt{\frac{n}{d}}Q_{2}\right)},

since we can first take the expectation for $\mathbf{y}$ conditioning on the resolvent $R(z)$ and then take the expectation for $R(z)$ . Besides, notice that $|\bar{Q}_{1}|,|\bar{Q}_{2}|\leq\frac{C}{v}$ uniformly. Hence, $\sqrt{\frac{n}{d}}\bar{Q}_{2}$ converges to zero uniformly and there exists some constant $C>0$ such that

(4.22)

\left|\frac{1}{1+\sqrt{\frac{n}{d}}\bar{Q}_{2}}\right|\leq C,

for all large $d$ and $n$ . In addition, observe that

\frac{\sqrt{\frac{n}{d}}Q_{1}}{1+\sqrt{\frac{n}{d}}Q_{2}}=\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}},

where $\tilde{\mathbf{y}}$ is defined in the proof of Lemma 4.3. In terms of (4.18) and (4.19), we verify that

(4.23)

\left|\frac{Q_{1}}{1+\sqrt{\frac{n}{d}}Q_{2}}\right|\leq\frac{C\|\mathbf{y}\|^{2}}{n},

where $C>0$ is some constant depending on $v$ . Next, recall that condition (4.2) exposes that

(4.24)

\mathbb{E}(Q_{2}-\bar{Q}_{2})^{2}\to 0\quad\text{and}\quad\mathbb{E}(\|\mathbf{y}\|^{2}/n-\operatorname{tr}\Phi_{n})^{2}\to 0

as $n\rightarrow\infty$ . The first convergence is derived by viewing $D_{n}=R(z)$ and taking expectation conditional on $R(z)$ . To sum up, we can bound $|\Delta_{n}|$ based on (4.22) and (4.23) in the subsequent way:

\displaystyle|\Delta_{n}|\leq~{}

\displaystyle\frac{C\|\mathbf{y}\|^{2}}{n}|\bar{Q}_{2}-Q_{2}|\leq C\left|\|\mathbf{y}\|^{2}/n-\operatorname{tr}\Phi_{n}\right|\cdot|\bar{Q}_{2}-Q_{2}|+C\left|\operatorname{tr}\Phi_{n}\right|\cdot|\bar{Q}_{2}-Q_{2}|.

Here, $|\operatorname{tr}\Phi_{n}|\leq\|\Phi_{n}\|$ and $\|\Phi_{n}\|$ is uniformly bounded by some constant. Then, by Hölder’s inequality, (4.24) implies that $\mathbb{E}[|\Delta_{n}|]\rightarrow 0$ , as $n$ approaching to infinity. This concludes $\mathbb{E}[\delta_{n}]=\mathbb{E}[\Delta_{n}]$ converges to zero.

∎

Lemma 4.5.

Under assumptions of Theorem 4.1, we can conclude that

\lim_{n,d\rightarrow\infty}\left(\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\right]\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right]\right)=0

holds for each $z\in\mathbb{C}^{+}$ and deterministic matrix $D$ with uniformly bounded spectral norm.

Proof.

Based on Lemma 4.3 and Lemma 4.4, (4.21) and (4.5) yield

\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right]=o(1).

As $|\operatorname{tr}R(z)D|$ and $|\operatorname{tr}R(z)D\Phi_{n}|$ are bounded by some constants uniformly and almost surely, for sufficiently large $d$ and $n$ , $|\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}|<1/2$ and

		$\displaystyle\left\|\mathbb{E}\left[\frac{\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right]-\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}\right]\right\|$
	$\displaystyle\leq~{}$	$\displaystyle\mathbb{E}\left[\|\operatorname{tr}R(z)D\|\cdot\|\operatorname{tr}R(z)D\Phi_{n}\|\cdot\left\|\frac{\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}{1+\sqrt{\frac{n}{d}}\operatorname{tr}R(z)\Phi_{n}}\right\|\right]\leq 2C\sqrt{\frac{n}{d}}\rightarrow 0,$

as $n/d\rightarrow 0$ . Hence,

(4.25)

\operatorname{tr}D+z\mathbb{E}[\operatorname{tr}R(z)D]+\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\operatorname{tr}R(z)\Phi_{n}\right]=o(1).

Considering $D_{n}=\Phi_{n}$ in (4.1), we can get almost sure convergence for $\operatorname{tr}DR(z)\Phi_{n}\cdot(\operatorname{tr}R(z)\Phi_{n}-\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right])$ to zero. Thus by dominated convergence theorem,

\lim_{n\rightarrow\infty}\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\cdot(\operatorname{tr}R(z)\Phi_{n}-\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right])\right]\rightarrow 0.

So we can replace the third term at the right-hand side of (4.25) with

\mathbb{E}\left[\operatorname{tr}DR(z)\Phi_{n}\right]\mathbb{E}\left[\operatorname{tr}R(z)\Phi_{n}\right]

to obtain the conclusion. ∎

Proof of Theorem 4.1.

Fix any $z\in\mathbb{C}^{+}$ . Denote the Stieltjes transform of empirical spectrum of $A_{n}$ and its expectation by $m_{n}(z):=\operatorname{tr}R(z)$ and $\bar{m}_{n}(z):=\mathbb{E}[m_{n}(z)]$ respectively. Let $\beta_{n}(z):=\operatorname{tr}R(z)\Phi_{n}$ and $\bar{\beta}_{n}(z):=\mathbb{E}[\beta_{n}(z)]$ . Notice that $m_{n}(z),\bar{m}_{n}(z),\beta_{n}$ and $\bar{\beta}_{n}(z)$ are all in $\mathbb{C}^{+}$ and uniformly and almost surely bounded by some constant. By choosing $D=\operatorname{Id}$ in Lemma 4.5, we conclude

(4.26)

\lim_{n,\,d\rightarrow\infty}\left(1+z\bar{m}_{n}(z)+\bar{\beta}_{n}(z)^{2}\right)=0.

Likewise, in Lemma 4.5, we consider $D=\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}$ . Let

U=\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}.

Because $\|\Phi_{n}\|$ is uniformly bounded, $\|D\|\leq C\|U\|$ . In terms of Lemma A.6, we only need to provide a lower bound for the imaginary part of $U$ . Observe that $\operatorname{Im}U=\operatorname{Im}\bar{\beta}_{n}(z)\Phi_{n}+v\operatorname{Id}\succeq v\operatorname{Id}$ since $\lambda_{\min}(\Phi_{n})\geq 0$ and $\operatorname{Im}\bar{\beta}_{n}(z)>0$ . Thus, $\|D\|\leq Cv^{-1}$ for all $n$ . Meanwhile, we have the equation $\bar{\beta}_{n}(z)\Phi_{n}D=\Phi_{n}-zD$ and hence,

\bar{\beta}_{n}(z)\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]=\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}D]\mathbb{E}[\operatorname{tr}R(z)\Phi_{n}]=\bar{\beta}_{n}(z)-z\mathbb{E}[\operatorname{tr}R(z)D].

So applying Lemma 4.5 again, we have another limiting equation $\operatorname{tr}D+\bar{\beta}_{n}(z)\rightarrow 0$ . In other words,

(4.27)

\lim_{n,\,d\rightarrow\infty}\left(\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}+\bar{\beta}_{n}(z)\right)=0.

Thanks to the identity

\displaystyle\bar{\beta}_{n}(z)\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}-1=-z\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1},

we can modify (4.26) and (4.27) to get

(4.28)

\lim_{n,\,d\rightarrow\infty}\left(\bar{m}_{n}(z)+\operatorname{tr}\left(\bar{\beta}_{n}(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\right)=0.

Since $\bar{\beta}_{n}(z)$ and $\bar{m}_{n}(z)$ are uniformly bounded, for any subsequence in $n$ , there is a further convergent sub-subsequence. We denote the limit of such sub-subsequence by $\beta(z)$ and $m(z)\in\mathbb{C}^{+}$ respectively. Hence, by (4.27) and (4.28), one can conclude

\lim_{n,\,d\rightarrow\infty}\left(\beta(z)+\operatorname{tr}\left(\beta(z)\Phi_{n}+z\operatorname{Id}\right)^{-1}\Phi_{n}\right)=0.

Because of the convergence of the empirical eigenvalue distribution of $\Phi_{n}$ , we obtain the fixed point equation (4.4) for $\beta(z)$ . Analogously, we can also obtain (4.3) for $m(z)$ and $\beta(z)$ . The existence and the uniqueness of the solutions to (4.3) and (4.4) are proved in [BZ10, Theorem 2.1] and [WP14, Section 3.4], which implies the convergence of $\bar{m}_{n}(z)$ and $\bar{\beta}_{n}(z)$ to $m(z)$ and $\beta(z)$ governed by the self-consistent equations (4.3) and (4.4) as $n\rightarrow\infty$ , respectively.

Then, by virtue of condition (4.1) in Theorem 4.1, we know $m_{n}(z)-\bar{m}_{n}(z)\overset{\text{a.s.}}{\longrightarrow}0$ and $\beta_{n}(z)-\bar{\beta}_{n}(z)\overset{\text{a.s.}}{\longrightarrow}0$ . Therefore, the empirical Stieltjes transform $m_{n}(z)$ converges to $m(z)$ almost surely for each $z\in\mathbb{C}^{+}$ . Recall that the Stieltjes transform of $\mu$ is $m(z)$ . By the standard Stieltjes continuity theorem (see for example, [BS10, Theorem B.9]), this finally concludes the weak convergence of empirical eigenvalue distribution of $A_{n}$ to $\mu$ .

Now we show $\mu=\mu_{s}\boxtimes\mu_{\Phi}$ . The fixed point equations (4.3) and (4.4) induce

(4.29)

\beta^{2}(z)+1+zm(z)=0,

since $\beta(z)\in\mathbb{C}^{+}$ for any $z\in\mathbb{C}^{+}$ . Together with (4.3), we attain the same self-consistent equations for the convergence of the empirical spectral distribution of the Wigner-type matrix studied in [BZ10, Theorem 1.1].

Define $W_{n}$ , the $n$ -by- $n$ Wigner matrix, as a Hermitian matrix with independent entries

\{W_{n}[i,j]:\mathbb{E}[W_{n}[i,j]]=0,\text{ }\mathbb{E}[W_{n}[i,j]^{2}]=1,\text{ }1\leq i\leq j\leq n\}.

The Wigner-type matrix studied in [BZ10, Definition 1.2] is indeed $\frac{1}{\sqrt{n}}\Phi_{n}^{1/2}W_{n}\Phi_{n}^{1/2}$ . Hence, such Wigner-type matrix $\frac{1}{\sqrt{n}}\Phi_{n}^{1/2}W_{n}\Phi_{n}^{1/2}$ has the same limiting spectral distribution as $A_{n}$ defined in Theorem 4.1. Both limits are determined by self-consistent equations (4.3) and (4.29).

On the other hand, based on [AGZ10, Theorem 5.4.5], $\frac{1}{\sqrt{n}}W_{n}$ and $\Phi_{n}$ are almost surely asymptotically free, i.e. the empirical distribution of $\{\frac{1}{\sqrt{n}}W_{n},\Phi_{n}\}$ converges almost surely to the law of $\{\bf s,d\}$ , where s and d are two free non-commutative random variables (s is a semicircle element and d has the law $\mu_{\Phi}$ ). Thus, the limiting spectral distribution $\mu$ of $\frac{1}{\sqrt{n}}\Phi_{n}^{1/2}W_{n}\Phi_{n}^{1/2}$ is the free multiplicative convolution between $\mu_{s}$ and $\mu_{\Phi}$ . This implies $\mu=\mu_{s}\boxtimes\mu_{\Phi}$ in our setting. ∎

5. Proof of Theorem 2.1 and Theorem 2.2

To prove Theorem 2.1, we first establish the following proposition to analyze the difference between Stieltjes transform of (2.1) and its expectation. This will assist us to verify condition (4.1) in Theorem 4.1. The proof is based on [FW20, Lemma E.6].

Proposition 5.1.

Let $D\in\mathbb{R}^{n\times n}$ be any deterministic symmetric matrix with a uniformly bounded spectral norm. Following the notions in Theorem 2.1, assume $\|X\|\leq C$ for some constant $C$ and Assumption 1.2 holds. Let $R(z)$ be the resolvent

\left(\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right)-z\operatorname{Id}\right)^{-1},

for any fixed $z\in\mathbb{C}^{+}$ . Then, there exist some constants $s,n_{0}>0$ such that for all $n>n_{0}$ and any $t>0$ ,

\mathbb{P}\left(\left|\operatorname{tr}R(z)D-\mathbb{E}[\operatorname{tr}R(z)D]\right|>t\right)\leq 2e^{-cnt^{2}}.

Proof.

Define function $F:\mathbb{R}^{d_{1}\times d_{0}}\rightarrow\mathbb{R}$ by $F(W):=\operatorname{tr}R(z)D$ . Fix any $W,\Delta\in\mathbb{R}^{d_{1}\times d_{0}}$ where $\|\Delta\|_{F}=1$ , and let $W_{t}=W+t\Delta$ . We want to verify $F(W)$ is a Lipschitz function in $W$ with respect to the Frobenius norm. First, recall

R(z)^{-1}=\frac{1}{\sqrt{d_{1}n}}\sigma(WX)^{\top}\sigma(WX)-\sqrt{\frac{d_{1}}{n}}\Phi-z\operatorname{Id},

where the last two terms are deterministic with respect to $W$ . Hence,

	$\displaystyle\operatorname{vec}(\Delta)^{\top}(\nabla F(W))$	$\displaystyle=\frac{d}{dt}\Big{\|}_{t=0}F(W_{t})$
		$\displaystyle=-\operatorname{tr}R(z)\left(\frac{d}{dt}\Big{\|}_{t=0}R(z)^{-1}\right)R(z)D$
		$\displaystyle=-\frac{1}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\frac{d}{dt}\Big{\|}_{t=0}\sigma(W_{t}X)^{\top}\sigma(W_{t}X)\right)R(z)D$
		$\displaystyle=-\frac{2}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\sigma(WX)^{\top}\cdot\frac{d}{dt}\Big{\|}_{t=0}\sigma(W_{t}X)\right)R(z)D$
		$\displaystyle=-\frac{2}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\sigma(WX)^{\top}\cdot\left(\sigma^{\prime}(WX)\odot(\Delta X)\right)\right)R(z)D,$

where $\odot$ is the Hadamard product, and $\sigma^{\prime}$ is applied entrywise. Here we utilize the formula

\partial R(z)=-R(z)(\partial(R(z)^{-1}))R(z)

and $R(z)=R(z)^{\top}$ . Lemma A.6 in Appendix A implies that $\|R(z)\|\leq\frac{1}{|\operatorname{Im}z|}.$ Therefore, based on the assumption of $D$ , we have

\Big{|}\operatorname{vec}(\Delta)^{\top}(\nabla F(W))\Big{|}\leq\frac{C}{\sqrt{d_{1}n}}\|R(z)\sigma(WX)^{\top}\|\cdot\|\sigma^{\prime}(WX)\odot(\Delta X)\|,

for some constant $C>0$ . For the first term in the product on the right-hand side,

		$\displaystyle\left(\frac{1}{\sqrt{d_{1}n}}\\|R(z)\sigma(WX)^{\top}\\|\right)^{2}$
	$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}\left\\|R(z)\left(\frac{1}{\sqrt{d_{1}n}}\sigma(WX)^{\top}\sigma(WX)\right)R(z)^{*}\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}\left(\\|R(z)R(z)^{-1}R(z)^{}\\|+\left\\|R(z)\left(\sqrt{\frac{d_{1}}{n}}\Phi+z\operatorname{Id}\right)R(z)^{}\right\\|\right)$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}\left(\\|R(z)\\|+\\|R(z)\\|^{2}\left(\sqrt{\frac{d_{1}}{n}}\\|\Phi\\|+\|z\|\right)\right)\leq\frac{C}{n}.$

For the second term,

\|\sigma^{\prime}(WX)\odot(\Delta X)\|\leq\|\sigma^{\prime}(WX)\odot(\Delta X)\|_{F}\leq\lambda_{\sigma}\|\Delta X\|_{F}\leq\lambda_{\sigma}\|\Delta\|_{F}\cdot\|X\|\leq C.

Thus, $|\operatorname{vec}(\Delta)^{\top}(\nabla F(W))|\leq C/\sqrt{n}$ . This holds for every $\Delta$ such that $\|\Delta\|_{F}=1$ , so $F(W)$ is $C/\sqrt{n}$ -Lipschitz in $W$ with respect to the Frobenius norm. Then the result follows from the Gaussian concentration inequality for Lipschitz functions. ∎

Next, we investigate the approximation of $\Phi=\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}X)^{\top}\sigma(\boldsymbol{w}^{\top}X)]$ via the Hermite polynomials $\{h_{k}\}_{k\geq 0}$ . The orthogonality of Hermite polynomials allows us to write $\Phi$ as a series of kernel matrices. Then we only need to estimate each kernel matrix in this series. The proof is directly based on [GMMM19, Lemma 2]. The only difference is that we consider the deterministic input data $X$ with the $(\varepsilon_{n},B)$ -orthonormal property, while in Lemma 2 of [GMMM19], the matrix $X$ is formed by independent Gaussian vectors.

Lemma 5.2.

Recall the definition of $\Phi_{0}$ in (1.13). If $X$ is $(\varepsilon_{n},B)$ -orthonormal and Assumption 1.2 holds, then we have the spectral norm bound

\|\Phi-\Phi_{0}\|\leq C_{B}\varepsilon_{n}^{2}\sqrt{n},

where $C_{B}$ is a constant depending on $B$ . Suppose that $\varepsilon_{n}^{2}\sqrt{n}\to 0$ as $n\to\infty$ , then $\|\Phi\|\leq C$ uniformly for some constant $C$ independent of $n$ .

Proof.

By Assumption 1.2, we know that

\xi_{0}(\sigma)=0,\quad\sum_{k=1}^{\infty}\zeta_{k}^{2}(\sigma)=\mathbb{E}[\sigma(\xi)^{2}]=1.

For any fixed $t$ , $\sigma(tx)\in L^{2}(\mathbb{R},\Gamma)$ . This is because $\sigma(x)\in L^{2}(\mathbb{R},\Gamma)$ is a Lipschitz function and by triangle inequality $|\sigma(tx)-\sigma(x)|\leq\lambda_{\sigma}|tx-x|$ , we have, for $\xi\sim\mathcal{N}(0,1)$ ,

(5.1)

\displaystyle\mathbb{E}(\sigma(t\xi)^{2})

\displaystyle\leq\mathbb{E}(|\sigma(\xi)|+\lambda_{\sigma}|t\xi-\xi|)^{2}<\infty.

For $1\leq\alpha\leq n$ , let $\sigma_{\alpha}(x):=\sigma(\|\mathbf{x}_{\alpha}\|x)$ and the Hermite expansion of $\sigma_{a}$ can be written as

\sigma_{\alpha}(x)=\sum_{k=0}^{\infty}\zeta_{k}(\sigma_{\alpha})h_{k}(x),

where the coefficient $\zeta_{k}(\sigma_{\alpha})=\mathbb{E}[\sigma_{\alpha}(\xi)h_{k}(\xi)]$ . Let unit vectors be $\mathbf{u}_{\alpha}=\mathbf{x}_{\alpha}/\|\mathbf{x}_{\alpha}\|$ , for $1\leq\alpha\leq n$ . So for $1\leq\alpha,\beta\leq n$ , the $(\alpha,\beta)$ entry of $\Phi$ is

\Phi_{\alpha\beta}=\mathbb{E}[\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{\alpha})\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{\beta})]=\mathbb{E}[\sigma_{\alpha}(\xi_{\alpha})\sigma_{\beta}(\xi_{\beta})],

where $(\xi_{\alpha},\xi_{\beta})=(\boldsymbol{w}^{\top}\mathbf{u}_{\alpha},\boldsymbol{w}^{\top}\mathbf{u}_{\beta})$ is a Gaussian random vector with mean zero and covariance

(5.2)

\begin{pmatrix}1&\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}\\ \mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}&1\end{pmatrix}.

By the orthogonality of Hermite polynomials with respect to $\Gamma$ and Lemma A.5, we can obtain

\mathbb{E}[h_{j}(\xi_{\alpha})h_{k}(\xi_{\beta})]=\mathbb{E}[h_{j}(\boldsymbol{w}^{\top}\mathbf{u}_{\alpha})h_{k}(\boldsymbol{w}^{\top}\mathbf{u}_{\beta})]=\delta_{j,k}(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k},

which leads to

(5.3)

\Phi_{\alpha\beta}=\sum_{k=0}^{\infty}\zeta_{k}(\sigma_{\alpha})\zeta_{k}(\sigma_{\beta})(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k}.

For any $k\in\mathbb{N}$ , let $T_{k}$ be an $n$ -by- $n$ matrix with $(\alpha,\beta)$ -th entry

(5.4)

\displaystyle(T_{k})_{\alpha\beta}:=\zeta_{k}(\sigma_{\alpha})\zeta_{k}(\sigma_{\beta})(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k}.

Specifically, for any $k\in\mathbb{N}$ , we have

T_{k}=D_{k}f_{k}(X^{\top}X)D_{k},

where $D_{k}$ is the diagonal matrix $\operatorname{diag}(\zeta_{k}(\sigma_{\alpha})/\|\mathbf{x}_{\alpha}\|^{k})_{\alpha\in[n]}$ .

At first, we consider twice differentiable $\sigma$ in Assumption 1.2. Similar with [GMMM19, Equation (26)], for any $\varepsilon>0$ and $|t-1|\leq\varepsilon$ , we take the Taylor approximation of $\sigma(tx)$ at point $x$ , then there exists $\eta$ between $tx$ and $x$ such that

\sigma(tx)-\sigma(x)=\sigma^{\prime}(x)x(t-1)+\frac{1}{2}\sigma^{\prime\prime}(\eta)x^{2}(t-1)^{2}.

Replacing $x$ by $\xi$ and taking expectation, since $\sigma^{\prime\prime}$ is uniformly bounded, we can get

(5.5)

\left|\mathbb{E}\left[\sigma(t\xi)-\sigma(\xi)\right]-\mathbb{E}[\sigma^{\prime}(\xi)\xi](t-1)\right|\leq C|t-1|^{2}\leq C\varepsilon_{n}^{2},

For $k\geq 1$ , the Lipschitz condition for $\sigma$ yields

(5.6)

\displaystyle|\zeta_{k}(\sigma_{\alpha})-\zeta_{k}(\sigma)|\leq C\left|\|\mathbf{x}_{\alpha}\|-1\right|\cdot\mathbb{E}[|\xi|\cdot|h_{k}(\xi)|]\leq C\varepsilon_{n},

where constant $C$ does not depend on $k$ . As for piece-wise linear $\sigma$ , it is not hard to see

(5.7)

\mathbb{E}\left[\sigma(t\xi)-\sigma(\xi)\right]=\mathbb{E}[\sigma^{\prime}(\xi)\xi](t-1).

Now, we begin to approximate $T_{k}$ separately based on (5.5), (5.6) and (5.7). Denote $\operatorname{diag}(A)$ the diagonal submatrix of a matrix $A$ .

(1) Approximation for $\sum_{k\geq 4}(T_{k}-\operatorname{diag}(T_{k}))$ . At first, we estimate the $L^{2}$ norm with respect to $\Gamma$ of the function $\sigma_{\alpha}$ . Recall that $\|\sigma_{\alpha}\|_{L^{2}}=\mathbb{E}[\sigma_{\alpha}(\xi)^{2}]^{1/2}.$ Because $\|\sigma\|_{L^{2}}=1$ and $\sigma$ is a Lipschitz function, we have

(5.8)		$\displaystyle\sup_{1\leq\alpha\leq n}\\|\sigma-\sigma_{\alpha}\\|_{L^{2}}=\mathbb{E}[(\sigma(\xi)-\sigma_{\alpha}(\xi))^{2}]^{1/2}\leq~{}$	$\displaystyle C\|\\|\mathbf{x}_{\alpha}\\|-1\|,$
(5.9)		$\displaystyle\sup_{1\leq\alpha\leq n}\\|\sigma_{\alpha}\\|_{L^{2}}\leq~{}$	$\displaystyle 1+C\varepsilon_{n}.$

Hence, $\|\sigma_{\alpha}\|_{L^{2}}$ is uniformly bounded with some constant for all large $n$ . Next, we estimate the off-diagonal entries of $T_{k}$ when $k\geq 4$ . From (5.4), we obtain that

	$\displaystyle\left\\|\sum_{k\geq 4}(T_{k}-\operatorname{diag}(T_{k}))\right\\|\leq~{}$	$\displaystyle\left\\|\sum_{k\geq 4}(T_{k}-\operatorname{diag}(T_{k}))\right\\|_{F}\leq\sum_{k\geq 4}\left\\|T_{k}-\operatorname{diag}(T_{k})\right\\|_{F}$
	$\displaystyle\leq~{}$	$\displaystyle\sum_{k\geq 4}\left(\sup_{\alpha\neq\beta}\|\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}\|^{k}\right)\left[\sum_{\alpha,\beta=1}^{n}\zeta_{k}(\sigma_{\alpha})^{2}\zeta_{k}(\sigma_{\beta})^{2}\right]^{\frac{1}{2}}$
	$\displaystyle\leq~{}$	$\displaystyle\left(\sup_{\alpha\neq\beta}\|\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta}\|^{4}\right)\sum_{\alpha=1}^{n}\sum_{k=0}^{\infty}\zeta_{k}(\sigma_{\alpha})^{2}$
(5.10)		$\displaystyle\leq~{}$	$\displaystyle n\cdot\left(\sup_{\alpha\neq\beta}\frac{\|\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}\|^{4}}{\\|\mathbf{x}_{\alpha}\\|^{4}\\|\mathbf{x}_{\beta}\\|^{4}}\right)\sup_{1\leq\alpha\leq n}\\|\sigma_{\alpha}\\|_{L^{2}}^{2}\leq Cn\cdot\varepsilon_{n}^{4},$

when $n$ is sufficiently large.

(2) Approximation for $T_{0}$ . Recall $\mathbb{E}[\sigma(\xi)]=0$ and by Gaussian integration by part,

\mathbb{E}[\sigma^{\prime}(\xi)\xi]=\mathbb{E}[\xi\int_{0}^{\xi}\sigma^{\prime}(x)xdx]=\mathbb{E}[\xi^{2}\sigma(\xi)]-\mathbb{E}[\xi\int_{0}^{\xi}\sigma(x)dx]=\mathbb{E}[\xi^{2}\sigma(\xi)]-\mathbb{E}[\sigma(\xi)].

Then, we have

\displaystyle\mathbb{E}[\sigma^{\prime}(\xi)\xi]=\mathbb{E}[(\xi^{2}-1)\sigma(\xi)]=\mathbb{E}[\sqrt{2}h_{2}(\xi)\sigma(\xi)]=\sqrt{2}\zeta_{2}(\sigma).

If $\sigma$ is twice differentiable, then $\mathbb{E}[\sigma^{\prime\prime}(\xi)]=\sqrt{2}\zeta_{2}(\sigma)$ as well.

Thus, taking $t=\|\mathbf{x}_{\alpha}\|$ in (5.5) and (5.7) implies that for any $1\leq\alpha\leq n,$

(5.11)

\left|\zeta_{0}(\sigma_{\alpha})-\sqrt{2}\zeta_{2}(\sigma)(\|\mathbf{x}_{\alpha}\|-1)\right|\leq C\varepsilon_{n}^{2}.

Define $\boldsymbol{\nu}^{\top}:=(\zeta_{0}(\sigma_{1}),\ldots,\zeta_{0}(\sigma_{n}))$ , then $T_{0}=\boldsymbol{\nu}\boldsymbol{\nu}^{\top}$ . Recall the definition of $\boldsymbol{\mu}$ in (1.13). Then, (5.11) ensures that

\|\boldsymbol{\mu}-\boldsymbol{\nu}\|\leq C\sqrt{n}\varepsilon_{n}^{2}.

Applying the $(\varepsilon_{n},B)$ -orthonormal property of $\mathbf{x}_{\alpha}$ yields

(5.12)

\|\boldsymbol{\mu}\|^{2}=2\zeta_{2}(\sigma)^{2}\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|-1)^{2}\leq 2\zeta_{2}(\sigma)^{2}\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|^{2}-1)^{2}\leq 2B^{2}\zeta_{2}(\sigma)^{2}.

Hence the difference between $T_{0}$ and $\boldsymbol{\mu}\boldsymbol{\mu}^{\top}$ is controlled by

(5.13)

\|T_{0}-\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\|\leq\|\boldsymbol{\mu}-\boldsymbol{\nu}\|\left(2\|\boldsymbol{\mu}\|+\|\boldsymbol{\nu}-\boldsymbol{\mu}\|\right)\leq C\sqrt{n}\varepsilon_{n}^{2}.

(3) Approximation for $T_{k}$ for $k=1,2,3$ . For $0\leq k\leq 3$ , Assumption 1.4 and (5.6) show that

	$\displaystyle\left\|\zeta_{k}(\sigma_{\alpha})/\\|\mathbf{x}_{\alpha}\\|^{k}-\zeta_{k}(\sigma)\right\|\leq~{}$	$\displaystyle\frac{1}{\\|\mathbf{x}_{\alpha}\\|^{k}}\left[\left\|\zeta_{k}(\sigma_{\alpha})-\zeta_{k}(\sigma)\right\|+\|\zeta_{k}(\sigma)\|\cdot\|\\|\mathbf{x}_{\alpha}\\|^{k}-1\|\right]$
(5.14)		$\displaystyle\leq~{}$	$\displaystyle\frac{C\varepsilon_{n}+C_{1}\|\\|\mathbf{x}_{\alpha}\\|-1\|}{(1-\varepsilon_{n})^{k}}\leq C_{2}\varepsilon_{n},$

when $n$ is sufficiently large. Notice that $T_{k}=D_{k}f_{k}(X^{\top}X)D_{k}$ , where $D_{k}$ is the diagonal matrix. Hence, by (5.14),

\|D_{k}-\zeta_{k}(\sigma)\operatorname{Id}\|\leq C_{2}\varepsilon_{n}.

And for $k=1,2,3,$ by the triangle inequality,

		$\displaystyle\\|T_{k}-\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)\\|=\\|D_{k}f_{k}(X^{\top}X)D_{k}-\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)\\|$
	$\displaystyle\leq~{}$	$\displaystyle\\|D_{k}-\zeta_{k}(\sigma)\operatorname{Id}\\|\cdot\\|f_{k}(X^{\top}X)\\|(\|\zeta_{k}(\sigma)\|+\\|D_{k}-\zeta_{k}(\sigma)\operatorname{Id}\\|)\leq C\varepsilon_{n}\\|f_{k}(X^{\top}X)\\|.$

When $k=1$ , $f_{1}(X^{\top}X)=X^{\top}X$ and $\|X^{\top}X\|\leq\|X\|^{2}\leq B^{2}$ . When $k=2$ ,

f_{2}(X^{\top}X)=(X^{\top}X)\odot(X^{\top}X).

From Lemma A.1 in Appendix A, we have that

(5.15)

\|f_{2}(X^{\top}X)\|\leq\max_{1\leq\alpha,\beta\leq n}|\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}|\cdot\|X\|^{2}\leq B^{2}(1+\varepsilon_{n}).

So the left-hand side of (5.15) is bounded. Analogously, we can verify $\|f_{3}(X^{\top}X)\|$ is also bounded. Therefore, we have

(5.16)

\|T_{k}-\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)\|\leq C\varepsilon_{n},

for some constant $C$ and $k=1,2,3$ when $n$ is sufficiently large.

(4) Approximation for $\sum_{k\geq 4}\operatorname{diag}(T_{k})$ . Since $\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\alpha}=1$ , we know

\sum_{k\geq 4}\operatorname{diag}(T_{k})=\operatorname{diag}\left(\sum_{k\geq 4}\zeta_{k}(\sigma_{\alpha})^{2}\right)_{\alpha\in[n]}=\operatorname{diag}\left(\|\sigma_{\alpha}\|_{L^{2}}^{2}-\sum_{k=0}^{4}\zeta_{k}(\sigma_{\alpha})^{2}\right)_{\alpha\in[n]}.

First, by (5.8) and (5.9), we can claim that

\displaystyle|\|\sigma_{\alpha}\|_{L^{2}}^{2}-1|=|\|\sigma_{\alpha}\|_{L^{2}}^{2}-\|\sigma\|_{L^{2}}^{2}|\leq C\|\sigma_{\alpha}-\sigma\|_{L^{2}}\leq C\varepsilon_{n}.

Second, in terms of (5.14), we obtain

|\zeta_{k}(\sigma_{\alpha})^{2}-\zeta_{k}(\sigma)^{2}|\leq C|\zeta_{k}(\sigma_{\alpha})-\zeta_{k}(\sigma)|\leq C\varepsilon_{n},

for $k=1,2$ and $3$ . Combining these together, we conclude that

		$\displaystyle\left\\|\sum_{k\geq 4}\operatorname{diag}(T_{k})-(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})\operatorname{Id}\right\\|$
(5.17)		$\displaystyle\leq\quad$	$\displaystyle\max_{1\leq\alpha\leq n}\left\|(\\|\sigma_{\alpha}\\|_{L^{2}}^{2}-1)-\sum_{k=0}^{4}(\zeta_{k}(\sigma_{\alpha})^{2}-\zeta_{k}(\sigma)^{2})\right\|\leq C\varepsilon_{n}.$

Recall

\Phi_{0}=\boldsymbol{\mu}\boldsymbol{\mu}^{\top}+\sum_{k=1}^{3}\zeta_{k}(\sigma)^{2}f_{k}(X^{\top}X)+(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})\operatorname{Id}.

In terms of approximations (5.10), (5.13), (5.16) and (5.17), we can finally manifest

(5.18)

\displaystyle\|\Phi-\Phi_{0}\|\leq C\left(\varepsilon_{n}+\sqrt{n}\varepsilon_{n}^{2}+n\varepsilon_{n}^{4}\right)\leq C\sqrt{n}\varepsilon_{n}^{2},

for some constant $C>0$ as $\sqrt{n}\varepsilon_{n}^{2}\to 0$ . The spectral norm bound of $\Phi$ is directly deduced by the spectral norm bound of $\Phi_{0}$ based on (5.12) and (5.15), together with (5.18). ∎

Remark 5.3 (Optimality of $\varepsilon_{n}$ ).

For general deterministic data $X$ , our pairwise orthogonality assumption with rate $n\varepsilon_{n}^{4}=o(1)$ is optimal for the approximation of $\Phi$ by $\Phi_{0}$ in the spectral norm. If we relax the decay rate of $\varepsilon_{n}$ in Assumption 1.4, the above approximation may require including terms of higher-degree $f_{k}(X^{\top}X)$ for $k\geq 4$ in $\Phi_{0}$ , which will lead to the invalidation of some of our following results and simplifications. Subsequent to the initial completion of our paper, this weaker regime has been considered in our follow-up work [WZ22].

Next, we continue to provide an additional estimate for $\Phi$ , but in the Frobenius norm to further simplify the limiting spectral distribution of $\Phi$ .

Lemma 5.4.

If Assumptions 1.2 and 1.4 hold, then $\Phi$ has the same limiting spectrum as $b_{\sigma}^{2}X^{\top}X+(1-b_{\sigma}^{2})\operatorname{Id}$ when $n\to\infty$ , i.e.

\operatorname{lim\,spec}\Phi=\operatorname{lim\,spec}\left(b_{\sigma}^{2}X^{\top}X+(1-b_{\sigma}^{2})\operatorname{Id}\right)=b_{\sigma}^{2}\mu_{0}+(1-b_{\sigma}^{2}).

Proof.

By the definition of $b_{\sigma}$ , we know that $b_{\sigma}=\zeta_{1}(\sigma)$ . As a direct deduction of Lemma 5.2, the limiting spectrum of $\Phi$ is identical to the limiting spectrum of $\Phi_{0}$ . To prove this lemma, it suffices to check the Frobenius norm of the difference between $\Phi_{0}$ and $\zeta_{1}(\sigma)^{2}X^{\top}X+(1-\zeta_{1}(\sigma)^{2})\operatorname{Id}$ . Notice that

		$\displaystyle\Phi_{0}-\zeta_{1}(\sigma)^{2}X^{\top}X-(1-\zeta_{1}(\sigma)^{2})\operatorname{Id}$
	$\displaystyle=~{}$	$\displaystyle\boldsymbol{\mu}\boldsymbol{\mu}^{\top}+\zeta_{2}(\sigma)^{2}f_{2}(X^{\top}X)+\zeta_{3}(\sigma)^{2}f_{3}(X^{\top}X)-(\zeta_{2}(\sigma)^{2}+\zeta_{3}(\sigma)^{2})\operatorname{Id}.$

By the definition of vector $\boldsymbol{\mu}$ and the assumption of $X$ , we have

(5.19)

\|\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\|_{F}=\|\boldsymbol{\mu}\|^{2}=2\zeta_{2}^{2}(\sigma)\sum_{\alpha=1}^{n}(\|\mathbf{x}_{\alpha}\|-1)^{2}\leq 2\zeta_{2}^{2}(\sigma)B^{2}.

For $k=2,3$ , the Frobenius norm can be controlled by

	$\displaystyle\\|f_{k}(X^{\top}X)-\operatorname{Id}\\|_{F}^{2}=~{}$	$\displaystyle\sum_{\alpha,\beta=1}^{n}\left((\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta})^{k}-\delta_{\alpha\beta}\right)^{2}$
	$\displaystyle\leq~{}$	$\displaystyle n(n-1)\varepsilon_{n}^{2k}+\sum_{\alpha=1}^{n}(\\|\mathbf{x}_{\alpha}\\|^{2k}-1)^{2}\leq n^{2}\varepsilon_{n}^{2k}+Cn\varepsilon_{n}^{2}.$

Hence, as $n\to\infty$ , we have

\frac{1}{n}\|\boldsymbol{\mu}\boldsymbol{\mu}^{\top}\|_{F}^{2},\quad\frac{1}{n}\|f_{k}(X^{\top}X)-\operatorname{Id}\|_{F}^{2}\to 0,~{}\text{ for }~{}k=2,3,

as $n\varepsilon_{n}^{4}\to 0$ . Then we conclude that

\frac{1}{n}\|\Phi_{0}-\zeta_{1}(\sigma)^{2}X^{\top}X-(1-\zeta_{1}(\sigma)^{2})\operatorname{Id}\|_{F}^{2}\leq C(n\varepsilon_{n}^{4}+\varepsilon_{n}^{2})\to 0.

Hence, $\operatorname{lim\,spec}\Phi$ is the same as $\operatorname{lim\,spec}(\zeta_{1}(\sigma)^{2}X^{\top}X+(1-\zeta_{1}(\sigma)^{2})\operatorname{Id})$ when $n\to\infty,$ due to Lemma A.7 in Appendix A. ∎

Moreover, the proof of Lemma 5.4 can be modified to prove (2.32), so we omit its proof. Now, based on Corollary 3.7, Proposition 5.1, Lemma 5.2, and Lemma 5.4, applying Theorem 4.1 for general sample covariance matrices, we can finish the proof of Theorem 2.1.

Proof of Theorem 2.1.

Based on Corollary 3.7 and Proposition 5.1, we can verify the conditions (4.1) and (4.2) in Theorem 4.1. By Lemma 5.2 and Lemma 5.4, we know that the limiting eigenvalue distributions of $\Phi$ and $(1-b_{\sigma}^{2})\operatorname{Id}+b_{\sigma}^{2}X^{\top}X$ are identical and $\|\Phi\|$ is uniformly bounded. So the limiting eigenvalue distribution of $\Phi$ denoted by $\mu_{\Phi}$ is just $(1-b_{\sigma}^{2})+b_{\sigma}^{2}\mu_{0}$ . Hence, the first conclusion of Theorem 2.1 follows from Theorem 4.1.

For the second part of this theorem, we consider the difference

		$\displaystyle\frac{1}{n}\left\\|\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-\mathbb{E}[Y^{\top}Y]\right)-\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-d_{1}\Phi_{0}\right)\right\\|^{2}_{F}$
	$\displaystyle\leq~{}$	$\displaystyle\frac{d_{1}}{n^{2}}\\|\Phi-\Phi_{0}\\|_{F}^{2}\leq\frac{d_{1}}{n}\\|\Phi-\Phi_{0}\\|^{2}\leq d_{1}\varepsilon_{n}^{4}\to 0,$

where we employ Lemma 5.2 and the assumption $d_{1}\varepsilon_{n}^{4}=o(1)$ . Thus, because of Lemma A.7, $\frac{1}{\sqrt{d_{1}n}}\left(Y^{\top}Y-d_{1}\Phi_{0}\right)$ has the same limiting eigenvalue distribution as (2.1), $\mu_{s}\boxtimes((1-b_{\sigma}^{2})+b_{\sigma}^{2}\mu_{0})$ . This finishes the proof of Theorem 2.1. ∎

Next, we move to study the empirical NTK and its corresponding limiting eigenvalue distribution. Similarly, we first verify that such NTK concentrates around its expectation and then simplify this expectation by some deterministic matrix only depending on the input data matrix $X$ and nonlinear activation $\sigma$ . The following lemma can be obtained from (2.12) in Theorem 2.7.

Lemma 5.5.

Suppose that Assumption 1.1 holds, $\sup_{x\in\mathbb{R}}|\sigma^{\prime}(x)|\leq\lambda_{\sigma}$ and $\|X\|\leq B$ . Then if $d_{1}=\omega(\log n)$ , we have

(5.20)

\displaystyle\frac{1}{d_{1}}\left\|(S^{\top}S)\odot(X^{\top}X)-\mathbb{E}[(S^{\top}S)\odot(X^{\top}X)]\right\|\to 0,

almost surely as $n,d_{0},d_{1}\to\infty$ . Moreover, if $d_{1}/n\to\infty$ as $n\to\infty$ , then almost surely

(5.21)

\frac{1}{\sqrt{nd_{1}}}\left\|(S^{\top}S)\odot(X^{\top}X)-\mathbb{E}[(S^{\top}S)\odot(X^{\top}X)]\right\|\to 0.

Lemma 5.6.

Suppose $X$ is $(\varepsilon_{n},B)$ -orthonormal. Under Assumption 1.2, we have

(5.22)

\|\Psi-\Psi_{0}\|\leq C_{B}\varepsilon_{n}^{4}n,

where $\Psi$ and $\Psi_{0}$ are defined in (2.6) and (2.7), respectively, and $C_{B}$ is a constant depending on $B$ .

Proof.

We can directly apply methods in the proof of Lemma 5.2. Notice that (1.7) and (1.9) imply

\mathbb{E}[S^{\top}S]=d_{1}\mathbb{E}[\sigma^{\prime}(\boldsymbol{w}^{\top}X)^{\top}\sigma^{\prime}(\boldsymbol{w}^{\top}X)],

for any standard Gaussian random vector $\boldsymbol{w}\sim\mathcal{N}(0,\operatorname{Id})$ . Recall that (2.8) defines the $k$ -th coefficient of Hermite expansion of $\sigma^{\prime}(x)$ by $\eta_{k}(\sigma)$ for any $k\in\mathbb{N}.$ Then, Assumption 1.2 indicates $b_{\sigma}=\eta_{0}(\sigma)$ and $a_{\sigma}=\sum_{k=0}^{\infty}\eta_{k}^{2}(\sigma)$ . For $1\leq\alpha\leq n$ , we introduce $\phi_{\alpha}(x):=\sigma^{\prime}(\|\mathbf{x}_{\alpha}\|x)$ and the Hermite expansion of this function as

\phi_{\alpha}(x)=\sum_{k=0}^{\infty}\zeta_{k}(\phi_{\alpha})h_{k}(x),

where the coefficient $\zeta_{k}(\sigma_{\alpha})=\mathbb{E}[\phi_{\alpha}(\xi)h_{k}(\xi)]$ . Let $\mathbf{u}_{\alpha}=\mathbf{x}_{\alpha}/\|\mathbf{x}_{\alpha}\|$ , for $1\leq\alpha\leq n$ . So for $1\leq\alpha,\beta\leq n$ , the $(\alpha,\beta)$ -entry of $\Psi$ is

\Psi_{\alpha\beta}=\mathbb{E}[\phi_{\alpha}(\xi_{\alpha})\phi_{\beta}(\xi_{\beta})]\cdot(\mathbf{x}_{a}^{\top}\mathbf{x}_{\beta}),

where $(\xi_{\alpha},\xi_{\beta})=(\boldsymbol{w}^{\top}\mathbf{u}_{\alpha},\boldsymbol{w}^{\top}\mathbf{u}_{\beta})$ is a Gaussian random vector with mean zero and covariance (5.2). Following the derivation of formula (5.3), we obtain

(5.23)

\Psi_{\alpha\beta}=\sum_{k=0}^{\infty}\frac{\zeta_{k}(\phi_{\alpha})\zeta_{k}(\phi_{\beta})}{\|\mathbf{x}_{\alpha}\|^{k}\|\mathbf{x}_{\beta}\|^{k}}(\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta})^{k+1}.

For any $k\in\mathbb{N}$ , let $T_{k}\in\mathbb{R}^{n\times n}$ be an $n$ -by- $n$ matrix with $(\alpha,\beta)$ entry

(T_{k})_{\alpha\beta}:=\frac{\zeta_{k}(\phi_{\alpha})\zeta_{k}(\phi_{\beta})}{\|\mathbf{x}_{\alpha}\|^{k}\|\mathbf{x}_{\beta}\|^{k}}(\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta})^{k+1}.

We can write $T_{k}=D_{k}f_{k+1}(X^{\top}X)D_{k}$ for any $k\in\mathbb{N}$ , where $D_{k}$ is $\operatorname{diag}(\zeta_{k}(\phi_{\alpha})/\|\mathbf{x}_{\alpha}\|^{k})$ . Then, adopting the proof of (5.16), we can similarly conclude that

\|T_{k}-\eta_{k}^{2}(\sigma)f_{k+1}(X^{\top}X)\|\leq C\varepsilon_{n},

for some constant $C$ and $k=0,1,2$ , when $n$ is sufficiently large. Likewise, (5.10) indicates

\left\|\sum_{k\geq 3}(T_{k}-\operatorname{diag}(T_{k}))\right\|\leq C\varepsilon_{n}^{4}n,

and a similar proof of (5.17) implies that

\left\|\sum_{k\geq 3}\operatorname{diag}(T_{k})-\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)\operatorname{Id}\right\|\leq C\varepsilon_{n}.

Based on these approximations, we can conclude the result of this lemma. ∎

Proof of Theorem 2.2.

The first part of the statement is a straight consequence of (5.21) and Theorem 2.1. Denote by $A:=\sqrt{\frac{d_{1}}{n}}\left(H-\mathbb{E}[H]\right)$ and $B:=\sqrt{\frac{d_{1}}{n}}\left(\frac{1}{d_{1}}Y^{\top}Y-\Phi\right)$ . Observe that

B-A=\frac{1}{\sqrt{nd_{1}}}\left[(S^{\top}S)\odot(X^{\top}X)-\mathbb{E}[(S^{\top}S)\odot(X^{\top}X)]\right].

Hence, (5.21) indicates $\|B-A\|\to 0$ as $n\to\infty$ . This convergence implies that limiting laws of $A$ and $B$ are identical because of Lemma A.3.

The second part is because of Lemma 5.2 and Lemma 5.6. From (1.8) and (2.6), $\mathbb{E}[H]=\Phi+\Psi.$ Then almost surely,

		$\displaystyle\left\\|\sqrt{\frac{d_{1}}{n}}\left(H-\mathbb{E}[H]\right)-\sqrt{\frac{d_{1}}{n}}\left(H-\Phi_{0}-\Psi_{0}\right)\right\\|=\sqrt{\frac{d_{1}}{n}}\left\\|\Phi_{0}+\Psi_{0}-\mathbb{E}[H]\right\\|$
	$\displaystyle\leq~{}$	$\displaystyle\sqrt{\frac{d_{1}}{n}}\left(\\|\Phi-\Phi_{0}\\|+\\|\Psi-\Psi_{0}\\|\right)\leq\sqrt{\frac{d_{1}}{n}}\left(\sqrt{n}\varepsilon_{n}^{2}+n\varepsilon_{n}^{4}\right)\to 0,$

as $\varepsilon_{n}^{4}d_{1}\to 0$ by the assumption of Theorem 2.2. Therefore, the limiting eigenvalue distribution of (2.10) is the same as (2.9). ∎

6. Proof of the concentration for extreme eigenvalues

In this section, we obtain the estimates of the extreme eigenvalues for the CK and NTK we studied in Section 5. The limiting spectral distribution of $\frac{1}{\sqrt{d_{1}n}}(Y^{\top}Y-\mathbb{E}[Y^{\top}Y])$ tells us the bulk behavior of the spectrum. An estimation of the extreme eigenvalues will show that the eigenvalues are confined in a finite interval with high probability. We first provide a non-asymptotic bound on the concentration of $\frac{1}{d_{1}}Y^{\top}Y$ under the spectral norm. The proof is based on the Hanson-Wright inequality we proved in Section 3 and an $\varepsilon$ -net argument.

Proof of Theorem 2.3.

Recall notations in Section 1. Define

	$\displaystyle M:=~{}$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}Y^{\top}Y=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}\mathbf{y}_{i}\mathbf{y}_{i}^{\top},$
	$\displaystyle M-\mathbb{E}M=~{}$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}(\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\mathbb{E}[\mathbf{y}_{i}\mathbf{y}_{i}^{\top}])=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}(\mathbf{y}_{i}\mathbf{y}_{i}^{\top}-\Phi),$

where $\mathbf{y}_{i}^{\top}=\sigma(\boldsymbol{w}_{i}^{\top}X)$ .

For any fixed $\mathbf{z}\in\mathbb{S}^{n-1}$ , we have

	$\displaystyle\mathbf{z}^{\top}(M-\mathbb{E}M)\mathbf{z}$	$\displaystyle=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}[\langle\mathbf{z},\mathbf{y}_{i}\rangle^{2}-\mathbf{z}^{\top}\Phi\mathbf{z}]$
		$\displaystyle=\frac{1}{\sqrt{d_{1}n}}\sum_{i=1}^{d_{1}}[\mathbf{y}_{i}^{\top}(\mathbf{z}\mathbf{z}^{\top})\mathbf{y}_{i}-\operatorname{Tr}(\Phi\mathbf{z}\mathbf{z}^{\top})]$
(6.1)			$\displaystyle=(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})^{\top}A_{\mathbf{z}}(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})-\operatorname{Tr}(A_{\mathbf{z}}\tilde{\Phi}),$

where

\displaystyle A_{\mathbf{z}}=\frac{1}{\sqrt{d_{1}n}}\begin{bmatrix}\mathbf{z}\mathbf{z}^{\top}&&\\ &\ddots&\\ &&\mathbf{z}\mathbf{z}^{\top}\end{bmatrix}\in\mathbb{R}^{nd_{1}\times nd_{1}},\quad\tilde{\Phi}=\begin{bmatrix}\Phi&&\\ &\ddots&\\ &&\Phi\end{bmatrix}\in\mathbb{R}^{nd_{1}\times nd_{1}},

and column vector $(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})\in\mathbb{R}^{nd_{1}}$ is the concatenation of column vectors $\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}}$ . Then

\displaystyle(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})^{\top}=\sigma((\boldsymbol{w}_{1},\dots,\boldsymbol{w}_{d_{1}})^{\top}\tilde{X})

with block matrix

\displaystyle\tilde{X}=\begin{bmatrix}X&&\\ &\ddots&\\ &&X\end{bmatrix}.

Notice that

\displaystyle\|A_{\mathbf{z}}\|=\frac{1}{\sqrt{d_{1}n}},\quad\|A_{\mathbf{z}}\|_{F}=\frac{1}{\sqrt{n}},\quad\|\tilde{X}\|=\|X\|.

Denote $\tilde{\mathbf{y}}=(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})$ . With (3.11), we obtain

	$\displaystyle\\|\mathbb{E}\tilde{\mathbf{y}}\\|^{2}$	$\displaystyle=d_{1}\\|\mathbb{E}\mathbf{y}\\|^{2}\leq d_{1}\left(2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\\|\mathbf{x}_{i}\\|^{2}-1)^{2}+2n(\mathbb{E}\sigma(\xi))^{2}\right)$
		$\displaystyle=d_{1}\left(2\lambda_{\sigma}^{2}\sum_{i=1}^{n}(\\|\mathbf{x}_{i}\\|^{2}-1)^{2}\right)\leq 2d_{1}\lambda_{\sigma}^{2}B^{2},$

where the last line is from the assumptions on $X$ and $\sigma$ . When $B\not=0$ , applying (3.8) to (6.1) implies

		$\displaystyle\mathbb{P}\left(\|(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})^{\top}A_{\mathbf{z}}(\mathbf{y}_{1},\dots,\mathbf{y}_{d_{1}})-\operatorname{Tr}(A_{\mathbf{z}}\tilde{\Phi})\|\geq t\right)$
	$\displaystyle\leq~{}$	$\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}n}{8\lambda_{\sigma}^{4}\\|X\\|^{4}},\frac{t\sqrt{d_{1}n}}{\lambda_{\sigma}^{2}\\|X\\|^{2}}\right\}\right)+2\exp\left(-\frac{t^{2}d_{1}n}{32\lambda_{\sigma}^{2}\\|X\\|^{2}\\|\mathbb{E}\tilde{\mathbf{y}}\\|^{2}}\right)$
	$\displaystyle\leq~{}$	$\displaystyle 2\exp\left(-\frac{1}{C}\min\left\{\frac{t^{2}n}{8\lambda_{\sigma}^{4}\\|X\\|^{4}},\frac{t\sqrt{d_{1}n}}{\lambda_{\sigma}^{2}\\|X\\|^{2}}\right\}\right)+2\exp\left(-\frac{t^{2}n}{64\lambda_{\sigma}^{4}B^{2}\\|X\\|^{2}}\right).$

Let $\mathcal{N}$ be a $1/2$ -net on $\mathbb{S}^{n-1}$ with $|\mathcal{N}|\leq 5^{n}$ (see e.g. [Ver18, Corollary 4.2.13]), then

\displaystyle\|M-\mathbb{E}M\|\leq 2\sup_{\mathbf{z}\in\mathcal{N}}|\mathbf{z}^{\top}(M-\mathbb{E}M)\mathbf{z}|.

Taking a union bound over $\mathcal{N}$ yields

	$\displaystyle\mathbb{P}(\\|M-\mathbb{E}M\\|\geq 2t)\leq~{}$	$\displaystyle 2\exp\left(n\log 5-\frac{1}{C}\min\left\{\frac{t^{2}n}{16\lambda_{\sigma}^{4}\\|X\\|^{4}},\frac{t\sqrt{d_{1}n}}{2\lambda_{\sigma}^{2}\\|X\\|^{2}}\right\}\right)$
		$\displaystyle+2\exp\left(n\log 5-\frac{t^{2}n}{64\lambda_{\sigma}^{4}B^{2}\\|X\\|^{2}}\right).$

We then can set

t=\left(8\sqrt{C}+8C\sqrt{\frac{n}{d_{1}}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+16B\lambda_{\sigma}^{2}\|X\|,

to conclude

\displaystyle\mathbb{P}\left(\|M-\mathbb{E}M\|\geq\left(16\sqrt{C}+16C\sqrt{\frac{n}{d_{1}}}\right)\lambda_{\sigma}^{2}\|X\|^{2}+32B\lambda_{\sigma}^{2}\|X\|\right)\leq 4e^{-2n}.

Since

\displaystyle\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|=\sqrt{\frac{n}{d_{1}}}\|M-\mathbb{E}M\|,

the upper bound in (2.11) is then verified. When $B=0$ , we can apply (3.6) and follow the same steps to get the desired bound. ∎

By the concentration inequality in Theorem 2.3, we can get a lower bound on the smallest eigenvalue of the conjugate kernel $\frac{1}{d_{1}}Y^{\top}Y$ as follows.

Lemma 6.1.

Assume $X$ satisfies $\sum_{i=1}^{n}(\|\mathbf{x}_{i}\|^{2}-1)^{2}\leq B^{2}$ for a constant $B>0$ , and $\sigma$ is $\lambda_{\sigma}$ -Lipschitz with $\mathbb{E}\sigma(\xi)=0$ . Then with probability at least $1-4e^{-2n}$ ,

(6.2)

\displaystyle\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)\geq\lambda_{\min}(\Phi)-C\left(\sqrt{\frac{n}{d_{1}}}+\frac{n}{d_{1}}\right)\lambda_{\sigma}^{2}\|X\|^{2}-32B\lambda_{\sigma}^{2}\|X\|\sqrt{\frac{n}{d_{1}}}.

Proof.

By Weyl’s inequality [AGZ10, Corollary A.6], we have

\displaystyle\left|\lambda_{\min}\left(\frac{1}{d_{1}}Y^{\top}Y\right)-\lambda_{\min}(\Phi)\right|\leq\left\|\frac{1}{d_{1}}Y^{\top}Y-d_{1}\Phi\right\|.

Then (6.2) follows from (2.11). ∎

The lower bound in (6.2) relies on $\lambda_{\min}(\Phi)$ . Under certain assumptions on $X$ and $\sigma$ , we can guarantee that $\lambda_{\min}(\Phi)$ is bounded below by an absolute constant.

Lemma 6.2.

Assume $\sigma$ is not a linear function and $\sigma(x)$ is Lipschitz. Then

(6.3)

\displaystyle\sup\{k\in\mathbb{N}:\zeta_{k}(\sigma)^{2}>0\}=\infty.

Proof.

Suppose that $\sup\{k\in\mathbb{N}:\zeta_{k}(\sigma)^{2}>0\}$ is finite. Then $\sigma$ is a polynomial of degree at least $2$ from our assumption, which is a contradiction to the fact that $\sigma$ is Lipschitz. Hence, (6.3) holds. ∎

Lemma 6.3.

Assume Assumption 1.2 holds, $\sigma$ is not a linear function, and $X$ satisfies $(\varepsilon_{n},B)$ -orthonormal property. Then,

(6.4)

\displaystyle\lambda_{\min}(\Phi)\geq 1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2}-C_{B}\varepsilon_{n}^{2}\sqrt{n}.

Remark 6.4.

This bound will not hold when $\sigma$ is a linear function. Suppose $\sigma$ is a linear function, under Assumption 1.2, we must have $\sigma(x)=x$ and $\Phi=X^{\top}X$ . Then we will not have a lower bound on $\lambda_{\min}(\Phi)$ based on the Hermite coefficients of $\sigma$ .

Proof of Lemma 6.3.

From Lemma 5.2, under our assumptions, we know that

\|\Phi-\Phi_{0}\|\leq C_{B}\varepsilon_{n}^{2}\sqrt{n}.

where $\Phi_{0}$ is given by (1.13). Thus, $\lambda_{\min}(\Phi)\geq\lambda_{\min}(\Phi_{0})-C_{B}\varepsilon_{n}^{2}\sqrt{n},$

and, from Weyl’s inequality [AGZ10, Theorem A.5], we have

\displaystyle\lambda_{\min}(\Phi_{0})\geq\sum_{k=1}^{3}\zeta_{k}(\sigma)^{2}\lambda_{\min}(f_{k}(X^{\top}X))+(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2}).

Note that $f_{k}(X^{\top}X)=K_{k}^{\top}K_{k},$ where $K_{k}\in\mathbb{R}^{d_{0}^{k}\times n}$ , and each column of $K_{k}$ is given by the $k$ -th Kronecker product $\mathbf{x}_{i}\otimes\cdots\otimes\mathbf{x}_{i}$ . Hence, $f_{k}(X^{\top}X)$ is positive semi-definite.

Therefore,

\displaystyle\lambda_{\min}(\Phi_{0})\geq(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2}).

Since $\sigma$ is not linear but Lipschitz, (6.3) holds for $\sigma$ . Therefore, $(1-\zeta_{1}(\sigma)^{2}-\zeta_{2}(\sigma)^{2}-\zeta_{3}(\sigma)^{2})=\sum_{k=4}^{\infty}\zeta_{k}(\sigma)^{2}>0$ , and then (6.4) holds.

∎

Theorem 2.5 then follows directly from Lemma 6.1 and Lemma 6.3.

Next, we move on to non-asymptotic estimations for NTK. Recall that the empirical NTK matrix $H$ is given by (1.8) and the $\alpha$ -th column of $S$ is defined by $\operatorname{diag}(\sigma^{\prime}(W\mathbf{x}_{\alpha}))\boldsymbol{a}$ , for $1\leq\alpha\leq n$ , in (1.9).

The $i$ -th row of $S$ is given by $\mathbf{z}_{i}^{\top}:=\sigma^{\prime}(\boldsymbol{w}_{i}^{\top}X)a_{i},$ and $\mathbb{E}[\mathbf{z}_{i}]=0$ , where $a_{i}$ is the $i$ -th entry of $\boldsymbol{a}$ . Define $D_{\alpha}=\operatorname{diag}(\sigma^{\prime}(\boldsymbol{w}_{\alpha}^{\top}X)a_{\alpha}),$ for $1\leq\alpha\leq d_{1}$ . We can rewrite $(S^{\top}S)\odot(X^{\top}X)$ as

(S^{\top}S)\odot(X^{\top}X)=\sum_{\alpha=1}^{d_{1}}a_{\alpha}^{2}D_{\alpha}X^{\top}XD_{\alpha}.

Let us define $L$ and further expand it as follows:

(6.5)	$\displaystyle L$	$\displaystyle:=\frac{1}{d_{1}}(S^{\top}S-\mathbb{E}[S^{\top}S])\odot(X^{\top}X)$
	$\displaystyle=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}(\mathbf{z}_{i}\mathbf{z}_{i}^{\top}-\mathbb{E}[\mathbf{z}_{i}\mathbf{z}_{i}^{\top}])\odot(X^{\top}X)$
(6.6)		$\displaystyle=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}\left(D_{i}(X^{\top}X)D_{i}-\mathbb{E}[D_{i}(X^{\top}X)D_{i}]\right)=\frac{1}{d_{1}}\sum_{i=1}^{d_{1}}Z_{i}.$

Here $Z_{i}$ is a centered random matrix, and we can apply matrix Bernstein’s inequality to show the concentration of $L$ . Since $Z_{i}$ does not have an almost sure bound on the spectral norm, we will use the following sub-exponential version of the matrix Bernstein inequality from [Tro12].

Lemma 6.5 ([Tro12], Theorem 6.2).

Let $Z_{k}$ be independent Hermitian matrices of size $n\times n$ . Assume

\displaystyle\mathbb{E}Z_{i}=0,\quad\left\|\mathbb{E}[Z_{i}^{p}]\right\|\leq\frac{1}{2}p!R^{p-2}a^{2},

for any integer $p\geq 2$ . Then for all $t\geq 0$ ,

(6.7)

\displaystyle\mathbb{P}\left(~{}\left\|\sum_{i=1}^{d_{1}}Z_{i}\right\|\geq t\right)\leq n\exp\left(-\frac{t^{2}}{2d_{1}a^{2}+2Rt}\right).

Proof of Theorem 2.7.

From (6.6), $\mathbb{E}Z_{i}=0$ , and

\displaystyle\|Z_{i}\|

\displaystyle\leq\|D_{i}\|^{2}\|XX^{\top}\|+\mathbb{E}\|D_{i}\|^{2}\|XX^{\top}\|\leq C_{1}(a_{i}^{2}+1),

where $C_{1}=\lambda_{\sigma}^{2}\|X\|^{2}$ and where $a_{i}\sim\mathcal{N}(0,1)$ is the $i$ -th entry of the second layer weight $\boldsymbol{a}$ . Then

	$\displaystyle\\|\mathbb{E}[Z_{i}^{p}]\\|\leq\mathbb{E}\\|Z_{i}\\|^{p}$	$\displaystyle\leq C_{1}^{2p}\mathbb{E}(a_{i}^{2}+1)^{p}\leq C_{1}^{2p}\sum_{k=1}^{p}\binom{p}{k}(2k-1)!!$
		$\displaystyle=C_{1}^{2p}p!\sum_{k=1}^{p}\frac{(2k-1)!!}{k!(p-k)!}\leq C_{1}^{2p}p!\sum_{k=1}^{p}2^{k}\leq 2(2C_{1}^{2})^{p}p!.$

So we can take $R=2C_{1}^{2},a^{2}=8C_{1}^{4}$ in (6.7) and obtain

\displaystyle\mathbb{P}\left(\left\|\sum_{i=1}^{d_{1}}Z_{i}\right\|\geq t\right)\leq n\exp\left(-\frac{t^{2}}{16d_{1}C_{1}^{4}+4C_{1}^{2}t}\right).

Hence, $L$ defined in (6.5) has a probability bound:

\displaystyle\mathbb{P}\left(\|L\|\geq t\right)=\mathbb{P}\left(\frac{1}{d_{1}}\left\|\sum_{i=1}^{d_{1}}Z_{i}\right\|\geq t\right)\leq n\exp\left(-\frac{t^{2}d_{1}}{16C_{1}^{4}+4C_{1}^{2}t}\right).

Take $t=10C_{1}^{2}\sqrt{\log n/d_{1}}$ . Under the assumption that $d_{1}\geq\log n$ , we conclude that, with high probability at least $1-n^{-7/3}$ ,

(6.8)

\displaystyle\|L\|\leq 10C_{1}^{2}\sqrt{\frac{\log n}{d_{1}}}.

Thus, as a corollary, the two statements in Lemma 5.5 follow from (6.8). Meanwhile, since

\|H-\mathbb{E}H\|\leq\left\|\frac{1}{d_{1}}Y^{\top}Y-\Phi\right\|+\|L\|,

the bound in (2.13) follows from Theorem 2.3 and (6.8). ∎

We now proceed to provide a lower bound of $\lambda_{\min}(H)$ from Theorem 2.7.

Proof of Theorem 2.9.

Note that from (1.8), (2.6) and (6.5), we have

	$\displaystyle\lambda_{\min}(H)$	$\displaystyle\geq\frac{1}{d_{1}}\lambda_{\min}((S^{\top}S)\odot(X^{\top}X))$
		$\displaystyle\geq\frac{1}{d_{1}}\lambda_{\min}((\mathbb{E}S^{\top}S)\odot(X^{\top}X))-\\|L\\|=\lambda_{\min}(\Psi)-\\|L\\|.$

Then with Lemma 5.6, we can get

\displaystyle\lambda_{\min}(H)

\displaystyle\geq\lambda_{\min}(\Psi_{0})-C\varepsilon_{n}^{4}n-\|L\|\geq\left(a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)\right)-C\varepsilon_{n}^{4}n-\|L\|.

Therefore, from Theorem 2.7, with probability at least $1-n^{-7/3}$ ,

	$\displaystyle\lambda_{\min}(H)$	$\displaystyle\geq a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)-C\varepsilon_{n}^{4}n-10\lambda_{\sigma}^{4}\\|X\\|^{4}\sqrt{\frac{\log n}{d_{1}}}$
		$\displaystyle\geq a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)-C\varepsilon_{n}^{4}n-10\lambda_{\sigma}^{4}B^{4}\sqrt{\frac{\log n}{d_{1}}}.$

Since $\sigma$ is Lipschitz and non-linear, we know $\sigma^{\prime}(x)$ is not a linear function (including the constant function) and $|\sigma^{\prime}(x)|$ is bounded. Suppose that $\sigma^{\prime}(x)$ has finite many non-zero Hermite coefficients, $\sigma(x)$ is a polynomial, then we get a contradiction. Hence, the Hermite coefficients of $\sigma^{\prime}$ satisfy

(6.9)

\displaystyle\sup\{k\in\mathbb{N}:\eta_{k}^{2}(\sigma)>0\}=\infty~{}\text{ and }~{}a_{\sigma}-\sum_{k=0}^{2}\eta_{k}^{2}(\sigma)=\sum_{k=3}^{\infty}\eta_{k}^{2}(\sigma)>0.

This finishes the proof. ∎

7. Proof of Theorem 2.12 and Theorem 2.17

By definitions, the random matrix $K_{n}(X,X)$ is $\frac{1}{d_{1}}Y^{\top}Y$ and the kernel matrix $K(X,X)=\Phi$ is defined in (1.3). These two matrices have been already analyzed in Theorem 2.3 and Theorem 2.5, so we will apply these results to estimate how great the difference between training errors of random feature regression and its corresponding kernel regression.

Proof of Theorem 2.12.

Denote $K_{\lambda}:=(K+\lambda\operatorname{Id})$ . From the definitions of training errors in (2.20) and (2.21), we have

		$\displaystyle\left\|E_{\textnormal{train}}^{(RF,\lambda)}-E_{\textnormal{train}}^{(K,\lambda)}\right\|=\frac{1}{n}\left\|\\|\hat{f}_{\lambda}^{(RF)}(X)-\boldsymbol{y}\\|^{2}-\\|\hat{f}_{\lambda}^{(K)}(X)-\boldsymbol{y}\\|^{2}\right\|$
	$\displaystyle=~{}$	$\displaystyle\frac{\lambda^{2}}{n}\left\|\operatorname{Tr}[(K(X,X)+\lambda\operatorname{Id})^{-2}\boldsymbol{y}\boldsymbol{y}^{\top}]-\operatorname{Tr}[(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}\boldsymbol{y}\boldsymbol{y}^{\top}]\right\|$
	$\displaystyle=~{}$	$\displaystyle\frac{\lambda^{2}}{n}\left\|\boldsymbol{y}^{\top}\left[(K(X,X)+\lambda\operatorname{Id})^{-2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}\right]\boldsymbol{y}\right\|$
	$\displaystyle\leq~{}$	$\displaystyle\frac{\lambda^{2}}{n}\\|(K(X,X)+\lambda\operatorname{Id})^{-2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}\\|\cdot\\|\boldsymbol{y}\\|^{2}$
(7.1)		$\displaystyle\leq~{}$	$\displaystyle\frac{\lambda^{2}\\|\boldsymbol{y}\\|^{2}}{n\lambda_{\min}^{2}(K(X,X))\lambda_{\min}^{2}(K_{n}(X,X))}\\|(K_{\lambda}^{2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{2}\\|.$

Here, in (7.1), we employ the identity

(7.2)

A^{-1}-B^{-1}=B^{-1}(B-A)A^{-1},

for $A=(K(X,X)+\lambda\operatorname{Id})^{-2}$ and $B=(K_{n}(X,X)+\lambda\operatorname{Id})^{-2}$ , and the fact that $\|(K(X,X)+\lambda\operatorname{Id})^{-1}\|\leq\lambda_{\min}^{-1}(K(X,X))$ and $(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\|\leq\lambda_{\min}^{-1}(K_{n}(X,X))$ . Next, before providing uniform upper bounds for $\lambda_{\min}^{-2}(K(X,X))$ and $\lambda_{\min}^{-2}(K_{n}(X,X))$ in (7.1), we can first get a bound for the last term of (7.1) as follows:

		$\displaystyle\\|(K(X,X)+\lambda\operatorname{Id})^{2}-(K_{n}(X,X)+\lambda\operatorname{Id})^{2}\\|$
	$\displaystyle=~{}$	$\displaystyle\\|K^{2}(X,X)-K_{n}^{2}(X,X)+2\lambda(K(X,X)-K_{n}(X,X))\\|$
	$\displaystyle\leq~{}$	$\displaystyle\\|K^{2}(X,X)-K_{n}^{2}(X,X)\\|+2\lambda\\|(K(X,X)-K_{n}(X,X))\\|$
	$\displaystyle\leq~{}$	$\displaystyle\Big{(}\\|K_{n}(X,X)-K(X,X)\\|+2\\|K(X,X)\\|+2\lambda\Big{)}\cdot\\|K(X,X)-K_{n}(X,X)\\|$
(7.3)		$\displaystyle\leq~{}$	$\displaystyle C\left(\sqrt{\frac{n}{d_{1}}}+C\right)\sqrt{\frac{n}{d_{1}}}.$

for some constant $C>0$ , with probability at least $1-4e^{-2n}$ , where the last bound in (7.3) is due to Theorem 2.3 and Lemma A.9 in Appendix A. Additionally, combining Theorem 2.3 and Theorem 2.5, we can easily get

(7.4)

\|(K_{n}(X,X)+\lambda\operatorname{Id})^{-1}\|\leq\lambda_{\min}^{-1}(K_{n}(X,X))\leq C

for all large $n$ and some universal constant $C$ , under the same event that (7.3) holds. Theorem 6.3 also shows $\lambda_{\min}^{-1}(K(X,X))\leq C$ for all large $n$ . Hence, with the upper bounds for $\lambda_{\min}^{-2}(K(X,X))$ and $\lambda_{\min}^{-2}(K_{n}(X,X))$ , (2.22) follows from the bounds of (7.1) and (7.3). ∎

For ease of notation, we denote $K:=K(X,X)$ and $K_{n}:=K_{n}(X,X)$ . Hence, from (2.24), we can further decompose the test errors for $K$ and $K_{n}$ into

(7.5)	$\displaystyle\mathcal{L}(\hat{f}^{(K)}_{\lambda})=~{}$	$\displaystyle\mathbb{E}_{\mathbf{x}}[\|f^{*}(\mathbf{x})\|^{2}]+\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]\right]$
	$\displaystyle~{}-2\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\mathbb{E}_{\mathbf{x}}[f^{*}(\mathbf{x})K(\mathbf{x},X)]\right],$
(7.6)	$\displaystyle\mathcal{L}(\hat{f}^{(RF)}_{\lambda})=~{}$	$\displaystyle\mathbb{E}_{\mathbf{x}}[\|f^{*}(\mathbf{x})\|^{2}]+\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K_{n}(\mathbf{x},X)^{\top}K_{n}(\mathbf{x},X)]\right]$
	$\displaystyle~{}-2\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\mathbb{E}_{\mathbf{x}}[f^{*}(\mathbf{x})K_{n}(\mathbf{x},X)]\right].$

Let us denote

	$\displaystyle E_{1}:=$	$\displaystyle\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K_{n}(\mathbf{x},X)^{\top}K_{n}(\mathbf{x},X)]\right],$
	$\displaystyle\bar{E}_{1}:=$	$\displaystyle\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{y}^{\top}(K+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]\right],$
	$\displaystyle E_{2}:=$	$\displaystyle\operatorname{Tr}\left[(K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K_{n}(\mathbf{x},X)]\right],$
	$\displaystyle\bar{E}_{2}:=$	$\displaystyle\operatorname{Tr}\left[(K+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\right].$

As we can see, to compare the test errors between random feature and kernel regression models, we need to control $|E_{1}-\bar{E}_{1}|$ and $|E_{2}-\bar{E}_{2}|$ . Firstly, it is necessary to study the concentrations of

\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)-K_{n}(\mathbf{x},X)^{\top}K_{n}(\mathbf{x},X)]

and

\mathbb{E}_{\mathbf{x}}\left[f^{*}(\mathbf{x})\left(K(\mathbf{x},X)-K_{n}(\mathbf{x},X)\right)\right].

Lemma 7.1.

Under Assumption 1.2 for $\sigma$ and Assumption 2.14 for $\mathbf{x}$ and $X$ , with probability at least $1-4e^{-2n}$ , we have

(7.7)

\displaystyle\left\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right\|\leq C\sqrt{\frac{n}{d_{1}}},

where $C>0$ is a universal constant. Here, we only consider the randomness of the weight matrix in $K_{n}(\mathbf{x},X)$ defined by (2.17) and (2.18).

Proof.

We consider $\tilde{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{n},\mathbf{x}]$ , its corresponding kernels $K_{n}(\tilde{X},\tilde{X})$ and $K(\tilde{X},\tilde{X})\in\mathbb{R}^{(n+1)\times(n+1)}$ . Under Assumption 2.14, we can directly apply Theorem 2.3 to get the concentration of $K_{n}(\tilde{X},\tilde{X})$ around $K(\tilde{X},\tilde{X})$ , namely,

(7.8)

\left\|K_{n}(\tilde{X},\tilde{X})-K(\tilde{X},\tilde{X})\right\|\leq C\sqrt{\frac{n}{d_{1}}},

with probability at least $1-4e^{-2n}$ . Meanwhile, we can write $K_{n}(\tilde{X},\tilde{X})$ and $K(\tilde{X},\tilde{X})$ as block matrices:

K_{n}(\tilde{X},\tilde{X})=\begin{pmatrix}K_{n}(X,X)&K_{n}(X,\mathbf{x})\\ K_{n}(\mathbf{x},X)&K_{n}(\mathbf{x},\mathbf{x})\end{pmatrix}~{}\text{ and }~{}K(\tilde{X},\tilde{X})=\begin{pmatrix}K(X,X)&K(X,\mathbf{x})\\ K(\mathbf{x},X)&K(\mathbf{x},\mathbf{x})\end{pmatrix}.

Since the $\ell_{2}$ -norm of any row is bounded above by the spectral norm of its entire matrix, we complete the proof of (7.7). ∎

Lemma 7.2.

Assume that training labels satisfy Assumption 2.13 and $\|X\|\leq B$ , then for any deterministic $A\in\mathbb{R}^{n\times n}$ , we have

\displaystyle\operatorname{Var}\left(\boldsymbol{y}^{\top}A\boldsymbol{y}\right),\operatorname{Var}\left(\boldsymbol{\beta}^{*\top}A\boldsymbol{y}\right)\leq c\|A\|_{F}^{2},

where constant $c$ only depends on $\sigma_{\boldsymbol{\beta}},$ $\sigma_{\boldsymbol{\epsilon}}$ and $B$ . Moreover,

\mathbb{E}[\boldsymbol{y}^{\top}A\boldsymbol{y}]=\sigma^{2}_{\boldsymbol{\beta}}\operatorname{Tr}AX^{\top}X+\sigma^{2}_{\boldsymbol{\epsilon}}\operatorname{Tr}A,\quad\mathbb{E}[\boldsymbol{\beta}^{*\top}A\boldsymbol{y}]=\sigma^{2}_{\boldsymbol{\beta}}\operatorname{Tr}AX^{\top}.

Proof.

We follow the idea in Lemma C.8 of [MM19] to investigate the variance of the quadratic form for the Gaussian random vector by

(7.9)

\operatorname{Var}(\boldsymbol{g}^{\top}A\boldsymbol{g})=\|A\|_{F}^{2}+\operatorname{Tr}(A^{2})\leq 2\|A\|_{F}^{2},

for any deterministic square matrix $A$ and standard normal random vector $\boldsymbol{g}$ . Notice that the quadratic form

(7.10)

\boldsymbol{y}^{\top}A\boldsymbol{y}=\boldsymbol{g}^{\top}\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}XAX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}XA\\ \sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}^{2}A\end{pmatrix}\boldsymbol{g},

where $\boldsymbol{g}$ is a standard Gaussian random vector in $\mathbb{R}^{d_{0}+n}$ . Similarly, the second quadratic form can be written as

\boldsymbol{\beta}^{*\top}A\boldsymbol{y}=\boldsymbol{g}^{\top}\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}A\\ \mathbf{0}&\mathbf{0}\end{pmatrix}\boldsymbol{g}.

Let

\tilde{A}_{1}:=\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}XAX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}XA\\ \sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}^{2}A\end{pmatrix},\quad\tilde{A}_{2}:=\begin{pmatrix}\sigma_{\boldsymbol{\beta}}^{2}AX^{\top}&\sigma_{\boldsymbol{\epsilon}}\sigma_{\boldsymbol{\beta}}A\\ \mathbf{0}&\mathbf{0}\end{pmatrix}.

By (7.9), we know $\operatorname{Var}\left(\boldsymbol{y}^{\top}A\boldsymbol{y}\right)\leq 2\|\tilde{A}_{1}\|_{F}^{2}$ and $\operatorname{Var}\left(\boldsymbol{\beta}^{*\top}A\boldsymbol{y}\right)\leq 2\|\tilde{A}_{2}\|_{F}^{2}$ . Since

\|\tilde{A}_{1}\|^{2}_{F}=\sigma_{\boldsymbol{\beta}}^{4}\|XAX^{\top}\|^{2}_{F}+\sigma^{2}_{\boldsymbol{\epsilon}}\sigma^{2}_{\boldsymbol{\beta}}\|XA\|^{2}_{F}+\sigma^{2}_{\boldsymbol{\epsilon}}\sigma^{2}_{\boldsymbol{\beta}}\|AX^{\top}\|_{F}^{2}+\sigma_{\boldsymbol{\epsilon}}^{4}\|A\|_{F}^{2}\leq c\|A\|_{F}^{2}

and similarly $\|\tilde{A}_{2}\|_{F}\leq c\|A\|_{F}^{2}$ for a constant $c$ , we can complete the proof. ∎

As a remark, in Lemma 7.2, for simplicity, we only provide a variance control for the quadratic forms to obtain convergence in probability in the following proofs of Theorems 2.16 and 2.17. However, we can actually apply Hanson-Wright inequalities in Section 3 to get more precise probability bounds and consider non-Gaussian distributions for $\boldsymbol{\beta}^{*}$ and $\boldsymbol{\epsilon}$ .

Proof of Theorem 2.16.

Based on the preceding expansions of $\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))$ and $\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))$ in (7.5) and (7.6), we need to control the right-hand side of

\displaystyle\left|\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))-\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))\right|\leq\left|E_{1}-\bar{E}_{1}\right|+2\left|\bar{E}_{2}-E_{2}\right|.

In the subsequent procedure, we first take the concentrations of $E_{1}$ and $E_{2}$ with respect to normal random vectors $\boldsymbol{\beta}^{*}$ and $\boldsymbol{\epsilon}$ , respectively. Then, we apply Theorem 2.3 and Lemma 7.1 to complete the proof of (2.25). For simplicity, we start with the second term

	$\displaystyle\left\|\bar{E}_{2}-E_{2}\right\|\leq~{}$	$\displaystyle\left\|\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y}\right\|$
		$\displaystyle+\left\|\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)\boldsymbol{y}\right\|$
(7.11)		$\displaystyle\leq~{}$	$\displaystyle\|I_{1}-\bar{I}_{1}\|+\|I_{2}-\bar{I}_{2}\|+\|\bar{I}_{1}\|+\|\bar{I}_{2}\|,$

where $I_{1}$ and $I_{2}$ are quadratic forms defined below

	$\displaystyle I_{1}:=~{}$	$\displaystyle\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1}\boldsymbol{y},$
	$\displaystyle I_{2}:=~{}$	$\displaystyle\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)\boldsymbol{y},$

and their expectations with respect to random vectors $\boldsymbol{\beta}^{*}$ and $\boldsymbol{\epsilon}$ are denoted by

	$\displaystyle\bar{I}_{1}:=\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[I_{1}]=~{}$	$\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\operatorname{Tr}\left(\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1}X^{\top}\right),$
	$\displaystyle\bar{I}_{2}:=\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[I_{2}]=~{}$	$\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\operatorname{Tr}\left(\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)X^{\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\right).$

We first consider the randomness of the weight matrix in $K_{n}$ and define the event $\mathcal{E}$ where both (7.4) and (7.8) hold. Then, Theorem 2.5 and the proof of Lemma 7.1 indicate that event $\mathcal{E}$ occurs with probability at least $1-4e^{-2n}$ for all large $n$ . Notice that $\mathcal{E}$ does not rely on the randomness of test data $\mathbf{x}$ .

We now consider $A=\mathbb{E}_{\mathbf{x}}[\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))](K_{n}+\lambda\operatorname{Id})^{-1}$ in Lemma 7.2. Conditioning on event $\mathcal{E}$ , we have

(7.12)		$\displaystyle\\|A\\|_{F}^{2}\leq~{}$	$\displaystyle\mathbb{E}_{\mathbf{x}}\left[\left\\|\mathbf{x}(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))^{\top}\right\\|_{F}^{2}\right]\cdot\left\\|(K_{n}+\lambda\operatorname{Id})^{-1}X^{\top}\right\\|^{2}$
(7.13)		$\displaystyle\leq~{}$	$\displaystyle\\|X\\|^{2}\left\\|(K_{n}+\lambda\operatorname{Id})^{-1}\right\\|^{2}\cdot\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{x}\\|^{2}\\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\\|^{2}\right]\leq C\frac{n}{d_{1}},$

for some constant $C$ , where we utilize the assumption $\mathbb{E}[\|\mathbf{x}\|^{2}]=1$ . Hence, based on Lemma 7.2, we know $\operatorname{Var}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}(I_{1})\leq cn/d_{1}$ , for some constant $c$ . By Chebyshev’s inequality and event $\mathcal{E}$ ,

(7.14)

\mathbb{P}\left(|I_{1}-\bar{I}_{1}|\geq(n/d_{1})^{\frac{1-\varepsilon}{2}}\right)\leq c\left(\frac{n}{d_{1}}\right)^{\varepsilon}+4e^{-2n},

for any $\varepsilon\in(0,1/2)$ . Hence, $\left(d_{1}/n\right)^{\frac{1}{2}-\varepsilon}\cdot|I_{1}-\bar{I}_{1}|=o(1)$ with probability $1-o(1)$ , when $n/d_{1}\to 0$ and $n\to\infty$ .

Likewise, when $A=\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)$ , we can apply (7.2) and

(7.15)

\|K(\mathbf{x},X)\|\leq\|K(\tilde{X},\tilde{X})\|\leq C\lambda_{\sigma}^{2}B^{2},

due to Lemma A.9 in Appendix A, to obtain $\|A\|_{F}^{2}\leq Cn/d_{1}$ conditionally on event $\mathcal{E}$ . Then, similarly, Lemma 7.2 shows $\operatorname{Var}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}(I_{2})\leq cn/d_{1}$ . Therefore, (7.14) also holds for $|I_{2}-\bar{I}_{2}|$ .

Moreover, conditioning on the event $\mathcal{E}$ ,

(7.16)	$\displaystyle\|\bar{I}_{1}\|=~{}$	$\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\left\|\mathbb{E}_{\mathbf{x}}\left[(K_{n}(\mathbf{x},X)-K(\mathbf{x},X))(K_{n}+\lambda\operatorname{Id})^{-1}X^{\top}\mathbf{x}\right]\right\|$
(7.17)	$\displaystyle\leq~{}$	$\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{x}\\|\cdot\left\\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right\\|\cdot\\|X\\|\cdot\left\\|(K_{n}+\lambda\operatorname{Id})^{-1}\right\\|\right],$
(7.18)	$\displaystyle\leq~{}$	$\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{x}\\|^{2}\right]^{\frac{1}{2}}\mathbb{E}_{\mathbf{x}}\left[\left\\|K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right\\|^{2}\right]^{\frac{1}{2}}\\|X\\|\left\\|(K_{n}+\lambda\operatorname{Id})^{-1}\right\\|\leq C\sqrt{\frac{n}{d_{1}}},$

for some constant $C$ . In the same way, with (7.15), $|\bar{I}_{2}|\leq C\sqrt{\frac{n}{d_{1}}}$ on the event $\mathcal{E}$ . Therefore, from (7.11), we can conclude $\left|\bar{E}_{2}-E_{2}\right|=o\left(\left(n/d_{1}\right)^{1/2-\varepsilon}\right)$ for any $\varepsilon\in(0,1/2)$ , with probability $1-o(1)$ , when $n/d_{1}\to 0$ and $n\to\infty$ .

Analogously, the first term $\left|\bar{E}_{1}-E_{1}\right|$ is controlled by the following four quadratic forms

\left|\bar{E}_{1}-E_{1}\right|\leq\sum_{i=1}^{4}\left|\boldsymbol{y}^{\top}A_{i}\boldsymbol{y}\right|,

where we define by $J_{i}:=\boldsymbol{y}^{\top}A_{i}\boldsymbol{y}$ for $1\leq i\leq 4$ and

	$\displaystyle A_{1}:=~{}$	$\displaystyle(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K_{n}(\mathbf{x},X)^{\top}\left(K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right)](K_{n}+\lambda\operatorname{Id})^{-1},$
	$\displaystyle A_{2}:=~{}$	$\displaystyle(K_{n}+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[\left(K_{n}(\mathbf{x},X)-K(\mathbf{x},X)\right)^{\top}K(\mathbf{x},X)](K_{n}+\lambda\operatorname{Id})^{-1},$
	$\displaystyle A_{3}:=~{}$	$\displaystyle\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right)\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)](K_{n}+\lambda\operatorname{Id})^{-1},$
	$\displaystyle A_{4}:=~{}$	$\displaystyle(K+\lambda\operatorname{Id})^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]\left((K_{n}+\lambda\operatorname{Id})^{-1}-(K+\lambda\operatorname{Id})^{-1}\right).$

Similarly with (7.13) and (7.18), it is not hard to verify $\|A_{i}\|_{F}\leq C\sqrt{n/d_{1}}$ and $|\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[J_{i}]|\leq C\sqrt{n/d_{1}}$ conditioning on the event $\mathcal{E}$ . Then, like (7.14), we can invoke Lemma 7.2 for each $A_{i}$ to apply Chebyshev’s inequality and conclude $\left|\bar{E}_{1}-E_{1}\right|=o\left(\left(n/d_{1}\right)^{1/2-\varepsilon}\right)$ with probability $1-o(1)$ when $d_{1}/n\to\infty$ , for any $\varepsilon\in(0,1/2)$ . ∎

Lemma 7.3.

With Assumptions 1.2 and 2.14, for $(\varepsilon_{n},B)$ -orthonormal $X$ , we have that

(7.19)		$\displaystyle\left\\|\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right\\|\leq$	$\displaystyle\left\\|\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right\\|_{F}\leq C\sqrt{n}\varepsilon_{n}^{2},$
(7.20)		$\displaystyle\left\\|\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]-\frac{b_{\sigma}^{2}}{d_{0}}X\right\\|\leq$	$\displaystyle\left\\|\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]-\frac{b_{\sigma}^{2}}{d_{0}}X\right\\|_{F}\leq C\sqrt{n}\varepsilon_{n}^{2},$

for some constant $C>0$ .

Proof.

By Lemma A.8, we have an entrywise approximation

|K(\mathbf{x},\mathbf{x}_{i})-b_{\sigma}^{2}\mathbf{x}^{\top}\mathbf{x}_{i}|\leq C\lambda_{\sigma}\varepsilon_{n}^{2},

for any $1\leq i\leq n$ . Hence, $\|K(\mathbf{x},X)-b_{\sigma}^{2}\mathbf{x}^{\top}X\|\leq C\lambda_{\sigma}\sqrt{n}\varepsilon_{n}^{2}.$ Assumption 2.14 of $\mathbf{x}$ implies that $\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X=b_{\sigma}^{4}\mathbb{E}_{\mathbf{x}}[X^{\top}\mathbf{x}\mathbf{x}^{\top}X]$ . Then, we can verify (7.19) based on the following approximation

		$\displaystyle\left\\|\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right\\|_{F}\leq\mathbb{E}_{\mathbf{x}}\left[\left\\|K(\mathbf{x},X)^{\top}K(\mathbf{x},X)-b_{\sigma}^{4}X^{\top}\mathbf{x}\mathbf{x}^{\top}X\right\\|_{F}\right]$
	$\displaystyle\leq~{}$	$\displaystyle\mathbb{E}_{\mathbf{x}}\left[\left\\|K(\mathbf{x},X)^{\top}\left(K(\mathbf{x},X)-b_{\sigma}^{2}\mathbf{x}^{\top}X\right)\right\\|_{F}+b_{\sigma}^{2}\left\\|\left(K(\mathbf{x},X)^{\top}-b_{\sigma}^{2}X^{\top}\mathbf{x}\right)\mathbf{x}^{\top}X\right\\|_{F}\right]$
	$\displaystyle\leq~{}$	$\displaystyle\mathbb{E}_{\mathbf{x}}\left[\\|K(\mathbf{x},X)-b_{\sigma}^{2}\mathbf{x}^{\top}X\\|\left(\\|K(\mathbf{x},X)\\|+\\|b_{\sigma}^{2}\mathbf{x}^{\top}X\\|\right)\right]\leq C\sqrt{n}\varepsilon_{n}^{2},$

for some universal constant $C$ . The same argument can also be employed to prove (7.20), so details will be omitted here. ∎

Proof of Theorem 2.17.

From (2.22) and (2.25), we can easily conclude that

(7.21)		$\displaystyle E_{\textnormal{train}}^{(RF,\lambda)}-E_{\textnormal{train}}^{(K,\lambda)}$	$\displaystyle\overset{\mathbb{P}}{\to}0,$
(7.22)		$\displaystyle\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))-\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))$	$\displaystyle\overset{\mathbb{P}}{\to}0,$

as $n\to\infty$ and $n/d_{1}\to 0$ . Therefore, to study the training error $E_{\textnormal{train}}^{(RF,\lambda)}$ and the test error $\mathcal{L}(\hat{f}_{\lambda}^{(RF)}(\mathbf{x}))$ of random feature regression, it suffices to analyze the asymptotic behaviors of $E_{\textnormal{train}}^{(K,\lambda)}$ and $\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))$ for the kernel regression, respectively. In the rest of the proof, we will first analyze the test error $\mathcal{L}(\hat{f}_{\lambda}^{(K)}(\mathbf{x}))$ and then compute the training error $E_{\textnormal{train}}^{(K,\lambda)}$ under the ultra-wide regime.

Recall that $K_{\lambda}=(K+\lambda\operatorname{Id})$ and the test error is given by

(7.23)

\displaystyle\mathcal{L}(\hat{f}^{(K)}_{\lambda})=~{}

\displaystyle\frac{1}{d_{0}}\|\boldsymbol{\beta}^{*}\|^{2}+L_{1}-2L_{2},

where $L_{1}:=\boldsymbol{y}^{\top}K_{\lambda}^{-1}\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]K_{\lambda}^{-1}\boldsymbol{y}$ , $L_{2}:=\boldsymbol{\beta}^{*\top}\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]K_{\lambda}^{-1}\boldsymbol{y}$ . The spectral norm of $K_{\lambda}$ is bounded from above and the smallest eigenvalue is bounded from below by some positive constants.

We first focus on the last two terms $L_{1}$ and $L_{2}$ in the test error. Let us define

\widetilde{L}_{1}:=\frac{b_{\sigma}^{4}}{d_{0}}\boldsymbol{y}^{\top}K_{\lambda}^{-1}X^{\top}XK_{\lambda}^{-1}\boldsymbol{y}\quad\text{and}\quad\widetilde{L}_{2}:=\frac{b_{\sigma}^{2}}{d_{0}}\boldsymbol{\beta}^{*\top}XK_{\lambda}^{-1}\boldsymbol{y}.

Then, we obtain two quadratic forms

	$\displaystyle L_{1}-\widetilde{L}_{1}=$	$\displaystyle\boldsymbol{y}^{\top}K_{\lambda}^{-1}\left(\mathbb{E}_{\mathbf{x}}[K(\mathbf{x},X)^{\top}K(\mathbf{x},X)]-\frac{b_{\sigma}^{4}}{d_{0}}X^{\top}X\right)K_{\lambda}^{-1}\boldsymbol{y}=:\boldsymbol{y}^{\top}A_{1}\boldsymbol{y},$
	$\displaystyle L_{2}-\widetilde{L}_{2}=$	$\displaystyle\boldsymbol{\beta}^{\top}\left(\mathbb{E}_{\mathbf{x}}[\mathbf{x}K(\mathbf{x},X)]-\frac{b_{\sigma}^{2}}{d_{0}}X\right)K_{\lambda}^{-1}\boldsymbol{y}=:\boldsymbol{\beta}^{\top}A_{2}\boldsymbol{y},$

where $\|A_{1}\|_{F}$ and $\|A_{2}\|_{F}$ are at most $C\sqrt{n}\varepsilon_{n}^{2}$ for some constant $C>0$ , due to Lemma 7.3. Hence, applying Lemma 7.2 for these two quadratic forms, we have $\operatorname{Var}(L_{i}-\widetilde{L}_{i})\leq cn\varepsilon_{n}^{4}\to 0$ as $n\to\infty$ . Additionally, Lemma 7.2 and the proof of Lemma 7.3 verify that $\mathbb{E}[\boldsymbol{y}^{\top}A_{1}\boldsymbol{y}]$ and $\mathbb{E}[\boldsymbol{\beta}^{*\top}A_{2}\boldsymbol{y}]$ are vanishing as $n\to\infty$ . Therefore, $L_{i}-\widetilde{L}_{i}$ converges to zero in probability for $i=1,2$ . So we can move to analyze $\widetilde{L}_{1}$ and $\widetilde{L}_{2}$ instead. Copying the above procedure, we can separately compute the variances of $\widetilde{L}_{1}$ and $\widetilde{L}_{2}$ with respect to $\boldsymbol{\beta}^{*}$ and $\boldsymbol{\epsilon}$ , and then apply Lemma 7.2. Then, $|\widetilde{L}_{1}-\bar{L}_{1}|$ and $|\widetilde{L}_{2}-\bar{L}_{2}|$ will converge to zero in probability as $n,d_{0}\to\infty$ , where

	$\displaystyle\bar{L}_{1}:=~{}$	$\displaystyle\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[\widetilde{L}_{1}]=\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}K_{\lambda}^{-1}X^{\top}XK_{\lambda}^{-1}X^{\top}X+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}n}{d_{0}}\operatorname{tr}K_{\lambda}^{-1}X^{\top}XK_{\lambda}^{-1},$
	$\displaystyle\bar{L}_{2}:=~{}$	$\displaystyle\mathbb{E}_{\boldsymbol{\epsilon},\,\boldsymbol{\beta}^{*}}[\widetilde{L}_{2}]=\frac{b_{\sigma}^{2}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}K_{\lambda}^{-1}X^{\top}X.$

To obtain the last approximation, we define $\bar{K}(X,X):=b_{\sigma}^{2}X^{\top}X+(1-b_{\sigma}^{2})\operatorname{Id}$ and

(7.24)

\bar{K}_{\lambda}:=b_{\sigma}^{2}X^{\top}X+(1+\lambda-b_{\sigma}^{2})\operatorname{Id}.

We aim to replace $K_{\lambda}$ by $\bar{K}_{\lambda}$ in $\bar{L}_{1}$ and $\bar{L}_{2}$ . Recalling the identity (7.2), we have

\displaystyle K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1}=\bar{K}_{\lambda}^{-1}\left(K(X,X)-\bar{K}(X,X)\right)K_{\lambda}^{-1}.

Since $\sigma$ is not a linear function, $1-b_{\sigma}^{2}>0$ . Then, with (7.4), the proof of Lemma 5.4 indicates

(7.25)

\displaystyle\left\|K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1}\right\|_{F}\leq C\sqrt{n^{2}\varepsilon_{n}^{4}+n\varepsilon_{n}^{2}},

where we apply the fact that $\lambda_{\min}(\bar{K}(X,X))\geq 1-b_{\sigma}^{2}>0$ . Let us denote

(7.26)		$\displaystyle L_{1}^{0}:=~{}$	$\displaystyle\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}\bar{K}_{\lambda}^{-1}X^{\top}X\bar{K}_{\lambda}^{-1}X^{\top}X+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}n}{d_{0}}\operatorname{tr}\bar{K}_{\lambda}^{-1}X^{\top}X\bar{K}_{\lambda}^{-1},$
(7.27)		$\displaystyle L_{2}^{0}:=~{}$	$\displaystyle\frac{b_{\sigma}^{2}\sigma_{\boldsymbol{\beta}}^{2}n}{d_{0}}\operatorname{tr}\bar{K}_{\lambda}^{-1}X^{\top}X.$

Notice that for any matrices $A,B\in\mathbb{R}^{n\times n}$ , $\|AB\|_{F}\leq\|A\|\|B\|_{F},|\operatorname{Tr}(AB)|\leq\|A\|_{F}\|B\|_{F}.$ Then, with the help of (7.25) and uniform bounds of the spectral norms of $X^{\top}X$ , $K_{\lambda}^{-1}$ and $\bar{K}_{\lambda}^{-1}$ , we obtain that

	$\displaystyle\|\bar{L}_{1}-L^{0}_{1}\|\leq~{}$	$\displaystyle\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}}{d_{0}}\left\|\operatorname{Tr}K_{\lambda}^{-1}X^{\top}X(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})X^{\top}X\right\|+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}}{d_{0}}\left\|\operatorname{Tr}(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})X^{\top}X\bar{K}_{\lambda}^{-1}X^{\top}X\right\|$
		$\displaystyle+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}}{d_{0}}\left\|\operatorname{Tr}(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})X^{\top}X\bar{K}_{\lambda}^{-1}\right\|+\frac{b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}}{d_{0}}\left\|\operatorname{Tr}K_{\lambda}^{-1}X^{\top}X(K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1})\right\|$
	$\displaystyle\leq~{}$	$\displaystyle\frac{C\sqrt{n}}{d_{0}}\left\\|K_{\lambda}^{-1}-\bar{K}_{\lambda}^{-1}\right\\|_{F}\leq C\frac{n}{d_{0}}\sqrt{n\varepsilon_{n}^{4}+\varepsilon_{n}^{2}}\to 0,$

as $n\to\infty$ , $n/d_{0}\to\gamma$ and $n\varepsilon_{n}^{4}\to 0$ . Combining all the approximations, we conclude that $L_{i}$ and $L_{i}^{0}$ have identical limits in probability for $i=1,2$ . On the other hand, based on the assumption of $X$ and definitions in (7.24), (7.26) and (7.27), it is not hard to check that

	$\displaystyle\lim_{n\to\infty}L_{1}^{0}=~{}$	$\displaystyle b_{\sigma}^{4}\sigma_{\boldsymbol{\beta}}^{2}\gamma\int_{\mathbb{R}}\frac{x^{2}}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x)+b_{\sigma}^{4}\sigma_{\boldsymbol{\epsilon}}^{2}\gamma\int_{\mathbb{R}}\frac{x}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x),$
	$\displaystyle\lim_{n\to\infty}L_{2}^{0}=~{}$	$\displaystyle b_{\sigma}^{2}\sigma_{\boldsymbol{\beta}}^{2}\gamma\int_{\mathbb{R}}\frac{x}{b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2}}d\mu_{0}(x).$

Therefore, $L_{1}$ and $L_{2}$ converge in probability to the above limits, respectively, as $n\to\infty$ . In the end, we apply the concentration of the quadratic form $\boldsymbol{\beta}^{*\top}\boldsymbol{\beta}^{*}$ in (7.23) to get $\frac{1}{d_{0}}\|\boldsymbol{\beta}^{*}\|^{2}\xrightarrow[]{\mathbb{P}}\sigma^{2}_{\boldsymbol{\beta}}$ . Then, by (7.22), we can get the limit in (2.28) for the test error $\mathcal{L}(\hat{f}^{(RF)}_{\lambda})$ . As a byproduct, we can even use $L_{1}^{0}$ and $L_{2}^{0}$ to form an $n$ -dependent deterministic equivalent of $\mathcal{L}(\hat{f}^{(RF)}_{\lambda})$ as well.

Thanks to Lemma 7.2, the training error, $E_{\textnormal{train}}^{(K,\lambda)}=\frac{\lambda^{2}}{n}\boldsymbol{y}^{\top}K_{\lambda}^{-2}\boldsymbol{y}$ , analogously, concentrates around its expectation with respect to $\boldsymbol{\beta}^{*}$ and $\boldsymbol{\epsilon}$ , which is $\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\operatorname{tr}K_{\lambda}^{-2}X^{\top}X+\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\operatorname{tr}K_{\lambda}^{-2}$ . Moreover, because of (7.25), we can further substitute $K_{\lambda}^{-2}$ by $\bar{K}_{\lambda}^{-2}$ defined in (7.24). Hence, we know that, asymptotically,

\left|E_{\textnormal{train}}^{(K,\lambda)}-\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}X^{\top}X-\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}\right|\xrightarrow[]{\mathbb{P}}0,

where as $n,d_{0}\to\infty$ ,

(7.28)		$\displaystyle\lim_{n\to\infty}\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}X^{\top}X=~{}$	$\displaystyle\sigma_{\boldsymbol{\beta}}^{2}\lambda^{2}\int_{\mathbb{R}}\frac{x}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x),$
(7.29)		$\displaystyle\lim_{n\to\infty}\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\operatorname{tr}\bar{K}_{\lambda}^{-2}=~{}$	$\displaystyle\sigma_{\boldsymbol{\epsilon}}^{2}\lambda^{2}\int_{\mathbb{R}}\frac{1}{(b_{\sigma}^{2}x+1+\lambda-b_{\sigma}^{2})^{2}}d\mu_{0}(x).$

The last two limits are due to $\mu_{0}=\operatorname{lim\,spec}X^{\top}X$ as $n,d_{0}\to\infty$ . Therefore, by (7.21), we obtain our final result (2.27) in Theorem 2.17. ∎

Acknowledgements

Z.W. is partially supported by NSF DMS-2055340 and NSF DMS-2154099. Y.Z. is partially supported by NSF-Simons Research Collaborations on the Mathematical and Scientific Foundations of Deep Learning. This material is based upon work supported by the National Science Foundation under Grant No. DMS-1928930 while Y.Z. was in residence at the Mathematical Sciences Research Institute in Berkeley, California, during the Fall 2021 semester for the program “Universality and Integrability in Random Matrix Theory and Interacting Particle Systems”. Z.W. would like to thank Denny Wu for his valuable suggestions and comments. Both authors would like to thank Lucas Benigni, Ioana Dumitriu, and Kameron Decker Harris for their helpful discussion.

Appendix A Auxiliary lemmas

Lemma A.1 (Equation (3.7.9) in [Joh90]).

Let $A,B$ be two $n\times n$ matrices, $A$ be positive semidefinite, and $A\odot B$ be the Hadamard product between $A$ and $B$ . Then,

(A.1)

\|A\odot B\|\leq\max_{i,j}|A_{ij}|\cdot\|B\|.

Lemma A.2 (Sherman–Morrison formula, [Bar51]).

Suppose $A\in\mathbb{R}^{n\times n}$ is an invertible square matrix and $\mathbf{u},\mathbf{v}\in\mathbb{R}^{n}$ are column vectors. Then

(A.2)

\displaystyle(A+\mathbf{u}\mathbf{v}^{\top})^{-1}=A^{-1}-\frac{A^{-1}\mathbf{u}\mathbf{v}^{\top}A^{-1}}{1+\mathbf{v}^{\top}A^{-1}\mathbf{u}}.

Lemma A.3 (Theorem A.45 in [BS10]).

Let $A,B$ be two $n\times n$ Hermitian matrices. Then $A$ and $B$ have the same limiting spectral distribution if $\|A-B\|\to 0$ as $n\to\infty$ .

Lemma A.4 (Theorem B.11 in [BS10]).

Let $z=x+iv\in\mathbb{C},v>0$ and $s(z)$ be the Stieltjes transform of a probability measure. Then $|\operatorname{Re}s(z)|\leq v^{-1/2}\sqrt{\operatorname{Im}s(z)}$ .

Lemma A.5 (Lemma D.2 in [NM20]).

Let $\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}$ such that $\|\mathbf{x}\|=\|\mathbf{y}\|=1$ and $\boldsymbol{w}\sim\mathcal{N}(0,I_{d})$ . Let $h_{j}$ be the $j$ -th normalized Hermite polynomial given in (1.5). Then

(A.3)

\displaystyle\mathbb{E}_{\boldsymbol{w}}[h_{j}(\langle\boldsymbol{w},\mathbf{x}\rangle)h_{k}(\langle\boldsymbol{w},\mathbf{y}\rangle)]=\delta_{jk}\langle\mathbf{x},\mathbf{y}\rangle^{k}.

Lemma A.6 (Proposition C.2 in [FW20]).

Suppose $M=U+iV\in\mathbb{C}^{n\times n}$ , $U,V$ are real symmetric, and $V$ is invertible with $\sigma_{\min}(V)\geq c_{0}>0$ . Then $M$ is invertible with $\sigma_{\min}(M)\geq c_{0}$ .

Lemma A.7 (Proposition C.3 in [FW20]).

Let $M,\widetilde{M}$ be two sequences of $n\times n$ Hermitian matrices satisfying

\frac{1}{n}\|M-\widetilde{M}\|_{F}^{2}\to 0

as $n\to\infty$ . Suppose that, as $n\to\infty$ , $\operatorname{lim\,spec}M=\nu$ for a probability distribution $\nu$ on $\mathbb{R}$ , then $\operatorname{lim\,spec}\widetilde{M}=\nu$ .

Lemma A.8.

Recall the definition of $\Phi$ in (1.3). Under Assumption 1.2, if $X$ is $(\varepsilon,B)$ -orthonormal with sufficiently small $\varepsilon$ , then for a universal constant $C>0$ and any $\alpha\neq\beta\in[n]$ , we have

(A.4)		$\displaystyle\|\Phi_{\alpha\beta}-b_{\sigma}^{2}\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}\|\leq~{}$	$\displaystyle C\varepsilon^{2},$
(A.5)		$\displaystyle\|\mathbb{E}_{\boldsymbol{w}}[\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{\alpha})]\|\leq~{}$	$\displaystyle C\varepsilon.$

Proof.

When $\sigma$ is twice differentiable in Assumption 1.2, this result follows from Lemma D.3 in [FW20]. When $\sigma$ is a piece-wise linear function defined in case 2 of Assumption 1.2, the second inequality follows from (5.7) with $t=\|\mathbf{x}_{\alpha}\|$ . For the first inequality, the Hermite expansion of $\Phi_{\alpha\beta}$ is given by (5.3) with coefficients $\zeta_{k}(\sigma_{\alpha})=\mathbb{E}[\sigma(\|\mathbf{x}_{\alpha}\|\xi)h_{k}(\xi)]$ for $k\in\mathbb{N}$ . Observe that the piece-wise linear function in case 2 of Assumption 1.2 satisfies

(A.6)		$\displaystyle\zeta_{k}(\sigma_{\alpha})=~{}$	$\displaystyle\\|\mathbf{x}_{\alpha}\\|\zeta_{k}(\sigma),~{}\text{ for }~{}k\geq 1,$
(A.7)		$\displaystyle\zeta_{0}(\sigma_{\alpha})=~{}$	$\displaystyle b(1-\\|\mathbf{x}_{\alpha}\\|),$

because of condition (1.10) for $\sigma$ . Recall $\mathbf{u}_{\alpha}=\mathbf{x}_{\alpha}/\|\mathbf{x}_{\alpha}\|$ and $\zeta_{1}(\sigma)=b_{\sigma}$ . Then, analogously to the derivation of (5.10), there exists some constant $C>0$ such that

(A.8)		$\displaystyle\|\Phi_{\alpha\beta}-b_{\sigma}^{2}\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}\|=~{}$	$\displaystyle\left\|\sum_{k\neq 1}\zeta_{k}(\sigma_{\alpha})\zeta_{k}(\sigma_{\beta})(\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\beta})^{k}\right\|$
(A.9)		$\displaystyle\leq~{}$	$\displaystyle b^{2}(1-\\|\mathbf{x}_{\alpha}\\|)(1-\\|\mathbf{x}_{\beta}\\|)+\frac{\|\mathbf{x}_{\alpha}^{\top}\mathbf{x}_{\beta}\|^{2}}{\\|\mathbf{x}_{\alpha}\\|\\|\mathbf{x}_{\beta}\\|}\\|\sigma\\|_{L^{2}}^{2}\leq C\varepsilon^{2},$

for $\varepsilon\in(0,1)$ and $(\varepsilon,B)$ -orthonormal $X$ . This completes the proof of this lemma. ∎

With the above lemma, the proof of Lemma D.4 in [FW20] directly yields the following lemma.

Lemma A.9.

Under the same assumptions as Lemma A.8, there exists a constant $C>0$ such that $\|K(X,X)\|\leq CB^{2}$ . Additionally, with Assumption 2.14, we have $\|K(\tilde{X},\tilde{X})\|\leq CB^{2}$ .

Appendix B Additional simulations

Figures 4 and 5 provide additional simulations for the eigenvalue distribution described in Theorem 2.1 with different activation functions and scaling. Here, we compute the empirical eigenvalue distributions of centered CK matrices in histograms and the limiting spectra in terms of self-consistent equations. All the input data $X$ ’s are standard random Gaussian matrices. Interestingly, in Figure 5, we observe an outlier that emerges outside the bulk distribution for the piece-wise linear activation function defined in case 2 of Assumption 1.2. The analysis of the emergence of the outlier, in this case, would be interesting for future work.

References

[Ada15] Radoslaw Adamczak. A note on the Hanson-Wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 2015.
[ADH⁺19a] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
[ADH⁺19b] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8141–8150, 2019.
[AGZ10] Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An introduction to random matrices. Cambridge university press, 2010.
[AKM⁺17] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In International Conference on Machine Learning, pages 253–262. PMLR, 2017.
[ALP22] Ben Adlam, Jake A Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities in high dimensions. In International Conference on Artificial Intelligence and Statistics, pages 3434–3457. PMLR, 2022.
[AP20] Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR, 2020.
[AS17] Guillaume Aubrun and Stanisław J Szarek. Alice and Bob meet Banach, volume 223. American Mathematical Soc., 2017.
[Aub12] Guillaume Aubrun. Partial transposition of random states and non-centered semicircular distributions. Random Matrices: Theory and Applications, 1(02):1250001, 2012.
[AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252, 2019.
[Bac13] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Theory, pages 185–209. PMLR, 2013.
[Bac17] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
[Bao12] Zhigang Bao. Strong convergence of esd for the generalized sample covariance matrices when $p/n\to 0$ . Statistics $\&$ Probability Letters, 82(5):894–901, 2012.
[Bar51] Maurice S Bartlett. An inverse matrix adjustment arising in discriminant analysis. The Annals of Mathematical Statistics, 22(1):107–111, 1951.
[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
[BMR21] Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
[BP21] Lucas Benigni and Sandrine Péché. Eigenvalue distribution of some nonlinear models of random matrices. Electronic Journal of Probability, 26:1–37, 2021.
[BS10] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
[BY88] Zhidong Bai and Y. Q. Yin. Convergence to the semicircle law. The Annals of Probability, pages 863–875, 1988.
[BZ10] ZD Bai and LX Zhang. The limiting spectral distribution of the product of the Wigner matrix and a nonnegative definite matrix. Journal of Multivariate Analysis, 101(9):1927–1949, 2010.
[CH23] Benoit Collins and Tomohiro Hayase. Asymptotic freeness of layerwise jacobians caused by invariance of multilayer perceptron: The haar orthogonal case. Communications in Mathematical Physics, 397(1):85–109, 2023.
[COB19] Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2937–2947, 2019.
[CP12] Binbin Chen and Guangming Pan. Convergence of the largest eigenvalue of normalized sample covariance matrices when $p$ and $n$ both tend to infinity with their ratio converging to zero. Bernoulli, 18(4):1405–1420, 2012.
[CP15] Binbin Chen and Guangming Pan. CLT for linear spectral statistics of normalized sample covariance matrices with the dimension much larger than the sample size. Bernoulli, 21(2):1089–1133, 2015.
[CS09] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pages 342–350, 2009.
[CYZ18] Benoît Collins, Zhi Yin, and Ping Zhong. The PPT square conjecture holds generically for some classes of independent states. Journal of Physics A: Mathematical and Theoretical, 51(42):425301, 2018.
[DFS16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.
[DZPS19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019.
[Fel21] Michael J Feldman. Spiked singular values and vectors under extreme aspect ratios. arXiv preprint arXiv:2104.15127, 2021.
[FW20] Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 7710–7721. Curran Associates, Inc., 2020.
[GKZ19] David Gamarnik, Eren C Kızıldağ, and Ilias Zadik. Stationary points of shallow neural networks with quadratic activation function. arXiv preprint arXiv:1912.01599, 2019.
[GLBP21] Jungang Ge, Ying-Chang Liang, Zhidong Bai, and Guangming Pan. Large-dimensional random matrix theory and its applications in deep learning and wireless communications. Random Matrices: Theory and Applications, page 2230001, 2021.
[GLK⁺20] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
[GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limitations of lazy training of two-layers neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 9111–9121, 2019.
[GZR22] Diego Granziol, Stefan Zohren, and Stephen Roberts. Learning rates as a function of batch size: A random matrix theory approach to neural network training. J. Mach. Learn. Res, 23:1–65, 2022.
[HK21] Tomohiro Hayase and Ryo Karakida. The spectrum of Fisher information of deep networks achieving dynamical isometry. In International Conference on Artificial Intelligence and Statistics, pages 334–342. PMLR, 2021.
[HL22] Hong Hu and Yue M Lu. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 2022.
[HW71] David Lee Hanson and Farroll Tim Wright. A bound on tail probabilities for quadratic forms in independent random variables. The Annals of Mathematical Statistics, 42(3):1079–1083, 1971.
[HXAP20] Wei Hu, Lechao Xiao, Ben Adlam, and Jeffrey Pennington. The surprising simplicity of the early-time learning dynamics of neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 17116–17128. Curran Associates, Inc., 2020.
[JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8580–8589, 2018.
[Jia04] Tiefeng Jiang. The limiting distributions of eigenvalues of sample correlation matrices. Sankhyā: The Indian Journal of Statistics, pages 35–48, 2004.
[Joh90] C.R. Johnson. Matrix Theory and Applications. AMS Short Course Lecture Notes. American Mathematical Society, 1990.
[JSS⁺20] Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clément Hongler, and Franck Gabriel. Implicit regularization of random feature models. In International Conference on Machine Learning, pages 4631–4640. PMLR, 2020.
[LBN⁺18] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018.
[LC18] Zhenyu Liao and Romain Couillet. On the spectrum of random features maps of high dimensional data. In International Conference on Machine Learning, pages 3063–3071. PMLR, 2018.
[LCM20] Zhenyu Liao, Romain Couillet, and Michael W. Mahoney. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. In 34th Conference on Neural Information Processing Systems, 2020.
[LD21] Licong Lin and Edgar Dobriban. What causes the test error? going beyond bias-variance via anova. Journal of Machine Learning Research, 22(155):1–82, 2021.
[LGC⁺21] Bruno Loureiro, Cédric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34, 2021.
[LLC18] Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190–1248, 2018.
[LLS21] Fanghui Liu, Zhenyu Liao, and Johan Suykens. Kernel regression in high dimensions: Refined analysis beyond double descent. In International Conference on Artificial Intelligence and Statistics, pages 649–657. PMLR, 2021.
[LR20] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347, 2020.
[LRZ20] Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, pages 2683–2711. PMLR, 2020.
[LY16] Zeng Li and Jianfeng Yao. Testing the sphericity of a covariance matrix when the dimension is much larger than the sample size. Electronic Journal of Statistics, 10(2):2973–3010, 2016.
[MHR⁺18] Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
[MM19] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 2019.
[MMM21] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 2021.
[MZ22] Andrea Montanari and Yiqiao Zhong. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. The Annals of Statistics, 50(5):2816–2847, 2022.
[Nea95] Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
[Ngu21] Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In International Conference on Machine Learning, pages 8056–8062. PMLR, 2021.
[NM20] Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In 34th Conference on Neural Information Processing Systems, volume 33, 2020.
[NMM21] Quynh Nguyen, Marco Mondelli, and Guido F Montufar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. In International Conference on Machine Learning, pages 8119–8129. PMLR, 2021.
[NS06] Alexandru Nica and Roland Speicher. Lectures on the combinatorics of free probability, volume 13. Cambridge University Press, 2006.
[OS20] Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
[Péc19] S Péché. A note on the Pennington-Worah distribution. Electronic Communications in Probability, 24, 2019.
[PLR⁺16] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, pages 3360–3368, 2016.
[PS21] Vanessa Piccolo and Dominik Schröder. Analysis of one-hidden-layer neural networks via the resolvent method. Advances in Neural Information Processing Systems, 34, 2021.
[PSG17] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in neural information processing systems, 30, 2017.
[PSG18] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932. PMLR, 2018.
[PW17] Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[QLY21] Jiaxin Qiu, Zeng Li, and Jianfeng Yao. Asymptotic normality for eigenvalue statistics of a general sample covariance matrix when $p/n\to\infty$ and applications. arXiv preprint arXiv:2109.06701, 2021.
[RR07] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 1177–1184, 2007.
[RR17] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[RV13] Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18:1–9, 2013.
[SGGSD17] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations, 2017.
[Sil85] Jack W Silverstein. The smallest eigenvalue of a large dimensional wishart matrix. The Annals of Probability, pages 1364–1368, 1985.
[SY19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix Chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
[Tro12] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[Voi87] Dan Voiculescu. Multiplication of certain non-commuting random variables. Journal of Operator Theory, pages 223–235, 1987.
[WDW19] Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111, 2019.
[Wil97] Christopher KI Williams. Computing with infinite networks. In Advances in Neural Information Processing Systems, pages 295–301, 1997.
[WP14] Lili Wang and Debashis Paul. Limiting spectral distribution of renormalized separable sample covariance matrices when $p/n\to 0$ . Journal of Multivariate Analysis, 126:25–52, 2014.
[WZ22] Zhichao Wang and Yizhe Zhu. Overparameterized random feature regression with nearly orthogonal data. arXiv preprint arXiv:2211.06077, 2022.
[XBSD⁺18] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, pages 5393–5402. PMLR, 2018.
[Xie13] Junshan Xie. Limiting spectral distribution of normalized sample covariance matrices with $p/n\to 0$ . Statistics $\&$ Probability Letters, 83:543–550, 2013.
[YBM21] Zitong Yang, Yu Bai, and Song Mei. Exact gap between generalization error and uniform convergence in random feature models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11704–11715. PMLR, 18–24 Jul 2021.
[YXZ22] Long Yu, Jiahui Xie, and Wang Zhou. Testing Kronecker product covariance matrices for high-dimensional matrix-variate data. Biometrika, 11 2022. asac063.

	$\displaystyle\mathbb{P}(\|\mathbf{y}^{\top}A\mathbf{y}-\operatorname{Tr}A\Phi\|\geq t)$	$\displaystyle\leq\mathbb{P}(\|{\tilde{\mathbf{y}}}^{\top}A\tilde{\mathbf{y}}-\mathbb{E}(\tilde{\mathbf{y}}^{\top}A\tilde{\mathbf{y}})\|\geq t/2)+\mathbb{P}(\|(\mathbf{y}-\mathbb{E}\mathbf{y})^{\top}(A+A^{\top})\mathbb{E}\mathbf{y}\|\geq t/2)$
		$\displaystyle\leq 2\exp\left(-\frac{1}{2C}\min\left\{\frac{t^{2}}{4K^{4}\lambda_{\sigma}^{4}\\|X\\|^{4}\\|A\\|_{F}^{2}},\frac{t}{K^{2}\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|}\right\}\right)$
		$\displaystyle\quad+2\exp\left(-\frac{t^{2}}{16K^{2}\lambda_{\sigma}^{2}\\|X\\|^{2}\\|A\\|^{2}\\|\mathbb{E}\mathbf{y}\\|^{2}}\right).$

	$\displaystyle\|\mathbb{E}\sigma(\boldsymbol{w}^{\top}\mathbf{x}_{i})\|$	$\displaystyle=\|\mathbb{E}\sigma(\xi\\|\mathbf{x}_{i}\\|)\|\leq\mathbb{E}\|(\sigma(\xi\\|\mathbf{x}_{i}\\|)-\sigma(\xi))\|+\|\mathbb{E}\sigma(\xi)\|$
(3.10)			$\displaystyle\leq\lambda_{\sigma}\mathbb{E}\|\xi(\\|\mathbf{x}_{i}\\|-1)\|+\|\mathbb{E}\sigma(\xi)\|\leq\lambda_{\sigma}\|\\|\mathbf{x}_{i}\\|-1\|+\|\mathbb{E}\sigma(\xi)\|.$

	$\displaystyle\left\|\frac{\tilde{\mathbf{y}}^{\top}DR(z)\tilde{\mathbf{y}}}{1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}}}\right\|\leq~{}$	$\displaystyle\frac{\\|\tilde{\mathbf{y}}^{\top}D\\|\\|R(z)\tilde{\mathbf{y}}\\|}{\|\operatorname{Im}(1+\tilde{\mathbf{y}}^{\top}R(z)\tilde{\mathbf{y}})\|}=\frac{\\|\tilde{\mathbf{y}}^{\top}D\\|\\|R(z)\tilde{\mathbf{y}}\\|}{\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}}$
(4.19)		$\displaystyle\leq~{}$	$\displaystyle\frac{\\|\tilde{\mathbf{y}}^{\top}D\\|}{\left(v\tilde{\mathbf{y}}^{\top}\operatorname{Im}(R(z))\tilde{\mathbf{y}}\right)^{1/2}}\leq\frac{2\\|\tilde{\mathbf{y}}^{\top}D\\|\\|\tilde{\mathbf{y}}\\|}{v}\leq\frac{C\\|\mathbf{y}\\|^{2}}{v\sqrt{nd}},$

	$\displaystyle\operatorname{vec}(\Delta)^{\top}(\nabla F(W))$	$\displaystyle=\frac{d}{dt}\Big{\|}_{t=0}F(W_{t})$
		$\displaystyle=-\operatorname{tr}R(z)\left(\frac{d}{dt}\Big{\|}_{t=0}R(z)^{-1}\right)R(z)D$
		$\displaystyle=-\frac{1}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\frac{d}{dt}\Big{\|}_{t=0}\sigma(W_{t}X)^{\top}\sigma(W_{t}X)\right)R(z)D$
		$\displaystyle=-\frac{2}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\sigma(WX)^{\top}\cdot\frac{d}{dt}\Big{\|}_{t=0}\sigma(W_{t}X)\right)R(z)D$
		$\displaystyle=-\frac{2}{\sqrt{d_{1}n}}\operatorname{tr}R(z)\left(\sigma(WX)^{\top}\cdot\left(\sigma^{\prime}(WX)\odot(\Delta X)\right)\right)R(z)D,$

		$\displaystyle\left(\frac{1}{\sqrt{d_{1}n}}\\|R(z)\sigma(WX)^{\top}\\|\right)^{2}$
	$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}\left\\|R(z)\left(\frac{1}{\sqrt{d_{1}n}}\sigma(WX)^{\top}\sigma(WX)\right)R(z)^{*}\right\\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}\left(\\|R(z)R(z)^{-1}R(z)^{}\\|+\left\\|R(z)\left(\sqrt{\frac{d_{1}}{n}}\Phi+z\operatorname{Id}\right)R(z)^{}\right\\|\right)$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{d_{1}n}}\left(\\|R(z)\\|+\\|R(z)\\|^{2}\left(\sqrt{\frac{d_{1}}{n}}\\|\Phi\\|+\|z\|\right)\right)\leq\frac{C}{n}.$

Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks

Abstract.

1. Introduction

1.1. Nonlinear random matrix theory in neural networks

1.2. General sample covariance matrices

1.3. Infinite-width kernels and the smallest eigenvalues of empirical kernels

1.4. Random feature regression and limiting kernel regression

1.5. Preliminaries

Notations

Assumption 1.1.

Assumption 1.2.

Definition 1.3.

Assumption 1.4.

Definition 1.5 (Normalized Hermite polynomials).

Definition 1.6 (Stieltjes transform).

2. Main results

2.1. Spectra of the centered CK and NTK

Theorem 2.1 (Limiting spectral distribution for the conjugate kernel).

Theorem 2.2 (Limiting spectral distribution of the NTK).

2.2. Non-asymptotic estimations

Theorem 2.3.

Remark 2.4.

Theorem 2.5.

Remark 2.6.

Theorem 2.7.

Remark 2.8.

Theorem 2.9.

Remark 2.10.

Remark 2.11.

2.3. Training and test errors for random feature regression

Theorem 2.12 (Training error approximation).

Assumption 2.13.

Assumption 2.14.

Remark 2.15.

Theorem 2.16 (Test error approximation).

Theorem 2.17 (Asymptotic training and test errors).

Remark 2.18 (Implicit regularization).

Remark 2.19.

Remark 2.20 (Neural tangent regression).

Organization of the paper

3. A non-linear Hanson-Wright inequality

Definition 3.1 (Concentration property).

Definition 3.2 (Convex concentration property).

Lemma 3.3 (Theorem 2.5 in [Ada15]).

Theorem 3.4.

Proof.

Corollary 3.5.

Remark 3.6.

Proof of Corollary 3.5.

Corollary 3.7.

Proof.

4. Limiting law for general centered sample covariance matrices

Theorem 4.1.

Remark 4.2.

Lemma 4.3.

Proof.

Lemma 4.4.

Proof.

Lemma 4.5.

Proof.

Proof of Theorem 4.1.

5. Proof of Theorem 2.1 and Theorem 2.2

Proposition 5.1.

Proof.

Lemma 5.2.

Proof.

Remark 5.3 (Optimality of εn\varepsilon_{n}).

Lemma 5.4.

Proof.

Proof of Theorem 2.1.

Lemma 5.5.

Lemma 5.6.

Proof.

Proof of Theorem 2.2.

6. Proof of the concentration for extreme eigenvalues

Proof of Theorem 2.3.

Lemma 6.1.

Proof.

Lemma 6.2.

Proof.

Remark 5.3 (Optimality of $\varepsilon_{n}$ ).